This assignment shows an example of reading in a dataset, explaining the variables in the dataset and then demonstrating at least 2 basic data-wrangling operations.
The dataset used for this assignment is workshop_masterlist. It holds data on faculty who have participated in training offered by my office, and is being shared here with identifiers removed and permission.
The variables include:
variable | data type |
---|---|
id | char |
campus | char |
workshop | char |
workshop_status_id* | char |
workshop_date | date |
semester | char |
workshop_year | char |
*The variable workshop_status_id was intentionally left as an id in the csv, for practice using mutate to tidy the data.
#read in dataset from csv and save as object
wm_csv <- read_csv("workshop_masterlist_2022-02-08.csv")
#create tibble
workshop_masterlist <- as_tibble(wm_csv)
#create new column workshop_status
wm <- workshop_masterlist %>%
mutate(workshop_status = case_when(
workshop_status_id == 1 ~ "Pass",
workshop_status_id == 2 ~ "No Pass",
workshop_status_id == 3 ~ "Withdraw",
workshop_status_id == 4 ~ "No Show",
workshop_status_id == 5 ~ "Audit") )
#check variable types
str(wm)
tibble [5,844 x 8] (S3: tbl_df/tbl/data.frame)
$ id : num [1:5844] 3955 3956 3957 3958 3959 ...
$ campus : chr [1:5844] "Brooklyn" "Brooklyn" "York" "Brooklyn" ...
$ workshop : chr [1:5844] "OTE" "OTE" "OTE" "OTE" ...
$ workshop_status_id: num [1:5844] 1 2 1 1 1 1 1 2 1 4 ...
$ workshop_date : Date[1:5844], format: "2020-07-09" ...
$ semester : chr [1:5844] "Summer 2020" "Summer 2020" "Summer 2020" "Summer 2020" ...
$ workshop_year : num [1:5844] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
$ workshop_status : chr [1:5844] "Pass" "No Pass" "Pass" "Pass" ...
#select desired columns in new order, and change incorrect variable types
wm_tidy <- wm %>% select (id:workshop, workshop_status, workshop_year) %>%
mutate(id = as.character(id),
workshop_year = as.character(workshop_year))
#check variable types
str(wm_tidy)
tibble [5,844 x 5] (S3: tbl_df/tbl/data.frame)
$ id : chr [1:5844] "3955" "3956" "3957" "3958" ...
$ campus : chr [1:5844] "Brooklyn" "Brooklyn" "York" "Brooklyn" ...
$ workshop : chr [1:5844] "OTE" "OTE" "OTE" "OTE" ...
$ workshop_status: chr [1:5844] "Pass" "No Pass" "Pass" "Pass" ...
$ workshop_year : chr [1:5844] "2020" "2020" "2020" "2020" ...
#filter for all participants who passed, and arrange by campus and then year
wm_tidy %>%
filter(workshop_status == "Pass") %>%
arrange(campus, workshop_year)
# A tibble: 4,740 x 5
id campus workshop workshop_status workshop_year
<chr> <chr> <chr> <chr> <chr>
1 239 Baruch PTO Pass 2011
2 364 Baruch PTO Pass 2011
3 430 Baruch PTO Pass 2011
4 636 Baruch PTO Pass 2011
5 684 Baruch PTO Pass 2011
6 864 Baruch PTO Pass 2011
7 959 Baruch PTO Pass 2011
8 1048 Baruch PTO Pass 2011
9 1061 Baruch PTO Pass 2011
10 1063 Baruch PTO Pass 2011
# ... with 4,730 more rows
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Collazo (2022, Feb. 9). Data Analytics and Computational Social Science: HW2. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomlcollazohw2/
BibTeX citation
@misc{collazo2022hw2, author = {Collazo, Laura}, title = {Data Analytics and Computational Social Science: HW2}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomlcollazohw2/}, year = {2022} }