This assignment identifies the dataset I will be using for the final project, reads it into R, cleans it, and identifies potential research questions it can help me answer.
The dataset I’ve chosen for my final project is the Program for International Student Assessment (PISA) 2018 student data. It’s a large dataset containing 1,119 variables and 612,004 observations from 80 countries.
Because of the size of this dataset, I was unable to knit my file due to issues with available memory. After many hours of trying to troubleshoot, I decided the best option to decrease the memory needed was to write a csv with only the variables I want to examine and then read this back in.
Before moving on to cleaning the data, I spent a couple hours examining the codebook to decide which variables I wanted to explore. It took so long because every variable sounded interesting! Once I finally made my decision, I created a mini-codebook in Excel with just the details for these variables included. In this file, I used concatenate to help quickly prepare the code for the select( )
statement below.
# select only desired variables and filter country for Spain
pisa_smaller <- pisa %>%
select(c(CNT,
ST001D01T,
ST004D01T,
ST197Q01HA,
ST197Q02HA,
ST197Q04HA,
ST197Q07HA,
ST197Q08HA,
ST197Q09HA,
ST197Q12HA,
ST220Q01HA,
ST220Q02HA,
ST220Q03HA,
ST220Q04HA,
ST177Q01HA,
ST019AQ01T,
ST021Q01TA)) %>%
filter(CNT == "ESP")
#check work
head(pisa_smaller)
#write csv
write_csv(pisa_smaller, "pisa_smaller_2022-2-20.csv")
#read in condensed version of pisa data & examine
pisa <- read_csv("pisa_smaller_2022-2-20.csv")
head(pisa)
# A tibble: 6 x 17
CNT ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ESP 10 2 4 4 4
2 ESP 9 1 3 2 3
3 ESP 10 2 4 3 3
4 ESP 8 2 2 1 3
5 ESP 10 1 NA NA NA
6 ESP 10 1 4 2 3
# ... with 11 more variables: ST197Q07HA <dbl>, ST197Q08HA <dbl>,
# ST197Q09HA <dbl>, ST197Q12HA <dbl>, ST220Q01HA <dbl>,
# ST220Q02HA <dbl>, ST220Q03HA <dbl>, ST220Q04HA <dbl>,
# ST177Q01HA <dbl>, ST019AQ01T <dbl>, ST021Q01TA <dbl>
tail(pisa)
# A tibble: 6 x 17
CNT ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ESP 10 1 4 4 4
2 ESP 9 2 3 3 3
3 ESP 10 2 4 4 4
4 ESP 9 2 2 2 2
5 ESP 8 2 3 3 3
6 ESP 9 1 2 2 2
# ... with 11 more variables: ST197Q07HA <dbl>, ST197Q08HA <dbl>,
# ST197Q09HA <dbl>, ST197Q12HA <dbl>, ST220Q01HA <dbl>,
# ST220Q02HA <dbl>, ST220Q03HA <dbl>, ST220Q04HA <dbl>,
# ST177Q01HA <dbl>, ST019AQ01T <dbl>, ST021Q01TA <dbl>
At this point I created new variable names and added them to my codebook. Then, I again used concatenate to quickly create the code I pasted into rename()
. I believe a join could have been used here instead, however, I stuck with this simpler approach for now as I’m still trying to get the basics down.
#rename variables
pisa_tidy <- pisa %>%
rename(country=CNT,
grade=ST001D01T,
gender=ST004D01T,
informed_climate_change=ST197Q01HA,
informed_global_health=ST197Q02HA,
informed_migration=ST197Q04HA,
informed_international_conflict=ST197Q07HA,
informed_world_hunger=ST197Q08HA,
informed_poverty_causes=ST197Q09HA,
informed_gender_equality=ST197Q12HA,
countries_family=ST220Q01HA,
countries_school=ST220Q02HA,
countries_neighbourhood=ST220Q03HA,
countries_friends=ST220Q04HA,
language_self=ST177Q01HA,
country_born=ST019AQ01T,
country_arrival_age=ST021Q01TA) %>%
#remove NAs- the only variable it makes sense to keep NA is country_arrival_age
filter(complete.cases(.[-17])) %>%
#recode values
mutate(country = recode(country, ESP = "Spain")) %>%
mutate(gender = recode(gender, `1` = "Female" , `2` = "Male")) %>%
mutate(country_born = recode(country_born, `1` = "Spain", `2` = "other")) %>%
mutate(informed_climate_change = recode(informed_climate_change,
`1` = "I have never heard of this",
`2` = "I have heard about this but I would not be able to explain what it is really about",
`3` = "I know something about this and could explain the general issue",
`4` = "I am familiar with this and I would be able to explain this well")) %>%
mutate(informed_global_health = recode(informed_global_health,
`1` = "I have never heard of this",
`2` = "I have heard about this but I would not be able to explain what it is really about",
`3` = "I know something about this and could explain the general issue",
`4` = "I am familiar with this and I would be able to explain this well"))%>%
mutate(informed_migration = recode(informed_migration,
`1` = "I have never heard of this",
`2` = "I have heard about this but I would not be able to explain what it is really about",
`3` = "I know something about this and could explain the general issue",
`4` = "I am familiar with this and I would be able to explain this well"))%>%
mutate(informed_international_conflict = recode(informed_international_conflict,
`1` = "I have never heard of this",
`2` = "I have heard about this but I would not be able to explain what it is really about",
`3` = "I know something about this and could explain the general issue",
`4` = "I am familiar with this and I would be able to explain this well"))%>%
mutate(informed_world_hunger = recode(informed_world_hunger,
`1` = "I have never heard of this",
`2` = "I have heard about this but I would not be able to explain what it is really about",
`3` = "I know something about this and could explain the general issue",
`4` = "I am familiar with this and I would be able to explain this well"))%>%
mutate(informed_poverty_causes = recode(informed_poverty_causes,
`1` = "I have never heard of this",
`2` = "I have heard about this but I would not be able to explain what it is really about",
`3` = "I know something about this and could explain the general issue",
`4` = "I am familiar with this and I would be able to explain this well"))%>%
mutate(informed_gender_equality = recode(informed_gender_equality,
`1` = "I have never heard of this",
`2` = "I have heard about this but I would not be able to explain what it is really about",
`3` = "I know something about this and could explain the general issue",
`4` = "I am familiar with this and I would be able to explain this well"))%>%
mutate(countries_family = recode(countries_family, `1` = "Yes", `2` = "No")) %>%
mutate(countries_school = recode(countries_school, `1` = "Yes", `2` = "No")) %>%
mutate(countries_neighbourhood = recode(countries_neighbourhood, `1` = "Yes", `2` = "No")) %>%
mutate(countries_friends = recode(countries_friends, `1` = "Yes", `2` = "No")) %>%
mutate(language_self = recode(language_self, `1` = "One", `2` = "Two", `3` = "Three", `4` = "Four")) %>%
mutate(country_arrival_age = recode(country_arrival_age,
`1` = "Age 0 - 1",
`2` = "Age 1",
`3` = "Age 2",
`4` = "Age 3",
`5` = "Age 4",
`6` = "Age 5",
`7` = "Age 6",
`8` = "Age 7",
`9` = "Age 8",
`10` = "Age 9",
`11` = "Age 10",
`12` = "Age 11",
`13` = "Age 12",
`14` = "Age 13",
`15` = "Age 14",
`16` = "Age 15",
`17` = "Age 16"
)) %>%
mutate(grade = recode(grade,
`7` = "Grade 7",
`8` = "Grade 8",
`9` = "Grade 9",
`10` = "Grade 10",
`11` = "Grade 11",
`12` = "Grade 12",
`13` = "Grade 13"))
To make the above code easier to read, a better approach would have been to use mutate()
with across()
and case_when()
. After many attempts, I was unable to get this to work, but I plan to come back to it so I can learn how to use this best approach going forward.
#check work
pisa_tidy
# A tibble: 26,573 x 17
country grade gender informed_climate_change informed_global~
<chr> <chr> <chr> <chr> <chr>
1 Spain Grade 10 Male I am familiar with this a~ I am familiar w~
2 Spain Grade 9 Female I know something about th~ I have heard ab~
3 Spain Grade 10 Male I am familiar with this a~ I know somethin~
4 Spain Grade 8 Male I have heard about this b~ I have never he~
5 Spain Grade 10 Female I am familiar with this a~ I have heard ab~
6 Spain Grade 9 Male I know something about th~ I have heard ab~
7 Spain Grade 10 Male I know something about th~ I know somethin~
8 Spain Grade 10 Female I have heard about this b~ I have heard ab~
9 Spain Grade 10 Male I have heard about this b~ I know somethin~
10 Spain Grade 10 Male I know something about th~ I know somethin~
# ... with 26,563 more rows, and 12 more variables:
# informed_migration <chr>, informed_international_conflict <chr>,
# informed_world_hunger <chr>, informed_poverty_causes <chr>,
# informed_gender_equality <chr>, countries_family <chr>,
# countries_school <chr>, countries_neighbourhood <chr>,
# countries_friends <chr>, language_self <chr>, country_born <chr>,
# country_arrival_age <chr>
My original plan with this data was to examine differences in how students from Spain and the United States responded to certain variables. However, after examining the data in more detail I realized students from the USA did not respond to the variables I was interested in.
My new plan is to examine if students in Spain feel they are better informed on 7 different topics (all character variables):
Depending on (all characher variables):
Esentially, I am curious if exposure to other cultures/languages increases the liklihood that a student living in Spain feels better informed on the above topics. I would ultimately love to expand this to look at all countries who responded to the selected variables to see if what I find in Spain holds true everywhere. I think this would be a much more interesting research question! However, I believe my current R skill level requires a much simpler analysis at this time. I did leave the variable “country”
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Collazo (2022, Feb. 23). Data Analytics and Computational Social Science: HW3. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomlcollazohw3/
BibTeX citation
@misc{collazo2022hw3, author = {Collazo, Laura}, title = {Data Analytics and Computational Social Science: HW3}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomlcollazohw3/}, year = {2022} }