Data Analytics and Computational Social Science: HW3

Laura Collazo

Dataset

The dataset I’ve chosen for my final project is the Program for International Student Assessment (PISA) 2018 student data. It’s a large dataset containing 1,119 variables and 612,004 observations from 80 countries.

Because of the size of this dataset, I was unable to knit my file due to issues with available memory. After many hours of trying to troubleshoot, I decided the best option to decrease the memory needed was to write a csv with only the variables I want to examine and then read this back in.

Read in/clean dataset

#read in SAS & examine

pisa <- read_sas("cy07_msu_stu_qqq.sas7bdat", "CY07MSU_FMT_STU_QQQ.SAS7BCAT", encoding = NULL, .name_repair = "unique")

head(pisa)

tail(pisa)

unique(pisa[c("CNT")])

Before moving on to cleaning the data, I spent a couple hours examining the codebook to decide which variables I wanted to explore. It took so long because every variable sounded interesting! Once I finally made my decision, I created a mini-codebook in Excel with just the details for these variables included. In this file, I used concatenate to help quickly prepare the code for the select( ) statement below.

# select only desired variables and filter country for Spain

pisa_smaller <- pisa %>% 
  
select(c(CNT,
ST001D01T,
ST004D01T,
ST197Q01HA,
ST197Q02HA,
ST197Q04HA,
ST197Q07HA,
ST197Q08HA,
ST197Q09HA,
ST197Q12HA,
ST220Q01HA,
ST220Q02HA,
ST220Q03HA,
ST220Q04HA,
ST177Q01HA,
ST019AQ01T,
ST021Q01TA)) %>%
  
filter(CNT == "ESP")
  
#check work
 
head(pisa_smaller)

#write csv

write_csv(pisa_smaller, "pisa_smaller_2022-2-20.csv")

#read in condensed version of pisa data & examine

pisa <- read_csv("pisa_smaller_2022-2-20.csv")

head(pisa)

# A tibble: 6 x 17
  CNT   ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
  <chr>     <dbl>     <dbl>      <dbl>      <dbl>      <dbl>
1 ESP          10         2          4          4          4
2 ESP           9         1          3          2          3
3 ESP          10         2          4          3          3
4 ESP           8         2          2          1          3
5 ESP          10         1         NA         NA         NA
6 ESP          10         1          4          2          3
# ... with 11 more variables: ST197Q07HA <dbl>, ST197Q08HA <dbl>,
#   ST197Q09HA <dbl>, ST197Q12HA <dbl>, ST220Q01HA <dbl>,
#   ST220Q02HA <dbl>, ST220Q03HA <dbl>, ST220Q04HA <dbl>,
#   ST177Q01HA <dbl>, ST019AQ01T <dbl>, ST021Q01TA <dbl>

tail(pisa)

# A tibble: 6 x 17
  CNT   ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
  <chr>     <dbl>     <dbl>      <dbl>      <dbl>      <dbl>
1 ESP          10         1          4          4          4
2 ESP           9         2          3          3          3
3 ESP          10         2          4          4          4
4 ESP           9         2          2          2          2
5 ESP           8         2          3          3          3
6 ESP           9         1          2          2          2
# ... with 11 more variables: ST197Q07HA <dbl>, ST197Q08HA <dbl>,
#   ST197Q09HA <dbl>, ST197Q12HA <dbl>, ST220Q01HA <dbl>,
#   ST220Q02HA <dbl>, ST220Q03HA <dbl>, ST220Q04HA <dbl>,
#   ST177Q01HA <dbl>, ST019AQ01T <dbl>, ST021Q01TA <dbl>

At this point I created new variable names and added them to my codebook. Then, I again used concatenate to quickly create the code I pasted into rename(). I believe a join could have been used here instead, however, I stuck with this simpler approach for now as I’m still trying to get the basics down.

#rename variables

pisa_tidy <- pisa %>%
  
rename(country=CNT,
grade=ST001D01T,
gender=ST004D01T,
informed_climate_change=ST197Q01HA,
informed_global_health=ST197Q02HA,
informed_migration=ST197Q04HA,
informed_international_conflict=ST197Q07HA,
informed_world_hunger=ST197Q08HA,
informed_poverty_causes=ST197Q09HA,
informed_gender_equality=ST197Q12HA,
countries_family=ST220Q01HA,
countries_school=ST220Q02HA,
countries_neighbourhood=ST220Q03HA,
countries_friends=ST220Q04HA,
language_self=ST177Q01HA,
country_born=ST019AQ01T,
country_arrival_age=ST021Q01TA) %>%

#remove NAs- the only variable it makes sense to keep NA is country_arrival_age

filter(complete.cases(.[-17])) %>%

#recode values
  
mutate(country = recode(country, ESP = "Spain")) %>%
  
mutate(gender = recode(gender, `1` = "Female" , `2` = "Male")) %>%
  
mutate(country_born = recode(country_born, `1` = "Spain", `2` = "other")) %>%
  
mutate(informed_climate_change = recode(informed_climate_change, 
      `1` = "I have never heard of this", 
      `2` = "I have heard about this but I would not be able to explain what it is really about",
      `3` = "I know something about this and could explain the general issue", 
      `4` = "I am familiar with this and I would be able to explain this well")) %>%

mutate(informed_global_health = recode(informed_global_health, 
      `1` = "I have never heard of this", 
      `2` = "I have heard about this but I would not be able to explain what it is really about", 
      `3` = "I know something about this and could explain the general issue", 
      `4` = "I am familiar with this and I would be able to explain this well"))%>%

mutate(informed_migration = recode(informed_migration, 
      `1` = "I have never heard of this", 
      `2` = "I have heard about this but I would not be able to explain what it is really about", 
      `3` = "I know something about this and could explain the general issue", 
      `4` = "I am familiar with this and I would be able to explain this well"))%>%

mutate(informed_international_conflict = recode(informed_international_conflict,
      `1` = "I have never heard of this", 
      `2` = "I have heard about this but I would not be able to explain what it is really about", 
      `3` = "I know something about this and could explain the general issue", 
      `4` = "I am familiar with this and I would be able to explain this well"))%>%

mutate(informed_world_hunger = recode(informed_world_hunger, 
      `1` = "I have never heard of this", 
      `2` = "I have heard about this but I would not be able to explain what it is really about", 
      `3` = "I know something about this and could explain the general issue", 
      `4` = "I am familiar with this and I would be able to explain this well"))%>%

mutate(informed_poverty_causes = recode(informed_poverty_causes, 
      `1` = "I have never heard of this", 
      `2` = "I have heard about this but I would not be able to explain what it is really about", 
      `3` = "I know something about this and could explain the general issue", 
      `4` = "I am familiar with this and I would be able to explain this well"))%>%

mutate(informed_gender_equality = recode(informed_gender_equality, 
      `1` = "I have never heard of this", 
      `2` = "I have heard about this but I would not be able to explain what it is really about", 
      `3` = "I know something about this and could explain the general issue", 
      `4` = "I am familiar with this and I would be able to explain this well"))%>%

mutate(countries_family = recode(countries_family, `1` = "Yes", `2` = "No")) %>%

mutate(countries_school = recode(countries_school, `1` = "Yes", `2` = "No")) %>%
  
mutate(countries_neighbourhood = recode(countries_neighbourhood, `1` = "Yes", `2` = "No")) %>%
  
mutate(countries_friends = recode(countries_friends, `1` = "Yes", `2` = "No")) %>%
  
mutate(language_self = recode(language_self, `1` = "One", `2` = "Two", `3` = "Three", `4` = "Four")) %>%
  
mutate(country_arrival_age = recode(country_arrival_age, 
`1` = "Age 0 - 1",
`2` = "Age 1",
`3` = "Age 2",
`4` = "Age 3",
`5` = "Age 4",
`6` = "Age 5",
`7` = "Age 6",
`8` = "Age 7",
`9` = "Age 8",
`10` = "Age 9",
`11` = "Age 10",
`12` = "Age 11",
`13` = "Age 12",
`14` = "Age 13",
`15` = "Age 14",
`16` = "Age 15",
`17` = "Age 16"
 )) %>%
  
mutate(grade = recode(grade,
`7` = "Grade 7",
`8` = "Grade 8",
`9` = "Grade 9",
`10` = "Grade 10",
`11` = "Grade 11",
`12` = "Grade 12",
`13` = "Grade 13"))

To make the above code easier to read, a better approach would have been to use mutate() with across() and case_when(). After many attempts, I was unable to get this to work, but I plan to come back to it so I can learn how to use this best approach going forward.

#check work
pisa_tidy

# A tibble: 26,573 x 17
   country grade    gender informed_climate_change    informed_global~
   <chr>   <chr>    <chr>  <chr>                      <chr>           
 1 Spain   Grade 10 Male   I am familiar with this a~ I am familiar w~
 2 Spain   Grade 9  Female I know something about th~ I have heard ab~
 3 Spain   Grade 10 Male   I am familiar with this a~ I know somethin~
 4 Spain   Grade 8  Male   I have heard about this b~ I have never he~
 5 Spain   Grade 10 Female I am familiar with this a~ I have heard ab~
 6 Spain   Grade 9  Male   I know something about th~ I have heard ab~
 7 Spain   Grade 10 Male   I know something about th~ I know somethin~
 8 Spain   Grade 10 Female I have heard about this b~ I have heard ab~
 9 Spain   Grade 10 Male   I have heard about this b~ I know somethin~
10 Spain   Grade 10 Male   I know something about th~ I know somethin~
# ... with 26,563 more rows, and 12 more variables:
#   informed_migration <chr>, informed_international_conflict <chr>,
#   informed_world_hunger <chr>, informed_poverty_causes <chr>,
#   informed_gender_equality <chr>, countries_family <chr>,
#   countries_school <chr>, countries_neighbourhood <chr>,
#   countries_friends <chr>, language_self <chr>, country_born <chr>,
#   country_arrival_age <chr>

Potential research questions

My original plan with this data was to examine differences in how students from Spain and the United States responded to certain variables. However, after examining the data in more detail I realized students from the USA did not respond to the variables I was interested in.

My new plan is to examine if students in Spain feel they are better informed on 7 different topics (all character variables):

How informed are you about the following topics? Climate change and global warming (informed_climate_change )
How informed are you about the following topics? Global health (e.g. epidemics) (informed_global_health )
How informed are you about the following topics? Migration (movement of people) (informed_migration )
How informed are you about the following topics? International conflicts (informed_international_conflict )
How informed are you about the following topics? Hunger or malnutrition in different parts of the world (informed_world_hunger )
How informed are you about the following topics? Causes of poverty (informed_poverty_causes )
How informed are you about the following topics? Equality between men and women in different parts of the world (informed_gender_equality )

Depending on (all characher variables):

Whether they were born in Spain or another country (country_born )
What their arrival age in Spain was if they were born in another country (country_arrival_age )
How many languages they speak well enough to converse with others (language_self )
If they have contact with people from other countries:
- in their family (countries_family )
- at school (countries_school )
- in their neighborhood (countries_neighbourhood )
- in their circle of friends (countries_friends )

Esentially, I am curious if exposure to other cultures/languages increases the liklihood that a student living in Spain feels better informed on the above topics. I would ultimately love to expand this to look at all countries who responded to the selected variables to see if what I find in Spain holds true everywhere. I think this would be a much more interesting research question! However, I believe my current R skill level requires a much simpler analysis at this time. I did leave the variable “country” in my dataset, though, in case I reach the point where I feel good about expanding my research. The variables “grade” and “gender” were also left in for the same reason.

Comment on this article Share:

HW3

Dataset

Read in/clean dataset

Potential research questions

Reuse

Citation