Data Analytics and Computational Social Science: HW6

Laura Collazo

Introduction

A year ago my family made our third international move and my kids, ages 12 and 14 years, began learning a third language. This reality, coupled with my own research interests, often has me wondering about the impact exposure to different languages and cultures has on an individual.

In this analysis I seek to explore if there is a positive correlation between the number of languages a student in Spain speaks and how well informed they feel on climate change, global health, migration, international conflict, world hunger, causes of poverty, and gender inequality.

Data

The data for this analysis comes from the Organisation for Economic Co-operation and Development’s 2018 Programme for International Student Assessment (PISA) which “measures 15-year-olds ability to use their reading, mathematics and science knowledge and skills to meet real-life challenges.” There are many questionnaires provided to students, teachers, principals and parents that make up the complete dataset for 2018, and this analysis will focus on data from the student questionnaire. It’s a large dataset containing 1,119 variables and 612,004 observations from 80 countries. The codebook for this dataset was relied on to understand the data and rename variables and values.

Read in Dataset

The dataset originally read in was a very large SAS file. Due to insufficient computer memory, the size of the dataset created challenges in working with the data. As a workaround, a csv containing a limited number of variables was written out and then read back in. Further discussion of this can be found in the Reflection section.

#read in SAS file

pisa <- read_sas("cy07_msu_stu_qqq.sas7bdat", "CY07MSU_FMT_STU_QQQ.SAS7BCAT", encoding = NULL, .name_repair = "unique")

#determine how many countries are in dataset

unique(pisa[c("CNT")])

# select desired variables and filter country for Spain

pisa_smaller <- pisa %>% 
  
select(c(CNT,ST001D01T,ST004D01T,ST197Q01HA,ST197Q02HA,ST197Q04HA,ST197Q07HA,ST197Q08HA,ST197Q09HA,ST197Q12HA,
         ST220Q01HA,ST220Q02HA,ST220Q03HA,ST220Q04HA,ST177Q01HA,ST019AQ01T,ST021Q01TA)) %>%
  
filter(CNT == "ESP")

#write csv

write_csv(pisa_smaller, "pisa_smaller_2022-2-20.csv")

Tidy data

To examine the research question, the dataset was filtered to include responses from students living in Spain along with eight character variables which include how many languages students speak well enough to converse with others (language_self) and how informed the student feels on the following topics:

Climate change and global warming (informed_climate_change)
Global health (e.g. epidemics) (informed_global_health)
Migration (movement of people) (informed_migration)
International conflicts (informed_international_conflict)
Hunger or malnutrition in different parts of the world (informed_world_hunger)
Causes of poverty (informed_poverty_causes)
Equality between men and women in different parts of the world (informed_gender_equality)

By filtering for students living in Spain, and removing NAs for the selected variables, the final number of observations used for this analysis totals 28,022 students. The tibble containing this data can be viewed in the appendix.

#read in csv

pisa <- read_csv("pisa_smaller_2022-2-20.csv")

#remove additional variables

pisa_tidy <- pisa %>%
  
select(-c("ST001D01T", "ST004D01T", "ST220Q01HA", "ST220Q02HA", "ST220Q03HA", "ST220Q04HA", "ST019AQ01T", "ST021Q01TA")) %>%
  
#rename variables
  
rename(country=CNT,
informed_climate_change=ST197Q01HA,
informed_global_health=ST197Q02HA,
informed_migration=ST197Q04HA,
informed_international_conflict=ST197Q07HA,
informed_world_hunger=ST197Q08HA,
informed_poverty_causes=ST197Q09HA,
informed_gender_equality=ST197Q12HA,
language_self=ST177Q01HA) %>%

#remove NAs

drop_na %>%

#recode values
  
mutate(country = recode(country, ESP = "Spain")) %>%
  
mutate(informed_climate_change = recode(informed_climate_change, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_global_health = recode(informed_global_health, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_migration = recode(informed_migration, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_international_conflict = recode(informed_international_conflict,
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_world_hunger = recode(informed_world_hunger, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_poverty_causes = recode(informed_poverty_causes, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_gender_equality = recode(informed_gender_equality, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%
  
mutate(language_self = recode(language_self, 
      `1` = "One", 
      `2` = "Two", 
      `3` = "Three", 
      `4` = "Four +"))

Examine Data

To gain an initial understanding of the selected variables, count and percent are calculated, see appendix, and univariate plots created. The question asked of participants has been included before each plot.

How many languages do you speak well enough to converse with others?

# plot for language_self

  ggplot(pisa_tidy, aes(x = fct_relevel(language_self, "One", "Two", "Three", "Four +"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Languages Spoken",
       y = "Count",
       title = "Number of languages students speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text( size = 9),
          axis.text.y = element_text( size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? Climate change and global warming

#plot for informed_climate_change

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_climate_change, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3",color = "black") +
  labs(x = "Climate Change",
       y = "Count",
       title = "How informed students feel on climate change and global warming", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? Global health (e.g. epidemics)

#plot for informed_global_health

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_global_health, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3",color = "black") +
  labs(x = "Global Health",
       y = "Count",
       title = "How informed students feel on global health", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? Migration (movement of people)

#plot for informed_migration

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_migration, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Migration",
       y = "Count",
       title = "How informed students feel on migration", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? International conflicts

#plot for informed_international_conflict

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_international_conflict, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "International Conflict",
       y = "Count",
       title = "How informed students feel on international conflicts", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? Hunger or malnutrition in different parts of the world

#plot for informed_world_hunger

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_world_hunger, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "World Hunger",
       y = "Count",
       title = "How informed students feel on world hunger or malnutrition", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? Causes of poverty

#plot for informed_poverty_causes

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_poverty_causes, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Causes of Poverty",
       y = "Count",
       title = "How informed students feel on causes of poverty", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? Equality between men and women in different parts of the world

#plot for informed_gender_equality

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_gender_equality, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Gender Equality",
       y = "Count",
       title = "How informed students feel on gender equality", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

Through this examination of variables it’s first observed that 85.7 percent of students speak two or more languages with 38.3 percent speaking two, 37 percent speaking three and 10.4 percent speaking 4 or more. When it comes to the 7 topics students were asked about it’s observed the majority of students feel informed on the topic at hand, with gender equality being unique in more students responding they felt well informed than just informed. The table below provides a summary of how students responded by percent.

Topic	Not informed	Not well informed	Informed
Climate change	1.93%	14.3%	59.3%
Global health	1.67%	25.9%	58.2%
Migration	1.61%	19.7%	59.2%
International conflict	2.62%	29.8%	49.1%
World hunger	1.23%	15.5%	58.7%
Causes of poverty	1.4%	18.7%	54.8%
Gender equality	1.3%	5.96%	41.2%

Visualizations

To explore my research question, seven bivariate plots with a facet grid were created using the variable language_self and each of the seven variables which ask how informed students feel about a certain topic. A breakdown of the summary statistics, count and percent, calculated for each combination of variables can be viewed in the appendix.

Through these initial plots, it’s observed that in all instances the more languages a student speaks, the more well informed they feel on the given topic. Looking a bit closer, there is another trend that stands out when examining students who respond they do not feel well informed on the topic. Here, in all cases, there is an uptick in students speaking four or more languages rather than a continued downward trend. This observation led to a second round of visualizations which combined students who responded “not informed” and “not well informed” into one new value “not informed” and students who responded “informed” and “ well informed” into one new value “informed.” In observing these new visualizations, a slight drop is now observed in how informed students speaking 4 or more languages feel on 3 of the 7 topics: climate change, world hunger, and gender equality.

#create new object to calculate percent for language_self & informed_climate_change

language_climate_change <- select(pisa_tidy, "language_self", "informed_climate_change") %>%
  group_by(language_self, informed_climate_change) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_climate_change, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

#create plot with facet_grid

ggplot(language_climate_change,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_climate_change, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on climate change\nby number of languages they speak", subtitle = "15-year-old Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_climate_change, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to combine "not informed" and "not well informed" to become "not informed" and "informed" and "well informed" to become "informed"

language_climate_change_2 <- pisa_tidy%>%
  mutate(informed_climate_change = recode(informed_climate_change, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_climate_change) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_climate_change, 
                         levels = c("Not informed", "Informed")))

#create plot with facet_grid

ggplot(language_climate_change_2,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_climate_change, 
                         levels = c("Not informed", "Informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on climate change\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_climate_change, levels = c("Not informed", "Informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to calculate percent for language_self & informed_global_health

language_global_health <- select(pisa_tidy, "language_self", "informed_global_health") %>%
  group_by(language_self, informed_global_health) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_global_health, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

#create plot with facet_grid

ggplot(language_global_health,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_global_health, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on global health\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_global_health, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to combine "not informed" and "not well informed" to become "not informed" and "informed" and "well informed" to become "informed"

language_global_health_2 <- pisa_tidy%>%
  mutate(informed_global_health = recode(informed_global_health, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_global_health) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_global_health, 
                         levels = c("Not informed", "Informed")))

#create plot with facet_grid

ggplot(language_global_health_2,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_global_health, 
                         levels = c("Not informed", "Informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on global health\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_global_health, levels = c("Not informed", "Informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to calculate percent for language_self & informed_migration

language_migration <- select(pisa_tidy, "language_self", "informed_migration") %>%
  group_by(language_self, informed_migration) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_migration, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

#create plot with facet_grid

ggplot(language_migration,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_migration, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on migration\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_migration, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to combine "not informed" and "not well informed" to become "not informed" and "informed" and "well informed" to become "informed"

language_migration_2 <- pisa_tidy%>%
  mutate(informed_migration = recode(informed_migration, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_migration) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_migration, 
                         levels = c("Not informed", "Informed")))

#create plot with facet_grid

ggplot(language_migration_2,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_migration, 
                         levels = c("Not informed", "Informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on migration\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_migration, levels = c("Not informed", "Informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to calculate percent for language_self & informed_international_conflict

language_international_conflict <- select(pisa_tidy, "language_self", "informed_international_conflict") %>%
  group_by(language_self, informed_international_conflict) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_international_conflict, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

#create plot with facet_grid

ggplot(language_international_conflict,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_international_conflict, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on international conflict\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_international_conflict, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to combine "not informed" and "not well informed" to become "not informed" and "informed" and "well informed" to become "informed"

language_international_conflict_2 <- pisa_tidy%>%
  mutate(informed_international_conflict = recode(informed_international_conflict,
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_international_conflict) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_international_conflict, 
                         levels = c("Not informed", "Informed")))

#create plot with facet_grid

ggplot(language_international_conflict_2,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_international_conflict, 
                         levels = c("Not informed", "Informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on international conflict\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_international_conflict, levels = c("Not informed", "Informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to calculate percent for language_self & informed_world_hunger

language_world_hunger <- select(pisa_tidy, "language_self", "informed_world_hunger") %>%
  group_by(language_self, informed_world_hunger) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_world_hunger, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

#create plot with facet_grid

ggplot(language_world_hunger,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_world_hunger, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on world hunger\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_world_hunger, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to combine "not informed" and "not well informed" to become "not informed" and "informed" and "well informed" to become "informed"

language_world_hunger_2 <- pisa_tidy%>%
  mutate(informed_world_hunger = recode(informed_world_hunger, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_world_hunger) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_world_hunger, 
                         levels = c("Not informed", "Informed")))

#create plot with facet_grid

ggplot(language_world_hunger_2,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_world_hunger, 
                         levels = c("Not informed", "Informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on world hunger\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_world_hunger, levels = c("Not informed", "Informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to calculate percent for language_self & informed_poverty_causes

language_poverty_causes <- select(pisa_tidy, "language_self", "informed_poverty_causes") %>%
  group_by(language_self, informed_poverty_causes) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_poverty_causes, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

#create plot with facet_grid

ggplot(language_poverty_causes,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_poverty_causes, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on causes of poverty\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_poverty_causes, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to combine "not informed" and "not well informed" to become "not informed" and "informed" and "well informed" to become "informed"

language_poverty_causes_2 <- pisa_tidy%>%
  mutate(informed_poverty_causes = recode(informed_poverty_causes, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_poverty_causes) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_poverty_causes, 
                         levels = c("Not informed", "Informed")))

#create plot with facet_grid

ggplot(language_poverty_causes_2,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_poverty_causes, 
                         levels = c("Not informed", "Informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on causes of poverty\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_poverty_causes, levels = c("Not informed", "Informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to calculate percent for language_self & informed_gender_equality

language_gender_equality <- select(pisa_tidy, "language_self", "informed_gender_equality") %>%
  group_by(language_self, informed_gender_equality) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_gender_equality, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

#create plot with facet_grid

ggplot(language_gender_equality,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_gender_equality, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on gender equality\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_gender_equality, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create new object to combine "not informed" and "not well informed" to become "not informed" and "informed" and "well informed" to become "informed"

language_gender_equality_2 <- pisa_tidy%>%
  mutate(informed_gender_equality = recode(informed_gender_equality, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_gender_equality) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_gender_equality, 
                         levels = c("Not informed", "Informed")))

#create plot with facet_grid

ggplot(language_gender_equality_2,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_gender_equality, 
                         levels = c("Not informed", "Informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on gender equality\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_gender_equality, levels = c("Not informed", "Informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

Reflection

My experience with this project was unique in that I had no previous R programming knowledge to fall back on as I worked through this analysis. I learned a lot, but in hindsight there are many things I wish I would have done differently or knew a little bit more about before getting started.

The first challenge I ran into was finding a dataset that coincided with my research interests. I found many datasets through UNESCO, and was especially interested in the World Inequality Database on Education (WIDE), but I could not find corresponding codebooks to interpret the raw data. After reaching out to the class Slack group for suggestions, it dawned on me that maybe I was thinking a little too broad and needed to narrow things down a bit. With this in mind, I realized the WIDE dataset was actually a compilation of data from many other datasets. I started looking these up one by one and through this process came across the Programme for International Student Assessment (PISA).

The PISA datasets are available as SAS files, so not having a way to first examine the data I jumped straight to reading it into RStudio. I received an error that the file was too large and after some troubleshooting found switching from 32-bit to 64- bit allowed me to read in the data. It was then I realized just how large the dataset was with 1,119 variables and 612,004 observations from 80 countries. This felt like way too much data to examine in a small R window, so I wrote out a csv of the data. A couple hours was then spent examining variables and trying to narrow down which ones I was interested in analyzing. This was especially challenging because I wanted to dig into every single variable! I finally decided I couldn’t go wrong for this assignment and landed on analyzing if exposure to other cultures/languages (seven variables were chosen to examine) increased the likelihood that a student feels better informed on seven different topics, and then I planned to compare this between the United States and Spain.

Even though I had finally chosen variables, I still felt overwhelmed by the size of the dataset and knowing how to go about cleaning it. I ended up creating my own mini-codebook that just contained details of the variables I had selected, which helped the data feel a bit more manageable. I also used the concatenate function in Excel to quickly prepare a list of the variables for my R script including how to rename them and their corresponding values.

The next step in the tidying process was dropping NAs, and something that drastically changed my initial research question. Students in the United States did not answer the questions I had selected as my variables! By this point I had already put in a significant amount of time into this project, so I decided the best course of action was to adjust my research question. This was disappointing as I was excited about observing differences between countries. It was also not an ideal choice because in “real life” a research question is set.

When I finished tidying my data it was time to knit and submit HW3, but the file was too large and wouldn’t knit. I tried multiple options found through internet searches to make this work and reached out on the class Slack channel for input. However, the suggestions I tried did not work and some made my whole computer crash. There were a lot of tears at this point thinking I would need to start from scratch after so many hours of work. I decided to take a few days off from the assignment so I could think clearly how to move forward. During this time I realized I had already written out the csv of the whole dataset, so a potential workaround would be to eliminate the variables/observations I wouldn’t be using in my analysis, and then read this back into R to create a smaller file. Success!

Next up, was learning to use ggplot to create visualizations. Since all variables used in this analysis are characters, it took me a while to figure out counts, percentages, etc. must first be calculated before creating plots. What eventually ended up aiding my understanding was reading the sections of Data Visualizations with R on univariate and bivariate plots which break down how to use them with both categorical and quantitative variables.

Once I was successful in creating a plot, I realized my research question was way too broad, at least for this assignment. To explore my question, I would have needed to create at least 49 bivariate plots to analyze each of the seven variables against another seven. I believe the use of functions would have made this doable, but as this is outside of my current R knowledge I chose to again modify my research question. This brought my question to its final form, “Is there a positive correlation between the number of languages a student in Spain speaks and how well informed they feel on climate change, global health, migration, international conflict, world hunger, causes of poverty, and gender inequality?”

To summarize, here’s what I would have done differently for this assignment if I could go back in time:

Chosen a smaller dataset so my computer could have handled the size adequately
Solidified a more narrow research question from the beginning
Examined my dataset thoroughly to make sure it could answer my research question
Chosen data that also contained numerical data, so I could have experimented with other types of visualizations and summary statistics

Conclusion

It’s clear there is a positive correlation between the number of languages spoken and how well informed students feel on these seven topics, yet this analysis alone is not enough to conclude the more languages a student speaks the better informed they are on these topics. There are many other variables which could be at play such as:

Where does the student live (urban, suburb, rural)?
Does the student attend a public, semi-private or private school?
How informed do the student’s parents feel on these topics?
What are the parents’ education levels?
What is the family’s income level?

I hypothesize that students attending private schools in urban cities speak more languages and imagine these variables, combined with family income and/or parents’ level of education, could be what impacts how informed a student is on certain topics and not primarily the number of languages the student speaks. The original dataset may include some of these variables, and there is also a separate dataset of parent responses which may contain some or all of these variables. However, the size of the datasets and the limited memory available on my computer prohibits my ability to explore things further at this time.

It’s also important to note that students are responding to how informed they feel on a topic, and no quantifiable evidence is provided to demonstrate if a student is as informed or uninformed as they believe themselves to be. It’s possible that the more languages a student speaks the more confident they feel in their knowledge of certain topics. An exam would be needed to determine if a student’s perception of knowledge and their actual knowledge align.

HW6 questions

What is missing from my final project, and what do I hope to accomplish between now and submisison time?

I’m not sure I have anything missing from my final project, but before I submit, I definitely plan to create a better introduction. Since this won’t take much time, and we are already half way through the semester, my plan for the second half of the semester is to informally work through all the steps of this assignment with a new dataset and research question. This will allow me to put into practice the things I wish I would have done differently the first time around, and I’m sure provide plenty of opportunities to work through new challenges.

Bibliography

Kabacoff, R. (2020). Data visualizations with R. Quantitative Analysis Center, Wesleyan University.
https://rkabacoff.github.io/datavis/index.html

Programme for International Student Assessment.(2020). Student questionnaire data files (PISA 2018 Database) [Dataset and codebook]. Organisation for Economic Co-operation and Development. https://www.oecd.org/pisa/data/2018database/

RStudio Team (2022). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA.
http://www.rstudio.com/.

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Comment on this article Share:

HW6