Data Analytics and Computational Social Science: HW5

Laura Collazo

Dataset

The dataset I’ve chosen for my final project is the Program for International Student Assessment (PISA) 2018 student data. It’s a large dataset containing 1,119 variables and 612,004 observations of 15 year old students from 80 countries.

Research Question

In my analysis of this data, I examine how informed students in Spain feel they are on 7 different topics (all character variables):

Climate change and global warming (informed_climate_change)
Global health (e.g. epidemics) (informed_global_health)
Migration (movement of people) (informed_migration)
International conflicts (informed_international_conflict)
Hunger or malnutrition in different parts of the world (informed_world_hunger)
Causes of poverty (informed_poverty_causes)
Equality between men and women in different parts of the world (informed_gender_equality)

By an additional character variable:

How many languages the student speaks well enough to converse with others (language_self)

Essentially, I am curious if there is a positive correlation between the number of languages a student speaks and how well informed a student feels on the seven different variables listed above.

Read in Dataset

The dataset originally read in was a very large SAS file. However, my computer’s memory was not sufficient to knit the file. As a workaround, I wrote out a csv containing a limited number of variables and then read this back in.

#read in SAS file & examine data

pisa <- read_sas("cy07_msu_stu_qqq.sas7bdat", "CY07MSU_FMT_STU_QQQ.SAS7BCAT", encoding = NULL, .name_repair = "unique")

pisa

tail(pisa)

unique(pisa[c("CNT")])

# select only desired variables and filter country for Spain

pisa_smaller <- pisa %>% 
  
select(c(CNT,
ST001D01T,
ST004D01T,
ST197Q01HA,
ST197Q02HA,
ST197Q04HA,
ST197Q07HA,
ST197Q08HA,
ST197Q09HA,
ST197Q12HA,
ST220Q01HA,
ST220Q02HA,
ST220Q03HA,
ST220Q04HA,
ST177Q01HA,
ST019AQ01T,
ST021Q01TA)) %>%
  
filter(CNT == "ESP")
  
#check work
 
pisa_smaller

#write csv

write_csv(pisa_smaller, "pisa_smaller_2022-2-20.csv")

#read in csv & examine data

pisa <- read_csv("pisa_smaller_2022-2-20.csv")

pisa

# A tibble: 35,943 x 17
   CNT   ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
   <chr>     <dbl>     <dbl>      <dbl>      <dbl>      <dbl>
 1 ESP          10         2          4          4          4
 2 ESP           9         1          3          2          3
 3 ESP          10         2          4          3          3
 4 ESP           8         2          2          1          3
 5 ESP          10         1         NA         NA         NA
 6 ESP          10         1          4          2          3
 7 ESP           9         1         NA         NA         NA
 8 ESP           9         2          3          2          2
 9 ESP           9         2         NA         NA         NA
10 ESP          10         2          3          3          3
# ... with 35,933 more rows, and 11 more variables: ST197Q07HA <dbl>,
#   ST197Q08HA <dbl>, ST197Q09HA <dbl>, ST197Q12HA <dbl>,
#   ST220Q01HA <dbl>, ST220Q02HA <dbl>, ST220Q03HA <dbl>,
#   ST220Q04HA <dbl>, ST177Q01HA <dbl>, ST019AQ01T <dbl>,
#   ST021Q01TA <dbl>

tail(pisa)

# A tibble: 6 x 17
  CNT   ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
  <chr>     <dbl>     <dbl>      <dbl>      <dbl>      <dbl>
1 ESP          10         1          4          4          4
2 ESP           9         2          3          3          3
3 ESP          10         2          4          4          4
4 ESP           9         2          2          2          2
5 ESP           8         2          3          3          3
6 ESP           9         1          2          2          2
# ... with 11 more variables: ST197Q07HA <dbl>, ST197Q08HA <dbl>,
#   ST197Q09HA <dbl>, ST197Q12HA <dbl>, ST220Q01HA <dbl>,
#   ST220Q02HA <dbl>, ST220Q03HA <dbl>, ST220Q04HA <dbl>,
#   ST177Q01HA <dbl>, ST019AQ01T <dbl>, ST021Q01TA <dbl>

#remove additional variables not needed to answer research question

pisa_tidy <- pisa %>%
  
select(-c("ST001D01T", "ST004D01T", "ST220Q01HA", "ST220Q02HA", "ST220Q03HA", "ST220Q04HA", "ST019AQ01T", "ST021Q01TA")) %>%
  
#rename variables
  
rename(country=CNT,
informed_climate_change=ST197Q01HA,
informed_global_health=ST197Q02HA,
informed_migration=ST197Q04HA,
informed_international_conflict=ST197Q07HA,
informed_world_hunger=ST197Q08HA,
informed_poverty_causes=ST197Q09HA,
informed_gender_equality=ST197Q12HA,
language_self=ST177Q01HA) %>%

#remove NAs

drop_na %>%

#recode values
  
mutate(country = recode(country, ESP = "Spain")) %>%
  
mutate(informed_climate_change = recode(informed_climate_change, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_global_health = recode(informed_global_health, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_migration = recode(informed_migration, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_international_conflict = recode(informed_international_conflict,
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_world_hunger = recode(informed_world_hunger, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_poverty_causes = recode(informed_poverty_causes, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_gender_equality = recode(informed_gender_equality, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%
  
mutate(language_self = recode(language_self, 
      `1` = "One", 
      `2` = "Two", 
      `3` = "Three", 
      `4` = "Four +"))

#examine
  
pisa_tidy

# A tibble: 28,022 x 9
   country informed_climate_change informed_global_h~ informed_migrat~
   <chr>   <chr>                   <chr>              <chr>           
 1 Spain   Well informed           Well informed      Well informed   
 2 Spain   Informed                Not well informed  Informed        
 3 Spain   Well informed           Informed           Informed        
 4 Spain   Not well informed       Not informed       Informed        
 5 Spain   Well informed           Not well informed  Informed        
 6 Spain   Informed                Not well informed  Not well inform~
 7 Spain   Informed                Informed           Informed        
 8 Spain   Not well informed       Not well informed  Informed        
 9 Spain   Not well informed       Informed           Informed        
10 Spain   Informed                Informed           Informed        
# ... with 28,012 more rows, and 5 more variables:
#   informed_international_conflict <chr>,
#   informed_world_hunger <chr>, informed_poverty_causes <chr>,
#   informed_gender_equality <chr>, language_self <chr>

Univariate Visualizations

I have created univariate plots for each variable which show count, in addition to using group_by() to first view percent in a tibble. These plots don’t directly answer my research question, but they do provide a general overview of each individual variable. The question asked of each participant has been included before each variable’s percent calculation and plot.

How informed are you about the following topics? Climate change and global warming

#calculate percent for informed_climate_change

select(pisa_tidy, "informed_climate_change") %>%
  group_by(informed_climate_change) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor(informed_climate_change, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

# A tibble: 4 x 3
  informed_climate_change count percent
  <chr>                   <int>   <dbl>
1 Not informed              541    1.93
2 Not well informed        4000   14.3 
3 Informed                16623   59.3 
4 Well informed            6858   24.5

#plot for informed_climate_change

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_climate_change, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3",color = "black") +
  labs(x = "Climate Change",
       y = "Count",
       title = "How informed students feel on climate change", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? Global health (e.g. epidemics)

#calculate percent for informed_global_health

select(pisa_tidy, "informed_global_health") %>%
  group_by(informed_global_health) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor(informed_global_health, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

# A tibble: 4 x 3
  informed_global_health count percent
  <chr>                  <int>   <dbl>
1 Not informed             467    1.67
2 Not well informed       7249   25.9 
3 Informed               16315   58.2 
4 Well informed           3991   14.2

#plot for informed_global_health

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_global_health, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3",color = "black") +
  labs(x = "Global Health",
       y = "Count",
       title = "How informed students feel on global health", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? Migration (movement of people)

#calculate percent for informed_migration

select(pisa_tidy, "informed_migration") %>%
  group_by(informed_migration) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor(informed_migration, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

# A tibble: 4 x 3
  informed_migration count percent
  <chr>              <int>   <dbl>
1 Not informed         450    1.61
2 Not well informed   5532   19.7 
3 Informed           16583   59.2 
4 Well informed       5457   19.5

#plot for informed_migration

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_migration, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Migration",
       y = "Count",
       title = "How informed students feel on migration", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? International conflicts

#calculate percent for informed_international_conflict

select(pisa_tidy, "informed_international_conflict") %>%
  group_by(informed_international_conflict) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor(informed_international_conflict, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

# A tibble: 4 x 3
  informed_international_conflict count percent
  <chr>                           <int>   <dbl>
1 Not informed                      733    2.62
2 Not well informed                8349   29.8 
3 Informed                        13758   49.1 
4 Well informed                    5182   18.5

#plot for informed_international_conflict

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_international_conflict, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "International Conflict",
       y = "Count",
       title = "How informed students feel on international conflict", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? Hunger or malnutrition in different parts of the world

#calculate percent for informed_world_hunger

select(pisa_tidy, "informed_world_hunger") %>%
  group_by(informed_world_hunger) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor(informed_world_hunger, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

# A tibble: 4 x 3
  informed_world_hunger count percent
  <chr>                 <int>   <dbl>
1 Not informed            345    1.23
2 Not well informed      4331   15.5 
3 Informed              16459   58.7 
4 Well informed          6887   24.6

#plot for informed_world_hunger

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_world_hunger, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "World Hunger",
       y = "Count",
       title = "How informed students feel on world hunger", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? Causes of poverty

#calculate percent for informed_poverty_causes

select(pisa_tidy,"informed_poverty_causes") %>%
  group_by(informed_poverty_causes) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor(informed_poverty_causes, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

# A tibble: 4 x 3
  informed_poverty_causes count percent
  <chr>                   <int>   <dbl>
1 Not informed              391    1.40
2 Not well informed        5230   18.7 
3 Informed                15367   54.8 
4 Well informed            7034   25.1

#plot for informed_poverty_causes

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_poverty_causes, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Causes of Poverty",
       y = "Count",
       title = "How informed students feel on causes of poverty", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? Equality between men and women in different parts of the world

#calculate percent for informed_gender_equality

select(pisa_tidy, "informed_gender_equality") %>%
  group_by(informed_gender_equality) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor(informed_gender_equality, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

# A tibble: 4 x 3
  informed_gender_equality count percent
  <chr>                    <int>   <dbl>
1 Not informed               365    1.30
2 Not well informed         1671    5.96
3 Informed                 11550   41.2 
4 Well informed            14436   51.5

#plot for informed_gender_equality

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_gender_equality, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Gender Equality",
       y = "Count",
       title = "How informed students feel on gender equality", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How many languages do you speak well enough to converse with others?

#calculate percent for language_self

select(pisa_tidy, "language_self") %>%
  group_by(language_self) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")))

# A tibble: 4 x 3
  language_self count percent
  <chr>         <int>   <dbl>
1 One            3996    14.3
2 Two           10732    38.3
3 Three         10370    37.0
4 Four +         2924    10.4

# plot for language_self

  ggplot(pisa_tidy, aes(x = fct_relevel(language_self, "One", "Two", "Three", "Four or more"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Languages Spoken",
       y = "Count",
       title = "Number of languages students speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text( size = 9),
          axis.text.y = element_text( size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

Bivariate Visualizations

To directly explore my research question I have calculated percent and then created two bivariate visualizations for each grouping, one of which uses facet_grid().

#language_self & informed_climate_change

language_climate_change <- select(pisa_tidy, "language_self", "informed_climate_change") %>%
  group_by(language_self, informed_climate_change) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_climate_change, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_climate_change

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_climate_change count percent
   <chr>         <chr>                   <int>   <dbl>
 1 One           Not informed              171    4.28
 2 One           Not well informed         957   23.9 
 3 One           Informed                 2275   56.9 
 4 One           Well informed             593   14.8 
 5 Two           Not informed              142    1.32
 6 Two           Not well informed        1518   14.1 
 7 Two           Informed                 6740   62.8 
 8 Two           Well informed            2332   21.7 
 9 Three         Not informed              134    1.29
10 Three         Not well informed        1196   11.5 
11 Three         Informed                 6071   58.5 
12 Three         Well informed            2969   28.6 
13 Four +        Not informed               94    3.21
14 Four +        Not well informed         329   11.3 
15 Four +        Informed                 1537   52.6 
16 Four +        Well informed             964   33.0

#create plot

ggplot(language_climate_change, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, 
           fill = factor(informed_climate_change, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", color = "black", position = "fill") + 
  labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on climate change\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create plot with facet_grid

ggplot(language_climate_change,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_climate_change, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on climate change\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_climate_change, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#calculate percent for language_self & informed_global_health

language_global_health <- select(pisa_tidy, "language_self", "informed_global_health") %>%
  group_by(language_self, informed_global_health) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_global_health, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_global_health

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_global_health count percent
   <chr>         <chr>                  <int>   <dbl>
 1 One           Not informed             126    3.15
 2 One           Not well informed       1303   32.6 
 3 One           Informed                2147   53.7 
 4 One           Well informed            420   10.5 
 5 Two           Not informed             139    1.30
 6 Two           Not well informed       2894   27.0 
 7 Two           Informed                6361   59.3 
 8 Two           Well informed           1338   12.5 
 9 Three         Not informed             133    1.28
10 Three         Not well informed       2450   23.6 
11 Three         Informed                6204   59.8 
12 Three         Well informed           1583   15.3 
13 Four +        Not informed              69    2.36
14 Four +        Not well informed        602   20.6 
15 Four +        Informed                1603   54.8 
16 Four +        Well informed            650   22.2

#create plot

ggplot(language_global_health, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, 
           fill = factor(informed_global_health, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", color = "black", position = "fill") + 
  labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on global health\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create plot with facet_grid

ggplot(language_global_health,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_global_health, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on global health\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_global_health, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#calculate percent for language_self & informed_migration

language_migration <- select(pisa_tidy, "language_self", "informed_migration") %>%
  group_by(language_self, informed_migration) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_migration, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_migration

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_migration count percent
   <chr>         <chr>              <int>   <dbl>
 1 One           Not informed         134    3.35
 2 One           Not well informed    948   23.7 
 3 One           Informed            2301   57.6 
 4 One           Well informed        613   15.3 
 5 Two           Not informed         113    1.05
 6 Two           Not well informed   2248   20.9 
 7 Two           Informed            6548   61.0 
 8 Two           Well informed       1823   17.0 
 9 Three         Not informed         122    1.18
10 Three         Not well informed   1881   18.1 
11 Three         Informed            6196   59.7 
12 Three         Well informed       2171   20.9 
13 Four +        Not informed          81    2.77
14 Four +        Not well informed    455   15.6 
15 Four +        Informed            1538   52.6 
16 Four +        Well informed        850   29.1

#create plot

ggplot(language_migration, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, 
           fill = factor(informed_migration, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", color = "black", position = "fill") + 
  labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on migration\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create plot with facet_grid

ggplot(language_migration,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_migration, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on migration\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_migration, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#calculate percent for language_self & informed_international_conflict

language_international_conflict <- select(pisa_tidy, "language_self", "informed_international_conflict") %>%
  group_by(language_self, informed_international_conflict) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_international_conflict, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_international_conflict

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_international_conflict count percent
   <chr>         <chr>                           <int>   <dbl>
 1 One           Not informed                      195    4.88
 2 One           Not well informed                1480   37.0 
 3 One           Informed                         1791   44.8 
 4 One           Well informed                     530   13.3 
 5 Two           Not informed                      226    2.11
 6 Two           Not well informed                3319   30.9 
 7 Two           Informed                         5477   51.0 
 8 Two           Well informed                    1710   15.9 
 9 Three         Not informed                      222    2.14
10 Three         Not well informed                2870   27.7 
11 Three         Informed                         5157   49.7 
12 Three         Well informed                    2121   20.5 
13 Four +        Not informed                       90    3.08
14 Four +        Not well informed                 680   23.3 
15 Four +        Informed                         1333   45.6 
16 Four +        Well informed                     821   28.1

#create plot

ggplot(language_international_conflict, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, 
           fill = factor(informed_international_conflict, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", color = "black", position = "fill") + 
  labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on international conflict\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create plot with facet_grid

ggplot(language_international_conflict,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_international_conflict, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on international conflict\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_international_conflict, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#calculate percent for language_self & informed_world_hunger

language_world_hunger <- select(pisa_tidy, "language_self", "informed_world_hunger") %>%
  group_by(language_self, informed_world_hunger) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_world_hunger, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_world_hunger

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_world_hunger count percent
   <chr>         <chr>                 <int>   <dbl>
 1 One           Not informed             97   2.43 
 2 One           Not well informed       780  19.5  
 3 One           Informed               2326  58.2  
 4 One           Well informed           793  19.8  
 5 Two           Not informed             92   0.857
 6 Two           Not well informed      1717  16.0  
 7 Two           Informed               6514  60.7  
 8 Two           Well informed          2409  22.4  
 9 Three         Not informed             95   0.916
10 Three         Not well informed      1453  14.0  
11 Three         Informed               6091  58.7  
12 Three         Well informed          2731  26.3  
13 Four +        Not informed             61   2.09 
14 Four +        Not well informed       381  13.0  
15 Four +        Informed               1528  52.3  
16 Four +        Well informed           954  32.6

#create plot

ggplot(language_world_hunger, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, 
           fill = factor(informed_world_hunger, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", color = "black", position = "fill") + 
  labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on world hunger\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create plot with facet_grid

ggplot(language_world_hunger,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_world_hunger, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on world hunger\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_world_hunger, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#calculate percent for language_self & informed_poverty_causes

language_poverty_causes <- select(pisa_tidy, "language_self", "informed_poverty_causes") %>%
  group_by(language_self, informed_poverty_causes) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_poverty_causes, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_poverty_causes

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_poverty_causes count percent
   <chr>         <chr>                   <int>   <dbl>
 1 One           Not informed              108   2.70 
 2 One           Not well informed         894  22.4  
 3 One           Informed                 2169  54.3  
 4 One           Well informed             825  20.6  
 5 Two           Not informed              106   0.988
 6 Two           Not well informed        2113  19.7  
 7 Two           Informed                 6104  56.9  
 8 Two           Well informed            2409  22.4  
 9 Three         Not informed              115   1.11 
10 Three         Not well informed        1789  17.3  
11 Three         Informed                 5664  54.6  
12 Three         Well informed            2802  27.0  
13 Four +        Not informed               62   2.12 
14 Four +        Not well informed         434  14.8  
15 Four +        Informed                 1430  48.9  
16 Four +        Well informed             998  34.1

#create plot

ggplot(language_poverty_causes, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, 
           fill = factor(informed_poverty_causes, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", color = "black", position = "fill") + 
  labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on causes of poverty\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create plot with facet_grid

ggplot(language_poverty_causes,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_poverty_causes, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on causes of poverty\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_poverty_causes, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#calculate percent for language_self & informed_gender_equality

language_gender_equality <- select(pisa_tidy, "language_self", "informed_gender_equality") %>%
  group_by(language_self, informed_gender_equality) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_gender_equality, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_gender_equality

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_gender_equality count percent
   <chr>         <chr>                    <int>   <dbl>
 1 One           Not informed               123   3.08 
 2 One           Not well informed          393   9.83 
 3 One           Informed                  1846  46.2  
 4 One           Well informed             1634  40.9  
 5 Two           Not informed                80   0.745
 6 Two           Not well informed          634   5.91 
 7 Two           Informed                  4734  44.1  
 8 Two           Well informed             5284  49.2  
 9 Three         Not informed                90   0.868
10 Three         Not well informed          490   4.73 
11 Three         Informed                  4004  38.6  
12 Three         Well informed             5786  55.8  
13 Four +        Not informed                72   2.46 
14 Four +        Not well informed          154   5.27 
15 Four +        Informed                   966  33.0  
16 Four +        Well informed             1732  59.2

#create plot

ggplot(language_gender_equality, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, 
           fill = factor(informed_gender_equality, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", color = "black", position = "fill") + 
  labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on gender equality\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

#create plot with facet_grid

ggplot(language_gender_equality,
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, fill = factor(informed_gender_equality, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
    geom_bar(stat = "identity", color = "black", position = "dodge") +
    labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on gender equality\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
    facet_grid(~factor(informed_gender_equality, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9, angle = 30),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

Reflection

It’s clear there is a positive correlation between the number of languages spoken and how well informed students feel they are on these seven topics, yet this analysis alone is not enough to conclude that the more languages you speak the better informed you are on these topics. There are many other variables that could be at play.

Things I would want to explore further are:

Where does the student live (urban or rural city)?
Does the student attend a public, semi-private or private school?
How informed do the student’s parents feel on these topics?
What are the parents’ education levels?
What is the family’s income level?

I hypothesize that students attending private schools in urban cities speak more languages, and if this is the case, these variables combined with family income and/or parents level of education could be what impacts how informed a student is on certain topics and not primarily the number of languages the student speaks. The original dataset may include some of these variables, and there is also a separate dataset of parent responses to a questionnaire which may contain some or all of these variables. However, the size of the datasets combined with the limited memory available on my computer prohibits my ability to explore things further at this time.

One thing I did notice in reviewing the data was in three of the seven topics (climate change, world hunger and gender equality) students who speak four or more languages felt slightly less informed overall (when combining informed and well informed) than students speaking three languages. It could be helpful to show this clearly in my analysis as a naive reader may not pick up on this small detail. In general, I do believe a naive reader would be able to understand my graphs. As I’m still learning, I definitely welcome feedback that would say otherwise, though!

Bibliography

Programme for International Student Assessment.(2020). Student questionnaire data files (PISA 2018 Database)[Dataset and codebook]. Organisation for Economic Co-operation and Development. https://www.oecd.org/pisa/data/2018database/

RStudio Team (2022). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL
http://www.rstudio.com/.

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Comment on this article Share:

HW5