Data Analytics and Computational Social Science: HW4

Laura Collazo

Dataset

The dataset I’ve chosen for my final project is the Program for International Student Assessment (PISA) 2018 student data. It’s a large dataset containing 1,119 variables and 612,004 observations of 15 year old students from 80 countries.

Read in Dataset

Note: The dataset orginally read in was a very large SAS file. However, my computer’s memory was not sufficient to knit the file. As a workaround I had to write a csv containing a limited number of variables and then read this back in.

#read in SAS & examine

pisa <- read_sas("cy07_msu_stu_qqq.sas7bdat", "CY07MSU_FMT_STU_QQQ.SAS7BCAT", encoding = NULL, .name_repair = "unique")

head(pisa)

tail(pisa)

unique(pisa[c("CNT")])

# select only desired variables and filter country for Spain

pisa_smaller <- pisa %>% 
  
select(c(CNT,
ST001D01T,
ST004D01T,
ST197Q01HA,
ST197Q02HA,
ST197Q04HA,
ST197Q07HA,
ST197Q08HA,
ST197Q09HA,
ST197Q12HA,
ST220Q01HA,
ST220Q02HA,
ST220Q03HA,
ST220Q04HA,
ST177Q01HA,
ST019AQ01T,
ST021Q01TA)) %>%
  
filter(CNT == "ESP")
  
#check work
 
head(pisa_smaller)

#write csv

write_csv(pisa_smaller, "pisa_smaller_2022-2-20.csv")

pisa <- read_csv("pisa_smaller_2022-2-20.csv")

head(pisa)

# A tibble: 6 x 17
  CNT   ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
  <chr>     <dbl>     <dbl>      <dbl>      <dbl>      <dbl>
1 ESP          10         2          4          4          4
2 ESP           9         1          3          2          3
3 ESP          10         2          4          3          3
4 ESP           8         2          2          1          3
5 ESP          10         1         NA         NA         NA
6 ESP          10         1          4          2          3
# ... with 11 more variables: ST197Q07HA <dbl>, ST197Q08HA <dbl>,
#   ST197Q09HA <dbl>, ST197Q12HA <dbl>, ST220Q01HA <dbl>,
#   ST220Q02HA <dbl>, ST220Q03HA <dbl>, ST220Q04HA <dbl>,
#   ST177Q01HA <dbl>, ST019AQ01T <dbl>, ST021Q01TA <dbl>

tail(pisa)

# A tibble: 6 x 17
  CNT   ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
  <chr>     <dbl>     <dbl>      <dbl>      <dbl>      <dbl>
1 ESP          10         1          4          4          4
2 ESP           9         2          3          3          3
3 ESP          10         2          4          4          4
4 ESP           9         2          2          2          2
5 ESP           8         2          3          3          3
6 ESP           9         1          2          2          2
# ... with 11 more variables: ST197Q07HA <dbl>, ST197Q08HA <dbl>,
#   ST197Q09HA <dbl>, ST197Q12HA <dbl>, ST220Q01HA <dbl>,
#   ST220Q02HA <dbl>, ST220Q03HA <dbl>, ST220Q04HA <dbl>,
#   ST177Q01HA <dbl>, ST019AQ01T <dbl>, ST021Q01TA <dbl>

#remove unneeded variables

pisa_tidy <- pisa %>%
  
select(-c("ST001D01T", "ST004D01T", "ST220Q01HA", "ST220Q02HA", "ST220Q03HA", "ST220Q04HA", "ST019AQ01T", "ST021Q01TA")) %>%
  
#rename variables
  
rename(country=CNT,
informed_climate_change=ST197Q01HA,
informed_global_health=ST197Q02HA,
informed_migration=ST197Q04HA,
informed_international_conflict=ST197Q07HA,
informed_world_hunger=ST197Q08HA,
informed_poverty_causes=ST197Q09HA,
informed_gender_equality=ST197Q12HA,
language_self=ST177Q01HA) %>%

#remove NAs

drop_na %>%

#recode values (I still need to come back to this and learn to use across() to recode all variables beginning with "informed_")
  
mutate(country = recode(country, ESP = "Spain")) %>%
  
mutate(informed_climate_change = recode(informed_climate_change, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_global_health = recode(informed_global_health, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_migration = recode(informed_migration, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_international_conflict = recode(informed_international_conflict,
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_world_hunger = recode(informed_world_hunger, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_poverty_causes = recode(informed_poverty_causes, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_gender_equality = recode(informed_gender_equality, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%
  
mutate(language_self = recode(language_self, `1` = "One", `2` = "Two", `3` = "Three", `4` = "Four or more"))
  
pisa_tidy

# A tibble: 28,022 x 9
   country informed_climate_change informed_global_h~ informed_migrat~
   <chr>   <chr>                   <chr>              <chr>           
 1 Spain   Well informed           Well informed      Well informed   
 2 Spain   Informed                Not well informed  Informed        
 3 Spain   Well informed           Informed           Informed        
 4 Spain   Not well informed       Not informed       Informed        
 5 Spain   Well informed           Not well informed  Informed        
 6 Spain   Informed                Not well informed  Not well inform~
 7 Spain   Informed                Informed           Informed        
 8 Spain   Not well informed       Not well informed  Informed        
 9 Spain   Not well informed       Informed           Informed        
10 Spain   Informed                Informed           Informed        
# ... with 28,012 more rows, and 5 more variables:
#   informed_international_conflict <chr>,
#   informed_world_hunger <chr>, informed_poverty_causes <chr>,
#   informed_gender_equality <chr>, language_self <chr>

Research Question

The more I work with this data, the more my research question has narrowed. As I’m now a little less in panic mode working in R, I can also hear my undergrad professors stressing the importance of paring research questions down to be very specific. This is hard as I want to explore all the things, but I’m remembering there is plenty to dig into even in specific questions.

As of now, my research question is to explore if students in Spain feel they are better informed on 7 different topics depending on how many languages they speak well enough to converse with someone else.

How informed are you about the following topics? Climate change and global warming (informed_climate_change)
How informed are you about the following topics? Global health (e.g. epidemics) (informed_global_health)
How informed are you about the following topics? Migration (movement of people) (informed_migration)
How informed are you about the following topics? International conflicts (informed_international_conflict)
How informed are you about the following topics? Hunger or malnutrition in different parts of the world (informed_world_hunger)
How informed are you about the following topics? Causes of poverty (informed_poverty_causes)
How informed are you about the following topics? Equality between men and women in different parts of the world (informed_gender_equality)
How many languages they speak well enough to converse with others (language_self)

I would love to expand this to look at all countries who responded to these variables to see if what I observe in Spain holds true elsewhere. I think this would be a much more interesting research question! Right now I’m trying to focus on the basics of R, so want to keep things on the simpler side. I did leave the variable “country” in my dataset, though, so if/when my comfort level with R increases, I can dig into things deeper.

Univariate Visualizations

My dataset is comprised of all character variables, so I have created frequencies for each as well as univariate plots showing count. These plots don’t directly answer my research question, but they do provide a general overview of each individual variable before they are grouped to explore my research question. The question asked of each participant has been included before each variable’s frequency and plot.

How informed are you about the following topics? Climate change and global warming

#frequency of informed_climate_change

select(pisa_tidy, "informed_climate_change") %>%
  group_by(informed_climate_change) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

# A tibble: 4 x 3
  informed_climate_change count frequency
  <chr>                   <int>     <dbl>
1 Not informed              541      1.93
2 Not well informed        4000     14.3 
3 Well informed            6858     24.5 
4 Informed                16623     59.3

#plot for informed_climate_change

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_climate_change, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Climate Change",
       y = "Count",
       title = "Participants by how informed on climate change")

How informed are you about the following topics? Global health (e.g. epidemics)

#frequency of informed_global_health

select(pisa_tidy, "informed_global_health") %>%
  group_by(informed_global_health) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

# A tibble: 4 x 3
  informed_global_health count frequency
  <chr>                  <int>     <dbl>
1 Not informed             467      1.67
2 Well informed           3991     14.2 
3 Not well informed       7249     25.9 
4 Informed               16315     58.2

#plot for informed_global_health

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_global_health, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Global Health",
       y = "Count",
       title = "Participants by how informed on global health")

How informed are you about the following topics? Migration (movement of people)

#frequency of informed_migration

select(pisa_tidy, "informed_migration") %>%
  group_by(informed_migration) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

# A tibble: 4 x 3
  informed_migration count frequency
  <chr>              <int>     <dbl>
1 Not informed         450      1.61
2 Well informed       5457     19.5 
3 Not well informed   5532     19.7 
4 Informed           16583     59.2

#plot for informed_migration

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_migration, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Migration",
       y = "Count",
       title = "Participants by how informed on migration")

How informed are you about the following topics? International conflicts

#frequency of informed_international_conflict

select(pisa_tidy, "informed_international_conflict") %>%
  group_by(informed_international_conflict) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

# A tibble: 4 x 3
  informed_international_conflict count frequency
  <chr>                           <int>     <dbl>
1 Not informed                      733      2.62
2 Well informed                    5182     18.5 
3 Not well informed                8349     29.8 
4 Informed                        13758     49.1

#plot for informed_international_conflict

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_international_conflict, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "International Conflict",
       y = "Count",
       title = "Participants by how informed on international conflict")

How informed are you about the following topics? Hunger or malnutrition in different parts of the world

#frequency of informed_world_hunger

select(pisa_tidy, "informed_world_hunger") %>%
  group_by(informed_world_hunger) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

# A tibble: 4 x 3
  informed_world_hunger count frequency
  <chr>                 <int>     <dbl>
1 Not informed            345      1.23
2 Not well informed      4331     15.5 
3 Well informed          6887     24.6 
4 Informed              16459     58.7

#plot for informed_world_hunger

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_world_hunger, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "World Hunger",
       y = "Count",
       title = "Participants by how informed on world hunger")

How informed are you about the following topics? Causes of poverty

#frequency of informed_poverty_causes

select(pisa_tidy,"informed_poverty_causes") %>%
  group_by(informed_poverty_causes) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

# A tibble: 4 x 3
  informed_poverty_causes count frequency
  <chr>                   <int>     <dbl>
1 Not informed              391      1.40
2 Not well informed        5230     18.7 
3 Well informed            7034     25.1 
4 Informed                15367     54.8

#plot for informed_poverty_causes

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_poverty_causes, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Poverty Causes",
       y = "Count",
       title = "Participants by how informed on poverty causes")

How informed are you about the following topics? Equality between men and women in different parts of the world

#frequency of informed_gender_equality

select(pisa_tidy, "informed_gender_equality") %>%
  group_by(informed_gender_equality) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

# A tibble: 4 x 3
  informed_gender_equality count frequency
  <chr>                    <int>     <dbl>
1 Not informed               365      1.30
2 Not well informed         1671      5.96
3 Informed                 11550     41.2 
4 Well informed            14436     51.5

#plot for informed_gender_equality

  ggplot(pisa_tidy, aes(x = fct_relevel(informed_gender_equality, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Gender Equality",
       y = "Count",
       title = "Participants by how informed on gender equality")

How many languages do you speak well enough to converse with others?

#frequency of language_self

select(pisa_tidy, "language_self") %>%
  group_by(language_self) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

# A tibble: 4 x 3
  language_self count frequency
  <chr>         <int>     <dbl>
1 Four or more   2924      10.4
2 One            3996      14.3
3 Three         10370      37.0
4 Two           10732      38.3

# plot for language_self

  ggplot(pisa_tidy, aes(x = fct_relevel(language_self, "One", "Two", "Three", "Four or more"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Languages Spoken",
       y = "Count",
       title = "Participants by languages spoken")

Thoughts on Univariate Visualizations

It’s clear after viewing these visualizations the majority of students feel informed on each topic. The one topic that stood out is gender equality as this is the only one where more students responded they feel well informed on this topic compared to just informed. Living in Spain, I know there has been a huge push in more recent years to increase education on gender equality in the country. Although it doesn’t fit with my current research question, it would be interesting to explore this variable over time using PISA datasets from previous years.

When it comes to the number of languages spoken, I wish the dataset included a follow-up question on which languages. Many regions of Spain have a regional language so children are educated in both this language and Spanish. However, 37% of students responded that they speak three languages and 10.4% speak four or more languages, so even taking into account that speaking two languages is normal for many students in Spain, 47.4% have learned additional languages well enough to converse with others

Bivariate Visualizations

There are seven initial groupings I need to create to explore my research question. Therefore, the following visualizations look at how informed a student in Spain feels they are on a specific topic by how many languages they speak well enough to converse with someone.

#language_self & informed_climate_change

language_climate_change <- select(pisa_tidy, "language_self", "informed_climate_change") %>%
  group_by(language_self, informed_climate_change) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

language_climate_change

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_climate_change count frequency
   <chr>         <chr>                   <int>     <dbl>
 1 Four or more  Not informed               94      3.21
 2 Three         Not informed              134      1.29
 3 Two           Not informed              142      1.32
 4 One           Not informed              171      4.28
 5 Four or more  Not well informed         329     11.3 
 6 One           Well informed             593     14.8 
 7 One           Not well informed         957     23.9 
 8 Four or more  Well informed             964     33.0 
 9 Three         Not well informed        1196     11.5 
10 Two           Not well informed        1518     14.1 
11 Four or more  Informed                 1537     52.6 
12 One           Informed                 2275     56.9 
13 Two           Well informed            2332     21.7 
14 Three         Well informed            2969     28.6 
15 Three         Informed                 6071     58.5 
16 Two           Informed                 6740     62.8

ggplot(language_climate_change, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
           y = frequency, 
           fill = factor(informed_climate_change, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", position = "fill") + 
  labs(y = "Frequency", fill = "Climate Change", x = "Languages Spoken", title = "Informed on climate change by number of languages
       spoken")

#language_self & informed_global_health

language_global_health <- select(pisa_tidy, "language_self", "informed_global_health") %>%
  group_by(language_self, informed_global_health) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

language_global_health

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_global_health count frequency
   <chr>         <chr>                  <int>     <dbl>
 1 Four or more  Not informed              69      2.36
 2 One           Not informed             126      3.15
 3 Three         Not informed             133      1.28
 4 Two           Not informed             139      1.30
 5 One           Well informed            420     10.5 
 6 Four or more  Not well informed        602     20.6 
 7 Four or more  Well informed            650     22.2 
 8 One           Not well informed       1303     32.6 
 9 Two           Well informed           1338     12.5 
10 Three         Well informed           1583     15.3 
11 Four or more  Informed                1603     54.8 
12 One           Informed                2147     53.7 
13 Three         Not well informed       2450     23.6 
14 Two           Not well informed       2894     27.0 
15 Three         Informed                6204     59.8 
16 Two           Informed                6361     59.3

ggplot(language_global_health, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
           y = frequency, 
           fill = factor(informed_global_health, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", position = "fill") + 
  labs(y = "Frequency", fill = "Global Health", x = "Languages Spoken", title = "Informed on global health by number of languages spoken")

#language_self & informed_migration

language_migration <- select(pisa_tidy, "language_self", "informed_migration") %>%
  group_by(language_self, informed_migration) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(informed_migration)

language_migration

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_migration count frequency
   <chr>         <chr>              <int>     <dbl>
 1 Four or more  Informed            1538     52.6 
 2 One           Informed            2301     57.6 
 3 Three         Informed            6196     59.7 
 4 Two           Informed            6548     61.0 
 5 Four or more  Not informed          81      2.77
 6 One           Not informed         134      3.35
 7 Three         Not informed         122      1.18
 8 Two           Not informed         113      1.05
 9 Four or more  Not well informed    455     15.6 
10 One           Not well informed    948     23.7 
11 Three         Not well informed   1881     18.1 
12 Two           Not well informed   2248     20.9 
13 Four or more  Well informed        850     29.1 
14 One           Well informed        613     15.3 
15 Three         Well informed       2171     20.9 
16 Two           Well informed       1823     17.0

ggplot(language_migration, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
           y = frequency, 
           fill = factor(informed_migration, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", position = "fill") + 
  labs(y = "Frequency", fill = "Migration", x = "Languages Spoken", title = "Informed on migration by number of languages spoken")

#language_self & informed_international_conflict

language_international_conflict <- select(pisa_tidy, "language_self", "informed_international_conflict") %>%
  group_by(language_self, informed_international_conflict) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

language_international_conflict

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_international_conflict count frequency
   <chr>         <chr>                           <int>     <dbl>
 1 Four or more  Not informed                       90      3.08
 2 One           Not informed                      195      4.88
 3 Three         Not informed                      222      2.14
 4 Two           Not informed                      226      2.11
 5 One           Well informed                     530     13.3 
 6 Four or more  Not well informed                 680     23.3 
 7 Four or more  Well informed                     821     28.1 
 8 Four or more  Informed                         1333     45.6 
 9 One           Not well informed                1480     37.0 
10 Two           Well informed                    1710     15.9 
11 One           Informed                         1791     44.8 
12 Three         Well informed                    2121     20.5 
13 Three         Not well informed                2870     27.7 
14 Two           Not well informed                3319     30.9 
15 Three         Informed                         5157     49.7 
16 Two           Informed                         5477     51.0

ggplot(language_international_conflict, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
           y = frequency, 
           fill = factor(informed_international_conflict, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", position = "fill") + 
  labs(y = "Frequency", fill = "International Conflict", x = "Languages Spoken", title = "Informed on international conflict by number of languages spoken")

#language_self & informed_world_hunger

language_world_hunger <- select(pisa_tidy, "language_self", "informed_world_hunger") %>%
  group_by(language_self, informed_world_hunger) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

language_world_hunger

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_world_hunger count frequency
   <chr>         <chr>                 <int>     <dbl>
 1 Four or more  Not informed             61     2.09 
 2 Two           Not informed             92     0.857
 3 Three         Not informed             95     0.916
 4 One           Not informed             97     2.43 
 5 Four or more  Not well informed       381    13.0  
 6 One           Not well informed       780    19.5  
 7 One           Well informed           793    19.8  
 8 Four or more  Well informed           954    32.6  
 9 Three         Not well informed      1453    14.0  
10 Four or more  Informed               1528    52.3  
11 Two           Not well informed      1717    16.0  
12 One           Informed               2326    58.2  
13 Two           Well informed          2409    22.4  
14 Three         Well informed          2731    26.3  
15 Three         Informed               6091    58.7  
16 Two           Informed               6514    60.7

ggplot(language_world_hunger, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
           y = frequency, 
           fill = factor(informed_world_hunger, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", position = "fill") + 
  labs(y = "Frequency", fill = "World Hunger", x = "Languages Spoken", title = "Informed on world hunger by number of languages spoken")

#language_self & informed_poverty_causes

language_poverty_causes <- select(pisa_tidy, "language_self", "informed_poverty_causes") %>%
  group_by(language_self, informed_poverty_causes) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

language_poverty_causes

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_poverty_causes count frequency
   <chr>         <chr>                   <int>     <dbl>
 1 Four or more  Not informed               62     2.12 
 2 Two           Not informed              106     0.988
 3 One           Not informed              108     2.70 
 4 Three         Not informed              115     1.11 
 5 Four or more  Not well informed         434    14.8  
 6 One           Well informed             825    20.6  
 7 One           Not well informed         894    22.4  
 8 Four or more  Well informed             998    34.1  
 9 Four or more  Informed                 1430    48.9  
10 Three         Not well informed        1789    17.3  
11 Two           Not well informed        2113    19.7  
12 One           Informed                 2169    54.3  
13 Two           Well informed            2409    22.4  
14 Three         Well informed            2802    27.0  
15 Three         Informed                 5664    54.6  
16 Two           Informed                 6104    56.9

ggplot(language_poverty_causes, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
           y = frequency, 
           fill = factor(informed_poverty_causes, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", position = "fill") + 
  labs(y = "Frequency", fill = "Poverty Causes", x = "Languages Spoken", title = "Informed on poverty causes by number of languages spoken")

#language_self & informed_gender_equality

language_gender_equality <- select(pisa_tidy, "language_self", "informed_gender_equality") %>%
  group_by(language_self, informed_gender_equality) %>%
summarise(count = n()) %>%
  mutate(frequency = count/sum(count) * 100) %>%
  arrange(count)

language_gender_equality

# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_gender_equality count frequency
   <chr>         <chr>                    <int>     <dbl>
 1 Four or more  Not informed                72     2.46 
 2 Two           Not informed                80     0.745
 3 Three         Not informed                90     0.868
 4 One           Not informed               123     3.08 
 5 Four or more  Not well informed          154     5.27 
 6 One           Not well informed          393     9.83 
 7 Three         Not well informed          490     4.73 
 8 Two           Not well informed          634     5.91 
 9 Four or more  Informed                   966    33.0  
10 One           Well informed             1634    40.9  
11 Four or more  Well informed             1732    59.2  
12 One           Informed                  1846    46.2  
13 Three         Informed                  4004    38.6  
14 Two           Informed                  4734    44.1  
15 Two           Well informed             5284    49.2  
16 Three         Well informed             5786    55.8

ggplot(language_gender_equality, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
           y = frequency, 
           fill = factor(informed_gender_equality, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", position = "fill") + 
  labs(y = "Frequency", fill = "Gender Equality", x = "Languages Spoken", title = "Informed on gender equality by number of languages spoken")

Thoughts on Bivariate Visualizations

In each of the above visualizations, it’s evident that the more languages a student speaks the more well informed they feel about the given topic. Interestingly, if you look at informed and well informed together, in three of the seven topics (climate change, world hunger and gender equality) students who speak four or more languages felt slightly less informed overall than students speaking three languages. Calculating this difference would be a helpful statistic to include in my analysis.

Improving Visualizations

I would like to add percentages to my visualizations as this would aid myself and a “naive viewer” in understanding what they are seeing without having to read the tibbles that display this information. I also don’t love the titles of each visualization and believe they could be improved.

I imagine there are other styles of visualizations that could also be created. I still feel pretty uncertain on plotting, though, so poco a poco (little by little).

Unanswered Questions

Although it’s clear there is a positive correlation between the number of languages spoken and how informed students feel they are on these seven topics (variables), I don’t believe this analysis alone is enough to conclude that more languages you speak the better informed you are on these topics. There are many other variables that could be at play.

Things I would want to explore further are:

Where does the student live (urban or rural city)?
Does the student attend a public, semi-private or private school?
How informed do the student’s parents feel on these topics?
What are the parents’ education levels?
What is the family’s income level?

For instance, I hypothesize that students attending private schools in urban cities would speak more languages, and if this is the case, these variables combined with family income and/or parents level of education could be what impacts how informed a student is on certain topics and not the number of languages the student speaks. I will have to do some investigating in the original dataset to see if any of these other variables are available. I believe there is also a parent questionnaire, so it’s possible I could join datasets to explore things further. A challenge with this could be the size of the datasets, though, and the limited memory available on my computer to work with them.

Comment on this article Share:

HW4