DACSS 601 Final Project

This research analysis explores if there is a relationship between the number of languages a student in Spain speaks and how well informed they feel on climate change, global health, migration, international conflict, world hunger, causes of poverty, and gender inequality.

Laura Collazo
2022-05-04

Introduction

A year ago my family made our third international move and my kids, ages 12 and 14 years, began learning a third language. This reality, coupled with my general research interests, often leads me to consider the impact exposure to different languages, and therefore culture, has on an individual. In most of the world multilingualism is commonplace, yet in the United States the idea of speaking a language other than English is not embraced. Sadly, this is often rooted in misconceptions around immigration and cultural diversity (Kroll and Dussias, 2017). The prevalence of this misinformation leads me to wonder if speaking more languages aids in knowledge of important topics, such as immigration, and if Americans therefore find themselves less informed in part as a result of their monolinguism.

In this analysis I explore if there is a relationship between the number of languages a student in Spain speaks and how well informed they feel on climate change, global health, migration, international conflict, world hunger, causes of poverty, and gender inequality. This does not quite line up with what I hoped to explore as my original goal was to compare responses between students in Spain and the United States. Reasons for this will be explained in my reflection towards the end of this analysis.

An important piece of information to keep in mind when observing the data in this analysis is that apart from Spanish, or more properly referred to as Castellano to separate the language from dialects spoken in Central and South America, many regional languages exist in Spain and are taught in local schools. The primary regional languages are Catalan, spoken in Catalonia and the Balearic Islands, Galician in Galicia, Euskara in the Basque county and parts of Navarre, and Valencian in Valencia. Two less common languages include Aranese, spoken in the Northeastern part of Spain, and Extremaduran, spoken in the Western region of Extremadura. There are also some endangered minority languages in Spain including Aragonese, Asturian and Leonese (Luna, 2017). It is therefore not uncommon for students in Spain to be able to speak at least two languages, Castellano and their regional language.

Data

The data for this analysis comes from the Organisation for Economic Co-operation and Development’s 2018 Programme for International Student Assessment (PISA) which “measures 15-year-olds ability to use their reading, mathematics and science knowledge and skills to meet real-life challenges” (OECD, n.d.). There are many questionnaires provided to students, teachers, principals and parents that make up the complete dataset for 2018, and for this analysis data from the student questionnaire will be used. It alone is a large dataset containing 1,119 variables and 612,004 observations from 80 countries. The codebook for this dataset was relied on heavily to understand the raw data and rename variables and values.

Read in Dataset

The 2018 PISA student dataset is a very large SAS file which presented challenges in working with it due to insufficient computer memory. As a workaround, a csv containing a limited number of variables was written out and then read back into RStudio. Further discussion of this can be found in the Reflection section.

Show code
#read in SAS file

pisa <- read_sas("cy07_msu_stu_qqq.sas7bdat", "CY07MSU_FMT_STU_QQQ.SAS7BCAT", encoding = NULL, .name_repair = "unique")

#determine how many countries are in dataset

unique(pisa[c("CNT")])

# select desired variables and filter country for Spain

pisa_smaller <- pisa %>% 
  
select(c(CNT,ST001D01T,ST004D01T,ST197Q01HA,ST197Q02HA,ST197Q04HA,ST197Q07HA,ST197Q08HA,ST197Q09HA,ST197Q12HA,
         ST220Q01HA,ST220Q02HA,ST220Q03HA,ST220Q04HA,ST177Q01HA,ST019AQ01T,ST021Q01TA)) %>%
  
filter(CNT == "ESP")

#write csv

write_csv(pisa_smaller, "pisa_smaller_2022-2-20.csv")

Tidy data

To examine my research question, the dataset was filtered to include responses from students living in Spain along with eight character variables which include how many languages students speak well enough to converse with others (language_self) and how informed the student feels on the following topics:

By filtering for students living in Spain and removing NAs for the selected variables (7,921 NA observations were removed), the final number of observations used for this analysis totals 28,022 students.

Show code
#read in csv

pisa <- read_csv("pisa_smaller_2022-2-20.csv")

pisa
# A tibble: 35,943 x 17
   CNT   ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
   <chr>     <dbl>     <dbl>      <dbl>      <dbl>      <dbl>
 1 ESP          10         2          4          4          4
 2 ESP           9         1          3          2          3
 3 ESP          10         2          4          3          3
 4 ESP           8         2          2          1          3
 5 ESP          10         1         NA         NA         NA
 6 ESP          10         1          4          2          3
 7 ESP           9         1         NA         NA         NA
 8 ESP           9         2          3          2          2
 9 ESP           9         2         NA         NA         NA
10 ESP          10         2          3          3          3
# ... with 35,933 more rows, and 11 more variables: ST197Q07HA <dbl>,
#   ST197Q08HA <dbl>, ST197Q09HA <dbl>, ST197Q12HA <dbl>,
#   ST220Q01HA <dbl>, ST220Q02HA <dbl>, ST220Q03HA <dbl>,
#   ST220Q04HA <dbl>, ST177Q01HA <dbl>, ST019AQ01T <dbl>,
#   ST021Q01TA <dbl>
Show code
#remove additional variables

pisa_tidy <- pisa %>%
  
select(-c("ST001D01T", "ST004D01T", "ST220Q01HA", "ST220Q02HA", "ST220Q03HA", "ST220Q04HA", "ST019AQ01T", "ST021Q01TA")) %>%
  
#rename variables
  
rename(country=CNT,
informed_climate_change=ST197Q01HA,
informed_global_health=ST197Q02HA,
informed_migration=ST197Q04HA,
informed_international_conflict=ST197Q07HA,
informed_world_hunger=ST197Q08HA,
informed_poverty_causes=ST197Q09HA,
informed_gender_equality=ST197Q12HA,
language_self=ST177Q01HA) %>%

#remove NAs

drop_na %>%

#recode values
  
mutate(country = recode(country, ESP = "Spain")) %>%
  
mutate(informed_climate_change = recode(informed_climate_change, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_global_health = recode(informed_global_health, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_migration = recode(informed_migration, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_international_conflict = recode(informed_international_conflict,
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_world_hunger = recode(informed_world_hunger, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_poverty_causes = recode(informed_poverty_causes, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%

mutate(informed_gender_equality = recode(informed_gender_equality, 
      `1` = "Not informed", 
      `2` = "Not well informed",
      `3` = "Informed", 
      `4` = "Well informed")) %>%
  
mutate(language_self = recode(language_self, 
      `1` = "One", 
      `2` = "Two", 
      `3` = "Three", 
      `4` = "Four +"))

#examine

pisa_tidy
# A tibble: 28,022 x 9
   country informed_climate_change informed_global_h~ informed_migrat~
   <chr>   <chr>                   <chr>              <chr>           
 1 Spain   Well informed           Well informed      Well informed   
 2 Spain   Informed                Not well informed  Informed        
 3 Spain   Well informed           Informed           Informed        
 4 Spain   Not well informed       Not informed       Informed        
 5 Spain   Well informed           Not well informed  Informed        
 6 Spain   Informed                Not well informed  Not well inform~
 7 Spain   Informed                Informed           Informed        
 8 Spain   Not well informed       Not well informed  Informed        
 9 Spain   Not well informed       Informed           Informed        
10 Spain   Informed                Informed           Informed        
# ... with 28,012 more rows, and 5 more variables:
#   informed_international_conflict <chr>,
#   informed_world_hunger <chr>, informed_poverty_causes <chr>,
#   informed_gender_equality <chr>, language_self <chr>

Examine Data

To gain an initial understanding of all variables, count and percent were calculated and univariate plots created. Functions were created to view these statistics of “informed” variables and to plot how informed a student feels on a given topic. The question asked of participants has been included for each variable.

Through the examination of variables it’s observed that 85.7 percent of students speak two or more languages well enough to converse with someone else with 38.3 percent speaking two, 37 percent speaking three and 10.4 percent speaking four or more. When it comes to the seven topics students were asked about it’s observed the majority of students feel informed on the topic at hand, with gender equality being unique in that more students responded they feel well informed. The table below provides a summary of how students responded by percent.

Topic Not informed Not well informed Informed Well informed
Climate change 1.93% 14.3% 59.3% 24.5%
Global health 1.67% 25.9% 58.2% 14.2%
Migration 1.61% 19.7% 59.2% 19.5%
International conflict 2.62% 29.8% 49.1% 18.5%
World hunger 1.23% 15.5% 58.7% 24.6%
Causes of poverty 1.4% 18.7% 54.8% 25.1%
Gender equality 1.3% 5.96% 41.2% 51.5%

Univariate count and percent function for “informed” variables

Show code
#create function

uni_summary_stats <- function (mydata, myxvar) {
  select(mydata, {{myxvar}}) %>%
  group_by({{myxvar}}) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(fct_relevel({{myxvar}}, "Not informed", "Not well informed", "Informed", "Well informed"))
}

Univariate plot function for “informed” variables

Show code
#create function

uniplot<- function(mycol, myxlab, mytitle) {
  ggplot(pisa_tidy, aes(x = fct_relevel({{mycol}}, "Not informed", "Not well informed", "Informed", "Well informed"))) + 
   geom_bar (fill = "turquoise3",color = "black") +
    labs(x = myxlab,
         y = "Count",
         title = mytitle, 
         subtitle = "Spain, 2018", 
         caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))
}

How many languages do you speak well enough to converse with others?

Show code
#calculate count and percent for language_self

select(pisa_tidy, "language_self") %>%
  group_by(language_self) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(fct_relevel(language_self, "One", "Two", "Three", "Four +"))
# A tibble: 4 x 3
  language_self count percent
  <chr>         <int>   <dbl>
1 One            3996    14.3
2 Two           10732    38.3
3 Three         10370    37.0
4 Four +         2924    10.4
Show code
#plot for language_self

  ggplot(pisa_tidy, aes(x = fct_relevel(language_self, "One", "Two", "Three", "Four +"))) + 
  geom_bar (fill = "turquoise3", color = "black") +
  labs(x = "Languages Spoken",
       y = "Count",
       title = "Number of languages students speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
    theme_linedraw() +
    theme(axis.text.x = element_text( size = 9),
          axis.text.y = element_text( size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0))

How informed are you about the following topics? Climate change and global warming

Show code
#calculate count and percent for informed_climate_change

uni_summary_stats(pisa_tidy, informed_climate_change)
# A tibble: 4 x 3
  informed_climate_change count percent
  <chr>                   <int>   <dbl>
1 Not informed              541    1.93
2 Not well informed        4000   14.3 
3 Informed                16623   59.3 
4 Well informed            6858   24.5 
Show code
#plot for informed_climate_change

uniplot(informed_climate_change, "Climate Change", "How informed students feel on climate change")

How informed are you about the following topics? Global health (e.g. epidemics)

Show code
#calculate count and percent for informed_global_health

uni_summary_stats(pisa_tidy, informed_global_health)
# A tibble: 4 x 3
  informed_global_health count percent
  <chr>                  <int>   <dbl>
1 Not informed             467    1.67
2 Not well informed       7249   25.9 
3 Informed               16315   58.2 
4 Well informed           3991   14.2 
Show code
#plot for informed_global_health

uniplot(informed_global_health, "Global Health", "How informed students feel on global health")

How informed are you about the following topics? Migration (movement of people)

Show code
#calculate count and percent for informed_migration

uni_summary_stats(pisa_tidy, informed_migration)
# A tibble: 4 x 3
  informed_migration count percent
  <chr>              <int>   <dbl>
1 Not informed         450    1.61
2 Not well informed   5532   19.7 
3 Informed           16583   59.2 
4 Well informed       5457   19.5 
Show code
#plot for informed_migration

uniplot(informed_migration, "Migration", "How informed students feel on migration")

How informed are you about the following topics? International conflicts

Show code
#calculate count and percent for informed_international_conflict

uni_summary_stats(pisa_tidy, informed_international_conflict)
# A tibble: 4 x 3
  informed_international_conflict count percent
  <chr>                           <int>   <dbl>
1 Not informed                      733    2.62
2 Not well informed                8349   29.8 
3 Informed                        13758   49.1 
4 Well informed                    5182   18.5 
Show code
#plot for informed_international_conflict

uniplot(informed_international_conflict, "International Conflict", "How informed students feel on international conflicts")

How informed are you about the following topics? Hunger or malnutrition in different parts of the world

Show code
#calculate count and percent for informed_world_hunger

uni_summary_stats(pisa_tidy, informed_world_hunger)
# A tibble: 4 x 3
  informed_world_hunger count percent
  <chr>                 <int>   <dbl>
1 Not informed            345    1.23
2 Not well informed      4331   15.5 
3 Informed              16459   58.7 
4 Well informed          6887   24.6 
Show code
#plot for informed_world_hunger

uniplot(informed_world_hunger, "World Hunger", "How informed students feel on world hunger")

How informed are you about the following topics? Causes of poverty

Show code
#calculate count and percent for informed_poverty_causes

uni_summary_stats(pisa_tidy, informed_poverty_causes)
# A tibble: 4 x 3
  informed_poverty_causes count percent
  <chr>                   <int>   <dbl>
1 Not informed              391    1.40
2 Not well informed        5230   18.7 
3 Informed                15367   54.8 
4 Well informed            7034   25.1 
Show code
#plot for informed_poverty_causes

uniplot(informed_poverty_causes, "Poverty Causes", "How informed students feel on causes of poverty")

How informed are you about the following topics? Equality between men and women in different parts of the world

Show code
#calculate count and percent for informed_gender_equality

uni_summary_stats(pisa_tidy, informed_gender_equality)
# A tibble: 4 x 3
  informed_gender_equality count percent
  <chr>                    <int>   <dbl>
1 Not informed               365    1.30
2 Not well informed         1671    5.96
3 Informed                 11550   41.2 
4 Well informed            14436   51.5 
Show code
#plot for informed_gender_equality

uniplot(informed_gender_equality, "Gender Equality", "How informed students feel on gender equality")

Visualizations

To explore my research question, seven bivariate plots were initially created using the variable language_self and each of the seven variables which ask how informed students feel about a certain topic. A second round of seven visualizations were then created which combined students who responded “Not informed” and “Not well informed” into one value, “Not informed”, and students who responded “Informed” and “Well informed” into one value, “Informed.” A function, biplot, was created to aid in the creation of these plots.

Bivariate plot function

Show code
#create function

biplot<-function(mydata, myfillvar, mytitle) {
  ggplot(mydata, 
       aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
           y = percent, 
           fill = factor(.data[[myfillvar]], 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) + 
  geom_bar(stat = "identity", position = "fill") + 
  labs(y = "Frequency", 
       x = "Languages Spoken", 
       title = mytitle,
       subtitle = "Spain, 2018", 
       caption = "Source: PISA 2018 Student Questionnaire Database") +
  theme_linedraw() +
    theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0),
          legend.title = element_blank()) +
   scale_y_continuous(breaks = seq(0, 1, by = .1))
  
}

How informed students feel on climate change by number of languages they speak

Show code
#create new object to calculate count and percent for language_self & informed_climate_change

language_climate_change <- select(pisa_tidy, "language_self", "informed_climate_change") %>%
  group_by(language_self, informed_climate_change) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_climate_change, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_climate_change
# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_climate_change count percent
   <chr>         <chr>                   <int>   <dbl>
 1 One           Not informed              171    4.28
 2 One           Not well informed         957   23.9 
 3 One           Informed                 2275   56.9 
 4 One           Well informed             593   14.8 
 5 Two           Not informed              142    1.32
 6 Two           Not well informed        1518   14.1 
 7 Two           Informed                 6740   62.8 
 8 Two           Well informed            2332   21.7 
 9 Three         Not informed              134    1.29
10 Three         Not well informed        1196   11.5 
11 Three         Informed                 6071   58.5 
12 Three         Well informed            2969   28.6 
13 Four +        Not informed               94    3.21
14 Four +        Not well informed         329   11.3 
15 Four +        Informed                 1537   52.6 
16 Four +        Well informed             964   33.0 
Show code
#create plot for language_climate_change

biplot(language_climate_change, "informed_climate_change", "How informed students feel on climate change\nby number of languages they speak")
Show code
#create new object to combine responses, and calculate count and percent

language_climate_change_2 <- pisa_tidy%>%
  mutate(informed_climate_change = recode(informed_climate_change, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_climate_change) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_climate_change, 
                         levels = c("Not informed", "Informed")))

language_climate_change_2
# A tibble: 8 x 4
# Groups:   language_self [4]
  language_self informed_climate_change count percent
  <chr>         <chr>                   <int>   <dbl>
1 One           Not informed             1128    28.2
2 One           Informed                 2868    71.8
3 Two           Not informed             1660    15.5
4 Two           Informed                 9072    84.5
5 Three         Not informed             1330    12.8
6 Three         Informed                 9040    87.2
7 Four +        Not informed              423    14.5
8 Four +        Informed                 2501    85.5
Show code
#create plot for language_climate_change_2

biplot(language_climate_change_2, "informed_climate_change", "How informed students feel on climate change\nby number of languages they speak")

How informed students feel on global health by number of languages they speak

Show code
#create new object to calculate count and percent for language_self & informed_global_health

language_global_health <- select(pisa_tidy, "language_self", "informed_global_health") %>%
  group_by(language_self, informed_global_health) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_global_health, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_global_health
# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_global_health count percent
   <chr>         <chr>                  <int>   <dbl>
 1 One           Not informed             126    3.15
 2 One           Not well informed       1303   32.6 
 3 One           Informed                2147   53.7 
 4 One           Well informed            420   10.5 
 5 Two           Not informed             139    1.30
 6 Two           Not well informed       2894   27.0 
 7 Two           Informed                6361   59.3 
 8 Two           Well informed           1338   12.5 
 9 Three         Not informed             133    1.28
10 Three         Not well informed       2450   23.6 
11 Three         Informed                6204   59.8 
12 Three         Well informed           1583   15.3 
13 Four +        Not informed              69    2.36
14 Four +        Not well informed        602   20.6 
15 Four +        Informed                1603   54.8 
16 Four +        Well informed            650   22.2 
Show code
#create plot for language_global_health

biplot(language_global_health, "informed_global_health", "How informed students feel on global health\nby number of languages they speak" )
Show code
#create new object to combine responses, and calculate count and percent

language_global_health_2 <- pisa_tidy%>%
  mutate(informed_global_health = recode(informed_global_health, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_global_health) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_global_health, 
                         levels = c("Not informed", "Informed")))

language_global_health_2
# A tibble: 8 x 4
# Groups:   language_self [4]
  language_self informed_global_health count percent
  <chr>         <chr>                  <int>   <dbl>
1 One           Not informed            1429    35.8
2 One           Informed                2567    64.2
3 Two           Not informed            3033    28.3
4 Two           Informed                7699    71.7
5 Three         Not informed            2583    24.9
6 Three         Informed                7787    75.1
7 Four +        Not informed             671    22.9
8 Four +        Informed                2253    77.1
Show code
#create plot for language_global_health_2

biplot(language_global_health_2, "informed_global_health", "How informed students feel on global health\nby number of languages they speak" )

How informed students feel on migration by number of languages they speak

Show code
#create new object to calculate count and percent for language_self & informed_migration

language_migration <- select(pisa_tidy, "language_self", "informed_migration") %>%
  group_by(language_self, informed_migration) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_migration, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_migration
# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_migration count percent
   <chr>         <chr>              <int>   <dbl>
 1 One           Not informed         134    3.35
 2 One           Not well informed    948   23.7 
 3 One           Informed            2301   57.6 
 4 One           Well informed        613   15.3 
 5 Two           Not informed         113    1.05
 6 Two           Not well informed   2248   20.9 
 7 Two           Informed            6548   61.0 
 8 Two           Well informed       1823   17.0 
 9 Three         Not informed         122    1.18
10 Three         Not well informed   1881   18.1 
11 Three         Informed            6196   59.7 
12 Three         Well informed       2171   20.9 
13 Four +        Not informed          81    2.77
14 Four +        Not well informed    455   15.6 
15 Four +        Informed            1538   52.6 
16 Four +        Well informed        850   29.1 
Show code
#create plot for language_migration

biplot(language_migration, "informed_migration","How informed students feel on migration\nby number of languages they speak")
Show code
#create new object to combine responses, and calculate count and percent

language_migration_2 <- pisa_tidy%>%
  mutate(informed_migration = recode(informed_migration, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_migration) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_migration, 
                         levels = c("Not informed", "Informed")))

language_migration_2
# A tibble: 8 x 4
# Groups:   language_self [4]
  language_self informed_migration count percent
  <chr>         <chr>              <int>   <dbl>
1 One           Not informed        1082    27.1
2 One           Informed            2914    72.9
3 Two           Not informed        2361    22.0
4 Two           Informed            8371    78.0
5 Three         Not informed        2003    19.3
6 Three         Informed            8367    80.7
7 Four +        Not informed         536    18.3
8 Four +        Informed            2388    81.7
Show code
#create plot for language_migration_2

biplot(language_migration_2, "informed_migration","How informed students feel on migration\nby number of languages they speak")

How informed students feel on international conflicts by number of languages they speak

Show code
#create new object to calculate count and percent for language_self & informed_international_conflict

language_international_conflict <- select(pisa_tidy, "language_self", "informed_international_conflict") %>%
  group_by(language_self, informed_international_conflict) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_international_conflict, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_international_conflict
# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_international_conflict count percent
   <chr>         <chr>                           <int>   <dbl>
 1 One           Not informed                      195    4.88
 2 One           Not well informed                1480   37.0 
 3 One           Informed                         1791   44.8 
 4 One           Well informed                     530   13.3 
 5 Two           Not informed                      226    2.11
 6 Two           Not well informed                3319   30.9 
 7 Two           Informed                         5477   51.0 
 8 Two           Well informed                    1710   15.9 
 9 Three         Not informed                      222    2.14
10 Three         Not well informed                2870   27.7 
11 Three         Informed                         5157   49.7 
12 Three         Well informed                    2121   20.5 
13 Four +        Not informed                       90    3.08
14 Four +        Not well informed                 680   23.3 
15 Four +        Informed                         1333   45.6 
16 Four +        Well informed                     821   28.1 
Show code
#create plot for language_international_conflict

biplot(language_international_conflict, "informed_international_conflict", "How informed students feel on international conflict\nby number of languages they speak")
Show code
#create new object to combine responses, and calculate count and percent

language_international_conflict_2 <- pisa_tidy%>%
  mutate(informed_international_conflict = recode(informed_international_conflict,
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_international_conflict) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_international_conflict, 
                         levels = c("Not informed", "Informed")))

language_international_conflict_2
# A tibble: 8 x 4
# Groups:   language_self [4]
  language_self informed_international_conflict count percent
  <chr>         <chr>                           <int>   <dbl>
1 One           Not informed                     1675    41.9
2 One           Informed                         2321    58.1
3 Two           Not informed                     3545    33.0
4 Two           Informed                         7187    67.0
5 Three         Not informed                     3092    29.8
6 Three         Informed                         7278    70.2
7 Four +        Not informed                      770    26.3
8 Four +        Informed                         2154    73.7
Show code
#create plot for language_international_conflict_2

biplot(language_international_conflict_2, "informed_international_conflict", "How informed students feel on international conflict\nby number of languages they speak")

How informed students feel on world hunger by number of languages they speak

Show code
#create new object to calculate count and percent for language_self & informed_world_hunger

language_world_hunger <- select(pisa_tidy, "language_self", "informed_world_hunger") %>%
  group_by(language_self, informed_world_hunger) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_world_hunger, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_world_hunger
# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_world_hunger count percent
   <chr>         <chr>                 <int>   <dbl>
 1 One           Not informed             97   2.43 
 2 One           Not well informed       780  19.5  
 3 One           Informed               2326  58.2  
 4 One           Well informed           793  19.8  
 5 Two           Not informed             92   0.857
 6 Two           Not well informed      1717  16.0  
 7 Two           Informed               6514  60.7  
 8 Two           Well informed          2409  22.4  
 9 Three         Not informed             95   0.916
10 Three         Not well informed      1453  14.0  
11 Three         Informed               6091  58.7  
12 Three         Well informed          2731  26.3  
13 Four +        Not informed             61   2.09 
14 Four +        Not well informed       381  13.0  
15 Four +        Informed               1528  52.3  
16 Four +        Well informed           954  32.6  
Show code
#create plot for language_world_hunger

biplot(language_world_hunger, "informed_world_hunger", "How informed students feel on world hunger\nby number of languages they speak")
Show code
#create new object to combine responses, and calculate count and percent

language_world_hunger_2 <- pisa_tidy%>%
  mutate(informed_world_hunger = recode(informed_world_hunger, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_world_hunger) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_world_hunger, 
                         levels = c("Not informed", "Informed")))

language_world_hunger_2
# A tibble: 8 x 4
# Groups:   language_self [4]
  language_self informed_world_hunger count percent
  <chr>         <chr>                 <int>   <dbl>
1 One           Not informed            877    21.9
2 One           Informed               3119    78.1
3 Two           Not informed           1809    16.9
4 Two           Informed               8923    83.1
5 Three         Not informed           1548    14.9
6 Three         Informed               8822    85.1
7 Four +        Not informed            442    15.1
8 Four +        Informed               2482    84.9
Show code
#create plot for language_world_hunger_2

biplot(language_world_hunger_2, "informed_world_hunger", "How informed students feel on world hunger\nby number of languages they speak")

How informed students feel on causes of poverty by number of languages they speak

Show code
#create new object to calculate count and percent for language_self & informed_poverty_causes

language_poverty_causes <- select(pisa_tidy, "language_self", "informed_poverty_causes") %>%
  group_by(language_self, informed_poverty_causes) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_poverty_causes, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_poverty_causes
# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_poverty_causes count percent
   <chr>         <chr>                   <int>   <dbl>
 1 One           Not informed              108   2.70 
 2 One           Not well informed         894  22.4  
 3 One           Informed                 2169  54.3  
 4 One           Well informed             825  20.6  
 5 Two           Not informed              106   0.988
 6 Two           Not well informed        2113  19.7  
 7 Two           Informed                 6104  56.9  
 8 Two           Well informed            2409  22.4  
 9 Three         Not informed              115   1.11 
10 Three         Not well informed        1789  17.3  
11 Three         Informed                 5664  54.6  
12 Three         Well informed            2802  27.0  
13 Four +        Not informed               62   2.12 
14 Four +        Not well informed         434  14.8  
15 Four +        Informed                 1430  48.9  
16 Four +        Well informed             998  34.1  
Show code
#create plot for language_poverty_causes

biplot(language_poverty_causes, "informed_poverty_causes", "How informed students feel on causes of poverty\nby number of languages they speak")
Show code
#create new object to combine responses, and calculate count and percent

language_poverty_causes_2 <- pisa_tidy%>%
  mutate(informed_poverty_causes = recode(informed_poverty_causes, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_poverty_causes) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_poverty_causes, 
                         levels = c("Not informed", "Informed")))

language_poverty_causes_2
# A tibble: 8 x 4
# Groups:   language_self [4]
  language_self informed_poverty_causes count percent
  <chr>         <chr>                   <int>   <dbl>
1 One           Not informed             1002    25.1
2 One           Informed                 2994    74.9
3 Two           Not informed             2219    20.7
4 Two           Informed                 8513    79.3
5 Three         Not informed             1904    18.4
6 Three         Informed                 8466    81.6
7 Four +        Not informed              496    17.0
8 Four +        Informed                 2428    83.0
Show code
#create plot for language_poverty_causes_2

biplot(language_poverty_causes_2, "informed_poverty_causes", "How informed students feel on causes of poverty\nby number of languages they speak")

How informed students feel on gender equality by number of languages they speak

Show code
#create new object to calculate count and percent for language_self & informed_gender_equality

language_gender_equality <- select(pisa_tidy, "language_self", "informed_gender_equality") %>%
  group_by(language_self, informed_gender_equality) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_gender_equality, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed")))

language_gender_equality
# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self informed_gender_equality count percent
   <chr>         <chr>                    <int>   <dbl>
 1 One           Not informed               123   3.08 
 2 One           Not well informed          393   9.83 
 3 One           Informed                  1846  46.2  
 4 One           Well informed             1634  40.9  
 5 Two           Not informed                80   0.745
 6 Two           Not well informed          634   5.91 
 7 Two           Informed                  4734  44.1  
 8 Two           Well informed             5284  49.2  
 9 Three         Not informed                90   0.868
10 Three         Not well informed          490   4.73 
11 Three         Informed                  4004  38.6  
12 Three         Well informed             5786  55.8  
13 Four +        Not informed                72   2.46 
14 Four +        Not well informed          154   5.27 
15 Four +        Informed                   966  33.0  
16 Four +        Well informed             1732  59.2  
Show code
#create plot for language_gender_equality

biplot(language_gender_equality, "informed_gender_equality", "How informed students feel on gender equality\nby number of languages they speak")
Show code
#create new object to combine responses, and calculate count and percent

language_gender_equality_2 <- pisa_tidy%>%
  mutate(informed_gender_equality = recode(informed_gender_equality, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, informed_gender_equality) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_gender_equality, 
                         levels = c("Not informed", "Informed")))

language_gender_equality_2
# A tibble: 8 x 4
# Groups:   language_self [4]
  language_self informed_gender_equality count percent
  <chr>         <chr>                    <int>   <dbl>
1 One           Not informed               516   12.9 
2 One           Informed                  3480   87.1 
3 Two           Not informed               714    6.65
4 Two           Informed                 10018   93.3 
5 Three         Not informed               580    5.59
6 Three         Informed                  9790   94.4 
7 Four +        Not informed               226    7.73
8 Four +        Informed                  2698   92.3 
Show code
#create plot for language_gender_equality_2

biplot(language_gender_equality_2, "informed_gender_equality", "How informed students feel on gender equality\nby number of languages they speak")

In observing the initial plots with four “informed” values it’s seen in all instances the more languages a student speaks, the more well informed the student feels on the given topic. Looking a bit closer, another trend stands out when examining students who responded they feel “Not well informed.” When considering this response across all variables, students who speak four or more languages did not follow the downward trend consistently observed between one, two and three languages. This observation led me to be curious as to what would be more easily revealed if the four informed responses were collapsed down to just two, “Not informed” and “Informed.” In observing these new visualizations, a slight overall drop is observed in how informed students speaking four or more languages feel on three of the seven topics: climate change, world hunger, and gender equality.

Visualizations using mode

To wrap up visualizations for this analysis, I observed how informed students feel as a whole across all seven “informed” variables by the number of languages they speak. I worked very hard to compute mode using just R, but in the end, I was unable to determine how to rework the function to return all modes when multiple existed. What it does return is the first mode available for each row, however this is problematic as it skews the data.

Show code
#create function to find mode

getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

#create new object with mode

pisa_informed_mode <- pisa_tidy %>%
  rowwise() %>%
  mutate(informed_mode = getmode(c_across(starts_with("informed_")))) %>%
  select(c(language_self, informed_mode))

pisa_informed_mode
# A tibble: 28,022 x 2
# Rowwise: 
   language_self informed_mode    
   <chr>         <chr>            
 1 Three         Well informed    
 2 Three         Informed         
 3 Three         Informed         
 4 Two           Not well informed
 5 Four +        Well informed    
 6 Two           Not well informed
 7 Three         Informed         
 8 Three         Informed         
 9 Four +        Informed         
10 Three         Informed         
# ... with 28,012 more rows

Although not ideal, as a workaround to create visualizations with mode, I used Excel to solve for all modes using the extract I had previously written out from the PISA dataset. I then read this in, and removed all rows with two or more modes so as to only included students with a clear tendency to responded in one way. This eliminated 2,032 students from the original 28,022 students used in the previous visualizations.

Show code
#read in csv with modes

pisa_mode <- read_xlsx("pisa_mode_2022-4-8.xlsx") %>%
  select(c(language_self, mode, mode_2, mode_3))

pisa_mode
# A tibble: 28,022 x 4
   language_self mode              mode_2            mode_3  
   <chr>         <chr>             <chr>             <chr>   
 1 Three         Well informed     <NA>              <NA>    
 2 Three         Informed          <NA>              <NA>    
 3 Three         Informed          <NA>              <NA>    
 4 Four +        Not informed      Not well informed Informed
 5 Four +        Well informed     Informed          <NA>    
 6 Two           Not well informed <NA>              <NA>    
 7 Three         Informed          <NA>              <NA>    
 8 Three         Informed          <NA>              <NA>    
 9 Four +        Informed          <NA>              <NA>    
10 Three         Informed          <NA>              <NA>    
# ... with 28,012 more rows
Show code
#remove observations that have more than one mode; 2032 observations were removed

pisa_tidy_mode <-pisa_mode %>%
  filter(is.na(mode_2)) %>%
  select(c(language_self, mode))

pisa_tidy_mode
# A tibble: 25,990 x 2
   language_self mode             
   <chr>         <chr>            
 1 Three         Well informed    
 2 Three         Informed         
 3 Three         Informed         
 4 Two           Not well informed
 5 Three         Informed         
 6 Three         Informed         
 7 Four +        Informed         
 8 Three         Informed         
 9 Two           Informed         
10 Three         Informed         
# ... with 25,980 more rows
Show code
#calculate count and percent

language_informed_mode <- select(pisa_tidy_mode, "language_self", "mode") %>%
  group_by(language_self, mode) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(mode, 
                         levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) 

language_informed_mode
# A tibble: 16 x 4
# Groups:   language_self [4]
   language_self mode              count percent
   <chr>         <chr>             <int>   <dbl>
 1 One           Not informed         98   2.66 
 2 One           Not well informed   679  18.5  
 3 One           Informed           2275  61.8  
 4 One           Well informed       628  17.1  
 5 Two           Not informed         67   0.673
 6 Two           Not well informed  1220  12.3  
 7 Two           Informed           6729  67.6  
 8 Two           Well informed      1942  19.5  
 9 Three         Not informed         69   0.715
10 Three         Not well informed  1046  10.8  
11 Three         Informed           6173  64.0  
12 Three         Well informed      2364  24.5  
13 Four +        Not informed         53   1.96 
14 Four +        Not well informed   246   9.11 
15 Four +        Informed           1466  54.3  
16 Four +        Well informed       935  34.6  
Show code
#create plot

biplot(language_informed_mode, "mode", "How informed students feel on various topics\nby number of languages they speak")
Show code
#create new object to combine responses, and calculate count and percent

language_informed_mode_2 <- pisa_tidy_mode%>%
  mutate(mode = recode(mode, 
      `Not informed` = "Not informed", 
      `Not well informed` = "Not informed",
      `Informed` = "Informed", 
      `Well informed` = "Informed")) %>%
  group_by(language_self, mode) %>%
summarise(count = n()) %>%
  mutate(percent = count/sum(count) * 100) %>%
  arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(mode, 
                         levels = c("Not informed", "Informed")))

language_informed_mode_2
# A tibble: 8 x 4
# Groups:   language_self [4]
  language_self mode         count percent
  <chr>         <chr>        <int>   <dbl>
1 One           Not informed   777    21.1
2 One           Informed      2903    78.9
3 Two           Not informed  1287    12.9
4 Two           Informed      8671    87.1
5 Three         Not informed  1115    11.6
6 Three         Informed      8537    88.4
7 Four +        Not informed   299    11.1
8 Four +        Informed      2401    88.9
Show code
#create plot

biplot(language_informed_mode_2, "mode", "How informed students feel on various topics\nby number of languages they speak")

In these visualizations, it becomes clear students feel more well informed across the board as the number of languages spoken increases. The biggest jump for “Well informed” occurs between speaking three and four or more languages. When “Informed” and “Well informed” are combined into one variable, the biggest jump occurs for those who feel “Informed” and speak between speaking one and two languages. With these combined variables it is also observed the general level of how informed a student feels increase by the number of languages spoken, although the difference becomes much less pronounced between three and four languages as seen when looking at “Well informed” separate from “Informed.”

Reflection

This assignment was unique in that it felt a bit more like “on the job training” rather than learning how to use R first and then applying it to an assignment. I appreciated this approach as it forced me to learn quickly, yet there are things I wish I would have done differently or knew a little bit more about before getting started.

The primary challenge I came up against was the size of my chosen dataset and the capabilities of my computer. The PISA datasets are available as SAS files, so not having a way to first examine the data I jumped straight to reading the 2018 student data into RStudio. In doing so, I received an error that the file was too large, but after some troubleshooting I found switching from 32-bit to 64-bit resolved the issue. It was then I realized just how large the dataset was with 1,119 variables and 612,004 observations from 80 countries. This felt like way too much data to examine in a small R window, so I wrote out a csv to examine the data in Excel. It took a couple hours to examine all of the variables and try to narrow down which ones I was interested in analyzing. I eventually landed on analyzing if exposure to other cultures/languages (seven variables were chosen to represent this) increased the likelihood that a student feels better informed on seven different topics, and then comparing responses between the United States and Spain.

Tidying the data was the next step in the process, and with that came dropping NAs. This turned out to be something that drastically changed my initial research goals as students in the United States did not answer the “informed” topics I had chosen as variables! By this point I had already put in a significant amount of time into this project, so I decided the best course of action was to adjust my research question. This was disappointing as my main goal was to observe differences between countries, and it also made me feel uneasy as I’d been taught in undergrad you stick with your research question. I’ll come back to this point momentarily.

When I’d finished tidying my data, it was time to knit and submit Homework Three. However, my “fix” to dealing with such a large dataset was only temporary as my file was too large to knit. I tried multiple options found through internet searches to make this work and reached out on the class Slack channel for input. However, the suggestions I tried did not work and some even made my computer crash. There were a lot of tears at this point thinking I would need to start from scratch after so many hours of work, so I decided to take a few days off from the assignment so I could think clearly how to move forward. This proved to be fruitful, and is something I’ve filed away for future projects. It’s okay to step away to gain clarity and reassess next steps! It was during this time I realized I had already written out the csv of the whole dataset, so a workaround would be to eliminate the variables/observations I wouldn’t be using in my analysis, and then read this back into R to create a smaller file. As evident in my script above, this is the approach I went with.

Next up was learning to use ggplot to create visualizations. Since all variables used in this analysis are characters, it took me a while to figure out counts, percentages, etc. must first be calculated before creating plots. What eventually ended up aiding my understanding was reading the sections of Data Visualizations with R on univariate and bivariate plots which break down how to use them with both categorical and quantitative variables.

Once I was successful in creating a plot, I realized my research question was way too broad, at least for this assignment. To explore my question, I would have needed to create at least 49 bivariate plots to analyze each of the seven variables against another seven. I know now functions would have made this doable, but at this time I had no idea how to go about creating one. I also didn’t want to get so caught up in trying to answer my reasearch question that I didn’t have a chance to focus on the process of learning R. This led to another modification of my research question and brought it to its final form.

I did eventually learn how to write functions and this was a game changer! The function for the univariate plots came together pretty quickly, but it took a few weeks to create one for the bivariate plots. What ended up being the missing piece to the puzzle came from the article Programming with dplyr. This explained the difference between data-variables and env-variables, and that env-variables which are character need to be indexed with .data and double brackets to look something like .data[[var]] (Wickham, François, Henry, & Müller, n.d.). Being able to use this function was incredibly helpful as I had decided to switch all of my bivariate plots back to a stacked bar chart as I had used originally before adding in facet_grid for Homework Five. I also wanted to make some styling changes to ease how the viewer observed the visualizations. Without this function I would have manually needed to adjust 16 plots. This would have been very time consuming and prone to errors.

Towards the end of working on this analysis, I read Chapter 7: Exploratory Data Analysis in R for Data Science. I wish this chapter had been recommended early on in the class! It helped me understand this assignment was much different than writing a traditional research paper where you develop a hypothesis and then seek out data and/or conduct a study to see if it holds true. I especially appreciated the quote included in this chapter by John Tukey which says, “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” Knowing this sooner would have eliminated a lot of angst as I struggled with feeling I was doing something wrong by modifying the question I had originally set out to explore.

There is still much I need to learn to feel comfortable and confident using R. I would have valued feedback on my work throughout the course to know if I was on the right track and implementing best practices. This was in part made difficult by not being able to attend the synchronous classes, yet I’m incredibly grateful for having the option to participate asynchronously and I’d choose it again even knowing the downsides of participating this way. I also wish I would have had the opportunity to discuss challenges I ran into, like how to change my function to solve for mode, with a tutor so I could have increased my understanding rather than feeling like I reached a wall and was limited in the ability to grow my skills. Challenges and all, it’s encouraging to look back to the beginning of the semester, when I had zero knowledge of R, and realize how far I’ve come. I know one day soon I’ll be able to reflect back to the end of my first semester in DACSS and celebrate my continued forward progress.

Conclusion

While my analysis does show there is a positive relationship between the number of languages spoken and how well informed students feel on these seven topics, it does not conclude if this relationship is significant. And even if it did, it would not be enough to conclude that students who speak more languages are innately better informed. There are many other variables which could be at play such as:

Should I have been able to explore things further, I would hypothesize that students attending private schools in urban cities speak more languages and that higher family income and/or parents’ level of education could be what impacts how informed a student is on certain topics and not primarily the number of languages the student speaks. Data on these variables may have been available among the many questionnaires that make up the full 2018 PISA dataset, however my limited computer memory prohibited me from joining data and embarking on a richer exploratory analysis.

It is also important to note that students are responding to how informed they feel on a topic, and no quantifiable evidence is provided to demonstrate if a student is as informed or uninformed as they believe themselves to be. An exam would need to be provided to determine if a student’s perception of knowledge and their actual knowledge align.

In conclusion, this analysis of whether there is a relationship between the number of languages a student in Spain speaks and how well informed they feel on climate change, global health, migration, international conflict, world hunger, causes of poverty, and gender inequality doesn’t tell us much on its own. It’s clear a relationship exists, but ends with many more questions than answers. It did provide a straightforward question to use throughout this process of learning how to conduct data analysis in R, though, meaning it served its purpose for this assignment well.

Bibliography

Kabacoff, R. (2020). Data visualizations with R. Quantitative Analysis Center, Wesleyan University. https://rkabacoff.github.io/datavis/index.html

Kroll, J.F. & Dussias, P.E. (2017). The benefits of multilingualism to the personal and professional development of residents of the US. Foreign Language Annals, 50(2), 248-259. https://doi.org/10.1111/flan.12271

Luna, M.Z. (2020). Languages in Spain: how many languages are Spoken in Spain. Homeschool Spanish Academy. https://www.spanish.academy/blog/languages-in-spain-how-many-languages-are-spoken-in-spain/

OECD (n.d.). What is PISA?. PISA: Programme for International Student Assessment. https://www.oecd.org/pisa/

Programme for International Student Assessment (2020). Student questionnaire data files (PISA 2018 Database) [Dataset and codebook]. Organisation for Economic Co-operation and Development. https://www.oecd.org/pisa/data/2018database/

RStudio Team (2022). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA, http://www.rstudio.com/.

Wickham, H. & Bryan, J. (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl

Wickham, H., François, R., Henry, L., & Müller, K. (n.d.). Programming with dplyr. dplyr. https://dplyr.tidyverse.org/articles/programming.html

Wickham, H. & Grolemund, G. (n.d.). R for data science [eBook edition]. O’Reilly. https://r4ds.had.co.nz/index.html

Wickham et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Collazo (2022, May 4). Data Analytics and Computational Social Science: DACSS 601 Final Project. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomlcollazodacss601final/

BibTeX citation

@misc{collazo2022dacss,
  author = {Collazo, Laura},
  title = {Data Analytics and Computational Social Science: DACSS 601 Final Project},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomlcollazodacss601final/},
  year = {2022}
}