This research analysis explores if there is a relationship between the number of languages a student in Spain speaks and how well informed they feel on climate change, global health, migration, international conflict, world hunger, causes of poverty, and gender inequality.
A year ago my family made our third international move and my kids, ages 12 and 14 years, began learning a third language. This reality, coupled with my general research interests, often leads me to consider the impact exposure to different languages, and therefore culture, has on an individual. In most of the world multilingualism is commonplace, yet in the United States the idea of speaking a language other than English is not embraced. Sadly, this is often rooted in misconceptions around immigration and cultural diversity (Kroll and Dussias, 2017). The prevalence of this misinformation leads me to wonder if speaking more languages aids in knowledge of important topics, such as immigration, and if Americans therefore find themselves less informed in part as a result of their monolinguism.
In this analysis I explore if there is a relationship between the number of languages a student in Spain speaks and how well informed they feel on climate change, global health, migration, international conflict, world hunger, causes of poverty, and gender inequality. This does not quite line up with what I hoped to explore as my original goal was to compare responses between students in Spain and the United States. Reasons for this will be explained in my reflection towards the end of this analysis.
An important piece of information to keep in mind when observing the data in this analysis is that apart from Spanish, or more properly referred to as Castellano to separate the language from dialects spoken in Central and South America, many regional languages exist in Spain and are taught in local schools. The primary regional languages are Catalan, spoken in Catalonia and the Balearic Islands, Galician in Galicia, Euskara in the Basque county and parts of Navarre, and Valencian in Valencia. Two less common languages include Aranese, spoken in the Northeastern part of Spain, and Extremaduran, spoken in the Western region of Extremadura. There are also some endangered minority languages in Spain including Aragonese, Asturian and Leonese (Luna, 2017). It is therefore not uncommon for students in Spain to be able to speak at least two languages, Castellano and their regional language.
The data for this analysis comes from the Organisation for Economic Co-operation and Development’s 2018 Programme for International Student Assessment (PISA) which “measures 15-year-olds ability to use their reading, mathematics and science knowledge and skills to meet real-life challenges” (OECD, n.d.). There are many questionnaires provided to students, teachers, principals and parents that make up the complete dataset for 2018, and for this analysis data from the student questionnaire will be used. It alone is a large dataset containing 1,119 variables and 612,004 observations from 80 countries. The codebook for this dataset was relied on heavily to understand the raw data and rename variables and values.
The 2018 PISA student dataset is a very large SAS file which presented challenges in working with it due to insufficient computer memory. As a workaround, a csv containing a limited number of variables was written out and then read back into RStudio. Further discussion of this can be found in the Reflection section.
#read in SAS file
pisa <- read_sas("cy07_msu_stu_qqq.sas7bdat", "CY07MSU_FMT_STU_QQQ.SAS7BCAT", encoding = NULL, .name_repair = "unique")
#determine how many countries are in dataset
unique(pisa[c("CNT")])
# select desired variables and filter country for Spain
pisa_smaller <- pisa %>%
select(c(CNT,ST001D01T,ST004D01T,ST197Q01HA,ST197Q02HA,ST197Q04HA,ST197Q07HA,ST197Q08HA,ST197Q09HA,ST197Q12HA,
ST220Q01HA,ST220Q02HA,ST220Q03HA,ST220Q04HA,ST177Q01HA,ST019AQ01T,ST021Q01TA)) %>%
filter(CNT == "ESP")
#write csv
write_csv(pisa_smaller, "pisa_smaller_2022-2-20.csv")
To examine my research question, the dataset was filtered to include responses from students living in Spain along with eight character variables which include how many languages students speak well enough to converse with others (language_self) and how informed the student feels on the following topics:
By filtering for students living in Spain and removing NAs for the selected variables (7,921 NA observations were removed), the final number of observations used for this analysis totals 28,022 students.
#read in csv
pisa <- read_csv("pisa_smaller_2022-2-20.csv")
pisa
# A tibble: 35,943 x 17
CNT ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ESP 10 2 4 4 4
2 ESP 9 1 3 2 3
3 ESP 10 2 4 3 3
4 ESP 8 2 2 1 3
5 ESP 10 1 NA NA NA
6 ESP 10 1 4 2 3
7 ESP 9 1 NA NA NA
8 ESP 9 2 3 2 2
9 ESP 9 2 NA NA NA
10 ESP 10 2 3 3 3
# ... with 35,933 more rows, and 11 more variables: ST197Q07HA <dbl>,
# ST197Q08HA <dbl>, ST197Q09HA <dbl>, ST197Q12HA <dbl>,
# ST220Q01HA <dbl>, ST220Q02HA <dbl>, ST220Q03HA <dbl>,
# ST220Q04HA <dbl>, ST177Q01HA <dbl>, ST019AQ01T <dbl>,
# ST021Q01TA <dbl>
#remove additional variables
pisa_tidy <- pisa %>%
select(-c("ST001D01T", "ST004D01T", "ST220Q01HA", "ST220Q02HA", "ST220Q03HA", "ST220Q04HA", "ST019AQ01T", "ST021Q01TA")) %>%
#rename variables
rename(country=CNT,
informed_climate_change=ST197Q01HA,
informed_global_health=ST197Q02HA,
informed_migration=ST197Q04HA,
informed_international_conflict=ST197Q07HA,
informed_world_hunger=ST197Q08HA,
informed_poverty_causes=ST197Q09HA,
informed_gender_equality=ST197Q12HA,
language_self=ST177Q01HA) %>%
#remove NAs
drop_na %>%
#recode values
mutate(country = recode(country, ESP = "Spain")) %>%
mutate(informed_climate_change = recode(informed_climate_change,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_global_health = recode(informed_global_health,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_migration = recode(informed_migration,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_international_conflict = recode(informed_international_conflict,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_world_hunger = recode(informed_world_hunger,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_poverty_causes = recode(informed_poverty_causes,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_gender_equality = recode(informed_gender_equality,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(language_self = recode(language_self,
`1` = "One",
`2` = "Two",
`3` = "Three",
`4` = "Four +"))
#examine
pisa_tidy
# A tibble: 28,022 x 9
country informed_climate_change informed_global_h~ informed_migrat~
<chr> <chr> <chr> <chr>
1 Spain Well informed Well informed Well informed
2 Spain Informed Not well informed Informed
3 Spain Well informed Informed Informed
4 Spain Not well informed Not informed Informed
5 Spain Well informed Not well informed Informed
6 Spain Informed Not well informed Not well inform~
7 Spain Informed Informed Informed
8 Spain Not well informed Not well informed Informed
9 Spain Not well informed Informed Informed
10 Spain Informed Informed Informed
# ... with 28,012 more rows, and 5 more variables:
# informed_international_conflict <chr>,
# informed_world_hunger <chr>, informed_poverty_causes <chr>,
# informed_gender_equality <chr>, language_self <chr>
To gain an initial understanding of all variables, count and percent were calculated and univariate plots created. Functions were created to view these statistics of “informed” variables and to plot how informed a student feels on a given topic. The question asked of participants has been included for each variable.
Through the examination of variables it’s observed that 85.7 percent of students speak two or more languages well enough to converse with someone else with 38.3 percent speaking two, 37 percent speaking three and 10.4 percent speaking four or more. When it comes to the seven topics students were asked about it’s observed the majority of students feel informed on the topic at hand, with gender equality being unique in that more students responded they feel well informed. The table below provides a summary of how students responded by percent.
Topic | Not informed | Not well informed | Informed | Well informed |
---|---|---|---|---|
Climate change | 1.93% | 14.3% | 59.3% | 24.5% |
Global health | 1.67% | 25.9% | 58.2% | 14.2% |
Migration | 1.61% | 19.7% | 59.2% | 19.5% |
International conflict | 2.62% | 29.8% | 49.1% | 18.5% |
World hunger | 1.23% | 15.5% | 58.7% | 24.6% |
Causes of poverty | 1.4% | 18.7% | 54.8% | 25.1% |
Gender equality | 1.3% | 5.96% | 41.2% | 51.5% |
#create function
uniplot<- function(mycol, myxlab, mytitle) {
ggplot(pisa_tidy, aes(x = fct_relevel({{mycol}}, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3",color = "black") +
labs(x = myxlab,
y = "Count",
title = mytitle,
subtitle = "Spain, 2018",
caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
}
# A tibble: 4 x 3
language_self count percent
<chr> <int> <dbl>
1 One 3996 14.3
2 Two 10732 38.3
3 Three 10370 37.0
4 Four + 2924 10.4
#plot for language_self
ggplot(pisa_tidy, aes(x = fct_relevel(language_self, "One", "Two", "Three", "Four +"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "Languages Spoken",
y = "Count",
title = "Number of languages students speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text( size = 9),
axis.text.y = element_text( size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate count and percent for informed_climate_change
uni_summary_stats(pisa_tidy, informed_climate_change)
# A tibble: 4 x 3
informed_climate_change count percent
<chr> <int> <dbl>
1 Not informed 541 1.93
2 Not well informed 4000 14.3
3 Informed 16623 59.3
4 Well informed 6858 24.5
#plot for informed_climate_change
uniplot(informed_climate_change, "Climate Change", "How informed students feel on climate change")
#calculate count and percent for informed_global_health
uni_summary_stats(pisa_tidy, informed_global_health)
# A tibble: 4 x 3
informed_global_health count percent
<chr> <int> <dbl>
1 Not informed 467 1.67
2 Not well informed 7249 25.9
3 Informed 16315 58.2
4 Well informed 3991 14.2
#plot for informed_global_health
uniplot(informed_global_health, "Global Health", "How informed students feel on global health")
#calculate count and percent for informed_migration
uni_summary_stats(pisa_tidy, informed_migration)
# A tibble: 4 x 3
informed_migration count percent
<chr> <int> <dbl>
1 Not informed 450 1.61
2 Not well informed 5532 19.7
3 Informed 16583 59.2
4 Well informed 5457 19.5
#plot for informed_migration
uniplot(informed_migration, "Migration", "How informed students feel on migration")
#calculate count and percent for informed_international_conflict
uni_summary_stats(pisa_tidy, informed_international_conflict)
# A tibble: 4 x 3
informed_international_conflict count percent
<chr> <int> <dbl>
1 Not informed 733 2.62
2 Not well informed 8349 29.8
3 Informed 13758 49.1
4 Well informed 5182 18.5
#plot for informed_international_conflict
uniplot(informed_international_conflict, "International Conflict", "How informed students feel on international conflicts")
#calculate count and percent for informed_world_hunger
uni_summary_stats(pisa_tidy, informed_world_hunger)
# A tibble: 4 x 3
informed_world_hunger count percent
<chr> <int> <dbl>
1 Not informed 345 1.23
2 Not well informed 4331 15.5
3 Informed 16459 58.7
4 Well informed 6887 24.6
#plot for informed_world_hunger
uniplot(informed_world_hunger, "World Hunger", "How informed students feel on world hunger")
#calculate count and percent for informed_poverty_causes
uni_summary_stats(pisa_tidy, informed_poverty_causes)
# A tibble: 4 x 3
informed_poverty_causes count percent
<chr> <int> <dbl>
1 Not informed 391 1.40
2 Not well informed 5230 18.7
3 Informed 15367 54.8
4 Well informed 7034 25.1
#plot for informed_poverty_causes
uniplot(informed_poverty_causes, "Poverty Causes", "How informed students feel on causes of poverty")
#calculate count and percent for informed_gender_equality
uni_summary_stats(pisa_tidy, informed_gender_equality)
# A tibble: 4 x 3
informed_gender_equality count percent
<chr> <int> <dbl>
1 Not informed 365 1.30
2 Not well informed 1671 5.96
3 Informed 11550 41.2
4 Well informed 14436 51.5
#plot for informed_gender_equality
uniplot(informed_gender_equality, "Gender Equality", "How informed students feel on gender equality")
To explore my research question, seven bivariate plots were initially created using the variable language_self and each of the seven variables which ask how informed students feel about a certain topic. A second round of seven visualizations were then created which combined students who responded “Not informed” and “Not well informed” into one value, “Not informed”, and students who responded “Informed” and “Well informed” into one value, “Informed.” A function, biplot, was created to aid in the creation of these plots.
#create function
biplot<-function(mydata, myfillvar, mytitle) {
ggplot(mydata,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent,
fill = factor(.data[[myfillvar]],
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", position = "fill") +
labs(y = "Frequency",
x = "Languages Spoken",
title = mytitle,
subtitle = "Spain, 2018",
caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0),
legend.title = element_blank()) +
scale_y_continuous(breaks = seq(0, 1, by = .1))
}
#create new object to calculate count and percent for language_self & informed_climate_change
language_climate_change <- select(pisa_tidy, "language_self", "informed_climate_change") %>%
group_by(language_self, informed_climate_change) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_climate_change,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_climate_change
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_climate_change count percent
<chr> <chr> <int> <dbl>
1 One Not informed 171 4.28
2 One Not well informed 957 23.9
3 One Informed 2275 56.9
4 One Well informed 593 14.8
5 Two Not informed 142 1.32
6 Two Not well informed 1518 14.1
7 Two Informed 6740 62.8
8 Two Well informed 2332 21.7
9 Three Not informed 134 1.29
10 Three Not well informed 1196 11.5
11 Three Informed 6071 58.5
12 Three Well informed 2969 28.6
13 Four + Not informed 94 3.21
14 Four + Not well informed 329 11.3
15 Four + Informed 1537 52.6
16 Four + Well informed 964 33.0
#create plot for language_climate_change
biplot(language_climate_change, "informed_climate_change", "How informed students feel on climate change\nby number of languages they speak")
#create new object to combine responses, and calculate count and percent
language_climate_change_2 <- pisa_tidy%>%
mutate(informed_climate_change = recode(informed_climate_change,
`Not informed` = "Not informed",
`Not well informed` = "Not informed",
`Informed` = "Informed",
`Well informed` = "Informed")) %>%
group_by(language_self, informed_climate_change) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_climate_change,
levels = c("Not informed", "Informed")))
language_climate_change_2
# A tibble: 8 x 4
# Groups: language_self [4]
language_self informed_climate_change count percent
<chr> <chr> <int> <dbl>
1 One Not informed 1128 28.2
2 One Informed 2868 71.8
3 Two Not informed 1660 15.5
4 Two Informed 9072 84.5
5 Three Not informed 1330 12.8
6 Three Informed 9040 87.2
7 Four + Not informed 423 14.5
8 Four + Informed 2501 85.5
#create plot for language_climate_change_2
biplot(language_climate_change_2, "informed_climate_change", "How informed students feel on climate change\nby number of languages they speak")
#create new object to calculate count and percent for language_self & informed_global_health
language_global_health <- select(pisa_tidy, "language_self", "informed_global_health") %>%
group_by(language_self, informed_global_health) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_global_health,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_global_health
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_global_health count percent
<chr> <chr> <int> <dbl>
1 One Not informed 126 3.15
2 One Not well informed 1303 32.6
3 One Informed 2147 53.7
4 One Well informed 420 10.5
5 Two Not informed 139 1.30
6 Two Not well informed 2894 27.0
7 Two Informed 6361 59.3
8 Two Well informed 1338 12.5
9 Three Not informed 133 1.28
10 Three Not well informed 2450 23.6
11 Three Informed 6204 59.8
12 Three Well informed 1583 15.3
13 Four + Not informed 69 2.36
14 Four + Not well informed 602 20.6
15 Four + Informed 1603 54.8
16 Four + Well informed 650 22.2
#create plot for language_global_health
biplot(language_global_health, "informed_global_health", "How informed students feel on global health\nby number of languages they speak" )
#create new object to combine responses, and calculate count and percent
language_global_health_2 <- pisa_tidy%>%
mutate(informed_global_health = recode(informed_global_health,
`Not informed` = "Not informed",
`Not well informed` = "Not informed",
`Informed` = "Informed",
`Well informed` = "Informed")) %>%
group_by(language_self, informed_global_health) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_global_health,
levels = c("Not informed", "Informed")))
language_global_health_2
# A tibble: 8 x 4
# Groups: language_self [4]
language_self informed_global_health count percent
<chr> <chr> <int> <dbl>
1 One Not informed 1429 35.8
2 One Informed 2567 64.2
3 Two Not informed 3033 28.3
4 Two Informed 7699 71.7
5 Three Not informed 2583 24.9
6 Three Informed 7787 75.1
7 Four + Not informed 671 22.9
8 Four + Informed 2253 77.1
#create plot for language_global_health_2
biplot(language_global_health_2, "informed_global_health", "How informed students feel on global health\nby number of languages they speak" )
#create new object to calculate count and percent for language_self & informed_migration
language_migration <- select(pisa_tidy, "language_self", "informed_migration") %>%
group_by(language_self, informed_migration) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_migration,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_migration
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_migration count percent
<chr> <chr> <int> <dbl>
1 One Not informed 134 3.35
2 One Not well informed 948 23.7
3 One Informed 2301 57.6
4 One Well informed 613 15.3
5 Two Not informed 113 1.05
6 Two Not well informed 2248 20.9
7 Two Informed 6548 61.0
8 Two Well informed 1823 17.0
9 Three Not informed 122 1.18
10 Three Not well informed 1881 18.1
11 Three Informed 6196 59.7
12 Three Well informed 2171 20.9
13 Four + Not informed 81 2.77
14 Four + Not well informed 455 15.6
15 Four + Informed 1538 52.6
16 Four + Well informed 850 29.1
#create plot for language_migration
biplot(language_migration, "informed_migration","How informed students feel on migration\nby number of languages they speak")
#create new object to combine responses, and calculate count and percent
language_migration_2 <- pisa_tidy%>%
mutate(informed_migration = recode(informed_migration,
`Not informed` = "Not informed",
`Not well informed` = "Not informed",
`Informed` = "Informed",
`Well informed` = "Informed")) %>%
group_by(language_self, informed_migration) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_migration,
levels = c("Not informed", "Informed")))
language_migration_2
# A tibble: 8 x 4
# Groups: language_self [4]
language_self informed_migration count percent
<chr> <chr> <int> <dbl>
1 One Not informed 1082 27.1
2 One Informed 2914 72.9
3 Two Not informed 2361 22.0
4 Two Informed 8371 78.0
5 Three Not informed 2003 19.3
6 Three Informed 8367 80.7
7 Four + Not informed 536 18.3
8 Four + Informed 2388 81.7
#create plot for language_migration_2
biplot(language_migration_2, "informed_migration","How informed students feel on migration\nby number of languages they speak")
#create new object to calculate count and percent for language_self & informed_international_conflict
language_international_conflict <- select(pisa_tidy, "language_self", "informed_international_conflict") %>%
group_by(language_self, informed_international_conflict) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_international_conflict,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_international_conflict
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_international_conflict count percent
<chr> <chr> <int> <dbl>
1 One Not informed 195 4.88
2 One Not well informed 1480 37.0
3 One Informed 1791 44.8
4 One Well informed 530 13.3
5 Two Not informed 226 2.11
6 Two Not well informed 3319 30.9
7 Two Informed 5477 51.0
8 Two Well informed 1710 15.9
9 Three Not informed 222 2.14
10 Three Not well informed 2870 27.7
11 Three Informed 5157 49.7
12 Three Well informed 2121 20.5
13 Four + Not informed 90 3.08
14 Four + Not well informed 680 23.3
15 Four + Informed 1333 45.6
16 Four + Well informed 821 28.1
#create plot for language_international_conflict
biplot(language_international_conflict, "informed_international_conflict", "How informed students feel on international conflict\nby number of languages they speak")
#create new object to combine responses, and calculate count and percent
language_international_conflict_2 <- pisa_tidy%>%
mutate(informed_international_conflict = recode(informed_international_conflict,
`Not informed` = "Not informed",
`Not well informed` = "Not informed",
`Informed` = "Informed",
`Well informed` = "Informed")) %>%
group_by(language_self, informed_international_conflict) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_international_conflict,
levels = c("Not informed", "Informed")))
language_international_conflict_2
# A tibble: 8 x 4
# Groups: language_self [4]
language_self informed_international_conflict count percent
<chr> <chr> <int> <dbl>
1 One Not informed 1675 41.9
2 One Informed 2321 58.1
3 Two Not informed 3545 33.0
4 Two Informed 7187 67.0
5 Three Not informed 3092 29.8
6 Three Informed 7278 70.2
7 Four + Not informed 770 26.3
8 Four + Informed 2154 73.7
#create plot for language_international_conflict_2
biplot(language_international_conflict_2, "informed_international_conflict", "How informed students feel on international conflict\nby number of languages they speak")
#create new object to calculate count and percent for language_self & informed_world_hunger
language_world_hunger <- select(pisa_tidy, "language_self", "informed_world_hunger") %>%
group_by(language_self, informed_world_hunger) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_world_hunger,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_world_hunger
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_world_hunger count percent
<chr> <chr> <int> <dbl>
1 One Not informed 97 2.43
2 One Not well informed 780 19.5
3 One Informed 2326 58.2
4 One Well informed 793 19.8
5 Two Not informed 92 0.857
6 Two Not well informed 1717 16.0
7 Two Informed 6514 60.7
8 Two Well informed 2409 22.4
9 Three Not informed 95 0.916
10 Three Not well informed 1453 14.0
11 Three Informed 6091 58.7
12 Three Well informed 2731 26.3
13 Four + Not informed 61 2.09
14 Four + Not well informed 381 13.0
15 Four + Informed 1528 52.3
16 Four + Well informed 954 32.6
#create plot for language_world_hunger
biplot(language_world_hunger, "informed_world_hunger", "How informed students feel on world hunger\nby number of languages they speak")
#create new object to combine responses, and calculate count and percent
language_world_hunger_2 <- pisa_tidy%>%
mutate(informed_world_hunger = recode(informed_world_hunger,
`Not informed` = "Not informed",
`Not well informed` = "Not informed",
`Informed` = "Informed",
`Well informed` = "Informed")) %>%
group_by(language_self, informed_world_hunger) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_world_hunger,
levels = c("Not informed", "Informed")))
language_world_hunger_2
# A tibble: 8 x 4
# Groups: language_self [4]
language_self informed_world_hunger count percent
<chr> <chr> <int> <dbl>
1 One Not informed 877 21.9
2 One Informed 3119 78.1
3 Two Not informed 1809 16.9
4 Two Informed 8923 83.1
5 Three Not informed 1548 14.9
6 Three Informed 8822 85.1
7 Four + Not informed 442 15.1
8 Four + Informed 2482 84.9
#create plot for language_world_hunger_2
biplot(language_world_hunger_2, "informed_world_hunger", "How informed students feel on world hunger\nby number of languages they speak")
#create new object to calculate count and percent for language_self & informed_poverty_causes
language_poverty_causes <- select(pisa_tidy, "language_self", "informed_poverty_causes") %>%
group_by(language_self, informed_poverty_causes) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_poverty_causes,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_poverty_causes
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_poverty_causes count percent
<chr> <chr> <int> <dbl>
1 One Not informed 108 2.70
2 One Not well informed 894 22.4
3 One Informed 2169 54.3
4 One Well informed 825 20.6
5 Two Not informed 106 0.988
6 Two Not well informed 2113 19.7
7 Two Informed 6104 56.9
8 Two Well informed 2409 22.4
9 Three Not informed 115 1.11
10 Three Not well informed 1789 17.3
11 Three Informed 5664 54.6
12 Three Well informed 2802 27.0
13 Four + Not informed 62 2.12
14 Four + Not well informed 434 14.8
15 Four + Informed 1430 48.9
16 Four + Well informed 998 34.1
#create plot for language_poverty_causes
biplot(language_poverty_causes, "informed_poverty_causes", "How informed students feel on causes of poverty\nby number of languages they speak")
#create new object to combine responses, and calculate count and percent
language_poverty_causes_2 <- pisa_tidy%>%
mutate(informed_poverty_causes = recode(informed_poverty_causes,
`Not informed` = "Not informed",
`Not well informed` = "Not informed",
`Informed` = "Informed",
`Well informed` = "Informed")) %>%
group_by(language_self, informed_poverty_causes) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_poverty_causes,
levels = c("Not informed", "Informed")))
language_poverty_causes_2
# A tibble: 8 x 4
# Groups: language_self [4]
language_self informed_poverty_causes count percent
<chr> <chr> <int> <dbl>
1 One Not informed 1002 25.1
2 One Informed 2994 74.9
3 Two Not informed 2219 20.7
4 Two Informed 8513 79.3
5 Three Not informed 1904 18.4
6 Three Informed 8466 81.6
7 Four + Not informed 496 17.0
8 Four + Informed 2428 83.0
#create plot for language_poverty_causes_2
biplot(language_poverty_causes_2, "informed_poverty_causes", "How informed students feel on causes of poverty\nby number of languages they speak")
#create new object to calculate count and percent for language_self & informed_gender_equality
language_gender_equality <- select(pisa_tidy, "language_self", "informed_gender_equality") %>%
group_by(language_self, informed_gender_equality) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_gender_equality,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_gender_equality
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_gender_equality count percent
<chr> <chr> <int> <dbl>
1 One Not informed 123 3.08
2 One Not well informed 393 9.83
3 One Informed 1846 46.2
4 One Well informed 1634 40.9
5 Two Not informed 80 0.745
6 Two Not well informed 634 5.91
7 Two Informed 4734 44.1
8 Two Well informed 5284 49.2
9 Three Not informed 90 0.868
10 Three Not well informed 490 4.73
11 Three Informed 4004 38.6
12 Three Well informed 5786 55.8
13 Four + Not informed 72 2.46
14 Four + Not well informed 154 5.27
15 Four + Informed 966 33.0
16 Four + Well informed 1732 59.2
#create plot for language_gender_equality
biplot(language_gender_equality, "informed_gender_equality", "How informed students feel on gender equality\nby number of languages they speak")
#create new object to combine responses, and calculate count and percent
language_gender_equality_2 <- pisa_tidy%>%
mutate(informed_gender_equality = recode(informed_gender_equality,
`Not informed` = "Not informed",
`Not well informed` = "Not informed",
`Informed` = "Informed",
`Well informed` = "Informed")) %>%
group_by(language_self, informed_gender_equality) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_gender_equality,
levels = c("Not informed", "Informed")))
language_gender_equality_2
# A tibble: 8 x 4
# Groups: language_self [4]
language_self informed_gender_equality count percent
<chr> <chr> <int> <dbl>
1 One Not informed 516 12.9
2 One Informed 3480 87.1
3 Two Not informed 714 6.65
4 Two Informed 10018 93.3
5 Three Not informed 580 5.59
6 Three Informed 9790 94.4
7 Four + Not informed 226 7.73
8 Four + Informed 2698 92.3
#create plot for language_gender_equality_2
biplot(language_gender_equality_2, "informed_gender_equality", "How informed students feel on gender equality\nby number of languages they speak")
In observing the initial plots with four “informed” values it’s seen in all instances the more languages a student speaks, the more well informed the student feels on the given topic. Looking a bit closer, another trend stands out when examining students who responded they feel “Not well informed.” When considering this response across all variables, students who speak four or more languages did not follow the downward trend consistently observed between one, two and three languages. This observation led me to be curious as to what would be more easily revealed if the four informed responses were collapsed down to just two, “Not informed” and “Informed.” In observing these new visualizations, a slight overall drop is observed in how informed students speaking four or more languages feel on three of the seven topics: climate change, world hunger, and gender equality.
To wrap up visualizations for this analysis, I observed how informed students feel as a whole across all seven “informed” variables by the number of languages they speak. I worked very hard to compute mode using just R, but in the end, I was unable to determine how to rework the function to return all modes when multiple existed. What it does return is the first mode available for each row, however this is problematic as it skews the data.
#create function to find mode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
#create new object with mode
pisa_informed_mode <- pisa_tidy %>%
rowwise() %>%
mutate(informed_mode = getmode(c_across(starts_with("informed_")))) %>%
select(c(language_self, informed_mode))
pisa_informed_mode
# A tibble: 28,022 x 2
# Rowwise:
language_self informed_mode
<chr> <chr>
1 Three Well informed
2 Three Informed
3 Three Informed
4 Two Not well informed
5 Four + Well informed
6 Two Not well informed
7 Three Informed
8 Three Informed
9 Four + Informed
10 Three Informed
# ... with 28,012 more rows
Although not ideal, as a workaround to create visualizations with mode, I used Excel to solve for all modes using the extract I had previously written out from the PISA dataset. I then read this in, and removed all rows with two or more modes so as to only included students with a clear tendency to responded in one way. This eliminated 2,032 students from the original 28,022 students used in the previous visualizations.
# A tibble: 28,022 x 4
language_self mode mode_2 mode_3
<chr> <chr> <chr> <chr>
1 Three Well informed <NA> <NA>
2 Three Informed <NA> <NA>
3 Three Informed <NA> <NA>
4 Four + Not informed Not well informed Informed
5 Four + Well informed Informed <NA>
6 Two Not well informed <NA> <NA>
7 Three Informed <NA> <NA>
8 Three Informed <NA> <NA>
9 Four + Informed <NA> <NA>
10 Three Informed <NA> <NA>
# ... with 28,012 more rows
# A tibble: 25,990 x 2
language_self mode
<chr> <chr>
1 Three Well informed
2 Three Informed
3 Three Informed
4 Two Not well informed
5 Three Informed
6 Three Informed
7 Four + Informed
8 Three Informed
9 Two Informed
10 Three Informed
# ... with 25,980 more rows
#calculate count and percent
language_informed_mode <- select(pisa_tidy_mode, "language_self", "mode") %>%
group_by(language_self, mode) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(mode,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_informed_mode
# A tibble: 16 x 4
# Groups: language_self [4]
language_self mode count percent
<chr> <chr> <int> <dbl>
1 One Not informed 98 2.66
2 One Not well informed 679 18.5
3 One Informed 2275 61.8
4 One Well informed 628 17.1
5 Two Not informed 67 0.673
6 Two Not well informed 1220 12.3
7 Two Informed 6729 67.6
8 Two Well informed 1942 19.5
9 Three Not informed 69 0.715
10 Three Not well informed 1046 10.8
11 Three Informed 6173 64.0
12 Three Well informed 2364 24.5
13 Four + Not informed 53 1.96
14 Four + Not well informed 246 9.11
15 Four + Informed 1466 54.3
16 Four + Well informed 935 34.6
#create plot
biplot(language_informed_mode, "mode", "How informed students feel on various topics\nby number of languages they speak")
#create new object to combine responses, and calculate count and percent
language_informed_mode_2 <- pisa_tidy_mode%>%
mutate(mode = recode(mode,
`Not informed` = "Not informed",
`Not well informed` = "Not informed",
`Informed` = "Informed",
`Well informed` = "Informed")) %>%
group_by(language_self, mode) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(mode,
levels = c("Not informed", "Informed")))
language_informed_mode_2
# A tibble: 8 x 4
# Groups: language_self [4]
language_self mode count percent
<chr> <chr> <int> <dbl>
1 One Not informed 777 21.1
2 One Informed 2903 78.9
3 Two Not informed 1287 12.9
4 Two Informed 8671 87.1
5 Three Not informed 1115 11.6
6 Three Informed 8537 88.4
7 Four + Not informed 299 11.1
8 Four + Informed 2401 88.9
#create plot
biplot(language_informed_mode_2, "mode", "How informed students feel on various topics\nby number of languages they speak")
In these visualizations, it becomes clear students feel more well informed across the board as the number of languages spoken increases. The biggest jump for “Well informed” occurs between speaking three and four or more languages. When “Informed” and “Well informed” are combined into one variable, the biggest jump occurs for those who feel “Informed” and speak between speaking one and two languages. With these combined variables it is also observed the general level of how informed a student feels increase by the number of languages spoken, although the difference becomes much less pronounced between three and four languages as seen when looking at “Well informed” separate from “Informed.”
This assignment was unique in that it felt a bit more like “on the job training” rather than learning how to use R first and then applying it to an assignment. I appreciated this approach as it forced me to learn quickly, yet there are things I wish I would have done differently or knew a little bit more about before getting started.
The primary challenge I came up against was the size of my chosen dataset and the capabilities of my computer. The PISA datasets are available as SAS files, so not having a way to first examine the data I jumped straight to reading the 2018 student data into RStudio. In doing so, I received an error that the file was too large, but after some troubleshooting I found switching from 32-bit to 64-bit resolved the issue. It was then I realized just how large the dataset was with 1,119 variables and 612,004 observations from 80 countries. This felt like way too much data to examine in a small R window, so I wrote out a csv to examine the data in Excel. It took a couple hours to examine all of the variables and try to narrow down which ones I was interested in analyzing. I eventually landed on analyzing if exposure to other cultures/languages (seven variables were chosen to represent this) increased the likelihood that a student feels better informed on seven different topics, and then comparing responses between the United States and Spain.
Tidying the data was the next step in the process, and with that came dropping NAs. This turned out to be something that drastically changed my initial research goals as students in the United States did not answer the “informed” topics I had chosen as variables! By this point I had already put in a significant amount of time into this project, so I decided the best course of action was to adjust my research question. This was disappointing as my main goal was to observe differences between countries, and it also made me feel uneasy as I’d been taught in undergrad you stick with your research question. I’ll come back to this point momentarily.
When I’d finished tidying my data, it was time to knit and submit Homework Three. However, my “fix” to dealing with such a large dataset was only temporary as my file was too large to knit. I tried multiple options found through internet searches to make this work and reached out on the class Slack channel for input. However, the suggestions I tried did not work and some even made my computer crash. There were a lot of tears at this point thinking I would need to start from scratch after so many hours of work, so I decided to take a few days off from the assignment so I could think clearly how to move forward. This proved to be fruitful, and is something I’ve filed away for future projects. It’s okay to step away to gain clarity and reassess next steps! It was during this time I realized I had already written out the csv of the whole dataset, so a workaround would be to eliminate the variables/observations I wouldn’t be using in my analysis, and then read this back into R to create a smaller file. As evident in my script above, this is the approach I went with.
Next up was learning to use ggplot to create visualizations. Since all variables used in this analysis are characters, it took me a while to figure out counts, percentages, etc. must first be calculated before creating plots. What eventually ended up aiding my understanding was reading the sections of Data Visualizations with R on univariate and bivariate plots which break down how to use them with both categorical and quantitative variables.
Once I was successful in creating a plot, I realized my research question was way too broad, at least for this assignment. To explore my question, I would have needed to create at least 49 bivariate plots to analyze each of the seven variables against another seven. I know now functions would have made this doable, but at this time I had no idea how to go about creating one. I also didn’t want to get so caught up in trying to answer my reasearch question that I didn’t have a chance to focus on the process of learning R. This led to another modification of my research question and brought it to its final form.
I did eventually learn how to write functions and this was a game
changer! The function for the univariate plots came together pretty
quickly, but it took a few weeks to create one for the bivariate plots.
What ended up being the missing piece to the puzzle came from the
article Programming
with dplyr. This explained the difference between data-variables and
env-variables, and that env-variables which are character need to be
indexed with .data and double brackets to look something like
.data[[var]]
(Wickham, François, Henry, & Müller,
n.d.). Being able to use this function was incredibly helpful as I had
decided to switch all of my bivariate plots back to a stacked bar chart
as I had used originally before adding in facet_grid for Homework Five.
I also wanted to make some styling changes to ease how the viewer
observed the visualizations. Without this function I would have manually
needed to adjust 16 plots. This would have been very time consuming and
prone to errors.
Towards the end of working on this analysis, I read Chapter 7: Exploratory Data Analysis in R for Data Science. I wish this chapter had been recommended early on in the class! It helped me understand this assignment was much different than writing a traditional research paper where you develop a hypothesis and then seek out data and/or conduct a study to see if it holds true. I especially appreciated the quote included in this chapter by John Tukey which says, “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” Knowing this sooner would have eliminated a lot of angst as I struggled with feeling I was doing something wrong by modifying the question I had originally set out to explore.
There is still much I need to learn to feel comfortable and confident using R. I would have valued feedback on my work throughout the course to know if I was on the right track and implementing best practices. This was in part made difficult by not being able to attend the synchronous classes, yet I’m incredibly grateful for having the option to participate asynchronously and I’d choose it again even knowing the downsides of participating this way. I also wish I would have had the opportunity to discuss challenges I ran into, like how to change my function to solve for mode, with a tutor so I could have increased my understanding rather than feeling like I reached a wall and was limited in the ability to grow my skills. Challenges and all, it’s encouraging to look back to the beginning of the semester, when I had zero knowledge of R, and realize how far I’ve come. I know one day soon I’ll be able to reflect back to the end of my first semester in DACSS and celebrate my continued forward progress.
While my analysis does show there is a positive relationship between the number of languages spoken and how well informed students feel on these seven topics, it does not conclude if this relationship is significant. And even if it did, it would not be enough to conclude that students who speak more languages are innately better informed. There are many other variables which could be at play such as:
Should I have been able to explore things further, I would hypothesize that students attending private schools in urban cities speak more languages and that higher family income and/or parents’ level of education could be what impacts how informed a student is on certain topics and not primarily the number of languages the student speaks. Data on these variables may have been available among the many questionnaires that make up the full 2018 PISA dataset, however my limited computer memory prohibited me from joining data and embarking on a richer exploratory analysis.
It is also important to note that students are responding to how informed they feel on a topic, and no quantifiable evidence is provided to demonstrate if a student is as informed or uninformed as they believe themselves to be. An exam would need to be provided to determine if a student’s perception of knowledge and their actual knowledge align.
In conclusion, this analysis of whether there is a relationship between the number of languages a student in Spain speaks and how well informed they feel on climate change, global health, migration, international conflict, world hunger, causes of poverty, and gender inequality doesn’t tell us much on its own. It’s clear a relationship exists, but ends with many more questions than answers. It did provide a straightforward question to use throughout this process of learning how to conduct data analysis in R, though, meaning it served its purpose for this assignment well.
Kabacoff, R. (2020). Data visualizations with R. Quantitative Analysis Center, Wesleyan University. https://rkabacoff.github.io/datavis/index.html
Kroll, J.F. & Dussias, P.E. (2017). The benefits of multilingualism to the personal and professional development of residents of the US. Foreign Language Annals, 50(2), 248-259. https://doi.org/10.1111/flan.12271
Luna, M.Z. (2020). Languages in Spain: how many languages are Spoken in Spain. Homeschool Spanish Academy. https://www.spanish.academy/blog/languages-in-spain-how-many-languages-are-spoken-in-spain/
OECD (n.d.). What is PISA?. PISA: Programme for International Student Assessment. https://www.oecd.org/pisa/
Programme for International Student Assessment (2020). Student questionnaire data files (PISA 2018 Database) [Dataset and codebook]. Organisation for Economic Co-operation and Development. https://www.oecd.org/pisa/data/2018database/
RStudio Team (2022). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA, http://www.rstudio.com/.
Wickham, H. & Bryan, J. (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl
Wickham, H., François, R., Henry, L., & Müller, K. (n.d.). Programming with dplyr. dplyr. https://dplyr.tidyverse.org/articles/programming.html
Wickham, H. & Grolemund, G. (n.d.). R for data science [eBook edition]. O’Reilly. https://r4ds.had.co.nz/index.html
Wickham et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Collazo (2022, May 4). Data Analytics and Computational Social Science: DACSS 601 Final Project. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomlcollazodacss601final/
BibTeX citation
@misc{collazo2022dacss, author = {Collazo, Laura}, title = {Data Analytics and Computational Social Science: DACSS 601 Final Project}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomlcollazodacss601final/}, year = {2022} }