This assignment adds to my previous homework by improving my visualizations and adding new ones using facet_grid. It also includes reflection on where I’m at in my analysis which seeks to explore if there is a positive correlation between the number of languages a student speaks and how well informed they feel on seven different topics.
The dataset I’ve chosen for my final project is the Program for International Student Assessment (PISA) 2018 student data. It’s a large dataset containing 1,119 variables and 612,004 observations of 15 year old students from 80 countries.
In my analysis of this data, I examine how informed students in Spain feel they are on 7 different topics (all character variables):
By an additional character variable:
Essentially, I am curious if there is a positive correlation between the number of languages a student speaks and how well informed a student feels on the seven different variables listed above.
The dataset originally read in was a very large SAS file. However, my computer’s memory was not sufficient to knit the file. As a workaround, I wrote out a csv containing a limited number of variables and then read this back in.
#read in SAS file & examine data
pisa <- read_sas("cy07_msu_stu_qqq.sas7bdat", "CY07MSU_FMT_STU_QQQ.SAS7BCAT", encoding = NULL, .name_repair = "unique")
pisa
tail(pisa)
unique(pisa[c("CNT")])
# select only desired variables and filter country for Spain
pisa_smaller <- pisa %>%
select(c(CNT,
ST001D01T,
ST004D01T,
ST197Q01HA,
ST197Q02HA,
ST197Q04HA,
ST197Q07HA,
ST197Q08HA,
ST197Q09HA,
ST197Q12HA,
ST220Q01HA,
ST220Q02HA,
ST220Q03HA,
ST220Q04HA,
ST177Q01HA,
ST019AQ01T,
ST021Q01TA)) %>%
filter(CNT == "ESP")
#check work
pisa_smaller
#write csv
write_csv(pisa_smaller, "pisa_smaller_2022-2-20.csv")
#read in csv & examine data
pisa <- read_csv("pisa_smaller_2022-2-20.csv")
pisa
# A tibble: 35,943 x 17
CNT ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ESP 10 2 4 4 4
2 ESP 9 1 3 2 3
3 ESP 10 2 4 3 3
4 ESP 8 2 2 1 3
5 ESP 10 1 NA NA NA
6 ESP 10 1 4 2 3
7 ESP 9 1 NA NA NA
8 ESP 9 2 3 2 2
9 ESP 9 2 NA NA NA
10 ESP 10 2 3 3 3
# ... with 35,933 more rows, and 11 more variables: ST197Q07HA <dbl>,
# ST197Q08HA <dbl>, ST197Q09HA <dbl>, ST197Q12HA <dbl>,
# ST220Q01HA <dbl>, ST220Q02HA <dbl>, ST220Q03HA <dbl>,
# ST220Q04HA <dbl>, ST177Q01HA <dbl>, ST019AQ01T <dbl>,
# ST021Q01TA <dbl>
tail(pisa)
# A tibble: 6 x 17
CNT ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ESP 10 1 4 4 4
2 ESP 9 2 3 3 3
3 ESP 10 2 4 4 4
4 ESP 9 2 2 2 2
5 ESP 8 2 3 3 3
6 ESP 9 1 2 2 2
# ... with 11 more variables: ST197Q07HA <dbl>, ST197Q08HA <dbl>,
# ST197Q09HA <dbl>, ST197Q12HA <dbl>, ST220Q01HA <dbl>,
# ST220Q02HA <dbl>, ST220Q03HA <dbl>, ST220Q04HA <dbl>,
# ST177Q01HA <dbl>, ST019AQ01T <dbl>, ST021Q01TA <dbl>
#remove additional variables not needed to answer research question
pisa_tidy <- pisa %>%
select(-c("ST001D01T", "ST004D01T", "ST220Q01HA", "ST220Q02HA", "ST220Q03HA", "ST220Q04HA", "ST019AQ01T", "ST021Q01TA")) %>%
#rename variables
rename(country=CNT,
informed_climate_change=ST197Q01HA,
informed_global_health=ST197Q02HA,
informed_migration=ST197Q04HA,
informed_international_conflict=ST197Q07HA,
informed_world_hunger=ST197Q08HA,
informed_poverty_causes=ST197Q09HA,
informed_gender_equality=ST197Q12HA,
language_self=ST177Q01HA) %>%
#remove NAs
drop_na %>%
#recode values
mutate(country = recode(country, ESP = "Spain")) %>%
mutate(informed_climate_change = recode(informed_climate_change,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_global_health = recode(informed_global_health,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_migration = recode(informed_migration,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_international_conflict = recode(informed_international_conflict,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_world_hunger = recode(informed_world_hunger,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_poverty_causes = recode(informed_poverty_causes,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_gender_equality = recode(informed_gender_equality,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(language_self = recode(language_self,
`1` = "One",
`2` = "Two",
`3` = "Three",
`4` = "Four +"))
#examine
pisa_tidy
# A tibble: 28,022 x 9
country informed_climate_change informed_global_h~ informed_migrat~
<chr> <chr> <chr> <chr>
1 Spain Well informed Well informed Well informed
2 Spain Informed Not well informed Informed
3 Spain Well informed Informed Informed
4 Spain Not well informed Not informed Informed
5 Spain Well informed Not well informed Informed
6 Spain Informed Not well informed Not well inform~
7 Spain Informed Informed Informed
8 Spain Not well informed Not well informed Informed
9 Spain Not well informed Informed Informed
10 Spain Informed Informed Informed
# ... with 28,012 more rows, and 5 more variables:
# informed_international_conflict <chr>,
# informed_world_hunger <chr>, informed_poverty_causes <chr>,
# informed_gender_equality <chr>, language_self <chr>
I have created univariate plots for each variable which show count, in addition to using group_by()
to first view percent in a tibble. These plots don’t directly answer my research question, but they do provide a general overview of each individual variable. The question asked of each participant has been included before each variable’s percent calculation and plot.
#calculate percent for informed_climate_change
select(pisa_tidy, "informed_climate_change") %>%
group_by(informed_climate_change) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor(informed_climate_change,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
# A tibble: 4 x 3
informed_climate_change count percent
<chr> <int> <dbl>
1 Not informed 541 1.93
2 Not well informed 4000 14.3
3 Informed 16623 59.3
4 Well informed 6858 24.5
#plot for informed_climate_change
ggplot(pisa_tidy, aes(x = fct_relevel(informed_climate_change, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3",color = "black") +
labs(x = "Climate Change",
y = "Count",
title = "How informed students feel on climate change", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for informed_global_health
select(pisa_tidy, "informed_global_health") %>%
group_by(informed_global_health) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor(informed_global_health,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
# A tibble: 4 x 3
informed_global_health count percent
<chr> <int> <dbl>
1 Not informed 467 1.67
2 Not well informed 7249 25.9
3 Informed 16315 58.2
4 Well informed 3991 14.2
#plot for informed_global_health
ggplot(pisa_tidy, aes(x = fct_relevel(informed_global_health, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3",color = "black") +
labs(x = "Global Health",
y = "Count",
title = "How informed students feel on global health", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for informed_migration
select(pisa_tidy, "informed_migration") %>%
group_by(informed_migration) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor(informed_migration,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
# A tibble: 4 x 3
informed_migration count percent
<chr> <int> <dbl>
1 Not informed 450 1.61
2 Not well informed 5532 19.7
3 Informed 16583 59.2
4 Well informed 5457 19.5
#plot for informed_migration
ggplot(pisa_tidy, aes(x = fct_relevel(informed_migration, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "Migration",
y = "Count",
title = "How informed students feel on migration", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for informed_international_conflict
select(pisa_tidy, "informed_international_conflict") %>%
group_by(informed_international_conflict) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor(informed_international_conflict,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
# A tibble: 4 x 3
informed_international_conflict count percent
<chr> <int> <dbl>
1 Not informed 733 2.62
2 Not well informed 8349 29.8
3 Informed 13758 49.1
4 Well informed 5182 18.5
#plot for informed_international_conflict
ggplot(pisa_tidy, aes(x = fct_relevel(informed_international_conflict, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "International Conflict",
y = "Count",
title = "How informed students feel on international conflict", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for informed_world_hunger
select(pisa_tidy, "informed_world_hunger") %>%
group_by(informed_world_hunger) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor(informed_world_hunger,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
# A tibble: 4 x 3
informed_world_hunger count percent
<chr> <int> <dbl>
1 Not informed 345 1.23
2 Not well informed 4331 15.5
3 Informed 16459 58.7
4 Well informed 6887 24.6
#plot for informed_world_hunger
ggplot(pisa_tidy, aes(x = fct_relevel(informed_world_hunger, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "World Hunger",
y = "Count",
title = "How informed students feel on world hunger", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for informed_poverty_causes
select(pisa_tidy,"informed_poverty_causes") %>%
group_by(informed_poverty_causes) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor(informed_poverty_causes,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
# A tibble: 4 x 3
informed_poverty_causes count percent
<chr> <int> <dbl>
1 Not informed 391 1.40
2 Not well informed 5230 18.7
3 Informed 15367 54.8
4 Well informed 7034 25.1
#plot for informed_poverty_causes
ggplot(pisa_tidy, aes(x = fct_relevel(informed_poverty_causes, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "Causes of Poverty",
y = "Count",
title = "How informed students feel on causes of poverty", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for informed_gender_equality
select(pisa_tidy, "informed_gender_equality") %>%
group_by(informed_gender_equality) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor(informed_gender_equality,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
# A tibble: 4 x 3
informed_gender_equality count percent
<chr> <int> <dbl>
1 Not informed 365 1.30
2 Not well informed 1671 5.96
3 Informed 11550 41.2
4 Well informed 14436 51.5
#plot for informed_gender_equality
ggplot(pisa_tidy, aes(x = fct_relevel(informed_gender_equality, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "Gender Equality",
y = "Count",
title = "How informed students feel on gender equality", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for language_self
select(pisa_tidy, "language_self") %>%
group_by(language_self) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")))
# A tibble: 4 x 3
language_self count percent
<chr> <int> <dbl>
1 One 3996 14.3
2 Two 10732 38.3
3 Three 10370 37.0
4 Four + 2924 10.4
# plot for language_self
ggplot(pisa_tidy, aes(x = fct_relevel(language_self, "One", "Two", "Three", "Four or more"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "Languages Spoken",
y = "Count",
title = "Number of languages students speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text( size = 9),
axis.text.y = element_text( size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
To directly explore my research question I have calculated percent and then created two bivariate visualizations for each grouping, one of which uses facet_grid()
.
#language_self & informed_climate_change
language_climate_change <- select(pisa_tidy, "language_self", "informed_climate_change") %>%
group_by(language_self, informed_climate_change) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_climate_change,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_climate_change
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_climate_change count percent
<chr> <chr> <int> <dbl>
1 One Not informed 171 4.28
2 One Not well informed 957 23.9
3 One Informed 2275 56.9
4 One Well informed 593 14.8
5 Two Not informed 142 1.32
6 Two Not well informed 1518 14.1
7 Two Informed 6740 62.8
8 Two Well informed 2332 21.7
9 Three Not informed 134 1.29
10 Three Not well informed 1196 11.5
11 Three Informed 6071 58.5
12 Three Well informed 2969 28.6
13 Four + Not informed 94 3.21
14 Four + Not well informed 329 11.3
15 Four + Informed 1537 52.6
16 Four + Well informed 964 33.0
#create plot
ggplot(language_climate_change,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent,
fill = factor(informed_climate_change,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "fill") +
labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on climate change\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#create plot with facet_grid
ggplot(language_climate_change,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent, fill = factor(informed_climate_change,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on climate change\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
facet_grid(~factor(informed_climate_change, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9, angle = 30),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for language_self & informed_global_health
language_global_health <- select(pisa_tidy, "language_self", "informed_global_health") %>%
group_by(language_self, informed_global_health) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_global_health,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_global_health
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_global_health count percent
<chr> <chr> <int> <dbl>
1 One Not informed 126 3.15
2 One Not well informed 1303 32.6
3 One Informed 2147 53.7
4 One Well informed 420 10.5
5 Two Not informed 139 1.30
6 Two Not well informed 2894 27.0
7 Two Informed 6361 59.3
8 Two Well informed 1338 12.5
9 Three Not informed 133 1.28
10 Three Not well informed 2450 23.6
11 Three Informed 6204 59.8
12 Three Well informed 1583 15.3
13 Four + Not informed 69 2.36
14 Four + Not well informed 602 20.6
15 Four + Informed 1603 54.8
16 Four + Well informed 650 22.2
#create plot
ggplot(language_global_health,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent,
fill = factor(informed_global_health,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "fill") +
labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on global health\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#create plot with facet_grid
ggplot(language_global_health,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent, fill = factor(informed_global_health,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on global health\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
facet_grid(~factor(informed_global_health, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9, angle = 30),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for language_self & informed_migration
language_migration <- select(pisa_tidy, "language_self", "informed_migration") %>%
group_by(language_self, informed_migration) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_migration,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_migration
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_migration count percent
<chr> <chr> <int> <dbl>
1 One Not informed 134 3.35
2 One Not well informed 948 23.7
3 One Informed 2301 57.6
4 One Well informed 613 15.3
5 Two Not informed 113 1.05
6 Two Not well informed 2248 20.9
7 Two Informed 6548 61.0
8 Two Well informed 1823 17.0
9 Three Not informed 122 1.18
10 Three Not well informed 1881 18.1
11 Three Informed 6196 59.7
12 Three Well informed 2171 20.9
13 Four + Not informed 81 2.77
14 Four + Not well informed 455 15.6
15 Four + Informed 1538 52.6
16 Four + Well informed 850 29.1
#create plot
ggplot(language_migration,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent,
fill = factor(informed_migration,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "fill") +
labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on migration\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#create plot with facet_grid
ggplot(language_migration,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent, fill = factor(informed_migration,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on migration\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
facet_grid(~factor(informed_migration, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9, angle = 30),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for language_self & informed_international_conflict
language_international_conflict <- select(pisa_tidy, "language_self", "informed_international_conflict") %>%
group_by(language_self, informed_international_conflict) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_international_conflict,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_international_conflict
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_international_conflict count percent
<chr> <chr> <int> <dbl>
1 One Not informed 195 4.88
2 One Not well informed 1480 37.0
3 One Informed 1791 44.8
4 One Well informed 530 13.3
5 Two Not informed 226 2.11
6 Two Not well informed 3319 30.9
7 Two Informed 5477 51.0
8 Two Well informed 1710 15.9
9 Three Not informed 222 2.14
10 Three Not well informed 2870 27.7
11 Three Informed 5157 49.7
12 Three Well informed 2121 20.5
13 Four + Not informed 90 3.08
14 Four + Not well informed 680 23.3
15 Four + Informed 1333 45.6
16 Four + Well informed 821 28.1
#create plot
ggplot(language_international_conflict,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent,
fill = factor(informed_international_conflict,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "fill") +
labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on international conflict\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#create plot with facet_grid
ggplot(language_international_conflict,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent, fill = factor(informed_international_conflict,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on international conflict\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
facet_grid(~factor(informed_international_conflict, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9, angle = 30),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for language_self & informed_world_hunger
language_world_hunger <- select(pisa_tidy, "language_self", "informed_world_hunger") %>%
group_by(language_self, informed_world_hunger) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_world_hunger,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_world_hunger
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_world_hunger count percent
<chr> <chr> <int> <dbl>
1 One Not informed 97 2.43
2 One Not well informed 780 19.5
3 One Informed 2326 58.2
4 One Well informed 793 19.8
5 Two Not informed 92 0.857
6 Two Not well informed 1717 16.0
7 Two Informed 6514 60.7
8 Two Well informed 2409 22.4
9 Three Not informed 95 0.916
10 Three Not well informed 1453 14.0
11 Three Informed 6091 58.7
12 Three Well informed 2731 26.3
13 Four + Not informed 61 2.09
14 Four + Not well informed 381 13.0
15 Four + Informed 1528 52.3
16 Four + Well informed 954 32.6
#create plot
ggplot(language_world_hunger,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent,
fill = factor(informed_world_hunger,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "fill") +
labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on world hunger\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#create plot with facet_grid
ggplot(language_world_hunger,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent, fill = factor(informed_world_hunger,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on world hunger\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
facet_grid(~factor(informed_world_hunger, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9, angle = 30),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for language_self & informed_poverty_causes
language_poverty_causes <- select(pisa_tidy, "language_self", "informed_poverty_causes") %>%
group_by(language_self, informed_poverty_causes) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_poverty_causes,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_poverty_causes
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_poverty_causes count percent
<chr> <chr> <int> <dbl>
1 One Not informed 108 2.70
2 One Not well informed 894 22.4
3 One Informed 2169 54.3
4 One Well informed 825 20.6
5 Two Not informed 106 0.988
6 Two Not well informed 2113 19.7
7 Two Informed 6104 56.9
8 Two Well informed 2409 22.4
9 Three Not informed 115 1.11
10 Three Not well informed 1789 17.3
11 Three Informed 5664 54.6
12 Three Well informed 2802 27.0
13 Four + Not informed 62 2.12
14 Four + Not well informed 434 14.8
15 Four + Informed 1430 48.9
16 Four + Well informed 998 34.1
#create plot
ggplot(language_poverty_causes,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent,
fill = factor(informed_poverty_causes,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "fill") +
labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on causes of poverty\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#create plot with facet_grid
ggplot(language_poverty_causes,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent, fill = factor(informed_poverty_causes,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on causes of poverty\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
facet_grid(~factor(informed_poverty_causes, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9, angle = 30),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#calculate percent for language_self & informed_gender_equality
language_gender_equality <- select(pisa_tidy, "language_self", "informed_gender_equality") %>%
group_by(language_self, informed_gender_equality) %>%
summarise(count = n()) %>%
mutate(percent = count/sum(count) * 100) %>%
arrange(factor (language_self, levels = c("One", "Two", "Three", "Four +")), factor(informed_gender_equality,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))
language_gender_equality
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_gender_equality count percent
<chr> <chr> <int> <dbl>
1 One Not informed 123 3.08
2 One Not well informed 393 9.83
3 One Informed 1846 46.2
4 One Well informed 1634 40.9
5 Two Not informed 80 0.745
6 Two Not well informed 634 5.91
7 Two Informed 4734 44.1
8 Two Well informed 5284 49.2
9 Three Not informed 90 0.868
10 Three Not well informed 490 4.73
11 Three Informed 4004 38.6
12 Three Well informed 5786 55.8
13 Four + Not informed 72 2.46
14 Four + Not well informed 154 5.27
15 Four + Informed 966 33.0
16 Four + Well informed 1732 59.2
#create plot
ggplot(language_gender_equality,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent,
fill = factor(informed_gender_equality,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "fill") +
labs(y = "Frequency", fill = NULL, x = "Languages Spoken", title = "How informed students feel on gender equality\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
#create plot with facet_grid
ggplot(language_gender_equality,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four +")),
y = percent, fill = factor(informed_gender_equality,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
labs(y = "Percent", fill = NULL, x = "Languages Spoken", title = "How informed students feel on gender equality\nby number of languages they speak", subtitle = "Spain, 2018", caption = "Source: PISA 2018 Student Questionnaire Database")+
facet_grid(~factor(informed_gender_equality, levels = c("Not informed", "Not well informed", "Informed", "Well informed"))) +
theme_linedraw() +
theme(axis.text.x = element_text(size = 9, angle = 30),
axis.text.y = element_text(size = 10),
text = element_text(size = 11),
plot.caption = element_text(hjust = 0))
It’s clear there is a positive correlation between the number of languages spoken and how well informed students feel they are on these seven topics, yet this analysis alone is not enough to conclude that the more languages you speak the better informed you are on these topics. There are many other variables that could be at play.
Things I would want to explore further are:
I hypothesize that students attending private schools in urban cities speak more languages, and if this is the case, these variables combined with family income and/or parents level of education could be what impacts how informed a student is on certain topics and not primarily the number of languages the student speaks. The original dataset may include some of these variables, and there is also a separate dataset of parent responses to a questionnaire which may contain some or all of these variables. However, the size of the datasets combined with the limited memory available on my computer prohibits my ability to explore things further at this time.
One thing I did notice in reviewing the data was in three of the seven topics (climate change, world hunger and gender equality) students who speak four or more languages felt slightly less informed overall (when combining informed and well informed) than students speaking three languages. It could be helpful to show this clearly in my analysis as a naive reader may not pick up on this small detail. In general, I do believe a naive reader would be able to understand my graphs. As I’m still learning, I definitely welcome feedback that would say otherwise, though!
Programme for International Student Assessment.(2020). Student questionnaire data files (PISA 2018 Database)[Dataset and codebook]. Organisation for Economic Co-operation and Development. https://www.oecd.org/pisa/data/2018database/
RStudio Team (2022). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL
http://www.rstudio.com/.
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Collazo (2022, March 2). Data Analytics and Computational Social Science: HW5. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomlcollazohw5/
BibTeX citation
@misc{collazo2022hw5, author = {Collazo, Laura}, title = {Data Analytics and Computational Social Science: HW5}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomlcollazohw5/}, year = {2022} }