This assignment includes descriptive statistics and visualizations for the variables I’ve selected for my final project.
The dataset I’ve chosen for my final project is the Program for International Student Assessment (PISA) 2018 student data. It’s a large dataset containing 1,119 variables and 612,004 observations of 15 year old students from 80 countries.
Note: The dataset orginally read in was a very large SAS file. However, my computer’s memory was not sufficient to knit the file. As a workaround I had to write a csv containing a limited number of variables and then read this back in.
#read in SAS & examine
pisa <- read_sas("cy07_msu_stu_qqq.sas7bdat", "CY07MSU_FMT_STU_QQQ.SAS7BCAT", encoding = NULL, .name_repair = "unique")
head(pisa)
tail(pisa)
unique(pisa[c("CNT")])
# select only desired variables and filter country for Spain
pisa_smaller <- pisa %>%
select(c(CNT,
ST001D01T,
ST004D01T,
ST197Q01HA,
ST197Q02HA,
ST197Q04HA,
ST197Q07HA,
ST197Q08HA,
ST197Q09HA,
ST197Q12HA,
ST220Q01HA,
ST220Q02HA,
ST220Q03HA,
ST220Q04HA,
ST177Q01HA,
ST019AQ01T,
ST021Q01TA)) %>%
filter(CNT == "ESP")
#check work
head(pisa_smaller)
#write csv
write_csv(pisa_smaller, "pisa_smaller_2022-2-20.csv")
# A tibble: 6 x 17
CNT ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ESP 10 2 4 4 4
2 ESP 9 1 3 2 3
3 ESP 10 2 4 3 3
4 ESP 8 2 2 1 3
5 ESP 10 1 NA NA NA
6 ESP 10 1 4 2 3
# ... with 11 more variables: ST197Q07HA <dbl>, ST197Q08HA <dbl>,
# ST197Q09HA <dbl>, ST197Q12HA <dbl>, ST220Q01HA <dbl>,
# ST220Q02HA <dbl>, ST220Q03HA <dbl>, ST220Q04HA <dbl>,
# ST177Q01HA <dbl>, ST019AQ01T <dbl>, ST021Q01TA <dbl>
tail(pisa)
# A tibble: 6 x 17
CNT ST001D01T ST004D01T ST197Q01HA ST197Q02HA ST197Q04HA
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ESP 10 1 4 4 4
2 ESP 9 2 3 3 3
3 ESP 10 2 4 4 4
4 ESP 9 2 2 2 2
5 ESP 8 2 3 3 3
6 ESP 9 1 2 2 2
# ... with 11 more variables: ST197Q07HA <dbl>, ST197Q08HA <dbl>,
# ST197Q09HA <dbl>, ST197Q12HA <dbl>, ST220Q01HA <dbl>,
# ST220Q02HA <dbl>, ST220Q03HA <dbl>, ST220Q04HA <dbl>,
# ST177Q01HA <dbl>, ST019AQ01T <dbl>, ST021Q01TA <dbl>
#remove unneeded variables
pisa_tidy <- pisa %>%
select(-c("ST001D01T", "ST004D01T", "ST220Q01HA", "ST220Q02HA", "ST220Q03HA", "ST220Q04HA", "ST019AQ01T", "ST021Q01TA")) %>%
#rename variables
rename(country=CNT,
informed_climate_change=ST197Q01HA,
informed_global_health=ST197Q02HA,
informed_migration=ST197Q04HA,
informed_international_conflict=ST197Q07HA,
informed_world_hunger=ST197Q08HA,
informed_poverty_causes=ST197Q09HA,
informed_gender_equality=ST197Q12HA,
language_self=ST177Q01HA) %>%
#remove NAs
drop_na %>%
#recode values (I still need to come back to this and learn to use across() to recode all variables beginning with "informed_")
mutate(country = recode(country, ESP = "Spain")) %>%
mutate(informed_climate_change = recode(informed_climate_change,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_global_health = recode(informed_global_health,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_migration = recode(informed_migration,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_international_conflict = recode(informed_international_conflict,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_world_hunger = recode(informed_world_hunger,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_poverty_causes = recode(informed_poverty_causes,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(informed_gender_equality = recode(informed_gender_equality,
`1` = "Not informed",
`2` = "Not well informed",
`3` = "Informed",
`4` = "Well informed")) %>%
mutate(language_self = recode(language_self, `1` = "One", `2` = "Two", `3` = "Three", `4` = "Four or more"))
pisa_tidy
# A tibble: 28,022 x 9
country informed_climate_change informed_global_h~ informed_migrat~
<chr> <chr> <chr> <chr>
1 Spain Well informed Well informed Well informed
2 Spain Informed Not well informed Informed
3 Spain Well informed Informed Informed
4 Spain Not well informed Not informed Informed
5 Spain Well informed Not well informed Informed
6 Spain Informed Not well informed Not well inform~
7 Spain Informed Informed Informed
8 Spain Not well informed Not well informed Informed
9 Spain Not well informed Informed Informed
10 Spain Informed Informed Informed
# ... with 28,012 more rows, and 5 more variables:
# informed_international_conflict <chr>,
# informed_world_hunger <chr>, informed_poverty_causes <chr>,
# informed_gender_equality <chr>, language_self <chr>
The more I work with this data, the more my research question has narrowed. As I’m now a little less in panic mode working in R, I can also hear my undergrad professors stressing the importance of paring research questions down to be very specific. This is hard as I want to explore all the things, but I’m remembering there is plenty to dig into even in specific questions.
As of now, my research question is to explore if students in Spain feel they are better informed on 7 different topics depending on how many languages they speak well enough to converse with someone else.
I would love to expand this to look at all countries who responded to these variables to see if what I observe in Spain holds true elsewhere. I think this would be a much more interesting research question! Right now I’m trying to focus on the basics of R, so want to keep things on the simpler side. I did leave the variable “country” in my dataset, though, so if/when my comfort level with R increases, I can dig into things deeper.
My dataset is comprised of all character variables, so I have created frequencies for each as well as univariate plots showing count. These plots don’t directly answer my research question, but they do provide a general overview of each individual variable before they are grouped to explore my research question. The question asked of each participant has been included before each variable’s frequency and plot.
#frequency of informed_climate_change
select(pisa_tidy, "informed_climate_change") %>%
group_by(informed_climate_change) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
# A tibble: 4 x 3
informed_climate_change count frequency
<chr> <int> <dbl>
1 Not informed 541 1.93
2 Not well informed 4000 14.3
3 Well informed 6858 24.5
4 Informed 16623 59.3
#plot for informed_climate_change
ggplot(pisa_tidy, aes(x = fct_relevel(informed_climate_change, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "Climate Change",
y = "Count",
title = "Participants by how informed on climate change")
#frequency of informed_global_health
select(pisa_tidy, "informed_global_health") %>%
group_by(informed_global_health) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
# A tibble: 4 x 3
informed_global_health count frequency
<chr> <int> <dbl>
1 Not informed 467 1.67
2 Well informed 3991 14.2
3 Not well informed 7249 25.9
4 Informed 16315 58.2
#plot for informed_global_health
ggplot(pisa_tidy, aes(x = fct_relevel(informed_global_health, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "Global Health",
y = "Count",
title = "Participants by how informed on global health")
#frequency of informed_migration
select(pisa_tidy, "informed_migration") %>%
group_by(informed_migration) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
# A tibble: 4 x 3
informed_migration count frequency
<chr> <int> <dbl>
1 Not informed 450 1.61
2 Well informed 5457 19.5
3 Not well informed 5532 19.7
4 Informed 16583 59.2
#plot for informed_migration
ggplot(pisa_tidy, aes(x = fct_relevel(informed_migration, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "Migration",
y = "Count",
title = "Participants by how informed on migration")
#frequency of informed_international_conflict
select(pisa_tidy, "informed_international_conflict") %>%
group_by(informed_international_conflict) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
# A tibble: 4 x 3
informed_international_conflict count frequency
<chr> <int> <dbl>
1 Not informed 733 2.62
2 Well informed 5182 18.5
3 Not well informed 8349 29.8
4 Informed 13758 49.1
#plot for informed_international_conflict
ggplot(pisa_tidy, aes(x = fct_relevel(informed_international_conflict, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "International Conflict",
y = "Count",
title = "Participants by how informed on international conflict")
#frequency of informed_world_hunger
select(pisa_tidy, "informed_world_hunger") %>%
group_by(informed_world_hunger) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
# A tibble: 4 x 3
informed_world_hunger count frequency
<chr> <int> <dbl>
1 Not informed 345 1.23
2 Not well informed 4331 15.5
3 Well informed 6887 24.6
4 Informed 16459 58.7
#plot for informed_world_hunger
ggplot(pisa_tidy, aes(x = fct_relevel(informed_world_hunger, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "World Hunger",
y = "Count",
title = "Participants by how informed on world hunger")
#frequency of informed_poverty_causes
select(pisa_tidy,"informed_poverty_causes") %>%
group_by(informed_poverty_causes) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
# A tibble: 4 x 3
informed_poverty_causes count frequency
<chr> <int> <dbl>
1 Not informed 391 1.40
2 Not well informed 5230 18.7
3 Well informed 7034 25.1
4 Informed 15367 54.8
#plot for informed_poverty_causes
ggplot(pisa_tidy, aes(x = fct_relevel(informed_poverty_causes, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "Poverty Causes",
y = "Count",
title = "Participants by how informed on poverty causes")
#frequency of informed_gender_equality
select(pisa_tidy, "informed_gender_equality") %>%
group_by(informed_gender_equality) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
# A tibble: 4 x 3
informed_gender_equality count frequency
<chr> <int> <dbl>
1 Not informed 365 1.30
2 Not well informed 1671 5.96
3 Informed 11550 41.2
4 Well informed 14436 51.5
#plot for informed_gender_equality
ggplot(pisa_tidy, aes(x = fct_relevel(informed_gender_equality, "Not informed", "Not well informed", "Informed", "Well informed"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "Gender Equality",
y = "Count",
title = "Participants by how informed on gender equality")
#frequency of language_self
select(pisa_tidy, "language_self") %>%
group_by(language_self) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
# A tibble: 4 x 3
language_self count frequency
<chr> <int> <dbl>
1 Four or more 2924 10.4
2 One 3996 14.3
3 Three 10370 37.0
4 Two 10732 38.3
# plot for language_self
ggplot(pisa_tidy, aes(x = fct_relevel(language_self, "One", "Two", "Three", "Four or more"))) +
geom_bar (fill = "turquoise3", color = "black") +
labs(x = "Languages Spoken",
y = "Count",
title = "Participants by languages spoken")
It’s clear after viewing these visualizations the majority of students feel informed on each topic. The one topic that stood out is gender equality as this is the only one where more students responded they feel well informed on this topic compared to just informed. Living in Spain, I know there has been a huge push in more recent years to increase education on gender equality in the country. Although it doesn’t fit with my current research question, it would be interesting to explore this variable over time using PISA datasets from previous years.
When it comes to the number of languages spoken, I wish the dataset included a follow-up question on which languages. Many regions of Spain have a regional language so children are educated in both this language and Spanish. However, 37% of students responded that they speak three languages and 10.4% speak four or more languages, so even taking into account that speaking two languages is normal for many students in Spain, 47.4% have learned additional languages well enough to converse with others
There are seven initial groupings I need to create to explore my research question. Therefore, the following visualizations look at how informed a student in Spain feels they are on a specific topic by how many languages they speak well enough to converse with someone.
#language_self & informed_climate_change
language_climate_change <- select(pisa_tidy, "language_self", "informed_climate_change") %>%
group_by(language_self, informed_climate_change) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
language_climate_change
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_climate_change count frequency
<chr> <chr> <int> <dbl>
1 Four or more Not informed 94 3.21
2 Three Not informed 134 1.29
3 Two Not informed 142 1.32
4 One Not informed 171 4.28
5 Four or more Not well informed 329 11.3
6 One Well informed 593 14.8
7 One Not well informed 957 23.9
8 Four or more Well informed 964 33.0
9 Three Not well informed 1196 11.5
10 Two Not well informed 1518 14.1
11 Four or more Informed 1537 52.6
12 One Informed 2275 56.9
13 Two Well informed 2332 21.7
14 Three Well informed 2969 28.6
15 Three Informed 6071 58.5
16 Two Informed 6740 62.8
ggplot(language_climate_change,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
y = frequency,
fill = factor(informed_climate_change,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", position = "fill") +
labs(y = "Frequency", fill = "Climate Change", x = "Languages Spoken", title = "Informed on climate change by number of languages
spoken")
#language_self & informed_global_health
language_global_health <- select(pisa_tidy, "language_self", "informed_global_health") %>%
group_by(language_self, informed_global_health) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
language_global_health
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_global_health count frequency
<chr> <chr> <int> <dbl>
1 Four or more Not informed 69 2.36
2 One Not informed 126 3.15
3 Three Not informed 133 1.28
4 Two Not informed 139 1.30
5 One Well informed 420 10.5
6 Four or more Not well informed 602 20.6
7 Four or more Well informed 650 22.2
8 One Not well informed 1303 32.6
9 Two Well informed 1338 12.5
10 Three Well informed 1583 15.3
11 Four or more Informed 1603 54.8
12 One Informed 2147 53.7
13 Three Not well informed 2450 23.6
14 Two Not well informed 2894 27.0
15 Three Informed 6204 59.8
16 Two Informed 6361 59.3
ggplot(language_global_health,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
y = frequency,
fill = factor(informed_global_health,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", position = "fill") +
labs(y = "Frequency", fill = "Global Health", x = "Languages Spoken", title = "Informed on global health by number of languages spoken")
#language_self & informed_migration
language_migration <- select(pisa_tidy, "language_self", "informed_migration") %>%
group_by(language_self, informed_migration) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(informed_migration)
language_migration
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_migration count frequency
<chr> <chr> <int> <dbl>
1 Four or more Informed 1538 52.6
2 One Informed 2301 57.6
3 Three Informed 6196 59.7
4 Two Informed 6548 61.0
5 Four or more Not informed 81 2.77
6 One Not informed 134 3.35
7 Three Not informed 122 1.18
8 Two Not informed 113 1.05
9 Four or more Not well informed 455 15.6
10 One Not well informed 948 23.7
11 Three Not well informed 1881 18.1
12 Two Not well informed 2248 20.9
13 Four or more Well informed 850 29.1
14 One Well informed 613 15.3
15 Three Well informed 2171 20.9
16 Two Well informed 1823 17.0
ggplot(language_migration,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
y = frequency,
fill = factor(informed_migration,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", position = "fill") +
labs(y = "Frequency", fill = "Migration", x = "Languages Spoken", title = "Informed on migration by number of languages spoken")
#language_self & informed_international_conflict
language_international_conflict <- select(pisa_tidy, "language_self", "informed_international_conflict") %>%
group_by(language_self, informed_international_conflict) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
language_international_conflict
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_international_conflict count frequency
<chr> <chr> <int> <dbl>
1 Four or more Not informed 90 3.08
2 One Not informed 195 4.88
3 Three Not informed 222 2.14
4 Two Not informed 226 2.11
5 One Well informed 530 13.3
6 Four or more Not well informed 680 23.3
7 Four or more Well informed 821 28.1
8 Four or more Informed 1333 45.6
9 One Not well informed 1480 37.0
10 Two Well informed 1710 15.9
11 One Informed 1791 44.8
12 Three Well informed 2121 20.5
13 Three Not well informed 2870 27.7
14 Two Not well informed 3319 30.9
15 Three Informed 5157 49.7
16 Two Informed 5477 51.0
ggplot(language_international_conflict,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
y = frequency,
fill = factor(informed_international_conflict,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", position = "fill") +
labs(y = "Frequency", fill = "International Conflict", x = "Languages Spoken", title = "Informed on international conflict by number of languages spoken")
#language_self & informed_world_hunger
language_world_hunger <- select(pisa_tidy, "language_self", "informed_world_hunger") %>%
group_by(language_self, informed_world_hunger) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
language_world_hunger
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_world_hunger count frequency
<chr> <chr> <int> <dbl>
1 Four or more Not informed 61 2.09
2 Two Not informed 92 0.857
3 Three Not informed 95 0.916
4 One Not informed 97 2.43
5 Four or more Not well informed 381 13.0
6 One Not well informed 780 19.5
7 One Well informed 793 19.8
8 Four or more Well informed 954 32.6
9 Three Not well informed 1453 14.0
10 Four or more Informed 1528 52.3
11 Two Not well informed 1717 16.0
12 One Informed 2326 58.2
13 Two Well informed 2409 22.4
14 Three Well informed 2731 26.3
15 Three Informed 6091 58.7
16 Two Informed 6514 60.7
ggplot(language_world_hunger,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
y = frequency,
fill = factor(informed_world_hunger,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", position = "fill") +
labs(y = "Frequency", fill = "World Hunger", x = "Languages Spoken", title = "Informed on world hunger by number of languages spoken")
#language_self & informed_poverty_causes
language_poverty_causes <- select(pisa_tidy, "language_self", "informed_poverty_causes") %>%
group_by(language_self, informed_poverty_causes) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
language_poverty_causes
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_poverty_causes count frequency
<chr> <chr> <int> <dbl>
1 Four or more Not informed 62 2.12
2 Two Not informed 106 0.988
3 One Not informed 108 2.70
4 Three Not informed 115 1.11
5 Four or more Not well informed 434 14.8
6 One Well informed 825 20.6
7 One Not well informed 894 22.4
8 Four or more Well informed 998 34.1
9 Four or more Informed 1430 48.9
10 Three Not well informed 1789 17.3
11 Two Not well informed 2113 19.7
12 One Informed 2169 54.3
13 Two Well informed 2409 22.4
14 Three Well informed 2802 27.0
15 Three Informed 5664 54.6
16 Two Informed 6104 56.9
ggplot(language_poverty_causes,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
y = frequency,
fill = factor(informed_poverty_causes,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", position = "fill") +
labs(y = "Frequency", fill = "Poverty Causes", x = "Languages Spoken", title = "Informed on poverty causes by number of languages spoken")
#language_self & informed_gender_equality
language_gender_equality <- select(pisa_tidy, "language_self", "informed_gender_equality") %>%
group_by(language_self, informed_gender_equality) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count) * 100) %>%
arrange(count)
language_gender_equality
# A tibble: 16 x 4
# Groups: language_self [4]
language_self informed_gender_equality count frequency
<chr> <chr> <int> <dbl>
1 Four or more Not informed 72 2.46
2 Two Not informed 80 0.745
3 Three Not informed 90 0.868
4 One Not informed 123 3.08
5 Four or more Not well informed 154 5.27
6 One Not well informed 393 9.83
7 Three Not well informed 490 4.73
8 Two Not well informed 634 5.91
9 Four or more Informed 966 33.0
10 One Well informed 1634 40.9
11 Four or more Well informed 1732 59.2
12 One Informed 1846 46.2
13 Three Informed 4004 38.6
14 Two Informed 4734 44.1
15 Two Well informed 5284 49.2
16 Three Well informed 5786 55.8
ggplot(language_gender_equality,
aes(x = factor (language_self, levels = c("One", "Two", "Three", "Four or more")),
y = frequency,
fill = factor(informed_gender_equality,
levels = c("Not informed", "Not well informed", "Informed", "Well informed")))) +
geom_bar(stat = "identity", position = "fill") +
labs(y = "Frequency", fill = "Gender Equality", x = "Languages Spoken", title = "Informed on gender equality by number of languages spoken")
In each of the above visualizations, it’s evident that the more languages a student speaks the more well informed they feel about the given topic. Interestingly, if you look at informed and well informed together, in three of the seven topics (climate change, world hunger and gender equality) students who speak four or more languages felt slightly less informed overall than students speaking three languages. Calculating this difference would be a helpful statistic to include in my analysis.
I would like to add percentages to my visualizations as this would aid myself and a “naive viewer” in understanding what they are seeing without having to read the tibbles that display this information. I also don’t love the titles of each visualization and believe they could be improved.
I imagine there are other styles of visualizations that could also be created. I still feel pretty uncertain on plotting, though, so poco a poco (little by little).
Although it’s clear there is a positive correlation between the number of languages spoken and how informed students feel they are on these seven topics (variables), I don’t believe this analysis alone is enough to conclude that more languages you speak the better informed you are on these topics. There are many other variables that could be at play.
Things I would want to explore further are:
For instance, I hypothesize that students attending private schools in urban cities would speak more languages, and if this is the case, these variables combined with family income and/or parents level of education could be what impacts how informed a student is on certain topics and not the number of languages the student speaks. I will have to do some investigating in the original dataset to see if any of these other variables are available. I believe there is also a parent questionnaire, so it’s possible I could join datasets to explore things further. A challenge with this could be the size of the datasets, though, and the limited memory available on my computer to work with them.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Collazo (2022, Feb. 23). Data Analytics and Computational Social Science: HW4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomlcollazohw4/
BibTeX citation
@misc{collazo2022hw4, author = {Collazo, Laura}, title = {Data Analytics and Computational Social Science: HW4}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomlcollazohw4/}, year = {2022} }