Computing descriptive statistics and creating visualizations
spec_tbl_df [1,000 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ...1 : num [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
$ gender : chr [1:1000] "female" "female" "female" "male" ...
$ race_ethnicity_group : chr [1:1000] "group B" "group C" "group B" "group A" ...
$ parent_highest_education: chr [1:1000] "bachelor's degree" "some college" "master's degree" "associate's degree" ...
$ lunch : chr [1:1000] "standard" "standard" "standard" "free/reduced" ...
$ test_preparation_course : chr [1:1000] "none" "completed" "none" "none" ...
$ math_marks : num [1:1000] 72 69 90 47 76 71 88 40 64 38 ...
$ reading_marks : num [1:1000] 72 90 95 57 78 83 95 43 64 60 ...
$ writing_marks : num [1:1000] 74 88 93 44 75 78 92 39 67 50 ...
$ total_marks : num [1:1000] 218 247 278 148 229 232 275 122 195 148 ...
$ mean_marks : num [1:1000] 72.7 82.3 92.7 49.3 76.3 ...
$ grade : chr [1:1000] "C" "B" "A" "F" ...
- attr(*, "spec")=
.. cols(
.. ...1 = col_double(),
.. gender = col_character(),
.. race_ethnicity_group = col_character(),
.. parent_highest_education = col_character(),
.. lunch = col_character(),
.. test_preparation_course = col_character(),
.. math_marks = col_double(),
.. reading_marks = col_double(),
.. writing_marks = col_double(),
.. total_marks = col_double(),
.. mean_marks = col_double(),
.. grade = col_character()
.. )
- attr(*, "problems")=<externalptr>
Mean of the numeric columns:
student_final %>%
summarise_if(is.numeric, mean)
# A tibble: 1 × 6
...1 math_marks reading_marks writing_marks total_marks mean_marks
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 500. 66.1 69.2 68.1 203. 67.8
Median of the numeric columns:
student_final %>%
summarise_if(is.numeric, median)
# A tibble: 1 × 6
...1 math_marks reading_marks writing_marks total_marks mean_marks
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 500. 66 70 69 205 68.3
Standard deviation of the numeric columns:
student_final %>%
summarise_if(is.numeric, sd)
# A tibble: 1 × 6
...1 math_marks reading_marks writing_marks total_marks mean_marks
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 289. 15.2 14.6 15.2 42.8 14.3
Frequency of gender
# A tibble: 2 × 2
gender count
<chr> <int>
1 female 518
2 male 482
Frequency of ethnicity group of students:
# A tibble: 5 × 2
race_ethnicity_group count
<chr> <int>
1 group A 89
2 group B 190
3 group C 319
4 group D 262
5 group E 140
Frequency of the highest education obtained by parents:
# A tibble: 6 × 2
parent_highest_education count
<chr> <int>
1 associate's degree 222
2 bachelor's degree 118
3 high school 196
4 master's degree 59
5 some college 226
6 some high school 179
Frequency of the type of lunch:
# A tibble: 2 × 2
lunch count
<chr> <int>
1 free/reduced 355
2 standard 645
Frequency of the completion of course materials:
# A tibble: 2 × 2
test_preparation_course count
<chr> <int>
1 completed 358
2 none 642
Calculating mean, median and Standard deviation of the marks obtained in Maths, grouped by gender:
student_final %>%
group_by(gender) %>%
summarise(gender_count = n(), mean_math_marks = mean(math_marks), median_math_marks = median(math_marks), sd_math_marks = sd(math_marks))
# A tibble: 2 × 5
gender gender_count mean_math_marks median_math_marks sd_math_marks
<chr> <int> <dbl> <dbl> <dbl>
1 female 518 63.6 65 15.5
2 male 482 68.7 69 14.4
Calculating mean, median and Standard deviation of the marks obtained in Maths, grouped by ethnic group:
student_final %>%
group_by(race_ethnicity_group) %>%
summarise(race_ethnicity_group_count = n(), mean_math_marks = mean(math_marks), median_math_marks = median(math_marks), sd_math_marks = sd(math_marks))
# A tibble: 5 × 5
race_ethnicity_gr… race_ethnicity_… mean_math_marks median_math_mar…
<chr> <int> <dbl> <dbl>
1 group A 89 61.6 61
2 group B 190 63.5 63
3 group C 319 64.5 65
4 group D 262 67.4 69
5 group E 140 73.8 74.5
# … with 1 more variable: sd_math_marks <dbl>
Calculating mean, median and Standard deviation of the marks obtained in Maths, grouped by gender and completion of test courses:
student_final %>%
group_by(gender, test_preparation_course) %>%
summarise(count_gender= n(), mean_math_marks = mean(math_marks), median_math_marks = median(math_marks), sd_math_marks = sd(math_marks))
# A tibble: 4 × 6
# Groups: gender [2]
gender test_preparation_course count_gender mean_math_marks
<chr> <chr> <int> <dbl>
1 female completed 184 67.2
2 female none 334 61.7
3 male completed 174 72.3
4 male none 308 66.7
# … with 2 more variables: median_math_marks <dbl>,
# sd_math_marks <dbl>
ggplot(student_final, aes(x = race_ethnicity_group)) +
geom_bar() +
labs(x = "Ethnic group",
y = "Frequency",
title = "Students by ethnicity group") +
coord_flip()
This is a univariate plot of a categorical variable(race_ethnicity_group). It is a horizontally flipped bar chart that shows the frequency of students across ethnic groups. It shows that majority of the students belong to group C.
ggplot(student_final, aes(x = mean_marks)) +
geom_histogram(fill = "red", color = "white") +
labs(title = "Average Marks Distribution",
x = "mean_marks", y = "Frequency")
This is a univariate plot of a continuous variable(mean_marks) . It is a histogram and it is evident from the distribution that majority of the students scored in the range of 65-70 marks on an average
ggplot(student_final,
aes(x = gender,
y = mean_marks)) +
geom_boxplot() +
labs(title = "Average marks distribution by gender")
This is a bivariate plot of a categorical variable(gender) and a continuous variable (mean_marks). It is a box plot that shows summaries like median and quartile ranges.
ggplot(student_final,
aes(x = reading_marks,
y = writing_marks)) +
geom_point()
This is a bivariate plot of two continuous variables namely reading_marks and writing_marks. It is observed that students who perform well in reading performed well in writing as well.
Although the univariate analysis have the benefit of not sacrificing any specific information, the downside is when there are a high number of data points it may be difficult to obtain an overall perspective of a variable’s qualities. They may not allow for a straightforward comparison of variables, one to another. On the other hand, the bivariate plots can be made aesthetically pleasing for the reader. Both the plots warrant further improvements.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Dhal (2022, May 19). Data Analytics and Computational Social Science: HW4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscompdhal27homework4/
BibTeX citation
@misc{dhal2022hw4, author = {Dhal, Pragyanta}, title = {Data Analytics and Computational Social Science: HW4}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscompdhal27homework4/}, year = {2022} }