Comprehensive report
Students’ academic performance is affected by several factors which include students’ learning skills, parental background, peer influence, teachers’ quality, learning infrastructure among others. Many teachers believe that analyzing student testing data can boost performance, but research suggests otherwise. Analyzing this data set can give us insights, correlations between different factors responsible and affecting a student’s performance.
student <- read_csv("./data.csv")
The dataset is loaded using the read_csv(). It has 1000 rows and 8 columns. Columns are : gender, race/ethnicity, parental level of education, lunch, test preparation course, math score, reading score, writing score. Below is a glimpse of the dataset:
str(student)
spec_tbl_df [1,000 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ gender : chr [1:1000] "female" "female" "female" "male" ...
$ race/ethnicity : chr [1:1000] "group B" "group C" "group B" "group A" ...
$ parental level of education: chr [1:1000] "bachelor's degree" "some college" "master's degree" "associate's degree" ...
$ lunch : chr [1:1000] "standard" "standard" "standard" "free/reduced" ...
$ test preparation course : chr [1:1000] "none" "completed" "none" "none" ...
$ math score : num [1:1000] 72 69 90 47 76 71 88 40 64 38 ...
$ reading score : num [1:1000] 72 90 95 57 78 83 95 43 64 60 ...
$ writing score : num [1:1000] 74 88 93 44 75 78 92 39 67 50 ...
- attr(*, "spec")=
.. cols(
.. gender = col_character(),
.. `race/ethnicity` = col_character(),
.. `parental level of education` = col_character(),
.. lunch = col_character(),
.. `test preparation course` = col_character(),
.. `math score` = col_double(),
.. `reading score` = col_double(),
.. `writing score` = col_double()
.. )
- attr(*, "problems")=<externalptr>
Check if the data contains missing values or NAs
gender race/ethnicity
0 0
parental level of education lunch
0 0
test preparation course math score
0 0
reading score writing score
0 0
student$total_marks = student$math_marks + student$reading_marks + student$writing_marks
student$mean_marks = round((student$total_marks)/3,2)
student <- student %>%
mutate(grade = case_when(
mean_marks >= 90 & mean_marks <= 100 ~ "A",
mean_marks >= 80 & mean_marks < 90 ~ "B",
mean_marks >= 70 & mean_marks < 80 ~ "C",
mean_marks >= 60 & mean_marks < 70 ~ "D",
mean_marks >= 50 & mean_marks < 60 ~ "E",
mean_marks < 50 ~ "F"
)%>% as.factor()
)
Lets have a look at our data again :
str(student)
tibble [1,000 × 11] (S3: tbl_df/tbl/data.frame)
$ gender : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
$ race_ethnicity_group : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
$ parent_highest_education: Factor w/ 6 levels "some high school",..: 5 3 6 4 3 4 3 3 2 2 ...
$ lunch : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
$ test_preparation_course : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
$ math_marks : num [1:1000] 72 69 90 47 76 71 88 40 64 38 ...
$ reading_marks : num [1:1000] 72 90 95 57 78 83 95 43 64 60 ...
$ writing_marks : num [1:1000] 74 88 93 44 75 78 92 39 67 50 ...
$ total_marks : num [1:1000] 218 247 278 148 229 232 275 122 195 148 ...
$ mean_marks : num [1:1000] 72.7 82.3 92.7 49.3 76.3 ...
$ grade : Factor w/ 6 levels "A","B","C","D",..: 3 2 1 6 3 3 1 6 4 6 ...
summary(student)
gender race_ethnicity_group parent_highest_education
female:518 group A: 89 some high school :179
male :482 group B:190 high school :196
group C:319 some college :226
group D:262 associate's degree:222
group E:140 bachelor's degree :118
master's degree : 59
lunch test_preparation_course math_marks
free/reduced:355 completed:358 Min. : 0.00
standard :645 none :642 1st Qu.: 57.00
Median : 66.00
Mean : 66.09
3rd Qu.: 77.00
Max. :100.00
reading_marks writing_marks total_marks mean_marks
Min. : 17.00 Min. : 10.00 Min. : 27.0 Min. : 9.00
1st Qu.: 59.00 1st Qu.: 57.75 1st Qu.:175.0 1st Qu.: 58.33
Median : 70.00 Median : 69.00 Median :205.0 Median : 68.33
Mean : 69.17 Mean : 68.05 Mean :203.3 Mean : 67.77
3rd Qu.: 79.00 3rd Qu.: 79.00 3rd Qu.:233.0 3rd Qu.: 77.67
Max. :100.00 Max. :100.00 Max. :300.0 Max. :100.00
grade
A: 52
B:146
C:261
D:256
E:182
F:103
I order to gain understanding of the data statistics and be familiar with the distributions of each variable in the data set.
Mean of the numeric columns:
student %>%
summarise_if(is.numeric, mean)
# A tibble: 1 × 5
math_marks reading_marks writing_marks total_marks mean_marks
<dbl> <dbl> <dbl> <dbl> <dbl>
1 66.1 69.2 68.1 203. 67.8
Median of the numeric columns:
student %>%
summarise_if(is.numeric, median)
# A tibble: 1 × 5
math_marks reading_marks writing_marks total_marks mean_marks
<dbl> <dbl> <dbl> <dbl> <dbl>
1 66 70 69 205 68.3
Standard deviation of the numeric columns:
student %>%
summarise_if(is.numeric, sd)
# A tibble: 1 × 5
math_marks reading_marks writing_marks total_marks mean_marks
<dbl> <dbl> <dbl> <dbl> <dbl>
1 15.2 14.6 15.2 42.8 14.3
Frequency of gender
# A tibble: 2 × 2
gender count
<fct> <int>
1 female 518
2 male 482
Frequency of ethnicity group of students:
# A tibble: 5 × 2
race_ethnicity_group count
<fct> <int>
1 group A 89
2 group B 190
3 group C 319
4 group D 262
5 group E 140
Frequency of the highest education obtained by parents:
# A tibble: 6 × 2
parent_highest_education count
<fct> <int>
1 some high school 179
2 high school 196
3 some college 226
4 associate's degree 222
5 bachelor's degree 118
6 master's degree 59
Frequency of the type of lunch:
# A tibble: 2 × 2
lunch count
<fct> <int>
1 free/reduced 355
2 standard 645
Frequency of the completion of course materials:
# A tibble: 2 × 2
test_preparation_course count
<fct> <int>
1 completed 358
2 none 642
Calculating mean, median and Standard deviation of the marks obtained in Maths, grouped by gender:
student %>%
group_by(gender) %>%
summarise(gender_count = n(), mean_math_marks = mean(math_marks), median_math_marks = median(math_marks), sd_math_marks = sd(math_marks))
# A tibble: 2 × 5
gender gender_count mean_math_marks median_math_marks sd_math_marks
<fct> <int> <dbl> <dbl> <dbl>
1 female 518 63.6 65 15.5
2 male 482 68.7 69 14.4
Calculating mean, median and Standard deviation of the marks obtained in Maths, grouped by ethnic group:
student %>%
group_by(race_ethnicity_group) %>%
summarise(race_ethnicity_group_count = n(), mean_math_marks = mean(math_marks), median_math_marks = median(math_marks), sd_math_marks = sd(math_marks))
# A tibble: 5 × 5
race_ethnicity_gr… race_ethnicity_… mean_math_marks median_math_mar…
<fct> <int> <dbl> <dbl>
1 group A 89 61.6 61
2 group B 190 63.5 63
3 group C 319 64.5 65
4 group D 262 67.4 69
5 group E 140 73.8 74.5
# … with 1 more variable: sd_math_marks <dbl>
Calculating mean, median and Standard deviation of the marks obtained in Maths, grouped by gender and completion of test courses:
student %>%
group_by(gender, test_preparation_course) %>%
summarise(count_gender= n(), mean_math_marks = mean(math_marks), median_math_marks = median(math_marks), sd_math_marks = sd(math_marks))
# A tibble: 4 × 6
# Groups: gender [2]
gender test_preparation_course count_gender mean_math_marks
<fct> <fct> <int> <dbl>
1 female completed 184 67.2
2 female none 334 61.7
3 male completed 174 72.3
4 male none 308 66.7
# … with 2 more variables: median_math_marks <dbl>,
# sd_math_marks <dbl>
ggplot(student, aes( x= grade, fill = gender)) +
geom_bar() +
geom_text(stat="count" ,aes(label=..count..), position = position_stack( vjust = 0.5))+
labs(title ="Grade distribution", x ="Grades", y = " No of Students")
Observation Majority of the students obtained grades C & D, which are almost equally spread across both the genders - male and female. There are more female students who have secured grade A, whereas more male students have failed the courses and secured grade F.
ggplot(student,
aes(x = reading_marks,
y = writing_marks, color=gender)) +
geom_point()
Observation It is observed that students who perform well in reading performed well in writing as well.
student %>%
group_by(race_ethnicity_group) %>%
summarize(freq = n(),
mean = mean(total_marks),
sd = sd(total_marks),
se = sd / sqrt(freq)) %>%
ggplot(aes(x = race_ethnicity_group,
y = mean,
color = race_ethnicity_group)) +
geom_errorbar(aes(ymin = mean - se,
ymax = mean + se)) +
geom_point() + labs(title = "Visualizing uncertainty around estimation of total marks by ethnic group", y = "mean of total marks")
Observation It is observed that students who belong to ethnic group E performed significantly better than other students.
ggplot(student, aes(x= race_ethnicity_group, y = mean_marks, fill = test_preparation_course)) +
geom_col(position = "dodge") +
facet_wrap(~lunch)+
labs(title="Scores by Ethnic Background for Free/Reduced and Standard Lunch",
x ="Ethnic Background",
y ="Average Score") +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
Observation - Majority of students who took the test preparation course performed better than those who did not. Whereas for students who took standard lunch and belong to group D and E, the test_preparation_course didn’t make any significant difference.
student%>%
select(gender, math_marks, reading_marks, writing_marks)%>%
gather(key, value, -gender)%>%
ggplot( aes(x=gender, y = value , fill = gender )) +
geom_boxplot()+
facet_grid(~key)+
labs(title ="Marks by Gender", x= "Gender", y ="Marks")
Observation - Majority of Female students have performed better in reading and writing whereas majority of male students have performed better in maths.
student%>%
select(parent_highest_education, math_marks, reading_marks, writing_marks)%>%
gather(key, value, -parent_highest_education)%>%
ggplot( aes(x=parent_highest_education, y = value , fill = parent_highest_education )) +
geom_boxplot()+
facet_grid(~key)+
labs(title ="Marks distribution as per parent highest education level", x= "parent_highest_education", y ="Marks") +
theme(panel.spacing = unit(1, "lines")) +
coord_flip()
Observation - It is quite clear from the plot that students whose parent’s highest education level is master’s degree performed better in reading, writing and maths.
student%>%
select(lunch, math_marks, reading_marks, writing_marks)%>%
gather(key, value, -lunch)%>%
ggplot( aes(x=lunch, y = value , fill = lunch )) +
geom_boxplot()+
facet_grid(~key)+
labs(title ="Marks distribution of students according to their lunch type", x= "lunch", y ="Marks")+
coord_flip()
Observation It is clearly observed from the above plot that students with ‘standard’ lunch type perform better than students who receive free/reduced lunch. Hence, we can draw the inference that students with free/reduced lunch belong to lower middle class or low income families and perform lower than their counterparts.
student %>%
group_by(lunch, parent_highest_education) %>%
summarize(freq = n(),
mean = mean(total_marks),
sd = sd(total_marks),
se = sd / sqrt(freq)) %>%
ggplot(aes(x = lunch,
y = mean,
color = lunch)) +
geom_errorbar(aes(ymin = mean - se,
ymax = mean + se)) +
geom_point() + labs(title = "Uncertainty around estimation of total marks by lunch and parental education", y = "mean of total marks") +
facet_grid(cols = vars(parent_highest_education)) +
theme(axis.text.x = element_text(angle = 60, hjust = 1), strip.text = element_text(size=7))
Observation - It is quite evident from the plot that students who receive free/reduced lunch and whose parental educational level is “some high school” scored very low marks whereas the students who receive standard lunch and whose parents have master’s degree scored the highest marks.
student %>%
group_by(parent_highest_education, test_preparation_course) %>%
summarize(freq = n(),
mean = mean(total_marks),
sd = sd(total_marks),
se = sd / sqrt(freq)) %>%
ggplot(aes(x=parent_highest_education, y=mean, fill=test_preparation_course)) +
geom_bar(position="dodge", stat="identity") +
labs(title = "Total Marks for different parental education and Test Course", x='Parental Education', y="Average Score") + theme(axis.text.x = element_text(angle = 60, hjust = 1))
Observation - As we go from parent’s highest educational level i.e masters to high school, the gap between the marks obtained by students who completed the test preparation materials and those who did not also increases.
Rigorous data analysis techniques were deployed on the dataset and the conclusions drawn are -
Throughout the duration of the project, I was exposed to numerous challenges. These challenges seemed difficult at first but eventually were easy to grasp, thanks to the tutorials provided. I picked a dataset that I as a student felt closely related to. So, this dataset housing performance of students across exams seemed like a natural fit. It didn’t take much time to familiarize myself with the columns of the dataset although I found a few column names to be ambiguous. Starting with the data wrangling stage, I was able to apply most of the techniques that includes checking for non-available (NA) values, renaming ambiguous columns as well as adding new columns along the way for better analysis.
Afterwards, I applied certain descriptive statistical measures on the continuous and categorical columns. It gave me a rough idea of the performance of the students which gave way to a few potential research questions. Post this stage, I selected few continuous and categorical columns and performed univariate and bivariate analysis. Afterwards, I refined these analysis and proceeded with multivariate analysis by applying techniques like grouping and faceting. At this point, I was able to answer almost all my research questions. I could also realize that there are a very few questions I would have liked to analyze but can’t answer due to the lack of information in the dataset. I wanted to analyze the number of hours spent in test_preparatory_course so that I could extrapolate the relationship between the number of hours dedicated by students and the marks they received.
Lastly, I compiled all my work, formatted it so that they appear aesthetically attractive and user friendly. Overall, it was a memorable journey filled with a lot of challenges as well as learnings along the way. One of the most challenging work was to shortlist the key features from the dataset and present it in a user-friendly so that a user shouldn’t feel overwhelmed by an abundance of information.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Dhal (2022, May 19). Data Analytics and Computational Social Science: Final Report. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscompdhal27finalreport/
BibTeX citation
@misc{dhal2022final, author = {Dhal, Pragyanta}, title = {Data Analytics and Computational Social Science: Final Report}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscompdhal27finalreport/}, year = {2022} }