Data Analytics and Computational Social Science: HW6

Pragyanta Dhal

Introduction:

Students’ academic performance is affected by several factors which include students’ learning skills, parental background, peer influence, teachers’ quality, learning infrastructure among others. Many teachers believe that analyzing student testing data can boost performance, but research suggests otherwise. Analyzing this data set can give us insights, correlations between different factors responsible and affecting a student’s performance.

Load libraries

library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)

Read CSV data

student <- read_csv("./data.csv")

The dataset is loaded using the read_csv(). It has 1000 rows and 8 columns. Columns are : gender, race/ethnicity, parental level of education, lunch, test preparation course, math score, reading score, writing score. Below is a glimpse of the dataset:

str(student)

spec_tbl_df [1,000 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ gender                     : chr [1:1000] "female" "female" "female" "male" ...
 $ race/ethnicity             : chr [1:1000] "group B" "group C" "group B" "group A" ...
 $ parental level of education: chr [1:1000] "bachelor's degree" "some college" "master's degree" "associate's degree" ...
 $ lunch                      : chr [1:1000] "standard" "standard" "standard" "free/reduced" ...
 $ test preparation course    : chr [1:1000] "none" "completed" "none" "none" ...
 $ math score                 : num [1:1000] 72 69 90 47 76 71 88 40 64 38 ...
 $ reading score              : num [1:1000] 72 90 95 57 78 83 95 43 64 60 ...
 $ writing score              : num [1:1000] 74 88 93 44 75 78 92 39 67 50 ...
 - attr(*, "spec")=
  .. cols(
  ..   gender = col_character(),
  ..   `race/ethnicity` = col_character(),
  ..   `parental level of education` = col_character(),
  ..   lunch = col_character(),
  ..   `test preparation course` = col_character(),
  ..   `math score` = col_double(),
  ..   `reading score` = col_double(),
  ..   `writing score` = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Data wrangling and cleaning

Check if the data contains missing values or NAs

sapply(student, function(x) sum(is.na(x)))

                     gender              race/ethnicity 
                          0                           0 
parental level of education                       lunch 
                          0                           0 
    test preparation course                  math score 
                          0                           0 
              reading score               writing score 
                          0                           0

Renaming the columns for better understanding

colnames(student)[2] <- "race_ethnicity_group"
colnames(student)[3] <- "parent_highest_education"
colnames(student)[5] <- "test_preparation_course"
colnames(student)[6] <- "math_marks"
colnames(student)[7] <- "reading_marks"
colnames(student)[8] <- "writing_marks"

Converting few columns from character to factor

student$gender <- as.factor(student$gender)
student$race_ethnicity_group <- as.factor(student$race_ethnicity_group)
student$lunch <- as.factor(student$lunch)
student$test_preparation_course <- as.factor(student$test_preparation_course)

Creating new columns namely total and average score

student$total_marks = student$math_marks  + student$reading_marks + student$writing_marks
student$mean_marks = round((student$total_marks)/3,2)

Alloting grades as per the average score

student <- student %>% 
     mutate(grade = case_when(
         mean_marks >= 90 & mean_marks <= 100 ~ "A",
         mean_marks >= 80 & mean_marks < 90 ~ "B",
         mean_marks >= 70 & mean_marks < 80 ~ "C",
         mean_marks >= 60 & mean_marks < 70  ~ "D",
         mean_marks >= 50 & mean_marks < 60  ~ "E",
         mean_marks < 50 ~ "F"
     )%>% as.factor()
     )

Defining the order of levels in parent’s highest education

student$parent_highest_education <- 
     student$parent_highest_education %>%
     factor(levels = c("some high school","high school", "some college" ,
                       "associate's degree","bachelor's degree", "master's degree")
     )

Lets have a look at our data again :

str(student)

tibble [1,000 × 11] (S3: tbl_df/tbl/data.frame)
 $ gender                  : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
 $ race_ethnicity_group    : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
 $ parent_highest_education: Factor w/ 6 levels "some high school",..: 5 3 6 4 3 4 3 3 2 2 ...
 $ lunch                   : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
 $ test_preparation_course : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
 $ math_marks              : num [1:1000] 72 69 90 47 76 71 88 40 64 38 ...
 $ reading_marks           : num [1:1000] 72 90 95 57 78 83 95 43 64 60 ...
 $ writing_marks           : num [1:1000] 74 88 93 44 75 78 92 39 67 50 ...
 $ total_marks             : num [1:1000] 218 247 278 148 229 232 275 122 195 148 ...
 $ mean_marks              : num [1:1000] 72.7 82.3 92.7 49.3 76.3 ...
 $ grade                   : Factor w/ 6 levels "A","B","C","D",..: 3 2 1 6 3 3 1 6 4 6 ...

summary(student)

    gender    race_ethnicity_group       parent_highest_education
 female:518   group A: 89          some high school  :179        
 male  :482   group B:190          high school       :196        
              group C:319          some college      :226        
              group D:262          associate's degree:222        
              group E:140          bachelor's degree :118        
                                   master's degree   : 59        
          lunch     test_preparation_course   math_marks    
 free/reduced:355   completed:358           Min.   :  0.00  
 standard    :645   none     :642           1st Qu.: 57.00  
                                            Median : 66.00  
                                            Mean   : 66.09  
                                            3rd Qu.: 77.00  
                                            Max.   :100.00  
 reading_marks    writing_marks     total_marks      mean_marks    
 Min.   : 17.00   Min.   : 10.00   Min.   : 27.0   Min.   :  9.00  
 1st Qu.: 59.00   1st Qu.: 57.75   1st Qu.:175.0   1st Qu.: 58.33  
 Median : 70.00   Median : 69.00   Median :205.0   Median : 68.33  
 Mean   : 69.17   Mean   : 68.05   Mean   :203.3   Mean   : 67.77  
 3rd Qu.: 79.00   3rd Qu.: 79.00   3rd Qu.:233.0   3rd Qu.: 77.67  
 Max.   :100.00   Max.   :100.00   Max.   :300.0   Max.   :100.00  
 grade  
 A: 52  
 B:146  
 C:261  
 D:256  
 E:182  
 F:103

Analyzing data before visualizations

I order to gain understanding of the data statistics and be familiar with the distributions of each variable in the data set.

Descriptive statistics of numerical columns

Mean of the numeric columns:

student %>% 
    summarise_if(is.numeric, mean)

# A tibble: 1 × 5
  math_marks reading_marks writing_marks total_marks mean_marks
       <dbl>         <dbl>         <dbl>       <dbl>      <dbl>
1       66.1          69.2          68.1        203.       67.8

Median of the numeric columns:

student %>% 
    summarise_if(is.numeric, median)

# A tibble: 1 × 5
  math_marks reading_marks writing_marks total_marks mean_marks
       <dbl>         <dbl>         <dbl>       <dbl>      <dbl>
1         66            70            69         205       68.3

Standard deviation of the numeric columns:

student %>% 
    summarise_if(is.numeric, sd)

# A tibble: 1 × 5
  math_marks reading_marks writing_marks total_marks mean_marks
       <dbl>         <dbl>         <dbl>       <dbl>      <dbl>
1       15.2          14.6          15.2        42.8       14.3

Descriptive statistics of categorical columns

Frequency of gender

student %>%
     group_by(gender) %>%
     summarise(count = n())

# A tibble: 2 × 2
  gender count
  <fct>  <int>
1 female   518
2 male     482

Frequency of ethnicity group of students:

student %>%
     group_by(race_ethnicity_group) %>%
     summarise(count = n())

# A tibble: 5 × 2
  race_ethnicity_group count
  <fct>                <int>
1 group A                 89
2 group B                190
3 group C                319
4 group D                262
5 group E                140

Frequency of the highest education obtained by parents:

student %>%
     group_by(parent_highest_education) %>%
     summarise(count = n())

# A tibble: 6 × 2
  parent_highest_education count
  <fct>                    <int>
1 some high school           179
2 high school                196
3 some college               226
4 associate's degree         222
5 bachelor's degree          118
6 master's degree             59

Frequency of the type of lunch:

student %>%
     group_by(lunch) %>%
     summarise(count = n())

# A tibble: 2 × 2
  lunch        count
  <fct>        <int>
1 free/reduced   355
2 standard       645

Frequency of the completion of course materials:

student %>%
     group_by(test_preparation_course) %>%
     summarise(count = n())

# A tibble: 2 × 2
  test_preparation_course count
  <fct>                   <int>
1 completed                 358
2 none                      642

Descriptive statistics of relevant grouping

Calculating mean, median and Standard deviation of the marks obtained in Maths, grouped by gender:

student %>%
     group_by(gender) %>%
     summarise(gender_count = n(), mean_math_marks = mean(math_marks), median_math_marks = median(math_marks), sd_math_marks = sd(math_marks))

# A tibble: 2 × 5
  gender gender_count mean_math_marks median_math_marks sd_math_marks
  <fct>         <int>           <dbl>             <dbl>         <dbl>
1 female          518            63.6                65          15.5
2 male            482            68.7                69          14.4

Calculating mean, median and Standard deviation of the marks obtained in Maths, grouped by ethnic group:

student %>%
     group_by(race_ethnicity_group) %>%
     summarise(race_ethnicity_group_count = n(), mean_math_marks = mean(math_marks), median_math_marks = median(math_marks), sd_math_marks = sd(math_marks))

# A tibble: 5 × 5
  race_ethnicity_gr… race_ethnicity_… mean_math_marks median_math_mar…
  <fct>                         <int>           <dbl>            <dbl>
1 group A                          89            61.6             61  
2 group B                         190            63.5             63  
3 group C                         319            64.5             65  
4 group D                         262            67.4             69  
5 group E                         140            73.8             74.5
# … with 1 more variable: sd_math_marks <dbl>

Calculating mean, median and Standard deviation of the marks obtained in Maths, grouped by gender and completion of test courses:

student %>%
  group_by(gender, test_preparation_course) %>%
  summarise(count_gender= n(), mean_math_marks = mean(math_marks), median_math_marks = median(math_marks), sd_math_marks = sd(math_marks))

# A tibble: 4 × 6
# Groups:   gender [2]
  gender test_preparation_course count_gender mean_math_marks
  <fct>  <fct>                          <int>           <dbl>
1 female completed                        184            67.2
2 female none                             334            61.7
3 male   completed                        174            72.3
4 male   none                             308            66.7
# … with 2 more variables: median_math_marks <dbl>,
#   sd_math_marks <dbl>

Moving on to Visualizations

1. Overall grade distribution

ggplot(student, aes( x= grade, fill = gender)) + 
    geom_bar() + 
    geom_text(stat="count" ,aes(label=..count..), position = position_stack( vjust = 0.5))+
    labs(title ="Grade distribution", x ="Grades", y = " No of Students")

Observation Majority of the students obtained grades C & D, which are almost equally spread across both the genders - male and female. There are more female students who have secured grade A, whereas more male students have failed the courses and secured grade F.

2. Reading Marks vs Writing Marks

ggplot(student, 
        aes(x = reading_marks, 
            y = writing_marks, color=gender)) +
     geom_point()

Observation It is observed that students who perform well in reading performed well in writing as well.

3.Plotting uncertainty in total score estimate by ethnic group

student %>%
     group_by(race_ethnicity_group) %>%
     summarize(freq = n(),
               mean = mean(total_marks),
               sd = sd(total_marks),
               se = sd / sqrt(freq)) %>%
     ggplot(aes(x = race_ethnicity_group, 
                y = mean,
                color = race_ethnicity_group)) +
     geom_errorbar(aes(ymin = mean - se, 
                       ymax = mean + se)) +
     geom_point() + labs(title = "Visualizing uncertainty around estimation of total marks by ethnic group", y = "mean of total marks")

Observation It is observed that students who belong to ethnic group E performed significantly better than other students.

4. Scores for Free/Reduced and Standard Lunch by Ethnic Background

 ggplot(student, aes(x= race_ethnicity_group, y = mean_marks, fill = test_preparation_course)) +
     geom_col(position = "dodge") + 
     facet_wrap(~lunch)+
     labs(title="Scores by Ethnic Background for Free/Reduced and Standard Lunch", 
          x ="Ethnic Background", 
          y ="Average Score") +
     theme(axis.text.x = element_text(angle = 60, hjust = 1))

Observation - Majority of students who took the test preparation course performed better than those who did not. Whereas for students who took standard lunch and belong to group D and E, the test_preparation_course didn’t make any significant difference.

5. All Marks distribution across gender

student%>% 
     select(gender, math_marks, reading_marks, writing_marks)%>%
     gather(key, value, -gender)%>%
     ggplot( aes(x=gender, y = value , fill = gender )) +
     geom_boxplot()+ 
     facet_grid(~key)+
     labs(title ="Marks by Gender", x= "Gender", y ="Marks")

Observation - Majority of Female students have performed better in reading and writing whereas majority of male students have performed better in maths.

6. Marks distribution and parental education level

 student%>% 
     select(parent_highest_education, math_marks, reading_marks, writing_marks)%>%
     gather(key, value, -parent_highest_education)%>%
     ggplot( aes(x=parent_highest_education, y = value , fill = parent_highest_education )) +
     geom_boxplot()+ 
     facet_grid(~key)+
     labs(title ="Marks distribution as per parent highest education level", x= "parent_highest_education", y ="Marks") +
     theme(panel.spacing = unit(1, "lines")) +
     coord_flip()

Observation - It is quite clear from the plot that students whose parent’s highest education level is master’s degree performed better in reading, writing and maths.

Things missing from the final report and future work

I intend to generate and explore further visualizations in order to understand the reason behind the difference in students’ test scores. Additionally, I will properly format the overall final report.

Comment on this article Share:

HW6