HW3

Data Wrangling

Pragyanta Dhal
2022-05-11

Libraries

Data Loading & cleaning

student <- read_csv("data.csv")
str(student)
spec_tbl_df [1,000 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ gender                     : chr [1:1000] "female" "female" "female" "male" ...
 $ race/ethnicity             : chr [1:1000] "group B" "group C" "group B" "group A" ...
 $ parental level of education: chr [1:1000] "bachelor's degree" "some college" "master's degree" "associate's degree" ...
 $ lunch                      : chr [1:1000] "standard" "standard" "standard" "free/reduced" ...
 $ test preparation course    : chr [1:1000] "none" "completed" "none" "none" ...
 $ math score                 : num [1:1000] 72 69 90 47 76 71 88 40 64 38 ...
 $ reading score              : num [1:1000] 72 90 95 57 78 83 95 43 64 60 ...
 $ writing score              : num [1:1000] 74 88 93 44 75 78 92 39 67 50 ...
 - attr(*, "spec")=
  .. cols(
  ..   gender = col_character(),
  ..   `race/ethnicity` = col_character(),
  ..   `parental level of education` = col_character(),
  ..   lunch = col_character(),
  ..   `test preparation course` = col_character(),
  ..   `math score` = col_double(),
  ..   `reading score` = col_double(),
  ..   `writing score` = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Check if there are any “NA” values

sapply(student, function(x) sum(is.na(x)))
                     gender              race/ethnicity 
                          0                           0 
parental level of education                       lunch 
                          0                           0 
    test preparation course                  math score 
                          0                           0 
              reading score               writing score 
                          0                           0 

As we can see it is a clean dataset.

Renaming the columns

colnames(student)[2] <- "race_ethnicity_group"
colnames(student)[3] <- "parent_highest_education"
colnames(student)[5] <- "test_preparation_course"
colnames(student)[6] <- "math_marks"
colnames(student)[7] <- "reading_marks"
colnames(student)[8] <- "writing_marks"

Converting few columns from character to factor

student$gender <- as.factor(student$gender)
student$race_ethnicity_group <- as.factor(student$race_ethnicity_group)
student$lunch <- as.factor(student$lunch)
student$test_preparation_course <- as.factor(student$test_preparation_course)

Creating new columns namely total and average score

student$total_marks = student$math_marks  + student$reading_marks + student$writing_marks
student$mean_marks = round((student$total_marks)/3,2)

Alloting grades as per the average score

student <- student %>% 
     mutate(grade = case_when(
         mean_marks >= 90 & mean_marks <= 100 ~ "A",
         mean_marks >= 80 & mean_marks < 90 ~ "B",
         mean_marks >= 70 & mean_marks < 80 ~ "C",
         mean_marks >= 60 & mean_marks < 70  ~ "D",
         mean_marks >= 50 & mean_marks < 60  ~ "E",
         mean_marks < 50 ~ "F"
     )%>% as.factor()
     )

Defining the order of levels in parent’s highest education

student$parent_highest_education <- 
     student$parent_highest_education %>%
     factor(levels = c("some high school","high school", "some college" ,
                       "associate's degree","bachelor's degree", "master's degree")
     )

Lets have a look at our data again :

str(student)
tibble [1,000 × 11] (S3: tbl_df/tbl/data.frame)
 $ gender                  : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
 $ race_ethnicity_group    : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
 $ parent_highest_education: Factor w/ 6 levels "some high school",..: 5 3 6 4 3 4 3 3 2 2 ...
 $ lunch                   : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
 $ test_preparation_course : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
 $ math_marks              : num [1:1000] 72 69 90 47 76 71 88 40 64 38 ...
 $ reading_marks           : num [1:1000] 72 90 95 57 78 83 95 43 64 60 ...
 $ writing_marks           : num [1:1000] 74 88 93 44 75 78 92 39 67 50 ...
 $ total_marks             : num [1:1000] 218 247 278 148 229 232 275 122 195 148 ...
 $ mean_marks              : num [1:1000] 72.7 82.3 92.7 49.3 76.3 ...
 $ grade                   : Factor w/ 6 levels "A","B","C","D",..: 3 2 1 6 3 3 1 6 4 6 ...
summary(student)
    gender    race_ethnicity_group       parent_highest_education
 female:518   group A: 89          some high school  :179        
 male  :482   group B:190          high school       :196        
              group C:319          some college      :226        
              group D:262          associate's degree:222        
              group E:140          bachelor's degree :118        
                                   master's degree   : 59        
          lunch     test_preparation_course   math_marks    
 free/reduced:355   completed:358           Min.   :  0.00  
 standard    :645   none     :642           1st Qu.: 57.00  
                                            Median : 66.00  
                                            Mean   : 66.09  
                                            3rd Qu.: 77.00  
                                            Max.   :100.00  
 reading_marks    writing_marks     total_marks      mean_marks    
 Min.   : 17.00   Min.   : 10.00   Min.   : 27.0   Min.   :  9.00  
 1st Qu.: 59.00   1st Qu.: 57.75   1st Qu.:175.0   1st Qu.: 58.33  
 Median : 70.00   Median : 69.00   Median :205.0   Median : 68.33  
 Mean   : 69.17   Mean   : 68.05   Mean   :203.3   Mean   : 67.77  
 3rd Qu.: 79.00   3rd Qu.: 79.00   3rd Qu.:233.0   3rd Qu.: 77.67  
 Max.   :100.00   Max.   :100.00   Max.   :300.0   Max.   :100.00  
 grade  
 A: 52  
 B:146  
 C:261  
 D:256  
 E:182  
 F:103  
write.csv(student, "./student_final_data.csv")

Potential Research Questions

  1. Which gender performs better on an average?

  2. How is the performance of students who have completed the preparation course against those who have not?

  3. How much does the parental highest education level impact their child’s performance

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Dhal (2022, May 19). Data Analytics and Computational Social Science: HW3. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscompdhal27hw3/

BibTeX citation

@misc{dhal2022hw3,
  author = {Dhal, Pragyanta},
  title = {Data Analytics and Computational Social Science: HW3},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscompdhal27hw3/},
  year = {2022}
}