HW4

Computing descriptive statistics and creating visualizations

Pragyanta Dhal
2022-05-11

Load libraries

Data Loading of cleaned data

student_final <- read_csv("./student_final_data.csv")
str(student_final)
spec_tbl_df [1,000 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ...1                    : num [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
 $ gender                  : chr [1:1000] "female" "female" "female" "male" ...
 $ race_ethnicity_group    : chr [1:1000] "group B" "group C" "group B" "group A" ...
 $ parent_highest_education: chr [1:1000] "bachelor's degree" "some college" "master's degree" "associate's degree" ...
 $ lunch                   : chr [1:1000] "standard" "standard" "standard" "free/reduced" ...
 $ test_preparation_course : chr [1:1000] "none" "completed" "none" "none" ...
 $ math_marks              : num [1:1000] 72 69 90 47 76 71 88 40 64 38 ...
 $ reading_marks           : num [1:1000] 72 90 95 57 78 83 95 43 64 60 ...
 $ writing_marks           : num [1:1000] 74 88 93 44 75 78 92 39 67 50 ...
 $ total_marks             : num [1:1000] 218 247 278 148 229 232 275 122 195 148 ...
 $ mean_marks              : num [1:1000] 72.7 82.3 92.7 49.3 76.3 ...
 $ grade                   : chr [1:1000] "C" "B" "A" "F" ...
 - attr(*, "spec")=
  .. cols(
  ..   ...1 = col_double(),
  ..   gender = col_character(),
  ..   race_ethnicity_group = col_character(),
  ..   parent_highest_education = col_character(),
  ..   lunch = col_character(),
  ..   test_preparation_course = col_character(),
  ..   math_marks = col_double(),
  ..   reading_marks = col_double(),
  ..   writing_marks = col_double(),
  ..   total_marks = col_double(),
  ..   mean_marks = col_double(),
  ..   grade = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Descriptive statistics

Descriptive statistics of numerical columns

Mean of the numeric columns:

student_final %>% 
    summarise_if(is.numeric, mean)
# A tibble: 1 × 6
   ...1 math_marks reading_marks writing_marks total_marks mean_marks
  <dbl>      <dbl>         <dbl>         <dbl>       <dbl>      <dbl>
1  500.       66.1          69.2          68.1        203.       67.8

Median of the numeric columns:

student_final %>% 
    summarise_if(is.numeric, median)
# A tibble: 1 × 6
   ...1 math_marks reading_marks writing_marks total_marks mean_marks
  <dbl>      <dbl>         <dbl>         <dbl>       <dbl>      <dbl>
1  500.         66            70            69         205       68.3

Standard deviation of the numeric columns:

student_final %>% 
    summarise_if(is.numeric, sd)
# A tibble: 1 × 6
   ...1 math_marks reading_marks writing_marks total_marks mean_marks
  <dbl>      <dbl>         <dbl>         <dbl>       <dbl>      <dbl>
1  289.       15.2          14.6          15.2        42.8       14.3

Descriptive statistics of categorical columns

Frequency of gender

student_final %>%
     group_by(gender) %>%
     summarise(count = n())
# A tibble: 2 × 2
  gender count
  <chr>  <int>
1 female   518
2 male     482

Frequency of ethnicity group of students:

student_final %>%
     group_by(race_ethnicity_group) %>%
     summarise(count = n())
# A tibble: 5 × 2
  race_ethnicity_group count
  <chr>                <int>
1 group A                 89
2 group B                190
3 group C                319
4 group D                262
5 group E                140

Frequency of the highest education obtained by parents:

student_final %>%
     group_by(parent_highest_education) %>%
     summarise(count = n())
# A tibble: 6 × 2
  parent_highest_education count
  <chr>                    <int>
1 associate's degree         222
2 bachelor's degree          118
3 high school                196
4 master's degree             59
5 some college               226
6 some high school           179

Frequency of the type of lunch:

student_final %>%
     group_by(lunch) %>%
     summarise(count = n())
# A tibble: 2 × 2
  lunch        count
  <chr>        <int>
1 free/reduced   355
2 standard       645

Frequency of the completion of course materials:

student_final %>%
     group_by(test_preparation_course) %>%
     summarise(count = n())
# A tibble: 2 × 2
  test_preparation_course count
  <chr>                   <int>
1 completed                 358
2 none                      642

Descriptive statistics of relevant grouping

Calculating mean, median and Standard deviation of the marks obtained in Maths, grouped by gender:

student_final %>%
     group_by(gender) %>%
     summarise(gender_count = n(), mean_math_marks = mean(math_marks), median_math_marks = median(math_marks), sd_math_marks = sd(math_marks))
# A tibble: 2 × 5
  gender gender_count mean_math_marks median_math_marks sd_math_marks
  <chr>         <int>           <dbl>             <dbl>         <dbl>
1 female          518            63.6                65          15.5
2 male            482            68.7                69          14.4

Calculating mean, median and Standard deviation of the marks obtained in Maths, grouped by ethnic group:

student_final %>%
     group_by(race_ethnicity_group) %>%
     summarise(race_ethnicity_group_count = n(), mean_math_marks = mean(math_marks), median_math_marks = median(math_marks), sd_math_marks = sd(math_marks))
# A tibble: 5 × 5
  race_ethnicity_gr… race_ethnicity_… mean_math_marks median_math_mar…
  <chr>                         <int>           <dbl>            <dbl>
1 group A                          89            61.6             61  
2 group B                         190            63.5             63  
3 group C                         319            64.5             65  
4 group D                         262            67.4             69  
5 group E                         140            73.8             74.5
# … with 1 more variable: sd_math_marks <dbl>

Calculating mean, median and Standard deviation of the marks obtained in Maths, grouped by gender and completion of test courses:

student_final %>%
  group_by(gender, test_preparation_course) %>%
  summarise(count_gender= n(), mean_math_marks = mean(math_marks), median_math_marks = median(math_marks), sd_math_marks = sd(math_marks))
# A tibble: 4 × 6
# Groups:   gender [2]
  gender test_preparation_course count_gender mean_math_marks
  <chr>  <chr>                          <int>           <dbl>
1 female completed                        184            67.2
2 female none                             334            61.7
3 male   completed                        174            72.3
4 male   none                             308            66.7
# … with 2 more variables: median_math_marks <dbl>,
#   sd_math_marks <dbl>

Visualizations

Univariate Analysis

Frequency of students across ethnic groups

ggplot(student_final, aes(x = race_ethnicity_group)) + 
     geom_bar() +
     labs(x = "Ethnic group",
          y = "Frequency",
          title = "Students by ethnicity group") +
     coord_flip()

This is a univariate plot of a categorical variable(race_ethnicity_group). It is a horizontally flipped bar chart that shows the frequency of students across ethnic groups. It shows that majority of the students belong to group C.

Frequency of mean marks

ggplot(student_final, aes(x = mean_marks)) +
     geom_histogram(fill = "red", color = "white") + 
     labs(title = "Average Marks Distribution",
          x = "mean_marks", y = "Frequency")

This is a univariate plot of a continuous variable(mean_marks) . It is a histogram and it is evident from the distribution that majority of the students scored in the range of 65-70 marks on an average

Bivariate Analysis

Gender vs Mean Marks

ggplot(student_final, 
        aes(x = gender, 
            y = mean_marks)) +
     geom_boxplot() +
     labs(title = "Average marks distribution by gender")

This is a bivariate plot of a categorical variable(gender) and a continuous variable (mean_marks). It is a box plot that shows summaries like median and quartile ranges.

Reading Marks vs Writing Marks

ggplot(student_final, 
        aes(x = reading_marks, 
            y = writing_marks)) +
     geom_point()

This is a bivariate plot of two continuous variables namely reading_marks and writing_marks. It is observed that students who perform well in reading performed well in writing as well.

Limitations

Although the univariate analysis have the benefit of not sacrificing any specific information, the downside is when there are a high number of data points it may be difficult to obtain an overall perspective of a variable’s qualities. They may not allow for a straightforward comparison of variables, one to another. On the other hand, the bivariate plots can be made aesthetically pleasing for the reader. Both the plots warrant further improvements.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Dhal (2022, May 19). Data Analytics and Computational Social Science: HW4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscompdhal27homework4/

BibTeX citation

@misc{dhal2022hw4,
  author = {Dhal, Pragyanta},
  title = {Data Analytics and Computational Social Science: HW4},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscompdhal27homework4/},
  year = {2022}
}