Data Analytics and Computational Social Science: MADT_Homework 4

Meredith Derian-Toth

#Reading in my Data
library(readr)
WA_Edu_Improvment_2019 <- read_csv("Washington_School_Improvement_Framework__WSIF__Stacked_2017_-_2019_Runs2.csv")

Question One Instructions:

This should include mean, median, and standard deviation for numerical variables, and frequencies for categorical variables.

In addition to overall means, medians, and SDs, use group_by() and summarise() to compute mean/median/SD for any relevant groupings. For example, if you were interested in how state relates to income, you would compute mean income for all states combined, and then you would compute mean income for each individual state in the US.

My Plan

The following data is a representation of attendance distributed by 3 variables:The “Student Group” variable shows us student race, The “School Type” variable shows us the different types of schools in the dataset, and the “Grad FourYear Rate” shows us the percent of students who graduated within four years of high school.

I am interested to know the relationships between each variable and attendance rate. Limitations to this data set is a lack of suspension and dropout data. The research has shown a relationship between suspensions, attendance, and dropout rates.

Preparing my Data for summary data

The following code renames the variables I am using as well as filters out the text in order to aggregate the integars only.

#In preparation for summary and descriptive data of the variables mentioned above, I will need to rename them to remove the space in their name as well as filter out the text. 
library("tidyverse")
WA_Edu_Improvment_2019 <- rename(WA_Edu_Improvment_2019, Attendance_Rate2="RegularAttendance Rate", Four_Year_Grad_Rate="Grad FourYear Rate", School_Type="School Type")

WA_Edu_Improvment_2019 <- rename(WA_Edu_Improvment_2019, District_Name="District Name")

WA_Edu_Improvment_2019 <- rename(WA_Edu_Improvment_2019, Student_Group="Student Group")
                                 
#And the next step is to filter anything that contains "Suppress" or "N" from the variable. 
library("stringr")
WA_Edu_Improvment_2019<-WA_Edu_Improvment_2019%>%
  filter(!str_detect(Attendance_Rate2,"Suppress|N"))%>%
  filter(!str_detect(Four_Year_Grad_Rate,"Suppress|N"))
view(WA_Edu_Improvment_2019)

library("dplyr")

WA_Edu_Improvment_2019<-WA_Edu_Improvment_2019%>%
  mutate(Attendance_Rate2 = parse_number(Attendance_Rate2),
         Attendance_Rate2 = ifelse(Attendance_Rate2>1,Attendance_Rate2/100,Attendance_Rate2))

WA_Edu_Improvment_2019<-WA_Edu_Improvment_2019%>%
  mutate(Four_Year_Grad_Rate = parse_number(Four_Year_Grad_Rate),
         Four_Year_Grad_Rate = ifelse(Four_Year_Grad_Rate>1,Four_Year_Grad_Rate/100,Four_Year_Grad_Rate))

View(WA_Edu_Improvment_2019)

Summarizing my variables

#summarizing with mean
  summarize(WA_Edu_Improvment_2019, mean_Attendance=mean(Attendance_Rate2), mean_Grad=mean(Four_Year_Grad_Rate))

# A tibble: 1 × 2
  mean_Attendance mean_Grad
            <dbl>     <dbl>
1           0.682     0.735

#summarizing with min and max
  summarize(WA_Edu_Improvment_2019, min_Attendance=min(Attendance_Rate2), min_Grad=min(Four_Year_Grad_Rate))

# A tibble: 1 × 2
  min_Attendance min_Grad
           <dbl>    <dbl>
1          0.053   0.0202

  summarize(WA_Edu_Improvment_2019, max_Attendance=max(Attendance_Rate2), max_Grad=max(Four_Year_Grad_Rate))

# A tibble: 1 × 2
  max_Attendance max_Grad
           <dbl>    <dbl>
1          0.990    0.990

#Note to self - use group_by function for Question one

#Here I am going to summarize the variables grouping them by student group. First, I will rename "District Name" to remove the space. 

WA_Edu_Improvment_2019%>%
  group_by(District_Name)%>%
  select(Attendance_Rate2, Four_Year_Grad_Rate)%>%
  summarize(mean_Attendance=mean(Attendance_Rate2), mean_Grad=mean(Four_Year_Grad_Rate))

# A tibble: 249 × 3
   District_Name                     mean_Attendance mean_Grad
   <chr>                                       <dbl>     <dbl>
 1 Aberdeen School District                    0.529     0.629
 2 Adna School District                        0.848     0.940
 3 Anacortes School District                   0.648     0.721
 4 Arlington School District                   0.619     0.610
 5 Asotin-Anatone School District              0.808     0.934
 6 Auburn School District                      0.624     0.661
 7 Bainbridge Island School District           0.777     0.874
 8 Battle Ground School District               0.827     0.671
 9 Bellevue School District                    0.847     0.838
10 Bellingham School District                  0.658     0.701
# … with 239 more rows

#Here I am going to summarize the variables grouping them by student group. First, I will rename "Student Group" to remove the space. 

WA_Edu_Improvment_2019%>%
  group_by(Student_Group)%>%
  select(Attendance_Rate2, Four_Year_Grad_Rate)%>%
  summarize(mean_Attendance=mean(Attendance_Rate2), mean_Grad=mean(Four_Year_Grad_Rate))

# A tibble: 11 × 3
   Student_Group                           mean_Attendance mean_Grad
   <chr>                                             <dbl>     <dbl>
 1 All Students                                      0.699     0.752
 2 American Indian/ Alaskan Native                   0.502     0.637
 3 Asian                                             0.845     0.869
 4 Black/ African American                           0.675     0.774
 5 English Language Learners                         0.669     0.639
 6 Hispanic/ Latino of any race(s)                   0.671     0.741
 7 Low-Income                                        0.642     0.720
 8 Native Hawaiian/ Other Pacific Islander           0.554     0.736
 9 Students with Disabilities                        0.631     0.628
10 Two or More Races                                 0.710     0.817
11 White                                             0.725     0.784

Question Two Instructions

Create at least two visualizations using your final project dataset. • The visualizations should use the ggplot2 package. • At least one visualization should be univariate, and at least one should be bivariate.

WA_Edu_Improvment_2019%>%
  filter(str_detect(School_Type,"Alternative"))%>%
  filter(!str_detect(Student_Group,"White|Low-Income|Students with Disabilities|All Students"))%>%
  ggplot(aes(Attendance_Rate2)) +
  geom_histogram() +
  geom_density(alpha=0.2,fill="red") +
  labs(title = "Attendance Rate Distribution for Students who Identify as a Person of Color", x="Attendance Rate")

WA_Edu_Improvment_2019%>%
  filter(str_detect(School_Type,"Alternative"))%>%
  filter(str_detect(Student_Group,"White"))%>%
  ggplot(aes(Attendance_Rate2)) +
  geom_histogram() +
  geom_density(alpha=0.2,fill="red") +
  labs(title = "Attendance Rate Distribution for Students who Identify as White", x="Attendance Rate")

WA_Edu_Improvment_2019%>%
  filter(str_detect(School_Type,"Alternative"))%>%
  filter(str_detect(Student_Group,"Hispanic/ Latino of any race"))%>%
  ggplot(aes(Attendance_Rate2)) +
  geom_histogram() +
  geom_density(alpha=0.2,fill="red") +
  labs(title = "Attendance Rate Distribution For Students of any Race that Identify as Hispanic or Latino", x="Attendance Rate")

The three visualizations above explores the differences in attendance distribution of student race and ethnicity. They are broken out as Attendance distribution for students of color, for white students, and for Hispanic students. The visualizations are attempting to answer the question of whether there is a clear difference in average attendance rate between students of color and white students. I think we can conclude here that students of color generally have a lower attendance rate. What we cannot conclude from these visualizations however is why.

Questions left to answer: I would like to know what is preventing groups of students from attending school. To understand why there is a difference in average attendance rate for students of color versus white students we need to know more about these populations that might be preventing them from attending school.

WA_Edu_Improvment_2019%>%
ggplot(aes(School_Type, Attendance_Rate2)) +
  geom_boxplot() +
  labs(title = "Attendance Rate by Schoool Type", y = "Attendance Rate", x = "School Type") +
  theme(axis.text.x = element_text(angle = 45))

The visualization above shows the distribution of attendance rate based on school type. From this visualization I am attempting to answer if one type of school has a clearly higher attendance rate than another type. From this visualization we can conclude that the virtual or out of district school tends to have a higher average attendance rate.

Questions left unanswered: However, what this graph doesn’t show is the count of students in each school. To take this into account a stacked bar graph would be helpful.

WA_Edu_Improvment_2019%>%
  filter(str_detect(School_Type,"Alternative"))%>%
  ggplot(aes(Attendance_Rate2, Four_Year_Grad_Rate)) +
  geom_point() +
  geom_smooth() +
  labs(title = "Attendance Rate by Student Group", y = "Attendance Rate", x = "Four-Year Graduation Rate")

The visualization above is a scatterplot looking at graduation rate in relation to attendance rate. The point of this visualization is to attempt to answer the question of whether there is a relationship between graduation rate and attendance rate.

Questions left to answer: How can we explain the dip in 50% graduation rate? Why would schools with a much lower graduation rate have a higher attendance rate than the schools with a 50% graduation rate?

Overall, these sets of visualizations show us a broad overview of what might be effecting attendance rate. I think it would be helpful to run a multiple regression to see if the variables chosen are a predictive model for attendance rate. I would also like to learn more about these students lives. The next analysis will include low income student attendance rate in comparison to students of non-low-income status.

Comment on this article Share:

MADT_Homework 4

Question One Instructions:

My Plan

Preparing my Data for summary data

Summarizing my variables

Question Two Instructions

Reuse

Citation