Data Analytics and Computational Social Science: MADT_Homework 3

Meredith Derian-Toth

Research Interest

My research interest is in schols and districts that are lower performing. I am interested to know more about the learning environment as well as the community environment that may effect student learning and student attendance. The variables in this data set only give us some of this information, such as student performance data, student attendance data, and some demographic data.

library(readr)
WA_Edu_Improvment_2019 <- read_csv("WA_School_Improvment_2017_-_2019_Runs.csv")
View(WA_Edu_Improvment_2019)
dim(WA_Edu_Improvment_2019)

[1] 72850    56

head(WA_Edu_Improvment_2019)

# A tibble: 6 × 56
  `ESD Organization Id` `ESD Name`    `District Orga…` `District Code`
                  <dbl> <chr>                    <dbl>           <dbl>
1                100006 Puget Sound …           100229           17001
2                100003 Educational …           100278            6037
3                100009 Northwest Ed…           100142           31025
4                100009 Northwest Ed…           100159           31006
5                100009 Northwest Ed…           100142           31025
6                100007 Educational …           100195           11001
# … with 52 more variables: `District Name` <chr>,
#   `School Code` <dbl>, `School Name` <chr>,
#   `School Organization Id` <dbl>, `School Type` <chr>,
#   `Student Group` <chr>, `Proficiency ELA Numerator` <chr>,
#   `Proficiency ELA Denominator` <dbl>,
#   `Proficiency ELA Rate` <chr>, `Proficiency ELA Decile` <dbl>,
#   `Proficiency Math Numerator` <chr>, …

Variables and Research Questions

There are many variables in this data set, a few that I will be focusing on are: District code and school code, these variables are discrete integers. Student group is a categorical text variable that identifies students of a particular race, ethnicity, low income status, or status as an English language learner . Proficiency ELA Rate, Proficiency Math Rate, Regular Attendance Rate, and Grade FourYear Rate are all continuous variables

Potential research questions are: 1. Do school’s with low attendance rates also have low 4-year graduation rates? 2. Do school’s with low attendance rates also perform lower on ELA or Math proficiency assessments? 3. Is there a relationship between school’s student group populations and their attendance rate and/or academic proficiency scores?

Preparing the Data for Visualizations

#Here I am renaming the variable to remove spaces
library("tidyverse")
WA_Edu_Improvment_2019 <- rename(WA_Edu_Improvment_2019, Attendance_Rate2="RegularAttendance Rate")
head(WA_Edu_Improvment_2019['Attendance_Rate2',])

# A tibble: 1 × 56
  `ESD Organization Id` `ESD Name` `District Organiz…` `District Code`
                  <dbl> <chr>                    <dbl>           <dbl>
1                    NA <NA>                        NA              NA
# … with 52 more variables: `District Name` <chr>,
#   `School Code` <dbl>, `School Name` <chr>,
#   `School Organization Id` <dbl>, `School Type` <chr>,
#   `Student Group` <chr>, `Proficiency ELA Numerator` <chr>,
#   `Proficiency ELA Denominator` <dbl>,
#   `Proficiency ELA Rate` <chr>, `Proficiency ELA Decile` <dbl>,
#   `Proficiency Math Numerator` <chr>, …

#And the next step is to filter anything that contains "Suppress" or "N" from the variable. 
library("stringr")
WA_Edu_Improvment_2019<-WA_Edu_Improvment_2019%>%
  filter(!str_detect(Attendance_Rate2,"Suppress|N"))
View(WA_Edu_Improvment_2019)

Visualizations

#Here I have created a boxplot of the attendance rate data. I am unsure if the "parse_number()" is in the correct spot since it is showing up as text for the x axis, however I can't seem to get it to work in any other part of the code.
ggplot(WA_Edu_Improvment_2019, aes(parse_number(Attendance_Rate2))) +
  geom_boxplot()

#Here I want to see the relationship between ELA Performance Data and Attendance Rate Data so I am preparing my data to create a scatter plot. 

#First I have to filter out the text data from the ELA data. To do that I will follow the same steps as the attendance data. I will rename the variable to remove spaces and then filter out the text. 

#Here I am renaming the variable to remove spaces
library("tidyverse")
WA_Edu_Improvment_2019 <- rename(WA_Edu_Improvment_2019, ELA_Proficiency_Rate="Proficiency ELA Rate")

head(WA_Edu_Improvment_2019['ELA_Proficiency_Rate',])

# A tibble: 1 × 56
  `ESD Organization Id` `ESD Name` `District Organiz…` `District Code`
                  <dbl> <chr>                    <dbl>           <dbl>
1                    NA <NA>                        NA              NA
# … with 52 more variables: `District Name` <chr>,
#   `School Code` <dbl>, `School Name` <chr>,
#   `School Organization Id` <dbl>, `School Type` <chr>,
#   `Student Group` <chr>, `Proficiency ELA Numerator` <chr>,
#   `Proficiency ELA Denominator` <dbl>, ELA_Proficiency_Rate <chr>,
#   `Proficiency ELA Decile` <dbl>,
#   `Proficiency Math Numerator` <chr>, …

#And the next step is to filter anything that contains "Suppress" or "N" from the variable. 
library("stringr")
WA_Edu_Improvment_2019<-WA_Edu_Improvment_2019%>%
  filter(!str_detect(ELA_Proficiency_Rate,"Suppress|N"))
View(WA_Edu_Improvment_2019)

#Though I do see one problem here, some of the data is in "%" format and some of the data is in decimal format. This will have to be fixed.

# Here is the scatter plot!

ggplot(WA_Edu_Improvment_2019, aes(parse_number(Attendance_Rate2), parse_number(ELA_Proficiency_Rate))) +
  geom_point() +
  geom_smooth()

#And now I want to see attendance rate for the different student groups. First I will rename the "Student Group" variable to remove space. Then I will use it as my x axis in my box plot. 
library("tidyverse")
WA_Edu_Improvment_2019 <- rename(WA_Edu_Improvment_2019, Student_Group="Student Group")

WA_Edu_Improvment_2019%>%
filter(!str_detect(Student_Group,"All Students|English Language Learners|Low-Income"))%>%  
ggplot(aes(Student_Group, parse_number(Attendance_Rate2))) +
  geom_boxplot() +
  labs(title = "Attendance Rate by Student Group", y = "Student Group", x = "Attendance Rate") +
  theme(axis.text.x = element_text(angle = 45))

Comment on this article Share:

MADT_Homework 3

Research Interest

Variables and Research Questions

Preparing the Data for Visualizations

Visualizations

Reuse

Citation