Data Analytics and Computational Social Science: HW4

TLamkin

INTRODUCTION

The Massachusetts Department of Elementary and Secondary Education publishes data related to teacher pay rates. These are made available at https://www.doe.mass.edu/SchDistrictData.html.

My goal was to examine data by providing a statistical overview and visualizations.

Checking the Environment

I like to verify that my working directory and library paths are the same and that I have the correct packages installed.

# Verify library path

.libPaths()

[1] "C:/Users/theresa/Documents/R/win-library/4.1"
[2] "C:/Program Files/R/R-4.1.2/library"

# set the working directory to be the same as the library path

setwd("C:/Users/theresa/Documents/R/win-library/4.1")

# verify the working directory

getwd()

[1] "C:/Users/theresa/Documents/R/win-library/4.1"

# Installing Tidyverse and readxl packages with explicitly defining the URL of where it lives. This is to get around a Mirror error. 

# install.packages("tidyverse", repos = "http://cran.us.r-project.org")
# install.packages("readxl", repos = "http://cran.us.r-project.org")

# load the necessary libraries for the processing

library(tidyverse)
library(dplyr)
library(readxl)
library(readr)
library(stringr)
library(ggplot2)
library(quantreg)

Loading the Data

Load in the full dataset, skipping the header rows and cleaning the column names to populate spaces.

preliminary_school_data_hs <-read_csv("c:/users/theresa/Documents/DACSS Local/DataSets/preliminary-school-ppx.csv", skip = 3) 

# To create easier field names, replace the spaces with a dash in the original column names.

names(preliminary_school_data_hs) <-str_replace_all(names(preliminary_school_data_hs), c(" "="-"))

Tidy-ing the Numeric data:

The teacher data included the 2018-2019 salary information for all school levels pre-k - 12. The data also included many other operational expenditures for each school.

The financial data for this analysis needed to be reduced to only relevant data: District Name School Name Represented Grades % Economically Disadvantaged Student Body Teacher Salary Guidance Dept Salary

The first step was to reduce the data to the relevant fields and clean any non-numeric values.

Missing data was filled with the median for the field.

# Reduce the data set to the relevant columns 

preliminary_school_data_g12 <- preliminary_school_data_hs %>%
   select(District...2,`School-Name`, `Grade-Level`, `%-Econ-Disadv`, `Teachers-Per-100-Students`,`Avg-Teacher-Salary`,) 

preliminary_school_data_g12

# A tibble: 3,603 x 6
   District...2 `School-Name`            `Grade-Level` `%-Econ-Disadv`
   <chr>        <chr>                    <chr>         <chr>          
 1 Abington     Abington Early Educatio~ PK            30.7%          
 2 Abington     Woodsdale Elementary Sc~ 3-4           23.9%          
 3 Abington     Beaver Brook Elementary  K-2           22.7%          
 4 Abington     Abington Middle School   5-8           25.9%          
 5 Abington     Abington High            9-12          23.3%          
 6 Acushnet     Acushnet Elementary Sch~ PK-4          26.3%          
 7 Acushnet     Albert F Ford Middle Sc~ 5-8           25.1%          
 8 Agawam       Agawam Early Childhood ~ PK            33.1%          
 9 Agawam       Clifford M Granger       K-4           37.9%          
10 Agawam       Benjamin J Phelps        K-4           28.7%          
# ... with 3,593 more rows, and 2 more variables:
#   `Teachers-Per-100-Students` <chr>, `Avg-Teacher-Salary` <chr>

# Convert the numbers from character to numeric

preliminary_school_data_g12$`Avg-Teacher-Salary`         <-          parse_number(preliminary_school_data_g12$`Avg-Teacher-Salary`) 
preliminary_school_data_g12$`Teachers-Per-100-Students`  <-  parse_number(preliminary_school_data_g12$`Teachers-Per-100-Students`) 
preliminary_school_data_g12$`%-Econ-Disadv`              <-  parse_number(preliminary_school_data_g12$`%-Econ-Disadv`) 
preliminary_school_data_g12

# A tibble: 3,603 x 6
   District...2 `School-Name`            `Grade-Level` `%-Econ-Disadv`
   <chr>        <chr>                    <chr>                   <dbl>
 1 Abington     Abington Early Educatio~ PK                       30.7
 2 Abington     Woodsdale Elementary Sc~ 3-4                      23.9
 3 Abington     Beaver Brook Elementary  K-2                      22.7
 4 Abington     Abington Middle School   5-8                      25.9
 5 Abington     Abington High            9-12                     23.3
 6 Acushnet     Acushnet Elementary Sch~ PK-4                     26.3
 7 Acushnet     Albert F Ford Middle Sc~ 5-8                      25.1
 8 Agawam       Agawam Early Childhood ~ PK                       33.1
 9 Agawam       Clifford M Granger       K-4                      37.9
10 Agawam       Benjamin J Phelps        K-4                      28.7
# ... with 3,593 more rows, and 2 more variables:
#   `Teachers-Per-100-Students` <dbl>, `Avg-Teacher-Salary` <dbl>

# Globally replace any value that wasn't provided (was a non-numeric), replace it with the median of all the other values in that instance. 

preliminary_school_data_g12 <- preliminary_school_data_g12 %>%
    mutate_if(is.numeric, function(x) ifelse(is.na(x), median(x, na.rm = T), x))

preliminary_school_data_g12

# A tibble: 3,603 x 6
   District...2 `School-Name`            `Grade-Level` `%-Econ-Disadv`
   <chr>        <chr>                    <chr>                   <dbl>
 1 Abington     Abington Early Educatio~ PK                       30.7
 2 Abington     Woodsdale Elementary Sc~ 3-4                      23.9
 3 Abington     Beaver Brook Elementary  K-2                      22.7
 4 Abington     Abington Middle School   5-8                      25.9
 5 Abington     Abington High            9-12                     23.3
 6 Acushnet     Acushnet Elementary Sch~ PK-4                     26.3
 7 Acushnet     Albert F Ford Middle Sc~ 5-8                      25.1
 8 Agawam       Agawam Early Childhood ~ PK                       33.1
 9 Agawam       Clifford M Granger       K-4                      37.9
10 Agawam       Benjamin J Phelps        K-4                      28.7
# ... with 3,593 more rows, and 2 more variables:
#   `Teachers-Per-100-Students` <dbl>, `Avg-Teacher-Salary` <dbl>

Descriptive Statistics:

Establishing some simple descriptive statistics for:

Average Salary
Teacher to Student Ratio
% Economically Disadvantaged

median(preliminary_school_data_g12$`Avg-Teacher-Salary`,na.rm = T)

[1] 79185

mean(preliminary_school_data_g12$`Avg-Teacher-Salary`, na.rm = T)

[1] 79050.83

sd(preliminary_school_data_g12$`Avg-Teacher-Salary`, na.rm = T)

[1] 8414.818

median(preliminary_school_data_g12$`Teachers-Per-100-Students`,na.rm = T)

[1] 7.9

mean(preliminary_school_data_g12$`Teachers-Per-100-Students`, na.rm = T)

[1] 8.115626

sd(preliminary_school_data_g12$`Teachers-Per-100-Students`, na.rm = T)

[1] 2.019188

median(preliminary_school_data_g12$`%-Econ-Disadv`,na.rm = T)

[1] 28.9

mean(preliminary_school_data_g12$`%-Econ-Disadv`, na.rm = T)

[1] 31.6776

sd(preliminary_school_data_g12$`%-Econ-Disadv`, na.rm = T)

[1] 17.14309

Grouping and Looking for Trends:

Attempt to determine if there is any correlation between the salaries given, the % of disadvantaged children and the teacher/student ratios.

Graph 1: It would appear that, when looking at data grouped by grade, the teacher salary peaks when the ratio of student to teacher is around 7-8 teachers per 100 students.

Graph 2: It would appear that, when looking at the data grouped by district, the salaries decrease dramatically as the ratio of teachers to students increases. That’s a bit surprising because I would think if the community has a greater number of teachers/100 students, they would also be a wealthier community.

Graph 3: It would appear that, when looking at the data grouped by district, the teacher salaries are higher when the % of economic disadvantaged students is lower. This would make sense since the communities with the higher number of disadvantaged students are not communities of wealth. This appears to negatively impact their ability to compensate their teachers.

Graph 4: It would appear that, when looking at the data grouped by district, the student to teacher ratio does not appear to correlate to the % of disadvantaged students.

Graph 5: Not sure what this graph tells us.

# Group the dataset by grade level and summarize that data by the mean of teacher salaries, median for % econ disadvanaged, and the mean teacher/student ratio


by_grade <- preliminary_school_data_g12 %>%
  group_by(`Grade-Level`)
summary_data <- summarize(by_grade, 
                          ratio_teach_stud = mean(`Teachers-Per-100-Students`),
                          avg_dis= median(`%-Econ-Disadv`),
                          avg_salary = mean(`Avg-Teacher-Salary`))
summary_data

# A tibble: 64 x 4
   `Grade-Level` ratio_teach_stud avg_dis avg_salary
   <chr>                    <dbl>   <dbl>      <dbl>
 1 1-12                     13.6     70.6     77074.
 2 1-2                       9.6     11.5     89152 
 3 1-3                       7.72    13.6     77828.
 4 1-4                       6.98    67.8     74716.
 5 1-5                       8.65    36.7     80028.
 6 1-6                       7       35.8     77110 
 7 11-12                     9.9     60.1     83307 
 8 12                       16.2     60.2     63876.
 9 2-11                     25.1     79.3     74759 
10 2-12                     15.6     68.7     85217 
# ... with 54 more rows

# Graph 1: Attempt to plot that to find anything interesting

ggplot(data = summary_data, mapping = aes(x = ratio_teach_stud, y = avg_salary)) +
  geom_point(aes(size = avg_dis), alpha = 1/3) +
  geom_smooth(se = FALSE)

# Do the same by district instead of by grade

by_district <- preliminary_school_data_g12 %>%
  group_by(`District...2`)
summary_dist <- summarize(by_district, 
                          ratio_med = median(`Teachers-Per-100-Students`),
                          avg_dis_2 = median(`%-Econ-Disadv`),
                          avg_salary_2 = mean(`Avg-Teacher-Salary`)) 

# Graph 2: 

ggplot(data = summary_dist, mapping = aes(x = `ratio_med`, y = avg_salary_2)) +
  geom_point(aes(size = ratio_med), alpha = 1/3) +
  geom_smooth(se = FALSE)

# Graph 3:
ggplot(data = summary_dist) + 
  geom_quantile(mapping = aes(x = `avg_dis_2`, y = avg_salary_2), alpha = 1/3)

# Graph 4: 
ggplot(data = summary_dist) + 
  geom_jitter(mapping = aes(x = log(`avg_dis_2`), y = log(`ratio_med`)), alpha = 1/3)

# Graph 5: 
ggplot(data = summary_dist) + 
  geom_violin(mapping = aes(x = `avg_dis_2`, y = avg_salary_2), alpha = 1/3)

preliminary_school_data_g12

# A tibble: 3,603 x 6
   District...2 `School-Name`            `Grade-Level` `%-Econ-Disadv`
   <chr>        <chr>                    <chr>                   <dbl>
 1 Abington     Abington Early Educatio~ PK                       30.7
 2 Abington     Woodsdale Elementary Sc~ 3-4                      23.9
 3 Abington     Beaver Brook Elementary  K-2                      22.7
 4 Abington     Abington Middle School   5-8                      25.9
 5 Abington     Abington High            9-12                     23.3
 6 Acushnet     Acushnet Elementary Sch~ PK-4                     26.3
 7 Acushnet     Albert F Ford Middle Sc~ 5-8                      25.1
 8 Agawam       Agawam Early Childhood ~ PK                       33.1
 9 Agawam       Clifford M Granger       K-4                      37.9
10 Agawam       Benjamin J Phelps        K-4                      28.7
# ... with 3,593 more rows, and 2 more variables:
#   `Teachers-Per-100-Students` <dbl>, `Avg-Teacher-Salary` <dbl>

Comment on this article Share:

HW4