Statistical Review and Visualization of School Compensations Data from 20218-19
The Massachusetts Department of Elementary and Secondary Education publishes data related to teacher pay rates. These are made available at https://www.doe.mass.edu/SchDistrictData.html.
My goal was to examine data by providing a statistical overview and visualizations.
I like to verify that my working directory and library paths are the same and that I have the correct packages installed.
# Verify library path
.libPaths()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
[2] "C:/Program Files/R/R-4.1.2/library"
# set the working directory to be the same as the library path
setwd("C:/Users/theresa/Documents/R/win-library/4.1")
# verify the working directory
getwd()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
# Installing Tidyverse and readxl packages with explicitly defining the URL of where it lives. This is to get around a Mirror error.
# install.packages("tidyverse", repos = "http://cran.us.r-project.org")
# install.packages("readxl", repos = "http://cran.us.r-project.org")
# load the necessary libraries for the processing
library(tidyverse)
library(dplyr)
library(readxl)
library(readr)
library(stringr)
library(ggplot2)
library(quantreg)
Load in the full dataset, skipping the header rows and cleaning the column names to populate spaces.
preliminary_school_data_hs <-read_csv("c:/users/theresa/Documents/DACSS Local/DataSets/preliminary-school-ppx.csv", skip = 3)
# To create easier field names, replace the spaces with a dash in the original column names.
names(preliminary_school_data_hs) <-str_replace_all(names(preliminary_school_data_hs), c(" "="-"))
The teacher data included the 2018-2019 salary information for all school levels pre-k - 12. The data also included many other operational expenditures for each school.
The financial data for this analysis needed to be reduced to only relevant data: District Name School Name Represented Grades % Economically Disadvantaged Student Body Teacher Salary Guidance Dept Salary
The first step was to reduce the data to the relevant fields and clean any non-numeric values.
Missing data was filled with the median for the field.
# Reduce the data set to the relevant columns
preliminary_school_data_g12 <- preliminary_school_data_hs %>%
select(District...2,`School-Name`, `Grade-Level`, `%-Econ-Disadv`, `Teachers-Per-100-Students`,`Avg-Teacher-Salary`,)
preliminary_school_data_g12
# A tibble: 3,603 x 6
District...2 `School-Name` `Grade-Level` `%-Econ-Disadv`
<chr> <chr> <chr> <chr>
1 Abington Abington Early Educatio~ PK 30.7%
2 Abington Woodsdale Elementary Sc~ 3-4 23.9%
3 Abington Beaver Brook Elementary K-2 22.7%
4 Abington Abington Middle School 5-8 25.9%
5 Abington Abington High 9-12 23.3%
6 Acushnet Acushnet Elementary Sch~ PK-4 26.3%
7 Acushnet Albert F Ford Middle Sc~ 5-8 25.1%
8 Agawam Agawam Early Childhood ~ PK 33.1%
9 Agawam Clifford M Granger K-4 37.9%
10 Agawam Benjamin J Phelps K-4 28.7%
# ... with 3,593 more rows, and 2 more variables:
# `Teachers-Per-100-Students` <chr>, `Avg-Teacher-Salary` <chr>
# Convert the numbers from character to numeric
preliminary_school_data_g12$`Avg-Teacher-Salary` <- parse_number(preliminary_school_data_g12$`Avg-Teacher-Salary`)
preliminary_school_data_g12$`Teachers-Per-100-Students` <- parse_number(preliminary_school_data_g12$`Teachers-Per-100-Students`)
preliminary_school_data_g12$`%-Econ-Disadv` <- parse_number(preliminary_school_data_g12$`%-Econ-Disadv`)
preliminary_school_data_g12
# A tibble: 3,603 x 6
District...2 `School-Name` `Grade-Level` `%-Econ-Disadv`
<chr> <chr> <chr> <dbl>
1 Abington Abington Early Educatio~ PK 30.7
2 Abington Woodsdale Elementary Sc~ 3-4 23.9
3 Abington Beaver Brook Elementary K-2 22.7
4 Abington Abington Middle School 5-8 25.9
5 Abington Abington High 9-12 23.3
6 Acushnet Acushnet Elementary Sch~ PK-4 26.3
7 Acushnet Albert F Ford Middle Sc~ 5-8 25.1
8 Agawam Agawam Early Childhood ~ PK 33.1
9 Agawam Clifford M Granger K-4 37.9
10 Agawam Benjamin J Phelps K-4 28.7
# ... with 3,593 more rows, and 2 more variables:
# `Teachers-Per-100-Students` <dbl>, `Avg-Teacher-Salary` <dbl>
# Globally replace any value that wasn't provided (was a non-numeric), replace it with the median of all the other values in that instance.
preliminary_school_data_g12 <- preliminary_school_data_g12 %>%
mutate_if(is.numeric, function(x) ifelse(is.na(x), median(x, na.rm = T), x))
preliminary_school_data_g12
# A tibble: 3,603 x 6
District...2 `School-Name` `Grade-Level` `%-Econ-Disadv`
<chr> <chr> <chr> <dbl>
1 Abington Abington Early Educatio~ PK 30.7
2 Abington Woodsdale Elementary Sc~ 3-4 23.9
3 Abington Beaver Brook Elementary K-2 22.7
4 Abington Abington Middle School 5-8 25.9
5 Abington Abington High 9-12 23.3
6 Acushnet Acushnet Elementary Sch~ PK-4 26.3
7 Acushnet Albert F Ford Middle Sc~ 5-8 25.1
8 Agawam Agawam Early Childhood ~ PK 33.1
9 Agawam Clifford M Granger K-4 37.9
10 Agawam Benjamin J Phelps K-4 28.7
# ... with 3,593 more rows, and 2 more variables:
# `Teachers-Per-100-Students` <dbl>, `Avg-Teacher-Salary` <dbl>
Establishing some simple descriptive statistics for:
Average Salary
Teacher to Student Ratio
% Economically Disadvantaged
median(preliminary_school_data_g12$`Avg-Teacher-Salary`,na.rm = T)
[1] 79185
mean(preliminary_school_data_g12$`Avg-Teacher-Salary`, na.rm = T)
[1] 79050.83
sd(preliminary_school_data_g12$`Avg-Teacher-Salary`, na.rm = T)
[1] 8414.818
median(preliminary_school_data_g12$`Teachers-Per-100-Students`,na.rm = T)
[1] 7.9
mean(preliminary_school_data_g12$`Teachers-Per-100-Students`, na.rm = T)
[1] 8.115626
sd(preliminary_school_data_g12$`Teachers-Per-100-Students`, na.rm = T)
[1] 2.019188
median(preliminary_school_data_g12$`%-Econ-Disadv`,na.rm = T)
[1] 28.9
mean(preliminary_school_data_g12$`%-Econ-Disadv`, na.rm = T)
[1] 31.6776
sd(preliminary_school_data_g12$`%-Econ-Disadv`, na.rm = T)
[1] 17.14309
Attempt to determine if there is any correlation between the salaries given, the % of disadvantaged children and the teacher/student ratios.
Graph 1: It would appear that, when looking at data grouped by grade, the teacher salary peaks when the ratio of student to teacher is around 7-8 teachers per 100 students.
Graph 2: It would appear that, when looking at the data grouped by district, the salaries decrease dramatically as the ratio of teachers to students increases. That’s a bit surprising because I would think if the community has a greater number of teachers/100 students, they would also be a wealthier community.
Graph 3: It would appear that, when looking at the data grouped by district, the teacher salaries are higher when the % of economic disadvantaged students is lower. This would make sense since the communities with the higher number of disadvantaged students are not communities of wealth. This appears to negatively impact their ability to compensate their teachers.
Graph 4: It would appear that, when looking at the data grouped by district, the student to teacher ratio does not appear to correlate to the % of disadvantaged students.
Graph 5: Not sure what this graph tells us.
# Group the dataset by grade level and summarize that data by the mean of teacher salaries, median for % econ disadvanaged, and the mean teacher/student ratio
by_grade <- preliminary_school_data_g12 %>%
group_by(`Grade-Level`)
summary_data <- summarize(by_grade,
ratio_teach_stud = mean(`Teachers-Per-100-Students`),
avg_dis= median(`%-Econ-Disadv`),
avg_salary = mean(`Avg-Teacher-Salary`))
summary_data
# A tibble: 64 x 4
`Grade-Level` ratio_teach_stud avg_dis avg_salary
<chr> <dbl> <dbl> <dbl>
1 1-12 13.6 70.6 77074.
2 1-2 9.6 11.5 89152
3 1-3 7.72 13.6 77828.
4 1-4 6.98 67.8 74716.
5 1-5 8.65 36.7 80028.
6 1-6 7 35.8 77110
7 11-12 9.9 60.1 83307
8 12 16.2 60.2 63876.
9 2-11 25.1 79.3 74759
10 2-12 15.6 68.7 85217
# ... with 54 more rows
# Graph 1: Attempt to plot that to find anything interesting
ggplot(data = summary_data, mapping = aes(x = ratio_teach_stud, y = avg_salary)) +
geom_point(aes(size = avg_dis), alpha = 1/3) +
geom_smooth(se = FALSE)
# Do the same by district instead of by grade
by_district <- preliminary_school_data_g12 %>%
group_by(`District...2`)
summary_dist <- summarize(by_district,
ratio_med = median(`Teachers-Per-100-Students`),
avg_dis_2 = median(`%-Econ-Disadv`),
avg_salary_2 = mean(`Avg-Teacher-Salary`))
# Graph 2:
ggplot(data = summary_dist, mapping = aes(x = `ratio_med`, y = avg_salary_2)) +
geom_point(aes(size = ratio_med), alpha = 1/3) +
geom_smooth(se = FALSE)
# Graph 3:
ggplot(data = summary_dist) +
geom_quantile(mapping = aes(x = `avg_dis_2`, y = avg_salary_2), alpha = 1/3)
# Graph 4:
ggplot(data = summary_dist) +
geom_jitter(mapping = aes(x = log(`avg_dis_2`), y = log(`ratio_med`)), alpha = 1/3)
# Graph 5:
ggplot(data = summary_dist) +
geom_violin(mapping = aes(x = `avg_dis_2`, y = avg_salary_2), alpha = 1/3)
preliminary_school_data_g12
# A tibble: 3,603 x 6
District...2 `School-Name` `Grade-Level` `%-Econ-Disadv`
<chr> <chr> <chr> <dbl>
1 Abington Abington Early Educatio~ PK 30.7
2 Abington Woodsdale Elementary Sc~ 3-4 23.9
3 Abington Beaver Brook Elementary K-2 22.7
4 Abington Abington Middle School 5-8 25.9
5 Abington Abington High 9-12 23.3
6 Acushnet Acushnet Elementary Sch~ PK-4 26.3
7 Acushnet Albert F Ford Middle Sc~ 5-8 25.1
8 Agawam Agawam Early Childhood ~ PK 33.1
9 Agawam Clifford M Granger K-4 37.9
10 Agawam Benjamin J Phelps K-4 28.7
# ... with 3,593 more rows, and 2 more variables:
# `Teachers-Per-100-Students` <dbl>, `Avg-Teacher-Salary` <dbl>
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
TLamkin (2022, Feb. 23). Data Analytics and Computational Social Science: HW4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin869562/
BibTeX citation
@misc{tlamkin2022hw4, author = {TLamkin, }, title = {Data Analytics and Computational Social Science: HW4}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin869562/}, year = {2022} }