Code
library(tidyverse)
library(readxl)
::opts_chunk$set(echo = TRUE) knitr
Matthew O’Neill
December 1, 2022
For this Homework I decided to use the dataset I plan to work with for the final project. It contains extensive information about all schools in the state from the year 2017. For this assignment, I plan to explore the dataset, build a narrative around it, and begin to identify research questions which could be solved, at least in part, with this data.
# A tibble: 10 × 302
School Co…¹ Schoo…² Schoo…³ Funct…⁴ Conta…⁵ Addre…⁶ Addre…⁷ Town State Zip
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 00010505 Abingt… Public… Princi… Teresa… 201 Gl… <NA> Abin… MA 2351
2 00010003 Beaver… Public… Princi… Cather… 1 Ralp… <NA> Abin… MA 2351
3 00010002 Center… Public… Princi… Lora M… 201 Gl… <NA> Abin… MA 2351
4 00010405 Frolio… Public… Princi… Matthe… 201 Gl… <NA> Abin… MA 2351
5 00010015 Woodsd… Public… Princi… Jonath… 128 Ch… <NA> Abin… MA 2351
6 00030025 Acushn… Public… Princi… Susan … 800 Mi… <NA> Acus… MA 2743
7 00030305 Albert… Public… Princi… Michel… 708 Mi… <NA> Acus… MA 2743
8 00050003 Agawam… Public… Princi… Robin … 108 Pe… <NA> Agaw… MA 1001
9 00050505 Agawam… Public… Princi… Thomas… 760 Co… <NA> Agaw… MA 1001
10 00050405 Agawam… Public… Princi… Norman… 1305 S… Suite 2 Feed… MA 1030
# … with 292 more variables: Phone <chr>, Fax <chr>, Grade <chr>,
# `District Name` <chr>, `District Code` <chr>, PK_Enrollment <dbl>,
# K_Enrollment <dbl>, `1_Enrollment` <dbl>, `2_Enrollment` <dbl>,
# `3_Enrollment` <dbl>, `4_Enrollment` <dbl>, `5_Enrollment` <dbl>,
# `6_Enrollment` <dbl>, `7_Enrollment` <dbl>, `8_Enrollment` <dbl>,
# `9_Enrollment` <dbl>, `10_Enrollment` <dbl>, `11_Enrollment` <dbl>,
# `12_Enrollment` <dbl>, SP_Enrollment <dbl>, TOTAL_Enrollment <dbl>, …
The first step is to read in the data. I printed out the first 10 lines of data to look through and see what we have to work with.
The dataset has 302 columns, which is great. We might not need all of them though, specifically the MCAS scores for different grades accross different skill areas. We also don’t necessarily need all of the information about each specific school and breakdowns by grades.
There are a handful of steps I’d like to take to clean the data into the most useable state. To start, I will reduce the 301 columns just the columns which I am interested in. Simultaneously, I will restrict my dataset to just schools which report graduation data, which will remove schools which did not report and remove primary and middle schools, since the focus our of research will be high schools and graduation related data,
# A tibble: 376 × 53
School …¹ Schoo…² Zip Grade Distr…³ Distr…⁴ 9_Enr…⁵ 10_En…⁶ 11_En…⁷ 12_En…⁸
<chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 00010505 Abingt… 2351 09,1… Abingt… 000100… 124 109 123 92
2 00050505 Agawam… 1001 09,1… Agawam 000500… 299 309 293 315
3 00070505 Amesbu… 1913 09,1… Amesbu… 000700… 147 138 145 163
4 00070515 Amesbu… 1913 09,1… Amesbu… 000700… 7 2 11 11
5 00090505 Andove… 1810 09,1… Andover 000900… 446 459 421 462
6 00100505 Arling… 2476 09,1… Arling… 001000… 332 350 312 295
7 00140505 Ashlan… 1721 09,1… Ashland 001400… 202 172 183 187
8 00160515 Attleb… 2703 09,1… Attleb… 001600… 0 14 35 14
9 00160505 Attleb… 2703 09,1… Attleb… 001600… 449 401 419 392
10 00170505 Auburn… 1501 PK,0… Auburn 001700… 174 197 175 157
# … with 366 more rows, 43 more variables: SP_Enrollment <dbl>,
# TOTAL_Enrollment <dbl>, `First Language Not English` <dbl>,
# `English Language Learner` <dbl>, `Students With Disabilities` <dbl>,
# `High Needs` <dbl>, `Economically Disadvantaged` <dbl>,
# `% African American` <dbl>, `% Asian` <dbl>, `% Hispanic` <dbl>,
# `% White` <dbl>, `% Native American` <dbl>,
# `% Native Hawaiian, Pacific Islander` <dbl>, …
Next, let’s mutate the data to get totals for each racial group. It will be useful to have the specific totals for each group on hand as we will be breaking them up into districts with different levels of spending, teach salary, and class size.
hs <- mutate(hs,
`African American Students` = round((`% African American`/100)*`Number of Students`),
`Asian Students` = round((`% Asian`/100)*`Number of Students`),
`Hispanic Students` = round((`% Hispanic`/100)*`Number of Students`),
`White Students` = round((`% White`/100)*`Number of Students`),
`Native American Students` = round((`% Native American`/100)*`Number of Students`),
`Native Hawaiian, Pacific Islander` = round((`% Native Hawaiian, Pacific Islander`/100)*`Number of Students`),
`Multi-Race, Non-Hispanic` = round((`% Multi-Race, Non-Hispanic`/100)*`Number of Students`),
`Male Students` = round((`% Males`/100)*`Number of Students`),
`Female Students` = round((`% Females`/100)*`Number of Students`)
)
hs
# A tibble: 376 × 62
School …¹ Schoo…² Zip Grade Distr…³ Distr…⁴ 9_Enr…⁵ 10_En…⁶ 11_En…⁷ 12_En…⁸
<chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 00010505 Abingt… 2351 09,1… Abingt… 000100… 124 109 123 92
2 00050505 Agawam… 1001 09,1… Agawam 000500… 299 309 293 315
3 00070505 Amesbu… 1913 09,1… Amesbu… 000700… 147 138 145 163
4 00070515 Amesbu… 1913 09,1… Amesbu… 000700… 7 2 11 11
5 00090505 Andove… 1810 09,1… Andover 000900… 446 459 421 462
6 00100505 Arling… 2476 09,1… Arling… 001000… 332 350 312 295
7 00140505 Ashlan… 1721 09,1… Ashland 001400… 202 172 183 187
8 00160515 Attleb… 2703 09,1… Attleb… 001600… 0 14 35 14
9 00160505 Attleb… 2703 09,1… Attleb… 001600… 449 401 419 392
10 00170505 Auburn… 1501 PK,0… Auburn 001700… 174 197 175 157
# … with 366 more rows, 52 more variables: SP_Enrollment <dbl>,
# TOTAL_Enrollment <dbl>, `First Language Not English` <dbl>,
# `English Language Learner` <dbl>, `Students With Disabilities` <dbl>,
# `High Needs` <dbl>, `Economically Disadvantaged` <dbl>,
# `% African American` <dbl>, `% Asian` <dbl>, `% Hispanic` <dbl>,
# `% White` <dbl>, `% Native American` <dbl>,
# `% Native Hawaiian, Pacific Islander` <dbl>, …
The data is a lot easier to work with now. In the future, we may want to add a variable breaking up salary and expenditures into brackets to see socioeconomic makeup on schools in each bracket, but for now, the data is clean.
This is a comprehensive dataset of every state funded public school in Massachusetts for the fiscal year 2017. Each row of data represents a school and each school has a large variety of useful data ranging for finances, socioeconomic makeup, test scores, enrollment data, and high school outcomes. Below is a breakdown of that data.
It’s important to discuss some limitations that this dataset has.
Although it is very compehensive, it is relatively old. However, the questions I envision myself attemtping to research are not very time sensitive and could still produce useful results.
The data is also specific to Massachusetts Public Schools, which means there will be relatively fewer schools with extreme class sizes, like you might find in an area with very high population density. There will also be relatively higher average salaries and expenditures accross the board, since Massachusetts is one of the wealthiest states in the country.
That said, value can still be extracted from researching this data as it can inform local education decision making.
I chose this dataset in particular because at one time, I planned to become a high school Math teacher. I was very interested in the impact that salary and class size had on the academic outcomes of students, and how much access students from different socioeconomic backgrounds had to higher quality education.
Two formal research questions I plan to answer in the final project are:
---
title: "Homework 2"
author: "Matthew O'Neill"
desription: "MA School Distict Data Exploration"
date: "12/01/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw1
- Matthew O'Neill
- MA_Public_Schools_2017
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
For this Homework I decided to use the dataset I plan to work with for the final project. It contains extensive information about all schools in the state from the year 2017. For this assignment, I plan to explore the dataset, build a narrative around it, and begin to identify research questions which could be solved, at least in part, with this data.
## Reading in Data
```{r}
data<- read_csv("_data/MA_Public_Schools_2017.csv",show_col_types = FALSE)
head(data,10)
```
The first step is to read in the data. I printed out the first 10 lines of data to look through and see what we have to work with.
The dataset has 302 columns, which is great. We might not need all of them though, specifically the MCAS scores for different grades accross different skill areas. We also don't necessarily need all of the information about each specific school and breakdowns by grades.
## Cleaning the Data
There are a handful of steps I'd like to take to clean the data into the most useable state. To start, I will reduce the 301 columns just the columns which I am interested in. Simultaneously, I will restrict my dataset to just schools which report graduation data, which will remove schools which did not report and remove primary and middle schools, since the focus our of research will be high schools and graduation related data,
```{r}
keeps <- c(1,2,10,13,14,15,26,27,28,29,30,31,32,34,36,38,40,42,43,44,45,46,47,48,49,50,51,53,55,62,64,68,70,71,72,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97)
hs <- data[keeps]%>%
subset(! is.na(data$`% Graduated`))
hs
```
Next, let's mutate the data to get totals for each racial group. It will be useful to have the specific totals for each group on hand as we will be breaking them up into districts with different levels of spending, teach salary, and class size.
```{r}
hs <- mutate(hs,
`African American Students` = round((`% African American`/100)*`Number of Students`),
`Asian Students` = round((`% Asian`/100)*`Number of Students`),
`Hispanic Students` = round((`% Hispanic`/100)*`Number of Students`),
`White Students` = round((`% White`/100)*`Number of Students`),
`Native American Students` = round((`% Native American`/100)*`Number of Students`),
`Native Hawaiian, Pacific Islander` = round((`% Native Hawaiian, Pacific Islander`/100)*`Number of Students`),
`Multi-Race, Non-Hispanic` = round((`% Multi-Race, Non-Hispanic`/100)*`Number of Students`),
`Male Students` = round((`% Males`/100)*`Number of Students`),
`Female Students` = round((`% Females`/100)*`Number of Students`)
)
hs
```
The data is a lot easier to work with now. In the future, we may want to add a variable breaking up salary and expenditures into brackets to see socioeconomic makeup on schools in each bracket, but for now, the data is clean.
## Narrative
This is a comprehensive dataset of every state funded public school in Massachusetts for the fiscal year 2017. Each row of data represents a school and each school has a large variety of useful data ranging for finances, socioeconomic makeup, test scores, enrollment data, and high school outcomes. Below is a breakdown of that data.
* Financial Data
+ Teacher Salaries
+ Total Expenditures
+ Expenditures Per Pupil
* Socioeconomic Makeup of Student Body
+ Percentage of students of different racial background
+ Percentage of economically disadvantaged students
+ Percentage of Disabled studnets
+ Percentage Male vs Female Students
* Enrollment Data
+ Enrollment by Grade
+ Average Class Size
* Test Scores
+ SAT Scores w/ Subject Breakdown
+ MCAS Scores w/ Subject Breakdown
+ AP Scores and Number of AP Test Takers
* High School Outcome
+ Graduation Rate
+ Dropout Rate
+ Rate of Secondary Education
It's important to discuss some limitations that this dataset has.
Although it is very compehensive, it is relatively old. However, the questions I envision myself attemtping to research are not very time sensitive and could still produce useful results.
The data is also specific to Massachusetts Public Schools, which means there will be relatively fewer schools with extreme class sizes, like you might find in an area with very high population density. There will also be relatively higher average salaries and expenditures accross the board, since Massachusetts is one of the wealthiest states in the country.
That said, value can still be extracted from researching this data as it can inform local education decision making.
## Potential Research Questions
I chose this dataset in particular because at one time, I planned to become a high school Math teacher. I was very interested in the impact that salary and class size had on the academic outcomes of students, and how much access students from different socioeconomic backgrounds had to higher quality education.
Two formal research questions I plan to answer in the final project are:
* How does average class size and average distict expenditure impact graduation rates and rates of enrollment in higher education?
* Do studnets in different socioeconomic backgrounds have differing levels of access to smaller class sizes and high expenditure?