DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Homework 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Introduction
  • Reading in Data
  • Cleaning the Data
  • Narrative
  • Potential Research Questions

Homework 2

  • Show All Code
  • Hide All Code

  • View Source
hw1
Matthew O’Neill
MA_Public_Schools_2017
Author

Matthew O’Neill

Published

December 1, 2022

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE)

Introduction

For this Homework I decided to use the dataset I plan to work with for the final project. It contains extensive information about all schools in the state from the year 2017. For this assignment, I plan to explore the dataset, build a narrative around it, and begin to identify research questions which could be solved, at least in part, with this data.

Reading in Data

Code
data<- read_csv("_data/MA_Public_Schools_2017.csv",show_col_types = FALSE)


head(data,10)
# A tibble: 10 × 302
   School Co…¹ Schoo…² Schoo…³ Funct…⁴ Conta…⁵ Addre…⁶ Addre…⁷ Town  State   Zip
   <chr>       <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr> <chr> <dbl>
 1 00010505    Abingt… Public… Princi… Teresa… 201 Gl… <NA>    Abin… MA     2351
 2 00010003    Beaver… Public… Princi… Cather… 1 Ralp… <NA>    Abin… MA     2351
 3 00010002    Center… Public… Princi… Lora M… 201 Gl… <NA>    Abin… MA     2351
 4 00010405    Frolio… Public… Princi… Matthe… 201 Gl… <NA>    Abin… MA     2351
 5 00010015    Woodsd… Public… Princi… Jonath… 128 Ch… <NA>    Abin… MA     2351
 6 00030025    Acushn… Public… Princi… Susan … 800 Mi… <NA>    Acus… MA     2743
 7 00030305    Albert… Public… Princi… Michel… 708 Mi… <NA>    Acus… MA     2743
 8 00050003    Agawam… Public… Princi… Robin … 108 Pe… <NA>    Agaw… MA     1001
 9 00050505    Agawam… Public… Princi… Thomas… 760 Co… <NA>    Agaw… MA     1001
10 00050405    Agawam… Public… Princi… Norman… 1305 S… Suite 2 Feed… MA     1030
# … with 292 more variables: Phone <chr>, Fax <chr>, Grade <chr>,
#   `District Name` <chr>, `District Code` <chr>, PK_Enrollment <dbl>,
#   K_Enrollment <dbl>, `1_Enrollment` <dbl>, `2_Enrollment` <dbl>,
#   `3_Enrollment` <dbl>, `4_Enrollment` <dbl>, `5_Enrollment` <dbl>,
#   `6_Enrollment` <dbl>, `7_Enrollment` <dbl>, `8_Enrollment` <dbl>,
#   `9_Enrollment` <dbl>, `10_Enrollment` <dbl>, `11_Enrollment` <dbl>,
#   `12_Enrollment` <dbl>, SP_Enrollment <dbl>, TOTAL_Enrollment <dbl>, …

The first step is to read in the data. I printed out the first 10 lines of data to look through and see what we have to work with.

The dataset has 302 columns, which is great. We might not need all of them though, specifically the MCAS scores for different grades accross different skill areas. We also don’t necessarily need all of the information about each specific school and breakdowns by grades.

Cleaning the Data

There are a handful of steps I’d like to take to clean the data into the most useable state. To start, I will reduce the 301 columns just the columns which I am interested in. Simultaneously, I will restrict my dataset to just schools which report graduation data, which will remove schools which did not report and remove primary and middle schools, since the focus our of research will be high schools and graduation related data,

Code
keeps <- c(1,2,10,13,14,15,26,27,28,29,30,31,32,34,36,38,40,42,43,44,45,46,47,48,49,50,51,53,55,62,64,68,70,71,72,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97)

hs <- data[keeps]%>% 
      subset(! is.na(data$`% Graduated`))

hs
# A tibble: 376 × 53
   School …¹ Schoo…²   Zip Grade Distr…³ Distr…⁴ 9_Enr…⁵ 10_En…⁶ 11_En…⁷ 12_En…⁸
   <chr>     <chr>   <dbl> <chr> <chr>   <chr>     <dbl>   <dbl>   <dbl>   <dbl>
 1 00010505  Abingt…  2351 09,1… Abingt… 000100…     124     109     123      92
 2 00050505  Agawam…  1001 09,1… Agawam  000500…     299     309     293     315
 3 00070505  Amesbu…  1913 09,1… Amesbu… 000700…     147     138     145     163
 4 00070515  Amesbu…  1913 09,1… Amesbu… 000700…       7       2      11      11
 5 00090505  Andove…  1810 09,1… Andover 000900…     446     459     421     462
 6 00100505  Arling…  2476 09,1… Arling… 001000…     332     350     312     295
 7 00140505  Ashlan…  1721 09,1… Ashland 001400…     202     172     183     187
 8 00160515  Attleb…  2703 09,1… Attleb… 001600…       0      14      35      14
 9 00160505  Attleb…  2703 09,1… Attleb… 001600…     449     401     419     392
10 00170505  Auburn…  1501 PK,0… Auburn  001700…     174     197     175     157
# … with 366 more rows, 43 more variables: SP_Enrollment <dbl>,
#   TOTAL_Enrollment <dbl>, `First Language Not English` <dbl>,
#   `English Language Learner` <dbl>, `Students With Disabilities` <dbl>,
#   `High Needs` <dbl>, `Economically Disadvantaged` <dbl>,
#   `% African American` <dbl>, `% Asian` <dbl>, `% Hispanic` <dbl>,
#   `% White` <dbl>, `% Native American` <dbl>,
#   `% Native Hawaiian, Pacific Islander` <dbl>, …

Next, let’s mutate the data to get totals for each racial group. It will be useful to have the specific totals for each group on hand as we will be breaking them up into districts with different levels of spending, teach salary, and class size.

Code
  hs <- mutate(hs,
              `African American Students` = round((`% African American`/100)*`Number of Students`),
              `Asian Students` = round((`% Asian`/100)*`Number of Students`),
              `Hispanic Students` = round((`% Hispanic`/100)*`Number of Students`),
              `White Students` = round((`% White`/100)*`Number of Students`),
              `Native American Students` = round((`% Native American`/100)*`Number of Students`),
              `Native Hawaiian, Pacific Islander` = round((`% Native Hawaiian, Pacific Islander`/100)*`Number of Students`),
              `Multi-Race, Non-Hispanic` = round((`% Multi-Race, Non-Hispanic`/100)*`Number of Students`),
              `Male Students` = round((`% Males`/100)*`Number of Students`),
              `Female Students` = round((`% Females`/100)*`Number of Students`)
               )

hs
# A tibble: 376 × 62
   School …¹ Schoo…²   Zip Grade Distr…³ Distr…⁴ 9_Enr…⁵ 10_En…⁶ 11_En…⁷ 12_En…⁸
   <chr>     <chr>   <dbl> <chr> <chr>   <chr>     <dbl>   <dbl>   <dbl>   <dbl>
 1 00010505  Abingt…  2351 09,1… Abingt… 000100…     124     109     123      92
 2 00050505  Agawam…  1001 09,1… Agawam  000500…     299     309     293     315
 3 00070505  Amesbu…  1913 09,1… Amesbu… 000700…     147     138     145     163
 4 00070515  Amesbu…  1913 09,1… Amesbu… 000700…       7       2      11      11
 5 00090505  Andove…  1810 09,1… Andover 000900…     446     459     421     462
 6 00100505  Arling…  2476 09,1… Arling… 001000…     332     350     312     295
 7 00140505  Ashlan…  1721 09,1… Ashland 001400…     202     172     183     187
 8 00160515  Attleb…  2703 09,1… Attleb… 001600…       0      14      35      14
 9 00160505  Attleb…  2703 09,1… Attleb… 001600…     449     401     419     392
10 00170505  Auburn…  1501 PK,0… Auburn  001700…     174     197     175     157
# … with 366 more rows, 52 more variables: SP_Enrollment <dbl>,
#   TOTAL_Enrollment <dbl>, `First Language Not English` <dbl>,
#   `English Language Learner` <dbl>, `Students With Disabilities` <dbl>,
#   `High Needs` <dbl>, `Economically Disadvantaged` <dbl>,
#   `% African American` <dbl>, `% Asian` <dbl>, `% Hispanic` <dbl>,
#   `% White` <dbl>, `% Native American` <dbl>,
#   `% Native Hawaiian, Pacific Islander` <dbl>, …

The data is a lot easier to work with now. In the future, we may want to add a variable breaking up salary and expenditures into brackets to see socioeconomic makeup on schools in each bracket, but for now, the data is clean.

Narrative

This is a comprehensive dataset of every state funded public school in Massachusetts for the fiscal year 2017. Each row of data represents a school and each school has a large variety of useful data ranging for finances, socioeconomic makeup, test scores, enrollment data, and high school outcomes. Below is a breakdown of that data.

  • Financial Data
    • Teacher Salaries
    • Total Expenditures
    • Expenditures Per Pupil
  • Socioeconomic Makeup of Student Body
    • Percentage of students of different racial background
    • Percentage of economically disadvantaged students
    • Percentage of Disabled studnets
    • Percentage Male vs Female Students
  • Enrollment Data
    • Enrollment by Grade
    • Average Class Size
  • Test Scores
    • SAT Scores w/ Subject Breakdown
    • MCAS Scores w/ Subject Breakdown
    • AP Scores and Number of AP Test Takers
  • High School Outcome
    • Graduation Rate
    • Dropout Rate
    • Rate of Secondary Education

It’s important to discuss some limitations that this dataset has.

Although it is very compehensive, it is relatively old. However, the questions I envision myself attemtping to research are not very time sensitive and could still produce useful results.

The data is also specific to Massachusetts Public Schools, which means there will be relatively fewer schools with extreme class sizes, like you might find in an area with very high population density. There will also be relatively higher average salaries and expenditures accross the board, since Massachusetts is one of the wealthiest states in the country.

That said, value can still be extracted from researching this data as it can inform local education decision making.

Potential Research Questions

I chose this dataset in particular because at one time, I planned to become a high school Math teacher. I was very interested in the impact that salary and class size had on the academic outcomes of students, and how much access students from different socioeconomic backgrounds had to higher quality education.

Two formal research questions I plan to answer in the final project are:

  • How does average class size and average distict expenditure impact graduation rates and rates of enrollment in higher education?
  • Do studnets in different socioeconomic backgrounds have differing levels of access to smaller class sizes and high expenditure?
Source Code
---
title: "Homework 2"
author: "Matthew O'Neill"
desription: "MA School Distict Data Exploration"
date: "12/01/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - hw1
  - Matthew O'Neill
  - MA_Public_Schools_2017
---

```{r}
#| label: setup
#| warning: false

library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

For this Homework I decided to use the dataset I plan to work with for the final project. It contains extensive information about all schools in the state from the year 2017. For this assignment, I plan to explore the dataset, build a narrative around it, and begin to identify research questions which could be solved, at least in part, with this data.

## Reading in Data


```{r}
data<- read_csv("_data/MA_Public_Schools_2017.csv",show_col_types = FALSE)


head(data,10)


```


The first step is to read in the data. I printed out the first 10 lines of data to look through and see what we have to work with.

The dataset has 302 columns, which is great. We might not need all of them though, specifically the MCAS scores for different grades accross different skill areas. We also don't necessarily need all of the information about each specific school and breakdowns by grades.

## Cleaning the Data

There are a handful of steps I'd like to take to clean the data into the most useable state. To start, I will reduce the 301 columns just the columns which I am interested in. Simultaneously, I will restrict my dataset to just schools which report graduation data, which will remove schools which did not report and remove primary and middle schools, since the focus our of research will be high schools and graduation related data,

```{r}
keeps <- c(1,2,10,13,14,15,26,27,28,29,30,31,32,34,36,38,40,42,43,44,45,46,47,48,49,50,51,53,55,62,64,68,70,71,72,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97)

hs <- data[keeps]%>% 
      subset(! is.na(data$`% Graduated`))

hs

```

Next, let's mutate the data to get totals for each racial group. It will be useful to have the specific totals for each group on hand as we will be breaking them up into districts with different levels of spending, teach salary, and class size.

```{r}
  hs <- mutate(hs,
              `African American Students` = round((`% African American`/100)*`Number of Students`),
              `Asian Students` = round((`% Asian`/100)*`Number of Students`),
              `Hispanic Students` = round((`% Hispanic`/100)*`Number of Students`),
              `White Students` = round((`% White`/100)*`Number of Students`),
              `Native American Students` = round((`% Native American`/100)*`Number of Students`),
              `Native Hawaiian, Pacific Islander` = round((`% Native Hawaiian, Pacific Islander`/100)*`Number of Students`),
              `Multi-Race, Non-Hispanic` = round((`% Multi-Race, Non-Hispanic`/100)*`Number of Students`),
              `Male Students` = round((`% Males`/100)*`Number of Students`),
              `Female Students` = round((`% Females`/100)*`Number of Students`)
               )

hs
```

The data is a lot easier to work with now. In the future, we may want to add a variable breaking up salary and expenditures into brackets to see socioeconomic makeup on schools in each bracket, but for now, the data is clean.



## Narrative

This is a comprehensive dataset of every state funded public school in Massachusetts for the fiscal year 2017. Each row of data represents a school and each school has a large variety of useful data ranging for finances, socioeconomic makeup, test scores, enrollment data, and high school outcomes. Below is a breakdown of that data.


  * Financial Data
    + Teacher Salaries
    + Total Expenditures
    + Expenditures Per Pupil
  * Socioeconomic Makeup of Student Body
    + Percentage of students of different racial background
    + Percentage of economically disadvantaged students
    + Percentage of Disabled studnets
    + Percentage Male vs Female Students
  * Enrollment Data
    + Enrollment by Grade
    + Average Class Size
  * Test Scores
    + SAT Scores w/ Subject Breakdown
    + MCAS Scores w/ Subject Breakdown
    + AP Scores and Number of AP Test Takers
  * High School Outcome
    + Graduation Rate
    + Dropout Rate
    + Rate of Secondary Education
    

    
It's important to discuss some limitations that this dataset has. 

Although it is very compehensive, it is relatively old. However, the questions I envision myself attemtping to research are not very time sensitive and could still produce useful results.

The data is also specific to Massachusetts Public Schools, which means there will be relatively fewer schools with extreme class sizes, like you might find in an area with very high population density. There will also be relatively higher average salaries and expenditures accross the board, since Massachusetts is one of the wealthiest states in the country.

That said, value can still be extracted from researching this data as it can inform local education decision making.



## Potential Research Questions

I chose this dataset in particular because at one time, I planned to become a high school Math teacher. I was very interested in the impact that salary and class size had on the academic outcomes of students, and how much access students from different socioeconomic backgrounds had to higher quality education.

Two formal research questions I plan to answer in the final project are: 

* How does average class size and average distict expenditure impact graduation rates and rates of enrollment in higher education?
* Do studnets in different socioeconomic backgrounds have differing levels of access to smaller class sizes and high expenditure?