DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Homework #2 Naughton

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Description of the Data
    • Homework 2: Massachusetts Public School Enrollment
    • Potential Research Questions

Homework #2 Naughton

  • Show All Code
  • Hide All Code

  • View Source
Courtney Naughton
Author

Courtney Naughton

Published

October 11, 2022

Code
library(tidyverse)
library(tibble)
library(dplyr)
library(stringr)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Description of the Data

Homework 2: Massachusetts Public School Enrollment

This dataset is enrollment of students at Massachusetts Public Schools from 2015 - 2021.

The dataset has 70 columns and 70,138 rows. The variables include RowID, State, County, District, and school name. There are state IDs for each of these areas. There are columns that describe the school type as well as the low and high grade for each school. Other important variables are the year, term, and grade. There are also demographics broken into gender (male, female, non-binary, gender unknown) and race (white, black, hispanic, native_american, ative_american_alaskan_native, asian, native_hawaiian, asian_pacific_islander, native_hawaiian_pacific_islander, multiracial, unknown_race) and other details such as ELL, Homeless, low income, free and reduced, disability, section 504. This information is given twice - the columns are repeated with int at the end - I believe this is converting to integer form. Those variables are all double, but the columns with int are characters. I want to get rid of columns “district_state_type,” “county_state_id,” “term,” and “school_state_type” because they are all N/A. I will also take out the demographic columns that don’t end in int because I would prefer the other form.

Code
enrollment <- read_csv("_data/enrollments_Naughton.csv")
Code
enrollment_new <- select(enrollment, 
                         rowid, 
                         state, 
                         county, 
                         admin_level, 
                         district_nces_id, 
                         district_state_id, 
                         district, 
                         school, 
                         ccd_school_type, 
                         ccd_charter_school, 
                         ccd_school_lvl, 
                         ccd_school_low_grade, 
                         ccd_school_high_grade, 
                         year, 
                         grade, 
                         ends_with("_int") )
enrollment_new

I first want to look at Boston’s enrollment. When I did this, I noticed that in 2021, enrollment for schools were listed as “Boston - School Name” so I will need to take out “Boston -” so that the name of the school is consistent with the other years.

Code
enrollment_Boston8 <- enrollment_new  %>%
  filter( district =="Boston", grade=="grade_8") %>% 
  select(school, year, grade, male_int, female_int, non_binary_int, total_int) %>% 
  mutate(school = str_remove_all(school, "Boston - "))%>%
   mutate(school = str_remove_all(school, "Boston: "))%>%
  rename("Total" = "total_int")%>%
   drop_na(school)%>%
arrange(school, year)


enrollment_Boston8

Next, I graphed the enrollment during the 8th grade of Boston schools whose total was 100 or more in 2015. This was 14 schools:

Code
total_Boston8 <-select(enrollment_Boston8, "school",year, Total) %>% 
  filter(school%in% c("Boston Latin",
                      "Boston Latin Academy", 
                      "Clarence R Edwards Middle",
                      "Curley K-8 School", 
                      "Edison K-8", 
                      "James P Timilty Middle", 
                      "John W McCormack", 
                      "Lilla G. Frederick Middle School", 
                      "Mario Umana Academy", 
                      "O'Bryant School Math/Science", 
                      "TechBoston Academy", 
                      "Washington Irving Middle"))
total_Boston8
Code
ggplot(total_Boston8, aes(year, Total)) + 
  geom_line(aes(group = school)) +
  geom_point(aes(colour = school))

Code
ggplot(total_Boston8, aes(year, Total)) + 
  geom_line(aes(group = school)) +
  geom_point(aes(colour = school),  
             show.legend = FALSE) +
  facet_wrap(~ school, nrow = 4)

Potential Research Questions

Some potential research questions for this could be:

  • Which schools should be closed or combined?

  • Which schools need more seats available to them?

  • Which schools and districts need more resources (teachers, staff, classroom materials, technology, school lunches, programs, etc)?

  • Which areas of the state are growing?

  • Which high schools will grow or shrink by looking at cross grade data?

Source Code
---
title: "Homework #2 Naughton"
author: "Courtney Naughton"
desription: "Homework Two: Reading in Data"
date: "10/11/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
    df-print: paged
categories:
  - Courtney Naughton
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(tibble)
library(dplyr)
library(stringr)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Description of the Data
### Homework 2: Massachusetts Public School Enrollment

This dataset is enrollment of students at Massachusetts Public Schools from 2015 - 2021. 

The dataset has 70 columns and 70,138 rows. The variables include RowID, State, County, District, and school name. There are state IDs for each of these areas.  There are columns that describe the school type as well as the low and high grade for each school. Other important variables are the year, term, and grade. There are also demographics broken into gender (male, female, non-binary, gender unknown) and race (white, black, hispanic, native_american, ative_american_alaskan_native, asian, native_hawaiian, asian_pacific_islander, native_hawaiian_pacific_islander, multiracial, unknown_race) and other details such as ELL, Homeless, low income, free and reduced, disability, section 504. This information is given twice - the columns are repeated with int at the end - I believe this is converting to integer form. Those variables are all double, but the columns with int are characters.
I want to get rid of columns "district_state_type," "county_state_id," "term," and "school_state_type" because they are all N/A. I will also take out the demographic columns that don't end in int because I would prefer the other form.

```{r}
enrollment <- read_csv("_data/enrollments_Naughton.csv")
```

```{r}
#| label: Select Data 
enrollment_new <- select(enrollment, 
                         rowid, 
                         state, 
                         county, 
                         admin_level, 
                         district_nces_id, 
                         district_state_id, 
                         district, 
                         school, 
                         ccd_school_type, 
                         ccd_charter_school, 
                         ccd_school_lvl, 
                         ccd_school_low_grade, 
                         ccd_school_high_grade, 
                         year, 
                         grade, 
                         ends_with("_int") )
enrollment_new
```

I first want to look at Boston's enrollment. When I did this, I noticed that in 2021, enrollment for schools were listed as "Boston - School Name" so I will need to take out "Boston - " so that the name of the school is consistent with the other years.

```{r}
#| label: Boston 8th Grade Data
enrollment_Boston8 <- enrollment_new  %>%
  filter( district =="Boston", grade=="grade_8") %>% 
  select(school, year, grade, male_int, female_int, non_binary_int, total_int) %>% 
  mutate(school = str_remove_all(school, "Boston - "))%>%
   mutate(school = str_remove_all(school, "Boston: "))%>%
  rename("Total" = "total_int")%>%
   drop_na(school)%>%
arrange(school, year)


enrollment_Boston8

```
Next, I graphed the enrollment during the 8th grade of Boston schools whose total was 100 or more in 2015. This was 14 schools:
```{r}
#| label: 8th Grade Boston data, certain schools

total_Boston8 <-select(enrollment_Boston8, "school",year, Total) %>% 
  filter(school%in% c("Boston Latin",
                      "Boston Latin Academy", 
                      "Clarence R Edwards Middle",
                      "Curley K-8 School", 
                      "Edison K-8", 
                      "James P Timilty Middle", 
                      "John W McCormack", 
                      "Lilla G. Frederick Middle School", 
                      "Mario Umana Academy", 
                      "O'Bryant School Math/Science", 
                      "TechBoston Academy", 
                      "Washington Irving Middle"))
total_Boston8

ggplot(total_Boston8, aes(year, Total)) + 
  geom_line(aes(group = school)) +
  geom_point(aes(colour = school))

ggplot(total_Boston8, aes(year, Total)) + 
  geom_line(aes(group = school)) +
  geom_point(aes(colour = school),  
             show.legend = FALSE) +
  facet_wrap(~ school, nrow = 4)


```
### Potential Research Questions
Some potential research questions for this could be: 

 * Which schools should be closed or combined?
 
 * Which schools need more seats available to them? 
 
 * Which schools and districts need more resources (teachers, staff, classroom materials, technology, school lunches, programs, etc)?
 
 * Which areas of the state are growing?
 
 * Which high schools will grow or shrink by looking at cross grade data?