HW#3 - Transforming the US public schools data set

Using the public school characteristic data set collected in the 2017-2018 school year

Brittany Kenison
10-03-2021

Review

We ended the last blog post by looking at the data fields provided. The information appears to capture school location and contact information, student enrollment by grade, broken down by race/ethnicity and gender, and number of students by grade.

Here is a recap of all fields:

Show code
#read data from public school characteristics file
psc <- read_csv("../../_data/Public_School_Characteristics_2017-18.csv")

#display all column names
colnames(psc)
 [1] "X"                "Y"                "OBJECTID"        
 [4] "NCESSCH"          "NMCNTY"           "SURVYEAR"        
 [7] "STABR"            "LEAID"            "ST_LEAID"        
[10] "LEA_NAME"         "SCH_NAME"         "LSTREET1"        
[13] "LSTREET2"         "LSTREET3"         "LCITY"           
[16] "LSTATE"           "LZIP"             "LZIP4"           
[19] "PHONE"            "GSLO"             "GSHI"            
[22] "VIRTUAL"          "TOTFRL"           "FRELCH"          
[25] "REDLCH"           "PK"               "KG"              
[28] "G01"              "G02"              "G03"             
[31] "G04"              "G05"              "G06"             
[34] "G07"              "G08"              "G09"             
[37] "G10"              "G11"              "G12"             
[40] "G13"              "TOTAL"            "MEMBER"          
[43] "AM"               "HI"               "BL"              
[46] "WH"               "HP"               "TR"              
[49] "FTE"              "LATCOD"           "LONCOD"          
[52] "ULOCALE"          "STUTERATIO"       "STITLEI"         
[55] "AMALM"            "AMALF"            "ASALM"           
[58] "ASALF"            "HIALM"            "HIALF"           
[61] "BLALM"            "BLALF"            "WHALM"           
[64] "WHALF"            "HPALM"            "HPALF"           
[67] "TRALM"            "TRALF"            "TOTMENROL"       
[70] "TOTFENROL"        "STATUS"           "UG"              
[73] "AE"               "SCHOOL_TYPE_TEXT" "SY_STATUS_TEXT"  
[76] "SCHOOL_LEVEL"     "AS"               "CHARTER_TEXT"    
[79] "MAGNET_TEXT"     

What should this data represent?

Based on the data points, I’ve outlined my research interests into three categories:

Selecting data

Based on the defined interests above, we can narrow down the data set to following fields:

Show code
#update table by selecting fields to be used for analysis
psc <- psc%>%
  select (LEA_NAME, SCH_NAME, LSTATE, FRELCH, REDLCH, ULOCALE, STUTERATIO, FTE, PK:G13) 

#display column names of updated table
colnames(psc)
 [1] "LEA_NAME"   "SCH_NAME"   "LSTATE"     "FRELCH"     "REDLCH"    
 [6] "ULOCALE"    "STUTERATIO" "FTE"        "PK"         "KG"        
[11] "G01"        "G02"        "G03"        "G04"        "G05"       
[16] "G06"        "G07"        "G08"        "G09"        "G10"       
[21] "G11"        "G12"        "G13"       

Make columns readable

Next, let’s rename the columns into a more readable format:

Show code
#rename columns into a readable format
psc <- psc %>%
  rename(District = LEA_NAME, School = SCH_NAME, State = LSTATE, Free_Lunch = FRELCH, Reduced_Lunch = REDLCH, Rural_Urban_Status  = ULOCALE, Student_Teacher_Ratio = STUTERATIO, Teacher_counts = FTE)

#display updated column names
colnames(psc)
 [1] "District"              "School"               
 [3] "State"                 "Free_Lunch"           
 [5] "Reduced_Lunch"         "Rural_Urban_Status"   
 [7] "Student_Teacher_Ratio" "Teacher_counts"       
 [9] "PK"                    "KG"                   
[11] "G01"                   "G02"                  
[13] "G03"                   "G04"                  
[15] "G05"                   "G06"                  
[17] "G07"                   "G08"                  
[19] "G09"                   "G10"                  
[21] "G11"                   "G12"                  
[23] "G13"                  

Divide and conquor

Now that we have distilled (get it?) the data down to only the necessary information, there is one final step to wrangle (I’m sorry) the data: create two tables.

Why two tables? The counts of students by grade need to be collapsed into one column to represent the grade variable. If we pivot within the same table, other numeric values will be repeated (for example, the free and reduced lunch counts) for each grade and school. This could cause mathematical errors later on if we are not careful.

The first table will reflect one row per school with other characteristic information, sorted alphabetically by State, District, then School:

Show code
##create a new table (psc_char) with school level characteristic data with one row per school
psc_char <- psc%>%
select (State, District, School, Free_Lunch, Reduced_Lunch, Rural_Urban_Status, Student_Teacher_Ratio, Teacher_counts)%>%
  arrange(State, District, School)

#display a subset of the results
paged_table(psc_char, options=list(rows.print = 10, max.print = 100))

In the second table, each one row will represent one grade per school and the number of students enrolled:

Show code
##create a second table with the counts of students by grade per school.
##pivot grade columns into a single variable and store the values in new column.
psc_grade <- psc %>%
  pivot_longer(PK:G13, names_to = "Grade", values_to= "Grade_Counts") %>%
  select(State, School, Grade, Grade_Counts)%>%
  arrange(State, School, Grade, Grade_Counts)

#display a subset of the results
knitr::kable(head(psc_grade), "simple", caption = "Student counts by grade")
Table 1: Student counts by grade
State School Grade Grade_Counts
AK Abbott Loop Elementary G01 33
AK Abbott Loop Elementary G02 38
AK Abbott Loop Elementary G03 53
AK Abbott Loop Elementary G04 38
AK Abbott Loop Elementary G05 53
AK Abbott Loop Elementary G06 52

Summarizing

With our data sets ready, we can begin our analysis in the three research interest areas.

State school counts

To find the six states/territories with the lowest number of schools, we group by schools, count the number of rows and leave the sorting default:

Show code
#counts of schools by state using psc_char table since it represents a single school per row
#Sort from lowest to highest count
state_counts <- psc_char %>%
  group_by(State) %>%
  summarise(count = n()) %>%
  arrange(count)

#display results
kbl(head(state_counts[,1:2]), "simple", caption = "Six states with lowest number of schools")
Table 2: Six states with lowest number of schools
State count
AS 28
VI 28
GU 41
DC 227
DE 229
HI 294

To find the six states and territories with the highest number of schools, we use the same code as above, except we add the descending to the sorting function:

Show code
#counts of schools by state/territory using psc_char table since it represents a single school per row
#sort by highest to lowest counts
state_counts <- psc_char %>%
  group_by(State) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

#display results
kbl(head(state_counts[,1:2]), "simple", caption = "Six states with the highest number of schools")
Table 3: Six states with the highest number of schools
State count
CA 10325
TX 9320
NY 4808
FL 4377
IL 4245
MI 3736

Urban/Rural schools

In the characteristics table, each school has a ‘locale’ classification to indicate if it is urban, suburban, or rural. To find the total by category, we will simply apply the group by function.

Show code
#find the number of schools by the locale type (aka, rural/urban)
locale_type <- psc_char %>%
  group_by(Rural_Urban_Status)

#display results
table(select(locale_type, Rural_Urban_Status))

     11-City: Large   12-City: Mid-size      13-City: Small 
              14851                5876                6635 
   21-Suburb: Large 22-Suburb: Mid-size    23-Suburb: Small 
              26772                3305                2053 
    31-Town: Fringe    32-Town: Distant     33-Town: Remote 
               2963                6266                4138 
   41-Rural: Fringe   42-Rural: Distant    43-Rural: Remote 
              11179               10279                6412 

The locale type combines a main type and a sub-type. I am not interested in the sub-types, so let’s group them by the main category by using the case when function:

Show code
#group sub-types to report the number of schools by the locale type 
localeGroup <- psc_char %>%
  mutate(LocaleGroup = case_when(
    Rural_Urban_Status == "41-Rural: Fringe" | Rural_Urban_Status == "42-Rural: Distant" | Rural_Urban_Status == "43-Rural: Remote" ~ "Rural", 
    Rural_Urban_Status == "31-Town: Fringe" | Rural_Urban_Status == "32-Town: Distant" | Rural_Urban_Status == "33-Town: Remote" ~ "Town",
    Rural_Urban_Status == "21-Suburb: Large" | Rural_Urban_Status == "21-Suburb: Large" | Rural_Urban_Status == "21-Suburb: Large" ~ "Suburb",
    Rural_Urban_Status == "11-City: Large" | Rural_Urban_Status == "12-City: Mid-size" | Rural_Urban_Status == "13-City: Small" ~ "City"
      ))
#display results
table(select(localeGroup, LocaleGroup))

  City  Rural Suburb   Town 
 27362  27870  26772  13367 

Student counts by grade

These are the total number of students (all US States and territories) per grade

Show code
#count the number of students for each grade
t_gradeCount <- psc_grade %>%
  group_by(Grade) %>%
  summarise(GradeCount = sum(Grade_Counts))

knitr::kable(t_gradeCount)
Grade GradeCount
G01 NA
G02 NA
G03 NA
G04 NA
G05 NA
G06 NA
G07 NA
G08 NA
G09 NA
G10 NA
G11 NA
G12 NA
G13 NA
KG NA
PK NA

Oops, I forgot to remove the rows with missing data for grade counts. This makes sense as most schools do not teach all grades.

Here is the corrected version:

Show code
#filter out rows where grade counts are null
#used same code as above to group grade counts by grade
t_gradeCountCorrect <- psc_grade %>%
  filter(!is.na(Grade_Counts))%>%
  group_by(Grade) %>%
  summarise(GradeCount = sum(Grade_Counts))
  
#display results
  kable(t_gradeCountCorrect)
Grade GradeCount
G01 3696986
G02 3711657
G03 3816092
G04 3887554
G05 3905346
G06 3854703
G07 3805136
G08 3798140
G09 4018522
G10 3854951
G11 3695166
G12 3640287
G13 1
KG 3710321
PK 1255504

The End

Thanks for reading!

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Kenison (2021, Oct. 3). DACSS 601 Fall 2021: HW#3 - Transforming the US public schools data set. Retrieved from https://mrolfe.github.io/DACSS601Fall21/posts/2021-10-03-hw3-brittany-kenison/

BibTeX citation

@misc{kenison2021hw#3,
  author = {Kenison, Brittany},
  title = {DACSS 601 Fall 2021: HW#3 - Transforming the US public schools data set},
  url = {https://mrolfe.github.io/DACSS601Fall21/posts/2021-10-03-hw3-brittany-kenison/},
  year = {2021}
}