DACSS 601 Fall 2021: HW#3 - Transforming the US public schools data set

Brittany Kenison

Review

We ended the last blog post by looking at the data fields provided. The information appears to capture school location and contact information, student enrollment by grade, broken down by race/ethnicity and gender, and number of students by grade.

Here is a recap of all fields:

Show code

#read data from public school characteristics file
psc <- read_csv("../../_data/Public_School_Characteristics_2017-18.csv")

#display all column names
colnames(psc)

 [1] "X"                "Y"                "OBJECTID"        
 [4] "NCESSCH"          "NMCNTY"           "SURVYEAR"        
 [7] "STABR"            "LEAID"            "ST_LEAID"        
[10] "LEA_NAME"         "SCH_NAME"         "LSTREET1"        
[13] "LSTREET2"         "LSTREET3"         "LCITY"           
[16] "LSTATE"           "LZIP"             "LZIP4"           
[19] "PHONE"            "GSLO"             "GSHI"            
[22] "VIRTUAL"          "TOTFRL"           "FRELCH"          
[25] "REDLCH"           "PK"               "KG"              
[28] "G01"              "G02"              "G03"             
[31] "G04"              "G05"              "G06"             
[34] "G07"              "G08"              "G09"             
[37] "G10"              "G11"              "G12"             
[40] "G13"              "TOTAL"            "MEMBER"          
[43] "AM"               "HI"               "BL"              
[46] "WH"               "HP"               "TR"              
[49] "FTE"              "LATCOD"           "LONCOD"          
[52] "ULOCALE"          "STUTERATIO"       "STITLEI"         
[55] "AMALM"            "AMALF"            "ASALM"           
[58] "ASALF"            "HIALM"            "HIALF"           
[61] "BLALM"            "BLALF"            "WHALM"           
[64] "WHALF"            "HPALM"            "HPALF"           
[67] "TRALM"            "TRALF"            "TOTMENROL"       
[70] "TOTFENROL"        "STATUS"           "UG"              
[73] "AE"               "SCHOOL_TYPE_TEXT" "SY_STATUS_TEXT"  
[76] "SCHOOL_LEVEL"     "AS"               "CHARTER_TEXT"    
[79] "MAGNET_TEXT"

What should this data represent?

Based on the data points, I’ve outlined my research interests into three categories:

number of schools by state
counts of schools by rural/urban location
number of students by grade

Selecting data

Based on the defined interests above, we can narrow down the data set to following fields:

Show code

#update table by selecting fields to be used for analysis
psc <- psc%>%
  select (LEA_NAME, SCH_NAME, LSTATE, FRELCH, REDLCH, ULOCALE, STUTERATIO, FTE, PK:G13) 

#display column names of updated table
colnames(psc)

 [1] "LEA_NAME"   "SCH_NAME"   "LSTATE"     "FRELCH"     "REDLCH"    
 [6] "ULOCALE"    "STUTERATIO" "FTE"        "PK"         "KG"        
[11] "G01"        "G02"        "G03"        "G04"        "G05"       
[16] "G06"        "G07"        "G08"        "G09"        "G10"       
[21] "G11"        "G12"        "G13"

Make columns readable

Next, let’s rename the columns into a more readable format:

Show code

#rename columns into a readable format
psc <- psc %>%
  rename(District = LEA_NAME, School = SCH_NAME, State = LSTATE, Free_Lunch = FRELCH, Reduced_Lunch = REDLCH, Rural_Urban_Status  = ULOCALE, Student_Teacher_Ratio = STUTERATIO, Teacher_counts = FTE)

#display updated column names
colnames(psc)

 [1] "District"              "School"               
 [3] "State"                 "Free_Lunch"           
 [5] "Reduced_Lunch"         "Rural_Urban_Status"   
 [7] "Student_Teacher_Ratio" "Teacher_counts"       
 [9] "PK"                    "KG"                   
[11] "G01"                   "G02"                  
[13] "G03"                   "G04"                  
[15] "G05"                   "G06"                  
[17] "G07"                   "G08"                  
[19] "G09"                   "G10"                  
[21] "G11"                   "G12"                  
[23] "G13"

Divide and conquor

Now that we have distilled (get it?) the data down to only the necessary information, there is one final step to wrangle (I’m sorry) the data: create two tables.

Why two tables? The counts of students by grade need to be collapsed into one column to represent the grade variable. If we pivot within the same table, other numeric values will be repeated (for example, the free and reduced lunch counts) for each grade and school. This could cause mathematical errors later on if we are not careful.

The first table will reflect one row per school with other characteristic information, sorted alphabetically by State, District, then School:

Show code

##create a new table (psc_char) with school level characteristic data with one row per school
psc_char <- psc%>%
select (State, District, School, Free_Lunch, Reduced_Lunch, Rural_Urban_Status, Student_Teacher_Ratio, Teacher_counts)%>%
  arrange(State, District, School)

#display a subset of the results
paged_table(psc_char, options=list(rows.print = 10, max.print = 100))

In the second table, each one row will represent one grade per school and the number of students enrolled:

Show code

##create a second table with the counts of students by grade per school.
##pivot grade columns into a single variable and store the values in new column.
psc_grade <- psc %>%
  pivot_longer(PK:G13, names_to = "Grade", values_to= "Grade_Counts") %>%
  select(State, School, Grade, Grade_Counts)%>%
  arrange(State, School, Grade, Grade_Counts)

#display a subset of the results
knitr::kable(head(psc_grade), "simple", caption = "Student counts by grade")

Table 1: Student counts by grade
State	School	Grade	Grade_Counts
AK	Abbott Loop Elementary	G01	33
AK	Abbott Loop Elementary	G02	38
AK	Abbott Loop Elementary	G03	53
AK	Abbott Loop Elementary	G04	38
AK	Abbott Loop Elementary	G05	53
AK	Abbott Loop Elementary	G06	52

Summarizing

With our data sets ready, we can begin our analysis in the three research interest areas.

State school counts

To find the six states/territories with the lowest number of schools, we group by schools, count the number of rows and leave the sorting default:

Show code

#counts of schools by state using psc_char table since it represents a single school per row
#Sort from lowest to highest count
state_counts <- psc_char %>%
  group_by(State) %>%
  summarise(count = n()) %>%
  arrange(count)

#display results
kbl(head(state_counts[,1:2]), "simple", caption = "Six states with lowest number of schools")

Table 2: Six states with lowest number of schools
State	count
AS	28
VI	28
GU	41
DC	227
DE	229
HI	294

To find the six states and territories with the highest number of schools, we use the same code as above, except we add the descending to the sorting function:

Show code

#counts of schools by state/territory using psc_char table since it represents a single school per row
#sort by highest to lowest counts
state_counts <- psc_char %>%
  group_by(State) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

#display results
kbl(head(state_counts[,1:2]), "simple", caption = "Six states with the highest number of schools")

Table 3: Six states with the highest number of schools
State	count
CA	10325
TX	9320
NY	4808
FL	4377
IL	4245
MI	3736

Urban/Rural schools

In the characteristics table, each school has a ‘locale’ classification to indicate if it is urban, suburban, or rural. To find the total by category, we will simply apply the group by function.

Show code

#find the number of schools by the locale type (aka, rural/urban)
locale_type <- psc_char %>%
  group_by(Rural_Urban_Status)

#display results
table(select(locale_type, Rural_Urban_Status))


     11-City: Large   12-City: Mid-size      13-City: Small 
              14851                5876                6635 
   21-Suburb: Large 22-Suburb: Mid-size    23-Suburb: Small 
              26772                3305                2053 
    31-Town: Fringe    32-Town: Distant     33-Town: Remote 
               2963                6266                4138 
   41-Rural: Fringe   42-Rural: Distant    43-Rural: Remote 
              11179               10279                6412

The locale type combines a main type and a sub-type. I am not interested in the sub-types, so let’s group them by the main category by using the case when function:

Show code

#group sub-types to report the number of schools by the locale type 
localeGroup <- psc_char %>%
  mutate(LocaleGroup = case_when(
    Rural_Urban_Status == "41-Rural: Fringe" | Rural_Urban_Status == "42-Rural: Distant" | Rural_Urban_Status == "43-Rural: Remote" ~ "Rural", 
    Rural_Urban_Status == "31-Town: Fringe" | Rural_Urban_Status == "32-Town: Distant" | Rural_Urban_Status == "33-Town: Remote" ~ "Town",
    Rural_Urban_Status == "21-Suburb: Large" | Rural_Urban_Status == "21-Suburb: Large" | Rural_Urban_Status == "21-Suburb: Large" ~ "Suburb",
    Rural_Urban_Status == "11-City: Large" | Rural_Urban_Status == "12-City: Mid-size" | Rural_Urban_Status == "13-City: Small" ~ "City"
      ))
#display results
table(select(localeGroup, LocaleGroup))


  City  Rural Suburb   Town 
 27362  27870  26772  13367

Student counts by grade

These are the total number of students (all US States and territories) per grade

Show code

#count the number of students for each grade
t_gradeCount <- psc_grade %>%
  group_by(Grade) %>%
  summarise(GradeCount = sum(Grade_Counts))

knitr::kable(t_gradeCount)

Grade	GradeCount
G01	NA
G02	NA
G03	NA
G04	NA
G05	NA
G06	NA
G07	NA
G08	NA
G09	NA
G10	NA
G11	NA
G12	NA
G13	NA
KG	NA
PK	NA

Oops, I forgot to remove the rows with missing data for grade counts. This makes sense as most schools do not teach all grades.

Here is the corrected version:

Show code

#filter out rows where grade counts are null
#used same code as above to group grade counts by grade
t_gradeCountCorrect <- psc_grade %>%
  filter(!is.na(Grade_Counts))%>%
  group_by(Grade) %>%
  summarise(GradeCount = sum(Grade_Counts))
  
#display results
  kable(t_gradeCountCorrect)

Grade	GradeCount
G01	3696986
G02	3711657
G03	3816092
G04	3887554
G05	3905346
G06	3854703
G07	3805136
G08	3798140
G09	4018522
G10	3854951
G11	3695166
G12	3640287
G13	1
KG	3710321
PK	1255504

The End

Thanks for reading!

Comment on this article Share:

HW#3 - Transforming the US public schools data set

Review

What should this data represent?

Selecting data

Make columns readable

Divide and conquor

Summarizing

State school counts

Urban/Rural schools

Student counts by grade

The End

Reuse

Citation

Grade	GradeCount
G01	NA
G02	NA
G03	NA
G04	NA
G05	NA
G06	NA
G07	NA
G08	NA
G09	NA
G10	NA
G11	NA
G12	NA
G13	NA
KG	NA
PK	NA

Grade	GradeCount
G01	NA
G02	NA
G03	NA
G04	NA
G05	NA
G06	NA
G07	NA
G08	NA
G09	NA
G10	NA
G11	NA
G12	NA
G13	NA
KG	NA
PK	NA

Grade	GradeCount
G01	NA
G02	NA
G03	NA
G04	NA
G05	NA
G06	NA
G07	NA
G08	NA
G09	NA
G10	NA
G11	NA
G12	NA
G13	NA
KG	NA
PK	NA