Using the public school characteristic data set collected in the 2017-2018 school year
We ended the last blog post by looking at the data fields provided. The information appears to capture school location and contact information, student enrollment by grade, broken down by race/ethnicity and gender, and number of students by grade.
Here is a recap of all fields:
[1] "X" "Y" "OBJECTID"
[4] "NCESSCH" "NMCNTY" "SURVYEAR"
[7] "STABR" "LEAID" "ST_LEAID"
[10] "LEA_NAME" "SCH_NAME" "LSTREET1"
[13] "LSTREET2" "LSTREET3" "LCITY"
[16] "LSTATE" "LZIP" "LZIP4"
[19] "PHONE" "GSLO" "GSHI"
[22] "VIRTUAL" "TOTFRL" "FRELCH"
[25] "REDLCH" "PK" "KG"
[28] "G01" "G02" "G03"
[31] "G04" "G05" "G06"
[34] "G07" "G08" "G09"
[37] "G10" "G11" "G12"
[40] "G13" "TOTAL" "MEMBER"
[43] "AM" "HI" "BL"
[46] "WH" "HP" "TR"
[49] "FTE" "LATCOD" "LONCOD"
[52] "ULOCALE" "STUTERATIO" "STITLEI"
[55] "AMALM" "AMALF" "ASALM"
[58] "ASALF" "HIALM" "HIALF"
[61] "BLALM" "BLALF" "WHALM"
[64] "WHALF" "HPALM" "HPALF"
[67] "TRALM" "TRALF" "TOTMENROL"
[70] "TOTFENROL" "STATUS" "UG"
[73] "AE" "SCHOOL_TYPE_TEXT" "SY_STATUS_TEXT"
[76] "SCHOOL_LEVEL" "AS" "CHARTER_TEXT"
[79] "MAGNET_TEXT"
Based on the data points, I’ve outlined my research interests into three categories:
number of schools by state
counts of schools by rural/urban location
number of students by grade
Based on the defined interests above, we can narrow down the data set to following fields:
#update table by selecting fields to be used for analysis
psc <- psc%>%
select (LEA_NAME, SCH_NAME, LSTATE, FRELCH, REDLCH, ULOCALE, STUTERATIO, FTE, PK:G13)
#display column names of updated table
colnames(psc)
[1] "LEA_NAME" "SCH_NAME" "LSTATE" "FRELCH" "REDLCH"
[6] "ULOCALE" "STUTERATIO" "FTE" "PK" "KG"
[11] "G01" "G02" "G03" "G04" "G05"
[16] "G06" "G07" "G08" "G09" "G10"
[21] "G11" "G12" "G13"
Next, let’s rename the columns into a more readable format:
#rename columns into a readable format
psc <- psc %>%
rename(District = LEA_NAME, School = SCH_NAME, State = LSTATE, Free_Lunch = FRELCH, Reduced_Lunch = REDLCH, Rural_Urban_Status = ULOCALE, Student_Teacher_Ratio = STUTERATIO, Teacher_counts = FTE)
#display updated column names
colnames(psc)
[1] "District" "School"
[3] "State" "Free_Lunch"
[5] "Reduced_Lunch" "Rural_Urban_Status"
[7] "Student_Teacher_Ratio" "Teacher_counts"
[9] "PK" "KG"
[11] "G01" "G02"
[13] "G03" "G04"
[15] "G05" "G06"
[17] "G07" "G08"
[19] "G09" "G10"
[21] "G11" "G12"
[23] "G13"
Now that we have distilled (get it?) the data down to only the necessary information, there is one final step to wrangle (I’m sorry) the data: create two tables.
Why two tables? The counts of students by grade need to be collapsed into one column to represent the grade variable. If we pivot within the same table, other numeric values will be repeated (for example, the free and reduced lunch counts) for each grade and school. This could cause mathematical errors later on if we are not careful.
The first table will reflect one row per school with other characteristic information, sorted alphabetically by State, District, then School:
##create a new table (psc_char) with school level characteristic data with one row per school
psc_char <- psc%>%
select (State, District, School, Free_Lunch, Reduced_Lunch, Rural_Urban_Status, Student_Teacher_Ratio, Teacher_counts)%>%
arrange(State, District, School)
#display a subset of the results
paged_table(psc_char, options=list(rows.print = 10, max.print = 100))
In the second table, each one row will represent one grade per school and the number of students enrolled:
##create a second table with the counts of students by grade per school.
##pivot grade columns into a single variable and store the values in new column.
psc_grade <- psc %>%
pivot_longer(PK:G13, names_to = "Grade", values_to= "Grade_Counts") %>%
select(State, School, Grade, Grade_Counts)%>%
arrange(State, School, Grade, Grade_Counts)
#display a subset of the results
knitr::kable(head(psc_grade), "simple", caption = "Student counts by grade")
State | School | Grade | Grade_Counts |
---|---|---|---|
AK | Abbott Loop Elementary | G01 | 33 |
AK | Abbott Loop Elementary | G02 | 38 |
AK | Abbott Loop Elementary | G03 | 53 |
AK | Abbott Loop Elementary | G04 | 38 |
AK | Abbott Loop Elementary | G05 | 53 |
AK | Abbott Loop Elementary | G06 | 52 |
With our data sets ready, we can begin our analysis in the three research interest areas.
To find the six states/territories with the lowest number of schools, we group by schools, count the number of rows and leave the sorting default:
#counts of schools by state using psc_char table since it represents a single school per row
#Sort from lowest to highest count
state_counts <- psc_char %>%
group_by(State) %>%
summarise(count = n()) %>%
arrange(count)
#display results
kbl(head(state_counts[,1:2]), "simple", caption = "Six states with lowest number of schools")
State | count |
---|---|
AS | 28 |
VI | 28 |
GU | 41 |
DC | 227 |
DE | 229 |
HI | 294 |
To find the six states and territories with the highest number of schools, we use the same code as above, except we add the descending to the sorting function:
#counts of schools by state/territory using psc_char table since it represents a single school per row
#sort by highest to lowest counts
state_counts <- psc_char %>%
group_by(State) %>%
summarise(count = n()) %>%
arrange(desc(count))
#display results
kbl(head(state_counts[,1:2]), "simple", caption = "Six states with the highest number of schools")
State | count |
---|---|
CA | 10325 |
TX | 9320 |
NY | 4808 |
FL | 4377 |
IL | 4245 |
MI | 3736 |
In the characteristics table, each school has a ‘locale’ classification to indicate if it is urban, suburban, or rural. To find the total by category, we will simply apply the group by function.
#find the number of schools by the locale type (aka, rural/urban)
locale_type <- psc_char %>%
group_by(Rural_Urban_Status)
#display results
table(select(locale_type, Rural_Urban_Status))
11-City: Large 12-City: Mid-size 13-City: Small
14851 5876 6635
21-Suburb: Large 22-Suburb: Mid-size 23-Suburb: Small
26772 3305 2053
31-Town: Fringe 32-Town: Distant 33-Town: Remote
2963 6266 4138
41-Rural: Fringe 42-Rural: Distant 43-Rural: Remote
11179 10279 6412
The locale type combines a main type and a sub-type. I am not interested in the sub-types, so let’s group them by the main category by using the case when function:
#group sub-types to report the number of schools by the locale type
localeGroup <- psc_char %>%
mutate(LocaleGroup = case_when(
Rural_Urban_Status == "41-Rural: Fringe" | Rural_Urban_Status == "42-Rural: Distant" | Rural_Urban_Status == "43-Rural: Remote" ~ "Rural",
Rural_Urban_Status == "31-Town: Fringe" | Rural_Urban_Status == "32-Town: Distant" | Rural_Urban_Status == "33-Town: Remote" ~ "Town",
Rural_Urban_Status == "21-Suburb: Large" | Rural_Urban_Status == "21-Suburb: Large" | Rural_Urban_Status == "21-Suburb: Large" ~ "Suburb",
Rural_Urban_Status == "11-City: Large" | Rural_Urban_Status == "12-City: Mid-size" | Rural_Urban_Status == "13-City: Small" ~ "City"
))
#display results
table(select(localeGroup, LocaleGroup))
City Rural Suburb Town
27362 27870 26772 13367
These are the total number of students (all US States and territories) per grade
Grade | GradeCount |
---|---|
G01 | NA |
G02 | NA |
G03 | NA |
G04 | NA |
G05 | NA |
G06 | NA |
G07 | NA |
G08 | NA |
G09 | NA |
G10 | NA |
G11 | NA |
G12 | NA |
G13 | NA |
KG | NA |
PK | NA |
Oops, I forgot to remove the rows with missing data for grade counts. This makes sense as most schools do not teach all grades.
Here is the corrected version:
Grade | GradeCount |
---|---|
G01 | 3696986 |
G02 | 3711657 |
G03 | 3816092 |
G04 | 3887554 |
G05 | 3905346 |
G06 | 3854703 |
G07 | 3805136 |
G08 | 3798140 |
G09 | 4018522 |
G10 | 3854951 |
G11 | 3695166 |
G12 | 3640287 |
G13 | 1 |
KG | 3710321 |
PK | 1255504 |
Thanks for reading!
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kenison (2021, Oct. 3). DACSS 601 Fall 2021: HW#3 - Transforming the US public schools data set. Retrieved from https://mrolfe.github.io/DACSS601Fall21/posts/2021-10-03-hw3-brittany-kenison/
BibTeX citation
@misc{kenison2021hw#3, author = {Kenison, Brittany}, title = {DACSS 601 Fall 2021: HW#3 - Transforming the US public schools data set}, url = {https://mrolfe.github.io/DACSS601Fall21/posts/2021-10-03-hw3-brittany-kenison/}, year = {2021} }