Code
library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Theresa Szczepanski
October 5, 2022
On the read in, I deleted:
duplicates of the latitute/longitudinal coordinates X
, Y
SURVYEAR
since we are only examining 2017-2018 survey
I thought I should delete: aggregate information that could be replicated: TOTFRL
, TOTMENR
TOTFENR
, TOTAL
, Member
; HOWEVER, inspection of the median, range, and distribution of numeric variables in the summary indicates there are several mis-entries, (for example: student to teacher ratio: STUTERATIO
has a min = 0, med = 15.3, and max=22350.
Some of the aggregate categories might help me check for mis-entries.
On the read in, I factored the ordinal variables:
GSHI
, GSLO
, SCHOOL_LEVEL
, and ULOCALE
#Work done to determine what to filter/recode on read in
# PublicSchools_2017<-read_csv("_data/Public_School_Characteristics_2017-18.csv")%>%
# select(-c("X", "Y","OBJECTID" ,"SURVYEAR"))
#Aggregate variables I would have filtered if I wasn't concerned about mis-entries:
#"TOTFRL", "TOTMENROL", "TOTFENROL", "MEMBER", "TOTAL"
# Identify Levels for Factoring Ordinal Variables
# #ULOCALE
# PublicSchools_2017%>%
# select(ULOCALE)%>%
# unique()
# #GSLO
# PublicSchools_2017%>%
# select(GSLO)%>%
# unique()
# #GSLHI
# PublicSchools_2017%>%
# select(GSHI)%>%
# unique()
# #SCHOOL_LEVEL
# PublicSchools_2017%>%
# select(SCHOOL_LEVEL)%>%
# unique()
#Recode all ordinal variable as factors
PublicSchools_2017<-read_csv("_data/Public_School_Characteristics_2017-18.csv")%>%
select(-c("X", "Y","OBJECTID" ,"SURVYEAR")) %>%
mutate(ULOCALE = recode_factor(ULOCALE,
"11-City: Large" = "City: Large",
"12-City: Mid-size" = "City: Mid-size",
"13-City: Small" = "City: Small",
"21-Suburb: Large"= "Suburb: Large",
"22-Suburb: Mid-size"= "Suburb: Mid-size",
"23-Suburb: Small" = "Suburb: Small",
"31-Town: Fringe" = "Town: Fringe",
"32-Town: Distant" = "Town: Distant",
"33-Town: Remote" = "Town: Remote",
"41-Rural: Fringe" = "Rural: Fringe",
"42-Rural: Distant" = "Rural: Distant",
"43-Rural: Remote" = "Rural: Remote",
.ordered = TRUE))%>%
mutate(SCHOOL_LEVEL = recode_factor(SCHOOL_LEVEL,
"Prekindergarten" = "Prekindergarten",
"Elementary" = "Elementary",
"Middle" = "Middle",
"Secondary"= "Secondary",
"High"= "High",
"Ungraded" = "Ungraded",
"Other" = "Other",
"Not Applicable" = "Not Applicable",
"Not Reported" = "Not Reported",
.ordered = TRUE))%>%
mutate(GSLO = recode_factor(GSLO,
"PK" = "PK",
"KG" = "KG",
"01" = "01",
"02" = "02",
"03" = "03",
"04" = "04",
"05" = "05",
"M" = "M",
"06" = "06",
"07" = "07",
"08" = "08",
"09" = "09",
"10" = "10",
"11" = "11",
"12" = "12",
"AE" = "AE",
"UG" = "UG",
"N" = "N",
.ordered = TRUE))%>%
mutate(GSHI = recode_factor(GSHI,
"PK" = "PK",
"KG" = "KG",
"01" = "01",
"02" = "02",
"03" = "03",
"04" = "04",
"05" = "05",
"M" = "M",
"06" = "06",
"07" = "07",
"08" = "08",
"09" = "09",
"10" = "10",
"11" = "11",
"12" = "12",
"13" = "13",
"AE" = "AE",
"UG" = "UG",
"N" = "N",
.ordered = TRUE))
PublicSchools_2017
The PublicSchools_2017
data frame consists of data from selected questions from the 2017-208 National Teachers and Principals Survey conducted by the United States Census Board and is “a system of related questionnaires that provide descriptive data on the context of public and private elementary and secondary education in addition to giving local, state, and national policymakers a variety of statistics on the condition of education in the United States.”
Our data frame consists of a subset of the items surveyed from 100729 schools across the United States. The 75 variables contain information from the following categories:
Geographic Location of the School
Characteristics of the School design:
Demographic Characteristics of the student body:
Socioeconomic Characteristics of the student body:
What are the following variables?
G13
AS
UG
: Ungraded (School level)AE
: Adult Education (School level)FTE
STATUS
Why did the original Member
have 2944 distinct values while total
had 2944?
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NCESSCH [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NMCNTY [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
STABR [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LEAID [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ST_LEAID [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LEA_NAME [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SCH_NAME [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LSTREET1 [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LSTREET2 [character] |
|
|
100137 (99.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LSTREET3 [logical] |
|
100729 (100.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LCITY [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LSTATE [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LZIP [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LZIP4 [character] |
|
|
41502 (41.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PHONE [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GSLO [ordered, factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GSHI [ordered, factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
VIRTUAL [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TOTFRL [numeric] |
|
1906 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
FRELCH [numeric] |
|
1765 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
REDLCH [numeric] |
|
399 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PK [numeric] |
|
468 distinct values | 64621 (64.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
KG [numeric] |
|
393 distinct values | 43684 (43.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G01 [numeric] |
|
353 distinct values | 43333 (43.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G02 [numeric] |
|
345 distinct values | 43268 (43.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G03 [numeric] |
|
358 distinct values | 43253 (42.9%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G04 [numeric] |
|
382 distinct values | 43470 (43.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G05 [numeric] |
|
494 distinct values | 44673 (44.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G06 [numeric] |
|
641 distinct values | 58585 (58.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G07 [numeric] |
|
687 distinct values | 63682 (63.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G08 [numeric] |
|
700 distinct values | 63449 (63.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G09 [numeric] |
|
987 distinct values | 68499 (68.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G10 [numeric] |
|
945 distinct values | 68706 (68.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G11 [numeric] |
|
914 distinct values | 68720 (68.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G12 [numeric] |
|
891 distinct values | 68814 (68.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
G13 [logical] |
|
|
100692 (100.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TOTAL [numeric] |
|
2945 distinct values | 2229 (2.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MEMBER [numeric] |
|
2944 distinct values | 2229 (2.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AM [numeric] |
|
424 distinct values | 20609 (20.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HI [numeric] |
|
1745 distinct values | 3852 (3.8%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BL [numeric] |
|
1166 distinct values | 8325 (8.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
WH [numeric] |
|
1839 distinct values | 3993 (4.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HP [numeric] |
|
305 distinct values | 30008 (29.8%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TR [numeric] |
|
307 distinct values | 7137 (7.1%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
FTE [numeric] |
|
10066 distinct values | 5233 (5.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LATCOD [numeric] |
|
96746 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LONCOD [numeric] |
|
96911 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ULOCALE [ordered, factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
STUTERATIO [numeric] |
|
3854 distinct values | 6835 (6.8%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
STITLEI [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AMALM [numeric] |
|
268 distinct values | 26365 (26.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AMALF [numeric] |
|
263 distinct values | 26708 (26.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ASALM [numeric] |
|
522 distinct values | 16162 (16.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ASALF [numeric] |
|
495 distinct values | 16080 (16.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HIALM [numeric] |
|
1073 distinct values | 4774 (4.7%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HIALF [numeric] |
|
1047 distinct values | 5121 (5.1%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BLALM [numeric] |
|
687 distinct values | 10801 (10.7%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BLALF [numeric] |
|
693 distinct values | 11485 (11.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
WHALM [numeric] |
|
1046 distinct values | 4502 (4.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
WHALF [numeric] |
|
1030 distinct values | 4682 (4.6%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HPALM [numeric] |
|
210 distinct values | 34182 (33.9%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HPALF [numeric] |
|
212 distinct values | 34563 (34.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TRALM [numeric] |
|
174 distinct values | 9200 (9.1%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TRALF [numeric] |
|
183 distinct values | 9477 (9.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TOTMENROL [numeric] |
|
1691 distinct values | 2296 (2.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TOTFENROL [numeric] |
|
1646 distinct values | 2362 (2.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
STATUS [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
UG [numeric] |
|
217 distinct values | 88689 (88.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AE [logical] |
|
|
100665 (99.9%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SCHOOL_TYPE_TEXT [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SY_STATUS_TEXT [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SCHOOL_LEVEL [ordered, factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AS [numeric] |
|
850 distinct values | 12717 (12.6%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CHARTER_TEXT [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MAGNET_TEXT [character] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21
Because we have survey data, we will have a relatively wide data frame, and will have to make use of select
and group by
when making summaries or visualizations.
The ULOCALE
variable needed to be recoded as an ordinal variable with levels in order to have the bars appear in the appropriate order for our visualization.
Upon closer inspection, it turns out that there are several numeric variables with data mis-entered:
The number of students with Free or Reduced lunch cannot be negative
Student to Teacher Ratio cannot exceed 100 (yet there entries in 600 and even 2350)
How should these values be recoded, so we can still use the information for a given school but not throw off our summary statistics or visual representations?
The min, median, max values are suspicious for several of the numeric entries. If I had more time, I would consider each variable, what I know about it in context, and take advantage of mean/sd or median and IQR to replace likely mis-entries with N/A
I used the code below to remove the most extreme cases from our calculations based on the logical bounds of a ratio and count of students as well as knowledge of classroom sizes in the US.
I chose to visualize the ULOCALE
using a geom_bar
since it was an ordinal variable. Before, creating the bar chart, I factored and ordered the values for each of the urbanization classifications from the survey. Because the variable names were rather long, I “flipped” the orientation of the chart to horizontal in order to make the names easier to read.
It is striking to see how there are many more Large Suburban Schools in our country, relative to the other urbanization levels.
# Bar Chart School Level
ggplot(PublicSchools_2017, aes(ULOCALE)) +
geom_bar(fill="#404080", color="#e8ecef", alpha=0.9) +
#geom_bar(stat="identity", width=2) +
scale_fill_manual("legend", values = c("City: Large" = "blue",
"City: Mid-Size" = "blue",
"City: Small" = "blue")) +
theme_minimal() +
labs(title = "Urbanization Level US Teacher and Principals Survey SY2017-2018",
y = "Number of Schools", x = "Urbanization Level") +
coord_flip()
Are most schools classified as suburban? To answer this question, I collapsed the Rural, City, Town, and Suburban categories and created a second chart. While, there are still the most suburban schools represented, one can see that the majority of US public schools are either located in Suburban, City, or Rural regions.
Urbanization <-PublicSchools_2017%>%
select(ULOCALE)%>%
mutate(UrbBroad = ifelse(str_detect(ULOCALE,"Rural"),
"Rural",
ifelse(str_detect(ULOCALE, "Town"),"Town",
ifelse(str_detect(ULOCALE, "Suburb"),"Suburb",
ifelse(str_detect(ULOCALE, "City"),"City",
ULOCALE)))))%>%
mutate(UrbBroad = recode_factor(UrbBroad,
"Rural" = "Rural",
"Town" = "Town",
"Suburb" = "Suburb",
"City" = "City",
.ordered = TRUE))
Urbanization
# Bar Broader Urbanization Level
ggplot(Urbanization, aes(UrbBroad)) +
geom_bar(fill="#404080", color="#e8ecef", alpha=0.9) +
#geom_bar(stat="identity", width=2) +
theme_minimal() +
labs(title = "Urbanization Level US Teacher and Principals Survey SY2017-2018",
y = "Number of Schools", x = "Urbanization Level")
I decided to use a histogram to to visualize the distribution of the student to teacher ratio in schools across the country. From this we can have an idea of the distribution of class sizes across the United States. We see a symmetric distribution with most frequent ratio being between 15-16 students per teacher.
One might consider if the distribution of the student to teacher ratio is different based on the urbanization level of a school, i.e. “Are students in City Schools more likely to be in crowded classrooms?”. I would like to produce a more advanced plot, where I see 4 histograms side by side where I group this data by Rural/Town/Suburban/City Urban level.
Here we can see that all urban levels seem to have the most schools with a student to teacher between 14-16 BUT urban and suburban schools show a left skew, and have a higher proportion of schools with a student to teacher ratio greater than 16.
Urban_Ratio <-PublicSchools_2017%>%
select(ULOCALE, STUTERATIO)%>%
mutate(UrbBroad = ifelse(str_detect(ULOCALE,"Rural"),
"Rural",
ifelse(str_detect(ULOCALE, "Town"),"Town",
ifelse(str_detect(ULOCALE, "Suburb"),"Suburb",
ifelse(str_detect(ULOCALE, "City"),"City",
ULOCALE)))))%>%
mutate(UrbBroad = recode_factor(UrbBroad,
"Rural" = "Rural",
"Town" = "Town",
"Suburb" = "Suburb",
"City" = "City",
.ordered = TRUE))
Urban_Ratio
How do I change the colors of individual bars in my first chart so that all of the “City Bars” were one color, “Suburban Bars” another, and “Rural Bars” another.
When I originally tried to create this histogram, I noticed that the bin-sizes were off. It turns out that there are some mis-entries in the STUTERATIO
column. Notably: a school with a STUTERATIO of 677.36 and another with a ratio of 22350. These are clearly mis-entries. What would be the best way to re-code these? I used some knowledge of percents and classroom size to filter the values.
If I don’t have background knowledge, should I use the Median/IQR, or iteratively use the Mean/3SD to remove outliers?
(leaving this link here for future reference) R Graph Gallery
I chose to visualize the percentage of students qualifying for either free or reduced lunch at a school against the student to teacher ratio. I was curious to see if there appears to be an association between socio-economic status of the student body of a schoo and the student to teacher ratio. When examining the values of FRELCH
and REDLCH
in the summary, I realized there were several apparent mis-entries. Before producing the scatter plot, I first attempted to filter out data mis-entries.
# Comparing Student Teacher Ratio to Free Reduced Lunch Ratio and removing
FRED_Lunch_Teach <-PublicSchools_2017%>%
select(FRELCH, STUTERATIO, REDLCH, TOTAL )%>%
mutate(Ratio_FRED_LCH = (FRELCH + REDLCH)/TOTAL*100)%>%
mutate(Ratio_FRED_LCH = replace(Ratio_FRED_LCH, which(Ratio_FRED_LCH>75), NA))%>%
mutate(Ratio_FRED_LCH = replace(Ratio_FRED_LCH, which(Ratio_FRED_LCH<0), NA))
FRED_Lunch_Teach
ggplot(FRED_Lunch_Teach, aes(x=Ratio_FRED_LCH, y=STUTERATIO)) +
geom_point(size = .05, color="#69b3a2")+
geom_smooth(method="lm",color="grey", size =.5 )+
labs(title = "Free or Reduced Lunch Ratio vs. Student Teacher Ratio US Public Schools 2017-2018", y = "Student Teacher Ratio",
x = "Free/Reduced Lunch Ratio") +
theme_minimal()
I also produced a scatter plot of Hispanic Male pop vs. Hispanic Female pop. in schools. This is not interesting from an analysis perspective; I performed this as a sanity check for myself to see if my graph showed a strong linear correlation.
# Comparing Number of Hispanic Male Students to Hispanic Female Students
ggplot(PublicSchools_2017, aes(x=HPALM, y=HPALF)) +
geom_point(size=.65, color="#69b3a2" )+
geom_smooth(method="lm",color="grey", size =.5 )+
labs(title = "Hispanic Males vs. Hispanic Females Population US Public Schools 2017-2018", y = "Hispanic Female Students",
x = "Hispanic Male Students") +
theme_minimal()
Since this data set had latitude and longitudinal data for the geographic locations of the schools, it would be great to practice making a map with the info.
How do I get the mapping visualization to work?
---
title: "Challenge 5"
author: "Theresa Szczepanski"
description: "Introduction to Visualization"
date: "10/05/2022"
format:
html:
toc: true
code-copy: true
code-tools: true
df-print: paged
code-fold: true
categories:
- Theresa_Szczepanski
- challenge_5
- public_schools
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(ggplot2)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
::: panel-tabset
## Public School Characteristics ⭐⭐⭐⭐
::: panel-tabset
### Read in the Data
On the read in, I deleted:
- duplicates of the latitute/longitudinal coordinates `X`, `Y`
- `SURVYEAR` since we are only examining 2017-2018 survey
- __I thought I should delete__: aggregate information that could be
replicated: `TOTFRL`, `TOTMENR` `TOTFENR`, `TOTAL`, `Member`;
__HOWEVER__, inspection of the median, range, and distribution of numeric
variables in the summary indicates there are several mis-entries, (for example:
student to teacher ratio: `STUTERATIO` has a min = 0, med = 15.3, and max=22350.
- Some of the aggregate categories might help me check for mis-entries.
On the read in, I factored the ordinal variables:
- `GSHI`, `GSLO`, `SCHOOL_LEVEL`, and `ULOCALE`
```{r}
#Work done to determine what to filter/recode on read in
# PublicSchools_2017<-read_csv("_data/Public_School_Characteristics_2017-18.csv")%>%
# select(-c("X", "Y","OBJECTID" ,"SURVYEAR"))
#Aggregate variables I would have filtered if I wasn't concerned about mis-entries:
#"TOTFRL", "TOTMENROL", "TOTFENROL", "MEMBER", "TOTAL"
# Identify Levels for Factoring Ordinal Variables
# #ULOCALE
# PublicSchools_2017%>%
# select(ULOCALE)%>%
# unique()
# #GSLO
# PublicSchools_2017%>%
# select(GSLO)%>%
# unique()
# #GSLHI
# PublicSchools_2017%>%
# select(GSHI)%>%
# unique()
# #SCHOOL_LEVEL
# PublicSchools_2017%>%
# select(SCHOOL_LEVEL)%>%
# unique()
#Recode all ordinal variable as factors
PublicSchools_2017<-read_csv("_data/Public_School_Characteristics_2017-18.csv")%>%
select(-c("X", "Y","OBJECTID" ,"SURVYEAR")) %>%
mutate(ULOCALE = recode_factor(ULOCALE,
"11-City: Large" = "City: Large",
"12-City: Mid-size" = "City: Mid-size",
"13-City: Small" = "City: Small",
"21-Suburb: Large"= "Suburb: Large",
"22-Suburb: Mid-size"= "Suburb: Mid-size",
"23-Suburb: Small" = "Suburb: Small",
"31-Town: Fringe" = "Town: Fringe",
"32-Town: Distant" = "Town: Distant",
"33-Town: Remote" = "Town: Remote",
"41-Rural: Fringe" = "Rural: Fringe",
"42-Rural: Distant" = "Rural: Distant",
"43-Rural: Remote" = "Rural: Remote",
.ordered = TRUE))%>%
mutate(SCHOOL_LEVEL = recode_factor(SCHOOL_LEVEL,
"Prekindergarten" = "Prekindergarten",
"Elementary" = "Elementary",
"Middle" = "Middle",
"Secondary"= "Secondary",
"High"= "High",
"Ungraded" = "Ungraded",
"Other" = "Other",
"Not Applicable" = "Not Applicable",
"Not Reported" = "Not Reported",
.ordered = TRUE))%>%
mutate(GSLO = recode_factor(GSLO,
"PK" = "PK",
"KG" = "KG",
"01" = "01",
"02" = "02",
"03" = "03",
"04" = "04",
"05" = "05",
"M" = "M",
"06" = "06",
"07" = "07",
"08" = "08",
"09" = "09",
"10" = "10",
"11" = "11",
"12" = "12",
"AE" = "AE",
"UG" = "UG",
"N" = "N",
.ordered = TRUE))%>%
mutate(GSHI = recode_factor(GSHI,
"PK" = "PK",
"KG" = "KG",
"01" = "01",
"02" = "02",
"03" = "03",
"04" = "04",
"05" = "05",
"M" = "M",
"06" = "06",
"07" = "07",
"08" = "08",
"09" = "09",
"10" = "10",
"11" = "11",
"12" = "12",
"13" = "13",
"AE" = "AE",
"UG" = "UG",
"N" = "N",
.ordered = TRUE))
PublicSchools_2017
```
::: panel-tabset
### Briefly describe the data
The `PublicSchools_2017` data frame consists of data from selected questions
from the [2017-208 National Teachers and Principals Survey](https://nces.ed.gov/surveys/ntps/question1718.asp)
conducted by the United States Census Board and is "a system of related
questionnaires that provide descriptive data on the context of public and
private elementary and secondary education in addition to giving local, state,
and national policymakers a variety of statistics on the condition of education
in the United States."
Our data frame consists of a subset of the items surveyed from 100729 schools across
the United States. The 75 variables contain information from the following categories:
Geographic Location of the School
- State, town, and address
- Level of Urbanization (rural, town, city, etc.)
Characteristics of the School design:
- Charter, Magnet, Traditional Public,
- Virtual/non
- Highest and Lowest Grade levels served and number of students per grade level.
- Level of School: Elementary, Middle, Secondary, Adult Ed., etc.
- Type of School: Alternative, Regular school, Special education school, or
Vocational school
- Status of the school when surveyed (new, change of leadership, operational, etc.)
- Student to Teacher Ratio
- If the school has Title 1 status
Demographic Characteristics of the student body:
- Number of students of given ethnic backgrounds by gender (M/F only)
Socioeconomic Characteristics of the student body:
- Number of students qualifying for free or reduced lunch.
## Questions for Further Review
What are the following variables?
- `G13`
- `AS`
- `UG`: Ungraded (School level)
- `AE`: Adult Education (School level)
- `FTE`
- `STATUS`
Why did the original `Member` have 2944 distinct values while `total` had 2944?
### Data Summary
```{r}
# examine the summary to decide how to best set up our data frame
print(summarytools::dfSummary(PublicSchools_2017,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
```
:::
### Tidy Data (MUCH WORK LEFT HERE)
Because we have survey data, we will have a relatively wide data frame, and will
have to make use of `select` and `group by` when making summaries or
visualizations.
The `ULOCALE` variable needed to be recoded as an ordinal variable with levels
in order to have the bars appear in the appropriate order for our visualization.
Upon closer inspection, it turns out that there are several numeric variables with
data mis-entered:
- The number of students with Free or Reduced lunch cannot be negative
- Student to Teacher Ratio cannot exceed 100 (yet there entries in 600 and even 2350)
- How should these values be recoded, so we can still use the information for a given
school but not throw off our summary statistics or visual representations?
- The min, median, max values are suspicious for several of the numeric entries.
If I had more time, I would consider each variable, what I know about it in context,
and take advantage of mean/sd or median and IQR to replace likely mis-entries with N/A
- I used the code below to remove the most extreme cases from our calculations based
on the logical bounds of a ratio and count of students as well as knowledge of
classroom sizes in the US.
```{r}
PublicSchools_2017<-PublicSchools_2017%>%
mutate(FRELCH = replace(FRELCH, which(FRELCH<0), NA))%>%
mutate(REDLCH = replace(REDLCH, which(REDLCH<0), NA))%>%
mutate(STUTERATIO = replace(STUTERATIO, which(STUTERATIO>45), NA))%>%
mutate(STUTERATIO = replace(STUTERATIO, which(STUTERATIO<1), NA))
```
### Univariate Visualizations
I chose to visualize the `ULOCALE` using a `geom_bar` since it was an ordinal
variable. Before, creating the bar chart, I factored and ordered the values
for each of the urbanization classifications from the survey. Because the variable
names were rather long, I "flipped" the orientation of the chart to horizontal in
order to make the names easier to read.
It is striking to see how there are many more __Large Suburban Schools__
in our country, relative to the other urbanization levels.
```{r}
# Bar Chart School Level
ggplot(PublicSchools_2017, aes(ULOCALE)) +
geom_bar(fill="#404080", color="#e8ecef", alpha=0.9) +
#geom_bar(stat="identity", width=2) +
scale_fill_manual("legend", values = c("City: Large" = "blue",
"City: Mid-Size" = "blue",
"City: Small" = "blue")) +
theme_minimal() +
labs(title = "Urbanization Level US Teacher and Principals Survey SY2017-2018",
y = "Number of Schools", x = "Urbanization Level") +
coord_flip()
# Perhaps histogram of urban status
```
Are most schools classified as suburban? To answer this question, I collapsed the
Rural, City, Town, and Suburban categories and created a second chart. While, there
are still the most suburban schools represented, one can see that the majority of
US public schools are either located in Suburban, City, or Rural regions.
```{r}
Urbanization <-PublicSchools_2017%>%
select(ULOCALE)%>%
mutate(UrbBroad = ifelse(str_detect(ULOCALE,"Rural"),
"Rural",
ifelse(str_detect(ULOCALE, "Town"),"Town",
ifelse(str_detect(ULOCALE, "Suburb"),"Suburb",
ifelse(str_detect(ULOCALE, "City"),"City",
ULOCALE)))))%>%
mutate(UrbBroad = recode_factor(UrbBroad,
"Rural" = "Rural",
"Town" = "Town",
"Suburb" = "Suburb",
"City" = "City",
.ordered = TRUE))
Urbanization
# Bar Broader Urbanization Level
ggplot(Urbanization, aes(UrbBroad)) +
geom_bar(fill="#404080", color="#e8ecef", alpha=0.9) +
#geom_bar(stat="identity", width=2) +
theme_minimal() +
labs(title = "Urbanization Level US Teacher and Principals Survey SY2017-2018",
y = "Number of Schools", x = "Urbanization Level")
#coord_flip()
```
I decided to use a histogram to to visualize the distribution of the student to
teacher ratio in schools across the country. From this we
can have an idea of the distribution of class sizes across the United States. We
see a symmetric distribution with most frequent ratio being between 15-16 students
per teacher.
```{r}
ggplot(PublicSchools_2017, aes(x = STUTERATIO)) +
geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.9) +
theme_minimal() +
labs(title = "Student to Teacher Ratio US Teacher and Principals Survey SY2017-2018",
y = "Number of Schools", x = "Student to Teacher Ratio")
```
One might consider if the distribution of the student to teacher ratio is different
based on the urbanization level of a school, i.e. "Are students in City Schools
more likely to be in crowded classrooms?". I would like to produce a more advanced plot, where I see 4 histograms side by side where I group this data by Rural/Town/Suburban/City Urban level.
Here we can see that all urban levels seem to have the most schools with
a student to teacher between 14-16 BUT urban and suburban schools show a left skew,
and have a higher proportion of schools with a student to teacher ratio greater
than 16.
```{r}
Urban_Ratio <-PublicSchools_2017%>%
select(ULOCALE, STUTERATIO)%>%
mutate(UrbBroad = ifelse(str_detect(ULOCALE,"Rural"),
"Rural",
ifelse(str_detect(ULOCALE, "Town"),"Town",
ifelse(str_detect(ULOCALE, "Suburb"),"Suburb",
ifelse(str_detect(ULOCALE, "City"),"City",
ULOCALE)))))%>%
mutate(UrbBroad = recode_factor(UrbBroad,
"Rural" = "Rural",
"Town" = "Town",
"Suburb" = "Suburb",
"City" = "City",
.ordered = TRUE))
Urban_Ratio
ggplot(Urban_Ratio, aes(x = STUTERATIO)) +
geom_histogram( fill="#69b3a2", color="#e9ecef", alpha=0.9) +
labs(title = "Student to Teacher Ratio by Urbanization Level", y = "Number of Schools",
x = "Student to Teacher Ratio") +
theme_minimal() +
facet_wrap(vars(UrbBroad))
```
## Questions
How do I change the colors of individual bars in my first chart so that all of the
"City Bars" were one color, "Suburban Bars" another, and "Rural Bars" another.
When I originally tried to create this histogram, I noticed that the bin-sizes
were off. It turns out that there are some mis-entries in the `STUTERATIO` column.
Notably: a school with a STUTERATIO of 677.36 and another with a ratio of 22350.
These are clearly mis-entries. What would be the best way to re-code these? I used
some knowledge of percents and classroom size to filter the values.
If I don't have background knowledge, should I use the Median/IQR, or iteratively
use the Mean/3SD to remove outliers?
(leaving this link here for future reference)
[R Graph Gallery](https://r-graph-gallery.com/)
### Bivariate Visualization(s)
I chose to visualize the percentage of students qualifying for either free or
reduced lunch at a school against the student to teacher ratio. I was curious to
see if there appears to be an association between socio-economic status of the
student body of a schoo and the student to teacher ratio. When examining the values
of `FRELCH` and `REDLCH` in the summary, I realized there were several apparent
mis-entries. Before producing the scatter plot, I first attempted to filter out
data mis-entries.
```{r}
# Comparing Student Teacher Ratio to Free Reduced Lunch Ratio and removing
FRED_Lunch_Teach <-PublicSchools_2017%>%
select(FRELCH, STUTERATIO, REDLCH, TOTAL )%>%
mutate(Ratio_FRED_LCH = (FRELCH + REDLCH)/TOTAL*100)%>%
mutate(Ratio_FRED_LCH = replace(Ratio_FRED_LCH, which(Ratio_FRED_LCH>75), NA))%>%
mutate(Ratio_FRED_LCH = replace(Ratio_FRED_LCH, which(Ratio_FRED_LCH<0), NA))
FRED_Lunch_Teach
ggplot(FRED_Lunch_Teach, aes(x=Ratio_FRED_LCH, y=STUTERATIO)) +
geom_point(size = .05, color="#69b3a2")+
geom_smooth(method="lm",color="grey", size =.5 )+
labs(title = "Free or Reduced Lunch Ratio vs. Student Teacher Ratio US Public Schools 2017-2018", y = "Student Teacher Ratio",
x = "Free/Reduced Lunch Ratio") +
theme_minimal()
```
I also produced a scatter plot of Hispanic Male pop vs. Hispanic Female pop. in schools. This is not interesting from an analysis perspective; I performed this as a sanity check for myself to see if my graph showed a strong linear correlation.
```{r}
# Comparing Number of Hispanic Male Students to Hispanic Female Students
ggplot(PublicSchools_2017, aes(x=HPALM, y=HPALF)) +
geom_point(size=.65, color="#69b3a2" )+
geom_smooth(method="lm",color="grey", size =.5 )+
labs(title = "Hispanic Males vs. Hispanic Females Population US Public Schools 2017-2018", y = "Hispanic Female Students",
x = "Hispanic Male Students") +
theme_minimal()
```
```{r}
# testing cross tabs
#xtabs(~ GSLO + GSHI, PublicSchools_2017)
```
### Fun Mapping Visualization
Since this data set had latitude and longitudinal data for the geographic locations
of the schools, it would be great to practice making a map with the info.
```{r}
#PublicSchools_2017MAP<- st_as_sf(PublicSchools_2017, coords = c("LONCOD", "LATCOD"), crs = 4326)
```
## Questions
How do I get the mapping visualization to work?
:::
:::