Challenge 5 Submission

challenge_5
public_schools
Introduction to Visualization
Author

Suyash Bhagwat

Published

June 14, 2023

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. create at least two univariate visualizations
  • try to make them “publication” ready
  • Explain why you choose the specific graph type
  1. Create at least one bivariate visualization
  • try to make them “publication” ready
  • Explain why you choose the specific graph type

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • Public School Characteristics ⭐⭐⭐⭐

Ans: The data set used for this challenge is given in the code below:

data_pub_school <- read_csv("_data/Public_School_Characteristics_2017-18.csv")
data_pub_school
glimpse(data_pub_school)
Rows: 100,729
Columns: 79
$ X                <dbl> -149.3578, -156.7542, -151.0701, -151.2791, -151.2323…
$ Y                <dbl> 61.62714, 71.30034, 60.49144, 60.56828, 60.56700, 56.…
$ OBJECTID         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ NCESSCH          <chr> "020051000480", "020061000470", "020039000448", "0200…
$ NMCNTY           <chr> "Matanuska-Susitna Borough", "North Slope Borough", "…
$ SURVYEAR         <chr> "2017-2018", "2017-2018", "2017-2018", "2017-2018", "…
$ STABR            <chr> "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK",…
$ LEAID            <chr> "0200510", "0200610", "0200390", "0200390", "0200390"…
$ ST_LEAID         <chr> "AK-33", "AK-36", "AK-24", "AK-24", "AK-24", "AK-44",…
$ LEA_NAME         <chr> "Matanuska-Susitna Borough School District", "North S…
$ SCH_NAME         <chr> "John Shaw Elementary", "Kiita Learning Community", "…
$ LSTREET1         <chr> "3750 E Paradise Ln", "5246 Karluk St", "158 E Park A…
$ LSTREET2         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ LSTREET3         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ LCITY            <chr> "Wasilla", "Utqiagvik", "Soldotna", "Kenai", "Kenai",…
$ LSTATE           <chr> "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK",…
$ LZIP             <chr> "99654", "99723", "99669", "99611", "99611", "99950",…
$ LZIP4            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "0069", NA, N…
$ PHONE            <chr> "(907)352-0500", "(907)852-9677", "(907)260-9221", "(…
$ GSLO             <chr> "PK", "09", "KG", "KG", "07", "PK", "KG", "PK", "PK",…
$ GSHI             <chr> "05", "12", "06", "05", "12", "12", "12", "12", "12",…
$ VIRTUAL          <chr> "Not a virtual school", "Not a virtual school", "Not …
$ TOTFRL           <dbl> 183, 27, 43, 69, -9, 17, 3, 3, 3, -9, 3, 3, -2, 3, 53…
$ FRELCH           <dbl> 158, 27, 23, 50, -9, 17, -1, -1, -1, -9, -1, -1, -2, …
$ REDLCH           <dbl> 25, 0, 20, 19, -9, 0, -1, -1, -1, -9, -1, -1, -2, -1,…
$ PK               <dbl> 30, NA, NA, NA, NA, 0, NA, 42, 0, 0, 11, 14, NA, 9, 0…
$ KG               <dbl> 81, NA, 23, 40, NA, 0, 2, 40, 4, 0, 2, 5, NA, 2, 32, …
$ G01              <dbl> 63, NA, 23, 43, NA, 3, 2, 44, 0, 0, 6, 10, NA, 6, 30,…
$ G02              <dbl> 80, NA, 27, 42, NA, 1, 1, 56, 3, 1, 7, 9, NA, 5, 36, …
$ G03              <dbl> 62, NA, 22, 46, NA, 2, 1, 59, 1, 1, 5, 13, NA, 6, 33,…
$ G04              <dbl> 58, NA, 25, 46, NA, 2, 1, 61, 1, 0, 8, 4, NA, 7, 31, …
$ G05              <dbl> 73, NA, 28, 43, NA, 2, 0, 59, 0, 1, 6, 13, NA, 10, 26…
$ G06              <dbl> NA, NA, 19, NA, NA, 1, 0, 54, 2, 0, 11, 8, NA, 6, 29,…
$ G07              <dbl> NA, NA, NA, NA, 0, 5, 0, 55, 0, 1, 7, 8, NA, 4, NA, 3…
$ G08              <dbl> NA, NA, NA, NA, 1, 1, 0, 74, 0, 1, 2, 7, NA, 1, NA, 2…
$ G09              <dbl> NA, 0, NA, NA, 1, 0, 1, 47, 0, 1, 8, 5, NA, 1, NA, 26…
$ G10              <dbl> NA, 3, NA, NA, 2, 0, 2, 51, 0, 0, 6, 10, NA, 1, NA, 3…
$ G11              <dbl> NA, 7, NA, NA, 1, 0, 0, 48, 1, 0, 8, 5, NA, 3, NA, 36…
$ G12              <dbl> NA, 20, NA, NA, 0, 1, 1, 47, 1, 0, 7, 9, NA, 0, NA, 2…
$ G13              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TOTAL            <dbl> 447, 30, 167, 260, 5, 18, 11, 737, 13, 6, 94, 120, NA…
$ MEMBER           <dbl> 447, 30, 167, 260, 5, 18, 11, 737, 13, 6, 94, 120, NA…
$ AM               <dbl> 50, 27, 8, 16, 0, 0, 2, 53, 11, 4, 84, 79, NA, 60, 23…
$ HI               <dbl> 12, 0, 5, 14, 0, 1, 0, 76, 0, 0, 2, 2, NA, 1, 30, 21,…
$ BL               <dbl> 5, 0, 0, 3, 1, 0, 5, 39, 0, 0, 0, 0, NA, 0, 2, 0, 0, …
$ WH               <dbl> 351, 0, 136, 168, 3, 13, 4, 443, 1, 2, 6, 13, NA, 0, …
$ HP               <dbl> 2, 1, 0, 0, 1, 0, 0, 8, 0, 0, 0, 10, NA, 0, 13, 8, 0,…
$ TR               <dbl> 23, 2, 15, 56, 0, 4, 0, 97, 1, 0, 0, 5, NA, 0, 3, 0, …
$ FTE              <dbl> 24.90, 3.00, 10.35, 16.75, 0.67, 1.90, 0.00, 5.79, 1.…
$ LATCOD           <dbl> 61.62714, 71.30034, 60.49144, 60.56828, 60.56700, 56.…
$ LONCOD           <dbl> -149.3579, -156.7542, -151.0702, -151.2791, -151.2323…
$ ULOCALE          <chr> "41-Rural: Fringe", "33-Town: Remote", "33-Town: Remo…
$ STUTERATIO       <dbl> 17.95, 10.00, 16.14, 15.52, 7.46, 9.47, NA, 127.29, 6…
$ STITLEI          <chr> "Yes", "Not Applicable", "Not Applicable", "Not Appli…
$ AMALM            <dbl> 33, 16, 4, 10, 0, 0, 1, 21, 6, 2, 41, 39, NA, 33, 11,…
$ AMALF            <dbl> 17, 11, 4, 6, 0, 0, 1, 32, 5, 2, 43, 40, NA, 27, 12, …
$ ASALM            <dbl> 1, 0, 0, 1, 0, 0, 0, 13, 0, 0, 1, 5, NA, 0, 52, 52, 0…
$ ASALF            <dbl> 3, 0, 3, 2, 0, 0, 0, 8, 0, 0, 1, 6, NA, 0, 38, 41, 0,…
$ HIALM            <dbl> 10, 0, 2, 6, 0, 1, 0, 33, 0, 0, 1, 2, NA, 1, 14, 12, …
$ HIALF            <dbl> 2, 0, 3, 8, 0, 0, 0, 43, 0, 0, 1, 0, NA, 0, 16, 9, 0,…
$ BLALM            <dbl> 3, 0, 0, 3, 0, 0, 3, 20, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
$ BLALF            <dbl> 2, 0, 0, 0, 1, 0, 2, 19, 0, 0, 0, 0, NA, 0, 2, 0, 0, …
$ WHALM            <dbl> 193, 0, 58, 82, 1, 5, 1, 221, 0, 1, 3, 7, NA, 0, 26, …
$ WHALF            <dbl> 158, 0, 78, 86, 2, 8, 3, 222, 1, 1, 3, 6, NA, 0, 30, …
$ HPALM            <dbl> 0, 1, 0, 0, 0, 0, 0, 4, 0, 0, 0, 5, NA, 0, 7, 4, 0, 0…
$ HPALF            <dbl> 2, 0, 0, 0, 1, 0, 0, 4, 0, 0, 0, 5, NA, 0, 6, 4, 0, 0…
$ TRALM            <dbl> 11, 1, 7, 26, 0, 4, 0, 48, 1, 0, 0, 3, NA, 0, 1, 0, 0…
$ TRALF            <dbl> 12, 1, 8, 30, 0, 0, 0, 49, 0, 0, 0, 2, NA, 0, 2, 0, 1…
$ TOTMENROL        <dbl> 251, 18, 71, 128, 1, 10, 5, 360, 7, 3, 46, 61, NA, 34…
$ TOTFENROL        <dbl> 196, 12, 96, 132, 4, 8, 6, 377, 6, 3, 48, 59, NA, 27,…
$ STATUS           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 1, 1, 1,…
$ UG               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ AE               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ SCHOOL_TYPE_TEXT <chr> "Regular school", "Alternative/other school", "Regula…
$ SY_STATUS_TEXT   <chr> "Currently operational", "Currently operational", "Cu…
$ SCHOOL_LEVEL     <chr> "Elementary", "High", "Elementary", "Elementary", "Hi…
$ AS               <dbl> 4, 0, 3, 3, 0, 0, 0, 21, 0, 0, 2, 11, NA, 0, 90, 93, …
$ CHARTER_TEXT     <chr> "No", "No", "Yes", "Yes", "No", "No", "No", "No", "No…
$ MAGNET_TEXT      <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…

Briefly describe the data

Ans: The Public_School_Characteristics_2017-18.csv data set provides information on the different types of US schools (e.g. elementary,middle, high school) for the year 2017-2018. The data set is a tibble of size 100,729 rows × 79 cols. The glimpse function provides us more information about the columns and their datatypes for the data set.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Ans: For the purpose of this challenge(data visualization), the data is in a tidy format and the only change that needs to be made are the filtering of ‘Not Reported’ and ‘Not Applicable’ value for SCHOOL_LEVEL column and the creation of the total_student_enrollment column. The total_student_enrollment is the sum of the male and female enrollment.

data_pub_school <- filter(data_pub_school,SCHOOL_LEVEL!="Not Reported" | SCHOOL_LEVEL!="Not Applicable")

Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

Ans: Yes, I need to create the total_student_enrollment column. The total_enrollment is the sum of the male and female student enrollment.

data_pub_school<-mutate(data_pub_school,total_student_enrollment = TOTMENROL+TOTFENROL)
data_pub_school

Univariate Visualizations

Ans: Given below is the bar chart of the school count vs school level. I chose a bar chart since a bar chart can effectively visualize univariate categorical data. The chart below gives us a visualization of the most common school level. For e.g. for our data, the most common school_level is ‘Elementary’.

ggplot(data_pub_school, aes(SCHOOL_LEVEL)) + geom_bar(fill = "steelblue")+labs(x = "School_Level", y = "School Count", title = "School Count vs School Level") +theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

The second univariate plot is a histogram of the total student enrollment.I chose a histogram since a histogram can effectively visualize univariate continuous/numerical data.

ggplot(data_pub_school, aes(total_student_enrollment)) + 
  geom_histogram(bins = 100,aes(y = ..density..), alpha = 0.5) + labs(title = "Histogram of Total Student Enrollment with Density Plot", x = "Total Student Enrollment", y = "Density")+ geom_density(alpha = 0.1, fill="red") +xlim(0, 2500)

Bivariate Visualization(s)

Ans: Given below is a box plot of the Total student count vs the School Level. I chose a box plot since a box plot can effectively visualize bivariate data with x being the categorical variable and y being the numerical/continuous variable. Scatter plots need both x and y to be numerical data and hence it can’t be used in this case.

ggplot(data_pub_school, aes(SCHOOL_LEVEL, total_student_enrollment)) + geom_boxplot() + labs(x = "School_Level", y = "Total Student Enrollment", title = "Total Student Enrollment vs School Level") +theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))