Challenge 5 Solutions

challenge_5
railroads
cereal
air_bnb
pathogen_cost
australian_marriage
public_schools
usa_hh
Introduction to Visualization
Author

Meredith Rolfe

Published

August 23, 2022

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read, clean, and tidy data and then…
  2. create at least two univariate visualizations
  • try to make them “publication” ready
  • Explain why you choose the specific graph type
  1. Create at least one bivariate visualization
  • try to make them “publication” ready
  • Explain why you choose the specific graph type

There is even an R Graph Gallery book to use that summarizes information from the website!

The cereal dataset includes sodium and sugar content for 20 popular cereals, along with an indicator of cereal category (A, B, or C) but we are not sure what that variable corresponds to.

cereal<-read_csv("_data/cereal.csv")

Univariate Visualizations

I am interested in the distribution of sodium and sugar content in cereals, lets start by checking out a simple histogram - binned into approximately 25 mg ranges. I do this by setting bins equal to max minur min of the variable, or 14 bins.

ggplot(cereal, aes(x=Sodium)) +
  geom_histogram(bins=14)

It looks like there are some outliers while most cereals are more clumped together between 100 and 200 mg. Unfortunately, we can’t automatically label outliers, but there is a commonly used trick to add in labels that I can never get to work for a single boxplot. So, I use it for grouped data in the example below, but am cheating by using the car package to label the outliers for the single boxplot - maybe one of you can find a better way!

car::Boxplot(cereal$Sodium, 
        data=cereal, 
        id=list(labels=cereal$Cereal),
        cex=0.2)
Error in loadNamespace(x): there is no package called 'car'
is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

cereal %>%
  mutate(Sugar = Sugar * 15) %>%
  pivot_longer(cols=c(Sodium, Sugar),
               names_to = "content",
               values_to = "value")%>%
  group_by(content)%>%
  mutate(outlier = if_else(is_outlier(value), Cereal, NA_character_)) %>%
  ggplot(., aes(x = content, y = value, color=factor(content))) +
  geom_boxplot(outlier.shape = NA) +
  theme(legend.position = "none") +
  geom_text(aes(label = outlier), na.rm = TRUE, show.legend = FALSE) +
  scale_y_continuous("Milligrams (Sodium)", 
    sec.axis = sec_axis(~ . /15, name = "Milligrams (Sugar)")
  )

How about sugar? We can set the number of bins to cover 2 grams of sugar, or 9.

ggplot(cereal, aes(x=Sugar)) +
  geom_histogram(bins=9)

It looks like cereals are more closely grouped with respect to sugar content - and a boxplot indicates no true outliers.

ggplot(cereal, aes(y = Sugar)) +
    geom_boxplot()

Bivariate Visualization(s)

Are cereals high in sodium low in sugar, or vice versa? To answer this question, lets check out a scatterplot.

ggplot(cereal, aes(y=Sugar, x=Sodium)) +
  geom_point()

It doesn’t look like there is a systematic relationship. However, this might be different if we added in the types A and C. Also, Raisin Bran seems to be high in both!

This dataset includes the total number of cases and total estimated cost for the top 15 pathogens in 2018.

pathogen<-readxl::read_excel(
  "_data/Total_cost_for_top_15_pathogens_2018.xlsx",
  skip=5, 
  n_max=16, 
  col_names = c("pathogens", "Cases", "Cost"))

pathogen

Univariate Visualizations

Lets check out the distribution of cost and number of cases. There are only 15 observations - even fewer than the number of cereals, and the data are highly skewed. Will the same sorts of visualizations work?

ggplot(pathogen, aes(x=Cases)) +
  geom_histogram()
ggplot(pathogen, aes(x=Cases)) +
  geom_histogram()+
  scale_x_continuous(trans = "log10")
ggplot(pathogen, aes(x=Cases)) +
  geom_boxplot()
ggplot(pathogen, aes(x=Cases)) +
  geom_boxplot()+
  scale_x_continuous(trans = "log10")

Histogram of Cases

Histogram of (Logged) Cases

Boxplot of Cases

Boxplot of (Logged) Cases

The histogram isn’t ideal, we can see the single outlier - but it is hard to get a grasp on the number of cases of pathogens with lower case counts. Perhaps if we rescaled the number of cases to a log or some other scaling function. As we see below, the logging of the x axis is much more revealing.

What happens when we graph costs?

ggplot(pathogen, aes(x=Cost)) +
  geom_histogram()
ggplot(pathogen, aes(x=Cost)) +
  geom_histogram()+
  scale_x_continuous(trans = "log10")

Histogram of Cost

Histogram of (Logged) Cost

Bivariate Visualization(s)

Given what we saw above, lets try a logged and unlogged scatterplot for Cases vs Costs.

ggplot(pathogen, aes(x=Cases, y=Cost, label=pathogens)) +
  geom_point() +
  scale_x_continuous(labels = scales::comma)+
  geom_text()
ggplot(pathogen, aes(x=Cases, y=Cost, label=pathogens)) +
  geom_point()+
  scale_x_continuous(trans = "log10", labels = scales::comma)+
  scale_y_continuous(trans = "log10", labels = scales::comma)+
  ggrepel::geom_label_repel()

Relationship between Cases and Total Cost of Pathogens

Logged Relationship between Cases and Total Cost

In 2017, Australia conducted a postal survey to gauge citizens’ opinions towards same sex marriage: “Should the law be changed to allow same-sex couples to marry?” The table provided by the Australian Bureau of Statistics includes estimates of the proportion of citizens choosing to 1) vote yes, 2) vote no, 3) vote in an unclear way, or 4) fail to vote. These results are aggregated by Federal Electoral District, which are nested within one of 8 overarching Electoral Divisions. See Challenge 3 for more details.

vote_orig <- readxl::read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",
           sheet="Table 2",
           skip=7,
           col_names = c("District", "Yes", "del", "No", rep("del", 6), "Illegible", "del", "No Response", rep("del", 3)))%>%
  select(!starts_with("del"))%>%
  drop_na(District)%>%
  filter(!str_detect(District, "(Total)"))%>%
  filter(!str_starts(District, "\\("))

vote<- vote_orig%>%
  mutate(Division = case_when(
    str_ends(District, "Divisions") ~ District,
    TRUE ~ NA_character_ ))%>%
  fill(Division, .direction = "down")
vote<- filter(vote,!str_detect(District, "Division|Australia"))

vote_long <- vote%>%
  pivot_longer(
    cols = Yes:`No Response`,
    names_to = "Response",
    values_to = "Count"
  )

Univariate Visualization(s)

I think I will start out by graphing the overall proportion of Australian citizens who voted yes, no, etc. That requires me to recreate the proportions information we discarded when we read in the data!

vote_long%>%
  group_by(Response)%>%
  summarise(Count = sum(Count))%>%
  ggplot(., aes(x=Response, y=Count))+
  geom_bar(stat="identity")

Hm, I see a few issues. I would like to reorder the Yes and No folks (who voted) and clearly distinguish them from No Response. Plus maybe label the bars with the % vote (or total numbers?) and the axis with the other value.

vote_long%>%
  mutate(Response = as_factor(Response),
         Response = fct_relevel(Response, "Yes", "No", "Illegible"))%>%
  group_by(Response)%>%
  summarise(Count = sum(Count))%>%
  ungroup()%>%
  mutate(perc = Count/sum(Count))%>%
  ggplot(., aes(y=perc, x=Response))+
  geom_bar(stat="Identity", alpha=.75) +
  scale_y_continuous(name= "Percent of Citizens", 
                     label = scales::percent) +
  geom_text(aes(label = Count), size=3, vjust=-.5)

Bivariate Visualization(s)

This is a new data set from air bnb, lets check it out.

airb<-read_csv("_data/AB_NYC_2019.csv")
print(summarytools::dfSummary(airb,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

airb

Dimensions: 48895 x 16
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
id [numeric]
Mean (sd) : 19017143 (10983108)
min ≤ med ≤ max:
2539 ≤ 19677284 ≤ 36487245
IQR (CV) : 19680234 (0.6)
48895 distinct values 0 (0.0%)
name [character]
1. Hillside Hotel
2. Home away from home
3. New york Multi-unit build
4. Brooklyn Apartment
5. Loft Suite @ The Box Hous
6. Private Room
7. Artsy Private BR in Fort
8. Private room
9. Beautiful Brooklyn Browns
10. Cozy Brooklyn Apartment
[ 47884 others ]
18(0.0%)
17(0.0%)
16(0.0%)
12(0.0%)
11(0.0%)
11(0.0%)
10(0.0%)
10(0.0%)
8(0.0%)
8(0.0%)
48758(99.8%)
16 (0.0%)
host_id [numeric]
Mean (sd) : 67620011 (78610967)
min ≤ med ≤ max:
2438 ≤ 30793816 ≤ 274321313
IQR (CV) : 99612390 (1.2)
37457 distinct values 0 (0.0%)
host_name [character]
1. Michael
2. David
3. Sonder (NYC)
4. John
5. Alex
6. Blueground
7. Sarah
8. Daniel
9. Jessica
10. Maria
[ 11442 others ]
417(0.9%)
403(0.8%)
327(0.7%)
294(0.6%)
279(0.6%)
232(0.5%)
227(0.5%)
226(0.5%)
205(0.4%)
204(0.4%)
46060(94.2%)
21 (0.0%)
neighbourhood_group [character]
1. Bronx
2. Brooklyn
3. Manhattan
4. Queens
5. Staten Island
1091(2.2%)
20104(41.1%)
21661(44.3%)
5666(11.6%)
373(0.8%)
0 (0.0%)
neighbourhood [character]
1. Williamsburg
2. Bedford-Stuyvesant
3. Harlem
4. Bushwick
5. Upper West Side
6. Hell's Kitchen
7. East Village
8. Upper East Side
9. Crown Heights
10. Midtown
[ 211 others ]
3920(8.0%)
3714(7.6%)
2658(5.4%)
2465(5.0%)
1971(4.0%)
1958(4.0%)
1853(3.8%)
1798(3.7%)
1564(3.2%)
1545(3.2%)
25449(52.0%)
0 (0.0%)
latitude [numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.1 (0)
19048 distinct values 0 (0.0%)
longitude [numeric]
Mean (sd) : -74 (0)
min ≤ med ≤ max:
-74.2 ≤ -74 ≤ -73.7
IQR (CV) : 0 (0)
14718 distinct values 0 (0.0%)
room_type [character]
1. Entire home/apt
2. Private room
3. Shared room
25409(52.0%)
22326(45.7%)
1160(2.4%)
0 (0.0%)
price [numeric]
Mean (sd) : 152.7 (240.2)
min ≤ med ≤ max:
0 ≤ 106 ≤ 10000
IQR (CV) : 106 (1.6)
674 distinct values 0 (0.0%)
minimum_nights [numeric]
Mean (sd) : 7 (20.5)
min ≤ med ≤ max:
1 ≤ 3 ≤ 1250
IQR (CV) : 4 (2.9)
109 distinct values 0 (0.0%)
number_of_reviews [numeric]
Mean (sd) : 23.3 (44.6)
min ≤ med ≤ max:
0 ≤ 5 ≤ 629
IQR (CV) : 23 (1.9)
394 distinct values 0 (0.0%)
last_review [Date]
min : 2011-03-28
med : 2019-05-19
max : 2019-07-08
range : 8y 3m 10d
1764 distinct values 10052 (20.6%)
reviews_per_month [numeric]
Mean (sd) : 1.4 (1.7)
min ≤ med ≤ max:
0 ≤ 0.7 ≤ 58.5
IQR (CV) : 1.8 (1.2)
937 distinct values 10052 (20.6%)
calculated_host_listings_count [numeric]
Mean (sd) : 7.1 (33)
min ≤ med ≤ max:
1 ≤ 1 ≤ 327
IQR (CV) : 1 (4.6)
47 distinct values 0 (0.0%)
availability_365 [numeric]
Mean (sd) : 112.8 (131.6)
min ≤ med ≤ max:
0 ≤ 45 ≤ 365
IQR (CV) : 227 (1.2)
366 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-28

Univariate Visualizations

Bivariate Visualization(s)

The railroad data contain 2931 county-level aggregated counts of the number of railroad employees in 2012. Counties are embedded within States, and all 50 states plus Canada, overseas addresses in Asia and Europe, and Washington, DC are represented. See challenges 1 and 2 for more information.

railroad<-readxl::read_excel("_data/StateCounty2012.xls",
                     skip = 4,
                     col_names= c("state", "delete",  "county",
                                  "delete", "employees"))%>%
  select(!contains("delete"))%>%
  filter(!str_detect(state, "Total"))

railroad<-head(railroad, -2)%>%
  mutate(county = ifelse(state=="CANADA", "CANADA", county))

Lets create some numerical variables that we can visualize!

railroad<- railroad%>%
  group_by(state)%>%
  mutate(state_employees = sum(employees),
         state_countries = n_distinct(county))

Univariate Visualizations

Bivariate Visualization(s)

This is another new dataset.

schools<-read_csv("_data/Public_School_Characteristics_2017-18.csv")
print(summarytools::dfSummary(schools,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

schools

Dimensions: 100729 x 79
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
X [numeric]
Mean (sd) : -92.9 (16.9)
min ≤ med ≤ max:
-176.6 ≤ -89.3 ≤ 144.9
IQR (CV) : 20.2 (-0.2)
97136 distinct values 0 (0.0%)
Y [numeric]
Mean (sd) : 37.8 (5.8)
min ≤ med ≤ max:
-14.3 ≤ 38.8 ≤ 71.3
IQR (CV) : 7.7 (0.2)
97136 distinct values 0 (0.0%)
OBJECTID [numeric]
Mean (sd) : 50365 (29078.1)
min ≤ med ≤ max:
1 ≤ 50365 ≤ 100729
IQR (CV) : 50364 (0.6)
100729 distinct values 0 (0.0%)
NCESSCH [character]
1. 010000500870
2. 010000500871
3. 010000500879
4. 010000500889
5. 010000501616
6. 010000502150
7. 010000600193
8. 010000600872
9. 010000600876
10. 010000600877
[ 100719 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
100719(100.0%)
0 (0.0%)
NMCNTY [character]
1. Los Angeles County
2. Cook County
3. Maricopa County
4. Harris County
5. Orange County
6. Jefferson County
7. Montgomery County
8. Washington County
9. Wayne County
10. Dallas County
[ 1949 others ]
2264(2.2%)
1388(1.4%)
1256(1.2%)
1142(1.1%)
1074(1.1%)
980(1.0%)
888(0.9%)
848(0.8%)
817(0.8%)
814(0.8%)
89258(88.6%)
0 (0.0%)
SURVYEAR [character] 1. 2017-2018
100729(100.0%)
0 (0.0%)
STABR [character]
1. CA
2. TX
3. NY
4. FL
5. IL
6. MI
7. OH
8. PA
9. NC
10. NJ
[ 46 others ]
10323(10.2%)
9320(9.3%)
4808(4.8%)
4375(4.3%)
4245(4.2%)
3734(3.7%)
3610(3.6%)
2990(3.0%)
2691(2.7%)
2595(2.6%)
52038(51.7%)
0 (0.0%)
LEAID [character]
1. 7200030
2. 0622710
3. 1709930
4. 1200390
5. 3200060
6. 1200180
7. 1200870
8. 1500030
9. 4823640
10. 1201500
[ 17451 others ]
1121(1.1%)
1009(1.0%)
655(0.7%)
537(0.5%)
381(0.4%)
336(0.3%)
320(0.3%)
294(0.3%)
284(0.3%)
268(0.3%)
95524(94.8%)
0 (0.0%)
ST_LEAID [character]
1. PR-01
2. CA-1964733
3. IL-15-016-2990-25
4. FL-13
5. NV-02
6. FL-06
7. FL-29
8. HI-001
9. TX-101912
10. FL-50
[ 17451 others ]
1121(1.1%)
1009(1.0%)
655(0.7%)
537(0.5%)
381(0.4%)
336(0.3%)
320(0.3%)
294(0.3%)
284(0.3%)
268(0.3%)
95524(94.8%)
0 (0.0%)
LEA_NAME [character]
1. PUERTO RICO DEPARTMENT OF
2. Los Angeles Unified
3. City of Chicago SD 299
4. DADE
5. CLARK COUNTY SCHOOL DISTR
6. BROWARD
7. HILLSBOROUGH
8. Hawaii Department of Educ
9. HOUSTON ISD
10. PALM BEACH
[ 17147 others ]
1121(1.1%)
1009(1.0%)
655(0.7%)
537(0.5%)
381(0.4%)
336(0.3%)
320(0.3%)
294(0.3%)
284(0.3%)
268(0.3%)
95524(94.8%)
0 (0.0%)
SCH_NAME [character]
1. Lincoln Elementary School
2. Lincoln Elementary
3. Jefferson Elementary
4. Washington Elementary
5. Washington Elementary Sch
6. Central Elementary School
7. Jefferson Elementary Scho
8. Lincoln Elem School
9. Central High School
10. Roosevelt Elementary
[ 88366 others ]
64(0.1%)
61(0.1%)
53(0.1%)
49(0.0%)
46(0.0%)
42(0.0%)
33(0.0%)
33(0.0%)
32(0.0%)
32(0.0%)
100284(99.6%)
0 (0.0%)
LSTREET1 [character]
1. 6420 E. Broadway Blvd. Su
2. Box DOE
3. 2405 FAIRVIEW SCHOOL RD
4. 1820 XENIUM LN N
5. Main St
6. 335 ALTERNATIVE LN
7. 2101 N TWYMAN RD
8. 720 9TH AVE
9. 50 Moreland Rd.
10. 951 W Snowflake Blvd
[ 92384 others ]
33(0.0%)
28(0.0%)
22(0.0%)
19(0.0%)
13(0.0%)
12(0.0%)
11(0.0%)
11(0.0%)
10(0.0%)
10(0.0%)
100560(99.8%)
0 (0.0%)
LSTREET2 [character]
1. Suite B
2. Ste. 100
3. P.O. Box 1497
4. Suite A
5. Suite 200
6. Building B
7. Ste. 102
8. Ste. A
9. Suite 1
10. SUITE 111 HART
[ 482 others ]
8(1.4%)
7(1.2%)
6(1.0%)
6(1.0%)
5(0.8%)
4(0.7%)
4(0.7%)
4(0.7%)
4(0.7%)
4(0.7%)
540(91.2%)
100137 (99.4%)
LSTREET3 [logical]
All NA's
100729 (100.0%)
LCITY [character]
1. HOUSTON
2. Chicago
3. Los Angeles
4. BROOKLYN
5. SAN ANTONIO
6. Phoenix
7. BRONX
8. DALLAS
9. NEW YORK
10. Tucson
[ 14624 others ]
783(0.8%)
664(0.7%)
577(0.6%)
569(0.6%)
520(0.5%)
446(0.4%)
441(0.4%)
378(0.4%)
359(0.4%)
330(0.3%)
95662(95.0%)
0 (0.0%)
LSTATE [character]
1. CA
2. TX
3. NY
4. FL
5. IL
6. MI
7. OH
8. PA
9. NC
10. NJ
[ 45 others ]
10325(10.3%)
9320(9.3%)
4808(4.8%)
4377(4.3%)
4245(4.2%)
3736(3.7%)
3610(3.6%)
2990(3.0%)
2693(2.7%)
2595(2.6%)
52030(51.7%)
0 (0.0%)
LZIP [character]
1. 85710
2. 10456
3. 85364
4. 78521
5. 78572
6. 78577
7. 00731
8. 10457
9. 78539
10. 60623
[ 22526 others ]
53(0.1%)
45(0.0%)
44(0.0%)
43(0.0%)
42(0.0%)
41(0.0%)
39(0.0%)
37(0.0%)
37(0.0%)
36(0.0%)
100312(99.6%)
0 (0.0%)
LZIP4 [character]
1. 8888
2. 1199
3. 1299
4. 9801
5. 2099
6. 1399
7. 1699
8. 1599
9. 1499
10. 1899
[ 8615 others ]
899(1.5%)
113(0.2%)
111(0.2%)
106(0.2%)
104(0.2%)
101(0.2%)
100(0.2%)
99(0.2%)
94(0.2%)
89(0.2%)
57411(96.9%)
41502 (41.2%)
PHONE [character]
1. (505)880-3744
2. (520)225-6060
3. (505)721-1051
4. (480)461-4000
5. (972)316-3663
6. (505)527-5800
7. (520)745-4588
8. (480)497-3300
9. (623)445-5000
10. (480)484-6100
[ 91818 others ]
141(0.1%)
63(0.1%)
36(0.0%)
35(0.0%)
34(0.0%)
33(0.0%)
33(0.0%)
29(0.0%)
28(0.0%)
27(0.0%)
100270(99.5%)
0 (0.0%)
GSLO [character]
1. PK
2. KG
3. 09
4. 06
5. 07
6. 05
7. 03
8. 04
9. M
10. 01
[ 8 others ]
31179(31.0%)
23839(23.7%)
16627(16.5%)
12912(12.8%)
5441(5.4%)
2578(2.6%)
1581(1.6%)
1165(1.2%)
1113(1.1%)
964(1.0%)
3330(3.3%)
0 (0.0%)
GSHI [character]
1. 05
2. 12
3. 08
4. 06
5. 04
6. 02
7. 03
8. PK
9. M
10. N
[ 9 others ]
28039(27.8%)
26443(26.3%)
21860(21.7%)
10873(10.8%)
3938(3.9%)
1591(1.6%)
1446(1.4%)
1430(1.4%)
1113(1.1%)
796(0.8%)
3200(3.2%)
0 (0.0%)
VIRTUAL [character]
1. A virtual school
2. Missing
3. Not a virtual school
4. Not Applicable
656(0.7%)
183(0.2%)
99049(98.3%)
841(0.8%)
0 (0.0%)
TOTFRL [numeric]
Mean (sd) : 249.4 (275.2)
min ≤ med ≤ max:
-9 ≤ 178 ≤ 9626
IQR (CV) : 297 (1.1)
1906 distinct values 0 (0.0%)
FRELCH [numeric]
Mean (sd) : 221.6 (253.9)
min ≤ med ≤ max:
-9 ≤ 149 ≤ 7581
IQR (CV) : 272 (1.1)
1765 distinct values 0 (0.0%)
REDLCH [numeric]
Mean (sd) : 26 (36.9)
min ≤ med ≤ max:
-9 ≤ 16 ≤ 2045
IQR (CV) : 37 (1.4)
399 distinct values 0 (0.0%)
PK [numeric]
Mean (sd) : 34.8 (53.5)
min ≤ med ≤ max:
0 ≤ 22 ≤ 1912
IQR (CV) : 43 (1.5)
468 distinct values 64621 (64.2%)
KG [numeric]
Mean (sd) : 65 (46.9)
min ≤ med ≤ max:
0 ≤ 62 ≤ 948
IQR (CV) : 57 (0.7)
393 distinct values 43684 (43.4%)
G01 [numeric]
Mean (sd) : 64.4 (44.8)
min ≤ med ≤ max:
0 ≤ 62 ≤ 1408
IQR (CV) : 56 (0.7)
353 distinct values 43333 (43.0%)
G02 [numeric]
Mean (sd) : 64.6 (44.4)
min ≤ med ≤ max:
0 ≤ 63 ≤ 688
IQR (CV) : 56 (0.7)
345 distinct values 43268 (43.0%)
G03 [numeric]
Mean (sd) : 66.4 (46.3)
min ≤ med ≤ max:
0 ≤ 64 ≤ 783
IQR (CV) : 59 (0.7)
358 distinct values 43253 (42.9%)
G04 [numeric]
Mean (sd) : 67.9 (48.7)
min ≤ med ≤ max:
0 ≤ 65 ≤ 877
IQR (CV) : 61 (0.7)
382 distinct values 43470 (43.2%)
G05 [numeric]
Mean (sd) : 69.7 (56.7)
min ≤ med ≤ max:
0 ≤ 64 ≤ 985
IQR (CV) : 65 (0.8)
494 distinct values 44673 (44.3%)
G06 [numeric]
Mean (sd) : 91.5 (108.4)
min ≤ med ≤ max:
0 ≤ 56 ≤ 1155
IQR (CV) : 111 (1.2)
641 distinct values 58585 (58.2%)
G07 [numeric]
Mean (sd) : 102.7 (126.2)
min ≤ med ≤ max:
0 ≤ 52 ≤ 1439
IQR (CV) : 153 (1.2)
687 distinct values 63682 (63.2%)
G08 [numeric]
Mean (sd) : 101.9 (127.1)
min ≤ med ≤ max:
0 ≤ 50 ≤ 1608
IQR (CV) : 152 (1.2)
700 distinct values 63449 (63.0%)
G09 [numeric]
Mean (sd) : 124.7 (185.8)
min ≤ med ≤ max:
0 ≤ 40 ≤ 2799
IQR (CV) : 166 (1.5)
987 distinct values 68499 (68.0%)
G10 [numeric]
Mean (sd) : 120.4 (178.1)
min ≤ med ≤ max:
0 ≤ 39 ≤ 1837
IQR (CV) : 157 (1.5)
945 distinct values 68706 (68.2%)
G11 [numeric]
Mean (sd) : 115.4 (170.1)
min ≤ med ≤ max:
0 ≤ 40 ≤ 1719
IQR (CV) : 149 (1.5)
914 distinct values 68720 (68.2%)
G12 [numeric]
Mean (sd) : 114.1 (165.5)
min ≤ med ≤ max:
0 ≤ 43 ≤ 2580
IQR (CV) : 150 (1.5)
891 distinct values 68814 (68.3%)
G13 [logical]
1. FALSE
2. TRUE
36(97.3%)
1(2.7%)
100692 (100.0%)
TOTAL [numeric]
Mean (sd) : 515.7 (450.2)
min ≤ med ≤ max:
0 ≤ 434 ≤ 14286
IQR (CV) : 408 (0.9)
2945 distinct values 2229 (2.2%)
MEMBER [numeric]
Mean (sd) : 515.6 (449.9)
min ≤ med ≤ max:
0 ≤ 434 ≤ 14286
IQR (CV) : 408 (0.9)
2944 distinct values 2229 (2.2%)
AM [numeric]
Mean (sd) : 6.7 (30.3)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1395
IQR (CV) : 4 (4.5)
424 distinct values 20609 (20.5%)
HI [numeric]
Mean (sd) : 142.5 (240.6)
min ≤ med ≤ max:
0 ≤ 49 ≤ 4677
IQR (CV) : 160 (1.7)
1745 distinct values 3852 (3.8%)
BL [numeric]
Mean (sd) : 83 (151.4)
min ≤ med ≤ max:
0 ≤ 19 ≤ 5088
IQR (CV) : 90 (1.8)
1166 distinct values 8325 (8.3%)
WH [numeric]
Mean (sd) : 247.9 (275.1)
min ≤ med ≤ max:
0 ≤ 182 ≤ 8146
IQR (CV) : 312 (1.1)
1839 distinct values 3993 (4.0%)
HP [numeric]
Mean (sd) : 3.1 (24.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 1394
IQR (CV) : 2 (8)
305 distinct values 30008 (29.8%)
TR [numeric]
Mean (sd) : 20.7 (27.3)
min ≤ med ≤ max:
0 ≤ 12 ≤ 1228
IQR (CV) : 24 (1.3)
307 distinct values 7137 (7.1%)
FTE [numeric]
Mean (sd) : 32.6 (25.6)
min ≤ med ≤ max:
0 ≤ 27.6 ≤ 1419
IQR (CV) : 24 (0.8)
10066 distinct values 5233 (5.2%)
LATCOD [numeric]
Mean (sd) : 37.8 (5.8)
min ≤ med ≤ max:
-14.3 ≤ 38.8 ≤ 71.3
IQR (CV) : 7.7 (0.2)
96746 distinct values 0 (0.0%)
LONCOD [numeric]
Mean (sd) : -92.9 (16.9)
min ≤ med ≤ max:
-176.6 ≤ -89.3 ≤ 144.9
IQR (CV) : 20.2 (-0.2)
96911 distinct values 0 (0.0%)
ULOCALE [character]
1. 21-Suburb: Large
2. 11-City: Large
3. 41-Rural: Fringe
4. 42-Rural: Distant
5. 13-City: Small
6. 43-Rural: Remote
7. 32-Town: Distant
8. 12-City: Mid-size
9. 33-Town: Remote
10. 22-Suburb: Mid-size
[ 2 others ]
26772(26.6%)
14851(14.7%)
11179(11.1%)
10279(10.2%)
6635(6.6%)
6412(6.4%)
6266(6.2%)
5876(5.8%)
4138(4.1%)
3305(3.3%)
5016(5.0%)
0 (0.0%)
STUTERATIO [numeric]
Mean (sd) : 16.9 (85.7)
min ≤ med ≤ max:
0 ≤ 15.3 ≤ 22350
IQR (CV) : 5.3 (5.1)
3854 distinct values 6835 (6.8%)
STITLEI [character]
1. Missing
2. No
3. Not Applicable
4. Yes
864(0.9%)
14596(14.5%)
29199(29.0%)
56070(55.7%)
0 (0.0%)
AMALM [numeric]
Mean (sd) : 3.7 (16.1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 743
IQR (CV) : 2 (4.4)
268 distinct values 26365 (26.2%)
AMALF [numeric]
Mean (sd) : 3.6 (15.5)
min ≤ med ≤ max:
0 ≤ 1 ≤ 652
IQR (CV) : 2 (4.4)
263 distinct values 26708 (26.5%)
ASALM [numeric]
Mean (sd) : 15.9 (45.2)
min ≤ med ≤ max:
0 ≤ 3 ≤ 1997
IQR (CV) : 11 (2.8)
522 distinct values 16162 (16.0%)
ASALF [numeric]
Mean (sd) : 15.1 (42.5)
min ≤ med ≤ max:
0 ≤ 3 ≤ 1532
IQR (CV) : 11 (2.8)
495 distinct values 16080 (16.0%)
HIALM [numeric]
Mean (sd) : 73.7 (123.5)
min ≤ med ≤ max:
0 ≤ 25 ≤ 2292
IQR (CV) : 83 (1.7)
1073 distinct values 4774 (4.7%)
HIALF [numeric]
Mean (sd) : 70.5 (118.7)
min ≤ med ≤ max:
0 ≤ 24 ≤ 2461
IQR (CV) : 79 (1.7)
1047 distinct values 5121 (5.1%)
BLALM [numeric]
Mean (sd) : 43.5 (77.3)
min ≤ med ≤ max:
0 ≤ 11 ≤ 2473
IQR (CV) : 48 (1.8)
687 distinct values 10801 (10.7%)
BLALF [numeric]
Mean (sd) : 42.1 (76.8)
min ≤ med ≤ max:
0 ≤ 10 ≤ 2615
IQR (CV) : 46 (1.8)
693 distinct values 11485 (11.4%)
WHALM [numeric]
Mean (sd) : 128.6 (140.5)
min ≤ med ≤ max:
0 ≤ 95 ≤ 3854
IQR (CV) : 160 (1.1)
1046 distinct values 4502 (4.5%)
WHALF [numeric]
Mean (sd) : 120.8 (135.6)
min ≤ med ≤ max:
0 ≤ 88 ≤ 4292
IQR (CV) : 152 (1.1)
1030 distinct values 4682 (4.6%)
HPALM [numeric]
Mean (sd) : 1.7 (13.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 751
IQR (CV) : 1 (7.9)
210 distinct values 34182 (33.9%)
HPALF [numeric]
Mean (sd) : 1.6 (12.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 643
IQR (CV) : 1 (7.7)
212 distinct values 34563 (34.3%)
TRALM [numeric]
Mean (sd) : 10.8 (13.9)
min ≤ med ≤ max:
0 ≤ 6 ≤ 512
IQR (CV) : 13 (1.3)
174 distinct values 9200 (9.1%)
TRALF [numeric]
Mean (sd) : 10.5 (14)
min ≤ med ≤ max:
0 ≤ 6 ≤ 716
IQR (CV) : 12 (1.3)
183 distinct values 9477 (9.4%)
TOTMENROL [numeric]
Mean (sd) : 264.9 (229)
min ≤ med ≤ max:
0 ≤ 224 ≤ 6890
IQR (CV) : 210 (0.9)
1691 distinct values 2296 (2.3%)
TOTFENROL [numeric]
Mean (sd) : 251.1 (222.8)
min ≤ med ≤ max:
0 ≤ 211 ≤ 7396
IQR (CV) : 200 (0.9)
1646 distinct values 2362 (2.3%)
STATUS [numeric]
Mean (sd) : 1.1 (0.6)
min ≤ med ≤ max:
1 ≤ 1 ≤ 8
IQR (CV) : 0 (0.5)
1:98557(97.8%)
3:1103(1.1%)
4:77(0.1%)
5:110(0.1%)
6:500(0.5%)
7:341(0.3%)
8:41(0.0%)
0 (0.0%)
UG [numeric]
Mean (sd) : 11.2 (33.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 1017
IQR (CV) : 10 (3)
217 distinct values 88689 (88.0%)
AE [logical]
1. FALSE
2. TRUE
60(93.8%)
4(6.2%)
100665 (99.9%)
SCHOOL_TYPE_TEXT [character]
1. Alternative/other school
2. Regular school
3. Special education school
4. Vocational school
5531(5.5%)
91737(91.1%)
1948(1.9%)
1513(1.5%)
0 (0.0%)
SY_STATUS_TEXT [character]
1. Currently operational
2. New school
3. School has changed agency
4. School has reopened
5. School temporarily closed
6. School to be operational
7. School was operational bu
98557(97.8%)
1103(1.1%)
110(0.1%)
41(0.0%)
500(0.5%)
341(0.3%)
77(0.1%)
0 (0.0%)
SCHOOL_LEVEL [character]
1. Adult Education
2. Elementary
3. High
4. Middle
5. Not Applicable
6. Not Reported
7. Other
8. Prekindergarten
9. Secondary
10. Ungraded
28(0.0%)
53287(52.9%)
22977(22.8%)
16506(16.4%)
796(0.8%)
1113(1.1%)
3824(3.8%)
1430(1.4%)
602(0.6%)
166(0.2%)
0 (0.0%)
AS [numeric]
Mean (sd) : 29.8 (85.8)
min ≤ med ≤ max:
0 ≤ 5 ≤ 3529
IQR (CV) : 21 (2.9)
850 distinct values 12717 (12.6%)
CHARTER_TEXT [character]
1. No
2. Not Applicable
3. Yes
87007(86.4%)
6387(6.3%)
7335(7.3%)
0 (0.0%)
MAGNET_TEXT [character]
1. Missing
2. No
3. Not Applicable
4. Yes
6256(6.2%)
77531(77.0%)
13520(13.4%)
3422(3.4%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-28

Univariate Visualizations

Bivariate Visualization(s)

Univariate Visualizations

Bivariate Visualization(s)