Data Analytics and Computational Social Science: Shaye Hallee - DACSS 601 HW02

Shaye Hallee

Introducing the data

Here, we’ll be looking at data about disabled populations in US counties. Specifically, we’re using Subject Table S1810 from the 2019 1-year population estimates from the American Community Survey, an on-going demographics survey run by the U.S. Census Bureau. This table includes lots of data including county populations and disabled populations across different demographics.¹

We’re going to answer the following questions:

Which US county has the highest disabled population (by count)?
Which county has the lowest disabled population (by count)?
Which counties have a much higher than average disabled population (by percentage)?
Which counties have a much lower than average disabled population (by percentage)?

Table S1810 is incredibly large, so we’ll pull out the following columns:

Variable	Class	Description
County	`char` (text)	county name
State	`char` (text)	state name
cty_ni_pop	`dbl` (numerical)	total estimated 2019 county population of noninstitutionalized civilians
cty_ni_dis_pop	`dbl` (numerical)	estimated 2019 county population of disabled, noninstitutionalized civilians
cty_pct_disabled	`dbl` (numerical)	disabled population as a percentage of the total county population

“Noninstitutionalized civilians” means people who aren’t in the armed forces and don’t live in institutions like prisons, hospitals, or nursing homes.² These other two groups usually rely on their respective institutions to meet their support and access needs, and they usually have higher disabled populations. Surveys like the ACS are mostly used to plan community resources, so they exclude these groups with the assumption that they won’t be interacting with the communities around them.³

I might have to find more thorough data if I plan to use demographics information in future projects.

Reading in the data

Let’s read the data in, free it from an unnecessary row, and put it all in a tibble.

library(tidyverse)
library(knitr)

data <- read.csv("ACS_ST_1Y_2019_Disability_County/data_all.csv", encoding = "UTF-8")
data <- data[c(2:nrow(data)),]
data <- as_tibble(data)

Let’s make sure it’s a tibble of about the expected size.

class(data)

[1] "tbl_df"     "tbl"        "data.frame"

dim(data)

[1] 840 416

Done!

Cleaning up the data

Right now, our tibble has a lot of very cool data that we won’t be using, and the column names aren’t human-friendly.

Let’s extract the right columns and give them (marginally) friendlier names. We’ll use dplyr::select for that.

data <- select(data,
               NAME,
               cty_ni_pop = S1810_C01_001E,
               cty_ni_dis_pop = S1810_C02_001E,
               cty_pct_disabled = S1810_C03_001E)

For kicks, let’s separate the “NAME” column into “County” and “State.”

data <- separate(data, NAME, c("County", "State"), sep = ", ")

Let’s turn the appropriate columns into numerical values. This is kind of sloppy, but it’s just three columns in a script we probably won’t use again. Famous last words, I know.

data$cty_ni_pop <- as.numeric(data$cty_ni_pop)
data$cty_ni_dis_pop <- as.numeric(data$cty_ni_dis_pop)
data$cty_pct_disabled <- as.numeric(data$cty_pct_disabled)

Here’s what the data looks like now:

kable(head(data))

County	State	cty_ni_pop	cty_ni_dis_pop	cty_pct_disabled
Baldwin County	Alabama	220911	31901	14.4
Calhoun County	Alabama	111075	22269	20.0
Cullman County	Alabama	82841	14480	17.5
DeKalb County	Alabama	70392	7583	10.8
Elmore County	Alabama	75409	9707	12.9
Etowah County	Alabama	101470	15944	15.7

kable(tail(data))

County	State	cty_ni_pop	cty_ni_dis_pop	cty_pct_disabled
Mayagüez Municipio	Puerto Rico	71018	15705	22.1
Ponce Municipio	Puerto Rico	129198	28785	22.3
San Juan Municipio	Puerto Rico	313915	60014	19.1
Toa Alta Municipio	Puerto Rico	71897	6140	8.5
Toa Baja Municipio	Puerto Rico	73735	16284	22.1
Trujillo Alto Municipio	Puerto Rico	63312	15870	25.1

Finding cool things in the data

First, let’s find the average disabled population in a US county, as a percentage of the total population.

mean_pct_disabled <- mean(data$cty_pct_disabled)
mean_pct_disabled

[1] 13.70964

We’ll use dplyr::filter to answer the questions from the intro.

Which US county has the highest disabled population (by count)?

highest_disabled_pop <- filter(data, (data$cty_ni_dis_pop == max(data$cty_ni_dis_pop)))
kable(highest_disabled_pop)

County	State	cty_ni_pop	cty_ni_dis_pop	cty_pct_disabled
Los Angeles County	California	9964081	984931	9.9

Which county has the lowest disabled population (by count)?

lowest_disabled_pop <- filter(data, (data$cty_ni_dis_pop == min(data$cty_ni_dis_pop)))
kable(lowest_disabled_pop)

County	State	cty_ni_pop	cty_ni_dis_pop	cty_pct_disabled
Walker County	Texas	61093	4947	8.1

Which counties have a much higher than average disabled population (by percentage)? Let’s arbitrarily use 1.75 times the mean as our threshold.

hi_disability <- filter(data, data$cty_pct_disabled >= 1.75*mean_pct_disabled)
kable(hi_disability)

County	State	cty_ni_pop	cty_ni_dis_pop	cty_pct_disabled
Talladega County	Alabama	76722	20102	26.2
Walker County	Alabama	62896	17381	27.6
Charlotte County	Florida	186002	45368	24.4
Walker County	Georgia	68199	16593	24.3
Raleigh County	West Virginia	70907	17130	24.2
Bayamón Municipio	Puerto Rico	164521	43015	26.1
Caguas Municipio	Puerto Rico	124149	32099	25.9
Guaynabo Municipio	Puerto Rico	83119	19948	24.0
Trujillo Alto Municipio	Puerto Rico	63312	15870	25.1

Which counties have a much lower than average disabled population (by percentage)? Let’s (also arbitrarily) use 0.5 times the mean as our threshold.

lo_disability <- filter(data, data$cty_pct_disabled <= 0.5*mean_pct_disabled)
kable(lo_disability)

County	State	cty_ni_pop	cty_ni_dis_pop	cty_pct_disabled
Gwinnett County	Georgia	930955	63740	6.8
Carver County	Minnesota	104708	6822	6.5
Fort Bend County	Texas	806384	53265	6.6
Arlington County	Virginia	231652	14506	6.3
Loudoun County	Virginia	411654	23713	5.8
Alexandria city	Virginia	155298	10181	6.6

What’s next?

None of this tells us anything particularly interesting without looking at some complementary data. I’d be interested in looking at other data from Subject Table S1810 to see if there are correlations with race, overall population size, age, or type of disability. Other data sets are out there with data on poverty, food access, urbanization, and lots of other information, and it’ll be very cool to check out some of that data.

U.S. Census Bureau, 2019 American Community Survey 1-Year Estimates, https://data.census.gov/cedsci/table?t=Disability&tid=ACSST1Y2019.S1810 ↩︎
U.S. Census Bureau, American Community Survey and Puerto Rico Community Survey 2019 Code List, https://www2.census.gov/programs-surveys/acs/tech_docs/code_lists/2019_ACS_Code_Lists.pdf ↩︎
Brault, M. (2008). Disability Status and the Characteristics of People in Group Quarters: A Brief Analysis of Disability Prevalence Among the Civilian Noninstitutionalized and Total Populations in the American Community Survey. U.S. Census Bureau “Working Papers”. https://www.census.gov/library/working-papers/2008/demo/brault-01.html ↩︎

Comment on this article Share:

Shaye Hallee - DACSS 601 HW02

Introducing the data

Reading in the data

Cleaning up the data

Finding cool things in the data

What’s next?

Reuse

Citation