Code
library(tidyverse)
library(readxl)
library(here)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Linus Jen
June 29, 2023
The United States contains over 330 million people over about 3.8 million squared mile. However, the distribution of people across this vast country is not even. As a result, each region has different demographics than other states, leading to different cultures, economic and political beliefs, ideologies, and societal components.
For this homework assignment and final project, I want to dive deeper into US state and county level data to explore how distributions in population, education level, and poverty all relate to one another. All data was collected by the ifically data that relates back to population, education level, and poverty estimates. All data was pulled from the US Department of Agriculture via the census. Although data does exist for 2021, not all datasets have this set of information, and thus we will be exploring the collected results from 2020. Three datasets have been added to the GitHub repo for the class, and the three files are Linus_US_Education.xlsx
, Linus_US_PopulationEstimates.xlsx
, and Linus_US_PovertyEstimates.xlsx
.
As mentioned previously, there are 3 files of interest that explore population, education, and poverty. To connect all the datasets together, the Federal Information Processing Standard (a five digit number representing a unique ID for states and counties), or FIPS for short, is used as a key. To remove state totals, simply exclude observations where the FIPS value ends in “000”, such as “01000”, “02000”, etc. We also are only interested in US counties, and remove all FIPS codes greater than or equal to 72000. Also, I will treat Washington D.C. as a county rather than a state, as it is duplicated in our datasets.
This dataset contains population information for US territories, states, and counties. Although there’s data from the 1990s to 2021, as mentioned before, we will only look at 2020 census data so that it can be used with other datasets. Another interesting variable is the rural-urban continuum code that was given for each county in 2013. From the US Department of Agriculture, these continuum codes are split into metropolitan (1-3) and non-metropolitan counties (4-9), leading to 9 distinct groups. Codes 1-3 represent counties in metro areas of 1 million or more people, 250K - 1 million people, or less than 250K, respectively. Codes 4 and 5 represent counties with an urban population of 20K or more, adjacent or not adjacent to a metro area, respectively. Codes 6 and 7 represent counties with an urban population of 2.5K - 20K, adjacent or not adjacent to a metro area, respectively. Finally, codes 8 and 9 represent counties with a completely rural or less than 2.5K urban population, adjacent or not adjacent to a metro area. The continuum codes provide a straightforward and guided way to group counties together for future analyses.
pop <- read_excel(here("posts", "_data", "Linus_US_data", "Linus_US_PopulationEstimates.xlsx"), skip=4) %>%
select(matches("FIPS|State|Area|Code|2020")) %>%
rename(FIPS=1, State=2, Area_Name=3, Cont_Code=4, pop_2020=5) %>%
filter(FIPS < 72000 & FIPS != 11000) %>%
mutate(region_type = case_when(FIPS == "00000" ~ "national",
str_detect(FIPS, "0{3}$") ~ "state",
TRUE ~ "county"))
# View data
pop
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
State [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Area_Name [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Cont_Code [numeric] |
|
|
57 (1.8%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pop_2020 [numeric] |
|
3136 distinct values | 6 (0.2%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
region_type [character] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.3.0)
2023-07-02
The tables above provide a quick glimpse into the dataset. Fortunately, the data is already tidy, as each row represents data at a region level (national, state, or county/city, based on the zip code). It’s interesting to note that although the data is supposedly at the county level, counties have different names in different states. For example, Alaska has many regions that are called boroughs or areas, and Louisiana’s counties are called parishes! Strangely enough, Virginia also has a few zip codes dedicated to a specific city.
As for our data, thankfully it’s clean, but there is some missing data in the population (especially concerning given that the government “should” know where people live, for better or for worse). Let’s investigate this further.
The first table above contains all regions where the population was not provided. Four of the six regions are in Alaska, where it’s quite difficult to the population and is either desolate or controlled by native Alaskans. The last two regions, however, are cities in Virginia, which don’t necessarily fit into the classifications of “county”. I’m not sure why this is the case, but a quick Google search showed that these populations are very low, so I’ll assume it’s because the number was too low to track (or some government inefficiency lost the postal information somewhere).
The second table shows where there isn’t a continuum code. Regions that are not counties (IE states and the US) do not have a continuum code. The remaining counties without a continuum code all are mostly in Alaska or are really small, which explains why they might not have a continuum code.
I plan to keep data with missing information, even though I will use a full join to connect all the datasets together.
Next, let’s look at county and state level data and summaries.
# Population summary statistics on the county level
pop %>%
filter(region_type == "county") %>% # Select only
summarise(min_pop = min(pop_2020, na.rm=TRUE),
q1_pop = quantile(pop_2020, 0.25, 1, na.rm=TRUE),
mean_pop = mean(pop_2020, na.rm=TRUE),
med_pop = median(pop_2020, na.rm=TRUE),
q3_pop = quantile(pop_2020, 0.75, 1, na.rm=TRUE),
max_pop = max(pop_2020, na.rm=TRUE))
# Graph data at the state
pop %>%
filter(region_type == "county" & pop_2020 < 250000) %>%
ggplot(aes(pop_2020)) +
geom_histogram(fill="blue", alpha=0.5) +
theme_minimal() +
labs(title="Graph TODO: Distribution of US County Populations",
x="Population of a County",
y="Number of Counties",
subtitle="For county populations under 250,000")
At the county level, there’s huge variance in terms of the population of a county, from a meager 64 people to a whopping 10 million people. A majority of counties have populations between 10K and 68K, and the number of counties with populations of 75K drops significantly.
# Population summary statistics on the state level
pop %>%
filter(region_type == "state") %>% # Select only
summarise(min_pop = min(pop_2020, na.rm=TRUE),
q1_pop = quantile(pop_2020, 0.25, 1, na.rm=TRUE),
mean_pop = mean(pop_2020, na.rm=TRUE),
med_pop = median(pop_2020, na.rm=TRUE),
q3_pop = quantile(pop_2020, 0.75, 1, na.rm=TRUE),
max_pop = max(pop_2020, na.rm=TRUE))
Like the county populations, state populations show high variance, with populations between 577K and 39.5 million residents. Most state populations are between 1.8 million and 7.5 million people, with the number of states with populations above 8 million people dropping significantly, as seen in the histogram above.
Lastly for this section, it might be helpful to see the distribution of the continuum codes for counties, as this is a great way to compare all the consolidated data.
The table above shows the number and proportion of counties who fall within each continuum code. Surprisingly, the data isn’t incredibly skewed towards any particular code. The counts for metropolitan counties (codes 1-3) are pretty close to one another. For non-metropolitan counties, it’s surprising to see that codes 6 and 7 (counties with populations between 2.5K and 20K people in counties adjacent or nonadjacent to a metro area, respectively) have a greater number of counties that fall into that grouping. Code 5 (having an urban population of 20K or greater, not adjacent to a metro area) is by far the smallest group, and this makes sense, as counties with a large population not next to a metropolitan area isn’t very common in the US, and this supports that belief.
For future analyses, we could group counties into metropolitan (codes 1-3) vs. non-metropolitan codes (4-9), compare codes within those two groups against one another, or compare non-metropolitan counties adjacent to (codes 4, 6, and 8) and non-adjacent to metro areas (codes 5, 7, and 9).
The education dataset is very similar to that of the population dataset, in which each observation is a zip code representing a county, state or the country. There is data from 1970 to 2021, with the 2017-2021 data all consolidated into one value. Different levels of education are tracked (didn’t receive their high school diploma, received only their high school diploma, some bachelors or associates degree level education, or received a higher degree or above) as well as both the number and percentage of residents from a particular region that falls into each category.
# Pull in data
edu <- read_excel(here("posts", "_data", "Linus_US_data", "Linus_US_Education.xlsx"), skip=3) %>%
select(matches("21|FIPS|State|name")) %>%
rename(FIPS=1, State=2, Area_Name=3, no_HS=4, HS_only=5, some_college=6, bach_plus=7, perc_no_HS=8, perc_HS_only=9, perc_some_college=10, perc_bach_plus=11) %>%
filter(FIPS < 72000 & FIPS != 11000) %>%
mutate(region_type = case_when(FIPS == "00000" ~ "national",
str_detect(FIPS, "0{3}$") ~ "state",
TRUE ~ "county"))
# dfSummary
print(summarytools::dfSummary(edu,
varnumbers=FALSE,
plain.ascii=FALSE,
style="grid",
graph.magnif = 0.70,
valid.col=FALSE),
method="render",
table.classes="table-condensed")
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
State [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Area_Name [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
no_HS [numeric] |
|
2610 distinct values | 11 (0.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HS_only [numeric] |
|
2949 distinct values | 11 (0.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
some_college [numeric] |
|
2932 distinct values | 11 (0.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
bach_plus [numeric] |
|
2805 distinct values | 11 (0.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
perc_no_HS [numeric] |
|
3193 distinct values | 11 (0.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
perc_HS_only [numeric] |
|
3194 distinct values | 11 (0.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
perc_some_college [numeric] |
|
3191 distinct values | 11 (0.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
perc_bach_plus [numeric] |
|
3193 distinct values | 11 (0.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
region_type [character] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.3.0)
2023-07-02
The tables above provide a quick glance into how the data is formatted. Some formatting will need to be done to make the data tidy. There are 11 instances where there is missing data - let’s look into this now.
We see that for 11 regions, no education data has been recorded. Many of these are Alaskan counties, which, as mentioned before, have very small populations and may be home to native Alaskans. Bedford city and Clifton Forge city are again listed with missing data, supporting the hypothesis that both cities didn’t report some crucial information, or because these cities are connected to a county, that information has already been shared.
The following code is run to make our data “tidy”.
# Make tidy
# Made tidy
edu_totals <- edu %>%
select(c("FIPS", contains("perc"))) %>%
pivot_longer(cols=contains("perc"),
names_to="edu_level",
values_to="edu_perc") %>%
mutate(edu_level=str_remove(edu_level, "perc_"))
edu_percs <- edu %>%
select(c("FIPS", !contains("perc"))) %>%
pivot_longer(cols=no_HS:bach_plus,
names_to="edu_level",
values_to="edu_total")
edu_tidy = edu_percs %>% inner_join(edu_totals,
by=c("FIPS", "edu_level")) %>%
mutate(edu_level=factor(edu_level, levels=c("no_HS", "HS_only", "some_college", "bach_plus")))
# View tidy data
edu_tidy
[1] "County level education summary statistics for residents, by county totals"
edu_tidy %>%
filter(region_type == "county") %>% # Remove states
group_by(edu_level) %>%
summarise(min_tot = min(edu_total, na.rm=TRUE),
q1_tot = quantile(edu_total, 0.25, 1, na.rm=TRUE),
mean_tot = mean(edu_total, na.rm=TRUE),
med_tot = median(edu_total, na.rm=TRUE),
q3_tot = quantile(edu_total, 0.75, 1, na.rm=TRUE),
max_tot = max(edu_total, na.rm=TRUE))
[1] "County level education summary statistics for residents, by county percentages"
edu_tidy %>%
filter(region_type == "county") %>% # Select only
group_by(edu_level) %>%
summarise(min_perc = min(edu_perc, na.rm=TRUE),
q1_perc = quantile(edu_perc, 0.25, 1, na.rm=TRUE),
mean_perc = mean(edu_perc, na.rm=TRUE),
med_perc = median(edu_perc, na.rm=TRUE),
q3_perc = quantile(edu_perc, 0.75, 1, na.rm=TRUE),
max_perc = max(edu_perc, na.rm=TRUE))
# Graph proportions at county level
edu %>%
select(matches("FIPS|perc|region")) %>%
filter(region_type == "county") %>% # Remove states
pivot_longer(cols=contains("perc"), names_to="perc_type", values_to="percent") %>%
mutate(perc_type = factor(perc_type, levels=c("perc_no_HS", "perc_HS_only", "perc_some_college", "perc_bach_plus"))) %>%
ggplot(aes(x=perc_type, y=percent/100, color=perc_type, alpha = 0.8)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.2) +
theme_minimal() +
theme(legend.position="none",
axis.text.x = element_text(angle=0)) +
labs(title="Graph TODO: Distribution of Education Received for US Counties",
x="Education Level",
y="Percentage of People within a County", fill="Education Level") +
scale_x_discrete(labels=c("No HS diploma", "HS Diploma", "Some higher edu", "Received higher degree")) +
scale_y_continuous(labels=scales::percent)
Graph TODO above shows the distributions of people who fall into each education category based on the county they live in. Because the populations for each county vary so drastically (as seen in the previous dataset), the percentages for people who fall into each category for each county. Education differs greatly across counties, as each category has quite distinct peaks and and variability. People who have only received a high school diploma or some higher education make up the majority of the all counties’ population. Getting a college degree or higher is still somewhat rare, as we see that the peak is around 20% of a county’s population. Lastly, there’s clearly some room for improvement in terms of improving education for all, as ~10% of people in counties don’t have their high school diploma (based on the median), with about 25% of all counties having more than 15% of their population not even getting their high school diploma.
The tables above graph TODO add more insight with regards to educational levels. There are instances where some counties don’t have any residents who reported getting any form of higher education. The fact that some counties had 50% or more their residents not even receiving a high school diploma is startling. Although the group of people who have received only a high school diploma
[1] "State level education summary statistics for residents, by state totals"
edu_tidy %>%
filter(region_type == "state") %>% # Select states
group_by(edu_level) %>%
summarise(min_tot = min(edu_total, na.rm=TRUE),
q1_tot = quantile(edu_total, 0.25, 1, na.rm=TRUE),
mean_tot = mean(edu_total, na.rm=TRUE),
med_tot = median(edu_total, na.rm=TRUE),
q3_tot = quantile(edu_total, 0.75, 1, na.rm=TRUE),
max_tot = max(edu_total, na.rm=TRUE))
[1] "State level education summary statistics for residents, by percentages"
edu_tidy %>%
filter(region_type == "state") %>% # Select states
group_by(edu_level) %>%
summarise(min_perc = min(edu_perc, na.rm=TRUE),
q1_perc = quantile(edu_perc, 0.25, 1, na.rm=TRUE),
mean_perc = mean(edu_perc, na.rm=TRUE),
med_perc = median(edu_perc, na.rm=TRUE),
q3_perc = quantile(edu_perc, 0.75, 1, na.rm=TRUE),
max_perc = max(edu_perc, na.rm=TRUE))
# Graph proportions at state level
edu %>%
select(matches("FIPS|perc|region")) %>%
filter(region_type == "state") %>% # Remove states
pivot_longer(cols=contains("perc"), names_to="perc_type", values_to="percent") %>%
mutate(perc_type = factor(perc_type, levels=c("perc_no_HS", "perc_HS_only", "perc_some_college", "perc_bach_plus"))) %>%
ggplot(aes(x=perc_type, y=percent/100, color=perc_type, alpha = 0.8)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.2) +
theme_minimal() +
theme(legend.position="none",
axis.text.x = element_text(angle=0)) +
labs(title="Graph TODO: Distribution of Education Received for US States",
x="Education Level",
y="Percentage of People within a State", fill="Education Level") +
scale_x_discrete(labels=c("No HS diploma", "HS Diploma", "Some higher edu", "Received higher degree")) +
scale_y_continuous(labels=scales::percent)
The state level data paints a slightly different picture. People tend to be pretty educated, as those who have at least some college education (if not more) have a higher summary statistics across the board, compared to those with only a high school diploma or lower. This is also reflected in the table showing the summary statistics for state percentages, as the categories some_college
and bach_plus
are greater than the two categories showing only some high school education. This suggests that at the county level, those with some higher education tend to live in very specific counties, rather than equally dispersing across the state. This makes sense, as most college educated folks tend to move to urban, metropolitan areas, while rural counties make up the majority of the US. Similarly, because there are many counties with a small number of people (who may be less educated), these counties would inflate the proportion of people with only a high school diploma or less.
This poverty dataset contains poverty-related statistics for 2020. Poverty, as explained on the census.gov website, is determined by first summing up a family’s total income over a given year, and then comparing that with a “poverty threshold” that is calculated by the government. This poverty threshold incorporates the size of a family and age of each member and Consumer Price Index for All Urban Consumers (CPI-U), but does not vary by geographical location.
Although this dataset contains a lot of variables (such as 90% confidence intervals for different poverty groups), we will only be looking at the total number and proportion of people living in poverty for a specific region (county, state, or nation), as well as the number and proportion of people within a certain age group (aged 0-17, 0-4, and 5-17) who are estimated to be living in poverty. This data is especially valuable and investigating poverty for minors, as theoretically investing more in minors could help reduce poverty later on.
pov <- read_excel(here("posts", "_data", "Linus_US_data", "Linus_US_PovertyEstimates.xlsx"), sheet="Poverty Data 2020", skip=4) %>%
select(c(matches("FIPS|POV|MED"), "Stabr", "Area_name")) %>%
rename(FIPS=1, pov_all=2, perc_pov_all=3, pov_0_17=4, perc_pov_0_17=5, pov_5_17=6, perc_pov_5_17=7, med_inc=8, pov_0_4=9, perc_pov_0_4=10, state_acr=11, area_name=12) %>%
filter(FIPS < 72000 & FIPS != 11000) %>%
mutate(region_type = case_when(FIPS == "00000" ~ "national",
str_detect(FIPS, "0{3}$") ~ "state",
TRUE ~ "county"))
# dfSummary
print(summarytools::dfSummary(pov,
varnumbers=FALSE,
plain.ascii=FALSE,
style="grid",
graph.magnif = 0.70,
valid.col=FALSE),
method="render",
table.classes="table-condensed")
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pov_all [numeric] |
|
2762 distinct values | 1 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
perc_pov_all [numeric] |
|
286 distinct values | 1 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pov_0_17 [numeric] |
|
2161 distinct values | 1 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
perc_pov_0_17 [numeric] |
|
400 distinct values | 1 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pov_5_17 [numeric] |
|
1927 distinct values | 1 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
perc_pov_5_17 [numeric] |
|
385 distinct values | 1 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
med_inc [numeric] |
|
3089 distinct values | 1 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pov_0_4 [numeric] |
|
51 distinct values | 3143 (98.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
perc_pov_0_4 [numeric] |
|
45 distinct values | 3143 (98.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
state_acr [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
area_name [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
region_type [character] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.3.0)
2023-07-02
The data above is shown at the zip code level, and includes the number and percent of people living in poverty for all residents, or those of a specific age bracket. One concerning factor is that there is a significant number of missing data for the number and proportion of people aged 0-4 living in poverty. This is probably due to the fact that this age group is the hardest to keep track of (as they aren’t of age to attend school yet). We will remove these two columns due to the high levels of missingness.
Let’s check the missing data.
We see for Kalawao County in Hawaii, surprisingly, all values pertaining to poverty are missing. Because this is only one county amongst about 3,850 counties, this isn’t a big deal. I’ll leave it in so that during the join, we know it was missing information.
Lastly, our data needs to be made “tidy”, as percents and numbers should each be their own column.
# Tidy data
pov_percs <- pov %>%
select(!contains("perc")) %>%
pivot_longer(cols=pov_all:pov_5_17,
names_to="age_group",
values_to="pov_total")
pov_totals <- pov %>%
select(c("FIPS", contains("perc"))) %>%
pivot_longer(cols=perc_pov_all:perc_pov_5_17,
names_to="age_group",
values_to="pov_perc") %>%
mutate(age_group=str_remove(age_group, "perc_"))
pov_tidy <- pov_percs %>% inner_join(pov_totals, by=c("FIPS", "age_group")) %>%
mutate(age_group=factor(age_group, levels=c("pov_all", "pov_0_17", "pov_5_17", "pov_0_4", "med_")))
# Show tidy data
pov_tidy
[1] "County level summary statistics for residents living in poverty, total number"
pov_tidy %>%
filter(region_type == "county") %>% # Remove states
group_by(age_group) %>%
summarise(min_tot = min(pov_total, na.rm=TRUE),
q1_tot = quantile(pov_total, 0.25, 1, na.rm=TRUE),
mean_tot = mean(pov_total, na.rm=TRUE),
med_tot = median(pov_total, na.rm=TRUE),
q3_tot = quantile(pov_total, 0.75, 1, na.rm=TRUE),
max_tot = max(pov_total, na.rm=TRUE))
[1] "County level summary statistics for residents living in poverty by percent"
pov_tidy %>%
filter(region_type == "county") %>% # Remove states
group_by(age_group) %>%
summarise(min_tot = min(pov_perc, na.rm=TRUE),
q1_tot = quantile(pov_perc, 0.25, 1, na.rm=TRUE),
mean_tot = mean(pov_perc, na.rm=TRUE),
med_tot = median(pov_perc, na.rm=TRUE),
q3_tot = quantile(pov_perc, 0.75, 1, na.rm=TRUE),
max_tot = max(pov_perc, na.rm=TRUE))
# Graph data
pov %>%
select(matches("FIPS|perc|region")) %>%
filter(region_type=="county") %>% # Remove states
pivot_longer(cols=contains("perc"), names_to="perc_type", values_to="percent") %>%
mutate(perc_type = factor(perc_type, levels=c("perc_pov_all", "perc_pov_0_17", "perc_pov_5_17", "perc_pov_0_4"))) %>%
ggplot(aes(x=perc_type, y=percent/100, color=perc_type, alpha=0.8)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.2) +
theme_minimal() +
theme(legend.position="none") +
labs(title="Graph TODO: Poverty Breakdown Across Age Groups for US Counties",
subtitle="Based on proportion of people who fall within each category per county",
x="Proportion of People Living in Poverty in a County (%)",
y="Number of Counties") +
scale_x_discrete(labels=c("All residents", "Residents aged 0-17", "Residents aged 5-17", "Residents aged 0-4")) +
scale_y_continuous(labels=scales::percent)
From graph TODO and tables above above, the poverty rate for all residents across counties is about 13% - much higher than I was expecting. However, it’s shocking to see that there are some counties were over 10% of the county’s population is living in poverty. Equally shocking is how the proportion of minors (aged 0-17 or 5-17) living in poverty is actually higher than that of all residents. It could be because those living in poverty tend to have large families (as discussed on WorldVision), or that the count of adults living in poverty is underrepresented, as adults living in poverty are more likely to be houseless and thus not responsible to fill out the census survey, while children living in poverty (hopefully) are living with parents and are more likely to be included in census data. In general though, it’s scary and quite eye-opening that about 18% of all minors are living in poverty.
[1] "State level summary statistics for residents living in poverty, total number"
pov_tidy %>%
filter(region_type == "state") %>% # Remove states
group_by(age_group) %>%
summarise(min_tot = min(pov_total, na.rm=TRUE),
q1_tot = quantile(pov_total, 0.25, 1, na.rm=TRUE),
mean_tot = mean(pov_total, na.rm=TRUE),
med_tot = median(pov_total, na.rm=TRUE),
q3_tot = quantile(pov_total, 0.75, 1, na.rm=TRUE),
max_tot = max(pov_total, na.rm=TRUE))
[1] "State level summary statistics for residents living in poverty by percent"
pov_tidy %>%
filter(region_type == "state") %>% # Remove states
group_by(age_group) %>%
summarise(min_tot = min(pov_perc, na.rm=TRUE),
q1_tot = quantile(pov_perc, 0.25, 1, na.rm=TRUE),
mean_tot = mean(pov_perc, na.rm=TRUE),
med_tot = median(pov_perc, na.rm=TRUE),
q3_tot = quantile(pov_perc, 0.75, 1, na.rm=TRUE),
max_tot = max(pov_perc, na.rm=TRUE))
# Graph data
pov %>%
select(matches("FIPS|perc|region")) %>%
filter(region_type=="state") %>% # Remove states
pivot_longer(cols=contains("perc"), names_to="perc_type", values_to="percent") %>%
mutate(perc_type = factor(perc_type, levels=c("perc_pov_all", "perc_pov_0_17", "perc_pov_5_17", "perc_pov_0_4"))) %>%
ggplot(aes(x=perc_type, y=percent/100, color=perc_type, alpha=0.8)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.2) +
theme_minimal() +
theme(legend.position="none") +
labs(title="Graph TODO: Poverty Breakdown Across Age Groups for US States",
subtitle="Based on proportion of people who fall within each category per state",
x="Proportion of People Living in Poverty in a County (%)",
y="Number of Counties") +
scale_x_discrete(labels=c("All residents", "Residents aged 0-17", "Residents aged 5-17", "Residents aged 0-4")) +
scale_y_continuous(labels=scales::percent)
At the state level, the percentage of people living in poverty has dropped somewhat, most likely again not because of the change in numbers, but because of the reduction of counties and inflated numbers. However, we again see that the proportion of all people living in poverty is slightly below that of minors (11.6% vs. ~14%), and that there are some states where over 20% of minors (aged either 0-17 or 5-17) are living at or below poverty.
With data in hand, let us look into how population size, poverty rates, and education levels all relate to one another at both the county level. (For the final project, I will also look into relationships at the state level). Our research questions are as follows:
How does the population size of a county relate to educational attainment for residents?
How does the population size of a county connect with the number and proportion of people living in poverty?
How does level of education for residents correlate with poverty in these regions?
Continuum Codes
will provide deeper insight as to how these features relate to how urban / rural that county might be, incorporating other features such as population size and city-like features into this analysis.First, we’ll look at the totals for each county. Because we’re comparing two different populations, I expect to see a positive trend between population and number of people in each education group.
# Because each dataset uses different types of "tidy" data, will manually join tables for each RQ
pop_edu_tidy = pop %>%
inner_join(edu_tidy %>%
select(c(FIPS, edu_level, edu_total, edu_perc)), by = "FIPS")
# Let's look into edu_total vs. pop_2020, faceted by edu_level
pop_edu_tidy %>%
filter(region_type=="county") %>%
ggplot(aes(x = pop_2020, y = edu_total, color = edu_level)) +
geom_point(alpha = 0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_wrap(~edu_level, labeller =
labeller(edu_level = c("no_HS" = "No HS Diploma",
"HS_only" = "Received only a HS Diploma",
"some_college" = "Taken Some College or Associates Classes",
"bach_plus" = "Received Bachelors Degree or More")
)) +
labs(title="Graph TODO: Comparing Population with Level of Education of US County Residents",
caption="From 2020 US census data",
x = "Number of Residents in a County",
y = "Number of Residents who Fall into that Education Group") +
theme(legend.position = "none")
As expected, when plotting population vs. education level, we see that the number of people who fall into each education category increases as population increases. It’s interesting to note that the rate of increase is mostly linear for all groups, though the rate of increase varies, most evident when looking at the number of people who have received a bachelors degree or more compared to all the other groups. Another interesting point to note is that the errors / noise differs across all groups. Because this graph doesn’t tell us much about how the percentage of people who fall into each group changes, we’ll next look into the percentage of people who fall into each category.
# Let's look into edu_total vs. pop_2020, faceted by edu_level
pop_edu_tidy %>%
filter(region_type=="county" & pop_2020 <= 2500000) %>%
ggplot(aes(x = pop_2020, y = edu_perc / 100, color = edu_level)) +
geom_point(alpha = 0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_wrap(~edu_level, labeller =
labeller(edu_level = c("no_HS" = "No HS Diploma",
"HS_only" = "Received only a HS Diploma",
"some_college" = "Taken Some College or Associates Classes",
"bach_plus" = "Received Bachelors Degree or More")
)) +
labs(title="Graph TODO: Comparing Population with Level of Education of US County Residents",
subtitle="Excluding populations > 2.5 million",
caption="From 2020 US census data",
x = "Number of Residents in a County",
y = "Percentage of Residents that Fall into that Education Group") +
theme(legend.position = "none") +
scale_y_continuous(labels=scales::percent)
Eerily enough, for the most part, when looking at changes in the percentage of people who fall into each category, the results change dramatically. for counties with a low population, there tends to be very large fluctuations in how educated its residents are, as represented by more dispersed points at lower populations. However, as the population rises, educational levels slightly increase, as evident by the increase in the percentage of people who get a bachelors degree or more and the slight decline in number of people who only receive a high school diploma. However, we do see that overall, as the population increases, the percentage of people who not received their high school diploma does increase ever so slightly.
Lastly, let’s investigate how metropolitan vs. non-metropolitan counties compare. We’ll only look at the percentage breakdowns for education, as the raw number will most likely be positive (as seen in graph TODO)
pop_edu_tidy %>%
filter(region_type == "county" & pop_2020 <= 2500000) %>%
mutate(metro = ifelse(Cont_Code %in% c(1, 2, 3), TRUE, FALSE)) %>%
ggplot(aes(x=pop_2020, y=edu_perc/100, color=edu_level)) +
geom_point(alpha=0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_grid(vars(edu_level), vars(metro), scales = "free_x",
labeller = labeller(edu_level = c("no_HS" = "No HS Diploma",
"HS_only" = "Only HS Diploma",
"some_college" = "Some College",
"bach_plus" = "College Diploma +"),
metro = c("FALSE" = "Non-Metropolitan Counties", "TRUE" = "Metropolitan Counties"))) +
labs(title="Graph TODO: Comparing Population with Level of Education of US County Residents",
subtitle="For county populations under 2.5 million, splitting by if a county is a metropolitan",
caption="From 2020 US census data",
x = "Number of Residents in a County",
y = "Percentage of Residents that Fall into that Education Group") +
theme(legend.position = "none") +
scale_y_continuous(labels=scales::percent)
When breaking down education further by whether the county is a metropolitan or not, new trends emerge. Focusing on the metropolitan counties, for all education groups except for having a college diploma, the trend lines are slightly U-shaped (concave up), meaning that for low populations, the percentage of people who are not college educated is slightly higher, and for larger populations, the percentage actually increases. This is surprising to me because I would’ve thought that larger cities are filled with more college-educated students, but this isn’t always the case, as shown here. For non-metropolitan counties, the trend lines are mostly flat, meaning that regardless of population size for counties, on average, the percentage of people who fall into each category are about the same. The percentage of college-educated residents does seem increase ever so slightly as the population for a county increases in non-metropolitan counties.
We’ll first look into the totals for both variables, and then look into the percentages of people living in poverty. As mentioned previously, we expect to see a positive relationship between the county population and number of people living in poverty.
# First, join data together
pop_pov_tidy = pop %>%
inner_join(pov_tidy %>%
select(c(FIPS, age_group, pov_total, pov_perc)),
by = "FIPS")
# Next, graph pov_total vs. pop_2020, faceted by age_group
pop_pov_tidy %>%
filter(region_type=="county" & pop_2020 <= 2500000) %>%
ggplot(aes(x=pop_2020, y=pov_total, color=age_group)) +
geom_point(alpha=0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_wrap(~age_group,
labeller = labeller(age_group = c("pov_all" = "All Ages",
"pov_0_17" = "Minors aged 0-17",
"pov_5_17" = "Minors aged 5-17"))) +
labs(title="Graph TODO: Comparing Population with Poverty of US County Residents",
subtitle = "For counties with less than 2.5 million residents, split by age group",
caption="From 2020 US census data",
x = "Number of Residents in a County",
y = "Number of Residents who Fall into that Education Group") +
theme(legend.position = "none",
axis.text.x = element_text(angle=45, hjust=1))
Unsurprisingly, we see that as the population rises in a county, the number of people living in poverty for all age groups increases. Let’s look at the percentage of people who fall into each category instead.
# Graph pov_perc vs. pop_2020, faceted by age_group
pop_pov_tidy %>%
filter(region_type=="county" & pop_2020 <= 2500000) %>%
ggplot(aes(x=pop_2020, y=pov_perc/100, color=age_group)) +
geom_point(alpha=0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_wrap(~age_group,
labeller = labeller(age_group = c("pov_all" = "All Ages",
"pov_0_17" = "Minors aged 0-17",
"pov_5_17" = "Minors aged 5-17"))) +
labs(title="Graph TODO: Comparing Population with Poverty of US County Residents",
subtitle = "For counties with less than 2.5 million residents, split by Age Group",
caption="From 2020 US census data",
x = "Number of Residents in a County",
y = "Percentage of Residents who Fall into that Education Group") +
theme(legend.position = "none",
axis.text.x = element_text(angle=45, hjust=1)) +
scale_y_continuous(labels=scales::percent)
Again, we see that trends tend to follow a general U-shape, where for counties with very low populations, the percentage of people who are living in poverty is slightly higher, while counties with middling populations (about 250K to 1 million residents tend to see a decline in poverty rates. After the 1 million mark, however, the percentage of people living in poverty tends to increase. It’s interesting that for larger counties, as the population increases, poverty also increases, showing that even big cities with lots of people and wealth can still have a significant proportion of people struggling to make ends meet.
Lastly, let’s see how metropolitan vs. non-metropolitan counties vary in terms of proportion of people living in poverty.
pop_pov_tidy %>%
filter(region_type=="county" & pop_2020 <= 2500000) %>%
mutate(metro = ifelse(Cont_Code %in% c(1, 2, 3), TRUE, FALSE)) %>%
ggplot(aes(x=pop_2020, y=pov_perc/100, color=age_group)) +
geom_point(alpha=0.75) +
geom_line(alpha=0.5, color="black", linewidth=1.25, method="loess", stat="smooth") +
facet_grid(vars(age_group), vars(metro), scale="free_x",
labeller = labeller(age_group = c("pov_all" = "All Ages",
"pov_0_17" = "Minors aged 0-17",
"pov_5_17" = "Minors aged 5-17"),
metro = c("FALSE" = "Non-Metropolitan Counties",
"TRUE" = "Metropolitan Counties"))) +
labs(title="Graph TODO: Comparing Population with Poverty of US County Residents",
subtitle = "Populations <2.5 million residents, split by age group and metropolitan status",
caption="From 2020 US census data",
x = "Number of Residents in a County",
y = "Percentage of Residents who Fall into that Education Group") +
theme(legend.position = "none",
axis.text.x = element_text(angle=45, hjust=1)) +
scale_y_continuous(labels=scales::percent)
# Check numbers
pop_pov_tidy %>%
filter(region_type == "county") %>% # Remove states
mutate(metro = ifelse(Cont_Code %in% c(1, 2, 3), TRUE, FALSE)) %>%
group_by(metro, age_group) %>%
summarise(min_tot = min(pov_perc, na.rm=TRUE),
q1_tot = quantile(pov_perc, 0.25, 1, na.rm=TRUE),
mean_tot = mean(pov_perc, na.rm=TRUE),
med_tot = median(pov_perc, na.rm=TRUE),
q3_tot = quantile(pov_perc, 0.75, 1, na.rm=TRUE),
max_tot = max(pov_perc, na.rm=TRUE))
By splitting by whether a county is a metropolitan or not, we now see that slightly opposite trends. Counties that are metros show similar trends as our previous findings, where poverty rates tend to be higher for counties with very low populations, drop slightly for counties with middling populations, and increase again slightly for counties with large populations above ~1 million people. However, metropolitan counties tend to see an increase in poverty rates for counties with very low populations. Over time (after about a population size of 20K), these trend lines tend to decline. Additionally, counties that aren’t metropolitans also seem to have a greater proportion of people living in poverty, as from the table above, for all age groups, the proportion of people living in poverty is higher by about 2-4%.
Lastly, we’ll look into how level of education correlates to poverty counties. For visualization purposes,
# Join tables together
df_full = pop_edu_tidy %>%
inner_join(pov_tidy %>%
select(c(FIPS, age_group, pov_total, pov_perc)),
by = "FIPS")
# First, let's see how the totals relate to one another
df_full %>%
filter(region_type == "county" &
pop_2020 <= 2500000 &
age_group == "pov_all") %>%
ggplot(aes(x=pov_total, y=edu_total, color=edu_level)) +
geom_point(alpha = 0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_wrap(edu_level ~ .,
scale="free_x",
labeller = labeller(edu_level = c("no_HS" = "No HS Diploma",
"HS_only" = "Only HS Diploma",
"some_college" = "Some College",
"bach_plus" = "College Diploma +"))) +
labs(title = "Graph TODO: Comparing Poverty with Highest Education Achieved",
subtitle="For county populations under 2.5 million, splitting by highest education achieved",
caption="From 2020 US census data",
x="Number of People Living in Poverty",
y="Number of People a Specific Education Level") +
theme(legend.position="none")
As expected, as the number of people living in poverty increases, the number of residents who fall into each category for education also increases. This follows the logic of the previous sections, as the greater the population the greater the number of people living in poverty, as well as the greater the number of people who fit into each category. However, some surprising relationships are still present. Firstly, we see that the relationship between number of people living in poverty and those who have achieved a high school diploma or less is positive and roughly linear. However, the relationship between the number of people living in poverty and the the number of people who have gotten some higher education isn’t linear, but rather shows diminishing changes. One observation could be that people living in poverty tend to not have college educations, so counties with high levels of poverty probably also aren’t “well-educated” or living with these folks.
Next, let’s compare percentages for these two groups to get a better sense of how they might relate to one another.
# Next, compare percentages
df_full %>%
filter(region_type == "county" &
pop_2020 <= 2500000 &
age_group == "pov_all") %>%
ggplot(aes(x=pov_perc/100, y=edu_perc/100, color=edu_level)) +
geom_point(alpha = 0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_wrap(edu_level ~ .,
scale="free_x",
labeller = labeller(edu_level = c("no_HS" = "No HS Diploma",
"HS_only" = "Only HS Diploma",
"some_college" = "Some College",
"bach_plus" = "College Diploma +"))) +
labs(title = "Graph TODO: Comparing Poverty with Highest Education Achieved",
subtitle="For county populations under 2.5 million, splitting by highest education achieved",
caption="From 2020 US census data",
x="Percentage of People Living in Poverty",
y="Percentage of People a Specific Education Level") +
theme(legend.position="none") +
scale_x_continuous(labels=scales::percent) +
scale_y_continuous(labels=scales::percent)
By comparing the poverty rates with the educational achievements of residents in US counties, it’s pretty clear that there’s a straightforward trend here. In general, as the percentage of people living in poverty increases, the percentage of people who have gotten only a high school diploma or less education increases. On the other hand, greater poverty rates correlate to a lower proportion of people with a college diploma or more. Intuitively, this makes sense, as people with college degrees tend to earn more money than those who have only a high school diploma. Because poverty is based on one’s income, it logically makes sense that communities with high poverty tend to have more people who aren’t college educated. It is interesting that the trend line for the proportion of people who have some college experience is flat, regardless of the poverty rate.
This report only shows a snapshot of the information provided. In future iterations, it would be interesting to see how demographics have shifted in the United States. Specifically, visualizing some time series graphs and visualizing changes in relationships between populous counties and shifts in education or poverty would be helpful.
Similarly, showcasing this data on a map could be beneficial, as plotting major cities to connect county and state data could prove beneficial.
For this homework submission, I only focused on county-level comparisons. For the final project, I would also be interested in seeing how these variables correlate with one another at the state level, as the EDA sections comparing county level data with state level data showed significant differences.
Like many other datasets having to deal with census information, we assume that the data provided is correct. Because census information is filled out and submitted by households, it’s very likely that some at-risk groups (such as houseless people, those moving between places, or people working many hours and unable to respond to their mail) are not included in this data. As a result, metrics such as education levels or poverty might be better than reality.
Lastly, it’s important to note the distinction between correlation and causation. Although our data might show interesting links between variables, because each county and state have their own legislation, programs, systems, etc., there are many confounding variables not included in this analysis that might explain why we see specific trends.
Some things I can work on for my graphs are:
Improving the readability for the x-axes
Name graphs for easy reference
Not related to graphing, but ensure that all definitions for key terms are included in the writeup (such as poverty, Continuum Codes, etc.).
Grolemund, G., & Wickham, H. (2017). R for Data Science. O’Reilly Media.
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Rupnarain, K. (2020a, July 13). Why do the poor have large families?. World Vision Canada. https://www.worldvision.ca/stories/why-do-the-poor-have-large-families
Rural-urban continuum codes. USDA ERS - Rural-Urban Continuum Codes. (n.d.). https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/
U.S. and World Population Clock. United States Census Bureau. (n.d.). https://www.census.gov/popclock/
US Census Bureau. (2021, December 16). State area measurements and internal point coordinates. Census.gov. https://www.census.gov/geographies/reference-files/2010/geo/state-area.html
US Census Bureau. (2023, January 30). How the Census Bureau Measures Poverty. Census.gov. https://www.census.gov/topics/income-poverty/poverty/guidance/poverty-measures.html
---
title: "Homework 3: US State and County Data on Population, Education, and Poverty"
author: "Linus Jen"
description: "Reading in Data"
date: "6/29/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
df-print: paged
categories:
- hw3
- Linus Jen
---
```{r setup}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
library(here)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
# Introduction
The United States contains over 330 million people over about 3.8 million squared mile. However, the distribution of people across this vast country is not even. As a result, each region has different demographics than other states, leading to different cultures, economic and political beliefs, ideologies, and societal components.
For this homework assignment and final project, I want to dive deeper into US state and county level data to explore how distributions in population, education level, and poverty all relate to one another. All data was collected by the ifically data that relates back to population, education level, and poverty estimates. All data was pulled from the [US Department of Agriculture](https://www.ers.usda.gov/data-products/county-level-data-sets/) via the census. Although data does exist for 2021, not all datasets have this set of information, and thus we will be exploring the collected results from 2020. Three datasets have been added to the [GitHub repo](https://github.com/DACSS/DACSS_601_Summer2023_Sec1/tree/submission/posts/_data) for the class, and the three files are `Linus_US_Education.xlsx`, `Linus_US_PopulationEstimates.xlsx`, and `Linus_US_PovertyEstimates.xlsx`.
# Data Ingestion and Exploration
As mentioned previously, there are 3 files of interest that explore population, education, and poverty. To connect all the datasets together, the Federal Information Processing Standard (a five digit number representing a unique ID for states and counties), or FIPS for short, is used as a key. To remove state totals, simply exclude observations where the FIPS value ends in "000", such as "01000", "02000", etc. We also are only interested in US counties, and remove all FIPS codes greater than or equal to 72000. Also, I will treat Washington D.C. as a county rather than a state, as it is duplicated in our datasets.
## Population
This dataset contains population information for US territories, states, and counties. Although there's data from the 1990s to 2021, as mentioned before, we will only look at 2020 census data so that it can be used with other datasets. Another interesting variable is the rural-urban continuum code that was given for each county in 2013. From the [US Department of Agriculture](https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/), these continuum codes are split into metropolitan (1-3) and non-metropolitan counties (4-9), leading to 9 distinct groups. Codes 1-3 represent counties in metro areas of 1 million or more people, 250K - 1 million people, or less than 250K, respectively. Codes 4 and 5 represent counties with an urban population of 20K or more, adjacent or not adjacent to a metro area, respectively. Codes 6 and 7 represent counties with an urban population of 2.5K - 20K, adjacent or not adjacent to a metro area, respectively. Finally, codes 8 and 9 represent counties with a completely rural or less than 2.5K urban population, adjacent or not adjacent to a metro area. The continuum codes provide a straightforward and guided way to group counties together for future analyses.
```{r}
pop <- read_excel(here("posts", "_data", "Linus_US_data", "Linus_US_PopulationEstimates.xlsx"), skip=4) %>%
select(matches("FIPS|State|Area|Code|2020")) %>%
rename(FIPS=1, State=2, Area_Name=3, Cont_Code=4, pop_2020=5) %>%
filter(FIPS < 72000 & FIPS != 11000) %>%
mutate(region_type = case_when(FIPS == "00000" ~ "national",
str_detect(FIPS, "0{3}$") ~ "state",
TRUE ~ "county"))
# View data
pop
# dfSummary
print(summarytools::dfSummary(pop,
varnumbers=FALSE,
plain.ascii=FALSE,
style="grid",
graph.magnif = 0.70,
valid.col=FALSE),
method="render",
table.classes="table-condensed")
```
The tables above provide a quick glimpse into the dataset. Fortunately, the data is already tidy, as each row represents data at a region level (national, state, or county/city, based on the zip code). It's interesting to note that although the data is supposedly at the county level, counties have different names in different states. For example, Alaska has many regions that are called boroughs or areas, and Louisiana's counties are called parishes! Strangely enough, Virginia also has a few zip codes dedicated to a specific city.
As for our data, thankfully it's clean, but there is some missing data in the population (especially concerning given that the government "should" know where people live, for better or for worse). Let's investigate this further.
```{r}
# Missing data
pop %>% filter(is.na(pop_2020))
pop %>% filter(is.na(Cont_Code))
```
The first table above contains all regions where the population was not provided. Four of the six regions are in Alaska, where it's quite difficult to the population and is either desolate or controlled by native Alaskans. The last two regions, however, are cities in Virginia, which don't necessarily fit into the classifications of "county". I'm not sure why this is the case, but a quick Google search showed that these populations are very low, so I'll assume it's because the number was too low to track (or some government inefficiency lost the postal information somewhere).
The second table shows where there isn't a continuum code. Regions that are not counties (IE states and the US) do not have a continuum code. The remaining counties without a continuum code all are mostly in Alaska or are really small, which explains why they might not have a continuum code.
I plan to keep data with missing information, even though I will use a full join to connect all the datasets together.
Next, let's look at county and state level data and summaries.
### County Population
```{r}
# Population summary statistics on the county level
pop %>%
filter(region_type == "county") %>% # Select only
summarise(min_pop = min(pop_2020, na.rm=TRUE),
q1_pop = quantile(pop_2020, 0.25, 1, na.rm=TRUE),
mean_pop = mean(pop_2020, na.rm=TRUE),
med_pop = median(pop_2020, na.rm=TRUE),
q3_pop = quantile(pop_2020, 0.75, 1, na.rm=TRUE),
max_pop = max(pop_2020, na.rm=TRUE))
# Graph data at the state
pop %>%
filter(region_type == "county" & pop_2020 < 250000) %>%
ggplot(aes(pop_2020)) +
geom_histogram(fill="blue", alpha=0.5) +
theme_minimal() +
labs(title="Graph TODO: Distribution of US County Populations",
x="Population of a County",
y="Number of Counties",
subtitle="For county populations under 250,000")
```
At the county level, there's huge variance in terms of the population of a county, from a meager 64 people to a whopping 10 million people. A majority of counties have populations between 10K and 68K, and the number of counties with populations of 75K drops significantly.
### State Population
```{r}
# Population summary statistics on the state level
pop %>%
filter(region_type == "state") %>% # Select only
summarise(min_pop = min(pop_2020, na.rm=TRUE),
q1_pop = quantile(pop_2020, 0.25, 1, na.rm=TRUE),
mean_pop = mean(pop_2020, na.rm=TRUE),
med_pop = median(pop_2020, na.rm=TRUE),
q3_pop = quantile(pop_2020, 0.75, 1, na.rm=TRUE),
max_pop = max(pop_2020, na.rm=TRUE))
# Graph population at the state level
pop %>%
filter(region_type == "state") %>%
ggplot(aes(pop_2020)) +
geom_histogram(fill="blue", alpha=0.5) +
theme_minimal() +
labs(title="Graph TODO: Distribution of US State Populations",
x="Population of a State",
y="Number of States")
```
Like the county populations, state populations show high variance, with populations between 577K and 39.5 million residents. Most state populations are between 1.8 million and 7.5 million people, with the number of states with populations above 8 million people dropping significantly, as seen in the histogram above.
### Continuum Codes
Lastly for this section, it might be helpful to see the distribution of the continuum codes for counties, as this is a great way to compare all the consolidated data.
```{r}
pop %>%
group_by(Cont_Code) %>%
filter(!is.na(Cont_Code)) %>%
summarise(count = n()) %>%
mutate(freq = round(count / sum(count), 3)) %>%
arrange(Cont_Code)
```
The table above shows the number and proportion of counties who fall within each continuum code. Surprisingly, the data isn't incredibly skewed towards any particular code. The counts for metropolitan counties (codes 1-3) are pretty close to one another. For non-metropolitan counties, it's surprising to see that codes 6 and 7 (counties with populations between 2.5K and 20K people in counties adjacent or nonadjacent to a metro area, respectively) have a greater number of counties that fall into that grouping. Code 5 (having an urban population of 20K or greater, not adjacent to a metro area) is by far the smallest group, and this makes sense, as counties with a large population not next to a metropolitan area isn't very common in the US, and this supports that belief.
For future analyses, we could group counties into metropolitan (codes 1-3) vs. non-metropolitan codes (4-9), compare codes within those two groups against one another, or compare non-metropolitan counties adjacent to (codes 4, 6, and 8) and non-adjacent to metro areas (codes 5, 7, and 9).
## Education
The education dataset is very similar to that of the population dataset, in which each observation is a zip code representing a county, state or the country. There is data from 1970 to 2021, with the 2017-2021 data all consolidated into one value. Different levels of education are tracked (didn't receive their high school diploma, received only their high school diploma, some bachelors or associates degree level education, or received a higher degree or above) as well as both the number and percentage of residents from a particular region that falls into each category.
```{r}
# Pull in data
edu <- read_excel(here("posts", "_data", "Linus_US_data", "Linus_US_Education.xlsx"), skip=3) %>%
select(matches("21|FIPS|State|name")) %>%
rename(FIPS=1, State=2, Area_Name=3, no_HS=4, HS_only=5, some_college=6, bach_plus=7, perc_no_HS=8, perc_HS_only=9, perc_some_college=10, perc_bach_plus=11) %>%
filter(FIPS < 72000 & FIPS != 11000) %>%
mutate(region_type = case_when(FIPS == "00000" ~ "national",
str_detect(FIPS, "0{3}$") ~ "state",
TRUE ~ "county"))
# dfSummary
print(summarytools::dfSummary(edu,
varnumbers=FALSE,
plain.ascii=FALSE,
style="grid",
graph.magnif = 0.70,
valid.col=FALSE),
method="render",
table.classes="table-condensed")
# View structure of dataset
edu
```
The tables above provide a quick glance into how the data is formatted. Some formatting will need to be done to make the data tidy. There are 11 instances where there is missing data - let's look into this now.
```{r}
edu %>%
filter(is.na(no_HS))
```
We see that for 11 regions, no education data has been recorded. Many of these are Alaskan counties, which, as mentioned before, have very small populations and may be home to native Alaskans. Bedford city and Clifton Forge city are again listed with missing data, supporting the hypothesis that both cities didn't report some crucial information, or because these cities are connected to a county, that information has already been shared.
The following code is run to make our data "tidy".
```{r}
# Make tidy
# Made tidy
edu_totals <- edu %>%
select(c("FIPS", contains("perc"))) %>%
pivot_longer(cols=contains("perc"),
names_to="edu_level",
values_to="edu_perc") %>%
mutate(edu_level=str_remove(edu_level, "perc_"))
edu_percs <- edu %>%
select(c("FIPS", !contains("perc"))) %>%
pivot_longer(cols=no_HS:bach_plus,
names_to="edu_level",
values_to="edu_total")
edu_tidy = edu_percs %>% inner_join(edu_totals,
by=c("FIPS", "edu_level")) %>%
mutate(edu_level=factor(edu_level, levels=c("no_HS", "HS_only", "some_college", "bach_plus")))
# View tidy data
edu_tidy
# Keep env clean
rm(edu_percs)
rm(edu_totals)
```
### Education by County
```{r}
# Get summary statistics for each education category on county totals
print("County level education summary statistics for residents, by county totals")
edu_tidy %>%
filter(region_type == "county") %>% # Remove states
group_by(edu_level) %>%
summarise(min_tot = min(edu_total, na.rm=TRUE),
q1_tot = quantile(edu_total, 0.25, 1, na.rm=TRUE),
mean_tot = mean(edu_total, na.rm=TRUE),
med_tot = median(edu_total, na.rm=TRUE),
q3_tot = quantile(edu_total, 0.75, 1, na.rm=TRUE),
max_tot = max(edu_total, na.rm=TRUE))
# Get summary statistics for each education category on county percentages
print("County level education summary statistics for residents, by county percentages")
edu_tidy %>%
filter(region_type == "county") %>% # Select only
group_by(edu_level) %>%
summarise(min_perc = min(edu_perc, na.rm=TRUE),
q1_perc = quantile(edu_perc, 0.25, 1, na.rm=TRUE),
mean_perc = mean(edu_perc, na.rm=TRUE),
med_perc = median(edu_perc, na.rm=TRUE),
q3_perc = quantile(edu_perc, 0.75, 1, na.rm=TRUE),
max_perc = max(edu_perc, na.rm=TRUE))
# Graph proportions at county level
edu %>%
select(matches("FIPS|perc|region")) %>%
filter(region_type == "county") %>% # Remove states
pivot_longer(cols=contains("perc"), names_to="perc_type", values_to="percent") %>%
mutate(perc_type = factor(perc_type, levels=c("perc_no_HS", "perc_HS_only", "perc_some_college", "perc_bach_plus"))) %>%
ggplot(aes(x=perc_type, y=percent/100, color=perc_type, alpha = 0.8)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.2) +
theme_minimal() +
theme(legend.position="none",
axis.text.x = element_text(angle=0)) +
labs(title="Graph TODO: Distribution of Education Received for US Counties",
x="Education Level",
y="Percentage of People within a County", fill="Education Level") +
scale_x_discrete(labels=c("No HS diploma", "HS Diploma", "Some higher edu", "Received higher degree")) +
scale_y_continuous(labels=scales::percent)
```
Graph TODO above shows the distributions of people who fall into each education category based on the county they live in. Because the populations for each county vary so drastically (as seen in the previous dataset), the percentages for people who fall into each category for each county. Education differs greatly across counties, as each category has quite distinct peaks and and variability. People who have only received a high school diploma or some higher education make up the majority of the all counties' population. Getting a college degree or higher is still somewhat rare, as we see that the peak is around 20% of a county's population. Lastly, there's clearly some room for improvement in terms of improving education for all, as ~10% of people in counties don't have their high school diploma (based on the median), with about 25% of all counties having more than 15% of their population not even getting their high school diploma.
The tables above graph TODO add more insight with regards to educational levels. There are instances where some counties don't have any residents who reported getting any form of higher education. The fact that some counties had 50% or more their residents not even receiving a high school diploma is startling. Although the group of people who have received only a high school diploma
### Summary Statistics for State Level Education Data
```{r}
# Get summary statistics for each education category on state totals
print("State level education summary statistics for residents, by state totals")
edu_tidy %>%
filter(region_type == "state") %>% # Select states
group_by(edu_level) %>%
summarise(min_tot = min(edu_total, na.rm=TRUE),
q1_tot = quantile(edu_total, 0.25, 1, na.rm=TRUE),
mean_tot = mean(edu_total, na.rm=TRUE),
med_tot = median(edu_total, na.rm=TRUE),
q3_tot = quantile(edu_total, 0.75, 1, na.rm=TRUE),
max_tot = max(edu_total, na.rm=TRUE))
# Get summary statistics for each education category on state percentages
print("State level education summary statistics for residents, by percentages")
edu_tidy %>%
filter(region_type == "state") %>% # Select states
group_by(edu_level) %>%
summarise(min_perc = min(edu_perc, na.rm=TRUE),
q1_perc = quantile(edu_perc, 0.25, 1, na.rm=TRUE),
mean_perc = mean(edu_perc, na.rm=TRUE),
med_perc = median(edu_perc, na.rm=TRUE),
q3_perc = quantile(edu_perc, 0.75, 1, na.rm=TRUE),
max_perc = max(edu_perc, na.rm=TRUE))
# Graph proportions at state level
edu %>%
select(matches("FIPS|perc|region")) %>%
filter(region_type == "state") %>% # Remove states
pivot_longer(cols=contains("perc"), names_to="perc_type", values_to="percent") %>%
mutate(perc_type = factor(perc_type, levels=c("perc_no_HS", "perc_HS_only", "perc_some_college", "perc_bach_plus"))) %>%
ggplot(aes(x=perc_type, y=percent/100, color=perc_type, alpha = 0.8)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.2) +
theme_minimal() +
theme(legend.position="none",
axis.text.x = element_text(angle=0)) +
labs(title="Graph TODO: Distribution of Education Received for US States",
x="Education Level",
y="Percentage of People within a State", fill="Education Level") +
scale_x_discrete(labels=c("No HS diploma", "HS Diploma", "Some higher edu", "Received higher degree")) +
scale_y_continuous(labels=scales::percent)
```
The state level data paints a slightly different picture. People tend to be pretty educated, as those who have at least some college education (if not more) have a higher summary statistics across the board, compared to those with only a high school diploma or lower. This is also reflected in the table showing the summary statistics for state percentages, as the categories `some_college` and `bach_plus` are greater than the two categories showing only some high school education. This suggests that at the county level, those with some higher education tend to live in very specific counties, rather than equally dispersing across the state. This makes sense, as most college educated folks tend to move to urban, metropolitan areas, while rural counties make up the majority of the US. Similarly, because there are many counties with a small number of people (who may be less educated), these counties would inflate the proportion of people with only a high school diploma or less.
## Poverty
This poverty dataset contains poverty-related statistics for 2020. Poverty, as explained on the [census.gov](https://www.census.gov/topics/income-poverty/poverty/guidance/poverty-measures.html) website, is determined by first summing up a family's total income over a given year, and then comparing that with a "poverty threshold" that is calculated by the government. This poverty threshold incorporates the size of a family and age of each member and Consumer Price Index for All Urban Consumers (CPI-U), but does not vary by geographical location.
Although this dataset contains a lot of variables (such as 90% confidence intervals for different poverty groups), we will only be looking at the total number and proportion of people living in poverty for a specific region (county, state, or nation), as well as the number and proportion of people within a certain age group (aged 0-17, 0-4, and 5-17) who are estimated to be living in poverty. This data is especially valuable and investigating poverty for minors, as theoretically investing more in minors could help reduce poverty later on.
```{r}
pov <- read_excel(here("posts", "_data", "Linus_US_data", "Linus_US_PovertyEstimates.xlsx"), sheet="Poverty Data 2020", skip=4) %>%
select(c(matches("FIPS|POV|MED"), "Stabr", "Area_name")) %>%
rename(FIPS=1, pov_all=2, perc_pov_all=3, pov_0_17=4, perc_pov_0_17=5, pov_5_17=6, perc_pov_5_17=7, med_inc=8, pov_0_4=9, perc_pov_0_4=10, state_acr=11, area_name=12) %>%
filter(FIPS < 72000 & FIPS != 11000) %>%
mutate(region_type = case_when(FIPS == "00000" ~ "national",
str_detect(FIPS, "0{3}$") ~ "state",
TRUE ~ "county"))
# dfSummary
print(summarytools::dfSummary(pov,
varnumbers=FALSE,
plain.ascii=FALSE,
style="grid",
graph.magnif = 0.70,
valid.col=FALSE),
method="render",
table.classes="table-condensed")
# Remove `pov_0_4` and `perc_pov_0_4`
pov = pov %>%
select(-c(pov_0_4, perc_pov_0_4))
# Show data
pov
```
The data above is shown at the zip code level, and includes the number and percent of people living in poverty for all residents, or those of a specific age bracket. One concerning factor is that there is a significant number of missing data for the number and proportion of people aged 0-4 living in poverty. This is probably due to the fact that this age group is the hardest to keep track of (as they aren't of age to attend school yet). We will remove these two columns due to the high levels of missingness.
Let's check the missing data.
```{r}
pov %>% filter(is.na(med_inc))
```
We see for Kalawao County in Hawaii, surprisingly, all values pertaining to poverty are missing. Because this is only one county amongst about 3,850 counties, this isn't a big deal. I'll leave it in so that during the join, we know it was missing information.
Lastly, our data needs to be made "tidy", as percents and numbers should each be their own column.
```{r}
# Tidy data
pov_percs <- pov %>%
select(!contains("perc")) %>%
pivot_longer(cols=pov_all:pov_5_17,
names_to="age_group",
values_to="pov_total")
pov_totals <- pov %>%
select(c("FIPS", contains("perc"))) %>%
pivot_longer(cols=perc_pov_all:perc_pov_5_17,
names_to="age_group",
values_to="pov_perc") %>%
mutate(age_group=str_remove(age_group, "perc_"))
pov_tidy <- pov_percs %>% inner_join(pov_totals, by=c("FIPS", "age_group")) %>%
mutate(age_group=factor(age_group, levels=c("pov_all", "pov_0_17", "pov_5_17", "pov_0_4", "med_")))
# Show tidy data
pov_tidy
# Clean env
rm(pov_percs)
rm(pov_totals)
```
### Poverty at the County Level
```{r}
# County totals
print("County level summary statistics for residents living in poverty, total number")
pov_tidy %>%
filter(region_type == "county") %>% # Remove states
group_by(age_group) %>%
summarise(min_tot = min(pov_total, na.rm=TRUE),
q1_tot = quantile(pov_total, 0.25, 1, na.rm=TRUE),
mean_tot = mean(pov_total, na.rm=TRUE),
med_tot = median(pov_total, na.rm=TRUE),
q3_tot = quantile(pov_total, 0.75, 1, na.rm=TRUE),
max_tot = max(pov_total, na.rm=TRUE))
print("County level summary statistics for residents living in poverty by percent")
pov_tidy %>%
filter(region_type == "county") %>% # Remove states
group_by(age_group) %>%
summarise(min_tot = min(pov_perc, na.rm=TRUE),
q1_tot = quantile(pov_perc, 0.25, 1, na.rm=TRUE),
mean_tot = mean(pov_perc, na.rm=TRUE),
med_tot = median(pov_perc, na.rm=TRUE),
q3_tot = quantile(pov_perc, 0.75, 1, na.rm=TRUE),
max_tot = max(pov_perc, na.rm=TRUE))
# Graph data
pov %>%
select(matches("FIPS|perc|region")) %>%
filter(region_type=="county") %>% # Remove states
pivot_longer(cols=contains("perc"), names_to="perc_type", values_to="percent") %>%
mutate(perc_type = factor(perc_type, levels=c("perc_pov_all", "perc_pov_0_17", "perc_pov_5_17", "perc_pov_0_4"))) %>%
ggplot(aes(x=perc_type, y=percent/100, color=perc_type, alpha=0.8)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.2) +
theme_minimal() +
theme(legend.position="none") +
labs(title="Graph TODO: Poverty Breakdown Across Age Groups for US Counties",
subtitle="Based on proportion of people who fall within each category per county",
x="Proportion of People Living in Poverty in a County (%)",
y="Number of Counties") +
scale_x_discrete(labels=c("All residents", "Residents aged 0-17", "Residents aged 5-17", "Residents aged 0-4")) +
scale_y_continuous(labels=scales::percent)
```
From graph TODO and tables above above, the poverty rate for all residents across counties is about 13% - much higher than I was expecting. However, it's shocking to see that there are some counties were over 10% of the county's population is living in poverty. Equally shocking is how the proportion of minors (aged 0-17 or 5-17) living in poverty is actually higher than that of all residents. It could be because those living in poverty tend to have large families (as discussed on [WorldVision](https://www.worldvision.ca/stories/why-do-the-poor-have-large-families)), or that the count of adults living in poverty is underrepresented, as adults living in poverty are more likely to be houseless and thus not responsible to fill out the census survey, while children living in poverty (hopefully) are living with parents and are more likely to be included in census data. In general though, it's scary and quite eye-opening that about 18% of all minors are living in poverty.
### Poverty at the State Level
```{r}
# State totals
print("State level summary statistics for residents living in poverty, total number")
pov_tidy %>%
filter(region_type == "state") %>% # Remove states
group_by(age_group) %>%
summarise(min_tot = min(pov_total, na.rm=TRUE),
q1_tot = quantile(pov_total, 0.25, 1, na.rm=TRUE),
mean_tot = mean(pov_total, na.rm=TRUE),
med_tot = median(pov_total, na.rm=TRUE),
q3_tot = quantile(pov_total, 0.75, 1, na.rm=TRUE),
max_tot = max(pov_total, na.rm=TRUE))
print("State level summary statistics for residents living in poverty by percent")
pov_tidy %>%
filter(region_type == "state") %>% # Remove states
group_by(age_group) %>%
summarise(min_tot = min(pov_perc, na.rm=TRUE),
q1_tot = quantile(pov_perc, 0.25, 1, na.rm=TRUE),
mean_tot = mean(pov_perc, na.rm=TRUE),
med_tot = median(pov_perc, na.rm=TRUE),
q3_tot = quantile(pov_perc, 0.75, 1, na.rm=TRUE),
max_tot = max(pov_perc, na.rm=TRUE))
# Graph data
pov %>%
select(matches("FIPS|perc|region")) %>%
filter(region_type=="state") %>% # Remove states
pivot_longer(cols=contains("perc"), names_to="perc_type", values_to="percent") %>%
mutate(perc_type = factor(perc_type, levels=c("perc_pov_all", "perc_pov_0_17", "perc_pov_5_17", "perc_pov_0_4"))) %>%
ggplot(aes(x=perc_type, y=percent/100, color=perc_type, alpha=0.8)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.2) +
theme_minimal() +
theme(legend.position="none") +
labs(title="Graph TODO: Poverty Breakdown Across Age Groups for US States",
subtitle="Based on proportion of people who fall within each category per state",
x="Proportion of People Living in Poverty in a County (%)",
y="Number of Counties") +
scale_x_discrete(labels=c("All residents", "Residents aged 0-17", "Residents aged 5-17", "Residents aged 0-4")) +
scale_y_continuous(labels=scales::percent)
```
At the state level, the percentage of people living in poverty has dropped somewhat, most likely again not because of the change in numbers, but because of the reduction of counties and inflated numbers. However, we again see that the proportion of all people living in poverty is slightly below that of minors (11.6% vs. ~14%), and that there are some states where over 20% of minors (aged either 0-17 or 5-17) are living at or below poverty.
# Research Questions
With data in hand, let us look into how population size, poverty rates, and education levels all relate to one another at both the county level. (For the final project, I will also look into relationships at the state level). Our research questions are as follows:
1. How does the population size of a county relate to educational attainment for residents?
2. How does the population size of a county connect with the number and proportion of people living in poverty?
3. How does level of education for residents correlate with poverty in these regions?
* In addition to the questions above, faceting and aggregating by the `Continuum Codes` will provide deeper insight as to how these features relate to how urban / rural that county might be, incorporating other features such as population size and city-like features into this analysis.
# Analysis
## RQ1: How does population size of counties relate to the highest level of education attained for residents?
First, we'll look at the totals for each county. Because we're comparing two different populations, I expect to see a positive trend between population and number of people in each education group.
```{r}
# Because each dataset uses different types of "tidy" data, will manually join tables for each RQ
pop_edu_tidy = pop %>%
inner_join(edu_tidy %>%
select(c(FIPS, edu_level, edu_total, edu_perc)), by = "FIPS")
# Let's look into edu_total vs. pop_2020, faceted by edu_level
pop_edu_tidy %>%
filter(region_type=="county") %>%
ggplot(aes(x = pop_2020, y = edu_total, color = edu_level)) +
geom_point(alpha = 0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_wrap(~edu_level, labeller =
labeller(edu_level = c("no_HS" = "No HS Diploma",
"HS_only" = "Received only a HS Diploma",
"some_college" = "Taken Some College or Associates Classes",
"bach_plus" = "Received Bachelors Degree or More")
)) +
labs(title="Graph TODO: Comparing Population with Level of Education of US County Residents",
caption="From 2020 US census data",
x = "Number of Residents in a County",
y = "Number of Residents who Fall into that Education Group") +
theme(legend.position = "none")
```
As expected, when plotting population vs. education level, we see that the number of people who fall into each education category increases as population increases. It's interesting to note that the rate of increase is mostly linear for all groups, though the rate of increase varies, most evident when looking at the number of people who have received a bachelors degree or more compared to all the other groups. Another interesting point to note is that the errors / noise differs across all groups. Because this graph doesn't tell us much about how the percentage of people who fall into each group changes, we'll next look into the percentage of people who fall into each category.
```{r}
# Let's look into edu_total vs. pop_2020, faceted by edu_level
pop_edu_tidy %>%
filter(region_type=="county" & pop_2020 <= 2500000) %>%
ggplot(aes(x = pop_2020, y = edu_perc / 100, color = edu_level)) +
geom_point(alpha = 0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_wrap(~edu_level, labeller =
labeller(edu_level = c("no_HS" = "No HS Diploma",
"HS_only" = "Received only a HS Diploma",
"some_college" = "Taken Some College or Associates Classes",
"bach_plus" = "Received Bachelors Degree or More")
)) +
labs(title="Graph TODO: Comparing Population with Level of Education of US County Residents",
subtitle="Excluding populations > 2.5 million",
caption="From 2020 US census data",
x = "Number of Residents in a County",
y = "Percentage of Residents that Fall into that Education Group") +
theme(legend.position = "none") +
scale_y_continuous(labels=scales::percent)
```
Eerily enough, for the most part, when looking at changes in the percentage of people who fall into each category, the results change dramatically. for counties with a low population, there tends to be very large fluctuations in how educated its residents are, as represented by more dispersed points at lower populations. However, as the population rises, educational levels slightly increase, as evident by the increase in the percentage of people who get a bachelors degree or more and the slight decline in number of people who only receive a high school diploma. However, we do see that overall, as the population increases, the percentage of people who not received their high school diploma does increase ever so slightly.
Lastly, let's investigate how metropolitan vs. non-metropolitan counties compare. We'll only look at the percentage breakdowns for education, as the raw number will most likely be positive (as seen in graph TODO)
```{r}
pop_edu_tidy %>%
filter(region_type == "county" & pop_2020 <= 2500000) %>%
mutate(metro = ifelse(Cont_Code %in% c(1, 2, 3), TRUE, FALSE)) %>%
ggplot(aes(x=pop_2020, y=edu_perc/100, color=edu_level)) +
geom_point(alpha=0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_grid(vars(edu_level), vars(metro), scales = "free_x",
labeller = labeller(edu_level = c("no_HS" = "No HS Diploma",
"HS_only" = "Only HS Diploma",
"some_college" = "Some College",
"bach_plus" = "College Diploma +"),
metro = c("FALSE" = "Non-Metropolitan Counties", "TRUE" = "Metropolitan Counties"))) +
labs(title="Graph TODO: Comparing Population with Level of Education of US County Residents",
subtitle="For county populations under 2.5 million, splitting by if a county is a metropolitan",
caption="From 2020 US census data",
x = "Number of Residents in a County",
y = "Percentage of Residents that Fall into that Education Group") +
theme(legend.position = "none") +
scale_y_continuous(labels=scales::percent)
```
When breaking down education further by whether the county is a metropolitan or not, new trends emerge. Focusing on the metropolitan counties, for all education groups except for having a college diploma, the trend lines are slightly U-shaped (concave up), meaning that for low populations, the percentage of people who are not college educated is slightly higher, and for larger populations, the percentage actually increases. This is surprising to me because I would've thought that larger cities are filled with more college-educated students, but this isn't always the case, as shown here. For non-metropolitan counties, the trend lines are mostly flat, meaning that regardless of population size for counties, on average, the percentage of people who fall into each category are about the same. The percentage of college-educated residents does seem increase ever so slightly as the population for a county increases in non-metropolitan counties.
## RQ2: How does the population size of a county connect with the number and proportion of people living in poverty?
We'll first look into the totals for both variables, and then look into the percentages of people living in poverty. As mentioned previously, we expect to see a positive relationship between the county population and number of people living in poverty.
```{r}
# First, join data together
pop_pov_tidy = pop %>%
inner_join(pov_tidy %>%
select(c(FIPS, age_group, pov_total, pov_perc)),
by = "FIPS")
# Next, graph pov_total vs. pop_2020, faceted by age_group
pop_pov_tidy %>%
filter(region_type=="county" & pop_2020 <= 2500000) %>%
ggplot(aes(x=pop_2020, y=pov_total, color=age_group)) +
geom_point(alpha=0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_wrap(~age_group,
labeller = labeller(age_group = c("pov_all" = "All Ages",
"pov_0_17" = "Minors aged 0-17",
"pov_5_17" = "Minors aged 5-17"))) +
labs(title="Graph TODO: Comparing Population with Poverty of US County Residents",
subtitle = "For counties with less than 2.5 million residents, split by age group",
caption="From 2020 US census data",
x = "Number of Residents in a County",
y = "Number of Residents who Fall into that Education Group") +
theme(legend.position = "none",
axis.text.x = element_text(angle=45, hjust=1))
```
Unsurprisingly, we see that as the population rises in a county, the number of people living in poverty for all age groups increases. Let's look at the percentage of people who fall into each category instead.
```{r}
# Graph pov_perc vs. pop_2020, faceted by age_group
pop_pov_tidy %>%
filter(region_type=="county" & pop_2020 <= 2500000) %>%
ggplot(aes(x=pop_2020, y=pov_perc/100, color=age_group)) +
geom_point(alpha=0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_wrap(~age_group,
labeller = labeller(age_group = c("pov_all" = "All Ages",
"pov_0_17" = "Minors aged 0-17",
"pov_5_17" = "Minors aged 5-17"))) +
labs(title="Graph TODO: Comparing Population with Poverty of US County Residents",
subtitle = "For counties with less than 2.5 million residents, split by Age Group",
caption="From 2020 US census data",
x = "Number of Residents in a County",
y = "Percentage of Residents who Fall into that Education Group") +
theme(legend.position = "none",
axis.text.x = element_text(angle=45, hjust=1)) +
scale_y_continuous(labels=scales::percent)
```
Again, we see that trends tend to follow a general U-shape, where for counties with very low populations, the percentage of people who are living in poverty is slightly higher, while counties with middling populations (about 250K to 1 million residents tend to see a decline in poverty rates. After the 1 million mark, however, the percentage of people living in poverty tends to increase. It's interesting that for larger counties, as the population increases, poverty also increases, showing that even big cities with lots of people and wealth can still have a significant proportion of people struggling to make ends meet.
Lastly, let's see how metropolitan vs. non-metropolitan counties vary in terms of proportion of people living in poverty.
```{r}
pop_pov_tidy %>%
filter(region_type=="county" & pop_2020 <= 2500000) %>%
mutate(metro = ifelse(Cont_Code %in% c(1, 2, 3), TRUE, FALSE)) %>%
ggplot(aes(x=pop_2020, y=pov_perc/100, color=age_group)) +
geom_point(alpha=0.75) +
geom_line(alpha=0.5, color="black", linewidth=1.25, method="loess", stat="smooth") +
facet_grid(vars(age_group), vars(metro), scale="free_x",
labeller = labeller(age_group = c("pov_all" = "All Ages",
"pov_0_17" = "Minors aged 0-17",
"pov_5_17" = "Minors aged 5-17"),
metro = c("FALSE" = "Non-Metropolitan Counties",
"TRUE" = "Metropolitan Counties"))) +
labs(title="Graph TODO: Comparing Population with Poverty of US County Residents",
subtitle = "Populations <2.5 million residents, split by age group and metropolitan status",
caption="From 2020 US census data",
x = "Number of Residents in a County",
y = "Percentage of Residents who Fall into that Education Group") +
theme(legend.position = "none",
axis.text.x = element_text(angle=45, hjust=1)) +
scale_y_continuous(labels=scales::percent)
# Check numbers
pop_pov_tidy %>%
filter(region_type == "county") %>% # Remove states
mutate(metro = ifelse(Cont_Code %in% c(1, 2, 3), TRUE, FALSE)) %>%
group_by(metro, age_group) %>%
summarise(min_tot = min(pov_perc, na.rm=TRUE),
q1_tot = quantile(pov_perc, 0.25, 1, na.rm=TRUE),
mean_tot = mean(pov_perc, na.rm=TRUE),
med_tot = median(pov_perc, na.rm=TRUE),
q3_tot = quantile(pov_perc, 0.75, 1, na.rm=TRUE),
max_tot = max(pov_perc, na.rm=TRUE))
```
By splitting by whether a county is a metropolitan or not, we now see that slightly opposite trends. Counties that are metros show similar trends as our previous findings, where poverty rates tend to be higher for counties with very low populations, drop slightly for counties with middling populations, and increase again slightly for counties with large populations above ~1 million people. However, metropolitan counties tend to see an increase in poverty rates for counties with very low populations. Over time (after about a population size of 20K), these trend lines tend to decline. Additionally, counties that aren't metropolitans also seem to have a greater proportion of people living in poverty, as from the table above, for all age groups, the proportion of people living in poverty is higher by about 2-4%.
## RQ3: How does level of education for residents correlate with poverty in these regions?
Lastly, we'll look into how level of education correlates to poverty counties. For visualization purposes,
```{r}
# Join tables together
df_full = pop_edu_tidy %>%
inner_join(pov_tidy %>%
select(c(FIPS, age_group, pov_total, pov_perc)),
by = "FIPS")
# First, let's see how the totals relate to one another
df_full %>%
filter(region_type == "county" &
pop_2020 <= 2500000 &
age_group == "pov_all") %>%
ggplot(aes(x=pov_total, y=edu_total, color=edu_level)) +
geom_point(alpha = 0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_wrap(edu_level ~ .,
scale="free_x",
labeller = labeller(edu_level = c("no_HS" = "No HS Diploma",
"HS_only" = "Only HS Diploma",
"some_college" = "Some College",
"bach_plus" = "College Diploma +"))) +
labs(title = "Graph TODO: Comparing Poverty with Highest Education Achieved",
subtitle="For county populations under 2.5 million, splitting by highest education achieved",
caption="From 2020 US census data",
x="Number of People Living in Poverty",
y="Number of People a Specific Education Level") +
theme(legend.position="none")
```
As expected, as the number of people living in poverty increases, the number of residents who fall into each category for education also increases. This follows the logic of the previous sections, as the greater the population the greater the number of people living in poverty, as well as the greater the number of people who fit into each category. However, some surprising relationships are still present. Firstly, we see that the relationship between number of people living in poverty and those who have achieved a high school diploma or less is positive and roughly linear. However, the relationship between the number of people living in poverty and the the number of people who have gotten some higher education isn't linear, but rather shows diminishing changes. One observation could be that people living in poverty tend to not have college educations, so counties with high levels of poverty probably also aren't "well-educated" or living with these folks.
Next, let's compare percentages for these two groups to get a better sense of how they might relate to one another.
```{r}
# Next, compare percentages
df_full %>%
filter(region_type == "county" &
pop_2020 <= 2500000 &
age_group == "pov_all") %>%
ggplot(aes(x=pov_perc/100, y=edu_perc/100, color=edu_level)) +
geom_point(alpha = 0.75) +
geom_line(alpha = 0.5, stat="smooth", linewidth=1.25, method="loess", color="black") +
facet_wrap(edu_level ~ .,
scale="free_x",
labeller = labeller(edu_level = c("no_HS" = "No HS Diploma",
"HS_only" = "Only HS Diploma",
"some_college" = "Some College",
"bach_plus" = "College Diploma +"))) +
labs(title = "Graph TODO: Comparing Poverty with Highest Education Achieved",
subtitle="For county populations under 2.5 million, splitting by highest education achieved",
caption="From 2020 US census data",
x="Percentage of People Living in Poverty",
y="Percentage of People a Specific Education Level") +
theme(legend.position="none") +
scale_x_continuous(labels=scales::percent) +
scale_y_continuous(labels=scales::percent)
```
By comparing the poverty rates with the educational achievements of residents in US counties, it's pretty clear that there's a straightforward trend here. In general, as the percentage of people living in poverty increases, the percentage of people who have gotten only a high school diploma or less education increases. On the other hand, greater poverty rates correlate to a lower proportion of people with a college diploma or more. Intuitively, this makes sense, as people with college degrees tend to earn more money than those who have only a high school diploma. Because poverty is based on one's income, it logically makes sense that communities with high poverty tend to have more people who aren't college educated. It is interesting that the trend line for the proportion of people who have some college experience is flat, regardless of the poverty rate.
# Limitations
This report only shows a snapshot of the information provided. In future iterations, it would be interesting to see how demographics have shifted in the United States. Specifically, visualizing some time series graphs and visualizing changes in relationships between populous counties and shifts in education or poverty would be helpful.
Similarly, showcasing this data on a map could be beneficial, as plotting major cities to connect county and state data could prove beneficial.
For this homework submission, I only focused on county-level comparisons. For the final project, I would also be interested in seeing how these variables correlate with one another at the state level, as the EDA sections comparing county level data with state level data showed significant differences.
Like many other datasets having to deal with census information, we assume that the data provided is correct. Because census information is filled out and submitted by households, it's very likely that some at-risk groups (such as houseless people, those moving between places, or people working many hours and unable to respond to their mail) are not included in this data. As a result, metrics such as education levels or poverty might be better than reality.
Lastly, it's important to note the distinction between correlation and causation. Although our data might show interesting links between variables, because each county and state have their own legislation, programs, systems, etc., there are many confounding variables not included in this analysis that might explain why we see specific trends.
Some things I can work on for my graphs are:
1. Improving the readability for the x-axes
2. Name graphs for easy reference
3. Not related to graphing, but ensure that all definitions for key terms are included in the writeup (such as poverty, Continuum Codes, etc.).
# References
Grolemund, G., & Wickham, H. (2017). R for Data Science. O’Reilly Media.
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Rupnarain, K. (2020a, July 13). Why do the poor have large families?. World Vision Canada. https://www.worldvision.ca/stories/why-do-the-poor-have-large-families
Rural-urban continuum codes. USDA ERS - Rural-Urban Continuum Codes. (n.d.). https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/
U.S. and World Population Clock. United States Census Bureau. (n.d.). https://www.census.gov/popclock/
US Census Bureau. (2021, December 16). State area measurements and internal point coordinates. Census.gov. https://www.census.gov/geographies/reference-files/2010/geo/state-area.html
US Census Bureau. (2023, January 30). How the Census Bureau Measures Poverty. Census.gov. https://www.census.gov/topics/income-poverty/poverty/guidance/poverty-measures.html