The United States has over 330 million people living on about 3.8 million squared miles of land, but the distribution of people across this vast country is not equal. As a result, each region has different demographics than other states, leading to different cultures, economic and political beliefs, ideologies, and societal components.

For this homework assignment and final project, I want to dive deeper into US state and county level data to explore how distributions in population, education level, and poverty all relate to one another. All data was pulled from the US Department of Agriculture via the US census. Although data does exist for 2021, not all datasets have this set of information, and thus we will be exploring the collected results from 2020. Three datasets have been added to the GitHub repo for the class, and the three files are Linus_US_Education.xlsx, Linus_US_PopulationEstimates.xlsx, and Linus_US_PovertyEstimates.xlsx.

Data Ingestion and Exploration

As mentioned previously, there are 3 files of interest that explore population, education, and poverty. To connect all the datasets together, the Federal Information Processing Standard (a five digit number representing a unique ID for states and counties), or FIPS for short, is used as a key. To remove state totals, simply exclude observations where the FIPS value ends in “000”, such as “01000”, “02000”, etc. We also are only interested in US counties, and remove all FIPS codes greater than or equal to 72000. Also, I will treat Washington D.C. as a county rather than a state.


This dataset contains population information for US territories, states, and counties. Although there’s data from the 1990s to 2021, as mentioned before, we will only look at 2020 census data so that it can be used with other datasets. Another interesting variable is the rural-urban continuum code that was given for each county in 2013. From the US Department of Agriculture, these continuum codes are split into metropolitan (1-3) and non-metropolitan counties (4-9), leading to 9 distinct groups. Codes 1-3 represent counties in metro areas of 1 million or more people, 250K - 1 million people, or less than 250K, respectively. Codes 4 and 5 represent counties with an urban population of 20K or more, adjacent or not adjacent to a metro area, respectively. Codes 6 and 7 represent counties with an urban population of 2.5K - 20K, adjacent or not adjacent to a metro area, respectively. Finally, codes 8 and 9 represent counties with a completely rural or less than 2.5K urban population, adjacent or not adjacent to a metro area. The continuum codes provide a straightforward and guided way to group counties together for future analysis.

The tables above provide a quick glimpse into the dataset. Fortunately, the data is already tidy, as each row represents data at a region level (national, state, or county/city, based on the zip code). It’s interesting to note that although the data is supposedly at the county level, counties have different names in different states. For example, Alaska has many regions that are called boroughs or areas, and Louisiana’s counties are called parishes! Strangely enough, Virginia also has a few zip codes dedicated to a specific city.

As for our data, thankfully it’s clean, but there is some missing data in the population (especially concerning given that the government “should” know where people live, for better or for worse). Let’s investigate this further.

At the county level, there’s huge variance in terms of the population of a county, from a meager 64 people to a whopping 10 million people. A majority of counties have populations between 10K and 68K, and the number of counties with populations of 75K drops significantly, as evident in Graph 1. However, we see that there are still a fair number of counties with populations above 100,000.

State Population

Like the county populations, state populations show high variance, with populations between 577K and 39.5 million residents. Most state populations are between 1.8 million and 7.5 million people, with the number of states with populations above 8 million people dropping significantly, as seen in Graph 2 above.

Continuum Codes

Lastly for this section, it might be helpful to see the distribution of the continuum codes for counties, as this is a great way to compare all the consolidated data.

The table above shows the number and proportion of counties who fall within each continuum code. Surprisingly, the data isn’t incredibly skewed towards any particular code. The counts for metropolitan counties (codes 1-3) are pretty close to one another. For non-metropolitan counties, it’s surprising to see that codes 6 and 7 (counties with populations between 2.5K and 20K people in counties adjacent or nonadjacent to a metro area, respectively) have a greater number of counties that fall into that grouping. Code 5 (having an urban population of 20K or greater, not adjacent to a metro area) is by far the smallest group. This makes sense, as counties with a large population not next to a metropolitan area isn’t very common in the US.

For future analyses, we could group counties into metropolitan (codes 1-3) vs. non-metropolitan codes (4-9), compare codes within those two groups against one another, or compare non-metropolitan counties adjacent to (codes 4, 6, and 8) and non-adjacent to metro areas (codes 5, 7, and 9).


The education dataset is very similar to that of the population dataset, in which each observation is a zip code representing a county, state or the country. There is data from 1970 to 2021, with the 2017-2021 data all consolidated into one value. Different levels of education are tracked (didn’t receive their high school diploma, received only their high school diploma, received some bachelors or associates degree level education, or received a higher degree or above) as well as both the number and percentage of residents from a particular region that falls into each category.

# View structure of dataset

The tables above provide a quick glance into how the data is formatted. Some formatting will need to be done to make the data tidy. There are 11 instances where there is missing data - let’s look into this now.

edu %>%

We see that for 11 regions, no education data has been recorded. Many of these are Alaskan counties, which, as mentioned before, have very small populations and may be home to native Alaskans. Bedford city and Clifton Forge city are again listed with missing data, supporting the hypothesis that both cities didn’t report some crucial information, or because these cities are connected to a county, that information has already been shared.

The following code is run to make our data “tidy”.

Education by County

Graph 3 above shows the distributions of people who fall into each education category based on the county they live in. Because the populations for each county vary so drastically (as seen in the previous dataset), we compare the percentages for people who fall into each category for each county rather than the total count. Education differs greatly across counties, as each category has quite distinct peaks and and variability. People who have only received a high school diploma or some higher education make up the majority of the all counties’ population. Getting a college degree or higher is still somewhat rare, as we see that the peak is around 20% of a county’s population. Lastly, there’s clearly some room for improvement in terms of improving education for all, as ~10% of people in counties don’t have their high school diploma (based on the median), with about 25% of all counties having more than 15% of their population not even getting their high school diploma.

The tables above graph 3 add more insight with regards to educational levels. There are instances where some counties don’t have any residents who reported getting any form of higher education. The fact that some counties had 50% or more their residents not even receiving a high school diploma is startling. Although the group of people who have received only a high school diploma.

Summary Statistics for State Level Education Data

The state level data paints a slightly different picture. Americans tend to be pretty educated, as those who have at least some college education (if not more) have a higher summary statistics across the board, compared to those with only a high school diploma or lower. This is also reflected in the table showing the summary statistics for state percentages, as the categories some_college and bach_plus are greater than the two categories showing only some high school education. This suggests that at the county level, those with some higher education tend to live in very specific counties, rather than equally dispersing across the state. Because the many counties with low populations and low education rates outweigh the few counties with high populations or high education rates, county level data might skew the percentages to make residents look less educated. However, because we aggregate at the state level, populations are all combined into one group (state), telling us that at least from this perspective, Americans are more educated than what was depicted earlier. These findings are also reflected in Graph 4 above, as the median percentage for people who have received some higher education or have received a higher degree is higher for both than the median percentage for the percentage of people who have a high school diploma or didn’t finish high school.


This poverty dataset contains poverty-related statistics for 2020. Poverty, as explained on the website, is determined by first summing up a family’s total income over a given year, and then comparing that with a “poverty threshold” that is calculated by the government. This poverty threshold incorporates the size of a family and age of each member and Consumer Price Index for All Urban Consumers (CPI-U), but does not vary by geographical location.

Although this dataset contains a lot of variables (such as 90% confidence intervals for different poverty groups), we will only be looking at the total number and proportion of people living in poverty for a specific region (county, state, or nation), as well as the number and proportion of people within a certain age group (aged 0-17, 0-4, and 5-17) who are estimated to be living in poverty. This data is especially valuable and investigating poverty for minors, as theoretically investing more in minors could help reduce poverty later on.

The data above is shown at the zip code level, and includes the number and percent of people living in poverty for all residents, or those of a specific age bracket. One concerning factor is that there is a significant number of missing data for the number and proportion of people aged 0-4 living in poverty. This is probably due to the fact that this age group is the hardest to keep track of (as they aren’t of age to attend school yet). We will remove these two columns due to the high levels of missingness.

Let’s check the missing data.

pov %>% filter(

We see for Kalawao County in Hawaii, surprisingly, all values pertaining to poverty are missing. Because this is only one county amongst about 3,850 counties, this isn’t a big deal. I’ll leave it in so that during the join, we know it was missing information.

Lastly, our data needs to be made “tidy”, as percents and numbers should each be their own column.

Poverty at the County Level

From graph 5 and tables above above, the poverty rate for all residents across counties is about 13% - much higher than I was expecting. However, it’s shocking to see that there are some counties were over 20% of the county’s population is living in poverty. Equally shocking is how the proportion of minors (aged 0-17 or 5-17) living in poverty is actually higher than that of all residents. It could be because adults living in poverty tend to have large families (as discussed on WorldVision), or that the count of adults living in poverty is underrepresented, as adults living in poverty are more likely to be houseless and thus not have access to the census survey, while children living in poverty (hopefully) are living with parents and are more likely to be included in census data. In general though, it’s scary and quite eye-opening that about 18% of all minors are living in poverty, when data is broken up at the county level.

Poverty at the State Level

At the state level, the percentage of people living in poverty has dropped somewhat, most likely again not because of the change in numbers, but because of the reduction of counties with high poverty rates inflating the percentages. However, we still see that poverty rates for minors (aged 0-17 or aged 5-17) is higher than poverty rates for all residents of states (14.84% and 13.93% vs. 11.62%, respectively, when comparing the medians), and that there are some states where over 20% of minors (aged either 0-17 or 5-17) are living at or below poverty. The lower numbers are reflected in graph 6, as compared to graph 5, the percentages for all groups is a lot lower for the medians and quartiles. One point to add, however, is that although the percentages are generally lower, the minimums at the state level are are higher than that at the county level. This again might be one of the differences that are hidden away when aggregating data at different levels.

Research Questions

With data in hand, let us look into how population size, poverty rates, and education levels all relate to one another at both the county level. Our research questions are as follows:

  1. How does the population size of a county relate to educational attainment for residents?

  2. How does the population size of a county connect with the number and proportion of people living in poverty?

  3. How does level of education for residents correlate with poverty in these regions?

  • In addition to the questions above, faceting and aggregating by the Continuum Codes will provide deeper insight as to how these features relate to how urban / rural that county might be, incorporating other features such as population size and city-like features into this analysis.


RQ1: How does population size of counties relate to the highest level of education attained for residents?

RQ1 - County Level

First, we’ll look at the totals for each county. Because we’re comparing two different populations, I expect to see a positive trend between population and number of people in each education group.

As expected, when plotting population vs. education level, we see that the number of people who fall into each education category increases as population increases. Two trends are clear from graph 7 above: 1) education groups are very close to one another when populations are low, but these groups diverge and seem to be somewhat distinct as the population increases, and 2) there seems to be a linear relationship between the population size and rate at which the number of people within each education group increases.

Graph 7 compares the county population with the highest education achieved for said residents for counties less than 3 million. 3 million was chosen because there were very few counties with populations over 3 million, and these data points gave no new insight to the trends. From graph 7, it’s pretty clear that the education groups are pretty distinct, with people not getting a high school diploma making up the smallest number of residents, and those with at least a bachelors degree at the top. The groups are somewhat distinct as well, with only the “HS Diploma” and “Some Bachelors / Associates” groups having overlap.

Graph 8 separates these groups up, and gives more insight as to the relationship between county population and highest education achieved. The number of people who don’t get a high school diploma increases the least as population increases, while the number of people who get at least a bachelors degree tends to increase the most. Those who receive only a high school diploma and those who take only some college or associates classes have about the same rate of increase, which falls in line given that the difference in these two groups might be a semester of classes (like 3 months of class).

Because this graph doesn’t tell us much about how the percentage of people who fall into each group changes, we’ll next look into the percentage of people who fall into each category.

Lastly, let’s investigate how metropolitan vs. non-metropolitan counties compare. We’ll only look at the percentage breakdowns for education, as the raw number will most likely be positive (as seen in graph 11).

When breaking down education further by whether the county is a metropolitan or not, new trends emerge. Focusing on the metropolitan counties, for all education groups except for having a college diploma, the trend lines are slightly U-shaped (concave up), meaning that for low populations, the percentage of people who are not college educated is slightly higher, and for larger populations, the percentage actually increases. This is surprising to me because I would’ve thought that larger cities are filled with more college-educated students, but this isn’t always the case, as shown here. For non-metropolitan counties, the trend lines are mostly flat, meaning that regardless of population size for counties, on average, the percentage of people who fall into each category are about the same. The percentage of college-educated residents does seem increase ever so slightly as the population for a county increases in non-metropolitan counties.

RQ1 - State Level

For this section, we’ll look at the state level data to compare state population with highest education achieved. First, let’s look at

Graph 13 gives a better picture at how these trends might change. Unlike graph 8, graph 13 shoes that in general, the rate of change between the number of residents in a state and the number of residents who fall into each group is about the same for all groups expect for those who have not received a high school diploma. The trend lines all seem to be somewhat linear, though for larger populations (populations > 20 million), there seems to be a shift of trends, as there are more people without a high school diploma and less people who get their high school diploma than expected.

Next, let’s look at how the percentage of each education group varies per population.

Graph 14 above shows the relationship between the percentage of residents in each education group vs. state populations. Like the county-equivalent in graph 9, we see firstly that those without a high school diploma make up the smallest percentage of residents, but as the population size increases, the percentage increases as well. Next, there is a lot of overlap between the remaining three groups, but typically, we see that those with at least a bachelors degree make up a majority of residents, followed by those who have received some college education, and lastly those who have received their high school diploma. However, for state populations under 10 million, those who have received a high school diploma and those who took some higher level classes make up about the same percentages, as they are overlap greatly in graph 14.

RQ2: How does the population size of a county connect with the number and proportion of people living in poverty?

RQ2 - County Level

We’ll first look into the totals for both variables, and then look into the percentages of people living in poverty. As mentioned previously, we expect to see a positive relationship between the county population and number of people living in poverty.

Unsurprisingly, we see that as the population rises in a county, the number of people living in poverty for all age groups increases. Similarly, we also see that there are three distinct groups in graph 15, as each group contains another (IE minors are included in the All Ages group, and those aged 5-17 are part of the group aged 0-17).

Let’s look at the percentage of people who fall into each category instead.

Again, in both graphs 16 and 17, we see that trends tend to follow a general U-shape, where for counties with very low populations, the percentage of people who are living in poverty is slightly higher, while counties with middling populations (about 250K to 1 million residents tend to see a decline in poverty rates. After the 1 million mark, however, the percentage of people living in poverty tends to increase. It’s interesting that for larger counties, as the population increases, poverty also increases, showing that even big cities with lots of people and wealth can still have a significant proportion of people struggling to make ends meet. Although graph 16 shows plenty of overlap between all three age age groups, from graph 17, we see that minors tend to have slightly higher poverty rates, supporting our findings from the EDA section.

Lastly, let’s see how metropolitan vs. non-metropolitan counties vary in terms of proportion of people living in poverty.

By splitting by whether a county is a metropolitan or not, we now see that slightly opposite trends, as depicted in graph 18. Counties that are metros show similar trends as our previous findings, where poverty rates tend to be higher for counties with very low populations, drop slightly for counties with middling populations, and increase again slightly for counties with large populations above ~1 million people. However, metropolitan counties tend to see an increase in poverty rates for counties with very low populations. Over time (after about a population size of 20K), these trend lines tend to decline. Additionally, counties that aren’t metropolitans also seem to have a greater proportion of people living in poverty, as from the table above, for all age groups, the proportion of people living in poverty is higher by about 2-4%.

RQ2 - State Level

Now, looking at the state level, how does population and poverty relate to one another?

Just like graph 15, in graph 19 above, we see that each age group is quite separate from one another, and all seem to follow a linear relationship between state population and number of residents living in poverty. The rate of increase differs between each group, but this is most likely because wider age groups envelop more people, and thus the increase in poverty can be attributed to the larger population.

How does the poverty rate vary when looking at the percentage of people in a state living in poverty?

In graph 20 above, we get very similar findings as the county-level equivalent seen in graph 16. States with low populations (< 5 million residents) tend to have greater variation in the percentage of people living in poverty for all age groups, especially for minors. However, unlike graph 16, it’s a lot more clear that poverty rates for minors is slightly higher than the poverty rates all age groups, matching our findings from our initial EDA for poverty rates at the state level.

RQ3: How does level of education for residents correlate with poverty in these regions?

Lastly, we’ll look into how level of education correlates to poverty rates. For visualization purposes, we will only use the poverty rates and values linked to residents of all ages.

RQ3 - County Level

As expected, as the number of people living in poverty increases, the number of residents who fall into each category for education also increases, as evident in graphs 21 and 22. This follows the logic of the previous sections, as the greater the population the greater the number of people living in poverty, as well as the greater the number of people who fit into each category. However, some surprising relationships are still present. Firstly, we see that just like the population vs. education plot (graph 7), the education level groups are quite distinct. In general, we see that those without a high school diploma tend to make up the least number of people within a county, and those with a bachelors degree tend to make up the most of a county’s population, with the other two groups mixed in between. However, the relationship between the two variables is quite different, depending on the education group. From graph 22, we clearly see a strong linear trend between number of people living in poverty and number of people who didn’t get their HS diploma. Although the populations for people who have received their HS diploma and those that have taken some higher level courses are roughly the same, graph 21 does show that there is a slightly stronger positive relationship (as evident by the steeper slope and distinction in colors at larger populations) between the number of people living in poverty and those who taken some college courses. We see the greatest variability for those who have received a degree in higher education.

These relationships are by no means surprising. As we found in research question 2, there is a strong relationship between the population size of a county and the number of people living in poverty. Therefore, we would again expect to see a positive linear relationship between the number of people living in poverty and the number of people with various educational backgrounds, as evident in graph 21.

Next, let’s compare percentages for these two groups to get a better sense of how they might relate to one another.

RQ3: State Level

First, we’ll look into the state-level data to investigate the relationship between the number of people living in poverty and the highest education achieved by residents in the county.

Lastly, we’ll look at how the different percentages between education and poverty are related.

Key Findings and Conclusion

For this paper, we set out to find how state and county populations relate to poverty rates and the education of their residents. Our findings show that there indeed are relationships between all these variables, but we cannot make claims as to why these instances occur. The following sections give a brief summary about our discoveries.


At both the state and county levels, in general, smaller populations tend to see more variation how educated residents are. Although the population for those who have not received a high school diploma tends to be the smallest, we see that as the number of residents increase, the percentage of people who fall into this category actually increases slightly. On the other hand, although the group of people who have received at least a bachelors degree generally make up the majority of the population, especially for small counties, this number and percentage is often very small, showing the disparity in the education levels for those who live in big counties vs. rural or less population regions of the US.


In general, poverty rates for minors are slightly higher than the group encompassing all ages. When looking at state-level data, poverty rates tend to be slightly lower, supporting the belief that some counties with small populations and high poverty rates inflate the distributions when looking at the county level. There is also greater variation in poverty rates for lower populations (for both counties and states), but the rates tend to stabilize for medium to large counties.

Education and Poverty

Unsurprisingly, how educated a population is and the number of people living in poverty has a strong relationship. There is a strong positive linear relationship between poverty rates and the number of people who don’t have a high school diploma. Similarly, there’s a negative relationship between poverty rates and the number of people who have at least a bachelors degree.


This report only shows a snapshot of the information provided. In future iterations, it would be interesting to see how demographics have shifted in the United States. Specifically, visualizing some time series graphs and visualizing changes in relationships between populous counties and shifts in education or poverty would be helpful.

Similarly, showcasing this data on a map could be beneficial, as plotting major cities to connect county and state data could prove beneficial.

Like many other datasets having to deal with census information, we assume that the data provided is correct. Because census information is filled out and submitted by households, it’s very likely that some at-risk groups (such as houseless people, those moving between places, or people working many hours and unable to respond to their mail) are not included in this data. As a result, metrics such as education levels or poverty might be better than reality.

Lastly, it’s important to note the distinction between correlation and causation. Although our data might show interesting links between variables, because each county and state have their own legislation, programs, systems, etc., there are many confounding variables not included in this analysis that might explain why we see specific trends.


