Data Analytics and Computational Social Science: Final Project Updated

Eris Dodds

Introduction

Employment rates across the U.S. States in 2012 may have been impacted by a number of factors. These factors could impact employment rates both within and between states, as well be subject to factors like population by state, government policies, developments in companies and organizations, all of which reflect current trends during those times. The current project is a review of employment rates across the U.S. during the year 2012. The intent of the project is present a descriptive rather than correlational view of employment rates. Further investigations might benefit from the inclusion of additional data such as 1) state population data by county for 2012 2) demographic data for those employed in 2012 3) political trends in policy making at both the federal and state level 4) data on companies and organizations

Data

The data used was uploaded from a government database on employment rates across states and counties for 2012. The uploaded raw data is represented with each state across the U.S.A, as well as the entirety of Canada. Each state is represented in numerous rows, each row being a different county in that state. The total rates of employment are also given in the data set, as its own row. So, the outline of the data set represents two types of quatities 1) Individual rates of employment by county per state 2) Total rates of employment in that state (i.e. sum of all county employment rates within a state).

Modifying Data

Throughought the course of this project, I’ve had to modify the data to examine different aspects of employment by state, which will be discussed later. To start, however, I needed to clean the initial data set (as is) to remove NA data, and to rename column names from numerical values to nominal values.

state<- select(StateCounty2012, "...2", "...6")
state<-drop_na(state)
colnames(state)<-c("State", "Total")
state<-state[-c(1),]
state_numbers2<- select(StateCounty2012, "...2", "...4", "...6")
state_numbers2<- drop_na(state_numbers2)
colnames(state_numbers2)<-c("State", "County","Total")

Creating new Datasets

As mentioned above, the current dataset offers a few unique representations about employment rates across U.S. states by county. Specifically, it represents employment data for each state in numerous rows. Each row is the data for a different county in each state. There is an additional row that offers the total rate of employment (sum of rates per county). My goal was to isolate rows with totaled values from the county data. As a result, I have created 4 objects:

State totals: Isolated totals from State emplotment rates.

state_numbers2: Individual county rates within each state.

state_highs: top 5 highest values in data set.

state_low: lowest values in data set.

statenumbers_low: County data for 5 states with lowest employment rate.

statenumbers_high: County data for 5 states with highest employment rate.

statetotals_low<-state_totals[c(1, 4, 14, 2, 49),]
statetotals_highs<-state_totals[c(32, 7, 17, 46, 37),]
statenumbers_high<-state_numbers2[c(164:218, 645:747, 1722:1810, 1883:1943, 2397:2617),]
statenumbers_low<-state_numbers2[c(1, 3:8, 76, 507:509, 2735:2748),]

statetotals_low

# A tibble: 5 × 2
  State     Total
  <chr>     <chr>
1 AE Total1 2    
2 AP Total1 1    
3 HI Total  4    
4 AK Total  103  
5 VT Total  259

statetotals_highs

# A tibble: 5 × 2
  State    Total
  <chr>    <chr>
1 NE Total 13176
2 CA Total 13137
3 IL Total 19131
4 TX Total 19839
5 NY Total 17050

statenumbers_high

# A tibble: 529 × 3
   State County       Total
   <chr> <chr>        <chr>
 1 CA    ALAMEDA      346.0
 2 CA    AMADOR       9.0  
 3 CA    BUTTE        69.0 
 4 CA    CALAVERAS    30.0 
 5 CA    COLUSA       2.0  
 6 CA    CONTRA COSTA 348.0
 7 CA    EL DORADO    103.0
 8 CA    FRESNO       341.0
 9 CA    GLENN        4.0  
10 CA    HUMBOLDT     2.0  
# … with 519 more rows

statenumbers_low

# A tibble: 25 × 3
   State County               Total
   <chr> <chr>                <chr>
 1 STATE COUNTY               TOTAL
 2 AK    ANCHORAGE            7.0  
 3 AK    FAIRBANKS NORTH STAR 2.0  
 4 AK    JUNEAU               3.0  
 5 AK    MATANUSKA-SUSITNA    2.0  
 6 AK    SITKA                1.0  
 7 AK    SKAGWAY MUNICIPALITY 88.0 
 8 AP    APO                  1.0  
 9 HI    HAWAII               1.0  
10 HI    HONOLULU             2.0  
# … with 15 more rows

ggplot(data = statetotals_low) + geom_bar(mapping = aes(x = State, y = Total, fill = State), stat = "identity") + labs(title = "Low Employment Rate States", y = "Rate", x = "State")

ggplot(data = statetotals_highs) + geom_bar(mapping = aes(x = State, y = Total, fill = State), stat = "identity") + labs(title = "High Employment Rate States", y = "Rate", x = "State")

ggplot(data = statenumbers_high) + stat_count(mapping = aes(x = State, fill = State)) + labs(title = "High Employment Rate States", y = "Rate", x = "State")

ggplot(data = statenumbers_low) + stat_count(mapping = aes(x = State, fill = State)) + labs(title = "Low Employment Rate States", y = "Rate", x = "State")

library(maps)
states<-map_data("state")
state_totals<-state_totals[-c(1, 2, 4, 14),]
colnames(state_totals)<-c("state", "rates")
state_totals$region<-gsub("1", "", as.character(state_totals$state))
state_totals$region<-gsub("Total", "", as.character(state_totals$state))

Creating Maps

Next, I want to make a map display that shows the employment rates across the US. The dataset that I have uploaded will need to be manipulated substantially, then merged with the maps data for the US for plotting

mapdata<-left_join(states, state_totals, by="region")
mapdata<-mapdata[order(mapdata$region), ]

mapdata$rates<- sub("\\.\\d+$", "", mapdata$rates)
mapdata$rates = as.numeric(as.character(mapdata$rates))
view(mapdata)

Cleaning Map Data

In order to merge the state employment data with the maps data, I needed to rename the column labeled “State” to “region”, which would match the “region” column in the mapdata set, allowing me to merge the two data sets together via that column similarity.The ‘region’ column in the “State” data included lower case state names, while the newly dubbed “region” column in the state_totals had abbreviated state names. Full state names where then added to the state_totals data set, while the older columns with abbreviated state names were removed. This allowed me to merge the “State” data, which inlcudes coordinates for mapping the U.S. with the state_totals data that includes rates of employment for each state. Data sets were merged into ‘mapdata’ using left_join(). It had also been found, during this process, that R had been reading the ‘rates’ column has having a character value, and not a numeric one. as.numeric() was used to change the type of the data to numeric.

ggplot(mapdata, aes(x = long, y = lat, group = group, fill = rates)) + geom_polygon(color = 'gray') + coord_map('polyconic') + labs(title = "2012 U.S. Rates of Employment") + scale_fill_continuous(name = "Employment Rates", label = scales::comma)

Review

This looks great, and echos some of the information from the previous graphs, and shows which states have the lowest and highest rates of employment. One possible driving factor may be population rates across states as influecing rates of employment. To understand this relationship visually, the choroplethr package and choroplethrmaps package was installed, which includes datasets for population rates across the U.S. in 2012. The county_choropleth() function was used to map these rates across the U.S. This data set includes Alaska and Hawaii, but will not be considered for the purpose of this review.

data(df_pop_county)
county_choropleth(df_pop_county)

Futher Analyses

A final assessment was to take a sample state, specifically “California” to determine overlaps between population rates and rates of employment in 2012. Doing this means building a county region map of California, that shows rates of employment by county. The “county” data set was imported from map_data, and subset() was used to single out Califonia data from the “mapdata” data set and the “county” data set. The newly made datasets were then merged into a ggplot to get the following graph.

CA_data <- subset(mapdata, region == "california")
CA_county<-subset(county, region=="california")
ca_data<-left_join(CA_county, CA_data, by="subregion")
colnames(ca_data)<-c("long","lat","group", "rate","region","subregion","long2","lat2","group2","rate2","5","6")
ca_base<-ggplot(data = CA_data, mapping = aes(x = long, y = lat, group = group)) + coord_fixed(1.3) + geom_polygon(color = "black", fill = "gray")
ditch_the_axis <- theme( axis.text = element_blank(), axis.line = element_blank(), axis.ticks = element_blank(), panel.border = element_blank(), panel.grid = element_blank(), axis.title = element_blank())
gg1<-ca_base+geom_polygon(data = ca_data, aes(fill = rate), color = "white") + geom_polygon(color = "black", fill = NA) + theme_bw() + ditch_the_axis + ggtitle("Employment Rate California 2012")
gg1

Results and Conclusions

Overall, it appears that TX, IL, CA, NY, NE are states that had the highest rates of employment in 2012. Graphs displayed in the bar graphs include totals from Military locations, as well as Alaska and Hawaii. The final map does not include these locations. A few possible explanations for why these states had high employment rates could be 1) Population of states 2) Types of businesses and organizations in those states, that create the opportunity for more jobs 3) Employment policies enacted at the state/ federal level in 2012 4) Demographics of those employed. A visual review of state population rates seem to show inconsistent relationships between rates of population and employment rates, though this may not uniformly be the case across states. Specifically, comparing the rates of employment graph to the population graph shows that densely populated areas are not consistently the highest employed. In some states, employment rates may be correlated with rates of population, in other states not. Further analyses would be needed to draw these conclusions.

Distill is a publication format for scientific and technical writing, native to the web.

Learn more about using Distill for R Markdown at https://rstudio.github.io/distill.

Comment on this article Share:

Final Project Updated