Data Analytics and Computational Social Science: HW5_IanKennedy

Ian Kennedy

For my final project I’ll be using a data set from the Research Lab (FFRC) I’ve been working with for the past few years. In short, the FFRC, in coordination with the USDA Forest Service, implements an annual survey (TPO) which attempts to track timber procurement and lumber production at mills throughout a host of states. I’ll be running analyses to see how shifting the baseline for minimum volume procured would impact the number of mills and total volume of procured timber included in the sample frame. Generally speaking, this sort of analysis will attempt to answer the following research questions:

Does it make sense to exclude small-production/hobby mills from future survey attempts?
If yes, what is the minimum volume that mills should have procured to be included in survey efforts?
What impact would shifting the baseline, regardless of minimum volume utilized, have on the total volume and number of mills that are surveyed?

There will likely be further questions that will need to be analyzed, however these questions outline the main priorities of data analysis for this project.

Initial Results

HW3_Data <- read_excel(path = "C:/Users/kenne/Documents/R_Workspace/TPO_Sample.xlsx")


#Calculate the sum of all Mill Volumes
HW3_Data %>%
  filter(is.na(MILL_STATUS_CD)| MILL_STATUS_CD == 2| MILL_STATUS_CD == 1) %>%
  select(TOT_BF)%>%
  sum(na.rm = FALSE) -> TOT_Volume


#Calculate Total # of mills
HW3_Data %>%
    filter(is.na(MILL_STATUS_CD)| MILL_STATUS_CD == 2| MILL_STATUS_CD == 1)%>%
  nrow() -> TOT_Mills
TOT_Mills <- as.numeric(unlist(TOT_Mills[1]))

#Calculate the sum omitted if omitting mills below 20,000 BF (not including blanks)
HW3_Data %>%
  mutate(Volume_Code = case_when(TOT_BF > 20000 ~ 2, 
                                TOT_BF <= 20000 & TOT_BF > 0 ~ 1,
                                 TOT_BF == 0 ~ 0))%>%
  filter(Volume_Code == 1, is.na(MILL_STATUS_CD)| MILL_STATUS_CD == 2| MILL_STATUS_CD == 1) %>%
  select(TOT_BF) %>%
  sum(na.rm = TRUE) -> TOT_Volume_Omit

  
#Calculate number of mills if omitting mills below 20,000 BF (not including blanks)
HW3_Data %>%
  mutate(Volume_Code = case_when(TOT_BF > 20000 ~ 2, 
                                 TOT_BF <= 20000 & TOT_BF > 0 ~ 1,
                                 TOT_BF == 0 ~ 0))%>%
  filter(Volume_Code == 1, is.na(MILL_STATUS_CD)|MILL_STATUS_CD == 2|MILL_STATUS_CD == 1) %>%
  select(TOT_BF) %>%
  nrow -> TOT_Mills_Omit
TOT_Mills_Omit <- as.numeric(unlist(TOT_Mills_Omit[1]))

#Filter mills with BF > 100,000,000
HW3_Data %>%
  mutate(High_BF = case_when(TOT_BF >= 100000000 ~ 2,
                             TOT_BF < 100000000 ~ 1)) %>%
  filter(High_BF == 2, is.na(MILL_STATUS_CD)|MILL_STATUS_CD == 2|MILL_STATUS_CD == 1)%>%
  arrange(MILL_STATECD)%>%
  select(MILL_STATECD, MILL_NAME, TOT_BF, MILL_STATUS_CD) %>%
  nrow -> TOT_Mills_High_Vol
TOT_Mills_High_Vol <- as.numeric(unlist(TOT_Mills_High_Vol[1]))

Mill_Count_If_Omit <- TOT_Mills - TOT_Mills_Omit
Percent_Omit <- 100*(TOT_Mills_Omit/TOT_Mills)

Total # of mills in sample = 3033
Total volume in sample = 4.0246278^{11} BF
Total # of mills with volume > 100,000,000 BF = 83
Total # of mills omitted if using a 20,000 BF cut-off = 427
Total % of mills omitted if using a 20,000 BF cut-off = 14.0784702%
Total volume omitted if using a 20,000 BF cut-off = 3.5953829^{6} BF
Total # of mills in sample if using a 20,000 BF cut-off = 2606

Histogram Outlines and Creation of BF_plot

#Convert State Codes to Strings and Create BF_plot

BF_plot <- data.frame(HW3_Data)%>%
mutate(MILL_STATE = case_when(MILL_STATECD == 9 ~ "CT",
                                MILL_STATECD == 10 ~ "DE",
                                MILL_STATECD == 17 ~ "IL", 
                                MILL_STATECD == 18 ~ "IN",
                                MILL_STATECD == 19 ~ "IA",
                                MILL_STATECD == 20 ~ "KS",
                                MILL_STATECD == 23 ~ "ME",
                                MILL_STATECD == 24 ~ "MD",
                                MILL_STATECD == 25 ~ "MA",
                                MILL_STATECD == 26 ~ "MI",
                                MILL_STATECD == 27 ~ "MN",
                                MILL_STATECD == 29 ~ "MO",
                                MILL_STATECD == 31 ~ "NE",
                                MILL_STATECD == 33 ~ "NH",
                                MILL_STATECD == 34 ~ "NJ",
                                MILL_STATECD == 36 ~ "NY",
                                MILL_STATECD == 38 ~ "ND",
                                MILL_STATECD == 39 ~ "OH",
                                MILL_STATECD == 42 ~ "PA",
                                MILL_STATECD == 44 ~ "RI",
                                MILL_STATECD == 46 ~ "SD",
                                MILL_STATECD == 50 ~ "VT",
                                MILL_STATECD == 54 ~ "WV",
                                MILL_STATECD == 55 ~ "WI")) %>%
  filter(is.na(MILL_STATUS_CD)|MILL_STATUS_CD == 2|MILL_STATUS_CD == 1)

Histogram Plots & Associated Code

These histograms display the logged values of reported volumes for mills across the Northern United States. Both are univariate, utilizing the columns “TOT_MCF_LOG” and “TOT_BF_LOG” respectively. These plots will help to answer questions regarding the presence of reported mill volumes that are far outside of a normal range. A “normal” log MCF value would be < 4.2, while a “normal” log BF value would be < 8. In other words, any mills with reported log volumes above these thresholds will need to be reviewed and amended by USDA Forest Service Staff.

In taking a quick glance at the Logged BF, it is clear that there are at least 83 mills with reported volumes far outside of a “normal” range. Thus, these data sets will need review from partners within the USDA Forest Service.

A naive viewer would have trouble interpreting these plots given the units at hand. Without specifying what MCF (Thousand Cubic Feet) and BF (Board Feet) refer to, a naive viewer would likely have little clue as to what the histograms are displaying. However, because this information will be utilized by those within the Forest Service, defining these units is not necessary.

#Create a histogram of Mill Volumes (Logged MTC)

MCF_DF <- data.frame(BF_plot['TOT_MCF_LOG'])
MCF_List <- unclass(MCF_DF)
MCF_Vect <- as.vector(unlist(MCF_List[[1]]))

Hist_Breaks_MCF <- c(seq(from = -3, to = 8, by = .5))

hist.default(MCF_Vect, xlim = c(-3, 8), ylim = c(0, 600), breaks = Hist_Breaks_MCF, labels = TRUE, ylab = 'Count', xlab = 'Log(10) of MCF', main = 'Log(10) of MCF Histogram')

#Create a histogram of Mill Volumes (Logged BF)

BF_DF <- data.frame(BF_plot['TOT_BF_LOG'])
BF_List <- unclass(BF_DF)
BF_Vect <- as.vector(unlist(BF_List[[1]]))


Hist_Breaks_BF <- c(seq(from = 0, to = 12, by = 1))

hist.default(BF_Vect, xlim = c(0, 12), ylim = c(0, 1200), breaks = Hist_Breaks_BF, labels = TRUE, ylab = 'Count', xlab = 'Log(10) of BF', main = 'Log(10) of BF Histogram')

State Level Plots

The scatter plots/box plots below utilize a host of variables, most of which required some sort of tidying prior to visualization.

In terms of how a naive viewer would interpret these, the same issues as with the above histograms arise. Namely, because some of the plots are visualizing logged BF (Board Feet), someone lacking insight into the lumber industry would likely be left clueless if attempting to draw meaningful conclusions.

Log(10) BF Plots

These plots visualize reported mill volumes’ on a state-by-state basis. By breaking these volumes down by state, our FS partners are able to hone-in on states with the highest number of volume-outliers (i.e. Log(10) BF > 8). From the plots, it is clear that there are a few states that will need to be reviewed, these include ME, MI, & MN, among others.

#Create State-by-State boxplots and scatter plots for Logged BF Volume

ggplot(BF_plot, aes(x = MILL_STATE, y = TOT_BF_LOG, color = MILL_STATE)) + 
  geom_point() +
  scale_y_continuous(breaks = Hist_Breaks_BF, limits = c(0,12)) + 
  labs(title = "Log(10) of BF by State", y = "Log(10) of BF", x = "State") +
  theme(text = element_text(family = "serif"))

ggplot(BF_plot, aes(x = MILL_STATE, y = TOT_BF_LOG, color = MILL_STATE)) + 
  geom_boxplot() +
  scale_y_continuous(breaks = Hist_Breaks_BF, limits = c(0,12)) + 
  labs(title = "Log(10) of BF by State", y = "Log(10) of BF", x = "State") +
  theme(text = element_text(family = "serif"))

Omitted Mill/Volume Plots

These plots visualize the number of mills and aggregate volume (on a state-by-state basis) that would be removed if a minimum volume of 20,000 BF was set as a baseline for survey inclusion. States without omissions were not included in the plots.

#Create Bar Charts for Omitted Mills/Volumes by State

Omitted_Mills <- BF_plot %>%
  group_by(MILL_STATE) %>%
  filter(TOT_BF_LOG < 4.30103) %>%
  summarize("Omitted_Mill_Count" = n(), "Volume_Omitted" = sum(TOT_BF))


ggplot(Omitted_Mills, aes(x = MILL_STATE, y = Omitted_Mill_Count)) + 
  geom_col(width = 0.5) + 
  geom_text(aes(label = Omitted_Mill_Count), nudge_y = 2.5) + 
  scale_y_continuous(breaks = c(seq(from = 0, to = 60, by = 5))) + 
  labs(title = "Omitted Mills by State", y = "Omitted Mills", x = "State") +
  theme(text = element_text(family = "serif"))

ggplot(Omitted_Mills, aes(x = MILL_STATE, y = Volume_Omitted)) + 
  geom_col(width = 0.5) + 
  geom_text(size = 3.5, aes(label = round(Volume_Omitted)), nudge_y = 25000) + 
  scale_y_continuous(breaks = c(seq(from = 0, to = 1000000, by = 50000))) + 
  labs(title = "Omitted Mills' Volume by State", y = "Omitted Mill Volumes", x = "State") +
  theme(text = element_text(family = "serif"))

High Volume Plot

This bar chart depicts the number of volume-outliers (i.e. Log(10) BF > 8) on a state-by-state basis. States with no volume-outliers were not included in the plot.

#Create data frames for High Volume Mills

High_Vol_by_State <- BF_plot %>%
  group_by(MILL_STATE) %>%
  filter(TOT_BF_LOG >= 8) %>%
  summarize("High_Volume_Mills" = n())



#Create Bar Chart for High Volume Mills by State

ggplot(High_Vol_by_State, aes(x = MILL_STATE, y = High_Volume_Mills)) + 
  geom_col() + 
  geom_text(aes(label = High_Volume_Mills), nudge_y = 2.5) + 
  scale_y_continuous(breaks = c(seq(from = 0, to = 65, by = 5))) + 
  labs(title = "High Volume Mills (Log(10) BF > 8) by State", y = "High Volume Mills", x = "State") +
  theme(text = element_text(family = "serif"))

High_Vol_by_State_Breakdown <- BF_plot %>%
  group_by(MILL_STATE) %>%
  filter(TOT_BF_LOG >= 8) %>%
  select(MILL_NBR, MILL_STATECD:MTC, TOT_MCF, MILL_LAT:MILL_STATE)

Mill Type Plot

This bar chart depicts a breakdown of Mill “Types” on a state-by-state basis.

#Create a Bar Chart depicting Mill Type by State

MTC_summary <- BF_plot %>%
 mutate(MTC_Tidy = case_when(MTC == 10 ~ "Sawmill",
                         MTC == 21 ~ "Sawmill",
                         MTC == 22 ~ "Sawmill",
                         MTC == 23 ~ "Sawmill",
                         MTC == 24 ~ "Sawmill",
                         MTC == 25 ~ "Sawmill",
                         MTC == 26 ~ "Sawmill",
                         MTC == 20 ~ "Veneer Mill",
                         MTC == 30 ~ "Pulp/Composite Mill",
                         MTC == 40 ~ "Pulp/Composite Mill",
                         MTC == 50 ~ "Bioenergy Mill",
                         MTC == 60 ~ "House/Cabin Mill",
                         MTC == 70 ~ "Post/Pole/Piling Mill",
                         MTC == 80 ~ "Post/Pole/Piling Mill",
                         MTC == 90 ~ "Post/Pole/Piling Mill",
                         MTC == 100 ~ "Residential Firewood Mill",
                         MTC == 110 ~ "Other/Misc Mill",
                         MTC == 111 ~ "Other/Misc Mill",
                         MTC == 117 ~ "Concentration Yard",
                         is.na(MTC) ~ "Not Provided"))%>%
  
  group_by(MILL_STATE, MTC_Tidy) %>%
  arrange(as.factor(MTC_Tidy)) %>%
  summarize("Mill_Type_Count" = n()) %>%
  group_by(MILL_STATE)%>%
  mutate(State_Count = sum(Mill_Type_Count))




#Create bar chart of Mill Type Counts by State
  
MTC_Tidy_Factor_Levels <- c("Sawmill", "Veneer Mill", "Pulp/Composite Mill", "Bioenergy Mill", "House/Cabin Mill", "Post/Pole/Piling Mill", "Residential Firewood Mill", "Other/Misc Mill", "Concentration Yard", "Not Provided")


ggplot(MTC_summary, aes(x = MILL_STATE, y = Mill_Type_Count, fill = factor(MTC_Tidy, levels = MTC_Tidy_Factor_Levels))) +
  geom_col(position = 'stack', width = .8) + 
  labs(title = "Mill Type Counts by State", y = "Mill Type Count", x = "State") +
  scale_y_continuous(limits = c(0, 500), breaks = c(seq(from = 0, to = 500, by = 25))) +
  scale_fill_discrete(name = "Mill Types") +
  theme(text = element_text(family = "serif"))

Faceted Mill Type & Log(10) BF Plots

These plots are faceted (by State) versions of the Mill Type and Log(10) of BF plots above.

# Create Bar Charts and Box Plots using facet_wrap()

ggplot(MTC_summary, aes(0, Mill_Type_Count, fill = factor(MTC_Tidy, levels = MTC_Tidy_Factor_Levels))) +
  geom_col() + 
  labs(title = "Mill Type Counts", y = "Mill Type Count") +
  scale_fill_discrete(name = "Mill Types") +
  facet_wrap(facets = vars(MILL_STATE), scales = "free_y" ,ncol = 8) +
  theme(text = element_text(family = "serif"), axis.title.x = element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank())

ggplot(BF_plot, aes(0, y = TOT_BF_LOG, color = MILL_STATE)) + 
  geom_boxplot() +
  labs(title = "Log(10) of BF by State", y = "Log(10) of BF") +
  facet_wrap(facets = vars(MILL_STATE), ncol = 8, scales = "free") +
  theme(text = element_text(family = "serif"), axis.title.x = element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank())

Dplyr Functions

Because dplyr functions weren’t utilized within the above code chunks, the following chunk uses dplyr functions to measure a few meaningful values on a state-by-state basis.

#Calculate means, medians, and standard deviations of logged volumes by state
TPO_summary <- BF_plot %>%
  select(MILL_STATECD, MILL_STATE, MTC, TOT_MCF_LOG, TOT_BF_LOG) %>%
  group_by(MILL_STATE) %>%
  summarize(meanBF = mean(TOT_BF_LOG, na.rm = TRUE), 
            medianBF = median(TOT_BF_LOG, na.rm = TRUE), 
            meanMCF = mean(TOT_MCF_LOG, na.rm = TRUE), 
            medianMCF = median(TOT_MCF_LOG, na.rm = TRUE), 
            St_Dev = sd(TOT_BF_LOG, na.rm = TRUE), 
            NumberofMills = n()) 
TPO_summary

# A tibble: 24 x 7
   MILL_STATE meanBF medianBF meanMCF medianMCF St_Dev NumberofMills
   <chr>       <dbl>    <dbl>   <dbl>     <dbl>  <dbl>         <int>
 1 CT           5.59     5.44   1.79      1.64   0.820            22
 2 DE           5.10     4.80   1.30      1      1.01             15
 3 IA           4.94     4.55   1.14      0.749  1.11             70
 4 IL           4.78     4.46   0.980     0.660  1.06            136
 5 IN           5.76     6.16   1.96      2.36   1.18            147
 6 KS           4.58     4.30   0.782     0.500  0.991            48
 7 MA           4.59     4.00   0.790     0.199  0.906           120
 8 MD           5.51     5.67   1.70      1.87   1.38             43
 9 ME           5.60     4.81   1.80      1.01   1.28            162
10 MI           5.71     5.71   1.91      1.91   1.08            354
# ... with 14 more rows

Conclusions & Next Steps

What is missing (if anything) in your analysis process so far?
- The major piece of information missing from this analysis is how to handle high-volume-outliers. Specifically, because this analysis is attempting to measure the impact of setting a volume-based cut-off for survey inclusion, corrections to these (likely) incorrect volumes are necessary prior to gauging impact on the overall volume included in the sample. In other words, while we can adequately measure the number and % of mills that would be removed if using a cut-off volume, we cannot do accurately measure the impact on overall sample volume.
What conclusions can you make about your research questions at this point?
- There are two major conclusions that can be made at this point. First, a 20,000 BF cut-off would have little impact on overall sample volume, regardless of the impact of the volume-outliers. Namely, if all mills reporting <20,000 BF were removed, only ~3.6 million BF of volume would be lost (across all states), a volume that is dwarfed by many individual mills. Second, the high-volume-outliers seem to be aggregated in just a few states, namely Minnesota. With this information in mind, our FS partners can (hopefully) make volume corrections to the database from which this information is pulled.
What do you think a naive reader would need to fully understand your graphs?
- A naive reader would have immense difficulty attempting to understand these graphs, as they are industry-specific, and require some sort of background in forestry and/or timber/lumber procurement. To grasp the information being presented, a reader would first need to become familiar with volume units related to the lumber industry (i.e. BF/MCF). If familiar with the units, it would also be helpful to have a basic understanding of the major players in the timber/lumber industry on a state-by-state basis. Regardless, these products are not intended for a broad audience…Rather they will be utilized by our partners at the USDA Forest Service, all of whom have ample experience with the units at play.
Is there anything you want to answer with your dataset, but can’t?
- Yes, as mentioned above, I would love to accurately measure the impact of creating a volume cut-off on overall sample volume, though this will have to wait until the high-volume-outliers are amended to reflect reasonable volume measures.

Comment on this article Share:

HW5_IanKennedy

Initial Results

Histogram Outlines and Creation of BF_plot

Histogram Plots & Associated Code

State Level Plots

Log(10) BF Plots

Omitted Mill/Volume Plots

High Volume Plot

Mill Type Plot

Faceted Mill Type & Log(10) BF Plots

Dplyr Functions

Conclusions & Next Steps

Reuse

Citation

HW5_IanKennedy

Project Outline & Related Questions

Initial Results

Histogram Outlines and Creation of BF_plot

Histogram Plots & Associated Code

State Level Plots

Log(10) BF Plots

Omitted Mill/Volume Plots

High Volume Plot

Mill Type Plot

Faceted Mill Type & Log(10) BF Plots

Dplyr Functions

Conclusions & Next Steps

Reuse

Citation