HW5_IanKennedy

HW5 for DACSS 601.

Ian Kennedy
2022-02-26

Project Outline & Related Questions

For my final project I’ll be using a data set from the Research Lab (FFRC) I’ve been working with for the past few years. In short, the FFRC, in coordination with the USDA Forest Service, implements an annual survey (TPO) which attempts to track timber procurement and lumber production at mills throughout a host of states. I’ll be running analyses to see how shifting the baseline for minimum volume procured would impact the number of mills and total volume of procured timber included in the sample frame. Generally speaking, this sort of analysis will attempt to answer the following research questions:

  1. Does it make sense to exclude small-production/hobby mills from future survey attempts?
  2. If yes, what is the minimum volume that mills should have procured to be included in survey efforts?
  3. What impact would shifting the baseline, regardless of minimum volume utilized, have on the total volume and number of mills that are surveyed?

There will likely be further questions that will need to be analyzed, however these questions outline the main priorities of data analysis for this project.

Initial Results

HW3_Data <- read_excel(path = "C:/Users/kenne/Documents/R_Workspace/TPO_Sample.xlsx")


#Calculate the sum of all Mill Volumes
HW3_Data %>%
  filter(is.na(MILL_STATUS_CD)| MILL_STATUS_CD == 2| MILL_STATUS_CD == 1) %>%
  select(TOT_BF)%>%
  sum(na.rm = FALSE) -> TOT_Volume


#Calculate Total # of mills
HW3_Data %>%
    filter(is.na(MILL_STATUS_CD)| MILL_STATUS_CD == 2| MILL_STATUS_CD == 1)%>%
  nrow() -> TOT_Mills
TOT_Mills <- as.numeric(unlist(TOT_Mills[1]))

#Calculate the sum omitted if omitting mills below 20,000 BF (not including blanks)
HW3_Data %>%
  mutate(Volume_Code = case_when(TOT_BF > 20000 ~ 2, 
                                TOT_BF <= 20000 & TOT_BF > 0 ~ 1,
                                 TOT_BF == 0 ~ 0))%>%
  filter(Volume_Code == 1, is.na(MILL_STATUS_CD)| MILL_STATUS_CD == 2| MILL_STATUS_CD == 1) %>%
  select(TOT_BF) %>%
  sum(na.rm = TRUE) -> TOT_Volume_Omit

  
#Calculate number of mills if omitting mills below 20,000 BF (not including blanks)
HW3_Data %>%
  mutate(Volume_Code = case_when(TOT_BF > 20000 ~ 2, 
                                 TOT_BF <= 20000 & TOT_BF > 0 ~ 1,
                                 TOT_BF == 0 ~ 0))%>%
  filter(Volume_Code == 1, is.na(MILL_STATUS_CD)|MILL_STATUS_CD == 2|MILL_STATUS_CD == 1) %>%
  select(TOT_BF) %>%
  nrow -> TOT_Mills_Omit
TOT_Mills_Omit <- as.numeric(unlist(TOT_Mills_Omit[1]))

#Filter mills with BF > 100,000,000
HW3_Data %>%
  mutate(High_BF = case_when(TOT_BF >= 100000000 ~ 2,
                             TOT_BF < 100000000 ~ 1)) %>%
  filter(High_BF == 2, is.na(MILL_STATUS_CD)|MILL_STATUS_CD == 2|MILL_STATUS_CD == 1)%>%
  arrange(MILL_STATECD)%>%
  select(MILL_STATECD, MILL_NAME, TOT_BF, MILL_STATUS_CD) %>%
  nrow -> TOT_Mills_High_Vol
TOT_Mills_High_Vol <- as.numeric(unlist(TOT_Mills_High_Vol[1]))

Mill_Count_If_Omit <- TOT_Mills - TOT_Mills_Omit
Percent_Omit <- 100*(TOT_Mills_Omit/TOT_Mills)

Histogram Outlines and Creation of BF_plot

#Convert State Codes to Strings and Create BF_plot

BF_plot <- data.frame(HW3_Data)%>%
mutate(MILL_STATE = case_when(MILL_STATECD == 9 ~ "CT",
                                MILL_STATECD == 10 ~ "DE",
                                MILL_STATECD == 17 ~ "IL", 
                                MILL_STATECD == 18 ~ "IN",
                                MILL_STATECD == 19 ~ "IA",
                                MILL_STATECD == 20 ~ "KS",
                                MILL_STATECD == 23 ~ "ME",
                                MILL_STATECD == 24 ~ "MD",
                                MILL_STATECD == 25 ~ "MA",
                                MILL_STATECD == 26 ~ "MI",
                                MILL_STATECD == 27 ~ "MN",
                                MILL_STATECD == 29 ~ "MO",
                                MILL_STATECD == 31 ~ "NE",
                                MILL_STATECD == 33 ~ "NH",
                                MILL_STATECD == 34 ~ "NJ",
                                MILL_STATECD == 36 ~ "NY",
                                MILL_STATECD == 38 ~ "ND",
                                MILL_STATECD == 39 ~ "OH",
                                MILL_STATECD == 42 ~ "PA",
                                MILL_STATECD == 44 ~ "RI",
                                MILL_STATECD == 46 ~ "SD",
                                MILL_STATECD == 50 ~ "VT",
                                MILL_STATECD == 54 ~ "WV",
                                MILL_STATECD == 55 ~ "WI")) %>%
  filter(is.na(MILL_STATUS_CD)|MILL_STATUS_CD == 2|MILL_STATUS_CD == 1)

Histogram Plots & Associated Code

These histograms display the logged values of reported volumes for mills across the Northern United States. Both are univariate, utilizing the columns “TOT_MCF_LOG” and “TOT_BF_LOG” respectively. These plots will help to answer questions regarding the presence of reported mill volumes that are far outside of a normal range. A “normal” log MCF value would be < 4.2, while a “normal” log BF value would be < 8. In other words, any mills with reported log volumes above these thresholds will need to be reviewed and amended by USDA Forest Service Staff.

In taking a quick glance at the Logged BF, it is clear that there are at least 83 mills with reported volumes far outside of a “normal” range. Thus, these data sets will need review from partners within the USDA Forest Service.

A naive viewer would have trouble interpreting these plots given the units at hand. Without specifying what MCF (Thousand Cubic Feet) and BF (Board Feet) refer to, a naive viewer would likely have little clue as to what the histograms are displaying. However, because this information will be utilized by those within the Forest Service, defining these units is not necessary.

#Create a histogram of Mill Volumes (Logged MTC)

MCF_DF <- data.frame(BF_plot['TOT_MCF_LOG'])
MCF_List <- unclass(MCF_DF)
MCF_Vect <- as.vector(unlist(MCF_List[[1]]))

Hist_Breaks_MCF <- c(seq(from = -3, to = 8, by = .5))

hist.default(MCF_Vect, xlim = c(-3, 8), ylim = c(0, 600), breaks = Hist_Breaks_MCF, labels = TRUE, ylab = 'Count', xlab = 'Log(10) of MCF', main = 'Log(10) of MCF Histogram')
#Create a histogram of Mill Volumes (Logged BF)

BF_DF <- data.frame(BF_plot['TOT_BF_LOG'])
BF_List <- unclass(BF_DF)
BF_Vect <- as.vector(unlist(BF_List[[1]]))


Hist_Breaks_BF <- c(seq(from = 0, to = 12, by = 1))

hist.default(BF_Vect, xlim = c(0, 12), ylim = c(0, 1200), breaks = Hist_Breaks_BF, labels = TRUE, ylab = 'Count', xlab = 'Log(10) of BF', main = 'Log(10) of BF Histogram') 

State Level Plots

The scatter plots/box plots below utilize a host of variables, most of which required some sort of tidying prior to visualization.

In terms of how a naive viewer would interpret these, the same issues as with the above histograms arise. Namely, because some of the plots are visualizing logged BF (Board Feet), someone lacking insight into the lumber industry would likely be left clueless if attempting to draw meaningful conclusions.

Log(10) BF Plots

These plots visualize reported mill volumes’ on a state-by-state basis. By breaking these volumes down by state, our FS partners are able to hone-in on states with the highest number of volume-outliers (i.e. Log(10) BF > 8). From the plots, it is clear that there are a few states that will need to be reviewed, these include ME, MI, & MN, among others.

#Create State-by-State boxplots and scatter plots for Logged BF Volume

ggplot(BF_plot, aes(x = MILL_STATE, y = TOT_BF_LOG, color = MILL_STATE)) + 
  geom_point() +
  scale_y_continuous(breaks = Hist_Breaks_BF, limits = c(0,12)) + 
  labs(title = "Log(10) of BF by State", y = "Log(10) of BF", x = "State") +
  theme(text = element_text(family = "serif")) 
ggplot(BF_plot, aes(x = MILL_STATE, y = TOT_BF_LOG, color = MILL_STATE)) + 
  geom_boxplot() +
  scale_y_continuous(breaks = Hist_Breaks_BF, limits = c(0,12)) + 
  labs(title = "Log(10) of BF by State", y = "Log(10) of BF", x = "State") +
  theme(text = element_text(family = "serif"))

Omitted Mill/Volume Plots

These plots visualize the number of mills and aggregate volume (on a state-by-state basis) that would be removed if a minimum volume of 20,000 BF was set as a baseline for survey inclusion. States without omissions were not included in the plots.

#Create Bar Charts for Omitted Mills/Volumes by State

Omitted_Mills <- BF_plot %>%
  group_by(MILL_STATE) %>%
  filter(TOT_BF_LOG < 4.30103) %>%
  summarize("Omitted_Mill_Count" = n(), "Volume_Omitted" = sum(TOT_BF))


ggplot(Omitted_Mills, aes(x = MILL_STATE, y = Omitted_Mill_Count)) + 
  geom_col(width = 0.5) + 
  geom_text(aes(label = Omitted_Mill_Count), nudge_y = 2.5) + 
  scale_y_continuous(breaks = c(seq(from = 0, to = 60, by = 5))) + 
  labs(title = "Omitted Mills by State", y = "Omitted Mills", x = "State") +
  theme(text = element_text(family = "serif"))
ggplot(Omitted_Mills, aes(x = MILL_STATE, y = Volume_Omitted)) + 
  geom_col(width = 0.5) + 
  geom_text(size = 3.5, aes(label = round(Volume_Omitted)), nudge_y = 25000) + 
  scale_y_continuous(breaks = c(seq(from = 0, to = 1000000, by = 50000))) + 
  labs(title = "Omitted Mills' Volume by State", y = "Omitted Mill Volumes", x = "State") +
  theme(text = element_text(family = "serif"))

High Volume Plot

This bar chart depicts the number of volume-outliers (i.e. Log(10) BF > 8) on a state-by-state basis. States with no volume-outliers were not included in the plot.

#Create data frames for High Volume Mills

High_Vol_by_State <- BF_plot %>%
  group_by(MILL_STATE) %>%
  filter(TOT_BF_LOG >= 8) %>%
  summarize("High_Volume_Mills" = n())



#Create Bar Chart for High Volume Mills by State

ggplot(High_Vol_by_State, aes(x = MILL_STATE, y = High_Volume_Mills)) + 
  geom_col() + 
  geom_text(aes(label = High_Volume_Mills), nudge_y = 2.5) + 
  scale_y_continuous(breaks = c(seq(from = 0, to = 65, by = 5))) + 
  labs(title = "High Volume Mills (Log(10) BF > 8) by State", y = "High Volume Mills", x = "State") +
  theme(text = element_text(family = "serif"))
High_Vol_by_State_Breakdown <- BF_plot %>%
  group_by(MILL_STATE) %>%
  filter(TOT_BF_LOG >= 8) %>%
  select(MILL_NBR, MILL_STATECD:MTC, TOT_MCF, MILL_LAT:MILL_STATE)

Mill Type Plot

This bar chart depicts a breakdown of Mill “Types” on a state-by-state basis.

#Create a Bar Chart depicting Mill Type by State

MTC_summary <- BF_plot %>%
 mutate(MTC_Tidy = case_when(MTC == 10 ~ "Sawmill",
                         MTC == 21 ~ "Sawmill",
                         MTC == 22 ~ "Sawmill",
                         MTC == 23 ~ "Sawmill",
                         MTC == 24 ~ "Sawmill",
                         MTC == 25 ~ "Sawmill",
                         MTC == 26 ~ "Sawmill",
                         MTC == 20 ~ "Veneer Mill",
                         MTC == 30 ~ "Pulp/Composite Mill",
                         MTC == 40 ~ "Pulp/Composite Mill",
                         MTC == 50 ~ "Bioenergy Mill",
                         MTC == 60 ~ "House/Cabin Mill",
                         MTC == 70 ~ "Post/Pole/Piling Mill",
                         MTC == 80 ~ "Post/Pole/Piling Mill",
                         MTC == 90 ~ "Post/Pole/Piling Mill",
                         MTC == 100 ~ "Residential Firewood Mill",
                         MTC == 110 ~ "Other/Misc Mill",
                         MTC == 111 ~ "Other/Misc Mill",
                         MTC == 117 ~ "Concentration Yard",
                         is.na(MTC) ~ "Not Provided"))%>%
  
  group_by(MILL_STATE, MTC_Tidy) %>%
  arrange(as.factor(MTC_Tidy)) %>%
  summarize("Mill_Type_Count" = n()) %>%
  group_by(MILL_STATE)%>%
  mutate(State_Count = sum(Mill_Type_Count))




#Create bar chart of Mill Type Counts by State
  
MTC_Tidy_Factor_Levels <- c("Sawmill", "Veneer Mill", "Pulp/Composite Mill", "Bioenergy Mill", "House/Cabin Mill", "Post/Pole/Piling Mill", "Residential Firewood Mill", "Other/Misc Mill", "Concentration Yard", "Not Provided")


ggplot(MTC_summary, aes(x = MILL_STATE, y = Mill_Type_Count, fill = factor(MTC_Tidy, levels = MTC_Tidy_Factor_Levels))) +
  geom_col(position = 'stack', width = .8) + 
  labs(title = "Mill Type Counts by State", y = "Mill Type Count", x = "State") +
  scale_y_continuous(limits = c(0, 500), breaks = c(seq(from = 0, to = 500, by = 25))) +
  scale_fill_discrete(name = "Mill Types") +
  theme(text = element_text(family = "serif"))

Faceted Mill Type & Log(10) BF Plots

These plots are faceted (by State) versions of the Mill Type and Log(10) of BF plots above.

# Create Bar Charts and Box Plots using facet_wrap()

ggplot(MTC_summary, aes(0, Mill_Type_Count, fill = factor(MTC_Tidy, levels = MTC_Tidy_Factor_Levels))) +
  geom_col() + 
  labs(title = "Mill Type Counts", y = "Mill Type Count") +
  scale_fill_discrete(name = "Mill Types") +
  facet_wrap(facets = vars(MILL_STATE), scales = "free_y" ,ncol = 8) +
  theme(text = element_text(family = "serif"), axis.title.x = element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank())
ggplot(BF_plot, aes(0, y = TOT_BF_LOG, color = MILL_STATE)) + 
  geom_boxplot() +
  labs(title = "Log(10) of BF by State", y = "Log(10) of BF") +
  facet_wrap(facets = vars(MILL_STATE), ncol = 8, scales = "free") +
  theme(text = element_text(family = "serif"), axis.title.x = element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank())

Dplyr Functions

Because dplyr functions weren’t utilized within the above code chunks, the following chunk uses dplyr functions to measure a few meaningful values on a state-by-state basis.

#Calculate means, medians, and standard deviations of logged volumes by state
TPO_summary <- BF_plot %>%
  select(MILL_STATECD, MILL_STATE, MTC, TOT_MCF_LOG, TOT_BF_LOG) %>%
  group_by(MILL_STATE) %>%
  summarize(meanBF = mean(TOT_BF_LOG, na.rm = TRUE), 
            medianBF = median(TOT_BF_LOG, na.rm = TRUE), 
            meanMCF = mean(TOT_MCF_LOG, na.rm = TRUE), 
            medianMCF = median(TOT_MCF_LOG, na.rm = TRUE), 
            St_Dev = sd(TOT_BF_LOG, na.rm = TRUE), 
            NumberofMills = n()) 
TPO_summary
# A tibble: 24 x 7
   MILL_STATE meanBF medianBF meanMCF medianMCF St_Dev NumberofMills
   <chr>       <dbl>    <dbl>   <dbl>     <dbl>  <dbl>         <int>
 1 CT           5.59     5.44   1.79      1.64   0.820            22
 2 DE           5.10     4.80   1.30      1      1.01             15
 3 IA           4.94     4.55   1.14      0.749  1.11             70
 4 IL           4.78     4.46   0.980     0.660  1.06            136
 5 IN           5.76     6.16   1.96      2.36   1.18            147
 6 KS           4.58     4.30   0.782     0.500  0.991            48
 7 MA           4.59     4.00   0.790     0.199  0.906           120
 8 MD           5.51     5.67   1.70      1.87   1.38             43
 9 ME           5.60     4.81   1.80      1.01   1.28            162
10 MI           5.71     5.71   1.91      1.91   1.08            354
# ... with 14 more rows

Conclusions & Next Steps

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Kennedy (2022, Feb. 27). Data Analytics and Computational Social Science: HW5_IanKennedy. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httprpubscomikennedy040hw5/

BibTeX citation

@misc{kennedy2022hw5_iankennedy,
  author = {Kennedy, Ian},
  title = {Data Analytics and Computational Social Science: HW5_IanKennedy},
  url = {https://github.com/DACSS/dacss_course_website/posts/httprpubscomikennedy040hw5/},
  year = {2022}
}