Data Analytics and Computational Social Science: HW4_IanKennedy

Ian Kennedy

For my final project I’ll be using a data set from the Research Lab (FFRC) I’ve been working with for the past few years. In short, the FFRC, in coordination with the USDA Forest Service, implements an annual survey (TPO) which attempts to track timber procurement and lumber production at mills throughout the USA. I’ll be running analyses to see how shifting the baseline for minimum volume procured (TOT_MCF) would impact the number of mills and total volume of procured timber included in the sample frame. I’ve also added a new column (Volume_Code) which will be updated based on the volume that is landed on. Generally speaking, this sort of analysis will attempt to answer the following research questions:

Does it make sense to exclude small-production/hobby mills from future survey attempts?
If yes, what is the minimum volume (TOT_MCF) that mills should have procured to be included in survey efforts?
What impact would shifting the baseline (regardless of minimum volume utilized) have on the total volume and number of mills that are surveyed?

There will likely be further questions that will need to be analyzed, however these questions outline the main priorities of data analysis for this project.

TPO_Data <- read_excel(path = "C:/Users/kenne/Documents/R_Workspace/TPO_Sample.xlsx")


#Calculate the sum of all Mill Volumes
TPO_Data %>%
  select(TOT_BF)%>%
  sum(na.rm = TRUE) -> TOT_Volume


#Calculate Total # of mills
TPO_Data %>%
  nrow() -> TOT_Mills
TOT_Mills <- as.numeric(unlist(TOT_Mills[1]))

#Calculate the sum omitted if omitting mills below 20,000 BF (not including blanks)
TPO_Data %>%
  mutate(Volume_Code = case_when(TOT_BF > 20000 ~ 2, 
                                TOT_BF <= 20000 & TOT_BF > 0 ~ 1,
                                 TOT_BF == 0 ~ 0))%>%
  filter(Volume_Code == 1) %>%
  select(TOT_BF) %>%
  sum(na.rm = TRUE) -> TOT_Volume_Omit

  
#Calculate number of mills if omitting mills below 20,000 BF (not including blanks)
TPO_Data %>%
  mutate(Volume_Code = case_when(TOT_BF > 20000 ~ 2, 
                                 TOT_BF <= 20000 & TOT_BF > 0 ~ 1,
                                 TOT_BF == 0 ~ 0))%>%
  filter(Volume_Code == 1) %>%
  select(TOT_BF) %>%
  nrow -> TOT_Mills_Omit
TOT_Mills_Omit <- as.numeric(unlist(TOT_Mills_Omit[1]))

#Filter mills with BF > 100,000,000
TPO_Data %>%
  mutate(High_BF = case_when(TOT_BF >= 100000000 ~ 2,
                             TOT_BF < 100000000 ~ 1)) %>%
  filter(High_BF == 2)%>%
  arrange(MILL_STATECD)%>%
  select(MILL_STATECD, MILL_NAME, TOT_BF) %>%
  nrow -> TOT_Mills_High_Vol
TOT_Mills_High_Vol <- as.numeric(unlist(TOT_Mills_High_Vol[1]))

Mill_Count_If_Omit <- TOT_Mills - TOT_Mills_Omit


# Print findings from initial analyses

print(paste("Total # of mills in sample =", TOT_Mills))

[1] "Total # of mills in sample = 3321"

print(paste("Total volume in sample =", TOT_Volume))

[1] "Total volume in sample = 402526008972.986"

print(paste("Total # of mills with volume > 100,000,000 BF =", TOT_Mills_High_Vol))

[1] "Total # of mills with volume > 100,000,000 BF = 83"

print(paste("Total # of mills omitted if using a 20,000 BF cut-off =", TOT_Mills_Omit))

[1] "Total # of mills omitted if using a 20,000 BF cut-off = 453"

print(paste("Total volume omitted if using a 20,000 BF cut-off =", TOT_Volume_Omit))

[1] "Total volume omitted if using a 20,000 BF cut-off = 3794201.62380122"

print(paste("Total # of mills in sample if using a 20,000 BF cut-off =", Mill_Count_If_Omit))

[1] "Total # of mills in sample if using a 20,000 BF cut-off = 2868"

These histograms display the logged values of reported volumes for mills across the Northern United States. Both are univariate, utilizing the columns “TOT_MCF_LOG” and “TOT_BF_LOG” respectively. These plots will help to answer questions regarding the presence of reported mill volumes that are far outside of a normal range. A “normal” log MCF value would be < 4.2, while a “normal” log BF value would be < 8. In other words, any mills with reported log volumes above these thresholds will need to be reviewed and amended by USDA Forest Service Staff.

In taking a quick glance at the Logged BF, it is clear that there are at least 83 mills with reported volumes far outside of a “normal” range. Thus, these data sets will need review from partners within the USDA Forest Service.

There are a host of limitations that these plots present. Namely, the lack of volumes for a handful of mills make the analysis incomplete. Additionally, because the corrections to reported volumes are done on a state-by-state basis, a plot that includes the information regarding the state in which each mill is located would be preferred to the raw histogram below.

On top of these limitations, a naive viewer would have trouble interpreting these plots given the units at hand. Without specifying what MCF (thousand cubic feet) and BF (board feet) refer to, a naive viewer would likely have little clue as to what the histograms are displaying. However, because this information will be utilized by those within the Forest Service, defining these units is not necessary.

#Create a histogram of Mill Volumes (Logged MTC)

MCF_DF <- data.frame(TPO_Data['TOT_MCF_LOG'])
MCF_List <- unclass(MCF_DF)
MCF_Vect <- as.vector(unlist(MCF_List[[1]]))

Hist_Breaks_MCF <- c(seq(from = -3, to = 8, by = .5))

hist.default(MCF_Vect, xlim = c(-3, 8), ylim = c(0, 600), breaks = Hist_Breaks_MCF, labels = TRUE, ylab = 'Count', xlab = 'Log of MCF', main = 'Log of MCF Histogram')

#Create a histogram of Mill Volumes (Logged BF)

BF_DF <- data.frame(TPO_Data['TOT_BF_LOG'])
BF_List <- unclass(BF_DF)
BF_Vect <- as.vector(unlist(BF_List[[1]]))


Hist_Breaks_BF <- c(seq(from = 0, to = 12, by = 1))

hist.default(BF_Vect, xlim = c(0, 12), ylim = c(0, 1200), breaks = Hist_Breaks_BF, labels = TRUE, ylab = 'Count', xlab = 'Log of BF', main = 'Log of BF Histogram')

The scatter plot below utilizes two variables, ‘TOT_BF_LOG’ (this was also used for one of the histograms above) and ‘MILL_STATE’, which was created using the mutate() function, in reference to an existing variable, ‘MILL_STATECD’.

This scatter plot attempts to highlight states that are housing mills with egregiously high volume measures. By breaking these volumes down by state, our FS partners are able to hone-in on states with the highest number of volume-outliers. From the scatter plot, it is clear that there are a few states that will need to be reviewed, these include ME, MI, & MN, among others.

In a similar sense to the histograms above, there are a host of mills with missing volumes, thus making this analysis incomplete. These volumes will need to be provided by FS staff before finalizing the analysis. Additionally, while this visualization provides a state-by-state breakdown, it could be improved by providing the count of mills/observations (within each state), that would be considered to have egregiously high volumes.

In terms of how a naive viewer would interpret this, the same issues as with the above histograms arise. Namely, because this plot is visualizing logged BF (Board Feet), someone lacking insight into the lumber industry would likely be left clueless if attempting to draw meaningful conclusions.

To improve this scatter plot, including the aforementioned count of mills reporting volumes outside of the “normal” range (by state) would make the analysis more well-rounded. On top of this, if mills lacking a reported volume are amended, this analysis could be re-run to include all mills present in our survey sample.

#Create State-by-State histograms and scatter plots for Logged BF Volume

BF_plot <- data.frame(TPO_Data) %>%
  mutate(MILL_STATE = case_when(MILL_STATECD == 9 ~ "CT",
                                MILL_STATECD == 10 ~ "DE",
                                MILL_STATECD == 17 ~ "IL", 
                                MILL_STATECD == 18 ~ "IN",
                                MILL_STATECD == 19 ~ "IA",
                                MILL_STATECD == 20 ~ "KS",
                                MILL_STATECD == 23 ~ "ME",
                                MILL_STATECD == 24 ~ "MD",
                                MILL_STATECD == 25 ~ "MA",
                                MILL_STATECD == 26 ~ "MI",
                                MILL_STATECD == 27 ~ "MN",
                                MILL_STATECD == 29 ~ "MO",
                                MILL_STATECD == 31 ~ "NE",
                                MILL_STATECD == 33 ~ "NH",
                                MILL_STATECD == 34 ~ "NJ",
                                MILL_STATECD == 36 ~ "NY",
                                MILL_STATECD == 38 ~ "ND",
                                MILL_STATECD == 39 ~ "OH",
                                MILL_STATECD == 42 ~ "PA",
                                MILL_STATECD == 44 ~ "RI",
                                MILL_STATECD == 46 ~ "SD",
                                MILL_STATECD == 50 ~ "VT",
                                MILL_STATECD == 54 ~ "WV",
                                MILL_STATECD == 55 ~ "WI"))

ggplot(BF_plot, aes(x = MILL_STATE, y = TOT_BF_LOG, color = MILL_STATE)) + geom_point() +
scale_y_continuous(name = "Log of BF", breaks = Hist_Breaks_BF, limits = c(0,12)) + 
scale_x_discrete(name = "State") +
ggtitle("Reported Mill Volumes (Log of BF) by State")

Because dplyr functions weren’t utilized within the above code chunks, the following chunk uses dplyr functions to measure a few meaningful values on a state-by-state basis.

#Calculate means, medians, and standard deviations of logged volumes by state
TPO_summary <- BF_plot %>%
  select(MILL_STATECD, MILL_STATE, MTC, TOT_MCF_LOG, TOT_BF_LOG) %>%
  group_by(MILL_STATE) %>%
  summarize(meanBF = mean(TOT_BF_LOG, na.rm = TRUE), 
            medianBF = median(TOT_BF_LOG, na.rm = TRUE), 
            meanMCF = mean(TOT_MCF_LOG, na.rm = TRUE), 
            medianMCF = median(TOT_MCF_LOG, na.rm = TRUE), 
            St_Dev = sd(TOT_BF_LOG, na.rm = TRUE), 
            NumberofMills = n())

TPO_summary

# A tibble: 24 x 7
   MILL_STATE meanBF medianBF meanMCF medianMCF St_Dev NumberofMills
   <chr>       <dbl>    <dbl>   <dbl>     <dbl>  <dbl>         <int>
 1 CT           5.59     5.44   1.79      1.64   0.820            22
 2 DE           4.87     4.51   1.06      0.711  1.00             19
 3 IA           4.94     4.55   1.14      0.749  1.11             78
 4 IL           4.78     4.46   0.977     0.661  1.06            147
 5 IN           5.67     6.13   1.86      2.33   1.25            189
 6 KS           4.58     4.30   0.782     0.500  0.991            48
 7 MA           4.59     4.00   0.790     0.199  0.906           120
 8 MD           5.48     5.32   1.68      1.52   1.36             48
 9 ME           5.60     4.81   1.80      1.01   1.28            162
10 MI           5.70     5.70   1.90      1.90   1.08            407
# ... with 14 more rows

Comment on this article Share:

HW4_IanKennedy

Reuse

Citation