TPO Mill Size Analysis

Final Project for DACSS 601

Ian Kennedy
2022-05-05

1 Abstract

The Family Forest Research Center (FFRC), in coordination with the USDA Forest Service (FS), implements the Timber Products Output (TPO) Survey on an annual basis, which attempts to track timber procurement at mills throughout a host of states in the northern US. Tracking of timber procurement is vital as it provides a lens through which to highlight geographic trends in forest management and conservation. With this procurement information, state agencies may adapt their policies to mitigate deforestation and preserve species biodiversity.

Because implementation resources are limited, the FFRC is interested in introducing a volume-based threshold for survey inclusion. For such a threshold to be established, a definition regarding commercial status must first be distinguished. Thus, this analysis outlines the impacts of introducing a volume-based threshold on two variables tied to the survey’s sample-frame; aggregate volume and total number of mills. In short, the threshold that is utilized will differentiate commercial mills from non-commercial, or ‘hobby’, mills.

To parse potential applicability, thresholds of 1,000 - 50,000 Board Feet (BF) with an increment of 5,000 BF were analyzed (1,000, 5,000, 10,000, etc.). While there is not an existing conventional threshold for commercial status, these thresholds were established through interviews with FS staff, each of whom have participated in TPO implementation for over a decade. Initial findings suggest that, if utilizing the upper most threshold (50,000 BF), the reduction in aggregate procurement volume is minuscule (~12,000,000 BF), while the reduction in total number of mills is substantial (~22% reduction). With a reduction of this magnitude, the FFRC could hone surveying efforts towards commercial entities, enhancing the likelihood of garnering responses from mills that are most consequential in terms of impacts on forest conservation.

2 Introduction

For this project I will be using the TPO sample frame, which has been utilized for the past four survey iterations (2018-2021). As mentioned above, findings from this project will assist in answering questions surrounding the introduction of a volume-based threshold relating to survey inclusion. Specifically, through this analysis I will attempt to answer the following questions:

  1. In an effort to hone surveying efforts towards commercial mills, should low-production/hobby mills be excluded from future survey attempts?
  2. If yes, what is the minimum volume (i.e. omission threshold) that mills will need to have procured to be considered a ‘commercial’ mill?
  3. What impact does introducing a potential omission threshold have on the total volume and number of mills included in the sample frame?

3 Background

Before delving into the project, a bit of background on the TPO Survey is necessary. While the TPO Survey is conducted annually, most mills within the sample do not receive a survey every year. Rather, each year ‘filtering’ of the sample frame is conducted prior to survey dissemination. This ‘filtering’ is done on a state-by-state basis, and uses a volume-based weighting system to pull a certain number of mills for the given year. In other words, high-volume mills are more likely to be surveyed than low-volume mills within the same state. Thus, it is possible that a mill could go several years without being surveyed, or vice-versa (surveyed each and every year). On top of the state-level ‘filtering’, differences in state-level mill counts also play a role. Mills located in states with low mill counts (i.e. Rhode Island, New Jersey, Delaware, etc.) are more likely to be surveyed than similar-volume mills in states with far higher mill counts (i.e. Michigan, Pennsylvania, etc.). This is to ensure that every state within the sample has at least a few mills included in survey attempts each and every year.

The ‘filtering’ discussed above does not impact the process through which an omission threshold analysis is conducted. However it does present one caveat that must be considered prior to usage of a threshold, which will be discussed in greater detail in the ‘Data’ section below.

4 Data

The TPO sample frame, created in 2018, has been used for the past four survey iterations (2018-2021). Each row (observation) in the file represents a single mill, with columns highlighting a variety of information including name, address, basic contact information, reported (or estimated) volume, mill type, etc. The variables (columns) of highest interest are the volume/mill type columns. Volume units for timber products are diverse, as mills use differing units based on geographical location and product type. To adequately assign thresholds, all volumes were converted to MCF (thousand cubic feet). Throughout this analysis, units of MCF and BF (Board Feet) will be utilized; A conversion factor of 6,033 BF/MCF was used to convert from MCF to BF. Naive viewers should reference this link for a better understanding of timber-based volume units.

Now on to the aforementioned caveat. Because ‘filtering’ allows for the possibility that a mill is not surveyed for multiple years, many mills’ reported volumes are simply estimates based on nearby mills of similar size, or worse, are rollover-data from the last time they did respond. On top of this, with a response rate ~50%, even if a mill is chosen for survey inclusion every year, they may not respond. Thus, prior to utilization of any omission thresholds, further analysis of potentially omitted mills is necessary. This issue is outside of the breadth of this analysis as it will need to be addressed manually by Forest Service staff, but is important to highlight.

4.1 Reading in Data & Setting up ‘State_Code_Conversion’

The following lines of code are used to read in the data and to set up the ‘State_Code_Conversion’ function that will be utilized throughout the analysis. Because each mill contains a variable ‘MILL_STATECD’, the numeric state code for the state in which the mill is located, the ‘State_Code_Conversion’ function creates a new column, ‘MILL_STATE’, containing the two-letter abbreviation for the respective state.

Show code
TPO_Data <- read_excel(path = "C:/Users/kenne/Documents/R_Workspace/TPO_Sample.xlsx")

State_Code_Conversion <- function(Data){
   Data %>%
    mutate(MILL_STATE = case_when(MILL_STATECD == 9 ~ "CT",
                                MILL_STATECD == 10 ~ "DE",
                                MILL_STATECD == 17 ~ "IL", 
                                MILL_STATECD == 18 ~ "IN",
                                MILL_STATECD == 19 ~ "IA",
                                MILL_STATECD == 20 ~ "KS",
                                MILL_STATECD == 23 ~ "ME",
                                MILL_STATECD == 24 ~ "MD",
                                MILL_STATECD == 25 ~ "MA",
                                MILL_STATECD == 26 ~ "MI",
                                MILL_STATECD == 27 ~ "MN",
                                MILL_STATECD == 29 ~ "MO",
                                MILL_STATECD == 31 ~ "NE",
                                MILL_STATECD == 33 ~ "NH",
                                MILL_STATECD == 34 ~ "NJ",
                                MILL_STATECD == 36 ~ "NY",
                                MILL_STATECD == 38 ~ "ND",
                                MILL_STATECD == 39 ~ "OH",
                                MILL_STATECD == 42 ~ "PA",
                                MILL_STATECD == 44 ~ "RI",
                                MILL_STATECD == 46 ~ "SD",
                                MILL_STATECD == 50 ~ "VT",
                                MILL_STATECD == 54 ~ "WV",
                                MILL_STATECD == 55 ~ "WI"))
}

4.2 Calculating Initial Results

The next few lines of code use the data set to calculate a few meaningful values. The first chunk finds the sum of all mill volumes, stored in a variable named ‘TOT_Volume’. The second chunk finds the total number of active/new mills in the sample, stored in a variable named ‘TOT_Mills’. The last chunk filters the data set to include only mills with a reported volume >= 100,000,000 BF, stored in a variable named ‘TOT_Mills_High_Vol’.

Show code
#Calculate the sum of all Mill Volumes

TPO_Data %>%
  select(TOT_BF)%>%
  sum(na.rm = FALSE) -> TOT_Volume

TPO_Data <- State_Code_Conversion(TPO_Data)

#Calculate Total # of mills

TPO_Data %>%
    nrow() -> TOT_Mills
TOT_Mills <- as.numeric(unlist(TOT_Mills))

#Filter mills with BF > 100,000,000

TPO_Data %>%
  filter(TOT_BF >= 100000000)%>%
  nrow -> TOT_Mills_High_Vol
TOT_Mills_High_Vol <- as.numeric(unlist(TOT_Mills_High_Vol))

4.2.1 Printing Initial Results

4.3 Omission Threshold Functions and Setup

The chunk below starts the process of creating omission thresholds, first creating a new column in the dataframe named ‘Volume_Code’. With thresholds starting at 1,000 BF, and then incrementing by 5,000 BF (from 5,000 BF to 50,000 BF), the ‘Volume_Code’ variable takes the value of 1/10,000th of the threshold. In other words, any mill with a reported volume between 5,000-10,000 BF receives a ‘1’, whereas any mill with a reported volume between 10,000-15,000 BF receives a ‘1.5’. Any mills above the final threshold value (50,000 BF) receive a ‘99’, while mills reporting 0 BF/blanks receive a ‘0’.

From here, the ‘Threshold_Data’ dataframe is created by filtering for mills whose ‘Volume_Code’ value is not 99 or 0.

Next, two more functions are established, ‘Threshold_Mills’ and ‘Threshold_Volume’, each measuring different values. Both contain two arguments, ‘Data’ and ‘Volumes’…‘Data’ is simply the data set to be utilized, while ‘Volumes’ refers to a vector of threshold volumes. ‘Threshold_Mills’ calculates the number of mills omitted at a given threshold, whereas ‘Threshold_Volume’ calculated the aggregate volume omitted at that same threshold.

Show code
#Create Omission Thresholds
  #If changing thresholds, make sure to update Function_Volumes.
  #If new thresholds are not 1/10000th of top-bounding threshold, make sure to update for-in loop.

TPO_Data <- TPO_Data %>%
  mutate(Volume_Code = case_when(TOT_BF >= 50000 ~ 99,
                                 TOT_BF < 50000 & TOT_BF >= 45000 ~ 5,
                                 TOT_BF < 45000 & TOT_BF >= 40000 ~ 4.5,
                                 TOT_BF < 40000 & TOT_BF >= 35000 ~ 4,
                                 TOT_BF < 35000 & TOT_BF >= 30000 ~ 3.5,
                                 TOT_BF < 30000 & TOT_BF >= 25000 ~ 3,
                                 TOT_BF < 25000 & TOT_BF >= 20000 ~ 2.5,
                                 TOT_BF < 20000 & TOT_BF >= 15000 ~ 2,
                                 TOT_BF < 15000 & TOT_BF >= 10000 ~ 1.5,
                                 TOT_BF < 10000 & TOT_BF >= 5000 ~ 1,
                                 TOT_BF < 5000 & TOT_BF > 1000 ~ 0.5,
                                 TOT_BF < 1000 & TOT_BF > 0 ~ 0.1,
                                 TOT_BF == 0 ~ 0))

Threshold_Data <- TPO_Data %>%
  filter(Volume_Code != 99 & Volume_Code !=0)

#Create Functions for Mill Count & Volume Omission Data

Threshold_Mills <- function(Data, Volumes){
  Data %>% 
    filter(Volume_Code %in% Volumes) %>%
    select(TOT_BF) %>%
    nrow -> placeholder 
  placeholder <- as.numeric(unlist(placeholder))
}

Threshold_Volume <- function(Data, Volumes){
  Data %>%
    filter(Volume_Code %in% Volumes) %>%
    select(TOT_BF) %>%
    sum(na.rm = TRUE)
}

4.3.1 For-In Loop

The code snippet below first establishes ‘Function_Volumes’ by creating a vector of values containing the unique values left in the ‘Volume_Code’ column (99s and 0s were previously removed).

From there, a for-in loop is used to cycle through each value within ‘Function_Volumes’. With each iteration two numeric variables are outputted, ‘TOT_Mill_Omit_[Mill Count]’ and ‘TOT_Volume_Omit_[Volume Omitted]’. The ’Function_Volumes[i]*10000’ piece multiplies the value by 10,000 to return a variable named based on the threshold in question. The value of these variables are calculated by using the respective function (Threshold_Mills or Threshold_Volume) for each of the iterations. By specifying a volume argument of ‘Function_Volumes[1:i]’ for each, quantities are aggregated from previous iterations. In other words, a mill reporting 3,000 BF will not only be omitted from a 5,000 BF threshold, but also from a 10,000, 15,000, 20,000, etc. BF threshold.

Show code
#For-Loop for Mill & Volume Omission Data
Function_Volumes <- Threshold_Data$Volume_Code %>%
  unique() %>%
  sort()


for (i in 1:length(Function_Volumes)){
  assign(paste0("TOT_Mill_Omit_", Function_Volumes[i]*10000), value = Threshold_Mills(Threshold_Data, Function_Volumes[1:i]))
  assign(paste0("TOT_Volume_Omit_", Function_Volumes[i]*10000), value = Threshold_Volume(Threshold_Data, Function_Volumes[1:i]))
}

4.3.2 Vector Creation

Following the for-in loop, 11 of each of the ‘TOT_Mill_Omit_[Mill Count]’ and ‘TOT_Volume_Omit_[Volume Omitted]’ variables are created. From there, the Mill Count variables are added to a vector named ‘Omit_Mill_Count_Vector’ and the Omitted Volume variables are added to a vector named ‘Omit_Volume_Vector’. Additionally, two vectors of length 11, containing the total sample volume (‘TOT_Volume’) and number of mills within the sample (‘TOT_Mills’), are also created.

Next, the ‘Omit_DF’ dataframe is created using the 4 vectors established above, with rows named based on the threshold in question. Also, a new column (‘Percent_Omitted’) is added by dividing the Omitted Mill Count (‘Omit_Mill_Count_Vector’) by the total Mill Count in the sample (‘TOT_Mill_Vector’), and multiplying by 100.

Using the ‘Omit_DF’ dataframe, an omission threshold table is created using the kable() function, removing the columns relating to Total Volume and Total Mill Count (4 & 5).

Again using the ‘Omit_DF’ dataframe, a scatter/line plot is drawn based on ‘Percent_Omitted’ (x) and ‘Omit_Volume_Vector’ (y). Before doing so, the ‘Omit_Vec’ vector was created and binded to ‘Omit_DF’ using the cbind() function. The plot’s code contains a few other specifications, all of which are included to clean up the aesthetics of the plot.

Finally, all of the now-unneeded vectors are removed from the local environment.

Show code
# Create Vectors for Omitted Mill Counts and Volumes
# Need to change these vectors if changing omission thresholds

Omit_Mill_Count_Vector <- c(TOT_Mill_Omit_1000, TOT_Mill_Omit_5000, TOT_Mill_Omit_10000, TOT_Mill_Omit_15000, 
                            TOT_Mill_Omit_20000, TOT_Mill_Omit_25000, TOT_Mill_Omit_30000, TOT_Mill_Omit_35000, 
                            TOT_Mill_Omit_40000, TOT_Mill_Omit_45000, TOT_Mill_Omit_50000)

Omit_Volume_Vector <- c(TOT_Volume_Omit_1000,TOT_Volume_Omit_5000, TOT_Volume_Omit_10000, TOT_Volume_Omit_15000, 
                        TOT_Volume_Omit_20000, TOT_Volume_Omit_25000, TOT_Volume_Omit_30000, TOT_Volume_Omit_35000, 
                        TOT_Volume_Omit_40000, TOT_Volume_Omit_45000, TOT_Volume_Omit_50000)

TOT_Volume_Vector <- c(TOT_Volume) %>%
  rep.int(times = length(Omit_Volume_Vector))

TOT_Mill_Vector <- c(TOT_Mills) %>%
  rep.int(times = length(Omit_Volume_Vector))

# Convert Vectors to Dataframe

Omit_DF <- data.frame(Omit_Mill_Count_Vector, Omit_Volume_Vector, TOT_Volume_Vector, TOT_Mill_Vector, 
                      row.names = c('1,000 BF cut-off','5,000 BF cut-off', '10,000 BF cut-off', '15,000 BF cut-off',
                                    '20,000 BF cut-off', '25,000 BF cut-off', '30,000 BF cut-off', '35,000 BF cut-off',
                                    '40,000 BF cut-off', '45,000 BF cut-off', '50,000 BF cut-off')) %>%
  mutate(Percent_Omitted = Omit_Mill_Count_Vector/TOT_Mill_Vector*100)

# Produce Omission Table and Plot

kable(Omit_DF, digits = 3, align = "ccccc", 
      col.names = c("Omitted Mill Count", "Omitted Volume (BF)", "Total Volume (BF) in Sample", 
                    "Total Mill Count in Sample", "Percent of Mills Omitted"), 
      caption = "Omissions by Varying Volume Baselines", 
      format.args = list(big.mark = ",", scientific = FALSE)) %>%
  kable_styling(font_size = 16) %>%
  column_spec(column = 1, bold = TRUE) %>%
  remove_column(columns = c(4,5))
Table 1: Omissions by Varying Volume Baselines
Omitted Mill Count Omitted Volume (BF) Percent of Mills Omitted
1,000 BF cut-off 13 5,372.033 0.391
5,000 BF cut-off 66 139,867.886 1.987
10,000 BF cut-off 227 1,163,217.797 6.835
15,000 BF cut-off 419 3,218,548.896 12.617
20,000 BF cut-off 453 3,794,201.624 13.640
25,000 BF cut-off 563 6,131,755.948 16.953
30,000 BF cut-off 628 7,822,716.833 18.910
35,000 BF cut-off 679 9,448,236.934 20.446
40,000 BF cut-off 692 9,936,233.774 20.837
45,000 BF cut-off 714 10,862,499.457 21.500
50,000 BF cut-off 731 11,668,039.434 22.011
Show code
Omit_Vec <- Function_Volumes*10000
Omit_DF <- cbind(Omit_DF, Omit_Vec)

Omit_DF %>%
  ggplot(aes(Percent_Omitted, Omit_Volume_Vector, color = factor(Omit_Vec))) +
  geom_point(size = 3) +
  geom_smooth(size = .8, se = FALSE, color = 'black') +
  scale_x_continuous(breaks = c(seq(0,24,2)), limits = c(0,24)) +
  scale_y_continuous(labels = scales::comma, limits = c(0,12000000), breaks = c(seq(0,12000000,1000000))) +
  scale_color_discrete(name = "Omission Threshold (BF)") +
  labs(title = "Omitted Volumes (BF) & Mill Counts at Varying Thresholds", caption = "Figure 1. Omission Threshold Plot") +
  geom_text(aes(label = Omit_Mill_Count_Vector), size =  5,nudge_y = 300000, nudge_x = -.3) +
  theme_fivethirtyeight(base_size = 20, base_family = 'serif')+
  theme(axis.title = element_text(family = 'serif', size = 20)) + ylab('Volume (BF) Omitted') + xlab('% of Mills Omitted')
Show code
# Remove Unneeded Vectors

rm(Omit_Mill_Count_Vector, Omit_Volume_Vector, TOT_Volume_Vector, TOT_Mill_Vector, Omit_Vec, i, Function_Volumes)

rm(TOT_Mill_Omit_1000,TOT_Mill_Omit_5000, TOT_Mill_Omit_10000, TOT_Mill_Omit_15000, TOT_Mill_Omit_20000,
   TOT_Mill_Omit_25000, TOT_Mill_Omit_30000, TOT_Mill_Omit_35000, 
   TOT_Mill_Omit_40000, TOT_Mill_Omit_45000, TOT_Mill_Omit_50000)

rm(TOT_Volume_Omit_1000, TOT_Volume_Omit_5000, TOT_Volume_Omit_10000, TOT_Volume_Omit_15000, TOT_Volume_Omit_20000,
   TOT_Volume_Omit_25000, TOT_Volume_Omit_30000, TOT_Volume_Omit_35000, 
   TOT_Volume_Omit_40000, TOT_Volume_Omit_45000, TOT_Volume_Omit_50000)

4.4 State-Level Omission Thresholds

4.4.1 Creation of State-Level Omission Function

In the following chunk, the ‘State_Omit_Count_Levels’ dataframe is created by selecting the columns of interest from ‘Threshold_Data’, grouping by State (‘MILL_STATECD’) and Omission Threshold (‘Volume_Code’) to find the # of mills omitted at each threshold, converting the ‘Volume_Code’ variable to BF units by multiplying the value by 10,000, and then using the ‘State_Code_Conversion’ function to convert state codes to the respective state abbreviation.

With aggregate omission threshold data (across all states), the next step in this analysis is to produce similar plots and tables at the state level. To do this, I first created a function, ‘Sum_Threshold_Volumes_by_State’, which is broken down in further detail below:

- The function requires one argument, 'Data', which represents the dataset to be utilized within the call. 
- For each value in the 'Omission Threshold (BF)' column, the previous 'Volume Omitted (BF)' is added to the current value. 
Also within the function are a few 'next' calls, specifying that the first index should be skipped, 
and anytime the for-in loop moves to a new state, the first value in the state should also be skipped.
    - These 'next' calls hinge on the fact that the data is grouped by 'State' '& 'Omission Threshold (BF)', 
    without grouping the function would not work as anticipated.
Show code
# Find State-by-State Omitted Volume Totals

State_Omit_Count_Levels <- Threshold_Data %>%
  select(MILL_STATECD, MTC:TOT_BF_LOG, Volume_Code) %>%
  group_by(MILL_STATECD, Volume_Code) %>%
  summarize(Mill_Count = n()) %>%
  mutate(Volume_Code = Volume_Code*10000)
State_Omit_Count_Levels <- State_Code_Conversion(State_Omit_Count_Levels)

# Function for summing Omitted Volume by Threshold

Sum_Threshold_Volumes_by_State <- function(Data){
  for (v in 1:length(Data$`Omission Threshold (BF)`)){
    if (v == 1){
      next
    } else if (Data$State[v] != Data$State[v-1]){
      next
    } else if (Data$`Omission Threshold (BF)`[v] > Data$`Omission Threshold (BF)`[v-1]){
      Data$`Volume Omitted (BF)`[v] + Data$`Volume Omitted (BF)`[v-1] -> Data$`Volume Omitted (BF)`[v]
    } else{
      break
    }
    }
  Data
  }

4.4.2 Tidying of State-Level Data Frame

The next step in the workflow is to create a state-level dataframe containing omission threshold mill counts and omitted volumes. I first use the ‘State_Code_Conversion’ function to convert state codes to the respective state abbreviation within the ‘Threshold_Data’ dataframe. Next, I create a dataframe containing total reported/estimated volumes for each state (‘State_Volumes’). From there I create the ‘Omission_Data’ dataframe by grouping by state and omission threshold, summing volume (‘TOT_BF’) for each grouping, creating the ‘Omission Threshold (BF)’ column/dropping the ‘Volume_Code’ column, and finally joining the state volume totals (‘State_Volumes’) to the dataframe with a left_join() call. The dataframe is then used as an input to the ‘Sum_Threshold_Volumes_by_State’ function discussed above.

Following usage of the function, a new variable (‘% of Volume Omitted’) is added by dividing the omitted volume by the total state volume, and columns are reordered.

At this point the dataframe contains state-level omitted volume data, though does not contain omitted mill count information. Thus the ‘State_Omit_Count_Levels’ dataframe (created in the previous chunk) is also joined to ‘Omission_Data’, again using a left_join() call.

Finally, a ‘state_Count’ variable is added by summing omitted mill counts, grouped by state. This variable represents the total number of mills omitted at the upper-most volume threshold applied to the state in question.

Show code
#Specify Column Order 

col_order <- c("State", "Omission Threshold (BF)", "Volume Omitted (BF)",
               "Total State Volume (BF)", "% of Volume Omitted")

Threshold_Data <- State_Code_Conversion(Threshold_Data)

# Find Total State Volumes

State_Volumes <- TPO_Data %>%
  select('MILL_STATE', 'TOT_BF', 'TOT_BF_LOG') %>%
  group_by(MILL_STATE) %>%
  summarize(State_Volume = sum(TOT_BF))

# Create Omitted Volume by State DF

Omission_Data <- Threshold_Data %>%
  group_by(MILL_STATE, Volume_Code) %>%
  summarize('Volume Omitted (BF)' = sum(TOT_BF)) %>%
  mutate('Omission Threshold (BF)' = Volume_Code*10000) %>%
  select(-Volume_Code) %>%
  left_join(State_Volumes, by = 'MILL_STATE') %>%
  rename('Total State Volume (BF)' = 'State_Volume', 'State' = 'MILL_STATE') %>%
  ungroup()

# Sum Volumes by State/Threshold & Create % field comparing Omitted Threshold Volume to Total State Volume
Omission_Data <- Omission_Data %>%
  Sum_Threshold_Volumes_by_State() %>%
  mutate('% of Volume Omitted' = `Volume Omitted (BF)`/`Total State Volume (BF)` * 100)

# Reorder columns
Omission_Data <- Omission_Data[, col_order]
rm(col_order)

# Join Omission Data with Omitted Mill Counts
Omission_Data <- Omission_Data %>%
  left_join(State_Omit_Count_Levels, by = c('State' = 'MILL_STATE', 'Omission Threshold (BF)' = 'Volume_Code')) 

Omission_Data <- Omission_Data %>%
  group_by(State) %>%
  mutate(state_Count = sum(Mill_Count)) %>%
  select(-MILL_STATECD)

4.4.3 Aggregating State Omission Volumes and Mill Counts

Because the ‘Omission_Data’ dataframe is only populated for thresholds (rows) from which a mill has been omitted, measures of volumes and mill counts at each threshold again must be aggregated, this time on a state-by-state basis. In other words, assuming a state only contains a few mills reporting under 50,000 BF (the upper most threshold), and each of these mills falls within the 10,000-15,000 BF range (categorized under the 15,000 BF threshold). Only the 15,000 BF threshold row would be present for the state, leading to inaccurate visualization…As mentioned above, a mill within the 10,000-15,000 BF range will not only be omitted if using a 15,000 BF threshold, but also a 20,000 BF, 25,000 BF, 30,000 BF, etc. threshold.

To fix this problem, a blank template of the ‘Omission_Data’ dataframe is created and joined to the existing Omission_Data. This template, named ‘TestDF’, contains all of the same columns, though has rows for each of the 11 respective omission thresholds for each state. To create the template, I first create a vector (‘States’) containing all of the states present in the original data set. Additionally, using the ‘States’ vector, I create a new variable named ‘State_Length’ representing the number of states present in the vector. One more vector, ‘Thresholds’, is created by taking the unique values housed under ‘Volume_Code’ (within ‘Threshold_Data’), sorting in an ascending manner, and multiplying each value by 10,000.

With the three vectors/variables established, the ‘States’ variable is replicated by the length of the ‘Thresholds’ and sorted in ascending order (alphabetically). Once the ‘States’ vector has been lengthened, the same is done to the ‘Thresholds’ variable, instead using the ‘State_Length’ variable as the ‘times’ argument. At this point, each of the mentioned vectors are of length 264, which is in line with the length of the desired template…24 (States) * 11 (Thresholds) = 264.

Now that the vectors are of proper length, the last step prior to joining is to create the blank template (‘TestDF’), utilizing the ‘Columns’ vector to properly name variables. Finally, the ‘Omission_Data’ is joined to the ‘TestDF’ template, and a fill() call is used to aggregate relevant values from one threshold to the next (within each state). Once ‘Omission_Data’ is amended, unneeded vectors are removed, and all NA values are changed to zeros.

Up to this point the dataframe houses aggregated omitted volume measures, though the same is not true of mill counts on a state-by-state basis. Thus, one more function (SumTest) is created and used on the ‘Omission_Data’ dataframe. The comments within the function provide reasoning behind the code. In short, the function aggregates mill counts up to the largest threshold established within each state, and is quite similar to the ‘Sum_Threshold_Volumes_by_State’ function utilized previously.

Show code
#Create Omission Template, with NAs for all empty fields.
States <- unique(TPO_Data$MILL_STATE) 
State_Length = as.numeric(length(States))

Thresholds <- unique(Threshold_Data$Volume_Code) %>%
  sort(decreasing = FALSE)*10000

States <- rep.int(States, times = length(Thresholds))

States <- States %>%
  sort(States, decreasing = FALSE)

Thresholds <- Thresholds %>%
  rep.int(times = State_Length)

Columns <- c("State", "Omission Threshold (BF)")

#Create Omission Template, named TestDF
TestDF <- data.frame("State" = States, "Omission Threshold (BF)" = Thresholds)%>%
  group_by(State)

colnames(TestDF) <- Columns

#Join Omission_Data to TestDF, save as Omission_Data
Omission_Data <- TestDF %>%
  left_join(Omission_Data, by = c('State', "Omission Threshold (BF)")) 

#Aggregate Volume/Mill Count by threshold/state
Omission_Data <- Omission_Data %>%
  group_by(State) %>%
  fill(c("Volume Omitted (BF)", "Total State Volume (BF)", "% of Volume Omitted", "Mill_Count", "state_Count"), .direction = "down") 

#Remove unneeded vectors/DFs
rm(Columns, States, Thresholds, State_Length, TestDF)

#Change NA values to 0
Omission_Data[is.na(Omission_Data)] = 0

SumTest <- function(Data){
  for (v in 1:length(Data$Mill_Count)){
    #Pass the first row
    if (v == 1){
    next
    #Pass to the next row if passing to new state
    } else if (Data$State[v] != Data$State[v-1]){
      next
    #If zero & previous row not zero, pull value from previous row
    } else if (Data$Mill_Count[v] == 0 & Data$Mill_Count[v-1] != 0){
      Data$Mill_Count[v-1] -> Data$Mill_Count[v]
      next
    #If not zero, previous row not zero, & volume omitted is greater than the previous row (threshold), 
    #add mill count from previous row to current row's mill count
    } else if (Data$Mill_Count[v] != 0 & Data$Mill_Count[v-1] != 0 & Data$`Volume Omitted (BF)`[v] > Data$`Volume Omitted (BF)`[v-1]){
      Data$Mill_Count[v] + Data$Mill_Count[v-1] -> Data$Mill_Count[v]
      next
    #If not zero, previous row not zero, & volume omitted is equal to the previous row (threshold), pull mill count from previous row
    } else if (Data$Mill_Count[v] != 0 & Data$Mill_Count[v-1] != 0 & Data$`Volume Omitted (BF)`[v] == Data$`Volume Omitted (BF)`[v-1]){
      Data$Mill_Count[v-1] -> Data$Mill_Count[v]
      next
    #Else pass to next row
    } else{
      next
    }
  }
  Data
}

Omission_Data <- Omission_Data %>%
  SumTest() 

5 Visualization

5.1 State-Level Omission Threshold Plots, Table, & Code

Using the ‘Omission_Data’ dataframe, a few plots are created along with a table depicting the information within the dataframe. Additionally, click this link to view an interactive omission threshold web map created with Tableau.

Show code
# Omission Plots 
Omission_Data %>%
  ggplot(aes(State,`Volume Omitted (BF)`, color = factor(`Omission Threshold (BF)`))) + 
  geom_point(size = 2) +
  scale_y_continuous(labels = scales::comma, limits = c(0,1700000), breaks = c(seq(0,1700000,100000))) +
  scale_color_discrete(name = "Omission Threshold (BF)") +
  labs(title = "Omitted Volumes (BF) at Varying Thresholds by State", y = "Omitted Volume (BF)", x = "State",
       caption = "Figure 2. Omitted Volumes (BF) at Varying Thresholds by State") +
  theme_fivethirtyeight(base_size = 20, base_family = 'serif') +
  theme(axis.title = element_text(family = 'serif', size = 20)) + ylab('Volume (BF) Omitted') + xlab('State')
Show code
Omission_Data %>%
  ggplot(aes(0, `Volume Omitted (BF)`, fill = factor(`Omission Threshold (BF)`))) + 
  geom_col(position = 'dodge') +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_discrete(name = "Omission Threshold (BF)") +
  facet_wrap(facets = vars(State), scales = 'free_y', ncol = 6) +
  labs(title = "Omitted Volumes (BF) at Varying Thresholds by State", y = "Omitted Volume (BF)", x = "State", 
       caption = "Figure 3. Omitted Volumes (BF) at Varying Thresholds by State") +
  theme_fivethirtyeight(base_size = 20, base_family = 'serif') +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.title = element_text(family = 'serif', size = 20)) + 
  ylab('Volume (BF) Omitted')
Show code
# Omission Table Creation
## Will need to amend rowStyle distinctions if outlier entries are corrected
Omission_Table <- reactable(Omission_Data[,c(1:3,5:6)], groupBy = "State",  defaultPageSize = 24, outlined = TRUE, 
                            resizable = TRUE, wrap = TRUE, highlight = TRUE, theme = reactableTheme(
                           cellStyle = list(display = "flex", flexDirection = "column", justifyContent = "center"), 
                           headerStyle = list(display = "flex", flexDirection = "column", justifyContent = "center")),

rowStyle = JS("function(rowInfo) {if (rowInfo.row['State'] == 'IN' || rowInfo.row['State'] == 'KS'|| rowInfo.row['State'] == 'ME'||
rowInfo.row['State'] == 'MD'|| rowInfo.row['State'] == 'MI'|| rowInfo.row['State'] == 'MN'|| 
rowInfo.row['State'] == 'OH'|| rowInfo.row['State'] == 'PA'|| rowInfo.row['State'] == 'WI') {
return { backgroundColor: 'yellow' }
}
}"),
                           columns = list(
                            Mill_Count = colDef(name = "# of Mills Omitted", align = "center"),
                            State = colDef(align = "center"), 
                            'Omission Threshold (BF)' = colDef(align = "center", format = colFormat(separators = TRUE)),
                            'Volume Omitted (BF)' = colDef(align = "center", format = colFormat(separators = TRUE, digits = 2)),
                            '% of Volume Omitted' = colDef(align = "center", format = colFormat(digits = 3))))

Omission_Table_Titled <- htmlwidgets::prependContent(Omission_Table, 
h2(class = "title", "Omitted Volumes (BF) at Varying Thresholds by State", 
   style = "text-align: center"), 
h3(class = "caption", "States highlighted in yellow contain at least one mill with a reported volume >= 100,000,000 BF", 
   style = "text-align: center; font-size: 85%; font-weight: normal"))

Omission_Table_Titled

Omitted Volumes (BF) at Varying Thresholds by State

States highlighted in yellow contain at least one mill with a reported volume >= 100,000,000 BF

5.2 Histograms of Log(10) Volume

5.2.1 Histrogram Code & Output

The code below produces two histograms of reported volumes, using Logged MCF and Logged BF. Both are univariate, utilizing the columns “TOT_MCF_LOG” and “TOT_BF_LOG” respectively. These plots help to answer questions regarding the presence of reported mill volumes that are far outside of a normal range. A “normal” log MCF value would be < 4.2, while a “normal” log BF value would be < 8.

In taking a quick glance at the Logged BF plot, it is clear that there are at least 83 mills with reported volumes far outside of a “normal” range.

Show code
#Create a histogram of Mill Volumes (Logged MCF)

Hist_Breaks_MCF <- c(seq(from = -3, to = 8, by = .5))

ggplot(TPO_Data, aes(TOT_MCF_LOG, label = ..count..), xlim = c(0, 8), ylim = c(0, 600)) + 
  geom_histogram(breaks = Hist_Breaks_MCF, bins = 16) + 
  scale_x_continuous(breaks= Hist_Breaks_MCF) +
  labs(title = "Histogram of Log(10) MCF",
       caption = "Figure 4. Sample-Size Histogram of Log(10) MCF (Closed/OOB & Dismantled Mills not included)") + 
  geom_text(stat="bin", nudge_y = 10, size=5, breaks = Hist_Breaks_MCF) +
  theme(text = element_text(family = "serif")) +
  theme_fivethirtyeight(base_size = 20, base_family = 'serif') +
  theme(axis.title = element_text(family = 'serif', size = 20)) + ylab('Mill Count') + xlab('Log(10) MCF')
Show code
#Create a histogram of Mill Volumes (Logged BF)

Hist_Breaks_BF <- c(seq(from = 0, to = 12, by = 1))

ggplot(TPO_Data, aes(TOT_BF_LOG, label = ..count..), xlim = c(0, 12), ylim = c(0, 1200)) + 
  geom_histogram(breaks = Hist_Breaks_BF, binwidth = 1) + 
  scale_x_continuous(breaks= Hist_Breaks_BF) +
  labs(title = "Histogram of Log(10) BF",
       caption = "Figure 5. Sample-Size Histogram of Log(10) BF (Closed/OOB & Dismantled Mills not included)") + 
  geom_text(stat="bin", nudge_y = 17,size=5,breaks = Hist_Breaks_BF) +
  theme_fivethirtyeight(base_size = 20, base_family = 'serif') +
  theme(axis.title = element_text(family = 'serif', size = 20)) + ylab('Mill Count') + xlab('Log(10) BF')

5.3 State Level Volume Plots

The scatter, box, and violin plots below utilize a host of variables, most of which required some sort of tidying prior to visualization.

5.3.1 Log(10) BF Volume Scatter, Box, and Violin Plots

These plots visualize reported mill volumes’ on a state-by-state basis. By breaking these volumes down by state, our FS partners are able to hone-in on states with the largest number of volume-outliers (i.e. Log(10) BF > 8). From the plots, it is clear that there are a few states that will need to be reviewed, these include ME, MI, & MN, among others.

Show code
#Create State-by-State boxplots and scatter plots for Logged BF Volume

ggplot(TPO_Data, aes(x = MILL_STATE, y = TOT_BF_LOG, color = MILL_STATE)) + 
  geom_point(size = 1.2) +
  scale_y_continuous(breaks = Hist_Breaks_BF, limits = c(0,12)) + 
  labs(title = "Log(10) of BF by State", y = "Log(10) of BF", x = "State",
       caption = "Figure 6. Scatter Plot of Reported Mill Volumes (Log10 of BF) by State") +
  theme_fivethirtyeight(base_size = 20, base_family = 'serif') +
  theme(axis.title = element_text(family = 'serif', size = 20), legend.position = "none") + ylab('Log(10) BF') + xlab('State')
Show code

ggplot(TPO_Data, aes(x = MILL_STATE, y = TOT_BF_LOG, color = MILL_STATE)) + 
  geom_boxplot(outlier.size = 1.5, lwd=1.2) +
  scale_y_continuous(breaks = Hist_Breaks_BF, limits = c(0,12)) + 
  labs(title = "Log(10) of BF by State", y = "Log(10) of BF", x = "State",
       caption = "Figure 7. Boxplot of Reported Mill Volumes (Log10 of BF) by State") +
  theme_fivethirtyeight(base_size = 20, base_family = 'serif') +
  theme(axis.title = element_text(family = 'serif', size = 20), legend.position = "none") + ylab('Log(10) BF') + xlab('State')
Show code

ggplot(TPO_Data, aes(0, y = TOT_BF_LOG, color = MILL_STATE)) + 
  geom_violin(width = 1, scale = 'count') + 
  geom_jitter(width = 0.5, size = 1) +
  scale_y_continuous(breaks = c(seq(0,12,2)), limits = c(0,12)) + 
  labs(title = "Log(10) of BF by State", caption = "Figure 8. Violin Plot of Reported Mill Volumes (Log10 of BF) by State") +  
  theme_fivethirtyeight(base_size = 20, base_family = 'serif') + 
  theme(axis.title = element_text(family = 'serif', size = 20), legend.position = "none", axis.text.x = element_blank(), 
        panel.grid.major.x = element_blank()) + ylab('Log(10) BF') + xlab('State') +
  facet_wrap(facets = vars(MILL_STATE), ncol = 8)

5.3.2 Volume-Outliers Plot

This bar chart depicts the number of volume-outliers (i.e. Log(10) BF > 8) on a state-by-state basis. States with no volume-outliers were not included in the plot.

Show code
#Create Bar Chart for High Volume Mills by State

TPO_Data %>%
  group_by(MILL_STATE) %>%
  filter(TOT_BF_LOG >= 8) %>%
  summarize("High_Volume_Mills" = n()) %>%
  ggplot(aes(x = MILL_STATE, y = High_Volume_Mills)) + 
    geom_col() + 
    geom_text(aes(label = High_Volume_Mills), size = 6 ,nudge_y = 1.5) + 
    scale_y_continuous(breaks = c(seq(from = 0, to = 65, by = 5))) + 
    theme_fivethirtyeight(base_size = 20, base_family = 'serif') +
    theme(axis.title = element_text(family = 'serif', size = 20), panel.grid.major.x = element_blank()) + 
    ylab('High Volume Mills') + 
    xlab('State') +
    labs(title = "High Volume Mills (Log(10) BF > 8) by State", caption = "Figure 9. High Volume Mills (>100,000,000 BF) by State")

5.3.3 Mill Type Plot

This bar chart depicts a breakdown of Mill “Types” on a state-by-state basis.

Show code
#Create MTC_Summary

MTC_summary <- TPO_Data %>%
 mutate(MTC_Tidy = case_when(MTC %in% c(10,21,22,23,24,25,26) ~ "Sawmill",
                         MTC == 20 ~ "Veneer Mill",
                         MTC %in% c(30,40) ~ "Pulp/Composite Mill",
                         MTC == 50 ~ "Bioenergy Mill",
                         MTC == 60 ~ "House/Cabin Mill",
                         MTC %in% c(70,80,90) ~ "Post/Pole/Piling Mill",
                         MTC == 100 ~ "Residential Firewood Mill",
                         MTC %in% c(110,111) ~ "Other/Misc Mill",
                         MTC == 117 ~ "Concentration Yard",
                         is.na(MTC) ~ "Not Provided"))%>%
  group_by(MILL_STATE, MTC_Tidy) %>%
  arrange(as.factor(MTC_Tidy)) %>%
  summarize("Mill_Type_Count" = n())

#Create bar chart of Mill Type Counts by State
  
MTC_Tidy_Factor_Levels <- c("Sawmill", "Veneer Mill", "Pulp/Composite Mill", "Bioenergy Mill", "House/Cabin Mill", "Post/Pole/Piling Mill",
                            "Residential Firewood Mill", "Other/Misc Mill", "Concentration Yard", "Not Provided")

ggplot(MTC_summary, aes(x = MILL_STATE, y = Mill_Type_Count, 
                        fill = fct_rev(fct_infreq(factor(MTC_Tidy, levels = MTC_Tidy_Factor_Levels))))) +
  geom_col(position = 'stack', width = .8) + 
  labs(title = "Mill Type Counts by State", caption = "Figure 10. Mill Types by State") +
  scale_y_continuous(limits = c(0, 425), breaks = c(seq(from = 0, to = 425, by = 25))) +
  scale_fill_discrete(name = "Mill Type") +
  theme_fivethirtyeight(base_size = 20, base_family = 'serif') +
  theme(axis.title = element_text(family = 'serif', size = 20)) + ylab('Mill Count') + xlab('State') +
  guides(fill = guide_legend(title = "Mill Type", reverse = T))

6 Summary Statistics

6.1 State Level Means, Medians, & St. Deviations

The following chunk uses dplyr functions to measure a few meaningful values on a state-by-state basis.

Show code
#Calculate means, medians, and standard deviations of logged volumes by state

TPO_summary <- TPO_Data %>%
  select(MILL_STATECD, MILL_STATE, MTC, TOT_MCF_LOG, TOT_BF_LOG) %>%
  group_by(MILL_STATE) %>%
  summarize(meanBF = mean(TOT_BF_LOG, na.rm = TRUE), 
            medianBF = median(TOT_BF_LOG, na.rm = TRUE), 
            meanMCF = mean(TOT_MCF_LOG, na.rm = TRUE), 
            medianMCF = median(TOT_MCF_LOG, na.rm = TRUE), 
            St_Dev = sd(TOT_BF_LOG, na.rm = TRUE), 
            NumberofMills = n()) 

#Create table for means, medians, and standard deviations

kable(TPO_summary, digits = 4, align = "ccccccc", col.names = c("State", "Mean Log(10) BF", "Median Log(10) BF", "Mean Log(10) MCF", 
                                                                "Median Log(10) MCF", "Standard Deviation", "Mill Count"), 
      caption = "Volume-based means, medians, and standard deviations for all mills (omissions not removed) by State. 
      States highlighted in yellow contain at least one mill with a reported volume >= 100,000,000 BF") %>%
  kable_styling(font_size = 16) %>%
  row_spec(c(5,6,8,9:11,18,19,23), background = "yellow")
Table 2: Volume-based means, medians, and standard deviations for all mills (omissions not removed) by State. States highlighted in yellow contain at least one mill with a reported volume >= 100,000,000 BF
State Mean Log(10) BF Median Log(10) BF Mean Log(10) MCF Median Log(10) MCF Standard Deviation Mill Count
CT 5.5946 5.4378 1.7930 1.6362 0.8204 22
DE 4.8660 4.5127 1.0644 0.7111 1.0010 19
IA 4.9398 4.5509 1.1382 0.7492 1.1110 78
IL 4.7787 4.4627 0.9771 0.6611 1.0594 147
IN 5.6661 6.1285 1.8644 2.3269 1.2481 189
KS 4.5832 4.3013 0.7816 0.4997 0.9912 48
MA 4.5912 4.0003 0.7896 0.1987 0.9057 120
MD 5.4849 5.3181 1.6833 1.5165 1.3600 48
ME 5.5991 4.8144 1.7974 1.0128 1.2836 162
MI 5.7038 5.7041 1.9022 1.9025 1.0820 407
MN 6.2659 6.0003 2.4643 2.1987 1.7164 338
MO 5.6701 5.8416 1.8685 2.0400 0.9099 463
ND 4.1555 3.8454 0.3539 0.0438 0.9736 5
NE 4.4718 4.3277 0.6702 0.5261 0.9500 60
NH 5.8028 5.9649 2.0011 2.1633 1.0444 45
NJ 4.0423 4.0003 0.2407 0.1987 0.8522 23
NY 5.7856 5.7784 1.9840 1.9768 0.8405 139
OH 5.5961 5.8753 1.7945 2.0737 1.0604 219
PA 5.9552 6.1510 2.1536 2.3494 0.7750 406
RI 4.9757 4.6501 1.1741 0.8485 0.8998 4
SD 5.8746 5.7797 2.0729 1.9780 0.9865 26
VT 5.4019 5.2720 1.6002 1.4704 1.0931 61
WI 5.8532 5.9688 2.0516 2.1672 1.0782 226
WV 6.5522 6.6467 2.7506 2.8451 0.8693 66

7 Comarison of Sample-Reported & 2020 Survey Response Volumes

The final step in this project is to compare the sample-reported volumes with survey-response volumes from the most recent iteration (2020, though surveyed in 2021). While much of the ‘tidying’ code is shown in the chunks below, some of the code could not be shown given the confidential nature of individual survey responses. The hidden sections include syntax to fix erroneous state/volume unit entries, and was done manually by parsing through the ‘TPO_Data_2020’ dataframe.

To begin this process, the 2020 survey data is read in and saved as a dataframe named ‘TPO_Data_2020’. In a hidden chunk that follows, a few erroneous state entries are fixed for individual mills.

Show code
# Read 2020 TPO Data & rename problematic State entries

TPO_Data_2020 <- read_csv(file = "C:\\Users\\kenne\\Documents\\R_Workspace\\2020_TPO_Data.csv", 
                          col_select = c("MILL_NAME":"MILL_ZIP_CD", "MILL_TYPE_CD":"WOOD_PROCESSED_CD", "MILL_OUTPUT_CAPACITY_ANNUAL", 
                                         "MILL_OUTPUT_CAP_ANNUAL_UNIT_CD", "AMOUNT":"RWAMT_UNIT_MEASURE_CD_OTHERTXT", "TPOID", 
                                         "SW_LBSPERMBF":"HW_LBSPERCORD")) %>%
  select(-'MILL_PHONE')

The analysis continues below; First, the TPO sample-frame is joined to the 2020 dataframe using the unique ID provided for each mill. Using the joined dataframe, certain columns are selected, the dataframe is arranged by state, the ‘MILL_STATE’ column is edited to match the state provided within the sample. Within the same pipe the dataframe is further filtered to only include mills that are active/NA, or those that reported being idle/closed/dismantled(‘MILL_STATUS_CD’ = 3, 4, or 5) whom also provided a volume (‘AMOUNT’). Finally, the Amount_Unit column is created based on entries in the ‘RWAMT_UNIT_MEASURE_CD’ column, and the dataframe is once again filtered for mills whose ‘Amount_Unit’ column was populated or whom provided an estimate of annual volume in the ‘MILL_OUTPUT_CAPACITY_ANNUAL’ column.

Show code
# Create DF for Closed/Dismantled Responses in 2020

Closed_Dismantled_2020 <- TPO_Data_2020 %>%
  filter(MILL_STATUS_CD == 4 | MILL_STATUS_CD == 5) %>%
  arrange(desc(WOOD_PROCESSED_CD), desc(AMOUNT)) 

# Join Sample data to 2020 Data

TPO_Data_2020 <- TPO_Data_2020 %>%
  left_join(TPO_Data, by = c("TPOID"="TPOID_2020"), suffix = c("","_Sample")) %>%
  select(-c(MILL_LAT:SURVEY_YEAR))
  
# Tidy the 2020 Data, still need to parse different Volume Units

TPO_Data_2020 <- TPO_Data_2020 %>%
  select(MILL_NAME:TPOID, MILL_NBR:MILL_STATECD, MTC:TOT_BF_LOG, MILL_STATE_Sample) %>%
  arrange(MILL_STATE) %>%
  mutate(MILL_STATE = case_when(MILL_STATE != MILL_STATE_Sample ~ MILL_STATE_Sample,
                                 MILL_STATE == MILL_STATE_Sample ~ MILL_STATE)) %>%
  filter(MILL_STATUS_CD == 2 | is.na(MILL_STATUS_CD) | ((MILL_STATUS_CD == 3 | MILL_STATUS_CD == 4 | MILL_STATUS_CD == 5)
                                                        & !is.na(AMOUNT))) %>%
  mutate(Amount_Unit = case_when(RWAMT_UNIT_MEASURE_CD == 1 ~ "BF Doyle",
                                 RWAMT_UNIT_MEASURE_CD == 2 ~ "BF Scribner",
                                 RWAMT_UNIT_MEASURE_CD == 5 ~ "BF 1/4 Inch",
                                 RWAMT_UNIT_MEASURE_CD == 6 ~ "BF LT",
                                 RWAMT_UNIT_MEASURE_CD == 11 ~ "MBF Doyle",
                                 RWAMT_UNIT_MEASURE_CD == 12 ~ "MBF Scribner",
                                 RWAMT_UNIT_MEASURE_CD == 15 ~ "MBF 1/4 Inch",
                                 RWAMT_UNIT_MEASURE_CD == 16 ~ "MBF LT",
                                 RWAMT_UNIT_MEASURE_CD == 21 ~ "Standard Cord",
                                 RWAMT_UNIT_MEASURE_CD == 22 ~ "Lake States Cord",
                                 RWAMT_UNIT_MEASURE_CD == 31 ~ "Green Tons",
                                 RWAMT_UNIT_MEASURE_CD == 61 ~ "Pieces",
                                 RWAMT_UNIT_MEASURE_CD == 99 ~ RWAMT_UNIT_MEASURE_CD_OTHERTXT,
                                 is.na(RWAMT_UNIT_MEASURE_CD) ~ RWAMT_UNIT_MEASURE_CD_OTHERTXT)) %>%
  filter(!is.na(Amount_Unit) | !is.na(MILL_OUTPUT_CAPACITY_ANNUAL)) %>%
  arrange(AMOUNT)

In another hidden chunk, the Amount_Unit_Tidy column was created by aggregating units of similar value. All BF measurements were coded as ‘BF’, MBF measurements as ‘MBF’, and MMBF units as ‘MMBF’. In the below chunk, the dataframe is filtered to only include mills reporting BF units (BF/MBF/MMBF) whom also reported a volume (‘AMOUNT’), or whom reported an annual volume estimate in the ‘MILL_OUTPUT_CAPACITY_ANNUAL’ column. Then, the ‘AMOUNT’ column is converted to a numeric format, and the ‘Amount_BF’ column is created to convert volumes based on the unit provided.

Show code
# Create DF for units in BF, MBF, & MMBF

TPO_Data_2020_BF <- TPO_Data_2020_Tidy %>%
  filter(Amount_Unit_Tidy %in% c('BF', 'MBF', 'MMBF'), !is.na(AMOUNT) | !is.na(MILL_OUTPUT_CAPACITY_ANNUAL)) 

# Convert Amounts to numeric

TPO_Data_2020_BF$AMOUNT <- as.numeric(as.character(TPO_Data_2020_BF$AMOUNT)) 


# Convert MBF/MMBF to BF

TPO_Data_2020_BF <- TPO_Data_2020_BF %>%
  mutate(AMOUNT_BF = case_when(Amount_Unit_Tidy == 'BF' ~ AMOUNT,
                            Amount_Unit_Tidy == 'MBF' ~ AMOUNT*1000, 
                            Amount_Unit_Tidy == 'MMBF' ~ AMOUNT*1000000))  

In a hidden chunk above a few erroneous volume entries were manually corrected. The chunk was hidden as it references a few mills’ names directly within the code.

The final chunks below create scatter/line plots to compare 2020 survey response volumes with those reported in the sample-frame. The first three plots each represent a range of 2020 response volumes (< 1,000,000 BF, 1,000,000 - 10,000,000 BF, and > 10,000,000 BF), while the final plot shows the full range of 2020 response volumes. Based on previous filtering, only mills reporting volumes on a board footage (BF, MBF, or MMBF) scale were included for these visualizations.

Show code
TPO_Data_2020_BF %>%
  filter(AMOUNT_BF_Tidy < 1000000) %>%
  ggplot(aes(TOT_BF, AMOUNT_BF_Tidy)) +
  geom_point(aes(color = MILL_NAME %in% Edited_Mills)) +
  geom_smooth(se = FALSE) +
  geom_abline() +
  scale_x_continuous(labels = scales::comma,limits = c(0,1000000), breaks = c(seq(0,1000000,100000))) +
  scale_y_continuous(labels = scales::comma,limits = c(0,1000000), breaks = c(seq(0,1000000,100000))) +
  theme_fivethirtyeight(base_size = 20, base_family = 'serif') +
  theme(axis.title = element_text(family = 'serif', size = 20)) + ylab('2020 Survey Response Volume (BF)') + 
  xlab('Sample-Reported Volume (BF)') +
  labs(title = "2020 Response Volumes vs. Sample-Reported Volumes (BF)", 
       caption = "Figure 11. 2020 Response Volumes vs. Sample-Reported Volumes for Response Volumes < 1,000,000 BF") +
  scale_color_discrete(name = "Manual Volume Modification?", labels = c("No", "Yes"))
Show code

TPO_Data_2020_BF %>%
  filter(AMOUNT_BF_Tidy >= 1000000 & AMOUNT_BF_Tidy < 10000000) %>%
  ggplot(aes(TOT_BF, AMOUNT_BF_Tidy)) +
  geom_point(aes(color = MILL_NAME %in% Edited_Mills)) + 
  geom_smooth(se = FALSE) +
  geom_abline() +
  scale_x_continuous(labels = scales::comma, limits = c(0,10000000), breaks = c(seq(0,100000000,1000000))) +
  scale_y_continuous(labels = scales::comma, limits = c(0,10000000), breaks = c(seq(1000000,100000000,1000000))) +
  theme_fivethirtyeight(base_size = 20, base_family = 'serif') +
  theme(axis.title = element_text(family = 'serif', size = 20)) + ylab('2020 Survey Response Volume (BF)') + 
  xlab('Sample-Reported Volume (BF)') +
  labs(title = "2020 Response Volumes vs. Sample-Reported Volumes (BF)", 
       caption = "Figure 12. 2020 Response Volumes vs. Sample-Reported Volumes for Response Volumes, 1,000,000 BF - 10,000,000 BF") +
  scale_color_discrete(name = "Manual Volume Modification?", labels = c("No", "Yes"))
Show code

TPO_Data_2020_BF %>%
  filter(AMOUNT_BF_Tidy >= 10000000) %>%
  ggplot(aes(TOT_BF, AMOUNT_BF_Tidy)) +
  geom_point(aes(color = MILL_NAME %in% Edited_Mills)) + 
  geom_smooth(se = FALSE) +
  geom_abline() +
  scale_x_continuous(labels = scales::comma, breaks = c(seq(0, 80000000, 10000000))) +
  scale_y_continuous(labels = scales::comma, limits = c(0, 100000000), breaks = c(seq(0, 80000000, 10000000))) +
  theme_fivethirtyeight(base_size = 20, base_family = 'serif') +
  theme(axis.title = element_text(family = 'serif', size = 20)) + ylab('2020 Survey Response Volume (BF)') + 
  xlab('Sample-Reported Volume (BF)') +
  labs(title = "2020 Response Volumes vs. Sample-Reported Volumes (BF)", 
       caption = "Figure 13. 2020 Response Volumes vs. Sample-Reported Volumes for Response Volumes >= 10,000,000 BF") +
  scale_color_discrete(name = "Manual Volume Modification?", labels = c("No", "Yes"))
Show code


# Full Sample Comparison

TPO_Data_2020_BF %>%
  filter(MILL_NAME != "SMITH FOREST PRODUCTS" & MILL_NAME != "BERG REINVIGORATIONS LLC" & MILL_NAME != "INDIANA VENEERS CORP") %>%
  ggplot(aes(TOT_BF, AMOUNT_BF_Tidy)) +
  geom_point(aes(color = MILL_NAME %in% Edited_Mills)) + 
  geom_smooth(se = FALSE) +
  geom_abline() +
  scale_x_continuous(labels = scales::comma, limits = c(0, 100000000), breaks = c(seq(0, 80000000, 10000000))) +
  scale_y_continuous(labels = scales::comma, limits = c(0, 100000000), breaks = c(seq(0, 80000000, 10000000))) +
  theme_fivethirtyeight(base_size = 20, base_family = 'serif') +
  theme(axis.title = element_text(family = 'serif', size = 20)) + ylab('2020 Survey Response Volume (BF)') + 
  xlab('Sample-Reported Volume (BF)') +
  labs(title = "2020 Response Volumes vs. Sample-Reported Volumes (BF)", 
       caption = "Figure 14. 2020 Response Volumes vs. Sample-Reported Volumes for Full sample") +
  scale_color_discrete(name = "Manual Volume Modification?", labels = c("No", "Yes"))
Show code

rm(TPO_Data_2020, TPO_Data_2020_BF, TPO_Data_2020_Tidy, Closed_Dismantled_2020, Edited_Mills)

8 Conclusions

All in all, the results from this analysis indicate that introduction of an omission threshold within the range 1,000 - 50,000 BF would be sufficient in, not only reducing overall sample size, but also maintaining aggregate volume. At the upper-most threshold (50,000 BF), roughly 22% of mills are omitted, whereas only ~12,000,000 BF of volume is lost. For reference, there are approximately 269 individual mills with reported volumes greater than 12,000,000 BF. Such a large reduction in sample size would enhance the likelihood that a ‘commercial’ mill is surveyed in a given year, allowing for greater response rates within this subset.

The major piece of information missing from this analysis is how to handle high-volume-outliers. Because this analysis measures the impact of introducing a volume-based threshold for survey inclusion, corrections to these inaccurate volumes are necessary prior to gauging impact on aggregate sample volume. In other words, while we can adequately measure the number and % of mills that would be removed if using an omission threshold, we cannot accurately measure the impact on aggregate sample volume.

Assuming outliers are addressed, measures surrounding percentage (or proportion) of total volume omitted at varying thresholds can also be presented.

9 Bibliography

United States, Forest Inventory & Analysis - Northern Research Station (NRS-FIA). Forest Inventory and Analysis National Program - Program Features, USDA Forest Service, 3 May 2021. https://www.fia.fs.fed.us/program-features/tpo/. Accessed 30 Jan. 2022.

R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Wickham, Hadley, and Garrett Grolemund. R For Data Science: Import, Tidy, Transform, Visualize and Model Data. O’Reilly, 2017.

Allaire, et al. (2018, Sept. 10). Distill for R Markdown. Retrieved from https://rstudio.github.io/distill

“Distill Basics.” Distill for R Markdown, The Distill Template Authors & RStudio, Inc., 2018, https://rstudio.github.io/distill/basics.html.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Kennedy (2022, May 11). Data Analytics and Computational Social Science: TPO Mill Size Analysis. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httprpubscomikennedy040finalproject/

BibTeX citation

@misc{kennedy2022tpo,
  author = {Kennedy, Ian},
  title = {Data Analytics and Computational Social Science: TPO Mill Size Analysis},
  url = {https://github.com/DACSS/dacss_course_website/posts/httprpubscomikennedy040finalproject/},
  year = {2022}
}