HW6 - TPO Sample Size Analysis

HW6 for DACSS 601.

Ian Kennedy
2022-03-21

Introduction

For my final project I am using a data set from the research lab (FFRC) I’ve been working with for the past few years. In short, the Family Forest Research Center (FFRC), in coordination with the USDA Forest Service, implements the annual Timber Products Output (TPO) survey, which attempts to track timber procurement and lumber production at mills throughout a host of states in the north/northeastern US. Findings from this analysis will assist in answering questions surrounding the introduction of a volume-based threshold relating to survey inclusion. Specifically, through this analysis I will attempt to answer the following questions:

  1. In an effort to hone surveying efforts towards commercial mills, should small-production/hobby mills be excluded from future survey attempts?
  2. If yes, what is the minimum volume (i.e. omission threshold) that mills will need to have procured to be included in survey efforts?
  3. What impact does shifting a potential omission threshold have on the total volume and number of mills that are surveyed?

Before delving into the project, a bit of background on the TPO Survey is necessary. While the TPO Survey is conducted annually, most mills within the sample do not receive a survey every year. Rather, each year ‘filtering’ of the sample frame is conducted prior to survey dissemination. This ‘filtering’ is done on a state-by-state basis, and uses a volume-based weighting system to pull a certain number of mills for the given year. In other words, high-volume mills are more likely to be surveyed than low-volume mills within the same state. Thus, it is possible that a mill could go several years without being surveyed, or vice-versa (surveyed each and every year). On top of the state-level ‘filtering’, differences in state-level mill counts also play a role. Mills located in states with low mill counts (i.e. Rhode Island, New Jersey, Delaware, etc.) are more likely to be surveyed than similar-volume mills in states with far higher mill counts (i.e. Michigan, Pennsylvania, etc.). This is to ensure that every state within the sample has at least a few mills included in survey attempts each and every year.

The ‘filtering’ discussed above does not impact the process through which an omission threshold analysis is conducted. However it does present one caveat that must be considered prior to usage of a threshold, which will be discussed in greater detail in the ‘Data’ section below.

Data

For this project I am utilizing the TPO sample frame, created in 2018, which has been used for survey efforts from 2018-2021. Each row (observation) in the file represents a single mill, with columns highlighting a variety of information including name, address, basic contact information, reported (or estimated) volume, mill type, etc. The variables (columns) of highest interest are the volume/mill type columns. Volume units for timber products are diverse, with mills using differing units based on geographical location and product type. To adequately ‘filter’ for survey inclusion, all volumes are converted to MCF (thousand cubic feet). Throughout this analysis, units of MCF and BF (Board Feet) will be utilized; A conversion factor of 6,033 BF/MCF was used to convert from MCF to BF. Naive viewers should reference this link for a better understanding of timber-based volume units.

Now on to the aforementioned caveat. Because ‘filtering’ allows for the possibility that a mill is not surveyed for multiple years, many mills’ reported volumes are simply estimates based on nearby mills of similar size, or worse, are rollover-data from the last time they did respond. On top of this, with a response rate ~50%, even if a mill is chosen for survey inclusion every year, they may not respond. Thus, prior to utilizing any omission thresholds, further analysis of potentially omitted mills is necessary. This issue is outside of the breadth of this analysis as it will need to be addressed manually by Forest Service staff, but is important to highlight.

Reading in Data & Setting up ‘State_Code_Conversion’

The following lines of code are used to read in the dataset and to set up the ‘State_Code_Conversion’ function that will be utilized throughout the analysis. Because each mill contains a variable ‘MILL_STATECD’, which is simply the numeric state code for the state the mill is located, the ‘State_Code_Conversion’ function creates a new column, ‘MILL_STATE’, containing the two-letter abbreviation for the respective state.

TPO_Sample <- read_excel(path = "C:/Users/kenne/Documents/R_Workspace/TPO_Sample.xlsx")

State_Code_Conversion <- function(Data) {
    Data %>%
        mutate(MILL_STATE = case_when(MILL_STATECD ==
            9 ~ "CT", MILL_STATECD == 10 ~
            "DE", MILL_STATECD == 17 ~ "IL",
            MILL_STATECD == 18 ~ "IN", MILL_STATECD ==
                19 ~ "IA", MILL_STATECD ==
                20 ~ "KS", MILL_STATECD ==
                23 ~ "ME", MILL_STATECD ==
                24 ~ "MD", MILL_STATECD ==
                25 ~ "MA", MILL_STATECD ==
                26 ~ "MI", MILL_STATECD ==
                27 ~ "MN", MILL_STATECD ==
                29 ~ "MO", MILL_STATECD ==
                31 ~ "NE", MILL_STATECD ==
                33 ~ "NH", MILL_STATECD ==
                34 ~ "NJ", MILL_STATECD ==
                36 ~ "NY", MILL_STATECD ==
                38 ~ "ND", MILL_STATECD ==
                39 ~ "OH", MILL_STATECD ==
                42 ~ "PA", MILL_STATECD ==
                44 ~ "RI", MILL_STATECD ==
                46 ~ "SD", MILL_STATECD ==
                50 ~ "VT", MILL_STATECD ==
                54 ~ "WV", MILL_STATECD ==
                55 ~ "WI"))
}

Calculating Initial Results

The next few lines of code use the dataset to calculate a few meaningful values. For each of these calls, ‘MILL_STATUS_CD’ is filtered to only include observations listed as 1 (new), 2 (active), or NA (blank). All other values, which represent closed, idle, or out of business operations, were excluded from these measurements.

The first chunk finds the sum of all mill volumes, stored in a variable named ‘TOT_Volume’. The second chunk finds the total number of active/new mills in the sample, stored in a variable named ‘TOT_Volume’. The last chunk creates a new column named ‘High_BF’ which reads 2 if the mill has a reported volume >= to 100,000,000 BF, and 1 for all others. Within the pipe, the number of mills with a reported/estimated volume >= 100,000,000 is stored in a variable named ‘TOT_Mills_High_Vol’.

# Calculate the sum of all Mill Volumes

TPO_Sample %>%
    filter(is.na(MILL_STATUS_CD) | MILL_STATUS_CD ==
        2 | MILL_STATUS_CD == 1) %>%
    select(TOT_BF) %>%
    sum(na.rm = FALSE) -> TOT_Volume

TPO_Sample <- State_Code_Conversion(TPO_Sample)

# Calculate Total # of mills

TPO_Sample %>%
    filter(is.na(MILL_STATUS_CD) | MILL_STATUS_CD ==
        2 | MILL_STATUS_CD == 1) %>%
    nrow() -> TOT_Mills
TOT_Mills <- as.numeric(unlist(TOT_Mills))

# Filter mills with BF > 100,000,000

TPO_Sample %>%
    mutate(High_BF = case_when(TOT_BF >=
        100000000 ~ 2, TOT_BF < 100000000 ~
        1)) %>%
    filter(High_BF == 2, is.na(MILL_STATUS_CD) |
        MILL_STATUS_CD == 2 | MILL_STATUS_CD ==
        1) %>%
    arrange(MILL_STATECD) %>%
    select(MILL_STATECD, MILL_NAME, TOT_BF,
        MILL_STATUS_CD) %>%
    nrow -> TOT_Mills_High_Vol
TOT_Mills_High_Vol <- as.numeric(unlist(TOT_Mills_High_Vol))

Printing Initial Results

Omission Threshold Functions and Setup

The chunk below jump starts the process of creating omission thresholds, first creating a new column in the dataset named ‘Volume_Code’. With thresholds incrementing by 5,000 BF, the ‘Volume_Code’ variable takes the value of 1/10,000th of the threshold. In other words, any mill with a reported volume between 5,000-10,000 receives a ‘1’, whereas any mill with a reported volume between 10,000-15,000 receives a ‘1.5’. Any mills above the final threshold value (50,000 BF) receive a ‘99’, while mills reporting 0 BF/blanks receive a ‘0’.

From here, the ‘Threshold_Data’ df is created by filtering for mills whose ‘Volume_Code’ value is not 99 or 0.

Next, two more functions are established, ‘Threshold_Mills’ and ‘Threshold_Volume’, each measuring different values. Both contain two arguments, ‘Data’ and ‘Volumes’…‘Data’ is simply the dataset to be utilized, while ‘Volumes’ refers to a vector of threshold volumes.

# Create Omission Thresholds If
# changing thresholds, make sure to
# update Function_Volumes. If new
# thresholds are not 1/10000th of
# top-bounding threshold, make sure to
# update for-in loop.

TPO_Sample <- TPO_Sample %>%
    mutate(Volume_Code = case_when(TOT_BF >=
        50000 ~ 99, TOT_BF < 50000 & TOT_BF >=
        45000 ~ 5, TOT_BF < 45000 & TOT_BF >=
        40000 ~ 4.5, TOT_BF < 40000 & TOT_BF >=
        35000 ~ 4, TOT_BF < 35000 & TOT_BF >=
        30000 ~ 3.5, TOT_BF < 30000 & TOT_BF >=
        25000 ~ 3, TOT_BF < 25000 & TOT_BF >=
        20000 ~ 2.5, TOT_BF < 20000 & TOT_BF >=
        15000 ~ 2, TOT_BF < 15000 & TOT_BF >=
        10000 ~ 1.5, TOT_BF < 10000 & TOT_BF >=
        5000 ~ 1, TOT_BF < 5000 & TOT_BF >
        0 ~ 0.5, TOT_BF == 0 ~ 0)) %>%
    filter(is.na(MILL_STATUS_CD) | MILL_STATUS_CD ==
        2 | MILL_STATUS_CD == 1)


Threshold_Data <- TPO_Sample %>%
    filter(Volume_Code != 99 & Volume_Code !=
        0)


# Create Functions for Mill Count &
# Volume Omission Data

Threshold_Mills <- function(Data, Volumes) {
    Data %>%
        filter(Volume_Code %in% Volumes) %>%
        select(TOT_BF) %>%
        nrow -> placeholder
    placeholder <- as.numeric(unlist(placeholder))
}

Threshold_Volume <- function(Data, Volumes) {
    Data %>%
        filter(Volume_Code %in% Volumes) %>%
        select(TOT_BF) %>%
        sum(na.rm = TRUE)
}

For-In Loop

The code snippet below first establishes ‘Function_Volumes’ by creating a vector of values containing the unique values left in the ‘Volume_Code’ column (99s and 0s were previously removed).

From there, a for-in loop is used to cycle through each value within ‘Function_Volumes’. With each iteration, two numeric variables are outputted, ‘TOT_Mill_Omit_[Mill Count]’ and ‘TOT_Volume_Omit_[Volume Omitted]’. The ’Function_Volumes[i]*10000’ piece multiplies the value by 10,000 to return a variable named based on the threshold in question. The value of these variables are calculated by using the respective function (Threshold_Mills or Threshold_Volume) for each of the iterations. By specifying a volume argument of ‘Function_Volumes[1:i]’ for each, quantities are aggregated from previous iterations. In other words, a mill reporting 3,000 BF will not only be omitted from a 5,000 BF threshold, but also from a 10,000, 15,000, 20,000, etc. BF threshold.

# For-Loop for Mill & Volume Omission
# Data

Function_Volumes <- c(seq(0.5, 5, by = 0.5))

for (i in 1:length(Function_Volumes)) {
    assign(paste0("TOT_Mill_Omit_", Function_Volumes[i] *
        10000), value = Threshold_Mills(Threshold_Data,
        Function_Volumes[1:i]))
    assign(paste0("TOT_Volume_Omit_", Function_Volumes[i] *
        10000), value = Threshold_Volume(Threshold_Data,
        Function_Volumes[1:i]))
}

Vector Creation

Following the for-in loop, 10 of each of the ‘TOT_Mill_Omit_[Mill Count]’ and ‘TOT_Volume_Omit_[Volume Omitted]’ variables are created. From there, the Mill Count variables are added to a vector named ‘Omit_Mill_Count_Vector’ and the Omitted Volume variables are added to a vector named ‘Omit_Volume_Vector’. Additionally, two vectors of length 10, containing the total sample volume (‘TOT_Volume’) and number of mills within the sample (‘TOT_Mills’), are also created.

Next, the ‘Omit_DF’ df is created using the 4 vectors established above, with rows named based on the threshold in question. Also, a new column (‘Percent_Omitted’) is added by dividing the Omitted Mill Count (‘Omit_Mill_Count_Vector’) by the total Mill Count in the sample (‘TOT_Mill_Vector’), and multiplying by 100.

Using the ‘Omit_DF’ df, an omission threshold table is created using the kable() function, removing the columns relating to Total Volume and Total Mill Count (4 & 5).

Again using the ‘Omit_DF’ df, a scatter/line plot is drawn based on ‘Percent_Omitted’ (x) and ‘Omit_Volume_Vector’ (y). Before doing so, the ‘Omit_Vec’ vector is created and binded to ‘Omit_DF’ using the cbind() function. The plot’s code contains a few other specifications, all of which are included to clean up the aesthetics/look of the plot.

Finally, all of the now-unneeded vectors are removed from the local environment.

# Create Vectors for Omitted Mill
# Counts and Volumes

Omit_Mill_Count_Vector <- c(TOT_Mill_Omit_5000,
    TOT_Mill_Omit_10000, TOT_Mill_Omit_15000,
    TOT_Mill_Omit_20000, TOT_Mill_Omit_25000,
    TOT_Mill_Omit_30000, TOT_Mill_Omit_35000,
    TOT_Mill_Omit_40000, TOT_Mill_Omit_45000,
    TOT_Mill_Omit_50000)

Omit_Volume_Vector <- c(TOT_Volume_Omit_5000,
    TOT_Volume_Omit_10000, TOT_Volume_Omit_15000,
    TOT_Volume_Omit_20000, TOT_Volume_Omit_25000,
    TOT_Volume_Omit_30000, TOT_Volume_Omit_35000,
    TOT_Volume_Omit_40000, TOT_Volume_Omit_45000,
    TOT_Volume_Omit_50000)

TOT_Volume_Vector <- c(TOT_Volume, TOT_Volume,
    TOT_Volume, TOT_Volume, TOT_Volume, TOT_Volume,
    TOT_Volume, TOT_Volume, TOT_Volume, TOT_Volume)

TOT_Mill_Vector <- c(TOT_Mills, TOT_Mills,
    TOT_Mills, TOT_Mills, TOT_Mills, TOT_Mills,
    TOT_Mills, TOT_Mills, TOT_Mills, TOT_Mills)



# Convert Vectors to Dataframe

Omit_DF <- data.frame(Omit_Mill_Count_Vector,
    Omit_Volume_Vector, TOT_Volume_Vector,
    TOT_Mill_Vector, row.names = c("5,000 BF cut-off",
        "10,000 BF cut-off", "15,000 BF cut-off",
        "20,000 BF cut-off", "25,000 BF cut-off",
        "30,000 BF cut-off", "35,000 BF cut-off",
        "40,000 BF cut-off", "45,000 BF cut-off",
        "50,000 BF cut-off")) %>%
    mutate(Percent_Omitted = Omit_Mill_Count_Vector/TOT_Mill_Vector *
        100)



# Produce Omission Table and Plot

kable(Omit_DF, digits = 3, align = "ccccc",
    col.names = c("Omitted Mill Count", "Omitted Volume (BF)",
        "Total Volume (BF) in Sample", "Total Mill Count in Sample",
        "Percent of Mills Omitted"), caption = "Omissions by Varying Volume Baselines",
    format.args = list(big.mark = ",", scientific = FALSE)) %>%
    kable_styling(font_size = 16) %>%
    column_spec(column = 1, bold = TRUE) %>%
    remove_column(columns = c(4, 5))
Table 1: Omissions by Varying Volume Baselines
Omitted Mill Count Omitted Volume (BF) Percent of Mills Omitted
5,000 BF cut-off 61 131,815.2 2.011
10,000 BF cut-off 212 1,090,339.7 6.990
15,000 BF cut-off 395 3,050,295.8 13.023
20,000 BF cut-off 427 3,595,382.9 14.078
25,000 BF cut-off 534 5,869,227.3 17.606
30,000 BF cut-off 599 7,560,188.1 19.749
35,000 BF cut-off 646 9,058,803.1 21.299
40,000 BF cut-off 657 9,474,340.4 21.662
45,000 BF cut-off 678 10,359,047.5 22.354
50,000 BF cut-off 694 11,116,999.6 22.882
Omit_Vec <- c(seq(5000, 50000, by = 5000))
Omit_DF <- cbind(Omit_DF, Omit_Vec)

Omit_DF %>%
    ggplot(aes(Percent_Omitted, Omit_Volume_Vector,
        color = factor(Omit_Vec))) + geom_point(size = 6) +
    geom_smooth(size = 0.8, se = FALSE, color = "black") +
    scale_x_continuous(breaks = c(seq(0,
        24, 2)), limits = c(0, 24)) + scale_y_continuous(labels = scales::comma,
    limits = c(0, 12000000), breaks = c(seq(0,
        12000000, 1000000))) + scale_color_discrete(name = "Omission Threshold (BF)") +
    labs(title = "Omitted Volumes (BF) & Mill Counts at Varying Thresholds",
        caption = "Figure 1. Omission Threshold Plot") +
    geom_text(aes(label = Omit_Mill_Count_Vector),
        size = 10, nudge_y = 400000) + theme_fivethirtyeight(base_size = 40,
    base_family = "serif") + theme(axis.title = element_text(family = "serif",
    size = 40)) + ylab("Volume (BF) Omitted") +
    xlab("% of Mills Omitted")
# Remove Unneeded Vectors

rm(Omit_Mill_Count_Vector, Omit_Volume_Vector,
    TOT_Volume_Vector, TOT_Mill_Vector, Omit_Vec,
    i, Function_Volumes)

rm(TOT_Mill_Omit_5000, TOT_Mill_Omit_10000,
    TOT_Mill_Omit_15000, TOT_Mill_Omit_20000,
    TOT_Mill_Omit_25000, TOT_Mill_Omit_30000,
    TOT_Mill_Omit_35000, TOT_Mill_Omit_40000,
    TOT_Mill_Omit_45000, TOT_Mill_Omit_50000)

rm(TOT_Volume_Omit_5000, TOT_Volume_Omit_10000,
    TOT_Volume_Omit_15000, TOT_Volume_Omit_20000,
    TOT_Volume_Omit_25000, TOT_Volume_Omit_30000,
    TOT_Volume_Omit_35000, TOT_Volume_Omit_40000,
    TOT_Volume_Omit_45000, TOT_Volume_Omit_50000)

State-Level Omission Thresholds

Creation of State-Level Omission Function

With aggregate omission threshold data (across all states), the next step in this analysis is to produce similar plots and tables at the state level. To do this, I first created a function, ‘Sum_Threshold_Volumes_by_State’, which is broken down in further detail below:

- The function requires one argument, 'Data', which represents the dataset to be utilized within the call. 
- For each value in the 'Omission Threshold (BF)' column (present in the 'Omission_Data' df created in the next chunk), the previous 'Volume Omitted (BF)' is added to the current value. Also within the function are a few 'next' calls, specifying that the first index should be skipped, and anytime the for-in loop moves to a new state, the first value in the state should also be skipped.
    - These 'next' calls hinge on the fact that the data is grouped by 'State' '& 'Omission Threshold (BF)', without grouping the function would not work as anticipated.
# Function for summing Omitted Volume
# by Threshold

Sum_Threshold_Volumes_by_State <- function(Data) {
    for (v in 1:length(Data$`Omission Threshold (BF)`)) {
        if (v == 1) {
            next
        } else if (Data$State[v] != Data$State[v -
            1]) {
            next
        } else if (Data$`Omission Threshold (BF)`[v] >
            Data$`Omission Threshold (BF)`[v -
                1]) {
            Data$`Volume Omitted (BF)`[v] +
                Data$`Volume Omitted (BF)`[v -
                  1] -> Data$`Volume Omitted (BF)`[v]
        } else {
            break
        }
    }
    Data
}

Tidying of State-Level Data Frame

The next step in the analysis is to create a state-level df containing omission threshold mill counts and omitted volumes. I first used the ‘State_Code_Conversion’ function to convert state codes to the respective state abbreviation within the ‘Threshold_Data’ df. Next, I created a df containing total reported/estimated volumes for each state (‘State_Volumes’). From there I created the ‘Omission_Data’ df by grouping by State & omission threshold, summing volume (‘TOT_BF’) for each grouping, creating the ‘Omission Threshold (BF)’ column/dropping the ‘Volume_Code’ column, and finally joining the state volume totals (‘State_Volumes’) to the df with a left_join() call. The df is then used as an input to the ‘Sum_Threshold_Volumes_by_State’ function discussed above.

Following usage of the function, a new variable (‘% of Volume Omitted’) is added by dividing the omitted volume by the total state volume, and columns are reordered.

At this point the df contains state-level omitted volume data, though does not contain omitted mill count information. Thus, the ‘State_Omit_Count_Levels’ df is created by selecting the columns of interest from ‘Threshold_Data’, grouping by State (‘MILL_STATECD’) and Omission Threshold (‘Volume_Code’) to find the # of mills omitted at each threshold, converting the ‘Volume_Code’ variable to BF units by multiplying the value by 10,000, and then using the ‘State_Code_Conversion’ function to convert state codes to the respective state abbreviation. Once created, this df is also joined to ‘Omission_Data’, again using a left_join() call.

Finally, a ‘State_Count’ variable is added by summing omitted mill counts, grouped by state. This variable represents the total number of mills omitted at the upper-most volume threshold applied to the state in question.

Threshold_Data <- State_Code_Conversion(Threshold_Data)

# Find Total State Volumes
State_Volumes <- TPO_Sample %>%
    select("MILL_STATE", "TOT_BF", "TOT_BF_LOG") %>%
    group_by(MILL_STATE) %>%
    summarize(State_Volume = sum(TOT_BF))

# Create Omitted Volume by State DF
Omission_Data <- Threshold_Data %>%
    group_by(MILL_STATE, Volume_Code) %>%
    summarize(`Volume Omitted (BF)` = sum(TOT_BF)) %>%
    mutate(`Omission Threshold (BF)` = Volume_Code *
        10000) %>%
    select(-Volume_Code) %>%
    left_join(State_Volumes, by = "MILL_STATE") %>%
    rename(`Total State Volume (BF)` = "State_Volume",
        State = "MILL_STATE") %>%
    ungroup()

# Sum Volumes by State/Threshold
Omission_Data <- Sum_Threshold_Volumes_by_State(Omission_Data)

# Create % field comparing Omitted
# Threshold Volume to Total State
# Volume
Omission_Data <- Omission_Data %>%
    mutate(`% of Volume Omitted` = `Volume Omitted (BF)`/`Total State Volume (BF)` *
        100)

# Reorder columns
col_order <- c("State", "Omission Threshold (BF)",
    "Volume Omitted (BF)", "Total State Volume (BF)",
    "% of Volume Omitted")

Omission_Data <- Omission_Data[, col_order]
rm(col_order)

# Find State-by-State Omitted Mill
# Counts
State_Omit_Count_Levels <- Threshold_Data %>%
    select(MILL_STATECD, MTC:TOT_BF_LOG,
        Volume_Code) %>%
    group_by(MILL_STATECD, Volume_Code) %>%
    summarize(Mill_Count = n()) %>%
    mutate(Volume_Code = Volume_Code * 10000)
State_Omit_Count_Levels <- State_Code_Conversion(State_Omit_Count_Levels)

# Join Omission Data with Omitted Mill
# Counts
Omission_Data <- Omission_Data %>%
    left_join(State_Omit_Count_Levels, by = c(State = "MILL_STATE",
        `Omission Threshold (BF)` = "Volume_Code"))

Omission_Data <- Omission_Data %>%
    group_by(State) %>%
    mutate(State_Count = sum(Mill_Count)) %>%
    select(-MILL_STATECD)

Visualization

State-Level Omission Threshold Plots, Table, & Code

Using the ‘Omission_Data’ df, a few plots are created, along with a table depicting the information within the df.

Omission_Data %>%
    ggplot(aes(State, `Volume Omitted (BF)`,
        color = factor(`Omission Threshold (BF)`))) +
    geom_point(size = 6) + scale_y_continuous(labels = scales::comma,
    limits = c(0, 1400000), breaks = c(seq(0,
        1400000, 100000))) + scale_color_discrete(name = "Omission Threshold (BF)") +
    labs(title = "Omitted Volumes (BF) at Varying Thresholds by State",
        y = "Omitted Volume (BF)", x = "State",
        caption = "Figure 2. Omitted Volumes (BF) at Varying Thresholds by State") +
    theme_fivethirtyeight(base_size = 40,
        base_family = "serif") + theme(axis.title = element_text(family = "serif",
    size = 40)) + ylab("Volume (BF) Omitted") +
    xlab("State")


Omission_Data %>%
    ggplot(aes(0, `Volume Omitted (BF)`,
        fill = factor(`Omission Threshold (BF)`))) +
    geom_col(position = "dodge") + scale_y_continuous(labels = scales::comma) +
    scale_fill_discrete(name = "Omission Threshold (BF)") +
    facet_wrap(facets = vars(State), scales = "free_y",
        ncol = 6) + labs(title = "Omitted Volumes (BF) at Varying Thresholds by State",
    y = "Omitted Volume (BF)", x = "State",
    caption = "Figure 3. Omitted Volumes (BF) at Varying Thresholds by State") +
    theme_fivethirtyeight(base_size = 40,
        base_family = "serif") + theme(axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), axis.title = element_text(family = "serif",
        size = 40)) + ylab("Volume (BF) Omitted")



kable(Omission_Data, digits = 3, format.args = list(big.mark = ",",
    scientific = FALSE), col.names = c("State",
    "Cut-off (BF)", "Volume Omitted (BF)",
    "Total State Volume (BF)", "% of Volume Omitted",
    "# of Mills Omitted", "# of Mills Omitted Statewide"),
    align = "ccccccc", caption = "Omitted Volumes (BF) at Varying Thresholds by State. States highlighted in yellow contain at least one mill with a reported volume >= 100,000,000 BF") %>%
    kable_styling(font_size = 16) %>%
    pack_rows(index = table(Omission_Data$State),
        group_label = Omission_Data$State_Count) %>%
    remove_column(1) %>%
    row_spec(c(19:37, 44:63, 94:107, 115:124),
        background = "yellow")
Table 2: Omitted Volumes (BF) at Varying Thresholds by State. States highlighted in yellow contain at least one mill with a reported volume >= 100,000,000 BF
Cut-off (BF) Volume Omitted (BF) Total State Volume (BF) % of Volume Omitted # of Mills Omitted # of Mills Omitted Statewide
CT
30,000 25,015.350 45,521,763.7 0.055 1 2
45,000 65,039.910 45,521,763.7 0.143 1 2
DE
10,000 7,398.933 10,921,463.5 0.068 1 6
15,000 47,423.493 10,921,463.5 0.434 4 6
35,000 79,987.779 10,921,463.5 0.732 1 6
IA
5,000 11,389.783 88,530,183.9 0.013 3 37
10,000 119,050.783 88,530,183.9 0.134 17 37
15,000 175,159.903 88,530,183.9 0.198 5 37
20,000 193,094.488 88,530,183.9 0.218 1 37
35,000 478,079.488 88,530,183.9 0.540 9 37
40,000 517,994.060 88,530,183.9 0.585 1 37
45,000 558,661.499 88,530,183.9 0.631 1 37
IL
5,000 10,476.345 104,778,116.4 0.010 4 70
10,000 338,027.805 104,778,116.4 0.323 51 70
15,000 420,757.289 104,778,116.4 0.402 6 70
20,000 489,612.933 104,778,116.4 0.467 4 70
30,000 600,154.872 104,778,116.4 0.573 4 70
45,000 643,462.002 104,778,116.4 0.614 1 70
IN
5,000 17,415.750 900,044,946.7 0.002 8 35
10,000 55,793.730 900,044,946.7 0.006 5 35
15,000 193,599.810 900,044,946.7 0.022 13 35
20,000 210,952.230 900,044,946.7 0.023 1 35
25,000 232,674.420 900,044,946.7 0.026 1 35
30,000 261,742.890 900,044,946.7 0.029 1 35
35,000 294,041.190 900,044,946.7 0.033 1 35
40,000 367,503.990 900,044,946.7 0.041 2 35
45,000 455,596.020 900,044,946.7 0.051 2 35
50,000 505,183.410 900,044,946.7 0.056 1 35
KS
5,000 6,257.004 269,798,257.0 0.002 3 38
10,000 43,343.052 269,798,257.0 0.016 5 38
15,000 116,349.876 269,798,257.0 0.043 6 38
20,000 152,618.967 269,798,257.0 0.057 2 38
25,000 477,020.559 269,798,257.0 0.177 16 38
30,000 530,667.402 269,798,257.0 0.197 2 38
35,000 593,908.740 269,798,257.0 0.220 2 38
40,000 630,051.171 269,798,257.0 0.234 1 38
50,000 675,294.123 269,798,257.0 0.250 1 38
MA
10,000 105,064.470 51,750,642.2 0.203 21 74
15,000 550,337.700 51,750,642.2 1.063 44 74
20,000 565,346.910 51,750,642.2 1.092 1 74
25,000 625,383.750 51,750,642.2 1.208 3 74
30,000 725,445.150 51,750,642.2 1.402 4 74
50,000 770,472.780 51,750,642.2 1.489 1 74
MD
15,000 170,104.380 295,237,208.9 0.058 17 17
ME
25,000 1,033,545.600 2,714,122,890.9 0.038 48 48
MI
5,000 5,947.472 2,896,285,457.3 0.000 6 65
10,000 33,283.729 2,896,285,457.3 0.001 5 65
15,000 195,781.350 2,896,285,457.3 0.007 13 65
20,000 311,496.291 2,896,285,457.3 0.011 7 65
25,000 379,106.914 2,896,285,457.3 0.013 3 65
30,000 839,654.701 2,896,285,457.3 0.029 18 65
35,000 1,064,817.021 2,896,285,457.3 0.037 7 65
40,000 1,102,526.020 2,896,285,457.3 0.038 1 65
45,000 1,188,387.703 2,896,285,457.3 0.041 2 65
50,000 1,331,864.668 2,896,285,457.3 0.046 3 65
MN
5,000 7,662.930 388,607,516,082.7 0.000 4 44
10,000 43,001.070 388,607,516,082.7 0.000 6 44
15,000 160,224.900 388,607,516,082.7 0.000 11 44
20,000 177,197.340 388,607,516,082.7 0.000 1 44
25,000 427,667.490 388,607,516,082.7 0.000 12 44
30,000 554,897.460 388,607,516,082.7 0.000 5 44
35,000 622,280.580 388,607,516,082.7 0.000 2 44
45,000 749,447.220 388,607,516,082.7 0.000 3 44
MO
5,000 31,531.199 926,909,967.1 0.003 13 65
10,000 81,129.507 926,909,967.1 0.009 6 65
15,000 173,256.641 926,909,967.1 0.019 8 65
20,000 226,925.909 926,909,967.1 0.024 3 65
25,000 314,351.169 926,909,967.1 0.034 4 65
30,000 592,894.043 926,909,967.1 0.064 10 65
35,000 944,090.248 926,909,967.1 0.102 11 65
40,000 1,055,821.906 926,909,967.1 0.114 3 65
45,000 1,185,157.599 926,909,967.1 0.128 3 65
50,000 1,376,301.516 926,909,967.1 0.148 4 65
ND
5,000 6,003.684 678,416.3 0.885 2 4
10,000 13,007.982 678,416.3 1.917 1 4
20,000 28,017.192 678,416.3 4.130 1 4
NE
5,000 14,480.770 11,939,191.0 0.121 7 34
10,000 93,039.080 11,939,191.0 0.779 10 34
15,000 128,524.533 11,939,191.0 1.076 3 34
20,000 178,596.635 11,939,191.0 1.496 3 34
25,000 225,441.015 11,939,191.0 1.888 2 34
30,000 251,072.566 11,939,191.0 2.103 1 34
35,000 380,004.466 11,939,191.0 3.183 4 34
45,000 422,115.750 11,939,191.0 3.536 1 34
50,000 566,411.146 11,939,191.0 4.744 3 34
NH
10,000 6,333.000 208,167,555.2 0.003 1 7
15,000 16,339.140 208,167,555.2 0.008 1 7
30,000 146,798.940 208,167,555.2 0.071 5 7
NJ
5,000 4,323.252 1,179,267.8 0.367 4 18
10,000 10,113.070 1,179,267.8 0.858 1 18
15,000 120,180.610 1,179,267.8 10.191 11 18
20,000 156,500.365 1,179,267.8 13.271 2 18
NY
30,000 150,092.100 350,965,360.5 0.043 6 6
OH
5,000 3,883.763 775,600,293.8 0.001 2 51
10,000 61,819.313 775,600,293.8 0.008 10 51
15,000 302,967.287 775,600,293.8 0.039 24 51
20,000 332,985.707 775,600,293.8 0.043 2 51
25,000 414,035.441 775,600,293.8 0.053 4 51
30,000 514,096.841 775,600,293.8 0.066 4 51
35,000 574,133.681 775,600,293.8 0.074 2 51
45,000 656,184.029 775,600,293.8 0.085 2 51
50,000 701,211.659 775,600,293.8 0.090 1 51
PA
10,000 10,006.140 1,689,273,242.9 0.001 2 13
15,000 57,492.744 1,689,273,242.9 0.003 4 13
25,000 120,031.119 1,689,273,242.9 0.007 3 13
40,000 157,982.572 1,689,273,242.9 0.009 1 13
45,000 281,058.094 1,689,273,242.9 0.017 3 13
RI
25,000 20,012.280 2,111,168.9 0.948 1 2
40,000 59,910.180 2,111,168.9 2.838 1 2
SD
35,000 64,890.157 92,285,937.3 0.070 2 2
VT
5,000 500.307 138,292,273.1 0.000 1 15
10,000 31,519.341 138,292,273.1 0.023 5 15
15,000 91,556.181 138,292,273.1 0.066 6 15
35,000 181,611.441 138,292,273.1 0.131 3 15
WI
5,000 11,942.903 1,505,096,118.8 0.001 4 38
10,000 38,408.656 1,505,096,118.8 0.003 4 38
15,000 117,231.960 1,505,096,118.8 0.008 7 38
20,000 189,122.010 1,505,096,118.8 0.013 4 38
25,000 407,309.332 1,505,096,118.8 0.027 10 38
30,000 457,340.032 1,505,096,118.8 0.030 2 38
35,000 525,191.794 1,505,096,118.8 0.035 2 38
40,000 563,919.328 1,505,096,118.8 0.037 1 38
45,000 646,934.072 1,505,096,118.8 0.043 2 38
50,000 741,084.281 1,505,096,118.8 0.049 2 38
WV
30,000 50,030.700 775,772,253.0 0.006 2 3
35,000 80,049.120 775,772,253.0 0.010 1 3

Histograms of Log(10) Volume

Creation of Data Frames

Below, the ‘BF_Plot’, ‘High_Vol_by_State’, and ‘High_Vol_by_State_Table’ dfs are created using a few dplyr functions in relation to the original dataset.

# Convert State Codes to Strings and
# Create BF_Plot

BF_Plot <- TPO_Sample %>%
    State_Code_Conversion() %>%
    filter(is.na(MILL_STATUS_CD) | MILL_STATUS_CD ==
        2 | MILL_STATUS_CD == 1)

# Create data frame for High Volume
# Mills

High_Vol_by_State <- BF_Plot %>%
    group_by(MILL_STATE) %>%
    filter(TOT_BF_LOG >= 8) %>%
    summarize(High_Volume_Mills = n())

High_Vol_by_State_Table <- BF_Plot %>%
    filter(TOT_BF_LOG >= 8) %>%
    select(MILL_NBR:MTC, TOT_MCF:TOT_BF_LOG,
        MILL_STATUS_CD, MILL_STATE) %>%
    arrange(MILL_STATECD, desc(TOT_BF_LOG))

Histrogram Code & Output

The code below produces two histograms of reported volumes, using Logged MCF and Logged BF. Both are univariate, utilizing the columns “TOT_MCF_LOG” and “TOT_BF_LOG” respectively. These plots help to answer questions regarding the presence of reported mill volumes that are far outside of a normal range. A “normal” log MCF value would be < 4.2, while a “normal” log BF value would be < 8.

In taking a quick glance at the Logged BF plot, it is clear that there are at least 83 mills with reported volumes far outside of a “normal” range.

# Create a histogram of Mill Volumes
# (Logged MTC)

Hist_Breaks_MCF <- c(seq(from = -3, to = 8,
    by = 0.5))

# hist(unlist(as.vector(unclass(BF_Plot['TOT_MCF_LOG']))),
# xlim = c(-3, 8), ylim = c(0, 600),
# breaks = Hist_Breaks_MCF, labels =
# TRUE, ylab = 'Count', xlab = 'Log(10)
# of MCF', main = 'Log(10) of MCF
# Histogram') %>% axis(side = 1, at =
# Hist_Breaks_MCF, labels =
# Hist_Breaks_MCF)

ggplot(BF_Plot, aes(TOT_MCF_LOG, label = ..count..),
    xlim = c(0, 8), ylim = c(0, 600)) + geom_histogram(breaks = Hist_Breaks_MCF,
    bins = 16) + scale_x_continuous(breaks = Hist_Breaks_MCF) +
    labs(title = "Histogram of Log(10) MCF",
        caption = "Figure 4. Sample-Size Histogram of Log(10) MCF") +
    geom_text(stat = "bin", nudge_y = 5,
        size = 10, breaks = Hist_Breaks_MCF) +
    theme(text = element_text(family = "serif",
        size = 40)) + theme_fivethirtyeight(base_size = 40,
    base_family = "serif") + theme(axis.title = element_text(family = "serif",
    size = 40)) + ylab("Mill Count") + xlab("Log(10) MCF")
# Create a histogram of Mill Volumes
# (Logged BF)

Hist_Breaks_BF <- c(seq(from = 0, to = 12,
    by = 1))

# hist(unlist(as.vector(unclass(BF_Plot['TOT_BF_LOG']))),
# xlim = c(0, 12), ylim = c(0, 1200),
# breaks = Hist_Breaks_BF,labels =
# TRUE, ylab = 'Count', xlab = 'Log(10)
# of BF', main = 'Log(10) of BF
# Histogram') %>% axis(side = 1, at =
# Hist_Breaks_BF, labels =
# Hist_Breaks_BF)

ggplot(BF_Plot, aes(TOT_BF_LOG, label = ..count..),
    xlim = c(0, 12), ylim = c(0, 1200)) +
    geom_histogram(breaks = Hist_Breaks_BF,
        binwidth = 1) + scale_x_continuous(breaks = Hist_Breaks_BF) +
    labs(title = "Histogram of Log(10) BF",
        caption = "Figure 5. Sample-Size Histogram of Log(10) BF") +
    geom_text(stat = "bin", nudge_y = 10,
        size = 10, breaks = Hist_Breaks_BF) +
    theme_fivethirtyeight(base_size = 40,
        base_family = "serif") + theme(axis.title = element_text(family = "serif",
    size = 40)) + ylab("Mill Count") + xlab("Log(10) BF")

State Level Volume Plots

The scatter plots/box plots below utilize a host of variables, most of which required some sort of tidying prior to visualization.

Log(10) BF Volume Box & Scatter Plots

These plots visualize reported mill volumes’ on a state-by-state basis. By breaking these volumes down by state, our FS partners are able to hone-in on states with the highest number of volume-outliers (i.e. Log(10) BF > 8). From the plots, it is clear that there are a few states that will need to be reviewed, these include ME, MI, & MN, among others.

# Create State-by-State boxplots and
# scatter plots for Logged BF Volume

ggplot(BF_Plot, aes(x = MILL_STATE, y = TOT_BF_LOG,
    color = MILL_STATE)) + geom_point(size = 4) +
    scale_y_continuous(breaks = Hist_Breaks_BF,
        limits = c(0, 12)) + scale_color_discrete(name = "State") +
    labs(title = "Log(10) of BF by State",
        y = "Log(10) of BF", x = "State",
        caption = "Figure 6. Reported Mill Volumes (Log10 of BF) by State") +
    theme(text = element_text(family = "serif",
        size = 40))


ggplot(BF_Plot, aes(x = MILL_STATE, y = TOT_BF_LOG,
    color = MILL_STATE)) + geom_boxplot(outlier.size = 4,
    lwd = 2) + scale_y_continuous(breaks = Hist_Breaks_BF,
    limits = c(0, 12)) + scale_color_discrete(name = "State") +
    labs(title = "Log(10) of BF by State",
        y = "Log(10) of BF", x = "State",
        caption = "Figure 7. Boxplot of Reported Mill Volumes (Log10 of BF) by State") +
    theme(text = element_text(family = "serif",
        size = 40))

High Volume Plot

This bar chart depicts the number of volume-outliers (i.e. Log(10) BF > 8) on a state-by-state basis. States with no volume-outliers were not included in the plot.

# Create Bar Chart for High Volume
# Mills by State

ggplot(High_Vol_by_State, aes(x = MILL_STATE,
    y = High_Volume_Mills)) + geom_col() +
    geom_text(aes(label = High_Volume_Mills),
        size = 14, nudge_y = 1) + scale_y_continuous(breaks = c(seq(from = 0,
    to = 65, by = 5))) + theme_fivethirtyeight(base_size = 40,
    base_family = "serif") + theme(axis.title = element_text(family = "serif",
    size = 40)) + ylab("High Volume Mills") +
    xlab("State") + labs(title = "High Volume Mills (Log(10) BF > 8) by State",
    caption = "Figure 8. High Volume Mills (>100,000,000 BF) by State")

Mill Type Plot

This bar chart depicts a breakdown of Mill “Types” on a state-by-state basis.

# Create MTC_Summary

MTC_summary <- BF_Plot %>%
    mutate(MTC_Tidy = case_when(MTC == 10 ~
        "Sawmill", MTC == 21 ~ "Sawmill",
        MTC == 22 ~ "Sawmill", MTC == 23 ~
            "Sawmill", MTC == 24 ~ "Sawmill",
        MTC == 25 ~ "Sawmill", MTC == 26 ~
            "Sawmill", MTC == 20 ~ "Veneer Mill",
        MTC == 30 ~ "Pulp/Composite Mill",
        MTC == 40 ~ "Pulp/Composite Mill",
        MTC == 50 ~ "Bioenergy Mill", MTC ==
            60 ~ "House/Cabin Mill", MTC ==
            70 ~ "Post/Pole/Piling Mill",
        MTC == 80 ~ "Post/Pole/Piling Mill",
        MTC == 90 ~ "Post/Pole/Piling Mill",
        MTC == 100 ~ "Residential Firewood Mill",
        MTC == 110 ~ "Other/Misc Mill", MTC ==
            111 ~ "Other/Misc Mill", MTC ==
            117 ~ "Concentration Yard", is.na(MTC) ~
            "Not Provided")) %>%
    group_by(MILL_STATE, MTC_Tidy) %>%
    arrange(as.factor(MTC_Tidy)) %>%
    summarize(Mill_Type_Count = n())



# Create bar chart of Mill Type Counts
# by State

MTC_Tidy_Factor_Levels <- c("Sawmill", "Veneer Mill",
    "Pulp/Composite Mill", "Bioenergy Mill",
    "House/Cabin Mill", "Post/Pole/Piling Mill",
    "Residential Firewood Mill", "Other/Misc Mill",
    "Concentration Yard", "Not Provided")


ggplot(MTC_summary, aes(x = MILL_STATE, y = Mill_Type_Count,
    fill = factor(MTC_Tidy, levels = MTC_Tidy_Factor_Levels))) +
    geom_col(position = "stack", width = 0.8) +
    labs(title = "Mill Type Counts by State",
        caption = "Figure 9. Mill Type Counts by State") +
    scale_y_continuous(limits = c(0, 425),
        breaks = c(seq(from = 0, to = 425,
            by = 25))) + scale_fill_discrete(name = "Mill Type") +
    theme_fivethirtyeight(base_size = 40,
        base_family = "serif") + theme(axis.title = element_text(family = "serif",
    size = 40)) + ylab("Mill Count") + xlab("State")

Summary Statistics

State Level Means, Medians, & St. Deviations

The following chunk uses dplyr functions to measure a few meaningful values on a state-by-state basis.

# Calculate means, medians, and
# standard deviations of logged volumes
# by state

TPO_summary <- BF_Plot %>%
    select(MILL_STATECD, MILL_STATE, MTC,
        TOT_MCF_LOG, TOT_BF_LOG) %>%
    group_by(MILL_STATE) %>%
    summarize(meanBF = mean(TOT_BF_LOG, na.rm = TRUE),
        medianBF = median(TOT_BF_LOG, na.rm = TRUE),
        meanMCF = mean(TOT_MCF_LOG, na.rm = TRUE),
        medianMCF = median(TOT_MCF_LOG, na.rm = TRUE),
        St_Dev = sd(TOT_BF_LOG, na.rm = TRUE),
        NumberofMills = n())

# Create table for means, medians, and
# standard deviations

kable(TPO_summary, digits = 4, align = "ccccccc",
    col.names = c("State", "Mean Log(10) BF",
        "Median Log(10) BF", "Mean Log(10) MCF",
        "Median Log(10) MCF", "Standard Deviation",
        "Mill Count"), caption = "Volume-based means, medians, and standard deviations for all mills (omissions not removed) by State. States highlighted in yellow contain at least one mill with a reported volume >= 100,000,000 BF") %>%
    kable_styling(font_size = 16) %>%
    row_spec(c(5, 6, 8, 9:11, 18, 19, 23),
        background = "yellow")
Table 3: Volume-based means, medians, and standard deviations for all mills (omissions not removed) by State. States highlighted in yellow contain at least one mill with a reported volume >= 100,000,000 BF
State Mean Log(10) BF Median Log(10) BF Mean Log(10) MCF Median Log(10) MCF Standard Deviation Mill Count
CT 5.5946 5.4378 1.7930 1.6362 0.8204 22
DE 5.0969 4.8016 1.2953 1.0000 1.0085 15
IA 4.9398 4.5509 1.1382 0.7492 1.1110 70
IL 4.7821 4.4616 0.9805 0.6600 1.0647 136
IN 5.7586 6.1617 1.9570 2.3601 1.1764 147
KS 4.5832 4.3013 0.7816 0.4997 0.9912 48
MA 4.5912 4.0003 0.7896 0.1987 0.9057 120
MD 5.5059 5.6701 1.7043 1.8685 1.3804 43
ME 5.5991 4.8144 1.7974 1.0128 1.2836 162
MI 5.7111 5.7119 1.9095 1.9103 1.0814 354
MN 6.2930 6.0091 2.4914 2.2075 1.7166 246
MO 5.7202 5.8943 1.9186 2.0927 0.8908 419
ND 4.1555 3.8454 0.3539 0.0438 0.9736 5
NE 4.4404 4.3277 0.6388 0.5261 0.9163 46
NH 5.8028 5.9649 2.0011 2.1633 1.0444 43
NJ 4.0423 4.0003 0.2407 0.1987 0.8522 23
NY 5.7856 5.7784 1.9840 1.9768 0.8405 139
OH 5.5961 5.8753 1.7945 2.0737 1.0604 219
PA 5.9581 6.1510 2.1565 2.3494 0.7736 402
RI 4.9757 4.6501 1.1741 0.8485 0.8998 4
SD 5.8746 5.7797 2.0729 1.9780 0.9865 17
VT 5.4019 5.2720 1.6002 1.4704 1.0931 61
WI 5.8532 5.9688 2.0516 2.1672 1.0782 226
WV 6.5522 6.6467 2.7506 2.8451 0.8693 66

Conclusions & Next Steps

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Kennedy (2022, March 23). Data Analytics and Computational Social Science: HW6 - TPO Sample Size Analysis. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomikennedy040hw6/

BibTeX citation

@misc{kennedy2022hw6,
  author = {Kennedy, Ian},
  title = {Data Analytics and Computational Social Science: HW6 - TPO Sample Size Analysis},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomikennedy040hw6/},
  year = {2022}
}