HW5 for DACSS 601.
For my final project I’ll be using a data set from the Research Lab (FFRC) I’ve been working with for the past few years. In short, the FFRC, in coordination with the USDA Forest Service, implements an annual survey (TPO) which attempts to track timber procurement and lumber production at mills throughout a host of states. I’ll be running analyses to see how shifting the baseline for minimum volume procured would impact the number of mills and total volume of procured timber included in the sample frame. Generally speaking, this sort of analysis will attempt to answer the following research questions:
There will likely be further questions that will need to be analyzed, however these questions outline the main priorities of data analysis for this project.
HW3_Data <- read_excel(path = "C:/Users/kenne/Documents/R_Workspace/TPO_Sample.xlsx")
#Calculate the sum of all Mill Volumes
HW3_Data %>%
filter(is.na(MILL_STATUS_CD)| MILL_STATUS_CD == 2| MILL_STATUS_CD == 1) %>%
select(TOT_BF)%>%
sum(na.rm = FALSE) -> TOT_Volume
#Calculate Total # of mills
HW3_Data %>%
filter(is.na(MILL_STATUS_CD)| MILL_STATUS_CD == 2| MILL_STATUS_CD == 1)%>%
nrow() -> TOT_Mills
TOT_Mills <- as.numeric(unlist(TOT_Mills[1]))
#Calculate the sum omitted if omitting mills below 20,000 BF (not including blanks)
HW3_Data %>%
mutate(Volume_Code = case_when(TOT_BF > 20000 ~ 2,
TOT_BF <= 20000 & TOT_BF > 0 ~ 1,
TOT_BF == 0 ~ 0))%>%
filter(Volume_Code == 1, is.na(MILL_STATUS_CD)| MILL_STATUS_CD == 2| MILL_STATUS_CD == 1) %>%
select(TOT_BF) %>%
sum(na.rm = TRUE) -> TOT_Volume_Omit
#Calculate number of mills if omitting mills below 20,000 BF (not including blanks)
HW3_Data %>%
mutate(Volume_Code = case_when(TOT_BF > 20000 ~ 2,
TOT_BF <= 20000 & TOT_BF > 0 ~ 1,
TOT_BF == 0 ~ 0))%>%
filter(Volume_Code == 1, is.na(MILL_STATUS_CD)|MILL_STATUS_CD == 2|MILL_STATUS_CD == 1) %>%
select(TOT_BF) %>%
nrow -> TOT_Mills_Omit
TOT_Mills_Omit <- as.numeric(unlist(TOT_Mills_Omit[1]))
#Filter mills with BF > 100,000,000
HW3_Data %>%
mutate(High_BF = case_when(TOT_BF >= 100000000 ~ 2,
TOT_BF < 100000000 ~ 1)) %>%
filter(High_BF == 2, is.na(MILL_STATUS_CD)|MILL_STATUS_CD == 2|MILL_STATUS_CD == 1)%>%
arrange(MILL_STATECD)%>%
select(MILL_STATECD, MILL_NAME, TOT_BF, MILL_STATUS_CD) %>%
nrow -> TOT_Mills_High_Vol
TOT_Mills_High_Vol <- as.numeric(unlist(TOT_Mills_High_Vol[1]))
Mill_Count_If_Omit <- TOT_Mills - TOT_Mills_Omit
Percent_Omit <- 100*(TOT_Mills_Omit/TOT_Mills)
#Convert State Codes to Strings and Create BF_plot
BF_plot <- data.frame(HW3_Data)%>%
mutate(MILL_STATE = case_when(MILL_STATECD == 9 ~ "CT",
MILL_STATECD == 10 ~ "DE",
MILL_STATECD == 17 ~ "IL",
MILL_STATECD == 18 ~ "IN",
MILL_STATECD == 19 ~ "IA",
MILL_STATECD == 20 ~ "KS",
MILL_STATECD == 23 ~ "ME",
MILL_STATECD == 24 ~ "MD",
MILL_STATECD == 25 ~ "MA",
MILL_STATECD == 26 ~ "MI",
MILL_STATECD == 27 ~ "MN",
MILL_STATECD == 29 ~ "MO",
MILL_STATECD == 31 ~ "NE",
MILL_STATECD == 33 ~ "NH",
MILL_STATECD == 34 ~ "NJ",
MILL_STATECD == 36 ~ "NY",
MILL_STATECD == 38 ~ "ND",
MILL_STATECD == 39 ~ "OH",
MILL_STATECD == 42 ~ "PA",
MILL_STATECD == 44 ~ "RI",
MILL_STATECD == 46 ~ "SD",
MILL_STATECD == 50 ~ "VT",
MILL_STATECD == 54 ~ "WV",
MILL_STATECD == 55 ~ "WI")) %>%
filter(is.na(MILL_STATUS_CD)|MILL_STATUS_CD == 2|MILL_STATUS_CD == 1)
These histograms display the logged values of reported volumes for mills across the Northern United States. Both are univariate, utilizing the columns “TOT_MCF_LOG” and “TOT_BF_LOG” respectively. These plots will help to answer questions regarding the presence of reported mill volumes that are far outside of a normal range. A “normal” log MCF value would be < 4.2, while a “normal” log BF value would be < 8. In other words, any mills with reported log volumes above these thresholds will need to be reviewed and amended by USDA Forest Service Staff.
In taking a quick glance at the Logged BF, it is clear that there are at least 83 mills with reported volumes far outside of a “normal” range. Thus, these data sets will need review from partners within the USDA Forest Service.
A naive viewer would have trouble interpreting these plots given the units at hand. Without specifying what MCF (Thousand Cubic Feet) and BF (Board Feet) refer to, a naive viewer would likely have little clue as to what the histograms are displaying. However, because this information will be utilized by those within the Forest Service, defining these units is not necessary.
#Create a histogram of Mill Volumes (Logged MTC)
MCF_DF <- data.frame(BF_plot['TOT_MCF_LOG'])
MCF_List <- unclass(MCF_DF)
MCF_Vect <- as.vector(unlist(MCF_List[[1]]))
Hist_Breaks_MCF <- c(seq(from = -3, to = 8, by = .5))
hist.default(MCF_Vect, xlim = c(-3, 8), ylim = c(0, 600), breaks = Hist_Breaks_MCF, labels = TRUE, ylab = 'Count', xlab = 'Log(10) of MCF', main = 'Log(10) of MCF Histogram')
#Create a histogram of Mill Volumes (Logged BF)
BF_DF <- data.frame(BF_plot['TOT_BF_LOG'])
BF_List <- unclass(BF_DF)
BF_Vect <- as.vector(unlist(BF_List[[1]]))
Hist_Breaks_BF <- c(seq(from = 0, to = 12, by = 1))
hist.default(BF_Vect, xlim = c(0, 12), ylim = c(0, 1200), breaks = Hist_Breaks_BF, labels = TRUE, ylab = 'Count', xlab = 'Log(10) of BF', main = 'Log(10) of BF Histogram')
The scatter plots/box plots below utilize a host of variables, most of which required some sort of tidying prior to visualization.
In terms of how a naive viewer would interpret these, the same issues as with the above histograms arise. Namely, because some of the plots are visualizing logged BF (Board Feet), someone lacking insight into the lumber industry would likely be left clueless if attempting to draw meaningful conclusions.
These plots visualize reported mill volumes’ on a state-by-state basis. By breaking these volumes down by state, our FS partners are able to hone-in on states with the highest number of volume-outliers (i.e. Log(10) BF > 8). From the plots, it is clear that there are a few states that will need to be reviewed, these include ME, MI, & MN, among others.
#Create State-by-State boxplots and scatter plots for Logged BF Volume
ggplot(BF_plot, aes(x = MILL_STATE, y = TOT_BF_LOG, color = MILL_STATE)) +
geom_point() +
scale_y_continuous(breaks = Hist_Breaks_BF, limits = c(0,12)) +
labs(title = "Log(10) of BF by State", y = "Log(10) of BF", x = "State") +
theme(text = element_text(family = "serif"))
ggplot(BF_plot, aes(x = MILL_STATE, y = TOT_BF_LOG, color = MILL_STATE)) +
geom_boxplot() +
scale_y_continuous(breaks = Hist_Breaks_BF, limits = c(0,12)) +
labs(title = "Log(10) of BF by State", y = "Log(10) of BF", x = "State") +
theme(text = element_text(family = "serif"))
These plots visualize the number of mills and aggregate volume (on a state-by-state basis) that would be removed if a minimum volume of 20,000 BF was set as a baseline for survey inclusion. States without omissions were not included in the plots.
#Create Bar Charts for Omitted Mills/Volumes by State
Omitted_Mills <- BF_plot %>%
group_by(MILL_STATE) %>%
filter(TOT_BF_LOG < 4.30103) %>%
summarize("Omitted_Mill_Count" = n(), "Volume_Omitted" = sum(TOT_BF))
ggplot(Omitted_Mills, aes(x = MILL_STATE, y = Omitted_Mill_Count)) +
geom_col(width = 0.5) +
geom_text(aes(label = Omitted_Mill_Count), nudge_y = 2.5) +
scale_y_continuous(breaks = c(seq(from = 0, to = 60, by = 5))) +
labs(title = "Omitted Mills by State", y = "Omitted Mills", x = "State") +
theme(text = element_text(family = "serif"))
ggplot(Omitted_Mills, aes(x = MILL_STATE, y = Volume_Omitted)) +
geom_col(width = 0.5) +
geom_text(size = 3.5, aes(label = round(Volume_Omitted)), nudge_y = 25000) +
scale_y_continuous(breaks = c(seq(from = 0, to = 1000000, by = 50000))) +
labs(title = "Omitted Mills' Volume by State", y = "Omitted Mill Volumes", x = "State") +
theme(text = element_text(family = "serif"))
This bar chart depicts the number of volume-outliers (i.e. Log(10) BF > 8) on a state-by-state basis. States with no volume-outliers were not included in the plot.
#Create data frames for High Volume Mills
High_Vol_by_State <- BF_plot %>%
group_by(MILL_STATE) %>%
filter(TOT_BF_LOG >= 8) %>%
summarize("High_Volume_Mills" = n())
#Create Bar Chart for High Volume Mills by State
ggplot(High_Vol_by_State, aes(x = MILL_STATE, y = High_Volume_Mills)) +
geom_col() +
geom_text(aes(label = High_Volume_Mills), nudge_y = 2.5) +
scale_y_continuous(breaks = c(seq(from = 0, to = 65, by = 5))) +
labs(title = "High Volume Mills (Log(10) BF > 8) by State", y = "High Volume Mills", x = "State") +
theme(text = element_text(family = "serif"))
This bar chart depicts a breakdown of Mill “Types” on a state-by-state basis.
#Create a Bar Chart depicting Mill Type by State
MTC_summary <- BF_plot %>%
mutate(MTC_Tidy = case_when(MTC == 10 ~ "Sawmill",
MTC == 21 ~ "Sawmill",
MTC == 22 ~ "Sawmill",
MTC == 23 ~ "Sawmill",
MTC == 24 ~ "Sawmill",
MTC == 25 ~ "Sawmill",
MTC == 26 ~ "Sawmill",
MTC == 20 ~ "Veneer Mill",
MTC == 30 ~ "Pulp/Composite Mill",
MTC == 40 ~ "Pulp/Composite Mill",
MTC == 50 ~ "Bioenergy Mill",
MTC == 60 ~ "House/Cabin Mill",
MTC == 70 ~ "Post/Pole/Piling Mill",
MTC == 80 ~ "Post/Pole/Piling Mill",
MTC == 90 ~ "Post/Pole/Piling Mill",
MTC == 100 ~ "Residential Firewood Mill",
MTC == 110 ~ "Other/Misc Mill",
MTC == 111 ~ "Other/Misc Mill",
MTC == 117 ~ "Concentration Yard",
is.na(MTC) ~ "Not Provided"))%>%
group_by(MILL_STATE, MTC_Tidy) %>%
arrange(as.factor(MTC_Tidy)) %>%
summarize("Mill_Type_Count" = n()) %>%
group_by(MILL_STATE)%>%
mutate(State_Count = sum(Mill_Type_Count))
#Create bar chart of Mill Type Counts by State
MTC_Tidy_Factor_Levels <- c("Sawmill", "Veneer Mill", "Pulp/Composite Mill", "Bioenergy Mill", "House/Cabin Mill", "Post/Pole/Piling Mill", "Residential Firewood Mill", "Other/Misc Mill", "Concentration Yard", "Not Provided")
ggplot(MTC_summary, aes(x = MILL_STATE, y = Mill_Type_Count, fill = factor(MTC_Tidy, levels = MTC_Tidy_Factor_Levels))) +
geom_col(position = 'stack', width = .8) +
labs(title = "Mill Type Counts by State", y = "Mill Type Count", x = "State") +
scale_y_continuous(limits = c(0, 500), breaks = c(seq(from = 0, to = 500, by = 25))) +
scale_fill_discrete(name = "Mill Types") +
theme(text = element_text(family = "serif"))
These plots are faceted (by State) versions of the Mill Type and Log(10) of BF plots above.
# Create Bar Charts and Box Plots using facet_wrap()
ggplot(MTC_summary, aes(0, Mill_Type_Count, fill = factor(MTC_Tidy, levels = MTC_Tidy_Factor_Levels))) +
geom_col() +
labs(title = "Mill Type Counts", y = "Mill Type Count") +
scale_fill_discrete(name = "Mill Types") +
facet_wrap(facets = vars(MILL_STATE), scales = "free_y" ,ncol = 8) +
theme(text = element_text(family = "serif"), axis.title.x = element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank())
ggplot(BF_plot, aes(0, y = TOT_BF_LOG, color = MILL_STATE)) +
geom_boxplot() +
labs(title = "Log(10) of BF by State", y = "Log(10) of BF") +
facet_wrap(facets = vars(MILL_STATE), ncol = 8, scales = "free") +
theme(text = element_text(family = "serif"), axis.title.x = element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank())
Because dplyr functions weren’t utilized within the above code chunks, the following chunk uses dplyr functions to measure a few meaningful values on a state-by-state basis.
#Calculate means, medians, and standard deviations of logged volumes by state
TPO_summary <- BF_plot %>%
select(MILL_STATECD, MILL_STATE, MTC, TOT_MCF_LOG, TOT_BF_LOG) %>%
group_by(MILL_STATE) %>%
summarize(meanBF = mean(TOT_BF_LOG, na.rm = TRUE),
medianBF = median(TOT_BF_LOG, na.rm = TRUE),
meanMCF = mean(TOT_MCF_LOG, na.rm = TRUE),
medianMCF = median(TOT_MCF_LOG, na.rm = TRUE),
St_Dev = sd(TOT_BF_LOG, na.rm = TRUE),
NumberofMills = n())
TPO_summary
# A tibble: 24 x 7
MILL_STATE meanBF medianBF meanMCF medianMCF St_Dev NumberofMills
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 CT 5.59 5.44 1.79 1.64 0.820 22
2 DE 5.10 4.80 1.30 1 1.01 15
3 IA 4.94 4.55 1.14 0.749 1.11 70
4 IL 4.78 4.46 0.980 0.660 1.06 136
5 IN 5.76 6.16 1.96 2.36 1.18 147
6 KS 4.58 4.30 0.782 0.500 0.991 48
7 MA 4.59 4.00 0.790 0.199 0.906 120
8 MD 5.51 5.67 1.70 1.87 1.38 43
9 ME 5.60 4.81 1.80 1.01 1.28 162
10 MI 5.71 5.71 1.91 1.91 1.08 354
# ... with 14 more rows
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kennedy (2022, Feb. 27). Data Analytics and Computational Social Science: HW5_IanKennedy. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httprpubscomikennedy040hw5/
BibTeX citation
@misc{kennedy2022hw5_iankennedy, author = {Kennedy, Ian}, title = {Data Analytics and Computational Social Science: HW5_IanKennedy}, url = {https://github.com/DACSS/dacss_course_website/posts/httprpubscomikennedy040hw5/}, year = {2022} }