Manipulating 19 Years of Natural Gas Data
The hypothesis going into this analysis was that the price of natural gas fluctuates with production.
The goal of this analysis was to:
1 Examine the fluctuation in commercial and residential natural gas prices over the course of the average year and compare that to the monthly production.
2 Determine if there exists a relationship between the monthly domestic gas production and those prices.
3 Examine the fluctuation of the same over the course of the last 19 years (as opposed to the average by month).
The prediction is that the price change will be inversely proportional to the production and that production has increased steadily over the last 19 years.
This analysis was done using a variety of R packages including:
* STRINGR
* TIDYVERSE
* DPLYR
* GGPLOT2
* MATRIXSTATS
I like to verify that my working directory and library paths are the same and that I have the correct packages installed.
# Verify library path
.libPaths()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
[2] "C:/Program Files/R/R-4.1.2/library"
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
# Installing Tidyverse and readxl packages with explicitly defining the URL of where it lives. This is to get around a Mirror error. The packages are now already installed and therefore commented out for efficiency. However, should this be run in a new environment, these should be re-installed.
# install.packages("tidyverse", repos = "http://cran.us.r-project.org")
# install.packages("readxl", repos = "http://cran.us.r-project.org")
# load the necessary libraries for the processing
# install.packages("matrixStats", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(dplyr)
library(readxl)
library(readr)
library(stringr)
library(ggplot2)
library(quantreg)
library(lubridate)
library(matrixStats)
library(kableExtra)
library(rmarkdown)
starting_year = params$starting_year
relevant_sheet = params$relevant_sheet
ending_year = params$ending_year
The processing of the data has been broken info functions. Those functions perform the following tasks:
Production data manipulation
Price data manipulation
2 types of data pivoting
production_manipulation <- function(dataset) {
print("Using production manipulation function")
dataset <- production_data %>%
select(!contains('U.S.')) %>%
select(!contains('Idaho')) %>%
rename_with(~stringr::str_replace(.,' Natural Gas Gross Withdrawals','')) %>%
mutate(production_year_data = year(`Date`)) %>%
relocate(production_year_data) %>%
mutate(production_month = month(`Date`, label = TRUE)) %>%
relocate(production_month) %>%
filter(production_year_data>starting_year, production_year_data<ending_year) %>%
rowwise() %>%
mutate(total_national_prod_inbill = sum(c(`Alaska (MMcf)`:`Wyoming (MMcf)`))) %>%
mutate(total_national_prod_inbill = total_national_prod_inbill / 1000000000)
head(dataset)
return(dataset)
}
price_manipulation <- function(dataset) {
print("Using price manipulation function")
dataset <- price_data %>%
select(1, 6:9) %>%
select(!contains('Total')) %>%
mutate(`Date` = as.POSIXct(`Date`,optional = FALSE)) %>%
mutate(year_data = year(`Date`)) %>%
relocate(year_data) %>%
mutate(month_data = month(`Date`, label = TRUE)) %>%
relocate(month_data) %>%
filter(year_data>starting_year, year_data<ending_year) %>%
rename(price_res_deliv = `U.S. Price of Natural Gas Delivered to Residential Consumers (Dollars per Thousand Cubic Feet)`, price_comm = `U.S. Price of Natural Gas Sold to Commercial Consumers (Dollars per Thousand Cubic Feet)`)
head(dataset)
return(dataset)
}
make_data_long_by_year <- function(orig_dataset) {
print("Using pivot longer for price data")
dataset <- orig_dataset %>%
pivot_longer(
cols = price_res_deliv:price_comm,
names_to = c("category", "type_use"),
names_sep = "_",
values_to = "price") %>%
relocate(`type_use`) %>%
relocate(`price`) %>%
relocate(`category`)
return(dataset)
}
make_data_long_by_state <- function(orig_dataset) {
print("Using pivot longer for state data")
dataset <- orig_dataset %>%
pivot_longer(
cols = `Alaska (MMcf)`:`total_national_prod_inbill`,
names_to = c("state"),
values_to = "production") %>%
relocate(`production`) %>%
mutate(reduced_production = production / 10000) %>%
relocate(reduced_production) %>%
relocate(category) %>%
relocate(type_use) %>%
relocate(`state`) %>%
relocate(`Date`) %>%
relocate(month_data) %>%
relocate(year_data)
return(dataset)
}
There are 2 datasets containing US natural gas data as provided by the U.S. Energy Information Administration (http://www.eia.gov/oil_gas/natural_gas/data_publications/natural_gas_monthly/ngm.html).
Both workbooks:
* Contain monthly data starting in 1973
* Contain the relevant data on the tab labeled Data 1 (as identified in the constant defined in YAML as "relevant_sheet")
* Have varying levels of missing data for the first 30 years.
Due to the inconsistency in the data pre-2003, I am only using the last 19 years of data starting in 2003 (as identified using the constant defined in YAML as “starting_year”)
# getting info about all excel sheets
path1 <- "c:/users/theresa/Documents/DACSS Local/DataSets/US-natural-gas-production.xls"
path2 <- "c:/users/theresa/Documents/DACSS Local/DataSets/US-natural-gas-price.xls"
Production data details contain 24 columns of interest:
* Data date (month year format)
* State gas withdrawal levels for all drilling states (Idaho has only a few years of data and is therefore eliminated)
Date is broken down into 2 fields called month_data and year_data.
# Sheet names for the PRODUCTION data file
sheet_names <- readxl::excel_sheets(path1)
# Importing the production data into a datasheet
production_data <- readxl::read_excel(path1, sheet = relevant_sheet, skip = 2)
# Performing the necessary manipulation of that datasheet into a modified version
production_data_mod <- production_manipulation(production_data)
[1] "Using production manipulation function"
# Show a sample of the modified PRODUCTION data
paged_table(head(production_data_mod,100))
Price data details contain 3 columns of interest:
* Data date (month year format)
* Residential prices (in dollars per 1000 cubi feet)
* Commercial prices (again in dollars per 1000 cubic feet)
The date field is broken down into 2 new fields called month_data and year_data
# Sheet names for the PRICE data file
sheet_names <- readxl::excel_sheets(path1)
# Importing the price data into a datasheet
price_data <- readxl::read_excel(path2, sheet = relevant_sheet, skip = 2)
# Performing the necessary manipulation of that datasheet into a modified version
price_data_mod <- price_manipulation(price_data)
[1] "Using price manipulation function"
# Show a sample of the modified PRICE data
paged_table(head(price_data_mod,100))
Data for every month between 2003 and 2021 is provided on both the price and production datasets. This processing merges the data together so that for each month, the resulting dataset contains both price and production data.
# Merge the datasets together - using BASE R
# price_production_data <- merge(price_data_mod, production_data_mod, by.x = 'Date', by.y = 'Date', all.x = TRUE, all.y = TRUE) %>%
# select(-6) %>%
# select(-`production_year_data`)
# Merge the datasets together - using JOIN IN DPLYR
price_production_data <- left_join(price_data_mod, production_data_mod, by = 'Date')
# add a column for the largest producer for that month. I would prefer to get the largest producer's name (column name).
price_production_data <- price_production_data %>%
mutate(largest_producer = rowMaxs(as.matrix((price_production_data[,c(8:24)]))))
# Show a sample of the modified MERGED data
paged_table(head(price_production_data,100))
# Show a resulting column names
# colnames(price_production_data)
Keeping the data in its wide format, doing some summary by month and year.
Determine the average residential & commertial prices, as well as the average production by state for each year.
Determine the average residential & commercial prices, as well as the average production by state for each month over the 20 years.
In an attempt to make the data tidy, flip price type into a USE_TYPE and then further tidy by flipping location into a STATE column.
Divide production by 10,000 to get a reduced version that will graph more easily with price
# Pivot the price data to give instances by type
changed_price_production_data <- make_data_long_by_year(price_production_data)
[1] "Using pivot longer for price data"
# Pivot the production data to give instance by state
changed_v2_price_production_data <- make_data_long_by_state(changed_price_production_data)
[1] "Using pivot longer for state data"
# Information about the changed_price_production_data dataset
# colnames(changed_price_production_data)
paged_table(head(changed_price_production_data,100))
# Information about the changed_v2_price_production_data dataset
# colnames(changed_v2_price_production_data)
paged_table(head(changed_v2_price_production_data,1000))
Group and summarize by month, year, and state
The datasets contain the means for the reduced value for production and the overall price (not grouped by residential / commercial. It would have been much better to group and average price with those categories but the double grouping stymied me).
by_month_v2_summary_data <- changed_v2_price_production_data %>%
group_by(month_data) %>%
summarise(production_avg = mean(reduced_production, na.rm=TRUE), price = mean(price, na.rm = TRUE))
by_state_v2_summary_data <- changed_v2_price_production_data %>%
group_by(state) %>%
summarise(production_avg = mean(reduced_production,na.rm =TRUE), price = mean(price, na.rm = TRUE))
by_year_v2_summary_data <- changed_v2_price_production_data %>%
filter(state == 'total_national_prod_inbill') %>%
group_by(year_data) %>%
summarise(production_avg = mean(production,na.rm =TRUE), price = mean(price, na.rm = TRUE))
Looking at the averages for each month accross a year, there is a noticable trend between the cost of gas, both commercial and residential, and the national production levels.
In the summer months, production is typically lower and prices rise.
# Globally replace any value that wasn't provided (was a non-numeric), replace it with the median of all the other values in that instance.
by_month_summary_data %>%
select(month_data, price_res_deliv, price_comm, total_national_prod_inbill) %>%
ggplot(aes(x=month_data,y=price_res_deliv, size=total_national_prod_inbill, color=month_data)) +
geom_point() +
geom_line()+
geom_point(aes(x=month_data, y=price_comm, size=total_national_prod_inbill, color=month_data)) +
geom_line() +
labs(title="Average Price / Production by Month over 19 Years", y="Delivery Price", x="Month", subtitle = "The size of the data point represents the production level")
The same concept clarified with a line graph.
by_month_summary_data %>%
select(month_data, price_res_deliv, price_comm, total_national_prod_inbill) %>%
ggplot(aes(x=month_data,y=price_res_deliv, group=1, color="orange")) +
geom_line() +
geom_line(aes(x=month_data, y=price_comm, group = 1, color="blue")) +
geom_line(aes(x=month_data, y=total_national_prod_inbill, group=1, color="red")) +
labs(title="Average Price / Production by Month over 19 Years", y="Delivery Price/Production Volume", subtitle = "The production levels are represented in billions")
Looking at the averages by year for 2003 - 2021, there is less correlation between cost and the national production levels.
Production has increased dramatically over the years of 2009-2021; however, price levels have been relatively stable.
by_year_summary_data %>%
select(year_data, price_res_deliv, price_comm, total_national_prod_inbill) %>%
ggplot(aes(x=year_data,y=price_res_deliv, color="orange")) +
# geom_point() +
geom_line()+
# geom_point(aes(x=year_data, y=price_comm, size=total_national_prod_inbill, color=year_data)) +
geom_line(aes(x=year_data, y=price_comm, color="blue")) +
geom_line(aes(x=year_data, y=total_national_prod_inbill, color="red")) +
labs(title="Average Price / production over 19 Years", y="Delivery Price", x="Month", subtitle = "Annual production does not dictate price trend")
Commercial and residential gas prices fluctuated in relative synchronicity between 2003 and 2010. From 2010 - 2021, the residential prices consistently have been increasing at a higher rate than commercial prices.
ggplot(by_year_summary_data, aes(x = year_data, y = price_res_deliv)) +
geom_smooth(aes(y = price_res_deliv, x = year_data), color = "blue") +
geom_smooth(aes(x = year_data, y = price_comm), color="red") +
labs(title="Average Price by Year over 20 Years", y="Delivery Price (red = commercial, blue = residential)", x="Month")
#ggplot(by_month_v2_summary_data, aes(x = month_data, y = production_avg, fill = month_data)) +
# geom_col(aes(y = production_avg, x = month_data), color = "blue") +
# geom_col(aes(x = month_data, y = price), color="red") +
# theme_light() +
# labs(title="Average Monthly Price v Prod over 20 Years", y="Price and Production", x="Month")
ggplot(by_state_v2_summary_data, aes(x = state, y = production_avg)) +
geom_col(aes(y = production_avg, x = state)) +
theme_classic() +
labs(title="Average Production by State over 20 Years", y="Production") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
# print(by_month_v2_summary_data)
#ggplot(by_month_v2_summary_data, aes(x = month_data, y = production)) +
# geom_smooth(aes(x = month_data, y = price), color="red") +
# geom_smooth(aes(x = month_data, y = production), color="blue") +
# facet_wrap(var(month_data)) +
# labs(title="Average Price by Month over 20 Years", y="Delivery Price", x="Month")
# ggplot(by_month_summary_data, aes(x = month_data)) +
# geom_line(aes(x = month_data, y = price_res_deliv)) +
# geom_line(aes(x = month_data, y = price_comm))
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
TLamkin (2022, April 11). Data Analytics and Computational Social Science: HW6. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin887327/
BibTeX citation
@misc{tlamkin2022hw6, author = {TLamkin, }, title = {Data Analytics and Computational Social Science: HW6}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin887327/}, year = {2022} }