HW6

Manipulating 19 Years of Natural Gas Data

TLamkin
2022-04-06

INTRODUCTION

The hypothesis going into this analysis was that the price of natural gas fluctuates with production.

The goal of this analysis was to:

1 Examine the fluctuation in commercial and residential natural gas prices over the course of the average year and compare that to the monthly production.  
2 Determine if there exists a relationship between the monthly domestic gas production and those prices.
3 Examine the fluctuation of the same over the course of the last 19 years (as opposed to the average by month). 

The prediction is that the price change will be inversely proportional to the production and that production has increased steadily over the last 19 years.

This analysis was done using a variety of R packages including:

* STRINGR
* TIDYVERSE
* DPLYR
* GGPLOT2
* MATRIXSTATS

Checking the Environment

I like to verify that my working directory and library paths are the same and that I have the correct packages installed.

Show code
# Verify library path

.libPaths()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
[2] "C:/Program Files/R/R-4.1.2/library"          
Show code
# set the working directory to be the same as the library path

setwd("C:/Users/theresa/Documents/R/win-library/4.1")

# verify the working directory

getwd()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
Show code
# Installing Tidyverse and readxl packages with explicitly defining the URL of where it lives. This is to get around a Mirror error.  The packages are now already installed and therefore commented out for efficiency.  However, should this be run in a new environment, these should be re-installed. 

# install.packages("tidyverse", repos = "http://cran.us.r-project.org")
# install.packages("readxl", repos = "http://cran.us.r-project.org")

# load the necessary libraries for the processing
# install.packages("matrixStats", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(dplyr)
library(readxl)
library(readr)
library(stringr)
library(ggplot2)
library(quantreg)
library(lubridate)
library(matrixStats)
library(kableExtra)
library(rmarkdown)

starting_year = params$starting_year
relevant_sheet = params$relevant_sheet
ending_year = params$ending_year

FUNCTIONS

The processing of the data has been broken info functions. Those functions perform the following tasks:

 Production data manipulation
 Price data manipulation
 2 types of data pivoting
Show code
production_manipulation <- function(dataset) {
   
  print("Using production manipulation function")
  
  dataset <- production_data %>%
    select(!contains('U.S.')) %>%
    select(!contains('Idaho')) %>%
    rename_with(~stringr::str_replace(.,' Natural Gas Gross Withdrawals','')) %>%
    mutate(production_year_data = year(`Date`)) %>%
    relocate(production_year_data) %>%
    mutate(production_month = month(`Date`, label = TRUE)) %>%
      relocate(production_month) %>%
    filter(production_year_data>starting_year, production_year_data<ending_year) %>%
    rowwise() %>%
    mutate(total_national_prod_inbill = sum(c(`Alaska (MMcf)`:`Wyoming (MMcf)`))) %>%
    mutate(total_national_prod_inbill = total_national_prod_inbill / 1000000000)
    head(dataset)
    return(dataset)
}

price_manipulation <- function(dataset) {
  
  print("Using price manipulation function")
  
  dataset <- price_data %>%
    select(1, 6:9) %>%
    select(!contains('Total')) %>%
    mutate(`Date` = as.POSIXct(`Date`,optional = FALSE)) %>%
    mutate(year_data = year(`Date`)) %>%
    relocate(year_data) %>%
    mutate(month_data = month(`Date`, label = TRUE)) %>%
      relocate(month_data) %>%
    filter(year_data>starting_year, year_data<ending_year) %>%
    rename(price_res_deliv = `U.S. Price of Natural Gas Delivered to Residential Consumers (Dollars per Thousand Cubic Feet)`, price_comm = `U.S. Price of Natural Gas Sold to Commercial Consumers (Dollars per Thousand Cubic Feet)`)
  head(dataset)
    return(dataset)
  
  
}


make_data_long_by_year <- function(orig_dataset) {
 
  print("Using pivot longer for price data")
  
   dataset <- orig_dataset %>%
  pivot_longer(
    cols = price_res_deliv:price_comm,
    names_to = c("category", "type_use"),
    names_sep = "_",
    values_to = "price") %>%
  relocate(`type_use`) %>%
  relocate(`price`) %>%
  relocate(`category`)
  return(dataset)
  
}

make_data_long_by_state <- function(orig_dataset) {
  
  print("Using pivot longer for state data")
  
  dataset <- orig_dataset %>%
pivot_longer(
    cols = `Alaska (MMcf)`:`total_national_prod_inbill`,
    names_to = c("state"),
    values_to = "production") %>%
  relocate(`production`) %>%
  mutate(reduced_production = production / 10000) %>%
  relocate(reduced_production) %>%
  relocate(category) %>%
  relocate(type_use) %>%
  relocate(`state`) %>%
  relocate(`Date`) %>%
  relocate(month_data) %>%
  relocate(year_data) 
  return(dataset)
  
}

Loading the Data

There are 2 datasets containing US natural gas data as provided by the U.S. Energy Information Administration (http://www.eia.gov/oil_gas/natural_gas/data_publications/natural_gas_monthly/ngm.html).

Both workbooks:

* Contain monthly data starting in 1973
* Contain the relevant data on the tab labeled Data 1 (as identified in the constant defined in YAML as "relevant_sheet")
* Have varying levels of missing data for the first 30 years.  

Due to the inconsistency in the data pre-2003, I am only using the last 19 years of data starting in 2003 (as identified using the constant defined in YAML as “starting_year”)

Show code
# getting info about all excel sheets

  path1 <- "c:/users/theresa/Documents/DACSS Local/DataSets/US-natural-gas-production.xls"
  path2 <- "c:/users/theresa/Documents/DACSS Local/DataSets/US-natural-gas-price.xls"

Production Data

Production data details contain 24 columns of interest:

* Data date (month year format) 
* State gas withdrawal levels for all drilling states (Idaho has only a few years of data and is therefore eliminated)

Date is broken down into 2 fields called month_data and year_data.

Show code
# Sheet names for the PRODUCTION data file 
  sheet_names <- readxl::excel_sheets(path1)

 # Importing the production data into a datasheet

  production_data <- readxl::read_excel(path1, sheet = relevant_sheet, skip = 2) 
  
# Performing the necessary manipulation of that datasheet into a modified version
 
  production_data_mod <- production_manipulation(production_data)
[1] "Using production manipulation function"
Show code
# Show a sample of the modified PRODUCTION data
  paged_table(head(production_data_mod,100))

Price Data

Price data details contain 3 columns of interest:

* Data date (month year format)
* Residential prices (in dollars per 1000 cubi feet)
* Commercial prices (again in dollars per 1000 cubic feet)

The date field is broken down into 2 new fields called month_data and year_data

Show code
# Sheet names for the PRICE data file 
  sheet_names <- readxl::excel_sheets(path1)
  
# Importing the price data into a datasheet
  
  price_data <- readxl::read_excel(path2, sheet = relevant_sheet, skip = 2) 
  
# Performing the necessary manipulation of that datasheet into a modified version
  
  price_data_mod <- price_manipulation(price_data)
[1] "Using price manipulation function"
Show code
# Show a sample of the modified PRICE data
  paged_table(head(price_data_mod,100))

Merge Datasets Together

Data for every month between 2003 and 2021 is provided on both the price and production datasets. This processing merges the data together so that for each month, the resulting dataset contains both price and production data.

Show code
# Merge the datasets together - using BASE R
  
#  price_production_data <- merge(price_data_mod, production_data_mod, by.x = 'Date', by.y = 'Date', all.x = TRUE, all.y = TRUE) %>%
#    select(-6) %>%
#    select(-`production_year_data`) 
  
# Merge the datasets together - using JOIN IN DPLYR
  
  price_production_data <- left_join(price_data_mod, production_data_mod, by = 'Date')

# add a column for the largest producer for that month. I would prefer to get the largest producer's name (column name). 
  price_production_data <- price_production_data %>%
    mutate(largest_producer = rowMaxs(as.matrix((price_production_data[,c(8:24)]))))
    
# Show a sample of the modified MERGED data
 paged_table(head(price_production_data,100))
Show code
# Show a resulting column names
#  colnames(price_production_data)

Summary Sets

Keeping the data in its wide format, doing some summary by month and year.

Summarize by Year

Determine the average residential & commertial prices, as well as the average production by state for each year.

Show code
by_year_summary_data <- price_production_data %>%
 group_by(year_data) %>%
 summarise(across(price_res_deliv:total_national_prod_inbill,~mean(.x, na.rm = TRUE))) %>%
  select(-4)

# Information about the resulting by_year_summary_data
  paged_table(head(by_year_summary_data,100))

Summarize by Month

Determine the average residential & commercial prices, as well as the average production by state for each month over the 20 years.

Show code
by_month_summary_data <- price_production_data %>%
 group_by(month_data) %>%
 summarise(across(price_res_deliv:total_national_prod_inbill,~mean(.x, na.rm = TRUE))) %>%
  select(-4)
  

  # Information about the resulting by_month_summary_data
  paged_table(head(by_month_summary_data,100))

Pivot:

In an attempt to make the data tidy, flip price type into a USE_TYPE and then further tidy by flipping location into a STATE column.

Divide production by 10,000 to get a reduced version that will graph more easily with price

Show code
# Pivot the price data to give instances by type

changed_price_production_data <- make_data_long_by_year(price_production_data)
[1] "Using pivot longer for price data"
Show code
# Pivot the production data to give instance by state

changed_v2_price_production_data <- make_data_long_by_state(changed_price_production_data)
[1] "Using pivot longer for state data"
Show code
# Information about the changed_price_production_data dataset
# colnames(changed_price_production_data)
paged_table(head(changed_price_production_data,100))
Show code
# Information about the changed_v2_price_production_data dataset
# colnames(changed_v2_price_production_data)
paged_table(head(changed_v2_price_production_data,1000))

Summarize the Pivoted Data:

Group and summarize by month, year, and state

The datasets contain the means for the reduced value for production and the overall price (not grouped by residential / commercial. It would have been much better to group and average price with those categories but the double grouping stymied me).

Show code
by_month_v2_summary_data <- changed_v2_price_production_data %>%
 group_by(month_data) %>%
 summarise(production_avg = mean(reduced_production, na.rm=TRUE), price = mean(price, na.rm = TRUE)) 

by_state_v2_summary_data <- changed_v2_price_production_data %>%
group_by(state) %>%
summarise(production_avg = mean(reduced_production,na.rm =TRUE), price = mean(price, na.rm = TRUE)) 
  
by_year_v2_summary_data <- changed_v2_price_production_data %>%
 filter(state == 'total_national_prod_inbill') %>%
 group_by(year_data) %>%
 summarise(production_avg = mean(production,na.rm =TRUE), price = mean(price, na.rm = TRUE))

Graphs:

Monthly Averages

Looking at the averages for each month accross a year, there is a noticable trend between the cost of gas, both commercial and residential, and the national production levels.

In the summer months, production is typically lower and prices rise.

Show code
# Globally replace any value that wasn't provided (was a non-numeric), replace it with the median of all the other values in that instance. 

by_month_summary_data %>% 
  select(month_data, price_res_deliv, price_comm, total_national_prod_inbill) %>%
  ggplot(aes(x=month_data,y=price_res_deliv,  size=total_national_prod_inbill, color=month_data)) +
  geom_point() +
  geom_line()+
  geom_point(aes(x=month_data, y=price_comm, size=total_national_prod_inbill, color=month_data)) +
  geom_line() +
  labs(title="Average Price / Production by Month over 19 Years", y="Delivery Price", x="Month", subtitle = "The size of the data point represents the production level")

Presented Differently

The same concept clarified with a line graph.

Show code
by_month_summary_data %>% 
  select(month_data, price_res_deliv, price_comm, total_national_prod_inbill) %>%
  ggplot(aes(x=month_data,y=price_res_deliv, group=1, color="orange")) +
  geom_line() +
  geom_line(aes(x=month_data, y=price_comm, group = 1, color="blue")) +
  geom_line(aes(x=month_data, y=total_national_prod_inbill, group=1, color="red")) +
   labs(title="Average Price / Production by Month over 19 Years", y="Delivery Price/Production Volume", subtitle = "The production levels are represented in billions")

Yearly Averages

Looking at the averages by year for 2003 - 2021, there is less correlation between cost and the national production levels.

Production has increased dramatically over the years of 2009-2021; however, price levels have been relatively stable.

Show code
by_year_summary_data %>% 
  select(year_data, price_res_deliv, price_comm, total_national_prod_inbill) %>%
  ggplot(aes(x=year_data,y=price_res_deliv, color="orange")) +
 # geom_point() +
  geom_line()+
  # geom_point(aes(x=year_data, y=price_comm, size=total_national_prod_inbill, color=year_data)) +
  geom_line(aes(x=year_data, y=price_comm, color="blue")) +
  geom_line(aes(x=year_data, y=total_national_prod_inbill, color="red")) +
   labs(title="Average Price / production over 19 Years", y="Delivery Price", x="Month", subtitle = "Annual production does not dictate price trend")

Annual Gas Prices: Residential v. Commercial

Commercial and residential gas prices fluctuated in relative synchronicity between 2003 and 2010. From 2010 - 2021, the residential prices consistently have been increasing at a higher rate than commercial prices.

Show code
ggplot(by_year_summary_data, aes(x = year_data, y = price_res_deliv)) + 
  geom_smooth(aes(y = price_res_deliv, x = year_data), color = "blue")  +
  geom_smooth(aes(x = year_data, y = price_comm), color="red") + 
  labs(title="Average Price by Year over 20 Years", y="Delivery Price (red = commercial, blue = residential)", x="Month")

Show code
#ggplot(by_month_v2_summary_data, aes(x = month_data, y = production_avg, fill = month_data)) + 
#  geom_col(aes(y = production_avg, x = month_data), color = "blue")  +
#  geom_col(aes(x = month_data, y = price), color="red") +
#  theme_light() +
#  labs(title="Average Monthly Price v Prod over 20 Years", y="Price and Production", x="Month")

Average Production over 19 years by State

Show code
ggplot(by_state_v2_summary_data, aes(x = state, y = production_avg)) + 
  geom_col(aes(y = production_avg, x = state))  +
  theme_classic() +
  labs(title="Average Production by State over 20 Years", y="Production") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
Show code
# print(by_month_v2_summary_data)

#ggplot(by_month_v2_summary_data, aes(x = month_data, y = production)) + 
# geom_smooth(aes(x = month_data, y = price), color="red") + 
#  geom_smooth(aes(x = month_data, y = production), color="blue") +
# facet_wrap(var(month_data)) +
#  labs(title="Average Price by Month over 20 Years", y="Delivery Price", x="Month")


# ggplot(by_month_summary_data, aes(x = month_data)) + 
#    geom_line(aes(x = month_data, y = price_res_deliv)) + 
#     geom_line(aes(x = month_data, y = price_comm))  

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

TLamkin (2022, April 11). Data Analytics and Computational Social Science: HW6. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin887327/

BibTeX citation

@misc{tlamkin2022hw6,
  author = {TLamkin, },
  title = {Data Analytics and Computational Social Science: HW6},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin887327/},
  year = {2022}
}