Manipulating 20 Years of Natural Gas Data
This homework demonstrates the tidy-ing and manipulation of 20 years of natural gas production and price data using a variety of packages including:
The goal of this analysis was to: - Examine the fluctuation in natural gas prices over the course of the average year and how that fluctuation varies between commercial and residential uses.
- Determine if there exists a relationship between the monthly domestic gas production and the price.
I like to verify that my working directory and library paths are the same and that I have the correct packages installed.
# Verify library path
.libPaths()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
[2] "C:/Program Files/R/R-4.1.2/library"
# set the working directory to be the same as the library path
setwd("C:/Users/theresa/Documents/R/win-library/4.1")
# verify the working directory
getwd()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
# Installing Tidyverse and readxl packages with explicitly defining the URL of where it lives. This is to get around a Mirror error. The packages are now already installed and therefore commented out for efficiency. However, should this be run in a new environment, these should be re-installed.
# install.packages("tidyverse", repos = "http://cran.us.r-project.org")
# install.packages("readxl", repos = "http://cran.us.r-project.org")
# load the necessary libraries for the processing
# install.packages("matrixStats", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(dplyr)
library(readxl)
library(readr)
library(stringr)
library(ggplot2)
library(quantreg)
library(lubridate)
library(matrixStats)
There are 2 datasets containing US natural gas data as provided by the U.S. Energy Information Administration (http://www.eia.gov/oil_gas/natural_gas/data_publications/natural_gas_monthly/ngm.html).
Both workbooks:
Price data details contain 3 rows of interest:
The date field is broken down into 2 new fields called month_data and year_data
Production data details contain 24 rows of interest:
Again, date is broken down into 2 fields called month_data and year_data.
# getting info about all excel sheets
path1 <- "c:/users/theresa/Documents/DACSS Local/DataSets/US-natural-gas-production.xls"
path2 <- "c:/users/theresa/Documents/DACSS Local/DataSets/US-natural-gas-price.xls"
# Sheet names for the PRODUCTION data file
sheet_names <- readxl::excel_sheets(path1)
print(sheet_names)
[1] "Contents" "Data 1" "Data 2" "Data 3" "Data 4"
# Sheet names for the PRICE data file
sheet_names <- readxl::excel_sheets(path1)
print(sheet_names)
[1] "Contents" "Data 1" "Data 2" "Data 3" "Data 4"
relevant_sheet = "Data 1"
starting_year = 2002
# Importing the price data into a datasheet
price_data <- readxl::read_excel(path2, sheet = relevant_sheet, skip = 2)
# Performing the necessary manipulation of that datasheet into a modified version
price_data_mod <- price_data %>%
select(1, 6:9) %>%
select(!contains('Total')) %>%
# mutate(`Date` = as.POSIXct(`Date`)) %>%
mutate(year_data = year(`Date`)) %>%
relocate(year_data) %>%
mutate(month_data = month(`Date`, label = TRUE)) %>%
relocate(month_data) %>%
filter(year_data>starting_year) %>%
rename(price_res_deliv = `U.S. Price of Natural Gas Delivered to Residential Consumers (Dollars per Thousand Cubic Feet)`, price_comm = `U.S. Price of Natural Gas Sold to Commercial Consumers (Dollars per Thousand Cubic Feet)`)
# Importing the production data into a datasheet
production_data <- readxl::read_excel(path1, sheet = relevant_sheet, skip = 2)
# Performing the necessary manipulation of that datasheet into a modified version
production_data_mod <- production_data %>%
select(!contains('U.S.')) %>%
select(!contains('Idaho')) %>%
rename_with(~stringr::str_replace(.,' Natural Gas Gross Withdrawals','')) %>%
mutate(production_year_data = year(`Date`)) %>%
relocate(production_year_data) %>%
mutate(production_month = month(`Date`, label = TRUE)) %>%
relocate(production_month) %>%
filter(production_year_data>starting_year)
# Show a sample of the modified PRODUCTION data
head(production_data_mod)
# A tibble: 6 x 20
production_month production_year_data Date
<ord> <dbl> <dttm>
1 Jan 2003 2003-01-15 00:00:00
2 Feb 2003 2003-02-15 00:00:00
3 Mar 2003 2003-03-15 00:00:00
4 Apr 2003 2003-04-15 00:00:00
5 May 2003 2003-05-15 00:00:00
6 Jun 2003 2003-06-15 00:00:00
# ... with 17 more variables: `Alaska (MMcf)` <dbl>,
# `Arkansas (MMcf)` <dbl>, `California (MMcf)` <dbl>,
# `Colorado (MMcf)` <dbl>,
# `Federal Offshore--Gulf of Mexico (MMcf)` <dbl>,
# `Kansas (MMcf)` <dbl>, `Louisiana (MMcf)` <dbl>,
# `Montana (MMcf)` <dbl>, `New Mexico (MMcf)` <dbl>,
# `North Dakota (MMcf)` <dbl>, `Ohio (MMcf)` <dbl>, ...
# Show a sample of the modified PRICE data
head(price_data_mod)
# A tibble: 6 x 5
month_data year_data Date price_res_deliv price_comm
<ord> <dbl> <dttm> <dbl> <dbl>
1 Jan 2003 2003-01-15 00:00:00 8.18 7.48
2 Feb 2003 2003-02-15 00:00:00 8.58 7.98
3 Mar 2003 2003-03-15 00:00:00 9.77 9.2
4 Apr 2003 2003-04-15 00:00:00 10.2 8.97
5 May 2003 2003-05-15 00:00:00 10.8 8.71
6 Jun 2003 2003-06-15 00:00:00 12.1 9
# Merge the datasets together
price_production_data <- merge(price_data_mod, production_data_mod, by.x = 'Date', by.y = 'Date', all.x = TRUE, all.y = TRUE) %>%
select(-6) %>%
select(-`production_year_data`)
# add a column for the largest producer for that month. I would prefer to get the largest producer's name (column name).
price_production_data <- price_production_data %>%
mutate(largest_producer = rowMaxs(as.matrix((price_production_data[,c(6:22)]))))
# Show a sample of the modified MERGED data
head(price_production_data)
Date month_data year_data price_res_deliv price_comm
1 2003-01-15 Jan 2003 8.18 7.48
2 2003-02-15 Feb 2003 8.58 7.98
3 2003-03-15 Mar 2003 9.77 9.20
4 2003-04-15 Apr 2003 10.18 8.97
5 2003-05-15 May 2003 10.79 8.71
6 2003-06-15 Jun 2003 12.08 9.00
Alaska (MMcf) Arkansas (MMcf) California (MMcf) Colorado (MMcf)
1 319683.8 14417.42 32679.53 86913.57
2 302602.0 13149.66 29324.73 78600.27
3 321353.4 13977.18 32170.33 86212.13
4 292888.6 13807.69 30642.15 83280.26
5 268966.9 14454.52 32509.44 86320.84
6 278916.6 13953.11 31001.41 83389.13
Federal Offshore--Gulf of Mexico (MMcf) Kansas (MMcf)
1 381172 36707.96
2 350913 32729.53
3 398140 36441.45
4 388074 35425.78
5 393109 36430.46
6 370801 35813.18
Louisiana (MMcf) Montana (MMcf) New Mexico (MMcf)
1 116474.9 7549.683 134374.7
2 107305.8 6693.286 123955.2
3 121044.6 7246.902 141622.0
4 117070.6 6965.165 133134.6
5 119967.4 6938.378 138228.0
6 115378.3 6937.862 130816.4
North Dakota (MMcf) Ohio (MMcf) Oklahoma (MMcf) Pennsylvania (MMcf)
1 4821 8305.159 126172.8 14177.22
2 4305 7418.616 115436.1 12660.73
3 4757 7992.273 135222.3 13642.01
4 4502 7665.553 135370.0 13083.13
5 4688 7720.430 129062.0 13177.00
6 4776 7409.804 131943.3 12645.65
Texas (MMcf) Utah (MMcf) West Virginia (MMcf) Wyoming (MMcf)
1 474681.9 24588.66 16644.61 161160.4
2 432716.0 22485.69 15959.33 145024.8
3 490444.8 24849.91 16111.51 159736.7
4 471949.8 23698.42 15010.96 151053.9
5 492743.3 24123.12 15178.24 143493.7
6 476846.1 23546.51 15349.56 146507.2
largest_producer
1 474681.9
2 432716.0
3 490444.8
4 471949.8
5 492743.3
6 476846.1
# Show a resulting column names
colnames(price_production_data)
[1] "Date"
[2] "month_data"
[3] "year_data"
[4] "price_res_deliv"
[5] "price_comm"
[6] "Alaska (MMcf)"
[7] "Arkansas (MMcf)"
[8] "California (MMcf)"
[9] "Colorado (MMcf)"
[10] "Federal Offshore--Gulf of Mexico (MMcf)"
[11] "Kansas (MMcf)"
[12] "Louisiana (MMcf)"
[13] "Montana (MMcf)"
[14] "New Mexico (MMcf)"
[15] "North Dakota (MMcf)"
[16] "Ohio (MMcf)"
[17] "Oklahoma (MMcf)"
[18] "Pennsylvania (MMcf)"
[19] "Texas (MMcf)"
[20] "Utah (MMcf)"
[21] "West Virginia (MMcf)"
[22] "Wyoming (MMcf)"
[23] "largest_producer"
Keeping the data in its wide format, doing some summary by month and year.
by_year_summary_data <- price_production_data %>%
group_by(year_data) %>%
summarise(across(price_res_deliv:largest_producer,~mean(.x, na.rm = TRUE)))
by_month_summary_data <- price_production_data %>%
group_by(month_data) %>%
summarise(across(price_res_deliv:largest_producer,~mean(.x, na.rm = TRUE)))
# Information about the resulting by_year_summary_data
print(by_year_summary_data)
# A tibble: 20 x 21
year_data price_res_deliv price_comm `Alaska (MMcf)`
<dbl> <dbl> <dbl> <dbl>
1 2003 10.6 8.51 298192.
2 2004 11.6 9.46 303674.
3 2005 13.7 11.5 303579.
4 2006 14.2 11.6 267146.
5 2007 14.2 11.3 289941.
6 2008 15.8 12.8 284657.
7 2009 12.9 9.79 276032.
8 2010 12.9 9.53 266425.
9 2011 12.6 9.09 263577.
10 2012 12.0 8.2 263733.
11 2013 12.1 8.35 267946.
12 2014 13.0 9.18 264045.
13 2015 12.3 8.02 264608.
14 2016 12.2 7.53 269183.
15 2017 13.0 8.10 270897.
16 2018 12.8 8.02 271226.
17 2019 12.7 7.81 270863.
18 2020 12.6 7.73 285776.
19 2021 14.9 9.24 290505.
20 2022 NaN NaN NaN
# ... with 17 more variables: `Arkansas (MMcf)` <dbl>,
# `California (MMcf)` <dbl>, `Colorado (MMcf)` <dbl>,
# `Federal Offshore--Gulf of Mexico (MMcf)` <dbl>,
# `Kansas (MMcf)` <dbl>, `Louisiana (MMcf)` <dbl>,
# `Montana (MMcf)` <dbl>, `New Mexico (MMcf)` <dbl>,
# `North Dakota (MMcf)` <dbl>, `Ohio (MMcf)` <dbl>,
# `Oklahoma (MMcf)` <dbl>, `Pennsylvania (MMcf)` <dbl>, ...
# Information about the resulting by_month_summary_data
print(by_month_summary_data)
# A tibble: 12 x 21
month_data price_res_deliv price_comm `Alaska (MMcf)`
<ord> <dbl> <dbl> <dbl>
1 Jan 10.2 8.89 302773.
2 Feb 10.2 8.87 281116.
3 Mar 10.6 9.02 304457.
4 Apr 11.4 9.05 281049.
5 May 13.0 9.29 268982.
6 Jun 15.2 9.60 254432.
7 Jul 16.5 9.75 238781.
8 Aug 17.0 9.66 239779.
9 Sep 16.2 9.54 265292.
10 Oct 13.2 9.17 288291.
11 Nov 11.2 9.11 294088.
12 Dec 10.6 9.09 310649.
# ... with 17 more variables: `Arkansas (MMcf)` <dbl>,
# `California (MMcf)` <dbl>, `Colorado (MMcf)` <dbl>,
# `Federal Offshore--Gulf of Mexico (MMcf)` <dbl>,
# `Kansas (MMcf)` <dbl>, `Louisiana (MMcf)` <dbl>,
# `Montana (MMcf)` <dbl>, `New Mexico (MMcf)` <dbl>,
# `North Dakota (MMcf)` <dbl>, `Ohio (MMcf)` <dbl>,
# `Oklahoma (MMcf)` <dbl>, `Pennsylvania (MMcf)` <dbl>, ...
In an attempt to make the data tidy, flip price type into a USE_TYPE and then further tidy by flipping location into a STATE column.
Divide production by 10,000 to get a reduced version that will graph more easily with price
changed_price_production_data <- price_production_data %>%
pivot_longer(
cols = price_res_deliv:price_comm,
names_to = c("category", "type_use"),
names_sep = "_",
values_to = "price") %>%
relocate(`type_use`) %>%
relocate(`price`) %>%
relocate(`category`)
changed_v2_price_production_data <- changed_price_production_data %>%
pivot_longer(
cols = `Alaska (MMcf)`:`Wyoming (MMcf)`,
names_to = c("state"),
#names_sep = "_",
values_to = "production") %>%
relocate(`production`) %>%
mutate(distilled_production = production / 10000) %>%
relocate(distilled_production) %>%
relocate(category) %>%
relocate(type_use) %>%
relocate(`state`) %>%
relocate(`Date`) %>%
relocate(month_data) %>%
relocate(year_data)
# Information about the changed_price_production_data dataset
colnames(changed_price_production_data)
[1] "category"
[2] "price"
[3] "type_use"
[4] "Date"
[5] "month_data"
[6] "year_data"
[7] "Alaska (MMcf)"
[8] "Arkansas (MMcf)"
[9] "California (MMcf)"
[10] "Colorado (MMcf)"
[11] "Federal Offshore--Gulf of Mexico (MMcf)"
[12] "Kansas (MMcf)"
[13] "Louisiana (MMcf)"
[14] "Montana (MMcf)"
[15] "New Mexico (MMcf)"
[16] "North Dakota (MMcf)"
[17] "Ohio (MMcf)"
[18] "Oklahoma (MMcf)"
[19] "Pennsylvania (MMcf)"
[20] "Texas (MMcf)"
[21] "Utah (MMcf)"
[22] "West Virginia (MMcf)"
[23] "Wyoming (MMcf)"
[24] "largest_producer"
changed_price_production_data
# A tibble: 458 x 24
category price type_use Date month_data year_data
<chr> <dbl> <chr> <dttm> <ord> <dbl>
1 price 8.18 res 2003-01-15 00:00:00 Jan 2003
2 price 7.48 comm 2003-01-15 00:00:00 Jan 2003
3 price 8.58 res 2003-02-15 00:00:00 Feb 2003
4 price 7.98 comm 2003-02-15 00:00:00 Feb 2003
5 price 9.77 res 2003-03-15 00:00:00 Mar 2003
6 price 9.2 comm 2003-03-15 00:00:00 Mar 2003
7 price 10.2 res 2003-04-15 00:00:00 Apr 2003
8 price 8.97 comm 2003-04-15 00:00:00 Apr 2003
9 price 10.8 res 2003-05-15 00:00:00 May 2003
10 price 8.71 comm 2003-05-15 00:00:00 May 2003
# ... with 448 more rows, and 18 more variables:
# `Alaska (MMcf)` <dbl>, `Arkansas (MMcf)` <dbl>,
# `California (MMcf)` <dbl>, `Colorado (MMcf)` <dbl>,
# `Federal Offshore--Gulf of Mexico (MMcf)` <dbl>,
# `Kansas (MMcf)` <dbl>, `Louisiana (MMcf)` <dbl>,
# `Montana (MMcf)` <dbl>, `New Mexico (MMcf)` <dbl>,
# `North Dakota (MMcf)` <dbl>, `Ohio (MMcf)` <dbl>, ...
# Information about the changed_v2_price_production_data dataset
colnames(changed_v2_price_production_data)
[1] "year_data" "month_data"
[3] "Date" "state"
[5] "type_use" "category"
[7] "distilled_production" "production"
[9] "price" "largest_producer"
changed_v2_price_production_data
# A tibble: 7,786 x 10
year_data month_data Date state type_use category
<dbl> <ord> <dttm> <chr> <chr> <chr>
1 2003 Jan 2003-01-15 00:00:00 Alaska ~ res price
2 2003 Jan 2003-01-15 00:00:00 Arkansa~ res price
3 2003 Jan 2003-01-15 00:00:00 Califor~ res price
4 2003 Jan 2003-01-15 00:00:00 Colorad~ res price
5 2003 Jan 2003-01-15 00:00:00 Federal~ res price
6 2003 Jan 2003-01-15 00:00:00 Kansas ~ res price
7 2003 Jan 2003-01-15 00:00:00 Louisia~ res price
8 2003 Jan 2003-01-15 00:00:00 Montana~ res price
9 2003 Jan 2003-01-15 00:00:00 New Mex~ res price
10 2003 Jan 2003-01-15 00:00:00 North D~ res price
# ... with 7,776 more rows, and 4 more variables:
# distilled_production <dbl>, production <dbl>, price <dbl>,
# largest_producer <dbl>
Group and summarize by month, year, and state
The datasets contain the means for the reduced value for production and the overall price (not grouped by residential / commercial. It would have been much better to group and average price with those categories but the double grouping stymied me).
by_month_v2_summary_data <- changed_v2_price_production_data %>%
group_by(month_data) %>%
summarise(production_avg = mean(distilled_production, na.rm=TRUE), price = mean(price, na.rm = TRUE))
by_state_v2_summary_data <- changed_v2_price_production_data %>%
group_by(state) %>%
summarise(production_avg = mean(distilled_production,na.rm =TRUE), price = mean(price, na.rm = TRUE))
by_year_v2_summary_data <- changed_v2_price_production_data %>%
group_by(year_data) %>%
summarise(production_avg = mean(distilled_production,na.rm =TRUE), price = mean(price, na.rm = TRUE))
# Globally replace any value that wasn't provided (was a non-numeric), replace it with the median of all the other values in that instance.
ggplot(by_month_summary_data, aes(x = month_data, y = price_res_deliv)) +
geom_point(aes(y = price_res_deliv, x = month_data), color = "blue") +
geom_point(aes(x = month_data, y = price_comm), color="red") +
theme_linedraw()+
labs(title="Average Price by Month over 20 Years (red = commercial, blue = residential)", y="Delivery Price", x="Month")
ggplot(by_year_summary_data, aes(x = year_data, y = price_res_deliv)) +
geom_smooth(aes(y = price_res_deliv, x = year_data), color = "blue") +
geom_smooth(aes(x = year_data, y = price_comm), color="red") +
labs(title="Average Price by Year over 20 Years", y="Delivery Price (red = commercial, blue = residential)", x="Month")
ggplot(by_month_v2_summary_data, aes(x = month_data, y = production_avg, fill = month_data)) +
geom_col(aes(y = production_avg, x = month_data), color = "blue") +
geom_col(aes(x = month_data, y = price), color="red") +
theme_light() +
labs(title="Average Monthly Price v Prod over 20 Years", y="Price and Production", x="Month")
ggplot(by_state_v2_summary_data, aes(x = state, y = production_avg)) +
geom_col(aes(y = production_avg, x = state)) +
theme_classic() +
labs(title="Average Production by State over 20 Years", y="Production") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
# print(by_month_v2_summary_data)
#ggplot(by_month_v2_summary_data, aes(x = month_data, y = production)) +
# geom_smooth(aes(x = month_data, y = price), color="red") +
# geom_smooth(aes(x = month_data, y = production), color="blue") +
# facet_wrap(var(month_data)) +
# labs(title="Average Price by Month over 20 Years", y="Delivery Price", x="Month")
# ggplot(by_month_summary_data, aes(x = month_data)) +
# geom_line(aes(x = month_data, y = price_res_deliv)) +
# geom_line(aes(x = month_data, y = price_comm))
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
TLamkin (2022, March 23). Data Analytics and Computational Social Science: HW6. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin881487/
BibTeX citation
@misc{tlamkin2022hw6, author = {TLamkin, }, title = {Data Analytics and Computational Social Science: HW6}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin881487/}, year = {2022} }