HW 2

hw2

emissions

Reading in Data

Author

Paarth Tandon

Published

January 4, 2023

Code

library(tidyverse)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in Data

I will be using an emissions dataset for my homework. I downloaded it from this Kaggle page.

Code

set.seed(42)
# read in the data using readr
emissions <- read_csv("_data/emissions.csv")
# sample a few data points
emissions[sample(nrow(emissions), 10), ]

Country	ISO 3166-1 alpha-3	Year	Total	Coal	Oil	Gas	Cement	Flaring	Other	Per Capita
Viet Nam	VNM	1962	9.337664	7.122816	1.985888	0	0.22896	0	NA	0.270391
Sweden	SWE	1774	0.000000	NA	NA	NA	NA	NA	NA	NA
Maldives	MDV	1769	0.000000	NA	NA	NA	NA	NA	NA	NA
Burundi	BDI	1871	0.000000	NA	NA	NA	NA	NA	NA	NA
South Sudan	SSD	1989	0.312751	0.000000	0.312751	0	0.00000	0	NA	0.066449
Russia	RUS	1821	0.000000	NA	NA	NA	NA	NA	NA	NA
Equatorial Guinea	GNQ	1897	0.000000	NA	NA	NA	NA	NA	NA	NA
Wallis and Futuna Islands	WLF	1882	0.000000	NA	NA	NA	NA	NA	NA	NA
British Virgin Islands	VGB	1833	0.000000	NA	NA	NA	NA	NA	NA	NA
Montenegro	MNE	1751	0.000000	NA	NA	NA	NA	NA	NA	NA

Clean Data

The poster of this dataset claims that it ranges from 2002-2022, but for some reason the data includes many samples from before this data (ranging all the way to 1750?!?). I am going to assume they claimed 2002 because that is when the data is accurate and complete. Because of this, I will drop any samples before 1750.

Code

set.seed(42)

# remove old samples
emissions_clean <- filter(emissions, Year >= 2002)

# sample a few data points
emissions_clean[sample(nrow(emissions_clean), 10), ]

Country	ISO 3166-1 alpha-3	Year	Total	Coal	Oil	Gas	Cement	Flaring	Other	Per Capita
Mauritania	MRT	2010	2.044512	0.000000	2.044512	0.000000	0.000000	0.000000	NA	0.597905
Taiwan	TWN	2010	270.148000	157.439274	73.381644	30.711082	8.105000	0.000000	0.511000	11.703289
Lithuania	LTU	2010	13.946613	0.842852	6.848402	5.663580	0.289045	0.257944	0.044790	4.442985
Denmark	DNK	2019	30.955444	3.588381	19.698810	6.016661	1.129199	0.194717	0.327676	5.340941
Eritrea	ERI	2013	0.620806	0.000000	0.531280	0.000000	0.089526	0.000000	NA	0.188330
Burkina Faso	BFA	2015	3.714985	0.000000	3.191869	0.000000	0.523116	0.000000	NA	0.198471
Kenya	KEN	2018	18.169173	1.838957	13.993658	0.000000	2.336557	0.000000	NA	0.363723
St. Kitts-Nevis-Anguilla	KNA	2012	0.000000	NA	NA	NA	NA	NA	NA	0.000000
Barbados	BRB	2017	1.206408	0.000000	1.066224	0.043968	0.070568	0.025648	NA	4.321146
Turks and Caicos Islands	TCA	2002	0.150224	0.000000	0.150224	0.000000	0.000000	0.000000	NA	7.293135

As you can see, there are many missing data points. This is most likely due to lack of reporting in those categories that year. I do not think that there is a blanket way to deal with this. It will be dealt with case by case when doing analysis on the data. One may ask: Why not set it to zero? Well, there are issues with doing that. For example, is it really fair to say a country had zero emissions in Coal just because they did not report it?

Another way to deal with the missing data problem is to sample from the total distribution in that category. For example, if we wanted to fill in the emissions due to Coal for KNA in 2012, we could fit a gaussian distribution onto all the data in the Coal category and sample from it. This method could also be problematic when doing analysis as it may misrepresent the data point, but is probably better than setting it to zero. Of course the gaussian distribution could also be replaced with other distributions, or even a simple mean/median.

Another common method is to drop rows containing missing data. This ranges in strictness. For example we could drop all rows that contain any missing data, or we could drop rows that contain a certain threshold amount. I don’t believe that I will be able to apply this strategy, as it will remove far too much crucial data.

In the end, it is difficult to deal with missing data, and each strategy will have downsides. It may be worth it to compare different methods to see how they affect the analysis. Nothing will be perfect and we can only try our best to replace the data. But hey thats the point of stats ¯\_(ツ)_/¯, estimation with reasoning.

Data Narrative

Carbon Emissions

This dataset presents us with carbon emission reports. You may be wondering: What are carbon emissions? According to the European Union: “Carbon dioxide emissions or CO2 emissions are emissions stemming from the burning of fossil fuels and the manufacture of cement; they include carbon dioxide produced during consumption of solid, liquid, and gas fuels as well as gas flaring.”

Our data presents this concept precisely, reporting emissions from coal, oil, gas, cement, flaring, and other sources. The data is presented per year and per country, and ranges from 2002 to 2022.

Variables Explained

Country <chr>: Country from where emissions were reported
ISO 3166-1 alpha-3 <chr>: 3 character code for the country
Year <dbl>: The year the emissions were reported
Total <dbl>: Total emissions measured in kt
Coal <dbl>: Emission from burning coal measured in kt
Oil <dbl>: Emission from burning oil measured in kt
Gas <dbl>: Emission from burning gas measured in kt
Cement <dbl>: Emission from cement manufacturing measured in kt
Flaring <dbl>: Emission from flaring measured in kt
- a flare is a gas combustion device used in places such as petroleum refineries, chemical plants and natural gas processing plants, oil or gas extraction sites having oil wells, gas wells, offshore oil and gas rigs and landfills. (Wikipedia)
Other <dbl>: Emission from other causes measured in kt

Potential Research Questions

In the past year, how do countries compare in emissions?
- Breakdown based on emission type and continent
- Compare countries using null hypothesis testing
Have countries changed over time?
- Breakdown based on emission type and continent
- Compare countries using null hypothesis testing
How do the different emission types compare overall?
- Over time and past year
- Compare using null hypothesis testing
What about reporting?
- Which countries fail to report the most and how has it changed?
- How about overall reporting?
- Have there been significant changes in any specific types?
Can we predict future emissions based on this data?
- Regression models…
- Probably need some sort of basis expansion (polynomial/trigonometric) if I want to use linear regression.
- Probably limited by data size for high-parameter model types.