Data Analytics and Computational Social Science: Homework 5

Erica Laidler

Introduction

Climate change is perhaps the most alarming and important topic of our generation. It has already resulted in a wide array of cascading effects, leading to more frequent and extreme storms, heat waves, dry spells, higher sea levels, and warming oceans (WWF). As time goes on, unless we collectively take serious systemic measures to slow the effects, climate changes will continue to wreak havoc on animal habitats and our communities. In this project, I focus on extreme weather patterns, specifically erratic precipitation in the United States. I set out to determine which state tends to experience the most inconsistent weather, and find that Louisiana has the greatest range between its highest and lowest recorded precipitation measurements. I also uncover quantitative evidence to suggest that Louisiana has developed more extreme weather patterns during the years since 1895.

Data

The dataset I will be using contains detailed climate data from the National Centers for Climate Data. More specifically, it contains information on precipitation in the United States. It can be found at: https://www.ncei.noaa.gov/data/climdiv/ (data set ‘pcpncy’). The data set contains records of the total precipitation which fell in each state region over the course of every month from 1895 to 2021.

Tidying the Data

Before importing into R, I download the data set as a .txt file and then use an external application to convert it into a .csv file.

Structuring the data set into a more useful, easily comprehensible form requires significant tidying. In the original data set, the row names are coded. Certain digits of the row names represent the state, metric (temperature or precipitation), and year which are associated with that particular row.

climate <- read.csv("climate.csv")
print(head(climate))

  X01001011895 X7.03 X2.96 X8.36 X3.53 X3.96 X5.40 X3.92 X3.36 X0.73
1   1001011896  5.86  5.42  5.54  3.98  3.77  6.24  4.38  2.57  0.82
2   1001011897  3.27  6.63 10.94  4.35  0.81  1.57  3.96  5.02  0.87
3   1001011898  2.33  2.07  2.60  4.56  0.54  3.13  5.80  6.02  1.51
4   1001011899  5.80  6.94  3.35  2.22  2.93  2.31  6.80  2.90  0.63
5   1001011900  3.18  9.07  5.77  7.14  1.63  7.36  3.35  3.85  4.74
6   1001011901  5.20  4.39  6.35  4.61  5.44  2.24  2.79  5.58  3.75
  X2.03 X1.44 X3.66
1  1.66  2.89  1.94
2  0.75  1.84  4.38
3  3.21  6.66  3.91
4  3.02  1.98  5.25
5  5.92  4.09  4.89
6  1.01  2.07  7.55

As depicted above, the data set is not very immediately interpretable. According to the code book, the first 1-2 digits of the row names represent a state ID associated with a particular state. The next three digits represent the state division. The next digit represents the metric (“temperature” or “precipitation”), and the last four represent the year. In turn, each of the 12 columns represents a month, and contain the relevant weather measurements for that row. The elements in the data frame are all of type double, because they refers to a precipitation measurement (in inches).

Ultimately, I want the final form of the data set to contain 16 columns. The first 12 should remain the 12 months of the year, but the next columns should be state, division, metric, and year. This format would make the information about each row much more clear. Further analysis of the data set will not require extraction of subsets of the digits of the row names. For instance, if I want to analyze Maryland precipitation in April, I can refer to the categorical column ‘state’ and subset based on the Maryland state ID, as opposed to referring to the cumbersome coded row names.

First, I replace the column names with the months of the year. Also, in the original data set, the first row is misplaced and sits where the column names should be. Thus I rename the columns, and then move the first row to where it belongs.

#Change the column names to represent the months of the year:

climate2 <- climate
colnames(climate2) = c("a", "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")

climate3 <- climate2
library(tidyverse)
climate3 <- climate3 %>% add_row("a" = 01001011895, "January" = 7.03, "February" = 2.96, "March" = 8.36, "April" = 3.53, "May" = 3.96, "June"  = 5.40, "July" = 3.92, "August" = 3.36, "September" = 0.73, "October" = 2.03, "November" = 1.44, "December" = 3.66, .before = 1)
head(climate3)

           a January February March April  May June July August
1 1001011895    7.03     2.96  8.36  3.53 3.96 5.40 3.92   3.36
2 1001011896    5.86     5.42  5.54  3.98 3.77 6.24 4.38   2.57
3 1001011897    3.27     6.63 10.94  4.35 0.81 1.57 3.96   5.02
4 1001011898    2.33     2.07  2.60  4.56 0.54 3.13 5.80   6.02
5 1001011899    5.80     6.94  3.35  2.22 2.93 2.31 6.80   2.90
6 1001011900    3.18     9.07  5.77  7.14 1.63 7.36 3.35   3.85
  September October November December
1      0.73    2.03     1.44     3.66
2      0.82    1.66     2.89     1.94
3      0.87    0.75     1.84     4.38
4      1.51    3.21     6.66     3.91
5      0.63    3.02     1.98     5.25
6      4.74    5.92     4.09     4.89

Now the data set has column names, and the first row was placed where it belongs, as shown above.

My next goal is to convert the data set, such that instead of relying on coded row names, it contains columns that represent the state, division, metric, and year. I perform this data tidying process below.

#Change the row names:

#(1) Establish the row names.
climate4 <- climate3
rownames(climate4) = climate4$a

#(2) Eliminate the now unnecessary column a.
climate5 <- climate4[,-(1),drop=FALSE] 

#Create function that counts the number of digits in a number.
nDigits <- function(x) nchar(trunc(abs(x)))

#In the code below, I discovered that row names have either 10 or 11 digits.
#This is because the state ID's at the beginning have either 1 or 2 digits, 
#depending on whether it is single- or double-digit.
sum_10 = 0
sum_11 = 0
for(i in 1:nrow(climate5)){
  if (nDigits(as.numeric(rownames(climate5)[i])) != 10){
    sum_11 = sum_11 + 1
  }
  if (nDigits(as.numeric(rownames(climate5)[i])) != 11){
    sum_10 = sum_10 + 1
  }
}
print(paste0("Number of single-digit state ID's: ", sum_10))

[1] "Number of single-digit state ID's: 66048"

print(paste0("Number of double-digit state ID's: ", sum_11))

[1] "Number of double-digit state ID's: 334588"

#Confirm that all of the state ID's are either 1 or 2 digits: 
(sum_11 + sum_10) == nrow(climate5)

[1] TRUE

#Now we have confirmed that the state ID's either have 1 or 2 digits, and hence that #the row names have either 10 or 11 digits total. We need to deal with both of these #cases.

#Create 4 columns: state, division, metric, or year based on the codings in the row #names, and taking into account the cases when the row names are either 10 or 11 #digits. 

#Contains the number of digits for each row name.
name_num = nDigits(as.numeric(rownames(climate5)))

climate_try <- climate5 %>%
mutate(state = ifelse(name_num == 11, as.numeric(substr(rownames(climate5), 1,2)), as.numeric(substr(rownames(climate5), 1,1))),
       division = ifelse(name_num == 11, as.numeric(substr(rownames(climate5), 3,5)), as.numeric(substr(rownames(climate5), 2,4))),
       metric = ifelse(name_num == 11, as.numeric(substr(rownames(climate5), 6,7)), as.numeric(substr(rownames(climate5), 5,6))),
       year = ifelse(name_num == 11, as.numeric(substr(rownames(climate5), 8,11)), as.numeric(substr(rownames(climate5), 7,10))))

climate6 <- climate_try

#Add another column to explicitly describe what metric number codings 1, 2, 27, and #28 refer to.
climate7 <- climate6 %>%
  mutate(metric_name = case_when(
    metric == 1 ~ "Precipitation",
    metric == 2 ~ "Average Temperature",
    metric == 27 ~ "Maximum Temperature",
    metric == 28 ~ "Minimum Temperature"
  ))

#Rename row names to simply be ID numbers from 1 to 400,636. 
climate8 <- climate7
rownames(climate8) = 1:400636
head(climate8)

  January February March April  May June July August September
1    7.03     2.96  8.36  3.53 3.96 5.40 3.92   3.36      0.73
2    5.86     5.42  5.54  3.98 3.77 6.24 4.38   2.57      0.82
3    3.27     6.63 10.94  4.35 0.81 1.57 3.96   5.02      0.87
4    2.33     2.07  2.60  4.56 0.54 3.13 5.80   6.02      1.51
5    5.80     6.94  3.35  2.22 2.93 2.31 6.80   2.90      0.63
6    3.18     9.07  5.77  7.14 1.63 7.36 3.35   3.85      4.74
  October November December state division metric year   metric_name
1    2.03     1.44     3.66     1        1      1 1895 Precipitation
2    1.66     2.89     1.94     1        1      1 1896 Precipitation
3    0.75     1.84     4.38     1        1      1 1897 Precipitation
4    3.21     6.66     3.91     1        1      1 1898 Precipitation
5    3.02     1.98     5.25     1        1      1 1899 Precipitation
6    5.92     4.09     4.89     1        1      1 1900 Precipitation

Now the data frame is tidied. The row names are now simply ID numbers from 1 to 400,636. I add columns for state, division, metric, year, and metric name. The first 12 columns of each row contains the precipitation or temperature value from January to December for a specific state, division, and year. The metric column tells us the coded number for the metric being reported in that row. However, it is coded as a number so I include an additional column, metric_name, which tells us whether the metric represented in that row is precipitation, average temperature, maximum temperature, or minimum temperature. I choose to keep the columns state and division coded, which means that it will be useful to have the legend available during further analysis of the data set.

I make some final edits upon further exploration of the data. First, I observe that all the measurements in this particular data set were precipitation measurements, by finding that every categorical value in column 15 (‘metric’) was ‘precipitation’, as shown below:

sum6 <- sum(((climate8[,15] != 1)))
print(paste0("Number of non-precipitation measurements: ", sum6))

[1] "Number of non-precipitation measurements: 0"

For this reason, I eliminate the metric and metric_name columns, as they provide no extra information.

climate9 <- subset(climate8, select = -c(metric, metric_name))

I also remove the rows for 2022. This is because upon exploration of the dataset, I noticed that these rows contained the value -9.99 for a lot of their entries. 2022 was the only year for which this happened. There is no explicit mention of this on the website where the data set originated, but my guess is that the data was collected mid-way through 2022, meaning that they left a lot of entries with negative default values. Below I remove all rows pertaining to 2022:

climate_new <-subset(climate9, year !="2022")
climate <- climate_new

#Check to see if there are any more rows with negative values.
sum(climate9[1:12,] <0)

[1] 0

Understanding the Data

The first goal to determine which state had the highest range of precipitation in 2021, such that the difference between the measurement for their highest-precipitation month and their lowest-precipitation month in 2021 was the greatest out of all the states.

I start by finding the maximum measurement values for each state in 2021. I loop over all the regional measurements in a particular state in 2021, and find the highest measurement. At the end I will have a vector with 49 values. Each value will represent the highest precipitation measurement for a particular state.

#FINDING MAXIMUM VALUES:

#Initialize the vector 'highest'.  
highest <- rep(0, 12)

#In loop below, we will establish the vector 'highest'. This vector contains 12 #lists, which represent the 12 months in 2021. In each list, there are 49 values for #the 49 states. Each of the 49 values represent the highest precipitation measurement #recorded in the state that month, out of all the divisions in the state.

#For instance, the first list in 'highest' is 'January'. The first value in January #is associated with state #1 (Alabama) and has a measurement of 3.19. This means that #the maximum precipitation value recorded in Alabama (state #1) during January 2021 #was 3.19.

climate_t <- climate

library(tidyverse)
climate10 <- c()

for(i in 1:12){
climate10 <- climate_t %>%
  group_by(state) %>%
  arrange(desc(climate_t[,i])) %>%
  filter(year == 2021) %>%
  slice(1) %>%
  ungroup() %>%
  select(state, i)
highest[i] <- climate10[,2]
}

#Now that we have found the highest precipitation measurement for each month for #every state, I want to find the maximum value of the whole year for every state.

#The vector max_val will contain 49 values, each of which contains the highest #precipitation measurement recorded in one of the 49 states over the course of 2021.
max_val <- c(0, 49)
for (i in 1:49){
max_val[i] = max(highest[[1]][i], highest[[2]][i], highest[[3]][i], highest[[4]][i], highest[[5]][i], highest[[6]][i], highest[[7]][i], highest[[8]][i], highest[[9]][i], highest[[10]][i], highest[[11]][i], highest[[12]][i])
}
max_val

 [1] 13.74  8.93 11.29 19.30  8.83 11.37  6.41 15.39 13.87  8.85  9.03
[12]  9.54 13.02 10.02 10.35 20.16 11.07  9.81 12.66 10.39  9.58 15.30
[23] 10.95  7.22  8.28  6.04 15.57 10.35  4.81 10.87 13.01  6.00  8.02
[34] 10.89 20.89 11.69  9.65 10.57  7.14 13.40 17.00  5.34 13.19 10.33
[45] 27.09  8.00 10.72  5.69 28.78

The max_val vector contains the highest recorded precipitation measurement for each of the 49 states in the year 2021.

Now I repeat the process for the minimum value.

#FINDING MINIMUM VALUES:

#Initialize the vector 'lowest'.
lowest <- rep(0, 12)

#Below I complete the vector 'lowest', which has the same structure as the vector #'highest' in the chunk of code above, except that it represents the minimum values #rather than the maximum values.

climate_t <- climate

library(tidyverse)
climate10 <- c()

for(i in 1:12){
climate10 <- climate_t %>%
  group_by(state) %>%
  arrange(climate_t[,i]) %>%
  filter(year == 2021) %>%
  slice(1) %>%
  ungroup() %>%
  select(state, i)
lowest[i] <- climate10[,2]
}

#Now I create the vector min_val, which is similar in structure to max_val in the #code chunk above, except now it contains the minimum values.
min_val <- c(0, 49)
for (i in 1:49){
min_val[i] = min(lowest[[1]][i], lowest[[2]][i], lowest[[3]][i], lowest[[4]][i], lowest[[5]][i], lowest[[6]][i], lowest[[7]][i], lowest[[8]][i], lowest[[9]][i], lowest[[10]][i], lowest[[11]][i], lowest[[12]][i])
}
min_val

 [1] 0.26 0.00 0.64 0.00 0.00 1.44 0.75 0.14 0.20 0.02 0.24 0.72 0.19
[14] 0.00 1.34 0.37 0.88 0.60 1.20 0.53 0.09 0.57 0.27 0.03 0.05 0.00
[27] 1.14 0.75 0.00 0.79 0.22 0.02 0.96 0.00 0.00 0.74 1.57 0.32 0.02
[40] 0.82 0.00 0.01 0.86 0.16 0.00 0.66 0.39 0.11 0.15

The min_val vector contains the lowest recorded precipitation measurement for each of the 49 states in the year 2021.

It is interesting to note a few features. For one, according to the data set, Alaska (state #49) has the highest precipitation measurement out of all the states in the US in 2021, at 28.78 inches. This is followed by Washington (27.09 in), Oregon (20.89), and Louisiana (20.16).

print(paste0("Highest precipitation measurement in 2021: ", max(max_val)))

[1] "Highest precipitation measurement in 2021: 28.78"

print(paste0("State ID associated with highest precipitation measurement: ", which.max(max_val)))

[1] "State ID associated with highest precipitation measurement: 49"

print(paste0("Second highest precipitation measurement - Washington: ", sort(max_val,partial=48)[48]))

[1] "Second highest precipitation measurement - Washington: 27.09"

print(paste0("Third highest precipitation measurement - Oregon: ",
 sort(max_val,partial=47)[47]))

[1] "Third highest precipitation measurement - Oregon: 20.89"

print(paste0("Fourth highest precipitation measurement - Louisiana: ",sort(max_val,partial=46)[46]))

[1] "Fourth highest precipitation measurement - Louisiana: 20.16"

We also see that there are 10 states that have the lowest precipitation measurements in 2021, at 0 inches. It is possible for there to be no rain at all during some months. The 10 states that have precipitation measurements of 0 at some point in 2021 are Arizona, California, Colorado, Kansas, Nevada, New Mexico, Oklahoma, Oregon, Texas, and Washington.

print(paste0("Lowest precipitation measurement in 2021: ", min(min_val)))

[1] "Lowest precipitation measurement in 2021: 0"

print(paste0("Number of states which have a precipitation measurement of 0: ", sum(min_val == 0)))

[1] "Number of states which have a precipitation measurement of 0: 10"

#state ID's for states that had precipitation measurements of 0 in 2021
zero <- c()
for (i in 1:49){
  if (min_val[i] == 0){
    zero <- append(zero, i)
  }
}
print("State ID's associated with precipitation measurements of 0 in 2021: ")

[1] "State ID's associated with precipitation measurements of 0 in 2021: "

print(zero)

 [1]  2  4  5 14 26 29 34 35 41 45

I now specifically address the research question. I want to find the state that had the highest range in precipitation, meaning that their rainy season was the most drastically different from their dry season.

print(paste0("Largest range in precipitation: ", max(max_val-min_val)))

[1] "Largest range in precipitation: 28.63"

print(paste0("State associated with largest range in precipitation: ", which.max(max_val-min_val)))

[1] "State associated with largest range in precipitation: 49"

range <- max_val - min_val

print(paste0("Second highest range in precipitation : ", sort(range, partial=48)[48]))

[1] "Second highest range in precipitation : 27.09"

print(paste0("Third highest range in precipitation : ", sort(range, partial=47)[47]))

[1] "Third highest range in precipitation : 20.89"

print(paste0("Fourth highest range in precipitation : ", sort(range, partial=46)[46]))

[1] "Fourth highest range in precipitation : 19.79"

In the code above, I found that state #49 had the largest range in precipitation, meaning that the difference between their highest and lowest recorded precipitation rate was the greatest out of all the states during 2021, at 28.63 inches.

This was followed by Washington (27.09 in.), Oregon (20.89 in.), and Louisiana (19.79 in.).

All three of the states with the largest range in precipitation, especially Alaska and Oregon, have very high annual rates of snowfall. This could definitely be contributing to the annual precipitation measurements. Thus, to isolate rainfall as a variable of interest, I choose to focus on Louisiana, which has a warm climate and tends to receive less than an inch of snow every year (Snow Climatology).

As mentioned, Louisiana has the fourth largest range in precipitation. As it turns out, Louisiana has a significant rainy season. In fact, WorldAtlas.com explains that Louisiana tends to be the second rainiest state in the entire United States. Hawaii is the first rainiest, but upon checking the code book Hawaii is the only state which is not represented in the data set. Overall, these findings are exciting because they suggest that my analysis is reasonable.

It is also interesting to obtain the mean of the measurements in all the divisions in Louisiana during 2021. I will perform this function below:

climate_s <- climate

climate_s %>%
  filter(state == 16) %>%
  summarize(mean = mean(June))

      mean
1 4.829548

The mean precipitation value for Louisiana in June 2021 (over all the divisions) was 4.8295 inches. This suggests that there is a wide range in the amount of precipitation over the state, as we earlier learned that the highest precipitation measurement was around 20.16 inches. It seems that there are high-lying outliers in Louisiana in June.

Based on the fact that Louisiana has the fourth highest range in precipitation over the course of 2021, I wanted to see if there are particular months during which the rain tends to be much more extreme. I plot the data for June and November in Louisiana, from 2000 to 2021, below.

Plot A. Comparison of Rainfall in Louisiana in June and November from 2000 to 2021

#Create a data frame which is the result of subsetting on data from November 2021. 
nov_louis <- climate9 %>%
  group_by(year, state)%>%
  filter(state == '16', (year == '2021' || year == '2020' || year == '2019' || year == '2018' || year == '2017' || year == '2016' || year == '2015' || year == '2014' || year == '2013' || year == '2012' || year == '2011' || year == '2010'|| year == '2009' || year == '2008' || year == '2007' || year == '2006' || year == '2005' || year == '2004'|| year == '2003'|| year == '2002'|| year == '2001'|| year == '2000')) %>%
  select('November') 
nov_louis

# A tibble: 1,408 × 3
# Groups:   year, state [22]
    year state November
   <dbl> <dbl>    <dbl>
 1  2000    16    12.0 
 2  2001    16     3.77
 3  2002    16     5.78
 4  2003    16     4.76
 5  2004    16     8.37
 6  2005    16     3.47
 7  2006    16     2.01
 8  2007    16     5.92
 9  2008    16     2.98
10  2009    16     2.63
# … with 1,398 more rows

nov_louis$month = "November"
nov_louis$precipitation = nov_louis$November
nov_louis <- subset(nov_louis, select = c('month', 'precipitation'))

#Create a data frame which is the result of subsetting on data from June 2021. 
june_louis <- climate9 %>%
  group_by(year, state)%>%
  filter(state == '16' & (year == '2021' || year == '2020' || year == '2019' || year == '2018' || year == '2017' || year == '2016' || year == '2015' || year == '2014' || year == '2013' || year == '2012' || year == '2011' || year == '2010'|| year == '2009' || year == '2008' || year == '2007' || year == '2006' || year == '2005' || year == '2004'|| year == '2003'|| year == '2002'|| year == '2001'|| year == '2000')) %>%
  select('June') 

june_louis$month = "June"
june_louis$precipitation = june_louis$June
june_louis <- subset(june_louis, select = c('month', 'precipitation'))

#Bind the November and June data frames together.
june_nov_df <- rbind(june_louis, nov_louis)

#Assign colors.
colors = c(June="pink", November="lightblue")

#Plot the data as 2 histograms laid over each other, with a legend.
plotA = ggplot(june_nov_df, aes(precipitation, fill=month)) +
         theme_minimal() +
         geom_bar(stat = "count") +
         scale_fill_manual(values=colors) + ylim(0, 10) +
         labs(title = "Louisiana Precipitation in June and November (2000-2021)", x = "Precipitation Measurements")
plotA

It does appear that overall there tended to be more Louisiana precipitation (including more extremely high precipitation measurements) in June than in November.

Plot A is the product of two simple histograms layered over one another. One histogram (in light blue) represents November precipitation measurements in Louisiana from 2000 to 2021. The other histogram (in pink) represents June precipitation measurements in Louisiana from 2000 to 2021.

I was curious to know whether there would be a visible difference in the amount of precipitation recorded in June versus in November. June is historically considered the rainiest month in Louisiana, especially in certain parts of the state (Weather & Climate). Overall, though there is not an extreme difference, the June values appear to have a wider range and a greater right skew, with far more values which lie above the mean and more large outliers. Previous exploration suggested that Louisiana was the state for which there was the greatest variation in the amount of precipitation which fell during different months of the year. Thus, it makes sense to see that there was a visible difference in the trends of precipitation recorded during June versus that recorded in November in Louisiana.

The other research question addresses whether Louisiana has changed its precipitation patterns in the years since 2000.

Plot B. Rainfall in Louisiana in June from 2000 to 2021

june_louis <- climate9 %>%
  group_by(year, state)%>%
  filter(state == 16) %>%
  select('year', 'June') %>%
  summarize(mean = mean(June)) %>%
  select(mean)

june_louis_v <- june_louis$mean

year <- as.factor(2000:2021)

june_louis_df <- data.frame(year, june_louis_v[1:22])

plotB <- ggplot(june_louis_df, aes(x= year, y = june_louis_v[1:22])) + geom_bar(stat = "identity") + labs(x= "Year", y = "Mean Precipitation value in June", title = "June Precipitation Measurements in Louisiana from 2000 to 2021") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = -1)) 
plotB

In this plot, I incorporate two variables: precipitation in Louisiana, and time. More specifically, in the plot above, the x-axis represents the year, while the y-axis represents June precipitation measurements collected in that year. The question I was curious to answer was whether or not there were significant changes in the June precipitation trends from 2000 to 2021. Based on a general analysis of the plot, it is possible that Louisiana has had less rain in recent years. However, more analysis would have to be conducted to verify if this was a significant difference.

I will also depict June data in Louisiana from 1895 to 1910, and then 2000 to 2021.

june_louis <- climate9 %>%
  group_by(year, state)%>%
  filter(state == 16) %>%
  select('year', 'June') %>%
  summarize(mean = mean(June)) %>%
  select(mean)

june_louis_v <- june_louis$mean

year_list <- c(1895:1910, 2000:2021)
year_f <- as.factor(year_list)

june_louis_v <- june_louis_v[c(1:16, 105:126)]

june_louis_df <- data.frame(year_f, june_louis_v)

plotC <- ggplot(june_louis_df, aes(x= year_f, y = june_louis_v)) + geom_bar(stat = "identity") + labs(x= "Year", y = "Mean Precipitation value in June", title = "June Precipitation Measurements in Louisiana from 2000 to 2021") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = -1)) 
plotC

Overall, the precipitation in Louisiana patterns seem to be abnormally high from around 2002 to 2005. Aside from that, they seem to fluctuate within a similar range. Below, I display the same information slightly differently, in the form of a facet wrap plot.

june_louis <- climate9 %>%
  group_by(year, state)%>%
  filter(state == 16) %>%
  select('year', 'June') %>%
  summarize(mean = mean(June)) %>%
  select(mean)

june_louis_v <- june_louis$mean

year_list1 <- c(1895:1910)
year_list2 <- c(2000:2021)
year_f1 <- as.factor(year_list1)
year_f2 <- as.factor(year_list2)

june_louis_v1 <- june_louis_v[1:16]
june_louis_v2 <- june_louis_v[105:126]

june_louis_df1 <- data.frame(year = year_f1, precipitation = june_louis_v1, category = "Era 1")
june_louis_df2 <- data.frame(year = year_f2, precipitation = june_louis_v2, category = "Era 2")

june_louis_df <- rbind(june_louis_df1, june_louis_df2)

ggplot(data = june_louis_df, aes(year, precipitation)) + geom_bar(stat = "identity") + labs(x= "Year", y = "Mean Precipitation Value in June", title = "June Precipitation Measurements in Louisiana Across Eras") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = -1)) + facet_wrap(~ category, scales = "free")

t.test(june_louis_v1, june_louis_v2, alternative = "two.sided", var.equal = FALSE)


    Welch Two Sample t-test

data:  june_louis_v1 and june_louis_v2
t = -0.8082, df = 35.57, p-value = 0.4243
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.335535  1.004912
sample estimates:
mean of x mean of y 
 4.968701  5.634013

Based on this two-sided t-test, there is not a great enough difference in means to conclude that the June precipitation in Louisiana has significantly changed in the years since the late 1800’s to the early 2000’s.

Questions:

-What is missing (if anything) in your analysis process so far?

Before this homework, I felt that I had not fully addressed my second research question (Has weather changed significantly over time?) I feel that my facet wrap plot, in conjunction with the two-sided t-test, help resolve this question.

– What conclusions can you make about your research questions at this point?

I have established the four states with the highest precipitation measurements in 2021, the four with the lowest precipitation measurements, and the four with the greatest range between their highest and lowest precipitation measurements. I also analyze trends in precipitation in June, the rainiest month in Louisiana, finding that there is not statistically significant evidence to conclude that the precipitation patterns have changed in the years between 1895-1910 (Era 1) and 2000-2021 (Era 2).

– What do you think a naive reader would need to fully understand your graphs?

I think my plots would be fairly comprehensible to the naive reader. It’s possible that they would need more explanation about the comparison histogram plot (Plot A) because the count y variable might not be immediately obvious to those without a statistics background. However, I think the labeling in my plots is explicit and does not require previous contextual knowledge.

– Is there anything you want to answer with your dataset, but can’t?

Precipitation genereally includes both rain and snow. I would be interested to know if my findings would change if I removed snow from the equation.

Reflection (to be completed in Homework 6)

Conclusion ’’

References

“Climate and Average Weather Year Round in Alabama.” Weatherspark.com, https://weatherspark.com/y/20416/Average-Weather-in-Alabama-New-York-United-States-Year-Round#:~:text=The%20chance%20of%20wet%20days,least%200.04%20inches%20of%20precipitation.

Nag, Oishimaya Sen. “The 10 Wettest States in the United States of America.” WorldAtlas, WorldAtlas, 24 Apr. 2019, https://www.worldatlas.com/articles/the-10-wettest-states-in-the-united-states-of-america.html.

Data Set:

Data Set Documentation/Code Book: https://www.ncei.noaa.gov/data/climdiv/access/county-readme.txt

https://www.weather.gov/lix/snowcli

Comment on this article Share:

Homework 5

Reuse

Citation