Data Analytics and Computational Social Science: Homework 4

Erica Laidler

PROBLEM 1) Read in your dataset, and compute descriptive statistics for each of your variables using dplyr.

-In addition to overall means, medians, and SDs, use group_by() and summarise() to compute mean/median/SD for any relevant groupings. For example, if you were interested in how state relates to income, you would compute mean income for all states combined, and then you would compute mean income for each individual state in the US.

First I imported the data frame I tidied in Homework 3. During this process, I noticed that there was a mistake in the indexing of some of the recoded columns, which was causing there to be many missing values where there shouldn’t have been any. I fixed this before importing the data.

climate8 <- read.csv("climate8.csv")

#Tidying - Every time you write a data set to a .csv file, it adds a first column so I deleted this column ('X') below.
climate9 <- subset(climate8, select = -c(X))

#First 6 lines of imported dataset:
head(climate9)

  January February March April  May June July August September
1    7.03     2.96  8.36  3.53 3.96 5.40 3.92   3.36      0.73
2    5.86     5.42  5.54  3.98 3.77 6.24 4.38   2.57      0.82
3    3.27     6.63 10.94  4.35 0.81 1.57 3.96   5.02      0.87
4    2.33     2.07  2.60  4.56 0.54 3.13 5.80   6.02      1.51
5    5.80     6.94  3.35  2.22 2.93 2.31 6.80   2.90      0.63
6    3.18     9.07  5.77  7.14 1.63 7.36 3.35   3.85      4.74
  October November December state division metric year   metric_name
1    2.03     1.44     3.66     1        1      1 1895 Precipitation
2    1.66     2.89     1.94     1        1      1 1896 Precipitation
3    0.75     1.84     4.38     1        1      1 1897 Precipitation
4    3.21     6.66     3.91     1        1      1 1898 Precipitation
5    3.02     1.98     5.25     1        1      1 1899 Precipitation
6    5.92     4.09     4.89     1        1      1 1900 Precipitation

I also had to perform a second edit, which was to remove the rows for 2022. This is because upon exploration of the dataset, I noticed that these rows contained the value -9.99 for a lot of their entries. 2022 was the only year for which this happened. There is no explicit mention of this on the website where the data set originated, but my guess is that the data was collected mid-way through 2022, meaning that they left a lot of entries with negative default values. Below I remove all rows pertaining to 2022:

climate9 <-subset(climate9, year != "2022")

After, I checked to see if there are any other rows with negative values, but there are not, as shown below. The years represented in the data set now range from 1895 to 2021.

sum(climate9[1:12,] <0)

[1] 0

Next I gathered descriptive statistics. Since all of the relevant variables (January - December) are numeric, I found the relevant mean, median, and standard deviation values.

The structure of this data frame makes it so that it would not be very informative to perform summary statistics on an entire column. As a reminder about the structure of my data set: The first 12 columns of each row contain the precipitation values from January to December for a specific state, division, and year. The next 5 columns indicate which state, division, year, and metric are relevant to this particular row. For instance, row 1 has state = 10, division = 10, metric_name = “Precipitation”, and year = “1895”. Thus, based on the coding in https://www.ncei.noaa.gov/data/climdiv/access/county-readme.txt, row 1 gives the precipitation measurement for each month in Idaho (division 10) during the year 1895.

During this exploration stage, I found that this particular data set did not have a mix of precipitation and temperature measurements; it only had precipitation measurements. For this reason, I deleted the columns ‘metric’ and ‘metric_name,” whose purpose had been to explain whether the measurements in a particular row were precipitation measurements or temperature measurements. The ’metric’ columns were now redundant seeing as they all contained the descriptor ‘precipitation.’ I remove the columns, and print the resulting data set, below:

climate9 = subset(climate9, select = -c(metric, metric_name))
head(climate9)

  January February March April  May June July August September
1    7.03     2.96  8.36  3.53 3.96 5.40 3.92   3.36      0.73
2    5.86     5.42  5.54  3.98 3.77 6.24 4.38   2.57      0.82
3    3.27     6.63 10.94  4.35 0.81 1.57 3.96   5.02      0.87
4    2.33     2.07  2.60  4.56 0.54 3.13 5.80   6.02      1.51
5    5.80     6.94  3.35  2.22 2.93 2.31 6.80   2.90      0.63
6    3.18     9.07  5.77  7.14 1.63 7.36 3.35   3.85      4.74
  October November December state division year
1    2.03     1.44     3.66     1        1 1895
2    1.66     2.89     1.94     1        1 1896
3    0.75     1.84     4.38     1        1 1897
4    3.21     6.66     3.91     1        1 1898
5    3.02     1.98     5.25     1        1 1899
6    5.92     4.09     4.89     1        1 1900

As shown in the updated version above, the data frame now has 15 columns: 12 to represent the 12 months of the year, as well as a ‘state’ column to represent the state index number, a ‘division’ column to represent the division index number, and a ‘year’ column.

Now I will perform the summary statistics for the precipitation amounts grouped by month, over all locations and years.

library(tidyr)

jan_mean = mean(climate9$January)
feb_mean= mean(climate9$February)
march_mean= mean(climate9$March)
april_mean = mean(climate9$April)
may_mean = mean(climate9$May)
june_mean = mean(climate9$June)
july_mean = mean(climate9$July)
august_mean = mean(climate9$August)
september_mean = mean(climate9$September)
october_mean = mean(climate9$October)
november_mean = mean(climate9$November)
december_mean = mean(climate9$December)

print(paste0("Janurary mean: ", jan_mean))

[1] "Janurary mean: 2.76676613022925"

print(paste0("February mean: ", feb_mean))

[1] "February mean: 2.61202146420493"

print(paste0("March mean: ", march_mean))

[1] "March mean: 3.22113839280105"

print(paste0("April mean: ", april_mean))

[1] "April mean: 3.26160367699035"

print(paste0("May mean: ", may_mean))

[1] "May mean: 3.71442041363625"

print(paste0("June mean: ", june_mean))

[1] "June mean: 3.796537223993"

print(paste0("July mean: ", july_mean))

[1] "July mean: 3.71906965803687"

print(paste0("August mean: ", august_mean))

[1] "August mean: 3.4630312025942"

print(paste0("September mean: ", september_mean))

[1] "September mean: 3.21877104093343"

print(paste0("October mean: ", october_mean))

[1] "October mean: 2.78647797856094"

print(paste0("November mean: ", november_mean))

[1] "November mean: 2.70991109411597"

print(paste0("December mean: ", december_mean))

[1] "December mean: 2.88990905637502"

Each of the 12 outputs above depict the average amount of precipitation for a particular month over all years and locations. For example, the first output (January mean) results from taking the average of all the January precipitation measurements from every state and every year from 1895 to 2022. The overall mean precipitation values from January through December are approximately as follows: 2.77, 2.61, 3.22, 3.26, 3.71, 3.80, 3.72, 3.46, 3.22, 2.79, 2.71, and 2.90.

I will repeat that process for median and standard deviation.

library(tidyr)

jan_median = median(climate9$January)
feb_median= median(climate9$February)
march_median= median(climate9$March)
april_median = median(climate9$April)
may_median = median(climate9$May)
june_median = median(climate9$June)
july_median = median(climate9$July)
august_median = median(climate9$August)
september_median = median(climate9$September)
october_median = median(climate9$October)
november_median = median(climate9$November)
december_median = median(climate9$December)

print(paste0("Janurary median: ", jan_median))

[1] "Janurary median: 2.11"

print(paste0("February median: ", feb_median))

[1] "February median: 2"

print(paste0("March median: ", march_median))

[1] "March median: 2.7"

print(paste0("April median: ", april_median))

[1] "April median: 2.9"

print(paste0("May median: ", may_median))

[1] "May median: 3.39"

print(paste0("June median: ", june_median))

[1] "June median: 3.53"

print(paste0("July median: ", july_median))

[1] "July median: 3.42"

print(paste0("August median: ", august_median))

[1] "August median: 3.09"

print(paste0("September median: ", september_median))

[1] "September median: 2.71"

print(paste0("October median: ", october_median))

[1] "October median: 2.27"

print(paste0("November median: ", november_median))

[1] "November median: 2.17"

print(paste0("December median: ", december_median))

[1] "December median: 2.3"

The median precipitation values for January to December, over all years and states, are: 2.11, 2, 2.7, 2.9, 3.39, 3.53, 3.42, 3.09, 2.71, 2.27, 2.17, and 2.3.

library(tidyr)

jan_sd = sd(climate9$January)
feb_sd= sd(climate9$February)
march_sd= sd(climate9$March)
april_sd = sd(climate9$April)
may_sd = sd(climate9$May)
june_sd = sd(climate9$June)
july_sd = sd(climate9$July)
august_sd = sd(climate9$August)
september_sd = sd(climate9$September)
october_sd = sd(climate9$October)
november_sd = sd(climate9$November)
december_sd = sd(climate9$December)

print(paste0("Janurary standard deviation: ", jan_sd))

[1] "Janurary standard deviation: 2.64838002974412"

print(paste0("February standard deviation: ", feb_sd))

[1] "February standard deviation: 2.35186875375411"

print(paste0("March standard deviation: ", march_sd))

[1] "March standard deviation: 2.48454091248144"

print(paste0("April standard deviation: ", april_sd))

[1] "April standard deviation: 2.14537191847582"

print(paste0("May standard deviation: ", may_sd))

[1] "May standard deviation: 2.24603361135791"

print(paste0("June standard deviation: ", june_sd))

[1] "June standard deviation: 2.28669848935868"

print(paste0("July standard deviation: ", july_sd))

[1] "July standard deviation: 2.38458296538313"

print(paste0("August standard deviation: ", august_sd))

[1] "August standard deviation: 2.34171092390367"

print(paste0("September standard deviation: ", september_sd))

[1] "September standard deviation: 2.48487877425299"

print(paste0("October standard deviation: ", october_sd))

[1] "October standard deviation: 2.38512358267892"

print(paste0("November standard deviation: ", november_sd))

[1] "November standard deviation: 2.47045362571965"

print(paste0("December standard deviation: ", december_sd))

[1] "December standard deviation: 2.66260835474644"

The standard deviation in precipitation values for January to December, over all years and states, are: 2.65, 2.35, 2.49, 2.15, 2.45, 2.29, 2.39, 2.34, 2.29, 2.39, 2.47, and 2.66.

However, as mentioned, summarizing by month may not be the optimal way of obtaining relevant information. Instead, I will try to find descriptive statistics that may be helpful for answering some of my potential research questions.

For instance, I am curious to know which state had the highest range of precipitation throughout 2022, such that the difference between the measurement for their highest-precipitation month and their lowest-precipitation month was the greatest out of all the states in 2021.

This is not a very simple goal. First I will start with finding the maximum measurement values for each state in 2021. This means that I will loop over all the division measurements in a particular state in 2021, and find the highest measurement. At the end I will have a vector with 49 values. Each value will represent the highest precipitation measurement for a particular state. There are 49 states represented in the data (all but Hawaii).

#FINDING MAXIMUM VALUES:

#Initialize the vector 'highest'.  
highest <- rep(0, 12)

#In loop below, we will establish the vector 'highest'. This vector contains 12 lists. Each list represents a month and contains 49 values. Each of the values is associated with a state. Each of the 49 values represent the highest precipitation measurement recorded in the state that month, out of all the possible divisions in the state.

#For instance, the first list in 'highest' is 'January'. The first value in January is 3.19, which means that the maximum precipitation value recorded in Alabama (state #1) during January 2021 was 3.19.

library(tidyverse)
climate10 <- c()
for(i in c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')){
climate10 <- climate9 %>%
  group_by(state) %>%
  filter(year == 2021) %>%
  arrange(desc(i)) %>%
  slice(1) %>%
    ungroup %>%
  select(state, i) 
highest[i] = climate10[,2]
}

#Now that we have found the highest precipitation measurement for each month for every state, I want to find the maximum out of all these values for each state.

#The vector max_val will contain 49 values, each of which contains the highest precipitation measurement recorded in each of the 49 states in 2021.

max_val <- c(0, 49)
for (i in 1:49){
max_val[i] = max(highest$'January'[i], highest$'February'[i], highest$'March'[i], highest$'April'[i], highest$'May'[i], highest$'June'[i], highest$'July'[i], highest$'August'[i], highest$'September'[i], highest$'October'[i], highest$'November'[i], highest$'December'[i])
}
max_val

 [1]  9.03  3.36 10.53  6.54  4.09  9.38  5.61 13.26  9.65  1.77  7.98
[12]  7.30  5.85  9.48  8.95 14.61  8.61  7.25 10.82  5.51  3.53 13.28
[23]  9.07  2.38  8.28  1.94 12.15  7.09  2.39  8.25  6.27  4.47  7.47
[34]  7.81  3.63 10.55  7.19  8.19  3.40  9.07  9.98  2.60  7.35  6.13
[45]  1.97  5.36  7.60  1.64  6.16

Now the process must be repeated for the minimum value.

#FINDING MINIMUM VALUES:

#Initialize the vector 'lowest'.
lowest <- rep(0, 12)

#Below I complete the vector 'lowest', which has the same structure as the vector 'highest' in the chunk of code above, except that it represents the minimum values rather than the maximum values.

climate10 <- c()
for(i in c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')){
climate10 <- climate9 %>%
  group_by(state) %>%
  filter(year == 2021) %>%
  arrange(i) %>%
  slice(1) %>%
    ungroup %>%
  select(state, i) 
lowest[i] = climate10[,2]
}

#Now I create the vector min_val, which is similar in structure to max_val in the code chunk above, except now it contains the minimum values.
min_val <- c(0, 49)
for (i in 1:49){
min_val[i] = min(lowest$'January'[i], lowest$'February'[i], lowest$'March'[i], lowest$'April'[i], lowest$'May'[i], lowest$'June'[i], lowest$'July'[i], lowest$'August'[i], lowest$'September'[i], lowest$'October'[i], lowest$'November'[i], lowest$'December'[i])
}
min_val

 [1] 1.24 0.08 1.24 0.00 0.07 1.57 0.75 0.82 1.45 0.21 0.64 1.13 0.50
[14] 0.51 2.69 0.88 0.89 0.90 1.70 0.71 0.29 1.28 0.57 0.26 0.16 0.02
[27] 1.71 0.81 0.13 1.63 0.48 0.03 1.63 0.94 0.26 0.88 1.87 0.79 0.22
[40] 1.22 0.36 0.05 0.96 1.11 0.01 1.43 0.81 0.37 2.64

It is interesting to note a few features. For one, according to the data set, Louisiana (state #16) has the highest precipitation measurement out of all the states in the US in 2021, at 14.61 inches.

max(max_val)

[1] 14.61

which.max(max_val)

[1] 16

We also see that California (state #4) contains the lowest recorded precipitation value out of all the states, at 0 inches, as shown below. It is possible for there to be no rain at all during some months. However, California is the only state which had no rain at all during 2021.

min(min_val)

[1] 0

which.min(min_val)

[1] 4

sum(min_val == 0)

[1] 1

Now I will address the research question I mentioned earlier. I wanted to find the state that had the highest range in precipitation, meaning that their rainy season was the most drastically different from their dry season.

max(max_val-min_val)

[1] 13.73

which.max(max_val-min_val)

[1] 16

In the code above, I found that state #16 had the highest range in precipitation, meaning that the difference between their highest and lowest recorded precipitation rate was the greatest out of all the states during 2021, at 13.73 inches. According to the code book in https://www.ncei.noaa.gov/data/climdiv/access/county-readme.txt, state #16 is Louisiana. This is interesting. On WeatherSpark.com, an article states that Louisiana has a significant rainy season. In fact, WorldAtlas.com explains that Louisiana tends to be the second rainiest state in the entire United States. Hawaii is the first, but upon checking the code book Hawaii is the only state which is not represented in the data set. Overall, these findings are exciting because they suggest that my analysis is reasonable.

Depending on the research goal, it could also be helpful to know the means of certain subsets of the data. For instance, suppose you want to obtain the mean of the measurements in all the divisions for a particular state in June 2021. I will perform this function below:

climate9 %>%
  filter(state == 16) %>%
  select('June') %>%
  summarize(mean = mean(June))

      mean
1 4.829548

The mean precipitation value for Louisiana in June 2021 (over all the divisions) was 4.8295 inches. This suggests that there is a wide range in the amount of precipitation over the state, as we earlier learned that the highest precipitation measurement was around 14 inches, especially as June tends to be a rainier month in Louisiana. It seems that there are high-lying outliers in Louisiana during its rainy season.

PROBLEM 2)

Create at least two visualizations using your final project dataset.

-The visualizations should use the ggplot2 package.

-At least one visualization should be univariate, and at least one should be bivariate.

Plot 2.1A) Rainfall in Michigan in August (univariate)

aug_mich <- climate9 %>%
  group_by(year, state)%>%
  filter(state == '20') %>%
  select('August') 

aug_mich <- aug_mich$August

aug_mich_df <- data.frame(1:10541, aug_mich)
plot2.1A <- ggplot(aug_mich_df, aes(aug_mich)) + geom_bar(stat = "count") + labs(x = "Precipitation Measurement from Divisions in Michigan", title = "August Precipitation in Michigan From 1895 to 2021") + theme_minimal()
plot2.1A

The plot above is a histogram showing the distribution of all the August precipitation measurements taken in Michigan for each division over the course of the years from 1895 to 2021. As seen, the distribution is roughly normal, with a mean hovering around 3 inches.

Plot 2.1B Rainfall in Louisiana in June from 2000 to 2021 (univariate)

june_louis <- climate9 %>%
  group_by(year, state)%>%
  filter(state == '16' & (year == '2021' || year == '2020' || year == '2019' || year == '2018' || year == '2017' || year == '2016' || year == '2015' || year == '2014' || year == '2013' || year == '2012' || year == '2011' || year == '2010'|| year == '2009' || year == '2008' || year == '2007' || year == '2006' || year == '2005' || year == '2004'|| year == '2003'|| year == '2002'|| year == '2001'|| year == '2000')) %>%
  select('June') 

june_louis <- june_louis$June

june_louis_df <- data.frame(1:1408, june_louis)
plot2.1B <- ggplot(june_louis_df, aes(june_louis)) + geom_bar(stat = "count") + labs(x = "Precipitation Measurement from Divisions in Louisiana", title = "June Precipitation in Louisiana From 2000 to 2021") + theme_minimal()
plot2.1B

The plot above is similar to Plot 2.1A, except that it is a histogram for the precipitation measurements from all the divisions in Louisiana from 2000 to 2021. The data is less normally distributed than in the first plot, as it is skewed right with a long tail, which suggests that there may be outliers in Louisiana (certain divisions and years with extremely high precipitation measurements). This makes sense as the previous data exploration has suggested that Louisiana precipitation may vary widely in different divisions, and may have high outliers.

Plot 2.1 C

nov_louis <- climate9 %>%
  group_by(year, state)%>%
  filter(state == '16', (year == '2021' || year == '2020' || year == '2019' || year == '2018' || year == '2017' || year == '2016' || year == '2015' || year == '2014' || year == '2013' || year == '2012' || year == '2011' || year == '2010'|| year == '2009' || year == '2008' || year == '2007' || year == '2006' || year == '2005' || year == '2004'|| year == '2003'|| year == '2002'|| year == '2001'|| year == '2000')) %>%
  select('November') 

nov_louis <- nov_louis$November

nov_louis_df <- data.frame(1:1408, nov_louis)
plot2.1C <- ggplot(nov_louis_df, aes(nov_louis)) + geom_bar(stat = "count") + labs(x = "Precipitation Measurement from Divisions in Louisiana", title = "November Precipitation in Louisiana From 2000 to 2021")
plot2.1C

Plot 2.1D. Comparison of Rainfall in Louisiana in June and November from 2000 to 2021 (univariate plots) (no legend)

plot2.1D <- ggplot() + geom_bar(data = june_louis_df, aes(june_louis), stat = "count", fill = "pink")+ geom_bar(data = nov_louis_df, aes(nov_louis), stat = "count", fill = "lightblue") + labs(x = "Precipitation Measurements in Louisiana From 2000 to 2021", title = "Louisiana Precipitation in June and November") + theme_minimal()
plot2.1D

The method above did not produce a legend, so I tried to recreate it a different way:

Plot 2.1E.

nov_louis <- climate9 %>%
  group_by(year, state)%>%
  filter(state == '16', (year == '2021' || year == '2020' || year == '2019' || year == '2018' || year == '2017' || year == '2016' || year == '2015' || year == '2014' || year == '2013' || year == '2012' || year == '2011' || year == '2010'|| year == '2009' || year == '2008' || year == '2007' || year == '2006' || year == '2005' || year == '2004'|| year == '2003'|| year == '2002'|| year == '2001'|| year == '2000')) %>%
  select('November') 

nov_louis$month = "November"
nov_louis$precipitation = nov_louis$November
nov_louis <- subset(nov_louis, select = c('month', 'precipitation'))


june_louis <- climate9 %>%
  group_by(year, state)%>%
  filter(state == '16' & (year == '2021' || year == '2020' || year == '2019' || year == '2018' || year == '2017' || year == '2016' || year == '2015' || year == '2014' || year == '2013' || year == '2012' || year == '2011' || year == '2010'|| year == '2009' || year == '2008' || year == '2007' || year == '2006' || year == '2005' || year == '2004'|| year == '2003'|| year == '2002'|| year == '2001'|| year == '2000')) %>%
  select('June') 

june_louis$month = "June"
june_louis$precipitation = june_louis$June
june_louis <- subset(june_louis, select = c('month', 'precipitation'))


june_nov_df <- rbind(june_louis, nov_louis)

colors = c(June="pink", November="lightblue")

plot2.1E = ggplot(june_nov_df, aes(precipitation, fill=month)) +
         theme_minimal() +
         geom_bar(stat = "count") +
         scale_fill_manual(values=colors) + ylim(0, 10) +
         labs(title = "Louisiana Precipitation in June and November (2000-2021)", x = "Precipitation Measurements")
plot2.1E

It does appear that overall there tended to be more Louisiana precipitation (including more extremely high precipitation measurements) in June than in November.

Plot 2.2 Mean Louisiana Precipitation Values Over All Divisions in June from 1895 to 2021 (bivariate)

(note: This was too visibly complicated to provide much insight, so I repeated it with a smaller subset of years afterwards).

june_louis <- climate9 %>%
  group_by(year, state)%>%
  filter(state == 16) %>%
  select('year', 'June') %>%
  summarize(mean = mean(June)) %>%
  select(mean)

june_louis_v <- june_louis$mean

year <- as.factor(1895:2021)

june_louis_df <- data.frame(year, june_louis_v)

plot2.2A <- ggplot(june_louis_df, aes(x= year, y = june_louis_v)) + geom_bar(stat = "identity") + labs(x= "Year", y = "Mean Precipitation value in June", title = "June Precipitation Measurements in Louisiana from 1895 to 2021") + theme_minimal()
plot2.2A

As stated, the following was too visibly confusing, so I tried again on a smaller subset of the data:

Plot 2.2B. Mean Louisiana Precipitation Values Over All Divisions in June from 2000 to 2021 (bivariate)

june_louis <- climate9 %>%
  group_by(year, state)%>%
  filter(state == 16) %>%
  select('year', 'June') %>%
  summarize(mean = mean(June)) %>%
  select(mean)

june_louis_v <- june_louis$mean

year <- as.factor(2000:2021)

june_louis_df <- data.frame(year, june_louis_v[1:22])

plot2.2B <- ggplot(june_louis_df, aes(x= year, y = june_louis_v[1:22])) + geom_bar(stat = "identity") + labs(x= "Year", y = "Mean Precipitation value in June", title = "June Precipitation Measurements in Louisiana from 2000 to 2021") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = -1)) 
plot2.2B

PROBLEM 3) Explain each visualization. Each explanation should include:

What variable(s) you are visualizing?
What question(s) you are attempting to answer with the visualization?
What conclusions you can make from the visualization?

In this problem, I will explain the most complete and useful of the plots above: 2.1A, 2.1E, and 2.2B.

Explanation of Plot 2.1A:

plot2.1A

In univariate Plot 2.1A, I visualize the variable ‘precipitation’. For this plot I used only a subset of data, specifically the precipitation measurements for Michigan. This plot displays the general distribution of the Michigan precipitation data from 1895 to 2021, including the measurements from all of the divisions in Michigan during this time. The x-axis represents the precipitation measurements, and the y-axis represents ‘count’, or the number of times each measurement value was recorded during the time period.

I was interested to know whether there would be any discernible trends or shapes in the distribution, and whether the data would appear to be close to normally distributed. In plot 2.1A, the data does appear to be close to normally distributed, with a mean that hovers around 3. This aligns with our expectations, as the mean of all the Michigan data is about 3.138. Knowing that the distribution may be normal is useful and suggests it can be used in certain models, like regression models.

Explanation of Plot 2.1E:

plot2.1E

2.1E is the product of two simple histograms layered over one another. One histogram (in light blue) represents November precipitation measurements in Louisiana from 2000 to 2021. The other histogram (in pink) represents June precipitation measurements in Louisiana from 2000 to 2021.

I was curious to know whether there would be a visible difference in the amount of precipitation recorded in June versus in November over the course of the years from 2000 to 2021. Overall, though there is not an extremely obvious difference, the June values appear to have a wider range and a greater right skew, with far more values which lie above the mean and more large outliers. Previous exploration suggested that Louisiana was the state for which there was the greatest variation in the amount of precipitation which fell during different months of the year. Thus, it makes sense to see that there was a visible difference in the trends of precipitation recorded during June versus that recorded in November in Louisiana.

Explanation of Plot 2.2B:

plot2.2B

In bivariate plot 2.2B, I wanted to incorporate two variables: precipitation in Louisiana, and time. More specifically, in the plot above, the x-axis represents the year, while the y-axis represents June precipitation measurements collected in that year. The question I was curious to answer was whether or not there were significant changes in the June precipitation trends from 2000 to 2021. Based on a general analysis of the plot, it is possible that Louisiana has had less rain in recent years. However, more analysis would have to be conducted to verify if this was a significant difference.

PROBLEM 4)

-What questions are left unanswered with your visualizations?

What about the visualizations may be unclear to a naive viewer?
How could you improve the visualizations for the final project?

I think the visualizations are pretty clear to observers now that I improved the 2.1D plot to include a legend. However, I have a few questions left unanswered. One question is this: It seems that some divisions in Louisiana are particularly rainy during the month of June. If I find out which divisions these are, and subset the data just to these divisions, will there be a more noticeable (visible) difference in the precipitation trends between June and November?

Another question is: Has there been a change in precipitation in Louisiana during the June months from 1895-1900 as compared to 2000-2021? Is this difference statistically significant?

References: “Climate and Average Weather Year Round in Alabama.” Weatherspark.com, https://weatherspark.com/y/20416/Average-Weather-in-Alabama-New-York-United-States-Year-Round#:~:text=The%20chance%20of%20wet%20days,least%200.04%20inches%20of%20precipitation.

Nag, Oishimaya Sen. “The 10 Wettest States in the United States of America.” WorldAtlas, WorldAtlas, 24 Apr. 2019, https://www.worldatlas.com/articles/the-10-wettest-states-in-the-united-states-of-america.html.

Comment on this article Share:

Homework 4

Reuse

Citation