Homework 2: Diving into Data on Atlantic Hurricanes from 1920-2020

hw2

atlantic_hurricanes

hunter_major

Homework 2 Assignment on Atlantic Hurricanes Data

Author

Hunter Major

Published

June 12, 2023

Code

# loading packages into R Studio

#| label: setup
#| warning: false
#| message: false

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(dplyr)
library(readr)
library(readxl)
library(tidyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Homework 2 Overview

For this homework, your goal is to read in a more complicated dataset. Please use the category tag “hw2” as well as a tag for the dataset you choose to use.

Read in a dataset. It’s strongly recommended that you choose a dataset you’re considering using for the final project. If you decide to use one of the datasets we have provided, please use a challenging dataset - check with us if you are not sure.

Clean the data as needed using dplyr and related tidyverse packages. Provide a narrative about the data set (look it up if you aren’t sure what you have got) and the variables in your dataset, including what type of data each variable is. The goal of this step is to communicate in a visually appealing way to non-experts - not to replicate r-code. Identify potential research questions that your dataset can help answer.

Code

# reading in the dataset

atlantic_hurricanes <- 
read_csv("_mysampledatasets/atlantic_hurricanes.csv")


atlantic_hurricanes

Narrative About The Data

Code

summary(atlantic_hurricanes)

      ...1           Name             Duration          Wind speed       
 Min.   :  0.0   Length:458         Length:458         Length:458        
 1st Qu.:114.2   Class :character   Class :character   Class :character  
 Median :228.5   Mode  :character   Mode  :character   Mode  :character  
 Mean   :228.5                                                           
 3rd Qu.:342.8                                                           
 Max.   :457.0                                                           
   Pressure         Areas affected        Deaths             Damage         
 Length:458         Length:458         Length:458         Length:458        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
     REf               Category    
 Length:458         Min.   :1.000  
 Class :character   1st Qu.:1.000  
 Mode  :character   Median :2.000  
                    Mean   :2.037  
                    3rd Qu.:3.000  
                    Max.   :5.000

Code

dim(atlantic_hurricanes)

[1] 458  10

This dataset provides insights into Atlantic hurricanes, hurricanes that developed in the Atlantic Ocean area, across a 100 year time period, from 1920 to 2020. This dataset lists hurricanes that fall under the Category 1, Category 2, Category 3, and Category 5 classifications; therefore, it doesn’t include Category 4 hurricanes nor does it include (storms that didn’t develop beyond) tropical storms and tropical depressions. 458 hurricanes or observations/rows are included in this dataset. There are 10 variables. Variables in the original version of the data are:

…1 or X (or the list number/ID of the hurricane as entered into the dataset, mostly used for organizational, data entry purposes)
the name of the hurricane (character value),
the duration of the hurricane/the dates that it occurred (numeric/character value?),
the wind speed of the hurricane (in miles per hour and kilometers per hour) (numeric value),
the pressure of the hurricane (in atmospheric pressure-hPa and in inch of mercury-inHg) (numeric value),
the number of deaths caused by the hurricane (numeric value),
the amount of damage in US dollars caused by the hurricane (numeric value),
the category of the hurricane (Cat 1, 2, 3, or 5) (number representing a character value?),
the numerically assigned references/footnotes that provide further information about the hurricane (number representing a character value?.

**Using the summary () function, most of the values for each variable are listed as character values because they do include words, abbreviations, units of measurement, other symbols/punctuation in them currently, in/after the process of tidying the dataset, I hope to make sure that all of the variables that I’ve listed as having numeric values in paratheses above will be then ‘understood’/interpreted by R as having numeric values. However, because I am a beginner in R–I don’t know if I’ll be able to accomplish all of what I perceive has to be cleaned within the suite of Homework 2 alone–as I will likely be working with this dataset (outside of the challenge assignments) all the way through to the final project, but I’ll try my best:)

Data Cleaning

1. Removing REf Column and the …1 or X Column

I am removing the REf column because I am unclear on what it represents, and I do not believe it will be useful for my purposes in tidying and working towards analyzing the snapshot this dataset provides into Atlantic hurricanes more broadly. I believe REf is potentially referring to listed/numbered footnote references in the original study. I sourced this dataset from Kaggle, https://www.kaggle.com/datasets/valery2042/hurricanes, and do not currently have access to the original study. The only information provided about sources on the dataset’s Kaggle page is, “I scraped Wikipedia pages of Atlantic hurricanes of Categoris 1,2,3 and 5 using pandas/html” (Liamtsau, 2022). I am removing the ..1/X column as well because this is simply the list/ID number of each hurricane as it is entered into the dataset, and since R Studio maintains its own list/ID number on the far left of the table I believe the X variable is no longer necessary. Also, since the first number in the X column is 0 for the first hurricane listed instead of 1, this can be confusing for some readers whose numbering convention starts with 1. Scrolling all the way to the end of the table (page 46 for me), we can see that the last value listed in the X column is 457, which is a slight mismatch from the 458 rows/observation values, which represented the total number of hurricanes included in the study, that R computed the dataset to have.

Code

# remove column named REf and the X Column

atlantic_hurricanes2 <- atlantic_hurricanes %>% select(-c(...1,REf))
  

atlantic_hurricanes2

Looks like the REf and X (or …1) columns were successfully removed! There should now be 8 columns.

2. Separate Wind.speed into Wind.speed.mph and Wind.speed.kmh and Pressure into Pressure.hPa and Pressure.inHg

In the current version of the dataset, within the Wind.speed column, values for each hurricane’s wind speed are provided in miles per hour (mph) and kilometers per hour (km/h) in the same cell. Likewise, values for each hurricane’s pressure are provided in hPa (atmospheric pressure) and inHg (inch of Mercury). I would like to separate those values, so each unit of measurement for the wind speed and pressure, respectively has their own distinct columns.

Code

# separate the Wind.speed column into Wind.speed.mph and Wind.speed.kmh

atlantic_hurricanes3 <- separate(atlantic_hurricanes2, `Wind speed`, into = c("Wind.speed.mph", "Wind.speed.kmh"), sep = "\\(")

atlantic_hurricanes3

Code

# separate Pressure column into Pressure.hPa and Pressure.inHg

atlantic_hurricanes4 <- separate(atlantic_hurricanes3, Pressure, into = c("Pressure.hPa", "Pressure.inHg"), sep = " ")

atlantic_hurricanes4

Looks like each unit of measurement for a hurricane’s wind speed (Wind.speed.mph and Wind.speed.kmh) and a hurricane’s pressure (Pressure.hPa and Pressure.inHg) now have their own distinct columns!

3. Removing measurement unit abbreviations and unneeded parentheses from values in the Wind.speed.mph, Wind.speed.kmh, Pressure.hPa, and Pressure.inHg columns

I would like to remove the measurement unit abbreviations and unneeded parentheses from values in the Wind.speed.mph, Wind.speed.kmh, Pressure.hPa, and Pressure.inHg columns so that only the numbers/numeric values remain. Once R reads these columns as have numeric values, I’ll be able to run summary statistics and other relevant numeric related functions using them that’ll provide useful information to analyze.

Code

# removing "mph" from the end of values in the Wind.speed.mph column

atlantic_hurricanes5 <- mutate(atlantic_hurricanes4, Wind.speed.mph = as.numeric(str_extract(Wind.speed.mph,pattern="[:digit:]")))

atlantic_hurricanes5

Code

# removing "km/h)" from the end of values in the Wind.speed.kmh column

atlantic_hurricanes6 <- mutate(atlantic_hurricanes5, Wind.speed.kmh = as.numeric(str_extract(Wind.speed.kmh,pattern = "[:digit:]")))

atlantic_hurricanes6

3a. Running into a Small Issue within the data cleaning task 3

**In essence, it looks like this attempt to remove the unneeded abbreviations and parentheses is working; however, I need to troubleshoot this further because upon looking at the ‘before’ and ‘after’ of the values in these respective columns, I found that every digit after the first digit in a cell is getting removed also–> for example, in the first row, 90 mph has changed to 9 and 150 km/h has changed to 1. How can I fix this?

What Remains to Be Cleaned

While I did remove the REf and X variables and separated the Wind.speed and Pressure columns into columns with distinct units of measurement, I’m aware that there are additional items to be cleaned.

Going forward, I want to keep the information within the Duration column but potentially rename it to Date and create a separate Duration column that transforms the date timeline/range for each hurricane into a number of days. I know some of the datasets we’ve used in class for the challenges have dealt with aspects of tidying date-related data (like separating or grouping months and years), but I’m not exactly sure yet on how to go about transforming a date range into number of days within R Studio without going through and making manual changes to each row myself. In addition, though properly fixing this is in progress (I still have to remove the phrase “mph” from the values in the Wind.speed.mph column, remove the parentheses and “km/h” from the Wind.speed.kmh values, remove the “hPa” phrase from the Pressure.hPa column, remove the parentheses and “inHg” from the Pressure.inHg column–so that R can classify those variables as numeric values that I can then use to compute summary statistics with.) Furthermore, I recognized that in the long-run, it might be more convenient to pick just one unit of measurement for the Wind speed and Pressure variables versus having two for each…this is something I will consider amending as I work with this dataset further.

The Areas.affected column is comprehensive but not consistent…sometimes global regions are listed, sometimes countries, sometimes island clusters, sometimes U.S. states, sometimes U.S. regions, sometimes parts within U.S. states (i.e., South Texas), parts of a global region (i.e., Northeastern Caribbean), then there’s also the category of “No land areas” (which I intuitively understand as a hurricane that remained in open water and did not make landfall), etc. I can try my best more neatly organize this by creating separate columns with different area classifications, but this will make the dataset much wider and as hinted at previously, there are overlaps and inconsistencies between the types of Areas.affected listed. I like the succinct-ness of one Areas.affected column, but at the same time, I recognize that how it currently exists is not as tidy as it can be.

Similarly, the Damage column also provides values that can be made more tidy (i.e., $1.3 billion can be changed to 1,300,000,000) but there are also inconsistencies between in the ways Damage is characterized throughout the column. Sometimes it’s in dollars (sometimes with million, billion, or thousand written out, sometimes it’s listed as $450,000), assumingly USD, sometimes the cells contain the words “Minimal” “None” “Extensive” “Heavy” “Minor” “Unknown” “Millions” “Moderate”, sometimes the cells are left blank altogether, sometimes a range/modified by a logical operator (i.e., > $100 thousand), etc.

I’m open to advice and assistance on how to best approach further tidying the Areas.affected column and Damage column in particular. I suppose I could filter the Areas.affected column to just list countries or just list U.S. states, for example–but then that means deleting area phrases like “Gulf Coast United States” which is within the country of the United States and is comprised of states like Texas, Louisiana, Mississippi, Alabama, Florida. Likewise, I could filter the Damage column so that it just lists dollar amounts and lists them with zeroes at the end vs words like “million” “thousand,” to make the variable have numeric values but then that also takes out all of the more qualitative descriptors of Damage, which do serve their own purpose/provide their own insights or summary.

Potential Research Questions This Dataset Could Help Answer

Do duration, wind speed, pressure, and hurricane category classification have any correlation with deaths and damage costs caused by hurricanes? If so, how so? Which areas are most susceptible to the most severe and devastating hurricanes in the century this dataset covers? On the other hand, there are some limitations to this dataset. Something I noticed this dataset doesn’t answer is, at what point in the duration of the hurricane were these measures of wind speed and pressure taken? While I’m assuming it is a listing of the highest intensity recorded wind speed and pressure throughout the course of the storm, this isn’t clear. Moreover, for the hurricanes that impacted multiple areas, the windspeed and pressure was likely not the exact same measure in each successive location the hurricane made landfall in. While insights can be deduced from this dataset, especially after additional, forthcoming stages of cleaning the data, studying each storm and each area individually will provide additional, more full and nuanced context(s) of systems, resources, lived experiences, etc that influenced and/or were influenced by the hurricane(s).

References

Liamtsau, V. (2022, August 5). Atlantic hurricanes (data cleaning challenge). Kaggle. https://www.kaggle.com/datasets/valery2042/hurricanes