DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Homework 2 Abby Balint

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Homework 2 Intro
  • Description of the data
  • Tidying the data

Homework 2 Abby Balint

  • Show All Code
  • Hide All Code

  • View Source
hw2
abby_balint
SharkAttackKaggle
Author

Abby Balint

Published

October 11, 2022

Code
library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Homework 2 Intro

For Homework 2 I found a dataset on Kaggle about Shark Attacks that I am reading in below using read_csv, renaming it to “shark” to make it easy to work with. I will then describe the dataset as well as what I am doing to tidy/mutate it to answer potential questions or points of analysis. The dataset I downloaded can be found here: https://www.kaggle.com/datasets/mysarahmadbhat/shark-attacks

My goal is to be able to describe some high level statistics about shark attacks in the USA since 2000.

Code
shark <- read_csv("_data/shark attacks_abbybalintHW2.csv")
head(shark)
# A tibble: 6 × 22
  Case Numbe…¹ Date   Year Type  Country Area  Locat…² Activ…³ Name  Sex   Age  
  <chr>        <chr> <dbl> <chr> <chr>   <chr> <chr>   <chr>   <chr> <chr> <chr>
1 2017.06.11   6/11…  2017 Unpr… AUSTRA… West… Point … Body b… Paul… M     48   
2 2017.06.10.b 6/10…  2017 Unpr… AUSTRA… Vict… Flinde… Surfing fema… F     <NA> 
3 2017.06.10.a 6/10…  2017 Unpr… USA     Flor… Ponce … Surfing Brya… M     19   
4 2017.06.07.R Repo…  2017 Unpr… UNITED… Sout… Bantha… Surfing Rich… M     30   
5 2017.06.04   6/4/…  2017 Unpr… USA     Flor… Middle… Spearf… Park… M     <NA> 
6 2017.06.02   6/2/…  2017 Unpr… BAHAMAS New … Athol … Snorke… Tiff… F     32   
# … with 11 more variables: Injury <chr>, `Fatal (Y/N)` <chr>, Time <chr>,
#   Species <chr>, `Investigator or Source` <chr>, pdf <chr>,
#   `href formula` <chr>, href <chr>, `Case Number...20` <chr>,
#   `Case Number...21` <chr>, `original order` <dbl>, and abbreviated variable
#   names ¹​`Case Number...1`, ²​Location, ³​Activity

Description of the data

This dataset is from Kaggle and appears to likely be an aggregate of a few different data sources as it goes back to the late 1800’s all the way to 2017, including over 6,000 individual shark attacks. The data spans several countries. There is a variable titled “source” that describes where each individual attack was originally cited from. There are 22 variables, however many of them are different ways of describing a date, so it makes sense to choose the best column representative of date and remove some of the others. The other columns describe each shark attack with variables like location, activity, demos about the victim, injury type, and shark species.

Tidying the data

  1. I am selecting only the columns I want to work with, dropping some of the columns that reference source of data as I can reference those individually if needed in the original dataset, but they won’t necessarily be useful for my analysis.
Code
sharknew <- shark %>%
  select("Date":"Species", "original order")
Code
head(sharknew)
# A tibble: 6 × 15
  Date         Year Type  Country Area  Locat…¹ Activ…² Name  Sex   Age   Injury
  <chr>       <dbl> <chr> <chr>   <chr> <chr>   <chr>   <chr> <chr> <chr> <chr> 
1 6/11/17      2017 Unpr… AUSTRA… West… Point … Body b… Paul… M     48    No in…
2 6/10/17      2017 Unpr… AUSTRA… Vict… Flinde… Surfing fema… F     <NA>  No in…
3 6/10/17      2017 Unpr… USA     Flor… Ponce … Surfing Brya… M     19    Lacer…
4 Reported 0…  2017 Unpr… UNITED… Sout… Bantha… Surfing Rich… M     30    Bruis…
5 6/4/17       2017 Unpr… USA     Flor… Middle… Spearf… Park… M     <NA>  Lacer…
6 6/2/17       2017 Unpr… BAHAMAS New … Athol … Snorke… Tiff… F     32    Right…
# … with 4 more variables: `Fatal (Y/N)` <chr>, Time <chr>, Species <chr>,
#   `original order` <dbl>, and abbreviated variable names ¹​Location, ²​Activity
  1. Here I am filtering the data down to US only rows in or after the year 2000 to narrow down the scope of my analysis. I am also sorting by Case Number as upon reviewing the data, this is the variable that actually determines the chronological order of when these were reported. All other time and date rows are missing various data or in incomplete formats. I kept in the date column because for the rows that do have values, the formatting is pretty standardized.
Code
sharknew %>%
  filter(`Country` == "USA", `Year` >= "2000") %>%
  arrange(`original order`)
# A tibble: 963 × 15
   Date     Year Type     Country Area  Locat…¹ Activ…² Name  Sex   Age   Injury
   <chr>   <dbl> <chr>    <chr>   <chr> <chr>   <chr>   <chr> <chr> <chr> <chr> 
 1 6/22/05  2000 Boat     USA     Flor… Boca G… Fishin… boat… M     <NA>  No in…
 2 2/21/00  2000 Unprovo… USA     Flor… Rivier… <NA>    male  M     27    Right…
 3 3/1/00   2000 Unprovo… USA     Loui… Midnig… Spearf… Kurt… M     39    No in…
 4 3/24/00  2000 Unprovo… USA     Flor… Florid… Surfing Barr… M     37    Left …
 5 3/26/00  2000 Unprovo… USA     Flor… Juno B… Boogie… Heat… F     14    Right…
 6 3/31/00  2000 Unprovo… USA     Flor… Santa … Fishing Dave… M     <NA>  No In…
 7 4/9/00   2000 Unprovo… USA     Flor… Munici… Boogie… teen  M     <NA>  Punct…
 8 4/14/00  2000 Unprovo… USA     Flor… On the… Walking Adam… M     34    Left …
 9 6/2/00   2000 Unprovo… USA     Flor… 27th A… Snorke… Bria… M     13    Right…
10 6/9/00   2000 Unprovo… USA     Alab… Gulf S… Swimmi… Chuc… M     44    Right…
# … with 953 more rows, 4 more variables: `Fatal (Y/N)` <chr>, Time <chr>,
#   Species <chr>, `original order` <dbl>, and abbreviated variable names
#   ¹​Location, ²​Activity
  1. I saw that the Location column could be used to break out counties since all values listed a description of the beach, park, etc, followed by a county after the column. This is particularly useful because I am looking at US shark attacks only and all US locations have a county associated with them. I also renamed the “Area” variable to “State” because I am looking at US.
Code
sharknew %>%
  filter(`Country` == "USA", `Year` >= "2000") %>%
  arrange(`original order`) %>%
  separate(col=`Location`, into=c("Description" , "County"), sep = ",") %>%
  rename(State = `Area`)
# A tibble: 963 × 16
   Date     Year Type     Country State Descr…¹ County Activ…² Name  Sex   Age  
   <chr>   <dbl> <chr>    <chr>   <chr> <chr>   <chr>  <chr>   <chr> <chr> <chr>
 1 6/22/05  2000 Boat     USA     Flor… Boca G… " Lee… Fishin… boat… M     <NA> 
 2 2/21/00  2000 Unprovo… USA     Flor… Rivier… " Pal… <NA>    male  M     27   
 3 3/1/00   2000 Unprovo… USA     Loui… Midnig…  <NA>  Spearf… Kurt… M     39   
 4 3/24/00  2000 Unprovo… USA     Flor… Florid… " Bre… Surfing Barr… M     37   
 5 3/26/00  2000 Unprovo… USA     Flor… Juno B… " Pal… Boogie… Heat… F     14   
 6 3/31/00  2000 Unprovo… USA     Flor… Santa …  <NA>  Fishing Dave… M     <NA> 
 7 4/9/00   2000 Unprovo… USA     Flor… Munici… " Riv… Boogie… teen  M     <NA> 
 8 4/14/00  2000 Unprovo… USA     Flor… On the… " Vol… Walking Adam… M     34   
 9 6/2/00   2000 Unprovo… USA     Flor… 27th A… " New… Snorke… Bria… M     13   
10 6/9/00   2000 Unprovo… USA     Alab… Gulf S… " Bal… Swimmi… Chuc… M     44   
# … with 953 more rows, 5 more variables: Injury <chr>, `Fatal (Y/N)` <chr>,
#   Time <chr>, Species <chr>, `original order` <dbl>, and abbreviated variable
#   names ¹​Description, ²​Activity
  1. Since there isn’t too much in this dataset to mutate or pivot, I thought another thing that would be interesting to recode for analysis would be granular ages into Age groups.
Code
sharknew %>%
  filter(`Country` == "USA", `Year` >= "2000") %>%
  arrange(`original order`) %>%
  separate(col=`Location`, into=c("Description" , "County"), sep = ",") %>%
  rename(State = `Area`) %>%
  mutate(`AgeRanges` = dplyr::case_when(
    `Age` >= 18 & `Age` <= 24 ~ "18-24",
    `Age` >= 25 & `Age` <= 34 ~ "25-34",
    `Age` >= 35 & `Age` <= 44 ~ "18-44",
    `Age` >= 45 & `Age` <= 54 ~ "18-54",
    `Age` >= 55  ~ "55+" ))
# A tibble: 963 × 17
   Date     Year Type     Country State Descr…¹ County Activ…² Name  Sex   Age  
   <chr>   <dbl> <chr>    <chr>   <chr> <chr>   <chr>  <chr>   <chr> <chr> <chr>
 1 6/22/05  2000 Boat     USA     Flor… Boca G… " Lee… Fishin… boat… M     <NA> 
 2 2/21/00  2000 Unprovo… USA     Flor… Rivier… " Pal… <NA>    male  M     27   
 3 3/1/00   2000 Unprovo… USA     Loui… Midnig…  <NA>  Spearf… Kurt… M     39   
 4 3/24/00  2000 Unprovo… USA     Flor… Florid… " Bre… Surfing Barr… M     37   
 5 3/26/00  2000 Unprovo… USA     Flor… Juno B… " Pal… Boogie… Heat… F     14   
 6 3/31/00  2000 Unprovo… USA     Flor… Santa …  <NA>  Fishing Dave… M     <NA> 
 7 4/9/00   2000 Unprovo… USA     Flor… Munici… " Riv… Boogie… teen  M     <NA> 
 8 4/14/00  2000 Unprovo… USA     Flor… On the… " Vol… Walking Adam… M     34   
 9 6/2/00   2000 Unprovo… USA     Flor… 27th A… " New… Snorke… Bria… M     13   
10 6/9/00   2000 Unprovo… USA     Alab… Gulf S… " Bal… Swimmi… Chuc… M     44   
# … with 953 more rows, 6 more variables: Injury <chr>, `Fatal (Y/N)` <chr>,
#   Time <chr>, Species <chr>, `original order` <dbl>, AgeRanges <chr>, and
#   abbreviated variable names ¹​Description, ²​Activity
Code
sharkfinal <-  sharknew
  1. Now I can plot a few things as examples of where I could start if looking to analyze patterns of shark attacks in US since 2000.
Code
sharknew %>%
  filter(`Country` == "USA", `Year` >= "2000") %>%
  arrange(`original order`) %>%
  separate(col=`Location`, into=c("Description" , "County"), sep = ",") %>%
  rename(State = `Area`) %>%
  mutate(`AgeRanges` = dplyr::case_when(
    `Age` >= 18 & `Age` <= 24 ~ "18-24",
    `Age` >= 25 & `Age` <= 34 ~ "25-34",
    `Age` >= 35 & `Age` <= 44 ~ "18-44",
    `Age` >= 45 & `Age` <= 54 ~ "18-54",
    `Age` >= 55  ~ "55+" )) %>% 
  ggplot(aes(`Type`)) + geom_bar()

Code
sharknew %>%
  filter(`Country` == "USA", `Year` >= "2000") %>%
  arrange(`original order`) %>%
  separate(col=`Location`, into=c("Description" , "County"), sep = ",") %>%
  rename(State = `Area`) %>%
  mutate(`AgeRanges` = dplyr::case_when(
    `Age` >= 18 & `Age` <= 24 ~ "18-24",
    `Age` >= 25 & `Age` <= 34 ~ "25-34",
    `Age` >= 35 & `Age` <= 44 ~ "18-44",
    `Age` >= 45 & `Age` <= 54 ~ "18-54",
    `Age` >= 55  ~ "55+" )) %>% 
  ggplot(aes(`Fatal (Y/N)`)) + geom_bar()

Source Code
---
title: "Homework 2 Abby Balint"
author: "Abby Balint"
desription: "HW2 - Read in/tidying"
date: "10/11/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - hw2
  - abby_balint
  - SharkAttackKaggle
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

### Homework 2 Intro


For Homework 2 I found a dataset on Kaggle about Shark Attacks that I am reading in below using read_csv, renaming it to "shark" to make it easy to work with. I will then describe the dataset as well as what I am doing to tidy/mutate it to answer potential questions or points of analysis. The dataset I downloaded can be found here: [https://www.kaggle.com/datasets/mysarahmadbhat/shark-attacks](link)

My goal is to be able to describe some high level statistics about shark attacks in the USA since 2000.

```{r}
shark <- read_csv("_data/shark attacks_abbybalintHW2.csv")
head(shark)
```

### Description of the data
This dataset is from Kaggle and appears to likely be an aggregate of a few different data sources as it goes back to the late 1800's all the way to 2017, including over 6,000 individual shark attacks. The data spans several countries. There is a variable titled "source" that describes where each individual attack was originally cited from. There are 22 variables, however many of them are different ways of describing a date, so it makes sense to choose the best column representative of date and remove some of the others. The other columns describe each shark attack with variables like location, activity, demos about the victim, injury type, and shark species. 

## Tidying the data

1. I am selecting only the columns I want to work with, dropping some of the columns that reference source of data as I can reference those individually if needed in the original dataset, but they won't necessarily be useful for my analysis.


```{r}
sharknew <- shark %>%
  select("Date":"Species", "original order")

```

```{r}
head(sharknew)
```
2. Here I am filtering the data down to US only rows in or after the year 2000 to narrow down the scope of my analysis. I am also sorting by Case Number as upon reviewing the data, this is the variable that actually determines the chronological order of when these were reported. All other time and date rows are missing various data or in incomplete formats. I kept in the date column because for the rows that do have values, the formatting is pretty standardized.

```{r}
sharknew %>%
  filter(`Country` == "USA", `Year` >= "2000") %>%
  arrange(`original order`)
```

3. I saw that the Location column could be used to break out counties since all values listed a description of the beach, park, etc, followed by a county after the column. This is particularly useful because I am looking at US shark attacks only and all US locations have a county associated with them. I also renamed the "Area" variable to "State" because I am looking at US.

```{r}
sharknew %>%
  filter(`Country` == "USA", `Year` >= "2000") %>%
  arrange(`original order`) %>%
  separate(col=`Location`, into=c("Description" , "County"), sep = ",") %>%
  rename(State = `Area`)
```

4. Since there isn't too much in this dataset to mutate or pivot, I thought another thing that would be interesting to recode for analysis would be granular ages into Age groups.

```{r}
sharknew %>%
  filter(`Country` == "USA", `Year` >= "2000") %>%
  arrange(`original order`) %>%
  separate(col=`Location`, into=c("Description" , "County"), sep = ",") %>%
  rename(State = `Area`) %>%
  mutate(`AgeRanges` = dplyr::case_when(
    `Age` >= 18 & `Age` <= 24 ~ "18-24",
    `Age` >= 25 & `Age` <= 34 ~ "25-34",
    `Age` >= 35 & `Age` <= 44 ~ "18-44",
    `Age` >= 45 & `Age` <= 54 ~ "18-54",
    `Age` >= 55  ~ "55+" ))
```
```{r}
sharkfinal <-  sharknew
```

5. Now I can plot a few things as examples of where I could start if looking to analyze patterns of shark attacks in US since 2000.

```{r}
sharknew %>%
  filter(`Country` == "USA", `Year` >= "2000") %>%
  arrange(`original order`) %>%
  separate(col=`Location`, into=c("Description" , "County"), sep = ",") %>%
  rename(State = `Area`) %>%
  mutate(`AgeRanges` = dplyr::case_when(
    `Age` >= 18 & `Age` <= 24 ~ "18-24",
    `Age` >= 25 & `Age` <= 34 ~ "25-34",
    `Age` >= 35 & `Age` <= 44 ~ "18-44",
    `Age` >= 45 & `Age` <= 54 ~ "18-54",
    `Age` >= 55  ~ "55+" )) %>% 
  ggplot(aes(`Type`)) + geom_bar()

```
```{r}
sharknew %>%
  filter(`Country` == "USA", `Year` >= "2000") %>%
  arrange(`original order`) %>%
  separate(col=`Location`, into=c("Description" , "County"), sep = ",") %>%
  rename(State = `Area`) %>%
  mutate(`AgeRanges` = dplyr::case_when(
    `Age` >= 18 & `Age` <= 24 ~ "18-24",
    `Age` >= 25 & `Age` <= 34 ~ "25-34",
    `Age` >= 35 & `Age` <= 44 ~ "18-44",
    `Age` >= 45 & `Age` <= 54 ~ "18-54",
    `Age` >= 55  ~ "55+" )) %>% 
  ggplot(aes(`Fatal (Y/N)`)) + geom_bar()
```