Final Project Assignment#1: Teresa Lardo

final_Project_assignment_1

final_project_data_description

Teresa Lardo

Bigfoot Reports

Project & Data Description

Author

Teresa Lardo

Published

April 12, 2023

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Part 1. Introduction

Dataset

For my final project, I want to work with a data set on Bigfoot reports (“reports” includes direct sightings of or encounters with a creature purported to be Bigfoot, vocalizations thought to be those of a Sasquatch, and discoveries of a Bigfoot-like footprint). This data set describes just over 5000 Bigfoot reports within the continental United States dated from November 1869 to November 2021. The data on Bigfoot sightings is largely self-reported, though some of the earlier reports are taken from newspaper reports.

The report data comes from The Bigfoot Field Researchers Organization, or BFRO, and supplemental data on weather and environmental conditions were added from Dark Sky API. The geocoded and weather-enhanced data set I’m using comes courtesy of Tim Renner.

Questions

With this data set, I want to explore the relationships between types of reports in different locations and environmental conditions over time:

How have the amount of Bigfoot reports in different states & geographic regions fluctuated over time? (For example, have sightings in New England increased notably since the 1970s? Was there a spike in reports from Texas in the mid-90s?)
Do sightings or reports of vocalizations correspond strongly with a specific moon phase?
Are Class A reports more likely to correspond with clear weather conditions, and do Class B reports correspond to foggy weather, low visibility?

Part 2. Describe the data set

library(readr)
library(dplyr)
bfro <- read_csv("TeresaLardo_FinalProjectData/bfro_reports_geocoded.csv", col_types = cols(number = col_skip()))

# Rearrange order of rows for my personal sanity
bfro <- bfro %>% 
  select(index, date, title, state, county, classification, everything())

head(bfro)

In reading in the csv of the data set, I have opted to remove a variable that simply describes the number of the report in the BFRO system.

dim(bfro)

[1] 5021   28

colnames(bfro)

 [1] "index"              "date"               "title"             
 [4] "state"              "county"             "classification"    
 [7] "observed"           "location_details"   "season"            
[10] "latitude"           "longitude"          "geohash"           
[13] "temperature_high"   "temperature_mid"    "temperature_low"   
[16] "dew_point"          "humidity"           "cloud_cover"       
[19] "moon_phase"         "precip_intensity"   "precip_probability"
[22] "precip_type"        "pressure"           "summary"           
[25] "uv_index"           "visibility"         "wind_bearing"      
[28] "wind_speed"

unique(bfro$season)

[1] "Summer"  "Fall"    "Spring"  "Winter"  "Unknown"

length(unique(bfro$state))

[1] 49

unique(bfro$classification)

[1] "Class B" "Class A" "Class C"

Summary Statistics

The dataset includes many variables about environmental conditions for the reports which have specific dates. I would like to concentrate on the following variables: cloud cover, visibility, temperature (high, low, and mid), and moon phase.

library(summarytools)
bfro_stats <- bfro %>%
  select(moon_phase, cloud_cover, temperature_high, temperature_mid, temperature_low, visibility) %>%
  drop_na()

descr(bfro_stats)

Descriptive Statistics  
bfro_stats  
N: 2859  

                    cloud_cover   moon_phase   temperature_high   temperature_low   temperature_mid
----------------- ------------- ------------ ------------------ ----------------- -----------------
             Mean          0.43         0.50              67.14             48.83             57.99
          Std.Dev          0.33         0.29              17.95             16.12             16.54
              Min          0.00         0.00              -0.62            -22.78             -8.46
               Q1          0.12         0.25              55.17             37.60             46.80
           Median          0.39         0.49              70.01             49.70             59.58
               Q3          0.72         0.75              81.23             61.10             70.66
              Max          1.00         1.00             106.51             84.34             94.03
              MAD          0.43         0.37              18.71             17.45             17.69
              IQR          0.60         0.50              26.06             23.49             23.86
               CV          0.77         0.58               0.27              0.33              0.29
         Skewness          0.26         0.00              -0.57             -0.42             -0.52
      SE.Skewness          0.05         0.05               0.05              0.05              0.05
         Kurtosis         -1.26        -1.20              -0.16             -0.02             -0.06
          N.Valid       2859.00      2859.00            2859.00           2859.00           2859.00
        Pct.Valid        100.00       100.00             100.00            100.00            100.00

Table: Table continues below

 

                    visibility
----------------- ------------
             Mean         8.53
          Std.Dev         2.01
              Min         0.74
               Q1         7.71
           Median         9.46
               Q3        10.00
              Max        10.00
              MAD         0.80
              IQR         2.29
               CV         0.24
         Skewness        -1.70
      SE.Skewness         0.05
         Kurtosis         2.46
          N.Valid      2859.00
        Pct.Valid       100.00

What’s In This Dataset?

A case in this dataset is a Bigfoot report, including sightings, vocalizations, and footprints. There are over 5,000 cases in this set, and the dataset includes:

descriptions of the events of the encounter,
the date & season of the encounter,
the location (including state, county, latitude, longitude, geohash, and details describing the specific location such as “near the summit of Mt. Mitchell” or “north of Highway 285”),
title of the report,
classification of the sighting (relating to the circumstantial potential for misinterpretation of the observation). Class A denotes to a very low potential for misinterpretation, Class B denotes a greater potential for misinterpretation or misidentification such as in the case of sounds heard but no clear view of a creature, and Class C denotes a high potential for inaccuracy due to being second-hand reports or having untraceable sources),
and environmental conditions for reports with specified dates, including temperature (high, low, mid), dew point, humidity, cloud cover, moon phase, precipitation (type, probability & intensity), atmospheric pressure, UV index, wind bearing & wind speed, visibility, and a textual summary of the weather conditions of the day in the report’s location.

I plan to focus on the environmental variables of temperature, cloud cover, visibility, and moon phase.

3. The Tentative Plan for Visualization

In order to explore the question of changes in amount of reports by state and region over time, I plan to mutate the data to create a “Region” variable to better chunk together the geographic data, and then create a time series visualization for the different regions. From there, I can hone in on any regional spikes in activity and investigate the activity at the state level (for states in that region). I want to also look into the details on creating a choropleth map and see if I can integrate an animated time series element into that. I have created some maps on ArcGIS with this data set before, and I think a choropleth map would be a great way to show which areas “light up” with activity within a given timespan.

To explore the question of moon phase correspondence, I can start with a basic scatterplot. I would like to use string searching to detect which reports are visual vs. auditory and categorize the reports of vocalizations separately from the visual sightings, and use color to distinguish these two categories on the moon phase scatterplot.

To explore the question of Class A & B reports and relative visibility, I can again start with a basic scatterplot where Class A & B reports are distinguished by color. This should give a quick sense of any trends by level of visibility for both classifications. I would also like to dig into the weather details in the “Summary” variable, such as searching for words like “fog” and “foggy” versus “clear,” and see which classes of reports show up most for reports with these weather descriptions. This will also likely be taken from the subset of visual sightings instead of from all reports.

Creating a Region Column

# Creating separate vectors for each region
Pacific <- c("California", "Oregon", "Washington", "Alaska")
Mountain <- c("Nevada", "Arizona", "New Mexico", "Colorado", "Utah", "Idaho", "Montana", "Wyoming")
West_North_Central <- c("Minnesota", "North Dakota", "South Dakota", "Iowa", "Nebraska", "Kansas", "Missouri")
West_South_Central <- c("Texas", "Oklahoma", "Louisiana", "Arkansas")
East_North_Central <- c("Ohio", "Wisconsin", "Michigan", "Illinois", "Indiana")
East_South_Central <- c("Alabama", "Kentucky", "Tennessee", "Mississippi")
South_Atlantic <- c("Florida", "Georgia", "South Carolina", "North Carolina", "Virginia", "West Virginia", "Maryland", "Delaware")
Mid_Atlantic <- c("Pennsylvania", "New York", "New Jersey")
New_England <- c("Connecticut", "Rhode Island", "Massachusetts", 'Vermont', "New Hampshire", "Maine")

# Mutating new column using the vectors above
bfro <- bfro %>%
  mutate(
    Region = case_when(state %in% Pacific ~ "Pacific",
                       state %in% Mountain ~ "Mountain",
                       state %in% West_North_Central ~ "West North Central",
                       state %in% West_South_Central ~ "West South Central",
                       state %in% East_North_Central ~ "East North Central",
                       state %in% East_South_Central ~ "East South Central",
                       state %in% South_Atlantic ~ "South Atlantic",
                       state %in% Mid_Atlantic ~ "Mid-Atlantic",
                       state %in% New_England ~ "New England")
  )

Let’s do a quick sanity check on that new column:

# Select the state and Region columns from the dataset and look at a sample with head()
bfro %>% 
  select(state, Region) %>%
  head(15)

Okay, good - this sample of the first 15 values of the state & Region columns indicate that the states have been categorized into their correct regions.