library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Final Project Assignment#1: Teresa Lardo
Part 1. Introduction
Dataset
For my final project, I want to work with a data set on Bigfoot reports (“reports” includes direct sightings of or encounters with a creature purported to be Bigfoot, vocalizations thought to be those of a Sasquatch, and discoveries of a Bigfoot-like footprint). This data set describes just over 5000 Bigfoot reports within the continental United States dated from November 1869 to November 2021. The data on Bigfoot sightings is largely self-reported, though some of the earlier reports are taken from newspaper reports.
The report data comes from The Bigfoot Field Researchers Organization, or BFRO, and supplemental data on weather and environmental conditions were added from Dark Sky API. The geocoded and weather-enhanced data set I’m using comes courtesy of Tim Renner.
Questions
With this data set, I want to explore the relationships between types of reports in different locations and environmental conditions over time:
- How have the amount of Bigfoot reports in different states & geographic regions fluctuated over time? (For example, have sightings in New England increased notably since the 1970s? Was there a spike in reports from Texas in the mid-90s?)
- Do sightings or reports of vocalizations correspond strongly with a specific moon phase?
- Are Class A reports more likely to correspond with clear weather conditions, and do Class B reports correspond to foggy weather, low visibility?
Part 2. Describe the data set
library(readr)
library(dplyr)
<- read_csv("TeresaLardo_FinalProjectData/bfro_reports_geocoded.csv", col_types = cols(number = col_skip()))
bfro
# Rearrange order of rows for my personal sanity
<- bfro %>%
bfro select(index, date, title, state, county, classification, everything())
head(bfro)
In reading in the csv of the data set, I have opted to remove a variable that simply describes the number of the report in the BFRO system.
dim(bfro)
[1] 5021 28
colnames(bfro)
[1] "index" "date" "title"
[4] "state" "county" "classification"
[7] "observed" "location_details" "season"
[10] "latitude" "longitude" "geohash"
[13] "temperature_high" "temperature_mid" "temperature_low"
[16] "dew_point" "humidity" "cloud_cover"
[19] "moon_phase" "precip_intensity" "precip_probability"
[22] "precip_type" "pressure" "summary"
[25] "uv_index" "visibility" "wind_bearing"
[28] "wind_speed"
unique(bfro$season)
[1] "Summer" "Fall" "Spring" "Winter" "Unknown"
length(unique(bfro$state))
[1] 49
unique(bfro$classification)
[1] "Class B" "Class A" "Class C"
Summary Statistics
The dataset includes many variables about environmental conditions for the reports which have specific dates. I would like to concentrate on the following variables: cloud cover, visibility, temperature (high, low, and mid), and moon phase.
library(summarytools)
<- bfro %>%
bfro_stats select(moon_phase, cloud_cover, temperature_high, temperature_mid, temperature_low, visibility) %>%
drop_na()
descr(bfro_stats)
Descriptive Statistics
bfro_stats
N: 2859
cloud_cover moon_phase temperature_high temperature_low temperature_mid
----------------- ------------- ------------ ------------------ ----------------- -----------------
Mean 0.43 0.50 67.14 48.83 57.99
Std.Dev 0.33 0.29 17.95 16.12 16.54
Min 0.00 0.00 -0.62 -22.78 -8.46
Q1 0.12 0.25 55.17 37.60 46.80
Median 0.39 0.49 70.01 49.70 59.58
Q3 0.72 0.75 81.23 61.10 70.66
Max 1.00 1.00 106.51 84.34 94.03
MAD 0.43 0.37 18.71 17.45 17.69
IQR 0.60 0.50 26.06 23.49 23.86
CV 0.77 0.58 0.27 0.33 0.29
Skewness 0.26 0.00 -0.57 -0.42 -0.52
SE.Skewness 0.05 0.05 0.05 0.05 0.05
Kurtosis -1.26 -1.20 -0.16 -0.02 -0.06
N.Valid 2859.00 2859.00 2859.00 2859.00 2859.00
Pct.Valid 100.00 100.00 100.00 100.00 100.00
Table: Table continues below
visibility
----------------- ------------
Mean 8.53
Std.Dev 2.01
Min 0.74
Q1 7.71
Median 9.46
Q3 10.00
Max 10.00
MAD 0.80
IQR 2.29
CV 0.24
Skewness -1.70
SE.Skewness 0.05
Kurtosis 2.46
N.Valid 2859.00
Pct.Valid 100.00
What’s In This Dataset?
A case in this dataset is a Bigfoot report, including sightings, vocalizations, and footprints. There are over 5,000 cases in this set, and the dataset includes:
- descriptions of the events of the encounter,
- the date & season of the encounter,
- the location (including state, county, latitude, longitude, geohash, and details describing the specific location such as “near the summit of Mt. Mitchell” or “north of Highway 285”),
- title of the report,
- classification of the sighting (relating to the circumstantial potential for misinterpretation of the observation). Class A denotes to a very low potential for misinterpretation, Class B denotes a greater potential for misinterpretation or misidentification such as in the case of sounds heard but no clear view of a creature, and Class C denotes a high potential for inaccuracy due to being second-hand reports or having untraceable sources),
- and environmental conditions for reports with specified dates, including temperature (high, low, mid), dew point, humidity, cloud cover, moon phase, precipitation (type, probability & intensity), atmospheric pressure, UV index, wind bearing & wind speed, visibility, and a textual summary of the weather conditions of the day in the report’s location.
I plan to focus on the environmental variables of temperature, cloud cover, visibility, and moon phase.
3. The Tentative Plan for Visualization
In order to explore the question of changes in amount of reports by state and region over time, I plan to mutate the data to create a “Region” variable to better chunk together the geographic data, and then create a time series visualization for the different regions. From there, I can hone in on any regional spikes in activity and investigate the activity at the state level (for states in that region). I want to also look into the details on creating a choropleth map and see if I can integrate an animated time series element into that. I have created some maps on ArcGIS with this data set before, and I think a choropleth map would be a great way to show which areas “light up” with activity within a given timespan.
To explore the question of moon phase correspondence, I can start with a basic scatterplot. I would like to use string searching to detect which reports are visual vs. auditory and categorize the reports of vocalizations separately from the visual sightings, and use color to distinguish these two categories on the moon phase scatterplot.
To explore the question of Class A & B reports and relative visibility, I can again start with a basic scatterplot where Class A & B reports are distinguished by color. This should give a quick sense of any trends by level of visibility for both classifications. I would also like to dig into the weather details in the “Summary” variable, such as searching for words like “fog” and “foggy” versus “clear,” and see which classes of reports show up most for reports with these weather descriptions. This will also likely be taken from the subset of visual sightings instead of from all reports.
Creating a Region Column
# Creating separate vectors for each region
<- c("California", "Oregon", "Washington", "Alaska")
Pacific <- c("Nevada", "Arizona", "New Mexico", "Colorado", "Utah", "Idaho", "Montana", "Wyoming")
Mountain <- c("Minnesota", "North Dakota", "South Dakota", "Iowa", "Nebraska", "Kansas", "Missouri")
West_North_Central <- c("Texas", "Oklahoma", "Louisiana", "Arkansas")
West_South_Central <- c("Ohio", "Wisconsin", "Michigan", "Illinois", "Indiana")
East_North_Central <- c("Alabama", "Kentucky", "Tennessee", "Mississippi")
East_South_Central <- c("Florida", "Georgia", "South Carolina", "North Carolina", "Virginia", "West Virginia", "Maryland", "Delaware")
South_Atlantic <- c("Pennsylvania", "New York", "New Jersey")
Mid_Atlantic <- c("Connecticut", "Rhode Island", "Massachusetts", 'Vermont', "New Hampshire", "Maine")
New_England
# Mutating new column using the vectors above
<- bfro %>%
bfro mutate(
Region = case_when(state %in% Pacific ~ "Pacific",
%in% Mountain ~ "Mountain",
state %in% West_North_Central ~ "West North Central",
state %in% West_South_Central ~ "West South Central",
state %in% East_North_Central ~ "East North Central",
state %in% East_South_Central ~ "East South Central",
state %in% South_Atlantic ~ "South Atlantic",
state %in% Mid_Atlantic ~ "Mid-Atlantic",
state %in% New_England ~ "New England")
state )
Let’s do a quick sanity check on that new column:
# Select the state and Region columns from the dataset and look at a sample with head()
%>%
bfro select(state, Region) %>%
head(15)
Okay, good - this sample of the first 15 values of the state & Region columns indicate that the states have been categorized into their correct regions.