library(tidyverse)
library(summarytools)
library(lubridate)
library(ggplot2)
library(dplyr)
library(hrbrthemes)
library(scales)
library(sf)
library(plotly)
library(timeDate)
library(forcats)
library(hrbrthemes)
library(stringr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Final Project Assignment#2: Sue-Ellen Duffy
Part 1. Introduction
- Data Introduction
The dataset was retrieved from the Massachusetts Department of Transportation data extraction service. I requested publicly available crash data for Boston municipality for the specific years of 2012-2022 and received a CSV file.
This dataset includes 45,980 individual crashes in Boston from 2012-2022 which are represented by each row in the dataset. The dataset includes crash information including date, time, weather and lighting conditions and the severity of each crash. In regards to location, the dataset includes the travel direction of the vehicles, proximity to certain landmarks such as an exit or roadway intersection and the coordinates (latitude and longitude) of the crash point. There are insert number of different data points per crash with some NAs or Unreported or Missing information.
Post-pandemic car ownership and commuting by car has increased in an astonishing way - not to help the matter, train commutes are slower than ever. This dataset cannot delve deep enough to understand the scope of this transit issue and traffic data is not publicly available. This dataset will at least allow me to analyze city crashes that may yield some understanding about the implications of an increase in car commuting to the city. Insert some data from Inrix Potentially mention Vision Zero Boston
Questions
Are crashes in Boston increasing overtime? What type if any are increasing over time?
Are there any correlations to time of day, date, or weather conditions that implicate a higher severity of crash and or a higher prevalence/count of crash.
How did the pandemic affect Boston crashes?
Part 2. Describe the data set
<- read_csv("Sue-EllenDuffy_FinalProjectData/Crash_Details_2012-2022.csv",
mydataog skip = 2)
<- mydataog mydata
$Crash_Date <- as.Date(mydata$Crash_Date, "%d-%b-%Y")
mydata$Weekday <-wday(mydata$Crash_Date, label = TRUE, abbr = FALSE)
mydata$Month <- month(mydata$Crash_Date, label = TRUE, abbr = TRUE)
mydata$Year <- year(mydata$Crash_Date) mydata
<-mydata %>%
mydatamutate(Crash_Severity = recode(Crash_Severity, `Property damage only (none injured)` = "Property damage",
`Unknown` = "Unknown/Not Reported", `Not Reported` = "Unknown/Not Reported"))
<-mydata%>%
mydatamutate(Crash_Hour=hour(Crash_Time))
<- mydata %>%
mydata mutate(Crash_Timegroup = case_when(Crash_Hour>=6 & Crash_Hour<=9 ~ "Morning", Crash_Hour>=10 & Crash_Hour<= 13 ~ "Midday", Crash_Hour>=14 & Crash_Hour<=17 ~ "Afternoon", Crash_Hour>=18 & Crash_Hour<=21 ~ "Evening", Crash_Hour>= 22 & Crash_Hour >=23 ~ "Late Night", Crash_Hour>= 1 & Crash_Hour >=0 ~ "Late Night", Crash_Hour== "0" ~ "Late Night",Crash_Hour>=2 & Crash_Hour<=5 ~ "Overnight"))
<- mydata %>%
mydatagroup_by(Crash_Date) %>%
mutate(Crash_Count = n())
<-mydata%>%
mydatamutate(Crash_Countgroup =
case_when(Crash_Count>=1 & Crash_Count<= 5 ~ "1-5",
>=6 & Crash_Count<=10 ~ "6-10",
Crash_Count>=11 & Crash_Count<=15 ~ "11-15",
Crash_Count>= 16 & Crash_Count<=20 ~ "16-20",
Crash_Count>=21 & Crash_Count<=30 ~ "21-30",
Crash_Count>=31 & Crash_Count<=50 ~ "31-37")) Crash_Count
dim(mydata)
[1] 45980 32
head(mydata)
Columns/Data Description - from original data
Crash_Date - Date occurrence of crash (year, month, and day)
Crash_Time - Time occurrence of crash (hour, min, and sec)
Crash_Severity - Indicates the severity of a crash based on the most severe injury to any person based on 3 levels - Fatal injury,
Non-fatal injury, Property damage only (no injury) - and either Unknown or Not Reported
Maximum_Injury_Severity_Reported - Reported injury if both fatal and non-fatal will be categorized “Fatal injury” as it is the most severe reported injury
Number_of_Vehicles - Number of vehicles involved in crash occurrence
Total_Nonfatal_Injuries -
Total_Fatal_Injuries -
Weather_Condition - The prevailing and secondary (if applicable) atmospheric conditions at the time of crash
Non_Motorist_Type - The type of non-motorist
Columns/Data Description - added for analysis
Crash_Hour - Time occurrence of crash (hour only)
Crash_Timegroup - Time occurrence of crash (time intervals - fill this in)
Crash_Count - Number of crash occurrences per day
Crash_Countgroup - Number of crash occurrences per day (defined by group intervals - fill this in)
Date and Time Tidying
Sanity Check - Any Crashes reported more than once? No. Duplicate was run and the row counts remained the same. > mydata[!duplicated(mydata$Crash_Number), ] # A tibble: 45,980 × 28 >
Part 3. Visualization
Crash Data by Severity
There were 45,980 reported from 2012 to 2022. Of those crashes the majority are reported as resulting in property damage only with no injuries reported. Ranking second is non-fatal injuries, and finally a small portion are rated as fatal injuries. 4,002 crashes, while reported to MassDot are missing report data on crash severity and are reported as “Unknown” or “Not Reported. For the purpose of this report, I have combined this category into”Unknown/Not Reported”. These “Unknown/Not Reported” crashes make up 8.7% of the crash data but we do not know the crash severity and therefore cannot use this data during certain analysis. In the following data I will note on each graph if I have removed these missing data from analysis.
<- table(mydata$City_Town_Name, mydata$Crash_Severity)
tabseverities tabseverities
Fatal injury Non-fatal injury Property damage Unknown/Not Reported
BOSTON 248 12804 28926 4002
ggplot(mydata, aes(x=Crash_Severity, fill = Crash_Severity)) + geom_histogram(stat = "count")
Crash Data by Date/Time
Overall Data
<- ggplot(mydata, aes(Crash_Date, stat="count")) + geom_bar()
allplotggplotly(allplot)
ggplot(mydata, aes(Crash_Date, y=Crash_Count, group = Crash_Severity)) + geom_point()
Crash Data by Day
Boston Averages 13 crashes per day (from 2012-2022).
#Crash Count per Day
<- mydata %>%
mydatacountsall group_by(City_Town_Name) %>%
summarize("min" = min(Crash_Count, na.rm = TRUE),
"max" = max(Crash_Count, na.rm = TRUE),
"mean" = mean(Crash_Count, na.rm = TRUE),
"median" = median(Crash_Count, na.rm = TRUE),
"standard_deviation" = sd(Crash_Count, na.rm = TRUE)) %>%
arrange(City_Town_Name)
print(mydatacountsall)
# A tibble: 1 × 6
City_Town_Name min max mean median standard_deviation
<chr> <int> <int> <dbl> <dbl> <dbl>
1 BOSTON 1 37 13.2 13 4.68
<- c("1-5", "6-10", "11-15", "16-20", "21-30", "31-37")
position_Countgroup ggplot(mydata, aes(Crash_Countgroup, stat="count")) + geom_bar() + scale_x_discrete(limit =position_Countgroup)
Crash Data by Day of the Week
Overall, Boston sees the most crashes on Friday.
%>%
mydata ggplot(aes(Weekday, fill = Weekday)) +
geom_bar( stat = "count") + theme(axis.text.x = element_text(angle = 90)) + theme(legend.position = "none")
When subsetted for Crash Severity, however Boston sees the most fatal accidents on Wednesday and Saturday.
%>%
mydata filter(Crash_Severity != "Unknown/Not Reported") %>%
ggplot(aes(Weekday, fill = Weekday)) +
geom_bar( stat = "count") + facet_wrap ( ~ Crash_Severity, scales = "free_y") + theme(axis.text.x = element_text(angle = 90)) + theme(legend.position = "none")
Crash Data by Month
Overall, October is the month with the highest Crash Ratings over 2012-2022. Let’s look at how each crash severity rating factors into overall highest crash ratings.
What could we learn about October? Are there significant events that happen in October? Is it all happening around Halloween? Or does weather condition have something to do with the increase in accidents?
#Crash Count per Day by Month
<- mydata %>%
mydatamonthsum group_by(Month) %>%
summarize("min" = min(Crash_Count, na.rm = TRUE),
"max" = max(Crash_Count, na.rm = TRUE),
"mean" = mean(Crash_Count, na.rm = TRUE),
"median" = median(Crash_Count, na.rm = TRUE),
"standard_deviation" = sd(Crash_Count, na.rm = TRUE))%>%
arrange(Month)
%>%
mydatamonthsum print(n=12)
# A tibble: 12 × 6
Month min max mean median standard_deviation
<ord> <int> <int> <dbl> <dbl> <dbl>
1 Jan 1 26 12.7 12 4.56
2 Feb 1 37 12.6 12 4.74
3 Mar 1 27 12.4 12 4.48
4 Apr 1 26 12.2 12 4.17
5 May 1 28 13.2 13 4.50
6 Jun 1 26 13.5 13 4.35
7 Jul 1 32 13.2 13 4.60
8 Aug 3 22 12.5 12 3.64
9 Sep 2 28 13.4 13 4.27
10 Oct 2 34 15.1 14 5.74
11 Nov 1 31 13.9 13 5.40
12 Dec 1 25 13.3 13 4.38
ggplot(mydata, aes(Month, stat="count")) + geom_bar()
We can now see that the month with the highest Fatal injury is reported in July, with dips in spring, winter, and October while Non-fatal injuries and Property damage peak in October. The peak for Unknown/Not Reported crashes is in June and July.
ggplot(mydata, aes(Month, stat="count", fill = Crash_Severity)) + geom_bar() + facet_wrap( ~ Crash_Severity, scales = "free_y") + labs(title = "Crash Severity by Month (with free Y axis)") + theme(legend.position = "none")
Crash Data by Year
Overall we see that 2017 was the year with the most Crashes from 2012-2022. The data shows 2017 as a peak with a sharp drop off in 2020 (due to the pandemic and reduced traffic flow) and a climb back up in 2021 almost matching crash reports from 2019. Of course this data shows that post-pandemic rates of car crashes has gone back up, but it is not as high as crashes in 2017. What happened in 2017 and were these crashes somehow different from crashes in 2016?
#Crash Count per Day by Year
<- mydata %>%
mydatasum group_by(Year) %>%
summarize("min" = min(Crash_Count, na.rm = TRUE),
"max" = max(Crash_Count, na.rm = TRUE),
"mean" = mean(Crash_Count, na.rm = TRUE),
"median" = median(Crash_Count, na.rm = TRUE),
"standard_deviation" = sd(Crash_Count, na.rm = TRUE)) %>%
arrange(Year)
%>%
mydatasum print(n=11)
# A tibble: 11 × 6
Year min max mean median standard_deviation
<dbl> <int> <int> <dbl> <dbl> <dbl>
1 2012 1 26 12.5 12 4.41
2 2013 1 31 12.4 12 4.61
3 2014 2 29 12.8 12 4.78
4 2015 1 32 13.2 13 4.59
5 2016 1 34 13.2 13 4.44
6 2017 6 37 15.8 15 4.72
7 2018 1 28 14.6 14 4.44
8 2019 4 24 13.3 13 3.92
9 2020 1 29 10.6 10 4.45
10 2021 1 28 13.1 13 4.70
11 2022 2 26 12.0 12 4.37
ggplot(mydata, aes(year(Crash_Date))) + geom_histogram(binwidth = .50) + scale_x_continuous(breaks=pretty_breaks())
When yearly crash data is subset to look at Crash Severity we see different trends across the years. Fatal injury peaked in 2021 with 2016 close behind. Non-fatal injury peaked in 2017. Unknown/Not Reported crashes peaked in 2012 and in 2017. I wonder if in 2012 there was not an established standard to filling out reports and if in 2017 there were just more reports than could be fully processed.
ggplot(mydata, aes(Year, stat="count", fill = Crash_Severity)) + geom_bar() + scale_x_continuous(breaks=pretty_breaks()) +facet_wrap( ~ Crash_Severity, scales = "free_y" ) + labs(title = "Crash Severity by Year (with free Y axis)") + theme(legend.position = "none")
Crash Data by Time of Day
<- c("Morning", "Midday", "Afternoon", "Evening", "Late Night")
positions
ggplot(mydata, aes(Crash_Time)) + geom_histogram()
ggplot(mydata, aes(Crash_Time, stat="count", fill = Crash_Severity)) + geom_bar()+ theme(legend.position = "bottom")
ggplot(mydata, aes(Crash_Time, stat="count", fill = Crash_Severity)) + geom_bar()+ theme(legend.position = "bottom") + facet_wrap ( ~ Crash_Severity, scales = "free_y")
ggplot(mydata, aes(Crash_Timegroup, stat="count", fill = Crash_Severity)) + geom_bar() + scale_x_discrete(limits = positions)
ggplot(mydata, aes(Crash_Timegroup, stat="count", fill = Crash_Severity)) + geom_bar() + facet_wrap ( ~ Crash_Severity, scales = "free_y") + scale_x_discrete(limits = positions) + theme(legend.position = "none")
Crash Data by Manner of Collision
To no one’s surprise, we see that Rear-ends are the manner of collision that occurs most often in Boston. Following in second, third, and fourth respectively are: Single vehicle crashes,Angle crashes, and Sideswipe, same direction crashes.
<- c("Rear-end", "Single vehicle crash", "Angle", "Sideswipe, same direction", "Not reported", "Head-on", "Sideswipe, opposite direction", "Unknown", "Front to Rear", "Front to Front", "Rear-to-rear", "Rear to Side", "NA")
positions2
%>%
mydata ggplot(., aes(x= Manner_of_Collision, stat = "count", fill = Manner_of_Collision)) +
geom_bar() + labs(title = "Manner of Collision from 2012-2022") + coord_flip() + theme(legend.position = "none") + scale_x_discrete(limits = positions2)
%>%
mydata ggplot(aes(x= Manner_of_Collision, stat = "count", fill = Manner_of_Collision)) +
geom_bar() + labs(title = "Manner of Collision from 2012-2022, with free-x axis") + coord_flip() + theme(legend.position = "none") + facet_wrap ( ~ Crash_Severity, scales = "free_x") + scale_x_discrete(limits = positions2)
%>%
mydata filter(Crash_Severity == "Fatal injury") %>%
ggplot(aes(x= Manner_of_Collision, stat = "count", fill = Manner_of_Collision)) +
geom_bar() + labs(title = "Manner of Collision for Fatal Crashes from 2012-2022") + coord_flip() + theme(legend.position = "none")+ scale_x_discrete(limits = positions2)
Part 4. Handling Unknown, Not Reported, NA or Missing data
Unknown and Not Reported incidents
All 46 reporting inconsistencies occurred between May 2018 and October 2019:
Crash Severity “Non-fatal injury” has 7 reports of “no injury” and just under 300 counts of “possible” or “suspected” injuries.
Crash Severity “Not Reported” has 42 reports of maximum injury reports across 6 categories that suggest a possible injury.
Crash Severity “Property damage” has 11 reports of a maximum injury reported as being “not reported”.
Crash Severity “Unknown” has 7 reports of non-fatal injury as either “Non-incapacitating” or “Possible”.
$Crash_Date <- as.Date(mydataog$Crash_Date, "%d-%b-%Y")
mydataog$Year <- year(mydataog$Crash_Date)
mydataog
<- mydataog %>%
mydataog mutate(Crash_Severity = recode(Crash_Severity, `Property damage only (none injured)` = "Property damage"))
<- mydataog %>%
mydatainconsistencies filter(between(Crash_Date, as.Date('2018-05-01'), as.Date('2019-10-31')))
<- table(mydatainconsistencies$Maximum_Injury_Severity_Reported, mydatainconsistencies$Crash_Severity) # Table for 2018-05-01:2019-10-31
tabinconsistencies tabinconsistencies
Fatal injury Non-fatal injury
Deceased not caused by crash 0 3
Fatal injury (K) 30 0
No Apparent Injury (O) 0 0
No injury 0 1
Non-fatal injury - Incapacitating 0 142
Non-fatal injury - Non-incapacitating 0 686
Non-fatal injury - Possible 0 455
Not Applicable 0 0
Not reported 0 0
Possible Injury (C) 0 238
Suspected Minor Injury (B) 0 273
Suspected Serious Injury (A) 0 18
Unknown 0 0
Not Reported Property damage Unknown
Deceased not caused by crash 1 0 0
Fatal injury (K) 0 0 0
No Apparent Injury (O) 20 1786 1
No injury 31 2766 3
Non-fatal injury - Incapacitating 3 0 0
Non-fatal injury - Non-incapacitating 14 0 3
Non-fatal injury - Possible 17 0 4
Not Applicable 56 0 2
Not reported 326 0 101
Possible Injury (C) 3 0 0
Suspected Minor Injury (B) 3 0 0
Suspected Serious Injury (A) 2 0 0
Unknown 0 0 21
ggplot(mydataog, aes(x=Year, fill = Crash_Severity)) + geom_bar() + facet_wrap( ~ Crash_Severity, scales = "free_y") + theme(legend.position = "none") + scale_x_continuous(breaks=pretty_breaks())
ggplot(mydata, aes(x=Year, fill = Crash_Severity)) + geom_bar() + facet_wrap( ~ Crash_Severity, scales = "free_y") + theme(legend.position = "none") + scale_x_continuous(breaks=pretty_breaks())
%>%
mydata filter(Crash_Severity == "Unknown/Not Reported") %>%
filter(between(Crash_Date, as.Date('2018-05-01'), as.Date('2019-10-31'))) %>%
ggplot(., aes(x=Maximum_Injury_Severity_Reported, stat="count", fill = Maximum_Injury_Severity_Reported)) + geom_bar() + coord_flip() + facet_wrap (~ Crash_Severity, scales = "free_x") + theme(legend.position = "none") +labs(title = str_wrap("Inconsistencies in Reported Crash Severity Between May 1st, 2018 and October 31st, 2019", width = 50))
Missing data/NAs and outliers? And why do you choose this way to deal with NAs?
*should I fix the 20 data that report unknown or not reported for crash severity and then list either “Non-fatal injury - Incapacitating” or “Non-fatal injury - Non-incapacitating” in Maximum_Injury_Severity_Reported?*
sapply(mydata, function(x) sum(is.na(x)))
Crash_Number
0
City_Town_Name
0
Crash_Date
0
Crash_Time
0
Crash_Severity
0
Maximum_Injury_Severity_Reported
0
Number_of_Vehicles
0
Total_Nonfatal_Injuries
0
Total_Fatal_Injuries
0
Manner_of_Collision
5
Vehicle_Action_Prior_to_Crash
8
Vehicle_Travel_Directions
4
Most_Harmful_Events
2639
Vehicle_Configuration
213
Road_Surface_Condition
368
Ambient_Light
1
Weather_Condition
0
At_Roadway_Intersection
33040
Distance_From_Nearest_Roadway_Intersection
309
Distance_From_Nearest_Milemarker
43298
Distance_From_Nearest_Exit
36521
Distance_From_Nearest_Landmark
35704
Non_Motorist_Type
43462
X_Cooordinate
1934
Y_Cooordinate
1934
Weekday
0
Month
0
Year
0
Crash_Hour
0
Crash_Timegroup
0
Crash_Count
0
Crash_Countgroup
0