Final Project: Sue-Ellen Duffy

final_project
Sue-Ellen Duffy
Boston Crash Trends 2012-2022
Author

Sue-Ellen Duffy

Published

May 20, 2023

Code
library(tidyverse)
library(summarytools)
library(lubridate)
library(ggplot2)
library(dplyr)
library(hrbrthemes)
library(scales)
library(sf)
library(treemap)
library(treemapify)
library(timeDate)
library(forcats)
library(hrbrthemes)
library(stringr)
library(knitr)
library(kableExtra)
library(GGally)
library(patchwork)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Introduction

The dataset was retrieved from the Massachusetts Department of Transportation data extraction service. I requested publicly available crash data for Boston municipality for the specific years of 2012-2022 and received a CSV file. Car crashes are reported to the Registry of Motor Vehicle and from there the Department of Transportation makes data available to the public.

This dataset includes 45,980 individual crashes in Boston from 2012-2022 which are represented by each row in the dataset. The dataset includes crash information including date, time, weather and lighting conditions and the severity of each crash. In regards to location, the dataset includes the travel direction of the vehicles, proximity to certain landmarks such as an exit or roadway intersection and the coordinates (latitude and longitude) of the crash point.

Post-pandemic car ownership and commuting by car has increased in an astonishing way - not to help the matter, train commutes are slower than ever. This dataset cannot delve deep enough to understand the scope of this transit issue and traffic data is not publicly available. This dataset will at least allow me to analyze city crashes that may yield some understanding about the implications of an increase in car commuting to the city. In a 2022 report by Inrix, Boston ranked 4th on the Global Traffic Scorecard, but not in the good way.

Questions

  1. Are crashes in Boston increasing overtime? What type if any are increasing over time?

  2. Are there any correlations to time of day, date, or road surface conditions that implicate a higher severity of crash and or more crashes overall?

  3. How did the pandemic affect Boston crashes?

Data Description

  • Crash Date - Date occurrence of crash (year, month, and day)

  • Crash Time - Time occurrence of crash (hour, min, and sec)

  • Crash Severity - Indicates the severity of a crash based on the most severe injury to any person based on 3 levels - Fatal injury, Non-fatal injury, Property damage only (no injury) - and either Unknown or Not Reported

  • Maximum Injury Severity Reported - Reported injury if both fatal and non-fatal will be categorized “Fatal injury” as it is the most severe reported injury

  • Non_Motorist_Type - The type of non-motorist

  • Crash Hour - [added for analysis] Time occurrence of crash (hour only)

  • Crash Timegroup - [added for analysis] Time occurrence of crash (defined by time intervals: (Overnight = 2AM-5:59AM; Morning = 6AM-9:59AM; Midday = 10AM-1:59PM; Afternoon = 2PM-5:59PM; Evening = 6PM-9:59PM; Late Night = 10PM-1:59AM)

  • Crash Count - [added for analysis] Number of crash occurrences per day

  • Crash Countgroup - [added for analysis] Number of crash occurrences per day (defined by group intervals: (1-5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40)

Data Cleaning:

Code
mydataog <- read_csv("Sue-EllenDuffy_FinalProjectData/Crash_Details_2012-2022.csv", 
    skip = 2)
mydata <- mydataog
Code
mydataog$Crash_Date <- as.Date(mydataog$Crash_Date, "%d-%b-%Y")
mydataog$Year <- year(mydataog$Crash_Date)
mydata$Crash_Date <- as.Date(mydata$Crash_Date, "%d-%b-%Y")
mydata$Weekday <-wday(mydata$Crash_Date, label = TRUE, abbr = FALSE)
mydata$Month <- month(mydata$Crash_Date, label = TRUE, abbr = TRUE)
mydata$Year <- year(mydata$Crash_Date)
mydata$Day <- day(mydata$Crash_Date)
Code
mydata<-mydata %>%
  mutate(Crash_Severity = recode(Crash_Severity, `Property damage only (none injured)` = "Property damage",
 `Unknown` = "Unknown/Not Reported", `Not Reported` = "Unknown/Not Reported"))
Code
mydata<-mydata%>%
  mutate(Crash_Hour=hour(Crash_Time))
mydata <- mydata %>%
    mutate(Crash_Timegroup = 
             case_when(Crash_Hour>=2 & Crash_Hour<=5 ~ "Overnight", 
                       Crash_Hour>=6 & Crash_Hour<=9 ~ "Morning", 
                       Crash_Hour>=10 & Crash_Hour<= 13 ~ "Midday", 
                       Crash_Hour>=14 & Crash_Hour<=17 ~ "Afternoon", 
                       Crash_Hour>=18 & Crash_Hour<=21 ~ "Evening", 
                       Crash_Hour>= 22 & Crash_Hour >=23 ~ "Late Night", Crash_Hour>= 1 & Crash_Hour >=0 ~ "Late Night", Crash_Hour== "0" ~ "Late Night"))
Code
mydata<- mydata %>%
   group_by(Crash_Date) %>%
    mutate(Crash_Count = n())
Code
mydata<-mydata%>%
    mutate(Crash_Countgroup = 
             case_when(Crash_Count>=1 & Crash_Count<= 5 ~ "1-5", 
                       Crash_Count>=6 & Crash_Count<=10 ~ "6-10", 
                       Crash_Count>=11 & Crash_Count<=15 ~ "11-15",  
                       Crash_Count>= 16 & Crash_Count<=20 ~ "16-20",
                       Crash_Count>=21 & Crash_Count<=25 ~ "21-25",
                       Crash_Count>= 26 & Crash_Count<=30 ~ "26-30",
                       Crash_Count>=31 & Crash_Count<=35 ~ "31-35",
                       Crash_Count>=36 & Crash_Count<=40 ~ "36-40"))
Code
positionroad <- c("Wet", "Snow", "Water standing, moving)", "Slush", "Sand, mud, dirt, oil, gravel", "NA", "Unknown", "Not reported", "Ice", "Dry")
positions <- c("Overnight", "Morning", "Midday", "Afternoon", "Evening", "Late Night")
positions2 <- c("Rear-end", "Single vehicle crash", "Angle", "Sideswipe, same direction", "Not reported", "Head-on", "Sideswipe, opposite direction", "Unknown", "Front to Rear", "Front to Front", "Rear-to-rear", "Rear to Side", "NA")
position_Countgroup <- c("1-5", "6-10", "11-15", "16-20", "21-25", "26-30", "31-35", "36-40")

Handling Unknown, Not Reported, NA or Missing data

Unknown/Not Reported Data: There were 45,980 crashes reported from 2012 to 2022. Of those crashes 4,002 (8.7%) were reported with a Crash Severity of “Unknown” or “Not Reported” the majority are reported as resulting in property damage only with no injuries reported. Ranking second is non-fatal injuries, and finally a small portion are rated as fatal injuries.

  • For the purpose of this report, and for the category of “Crash Severity” only, I have combined these into “Unknown/Not Reported”.

Why are some crashes reported as Unknown or Not Reported? Possibilities include: + internal clerical errors

  • reports must be completed within 5 days of crash (leading to rushed answers)

  • decision fatigue from having to fill out the report alone post accident

  • data missing due to personal choice in managing insurance claims

Unknown/Not Reported Over Time: The category “Unknown” was used more so in later years, with a peak in 2021. Whereas “Not Reported” had peaks in 2012 and 2017 and very low numbers in 2020-2022.

Code
pUnknown <- mydataog %>%
  filter(Crash_Severity == "Unknown") %>%
ggplot(., aes(x=Year)) + geom_bar() + facet_wrap( ~ Crash_Severity, ) + theme(legend.position = "none") + scale_x_continuous(breaks=pretty_breaks()) +
     coord_cartesian(ylim = c(0, 700))
pNotReported <- mydataog %>%
  filter(Crash_Severity == "Not Reported") %>%
ggplot(., aes(x=Year)) + geom_bar() + facet_wrap( ~ Crash_Severity, ) + theme(legend.position = "none") + scale_x_continuous(breaks=pretty_breaks())

ppUN <- (pUnknown + pNotReported)
ppUN + plot_annotation(title = "Reports of 'Unknown' and 'Not Reported' for category Crash Severity")

Unknown/Not Reported Data Inconsistencies: Crashes that were reported under Crash Severity as being “Unknown” or “Not Reported” had 50 inconsistently reported injuries in another column of the data. All 50 reporting inconsistencies occurred between May 2018 and October 2019:

  • Crash Severity “Not Reported” has 42 reports of maximum injury reports across 6 categories that suggest a possible injury.

  • Crash Severity “Unknown” has 7 reports of non-fatal injury as either “Non-incapacitating” or “Possible” and 1 report of “No apparent injury”.

Code
mydataog <- mydataog %>%
mutate(Crash_Severity = recode(Crash_Severity, `Property damage only (none injured)` = "Property damage"))
mydatainconsistencies <- mydataog %>% 
  filter(between(Crash_Date, as.Date('2018-05-01'), as.Date('2019-10-31'))) 

Missing data/NAs and outliers:

Road Surface Condition has an NA count of n=368. For the purpose of this report I am considering this sort of NA to be comparable with “Not Reported” or “Unknown” and leave the data in the dataset as I am only looking at a few indicators of Road Surface Condition and not analyzing it in depth.

Manner of Collision also has an NA count of n=5. I will leave that data alone as well.

Overview of Crash Data in Boston 2012-2022

Code
ggplot(mydata, aes(Crash_Date, stat="count")) + geom_bar() + labs(x="Crash Date", y="Crashes per Day", title = "Crash Data from 2012-2022, Boston")

Boston averaged 13 crashes per day (from 2012-2022).

Code
#Crash Count per Day
mydatacountsall <- mydata %>% 
  group_by(City_Town_Name) %>% 
  summarize("min" = min(Crash_Count, na.rm = TRUE),
            "max" = max(Crash_Count, na.rm = TRUE),
            "mean" = mean(Crash_Count, na.rm = TRUE), 
            "median" = median(Crash_Count, na.rm = TRUE),
            "standard deviation" = sd(Crash_Count, na.rm = TRUE)) %>%
  arrange(City_Town_Name)
mydatacountsall%>%
  kbl() %>%
  kable_classic()
City_Town_Name min max mean median standard deviation
BOSTON 1 37 13.21274 13 4.682742

By viewing daily crash counts in intervals of 5 we can see that on most days there are 11-15 crashes in Boston, with some days having 6-10 or 16-20 crashes.

Very few days have 5 or less crashes.

Code
ggplot(mydata, aes(Crash_Countgroup, stat="count")) + geom_bar() + scale_x_discrete(limit =position_Countgroup) + labs(title = "Crashes by Daily Crash Count Groups", x = "Count Groups")

The following graph shows the density of crashes with crash counts per day. As opposed to the bargraph charting all crashes by total counts in the Overview, this graph lets us see outliers not only in the top portion of the graph with the most counts, but also in the lowest counts. We can now see the pockets of time when there were, for instance, more than 6 crashes per day everyday, like in 2017, or the time when in 2020 there were many instances of having only a 1-3 crashes per day.

Code
ggplot(mydata, aes(Crash_Date, y=Crash_Count, group = Crash_Severity)) + geom_point() + geom_jitter() + labs(title = "All Crashes by Date", y = "Crash counts per day", x = "Date")

Boston saw the most crashes on Fridays (2012-2022).

Code
mydata %>%
ggplot(aes(Weekday, fill = Weekday)) +
  geom_bar( stat = "count") + theme(axis.text.x = element_text(angle = 90)) + theme(legend.position = "none") + labs(title = "Days of the Week & Crash Count", x = "")

When subsetted for Crash Severity, however Boston sees the most fatal accidents on Wednesday and Saturday. (The following graph shows Crash Severity: Fatal injury, Non-fatal injury, and Property damage. It does not include Unknown/Not Reported)

Code
mydata %>%
  filter(Crash_Severity != "Unknown/Not Reported") %>%
ggplot(aes(Weekday, fill = Weekday)) +
  geom_bar( stat = "count") + facet_wrap ( ~ Crash_Severity, scales = "free_y") + theme(axis.text.x = element_text(angle = 90)) + theme(legend.position = "none") + labs(title = "Days of the Week & Crash Severity", subtitle ="with free Y-axis", x = "") 

October is the month with the most crashes and the highest average crash rate in Boston (2012-2022), averaging 15 crashes per day.

Buckle up, this tab has a longer analysis section.

Code
#Crash Count per Day by Month
mydatamonthsum <- mydata %>% 
  group_by(Month) %>% 
  summarize("min" = min(Crash_Count, na.rm = TRUE),
            "max" = max(Crash_Count, na.rm = TRUE),
            "mean" = mean(Crash_Count, na.rm = TRUE), 
            "median" = median(Crash_Count, na.rm = TRUE),
            "standard_deviation" = sd(Crash_Count, na.rm = TRUE))%>% 
  arrange(Month)
mydatamonthsum %>%
  kbl(caption = "Crash Statistic by Month") %>%
  kable_classic() %>%
  row_spec(10, bold = T, color = "black", background = "orange") 
Crash Statistic by Month
Month min max mean median standard_deviation
Jan 1 26 12.72292 12 4.561773
Feb 1 37 12.60901 12 4.740875
Mar 1 27 12.38372 12 4.475690
Apr 1 26 12.23283 12 4.174287
May 1 28 13.22847 13 4.504422
Jun 1 26 13.50594 13 4.349584
Jul 1 32 13.16479 13 4.600828
Aug 3 22 12.46228 12 3.640509
Sep 2 28 13.42582 13 4.272076
Oct 2 34 15.13079 14 5.736485
Nov 1 31 13.87772 13 5.403354
Dec 1 25 13.28206 13 4.378017
Code
ggplot(mydata, aes(Month, stat="count")) + geom_bar() + labs(title = "All Crashes by Month")

What could we learn about October?

+ Do road conditions have something to do with the increase in accidents?

+ Are there different types of accidents in October? 

+ Is it all happening around Halloween? 

Road Surface Conditions

When we subset the data for October by road surface condition we see that “dry” is the predominant condition, however we see “Wet” with a large portion of data as well.

Code
wet1 <- mydata %>%
  filter(Month == "Oct") %>%
ggplot(., aes(Road_Surface_Condition, stat = "count", fill = (Road_Surface_Condition == "Wet"))) + geom_bar() + coord_flip() + labs(x = "Road Surface", y = "", title = str_wrap("October Crashes by Road Surface Condition", width = 25)) + scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+ theme(legend.position = "none") + scale_fill_manual(values=c( "#999999", "#0000FF"))
wet2 <- mydata %>%
  filter(Road_Surface_Condition == "Wet") %>%
  ggplot(., aes(Road_Surface_Condition, stat = "count", fill = Month)) + geom_bar() + facet_wrap( ~ Month,) + labs(x = "", title = str_wrap("All Crashes with Road Surface Condition 'Wet' by Month", width = 25)) + theme(legend.position = "none") + scale_fill_manual(values= c("#B7DFF6", "#B7DFF6", "#B7DFF6", "#B7DFF6", "#B7DFF6", "#B7DFF6", "#B7DFF6", "#B7DFF6", "#B7DFF6", "#0000FF", "#B7DFF6", "#B7DFF6"))

wet1 + wet2

While October has the most crashes with “Wet” road surface conditions reported, October also has the most crashes overall. In order to understand whether this road surface condition “Wet” was a factor in making October the month with the highest crash rating, I looked at the percentages of crashes in each month that occurred under “Wet” road surface conditions. October and December both have 25% of crashes (per their respective months) with “Wet” road surface conditions reported.

Code
mydata %>%
  ggplot(., aes(Month, stat = "count", fill = (Road_Surface_Condition == "Wet"))) + geom_bar(position = "fill") + labs(title = "Crashes with Road Surface Condition 'Wet' by Month", y = "percent", x = "Month", fill = "Wet") + scale_fill_manual(values=c("#999999", "#B7DFF6", "#999999"))

Code
road<-mydata%>%
  select(Month, Road_Surface_Condition)%>%
 group_by(Month,Road_Surface_Condition)%>%
  count()

road<-pivot_wider(road,names_from=Road_Surface_Condition,values_from = n)
road <- mutate_all(road, ~replace_na(.,0))

road <- road %>%
  mutate(total = sum(c_across(Dry:Wet))) %>%
  ungroup() %>%
  mutate(across(Dry:"NA", ~ ./ total)) %>%
  mutate(across(Dry:"NA", ~ .*100))

road %>%
  kbl(digits = 1, caption = "Road Conditions of Crashes (2012-2022) by Month") %>%
  kable_classic() %>%
    column_spec(11, bold = T) %>%
  row_spec(c(10, 12), color = "black", background = "#B7DFF6") 
Road Conditions of Crashes (2012-2022) by Month
Month Dry Ice Not reported Other Sand, mud, dirt, oil, gravel Slush Snow Unknown Water (standing, moving) Wet NA total
Jan 68.8 2.9 3.0 0.3 0.7 0.7 6.6 0.5 0.1 16.4 0.0 3739
Feb 66.3 2.8 3.4 0.1 0.5 1.3 8.3 0.2 0.1 17.0 0.0 3416
Mar 77.4 1.3 2.7 0.1 0.2 0.3 2.8 0.5 0.1 14.6 0.1 3622
Apr 79.8 0.1 2.2 0.0 0.0 0.1 0.4 0.4 0.2 16.8 0.1 3433
May 81.5 0.0 2.0 0.1 0.0 0.0 0.0 0.2 0.1 16.1 0.6 3891
Jun 83.2 0.0 2.9 0.1 0.1 0.0 0.1 0.5 0.1 13.2 2.0 3878
Jul 86.6 0.0 2.6 0.1 0.2 0.0 0.0 0.4 0.1 10.1 6.6 3667
Aug 87.2 0.0 2.0 0.1 0.1 0.0 0.0 0.5 0.1 10.1 0.4 3841
Sep 82.1 0.0 2.4 0.0 0.1 0.0 0.0 0.4 0.0 15.0 0.0 3950
Oct 71.4 0.1 2.4 0.0 0.0 0.0 0.1 0.3 0.1 25.5 0.0 4404
Nov 73.2 0.2 2.7 0.1 0.0 0.1 0.4 0.5 0.1 22.9 0.0 3859
Dec 66.1 2.0 2.5 0.1 0.2 0.4 3.0 0.3 0.1 25.5 0.1 3912

Non-Motorist Involvement

October is the month with the most instances of Non-motorist type involvement in crashes with 253 reports.

Code
nm<-mydata%>%
  select(Month, Non_Motorist_Type)%>%
 group_by(Month,Non_Motorist_Type)%>%
  count()

nm<-pivot_wider(nm,names_from=Non_Motorist_Type,values_from = n)
nm <- mutate_all(nm, ~replace_na(.,0))
nm <- nm %>%
  select(., -"NA")
nm <-nm %>%
  mutate(total = sum(c_across("P1: Pedestrian" : "P2: Cyclist / P3: Other / P4: Other"))) %>%
  ungroup() 

nm <- nm %>%  mutate(across("P1: Pedestrian" : "P2: Cyclist / P3: Other / P4: Other" , ~ ./ total)) %>%
  mutate(across("P1: Pedestrian" : "P2: Cyclist / P3: Other / P4: Other", ~ .*100))

#total instances of crashes with Non-Motorists reported by Month
ggplot(nm, aes(x=Month, y = total)) + geom_point() + labs(title = "Crashes Involving Non-Motorists")

However proportional to the total number of crashes per Month, we see that July and September have the highest % of crashes involving non-motorist types.

Code
mydata %>%
  ggplot(., aes(Month, stat = "count", fill = (Non_Motorist_Type != "NA"))) + geom_bar(position = "fill") + labs(title = "Percent Crashes per Month Involving Non-Motorists", y = "percent", fill = "Non-Motorist Involved")

Crash Severity

Let’s look at how each crash severity rating factors into this month analysis.

We can now see that the month with the highest Fatal injury is reported in July, with dips in spring, winter, and October while Non-fatal injuries and Property damage peak in October. The peak for Unknown/Not Reported crashes is in May and June.

Code
ggplot(mydata, aes(Month, stat="count", fill = Crash_Severity)) + geom_bar() + facet_wrap( ~ Crash_Severity, scales = "free_y") + labs(title = "Crash Severity by Month", subtitle = "with free y-axis") + theme(legend.position = "none")

Dates within October

I can’t call Halloween out on this one.

October 4th has the most counts of crashes in the month, which happens to be National Taco Day and World Animal Day.

Code
mydata %>%
  filter(Month == "Oct") %>%
  ggplot(., aes(x =Day, stat = "count", fill = (Day == "31"))) + geom_bar() + scale_fill_manual(values=c("#999999", "orange"))+ labs(title = "Crash Counts by Day in October",  x = "Days in October", fill = "Halloween")

2017 was the year with the most crashes in Boston (2012-2022).

Code
#Crash Count per Day by Year
mydatasum <- mydata %>% 
  group_by(Year) %>% 
  summarize("min" = min(Crash_Count, na.rm = TRUE),
            "max" = max(Crash_Count, na.rm = TRUE),
            "mean" = mean(Crash_Count, na.rm = TRUE), 
            "median" = median(Crash_Count, na.rm = TRUE),
            "standard deviation" = sd(Crash_Count, na.rm = TRUE)) %>% 
  arrange(Year)

mydatasum %>%
  kbl(caption = "Crash Statistic by Year") %>%
  kable_classic() %>%
  row_spec(6, bold = T) 
Crash Statistic by Year
Year min max mean median standard deviation
2012 1 26 12.52212 12 4.407276
2013 1 31 12.44976 12 4.607583
2014 2 29 12.79468 12 4.780628
2015 1 32 13.18238 13 4.592712
2016 1 34 13.21089 13 4.439938
2017 6 37 15.78167 15 4.718541
2018 1 28 14.61683 14 4.435525
2019 4 24 13.34472 13 3.918186
2020 1 29 10.59297 10 4.454502
2021 1 28 13.10707 13 4.704753
2022 2 26 11.97473 12 4.372023

The data shows 2017 as a peak with a sharp drop off in 2020 (due to the pandemic and reduced traffic flow) then a climb back up in 2021 almost matching crash reports from 2019. Of course this data shows that post-pandemic rates of car crashes have gone back up, but it is not as high as crashes in 2017.

Code
ggplot(mydata, aes(year(Crash_Date))) + geom_histogram(binwidth = .50) + scale_x_continuous(breaks=pretty_breaks()) + labs(title = "All Crashes by Year", x = "Year",)

What happened in 2017 and were these crashes somehow different from crashes in other years?

Road Surface Conditions

2017 had the most incidents of crash where “Ice” was the indicated road surface condition. Not the most significant data point considering its low n(79) but interesting none the less.

Code
datayear <- mydata
datayear = table(datayear$Year, datayear$Road_Surface_Condition) 
datayear %>%
  kbl(caption = "Road Conditions of Crashes (2012-2022) by Year" ) %>%
  kable_classic() %>%
    column_spec(3, bold = T) %>%
  row_spec(6, color = "black", background = "#B7DFF6")
Road Conditions of Crashes (2012-2022) by Year
Dry Ice Not reported Other Sand, mud, dirt, oil, gravel Slush Snow Unknown Water (standing, moving) Wet
2012 2814 13 240 3 11 6 28 17 5 817
2013 2912 25 196 4 12 14 90 3 1 694
2014 2878 28 239 3 10 10 71 10 6 768
2015 3188 47 140 3 14 24 133 9 2 660
2016 3451 15 103 4 2 5 69 9 2 655
2017 4133 79 79 5 6 10 99 50 5 813
2018 3737 30 42 3 3 6 61 39 4 849
2019 3221 33 20 1 9 7 88 14 4 663
2020 2460 16 24 4 2 5 37 10 4 568
2021 3295 6 46 1 1 5 47 9 4 733
2022 3006 47 38 3 6 10 66 6 1 576

Number of Vehicles Involved

2017 had the most 2 car accidents and had 1 instance each of a 12 car and 18 car crash. Both of these high number crashes were in fact Ice related.

  • While a crash may involve multiple cars, the counts recorded throughout the data and this analysis are on individual crashes (regardless of how many cars are involved in the crash)
Code
#Year and Number of Vehicles
datayearvehicles <- mydata
datayearvehicles = table(datayearvehicles$Year, datayearvehicles$Number_of_Vehicles) 
datayearvehicles %>%
  kbl(caption = "Number of Vehicles Involved in Crashes (2012-2022) by Year" ) %>%
  kable_classic() %>%
    column_spec(3:4, bold = T) %>%
  row_spec(6, color = "black", background = "#B7DFF6")
Number of Vehicles Involved in Crashes (2012-2022) by Year
0 1 2 3 4 5 6 7 8 9 10 11 12 18
2012 1 1109 2366 372 73 27 4 2 0 1 0 0 0 0
2013 0 1047 2354 422 98 19 8 0 2 0 0 1 0 0
2014 0 1059 2383 463 92 22 3 0 0 1 0 0 0 0
2015 0 981 2637 452 127 20 5 0 0 0 0 0 0 0
2016 0 933 2726 501 125 22 5 2 0 0 1 0 0 0
2017 0 1131 3383 581 140 32 9 1 1 1 0 0 1 1
2018 0 951 3121 526 143 19 9 4 2 1 0 0 0 0
2019 0 825 2914 530 108 30 10 4 0 0 0 0 0 0
2020 0 860 1829 331 77 23 6 3 0 1 0 0 0 0
2021 0 948 2567 484 106 28 10 3 0 0 1 0 0 0
2022 0 855 2367 423 81 26 4 1 0 2 0 0 0 0

Non-Motorists Involved

2017 is the year with the most crashes involving Non-Motorists with 378 crashes.

Code
#Year and Non-motorist Type
nmy<-mydata%>%
  select(Year, Non_Motorist_Type)%>%
 group_by(Year,Non_Motorist_Type)%>%
  count()

nmy<-pivot_wider(nmy,names_from=Non_Motorist_Type,values_from = n)
nmy <- mutate_all(nmy, ~replace_na(.,0))
nmy <- nmy %>%
  select(., -"NA")
nmy <-nmy %>%
  mutate(total = sum(c_across("P1: Cyclist" : "P9: Pedestrian"))) %>%
  ungroup() 
nmy <- nmy %>%  mutate(across("P1: Cyclist" : "P9: Pedestrian" , ~ ./ total)) %>%
  mutate(across("P1: Cyclist" : "P9: Pedestrian", ~ .*100))

ggplot(nmy, aes(x=Year, y = total)) + geom_point() +  scale_x_continuous(breaks=pretty_breaks()) + labs(title = "Crashes Involving Non-Motorist by Year")

However, again, while it has the most instances of Non-motorist involvement, 2017 has just about the same percent of crashes as 2012 and 2013, with around 7% of all crashes in each of those years involving non-motorists

Code
mydata %>%
  ggplot(., aes(Year, stat = "count", fill = (Non_Motorist_Type != "NA"))) + geom_bar(position = "fill") + scale_x_continuous(breaks=pretty_breaks()) + labs(title = "Percent Crashes per Year Involving Non-Motorists", y = "percent", fill = "Non-Motorist Involved")

Most crashes occur in the afternoon, between 3pm and 5pm, in Boston (2012-2022).

Code
ggplot(mydata, aes(Crash_Hour)) + geom_histogram(stat="count") + labs(title = "Crashes by Hour of Crash", x = "Hour of Crash")

While Non-fatal, Property damage, Unknown/Not Reported all follow the same pattern of peaks in the afternoon, crashes resulting in Fatal injury peak at late night.

Time occurrence of crash (defined by time intervals: (Overnight = 2AM-5:59AM; Morning = 6AM-9:59AM; Midday = 10AM-1:59PM; Afternoon = 2PM-5:59PM; Evening = 6PM-9:59PM; Late Night = 10PM-1:59AM)

Code
tgroup <- ggplot(mydata, aes(Crash_Timegroup, stat="count", fill = Crash_Severity)) + geom_bar() + scale_x_discrete(limits = positions) + labs(title = "Crashes by Time Group", x = "Time Group", fill = "Crash Severity") + theme(legend.position = "none")
csgroup <- ggplot(mydata, aes(Crash_Timegroup, stat="count", fill = Crash_Severity)) + geom_bar() + facet_wrap ( ~ Crash_Severity, scales = "free_y")  + scale_x_discrete(limits = positions) + theme(legend.position = "none") + labs(x = "Time Group", subtitle = "with free y-axis", title = str_wrap("Time Group and Crash Severity", width = 40)) + theme(axis.text.x = element_text(angle = 90))
tgroup + csgroup

Instances of 2 car crashes have been on an overall trend upwards in Boston, whereas instances of 1 car crashes are decreasing slightly.

Code
vehicles <-mydata%>%
  select(Year, Number_of_Vehicles)%>%
 group_by(Year,Number_of_Vehicles)%>%
  count()

vehicles<-pivot_wider(vehicles,names_from=Number_of_Vehicles,values_from = n)
vehicles <- mutate_all(vehicles, ~replace_na(.,0))

vehicles2<- vehicles%>%
  pivot_longer(!Year, names_to = "Number_of_Vehicles", values_to = "count")

vehicles2 %>%
  filter(Number_of_Vehicles == 1 | Number_of_Vehicles == 2) %>%
  ggplot(., aes(x=Year, y=`count`, color = Number_of_Vehicles, group=interaction(Number_of_Vehicles))) + 
  geom_point(size=3) +
  geom_line()  +  
  scale_x_continuous(breaks=pretty_breaks()) + 
  labs(title = "Comparisson of 1 Vehicle and 2 Vehicle Crashes from 2012-2022", color = "Number of Vehicles")

To the surprise of no one (who has driven in the city of Boston), we see that Rear-ends are the manner of collision that occurs most often in Boston.

Following in second, third, and fourth respectively are: Single vehicle crashes, Angle crashes, and Sideswipe, same direction crashes.

Code
mydata %>%
  ggplot(., aes(x= Manner_of_Collision, stat = "count", fill = Manner_of_Collision)) + 
  geom_bar()  + coord_flip() + theme(legend.position = "none") + scale_x_discrete(limits = positions2) +  labs(title = "Manner of Collision of All Crashes", x = "Manner of Collision")

When we subset this data for Crash Severity, we see that fatal injuries occur with the most frequency in single vehicle crashes. The second major difference we see for crashes with fatal injuries is a high proportion of “Not Reported” for manner of collision.

Code
mydata %>%
  ggplot(aes(x= Manner_of_Collision, stat = "count", fill = Manner_of_Collision)) + 
  geom_bar() + coord_flip() + theme(legend.position = "none") + facet_wrap ( ~ Crash_Severity, scales = "free_x") + scale_x_discrete(limits = positions2) +  labs(title = "Manner of Collision of All Crashes by Crash Severity", subtitle = "with free x-axis", x = "Manner of Collision")

Code
mydata %>%
  filter(Crash_Severity == "Fatal injury") %>%
  ggplot(aes(x= Manner_of_Collision, stat = "count", fill = Manner_of_Collision)) + 
  geom_bar() + coord_flip() + theme(legend.position = "none") + facet_wrap ( ~ Crash_Severity, scales = "free_x") + scale_x_discrete(limits = positions2) +  labs(title = "Manner of Collision of Fatal Crashes", x = "Manner of Collision")

In regards to Crash Severity, the vast majority of crashes in Boston only involve property damage. The lowest proportion, with 0.05% of all crashes, result in fatal injury.

Code
mydataall <- mydata %>%
  select(Crash_Severity)%>%
  group_by(Crash_Severity) %>%
  count()

ggplot(mydataall, aes(area= n, fill= Crash_Severity, label = paste0(Crash_Severity, "\n", n))) + 
  geom_treemap() + 
  labs(title = "Crashes in Boston by Severity 2012-2022") + 
  scale_fill_discrete(name = "Crash Severity") +
  geom_treemap_text(colour = "black",
                    place = "centre") + theme(legend.position = "bottom") 

Code
tabseverities <- table(mydata$City_Town_Name, mydata$Crash_Severity)
kable(tabseverities)
Fatal injury Non-fatal injury Property damage Unknown/Not Reported
BOSTON 248 12804 28926 4002

Final Conclusion and Takeaways

Overall instances of crashes in Boston are changing over time. Two major trends happened: crashes increased from 2012 to 2017, and crashes drastically decreased in 2020. I did not see what I was expecting to see in regards to crashes post-pandemic. As a resident of Boston, I would have assumed accidents had reached their peak in 2022.

  • Day of the week with the most crashes: Friday

  • Month with the most crashes: October

  • Year with the most crashes: 2017

  • 63% of accidents in the city lead to property damage only

What happened in 2017 I still am unclear about, there are certain trends I can parse out in the data but I am left with more questions. Were gas prices really inexpensive in 2017? Were there policies put in place in 2018 to mitigate factors that led to the high rates of crash in 2017?

Next steps

Coordinates! The coordinate data would be fascinating to study. Do sideswipes happen on tight 2 way sidestreets? Do crashes involving pedestrians happen near train stations? Do most rear-ends happen at traffic lights? (Personal anecdote: I got hit by a truck at a red light because the driver was “looking at his gps, and the company truck has bad brakes”)

Number of Vehicles per Crash This was a striking last minute discovery and I’d love to analyze this more. I did not know how to do analysis in R but off screen (on excel) I was able to do some preliminary calculations showing that one car:two car crashes were changing from 1:2 to 1:3 ratio over time and that’s interesting. Does that shed light onto the changing proximity of cars, meaning that more cars, packed in tighter, might mean more cars inevitably involved in a single accident?

Operator Error (utilizing Person Level crash data) MassDot provides person level data involved in each crash. This would have been far too much for me to join person level and crash level data together in this one assignment. However, I would be curious to see what trends have transpired over time. Are more drivers crashing because they are distracted? Are the ages of drivers involved in accidents going up? (I did hear recently that many younger folks are waiting longer to drive, but I have no data to back that up…yet!)

Takeaways

As my first project in R, I bit off quite a lot. I had intended to analyze the coordinates of crashes, however after realizing I needed to understand a fair bit more about the data (and coordinates), I chose to analyze trends over time. This project was hard to start and just as hard to finish, for different reasons. At first I didn’t know how this dataset could yield anything interesting and I thought I’d be bored. Now, as I wrap up the last few words of this report I feel the urge to make “just one more graph!”

References

  • IMPACT. Data Extract. 2023. https://apps.impact.dot.state.ma.us/cdp/home

  • INRIX. INRIX 2022 Global Score Card. 2023. http://inrix.com/scorecard/

  • October Holidays & National Days. National Today. 2023. https://nationaltoday.com/october-holidays/