Data Analytics and Computational Social Science: Final Paper

Apoorva Hungund

Introduction

Use of automobiles for transport has become a major part of everyday life. Despite the many advancements and safety laws initiated, vehicle crashes are one of the biggest causes of fatalities and injuries worldwide. Previous research on this topic has found that around 65% of vehicle-related fatalities happen to in-vehicle occupants (WHO, 2009). This same report also estimated that by 2030, road traffic injuries would be the 5th leading cause of death around the world. It is also essential to consider the financial losses caused due to vehicle fatalities. A report by NHTSA states that there was an increase trend in the fatalities, which continued int 2021 (NHTSA, 2021), and while the number of crashes and traffic injuries declined overall, fatal crashes increased by 6.8%, highest since 2007.

There is extensive research being conducted in this field and one of the most efficient ways is to introduce new laws and policies aimed at reduced crash rates and injuries. To actually understand the effect and importance of such laws, we will be exploring one of the initial datasets collected on automobile crashes. The dataset, Seatbelts, explores information about automobile fatalities before and after a seatbelt law was enforced. My research mainly focuses on Human Factors in Driving and Safety while Driving. The database most relevant to my study is the ‘Seatbelts’ data, which provides data to measure differences in driving deaths over the years. The main aim was to understand change in behavior after the introduction of seatbelt legislation on January 31st 1983. The dataset was officially commissioned by Department of Transport in 1984 and covers a time period of 15 years (1969 - 1984). Since it’s almost 40 years old, we’ll have to clean up the data and make it a bit more easy to read.The dataset includes entries from 1969 to 1984, with number of car and van fatalities listed for each month.

We will be visualizing analyzing two main parts of the data - DriversKilled and VanKilled.

Research Questions

My analysis will work on understanding if:

Is there a change in car driving deaths once the legislation was established?
Is there a change in van driving once the legislation was established?

Hypothesis

Hypothesis 1: There is a reduction in car driving deaths after the legislation was established.

Hypothesis 2: There is a eduction in car driving deaths after the legislation was established.

Data

Data Selection

First, we’ll set up packages and libraries need for this project.

Show code

knitr::opts_chunk$set(echo = FALSE)
library(dplyr)
library(plyr)
library(magrittr)
library(mosaic)
library(tidyverse)
library(wesanderson)
library(ggpubr)
library(rmarkdown)

Next, we’ll load the dataset and convert the time series into a more readable and understandable format.

Show code

data(Seatbelts)
Seatbelts <- data.frame(years=floor(time(Seatbelts)),months=factor(cycle(Seatbelts),labels=month.abb), Seatbelts)
head(Seatbelts)

  years months DriversKilled drivers front rear   kms PetrolPrice
1  1969    Jan           107    1687   867  269  9059   0.1029718
2  1969    Feb            97    1508   825  265  7685   0.1023630
3  1969    Mar           102    1507   806  319  9963   0.1020625
4  1969    Apr            87    1385   814  407 10955   0.1008733
5  1969    May           119    1632   991  454 11823   0.1010197
6  1969    Jun           106    1511   945  427 12391   0.1005812
  VanKilled law
1        12   0
2         6   0
3        12   0
4         8   0
5        10   0
6        13   0

Next, we’ll work on seperating the dataset into two major ones based on when the law was active vs inactive.

Show code

Law_Active<-subset(Seatbelts, law == 1, select = c(years:law))
Law_Inactive<-subset(Seatbelts, law == 0, select = c(years:law))

Data Tidying and Exploration

We’ll now explore the datasets and calculate some basic statistics for both the dataframes.

Show code

summary(Law_Inactive)

     years          months   DriversKilled      drivers    
 Min.   :1969   Jan    :15   Min.   : 79.0   Min.   :1309  
 1st Qu.:1972   Feb    :14   1st Qu.:108.0   1st Qu.:1511  
 Median :1976   Mar    :14   Median :121.0   Median :1653  
 Mean   :1976   Apr    :14   Mean   :125.9   Mean   :1718  
 3rd Qu.:1979   May    :14   3rd Qu.:140.0   3rd Qu.:1926  
 Max.   :1983   Jun    :14   Max.   :198.0   Max.   :2654  
                (Other):84                                 
     front             rear            kms         PetrolPrice     
 Min.   : 567.0   Min.   :224.0   Min.   : 7685   Min.   :0.08118  
 1st Qu.: 767.0   1st Qu.:344.0   1st Qu.:12387   1st Qu.:0.09078  
 Median : 860.0   Median :401.0   Median :14455   Median :0.10273  
 Mean   : 873.5   Mean   :400.3   Mean   :14463   Mean   :0.10187  
 3rd Qu.: 986.0   3rd Qu.:454.0   3rd Qu.:16585   3rd Qu.:0.11132  
 Max.   :1299.0   Max.   :646.0   Max.   :21040   Max.   :0.13303  
                                                                   
   VanKilled           law   
 Min.   : 2.000   Min.   :0  
 1st Qu.: 7.000   1st Qu.:0  
 Median :10.000   Median :0  
 Mean   : 9.586   Mean   :0  
 3rd Qu.:13.000   3rd Qu.:0  
 Max.   :17.000   Max.   :0

Show code

dim(Law_Inactive)

[1] 169  10

Show code

summary(Law_Active)

     years          months   DriversKilled      drivers    
 Min.   :1983   Feb    : 2   Min.   : 60.0   Min.   :1057  
 1st Qu.:1983   Mar    : 2   1st Qu.: 85.0   1st Qu.:1171  
 Median :1984   Apr    : 2   Median : 92.0   Median :1282  
 Mean   :1984   May    : 2   Mean   :100.3   Mean   :1322  
 3rd Qu.:1984   Jun    : 2   3rd Qu.:119.0   3rd Qu.:1464  
 Max.   :1984   Jul    : 2   Max.   :154.0   Max.   :1763  
                (Other):11                                 
     front            rear            kms         PetrolPrice    
 Min.   :426.0   Min.   :296.0   Min.   :15511   Min.   :0.1131  
 1st Qu.:516.0   1st Qu.:347.0   1st Qu.:17971   1st Qu.:0.1148  
 Median :585.0   Median :408.0   Median :19162   Median :0.1161  
 Mean   :571.0   Mean   :407.7   Mean   :18890   Mean   :0.1165  
 3rd Qu.:629.5   3rd Qu.:471.5   3rd Qu.:19952   3rd Qu.:0.1180  
 Max.   :721.0   Max.   :521.0   Max.   :21626   Max.   :0.1201  
                                                                 
   VanKilled          law   
 Min.   :2.000   Min.   :1  
 1st Qu.:3.500   1st Qu.:1  
 Median :5.000   Median :1  
 Mean   :5.174   Mean   :1  
 3rd Qu.:7.000   3rd Qu.:1  
 Max.   :8.000   Max.   :1

Show code

dim(Law_Active)

[1] 23 10

We’ll also remove any variables we don’t need.

Show code

Law_Inactive <- Law_Inactive[ -c(7,8) ]
paged_table(Law_Inactive)

Show code

Law_Active <- Law_Active[ -c(7,8) ]
paged_table(Law_Active)

Our final datasets have the following columns:

Years - Year of the record
Months - Month of the record
DriversKilled - Total car drivers killed
Drivers - Total Drivers
Front - Number of Front end accidents
Rear - Number of rear end accidents
VanKilled - Total van drivers killed
Law - Status of Law (Active, Inactive)

Visualizations

We’ll start with some basic boxplots on the average fatalities, but they may not best indicate the trends in these variables. In order to understand that, I would use a scatter plot or maybe a bar plot.

Show code

Seatbelts$law<-as.factor(Seatbelts$law)
Seatbelts$DriversKilled<-as.numeric(Seatbelts$DriversKilled)
Seatbelts$VanKilled<-as.numeric(Seatbelts$VanKilled)

Seatbelts<-Seatbelts %>% 
  mutate_at("law", str_replace, "0", "Inactive")

Seatbelts<-Seatbelts %>% 
  mutate_at("law", str_replace, "1", "Active")

stats_law_DK<-Seatbelts %>%
  group_by(law) %>%
  dplyr::summarize(min = min(DriversKilled),
            median = median(DriversKilled),
            mean = mean(DriversKilled),
            sd = sd(DriversKilled),
            max = max(DriversKilled))
paged_table(stats_law_DK)

Show code

stats_law_VK<-Seatbelts %>%
  group_by(law) %>%
  dplyr::summarize(min = min(VanKilled),
            median = median(VanKilled),
            mean = mean(VanKilled),
            sd = sd(VanKilled),
            max = max(VanKilled))
paged_table(stats_law_VK)

Show code

##Grouping by years and calculating average accidents

ggplot(Seatbelts, aes(x=factor(law), y =DriversKilled, fill = law)) +
  geom_boxplot() +
  scale_fill_manual(values = wes_palette("Darjeeling2"))+
  ylab ("Car Drivers Killed")+
  xlab("Law Status")

Show code

ggplot(Seatbelts, aes(x=factor(law), y =VanKilled, fill = law)) +
  geom_boxplot() +
  scale_fill_manual(values = wes_palette("Cavalcanti1"))+
  ylab ("Van Drivers Killed")+
  xlab("Law Status")

Show code

Avg_DK<- Seatbelts %>%
  group_by(law) %>%
  dplyr::summarise(mean = mean(DriversKilled))

ggplot(Avg_DK, aes(x = law, y = mean, fill=law)) +
  geom_bar(stat="identity", position=position_dodge())+
  scale_fill_manual(values = wes_palette("Darjeeling2"))+
  xlab("Status of Law") +
  ylab("Average Number of Drivers Killed")+
  scale_y_continuous(limits = c(0,130), breaks = c(0,20,40,60,80,100,120))+
  theme(legend.position = "none")

Show code

Avg_VK<- Seatbelts %>%
  group_by(law) %>%
  dplyr::summarise(mean = mean(VanKilled))

ggplot(Avg_VK, aes(x = law, y = mean, fill=law)) +
  geom_bar(stat="identity", position=position_dodge())+
  scale_fill_manual(values = wes_palette("Cavalcanti1"))+
  xlab("Status of Law") +
  ylab("Average Number of Van Drivers Killed")+
  scale_y_continuous(limits = c(0,20), breaks = c(0,5,10,15,20))+
  theme(legend.position = "none")

Show code

law_means_DK <- ddply(Seatbelts, "law", summarise, mean_DK = mean(DriversKilled))
ggplot(Seatbelts, aes(x=years, y=DriversKilled, color=law)) +
  geom_point()+
  stat_smooth(method = 'lm')+
  geom_hline(data=law_means_DK, aes(yintercept=mean_DK, color=law), 
             linetype="dashed")+
  scale_color_manual(values=wes_palette("Darjeeling2"))+
  xlab("Years") +
  ylab("Number of Drivers Killed")+
  scale_y_continuous(limits = c(0,200), breaks = c(0,20,40,60,80,100,120,140,160,180,200))+
  scale_x_continuous(limits = c(1969,1984), breaks = c(1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984))+
  theme(legend.position = "right")

Show code

law_means_VK <- ddply(Seatbelts, "law", summarise, mean_VK = mean(VanKilled))
ggplot(Seatbelts, aes(x=years, y=VanKilled, color=law)) +
  geom_point()+
  stat_smooth(method = 'lm')+
  geom_hline(data=law_means_VK, aes(yintercept=mean_VK, color=law), 
             linetype="dashed")+
  scale_color_manual(values=wes_palette("GrandBudapest1"))+
  xlab("Years") +
  ylab("Number of Van Drivers Killed")+
  scale_y_continuous(limits = c(0,20), breaks = c(0,5,10,15,20))+
  scale_x_continuous(limits = c(1969,1984), breaks = c(1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984))+
  theme(legend.position = "right")

Based on the plots, we can get a good idea of the average number of drivers and van drivers killed when the law is inactive and active. It’s clear that there are fewer fatalities when the law is active, which is to be expected. This does provide us with the insight that once the law was activated, the number of drivers and van drivers killed on average over the years reduced comapred to years when the seatbelt law is active. However, these are only averages. This doesn’t give us an idea of the trend in fatalities over the years.

Show code

data<-read_csv("All_Data.csv")

data$CONDITION<-as.factor(data$CONDITION)
data$VALUE<-as.numeric(data$VALUE)
data$YEAR<-as.factor(data$YEAR)
data$VARIABLE <- as.factor(data$VARIABLE)

L_AC<-subset(data,CONDITION=="ACTIVE", select = c("YEAR","VARIABLE", "VALUE"))
L_INAC<-subset(data,CONDITION=="INACTIVE", select = c("YEAR","VARIABLE", "VALUE"))

a<-ggplot(data=L_AC, aes(x=YEAR, y=VALUE, fill=YEAR)) +
  geom_bar(stat="identity", position=position_dodge())+
  facet_wrap(~VARIABLE,nrow = 1, ncol = 2)+
  ggtitle("Seatbelt Law - Active")+
  scale_fill_manual(values = wes_palette("Chevalier1"))+
  scale_x_discrete(name ="Years")+
  scale_y_continuous(name = "Values", limits = c(0,110))+
  theme(legend.position = "none")

b<-ggplot(data=L_INAC, aes(x=YEAR, y=VALUE, fill=YEAR)) +
  geom_bar(stat="identity", position=position_dodge())+
  facet_wrap(~VARIABLE,nrow = 1, ncol = 2)+
  ggtitle("Seatbelt Law - Inactive")+
  scale_x_discrete(name ="Years")+
  scale_y_continuous(name = "Values", limits = c(0,150))+
  theme(legend.position = "none")

ggpubr::ggarrange(a,b,
          ncol = 1, nrow = 2)

This gives a much better idea of the fatalities trend over the years and seperated by the status of the law. The different plots all seem to indicate a difference in averages when the law is active compared to inactive. In order to test our hypothesis and state whether there is an effect of the law, we will perform a statistical test.

Data Analysis

We will now move on to statistical testing to check for differences in Cars and Van Drivers Killed before and after the law was introduced. Since we’re comparing the average fatalities over years, I would suggest we use T-Tests to compare the means and understadn if there is a significant difference.

Show code

data_ttest<-read.csv2(file = "ttest_data.csv", sep = ",")

data_ttest$ACTIVE<-as.numeric(data_ttest$ACTIVE)
data_ttest$INACTIVE<-as.numeric(data_ttest$INACTIVE)

cd<-subset(data_ttest, VARIABLE=="CAR_DRIVERS", select = c("YEAR", "ACTIVE", "INACTIVE"))
t.test(cd$ACTIVE, cd$INACTIVE, na.rm = TRUE)


    Welch Two Sample t-test

data:  cd$ACTIVE and cd$INACTIVE
t = 7.1257, df = 6.0308, p-value = 0.0003757
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 16.65575 34.04414
sample estimates:
mean of x mean of y 
 125.5164  100.1665

Show code

vd<-subset(data_ttest, VARIABLE=="VAN_DRIVERS", select = c("YEAR", "ACTIVE", "INACTIVE"))
t.test(vd$ACTIVE, vd$INACTIVE, na.rm = TRUE)


    Welch Two Sample t-test

data:  vd$ACTIVE and vd$INACTIVE
t = 8.3989, df = 14.66, p-value = 5.557e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 3.223193 5.421456
sample estimates:
mean of x mean of y 
 9.488824  5.166500

The t-tests state that there is a difference in the fatalities before and after the law was established.

Reflections

This was a really fun class and a good experience of using R to actually understand and how it applies to real world data analysis. I had to adapt quickly to complete the assignments, but I think, overall it helped a lot. I have extensively used R before, but I have not taken classes that actively used R, so it was a good experience.

My thesis research focuses on driving behavior, so I automatically looked for a dataset that was related to my work. I thought this may make it easier for me to interpret the data better, especially given the literature reviews I’ve conducted in the topic. So I decided to go with the Seatbelts dataset. This is a very old, but very well collected dataset, as there are no NAs and all information has been maintained correctly. I think one of the biggest challenges was figuring out how to actual convert the date and year information given into a readable format. Once I figured out using the time() function, it was relatively easier to follow. Next, I figured out what variables I was interested in and how to best understand the trends in these variables.

I was initially going to use the rear and front end accidents and how the seatbelt law affected these rates, but ultimately I decided on fatality rates. I think studying the fatality rate may be more interesting given the importance of this topic, especially if we could clearly understand the trends in the visualization. The next challenge was with selecting the visualizations and after trying multiple plots, I think the scatter plot and bar plots best depict the trends in the two variables - Car Drivers Killed and Van Drivers Killed.

Cleaning and wrangling the dataset into a format needed was a bit confusing and I kept coming back to by year, but by law was important as well. I finally decided on subsetting, but facet_wrap() was a really great function. I did end up having to conduct some data wrangling on excel, especially while getting the data in the correct format for ttests, but I’m overall happy with this project.

I think if I were to work on this in the future, I would focus on rear and front accidents, but also understanding how effective the law was on crash and fatality rates. The best way to do this would be to calculate overall differences and then statistical testing to figure out where the decrease is biggest. Working on this project has definitely upped my skills in R, especially the tutorials.

Conclusion

For both the variables - van drivers killed and car drivers killed, the ttests show a significant difference in the average fatalities by year. This indicates that there is an effect of the seatbelt law and that there has been a significant decrease in accidents. For fatality in car drivers, the results t(6.0308) = 7.1257, p<0.05, there is a decrease in the mean fatalities (Mean(Law Active) = 100.16, Mean(Law Inactive) = 125.5164). Similarly,for van drivers, the results t(14.66) = 8.3989, p<0.05 also indicates a decrease in fatalities (Mean(Law Active) = 5.16, Mean(Law Inactive) = 9.488).

We can conclusively state that introducing the Seatbelt Law has led to a decrease in Car and Van Drivers Fatality Rate.

The next steps would be to focus on rear and front end crashes and also understanding how effective the law is on the said variables. We can calculate the average accidents when law is active vs inactive and then compare the averages to understand which difference is greater.

References

Global status report on road safety: time for action. Geneva, World Health Organization, 2009 (www.who.int/violence_ injury_prevention/road_safety_status/2009).
National Center for Statistics and Analysis. (2021, October).Early estimate of motor vehicle traffic fatalities for the first half (January–June) of 2021 (Crash•Stats Brief Statistical Summary. Report No. DOT HS 813 199). National Highway Traffic Safety Administration.
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/time
https://rdrr.io/r/datasets/UKDriverDeaths.html

Comment on this article Share:

Final Paper

Introduction

Research Questions

Hypothesis

Data

Data Selection

Data Tidying and Exploration

Visualizations

Data Analysis

Reflections

Conclusion

References

Reuse

Citation