DACSS 601 Data Science Fundamentals - Final Paper
Use of automobiles for transport has become a major part of everyday life. Despite the many advancements and safety laws initiated, vehicle crashes are one of the biggest causes of fatalities and injuries worldwide. Previous research on this topic has found that around 65% of vehicle-related fatalities happen to in-vehicle occupants (WHO, 2009). This same report also estimated that by 2030, road traffic injuries would be the 5th leading cause of death around the world. It is also essential to consider the financial losses caused due to vehicle fatalities. A report by NHTSA states that there was an increase trend in the fatalities, which continued int 2021 (NHTSA, 2021), and while the number of crashes and traffic injuries declined overall, fatal crashes increased by 6.8%, highest since 2007.
There is extensive research being conducted in this field and one of the most efficient ways is to introduce new laws and policies aimed at reduced crash rates and injuries. To actually understand the effect and importance of such laws, we will be exploring one of the initial datasets collected on automobile crashes. The dataset, Seatbelts, explores information about automobile fatalities before and after a seatbelt law was enforced. My research mainly focuses on Human Factors in Driving and Safety while Driving. The database most relevant to my study is the ‘Seatbelts’ data, which provides data to measure differences in driving deaths over the years. The main aim was to understand change in behavior after the introduction of seatbelt legislation on January 31st 1983. The dataset was officially commissioned by Department of Transport in 1984 and covers a time period of 15 years (1969 - 1984). Since it’s almost 40 years old, we’ll have to clean up the data and make it a bit more easy to read.The dataset includes entries from 1969 to 1984, with number of car and van fatalities listed for each month.
We will be visualizing analyzing two main parts of the data - DriversKilled and VanKilled.
My analysis will work on understanding if:
Is there a change in car driving deaths once the legislation was established?
Is there a change in van driving once the legislation was established?
Hypothesis 1: There is a reduction in car driving deaths after the legislation was established.
Hypothesis 2: There is a eduction in car driving deaths after the legislation was established.
First, we’ll set up packages and libraries need for this project.
Next, we’ll load the dataset and convert the time series into a more readable and understandable format.
years months DriversKilled drivers front rear kms PetrolPrice
1 1969 Jan 107 1687 867 269 9059 0.1029718
2 1969 Feb 97 1508 825 265 7685 0.1023630
3 1969 Mar 102 1507 806 319 9963 0.1020625
4 1969 Apr 87 1385 814 407 10955 0.1008733
5 1969 May 119 1632 991 454 11823 0.1010197
6 1969 Jun 106 1511 945 427 12391 0.1005812
VanKilled law
1 12 0
2 6 0
3 12 0
4 8 0
5 10 0
6 13 0
Next, we’ll work on seperating the dataset into two major ones based on when the law was active vs inactive.
We’ll now explore the datasets and calculate some basic statistics for both the dataframes.
summary(Law_Inactive)
years months DriversKilled drivers
Min. :1969 Jan :15 Min. : 79.0 Min. :1309
1st Qu.:1972 Feb :14 1st Qu.:108.0 1st Qu.:1511
Median :1976 Mar :14 Median :121.0 Median :1653
Mean :1976 Apr :14 Mean :125.9 Mean :1718
3rd Qu.:1979 May :14 3rd Qu.:140.0 3rd Qu.:1926
Max. :1983 Jun :14 Max. :198.0 Max. :2654
(Other):84
front rear kms PetrolPrice
Min. : 567.0 Min. :224.0 Min. : 7685 Min. :0.08118
1st Qu.: 767.0 1st Qu.:344.0 1st Qu.:12387 1st Qu.:0.09078
Median : 860.0 Median :401.0 Median :14455 Median :0.10273
Mean : 873.5 Mean :400.3 Mean :14463 Mean :0.10187
3rd Qu.: 986.0 3rd Qu.:454.0 3rd Qu.:16585 3rd Qu.:0.11132
Max. :1299.0 Max. :646.0 Max. :21040 Max. :0.13303
VanKilled law
Min. : 2.000 Min. :0
1st Qu.: 7.000 1st Qu.:0
Median :10.000 Median :0
Mean : 9.586 Mean :0
3rd Qu.:13.000 3rd Qu.:0
Max. :17.000 Max. :0
dim(Law_Inactive)
[1] 169 10
summary(Law_Active)
years months DriversKilled drivers
Min. :1983 Feb : 2 Min. : 60.0 Min. :1057
1st Qu.:1983 Mar : 2 1st Qu.: 85.0 1st Qu.:1171
Median :1984 Apr : 2 Median : 92.0 Median :1282
Mean :1984 May : 2 Mean :100.3 Mean :1322
3rd Qu.:1984 Jun : 2 3rd Qu.:119.0 3rd Qu.:1464
Max. :1984 Jul : 2 Max. :154.0 Max. :1763
(Other):11
front rear kms PetrolPrice
Min. :426.0 Min. :296.0 Min. :15511 Min. :0.1131
1st Qu.:516.0 1st Qu.:347.0 1st Qu.:17971 1st Qu.:0.1148
Median :585.0 Median :408.0 Median :19162 Median :0.1161
Mean :571.0 Mean :407.7 Mean :18890 Mean :0.1165
3rd Qu.:629.5 3rd Qu.:471.5 3rd Qu.:19952 3rd Qu.:0.1180
Max. :721.0 Max. :521.0 Max. :21626 Max. :0.1201
VanKilled law
Min. :2.000 Min. :1
1st Qu.:3.500 1st Qu.:1
Median :5.000 Median :1
Mean :5.174 Mean :1
3rd Qu.:7.000 3rd Qu.:1
Max. :8.000 Max. :1
dim(Law_Active)
[1] 23 10
We’ll also remove any variables we don’t need.
Law_Inactive <- Law_Inactive[ -c(7,8) ]
paged_table(Law_Inactive)
Law_Active <- Law_Active[ -c(7,8) ]
paged_table(Law_Active)
Our final datasets have the following columns:
We’ll start with some basic boxplots on the average fatalities, but they may not best indicate the trends in these variables. In order to understand that, I would use a scatter plot or maybe a bar plot.
Seatbelts$law<-as.factor(Seatbelts$law)
Seatbelts$DriversKilled<-as.numeric(Seatbelts$DriversKilled)
Seatbelts$VanKilled<-as.numeric(Seatbelts$VanKilled)
Seatbelts<-Seatbelts %>%
mutate_at("law", str_replace, "0", "Inactive")
Seatbelts<-Seatbelts %>%
mutate_at("law", str_replace, "1", "Active")
stats_law_DK<-Seatbelts %>%
group_by(law) %>%
dplyr::summarize(min = min(DriversKilled),
median = median(DriversKilled),
mean = mean(DriversKilled),
sd = sd(DriversKilled),
max = max(DriversKilled))
paged_table(stats_law_DK)
##Grouping by years and calculating average accidents
ggplot(Seatbelts, aes(x=factor(law), y =DriversKilled, fill = law)) +
geom_boxplot() +
scale_fill_manual(values = wes_palette("Darjeeling2"))+
ylab ("Car Drivers Killed")+
xlab("Law Status")
ggplot(Seatbelts, aes(x=factor(law), y =VanKilled, fill = law)) +
geom_boxplot() +
scale_fill_manual(values = wes_palette("Cavalcanti1"))+
ylab ("Van Drivers Killed")+
xlab("Law Status")
Avg_DK<- Seatbelts %>%
group_by(law) %>%
dplyr::summarise(mean = mean(DriversKilled))
ggplot(Avg_DK, aes(x = law, y = mean, fill=law)) +
geom_bar(stat="identity", position=position_dodge())+
scale_fill_manual(values = wes_palette("Darjeeling2"))+
xlab("Status of Law") +
ylab("Average Number of Drivers Killed")+
scale_y_continuous(limits = c(0,130), breaks = c(0,20,40,60,80,100,120))+
theme(legend.position = "none")
Avg_VK<- Seatbelts %>%
group_by(law) %>%
dplyr::summarise(mean = mean(VanKilled))
ggplot(Avg_VK, aes(x = law, y = mean, fill=law)) +
geom_bar(stat="identity", position=position_dodge())+
scale_fill_manual(values = wes_palette("Cavalcanti1"))+
xlab("Status of Law") +
ylab("Average Number of Van Drivers Killed")+
scale_y_continuous(limits = c(0,20), breaks = c(0,5,10,15,20))+
theme(legend.position = "none")
law_means_DK <- ddply(Seatbelts, "law", summarise, mean_DK = mean(DriversKilled))
ggplot(Seatbelts, aes(x=years, y=DriversKilled, color=law)) +
geom_point()+
stat_smooth(method = 'lm')+
geom_hline(data=law_means_DK, aes(yintercept=mean_DK, color=law),
linetype="dashed")+
scale_color_manual(values=wes_palette("Darjeeling2"))+
xlab("Years") +
ylab("Number of Drivers Killed")+
scale_y_continuous(limits = c(0,200), breaks = c(0,20,40,60,80,100,120,140,160,180,200))+
scale_x_continuous(limits = c(1969,1984), breaks = c(1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984))+
theme(legend.position = "right")
law_means_VK <- ddply(Seatbelts, "law", summarise, mean_VK = mean(VanKilled))
ggplot(Seatbelts, aes(x=years, y=VanKilled, color=law)) +
geom_point()+
stat_smooth(method = 'lm')+
geom_hline(data=law_means_VK, aes(yintercept=mean_VK, color=law),
linetype="dashed")+
scale_color_manual(values=wes_palette("GrandBudapest1"))+
xlab("Years") +
ylab("Number of Van Drivers Killed")+
scale_y_continuous(limits = c(0,20), breaks = c(0,5,10,15,20))+
scale_x_continuous(limits = c(1969,1984), breaks = c(1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984))+
theme(legend.position = "right")
Based on the plots, we can get a good idea of the average number of drivers and van drivers killed when the law is inactive and active. It’s clear that there are fewer fatalities when the law is active, which is to be expected. This does provide us with the insight that once the law was activated, the number of drivers and van drivers killed on average over the years reduced comapred to years when the seatbelt law is active. However, these are only averages. This doesn’t give us an idea of the trend in fatalities over the years.
data<-read_csv("All_Data.csv")
data$CONDITION<-as.factor(data$CONDITION)
data$VALUE<-as.numeric(data$VALUE)
data$YEAR<-as.factor(data$YEAR)
data$VARIABLE <- as.factor(data$VARIABLE)
L_AC<-subset(data,CONDITION=="ACTIVE", select = c("YEAR","VARIABLE", "VALUE"))
L_INAC<-subset(data,CONDITION=="INACTIVE", select = c("YEAR","VARIABLE", "VALUE"))
a<-ggplot(data=L_AC, aes(x=YEAR, y=VALUE, fill=YEAR)) +
geom_bar(stat="identity", position=position_dodge())+
facet_wrap(~VARIABLE,nrow = 1, ncol = 2)+
ggtitle("Seatbelt Law - Active")+
scale_fill_manual(values = wes_palette("Chevalier1"))+
scale_x_discrete(name ="Years")+
scale_y_continuous(name = "Values", limits = c(0,110))+
theme(legend.position = "none")
b<-ggplot(data=L_INAC, aes(x=YEAR, y=VALUE, fill=YEAR)) +
geom_bar(stat="identity", position=position_dodge())+
facet_wrap(~VARIABLE,nrow = 1, ncol = 2)+
ggtitle("Seatbelt Law - Inactive")+
scale_x_discrete(name ="Years")+
scale_y_continuous(name = "Values", limits = c(0,150))+
theme(legend.position = "none")
ggpubr::ggarrange(a,b,
ncol = 1, nrow = 2)
This gives a much better idea of the fatalities trend over the years and seperated by the status of the law. The different plots all seem to indicate a difference in averages when the law is active compared to inactive. In order to test our hypothesis and state whether there is an effect of the law, we will perform a statistical test.
We will now move on to statistical testing to check for differences in Cars and Van Drivers Killed before and after the law was introduced. Since we’re comparing the average fatalities over years, I would suggest we use T-Tests to compare the means and understadn if there is a significant difference.
data_ttest<-read.csv2(file = "ttest_data.csv", sep = ",")
data_ttest$ACTIVE<-as.numeric(data_ttest$ACTIVE)
data_ttest$INACTIVE<-as.numeric(data_ttest$INACTIVE)
cd<-subset(data_ttest, VARIABLE=="CAR_DRIVERS", select = c("YEAR", "ACTIVE", "INACTIVE"))
t.test(cd$ACTIVE, cd$INACTIVE, na.rm = TRUE)
Welch Two Sample t-test
data: cd$ACTIVE and cd$INACTIVE
t = 7.1257, df = 6.0308, p-value = 0.0003757
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
16.65575 34.04414
sample estimates:
mean of x mean of y
125.5164 100.1665
Welch Two Sample t-test
data: vd$ACTIVE and vd$INACTIVE
t = 8.3989, df = 14.66, p-value = 5.557e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.223193 5.421456
sample estimates:
mean of x mean of y
9.488824 5.166500
The t-tests state that there is a difference in the fatalities before and after the law was established.
This was a really fun class and a good experience of using R to actually understand and how it applies to real world data analysis. I had to adapt quickly to complete the assignments, but I think, overall it helped a lot. I have extensively used R before, but I have not taken classes that actively used R, so it was a good experience.
My thesis research focuses on driving behavior, so I automatically looked for a dataset that was related to my work. I thought this may make it easier for me to interpret the data better, especially given the literature reviews I’ve conducted in the topic. So I decided to go with the Seatbelts dataset. This is a very old, but very well collected dataset, as there are no NAs and all information has been maintained correctly. I think one of the biggest challenges was figuring out how to actual convert the date and year information given into a readable format. Once I figured out using the time() function, it was relatively easier to follow. Next, I figured out what variables I was interested in and how to best understand the trends in these variables.
I was initially going to use the rear and front end accidents and how the seatbelt law affected these rates, but ultimately I decided on fatality rates. I think studying the fatality rate may be more interesting given the importance of this topic, especially if we could clearly understand the trends in the visualization. The next challenge was with selecting the visualizations and after trying multiple plots, I think the scatter plot and bar plots best depict the trends in the two variables - Car Drivers Killed and Van Drivers Killed.
Cleaning and wrangling the dataset into a format needed was a bit confusing and I kept coming back to by year, but by law was important as well. I finally decided on subsetting, but facet_wrap() was a really great function. I did end up having to conduct some data wrangling on excel, especially while getting the data in the correct format for ttests, but I’m overall happy with this project.
I think if I were to work on this in the future, I would focus on rear and front accidents, but also understanding how effective the law was on crash and fatality rates. The best way to do this would be to calculate overall differences and then statistical testing to figure out where the decrease is biggest. Working on this project has definitely upped my skills in R, especially the tutorials.
For both the variables - van drivers killed and car drivers killed, the ttests show a significant difference in the average fatalities by year. This indicates that there is an effect of the seatbelt law and that there has been a significant decrease in accidents. For fatality in car drivers, the results t(6.0308) = 7.1257, p<0.05, there is a decrease in the mean fatalities (Mean(Law Active) = 100.16, Mean(Law Inactive) = 125.5164). Similarly,for van drivers, the results t(14.66) = 8.3989, p<0.05 also indicates a decrease in fatalities (Mean(Law Active) = 5.16, Mean(Law Inactive) = 9.488).
We can conclusively state that introducing the Seatbelt Law has led to a decrease in Car and Van Drivers Fatality Rate.
The next steps would be to focus on rear and front end crashes and also understanding how effective the law is on the said variables. We can calculate the average accidents when law is active vs inactive and then compare the averages to understand which difference is greater.
Global status report on road safety: time for action. Geneva, World Health Organization, 2009 (www.who.int/violence_ injury_prevention/road_safety_status/2009).
National Center for Statistics and Analysis. (2021, October).Early estimate of motor vehicle traffic fatalities for the first half (January–June) of 2021 (Crash•Stats Brief Statistical Summary. Report No. DOT HS 813 199). National Highway Traffic Safety Administration.
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/time
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hungund (2022, May 11). Data Analytics and Computational Social Science: Final Paper. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomahungundaphfp/
BibTeX citation
@misc{hungund2022final, author = {Hungund, Apoorva}, title = {Data Analytics and Computational Social Science: Final Paper}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomahungundaphfp/}, year = {2022} }