DACSS 601 Data Science Fundamentals - Homework 4

Apoorva Hungund
#1) Reading in the Dataset

Seatbelts <- data.frame(years=floor(time(Seatbelts)),months=factor(cycle(Seatbelts),labels=month.abb), Seatbelts)


Seatbelts<-Seatbelts %>% 
  mutate_at("law", str_replace, "0", "Inactive")

Seatbelts<-Seatbelts %>% 
  mutate_at("law", str_replace, "1", "Active")

#1)Descriptive Statistics

 min     Q1 median  Q3 max     mean       sd   n missing
  60 104.75  118.5 138 198 122.8021 25.37989 192       0
stats_law_DK<-Seatbelts %>%
  group_by(law) %>%
  dplyr::summarize(min = min(DriversKilled),
            median = median(DriversKilled),
            mean = mean(DriversKilled),
            sd = sd(DriversKilled),
            max = max(DriversKilled))
# A tibble: 2 × 6
  law        min median  mean    sd   max
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl>
1 Active      60     92  100.  22.2   154
2 Inactive    79    121  126.  24.3   198
stats_law_VK<-Seatbelts %>%
  group_by(law) %>%
  dplyr::summarize(min = min(VanKilled),
            median = median(VanKilled),
            mean = mean(VanKilled),
            sd = sd(VanKilled),
            max = max(VanKilled))
# A tibble: 2 × 6
  law        min median  mean    sd   max
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl>
1 Active       2      5  5.17  1.83     8
2 Inactive     2     10  9.59  3.50    17
stats_years_DK<-Seatbelts %>%
  group_by(years) %>%
  dplyr::summarize(min = min(DriversKilled),
            median = median(DriversKilled),
            mean = mean(DriversKilled),
            sd = sd(DriversKilled),
            max = max(DriversKilled))
# A tibble: 16 × 6
   years   min median  mean    sd   max
   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
 1  1969    87  107   117.   25.5   180
 2  1970   102  124.  133.   30.6   190
 3  1971   104  138   138.   22.5   187
 4  1972   114  150.  147.   25.5   198
 5  1973   109  145   144.   15.2   168
 6  1974   100  131   129.   21.9   163
 7  1975    92  111   118.   23.4   161
 8  1976    97  112.  120.   20.4   162
 9  1977    79  111   119.   31.2   183
10  1978   100  120.  127.   22.5   178
11  1979    94  116.  123.   20.7   171
12  1980    92  109   112.   16.3   137
13  1981    84  108.  112.   19.5   153
14  1982   103  122   123.   14.9   152
15  1983    60   97.5  99.8  20.2   126
16  1984    79   91   102.   24.7   154
stats_years_VK<-Seatbelts %>%
  group_by(years) %>%
  dplyr::summarize(min = min(VanKilled),
            median = median(VanKilled),
            mean = mean(VanKilled),
            sd = sd(VanKilled),
            max = max(VanKilled))
# A tibble: 16 × 6
   years   min median  mean    sd   max
   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
 1  1969     6   11.5 10.9   3.09    16
 2  1970     6   12   11.5   3.12    16
 3  1971     6   13.5 13.2   3.07    17
 4  1972     3   12   10.8   4.29    17
 5  1973     5   11   10.6   3.15    15
 6  1974     4   10.5 10.2   3.76    15
 7  1975     6   11   10.7   3.14    16
 8  1976     4   10    9.75  2.96    14
 9  1977     5    7.5  8.33  3.03    15
10  1978     3    9    8.33  3.26    14
11  1979     5    8.5  9     2.30    13
12  1980     4    8    8.08  2.68    14
13  1981     4    7    7.25  2.38    12
14  1982     2    5.5  5.83  2.76    12
15  1983     2    5.5  5.25  2.22     8
16  1984     3    5.5  5.33  1.56     7

##Grouping by years and calculating average accidents

Avg_DK<- Seatbelts %>%
  group_by(law) %>%
  dplyr::summarise(mean = mean(DriversKilled))

ggplot(Avg_DK, aes(x = law, y = mean, fill=law)) +
  geom_bar(stat="identity", position=position_dodge())+
  scale_fill_manual(values = wes_palette("Darjeeling2"))+
  xlab("Status of Law") +
  ylab("Average Number of Drivers Killed")+
  scale_y_continuous(limits = c(0,130), breaks = c(0,20,40,60,80,100,120))+
  theme(legend.position = "none")
Avg_VK<- Seatbelts %>%
  group_by(law) %>%
  dplyr::summarise(mean = mean(VanKilled))

ggplot(Avg_VK, aes(x = law, y = mean, fill=law)) +
  geom_bar(stat="identity", position=position_dodge())+
  scale_fill_manual(values = wes_palette("Cavalcanti1"))+
  xlab("Status of Law") +
  ylab("Average Number of Van Drivers Killed")+
  scale_y_continuous(limits = c(0,20), breaks = c(0,5,10,15,20))+
  theme(legend.position = "none")

Very basic lot, but gives us a good idea of the average number of drivers and van drivers killed when the law is inactive and active. It’s clear that there are fewer fatalities when the law is active, which is to be expected. This does provide us with the insight that once the law was activated, the number of drivers and van drivers killed on average over the years reduced comapred to years when the seatbelt law is active. However, these are only averages. This doesn’t give us an idea of the trend in fatalities over the years.

#2)Visualizations Continued
law_means_DK <- ddply(Seatbelts, "law", summarise, mean_DK = mean(DriversKilled))
ggplot(Seatbelts, aes(x=years, y=DriversKilled, color=law)) +
  stat_smooth(method = 'lm')+
  geom_hline(data=law_means_DK, aes(yintercept=mean_DK, color=law), 
  xlab("Years") +
  ylab("Number of Drivers Killed")+
  scale_y_continuous(limits = c(0,200), breaks = c(0,20,40,60,80,100,120,140,160,180,200))+
  scale_x_continuous(limits = c(1969,1984), breaks = c(1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984))+
  theme(legend.position = "right")
law_means_VK <- ddply(Seatbelts, "law", summarise, mean_VK = mean(VanKilled))
ggplot(Seatbelts, aes(x=years, y=VanKilled, color=law)) +
  stat_smooth(method = 'lm')+
  geom_hline(data=law_means_VK, aes(yintercept=mean_VK, color=law), 
  xlab("Years") +
  ylab("Number of Van Drivers Killed")+
  scale_y_continuous(limits = c(0,20), breaks = c(0,5,10,15,20))+
  scale_x_continuous(limits = c(1969,1984), breaks = c(1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984))+
  theme(legend.position = "right")

Based on the two plots above, we can make a conclusion, based only on the visualization that the fatalities, both in car and van drivers, show a decreasing trend up until the law is introduced, at which point, there is a definite decrease in fatalities. However, it should also be observed that even after the law is introduced in 1983, there is a slight increase from 1983 to 1984 - in both car and van drivers. This image however does not give us an idea of the average fatalities by year. Insted of scatter plot, maybe a box plot would be a better visualization technique here.

#2)Visualizations Continued

ggplot(Seatbelts, aes(x=DriversKilled, y=VanKilled, color=law)) +
  stat_smooth(method = 'lm')+
  xlab("Number of Drivers Killed") +
  ylab("Number of Van Drivers Killed")+
  theme(legend.position = "right")

This graph depicts the association Drivers and Van Drivers Killed when the law is active vs inactive. This graph helps us understand that fatalities show an increase both when the law is active and inactive. When inactive, the fatalities show a sharper increase as compared to when active. However this does not help us understand the fatalities trend by year. A good idea would be to heva years in y-axis and both VanKilled and DriversKilled as paired box-plots.


For attribution, please cite this work as

Hungund (2022, May 4). Data Analytics and Computational Social Science: HW4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomahungundaphhw4/

BibTeX citation

  author = {Hungund, Apoorva},
  title = {Data Analytics and Computational Social Science: HW4},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomahungundaphhw4/},
  year = {2022}