Data Analytics and Computational Social Science: Data Science Fundamentals Paper

Brinda Murulidhara

Introduction

I have chosen the Emergency - 911 calls dataset from Kaggle (https://www.kaggle.com/mchirico/montcoalert/version/32) for my final project. The dataset contains emergency 911 calls in Montgomery County, Pennsylvania from 2015 to 2020.

Research questions

What type of emergency contributes the most to emergency calls in Montgomery County?
Which township has the highest number of emergency calls?
How do emergency calls vary across year, month, and time of the day? Has there been a reduction/increase in a certain type of emergency (e.g. vehicle accident) from 2015 to 2020?
How do vehicle accidents vary across year, month, week, and time of the day? Do you see any trends?
Do you see any unique patterns? If yes, can you find an explanation for those patterns? E.g. Are there more vehicle accidents during night times and peak hours of traffic? Are there more cardiac arrest emergencies during the daytime? When do we get more heat exhaustion emergency calls?

Data

Variables in the dataset

Below are the variables in the dataset:

lat: Latitude of the location where the emergency occurred. The data type is double.
lng: Longitude of the location where the emergency occurred. The data type is double.
desc: Description of the emergency. The data type is String.
zip: Zipcode of the location where the emergency occurred. The data type is integer. This variable is interpreted as a factor.
title: Type of emergency. The data type is String.
timeStamp: Date and time of the emergency call. The data type is String.
twp: Township where the emergency occurred. The data type is String.
addr: Address of the emergency location. The data type is String.
e: Index column whose value is always 1. The data type is integer.

Below is the code snippet to read and preview the data.

library(dplyr)
emergency_calls_data <- read.csv("911.csv")
head(emergency_calls_data)

       lat       lng
1 40.29788 -75.58129
2 40.25806 -75.26468
3 40.12118 -75.35198
4 40.11615 -75.34351
5 40.25149 -75.60335
6 40.25347 -75.28324
                                                                                 desc
1           REINDEER CT & DEAD END;  NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52;
2 BRIAR PATH & WHITEMARSH LN;  HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;
3                          HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27;
4               AIRY ST & SWEDE ST;  NORRISTOWN; Station 308A; 2015-12-10 @ 16:47:36;
5    CHERRYWOOD CT & DEAD END;  LOWER POTTSGROVE; Station 329; 2015-12-10 @ 16:56:52;
6               CANNON AVE & W 9TH ST;  LANSDALE; Station 345; 2015-12-10 @ 15:39:04;
    zip                   title           timeStamp               twp
1 19525  EMS: BACK PAINS/INJURY 2015-12-10 17:10:52       NEW HANOVER
2 19446 EMS: DIABETIC EMERGENCY 2015-12-10 17:29:21 HATFIELD TOWNSHIP
3 19401     Fire: GAS-ODOR/LEAK 2015-12-10 14:39:21        NORRISTOWN
4 19401  EMS: CARDIAC EMERGENCY 2015-12-10 16:47:36        NORRISTOWN
5    NA          EMS: DIZZINESS 2015-12-10 16:56:52  LOWER POTTSGROVE
6 19446        EMS: HEAD INJURY 2015-12-10 15:39:04          LANSDALE
                        addr e
1     REINDEER CT & DEAD END 1
2 BRIAR PATH & WHITEMARSH LN 1
3                   HAWS AVE 1
4         AIRY ST & SWEDE ST 1
5   CHERRYWOOD CT & DEAD END 1
6      CANNON AVE & W 9TH ST 1

Descriptive statistics

Mean

Below are the means of latitude and longitude columns.

summarise(emergency_calls_data, mean_latitude=mean(lat), mean_longitude=mean(lng))

  mean_latitude mean_longitude
1      40.15816       -75.3001

Median

Below are the medians of latitude and longitude columns.

summarise(emergency_calls_data, median_latitude=median(lat), median_longitude=median(lng))

  median_latitude median_longitude
1        40.14393        -75.30514

Standard Deviation

Below are the standard deviations of latitude and longitude columns.

summarise(emergency_calls_data, sd_latitude=sd(lat), sd_longitude=sd(lng))

  sd_latitude sd_longitude
1   0.2206414     1.672884

Visualization

High-level visualizations

Emergency subcategory vs frequency

Below is a plot of emergency subcategory (Vehicle accidents and others) vs frequency of occurrence.
This visualization helps us understand that vehicle accident related emergencies occur more frequently in Montgomery County than any other type of emergency. The emergency response team can make better preparations with this knowledge.

Since vehicle accidents contribute the most to emergency calls in Montgomery County, further analysis is done in the upcoming visualizations.

library(ggplot2)
calls_data <- mutate(emergency_calls_data, broad_category = ifelse(grepl("VEHICLE ACCIDENT", title), "Vehicle accident", "Other"))
ggplot(calls_data, aes(x = broad_category)) +
        geom_bar() + 
        labs(x="Emergency sub-category", y="Frequency") + 
        ggtitle("Plot of emergency sub-category vs frequency")

Township vs frequency

The number of townships in Montgomery County is large. Hence, the top 10 townships that have the highest number of emergency calls are chosen. Lower Merion contributes the highest to the emergency calls in the county followed by Abington and Norristown.

sort(table(emergency_calls_data[, "twp"]), decreasing=TRUE, na.last = TRUE)[1:10]


    LOWER MERION         ABINGTON       NORRISTOWN     UPPER MERION 
           55490            39947            37633            36010 
      CHELTENHAM        POTTSTOWN   UPPER MORELAND LOWER PROVIDENCE 
           30574            27387            22932            22476 
        PLYMOUTH     UPPER DUBLIN 
           20116            18862

Below is a plot of township vs frequency of calls grouped by emergency category for the top ten townships. The dataset has three major predefined categories - EMS, Traffic and Fire. EMS includes serious illness or injuries like weakness, head injuries, seizures etc. Traffic constitutes vehicle accidents, disabled vehicles etc. Fire includes accidents resulting from any kind of fire in a building or outside.
This visualization helps us understand what kind of emergencies occur more frequently in each township in Montgomery County. It also helps us compare the number of emergencies that occur across townships. The emergency response team can focus its efforts more on townships that have higher counts of emergencies.

library(stringr) 
calls_data <- mutate(emergency_calls_data, emergency_category=word(title, sep = fixed(":")))
ggplot(subset(calls_data, twp %in% c("LOWER MERION", "ABINGTON", "NORRISTOWN", "UPPER MERION", "CHELTENHAM", "POTTSTOWN", "UPPER MORELAND", "LOWER PROVIDENCE", "PLYMOUTH", "UPPER DUBLIN")), aes(x = twp, 
           fill = emergency_category)) + 
  geom_bar(position = "stack") + 
  theme(axis.text.x=element_text(angle=90)) +
  labs(x="Township", y="Frequency") + 
  ggtitle("Plot of township vs frequency grouped by category")

Emergency call count vs year

Below is a plot of emergency call count vs year.

We can see that 2015 has significantly lower emergencies compared to other years.
Emergencies are high and nearly the same from 2016 to 2019
In 2020, the emergency calls have reduced by half compared to 2019.

library(stringr) 
emergency_calls_data  %>%
mutate(year=format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%Y")) %>%
ggplot(aes(x = year)) + 
  geom_bar() + 
  labs(x="Year", y="Emergency call count") +
  ggtitle("Plot of call count vs year")

Emergency call count vs month

Below is a plot of emergency call count vs month. There is no significant difference in emergency counts across various months.

library(stringr) 
emergency_calls_data  %>%
mutate(month=ordered(month.abb[strtoi(format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%m"))], levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))) %>%
filter(!is.na(month)) %>%
ggplot(aes(x = month, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Month", y="Emergency call count") +
  ggtitle("Plot of call count vs month")

Emergency call count vs year grouped by emergency category

Below is a barplot of emergency counts vs year grouped by broad emergency categories (EMS, Fire and Traffic)

We can see that 2015 has significantly lower emergencies of all types compared to other years. This could also be a result of missing data which the data collection team needs to focus on.
There is an increase in all types of emergencies in 2016 and remains nearly the same until 2019.
The emergency call count reduces in 2020.

emergency_calls_data  %>%
mutate(emergency_category=word(title, sep = fixed(":")), year=format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%Y")) %>%
ggplot(aes(x = year, 
           fill = emergency_category)) + 
  geom_bar(position = "stack") + 
  labs(x="Year", y="Emergency call count") + 
  ggtitle("Plot of call count vs year grouped by emergency category")

Emergency call count vs time of the day

Below is a plot of emergency calls vs time of the day.

day_time <- as.POSIXct(strptime(c("000000","040000","114500","170000","193000","235959"),
                      "%H%M%S"),"UTC")
labels = c("night","morning","afternoon","evening","night")


calls_data <- mutate(emergency_calls_data, time_of_day=ordered(labels[findInterval(as.POSIXct(strptime(format(strptime(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%H%M%S"), format="%H%M%S"), "UTC"), day_time)], levels=c("morning", "afternoon", "evening", "night")))
calls_data <- filter(calls_data, !is.na(time_of_day))
ggplot(calls_data, aes(x = time_of_day, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Time of day", y="Emergency call count") +
  ggtitle("Plot of emergency call count vs time of day")

Low-level visualizations

Vehicle accidents vs year

Since vehicle accidents contribute the most to emergency calls in Montgomery County, here is a plot of vehicle accidents vs year. Vehicle accidents are high during the period from 2016 to 2019

library(stringr) 
calls_data <- mutate(emergency_calls_data, year=format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%Y"))
ggplot(subset(calls_data, title="Traffic: VEHICLE ACCIDENT -"), aes(x = year)) + 
  geom_bar() + 
  labs(x="Year", y="Vehicle accident calls") +
  ggtitle("Plot of vehicle accident calls vs year")

Traffic emergency call count vs weekday

Below is a plot of traffic related emergency calls vs day of the week.
We can see that the number of traffic related emergencies are higher on Fridays. There could be an increase in the number of vehicles on Fridays with people travelling to other towns, returning home, etc. for the weekend.
Weekends have lower emergencies compared to weekdays. The reason might be due to higher vehicular movements on weekdays.

library(stringr) 
calls_data <- mutate(emergency_calls_data, day=ordered(weekdays(as.Date(timeStamp)), levels=c("Monday", "Tuesday", "Wednesday", "Thursday", 
"Friday", "Saturday", "Sunday")))
calls_data <- filter(calls_data, !is.na(day))
ggplot(subset(calls_data,  grepl('Traffic:', title)), aes(x = day, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Weekday", y="Traffic emergency call count") +
  ggtitle("Plot of traffic emergency call count vs weekday")

Traffic emergency call count vs time of the day

Below is a plot of traffic related emergency calls vs time of the day.
Generally the number of traffic related emergencies are higher in the evenings and at night due to higher vehicular movements and also lower visibility when it’s dark.
However, we can see that the number of emergencies are higher during the afternoons in our data set. This could be due to incorrect data or there maybe an abnormal pattern that needs to be investigated by the emergency response team.

day_time <- as.POSIXct(strptime(c("000000","040000","114500","170000","193000","235959"),
                      "%H%M%S"),"UTC")
labels = c("night","morning","afternoon","evening","night")


calls_data <- mutate(emergency_calls_data, time_of_day=ordered(labels[findInterval(as.POSIXct(strptime(format(strptime(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%H%M%S"), format="%H%M%S"), "UTC"), day_time)], levels=c("morning", "afternoon", "evening", "night")))
calls_data <- filter(calls_data, !is.na(time_of_day))
ggplot(subset(calls_data,  grepl('Traffic:', title)), aes(x = time_of_day, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Time of day", y="Traffic emergency call count") +
  ggtitle("Plot of traffic emergency call count vs time of day")

Cardiac arrest emergency call count vs time of the day

Below is a plot of cardiac arrest emergency calls vs time of the day.
It can be observed that the number of cardiac emergency calls are higher in the morning. This is a scientifically proven fact. During early morning, when blood pressure is higher, the risk of stroke increases.

day_time <- as.POSIXct(strptime(c("000000","040000","114500","170000","193000","235959"),
                      "%H%M%S"),"UTC")
labels = c("night","morning","afternoon","evening","night")


calls_data <- mutate(emergency_calls_data, time_of_day=ordered(labels[findInterval(as.POSIXct(strptime(format(strptime(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%H%M%S"), format="%H%M%S"), "UTC"), day_time)], levels=c("morning", "afternoon", "evening", "night")))
calls_data <- filter(calls_data, !is.na(time_of_day))
ggplot(subset(calls_data,  grepl('EMS: CARDIAC EMERGENCY', title)), aes(x = time_of_day, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Time of day", y="Cardiac emergency call count") +
  ggtitle("Plot of cardiac emergency call count vs time of day")

Heat exhaustion emergency call count vs month

Below is a plot of heat exhaustion emergency calls vs month.
It can be observed that the number of heat exhaustion emergencies are higher in the month of July which is the hottest summer month in Montgomery County.

library(stringr) 
calls_data <- mutate(emergency_calls_data, month=ordered(month.abb[strtoi(format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%m"))], levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
calls_data <- filter(calls_data, !is.na(month))
ggplot(subset(calls_data,  grepl('EMS: HEAT EXHAUSTION', title)), aes(x = month, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Month", y="Heat exhaustion call count") +
  ggtitle("Plot of heat exhaustion calls vs month")

Vehicle accident emergency call count vs month

Below is a plot of vehicle accident emergency calls vs month.
It can be observed that the number of vehicle accidents are higher in the months of December and January when winter is at its peak.
During winters, due to sheets of ice covering the county, the number of vehicle accidents are high. According to National Highway Traffic Safety Administration (NHTSA), during the Christmas period, many fatalities involving an alcohol-impaired driver occur.

library(stringr) 
calls_data <- mutate(emergency_calls_data, month=ordered(month.abb[strtoi(format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%m"))], levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
calls_data <- filter(calls_data, !is.na(month))
ggplot(subset(calls_data,  grepl('Traffic: VEHICLE ACCIDENT', title)), aes(x = month, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Month", y="Vehicle accident emergency call count") +
  ggtitle("Plot of vehicle accident emergency calls vs month")

Reflection

Through this project I was able to apply the concepts I learned during the course on a real world dataset. I understood how to handle large amounts of data and high levels of granularity well and create meaningful visualizations.

The number of emergency sub categories is large and the distinction between every one of those may be unclear to a naive viewer. Also, there is a large count of townships in Montgomery County. Analysis of emergency counts of various sub-categories across townships is thus not feasible. The visualization will look cluttered and hard to notice patterns. Hence the top 10 townships have been chosen for analysis and the emergency sub-categories have been grouped.

In the future, I would like to compare Montgomery County with other counties to contrast the crime scene, traffic etc.
I would include more granular emergency categories alongwith EMS, Traffic, Fire in further visualizations to improve my project. E.g. crime, minor ailments, serious injury etc. along with the predefined ones for better analysis.
I would also see if it’s possible to group multiple geographically close counties into regions to avoid overwhelming a naive viewer with tens of counties.

Conclusion

The visualizations help us understand what kind of emergency occurs most frequently in Montgomery County as a whole as well as in every township. We can compare the number and type of emergencies that occur across townships. The emergency response team can make better preparations with this knowledge and can focus its efforts more on townships that have higher counts of serious emergencies.

The variation of call counts across years/months/week and time of day can give us insights into the crime scene, traffic patterns and effects of weather in Montgomery County. Vehicle accidents have been high during the years 2016 to 2019. It can also be observed that heat exhaustion cases are higher during peak summer, cardiac arrests are higher during morning hours and traffic related emergencies occur more during weekdays especially Fridays and during peak Winter.

We can conclude that vehicle accidents contribute the most to emergency calls in Montgomery County. Lower Merion contributes the highest to the emergency calls in the county followed by Abington and Norristown.

Bibliography

Emergency 911 calls dataset - https://www.kaggle.com/mchirico/montcoalert/version/32
R Language - https://rpubs.com/martike8/858127
Data Visualization with R - https://rkabacoff.github.io/datavis/
Top 50 ggplot2 visualizations - http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html

Comment on this article Share:

Data Science Fundamentals Paper