Final paper submission
I have chosen the Emergency - 911 calls dataset from Kaggle (https://www.kaggle.com/mchirico/montcoalert/version/32) for my final project. The dataset contains emergency 911 calls in Montgomery County, Pennsylvania from 2015 to 2020.
Below are the variables in the dataset:
Below is the code snippet to read and preview the data.
lat lng
1 40.29788 -75.58129
2 40.25806 -75.26468
3 40.12118 -75.35198
4 40.11615 -75.34351
5 40.25149 -75.60335
6 40.25347 -75.28324
desc
1 REINDEER CT & DEAD END; NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52;
2 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;
3 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27;
4 AIRY ST & SWEDE ST; NORRISTOWN; Station 308A; 2015-12-10 @ 16:47:36;
5 CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; Station 329; 2015-12-10 @ 16:56:52;
6 CANNON AVE & W 9TH ST; LANSDALE; Station 345; 2015-12-10 @ 15:39:04;
zip title timeStamp twp
1 19525 EMS: BACK PAINS/INJURY 2015-12-10 17:10:52 NEW HANOVER
2 19446 EMS: DIABETIC EMERGENCY 2015-12-10 17:29:21 HATFIELD TOWNSHIP
3 19401 Fire: GAS-ODOR/LEAK 2015-12-10 14:39:21 NORRISTOWN
4 19401 EMS: CARDIAC EMERGENCY 2015-12-10 16:47:36 NORRISTOWN
5 NA EMS: DIZZINESS 2015-12-10 16:56:52 LOWER POTTSGROVE
6 19446 EMS: HEAD INJURY 2015-12-10 15:39:04 LANSDALE
addr e
1 REINDEER CT & DEAD END 1
2 BRIAR PATH & WHITEMARSH LN 1
3 HAWS AVE 1
4 AIRY ST & SWEDE ST 1
5 CHERRYWOOD CT & DEAD END 1
6 CANNON AVE & W 9TH ST 1
Below are the means of latitude and longitude columns.
mean_latitude mean_longitude
1 40.15816 -75.3001
Below are the medians of latitude and longitude columns.
median_latitude median_longitude
1 40.14393 -75.30514
Below are the standard deviations of latitude and longitude columns.
sd_latitude sd_longitude
1 0.2206414 1.672884
Since vehicle accidents contribute the most to emergency calls in Montgomery County, further analysis is done in the upcoming visualizations.
library(ggplot2)
calls_data <- mutate(emergency_calls_data, broad_category = ifelse(grepl("VEHICLE ACCIDENT", title), "Vehicle accident", "Other"))
ggplot(calls_data, aes(x = broad_category)) +
geom_bar() +
labs(x="Emergency sub-category", y="Frequency") +
ggtitle("Plot of emergency sub-category vs frequency")
The number of townships in Montgomery County is large. Hence, the top 10 townships that have the highest number of emergency calls are chosen. Lower Merion contributes the highest to the emergency calls in the county followed by Abington and Norristown.
LOWER MERION ABINGTON NORRISTOWN UPPER MERION
55490 39947 37633 36010
CHELTENHAM POTTSTOWN UPPER MORELAND LOWER PROVIDENCE
30574 27387 22932 22476
PLYMOUTH UPPER DUBLIN
20116 18862
Below is a plot of township vs frequency of calls grouped by emergency category for the top ten townships. The dataset has three major predefined categories - EMS, Traffic and Fire. EMS
includes serious illness or injuries like weakness, head injuries, seizures etc. Traffic
constitutes vehicle accidents, disabled vehicles etc. Fire
includes accidents resulting from any kind of fire in a building or outside.
This visualization helps us understand what kind of emergencies occur more frequently in each township in Montgomery County. It also helps us compare the number of emergencies that occur across townships. The emergency response team can focus its efforts more on townships that have higher counts of emergencies.
library(stringr)
calls_data <- mutate(emergency_calls_data, emergency_category=word(title, sep = fixed(":")))
ggplot(subset(calls_data, twp %in% c("LOWER MERION", "ABINGTON", "NORRISTOWN", "UPPER MERION", "CHELTENHAM", "POTTSTOWN", "UPPER MORELAND", "LOWER PROVIDENCE", "PLYMOUTH", "UPPER DUBLIN")), aes(x = twp,
fill = emergency_category)) +
geom_bar(position = "stack") +
theme(axis.text.x=element_text(angle=90)) +
labs(x="Township", y="Frequency") +
ggtitle("Plot of township vs frequency grouped by category")
Below is a plot of emergency call count vs year.
Below is a plot of emergency call count vs month. There is no significant difference in emergency counts across various months.
library(stringr)
emergency_calls_data %>%
mutate(month=ordered(month.abb[strtoi(format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%m"))], levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))) %>%
filter(!is.na(month)) %>%
ggplot(aes(x = month, na.rm = TRUE)) +
geom_bar() +
labs(x="Month", y="Emergency call count") +
ggtitle("Plot of call count vs month")
Below is a barplot of emergency counts vs year grouped by broad emergency categories (EMS, Fire and Traffic)
emergency_calls_data %>%
mutate(emergency_category=word(title, sep = fixed(":")), year=format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%Y")) %>%
ggplot(aes(x = year,
fill = emergency_category)) +
geom_bar(position = "stack") +
labs(x="Year", y="Emergency call count") +
ggtitle("Plot of call count vs year grouped by emergency category")
day_time <- as.POSIXct(strptime(c("000000","040000","114500","170000","193000","235959"),
"%H%M%S"),"UTC")
labels = c("night","morning","afternoon","evening","night")
calls_data <- mutate(emergency_calls_data, time_of_day=ordered(labels[findInterval(as.POSIXct(strptime(format(strptime(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%H%M%S"), format="%H%M%S"), "UTC"), day_time)], levels=c("morning", "afternoon", "evening", "night")))
calls_data <- filter(calls_data, !is.na(time_of_day))
ggplot(calls_data, aes(x = time_of_day, na.rm = TRUE)) +
geom_bar() +
labs(x="Time of day", y="Emergency call count") +
ggtitle("Plot of emergency call count vs time of day")
Since vehicle accidents contribute the most to emergency calls in Montgomery County, here is a plot of vehicle accidents vs year. Vehicle accidents are high during the period from 2016 to 2019
library(stringr)
calls_data <- mutate(emergency_calls_data, year=format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%Y"))
ggplot(subset(calls_data, title="Traffic: VEHICLE ACCIDENT -"), aes(x = year)) +
geom_bar() +
labs(x="Year", y="Vehicle accident calls") +
ggtitle("Plot of vehicle accident calls vs year")
library(stringr)
calls_data <- mutate(emergency_calls_data, day=ordered(weekdays(as.Date(timeStamp)), levels=c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")))
calls_data <- filter(calls_data, !is.na(day))
ggplot(subset(calls_data, grepl('Traffic:', title)), aes(x = day, na.rm = TRUE)) +
geom_bar() +
labs(x="Weekday", y="Traffic emergency call count") +
ggtitle("Plot of traffic emergency call count vs weekday")
day_time <- as.POSIXct(strptime(c("000000","040000","114500","170000","193000","235959"),
"%H%M%S"),"UTC")
labels = c("night","morning","afternoon","evening","night")
calls_data <- mutate(emergency_calls_data, time_of_day=ordered(labels[findInterval(as.POSIXct(strptime(format(strptime(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%H%M%S"), format="%H%M%S"), "UTC"), day_time)], levels=c("morning", "afternoon", "evening", "night")))
calls_data <- filter(calls_data, !is.na(time_of_day))
ggplot(subset(calls_data, grepl('Traffic:', title)), aes(x = time_of_day, na.rm = TRUE)) +
geom_bar() +
labs(x="Time of day", y="Traffic emergency call count") +
ggtitle("Plot of traffic emergency call count vs time of day")
day_time <- as.POSIXct(strptime(c("000000","040000","114500","170000","193000","235959"),
"%H%M%S"),"UTC")
labels = c("night","morning","afternoon","evening","night")
calls_data <- mutate(emergency_calls_data, time_of_day=ordered(labels[findInterval(as.POSIXct(strptime(format(strptime(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%H%M%S"), format="%H%M%S"), "UTC"), day_time)], levels=c("morning", "afternoon", "evening", "night")))
calls_data <- filter(calls_data, !is.na(time_of_day))
ggplot(subset(calls_data, grepl('EMS: CARDIAC EMERGENCY', title)), aes(x = time_of_day, na.rm = TRUE)) +
geom_bar() +
labs(x="Time of day", y="Cardiac emergency call count") +
ggtitle("Plot of cardiac emergency call count vs time of day")
library(stringr)
calls_data <- mutate(emergency_calls_data, month=ordered(month.abb[strtoi(format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%m"))], levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
calls_data <- filter(calls_data, !is.na(month))
ggplot(subset(calls_data, grepl('EMS: HEAT EXHAUSTION', title)), aes(x = month, na.rm = TRUE)) +
geom_bar() +
labs(x="Month", y="Heat exhaustion call count") +
ggtitle("Plot of heat exhaustion calls vs month")
library(stringr)
calls_data <- mutate(emergency_calls_data, month=ordered(month.abb[strtoi(format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%m"))], levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
calls_data <- filter(calls_data, !is.na(month))
ggplot(subset(calls_data, grepl('Traffic: VEHICLE ACCIDENT', title)), aes(x = month, na.rm = TRUE)) +
geom_bar() +
labs(x="Month", y="Vehicle accident emergency call count") +
ggtitle("Plot of vehicle accident emergency calls vs month")
Through this project I was able to apply the concepts I learned during the course on a real world dataset. I understood how to handle large amounts of data and high levels of granularity well and create meaningful visualizations.
The number of emergency sub categories is large and the distinction between every one of those may be unclear to a naive viewer. Also, there is a large count of townships in Montgomery County. Analysis of emergency counts of various sub-categories across townships is thus not feasible. The visualization will look cluttered and hard to notice patterns. Hence the top 10 townships have been chosen for analysis and the emergency sub-categories have been grouped.
The visualizations help us understand what kind of emergency occurs most frequently in Montgomery County as a whole as well as in every township. We can compare the number and type of emergencies that occur across townships. The emergency response team can make better preparations with this knowledge and can focus its efforts more on townships that have higher counts of serious emergencies.
The variation of call counts across years/months/week and time of day can give us insights into the crime scene, traffic patterns and effects of weather in Montgomery County. Vehicle accidents have been high during the years 2016 to 2019. It can also be observed that heat exhaustion cases are higher during peak summer, cardiac arrests are higher during morning hours and traffic related emergencies occur more during weekdays especially Fridays and during peak Winter.
We can conclude that vehicle accidents contribute the most to emergency calls in Montgomery County. Lower Merion contributes the highest to the emergency calls in the county followed by Abington and Norristown.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Murulidhara (2022, Jan. 25). Data Analytics and Computational Social Science: Data Science Fundamentals Paper. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscombrinda857122/
BibTeX citation
@misc{murulidhara2022data, author = {Murulidhara, Brinda}, title = {Data Analytics and Computational Social Science: Data Science Fundamentals Paper}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscombrinda857122/}, year = {2022} }