Data Science Fundamentals Paper

Final paper submission

Brinda Murulidhara
2022-01-25

Introduction

I have chosen the Emergency - 911 calls dataset from Kaggle (https://www.kaggle.com/mchirico/montcoalert/version/32) for my final project. The dataset contains emergency 911 calls in Montgomery County, Pennsylvania from 2015 to 2020.

Research questions

Data

Variables in the dataset

Below are the variables in the dataset:

  1. lat: Latitude of the location where the emergency occurred. The data type is double.
  2. lng: Longitude of the location where the emergency occurred. The data type is double.
  3. desc: Description of the emergency. The data type is String.
  4. zip: Zipcode of the location where the emergency occurred. The data type is integer. This variable is interpreted as a factor.
  5. title: Type of emergency. The data type is String.
  6. timeStamp: Date and time of the emergency call. The data type is String.
  7. twp: Township where the emergency occurred. The data type is String.
  8. addr: Address of the emergency location. The data type is String.
  9. e: Index column whose value is always 1. The data type is integer.

Below is the code snippet to read and preview the data.

library(dplyr)
emergency_calls_data <- read.csv("911.csv")
head(emergency_calls_data)
       lat       lng
1 40.29788 -75.58129
2 40.25806 -75.26468
3 40.12118 -75.35198
4 40.11615 -75.34351
5 40.25149 -75.60335
6 40.25347 -75.28324
                                                                                 desc
1           REINDEER CT & DEAD END;  NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52;
2 BRIAR PATH & WHITEMARSH LN;  HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;
3                          HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27;
4               AIRY ST & SWEDE ST;  NORRISTOWN; Station 308A; 2015-12-10 @ 16:47:36;
5    CHERRYWOOD CT & DEAD END;  LOWER POTTSGROVE; Station 329; 2015-12-10 @ 16:56:52;
6               CANNON AVE & W 9TH ST;  LANSDALE; Station 345; 2015-12-10 @ 15:39:04;
    zip                   title           timeStamp               twp
1 19525  EMS: BACK PAINS/INJURY 2015-12-10 17:10:52       NEW HANOVER
2 19446 EMS: DIABETIC EMERGENCY 2015-12-10 17:29:21 HATFIELD TOWNSHIP
3 19401     Fire: GAS-ODOR/LEAK 2015-12-10 14:39:21        NORRISTOWN
4 19401  EMS: CARDIAC EMERGENCY 2015-12-10 16:47:36        NORRISTOWN
5    NA          EMS: DIZZINESS 2015-12-10 16:56:52  LOWER POTTSGROVE
6 19446        EMS: HEAD INJURY 2015-12-10 15:39:04          LANSDALE
                        addr e
1     REINDEER CT & DEAD END 1
2 BRIAR PATH & WHITEMARSH LN 1
3                   HAWS AVE 1
4         AIRY ST & SWEDE ST 1
5   CHERRYWOOD CT & DEAD END 1
6      CANNON AVE & W 9TH ST 1

Descriptive statistics

Mean

Below are the means of latitude and longitude columns.

summarise(emergency_calls_data, mean_latitude=mean(lat), mean_longitude=mean(lng))
  mean_latitude mean_longitude
1      40.15816       -75.3001

Median

Below are the medians of latitude and longitude columns.

summarise(emergency_calls_data, median_latitude=median(lat), median_longitude=median(lng))
  median_latitude median_longitude
1        40.14393        -75.30514

Standard Deviation

Below are the standard deviations of latitude and longitude columns.

summarise(emergency_calls_data, sd_latitude=sd(lat), sd_longitude=sd(lng))
  sd_latitude sd_longitude
1   0.2206414     1.672884

Visualization

High-level visualizations

Emergency subcategory vs frequency

Since vehicle accidents contribute the most to emergency calls in Montgomery County, further analysis is done in the upcoming visualizations.

library(ggplot2)
calls_data <- mutate(emergency_calls_data, broad_category = ifelse(grepl("VEHICLE ACCIDENT", title), "Vehicle accident", "Other"))
ggplot(calls_data, aes(x = broad_category)) +
        geom_bar() + 
        labs(x="Emergency sub-category", y="Frequency") + 
        ggtitle("Plot of emergency sub-category vs frequency")

Township vs frequency

The number of townships in Montgomery County is large. Hence, the top 10 townships that have the highest number of emergency calls are chosen. Lower Merion contributes the highest to the emergency calls in the county followed by Abington and Norristown.

sort(table(emergency_calls_data[, "twp"]), decreasing=TRUE, na.last = TRUE)[1:10]

    LOWER MERION         ABINGTON       NORRISTOWN     UPPER MERION 
           55490            39947            37633            36010 
      CHELTENHAM        POTTSTOWN   UPPER MORELAND LOWER PROVIDENCE 
           30574            27387            22932            22476 
        PLYMOUTH     UPPER DUBLIN 
           20116            18862 
library(stringr) 
calls_data <- mutate(emergency_calls_data, emergency_category=word(title, sep = fixed(":")))
ggplot(subset(calls_data, twp %in% c("LOWER MERION", "ABINGTON", "NORRISTOWN", "UPPER MERION", "CHELTENHAM", "POTTSTOWN", "UPPER MORELAND", "LOWER PROVIDENCE", "PLYMOUTH", "UPPER DUBLIN")), aes(x = twp, 
           fill = emergency_category)) + 
  geom_bar(position = "stack") + 
  theme(axis.text.x=element_text(angle=90)) +
  labs(x="Township", y="Frequency") + 
  ggtitle("Plot of township vs frequency grouped by category")

Emergency call count vs year

Below is a plot of emergency call count vs year.

library(stringr) 
emergency_calls_data  %>%
mutate(year=format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%Y")) %>%
ggplot(aes(x = year)) + 
  geom_bar() + 
  labs(x="Year", y="Emergency call count") +
  ggtitle("Plot of call count vs year")

Emergency call count vs month

Below is a plot of emergency call count vs month. There is no significant difference in emergency counts across various months.

library(stringr) 
emergency_calls_data  %>%
mutate(month=ordered(month.abb[strtoi(format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%m"))], levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))) %>%
filter(!is.na(month)) %>%
ggplot(aes(x = month, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Month", y="Emergency call count") +
  ggtitle("Plot of call count vs month")

Emergency call count vs year grouped by emergency category

Below is a barplot of emergency counts vs year grouped by broad emergency categories (EMS, Fire and Traffic)

emergency_calls_data  %>%
mutate(emergency_category=word(title, sep = fixed(":")), year=format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%Y")) %>%
ggplot(aes(x = year, 
           fill = emergency_category)) + 
  geom_bar(position = "stack") + 
  labs(x="Year", y="Emergency call count") + 
  ggtitle("Plot of call count vs year grouped by emergency category")

Emergency call count vs time of the day

day_time <- as.POSIXct(strptime(c("000000","040000","114500","170000","193000","235959"),
                      "%H%M%S"),"UTC")
labels = c("night","morning","afternoon","evening","night")


calls_data <- mutate(emergency_calls_data, time_of_day=ordered(labels[findInterval(as.POSIXct(strptime(format(strptime(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%H%M%S"), format="%H%M%S"), "UTC"), day_time)], levels=c("morning", "afternoon", "evening", "night")))
calls_data <- filter(calls_data, !is.na(time_of_day))
ggplot(calls_data, aes(x = time_of_day, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Time of day", y="Emergency call count") +
  ggtitle("Plot of emergency call count vs time of day")

Low-level visualizations

Vehicle accidents vs year

Since vehicle accidents contribute the most to emergency calls in Montgomery County, here is a plot of vehicle accidents vs year. Vehicle accidents are high during the period from 2016 to 2019

library(stringr) 
calls_data <- mutate(emergency_calls_data, year=format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%Y"))
ggplot(subset(calls_data, title="Traffic: VEHICLE ACCIDENT -"), aes(x = year)) + 
  geom_bar() + 
  labs(x="Year", y="Vehicle accident calls") +
  ggtitle("Plot of vehicle accident calls vs year")

Traffic emergency call count vs weekday

library(stringr) 
calls_data <- mutate(emergency_calls_data, day=ordered(weekdays(as.Date(timeStamp)), levels=c("Monday", "Tuesday", "Wednesday", "Thursday", 
"Friday", "Saturday", "Sunday")))
calls_data <- filter(calls_data, !is.na(day))
ggplot(subset(calls_data,  grepl('Traffic:', title)), aes(x = day, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Weekday", y="Traffic emergency call count") +
  ggtitle("Plot of traffic emergency call count vs weekday")

Traffic emergency call count vs time of the day

day_time <- as.POSIXct(strptime(c("000000","040000","114500","170000","193000","235959"),
                      "%H%M%S"),"UTC")
labels = c("night","morning","afternoon","evening","night")


calls_data <- mutate(emergency_calls_data, time_of_day=ordered(labels[findInterval(as.POSIXct(strptime(format(strptime(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%H%M%S"), format="%H%M%S"), "UTC"), day_time)], levels=c("morning", "afternoon", "evening", "night")))
calls_data <- filter(calls_data, !is.na(time_of_day))
ggplot(subset(calls_data,  grepl('Traffic:', title)), aes(x = time_of_day, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Time of day", y="Traffic emergency call count") +
  ggtitle("Plot of traffic emergency call count vs time of day")

Cardiac arrest emergency call count vs time of the day

day_time <- as.POSIXct(strptime(c("000000","040000","114500","170000","193000","235959"),
                      "%H%M%S"),"UTC")
labels = c("night","morning","afternoon","evening","night")


calls_data <- mutate(emergency_calls_data, time_of_day=ordered(labels[findInterval(as.POSIXct(strptime(format(strptime(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%H%M%S"), format="%H%M%S"), "UTC"), day_time)], levels=c("morning", "afternoon", "evening", "night")))
calls_data <- filter(calls_data, !is.na(time_of_day))
ggplot(subset(calls_data,  grepl('EMS: CARDIAC EMERGENCY', title)), aes(x = time_of_day, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Time of day", y="Cardiac emergency call count") +
  ggtitle("Plot of cardiac emergency call count vs time of day")

Heat exhaustion emergency call count vs month

library(stringr) 
calls_data <- mutate(emergency_calls_data, month=ordered(month.abb[strtoi(format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%m"))], levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
calls_data <- filter(calls_data, !is.na(month))
ggplot(subset(calls_data,  grepl('EMS: HEAT EXHAUSTION', title)), aes(x = month, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Month", y="Heat exhaustion call count") +
  ggtitle("Plot of heat exhaustion calls vs month")

Vehicle accident emergency call count vs month

library(stringr) 
calls_data <- mutate(emergency_calls_data, month=ordered(month.abb[strtoi(format(as.POSIXct(timeStamp, format="%Y-%m-%d %H:%M:%S"), format="%m"))], levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
calls_data <- filter(calls_data, !is.na(month))
ggplot(subset(calls_data,  grepl('Traffic: VEHICLE ACCIDENT', title)), aes(x = month, na.rm = TRUE)) + 
  geom_bar() + 
  labs(x="Month", y="Vehicle accident emergency call count") +
  ggtitle("Plot of vehicle accident emergency calls vs month")

Reflection

Through this project I was able to apply the concepts I learned during the course on a real world dataset. I understood how to handle large amounts of data and high levels of granularity well and create meaningful visualizations.

The number of emergency sub categories is large and the distinction between every one of those may be unclear to a naive viewer. Also, there is a large count of townships in Montgomery County. Analysis of emergency counts of various sub-categories across townships is thus not feasible. The visualization will look cluttered and hard to notice patterns. Hence the top 10 townships have been chosen for analysis and the emergency sub-categories have been grouped.

Conclusion

The visualizations help us understand what kind of emergency occurs most frequently in Montgomery County as a whole as well as in every township. We can compare the number and type of emergencies that occur across townships. The emergency response team can make better preparations with this knowledge and can focus its efforts more on townships that have higher counts of serious emergencies.

The variation of call counts across years/months/week and time of day can give us insights into the crime scene, traffic patterns and effects of weather in Montgomery County. Vehicle accidents have been high during the years 2016 to 2019. It can also be observed that heat exhaustion cases are higher during peak summer, cardiac arrests are higher during morning hours and traffic related emergencies occur more during weekdays especially Fridays and during peak Winter.

We can conclude that vehicle accidents contribute the most to emergency calls in Montgomery County. Lower Merion contributes the highest to the emergency calls in the county followed by Abington and Norristown.

Bibliography

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Murulidhara (2022, Jan. 25). Data Analytics and Computational Social Science: Data Science Fundamentals Paper. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscombrinda857122/

BibTeX citation

@misc{murulidhara2022data,
  author = {Murulidhara, Brinda},
  title = {Data Analytics and Computational Social Science: Data Science Fundamentals Paper},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscombrinda857122/},
  year = {2022}
}