DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Final Project

  • Final materials
    • Fall 2022 posts
    • final Posts

On this page

  • Introduction
  • Import the Data
  • Dataset Summary
  • Processing and Visualization
  • Reflection
  • Conclusion
  • Bibliography

Final Project

  • Show All Code
  • Hide All Code

  • View Source
Author

Siddharth Nammara Kalyana Raman

Published

December 14, 2022

Introduction

According to the Police Foundation’s the crime analysis is defined as the qualitative and quantitative study of crime and law enforcement information in combination with socio-demographic and spatial factors to apprehend criminals, prevent crime, reduce disorder, and evaluate organizational procedures.

The primary purpose of crime analysis is to assist or support a police department’s operations. These activities include patrolling, patrolling operations, crime prevention and reduction methods, problem-solving, evaluation and accountability of police actions, criminal investigation, arrest, and prosecution. Crime analysis would not be possible without police forces.

So in this project we have taken a small sample of Philadelphia crime data to perform some statistical analysis and understand their trends. The dataset was taken from OpenDataPhilly. The OpenDataPhilly is a source for the open data in the Philadelphia region.

Some of the questions to which I want to find out the answers are :

What are the different categories of crime happening in Philadelphia and what are the most common crimes?

How is the trend of crime as the years progress, whether the crimes are increasing or decreasing? This will help us to determine whether the strategies implemented by the police force to reduce the crime rate is working or not.

The month with the most number of crimes?

The hour with the most number of crimes?

The district in Philadelphia with most number of crimes?

The answers to the above three questions will help us to determine when and where do we need to increase the security?

Code
#Loading libraries

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tibble' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(dplyr)
library(summarytools)
Warning: package 'summarytools' was built under R version 4.2.2

Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view
Code
library(readxl)
Warning: package 'readxl' was built under R version 4.2.2
Code
load("snkraman_final.RData")
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Import the Data

Importing the Philadelphia crime data into R.

Code
#dataset<-read_csv("snkraman_final.RData")
head(dataset)
# A tibble: 6 × 14
  Dc_Dist Psa   Dispatch_Date_Time  Year  Month Dispatch…¹  Hour  Dc_Key Locat…²
  <chr>   <chr> <dttm>              <chr> <chr> <time>     <dbl>   <dbl> <chr>  
1 35      D     2009-07-19 01:09:00 2009  07    01:09          1 2.01e11 5500 B…
2 09      R     2009-06-25 00:14:00 2009  06    00:14          0 2.01e11 1800 B…
3 17      1     2015-04-25 12:50:00 2015  04    12:50         12 2.02e11 800 BL…
4 23      K     2009-02-10 14:33:00 2009  02    14:33         14 2.01e11 2200 B…
5 22      3     2015-10-06 18:18:00 2015  10    18:18         18 2.02e11 1500 B…
6 22      3     2015-10-09 00:49:00 2015  10    00:49          0 2.02e11 1500 B…
# … with 5 more variables: UCR_General <dbl>, Text_General_Code <chr>,
#   Police_Districts <dbl>, Lon <dbl>, Lat <dbl>, and abbreviated variable
#   names ¹​Dispatch_Time, ²​Location_Block

Dataset Summary

The columns and their descriptions are as follows :

  1. Dc_Dist - A two character field that names the District boundary.

  2. Psa - It is a single character field that names the Police Service Area boundary.

  3. DC_Key - The unique identifier of the crime that consists of Year+District+Unique ID.

  4. Dispatch_Date_Time - The date and time that the officer was dispatched to the scene.

  5. Dispatch_Date - It is the dispatch date formatted as character.

  6. Dispatch_Time - It is the dispatach time formatted as character.

  7. Hour - It is the generalized hour of the dispatched time.

  8. Location_Block - The location of crime generalized by the street block.

  9. UCR_General - Universal Crime Reporting, it is used to compare crimes in other areas.

  10. Text_General_Code - It defines the crime category.

  11. Police_Districts - It defines the police district where the crime happened.

  12. Month - It defines the month and year on which the crime happened.

  13. Lon - Longitude of the crime location.

  14. Lat - Latitude of the crime location.

Code
print(dfSummary(dataset, 
                varnumbers= FALSE, 
                plain.ascii= FALSE, 
                style= "grid", 
                graph.magnif= 0.80, 
                valid.col= TRUE),
      method= 'render', 
      table.classes= 'table-condensed')

Data Frame Summary

dataset

Dimensions: 2220256 x 14
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
Dc_Dist [character]
1. 15
2. 24
3. 25
4. 19
5. 12
6. 35
7. 22
8. 14
9. 02
10. 18
[ 15 others ]
183647(8.3%)
161381(7.3%)
150376(6.8%)
138399(6.2%)
131146(5.9%)
130351(5.9%)
126499(5.7%)
120496(5.4%)
116000(5.2%)
108927(4.9%)
853034(38.4%)
2220256 (100.0%) 0 (0.0%)
Psa [character]
1. 2
2. 1
3. 3
4. 4
5. E
6. D
7. K
8. J
9. H
10. F
[ 20 others ]
487711(22.0%)
447161(20.1%)
400769(18.1%)
60130(2.7%)
52898(2.4%)
51656(2.3%)
51178(2.3%)
50883(2.3%)
50549(2.3%)
48318(2.2%)
519003(23.4%)
2220256 (100.0%) 0 (0.0%)
Dispatch_Date_Time [POSIXct, POSIXt]
min : 2006-01-01
med : 2011-03-09 03:11:00
max : 2017-03-23 01:29:00
range : 11y 2m 22d 1H 29M 0S
1729275 distinct values 2220256 (100.0%) 0 (0.0%)
Year [character]
1. 2006
2. 2008
3. 2007
4. 2009
5. 2010
6. 2012
7. 2011
8. 2013
9. 2014
10. 2015
[ 2 others ]
232577(10.5%)
222118(10.0%)
222021(10.0%)
203659(9.2%)
198048(8.9%)
195544(8.8%)
194264(8.7%)
185308(8.3%)
185132(8.3%)
182349(8.2%)
199236(9.0%)
2220256 (100.0%) 0 (0.0%)
Month [character]
1. 08
2. 07
3. 05
4. 06
5. 03
6. 10
7. 04
8. 09
9. 01
10. 11
[ 2 others ]
202943(9.1%)
200388(9.0%)
196653(8.9%)
193874(8.7%)
188858(8.5%)
188751(8.5%)
187082(8.4%)
185960(8.4%)
179756(8.1%)
170828(7.7%)
325163(14.6%)
2220256 (100.0%) 0 (0.0%)
Dispatch_Time [hms, difftime]
min : 0
med : 52260
max : 86340
units : secs
1440 distinct values 2220256 (100.0%) 0 (0.0%)
Hour [numeric]
Mean (sd) : 13.2 (6.8)
min ≤ med ≤ max:
0 ≤ 14 ≤ 23
IQR (CV) : 10 (0.5)
24 distinct values 2220256 (100.0%) 0 (0.0%)
Dc_Key [numeric]
Mean (sd) : 201097291687 (323181332)
min ≤ med ≤ max:
199812085407 ≤ 2.01106e+11 ≤ 2.01777e+11
IQR (CV) : 586952056 (0)
2220256 distinct values 2220256 (100.0%) 0 (0.0%)
Location_Block [character]
1. 4600 BLOCK E ROOSEVELT BL
2. 1000 BLOCK MARKET ST
3. 5200 BLOCK FRANKFORD AVE
4. 0 BLOCK N 52ND ST
5. 1300 BLOCK MARKET ST
6. 1600 BLOCK S CHRISTOPHER
7. 1500 BLOCK MARKET ST
8. 2300 BLOCK COTTMAN AVE
9. 2800 BLOCK KENSINGTON AVE
10. 2700 BLOCK KENSINGTON AVE
[ 106072 others ]
4450(0.2%)
3970(0.2%)
3769(0.2%)
2779(0.1%)
2736(0.1%)
2351(0.1%)
2331(0.1%)
2088(0.1%)
2042(0.1%)
2028(0.1%)
2191712(98.7%)
2220256 (100.0%) 0 (0.0%)
UCR_General [numeric]
Mean (sd) : 1272.5 (814.7)
min ≤ med ≤ max:
100 ≤ 800 ≤ 2600
IQR (CV) : 1200 (0.6)
26 distinct values 2219602 (100.0%) 654 (0.0%)
Text_General_Code [character]
1. All Other Offenses
2. Other Assaults
3. Thefts
4. Vandalism/Criminal Mischi
5. Theft from Vehicle
6. Narcotic / Drug Law Viola
7. Fraud
8. Recovered Stolen Motor Ve
9. Burglary Residential
10. Aggravated Assault No Fir
[ 23 others ]
435476(19.6%)
275523(12.4%)
254714(11.5%)
199335(9.0%)
169539(7.6%)
136599(6.2%)
113555(5.1%)
94186(4.2%)
93979(4.2%)
68421(3.1%)
378275(17.0%)
2219602 (100.0%) 654 (0.0%)
Police_Districts [numeric]
Mean (sd) : 12.1 (5.8)
min ≤ med ≤ max:
1 ≤ 12 ≤ 22
IQR (CV) : 9 (0.5)
22 distinct values 2217675 (99.9%) 2581 (0.1%)
Lon [numeric]
Mean (sd) : -75.1 (0.1)
min ≤ med ≤ max:
-75.3 ≤ -75.2 ≤ -75
IQR (CV) : 0.1 (0)
197531 distinct values 2220256 (100.0%) 0 (0.0%)
Lat [numeric]
Mean (sd) : 40 (0)
min ≤ med ≤ max:
39.9 ≤ 40 ≤ 40.1
IQR (CV) : 0.1 (0)
169409 distinct values 2220256 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

##Tidy Data

Code
head(dataset)
# A tibble: 6 × 14
  Dc_Dist Psa   Dispatch_Date_Time  Year  Month Dispatch…¹  Hour  Dc_Key Locat…²
  <chr>   <chr> <dttm>              <chr> <chr> <time>     <dbl>   <dbl> <chr>  
1 35      D     2009-07-19 01:09:00 2009  07    01:09          1 2.01e11 5500 B…
2 09      R     2009-06-25 00:14:00 2009  06    00:14          0 2.01e11 1800 B…
3 17      1     2015-04-25 12:50:00 2015  04    12:50         12 2.02e11 800 BL…
4 23      K     2009-02-10 14:33:00 2009  02    14:33         14 2.01e11 2200 B…
5 22      3     2015-10-06 18:18:00 2015  10    18:18         18 2.02e11 1500 B…
6 22      3     2015-10-09 00:49:00 2015  10    00:49          0 2.02e11 1500 B…
# … with 5 more variables: UCR_General <dbl>, Text_General_Code <chr>,
#   Police_Districts <dbl>, Lon <dbl>, Lat <dbl>, and abbreviated variable
#   names ¹​Dispatch_Time, ²​Location_Block
Code
tail(dataset)
# A tibble: 6 × 14
  Dc_Dist Psa   Dispatch_Date_Time  Year  Month Dispatch…¹  Hour  Dc_Key Locat…²
  <chr>   <chr> <dttm>              <chr> <chr> <time>     <dbl>   <dbl> <chr>  
1 06      3     2017-01-17 08:33:00 2017  01    08:33          8 2.02e11 300 BL…
2 01      1     2017-01-17 09:13:00 2017  01    09:13          9 2.02e11 2100 B…
3 16      1     2017-01-17 22:35:00 2017  01    22:35         22 2.02e11 N 38TH…
4 16      1     2017-01-17 22:35:00 2017  01    22:35         22 2.02e11 N 38TH…
5 19      2     2017-01-18 01:23:00 2017  01    01:23          1 2.02e11 6000 B…
6 16      1     2017-01-17 16:20:00 2017  01    16:20         16 2.02e11 N 34TH…
# … with 5 more variables: UCR_General <dbl>, Text_General_Code <chr>,
#   Police_Districts <dbl>, Lon <dbl>, Lat <dbl>, and abbreviated variable
#   names ¹​Dispatch_Time, ²​Location_Block

Check the number of rows that has null data in the dataset.

Code
sum(is.na(dataset))
[1] 3889

Checking the attributes that has null data.

Code
cols_null_data<-colSums(is.na(dataset))
colnames(dataset)[cols_null_data>0]
[1] "UCR_General"       "Text_General_Code" "Police_Districts" 

Checking the number of rows in each column that has null data.

Code
sum(is.na(dataset$UCR_General))
[1] 654
Code
sum(is.na(dataset$Police_Districts))
[1] 2581
Code
sum(is.na(dataset$Lon))
[1] 0
Code
sum(is.na(dataset$Lat))
[1] 0

Removing the rows from the dataset that has latitude and longitude as a null value.

Code
dataset<-subset(dataset,dataset$Lat!="NA" & dataset$Lon!="NA")
head(dataset)
# A tibble: 6 × 14
  Dc_Dist Psa   Dispatch_Date_Time  Year  Month Dispatch…¹  Hour  Dc_Key Locat…²
  <chr>   <chr> <dttm>              <chr> <chr> <time>     <dbl>   <dbl> <chr>  
1 35      D     2009-07-19 01:09:00 2009  07    01:09          1 2.01e11 5500 B…
2 09      R     2009-06-25 00:14:00 2009  06    00:14          0 2.01e11 1800 B…
3 17      1     2015-04-25 12:50:00 2015  04    12:50         12 2.02e11 800 BL…
4 23      K     2009-02-10 14:33:00 2009  02    14:33         14 2.01e11 2200 B…
5 22      3     2015-10-06 18:18:00 2015  10    18:18         18 2.02e11 1500 B…
6 22      3     2015-10-09 00:49:00 2015  10    00:49          0 2.02e11 1500 B…
# … with 5 more variables: UCR_General <dbl>, Text_General_Code <chr>,
#   Police_Districts <dbl>, Lon <dbl>, Lat <dbl>, and abbreviated variable
#   names ¹​Dispatch_Time, ²​Location_Block

Checking whether there are any rows that has latitude and longitude as the null values after filtering the dataset.

Code
sum(is.na(dataset$Lon))
[1] 0
Code
sum(is.na(dataset$Lat))
[1] 0

Processing and Visualization

The Text_General_Code represents the crime category.

Code
sum(is.na(dataset$Text_General_Code))
[1] 654

We are calculating the number of occurrences of each crime type in the dataset.

Code
countData<- dataset%>%count(Text_General_Code)
countData<-countData[-c(1),]
countData
# A tibble: 33 × 2
   Text_General_Code                  n
   <chr>                          <int>
 1 Aggravated Assault No Firearm  68421
 2 All Other Offenses            435476
 3 Arson                           5643
 4 Burglary Non-Residential       23182
 5 Burglary Residential           93979
 6 Disorderly Conduct             39798
 7 DRIVING UNDER THE INFLUENCE    52750
 8 Embezzlement                    4642
 9 Forgery and Counterfeiting      4816
10 Fraud                         113555
# … with 23 more rows

Now we are going to see the visualized representation of each occurrence of crime category in Philadelphia.

Code
library(ggplot2)
ggplot(data = countData, mapping = aes(x= n, y= reorder(Text_General_Code, n)))+
  geom_col(aes(fill = Text_General_Code))+
  geom_text(data = countData[c(1,33),],mapping = aes(label = n))+
   theme_minimal()+
  labs(title = "Crime Category and their Frequency in Philadelphia",
       y = NULL,
       x = "Frequency")+
 theme(legend.position = "none")

Rearranging the data in decreasing order so that it would be helpful for us to know the major crimes happening in the city.

Code
countData<-countData[order(countData$n,decreasing = T),]
countData
# A tibble: 33 × 2
   Text_General_Code                   n
   <chr>                           <int>
 1 All Other Offenses             435476
 2 Other Assaults                 275523
 3 Thefts                         254714
 4 Vandalism/Criminal Mischief    199335
 5 Theft from Vehicle             169539
 6 Narcotic / Drug Law Violations 136599
 7 Fraud                          113555
 8 Recovered Stolen Motor Vehicle  94186
 9 Burglary Residential            93979
10 Aggravated Assault No Firearm   68421
# … with 23 more rows

Extracting the data of the top 10 crimes happening in Philadelphia.

Code
top_crime_data<-countData[1:10,]
top_crime_data
# A tibble: 10 × 2
   Text_General_Code                   n
   <chr>                           <int>
 1 All Other Offenses             435476
 2 Other Assaults                 275523
 3 Thefts                         254714
 4 Vandalism/Criminal Mischief    199335
 5 Theft from Vehicle             169539
 6 Narcotic / Drug Law Violations 136599
 7 Fraud                          113555
 8 Recovered Stolen Motor Vehicle  94186
 9 Burglary Residential            93979
10 Aggravated Assault No Firearm   68421
Code
ggplot(data = top_crime_data, mapping = aes(x= n, y= reorder(Text_General_Code, n)))+
  geom_col(aes(fill = Text_General_Code))+
  geom_text(data = top_crime_data[c(1,10),],mapping = aes(label = n))+
   theme_minimal()+
  labs(title = "Common Crime Category in San Francisco",
       y = NULL,
       x = "Frequency")+
 theme(legend.position = "none")

From the above graph we can see that “All Other Offenses” crime category is the most frequently occurring crime. All the other crimes are similar in range to their neighbors but the frequency of “All Other Offenses” is quite high compared to the other crime categories.

Now we are going to perform crime analysis per month. In the below code we are extracting the month and year on which the crime has happened from Dispatch_Date and making them as a separate attribute so that we can perform analysis on that.

Code
head(dataset)
# A tibble: 6 × 14
  Dc_Dist Psa   Dispatch_Date_Time  Year  Month Dispatch…¹  Hour  Dc_Key Locat…²
  <chr>   <chr> <dttm>              <chr> <chr> <time>     <dbl>   <dbl> <chr>  
1 35      D     2009-07-19 01:09:00 2009  07    01:09          1 2.01e11 5500 B…
2 09      R     2009-06-25 00:14:00 2009  06    00:14          0 2.01e11 1800 B…
3 17      1     2015-04-25 12:50:00 2015  04    12:50         12 2.02e11 800 BL…
4 23      K     2009-02-10 14:33:00 2009  02    14:33         14 2.01e11 2200 B…
5 22      3     2015-10-06 18:18:00 2015  10    18:18         18 2.02e11 1500 B…
6 22      3     2015-10-09 00:49:00 2015  10    00:49          0 2.02e11 1500 B…
# … with 5 more variables: UCR_General <dbl>, Text_General_Code <chr>,
#   Police_Districts <dbl>, Lon <dbl>, Lat <dbl>, and abbreviated variable
#   names ¹​Dispatch_Time, ²​Location_Block
Code
#this frame was used to separate Year and Month from the dataset.
#But as we took the image of the dataset there is no need to run this block as the Dispatch_Date column is overridden in the new frame

#dataset<- dataset %>%
 # separate(`Dispatch_Date`,c('Year','Month'),sep = "-")

#head(dataset)

Count the number of crimes happened on each year from 2006 to 2017.

Code
countCrimeByYear <- dataset %>% 
  group_by(Year) %>% 
  summarise(total = n())
countCrimeByYear
# A tibble: 12 × 2
   Year   total
   <chr>  <int>
 1 2006  232577
 2 2007  222021
 3 2008  222118
 4 2009  203659
 5 2010  198048
 6 2011  194264
 7 2012  195544
 8 2013  185308
 9 2014  185132
10 2015  182349
11 2016  166051
12 2017   33185
Code
ggplot(countCrimeByYear, aes(x=Year, y=total)) + 
  geom_point(size=3) + 
  geom_segment(aes(x=Year, 
                   xend=Year, 
                   y=0, 
                   yend=total)) + 
  labs(title="Average Crimes per Year in Philadelphia", 
       caption="source: mpg") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

The above plot shows the trend of the crime as the year progresses. We can see that on an average the crime has decreased by a great factor as the years progressed.

Count the number of crimes happened on each month for all the years from 2006 to 2017.

Code
countCrimeByMonth <- dataset %>% 
  group_by(Month) %>% 
  summarise(total = n())
head(countCrimeByMonth)
# A tibble: 6 × 2
  Month  total
  <chr>  <int>
1 01    179756
2 02    161409
3 03    188858
4 04    187082
5 05    196653
6 06    193874
Code
head(countCrimeByMonth)
# A tibble: 6 × 2
  Month  total
  <chr>  <int>
1 01    179756
2 02    161409
3 03    188858
4 04    187082
5 05    196653
6 06    193874
Code
theme_set(theme_classic())

ggplot(countCrimeByMonth, aes(x = Month, y = total))+
  geom_col(fill = "firebrick3")+
  theme_minimal()+
  labs(
    title = "Crime per Month in Philadelphia",
    subtitle = "From 2006 to 2017",
    x = "Month",
    y = "Total Crime"
  )

Counting the number of crimes happened at each hour.

Code
countCrimeByHour <- dataset %>% 
  group_by(Hour) %>% 
  summarise(total = n())
head(countCrimeByHour)
# A tibble: 6 × 2
   Hour  total
  <dbl>  <int>
1     0 119042
2     1  94198
3     2  66566
4     3  46011
5     4  29887
6     5  22603
Code
library(scales)
theme_set(theme_classic())

# Plot
ggplot(countCrimeByHour, aes(x=countCrimeByHour$Hour, y=countCrimeByHour$total)) + 
  geom_point(col="tomato2", size=3) +   # Draw points
  geom_segment(aes(x=countCrimeByHour$Hour, 
                   xend=countCrimeByHour$Hour, 
                   y=min(countCrimeByHour$total), 
                   yend=max(countCrimeByHour$total)), 
               linetype="dashed", 
               size=0.1) +   # Draw dashed lines
  labs(title="Dot Plot for the number of crimes per hour", 
       caption="source: mpg") +  
  coord_flip()

Code
countCrimeByHour <- dataset %>% 
  group_by(Hour) %>% 
  summarise(total = n())
countCrimeByHour<-countCrimeByHour[order(countCrimeByHour$total,decreasing = T),]
countCrimeByHour<-head(countCrimeByHour)
countCrimeByHour
# A tibble: 6 × 2
   Hour  total
  <dbl>  <int>
1    16 133738
2    17 125895
3    19 121618
4    22 119620
5     0 119042
6    18 118810
Code
library(scales)
theme_set(theme_classic())

# Plot
ggplot(countCrimeByHour, aes(x=countCrimeByHour$Hour, y=countCrimeByHour$total)) + 
  geom_point(col="tomato2", size=3) +   # Draw points
  geom_segment(aes(x=countCrimeByHour$Hour, 
                   xend=countCrimeByHour$Hour, 
                   y=min(countCrimeByHour$total), 
                   yend=max(countCrimeByHour$total)), 
               linetype="dashed", 
               size=0.1) +   # Draw dashed lines
  labs(title="Dot Plot for the number of crimes per hour", 
       caption="source: mpg") +  
  coord_flip()

Though from the above graph we can see that the max number of crimes happened at 16:00 hours, in order to infer the time range we need gather information from the first graph and see the collective time range in which maximum number of crimes are happening.

Now we will analyse the number of crimes per district. Below we are counting the number of crimes happened in each district.

Code
countCrimeByPoliceDistrict<- dataset %>% 
  group_by(Police_Districts) %>% 
  summarise(total = n())
head(countCrimeByPoliceDistrict)
# A tibble: 6 × 2
  Police_Districts  total
             <dbl>  <int>
1                1  48008
2                2 116180
3                3 114689
4                4  31113
5                5  96025
6                6  44444

In order to know the top 6 districts where the most crimes are happening, we’ll first rearrange the data in descending order and take the top 6 rows from the dataframe. You can see the top 6 districts and the number of crimes happening in each district clearly below.

Code
countTopCrimeByPoliceDistrict<-countCrimeByPoliceDistrict[order(countCrimeByPoliceDistrict$total,decreasing = T),]
countTopCrimeByPoliceDistrict<-head(countTopCrimeByPoliceDistrict)
countTopCrimeByPoliceDistrict
# A tibble: 6 × 2
  Police_Districts  total
             <dbl>  <int>
1               11 183196
2               17 161245
3               16 153103
4               18 150186
5               15 135628
6                9 132875

We’ll plot a pie chart for the above data. The below pie chart shows labels of each district and also a color. The label that has the lightest color is the district where most number of crimes are happening and the label with the darkest color is the district where the least number of crimes are happening. You can also see their value range in the scale shown beside the pie chart.

Code
library(ggplot2)

ggplot(countTopCrimeByPoliceDistrict, aes(x = "", y = "", fill = countTopCrimeByPoliceDistrict$total)) +
  geom_col() +
  geom_label(aes(label = countTopCrimeByPoliceDistrict$Police_Districts),
             position = position_stack(vjust = 0.5),
             show.legend = FALSE) +
  coord_polar(theta = "y")

From the above pie chart we can clearly see that “11” is the district where the most number of crimes are happening in Philadelphia.

In the given dataset we have latitude and longitude values. So let’s try to plot the crime location in the map.

Code
map_drug <- dataset %>% 
  filter(Year == 2006) %>% 
  select(Location_Block, Lon, Lat)
map_drug<-head(map_drug,50)
map_drug
# A tibble: 50 × 3
   Location_Block            Lon   Lat
   <chr>                   <dbl> <dbl>
 1 7200 BLOCK SAUL ST      -75.1  40.0
 2 2200 BLOCK COTTMAN AVE  -75.1  40.0
 3 1900 BLOCK S MOLE ST    -75.2  39.9
 4 2000 BLOCK S HEMBERGER  -75.2  39.9
 5 6600 BLOCK LYNFORD ST   -75.1  40.0
 6 1700 BLOCK BORBECK AV   -75.1  40.1
 7 1800 BLOCK S HICKS ST   -75.2  39.9
 8 2400 BLOCK S 24TH ST    -75.2  39.9
 9 5900 BLOCK REACH ST     -75.1  40.0
10 7900 BLOCK BURHOLME AVE -75.1  40.1
# … with 40 more rows
Code
library(leaflet)


ico <- makeIcon(iconUrl = "https://cdn.iconscout.com/icon/free/png-256/drugs-26-129384.png",iconWidth=47/2, iconHeight=41/2)
map2 <- leaflet()
map2 <- addTiles(map2)
map2 <- addMarkers(map2, data = map_drug, icon = ico, popup = map_drug[,"Location_Block"])
map2

The above map shows the locations of 50 crime scenes happened around Philadelphia in 2006.

Code
map_drug <- dataset %>% 
  filter(Text_General_Code=='Thefts',Year=='2006') %>% 
  select(Location_Block, Lon, Lat)
map_drug<-head(map_drug,50)
map_drug
# A tibble: 50 × 3
   Location_Block           Lon   Lat
   <chr>                  <dbl> <dbl>
 1 6600 BLOCK LYNFORD ST  -75.1  40.0
 2 1700 BLOCK BORBECK AV  -75.1  40.1
 3 2400 BLOCK S 24TH ST   -75.2  39.9
 4 2200 BLOCK OREGON AVE  -75.2  39.9
 5 300 BLOCK GERRITT ST   -75.2  39.9
 6 100 BLOCK CARPENTER ST -75.1  39.9
 7 500 BLOCK S 2ND ST     -75.1  39.9
 8 4700 BLOCK UMBRIA ST   -75.2  40.0
 9 0 BLOCK MIFFLIN ST     -75.1  39.9
10 6400 BLOCK RIDGE AV    -75.2  40.0
# … with 40 more rows
Code
library(leaflet)


ico <- makeIcon(iconUrl = "https://cdn.iconscout.com/icon/free/png-256/drugs-26-129384.png",iconWidth=47/2, iconHeight=41/2)
map2 <- leaflet()
map2 <- addTiles(map2)
map2 <- addMarkers(map2, data = map_drug, icon = ico, popup = map_drug[,"Location_Block"])
map2

The above map shows the locations of 50 theft crimes happened around Philadelphia in the year 2006.

Reflection

I’ve learned a lot from working on this project. Before taking this course I did not have any experience in R. We see a lot of analytics used in stock exchange. Initially, I thought of choosing a stock exchange dataset and work on trends. But when I came across crime analysis dataset, it had the latitude and longitude values and I want to experiment with plotting the values on the map. So I went with crime analytics dataset. After selecting the dataset I did not understand what kind of inferences can I draw from the dataset. Then I got a question in my mind why do we actually need to analyse the crime data and who will be using this. The answer to this question helped me to start my process of analysis, frame different research questions and draw inferences from it.

My thought process was to understand how each column in the dataset are related. When we find a relation between the columns we can deep dive and narrow down our research further. For example, Initially I found the relation between the crime type and their frequency. Later I went down to find what are most frequent crime categories in Philadelhia.

Later, when I was exploring about the different graphs we can plot with R, I found many interesting plots. But I did not understand how to manipulate the data in order to draw few of the graphs. I need to research and explore new techniques that will help me to manipulate data according to the needs. I tried some new ways to extract and create new column apart from the ones thought in class.

There were many challenges I faced while I was working on this project. I was using the tutorials and techniques learnt in class and also went through different websites in order to know how to manipulate data and draw plots. R is really a powerful tool and there is a lot for me to explore and learn so that I can draw better inferences and plot better visualizations.

Conclusion

Now we have answers to all our questions. The common crime categories in Philadelphia are “All Other Offenses”, “Other Assaults”, “Thefts”, “Vandalism”, “Theft from Vehicle”. Though all the crime category’s frequency are in similar range to their neighbors “All Other Offenses” frequency is way greater than other crime categories. From the aboce Lollipop chart we can see that the crime rate has decreased significantly over the years from 2006 to 2017.

From the monthly crime plot we can infer that though there are differences in the number of crimes happened in each month there isn’t any month that is significantly different from other months. So we can’t develop any strategy according to the monthly analysis.

From hourly analysis we can infer that the crimes have happened at all the times but if you see collectively as a group most of the crimes have happened between 6pm to 12am. From the pie plot we also know the districts where the most number of crimes happened.

From all the above inferences the police need to take strategies like increasing the police force or security during night times, crime prone districts and develop technologies to prevent the crimes. Finally, when we see the yearly plot we can understand that the strategies taken by the Police Force are working as the crime rate has decreased significantly over the years.

Bibliography

Dataset from Kaggle- https://www.kaggle.com/datasets/mchirico/philadelphiacrimedata

Referred crime analysis from - https://cops.usdoj.gov/ric/Publications/cops-w0273-pub.pdf

Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.

Wickham, H. (2019). Advanced R. Chapman and Hall/CRC.

Wickham, H. (2010). A layered grammar of graphics. Journal of Computational I and Graphical Statistics, 19(1), 3-28.

Source Code
---
title: "Final Project"
author: "Siddharth Nammara Kalyana Raman"
desription: "Final Project on Philadelphia Crime data"
date: "12/14/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
---
## Introduction

According to the Police Foundation's the crime analysis is defined as the qualitative and quantitative study of crime and law enforcement information in combination with socio-demographic and spatial factors to apprehend criminals, prevent crime, reduce disorder, and evaluate organizational procedures. 

The primary purpose of crime analysis is to assist or support a police department's operations. These activities include patrolling, patrolling operations, crime prevention and reduction methods, problem-solving, evaluation and accountability of police actions, criminal investigation, arrest, and prosecution. Crime analysis would not be possible without police forces.

So in this project we have taken a small sample of Philadelphia crime data to perform some statistical analysis and understand their trends. The dataset was taken from OpenDataPhilly. The OpenDataPhilly is a source for the open data in the Philadelphia region. 

Some of the questions to which I want to find out the answers are :
  
What are the different categories of crime happening in Philadelphia and what are the most common crimes?

How is the trend of crime as the years progress, whether the crimes are increasing or decreasing? This will help us to determine whether the strategies implemented by the police force to reduce the crime rate is working or not.

The month with the most number of crimes?

The hour with the most number of crimes?

The district in Philadelphia with most number of crimes?

The answers to the above three questions will help us to determine when and where do we need to increase the security?



```{r}
#Loading libraries

library(tidyverse)
library(dplyr)
library(summarytools)
library(readxl)
load("snkraman_final.RData")
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```


## Import the Data

Importing the Philadelphia crime data into R.
```{r}
#dataset<-read_csv("snkraman_final.RData")
head(dataset)
```

## Dataset Summary

The columns and their descriptions are as follows :

1. Dc_Dist - A two character field that names the District boundary.

2. Psa - It is a single character field that names the Police Service Area boundary.

3. DC_Key - The unique identifier of the crime that consists of Year+District+Unique ID.

4. Dispatch_Date_Time - The date and time that the officer was dispatched to the scene.

5. Dispatch_Date - It is the dispatch date formatted as character.

6. Dispatch_Time - It is the dispatach time formatted as character.

7. Hour - It is the generalized hour of the dispatched time.

8. Location_Block - The location of crime generalized by the street block.

9. UCR_General - Universal Crime Reporting, it is used to compare crimes in other areas.

10. Text_General_Code - It defines the crime category.

11. Police_Districts - It defines the police district where the crime happened.

12. Month - It defines the month and year on which the crime happened.

13. Lon - Longitude of the crime location.

14. Lat - Latitude of the crime location.

```{r}
print(dfSummary(dataset, 
                varnumbers= FALSE, 
                plain.ascii= FALSE, 
                style= "grid", 
                graph.magnif= 0.80, 
                valid.col= TRUE),
      method= 'render', 
      table.classes= 'table-condensed')
```

##Tidy Data
```{r}
head(dataset)
```
```{r}
tail(dataset)
```

Check the number of rows that has null data in the dataset.
```{r}
sum(is.na(dataset))
```

Checking the attributes that has null data.
```{r}
cols_null_data<-colSums(is.na(dataset))
colnames(dataset)[cols_null_data>0]
```

Checking the number of rows in each column that has null data.
```{r}
sum(is.na(dataset$UCR_General))
sum(is.na(dataset$Police_Districts))
sum(is.na(dataset$Lon))
sum(is.na(dataset$Lat))
```

Removing the rows from the dataset that has latitude and longitude as a null value.
```{r}
dataset<-subset(dataset,dataset$Lat!="NA" & dataset$Lon!="NA")
head(dataset)
```
Checking whether there are any rows that has latitude and longitude as the null values after filtering the dataset.
```{r}
sum(is.na(dataset$Lon))
sum(is.na(dataset$Lat))
```
## Processing and Visualization

The Text_General_Code represents the crime category.
```{r}
sum(is.na(dataset$Text_General_Code))
```

We are calculating the number of occurrences of each crime type in the dataset.
```{r}
countData<- dataset%>%count(Text_General_Code)
countData<-countData[-c(1),]
countData
```
Now we are going to see the visualized representation of each occurrence of crime category in Philadelphia.
```{r}
library(ggplot2)
ggplot(data = countData, mapping = aes(x= n, y= reorder(Text_General_Code, n)))+
  geom_col(aes(fill = Text_General_Code))+
  geom_text(data = countData[c(1,33),],mapping = aes(label = n))+
   theme_minimal()+
  labs(title = "Crime Category and their Frequency in Philadelphia",
       y = NULL,
       x = "Frequency")+
 theme(legend.position = "none")
```

Rearranging the data in decreasing order so that it would be helpful for us to know the major crimes happening in the city.
```{r}
countData<-countData[order(countData$n,decreasing = T),]
countData
```

Extracting the data of the top 10 crimes happening in Philadelphia.
```{r}
top_crime_data<-countData[1:10,]
top_crime_data
```

```{r}
ggplot(data = top_crime_data, mapping = aes(x= n, y= reorder(Text_General_Code, n)))+
  geom_col(aes(fill = Text_General_Code))+
  geom_text(data = top_crime_data[c(1,10),],mapping = aes(label = n))+
   theme_minimal()+
  labs(title = "Common Crime Category in San Francisco",
       y = NULL,
       x = "Frequency")+
 theme(legend.position = "none")
```

From the above graph we can see that "All Other Offenses" crime category is the most frequently occurring crime. All the other crimes are similar in range to their neighbors but the frequency of "All Other Offenses" is quite high compared to the other crime categories.

Now we are going to perform crime analysis per month. In the below code we are extracting the month and year on which the crime has happened from Dispatch_Date and making them as a separate attribute so that we can perform analysis on that.
```{r}
head(dataset)
```
```{r}
#this frame was used to separate Year and Month from the dataset.
#But as we took the image of the dataset there is no need to run this block as the Dispatch_Date column is overridden in the new frame

#dataset<- dataset %>%
 # separate(`Dispatch_Date`,c('Year','Month'),sep = "-")

#head(dataset)
```

Count the number of crimes happened on each year from 2006 to 2017.
```{r}
countCrimeByYear <- dataset %>% 
  group_by(Year) %>% 
  summarise(total = n())
countCrimeByYear
```


```{r}
ggplot(countCrimeByYear, aes(x=Year, y=total)) + 
  geom_point(size=3) + 
  geom_segment(aes(x=Year, 
                   xend=Year, 
                   y=0, 
                   yend=total)) + 
  labs(title="Average Crimes per Year in Philadelphia", 
       caption="source: mpg") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6))
```

The above plot shows the trend of the crime as the year progresses. We can see that on an average the crime has decreased by a great factor as the years progressed. 

Count the number of crimes happened on each month for all the years from 2006 to 2017.
```{r}
countCrimeByMonth <- dataset %>% 
  group_by(Month) %>% 
  summarise(total = n())
head(countCrimeByMonth)
```
```{r}
head(countCrimeByMonth)
theme_set(theme_classic())

ggplot(countCrimeByMonth, aes(x = Month, y = total))+
  geom_col(fill = "firebrick3")+
  theme_minimal()+
  labs(
    title = "Crime per Month in Philadelphia",
    subtitle = "From 2006 to 2017",
    x = "Month",
    y = "Total Crime"
  )
```

Counting the number of crimes happened at each hour.
```{r}
countCrimeByHour <- dataset %>% 
  group_by(Hour) %>% 
  summarise(total = n())
head(countCrimeByHour)
```


```{r}
library(scales)
theme_set(theme_classic())

# Plot
ggplot(countCrimeByHour, aes(x=countCrimeByHour$Hour, y=countCrimeByHour$total)) + 
  geom_point(col="tomato2", size=3) +   # Draw points
  geom_segment(aes(x=countCrimeByHour$Hour, 
                   xend=countCrimeByHour$Hour, 
                   y=min(countCrimeByHour$total), 
                   yend=max(countCrimeByHour$total)), 
               linetype="dashed", 
               size=0.1) +   # Draw dashed lines
  labs(title="Dot Plot for the number of crimes per hour", 
       caption="source: mpg") +  
  coord_flip()
```

```{r}
countCrimeByHour <- dataset %>% 
  group_by(Hour) %>% 
  summarise(total = n())
countCrimeByHour<-countCrimeByHour[order(countCrimeByHour$total,decreasing = T),]
countCrimeByHour<-head(countCrimeByHour)
countCrimeByHour
```

```{r}
library(scales)
theme_set(theme_classic())

# Plot
ggplot(countCrimeByHour, aes(x=countCrimeByHour$Hour, y=countCrimeByHour$total)) + 
  geom_point(col="tomato2", size=3) +   # Draw points
  geom_segment(aes(x=countCrimeByHour$Hour, 
                   xend=countCrimeByHour$Hour, 
                   y=min(countCrimeByHour$total), 
                   yend=max(countCrimeByHour$total)), 
               linetype="dashed", 
               size=0.1) +   # Draw dashed lines
  labs(title="Dot Plot for the number of crimes per hour", 
       caption="source: mpg") +  
  coord_flip()
```

Though from the above graph we can see that the max number of crimes happened at 16:00 hours, in order to infer the time range we need gather information from the first graph and see the collective time range in which maximum number of crimes are happening. 

Now we will analyse the number of crimes per district. Below we are counting the number of crimes happened in each district.
```{r}
countCrimeByPoliceDistrict<- dataset %>% 
  group_by(Police_Districts) %>% 
  summarise(total = n())
head(countCrimeByPoliceDistrict)
```
In order to know the top 6 districts where the most crimes are happening, we'll first rearrange the data in descending order and take the top 6 rows from the dataframe. You can see the top 6 districts and the number of crimes happening in each district clearly below.
```{r}
countTopCrimeByPoliceDistrict<-countCrimeByPoliceDistrict[order(countCrimeByPoliceDistrict$total,decreasing = T),]
countTopCrimeByPoliceDistrict<-head(countTopCrimeByPoliceDistrict)
countTopCrimeByPoliceDistrict
```

We'll plot a pie chart for the above data. The below pie chart shows labels of each district and also a color. The label that has the lightest color is the district where most number of crimes are happening and the label with the darkest color is the district where the least number of crimes are happening. You can also see their value range in the scale shown beside the pie chart.
```{r}
library(ggplot2)

ggplot(countTopCrimeByPoliceDistrict, aes(x = "", y = "", fill = countTopCrimeByPoliceDistrict$total)) +
  geom_col() +
  geom_label(aes(label = countTopCrimeByPoliceDistrict$Police_Districts),
             position = position_stack(vjust = 0.5),
             show.legend = FALSE) +
  coord_polar(theta = "y")
```

From the above pie chart we can clearly see that "11" is the district where the most number of crimes are happening in Philadelphia.

In the given dataset we have latitude and longitude values. So let's try to plot the crime location in the map. 

```{r}
map_drug <- dataset %>% 
  filter(Year == 2006) %>% 
  select(Location_Block, Lon, Lat)
map_drug<-head(map_drug,50)
map_drug
```


```{r}
library(leaflet)


ico <- makeIcon(iconUrl = "https://cdn.iconscout.com/icon/free/png-256/drugs-26-129384.png",iconWidth=47/2, iconHeight=41/2)
map2 <- leaflet()
map2 <- addTiles(map2)
map2 <- addMarkers(map2, data = map_drug, icon = ico, popup = map_drug[,"Location_Block"])
map2
```
The above map shows the locations of 50 crime scenes happened around Philadelphia in 2006.

```{r}
map_drug <- dataset %>% 
  filter(Text_General_Code=='Thefts',Year=='2006') %>% 
  select(Location_Block, Lon, Lat)
map_drug<-head(map_drug,50)
map_drug

```

```{r}
library(leaflet)


ico <- makeIcon(iconUrl = "https://cdn.iconscout.com/icon/free/png-256/drugs-26-129384.png",iconWidth=47/2, iconHeight=41/2)
map2 <- leaflet()
map2 <- addTiles(map2)
map2 <- addMarkers(map2, data = map_drug, icon = ico, popup = map_drug[,"Location_Block"])
map2
```

The above map shows the locations of 50 theft crimes happened around Philadelphia in the year 2006.

## Reflection

I've learned a lot from working on this project. Before taking this course I did not have any experience in R. We see a lot of analytics used in stock exchange. Initially, I thought of choosing a stock exchange dataset and work on trends. But when I came across crime analysis dataset, it had the latitude and longitude values and I want to experiment with plotting the values on the map. So I went with crime analytics dataset. After selecting the dataset I did not understand what kind of inferences can I draw from the dataset. Then I got a question in my mind why do we actually need to analyse the crime data and who will be using this. The answer to this question helped me to start my process of analysis, frame different research questions and draw inferences from it. 

My thought process was to understand how each column in the dataset are related. When we find a relation between the columns we can deep dive and narrow down our research further. For example, Initially I found the relation between the crime type and their frequency. Later I went down to find what are most frequent crime categories in Philadelhia.

Later, when I was exploring about the different graphs we can plot with R, I found many interesting plots. But I did not understand how to manipulate the data in order to draw few of the graphs. I need to research and explore new techniques that will help me to manipulate data according to the needs. I tried some new ways to extract and create new column apart from the ones thought in class. 

There were many challenges I faced while I was working on this project. I was using the tutorials and techniques learnt in class and also went through different websites in order to know how to manipulate data and draw plots. R is really a powerful tool and there is a lot for me to explore and learn so that I can draw better inferences and plot better visualizations.

## Conclusion

Now we have answers to all our questions. The common crime categories in Philadelphia are "All Other Offenses", "Other Assaults", "Thefts", "Vandalism", "Theft from Vehicle". Though all the crime category's frequency are in similar range to their neighbors "All Other Offenses" frequency is way greater than other crime categories. From the aboce Lollipop chart we can see that the crime rate has decreased significantly over the years from 2006 to 2017. 

From the monthly crime plot we can infer that though there are differences in the number of crimes happened in each month there isn't any month that is significantly different from other months. So we can't develop any strategy according to the monthly analysis. 

From hourly analysis we can infer that the crimes have happened at all the times but if you see collectively as a group most of the crimes have happened between 6pm to 12am. From the pie plot we also know the districts where the most number of crimes happened.

From all the above inferences the police need to take strategies like increasing the police force or security during night times, crime prone districts and develop technologies to prevent the crimes. Finally, when we see the yearly plot we can understand that the strategies taken by the Police Force are working as the crime rate has decreased significantly over the years.

## Bibliography

Dataset from Kaggle- https://www.kaggle.com/datasets/mchirico/philadelphiacrimedata

Referred crime analysis from - https://cops.usdoj.gov/ric/Publications/cops-w0273-pub.pdf

Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.

Wickham, H. (2019). Advanced R. Chapman and Hall/CRC.

Wickham, H. (2010). A layered grammar of graphics. Journal of Computational I and Graphical Statistics, 19(1), 3-28.