Final Project Assignment#1: Sai Pranav Kurly

final_Project_assignment_1
final_project_data_description
Project & Data Description
Author

Sai Pranav Kurly

Published

May 12, 2023

library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Part 1. Introduction

  1. Dataset(s) Introduction:

The Boston Crime Dataset, also known as the Boston Crime Incident Reports, is a dataset that contains information about reported incidents of crime in the city of Boston, Massachusetts, USA. It provides a detailed record of criminal activities and incidents that have occurred within the city. The dataset includes various attributes related to each reported crime, such as the type of offense, location, date and time of occurrence, and other relevant details. The information is collected and maintained by the Boston Police Department, which aims to promote transparency and public awareness regarding crime trends and patterns in the city. Researchers, analysts, and data enthusiasts often utilize the Boston Crime Dataset to study crime patterns, develop predictive models, and gain insights into criminal activities within the city. It can be used for various purposes, such as identifying high-crime areas, evaluating the effectiveness of law enforcement strategies, or understanding the impact of crime on different neighborhoods.

  1. What questions do you like to answer with this dataset(s)?

Some of the questions that I would like to answer and figure out are:

How have crime rates changed over the years in different districts of Boston? Is there a correlation between certain types of crimes and specific days of the week or months of the year? *Are there any noticeable spatial patterns or hotspots of crime in Boston?

Part 2. Describe the data set(s)

This part contains both a coding and a storytelling component.

In the coding component, you should:

  1. read the dataset;

I want the latest data which can only be found on the Boston PD website, hence I am combining all the data that I downloaded from the website first.

folder_path <- "SaipranavKurly_FinalProjectData/"
file_list <- list.files(folder_path)
file_list <- sort(file_list)
combined_data <- data.frame()
for (file_name in file_list) {
    if(file_name != 'Offense_Codes.csv' & file_name != 'Combined_Dataset.csv'){
          file_path <- file.path(folder_path, file_name)
    file_data <- read.csv(file_path)
    combined_data <- rbind(combined_data, file_data)
    }
}
combined_file_path <- file.path(folder_path, "Combined_Dataset.csv")
write.csv(combined_data, combined_file_path, row.names = FALSE)
-   (optional) If you have multiple dataset(s) you want to work with, you should combine these datasets at this step.

-   (optional) If your dataset is too big (for example, it contains too many variables/columns that may not be useful for your analysis), you may want to subset the data just to include the necessary variables/columns.
crime_dataset <- read.csv("SaipranavKurly_FinalProjectData/Combined_Dataset.csv")
  1. present the descriptive information of the dataset(s) using the functions in Challenges 1, 2, and 3;

    • for examples: dim(), length(unique()), head();
    dim(crime_dataset)
    [1] 303651     17
    length(unique(crime_dataset))
    [1] 17
    head(crime_dataset)
  2. conduct summary statistics of the dataset(s); especially show the basic statistics (min, max, mean, median, etc.) for the variables you are interested in.

summary(crime_dataset)
 INCIDENT_NUMBER     OFFENSE_CODE   OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION
 Length:303651      Min.   :  100   Mode:logical       Length:303651      
 Class :character   1st Qu.: 1102   NA's:303651        Class :character   
 Mode  :character   Median : 3005                      Mode  :character   
                    Mean   : 2362                                         
                    3rd Qu.: 3201                                         
                    Max.   :99999                                         
                                                                          
   DISTRICT         REPORTING_AREA     SHOOTING       OCCURRED_ON_DATE  
 Length:303651      Min.   :  0.0   Min.   :0.00000   Length:303651     
 Class :character   1st Qu.:172.0   1st Qu.:0.00000   Class :character  
 Mode  :character   Median :352.0   Median :0.00000   Mode  :character  
                    Mean   :379.4   Mean   :0.01177                     
                    3rd Qu.:532.0   3rd Qu.:0.00000                     
                    Max.   :962.0   Max.   :1.00000                     
                    NA's   :77826                                       
      YEAR          MONTH        DAY_OF_WEEK             HOUR      
 Min.   :2019   Min.   : 1.000   Length:303651      Min.   : 0.00  
 1st Qu.:2019   1st Qu.: 4.000   Class :character   1st Qu.: 9.00  
 Median :2020   Median : 7.000   Mode  :character   Median :14.00  
 Mean   :2020   Mean   : 6.592                      Mean   :12.84  
 3rd Qu.:2021   3rd Qu.: 9.000                      3rd Qu.:18.00  
 Max.   :2022   Max.   :12.000                      Max.   :23.00  
                                                                   
 UCR_PART          STREET               Lat             Long       
 Mode:logical   Length:303651      Min.   : 0.00   Min.   :-71.35  
 NA's:303651    Class :character   1st Qu.:42.30   1st Qu.:-71.10  
                Mode  :character   Median :42.33   Median :-71.08  
                                   Mean   :42.32   Mean   :-71.08  
                                   3rd Qu.:42.35   3rd Qu.:-71.06  
                                   Max.   :42.46   Max.   :  0.00  
                                   NA's   :11929   NA's   :11929   
   Location        
 Length:303651     
 Class :character  
 Mode  :character  
                   
                   
                   
                   

Storytelling:

The Dataset contains the following columns and below are the descriptions:

  • Incident Number: Internal report number for each incident, non-null value.
  • Offense Code: Numerical code representing the offense description.
  • Offense Code Group: High-level group name for the offense code.
  • Offense Description: Detailed description and internal categorization of the offense.
  • District: District where the crime occurred.
  • Reporting Area: Number of the reporting area where the crime occurred.
  • Shooting: Numerical value indicating if a shooting took place.
  • Occurred on Date: Date and time of when the crime occurred.
  • Year: Year when the crime occurred.
  • Month: Month when the crime occurred.
  • Day of Week: Day of the week when the crime occurred.
  • Hour: Hour when the crime occurred.
  • UCR Part: Universal Crime Reporting Part Number.
  • Street: Street name where the crime occurred.
  • Lat: Latitude coordinate of the crime location.
  • Long: Longitude coordinate of the crime location.
  • Location - Gives the location of where the crime has taken place.

It consists of crimes from 2019 to 2022 and has about 303651 crimes in total.

3. The Tentative Plan for Visualization

  1. Briefly describe what data analyses (please the special note on statistics in the next section) and visualizations you plan to conduct to answer the research questions you proposed above.

Currently, I am planning to to analyze the following using the dataset:

  • Crime Distribution and Frequency: Create a bar or pie chart to depict the distribution of crime types in Boston. Determine the most and least common crime categories. Calculate and display the relative frequency of each type of crime.
  • Temporal Patterns and Trends: Using line graphs or time series plots, plot the number of reported crimes over time (monthly or yearly). Identify any notable trends or patterns in crime rates over time.
  • Seasonal Variation in Crime: Aggregate the data by month or season to see if there are seasonal variations in crime rates. To compare the distribution of crimes across seasons, create box plots or violin plots.
  • Geographic Crime Hotspots: Identify high-crime areas in Boston using geospatial visualization techniques. To visualize crime density, plot crime incidents on a map with markers or heatmaps. To identify statistically significant crime clusters, use spatial analysis techniques such as hotspot analysis or cluster analysis.
  • Temporal Patterns by Crime Type: Examine temporal patterns associated with various types of crimes. To compare the temporal patterns of various crimes, create stacked line graphs or small multiples. Identify patterns and seasonality within crime types using statistical techniques such as time series decomposition or autocorrelation analysis.
  1. Explain why you choose to conduct these specific data analyses and visualizations. In other words, how do such types of statistics or graphs (see the R Gallery) help you answer specific questions? For example, how can a bivariate visualization reveal the relationship between two variables, or how does a linear graph of variables over time present the pattern of development?

The distribution of crime types can be represented visually using bar charts or pie charts. They give a clear overview of the most common and least common crime categories, making it simple to identify major crime trends. Line graphs and time series plots are useful for examining how crime rates change over time. In crime data, they reveal trends, patterns, and cyclical behavior. These visualizations aid in the identification of long-term trends, seasonal patterns, and unexpected changes in crime rates.Box plots and violin plots allow for the comparison of crime rates across seasons. They provide insights into the distribution of crime incidents during specific periods and aid in determining whether there are significant seasonal differences in crime rates.

  1. If you plan to conduct specific data analyses and visualizations, describe how do you need to process and prepare the tidy data.

Convert date and time variables to appropriate formats using date/time functions.Handle missing data and outliers by applying appropriate techniques (e.g., imputation, removal, robust statistics).