Data Analytics and Computational Social Science: HW_3(2)

Mani kanta Gogula

#1.Identify the datset you will be using for the final project.

I will be using the credit card transaction dataset containing legitimate and fraud transactions. Source of the dataset is “KAGGLE.COM”

#Importing the datset

library(readr)
library(dplyr)
library(tidyverse)
fraudTest <- read_csv("C:/Users/manik/Desktop/DACSS 601/fraudTest.csv")

#Preview of the datset

head(fraudTest)

# A tibble: 6 x 23
   ...1 trans_date_trans_time  cc_num merchant    category   amt first
  <dbl> <dttm>                  <dbl> <chr>       <chr>    <dbl> <chr>
1     0 2020-06-21 12:14:25   2.29e15 fraud_Kirl~ persona~  2.86 Jeff 
2     1 2020-06-21 12:14:33   3.57e15 fraud_Spor~ persona~ 29.8  Joan~
3     2 2020-06-21 12:14:53   3.60e15 fraud_Swan~ health_~ 41.3  Ashl~
4     3 2020-06-21 12:15:15   3.59e15 fraud_Hale~ misc_pos 60.0  Brian
5     4 2020-06-21 12:15:17   3.53e15 fraud_John~ travel    3.19 Nath~
6     5 2020-06-21 12:15:37   3.04e13 fraud_Daug~ kids_pe~ 19.6  Dani~
# ... with 16 more variables: last <chr>, gender <chr>, street <chr>,
#   city <chr>, state <chr>, zip <dbl>, lat <dbl>, long <dbl>,
#   city_pop <dbl>, job <chr>, dob <date>, trans_num <chr>,
#   unix_time <dbl>, merch_lat <dbl>, merch_long <dbl>,
#   is_fraud <dbl>

#Using the function dim() to get the dimensions of the dataset.

dim(fraudTest)

[1] 555719     23

#Extracting column names from the datset using colname() function

colnames(fraudTest)

 [1] "...1"                  "trans_date_trans_time"
 [3] "cc_num"                "merchant"             
 [5] "category"              "amt"                  
 [7] "first"                 "last"                 
 [9] "gender"                "street"               
[11] "city"                  "state"                
[13] "zip"                   "lat"                  
[15] "long"                  "city_pop"             
[17] "job"                   "dob"                  
[19] "trans_num"             "unix_time"            
[21] "merch_lat"             "merch_long"           
[23] "is_fraud"

#Naming the first column name as sr_no using names()

  names(fraudTest)[1]<-"sr_no"

#Preview of dataset post naming first column

head(fraudTest)

# A tibble: 6 x 23
  sr_no trans_date_trans_time  cc_num merchant    category   amt first
  <dbl> <dttm>                  <dbl> <chr>       <chr>    <dbl> <chr>
1     0 2020-06-21 12:14:25   2.29e15 fraud_Kirl~ persona~  2.86 Jeff 
2     1 2020-06-21 12:14:33   3.57e15 fraud_Spor~ persona~ 29.8  Joan~
3     2 2020-06-21 12:14:53   3.60e15 fraud_Swan~ health_~ 41.3  Ashl~
4     3 2020-06-21 12:15:15   3.59e15 fraud_Hale~ misc_pos 60.0  Brian
5     4 2020-06-21 12:15:17   3.53e15 fraud_John~ travel    3.19 Nath~
6     5 2020-06-21 12:15:37   3.04e13 fraud_Daug~ kids_pe~ 19.6  Dani~
# ... with 16 more variables: last <chr>, gender <chr>, street <chr>,
#   city <chr>, state <chr>, zip <dbl>, lat <dbl>, long <dbl>,
#   city_pop <dbl>, job <chr>, dob <date>, trans_num <chr>,
#   unix_time <dbl>, merch_lat <dbl>, merch_long <dbl>,
#   is_fraud <dbl>

#Focussing on MASS data

Selecting state , category , amt , is_fraud ,merchant columns from the dataset and filtering the obtained dateset to state “MA”.Arranging the dataset on the basis of amt .Plotting the graph for amt vs category.

fraudTest %>% 
  select(state, category, amt , is_fraud , merchant) %>%  filter(state =="MA") %>% 
  arrange(amt) %>% 
  ggplot(aes(amt,category)) +geom_line()

Identify the potential research questions taht your datset can help answer.

1)Which state has more number of fraud transactions and in which category?

2)Which category contributes more number of fraud transactions ?

3)To find out min and max amt of fraud transaction in each state ?

Comment on this article Share:

HW_3(2)

Identify the potential research questions taht your datset can help answer.

Reuse

Citation