Data Analytics and Computational Social Science: HW_6

Mani Kanta Gogula

INTRODUCTION

Due to rapid growth in field of cashless or digital transactions, credit cards are widely used in all around the world. Credit cards providers are issuing thousands of cards to their customers. Providers have to ensure all the credit card users should be genuine and real. Any mistake in issuing a card can be reason of financial crises. Due to rapid growth in cashless transaction, the chances of number of fraudulent transactions can also increasing. A Fraud transaction can be identified by analyzing various behaviors of credit card customers from previous transaction history datasets. If any deviation is noticed in spending behavior from available patterns, it is possibly of fraudulent transaction

Fraudulent transaction is the one of the most serious threats to online security nowadays. Artificial Intelligence is vital for financial risk control in cloud environment. Many studies had attempted to explore methods for online transaction fraud detection; however, the existing methods are not sufficient to conduction detection with high precision .

Fraudulent transactions are orders and purchases made using a credit card or bank account that does not belong to the buyer. One of the largest factors in identity fraud, these types of transactions can end up doing damage to both merchants and the identity fraud victim.

The dataset I’m using for my final project is the simulated credit card transaction dataset containing fraud transactions from the duration 1st Jan 2019 -31st Dec 2020 . It covers credit card transaction of customers with a pool of a 800 merchants.The source of my dataset is “KAGGLE”.

#RESEARCH QUESTIONS:

Which states have more number of fraud transaction among all states?
Which gender is the major victim of fraud transactions ?
What is the minimum and maximum amount among all the states?

4.Which category responsible for maximum number of fraud transactions in the top 3 states?

library(readr)
library(tidyverse)
library(dplyr)
library(ggplot2)

LOADING THE DATASET

fraudTest <- read_csv("fraudTest.csv")
head(fraudTest)

# A tibble: 6 x 22
  Sr_no Trans_date_Trans~  cc_num Merchant Category Amount first last 
  <dbl> <chr>               <dbl> <chr>    <chr>     <dbl> <chr> <chr>
1     0 6/21/2020         2.29e15 fraud_K~ persona~      3 Jeff  Elli~
2     1 6/21/2020         3.57e15 fraud_S~ persona~     30 Joan~ Will~
3     2 6/21/2020         3.6 e15 fraud_S~ health_~     41 Ashl~ Lopez
4     3 6/21/2020         3.59e15 fraud_H~ misc_pos     60 Brian Will~
5     4 6/21/2020         3.53e15 fraud_J~ travel        3 Nath~ Mass~
6     5 6/21/2020         3.04e13 fraud_D~ kids_pe~     20 Dani~ Evans
# ... with 14 more variables: gender <chr>, street <chr>, city <chr>,
#   state <chr>, zip <dbl>, lat <dbl>, long <dbl>, city_pop <dbl>,
#   job <chr>, dob <chr>, trans_num <chr>, unix_time <dbl>,
#   merch_lat <dbl>, merch_long <dbl>

CHECKING THE DIMENSION OF THE DATASET

#Checking the dimension of the dataset
dim(fraudTest)

[1] 555719     22

GLIMPSE OF THE DATASET

glimpse(fraudTest)

Rows: 555,719
Columns: 22
$ Sr_no                 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ~
$ Trans_date_Trans_time <chr> "6/21/2020", "6/21/2020", "6/21/2020",~
$ cc_num                <dbl> 2.29e+15, 3.57e+15, 3.60e+15, 3.59e+15~
$ Merchant              <chr> "fraud_Kirlin and Sons", "fraud_Sporer~
$ Category              <chr> "personal_care", "personal_care", "hea~
$ Amount                <dbl> 3, 30, 41, 60, 3, 20, 134, 10, 4, 67, ~
$ first                 <chr> "Jeff", "Joanne", "Ashley", "Brian", "~
$ last                  <chr> "Elliott", "Williams", "Lopez", "Willi~
$ gender                <chr> "M", "F", "F", "M", "M", "F", "F", "F"~
$ street                <chr> "351 Darlene Green", "3638 Marsh Union~
$ city                  <chr> "Columbia", "Altonah", "Bellmore", "Ti~
$ state                 <chr> "SC", "UT", "NY", "FL", "MI", "NY", "C~
$ zip                   <dbl> 29209, 84002, 11710, 32780, 49632, 148~
$ lat                   <dbl> 33.9659, 40.3207, 40.6729, 28.5697, 44~
$ long                  <dbl> -80.9355, -110.4360, -73.5365, -80.819~
$ city_pop              <dbl> 333497, 302, 34496, 54767, 1126, 520, ~
$ job                   <chr> "Mechanical engineer", "Sales professi~
$ dob                   <chr> "3/19/1968", "1/17/1990", "10/21/1970"~
$ trans_num             <chr> "2da90c7d74bd46a0caf3777415b3ebd3", "3~
$ unix_time             <dbl> 1371816865, 1371816873, 1371816893, 13~
$ merch_lat             <dbl> 33.98639, 39.45050, 40.49581, 28.81240~
$ merch_long            <dbl> -81.20071, -109.96043, -74.19611, -80.~

DESCRIPTIVE STATISTICS OF THE DATSET

median(fraudTest$Amount, na.rm= TRUE)

[1] 47

sd(fraudTest$Amount, na.rm= TRUE)

[1] 156.7456

mean(fraudTest$Amount , na.rm= TRUE)

[1] 69.39616

range(fraudTest$Amount)

[1]     1 22768

min(fraudTest$Amount)

[1] 1

max(fraudTest$Amount)

[1] 22768

We got few stats regarding the dataset using descriptiv stat functions :

The average amount of fraud transaction across all the states is 69.39616USD
The Minimum amount of fraud transaction across all the states is 1 USD

3.The Maximum amount of fraud transaction across all states is 22768 USD

The Median of the Fraud transactions Amount is 47 USD

5.The Amount Range of the fraud transactions lie between 1 and 22768 USD

Standard Deviation of the Amount of fraud transactions is 156.7456 USD

Finding the Minimum , Maximum and Average of the Amount per state wise

fraudTest %>% 
  group_by(state) %>% 
  summarise(Average=mean(Amount) , lower=min(Amount) , Upper = max(Amount))

# A tibble: 50 x 4
   state Average lower Upper
   <chr>   <dbl> <dbl> <dbl>
 1 AK       78.4     1  1617
 2 AL       64.3     1  5030
 3 AR       76.2     1  8181
 4 AZ       75.8     1  7321
 5 CA       73.3     1 16837
 6 CO       76.0     1  5187
 7 CT       62.6     1  4120
 8 DC       71.7     1  1121
 9 FL       71.4     1 21438
10 GA       69.2     1  7886
# ... with 40 more rows

Visualizing the count of the fraud transactions per state wise

fraudTest %>%
  select('state') %>%
  group_by(state) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = state, y = count, fill = state)) +
  geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none")

States like Texas , Newyork and Pennsylvania responsible for the maximum number of fraud transactions and States like Alaska ,Hawaii ,Rhode Island have least number of fraud transactions among all the states.

Focussing on least fraud transactions states per city wise

selected_states <- c('AK', 'HI')

for (x in selected_states){
  p <- fraudTest %>%
    filter(state == x) %>%
    ggplot(aes(x = city, fill = city)) +
    geom_histogram(stat = "count") + theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none") 
  print(p)
}

fraudTest %>% 
  select(state, Category, Amount , Merchant) %>%  
  filter(state =="HI") %>% 
  arrange(Amount) %>% 
  ggplot(aes(Amount , Category, fill= Category)) +geom_boxplot(aes(color=Category))

Shopping_pos and shopping_net are the top two categories respoonsible for rhe fraud transactions in HI state and Categories like Health_fitness , grocery_net and Gas_transaport have least number of fraud transactions

fraudTest %>% 
  select(state, Category, Amount , Merchant) %>%  
  filter(state =="AK") %>% 
  arrange(Amount) %>% 
  ggplot(aes(Amount , Category, fill= Category)) +geom_line(aes(color=Category))

Shopping_net , shopping_pos are the toptwo major categories which is responsible for the fraud transactions in AK state and categories like Grocery_net , gas_transport have least number of fraud transactions.

#What is missing from your final project?

The top most categories responsible fro the fraud transactions in the top 3 state in the Unites states and Gender ratios of the respected states.

Comment on this article Share:

HW_6

Reuse

Citation