Data Analytics and Computational Social Science: DACSS_601_FINAL_PROJECT

MANI KANTA GOGULA

INTRODUCTION

Credit card fraud is an inclusive term for fraud committed using a payment card, such as a credit card or debit card. The purpose may be to obtain goods or services, or to make unauthorized payments. The Payment Card Industry Data Security Standard (PCI DSS) is the data security standard created to help financial institutions process card payments securely and reduce card fraud.

Credit card fraud can be authorized, where the genuine customer themselves processes payment to another account which is controlled by any unauthorized entity, where the account holder does not provide authorization for the payment to proceed and the transaction is carried out by a third party. In 2018, unauthorized financial fraud losses across payment cards and remote banking totaled £844.8 million in the United Kingdom. Whereas banks and card companies prevented £1.66 billion in unauthorized fraud in 2018. That is the equivalent to £2 in every £3 of attempted fraud being stopped.

Credit cards are more secure than ever, with regulators, card providers and banks taking considerable time and effort to collaborate with investigators worldwide to ensure fraudsters aren’t successful. Cardholders’ money is usually protected from scammers with regulations that make the card provider and bank accountable. The technology and security measures behind credit cards are becoming increasingly sophisticated making it harder for fraudsters to steal money.

Due to rapid growth in field of cashless or digital transactions, credit cards are widely used in all around the world. Credit cards providers are issuing thousands of cards to their customers. Providers have to ensure all the credit card users should be genuine and real. Any mistake in issuing a card can be reason of financial crises. Due to rapid growth in cashless transaction, the chances of number of fraudulent transactions can also increasing. A Fraud transaction can be identified by analyzing various behaviors of credit card customers from previous transaction history datasets. If any deviation is noticed in spending behavior from available patterns, it is possibly of fraudulent transaction

Fraudulent transaction is the one of the most serious threats to online security nowadays. Artificial Intelligence is vital for financial risk control in cloud environment. Many studies had attempted to explore methods for online transaction fraud detection; however, the existing methods are not sufficient to conduction detection with high precision .

Fraudulent transactions are orders and purchases made using a credit card or bank account that does not belong to the buyer. One of the largest factors in identity fraud, these types of transactions can end up doing damage to both merchants and the identity fraud victim.

DATASET

This is a simulated credit card transaction dataset transactions from the duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of customers doing transactions with a pool of 800 merchants.Source of the dataset is Kaggle.com

#Loading Libraries

library(tidyverse) 
library(dplyr)
library(ggplot2)

Loading Dataset

fraudTest <- read_csv("fraudTest.csv") #Loading dataset
head(fraudTest)

# A tibble: 6 x 22
  Sr_no Trans_date_Trans~  cc_num Merchant Category Amount first last 
  <dbl> <chr>               <dbl> <chr>    <chr>     <dbl> <chr> <chr>
1     0 6/21/2020         2.29e15 fraud_K~ persona~      3 Jeff  Elli~
2     1 6/21/2020         3.57e15 fraud_S~ persona~     30 Joan~ Will~
3     2 6/21/2020         3.6 e15 fraud_S~ health_~     41 Ashl~ Lopez
4     3 6/21/2020         3.59e15 fraud_H~ misc_pos     60 Brian Will~
5     4 6/21/2020         3.53e15 fraud_J~ travel        3 Nath~ Mass~
6     5 6/21/2020         3.04e13 fraud_D~ kids_pe~     20 Dani~ Evans
# ... with 14 more variables: gender <chr>, street <chr>, city <chr>,
#   state <chr>, zip <dbl>, lat <dbl>, long <dbl>, city_pop <dbl>,
#   job <chr>, dob <chr>, trans_num <chr>, unix_time <dbl>,
#   merch_lat <dbl>, merch_long <dbl>

DATA EXPLORATION

Checking the dimension of the dataset

#Checking the dimension of the dataset
dim(fraudTest)

[1] 555719     22

Dataset contains 555719 rows and 22 columns

Glimpse of the dataset

glimpse(fraudTest) #glimpse of the dataset

Rows: 555,719
Columns: 22
$ Sr_no                 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ~
$ Trans_date_Trans_time <chr> "6/21/2020", "6/21/2020", "6/21/2020",~
$ cc_num                <dbl> 2.29e+15, 3.57e+15, 3.60e+15, 3.59e+15~
$ Merchant              <chr> "fraud_Kirlin and Sons", "fraud_Sporer~
$ Category              <chr> "personal_care", "personal_care", "hea~
$ Amount                <dbl> 3, 30, 41, 60, 3, 20, 134, 10, 4, 67, ~
$ first                 <chr> "Jeff", "Joanne", "Ashley", "Brian", "~
$ last                  <chr> "Elliott", "Williams", "Lopez", "Willi~
$ gender                <chr> "M", "F", "F", "M", "M", "F", "F", "F"~
$ street                <chr> "351 Darlene Green", "3638 Marsh Union~
$ city                  <chr> "Columbia", "Altonah", "Bellmore", "Ti~
$ state                 <chr> "SC", "UT", "NY", "FL", "MI", "NY", "C~
$ zip                   <dbl> 29209, 84002, 11710, 32780, 49632, 148~
$ lat                   <dbl> 33.9659, 40.3207, 40.6729, 28.5697, 44~
$ long                  <dbl> -80.9355, -110.4360, -73.5365, -80.819~
$ city_pop              <dbl> 333497, 302, 34496, 54767, 1126, 520, ~
$ job                   <chr> "Mechanical engineer", "Sales professi~
$ dob                   <chr> "3/19/1968", "1/17/1990", "10/21/1970"~
$ trans_num             <chr> "2da90c7d74bd46a0caf3777415b3ebd3", "3~
$ unix_time             <dbl> 1371816865, 1371816873, 1371816893, 13~
$ merch_lat             <dbl> 33.98639, 39.45050, 40.49581, 28.81240~
$ merch_long            <dbl> -81.20071, -109.96043, -74.19611, -80.~

Above table shows the glimpse of the fraudTest dataset . It shows the class type of the each columns and the initial recordings of that particualr column.Example class of the Merchant and Category is a character.

Column Names

colnames(fraudTest) #column names of the dataset

 [1] "Sr_no"                 "Trans_date_Trans_time"
 [3] "cc_num"                "Merchant"             
 [5] "Category"              "Amount"               
 [7] "first"                 "last"                 
 [9] "gender"                "street"               
[11] "city"                  "state"                
[13] "zip"                   "lat"                  
[15] "long"                  "city_pop"             
[17] "job"                   "dob"                  
[19] "trans_num"             "unix_time"            
[21] "merch_lat"             "merch_long"

Data set contains 22 columns :

Category : Fraud transaction item type

Amount : Amount of the transaction

Gender : Gender of the Credit Card Holder

City : City of the Credit Card Holder

State : State of the Credit Card Holder

Trans_date_Trans_time : Date and time of the fraud transaction occured.

cc_num: credit card number which is used for the fraud transaction

Merchant : a site or a physical location where credit card is used as a fraud purchase.

Street : Address of the customer

Zip : Zip code of the customer Billing Address.

Job: Job of the credit card holder

Dob:Date of Birth of the credit card holder

Renaming column names

library(dplyr)
rename(fraudTest ,first_name = first , last_name = last , Date_of_birth = dob , Trans_date = Trans_date_Trans_time) #Renaming column names

# A tibble: 555,719 x 22
   Sr_no Trans_date  cc_num Merchant        Category Amount first_name
   <dbl> <chr>        <dbl> <chr>           <chr>     <dbl> <chr>     
 1     0 6/21/2020  2.29e15 fraud_Kirlin a~ persona~      3 Jeff      
 2     1 6/21/2020  3.57e15 fraud_Sporer-K~ persona~     30 Joanne    
 3     2 6/21/2020  3.6 e15 fraud_Swaniaws~ health_~     41 Ashley    
 4     3 6/21/2020  3.59e15 fraud_Haley Gr~ misc_pos     60 Brian     
 5     4 6/21/2020  3.53e15 fraud_Johnston~ travel        3 Nathan    
 6     5 6/21/2020  3.04e13 fraud_Daughert~ kids_pe~     20 Danielle  
 7     6 6/21/2020  2.13e14 fraud_Romaguer~ health_~    134 Kayla     
 8     7 6/21/2020  3.59e15 fraud_Reichel ~ persona~     10 Paula     
 9     8 6/21/2020  3.6 e15 fraud_Goyette,~ shoppin~      4 David     
10     9 6/21/2020  3.55e15 fraud_Kilback ~ food_di~     67 Kayla     
# ... with 555,709 more rows, and 15 more variables: last_name <chr>,
#   gender <chr>, street <chr>, city <chr>, state <chr>, zip <dbl>,
#   lat <dbl>, long <dbl>, city_pop <dbl>, job <chr>,
#   Date_of_birth <chr>, trans_num <chr>, unix_time <dbl>,
#   merch_lat <dbl>, merch_long <dbl>

Discrete variables

Discrete variables are numeric variables that have a countable number of values between any two values. A discrete variable is always numeric. In our fraudTest Dataset we have few Discrete variables like state, category, gender. #Finding number these variables using table.

table(fraudTest$state) # State wise count


   AK    AL    AR    AZ    CA    CO    CT    DC    FL    GA    HI 
  843 17532 13484  4592 24135  5886  3277  1517 18104 11277  1090 
   IA    ID    IL    IN    KS    KY    LA    MA    MD    ME    MI 
11819  2490 18960 11959  9943 12506  8988  5186 11152  6928 19671 
   MN    MO    MS    MT    NC    ND    NE    NH    NJ    NM    NV 
13719 16501  8833  5052 12868  6397 10257  3449 10528  7020  2451 
   NY    OH    OK    OR    PA    RI    SC    SD    TN    TX    UT 
35918 20147 11379  7811 34326   195 12541  5250  7359 40393  4658 
   VA    VT    WA    WI    WV    WY 
12506  5044  8116 12370 10838  8454

Category wise count

t<-table(fraudTest$Category) #Category wise count
t


 entertainment    food_dining  gas_transport    grocery_net 
         40104          39268          56370          19426 
   grocery_pos health_fitness           home      kids_pets 
         52553          36674          52345          48692 
      misc_net       misc_pos  personal_care   shopping_net 
         27367          34574          39327          41779 
  shopping_pos         travel 
         49791          17449

#Gender wise count

x<-table(fraudTest$gender)#Gender wise count
x


     F      M 
304886 250833

Fraud transactions by category wise

cat_fraud<- group_by(fraudTest , Category) %>%  #Category wise count
  summarise(count=n()) 
cat_fraud

# A tibble: 14 x 2
   Category       count
   <chr>          <int>
 1 entertainment  40104
 2 food_dining    39268
 3 gas_transport  56370
 4 grocery_net    19426
 5 grocery_pos    52553
 6 health_fitness 36674
 7 home           52345
 8 kids_pets      48692
 9 misc_net       27367
10 misc_pos       34574
11 personal_care  39327
12 shopping_net   41779
13 shopping_pos   49791
14 travel         17449

Visualizing fraud transaction by category wise

ggplot(cat_fraud, aes(x = Category, y = count , fill = Category)) +  #Plotting the category wise fraud transactions count
  geom_bar( stat = "identity") +labs(title="Fraud transactions by category wise")+
  geom_text(aes(label = count), vjust = -0.3) +coord_flip()

In the above picture , I visualized the number of fraud transactions Category wise. Gas_Transport category responsible for most number of fraud transactions across the United State ,Categories like grocery_pos , home almost has same number of transactions which are counted as a fraud. As most of the customer uses their credit card for the shopping purchases, Categories like shopping_net , shopping_pos stands next to the grocery_pos and home category with almost 90k fraud transactions.Its interesting to see that there is vast difference between the count of fraud transactions of grocery_pos and grocery_net. grocery_net holds less than 20k number of fraud transactions. Outdoor payment categories like travel , food_dining , entertainment combinely holds nearly 100k number of fraud transactions.

Descriptive statistics

Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread). Measures of central tendency include the mean, median, and mode, while measures of variability include standard deviation, variance. In this Data set i will be focusing on the variable Amount, As whole dataset is about the fraud transactions with respect to the purchases which is easily denoted by amount. It also important to see the range , minimum , maximum amount of fraud transactions to get better visualizations.

median(fraudTest$Amount, na.rm= TRUE) #Median of the fraud transaction amount

[1] 47

sd(fraudTest$Amount, na.rm= TRUE)     #standard deviation of the fraud transaction amount

[1] 156.7456

mean(fraudTest$Amount , na.rm= TRUE)  #mean of the fraud transaction amount

[1] 69.39616

range(fraudTest$Amount)               #range of the fraud transaction amount

[1]     1 22768

min(fraudTest$Amount)                 #Minimum amount of the fraud transaction

[1] 1

max(fraudTest$Amount)                 #Maximum amount of the fraud transaction

[1] 22768

Above code shows the descriptive statistics of the amount variable . The median amount of the fraud transactions across all the states is 47 USD . To find the area of variation of upper and lower limits i.e upper and lower amount of fraud transaction , we can use in a simple function “range” , The range of the fraud transaction amount out of the 555719 transactions is 1 22768.In statistics, the standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values. To find out sd , we can use sd function . The standard deviation of the fraud amount is 156.7456USD . The minimum amount of fraud transactions is 1 and the maximum amount of fraud transaction is 22768 .

Finding the fraud transactions value for every per state

 desc_stats<- group_by(fraudTest ,state) %>% 
  summarise(Average=mean(Amount) , lower=min(Amount) , Upper = max(Amount)) #Fraud transactions value per state wise
head(desc_stats)

# A tibble: 6 x 4
  state Average lower Upper
  <chr>   <dbl> <dbl> <dbl>
1 AK       78.4     1  1617
2 AL       64.3     1  5030
3 AR       76.2     1  8181
4 AZ       75.8     1  7321
5 CA       73.3     1 16837
6 CO       76.0     1  5187

The above table shows the average , minimum and maximum fraud transactions amount per state wise . It is important to find out the individual state wise statistics to find out which state has more number of average . min and max amount fraud transactions when compare to overall states.

In the above table we see first six recordings according to alphabetical wise . All the six states in the above table i.e Alaska , Alabama , Arizona , California , Colorado has average more than the United states average amount . Colorado has high average of around 76 USD among all the above six states while average of fraud transaction in the United states lies below 70USD . All the above six states has 1 USD as minimum amount of fraud transaction . California state has higher maximum amount of fraud transaction i.e.16837 USD.Rest of all states i.e. AK, AL, AR, AZ,CO maximum amount of fraud transaction is under 8000USD.

Gender wise Fraud transactions count.

a <- fraudTest %>% 
  group_by(gender) %>% 
  summarise(count = n())#Fraud transactions per Gender wise.
  # count()
a

# A tibble: 2 x 2
  gender  count
  <chr>   <int>
1 F      304886
2 M      250833

pie(x,main = "Fruad transaction by Gender")

Above visualization shows the fraud transaction per gender wise .Out of 555719 fraud transaction in the dataset. Female credit card holder effected more with the fraud transactions with the count of 304886 . While there are 250833 fraud transactions which is of male credit card holders.

Number of fraud transactions per State

  select(fraudTest ,state) %>% 
    group_by(state) %>% 
    count()

# A tibble: 50 x 2
# Groups:   state [50]
   state     n
   <chr> <int>
 1 AK      843
 2 AL    17532
 3 AR    13484
 4 AZ     4592
 5 CA    24135
 6 CO     5886
 7 CT     3277
 8 DC     1517
 9 FL    18104
10 GA    11277
# ... with 40 more rows

fraudTest %>%
  select('state') %>%
  group_by(state) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = state, y = count, fill = count)) +labs(title="Fraud Transactions by state wise")+
  geom_col() + geom_text(aes(label = count), vjust = -0.3)+ theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none")+ coord_flip()

Above visualization shows the fraud transactions count per state . Among the 50 states , Texas has most number of fraud transactions with the 40393 transactions .Next to the Texas , New York has 35918fraud transactions . Pa state stood third among 50 states with 34326 transactions .These three states responsible for most number of fraud transaction among all the 50 states .Most of the states has less than 20000 number of fraud transactions . States like AK , HI has least number of fraud transactions with the count 843 ,1090 respectively . I will be focusing on these 5 states further and will be decoding the fraud transactions with respect to the city wise and category wise.

#Top 3 states in fraud transactions

selected_states <- c('TX', 'NY', 'PA')

for (x in selected_states){
  p <- fraudTest %>%
    filter(state == x) %>%
    ggplot(aes(x = city, fill = city)) +
    geom_histogram(stat = "count") + theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none")+ labs(title="Fraud Transactions by City wise")+coord_flip()
  print(p)
}

Above visualizations show the number of fraud transactions per city wise of three states i.e.(TX , NY , PA) . Cities like San Antonio and Holliday in Texas records most number of fraud transactions and 0 Brien , Desdemona has least number of transactions. Apart from top 2 fraud transaction cities , Most of the cities in texas has less than 1500 fraud transactions . Cities like Tupper Lake and Chatham has more number of fraud transactions in the New york state. Most of the New york cities almost have same number of fraud transactions. Spring field and Kirkwood cities in the newyork state has least number of fraud transactions. Clarks Mills and Philadelphia in the PA state has most number of fraud transactions and cities like Dublin and Harmony has least number of fraud transactionsin the PA state.

fraudTest %>% 
  select(state, Category, Amount , Merchant) %>%  
  filter(state =="TX") %>% 
  arrange(Amount) %>% 
  ggplot(aes(Amount , Category, fill= Category)) +geom_jitter(aes(color=Category))+coord_flip()+labs(title="Category wise transactions in Texas")

fraudTest %>% 
  select(state, Category, Amount , Merchant) %>%  
  filter(state =="NY") %>% 
  arrange(Amount) %>%
  ggplot(aes(Amount , Category)) +geom_boxplot(aes(color=Category))+labs(title="Category wise transactions in NewYork")

fraudTest %>% 
  select(state, Category, Amount , Merchant) %>%  
  filter(state =="PA") %>% 
  arrange(Amount) %>% 
  ggplot(aes(Amount , Category)) +geom_line(aes(color="Red"))+labs(title="Category wise transactions in Pennsylvania")

All the above 3 visualizations shows the relation of the fraud transactions amount with various categories of top 3 fraud transaction states. Interesting point is that in all the 3 states Travel category has the maximum amount and followed by shopping_pos and shopping_net.It also has minimum amount in gas_transaport and grocery_pos whic is simliar in all 3 states.

Least fraud transactions states per city wise

selected_states <- c('AK', 'HI')

for (x in selected_states){
  p <- fraudTest %>%
    filter(state == x) %>%
    ggplot(aes(x = city, fill = city)) +
    geom_histogram(stat = "count") + theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none") +labs(title="Least Fraud Transactions states")
  print(p)
}

AK and HI are the two states which has lowest number of fraud transactions. Above visualizations shows the fraud transactions of these two states per city wise.

Wales city in the AK state has most number of fraud transactions followed by the Huslia city and then Craig city.Paauilo city in HI state has most number of fraud transactions than the other city i.e. Honokaa.

fraudTest %>% 
  select(state, Category, Amount , Merchant) %>%  
  filter(state =="AK") %>% 
  arrange(Amount) %>% 
  ggplot(aes(Amount , Category, fill= Category)) +geom_jitter(aes(color=Category))+labs(title="Category wise transactions in Alaska")+coord_flip()

fraudTest %>% 
  select(state, Category, Amount , Merchant) %>%  
  filter(state =="HI") %>% 
  arrange(Amount) %>% 
  ggplot(aes(Amount , Category, fill= Category)) +geom_boxplot(aes(color=Category))+labs(title="Category wise transactions in Hawaii")

The above boxplots shows the amount of fraud transaction with the respect to the category in AK and HI state. The results were quite opposite to the results of the top 3 states i.e TX, NY and PA . Shopping_net and shopping_pos are the two categories in 2 states which is with maximum amount of fraud transaction . In both the states Travel category has the lowest amount of fraud transaction which is the opposite to the in the case of top 3 states.

Plotting fraud transaction per state and category wise

fraudTest %>% 
ggplot(mapping=aes(Amount,Category,color=Category))+geom_point() + facet_wrap(facets = vars(`state`))+labs(title="Category wise Fraud transactions in The United States")

Above visualizations shows the Category wise fraud transactions amount in each state of the United states .

Conclusions and Reflections

There were few observations made based on this project. The individual average amount of fraud transaction of almost many states are more than the combined average fraud transaction . States like TX , NY , PA states has most number of fraud transactions among all the 50 states . In these 3 states categories like travel , shopping_pos and shopping_net have maximum amount of fraud transaction than other categories . And Categories like gas_transport and grocery_pos has lowest amount of fraud transactions. But this results are not similar when coming the least number of fraud transactions state i.e HI and AK state . Shopping_pos and Shopping_net are the two categories with the maximum amount of fraud transactions and categories like gas_transport and grocery_pos has lowest amount of fraud transactions.When we consider the overall dataset , i.e fraud transaction across all the states. Categories like gas_transport , home , grocery_pos are the top 3 sectors for fraud transactions and categories like travel , grocery_net ,misc_net and misc_pos are has lowest number of categories across all other categories. This Project is not focused on two to three variables of the dataset i.e. DOB(date of birth ) and merchant . Analyzing these two variables can give us an assumption of fraud transactions based on age and merchant wise ., which will be helpful for analyzing the spending behavior of the customers according to their age.

Bibliography

RStudio Team (2022). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA, http://www.rstudio.com/.

Wickham, H. & Bryan, J. (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl

Wickham, H., François, R., Henry, L., & Müller, K. (n.d.). Programming with dplyr. dplyr. https://dplyr.tidyverse.org/articles/programming.html

Wickham, H. & Grolemund, G. (n.d.). R for data science [eBook edition]. O’Reilly. https://r4ds.had.co.nz/index.html

Wickham et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Source of the dataset :https://www.kaggle.com/datasets












```{.r .distill-force-highlighting-css}

Comment on this article Share:

DACSS_601_FINAL_PROJECT