This project is regarding the fraud transactions in the United states
Credit card fraud is an inclusive term for fraud committed using a payment card, such as a credit card or debit card. The purpose may be to obtain goods or services, or to make unauthorized payments. The Payment Card Industry Data Security Standard (PCI DSS) is the data security standard created to help financial institutions process card payments securely and reduce card fraud.
Credit card fraud can be authorized, where the genuine customer themselves processes payment to another account which is controlled by any unauthorized entity, where the account holder does not provide authorization for the payment to proceed and the transaction is carried out by a third party. In 2018, unauthorized financial fraud losses across payment cards and remote banking totaled £844.8 million in the United Kingdom. Whereas banks and card companies prevented £1.66 billion in unauthorized fraud in 2018. That is the equivalent to £2 in every £3 of attempted fraud being stopped.
Credit cards are more secure than ever, with regulators, card providers and banks taking considerable time and effort to collaborate with investigators worldwide to ensure fraudsters aren’t successful. Cardholders’ money is usually protected from scammers with regulations that make the card provider and bank accountable. The technology and security measures behind credit cards are becoming increasingly sophisticated making it harder for fraudsters to steal money.
Due to rapid growth in field of cashless or digital transactions, credit cards are widely used in all around the world. Credit cards providers are issuing thousands of cards to their customers. Providers have to ensure all the credit card users should be genuine and real. Any mistake in issuing a card can be reason of financial crises. Due to rapid growth in cashless transaction, the chances of number of fraudulent transactions can also increasing. A Fraud transaction can be identified by analyzing various behaviors of credit card customers from previous transaction history datasets. If any deviation is noticed in spending behavior from available patterns, it is possibly of fraudulent transaction
Fraudulent transaction is the one of the most serious threats to online security nowadays. Artificial Intelligence is vital for financial risk control in cloud environment. Many studies had attempted to explore methods for online transaction fraud detection; however, the existing methods are not sufficient to conduction detection with high precision .
Fraudulent transactions are orders and purchases made using a credit card or bank account that does not belong to the buyer. One of the largest factors in identity fraud, these types of transactions can end up doing damage to both merchants and the identity fraud victim.
This is a simulated credit card transaction dataset transactions from the duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of customers doing transactions with a pool of 800 merchants.Source of the dataset is Kaggle.com
#Loading Libraries
# A tibble: 6 x 22
Sr_no Trans_date_Trans~ cc_num Merchant Category Amount first last
<dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
1 0 6/21/2020 2.29e15 fraud_K~ persona~ 3 Jeff Elli~
2 1 6/21/2020 3.57e15 fraud_S~ persona~ 30 Joan~ Will~
3 2 6/21/2020 3.6 e15 fraud_S~ health_~ 41 Ashl~ Lopez
4 3 6/21/2020 3.59e15 fraud_H~ misc_pos 60 Brian Will~
5 4 6/21/2020 3.53e15 fraud_J~ travel 3 Nath~ Mass~
6 5 6/21/2020 3.04e13 fraud_D~ kids_pe~ 20 Dani~ Evans
# ... with 14 more variables: gender <chr>, street <chr>, city <chr>,
# state <chr>, zip <dbl>, lat <dbl>, long <dbl>, city_pop <dbl>,
# job <chr>, dob <chr>, trans_num <chr>, unix_time <dbl>,
# merch_lat <dbl>, merch_long <dbl>
Checking the dimension of the dataset
#Checking the dimension of the dataset
dim(fraudTest)
[1] 555719 22
Dataset contains 555719 rows and 22 columns
Glimpse of the dataset
glimpse(fraudTest) #glimpse of the dataset
Rows: 555,719
Columns: 22
$ Sr_no <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ~
$ Trans_date_Trans_time <chr> "6/21/2020", "6/21/2020", "6/21/2020",~
$ cc_num <dbl> 2.29e+15, 3.57e+15, 3.60e+15, 3.59e+15~
$ Merchant <chr> "fraud_Kirlin and Sons", "fraud_Sporer~
$ Category <chr> "personal_care", "personal_care", "hea~
$ Amount <dbl> 3, 30, 41, 60, 3, 20, 134, 10, 4, 67, ~
$ first <chr> "Jeff", "Joanne", "Ashley", "Brian", "~
$ last <chr> "Elliott", "Williams", "Lopez", "Willi~
$ gender <chr> "M", "F", "F", "M", "M", "F", "F", "F"~
$ street <chr> "351 Darlene Green", "3638 Marsh Union~
$ city <chr> "Columbia", "Altonah", "Bellmore", "Ti~
$ state <chr> "SC", "UT", "NY", "FL", "MI", "NY", "C~
$ zip <dbl> 29209, 84002, 11710, 32780, 49632, 148~
$ lat <dbl> 33.9659, 40.3207, 40.6729, 28.5697, 44~
$ long <dbl> -80.9355, -110.4360, -73.5365, -80.819~
$ city_pop <dbl> 333497, 302, 34496, 54767, 1126, 520, ~
$ job <chr> "Mechanical engineer", "Sales professi~
$ dob <chr> "3/19/1968", "1/17/1990", "10/21/1970"~
$ trans_num <chr> "2da90c7d74bd46a0caf3777415b3ebd3", "3~
$ unix_time <dbl> 1371816865, 1371816873, 1371816893, 13~
$ merch_lat <dbl> 33.98639, 39.45050, 40.49581, 28.81240~
$ merch_long <dbl> -81.20071, -109.96043, -74.19611, -80.~
Above table shows the glimpse of the fraudTest dataset . It shows the class type of the each columns and the initial recordings of that particualr column.Example class of the Merchant and Category is a character.
colnames(fraudTest) #column names of the dataset
[1] "Sr_no" "Trans_date_Trans_time"
[3] "cc_num" "Merchant"
[5] "Category" "Amount"
[7] "first" "last"
[9] "gender" "street"
[11] "city" "state"
[13] "zip" "lat"
[15] "long" "city_pop"
[17] "job" "dob"
[19] "trans_num" "unix_time"
[21] "merch_lat" "merch_long"
Data set contains 22 columns :
Category : Fraud transaction item type
Amount : Amount of the transaction
Gender : Gender of the Credit Card Holder
City : City of the Credit Card Holder
State : State of the Credit Card Holder
Trans_date_Trans_time : Date and time of the fraud transaction occured.
cc_num: credit card number which is used for the fraud transaction
Merchant : a site or a physical location where credit card is used as a fraud purchase.
Street : Address of the customer
Zip : Zip code of the customer Billing Address.
Job: Job of the credit card holder
Dob:Date of Birth of the credit card holder
library(dplyr)
rename(fraudTest ,first_name = first , last_name = last , Date_of_birth = dob , Trans_date = Trans_date_Trans_time) #Renaming column names
# A tibble: 555,719 x 22
Sr_no Trans_date cc_num Merchant Category Amount first_name
<dbl> <chr> <dbl> <chr> <chr> <dbl> <chr>
1 0 6/21/2020 2.29e15 fraud_Kirlin a~ persona~ 3 Jeff
2 1 6/21/2020 3.57e15 fraud_Sporer-K~ persona~ 30 Joanne
3 2 6/21/2020 3.6 e15 fraud_Swaniaws~ health_~ 41 Ashley
4 3 6/21/2020 3.59e15 fraud_Haley Gr~ misc_pos 60 Brian
5 4 6/21/2020 3.53e15 fraud_Johnston~ travel 3 Nathan
6 5 6/21/2020 3.04e13 fraud_Daughert~ kids_pe~ 20 Danielle
7 6 6/21/2020 2.13e14 fraud_Romaguer~ health_~ 134 Kayla
8 7 6/21/2020 3.59e15 fraud_Reichel ~ persona~ 10 Paula
9 8 6/21/2020 3.6 e15 fraud_Goyette,~ shoppin~ 4 David
10 9 6/21/2020 3.55e15 fraud_Kilback ~ food_di~ 67 Kayla
# ... with 555,709 more rows, and 15 more variables: last_name <chr>,
# gender <chr>, street <chr>, city <chr>, state <chr>, zip <dbl>,
# lat <dbl>, long <dbl>, city_pop <dbl>, job <chr>,
# Date_of_birth <chr>, trans_num <chr>, unix_time <dbl>,
# merch_lat <dbl>, merch_long <dbl>
Discrete variables are numeric variables that have a countable number of values between any two values. A discrete variable is always numeric. In our fraudTest Dataset we have few Discrete variables like state, category, gender. #Finding number these variables using table.
table(fraudTest$state) # State wise count
AK AL AR AZ CA CO CT DC FL GA HI
843 17532 13484 4592 24135 5886 3277 1517 18104 11277 1090
IA ID IL IN KS KY LA MA MD ME MI
11819 2490 18960 11959 9943 12506 8988 5186 11152 6928 19671
MN MO MS MT NC ND NE NH NJ NM NV
13719 16501 8833 5052 12868 6397 10257 3449 10528 7020 2451
NY OH OK OR PA RI SC SD TN TX UT
35918 20147 11379 7811 34326 195 12541 5250 7359 40393 4658
VA VT WA WI WV WY
12506 5044 8116 12370 10838 8454
t<-table(fraudTest$Category) #Category wise count
t
entertainment food_dining gas_transport grocery_net
40104 39268 56370 19426
grocery_pos health_fitness home kids_pets
52553 36674 52345 48692
misc_net misc_pos personal_care shopping_net
27367 34574 39327 41779
shopping_pos travel
49791 17449
#Gender wise count
x<-table(fraudTest$gender)#Gender wise count
x
F M
304886 250833
# A tibble: 14 x 2
Category count
<chr> <int>
1 entertainment 40104
2 food_dining 39268
3 gas_transport 56370
4 grocery_net 19426
5 grocery_pos 52553
6 health_fitness 36674
7 home 52345
8 kids_pets 48692
9 misc_net 27367
10 misc_pos 34574
11 personal_care 39327
12 shopping_net 41779
13 shopping_pos 49791
14 travel 17449
ggplot(cat_fraud, aes(x = Category, y = count , fill = Category)) + #Plotting the category wise fraud transactions count
geom_bar( stat = "identity") +labs(title="Fraud transactions by category wise")+
geom_text(aes(label = count), vjust = -0.3) +coord_flip()
In the above picture , I visualized the number of fraud transactions Category wise. Gas_Transport category responsible for most number of fraud transactions across the United State ,Categories like grocery_pos , home almost has same number of transactions which are counted as a fraud. As most of the customer uses their credit card for the shopping purchases, Categories like shopping_net , shopping_pos stands next to the grocery_pos and home category with almost 90k fraud transactions.Its interesting to see that there is vast difference between the count of fraud transactions of grocery_pos and grocery_net. grocery_net holds less than 20k number of fraud transactions. Outdoor payment categories like travel , food_dining , entertainment combinely holds nearly 100k number of fraud transactions.
Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread). Measures of central tendency include the mean, median, and mode, while measures of variability include standard deviation, variance. In this Data set i will be focusing on the variable Amount, As whole dataset is about the fraud transactions with respect to the purchases which is easily denoted by amount. It also important to see the range , minimum , maximum amount of fraud transactions to get better visualizations.
median(fraudTest$Amount, na.rm= TRUE) #Median of the fraud transaction amount
[1] 47
sd(fraudTest$Amount, na.rm= TRUE) #standard deviation of the fraud transaction amount
[1] 156.7456
mean(fraudTest$Amount , na.rm= TRUE) #mean of the fraud transaction amount
[1] 69.39616
range(fraudTest$Amount) #range of the fraud transaction amount
[1] 1 22768
min(fraudTest$Amount) #Minimum amount of the fraud transaction
[1] 1
max(fraudTest$Amount) #Maximum amount of the fraud transaction
[1] 22768
Above code shows the descriptive statistics of the amount variable . The median amount of the fraud transactions across all the states is 47 USD . To find the area of variation of upper and lower limits i.e upper and lower amount of fraud transaction , we can use in a simple function “range” , The range of the fraud transaction amount out of the 555719 transactions is 1 22768.In statistics, the standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values. To find out sd , we can use sd function . The standard deviation of the fraud amount is 156.7456USD . The minimum amount of fraud transactions is 1 and the maximum amount of fraud transaction is 22768 .
desc_stats<- group_by(fraudTest ,state) %>%
summarise(Average=mean(Amount) , lower=min(Amount) , Upper = max(Amount)) #Fraud transactions value per state wise
head(desc_stats)
# A tibble: 6 x 4
state Average lower Upper
<chr> <dbl> <dbl> <dbl>
1 AK 78.4 1 1617
2 AL 64.3 1 5030
3 AR 76.2 1 8181
4 AZ 75.8 1 7321
5 CA 73.3 1 16837
6 CO 76.0 1 5187
The above table shows the average , minimum and maximum fraud transactions amount per state wise . It is important to find out the individual state wise statistics to find out which state has more number of average . min and max amount fraud transactions when compare to overall states.
In the above table we see first six recordings according to alphabetical wise . All the six states in the above table i.e Alaska , Alabama , Arizona , California , Colorado has average more than the United states average amount . Colorado has high average of around 76 USD among all the above six states while average of fraud transaction in the United states lies below 70USD . All the above six states has 1 USD as minimum amount of fraud transaction . California state has higher maximum amount of fraud transaction i.e.16837 USD.Rest of all states i.e. AK, AL, AR, AZ,CO maximum amount of fraud transaction is under 8000USD.
a <- fraudTest %>%
group_by(gender) %>%
summarise(count = n())#Fraud transactions per Gender wise.
# count()
a
# A tibble: 2 x 2
gender count
<chr> <int>
1 F 304886
2 M 250833
pie(x,main = "Fruad transaction by Gender")
Above visualization shows the fraud transaction per gender wise .Out of 555719 fraud transaction in the dataset. Female credit card holder effected more with the fraud transactions with the count of 304886 . While there are 250833 fraud transactions which is of male credit card holders.
# A tibble: 50 x 2
# Groups: state [50]
state n
<chr> <int>
1 AK 843
2 AL 17532
3 AR 13484
4 AZ 4592
5 CA 24135
6 CO 5886
7 CT 3277
8 DC 1517
9 FL 18104
10 GA 11277
# ... with 40 more rows
fraudTest %>%
select('state') %>%
group_by(state) %>%
summarise(count = n()) %>%
ggplot(aes(x = state, y = count, fill = count)) +labs(title="Fraud Transactions by state wise")+
geom_col() + geom_text(aes(label = count), vjust = -0.3)+ theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none")+ coord_flip()
Above visualization shows the fraud transactions count per state . Among the 50 states , Texas has most number of fraud transactions with the 40393 transactions .Next to the Texas , New York has 35918fraud transactions . Pa state stood third among 50 states with 34326 transactions .These three states responsible for most number of fraud transaction among all the 50 states .Most of the states has less than 20000 number of fraud transactions . States like AK , HI has least number of fraud transactions with the count 843 ,1090 respectively . I will be focusing on these 5 states further and will be decoding the fraud transactions with respect to the city wise and category wise.
#Top 3 states in fraud transactions
selected_states <- c('TX', 'NY', 'PA')
for (x in selected_states){
p <- fraudTest %>%
filter(state == x) %>%
ggplot(aes(x = city, fill = city)) +
geom_histogram(stat = "count") + theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none")+ labs(title="Fraud Transactions by City wise")+coord_flip()
print(p)
}
Above visualizations show the number of fraud transactions per city wise of three states i.e.(TX , NY , PA) . Cities like San Antonio and Holliday in Texas records most number of fraud transactions and 0 Brien , Desdemona has least number of transactions. Apart from top 2 fraud transaction cities , Most of the cities in texas has less than 1500 fraud transactions . Cities like Tupper Lake and Chatham has more number of fraud transactions in the New york state. Most of the New york cities almost have same number of fraud transactions. Spring field and Kirkwood cities in the newyork state has least number of fraud transactions. Clarks Mills and Philadelphia in the PA state has most number of fraud transactions and cities like Dublin and Harmony has least number of fraud transactionsin the PA state.
fraudTest %>%
select(state, Category, Amount , Merchant) %>%
filter(state =="TX") %>%
arrange(Amount) %>%
ggplot(aes(Amount , Category, fill= Category)) +geom_jitter(aes(color=Category))+coord_flip()+labs(title="Category wise transactions in Texas")
fraudTest %>%
select(state, Category, Amount , Merchant) %>%
filter(state =="NY") %>%
arrange(Amount) %>%
ggplot(aes(Amount , Category)) +geom_boxplot(aes(color=Category))+labs(title="Category wise transactions in NewYork")
fraudTest %>%
select(state, Category, Amount , Merchant) %>%
filter(state =="PA") %>%
arrange(Amount) %>%
ggplot(aes(Amount , Category)) +geom_line(aes(color="Red"))+labs(title="Category wise transactions in Pennsylvania")
All the above 3 visualizations shows the relation of the fraud transactions amount with various categories of top 3 fraud transaction states. Interesting point is that in all the 3 states Travel category has the maximum amount and followed by shopping_pos and shopping_net.It also has minimum amount in gas_transaport and grocery_pos whic is simliar in all 3 states.
selected_states <- c('AK', 'HI')
for (x in selected_states){
p <- fraudTest %>%
filter(state == x) %>%
ggplot(aes(x = city, fill = city)) +
geom_histogram(stat = "count") + theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none") +labs(title="Least Fraud Transactions states")
print(p)
}
AK and HI are the two states which has lowest number of fraud transactions. Above visualizations shows the fraud transactions of these two states per city wise.
Wales city in the AK state has most number of fraud transactions followed by the Huslia city and then Craig city.Paauilo city in HI state has most number of fraud transactions than the other city i.e. Honokaa.
fraudTest %>%
select(state, Category, Amount , Merchant) %>%
filter(state =="AK") %>%
arrange(Amount) %>%
ggplot(aes(Amount , Category, fill= Category)) +geom_jitter(aes(color=Category))+labs(title="Category wise transactions in Alaska")+coord_flip()
fraudTest %>%
select(state, Category, Amount , Merchant) %>%
filter(state =="HI") %>%
arrange(Amount) %>%
ggplot(aes(Amount , Category, fill= Category)) +geom_boxplot(aes(color=Category))+labs(title="Category wise transactions in Hawaii")
The above boxplots shows the amount of fraud transaction with the respect to the category in AK and HI state. The results were quite opposite to the results of the top 3 states i.e TX, NY and PA . Shopping_net and shopping_pos are the two categories in 2 states which is with maximum amount of fraud transaction . In both the states Travel category has the lowest amount of fraud transaction which is the opposite to the in the case of top 3 states.
fraudTest %>%
ggplot(mapping=aes(Amount,Category,color=Category))+geom_point() + facet_wrap(facets = vars(`state`))+labs(title="Category wise Fraud transactions in The United States")
Above visualizations shows the Category wise fraud transactions amount in each state of the United states .
There were few observations made based on this project. The individual average amount of fraud transaction of almost many states are more than the combined average fraud transaction . States like TX , NY , PA states has most number of fraud transactions among all the 50 states . In these 3 states categories like travel , shopping_pos and shopping_net have maximum amount of fraud transaction than other categories . And Categories like gas_transport and grocery_pos has lowest amount of fraud transactions. But this results are not similar when coming the least number of fraud transactions state i.e HI and AK state . Shopping_pos and Shopping_net are the two categories with the maximum amount of fraud transactions and categories like gas_transport and grocery_pos has lowest amount of fraud transactions.When we consider the overall dataset , i.e fraud transaction across all the states. Categories like gas_transport , home , grocery_pos are the top 3 sectors for fraud transactions and categories like travel , grocery_net ,misc_net and misc_pos are has lowest number of categories across all other categories. This Project is not focused on two to three variables of the dataset i.e. DOB(date of birth ) and merchant . Analyzing these two variables can give us an assumption of fraud transactions based on age and merchant wise ., which will be helpful for analyzing the spending behavior of the customers according to their age.
RStudio Team (2022). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA, http://www.rstudio.com/.
Wickham, H. & Bryan, J. (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl
Wickham, H., François, R., Henry, L., & Müller, K. (n.d.). Programming with dplyr. dplyr. https://dplyr.tidyverse.org/articles/programming.html
Wickham, H. & Grolemund, G. (n.d.). R for data science [eBook edition]. O’Reilly. https://r4ds.had.co.nz/index.html
Wickham et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Source of the dataset :https://www.kaggle.com/datasets
```{.r .distill-force-highlighting-css}
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
GOGULA (2022, May 19). Data Analytics and Computational Social Science: DACSS_601_FINAL_PROJECT. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommanikanta896626/
BibTeX citation
@misc{gogula2022dacss_601_final_project, author = {GOGULA, MANI KANTA}, title = {Data Analytics and Computational Social Science: DACSS_601_FINAL_PROJECT}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommanikanta896626/}, year = {2022} }