HW_5

HW_5 601

Mani kanta Gogula
3/22/2022

About the Dataset : The dataset i’m using for my final project is the simulated credit card transaction datset containing fraud transactions from the duration 1st Jan 2019 -31st Dec 2020 . It covers credit card transaction of customers with a pool of a 800 merchants.The source of my dataset is “KAGGLE”.

Loading the datset and viewing the datset.

library(readr)
library(dplyr)
library(tidyverse)
fraudTest <- read_csv("fraudTest.csv")
head(fraudTest)
# A tibble: 6 x 22
  Sr_no Trans_date_Trans~  cc_num Merchant Category Amount first last 
  <dbl> <chr>               <dbl> <chr>    <chr>     <dbl> <chr> <chr>
1     0 6/21/2020         2.29e15 fraud_K~ persona~      3 Jeff  Elli~
2     1 6/21/2020         3.57e15 fraud_S~ persona~     30 Joan~ Will~
3     2 6/21/2020         3.6 e15 fraud_S~ health_~     41 Ashl~ Lopez
4     3 6/21/2020         3.59e15 fraud_H~ misc_pos     60 Brian Will~
5     4 6/21/2020         3.53e15 fraud_J~ travel        3 Nath~ Mass~
6     5 6/21/2020         3.04e13 fraud_D~ kids_pe~     20 Dani~ Evans
# ... with 14 more variables: gender <chr>, street <chr>, city <chr>,
#   state <chr>, zip <dbl>, lat <dbl>, long <dbl>, city_pop <dbl>,
#   job <chr>, dob <chr>, trans_num <chr>, unix_time <dbl>,
#   merch_lat <dbl>, merch_long <dbl>

Checking the colnames to pick important columns using colnames() fucntion.

colnames(fraudTest)
 [1] "Sr_no"                 "Trans_date_Trans_time"
 [3] "cc_num"                "Merchant"             
 [5] "Category"              "Amount"               
 [7] "first"                 "last"                 
 [9] "gender"                "street"               
[11] "city"                  "state"                
[13] "zip"                   "lat"                  
[15] "long"                  "city_pop"             
[17] "job"                   "dob"                  
[19] "trans_num"             "unix_time"            
[21] "merch_lat"             "merch_long"           

Using the count() and arrange() function to see which state consists most number of fraud transactions.

count(fraudTest , state) %>% 
  arrange(desc(n))
# A tibble: 50 x 2
   state     n
   <chr> <int>
 1 TX    40393
 2 NY    35918
 3 PA    34326
 4 CA    24135
 5 OH    20147
 6 MI    19671
 7 IL    18960
 8 FL    18104
 9 AL    17532
10 MO    16501
# ... with 40 more rows
States like TX , NY , PA , CA ,OH have fraud transactions more than 20000 .
fraudTest %>% 
  group_by(state)%>% 
  summarise(Average=mean(Amount) , Upper=max(Amount) , lower=min(Amount))
# A tibble: 50 x 4
   state Average Upper lower
   <chr>   <dbl> <dbl> <dbl>
 1 AK       78.4  1617     1
 2 AL       64.3  5030     1
 3 AR       76.2  8181     1
 4 AZ       75.8  7321     1
 5 CA       73.3 16837     1
 6 CO       76.0  5187     1
 7 CT       62.6  4120     1
 8 DC       71.7  1121     1
 9 FL       71.4 21438     1
10 GA       69.2  7886     1
# ... with 40 more rows
fraud_MA<- filter(fraudTest, state== "MA") 
fraud_MA %>% 
  count()
# A tibble: 1 x 1
      n
  <int>
1  5186

MA state has 5186 fraud transactions . We will deep dive into these 5186 transactions and see in which category most of the fraud transactions happened.

Checking the victims of the fraud transactions by gender and visualizing the output

library(tidyverse)
library(ggplot2)
fraud_MA %>% 
  group_by(gender) %>% 
  ggplot(aes(gender)) +geom_bar()

MA males are the major victims of the fraud transaction with the number more than 3k transactions.

fraud_MA %>% 
  group_by(Category) %>% 
  count() %>% 
  arrange(desc(n)) 
# A tibble: 14 x 2
# Groups:   Category [14]
   Category           n
   <chr>          <int>
 1 gas_transport    623
 2 grocery_pos      529
 3 home             513
 4 shopping_pos     427
 5 kids_pets        415
 6 shopping_net     376
 7 entertainment    373
 8 food_dining      345
 9 personal_care    333
10 health_fitness   323
11 misc_pos         298
12 misc_net         264
13 grocery_net      198
14 travel           169
fraudTest %>% 
  group_by(state) %>% 
  summarise(Average=mean(Amount) , lower=min(Amount) , Upper = max(Amount))
# A tibble: 50 x 4
   state Average lower Upper
   <chr>   <dbl> <dbl> <dbl>
 1 AK       78.4     1  1617
 2 AL       64.3     1  5030
 3 AR       76.2     1  8181
 4 AZ       75.8     1  7321
 5 CA       73.3     1 16837
 6 CO       76.0     1  5187
 7 CT       62.6     1  4120
 8 DC       71.7     1  1121
 9 FL       71.4     1 21438
10 GA       69.2     1  7886
# ... with 40 more rows
fraudTest %>% 
ggplot(aes(Category , Amount , color=state))+geom_col() + scale_y_log10()




 
```{.r .distill-force-highlighting-css}

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Gogula (2022, April 27). Data Analytics and Computational Social Science: HW_5. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommanikanta891378/

BibTeX citation

@misc{gogula2022hw_5,
  author = {Gogula, Mani kanta},
  title = {Data Analytics and Computational Social Science: HW_5},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommanikanta891378/},
  year = {2022}
}