Data Analytics and Computational Social Science: HW_5

Mani kanta Gogula

Loading the datset and viewing the datset.

library(readr)
library(dplyr)
library(tidyverse)
library(ggplot2)
fraudTest <- read_csv("fraudTest.csv")
head(fraudTest)

# A tibble: 6 x 22
  Sr_no Trans_date_Trans~  cc_num Merchant Category Amount first last 
  <dbl> <chr>               <dbl> <chr>    <chr>     <dbl> <chr> <chr>
1     0 6/21/2020         2.29e15 fraud_K~ persona~      3 Jeff  Elli~
2     1 6/21/2020         3.57e15 fraud_S~ persona~     30 Joan~ Will~
3     2 6/21/2020         3.6 e15 fraud_S~ health_~     41 Ashl~ Lopez
4     3 6/21/2020         3.59e15 fraud_H~ misc_pos     60 Brian Will~
5     4 6/21/2020         3.53e15 fraud_J~ travel        3 Nath~ Mass~
6     5 6/21/2020         3.04e13 fraud_D~ kids_pe~     20 Dani~ Evans
# ... with 14 more variables: gender <chr>, street <chr>, city <chr>,
#   state <chr>, zip <dbl>, lat <dbl>, long <dbl>, city_pop <dbl>,
#   job <chr>, dob <chr>, trans_num <chr>, unix_time <dbl>,
#   merch_lat <dbl>, merch_long <dbl>

Checking the colnames to pick important columns using colnames() fucntion.

colnames(fraudTest)

 [1] "Sr_no"                 "Trans_date_Trans_time"
 [3] "cc_num"                "Merchant"             
 [5] "Category"              "Amount"               
 [7] "first"                 "last"                 
 [9] "gender"                "street"               
[11] "city"                  "state"                
[13] "zip"                   "lat"                  
[15] "long"                  "city_pop"             
[17] "job"                   "dob"                  
[19] "trans_num"             "unix_time"            
[21] "merch_lat"             "merch_long"

Using the count() and arrange() function to see which state consists most number of fraud transactions.

count(fraudTest , state) %>% 
  arrange(desc(n))

# A tibble: 50 x 2
   state     n
   <chr> <int>
 1 TX    40393
 2 NY    35918
 3 PA    34326
 4 CA    24135
 5 OH    20147
 6 MI    19671
 7 IL    18960
 8 FL    18104
 9 AL    17532
10 MO    16501
# ... with 40 more rows

States like TX , NY , PA , CA ,OH have fraud transactions more than 20000 .Three states i.e. HI, AK, RI have least fraud transaction among all the states in the United states.


Finding the Average , Maximum and Minimum amount of fraud transaction per state wise.

fraudTest %>% 
  group_by(state)%>% 
  summarise(Average=mean(Amount) , Upper=max(Amount) , lower=min(Amount))

# A tibble: 50 x 4
   state Average Upper lower
   <chr>   <dbl> <dbl> <dbl>
 1 AK       78.4  1617     1
 2 AL       64.3  5030     1
 3 AR       76.2  8181     1
 4 AZ       75.8  7321     1
 5 CA       73.3 16837     1
 6 CO       76.0  5187     1
 7 CT       62.6  4120     1
 8 DC       71.7  1121     1
 9 FL       71.4 21438     1
10 GA       69.2  7886     1
# ... with 40 more rows

Visualizing the no. of fraud transactions per state wise

fraudTest %>%
  select('state') %>%
  group_by(state) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = state, y = count, fill = state)) +
  geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none")

Focussing on top 3 states of fraud transactions and decoding those 3 states into city wise

selected_states <- c('TX', 'NY', 'PA')

for (x in selected_states){
  p <- fraudTest %>%
    filter(state == x) %>%
    ggplot(aes(x = city, fill = city)) +
    geom_histogram(stat = "count") + theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none") 
  print(p)
}

Decoding MA state transactions

fraud_MA<- filter(fraudTest, state== "MA") 
fraud_MA %>% 
  count()

# A tibble: 1 x 1
      n
  <int>
1  5186

MA state has 5186 fraud transactions . We will deep dive into these 5186 transactions and see in which category most of the fraud transactions happened.

Checking the victims of the fraud transactions by gender and visualizing the output

library(tidyverse)
library(ggplot2)
fraud_MA %>% 
  group_by(gender) %>% 
  ggplot(aes(gender)) +geom_bar()

MA males are the major victims of the fraud transaction with the number more than 3k transactions.

fraud_MA %>% 
  group_by(Category) %>% 
  count() %>% 
  arrange(desc(n))

# A tibble: 14 x 2
# Groups:   Category [14]
   Category           n
   <chr>          <int>
 1 gas_transport    623
 2 grocery_pos      529
 3 home             513
 4 shopping_pos     427
 5 kids_pets        415
 6 shopping_net     376
 7 entertainment    373
 8 food_dining      345
 9 personal_care    333
10 health_fitness   323
11 misc_pos         298
12 misc_net         264
13 grocery_net      198
14 travel           169

Gas_transaport and grocery_pos are the top two categories which is responseible for fraud transactions in MA state

Finding the fraud transactions and category per Category wise

fraudTest %>% 
ggplot(mapping=aes(Amount,Category,color=Category))+geom_boxplot() + facet_wrap(facets = vars(`state`))

Comment on this article Share:

HW_5

Reuse

Citation