HW_5 601(Focussing on top 3 states)
Loading the datset and viewing the datset.
library(readr)
library(dplyr)
library(tidyverse)
library(ggplot2)
fraudTest <- read_csv("fraudTest.csv")
head(fraudTest)
# A tibble: 6 x 22
Sr_no Trans_date_Trans~ cc_num Merchant Category Amount first last
<dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
1 0 6/21/2020 2.29e15 fraud_K~ persona~ 3 Jeff Elli~
2 1 6/21/2020 3.57e15 fraud_S~ persona~ 30 Joan~ Will~
3 2 6/21/2020 3.6 e15 fraud_S~ health_~ 41 Ashl~ Lopez
4 3 6/21/2020 3.59e15 fraud_H~ misc_pos 60 Brian Will~
5 4 6/21/2020 3.53e15 fraud_J~ travel 3 Nath~ Mass~
6 5 6/21/2020 3.04e13 fraud_D~ kids_pe~ 20 Dani~ Evans
# ... with 14 more variables: gender <chr>, street <chr>, city <chr>,
# state <chr>, zip <dbl>, lat <dbl>, long <dbl>, city_pop <dbl>,
# job <chr>, dob <chr>, trans_num <chr>, unix_time <dbl>,
# merch_lat <dbl>, merch_long <dbl>
Checking the colnames to pick important columns using colnames() fucntion.
colnames(fraudTest)
[1] "Sr_no" "Trans_date_Trans_time"
[3] "cc_num" "Merchant"
[5] "Category" "Amount"
[7] "first" "last"
[9] "gender" "street"
[11] "city" "state"
[13] "zip" "lat"
[15] "long" "city_pop"
[17] "job" "dob"
[19] "trans_num" "unix_time"
[21] "merch_lat" "merch_long"
Using the count() and arrange() function to see which state consists most number of fraud transactions.
# A tibble: 50 x 2
state n
<chr> <int>
1 TX 40393
2 NY 35918
3 PA 34326
4 CA 24135
5 OH 20147
6 MI 19671
7 IL 18960
8 FL 18104
9 AL 17532
10 MO 16501
# ... with 40 more rows
States like TX , NY , PA , CA ,OH have fraud transactions more than 20000 .Three states i.e. HI, AK, RI have least fraud transaction among all the states in the United states.
Finding the Average , Maximum and Minimum amount of fraud transaction per state wise.
fraudTest %>%
group_by(state)%>%
summarise(Average=mean(Amount) , Upper=max(Amount) , lower=min(Amount))
# A tibble: 50 x 4
state Average Upper lower
<chr> <dbl> <dbl> <dbl>
1 AK 78.4 1617 1
2 AL 64.3 5030 1
3 AR 76.2 8181 1
4 AZ 75.8 7321 1
5 CA 73.3 16837 1
6 CO 76.0 5187 1
7 CT 62.6 4120 1
8 DC 71.7 1121 1
9 FL 71.4 21438 1
10 GA 69.2 7886 1
# ... with 40 more rows
Visualizing the no. of fraud transactions per state wise
fraudTest %>%
select('state') %>%
group_by(state) %>%
summarise(count = n()) %>%
ggplot(aes(x = state, y = count, fill = state)) +
geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none")
Focussing on top 3 states of fraud transactions and decoding those 3 states into city wise
selected_states <- c('TX', 'NY', 'PA')
for (x in selected_states){
p <- fraudTest %>%
filter(state == x) %>%
ggplot(aes(x = city, fill = city)) +
geom_histogram(stat = "count") + theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none")
print(p)
}
Decoding MA state transactions
MA state has 5186 fraud transactions . We will deep dive into these 5186 transactions and see in which category most of the fraud transactions happened.
Checking the victims of the fraud transactions by gender and visualizing the output
library(tidyverse)
library(ggplot2)
fraud_MA %>%
group_by(gender) %>%
ggplot(aes(gender)) +geom_bar()
MA males are the major victims of the fraud transaction with the number more than 3k transactions.
# A tibble: 14 x 2
# Groups: Category [14]
Category n
<chr> <int>
1 gas_transport 623
2 grocery_pos 529
3 home 513
4 shopping_pos 427
5 kids_pets 415
6 shopping_net 376
7 entertainment 373
8 food_dining 345
9 personal_care 333
10 health_fitness 323
11 misc_pos 298
12 misc_net 264
13 grocery_net 198
14 travel 169
Gas_transaport and grocery_pos are the top two categories which is responseible for fraud transactions in MA state
Finding the fraud transactions and category per Category wise
fraudTest %>%
ggplot(mapping=aes(Amount,Category,color=Category))+geom_boxplot() + facet_wrap(facets = vars(`state`))
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gogula (2022, May 11). Data Analytics and Computational Social Science: HW_5. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommanikanta898845/
BibTeX citation
@misc{gogula2022hw_5, author = {Gogula, Mani kanta}, title = {Data Analytics and Computational Social Science: HW_5}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommanikanta898845/}, year = {2022} }