HW_5 601
Loading the datset and viewing the datset.
library(readr)
library(dplyr)
library(tidyverse)
fraudTest <- read_csv("fraudTest.csv")
head(fraudTest)
# A tibble: 6 x 22
Sr_no Trans_date_Trans~ cc_num Merchant Category Amount first last
<dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
1 0 6/21/2020 2.29e15 fraud_K~ persona~ 3 Jeff Elli~
2 1 6/21/2020 3.57e15 fraud_S~ persona~ 30 Joan~ Will~
3 2 6/21/2020 3.6 e15 fraud_S~ health_~ 41 Ashl~ Lopez
4 3 6/21/2020 3.59e15 fraud_H~ misc_pos 60 Brian Will~
5 4 6/21/2020 3.53e15 fraud_J~ travel 3 Nath~ Mass~
6 5 6/21/2020 3.04e13 fraud_D~ kids_pe~ 20 Dani~ Evans
# ... with 14 more variables: gender <chr>, street <chr>, city <chr>,
# state <chr>, zip <dbl>, lat <dbl>, long <dbl>, city_pop <dbl>,
# job <chr>, dob <chr>, trans_num <chr>, unix_time <dbl>,
# merch_lat <dbl>, merch_long <dbl>
Checking the colnames to pick important columns using colnames() fucntion.
colnames(fraudTest)
[1] "Sr_no" "Trans_date_Trans_time"
[3] "cc_num" "Merchant"
[5] "Category" "Amount"
[7] "first" "last"
[9] "gender" "street"
[11] "city" "state"
[13] "zip" "lat"
[15] "long" "city_pop"
[17] "job" "dob"
[19] "trans_num" "unix_time"
[21] "merch_lat" "merch_long"
Using the count() and arrange() function to see which state consists most number of fraud transactions.
# A tibble: 50 x 2
state n
<chr> <int>
1 TX 40393
2 NY 35918
3 PA 34326
4 CA 24135
5 OH 20147
6 MI 19671
7 IL 18960
8 FL 18104
9 AL 17532
10 MO 16501
# ... with 40 more rows
States like TX , NY , PA , CA ,OH have fraud transactions more than 20000 .
fraudTest %>%
group_by(state)%>%
summarise(Average=mean(Amount) , Upper=max(Amount) , lower=min(Amount))
# A tibble: 50 x 4
state Average Upper lower
<chr> <dbl> <dbl> <dbl>
1 AK 78.4 1617 1
2 AL 64.3 5030 1
3 AR 76.2 8181 1
4 AZ 75.8 7321 1
5 CA 73.3 16837 1
6 CO 76.0 5187 1
7 CT 62.6 4120 1
8 DC 71.7 1121 1
9 FL 71.4 21438 1
10 GA 69.2 7886 1
# ... with 40 more rows
MA state has 5186 fraud transactions . We will deep dive into these 5186 transactions and see in which category most of the fraud transactions happened.
Checking the victims of the fraud transactions by gender and visualizing the output
library(tidyverse)
library(ggplot2)
fraud_MA %>%
group_by(gender) %>%
ggplot(aes(gender)) +geom_bar()
MA males are the major victims of the fraud transaction with the number more than 3k transactions.
# A tibble: 14 x 2
# Groups: Category [14]
Category n
<chr> <int>
1 gas_transport 623
2 grocery_pos 529
3 home 513
4 shopping_pos 427
5 kids_pets 415
6 shopping_net 376
7 entertainment 373
8 food_dining 345
9 personal_care 333
10 health_fitness 323
11 misc_pos 298
12 misc_net 264
13 grocery_net 198
14 travel 169
fraudTest %>%
group_by(state) %>%
summarise(Average=mean(Amount) , lower=min(Amount) , Upper = max(Amount))
# A tibble: 50 x 4
state Average lower Upper
<chr> <dbl> <dbl> <dbl>
1 AK 78.4 1 1617
2 AL 64.3 1 5030
3 AR 76.2 1 8181
4 AZ 75.8 1 7321
5 CA 73.3 1 16837
6 CO 76.0 1 5187
7 CT 62.6 1 4120
8 DC 71.7 1 1121
9 FL 71.4 1 21438
10 GA 69.2 1 7886
# ... with 40 more rows
fraudTest %>%
ggplot(aes(Category , Amount , color=state))+geom_col() + scale_y_log10()
```{.r .distill-force-highlighting-css}
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gogula (2022, April 27). Data Analytics and Computational Social Science: HW_5. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommanikanta891378/
BibTeX citation
@misc{gogula2022hw_5, author = {Gogula, Mani kanta}, title = {Data Analytics and Computational Social Science: HW_5}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommanikanta891378/}, year = {2022} }