DATA WRANGLING
#1.Identify the datset you will be using for the final project.
I will be using the credit card transaction dataset containing legitimate and fraud transactions. Source of the dataset is “KAGGLE.COM”
#Importing the datset
#Preview of the datset
head(fraudTest)
# A tibble: 6 x 23
...1 trans_date_trans_time cc_num merchant category amt first
<dbl> <dttm> <dbl> <chr> <chr> <dbl> <chr>
1 0 2020-06-21 12:14:25 2.29e15 fraud_Kirl~ persona~ 2.86 Jeff
2 1 2020-06-21 12:14:33 3.57e15 fraud_Spor~ persona~ 29.8 Joan~
3 2 2020-06-21 12:14:53 3.60e15 fraud_Swan~ health_~ 41.3 Ashl~
4 3 2020-06-21 12:15:15 3.59e15 fraud_Hale~ misc_pos 60.0 Brian
5 4 2020-06-21 12:15:17 3.53e15 fraud_John~ travel 3.19 Nath~
6 5 2020-06-21 12:15:37 3.04e13 fraud_Daug~ kids_pe~ 19.6 Dani~
# ... with 16 more variables: last <chr>, gender <chr>, street <chr>,
# city <chr>, state <chr>, zip <dbl>, lat <dbl>, long <dbl>,
# city_pop <dbl>, job <chr>, dob <date>, trans_num <chr>,
# unix_time <dbl>, merch_lat <dbl>, merch_long <dbl>,
# is_fraud <dbl>
#Using the function dim() to get the dimensions of the dataset.
dim(fraudTest)
[1] 555719 23
#Extracting column names from the datset using colname() function
colnames(fraudTest)
[1] "...1" "trans_date_trans_time"
[3] "cc_num" "merchant"
[5] "category" "amt"
[7] "first" "last"
[9] "gender" "street"
[11] "city" "state"
[13] "zip" "lat"
[15] "long" "city_pop"
[17] "job" "dob"
[19] "trans_num" "unix_time"
[21] "merch_lat" "merch_long"
[23] "is_fraud"
#Naming the first column name as sr_no using names()
names(fraudTest)[1]<-"sr_no"
#Preview of dataset post naming first column
head(fraudTest)
# A tibble: 6 x 23
sr_no trans_date_trans_time cc_num merchant category amt first
<dbl> <dttm> <dbl> <chr> <chr> <dbl> <chr>
1 0 2020-06-21 12:14:25 2.29e15 fraud_Kirl~ persona~ 2.86 Jeff
2 1 2020-06-21 12:14:33 3.57e15 fraud_Spor~ persona~ 29.8 Joan~
3 2 2020-06-21 12:14:53 3.60e15 fraud_Swan~ health_~ 41.3 Ashl~
4 3 2020-06-21 12:15:15 3.59e15 fraud_Hale~ misc_pos 60.0 Brian
5 4 2020-06-21 12:15:17 3.53e15 fraud_John~ travel 3.19 Nath~
6 5 2020-06-21 12:15:37 3.04e13 fraud_Daug~ kids_pe~ 19.6 Dani~
# ... with 16 more variables: last <chr>, gender <chr>, street <chr>,
# city <chr>, state <chr>, zip <dbl>, lat <dbl>, long <dbl>,
# city_pop <dbl>, job <chr>, dob <date>, trans_num <chr>,
# unix_time <dbl>, merch_lat <dbl>, merch_long <dbl>,
# is_fraud <dbl>
#Focussing on MASS data
Selecting state , category , amt , is_fraud ,merchant columns from the dataset and filtering the obtained dateset to state “MA”.Arranging the dataset on the basis of amt .Plotting the graph for amt vs category.
fraudTest %>%
select(state, category, amt , is_fraud , merchant) %>% filter(state =="MA") %>%
arrange(amt) %>%
ggplot(aes(amt,category)) +geom_line()
1)Which state has more number of fraud transactions and in which category?
2)Which category contributes more number of fraud transactions ?
3)To find out min and max amt of fraud transaction in each state ?
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gogula (2022, March 23). Data Analytics and Computational Social Science: HW_3(2). Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommanikanta879859/
BibTeX citation
@misc{gogula2022hw_3(2), author = {Gogula, Mani kanta}, title = {Data Analytics and Computational Social Science: HW_3(2)}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommanikanta879859/}, year = {2022} }