library(tidyverse)
library(readr)
::opts_chunk$set(echo = TRUE) knitr
Default of credit card clients dataset - UCI Machine Learning Repository
Sources
The dataset was obtained from the University of California Irvine (UCI) Machine Learning Repository. With the dataset labeled “Default of credit card clients dataset”. The dataset can be accessed here.
Reading the dataset
To code to read the dataset is as follows:
<- read_csv('_data/UCI_Credit_Card.csv') data
Rows: 30000 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (25): ID, LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
# A tibble: 6 × 25
ID LIMIT_…¹ SEX EDUCA…² MARRI…³ AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 20000 2 2 1 24 2 2 -1 -1 -2 -2
2 2 120000 2 2 2 26 -1 2 0 0 0 2
3 3 90000 2 2 2 34 0 0 0 0 0 0
4 4 50000 2 2 1 37 0 0 0 0 0 0
5 5 50000 1 2 1 57 -1 0 -1 0 0 0
6 6 50000 1 1 2 37 0 0 0 0 0 0
# … with 13 more variables: BILL_AMT1 <dbl>, BILL_AMT2 <dbl>, BILL_AMT3 <dbl>,
# BILL_AMT4 <dbl>, BILL_AMT5 <dbl>, BILL_AMT6 <dbl>, PAY_AMT1 <dbl>,
# PAY_AMT2 <dbl>, PAY_AMT3 <dbl>, PAY_AMT4 <dbl>, PAY_AMT5 <dbl>,
# PAY_AMT6 <dbl>, default.payment.next.month <dbl>, and abbreviated variable
# names ¹LIMIT_BAL, ²EDUCATION, ³MARRIAGE
The head of the table is shown above.
The dataset with skipped header is as follows:
<- read_csv('_data/UCI_Credit_Card.csv', skip=1) # the header is skipped data2
New names:
Rows: 29999 Columns: 25
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," dbl
(25): 1...1, 20000, 2...3, 2...4, 1...5, 24, 2...7, 2...8, -1...9, -1......
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `1` -> `1...1`
• `2` -> `2...3`
• `2` -> `2...4`
• `1` -> `1...5`
• `2` -> `2...7`
• `2` -> `2...8`
• `-1` -> `-1...9`
• `-1` -> `-1...10`
• `-2` -> `-2...11`
• `-2` -> `-2...12`
• `689` -> `689...15`
• `0` -> `0...16`
• `0` -> `0...17`
• `0` -> `0...18`
• `0` -> `0...19`
• `689` -> `689...20`
• `0` -> `0...21`
• `0` -> `0...22`
• `0` -> `0...23`
• `0` -> `0...24`
• `1` -> `1...25`
head(data2)
# A tibble: 6 × 25
`1...1` `20000` `2...3` `2...4` `1...5` `24` `2...7` `2...8` `-1...9` -1...…¹
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 120000 2 2 2 26 -1 2 0 0
2 3 90000 2 2 2 34 0 0 0 0
3 4 50000 2 2 1 37 0 0 0 0
4 5 50000 1 2 1 57 -1 0 -1 0
5 6 50000 1 1 2 37 0 0 0 0
6 7 500000 1 1 2 29 0 0 0 0
# … with 15 more variables: `-2...11` <dbl>, `-2...12` <dbl>, `3913` <dbl>,
# `3102` <dbl>, `689...15` <dbl>, `0...16` <dbl>, `0...17` <dbl>,
# `0...18` <dbl>, `0...19` <dbl>, `689...20` <dbl>, `0...21` <dbl>,
# `0...22` <dbl>, `0...23` <dbl>, `0...24` <dbl>, `1...25` <dbl>, and
# abbreviated variable name ¹`-1...10`
Dataset Description
The number of rows in the dataset can be obtained as follows:
nrow(data)
[1] 30000
The number of fields or columns can be obtained as follows:
ncol(data)
[1] 25
The different columns are as follows:
names(data)
[1] "ID" "LIMIT_BAL"
[3] "SEX" "EDUCATION"
[5] "MARRIAGE" "AGE"
[7] "PAY_0" "PAY_2"
[9] "PAY_3" "PAY_4"
[11] "PAY_5" "PAY_6"
[13] "BILL_AMT1" "BILL_AMT2"
[15] "BILL_AMT3" "BILL_AMT4"
[17] "BILL_AMT5" "BILL_AMT6"
[19] "PAY_AMT1" "PAY_AMT2"
[21] "PAY_AMT3" "PAY_AMT4"
[23] "PAY_AMT5" "PAY_AMT6"
[25] "default.payment.next.month"
The data types of different columns are as follows:
as.data.frame(sapply(data, class))
sapply(data, class)
ID numeric
LIMIT_BAL numeric
SEX numeric
EDUCATION numeric
MARRIAGE numeric
AGE numeric
PAY_0 numeric
PAY_2 numeric
PAY_3 numeric
PAY_4 numeric
PAY_5 numeric
PAY_6 numeric
BILL_AMT1 numeric
BILL_AMT2 numeric
BILL_AMT3 numeric
BILL_AMT4 numeric
BILL_AMT5 numeric
BILL_AMT6 numeric
PAY_AMT1 numeric
PAY_AMT2 numeric
PAY_AMT3 numeric
PAY_AMT4 numeric
PAY_AMT5 numeric
PAY_AMT6 numeric
default.payment.next.month numeric
Since there are some categorical attributes we can obtain the number of unique elements in each column as follows:
<- function(column_label) {
getUniqueValues %>% select(all_of(column_label)) %>% n_distinct(.)
data
}for (i in names(data)) {
print(i)
print(getUniqueValues(i))
}
[1] "ID"
[1] 30000
[1] "LIMIT_BAL"
[1] 81
[1] "SEX"
[1] 2
[1] "EDUCATION"
[1] 7
[1] "MARRIAGE"
[1] 4
[1] "AGE"
[1] 56
[1] "PAY_0"
[1] 11
[1] "PAY_2"
[1] 11
[1] "PAY_3"
[1] 11
[1] "PAY_4"
[1] 11
[1] "PAY_5"
[1] 10
[1] "PAY_6"
[1] 10
[1] "BILL_AMT1"
[1] 22723
[1] "BILL_AMT2"
[1] 22346
[1] "BILL_AMT3"
[1] 22026
[1] "BILL_AMT4"
[1] 21548
[1] "BILL_AMT5"
[1] 21010
[1] "BILL_AMT6"
[1] 20604
[1] "PAY_AMT1"
[1] 7943
[1] "PAY_AMT2"
[1] 7899
[1] "PAY_AMT3"
[1] 7518
[1] "PAY_AMT4"
[1] 6937
[1] "PAY_AMT5"
[1] 6897
[1] "PAY_AMT6"
[1] 6939
[1] "default.payment.next.month"
[1] 2
The dataset comprises of 25 different columns, 30000 rows. Each of these columns depict a particular attribute of the data here a person and the column “default.payment.next.month” depicts whether the person defaulted the credit payment or not. All the attributes are preprocessed so that ML algorithms can efficiently train on the data as we can observe that even some of the categorical values are converted to integer type. From the above cell we can observe some attributes such as “SEX”, “EDUCATION”, “MARRIAGE” and the “default.payment.next.month” have very less number of unique values suggesting that these attributes are categorical.
Dataset summary
To summarize the dataset we can use the following code:
summary(data)
ID LIMIT_BAL SEX EDUCATION
Min. : 1 Min. : 10000 Min. :1.000 Min. :0.000
1st Qu.: 7501 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000
Median :15000 Median : 140000 Median :2.000 Median :2.000
Mean :15000 Mean : 167484 Mean :1.604 Mean :1.853
3rd Qu.:22500 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000
Max. :30000 Max. :1000000 Max. :2.000 Max. :6.000
MARRIAGE AGE PAY_0 PAY_2
Min. :0.000 Min. :21.00 Min. :-2.0000 Min. :-2.0000
1st Qu.:1.000 1st Qu.:28.00 1st Qu.:-1.0000 1st Qu.:-1.0000
Median :2.000 Median :34.00 Median : 0.0000 Median : 0.0000
Mean :1.552 Mean :35.49 Mean :-0.0167 Mean :-0.1338
3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. :3.000 Max. :79.00 Max. : 8.0000 Max. : 8.0000
PAY_3 PAY_4 PAY_5 PAY_6
Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000
1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000
Mean :-0.1662 Mean :-0.2207 Mean :-0.2662 Mean :-0.2911
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000
BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4
Min. :-165580 Min. :-69777 Min. :-157264 Min. :-170000
1st Qu.: 3559 1st Qu.: 2985 1st Qu.: 2666 1st Qu.: 2327
Median : 22382 Median : 21200 Median : 20088 Median : 19052
Mean : 51223 Mean : 49179 Mean : 47013 Mean : 43263
3rd Qu.: 67091 3rd Qu.: 64006 3rd Qu.: 60165 3rd Qu.: 54506
Max. : 964511 Max. :983931 Max. :1664089 Max. : 891586
BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
Min. :-81334 Min. :-339603 Min. : 0 Min. : 0
1st Qu.: 1763 1st Qu.: 1256 1st Qu.: 1000 1st Qu.: 833
Median : 18104 Median : 17071 Median : 2100 Median : 2009
Mean : 40311 Mean : 38872 Mean : 5664 Mean : 5921
3rd Qu.: 50190 3rd Qu.: 49198 3rd Qu.: 5006 3rd Qu.: 5000
Max. :927171 Max. : 961664 Max. :873552 Max. :1684259
PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
1st Qu.: 390 1st Qu.: 296 1st Qu.: 252.5 1st Qu.: 117.8
Median : 1800 Median : 1500 Median : 1500.0 Median : 1500.0
Mean : 5226 Mean : 4826 Mean : 4799.4 Mean : 5215.5
3rd Qu.: 4505 3rd Qu.: 4013 3rd Qu.: 4031.5 3rd Qu.: 4000.0
Max. :896040 Max. :621000 Max. :426529.0 Max. :528666.0
default.payment.next.month
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.2212
3rd Qu.:0.0000
Max. :1.0000