Default of credit card clients dataset - UCI Machine Learning Repository

challenge_1
Pranav Bharadwaj Komaravolu
UCI_Credit_Card.csv
Author

Pranav Komaravolu

Published

February 22, 2023

library(tidyverse)
library(readr)

knitr::opts_chunk$set(echo = TRUE)

Sources

The dataset was obtained from the University of California Irvine (UCI) Machine Learning Repository. With the dataset labeled “Default of credit card clients dataset”. The dataset can be accessed here.

Reading the dataset

To code to read the dataset is as follows:

data <- read_csv('_data/UCI_Credit_Card.csv')
Rows: 30000 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (25): ID, LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
# A tibble: 6 × 25
     ID LIMIT_…¹   SEX EDUCA…² MARRI…³   AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6
  <dbl>    <dbl> <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1    20000     2       2       1    24     2     2    -1    -1    -2    -2
2     2   120000     2       2       2    26    -1     2     0     0     0     2
3     3    90000     2       2       2    34     0     0     0     0     0     0
4     4    50000     2       2       1    37     0     0     0     0     0     0
5     5    50000     1       2       1    57    -1     0    -1     0     0     0
6     6    50000     1       1       2    37     0     0     0     0     0     0
# … with 13 more variables: BILL_AMT1 <dbl>, BILL_AMT2 <dbl>, BILL_AMT3 <dbl>,
#   BILL_AMT4 <dbl>, BILL_AMT5 <dbl>, BILL_AMT6 <dbl>, PAY_AMT1 <dbl>,
#   PAY_AMT2 <dbl>, PAY_AMT3 <dbl>, PAY_AMT4 <dbl>, PAY_AMT5 <dbl>,
#   PAY_AMT6 <dbl>, default.payment.next.month <dbl>, and abbreviated variable
#   names ¹​LIMIT_BAL, ²​EDUCATION, ³​MARRIAGE

The head of the table is shown above.

The dataset with skipped header is as follows:

data2 <- read_csv('_data/UCI_Credit_Card.csv', skip=1) # the header is skipped
New names:
Rows: 29999 Columns: 25
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," dbl
(25): 1...1, 20000, 2...3, 2...4, 1...5, 24, 2...7, 2...8, -1...9, -1......
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `1` -> `1...1`
• `2` -> `2...3`
• `2` -> `2...4`
• `1` -> `1...5`
• `2` -> `2...7`
• `2` -> `2...8`
• `-1` -> `-1...9`
• `-1` -> `-1...10`
• `-2` -> `-2...11`
• `-2` -> `-2...12`
• `689` -> `689...15`
• `0` -> `0...16`
• `0` -> `0...17`
• `0` -> `0...18`
• `0` -> `0...19`
• `689` -> `689...20`
• `0` -> `0...21`
• `0` -> `0...22`
• `0` -> `0...23`
• `0` -> `0...24`
• `1` -> `1...25`
head(data2)
# A tibble: 6 × 25
  `1...1` `20000` `2...3` `2...4` `1...5`  `24` `2...7` `2...8` `-1...9` -1...…¹
    <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>   <dbl>    <dbl>   <dbl>
1       2  120000       2       2       2    26      -1       2        0       0
2       3   90000       2       2       2    34       0       0        0       0
3       4   50000       2       2       1    37       0       0        0       0
4       5   50000       1       2       1    57      -1       0       -1       0
5       6   50000       1       1       2    37       0       0        0       0
6       7  500000       1       1       2    29       0       0        0       0
# … with 15 more variables: `-2...11` <dbl>, `-2...12` <dbl>, `3913` <dbl>,
#   `3102` <dbl>, `689...15` <dbl>, `0...16` <dbl>, `0...17` <dbl>,
#   `0...18` <dbl>, `0...19` <dbl>, `689...20` <dbl>, `0...21` <dbl>,
#   `0...22` <dbl>, `0...23` <dbl>, `0...24` <dbl>, `1...25` <dbl>, and
#   abbreviated variable name ¹​`-1...10`

Dataset Description

The number of rows in the dataset can be obtained as follows:

nrow(data)
[1] 30000

The number of fields or columns can be obtained as follows:

ncol(data)
[1] 25

The different columns are as follows:

names(data)
 [1] "ID"                         "LIMIT_BAL"                 
 [3] "SEX"                        "EDUCATION"                 
 [5] "MARRIAGE"                   "AGE"                       
 [7] "PAY_0"                      "PAY_2"                     
 [9] "PAY_3"                      "PAY_4"                     
[11] "PAY_5"                      "PAY_6"                     
[13] "BILL_AMT1"                  "BILL_AMT2"                 
[15] "BILL_AMT3"                  "BILL_AMT4"                 
[17] "BILL_AMT5"                  "BILL_AMT6"                 
[19] "PAY_AMT1"                   "PAY_AMT2"                  
[21] "PAY_AMT3"                   "PAY_AMT4"                  
[23] "PAY_AMT5"                   "PAY_AMT6"                  
[25] "default.payment.next.month"

The data types of different columns are as follows:

as.data.frame(sapply(data, class))
                           sapply(data, class)
ID                                     numeric
LIMIT_BAL                              numeric
SEX                                    numeric
EDUCATION                              numeric
MARRIAGE                               numeric
AGE                                    numeric
PAY_0                                  numeric
PAY_2                                  numeric
PAY_3                                  numeric
PAY_4                                  numeric
PAY_5                                  numeric
PAY_6                                  numeric
BILL_AMT1                              numeric
BILL_AMT2                              numeric
BILL_AMT3                              numeric
BILL_AMT4                              numeric
BILL_AMT5                              numeric
BILL_AMT6                              numeric
PAY_AMT1                               numeric
PAY_AMT2                               numeric
PAY_AMT3                               numeric
PAY_AMT4                               numeric
PAY_AMT5                               numeric
PAY_AMT6                               numeric
default.payment.next.month             numeric

Since there are some categorical attributes we can obtain the number of unique elements in each column as follows:

getUniqueValues <- function(column_label) {
  data %>% select(all_of(column_label)) %>% n_distinct(.)
}
for (i in names(data)) {
  print(i)
  print(getUniqueValues(i))
}
[1] "ID"
[1] 30000
[1] "LIMIT_BAL"
[1] 81
[1] "SEX"
[1] 2
[1] "EDUCATION"
[1] 7
[1] "MARRIAGE"
[1] 4
[1] "AGE"
[1] 56
[1] "PAY_0"
[1] 11
[1] "PAY_2"
[1] 11
[1] "PAY_3"
[1] 11
[1] "PAY_4"
[1] 11
[1] "PAY_5"
[1] 10
[1] "PAY_6"
[1] 10
[1] "BILL_AMT1"
[1] 22723
[1] "BILL_AMT2"
[1] 22346
[1] "BILL_AMT3"
[1] 22026
[1] "BILL_AMT4"
[1] 21548
[1] "BILL_AMT5"
[1] 21010
[1] "BILL_AMT6"
[1] 20604
[1] "PAY_AMT1"
[1] 7943
[1] "PAY_AMT2"
[1] 7899
[1] "PAY_AMT3"
[1] 7518
[1] "PAY_AMT4"
[1] 6937
[1] "PAY_AMT5"
[1] 6897
[1] "PAY_AMT6"
[1] 6939
[1] "default.payment.next.month"
[1] 2

The dataset comprises of 25 different columns, 30000 rows. Each of these columns depict a particular attribute of the data here a person and the column “default.payment.next.month” depicts whether the person defaulted the credit payment or not. All the attributes are preprocessed so that ML algorithms can efficiently train on the data as we can observe that even some of the categorical values are converted to integer type. From the above cell we can observe some attributes such as “SEX”, “EDUCATION”, “MARRIAGE” and the “default.payment.next.month” have very less number of unique values suggesting that these attributes are categorical.

Dataset summary

To summarize the dataset we can use the following code:

summary(data)
       ID          LIMIT_BAL            SEX          EDUCATION    
 Min.   :    1   Min.   :  10000   Min.   :1.000   Min.   :0.000  
 1st Qu.: 7501   1st Qu.:  50000   1st Qu.:1.000   1st Qu.:1.000  
 Median :15000   Median : 140000   Median :2.000   Median :2.000  
 Mean   :15000   Mean   : 167484   Mean   :1.604   Mean   :1.853  
 3rd Qu.:22500   3rd Qu.: 240000   3rd Qu.:2.000   3rd Qu.:2.000  
 Max.   :30000   Max.   :1000000   Max.   :2.000   Max.   :6.000  
    MARRIAGE          AGE            PAY_0             PAY_2        
 Min.   :0.000   Min.   :21.00   Min.   :-2.0000   Min.   :-2.0000  
 1st Qu.:1.000   1st Qu.:28.00   1st Qu.:-1.0000   1st Qu.:-1.0000  
 Median :2.000   Median :34.00   Median : 0.0000   Median : 0.0000  
 Mean   :1.552   Mean   :35.49   Mean   :-0.0167   Mean   :-0.1338  
 3rd Qu.:2.000   3rd Qu.:41.00   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
 Max.   :3.000   Max.   :79.00   Max.   : 8.0000   Max.   : 8.0000  
     PAY_3             PAY_4             PAY_5             PAY_6        
 Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000  
 1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000  
 Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
 Mean   :-0.1662   Mean   :-0.2207   Mean   :-0.2662   Mean   :-0.2911  
 3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
 Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000  
   BILL_AMT1         BILL_AMT2        BILL_AMT3         BILL_AMT4      
 Min.   :-165580   Min.   :-69777   Min.   :-157264   Min.   :-170000  
 1st Qu.:   3559   1st Qu.:  2985   1st Qu.:   2666   1st Qu.:   2327  
 Median :  22382   Median : 21200   Median :  20088   Median :  19052  
 Mean   :  51223   Mean   : 49179   Mean   :  47013   Mean   :  43263  
 3rd Qu.:  67091   3rd Qu.: 64006   3rd Qu.:  60165   3rd Qu.:  54506  
 Max.   : 964511   Max.   :983931   Max.   :1664089   Max.   : 891586  
   BILL_AMT5        BILL_AMT6          PAY_AMT1         PAY_AMT2      
 Min.   :-81334   Min.   :-339603   Min.   :     0   Min.   :      0  
 1st Qu.:  1763   1st Qu.:   1256   1st Qu.:  1000   1st Qu.:    833  
 Median : 18104   Median :  17071   Median :  2100   Median :   2009  
 Mean   : 40311   Mean   :  38872   Mean   :  5664   Mean   :   5921  
 3rd Qu.: 50190   3rd Qu.:  49198   3rd Qu.:  5006   3rd Qu.:   5000  
 Max.   :927171   Max.   : 961664   Max.   :873552   Max.   :1684259  
    PAY_AMT3         PAY_AMT4         PAY_AMT5           PAY_AMT6       
 Min.   :     0   Min.   :     0   Min.   :     0.0   Min.   :     0.0  
 1st Qu.:   390   1st Qu.:   296   1st Qu.:   252.5   1st Qu.:   117.8  
 Median :  1800   Median :  1500   Median :  1500.0   Median :  1500.0  
 Mean   :  5226   Mean   :  4826   Mean   :  4799.4   Mean   :  5215.5  
 3rd Qu.:  4505   3rd Qu.:  4013   3rd Qu.:  4031.5   3rd Qu.:  4000.0  
 Max.   :896040   Max.   :621000   Max.   :426529.0   Max.   :528666.0  
 default.payment.next.month
 Min.   :0.0000            
 1st Qu.:0.0000            
 Median :0.0000            
 Mean   :0.2212            
 3rd Qu.:0.0000            
 Max.   :1.0000