Final Project Assignment#1: Pranav Bharadwaj Komaravolu

Final Project Assignment#1
Credit score evaluation
Project & Data Description
Author

Pranav Bharadwaj Komaravolu

Published

April 18, 2023

library(tidyverse)
library(readr)
library(mosaic)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Introduction

  1. The problem I chose to address through my final project is to identify how the credit score is impacted by various socioeconomic factors. Credits refer to the power of borrowing money with a promise of repayment in the future. It supports two main actors in the economy 1) consumers and 2) banks. This ensures rapid progress/advancements in the well being of the people/governments. Credit score refers to a rating or an assessment offered by banks/agencies based on the societal status and analytics on certain economic trends and individual’s behavioral patterns. Based on this value banks provide credits which thus shall contribute to the well being of the consumers. Being a new user of the credit card I was curious on how the credit system works and what parameters/attributes have more impact in determining the credit score of an individual. Each row in the datasets contains various attributes such as name, credit_accounts, delays/defaults in loan repayment and so on.

  2. The questions I will be addressing through this project are:

    • What are the parameters that impact the credit score/assessment the most and what are their correlations?
    • Which model works best to classify individuals into groups of “Good”, “Standard” and “Low”?

Dataset

For this task I have identified the “Credit score classification” dataset on kaggle. The number of available datasets for this task are very small in number and out of those datasets the current choice seemed more promising.

  1. reading the dataset:
data <- read_csv("601_Spring_2023_project/PranavKomaravolu_FinalProjectData/train.csv")
head(data)
# A tibble: 6 × 28
  ID    Custom…¹ Month Name  Age   SSN   Occup…² Annua…³ Month…⁴ Num_B…⁵ Num_C…⁶
  <chr> <chr>    <chr> <chr> <chr> <chr> <chr>   <chr>     <dbl>   <dbl>   <dbl>
1 1602  CUS_0xd… Janu… Aaro… 23    821-… Scient… 19114.…   1825.       3       4
2 1603  CUS_0xd… Febr… Aaro… 23    821-… Scient… 19114.…     NA        3       4
3 1604  CUS_0xd… March Aaro… -500  821-… Scient… 19114.…     NA        3       4
4 1605  CUS_0xd… April Aaro… 23    821-… Scient… 19114.…     NA        3       4
5 1606  CUS_0xd… May   Aaro… 23    821-… Scient… 19114.…   1825.       3       4
6 1607  CUS_0xd… June  Aaro… 23    821-… Scient… 19114.…     NA        3       4
# … with 17 more variables: Interest_Rate <dbl>, Num_of_Loan <chr>,
#   Type_of_Loan <chr>, Delay_from_due_date <dbl>,
#   Num_of_Delayed_Payment <chr>, Changed_Credit_Limit <chr>,
#   Num_Credit_Inquiries <dbl>, Credit_Mix <chr>, Outstanding_Debt <chr>,
#   Credit_Utilization_Ratio <dbl>, Credit_History_Age <chr>,
#   Payment_of_Min_Amount <chr>, Total_EMI_per_month <dbl>,
#   Amount_invested_monthly <chr>, Payment_Behaviour <chr>, …
  1. The datset is very vast and its dimensions are as follows:
dim(data)
[1] 100000     28

There are a 100000 rows and 28 columns in the dataset, from the above head of the dataset we can also observe some impurity/information gap.

The different columns in the dataset are as follows:

names(data)
 [1] "ID"                       "Customer_ID"             
 [3] "Month"                    "Name"                    
 [5] "Age"                      "SSN"                     
 [7] "Occupation"               "Annual_Income"           
 [9] "Monthly_Inhand_Salary"    "Num_Bank_Accounts"       
[11] "Num_Credit_Card"          "Interest_Rate"           
[13] "Num_of_Loan"              "Type_of_Loan"            
[15] "Delay_from_due_date"      "Num_of_Delayed_Payment"  
[17] "Changed_Credit_Limit"     "Num_Credit_Inquiries"    
[19] "Credit_Mix"               "Outstanding_Debt"        
[21] "Credit_Utilization_Ratio" "Credit_History_Age"      
[23] "Payment_of_Min_Amount"    "Total_EMI_per_month"     
[25] "Amount_invested_monthly"  "Payment_Behaviour"       
[27] "Monthly_Balance"          "Credit_Score"            

The number of unique customers can be obtained as follows:

From the above table we observe “Customer_ID” to be a unique attribute assigned to each customer so if we can compute the number of unique customer ids we can get the total number of customers observed by the sample. And also we can simultaneously obtain different labels that are assigned to the induviduals as their corresponding credit score/evaluation.

  data %>%
  select(Customer_ID, Credit_Score) %>%
  summarize_all(n_distinct)
# A tibble: 1 × 2
  Customer_ID Credit_Score
        <int>        <int>
1       12500            3

So there are about 12500 customers and each customer is assigned one of the three labels for the Credit_Score.

The additional summary statistics for different columns are as follows:

summary(data)
      ID            Customer_ID           Month               Name          
 Length:100000      Length:100000      Length:100000      Length:100000     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
     Age                SSN             Occupation        Annual_Income     
 Length:100000      Length:100000      Length:100000      Length:100000     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 Monthly_Inhand_Salary Num_Bank_Accounts Num_Credit_Card   Interest_Rate    
 Min.   :  303.6       Min.   :  -1.00   Min.   :   0.00   Min.   :   1.00  
 1st Qu.: 1625.6       1st Qu.:   3.00   1st Qu.:   4.00   1st Qu.:   8.00  
 Median : 3093.7       Median :   6.00   Median :   5.00   Median :  13.00  
 Mean   : 4194.2       Mean   :  17.09   Mean   :  22.47   Mean   :  72.47  
 3rd Qu.: 5957.4       3rd Qu.:   7.00   3rd Qu.:   7.00   3rd Qu.:  20.00  
 Max.   :15204.6       Max.   :1798.00   Max.   :1499.00   Max.   :5797.00  
 NA's   :15002                                                              
 Num_of_Loan        Type_of_Loan       Delay_from_due_date
 Length:100000      Length:100000      Min.   :-5.00      
 Class :character   Class :character   1st Qu.:10.00      
 Mode  :character   Mode  :character   Median :18.00      
                                       Mean   :21.07      
                                       3rd Qu.:28.00      
                                       Max.   :67.00      
                                                          
 Num_of_Delayed_Payment Changed_Credit_Limit Num_Credit_Inquiries
 Length:100000          Length:100000        Min.   :   0.00     
 Class :character       Class :character     1st Qu.:   3.00     
 Mode  :character       Mode  :character     Median :   6.00     
                                             Mean   :  27.75     
                                             3rd Qu.:   9.00     
                                             Max.   :2597.00     
                                             NA's   :1965        
  Credit_Mix        Outstanding_Debt   Credit_Utilization_Ratio
 Length:100000      Length:100000      Min.   :20.00           
 Class :character   Class :character   1st Qu.:28.05           
 Mode  :character   Mode  :character   Median :32.31           
                                       Mean   :32.29           
                                       3rd Qu.:36.50           
                                       Max.   :50.00           
                                                               
 Credit_History_Age Payment_of_Min_Amount Total_EMI_per_month
 Length:100000      Length:100000         Min.   :    0.00   
 Class :character   Class :character      1st Qu.:   30.31   
 Mode  :character   Mode  :character      Median :   69.25   
                                          Mean   : 1403.12   
                                          3rd Qu.:  161.22   
                                          Max.   :82331.00   
                                                             
 Amount_invested_monthly Payment_Behaviour  Monthly_Balance    
 Length:100000           Length:100000      Min.   :   0.0078  
 Class :character        Class :character   1st Qu.: 270.1066  
 Mode  :character        Mode  :character   Median : 336.7312  
                                            Mean   : 402.5513  
                                            3rd Qu.: 470.2629  
                                            Max.   :1602.0405  
                                            NA's   :1209       
 Credit_Score      
 Length:100000     
 Class :character  
 Mode  :character  
                   
                   
                   
                   

So each of these columns indicate a specific parameter that impacts the target attribute “Credit_Score”. The question I am trying to address is how each of these impoact or affect the credit score and what weights can we assign to each of these attributes during credit evaluation.

Visualization Plan

Firstly, I will tidy the data identify empty/NA fields and try to populate such rows using various strategies. To visualize and analyse the problem I would like to use a variety of plots which focus on how the target attribute is affected by the varying factors. One such plot being a correlation plot. Following which I would like to try out different classification algorithms like decision trees, logistic regression and asses their performance using the ROC curve and would also like to evaluate the variations caused due to change in empty fields handling strategies.