library(tidyverse)
library(readr)
library(mosaic)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Final Project Assignment#1: Pranav Bharadwaj Komaravolu
Introduction
The problem I chose to address through my final project is to identify how the credit score is impacted by various socioeconomic factors. Credits refer to the power of borrowing money with a promise of repayment in the future. It supports two main actors in the economy 1) consumers and 2) banks. This ensures rapid progress/advancements in the well being of the people/governments. Credit score refers to a rating or an assessment offered by banks/agencies based on the societal status and analytics on certain economic trends and individual’s behavioral patterns. Based on this value banks provide credits which thus shall contribute to the well being of the consumers. Being a new user of the credit card I was curious on how the credit system works and what parameters/attributes have more impact in determining the credit score of an individual. Each row in the datasets contains various attributes such as name, credit_accounts, delays/defaults in loan repayment and so on.
The questions I will be addressing through this project are:
- What are the parameters that impact the credit score/assessment the most and what are their correlations?
- Which model works best to classify individuals into groups of “Good”, “Standard” and “Low”?
Dataset
For this task I have identified the “Credit score classification” dataset on kaggle. The number of available datasets for this task are very small in number and out of those datasets the current choice seemed more promising.
- reading the dataset:
<- read_csv("601_Spring_2023_project/PranavKomaravolu_FinalProjectData/train.csv")
data head(data)
# A tibble: 6 × 28
ID Custom…¹ Month Name Age SSN Occup…² Annua…³ Month…⁴ Num_B…⁵ Num_C…⁶
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1602 CUS_0xd… Janu… Aaro… 23 821-… Scient… 19114.… 1825. 3 4
2 1603 CUS_0xd… Febr… Aaro… 23 821-… Scient… 19114.… NA 3 4
3 1604 CUS_0xd… March Aaro… -500 821-… Scient… 19114.… NA 3 4
4 1605 CUS_0xd… April Aaro… 23 821-… Scient… 19114.… NA 3 4
5 1606 CUS_0xd… May Aaro… 23 821-… Scient… 19114.… 1825. 3 4
6 1607 CUS_0xd… June Aaro… 23 821-… Scient… 19114.… NA 3 4
# … with 17 more variables: Interest_Rate <dbl>, Num_of_Loan <chr>,
# Type_of_Loan <chr>, Delay_from_due_date <dbl>,
# Num_of_Delayed_Payment <chr>, Changed_Credit_Limit <chr>,
# Num_Credit_Inquiries <dbl>, Credit_Mix <chr>, Outstanding_Debt <chr>,
# Credit_Utilization_Ratio <dbl>, Credit_History_Age <chr>,
# Payment_of_Min_Amount <chr>, Total_EMI_per_month <dbl>,
# Amount_invested_monthly <chr>, Payment_Behaviour <chr>, …
- The datset is very vast and its dimensions are as follows:
dim(data)
[1] 100000 28
There are a 100000 rows and 28 columns in the dataset, from the above head of the dataset we can also observe some impurity/information gap.
The different columns in the dataset are as follows:
names(data)
[1] "ID" "Customer_ID"
[3] "Month" "Name"
[5] "Age" "SSN"
[7] "Occupation" "Annual_Income"
[9] "Monthly_Inhand_Salary" "Num_Bank_Accounts"
[11] "Num_Credit_Card" "Interest_Rate"
[13] "Num_of_Loan" "Type_of_Loan"
[15] "Delay_from_due_date" "Num_of_Delayed_Payment"
[17] "Changed_Credit_Limit" "Num_Credit_Inquiries"
[19] "Credit_Mix" "Outstanding_Debt"
[21] "Credit_Utilization_Ratio" "Credit_History_Age"
[23] "Payment_of_Min_Amount" "Total_EMI_per_month"
[25] "Amount_invested_monthly" "Payment_Behaviour"
[27] "Monthly_Balance" "Credit_Score"
The number of unique customers can be obtained as follows:
From the above table we observe “Customer_ID” to be a unique attribute assigned to each customer so if we can compute the number of unique customer ids we can get the total number of customers observed by the sample. And also we can simultaneously obtain different labels that are assigned to the induviduals as their corresponding credit score/evaluation.
%>%
data select(Customer_ID, Credit_Score) %>%
summarize_all(n_distinct)
# A tibble: 1 × 2
Customer_ID Credit_Score
<int> <int>
1 12500 3
So there are about 12500 customers and each customer is assigned one of the three labels for the Credit_Score.
The additional summary statistics for different columns are as follows:
summary(data)
ID Customer_ID Month Name
Length:100000 Length:100000 Length:100000 Length:100000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Age SSN Occupation Annual_Income
Length:100000 Length:100000 Length:100000 Length:100000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Monthly_Inhand_Salary Num_Bank_Accounts Num_Credit_Card Interest_Rate
Min. : 303.6 Min. : -1.00 Min. : 0.00 Min. : 1.00
1st Qu.: 1625.6 1st Qu.: 3.00 1st Qu.: 4.00 1st Qu.: 8.00
Median : 3093.7 Median : 6.00 Median : 5.00 Median : 13.00
Mean : 4194.2 Mean : 17.09 Mean : 22.47 Mean : 72.47
3rd Qu.: 5957.4 3rd Qu.: 7.00 3rd Qu.: 7.00 3rd Qu.: 20.00
Max. :15204.6 Max. :1798.00 Max. :1499.00 Max. :5797.00
NA's :15002
Num_of_Loan Type_of_Loan Delay_from_due_date
Length:100000 Length:100000 Min. :-5.00
Class :character Class :character 1st Qu.:10.00
Mode :character Mode :character Median :18.00
Mean :21.07
3rd Qu.:28.00
Max. :67.00
Num_of_Delayed_Payment Changed_Credit_Limit Num_Credit_Inquiries
Length:100000 Length:100000 Min. : 0.00
Class :character Class :character 1st Qu.: 3.00
Mode :character Mode :character Median : 6.00
Mean : 27.75
3rd Qu.: 9.00
Max. :2597.00
NA's :1965
Credit_Mix Outstanding_Debt Credit_Utilization_Ratio
Length:100000 Length:100000 Min. :20.00
Class :character Class :character 1st Qu.:28.05
Mode :character Mode :character Median :32.31
Mean :32.29
3rd Qu.:36.50
Max. :50.00
Credit_History_Age Payment_of_Min_Amount Total_EMI_per_month
Length:100000 Length:100000 Min. : 0.00
Class :character Class :character 1st Qu.: 30.31
Mode :character Mode :character Median : 69.25
Mean : 1403.12
3rd Qu.: 161.22
Max. :82331.00
Amount_invested_monthly Payment_Behaviour Monthly_Balance
Length:100000 Length:100000 Min. : 0.0078
Class :character Class :character 1st Qu.: 270.1066
Mode :character Mode :character Median : 336.7312
Mean : 402.5513
3rd Qu.: 470.2629
Max. :1602.0405
NA's :1209
Credit_Score
Length:100000
Class :character
Mode :character
So each of these columns indicate a specific parameter that impacts the target attribute “Credit_Score”. The question I am trying to address is how each of these impoact or affect the credit score and what weights can we assign to each of these attributes during credit evaluation.
Visualization Plan
Firstly, I will tidy the data identify empty/NA fields and try to populate such rows using various strategies. To visualize and analyse the problem I would like to use a variety of plots which focus on how the target attribute is affected by the varying factors. One such plot being a correlation plot. Following which I would like to try out different classification algorithms like decision trees, logistic regression and asses their performance using the ROC curve and would also like to evaluate the variations caused due to change in empty fields handling strategies.