Ayushe’s HW3 Submission for DACSS 601
The dataset is loaded using the read_csv(). It is found that the dataset has 10000 rows and 55 columns, that is, 55 features. Given below is a glimpse of the dataset. The dataset has two types of values in it, namely, character type and double class.
data <- read_csv("/Users/ayushe/RStudio stuff/RData/loans_full_schema.csv")
The first 10 rows of the dataset are given below:
emp_title | emp_length | state | homeownership | annual_income | verified_income | debt_to_income | annual_income_joint | verification_income_joint | debt_to_income_joint | delinq_2y | months_since_last_delinq | earliest_credit_line | inquiries_last_12m | total_credit_lines | open_credit_lines | total_credit_limit | total_credit_utilized | num_collections_last_12m | num_historical_failed_to_pay | months_since_90d_late | current_accounts_delinq | total_collection_amount_ever | current_installment_accounts | accounts_opened_24m | months_since_last_credit_inquiry | num_satisfactory_accounts | num_accounts_120d_past_due | num_accounts_30d_past_due | num_active_debit_accounts | total_debit_limit | num_total_cc_accounts | num_open_cc_accounts | num_cc_carrying_balance | num_mort_accounts | account_never_delinq_percent | tax_liens | public_record_bankrupt | loan_purpose | application_type | loan_amount | term | interest_rate | installment | grade | sub_grade | issue_month | loan_status | initial_listing_status | disbursement_method | balance | paid_total | paid_principal | paid_interest | paid_late_fees |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
global config engineer | 3 | NJ | MORTGAGE | 90000 | Verified | 18.01 | NA | NA | NA | 0 | 38 | 2001 | 6 | 28 | 10 | 70795 | 38767 | 0 | 0 | 38 | 0 | 1250 | 2 | 5 | 5 | 10 | 0 | 0 | 2 | 11100 | 14 | 8 | 6 | 1 | 92.9 | 0 | 0 | moving | individual | 28000 | 60 | 14.07 | 652.53 | C | C3 | Mar-2018 | Current | whole | Cash | 27015.86 | 1999.33 | 984.14 | 1015.19 | 0 |
warehouse office clerk | 10 | HI | RENT | 40000 | Not Verified | 5.04 | NA | NA | NA | 0 | NA | 1996 | 1 | 30 | 14 | 28800 | 4321 | 0 | 1 | NA | 0 | 0 | 0 | 11 | 8 | 14 | 0 | 0 | 3 | 16500 | 24 | 14 | 4 | 0 | 100.0 | 0 | 1 | debt_consolidation | individual | 5000 | 36 | 12.61 | 167.54 | C | C1 | Feb-2018 | Current | whole | Cash | 4651.37 | 499.12 | 348.63 | 150.49 | 0 |
assembly | 3 | WI | RENT | 40000 | Source Verified | 21.15 | NA | NA | NA | 0 | 28 | 2006 | 4 | 31 | 10 | 24193 | 16000 | 0 | 0 | 28 | 0 | 432 | 1 | 13 | 7 | 10 | 0 | 0 | 3 | 4300 | 14 | 8 | 6 | 0 | 93.5 | 0 | 0 | other | individual | 2000 | 36 | 17.09 | 71.40 | D | D1 | Feb-2018 | Current | fractional | Cash | 1824.63 | 281.80 | 175.37 | 106.43 | 0 |
customer service | 1 | PA | RENT | 30000 | Not Verified | 10.16 | NA | NA | NA | 0 | NA | 2007 | 0 | 4 | 4 | 25400 | 4997 | 0 | 1 | NA | 0 | 0 | 1 | 1 | 15 | 4 | 0 | 0 | 2 | 19400 | 3 | 3 | 2 | 0 | 100.0 | 1 | 0 | debt_consolidation | individual | 21600 | 36 | 6.72 | 664.19 | A | A3 | Jan-2018 | Current | whole | Cash | 18853.26 | 3312.89 | 2746.74 | 566.15 | 0 |
security supervisor | 10 | CA | RENT | 35000 | Verified | 57.96 | 57000 | Verified | 37.66 | 0 | NA | 2008 | 7 | 22 | 16 | 69839 | 52722 | 0 | 0 | NA | 0 | 0 | 1 | 6 | 4 | 16 | 0 | 0 | 10 | 32700 | 20 | 15 | 13 | 0 | 100.0 | 0 | 0 | credit_card | joint | 23000 | 36 | 14.07 | 786.87 | C | C3 | Mar-2018 | Current | whole | Cash | 21430.15 | 2324.65 | 1569.85 | 754.80 | 0 |
NA | NA | KY | OWN | 34000 | Not Verified | 6.46 | NA | NA | NA | 1 | 3 | 1990 | 6 | 32 | 12 | 42100 | 3898 | 0 | 0 | 60 | 0 | 0 | 0 | 2 | 5 | 12 | 0 | 0 | 1 | 27200 | 27 | 12 | 5 | 3 | 78.1 | 0 | 0 | other | individual | 5000 | 36 | 6.72 | 153.75 | A | A3 | Jan-2018 | Current | whole | Cash | 4256.71 | 873.13 | 743.29 | 129.84 | 0 |
hr | 10 | MI | MORTGAGE | 35000 | Source Verified | 23.66 | 155000 | Not Verified | 13.12 | 0 | NA | 2004 | 1 | 12 | 10 | 291852 | 18916 | 0 | 0 | NA | 0 | 0 | 2 | 1 | 9 | 10 | 0 | 0 | 3 | 9100 | 8 | 7 | 6 | 2 | 100.0 | 0 | 0 | credit_card | joint | 24000 | 60 | 13.59 | 553.35 | C | C2 | Jan-2018 | Current | whole | Cash | 22560.00 | 2730.51 | 1440.00 | 1290.51 | 0 |
police | 10 | AZ | MORTGAGE | 110000 | Source Verified | 16.19 | NA | NA | NA | 1 | 19 | 2005 | 1 | 30 | 15 | 342336 | 60805 | 0 | 0 | 71 | 0 | 0 | 2 | 4 | 7 | 15 | NA | 0 | 5 | 22250 | 16 | 12 | 10 | 7 | 93.0 | 0 | 0 | debt_consolidation | individual | 20000 | 60 | 11.99 | 444.79 | B | B5 | Jan-2018 | Current | whole | Cash | 19005.39 | 1765.84 | 994.61 | 771.23 | 0 |
parts | 10 | NV | MORTGAGE | 65000 | Source Verified | 36.48 | NA | NA | NA | 1 | 18 | 1998 | 3 | 35 | 21 | 283190 | 69561 | 0 | 0 | 18 | 0 | 0 | 6 | 10 | 4 | 21 | 0 | 0 | 11 | 36700 | 19 | 14 | 14 | 2 | 97.1 | 0 | 0 | home_improvement | individual | 20000 | 36 | 13.59 | 679.58 | C | C2 | Feb-2018 | Current | whole | Cash | 18156.66 | 2703.22 | 1843.34 | 859.88 | 0 |
4th person | 3 | IL | RENT | 30000 | Not Verified | 18.91 | NA | NA | NA | 0 | NA | 2001 | 0 | 9 | 6 | 33114 | 17768 | 0 | 0 | NA | 0 | 0 | 1 | 5 | 17 | 6 | 0 | 0 | 3 | 14700 | 7 | 5 | 3 | 0 | 100.0 | 0 | 0 | credit_card | individual | 6400 | 36 | 6.71 | 196.77 | A | A3 | Mar-2018 | Current | fractional | Cash | 6077.13 | 391.15 | 322.87 | 68.28 | 0 |
glimpse(data)
Rows: 10,000
Columns: 55
$ emp_title <chr> "global config engineer", "…
$ emp_length <dbl> 3, 10, 3, 1, 10, NA, 10, 10…
$ state <chr> "NJ", "HI", "WI", "PA", "CA…
$ homeownership <chr> "MORTGAGE", "RENT", "RENT",…
$ annual_income <dbl> 90000, 40000, 40000, 30000,…
$ verified_income <chr> "Verified", "Not Verified",…
$ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, …
$ annual_income_joint <dbl> NA, NA, NA, NA, 57000, NA, …
$ verification_income_joint <chr> NA, NA, NA, NA, "Verified",…
$ debt_to_income_joint <dbl> NA, NA, NA, NA, 37.66, NA, …
$ delinq_2y <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 1, …
$ months_since_last_delinq <dbl> 38, NA, 28, NA, NA, 3, NA, …
$ earliest_credit_line <dbl> 2001, 1996, 2006, 2007, 200…
$ inquiries_last_12m <dbl> 6, 1, 4, 0, 7, 6, 1, 1, 3, …
$ total_credit_lines <dbl> 28, 30, 31, 4, 22, 32, 12, …
$ open_credit_lines <dbl> 10, 14, 10, 4, 16, 12, 10, …
$ total_credit_limit <dbl> 70795, 28800, 24193, 25400,…
$ total_credit_utilized <dbl> 38767, 4321, 16000, 4997, 5…
$ num_collections_last_12m <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ num_historical_failed_to_pay <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, …
$ months_since_90d_late <dbl> 38, NA, 28, NA, NA, 60, NA,…
$ current_accounts_delinq <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_collection_amount_ever <dbl> 1250, 0, 432, 0, 0, 0, 0, 0…
$ current_installment_accounts <dbl> 2, 0, 1, 1, 1, 0, 2, 2, 6, …
$ accounts_opened_24m <dbl> 5, 11, 13, 1, 6, 2, 1, 4, 1…
$ months_since_last_credit_inquiry <dbl> 5, 8, 7, 15, 4, 5, 9, 7, 4,…
$ num_satisfactory_accounts <dbl> 10, 14, 10, 4, 16, 12, 10, …
$ num_accounts_120d_past_due <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0,…
$ num_accounts_30d_past_due <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ num_active_debit_accounts <dbl> 2, 3, 3, 2, 10, 1, 3, 5, 11…
$ total_debit_limit <dbl> 11100, 16500, 4300, 19400, …
$ num_total_cc_accounts <dbl> 14, 24, 14, 3, 20, 27, 8, 1…
$ num_open_cc_accounts <dbl> 8, 14, 8, 3, 15, 12, 7, 12,…
$ num_cc_carrying_balance <dbl> 6, 4, 6, 2, 13, 5, 6, 10, 1…
$ num_mort_accounts <dbl> 1, 0, 0, 0, 0, 3, 2, 7, 2, …
$ account_never_delinq_percent <dbl> 92.9, 100.0, 93.5, 100.0, 1…
$ tax_liens <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ public_record_bankrupt <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ loan_purpose <chr> "moving", "debt_consolidati…
$ application_type <chr> "individual", "individual",…
$ loan_amount <dbl> 28000, 5000, 2000, 21600, 2…
$ term <dbl> 60, 36, 36, 36, 36, 36, 60,…
$ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, …
$ installment <dbl> 652.53, 167.54, 71.40, 664.…
$ grade <chr> "C", "C", "D", "A", "C", "A…
$ sub_grade <chr> "C3", "C1", "D1", "A3", "C3…
$ issue_month <chr> "Mar-2018", "Feb-2018", "Fe…
$ loan_status <chr> "Current", "Current", "Curr…
$ initial_listing_status <chr> "whole", "whole", "fraction…
$ disbursement_method <chr> "Cash", "Cash", "Cash", "Ca…
$ balance <dbl> 27015.86, 4651.37, 1824.63,…
$ paid_total <dbl> 1999.330, 499.120, 281.800,…
$ paid_principal <dbl> 984.14, 348.63, 175.37, 274…
$ paid_interest <dbl> 1015.19, 150.49, 106.43, 56…
$ paid_late_fees <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
There are a few problems with the dataset: 1. It has missing values, which can be counted column-wise as follows:
emp_title emp_length
833 817
state homeownership
0 0
annual_income verified_income
0 0
debt_to_income annual_income_joint
24 8505
verification_income_joint debt_to_income_joint
8545 8505
delinq_2y months_since_last_delinq
0 5658
earliest_credit_line inquiries_last_12m
0 0
total_credit_lines open_credit_lines
0 0
total_credit_limit total_credit_utilized
0 0
num_collections_last_12m num_historical_failed_to_pay
0 0
months_since_90d_late current_accounts_delinq
7715 0
total_collection_amount_ever current_installment_accounts
0 0
accounts_opened_24m months_since_last_credit_inquiry
0 1271
num_satisfactory_accounts num_accounts_120d_past_due
0 318
num_accounts_30d_past_due num_active_debit_accounts
0 0
total_debit_limit num_total_cc_accounts
0 0
num_open_cc_accounts num_cc_carrying_balance
0 0
num_mort_accounts account_never_delinq_percent
0 0
tax_liens public_record_bankrupt
0 0
loan_purpose application_type
0 0
loan_amount term
0 0
interest_rate installment
0 0
grade sub_grade
0 0
issue_month loan_status
0 0
initial_listing_status disbursement_method
0 0
balance paid_total
0 0
paid_principal paid_interest
0 0
paid_late_fees
0
We find that the column emp_title has 833 missing values, emp_length has 817 missing values, debt_to_income, annual_income_joint, verification_income_joint have 24, 8505 and 8545 missing values respectively, and so on. We first drop all the rows with missing values in the emp_title.
emp_title emp_length
0 0
state homeownership
0 0
annual_income verified_income
0 0
debt_to_income annual_income_joint
1 7875
verification_income_joint debt_to_income_joint
7909 7875
delinq_2y months_since_last_delinq
0 5183
earliest_credit_line inquiries_last_12m
0 0
total_credit_lines open_credit_lines
0 0
total_credit_limit total_credit_utilized
0 0
num_collections_last_12m num_historical_failed_to_pay
0 0
months_since_90d_late current_accounts_delinq
7090 0
total_collection_amount_ever current_installment_accounts
0 0
accounts_opened_24m months_since_last_credit_inquiry
0 1122
num_satisfactory_accounts num_accounts_120d_past_due
0 295
num_accounts_30d_past_due num_active_debit_accounts
0 0
total_debit_limit num_total_cc_accounts
0 0
num_open_cc_accounts num_cc_carrying_balance
0 0
num_mort_accounts account_never_delinq_percent
0 0
tax_liens public_record_bankrupt
0 0
loan_purpose application_type
0 0
loan_amount term
0 0
interest_rate installment
0 0
grade sub_grade
0 0
issue_month loan_status
0 0
initial_listing_status disbursement_method
0 0
balance paid_total
0 0
paid_principal paid_interest
0 0
paid_late_fees
0
We also change the column name of “emp_title” as “Job” and “emp_length” as “employment_len”
We replace the missing values in the rest of the columns with character class type as “Not Applicable” and for double class type as -1
data <- replace_na(data, list(debt_to_income_joint = -1))
data <- replace_na(data, list(annual_income_joint = -1))
data <- replace_na(data, list(verification_income_joint = "Not Applicable"))
data <- replace_na(data, list(months_since_last_delinq = -1))
data <- replace_na(data, list(months_since_90d_late = -1))
data <- replace_na(data, list(months_since_last_credit_inquiry = -1))
data <- replace_na(data, list(num_accounts_120d_past_due = -1))
We now take a look at the dataset, which has zero missing values
job employment_len
0 0
state homeownership
0 0
annual_income verified_income
0 0
debt_to_income annual_income_joint
0 0
verification_income_joint debt_to_income_joint
0 0
delinq_2y months_since_last_delinq
0 0
earliest_credit_line inquiries_last_12m
0 0
total_credit_lines open_credit_lines
0 0
total_credit_limit total_credit_utilized
0 0
num_collections_last_12m num_historical_failed_to_pay
0 0
months_since_90d_late current_accounts_delinq
0 0
total_collection_amount_ever current_installment_accounts
0 0
accounts_opened_24m months_since_last_credit_inquiry
0 0
num_satisfactory_accounts num_accounts_120d_past_due
0 0
num_accounts_30d_past_due num_active_debit_accounts
0 0
total_debit_limit num_total_cc_accounts
0 0
num_open_cc_accounts num_cc_carrying_balance
0 0
num_mort_accounts account_never_delinq_percent
0 0
tax_liens public_record_bankrupt
0 0
loan_purpose application_type
0 0
loan_amount term
0 0
interest_rate installment
0 0
grade sub_grade
0 0
issue_month loan_status
0 0
initial_listing_status disbursement_method
0 0
balance paid_total
0 0
paid_principal paid_interest
0 0
paid_late_fees
0
The dataset has been successfully cleaned now.
The research questions which I intend to answer:
Q1: Explore the relationship between the grade of Loan and interest rate.
Q2: Explore the relationship between the purpose of Loan and the amount of loan on two given terms of loan.
Q3: Does people of one state take more loan than the people of other state?
Q4: How does loan amount vary on the basis of annual income of the borrower?
Q5: How does loan amount vary on the basis of the purpose of loan and the type of application?
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gangal (2022, April 27). Data Analytics and Computational Social Science: HW 3 DACSS 601. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomayushe17892927/
BibTeX citation
@misc{gangal2022hw, author = {Gangal, Ayushe}, title = {Data Analytics and Computational Social Science: HW 3 DACSS 601}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomayushe17892927/}, year = {2022} }