HW 3 DACSS 601

Ayushe’s HW3 Submission for DACSS 601

Ayushe Gangal
2022-04-21

Loan Data for Lending Club

The dataset is loaded using the read_csv(). It is found that the dataset has 10000 rows and 55 columns, that is, 55 features. Given below is a glimpse of the dataset. The dataset has two types of values in it, namely, character type and double class.

data <- read_csv("/Users/ayushe/RStudio stuff/RData/loans_full_schema.csv")

The first 10 rows of the dataset are given below:

kable(head(data, 10))
emp_title emp_length state homeownership annual_income verified_income debt_to_income annual_income_joint verification_income_joint debt_to_income_joint delinq_2y months_since_last_delinq earliest_credit_line inquiries_last_12m total_credit_lines open_credit_lines total_credit_limit total_credit_utilized num_collections_last_12m num_historical_failed_to_pay months_since_90d_late current_accounts_delinq total_collection_amount_ever current_installment_accounts accounts_opened_24m months_since_last_credit_inquiry num_satisfactory_accounts num_accounts_120d_past_due num_accounts_30d_past_due num_active_debit_accounts total_debit_limit num_total_cc_accounts num_open_cc_accounts num_cc_carrying_balance num_mort_accounts account_never_delinq_percent tax_liens public_record_bankrupt loan_purpose application_type loan_amount term interest_rate installment grade sub_grade issue_month loan_status initial_listing_status disbursement_method balance paid_total paid_principal paid_interest paid_late_fees
global config engineer 3 NJ MORTGAGE 90000 Verified 18.01 NA NA NA 0 38 2001 6 28 10 70795 38767 0 0 38 0 1250 2 5 5 10 0 0 2 11100 14 8 6 1 92.9 0 0 moving individual 28000 60 14.07 652.53 C C3 Mar-2018 Current whole Cash 27015.86 1999.33 984.14 1015.19 0
warehouse office clerk 10 HI RENT 40000 Not Verified 5.04 NA NA NA 0 NA 1996 1 30 14 28800 4321 0 1 NA 0 0 0 11 8 14 0 0 3 16500 24 14 4 0 100.0 0 1 debt_consolidation individual 5000 36 12.61 167.54 C C1 Feb-2018 Current whole Cash 4651.37 499.12 348.63 150.49 0
assembly 3 WI RENT 40000 Source Verified 21.15 NA NA NA 0 28 2006 4 31 10 24193 16000 0 0 28 0 432 1 13 7 10 0 0 3 4300 14 8 6 0 93.5 0 0 other individual 2000 36 17.09 71.40 D D1 Feb-2018 Current fractional Cash 1824.63 281.80 175.37 106.43 0
customer service 1 PA RENT 30000 Not Verified 10.16 NA NA NA 0 NA 2007 0 4 4 25400 4997 0 1 NA 0 0 1 1 15 4 0 0 2 19400 3 3 2 0 100.0 1 0 debt_consolidation individual 21600 36 6.72 664.19 A A3 Jan-2018 Current whole Cash 18853.26 3312.89 2746.74 566.15 0
security supervisor 10 CA RENT 35000 Verified 57.96 57000 Verified 37.66 0 NA 2008 7 22 16 69839 52722 0 0 NA 0 0 1 6 4 16 0 0 10 32700 20 15 13 0 100.0 0 0 credit_card joint 23000 36 14.07 786.87 C C3 Mar-2018 Current whole Cash 21430.15 2324.65 1569.85 754.80 0
NA NA KY OWN 34000 Not Verified 6.46 NA NA NA 1 3 1990 6 32 12 42100 3898 0 0 60 0 0 0 2 5 12 0 0 1 27200 27 12 5 3 78.1 0 0 other individual 5000 36 6.72 153.75 A A3 Jan-2018 Current whole Cash 4256.71 873.13 743.29 129.84 0
hr 10 MI MORTGAGE 35000 Source Verified 23.66 155000 Not Verified 13.12 0 NA 2004 1 12 10 291852 18916 0 0 NA 0 0 2 1 9 10 0 0 3 9100 8 7 6 2 100.0 0 0 credit_card joint 24000 60 13.59 553.35 C C2 Jan-2018 Current whole Cash 22560.00 2730.51 1440.00 1290.51 0
police 10 AZ MORTGAGE 110000 Source Verified 16.19 NA NA NA 1 19 2005 1 30 15 342336 60805 0 0 71 0 0 2 4 7 15 NA 0 5 22250 16 12 10 7 93.0 0 0 debt_consolidation individual 20000 60 11.99 444.79 B B5 Jan-2018 Current whole Cash 19005.39 1765.84 994.61 771.23 0
parts 10 NV MORTGAGE 65000 Source Verified 36.48 NA NA NA 1 18 1998 3 35 21 283190 69561 0 0 18 0 0 6 10 4 21 0 0 11 36700 19 14 14 2 97.1 0 0 home_improvement individual 20000 36 13.59 679.58 C C2 Feb-2018 Current whole Cash 18156.66 2703.22 1843.34 859.88 0
4th person 3 IL RENT 30000 Not Verified 18.91 NA NA NA 0 NA 2001 0 9 6 33114 17768 0 0 NA 0 0 1 5 17 6 0 0 3 14700 7 5 3 0 100.0 0 0 credit_card individual 6400 36 6.71 196.77 A A3 Mar-2018 Current fractional Cash 6077.13 391.15 322.87 68.28 0
A more detailed description of the dataset is given below:
glimpse(data)
Rows: 10,000
Columns: 55
$ emp_title                        <chr> "global config engineer", "…
$ emp_length                       <dbl> 3, 10, 3, 1, 10, NA, 10, 10…
$ state                            <chr> "NJ", "HI", "WI", "PA", "CA…
$ homeownership                    <chr> "MORTGAGE", "RENT", "RENT",…
$ annual_income                    <dbl> 90000, 40000, 40000, 30000,…
$ verified_income                  <chr> "Verified", "Not Verified",…
$ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10.16, …
$ annual_income_joint              <dbl> NA, NA, NA, NA, 57000, NA, …
$ verification_income_joint        <chr> NA, NA, NA, NA, "Verified",…
$ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.66, NA, …
$ delinq_2y                        <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 1, …
$ months_since_last_delinq         <dbl> 38, NA, 28, NA, NA, 3, NA, …
$ earliest_credit_line             <dbl> 2001, 1996, 2006, 2007, 200…
$ inquiries_last_12m               <dbl> 6, 1, 4, 0, 7, 6, 1, 1, 3, …
$ total_credit_lines               <dbl> 28, 30, 31, 4, 22, 32, 12, …
$ open_credit_lines                <dbl> 10, 14, 10, 4, 16, 12, 10, …
$ total_credit_limit               <dbl> 70795, 28800, 24193, 25400,…
$ total_credit_utilized            <dbl> 38767, 4321, 16000, 4997, 5…
$ num_collections_last_12m         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ num_historical_failed_to_pay     <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, …
$ months_since_90d_late            <dbl> 38, NA, 28, NA, NA, 60, NA,…
$ current_accounts_delinq          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_collection_amount_ever     <dbl> 1250, 0, 432, 0, 0, 0, 0, 0…
$ current_installment_accounts     <dbl> 2, 0, 1, 1, 1, 0, 2, 2, 6, …
$ accounts_opened_24m              <dbl> 5, 11, 13, 1, 6, 2, 1, 4, 1…
$ months_since_last_credit_inquiry <dbl> 5, 8, 7, 15, 4, 5, 9, 7, 4,…
$ num_satisfactory_accounts        <dbl> 10, 14, 10, 4, 16, 12, 10, …
$ num_accounts_120d_past_due       <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0,…
$ num_accounts_30d_past_due        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ num_active_debit_accounts        <dbl> 2, 3, 3, 2, 10, 1, 3, 5, 11…
$ total_debit_limit                <dbl> 11100, 16500, 4300, 19400, …
$ num_total_cc_accounts            <dbl> 14, 24, 14, 3, 20, 27, 8, 1…
$ num_open_cc_accounts             <dbl> 8, 14, 8, 3, 15, 12, 7, 12,…
$ num_cc_carrying_balance          <dbl> 6, 4, 6, 2, 13, 5, 6, 10, 1…
$ num_mort_accounts                <dbl> 1, 0, 0, 0, 0, 3, 2, 7, 2, …
$ account_never_delinq_percent     <dbl> 92.9, 100.0, 93.5, 100.0, 1…
$ tax_liens                        <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ public_record_bankrupt           <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ loan_purpose                     <chr> "moving", "debt_consolidati…
$ application_type                 <chr> "individual", "individual",…
$ loan_amount                      <dbl> 28000, 5000, 2000, 21600, 2…
$ term                             <dbl> 60, 36, 36, 36, 36, 36, 60,…
$ interest_rate                    <dbl> 14.07, 12.61, 17.09, 6.72, …
$ installment                      <dbl> 652.53, 167.54, 71.40, 664.…
$ grade                            <chr> "C", "C", "D", "A", "C", "A…
$ sub_grade                        <chr> "C3", "C1", "D1", "A3", "C3…
$ issue_month                      <chr> "Mar-2018", "Feb-2018", "Fe…
$ loan_status                      <chr> "Current", "Current", "Curr…
$ initial_listing_status           <chr> "whole", "whole", "fraction…
$ disbursement_method              <chr> "Cash", "Cash", "Cash", "Ca…
$ balance                          <dbl> 27015.86, 4651.37, 1824.63,…
$ paid_total                       <dbl> 1999.330, 499.120, 281.800,…
$ paid_principal                   <dbl> 984.14, 348.63, 175.37, 274…
$ paid_interest                    <dbl> 1015.19, 150.49, 106.43, 56…
$ paid_late_fees                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …

There are a few problems with the dataset: 1. It has missing values, which can be counted column-wise as follows:

sapply(data, function(x) sum(is.na(x)))
                       emp_title                       emp_length 
                             833                              817 
                           state                    homeownership 
                               0                                0 
                   annual_income                  verified_income 
                               0                                0 
                  debt_to_income              annual_income_joint 
                              24                             8505 
       verification_income_joint             debt_to_income_joint 
                            8545                             8505 
                       delinq_2y         months_since_last_delinq 
                               0                             5658 
            earliest_credit_line               inquiries_last_12m 
                               0                                0 
              total_credit_lines                open_credit_lines 
                               0                                0 
              total_credit_limit            total_credit_utilized 
                               0                                0 
        num_collections_last_12m     num_historical_failed_to_pay 
                               0                                0 
           months_since_90d_late          current_accounts_delinq 
                            7715                                0 
    total_collection_amount_ever     current_installment_accounts 
                               0                                0 
             accounts_opened_24m months_since_last_credit_inquiry 
                               0                             1271 
       num_satisfactory_accounts       num_accounts_120d_past_due 
                               0                              318 
       num_accounts_30d_past_due        num_active_debit_accounts 
                               0                                0 
               total_debit_limit            num_total_cc_accounts 
                               0                                0 
            num_open_cc_accounts          num_cc_carrying_balance 
                               0                                0 
               num_mort_accounts     account_never_delinq_percent 
                               0                                0 
                       tax_liens           public_record_bankrupt 
                               0                                0 
                    loan_purpose                 application_type 
                               0                                0 
                     loan_amount                             term 
                               0                                0 
                   interest_rate                      installment 
                               0                                0 
                           grade                        sub_grade 
                               0                                0 
                     issue_month                      loan_status 
                               0                                0 
          initial_listing_status              disbursement_method 
                               0                                0 
                         balance                       paid_total 
                               0                                0 
                  paid_principal                    paid_interest 
                               0                                0 
                  paid_late_fees 
                               0 

We find that the column emp_title has 833 missing values, emp_length has 817 missing values, debt_to_income, annual_income_joint, verification_income_joint have 24, 8505 and 8545 missing values respectively, and so on. We first drop all the rows with missing values in the emp_title.

data <- drop_na(data, emp_title)
sapply(data, function(x) sum(is.na(x)))
                       emp_title                       emp_length 
                               0                                0 
                           state                    homeownership 
                               0                                0 
                   annual_income                  verified_income 
                               0                                0 
                  debt_to_income              annual_income_joint 
                               1                             7875 
       verification_income_joint             debt_to_income_joint 
                            7909                             7875 
                       delinq_2y         months_since_last_delinq 
                               0                             5183 
            earliest_credit_line               inquiries_last_12m 
                               0                                0 
              total_credit_lines                open_credit_lines 
                               0                                0 
              total_credit_limit            total_credit_utilized 
                               0                                0 
        num_collections_last_12m     num_historical_failed_to_pay 
                               0                                0 
           months_since_90d_late          current_accounts_delinq 
                            7090                                0 
    total_collection_amount_ever     current_installment_accounts 
                               0                                0 
             accounts_opened_24m months_since_last_credit_inquiry 
                               0                             1122 
       num_satisfactory_accounts       num_accounts_120d_past_due 
                               0                              295 
       num_accounts_30d_past_due        num_active_debit_accounts 
                               0                                0 
               total_debit_limit            num_total_cc_accounts 
                               0                                0 
            num_open_cc_accounts          num_cc_carrying_balance 
                               0                                0 
               num_mort_accounts     account_never_delinq_percent 
                               0                                0 
                       tax_liens           public_record_bankrupt 
                               0                                0 
                    loan_purpose                 application_type 
                               0                                0 
                     loan_amount                             term 
                               0                                0 
                   interest_rate                      installment 
                               0                                0 
                           grade                        sub_grade 
                               0                                0 
                     issue_month                      loan_status 
                               0                                0 
          initial_listing_status              disbursement_method 
                               0                                0 
                         balance                       paid_total 
                               0                                0 
                  paid_principal                    paid_interest 
                               0                                0 
                  paid_late_fees 
                               0 

We also change the column name of “emp_title” as “Job” and “emp_length” as “employment_len”

data <- data %>%
  rename(job = emp_title) %>%
  rename(employment_len = emp_length)

We replace the missing values in the rest of the columns with character class type as “Not Applicable” and for double class type as -1

data <- replace_na(data, list(debt_to_income_joint = -1))
data <- replace_na(data, list(annual_income_joint = -1))
data <- replace_na(data, list(verification_income_joint = "Not Applicable"))
data <- replace_na(data, list(months_since_last_delinq = -1))
data <- replace_na(data, list(months_since_90d_late = -1))
data <- replace_na(data, list(months_since_last_credit_inquiry = -1))
data <- replace_na(data, list(num_accounts_120d_past_due = -1))

We now take a look at the dataset, which has zero missing values

data <- drop_na(data, debt_to_income)
sapply(data, function(x) sum(is.na(x)))
                             job                   employment_len 
                               0                                0 
                           state                    homeownership 
                               0                                0 
                   annual_income                  verified_income 
                               0                                0 
                  debt_to_income              annual_income_joint 
                               0                                0 
       verification_income_joint             debt_to_income_joint 
                               0                                0 
                       delinq_2y         months_since_last_delinq 
                               0                                0 
            earliest_credit_line               inquiries_last_12m 
                               0                                0 
              total_credit_lines                open_credit_lines 
                               0                                0 
              total_credit_limit            total_credit_utilized 
                               0                                0 
        num_collections_last_12m     num_historical_failed_to_pay 
                               0                                0 
           months_since_90d_late          current_accounts_delinq 
                               0                                0 
    total_collection_amount_ever     current_installment_accounts 
                               0                                0 
             accounts_opened_24m months_since_last_credit_inquiry 
                               0                                0 
       num_satisfactory_accounts       num_accounts_120d_past_due 
                               0                                0 
       num_accounts_30d_past_due        num_active_debit_accounts 
                               0                                0 
               total_debit_limit            num_total_cc_accounts 
                               0                                0 
            num_open_cc_accounts          num_cc_carrying_balance 
                               0                                0 
               num_mort_accounts     account_never_delinq_percent 
                               0                                0 
                       tax_liens           public_record_bankrupt 
                               0                                0 
                    loan_purpose                 application_type 
                               0                                0 
                     loan_amount                             term 
                               0                                0 
                   interest_rate                      installment 
                               0                                0 
                           grade                        sub_grade 
                               0                                0 
                     issue_month                      loan_status 
                               0                                0 
          initial_listing_status              disbursement_method 
                               0                                0 
                         balance                       paid_total 
                               0                                0 
                  paid_principal                    paid_interest 
                               0                                0 
                  paid_late_fees 
                               0 

The dataset has been successfully cleaned now.

The research questions which I intend to answer:

Q1: Explore the relationship between the grade of Loan and interest rate.

Q2: Explore the relationship between the purpose of Loan and the amount of loan on two given terms of loan.

Q3: Does people of one state take more loan than the people of other state?

Q4: How does loan amount vary on the basis of annual income of the borrower?

Q5: How does loan amount vary on the basis of the purpose of loan and the type of application?

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Gangal (2022, April 27). Data Analytics and Computational Social Science: HW 3 DACSS 601. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomayushe17892927/

BibTeX citation

@misc{gangal2022hw,
  author = {Gangal, Ayushe},
  title = {Data Analytics and Computational Social Science: HW 3 DACSS 601},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomayushe17892927/},
  year = {2022}
}