Data Analytics and Computational Social Science: HW 4 DACSS 601

Ayushe Gangal

Basic Setup

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(readxl)
library(ggExtra)
library(ggplot2)
library(knitr)

Data Loading

Loading the clean loan data which was cleaned in the previous HW assignment.

data_clean <- read_csv("/Users/ayushe/RStudio stuff/RData/clean_loan_data.csv")

A glimpse of the cleaned dataset.

data_clean

# A tibble: 9,166 × 56
    ...1 job          employment_len state homeownership annual_income
   <dbl> <chr>                 <dbl> <chr> <chr>                 <dbl>
 1     1 global conf…              3 NJ    MORTGAGE              90000
 2     2 warehouse o…             10 HI    RENT                  40000
 3     3 assembly                  3 WI    RENT                  40000
 4     4 customer se…              1 PA    RENT                  30000
 5     5 security su…             10 CA    RENT                  35000
 6     6 hr                       10 MI    MORTGAGE              35000
 7     7 police                   10 AZ    MORTGAGE             110000
 8     8 parts                    10 NV    MORTGAGE              65000
 9     9 4th person                3 IL    RENT                  30000
10    10 supervisor               10 IL    MORTGAGE              75000
# … with 9,156 more rows, and 50 more variables:
#   verified_income <chr>, debt_to_income <dbl>,
#   annual_income_joint <dbl>, verification_income_joint <chr>,
#   debt_to_income_joint <dbl>, delinq_2y <dbl>,
#   months_since_last_delinq <dbl>, earliest_credit_line <dbl>,
#   inquiries_last_12m <dbl>, total_credit_lines <dbl>,
#   open_credit_lines <dbl>, total_credit_limit <dbl>, …

Creating a dataframe with only numeric data type values

data_num <- select_if(data_clean, is.numeric) 
data_num

# A tibble: 9,166 × 43
    ...1 employment_len annual_income debt_to_income annual_income_jo…
   <dbl>          <dbl>         <dbl>          <dbl>             <dbl>
 1     1              3         90000          18.0                 -1
 2     2             10         40000           5.04                -1
 3     3              3         40000          21.2                 -1
 4     4              1         30000          10.2                 -1
 5     5             10         35000          58.0              57000
 6     6             10         35000          23.7             155000
 7     7             10        110000          16.2                 -1
 8     8             10         65000          36.5                 -1
 9     9              3         30000          18.9                 -1
10    10             10         75000          10.4                 -1
# … with 9,156 more rows, and 38 more variables:
#   debt_to_income_joint <dbl>, delinq_2y <dbl>,
#   months_since_last_delinq <dbl>, earliest_credit_line <dbl>,
#   inquiries_last_12m <dbl>, total_credit_lines <dbl>,
#   open_credit_lines <dbl>, total_credit_limit <dbl>,
#   total_credit_utilized <dbl>, num_collections_last_12m <dbl>,
#   num_historical_failed_to_pay <dbl>, …

Finding the mean of all the numeric (42) columns

summarize_all(data_num, list(mean=mean), na.rm=TRUE)

# A tibble: 1 × 43
  ...1_mean employment_len_mean annual_income_mean debt_to_income_mean
      <dbl>               <dbl>              <dbl>               <dbl>
1     4584.                5.93             82193.                19.0
# … with 39 more variables: annual_income_joint_mean <dbl>,
#   debt_to_income_joint_mean <dbl>, delinq_2y_mean <dbl>,
#   months_since_last_delinq_mean <dbl>,
#   earliest_credit_line_mean <dbl>, inquiries_last_12m_mean <dbl>,
#   total_credit_lines_mean <dbl>, open_credit_lines_mean <dbl>,
#   total_credit_limit_mean <dbl>, total_credit_utilized_mean <dbl>,
#   num_collections_last_12m_mean <dbl>, …

Finding the median of all the numeric (42) columns

summarize_all(data_num, list(median=median), na.rm=TRUE)

# A tibble: 1 × 43
  ...1_median employment_len_median annual_income_me… debt_to_income_…
        <dbl>                 <dbl>             <dbl>            <dbl>
1       4584.                     6             68000             17.4
# … with 39 more variables: annual_income_joint_median <dbl>,
#   debt_to_income_joint_median <dbl>, delinq_2y_median <dbl>,
#   months_since_last_delinq_median <dbl>,
#   earliest_credit_line_median <dbl>,
#   inquiries_last_12m_median <dbl>, total_credit_lines_median <dbl>,
#   open_credit_lines_median <dbl>, total_credit_limit_median <dbl>,
#   total_credit_utilized_median <dbl>, …

Finding the standard deviation of all the numeric (42) columns

summarize_all(data_num, list(sd=sd), na.rm=TRUE)

# A tibble: 1 × 43
  ...1_sd employment_len_sd annual_income_sd debt_to_income_sd
    <dbl>             <dbl>            <dbl>             <dbl>
1   2646.              3.70           65934.              14.2
# … with 39 more variables: annual_income_joint_sd <dbl>,
#   debt_to_income_joint_sd <dbl>, delinq_2y_sd <dbl>,
#   months_since_last_delinq_sd <dbl>, earliest_credit_line_sd <dbl>,
#   inquiries_last_12m_sd <dbl>, total_credit_lines_sd <dbl>,
#   open_credit_lines_sd <dbl>, total_credit_limit_sd <dbl>,
#   total_credit_utilized_sd <dbl>,
#   num_collections_last_12m_sd <dbl>, …

Finding the inter quartile range of all the numeric (42) columns

summarize_all(data_num, list(IQR=IQR), na.rm=TRUE)

# A tibble: 1 × 43
  ...1_IQR employment_len_IQR annual_income_IQR debt_to_income_IQR
     <dbl>              <dbl>             <dbl>              <dbl>
1    4582.                  8            50219.               13.7
# … with 39 more variables: annual_income_joint_IQR <dbl>,
#   debt_to_income_joint_IQR <dbl>, delinq_2y_IQR <dbl>,
#   months_since_last_delinq_IQR <dbl>,
#   earliest_credit_line_IQR <dbl>, inquiries_last_12m_IQR <dbl>,
#   total_credit_lines_IQR <dbl>, open_credit_lines_IQR <dbl>,
#   total_credit_limit_IQR <dbl>, total_credit_utilized_IQR <dbl>,
#   num_collections_last_12m_IQR <dbl>, …

Finding the minimum of all the numeric (42) columns

summarize_all(data_num, list(min=min), na.rm=TRUE)

# A tibble: 1 × 43
  ...1_min employment_len_min annual_income_min debt_to_income_min
     <dbl>              <dbl>             <dbl>              <dbl>
1        1                  0              3000                  0
# … with 39 more variables: annual_income_joint_min <dbl>,
#   debt_to_income_joint_min <dbl>, delinq_2y_min <dbl>,
#   months_since_last_delinq_min <dbl>,
#   earliest_credit_line_min <dbl>, inquiries_last_12m_min <dbl>,
#   total_credit_lines_min <dbl>, open_credit_lines_min <dbl>,
#   total_credit_limit_min <dbl>, total_credit_utilized_min <dbl>,
#   num_collections_last_12m_min <dbl>, …

Finding the maximum of all the numeric (42) columns

summarize_all(data_num, list(max=max), na.rm=TRUE)

# A tibble: 1 × 43
  ...1_max employment_len_max annual_income_max debt_to_income_max
     <dbl>              <dbl>             <dbl>              <dbl>
1     9166                 10           2300000               469.
# … with 39 more variables: annual_income_joint_max <dbl>,
#   debt_to_income_joint_max <dbl>, delinq_2y_max <dbl>,
#   months_since_last_delinq_max <dbl>,
#   earliest_credit_line_max <dbl>, inquiries_last_12m_max <dbl>,
#   total_credit_lines_max <dbl>, open_credit_lines_max <dbl>,
#   total_credit_limit_max <dbl>, total_credit_utilized_max <dbl>,
#   num_collections_last_12m_max <dbl>, …

Finding the mean of annual income grouped by term and also counting the number of values which fall in each term

data_num %>%
  group_by(term) %>%
  summarise(mean_annual_income = mean(annual_income), n = n())

# A tibble: 2 × 3
   term mean_annual_income     n
  <dbl>              <dbl> <int>
1    36             80463.  6339
2    60             86070.  2827

Finding the mean of loan amount grouped by term and also counting the number of values which fall in each term

data_num %>%
  group_by(term) %>%
  summarise(mean_aloan_amount = mean(loan_amount), n = n())

# A tibble: 2 × 3
   term mean_aloan_amount     n
  <dbl>             <dbl> <int>
1    36            14037.  6339
2    60            22470.  2827

Finding the minimum, maximum and mean of annual income grouped by the type of job of the borrower and also counting the number of values which fall in each job

data_clean %>%
  group_by(job) %>%
  summarise(min_annual_income = min(annual_income), max_annual_income = max(annual_income), mean_annual_income = mean(annual_income), n = n())

# A tibble: 4,411 × 5
   job        min_annual_inco… max_annual_inco… mean_annual_inc…     n
   <chr>                 <dbl>            <dbl>            <dbl> <int>
 1 1sg                   78000           100000            89000     2
 2 1st grade…            45000            45000            45000     1
 3 21 dealer             44000            44000            44000     1
 4 2nd shift…            55000            55000            55000     1
 5 360app sp…            37000            37000            37000     1
 6 3rd shift…            88000            88000            88000     1
 7 3rd shift…            54000            54000            54000     1
 8 4th person            30000            30000            30000     1
 9 5th year …            88400            88400            88400     1
10 6th grade…            55000            55000            55000     1
# … with 4,401 more rows

Finding the minimum, maximum and mean of annual income grouped by the type of home ownership of the borrower and also counting the number of values which fall in each home ownership

data_clean %>%
  group_by(homeownership) %>%
  summarise(min_annual_income = min(annual_income), max_annual_income = max(annual_income), mean_annual_income = mean(annual_income), n = n())

# A tibble: 3 × 5
  homeownership min_annual_income max_annual_income mean_annual_income
  <chr>                     <dbl>             <dbl>              <dbl>
1 MORTGAGE                   3300           2300000             93753.
2 OWN                        9996            740000             76683.
3 RENT                       3000           1200000             69540.
# … with 1 more variable: n <int>

Finding the minimum, maximum and mean of loan amount grouped by the type of home ownership of the borrower and also counting the number of values which fall in each home ownership

data_clean %>%
  group_by(homeownership) %>%
  summarise(min_loan_amount = min(loan_amount), max_loan_amount = max(loan_amount), mean_loan_amount = mean(loan_amount), n = n())

# A tibble: 3 × 5
  homeownership min_loan_amount max_loan_amount mean_loan_amount     n
  <chr>                   <dbl>           <dbl>            <dbl> <int>
1 MORTGAGE                 1000           40000           18367.  4458
2 OWN                      1000           40000           16290.  1125
3 RENT                     1000           40000           14596.  3583

Finding the interest rate, the loan amount and the mean of annual income grouped by the type of grade of the loan and also counting the number of values which fall in each grade

data_clean %>%
  group_by(grade) %>%
  summarise( interest_rate = mean(interest_rate), 
            loan_amount = mean(loan_amount),
            mean_annual_income = mean(annual_income),
            n = n())

# A tibble: 7 × 5
  grade interest_rate loan_amount mean_annual_income     n
  <chr>         <dbl>       <dbl>              <dbl> <int>
1 A              6.73      15617.             92063.  2263
2 B             10.5       16510.             83528.  2788
3 C             14.2       17116.             76904.  2434
4 D             19.1       16962.             75087.  1310
5 E             25.1       18844.             70736.   304
6 F             29.4       22153.             77298.    56
7 G             30.8       25923.             71066.    11

Visualizations

Univariate Graphs

Loan Amount Count

ggplot(data_clean, aes(x = loan_amount)) +
  geom_histogram(fill = "cornflowerblue", 
                 color = "white") + 
  labs(title="loan amount count",
       x = "loan amount")

This is a uni variate plot which counts the number of people on the basis of loan amounts. Since, loan amount is a continuous value, we have used a histogram to portray it. We have the loan amount on the x-axis and the count on the y-axis. We find that maximum people take a loan of $10000.

Grade count

ggplot(data_clean, aes(x = grade)) +
  geom_bar(fill = "purple", 
                 color = "white") + 
  labs(title="grade count",
       x = "grade")

This is another uni variate plot which counts the number of people in each grade of loan. Since grade is a categorical variable, I have used a bar chart to portray the data. Grade of loan is on the x-axis and the count is on the y-axis. We find that maximum borrowers fall into the grade B loan.

Bivariate Graphs

Rate of Interest VS Grade of Loan

ggplot(data_clean, aes(x =  interest_rate, y = grade)) + 
        geom_boxplot() + 
        labs(x = 'Interest Rate' , y = 'Grade')

This plot uses a boxplot to depict the relationship between the rate of interest and the grade of A boxplot displays a five-number summary, giving the maximum, minimum, median, first quartile and third quartile, and therefore the inter quartile range as well. We find that the rate of interest is lowest for the grade A loans and highest for grade G.

Purpose of Loan VS Loan Amount

ggplot(data_clean, aes(x = loan_amount , y = loan_purpose)) + 
        geom_boxplot() + 
        labs(y = ' Purpose of Loan' , x = 'Loan_amount')

I have used a boxplot to depict the relationship between loan amount and the purpose of loan. We find that the highest amount of loan is taken for financing small businesses and the lowest amount is takken for vacations. The boxplot also displays a five-number summary, giving the maximum, minimum, median, first quartile and third quartile, and therefore the inter quartile range as well.

Limitations of the plots

Though the uni-variate plots are attractive, they only depict a limited amount of relationships among the data variables, while the bi-variate plots lack attractiveness. We can enhance these plots to be more attractive and informatoive for the final project.

Comment on this article Share:

HW 4 DACSS 601