Ayushe’s HW4 Submission for DACSS 601
Loading the clean loan data which was cleaned in the previous HW assignment.
data_clean <- read_csv("/Users/ayushe/RStudio stuff/RData/clean_loan_data.csv")
A glimpse of the cleaned dataset.
data_clean
# A tibble: 9,166 × 56
...1 job employment_len state homeownership annual_income
<dbl> <chr> <dbl> <chr> <chr> <dbl>
1 1 global conf… 3 NJ MORTGAGE 90000
2 2 warehouse o… 10 HI RENT 40000
3 3 assembly 3 WI RENT 40000
4 4 customer se… 1 PA RENT 30000
5 5 security su… 10 CA RENT 35000
6 6 hr 10 MI MORTGAGE 35000
7 7 police 10 AZ MORTGAGE 110000
8 8 parts 10 NV MORTGAGE 65000
9 9 4th person 3 IL RENT 30000
10 10 supervisor 10 IL MORTGAGE 75000
# … with 9,156 more rows, and 50 more variables:
# verified_income <chr>, debt_to_income <dbl>,
# annual_income_joint <dbl>, verification_income_joint <chr>,
# debt_to_income_joint <dbl>, delinq_2y <dbl>,
# months_since_last_delinq <dbl>, earliest_credit_line <dbl>,
# inquiries_last_12m <dbl>, total_credit_lines <dbl>,
# open_credit_lines <dbl>, total_credit_limit <dbl>, …
Creating a dataframe with only numeric data type values
data_num <- select_if(data_clean, is.numeric)
data_num
# A tibble: 9,166 × 43
...1 employment_len annual_income debt_to_income annual_income_jo…
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 90000 18.0 -1
2 2 10 40000 5.04 -1
3 3 3 40000 21.2 -1
4 4 1 30000 10.2 -1
5 5 10 35000 58.0 57000
6 6 10 35000 23.7 155000
7 7 10 110000 16.2 -1
8 8 10 65000 36.5 -1
9 9 3 30000 18.9 -1
10 10 10 75000 10.4 -1
# … with 9,156 more rows, and 38 more variables:
# debt_to_income_joint <dbl>, delinq_2y <dbl>,
# months_since_last_delinq <dbl>, earliest_credit_line <dbl>,
# inquiries_last_12m <dbl>, total_credit_lines <dbl>,
# open_credit_lines <dbl>, total_credit_limit <dbl>,
# total_credit_utilized <dbl>, num_collections_last_12m <dbl>,
# num_historical_failed_to_pay <dbl>, …
Finding the mean of all the numeric (42) columns
summarize_all(data_num, list(mean=mean), na.rm=TRUE)
# A tibble: 1 × 43
...1_mean employment_len_mean annual_income_mean debt_to_income_mean
<dbl> <dbl> <dbl> <dbl>
1 4584. 5.93 82193. 19.0
# … with 39 more variables: annual_income_joint_mean <dbl>,
# debt_to_income_joint_mean <dbl>, delinq_2y_mean <dbl>,
# months_since_last_delinq_mean <dbl>,
# earliest_credit_line_mean <dbl>, inquiries_last_12m_mean <dbl>,
# total_credit_lines_mean <dbl>, open_credit_lines_mean <dbl>,
# total_credit_limit_mean <dbl>, total_credit_utilized_mean <dbl>,
# num_collections_last_12m_mean <dbl>, …
Finding the median of all the numeric (42) columns
summarize_all(data_num, list(median=median), na.rm=TRUE)
# A tibble: 1 × 43
...1_median employment_len_median annual_income_me… debt_to_income_…
<dbl> <dbl> <dbl> <dbl>
1 4584. 6 68000 17.4
# … with 39 more variables: annual_income_joint_median <dbl>,
# debt_to_income_joint_median <dbl>, delinq_2y_median <dbl>,
# months_since_last_delinq_median <dbl>,
# earliest_credit_line_median <dbl>,
# inquiries_last_12m_median <dbl>, total_credit_lines_median <dbl>,
# open_credit_lines_median <dbl>, total_credit_limit_median <dbl>,
# total_credit_utilized_median <dbl>, …
Finding the standard deviation of all the numeric (42) columns
summarize_all(data_num, list(sd=sd), na.rm=TRUE)
# A tibble: 1 × 43
...1_sd employment_len_sd annual_income_sd debt_to_income_sd
<dbl> <dbl> <dbl> <dbl>
1 2646. 3.70 65934. 14.2
# … with 39 more variables: annual_income_joint_sd <dbl>,
# debt_to_income_joint_sd <dbl>, delinq_2y_sd <dbl>,
# months_since_last_delinq_sd <dbl>, earliest_credit_line_sd <dbl>,
# inquiries_last_12m_sd <dbl>, total_credit_lines_sd <dbl>,
# open_credit_lines_sd <dbl>, total_credit_limit_sd <dbl>,
# total_credit_utilized_sd <dbl>,
# num_collections_last_12m_sd <dbl>, …
Finding the inter quartile range of all the numeric (42) columns
summarize_all(data_num, list(IQR=IQR), na.rm=TRUE)
# A tibble: 1 × 43
...1_IQR employment_len_IQR annual_income_IQR debt_to_income_IQR
<dbl> <dbl> <dbl> <dbl>
1 4582. 8 50219. 13.7
# … with 39 more variables: annual_income_joint_IQR <dbl>,
# debt_to_income_joint_IQR <dbl>, delinq_2y_IQR <dbl>,
# months_since_last_delinq_IQR <dbl>,
# earliest_credit_line_IQR <dbl>, inquiries_last_12m_IQR <dbl>,
# total_credit_lines_IQR <dbl>, open_credit_lines_IQR <dbl>,
# total_credit_limit_IQR <dbl>, total_credit_utilized_IQR <dbl>,
# num_collections_last_12m_IQR <dbl>, …
Finding the minimum of all the numeric (42) columns
summarize_all(data_num, list(min=min), na.rm=TRUE)
# A tibble: 1 × 43
...1_min employment_len_min annual_income_min debt_to_income_min
<dbl> <dbl> <dbl> <dbl>
1 1 0 3000 0
# … with 39 more variables: annual_income_joint_min <dbl>,
# debt_to_income_joint_min <dbl>, delinq_2y_min <dbl>,
# months_since_last_delinq_min <dbl>,
# earliest_credit_line_min <dbl>, inquiries_last_12m_min <dbl>,
# total_credit_lines_min <dbl>, open_credit_lines_min <dbl>,
# total_credit_limit_min <dbl>, total_credit_utilized_min <dbl>,
# num_collections_last_12m_min <dbl>, …
Finding the maximum of all the numeric (42) columns
summarize_all(data_num, list(max=max), na.rm=TRUE)
# A tibble: 1 × 43
...1_max employment_len_max annual_income_max debt_to_income_max
<dbl> <dbl> <dbl> <dbl>
1 9166 10 2300000 469.
# … with 39 more variables: annual_income_joint_max <dbl>,
# debt_to_income_joint_max <dbl>, delinq_2y_max <dbl>,
# months_since_last_delinq_max <dbl>,
# earliest_credit_line_max <dbl>, inquiries_last_12m_max <dbl>,
# total_credit_lines_max <dbl>, open_credit_lines_max <dbl>,
# total_credit_limit_max <dbl>, total_credit_utilized_max <dbl>,
# num_collections_last_12m_max <dbl>, …
Finding the mean of annual income grouped by term and also counting the number of values which fall in each term
# A tibble: 2 × 3
term mean_annual_income n
<dbl> <dbl> <int>
1 36 80463. 6339
2 60 86070. 2827
Finding the mean of loan amount grouped by term and also counting the number of values which fall in each term
# A tibble: 2 × 3
term mean_aloan_amount n
<dbl> <dbl> <int>
1 36 14037. 6339
2 60 22470. 2827
Finding the minimum, maximum and mean of annual income grouped by the type of job of the borrower and also counting the number of values which fall in each job
data_clean %>%
group_by(job) %>%
summarise(min_annual_income = min(annual_income), max_annual_income = max(annual_income), mean_annual_income = mean(annual_income), n = n())
# A tibble: 4,411 × 5
job min_annual_inco… max_annual_inco… mean_annual_inc… n
<chr> <dbl> <dbl> <dbl> <int>
1 1sg 78000 100000 89000 2
2 1st grade… 45000 45000 45000 1
3 21 dealer 44000 44000 44000 1
4 2nd shift… 55000 55000 55000 1
5 360app sp… 37000 37000 37000 1
6 3rd shift… 88000 88000 88000 1
7 3rd shift… 54000 54000 54000 1
8 4th person 30000 30000 30000 1
9 5th year … 88400 88400 88400 1
10 6th grade… 55000 55000 55000 1
# … with 4,401 more rows
Finding the minimum, maximum and mean of annual income grouped by the type of home ownership of the borrower and also counting the number of values which fall in each home ownership
data_clean %>%
group_by(homeownership) %>%
summarise(min_annual_income = min(annual_income), max_annual_income = max(annual_income), mean_annual_income = mean(annual_income), n = n())
# A tibble: 3 × 5
homeownership min_annual_income max_annual_income mean_annual_income
<chr> <dbl> <dbl> <dbl>
1 MORTGAGE 3300 2300000 93753.
2 OWN 9996 740000 76683.
3 RENT 3000 1200000 69540.
# … with 1 more variable: n <int>
Finding the minimum, maximum and mean of loan amount grouped by the type of home ownership of the borrower and also counting the number of values which fall in each home ownership
data_clean %>%
group_by(homeownership) %>%
summarise(min_loan_amount = min(loan_amount), max_loan_amount = max(loan_amount), mean_loan_amount = mean(loan_amount), n = n())
# A tibble: 3 × 5
homeownership min_loan_amount max_loan_amount mean_loan_amount n
<chr> <dbl> <dbl> <dbl> <int>
1 MORTGAGE 1000 40000 18367. 4458
2 OWN 1000 40000 16290. 1125
3 RENT 1000 40000 14596. 3583
Finding the interest rate, the loan amount and the mean of annual income grouped by the type of grade of the loan and also counting the number of values which fall in each grade
data_clean %>%
group_by(grade) %>%
summarise( interest_rate = mean(interest_rate),
loan_amount = mean(loan_amount),
mean_annual_income = mean(annual_income),
n = n())
# A tibble: 7 × 5
grade interest_rate loan_amount mean_annual_income n
<chr> <dbl> <dbl> <dbl> <int>
1 A 6.73 15617. 92063. 2263
2 B 10.5 16510. 83528. 2788
3 C 14.2 17116. 76904. 2434
4 D 19.1 16962. 75087. 1310
5 E 25.1 18844. 70736. 304
6 F 29.4 22153. 77298. 56
7 G 30.8 25923. 71066. 11
ggplot(data_clean, aes(x = loan_amount)) +
geom_histogram(fill = "cornflowerblue",
color = "white") +
labs(title="loan amount count",
x = "loan amount")
This is a uni variate plot which counts the number of people on the basis of loan amounts. Since, loan amount is a continuous value, we have used a histogram to portray it. We have the loan amount on the x-axis and the count on the y-axis. We find that maximum people take a loan of $10000.
ggplot(data_clean, aes(x = grade)) +
geom_bar(fill = "purple",
color = "white") +
labs(title="grade count",
x = "grade")
This is another uni variate plot which counts the number of people in each grade of loan. Since grade is a categorical variable, I have used a bar chart to portray the data. Grade of loan is on the x-axis and the count is on the y-axis. We find that maximum borrowers fall into the grade B loan.
ggplot(data_clean, aes(x = interest_rate, y = grade)) +
geom_boxplot() +
labs(x = 'Interest Rate' , y = 'Grade')
This plot uses a boxplot to depict the relationship between the rate of interest and the grade of A boxplot displays a five-number summary, giving the maximum, minimum, median, first quartile and third quartile, and therefore the inter quartile range as well. We find that the rate of interest is lowest for the grade A loans and highest for grade G.
ggplot(data_clean, aes(x = loan_amount , y = loan_purpose)) +
geom_boxplot() +
labs(y = ' Purpose of Loan' , x = 'Loan_amount')
I have used a boxplot to depict the relationship between loan amount and the purpose of loan. We find that the highest amount of loan is taken for financing small businesses and the lowest amount is takken for vacations. The boxplot also displays a five-number summary, giving the maximum, minimum, median, first quartile and third quartile, and therefore the inter quartile range as well.
Though the uni-variate plots are attractive, they only depict a limited amount of relationships among the data variables, while the bi-variate plots lack attractiveness. We can enhance these plots to be more attractive and informatoive for the final project.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gangal (2022, April 27). Data Analytics and Computational Social Science: HW 4 DACSS 601. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomayushe17893403/
BibTeX citation
@misc{gangal2022hw, author = {Gangal, Ayushe}, title = {Data Analytics and Computational Social Science: HW 4 DACSS 601}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomayushe17893403/}, year = {2022} }