603_Project_Check_In_Saisrinivas_Ambatipudi

Author

Akhilesh Kumar & Saisrinivas Ambatipudi

Published

March 22, 2023

Data Description:

The blood transfusion dataset contain 748 samples with 5 input features: Input Features: • Recency (number of months since the last donation) • Frequency (total number of donations) • Monetary (total blood donated in c.c.) • Time (number of months since the first donation) • Age (age of the donor)

Source: https://www.kaggle.com/datasets/ninalabiba/blood-transfusion-dataset

BD <- read_csv("C:/UMass/DACSS_603/603_Spring_2023/posts/_data/transfusion_saisrinivas.csv")
Rows: 748 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): Recency (months), Frequency (times), Monetary (c.c. blood), Time (m...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
BD
# A tibble: 748 × 5
   `Recency (months)` `Frequency (times)` Monetary (c.c. blood…¹ Time …² wheth…³
                <dbl>               <dbl>                  <dbl>   <dbl>   <dbl>
 1                  2                  50                  12500      98       1
 2                  0                  13                   3250      28       1
 3                  1                  16                   4000      35       1
 4                  2                  20                   5000      45       1
 5                  1                  24                   6000      77       0
 6                  4                   4                   1000       4       0
 7                  2                   7                   1750      14       1
 8                  1                  12                   3000      35       0
 9                  2                   9                   2250      22       1
10                  5                  46                  11500      98       1
# … with 738 more rows, and abbreviated variable names
#   ¹​`Monetary (c.c. blood)`, ²​`Time (months)`,
#   ³​`whether he/she donated blood in March 2007`

Research Questions:

Blood Donation Prediction Frequency:

The aim of this study is to develop linear regression, logistic regression machine learning model, to accurately predict whether a blood donor is likely to donate in the future. The results of this study could be useful in developing targeted strategies for donor recruitment and retention, ultimately improving the availability and accessibility of blood donations.

Identify Factors that Affect Blood Donation:

How do donation behaviors vary across different regions, and what factors may contribute to these variations? Using hypothesis testing, this research project aims to compare the donation patterns of donors from different regions based on demographic and donation-related variables, such as age, gender, donation frequency, and time since last donation. The findings can provide insights into the regional variations in blood donation behaviors and inform targeted strategies to address these differences, potentially leading to increased donation rates and more efficient allocation of resources for blood donation organizations. The study will visualize the results using plots and charts to identify any significant patterns or trends that could help healthcare and blood donation organizations to develop effective strategies to increase blood donation rates.

Segmentation of Blood Donors:

Use clustering techniques to segment blood donors based on their demographic and donation history characteristics. Explore methods such as K-means clustering or hierarchical clustering to identify groups of donors with similar characteristics. Examine the differences between these groups and explore their donation patterns over time.

Building a Donor Retention Strategy:

Use the insights gained from the previous projects to develop a donor retention strategy for the blood donation center. Identify factors that are associated with donor churn (i.e., donors who stop donating blood), and develop a plan to mitigate these factors. This can help healthcare and blood donation organizations to develop strategies to retain donors over the long term.

Summary of the data

summary(BD)
 Recency (months) Frequency (times) Monetary (c.c. blood) Time (months)  
 Min.   : 0.000   Min.   : 1.000    Min.   :  250         Min.   : 2.00  
 1st Qu.: 2.750   1st Qu.: 2.000    1st Qu.:  500         1st Qu.:16.00  
 Median : 7.000   Median : 4.000    Median : 1000         Median :28.00  
 Mean   : 9.507   Mean   : 5.515    Mean   : 1379         Mean   :34.28  
 3rd Qu.:14.000   3rd Qu.: 7.000    3rd Qu.: 1750         3rd Qu.:50.00  
 Max.   :74.000   Max.   :50.000    Max.   :12500         Max.   :98.00  
 whether he/she donated blood in March 2007
 Min.   :0.000                             
 1st Qu.:0.000                             
 Median :0.000                             
 Mean   :0.238                             
 3rd Qu.:0.000                             
 Max.   :1.000                             

Descrpition of the Variables:

This summary function presents a statistical description of a dataset related to blood donation, with five variables: Recency, Frequency, Monetary, Time, and whether the individual donated blood in March 2007. Here’s a breakdown of each variable:

Recency (months): This variable represents the number of months since the last blood donation. The minimum value is 0 months, indicating that some individuals donated blood very recently. The mean is 9.507 months, suggesting that, on average, people donated blood around 9.5 months ago. The maximum value is 74 months, which means the longest gap between donations is 74 months.

Frequency (times): This variable shows the total number of times an individual has donated blood. The minimum value is 1, meaning that at least one person has only donated blood once. The mean is 5.515 times, indicating that people, on average, have donated blood about 5.5 times. The maximum value is 50 times, showing that some individuals have donated blood quite frequently.

Monetary (c.c. blood): This variable represents the total volume of blood donated by an individual, measured in cubic centimeters (c.c.). The minimum value is 250 c.c., which corresponds to the minimum single donation volume. The mean is 1379 c.c., suggesting that, on average, individuals have donated around 1.379 liters of blood. The maximum value is 12,500 c.c., indicating that the highest total volume donated by a person is 12.5 liters.

Time (months): This variable measures the length of time an individual has been donating blood. The minimum value is 2 months, suggesting that some individuals are relatively new to blood donation. The mean is 34.28 months, indicating that, on average, people have been donating blood for about 34.3 months. The maximum value is 98 months, showing that some individuals have been donating blood for a long time.

Whether he/she donated blood in March 2007: This is a binary variable that indicates if an individual donated blood in March 2007. The mean is 0.238, which means that about 23.8% of the individuals in the dataset donated blood in that specific month.

The summary function provides an overview of the dataset’s key statistics, such as minimum, 1st quartile, median, mean, 3rd quartile, and maximum values for each variable. This information helps to understand the distribution, central tendency, and spread of the data.

Expected Contribution:

  • Akhilesh Kumar: Will work on the Segmentation of Blood Donors and Building a Donor Retention Strategy

  • Sai Srinivas: Will work on the Blood Donation Prediction Frequency and Identify Factors that Affect Blood Donation

References:

Kaggle: https://www.kaggle.com/datasets/ninalabiba/blood-transfusion-dataset

Original Owner and Donor: Prof. I-Cheng Yeh, Department of Information Management, Chung-Hua University, Hsin Chu, Taiwan 30067, R.O.C., e-mail:icyeh ‘@’ chu.edu.tw, TEL:886-3-5186511, Date Donated: October 3, 2008