Final Project

Pavan Datta Abbineni


August 28, 2022

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)


According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.


About the Data

Data for this project originates from the Electronic Health Record (EHR) controlled by McKinsey & Company. This data is a well refined and filtered dataset of the original dataset which was collected over a course of several years in Bangladesh.

Project Goals

My main objective in this project is focused on finding the best lifestyle choices with the help of the attribute analysis in this dataset, inorder to prevent stroke.

Import the data

The HealthCareStrokeDataset is imported into R for cleaning, wrangling,exploration and analysis.

                        show_col_types = FALSE)

Attribute Information

  • id (int, categorical): unique identifier
  • gender (str, categorical): “Male”, “Female” or “Other”
  • age (int, numerical): age of the patient
  • hypertension (int, categorical): 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
  • heart_disease (int, categorical): 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
  • ever_married (str, categorical): “No” or “Yes”
  • work_type (str, categorical): “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”
  • Residence_type (str, categorical): “Rural” or “Urban”
  • avg_glucose_level (int, numerical): average glucose level in blood
  • bmi (str, numerical): body mass index*
  • smoking_status (str, categorical): “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
  • stroke (int, categorical): 1 if the patient had a stroke or 0 if not

Note: “Unknown, NA” in smoking_status and bmi means that the information is unavailable for this patient

Tidy the data

To learn more about the dataset let’s get an idea of the column names, data dimensions, statistical summary comprised of min,max,median,mean and interquartile range.

       id           gender               age         hypertension    
 Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
 1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
 Median :36932   Mode  :character   Median :45.00   Median :0.00000  
 Mean   :36518                      Mean   :43.23   Mean   :0.09746  
 3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
 Max.   :72940                      Max.   :82.00   Max.   :1.00000  
 heart_disease     ever_married        work_type         Residence_type    
 Min.   :0.00000   Length:5110        Length:5110        Length:5110       
 1st Qu.:0.00000   Class :character   Class :character   Class :character  
 Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.05401                                                           
 3rd Qu.:0.00000                                                           
 Max.   :1.00000                                                           
 avg_glucose_level     bmi            smoking_status         stroke       
 Min.   : 55.12    Length:5110        Length:5110        Min.   :0.00000  
 1st Qu.: 77.25    Class :character   Class :character   1st Qu.:0.00000  
 Median : 91.89    Mode  :character   Mode  :character   Median :0.00000  
 Mean   :106.15                                          Mean   :0.04873  
 3rd Qu.:114.09                                          3rd Qu.:0.00000  
 Max.   :271.74                                          Max.   :1.00000  
[1] 5110   12
 [1] "id"                "gender"            "age"              
 [4] "hypertension"      "heart_disease"     "ever_married"     
 [7] "work_type"         "Residence_type"    "avg_glucose_level"
[10] "bmi"               "smoking_status"    "stroke"           

To get a further insight into the dataset lets print in three different ways

  • Head of the dataset,
  • Tail of the dataset and
  • Randomly print ‘n’ elements from the dataset.
# A tibble: 6 × 12
     id gender   age hypertension heart_…¹ ever_…² work_…³ Resid…⁴ avg_g…⁵ bmi  
  <dbl> <chr>  <dbl>        <dbl>    <dbl> <chr>   <chr>   <chr>     <dbl> <chr>
1  9046 Male      67            0        1 Yes     Private Urban      229. 36.6 
2 51676 Female    61            0        0 Yes     Self-e… Rural      202. N/A  
3 31112 Male      80            0        1 Yes     Private Rural      106. 32.5 
4 60182 Female    49            0        0 Yes     Private Urban      171. 34.4 
5  1665 Female    79            1        0 Yes     Self-e… Rural      174. 24   
6 56669 Male      81            0        0 Yes     Private Urban      186. 29   
# … with 2 more variables: smoking_status <chr>, stroke <dbl>
#   variable names ¹​heart_disease, ²​ever_married, ³​work_type, ⁴​Residence_type,
#   ⁵​avg_glucose_level
# ℹ Use `colnames()` to see all variable names
# A tibble: 6 × 12
     id gender   age hypertension heart_…¹ ever_…² work_…³ Resid…⁴ avg_g…⁵ bmi  
  <dbl> <chr>  <dbl>        <dbl>    <dbl> <chr>   <chr>   <chr>     <dbl> <chr>
1 14180 Female    13            0        0 No      childr… Rural     103.  18.6 
2 18234 Female    80            1        0 Yes     Private Urban      83.8 N/A  
3 44873 Female    81            0        0 Yes     Self-e… Urban     125.  40   
4 19723 Female    35            0        0 Yes     Self-e… Rural      83.0 30.6 
5 37544 Male      51            0        0 Yes     Private Rural     166.  25.6 
6 44679 Female    44            0        0 Yes     Govt_j… Urban      85.3 26.2 
# … with 2 more variables: smoking_status <chr>, stroke <dbl>
#   variable names ¹​heart_disease, ²​ever_married, ³​work_type, ⁴​Residence_type,
#   ⁵​avg_glucose_level
# ℹ Use `colnames()` to see all variable names
randomlySelectedData <- healthCareStrokeData[sample(1:nrow(healthCareStrokeData), 5), ]
# A tibble: 5 × 12
     id gender   age hypertension heart_…¹ ever_…² work_…³ Resid…⁴ avg_g…⁵ bmi  
  <dbl> <chr>  <dbl>        <dbl>    <dbl> <chr>   <chr>   <chr>     <dbl> <chr>
1 48425 Male   21               0        0 No      Private Rural      89.3 23.4 
2 22706 Female  0.88            0        0 No      childr… Rural      88.1 15.5 
3 26154 Male   56               0        0 Yes     Private Rural      82.4 34.5 
4 49815 Female 17               0        0 No      Govt_j… Rural     116.  23.3 
5 60675 Female 48               1        0 Yes     Govt_j… Rural     221.  57.2 
# … with 2 more variables: smoking_status <chr>, stroke <dbl>
#   variable names ¹​heart_disease, ²​ever_married, ³​work_type, ⁴​Residence_type,
#   ⁵​avg_glucose_level
# ℹ Use `colnames()` to see all variable names

We can see that each row in our dataset is a unique observation which represents a unique persons lifestyle and if they had a stroke or not.

Each variable is seen as one consistent data type, thus some variables are numeric and some are categorical. The datatype of each column is elaborated below.

  • Categorical : Gender, Ever_married, Work_type, Residence_type, smoking_status
  • Categorical ( Boolean ) : Hypertension, Heart_disease, stroke_label
  • Quantitative (continuous) : avg_glucose_level, bmi
  • Quantitative (discrete) : age

Let’s check for any na values in our dataset.

[1] 0

Next lets check our dataset for any duplicate rows.

nOccur <- data.frame(table(healthCareStrokeData$id))
nOccur[nOccur$Freq > 1,]
[1] Var1 Freq
<0 rows> (or 0-length row.names)

From the above result we can confirm that there are no duplicates in our dataset.

From the datatype of bmi we can see that it is in string format, we need to convert into numeric.

healthCareStrokeData$bmi <- as.numeric(healthCareStrokeData$bmi)

As we can see there are NA values introduced by coercion, so let’s drop all the NA values before going to the next step.

[1] 201
healthCareStrokeData <- na.omit(healthCareStrokeData)

As our main interest in this project are only the people who had a stroke lets filter our dataset to only contain who had a stroke.

healthCareOnlyStroke <- healthCareStrokeData %>% filter(stroke == 1)

Let’s do a , statistical summary comprised of min,max,median,mean and interquartile range,column names, data dimensions for our target dataset.

       id           gender               age         hypertension   
 Min.   :  210   Length:209         Min.   :14.00   Min.   :0.0000  
 1st Qu.:17308   Class :character   1st Qu.:58.00   1st Qu.:0.0000  
 Median :36857   Mode  :character   Median :70.00   Median :0.0000  
 Mean   :37546                      Mean   :67.71   Mean   :0.2871  
 3rd Qu.:56939                      3rd Qu.:78.00   3rd Qu.:1.0000  
 Max.   :72918                      Max.   :82.00   Max.   :1.0000  
 heart_disease    ever_married        work_type         Residence_type    
 Min.   :0.0000   Length:209         Length:209         Length:209        
 1st Qu.:0.0000   Class :character   Class :character   Class :character  
 Median :0.0000   Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.1914                                                           
 3rd Qu.:0.0000                                                           
 Max.   :1.0000                                                           
 avg_glucose_level      bmi        smoking_status         stroke 
 Min.   : 56.11    Min.   :16.90   Length:209         Min.   :1  
 1st Qu.: 80.43    1st Qu.:26.40   Class :character   1st Qu.:1  
 Median :106.58    Median :29.70   Mode  :character   Median :1  
 Mean   :134.57    Mean   :30.47                      Mean   :1  
 3rd Qu.:196.92    3rd Qu.:33.70                      3rd Qu.:1  
 Max.   :271.74    Max.   :56.60                      Max.   :1  
[1] 209  12

Data Analysis and Visualization

Now that our dataset has been imported, cleaned, and tidied it can be used for further visualization and analysis. Let’s begin our analysis with the most basic questions like the mean and median of our age, bmi and avg_glucose_level.

As stated above the main variables I’m going to focus on are : * age * bmi and * avg_glucose_level

strokeLabels  = table(healthCareStrokeData$stroke)
pie(strokeLabels,labels = strokeLabels, main = "Number of people who had a stroke")

histogramStrokeData$stroke <- factor(histogramStrokeData$stroke,
                         levels = c(0,1),
                         labels = c("Didn't have a stroke","Had a Stroke"))
ggplot(histogramStrokeData, aes(stroke,))+
  geom_bar(fill=c("aquamarine2","pink3")) +
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Stroke Analysis")

favstats((healthCareStrokeData %>% filter(stroke == 0))$age)
  min Q1 median Q3 max     mean       sd    n missing
 0.08 24     43 59  82 41.76045 22.26813 4700       0
favstats((healthCareStrokeData %>% filter(stroke == 1))$age)
 min Q1 median Q3 max     mean       sd   n missing
  14 58     70 78  82 67.71292 12.40285 209       0

Effect of Age

strokeAgeLabels<-(healthCareStrokeData %>% filter(stroke == 1))$age
     main = "Age Histogram of all the people who had strokes",
     xlab = "Age",
     col = "white",
     border = 4)

We can see that the risk of having a stroke is a lot higher in the 70-80 age bracket.But age is something we cannot overcome so let’s have a detailed analysis of other attributes.

Effect of heart_disease

datasetForHeartAnalysis$heart_disease[datasetForHeartAnalysis$heart_disease == 0]<-"No Heart Disease"
datasetForHeartAnalysis$heart_disease[datasetForHeartAnalysis$heart_disease == 1]<-"Has Heart Disease"
datasetForHeartAnalysis$stroke[healthCareStrokeData$stroke == 0]<-"Didn't have a Stroke"
datasetForHeartAnalysis$stroke[healthCareStrokeData$stroke == 1]<-"Had a Stroke"

datasetForHeartAnalysis %>% filter(stroke=="Had a Stroke")%>% 
                        ggplot(aes(age, fill=heart_disease)) + 
                        geom_density(alpha=0.3) + 
                        ggtitle("Stroke by Age and heart_disease") + 
                        xlab("Age") + 

ggplot(data = datasetForHeartAnalysis,
           fill=stroke,)) + 
          geom_bar() +
          ggtitle("Stacked barchart for Heart Diseases v/s Stroke")

tableDatasetForHeartAnalysis<-datasetForHeartAnalysis%>%filter(heart_disease=="Has Heart Disease")

Didn't have a Stroke         Had a Stroke 
                 203                   40 
tableDatasetForHeartAnalysis<-datasetForHeartAnalysis%>%filter(heart_disease=="No Heart Disease")

Didn't have a Stroke         Had a Stroke 
                4497                  169 

We have a 16% chance to have a stroke if you have a heart disease and a 3.6% chance to have a stroke if you don’t have a heart disease.

From the above data/plots it is clearly evident that the people with heart diseases are more likely to have a stroke as they age.

Effect of hypertension

datasetForHypertensionAnalysis$hypertension[datasetForHypertensionAnalysis$hypertension == 0]<-"No Hypertension"
datasetForHypertensionAnalysis$hypertension[datasetForHypertensionAnalysis$hypertension == 1]<-"Has Hypertension"
datasetForHypertensionAnalysis$stroke[datasetForHypertensionAnalysis$stroke == 0]<-"Didn't have a Stroke"
datasetForHypertensionAnalysis$stroke[datasetForHypertensionAnalysis$stroke == 1]<-"Had a Stroke"

datasetForHypertensionAnalysis %>% filter(stroke=="Had a Stroke")%>% 
                        ggplot(aes(age, fill=hypertension)) + 
                        geom_density(alpha=0.3) + 
                        ggtitle("Stroke by Age and hypertension ") + 
                        xlab("Age") + 

ggplot(data = datasetForHypertensionAnalysis,
           fill=stroke,stat="count")) + 
          geom_bar() +
          ggtitle("Stacked barchart for Heart Diseases v/s Stroke")

tableDatasetForHypertension<-datasetForHypertensionAnalysis%>%filter(hypertension=="Has Hypertension")

Didn't have a Stroke         Had a Stroke 
                 391                   60 
tableDatasetForHypertension<-datasetForHypertensionAnalysis%>%filter(hypertension=="No Hypertension")

Didn't have a Stroke         Had a Stroke 
                4309                  149 

We have a 13.33% chance to have a stroke if you have a heart disease and a 3.34% chance to have a stroke if you don’t hypertension

We can see that we are at a higher chance ( approximately 4 times ) of having a stroke if you have hypertension.

What if a person has both hypertension and heart disease?

tableForDualAnalysis<-healthCareStrokeData%>%filter(hypertension==1 & heart_disease==1)

 0  1 
47 11 

We can see that a person having both hypertension and heart disease has a 19% chance to have a stroke.

Effect of bmi

hist(healthCareStrokeData$bmi,col=viridis(12,0.5),xlab = "Average Glucose Level")

datasetForbmiAnalysis$stroke[datasetForbmiAnalysis$stroke == 0]<-"No Stroke"
datasetForbmiAnalysis$stroke[datasetForbmiAnalysis$stroke == 1]<-"Had a Stroke"
datasetForbmiAnalysis %>% ggplot(aes(bmi, fill=stroke)) + geom_density(alpha=0.3) + ggtitle("Stroke by bmi") + xlab("BMI") + ylab("Density")

healthCareOnlyStroke %>% ggplot(aes(age, bmi, color=gender)) + geom_point() + ggtitle("Stroke and bmi over Time")

We can see that if you have a bmi greater than 25( overweight ) then you are more likely to have a stroke.

Effect of Glucose Level

hist(healthCareStrokeData$avg_glucose_level,col=viridis(12,0.5),xlab = "Average Glucose Level")

datasetForGlucoseAnalysis$stroke[datasetForGlucoseAnalysis$stroke == 0]<-"No Stroke"
datasetForGlucoseAnalysis$stroke[datasetForGlucoseAnalysis$stroke == 1]<-"Had a Stroke"
datasetForGlucoseAnalysis %>% ggplot(aes(avg_glucose_level, fill=stroke)) + geom_density(alpha=0.3) + ggtitle("Stroke by glucoselevel") + xlab("avg_glucose_level") + ylab("Density")

Similar to the case of bmi the chances of having a stroke is higher at higher glucose levels.

Effect of Smoking

datasetForSmokingAnalysis$stroke[datasetForSmokingAnalysis$stroke == 0]<-"No Stroke"
datasetForSmokingAnalysis$stroke[datasetForSmokingAnalysis$stroke == 1]<-"Had a Stroke"

           fill=stroke,)) +
  geom_bar() + ggtitle("Stacked barchart for Smoking Status v/s Stroke")

datasetForSmokingAnalysis %>% 
  filter(stroke == "Had a Stroke" & age<70 &smoking_status!="Unknown")%>%
  ggplot(aes(age, fill=smoking_status)) + 
  geom_density(alpha=0.3) + 
  ggtitle("Stroke by Age and Smoking Status") + 
  xlab("Smoking Status") + ylab("Density")

From the plot we can conclude that former smokers and a person who smokes is more likely to have a stroke when compared to a person who doesn’t smoke.

Effect of Gender

healthCareOnlyStroke  %>% ggplot(aes(age, fill=gender)) + geom_density(alpha=0.3) + ggtitle("Stroke by Age in Male and Female")

We can see that as the age increases women tend to be prone to having a stroke at an earlier age while men develop it over time.


Now that we have a detailed analysis of all the indicators that cause a stroke in our dataset, you can check for yourself how close you are to having a stroke. One thing we have no control over is age as everyone ages, but the rest of the attributes give us a lucid understanding of what to do to decrease our chances of having a stroke. A good start would be to quit smoking, Manage Stress, Normalize bmi, having low glucose level and having a better heart health.

I hope this study helps you make a data-driven decision about your health and lifestyle in order to prevent strokes.

Bibliography/ References

