Final Project
Author

Pavan Datta Abbineni

Published

August 28, 2022

Code
library(tidyverse)
library(lubridate)
library(magrittr)
library(tidyverse)
library("viridis")
library(glue)
library(leaflet)
library(ggplot2)
library(plotrix)
library(lubridate)
library(scales)
library(plyr)
require(mosaic)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Introduction

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

Dataset

About the Data

Data for this project originates from the Electronic Health Record (EHR) controlled by McKinsey & Company. This data is a well refined and filtered dataset of the original dataset which was collected over a course of several years in Bangladesh.

Project Goals

My main objective in this project is focused on finding the best lifestyle choices with the help of the attribute analysis in this dataset, inorder to prevent stroke.

Import the data

The HealthCareStrokeDataset is imported into R for cleaning, wrangling,exploration and analysis.

Code
healthCareStrokeData<-read_csv("_data/healthcare-dataset-stroke-data.csv",
                        show_col_types = FALSE)

Attribute Information

  • id (int, categorical): unique identifier
  • gender (str, categorical): “Male”, “Female” or “Other”
  • age (int, numerical): age of the patient
  • hypertension (int, categorical): 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
  • heart_disease (int, categorical): 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
  • ever_married (str, categorical): “No” or “Yes”
  • work_type (str, categorical): “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”
  • Residence_type (str, categorical): “Rural” or “Urban”
  • avg_glucose_level (int, numerical): average glucose level in blood
  • bmi (str, numerical): body mass index*
  • smoking_status (str, categorical): “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
  • stroke (int, categorical): 1 if the patient had a stroke or 0 if not

Note: “Unknown, NA” in smoking_status and bmi means that the information is unavailable for this patient

Tidy the data

To learn more about the dataset let’s get an idea of the column names, data dimensions, statistical summary comprised of min,max,median,mean and interquartile range.

Code
summary(healthCareStrokeData)
       id           gender               age         hypertension    
 Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
 1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
 Median :36932   Mode  :character   Median :45.00   Median :0.00000  
 Mean   :36518                      Mean   :43.23   Mean   :0.09746  
 3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
 Max.   :72940                      Max.   :82.00   Max.   :1.00000  
 heart_disease     ever_married        work_type         Residence_type    
 Min.   :0.00000   Length:5110        Length:5110        Length:5110       
 1st Qu.:0.00000   Class :character   Class :character   Class :character  
 Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.05401                                                           
 3rd Qu.:0.00000                                                           
 Max.   :1.00000                                                           
 avg_glucose_level     bmi            smoking_status         stroke       
 Min.   : 55.12    Length:5110        Length:5110        Min.   :0.00000  
 1st Qu.: 77.25    Class :character   Class :character   1st Qu.:0.00000  
 Median : 91.89    Mode  :character   Mode  :character   Median :0.00000  
 Mean   :106.15                                          Mean   :0.04873  
 3rd Qu.:114.09                                          3rd Qu.:0.00000  
 Max.   :271.74                                          Max.   :1.00000  
Code
dim(healthCareStrokeData)
[1] 5110   12
Code
names(healthCareStrokeData)
 [1] "id"                "gender"            "age"              
 [4] "hypertension"      "heart_disease"     "ever_married"     
 [7] "work_type"         "Residence_type"    "avg_glucose_level"
[10] "bmi"               "smoking_status"    "stroke"           

To get a further insight into the dataset lets print in three different ways

  • Head of the dataset,
  • Tail of the dataset and
  • Randomly print ‘n’ elements from the dataset.
Code
head(healthCareStrokeData)
# A tibble: 6 × 12
     id gender   age hypertension heart_…¹ ever_…² work_…³ Resid…⁴ avg_g…⁵ bmi  
  <dbl> <chr>  <dbl>        <dbl>    <dbl> <chr>   <chr>   <chr>     <dbl> <chr>
1  9046 Male      67            0        1 Yes     Private Urban      229. 36.6 
2 51676 Female    61            0        0 Yes     Self-e… Rural      202. N/A  
3 31112 Male      80            0        1 Yes     Private Rural      106. 32.5 
4 60182 Female    49            0        0 Yes     Private Urban      171. 34.4 
5  1665 Female    79            1        0 Yes     Self-e… Rural      174. 24   
6 56669 Male      81            0        0 Yes     Private Urban      186. 29   
# … with 2 more variables: smoking_status <chr>, stroke <dbl>, and abbreviated
#   variable names ¹​heart_disease, ²​ever_married, ³​work_type, ⁴​Residence_type,
#   ⁵​avg_glucose_level
# ℹ Use `colnames()` to see all variable names
Code
tail(healthCareStrokeData)
# A tibble: 6 × 12
     id gender   age hypertension heart_…¹ ever_…² work_…³ Resid…⁴ avg_g…⁵ bmi  
  <dbl> <chr>  <dbl>        <dbl>    <dbl> <chr>   <chr>   <chr>     <dbl> <chr>
1 14180 Female    13            0        0 No      childr… Rural     103.  18.6 
2 18234 Female    80            1        0 Yes     Private Urban      83.8 N/A  
3 44873 Female    81            0        0 Yes     Self-e… Urban     125.  40   
4 19723 Female    35            0        0 Yes     Self-e… Rural      83.0 30.6 
5 37544 Male      51            0        0 Yes     Private Rural     166.  25.6 
6 44679 Female    44            0        0 Yes     Govt_j… Urban      85.3 26.2 
# … with 2 more variables: smoking_status <chr>, stroke <dbl>, and abbreviated
#   variable names ¹​heart_disease, ²​ever_married, ³​work_type, ⁴​Residence_type,
#   ⁵​avg_glucose_level
# ℹ Use `colnames()` to see all variable names
Code
randomlySelectedData <- healthCareStrokeData[sample(1:nrow(healthCareStrokeData), 5), ]
randomlySelectedData
# A tibble: 5 × 12
     id gender   age hypertension heart_…¹ ever_…² work_…³ Resid…⁴ avg_g…⁵ bmi  
  <dbl> <chr>  <dbl>        <dbl>    <dbl> <chr>   <chr>   <chr>     <dbl> <chr>
1 48425 Male   21               0        0 No      Private Rural      89.3 23.4 
2 22706 Female  0.88            0        0 No      childr… Rural      88.1 15.5 
3 26154 Male   56               0        0 Yes     Private Rural      82.4 34.5 
4 49815 Female 17               0        0 No      Govt_j… Rural     116.  23.3 
5 60675 Female 48               1        0 Yes     Govt_j… Rural     221.  57.2 
# … with 2 more variables: smoking_status <chr>, stroke <dbl>, and abbreviated
#   variable names ¹​heart_disease, ²​ever_married, ³​work_type, ⁴​Residence_type,
#   ⁵​avg_glucose_level
# ℹ Use `colnames()` to see all variable names

We can see that each row in our dataset is a unique observation which represents a unique persons lifestyle and if they had a stroke or not.

Each variable is seen as one consistent data type, thus some variables are numeric and some are categorical. The datatype of each column is elaborated below.

  • Categorical : Gender, Ever_married, Work_type, Residence_type, smoking_status
  • Categorical ( Boolean ) : Hypertension, Heart_disease, stroke_label
  • Quantitative (continuous) : avg_glucose_level, bmi
  • Quantitative (discrete) : age

Let’s check for any na values in our dataset.

Code
sum(is.na(healthCareStrokeData))
[1] 0

Next lets check our dataset for any duplicate rows.

Code
nOccur <- data.frame(table(healthCareStrokeData$id))
nOccur[nOccur$Freq > 1,]
[1] Var1 Freq
<0 rows> (or 0-length row.names)

From the above result we can confirm that there are no duplicates in our dataset.

From the datatype of bmi we can see that it is in string format, we need to convert into numeric.

Code
healthCareStrokeData$bmi <- as.numeric(healthCareStrokeData$bmi)

As we can see there are NA values introduced by coercion, so let’s drop all the NA values before going to the next step.

Code
sum(is.na(healthCareStrokeData))
[1] 201
Code
healthCareStrokeData <- na.omit(healthCareStrokeData)

As our main interest in this project are only the people who had a stroke lets filter our dataset to only contain who had a stroke.

Code
healthCareOnlyStroke <- healthCareStrokeData %>% filter(stroke == 1)

Let’s do a , statistical summary comprised of min,max,median,mean and interquartile range,column names, data dimensions for our target dataset.

Code
summary(healthCareOnlyStroke)
       id           gender               age         hypertension   
 Min.   :  210   Length:209         Min.   :14.00   Min.   :0.0000  
 1st Qu.:17308   Class :character   1st Qu.:58.00   1st Qu.:0.0000  
 Median :36857   Mode  :character   Median :70.00   Median :0.0000  
 Mean   :37546                      Mean   :67.71   Mean   :0.2871  
 3rd Qu.:56939                      3rd Qu.:78.00   3rd Qu.:1.0000  
 Max.   :72918                      Max.   :82.00   Max.   :1.0000  
 heart_disease    ever_married        work_type         Residence_type    
 Min.   :0.0000   Length:209         Length:209         Length:209        
 1st Qu.:0.0000   Class :character   Class :character   Class :character  
 Median :0.0000   Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.1914                                                           
 3rd Qu.:0.0000                                                           
 Max.   :1.0000                                                           
 avg_glucose_level      bmi        smoking_status         stroke 
 Min.   : 56.11    Min.   :16.90   Length:209         Min.   :1  
 1st Qu.: 80.43    1st Qu.:26.40   Class :character   1st Qu.:1  
 Median :106.58    Median :29.70   Mode  :character   Median :1  
 Mean   :134.57    Mean   :30.47                      Mean   :1  
 3rd Qu.:196.92    3rd Qu.:33.70                      3rd Qu.:1  
 Max.   :271.74    Max.   :56.60                      Max.   :1  
Code
dim(healthCareOnlyStroke)
[1] 209  12

Data Analysis and Visualization

Now that our dataset has been imported, cleaned, and tidied it can be used for further visualization and analysis. Let’s begin our analysis with the most basic questions like the mean and median of our age, bmi and avg_glucose_level.

As stated above the main variables I’m going to focus on are : * age * bmi and * avg_glucose_level

Code
strokeLabels  = table(healthCareStrokeData$stroke)
pie(strokeLabels,labels = strokeLabels, main = "Number of people who had a stroke")

Code
histogramStrokeData<-healthCareStrokeData
histogramStrokeData$stroke <- factor(histogramStrokeData$stroke,
                         levels = c(0,1),
                         labels = c("Didn't have a stroke","Had a Stroke"))
ggplot(histogramStrokeData, aes(stroke,))+
  geom_bar(fill=c("aquamarine2","pink3")) +
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Stroke Analysis")

Code
favstats((healthCareStrokeData %>% filter(stroke == 0))$age)
  min Q1 median Q3 max     mean       sd    n missing
 0.08 24     43 59  82 41.76045 22.26813 4700       0
Code
favstats((healthCareStrokeData %>% filter(stroke == 1))$age)
 min Q1 median Q3 max     mean       sd   n missing
  14 58     70 78  82 67.71292 12.40285 209       0

Effect of Age

Code
strokeAgeLabels<-(healthCareStrokeData %>% filter(stroke == 1))$age
hist(strokeAgeLabels,
     main = "Age Histogram of all the people who had strokes",
     xlab = "Age",
     col = "white",
     border = 4)

We can see that the risk of having a stroke is a lot higher in the 70-80 age bracket.But age is something we cannot overcome so let’s have a detailed analysis of other attributes.

Effect of heart_disease

Code
datasetForHeartAnalysis<-healthCareStrokeData
datasetForHeartAnalysis$heart_disease[datasetForHeartAnalysis$heart_disease == 0]<-"No Heart Disease"
datasetForHeartAnalysis$heart_disease[datasetForHeartAnalysis$heart_disease == 1]<-"Has Heart Disease"
datasetForHeartAnalysis$stroke[healthCareStrokeData$stroke == 0]<-"Didn't have a Stroke"
datasetForHeartAnalysis$stroke[healthCareStrokeData$stroke == 1]<-"Had a Stroke"

datasetForHeartAnalysis %>% filter(stroke=="Had a Stroke")%>% 
                        ggplot(aes(age, fill=heart_disease)) + 
                        geom_density(alpha=0.3) + 
                        ggtitle("Stroke by Age and heart_disease") + 
                        xlab("Age") + 
                        ylab("Density")

Code
ggplot(data = datasetForHeartAnalysis,
       aes(x=heart_disease,
           fill=stroke,)) + 
          geom_bar() +
          ggtitle("Stacked barchart for Heart Diseases v/s Stroke")

Code
tableDatasetForHeartAnalysis<-datasetForHeartAnalysis%>%filter(heart_disease=="Has Heart Disease")
table(tableDatasetForHeartAnalysis$stroke)

Didn't have a Stroke         Had a Stroke 
                 203                   40 
Code
tableDatasetForHeartAnalysis<-datasetForHeartAnalysis%>%filter(heart_disease=="No Heart Disease")
table(tableDatasetForHeartAnalysis$stroke)

Didn't have a Stroke         Had a Stroke 
                4497                  169 

We have a 16% chance to have a stroke if you have a heart disease and a 3.6% chance to have a stroke if you don’t have a heart disease.

From the above data/plots it is clearly evident that the people with heart diseases are more likely to have a stroke as they age.

Effect of hypertension

Code
datasetForHypertensionAnalysis<-healthCareStrokeData
datasetForHypertensionAnalysis$hypertension[datasetForHypertensionAnalysis$hypertension == 0]<-"No Hypertension"
datasetForHypertensionAnalysis$hypertension[datasetForHypertensionAnalysis$hypertension == 1]<-"Has Hypertension"
datasetForHypertensionAnalysis$stroke[datasetForHypertensionAnalysis$stroke == 0]<-"Didn't have a Stroke"
datasetForHypertensionAnalysis$stroke[datasetForHypertensionAnalysis$stroke == 1]<-"Had a Stroke"

datasetForHypertensionAnalysis %>% filter(stroke=="Had a Stroke")%>% 
                        ggplot(aes(age, fill=hypertension)) + 
                        geom_density(alpha=0.3) + 
                        ggtitle("Stroke by Age and hypertension ") + 
                        xlab("Age") + 
                        ylab("Density")

Code
ggplot(data = datasetForHypertensionAnalysis,
       aes(x=hypertension,
           fill=stroke,stat="count")) + 
          geom_bar() +
          ggtitle("Stacked barchart for Heart Diseases v/s Stroke")

Code
tableDatasetForHypertension<-datasetForHypertensionAnalysis%>%filter(hypertension=="Has Hypertension")
table(tableDatasetForHypertension$stroke)

Didn't have a Stroke         Had a Stroke 
                 391                   60 
Code
tableDatasetForHypertension<-datasetForHypertensionAnalysis%>%filter(hypertension=="No Hypertension")
table(tableDatasetForHypertension$stroke)

Didn't have a Stroke         Had a Stroke 
                4309                  149 

We have a 13.33% chance to have a stroke if you have a heart disease and a 3.34% chance to have a stroke if you don’t hypertension

We can see that we are at a higher chance ( approximately 4 times ) of having a stroke if you have hypertension.

What if a person has both hypertension and heart disease?

Code
tableForDualAnalysis<-healthCareStrokeData%>%filter(hypertension==1 & heart_disease==1)
table(tableForDualAnalysis$stroke)

 0  1 
47 11 

We can see that a person having both hypertension and heart disease has a 19% chance to have a stroke.

Effect of bmi

Code
hist(healthCareStrokeData$bmi,col=viridis(12,0.5),xlab = "Average Glucose Level")

Code
datasetForbmiAnalysis<-healthCareStrokeData
datasetForbmiAnalysis$stroke[datasetForbmiAnalysis$stroke == 0]<-"No Stroke"
datasetForbmiAnalysis$stroke[datasetForbmiAnalysis$stroke == 1]<-"Had a Stroke"
datasetForbmiAnalysis %>% ggplot(aes(bmi, fill=stroke)) + geom_density(alpha=0.3) + ggtitle("Stroke by bmi") + xlab("BMI") + ylab("Density")

Code
healthCareOnlyStroke %>% ggplot(aes(age, bmi, color=gender)) + geom_point() + ggtitle("Stroke and bmi over Time")

We can see that if you have a bmi greater than 25( overweight ) then you are more likely to have a stroke.

Effect of Glucose Level

Code
hist(healthCareStrokeData$avg_glucose_level,col=viridis(12,0.5),xlab = "Average Glucose Level")

Code
datasetForGlucoseAnalysis<-healthCareStrokeData
datasetForGlucoseAnalysis$stroke[datasetForGlucoseAnalysis$stroke == 0]<-"No Stroke"
datasetForGlucoseAnalysis$stroke[datasetForGlucoseAnalysis$stroke == 1]<-"Had a Stroke"
datasetForGlucoseAnalysis %>% ggplot(aes(avg_glucose_level, fill=stroke)) + geom_density(alpha=0.3) + ggtitle("Stroke by glucoselevel") + xlab("avg_glucose_level") + ylab("Density")

Similar to the case of bmi the chances of having a stroke is higher at higher glucose levels.

Effect of Smoking

Code
datasetForSmokingAnalysis<-healthCareStrokeData
datasetForSmokingAnalysis$stroke[datasetForSmokingAnalysis$stroke == 0]<-"No Stroke"
datasetForSmokingAnalysis$stroke[datasetForSmokingAnalysis$stroke == 1]<-"Had a Stroke"

ggplot(datasetForSmokingAnalysis,
       aes(x=smoking_status,
           fill=stroke,)) +
  geom_bar() + ggtitle("Stacked barchart for Smoking Status v/s Stroke")

Code
datasetForSmokingAnalysis %>% 
  filter(stroke == "Had a Stroke" & age<70 &smoking_status!="Unknown")%>%
  ggplot(aes(age, fill=smoking_status)) + 
  geom_density(alpha=0.3) + 
  ggtitle("Stroke by Age and Smoking Status") + 
  xlab("Smoking Status") + ylab("Density")

From the plot we can conclude that former smokers and a person who smokes is more likely to have a stroke when compared to a person who doesn’t smoke.

Effect of Gender

Code
healthCareOnlyStroke  %>% ggplot(aes(age, fill=gender)) + geom_density(alpha=0.3) + ggtitle("Stroke by Age in Male and Female")

We can see that as the age increases women tend to be prone to having a stroke at an earlier age while men develop it over time.

Conclusion

Now that we have a detailed analysis of all the indicators that cause a stroke in our dataset, you can check for yourself how close you are to having a stroke. One thing we have no control over is age as everyone ages, but the rest of the attributes give us a lucid understanding of what to do to decrease our chances of having a stroke. A good start would be to quit smoking, Manage Stress, Normalize bmi, having low glucose level and having a better heart health.

I hope this study helps you make a data-driven decision about your health and lifestyle in order to prevent strokes.

Bibliography/ References

[1] https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9264165 [2] https://ggplot2.tidyverse.org [3] https://r-graph-gallery.com/stacked-barplot.html [4] https://education.rstudio.com/learn/beginner/