Data Analytics and Computational Social Science: HW6

Rhowena Vespa

INTRODUCTION

According to the CDC, heart disease is the leading cause of death in the US, followed by cancer and Covid-19 in 2020. As such, unlike my initial homework submissions which was focused on the “stroke” variable, I have decided to concentrate on the prevalence of heart disease for this final paper. This research involves analysis of “Stroke Prediction Dataset” from Kaggle. The dataset has 5110 observations consisting of binary and continuous variables of risk factors for heart disease such as hypertension, obesity, diabetes, and smoking among others. Using the data set, this research aims to establish correlations between the prevalence of the different risk factors in patients who have heart disease. The CDC also states that adults age 45-54 have the highest prevalence of obesity (38.1%) while adults age 18-24 have the lowest prevalence (19.5%) in the US. I would like to compare this information with this data set.

RESEARCH QUESTIONS

What significant risk factors are prevalent in people with heart disease?
Are people who formerly smoked less likely to also have heart disease than smokers?
Are older people more likely to have obesity and have diabetes?

DATA

Read CSV file into R

library(distill)
library(dplyr)
library(readr)
library(tidyverse)
Stroke<- read.csv('healthcare-dataset-stroke-data.csv',TRUE,',',na.strings = "N/A")
class(Stroke)

[1] "data.frame"

colnames(Stroke)

 [1] "id"                "gender"            "age"              
 [4] "hypertension"      "heart_disease"     "ever_married"     
 [7] "work_type"         "Residence_type"    "avg_glucose_level"
[10] "bmi"               "smoking_status"    "stroke"

dim(Stroke)

[1] 5110   12

The dataset consisted of people from infant (age=0) to age 82. This research will focus on the prevalence of risk factors on ADULT patients with heart disease. The dataset will first be filtered to exclude data on children (age < 18) and patient ID. The binary variables that have values of 1 =yes and 0=No are hypertension, heart disease and stroke; gender is “male” or “female”, ever married is “yes” or “no”, residence type is “urban” or “rural”. Other categorical variables are work type and smoking status, Numerical variables are age, body mass index (bmi) and avg glucose levels.

As with most available medical datasets, there is an imbalance in their class labels. As such, explanations and predictability of data is difficult to interpret. The dataset from Kaggle also has an imbalance class problem (Figure 1)

AdultStroke<- filter(Stroke, age>=18) #Filter out children age 0-17
head(AdultStroke)

     id gender age hypertension heart_disease ever_married
1  9046   Male  67            0             1          Yes
2 51676 Female  61            0             0          Yes
3 31112   Male  80            0             1          Yes
4 60182 Female  49            0             0          Yes
5  1665 Female  79            1             0          Yes
6 56669   Male  81            0             0          Yes
      work_type Residence_type avg_glucose_level  bmi  smoking_status
1       Private          Urban            228.69 36.6 formerly smoked
2 Self-employed          Rural            202.21   NA    never smoked
3       Private          Rural            105.92 32.5    never smoked
4       Private          Urban            171.23 34.4          smokes
5 Self-employed          Rural            174.12 24.0    never smoked
6       Private          Urban            186.21 29.0 formerly smoked
  stroke
1      1
2      1
3      1
4      1
5      1
6      1

AdultStroke1<-select(AdultStroke,-c(id))

HD<-mutate(AdultStroke1, heart_disease = recode(heart_disease, `1` = "Yes", `0` = "No"))

HD %>%
  count(heart_disease, sort = TRUE) %>%
  head(4)

  heart_disease    n
1            No 3979
2           Yes  275

Figure 1 - Imbalance dataset

ggplot(HD, aes(x=age, fill=heart_disease))+
  geom_bar() +
  labs(x="Age", y="Count", title = "Figure 1- Imbalance Data Set")

Understand data with descriptive statistics

HD$heart_disease <- as.factor(HD$heart_disease) 
summary(HD)

    gender               age        hypertension    heart_disease
 Length:4254        Min.   :18.0   Min.   :0.0000   No :3979     
 Class :character   1st Qu.:36.0   1st Qu.:0.0000   Yes: 275     
 Mode  :character   Median :50.5   Median :0.0000                
                    Mean   :50.2   Mean   :0.1168                
                    3rd Qu.:64.0   3rd Qu.:0.0000                
                    Max.   :82.0   Max.   :1.0000                
                                                                 
 ever_married        work_type         Residence_type    
 Length:4254        Length:4254        Length:4254       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
 avg_glucose_level      bmi        smoking_status    
 Min.   : 55.12    Min.   :11.30   Length:4254       
 1st Qu.: 77.48    1st Qu.:25.40   Class :character  
 Median : 92.47    Median :29.20   Mode  :character  
 Mean   :108.51    Mean   :30.43                     
 3rd Qu.:116.14    3rd Qu.:34.20                     
 Max.   :271.74    Max.   :92.00                     
                   NA's   :181                       
     stroke       
 Min.   :0.00000  
 1st Qu.:0.00000  
 Median :0.00000  
 Mean   :0.05806  
 3rd Qu.:0.00000  
 Max.   :1.00000

BALANCE THE DATASET

Balance the data set through oversampling where the data will be divided into train and test set. I will perform a classification predictive model on the training data set. Random Forest model will be used for prediction purposes. After which, an oversampled data set is created to correct the imbalance. The new oversampled data set is now balanced. (Figure 2)

set.seed(123) #partition into train and test data
ind <- sample(2, nrow(AdultStroke1), replace = TRUE, prob = c(0.7, 0.3))
train <- AdultStroke1[ind==1,]
test <- AdultStroke1[ind==2,]

table(train$heart_disease) #classification predictive model data on training set


   0    1 
2777  199

prop.table(table(train$heart_disease))


         0          1 
0.93313172 0.06686828

summary(train)

    gender               age         hypertension  heart_disease    
 Length:2976        Min.   :18.00   Min.   :0.00   Min.   :0.00000  
 Class :character   1st Qu.:37.00   1st Qu.:0.00   1st Qu.:0.00000  
 Mode  :character   Median :51.00   Median :0.00   Median :0.00000  
                    Mean   :50.59   Mean   :0.12   Mean   :0.06687  
                    3rd Qu.:64.00   3rd Qu.:0.00   3rd Qu.:0.00000  
                    Max.   :82.00   Max.   :1.00   Max.   :1.00000  
                                                                    
 ever_married        work_type         Residence_type    
 Length:2976        Length:2976        Length:2976       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
 avg_glucose_level      bmi       smoking_status    
 Min.   : 55.12    Min.   :11.3   Length:2976       
 1st Qu.: 77.31    1st Qu.:25.5   Class :character  
 Median : 92.91    Median :29.1   Mode  :character  
 Mean   :108.85    Mean   :30.3                     
 3rd Qu.:116.02    3rd Qu.:34.0                     
 Max.   :267.60    Max.   :78.0                     
                   NA's   :127                      
     stroke       
 Min.   :0.00000  
 1st Qu.:0.00000  
 Median :0.00000  
 Mean   :0.06082  
 3rd Qu.:0.00000  
 Max.   :1.00000

Predictive Model using Random Forest

library(ROSE)
library(randomForest)
library(caret)
library(e1071)

rftrain <- randomForest(heart_disease~.,data = train, na.action=na.omit)
print(rftrain)


Call:
 randomForest(formula = heart_disease ~ ., data = train, na.action = na.omit) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 3

          Mean of squared residuals: 0.05386863
                    % Var explained: 6.56

BALANCED DATA by oversampling is now HDover (see Figure 2)

HDover <- ovun.sample(heart_disease~., data = train, method = "over", N = 5340)$data

summary(HDover)

    gender               age         hypertension    heart_disease   
 Length:5340        Min.   :18.00   Min.   :0.0000   Min.   :0.0000  
 Class :character   1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000  
 Mode  :character   Median :63.00   Median :0.0000   Median :0.0000  
                    Mean   :59.27   Mean   :0.1682   Mean   :0.4993  
                    3rd Qu.:74.00   3rd Qu.:0.0000   3rd Qu.:1.0000  
                    Max.   :82.00   Max.   :1.0000   Max.   :1.0000  
 ever_married        work_type         Residence_type    
 Length:5340        Length:5340        Length:5340       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
 avg_glucose_level      bmi        smoking_status         stroke   
 Min.   : 55.12    Min.   :11.30   Length:5340        Min.   :0.0  
 1st Qu.: 78.95    1st Qu.:26.10   Class :character   1st Qu.:0.0  
 Median : 96.58    Median :29.50   Mode  :character   Median :0.0  
 Mean   :119.65    Mean   :30.32                      Mean   :0.1  
 3rd Qu.:150.45    3rd Qu.:33.80                      3rd Qu.:0.0  
 Max.   :267.60    Max.   :78.00                      Max.   :1.0

rfHDover <- randomForest(heart_disease~., data = HDover)

dim(HDover)

[1] 5340   11

head(HDover)

  gender age hypertension heart_disease ever_married     work_type
1   Male  81            0             0          Yes       Private
2 Female  78            0             0          Yes       Private
3 Female  54            0             0          Yes       Private
4   Male  75            1             0          Yes       Private
5 Female  60            0             0           No       Private
6 Female  52            1             0          Yes Self-employed
  Residence_type avg_glucose_level  bmi  smoking_status stroke
1          Urban            186.21 29.0 formerly smoked      1
2          Urban             58.57 24.2         Unknown      1
3          Urban            104.51 27.3          smokes      1
4          Urban            221.29 25.8          smokes      1
5          Urban             89.22 37.8    never smoked      1
6          Urban            233.29 48.9    never smoked      1

HDover %>%
  count(heart_disease, sort = TRUE) %>%
  head(2)

  heart_disease    n
1             0 2674
2             1 2666

FIGURE 2 -BALANCE DATA SET WITH 5340 obs.

HDover<-mutate(HDover, heart_disease = recode(heart_disease, `1` = "Yes", `0` = "No"))

ggplot(HDover, aes(x=age, fill=heart_disease))+
  geom_bar() +
  labs(x="Age", y="Count", title = "Figure 2- BALANCED DATA SET")

Recode binary variables to “YES” or “NO” for clarity

HDover<-mutate(HDover, hypertension = recode(hypertension, `1` = "Yes", `0` = "No"))

HDover<-mutate(HDover, stroke = recode(stroke, `1` = "Yes", `0` = "No"))

ANALYZE AND VISUALIZE BALANCED DATASET

RESEARCH QUESTION 1: What significant risk factors are prevalent in people with heart disease?

The binary variables were separated based on counts then analyzed for statistical significance. The numerical variables are analyzed on mean, median, sd.

HDover %>%
    count(stroke, sort = TRUE) %>%
  head(2)

  stroke    n
1     No 4806
2    Yes  534

HDover %>%
    count(hypertension, sort = TRUE) %>%
  head(2)

  hypertension    n
1           No 4442
2          Yes  898

HDover %>%
    count(gender, sort = TRUE) %>%
  head(2)

  gender    n
1 Female 2702
2   Male 2637

HDover %>%
    count(smoking_status, sort = TRUE) %>%
  head(4)

   smoking_status    n
1    never smoked 2054
2 formerly smoked 1350
3          smokes 1049
4         Unknown  887

HDover %>%
    count(ever_married, sort = TRUE) %>%
  head(4)

  ever_married    n
1          Yes 4490
2           No  850

HDover %>%
    count(Residence_type, sort = TRUE) %>%
  head(2)

  Residence_type    n
1          Urban 2769
2          Rural 2571

HDover %>%
    count(work_type, sort = TRUE) %>%
  head(4)

      work_type    n
1       Private 3295
2 Self-employed 1321
3      Govt_job  722
4  Never_worked    2

HDsummary <- matrix(c(898,2702,534,4490,3295,2769,1049,4442,2637,4806,850,2045,2571,3404),ncol=7,byrow=TRUE)
colnames(HDsummary) <- c("Hypertension","Female","HadStroke","EverMarried","PrivateWork","UrbanHome","Smoker")
rownames(HDsummary) <- c("Yes", "No")
HDsummary <- as.table(HDsummary)
HDsummary

    Hypertension Female HadStroke EverMarried PrivateWork UrbanHome
Yes          898   2702       534        4490        3295      2769
No          4442   2637      4806         850        2045      2571
    Smoker
Yes   1049
No    3404

barplot(HDsummary,legend=T,beside=T,main='Figure 3 - Significant Risk Factors for Heart Disease', las = 2, cex.names = 0.75,col = c("pink", "blue"))

STATISTICAL ANALYSIS

Now that we have more balanced, tidy data, we would conduct statistical analysis to determine which of the predicted risk factors are significant to answer the initial research questions. Based on Figure 3, the data seems to indicate that most patients who have heart disease do not have hypertension, did not smoke, ever married, did private work and non-smokers. Gender is indeterminate.

Statistical analysis will be used to establish if any of these prevalent risk factors are statistically significant to heart disease. Pearson’s Chi-squared test was used for categorical variables (gender, hypertension, marital status, smoking status, stroke, residence type, work type. Welch Two sample T-Test was used for numerical variables ( age, bmi, average glucose levels)

Chisq for HD and Stroke

chisq.test(HDover$heart_disease,HDover$stroke)


    Pearson's Chi-squared test with Yates' continuity correction

data:  HDover$heart_disease and HDover$stroke
X-squared = 165.23, df = 1, p-value < 2.2e-16

Chisq for HD and Hypertension

chisq.test(HDover$heart_disease,HDover$hypertension)


    Pearson's Chi-squared test with Yates' continuity correction

data:  HDover$heart_disease and HDover$hypertension
X-squared = 146.09, df = 1, p-value < 2.2e-16

HDover %>%
    count(gender,heart_disease, sort = TRUE) %>%
  head(4)

  gender heart_disease    n
1 Female            No 1661
2   Male           Yes 1625
3 Female           Yes 1041
4   Male            No 1012

HDovergender <- matrix(c(1661,1017,1012,1649),ncol=2,byrow=TRUE)
colnames(HDovergender) <- c("No Heart Disease", "Has Heart Disease")
rownames(HDovergender) <- c("Female", "Male")
HDovergender <- as.table(HDovergender)
HDovergender

       No Heart Disease Has Heart Disease
Female             1661              1017
Male               1012              1649

Chisq for HD and Gender

chisq.test(HDovergender)


    Pearson's Chi-squared test with Yates' continuity correction

data:  HDovergender
X-squared = 306.39, df = 1, p-value < 2.2e-16

Chisq for HD and Smoking Status

chisq.test(HDover$heart_disease,HDover$smoking_status)


    Pearson's Chi-squared test

data:  HDover$heart_disease and HDover$smoking_status
X-squared = 133.14, df = 3, p-value < 2.2e-16

Chisq for HD and Marital status

chisq.test(HDover$heart_disease,HDover$ever_married)


    Pearson's Chi-squared test with Yates' continuity correction

data:  HDover$heart_disease and HDover$ever_married
X-squared = 84.485, df = 1, p-value < 2.2e-16

Chisq for HD and Residence type

chisq.test(HDover$heart_disease,HDover$Residence_type)


    Pearson's Chi-squared test with Yates' continuity correction

data:  HDover$heart_disease and HDover$Residence_type
X-squared = 1.5975, df = 1, p-value = 0.2063

Chisq for HD and Work type

chisq.test(HDover$heart_disease,HDover$work_type)


    Pearson's Chi-squared test

data:  HDover$heart_disease and HDover$work_type
X-squared = 119.73, df = 3, p-value < 2.2e-16

NumericHD <- subset(HDover, select= c("age","avg_glucose_level","bmi"))
dim(NumericHD)

[1] 5340    3

t.test(age ~ heart_disease, data = HDover)


    Welch Two Sample t-test

data:  age by heart_disease
t = -51.573, df = 4356.9, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -21.03291 -19.49237
sample estimates:
 mean in group No mean in group Yes 
         49.15071          69.41335

t.test(bmi ~ heart_disease, data = HDover)


    Welch Two Sample t-test

data:  bmi by heart_disease
t = -0.41206, df = 4881, p-value = 0.6803
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -0.4037695  0.2635147
sample estimates:
 mean in group No mean in group Yes 
         30.28246          30.35259

t.test(avg_glucose_level ~ heart_disease, data = HDover)


    Welch Two Sample t-test

data:  avg_glucose_level by heart_disease
t = -18.418, df = 4952.8, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -29.98902 -24.21915
sample estimates:
 mean in group No mean in group Yes 
         106.1215          133.2256

summary(NumericHD)

      age        avg_glucose_level      bmi       
 Min.   :18.00   Min.   : 55.12    Min.   :11.30  
 1st Qu.:48.00   1st Qu.: 78.95    1st Qu.:26.10  
 Median :63.00   Median : 96.58    Median :29.50  
 Mean   :59.27   Mean   :119.65    Mean   :30.32  
 3rd Qu.:74.00   3rd Qu.:150.45    3rd Qu.:33.80  
 Max.   :82.00   Max.   :267.60    Max.   :78.00

sapply(NumericHD, sd, na.rm=TRUE)

              age avg_glucose_level               bmi 
        17.578299         55.423278          6.220602

OBSERVATION TO RESEARCH QUESTION 1:

Based on p<0.05, we can identify that almost all the variables namely gender, age, hypertension, marital status, work type, average glucose level, smoking_status and stroke are prevalent to people with heart disease. Only two variables, residence type and BMI, failed to reject the null hypothesis with a p-value = 0.2063 and p-value = 0.6803, respectively. Thus, BMI and residence type are not statistically significant in the prevalence of heart disease.

RESEARCH QUESTION 2: Are people who formerly smoked also less likely to have heart disease than smokers?

OBSERVATION TO RESEARCH QUESTION 2: This question would be answered with visualization comparing counts of patients with heart disease and smokes vs heart disease and former smokers (Figure 4). Based on p-value < 2.2e-16, prevalence of smoking is statistically significant to heart disease. Based on Figure 4, it appears that heart disease is prevalent on more people who formerly smoked compared to non-smokers. Table 1 confirms this with 822 former smokers with heart disease compared to 569 smokers with heart disease.

ggplot(HDover, aes(x=age, fill=smoking_status))+
  geom_histogram(binwidth = 5)+
  facet_wrap(vars(smoking_status,heart_disease))+
  labs(x="age", y="Count", title = "Figure 4- Heart Disease and Smoking")

Chisq for HD and Smoking Status

chisq.test(HDover$heart_disease,HDover$smoking_status)


    Pearson's Chi-squared test

data:  HDover$heart_disease and HDover$smoking_status
X-squared = 133.14, df = 3, p-value < 2.2e-16

YesHDover <- subset(HDover, heart_disease == "Yes", select= c("gender","age","hypertension","heart_disease","ever_married","work_type","Residence_type","avg_glucose_level","bmi","smoking_status","stroke"))
dim(YesHDover)

[1] 2666   11

TABLE 1

YesHDover %>%
    count(smoking_status, sort = TRUE) %>%
  head(4)

   smoking_status   n
1    never smoked 929
2 formerly smoked 822
3          smokes 569
4         Unknown 346

RESEARCH QUESTION 3. Are older people more likely to have obesity and diabetes? For this question, we will use the CDC definition of obesity as BMI>or=30 and diabetes as average glucose levels over 200.

ggplot(HDover, aes(x = age, y = avg_glucose_level)) +
    geom_point() + 
    geom_smooth(method = "lm") +
    labs(x="Age", y="Ave Glucose Level", title = "Figure 5- Age and Ave Glucose Level")

ggplot(HDover, aes(x = age, y = bmi, color=bmi)) +
    geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4) + 
    geom_smooth(method = "lm") +
    labs(x="Age", y="BMI", title = "Figure 6- Age and BMI")

HDover.aov <- aov(age ~ bmi + avg_glucose_level, data = HDover)
# Summary of the analysis
summary(HDover.aov)

                    Df  Sum Sq Mean Sq F value Pr(>F)    
bmi                  1    2427    2427   8.387 0.0038 ** 
avg_glucose_level    1  102565  102565 354.357 <2e-16 ***
Residuals         5337 1544740     289                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

OBSERVATION TO RESEARCH QUESTION 3:

Both BMI and average glucose levels are statistically significant to age. On Figure 5, we can see a positive correlation that average glucose levels increase as people get older. Although we can not determine if the rise in glucose levels (mean=119.65) are significant to clinically define it as diabetes (glucose >200). The data set also did not define if average glucose levels was fasting blood sugar or not. This is important because is the glucose level is under “Fasting conditions” , then diabetes is defined as over 100. With our data set mean=119.65, if under “fasting conditions”, the average of population in data set is considered diabetic.

On Figure 6, we can see that the median BMI is 29.2 with a significant number of outliers with BMI >30. In addition, the slope of the line in Figure 6 is very slightly downward, possibly indicating a very small negative correlation between between age and BMI.

THINGS LEFT TO DO: Clean up final paper. Complete reflections and summarize conclusions. Explain visualizations in detail–e.g. rationale

BIBLIOGRAPHY:

Centers for Disease Control and Prevention. (2022, January 13). FASTSTATS - leading causes of death. Centers for Disease Control and Prevention. Retrieved January 16, 2022, from https://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm
Centers for Disease Control and Prevention. (2021, September 27). Adult obesity prevalence maps. Centers for Disease Control and Prevention. Retrieved January 18, 2022, from https://www.cdc.gov/obesity/data/prevalence-maps.html#age
Fedesoriano. (2021, January 26). Stroke prediction dataset. Kaggle. Retrieved January 17, 2022, from https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
Grolemund, H. W. and G. (n.d.). R for data science. Welcome | R for Data Science. Retrieved January 17, 2022, from https://r4ds.had.co.nz/index.html
R as programming language
Finnstats. (2021, April 13). Random Forest in R: R-bloggers. R. Retrieved January 17, 2022, from https://www.r-bloggers.com/2021/04/random-forest-in-r/

Comment on this article Share:

HW6

INTRODUCTION

RESEARCH QUESTIONS

DATA

Figure 1 - Imbalance dataset

Understand data with descriptive statistics

BALANCE THE DATASET

Predictive Model using Random Forest

BALANCED DATA by oversampling is now HDover (see Figure 2)

FIGURE 2 -BALANCE DATA SET WITH 5340 obs.

Recode binary variables to “YES” or “NO” for clarity

ANALYZE AND VISUALIZE BALANCED DATASET

RESEARCH QUESTION 1: What significant risk factors are prevalent in people with heart disease?

The binary variables were separated based on counts then analyzed for statistical significance. The numerical variables are analyzed on mean, median, sd.

STATISTICAL ANALYSIS

Chisq for HD and Stroke

Chisq for HD and Hypertension

Chisq for HD and Gender

Chisq for HD and Smoking Status

Chisq for HD and Marital status

Chisq for HD and Residence type

Chisq for HD and Work type

OBSERVATION TO RESEARCH QUESTION 1:

RESEARCH QUESTION 2: Are people who formerly smoked also less likely to have heart disease than smokers?

Chisq for HD and Smoking Status

TABLE 1

RESEARCH QUESTION 3. Are older people more likely to have obesity and diabetes? For this question, we will use the CDC definition of obesity as BMI>or=30 and diabetes as average glucose levels over 200.

OBSERVATION TO RESEARCH QUESTION 3:

BIBLIOGRAPHY:

Reuse

Citation