Prevalence of Other Risk Factors in People with Heart Disease
According to the CDC, heart disease is the leading cause of death in the US, followed by cancer and Covid-19 in 2020. As such, unlike my initial homework submissions which was focused on the “stroke” variable, I have decided to concentrate on the prevalence of heart disease for this final paper. This research involves analysis of “Stroke Prediction Dataset” from Kaggle. The dataset has 5110 observations consisting of binary and continuous variables of risk factors for heart disease such as hypertension, obesity, diabetes, and smoking among others. Using the data set, this research aims to establish correlations between the prevalence of the different risk factors in patients who have heart disease. The CDC also states that adults age 45-54 have the highest prevalence of obesity (38.1%) while adults age 18-24 have the lowest prevalence (19.5%) in the US. I would like to compare this information with this data set.
library(distill)
library(dplyr)
library(readr)
library(tidyverse)
Stroke<- read.csv('healthcare-dataset-stroke-data.csv',TRUE,',',na.strings = "N/A")
class(Stroke)
[1] "data.frame"
colnames(Stroke)
[1] "id" "gender" "age"
[4] "hypertension" "heart_disease" "ever_married"
[7] "work_type" "Residence_type" "avg_glucose_level"
[10] "bmi" "smoking_status" "stroke"
dim(Stroke)
[1] 5110 12
The dataset consisted of people from infant (age=0) to age 82. This research will focus on the prevalence of risk factors on ADULT patients with heart disease. The dataset will first be filtered to exclude data on children (age < 18) and patient ID. The binary variables that have values of 1 =yes and 0=No are hypertension, heart disease and stroke; gender is “male” or “female”, ever married is “yes” or “no”, residence type is “urban” or “rural”. Other categorical variables are work type and smoking status, Numerical variables are age, body mass index (bmi) and avg glucose levels.
As with most available medical datasets, there is an imbalance in their class labels. As such, explanations and predictability of data is difficult to interpret. The dataset from Kaggle also has an imbalance class problem (Figure 1)
id gender age hypertension heart_disease ever_married
1 9046 Male 67 0 1 Yes
2 51676 Female 61 0 0 Yes
3 31112 Male 80 0 1 Yes
4 60182 Female 49 0 0 Yes
5 1665 Female 79 1 0 Yes
6 56669 Male 81 0 0 Yes
work_type Residence_type avg_glucose_level bmi smoking_status
1 Private Urban 228.69 36.6 formerly smoked
2 Self-employed Rural 202.21 NA never smoked
3 Private Rural 105.92 32.5 never smoked
4 Private Urban 171.23 34.4 smokes
5 Self-employed Rural 174.12 24.0 never smoked
6 Private Urban 186.21 29.0 formerly smoked
stroke
1 1
2 1
3 1
4 1
5 1
6 1
gender age hypertension heart_disease
Length:4254 Min. :18.0 Min. :0.0000 No :3979
Class :character 1st Qu.:36.0 1st Qu.:0.0000 Yes: 275
Mode :character Median :50.5 Median :0.0000
Mean :50.2 Mean :0.1168
3rd Qu.:64.0 3rd Qu.:0.0000
Max. :82.0 Max. :1.0000
ever_married work_type Residence_type
Length:4254 Length:4254 Length:4254
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
avg_glucose_level bmi smoking_status
Min. : 55.12 Min. :11.30 Length:4254
1st Qu.: 77.48 1st Qu.:25.40 Class :character
Median : 92.47 Median :29.20 Mode :character
Mean :108.51 Mean :30.43
3rd Qu.:116.14 3rd Qu.:34.20
Max. :271.74 Max. :92.00
NA's :181
stroke
Min. :0.00000
1st Qu.:0.00000
Median :0.00000
Mean :0.05806
3rd Qu.:0.00000
Max. :1.00000
Balance the data set through oversampling where the data will be divided into train and test set. I will perform a classification predictive model on the training data set. Random Forest model will be used for prediction purposes. After which, an oversampled data set is created to correct the imbalance. The new oversampled data set is now balanced. (Figure 2)
table(train$heart_disease) #classification predictive model data on training set
0 1
2777 199
prop.table(table(train$heart_disease))
0 1
0.93313172 0.06686828
summary(train)
gender age hypertension heart_disease
Length:2976 Min. :18.00 Min. :0.00 Min. :0.00000
Class :character 1st Qu.:37.00 1st Qu.:0.00 1st Qu.:0.00000
Mode :character Median :51.00 Median :0.00 Median :0.00000
Mean :50.59 Mean :0.12 Mean :0.06687
3rd Qu.:64.00 3rd Qu.:0.00 3rd Qu.:0.00000
Max. :82.00 Max. :1.00 Max. :1.00000
ever_married work_type Residence_type
Length:2976 Length:2976 Length:2976
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
avg_glucose_level bmi smoking_status
Min. : 55.12 Min. :11.3 Length:2976
1st Qu.: 77.31 1st Qu.:25.5 Class :character
Median : 92.91 Median :29.1 Mode :character
Mean :108.85 Mean :30.3
3rd Qu.:116.02 3rd Qu.:34.0
Max. :267.60 Max. :78.0
NA's :127
stroke
Min. :0.00000
1st Qu.:0.00000
Median :0.00000
Mean :0.06082
3rd Qu.:0.00000
Max. :1.00000
rftrain <- randomForest(heart_disease~.,data = train, na.action=na.omit)
print(rftrain)
Call:
randomForest(formula = heart_disease ~ ., data = train, na.action = na.omit)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 3
Mean of squared residuals: 0.05386863
% Var explained: 6.56
HDover <- ovun.sample(heart_disease~., data = train, method = "over", N = 5340)$data
summary(HDover)
gender age hypertension heart_disease
Length:5340 Min. :18.00 Min. :0.0000 Min. :0.0000
Class :character 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:0.0000
Mode :character Median :63.00 Median :0.0000 Median :0.0000
Mean :59.27 Mean :0.1682 Mean :0.4993
3rd Qu.:74.00 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :82.00 Max. :1.0000 Max. :1.0000
ever_married work_type Residence_type
Length:5340 Length:5340 Length:5340
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
avg_glucose_level bmi smoking_status stroke
Min. : 55.12 Min. :11.30 Length:5340 Min. :0.0
1st Qu.: 78.95 1st Qu.:26.10 Class :character 1st Qu.:0.0
Median : 96.58 Median :29.50 Mode :character Median :0.0
Mean :119.65 Mean :30.32 Mean :0.1
3rd Qu.:150.45 3rd Qu.:33.80 3rd Qu.:0.0
Max. :267.60 Max. :78.00 Max. :1.0
rfHDover <- randomForest(heart_disease~., data = HDover)
dim(HDover)
[1] 5340 11
head(HDover)
gender age hypertension heart_disease ever_married work_type
1 Male 81 0 0 Yes Private
2 Female 78 0 0 Yes Private
3 Female 54 0 0 Yes Private
4 Male 75 1 0 Yes Private
5 Female 60 0 0 No Private
6 Female 52 1 0 Yes Self-employed
Residence_type avg_glucose_level bmi smoking_status stroke
1 Urban 186.21 29.0 formerly smoked 1
2 Urban 58.57 24.2 Unknown 1
3 Urban 104.51 27.3 smokes 1
4 Urban 221.29 25.8 smokes 1
5 Urban 89.22 37.8 never smoked 1
6 Urban 233.29 48.9 never smoked 1
smoking_status n
1 never smoked 2054
2 formerly smoked 1350
3 smokes 1049
4 Unknown 887
Residence_type n
1 Urban 2769
2 Rural 2571
work_type n
1 Private 3295
2 Self-employed 1321
3 Govt_job 722
4 Never_worked 2
HDsummary <- matrix(c(898,2702,534,4490,3295,2769,1049,4442,2637,4806,850,2045,2571,3404),ncol=7,byrow=TRUE)
colnames(HDsummary) <- c("Hypertension","Female","HadStroke","EverMarried","PrivateWork","UrbanHome","Smoker")
rownames(HDsummary) <- c("Yes", "No")
HDsummary <- as.table(HDsummary)
HDsummary
Hypertension Female HadStroke EverMarried PrivateWork UrbanHome
Yes 898 2702 534 4490 3295 2769
No 4442 2637 4806 850 2045 2571
Smoker
Yes 1049
No 3404
Now that we have more balanced, tidy data, we would conduct statistical analysis to determine which of the predicted risk factors are significant to answer the initial research questions. Based on Figure 3, the data seems to indicate that most patients who have heart disease do not have hypertension, did not smoke, ever married, did private work and non-smokers. Gender is indeterminate.
Statistical analysis will be used to establish if any of these prevalent risk factors are statistically significant to heart disease. Pearson’s Chi-squared test was used for categorical variables (gender, hypertension, marital status, smoking status, stroke, residence type, work type. Welch Two sample T-Test was used for numerical variables ( age, bmi, average glucose levels)
chisq.test(HDover$heart_disease,HDover$stroke)
Pearson's Chi-squared test with Yates' continuity correction
data: HDover$heart_disease and HDover$stroke
X-squared = 165.23, df = 1, p-value < 2.2e-16
chisq.test(HDover$heart_disease,HDover$hypertension)
Pearson's Chi-squared test with Yates' continuity correction
data: HDover$heart_disease and HDover$hypertension
X-squared = 146.09, df = 1, p-value < 2.2e-16
gender heart_disease n
1 Female No 1661
2 Male Yes 1625
3 Female Yes 1041
4 Male No 1012
HDovergender <- matrix(c(1661,1017,1012,1649),ncol=2,byrow=TRUE)
colnames(HDovergender) <- c("No Heart Disease", "Has Heart Disease")
rownames(HDovergender) <- c("Female", "Male")
HDovergender <- as.table(HDovergender)
HDovergender
No Heart Disease Has Heart Disease
Female 1661 1017
Male 1012 1649
chisq.test(HDovergender)
Pearson's Chi-squared test with Yates' continuity correction
data: HDovergender
X-squared = 306.39, df = 1, p-value < 2.2e-16
chisq.test(HDover$heart_disease,HDover$smoking_status)
Pearson's Chi-squared test
data: HDover$heart_disease and HDover$smoking_status
X-squared = 133.14, df = 3, p-value < 2.2e-16
chisq.test(HDover$heart_disease,HDover$ever_married)
Pearson's Chi-squared test with Yates' continuity correction
data: HDover$heart_disease and HDover$ever_married
X-squared = 84.485, df = 1, p-value < 2.2e-16
chisq.test(HDover$heart_disease,HDover$Residence_type)
Pearson's Chi-squared test with Yates' continuity correction
data: HDover$heart_disease and HDover$Residence_type
X-squared = 1.5975, df = 1, p-value = 0.2063
chisq.test(HDover$heart_disease,HDover$work_type)
Pearson's Chi-squared test
data: HDover$heart_disease and HDover$work_type
X-squared = 119.73, df = 3, p-value < 2.2e-16
t.test(age ~ heart_disease, data = HDover)
Welch Two Sample t-test
data: age by heart_disease
t = -51.573, df = 4356.9, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-21.03291 -19.49237
sample estimates:
mean in group No mean in group Yes
49.15071 69.41335
t.test(bmi ~ heart_disease, data = HDover)
Welch Two Sample t-test
data: bmi by heart_disease
t = -0.41206, df = 4881, p-value = 0.6803
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-0.4037695 0.2635147
sample estimates:
mean in group No mean in group Yes
30.28246 30.35259
t.test(avg_glucose_level ~ heart_disease, data = HDover)
Welch Two Sample t-test
data: avg_glucose_level by heart_disease
t = -18.418, df = 4952.8, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-29.98902 -24.21915
sample estimates:
mean in group No mean in group Yes
106.1215 133.2256
summary(NumericHD)
age avg_glucose_level bmi
Min. :18.00 Min. : 55.12 Min. :11.30
1st Qu.:48.00 1st Qu.: 78.95 1st Qu.:26.10
Median :63.00 Median : 96.58 Median :29.50
Mean :59.27 Mean :119.65 Mean :30.32
3rd Qu.:74.00 3rd Qu.:150.45 3rd Qu.:33.80
Max. :82.00 Max. :267.60 Max. :78.00
sapply(NumericHD, sd, na.rm=TRUE)
age avg_glucose_level bmi
17.578299 55.423278 6.220602
Based on p<0.05, we can identify that almost all the variables namely gender, age, hypertension, marital status, work type, average glucose level, smoking_status and stroke are prevalent to people with heart disease. Only two variables, residence type and BMI, failed to reject the null hypothesis with a p-value = 0.2063 and p-value = 0.6803, respectively. Thus, BMI and residence type are not statistically significant in the prevalence of heart disease.
OBSERVATION TO RESEARCH QUESTION 2: This question would be answered with visualization comparing counts of patients with heart disease and smokes vs heart disease and former smokers (Figure 4). Based on p-value < 2.2e-16, prevalence of smoking is statistically significant to heart disease. Based on Figure 4, it appears that heart disease is prevalent on more people who formerly smoked compared to non-smokers. Table 1 confirms this with 822 former smokers with heart disease compared to 569 smokers with heart disease.
ggplot(HDover, aes(x=age, fill=smoking_status))+
geom_histogram(binwidth = 5)+
facet_wrap(vars(smoking_status,heart_disease))+
labs(x="age", y="Count", title = "Figure 4- Heart Disease and Smoking")
chisq.test(HDover$heart_disease,HDover$smoking_status)
Pearson's Chi-squared test
data: HDover$heart_disease and HDover$smoking_status
X-squared = 133.14, df = 3, p-value < 2.2e-16
YesHDover <- subset(HDover, heart_disease == "Yes", select= c("gender","age","hypertension","heart_disease","ever_married","work_type","Residence_type","avg_glucose_level","bmi","smoking_status","stroke"))
dim(YesHDover)
[1] 2666 11
smoking_status n
1 never smoked 929
2 formerly smoked 822
3 smokes 569
4 Unknown 346
ggplot(HDover, aes(x = age, y = avg_glucose_level)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x="Age", y="Ave Glucose Level", title = "Figure 5- Age and Ave Glucose Level")
ggplot(HDover, aes(x = age, y = bmi, color=bmi)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4) +
geom_smooth(method = "lm") +
labs(x="Age", y="BMI", title = "Figure 6- Age and BMI")
HDover.aov <- aov(age ~ bmi + avg_glucose_level, data = HDover)
# Summary of the analysis
summary(HDover.aov)
Df Sum Sq Mean Sq F value Pr(>F)
bmi 1 2427 2427 8.387 0.0038 **
avg_glucose_level 1 102565 102565 354.357 <2e-16 ***
Residuals 5337 1544740 289
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both BMI and average glucose levels are statistically significant to age. On Figure 5, we can see a positive correlation that average glucose levels increase as people get older. Although we can not determine if the rise in glucose levels (mean=119.65) are significant to clinically define it as diabetes (glucose >200). The data set also did not define if average glucose levels was fasting blood sugar or not. This is important because is the glucose level is under “Fasting conditions” , then diabetes is defined as over 100. With our data set mean=119.65, if under “fasting conditions”, the average of population in data set is considered diabetic.
On Figure 6, we can see that the median BMI is 29.2 with a significant number of outliers with BMI >30. In addition, the slope of the line in Figure 6 is very slightly downward, possibly indicating a very small negative correlation between between age and BMI.
THINGS LEFT TO DO: Clean up final paper. Complete reflections and summarize conclusions. Explain visualizations in detail–e.g. rationale
Centers for Disease Control and Prevention. (2022, January 13). FASTSTATS - leading causes of death. Centers for Disease Control and Prevention. Retrieved January 16, 2022, from https://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm
Centers for Disease Control and Prevention. (2021, September 27). Adult obesity prevalence maps. Centers for Disease Control and Prevention. Retrieved January 18, 2022, from https://www.cdc.gov/obesity/data/prevalence-maps.html#age
Fedesoriano. (2021, January 26). Stroke prediction dataset. Kaggle. Retrieved January 17, 2022, from https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
Grolemund, H. W. and G. (n.d.). R for data science. Welcome | R for Data Science. Retrieved January 17, 2022, from https://r4ds.had.co.nz/index.html
R as programming language
Finnstats. (2021, April 13). Random Forest in R: R-bloggers. R. Retrieved January 17, 2022, from https://www.r-bloggers.com/2021/04/random-forest-in-r/
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Vespa (2022, Jan. 20). Data Analytics and Computational Social Science: HW6. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomowenvespa856246/
BibTeX citation
@misc{vespa2022hw6, author = {Vespa, Rhowena}, title = {Data Analytics and Computational Social Science: HW6}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomowenvespa856246/}, year = {2022} }