Heart Disease
According to the CDC, heart disease is the leading cause of death in the US, followed by cancer and Covid-19 in 2020. This research involves analysis of Heart Disease using the “Stroke Prediction Dataset” from Kaggle. The dataset has 5110 observations consisting of binary and continuous variables of risk factors for heart disease such as hypertension, obesity, diabetes, and smoking among others. Using the data set, this research aims to establish correlations between the prevalence of the different risk factors in patients who have heart disease. The CDC also states that adults age 45-54 have the highest prevalence of obesity (38.1%) while adults age 18-24 have the lowest prevalence (19.5%) in the US. I would like to compare this information with this data set.
A. How many patients with heart disease ALSO have hypertension?
B. How many patients with heart disease ALSO have stroke?
A. How many patients with heart disease continue smoking?
library(distill)
library(dplyr)
library(readr)
library(tidyverse)
library(knitr)
HeartDisease<- read.csv('healthcare-dataset-stroke-data.csv',TRUE,',',na.strings = "N/A")
class(HeartDisease)
[1] "data.frame"
colnames(HeartDisease)
[1] "id" "gender" "age"
[4] "hypertension" "heart_disease" "ever_married"
[7] "work_type" "Residence_type" "avg_glucose_level"
[10] "bmi" "smoking_status" "stroke"
dim(HeartDisease)
[1] 5110 12
The dataset consisted of patients from infant (age=0) to age 82. This research will focus on the prevalence of risk factors on ADULT patients with heart disease. The dataset will first be filtered to exclude data on children (age < 18) and patient ID. The binary variables that have values of 1 =yes and 0=No are hypertension, heart disease and stroke; gender is “male” or “female”, ever married is “yes” or “no”, residence type is “urban” or “rural”. Other categorical variables are work type and smoking status, Numerical variables are age, body mass index (bmi) and avg glucose levels.
AdultHD<- filter(HeartDisease, age>=18) #Filter out children age 0-17
AdultHD1<-select(AdultHD,-c(id)) #drop the patient id column 'id'
kable(head(AdultHD1), format = "markdown", digits = 2)
gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke |
---|---|---|---|---|---|---|---|---|---|---|
Male | 67 | 0 | 1 | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | 1 |
Female | 61 | 0 | 0 | Yes | Self-employed | Rural | 202.21 | NA | never smoked | 1 |
Male | 80 | 0 | 1 | Yes | Private | Rural | 105.92 | 32.5 | never smoked | 1 |
Female | 49 | 0 | 0 | Yes | Private | Urban | 171.23 | 34.4 | smokes | 1 |
Female | 79 | 1 | 0 | Yes | Self-employed | Rural | 174.12 | 24.0 | never smoked | 1 |
Male | 81 | 0 | 0 | Yes | Private | Urban | 186.21 | 29.0 | formerly smoked | 1 |
EXPLANATION: Used geom_bar to show distribution of growing age with heart disease compared with no heart disease.
HD %>%
mutate(HD, heart_disease = recode(heart_disease, `1` = "Yes", `0` = "No")) %>%
mutate(HD, hypertension = recode(hypertension, `1` = "Yes", `0` = "No")) %>%
count(heart_disease,hypertension, sort = TRUE) %>%
head(4)
heart_disease hypertension n
1 No No 3546
2 No Yes 433
3 Yes No 211
4 Yes Yes 64
ggplot(HD, aes(x=age, fill=heart_disease))+
geom_histogram(binwidth = 5)+
facet_wrap(vars(heart_disease,hypertension))+
labs(x="Age of Patients", y="Number of Patients",
title = "Figure 2- There are 64 Patients with Heart Disease AND Hypertension
(see lower right image)")
EXPLANATION: Used geom_bar and facet_wrap to show distribution of growing age between 4 groups (heart disease=yes/no and hypertension=yes/no)
chisq.test(HD$heart_disease,HD$hypertension)
Pearson's Chi-squared test with Yates' continuity correction
data: HD$heart_disease and HD$hypertension
X-squared = 37.081, df = 1, p-value = 1.133e-09
PrevHDHTN<- (64/4254)*100
PrevHDHTN
[1] 1.504466
PrevHTNinHD<- (64/275)*100
PrevHTNinHD
[1] 23.27273
HD %>%
mutate(HD, heart_disease = recode(heart_disease, `1` = "Yes", `0` = "No")) %>%
mutate(HD, stroke = recode(stroke, `1` = "Yes", `0` = "No")) %>%
count(heart_disease,stroke, sort = TRUE) %>%
head(4)
heart_disease stroke n
1 No No 3779
2 Yes No 228
3 No Yes 200
4 Yes Yes 47
ggplot(HD, aes(x=age, fill=heart_disease))+
geom_histogram(binwidth = 5)+
facet_wrap(vars(heart_disease,hypertension))+
labs(x="Age of Patients", y="Number of Patients",
title = "Figure 3- There are 47 Patients with Heart Disease AND Stroke
(see lower right image)")
EXPLANATION: Used geom_bar and facet_wrap to show distribution of growing age between 4 groups (heart disease=yes/no and Stroke=yes/no)
chisq.test(HD$heart_disease,HD$stroke)
Pearson's Chi-squared test with Yates' continuity correction
data: HD$heart_disease and HD$stroke
X-squared = 66.267, df = 1, p-value = 3.937e-16
PrevHDStroke<- (47/4254)*100
PrevHDStroke
[1] 1.104843
PrevStrokeinHD<- (47/275)*100
PrevStrokeinHD
[1] 17.09091
According to the CDC, 14% (34 million) of US adults are smokers.How any patients with heart disease are former smokers and how many are current smokers? Ideally, as smoking is a risk factor, patients stop smoking after being diagnosed with heart disease. Let’s find out if patients with heart disease are also smokers..
heart_disease smoking_status n
1 No never smoked 1662
2 No Unknown 815
3 No formerly smoked 783
4 No smokes 719
5 Yes never smoked 90
6 Yes formerly smoked 77
7 Yes smokes 61
8 Yes Unknown 47
ggplot(HD, aes(x=age, fill=smoking_status))+
geom_histogram(binwidth = 5)+
facet_wrap(vars(smoking_status,heart_disease))+
labs(x="Age", y="Number of Patients", title =
"Figure 4- Former Smokers with Heart Disease (n=77)
vs Current Smokers with Heart Disease (n=61)")
EXPLANATION: Used geom_histogram facet_wrap to show distribution of growing age between 8 groups (heart disease and different smoking status groups)
chisq.test(HD$heart_disease,HD$smoking_status)
Pearson's Chi-squared test
data: HD$heart_disease and HD$smoking_status
X-squared = 17.75, df = 3, p-value = 0.0004954
OBSERVATION: The data set showed that there are 61 patients who have heart disease and are also smokers. The data can not identify if the 77 former smokers quit smoking before or after being diagnosed with heart disease.
PrevSmokerHD<- (61/275)*100
PrevSmokerHD
[1] 22.18182
ggplot(HD, aes(x = age, y = avg_glucose_level)) +
geom_point(color=7) +
geom_smooth(method = "lm") +
labs(x="Age", y="Ave Glucose Level", title = "Figure 5- Average Glucose Level INCREASES with Age")
EXPLANATION: Used geom_point to show scatter plot distribution of growing age and glucose level
ggplot(HD, aes(x = age, y = bmi, color=bmi)) +
geom_boxplot(outlier.colour="blue", outlier.shape=8,
outlier.size=4) +
geom_jitter(width = 0.01,colour = 3)+
geom_smooth(method = "lm",colour = 1) +
labs(x="Age", y="BMI", title = "Figure 6- Age and BMI (used jitter to show distribution)")
EXPLANATION: Used geom_boxplot to show median, quartiles and used jitter to show distribution.
Df Sum Sq Mean Sq F value Pr(>F)
bmi 1 1373 1373 4.569 0.0326 *
avg_glucose_level 1 64603 64603 215.019 <2e-16 ***
Residuals 4070 1222853 300
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
181 observations deleted due to missingness
Both BMI and average glucose levels are statistically significant to age. On Figure 5, we can see a positive correlation that average glucose levels increase as people get older. Although we can not determine if the rise in glucose levels (mean=119.65) are significant to clinically define it as diabetes (glucose >200). The data set also did not define if average glucose levels was fasting blood sugar or not. This is important because if the glucose level is under “Fasting conditions” , then diabetes is defined as over 100. With our data set mean=119.65, if under “fasting conditions”, the average of population in data set is considered diabetic.
On Figure 6, we can see that the median BMI is 29.2 with a significant number of outliers with BMI >30. In addition, the slope of the line in Figure 6 is very slightly upward, possibly indicating a very small positive correlation between between age and BMI. This data set does not mimic the CDC data that adults age 45-54 have the highest prevalence of obesity (38.1%) while adults age 18-24 have the lowest prevalence (19.5%) in the US. Although our median BMI is below 30 and the slope is near zero, the data set has significant number of outliers.
This is my first time dabbling in R and I found it very interesting to learn. With my healthcare background, my inclination was to find a health-related data set. The Stroke Prediction data set from kaggle is relatively clean and only needed some tidying such as the focus on only adult patients (filtering out children). Although it is a Stroke Prediction data, I found researching on prevalence of heart disease, the leading cause of death in US, more interesting.
For this final paper, I am focusing on data visualizations.Initially, I was trying to visualize all the prevalent risk factors but later decided to isolate the most important and common ones namely hypertension, stroke, age, bmi, glucose levels. This was a good exercise for binary, numeric and continuous variables. Statistically, I conducted Pearson’s chi-squared test, Two-sample T-test and Two way ANOVA to measure significance between risk factors (heart disease vs age, bmi, hypertension, glucose levels, stroke).I used basic epidemiology knowledge to calculate prevalence rates. I would like learn use of other packages in epidemiology like pubh and epiR for more sophisticated correlations.
The biggest challenge I had was building a machine learning model at Homework 6. To do this, I first needed to balance the data set, which is common when using healthcare data sets. I was looking to build a classification model and apply machine learning in R. I started this process when I split the data (train and test) then balanced the data set. After my Random Forest Sampling in homework 6, I got stuck with the Confusion Matrix and could not finish my predictive model. This will be my next step as I plan to learn more on neural networks.
Based on statistical analysis, the association of hypertension and stroke with heart disease is significant.
For Research Question 1A: The prevalence of heart disease and hypertension in total patients is 1.50%. The prevalence of hypertension in patients WITH heart disease is 23.27%.
For Research Question 1B: The prevalence of heart disease AND stroke in total patients is 1.10%. The prevalence of stroke in patients WITH heart disease is 17.09%.
Smoking status is associated with heart disease and the data indicates that 61 patients who have heart disease are current smokers. This is a 22.18% prevalence rate of smokers in patients with heart disease.
Another conclusion from this data set is that increase in glucose levels is positively associated with aging. Older people are more likely to have higher glucose levels than younger patients. This data showed statistical significance in age and bmi but visually, the correlation is very minimal as evidenced by a near-zero slope. Although the data also indicates a median below 30, meaning most of the patients in this data are not obese, there is too many outliers to make a definite conclusion. More statistical analysis is recommended.
Fedesoriano. (2021, January 26). Stroke prediction dataset. Kaggle. Retrieved January 17, 2022, from https://www.kaggle.com/fedesoriano/stroke-prediction-dataset.
Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Centers for Disease Control and Prevention. (2022, January 13). FASTSTATS - leading causes of death. Centers for Disease Control and Prevention. Retrieved January 16, 2022, from https://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm
Centers for Disease Control and Prevention. (2021, September 27). Adult obesity prevalence maps. Centers for Disease Control and Prevention. Retrieved January 18, 2022, from https://www.cdc.gov/obesity/data/prevalence-maps.html#age
Centers for Disease Control and Prevention. (2020, December 10). Current cigarette smoking among adults in the United States. Centers for Disease Control and Prevention. Retrieved January 23, 2022, from https://www.cdc.gov/tobacco/data_statistics/fact_sheets/adult_data/cig_smoking/index.htm
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Vespa (2022, Jan. 25). Data Analytics and Computational Social Science: Exploratory Data Analysis. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomowenvespa857794/
BibTeX citation
@misc{vespa2022exploratory, author = {Vespa, Rhowena}, title = {Data Analytics and Computational Social Science: Exploratory Data Analysis}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomowenvespa857794/}, year = {2022} }