Data Analytics and Computational Social Science: Exploratory Data Analysis

Rhowena Vespa

INTRODUCTION

According to the CDC, heart disease is the leading cause of death in the US, followed by cancer and Covid-19 in 2020. This research involves analysis of Heart Disease using the “Stroke Prediction Dataset” from Kaggle. The dataset has 5110 observations consisting of binary and continuous variables of risk factors for heart disease such as hypertension, obesity, diabetes, and smoking among others. Using the data set, this research aims to establish correlations between the prevalence of the different risk factors in patients who have heart disease. The CDC also states that adults age 45-54 have the highest prevalence of obesity (38.1%) while adults age 18-24 have the lowest prevalence (19.5%) in the US. I would like to compare this information with this data set.

RESEARCH QUESTIONS

Assessment of prevalence of risk factors in patients with heart disease.

  A. How many patients with heart disease ALSO have hypertension?
  
  B. How many patients with heart disease ALSO have stroke?

According to the CDC, about 14% (34 million) US adults are smokers.

  A. How many patients with heart disease continue smoking?

Are older patients more likely to have obesity or have diabetes?

DATA

Read CSV file into R

library(distill)
library(dplyr)
library(readr)
library(tidyverse)
library(knitr)
HeartDisease<- read.csv('healthcare-dataset-stroke-data.csv',TRUE,',',na.strings = "N/A")
class(HeartDisease)

[1] "data.frame"

colnames(HeartDisease)

 [1] "id"                "gender"            "age"              
 [4] "hypertension"      "heart_disease"     "ever_married"     
 [7] "work_type"         "Residence_type"    "avg_glucose_level"
[10] "bmi"               "smoking_status"    "stroke"

dim(HeartDisease)

[1] 5110   12

The dataset consisted of patients from infant (age=0) to age 82. This research will focus on the prevalence of risk factors on ADULT patients with heart disease. The dataset will first be filtered to exclude data on children (age < 18) and patient ID. The binary variables that have values of 1 =yes and 0=No are hypertension, heart disease and stroke; gender is “male” or “female”, ever married is “yes” or “no”, residence type is “urban” or “rural”. Other categorical variables are work type and smoking status, Numerical variables are age, body mass index (bmi) and avg glucose levels.

AdultHD<- filter(HeartDisease, age>=18) #Filter out children age 0-17
AdultHD1<-select(AdultHD,-c(id))  #drop the patient id column 'id' 
kable(head(AdultHD1), format = "markdown", digits = 2)

gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
Male	67	0	1	Yes	Private	Urban	228.69	36.6	formerly smoked	1
Female	61	0	0	Yes	Self-employed	Rural	202.21	NA	never smoked	1
Male	80	0	1	Yes	Private	Rural	105.92	32.5	never smoked	1
Female	49	0	0	Yes	Private	Urban	171.23	34.4	smokes	1
Female	79	1	0	Yes	Self-employed	Rural	174.12	24.0	never smoked	1
Male	81	0	0	Yes	Private	Urban	186.21	29.0	formerly smoked	1

DATA VISUALIZATION

HD<-mutate(AdultHD1, heart_disease = recode(heart_disease, `1` = "Yes", `0` = "No"))

HD %>%
  count(heart_disease, sort = TRUE) %>%
  head(4)

  heart_disease    n
1            No 3979
2           Yes  275

ggplot(HD, aes(x=age, fill=heart_disease))+
  geom_bar() +
  labs(x="Age of Patients", y="Number of Patients", 
  title = "Figure 1- More elderly patients (age >65) have heart disease 
                    than younger patients")

EXPLANATION: Used geom_bar to show distribution of growing age with heart disease compared with no heart disease.

RESEARCH QUESTION 1A: How many patients with heart disease ALSO have hypertension?

HD %>%
  mutate(HD, heart_disease = recode(heart_disease, `1` = "Yes", `0` = "No")) %>%
  mutate(HD, hypertension = recode(hypertension, `1` = "Yes", `0` = "No")) %>%
  count(heart_disease,hypertension, sort = TRUE) %>%
  head(4)

  heart_disease hypertension    n
1            No           No 3546
2            No          Yes  433
3           Yes           No  211
4           Yes          Yes   64

ggplot(HD, aes(x=age, fill=heart_disease))+
  geom_histogram(binwidth = 5)+
  facet_wrap(vars(heart_disease,hypertension))+
  labs(x="Age of Patients", y="Number of Patients", 
  title = "Figure 2- There are 64 Patients with Heart Disease AND Hypertension
                                 (see lower right image)")

EXPLANATION: Used geom_bar and facet_wrap to show distribution of growing age between 4 groups (heart disease=yes/no and hypertension=yes/no)

Chi-squared test for Heart Disease and Hypertension

chisq.test(HD$heart_disease,HD$hypertension)


    Pearson's Chi-squared test with Yates' continuity correction

data:  HD$heart_disease and HD$hypertension
X-squared = 37.081, df = 1, p-value = 1.133e-09

Calculate Prevalence –See explanation in CONCLUSION

PrevHDHTN<- (64/4254)*100
PrevHDHTN

[1] 1.504466

PrevHTNinHD<- (64/275)*100
PrevHTNinHD

[1] 23.27273

RESEARCH QUESTION 1B: How many patients with heart disease ALSO have stroke?

HD %>%
  mutate(HD, heart_disease = recode(heart_disease, `1` = "Yes", `0` = "No")) %>%
  mutate(HD, stroke = recode(stroke, `1` = "Yes", `0` = "No")) %>%
  count(heart_disease,stroke, sort = TRUE) %>%
  head(4)

  heart_disease stroke    n
1            No     No 3779
2           Yes     No  228
3            No    Yes  200
4           Yes    Yes   47

ggplot(HD, aes(x=age, fill=heart_disease))+
  geom_histogram(binwidth = 5)+
  facet_wrap(vars(heart_disease,hypertension))+
  labs(x="Age of Patients", y="Number of Patients", 
  title = "Figure 3- There are 47 Patients with Heart Disease AND Stroke 
                              (see lower right image)")

EXPLANATION: Used geom_bar and facet_wrap to show distribution of growing age between 4 groups (heart disease=yes/no and Stroke=yes/no)

Chi-squared for Heart Disease and Stroke

chisq.test(HD$heart_disease,HD$stroke)


    Pearson's Chi-squared test with Yates' continuity correction

data:  HD$heart_disease and HD$stroke
X-squared = 66.267, df = 1, p-value = 3.937e-16

Calculate Prevalence –see explanation in CONCLUSION

PrevHDStroke<- (47/4254)*100
PrevHDStroke

[1] 1.104843

PrevStrokeinHD<- (47/275)*100
PrevStrokeinHD

[1] 17.09091

RESEARCH QUESTION 2: Do patients with heart disease also smoke?

According to the CDC, 14% (34 million) of US adults are smokers.How any patients with heart disease are former smokers and how many are current smokers? Ideally, as smoking is a risk factor, patients stop smoking after being diagnosed with heart disease. Let’s find out if patients with heart disease are also smokers..

HD %>%
    count(heart_disease, smoking_status, sort = TRUE) %>%
  head(8)

  heart_disease  smoking_status    n
1            No    never smoked 1662
2            No         Unknown  815
3            No formerly smoked  783
4            No          smokes  719
5           Yes    never smoked   90
6           Yes formerly smoked   77
7           Yes          smokes   61
8           Yes         Unknown   47

ggplot(HD, aes(x=age, fill=smoking_status))+
  geom_histogram(binwidth = 5)+
  facet_wrap(vars(smoking_status,heart_disease))+
  labs(x="Age", y="Number of Patients", title = 
  "Figure 4- Former Smokers with Heart Disease (n=77) 
       vs Current Smokers with Heart Disease (n=61)")

EXPLANATION: Used geom_histogram facet_wrap to show distribution of growing age between 8 groups (heart disease and different smoking status groups)

Chi-squared test for Heart Disease and Smoking Status

chisq.test(HD$heart_disease,HD$smoking_status)


    Pearson's Chi-squared test

data:  HD$heart_disease and HD$smoking_status
X-squared = 17.75, df = 3, p-value = 0.0004954

OBSERVATION: The data set showed that there are 61 patients who have heart disease and are also smokers. The data can not identify if the 77 former smokers quit smoking before or after being diagnosed with heart disease.

Calculate Prevalence –See explanation in CONCLUSION

PrevSmokerHD<- (61/275)*100
PrevSmokerHD

[1] 22.18182

NumericHD <- subset(HD, select= c("age","avg_glucose_level","bmi"))

RESEARCH QUESTION 3. Are older people more likely to have obesity or have diabetes? For this question, we will use the CDC definition of obesity as BMI>or=30 and diabetes as average glucose levels > 200.

ggplot(HD, aes(x = age, y = avg_glucose_level)) +
    geom_point(color=7) + 
    geom_smooth(method = "lm") +
    labs(x="Age", y="Ave Glucose Level", title = "Figure 5- Average Glucose Level INCREASES with Age")

EXPLANATION: Used geom_point to show scatter plot distribution of growing age and glucose level

ggplot(HD, aes(x = age, y = bmi, color=bmi)) +
    geom_boxplot(outlier.colour="blue", outlier.shape=8,
                outlier.size=4) + 
    geom_jitter(width = 0.01,colour = 3)+
    geom_smooth(method = "lm",colour = 1) +
    labs(x="Age", y="BMI", title = "Figure 6- Age and BMI (used jitter to show distribution)")

EXPLANATION: Used geom_boxplot to show median, quartiles and used jitter to show distribution.

HD.aov <- aov(age ~ bmi + avg_glucose_level, data = HD)
# Summary of the analysis
summary(HD.aov)

                    Df  Sum Sq Mean Sq F value Pr(>F)    
bmi                  1    1373    1373   4.569 0.0326 *  
avg_glucose_level    1   64603   64603 215.019 <2e-16 ***
Residuals         4070 1222853     300                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
181 observations deleted due to missingness

OBSERVATION TO RESEARCH QUESTION 3:

Both BMI and average glucose levels are statistically significant to age. On Figure 5, we can see a positive correlation that average glucose levels increase as people get older. Although we can not determine if the rise in glucose levels (mean=119.65) are significant to clinically define it as diabetes (glucose >200). The data set also did not define if average glucose levels was fasting blood sugar or not. This is important because if the glucose level is under “Fasting conditions” , then diabetes is defined as over 100. With our data set mean=119.65, if under “fasting conditions”, the average of population in data set is considered diabetic.

On Figure 6, we can see that the median BMI is 29.2 with a significant number of outliers with BMI >30. In addition, the slope of the line in Figure 6 is very slightly upward, possibly indicating a very small positive correlation between between age and BMI. This data set does not mimic the CDC data that adults age 45-54 have the highest prevalence of obesity (38.1%) while adults age 18-24 have the lowest prevalence (19.5%) in the US. Although our median BMI is below 30 and the slope is near zero, the data set has significant number of outliers.

REFLECTION

This is my first time dabbling in R and I found it very interesting to learn. With my healthcare background, my inclination was to find a health-related data set. The Stroke Prediction data set from kaggle is relatively clean and only needed some tidying such as the focus on only adult patients (filtering out children). Although it is a Stroke Prediction data, I found researching on prevalence of heart disease, the leading cause of death in US, more interesting.

For this final paper, I am focusing on data visualizations.Initially, I was trying to visualize all the prevalent risk factors but later decided to isolate the most important and common ones namely hypertension, stroke, age, bmi, glucose levels. This was a good exercise for binary, numeric and continuous variables. Statistically, I conducted Pearson’s chi-squared test, Two-sample T-test and Two way ANOVA to measure significance between risk factors (heart disease vs age, bmi, hypertension, glucose levels, stroke).I used basic epidemiology knowledge to calculate prevalence rates. I would like learn use of other packages in epidemiology like pubh and epiR for more sophisticated correlations.

The biggest challenge I had was building a machine learning model at Homework 6. To do this, I first needed to balance the data set, which is common when using healthcare data sets. I was looking to build a classification model and apply machine learning in R. I started this process when I split the data (train and test) then balanced the data set. After my Random Forest Sampling in homework 6, I got stuck with the Confusion Matrix and could not finish my predictive model. This will be my next step as I plan to learn more on neural networks.

CONCLUSION:

Based on statistical analysis, the association of hypertension and stroke with heart disease is significant.

For Research Question 1A: The prevalence of heart disease and hypertension in total patients is 1.50%. The prevalence of hypertension in patients WITH heart disease is 23.27%.

For Research Question 1B: The prevalence of heart disease AND stroke in total patients is 1.10%. The prevalence of stroke in patients WITH heart disease is 17.09%.

Smoking status is associated with heart disease and the data indicates that 61 patients who have heart disease are current smokers. This is a 22.18% prevalence rate of smokers in patients with heart disease.

Another conclusion from this data set is that increase in glucose levels is positively associated with aging. Older people are more likely to have higher glucose levels than younger patients. This data showed statistical significance in age and bmi but visually, the correlation is very minimal as evidenced by a near-zero slope. Although the data also indicates a median below 30, meaning most of the patients in this data are not obese, there is too many outliers to make a definite conclusion. More statistical analysis is recommended.

BIBLIOGRAPHY:

Fedesoriano. (2021, January 26). Stroke prediction dataset. Kaggle. Retrieved January 17, 2022, from https://www.kaggle.com/fedesoriano/stroke-prediction-dataset.
Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Centers for Disease Control and Prevention. (2022, January 13). FASTSTATS - leading causes of death. Centers for Disease Control and Prevention. Retrieved January 16, 2022, from https://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm
Centers for Disease Control and Prevention. (2021, September 27). Adult obesity prevalence maps. Centers for Disease Control and Prevention. Retrieved January 18, 2022, from https://www.cdc.gov/obesity/data/prevalence-maps.html#age
Centers for Disease Control and Prevention. (2020, December 10). Current cigarette smoking among adults in the United States. Centers for Disease Control and Prevention. Retrieved January 23, 2022, from https://www.cdc.gov/tobacco/data_statistics/fact_sheets/adult_data/cig_smoking/index.htm

Comment on this article Share:

Exploratory Data Analysis

INTRODUCTION

RESEARCH QUESTIONS

DATA

DATA VISUALIZATION

RESEARCH QUESTION 1A: How many patients with heart disease ALSO have hypertension?

Chi-squared test for Heart Disease and Hypertension

Calculate Prevalence –See explanation in CONCLUSION

RESEARCH QUESTION 1B: How many patients with heart disease ALSO have stroke?

Chi-squared for Heart Disease and Stroke

Calculate Prevalence –see explanation in CONCLUSION

RESEARCH QUESTION 2: Do patients with heart disease also smoke?

Chi-squared test for Heart Disease and Smoking Status

Calculate Prevalence –See explanation in CONCLUSION

RESEARCH QUESTION 3. Are older people more likely to have obesity or have diabetes? For this question, we will use the CDC definition of obesity as BMI>or=30 and diabetes as average glucose levels > 200.

OBSERVATION TO RESEARCH QUESTION 3:

REFLECTION

CONCLUSION:

BIBLIOGRAPHY:

Reuse

Citation