HW5 -More Data Visualization

Stroke Predictor

Rhowena Vespa
1/11/2022

This final project will use the Stroke Prediction Dataset from Kaggle

  1. Read CSV file into R
library(distill)
library(dplyr)
library(readr)
library(tidyverse)
Stroke<- read.csv('healthcare-dataset-stroke-data.csv',TRUE,',',na.strings = "N/A")
class(Stroke)
[1] "data.frame"
colnames(Stroke)
 [1] "id"                "gender"            "age"              
 [4] "hypertension"      "heart_disease"     "ever_married"     
 [7] "work_type"         "Residence_type"    "avg_glucose_level"
[10] "bmi"               "smoking_status"    "stroke"           
dim(Stroke)
[1] 5110   12
Yesstroke <- subset(Stroke, stroke == 1, select= c("gender","age","hypertension","heart_disease","ever_married","work_type","Residence_type","avg_glucose_level","bmi","smoking_status","stroke"))
dim(Yesstroke)
[1] 249  11

Figure 1 -Smoking status of sample population

Observation: The sample population were generally non-smokers (never smoked and former smokers)

ggplot(Stroke, aes(x=age, fill=smoking_status))+
  geom_histogram(binwidth = 5)+
  facet_wrap(vars(smoking_status))
  labs(x="age", y="Count", title = "Risk Factors")
$x
[1] "age"

$y
[1] "Count"

$title
[1] "Risk Factors"

attr(,"class")
[1] "labels"

Figure 2- Smoking status of married patients who had stroke and hypertension.

Observation: Smoking status of most patients who had stroke was not significant when compared to other factors such as hypertension and marital status. Most patients who had stroke were MARRIED with NO HYPERTENSION.

ggplot(Yesstroke, aes(x=age, fill=smoking_status))+
  geom_histogram(binwidth = 5)+
  facet_wrap(vars(hypertension,ever_married))
  labs(x="age", y="Count", title = "Risk Factors")
$x
[1] "age"

$y
[1] "Count"

$title
[1] "Risk Factors"

attr(,"class")
[1] "labels"

Figure 3- Smoking status of female patients who had heart disease.

Observation: Smoking status of most patients who had stroke was not significant when compared to other factors such as gender and heart disease. Most patients who had stroke were FEMALE with NO HEART DISEASE.

ggplot(Yesstroke, aes(x=age, fill=smoking_status))+
  geom_histogram(binwidth = 5)+
  facet_wrap(vars(gender,heart_disease))
  labs(x="age", y="Count", title = "Risk Factors")
$x
[1] "age"

$y
[1] "Count"

$title
[1] "Risk Factors"

attr(,"class")
[1] "labels"
NewYesStroke <- matrix(c(66,141,47,220,149,135,42,183,108,202,29,98,114,160,0,0,0,0,2,0,47),ncol=7,byrow=TRUE)
colnames(NewYesStroke) <- c("Hypertension","Female","Heart Disease","Married","Private work","Urban Home","Smoker")
rownames(NewYesStroke) <- c("Yes", "No", "Unknown")
NewYesStroke <- as.table(NewYesStroke)
NewYesStroke
        Hypertension Female Heart Disease Married Private work
Yes               66    141            47     220          149
No               183    108           202      29           98
Unknown            0      0             0       0            2
        Urban Home Smoker
Yes            135     42
No             114    160
Unknown          0     47

Figure 4- Distribution of risk factors on patients who had stroke

Observation: Most patients who had stroke were: Non-Smokers, Married, had No Hypertension and No Heart Disease.

barplot(NewYesStroke,legend=T,beside=T,main='Risk Factors for patients who had Stroke', las = 2, cex.names = 0.75,col = c("pink", "blue","gray"))

ANSWERS TO QUESTIONS:

1. The visualizations need to account for the numerical variables.
2. Conclusions: Most patients who had stroke were: Smokers, Married, had No Hypertension and No Heart Disease.
3. Naive reader would need basic understanding of epidemiology
4. I think these observations are best visualized in matrix like 3D or 4D plots

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Vespa (2022, Jan. 14). Data Analytics and Computational Social Science: HW5 -More Data Visualization. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomowenvespa855102/

BibTeX citation

@misc{vespa2022hw5,
  author = {Vespa, Rhowena},
  title = {Data Analytics and Computational Social Science: HW5 -More Data Visualization},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomowenvespa855102/},
  year = {2022}
}