Stroke Predictor
library(distill)
library(dplyr)
library(readr)
library(tidyverse)
Stroke<- read.csv('healthcare-dataset-stroke-data.csv',TRUE,',',na.strings = "N/A")
class(Stroke)
[1] "data.frame"
colnames(Stroke)
[1] "id" "gender" "age"
[4] "hypertension" "heart_disease" "ever_married"
[7] "work_type" "Residence_type" "avg_glucose_level"
[10] "bmi" "smoking_status" "stroke"
Datasource: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
[1] 5110 3
sapply(NumericStroke, mean, na.rm=TRUE)
age avg_glucose_level bmi
43.22661 106.14768 28.89324
sapply(NumericStroke, median, na.rm=TRUE)
age avg_glucose_level bmi
45.000 91.885 28.100
sapply(NumericStroke, sd, na.rm=TRUE)
age avg_glucose_level bmi
22.612647 45.283560 7.854067
work_type n
1 Private 2925
2 Self-employed 819
3 children 687
4 Govt_job 657
Residence_type n
1 Urban 2596
2 Rural 2514
smoking_status n
1 never smoked 1892
2 Unknown 1544
3 formerly smoked 885
4 smokes 789
Positive <- group_by(Stroke, age, gender, heart_disease, hypertension)
summarise(Positive, Stroke=1)
# A tibble: 424 x 5
# Groups: age, gender, heart_disease [282]
age gender heart_disease hypertension Stroke
<dbl> <chr> <int> <int> <dbl>
1 0.08 Female 0 0 1
2 0.08 Male 0 0 1
3 0.16 Male 0 0 1
4 0.24 Male 0 0 1
5 0.32 Female 0 0 1
6 0.32 Male 0 0 1
7 0.4 Female 0 0 1
8 0.4 Male 0 0 1
9 0.48 Female 0 0 1
10 0.48 Male 0 0 1
# ... with 414 more rows
Posstroke <- subset(Stroke, stroke == 1, select= c("gender","age","hypertension","heart_disease","ever_married","work_type","Residence_type","avg_glucose_level","bmi","smoking_status","stroke"))
dim(Posstroke)
[1] 249 11
ggplot(data = Stroke) +
geom_point(mapping = aes(x = age, y = bmi))
ggplot(data = Posstroke) +
geom_point(mapping = aes(x = age, y = bmi))
1. The need to account for other variables
2. Naive viewer may not understand stroke risk factors (age, BMI, smoking status).
For example, BMI > or = 30 is clinically classified as obese.
Therefore, this visualization showed correlation that more older and obese
patients had incidence of stroke (Figure 4)
3. Visualizations are better represented with multivariate points
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Vespa (2022, Jan. 11). Data Analytics and Computational Social Science: HW4 -Data Visualization. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomowenvespa854286/
BibTeX citation
@misc{vespa2022hw4, author = {Vespa, Rhowena}, title = {Data Analytics and Computational Social Science: HW4 -Data Visualization}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomowenvespa854286/}, year = {2022} }