HW4 -Data Visualization

Stroke Predictor

Rhowena Vespa
1/09/2022

This final project will use the Stroke Prediction Dataset from Kaggle

  1. Read CSV file into R
library(distill)
library(dplyr)
library(readr)
library(tidyverse)
Stroke<- read.csv('healthcare-dataset-stroke-data.csv',TRUE,',',na.strings = "N/A")
class(Stroke)
[1] "data.frame"
colnames(Stroke)
 [1] "id"                "gender"            "age"              
 [4] "hypertension"      "heart_disease"     "ever_married"     
 [7] "work_type"         "Residence_type"    "avg_glucose_level"
[10] "bmi"               "smoking_status"    "stroke"           

Datasource: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset

Compute Descriptive of Each Variable

Mean, median, standard deviation for numerical variables

NumericStroke <- subset(Stroke, select= c("age","avg_glucose_level","bmi"))
dim(NumericStroke)
[1] 5110    3
Mean Values:
sapply(NumericStroke, mean, na.rm=TRUE)
              age avg_glucose_level               bmi 
         43.22661         106.14768          28.89324 
Median values:
sapply(NumericStroke, median, na.rm=TRUE)
              age avg_glucose_level               bmi 
           45.000            91.885            28.100 
Standard deviation:
sapply(NumericStroke, sd, na.rm=TRUE)
              age avg_glucose_level               bmi 
        22.612647         45.283560          7.854067 

Frequencies for categorical values

Gender
Stroke %>%
    count(gender, sort = TRUE) %>%
  head(2)
  gender    n
1 Female 2994
2   Male 2115
Hypertension frequencies where 0 = No, 1 = Yes
Stroke %>%
    count(hypertension, sort = TRUE) %>%
  head(2)
  hypertension    n
1            0 4612
2            1  498
Heart disease frequencies where 0 = No, 1 = Yes
Stroke %>%
    count(heart_disease, sort = TRUE) %>%
  head(2)
  heart_disease    n
1             0 4834
2             1  276
Marital Status
Stroke %>%
    count(ever_married, sort = TRUE) %>%
  head(2)
  ever_married    n
1          Yes 3353
2           No 1757
Employment type
Stroke %>%
    count(work_type, sort = TRUE) %>%
  head(4)
      work_type    n
1       Private 2925
2 Self-employed  819
3      children  687
4      Govt_job  657
Residence type
Stroke %>%
    count(Residence_type, sort = TRUE) %>%
  head(5)
  Residence_type    n
1          Urban 2596
2          Rural 2514
Smoker or Non-smoker
Stroke %>%
    count(smoking_status, sort = TRUE) %>%
  head(4)
   smoking_status    n
1    never smoked 1892
2         Unknown 1544
3 formerly smoked  885
4          smokes  789
Stroke occurences where 0 = No, 1 = Yes
Stroke %>%
    count(stroke, sort = TRUE) %>%
  head(2)
  stroke    n
1      0 4861
2      1  249
summarise(Stroke, Age = mean(age, na.rm = TRUE))
       Age
1 43.22661
Positive <- group_by(Stroke, age, gender, heart_disease, hypertension)
summarise(Positive, Stroke=1)
# A tibble: 424 x 5
# Groups:   age, gender, heart_disease [282]
     age gender heart_disease hypertension Stroke
   <dbl> <chr>          <int>        <int>  <dbl>
 1  0.08 Female             0            0      1
 2  0.08 Male               0            0      1
 3  0.16 Male               0            0      1
 4  0.24 Male               0            0      1
 5  0.32 Female             0            0      1
 6  0.32 Male               0            0      1
 7  0.4  Female             0            0      1
 8  0.4  Male               0            0      1
 9  0.48 Female             0            0      1
10  0.48 Male               0            0      1
# ... with 414 more rows
Posstroke <- subset(Stroke, stroke == 1, select= c("gender","age","hypertension","heart_disease","ever_married","work_type","Residence_type","avg_glucose_level","bmi","smoking_status","stroke"))
dim(Posstroke)
[1] 249  11
Breakdown of smoking status of the entire data set (Figure 1)
ggplot(Stroke) + 
  geom_bar(mapping = aes(x = smoking_status))

Breakdown of smoking status of the patients who had stroke (Figure 2)
ggplot(data = Posstroke) +
  geom_bar(mapping = aes(x = smoking_status))

Compared to the smoking status distribution of the entire data set (Figure 1), the incidence of stroke is higher in patients who never smoked compared to former smokers, smokers and unknown. (Figure 2)

Correlation of Age and BMI of the entire data set (Figure 3)
ggplot(data = Stroke) + 
  geom_point(mapping = aes(x = age, y = bmi))

Correlation of Age and BMI of the patients who had stroke (Figure 4)
ggplot(data = Posstroke) + 
  geom_point(mapping = aes(x = age, y = bmi))

Compared to the general population (Figure 3), most people who had stroke were (1) over 60 years old, (2) have a BMI > 30 , or both (1) and (2).

Limitations of the visualizations

1. The need to account for other variables
2. Naive viewer may not understand stroke risk factors (age, BMI, smoking status). 
For example, BMI > or = 30 is clinically classified as obese. 
Therefore, this visualization showed correlation that more older and obese
patients had incidence of stroke (Figure 4)
3. Visualizations are better represented with multivariate points

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Vespa (2022, Jan. 11). Data Analytics and Computational Social Science: HW4 -Data Visualization. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomowenvespa854286/

BibTeX citation

@misc{vespa2022hw4,
  author = {Vespa, Rhowena},
  title = {Data Analytics and Computational Social Science: HW4 -Data Visualization},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomowenvespa854286/},
  year = {2022}
}