Data Analytics and Computational Social Science: HW4 -Data Visualization

Rhowena Vespa

This final project will use the Stroke Prediction Dataset from Kaggle

Read CSV file into R

library(distill)
library(dplyr)
library(readr)
library(tidyverse)
Stroke<- read.csv('healthcare-dataset-stroke-data.csv',TRUE,',',na.strings = "N/A")
class(Stroke)

[1] "data.frame"

colnames(Stroke)

 [1] "id"                "gender"            "age"              
 [4] "hypertension"      "heart_disease"     "ever_married"     
 [7] "work_type"         "Residence_type"    "avg_glucose_level"
[10] "bmi"               "smoking_status"    "stroke"

Datasource: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset

Compute Descriptive of Each Variable

Mean, median, standard deviation for numerical variables

NumericStroke <- subset(Stroke, select= c("age","avg_glucose_level","bmi"))
dim(NumericStroke)

[1] 5110    3

Mean Values:

sapply(NumericStroke, mean, na.rm=TRUE)

              age avg_glucose_level               bmi 
         43.22661         106.14768          28.89324

Median values:

sapply(NumericStroke, median, na.rm=TRUE)

              age avg_glucose_level               bmi 
           45.000            91.885            28.100

Standard deviation:

sapply(NumericStroke, sd, na.rm=TRUE)

              age avg_glucose_level               bmi 
        22.612647         45.283560          7.854067

Frequencies for categorical values

Gender

Stroke %>%
    count(gender, sort = TRUE) %>%
  head(2)

  gender    n
1 Female 2994
2   Male 2115

Hypertension frequencies where 0 = No, 1 = Yes

Stroke %>%
    count(hypertension, sort = TRUE) %>%
  head(2)

  hypertension    n
1            0 4612
2            1  498

Heart disease frequencies where 0 = No, 1 = Yes

Stroke %>%
    count(heart_disease, sort = TRUE) %>%
  head(2)

  heart_disease    n
1             0 4834
2             1  276

Marital Status

Stroke %>%
    count(ever_married, sort = TRUE) %>%
  head(2)

  ever_married    n
1          Yes 3353
2           No 1757

Employment type

Stroke %>%
    count(work_type, sort = TRUE) %>%
  head(4)

      work_type    n
1       Private 2925
2 Self-employed  819
3      children  687
4      Govt_job  657

Residence type

Stroke %>%
    count(Residence_type, sort = TRUE) %>%
  head(5)

  Residence_type    n
1          Urban 2596
2          Rural 2514

Smoker or Non-smoker

Stroke %>%
    count(smoking_status, sort = TRUE) %>%
  head(4)

   smoking_status    n
1    never smoked 1892
2         Unknown 1544
3 formerly smoked  885
4          smokes  789

Stroke occurences where 0 = No, 1 = Yes

Stroke %>%
    count(stroke, sort = TRUE) %>%
  head(2)

  stroke    n
1      0 4861
2      1  249

summarise(Stroke, Age = mean(age, na.rm = TRUE))

       Age
1 43.22661

Positive <- group_by(Stroke, age, gender, heart_disease, hypertension)
summarise(Positive, Stroke=1)

# A tibble: 424 x 5
# Groups:   age, gender, heart_disease [282]
     age gender heart_disease hypertension Stroke
   <dbl> <chr>          <int>        <int>  <dbl>
 1  0.08 Female             0            0      1
 2  0.08 Male               0            0      1
 3  0.16 Male               0            0      1
 4  0.24 Male               0            0      1
 5  0.32 Female             0            0      1
 6  0.32 Male               0            0      1
 7  0.4  Female             0            0      1
 8  0.4  Male               0            0      1
 9  0.48 Female             0            0      1
10  0.48 Male               0            0      1
# ... with 414 more rows

Posstroke <- subset(Stroke, stroke == 1, select= c("gender","age","hypertension","heart_disease","ever_married","work_type","Residence_type","avg_glucose_level","bmi","smoking_status","stroke"))
dim(Posstroke)

[1] 249  11

Breakdown of smoking status of the entire data set (Figure 1)

ggplot(Stroke) + 
  geom_bar(mapping = aes(x = smoking_status))

Breakdown of smoking status of the patients who had stroke (Figure 2)

ggplot(data = Posstroke) +
  geom_bar(mapping = aes(x = smoking_status))

Compared to the smoking status distribution of the entire data set (Figure 1), the incidence of stroke is higher in patients who never smoked compared to former smokers, smokers and unknown. (Figure 2)

Correlation of Age and BMI of the entire data set (Figure 3)

ggplot(data = Stroke) + 
  geom_point(mapping = aes(x = age, y = bmi))

Correlation of Age and BMI of the patients who had stroke (Figure 4)

ggplot(data = Posstroke) + 
  geom_point(mapping = aes(x = age, y = bmi))

Compared to the general population (Figure 3), most people who had stroke were (1) over 60 years old, (2) have a BMI > 30 , or both (1) and (2).

Limitations of the visualizations

1. The need to account for other variables
2. Naive viewer may not understand stroke risk factors (age, BMI, smoking status). 
For example, BMI > or = 30 is clinically classified as obese. 
Therefore, this visualization showed correlation that more older and obese
patients had incidence of stroke (Figure 4)
3. Visualizations are better represented with multivariate points

Comment on this article Share:

HW4 -Data Visualization

This final project will use the Stroke Prediction Dataset from Kaggle

Compute Descriptive of Each Variable

Mean, median, standard deviation for numerical variables

Mean Values:

Median values:

Standard deviation:

Frequencies for categorical values

Gender

Hypertension frequencies where 0 = No, 1 = Yes

Heart disease frequencies where 0 = No, 1 = Yes

Marital Status

Employment type

Residence type

Smoker or Non-smoker

Stroke occurences where 0 = No, 1 = Yes

Breakdown of smoking status of the entire data set (Figure 1)

Breakdown of smoking status of the patients who had stroke (Figure 2)

Compared to the smoking status distribution of the entire data set (Figure 1), the incidence of stroke is higher in patients who never smoked compared to former smokers, smokers and unknown. (Figure 2)

Correlation of Age and BMI of the entire data set (Figure 3)

Correlation of Age and BMI of the patients who had stroke (Figure 4)

Compared to the general population (Figure 3), most people who had stroke were (1) over 60 years old, (2) have a BMI > 30 , or both (1) and (2).

Limitations of the visualizations

Reuse

Citation