Data Analytics and Computational Social Science: HW3-Data for Final Project

Rhowena Vespa

This final project will use the Stroke Prediction Dataset from Kaggle

Read CSV file into R

library(distill)
library(dplyr)
library(readr)
library(tidyverse)
Stroke<- read.csv('healthcare-dataset-stroke-data.csv',TRUE,',')
class(Stroke)

[1] "data.frame"

colnames(Stroke)

 [1] "id"                "gender"            "age"              
 [4] "hypertension"      "heart_disease"     "ever_married"     
 [7] "work_type"         "Residence_type"    "avg_glucose_level"
[10] "bmi"               "smoking_status"    "stroke"

dim(Stroke)

[1] 5110   12

The data set has 5110 observations of 12 variables (column names). Using R, this data set could be used to answer the following research questions:

1. Is there a single variable that can predict stroke? If yes, which is it?
2. Is work type a significant predictor of stroke?
3. Is residence type a significant predictor of stroke?
4. By splitting into test data and train data, I would like to build a model that could predict occurence of stroke.

Datasource: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset

Comment on this article Share:

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Vespa (2022, Jan. 3). Data Analytics and Computational Social Science: HW3-Data for Final Project. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomowenvespa852354/

BibTeX citation

@misc{vespa2022hw3-data,
  author = {Vespa, Rhowena},
  title = {Data Analytics and Computational Social Science: HW3-Data for Final Project},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomowenvespa852354/},
  year = {2022}
}