final_project

title: “Final_Project_2” author: “Mani Kanta Gogula” description: “Project Analysis” date: “11/11/2022” format: html: df-print: paged css: styles.css toc: true code-fold: true code-copy: true code-tools: true categories: - Project Analysis - —

Introduction :

Heart Disease in the United States Heart disease is the leading cause of death for men, women, and people of most racial and ethnic groups in the United States.1 One person dies every 34 seconds in the United States from cardiovascular disease.1 About 697,000 people in the United States died from heart disease in 2020—that’s 1 in every 5 deaths.1,2 Heart disease cost the United States about $229 billion each year from 2017 to 2018.3 This includes the cost of health care services, medicines, and lost productivity due to death.

Research :

Examining the relationship between maximum heart rate one can achieve during excercise and likelihood of developing heart disease .
Using Multiple logistic regression confounding effects of age and gender.

Loading Packages and Dataset :

library(readr)
library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.1.3

-- Attaching packages --------------------------------------- tidyverse 1.3.1 --

v ggplot2 3.3.5     v dplyr   1.0.8
v tibble  3.1.6     v stringr 1.4.0
v tidyr   1.2.0     v forcats 0.5.1
v purrr   0.3.4

-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

library(dplyr)
heart_cleveland_upload <- read_csv("posts/heart_cleveland_upload.csv")

Error: 'posts/heart_cleveland_upload.csv' does not exist in current working directory ('C:/Users/manik/Desktop/DACSS 603/603_Fall_2022/posts').

head(heart_cleveland_upload)

Error in head(heart_cleveland_upload): object 'heart_cleveland_upload' not found

colnames(heart_cleveland_upload)

Error in is.data.frame(x): object 'heart_cleveland_upload' not found

In the respective Data set , We have 14 variables :

AGE : Age in years SEX : Sex(1=MALE;O=FEMALE) CP : chest Pain type 0:typical angina 1:atypical angina 2:non-anginal pain 3:asymptomatic TRESTBPS :Resting blood pressure (in mm Hg on admission to the hospital)

CHOL :Serum cholestrol in mg/dl FBS :Fasting blood sugar> 120mg/dl( 1=true ; 0=false) RESTECG :Resting electrocardiographic results (0=Normal ;1=having ST-T Wave abnormality ; 2=Showing probable or definite left ventricular hypertrophy THALACH : Maximum heart acheived EXANG :Excercise induced angina (1=Yes ; 0=No) OLDPEAK :ST depression induced by excercise relative to rest SLOPE :The slope of the peak excercise relative to rest Value 0 :upsloping Value 1 :Flat Value 2 :Downsloping CA :Number of major vessels (0-3) colored by fluroscopy CONDITION :0=no disease , 1= disease

dim(heart_cleveland_upload)

Error in eval(expr, envir, enclos): object 'heart_cleveland_upload' not found

summary(heart_cleveland_upload)

Error in summary(heart_cleveland_upload): object 'heart_cleveland_upload' not found

Recoding the sex variable into categorical variable where 0 as Female and 1 as Male

heart_cleveland_upload$sex[heart_cleveland_upload$sex == 0] <- "female"

Error in heart_cleveland_upload$sex[heart_cleveland_upload$sex == 0] <- "female": object 'heart_cleveland_upload' not found

heart_cleveland_upload$sex[heart_cleveland_upload$sex == 1] <- "male"

Error in heart_cleveland_upload$sex[heart_cleveland_upload$sex == 1] <- "male": object 'heart_cleveland_upload' not found

head(heart_cleveland_upload)

Error in head(heart_cleveland_upload): object 'heart_cleveland_upload' not found

CHI-SQUARED TEST: Chi-Square test in R is a statistical method which used to determine if a categorical variable have a significant correlation between them. The two variables are selected from the same population.

Particularly in this test, we have to check the p-values. Moreover, like all statistical tests, we assume this test as a null hypothesis and an alternate hypothesis.

The main thing is, we reject the null hypothesis if the p-value that comes out in the result is less than a predetermined significance level, which is 0.05 usually, then we reject the null hypothesis.

H0: The two variables are independent. H1: The two variables relate to each other.

In the case of a null hypothesis, a chi-square test is to test the two variables that are independent.

heart_sex<- chisq.test(heart_cleveland_upload$sex,heart_cleveland_upload$condition)

Error in is.data.frame(x): object 'heart_cleveland_upload' not found

print(heart_sex)

Error in print(heart_sex): object 'heart_sex' not found

We have a high chi-squared value and a p-value of less than 0.05 significance level. So we reject the null hypothesis and conclude that sex and condition have a significant relationship.

# Does age have an effect? Age is continuous, so we use t-test here
heart_age<-t.test(heart_cleveland_upload$age ~ heart_cleveland_upload$condition)

Error in eval(predvars, data, env): object 'heart_cleveland_upload' not found

print(heart_age)

Error in print(heart_age): object 'heart_age' not found

Here’s how to interpret the results of the t-test:

data: This tells us the data that was used in the two sample t-test. In this case, we used the variables called age and condition.

t: This is the t test-statistic. In this case, it is -4.0636.

df: This is the degrees of freedom associated with the t test-statistic. In this case, it’s 294.66

p-value: This is the p-value that corresponds to a t test-statistic of -4.0636 and df = 294.66. The p-value turns out to be 0.00006204.

alternative hypothesis: This tells us the alternative hypothesis used for this particular t-test. In this case, the alternative hypothesis is that the true difference in means between the two groups is not equal to zero.

95 percent confidence interval: This tells us the 95% confidence interval for the true difference in means between the two groups. It turns out to be [-6.108, -2.12].

Because the p-value of our test (0.00006204) is less than alpha = 0.05, we reject the null hypothesis of the test. This means we have sufficient evidence to say that the mean of group 0 and group 1 is different.

heart_thalach<-t.test(heart_cleveland_upload$thalach ~ heart_cleveland_upload$condition)

Error in eval(predvars, data, env): object 'heart_cleveland_upload' not found

print(heart_thalach)

Error in print(heart_thalach): object 'heart_thalach' not found

data: This tells us the data that was used in the two sample t-test. In this case, we used the variables called thalach and condition.

t: This is the t test-statistic. In this case, it is 7.9286.

df: This is the degrees of freedom associated with the t test-statistic. In this case, it’s 266.44

p-value: This is the p-value that corresponds to a t test-statistic of 7.9286 and df = 266.44. The p-value turns out to be 0.00000000000006108.

95 percent confidence interval: This tells us the 95% confidence interval for the true difference in means between the two groups. It turns out to be [14.636, 24.30715].

Because the p-value of our test (0.0000000000000610) is less than alpha = 0.05, we reject the null hypothesis of the test. This means we have sufficient evidence to say that the mean of group 0 and group 1 is different.

Exploring the association graphically

# Recode condition to be labelled
heart_cleveland_upload %>% mutate(condition_labelled = ifelse(condition == 0, "No disease", "Disease")) ->heart_cleveland_upload

Error in mutate(., condition_labelled = ifelse(condition == 0, "No disease", : object 'heart_cleveland_upload' not found

# age vs condition
ggplot(data = heart_cleveland_upload, aes(x = condition_labelled, y = age)) + geom_boxplot()

Error in ggplot(data = heart_cleveland_upload, aes(x = condition_labelled, : object 'heart_cleveland_upload' not found

# sex vs condition
ggplot(data = heart_cleveland_upload, aes(x = condition_labelled, fill = sex)) + geom_bar(position = "fill") + ylab("Sex %")

Error in ggplot(data = heart_cleveland_upload, aes(x = condition_labelled, : object 'heart_cleveland_upload' not found

Logistics Regression :

Logistic regression is a predictive modelling algorithm that is used when the Y variable is binary categorical. That is, it can take only two values like 1 or 0. The goal is to determine a mathematical equation that can be used to predict the probability of event 1. Once the equation is established, it can be used to predict the Y when only the Xs are known.

Logistic regression is a classic predictive modelling technique and still remains a popular choice for modelling binary categorical variables.

Another advantage of logistic regression is that it computes a prediction probability score of an event. More on that when you actually start building the models.

Building the model and classifying the Y is only half work done. Actually, not even half. Because, the scope of evaluation metrics to judge the efficacy of the model is vast and requires careful judgement to choose the right model. In the next part, I will discuss various evaluation metrics that will help to understand how well the classification model performs from different perspectives.

In this Project , I have used two Models to compare various Metrics Like AUC, Accuracy and propse best model based on these metrics.

Model 1 : Our goal in Model is to predict condition of the patient based on different input parameters.

In Model 1 , I have used age, sex and thalach as my input variables and predicted the condition of the patient.

# using glm function from base R and specify the family argument as binomial
model <- glm(data = heart_cleveland_upload, condition ~ age + sex + thalach, family = "binomial" )

Error in is.data.frame(data): object 'heart_cleveland_upload' not found

# extract the model summary
summary(model)

Error in summary(model): object 'model' not found

In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.

Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model.

The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values.. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.

For every one unit change in age, the log odds of likelihood of developing heart disease increases by 0.03. For a one unit increase in sexmale, the log odds likelihood of developing heart disease increases by 1.46

Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC.

Prediction :

Let’s say a patient have a profile with age 45, sex =male and thalach 150. And we now predict the chances of that patient likelihood of developing heart disease based on our model 1

# get the predicted probability in our dataset using the predict() function
# We include the argument type=”response” in order to get our prediction.
pred_prob <- predict(model, heart_cleveland_upload, type="response")

Error in predict(model, heart_cleveland_upload, type = "response"): object 'model' not found

# create a decision rule using probability 0.5 as cutoff and save the predicted decision into the main data frame
heart_cleveland_upload$pred_condition <- ifelse(pred_prob >= 0.5, 1, 0)

Error in ifelse(pred_prob >= 0.5, 1, 0): object 'pred_prob' not found

# create a newdata data frame to save a new case information
newdata <- data.frame(age=45, sex='female', thalach=150)

# predict probability for this new case and print out the predicted value
p_new <- predict(model, newdata, type="response")

Error in predict(model, newdata, type = "response"): object 'model' not found

p_new

Error in eval(expr, envir, enclos): object 'p_new' not found

We see that this patient have 18% of chance developing heart disease.

METRICS :

Are the predictions accurate? How well does the model fit our data? We are going to use some common metrics to evaluate the model performance. The most straightforward one is- Accuracy, which is the proportion of the total number of predictions that were correct. On the other hand, we can calculate the classification error rate using 1- accuracy. However, accuracy can be misleading when the response is rare (i.e., imbalanced response). Another popular metric is Area Under the ROC curve (AUC), has the advantage that it’s independent of the change in the proportion of responders. AUC ranges from 0 to 1. The closer it gets to 1 the better the model performance. Lastly, a confusion matrix is an N X N matrix, where N is the level of outcome. For the problem at hand, we have N=2, and hence we get a 2 X 2 matrix. It cross-tabulates the predicted outcome levels against the true outcome levels.

# load Metrics package
library(Metrics)

Warning: package 'Metrics' was built under R version 4.1.3

# calculate auc, accuracy, clasification error
auc <- auc(heart_cleveland_upload$condition, heart_cleveland_upload$pred_condition)

Error in auc(heart_cleveland_upload$condition, heart_cleveland_upload$pred_condition): object 'heart_cleveland_upload' not found

accuracy <- accuracy(heart_cleveland_upload$condition, heart_cleveland_upload$pred_condition)

Error in mean(actual != predicted): object 'heart_cleveland_upload' not found

classification_error <- ce(heart_cleveland_upload$condition, heart_cleveland_upload$pred_condition)

Error in mean(actual != predicted): object 'heart_cleveland_upload' not found

# print out the metrics on to screen
print(paste("AUC=", auc))

Error in paste("AUC=", auc): cannot coerce type 'closure' to vector of type 'character'

print(paste("Accuracy=", accuracy))

Error in paste("Accuracy=", accuracy): cannot coerce type 'closure' to vector of type 'character'

print(paste("Classification Error=", classification_error))

Error in paste("Classification Error=", classification_error): object 'classification_error' not found

# confusion matrix
table(heart_cleveland_upload$condition, heart_cleveland_upload$pred_condition, dnn=c("True Status", "Predicted Status")) # confusion matrix

Error in table(heart_cleveland_upload$condition, heart_cleveland_upload$pred_condition, : object 'heart_cleveland_upload' not found

After these metrics are calculated, For a 45 years old female who has a max heart rate of 150, our model generated a heart disease probability of 0.18 indicating low risk of heart disease. Although our model has an overall accuracy of 0.71, there are cases that were misclassified as shown in the confusion matrix. One way to improve our current model is to include other relevant predictors from the dataset into our model.

MODEL 2 : In Model 1 , I have used age, sex and chol as my input variables and predicted the condition of the patient.

model_2 <- glm(data = heart_cleveland_upload, condition ~ age + sex + chol, family = "binomial" )

Error in is.data.frame(data): object 'heart_cleveland_upload' not found

# extract the model summary
summary(model_2)

Error in summary(model_2): object 'model_2' not found

# get the predicted probability in our dataset using the predict() function
# We include the argument type=”response” in order to get our prediction.
pred_prob_2 <- predict(model_2, heart_cleveland_upload, type="response")

Error in predict(model_2, heart_cleveland_upload, type = "response"): object 'model_2' not found

# create a decision rule using probability 0.5 as cutoff and save the predicted decision into the main data frame
heart_cleveland_upload$pred_condition_2 <- ifelse(pred_prob_2 >= 0.5, 1, 0)

Error in ifelse(pred_prob_2 >= 0.5, 1, 0): object 'pred_prob_2' not found

# create a newdata data frame to save a new case information
newdata_2 <- data.frame(age=45, sex='female', chol=150)

# predict probability for this new case and print out the predicted value
p_new_2 <- predict(model_2, newdata_2, type="response")

Error in predict(model_2, newdata_2, type = "response"): object 'model_2' not found

p_new_2

Error in eval(expr, envir, enclos): object 'p_new_2' not found

We see that this patient have 8.6% of chance developing heart disease. i.e indicating low risk of heart disease.

# load Metrics package
library(Metrics)

# calculate auc, accuracy, clasification error
auc <- auc(heart_cleveland_upload$condition, heart_cleveland_upload$pred_condition_2)

Error in auc(heart_cleveland_upload$condition, heart_cleveland_upload$pred_condition_2): object 'heart_cleveland_upload' not found

accuracy <- accuracy(heart_cleveland_upload$condition, heart_cleveland_upload$pred_condition_2)

Error in mean(actual != predicted): object 'heart_cleveland_upload' not found

classification_error <- ce(heart_cleveland_upload$condition, heart_cleveland_upload$pred_condition_2)

Error in mean(actual != predicted): object 'heart_cleveland_upload' not found

# print out the metrics on to screen
print(paste("AUC=", auc))

Error in paste("AUC=", auc): cannot coerce type 'closure' to vector of type 'character'

print(paste("Accuracy=", accuracy))

Error in paste("Accuracy=", accuracy): cannot coerce type 'closure' to vector of type 'character'

print(paste("Classification Error=", classification_error))

Error in paste("Classification Error=", classification_error): object 'classification_error' not found

# confusion matrix
table(heart_cleveland_upload$condition, heart_cleveland_upload$pred_condition_2, dnn=c("True Status", "Predicted Status")) # confusion matrix

Error in table(heart_cleveland_upload$condition, heart_cleveland_upload$pred_condition_2, : object 'heart_cleveland_upload' not found

When we compare our Model 1 and Model 2 based on metrics - AUC, Accuray .

Model 1 has a accuracy of 71 % and Model 2 has a acuuracy of 67 %. Based on these two models , We can conculde that Model 1 is the best model with predictor variables age , sex and thalach to predict likelihood of developing heart disease.