Final Project: Diabetes Prediction - Part 2

finalproject
Steph Roberts
dataset
ggplot2
Author

Steph Roberts

Published

October 9, 2022

Code
library(tidyverse)
library(ggplot2)
library(dplyr)
library(Hmisc)
library(pscl)
Error in library(pscl): there is no package called 'pscl'
Code
library(magrittr)

knitr::opts_chunk$set(echo = TRUE)

Diabetes risk factors

According to the World Health Organization (WHO), an estimated 537 million people worldwide are living with diabetes. It is a leading cause of health complications and even death. The WHO states close to 1.5 million people died due to diabetes and its complication in 2019 alone. It is a growing problem that requires dedicated research to aim at the slowdown and prevention of future cases.

Research Questions

  1. What risk factors are most predictive of diabetes?

Research on Diabetes is ongoing and in-depth within the medical field. The prevalence and incidence of diabetes mellitus type 2 (DQ2) have increased consistently for decades, giving way to an increase in mortality related to diabetes. Commonly in the medical field, many risk factors are used to measure a patient’s risk of developing DM2, such as obesity, family history, hypertension and changes in fasting blood sugar levels. Moreno et al. (2018) studied risk parameters for diabetes and concluded “risk of being diabetic rises in patients whose father has suffered an acute myocardial infarction, in those whose mother or father is diabetic and in patients with a high waist perimeter.” Their focus on family history leaves room for studies more focused on individual medical factors, such as blood pressure, BMI, number of pregnancies, etc. That is the aim of this project.

M. L. M. V. J. A. (2018, June 11). Predictive risk model for the diagnosis of diabetes mellitus type 2 in a follow-up study 15 Years on: Prodi2 study. European journal of public health. Retrieved October 9, 2022, from https://pubmed.ncbi.nlm.nih.gov/29897477/

  1. Can the use of regression analysis help predict risk of diabetes based on several medical variables?

Other research, such as a Edlitz & Segal (2022) study titled Prediction of type 2 diabetes mellitus onset using logistic regression-based scorecards, does focus on using individual medical factors to predict risk of diabetes through regression. This project aims to conduct similar analysis on different data.

E. Y. S. E. (2022, June 22). Prediction of type 2 diabetes mellitus onset using logistic regression-based scorecards. eLife. Retrieved October 9, 2022, from https://pubmed.ncbi.nlm.nih.gov/35731045/

Hypothesis

Body mass index (BMI), glucose, and age are positive predictors of diabetes mellitus type 2.

Both hypothesis have been tested in the above mentioned studies. The contribution from this project will be the additional support for or against the hypotheses from the analysis of different data.

Descriptive Statistics

The data was collected by the “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database (PIDD). A total of 768 cases are available in PIDD. However, 5 patients had a glucose of 0, 11 patients had a body mass index of 0, 28 others had a diastolic blood pressure of 0, 192 others had skin fold thickness readings of 0, and 140 others had serum insulin levels of 0. After cleaning the data by removing the cases with numbers that are incompatible with life, 392 cases remained. All patients belong to the Pima Indian heritage (subgroup of Native Americans), and are females aged 21 years and above.

The datasets consists of 9 medical predictor (independent) variables and one target (dependent) variable, outcome.

Pregnancies: Number of times a woman has been pregnant Glucose: Plasma Glucose concentration of 2 hours in an oral glucose tolerance test BloodPressure: Diastolic Blood Pressure (mm hg) SkinThickness: Triceps skin fold thickness(mm) Insulin: 2 hour serum insulin(mu U/ml) BMI: Body Mass Index ((weight in kg/height in m)^2) Age: Age(years) DiabetesPedigreeFunction: scores likelihood of diabetes based on family history Outcome: 0(doesn’t have diabetes) or 1 (has diabetes)

Code
df<- read_csv("diabetes2.csv")
Error: 'diabetes2.csv' does not exist in current working directory ('C:/Users/srika/OneDrive/Desktop/DACSS/603_Fall_2022/posts').
Code
dim(df)
NULL
Code
summary(df)
Error in object[[i]]: object of type 'closure' is not subsettable
Code
head(df)
                                              
1 function (x, df1, df2, ncp, log = FALSE)    
2 {                                           
3     if (missing(ncp))                       
4         .Call(C_df, x, df1, df2, log)       
5     else .Call(C_dnf, x, df1, df2, ncp, log)
6 }                                           
Code
#check for null entries
is.null(df)
[1] FALSE
Code
#Check number of 0s in each column
colSums(df==0)
Error in df == 0: comparison (1) is possible only for atomic and list types

Glucose, blood pressure, skin thickness, insulin, BMI and Age are not variables that should logically have 0s. Those values, if true, are likely incompatible with life. To reduce the amount of data we exclude, we will only remove observations that have 0s for our explanatory variables. We will remove those cases from analysis.

Code
#Remove rows with 0 in respective columns
df <- df[apply(df[c(2,6,8)],1,function(z) !any(z==0)),] 
Error in df[c(2, 6, 8)]: object of type 'closure' is not subsettable
Code
#Verify 0s are gone in selected rows
colSums(df==0)
Error in df == 0: comparison (1) is possible only for atomic and list types
Code
#Check cleaned data frame
glimpse(df)
function (x, df1, df2, ncp, log = FALSE)  
Code
#Summarize df
summary(df)
Error in object[[i]]: object of type 'closure' is not subsettable

At a glance, this summary is more fitting after having cleaned our data. An average of 3 pregnancies, considering our 21+ female population, makes sense. A mean glucose of 122, blood pressure of 70.66, a BMI of 33, and age of 30.86 are reasonably accurate of our population. The data is clean and ready for analysis.

Code
#Rename columns
colnames(df) <- c("pregnancies", "glucose", "bp", "skin_thickness", "insulin", "bmi", "dpf", "age", "diabetes")
Error in `colnames<-`(`*tmp*`, value = c("pregnancies", "glucose", "bp", : attempt to set 'colnames' on an object with less than two dimensions
Code
dim(df)
NULL

We have 752 observations and 9 variables with which to work.

Let’s get familiar with the data set by observing the scatterplot matrix and histograms for each variable.

Code
#Scatterplot matrix
pairs(df)
Error in as.data.frame.default(x): cannot coerce class '"function"' to a data.frame
Code
#Histograms of each variable
hist.data.frame(df)
Error in for (v in x) {: invalid for() loop sequence

The scatterplot matix shows that our variables generally lack linear relationships. Logistics regression may be best for analyzing this type of data.

The histograms show us the distribution of all our variables. Pregnancies is right skewed with more women having a smaller number of pregnancies and a small number having over 10. Glucose appears to follow a close-to-normal distribution with the mean just over 100. BMI follows a similar pattern with a central value of just over 30. Age is right skewed with the majority of the sample appearing to be in their 20s and slower tapering off to ages in their 50s. This is a good place to start. The only variable we cannot view in a histogram is the outcome variable due to its binary nature as a dummy variable.

Code
bgap <- as.data.frame(df[, c(1,2,6,8,9)])
Error in df[, c(1, 2, 6, 8, 9)]: object of type 'closure' is not subsettable
Code
pairs(bgap)
Error in pairs(bgap): object 'bgap' not found
Code
#Plot outcome variable
ggplot(df, aes(x=factor(diabetes))) +
  geom_bar(fill = "#0073C2FF")+
  scale_x_discrete(labels=c('No Diabetes','Diabetes'))+
  xlab("Outcome")
Error in `ggplot()`:
!   You're passing a function as global data.
  Have you misspelled the `data` argument in `ggplot()`

About a third of our sample have diabetes.

Hypothesis Testing

Code
#Create training and test data
set.seed(1)
bgap %>% 
  nrow() %>% 
  multiply_by(0.7) %>% 
  round() -> training_set_size
Error in nrow(.): object 'bgap' not found
Code
train_indices <- sample(1:nrow(df), training_set_size)
Error in 1:nrow(df): argument of length 0
Code
train <- bgap[train_indices,]
Error in eval(expr, envir, enclos): object 'bgap' not found
Code
test <- bgap[-train_indices,]
Error in eval(expr, envir, enclos): object 'bgap' not found
Code
nrow(train)
Error in nrow(train): object 'train' not found
Code
nrow(test)
Error in nrow(test): object 'test' not found
Code
head(train)
Error in head(train): object 'train' not found

As a reminder, pregnancies, age, and BMI are all right skewed. So, we will use the logarithm of those variables in our regressions.

Code
#Model using all independent variables
all_var <- glm(diabetes ~ ., data = train, family = binomial)
Error in is.data.frame(data): object 'train' not found
Code
#Model using BMI, glucose, and age
bga <- glm(diabetes ~ . -pregnancies, data = train, family = binomial)
Error in is.data.frame(data): object 'train' not found
Code
#Model using BMI and glucose
bg <- glm(diabetes ~ .-age -pregnancies, data = train, family = binomial)
Error in is.data.frame(data): object 'train' not found
Code
#Model using BMI, glucose, age, and an interaction between pregnancies and age.
ia <- glm(diabetes ~ glucose + bmi + age*pregnancies, data = train, family = binomial)
Error in is.data.frame(data): object 'train' not found
Code
#Model with BMI, glucose, and pregnancies (without age)
bgp <- glm(diabetes ~ . -age, data = train, family = binomial)
Error in is.data.frame(data): object 'train' not found

Our hypothesis involves the BMI, glucose, and age variables. First we can take a look at them individually.

Code
#Generalized linear model can be best for predicting categorical outcome
#Regression model on BMI
mbmi <- glm(diabetes ~ bmi, data = train)
Error in is.data.frame(data): object 'train' not found
Code
summary(mbmi)
Error in summary(mbmi): object 'mbmi' not found
Code
ggplot(data = train, aes(x =bmi, y = diabetes)) +
  geom_point() +
  geom_smooth(method = 'lm')
Error in ggplot(data = train, aes(x = bmi, y = diabetes)): object 'train' not found

Interpretation: The residuals are relatively small. The coefficient estimate suggests that for every 1 unit increase of BMI, the chance of diabetes increases 0.019. The standard errors are small. Our t-values are large compared to our standard error and relatively far from 0. All that, combined with a very small p-value of < 5.15e-11, indicate we can reject the null hypothesis and conclude a relationship between BMI and diabetes.

Code
#Regression model on glucose 
mgluc <- glm(diabetes ~ glucose, data = train)
Error in is.data.frame(data): object 'train' not found
Code
summary(mgluc)
Error in summary(mgluc): object 'mgluc' not found
Code
ggplot(data = train, aes(x = glucose, y = diabetes)) +
  geom_point() +
  geom_smooth(method = 'lm')
Error in ggplot(data = train, aes(x = glucose, y = diabetes)): object 'train' not found

Interpretation: Interestingly, the values of this regression mirror that of the BMI regression. With a very small p-value, < 2e-16, we can see that the plasma glucose concentrations of the 2 hour oral glucose tolerance test (GGT) are related to our outcome variable, diabetes. This makes sense as GGT is sometimes used as a diagnostic tool for diabetes.

The similarities is regression model values between BMI and glucose sparked a curiosity. Let’s see if BMI and glucose are specifically correlated to one another.

Code
cor.test(train$bmi, train$glucose, method = "pearson")
Error in cor.test(train$bmi, train$glucose, method = "pearson"): object 'train' not found

This shows a very small p-value of 8.477e-09, which is well under the significance level alpha = 0.5. However, the correlation coefficient suggests a weak positive correlation. This suggests the variables are only weakly related.

Let’s continue with our regression model hypothesis testing of individual variables.

Code
#Regression model on age
mage <- glm(diabetes ~ age, data = train)
Error in is.data.frame(data): object 'train' not found
Code
summary(mage)
Error in summary(mage): object 'mage' not found
Code
ggplot(data = train, aes(x = age, y = diabetes)) +
  geom_point() +
  geom_smooth(method = 'lm')
Error in ggplot(data = train, aes(x = age, y = diabetes)): object 'train' not found

Interpretation: Similar to the previous variable, age also has small residuals and standard error. It shows a p-value of 1.12e-06, indicating a relationship between age and diabetes.

The hypothesis that BMI, glucose tolerance results, and age are positively associated with diabetes is true. Next, we can investigate models that could help predict the outcome using these variables.

Model Comparisons

Code
#Summarize model including BMI, glucose, age, and pregnancies
summary(all_var)
Error in summary(all_var): object 'all_var' not found

We can see with the all variable model that collectively, independent variables including pregnancies, glucose, bmi, dpf and even blood pressure have low p-values. That suggests those variables have a relationship with the outcome variable. We can use AIC as an indicator of model fitness, with this model being 517.97. Let’s compare this with the following models.

Code
#Summarize BMI, glucose, and age model
summary(bga)
Error in summary(bga): object 'bga' not found
Code
ggplot(train, aes(x=age, y=bmi, color=glucose)) + geom_point() + facet_grid(~diabetes)
Error in ggplot(train, aes(x = age, y = bmi, color = glucose)): object 'train' not found

The model looking at BMI, age, and glucose as explanatory variables for diabetes has an AIC of 526.6, which is a worse fit than the all variables model.

Code
#Summarize BMI and glucose model
summary(bg)
Error in summary(bg): object 'bg' not found
Code
ggplot(train, aes(x=glucose, y=bmi, color=diabetes)) + geom_point() + facet_grid(~diabetes)
Error in ggplot(train, aes(x = glucose, y = bmi, color = diabetes)): object 'train' not found

This BMI and glucose only model has a worse (higher) AIC, with 536.8, than the 2 previous models. So far, the all variable model appears the best fit.

Code
#Summarize model including interaction between age and pregnancies 
#This was chosen to see if the impact of age on diabetes is different depending on the number of pregnancies
summary(ia)
Error in summary(ia): object 'ia' not found
Code
ggplot(train, aes(x=age, y=bmi, color=pregnancies)) + geom_point() + facet_grid(~diabetes)
Error in ggplot(train, aes(x = age, y = bmi, color = pregnancies)): object 'train' not found

With an AIC of 510.41, this model appears to be the best fit model of those we have compared. This indicated there is an interaction between age and pregnancies that has a relationship with the outcome variable.

Code
#Summarize model including without age
summary(bgp)
Error in summary(bgp): object 'bgp' not found
Code
ggplot(train, aes(x=pregnancies, y=bmi, color=glucose)) + geom_point() + facet_grid(~diabetes)
Error in ggplot(train, aes(x = pregnancies, y = bmi, color = glucose)): object 'train' not found

Removing age entirely increased the AIC of this model. So, the interaction model seems to be the best fit. That is, the model including BMI and glucose and controlling for an interaction between age and pregnancy has the best fit of all models tested.

We can use ANOVA to analyze the table of deviance of our regression model.

Code
#ANOVA test on interaction model
anova(ia, test = "Chisq")
Error in anova(ia, test = "Chisq"): object 'ia' not found

All of our variables have small p-values and appear good predictors of the diabetes outcome, with the exception of age. However, age is needed for the interaction variable.

Let’s check one more thing to ensure we have a good fit.

Code
#Check pseudo R2 for interaction model
pR2(ia)
Error in pR2(ia): could not find function "pR2"

With a McFadden R-squared of 0.27, this model can be considered an excellent fit.

Diagnostics

Code
#Apply model to test data
fitted.results <- predict(ia ,test,type='response')
Error in predict(ia, test, type = "response"): object 'ia' not found
Code
fitted.results <- ifelse(fitted.results > 0.5,1,0)
Error in ifelse(fitted.results > 0.5, 1, 0): object 'fitted.results' not found
Code
pred <- mean(fitted.results != test$diabetes)
Error in mean(fitted.results != test$diabetes): object 'fitted.results' not found
Code
output <- cbind(test, fitted.results)
Error in cbind(test, fitted.results): object 'test' not found
Code
print(paste('Accuracy',1-pred))
Error in paste("Accuracy", 1 - pred): object 'pred' not found
Code
head(output)
Error in head(output): object 'output' not found

Testing the interaction model on the test data produces the correct outcome 79% of the time.

Code
#Diagnostic plot of model
plot(ia)
Error in plot(ia): object 'ia' not found

Residuals vs Fitted: The residuals have 2 distinct patterns of behavior. They start at close to 0 and taper off to a negative slope. Another group of residuals starts large and tapers on a negative slope toward 0.

Normal Q-Q: The Q-Q plot, which indicates level of asymmetry, tells us our data is not a normal distribution. It appears to have a right tail (see the points drifting from the line at the top), which aligns with our individual variable histogram interpretations. It follows that a model combining variables with right skews may also be skewed to the right. However, it also has values away from the line on the bottom left, away from the intercept. This suggests our model also includes a left tail. This distribution seems to have “fat tails.”

Scale-Location: This plots the fitted values of the model along the x-axis and the square root of the standardized residuals along the y-axis. The red line does roughly travel horizontal across the plot. However, there is a clear X pattern among residuals. This suggests our model has heteroscedasticity, which is not ideal.

Residuals vs Leverage: The points on this graph all fall within Cook’s distance, indicating there are not any influential points in our regression model.

  • Plan to work on fixing the heteroscedasticity by log transforming BMI, pregnancies, and age when those variables are included in models because they are right skewed data. However, it is taking time figuring out how to do a multiple logistics regression without errors..