Final Project

finalpart1

Template of course blog qmd file

Author

Xiaoyan

Published

May 17, 2023

Code

library(tidyr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

library(readxl)
library(ggplot2)
library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

Code

library(reshape2)


Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths

Introduction and background

The implementation of the one-child policy by the Chinese government in 1979 led to an increase in the number of families with only one child and a unique family structure known as the “four-two-one” model, consisting of four grandparents, two parents, and one child. While being part of such a family structure provides certain advantages in terms of family and social resources, children without siblings, commonly referred to as “only children,” may experience various physical and socio-psychological challenges during their development. One notable concern is the increased risk of overweight and obesity among only children. These children are more likely to struggle with weight-related issues compared to their counterparts who have one or more siblings. Additionally, the psychosocial consequences associated with being an only child are also worth investigating. In this context, it is important to explore not only the relationship between overweight/obesity and mental health in young adolescents but also how the presence or absence of siblings and other factors into this relationship. Overall, investigating the link between overweight/obesity, mental health, and sib-size in young adolescents within the context of the one-child policy can shed light on the potential challenges faced by only children and contribute to a better understanding of their overall well-being.

research questions

Does obesity positively related to depression rate?
What are factors that affects obesity?
Does sibling or obesity directly related to depression?

key predictors

depression rate
sibling number
obesity rate
Family location, finance and education

hypothesis

Higher obesity rate increase the risk of depression
higher family income increase the rate of obesity
More sibling reduce the risk of depression

data description

overlook of data

Code

data<-read_excel("/Users/cassie199/Desktop/23spring/603_Spring_2023-1/posts/_data/mentalhealth_data.xlsx")
head(data)

# A tibble: 6 × 29
  T0depres…¹ T0anx…² T1dep…³ T1anx…⁴ Height Weight    WC    HC   SBP   DBP   FBG
       <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1         31      35      41      35   153.   34.6    58  67      98    60   4.4
2         35      24      35      25   172.   46.1    63  78     110    70   3.9
3         31      34      37      26   146.   38.9    72  77.7   102    62   4.6
4         27      31      42      35   162.   46.8    62  80     116    80   4.5
5         31      26      49      33   154.   36.4    56  72      90    60   4.2
6         30      28      47      32   164.   40.6    55  73     102    70   3.7
# … with 18 more variables: TC <dbl>, TG <dbl>, `HDL-C` <dbl>, `LDL-C` <dbl>,
#   BMI <dbl>, WHR <dbl>, WtHR <dbl>, `Family location` <dbl>,
#   `Number of siblings` <dbl>,
#   `How much time do you spend with your father in elementary school?` <dbl>,
#   `How much time do you spend with your mother in elementary school?` <dbl>,
#   `Father’s education level` <dbl>, `Mother’s education level` <dbl>,
#   `Family financial situation` <dbl>, `Sleeping hours` <dbl>, …

Code

sum(is.na(data))

[1] 728

Code

plot(data$T0depression~data$BMI)

This dataset including 1348 variables and 29 columns. there are 728 NA in this data set. all variables was presented as numberic data. descriptive data was also presented as degrees such as education level, family financial situation and depression rate. By pre-plotting depression rate vs BMI, we can see that some ouliers may need to deal with and there is no siginifcant disrtibution on graph. More data processing is needed in future process.

Modified column name

Code

variables <- c("Family location", "Number of siblings", " time  spend with father in elementary school?", 
          " time spend with mother in elementary school?", "Father’s education level", 
          "Mother’s education level", "Family financial situation", "Sleeping hours", "Skipping breakfast", 
          "Vigorous", "Moderate")
abreviations <- c("FL", "NS", "TFE", "TME", "FEL", "MEL", "FS", "SL", "SB", "VG", "MD")


cat("varible table\n")

varible table

Code

variable_table <- data.frame(variables, abreviations)
variable_table

                                        variables abreviations
1                                 Family location           FL
2                              Number of siblings           NS
3   time  spend with father in elementary school?          TFE
4    time spend with mother in elementary school?          TME
5                        Father’s education level          FEL
6                        Mother’s education level          MEL
7                      Family financial situation           FS
8                                  Sleeping hours           SL
9                              Skipping breakfast           SB
10                                       Vigorous           VG
11                                       Moderate           MD

Code

colnames(data)<-c("T0depression","T0anxiety","T1depression","T1anxiety","Height","Weight","WC","HC","SBP","DBP","FBG","TC","TG","HDL-C","LDL-C","BMI","WHR","WtHR","FL", "NS", "TFE", "TME", "FEL", "MEL", "FS", "SL", "SB","Vigorous","Moderate")

parameter explaination

BMI (body mass index) in this study is used as indicator of obisity. NIH divided BMI value into three levels as table below.

Code

# Create the data frame for BMI categories
bmi_levels <- c("Underweight", "Normal Weight", "Overweight")
bmi_values <- c("<18.5", "18.5-24.9", ">=25")
bmi_table <- data.frame(Category = bmi_levels, BMI = bmi_values)


# Print the BMI category table
cat("\nBMI Categories\n")


BMI Categories

Code

print(bmi_table)

       Category       BMI
1   Underweight     <18.5
2 Normal Weight 18.5-24.9
3    Overweight      >=25

data explanatory

Some key predictors were plotted. The distribution of family location was plotted in the first chart and the distribution of family financial situation were plotted in the second chart.

Code

data$proportion <- data$FL / sum(data$FL)
data$category <- factor(data$FS, levels = c("1", "2", "3","4"))
ggplot(data, aes(x = "", y = proportion, fill = category)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  labs(fill = "Category") +
  theme_void()

Code

data$proportion1 <- data$FS / sum(data$FS)
ggplot(data, aes(x = "", y = proportion1, fill = category)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  labs(fill = "Category") +
  theme_void()

A scatter plot was used to visualize the relationship between skipping breakfast and BMI rate. Only by scatter plot it is difficult to observe the relationship between two varibales. Therefore, further analysis is needed

Code

ggplot(data, aes(x = SB, y = BMI)) +
geom_jitter(width = 0.2, height = 0, color = "indianred", alpha = 0.5) +
  xlab("skipping breakfast")

Warning: Removed 36 rows containing missing values (`geom_point()`).

hypothesis test

1. Higher obesity rate increase the risk of depression

H0=no relationship between obesity rate and the risk of depression

Ha=higher obesity rate increases the risk of depression

In order to prove this hypothesis, linear model was used to calculate relationship between depression rate and BMI.

Code

#linear regression of depresison and BMI
lm0<-lm(T1depression ~ BMI, data = data)
summary(lm0)


Call:
lm(formula = T1depression ~ BMI, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.2371  -6.2000   0.1719   6.2964  21.9845 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.75036    1.50580  27.062   <2e-16 ***
BMI         -0.09528    0.07803  -1.221    0.222    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.093 on 1308 degrees of freedom
  (38 observations deleted due to missingness)
Multiple R-squared:  0.001139,  Adjusted R-squared:  0.0003749 
F-statistic: 1.491 on 1 and 1308 DF,  p-value: 0.2223

The BMI coefficient (-0.09528) represents the estimated change in the depression score for a one-unit increase in BMI. For each unit increase in BMI, the depression score decreases by 0.09528. With p-value of 0.222, the coefficient is not significant. Therefore, there is no strong evidence of a linear relationship between BMI and depression. the residuals range from -19.2371 to 21.9845. The residual standard error is relatively high also indicates the model is not fit to the data. F-staistic gives an overall sinificance of the model and with a high p-value, this model is also not statically significant as a whole. The multiple R-squared value (0.001139) represents the proportion of variance in the depression score explained by the model and only 0.1139% of the variability in depression can be attributed to the linear relationship with BMI. The adjusted R-squared value (0.0003749) adjusts the multiple R-squared value for the number of predictors in the model. It penalizes the inclusion of unnecessary predictors. A lower adjusted R-squared suggests that the model does not provide a good fit to the data.

In summary, based on the provided output, there is no strong evidence to support a linear relationship between BMI and depression. The coefficient for BMI is not statistically significant, and the model’s overall fit is weak (low R-squared values and non-significant F-statistic).

Code

#diagnostic
par(mfrow = c(2,2))
plot(lm0)

Linear regression diagnostic plot was used to evaluate the performance.In residual vs fitted plot, the a horizontal red line represent the mean or expected value of the residuals. and the residuals are evenly distributes above and below the horizontal line. This indicates the linear model was fitted to our data. In normal Q_Q plot, the straight pattern suggests that the residuals of a linear regression model follow a normal distribution,supporting the assumption of normality. In a scale-location plot , a straight red line typically indicates homoscedasticity, which means that the residuals have a constant variance across different levels of the predictor variable(s). The scale-location plot detects any systematic patterns in the spread (variance) of the residuals.Here, the plot suggested that the assumption of homoscedasticity is met. According to these diagnostics，the linear model is reliable and presenting the relationship properly.

Code

#visualization
ggplot(data, aes(x = BMI, y = T1depression)) +
  geom_point(color = "indianred") +
  geom_smooth(method = "lm", se = FALSE, color = "darkred")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 38 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 38 rows containing missing values (`geom_point()`).

Code

plott1depression<-data$T1depression[1:length(predict(lm0))]
plot_data <- data.frame(Predicted_value = predict(lm0),  
                       Observed_value = plott1depression)
ggplot(plot_data, aes(x = Predicted_value, y = Observed_value)) +
                  geom_point() +
                 geom_abline(intercept = 0, slope = 1, color = "green")

Warning: Removed 2 rows containing missing values (`geom_point()`).

In addition to BMI, several other variables were included in the analysis to examine their relationship with depression rate in children. Two variables, namely time spent with father in elementary school and frequency of skipping breakfast, emerged as significant factors influencing children’s depression rate.

The findings revealed that less time spent with father in elementary school was associated with a higher likelihood of experiencing an increase in depression rate among children. This suggests the importance of positive father-child interactions and involvement during this critical period of development.The analysis showed that skipping breakfast more frequently was also linked to a higher depression rate in children. With diagnostic, there was not a substantial deviation from the assumptions of the model. This suggests that the linear regression analysis provided a reasonable fit to the data and supported the interpretation of the results.

Code

#linear regression model 2
lm1<-lm(T1depression ~ BMI+NS+TFE+TME+FEL+MEL+FL+SL+SB, data = data)
summary(lm1)


Call:
lm(formula = T1depression ~ BMI + NS + TFE + TME + FEL + MEL + 
    FL + SL + SB, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3852  -5.9985   0.1987   6.4197  21.1941 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.55778    2.88686  14.742  < 2e-16 ***
BMI         -0.10358    0.07798  -1.328   0.1843    
NS           0.72821    0.51703   1.408   0.1592    
TFE         -0.57069    0.24271  -2.351   0.0189 *  
TME         -0.18783    0.30401  -0.618   0.5368    
FEL         -0.28747    0.24332  -1.181   0.2376    
MEL         -0.22372    0.26894  -0.832   0.4056    
FL           0.09777    0.18801   0.520   0.6031    
SL           0.10706    0.38873   0.275   0.7831    
SB           1.47402    0.31288   4.711 2.73e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.967 on 1300 degrees of freedom
  (38 observations deleted due to missingness)
Multiple R-squared:  0.0379,    Adjusted R-squared:  0.03124 
F-statistic: 5.691 on 9 and 1300 DF,  p-value: 9.18e-08

Code

#diagnostic
par(mfrow = c(2,2))
plot(lm1)

To verify if the model is correct, some of varibles with large p value are deleted for backward elimination. “Time spend with mother in elementary school”, “Father’s educaion level”, “sleeping time” are deleted comparing to the model before. In this case, “Time spend with father in elementary school” and “skipping breakfast” still above the significant level. By comparing the adjusted R square of two models(0.03124 and 0.03321). There was no not two big difference in these two models.

Code

#linear regression model 3
lm2<-lm(T1depression ~ BMI+NS+TFE+FEL+SB, data = data)
summary(lm2)


Call:
lm(formula = T1depression ~ BMI + NS + TFE + FEL + SB, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.9398  -6.0354   0.2289   6.3596  20.8288 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 41.91051    2.17737  19.248  < 2e-16 ***
BMI         -0.10798    0.07716  -1.399  0.16193    
NS           0.73126    0.45738   1.599  0.11011    
TFE         -0.64654    0.20647  -3.131  0.00178 ** 
FEL         -0.33519    0.23154  -1.448  0.14795    
SB           1.51314    0.31049   4.873 1.23e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.959 on 1304 degrees of freedom
  (38 observations deleted due to missingness)
Multiple R-squared:  0.03681,   Adjusted R-squared:  0.03312 
F-statistic: 9.967 on 5 and 1304 DF,  p-value: 2.26e-09

Code

#diagnostic 
par(mfrow = c(2,3))
plot(lm2)
#compare predicted value with observe value
lm3<-lm(T1depression ~ TFE, data = data)

plot_data <- data.frame(Predicted_value = predict(lm3),  
                       Observed_value = data$T1depression[1:length(predict(lm3))])
ggplot(plot_data, aes(x = Predicted_value, y = Observed_value)) +
                  geom_point() +
                 geom_abline(intercept = 0, slope = 1, color = "green")

Warning: Removed 2 rows containing missing values (`geom_point()`).

Predicted value was plotted between first linear model and actual value.The regression line was slightly decreasing. This also implies the linear model is not statistic significant. In summary, based on the provided output, there is no strong evidence to support a linear relationship between BMI and depression. The coefficient for BMI is not statistically significant, and the model’s overall fit is weak (low R-squared values and non-significant F-statistic). Therefore, we fail to reject the null hypothesis.

2. higher family income increase the rate of obesity

H0=no relationship between family income and rate of obesity

Ha=higher family income increase the rate of obesity

Based on this data, we may also explore what factors may affect the obesity rate. Here we made a hypothsis as higher family income increase the rate of obesity. Due to most of the variables are ordinal variables, ordinal logistic regression is applied in this slot to verify the hypothesis.

Code

#convert BMI to ordinal varible
data$BMI_category <- cut(data$BMI, 
                       breaks = c(-Inf, 18.5, 24.9, Inf),
                       labels = c("Underweight", "Normal weight", "Overweight"))
data$BMI_rank <- as.factor(unclass(data$BMI_category))

# Visualizing data
# Filter out rows with NA values in BMI_rank or FS
filtered_data <- data[complete.cases(data$BMI_rank, data$FS), ]

# Create the plot with filtered data
ggplot(filtered_data, aes(x = BMI_category, y = FS)) +
  geom_boxplot(size = 0.75, color = "indianred") +
  geom_jitter(alpha = 0.5, color = "red") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))+
  ylab("Financial Situation")

Code

# #Fit BMI rank and family finanacial situation into ordinal logit model
model <- polr(BMI_rank ~ FS, data = data, Hess=TRUE)
summary(model)

Call:
polr(formula = BMI_rank ~ FS, data = data, Hess = TRUE)

Coefficients:
     Value Std. Error t value
FS -0.1883    0.07865  -2.394

Intercepts:
    Value   Std. Error t value
1|2 -0.7067  0.2554    -2.7672
2|3  2.6195  0.2828     9.2639

Residual Deviance: 2173.226 
AIC: 2179.226 
(36 observations deleted due to missingness)

Code

#p value
ctable <- coef(summary(model))
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
ctable <- cbind(ctable, "p value" = p)
ctable

         Value Std. Error   t value      p value
FS  -0.1883107  0.0786533 -2.394187 1.665724e-02
1|2 -0.7067002  0.2553870 -2.767174 5.654453e-03
2|3  2.6194751  0.2827624  9.263873 1.971474e-20

Code

# Getting odds-ratio
exp(coef(model))

       FS 
0.8283573

The findings of the study indicate that financial situation (FS) is significantly associated with BMI ranks. The coefficient of -0.1883 suggests that higher financial situation is linked to a decreased likelihood of being in a higher BMI rank. Specifically, individuals with higher financial situation are less likely to fall into the normal weight category compared to the underweight category (1|2) with a coefficient of -0.7067. Conversely, they are more likely to be in the overweight category compared to the normal weight category (2|3) with a coefficient of 2.6195. These relationships were found to be statistically significant, indicating that the observed associations are unlikely to occur by chance. Furthermore, the inclusion of additional variables in the model improved its predictive ability, as indicated by the slightly lower AIC value. These findings highlight the importance of considering financial situation as a factor influencing BMI ranks and provide insights into the complex relationship between socioeconomic factors and weight status. The null hypothsis is therefore rejected

Code

#introduce more variables to compare

LR1<-polr(formula = BMI_rank~SL+SB+FS, data = data, Hess = TRUE, method = "logistic")
SUM1<-summary(LR1)
SUM1

Call:
polr(formula = BMI_rank ~ SL + SB + FS, data = data, Hess = TRUE, 
    method = "logistic")

Coefficients:
     Value Std. Error t value
SL -0.5179    0.09863  -5.250
SB  0.2064    0.07715   2.676
FS -0.1336    0.08003  -1.669

Intercepts:
    Value   Std. Error t value
1|2 -1.3558  0.3381    -4.0102
2|3  2.0251  0.3551     5.7024

Residual Deviance: 2136.853 
AIC: 2146.853 
(36 observations deleted due to missingness)

Code

#p value
ctable2 <- coef(summary(LR1))
p <- pnorm(abs(ctable2[, "t value"]), lower.tail = FALSE) * 2
ctable2 <- cbind(ctable2, "p value" = p)
ctable2

         Value Std. Error   t value      p value
SL  -0.5178720 0.09863452 -5.250414 1.517580e-07
SB   0.2064347 0.07715030  2.675748 7.456279e-03
FS  -0.1336130 0.08003340 -1.669465 9.502524e-02
1|2 -1.3557600 0.33807702 -4.010210 6.066466e-05
2|3  2.0251068 0.35513075  5.702426 1.181141e-08

Code

coef(SUM1)

         Value Std. Error   t value
SL  -0.5178720 0.09863452 -5.250414
SB   0.2064347 0.07715030  2.675748
FS  -0.1336130 0.08003340 -1.669465
1|2 -1.3557600 0.33807702 -4.010210
2|3  2.0251068 0.35513075  5.702426

Code

exp(coef(SUM1))

        Value Std. Error      t value
SL  0.5957870   1.103663 5.245348e-03
SB  1.2292875   1.080204 1.452320e+01
FS  0.8749286   1.083323 1.883478e-01
1|2 0.2577513   1.402249 1.812958e-02
2|3 7.5769204   1.426367 2.995934e+02

Code

### Predict probability
# Create a data frame with possible IV values
newdat <- data.frame(
  FS = rep(1:5, each = 272),
  SL = rep(1:4, each = 340),
  SB = rep(1:4, each = 340),
  BMI = rep(seq(from = 12.8, to = 39, length.out = 340), 4))
  


# Get the predicted probability 

newdat <- cbind(newdat, predict(LR1, newdat, type = "probs"))

# Keeping the category with the highest probability
lnewdat <- melt(newdat, id.vars = c("FS", "SL", "SB","BMI"),
                variable.name = "Level", value.name="Probability")


# Visualizing probability
ggplot(lnewdat, aes(x = BMI, y = Probability, colour = Level)) +
  geom_line() + facet_grid(FS ~ SL, labeller="label_both")

Code

#plot
ggplot(data, aes(x = SL, y = BMI)) +
  geom_point(color = "red4") +
  geom_smooth(method = "lm", se = FALSE,color="indianred")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 36 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 36 rows containing missing values (`geom_point()`).

3. More sibling reduce the risk of both depression and anxiety.

H0=no relationship between sibling numbers and depression rate

Ha=more siblings reduce the risk of depression

The sibling numbers here indicates as 1: only child and 2: have siblings. Therefore, a welch two sample t-test and corelation test were performed to explore the relationship.

Code

ggplot(data=subset(data,!is.na(NS)), aes(x = factor(NS), y = T1depression)) +
  geom_boxplot(color = "red4")+
  geom_jitter(color = "tomato1")+
  xlab("Number of sibling")

Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Warning: Removed 2 rows containing missing values (`geom_point()`).

Code

var1<-as.numeric(data$T1depression)
var2<-as.numeric(data$NS)

The analysis revealed a positive correlation between the number of siblings and depression score, with a sample estimate of the correlation coefficient (rho) of 0.0716. This suggests that as the number of siblings increases, the depression score tends to be higher. The statistical significance of this correlation was confirmed by a p-value of 0.0085, indicating that the observed relationship is unlikely to occur by chance. Furthermore, a Welch t-test was conducted, which demonstrated that the group with more siblings had a significantly higher depression index. Based on these findings, we can conclude that there is a positive relationship between the number of siblings and depression score, suggesting that having more siblings may contribute to increased levels of depression. Therefore, we conclude that number of siblings has a postive relationship with depression score.

Conclusion

In this study, we investigated the relationship between depression, obesity, family financial situation, and sibling numbers among young adolescents.
Firstly, we examined the association between depression and obesity using various linear regression models. The results indicated that there was no significant relationship between depression and obesity in this population. Next, we focused on the impact of family financial situation on obesity using ordinal logistic regression. Our findings revealed that higher family financial situation was associated with a higher likelihood of having abnormal weight, including being underweight or overweight, rather than having a normal weight. Furthermore, we tested the hypothesis regarding the association between depression and sibling numbers using a Welch t-test. Interestingly, the results showed that adolescents with siblings had significantly lower depression scores compared to those without siblings. Overall, this study contributes to our understanding of the factors influencing the mental and physical well-being of young adolescents. The findings suggest that family financial situation plays a role in the occurrence of abnormal weight, while the presence of siblings appears to have a protective effect against depression. These results emphasize the importance of considering familial and social factors in addressing the mental and physical health of young individuals. Further research and interventions can build upon these findings to develop strategies for promoting healthier outcomes in this population.