The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Code
library(readxl)library(ggplot2)
Introduction and background
The Chinese government implemented the one-child policy in 1979, which resulted in the increasing proportion of one-child families and the “four-two-one” family structure consisting of four grandparents, two parents, and one child. Despite being blessed with relatively more family and social resources, only children may face physical and socio-psychological problems during development, including an elevated risk for overweight and obesity and negative psychosocial consequences. Previous studies have shown that only children had a higher likelihood of overweight or obesity, compared with children who had one or more siblings. Over obesity, mental healthy is also interesting to explore that how it is related to overweight/obesity, as well as sib-size, in young adolescents affects mental health.。
research questions
Does obesity positively related to mental health?
what are factors that affects mental healthy?
does sibling or obeisty directily related to mental health?
key predictors
mental health
sibling number
obisity rate
Family location, finance and education
hypothesis
Higher obesity rate increase the risk of depression
higher family income increase the rate of obesity
More sibling reduce the risk of both depression and anxiety.
In these hypothesis, the response variables are depression rate, axiety rate and BMI index. The explanatory variables can be factors listed below. Analysis is needed to identify the control variables. For exapmle, in hypothesis 2, family income is the explanatory varible and rate of obsity(BMI) is response varible, the control varible may also be family financial situation.
# A tibble: 6 × 29
T0depres…¹ T0anx…² T1dep…³ T1anx…⁴ Height Weight WC HC SBP DBP FBG
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 31 35 41 35 153. 34.6 58 67 98 60 4.4
2 35 24 35 25 172. 46.1 63 78 110 70 3.9
3 31 34 37 26 146. 38.9 72 77.7 102 62 4.6
4 27 31 42 35 162. 46.8 62 80 116 80 4.5
5 31 26 49 33 154. 36.4 56 72 90 60 4.2
6 30 28 47 32 164. 40.6 55 73 102 70 3.7
# … with 18 more variables: TC <dbl>, TG <dbl>, `HDL-C` <dbl>, `LDL-C` <dbl>,
# BMI <dbl>, WHR <dbl>, WtHR <dbl>, `Family location` <dbl>,
# `Number of siblings` <dbl>,
# `How much time do you spend with your father in elementary school?` <dbl>,
# `How much time do you spend with your mother in elementary school?` <dbl>,
# `Father’s education level` <dbl>, `Mother’s education level` <dbl>,
# `Family financial situation` <dbl>, `Sleeping hours` <dbl>, …
Code
sum(is.na(data))
[1] 728
Code
plot(data$T0depression~data$BMI)
This dataset including 1348 variables and 29 columns. there are 728 NA in this data set. all variables was presented as numberic data. descriptive data was also presented as degrees such as education level, family financial situation and depression rate. By pre-plotting depression rate vs BMI, we can see that some ouliers may need to deal with and there is no siginifcant disrtibution on graph. More data processing is needed in future process.
Warning in cor.test.default(data$T1depression, data$BMI, method =
c("spearman")): Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: data$T1depression and data$BMI
S = 384922674, p-value = 0.3229
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.0273327
Pearson's product-moment correlation
data: data$T1depression and data$BMI
t = -1.2211, df = 1308, p-value = 0.2223
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.08774365 0.02045498
sample estimates:
cor
-0.03374321
Code
summary(lm(T1depression ~ BMI+NS+TFE+TME, data = data))
Call:
lm(formula = T1depression ~ BMI + NS + TFE + TME, data = data)
Residuals:
Min 1Q Median 3Q Max
-20.996 -6.131 0.232 6.436 21.536
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.08563 2.09342 20.581 < 2e-16 ***
BMI -0.08628 0.07766 -1.111 0.26681
NS 1.00348 0.45136 2.223 0.02637 *
TFE -0.67037 0.24018 -2.791 0.00533 **
TME -0.27532 0.30312 -0.908 0.36389
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.033 on 1305 degrees of freedom
(38 observations deleted due to missingness)
Multiple R-squared: 0.01805, Adjusted R-squared: 0.01504
F-statistic: 5.999 on 4 and 1305 DF, p-value: 8.78e-05
The Pearson correlation test is a statistical test used to measure the linear relationship between two continuous variables.The Spearman’s rank correlation coefficient (rho) measures the strength and direction of the association between two variables which don’t have to be both continuous or have a linear relationship. It ranges between -1 and 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicates no correlation. Due to the High p value of both test, the depression rate is less likely related to BMI.
2. higher family income increase the rate of obesity
Code
ggplot(data, aes(x = FS, y = BMI)) +geom_point() +geom_smooth(method ="lm", se =FALSE)
Warning in cor.test.default(data$FS, data$BMI, method = c("spearman")): Cannot
compute exact p-value with ties
Spearman's rank correlation rho
data: data$FS and data$BMI
S = 390451141, p-value = 0.1766
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.03732942
Code
cor.test(data$FS, data$BMI,method =c("pearson"))
Pearson's product-moment correlation
data: data$FS and data$BMI
t = -0.74116, df = 1310, p-value = 0.4587
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.07451001 0.03368374
sample estimates:
cor
-0.02047308
Code
fit2<-lm(FS ~ BMI, data = data)summary(fit2)
Call:
lm(formula = FS ~ BMI, data = data)
Residuals:
Min 1Q Median 3Q Max
-2.1991 -0.1810 -0.1691 0.8085 1.8644
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.265193 0.128881 25.335 <2e-16 ***
BMI -0.004949 0.006677 -0.741 0.459
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6932 on 1310 degrees of freedom
(36 observations deleted due to missingness)
Multiple R-squared: 0.0004191, Adjusted R-squared: -0.0003439
F-statistic: 0.5493 on 1 and 1310 DF, p-value: 0.4587
No significant corelationship on family financial status and obesity.
3. More sibling reduce the risk of both depression and anxiety.
Code
ggplot(data, aes(x = NS, y = T1depression)) +geom_point() +geom_smooth(method ="lm", se =FALSE)
Warning in cor.test.default(data$T1depression, data$NS, method = c("spearman")):
Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: data$T1depression and data$NS
S = 377319431, p-value = 0.008575
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.07162063
Pearson's product-moment correlation
data: data$T1depression and data$NS
t = 2.6388, df = 1344, p-value = 0.008415
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.01843398 0.12474745
sample estimates:
cor
0.07179463
Code
data$NS <-factor(data$NS)t.test(T1depression ~ NS, data = data)
Welch Two Sample t-test
data: T1depression by NS
t = -2.6337, df = 1252.5, p-value = 0.008551
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-2.0381547 -0.2979518
sample estimates:
mean in group 1 mean in group 2
38.38127 39.54932
In this case, the sample estimate of the correlation coefficient (rho) is 0.1184734, indicating a positive correlation between T0depression and NS. However, the p-value of the test is 0.008575, which is less than 0.05, suggesting that the correlation is statistically significant at a 5% level of significance.
Therefore, we can conclude that there is a significant positive correlation between the number of siblings (NS) and the degree of depression in this dataset.
By carrying out a Welch t-test, the group with more siblings have higher depression index and p value <0.05 indicates the result is siginifcant. (Not sure why the confident interval is negtive and none of the data was negative. )
Warning in cor.test.default(data$T1anxiety, data$NS, method = c("spearman")):
Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: data$T1anxiety and data$NS
S = 343033890, p-value = 8.796e-09
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.1559788
Code
fit3<-lm(T1anxiety ~ NS, data = data)summary(fit3)
Call:
lm(formula = T1anxiety ~ NS, data = data)
Residuals:
Min 1Q Median 3Q Max
-13.906 -3.997 0.003 3.003 33.003
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.0883 0.5185 58.030 < 2e-16 ***
NS 1.9091 0.3411 5.597 2.64e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.207 on 1344 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.02278, Adjusted R-squared: 0.02205
F-statistic: 31.32 on 1 and 1344 DF, p-value: 2.643e-08
data$BMI_category <-cut(data$BMI, breaks =c(-Inf, 18.5, 24.9, 29.9, Inf),labels =c("Underweight", "Normal weight", "Overweight", "Obesity"))data$Depression_category <-cut(data$T1depression, breaks =c(0,45, 59,74,75),labels =c("Normal", "Mild", "Moderate to Marked Major", "Severe or Extreme Major"))# Plot the bar chart
Answers to the feedbacks on check in 1
Here are a few things you may want to work on in future steps: 1. Please provide more information of the dataset: what each variable means (e.g. WC, HC, SBP etc) and how it is measured. This is to make sure audiences understand your confounders. a table with explaination of abrivation is updated in data description
Since gender is one of your key predictors, you may consider using the interaction between gender and other key variables in the model to see whether gender influences the impact of other predictors. Also, seem I didn’t find the gender variable in the dataset you provided? Thanks for pointing out. Since gender is missiong, I will not use gender as a key predicor.
As you mentioned, there are some outliers in the data, especially the one on the top-right corner. This outlier can change the slope of the regression. Also, the relationship between BMI and depression is not very clear in the graph, as you mentioned, more data processing is needed. You can also try plotting different groups (e.g. gender, family location) in different colors to see if there’s any pattern.
Thanks for the comments, I will try to process the data this time and plot more patterns.
Questions need to be addressed
the varibles such as family locations or education level can be expressed either as rank or ordinal, as drawed below,it is hard to find a correlationship with this kind of varibles. How can i explore the relationship between an ordinal varible and a continuous varible?
as some of the continuous varibles can also converted to ordinal varibles, what would be some method or test good to find the relationship between them?
Code
#convert continuous varibles into categorical variblesdata$FL1 <-factor(sample(1:5, 1348, replace =TRUE), levels =1:5, labels =c("Rural", "Suburban", "Urban", "City", "Metropolis"))data$BMI_category <-cut(data$BMI, breaks =c(-Inf, 18.5, 24.9, 29.9, Inf),labels =c("Underweight", "Normal weight", "Overweight", "Obesity"))data$Depression_category <-cut(data$T1depression, breaks =c(0,45, 59,74,75),labels =c("Normal", "Mild", "Moderate to Marked Major", "Severe or Extreme Major"))#plotggplot(data, aes(x = FL1, y = BMI, fill = BMI_category)) +geom_bar(stat ="identity", position ="stack") +scale_fill_manual(values =c("#1b9e77", "#d95f02", "#7570b3", "#e7298a")) +xlab("Family location") +ylab("BMI") +ggtitle("BMI category and family location") +theme_bw()
---title: "Final Project Checkin-2"author: "Xiaoyan"description: "Template of course blog qmd file"date: "04/17/2023"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - finalpart1---```{r}library(tidyr)library(dplyr)library(readxl)library(ggplot2)```# {.tabset}## Introduction and background The Chinese government implemented the one-child policy in 1979, which resulted in the increasing proportion of one-child families and the "four-two-one" family structure consisting of four grandparents, two parents, and one child. Despite being blessed with relatively more family and social resources, only children may face physical and socio-psychological problems during development, including an elevated risk for overweight and obesity and negative psychosocial consequences. Previous studies have shown that only children had a higher likelihood of overweight or obesity, compared with children who had one or more siblings. Over obesity, mental healthy is also interesting to explore that how it is related to overweight/obesity, as well as sib-size, in young adolescents affects mental health.。## research questions1. Does obesity positively related to mental health?2. what are factors that affects mental healthy?3. does sibling or obeisty directily related to mental health?## key predictors1. mental health2. sibling number3. obisity rate4. Family location, finance and education## hypothesis1. Higher obesity rate increase the risk of depression2. higher family income increase the rate of obesity3. More sibling reduce the risk of both depression and anxiety. In these hypothesis, the response variables are depression rate, axiety rate and BMI index. The explanatory variables can be factors listed below. Analysis is needed to identify the control variables. For exapmle, in hypothesis 2, family income is the explanatory varible and rate of obsity(BMI) is response varible, the control varible may also be family financial situation. ## data description### overlook of data```{r}data<-read_excel("/Users/cassie199/Desktop/23spring/603_Spring_2023-1/posts/_data/mentalhealth_data.xlsx")head(data)sum(is.na(data))plot(data$T0depression~data$BMI)```This dataset including 1348 variables and 29 columns. there are 728 NA in this data set. all variables was presented as numberic data. descriptive data was also presented as degrees such as education level, family financial situation and depression rate. By pre-plotting depression rate vs BMI, we can see that some ouliers may need to deal with and there is no siginifcant disrtibution on graph. More data processing is needed in future process.```{r}variables <-c("Internalizing problem - Depression (SDS)", "Internalizing problem - Anxiety (SAS)", "Obesity parameters - BMI", "Obesity parameters - WC", "Obesity parameters - WHR","Obesity parameters - WHtR", "Biochemical parameters - TG", "Biochemical parameters - FBG","Biochemical parameters - TC", "Biochemical parameters - HDL-C", "Biochemical parameters - LDL-C","Blood pressure - SBP", "Blood pressure - DBP","Family location", "Number of siblings", " time spend with father in elementary school?", " time spend with mother in elementary school?", "Father’s education level", "Mother’s education level", "Family financial situation", "Sleeping hours", "Skipping breakfast", "Vigorous", "Moderate")abreviations <-c("Depression", "Anxiety", "BMI", "WC", "WHR", "WHtR", "TG", "FBG", "TC", "HDL-C", "LDL-C", "SBP", "DBP","FL", "NS", "TFE", "TME", "FEL", "MEL", "FS", "SL", "SB", "VG", "MD")cat("varible table\n")variable_table <-data.frame(variables, abreviations)variable_table```### parameter explaination```{r}# Create the data frame for SAS and SDS scalessas_levels <-c("Normal", "Mild to Moderate", "Marked to Severe", "Extreme")sas_scores <-c("<45", "45-59", "60-74", ">=75")sas_table <-data.frame(Level = sas_levels, Score = sas_scores)sds_levels <-c("Normal", "Mild", "Moderate to Marked Major", "Severe or Extreme Major")sds_scores <-c("<50", "50-59", "60-69", ">=70")sds_table <-data.frame(Level = sds_levels, Score = sds_scores)# Create the data frame for BMI categoriesbmi_levels <-c("Underweight", "Normal Weight", "Overweight", "Obesity")bmi_values <-c("<18.5", "18.5-24.9", "25-29.9", ">=30")bmi_table <-data.frame(Category = bmi_levels, BMI = bmi_values)# Print the SAS scale tablecat("Self-rating Anxiety Scale (SAS)\n")print(sas_table)# Print the SDS scale tablecat("\nSDS scores (SDS)\n")print(sds_table)# Print the BMI category tablecat("\nBMI Categories\n")print(bmi_table)```## hypothesis test### 1. Higher obesity rate increase the risk of depression```{r}colnames(data)<-c("T0depression","T0anxiety","T1depression","T1anxiety","Height","Weight","WC","HC","SBP","DBP","FBG","TC","TG","HDL-C","LDL-C","BMI","WHR","WtHR","FL", "NS", "TFE", "TME", "FEL", "MEL", "FS", "SL", "SB","Vigorous","Moderate")ggplot(data, aes(x = T1depression, y = BMI)) +geom_point() +geom_smooth(method ="lm", se =FALSE)cor.test(data$T1depression, data$BMI,method =c("spearman"))cor.test(data$T1depression, data$BMI,method =c("pearson"))summary(lm(T1depression ~ BMI+NS+TFE+TME, data = data))```The Pearson correlation test is a statistical test used to measure the linear relationship between two continuous variables.The Spearman's rank correlation coefficient (rho) measures the strength and direction of the association between two variables which don't have to be both continuous or have a linear relationship. It ranges between -1 and 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicates no correlation.Due to the High p value of both test, the depression rate is less likely related to BMI. ### 2. higher family income increase the rate of obesity```{r}ggplot(data, aes(x = FS, y = BMI)) +geom_point() +geom_smooth(method ="lm", se =FALSE)cor.test(data$FS, data$BMI,method =c("spearman"))cor.test(data$FS, data$BMI,method =c("pearson"))fit2<-lm(FS ~ BMI, data = data)summary(fit2)```No significant corelationship on family financial status and obesity. ### 3. More sibling reduce the risk of both depression and anxiety. ```{r}ggplot(data, aes(x = NS, y = T1depression)) +geom_point() +geom_smooth(method ="lm", se =FALSE)cor.test(data$T1depression,data$NS, method =c("spearman"))cor.test(data$T1depression,data$NS, method =c("pearson"))data$NS <-factor(data$NS)t.test(T1depression ~ NS, data = data)```In this case, the sample estimate of the correlation coefficient (rho) is 0.1184734, indicating a positive correlation between T0depression and NS. However, the p-value of the test is 0.008575, which is less than 0.05, suggesting that the correlation is statistically significant at a 5% level of significance.Therefore, we can conclude that there is a significant positive correlation between the number of siblings (NS) and the degree of depression in this dataset.By carrying out a Welch t-test, the group with more siblings have higher depression index and p value <0.05 indicates the result is siginifcant. (Not sure why the confident interval is negtive and none of the data was negative. )```{r}print(data$NS)ggplot(data, aes(x = NS, y = T1anxiety)) +geom_point() +geom_smooth(method ="lm", se =FALSE)data$NS <-as.numeric(data$NS)cor.test(data$T1anxiety,data$NS, method =c("spearman"))fit3<-lm(T1anxiety ~ NS, data = data)summary(fit3)```### others```{r}# Create sample dataplot(data$BMI~data$T1depression)plot(data$T1depression~data$FL)data$BMI_category <-cut(data$BMI, breaks =c(-Inf, 18.5, 24.9, 29.9, Inf),labels =c("Underweight", "Normal weight", "Overweight", "Obesity"))data$Depression_category <-cut(data$T1depression, breaks =c(0,45, 59,74,75),labels =c("Normal", "Mild", "Moderate to Marked Major", "Severe or Extreme Major"))# Plot the bar chart```## Answers to the feedbacks on check in 1Here are a few things you may want to work on in future steps:1. Please provide more information of the dataset: what each variable means (e.g. WC, HC, SBP etc) and how it is measured. This is to make sure audiences understand your confounders. a table with explaination of abrivation is updated in data description2. Since gender is one of your key predictors, you may consider using the interaction between gender and other key variables in the model to see whether gender influences the impact of other predictors. Also, seem I didn't find the gender variable in the dataset you provided? Thanks for pointing out. Since gender is missiong, I will not use gender as a key predicor. 3. As you mentioned, there are some outliers in the data, especially the one on the top-right corner. This outlier can change the slope of the regression. Also, the relationship between BMI and depression is not very clear in the graph, as you mentioned, more data processing is needed. You can also try plotting different groups (e.g. gender, family location) in different colors to see if there's any pattern.Thanks for the comments, I will try to process the data this time and plot more patterns. ## Questions need to be addressed1. the varibles such as family locations or education level can be expressed either as rank or ordinal, as drawed below,it is hard to find a correlationship with this kind of varibles. How can i explore the relationship between an ordinal varible and a continuous varible?```{r}pairs(data[c("T1depression","T1anxiety","BMI","FL", "NS", "TFE", "TME", "FEL")])pairs(data[c("T1depression","T1anxiety", "MEL", "FS", "SL", "SB","Vigorous","Moderate")])```2. as some of the continuous varibles can also converted to ordinal varibles, what would be some method or test good to find the relationship between them?```{r}#convert continuous varibles into categorical variblesdata$FL1 <-factor(sample(1:5, 1348, replace =TRUE), levels =1:5, labels =c("Rural", "Suburban", "Urban", "City", "Metropolis"))data$BMI_category <-cut(data$BMI, breaks =c(-Inf, 18.5, 24.9, 29.9, Inf),labels =c("Underweight", "Normal weight", "Overweight", "Obesity"))data$Depression_category <-cut(data$T1depression, breaks =c(0,45, 59,74,75),labels =c("Normal", "Mild", "Moderate to Marked Major", "Severe or Extreme Major"))#plotggplot(data, aes(x = FL1, y = BMI, fill = BMI_category)) +geom_bar(stat ="identity", position ="stack") +scale_fill_manual(values =c("#1b9e77", "#d95f02", "#7570b3", "#e7298a")) +xlab("Family location") +ylab("BMI") +ggtitle("BMI category and family location") +theme_bw()ggplot(data, aes(x = FL1, y = T1depression, fill = Depression_category)) +geom_bar(stat ="identity", position ="stack")```