Homework3
Akhilesh
The third homework
Author

Akhilesh Kumar Meghwal

Published

April 11, 2023

Question 1

United Nations (Data file: UN 1 lin alr4) The data in the file UN 11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries.The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.

(a)

Identify the predictor and the response?

Code
data(UN11)
head(UN11)

The predictor variable is ppgdp and the response variable is fertility. ‘ppgdp’ represents the gross national product per person, while ‘fertility’ refers to the birth rate per 1000 females. The objective of the study is to understand how changes in ‘ppgdp’ can influence the value of ‘fertility’. Positive or negative correlations between the two variables can lead to conclusions about how one variable impacts the other. The analysis of this relationship can provide insights into socio-economic factors that influence birth rates, which can then inform policies on population growth, family planning, and economic development.

(b):

Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?

Code
UN11 %>%
  select(c(ppgdp, fertility)) %>%
  ggplot(aes(x = ppgdp, y = fertility)) + 
  geom_point(color = "#001F3F", alpha = 0.8, size = 3) +
  geom_smooth(method = lm, se = FALSE, color = "#E41A1C", size = 1) +
  theme_minimal() +
  labs(x = "Gross National Product per Person (US$)",
       y = "Birth Rate per 1000 Females",
       title = element_text(color = "#006600", "Scatterplot of Fertility vs. PPGDP with Regression Line")) +
  theme(plot.title = element_text(size = 12, face = "bold", hjust = 0.5, color = "#006600"),
        axis.title = element_text(size = 10, color = "#001F3F"),
        axis.text = element_text(size = 8, color = "#001F3F"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.line = element_line(color = "#333333", size = 0.5),
        plot.background = element_blank(),
        panel.background = element_blank(),
        legend.position = "none")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
`geom_smooth()` using formula = 'y ~ x'

The scatterplot shows the relationship between two variables, fertility (number of live births per 1000 females) and ppgdp (gross national product per person in US dollars). Each point on the plot represents a country, with its fertility rate on the vertical axis and its ppgdp on the horizontal axis.

From the scatterplot, we can see that there is a general negative relationship between fertility and ppgdp. This means that as a country’s income increases, its fertility rate tends to decrease. However, the relationship is not perfectly linear, and there is a lot of variation in fertility rates at each level of ppgdp.

To summarize the trend in the data, A straight-line mean function does seem to be plausible for summarizing the data, as the plotted regression line suggests a linear negative trend in the data. However, it is important to note that the relationship between ppgdp and fertility may not be entirely linear, and other factors may also contribute to the observed patterns in the data. Therefore, it may be appropriate to explore more complex statistical models to better understand the relationship between these variables.

Overall, this scatterplot provides a visual representation of the relationship between fertility and ppgdp. It allows us to quickly identify trends and patterns in the data and can be used to inform further analysis and modeling.

(c):

Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.

Code
UN11 %>%
  select(c(ppgdp, fertility)) %>%
  ggplot(aes(x = log(ppgdp), y = log(fertility))) + 
  geom_point(color = "#001F3F", alpha = 0.8, size = 3) +
  geom_smooth(method = lm, se = TRUE, color = "#E41A1C", size = 1) +
  theme_minimal() +
  labs(x = "Log(Gross National Product per Person (US$))",
       y = "Log(Birth Rate per 1000 Females)",
       title = element_text(color = "#006600", "Scatterplot of Log(Fertility) vs. Log(PPGDP) with Regression Line")) +
  theme(plot.title = element_text(size = 12, face = "bold", hjust = 0.5, color = "#006600"),
        axis.title = element_text(size = 10, color = "#001F3F"),
        axis.text = element_text(size = 8, color = "#001F3F"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.line = element_line(color = "#333333", size = 0.5),
        plot.background = element_blank(),
        panel.background = element_blank(),
        legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'

The scatterplot of Log(Fertility) versus Log(PPGDP) using data from the UN11 dataset shows the relationship between birth rates per 1000 females and gross national product per person in U.S. dollars. The graph displays the logarithmic transformation of the variables to better visualize the relationship between them.

The graph shows a negative linear relationship between log(fertility) and log(ppgdp), indicating that as gross national product per person increases, birth rates per 1000 females decrease. The linear regression line, which is plotted using the method of least squares, indicates that the relationship is statistically significant.

The scatterplot also shows some degree of variability around the regression line, which is captured by the shaded area around the line. This variability could be due to other factors that are not accounted for in the linear regression model.

Overall, the scatterplot suggests that there is a general negative relationship between fertility and gross national product per person, with higher levels of economic development associated with lower birth rates. However, the relationship does not appear to be strictly linear, as the points are not forming a tight linear pattern around the regression line. There is also significant scatter in the data, which suggests that there may be other factors that influence fertility rates besides gross national product per person.

Therefore, while a simple linear regression model could provide a summary of the relationship between fertility and gross national product per person, it may not capture the full complexity of the relationship. A more sophisticated model that takes into account additional factors may be necessary to accurately model the relationship between these two variables.

Changing the base of the logs

Log10

Code
UN11 %>%
  select(c(ppgdp, fertility)) %>%
  ggplot(aes(x = log10(ppgdp), y = log10(fertility))) + 
  geom_point(color = "#001F3F", alpha = 0.8, size = 3) +
  geom_smooth(method = lm, se = TRUE, color = "#E41A1C", size = 1) +
  theme_minimal() +
  labs(x = "Log10(Gross National Product per Person (US$))",
       y = "Log10(Birth Rate per 1000 Females)",
       title = element_text(color = "#006600", "Scatterplot of Log10(Fertility) vs. Log10(PPGDP) with Regression Line")) +
  theme(plot.title = element_text(size = 12, face = "bold", hjust = 0.5, color = "#006600"),
        axis.title = element_text(size = 10, color = "#001F3F"),
        axis.text = element_text(size = 8, color = "#001F3F"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.line = element_line(color = "#333333", size = 0.5),
        plot.background = element_blank(),
        panel.background = element_blank(),
        legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'

Log 2

Code
UN11 %>%
  select(c(ppgdp, fertility)) %>%
  ggplot(aes(x = log2(ppgdp), y = log2(fertility))) + 
  geom_point(color = "#001F3F", alpha = 0.8, size = 3) +
  geom_smooth(method = lm, se = TRUE, color = "#E41A1C", size = 1) +
  theme_minimal() +
  labs(x = "Log2(Gross National Product per Person (US$))",
       y = "Log2(Birth Rate per 1000 Females)",
       title = element_text(color = "#006600", "Scatterplot of Log2(Fertility) vs. Log2(PPGDP) with Regression Line")) +
  theme(plot.title = element_text(size = 12, face = "bold", hjust = 0.5, color = "#006600"),
        axis.title = element_text(size = 10, color = "#001F3F"),
        axis.text = element_text(size = 8, color = "#001F3F"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.line = element_line(color = "#333333", size = 0.5),
        plot.background = element_blank(),
        panel.background = element_blank(),
        legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'

Changing the base of logarithms will only change the values on the axes and not the shape of the graph.

In the case of the fertility vs. PPGDP graph, changing the base of logarithms will not change the overall pattern observed in the data, but it will change the values on the x and y axes. For example, using the natural logarithm (base e) instead of base 10 will result in different numerical values on the axes but the same underlying relationship between the two variables.

Question 2

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

(a)

How, if at all, does the slope of the prediction equation change?

Code
UN11$ppgdp_pound <- 1.33 * UN11$ppgdp
summary(lm(fertility ~ ppgdp_pound, UN11))

Call:
lm(formula = fertility ~ ppgdp_pound, data = UN11)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9006 -0.8801 -0.3547  0.6749  3.7585 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.178e+00  1.048e-01  30.331  < 2e-16 ***
ppgdp_pound -2.407e-05  3.500e-06  -6.877  7.9e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.206 on 197 degrees of freedom
Multiple R-squared:  0.1936,    Adjusted R-squared:  0.1895 
F-statistic: 47.29 on 1 and 197 DF,  p-value: 7.903e-11
Code
summary(lm(fertility ~ ppgdp, UN11))

Call:
lm(formula = fertility ~ ppgdp, data = UN11)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9006 -0.8801 -0.3547  0.6749  3.7585 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.178e+00  1.048e-01  30.331  < 2e-16 ***
ppgdp       -3.201e-05  4.655e-06  -6.877  7.9e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.206 on 197 degrees of freedom
Multiple R-squared:  0.1936,    Adjusted R-squared:  0.1895 
F-statistic: 47.29 on 1 and 197 DF,  p-value: 7.903e-11

The analysis shows the results of fitting two linear regression models to the UN11 dataset. The first model predicts fertility rate based on the variable “ppgdp_pound,” which is the gross domestic product (GDP) per capita in British pounds. The second model predicts fertility rate based on the variable “ppgdp,” which is the GDP per capita in US dollars.

Both models show a statistically significant negative relationship between GDP per capita and fertility rate, with a p-value of 7.903e-11 for both. The coefficient estimates for ppgdp and ppgdp_pound are different, which is expected since the units are different. The coefficient for ppgdp_pound is -2.407e-05, which means that for every one pound increase in GDP per capita, the fertility rate decreases by 0.00002407. The coefficient for ppgdp is -3.201e-05, which means that for every one dollar increase in GDP per capita, the fertility rate decreases by 0.00003201.

Overall, the results suggest that increasing economic development, as measured by GDP per capita, is associated with a decrease in fertility rate. However, the impact of the pound conversion on the slope of the regression line is negligible, with a difference of only 0.0000086 in the coefficient estimates.

(b)

How, if at all, does the correlation change?

Code
cor(UN11$ppgdp, UN11$fertility)
[1] -0.4399891
Code
cor(UN11$ppgdp_pound, UN11$fertility)
[1] -0.4399891

The correlation coefficient between fertility and ppgdp (in US dollars) is the same as the correlation coefficient between fertility and ppgdp_pound (in British pounds). This suggests that the choice of currency used to measure gross domestic product per capita does not have a significant impact on the relationship between fertility and economic development.

The correlation coefficient of -0.4399891 indicates a moderate negative correlation between fertility and gross domestic product per capita. This suggests that as the gross domestic product per capita increases, the fertility rate tends to decrease, and vice versa.

Overall, the analysis suggests that the choice of currency used to measure gross domestic product per capita does not have a significant impact on the relationship between fertility and economic development.

Question 3

Water runoff in the Sierras (Data file: water in alr4) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots.

(Hint: Use the pairsQ function.)

Code
data(water)
pairs(~APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data=water, main = "Sierra Southern California Water Supply Runoff",
      pch = 21, bg = "red")

Code
# Set colors for variables
colorA <- "blue"
colorB <- "red"
par(cex.axis = 0.8, cex.lab=0.8)

# Create the plots with colored points and title
plot(y=water$BSAAM,x=water$APMAM, col=ifelse(substr(names(water)[2],1,1) == "A", colorA, colorB),
     main="Scatter plot of BSAAM and APMAM", xlab="APMAM", ylab="BSAAM")

Code
plot(y=water$BSAAM,x=water$APSAB, col=ifelse(substr(names(water)[3],1,1) == "A", colorA, colorB),
     main="Scatter plot of BSAAM and APSAB", xlab="APSAB", ylab="BSAAM")

Code
plot(y=water$BSAAM,x=water$APSLAKE, col=ifelse(substr(names(water)[4],1,1) == "A", colorA, colorB),
     main="Scatter plot of BSAAM and APSLAKE", xlab="APSLAKE", ylab="BSAAM")

Code
plot(y=water$BSAAM,x=water$OPBPC, col=ifelse(substr(names(water)[5],1,1) == "A", colorA, colorB),
     main="Scatter plot of BSAAM and OPBPC", xlab="OPBPC", ylab="BSAAM")

Code
plot(y=water$BSAAM,x=water$OPRC, col=ifelse(substr(names(water)[6],1,1) == "A", colorA, colorB),
     main="Scatter plot of BSAAM and OPRC", xlab="OPRC", ylab="BSAAM")

Code
plot(y=water$BSAAM,x=water$OPSLAKE, col=ifelse(substr(names(water)[7],1,1) == "A", colorA, colorB),
     main="Scatter plot of BSAAM and OPSLAKE", xlab="OPSLAKE", ylab="BSAAM")

I created output graph using ‘pairs’ to establish and read correlation between different variables given in the dataset, but I had trouble reading the correlation output; so I also ran each of the scatter-plots separately

The graph provided in the analysis shows the correlations between six variables: OPBPC, OPRC, OPSLAKE, APMAM, APSAB, and APSLAKE. The graph indicates that there are two groups of variables that are more closely related to each other than to the other variables in the set.

The first group consists of OPBPC, OPRC, and OPSLAKE, which are strongly correlated with each other. The correlation between these three variables and APMAM, APSAB, and APSLAKE is much weaker. This suggests that the first group of variables is related to each other in a way that is different from their relationship with the second group of variables.

The second group consists of APMAM, APSAB, and APSLAKE, which are also correlated with each other, although to a lesser extent than OPBPC, OPRC, and OPSLAKE. This suggests that there is indeed a correlation among these three variables, but it is not as strong as the correlation between the first group of variables.

The analysis also indicates that BSAAM is more closely related to OPBPC, OPRC, and OPSLAKE than to APMAM, APSAB, and APSLAKE. The results show that BSAAM has a stronger positive correlation with OPBPC, OPRC, and OPSLAKE. On the other hand, the correlation between BSAAM and APMAM, APSAB, and APSLAKE are weaker. This suggests that BSAAM is indeed more closely related to the first group of variables than to the second group of variables.

In summary, the analysis reveals that there are two groups of variables in the set, and they are more closely related to each other within their group. OPBPC, OPRC, and OPSLAKE are strongly correlated with each other, but not with APMAM, APSAB, and APSLAKE. APMAM, APSAB, and APSLAKE are also correlated with each other, but to a lesser extent than OPBPC, OPRC, and OPSLAKE. BSAAM is more closely related to the first group of variables than to the second group of variables. This information can be useful in identifying important variables for analysis or modeling.

Question 4

Professor ratings (Data fde: Rateprof in alr4) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1-5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterinterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Create a scatterplot matrix of these five variables. Provide a brief description of the relationships between the five ratings.

Code
summary(Rateprof)
    gender       numYears        numRaters       numCourses     pepper   
 female:159   Min.   : 1.000   Min.   :10.00   Min.   : 1.000   no :320  
 male  :207   1st Qu.: 6.000   1st Qu.:15.00   1st Qu.: 3.000   yes: 46  
              Median :10.000   Median :24.00   Median : 4.000            
              Mean   : 8.347   Mean   :28.58   Mean   : 4.251            
              3rd Qu.:11.000   3rd Qu.:37.00   3rd Qu.: 5.000            
              Max.   :11.000   Max.   :86.00   Max.   :12.000            
                                                                         
    discipline          dept        quality       helpfulness   
 Hum     :134   English   : 49   Min.   :1.409   Min.   :1.364  
 SocSci  : 66   Math      : 34   1st Qu.:2.936   1st Qu.:3.069  
 STEM    :103   Biology   : 20   Median :3.612   Median :3.662  
 Pre-prof: 63   Chemistry : 20   Mean   :3.575   Mean   :3.631  
                Psychology: 20   3rd Qu.:4.250   3rd Qu.:4.351  
                Spanish   : 20   Max.   :4.981   Max.   :5.000  
                (Other)   :203                                  
    clarity         easiness     raterInterest     sdQuality      
 Min.   :1.333   Min.   :1.391   Min.   :1.098   Min.   :0.09623  
 1st Qu.:2.871   1st Qu.:2.548   1st Qu.:2.934   1st Qu.:0.87508  
 Median :3.600   Median :3.148   Median :3.305   Median :1.15037  
 Mean   :3.525   Mean   :3.135   Mean   :3.310   Mean   :1.05610  
 3rd Qu.:4.214   3rd Qu.:3.692   3rd Qu.:3.692   3rd Qu.:1.28730  
 Max.   :5.000   Max.   :4.900   Max.   :4.909   Max.   :1.67739  
                                                                  
 sdHelpfulness      sdClarity        sdEasiness     sdRaterInterest 
 Min.   :0.0000   Min.   :0.0000   Min.   :0.3162   Min.   :0.3015  
 1st Qu.:0.9902   1st Qu.:0.9085   1st Qu.:0.9045   1st Qu.:1.0848  
 Median :1.2860   Median :1.1712   Median :1.0247   Median :1.2167  
 Mean   :1.1719   Mean   :1.0970   Mean   :1.0196   Mean   :1.1965  
 3rd Qu.:1.4365   3rd Qu.:1.3328   3rd Qu.:1.1485   3rd Qu.:1.3326  
 Max.   :1.8091   Max.   :1.8091   Max.   :1.6293   Max.   :1.7246  
                                                                    
Code
head(Rateprof)
Code
#subset of RateProf data

rate_my_prof<-select(Rateprof,c('quality','helpfulness','clarity','easiness','raterInterest'))

head(rate_my_prof)
Code
#Scatterplot Matrix of five RateProf variables

pairs(rate_my_prof,
      col = "red3",
      pch = 20,
      main = "Rate my Professor Matrix ScatterPlot ")

Interpreting the scatter plot output generated through ‘pairs’ for the average professor ratings on the topics of quality, clarity, helpfulness, easiness, and rater interest, the variables quality, clarity, and helpfulness appear to each have strong positive correlations with each other.

The relationship between some pairs of variables indicates better positive linear correlations than others.

  • Quality-Clarity and Quality-Helpfulness indicate very strong positive linear correlations.
  • Quality-Easiness and Quality-RaterInterest indicate weak positive linear correlations.
  • Helpfulness-Easiness and Helpfulness-RaterInterest indicate weak linear correlations.
  • Clarity-Helpfulness indicates a positive correlation
  • Clarity-RaterInterest and Clarity-Easiness indicate weak positive correlations.
  • Easiness and RaterInterest indicate a very weak positive correlation.

Question 5

For the student.survey data file in the smss package, conduct regression analysesrelating (by convention, y denotes the outcome variable, x denotes the explanatory variable) (i) y = political ideology and x = religiosity, (ii) y = high school GPA and x = hours of TV watching. (You can use ?student. survey in the R console, after loading the package, to see what each variable means.)

(a)

Graphically portray how the explanatory variable relates to the outcome variable in each of the two cases

Code
data(student.survey)
summary(student.survey)
      subj       ge           ag              hi              co       
 Min.   : 1.00   f:31   Min.   :22.00   Min.   :2.000   Min.   :2.600  
 1st Qu.:15.75   m:29   1st Qu.:24.00   1st Qu.:3.000   1st Qu.:3.175  
 Median :30.50          Median :26.50   Median :3.350   Median :3.500  
 Mean   :30.50          Mean   :29.17   Mean   :3.308   Mean   :3.453  
 3rd Qu.:45.25          3rd Qu.:31.00   3rd Qu.:3.625   3rd Qu.:3.725  
 Max.   :60.00          Max.   :71.00   Max.   :4.000   Max.   :4.000  
                                                                       
       dh             dr               tv               sp        
 Min.   :   0   Min.   : 0.200   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 205   1st Qu.: 1.450   1st Qu.: 3.000   1st Qu.: 3.000  
 Median : 640   Median : 2.000   Median : 6.000   Median : 5.000  
 Mean   :1232   Mean   : 3.818   Mean   : 7.267   Mean   : 5.483  
 3rd Qu.:1350   3rd Qu.: 5.000   3rd Qu.:10.000   3rd Qu.: 7.000  
 Max.   :8000   Max.   :20.000   Max.   :37.000   Max.   :16.000  
                                                                  
       ne               ah             ve          pa    
 Min.   : 0.000   Min.   : 0.000   Mode :logical   d:21  
 1st Qu.: 2.000   1st Qu.: 0.000   FALSE:60        i:24  
 Median : 3.000   Median : 0.500                   r:15  
 Mean   : 4.083   Mean   : 1.433                         
 3rd Qu.: 5.250   3rd Qu.: 2.000                         
 Max.   :14.000   Max.   :11.000                         
                                                         
                     pi                re         ab              aa         
 very liberal         : 8   never       :15   Mode :logical   Mode :logical  
 liberal              :24   occasionally:29   FALSE:60        FALSE:59       
 slightly liberal     : 6   most weeks  : 7                   NA's :1        
 moderate             :10   every week  : 9                                  
 slightly conservative: 6                                                    
 conservative         : 4                                                    
 very conservative    : 2                                                    
     ld         
 Mode :logical  
 FALSE:44       
 NA's :16       
                
                
                
                
Code
head(student.survey)
Code
student.survey %>%
  
  # Group by religiosity and political ideology and count the number of observations
  group_by(re, pi) %>%
  summarize(count = n(), .groups = "drop") %>%
  # Create a stacked bar chart
  ggplot() + 
  geom_col(aes(x = re, y = count, fill = pi), position = "stack") +
  # Add labels to x and y axis
  xlab("Religiosity") +
  ylab("Count") +
  # Add a legend to show the relationship between fill colors and political ideology
  scale_fill_discrete(name = "Political ideology")

Conservatism and religiosity appear to be positively correlated, meaning that individuals who are more religious tend to be more conservative in their political beliefs.

Code
ggplot(data=student.survey, aes(x=tv, y=hi))+
geom_point()+
geom_smooth(method="lm", se=FALSE)+
xlab("Average Number of Hours of TV watched per Week") +
ylab("High School GPA") 
`geom_smooth()` using formula = 'y ~ x'

High school GPA and TV-watching seem to have a negative relationship.

(b)

Summarize and interpret results of inferential analyses.

Code
summary(lm(data = student.survey, formula = as.numeric(pi) ~ as.numeric(re)))

Call:
lm(formula = as.numeric(pi) ~ as.numeric(re), data = student.survey)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.9308     0.4252   2.189   0.0327 *  
as.numeric(re)   0.9704     0.1792   5.416 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 1.221e-06

This output is from a linear regression model that examines the relationship between religiousness (re) and political ideology (pi) in a sample of students. The model shows that there is a statistically significant positive relationship between religiousness and political ideology.

The intercept coefficient is 0.9308, which means that when religiousness is zero, the expected value of political ideology is 0.9308. The slope coefficient for religiousness is 0.9704, which means that for a one-unit increase in religiousness, the expected value of political ideology increases by 0.9704 units. The R-squared value is 0.3359, which indicates that the model explains 33.59% of the variation in political ideology.

The p-value for the slope coefficient is 1.221e-06, which is less than 0.05, indicating that the relationship between religiousness and political ideology is statistically significant. Therefore, we can conclude that there is a positive relationship between religiousness and political ideology in the sample of students.

Code
summary(lm(data = student.survey, formula = hi ~ tv))

Call:
lm(formula = hi ~ tv, data = student.survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

This analysis reports the results of a linear regression model to investigate the relationship between average hours of TV watched per week (TV) and high school GPA (hi) using data from student survey.

The model shows that the intercept is statistically significant at a p-value < 2e-16, indicating that the predicted value of the high school GPA when the average number of hours of TV watched per week is zero is 3.44. This value can be interpreted as the expected high school GPA for students who do not watch any TV.

Furthermore, the coefficient for TV is -0.018 with a p-value of 0.0388, which indicates that there is a statistically significant negative association between TV watching and high school GPA at the 0.05 level of significance. The negative coefficient suggests that as the average number of hours of TV watched per week increases by 1 hour, high school GPA decreases by 0.018.

The model’s R-squared value is 0.07156, indicating that only about 7% of the variance in high school GPA is explained by the variance in the average number of hours of TV watched per week. The adjusted R-squared value is 0.05555, which means that the model has a weak predictive power for high school GPA based on the average number of hours of TV watched per week.

In summary, the results suggest that there is a negative association between TV watching and high school GPA. However, the model’s low R-squared value suggests that other factors not included in the model might be more important in predicting high school GPA.