hw1
desriptive statistics
probability
Template of course blog qmd file
Author

Xiaoyan

Published

February 27, 2023

Code
library(tidyr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(readxl)
library(ggplot2)
data<-read_excel("/Users/cassie199/Desktop/23spring/603_Spring_2023-1/posts/_data/LungCapData.xls")
head(data)
# A tibble: 6 × 6
  LungCap   Age Height Smoke Gender Caesarean
    <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
1    6.48     6   62.1 no    male   no       
2   10.1     18   74.7 yes   female no       
3    9.55    16   69.7 no    female yes      
4   11.1     14   71   no    male   no       
5    4.8      5   56.9 no    male   no       
6    6.22    11   58.7 no    female no       

Question 1

  1. Use the LungCapData to answer the following questions. (Hint: Using dplyr, especially group_by() and summarize() can help you answer the following questions relatively efficiently.)
  1. What does the distribution of LungCap look like? (Hint: Plot a histogram with probability density on the y axis)

-the distribution of LungCap looks like a normal distribution

Code
ggplot(data, aes(x=LungCap)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Compare the probability distribution of the LungCap with respect to Males and Females? (Hint:make boxplots separated by gender using the boxplot() function)
Code
boxplot(data$LungCap~data$Gender)

  1. Compare the mean lung capacities for smokers and non-smokers. Does it make sense?
Code
boxplot(data$LungCap~data$Smoke)

  1. Examine the relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.
Code
age_ranges <- c("<=13", "14-15", "16-17",  ">18")
data$age_ranges <- cut(data$Age, breaks = c(13, 14,15, 16,17, 18),
                     include.lowest = TRUE)
ggplot(data, aes(x = age_ranges, y = LungCap)) +
  geom_boxplot() +
  labs(x = "Age Range", y = "LungCap")

  1. Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?

#Question 2

  1. Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners.

X 0 1 2 3 4 Frequency 128 434 160 64 24

  1. What is the probability that a randomly selected inmate has exactly 2 prior convictions?
Code
n<-810
X<-tibble(x=0:4,
          F=c(128,434,160,64,24))
X
# A tibble: 5 × 2
      x     F
  <int> <dbl>
1     0   128
2     1   434
3     2   160
4     3    64
5     4    24
Code
pa<-160/n
pa
[1] 0.1975309
  1. What is the probability that a randomly selected inmate has fewer than 2 prior convictions?
Code
pb<-(128+434)/n
pb
[1] 0.6938272
  1. What is the probability that a randomly selected inmate has 2 or fewer prior convictions?
Code
pc<-(128+434+160)/n
pc
[1] 0.891358
  1. What is the probability that a randomly selected inmate has more than 2 prior convictions?
Code
pd<-(64+24)/n
pd
[1] 0.108642
  1. What is the expected value1 for the number of prior convictions?
Code
prior<-c(434,160,64,24)
mean(prior)
[1] 170.5
  1. Calculate the variance and the standard deviation for the Prior Convictions.
Code
var(prior)
[1] 34115.67
Code
sd(prior)
[1] 184.7043