Hw 1 by Kristin Abijaoude

LungCap and Prison data HW1

Kristin Abijaoude


February 27, 2023


Lung Capacity

LungCapData <- read_excel("~/Documents/GitHub/Github Help/603_Spring_2023/posts/_data/LungCapData.xls")
# A tibble: 725 × 6
   LungCap   Age Height Smoke Gender Caesarean
     <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
 1    6.48     6   62.1 no    male   no       
 2   10.1     18   74.7 yes   female no       
 3    9.55    16   69.7 no    female yes      
 4   11.1     14   71   no    male   no       
 5    4.8      5   56.9 no    male   no       
 6    6.22    11   58.7 no    female no       
 7    4.95     8   63.3 no    male   yes      
 8    7.32    11   70.4 no    male   no       
 9    8.88    15   70.5 no    male   no       
10    6.8     11   59.2 no    male   no       
# … with 715 more rows
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary


Dimensions: 725 x 6
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
LungCap [numeric]
Mean (sd) : 7.9 (2.7)
min ≤ med ≤ max:
0.5 ≤ 8 ≤ 14.7
IQR (CV) : 3.7 (0.3)
342 distinct values 0 (0.0%)
Age [numeric]
Mean (sd) : 12.3 (4)
min ≤ med ≤ max:
3 ≤ 13 ≤ 19
IQR (CV) : 6 (0.3)
17 distinct values 0 (0.0%)
Height [numeric]
Mean (sd) : 64.8 (7.2)
min ≤ med ≤ max:
45.3 ≤ 65.4 ≤ 81.8
IQR (CV) : 10.4 (0.1)
274 distinct values 0 (0.0%)
Smoke [character]
1. no
2. yes
648 ( 89.4% )
77 ( 10.6% )
0 (0.0%)
Gender [character]
1. female
2. male
358 ( 49.4% )
367 ( 50.6% )
0 (0.0%)
Caesarean [character]
1. no
2. yes
561 ( 77.4% )
164 ( 22.6% )
0 (0.0%)

[1] "LungCap"   "Age"       "Height"    "Smoke"     "Gender"    "Caesarean"
[1] 725   6

1a. The distribution looks pretty normal to me, with capacity between 6 and 9 being the most frequent.

boxplot(LungCap ~ Gender, data=LungCapData)

1b. Separating the two genders, it looks like men have a higher lung capacity rate in comparison to women.

LungCapData %>%
  group_by(Smoke) %>%
  summarise(mean = mean(LungCap), n = n())
# A tibble: 2 × 3
  Smoke  mean     n
  <chr> <dbl> <int>
1 no     7.77   648
2 yes    8.65    77

1c. The average lung capacity for a non-smoker is around 7.78, while for smokers it’s 8.65. In other words, on average, the smokers have a higher lung capacity rate than non-smokers… this doesn’t make sense because smoking is supposed to be bad for your lungs.

agegroup <- LungCapData %>%
  mutate(agegroup = case_when(Age <= 13 ~ "Less than 13 years old",
                              Age == 14| Age == 15 ~ "14 to 15 years old",
                              Age == 16 | Age == 17 ~ "16 to 17 years old",
                              Age >= 18 ~ "18 years old and older"))
agegroup %>%
  ggplot(aes(x=LungCap, fill=Smoke)) +
  geom_histogram() +
1d. Obviously, older teens are more likely to be smokers, as well as have higher lung capacity, than younger teens. The vast majority of teens 13 years and younger are non-smoker (I would be horrified at the sight of a kid smoking).

agegroup <- agegroup %>%
  mutate(AgeGroup = factor(agegroup, level= c("Less than 13 years old", 
                                              "14 to 15 years old",
                                              "16 to 17 years old",
                                              "18 years old and older")))

boxplot(LungCap ~ AgeGroup, data=agegroup)

1e. There is a correlation between age and lung capacity. The lung capacity rate increases as the person gets older.

Prior Convictions

Another dataset I created here deals with prison convictions. The sample size is 810 prisoners in a state prison, some of the prisoners are there for the first time, while others have been imprison as many as 4 times, or have 4 prior convictions in other words. prior means numbers of prior convictions. freq means how many prisoners have a set of convictions (434 prisoners have 1 prior convictions, 160 prisoners have 2 prior convictions etc.). Finally, I created a new variable called probability, where I divided the freq variable by the total number of prisoners, to denote the probability that a prisoner had a certain number of prior convictions.

df <- data.frame(prior = c(0:4), 
                 freq = c(128, 434, 160, 64, 24)

df <- df %>%
  mutate(probability = freq/810)
  prior freq probability
1     0  128  0.15802469
2     1  434  0.53580247
3     2  160  0.19753086
4     3   64  0.07901235
5     4   24  0.02962963
# alternatively
(dbinom(x = 1, size = 1, prob = 160/810))*100
[1] 19.75309

2a. There is a less than 20% probability that a randomly selected inmate has exactly 2 prior convictions.

128 + 434
[1] 562
(dbinom(x = 1, size = 1, prob = 562/810))*100
[1] 69.38272

2b. There is a 69% probability that a randomly selected inmate has fewer than 2 prior convictions.

128 + 434 + 160
[1] 722
(dbinom(x = 1, size = 1, prob = 722/810))*100
[1] 89.1358

2c. There is a 89% probability that a randomly selected inmate has 2 or fewer prior convictions.

64 + 24
[1] 88
(dbinom(x = 1, size = 1, prob = 88/810))*100
[1] 10.8642

2d. There is a 10% probability that a randomly selected inmate has more than 2 prior convictions.

prior <- df$prior
prob <- df$probability
freq <- df$freq

exval <- sum(prior*prob)
[1] 1.28642

2e. The expected value exval, or long term mean, is 1.28642. I separated the variables into its own set and multiplied prior (# of prior convictions) and prob (the probability a given prisoner has a certain number of prior convictions).

# variance
var(rep(df$prior, df$freq))
[1] 0.8572937
# standard deviation
sd(rep(df$prior, df$freq))
[1] 0.9259016

The variance is 0.86, which mean the data is close to one another.

The standard deviation is 0.93, which means the data is more clustered around the mean.

