Hw 1 by Kristin Abijaoude

Hw1
kristin abijaoude
desriptive statistics
probability
LungCap and Prison data HW1
Author

Kristin Abijaoude

Published

February 27, 2023

Code
library(ggplot2)
library(readxl)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(summarytools)
library(stats)

Lung Capacity

Code
LungCapData <- read_excel("~/Documents/GitHub/Github Help/603_Spring_2023/posts/_data/LungCapData.xls")
LungCapData
# A tibble: 725 × 6
   LungCap   Age Height Smoke Gender Caesarean
     <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
 1    6.48     6   62.1 no    male   no       
 2   10.1     18   74.7 yes   female no       
 3    9.55    16   69.7 no    female yes      
 4   11.1     14   71   no    male   no       
 5    4.8      5   56.9 no    male   no       
 6    6.22    11   58.7 no    female no       
 7    4.95     8   63.3 no    male   yes      
 8    7.32    11   70.4 no    male   no       
 9    8.88    15   70.5 no    male   no       
10    6.8     11   59.2 no    male   no       
# … with 715 more rows
Code
print(dfSummary(LungCapData,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

LungCapData

Dimensions: 725 x 6
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
LungCap [numeric]
Mean (sd) : 7.9 (2.7)
min ≤ med ≤ max:
0.5 ≤ 8 ≤ 14.7
IQR (CV) : 3.7 (0.3)
342 distinct values 0 (0.0%)
Age [numeric]
Mean (sd) : 12.3 (4)
min ≤ med ≤ max:
3 ≤ 13 ≤ 19
IQR (CV) : 6 (0.3)
17 distinct values 0 (0.0%)
Height [numeric]
Mean (sd) : 64.8 (7.2)
min ≤ med ≤ max:
45.3 ≤ 65.4 ≤ 81.8
IQR (CV) : 10.4 (0.1)
274 distinct values 0 (0.0%)
Smoke [character]
1. no
2. yes
648 ( 89.4% )
77 ( 10.6% )
0 (0.0%)
Gender [character]
1. female
2. male
358 ( 49.4% )
367 ( 50.6% )
0 (0.0%)
Caesarean [character]
1. no
2. yes
561 ( 77.4% )
164 ( 22.6% )
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-03-16

Code
colnames(LungCapData)
[1] "LungCap"   "Age"       "Height"    "Smoke"     "Gender"    "Caesarean"
Code
dim(LungCapData)
[1] 725   6
Code
hist(LungCapData$LungCap)

1a. The distribution looks pretty normal to me, with capacity between 6 and 9 being the most frequent.

Code
boxplot(LungCap ~ Gender, data=LungCapData)

1b. Separating the two genders, it looks like men have a higher lung capacity rate in comparison to women.

Code
LungCapData %>%
  group_by(Smoke) %>%
  summarise(mean = mean(LungCap), n = n())
# A tibble: 2 × 3
  Smoke  mean     n
  <chr> <dbl> <int>
1 no     7.77   648
2 yes    8.65    77

1c. The average lung capacity for a non-smoker is around 7.78, while for smokers it’s 8.65. In other words, on average, the smokers have a higher lung capacity rate than non-smokers… this doesn’t make sense because smoking is supposed to be bad for your lungs.

Code
agegroup <- LungCapData %>%
  mutate(agegroup = case_when(Age <= 13 ~ "Less than 13 years old",
                              Age == 14| Age == 15 ~ "14 to 15 years old",
                              Age == 16 | Age == 17 ~ "16 to 17 years old",
                              Age >= 18 ~ "18 years old and older"))
agegroup %>%
  ggplot(aes(x=LungCap, fill=Smoke)) +
  geom_histogram() +
  facet_wrap(~agegroup)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1d. Obviously, older teens are more likely to be smokers, as well as have higher lung capacity, than younger teens. The vast majority of teens 13 years and younger are non-smoker (I would be horrified at the sight of a kid smoking).

Code
agegroup <- agegroup %>%
  mutate(AgeGroup = factor(agegroup, level= c("Less than 13 years old", 
                                              "14 to 15 years old",
                                              "16 to 17 years old",
                                              "18 years old and older")))

boxplot(LungCap ~ AgeGroup, data=agegroup)

1e. There is a correlation between age and lung capacity. The lung capacity rate increases as the person gets older.

Prior Convictions

Another dataset I created here deals with prison convictions. The sample size is 810 prisoners in a state prison, some of the prisoners are there for the first time, while others have been imprison as many as 4 times, or have 4 prior convictions in other words. prior means numbers of prior convictions. freq means how many prisoners have a set of convictions (434 prisoners have 1 prior convictions, 160 prisoners have 2 prior convictions etc.). Finally, I created a new variable called probability, where I divided the freq variable by the total number of prisoners, to denote the probability that a prisoner had a certain number of prior convictions.

Code
df <- data.frame(prior = c(0:4), 
                 freq = c(128, 434, 160, 64, 24)
                 )

df <- df %>%
  mutate(probability = freq/810)
df
  prior freq probability
1     0  128  0.15802469
2     1  434  0.53580247
3     2  160  0.19753086
4     3   64  0.07901235
5     4   24  0.02962963
Code
# alternatively
(dbinom(x = 1, size = 1, prob = 160/810))*100
[1] 19.75309

2a. There is a less than 20% probability that a randomly selected inmate has exactly 2 prior convictions.

Code
128 + 434
[1] 562
Code
(dbinom(x = 1, size = 1, prob = 562/810))*100
[1] 69.38272

2b. There is a 69% probability that a randomly selected inmate has fewer than 2 prior convictions.

Code
128 + 434 + 160
[1] 722
Code
(dbinom(x = 1, size = 1, prob = 722/810))*100
[1] 89.1358

2c. There is a 89% probability that a randomly selected inmate has 2 or fewer prior convictions.

Code
64 + 24
[1] 88
Code
(dbinom(x = 1, size = 1, prob = 88/810))*100
[1] 10.8642

2d. There is a 10% probability that a randomly selected inmate has more than 2 prior convictions.

Code
prior <- df$prior
prob <- df$probability
freq <- df$freq

exval <- sum(prior*prob)
exval
[1] 1.28642

2e. The expected value exval, or long term mean, is 1.28642. I separated the variables into its own set and multiplied prior (# of prior convictions) and prob (the probability a given prisoner has a certain number of prior convictions).

Code
# variance
var(rep(df$prior, df$freq))
[1] 0.8572937
Code
# standard deviation
sd(rep(df$prior, df$freq))
[1] 0.9259016

The variance is 0.86, which mean the data is close to one another.

The standard deviation is 0.93, which means the data is more clustered around the mean.

Code
render("Kristin_Abijaoude_HW1.qmd", output_format = "pdf_document", output_file = "Kristin_Abijaoude_HW1.pdf")
Error in render("Kristin_Abijaoude_HW1.qmd", output_format = "pdf_document", : could not find function "render"