Hw 1 by Kristin Abijaoude

Hw1

kristin abijaoude

desriptive statistics

probability

LungCap and Prison data HW1

Author

Kristin Abijaoude

Published

February 27, 2023

Code

library(ggplot2)
library(readxl)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

library(summarytools)
library(stats)

Lung Capacity

Code

LungCapData <- read_excel("~/Documents/GitHub/Github Help/603_Spring_2023/posts/_data/LungCapData.xls")
LungCapData

# A tibble: 725 × 6
   LungCap   Age Height Smoke Gender Caesarean
     <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
 1    6.48     6   62.1 no    male   no       
 2   10.1     18   74.7 yes   female no       
 3    9.55    16   69.7 no    female yes      
 4   11.1     14   71   no    male   no       
 5    4.8      5   56.9 no    male   no       
 6    6.22    11   58.7 no    female no       
 7    4.95     8   63.3 no    male   yes      
 8    7.32    11   70.4 no    male   no       
 9    8.88    15   70.5 no    male   no       
10    6.8     11   59.2 no    male   no       
# … with 715 more rows

Code

print(dfSummary(LungCapData,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

LungCapData

Dimensions: 725 x 6
Duplicates: 0

LungCap [numeric]

Mean (sd) : 7.9 (2.7)
min ≤ med ≤ max:
0.5 ≤ 8 ≤ 14.7
IQR (CV) : 3.7 (0.3)

342 distinct values

0 (0.0%)

Age [numeric]

Mean (sd) : 12.3 (4)
min ≤ med ≤ max:
3 ≤ 13 ≤ 19
IQR (CV) : 6 (0.3)

17 distinct values

0 (0.0%)

Height [numeric]

Mean (sd) : 64.8 (7.2)
min ≤ med ≤ max:
45.3 ≤ 65.4 ≤ 81.8
IQR (CV) : 10.4 (0.1)

274 distinct values

0 (0.0%)

Smoke [character]

1. no
2. yes

648	(	89.4%	)
77	(	10.6%	)

0 (0.0%)

Gender [character]

1. female
2. male

358	(	49.4%	)
367	(	50.6%	)

0 (0.0%)

Caesarean [character]

1. no
2. yes

561	(	77.4%	)
164	(	22.6%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-03-16

Code

colnames(LungCapData)

[1] "LungCap"   "Age"       "Height"    "Smoke"     "Gender"    "Caesarean"

Code

dim(LungCapData)

[1] 725   6

Code

hist(LungCapData$LungCap)

1a. The distribution looks pretty normal to me, with capacity between 6 and 9 being the most frequent.

Code

boxplot(LungCap ~ Gender, data=LungCapData)

1b. Separating the two genders, it looks like men have a higher lung capacity rate in comparison to women.

Code

LungCapData %>%
  group_by(Smoke) %>%
  summarise(mean = mean(LungCap), n = n())

# A tibble: 2 × 3
  Smoke  mean     n
  <chr> <dbl> <int>
1 no     7.77   648
2 yes    8.65    77

1c. The average lung capacity for a non-smoker is around 7.78, while for smokers it’s 8.65. In other words, on average, the smokers have a higher lung capacity rate than non-smokers… this doesn’t make sense because smoking is supposed to be bad for your lungs.

Code

agegroup <- LungCapData %>%
  mutate(agegroup = case_when(Age <= 13 ~ "Less than 13 years old",
                              Age == 14| Age == 15 ~ "14 to 15 years old",
                              Age == 16 | Age == 17 ~ "16 to 17 years old",
                              Age >= 18 ~ "18 years old and older"))
agegroup %>%
  ggplot(aes(x=LungCap, fill=Smoke)) +
  geom_histogram() +
  facet_wrap(~agegroup)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1d. Obviously, older teens are more likely to be smokers, as well as have higher lung capacity, than younger teens. The vast majority of teens 13 years and younger are non-smoker (I would be horrified at the sight of a kid smoking).

Code

agegroup <- agegroup %>%
  mutate(AgeGroup = factor(agegroup, level= c("Less than 13 years old", 
                                              "14 to 15 years old",
                                              "16 to 17 years old",
                                              "18 years old and older")))

boxplot(LungCap ~ AgeGroup, data=agegroup)

1e. There is a correlation between age and lung capacity. The lung capacity rate increases as the person gets older.

Prior Convictions

Another dataset I created here deals with prison convictions. The sample size is 810 prisoners in a state prison, some of the prisoners are there for the first time, while others have been imprison as many as 4 times, or have 4 prior convictions in other words. prior means numbers of prior convictions. freq means how many prisoners have a set of convictions (434 prisoners have 1 prior convictions, 160 prisoners have 2 prior convictions etc.). Finally, I created a new variable called probability, where I divided the freq variable by the total number of prisoners, to denote the probability that a prisoner had a certain number of prior convictions.

Code

df <- data.frame(prior = c(0:4), 
                 freq = c(128, 434, 160, 64, 24)
                 )

df <- df %>%
  mutate(probability = freq/810)
df

  prior freq probability
1     0  128  0.15802469
2     1  434  0.53580247
3     2  160  0.19753086
4     3   64  0.07901235
5     4   24  0.02962963

Code

# alternatively
(dbinom(x = 1, size = 1, prob = 160/810))*100

[1] 19.75309

2a. There is a less than 20% probability that a randomly selected inmate has exactly 2 prior convictions.

Code

128 + 434

[1] 562

Code

(dbinom(x = 1, size = 1, prob = 562/810))*100

[1] 69.38272

2b. There is a 69% probability that a randomly selected inmate has fewer than 2 prior convictions.

Code

128 + 434 + 160

[1] 722

Code

(dbinom(x = 1, size = 1, prob = 722/810))*100

[1] 89.1358

2c. There is a 89% probability that a randomly selected inmate has 2 or fewer prior convictions.

Code

64 + 24

[1] 88

Code

(dbinom(x = 1, size = 1, prob = 88/810))*100

[1] 10.8642

2d. There is a 10% probability that a randomly selected inmate has more than 2 prior convictions.

Code

prior <- df$prior
prob <- df$probability
freq <- df$freq

exval <- sum(prior*prob)
exval

[1] 1.28642

2e. The expected value exval, or long term mean, is 1.28642. I separated the variables into its own set and multiplied prior (# of prior convictions) and prob (the probability a given prisoner has a certain number of prior convictions).

Code

# variance
var(rep(df$prior, df$freq))

[1] 0.8572937

Code

# standard deviation
sd(rep(df$prior, df$freq))

[1] 0.9259016

The variance is 0.86, which mean the data is close to one another.

The standard deviation is 0.93, which means the data is more clustered around the mean.

Code

render("Kristin_Abijaoude_HW1.qmd", output_format = "pdf_document", output_file = "Kristin_Abijaoude_HW1.pdf")

Error in render("Kristin_Abijaoude_HW1.qmd", output_format = "pdf_document", : could not find function "render"