Code
library(tidyverse)
library(readxl)
::opts_chunk$set(echo = TRUE) knitr
Sarah McAlpine
October 23, 2022
# A tibble: 6 × 3
Pathogen est_cases total_cost
<chr> <dbl> <dbl>
1 Campylobacter spp. (all species) 845024 2181485783.
2 Clostridium perfringens 965958 384277856.
3 Cryptosporidium spp. (all species) 57616 58394152.
4 Cyclospora cayetanensis 11407 2571518.
5 Listeria monocytogenes 1591 3189686110.
6 Norovirus 5461731 2566984191.
With this initial read-in, I can see that the data contain 15 pathogen groups, the estimated mean of cases in 2018, and the total cost of those cases, adjusted for inflation per the notes. From what I understand, these 15 groups are distinct, and so I expect I can trust the values without needing to disentangle them from one another (i.e. if all species of Campylobacter and a subspecies were both listed). I will want to mutate the data to find the average cost per case so that I can arrange them in the order of most expensive per case.
pathogen <- read_xlsx("_data/Total_cost_for_top_15_pathogens_2018.xlsx",
col_names = c("Pathogen","est_cases","total_cost"),
range = "A6:C20") %>%
# add cost per case variable
mutate(cost_case = total_cost/est_cases) %>%
# arrange in descending order
arrange(desc(cost_case))
path_ranked <- print(pathogen)
# A tibble: 15 × 4
Pathogen est_c…¹ total…² cost_…³
<chr> <dbl> <dbl> <dbl>
1 Vibrio vulnificus 96 3.59e8 3.74e6
2 Listeria monocytogenes 1591 3.19e9 2.00e6
3 Toxoplasma gondii 86686 3.74e9 4.32e4
4 Shiga toxin-producing Escherichia coli O157 (STEC O1… 63153 3.11e8 4.93e3
5 Vibrio non-cholera species other than V. parahaemoly… 17564 8.17e7 4.65e3
6 Salmonella (non-typhoidal species) 1027561 4.14e9 4.03e3
7 Yersinia enterocolitica 97656 3.13e8 3.21e3
8 Campylobacter spp. (all species) 845024 2.18e9 2.58e3
9 Vibrio parahaemolyticus 34664 4.57e7 1.32e3
10 Shigella (all species) 131254 1.59e8 1.21e3
11 Cryptosporidium spp. (all species) 57616 5.84e7 1.01e3
12 Norovirus 5461731 2.57e9 4.70e2
13 Clostridium perfringens 965958 3.84e8 3.98e2
14 non-O157 Shiga toxin-producing Escherichia coli (STE… 112752 3.17e7 2.81e2
15 Cyclospora cayetanensis 11407 2.57e6 2.25e2
# … with abbreviated variable names ¹est_cases, ²total_cost, ³cost_case
This plot shows the average cost per case of the 15 pathogens in the data. I elected to display these as scale_x_sqrt
so that the differences among the lower 13 were distinguishable. I am not sure how to preserve the ranked order I got from my clean read-in and mutate chunk.
Warning: The following aesthetics were dropped during statistical transformation: label
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
---
title: "Sarah McAlpine - Challenge 5"
author: "Sarah McAlpine"
desription: "Homework Challenge 5"
date: "10/23/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_5
- sarahmcalpine
- pathogens
- ggplot
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE)
```
## Reading in Pathogen data
```{r}
#initial read-in for summary
pathogen_orig <- read_xlsx("_data/Total_cost_for_top_15_pathogens_2018.xlsx",
col_names = c("Pathogen","est_cases","total_cost"),
range = "A6:C20")
head(pathogen_orig)
#tail(pathogen_orig)
```
With this initial read-in, I can see that the data contain 15 pathogen groups, the estimated mean of cases in 2018, and the total cost of those cases, adjusted for inflation per the notes. From what I understand, these 15 groups are distinct, and so I expect I can trust the values without needing to disentangle them from one another (i.e. if all species of Campylobacter and a subspecies were both listed). I will want to mutate the data to find the average cost per case so that I can arrange them in the order of most expensive per case.
```{r}
pathogen <- read_xlsx("_data/Total_cost_for_top_15_pathogens_2018.xlsx",
col_names = c("Pathogen","est_cases","total_cost"),
range = "A6:C20") %>%
# add cost per case variable
mutate(cost_case = total_cost/est_cases) %>%
# arrange in descending order
arrange(desc(cost_case))
path_ranked <- print(pathogen)
```
## Univariate Visualization
```{r}
#plot as bar columns so names read horizontally
ggplot(path_ranked)+
geom_col(aes(cost_case,Pathogen),
fill = "blue",
width = .9) +
scale_x_sqrt()
#how do I retain the ranked order from above?
```
This plot shows the average cost per case of the 15 pathogens in the data. I elected to display these as `scale_x_sqrt` so that the differences among the lower 13 were distinguishable. I am not sure how to preserve the ranked order I got from my clean read-in and mutate chunk.
## Bivariate Visualization
```{r}
#try a dot plot
ggplot(path_ranked, aes(x=cost_case, y=est_cases, label=Pathogen))+
geom_point()+
scale_x_log10() +
geom_text()
#try a violin plot
ggplot(path_ranked, aes(x=cost_case, y=est_cases, label=Pathogen))+
geom_violin()+
geom_text()
```