More data import and an introduction to the dplyr package.
Today, we’ll be working on how to import a slightly more difficult dataset. Specifically, we’ll be looking at a dataset on the cost of foodborne illnesses in the US. These data were obtained from data.gov.
Let’s take a look at these data.
As you can see, these data come from an Excel file and are not
optimized to easily read into R
. Luckily, we can use the
readxl
package to get them into R
(albeit with
a little work).
The first thing we’ll notice is that we need to skip a few rows.
Specifically, we need to skip the first 5 rows, as they aren’t needed.
We can specify skip=5
when we read in the data.
d <- read_excel("../data/Total_cost_for_top_15_pathogens_2018.xlsx",
skip=5)
d
# A tibble: 22 × 3
`Campylobacter spp. (all species)` `845024` `2181485783.43…`
<chr> <dbl> <dbl>
1 Clostridium perfringens 965958 384277856.
2 Cryptosporidium spp. (all species) 57616 58394152.
3 Cyclospora cayetanensis 11407 2571518.
4 Listeria monocytogenes 1591 3189686110.
5 Norovirus 5461731 2566984191.
6 Salmonella (non-typhoidal species) 1027561 4142179161.
7 Shigella (all species) 131254 159202402.
8 Shiga toxin-producing Escherichia coli O… 63153 311036907.
9 non-O157 Shiga toxin-producing Escherich… 112752 31701852.
10 Toxoplasma gondii 86686 3744008907.
# … with 12 more rows
There a couple things wrong here. First, the column names are wrong.
Since we skipped the column names, R
tried naming the
columns based on the first row of data (which is wrong!). We can specify
the column names ourselves, using the col_names
argument.
We can give col_names
a character vector
that contains the names we want to give our columns. Specifically, we
can specify c("pathogen","cases","cost")
as the names of
our columns. Note that pathogen
wasn’t specified in the
original data file, but we’re specifying it here.
Next, you’ll notice that the data go on longer than we need. There’s
some white space in the data file, as well as a row containing the total
cost for all pathogens (which we can eliminate). There’s also a couple
of notes that we can eliminate. Rather than specify skip
,
we can instead use the range
argument to specify the
range of cells we want to read in. Looking back at our
data file, it looks like we only need cells A6:C20.
Let’s try this again! As a reminder, we will be using the
col_names
argument to name the columns of our data and the
range
argument to specify the cells we want R
to read in. Also note that we specify the path
argument
explicitly now. read_excel()
by default assumes the first
thing you give it is the file path, but it doesn’t hurt to be
explicit.
dataset <- read_excel(path="../data/Total_cost_for_top_15_pathogens_2018.xlsx",
range="A6:C20",
col_names=c("pathogen","cases","cost"))
dataset
# A tibble: 15 × 3
pathogen cases cost
<chr> <dbl> <dbl>
1 Campylobacter spp. (all species) 8.45e5 2.18e9
2 Clostridium perfringens 9.66e5 3.84e8
3 Cryptosporidium spp. (all species) 5.76e4 5.84e7
4 Cyclospora cayetanensis 1.14e4 2.57e6
5 Listeria monocytogenes 1.59e3 3.19e9
6 Norovirus 5.46e6 2.57e9
7 Salmonella (non-typhoidal species) 1.03e6 4.14e9
8 Shigella (all species) 1.31e5 1.59e8
9 Shiga toxin-producing Escherichia coli O157 (STEC O1… 6.32e4 3.11e8
10 non-O157 Shiga toxin-producing Escherichia coli (STE… 1.13e5 3.17e7
11 Toxoplasma gondii 8.67e4 3.74e9
12 Vibrio parahaemolyticus 3.47e4 4.57e7
13 Vibrio vulnificus 9.6 e1 3.59e8
14 Vibrio non-cholera species other than V. parahaemoly… 1.76e4 8.17e7
15 Yersinia enterocolitica 9.77e4 3.13e8
Voila! The dataset looks great.
dplyr
is an R
package for manipulating,
wrangling, and transforming data. We will touch on it more next week,
but this is a brief introduction to a few useful dplyr
functions.
select()
allows us to only pick out
certain columns from a dataframe.
Below, we extract the cases
column from our
dataset
object (see above). We don’t have to specify it in
quotes (the reasons for this are beyond the scope of this lesson).
select(dataset, cases)
# A tibble: 15 × 1
cases
<dbl>
1 845024
2 965958
3 57616
4 11407
5 1591
6 5461731
7 1027561
8 131254
9 63153
10 112752
11 86686
12 34664
13 96
14 17564
15 97656
The first argument tells R
the dataframe we are looking
in (dataset
), and the second argument tells R
the columns we want to pick out.
We can also select multiple columns.
select(dataset, cases, cost)
# A tibble: 15 × 2
cases cost
<dbl> <dbl>
1 845024 2181485783.
2 965958 384277856.
3 57616 58394152.
4 11407 2571518.
5 1591 3189686110.
6 5461731 2566984191.
7 1027561 4142179161.
8 131254 159202402.
9 63153 311036907.
10 112752 31701852.
11 86686 3744008907.
12 34664 45735332.
13 96 359481557.
14 17564 81749064.
15 97656 313297920.
You can even use the -
operator to tell R
to not select certain columns. This is particularly
helpful when you have a dataset with many columns, but you want to
remove a couple of them. Rather than say, write out the names of 28
columns you want to include, you can simply write the
names of the 2 columns you don’t want to include.
select(dataset, -pathogen)
# A tibble: 15 × 2
cases cost
<dbl> <dbl>
1 845024 2181485783.
2 965958 384277856.
3 57616 58394152.
4 11407 2571518.
5 1591 3189686110.
6 5461731 2566984191.
7 1027561 4142179161.
8 131254 159202402.
9 63153 311036907.
10 112752 31701852.
11 86686 3744008907.
12 34664 45735332.
13 96 359481557.
14 17564 81749064.
15 97656 313297920.
The filter()
function allows us to only include rows of
the data that meet certain criteria. For example, imagine that we only
want to analyze pathogens that infected at least 10,000 people. We can
use filter()
to tell R
to only include rows of
the data where cases
is greater than or equal to
10,000.
filter(dataset, cases>=10000)
# A tibble: 13 × 3
pathogen cases cost
<chr> <dbl> <dbl>
1 Campylobacter spp. (all species) 8.45e5 2.18e9
2 Clostridium perfringens 9.66e5 3.84e8
3 Cryptosporidium spp. (all species) 5.76e4 5.84e7
4 Cyclospora cayetanensis 1.14e4 2.57e6
5 Norovirus 5.46e6 2.57e9
6 Salmonella (non-typhoidal species) 1.03e6 4.14e9
7 Shigella (all species) 1.31e5 1.59e8
8 Shiga toxin-producing Escherichia coli O157 (STEC O1… 6.32e4 3.11e8
9 non-O157 Shiga toxin-producing Escherichia coli (STE… 1.13e5 3.17e7
10 Toxoplasma gondii 8.67e4 3.74e9
11 Vibrio parahaemolyticus 3.47e4 4.57e7
12 Vibrio non-cholera species other than V. parahaemoly… 1.76e4 8.17e7
13 Yersinia enterocolitica 9.77e4 3.13e8
The first argument tells R
the dataframe we want to look
in, while the second argument tells R
the criteria for
including a row. Specifically, a row in dataset
should
only be included if its value on the variable
cases
is greater than or equal to 10000.
We can also use the &
(and) operator to include
multiple criteria. For example, say we want to filter
out the rows of data for pathogens that don’t have at least 10,000 cases
and costed less than $10,000,000 to the US in 2018.
filter(dataset, cases>=10000 & cost>=10000000)
# A tibble: 12 × 3
pathogen cases cost
<chr> <dbl> <dbl>
1 Campylobacter spp. (all species) 8.45e5 2.18e9
2 Clostridium perfringens 9.66e5 3.84e8
3 Cryptosporidium spp. (all species) 5.76e4 5.84e7
4 Norovirus 5.46e6 2.57e9
5 Salmonella (non-typhoidal species) 1.03e6 4.14e9
6 Shigella (all species) 1.31e5 1.59e8
7 Shiga toxin-producing Escherichia coli O157 (STEC O1… 6.32e4 3.11e8
8 non-O157 Shiga toxin-producing Escherichia coli (STE… 1.13e5 3.17e7
9 Toxoplasma gondii 8.67e4 3.74e9
10 Vibrio parahaemolyticus 3.47e4 4.57e7
11 Vibrio non-cholera species other than V. parahaemoly… 1.76e4 8.17e7
12 Yersinia enterocolitica 9.77e4 3.13e8
We might also be interested in pathogens that had a high case count
OR a high economic cost. For this, we can use the
logical or (|
) operator to consider pathogens with at least
10,000 cases OR at least $10,000,000 in cost to the
US.
filter(dataset, cases>=1000 | cost>=10000000)
# A tibble: 15 × 3
pathogen cases cost
<chr> <dbl> <dbl>
1 Campylobacter spp. (all species) 8.45e5 2.18e9
2 Clostridium perfringens 9.66e5 3.84e8
3 Cryptosporidium spp. (all species) 5.76e4 5.84e7
4 Cyclospora cayetanensis 1.14e4 2.57e6
5 Listeria monocytogenes 1.59e3 3.19e9
6 Norovirus 5.46e6 2.57e9
7 Salmonella (non-typhoidal species) 1.03e6 4.14e9
8 Shigella (all species) 1.31e5 1.59e8
9 Shiga toxin-producing Escherichia coli O157 (STEC O1… 6.32e4 3.11e8
10 non-O157 Shiga toxin-producing Escherichia coli (STE… 1.13e5 3.17e7
11 Toxoplasma gondii 8.67e4 3.74e9
12 Vibrio parahaemolyticus 3.47e4 4.57e7
13 Vibrio vulnificus 9.6 e1 3.59e8
14 Vibrio non-cholera species other than V. parahaemoly… 1.76e4 8.17e7
15 Yersinia enterocolitica 9.77e4 3.13e8
There’s many types of complicated datasets that you’ll encounter in
your data analysis career. This has been a short introduction to these
types of tricky datasets and some useful dplyr
functions
for working with them.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Conway (2022, June 7). Data Analytics and Computational Social Science: More data import/dplyr. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomspconway911852/
BibTeX citation
@misc{conway2022more, author = {Conway, Sean}, title = {Data Analytics and Computational Social Science: More data import/dplyr}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomspconway911852/}, year = {2022} }