More data import/dplyr

More data import and an introduction to the dplyr package.

Sean Conway
2022-06-07

Introduction

Today, we’ll be working on how to import a slightly more difficult dataset. Specifically, we’ll be looking at a dataset on the cost of foodborne illnesses in the US. These data were obtained from data.gov.

The data

Let’s take a look at these data.

As you can see, these data come from an Excel file and are not optimized to easily read into R. Luckily, we can use the readxl package to get them into R (albeit with a little work).

Reading in the data

The first thing we’ll notice is that we need to skip a few rows. Specifically, we need to skip the first 5 rows, as they aren’t needed. We can specify skip=5 when we read in the data.

d <- read_excel("../data/Total_cost_for_top_15_pathogens_2018.xlsx",
                skip=5)
d
# A tibble: 22 × 3
   `Campylobacter spp. (all species)`        `845024` `2181485783.43…`
   <chr>                                        <dbl>            <dbl>
 1 Clostridium perfringens                     965958       384277856.
 2 Cryptosporidium spp. (all species)           57616        58394152.
 3 Cyclospora cayetanensis                      11407         2571518.
 4 Listeria monocytogenes                        1591      3189686110.
 5 Norovirus                                  5461731      2566984191.
 6 Salmonella (non-typhoidal species)         1027561      4142179161.
 7 Shigella (all species)                      131254       159202402.
 8 Shiga toxin-producing Escherichia coli O…    63153       311036907.
 9 non-O157 Shiga toxin-producing Escherich…   112752        31701852.
10 Toxoplasma gondii                            86686      3744008907.
# … with 12 more rows

There a couple things wrong here. First, the column names are wrong. Since we skipped the column names, R tried naming the columns based on the first row of data (which is wrong!). We can specify the column names ourselves, using the col_names argument. We can give col_names a character vector that contains the names we want to give our columns. Specifically, we can specify c("pathogen","cases","cost") as the names of our columns. Note that pathogen wasn’t specified in the original data file, but we’re specifying it here.

Next, you’ll notice that the data go on longer than we need. There’s some white space in the data file, as well as a row containing the total cost for all pathogens (which we can eliminate). There’s also a couple of notes that we can eliminate. Rather than specify skip, we can instead use the range argument to specify the range of cells we want to read in. Looking back at our data file, it looks like we only need cells A6:C20.

Let’s try this again! As a reminder, we will be using the col_names argument to name the columns of our data and the range argument to specify the cells we want R to read in. Also note that we specify the path argument explicitly now. read_excel() by default assumes the first thing you give it is the file path, but it doesn’t hurt to be explicit.

dataset <- read_excel(path="../data/Total_cost_for_top_15_pathogens_2018.xlsx",
                      range="A6:C20",
                      col_names=c("pathogen","cases","cost"))
dataset
# A tibble: 15 × 3
   pathogen                                               cases   cost
   <chr>                                                  <dbl>  <dbl>
 1 Campylobacter spp. (all species)                      8.45e5 2.18e9
 2 Clostridium perfringens                               9.66e5 3.84e8
 3 Cryptosporidium spp. (all species)                    5.76e4 5.84e7
 4 Cyclospora cayetanensis                               1.14e4 2.57e6
 5 Listeria monocytogenes                                1.59e3 3.19e9
 6 Norovirus                                             5.46e6 2.57e9
 7 Salmonella (non-typhoidal species)                    1.03e6 4.14e9
 8 Shigella (all species)                                1.31e5 1.59e8
 9 Shiga toxin-producing Escherichia coli O157 (STEC O1… 6.32e4 3.11e8
10 non-O157 Shiga toxin-producing Escherichia coli (STE… 1.13e5 3.17e7
11 Toxoplasma gondii                                     8.67e4 3.74e9
12 Vibrio parahaemolyticus                               3.47e4 4.57e7
13 Vibrio vulnificus                                     9.6 e1 3.59e8
14 Vibrio non-cholera species other than V. parahaemoly… 1.76e4 8.17e7
15 Yersinia enterocolitica                               9.77e4 3.13e8

Voila! The dataset looks great.

A little bit about dplyr

dplyr is an R package for manipulating, wrangling, and transforming data. We will touch on it more next week, but this is a brief introduction to a few useful dplyr functions.

select()

select() allows us to only pick out certain columns from a dataframe.

Below, we extract the cases column from our dataset object (see above). We don’t have to specify it in quotes (the reasons for this are beyond the scope of this lesson).

select(dataset, cases)
# A tibble: 15 × 1
     cases
     <dbl>
 1  845024
 2  965958
 3   57616
 4   11407
 5    1591
 6 5461731
 7 1027561
 8  131254
 9   63153
10  112752
11   86686
12   34664
13      96
14   17564
15   97656

The first argument tells R the dataframe we are looking in (dataset), and the second argument tells R the columns we want to pick out.
We can also select multiple columns.

select(dataset, cases, cost)
# A tibble: 15 × 2
     cases        cost
     <dbl>       <dbl>
 1  845024 2181485783.
 2  965958  384277856.
 3   57616   58394152.
 4   11407    2571518.
 5    1591 3189686110.
 6 5461731 2566984191.
 7 1027561 4142179161.
 8  131254  159202402.
 9   63153  311036907.
10  112752   31701852.
11   86686 3744008907.
12   34664   45735332.
13      96  359481557.
14   17564   81749064.
15   97656  313297920.

You can even use the - operator to tell R to not select certain columns. This is particularly helpful when you have a dataset with many columns, but you want to remove a couple of them. Rather than say, write out the names of 28 columns you want to include, you can simply write the names of the 2 columns you don’t want to include.

select(dataset, -pathogen)
# A tibble: 15 × 2
     cases        cost
     <dbl>       <dbl>
 1  845024 2181485783.
 2  965958  384277856.
 3   57616   58394152.
 4   11407    2571518.
 5    1591 3189686110.
 6 5461731 2566984191.
 7 1027561 4142179161.
 8  131254  159202402.
 9   63153  311036907.
10  112752   31701852.
11   86686 3744008907.
12   34664   45735332.
13      96  359481557.
14   17564   81749064.
15   97656  313297920.

filter()

The filter() function allows us to only include rows of the data that meet certain criteria. For example, imagine that we only want to analyze pathogens that infected at least 10,000 people. We can use filter() to tell R to only include rows of the data where cases is greater than or equal to 10,000.

filter(dataset, cases>=10000)
# A tibble: 13 × 3
   pathogen                                               cases   cost
   <chr>                                                  <dbl>  <dbl>
 1 Campylobacter spp. (all species)                      8.45e5 2.18e9
 2 Clostridium perfringens                               9.66e5 3.84e8
 3 Cryptosporidium spp. (all species)                    5.76e4 5.84e7
 4 Cyclospora cayetanensis                               1.14e4 2.57e6
 5 Norovirus                                             5.46e6 2.57e9
 6 Salmonella (non-typhoidal species)                    1.03e6 4.14e9
 7 Shigella (all species)                                1.31e5 1.59e8
 8 Shiga toxin-producing Escherichia coli O157 (STEC O1… 6.32e4 3.11e8
 9 non-O157 Shiga toxin-producing Escherichia coli (STE… 1.13e5 3.17e7
10 Toxoplasma gondii                                     8.67e4 3.74e9
11 Vibrio parahaemolyticus                               3.47e4 4.57e7
12 Vibrio non-cholera species other than V. parahaemoly… 1.76e4 8.17e7
13 Yersinia enterocolitica                               9.77e4 3.13e8

The first argument tells R the dataframe we want to look in, while the second argument tells R the criteria for including a row. Specifically, a row in dataset should only be included if its value on the variable cases is greater than or equal to 10000.

We can also use the & (and) operator to include multiple criteria. For example, say we want to filter out the rows of data for pathogens that don’t have at least 10,000 cases and costed less than $10,000,000 to the US in 2018.

filter(dataset, cases>=10000 & cost>=10000000)
# A tibble: 12 × 3
   pathogen                                               cases   cost
   <chr>                                                  <dbl>  <dbl>
 1 Campylobacter spp. (all species)                      8.45e5 2.18e9
 2 Clostridium perfringens                               9.66e5 3.84e8
 3 Cryptosporidium spp. (all species)                    5.76e4 5.84e7
 4 Norovirus                                             5.46e6 2.57e9
 5 Salmonella (non-typhoidal species)                    1.03e6 4.14e9
 6 Shigella (all species)                                1.31e5 1.59e8
 7 Shiga toxin-producing Escherichia coli O157 (STEC O1… 6.32e4 3.11e8
 8 non-O157 Shiga toxin-producing Escherichia coli (STE… 1.13e5 3.17e7
 9 Toxoplasma gondii                                     8.67e4 3.74e9
10 Vibrio parahaemolyticus                               3.47e4 4.57e7
11 Vibrio non-cholera species other than V. parahaemoly… 1.76e4 8.17e7
12 Yersinia enterocolitica                               9.77e4 3.13e8

We might also be interested in pathogens that had a high case count OR a high economic cost. For this, we can use the logical or (|) operator to consider pathogens with at least 10,000 cases OR at least $10,000,000 in cost to the US.

filter(dataset, cases>=1000 | cost>=10000000)
# A tibble: 15 × 3
   pathogen                                               cases   cost
   <chr>                                                  <dbl>  <dbl>
 1 Campylobacter spp. (all species)                      8.45e5 2.18e9
 2 Clostridium perfringens                               9.66e5 3.84e8
 3 Cryptosporidium spp. (all species)                    5.76e4 5.84e7
 4 Cyclospora cayetanensis                               1.14e4 2.57e6
 5 Listeria monocytogenes                                1.59e3 3.19e9
 6 Norovirus                                             5.46e6 2.57e9
 7 Salmonella (non-typhoidal species)                    1.03e6 4.14e9
 8 Shigella (all species)                                1.31e5 1.59e8
 9 Shiga toxin-producing Escherichia coli O157 (STEC O1… 6.32e4 3.11e8
10 non-O157 Shiga toxin-producing Escherichia coli (STE… 1.13e5 3.17e7
11 Toxoplasma gondii                                     8.67e4 3.74e9
12 Vibrio parahaemolyticus                               3.47e4 4.57e7
13 Vibrio vulnificus                                     9.6 e1 3.59e8
14 Vibrio non-cholera species other than V. parahaemoly… 1.76e4 8.17e7
15 Yersinia enterocolitica                               9.77e4 3.13e8

Wrapping up

There’s many types of complicated datasets that you’ll encounter in your data analysis career. This has been a short introduction to these types of tricky datasets and some useful dplyr functions for working with them.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Conway (2022, June 7). Data Analytics and Computational Social Science: More data import/dplyr. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomspconway911852/

BibTeX citation

@misc{conway2022more,
  author = {Conway, Sean},
  title = {Data Analytics and Computational Social Science: More data import/dplyr},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomspconway911852/},
  year = {2022}
}