Data Import

data import

readr

importing data with R

Author

Sean Conway

Published

June 5, 2022

Overview

Today, we’re going to read in three versions of the poultry_tidy data. These data are available in the _data in the course blog repo.

We will specifically read in 3 data files:
- poultry_tidy.csv
- poultry_tidy.xlsx
- poultry_tidy.RData

These are the “clean” versions of the raw data files.

To run this file, all 3 datasets should be in the same directory on your computer.

I also use the here package to manage relative directories.

Getting Started

To begin, we need to load two packages: readr and readxl, which contain very useful functions for reading in data to `R.

library(readr)
library(readxl)

If you’re unsure whether or not you have these packages installed, you can run the following command:

installed.packages()

We’re now ready to get started reading in actual datasets.

Reading in delimited text files

.csv is a common type of delimited text file. .csv stands for comma-separated value. This means that commas separate cells from one another.

R has a base read.csv() function. However, it comes with a couple of downsides - namely that it imports data as a dataframe rather than a tibble. So we will be using the function read_csv() from the readr package. In addition to importing data as a tibble, it also does a much better job guessing data types.

read_csv() is essentially a wrapper function (a function that calls another function) around the more general read_delim() function. Also see read_tsv() for tab-separated values.

?read_delim

Let’s look at the data files available for us to read in:

list.files(here("posts","_data"))

 [1] "AB_NYC_2019.csv"                                                                                 
 [2] "abc_poll_2021.csv"                                                                               
 [3] "ActiveDuty_MaritalStatus.xls"                                                                    
 [4] "animal_weight.csv"                                                                               
 [5] "australian_marriage_law_postal_survey_2017_-_response_final.xls"                                 
 [6] "australian_marriage_tidy.csv"                                                                    
 [7] "birds.csv"                                                                                       
 [8] "cereal.csv"                                                                                      
 [9] "cwc.csv"                                                                                         
[10] "Data_Extract_From_World_Development_Indicators.xlsx"                                             
[11] "Data_Extract_FromWorld Development Indicators.xlsx"                                              
[12] "debt_in_trillions.xlsx"                                                                          
[13] "eggs_tidy.csv"                                                                                   
[14] "FAOSTAT_cattle_dairy.csv"                                                                        
[15] "FAOSTAT_country_groups.csv"                                                                      
[16] "FAOSTAT_egg_chicken.csv"                                                                         
[17] "FAOSTAT_livestock.csv"                                                                           
[18] "FedFundsRate.csv"                                                                                
[19] "FRBNY-SCE-Public-Microdata-Complete-13-16.xlsx"                                                  
[20] "hotel_bookings.csv"                                                                              
[21] "organiceggpoultry.xls"                                                                           
[22] "poultry_tidy.csv"                                                                                
[23] "poultry_tidy.RData"                                                                              
[24] "poultry_tidy.xlsx"                                                                               
[25] "Public_School_Characteristics_2017-18.csv"                                                       
[26] "railroad_2012_clean_county.csv"                                                                  
[27] "sce-labor-chart-data-public.xlsx"                                                                
[28] "snl_actors.csv"                                                                                  
[29] "snl_casts.csv"                                                                                   
[30] "snl_seasons.csv"                                                                                 
[31] "starwars1.RData"                                                                                 
[32] "StateCounty2012.xls"                                                                             
[33] "test_objs.RData"                                                                                 
[34] "Total_cost_for_top_15_pathogens_2018.xlsx"                                                       
[35] "USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx"
[36] "wild_bird_data.xlsx"

There’s a lot of data files there, but we are going to import the poultry_tidy.csv file. Doing so is very simple using read_csv():

poultry_from_csv <- read_csv(here("posts","_data","poultry_tidy.csv"))

Rows: 600 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Product, Month
dbl (2): Year, Price_Dollar

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s take a look at our dataset (to view the tibble, running the name of the object will print it to the console):

poultry_from_csv

It worked great! The data is all there. To inspect the data types for each of the four columns in poultry_from_csv, we can use spec() or typeof():

Note: we can also avoid the setwd() issue by modifying the name of the file to include the file path:

poultry_from_csv <- read_csv(here("posts","_data","poultry_tidy.csv"))

spec(poultry_from_csv) # use the spec() function to check the data type for your columns

cols(
  Product = col_character(),
  Year = col_double(),
  Month = col_character(),
  Price_Dollar = col_double()
)

# can also use typeof() function on individual columns
typeof(poultry_from_csv$Product)

[1] "character"

typeof(poultry_from_csv$Year)

[1] "double"

typeof(poultry_from_csv$Month)

[1] "character"

typeof(poultry_from_csv$Price_Dollar)

[1] "double"

See this R section below for some more info on read_delim():

# read_delim() has a number of optional arguments
args(read_delim)

function (file, delim = NULL, quote = "\"", escape_backslash = FALSE, 
    escape_double = TRUE, col_names = TRUE, col_types = NULL, 
    col_select = NULL, id = NULL, locale = default_locale(), 
    na = c("", "NA"), quoted_na = TRUE, comment = "", trim_ws = FALSE, 
    skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", 
    num_threads = readr_threads(), progress = show_progress(), 
    show_col_types = should_show_types(), skip_empty_rows = TRUE, 
    lazy = should_read_lazy()) 
NULL

# there's too many to list here, so we will just go over a few
# run ?read_delim() to learn more
# 1) delim - text delimiter.
# default is NULL and read_delim() guesses delimiter
#
# 2) quote - symbol telling R when to quote a string
# default is "\""
# below comes from R documentation on quotes
# https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html
# identical() is a function that returns TRUE if two objects are equal
identical(1+4, 3+2)

[1] TRUE

identical('"It\'s alive!", he screamed.',
          "\"It's alive!\", he screamed.") # same

[1] TRUE

#
# 3) escape_backlash
# use backlash to escape special characters?
# default = FALSE
#
# 4) col_names
# can be TRUE (default), meaning that R reads in the first row of values as column names
# can FALSE - R creates column names (x1 x2 etc)
# OR can be a character vector of custom column names
poultry_custom_cols <- read_csv("_data/poultry_tidy.csv",
                                col_names = c("prod","yr","mo","$"),
                                skip = 1) # need this to skip the file's column names

Rows: 600 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): prod, mo
dbl (2): yr, $

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

poultry_custom_cols

poultry_custom_cols$`$` # note the backticks around the $ sign

  [1] 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500
 [10] 2.38500 2.38500 2.38500 7.03750 7.03750 7.03750 7.03750 7.03750 7.03750
 [19] 7.03750 7.03750 7.03750 7.03750 7.03750 7.03750 3.90500 3.90500 3.90500
 [28] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
 [37] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
 [46] 2.03500 2.03500 2.03500 2.16250 2.16250 2.16250 2.16250 2.16250 2.16250
 [55] 2.16250 2.16250 2.16250 2.16250 2.16250 2.16250 2.35000 2.38500 2.38500
 [64] 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500
 [73] 6.37500 7.00000 7.00000 7.00000 7.00000 7.00000 7.00000 7.00000 7.00000
 [82] 7.00000 7.03750 7.03750 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
 [91] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500
[100] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[109] 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.16250 2.16250
[118] 2.16250 2.16250 2.16250 2.35000 2.35000 2.35000 2.35000 2.35000 2.35000
[127] 2.35000 2.35000 2.35000 2.35000 2.35000 2.35000 6.37500 6.37500 6.37500
[136] 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500
[145] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[154] 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[163] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.15000 2.15000 2.15000
[172] 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000
[181] 2.48000 2.48000 2.48000 2.41500 2.35000 2.35000 2.41500 2.35000 2.35000
[190] 2.35000 2.35000 2.35000 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[199] 6.45500 6.42300 6.37500 6.37500 6.37500 6.37500 3.90500 3.90500 3.90500
[208] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[217] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[226] 2.03500 2.03500 2.03500 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[235] 2.22000 2.19200 2.15000 2.15000 2.15000 2.15000 2.48000 2.48000 2.48000
[244] 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000
[253] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[262] 6.45500 6.45500 6.45500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[271] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500
[280] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[289] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[298] 2.22000 2.22000 2.22000 2.20500 2.20500 2.20500 2.20500 2.20500 2.48000
[307] 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 6.45500 6.45500 6.45500
[316] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[325] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[334] 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[343] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.22000 2.22000 2.22000
[352] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[361] 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500
[370] 2.20500 2.20500 2.20500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[379] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 3.90500 3.90500 3.90500
[388] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[397] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[406] 2.03500 2.03500 2.03500 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[415] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.17000 2.17000 2.19625
[424] 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500
[433] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[442] 6.45500 6.45500 6.45500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[451] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500
[460] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[469] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[478] 2.22000 2.22000 2.22000 2.17000 2.17000 2.17000 2.17000 2.17000 2.17000
[487] 2.17000 2.17000 2.17000 2.17000 2.17000 2.17000 6.44000 6.45500 6.45500
[496] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[505] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[514] 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[523] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.13000 2.22000 2.22000
[532] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[541] 1.97500 1.97500 2.09000 2.12000 2.14500 2.16375 2.17000 2.17000 2.17000
[550] 2.17000 2.17000 2.17000 6.45500 6.42500 6.42500 6.42500 6.42500 6.41000
[559] 6.42500 6.42500 6.42500 6.42500 6.42500 6.42500      NA      NA      NA
[568]      NA      NA      NA 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[577] 1.93500 1.93500 1.93500 1.93500 1.93500 2.01875 2.03500 2.03500 2.03500
[586] 2.03500 2.03500 2.03500      NA 2.03000 2.03000 2.03000 2.03000 2.00375
[595] 1.99500 1.99500 1.99500 1.99500 1.99500 1.99500

# $ is a "special symbol" in R, because it is an operator used for indexing
# $ is technically an illegal column name, but we can still use it with ``
# same goes for column names consisting of numbers or other symbols, etc.
#
# 5) col_types
# default=NULL
# if NULL R guesses data type from first 1000 rows
# can also specify manually (but be careful)
# see ?read_delim and scroll to col_types for details
#
# 6) skip
# number of lines to skip
# default=0
# can be very useful with messy data files
#
# 7) n_max
# maximum number of lines to read
# default=Inf
#
#

Read in `.xls`/`.xlsx` files

.xls and .xlsx are files created in Microsoft Excel. There are separate functions read_xls() and read_xlsx(), but I find it’s best to use the wrapper function read_excel(). This will automatically call the correct function and avoid an error from accidentally mis-specifying the file type.

See below for what happens if we call the wrong function for the file type:

# the try() function will try to run the code
# see tryCatch() for more error handling 
# this code doesn't work because it tries to read the wrong file type
try(read_xls(here("posts","_data","poultry_tidy.xlsx")))

Error : 
  filepath: /Users/seanconway/Github/601_Winter_2022-2023/posts/_data/poultry_tidy.xlsx
  libxls error: Unable to open file

The code below works just fine, however:

# this code works 
poultry_from_excel <- try(read_excel(here("posts","_data","poultry_tidy.xlsx"),
                                     skip = 5,
                                     col_names = c("prod","year","month","price"))) 
poultry_from_excel

Let’s take a look at this tibble:

# examining our tibble
head(poultry_from_excel) # view the first several rows

colnames(poultry_from_excel) # print column names

[1] "prod"  "year"  "month" "price"

dplyr::glimpse(poultry_from_excel) # tidy little summary of it

Rows: 596
Columns: 4
$ prod  <chr> "Whole", "Whole", "Whole", "Whole", "Whole", "Whole", "Whole", "…
$ year  <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
$ month <chr> "May", "June", "July", "August", "September", "October", "Novemb…
$ price <dbl> 2.3850, 2.3850, 2.3850, 2.3850, 2.3850, 2.3850, 2.3850, 2.3850, …

# the package::function() syntax is only necessary if the package isn't loaded

Function documentation:

# to view function documentation
?read_excel

# optional arguments
# 1) sheet=NULL
# number of the sheet to read in
# by default it reads the first sheet

# 2) range=NULL
# range of cells to read in
# uses the cellranger package to work with specific cells in Excel files
# for more, see the cellranger package
# https://cran.r-project.org/web/packages/cellranger/index.html

# 3) col_names=TRUE
# how to get column names (works the same as read_delim())

# 4) col_types=NULL
# types of data in columns (works the same as read_delim())

# 5) skip = 0
# number of lines to skip (works the same as read_delim())

# 6) n_max=Inf
# max lines to read (works the same as read_delim())

Reading in `.RData` Files

Reading .RData is less commonly needed, but it’s still important to know about. .RData is a file type exclusively associated with R. It’s commonly used when someone has performed operations with data and saved the results to give to collaborators.

We can use the load() function to load R objects into our R environment from a file:

# running the load() function on the data file name will load the objects into your R environment
load(here("posts","_data","poultry_tidy.RData"))
poultry_tidy

# there's now a poultry_tidy object in our R environment

Note that we do not assign the data file to an object. Rather, it comes in as an object based on whatever the previous user named it as. If we try to assign it as an object, the object will only have the name of the data file, rather than the data itself:

# note that this operation shouldn't include any variable assignment
test_dat <- load(here("posts","_data","poultry_tidy.RData"))
test_dat # now it contains the object name, not the object itself

[1] "poultry_tidy"

You can also save any number of R objects to a .RData file using the save() function:

a <- rnorm(1000)
b <- matrix(runif(100),nrow=50,ncol=2)
c <- as_tibble(mtcars)
save(a,b,c,file=here("posts","_data","test_objs.RData"))
# there is now a test_objs.RData file in my working directory: 
list.files(here("posts","_data/"))

 [1] "AB_NYC_2019.csv"                                                                                 
 [2] "abc_poll_2021.csv"                                                                               
 [3] "ActiveDuty_MaritalStatus.xls"                                                                    
 [4] "animal_weight.csv"                                                                               
 [5] "australian_marriage_law_postal_survey_2017_-_response_final.xls"                                 
 [6] "australian_marriage_tidy.csv"                                                                    
 [7] "birds.csv"                                                                                       
 [8] "cereal.csv"                                                                                      
 [9] "cwc.csv"                                                                                         
[10] "Data_Extract_From_World_Development_Indicators.xlsx"                                             
[11] "Data_Extract_FromWorld Development Indicators.xlsx"                                              
[12] "debt_in_trillions.xlsx"                                                                          
[13] "eggs_tidy.csv"                                                                                   
[14] "FAOSTAT_cattle_dairy.csv"                                                                        
[15] "FAOSTAT_country_groups.csv"                                                                      
[16] "FAOSTAT_egg_chicken.csv"                                                                         
[17] "FAOSTAT_livestock.csv"                                                                           
[18] "FedFundsRate.csv"                                                                                
[19] "FRBNY-SCE-Public-Microdata-Complete-13-16.xlsx"                                                  
[20] "hotel_bookings.csv"                                                                              
[21] "organiceggpoultry.xls"                                                                           
[22] "poultry_tidy.csv"                                                                                
[23] "poultry_tidy.RData"                                                                              
[24] "poultry_tidy.xlsx"                                                                               
[25] "Public_School_Characteristics_2017-18.csv"                                                       
[26] "railroad_2012_clean_county.csv"                                                                  
[27] "sce-labor-chart-data-public.xlsx"                                                                
[28] "snl_actors.csv"                                                                                  
[29] "snl_casts.csv"                                                                                   
[30] "snl_seasons.csv"                                                                                 
[31] "starwars1.RData"                                                                                 
[32] "StateCounty2012.xls"                                                                             
[33] "test_objs.RData"                                                                                 
[34] "Total_cost_for_top_15_pathogens_2018.xlsx"                                                       
[35] "USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx"
[36] "wild_bird_data.xlsx"

Let’s remove these objects from our R environment and re-load them from the file we saved:

# remove objects from environment
rm(list=c("a","b","c"))

# now they're back! (If you save them)
try(load(here("posts","_data","test_objs.RData")))

Conclusion

You now know a little bit about how to read in some common data types. Note that these aren’t the only types of data you’ll encounter, but they are by far the most common ones.