library(readr)
library(readxl)
Data Import
Overview
Today, we’re going to read in three versions of the poultry_tidy
data. These data are available in the _data
in the course blog repo.
We will specifically read in 3 data files:
- poultry_tidy.csv
- poultry_tidy.xlsx
- poultry_tidy.RData
These are the “clean” versions of the raw data files.
To run this file, all 3 datasets should be in the same directory on your computer.
I also use the here package to manage relative directories.
Getting Started
To begin, we need to load two packages: readr
and readxl
, which contain very useful functions for reading in data to `R.
If you’re unsure whether or not you have these packages installed, you can run the following command:
installed.packages()
We’re now ready to get started reading in actual datasets.
Reading in delimited text files
.csv
is a common type of delimited text file. .csv
stands for comma-separated value. This means that commas separate cells from one another.
R
has a base read.csv()
function. However, it comes with a couple of downsides - namely that it imports data as a dataframe rather than a tibble. So we will be using the function read_csv()
from the readr
package. In addition to importing data as a tibble, it also does a much better job guessing data types.
read_csv()
is essentially a wrapper function (a function that calls another function) around the more general read_delim()
function. Also see read_tsv()
for tab-separated values.
?read_delim
Let’s look at the data files available for us to read in:
list.files(here("posts","_data"))
[1] "AB_NYC_2019.csv"
[2] "abc_poll_2021.csv"
[3] "ActiveDuty_MaritalStatus.xls"
[4] "animal_weight.csv"
[5] "australian_marriage_law_postal_survey_2017_-_response_final.xls"
[6] "australian_marriage_tidy.csv"
[7] "birds.csv"
[8] "cereal.csv"
[9] "cwc.csv"
[10] "Data_Extract_From_World_Development_Indicators.xlsx"
[11] "Data_Extract_FromWorld Development Indicators.xlsx"
[12] "debt_in_trillions.xlsx"
[13] "eggs_tidy.csv"
[14] "FAOSTAT_cattle_dairy.csv"
[15] "FAOSTAT_country_groups.csv"
[16] "FAOSTAT_egg_chicken.csv"
[17] "FAOSTAT_livestock.csv"
[18] "FedFundsRate.csv"
[19] "FRBNY-SCE-Public-Microdata-Complete-13-16.xlsx"
[20] "hotel_bookings.csv"
[21] "organiceggpoultry.xls"
[22] "poultry_tidy.csv"
[23] "poultry_tidy.RData"
[24] "poultry_tidy.xlsx"
[25] "Public_School_Characteristics_2017-18.csv"
[26] "railroad_2012_clean_county.csv"
[27] "sce-labor-chart-data-public.xlsx"
[28] "snl_actors.csv"
[29] "snl_casts.csv"
[30] "snl_seasons.csv"
[31] "starwars1.RData"
[32] "StateCounty2012.xls"
[33] "test_objs.RData"
[34] "Total_cost_for_top_15_pathogens_2018.xlsx"
[35] "USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx"
[36] "wild_bird_data.xlsx"
There’s a lot of data files there, but we are going to import the poultry_tidy.csv
file. Doing so is very simple using read_csv()
:
<- read_csv(here("posts","_data","poultry_tidy.csv")) poultry_from_csv
Rows: 600 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Product, Month
dbl (2): Year, Price_Dollar
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s take a look at our dataset (to view the tibble, running the name of the object will print it to the console):
poultry_from_csv
It worked great! The data is all there. To inspect the data types for each of the four columns in poultry_from_csv
, we can use spec()
or typeof()
:
Note: we can also avoid the setwd()
issue by modifying the name of the file to include the file path:
<- read_csv(here("posts","_data","poultry_tidy.csv")) poultry_from_csv
spec(poultry_from_csv) # use the spec() function to check the data type for your columns
cols(
Product = col_character(),
Year = col_double(),
Month = col_character(),
Price_Dollar = col_double()
)
# can also use typeof() function on individual columns
typeof(poultry_from_csv$Product)
[1] "character"
typeof(poultry_from_csv$Year)
[1] "double"
typeof(poultry_from_csv$Month)
[1] "character"
typeof(poultry_from_csv$Price_Dollar)
[1] "double"
See this R
section below for some more info on read_delim()
:
# read_delim() has a number of optional arguments
args(read_delim)
function (file, delim = NULL, quote = "\"", escape_backslash = FALSE,
escape_double = TRUE, col_names = TRUE, col_types = NULL,
col_select = NULL, id = NULL, locale = default_locale(),
na = c("", "NA"), quoted_na = TRUE, comment = "", trim_ws = FALSE,
skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique",
num_threads = readr_threads(), progress = show_progress(),
show_col_types = should_show_types(), skip_empty_rows = TRUE,
lazy = should_read_lazy())
NULL
# there's too many to list here, so we will just go over a few
# run ?read_delim() to learn more
# 1) delim - text delimiter.
# default is NULL and read_delim() guesses delimiter
#
# 2) quote - symbol telling R when to quote a string
# default is "\""
# below comes from R documentation on quotes
# https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html
# identical() is a function that returns TRUE if two objects are equal
identical(1+4, 3+2)
[1] TRUE
identical('"It\'s alive!", he screamed.',
"\"It's alive!\", he screamed.") # same
[1] TRUE
#
# 3) escape_backlash
# use backlash to escape special characters?
# default = FALSE
#
# 4) col_names
# can be TRUE (default), meaning that R reads in the first row of values as column names
# can FALSE - R creates column names (x1 x2 etc)
# OR can be a character vector of custom column names
<- read_csv("_data/poultry_tidy.csv",
poultry_custom_cols col_names = c("prod","yr","mo","$"),
skip = 1) # need this to skip the file's column names
Rows: 600 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): prod, mo
dbl (2): yr, $
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
poultry_custom_cols
$`$` # note the backticks around the $ sign poultry_custom_cols
[1] 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500
[10] 2.38500 2.38500 2.38500 7.03750 7.03750 7.03750 7.03750 7.03750 7.03750
[19] 7.03750 7.03750 7.03750 7.03750 7.03750 7.03750 3.90500 3.90500 3.90500
[28] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[37] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[46] 2.03500 2.03500 2.03500 2.16250 2.16250 2.16250 2.16250 2.16250 2.16250
[55] 2.16250 2.16250 2.16250 2.16250 2.16250 2.16250 2.35000 2.38500 2.38500
[64] 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500
[73] 6.37500 7.00000 7.00000 7.00000 7.00000 7.00000 7.00000 7.00000 7.00000
[82] 7.00000 7.03750 7.03750 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[91] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500
[100] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[109] 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.16250 2.16250
[118] 2.16250 2.16250 2.16250 2.35000 2.35000 2.35000 2.35000 2.35000 2.35000
[127] 2.35000 2.35000 2.35000 2.35000 2.35000 2.35000 6.37500 6.37500 6.37500
[136] 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500
[145] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[154] 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[163] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.15000 2.15000 2.15000
[172] 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000
[181] 2.48000 2.48000 2.48000 2.41500 2.35000 2.35000 2.41500 2.35000 2.35000
[190] 2.35000 2.35000 2.35000 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[199] 6.45500 6.42300 6.37500 6.37500 6.37500 6.37500 3.90500 3.90500 3.90500
[208] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[217] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[226] 2.03500 2.03500 2.03500 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[235] 2.22000 2.19200 2.15000 2.15000 2.15000 2.15000 2.48000 2.48000 2.48000
[244] 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000
[253] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[262] 6.45500 6.45500 6.45500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[271] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500
[280] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[289] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[298] 2.22000 2.22000 2.22000 2.20500 2.20500 2.20500 2.20500 2.20500 2.48000
[307] 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 6.45500 6.45500 6.45500
[316] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[325] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[334] 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[343] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.22000 2.22000 2.22000
[352] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[361] 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500
[370] 2.20500 2.20500 2.20500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[379] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 3.90500 3.90500 3.90500
[388] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[397] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[406] 2.03500 2.03500 2.03500 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[415] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.17000 2.17000 2.19625
[424] 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500
[433] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[442] 6.45500 6.45500 6.45500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[451] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500
[460] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[469] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[478] 2.22000 2.22000 2.22000 2.17000 2.17000 2.17000 2.17000 2.17000 2.17000
[487] 2.17000 2.17000 2.17000 2.17000 2.17000 2.17000 6.44000 6.45500 6.45500
[496] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[505] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[514] 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[523] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.13000 2.22000 2.22000
[532] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[541] 1.97500 1.97500 2.09000 2.12000 2.14500 2.16375 2.17000 2.17000 2.17000
[550] 2.17000 2.17000 2.17000 6.45500 6.42500 6.42500 6.42500 6.42500 6.41000
[559] 6.42500 6.42500 6.42500 6.42500 6.42500 6.42500 NA NA NA
[568] NA NA NA 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[577] 1.93500 1.93500 1.93500 1.93500 1.93500 2.01875 2.03500 2.03500 2.03500
[586] 2.03500 2.03500 2.03500 NA 2.03000 2.03000 2.03000 2.03000 2.00375
[595] 1.99500 1.99500 1.99500 1.99500 1.99500 1.99500
# $ is a "special symbol" in R, because it is an operator used for indexing
# $ is technically an illegal column name, but we can still use it with ``
# same goes for column names consisting of numbers or other symbols, etc.
#
# 5) col_types
# default=NULL
# if NULL R guesses data type from first 1000 rows
# can also specify manually (but be careful)
# see ?read_delim and scroll to col_types for details
#
# 6) skip
# number of lines to skip
# default=0
# can be very useful with messy data files
#
# 7) n_max
# maximum number of lines to read
# default=Inf
#
#
Read in .xls
/.xlsx
files
.xls
and .xlsx
are files created in Microsoft Excel. There are separate functions read_xls()
and read_xlsx()
, but I find it’s best to use the wrapper function read_excel()
. This will automatically call the correct function and avoid an error from accidentally mis-specifying the file type.
See below for what happens if we call the wrong function for the file type:
# the try() function will try to run the code
# see tryCatch() for more error handling
# this code doesn't work because it tries to read the wrong file type
try(read_xls(here("posts","_data","poultry_tidy.xlsx")))
Error :
filepath: /Users/seanconway/Github/601_Winter_2022-2023/posts/_data/poultry_tidy.xlsx
libxls error: Unable to open file
The code below works just fine, however:
# this code works
<- try(read_excel(here("posts","_data","poultry_tidy.xlsx"),
poultry_from_excel skip = 5,
col_names = c("prod","year","month","price")))
poultry_from_excel
Let’s take a look at this tibble:
# examining our tibble
head(poultry_from_excel) # view the first several rows
colnames(poultry_from_excel) # print column names
[1] "prod" "year" "month" "price"
::glimpse(poultry_from_excel) # tidy little summary of it dplyr
Rows: 596
Columns: 4
$ prod <chr> "Whole", "Whole", "Whole", "Whole", "Whole", "Whole", "Whole", "…
$ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
$ month <chr> "May", "June", "July", "August", "September", "October", "Novemb…
$ price <dbl> 2.3850, 2.3850, 2.3850, 2.3850, 2.3850, 2.3850, 2.3850, 2.3850, …
# the package::function() syntax is only necessary if the package isn't loaded
Function documentation:
# to view function documentation
?read_excel
# optional arguments
# 1) sheet=NULL
# number of the sheet to read in
# by default it reads the first sheet
# 2) range=NULL
# range of cells to read in
# uses the cellranger package to work with specific cells in Excel files
# for more, see the cellranger package
# https://cran.r-project.org/web/packages/cellranger/index.html
# 3) col_names=TRUE
# how to get column names (works the same as read_delim())
# 4) col_types=NULL
# types of data in columns (works the same as read_delim())
# 5) skip = 0
# number of lines to skip (works the same as read_delim())
# 6) n_max=Inf
# max lines to read (works the same as read_delim())
Reading in .RData
Files
Reading .RData is less commonly needed, but it’s still important to know about. .RData
is a file type exclusively associated with R. It’s commonly used when someone has performed operations with data and saved the results to give to collaborators.
We can use the load()
function to load R
objects into our R
environment from a file:
# running the load() function on the data file name will load the objects into your R environment
load(here("posts","_data","poultry_tidy.RData"))
poultry_tidy
# there's now a poultry_tidy object in our R environment
Note that we do not assign the data file to an object. Rather, it comes in as an object based on whatever the previous user named it as. If we try to assign it as an object, the object will only have the name of the data file, rather than the data itself:
# note that this operation shouldn't include any variable assignment
<- load(here("posts","_data","poultry_tidy.RData"))
test_dat # now it contains the object name, not the object itself test_dat
[1] "poultry_tidy"
You can also save any number of R
objects to a .RData
file using the save()
function:
<- rnorm(1000)
a <- matrix(runif(100),nrow=50,ncol=2)
b <- as_tibble(mtcars)
c save(a,b,c,file=here("posts","_data","test_objs.RData"))
# there is now a test_objs.RData file in my working directory:
list.files(here("posts","_data/"))
[1] "AB_NYC_2019.csv"
[2] "abc_poll_2021.csv"
[3] "ActiveDuty_MaritalStatus.xls"
[4] "animal_weight.csv"
[5] "australian_marriage_law_postal_survey_2017_-_response_final.xls"
[6] "australian_marriage_tidy.csv"
[7] "birds.csv"
[8] "cereal.csv"
[9] "cwc.csv"
[10] "Data_Extract_From_World_Development_Indicators.xlsx"
[11] "Data_Extract_FromWorld Development Indicators.xlsx"
[12] "debt_in_trillions.xlsx"
[13] "eggs_tidy.csv"
[14] "FAOSTAT_cattle_dairy.csv"
[15] "FAOSTAT_country_groups.csv"
[16] "FAOSTAT_egg_chicken.csv"
[17] "FAOSTAT_livestock.csv"
[18] "FedFundsRate.csv"
[19] "FRBNY-SCE-Public-Microdata-Complete-13-16.xlsx"
[20] "hotel_bookings.csv"
[21] "organiceggpoultry.xls"
[22] "poultry_tidy.csv"
[23] "poultry_tidy.RData"
[24] "poultry_tidy.xlsx"
[25] "Public_School_Characteristics_2017-18.csv"
[26] "railroad_2012_clean_county.csv"
[27] "sce-labor-chart-data-public.xlsx"
[28] "snl_actors.csv"
[29] "snl_casts.csv"
[30] "snl_seasons.csv"
[31] "starwars1.RData"
[32] "StateCounty2012.xls"
[33] "test_objs.RData"
[34] "Total_cost_for_top_15_pathogens_2018.xlsx"
[35] "USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx"
[36] "wild_bird_data.xlsx"
Let’s remove these objects from our R
environment and re-load them from the file we saved:
# remove objects from environment
rm(list=c("a","b","c"))
# now they're back! (If you save them)
try(load(here("posts","_data","test_objs.RData")))
Conclusion
You now know a little bit about how to read in some common data types. Note that these aren’t the only types of data you’ll encounter, but they are by far the most common ones.