importing data with R
Today, we’re going to read in three versions of the
poultry_tidy
data. All data files are available on the
sample datasets page on the Google Classroom.
We will specifically read in 3 data files:
- poultry_tidy.csv
- poultry_tidy.xlsx
- poultry_tidy.RData
These are the “clean” versions of the raw data files (which are also available on Google Classroom).
To run this file, all 3 datasets should be in the same directory on your computer.
To begin, we need to load two packages: readr
and
readxl
, which contain very useful functions for reading in
data to `R.
If you’re unsure whether or not you have these packages installed, you can run the following command:
First, let’s check our working directory using the
getwd()
command:
getwd()
[1] "/Users/seanconway/Teaching/DACSS_601_Su22/scripts"
I need to change my working directory to the location of the datasets
on my computer, using the setwd()
command. Note that for
RMarkdown files, the working directory will revert back to the file
location for each code chunk. So we actually need to rerun this line
below everytime we import data:
setwd("/Users/seanconway/Teaching/DACSS_601_Su22/data_cleaned")
We’re now ready to get started reading in actual datasets.
.csv
is a common type of delimited text file.
.csv
stands for comma-separated value.
This means that commas separate cells from one another.
R
has a base read.csv()
function. However,
it comes with a couple of downsides - namely that it imports data as a
dataframe rather than a tibble. So we will be using the function
read_csv()
from the readr
package. In addition
to importing data as a tibble, it also does a much better job guessing
data types.
read_csv()
is essentially a wrapper function (a function
that calls another function) around the more general
read_delim()
function. Also see read_tsv()
for
tab-separated values.
?read_delim
Let’s look at the data files available for us to read in:
setwd("/Users/seanconway/Teaching/DACSS_601_Su22/data_cleaned")
list.files()
[1] "airforce_cleaned.csv"
[2] "army_cleaned.csv"
[3] "australian_marriage_tidy.csv"
[4] "australian_marriage_tidy.RData"
[5] "australian_marriage_tidy.xlsx"
[6] "bicycle_NY_tidy.RData"
[7] "eggs_2004_2013.csv"
[8] "eggs_tidy.xlsx"
[9] "marinecorps_cleaned.csv"
[10] "marital_dod_tidy.RData"
[11] "marital_tidy_all.csv"
[12] "navy_cleaned.csv"
[13] "poultry_tidy.csv"
[14] "poultry_tidy.RData"
[15] "poultry_tidy.xlsx"
[16] "railroad_2012_clean_county_tidy.csv"
[17] "railroad_2012_clean_county.xlsx"
[18] "railroad_2012_clean_state.csv"
[19] "railroad_2012_clean_state.xlsx"
[20] "railroad_2012_county.RData"
[21] "railroad_2012_state.RData"
[22] "starwars1.RData"
[23] "test_objs.RData"
[24] "totaldod_cleaned.csv"
There’s a lot of data files there, but we are going to import the
poultry_tidy.csv
file. Doing so is very simple using
read_csv()
:
Let’s take a look at our dataset (to view the tibble, running the name of the object will print it to the console):
poultry_from_csv
# A tibble: 600 × 4
Product Year Month Price_Dollar
<chr> <dbl> <chr> <dbl>
1 Whole 2013 January 2.38
2 Whole 2013 February 2.38
3 Whole 2013 March 2.38
4 Whole 2013 April 2.38
5 Whole 2013 May 2.38
6 Whole 2013 June 2.38
7 Whole 2013 July 2.38
8 Whole 2013 August 2.38
9 Whole 2013 September 2.38
10 Whole 2013 October 2.38
# … with 590 more rows
It worked great! The data is all there. To inspect the data types for
each of the four columns in poultry_from_csv
, we can use
spec()
or typeof()
:
Note: we can also avoid the setwd()
issue by modifying
the name of the file to include the file path:
poultry_from_csv <- read_csv("../data_cleaned/poultry_tidy.csv")
spec(poultry_from_csv) # use the spec() function to check the data type for your columns
cols(
Product = col_character(),
Year = col_double(),
Month = col_character(),
Price_Dollar = col_double()
)
# can also use typeof() function on individual columns
typeof(poultry_from_csv$Product)
[1] "character"
typeof(poultry_from_csv$Year)
[1] "double"
typeof(poultry_from_csv$Month)
[1] "character"
typeof(poultry_from_csv$Price_Dollar)
[1] "double"
See this R
section below for some more info on
read_delim()
:
# read_delim() has a number of optional arguments
args(read_delim)
function (file, delim = NULL, quote = "\"", escape_backslash = FALSE,
escape_double = TRUE, col_names = TRUE, col_types = NULL,
col_select = NULL, id = NULL, locale = default_locale(),
na = c("", "NA"), quoted_na = TRUE, comment = "", trim_ws = FALSE,
skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique",
num_threads = readr_threads(), progress = show_progress(),
show_col_types = should_show_types(), skip_empty_rows = TRUE,
lazy = should_read_lazy())
NULL
# there's too many to list here, so we will just go over a few
# run ?read_delim() to learn more
# 1) delim - text delimiter.
# default is NULL and read_delim() guesses delimiter
#
# 2) quote - symbol telling R when to quote a string
# default is "\""
# below comes from R documentation on quotes
# https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html
# identical() is a function that returns TRUE if two objects are equal
identical(1+4, 3+2)
[1] TRUE
identical('"It\'s alive!", he screamed.',
"\"It's alive!\", he screamed.") # same
[1] TRUE
#
# 3) escape_backlash
# use backlash to escape special characters?
# default = FALSE
#
# 4) col_names
# can be TRUE (default), meaning that R reads in the first row of values as column names
# can FALSE - R creates column names (x1 x2 etc)
# OR can be a character vector of custom column names
poultry_custom_cols <- read_csv("../data_cleaned/poultry_tidy.csv",
col_names = c("prod","yr","mo","$"),
skip = 1) # need this to skip the file's column names
poultry_custom_cols
# A tibble: 600 × 4
prod yr mo `$`
<chr> <dbl> <chr> <dbl>
1 Whole 2013 January 2.38
2 Whole 2013 February 2.38
3 Whole 2013 March 2.38
4 Whole 2013 April 2.38
5 Whole 2013 May 2.38
6 Whole 2013 June 2.38
7 Whole 2013 July 2.38
8 Whole 2013 August 2.38
9 Whole 2013 September 2.38
10 Whole 2013 October 2.38
# … with 590 more rows
poultry_custom_cols$`$` # note the backticks around the $ sign
[1] 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500
[9] 2.38500 2.38500 2.38500 2.38500 7.03750 7.03750 7.03750 7.03750
[17] 7.03750 7.03750 7.03750 7.03750 7.03750 7.03750 7.03750 7.03750
[25] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[33] 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500
[41] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[49] 2.16250 2.16250 2.16250 2.16250 2.16250 2.16250 2.16250 2.16250
[57] 2.16250 2.16250 2.16250 2.16250 2.35000 2.38500 2.38500 2.38500
[65] 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500
[73] 6.37500 7.00000 7.00000 7.00000 7.00000 7.00000 7.00000 7.00000
[81] 7.00000 7.00000 7.03750 7.03750 3.90500 3.90500 3.90500 3.90500
[89] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[97] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[105] 2.03500 2.03500 2.03500 2.03500 2.15000 2.15000 2.15000 2.15000
[113] 2.15000 2.15000 2.15000 2.16250 2.16250 2.16250 2.16250 2.16250
[121] 2.35000 2.35000 2.35000 2.35000 2.35000 2.35000 2.35000 2.35000
[129] 2.35000 2.35000 2.35000 2.35000 6.37500 6.37500 6.37500 6.37500
[137] 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500
[145] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[153] 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500
[161] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[169] 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000
[177] 2.15000 2.15000 2.15000 2.15000 2.48000 2.48000 2.48000 2.41500
[185] 2.35000 2.35000 2.41500 2.35000 2.35000 2.35000 2.35000 2.35000
[193] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.42300
[201] 6.37500 6.37500 6.37500 6.37500 3.90500 3.90500 3.90500 3.90500
[209] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[217] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[225] 2.03500 2.03500 2.03500 2.03500 2.22000 2.22000 2.22000 2.22000
[233] 2.22000 2.22000 2.22000 2.19200 2.15000 2.15000 2.15000 2.15000
[241] 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000
[249] 2.48000 2.48000 2.48000 2.48000 6.45500 6.45500 6.45500 6.45500
[257] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[265] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[273] 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500
[281] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[289] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[297] 2.22000 2.22000 2.22000 2.22000 2.20500 2.20500 2.20500 2.20500
[305] 2.20500 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000
[313] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[321] 6.45500 6.45500 6.45500 6.45500 3.90500 3.90500 3.90500 3.90500
[329] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[337] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[345] 2.03500 2.03500 2.03500 2.03500 2.22000 2.22000 2.22000 2.22000
[353] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[361] 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500
[369] 2.20500 2.20500 2.20500 2.20500 6.45500 6.45500 6.45500 6.45500
[377] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[385] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[393] 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500
[401] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[409] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[417] 2.22000 2.22000 2.22000 2.22000 2.17000 2.17000 2.19625 2.20500
[425] 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500
[433] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[441] 6.45500 6.45500 6.45500 6.45500 3.90500 3.90500 3.90500 3.90500
[449] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[457] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[465] 2.03500 2.03500 2.03500 2.03500 2.22000 2.22000 2.22000 2.22000
[473] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[481] 2.17000 2.17000 2.17000 2.17000 2.17000 2.17000 2.17000 2.17000
[489] 2.17000 2.17000 2.17000 2.17000 6.44000 6.45500 6.45500 6.45500
[497] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[505] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[513] 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500
[521] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[529] 2.13000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[537] 2.22000 2.22000 2.22000 2.22000 1.97500 1.97500 2.09000 2.12000
[545] 2.14500 2.16375 2.17000 2.17000 2.17000 2.17000 2.17000 2.17000
[553] 6.45500 6.42500 6.42500 6.42500 6.42500 6.41000 6.42500 6.42500
[561] 6.42500 6.42500 6.42500 6.42500 NA NA NA NA
[569] NA NA 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[577] 1.93500 1.93500 1.93500 1.93500 1.93500 2.01875 2.03500 2.03500
[585] 2.03500 2.03500 2.03500 2.03500 NA 2.03000 2.03000 2.03000
[593] 2.03000 2.00375 1.99500 1.99500 1.99500 1.99500 1.99500 1.99500
# $ is a "special symbol" in R, because it is an operator used for indexing
# $ is technically an illegal column name, but we can still use it with ``
# same goes for column names consisting of numbers or other symbols, etc.
#
# 5) col_types
# default=NULL
# if NULL R guesses data type from first 1000 rows
# can also specify manually (but be careful)
# see ?read_delim and scroll to col_types for details
#
# 6) skip
# number of lines to skip
# default=0
# can be very useful with messy data files
#
# 7) n_max
# maximum number of lines to read
# default=Inf
#
#
.xls
/.xlsx
files.xls
and .xlsx
are files created in
Microsoft Excel. There are separate functions read_xls()
and read_xlsx()
, but I find it’s best to use the wrapper
function read_excel()
. This will automatically call the
correct function and avoid an error from accidentally mis-specifying the
file type.
See below for what happens if we call the wrong function for the file type:
# the try() function will try to run the code
# see tryCatch() for more error handling
# this code doesn't work because it tries to read the wrong file type
try(read_xls("../data_cleaned/poultry_tidy.xlsx"))
Error :
filepath: /Users/seanconway/Teaching/DACSS_601_Su22/data_cleaned/poultry_tidy.xlsx
libxls error: Unable to open file
The code below works just fine, however:
# this code works
poultry_from_excel <- try(read_excel("poultry_tidy.xlsx",
skip = 5,
col_names = c("prod","year","month","price")))
Error : `path` does not exist: 'poultry_tidy.xlsx'
poultry_from_excel
[1] "Error : `path` does not exist: 'poultry_tidy.xlsx'\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError: `path` does not exist: 'poultry_tidy.xlsx'>
Let’s take a look at this tibble:
# examining our tibble
head(poultry_from_excel) # view the first several rows
[1] "Error : `path` does not exist: 'poultry_tidy.xlsx'\n"
colnames(poultry_from_excel) # print column names
NULL
dplyr::glimpse(poultry_from_excel) # tidy little summary of it
'try-error' chr "Error : `path` does not exist: 'poultry_tidy.xlsx'\n"
- attr(*, "condition")=List of 2
..$ message: chr "`path` does not exist: 'poultry_tidy.xlsx'"
..$ call : NULL
..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
# the package::function() syntax is only necessary if the package isn't loaded
Function documentation:
# to view function documentation
?read_excel
# optional arguments
# 1) sheet=NULL
# number of the sheet to read in
# by default it reads the first sheet
# 2) range=NULL
# range of cells to read in
# uses the cellranger package to work with specific cells in Excel files
# for more, see the cellranger package
# https://cran.r-project.org/web/packages/cellranger/index.html
# 3) col_names=TRUE
# how to get column names (works the same as read_delim())
# 4) col_types=NULL
# types of data in columns (works the same as read_delim())
# 5) skip = 0
# number of lines to skip (works the same as read_delim())
# 6) n_max=Inf
# max lines to read (works the same as read_delim())
.RData
FilesReading .RData is less commonly needed, but it’s still important to
know about. .RData
is a file type exclusively associated
with R. It’s commonly used when someone has performed operations with
data and saved the results to give to collaborators.
We can use the load()
function to load R
objects into our R
environment from a file:
# running the load() function on the data file name will load the objects into your R environment
load("../data_cleaned/poultry_tidy.RData")
poultry_tidy
# A tibble: 600 × 4
Product Year Month Price_Dollar
<chr> <int> <chr> <dbl>
1 Whole 2013 January 2.38
2 Whole 2013 February 2.38
3 Whole 2013 March 2.38
4 Whole 2013 April 2.38
5 Whole 2013 May 2.38
6 Whole 2013 June 2.38
7 Whole 2013 July 2.38
8 Whole 2013 August 2.38
9 Whole 2013 September 2.38
10 Whole 2013 October 2.38
# … with 590 more rows
# there's now a poultry_tidy object in our R environment
Note that we do not assign the data file to an object. Rather, it comes in as an object based on whatever the previous user named it as. If we try to assign it as an object, the object will only have the name of the data file, rather than the data itself:
# note that this operation shouldn't include any variable assignment
test_dat <- load("../data_cleaned/poultry_tidy.RData")
test_dat # now it contains the object name, not the object itself
[1] "poultry_tidy"
You can also save any number of R
objects to a
.RData
file using the save()
function:
a <- rnorm(1000)
b <- matrix(runif(100),nrow=50,ncol=2)
c <- as_tibble(mtcars)
save(a,b,c,file="test_objs.RData")
# there is now a test_objs.RData file in my working directory:
list.files()
[1] "data_import.html" "data_import.Rmd"
[3] "more_data_import_-dplyr.html" "more_data_import_&dplyr.Rmd"
[5] "Rmd_Demo.html" "Rmd_Demo.Rmd"
[7] "rsconnect" "test_objs.RData"
Let’s remove these objects from our R
environment and
re-load them from the file we saved:
You now know a little bit about how to read in some common data types. Note that these aren’t the only types of data you’ll encounter, but they are by far the most common ones.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Conway (2022, June 7). Data Analytics and Computational Social Science: Data Import. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomspconway910890/
BibTeX citation
@misc{conway2022data, author = {Conway, Sean}, title = {Data Analytics and Computational Social Science: Data Import}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomspconway910890/}, year = {2022} }