Data Analytics and Computational Social Science: Data Import

Sean Conway

Overview

Today, we’re going to read in three versions of the poultry_tidy data. All data files are available on the sample datasets page on the Google Classroom.

We will specifically read in 3 data files:
- poultry_tidy.csv
- poultry_tidy.xlsx
- poultry_tidy.RData

These are the “clean” versions of the raw data files (which are also available on Google Classroom).

To run this file, all 3 datasets should be in the same directory on your computer.

Getting Started

To begin, we need to load two packages: readr and readxl, which contain very useful functions for reading in data to `R.

library(readr)
library(readxl)

If you’re unsure whether or not you have these packages installed, you can run the following command:

installed.packages()

First, let’s check our working directory using the getwd() command:

getwd()

[1] "/Users/seanconway/Teaching/DACSS_601_Su22/scripts"

I need to change my working directory to the location of the datasets on my computer, using the setwd() command. Note that for RMarkdown files, the working directory will revert back to the file location for each code chunk. So we actually need to rerun this line below everytime we import data:

setwd("/Users/seanconway/Teaching/DACSS_601_Su22/data_cleaned")

We’re now ready to get started reading in actual datasets.

Reading in delimited text files

.csv is a common type of delimited text file. .csv stands for comma-separated value. This means that commas separate cells from one another.

R has a base read.csv() function. However, it comes with a couple of downsides - namely that it imports data as a dataframe rather than a tibble. So we will be using the function read_csv() from the readr package. In addition to importing data as a tibble, it also does a much better job guessing data types.

read_csv() is essentially a wrapper function (a function that calls another function) around the more general read_delim() function. Also see read_tsv() for tab-separated values.

?read_delim

Let’s look at the data files available for us to read in:

setwd("/Users/seanconway/Teaching/DACSS_601_Su22/data_cleaned")
list.files()

 [1] "airforce_cleaned.csv"               
 [2] "army_cleaned.csv"                   
 [3] "australian_marriage_tidy.csv"       
 [4] "australian_marriage_tidy.RData"     
 [5] "australian_marriage_tidy.xlsx"      
 [6] "bicycle_NY_tidy.RData"              
 [7] "eggs_2004_2013.csv"                 
 [8] "eggs_tidy.xlsx"                     
 [9] "marinecorps_cleaned.csv"            
[10] "marital_dod_tidy.RData"             
[11] "marital_tidy_all.csv"               
[12] "navy_cleaned.csv"                   
[13] "poultry_tidy.csv"                   
[14] "poultry_tidy.RData"                 
[15] "poultry_tidy.xlsx"                  
[16] "railroad_2012_clean_county_tidy.csv"
[17] "railroad_2012_clean_county.xlsx"    
[18] "railroad_2012_clean_state.csv"      
[19] "railroad_2012_clean_state.xlsx"     
[20] "railroad_2012_county.RData"         
[21] "railroad_2012_state.RData"          
[22] "starwars1.RData"                    
[23] "test_objs.RData"                    
[24] "totaldod_cleaned.csv"

There’s a lot of data files there, but we are going to import the poultry_tidy.csv file. Doing so is very simple using read_csv():

setwd("/Users/seanconway/Teaching/DACSS_601_Su22/data_cleaned")
poultry_from_csv <- read_csv("poultry_tidy.csv")

Let’s take a look at our dataset (to view the tibble, running the name of the object will print it to the console):

poultry_from_csv

# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <chr>   <dbl> <chr>            <dbl>
 1 Whole    2013 January           2.38
 2 Whole    2013 February          2.38
 3 Whole    2013 March             2.38
 4 Whole    2013 April             2.38
 5 Whole    2013 May               2.38
 6 Whole    2013 June              2.38
 7 Whole    2013 July              2.38
 8 Whole    2013 August            2.38
 9 Whole    2013 September         2.38
10 Whole    2013 October           2.38
# … with 590 more rows

It worked great! The data is all there. To inspect the data types for each of the four columns in poultry_from_csv, we can use spec() or typeof():

Note: we can also avoid the setwd() issue by modifying the name of the file to include the file path:

poultry_from_csv <- read_csv("../data_cleaned/poultry_tidy.csv")

spec(poultry_from_csv) # use the spec() function to check the data type for your columns

cols(
  Product = col_character(),
  Year = col_double(),
  Month = col_character(),
  Price_Dollar = col_double()
)

# can also use typeof() function on individual columns
typeof(poultry_from_csv$Product)

[1] "character"

typeof(poultry_from_csv$Year)

[1] "double"

typeof(poultry_from_csv$Month)

[1] "character"

typeof(poultry_from_csv$Price_Dollar)

[1] "double"

See this R section below for some more info on read_delim():

# read_delim() has a number of optional arguments
args(read_delim)

function (file, delim = NULL, quote = "\"", escape_backslash = FALSE, 
    escape_double = TRUE, col_names = TRUE, col_types = NULL, 
    col_select = NULL, id = NULL, locale = default_locale(), 
    na = c("", "NA"), quoted_na = TRUE, comment = "", trim_ws = FALSE, 
    skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", 
    num_threads = readr_threads(), progress = show_progress(), 
    show_col_types = should_show_types(), skip_empty_rows = TRUE, 
    lazy = should_read_lazy()) 
NULL

# there's too many to list here, so we will just go over a few
# run ?read_delim() to learn more
# 1) delim - text delimiter.
# default is NULL and read_delim() guesses delimiter
#
# 2) quote - symbol telling R when to quote a string
# default is "\""
# below comes from R documentation on quotes
# https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html
# identical() is a function that returns TRUE if two objects are equal
identical(1+4, 3+2)

[1] TRUE

identical('"It\'s alive!", he screamed.',
          "\"It's alive!\", he screamed.") # same

[1] TRUE

#
# 3) escape_backlash
# use backlash to escape special characters?
# default = FALSE
#
# 4) col_names
# can be TRUE (default), meaning that R reads in the first row of values as column names
# can FALSE - R creates column names (x1 x2 etc)
# OR can be a character vector of custom column names
poultry_custom_cols <- read_csv("../data_cleaned/poultry_tidy.csv",
                                col_names = c("prod","yr","mo","$"),
                                skip = 1) # need this to skip the file's column names
poultry_custom_cols

# A tibble: 600 × 4
   prod     yr mo          `$`
   <chr> <dbl> <chr>     <dbl>
 1 Whole  2013 January    2.38
 2 Whole  2013 February   2.38
 3 Whole  2013 March      2.38
 4 Whole  2013 April      2.38
 5 Whole  2013 May        2.38
 6 Whole  2013 June       2.38
 7 Whole  2013 July       2.38
 8 Whole  2013 August     2.38
 9 Whole  2013 September  2.38
10 Whole  2013 October    2.38
# … with 590 more rows

poultry_custom_cols$`$` # note the backticks around the $ sign

  [1] 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500
  [9] 2.38500 2.38500 2.38500 2.38500 7.03750 7.03750 7.03750 7.03750
 [17] 7.03750 7.03750 7.03750 7.03750 7.03750 7.03750 7.03750 7.03750
 [25] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
 [33] 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500
 [41] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
 [49] 2.16250 2.16250 2.16250 2.16250 2.16250 2.16250 2.16250 2.16250
 [57] 2.16250 2.16250 2.16250 2.16250 2.35000 2.38500 2.38500 2.38500
 [65] 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500 2.38500
 [73] 6.37500 7.00000 7.00000 7.00000 7.00000 7.00000 7.00000 7.00000
 [81] 7.00000 7.00000 7.03750 7.03750 3.90500 3.90500 3.90500 3.90500
 [89] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
 [97] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[105] 2.03500 2.03500 2.03500 2.03500 2.15000 2.15000 2.15000 2.15000
[113] 2.15000 2.15000 2.15000 2.16250 2.16250 2.16250 2.16250 2.16250
[121] 2.35000 2.35000 2.35000 2.35000 2.35000 2.35000 2.35000 2.35000
[129] 2.35000 2.35000 2.35000 2.35000 6.37500 6.37500 6.37500 6.37500
[137] 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500 6.37500
[145] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[153] 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500
[161] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[169] 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000 2.15000
[177] 2.15000 2.15000 2.15000 2.15000 2.48000 2.48000 2.48000 2.41500
[185] 2.35000 2.35000 2.41500 2.35000 2.35000 2.35000 2.35000 2.35000
[193] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.42300
[201] 6.37500 6.37500 6.37500 6.37500 3.90500 3.90500 3.90500 3.90500
[209] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[217] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[225] 2.03500 2.03500 2.03500 2.03500 2.22000 2.22000 2.22000 2.22000
[233] 2.22000 2.22000 2.22000 2.19200 2.15000 2.15000 2.15000 2.15000
[241] 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000
[249] 2.48000 2.48000 2.48000 2.48000 6.45500 6.45500 6.45500 6.45500
[257] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[265] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[273] 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500
[281] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[289] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[297] 2.22000 2.22000 2.22000 2.22000 2.20500 2.20500 2.20500 2.20500
[305] 2.20500 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000 2.48000
[313] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[321] 6.45500 6.45500 6.45500 6.45500 3.90500 3.90500 3.90500 3.90500
[329] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[337] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[345] 2.03500 2.03500 2.03500 2.03500 2.22000 2.22000 2.22000 2.22000
[353] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[361] 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500
[369] 2.20500 2.20500 2.20500 2.20500 6.45500 6.45500 6.45500 6.45500
[377] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[385] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[393] 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500
[401] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[409] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[417] 2.22000 2.22000 2.22000 2.22000 2.17000 2.17000 2.19625 2.20500
[425] 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500 2.20500
[433] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[441] 6.45500 6.45500 6.45500 6.45500 3.90500 3.90500 3.90500 3.90500
[449] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[457] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[465] 2.03500 2.03500 2.03500 2.03500 2.22000 2.22000 2.22000 2.22000
[473] 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[481] 2.17000 2.17000 2.17000 2.17000 2.17000 2.17000 2.17000 2.17000
[489] 2.17000 2.17000 2.17000 2.17000 6.44000 6.45500 6.45500 6.45500
[497] 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500 6.45500
[505] 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[513] 3.90500 3.90500 3.90500 3.90500 2.03500 2.03500 2.03500 2.03500
[521] 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500 2.03500
[529] 2.13000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000 2.22000
[537] 2.22000 2.22000 2.22000 2.22000 1.97500 1.97500 2.09000 2.12000
[545] 2.14500 2.16375 2.17000 2.17000 2.17000 2.17000 2.17000 2.17000
[553] 6.45500 6.42500 6.42500 6.42500 6.42500 6.41000 6.42500 6.42500
[561] 6.42500 6.42500 6.42500 6.42500      NA      NA      NA      NA
[569]      NA      NA 3.90500 3.90500 3.90500 3.90500 3.90500 3.90500
[577] 1.93500 1.93500 1.93500 1.93500 1.93500 2.01875 2.03500 2.03500
[585] 2.03500 2.03500 2.03500 2.03500      NA 2.03000 2.03000 2.03000
[593] 2.03000 2.00375 1.99500 1.99500 1.99500 1.99500 1.99500 1.99500

# $ is a "special symbol" in R, because it is an operator used for indexing
# $ is technically an illegal column name, but we can still use it with ``
# same goes for column names consisting of numbers or other symbols, etc.
#
# 5) col_types
# default=NULL
# if NULL R guesses data type from first 1000 rows
# can also specify manually (but be careful)
# see ?read_delim and scroll to col_types for details
#
# 6) skip
# number of lines to skip
# default=0
# can be very useful with messy data files
#
# 7) n_max
# maximum number of lines to read
# default=Inf
#
#

Read in `.xls`/`.xlsx` files

.xls and .xlsx are files created in Microsoft Excel. There are separate functions read_xls() and read_xlsx(), but I find it’s best to use the wrapper function read_excel(). This will automatically call the correct function and avoid an error from accidentally mis-specifying the file type.

See below for what happens if we call the wrong function for the file type:

# the try() function will try to run the code
# see tryCatch() for more error handling 
# this code doesn't work because it tries to read the wrong file type
try(read_xls("../data_cleaned/poultry_tidy.xlsx"))

Error : 
  filepath: /Users/seanconway/Teaching/DACSS_601_Su22/data_cleaned/poultry_tidy.xlsx
  libxls error: Unable to open file

The code below works just fine, however:

# this code works 
poultry_from_excel <- try(read_excel("poultry_tidy.xlsx",
                                     skip = 5,
                                     col_names = c("prod","year","month","price")))

Error : `path` does not exist: 'poultry_tidy.xlsx'

poultry_from_excel

[1] "Error : `path` does not exist: 'poultry_tidy.xlsx'\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError: `path` does not exist: 'poultry_tidy.xlsx'>

Let’s take a look at this tibble:

# examining our tibble
head(poultry_from_excel) # view the first several rows

[1] "Error : `path` does not exist: 'poultry_tidy.xlsx'\n"

colnames(poultry_from_excel) # print column names

NULL

dplyr::glimpse(poultry_from_excel) # tidy little summary of it

 'try-error' chr "Error : `path` does not exist: 'poultry_tidy.xlsx'\n"
 - attr(*, "condition")=List of 2
  ..$ message: chr "`path` does not exist: 'poultry_tidy.xlsx'"
  ..$ call   : NULL
  ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

# the package::function() syntax is only necessary if the package isn't loaded

Function documentation:

# to view function documentation
?read_excel

# optional arguments
# 1) sheet=NULL
# number of the sheet to read in
# by default it reads the first sheet

# 2) range=NULL
# range of cells to read in
# uses the cellranger package to work with specific cells in Excel files
# for more, see the cellranger package
# https://cran.r-project.org/web/packages/cellranger/index.html

# 3) col_names=TRUE
# how to get column names (works the same as read_delim())

# 4) col_types=NULL
# types of data in columns (works the same as read_delim())

# 5) skip = 0
# number of lines to skip (works the same as read_delim())

# 6) n_max=Inf
# max lines to read (works the same as read_delim())

Reading in `.RData` Files

Reading .RData is less commonly needed, but it’s still important to know about. .RData is a file type exclusively associated with R. It’s commonly used when someone has performed operations with data and saved the results to give to collaborators.

We can use the load() function to load R objects into our R environment from a file:

# running the load() function on the data file name will load the objects into your R environment
load("../data_cleaned/poultry_tidy.RData")
poultry_tidy

# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <chr>   <int> <chr>            <dbl>
 1 Whole    2013 January           2.38
 2 Whole    2013 February          2.38
 3 Whole    2013 March             2.38
 4 Whole    2013 April             2.38
 5 Whole    2013 May               2.38
 6 Whole    2013 June              2.38
 7 Whole    2013 July              2.38
 8 Whole    2013 August            2.38
 9 Whole    2013 September         2.38
10 Whole    2013 October           2.38
# … with 590 more rows

# there's now a poultry_tidy object in our R environment

Note that we do not assign the data file to an object. Rather, it comes in as an object based on whatever the previous user named it as. If we try to assign it as an object, the object will only have the name of the data file, rather than the data itself:

# note that this operation shouldn't include any variable assignment
test_dat <- load("../data_cleaned/poultry_tidy.RData")
test_dat # now it contains the object name, not the object itself

[1] "poultry_tidy"

You can also save any number of R objects to a .RData file using the save() function:

a <- rnorm(1000)
b <- matrix(runif(100),nrow=50,ncol=2)
c <- as_tibble(mtcars)
save(a,b,c,file="test_objs.RData")
# there is now a test_objs.RData file in my working directory: 
list.files()

[1] "data_import.html"             "data_import.Rmd"             
[3] "more_data_import_-dplyr.html" "more_data_import_&dplyr.Rmd" 
[5] "Rmd_Demo.html"                "Rmd_Demo.Rmd"                
[7] "rsconnect"                    "test_objs.RData"

Let’s remove these objects from our R environment and re-load them from the file we saved:

# remove objects from environment
rm(list=c("a","b","c"))

# now they're back! (If you save them)
try(load("test_objs.RData"))

Conclusion

You now know a little bit about how to read in some common data types. Note that these aren’t the only types of data you’ll encounter, but they are by far the most common ones.

Comment on this article Share:

Data Import

Overview

Getting Started

Reading in delimited text files

Read in `.xls`/`.xlsx` files

Reading in `.RData` Files

Conclusion

Reuse

Citation

Data Import

Overview

Getting Started

Reading in delimited text files

Read in .xls/.xlsx files

Reading in .RData Files

Conclusion

Reuse

Citation

Read in `.xls`/`.xlsx` files

Reading in `.RData` Files