practicing the unsexy first step of data analysis
The first step in any project is going to be reading in the data. It’s not glamorous but without it we’d be stuck trying to make sense of a messy spreadsheet forever. For this first post, I’ll use the data set “organiceggpoultry.xls.”
# set working directory
setwd("../../_data")
# assign data to variable
cogEggs <- read_excel("organiceggpoultry.xls")
Once we’ve read the data in, we’ll use a few commands to get a birds-eye view of what we’re looking at. Getting the dimensions is a good place to start.
# get dimensions
dim(cogEggs)
[1] 124 11
This means that there are 124 rows and 11 columns. In other words, there are 11 variables and 124 observances.
Next, let’s preview the data set.
# preview first 5 rows
head(cogEggs)
# A tibble: 6 x 11
`(Certified Organ~ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
<chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr>
1 <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA>
2 USDA Certified Or~ <NA> <NA> <NA> <NA> NA USDA ~ <NA> <NA>
3 Price per Carton ~ <NA> <NA> <NA> <NA> NA Price~ <NA> <NA>
4 <NA> "Extr~ "Ext~ "Lar~ "Lar~ NA Whole B/S ~ Bone~
5 Jan 2004 "230" "132" "230" "126" NA 197.5 645.5 too ~
6 February "230" "134~ "226~ "128~ NA 197.5 642.5 too ~
# ... with 2 more variables: ...10 <chr>, ...11 <chr>
yikes!
Using the colnames() function will show us the names of all 11 columns, which may help us better understand what we’re looking at.
# get column names
colnames(cogEggs)
[1] "(Certified Organic denotes products grown and processed according to USDA's national organic standards and certified by USDA-accredited State and private certification organizations.)"
[2] "...2"
[3] "...3"
[4] "...4"
[5] "...5"
[6] "...6"
[7] "...7"
[8] "...8"
[9] "...9"
[10] "...10"
[11] "...11"
Something tells me that these are not actually the names of the columns.
Selecting a specific column to preview might help us understand what the column names should be.
# preview first 5 rows of column 2
head(select(cogEggs, "...2"))
# A tibble: 6 x 1
...2
<chr>
1 <NA>
2 <NA>
3 <NA>
4 "Extra Large \nDozen"
5 "230"
6 "230"
nope! This data set is going to need some work. More next week!
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Battaglia (2021, Sept. 15). DACSS 601 Fall 2021: blog post 1: reading in data. Retrieved from https://mrolfe.github.io/DACSS601Fall21/posts/2021-09-15-blog-post-1-read-in-data/
BibTeX citation
@misc{battaglia2021blog, author = {Battaglia, Claire}, title = {DACSS 601 Fall 2021: blog post 1: reading in data}, url = {https://mrolfe.github.io/DACSS601Fall21/posts/2021-09-15-blog-post-1-read-in-data/}, year = {2021} }