the adventure continues
First, let’s talk about what we mean by “data wrangling.” A typical data science project begins with reading in whatever data we’re working with. From there, the next two steps are tidying and transforming the data. Together, these two processes are loosely referred to as wrangling.1
In my last post homework 2: reading in data, I read in the dataset “organiceggpoultry.xls” and spent a little time trying to understand the basic shape of the data. Today, I’ll pick up where I left off and see if I can wrangle it into something more decipherable.
To start, I’ll once again read in the data.
# set working directory
setwd("../../_data")
# assign data to variable
cogEggs <- read_excel("organiceggpoultry.xls")
# preview dataset
head(cogEggs)
# A tibble: 6 x 11
`(Certified Organ~ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
<chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr>
1 <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA>
2 USDA Certified Or~ <NA> <NA> <NA> <NA> NA USDA ~ <NA> <NA>
3 Price per Carton ~ <NA> <NA> <NA> <NA> NA Price~ <NA> <NA>
4 <NA> "Extr~ "Ext~ "Lar~ "Lar~ NA Whole B/S ~ Bone~
5 Jan 2004 "230" "132" "230" "126" NA 197.5 645.5 too ~
6 February "230" "134~ "226~ "128~ NA 197.5 642.5 too ~
# ... with 2 more variables: ...10 <chr>, ...11 <chr>
It seems like we’ll have to do a few things before being able to manipulate the data in any meaningful way.
I’ll first use the separate() function to split the column into two columns. I’ll then use the fill() function to fill in the missing values within the Year column.
# separate column 1 into 2 columns
cogEggs <- cogEggs %>%
separate(col = 1, into = c('Month', 'Year'), sep = ' ')
# preview to check
head(cogEggs)
# A tibble: 6 x 12
Month Year ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
<chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <chr>
1 <NA> <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA> <NA>
2 USDA Cert~ <NA> <NA> <NA> <NA> NA USDA~ <NA> <NA> <NA>
3 Price per <NA> <NA> <NA> <NA> NA Pric~ <NA> <NA> <NA>
4 <NA> <NA> "Ext~ "Ext~ "Lar~ "Lar~ NA Whole B/S ~ Bone~ Whol~
5 Jan 2004 "230" "132" "230" "126" NA 197.5 645.5 too ~ 193.5
6 February <NA> "230" "134~ "226~ "128~ NA 197.5 642.5 too ~ 193.5
# ... with 1 more variable: ...11 <chr>
We now have two columns: Month and Year. Looking at row 5, we can see that where we have a value for year, it did indeed get moved into the appropriate column. Looking at row 6, however, we can see that not all of the rows have a value for year.
I’ll now use the fill() function to fill in the remaining years. This function will automatically fill designated cells with the value in the cell above/below. It’s easy to conceptualize this function as equivalent to copying and pasting by dragging in Excel.
# fill values in Year column
cogEggs <- cogEggs %>%
fill(Year)
#view to check
cogEggs
# A tibble: 124 x 12
Month Year ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
<chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <chr>
1 <NA> <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA> <NA>
2 USDA Certi~ <NA> <NA> <NA> <NA> NA USDA~ <NA> <NA> <NA>
3 Price per <NA> <NA> <NA> <NA> NA Pric~ <NA> <NA> <NA>
4 <NA> per "Ext~ "Ext~ "Lar~ "Lar~ NA Whole B/S ~ Bone~ Whol~
5 Jan 2004 "230" "132" "230" "126" NA 197.5 645.5 too ~ 193.5
6 Febru~ 2004 "230" "134~ "226~ "128~ NA 197.5 642.5 too ~ 193.5
7 March 2004 "230" "137" "225" "131" NA 209 642.5 too ~ 193.5
8 April 2004 "234~ "137" "225" "131" NA 212 642.5 too ~ 193.5
9 May 2004 "236" "137" "225" "131" NA 214.5 642.5 too ~ 193.5
10 June 2004 "241" "137" "231~ "133~ NA 216.~ 641 too ~ 201.~
# ... with 114 more rows, and 1 more variable: ...11 <chr>
Our dataset is slowly taking shape! As is, our dataset actually contains data for two separate datasets. It’s interesting to note that these were two separate sheets in the Excel file and were read in as one giant dataset. Because I am only interested in eggs, I’m going to reassign our object cogEggs to a new, paired-down dataset.
In order to make this easier, I am going to first rename the columns I want to keep. I’ll use the colnames() function to make sure I know the names of the columns.
# get column names
colnames(cogEggs)
[1] "Month" "Year" "...2" "...3" "...4" "...5" "...6" "...7"
[9] "...8" "...9" "...10" "...11"
Since I’ve already renamed the Month and Year columns, I only need to rename columns 2-5. Column 6 is empty and everything from column 7 on belongs to the chicken dataset.
# rename columns 2-5
cogEggs <- cogEggs %>%
rename(xlDoz = ...2, xlHalfDoz = ...3, lgDoz = ...4, lgHalfDoz = ...5)
#preview to check
head(cogEggs)
# A tibble: 6 x 12
Month Year xlDoz xlHalfDoz lgDoz lgHalfDoz ...6 ...7 ...8 ...9
<chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr>
1 <NA> <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA>
2 USDA Cert~ <NA> <NA> <NA> <NA> NA USDA~ <NA> <NA>
3 Price per <NA> <NA> <NA> <NA> NA Pric~ <NA> <NA>
4 <NA> per "Ext~ "Extra L~ "Lar~ "Large \~ NA Whole B/S ~ Bone~
5 Jan 2004 "230" "132" "230" "126" NA 197.5 645.5 too ~
6 Febru~ 2004 "230" "134.5" "226~ "128.5" NA 197.5 642.5 too ~
# ... with 2 more variables: ...10 <chr>, ...11 <chr>
#reassign object to paired-down dataset
cogEggs <- cogEggs %>%
select(Month, Year, xlDoz, xlHalfDoz, lgDoz, lgHalfDoz)
# preview to check
head(cogEggs)
# A tibble: 6 x 6
Month Year xlDoz xlHalfDoz lgDoz lgHalfDoz
<chr> <chr> <chr> <chr> <chr> <chr>
1 <NA> <NA> <NA> <NA> <NA> <NA>
2 USDA Certified <NA> <NA> <NA> <NA>
3 Price per <NA> <NA> <NA> <NA>
4 <NA> per "Extra Large \nDozen" "Extra Lar~ "Lar~ "Large \~
5 Jan 2004 "230" "132" "230" "126"
6 February 2004 "230" "134.5" "226~ "128.5"
Now let’s get rid of the the rows that don’t contain any data.
# slice all rows except 1-4
cogEggs <- cogEggs %>%
slice(-c(1,2,3,4))
#view to check
cogEggs
# A tibble: 120 x 6
Month Year xlDoz xlHalfDoz lgDoz lgHalfDoz
<chr> <chr> <chr> <chr> <chr> <chr>
1 Jan 2004 230 132 230 126
2 February 2004 230 134.5 226.25 128.5
3 March 2004 230 137 225 131
4 April 2004 234.5 137 225 131
5 May 2004 236 137 225 131
6 June 2004 241 137 231.375 133.5
7 July 2004 241 137 233.5 133.5
8 August 2004 241 137 233.5 133.5
9 September 2004 241 135.875 233.5 129.75
10 October 2004 241 135.5 233.5 128.5
# ... with 110 more rows
Our final step today will be to pivot this table from wide to long.
Our variables in this dataset are 1) month, 2) year, 3) egg size, 4) carton size
Our variables month and year already have their own column. The remaining columns, however, are actually values for the variables egg size (xl and lg) and carton size (dozen and half-dozen). This may be slightly more complicated because each column contains one egg size value and one carton size value. Yikes.
After reading about how to use the pivot_longer() function when each column contains multiple values, I realize that the way I renamed the columns is going to make my task here more difficult. To not totally lose my mind doing this, I’m going to rename them again, this time using an underscore, which will allow me to use the names_sep() function to separate by the _. I strongly dislike underscores but live and learn.
# rename columns 2-5 to contain _
cogEggs <- cogEggs %>%
rename(xl_Doz = xlDoz, xl_HalfDoz = xlHalfDoz, lg_Doz = lgDoz, lg_HalfDoz = lgHalfDoz)
#preview to check
head(cogEggs)
# A tibble: 6 x 6
Month Year xl_Doz xl_HalfDoz lg_Doz lg_HalfDoz
<chr> <chr> <chr> <chr> <chr> <chr>
1 Jan 2004 230 132 230 126
2 February 2004 230 134.5 226.25 128.5
3 March 2004 230 137 225 131
4 April 2004 234.5 137 225 131
5 May 2004 236 137 225 131
6 June 2004 241 137 231.375 133.5
Okay, now to actually pivot.
# pivot from wide to long
cogEggs <- cogEggs %>%
pivot_longer(
cols = !Month & !Year,
names_to = c("Egg_Size", "Carton_Size"),
names_sep = "_",
values_to = "Price"
)
#view to check
cogEggs
# A tibble: 480 x 5
Month Year Egg_Size Carton_Size Price
<chr> <chr> <chr> <chr> <chr>
1 Jan 2004 xl Doz 230
2 Jan 2004 xl HalfDoz 132
3 Jan 2004 lg Doz 230
4 Jan 2004 lg HalfDoz 126
5 February 2004 xl Doz 230
6 February 2004 xl HalfDoz 134.5
7 February 2004 lg Doz 226.25
8 February 2004 lg HalfDoz 128.5
9 March 2004 xl Doz 230
10 March 2004 xl HalfDoz 137
# ... with 470 more rows
That’s it for today! Datasets must meet three criteria to be considered “tidy.”2
Pivoting has made it so that this dataset meets these criteria. Next up: some summary stats and maybe a cool graph of some kind.
Here are a few notes I jotted down today that I’ll think about next time.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Battaglia (2021, Sept. 27). DACSS 601 Fall 2021: homework 3: basic data wrangling. Retrieved from https://mrolfe.github.io/DACSS601Fall21/posts/2021-09-27-homework-3-basic-data-wrangling/
BibTeX citation
@misc{battaglia2021homework, author = {Battaglia, Claire}, title = {DACSS 601 Fall 2021: homework 3: basic data wrangling}, url = {https://mrolfe.github.io/DACSS601Fall21/posts/2021-09-27-homework-3-basic-data-wrangling/}, year = {2021} }