homework 3: basic data wrangling

homework 3 data wrangling organiceggspoultry.xls

the adventure continues

Claire Battaglia
09-27-2021

basic data wrangling

First, let’s talk about what we mean by “data wrangling.” A typical data science project begins with reading in whatever data we’re working with. From there, the next two steps are tidying and transforming the data. Together, these two processes are loosely referred to as wrangling.1

In my last post homework 2: reading in data, I read in the dataset “organiceggpoultry.xls” and spent a little time trying to understand the basic shape of the data. Today, I’ll pick up where I left off and see if I can wrangle it into something more decipherable.

To start, I’ll once again read in the data.

# set working directory
setwd("../../_data")

# assign data to variable
cogEggs <- read_excel("organiceggpoultry.xls")

# preview dataset
head(cogEggs)
# A tibble: 6 x 11
  `(Certified Organ~ ...2   ...3  ...4  ...5  ...6  ...7   ...8  ...9 
  <chr>              <chr>  <chr> <chr> <chr> <lgl> <chr>  <chr> <chr>
1 <NA>                <NA>   <NA>  <NA>  <NA> NA    <NA>   <NA>  <NA> 
2 USDA Certified Or~  <NA>   <NA>  <NA>  <NA> NA    USDA ~ <NA>  <NA> 
3 Price per Carton ~  <NA>   <NA>  <NA>  <NA> NA    Price~ <NA>  <NA> 
4 <NA>               "Extr~ "Ext~ "Lar~ "Lar~ NA    Whole  B/S ~ Bone~
5 Jan 2004           "230"  "132" "230" "126" NA    197.5  645.5 too ~
6 February           "230"  "134~ "226~ "128~ NA    197.5  642.5 too ~
# ... with 2 more variables: ...10 <chr>, ...11 <chr>

It seems like we’ll have to do a few things before being able to manipulate the data in any meaningful way.

to-do

splitting one column into two

I’ll first use the separate() function to split the column into two columns. I’ll then use the fill() function to fill in the missing values within the Year column.

# separate column 1 into 2 columns
cogEggs <- cogEggs %>%
  separate(col = 1, into = c('Month', 'Year'), sep = ' ')

# preview to check
head(cogEggs)
# A tibble: 6 x 12
  Month    Year  ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
  <chr>    <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <chr>
1 <NA>     <NA>   <NA>  <NA>  <NA>  <NA> NA    <NA>  <NA>  <NA>  <NA> 
2 USDA     Cert~  <NA>  <NA>  <NA>  <NA> NA    USDA~ <NA>  <NA>  <NA> 
3 Price    per    <NA>  <NA>  <NA>  <NA> NA    Pric~ <NA>  <NA>  <NA> 
4 <NA>     <NA>  "Ext~ "Ext~ "Lar~ "Lar~ NA    Whole B/S ~ Bone~ Whol~
5 Jan      2004  "230" "132" "230" "126" NA    197.5 645.5 too ~ 193.5
6 February <NA>  "230" "134~ "226~ "128~ NA    197.5 642.5 too ~ 193.5
# ... with 1 more variable: ...11 <chr>

We now have two columns: Month and Year. Looking at row 5, we can see that where we have a value for year, it did indeed get moved into the appropriate column. Looking at row 6, however, we can see that not all of the rows have a value for year.

I’ll now use the fill() function to fill in the remaining years. This function will automatically fill designated cells with the value in the cell above/below. It’s easy to conceptualize this function as equivalent to copying and pasting by dragging in Excel.

# fill values in Year column
cogEggs <- cogEggs %>%
  fill(Year)

#view to check
cogEggs
# A tibble: 124 x 12
   Month  Year   ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
   <chr>  <chr>  <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <chr>
 1 <NA>   <NA>    <NA>  <NA>  <NA>  <NA> NA    <NA>  <NA>  <NA>  <NA> 
 2 USDA   Certi~  <NA>  <NA>  <NA>  <NA> NA    USDA~ <NA>  <NA>  <NA> 
 3 Price  per     <NA>  <NA>  <NA>  <NA> NA    Pric~ <NA>  <NA>  <NA> 
 4 <NA>   per    "Ext~ "Ext~ "Lar~ "Lar~ NA    Whole B/S ~ Bone~ Whol~
 5 Jan    2004   "230" "132" "230" "126" NA    197.5 645.5 too ~ 193.5
 6 Febru~ 2004   "230" "134~ "226~ "128~ NA    197.5 642.5 too ~ 193.5
 7 March  2004   "230" "137" "225" "131" NA    209   642.5 too ~ 193.5
 8 April  2004   "234~ "137" "225" "131" NA    212   642.5 too ~ 193.5
 9 May    2004   "236" "137" "225" "131" NA    214.5 642.5 too ~ 193.5
10 June   2004   "241" "137" "231~ "133~ NA    216.~ 641   too ~ 201.~
# ... with 114 more rows, and 1 more variable: ...11 <chr>

Our dataset is slowly taking shape! As is, our dataset actually contains data for two separate datasets. It’s interesting to note that these were two separate sheets in the Excel file and were read in as one giant dataset. Because I am only interested in eggs, I’m going to reassign our object cogEggs to a new, paired-down dataset.

In order to make this easier, I am going to first rename the columns I want to keep. I’ll use the colnames() function to make sure I know the names of the columns.

# get column names
colnames(cogEggs)
 [1] "Month" "Year"  "...2"  "...3"  "...4"  "...5"  "...6"  "...7" 
 [9] "...8"  "...9"  "...10" "...11"

Since I’ve already renamed the Month and Year columns, I only need to rename columns 2-5. Column 6 is empty and everything from column 7 on belongs to the chicken dataset.

# rename columns 2-5
cogEggs <- cogEggs %>%
  rename(xlDoz = ...2, xlHalfDoz = ...3, lgDoz = ...4, lgHalfDoz = ...5)

#preview to check
head(cogEggs)
# A tibble: 6 x 12
  Month  Year  xlDoz xlHalfDoz lgDoz lgHalfDoz ...6  ...7  ...8  ...9 
  <chr>  <chr> <chr> <chr>     <chr> <chr>     <lgl> <chr> <chr> <chr>
1 <NA>   <NA>   <NA>  <NA>      <NA>  <NA>     NA    <NA>  <NA>  <NA> 
2 USDA   Cert~  <NA>  <NA>      <NA>  <NA>     NA    USDA~ <NA>  <NA> 
3 Price  per    <NA>  <NA>      <NA>  <NA>     NA    Pric~ <NA>  <NA> 
4 <NA>   per   "Ext~ "Extra L~ "Lar~ "Large \~ NA    Whole B/S ~ Bone~
5 Jan    2004  "230" "132"     "230" "126"     NA    197.5 645.5 too ~
6 Febru~ 2004  "230" "134.5"   "226~ "128.5"   NA    197.5 642.5 too ~
# ... with 2 more variables: ...10 <chr>, ...11 <chr>
#reassign object to paired-down dataset
cogEggs <- cogEggs %>%
  select(Month, Year, xlDoz, xlHalfDoz, lgDoz, lgHalfDoz)

# preview to check
head(cogEggs)
# A tibble: 6 x 6
  Month    Year      xlDoz                 xlHalfDoz   lgDoz lgHalfDoz
  <chr>    <chr>     <chr>                 <chr>       <chr> <chr>    
1 <NA>     <NA>       <NA>                  <NA>        <NA>  <NA>    
2 USDA     Certified  <NA>                  <NA>        <NA>  <NA>    
3 Price    per        <NA>                  <NA>        <NA>  <NA>    
4 <NA>     per       "Extra Large \nDozen" "Extra Lar~ "Lar~ "Large \~
5 Jan      2004      "230"                 "132"       "230" "126"    
6 February 2004      "230"                 "134.5"     "226~ "128.5"  

Now let’s get rid of the the rows that don’t contain any data.

# slice all rows except 1-4
cogEggs <- cogEggs %>%
  slice(-c(1,2,3,4))

#view to check
cogEggs
# A tibble: 120 x 6
   Month     Year  xlDoz xlHalfDoz lgDoz   lgHalfDoz
   <chr>     <chr> <chr> <chr>     <chr>   <chr>    
 1 Jan       2004  230   132       230     126      
 2 February  2004  230   134.5     226.25  128.5    
 3 March     2004  230   137       225     131      
 4 April     2004  234.5 137       225     131      
 5 May       2004  236   137       225     131      
 6 June      2004  241   137       231.375 133.5    
 7 July      2004  241   137       233.5   133.5    
 8 August    2004  241   137       233.5   133.5    
 9 September 2004  241   135.875   233.5   129.75   
10 October   2004  241   135.5     233.5   128.5    
# ... with 110 more rows

pivoting from wide to long

Our final step today will be to pivot this table from wide to long.

Our variables in this dataset are 1) month, 2) year, 3) egg size, 4) carton size

Our variables month and year already have their own column. The remaining columns, however, are actually values for the variables egg size (xl and lg) and carton size (dozen and half-dozen). This may be slightly more complicated because each column contains one egg size value and one carton size value. Yikes.

After reading about how to use the pivot_longer() function when each column contains multiple values, I realize that the way I renamed the columns is going to make my task here more difficult. To not totally lose my mind doing this, I’m going to rename them again, this time using an underscore, which will allow me to use the names_sep() function to separate by the _. I strongly dislike underscores but live and learn.

# rename columns 2-5 to contain _
cogEggs <- cogEggs %>%
  rename(xl_Doz = xlDoz, xl_HalfDoz = xlHalfDoz, lg_Doz = lgDoz, lg_HalfDoz = lgHalfDoz)

#preview to check
head(cogEggs)
# A tibble: 6 x 6
  Month    Year  xl_Doz xl_HalfDoz lg_Doz  lg_HalfDoz
  <chr>    <chr> <chr>  <chr>      <chr>   <chr>     
1 Jan      2004  230    132        230     126       
2 February 2004  230    134.5      226.25  128.5     
3 March    2004  230    137        225     131       
4 April    2004  234.5  137        225     131       
5 May      2004  236    137        225     131       
6 June     2004  241    137        231.375 133.5     

Okay, now to actually pivot.

# pivot from wide to long
cogEggs <- cogEggs %>%
  pivot_longer(
    cols = !Month & !Year,
    names_to = c("Egg_Size", "Carton_Size"),
    names_sep = "_",
    values_to = "Price"
  )

#view to check
cogEggs
# A tibble: 480 x 5
   Month    Year  Egg_Size Carton_Size Price 
   <chr>    <chr> <chr>    <chr>       <chr> 
 1 Jan      2004  xl       Doz         230   
 2 Jan      2004  xl       HalfDoz     132   
 3 Jan      2004  lg       Doz         230   
 4 Jan      2004  lg       HalfDoz     126   
 5 February 2004  xl       Doz         230   
 6 February 2004  xl       HalfDoz     134.5 
 7 February 2004  lg       Doz         226.25
 8 February 2004  lg       HalfDoz     128.5 
 9 March    2004  xl       Doz         230   
10 March    2004  xl       HalfDoz     137   
# ... with 470 more rows

That’s it for today! Datasets must meet three criteria to be considered “tidy.”2

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Pivoting has made it so that this dataset meets these criteria. Next up: some summary stats and maybe a cool graph of some kind.

moving forward

Here are a few notes I jotted down today that I’ll think about next time.


  1. Hadley, Wickham and Garrett Grolemund. 2017. R for Data Science. Sebastopol, CA. O’Reilly Media, Inc.↩︎

  2. Hadley, Wickham and Garrett Grolemund. 2017. R for Data Science. Sebastopol, CA. O’Reilly Media, Inc.

    ↩︎

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Battaglia (2021, Sept. 27). DACSS 601 Fall 2021: homework 3: basic data wrangling. Retrieved from https://mrolfe.github.io/DACSS601Fall21/posts/2021-09-27-homework-3-basic-data-wrangling/

BibTeX citation

@misc{battaglia2021homework,
  author = {Battaglia, Claire},
  title = {DACSS 601 Fall 2021: homework 3: basic data wrangling},
  url = {https://mrolfe.github.io/DACSS601Fall21/posts/2021-09-27-homework-3-basic-data-wrangling/},
  year = {2021}
}