Challenge_3_Jyoti_Rani

Author

Jyoti Rani

Published

August 22, 2022

Read the data

eggs <- read_excel("_data/organiceggpoultry.xls",
                   sheet = "Data",
                   skip = 4,
                   range =cell_limits(c(6,2),c(NA,6)),
                  col_names = c("date", "xldozen", "xlhalf_dozen", "large_dozen", "large_half_dozen"))

eggs
# A tibble: 120 × 5
   date      xldozen xlhalf_dozen large_dozen large_half_dozen
   <chr>       <dbl>        <dbl>       <dbl>            <dbl>
 1 Jan 2004     230          132         230              126 
 2 February     230          134.        226.             128.
 3 March        230          137         225              131 
 4 April        234.         137         225              131 
 5 May          236          137         225              131 
 6 June         241          137         231.             134.
 7 July         241          137         234.             134.
 8 August       241          137         234.             134.
 9 September    241          136.        234.             130.
10 October      241          136.        234.             128.
# … with 110 more rows

Above I read in the data set as an excel file. I had to look at the solutions sheet to learn how to read in the data. The only trouble I had was figuring out the “range” parameter of the function. That one I just had to copy in because I was getting an error without it, but I understand that this tells R which cells to read.

eggs %>% 
  select("date")  %>% 
  distinct()
# A tibble: 22 × 1
   date     
   <chr>    
 1 Jan 2004 
 2 February 
 3 March    
 4 April    
 5 May      
 6 June     
 7 July     
 8 August   
 9 September
10 October  
# … with 12 more rows

Now we still have the issue of notes in the names of our months. This is most evident with “Jan…”, but the ” /1” in February needs to go. This needs to be removed to make analysis later on a bit easier to look at.

eggs <- eggs %>% 
  mutate(date = str_remove(date, " /1"))

Next, the January columns need to be dealt with so below I will remove the years in the “Jan” column with the separate and fill function.

eggs <- eggs %>% 
  separate(date, c("month", "year"), convert = TRUE) %>% 
  fill("year")
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 110 rows [2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, ...].
eggs
# A tibble: 120 × 6
   month      year xldozen xlhalf_dozen large_dozen large_half_dozen
   <chr>     <int>   <dbl>        <dbl>       <dbl>            <dbl>
 1 Jan        2004    230          132         230              126 
 2 February   2004    230          134.        226.             128.
 3 March      2004    230          137         225              131 
 4 April      2004    234.         137         225              131 
 5 May        2004    236          137         225              131 
 6 June       2004    241          137         231.             134.
 7 July       2004    241          137         234.             134.
 8 August     2004    241          137         234.             134.
 9 September  2004    241          136.        234.             130.
10 October    2004    241          136.        234.             128.
# … with 110 more rows

You can see as the years progress, the price of eggs increase, though at what rate I am uncertain. Now the data set is read for some pivoting. Now let me do a sanity check to view the data dimensions before the pivot and then I will look at it after the pivot.

dim(eggs)
[1] 120   6

There are four columns with the type of eggs this data set is viewing. If we want even tidier data we can collapse these four columns into one. Lets call this new data set “eggstidy” to represent the final form of the data in this challenge.

eggstidy <- eggs %>% 
  pivot_longer(c(3:6), names_to = "egg_type", values_to = "price")

eggstidy
# A tibble: 480 × 4
   month     year egg_type         price
   <chr>    <int> <chr>            <dbl>
 1 Jan       2004 xldozen           230 
 2 Jan       2004 xlhalf_dozen      132 
 3 Jan       2004 large_dozen       230 
 4 Jan       2004 large_half_dozen  126 
 5 February  2004 xldozen           230 
 6 February  2004 xlhalf_dozen      134.
 7 February  2004 large_dozen       226.
 8 February  2004 large_half_dozen  128.
 9 March     2004 xldozen           230 
10 March     2004 xlhalf_dozen      137 
# … with 470 more rows

By pivoting the data, each row has one observation, making it tidy for future manipulation. Within each row we can look at the particular variables within that observation with ease.

Below the dimensions have clearly changed, adding significantly more rows, but also condensing the amount of columns.