Code
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Meredith Rolfe
August 16, 2022
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rowsit shows state, county and total number of employees for a given dataset.
It’s group by state and county.
# A tibble: 2,930 × 3
# Groups:   county [1,709]
   county    state total_employees
   <chr>     <chr>           <dbl>
 1 ABBEVILLE SC                124
 2 ACADIA    LA                 13
 3 ACCOMACK  VA                  4
 4 ADA       ID                 81
 5 ADAIR     IA                  5
 6 ADAIR     KY                  1
 7 ADAIR     MO                 21
 8 ADAIR     OK                  2
 9 ADAMS     CO                553
10 ADAMS     IA                  7
# … with 2,920 more rowsNow let’s see the mean of the data set group by county.
# A tibble: 1,709 × 2
   county    total_employees
   <chr>               <dbl>
 1 ABBEVILLE          124   
 2 ACADIA              13   
 3 ACCOMACK             4   
 4 ADA                 81   
 5 ADAIR                7.25
 6 ADAMS               73.2 
 7 ADDISON              8   
 8 AIKEN              193   
 9 AITKIN              19   
10 ALACHUA             22   
# … with 1,699 more rowslet’s see the mean of the data set group by state.
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                2  
 2 AK               17.2
 3 AL               63.5
 4 AP                1  
 5 AR               53.8
 6 AZ              210. 
 7 CA              239. 
 8 CO               64.0
 9 CT              324  
10 DC              279  
# … with 43 more rowsThe median of the data set group by state.
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                2  
 2 AK                2.5
 3 AL               26  
 4 AP                1  
 5 AR               16.5
 6 AZ               94  
 7 CA               61  
 8 CO               10  
 9 CT              125  
10 DC              279  
# … with 43 more rowsThe median of the data set group by county.
# A tibble: 1,709 × 2
   county    total_employees
   <chr>               <dbl>
 1 ABBEVILLE           124  
 2 ACADIA               13  
 3 ACCOMACK              4  
 4 ADA                  81  
 5 ADAIR                 3.5
 6 ADAMS                19.5
 7 ADDISON               8  
 8 AIKEN               193  
 9 AITKIN               19  
10 ALACHUA              22  
# … with 1,699 more rowsCT, AZ, and DC all three are compared to check the difference in mean of the total_employees present in the data set.
# A tibble: 1 × 1
  total_employees
            <dbl>
1             324# A tibble: 1 × 1
  total_employees
            <dbl>
1            210.# A tibble: 1 × 1
  total_employees
            <dbl>
1             279The mean value vary a lot i.e. for CT it’s 324, AZ is 210.2 and DC is 279.
Now let’s see the Standard Deviation of the data set group by state and also by county.
# A tibble: 1,709 × 2
   county    total_employees
   <chr>               <dbl>
 1 ABBEVILLE           NA   
 2 ACADIA              NA   
 3 ACCOMACK            NA   
 4 ADA                 NA   
 5 ADAIR                9.32
 6 ADAMS              155.  
 7 ADDISON             NA   
 8 AIKEN               NA   
 9 AITKIN              NA   
10 ALACHUA             NA   
# … with 1,699 more rows# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE               NA  
 2 AK               34.8
 3 AL              130. 
 4 AP               NA  
 5 AR              131. 
 6 AZ              228. 
 7 CA              549. 
 8 CO              128. 
 9 CT              520. 
10 DC               NA  
# … with 43 more rowsIf we see CA then CT had the highest SD but if we look at the mean then it’s 324 and 238 which is a huge difference.
Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# A tibble: 30,977 × 6
   Area        Item      Year Value Flag  `Flag Description`
   <chr>       <chr>    <dbl> <dbl> <chr> <chr>             
 1 Afghanistan Chickens  1961  4700 F     FAO estimate      
 2 Afghanistan Chickens  1962  4900 F     FAO estimate      
 3 Afghanistan Chickens  1963  5000 F     FAO estimate      
 4 Afghanistan Chickens  1964  5300 F     FAO estimate      
 5 Afghanistan Chickens  1965  5500 F     FAO estimate      
 6 Afghanistan Chickens  1966  5800 F     FAO estimate      
 7 Afghanistan Chickens  1967  6600 F     FAO estimate      
 8 Afghanistan Chickens  1968  6290 <NA>  Official data     
 9 Afghanistan Chickens  1969  6300 F     FAO estimate      
10 Afghanistan Chickens  1970  6000 F     FAO estimate      
# … with 30,967 more rowsThis data set has many variables like Area, Item, year, value, flag, and flag description.
     Area               Item                Year          Value         
 Length:30977       Length:30977       Min.   :1961   Min.   :       0  
 Class :character   Class :character   1st Qu.:1976   1st Qu.:     171  
 Mode  :character   Mode  :character   Median :1992   Median :    1800  
                                       Mean   :1991   Mean   :   99411  
                                       3rd Qu.:2005   3rd Qu.:   15404  
                                       Max.   :2018   Max.   :23707134  
                                                      NA's   :1036      
     Flag           Flag Description  
 Length:30977       Length:30977      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      Let’s take a look at average and median stock values by year.
# A tibble: 58 × 3
    Year `mean(Value, na.rm = TRUE)` `median(Value, na.rm = TRUE)`
   <dbl>                       <dbl>                         <dbl>
 1  1961                      36752.                         1033 
 2  1962                      37787.                         1014 
 3  1963                      38736.                         1106 
 4  1964                      39325.                         1103 
 5  1965                      40334.                         1104 
 6  1966                      41229.                         1088.
 7  1967                      43240.                         1193 
 8  1968                      44420.                         1252.
 9  1969                      45607.                         1267 
10  1970                      47706.                         1259 
# … with 48 more rows# A tibble: 4 × 59
# Groups:   Item [4]
  Item     `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969` `1970`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Chickens 1.19e6 1.22e6 1.24e6 1.29e6 1.33e6 1.37e6 1.43e6 1.47e6 1.53e6 1.56e6
2 Ducks    9.64e3 9.99e3 1.07e4 1.10e4 1.13e4 1.19e4 1.18e4 1.20e4 1.20e4 1.21e4
3 Geese a… 5.53e2 5.61e2 5.95e2 6.07e2 6.18e2 6.43e2 5.95e2 6.23e2 6.59e2 6.65e2
4 Turkeys  1.19e5 1.03e5 1.05e5 1.13e5 1.18e5 1.30e5 1.39e5 1.20e5 1.20e5 1.31e5
# … with 48 more variables: `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
#   `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
#   `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
#   `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
#   `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
#   `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
#   `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, …From the analysis I can see that the average stock value has increased over the years and it has increased almost every year for every stock.
---
title: "Challenge 2 Instructions"
author: "Meredith Rolfe"
desription: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
## Railroad dataset(railroad\*.csv)-
```{r}
data <- read_csv('_data/railroad_2012_clean_county.csv')
data
```
it shows state, county and total number of employees for a given dataset.
It's group by state and county.
```{r}
data %>% group_by(county,state) %>%
  summarize_at(c('total_employees'),mean)
```
Now let's see the mean of the data set group by county.
```{r}
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),mean)
```
let's see the mean of the data set group by state.
```{r}
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),mean)
```
The median of the data set group by state.
```{r}
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),median)
```
The median of the data set group by county.
```{r}
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),median)
```
CT, AZ, and DC all three are compared to check the difference in mean of the total_employees present in the data set.
```{r}
result <- c("CT")
filter(data, state %in% result) %>% 
  summarize_at(c('total_employees'),mean)
```
```{r}
result <- c("AZ")
filter(data, state %in% result) %>% 
  summarize_at(c('total_employees'),mean)
```
```{r}
result <- c("DC")
filter(data, state %in% result) %>% 
  summarize_at(c('total_employees'),mean)
```
The mean value vary a lot i.e. for CT it's 324, AZ is 210.2 and DC is 279.
Now let's see the Standard Deviation of the data set group by state and also by county.
```{r}
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),sd)
```
```{r}
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),sd)
```
If we see CA then CT had the highest SD but if we look at the mean then it's 324 and 238 which is a huge difference.
## Bird dataset(birds.csv)-
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
birds <- read_csv("_data/birds.csv")%>%
  select(-c(contains("Code"), Element, Domain, Unit))
birds
```
This data set has many variables like Area, Item, year, value, flag, and flag description.
```{r}
summary(birds)
```
Let's take a look at average and median stock values by year.
```{r}
birds%>%
    group_by(Year)%>%
     summarise(mean(Value, na.rm=TRUE),median(Value, na.rm=TRUE))
```
```{r}
birds%>%
     filter(Area == "Americas")%>%
     group_by(Item,Year)%>%
     summarise(average = mean(Value, na.rm=TRUE))%>%
     pivot_wider(names_from = Year, values_from = (average))
```
From the analysis I can see that the average stock value has increased over the years and it has increased almost every year for every stock.