Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Meredith Rolfe
August 16, 2022
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,920 more rows
it shows state, county and total number of employees for a given dataset.
It’s group by state and county.
# A tibble: 2,930 × 3
# Groups: county [1,709]
county state total_employees
<chr> <chr> <dbl>
1 ABBEVILLE SC 124
2 ACADIA LA 13
3 ACCOMACK VA 4
4 ADA ID 81
5 ADAIR IA 5
6 ADAIR KY 1
7 ADAIR MO 21
8 ADAIR OK 2
9 ADAMS CO 553
10 ADAMS IA 7
# … with 2,920 more rows
Now let’s see the mean of the data set group by county.
# A tibble: 1,709 × 2
county total_employees
<chr> <dbl>
1 ABBEVILLE 124
2 ACADIA 13
3 ACCOMACK 4
4 ADA 81
5 ADAIR 7.25
6 ADAMS 73.2
7 ADDISON 8
8 AIKEN 193
9 AITKIN 19
10 ALACHUA 22
# … with 1,699 more rows
let’s see the mean of the data set group by state.
# A tibble: 53 × 2
state total_employees
<chr> <dbl>
1 AE 2
2 AK 17.2
3 AL 63.5
4 AP 1
5 AR 53.8
6 AZ 210.
7 CA 239.
8 CO 64.0
9 CT 324
10 DC 279
# … with 43 more rows
The median of the data set group by state.
# A tibble: 53 × 2
state total_employees
<chr> <dbl>
1 AE 2
2 AK 2.5
3 AL 26
4 AP 1
5 AR 16.5
6 AZ 94
7 CA 61
8 CO 10
9 CT 125
10 DC 279
# … with 43 more rows
The median of the data set group by county.
# A tibble: 1,709 × 2
county total_employees
<chr> <dbl>
1 ABBEVILLE 124
2 ACADIA 13
3 ACCOMACK 4
4 ADA 81
5 ADAIR 3.5
6 ADAMS 19.5
7 ADDISON 8
8 AIKEN 193
9 AITKIN 19
10 ALACHUA 22
# … with 1,699 more rows
CT, AZ, and DC all three are compared to check the difference in mean of the total_employees present in the data set.
# A tibble: 1 × 1
total_employees
<dbl>
1 324
# A tibble: 1 × 1
total_employees
<dbl>
1 210.
# A tibble: 1 × 1
total_employees
<dbl>
1 279
The mean value vary a lot i.e. for CT it’s 324, AZ is 210.2 and DC is 279.
Now let’s see the Standard Deviation of the data set group by state and also by county.
# A tibble: 1,709 × 2
county total_employees
<chr> <dbl>
1 ABBEVILLE NA
2 ACADIA NA
3 ACCOMACK NA
4 ADA NA
5 ADAIR 9.32
6 ADAMS 155.
7 ADDISON NA
8 AIKEN NA
9 AITKIN NA
10 ALACHUA NA
# … with 1,699 more rows
# A tibble: 53 × 2
state total_employees
<chr> <dbl>
1 AE NA
2 AK 34.8
3 AL 130.
4 AP NA
5 AR 131.
6 AZ 228.
7 CA 549.
8 CO 128.
9 CT 520.
10 DC NA
# … with 43 more rows
If we see CA then CT had the highest SD but if we look at the mean then it’s 324 and 238 which is a huge difference.
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# A tibble: 30,977 × 6
Area Item Year Value Flag `Flag Description`
<chr> <chr> <dbl> <dbl> <chr> <chr>
1 Afghanistan Chickens 1961 4700 F FAO estimate
2 Afghanistan Chickens 1962 4900 F FAO estimate
3 Afghanistan Chickens 1963 5000 F FAO estimate
4 Afghanistan Chickens 1964 5300 F FAO estimate
5 Afghanistan Chickens 1965 5500 F FAO estimate
6 Afghanistan Chickens 1966 5800 F FAO estimate
7 Afghanistan Chickens 1967 6600 F FAO estimate
8 Afghanistan Chickens 1968 6290 <NA> Official data
9 Afghanistan Chickens 1969 6300 F FAO estimate
10 Afghanistan Chickens 1970 6000 F FAO estimate
# … with 30,967 more rows
This data set has many variables like Area, Item, year, value, flag, and flag description.
Area Item Year Value
Length:30977 Length:30977 Min. :1961 Min. : 0
Class :character Class :character 1st Qu.:1976 1st Qu.: 171
Mode :character Mode :character Median :1992 Median : 1800
Mean :1991 Mean : 99411
3rd Qu.:2005 3rd Qu.: 15404
Max. :2018 Max. :23707134
NA's :1036
Flag Flag Description
Length:30977 Length:30977
Class :character Class :character
Mode :character Mode :character
Let’s take a look at average and median stock values by year.
# A tibble: 58 × 3
Year `mean(Value, na.rm = TRUE)` `median(Value, na.rm = TRUE)`
<dbl> <dbl> <dbl>
1 1961 36752. 1033
2 1962 37787. 1014
3 1963 38736. 1106
4 1964 39325. 1103
5 1965 40334. 1104
6 1966 41229. 1088.
7 1967 43240. 1193
8 1968 44420. 1252.
9 1969 45607. 1267
10 1970 47706. 1259
# … with 48 more rows
# A tibble: 4 × 59
# Groups: Item [4]
Item `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969` `1970`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Chickens 1.19e6 1.22e6 1.24e6 1.29e6 1.33e6 1.37e6 1.43e6 1.47e6 1.53e6 1.56e6
2 Ducks 9.64e3 9.99e3 1.07e4 1.10e4 1.13e4 1.19e4 1.18e4 1.20e4 1.20e4 1.21e4
3 Geese a… 5.53e2 5.61e2 5.95e2 6.07e2 6.18e2 6.43e2 5.95e2 6.23e2 6.59e2 6.65e2
4 Turkeys 1.19e5 1.03e5 1.05e5 1.13e5 1.18e5 1.30e5 1.39e5 1.20e5 1.20e5 1.31e5
# … with 48 more variables: `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
# `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
# `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
# `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
# `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
# `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
# `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, …
From the analysis I can see that the average stock value has increased over the years and it has increased almost every year for every stock.
---
title: "Challenge 2 Instructions"
author: "Meredith Rolfe"
desription: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
## Railroad dataset(railroad\*.csv)-
```{r}
data <- read_csv('_data/railroad_2012_clean_county.csv')
data
```
it shows state, county and total number of employees for a given dataset.
It's group by state and county.
```{r}
data %>% group_by(county,state) %>%
summarize_at(c('total_employees'),mean)
```
Now let's see the mean of the data set group by county.
```{r}
data %>% group_by(county) %>%
summarize_at(c('total_employees'),mean)
```
let's see the mean of the data set group by state.
```{r}
data %>% group_by(state) %>%
summarize_at(c('total_employees'),mean)
```
The median of the data set group by state.
```{r}
data %>% group_by(state) %>%
summarize_at(c('total_employees'),median)
```
The median of the data set group by county.
```{r}
data %>% group_by(county) %>%
summarize_at(c('total_employees'),median)
```
CT, AZ, and DC all three are compared to check the difference in mean of the total_employees present in the data set.
```{r}
result <- c("CT")
filter(data, state %in% result) %>%
summarize_at(c('total_employees'),mean)
```
```{r}
result <- c("AZ")
filter(data, state %in% result) %>%
summarize_at(c('total_employees'),mean)
```
```{r}
result <- c("DC")
filter(data, state %in% result) %>%
summarize_at(c('total_employees'),mean)
```
The mean value vary a lot i.e. for CT it's 324, AZ is 210.2 and DC is 279.
Now let's see the Standard Deviation of the data set group by state and also by county.
```{r}
data %>% group_by(county) %>%
summarize_at(c('total_employees'),sd)
```
```{r}
data %>% group_by(state) %>%
summarize_at(c('total_employees'),sd)
```
If we see CA then CT had the highest SD but if we look at the mean then it's 324 and 238 which is a huge difference.
## Bird dataset(birds.csv)-
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
birds <- read_csv("_data/birds.csv")%>%
select(-c(contains("Code"), Element, Domain, Unit))
birds
```
This data set has many variables like Area, Item, year, value, flag, and flag description.
```{r}
summary(birds)
```
Let's take a look at average and median stock values by year.
```{r}
birds%>%
group_by(Year)%>%
summarise(mean(Value, na.rm=TRUE),median(Value, na.rm=TRUE))
```
```{r}
birds%>%
filter(Area == "Americas")%>%
group_by(Item,Year)%>%
summarise(average = mean(Value, na.rm=TRUE))%>%
pivot_wider(names_from = Year, values_from = (average))
```
From the analysis I can see that the average stock value has increased over the years and it has increased almost every year for every stock.