DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Reading State County 2012
  • Reading Birds
  • Connor’s Answer to Reading the data
  • Provide Grouped Summary Statistics
  • Exploratory Data Analysis
  • Mean and Median
  • Min/Max, SD, IQR

Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Connor Skowyra

Published

August 16, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading State County 2012

library(readxl) StateCounty2012.csv <- read_excel(“posts/_data/StateCounty2012.xls”)

Reading Birds

library(readr) Birds.csv <- read_csv(“posts/_data/Birds.csv”)

Connor’s Answer to Reading the data

The data for all sets were either created in Excel or CSV. To understand the data sets, first, we must load the data to be analyzed.

library(tidyverse) load(Birds.csv)

We can create a summary to tell us the min and max as well as mean and median for the data.

An example is: summary(Birds.csv)

To find out how many columns and rows are in the dataset, we can use the dim() code to help find this answer.

An example is dim(Birds.csv)

On the left side of the answer is the amount of rows, which is 30,977, while on the right is the amount of columns, which is 14.

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Exploratory Data Analysis

I decided to research the amounts of chickens in Poland by year. The reason I decided on this research is that I am Polish, I found it interesting to see the amount of chickens in Poland per year and how the values changes. Using Filters, Arrange, and Select functions, I manipulated the data to show me the only important columns to this project which are the area, item, year, and value.

filter(Birds.csv, Item == ‘Chickens’ & Area == ‘Poland’) %>% arrange(Year) %>% select(Area,Item,Year,Value)

In the data, the value of chickens begin to increase on a yearly basis. The main reasoning for this fact is the modernization of Poland. The modernization of Poland comes with improved ways to take care of chickens, more people to feed due to a higher population, and higher GDP capita per person due to an economic boom through the years after Europe is rebuilt after World War II.

Finally, we need to summarize the data to be easily shared and discussed between groups of people with the intent of making decisions around the chicken population of Poland.

Mean and Median

Birds %>% filter(Item == ‘Chickens’ & Area == ‘Poland’) %>% group_by(Area) %>% select(Area,Item,Year,Value) %>% summarise(mean_value = mean(Value, na.rm=TRUE), median_value = median(Value, na.rm=TRUE))

Min/Max, SD, IQR

Birds %>% filter(Item == ‘Chickens’ & Area == ‘Poland’) %>% group_by(Area) %>% select(Area,Item,Year,Value) %>% summarise(min_value = min(Value, na.rm=TRUE), max_value = max(Value, na.rm=TRUE), sd_value= sd(Value, na.rm=TRUE), IQR_value = IQR(Value, na.rm = TRUE))

Using this code, we found the mean to be 85,845 chickens and median of 72.520 chickens in Poland for the last 50 years. Now, let’s compare the mean and median to a similar Eastern European country, Ukraine.

Birds %>% filter(Item == ‘Chickens’ & Area == ‘Ukraine’) %>% group_by(Area) %>% select(Area,Item,Year,Value) %>% summarise(mean_value = mean(Value, na.rm=TRUE), median_value = median(Value, na.rm=TRUE))

To find the difference between Poland and Ukraine’s Chicken, I subtracted the means and medians to give a final statement, Ukraine has 67,548 chickens more than that Poland on average over the last 50 years.

Source Code
---
title: "Challenge 2"
author: "Connor Skowyra"
desription: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Reading State County 2012

> library(readxl)
> StateCounty2012.csv <- read_excel("posts/_data/StateCounty2012.xls")

## Reading Birds

> library(readr)
> Birds.csv <- read_csv("posts/_data/Birds.csv")

## Connor's Answer to Reading the data

The data for all sets were either created in Excel or CSV. To understand the data sets, first, we must load the data to be analyzed.

> library(tidyverse)
> load(Birds.csv)

We can create a summary to tell us the min and max as well as mean and median for the data.

An example is: summary(Birds.csv)

To find out how many columns and rows are in the dataset, we can use the dim() code to help find this answer.

An example is dim(Birds.csv)

On the left side of the answer is the amount of rows, which is 30,977, while on the right is the amount of columns, which is 14.

## Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

## Exploratory Data Analysis

I decided to research the amounts of chickens in Poland by year. The reason I decided on this research is that I am Polish, I found it interesting to see the amount of chickens in Poland per year and how the values changes. Using Filters, Arrange, and Select functions, I manipulated the data to show me the only important columns to this project which are the area, item, year, and value.

filter(Birds.csv, `Item` == 'Chickens' & `Area` == 'Poland') %>% arrange(`Year`) %>% select(Area,Item,Year,Value)

In the data, the value of chickens begin to increase on a yearly basis. The main reasoning for this fact is the modernization of Poland. The modernization of Poland comes with improved ways to take care of chickens, more people to feed due to a higher population, and higher GDP capita per person due to an economic boom through the years after Europe is rebuilt after World War II.

Finally, we need to summarize the data to be easily shared and discussed between groups of people with the intent of making decisions around the chicken population of Poland.

## Mean and Median

> Birds %>% filter(`Item` == 'Chickens' & `Area` == 'Poland') %>% group_by(Area) %>% select(Area,Item,Year,Value) %>% summarise(mean_value = mean(Value, na.rm=TRUE), median_value = median(Value, na.rm=TRUE))

## Min/Max, SD, IQR

> Birds %>% filter(`Item` == 'Chickens' & `Area` == 'Poland') %>% group_by(Area) %>% select(Area,Item,Year,Value) %>% summarise(min_value = min(Value, na.rm=TRUE), max_value = max(Value, na.rm=TRUE), sd_value= sd(Value, na.rm=TRUE), IQR_value = IQR(Value, na.rm = TRUE))

Using this code, we found the mean to be 85,845 chickens and median of 72.520 chickens in Poland for the last 50 years. Now, let's compare the mean and median to a similar Eastern European country, Ukraine.

Birds %>% filter(`Item` == 'Chickens' & `Area` == 'Ukraine') %>% group_by(Area) %>% select(Area,Item,Year,Value) %>% summarise(mean_value = mean(Value, na.rm=TRUE), median_value = median(Value, na.rm=TRUE))

To find the difference between Poland and Ukraine's Chicken, I subtracted the means and medians to give a final statement, Ukraine has 67,548 chickens more than that Poland on average over the last 50 years.