DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 1

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data

Challenge 1

  • Show All Code
  • Hide All Code

  • View Source
challenge_1
railroads
faostat
wildbirds
Author

Matthew O’Neill

Published

October 5, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

  • railroad_2012_clean_county.csv ⭐
  • birds.csv ⭐⭐
  • FAOstat*.csv ⭐⭐
  • wild_bird_data.xlsx ⭐⭐⭐
  • StateCounty2012.xls ⭐⭐⭐⭐

Find the _data folder, located inside the posts folder. Then you can read in the data, using either one of the readr standard tidy read commands, or a specialized package such as readxl.

Code
data <- read_csv("../posts/_data/FAOSTAT_cattle_dairy.csv")

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
head(data)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
2 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1961  1961 hg/An
3 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1961  1961 tonn…
4 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1962  1962 Head 
5 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1962  1962 hg/An
6 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1962  1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
Code
dim(data)
[1] 36449    14
Code
colnames(data)
 [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
 [5] "Element Code"     "Element"          "Item Code"        "Item"            
 [9] "Year Code"        "Year"             "Unit"             "Value"           
[13] "Flag"             "Flag Description"

To begin, I’ve output the top few rows of data from our dataset to help visualize what is going on. Based on the data we see in our header(along with the name of the file alluding to working with cattle and dairy products), we can assume this dataset includes data on dairy production from various countries over many years. Some rows have a “Flag Description” of “FAO estimate”, which leads me to believe much of this data was collected by the Food and Agriculture Organization in the United States.

The column names aren’t very descriptive for this dataset, as columns such as “domain”, “item”, and “unit” are very vague. To get a better idea of what’s going on, we can dive into each column a bit more.

Code
domain <- select(data, "Domain")
table(domain)
Domain
Livestock Primary 
            36449 

First, we can see that there appears to only be one domain in this dataset, Livestock Primary. This column and it’s code are likely mainly useful if the dataset is joined with another one which has the same column.

Code
item <- select(data, "Item")
table(item)
Item
Milk, whole fresh cow 
                36449 
Code
prop.table(table(item))
Item
Milk, whole fresh cow 
                    1 

Once again, it appears all cows are being used for their milk production, which makes sense given the context of the table.

Code
unit <- select(data, "Unit")
table(unit)
Unit
  Head  hg/An tonnes 
 12158  12121  12170 
Code
prop.table(table(unit))
Unit
     Head     hg/An    tonnes 
0.3335620 0.3325468 0.3338912 

The “unit” column appears to be three different ways to weigh a given cow. While “tonnes” is obvious, there unfortunately isn’t too much context as to what the other two are, but it could be that “head” would be a measure of how many cows a given farm has.

Code
years <- select(data, "Year")
table(years)
Year
1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 
 594  594  594  594  594  594  594  594  594  594  594  594  594  594  594  594 
1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 
 594  594  594  594  594  594  594  594  594  594  594  594  594  594  600  657 
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 
 663  664  664  664  664  664  665  666  666  666  666  666  666  669  671  669 
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 
 671  669  669  672  672  674  674  674  672  672 

Overall, it appears that this dataset is a record of cattle/dairy data across many different countries over many different years. For each country/year combination, there are three entries for animal count, meat yield, and production weight.

Source Code
---
title: "Challenge 1"
author: "Matthew O'Neill"
desription: "Reading in data and creating a post"
date: "10/05/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_1
  - railroads
  - faostat
  - wildbirds
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a dataset, and

2)  describe the dataset using both words and any supporting information (e.g., tables, etc)

## Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

-   railroad_2012_clean_county.csv ⭐
-   birds.csv ⭐⭐
-   FAOstat\*.csv ⭐⭐
-   wild_bird_data.xlsx ⭐⭐⭐
-   StateCounty2012.xls ⭐⭐⭐⭐

Find the `_data` folder, located inside the `posts` folder. Then you can read in the data, using either one of the `readr` standard tidy read commands, or a specialized package such as `readxl`.

```{r}
data <- read_csv("../posts/_data/FAOSTAT_cattle_dairy.csv")

```

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

## Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

```{r}
#| label: summary

head(data)

dim(data)

colnames(data)


```

To begin, I've output the top few rows of data from our dataset to help visualize what is going on. Based on the data we see in our header(along with the name of the file alluding to working with cattle and dairy products), we can assume this dataset includes data on dairy production from various countries over many years. Some rows have a "Flag Description" of "FAO estimate", which leads me to believe much of this data was collected by the Food and Agriculture Organization in the United States.

The column names aren't very descriptive for this dataset, as columns such as "domain", "item", and "unit" are very vague. To get a better idea of what's going on, we can dive into each column a bit more.

```{r}
domain <- select(data, "Domain")
table(domain)

```

First, we can see that there appears to only be one domain in this dataset, Livestock Primary. This column and it's code are likely mainly useful if the dataset is joined with another one which has the same column.

```{r}
item <- select(data, "Item")
table(item)
prop.table(table(item))

```


Once again, it appears all cows are being used for their milk production, which makes sense given the context of the table.

```{r}
unit <- select(data, "Unit")
table(unit)
prop.table(table(unit))

```

The "unit" column appears to be three different ways to weigh a given cow. While "tonnes" is obvious, there unfortunately isn't too much context as to what the other two are, but it could be that "head" would be a measure of how many cows a given farm has. 


```{r}
years <- select(data, "Year")
table(years)

```

Overall, it appears that this dataset is a record of cattle/dairy data across many different countries over many different years. For each country/year combination, there are three entries for animal count, meat yield, and production weight.