DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 1

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in the Railroad Data
  • Describe the data
  • Read in the Birds Data
  • Describe the Data
  • Further Challenges to come back to

Challenge 1

  • Show All Code
  • Hide All Code

  • View Source
challenge_1
railroads
birds
Theresa_Szczepanski
Author

Theresa Szczepanski

Published

September 16, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Railroad Data

  • railroad_2012_clean_county.csv ⭐
Code
  Railroad <- read_csv("_data/railroad_2012_clean_county.csv")
  summary(Railroad)
    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00  
Code
  Railroad
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
Code
  States <-select(Railroad, state)
  Num_States <-unique(States)
  dim(Num_States)
[1] 53  1
Code
#| label: railroad wrangling and finding the number of unique states

Describe the data

The Railroad data set consists of 2930 observations of three variables: state, county, and total_employees of type character, character, and double respectively. The minimum number of employees is 1, at several counties and the maximum is 8207 in Cook county Illinois.

Code
  #Overview of Railroad
  arrange(Railroad, desc(total_employees))
# A tibble: 2,930 × 3
   state county           total_employees
   <chr> <chr>                      <dbl>
 1 IL    COOK                        8207
 2 TX    TARRANT                     4235
 3 NE    DOUGLAS                     3797
 4 NY    SUFFOLK                     3685
 5 VA    INDEPENDENT CITY            3249
 6 FL    DUVAL                       3073
 7 CA    SAN BERNARDINO              2888
 8 CA    LOS ANGELES                 2545
 9 TX    HARRIS                      2535
10 NE    LINCOLN                     2289
# … with 2,920 more rows
Code
  arrange(Railroad, total_employees)
# A tibble: 2,930 × 3
   state county   total_employees
   <chr> <chr>              <dbl>
 1 AK    SITKA                  1
 2 AL    BARBOUR                1
 3 AL    HENRY                  1
 4 AP    APO                    1
 5 AR    NEWTON                 1
 6 CA    MONO                   1
 7 CO    BENT                   1
 8 CO    CHEYENNE               1
 9 CO    COSTILLA               1
10 CO    DOLORES                1
# … with 2,920 more rows
Code
#| label: Railroad summary

There are 53 distinct entries in the state column. The 50 United states’ codes are represented as well as

  • DC, for Washington D.C.
  • AE, APO, unknown State/Territory, but AE, APO is possibly an Armed Forces Europe post office box.
  • AP, APO, unknown State/Territory, but AP, APO is possibly an Armed Forces Pacific post office box.
Code
  #Finding the number of unique states
  States <-select(Railroad, state)
  Num_States <-unique(States)
  summary(Num_States)
    state          
 Length:53         
 Class :character  
 Mode  :character  
Code
  Num_States
# A tibble: 53 × 1
   state
   <chr>
 1 AE   
 2 AK   
 3 AL   
 4 AP   
 5 AR   
 6 AZ   
 7 CA   
 8 CO   
 9 CT   
10 DC   
# … with 43 more rows
Code
#| label: Num States

The cases of this data set represent a unique State and Country pairing. The number of employees, possibly represents the number of Railroad employees for a given State and County pairing.

Read in the Birds Data

Code
  Birds <- read_csv("_data/birds.csv")
  summary(Birds)
 Domain Code           Domain            Area Code        Area          
 Length:30977       Length:30977       Min.   :   1   Length:30977      
 Class :character   Class :character   1st Qu.:  79   Class :character  
 Mode  :character   Mode  :character   Median : 156   Mode  :character  
                                       Mean   :1202                     
                                       3rd Qu.: 231                     
                                       Max.   :5504                     
                                                                        
  Element Code    Element            Item Code        Item          
 Min.   :5112   Length:30977       Min.   :1057   Length:30977      
 1st Qu.:5112   Class :character   1st Qu.:1057   Class :character  
 Median :5112   Mode  :character   Median :1068   Mode  :character  
 Mean   :5112                      Mean   :1066                     
 3rd Qu.:5112                      3rd Qu.:1072                     
 Max.   :5112                      Max.   :1083                     
                                                                    
   Year Code         Year          Unit               Value         
 Min.   :1961   Min.   :1961   Length:30977       Min.   :       0  
 1st Qu.:1976   1st Qu.:1976   Class :character   1st Qu.:     171  
 Median :1992   Median :1992   Mode  :character   Median :    1800  
 Mean   :1991   Mean   :1991                      Mean   :   99411  
 3rd Qu.:2005   3rd Qu.:2005                      3rd Qu.:   15404  
 Max.   :2018   Max.   :2018                      Max.   :23707134  
                                                  NA's   :1036      
     Flag           Flag Description  
 Length:30977       Length:30977      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      
Code
  head(Birds)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1961  1961 1000…
2 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1962  1962 1000…
3 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1963  1963 1000…
4 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1964  1964 1000…
5 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1965  1965 1000…
6 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1966  1966 1000…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
Code
#| label: birds wrangling

Describe the Data

The birds data consists of 8 variables of character type and 6 variables of double type. The data seem to describe the population of stock or domesticated fowl in regions of the world for given years.

Domain: For this data set all of the cases have the same Domain and Domain Code representing “live animals”.

Code
  Domains <-select(Birds, "Domain Code", Domain)
  Num_Domains <-unique(Domains)
  Num_Domains
# A tibble: 1 × 2
  `Domain Code` Domain      
  <chr>         <chr>       
1 QA            Live Animals
Code
#| label: Domains info

Area consists of 248 entries. Notably, the entries with values less than 5000 represent Countries of the world. The numeric codes correspond with the the Alphabetical order of the country names. When the area code has a value of 5000, it represents the entire world and when the code is greater than 5000, it then corresponds to regions of the world rather than a specific country. In these cases, regions with numbers closer in value have closer geographic proximity. It should be noted that there is a value for Europe as well as a value for Easter Europe and Western Europe, so there are regions that are represented in multiple cases of these entries.

Code
  Areas <-select(Birds, "Area Code", Area)
  Num_Areas <-unique(Areas)
  Num_Areas
# A tibble: 248 × 2
   `Area Code` Area               
         <dbl> <chr>              
 1           2 Afghanistan        
 2           3 Albania            
 3           4 Algeria            
 4           5 American Samoa     
 5           7 Angola             
 6           8 Antigua and Barbuda
 7           9 Argentina          
 8           1 Armenia            
 9          22 Aruba              
10          10 Australia          
# … with 238 more rows
Code
  arrange(Num_Areas, `Area Code`)
# A tibble: 248 × 2
   `Area Code` Area               
         <dbl> <chr>              
 1           1 Armenia            
 2           2 Afghanistan        
 3           3 Albania            
 4           4 Algeria            
 5           5 American Samoa     
 6           7 Angola             
 7           8 Antigua and Barbuda
 8           9 Argentina          
 9          10 Australia          
10          11 Austria            
# … with 238 more rows
Code
  arrange(Num_Areas, desc(`Area Code`))
# A tibble: 248 × 2
   `Area Code` Area                     
         <dbl> <chr>                    
 1        5504 Polynesia                
 2        5503 Micronesia               
 3        5502 Melanesia                
 4        5501 Australia and New Zealand
 5        5500 Oceania                  
 6        5404 Western Europe           
 7        5403 Southern Europe          
 8        5402 Northern Europe          
 9        5401 Eastern Europe           
10        5400 Europe                   
# … with 238 more rows
Code
  World_Region <- filter(Num_Areas, `Area Code` >= 5000)
  arrange(World_Region, `Area Code`)
# A tibble: 28 × 2
   `Area Code` Area            
         <dbl> <chr>           
 1        5000 World           
 2        5100 Africa          
 3        5101 Eastern Africa  
 4        5102 Middle Africa   
 5        5103 Northern Africa 
 6        5104 Southern Africa 
 7        5105 Western Africa  
 8        5200 Americas        
 9        5203 Northern America
10        5204 Central America 
# … with 18 more rows
Code
#| label: Area info

Element: For this data set all of the cases have the same Element and Element Code representing “stocks”.

Code
  Elements <-select(Birds, "Element Code", Element)
  Num_Elements <-unique(Elements)
  Num_Elements
# A tibble: 1 × 2
  `Element Code` Element
           <dbl> <chr>  
1           5112 Stocks 
Code
#| label: Elements info

Item: For this data set, all of the observations are of items of type chicken, duck, geese and guinea fowls, turkeys, or pigeons/other birds.

Code
  Items <-select(Birds, "Item Code", Item)
  Num_Items <-unique(Items)
  Num_Items
# A tibble: 5 × 2
  `Item Code` Item                  
        <dbl> <chr>                 
1        1057 Chickens              
2        1068 Ducks                 
3        1072 Geese and guinea fowls
4        1079 Turkeys               
5        1083 Pigeons, other birds  
Code
  table(Items)
         Item
Item Code Chickens Ducks Geese and guinea fowls Pigeons, other birds Turkeys
     1057    13074     0                      0                    0       0
     1068        0  6909                      0                    0       0
     1072        0     0                   4136                    0       0
     1079        0     0                      0                    0    5693
     1083        0     0                      0                 1165       0
Code
#| label: Items info

Year, ‘Unit’, ‘Value’: For a given observation, there is the year the observation was made (between 1961 and 2018), and the number of livestock counted as a value with units of 1000 head. 4700 represents, 4,700,000 heads of the bird observed.

Code
  Years_Values <-select(Birds, Year, Unit, Value)
  Units <-select(Birds, Unit)
  Num_Units <-unique(Units)
  Num_Units
# A tibble: 1 × 1
  Unit     
  <chr>    
1 1000 Head
Code
  summary(Years_Values)
      Year          Unit               Value         
 Min.   :1961   Length:30977       Min.   :       0  
 1st Qu.:1976   Class :character   1st Qu.:     171  
 Median :1992   Mode  :character   Median :    1800  
 Mean   :1991                      Mean   :   99411  
 3rd Qu.:2005                      3rd Qu.:   15404  
 Max.   :2018                      Max.   :23707134  
                                   NA's   :1036      
Code
  Years_Values
# A tibble: 30,977 × 3
    Year Unit      Value
   <dbl> <chr>     <dbl>
 1  1961 1000 Head  4700
 2  1962 1000 Head  4900
 3  1963 1000 Head  5000
 4  1964 1000 Head  5300
 5  1965 1000 Head  5500
 6  1966 1000 Head  5800
 7  1967 1000 Head  6600
 8  1968 1000 Head  6290
 9  1969 1000 Head  6300
10  1970 1000 Head  6000
# … with 30,967 more rows
Code
#| label: Years info

Unit

Flag consists of 6 values describing the methodology by which the data was collected.

Code
  Flag_Descriptions <-select(Birds, Flag, `Flag Description`)
  Num_Flag_Descriptions <-unique(Flag_Descriptions)
  Num_Flag_Descriptions
# A tibble: 6 × 2
  Flag  `Flag Description`                                                      
  <chr> <chr>                                                                   
1 F     FAO estimate                                                            
2 <NA>  Official data                                                           
3 Im    FAO data based on imputation methodology                                
4 M     Data not available                                                      
5 *     Unofficial figure                                                       
6 A     Aggregate, may include official, semi-official, estimated or calculated…
Code
  Flags <-select(Birds, Flag)
  table(Flags)
Flag
    *     A     F    Im     M 
 1494  6488 10007  1213  1002 
Code
#| label: Flags info

Element: All observations are of Stocks.

Code
  Elements <-select(Birds, "Element Code", Element)
  Num_Elements <-unique(Elements)
  dim(Elements)
[1] 30977     2
Code
  Num_Elements
# A tibble: 1 × 2
  `Element Code` Element
           <dbl> <chr>  
1           5112 Stocks 
Code
#| label: Flags info

Further Challenges to come back to

  • wild_bird_data.xlsx ⭐⭐⭐
  • StateCounty2012.xls ⭐⭐⭐⭐
Source Code
---
title: "Challenge 1"
author: "Theresa Szczepanski"
desription: "Reading in Railroad data and creating a post"
date: "09/16/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_1
  - railroads
  - birds
  - Theresa_Szczepanski
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```


## Read in the Railroad Data


-   railroad_2012_clean_county.csv ⭐



```{r}
  Railroad <- read_csv("_data/railroad_2012_clean_county.csv")
  summary(Railroad)
  Railroad
  States <-select(Railroad, state)
  Num_States <-unique(States)
  dim(Num_States)
  
  
#| label: railroad wrangling and finding the number of unique states
```

## Describe the data
The `Railroad` data set consists of 2930 observations of three variables: `state`, `county`, and `total_employees` of type
`character`, `character`, and `double` respectively. The minimum number of employees is 1, at several counties and the maximum is 8207 in Cook county Illinois.
```{r}
  #Overview of Railroad
  arrange(Railroad, desc(total_employees))
  arrange(Railroad, total_employees)
  
  





#| label: Railroad summary
```

There are 53 distinct entries in the `state` column. The 50 United states' codes are represented as well as

  - **DC**, for Washington D.C.
  - *AE, APO*, unknown State/Territory, but AE, APO is possibly an Armed Forces Europe post office box.
  - *AP, APO*, unknown State/Territory, but AP, APO is possibly an Armed Forces Pacific post office box.

```{r}
  #Finding the number of unique states
  States <-select(Railroad, state)
  Num_States <-unique(States)
  summary(Num_States)
  Num_States






#| label: Num States
```
The cases of this data set represent a unique State and Country pairing. The number of employees, possibly represents the number of Railroad employees
for a given State and County pairing.

## Read in the Birds Data


```{r}
  Birds <- read_csv("_data/birds.csv")
  summary(Birds)
  head(Birds)

  
#| label: birds wrangling
```
## Describe the Data

The `birds` data consists of 8 variables of character type and 6 variables of double type. The data seem to describe the population of `stock` or domesticated fowl in regions of the world for given years. 

`Domain`: For this data set all of the cases have the same `Domain` and `Domain Code` representing "live animals".
 
```{r}
  Domains <-select(Birds, "Domain Code", Domain)
  Num_Domains <-unique(Domains)
  Num_Domains
  
  
#| label: Domains info
``` 
`Area` consists of 248 entries. Notably, the entries with values less than 5000 represent Countries of the world. The numeric codes correspond with the the Alphabetical order of the country names. When the area code has a value of 5000, it represents the entire world and when the code is greater than 5000, it then corresponds to regions of the world rather than a specific country. In these cases, regions with numbers closer in value have closer geographic proximity. It should be noted that there is a value for Europe as well as a value for Easter Europe and Western Europe, so there are regions that are represented in multiple cases of these entries.

```{r}
  Areas <-select(Birds, "Area Code", Area)
  Num_Areas <-unique(Areas)
  Num_Areas
  arrange(Num_Areas, `Area Code`)
  arrange(Num_Areas, desc(`Area Code`))
  World_Region <- filter(Num_Areas, `Area Code` >= 5000)
  arrange(World_Region, `Area Code`)
  
#| label: Area info
``` 

`Element`: For this data set all of the cases have the same `Element` and `Element Code` representing "stocks".
 
```{r}
  Elements <-select(Birds, "Element Code", Element)
  Num_Elements <-unique(Elements)
  Num_Elements
  
  
#| label: Elements info
``` 
`Item`: For this data set, all of the observations are of items of type chicken, duck, geese and guinea fowls, turkeys, or pigeons/other birds.
 
```{r}
  Items <-select(Birds, "Item Code", Item)
  Num_Items <-unique(Items)
  Num_Items
  table(Items)
  
  
#| label: Items info
``` 

`Year`, 'Unit', 'Value': For a given observation, there is the year the observation was made (between 1961 and 2018), and the number of livestock counted
as a `value` with `units` of 1000 head. 4700 represents, 4,700,000 heads of the bird observed.
```{r}
  Years_Values <-select(Birds, Year, Unit, Value)
  Units <-select(Birds, Unit)
  Num_Units <-unique(Units)
  Num_Units
  summary(Years_Values)
  Years_Values
  
#| label: Years info
``` 

`Unit`


`Flag` consists of 6 values describing the methodology by which the data was collected.

```{r}
  Flag_Descriptions <-select(Birds, Flag, `Flag Description`)
  Num_Flag_Descriptions <-unique(Flag_Descriptions)
  Num_Flag_Descriptions
  Flags <-select(Birds, Flag)
  table(Flags)
  
  
#| label: Flags info
```
`Element`: All observations are of Stocks.
```{r}
  Elements <-select(Birds, "Element Code", Element)
  Num_Elements <-unique(Elements)
  dim(Elements)
  Num_Elements
  
  
#| label: Flags info
```
  
## Further Challenges to come back to
-   wild_bird_data.xlsx ⭐⭐⭐
-   StateCounty2012.xls ⭐⭐⭐⭐