challenge_1
railroads
wildbirds
Reading in data and creating a post
Author

Pooja Shah

Published

April 26, 2023

Code
library(tidyverse)
library(readxl)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

  • railroad_2012_clean_county.csv ⭐
  • birds.csv ⭐⭐
  • FAOstat*.csv ⭐⭐
  • wild_bird_data.xlsx ⭐⭐⭐
  • StateCounty2012.xls ⭐⭐⭐⭐

Find the _data folder, located inside the posts folder. Then you can read in the data, using either one of the readr standard tidy read commands, or a specialized package such as readxl.

Reading the railroad dataset

Code
#Reading the railroad_2012_clean_county dataset
railroad <- read_csv("_data/railroad_2012_clean_county.csv")

Printing first few rows of the railroad dataset

Code
#Printing only few rows
head(railroad)
# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1

Reading the wild bird dataset

Code
#Reading the wild_bird_data dataset
wildbird <- read_excel("_data/wild_bird_data.xlsx", skip=1)

Printing first few rows of the wildbird dataset

Code
#Printing only few rows
head(wildbird)
# A tibble: 6 × 2
  `Wet body weight [g]` `Population size`
                  <dbl>             <dbl>
1                  5.46           532194.
2                  7.76          3165107.
3                  8.64          2592997.
4                 10.7           3524193.
5                  7.42           389806.
6                  9.12           604766.

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
dfSummary(railroad)
Data Frame Summary  
railroad  
Dimensions: 2930 x 3  
Duplicates: 0  

-----------------------------------------------------------------------------------------------------------------
No   Variable          Stats / Values             Freqs (% of Valid)    Graph                Valid      Missing  
---- ----------------- -------------------------- --------------------- -------------------- ---------- ---------
1    state             1. TX                       221 ( 7.5%)          I                    2930       0        
     [character]       2. GA                       152 ( 5.2%)          I                    (100.0%)   (0.0%)   
                       3. KY                       119 ( 4.1%)                                                   
                       4. MO                       115 ( 3.9%)                                                   
                       5. IL                       103 ( 3.5%)                                                   
                       6. IA                        99 ( 3.4%)                                                   
                       7. KS                        95 ( 3.2%)                                                   
                       8. NC                        94 ( 3.2%)                                                   
                       9. IN                        92 ( 3.1%)                                                   
                       10. VA                       92 ( 3.1%)                                                   
                       [ 43 others ]              1748 (59.7%)          IIIIIIIIIII                              

2    county            1. WASHINGTON                31 ( 1.1%)                               2930       0        
     [character]       2. JEFFERSON                 26 ( 0.9%)                               (100.0%)   (0.0%)   
                       3. FRANKLIN                  24 ( 0.8%)                                                   
                       4. LINCOLN                   24 ( 0.8%)                                                   
                       5. JACKSON                   22 ( 0.8%)                                                   
                       6. MADISON                   19 ( 0.6%)                                                   
                       7. MONTGOMERY                18 ( 0.6%)                                                   
                       8. CLAY                      17 ( 0.6%)                                                   
                       9. MARION                    17 ( 0.6%)                                                   
                       10. MONROE                   17 ( 0.6%)                                                   
                       [ 1699 others ]            2715 (92.7%)          IIIIIIIIIIIIIIIIII                       

3    total_employees   Mean (sd) : 87.2 (283.6)   404 distinct values   :                    2930       0        
     [numeric]         min < med < max:                                 :                    (100.0%)   (0.0%)   
                       1 < 21 < 8207                                    :                                        
                       IQR (CV) : 58 (3.3)                              :                                        
                                                                        :                                        
-----------------------------------------------------------------------------------------------------------------
Code
dfSummary(wildbird)
Data Frame Summary  
wildbird  
Dimensions: 146 x 2  
Duplicates: 0  

-------------------------------------------------------------------------------------------------------------
No   Variable              Stats / Values                  Freqs (% of Valid)    Graph   Valid      Missing  
---- --------------------- ------------------------------- --------------------- ------- ---------- ---------
1    Wet body weight [g]   Mean (sd) : 363.7 (983.5)       146 distinct values   :       146        0        
     [numeric]             min < med < max:                                      :       (100.0%)   (0.0%)   
                           5.5 < 69.2 < 9639.8                                   :                           
                           IQR (CV) : 291.2 (2.7)                                :                           
                                                                                 : .                         

2    Population size       Mean (sd) : 382874 (951938.7)   146 distinct values   :       146        0        
     [numeric]             min < med < max:                                      :       (100.0%)   (0.0%)   
                           4.9 < 24353.2 < 5093378                               :                           
                           IQR (CV) : 196693.8 (2.5)                             :                           
                                                                                 : .                         
-------------------------------------------------------------------------------------------------------------