Challenge 1 Solutions

challenge_1

railroads

faostat

wildbirds

Author

Vishnupriya Varadharaju

Published

October 12, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Working with the Wild Birds Dataset

Challenge Overview

Today’s challenge is to

read in a dataset, and
describe the dataset using both words and any supporting information (e.g., tables, etc)

1. Read in the Data

Code

library("readxl")

# Reading in the data set such that the first row is skipped and the columns
# are renamed
wild_bird <- read_excel("_data/wild_bird_data.xlsx", skip=2, col_names=c('body_weight','pop_size'))
head(wild_bird)

# A tibble: 6 × 2
  body_weight pop_size
        <dbl>    <dbl>
1        5.46  532194.
2        7.76 3165107.
3        8.64 2592997.
4       10.7  3524193.
5        7.42  389806.
6        9.12  604766.

Code

# To show the columns and the dimensions of the data
dim(wild_bird)

[1] 146   2

Code

colnames(wild_bird)

[1] "body_weight" "pop_size"

The data has been read from the excel file and is stored in a variable named wild_bird. It consists of 2 columns and 146 rows. Each observation seems to correspond to a particular species of bird. The first column corresponds to the body weight of the bird in grams and the second column corresponds to the size of the population of that particular species.

2. Describe the data

Code

# Arranging the data in ascending order of body_weights
wild_bird <- arrange(wild_bird, body_weight)
head(wild_bird)

# A tibble: 6 × 2
  body_weight pop_size
        <dbl>    <dbl>
1        5.46  532194.
2        7.42  389806.
3        7.76 3165107.
4        8.04  192361.
5        8.64 2592997.
6        8.70  250452.

Code

# Checking for Null values
is.null(wild_bird)

[1] FALSE

Code

#Checking datatype of the two columns
str(wild_bird)

tibble [146 × 2] (S3: tbl_df/tbl/data.frame)
 $ body_weight: num [1:146] 5.46 7.42 7.76 8.04 8.64 ...
 $ pop_size   : num [1:146] 532194 389806 3165107 192361 2592997 ...

Code

# As the two columns are numerical data, we can use summarize all to get a high 
# descriptive statistics of the data
summarize_all(wild_bird, list(mean=mean, median=median, min=min, max=max, sd=sd, var=var, IQR=IQR))

# A tibble: 1 × 14
  body_weight_…¹ pop_s…² body_…³ pop_s…⁴ body_…⁵ pop_s…⁶ body_…⁷ pop_s…⁸ body_…⁹
           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1           364. 382874.    69.2  24353.    5.46    4.92   9640.  5.09e6    984.
# … with 5 more variables: pop_size_sd <dbl>, body_weight_var <dbl>,
#   pop_size_var <dbl>, body_weight_IQR <dbl>, pop_size_IQR <dbl>, and
#   abbreviated variable names ¹body_weight_mean, ²pop_size_mean,
#   ³body_weight_median, ⁴pop_size_median, ⁵body_weight_min, ⁶pop_size_min,
#   ⁷body_weight_max, ⁸pop_size_max, ⁹body_weight_sd

The wild birds data here consists of the body weight and the population size of different species. There is a good chance that this dataset was collected for research purposes by scientists. It could include bird species from different regions like marshlands, tropics, deserts etc. The population size can tell us if whether the species are endangered, vulnerable or threatened. Furthermore, from the body weight we can also know about the build of each specie and the quantity of food that it might need to survive. This all numerical data set does not have any null values. The descriptive stats with mean, median, min, max, standard deviation, variance and inter-quartile range for the dataset is seen above.