DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 1 Solutions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Working with the Wild Birds Dataset
  • Challenge Overview
  • 1. Read in the Data
  • 2. Describe the data

Challenge 1 Solutions

  • Show All Code
  • Hide All Code

  • View Source
challenge_1
railroads
faostat
wildbirds
Author

Vishnupriya Varadharaju

Published

October 12, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Working with the Wild Birds Dataset

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

1. Read in the Data

Code
library("readxl")

# Reading in the data set such that the first row is skipped and the columns
# are renamed
wild_bird <- read_excel("_data/wild_bird_data.xlsx", skip=2, col_names=c('body_weight','pop_size'))
head(wild_bird)
# A tibble: 6 × 2
  body_weight pop_size
        <dbl>    <dbl>
1        5.46  532194.
2        7.76 3165107.
3        8.64 2592997.
4       10.7  3524193.
5        7.42  389806.
6        9.12  604766.
Code
# To show the columns and the dimensions of the data
dim(wild_bird)
[1] 146   2
Code
colnames(wild_bird)
[1] "body_weight" "pop_size"   

The data has been read from the excel file and is stored in a variable named wild_bird. It consists of 2 columns and 146 rows. Each observation seems to correspond to a particular species of bird. The first column corresponds to the body weight of the bird in grams and the second column corresponds to the size of the population of that particular species.

2. Describe the data

Code
# Arranging the data in ascending order of body_weights
wild_bird <- arrange(wild_bird, body_weight)
head(wild_bird)
# A tibble: 6 × 2
  body_weight pop_size
        <dbl>    <dbl>
1        5.46  532194.
2        7.42  389806.
3        7.76 3165107.
4        8.04  192361.
5        8.64 2592997.
6        8.70  250452.
Code
# Checking for Null values
is.null(wild_bird)
[1] FALSE
Code
#Checking datatype of the two columns
str(wild_bird)
tibble [146 × 2] (S3: tbl_df/tbl/data.frame)
 $ body_weight: num [1:146] 5.46 7.42 7.76 8.04 8.64 ...
 $ pop_size   : num [1:146] 532194 389806 3165107 192361 2592997 ...
Code
# As the two columns are numerical data, we can use summarize all to get a high 
# descriptive statistics of the data
summarize_all(wild_bird, list(mean=mean, median=median, min=min, max=max, sd=sd, var=var, IQR=IQR))
# A tibble: 1 × 14
  body_weight_…¹ pop_s…² body_…³ pop_s…⁴ body_…⁵ pop_s…⁶ body_…⁷ pop_s…⁸ body_…⁹
           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1           364. 382874.    69.2  24353.    5.46    4.92   9640.  5.09e6    984.
# … with 5 more variables: pop_size_sd <dbl>, body_weight_var <dbl>,
#   pop_size_var <dbl>, body_weight_IQR <dbl>, pop_size_IQR <dbl>, and
#   abbreviated variable names ¹​body_weight_mean, ²​pop_size_mean,
#   ³​body_weight_median, ⁴​pop_size_median, ⁵​body_weight_min, ⁶​pop_size_min,
#   ⁷​body_weight_max, ⁸​pop_size_max, ⁹​body_weight_sd

The wild birds data here consists of the body weight and the population size of different species. There is a good chance that this dataset was collected for research purposes by scientists. It could include bird species from different regions like marshlands, tropics, deserts etc. The population size can tell us if whether the species are endangered, vulnerable or threatened. Furthermore, from the body weight we can also know about the build of each specie and the quantity of food that it might need to survive. This all numerical data set does not have any null values. The descriptive stats with mean, median, min, max, standard deviation, variance and inter-quartile range for the dataset is seen above.

Source Code
---
title: "Challenge 1 Solutions"
author: "Vishnupriya Varadharaju"
desription: "Reading in data and creating a post"
date: "10/12/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_1
  - railroads
  - faostat
  - wildbirds
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

### Working with the Wild Birds Dataset


## Challenge Overview

Today's challenge is to

1)  read in a dataset, and

2)  describe the dataset using both words and any supporting information (e.g., tables, etc)

## 1. Read in the Data

```{r}
library("readxl")

# Reading in the data set such that the first row is skipped and the columns
# are renamed
wild_bird <- read_excel("_data/wild_bird_data.xlsx", skip=2, col_names=c('body_weight','pop_size'))
head(wild_bird)
```
```{r}
# To show the columns and the dimensions of the data
dim(wild_bird)
colnames(wild_bird)
```


The data has been read from the excel file and is stored in a variable named wild_bird. It consists of 2 columns and 146 rows. Each observation seems to correspond to a particular species of bird. The first column corresponds to the body weight of the bird in grams and the second column corresponds to the size of the population of that particular species. 


## 2. Describe the data

```{r}
#| label: summary

# Arranging the data in ascending order of body_weights
wild_bird <- arrange(wild_bird, body_weight)
head(wild_bird)

# Checking for Null values
is.null(wild_bird)

#Checking datatype of the two columns
str(wild_bird)

# As the two columns are numerical data, we can use summarize all to get a high 
# descriptive statistics of the data
summarize_all(wild_bird, list(mean=mean, median=median, min=min, max=max, sd=sd, var=var, IQR=IQR))
```


The wild birds data here consists of the body weight and the population size of different species. There is a good chance that this dataset was collected for research purposes by scientists. It could include bird species from different regions like marshlands, tropics, deserts etc. The population size can tell us if whether the species are endangered, vulnerable or threatened. Furthermore, from the body weight we can also know about the build of each specie and the quantity of food that it might need to survive.
This all numerical data set does not have any null values. The descriptive stats with mean, median, min, max, standard deviation, variance and inter-quartile range for the dataset is seen above.