challenge_1
tidyverse
readxl
dplyr
Reading in data and creating a post
Author

Saaradhaa M

Published

August 15, 2022

Code
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to read in a dataset and describe the dataset using both words and any supporting information.

Read in the Data

I will be working with the wild bird dataset.

Code
# Load readxl package.
library(readxl)
#Read in and view the dataset.
wildbird <- read_excel("_data/wild_bird_data.xlsx")
view(wildbird)

Describe the data

Using a combination of words and results of R commands, our task is to provide a high level description of the data.

Code
# Run dim() to get the number of cases.
dim(wildbird)
[1] 147   2
Code
# There are 147 cases and 2 columns in this dataset.
# Run view() to see what these 2 columns are.
view(wildbird)

There are 147 cases in 2 columns, which are Wet Body Weight (g) and Population Size (but these are in the rows and need to be renamed). Additionally, viewing the dataset shows that there are no missing cases.

From one of the columns, I can see that the data was taken from Figure 1 of a paper written by Nee and colleagues (finding this paper will probably tell me which country this data is from). The column names also show that the data was probably collected via field research with wild birds.

Code
#Rename columns.
library(dplyr)
wildbird_new <- rename(wildbird, "wet_body_weight" = "Reference", "pop_size" = "Taken from Figure 1 of Nee et al.")
#Remove the first row of data.
wildbird_new <- wildbird_new[-1,]
#Check that the cleaning was done correctly.
view(wildbird_new)
#Check the number of cases again.
dim(wildbird_new)
[1] 146   2

Now that the columns are renamed and the first row is removed, we see that the true number of cases is 146.

Code
# Let's check the descriptive statistics.
summary(wildbird_new)
 wet_body_weight      pop_size        
 Length:146         Length:146        
 Class :character   Class :character  
 Mode  :character   Mode  :character  
Code
# The data is in characters, so we need to convert it to numbers.
wildbird_new$wet_body_weight <- as.numeric(wildbird_new$wet_body_weight)
wildbird_new$pop_size <- as.numeric(wildbird_new$pop_size)
Code
# Now let's check the descriptive statistics again.
library(dplyr)
summary(wildbird_new)
 wet_body_weight       pop_size      
 Min.   :   5.459   Min.   :      5  
 1st Qu.:  18.620   1st Qu.:   1821  
 Median :  69.232   Median :  24353  
 Mean   : 363.694   Mean   : 382874  
 3rd Qu.: 309.826   3rd Qu.: 198515  
 Max.   :9639.845   Max.   :5093378  

The mean wet body weight of the wild birds analysed was about 364g, and the mean population size was close to 383000. There was also a wide range of entries in both variables. Now, let’s check if they’re correlated.

Code
# Running correlation.
cor(wildbird_new$wet_body_weight,wildbird_new$pop_size)
[1] -0.1162993
Code
summary(lm(wildbird_new$wet_body_weight~wildbird_new$pop_size))

Call:
lm(formula = wildbird_new$wet_body_weight ~ wildbird_new$pop_size)

Residuals:
   Min     1Q Median     3Q    Max 
-400.1 -369.5 -275.6   -0.4 9230.6 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            4.097e+02  8.748e+01   4.683 6.47e-06 ***
wildbird_new$pop_size -1.202e-04  8.552e-05  -1.405    0.162    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 980.3 on 144 degrees of freedom
Multiple R-squared:  0.01353,   Adjusted R-squared:  0.006675 
F-statistic: 1.974 on 1 and 144 DF,  p-value: 0.1621

They are quite weakly correlated.