Code
library(tidyverse)
# Import the "readxl" package
library(readxl)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Zhongyue Lin
May 31, 2023
Today’s challenge is to
read in a dataset, and
describe the dataset using both words and any supporting information (e.g., tables, etc)
Read in one (or more) of the following data sets, using the correct R package and command.
In this challenge 1, I have chosen the raw data files “wild_bird_data.xlsx” and “StateCounty2012.xls” for practice. Since I have not had prior experience with handling and importing xls and xlsx data formats in R, importing these two raw data files is a new challenge for me.
While working with the data from “wild_bird_data.xlsx,” I encountered a situation where the header of the dataset occupied one row. To avoid any data format errors during the import process, I used the setting “skip=1” to skip the first row while importing the data.
# A tibble: 6 × 2
`Wet body weight [g]` `Population size`
<dbl> <dbl>
1 5.46 532194.
2 7.76 3165107.
3 8.64 2592997.
4 10.7 3524193.
5 7.42 389806.
6 9.12 604766.
[1] 146 2
Then, I used the dim()
function and head()
function to have an initial preview of the data (146 rows, 2 columns).
In the data import process, I made adjustments and expansions to the code based on the format of the original file. I skipped the first three rows, which contained the headers, and removed any “NA” values present in the dataset. I also removed any unrelated content at the end of the table.
However, there are still some remaining issues with this dataset. Each letter code includes rows labeled as “total,” which can affect data aggregation and analysis. However, this issue can be addressed in subsequent data processing steps.
# Read the file, skipping the first 3 rows and specifying column names, then remove all columns that are completely empty, then replace NA values in the "COUNTY" column with "/" when "STATE" contains "Total"
data_clean <- read_excel("_data/StateCounty2012.xls", skip = 3) %>%
select_if(~!all(is.na(.))) %>%
mutate(COUNTY = ifelse(is.na(COUNTY) & grepl("Total", STATE), "/", COUNTY))
# Print out the first few lines of the cleaned data
head(data_clean,100)
# A tibble: 100 × 3
STATE COUNTY TOTAL
<chr> <chr> <dbl>
1 AE APO 2
2 AE Total1 / 2
3 AK ANCHORAGE 7
4 AK FAIRBANKS NORTH STAR 2
5 AK JUNEAU 3
6 AK MATANUSKA-SUSITNA 2
7 AK SITKA 1
8 AK SKAGWAY MUNICIPALITY 88
9 AK Total / 103
10 AL AUTAUGA 102
# … with 90 more rows
# Identify the rows where "STATE" is "CANADA" and replace NA values in "COUNTY" with "Canada", then remove empty rows above the "CANADA" row
data_clean <- data_clean %>%
mutate(COUNTY = replace(COUNTY, STATE == "CANADA", "Canada")) %>%
slice(1:(nrow(.)-4))
# Print out the last few lines of the cleaned data to check
tail(data_clean,100)
# A tibble: 100 × 3
STATE COUNTY TOTAL
<chr> <chr> <dbl>
1 WI ROCK 138
2 WI RUSK 21
3 WI SAUK 29
4 WI SAWYER 14
5 WI SHAWANO 9
6 WI SHEBOYGAN 9
7 WI ST CROIX 168
8 WI TAYLOR 15
9 WI TREMPEALEAU 54
10 WI VERNON 48
# … with 90 more rows
# A tibble: 100 × 3
STATE COUNTY TOTAL
<chr> <chr> <dbl>
1 WI RACINE 100
2 WI ROCK 138
3 WI RUSK 21
4 WI SAUK 29
5 WI SAWYER 14
6 WI SHAWANO 9
7 WI SHEBOYGAN 9
8 WI ST CROIX 168
9 WI TAYLOR 15
10 WI TREMPEALEAU 54
# … with 90 more rows
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
tibble [146 × 2] (S3: tbl_df/tbl/data.frame)
$ Wet body weight [g]: num [1:146] 5.46 7.76 8.64 10.69 7.42 ...
$ Population size : num [1:146] 532194 3165107 2592997 3524193 389806 ...
Wet body weight [g] Population size
Min. : 5.459 Min. : 5
1st Qu.: 18.620 1st Qu.: 1821
Median : 69.232 Median : 24353
Mean : 363.694 Mean : 382874
3rd Qu.: 309.826 3rd Qu.: 198515
Max. :9639.845 Max. :5093378
Firstly, the dataset file does not provide specific background information on data collection, so the following description is based on assumptions from the data itself:
The dataset, named “wild_bird_data,” consists of 146 rows and 2 columns. This dataset is likely gathered from field research, where researchers possibly conducted bird weight measurements at a specific location or a series of locations, and recorded the bird population size in these areas.
The two main variables are “Wet body weight [g]” and “Population size”:
“Wet body weight [g]”: This variable represents the wet body weight of birds (in grams). The values range from 5.459g to 9639.845g, with an average of about 363.694g, and a median of about 69.232g. This variable could be very useful in studying the health conditions of birds, their life habits, or the impact of environmental changes on bird weight.
“Population size”: This variable represents the bird population size in a specific area. The values range widely, from a minimum of 5 to a maximum of 5093378, with an average of about 382874, and a median of about 24353. This variable could be very useful in studying the distribution of bird populations, dynamic changes in population, or the impact of environmental changes on population size.
tibble [2,985 × 3] (S3: tbl_df/tbl/data.frame)
$ STATE : chr [1:2985] "AE" "AE Total1" "AK" "AK" ...
$ COUNTY: chr [1:2985] "APO" "/" "ANCHORAGE" "FAIRBANKS NORTH STAR" ...
$ TOTAL : num [1:2985] 2 2 7 2 3 2 1 88 103 102 ...
STATE COUNTY TOTAL
Length:2931 Length:2931 Min. : 1.00
Class :character Class :character 1st Qu.: 7.00
Mode :character Mode :character Median : 21.00
Mean : 87.37
3rd Qu.: 65.00
Max. :8207.00
# A tibble: 100 × 3
STATE COUNTY TOTAL
<chr> <chr> <dbl>
1 WI PIERCE 43
2 WI POLK 23
3 WI PORTAGE 240
4 WI PRICE 10
5 WI RACINE 100
6 WI ROCK 138
7 WI RUSK 21
8 WI SAUK 29
9 WI SAWYER 14
10 WI SHAWANO 9
# … with 90 more rows
# A tibble: 1 × 3
STATE COUNTY TOTAL
<chr> <chr> <dbl>
1 Grand Total / 255432
# A tibble: 146 × 3
STATE COUNTY TOTAL
<chr> <chr> <dbl>
1 AK SITKA 1
2 AL BARBOUR 1
3 AL HENRY 1
4 AP APO 1
5 AP Total1 / 1
6 AR NEWTON 1
7 CA MONO 1
8 CO BENT 1
9 CO CHEYENNE 1
10 CO COSTILLA 1
# … with 136 more rows
The data comes from an Excel file called “StateCounty2012.xls”. After cleaning and processing, it now contains 2931 rows and 3 columns. These data likely come from a survey or statistic about railroad employment.
The three columns are as follows:
The “STATE” column: This is a character vector, representing various states and other areas in the United States (such as AE and AK), as well as Canada. The mode (most common value) of this column is “character”.
The “COUNTY” column: This is also a character vector, representing various counties or regions. Note some special values, for example, when the value of the “STATE” column contains “Total”, the value of this column is replaced with “/”, indicating that this is a summary row. In the row where the value of the “STATE” column is “CANADA”, the value of this column is replaced with “Canada”.
The “TOTAL” column: This is a numeric vector, representing the total number of railroad employments in various states and counties. The minimum value of this column is 1, the first quartile is 7, the median is 21, the mean is 87.37, the third quartile is 65, and the maximum value is 8207 (located in Cook County, Illinois).
---
title: "Challenge 1 "
author: "Zhongyue Lin"
description: "Reading in data and creating a post"
date: "5/31/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
- railroads
- faostat
- wildbirds
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
# Import the "readxl" package
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a dataset, and
2) describe the dataset using both words and any supporting information (e.g., tables, etc)
## Read in the Data
Read in one (or more) of the following data sets, using the correct R package and command.
- wild_bird_data.xlsx ⭐⭐⭐
- StateCounty2012.xls ⭐⭐⭐⭐
In this challenge 1, I have chosen the raw data files "wild_bird_data.xlsx" and "StateCounty2012.xls" for practice. Since I have not had prior experience with handling and importing xls and xlsx data formats in R, importing these two raw data files is a new challenge for me.
### wild_bird_data
While working with the data from "wild_bird_data.xlsx," I encountered a situation where the header of the dataset occupied one row. To avoid any data format errors during the import process, I used the setting "skip=1" to skip the first row while importing the data.
```{r}
# Read the file
data_wild_brid <- read_excel("_data/wild_bird_data.xlsx",skip = 1)
#Preview Data
head(data_wild_brid)
dim(data_wild_brid)
```
Then, I used the `dim()` function and `head()` function to have an initial preview of the data (146 rows, 2 columns).
### StateCounty2012
In the data import process, I made adjustments and expansions to the code based on the format of the original file. I skipped the first three rows, which contained the headers, and removed any "NA" values present in the dataset. I also removed any unrelated content at the end of the table.
However, there are still some remaining issues with this dataset. Each letter code includes rows labeled as "total," which can affect data aggregation and analysis. However, this issue can be addressed in subsequent data processing steps.
```{r}
# Read the file, skipping the first 3 rows and specifying column names, then remove all columns that are completely empty, then replace NA values in the "COUNTY" column with "/" when "STATE" contains "Total"
data_clean <- read_excel("_data/StateCounty2012.xls", skip = 3) %>%
select_if(~!all(is.na(.))) %>%
mutate(COUNTY = ifelse(is.na(COUNTY) & grepl("Total", STATE), "/", COUNTY))
# Print out the first few lines of the cleaned data
head(data_clean,100)
# Identify the rows where "STATE" is "CANADA" and replace NA values in "COUNTY" with "Canada", then remove empty rows above the "CANADA" row
data_clean <- data_clean %>%
mutate(COUNTY = replace(COUNTY, STATE == "CANADA", "Canada")) %>%
slice(1:(nrow(.)-4))
# Print out the last few lines of the cleaned data to check
tail(data_clean,100)
# Remove the row before the "CANADA" row
data_clean <- data_clean %>%
slice(-which(.$STATE == "CANADA") + 1)
# Print out the last few lines of the cleaned data to check
tail(data_clean,100)
```
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
### wild_bird_data
```{r}
#| label: summary
# Print the structure of the data
str(data_wild_brid)
# Print a summary of the data
summary(data_wild_brid)
```
Firstly, the dataset file does not provide specific background information on data collection, so the following description is based on assumptions from the data itself:
The dataset, named "wild_bird_data," consists of 146 rows and 2 columns. This dataset is likely gathered from field research, where researchers possibly conducted bird weight measurements at a specific location or a series of locations, and recorded the bird population size in these areas.
The two main variables are "Wet body weight \[g\]" and "Population size":
"Wet body weight \[g\]": This variable represents the wet body weight of birds (in grams). The values range from 5.459g to 9639.845g, with an average of about 363.694g, and a median of about 69.232g. This variable could be very useful in studying the health conditions of birds, their life habits, or the impact of environmental changes on bird weight.
"Population size": This variable represents the bird population size in a specific area. The values range widely, from a minimum of 5 to a maximum of 5093378, with an average of about 382874, and a median of about 24353. This variable could be very useful in studying the distribution of bird populations, dynamic changes in population, or the impact of environmental changes on population size.
### StateCounty2012
```{r}
str(data_clean)
# Remove rows where "STATE" contains "total"
data_clean_rm <- data_clean %>%
filter(!grepl("total", STATE, ignore.case = TRUE))
summary(data_clean_rm)
tail(data_clean_rm,100)
# Find the row with the maximum value
max_row <- data_clean %>%
filter(TOTAL == max(TOTAL, na.rm = TRUE))
# Print out the row with the maximum value
print(max_row)
# Find the row with the minimum value
min_row <- data_clean %>%
filter(TOTAL == min(TOTAL, na.rm = TRUE))
# Print out the row with the minimum value
print(min_row)
```
The data comes from an Excel file called "StateCounty2012.xls". After cleaning and processing, it now contains 2931 rows and 3 columns. These data likely come from a survey or statistic about railroad employment.
The three columns are as follows:
The "STATE" column: This is a character vector, representing various states and other areas in the United States (such as AE and AK), as well as Canada. The mode (most common value) of this column is "character".
The "COUNTY" column: This is also a character vector, representing various counties or regions. Note some special values, for example, when the value of the "STATE" column contains "Total", the value of this column is replaced with "/", indicating that this is a summary row. In the row where the value of the "STATE" column is "CANADA", the value of this column is replaced with "Canada".
The "TOTAL" column: This is a numeric vector, representing the total number of railroad employments in various states and counties. The minimum value of this column is 1, the first quartile is 7, the median is 21, the mean is 87.37, the third quartile is 65, and the maximum value is 8207 (located in Cook County, Illinois).