Code
library(tidyverse)
library(lubridate)
library(summarytools)
::opts_chunk$set(echo = TRUE) knitr
Sarah McAlpine
October 11, 2022
For this assignment, I chose to use a set of data from iNaturalist.org of citizen scientist observations of plant life using mobile apps. This particular data set is limited to those observations made in North America whose identifications have the most disagreements. The iNaturalist site allows for custom data queries, but since I am not familiar with each data field, I exported more than I needed. First I will read in this data and take a look at the column names and values present within each to decide what to keep. At the beginning, I have 27 columns and 25,266 rows.
At the outset, possible research questions could be about patterns in family, order, genus, and/or species that are difficult for the general public to identify, and possibly certain areas of the continent that have the highest likelihood of identification disagreement. I would not, however, be able to compare which users are most likely to have disagreed idetntifications since my data doesn’t include undisputed identifications as well.
Rows: 25266 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (19): observed_on, time_observed_at, time_zone, user_login, created_at, ...
dbl (7): id, user_id, num_identification_agreements, num_identification_dis...
lgl (1): captive_cultivated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] "id" "observed_on"
[3] "time_observed_at" "time_zone"
[5] "user_id" "user_login"
[7] "created_at" "updated_at"
[9] "quality_grade" "num_identification_agreements"
[11] "num_identification_disagreements" "captive_cultivated"
[13] "place_guess" "latitude"
[15] "longitude" "place_town_name"
[17] "place_state_name" "place_country_name"
[19] "species_guess" "scientific_name"
[21] "common_name" "iconic_taxon_name"
[23] "taxon_id" "taxon_order_name"
[25] "taxon_family_name" "taxon_genus_name"
[27] "taxon_species_name"
# A tibble: 6 × 27
id obser…¹ time_…² time_…³ user_id user_…⁴ creat…⁵ updat…⁶ quali…⁷ num_i…⁸
<dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 13597 4/1/20… 2011-0… Pacifi… 1 kueda 2011-0… 2022-0… needs_… 0
2 53394 2/17/2… 2003-0… Arizona 4881 victor… 2012-0… 2022-0… casual 1
3 56948 3/10/2… 2012-0… Pacifi… 1 kueda 2012-0… 2021-0… needs_… 0
4 59618 3/21/2… <NA> Pacifi… 549 bob-do… 2012-0… 2020-0… needs_… 1
5 101087 7/11/2… <NA> Easter… 4860 rcurtis 2012-0… 2021-0… needs_… 0
6 105005 7/21/2… 2012-0… Pacifi… 1 kueda 2012-0… 2016-0… needs_… 0
# … with 17 more variables: num_identification_disagreements <dbl>,
# captive_cultivated <lgl>, place_guess <chr>, latitude <dbl>,
# longitude <dbl>, place_town_name <chr>, place_state_name <chr>,
# place_country_name <chr>, species_guess <chr>, scientific_name <chr>,
# common_name <chr>, iconic_taxon_name <chr>, taxon_id <dbl>,
# taxon_order_name <chr>, taxon_family_name <chr>, taxon_genus_name <chr>,
# taxon_species_name <chr>, and abbreviated variable names ¹observed_on, …
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id [numeric] |
|
25266 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
observed_on [character] |
|
|
126 (0.5%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
time_observed_at [character] |
|
|
1148 (4.5%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
time_zone [character] |
|
|
3 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
user_id [numeric] |
|
15396 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
user_login [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
created_at [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
updated_at [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
quality_grade [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
num_identification_agreements [numeric] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
num_identification_disagreements [numeric] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
captive_cultivated [logical] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
place_guess [character] |
|
|
16 (0.1%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
latitude [numeric] |
|
24587 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
longitude [numeric] |
|
24595 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
place_town_name [character] |
|
|
23021 (91.1%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
place_state_name [character] |
|
|
2761 (10.9%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
place_country_name [character] |
|
|
69 (0.3%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
species_guess [character] |
|
|
5967 (23.6%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
scientific_name [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
common_name [character] |
|
|
2011 (8.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
iconic_taxon_name [character] | 1. Plantae |
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
taxon_id [numeric] |
|
7187 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
taxon_order_name [character] |
|
|
2836 (11.2%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
taxon_family_name [character] |
|
|
3169 (12.5%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
taxon_genus_name [character] |
|
|
5092 (20.2%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
taxon_species_name [character] |
|
|
10907 (43.2%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20
#[1] "id" DELETE "observed_on" 1973 - today "time_observed_at" LUBRIDATE
#[4] "time_zone" LUBRIDATE "user_id" DELETE "user_login" 15.3k
#[7] "created_at" LUBRIDATE "updated_at" LUBRIDATE "quality_grade" 3 categories
#10] "num_identification_agreements" "num_identification_disagreements" "captive_cultivated" T/F
#13] "place_guess" MAP? "latitude" MAP? "longitude" MAP?
#16] "place_town_name" MAP? "place_state_name" MAP? "place_country_name" MAP?
#19] "species_guess" "scientific_name" "common_name"
#22] "iconic_taxon_name" "taxon_id" DELETE "taxon_order_name" GROUP
#25] "taxon_family_name" GROUP "taxon_genus_name" GROUP "taxon_species_name"
#clean up time zones
#cleaning during read-in
obs_clean <- read_csv("_data/plant_observations.csv",
skip = 1,
col_names = c("id",
"delete",
"time_observed_at",
"delete",
"delete",
"user_login",
"delete",
"delete",
"quality_grade",
"num_identification_agreements",
"num_identification_disagreements",
"captive_cultivated",
"place_guess",
"latitude",
"longitude",
"place_town_name",
"place_state_name",
"place_country_name",
"species_guess",
"scientific_name",
"common_name",
"iconic_taxon_name",
"delete",
"taxon_order_name",
"taxon_family_name",
"taxon_genus_name",
"taxon_species_name")) %>%
#exclude columns called delete
select(!starts_with("delete"))
New names:
Rows: 25266 Columns: 27
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(19): delete...2, time_observed_at, delete...4, user_login, delete...7, ... dbl
(7): id, delete...5, num_identification_agreements, num_identification_... lgl
(1): captive_cultivated
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `delete` -> `delete...2`
• `delete` -> `delete...4`
• `delete` -> `delete...5`
• `delete` -> `delete...7`
• `delete` -> `delete...8`
• `delete` -> `delete...23`
After many hours, I was able to use dplyr
to remove the time zone characters from the datetime strings, and then I was able to use ymd_hms
to convert those to datetimes. Since I am not researching the times of day, I forced the UTC time zone for all these data.
One other idea: could I use latitudes to calculate time zones?
# A tibble: 6 × 21
id time_observed_at user_login quali…¹ num_i…² num_i…³ capti…⁴ place…⁵
<dbl> <dttm> <chr> <chr> <dbl> <dbl> <lgl> <chr>
1 13597 2011-04-02 01:00:04 kueda needs_… 0 1 FALSE Mount …
2 53394 2003-02-17 20:05:00 victorious… casual 1 3 TRUE Arizon…
3 56948 2012-03-10 22:00:26 kueda needs_… 0 1 FALSE Huckle…
4 59618 NA bob-dodge needs_… 1 3 FALSE jasper…
5 101087 NA rcurtis needs_… 0 1 FALSE Kent B…
6 105005 2012-07-21 19:51:00 kueda needs_… 0 1 FALSE Sagehe…
# … with 13 more variables: latitude <dbl>, longitude <dbl>,
# place_town_name <chr>, place_state_name <chr>, place_country_name <chr>,
# species_guess <chr>, scientific_name <chr>, common_name <chr>,
# iconic_taxon_name <chr>, taxon_order_name <chr>, taxon_family_name <chr>,
# taxon_genus_name <chr>, taxon_species_name <chr>, and abbreviated variable
# names ¹quality_grade, ²num_identification_agreements,
# ³num_identification_disagreements, ⁴captive_cultivated, ⁵place_guess
# find top 10 families
ranked_families <- obs_clean %>%
select(taxon_family_name) %>%
count(taxon_family_name) %>%
arrange(desc(n)) %>%
mutate(prop_families = round(n/sum(n),3))
# find proportions of top 10 families
obs_clean %>%
select(taxon_family_name) %>%
count(taxon_family_name) %>%
arrange(desc(n)) %>%
slice(1:11) %>%
mutate(prop_families = round(n/sum(n),3))
# A tibble: 11 × 3
taxon_family_name n prop_families
<chr> <int> <dbl>
1 <NA> 3169 0.266
2 Asteraceae 2668 0.224
3 Fabaceae 1244 0.104
4 Rosaceae 1236 0.104
5 Fagaceae 669 0.056
6 Pinaceae 562 0.047
7 Poaceae 535 0.045
8 Lamiaceae 491 0.041
9 Cactaceae 470 0.039
10 Asparagaceae 464 0.039
11 Ericaceae 409 0.034
---
title: "Sarah McAlpine HW 2"
author: "Sarah McAlpine"
desription: "Homework 2"
date: "10/11/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw2
- sarahmcalpine
- inaturalist data
- lubridate
- summarytools
- time zones
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(lubridate)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE)
```
## Select a Data Set
For this assignment, I chose to use a set of data from iNaturalist.org of citizen scientist observations of plant life using mobile apps. This particular data set is limited to those observations made in North America whose identifications have the most disagreements. The iNaturalist site allows for custom data queries, but since I am not familiar with each data field, I exported more than I needed. First I will read in this data and take a look at the column names and values present within each to decide what to keep. At the beginning, I have 27 columns and 25,266 rows.
At the outset, possible research questions could be about patterns in family, order, genus, and/or species that are difficult for the general public to identify, and possibly certain areas of the continent that have the highest likelihood of identification disagreement. I would not, however, be able to compare which users are most likely to have disagreed idetntifications since my data doesn't include undisputed identifications as well.
```{r}
#read in data
observations <- read_csv("_data/plant_observations.csv")
# preview data and plan for cleaning
colnames(observations)
head(observations)
print(dfSummary(observations,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.60,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
#[1] "id" DELETE "observed_on" 1973 - today "time_observed_at" LUBRIDATE
#[4] "time_zone" LUBRIDATE "user_id" DELETE "user_login" 15.3k
#[7] "created_at" LUBRIDATE "updated_at" LUBRIDATE "quality_grade" 3 categories
#10] "num_identification_agreements" "num_identification_disagreements" "captive_cultivated" T/F
#13] "place_guess" MAP? "latitude" MAP? "longitude" MAP?
#16] "place_town_name" MAP? "place_state_name" MAP? "place_country_name" MAP?
#19] "species_guess" "scientific_name" "common_name"
#22] "iconic_taxon_name" "taxon_id" DELETE "taxon_order_name" GROUP
#25] "taxon_family_name" GROUP "taxon_genus_name" GROUP "taxon_species_name"
#clean up time zones
```
## Building a Clean Read-in
```{r}
#cleaning during read-in
obs_clean <- read_csv("_data/plant_observations.csv",
skip = 1,
col_names = c("id",
"delete",
"time_observed_at",
"delete",
"delete",
"user_login",
"delete",
"delete",
"quality_grade",
"num_identification_agreements",
"num_identification_disagreements",
"captive_cultivated",
"place_guess",
"latitude",
"longitude",
"place_town_name",
"place_state_name",
"place_country_name",
"species_guess",
"scientific_name",
"common_name",
"iconic_taxon_name",
"delete",
"taxon_order_name",
"taxon_family_name",
"taxon_genus_name",
"taxon_species_name")) %>%
#exclude columns called delete
select(!starts_with("delete"))
```
## Dealing with Time Zones
After many hours, I was able to use `dplyr` to remove the time zone characters from the datetime strings, and then I was able to use `ymd_hms` to convert those to datetimes. Since I am not researching the times of day, I forced the UTC time zone for all these data.
One other idea: could I use latitudes to calculate time zones?
\-[lubridate source sheet](https://lubridate.tidyverse.org/articles/lubridate.html)
\-[lubridate cheat sheet](https://rawgit.com/rstudio/cheatsheets/main/lubridate.pdf)
\-[tz database time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones)
```{r}
#strip time zone info
obs_clean$time_observed_at <- str_remove_all (obs_clean$time_observed_at, "\\W0000")
# turn into a date
obs_clean$time_observed_at <- ymd_hms(obs_clean$time_observed_at,
tz = "UTC")
#preview
head(obs_clean)
```
## Identify Research Questions
* What are some of the most disputed families, genuses of plants?
* Do these vary by location or year?
```{r}
# find top 10 families
ranked_families <- obs_clean %>%
select(taxon_family_name) %>%
count(taxon_family_name) %>%
arrange(desc(n)) %>%
mutate(prop_families = round(n/sum(n),3))
# find proportions of top 10 families
obs_clean %>%
select(taxon_family_name) %>%
count(taxon_family_name) %>%
arrange(desc(n)) %>%
slice(1:11) %>%
mutate(prop_families = round(n/sum(n),3))
#top 10 included NA, so I changed to 11
```