Sarah McAlpine HW 2

hw2

sarahmcalpine

inaturalist data

lubridate

summarytools

time zones

Author

Sarah McAlpine

Published

October 11, 2022

Code

library(tidyverse)
library(lubridate)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE)

Select a Data Set

For this assignment, I chose to use a set of data from iNaturalist.org of citizen scientist observations of plant life using mobile apps. This particular data set is limited to those observations made in North America whose identifications have the most disagreements. The iNaturalist site allows for custom data queries, but since I am not familiar with each data field, I exported more than I needed. First I will read in this data and take a look at the column names and values present within each to decide what to keep. At the beginning, I have 27 columns and 25,266 rows.

At the outset, possible research questions could be about patterns in family, order, genus, and/or species that are difficult for the general public to identify, and possibly certain areas of the continent that have the highest likelihood of identification disagreement. I would not, however, be able to compare which users are most likely to have disagreed idetntifications since my data doesn’t include undisputed identifications as well.

Code

#read in data
observations <- read_csv("_data/plant_observations.csv")

Rows: 25266 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (19): observed_on, time_observed_at, time_zone, user_login, created_at, ...
dbl  (7): id, user_id, num_identification_agreements, num_identification_dis...
lgl  (1): captive_cultivated

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

# preview data and plan for cleaning
 colnames(observations)

 [1] "id"                               "observed_on"                     
 [3] "time_observed_at"                 "time_zone"                       
 [5] "user_id"                          "user_login"                      
 [7] "created_at"                       "updated_at"                      
 [9] "quality_grade"                    "num_identification_agreements"   
[11] "num_identification_disagreements" "captive_cultivated"              
[13] "place_guess"                      "latitude"                        
[15] "longitude"                        "place_town_name"                 
[17] "place_state_name"                 "place_country_name"              
[19] "species_guess"                    "scientific_name"                 
[21] "common_name"                      "iconic_taxon_name"               
[23] "taxon_id"                         "taxon_order_name"                
[25] "taxon_family_name"                "taxon_genus_name"                
[27] "taxon_species_name"

Code

 head(observations)

# A tibble: 6 × 27
      id obser…¹ time_…² time_…³ user_id user_…⁴ creat…⁵ updat…⁶ quali…⁷ num_i…⁸
   <dbl> <chr>   <chr>   <chr>     <dbl> <chr>   <chr>   <chr>   <chr>     <dbl>
1  13597 4/1/20… 2011-0… Pacifi…       1 kueda   2011-0… 2022-0… needs_…       0
2  53394 2/17/2… 2003-0… Arizona    4881 victor… 2012-0… 2022-0… casual        1
3  56948 3/10/2… 2012-0… Pacifi…       1 kueda   2012-0… 2021-0… needs_…       0
4  59618 3/21/2… <NA>    Pacifi…     549 bob-do… 2012-0… 2020-0… needs_…       1
5 101087 7/11/2… <NA>    Easter…    4860 rcurtis 2012-0… 2021-0… needs_…       0
6 105005 7/21/2… 2012-0… Pacifi…       1 kueda   2012-0… 2016-0… needs_…       0
# … with 17 more variables: num_identification_disagreements <dbl>,
#   captive_cultivated <lgl>, place_guess <chr>, latitude <dbl>,
#   longitude <dbl>, place_town_name <chr>, place_state_name <chr>,
#   place_country_name <chr>, species_guess <chr>, scientific_name <chr>,
#   common_name <chr>, iconic_taxon_name <chr>, taxon_id <dbl>,
#   taxon_order_name <chr>, taxon_family_name <chr>, taxon_genus_name <chr>,
#   taxon_species_name <chr>, and abbreviated variable names ¹observed_on, …

Code

 print(dfSummary(observations,
     varnumbers = FALSE,
     plain.ascii  = FALSE, 
     style = "grid",
     graph.magnif = 0.60, 
     valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

observations

Dimensions: 25266 x 27
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

id [numeric]

Mean (sd) : 49218948 (34354028)

min ≤ med ≤ max:

13597 ≤ 42531827 ≤ 138451899

IQR (CV) : 46488582 (0.7)

25266 distinct values

0 (0.0%)

observed_on [character]

1. 4/27/2019

2. 4/24/2020

3. 4/26/2020

4. 4/25/2020

5. 4/28/2019

6. 4/27/2020

7. 4/29/2019

8. 4/26/2019

9. 5/1/2021

10. 5/2/2021

[ 3044 others ]

143	(	0.6%	)
123	(	0.5%	)
121	(	0.5%	)
111	(	0.4%	)
111	(	0.4%	)
96	(	0.4%	)
92	(	0.4%	)
85	(	0.3%	)
71	(	0.3%	)
67	(	0.3%	)
24120	(	95.9%	)

126 (0.5%)

time_observed_at [character]

1. 2020-06-10 19:02:00 +0000

2. 2016-12-05 17:03:00 +0000

3. 2018-07-13 18:00:00 +0000

4. 2016-06-17 23:29:00 +0000

5. 2017-06-11 03:54:00 +0000

6. 2020-09-03 08:02:00 +0000

7. 2022-02-06 20:43:00 +0000

8. 2022-04-30 21:51:00 +0000

9. 2019-04-27 14:54:00 +0000

10. 2019-04-27 15:04:00 +0000

[ 23946 others ]

18	(	0.1%	)
6	(	0.0%	)
6	(	0.0%	)
5	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
24061	(	99.8%	)

1148 (4.5%)

time_zone [character]

1. Eastern Time (US & Canada

2. Pacific Time (US & Canada

3. UTC

4. Central Time (US & Canada

5. Mountain Time (US & Canad

6. Atlantic Time (Canada)

7. Hawaii

8. Mexico City

9. Arizona

10. Bogota

[ 78 others ]

8778	(	34.7%	)
6250	(	24.7%	)
3443	(	13.6%	)
2711	(	10.7%	)
1397	(	5.5%	)
419	(	1.7%	)
416	(	1.6%	)
378	(	1.5%	)
218	(	0.9%	)
140	(	0.6%	)
1113	(	4.4%	)

3 (0.0%)

user_id [numeric]

Mean (sd) : 1785599 (1550175)

min ≤ med ≤ max:

1 ≤ 1487247 ≤ 6238464

IQR (CV) : 2270983 (0.9)

15396 distinct values

0 (0.0%)

user_login [character]

1. finatic

2. jaykeller

3. danielmorton

4. lianamay

5. aaronbalam_tutor

6. sandbankspp

7. marymacaulay

8. silversea_starsong

9. frank324

10. simpylmare55

[ 15386 others ]

510	(	2.0%	)
437	(	1.7%	)
124	(	0.5%	)
119	(	0.5%	)
118	(	0.5%	)
117	(	0.5%	)
115	(	0.5%	)
95	(	0.4%	)
92	(	0.4%	)
90	(	0.4%	)
23449	(	92.8%	)

0 (0.0%)

created_at [character]

1. 2015-08-05 00:52:19 +0000

2. 2017-08-09 00:29:31 +0000

3. 2018-06-21 03:20:31 +0000

4. 2019-04-29 18:07:20 +0000

5. 2019-04-29 18:26:33 +0000

6. 2019-04-29 18:26:38 +0000

7. 2020-03-11 22:07:15 +0000

8. 2020-08-14 11:21:20 +0000

9. 2013-03-25 15:51:51 +0000

10. 2014-08-01 04:36:50 +0000

[ 25165 others ]

3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
25238	(	99.9%	)

0 (0.0%)

updated_at [character]

1. 2022-08-21 16:55:57 +0000

2. 2021-08-26 05:12:47 +0000

3. 2022-01-04 18:30:50 +0000

4. 2022-09-27 02:03:07 +0000

5. 2021-09-14 04:57:16 +0000

6. 2021-05-24 15:24:02 +0000

7. 2022-05-26 17:15:15 +0000

8. 2022-09-13 20:37:50 +0000

9. 2021-10-16 01:04:15 +0000

10. 2022-05-30 14:02:12 +0000

[ 24741 others ]

78	(	0.3%	)
46	(	0.2%	)
28	(	0.1%	)
20	(	0.1%	)
19	(	0.1%	)
18	(	0.1%	)
17	(	0.1%	)
11	(	0.0%	)
10	(	0.0%	)
9	(	0.0%	)
25010	(	99.0%	)

0 (0.0%)

quality_grade [character]

1. casual

2. needs_id

3. research

6408	(	25.4%	)
18306	(	72.5%	)
552	(	2.2%	)

0 (0.0%)

num_identification_agreements [numeric]

Mean (sd) : 0 (0.2)

min ≤ med ≤ max:

-1 ≤ 0 ≤ 3

IQR (CV) : 0 (8.6)

-1	:	12	(	0.0%	)
0	:	24855	(	98.4%	)
1	:	348	(	1.4%	)
2	:	47	(	0.2%	)
3	:	4	(	0.0%	)

0 (0.0%)

num_identification_disagreements [numeric]

Mean (sd) : 1.2 (0.6)

min ≤ med ≤ max:

0 ≤ 1 ≤ 9

IQR (CV) : 0 (0.5)

0	:	13	(	0.1%	)
1	:	22445	(	88.8%	)
2	:	1700	(	6.7%	)
3	:	761	(	3.0%	)
4	:	236	(	0.9%	)
5	:	79	(	0.3%	)
6	:	23	(	0.1%	)
7	:	7	(	0.0%	)
8	:	1	(	0.0%	)
9	:	1	(	0.0%	)

0 (0.0%)

captive_cultivated [logical]

1. FALSE

2. TRUE

20227	(	80.1%	)
5039	(	19.9%	)

0 (0.0%)

place_guess [character]

1. Texas, US

2. United States

3. California, US

4. North Carolina, US

5. Ohio, US

6. Florida, US

7. Los Angeles County, US-CA

8. New York, US

9. Denver, CO 80226, USA

10. Chihuahua, Chih., México

[ 17711 others ]

134	(	0.5%	)
132	(	0.5%	)
123	(	0.5%	)
122	(	0.5%	)
95	(	0.4%	)
70	(	0.3%	)
65	(	0.3%	)
58	(	0.2%	)
56	(	0.2%	)
50	(	0.2%	)
24345	(	96.4%	)

16 (0.1%)

latitude [numeric]

Mean (sd) : 36.7 (7.8)

min ≤ med ≤ max:

7.4 ≤ 37.7 ≤ 71.7

IQR (CV) : 8.9 (0.2)

24587 distinct values

0 (0.0%)

longitude [numeric]

Mean (sd) : -96.3 (18.2)

min ≤ med ≤ max:

-176.6 ≤ -95.1 ≤ -51.7

IQR (CV) : 36.4 (-0.2)

24595 distinct values

0 (0.0%)

place_town_name [character]

1. New York City

2. City of Austin

3. Zona Metropolitana

4. San Antonio

5. Portland

6. Boston

7. Chicago

8. Oakville, Ontario

9. Pittsburgh

10. Greater Chattanooga

[ 257 others ]

330	(	14.7%	)
205	(	9.1%	)
119	(	5.3%	)
92	(	4.1%	)
68	(	3.0%	)
65	(	2.9%	)
58	(	2.6%	)
50	(	2.2%	)
48	(	2.1%	)
45	(	2.0%	)
1165	(	51.9%	)

23021 (91.1%)

place_state_name [character]

1. California

2. Texas

3. Florida

4. New York

5. North Carolina

6. Ohio

7. Virginia

8. Arizona

9. Pennsylvania

10. Oregon

[ 145 others ]

4877	(	21.7%	)
2133	(	9.5%	)
1169	(	5.2%	)
860	(	3.8%	)
775	(	3.4%	)
628	(	2.8%	)
579	(	2.6%	)
562	(	2.5%	)
557	(	2.5%	)
537	(	2.4%	)
9828	(	43.7%	)

2761 (10.9%)

place_country_name [character]

1. United States

2. Canada

3. Mexico

4. Costa Rica

5. Panama

6. Honduras

7. Guatemala

8. Cuba

9. Dominican Republic

10. El Salvador

[ 18 others ]

20621	(	81.8%	)
2438	(	9.7%	)
1715	(	6.8%	)
119	(	0.5%	)
119	(	0.5%	)
37	(	0.1%	)
20	(	0.1%	)
18	(	0.1%	)
17	(	0.1%	)
16	(	0.1%	)
77	(	0.3%	)

69 (0.3%)

species_guess [character]

1. dicots

2. flowering plants

3. plants

4. vascular plants

5. monocots

6. grasses

7. sunflowers, daisies, aste

8. Magnolias, margaritas y p

9. roses

10. Asteroideae

[ 7230 others ]

1912	(	9.9%	)
389	(	2.0%	)
222	(	1.2%	)
168	(	0.9%	)
121	(	0.6%	)
111	(	0.6%	)
106	(	0.5%	)
103	(	0.5%	)
97	(	0.5%	)
76	(	0.4%	)
15994	(	82.9%	)

5967 (23.6%)

scientific_name [character]

1. Magnoliopsida

2. Angiospermae

3. Plantae

4. Tracheophyta

5. Quercus

6. Bryophyta

7. Poaceae

8. Asteraceae

9. Liliopsida

10. Rosa

[ 7125 others ]

1678	(	6.6%	)
375	(	1.5%	)
200	(	0.8%	)
153	(	0.6%	)
149	(	0.6%	)
123	(	0.5%	)
114	(	0.5%	)
105	(	0.4%	)
101	(	0.4%	)
95	(	0.4%	)
22173	(	87.8%	)

0 (0.0%)

common_name [character]

1. dicots

2. flowering plants

3. plants

4. vascular plants

5. mosses

6. grasses

7. sunflowers, daisies, aste

8. monocots

9. roses

10. broadleaf enchanter's nig

[ 6141 others ]

1678	(	7.2%	)
375	(	1.6%	)
200	(	0.9%	)
153	(	0.7%	)
123	(	0.5%	)
114	(	0.5%	)
105	(	0.5%	)
101	(	0.4%	)
95	(	0.4%	)
72	(	0.3%	)
20239	(	87.0%	)

2011 (8.0%)

iconic_taxon_name [character]

1. Plantae

25266

(

100.0%

)

0 (0.0%)

taxon_id [numeric]

Mean (sd) : 182411.7 (279303.7)

min ≤ med ≤ max:

47121 ≤ 63329 ≤ 1419341

IQR (CV) : 103131 (1.5)

7187 distinct values

0 (0.0%)

taxon_order_name [character]

1. Asterales

2. Rosales

3. Lamiales

4. Caryophyllales

5. Fabales

6. Fagales

7. Poales

8. Asparagales

9. Pinales

10. Ericales

[ 93 others ]

2758	(	12.3%	)
1872	(	8.3%	)
1516	(	6.8%	)
1314	(	5.9%	)
1259	(	5.6%	)
1189	(	5.3%	)
1176	(	5.2%	)
1081	(	4.8%	)
783	(	3.5%	)
755	(	3.4%	)
8727	(	38.9%	)

2836 (11.2%)

taxon_family_name [character]

1. Asteraceae

2. Fabaceae

3. Rosaceae

4. Fagaceae

5. Pinaceae

6. Poaceae

7. Lamiaceae

8. Cactaceae

9. Asparagaceae

10. Ericaceae

[ 319 others ]

2668	(	12.1%	)
1244	(	5.6%	)
1236	(	5.6%	)
669	(	3.0%	)
562	(	2.5%	)
535	(	2.4%	)
491	(	2.2%	)
470	(	2.1%	)
464	(	2.1%	)
409	(	1.9%	)
13349	(	60.4%	)

3169 (12.5%)

taxon_genus_name [character]

1. Quercus

2. Pinus

3. Acer

4. Rosa

5. Rubus

6. Prunus

7. Lupinus

8. Opuntia

9. Ilex

10. Ulmus

[ 1954 others ]

606	(	3.0%	)
279	(	1.4%	)
244	(	1.2%	)
231	(	1.1%	)
195	(	1.0%	)
182	(	0.9%	)
176	(	0.9%	)
168	(	0.8%	)
165	(	0.8%	)
160	(	0.8%	)
17768	(	88.1%	)

5092 (20.2%)

taxon_species_name [character]

1. Circaea canadensis

2. Acer rubrum

3. Amauropelta noveboracensi

4. Pseudoziziphus parryi

5. Viburnum cassinoides

6. Nymphaea odorata

7. Ulmus americana

8. Malus domestica

9. Lantana urticoides

10. Nabalus alatus

[ 5330 others ]

72	(	0.5%	)
67	(	0.5%	)
65	(	0.5%	)
58	(	0.4%	)
57	(	0.4%	)
53	(	0.4%	)
48	(	0.3%	)
41	(	0.3%	)
35	(	0.2%	)
32	(	0.2%	)
13831	(	96.3%	)

10907 (43.2%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

Code

#[1] "id"           DELETE              "observed_on"  1973 - today        "time_observed_at"       LUBRIDATE     
#[4] "time_zone"     LUBRIDATE          "user_id"      DELETE              "user_login"             15.3k    
#[7] "created_at"     LUBRIDATE         "updated_at"   LUBRIDATE           "quality_grade"          3 categories
#10] "num_identification_agreements"    "num_identification_disagreements" "captive_cultivated"     T/F      
#13] "place_guess"       MAP?           "latitude"               MAP?      "longitude"              MAP? 
#16] "place_town_name"    MAP?          "place_state_name"      MAP?       "place_country_name"     MAP?     
#19] "species_guess"                    "scientific_name"                  "common_name"                     
#22] "iconic_taxon_name"                "taxon_id"              DELETE     "taxon_order_name"       GROUP    
#25] "taxon_family_name"     GROUP      "taxon_genus_name"      GROUP      "taxon_species_name"  

#clean up time zones

Building a Clean Read-in

Code

#cleaning during read-in
obs_clean <- read_csv("_data/plant_observations.csv",
                      skip = 1,
                      col_names = c("id", 
                                   "delete",
                                   "time_observed_at",
                                   "delete",
                                   "delete",
                                   "user_login",
                                   "delete",
                                   "delete",
                                   "quality_grade",
                                   "num_identification_agreements",
                                   "num_identification_disagreements",
                                   "captive_cultivated",
                                   "place_guess", 
                                   "latitude",
                                   "longitude",
                                   "place_town_name",
                                   "place_state_name",
                                   "place_country_name", 
                                   "species_guess",
                                   "scientific_name",
                                   "common_name",
                                   "iconic_taxon_name",
                                   "delete",
                                   "taxon_order_name",
                                   "taxon_family_name",
                                   "taxon_genus_name",
                                   "taxon_species_name")) %>%
#exclude columns called delete
select(!starts_with("delete"))

New names:
Rows: 25266 Columns: 27
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(19): delete...2, time_observed_at, delete...4, user_login, delete...7, ... dbl
(7): id, delete...5, num_identification_agreements, num_identification_... lgl
(1): captive_cultivated
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `delete` -> `delete...2`
• `delete` -> `delete...4`
• `delete` -> `delete...5`
• `delete` -> `delete...7`
• `delete` -> `delete...8`
• `delete` -> `delete...23`

Dealing with Time Zones

After many hours, I was able to use dplyr to remove the time zone characters from the datetime strings, and then I was able to use ymd_hms to convert those to datetimes. Since I am not researching the times of day, I forced the UTC time zone for all these data.

One other idea: could I use latitudes to calculate time zones?

-lubridate source sheet

-lubridate cheat sheet

-tz database time zones

Code

#strip time zone info
obs_clean$time_observed_at <- str_remove_all (obs_clean$time_observed_at, "\\W0000")

  
# turn into a date  
obs_clean$time_observed_at <- ymd_hms(obs_clean$time_observed_at,
          tz = "UTC")                                                             

#preview
head(obs_clean)

# A tibble: 6 × 21
      id time_observed_at    user_login  quali…¹ num_i…² num_i…³ capti…⁴ place…⁵
   <dbl> <dttm>              <chr>       <chr>     <dbl>   <dbl> <lgl>   <chr>  
1  13597 2011-04-02 01:00:04 kueda       needs_…       0       1 FALSE   Mount …
2  53394 2003-02-17 20:05:00 victorious… casual        1       3 TRUE    Arizon…
3  56948 2012-03-10 22:00:26 kueda       needs_…       0       1 FALSE   Huckle…
4  59618 NA                  bob-dodge   needs_…       1       3 FALSE   jasper…
5 101087 NA                  rcurtis     needs_…       0       1 FALSE   Kent B…
6 105005 2012-07-21 19:51:00 kueda       needs_…       0       1 FALSE   Sagehe…
# … with 13 more variables: latitude <dbl>, longitude <dbl>,
#   place_town_name <chr>, place_state_name <chr>, place_country_name <chr>,
#   species_guess <chr>, scientific_name <chr>, common_name <chr>,
#   iconic_taxon_name <chr>, taxon_order_name <chr>, taxon_family_name <chr>,
#   taxon_genus_name <chr>, taxon_species_name <chr>, and abbreviated variable
#   names ¹quality_grade, ²num_identification_agreements,
#   ³num_identification_disagreements, ⁴captive_cultivated, ⁵place_guess

Identify Research Questions

What are some of the most disputed families, genuses of plants?
Do these vary by location or year?

Code

# find top 10 families
ranked_families <-  obs_clean %>%
  select(taxon_family_name) %>%
  count(taxon_family_name) %>%
  arrange(desc(n)) %>%
mutate(prop_families = round(n/sum(n),3))


# find proportions of top 10 families
obs_clean %>%
  select(taxon_family_name) %>%
  count(taxon_family_name) %>%
  arrange(desc(n)) %>%
  slice(1:11) %>%
  mutate(prop_families = round(n/sum(n),3))

# A tibble: 11 × 3
   taxon_family_name     n prop_families
   <chr>             <int>         <dbl>
 1 <NA>               3169         0.266
 2 Asteraceae         2668         0.224
 3 Fabaceae           1244         0.104
 4 Rosaceae           1236         0.104
 5 Fagaceae            669         0.056
 6 Pinaceae            562         0.047
 7 Poaceae             535         0.045
 8 Lamiaceae           491         0.041
 9 Cactaceae           470         0.039
10 Asparagaceae        464         0.039
11 Ericaceae           409         0.034

Code

#top 10 included NA, so I changed to 11