DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Sarah McAlpine HW 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Select a Data Set
  • Building a Clean Read-in
  • Dealing with Time Zones
  • Identify Research Questions

Sarah McAlpine HW 2

  • Show All Code
  • Hide All Code

  • View Source
hw2
sarahmcalpine
inaturalist data
lubridate
summarytools
time zones
Author

Sarah McAlpine

Published

October 11, 2022

Code
library(tidyverse)
library(lubridate)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE)

Select a Data Set

For this assignment, I chose to use a set of data from iNaturalist.org of citizen scientist observations of plant life using mobile apps. This particular data set is limited to those observations made in North America whose identifications have the most disagreements. The iNaturalist site allows for custom data queries, but since I am not familiar with each data field, I exported more than I needed. First I will read in this data and take a look at the column names and values present within each to decide what to keep. At the beginning, I have 27 columns and 25,266 rows.

At the outset, possible research questions could be about patterns in family, order, genus, and/or species that are difficult for the general public to identify, and possibly certain areas of the continent that have the highest likelihood of identification disagreement. I would not, however, be able to compare which users are most likely to have disagreed idetntifications since my data doesn’t include undisputed identifications as well.

Code
#read in data
observations <- read_csv("_data/plant_observations.csv")
Rows: 25266 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (19): observed_on, time_observed_at, time_zone, user_login, created_at, ...
dbl  (7): id, user_id, num_identification_agreements, num_identification_dis...
lgl  (1): captive_cultivated

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# preview data and plan for cleaning
 colnames(observations)
 [1] "id"                               "observed_on"                     
 [3] "time_observed_at"                 "time_zone"                       
 [5] "user_id"                          "user_login"                      
 [7] "created_at"                       "updated_at"                      
 [9] "quality_grade"                    "num_identification_agreements"   
[11] "num_identification_disagreements" "captive_cultivated"              
[13] "place_guess"                      "latitude"                        
[15] "longitude"                        "place_town_name"                 
[17] "place_state_name"                 "place_country_name"              
[19] "species_guess"                    "scientific_name"                 
[21] "common_name"                      "iconic_taxon_name"               
[23] "taxon_id"                         "taxon_order_name"                
[25] "taxon_family_name"                "taxon_genus_name"                
[27] "taxon_species_name"              
Code
 head(observations)
# A tibble: 6 × 27
      id obser…¹ time_…² time_…³ user_id user_…⁴ creat…⁵ updat…⁶ quali…⁷ num_i…⁸
   <dbl> <chr>   <chr>   <chr>     <dbl> <chr>   <chr>   <chr>   <chr>     <dbl>
1  13597 4/1/20… 2011-0… Pacifi…       1 kueda   2011-0… 2022-0… needs_…       0
2  53394 2/17/2… 2003-0… Arizona    4881 victor… 2012-0… 2022-0… casual        1
3  56948 3/10/2… 2012-0… Pacifi…       1 kueda   2012-0… 2021-0… needs_…       0
4  59618 3/21/2… <NA>    Pacifi…     549 bob-do… 2012-0… 2020-0… needs_…       1
5 101087 7/11/2… <NA>    Easter…    4860 rcurtis 2012-0… 2021-0… needs_…       0
6 105005 7/21/2… 2012-0… Pacifi…       1 kueda   2012-0… 2016-0… needs_…       0
# … with 17 more variables: num_identification_disagreements <dbl>,
#   captive_cultivated <lgl>, place_guess <chr>, latitude <dbl>,
#   longitude <dbl>, place_town_name <chr>, place_state_name <chr>,
#   place_country_name <chr>, species_guess <chr>, scientific_name <chr>,
#   common_name <chr>, iconic_taxon_name <chr>, taxon_id <dbl>,
#   taxon_order_name <chr>, taxon_family_name <chr>, taxon_genus_name <chr>,
#   taxon_species_name <chr>, and abbreviated variable names ¹​observed_on, …
Code
 print(dfSummary(observations,
     varnumbers = FALSE,
     plain.ascii  = FALSE, 
     style = "grid",
     graph.magnif = 0.60, 
     valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

observations

Dimensions: 25266 x 27
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
id [numeric]
Mean (sd) : 49218948 (34354028)
min ≤ med ≤ max:
13597 ≤ 42531827 ≤ 138451899
IQR (CV) : 46488582 (0.7)
25266 distinct values 0 (0.0%)
observed_on [character]
1. 4/27/2019
2. 4/24/2020
3. 4/26/2020
4. 4/25/2020
5. 4/28/2019
6. 4/27/2020
7. 4/29/2019
8. 4/26/2019
9. 5/1/2021
10. 5/2/2021
[ 3044 others ]
143(0.6%)
123(0.5%)
121(0.5%)
111(0.4%)
111(0.4%)
96(0.4%)
92(0.4%)
85(0.3%)
71(0.3%)
67(0.3%)
24120(95.9%)
126 (0.5%)
time_observed_at [character]
1. 2020-06-10 19:02:00 +0000
2. 2016-12-05 17:03:00 +0000
3. 2018-07-13 18:00:00 +0000
4. 2016-06-17 23:29:00 +0000
5. 2017-06-11 03:54:00 +0000
6. 2020-09-03 08:02:00 +0000
7. 2022-02-06 20:43:00 +0000
8. 2022-04-30 21:51:00 +0000
9. 2019-04-27 14:54:00 +0000
10. 2019-04-27 15:04:00 +0000
[ 23946 others ]
18(0.1%)
6(0.0%)
6(0.0%)
5(0.0%)
4(0.0%)
4(0.0%)
4(0.0%)
4(0.0%)
3(0.0%)
3(0.0%)
24061(99.8%)
1148 (4.5%)
time_zone [character]
1. Eastern Time (US & Canada
2. Pacific Time (US & Canada
3. UTC
4. Central Time (US & Canada
5. Mountain Time (US & Canad
6. Atlantic Time (Canada)
7. Hawaii
8. Mexico City
9. Arizona
10. Bogota
[ 78 others ]
8778(34.7%)
6250(24.7%)
3443(13.6%)
2711(10.7%)
1397(5.5%)
419(1.7%)
416(1.6%)
378(1.5%)
218(0.9%)
140(0.6%)
1113(4.4%)
3 (0.0%)
user_id [numeric]
Mean (sd) : 1785599 (1550175)
min ≤ med ≤ max:
1 ≤ 1487247 ≤ 6238464
IQR (CV) : 2270983 (0.9)
15396 distinct values 0 (0.0%)
user_login [character]
1. finatic
2. jaykeller
3. danielmorton
4. lianamay
5. aaronbalam_tutor
6. sandbankspp
7. marymacaulay
8. silversea_starsong
9. frank324
10. simpylmare55
[ 15386 others ]
510(2.0%)
437(1.7%)
124(0.5%)
119(0.5%)
118(0.5%)
117(0.5%)
115(0.5%)
95(0.4%)
92(0.4%)
90(0.4%)
23449(92.8%)
0 (0.0%)
created_at [character]
1. 2015-08-05 00:52:19 +0000
2. 2017-08-09 00:29:31 +0000
3. 2018-06-21 03:20:31 +0000
4. 2019-04-29 18:07:20 +0000
5. 2019-04-29 18:26:33 +0000
6. 2019-04-29 18:26:38 +0000
7. 2020-03-11 22:07:15 +0000
8. 2020-08-14 11:21:20 +0000
9. 2013-03-25 15:51:51 +0000
10. 2014-08-01 04:36:50 +0000
[ 25165 others ]
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
25238(99.9%)
0 (0.0%)
updated_at [character]
1. 2022-08-21 16:55:57 +0000
2. 2021-08-26 05:12:47 +0000
3. 2022-01-04 18:30:50 +0000
4. 2022-09-27 02:03:07 +0000
5. 2021-09-14 04:57:16 +0000
6. 2021-05-24 15:24:02 +0000
7. 2022-05-26 17:15:15 +0000
8. 2022-09-13 20:37:50 +0000
9. 2021-10-16 01:04:15 +0000
10. 2022-05-30 14:02:12 +0000
[ 24741 others ]
78(0.3%)
46(0.2%)
28(0.1%)
20(0.1%)
19(0.1%)
18(0.1%)
17(0.1%)
11(0.0%)
10(0.0%)
9(0.0%)
25010(99.0%)
0 (0.0%)
quality_grade [character]
1. casual
2. needs_id
3. research
6408(25.4%)
18306(72.5%)
552(2.2%)
0 (0.0%)
num_identification_agreements [numeric]
Mean (sd) : 0 (0.2)
min ≤ med ≤ max:
-1 ≤ 0 ≤ 3
IQR (CV) : 0 (8.6)
-1:12(0.0%)
0:24855(98.4%)
1:348(1.4%)
2:47(0.2%)
3:4(0.0%)
0 (0.0%)
num_identification_disagreements [numeric]
Mean (sd) : 1.2 (0.6)
min ≤ med ≤ max:
0 ≤ 1 ≤ 9
IQR (CV) : 0 (0.5)
0:13(0.1%)
1:22445(88.8%)
2:1700(6.7%)
3:761(3.0%)
4:236(0.9%)
5:79(0.3%)
6:23(0.1%)
7:7(0.0%)
8:1(0.0%)
9:1(0.0%)
0 (0.0%)
captive_cultivated [logical]
1. FALSE
2. TRUE
20227(80.1%)
5039(19.9%)
0 (0.0%)
place_guess [character]
1. Texas, US
2. United States
3. California, US
4. North Carolina, US
5. Ohio, US
6. Florida, US
7. Los Angeles County, US-CA
8. New York, US
9. Denver, CO 80226, USA
10. Chihuahua, Chih., México
[ 17711 others ]
134(0.5%)
132(0.5%)
123(0.5%)
122(0.5%)
95(0.4%)
70(0.3%)
65(0.3%)
58(0.2%)
56(0.2%)
50(0.2%)
24345(96.4%)
16 (0.1%)
latitude [numeric]
Mean (sd) : 36.7 (7.8)
min ≤ med ≤ max:
7.4 ≤ 37.7 ≤ 71.7
IQR (CV) : 8.9 (0.2)
24587 distinct values 0 (0.0%)
longitude [numeric]
Mean (sd) : -96.3 (18.2)
min ≤ med ≤ max:
-176.6 ≤ -95.1 ≤ -51.7
IQR (CV) : 36.4 (-0.2)
24595 distinct values 0 (0.0%)
place_town_name [character]
1. New York City
2. City of Austin
3. Zona Metropolitana
4. San Antonio
5. Portland
6. Boston
7. Chicago
8. Oakville, Ontario
9. Pittsburgh
10. Greater Chattanooga
[ 257 others ]
330(14.7%)
205(9.1%)
119(5.3%)
92(4.1%)
68(3.0%)
65(2.9%)
58(2.6%)
50(2.2%)
48(2.1%)
45(2.0%)
1165(51.9%)
23021 (91.1%)
place_state_name [character]
1. California
2. Texas
3. Florida
4. New York
5. North Carolina
6. Ohio
7. Virginia
8. Arizona
9. Pennsylvania
10. Oregon
[ 145 others ]
4877(21.7%)
2133(9.5%)
1169(5.2%)
860(3.8%)
775(3.4%)
628(2.8%)
579(2.6%)
562(2.5%)
557(2.5%)
537(2.4%)
9828(43.7%)
2761 (10.9%)
place_country_name [character]
1. United States
2. Canada
3. Mexico
4. Costa Rica
5. Panama
6. Honduras
7. Guatemala
8. Cuba
9. Dominican Republic
10. El Salvador
[ 18 others ]
20621(81.8%)
2438(9.7%)
1715(6.8%)
119(0.5%)
119(0.5%)
37(0.1%)
20(0.1%)
18(0.1%)
17(0.1%)
16(0.1%)
77(0.3%)
69 (0.3%)
species_guess [character]
1. dicots
2. flowering plants
3. plants
4. vascular plants
5. monocots
6. grasses
7. sunflowers, daisies, aste
8. Magnolias, margaritas y p
9. roses
10. Asteroideae
[ 7230 others ]
1912(9.9%)
389(2.0%)
222(1.2%)
168(0.9%)
121(0.6%)
111(0.6%)
106(0.5%)
103(0.5%)
97(0.5%)
76(0.4%)
15994(82.9%)
5967 (23.6%)
scientific_name [character]
1. Magnoliopsida
2. Angiospermae
3. Plantae
4. Tracheophyta
5. Quercus
6. Bryophyta
7. Poaceae
8. Asteraceae
9. Liliopsida
10. Rosa
[ 7125 others ]
1678(6.6%)
375(1.5%)
200(0.8%)
153(0.6%)
149(0.6%)
123(0.5%)
114(0.5%)
105(0.4%)
101(0.4%)
95(0.4%)
22173(87.8%)
0 (0.0%)
common_name [character]
1. dicots
2. flowering plants
3. plants
4. vascular plants
5. mosses
6. grasses
7. sunflowers, daisies, aste
8. monocots
9. roses
10. broadleaf enchanter's nig
[ 6141 others ]
1678(7.2%)
375(1.6%)
200(0.9%)
153(0.7%)
123(0.5%)
114(0.5%)
105(0.5%)
101(0.4%)
95(0.4%)
72(0.3%)
20239(87.0%)
2011 (8.0%)
iconic_taxon_name [character] 1. Plantae
25266(100.0%)
0 (0.0%)
taxon_id [numeric]
Mean (sd) : 182411.7 (279303.7)
min ≤ med ≤ max:
47121 ≤ 63329 ≤ 1419341
IQR (CV) : 103131 (1.5)
7187 distinct values 0 (0.0%)
taxon_order_name [character]
1. Asterales
2. Rosales
3. Lamiales
4. Caryophyllales
5. Fabales
6. Fagales
7. Poales
8. Asparagales
9. Pinales
10. Ericales
[ 93 others ]
2758(12.3%)
1872(8.3%)
1516(6.8%)
1314(5.9%)
1259(5.6%)
1189(5.3%)
1176(5.2%)
1081(4.8%)
783(3.5%)
755(3.4%)
8727(38.9%)
2836 (11.2%)
taxon_family_name [character]
1. Asteraceae
2. Fabaceae
3. Rosaceae
4. Fagaceae
5. Pinaceae
6. Poaceae
7. Lamiaceae
8. Cactaceae
9. Asparagaceae
10. Ericaceae
[ 319 others ]
2668(12.1%)
1244(5.6%)
1236(5.6%)
669(3.0%)
562(2.5%)
535(2.4%)
491(2.2%)
470(2.1%)
464(2.1%)
409(1.9%)
13349(60.4%)
3169 (12.5%)
taxon_genus_name [character]
1. Quercus
2. Pinus
3. Acer
4. Rosa
5. Rubus
6. Prunus
7. Lupinus
8. Opuntia
9. Ilex
10. Ulmus
[ 1954 others ]
606(3.0%)
279(1.4%)
244(1.2%)
231(1.1%)
195(1.0%)
182(0.9%)
176(0.9%)
168(0.8%)
165(0.8%)
160(0.8%)
17768(88.1%)
5092 (20.2%)
taxon_species_name [character]
1. Circaea canadensis
2. Acer rubrum
3. Amauropelta noveboracensi
4. Pseudoziziphus parryi
5. Viburnum cassinoides
6. Nymphaea odorata
7. Ulmus americana
8. Malus domestica
9. Lantana urticoides
10. Nabalus alatus
[ 5330 others ]
72(0.5%)
67(0.5%)
65(0.5%)
58(0.4%)
57(0.4%)
53(0.4%)
48(0.3%)
41(0.3%)
35(0.2%)
32(0.2%)
13831(96.3%)
10907 (43.2%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

Code
#[1] "id"           DELETE              "observed_on"  1973 - today        "time_observed_at"       LUBRIDATE     
#[4] "time_zone"     LUBRIDATE          "user_id"      DELETE              "user_login"             15.3k    
#[7] "created_at"     LUBRIDATE         "updated_at"   LUBRIDATE           "quality_grade"          3 categories
#10] "num_identification_agreements"    "num_identification_disagreements" "captive_cultivated"     T/F      
#13] "place_guess"       MAP?           "latitude"               MAP?      "longitude"              MAP? 
#16] "place_town_name"    MAP?          "place_state_name"      MAP?       "place_country_name"     MAP?     
#19] "species_guess"                    "scientific_name"                  "common_name"                     
#22] "iconic_taxon_name"                "taxon_id"              DELETE     "taxon_order_name"       GROUP    
#25] "taxon_family_name"     GROUP      "taxon_genus_name"      GROUP      "taxon_species_name"  

#clean up time zones

Building a Clean Read-in

Code
#cleaning during read-in
obs_clean <- read_csv("_data/plant_observations.csv",
                      skip = 1,
                      col_names = c("id", 
                                   "delete",
                                   "time_observed_at",
                                   "delete",
                                   "delete",
                                   "user_login",
                                   "delete",
                                   "delete",
                                   "quality_grade",
                                   "num_identification_agreements",
                                   "num_identification_disagreements",
                                   "captive_cultivated",
                                   "place_guess", 
                                   "latitude",
                                   "longitude",
                                   "place_town_name",
                                   "place_state_name",
                                   "place_country_name", 
                                   "species_guess",
                                   "scientific_name",
                                   "common_name",
                                   "iconic_taxon_name",
                                   "delete",
                                   "taxon_order_name",
                                   "taxon_family_name",
                                   "taxon_genus_name",
                                   "taxon_species_name")) %>%
#exclude columns called delete
select(!starts_with("delete"))
New names:
Rows: 25266 Columns: 27
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(19): delete...2, time_observed_at, delete...4, user_login, delete...7, ... dbl
(7): id, delete...5, num_identification_agreements, num_identification_... lgl
(1): captive_cultivated
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `delete` -> `delete...2`
• `delete` -> `delete...4`
• `delete` -> `delete...5`
• `delete` -> `delete...7`
• `delete` -> `delete...8`
• `delete` -> `delete...23`

Dealing with Time Zones

After many hours, I was able to use dplyr to remove the time zone characters from the datetime strings, and then I was able to use ymd_hms to convert those to datetimes. Since I am not researching the times of day, I forced the UTC time zone for all these data.

One other idea: could I use latitudes to calculate time zones?

-lubridate source sheet

-lubridate cheat sheet

-tz database time zones

Code
#strip time zone info
obs_clean$time_observed_at <- str_remove_all (obs_clean$time_observed_at, "\\W0000")

  
# turn into a date  
obs_clean$time_observed_at <- ymd_hms(obs_clean$time_observed_at,
          tz = "UTC")                                                             

#preview
head(obs_clean)
# A tibble: 6 × 21
      id time_observed_at    user_login  quali…¹ num_i…² num_i…³ capti…⁴ place…⁵
   <dbl> <dttm>              <chr>       <chr>     <dbl>   <dbl> <lgl>   <chr>  
1  13597 2011-04-02 01:00:04 kueda       needs_…       0       1 FALSE   Mount …
2  53394 2003-02-17 20:05:00 victorious… casual        1       3 TRUE    Arizon…
3  56948 2012-03-10 22:00:26 kueda       needs_…       0       1 FALSE   Huckle…
4  59618 NA                  bob-dodge   needs_…       1       3 FALSE   jasper…
5 101087 NA                  rcurtis     needs_…       0       1 FALSE   Kent B…
6 105005 2012-07-21 19:51:00 kueda       needs_…       0       1 FALSE   Sagehe…
# … with 13 more variables: latitude <dbl>, longitude <dbl>,
#   place_town_name <chr>, place_state_name <chr>, place_country_name <chr>,
#   species_guess <chr>, scientific_name <chr>, common_name <chr>,
#   iconic_taxon_name <chr>, taxon_order_name <chr>, taxon_family_name <chr>,
#   taxon_genus_name <chr>, taxon_species_name <chr>, and abbreviated variable
#   names ¹​quality_grade, ²​num_identification_agreements,
#   ³​num_identification_disagreements, ⁴​captive_cultivated, ⁵​place_guess

Identify Research Questions

  • What are some of the most disputed families, genuses of plants?
  • Do these vary by location or year?
Code
# find top 10 families
ranked_families <-  obs_clean %>%
  select(taxon_family_name) %>%
  count(taxon_family_name) %>%
  arrange(desc(n)) %>%
mutate(prop_families = round(n/sum(n),3))


# find proportions of top 10 families
obs_clean %>%
  select(taxon_family_name) %>%
  count(taxon_family_name) %>%
  arrange(desc(n)) %>%
  slice(1:11) %>%
  mutate(prop_families = round(n/sum(n),3))
# A tibble: 11 × 3
   taxon_family_name     n prop_families
   <chr>             <int>         <dbl>
 1 <NA>               3169         0.266
 2 Asteraceae         2668         0.224
 3 Fabaceae           1244         0.104
 4 Rosaceae           1236         0.104
 5 Fagaceae            669         0.056
 6 Pinaceae            562         0.047
 7 Poaceae             535         0.045
 8 Lamiaceae           491         0.041
 9 Cactaceae           470         0.039
10 Asparagaceae        464         0.039
11 Ericaceae           409         0.034
Code
#top 10 included NA, so I changed to 11
Source Code
---
title: "Sarah McAlpine HW 2"
author: "Sarah McAlpine"
desription: "Homework 2"
date: "10/11/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - hw2
  - sarahmcalpine
  - inaturalist data
  - lubridate
  - summarytools
  - time zones
---

```{r}
#| label: setup
#| warning: false

library(tidyverse)
library(lubridate)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE)
```

## Select a Data Set

For this assignment, I chose to use a set of data from iNaturalist.org of citizen scientist observations of plant life using mobile apps. This particular data set is limited to those observations made in North America whose identifications have the most disagreements. The iNaturalist site allows for custom data queries, but since I am not familiar with each data field, I exported more than I needed. First I will read in this data and take a look at the column names and values present within each to decide what to keep. At the beginning, I have 27 columns and 25,266 rows.

At the outset, possible research questions could be about patterns in family, order, genus, and/or species that are difficult for the general public to identify, and possibly certain areas of the continent that have the highest likelihood of identification disagreement. I would not, however, be able to compare which users are most likely to have disagreed idetntifications since my data doesn't include undisputed identifications as well.

```{r}
#read in data
observations <- read_csv("_data/plant_observations.csv")

# preview data and plan for cleaning
 colnames(observations)
 head(observations)
 print(dfSummary(observations,
     varnumbers = FALSE,
     plain.ascii  = FALSE, 
     style = "grid",
     graph.magnif = 0.60, 
     valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')


#[1] "id"           DELETE              "observed_on"  1973 - today        "time_observed_at"       LUBRIDATE     
#[4] "time_zone"     LUBRIDATE          "user_id"      DELETE              "user_login"             15.3k    
#[7] "created_at"     LUBRIDATE         "updated_at"   LUBRIDATE           "quality_grade"          3 categories
#10] "num_identification_agreements"    "num_identification_disagreements" "captive_cultivated"     T/F      
#13] "place_guess"       MAP?           "latitude"               MAP?      "longitude"              MAP? 
#16] "place_town_name"    MAP?          "place_state_name"      MAP?       "place_country_name"     MAP?     
#19] "species_guess"                    "scientific_name"                  "common_name"                     
#22] "iconic_taxon_name"                "taxon_id"              DELETE     "taxon_order_name"       GROUP    
#25] "taxon_family_name"     GROUP      "taxon_genus_name"      GROUP      "taxon_species_name"  

#clean up time zones

```

## Building a Clean Read-in

```{r}
#cleaning during read-in
obs_clean <- read_csv("_data/plant_observations.csv",
                      skip = 1,
                      col_names = c("id", 
                                   "delete",
                                   "time_observed_at",
                                   "delete",
                                   "delete",
                                   "user_login",
                                   "delete",
                                   "delete",
                                   "quality_grade",
                                   "num_identification_agreements",
                                   "num_identification_disagreements",
                                   "captive_cultivated",
                                   "place_guess", 
                                   "latitude",
                                   "longitude",
                                   "place_town_name",
                                   "place_state_name",
                                   "place_country_name", 
                                   "species_guess",
                                   "scientific_name",
                                   "common_name",
                                   "iconic_taxon_name",
                                   "delete",
                                   "taxon_order_name",
                                   "taxon_family_name",
                                   "taxon_genus_name",
                                   "taxon_species_name")) %>%
#exclude columns called delete
select(!starts_with("delete"))


```

## Dealing with Time Zones

After many hours, I was able to use `dplyr` to remove the time zone characters from the datetime strings, and then I was able to use `ymd_hms` to convert those to datetimes. Since I am not researching the times of day, I forced the UTC time zone for all these data. 

One other idea: could I use latitudes to calculate time zones?

\-[lubridate source sheet](https://lubridate.tidyverse.org/articles/lubridate.html)

\-[lubridate cheat sheet](https://rawgit.com/rstudio/cheatsheets/main/lubridate.pdf)

\-[tz database time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones)

```{r}
#strip time zone info
obs_clean$time_observed_at <- str_remove_all (obs_clean$time_observed_at, "\\W0000")

  
# turn into a date  
obs_clean$time_observed_at <- ymd_hms(obs_clean$time_observed_at,
          tz = "UTC")                                                             

#preview
head(obs_clean)
```
## Identify Research Questions
* What are some of the most disputed families, genuses of plants?
* Do these vary by location or year?

```{r}
# find top 10 families
ranked_families <-  obs_clean %>%
  select(taxon_family_name) %>%
  count(taxon_family_name) %>%
  arrange(desc(n)) %>%
mutate(prop_families = round(n/sum(n),3))


# find proportions of top 10 families
obs_clean %>%
  select(taxon_family_name) %>%
  count(taxon_family_name) %>%
  arrange(desc(n)) %>%
  slice(1:11) %>%
  mutate(prop_families = round(n/sum(n),3))
#top 10 included NA, so I changed to 11
```