HW2
Built in Datasets: There are some data sets that are provided as packages built in the R code base. This is a great place to start to look at a data set. We will look at the iris data set. It comes in the format as a data.frame which is the default data structure. However, we will convert it to a tibble that allows for some easier data manipulations later on with the as_tibble() function.
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
iris_tibble <- as_tibble(iris)
head(iris_tibble)
# A tibble: 6 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The tibble format shows the different data types for each column. We can examine the data either with the head() function that shows the first 5 or so rows. We can also use the print() function and specify how many rows and all the columns to show. This is one of the areas where tibble differentiates from the normal data.frame. We will use the nyc flight data set for this example.
# before print specifications, default 10 rows and only amount of cols that can fit in the screen
print(nycflights13::flights)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ... with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
sched_arr_time arr_delay carrier flight tailnum origin dest
<int> <dbl> <chr> <int> <chr> <chr> <chr>
1 819 11 UA 1545 N14228 EWR IAH
2 830 20 UA 1714 N24211 LGA IAH
3 850 33 AA 1141 N619AA JFK MIA
4 1022 -18 B6 725 N804JB JFK BQN
5 837 -25 DL 461 N668DN LGA ATL
air_time distance hour minute time_hour
<dbl> <dbl> <dbl> <dbl> <dttm>
1 227 1400 5 15 2013-01-01 05:00:00
2 227 1416 5 29 2013-01-01 05:00:00
3 160 1089 5 40 2013-01-01 05:00:00
4 183 1576 5 45 2013-01-01 05:00:00
5 116 762 6 0 2013-01-01 06:00:00
# ... with 336,771 more rows
External Datasets: The built-in datasets are great for practice. However, most data and analysis is done outside of R and has to be read in. These datasets can come in any format from excel sheets to binary. We will use the tidyverse’s readr package to read in some datatsets that are in .csv format.
library(here)
# using the absolute path to read in csv file
animal_weights_absolute <- as_tibble(read_csv(
"../../_data/animal_weight.csv"))
#using relative path to read in the csv file
animal_weights_relative <- as_tibble(read_csv("../../_data/animal_weight.csv"))
#using here to get relative path to read in csv file
animal_weights_here <- as_tibble(read_csv(here("_data", "animal_weight.csv")))
Inline CSV File: We can also write in a data set into the tibble framework.
Data Manipulation There are many different data types that can be read in with a dataset. We will review some ways to manipulate these data types to get in a form that is desirable.
String Manipulation Here are some strings that we will play with, “dog”, “mark”, “London”, and “tile”. Some other functions that were not touched, but can be further evaluated is str_to_lower, str_to_upper, str_to_title, str_sort, str_order, str_wrap, and str_trim.
# some strings to work with
my_strings <- c("dog", "mark", "London", "tile")
my_strings[1]
[1] "dog"
# can print an vector of string as lines
writeLines(my_strings)
dog
mark
London
tile
# can find the length of strings with str_length
str_length(my_strings)
[1] 3 4 6 4
# we can concatenate strings as well with *str_c*
str_c("This is a string called ", my_strings, "!")
[1] "This is a string called dog!"
[2] "This is a string called mark!"
[3] "This is a string called London!"
[4] "This is a string called tile!"
str_c("Hi", my_strings, "!", sep=", ")
[1] "Hi, dog, !" "Hi, mark, !" "Hi, London, !" "Hi, tile, !"
str_c(my_strings, collapse=", ")
[1] "dog, mark, London, tile"
#subset of strings can be retrieved with str_sub
str_sub(my_strings[3], 1, 3)
[1] "Lon"
Regular Expressions After some basic manipulations of strings, we can further use this understanding to search and matching patterns with regular expressions. The str_view() allows us to see what expressions match with our pattern. For instance, we can use the regex functionality to see how many types of berries are in the fruit dataset by searching for the pattern “berry”.
# we will use the fruit dataset for regex
head(fruit)
[1] "apple" "apricot" "avocado" "banana"
[5] "bell pepper" "bilberry"
# use str_view to see the regex matches
str_view(fruit, "berry", match=TRUE)
Factors Factors are a way to create categories within your data. For instance, we might have many different types of vehicles. With factors, we can create categories of “truck”, “sedan”, “suv”, etc. Another useful aspect of the factor data type is that it can be used to sort the dataset in a specific way instead of just alphabetic. We will work with the gss_cat dataset since it has many factor data types.
gss_cat
# A tibble: 21,483 x 9
year marital age race rincome partyid relig denom tvhours
<int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
1 2000 Never married 26 White $8000 ~ Ind,ne~ Prot~ Sout~ 12
2 2000 Divorced 48 White $8000 ~ Not st~ Prot~ Bapt~ NA
3 2000 Widowed 67 White Not ap~ Indepe~ Prot~ No d~ 2
4 2000 Never married 39 White Not ap~ Ind,ne~ Orth~ Not ~ 4
5 2000 Divorced 25 White Not ap~ Not st~ None Not ~ 1
6 2000 Married 25 White $20000~ Strong~ Prot~ Sout~ NA
7 2000 Never married 36 White $25000~ Not st~ Chri~ Not ~ 3
8 2000 Divorced 44 White $7000 ~ Ind,ne~ Prot~ Luth~ NA
9 2000 Married 44 White $25000~ Not st~ Prot~ Other 0
10 2000 Married 47 White $25000~ Strong~ Prot~ Sout~ 3
# ... with 21,473 more rows
# one way to see the levels is to index in the column of interest
gss_cat %>% .$race %>% levels()
[1] "Other" "Black" "White"
[4] "Not applicable"
# another way is to use the count() function
gss_cat %>% count(race)
# A tibble: 3 x 2
race n
<fct> <int>
1 Other 1959
2 Black 3129
3 White 16395
# we can use mutate and fct_recode to change factors
# before we change the factor codes
gss_cat %>% ggplot(aes(x = rincome)) + geom_bar() + coord_flip()
# after we change factor codes
gss_cat %>%
mutate(rincome = fct_recode(rincome,
"Less than $1000" = "Lt $1000")) %>%
mutate(rincome = fct_recode(rincome,
"NA" = "Not applicable",
"NA" = "Don't know",
"NA" = "No answer",
"NA" = "gRefused")) %>%
ggplot(aes(x = rincome)) + geom_bar() + coord_flip()
# can lump the small factors together and remove the "Not applicable", "Don't know", "No answer", "Refused" responses
gss_cat %>% count(rincome)
# A tibble: 16 x 2
rincome n
<fct> <int>
1 No answer 183
2 Don't know 267
3 Refused 975
4 $25000 or more 7363
5 $20000 - 24999 1283
6 $15000 - 19999 1048
7 $10000 - 14999 1168
8 $8000 to 9999 340
9 $7000 to 7999 188
10 $6000 to 6999 215
11 $5000 to 5999 227
12 $4000 to 4999 226
13 $3000 to 3999 276
14 $1000 to 2999 395
15 Lt $1000 286
16 Not applicable 7043
# first filter out the unwanted responses
gss_cat %>%
filter(!rincome %in%
c("Not applicable",
"Don't know",
"No answer",
"Refused")) %>%
count(rincome)
# A tibble: 12 x 2
rincome n
<fct> <int>
1 $25000 or more 7363
2 $20000 - 24999 1283
3 $15000 - 19999 1048
4 $10000 - 14999 1168
5 $8000 to 9999 340
6 $7000 to 7999 188
7 $6000 to 6999 215
8 $5000 to 5999 227
9 $4000 to 4999 226
10 $3000 to 3999 276
11 $1000 to 2999 395
12 Lt $1000 286
# lump groups outside the largest 8
gss_cat %>%
filter(!rincome %in%
c("Not applicable",
"Don't know",
"No answer",
"Refused")) %>%
mutate(rincome = fct_lump(rincome, n=8)) %>%
count(rincome)
# A tibble: 9 x 2
rincome n
<fct> <int>
1 $25000 or more 7363
2 $20000 - 24999 1283
3 $15000 - 19999 1048
4 $10000 - 14999 1168
5 $8000 to 9999 340
6 $3000 to 3999 276
7 $1000 to 2999 395
8 Lt $1000 286
9 Other 856
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Beach (2021, Sept. 30). DACSS 601 Fall 2021: HW2. Retrieved from https://mrolfe.github.io/DACSS601Fall21/posts/2021-09-30-hw2-allyson-beach/
BibTeX citation
@misc{beach2021hw2, author = {Beach, Allyson}, title = {DACSS 601 Fall 2021: HW2}, url = {https://mrolfe.github.io/DACSS601Fall21/posts/2021-09-30-hw2-allyson-beach/}, year = {2021} }