DACSS 601 Fall 2021: HW2

Allyson Beach

Built in Datasets: There are some data sets that are provided as packages built in the R code base. This is a great place to start to look at a data set. We will look at the iris data set. It comes in the format as a data.frame which is the default data structure. However, we will convert it to a tibble that allows for some easier data manipulations later on with the as_tibble() function.

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

iris_tibble <- as_tibble(iris)
head(iris_tibble)

# A tibble: 6 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa

The tibble format shows the different data types for each column. We can examine the data either with the head() function that shows the first 5 or so rows. We can also use the print() function and specify how many rows and all the columns to show. This is one of the areas where tibble differentiates from the normal data.frame. We will use the nyc flight data set for this example.

# before print specifications, default 10 rows and only amount of cols that can fit in the screen 
print(nycflights13::flights)

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# ... with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

# allow 5 rows and all columns to print 
print(nycflights13::flights, n=5, width=Inf)

# A tibble: 336,776 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>
1  2013     1     1      517            515         2      830
2  2013     1     1      533            529         4      850
3  2013     1     1      542            540         2      923
4  2013     1     1      544            545        -1     1004
5  2013     1     1      554            600        -6      812
  sched_arr_time arr_delay carrier flight tailnum origin dest 
           <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>
1            819        11 UA        1545 N14228  EWR    IAH  
2            830        20 UA        1714 N24211  LGA    IAH  
3            850        33 AA        1141 N619AA  JFK    MIA  
4           1022       -18 B6         725 N804JB  JFK    BQN  
5            837       -25 DL         461 N668DN  LGA    ATL  
  air_time distance  hour minute time_hour          
     <dbl>    <dbl> <dbl>  <dbl> <dttm>             
1      227     1400     5     15 2013-01-01 05:00:00
2      227     1416     5     29 2013-01-01 05:00:00
3      160     1089     5     40 2013-01-01 05:00:00
4      183     1576     5     45 2013-01-01 05:00:00
5      116      762     6      0 2013-01-01 06:00:00
# ... with 336,771 more rows

External Datasets: The built-in datasets are great for practice. However, most data and analysis is done outside of R and has to be read in. These datasets can come in any format from excel sheets to binary. We will use the tidyverse’s readr package to read in some datatsets that are in .csv format.

library(here)
# using the absolute path to read in csv file
animal_weights_absolute <- as_tibble(read_csv(
"../../_data/animal_weight.csv"))
#using relative path to read in the csv file
animal_weights_relative <- as_tibble(read_csv("../../_data/animal_weight.csv"))
#using here to get relative path to read in csv file
animal_weights_here <- as_tibble(read_csv(here("_data", "animal_weight.csv")))

Inline CSV File: We can also write in a data set into the tibble framework.

# we created a data set with that shows a sample of all the different data types you can write into a csv file
my_dataset <- read_csv(
"'fox', 3, 12.45, TRUE, 2010-01-01\n'hound', 5, 32.45, FALSE, 2010-01-01", 
col_names = c("string", "integer", "decimals", "logical", "dates"))

Data Manipulation There are many different data types that can be read in with a dataset. We will review some ways to manipulate these data types to get in a form that is desirable.

String Manipulation Here are some strings that we will play with, “dog”, “mark”, “London”, and “tile”. Some other functions that were not touched, but can be further evaluated is str_to_lower, str_to_upper, str_to_title, str_sort, str_order, str_wrap, and str_trim.

# some strings to work with 
my_strings <- c("dog", "mark", "London", "tile")
my_strings[1]

[1] "dog"

# can print an vector of string as lines 
writeLines(my_strings)

dog
mark
London
tile

# can find the length of strings with str_length
str_length(my_strings)

[1] 3 4 6 4

# we can concatenate strings as well with *str_c*
str_c("This is a string called ", my_strings, "!")

[1] "This is a string called dog!"   
[2] "This is a string called mark!"  
[3] "This is a string called London!"
[4] "This is a string called tile!"

str_c("Hi", my_strings, "!", sep=", ")

[1] "Hi, dog, !"    "Hi, mark, !"   "Hi, London, !" "Hi, tile, !"

str_c(my_strings, collapse=", ")

[1] "dog, mark, London, tile"

#subset of strings can be retrieved with str_sub
str_sub(my_strings[3], 1, 3)

[1] "Lon"

Regular Expressions After some basic manipulations of strings, we can further use this understanding to search and matching patterns with regular expressions. The str_view() allows us to see what expressions match with our pattern. For instance, we can use the regex functionality to see how many types of berries are in the fruit dataset by searching for the pattern “berry”.

# we will use the fruit dataset for regex 
head(fruit)

[1] "apple"       "apricot"     "avocado"     "banana"     
[5] "bell pepper" "bilberry"

# use str_view to see the regex matches 
str_view(fruit, "berry", match=TRUE)

Factors Factors are a way to create categories within your data. For instance, we might have many different types of vehicles. With factors, we can create categories of “truck”, “sedan”, “suv”, etc. Another useful aspect of the factor data type is that it can be used to sort the dataset in a specific way instead of just alphabetic. We will work with the gss_cat dataset since it has many factor data types.

gss_cat

# A tibble: 21,483 x 9
    year marital         age race  rincome partyid relig denom tvhours
   <int> <fct>         <int> <fct> <fct>   <fct>   <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 ~ Ind,ne~ Prot~ Sout~      12
 2  2000 Divorced         48 White $8000 ~ Not st~ Prot~ Bapt~      NA
 3  2000 Widowed          67 White Not ap~ Indepe~ Prot~ No d~       2
 4  2000 Never married    39 White Not ap~ Ind,ne~ Orth~ Not ~       4
 5  2000 Divorced         25 White Not ap~ Not st~ None  Not ~       1
 6  2000 Married          25 White $20000~ Strong~ Prot~ Sout~      NA
 7  2000 Never married    36 White $25000~ Not st~ Chri~ Not ~       3
 8  2000 Divorced         44 White $7000 ~ Ind,ne~ Prot~ Luth~      NA
 9  2000 Married          44 White $25000~ Not st~ Prot~ Other       0
10  2000 Married          47 White $25000~ Strong~ Prot~ Sout~       3
# ... with 21,473 more rows

# one way to see the levels is to index in the column of interest
gss_cat %>% .$race %>% levels()

[1] "Other"          "Black"          "White"         
[4] "Not applicable"

# another way is to use the count() function
gss_cat %>% count(race)

# A tibble: 3 x 2
  race      n
  <fct> <int>
1 Other  1959
2 Black  3129
3 White 16395

# we can use mutate and fct_recode to change factors
# before we change the factor codes 
gss_cat %>% ggplot(aes(x = rincome)) + geom_bar() + coord_flip()

# after we change factor codes 
gss_cat %>% 
mutate(rincome = fct_recode(rincome, 
                              "Less than $1000" = "Lt $1000")) %>%
  mutate(rincome = fct_recode(rincome, 
                              "NA" = "Not applicable", 
                              "NA" = "Don't know", 
                              "NA" = "No answer", 
                              "NA" = "gRefused")) %>% 
ggplot(aes(x = rincome)) + geom_bar() + coord_flip()

# can lump the small factors together and remove the "Not applicable", "Don't know", "No answer", "Refused" responses
gss_cat %>% count(rincome)

# A tibble: 16 x 2
   rincome            n
   <fct>          <int>
 1 No answer        183
 2 Don't know       267
 3 Refused          975
 4 $25000 or more  7363
 5 $20000 - 24999  1283
 6 $15000 - 19999  1048
 7 $10000 - 14999  1168
 8 $8000 to 9999    340
 9 $7000 to 7999    188
10 $6000 to 6999    215
11 $5000 to 5999    227
12 $4000 to 4999    226
13 $3000 to 3999    276
14 $1000 to 2999    395
15 Lt $1000         286
16 Not applicable  7043

# first filter out the unwanted responses 
gss_cat %>% 
filter(!rincome %in% 
           c("Not applicable", 
             "Don't know", 
             "No answer", 
             "Refused")) %>% 
count(rincome)

# A tibble: 12 x 2
   rincome            n
   <fct>          <int>
 1 $25000 or more  7363
 2 $20000 - 24999  1283
 3 $15000 - 19999  1048
 4 $10000 - 14999  1168
 5 $8000 to 9999    340
 6 $7000 to 7999    188
 7 $6000 to 6999    215
 8 $5000 to 5999    227
 9 $4000 to 4999    226
10 $3000 to 3999    276
11 $1000 to 2999    395
12 Lt $1000         286

# lump groups outside the largest 8
gss_cat %>% 
filter(!rincome %in% 
         c("Not applicable", 
           "Don't know", 
           "No answer", 
           "Refused")) %>% 
mutate(rincome = fct_lump(rincome, n=8))  %>% 
count(rincome)

# A tibble: 9 x 2
  rincome            n
  <fct>          <int>
1 $25000 or more  7363
2 $20000 - 24999  1283
3 $15000 - 19999  1048
4 $10000 - 14999  1168
5 $8000 to 9999    340
6 $3000 to 3999    276
7 $1000 to 2999    395
8 Lt $1000         286
9 Other            856

Comment on this article Share:

HW2

Reuse

Citation