Blog Post #4: Word Embedding & Dictionaries

Alexis Gamez

blogpost4

Word Embedding

Dictionary Analysis

Author

Alexis Gamez

Published

November 20, 2022

Setup

View Code

Code

knitr::opts_chunk$set(warning = FALSE, message = FALSE, echo = TRUE)
library(plyr)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2

Warning: package 'ggplot2' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::arrange()   masks plyr::arrange()
✖ purrr::compact()   masks plyr::compact()
✖ dplyr::count()     masks plyr::count()
✖ dplyr::failwith()  masks plyr::failwith()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::id()        masks plyr::id()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::mutate()    masks plyr::mutate()
✖ dplyr::rename()    masks plyr::rename()
✖ dplyr::summarise() masks plyr::summarise()
✖ dplyr::summarize() masks plyr::summarize()

Code

library(tidytext)
library(readr)
library(devtools)

Loading required package: usethis

Code

library(knitr)
library(data.table)


Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose

Code

library(rvest)

Warning: package 'rvest' was built under R version 4.2.2


Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

Code

library(rtweet)


Attaching package: 'rtweet'

The following object is masked from 'package:purrr':

    flatten

Code

library(twitteR)


Attaching package: 'twitteR'

The following object is masked from 'package:rtweet':

    lookup_statuses

The following objects are masked from 'package:dplyr':

    id, location

The following object is masked from 'package:plyr':

    id

Code

library(tm)

Loading required package: NLP

Attaching package: 'NLP'

The following object is masked from 'package:ggplot2':

    annotate

Code

library(lubridate)


Attaching package: 'lubridate'

The following objects are masked from 'package:data.table':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

Code

library(quanteda)

Package version: 3.2.3
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 8 of 8 threads used.
See https://quanteda.io for tutorials and examples.

Attaching package: 'quanteda'

The following object is masked from 'package:tm':

    stopwords

The following objects are masked from 'package:NLP':

    meta, meta<-

Code

library(quanteda.textplots)
library(quanteda.textstats)

Warning: package 'quanteda.textstats' was built under R version 4.2.2

Code

library(wordcloud)

Loading required package: RColorBrewer

Code

library(text2vec)

Warning: package 'text2vec' was built under R version 4.2.2

Code

library(ggplot2)
library(devtools)
# devtools::install_github("kbenoit/quanteda.dictionaries")
library(quanteda.dictionaries)
# remotes::install_github("quanteda/quanteda.sentiment")
library(quanteda.sentiment)


Attaching package: 'quanteda.sentiment'

The following object is masked from 'package:quanteda':

    data_dictionary_LSD2015

Data Source

As with previous blog posts, I will be continuing to use the corpus I’ve built from pulling tweets from the social media platform Twitter. All tweets have been pulled with relevance to the key words and phrases of eating bugs and eating insects. The CSV file we will eventually read in is a compilation of such tweets extracted between the dates of November 3rd and November 13th.

Goals

For this post, I’m going to be exploring the use of word embedding, & dictionary methods within my corpus. I will exclusively be utilizing the techniques found within tutorials 7 & 8. As to whether each technique is effectively useful to our analysis is to be determined by our experimentation.

Word Embeddings

Pre-Analysis

I start off by reading in my corpus as the bug_tweets object. Then, we’ll work to tokenize and vectorize the corpus. We’ll be using the text2vec package heavily in this section.

Code

# First we're going to read in our existing corpus, calling to the csv file we created in blog post #2
bug_tweets <- read.csv("eating_bugs_tweets_11_13_22.csv")

Error in file(file, "rt"): cannot open the connection

Code

head(bug_tweets, 10)

Error in head(bug_tweets, 10): object 'bug_tweets' not found

Code

dim(bug_tweets)

Error in eval(expr, envir, enclos): object 'bug_tweets' not found

Tokenizing & Vectorizing

Now that we’ve read in our bug_tweets object, we’ll be using word_tokenizer to tokenize our documents into a new object, bug_tokens.

View Code

Code

# Tokenizing the corpus
bug_tokens <- word_tokenizer(bug_tweets)

Error in stringi::stri_split_boundaries(strings, type = "word", skip_word_none = TRUE): object 'bug_tweets' not found

Code

head(bug_tokens, 5)

Error in head(bug_tokens, 5): object 'bug_tokens' not found

With the token object created, we can move to creating an iterator object and begin building the vocabulary we’re going to be using in this section.

Code

# Create an iterator
it <- itoken(bug_tokens, progressbar = FALSE)

Error in itoken(bug_tokens, progressbar = FALSE): object 'bug_tokens' not found

Code

# Then we're going to build the vocabulary
vocab <- create_vocabulary(it)

Error in create_vocabulary(it): object 'it' not found

Code

# Calling the vocab object
vocab

Error in eval(expr, envir, enclos): object 'vocab' not found

Code

# Calling for the dimensions of our vocabulary object
dim(vocab)

Error in eval(expr, envir, enclos): object 'vocab' not found

With the vocab object created, we can now prune and vectorize it. While vectorizing is pretty self explanatory (coercing our vocab object into a vector), pruning simply trims down our object and removes words that aren’t mentioned above a certain number of times. In this case, that threshold is a minimum of 5 times. All tokens not mentioned at least 5 times will be dropped from our object.

View Code

Code

# Now we're going to prune the vocabulary
vocab <- prune_vocabulary(vocab, term_count_min = 5)

Error in prune_vocabulary(vocab, term_count_min = 5): object 'vocab' not found

Code

# Checking the dimensions of the vocab list again shows how much we were able to cut down the original list
dim(vocab)

Error in eval(expr, envir, enclos): object 'vocab' not found

Code

# Now we're going to vectorize our vocab
vectorizer <- vocab_vectorizer(vocab)

Error in force(vocabulary): object 'vocab' not found

Finally, we’re going to create a term co-occurrence matrix. We’re going to be sticking to a skip gram window of 5 considering we don’t have a massive corpus.

Code

# Creating a term co-occurrence matrix
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

Error in create_tcm(it, vectorizer, skip_grams_window = 5L): object 'it' not found

Fitting out a GloVe Model

We have a Term Co-Occurrence Matrix! With our TCM created, let’s move to creating a GloVe model and fitting it to objectively fit our analysis.

View Code

Code

# Creating GloVe model
bug_glove <- GlobalVectors$new(rank = 50, x_max = 10)
bug_glove

<GloVe>
  Public:
    bias_i: NULL
    bias_j: NULL
    clone: function (deep = FALSE) 
    components: NULL
    fit_transform: function (x, n_iter = 10L, convergence_tol = -1, n_threads = getOption("rsparse_omp_threads", 
    get_history: function () 
    initialize: function (rank, x_max, learning_rate = 0.15, alpha = 0.75, lambda = 0, 
    shuffle: FALSE
  Private:
    alpha: 0.75
    b_i: NULL
    b_j: NULL
    cost_history: 
    fitted: FALSE
    glove_fitter: NULL
    initial: NULL
    lambda: 0
    learning_rate: 0.15
    rank: 50
    w_i: NULL
    w_j: NULL
    x_max: 10

Code

# Creating the fitting model here
bug_main <- bug_glove$fit_transform(tcm, n_iter= 10, 
                                convergence_tol = 0.01,
                                n_threads = 8)

Error in .class1(object): object 'tcm' not found

Code

dim(bug_main)

Error in eval(expr, envir, enclos): object 'bug_main' not found

We’ve now created our target word vector. As stated in tutorial 7, the target word vector can be thought of as the words of interest we seek within our corpus, while all other words would be considered to be a part of the context word vector. We’ll be creating the context vector now.

Code

# Since we've created the `main` target vector, we're going to create a context vector now
bug_context <- bug_glove$components
dim(bug_context)

NULL

With the target and context vectors made, we can create a word vector matrix by taking the sum of both.

View Code

Code

# Creating vector matrix
bug_vectors <- bug_main + t(bug_context)

Error in eval(expr, envir, enclos): object 'bug_main' not found

Code

bug_vectors

Error in eval(expr, envir, enclos): object 'bug_vectors' not found

Now that the word vector matrix has been created, we’re free to begin our analysis using Cosine Similarity.

Cosine Similarity

The logic behind Cosine Similarity looks to find the correlation between two vectors (i.e. our target and context vectors).The functions below are fed our word vector matrix and a token of our selection in an attempt to find any correlation with said token and others within our corpus. I decided to start off by testing the word “cricket” as it seemed the most direct (aside from “eating”, “bugs” and “insects”) while maintaining an air of objectivity.

View Code

Code

# We're going to test out Cosine Similarity now by running the following functions.
cricket <- bug_vectors["Cricket", , drop = FALSE]

Error in eval(expr, envir, enclos): object 'bug_vectors' not found

Code

cricket_cos_sim <- sim2(x = bug_vectors, y = cricket, 
                       method = "cosine", norm = "l2")

Error in stopifnot(inherits(x, "matrix") || inherits(x, "Matrix")): object 'bug_vectors' not found

Code

cricket_cos_sim

Error in eval(expr, envir, enclos): object 'cricket_cos_sim' not found

We can see some interesting results here, mostly that the most similar words (other than the word “cricket” itself), according to cosine similarity, are an assortment of seemingly negative ones. I say seemingly here because, without proper context, there’s no way of definitively understanding the intended sentiment behind it. With that said, we can still take some educated guessed. In this case, we see objectively negative words such as “nope”, “crap”, “mandatory”, “nobody”, and “bother” near the top. Let’s run a couple more tests and see what else pops up.

View Code

Code

right_wing <- bug_vectors["POTUS", , drop = FALSE] -
     bug_vectors["liberals", , drop = FALSE] +
     bug_vectors["Conservative", , drop = FALSE]

Error in eval(expr, envir, enclos): object 'bug_vectors' not found

Code

right_wing_cos_sim = sim2(x = bug_vectors, y = right_wing, method = "cosine", norm = "l2")

Error in stopifnot(inherits(x, "matrix") || inherits(x, "Matrix")): object 'bug_vectors' not found

Code

right_wing_cos_sim

Error in eval(expr, envir, enclos): object 'right_wing_cos_sim' not found

Code

conservative <- bug_vectors["POTUS", , drop = FALSE] -
     bug_vectors["JoeBiden", , drop = FALSE] +
     bug_vectors["Trump", , drop = FALSE]

Error in eval(expr, envir, enclos): object 'bug_vectors' not found

Code

conservative_cos_sim = sim2(x = bug_vectors, y = conservative, method = "cosine", norm = "l2")

Error in stopifnot(inherits(x, "matrix") || inherits(x, "Matrix")): object 'bug_vectors' not found

Code

conservative_cos_sim

Error in eval(expr, envir, enclos): object 'conservative_cos_sim' not found

Code

liberal <- bug_vectors["POTUS", , drop = FALSE] -
     bug_vectors["Trump", , drop = FALSE] +
     bug_vectors["JoeBiden", , drop = FALSE]

Error in eval(expr, envir, enclos): object 'bug_vectors' not found

Code

liberal_cos_sim = sim2(x = bug_vectors, y = liberal, method = "cosine", norm = "l2")

Error in stopifnot(inherits(x, "matrix") || inherits(x, "Matrix")): object 'bug_vectors' not found

Code

liberal_cos_sim

Error in eval(expr, envir, enclos): object 'liberal_cos_sim' not found

Code

eating_bugs <- bug_vectors["eating", , drop = FALSE] -
     bug_vectors["meat", , drop = FALSE] +
     bug_vectors["insects", , drop = FALSE] +
     bug_vectors["bugs", , drop = FALSE]

Error in eval(expr, envir, enclos): object 'bug_vectors' not found

Code

eating_bugs_cos_sim = sim2(x = bug_vectors, y = eating_bugs, method = "cosine", norm = "l2")

Error in stopifnot(inherits(x, "matrix") || inherits(x, "Matrix")): object 'bug_vectors' not found

Code

eating_bugs_cos_sim

Error in eval(expr, envir, enclos): object 'eating_bugs_cos_sim' not found

There are some interesting results here, primarily those cosine similarities that are politically oriented. We see in the results for the last eating_bugs chunk of code that there’s a bit of obstruction per many stop words that our pruning didn’t seem to eliminate. Either that or the code we did write needs a little work. Either way, we can effectively see some sentiment behind the previous code and key words, let’s see if we can rework the last bit of coding just a bit before we move on from cosine similarities.

View Code

Code

eating_insects <- bug_vectors["eating", , drop = FALSE] -
     bug_vectors["meat", , drop = FALSE] +
     bug_vectors["insects", , drop = FALSE]

Error in eval(expr, envir, enclos): object 'bug_vectors' not found

Code

eating_insects_cos_sim = sim2(x = bug_vectors, y = eating_bugs, method = "cosine", norm = "l2")

Error in stopifnot(inherits(x, "matrix") || inherits(x, "Matrix")): object 'bug_vectors' not found

Code

eating_bugs <- bug_vectors["eating", , drop = FALSE] -
     bug_vectors["meat", , drop = FALSE] +
     bug_vectors["bugs", , drop = FALSE]

Error in eval(expr, envir, enclos): object 'bug_vectors' not found

Code

eating_bugs_cos_sim = sim2(x = bug_vectors, y = eating_bugs, method = "cosine", norm = "l2")

Error in stopifnot(inherits(x, "matrix") || inherits(x, "Matrix")): object 'bug_vectors' not found

We still see the same skewing of results, even after simplifying our code. What we’ll have to decide from here, is whether it’s worth returning to the pre-processing stage and filtering out stop words from our corpus or whether we should include word embedding in our analysis at all. For now, we’ll move on from cosine similarity tests.

Dictionaries

Next we’re going to make use of quanteda’s dictionary package. We’re going to start by reading in our data set again to reset it’s properties and convert it into a corpus.

Code

bug_tweets <- read.csv("eating_bugs_tweets_11_13_22.csv")

Error in file(file, "rt"): cannot open the connection

Code

bug_corpus <- corpus(bug_tweets$x)

Error in corpus(bug_tweets$x): object 'bug_tweets' not found

Dictionary Analysis

With our corpus loaded in, we can begin taking a stab at different dictionary analysis methods. In this case, we’re going to start by using the NRC dictionary which will attempt to calculate a percentage of the the documents within our corpus that reflect certain emotional characteristics. The NRC dictionary refers to the NRC Emotion Lexicon which associates words with certain emotions.

Code

bug_sentiment_nrc <- liwcalike(bug_corpus, data_dictionary_NRC)

Error in liwcalike(bug_corpus, data_dictionary_NRC): object 'bug_corpus' not found

Code

# The function below provides us with the column names within our new `bug_sentiment_nrc` object
names(bug_sentiment_nrc)

Error in eval(expr, envir, enclos): object 'bug_sentiment_nrc' not found

Code

ggplot(bug_sentiment_nrc) +
     geom_histogram(aes(x = positive), binwidth = 2) +
     theme_bw()

Error in ggplot(bug_sentiment_nrc): object 'bug_sentiment_nrc' not found

We see some interesting results here after plotting our data, specifically, a lack of positive documents within our corpus. Additionally, according to our analysis, a majority of the documents are appearing as neutral sitting at a 0 “positive” score. First, let’s take a look at some positive texts to see what we’re dealing with.

Code

bug_corpus[which(bug_sentiment_nrc$positive > 15)]

Error in eval(expr, envir, enclos): object 'bug_corpus' not found

Something we can immediately see is that, while there are some positive sentiments referring to our eating bugs corpus, our code can’t seem to pick up on all nuances of twitter based dialect. Let’s look at this selection more thoroughly.

View Code

Code

table(bug_corpus[which(bug_sentiment_nrc$positive > 15)])

Error in table(bug_corpus[which(bug_sentiment_nrc$positive > 15)]): object 'bug_corpus' not found

Even at a glance, I think is alright to assume that a majority of there “positive” reviews contain heavy sarcasm and satirical language. I think this is something we can remedy through the use of different dictionaries. For now, we’re going to continue with the NRC dictionary and take a look at the most negative documents.

Code

ggplot(bug_sentiment_nrc) +
     geom_histogram(aes(x = negative), binwidth = 2) +
     theme_bw()

Error in ggplot(bug_sentiment_nrc): object 'bug_sentiment_nrc' not found

Right off the bat, we see that there is a higher spread of negative values from our plot. While the positive results were more concentrated toward the lower values of positivity, the negative values are more diverse in negative values. Also I noticed that, for some reason, there are more documents valued as neutral in the negative plot than the positive one. Let’s take a deeper look into the negative documents as we did with the positive ones.

View Code

Code

table(bug_corpus[which(bug_sentiment_nrc$negative > 15)])

Error in table(bug_corpus[which(bug_sentiment_nrc$negative > 15)]): object 'bug_corpus' not found

Noticeably, I can see that a majority of the documents presented from this function are indeed negative! There are still some language nuances that NRC can’t seem to pick up on, but it’s interesting to see the quantity and intensity of the negative documents within our corpus.

Having isolated both positive and negative results, let’s try analyzing our corpus by incorporating both sides of the proverbial coin.

Code

bug_sentiment_nrc$polarity <- bug_sentiment_nrc$positive - bug_sentiment_nrc$negative

Error in eval(expr, envir, enclos): object 'bug_sentiment_nrc' not found

Code

ggplot(bug_sentiment_nrc) +
     geom_histogram(aes(polarity), binwidth = 2) +
     theme_bw()

Error in ggplot(bug_sentiment_nrc): object 'bug_sentiment_nrc' not found

It seems that our previous observations are still consistent here. There isn’t much else to take away other than the fact that there seems to be a higher concentration of positive documents than negative, which is unexpected. I suspect that the NRC dictionary we’re using is heavily skewing the results. The fact that there are so many neutral valued documents further fortifies my suspicions.

Dictionaries with DFMs

Next, we’re going to be utilizing dictionary analysis methods that utilize DFMs as opposed to the corpus.

Code

# Here we are coercing our corpus into a dfm without using the NRC dictionary. 
bug_dfm <- tokens(bug_corpus,
                         remove_punct = TRUE,
                         remove_symbols = TRUE,
                         remove_numbers = TRUE,
                         remove_url = TRUE,
                         split_hyphens = FALSE,
                         include_docvars = TRUE) %>%
                         tokens_tolower() %>%
  dfm()

Error in tokens(bug_corpus, remove_punct = TRUE, remove_symbols = TRUE, : object 'bug_corpus' not found

Code

# Now we'll coerce our corpus to a dfm using the NRC dictionary
bug_dfm_nrc <- tokens(bug_corpus,
                         remove_punct = TRUE,
                         remove_symbols = TRUE,
                         remove_numbers = TRUE,
                         remove_url = TRUE,
                         split_hyphens = FALSE,
                         include_docvars = TRUE) %>%
  tokens_tolower() %>% 
  dfm() %>% 
  dfm_lookup(data_dictionary_NRC)

Error in tokens(bug_corpus, remove_punct = TRUE, remove_symbols = TRUE, : object 'bug_corpus' not found

Code

head(bug_dfm_nrc, 10)

Error in head(bug_dfm_nrc, 10): object 'bug_dfm_nrc' not found

Code

# These functions can be run to provide some more details behind our new bug_dfm_nrc object
# dim(bug_dfm_nrc)
# class(bug_dfm_nrc)

Compared to the standard dfm (w/o dictionaries) we can see there’s a bit more diversity in terms of polarity measures. Instead of counts for each token we analyze, we can categorize all tokens into different sentiments which may serve us some good. Next, we’re going to be converting the dictionary dfm into a dataframe. From there, we can create a polarity measure and attempt to visualize the data similarly to how we did prior to converting to dfm.

Code

bug_df_nrc <- convert(bug_dfm_nrc, to = "data.frame")

Error in convert(bug_dfm_nrc, to = "data.frame"): object 'bug_dfm_nrc' not found

Code

# Again, these provide the sentiment categories to which we'll be arranging our data
names(bug_df_nrc)

Error in eval(expr, envir, enclos): object 'bug_df_nrc' not found

Code

bug_df_nrc$polarity <- (bug_df_nrc$positive - bug_df_nrc$negative)/(bug_df_nrc$positive + bug_df_nrc$negative)

Error in eval(expr, envir, enclos): object 'bug_df_nrc' not found

Code

bug_df_nrc$polarity[(bug_df_nrc$positive + bug_df_nrc$negative) == 0] <- 0

Error in bug_df_nrc$polarity[(bug_df_nrc$positive + bug_df_nrc$negative) == : object 'bug_df_nrc' not found

Code

ggplot(bug_df_nrc) +
  geom_histogram(aes(x=polarity), binwidth = .25) +
  theme_bw()

Error in ggplot(bug_df_nrc): object 'bug_df_nrc' not found

Code

# This function provides us with the most positive reviews ranked at a value of 1
writeLines(head(bug_corpus[which(bug_df_nrc$polarity == 1)]))

Error in head(bug_corpus[which(bug_df_nrc$polarity == 1)]): object 'bug_corpus' not found

Interestingly, we receive very similar results to those we received when using dictionaries without coercing to dfm. While not surprised, our operations are still struggling to define what is positive and negative accurately according to our writeLines function.

Using Different Dictionaries

Seeing as the NRC dictionary didn’t exactly give us what we wanted to see, we’re going to test drive a few other options we have. We’ll start with the General Inquirer dictionary.

Code

# Here we're re-converting our corpus to a dfm using the general inquirer dictionary
bug_dfm_geninq <- bug_dfm %>%
  dfm_lookup(data_dictionary_geninqposneg)

Error in dfm_lookup(., data_dictionary_geninqposneg): object 'bug_dfm' not found

Code

head(bug_dfm_geninq, 6)

Error in head(bug_dfm_geninq, 6): object 'bug_dfm_geninq' not found

We can see here that the general inquirer dictionary, as opposed NRC, splits sentiments into only 2 categories. Positive and Negative. From the surface, it already looks like we’re going to be getting a majority positive. Let’s continue.

Code

# Create polarity measure for `geninq`
bug_df_geninq <- convert(bug_dfm_geninq, to = "data.frame")

Error in convert(bug_dfm_geninq, to = "data.frame"): object 'bug_dfm_geninq' not found

Code

bug_df_geninq$polarity <- (bug_df_geninq$positive - bug_df_geninq$negative)/(bug_df_geninq$positive + bug_df_geninq$negative)

Error in eval(expr, envir, enclos): object 'bug_df_geninq' not found

Code

bug_df_geninq$polarity[which((bug_df_geninq$positive + bug_df_geninq$negative) == 0)] <- 0

Error in bug_df_geninq$polarity[which((bug_df_geninq$positive + bug_df_geninq$negative) == : object 'bug_df_geninq' not found

Code

head(bug_df_geninq)

Error in head(bug_df_geninq): object 'bug_df_geninq' not found

Here we can see a bit of the logic behind the polarity scaling. We can also see that tokens of opposite polarities nullify each other, making it so that the documents are rated as a neutral 0. Interesting, but I’m not sure if it’s all that useful toward our analysis.

View Code

Code

# Let's create unique names for each data frame
colnames(bug_df_nrc) <- paste("nrc", colnames(bug_df_nrc), sep = "_")

Error in is.data.frame(x): object 'bug_df_nrc' not found

Code

colnames(bug_df_geninq) <- paste("geninq", colnames(bug_df_geninq), sep = "_")

Error in is.data.frame(x): object 'bug_df_geninq' not found

Code

# Now let's compare our estimates
sent_df <- merge(bug_df_nrc, bug_df_geninq, by.x = "nrc_doc_id", by.y = "geninq_doc_id")

Error in merge(bug_df_nrc, bug_df_geninq, by.x = "nrc_doc_id", by.y = "geninq_doc_id"): object 'bug_df_nrc' not found

Code

head(sent_df)

Error in head(sent_df): object 'sent_df' not found

Code

cor(sent_df$nrc_polarity, sent_df$geninq_polarity)

Error in is.data.frame(y): object 'sent_df' not found

With the functions used above, we’ve now successfully built a correlation model according to the results we received while using the General Inquirer and NRC dictionaries! Now, lets try and plot it.

Code

#  Now we'll plot them out
ggplot(sent_df, mapping = aes(x = nrc_polarity,
                              y = geninq_polarity)) +
  geom_point(alpha = 0.1) +
  geom_smooth() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_bw()

Error in ggplot(sent_df, mapping = aes(x = nrc_polarity, y = geninq_polarity)): object 'sent_df' not found

While there was some correlation between both models, we can see from our visual that there is a clear distinction between them as well. The NRC dictionary seems to rank polarity in an extremely linear fashion while the General Inquirer rankings are much more fluid and varying. The variance between both show us that it’s going to come down to what better serves our analysis. What other options could we benefit from? Next, we’re going to experiment a bit with applying dictionaries within contexts.

Dictionaries within Contexts

Using contexts within dictionary analysis essentially let us prompt our functions with “context vectors” that provide the data with key words to use in its associations. We start by isolating the tokens we wish to use as context.

Code

# tokenize corpus
bug_tokens <- tokens(bug_corpus, remove_punct = TRUE)

Error in tokens(bug_corpus, remove_punct = TRUE): object 'bug_corpus' not found

Code

# what are the context (target) words or phrases
bug_words <- c("eating bugs", "eating insects", "bug", "bugs", "insect", "insects")

# retain only our tokens and their context
tokens_bugs <- tokens_keep(bug_tokens, pattern = phrase(bug_words), window = 40)

Error in tokens_select(x, ..., selection = "keep"): object 'bug_tokens' not found

Next, within those token sets, we can pull out the positive and negative dictionaries to get an inside look at what we’re working with. In this case we’ll be using the Lexicoder Sentiment Dictionary or LSD as it’s denoted in the functions below. Once we’ve done that, we’ll coerce our token object into a DFM.

Code

data_dictionary_LSD2015_pos_neg <- data_dictionary_LSD2015[1:2]

tokens_bugs_lsd <- tokens_lookup(tokens_bugs,
                                dictionary = data_dictionary_LSD2015_pos_neg)

Error in tokens_lookup(tokens_bugs, dictionary = data_dictionary_LSD2015_pos_neg): object 'tokens_bugs' not found

Code

dfm_bugs <- dfm(tokens_bugs_lsd)

Error in dfm(tokens_bugs_lsd): object 'tokens_bugs_lsd' not found

Code

head(dfm_bugs, 10)

Error in head(dfm_bugs, 10): object 'dfm_bugs' not found

Finally, we’ll use the objects we’ve created thus far to; create a data frame, drop any features that contain only 0 values (have neither negative nor positive tokens within the document), print a summary sentence to tell us exactly how many tweets mention positive or negative tokens in the context of eating bugs, and finally create & plot the resulting polarity scores

Code

# convert to data frame
mat_bugs <- convert(dfm_bugs, to = "data.frame")

Error in convert(dfm_bugs, to = "data.frame"): object 'dfm_bugs' not found

Code

# drop if both features are 0
mat_bugs <- mat_bugs[-which((mat_bugs$negative + mat_bugs$positive)==0),]

Error in eval(expr, envir, enclos): object 'mat_bugs' not found

Code

# print a little summary info
paste("We have ",nrow(mat_bugs)," tweets that mention positive or negative words in the context of eating bugs or insects.", sep="")

Error in nrow(mat_bugs): object 'mat_bugs' not found

Code

# create polarity scores
mat_bugs$polarity <- (mat_bugs$positive - mat_bugs$negative)/(mat_bugs$positive + mat_bugs$negative)

Error in eval(expr, envir, enclos): object 'mat_bugs' not found

Code

# summary
summary(mat_bugs$polarity)

Error in summary(mat_bugs$polarity): object 'mat_bugs' not found

Code

# plot
ggplot(mat_bugs) + 
     geom_histogram(aes(x=polarity), binwidth = .25) + 
     theme_bw()

Error in ggplot(mat_bugs): object 'mat_bugs' not found

Look at those results! Out of all the data we’ve pulled and created thus far, I believe the Lexicoder Sentiment Dictionary (LSD) has provided us with the most accurate results. If we were to use the view(mat_bugs) function, we could see the breakdown of the polarity scores and how each individual document received it’s ranking. The one problem I still notice is the relevance of certain tweets in the context of humans eating bugs, but I believe our pre-processing work assists with filtering those down enough so that they don’t skew our final results. With that, we’ll close out our post for today. Thank you for reading! :)