Sentiment Analysis on Covid-19 Vaccine

blog Post 4
Kaushika
Published

October 30, 2022

In this project, I am going to predict the Sentiments of COVID-19 Vaccination tweets. The data I have used is collecting tweets on the topic “Covid-19 Vaccination” (web scraping) and preparing the data. The data was gathered from Twitter and I’m going to use the R environment to implement this project. During the pandemic, lots of studies carried out analyses using Twitter data.

In the previous blog I have mentioned that I have access to only the last 7 days of tweets. However, I have applied for academic access to Twitter API that allows me to collect more tweets for my analysis. I will be using the Premium search rather than the Standard search for tweets using Twitter API.

##Loading important libraries

Code
library(twitteR) #R package which provides access to the Twitter API
library(tm) #Text mining in R
Loading required package: NLP
Code
library(lubridate) #Lubridate is an R package that makes it easier to work with dates and times.

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
Code
library(quanteda) #Makes it easy to manage texts in the form of a corpus.
Package version: 3.2.3
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 8 of 8 threads used.
See https://quanteda.io for tutorials and examples.

Attaching package: 'quanteda'
The following object is masked from 'package:tm':

    stopwords
The following objects are masked from 'package:NLP':

    meta, meta<-
Code
library(wordcloud) #Visualize differences and similarity between documents
Loading required package: RColorBrewer
Code
library(wordcloud2)
library(ggplot2) #For creating Graphics 

Attaching package: 'ggplot2'
The following object is masked from 'package:NLP':

    annotate
Code
library(reshape2) # Transform data between wide and long formats.
library(dplyr) #Provides a grammar of data manipulation

Attaching package: 'dplyr'
The following objects are masked from 'package:twitteR':

    id, location
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(tidyverse) #Helps to transform and tidy data
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ tibble  3.1.8     ✔ purrr   0.3.5
✔ tidyr   1.2.1     ✔ stringr 1.4.1
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::annotate()      masks NLP::annotate()
✖ lubridate::as.difftime() masks base::as.difftime()
✖ lubridate::date()        masks base::date()
✖ dplyr::filter()          masks stats::filter()
✖ dplyr::id()              masks twitteR::id()
✖ lubridate::intersect()   masks base::intersect()
✖ dplyr::lag()             masks stats::lag()
✖ dplyr::location()        masks twitteR::location()
✖ lubridate::setdiff()     masks base::setdiff()
✖ lubridate::union()       masks base::union()
Code
library(tidytext) #Applies the principles of the tidyverse to analyzing text.
library(tidyr) #Helps to get tidy data
library(gridExtra) #Arrange multiple grid-based plots on a page, and draw tables

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine
Code
library(grid) #Produce graphical output
library(rtweet) #Collecting Twitter Data

Attaching package: 'rtweet'

The following object is masked from 'package:purrr':

    flatten

The following object is masked from 'package:twitteR':

    lookup_statuses
Code
library(syuzhet) #Returns a data frame in which each row represents a sentence from the original file

Attaching package: 'syuzhet'

The following object is masked from 'package:rtweet':

    get_tokens

Scraping Data from Twitter

After getting access to the Twitter API I can run the following (replacing ###### by my specific credentials) and search for tweets. (“######” used for protection)

Code
# twitter keys and tokens
api_key <- "######"
api_secret <- "######"
access_token <- "######"
access_token_secret <- "######"

# create token for rtweet
token <- create_token(
  app = "######",
  api_key,
  api_secret,
  access_token,
  access_token_secret,
  set_renv = TRUE)
Warning: `create_token()` was deprecated in rtweet 1.0.0.
ℹ See vignette('auth') for details
Saving auth to 'C:\Users\srika\AppData\Roaming/R/config/R/rtweet/
create_token.rds'
Code
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
[1] "Using direct authentication"
Error in check_twitter_oauth(): OAuth authentication error:
This most likely means that you have incorrectly called setup_twitter_oauth()'
Code
#what to search

#Searching for tweets using terms covid + 19 + vaccine and filtering out the retweets to avoid repetitions. After that I converted the list of tweets into a data frame.

tweets_covid = searchTwitter("covid+19+vaccine -filter:retweets", n = 20000, lang = "en")
Error in twInterfaceObj$doAPICall(cmd, params, "GET", ...): OAuth authentication error:
This most likely means that you have incorrectly called setup_twitter_oauth()'
Code
tweets.df = twListToDF(tweets_covid)
Error in twListToDF(tweets_covid): object 'tweets_covid' not found
Code
for (i in 1:nrow(tweets.df)) {
    if (tweets.df$truncated[i] == TRUE) {
        tweets.df$text[i] <- gsub("[[:space:]]*$","...",tweets.df$text[i])
    }
}
Error in nrow(tweets.df): object 'tweets.df' not found
Code
#Saving the collected tweets into a csv file.
write.csv(tweets.df, file = "covidtweets.csv", row.names = FALSE)
Error in is.data.frame(x): object 'tweets.df' not found

Reading the csv file

The csv file has approximately 15,000 tweets on the topic “Covid 19 Vaccination”.

Code
covid_19_vaccination <- read.csv("covidtweets.csv", header = T)
str(covid_19_vaccination)
'data.frame':   15040 obs. of  16 variables:
 $ text         : chr  "@1goodtern Who suffer the most, vaccine and mask 😷 off, not thinking long term effects with COVID-19 being a ma"| __truncated__ "@palminder1990 Google much?\nhttps://t.co/SXOBS5INdJ" "Arrest #JoeBiden for the assault on the #american people forcing and conning them to take the #vaccine for… htt"| __truncated__ "@9NewsSyd Remember that time \"conspiracy theorists\" said that the Covid-19 Vaccine was undertested, wouldn't "| __truncated__ ...
 $ favorited    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ favoriteCount: int  0 0 0 0 0 0 0 2 0 0 ...
 $ replyToSN    : chr  "1goodtern" "palminder1990" NA "9NewsSyd" ...
 $ created      : chr  "2022-10-31 01:35:17" "2022-10-31 01:33:07" "2022-10-31 01:27:07" "2022-10-31 01:24:45" ...
 $ truncated    : logi  TRUE FALSE TRUE TRUE TRUE TRUE ...
 $ replyToSID   : num  1.59e+18 1.59e+18 NA 1.59e+18 NA ...
 $ id           : num  1.59e+18 1.59e+18 1.59e+18 1.59e+18 1.59e+18 ...
 $ replyToUID   : num  9.61e+17 1.49e+18 NA 1.72e+08 NA ...
 $ statusSource : chr  "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>" ...
 $ screenName   : chr  "ecmoyer" "henri_gg" "Twitgovbot" "DjrellAZDelta" ...
 $ retweetCount : int  0 0 0 0 0 0 0 0 0 0 ...
 $ isRetweet    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ retweeted    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ longitude    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ latitude     : num  NA NA NA NA NA NA NA NA NA NA ...

##Build Corpus A corpus, or collection of text documents(in this case tweets), is the primary document management structure in the R package “tm” (text mining).

Code
corpus <- iconv(covid_19_vaccination$text, to = "utf-8")
corpus <- Corpus(VectorSource(corpus))
inspect(corpus[1:5])
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 5

[1] @1goodtern Who suffer the most, vaccine and mask 😷 off, not thinking long term effects with COVID-19 being a mass d… https://t.co/hxabqyjaIn...  
[2] @palminder1990 Google much?\nhttps://t.co/SXOBS5INdJ                                                                                             
[3] Arrest #JoeBiden for the assault on the #american people forcing and conning them to take the #vaccine for… https://t.co/VKh5GBecFn...           
[4] @9NewsSyd Remember that time "conspiracy theorists" said that the Covid-19 Vaccine was undertested, wouldn't work e… https://t.co/qNAoety4Y2...  
[5] One squat, deadlift, or benchpress session a day; keeps #COVID19 away!\n\nRun, stretch or dance: #Exercise could impr… https://t.co/Gh60QDwcvZ...
Code
#Suppress warnings in the global setting.
options(warn=-1)

#Cleaning the Data : Data Pre-Processing Cleaning the data include removing stopwords, numbers, punctuation, and other elements. Stopwords are words that have no sentimental meaning, such as conjunctions, pronouns, negations, etc. Common yet meaningless words like “covid,” “vaccination,” “corona,” etc. are also eliminated in this case.

Here we follow a particular order of removing Usernames before Punctuations. Since the symbol ‘@’ would be removed if we remove punctuations first and that would create an issue while removing usernames after that since the ‘@’ symbol would not be detected anymore.

Code
# clean text
removeUsername <- function(x) gsub('@[^[:space:]]*', '', x) #Removes usernames
removeURL <- function(x) gsub('http[[:alnum:]]*', '', x) #Removes URLs attached to tweets
removeNumPunct<- function(x) gsub("[^[:alpha:][:space:]]*","",x) #Remove Punctuations

#Text Mining Functions
cleandata <- tm_map(corpus, PlainTextDocument) #Function to create plain text documents.
cleandata <- tm_map(cleandata, content_transformer(removeUsername)) #Function to remove Usernames attached to the text.
cleandata <- tm_map(cleandata, content_transformer(removeURL)) #Function to remove URLs attached to the text.
cleandata <- tm_map(cleandata, content_transformer(tolower)) #Function to convert text into lowercase.
cleandata <- tm_map(cleandata, content_transformer(removeNumPunct)) #Function to remove Punctuations attached to text.
cleandata <- tm_map(cleandata, content_transformer(removeNumbers)) # #Function to remove Numbers attached to texts.
cleandata <- tm_map(cleandata, removeWords, stopwords("english"))

#Removing meaningless words like "covid," "vaccination," "corona," etc
cleandata <- tm_map(cleandata, removeWords, c('covid','vaccination', 
                                            'vaccinations','vaccine','vaccines',
                                            'vaccinated', "corona", 
                                            "coronavirus"))
cleandata <- tm_map(cleanset, gsub,
                   pattern = 'available',
                   replacement = 'availability')
Error in tm_map(cleanset, gsub, pattern = "available", replacement = "availability"): object 'cleanset' not found
Code
cleandata <- tm_map(cleandata, stripWhitespace) #Function to strip extra whitespace from a text document.
inspect(cleandata[1:5]) #Inspecting the first 5 rows.
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 5

[1]  suffer mask thinking long term effects mass d tcohxabqyjain                                    
[2]  google much tcosxobsindj                                                                       
[3] arrest joebiden assault american people forcing conning take tcovkhgbecfn                       
[4]  remember time conspiracy theorists said undertested wouldnt work e tcoqnaoetyy                 
[5] one squat deadlift benchpress session day keeps away run stretch dance exercise impr tcoghqdwcvz

Term Document Matrix

The second function constructs the term-document matrix, that describes the frequency of terms that occur in a collection of documents. This matrix has terms in the first column and documents across the top as individual column names. The rows are the terms (words) and the columns are the documents (tweets). So I made a tdm of our cleandata in the next step.

Code
tdm <- TermDocumentMatrix(cleandata)
tdm <- as.matrix(tdm)
tdm[1:5, 1:10]
         Docs
Terms     1 2 3 4 5 6 7 8 9 10
  effects 1 0 0 0 0 0 1 0 0  0
  long    1 0 0 0 0 0 0 0 0  0
  mask    1 0 0 0 0 0 0 0 0  0
  mass    1 0 0 0 0 0 0 0 0  0
  suffer  1 0 0 0 0 0 0 0 0  0

Analysis of the Most Frequent Words - Word Cloud

A wordcloud is a collection of words displayed in different sizes. The bigger and bolder the word appears, the more often it is mentioned within the tweets and the more important it is. Words like “pfizer”, “booster”, “flu”, “biden”, “people” , “get” seem to be appearing more

Code
# row sums
w <- rowSums(tdm) # how often appears each word?
w <- subset(w, w>=3000)
w <- sort(rowSums(tdm))

# wordcloud
options(repr.plot.width=14, repr.plot.height=10)
wordcloud(words = names(w),
          freq = w,
          colors=brewer.pal(8, "Dark2"),
          random.color = TRUE,
          max.words = 100,
          scale = c(4, 0.04))

Next I will Perform Sentiment Analysis and will keep updating the same: