blog Post 2
Kaushika
Author

Kaushika

Published

October 2, 2022

Data sources used:

In this project, I am going to predict the Sentiments of COVID-19 Vaccination tweets. The data I have used is collecting tweets on the topic “Covid-19 Vaccination” (web scraping) and preparing the data. The data was gathered from Twitter and I’m going to use the R environment to implement this project. During the pandemic, lots of studies carried out analyses using Twitter data.

I have currently scraped the data from Twitter however I have only got tweets from the last 7 days since Twitter only allows me to do so. However, I will keep collecting data or try getting access to Twitter API for Academic Research which will allow me to get tweets from any timeline. So that would better help me visualize my data without any bias.

To connect to the Twitter API I have used two libraries twitteR and rtweet.

Code
library(twitteR) #R package which provides access to the Twitter API
library(tm) #Text mining in R
Loading required package: NLP
Code
library(lubridate) #Lubridate is an R package that makes it easier to work with dates and times.

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
Code
library(wordcloud) #Visualize differences and similarity between documents
Loading required package: RColorBrewer
Code
library(wordcloud2)
library(ggplot2) #For creating Graphics 

Attaching package: 'ggplot2'
The following object is masked from 'package:NLP':

    annotate
Code
library(reshape2) # Transform data between wide and long formats.
library(dplyr) #Provides a grammar of data manipulation

Attaching package: 'dplyr'
The following objects are masked from 'package:twitteR':

    id, location
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(tidyverse) #Helps to transform and tidy data
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ tibble  3.1.8     ✔ purrr   0.3.5
✔ tidyr   1.2.1     ✔ stringr 1.4.1
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::annotate()      masks NLP::annotate()
✖ lubridate::as.difftime() masks base::as.difftime()
✖ lubridate::date()        masks base::date()
✖ dplyr::filter()          masks stats::filter()
✖ dplyr::id()              masks twitteR::id()
✖ lubridate::intersect()   masks base::intersect()
✖ dplyr::lag()             masks stats::lag()
✖ dplyr::location()        masks twitteR::location()
✖ lubridate::setdiff()     masks base::setdiff()
✖ lubridate::union()       masks base::union()
Code
library(tidytext) #Applies the principles of the tidyverse to analyzing text.
library(tidyr) #Helps to get tidy data
library(gridExtra) #Arrange multiple grid-based plots on a page, and draw tables

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine
Code
library(grid) #Produce graphical output
library(rtweet) #Collecting Twitter Data

Attaching package: 'rtweet'

The following object is masked from 'package:purrr':

    flatten

The following object is masked from 'package:twitteR':

    lookup_statuses
Code
library(syuzhet)

Attaching package: 'syuzhet'

The following object is masked from 'package:rtweet':

    get_tokens

In order to gain access to Twitter data, I will have to apply for a developer account. I will need first to establish a secure connection to the Twitter API; for the connection, I need to provide a consumer API key and a consumer API secret. I can obtain these two by creating a developer profile with Twitter.

Code
# twitter keys and tokens
api_key <- "######"
api_secret <- "######"
access_token <- "######"
access_token_secret <- "######"

# create token for rtweet
token <- create_token(
  app = "######",
  api_key,
  api_secret,
  access_token,
  access_token_secret,
  set_renv = TRUE)
Warning: `create_token()` was deprecated in rtweet 1.0.0.
ℹ See vignette('auth') for details
Saving auth to 'C:\Users\srika\AppData\Roaming/R/config/R/rtweet/
create_token.rds'

To start, we need to establish the connection.

Code
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret) #Authorising the connection
[1] "Using direct authentication"
Error in check_twitter_oauth(): OAuth authentication error:
This most likely means that you have incorrectly called setup_twitter_oauth()'

I will then use a basic function called searchTwitter to find tweets using multiple criteria.

For example, I have used the searchTwitter criteria to find 10000 tweets on the topic Covid-19 vaccine to best evaluate tweets related to that. I have also limited our scope in English and can give a time restriction. I converted the returned tweets into a data frame using the function twListToDF. Moreover, I noticed that all of the example tweets have RT at the beginning. This implies those results are retweeted and I filtered out those retweets using the function -filter:retweets.

Code
tweets_covid = searchTwitter("covid+19+vaccine -filter:retweets", n = 10000, lang = "en")
Error in twInterfaceObj$doAPICall(cmd, params, "GET", ...): OAuth authentication error:
This most likely means that you have incorrectly called setup_twitter_oauth()'
Code
tweets.df = twListToDF(tweets_covid)
Error in twListToDF(tweets_covid): object 'tweets_covid' not found
Code
write.csv(tweets.df, file = "covid197tweets.csv", row.names = FALSE)
Error in is.data.frame(x): object 'tweets.df' not found

We can write our data frame into a CSV file and observe that the text feature is complete.

Build Corpus

A corpus, or an aggregate of text documents or tweets, is the primary document management structure in the R package “tm” (text mining).

Data Pre-Processing

Cleaning the Data

Cleaning the data include removing stopwords, numbers, punctuation, and other elements. Stopwords are words that have no sentimental meaning, such as conjunctions, pronouns, negations, etc. Common yet meaningless words like “covid,” “vaccination,” “corona,” etc. are also omitted in this context. The pre-processing of the text data is an essential step as it makes the raw text ready for mining.

Social Network Analysis

Analysis of the Most Frequent Words - Word Cloud

A collection of words presented in various sizes is called a wordcloud. The bigger and bolder the word appears, the more frequently a term is used in tweets.

Research Question

I specifically focused on tweets about COVID-19 vaccines. I wish to perform a Sentiment Analysis on tweets related to the Covid-19 Vaccine. In the first part, I wish to collect tweets related to the Covid-19 vaccine (Web scraping) and prepare the data.

In the next part, I wish to conduct a social network analysis and visualize the underlying emotions (sentiments) of the tweets.