Blog 2-Sentiment Analysis on IRA and Healthcare Costs

Healthcare
Twitter
IRA
Rhowena Vespa
Blog Post 2
Sentiment Analysis
Author

Rhowena Vespa

Published

October 2, 2022

Code
library(tidyverse)
library(quarto)
library(quanteda)
library(quanteda.textplots)
library(lubridate)
library(tm) 
library(tidytext) 
library(dplyr)
library(corpus) 
library(ggplot2)
library(reshape2)
library(syuzhet) 
library(stringr)
library(tidyr)
library(DT)
library(RColorBrewer)



knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Research topic

This research revolves around text analytics and healthcare costs. Specifically, I wanted to focus on analyzing replies to a specific tweet on a specific social science topic.

According to the government, the Inflation Reduction Act of 2022 (IRA 2022) will cut prescription drug costs and lower overall healthcare costs. Unfortunately, the US healthcare system is very complex and the political divide among Americans view the IRA differently.

On October 15, 2022 at 9:10am, POTUS tweeted “With the Inflation Reduction Act, the American people won and Big Pharma lost.” At the time of this writing, this tweet has generated 7,301 replies, 5,016 retweet and 25.5K likes.

At first glance, my impression of this specific @POTUS tweet is positive. My research looks into the sentiments of all the Tweet replies (including replies to replies) to this specific $POTUS tweet.

    1. Do the responses imply trust on @POTUS?
    2. Is public opinion positive?
    3. Are there specific words that trigger positive or negative sentiments?

Data

Corpus is composed of 9497 twitter replies (all conversation threads) from @POTUS tweet on October 15, 2022 saying “With the Inflation Reduction Act, the American people won and Big Pharma lost.” Twarc package in python was used to extract the conversation threads of this specific tweet.

Data Pre-Processing

        1. Data extracted as jsonl, converted to csv
        2. Tweet replies isolated and cleaned.
        3. Cleaning: 
              remove stopwords, 
              lowercase
              remove twitter handles
              remove "POTUS" "joe" "biden" "inflation" "big" "pharma"
              remove punctuation
              remove retweet
              remove numbers
              remove white spaces
              & converted to 'amp', remove "amp"
        4. Stemming
        
        

Analysis -Word cloud, Sentiment Analysis

Word cloud was used to get visualization on most used words. Sentiment analysis using NRC as I wanted 8 specific emotions namely anger, fear, anticipation, trust, surprise, sadness, joy, disgust and positve or negative sentiment.

Results

Initial results was surprising. Most replies evoked trust to @POTUS tweet about the IRA and its effects on the people and Big Pharma. However, it was very surprising that sadness and anger were the next 2 prominent emotions from this tweet.
It was also unexpected that the tweet had an overall negative sentiment based on average sentiment scores.

Challenges

I had a difficult time sourcing data. Initial attempts to parse PUBMED and EMBASE. The Open-Access subsets were limited in terms of my research topic of sentiments on healthcare costs in the US. Due to time constraints, I did not find it practical hand-pick and parse 50+ pdfs, some requiring paid subscriptions. With this, I decided to go straight to the source- tweet responses. Extracting twitter replies was a steep, but fun, learning curve for me. Initially , I used the academictwitteR package in R but my dataframe kept coming up empty. I switched to python and found the twarc package much easier.

Future work

With this initial analysis, I would like to examine the replies closely and conduct a toxic comment classification. I would like to use classifiers to predict if a twitter response is toxic. I will continue to pre-process the data to remove emojis and non-english words.

Code
# Read in data

IRA.csv <- read_csv("IRA_med.csv")
Code
IRA_df <- as.data.frame(IRA.csv) #convert csv to df
Code
#standardize dates
IRA_df$created_at <- ymd_hms(IRA_df$created_at) 
IRA_df$created_at <- with_tz(IRA_df$created_at,"America/New_York")
IRA_df$created_date <- as.Date(IRA_df$created_at)

new df

Code
IRA_text <- IRA_df %>%
  select ("text", created_date) #new dataframe isolate text column 

remove @twitter handles

Code
IRA_text$text <- gsub("@[[:alpha:]]*","", IRA_text$text) #remove Twitter handles

text mining and cleaning - lowercase, remove common words, stop words, punctuation, retweet, & =amp and numbers

Code
IRA_corpus <- Corpus(VectorSource(IRA_text$text))
IRA_corpus <- tm_map(IRA_corpus, tolower) #lowercase
IRA_corpus <- tm_map(IRA_corpus, removeWords, 
                      c("joe", "biden","potus", "big", "pharma","inflation", "rt", "amp"))
IRA_corpus <- tm_map(IRA_corpus, removeWords, 
                      stopwords("en"))
IRA_corpus <- tm_map(IRA_corpus, removePunctuation)
IRA_corpus <- tm_map(IRA_corpus, stripWhitespace)
IRA_corpus <- tm_map(IRA_corpus, removeNumbers)

convert cleaned data back to data frame

Code
IRA_text_df <- data.frame(IRA_text_clean = get("content", IRA_corpus),  
                      stringsAsFactors = FALSE)

tokenize

Code
IRA_tidy <- IRA_text_df %>%
  unnest_tokens(word, IRA_text_clean)
IRA_tidy <- rename(IRA_tidy, text = word)
IRA_tokens <-text_tokens(IRA_tidy)

Stemming

Code
IRA_tokens <- tokens(IRA_tokens) 
IRA_tokens <- tokens_wordstem(IRA_tokens)

Word Cloud

Code
dfm_IRA <- corpus(as.character(IRA_tokens)) %>%
    dfm(remove = stopwords('english'), remove_punct = TRUE) %>%
    dfm_trim(min_termfreq = 10, verbose = FALSE)

textplot_wordcloud(dfm_IRA, scale=c(5,1), max.words=50, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))

Code
IRA_Sentiment <- get_nrc_sentiment(as.character(IRA_text_df$IRA_text_clean)) 
Code
IRA_td<-data.frame(t(IRA_Sentiment)) #transpose
IRA_td_new <- data.frame(rowSums(IRA_td[2:9497]))
IRA_td_mean <- data.frame(rowMeans(IRA_td[2:9497]))#get mean of sentiment values
Code
#Transformation and cleaning
names(IRA_td_new)[1] <- "Count"
Code
IRA_td_new <- cbind("sentiment" = rownames(IRA_td_new), IRA_td_new)
Code
rownames(IRA_td_new) <- NULL
Code
IRA_td_new2<-IRA_td_new[1:8,]
Code
#Plot One - Count of words associated with each sentiment
quickplot(sentiment, data=IRA_td_new2, weight=Count, geom="bar", fill=sentiment, ylab="count")+ggtitle("Emotions to @POTUS Tweet")

Code
names(IRA_td_mean)[1] <- "Mean"
Code
IRA_td_mean <- cbind("sentiment" = rownames(IRA_td_mean), IRA_td_mean)
Code
rownames(IRA_td_mean) <- NULL
Code
IRA_td_mean2<-IRA_td_mean[9:10,]
Code
#Plot One - Count of words associated with each sentiment
quickplot(sentiment, data=IRA_td_mean2, weight=Mean, geom="bar", fill=sentiment, ylab="Mean Sentiment Score")+ggtitle("Sentiments to @POTUS Tweet")

REFERENCES

    1. House, T., 2022. BY THE NUMBERS: The Inflation Reduction Act - The White House. [online] The White House. Available at: <https://www.whitehouse.gov/briefing-room/statements-releases/2022/08/15/by-the-numbers-the-inflation-reduction-act/> [Accessed 15 October 2022].
    2. Biden, P. (2022, October 15). We pay more for our prescription drugs than any other nation in the world. it's outrageous. but now, instead of money going into the pockets of drug companies, it will go into your pockets in the form of lower drug prices. Twitter. Retrieved October 15, 2022, from https://twitter.com/POTUS/status/1581374573815431168 
    3. Robinson, J. S. and D. (n.d.). Welcome to text mining with r: Text mining with R. Welcome to Text Mining with R | Text Mining with R. Retrieved October 15, 2022, from https://www.tidytextmining.com/