Blog 2-Sentiment Analysis on IRA and Healthcare Costs

Healthcare

Twitter

IRA

Rhowena Vespa

Blog Post 2

Sentiment Analysis

Author

Rhowena Vespa

Published

October 2, 2022

Code

library(tidyverse)
library(quarto)
library(quanteda)
library(quanteda.textplots)
library(lubridate)
library(tm) 
library(tidytext) 
library(dplyr)
library(corpus) 
library(ggplot2)
library(reshape2)
library(syuzhet) 
library(stringr)
library(tidyr)
library(DT)
library(RColorBrewer)



knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Research topic

This research revolves around text analytics and healthcare costs. Specifically, I wanted to focus on analyzing replies to a specific tweet on a specific social science topic.

According to the government, the Inflation Reduction Act of 2022 (IRA 2022) will cut prescription drug costs and lower overall healthcare costs. Unfortunately, the US healthcare system is very complex and the political divide among Americans view the IRA differently.

On October 15, 2022 at 9:10am, POTUS tweeted “With the Inflation Reduction Act, the American people won and Big Pharma lost.” At the time of this writing, this tweet has generated 7,301 replies, 5,016 retweet and 25.5K likes.

At first glance, my impression of this specific @POTUS tweet is positive. My research looks into the sentiments of all the Tweet replies (including replies to replies) to this specific $POTUS tweet.

    1. Do the responses imply trust on @POTUS?
    2. Is public opinion positive?
    3. Are there specific words that trigger positive or negative sentiments?

Data

Corpus is composed of 9497 twitter replies (all conversation threads) from @POTUS tweet on October 15, 2022 saying “With the Inflation Reduction Act, the American people won and Big Pharma lost.” Twarc package in python was used to extract the conversation threads of this specific tweet.

Data Pre-Processing

        1. Data extracted as jsonl, converted to csv
        2. Tweet replies isolated and cleaned.
        3. Cleaning: 
              remove stopwords, 
              lowercase
              remove twitter handles
              remove "POTUS" "joe" "biden" "inflation" "big" "pharma"
              remove punctuation
              remove retweet
              remove numbers
              remove white spaces
              & converted to 'amp', remove "amp"
        4. Stemming

Analysis -Word cloud, Sentiment Analysis

Word cloud was used to get visualization on most used words. Sentiment analysis using NRC as I wanted 8 specific emotions namely anger, fear, anticipation, trust, surprise, sadness, joy, disgust and positve or negative sentiment.

Results

Initial results was surprising. Most replies evoked trust to @POTUS tweet about the IRA and its effects on the people and Big Pharma. However, it was very surprising that sadness and anger were the next 2 prominent emotions from this tweet.
It was also unexpected that the tweet had an overall negative sentiment based on average sentiment scores.

Challenges

I had a difficult time sourcing data. Initial attempts to parse PUBMED and EMBASE. The Open-Access subsets were limited in terms of my research topic of sentiments on healthcare costs in the US. Due to time constraints, I did not find it practical hand-pick and parse 50+ pdfs, some requiring paid subscriptions. With this, I decided to go straight to the source- tweet responses. Extracting twitter replies was a steep, but fun, learning curve for me. Initially , I used the academictwitteR package in R but my dataframe kept coming up empty. I switched to python and found the twarc package much easier.

Future work

With this initial analysis, I would like to examine the replies closely and conduct a toxic comment classification. I would like to use classifiers to predict if a twitter response is toxic. I will continue to pre-process the data to remove emojis and non-english words.

Code

# Read in data

IRA.csv <- read_csv("IRA_med.csv")

Code

IRA_df <- as.data.frame(IRA.csv) #convert csv to df

Code

#standardize dates
IRA_df$created_at <- ymd_hms(IRA_df$created_at) 
IRA_df$created_at <- with_tz(IRA_df$created_at,"America/New_York")
IRA_df$created_date <- as.Date(IRA_df$created_at)

new df

Code

IRA_text <- IRA_df %>%
  select ("text", created_date) #new dataframe isolate text column

remove @twitter handles

Code

IRA_text$text <- gsub("@[[:alpha:]]*","", IRA_text$text) #remove Twitter handles

text mining and cleaning - lowercase, remove common words, stop words, punctuation, retweet, & =amp and numbers

Code

IRA_corpus <- Corpus(VectorSource(IRA_text$text))
IRA_corpus <- tm_map(IRA_corpus, tolower) #lowercase
IRA_corpus <- tm_map(IRA_corpus, removeWords, 
                      c("joe", "biden","potus", "big", "pharma","inflation", "rt", "amp"))
IRA_corpus <- tm_map(IRA_corpus, removeWords, 
                      stopwords("en"))
IRA_corpus <- tm_map(IRA_corpus, removePunctuation)
IRA_corpus <- tm_map(IRA_corpus, stripWhitespace)
IRA_corpus <- tm_map(IRA_corpus, removeNumbers)

convert cleaned data back to data frame

Code

IRA_text_df <- data.frame(IRA_text_clean = get("content", IRA_corpus),  
                      stringsAsFactors = FALSE)

tokenize

Code

IRA_tidy <- IRA_text_df %>%
  unnest_tokens(word, IRA_text_clean)
IRA_tidy <- rename(IRA_tidy, text = word)
IRA_tokens <-text_tokens(IRA_tidy)

Stemming

Code

IRA_tokens <- tokens(IRA_tokens) 
IRA_tokens <- tokens_wordstem(IRA_tokens)

Word Cloud

Code

dfm_IRA <- corpus(as.character(IRA_tokens)) %>%
    dfm(remove = stopwords('english'), remove_punct = TRUE) %>%
    dfm_trim(min_termfreq = 10, verbose = FALSE)

textplot_wordcloud(dfm_IRA, scale=c(5,1), max.words=50, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))

Code

IRA_Sentiment <- get_nrc_sentiment(as.character(IRA_text_df$IRA_text_clean))

Code

IRA_td<-data.frame(t(IRA_Sentiment)) #transpose
IRA_td_new <- data.frame(rowSums(IRA_td[2:9497]))
IRA_td_mean <- data.frame(rowMeans(IRA_td[2:9497]))#get mean of sentiment values

Code

#Transformation and cleaning
names(IRA_td_new)[1] <- "Count"

Code

IRA_td_new <- cbind("sentiment" = rownames(IRA_td_new), IRA_td_new)

Code

rownames(IRA_td_new) <- NULL

Code

IRA_td_new2<-IRA_td_new[1:8,]

Code

#Plot One - Count of words associated with each sentiment
quickplot(sentiment, data=IRA_td_new2, weight=Count, geom="bar", fill=sentiment, ylab="count")+ggtitle("Emotions to @POTUS Tweet")

Code

names(IRA_td_mean)[1] <- "Mean"

Code

IRA_td_mean <- cbind("sentiment" = rownames(IRA_td_mean), IRA_td_mean)

Code

rownames(IRA_td_mean) <- NULL

Code

IRA_td_mean2<-IRA_td_mean[9:10,]

Code

#Plot One - Count of words associated with each sentiment
quickplot(sentiment, data=IRA_td_mean2, weight=Mean, geom="bar", fill=sentiment, ylab="Mean Sentiment Score")+ggtitle("Sentiments to @POTUS Tweet")

REFERENCES

    1. House, T., 2022. BY THE NUMBERS: The Inflation Reduction Act - The White House. [online] The White House. Available at: <https://www.whitehouse.gov/briefing-room/statements-releases/2022/08/15/by-the-numbers-the-inflation-reduction-act/> [Accessed 15 October 2022].
    2. Biden, P. (2022, October 15). We pay more for our prescription drugs than any other nation in the world. it's outrageous. but now, instead of money going into the pockets of drug companies, it will go into your pockets in the form of lower drug prices. Twitter. Retrieved October 15, 2022, from https://twitter.com/POTUS/status/1581374573815431168 
    3. Robinson, J. S. and D. (n.d.). Welcome to text mining with r: Text mining with R. Welcome to Text Mining with R | Text Mining with R. Retrieved October 15, 2022, from https://www.tidytextmining.com/

--- title: "Blog 2-Sentiment Analysis on IRA and Healthcare Costs" author: "Rhowena Vespa" desription: "Sentiment Analysis on POTUS Twitter Replies" date: "10/02/2022" editor: visual format: html: toc: true code-fold: true code-copy: true code-tools: true categories: - Healthcare - Twitter - IRA - Rhowena Vespa - Blog Post 2 - Sentiment Analysis --- ```{r} #| label: setup #| warning: false library(tidyverse) library(quarto) library(quanteda) library(quanteda.textplots) library(lubridate) library(tm) library(tidytext) library(dplyr) library(corpus) library(ggplot2) library(reshape2) library(syuzhet) library(stringr) library(tidyr) library(DT) library(RColorBrewer) knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) ``` ## Research topic This research revolves around text analytics and healthcare costs. Specifically, I wanted to focus on analyzing **replies to a specific tweet** on a specific social science topic. According to the government, the Inflation Reduction Act of 2022 (IRA 2022) will cut prescription drug costs and lower overall healthcare costs. Unfortunately, the US healthcare system is very complex and the political divide among Americans view the IRA differently. On October 15, 2022 at 9:10am, POTUS tweeted "With the Inflation Reduction Act, the American people won and Big Pharma lost." At the time of this writing, this tweet has generated 7,301 replies, 5,016 retweet and 25.5K likes. At first glance, my impression of this specific @POTUS tweet is positive. My research looks into the sentiments of all the Tweet replies (including replies to replies) to this specific \$POTUS tweet. 1. Do the responses imply trust on @POTUS? 2. Is public opinion positive? 3. Are there specific words that trigger positive or negative sentiments? ## Data Corpus is composed of **9497 twitter replies (all conversation threads)** from @POTUS tweet on October 15, 2022 saying "With the Inflation Reduction Act, the American people won and Big Pharma lost." Twarc package in python was used to extract the conversation threads of this specific tweet. ### Data Pre-Processing 1. Data extracted as jsonl, converted to csv 2. Tweet replies isolated and cleaned. 3. Cleaning: remove stopwords, lowercase remove twitter handles remove "POTUS" "joe" "biden" "inflation" "big" "pharma" remove punctuation remove retweet remove numbers remove white spaces & converted to 'amp', remove "amp" 4. Stemming ## Analysis -Word cloud, Sentiment Analysis Word cloud was used to get visualization on most used words. Sentiment analysis using NRC as I wanted 8 specific emotions namely anger, fear, anticipation, trust, surprise, sadness, joy, disgust and positve or negative sentiment. ## Results Initial results was surprising. Most replies evoked **trust** to @POTUS tweet about the IRA and its effects on the people and Big Pharma. However, it was very surprising that **sadness** and **anger** were the next 2 prominent emotions from this tweet.\ It was also unexpected that the tweet had an overall **negative sentiment** based on average sentiment scores. ## Challenges I had a difficult time sourcing data. Initial attempts to parse PUBMED and EMBASE. The Open-Access subsets were limited in terms of my research topic of sentiments on healthcare costs in the US. Due to time constraints, I did not find it practical hand-pick and parse 50+ pdfs, some requiring paid subscriptions. With this, I decided to go straight to the source- tweet responses. Extracting twitter replies was a steep, but fun, learning curve for me. Initially , I used the academictwitteR package in R but my dataframe kept coming up empty. I switched to python and found the twarc package much easier. ## Future work With this initial analysis, I would like to examine the replies closely and conduct a toxic comment classification. I would like to use classifiers to predict if a twitter response is toxic. I will continue to pre-process the data to remove emojis and non-english words. ```{r} # Read in data IRA.csv <- read_csv("IRA_med.csv") ``` ```{r} IRA_df <- as.data.frame(IRA.csv) #convert csv to df ``` ```{r} #standardize dates IRA_df$created_at <- ymd_hms(IRA_df$created_at) IRA_df$created_at <- with_tz(IRA_df$created_at,"America/New_York") IRA_df$created_date <- as.Date(IRA_df$created_at) ``` new df ```{r} IRA_text <- IRA_df %>% select ("text", created_date) #new dataframe isolate text column ``` remove @twitter handles ```{r} IRA_text$text <- gsub("@[[:alpha:]]*","", IRA_text$text) #remove Twitter handles ``` text mining and cleaning - lowercase, remove common words, stop words, punctuation, retweet, & =amp and numbers ```{r} IRA_corpus <- Corpus(VectorSource(IRA_text$text)) IRA_corpus <- tm_map(IRA_corpus, tolower) #lowercase IRA_corpus <- tm_map(IRA_corpus, removeWords, c("joe", "biden","potus", "big", "pharma","inflation", "rt", "amp")) IRA_corpus <- tm_map(IRA_corpus, removeWords, stopwords("en")) IRA_corpus <- tm_map(IRA_corpus, removePunctuation) IRA_corpus <- tm_map(IRA_corpus, stripWhitespace) IRA_corpus <- tm_map(IRA_corpus, removeNumbers) ``` convert cleaned data back to data frame ```{r} IRA_text_df <- data.frame(IRA_text_clean = get("content", IRA_corpus), stringsAsFactors = FALSE) ``` tokenize ```{r} IRA_tidy <- IRA_text_df %>% unnest_tokens(word, IRA_text_clean) IRA_tidy <- rename(IRA_tidy, text = word) IRA_tokens <-text_tokens(IRA_tidy) ``` Stemming ```{r} IRA_tokens <- tokens(IRA_tokens) IRA_tokens <- tokens_wordstem(IRA_tokens) ``` Word Cloud ```{r} dfm_IRA <- corpus(as.character(IRA_tokens)) %>% dfm(remove = stopwords('english'), remove_punct = TRUE) %>% dfm_trim(min_termfreq = 10, verbose = FALSE) textplot_wordcloud(dfm_IRA, scale=c(5,1), max.words=50, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2")) ``` ```{r} IRA_Sentiment <- get_nrc_sentiment(as.character(IRA_text_df$IRA_text_clean)) ``` ```{r} IRA_td<-data.frame(t(IRA_Sentiment)) #transpose IRA_td_new <- data.frame(rowSums(IRA_td[2:9497])) IRA_td_mean <- data.frame(rowMeans(IRA_td[2:9497]))#get mean of sentiment values ``` ```{r} #Transformation and cleaning names(IRA_td_new)[1] <- "Count" ``` ```{r} IRA_td_new <- cbind("sentiment" = rownames(IRA_td_new), IRA_td_new) ``` ```{r} rownames(IRA_td_new) <- NULL ``` ```{r} IRA_td_new2<-IRA_td_new[1:8,] ``` ```{r} #Plot One - Count of words associated with each sentiment quickplot(sentiment, data=IRA_td_new2, weight=Count, geom="bar", fill=sentiment, ylab="count")+ggtitle("Emotions to @POTUS Tweet") ``` ```{r} names(IRA_td_mean)[1] <- "Mean" ``` ```{r} IRA_td_mean <- cbind("sentiment" = rownames(IRA_td_mean), IRA_td_mean) ``` ```{r} rownames(IRA_td_mean) <- NULL ``` ```{r} IRA_td_mean2<-IRA_td_mean[9:10,] ``` ```{r} #Plot One - Count of words associated with each sentiment quickplot(sentiment, data=IRA_td_mean2, weight=Mean, geom="bar", fill=sentiment, ylab="Mean Sentiment Score")+ggtitle("Sentiments to @POTUS Tweet") ``` ## REFERENCES 1. House, T., 2022. BY THE NUMBERS: The Inflation Reduction Act - The White House. [online] The White House. Available at: <https://www.whitehouse.gov/briefing-room/statements-releases/2022/08/15/by-the-numbers-the-inflation-reduction-act/> [Accessed 15 October 2022]. 2. Biden, P. (2022, October 15). We pay more for our prescription drugs than any other nation in the world. it's outrageous. but now, instead of money going into the pockets of drug companies, it will go into your pockets in the form of lower drug prices. Twitter. Retrieved October 15, 2022, from https://twitter.com/POTUS/status/1581374573815431168 3. Robinson, J. S. and D. (n.d.). Welcome to text mining with r: Text mining with R. Welcome to Text Mining with R | Text Mining with R. Retrieved October 15, 2022, from https://www.tidytextmining.com/