knitr
Rhowena Vespa
October 29, 2022
This 4th Blog starts work on supervised machine learning from week 7 and 8. Using tweet replies as corpus, sentiment scores, polarity scores are calculated and visualized.
Supervised machine learning will continue on next blog post to build models for polarity classification
Rows: 9497 Columns: 79
[1] "docname" "Segment" "WPS" "WC" "Sixltr"
[6] "Dic" "anger" "anticipation" "disgust" "fear"
[11] "joy" "negative" "positive" "sadness" "surprise"
[16] "trust" "AllPunc" "Period" "Comma" "Colon"
[21] "SemiC" "QMark" "Exclam" "Dash" "Quote"
[26] "Apostro" "Parenth" "OtherP"
Corpus consisting of 376 documents and 78 docvars.
text19 :
"4president The US gets zero benefit by supporting Ukraine...."
text28 :
" we’re winning?"
text78 :
" bold faced lie"
text131 :
" Prove it"
text138 :
" Bullshit clown"
text139 :
" Nothing has been won you clown"
[ reached max_ndoc ... 370 more documents ]
Corpus consisting of 884 documents and 78 docvars.
text24 :
" suck on that"
text25 :
" press x to doubt"
text26 :
" Hasbara troll.. shutthefuckup hoe"
text33 :
" Pharma didn't lose shit."
text78 :
" bold faced lie"
text91 :
" And inflation still went up"
[ reached max_ndoc ... 878 more documents ]
Document-feature matrix of: 10 documents, 11,231 features (99.88% sparse) and 78 docvars.
docs so when do i get to see those cheap prices
text1 0 0 0 0 0 0 0 0 0 0
text2 1 1 1 1 1 1 1 1 1 1
text3 0 0 0 1 0 0 0 0 0 0
text4 0 0 0 0 0 1 0 0 0 0
text5 0 0 0 0 1 1 0 0 0 0
text6 0 0 0 0 0 0 0 0 0 0
[ reached max_ndoc ... 4 more documents, reached max_nfeat ... 11,221 more features ]
[1] 9497 11231
[1] 9497 10
Document-feature matrix of: 10 documents, 10 features (76.00% sparse) and 78 docvars.
docs anger anticipation disgust fear joy negative positive sadness surprise
text1 0 0 0 0 0 0 0 0 0
text2 0 0 0 0 0 1 0 0 0
text3 2 1 0 1 1 2 1 1 0
text4 0 0 0 0 0 1 1 0 0
text5 0 0 0 0 0 0 1 0 0
text6 1 0 0 0 0 1 0 1 0
docs trust
text1 0
text2 0
text3 1
text4 0
text5 0
text6 0
[ reached max_ndoc ... 4 more documents ]
[1] "dfm"
[1] "quanteda"
[1] "doc_id" "anger" "anticipation" "disgust" "fear"
[6] "joy" "negative" "positive" "sadness" "surprise"
[11] "trust"
What use to cost 2-3 dollars at grocery stores now cost 5-7 dollars and we get less product. Great job.
How much of our money have you sent to Ukraine now joe?
4president The US gets zero benefit by supporting Ukraine. Absolutely nothing!
_di1200 Do you think the US will stand by when the Taliban use the “ARSENAL” left behind against an ally? \n\n5 years? \n10 years?\n\nPeace forever lol
we’re winning?
570SEASONS These are good provisions that will help people. Trump campaigned on a few of these too.
1. House, T., 2022. BY THE NUMBERS: The Inflation Reduction Act - The White House. [online] The White House. Available at: <> [Accessed 15 October 2022].
2. Biden, P. (2022, October 15). We pay more for our prescription drugs than any other nation in the world. it's outrageous. but now, instead of money going into the pockets of drug companies, it will go into your pockets in the form of lower drug prices. Twitter. Retrieved October 15, 2022, from
3. Robinson, J. S. and D. (n.d.). Welcome to text mining with r: Text mining with R. Welcome to Text Mining with R | Text Mining with R. Retrieved October 15, 2022, from
#| label: setup
#| warning: false
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
This 4th Blog starts work on supervised machine learning from week 7 and 8. Using tweet replies as corpus,
sentiment scores, polarity scores are calculated and visualized.
Supervised machine learning will continue on next blog post to build models for polarity classification
# Read in data
IRA<- read_csv("IRA_med.csv")
#remove @twitter handles
IRA$text <- gsub("@[[:alpha:]]*","", IRA$text) #remove Twitter handles
IRA_corpus <- corpus(IRA,text_field = "text")
#tokenize and stemming
IRA_tokens <- tokens(IRA_corpus)
IRA_tokens <- tokens_wordstem(IRA_tokens)
# USING LECTURE week8 --NRC sentiment
# use liwcalike() to estimate sentiment using NRC dictionary
IRAreviewSentiment_nrc <- liwcalike(IRA_corpus, data_dictionary_NRC)
ggplot(IRAreviewSentiment_nrc) +
geom_histogram(aes(x = positive)) +
IRA_corpus[which(IRAreviewSentiment_nrc$positive > 15)]
ggplot(IRAreviewSentiment_nrc) +
geom_histogram(aes(x = negative)) +
IRA_corpus[which(IRAreviewSentiment_nrc$negative > 15)]
# create a full dfm for comparison
IRA_Dfm <- tokens(IRA_corpus,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
split_hyphens = FALSE,
include_docvars = TRUE) %>%
tokens_tolower() %>%
head(IRA_Dfm, 10)
# convert corpus to dfm using the dictionary
IRADfm_nrc <- tokens(IRA_corpus,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
split_hyphens = FALSE,
include_docvars = TRUE) %>%
tokens_tolower() %>%
dfm() %>%
head(IRADfm_nrc, 10)
IRAdf_nrc <- convert(IRADfm_nrc, to = "data.frame")
IRAdf_nrc$polarity <- (IRAdf_nrc$positive - IRAdf_nrc$negative)/(IRAdf_nrc$positive + IRAdf_nrc$negative)
IRAdf_nrc$polarity[(IRAdf_nrc$positive + IRAdf_nrc$negative) == 0] <- 0
ggplot(IRAdf_nrc) +
geom_histogram(aes(x=polarity)) +
IRAdf_nrc_CBIND <-, IRA_text_df))
IRAdf_nrc_CBIND <- as.character(IRAdf_nrc_CBIND)
# NEW CORPUS with polarity scores
IRApolarity_corpus <- corpus(IRAdf_nrc_CBIND)
writeLines(head(IRA_corpus[which(IRAdf_nrc$polarity == 1)]))
# APPLY DICTIONARY within context
# tokenize corpus
IRAtokens <- tokens(IRA_corpus, remove_punct = TRUE)
# what are the context (target) words or phrases
IRA_words <- c("inflation","POTUS", "price*","joe", "biden", "trump","medicare","drug","cost","america*","won","lost")
# retain only our tokens and their context
IRAtokens_HC <- tokens_keep(IRAtokens, pattern = phrase(IRA_words), window = 40)
IRAdata_dictionary_LSD2015_pos_neg <- data_dictionary_LSD2015[1:2]
IRAtokens_HC_lsd <- tokens_lookup(IRAtokens_HC,
dictionary = data_dictionary_LSD2015_pos_neg)
# COnvert to dfm
IRAdfm_HC <- dfm(IRAtokens_HC_lsd)
head(IRAdfm_HC, 10)
# convert to data frame
IRAmat_HC <- convert(IRAdfm_HC, to = "data.frame")
# drop if both features are 0
IRAmat_HC <- IRAmat_HC[-which((IRAmat_HC$negative + IRAmat_HC$positive)==0),]
# print a little summary info
paste("We have ",nrow(IRAmat_HC)," tweets replies that mention positive or negative words in the context of Inflation terms.", sep="")
# create polarity scores
IRAmat_HC$polarity <- (IRAmat_HC$positive - IRAmat_HC$negative)/(IRAmat_HC$positive + IRAmat_HC$negative)
# summary
# plot
ggplot(IRAmat_HC) +
geom_histogram(aes(x=polarity)) +
