Blog Post 4 (Arlnow covid articles)

Miranda Manka
Author

Miranda Manka

Published

October 28, 2022

Code
library(tidyverse)
library(quanteda)
library(quanteda.textplots)
library(tidytext)
library(plyr)
library(ggplot2)
library(devtools)
devtools::install_github("kbenoit/quanteda.dictionaries")
library(quanteda.dictionaries)
remotes::install_github("quanteda/quanteda.sentiment")
library(quanteda.sentiment)

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Data

The dataset contains 550 different articles from Arlnow (local news site in Northern Virginia) from March 2020 to September 2022. I decided to go back to March because that is when covid was officially declared a pandemic in the U.S. I may scrape more to get the months before March. I may also try to find a similar site for another county/city in another state to compare the two and see similarities and differences.

Here I am just reading in the data from the csv I created and dropping the extra column that was created in the write.csv and renaming a column. I also removed articles where the title had “Morning Notes” as they weren’t really articles but more recaps.

Code
arlnow_covid = read_csv("_data/arlnow_covid_posts.csv", col_names = TRUE, show_col_types = FALSE)
arlnow_covid = subset(arlnow_covid, select = -c(1))
arlnow_covid = dplyr::rename(arlnow_covid, text_field = raw_text)
arlnow_covid = arlnow_covid[!grepl("Morning Notes", arlnow_covid$header_text), ]

Analysis

Most of this analysis follows the week 8 tutorial we were given, I found it very helpful and I wanted to note where a lot of the code/information came from as I use a lot of it for this post.

Code
arlnow_covid_corpus = corpus(arlnow_covid, docid_field = "doc_id", text_field = "text_field")
arlnow_covid_summary = summary(arlnow_covid_corpus)
arlnow_covid_corpus_tokens = tokens(arlnow_covid_corpus, remove_punct = T)

Dictionary Analysis

The basic idea with a dictionary analysis is to identify a set of words that connect to a certain concept, and to count the frequency of that set of words within a document. The liwcalike() function takes a corpus or character vector and carries out an analysis–based on a provide dictionary–that mimics the software LIWC (Linguistic Inquiry and Word Count). The LIWC software calculates the percentage of the document that reflects a host of different characteristics.

Code
# use liwcalike() to estimate sentiment using NRC dictionary
reviewSentiment_nrc = liwcalike(arlnow_covid_corpus, data_dictionary_NRC)

names(reviewSentiment_nrc)
 [1] "docname"      "Segment"      "WPS"          "WC"           "Sixltr"      
 [6] "Dic"          "anger"        "anticipation" "disgust"      "fear"        
[11] "joy"          "negative"     "positive"     "sadness"      "surprise"    
[16] "trust"        "AllPunc"      "Period"       "Comma"        "Colon"       
[21] "SemiC"        "QMark"        "Exclam"       "Dash"         "Quote"       
[26] "Apostro"      "Parenth"      "OtherP"      

I think this could be interesting but I don’t know how helpful it is for my own terms because I am not sure how well they apply to the ideas.

Looking at the most positive.

Code
ggplot(reviewSentiment_nrc) +
  geom_histogram(aes(x = positive)) +
  theme_bw()

Code
#Based on that, let's look at those that are out in the right tail (i.e., which are greater than 8, most positive)
arlnow_covid_corpus[which(reviewSentiment_nrc$positive > 8)]
Corpus consisting of 23 documents and 3 docvars.
text59 :
", Elementary-school-aged children will soon be able to get t..."

text65 :
", This sponsored column is by James Montana, Esq., Doran She..."

text89 :
"Members of Grace Community Church in Arlington honored thous..."

text123 :
", This column is written and sponsored by Arlington Arts/Arl..."

text135 :
", Peter’s Take is a weekly opinion column. The views and opi..."

text140 :
", Progressive Voice is a bi-weekly opinion column. The views..."

[ reached max_ndoc ... 17 more documents ]

Most terms seem to be around 3-7, and the highest is above 10.

Looking at the most negative.

Code
ggplot(reviewSentiment_nrc) +
  geom_histogram(aes(x = negative)) +
  theme_bw()

Code
arlnow_covid_corpus[which(reviewSentiment_nrc$negative > 4)]
Corpus consisting of 25 documents and 3 docvars.
text7 :
"Don’t look now but Covid cases are declining in Arlington., ..."

text53 :
"A post-Thanksgiving rise in Covid cases in Arlington appears..."

text124 :
", Progressive Voice is a bi-weekly opinion column. The views..."

text125 :
"Health Matters is a biweekly opinion column. The views expre..."

text127 :
", (Updated at 8:20 p.m.) The chairman of the Arlington GOP h..."

text141 :
", This is a sponsored column by attorneys John Berry and Kim..."

[ reached max_ndoc ... 19 more documents ]

Most terms seem to be around 1-3, and the highest is above 6, and the lowest is 0.

These alone may not be the best indicators though, a combined measure may be better.

Code
reviewSentiment_nrc$polarity = reviewSentiment_nrc$positive - reviewSentiment_nrc$negative

ggplot(reviewSentiment_nrc) +
  geom_histogram(aes(polarity)) +
  theme_bw()

Code
arlnow_covid_corpus[which(reviewSentiment_nrc$polarity < 0)]
Corpus consisting of 29 documents and 3 docvars.
text2 :
"(Updated at 9:50 a.m.) Covid cases have held relatively stea..."

text7 :
"Don’t look now but Covid cases are declining in Arlington., ..."

text12 :
"The stock market drop aside, some other falling figures in A..."

text16 :
"Arlington County Board member Libby Garvey is quarantining i..."

text25 :
"For the last two months, Arlington County has been getting y..."

text37 :
"The average rate of new Covid cases in Arlington has fallen ..."

[ reached max_ndoc ... 23 more documents ]

Most terms seem to be around 0-4, and the lowest close to -4 and the highest is almost 12.

Using Dictionaries with DFMs

For the dfm I am including most of the same preprocessing I did in the last post.

Code
# create a full dfm for comparison
arlnow_covid_dfm = tokens(arlnow_covid_corpus,
                                    remove_punct = TRUE,
                                    remove_symbols = TRUE,
                                    remove_numbers = TRUE) %>%
                           dfm(tolower = TRUE) %>%
                           dfm_remove(stopwords('english')) %>%
                           dfm_remove(c("arlington", "county", "virginia", "$"))

head(arlnow_covid_dfm, 10)
Document-feature matrix of: 10 documents, 13,682 features (98.79% sparse) and 3 docvars.
       features
docs    hundred parents say public schools prioritize recreating pre-covid
  text1       2       5   2      2       5          1          1         2
  text2       0       0   0      1       0          0          0         0
  text3       0       0   0      4       1          0          0         0
  text4       0       0   0      1       0          0          0         0
  text5       0       3   0      1       3          0          0         0
  text6       0       0   0      3       0          0          0         0
       features
docs    normalcy classroom
  text1        2         1
  text2        0         0
  text3        0         0
  text4        0         0
  text5        0         0
  text6        0         0
[ reached max_ndoc ... 4 more documents, reached max_nfeat ... 13,672 more features ]
Code
dim(arlnow_covid_dfm)
[1]   457 13682
Code
# convert corpus to dfm using the dictionary NRC
arlnow_covid_dfm_nrc = arlnow_covid_dfm %>%
                          dfm_lookup(data_dictionary_NRC)

dim(arlnow_covid_dfm_nrc)
[1] 457  10
Code
head(arlnow_covid_dfm_nrc, 10)
Document-feature matrix of: 10 documents, 10 features (9.00% sparse) and 3 docvars.
       features
docs    anger anticipation disgust fear joy negative positive sadness surprise
  text1     3           12       1    9   6        8       25       6        2
  text2     3            6       2    9   2       19       10      13        4
  text3     1            9       2    5   1        7       37       7        1
  text4     0            4       1    9   1       11       19      11        6
  text5     4           20       3   12   4       14       28       8        3
  text6     2           11       1   17   2       22       34      14        7
       features
docs    trust
  text1    26
  text2     6
  text3    10
  text4    15
  text5    26
  text6    10
[ reached max_ndoc ... 4 more documents ]
Code
class(arlnow_covid_dfm_nrc)
[1] "dfm"
attr(,"package")
[1] "quanteda"

I think this is getting a little closer to what I can look at for my analysis. Looking at the emotions is interesting because things like anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, trust, etc. Looking at the emotions of the text and seeing what most texts have over time can be really interesting and what I want to do for my next blog post.

Next I’ll convert that to a data frame for further analysis, then create a polarity measure using the positive and negative measures.

Code
df_nrc = convert(arlnow_covid_dfm_nrc, to = "data.frame")
names(df_nrc)
 [1] "doc_id"       "anger"        "anticipation" "disgust"      "fear"        
 [6] "joy"          "negative"     "positive"     "sadness"      "surprise"    
[11] "trust"       
Code
df_nrc$polarity = (df_nrc$positive - df_nrc$negative)/(df_nrc$positive + df_nrc$negative)

df_nrc$polarity[(df_nrc$positive + df_nrc$negative) == 0] = 0

ggplot(df_nrc) +
  geom_histogram(aes(x = polarity)) +
  theme_bw()

A lot of the polarity is centered around 0.5, although some reach 1 and a few go to 0 and below.

Dictionary Comparison

There are multiple dictionaries that can be used, so it may be helpful to compare them and how they work on this data.

Code
# convert corpus to DFM using the General Inquirer dictionary
arlnow_covid_dfm_geninq = arlnow_covid_dfm %>%
  dfm_lookup(data_dictionary_geninqposneg)

head(arlnow_covid_dfm_geninq, 6)
Document-feature matrix of: 6 documents, 2 features (0.00% sparse) and 3 docvars.
       features
docs    positive negative
  text1       23        9
  text2       17       14
  text3       19        8
  text4       12        5
  text5       36       17
  text6       20       18

I know that the dictionary used dependss on the analysis so I want to see this one too.

Code
# create polarity measure for geninq
df_geninq = convert(arlnow_covid_dfm_geninq, to = "data.frame")
df_geninq$polarity = (df_geninq$positive - df_geninq$negative)/(df_geninq$positive + df_geninq$negative)
df_geninq$polarity[which((df_geninq$positive + df_geninq$negative) == 0)] = 0

# look at first few rows
head(df_geninq)
  doc_id positive negative   polarity
1  text1       23        9 0.43750000
2  text2       17       14 0.09677419
3  text3       19        8 0.40740741
4  text4       12        5 0.41176471
5  text5       36       17 0.35849057
6  text6       20       18 0.05263158

Combine all of these into a single dataframe to see how well they match up.

Code
# create unique names for each data frame
colnames(df_nrc) = paste("nrc", colnames(df_nrc), sep = "_")
colnames(df_geninq) = paste("geninq", colnames(df_geninq), sep = "_")

# now let's compare our estimates
sent_df = merge(df_nrc, df_geninq, by.x = "nrc_doc_id", by.y = "geninq_doc_id")

head(sent_df)
  nrc_doc_id nrc_anger nrc_anticipation nrc_disgust nrc_fear nrc_joy
1      text1         3               12           1        9       6
2     text10         0                4           0        1       4
3    text100         0                7           0        3       3
4    text101         9               22           4        9      15
5    text102         0                5           1        2       3
6    text103         1                5           2        5       3
  nrc_negative nrc_positive nrc_sadness nrc_surprise nrc_trust nrc_polarity
1            8           25           6            2        26    0.5151515
2            3            8           2            0         6    0.4545455
3            5           19           1            1        14    0.5833333
4           20           60          11            6        41    0.5000000
5            4            9           2            0         2    0.3846154
6            7            9           4            0         3    0.1250000
  geninq_positive geninq_negative geninq_polarity
1              23               9       0.4375000
2               7               3       0.4000000
3              21              11       0.3125000
4              58              32       0.2888889
5              11               3       0.5714286
6              11               6       0.2941176

I think there are some differences between the measures of polarity for the nrc vs geninq based on the measure.

How well different measures of polarity agree across the different approaches.

Code
cor(sent_df$nrc_polarity, sent_df$geninq_polarity)
[1] 0.4553892
Code
ggplot(sent_df, mapping = aes(x = nrc_polarity,
                              y = geninq_polarity)) +
  geom_point(alpha = 0.1) +
  geom_smooth() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_bw()

The correlation of 0.45 is moderate, which is ok but not the best.

Apply Dictionary within Contexts

How is “vaccine” treated across the corpus of articles? Limit the corpus to just vaccine related tokens (vax_words) and the window they appear within.

Code
# tokenize corpus
tokens_LMRD = tokens(arlnow_covid_corpus, remove_punct = TRUE)

# what are the context (target) words or phrases
vax_words = c("vaccine", "vaccinate", "vaccinated", "vax", "shot", "dose", "booster")

# retain only our tokens and their context
tokens_vax = tokens_keep(tokens_LMRD, pattern = phrase(vax_words), window = 40)

Pull out the positive and negative dictionaries and look for those within our token sets.

Code
data_dictionary_LSD2015_pos_neg = data_dictionary_LSD2015[1:2]

tokens_vax_lsd = tokens_lookup(tokens_vax,
                               dictionary = data_dictionary_LSD2015_pos_neg)

Convert this to a DFM.

Code
dfm_vax = dfm(tokens_vax_lsd)
head(dfm_vax, 10)
Document-feature matrix of: 10 documents, 2 features (65.00% sparse) and 3 docvars.
       features
docs    negative positive
  text1        0        0
  text2        2        5
  text3        6       10
  text4        0        0
  text5        0        2
  text6        6        5
[ reached max_ndoc ... 4 more documents ]

Drop articles that did not feature any emotionally valence words from the analysis, then take a look at the distribution.

Code
# convert to data frame
mat_vax = convert(dfm_vax, to = "data.frame")

# drop if both features are 0
mat_vax = mat_vax[-which((mat_vax$negative + mat_vax$positive)==0), ]

# print a little summary info
paste("We have ", nrow(mat_vax), " reviews that mention positive or negative words in the context of vaccine terms.", sep = "")
[1] "We have 97 reviews that mention positive or negative words in the context of vaccine terms."
Code
# create polarity scores
mat_vax$polarity = (mat_vax$positive - mat_vax$negative)/(mat_vax$positive + mat_vax$negative)

# summary
summary(mat_vax$polarity)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.0000 -0.1667  0.1667  0.1503  0.4667  1.0000 
Code
# plot
ggplot(mat_vax) + 
  geom_histogram(aes(x = polarity)) + 
  theme_bw()

I kept this very general because I wasn’t sure how/if I should make my own dictionary (scheduled office hours to discuss and will update). I think I still want to keep working on some past suggestions like including another news source and looking at how specific words are connected and used (I did focus on vaccine and similar words here), but I haven’t included them yet as I don’t know exactly what I want to do with it.