::opts_chunk$set(echo = TRUE, warning = FALSE) knitr
Rhowena Vespa
October 16, 2022
This 3rd Blog uses text representation concepts from week 6. Using tweet replies as corpus, I will apply document-feature matrix and term-co-occurence matrix to better undersatnd the relationships of the texts. I will also apply TF-IDF: Term Frequency-Inverse Document Frequency to rank the words by frequency. Data visualization of the semantic network to understand what words co-occur with one another.
#tokenize and stemming
I played around with this and compared results from keywords :“inflation”, “pharma”, and “price”.
Keyword-in-context with 5 matches.
[text2, 10] see those cheap | price | on my prescript
[text22, 16] the US allow | price | goug. That
[text49, 6] that these lower | price | aren't lower price
[text49, 9] price aren't lower | price | ?? You'r
[text117, 18] HIGH. Gas | price | in CA are
#TF-IDF rank words by term frequency
big inflation pharma n people biden lost american
1307 1225 1174 873 822 485 467 456
act will
438 433
#rank words by TF-IDF
inflation big pharma n people biden lost american
1199.1418 1193.4589 1108.9193 1058.6552 901.0779 648.4959 618.0324 611.9988
will act
609.2498 599.6485
#Keyness analysis as an alternative to tf-idf I grouped the responses as based on possible sensitivity of the tweet replies. The possibly sensitive =TRUE replies “keywords” compared to the associations and a reference group of possible_sensitive= FALSE group replies,
IRA_dfm <- tokens(IRA_corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_url=TRUE) %>%
dfm() %>%
dfm_group(groups = possibly_sensitive)
IRAkey <- textstat_keyness(IRA_dfm, target = "TRUE")
show_reference = TRUE,
show_legend = TRUE,
n = 20L,
min_count = 2L,
margin = 0.05,
color = c("darkblue", "red"),
labelcolor = "gray30",
labelsize = 4,
font = NULL
[1] "the" "you" "to" "and" "a" "is"
#Feature-occurrence matrix of hashtags
Feature co-occurrence matrix of: 6 by 11,231 features.
features so when do i get to see those cheap prices
so 119805 172481 222460 433651 143080 1315660 87710 55862 4900 133770
when 0 61776 159808 311521 102784 945130 63008 40130 3520 96096
do 0 0 102831 401790 132568 1218990 81266 51756 4540 123942
i 0 0 0 391170 258420 2376235 158415 100892 8850 241605
get 0 0 0 0 42486 784020 52268 33288 2920 79716
to 0 0 0 0 0 3603315 480615 306110 26850 733005
[ reached max_nfeat ... 11,221 more features ]
#Visualization of semantic network based on hashtag co-occurrence
IRAtopgat_fcm <- fcm_select(IRAtag_fcm, pattern = IRAtoptag)
textplot_network(IRAtopgat_fcm, min_freq = 0.5,
omit_isolated = TRUE,
edge_color = "#1F78B4",
edge_alpha = 0.5,
edge_size = 2,
vertex_color = "#4D4D4D",
vertex_size = 2,
vertex_labelcolor = NULL,
vertex_labelfont = NULL,
vertex_labelsize = 5,
offset = NULL)
