Blog Post 6: Structural Topic Modeling

BlogPost6

Andrea Mah

Author

Andrea Mah

Published

December 2, 2022

Based on suggestions from Prof Song,I used strucutural topic modelling on my data. Structural topic models should allow me to test the association between topics and document metadata. In my case, I have three metadata variables to look at: year of speech, Climate Risk Index (CRI) and income level of a given country. To do this, I planned to use the stm package, using the Roberts et al. paper as a guide.

Code

#strucutral topic modeling
#loading in nececssary libraries
library(quanteda)

Package version: 3.2.3
Unicode version: 13.0
ICU version: 69.1

Parallel computing: 8 of 8 threads used.

See https://quanteda.io for tutorials and examples.

Code

library(tidyr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.2.2

Code

library(text2vec)

Warning: package 'text2vec' was built under R version 4.2.2

Code

library(stopwords)
library(RCurl)


Attaching package: 'RCurl'

The following object is masked from 'package:tidyr':

    complete

Code

#install.packages('stm')
library(stm)

Warning: package 'stm' was built under R version 4.2.2

stm v1.3.6 successfully loaded. See ?stm for help. 
 Papers, resources, and other materials at structuraltopicmodel.com

Code

#load in data/ 
speech.meta <- read.csv('FINAL_combined_meta-text_dataset.csv')

Warning in file(file, "rt"): cannot open file 'FINAL_combined_meta-
text_dataset.csv': No such file or directory

Error in file(file, "rt"): cannot open the connection

Code

speech.meta$income <- as.factor(speech.meta$income)

Error in is.factor(x): object 'speech.meta' not found

Code

levels(speech.meta$income)

Error in levels(speech.meta$income): object 'speech.meta' not found

Code

speech.meta$incNum <- recode(speech.meta$income, "High income" = 4, "Upper middle income" = 3, "Lower middle income" = 2, "Low income" = 1)

Error in recode(speech.meta$income, `High income` = 4, `Upper middle income` = 3, : object 'speech.meta' not found

Code

speech.meta$incNum

Error in eval(expr, envir, enclos): object 'speech.meta' not found

After loading in the data, and creating a numeric income metadata variable (although it isn’t a true continuous scale, it helps with interpretation to use this kind of ordinal recoding to understand, roughly, the relationships between income levevl and topics…) I started processing the data for use with stm package.

Code

#process the data for use with stm package
processed <- textProcessor(speech.meta$text_field, metadata = speech.meta)

Error in textProcessor(speech.meta$text_field, metadata = speech.meta): object 'speech.meta' not found

Code

out <- prepDocuments(processed$documents, processed$vocab, processed$meta)

Error in prepDocuments(processed$documents, processed$vocab, processed$meta): object 'processed' not found

Code

docs <- out$documents

Error in eval(expr, envir, enclos): object 'out' not found

Code

vocab <- out$vocab

Error in eval(expr, envir, enclos): object 'out' not found

Code

meta <- out$meta

Error in eval(expr, envir, enclos): object 'out' not found

After doing this initial processing, I realized I needed to limit my corpus to documents which had my covariates… thus:

Code

#process the data for use with stm package, limiting to only speeches
#from countries with an associated CRI
speech.meta <- subset(speech.meta, !(is.na(CRI)))

Error in subset(speech.meta, !(is.na(CRI))): object 'speech.meta' not found

Code

speech.meta.cri <- subset(speech.meta, !(is.na(incNum)))

Error in subset(speech.meta, !(is.na(incNum))): object 'speech.meta' not found

Code

tail(speech.meta.cri)

Error in tail(speech.meta.cri): object 'speech.meta.cri' not found

Code

processed <- textProcessor(speech.meta.cri$text_field, metadata = speech.meta.cri)

Error in textProcessor(speech.meta.cri$text_field, metadata = speech.meta.cri): object 'speech.meta.cri' not found

Code

out <- prepDocuments(processed$documents, processed$vocab, processed$meta)

Error in prepDocuments(processed$documents, processed$vocab, processed$meta): object 'processed' not found

Code

docs <- out$documents

Error in eval(expr, envir, enclos): object 'out' not found

Code

vocab <- out$vocab

Error in eval(expr, envir, enclos): object 'out' not found

Code

meta <- out$meta

Error in eval(expr, envir, enclos): object 'out' not found

Next, I tried running strucutral topic models using a variety of the methods presented in the Roberts paper:

Code

stm.cri <- stm(documents = out$documents, vocab = out$vocab,
               K = 20, prevalence = ~ CRI +income + year, data = out$meta,
               init.type = "Spectral")

Error in asSTMCorpus(documents, vocab, data): object 'out' not found

Code

stm.cri.Select <- selectModel(documents = out$documents, vocab = out$vocab,
               K = 20, prevalence = ~ CRI + year + income, data = out$meta,
               runs = 20, seed = 11112)

Casting net 
1 models in net

Error in asSTMCorpus(documents, vocab, data): object 'out' not found

Code

#plot the models
plotModels(stm.cri.Select, pch = c(1, 2, 3, 4, 5, 6, 7, 8, 9))

Error in plotModels(stm.cri.Select, pch = c(1, 2, 3, 4, 5, 6, 7, 8, 9)): object 'stm.cri.Select' not found

Code

selected.cri <- stm.cri.Select$runout[[1]]

Error in eval(expr, envir, enclos): object 'stm.cri.Select' not found

Code

summary(selected.cri)

Error in summary(selected.cri): object 'selected.cri' not found

I used the searchK function to compare models with different numbers of topics. Based on my previous exploration and use of LDA, I thought that k between 3 and 10 would likely be sufficient…

#SearchK function to determine  potential number of topics
#I wanted to test between 3 and 10 topics. I think 20 (which I looked at
#in the select function) was too many topics.
storage <- searchK(out$documents, out$vocab, K = c(3:10), 
                   prevalence =~ CRI + year + income, data = meta)

#now I want to examine the results of the search K... 
plot.searchK(storage)

results <- as.data.frame(storage$results)

View(results)

After running this, I was= looking for the following criteria: #higher semantic coherene #high held-out likelihood #low residual #low lower bound

iven the plots, I think that maybe a model with three, four or five topics could be best? ok, so given that I am thinking that three or four or five topics seems to fit criteria

I then ran models with each number of topics.

Code

stm.cri3 <- stm(documents = out$documents, vocab = out$vocab,
                K = 3, prevalence =~ CRI + year + income, data = out$meta,
                init.type = "Spectral")

Error in asSTMCorpus(documents, vocab, data): object 'out' not found

Code

stm.cri4 <- stm(documents = out$documents, vocab = out$vocab,
                K = 4, prevalence = ~ CRI + year + incNum, data = out$meta,
                init.type = "Spectral")

Error in asSTMCorpus(documents, vocab, data): object 'out' not found

Code

stm.cri5 <- stm(documents = out$documents, vocab = out$vocab,
                K = 5, prevalence = ~ CRI + year + income, data = out$meta,
                init.type = "Spectral")

Error in asSTMCorpus(documents, vocab, data): object 'out' not found

Then, I wanted to think about labeling these… After looking at three, four, and five-topic models, I decided to proceed with the four-topic model.

Labels based on FREX were a bit strange…Mostly country names. #1: nepal_namibia_malawi_zambia_vietnam #2: bonn_group_complianc_bueno_china #3: tuvalu_copenhagen_island_australia_barbado #4: comment_geneva_consumpt_albania_effici

perhaps labels with probability are better: #1: climat, chang, develop, countri, adapt #2: parti, convent, develop, countri, protocol #3: climate, will, chang, countri, must, develop #4: energi, develop, countri, emiss, climat

Code

labelTopics(stm.cri3)

Error in labelTopics(stm.cri3): object 'stm.cri3' not found

Code

sageLabels(stm.cri3)

Error in sageLabels(stm.cri3): object 'stm.cri3' not found

Code

labelTopics(stm.cri4)

Error in labelTopics(stm.cri4): object 'stm.cri4' not found

Code

labelTopics(stm.cri5)

Error in labelTopics(stm.cri5): object 'stm.cri5' not found

Code

#Topic 1 Top Words:
 # Highest Prob: climat, chang, develop, countri, presid, adapt, nation 
#FREX: napa, namibia, malawi, nepal, zambia, vietnam, african 
#Topic 2 Top Words:
 # Highest Prob: parti, develop, convent, protocol, countri, kyoto, presid 
#FREX: bonn, group, bueno, china, complianc, protocol, kyoto 
#Topic 3 Top Words:
 # Highest Prob: develop, climat, chang, countri, technolog, will, emiss 
#FREX: vision, poznan, market, australia, partnership, key, invest 
#Topic 4 Top Words:
 # Highest Prob: countri, emiss, energi, develop, climat, convent, chang 
#FREX: latvia, turkey, geneva, comment, agbm, lithuania, berlin 
#Topic 5 Top Words:
#  Highest Prob: will, climat, world, countri, chang, must, island 
#FREX: tuvalu, barbado, solomon, island, fiji, reef, children 

sageLabels(stm.cri4)

Error in sageLabels(stm.cri4): object 'stm.cri4' not found

Next I plotted more graphs looking at topic proportion:

Code

#getting the graphical display of estimated topic proportions
plot(stm.cri4, type = "summary")

Error in plot(stm.cri4, type = "summary"): object 'stm.cri4' not found

Proceeding with 4 topics, I wanted to look at my covariates in relation to topics. As a reminder, these covariates were income classification of the country, climate risk index, and year of speech.

Code

#Checking
prep <- estimateEffect(1:4 ~ CRI + year + incNum, stm.cri4,
                       meta = out$meta, uncertainty = "Global")

Error in estimateEffect(1:4 ~ CRI + year + incNum, stm.cri4, meta = out$meta, : object 'out' not found

Code

#looking at the relationship between covariates and these four topics;
summary(prep, topics = c(1,2,3,4))

Error in summary(prep, topics = c(1, 2, 3, 4)): object 'prep' not found

Interestingly, CRI did not seem to be a predictor of any topic. year and income were. I looked at a plot of income plotted against topic prevalence: #1 = low income, 2 = lower middle income, 3 = upper middle income, 4 = high income

Code

plot(prep, covariate = "incNum", topics = c(1:4), model = stm.cri4)

Error in plot(prep, covariate = "incNum", topics = c(1:4), model = stm.cri4): object 'prep' not found

Not completely sure I set this up right, but here I looked at expected topic proportion over time (by year) Strangely, topic 2 decreased over time, while 4 increased. Topic 1 remained relatively stable, Topic 2 slightly increased

Code

plot(prep, "year", method = "continuous", topics = c(1:4))

Error in plot(prep, "year", method = "continuous", topics = c(1:4)): object 'prep' not found

To remind myself of topic 2, I looked at the top words…

Code

#Topic 2:
labelTopics(stm.cri4)

Error in labelTopics(stm.cri4): object 'stm.cri4' not found

Code

#Topic 2 Top Words:
#Highest Prob: countri, develop, emiss, convent, climat, chang, energi 
#FREX: kazakhstan, romania, croatia, latvia, berlin, japan, geneva 
#Lift: absorpt, achi, acti, additon, adriat, aerosol, afterward 
#Score: romania, berlin, latvia, croatia, energi, implement, aij

To compare this topic to others, I used topical contrasts between topic 2 and other topics

Code

plot(stm.cri4, type = "perspectives", topics = c(1,2), xlim = c(0.1, 0.9))

Error in plot(stm.cri4, type = "perspectives", topics = c(1, 2), xlim = c(0.1, : object 'stm.cri4' not found

Code

plot(stm.cri4, type = "perspectives", topics = c(2,3), xlim = c(0.1, 0.9))

Error in plot(stm.cri4, type = "perspectives", topics = c(2, 3), xlim = c(0.1, : object 'stm.cri4' not found

Code

plot(stm.cri4, type = "perspectives", topics = c(2,4), xlim = c(0.1, 0.9))

Error in plot(stm.cri4, type = "perspectives", topics = c(2, 4), xlim = c(0.1, : object 'stm.cri4' not found

Finally, I looked at word clouds for each topic. I was hoping I could use the package grid or gridExtra to combine these four into one graphic, but I guess they were not the right kind of object.

Code

top1 <- cloud(stm.cri4, topic = 1, scale = c(3, .3))

Error in cloud(stm.cri4, topic = 1, scale = c(3, 0.3)): object 'stm.cri4' not found

Code

top2 <- cloud(stm.cri4, topic = 2, scale = c(3, .3))

Error in cloud(stm.cri4, topic = 2, scale = c(3, 0.3)): object 'stm.cri4' not found

Code

top3 <- cloud(stm.cri4, topic = 3, scale = c(3, .3))

Error in cloud(stm.cri4, topic = 3, scale = c(3, 0.3)): object 'stm.cri4' not found

Code

top4 <- cloud(stm.cri4, topic = 4, scale = c(3, .3))

Error in cloud(stm.cri4, topic = 4, scale = c(3, 0.3)): object 'stm.cri4' not found

I think this was a very interesting exercise, but I am definitely lacking confidence in interpretation of various outputs from structural topic modeling. I want to spend more time working with the package in the future. From these explorations, I don’t know that I have a good understanding of waht characterizes each of the four extracted topics. I think I would need to spend more time closely examining each one to identify what the latent ideas are for each topic.