Blog Post 6: Structural Topic Modeling

BlogPost6
Andrea Mah
Author

Andrea Mah

Published

December 2, 2022

Based on suggestions from Prof Song,I used strucutural topic modelling on my data. Structural topic models should allow me to test the association between topics and document metadata. In my case, I have three metadata variables to look at: year of speech, Climate Risk Index (CRI) and income level of a given country. To do this, I planned to use the stm package, using the Roberts et al. paper as a guide.

Code
#strucutral topic modeling
#loading in nececssary libraries
library(quanteda)
Package version: 3.2.3
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 8 of 8 threads used.
See https://quanteda.io for tutorials and examples.
Code
library(tidyr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.2.2
Code
library(text2vec)
Warning: package 'text2vec' was built under R version 4.2.2
Code
library(stopwords)
library(RCurl)

Attaching package: 'RCurl'
The following object is masked from 'package:tidyr':

    complete
Code
#install.packages('stm')
library(stm)
Warning: package 'stm' was built under R version 4.2.2
stm v1.3.6 successfully loaded. See ?stm for help. 
 Papers, resources, and other materials at structuraltopicmodel.com
Code
#load in data/ 
speech.meta <- read.csv('FINAL_combined_meta-text_dataset.csv')
Warning in file(file, "rt"): cannot open file 'FINAL_combined_meta-
text_dataset.csv': No such file or directory
Error in file(file, "rt"): cannot open the connection
Code
speech.meta$income <- as.factor(speech.meta$income)
Error in is.factor(x): object 'speech.meta' not found
Code
levels(speech.meta$income)
Error in levels(speech.meta$income): object 'speech.meta' not found
Code
speech.meta$incNum <- recode(speech.meta$income, "High income" = 4, "Upper middle income" = 3, "Lower middle income" = 2, "Low income" = 1)
Error in recode(speech.meta$income, `High income` = 4, `Upper middle income` = 3, : object 'speech.meta' not found
Code
speech.meta$incNum
Error in eval(expr, envir, enclos): object 'speech.meta' not found

After loading in the data, and creating a numeric income metadata variable (although it isn’t a true continuous scale, it helps with interpretation to use this kind of ordinal recoding to understand, roughly, the relationships between income levevl and topics…) I started processing the data for use with stm package.

Code
#process the data for use with stm package
processed <- textProcessor(speech.meta$text_field, metadata = speech.meta)
Error in textProcessor(speech.meta$text_field, metadata = speech.meta): object 'speech.meta' not found
Code
out <- prepDocuments(processed$documents, processed$vocab, processed$meta)
Error in prepDocuments(processed$documents, processed$vocab, processed$meta): object 'processed' not found
Code
docs <- out$documents
Error in eval(expr, envir, enclos): object 'out' not found
Code
vocab <- out$vocab
Error in eval(expr, envir, enclos): object 'out' not found
Code
meta <- out$meta
Error in eval(expr, envir, enclos): object 'out' not found

After doing this initial processing, I realized I needed to limit my corpus to documents which had my covariates… thus:

Code
#process the data for use with stm package, limiting to only speeches
#from countries with an associated CRI
speech.meta <- subset(speech.meta, !(is.na(CRI)))
Error in subset(speech.meta, !(is.na(CRI))): object 'speech.meta' not found
Code
speech.meta.cri <- subset(speech.meta, !(is.na(incNum)))
Error in subset(speech.meta, !(is.na(incNum))): object 'speech.meta' not found
Code
tail(speech.meta.cri)
Error in tail(speech.meta.cri): object 'speech.meta.cri' not found
Code
processed <- textProcessor(speech.meta.cri$text_field, metadata = speech.meta.cri)
Error in textProcessor(speech.meta.cri$text_field, metadata = speech.meta.cri): object 'speech.meta.cri' not found
Code
out <- prepDocuments(processed$documents, processed$vocab, processed$meta)
Error in prepDocuments(processed$documents, processed$vocab, processed$meta): object 'processed' not found
Code
docs <- out$documents
Error in eval(expr, envir, enclos): object 'out' not found
Code
vocab <- out$vocab
Error in eval(expr, envir, enclos): object 'out' not found
Code
meta <- out$meta
Error in eval(expr, envir, enclos): object 'out' not found

Next, I tried running strucutral topic models using a variety of the methods presented in the Roberts paper:

Code
stm.cri <- stm(documents = out$documents, vocab = out$vocab,
               K = 20, prevalence = ~ CRI +income + year, data = out$meta,
               init.type = "Spectral")
Error in asSTMCorpus(documents, vocab, data): object 'out' not found
Code
stm.cri.Select <- selectModel(documents = out$documents, vocab = out$vocab,
               K = 20, prevalence = ~ CRI + year + income, data = out$meta,
               runs = 20, seed = 11112)
Casting net 
1 models in net 
Error in asSTMCorpus(documents, vocab, data): object 'out' not found
Code
#plot the models
plotModels(stm.cri.Select, pch = c(1, 2, 3, 4, 5, 6, 7, 8, 9))
Error in plotModels(stm.cri.Select, pch = c(1, 2, 3, 4, 5, 6, 7, 8, 9)): object 'stm.cri.Select' not found
Code
selected.cri <- stm.cri.Select$runout[[1]]
Error in eval(expr, envir, enclos): object 'stm.cri.Select' not found
Code
summary(selected.cri)
Error in summary(selected.cri): object 'selected.cri' not found

I used the searchK function to compare models with different numbers of topics. Based on my previous exploration and use of LDA, I thought that k between 3 and 10 would likely be sufficient…

#SearchK function to determine  potential number of topics
#I wanted to test between 3 and 10 topics. I think 20 (which I looked at
#in the select function) was too many topics.
storage <- searchK(out$documents, out$vocab, K = c(3:10), 
                   prevalence =~ CRI + year + income, data = meta)

#now I want to examine the results of the search K... 
plot.searchK(storage)

results <- as.data.frame(storage$results)

View(results)

After running this, I was= looking for the following criteria: #higher semantic coherene #high held-out likelihood #low residual #low lower bound

iven the plots, I think that maybe a model with three, four or five topics could be best? ok, so given that I am thinking that three or four or five topics seems to fit criteria

I then ran models with each number of topics.

Code
stm.cri3 <- stm(documents = out$documents, vocab = out$vocab,
                K = 3, prevalence =~ CRI + year + income, data = out$meta,
                init.type = "Spectral")
Error in asSTMCorpus(documents, vocab, data): object 'out' not found
Code
stm.cri4 <- stm(documents = out$documents, vocab = out$vocab,
                K = 4, prevalence = ~ CRI + year + incNum, data = out$meta,
                init.type = "Spectral")
Error in asSTMCorpus(documents, vocab, data): object 'out' not found
Code
stm.cri5 <- stm(documents = out$documents, vocab = out$vocab,
                K = 5, prevalence = ~ CRI + year + income, data = out$meta,
                init.type = "Spectral")
Error in asSTMCorpus(documents, vocab, data): object 'out' not found

Then, I wanted to think about labeling these… After looking at three, four, and five-topic models, I decided to proceed with the four-topic model.

Labels based on FREX were a bit strange…Mostly country names. #1: nepal_namibia_malawi_zambia_vietnam #2: bonn_group_complianc_bueno_china #3: tuvalu_copenhagen_island_australia_barbado #4: comment_geneva_consumpt_albania_effici

perhaps labels with probability are better: #1: climat, chang, develop, countri, adapt #2: parti, convent, develop, countri, protocol #3: climate, will, chang, countri, must, develop #4: energi, develop, countri, emiss, climat

Code
labelTopics(stm.cri3)
Error in labelTopics(stm.cri3): object 'stm.cri3' not found
Code
sageLabels(stm.cri3)
Error in sageLabels(stm.cri3): object 'stm.cri3' not found
Code
labelTopics(stm.cri4)
Error in labelTopics(stm.cri4): object 'stm.cri4' not found
Code
labelTopics(stm.cri5)
Error in labelTopics(stm.cri5): object 'stm.cri5' not found
Code
#Topic 1 Top Words:
 # Highest Prob: climat, chang, develop, countri, presid, adapt, nation 
#FREX: napa, namibia, malawi, nepal, zambia, vietnam, african 
#Topic 2 Top Words:
 # Highest Prob: parti, develop, convent, protocol, countri, kyoto, presid 
#FREX: bonn, group, bueno, china, complianc, protocol, kyoto 
#Topic 3 Top Words:
 # Highest Prob: develop, climat, chang, countri, technolog, will, emiss 
#FREX: vision, poznan, market, australia, partnership, key, invest 
#Topic 4 Top Words:
 # Highest Prob: countri, emiss, energi, develop, climat, convent, chang 
#FREX: latvia, turkey, geneva, comment, agbm, lithuania, berlin 
#Topic 5 Top Words:
#  Highest Prob: will, climat, world, countri, chang, must, island 
#FREX: tuvalu, barbado, solomon, island, fiji, reef, children 

sageLabels(stm.cri4)
Error in sageLabels(stm.cri4): object 'stm.cri4' not found

Next I plotted more graphs looking at topic proportion:

Code
#getting the graphical display of estimated topic proportions
plot(stm.cri4, type = "summary")
Error in plot(stm.cri4, type = "summary"): object 'stm.cri4' not found

Proceeding with 4 topics, I wanted to look at my covariates in relation to topics. As a reminder, these covariates were income classification of the country, climate risk index, and year of speech.

Code
#Checking
prep <- estimateEffect(1:4 ~ CRI + year + incNum, stm.cri4,
                       meta = out$meta, uncertainty = "Global")
Error in estimateEffect(1:4 ~ CRI + year + incNum, stm.cri4, meta = out$meta, : object 'out' not found
Code
#looking at the relationship between covariates and these four topics;
summary(prep, topics = c(1,2,3,4))
Error in summary(prep, topics = c(1, 2, 3, 4)): object 'prep' not found

Interestingly, CRI did not seem to be a predictor of any topic. year and income were. I looked at a plot of income plotted against topic prevalence: #1 = low income, 2 = lower middle income, 3 = upper middle income, 4 = high income

Code
plot(prep, covariate = "incNum", topics = c(1:4), model = stm.cri4)
Error in plot(prep, covariate = "incNum", topics = c(1:4), model = stm.cri4): object 'prep' not found

Not completely sure I set this up right, but here I looked at expected topic proportion over time (by year) Strangely, topic 2 decreased over time, while 4 increased. Topic 1 remained relatively stable, Topic 2 slightly increased

Code
plot(prep, "year", method = "continuous", topics = c(1:4))
Error in plot(prep, "year", method = "continuous", topics = c(1:4)): object 'prep' not found

To remind myself of topic 2, I looked at the top words…

Code
#Topic 2:
labelTopics(stm.cri4)
Error in labelTopics(stm.cri4): object 'stm.cri4' not found
Code
#Topic 2 Top Words:
#Highest Prob: countri, develop, emiss, convent, climat, chang, energi 
#FREX: kazakhstan, romania, croatia, latvia, berlin, japan, geneva 
#Lift: absorpt, achi, acti, additon, adriat, aerosol, afterward 
#Score: romania, berlin, latvia, croatia, energi, implement, aij 

To compare this topic to others, I used topical contrasts between topic 2 and other topics

Code
plot(stm.cri4, type = "perspectives", topics = c(1,2), xlim = c(0.1, 0.9))
Error in plot(stm.cri4, type = "perspectives", topics = c(1, 2), xlim = c(0.1, : object 'stm.cri4' not found
Code
plot(stm.cri4, type = "perspectives", topics = c(2,3), xlim = c(0.1, 0.9))
Error in plot(stm.cri4, type = "perspectives", topics = c(2, 3), xlim = c(0.1, : object 'stm.cri4' not found
Code
plot(stm.cri4, type = "perspectives", topics = c(2,4), xlim = c(0.1, 0.9))
Error in plot(stm.cri4, type = "perspectives", topics = c(2, 4), xlim = c(0.1, : object 'stm.cri4' not found

Finally, I looked at word clouds for each topic. I was hoping I could use the package grid or gridExtra to combine these four into one graphic, but I guess they were not the right kind of object.

Code
top1 <- cloud(stm.cri4, topic = 1, scale = c(3, .3))
Error in cloud(stm.cri4, topic = 1, scale = c(3, 0.3)): object 'stm.cri4' not found
Code
top2 <- cloud(stm.cri4, topic = 2, scale = c(3, .3))
Error in cloud(stm.cri4, topic = 2, scale = c(3, 0.3)): object 'stm.cri4' not found
Code
top3 <- cloud(stm.cri4, topic = 3, scale = c(3, .3))
Error in cloud(stm.cri4, topic = 3, scale = c(3, 0.3)): object 'stm.cri4' not found
Code
top4 <- cloud(stm.cri4, topic = 4, scale = c(3, .3))
Error in cloud(stm.cri4, topic = 4, scale = c(3, 0.3)): object 'stm.cri4' not found

I think this was a very interesting exercise, but I am definitely lacking confidence in interpretation of various outputs from structural topic modeling. I want to spend more time working with the package in the future. From these explorations, I don’t know that I have a good understanding of waht characterizes each of the four extracted topics. I think I would need to spend more time closely examining each one to identify what the latent ideas are for each topic.