Blog Post 5: Topic Modeling

BlogPost4

Andrea Mah

Author

Andrea Mah

Published

November 30, 2022

Code

#loading in nececssary libraries
library(quanteda)

Package version: 3.2.3
Unicode version: 13.0
ICU version: 69.1

Parallel computing: 8 of 8 threads used.

See https://quanteda.io for tutorials and examples.

Code

library(tidyr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.2.2

Code

library(text2vec)

Warning: package 'text2vec' was built under R version 4.2.2

Code

library(stopwords)
library(RCurl)


Attaching package: 'RCurl'

The following object is masked from 'package:tidyr':

    complete

For the next stage in my project, I wanted to use topic modeling. Although I’m not sure I will use it in my final project, since I don’t have clear hypotheses about topics and so whatever I find will be purely exploratory and/or descriptive. While I think it would be interesting to see the links between topics and my metadata (climate risk index, year of speech), I don’t have strong ideas about a) what the topics will be, given that in my reading of speeches they seem highly similar and b) what topics might relate to in terms of climate risk or year of speech.

Code

#load in and check the data
speech.meta <- read.csv(file = "FINAL_combined_meta-text_dataset.csv")

Warning in file(file, "rt"): cannot open file 'FINAL_combined_meta-
text_dataset.csv': No such file or directory

Error in file(file, "rt"): cannot open the connection

Code

tail(speech.meta)

Error in tail(speech.meta): object 'speech.meta' not found

Code

names(speech.meta)

Error in eval(expr, envir, enclos): object 'speech.meta' not found

Code

#limit the dataset to just text and filenum
speech.meta <- speech.meta[,c(2,7)]

Error in eval(expr, envir, enclos): object 'speech.meta' not found

Code

head(speech.meta)

Error in head(speech.meta): object 'speech.meta' not found

Code

speech.meta.ac <- as.character(speech.meta$text)

Error in eval(expr, envir, enclos): object 'speech.meta' not found

Code

speech.meta.ac

Error in eval(expr, envir, enclos): object 'speech.meta.ac' not found

After getting the data ready, I followed the steps to set up the model as we learned in class.

Code

#create iterator
it <-itoken(speech.meta.ac, tolower, word_tokenizer, ids = speech.meta$textnum, n_chunks = 10)

Error in itoken(speech.meta.ac, tolower, word_tokenizer, ids = speech.meta$textnum, : object 'speech.meta.ac' not found

Code

# prints iterator
it

Error in eval(expr, envir, enclos): object 'it' not found

Code

# build the vocabulary, removing stopwords and some other tokens that are not meaningful
sw <- stopwords("en", source = "snowball")
typeof(sw)

[1] "character"

Code

sw

  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"       "will"

Code

sw <- c(sw, "must", "can", "c", "mr", 'v', "il", "tt", "ll", "aij", "j", "es", "ul","wi", "q", "el", "tl", "cl", "la", "er", "tt","ul", "fl", "fi", "r", "l", "lo", "tel", "cl", "la", "z", "le", "en", "ch", "ed", "fl", "er", "fi", "co")

#create vocabulary
v <- create_vocabulary(it, stopwords = sw, doc_proportion_max = .95, doc_proportion_min = .05)

Error in create_vocabulary(it, stopwords = sw, doc_proportion_max = 0.95, : object 'it' not found

Code

#I want to prune the vocabulary: 
v <- prune_vocabulary(v, term_count_min = 10)

Error in prune_vocabulary(v, term_count_min = 10): object 'v' not found

Code

# creates a closure that helps transform list of tokens into vector space
vectorizer <- vocab_vectorizer(v)

Error in force(vocabulary): object 'v' not found

Code

# creates document term matrix
dtm <- create_dtm(it, vectorizer, type = "dgTMatrix")

Error in create_dtm(it, vectorizer, type = "dgTMatrix"): object 'it' not found

Next I ran a series of models, testing different numbers of topics. I started with 10 since that seemeed like a lot of topics to me and I thought it would be informative. As I thought, many topics were not very prevalent in the documents.

Code

# create new LDA model with 10 topics
lda_model <- LDA$new(n_topics = 10, doc_topic_prior = 0.1,
                     topic_word_prior = 0.01)


# fitting the model
doc_topic_distr <- 
  lda_model$fit_transform(x = dtm, n_iter = 5000,
                          convergence_tol = 0.001, n_check_convergence = 25,
                          progressbar = T)

Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rowSums': object 'dtm' not found

Code

# View the topics 
lda_model$get_top_words(n = 20, topic_number = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L),
                        lambda = 0.3)

Error in lda_model$get_top_words(n = 20, topic_number = c(1L, 2L, 3L, : n >= 1 && n <= length(private$vocabulary) is not TRUE

Code

#What proportion of documents fit different topics? 
barplot(doc_topic_distr[2, ], xlab = "topic",
        ylab = "proportion", ylim = c(0,1),
        names.arg = 1:ncol(doc_topic_distr))

Error in barplot(doc_topic_distr[2, ], xlab = "topic", ylab = "proportion", : object 'doc_topic_distr' not found

Since the percentages were so low, I decided to try a model with a lower number of topics, moving to extraction of 5 topics.

Code

#for some topics, the proportion is very low. 

#now I want to test with 5 topics. 
lda_model5 <- LDA$new(n_topics = 5, doc_topic_prior = 0.1,
                     topic_word_prior = 0.01)
doc_topic_distr5 <- 
  lda_model5$fit_transform(x = dtm, n_iter = 5000,
                          convergence_tol = 0.001, n_check_convergence = 25,
                          progressbar = T)

Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rowSums': object 'dtm' not found

Code

# View the topics 
lda_model5$get_top_words(n = 20, topic_number = c(1L, 2L, 3L, 4L, 5L),
                        lambda = 0.3)

Error in lda_model5$get_top_words(n = 20, topic_number = c(1L, 2L, 3L, : n >= 1 && n <= length(private$vocabulary) is not TRUE

Code

#what proportion of docs fit the topics? 
barplot(doc_topic_distr5[2, ], xlab = "topic",
        ylab = "proportion", ylim = c(0,1),
        names.arg = 1:ncol(doc_topic_distr5))

Error in barplot(doc_topic_distr5[2, ], xlab = "topic", ylab = "proportion", : object 'doc_topic_distr5' not found

Less than 20% of documents are of 3 of these topics. Further, there were some topics that I had difficulty interpreting or thinking why they were separate. At this point I was starting to see some consistency in terms of which topics were extracted. For example, there was one topic that seemed to be very “solutions” focused with words like energy, development, technologies…However, I thought that maybe I should use even fewer topics?

Code

lda_model4 <- LDA$new(n_topics = 4, doc_topic_prior = 0.1,
                      topic_word_prior = 0.01)

doc_topic_distr4 <- 
  lda_model4$fit_transform(x = dtm, n_iter = 5000,
                           convergence_tol = 0.001, n_check_convergence = 25,
                           progressbar = T)

Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rowSums': object 'dtm' not found

Code

lda_model4$get_top_words(n = 20, topic_number = c(1L, 2L, 3L, 4L),
                         lambda = 0.3)

Error in lda_model4$get_top_words(n = 20, topic_number = c(1L, 2L, 3L, : n >= 1 && n <= length(private$vocabulary) is not TRUE

Code

#what proportion of docs fit the topics? 
barplot(doc_topic_distr4[2, ], xlab = "topic",
        ylab = "proportion", ylim = c(0,1),
        names.arg = 1:ncol(doc_topic_distr4))

Error in barplot(doc_topic_distr4[2, ], xlab = "topic", ylab = "proportion", : object 'doc_topic_distr4' not found

Code

doc_topic_distr4

Error in eval(expr, envir, enclos): object 'doc_topic_distr4' not found

I think 4 topics look meaningful, and at least each topic has 10% of documents classified as most likely being within that topic. But, to be safe, and to explore the data even more, I also looked at a 3 topic model.

Code

lda_model3 <- LDA$new(n_topics = 3, doc_topic_prior = 0.1,
                      topic_word_prior = 0.01)

doc_topic_distr3 <- 
  lda_model3$fit_transform(x = dtm, n_iter = 5000,
                           convergence_tol = 0.001, n_check_convergence = 25,
                           progressbar = T)

Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rowSums': object 'dtm' not found

Code

lda_model3$get_top_words(n = 20, topic_number = c(1L, 2L, 3L),
                         lambda = 0.3)

Error in lda_model3$get_top_words(n = 20, topic_number = c(1L, 2L, 3L), : n >= 1 && n <= length(private$vocabulary) is not TRUE

Code

#what proportion of docs fit the topics? 
barplot(doc_topic_distr3[2, ], xlab = "topic",
        ylab = "proportion", ylim = c(0,1),
        names.arg = 1:ncol(doc_topic_distr3))

Error in barplot(doc_topic_distr3[2, ], xlab = "topic", ylab = "proportion", : object 'doc_topic_distr3' not found

It seems like the 4-topic solution is what I should go with. Now I should try to describe each topic. Because this is completely exploratory, I thought it would be better to use a naming method that simply relies on top words in each topic.

Code

lda_model4$plot()

Error in lda_model4$plot(): To use visualisation, please install 'LDAvis' package first.

This resulted in the following four topics. I can speculate about the content of each but want to explore the speeches which are most likely classified under each before doing so…

#topic 1: convention_parties_president_protocol_annex - perhaps this is about the UNFCC itself/the proccess #topic 2: us_agreement_need_action_challenge - this also seems sort of process focused, ‘how will we go about solving the problems’ #topic 3: energy_development_technologies_projects_sustainable - this seems solution-focused to me, ‘what should we do about climate change’ #topic 4: island_people_human_small_sea - this final topic feels different from the others, based on my initial exploration I feel like it is really focused on how people are being impacted or the ‘human element of climate change’.

Next steps: Although it would be completely exploratory (and again, maybe not relevant to my key research questions) I was wondering what it would mean to see how the climate risk index relates to topic prevalence? As well, I could look at topic probabilities changing over time (i.e., by year of speech)?

To examine this, I want to save the probabilities for the four topics for each document, and then use my metadata to look at correlations/regressions between these and the topics. I haven’t figured out a clean way to export these probabilities and join them with my original data. Because it isn’t a priority for me, my plan is to return to topic modeling after completing the other key analyses.

Further, based on my past experience doing similar types of analyses (to me, the concept seems highly similar to latent class/profile analysis and factor analysis) I feel like I need to more deeply think about what these topics are about, what they represent, and (becaues of my area of interest) what they mean psychologically. It would be fun to rush ahead and look at some of the analyses I’d like to do, but I am still uncertain about what I think underlies each of the topics I found, and I want to think about it more before I do anlayses, maybe so I can even think of hypotheses to test rather than just testing every possibility…