I have switched data sets for future blog posts. I will be using open-ended survey responses from a 1996 study titled Survey of Gun Owners in the United States.
For the study, respondents were asked six qualifying questions related to: (1) gun ownership, (2) gun-carrying practices, (3) gun display against the respondent, (4) gun use in self-defense against animals, (5) gun use in self-defense against people, and (6) other weapons used in self-defense. A “yes” response to a qualifying question led to a series of additional questions on the same topic as the qualifying question.
The open-ended responses include descriptions specifically related to the following questions: (1) where the respondent was when he or she displayed a gun (in self-defense or otherwise), (2) specific reasons why the respondent displayed a gun, (3) how the other individual reacted when the respondent displayed the gun, (4) how the individual knew the respondent had a gun, (5) whether the police were contacted for specific self-defense events, and (6) if not, why not.
I will focus on the following research question: Can we identify the most common circumstance in which respondents displayed a gun? I am hoping that “circumstance” can include references to both the catalyst in the situation and the environment.
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"unpackedMatrix" of class "mMatrix"; definition not updated
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"unpackedMatrix" of class "replValueSp"; definition not updated
Package version: 3.2.3
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 4 of 4 threads used.
See https://quanteda.io for tutorials and examples.
library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
Warning: package 'tidytext' was built under R version 4.2.2
library(stopwords)
Warning: package 'stopwords' was built under R version 4.2.2
library(tm)
Loading required package: NLP
Attaching package: 'NLP'
The following objects are masked from 'package:quanteda':
meta, meta<-
The following object is masked from 'package:ggplot2':
annotate
Attaching package: 'tm'
The following object is masked from 'package:stopwords':
stopwords
The following object is masked from 'package:quanteda':
stopwords
Rows: 814 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Q#, VERBATIM RESPONSE
dbl (2): ID#, CODE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 814 × 4
Respondent_ID Survey_Question Code Open_Ended_Response
<dbl> <chr> <dbl> <chr>
1 250 12 0 "I had it in my vehicle to go hunting, n…
2 248 28 0 "None of the above. I was an employee w…
3 250 6 1 "Four of them."
4 250 6 1 "Four of them."
5 250 7 1 "My shotgun."
6 250 7 1 "My shotgun."
7 428 7 1 "{He said \"no\" to fully automatic weap…
8 409 8 1 "If they were used incorrectly."
9 1165 9 1 "Well, yeah, community property."
10 250 12 1 "I had it in my vehicle to go hunting, n…
# … with 804 more rows
There are 812 rows in this data frame and 4 columns. The four columns represent the following variables: survey respondent ID (‘Respondent_ID), survey question (’Survey_Question’), response code (‘Code’), and verbatim open-ended responses (‘Open_Ended_Responses’). Each of the 812 rows now represents a unique observation. Additionally, there are 8134 unique words across all open-ended responses.
I will next create a corpus and remove both capitalization and punctuation from all open-ended responses.
# create a corpus:Survey_corpus <-corpus(Survey_unique$Open_Ended_Response)head(Survey_corpus)
Corpus consisting of 6 documents.
text1 :
"I had it in my vehicle to go hunting, not on my person."
text2 :
"None of the above. I was an employee working in a convenien..."
text3 :
"Four of them."
text4 :
"My shotgun."
text5 :
"{He said "no" to fully automatic weapons but "yes" to semi-a..."
text6 :
"If they were used incorrectly."
Error in dfm_trim(Survey_dfm_freq, min_docfreq = 0.3, docfreq_type = "prop"): object 'Survey_dfm_freq' not found
# use dfm to create a feature co-occurring matrix:Survey_fcm <-fcm(Survey_dfm_freq)
Error in fcm(Survey_dfm_freq): object 'Survey_dfm_freq' not found
# check dimensions:dim(Survey_fcm)
Error in eval(expr, envir, enclos): object 'Survey_fcm' not found
QUESTION: The first time I ran this chunk of code, the dimensions of the feature co-occurring matrix were shown to be 395 rows by 395 columns. After re-running the codes after reopening R, the dimensions are now showing as 0 rows by 0 columns.
______________________
Next, I will apply the correlated topic modeling approach to my data set, since I do not have a covariate such as sentiment.
# install packages:library(stm)
Warning: package 'stm' was built under R version 4.2.2
stm v1.3.6 successfully loaded. See ?stm for help.
Papers, resources, and other materials at structuraltopicmodel.com
library(quanteda)# generate data frame:Survey_dfm_clean <-dfm(Survey_dfm)
Error in dfm(Survey_dfm): object 'Survey_dfm' not found
Error in asSTMCorpus(documents, vocab, data): object 'Survey_dfm' not found
cor_topic_model
Error in eval(expr, envir, enclos): object 'cor_topic_model' not found
summary(cor_topic_model)
Error in summary(cor_topic_model): object 'cor_topic_model' not found
# label topics:# frex = words that are both frequent and exclusive to topic.# lift = calculated by dividing the topic-word distribution by the empirical word count probability distribution.labelTopics(cor_topic_model)
Error in labelTopics(cor_topic_model): object 'cor_topic_model' not found
I now have a topic model with 5 topics, 805 documents, and a 1224-word dictionary.
# identify document most frequently associated with the five topics:findThoughts(cor_topic_model,texts = Survey_dfm$Open_Ended_Responses,topics =c(1:5),n =1)
Error in findThoughts(cor_topic_model, texts = Survey_dfm$Open_Ended_Responses, : object 'cor_topic_model' not found
I am having a little trouble with this code; my understanding is that it’s intended to yield the document most frequently associated with the five topics, but I am seeing a different, less informative output. I will work more on this.
QUESTION: Do I need to identify question numbers associated with location/reason for drawing a weapon and apply this as a covariate? Do I need a covariate? If so, how do I identify one when all open-ended responses are aggregated? I.e., the ‘Survey_Question’ variable is the only indicator for response categories, so there isn’t necessarily another variable to choose from in terms of identifying a predictor.
______________
Try structural topic modeling (without predictor). My thought is that these results may look similar to the correlation topic model since I am not adding a predictor.
# choose the number of topics:k <-5# specify model:Survey_STM <-stm(Survey_dfm,K = k,data = Survey_dfm$Open_Ended_Responses,max.em.its =1224,seed =1234,init.type ="Spectral")
Error in asSTMCorpus(documents, vocab, data): object 'Survey_dfm' not found
labelTopics(Survey_STM)
Error in labelTopics(Survey_STM): object 'Survey_STM' not found
Try some visualization to capture the estimated frequency of words across topics:
plot(Survey_STM, type ="summary")
Error in plot(Survey_STM, type = "summary"): object 'Survey_STM' not found
Next, I want to try extracting topics and assigning them to the vector of document proportions. Extract the top words (as identified by ‘frex’), collapse the strings, and separate the tokens:
# get the words:Survey_words <-labelTopics(Survey_STM, n=5)$frex
Error in labelTopics(Survey_STM, n = 5): object 'Survey_STM' not found
# set up an empty vector:Survey_topic_labels <-rep(NA, k)# set up a loop to go through the topics and collapse the words to a single label:for(i in1:k){Survey_topic_labels[i] <-paste(Survey_words[i,], collapse ="_")}
Error in paste(Survey_words[i, ], collapse = "_"): object 'Survey_words' not found
# print the labels:Survey_topic_labels
[1] NA NA NA NA NA
Without a predictor or covariate, I am unable to measure the effect on topic distributions. Identifying a way to incorporate a covariate or predictor will be my next task.