Claire Battaglia
text-as-data
blog post 2
open-text survey response
Author

Claire Battaglia

Published

October 2, 2022

Code
library(readxl)
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

The Data

The dataset that I’ll be working with is from the Missoula City-County’s inaugural Community Food Assessment, completed in 2021. In particular, the responses to one (or more) of the open-ended questions:

  1. What changes would you like to see for Missoula’s food system?
  2. In your opinion, what strengths and assets exist in Missoula’s food system (i.e. in what ways is the food system doing well)?
  3. In your opinion, what gaps or unmet needs exist in Missoula’s food system (i.e. in what ways could the food system do better)?

Question One and Question Three are distinct questions but that distinction might not have been clear to respondents so I’m not sure yet whether it makes sense to analyze both questions or just one of the two.

We received 389 responses in 2021 but are working to implement random sampling for our next wave and anticipate a larger, valid sample moving forward. Working with this smaller sample will thus be good practice for me but it’s important to note that, given the sampling method (a mix between convenience and snowball sampling), I will ultimately not be drawing any inferences about the Missoula community as a whole from it.

Research Question and Analysis

While I am still researching methods of analysis I am currently leaning towards Structural Topic Modeling (STM). I am ultimately interested in understanding the prevalence and content of topics (for each of the above survey questions) by certain categories: age, household size, household income level, and geographic area, etc. STM will enable me to do this. As described by Roberts et al., it will allow me to discover the topics within the responses as opposed to assume them based upon my own theoretical expectations.1

Prevalence refers to how frequently a topic is discussed and content refers to the language used to discuss it.

STM will allow me to analyze whether different groups discuss certain topics more frequently than other groups but also what language they use to discuss them. This is important. Looking at prevalence alone may reveal that all income levels discuss farmers markets with the same frequency but looking at content as well could reveal that one income level uses words/phrases like “fun” or “meet friends” to talk about them while another group uses “expensive” or “far away.” These are clearly getting at very different lived experiences of farmers markets.

Getting Started

Code
# check wd
getwd()
[1] "C:/Users/srika/OneDrive/Desktop/DACSS/Text_as_Data_Fall_2022/posts"
Code
# read data
CFA_raw <- read_excel("FPAB Community Food Assessment Survey Data.xlsx", sheet = "Raw")
New names:
• `` -> `...15`
• `` -> `...16`
• `` -> `...17`
• `` -> `...18`
• `` -> `...19`
• `` -> `...20`
• `` -> `...21`
• `` -> `...22`
• `` -> `...23`
• `` -> `...24`
• `` -> `...26`
• `` -> `...27`
• `` -> `...28`
• `` -> `...29`
• `` -> `...30`
• `` -> `...31`
• `` -> `...32`
• `` -> `...34`
• `` -> `...35`
• `` -> `...36`
• `` -> `...37`
• `` -> `...38`
• `` -> `...39`
• `` -> `...40`
• `` -> `...43`
• `` -> `...44`
• `` -> `...45`
• `` -> `...46`
• `` -> `...47`
• `` -> `...48`
• `` -> `...50`
• `` -> `...51`
• `` -> `...52`
• `` -> `...53`
• `` -> `...54`
• `` -> `...55`
• `` -> `...56`
• `` -> `...57`
• `` -> `...58`
• `` -> `...60`
• `` -> `...61`
• `` -> `...62`
• `` -> `...63`
• `` -> `...64`
• `` -> `...65`
• `` -> `...66`
• `` -> `...67`
• `` -> `...68`
• `` -> `...69`
Code
# preview
head(CFA_raw)
# A tibble: 6 × 72
  Respondent I…¹ Colle…² `Start Date`        `End Date`          IP Ad…³ Email…⁴
           <dbl>   <dbl> <dttm>              <dttm>              <chr>   <lgl>  
1    12978141302  4.09e8 2021-09-18 13:41:15 2021-09-18 14:43:04 216.14… NA     
2    12979593159  4.09e8 2021-09-19 17:30:07 2021-09-19 17:36:22 72.174… NA     
3    12991074702  4.09e8 2021-09-22 15:30:03 2021-09-23 13:55:29 184.16… NA     
4    12985165265  4.09e8 2021-09-21 15:03:48 2021-09-21 15:58:22 72.174… NA     
5    12978197761  4.09e8 2021-09-18 15:31:36 2021-09-18 15:44:55 209.6.… NA     
6    13003870096  4.09e8 2021-09-28 23:25:30 2021-09-28 23:27:22 69.145… NA     
# … with 66 more variables: `First Name` <lgl>, `Last Name` <lgl>,
#   `Custom Data 1` <lgl>, `Zip Code` <dbl>, Age <chr>,
#   `Annual Household Income` <chr>,
#   `Number of Individuals in Your Household` <chr>,
#   `In the context of food and agriculture, which of the following do you identify with? (Check all that apply)` <chr>,
#   ...15 <chr>, ...16 <chr>, ...17 <chr>, ...18 <chr>, ...19 <chr>,
#   ...20 <chr>, ...21 <chr>, ...22 <chr>, ...23 <chr>, ...24 <chr>, …

There is some initial cleaning I can do right now.

Code
# rename columns
CFA_tidy <- CFA_raw %>%
  rename("id" = "Respondent ID",
         "zip" = "Zip Code",
         "age" = "Age",
         "income" = "Annual Household Income",
         "size" = "Number of Individuals in Your Household",
         "change" = "What changes would you like to see for Missoula’s food system?",
         "strengths" = "In your opinion, what strengths and assets exist in Missoula's food system (i.e. in what ways is the food system doing well)?")

# create subset
CFA_tidy <- CFA_tidy %>%
  subset(select = c("id", "zip", "age", "income", "change"))

# preview
head(CFA_tidy)
# A tibble: 6 × 5
           id   zip age   income                      change                    
        <dbl> <dbl> <chr> <chr>                       <chr>                     
1 12978141302 59823 72    Under $25,000               "More concentration on th…
2 12979593159 59802 42    Between $35,000 and $49,999  <NA>                     
3 12991074702 59802 29    Between $50,000 and $99,999 "I'd like to see more sto…
4 12985165265 59801 60    Between $50,000 and $99,999 "Factual, evidence-based …
5 12978197761 59801 61    Between $50,000 and $99,999 "Whatever it takes to get…
6 13003870096 59803 27    Between $25,000 and $34,999  <NA>                     
Code
# change class
CFA_tidy$id <- as.character(CFA_tidy$id)
CFA_tidy$zip <- as.character(CFA_tidy$zip)
CFA_tidy$age <- as.numeric(CFA_tidy$age)
Warning: NAs introduced by coercion
Code
# TODO decide what to do with income categories
# CFA_tidy$size <- as.numeric(CFA_tidy$size)

# preview
head(CFA_tidy)
# A tibble: 6 × 5
  id          zip     age income                      change                    
  <chr>       <chr> <dbl> <chr>                       <chr>                     
1 12978141302 59823    72 Under $25,000               "More concentration on th…
2 12979593159 59802    42 Between $35,000 and $49,999  <NA>                     
3 12991074702 59802    29 Between $50,000 and $99,999 "I'd like to see more sto…
4 12985165265 59801    60 Between $50,000 and $99,999 "Factual, evidence-based …
5 12978197761 59801    61 Between $50,000 and $99,999 "Whatever it takes to get…
6 13003870096 59803    27 Between $25,000 and $34,999  <NA>                     
Code
# remove na values
CFA_tidy <- na.omit(CFA_tidy)

save(CFA_tidy, file = "CFA_tidy.RData")

# TODO get mean age and size of household
# mean(CFA_tidy$age, na.rm = TRUE)
# mean(CFA_tidy$size, na.rm = TRUE)

Next I’ll calculate some summary statistics.

For age I’ll be interested in mean, median, mode and range. For household size I’ll be interested in mean, median, mode and range. I’ll want the mode income level. I’ll also want all of those statistics broken out for each zip code.

Thinking Ahead to Next Steps

Next I’ll spend some time with the stm package and go through the methodology outlined in “Structural Topic Models for Open-Ended Survey Responses” in more detail. I definitely need to wrap my head around constructing the actual model(s) and there are some statistical concepts that have surfaced that I’m not familiar with (e.g. shrinkage priors, regularization, etc.)

I’ve also started thinking about what I would like to be able to visualize and came across these posts about visualizing text data in both R and Python:

Neither are specific to STM but may give me a sense of what is possible, packages to use, etc. The STM package documentation also includes some ideas for visualization STM models.

Footnotes

  1. Roberts, Margaret E.; Stewart, Brandon M.; Tingley, Dustin; Lucas, Christopher; Leder-Luis, Jetson; Gadarian, Shana Kushner; Albertson, Bethany; Rand, David G. (2014). Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082.doi:10.1111/ajps.12103↩︎