Code
library(readxl)
library(tidyverse)
::opts_chunk$set(echo = TRUE) knitr
Claire Battaglia
October 2, 2022
The dataset that I’ll be working with is from the Missoula City-County’s inaugural Community Food Assessment, completed in 2021. In particular, the responses to one (or more) of the open-ended questions:
Question One and Question Three are distinct questions but that distinction might not have been clear to respondents so I’m not sure yet whether it makes sense to analyze both questions or just one of the two.
We received 389 responses in 2021 but are working to implement random sampling for our next wave and anticipate a larger, valid sample moving forward. Working with this smaller sample will thus be good practice for me but it’s important to note that, given the sampling method (a mix between convenience and snowball sampling), I will ultimately not be drawing any inferences about the Missoula community as a whole from it.
While I am still researching methods of analysis I am currently leaning towards Structural Topic Modeling (STM). I am ultimately interested in understanding the prevalence and content of topics (for each of the above survey questions) by certain categories: age, household size, household income level, and geographic area, etc. STM will enable me to do this. As described by Roberts et al., it will allow me to discover the topics within the responses as opposed to assume them based upon my own theoretical expectations.1
Prevalence refers to how frequently a topic is discussed and content refers to the language used to discuss it.
STM will allow me to analyze whether different groups discuss certain topics more frequently than other groups but also what language they use to discuss them. This is important. Looking at prevalence alone may reveal that all income levels discuss farmers markets with the same frequency but looking at content as well could reveal that one income level uses words/phrases like “fun” or “meet friends” to talk about them while another group uses “expensive” or “far away.” These are clearly getting at very different lived experiences of farmers markets.
[1] "C:/Users/srika/OneDrive/Desktop/DACSS/Text_as_Data_Fall_2022/posts"
New names:
• `` -> `...15`
• `` -> `...16`
• `` -> `...17`
• `` -> `...18`
• `` -> `...19`
• `` -> `...20`
• `` -> `...21`
• `` -> `...22`
• `` -> `...23`
• `` -> `...24`
• `` -> `...26`
• `` -> `...27`
• `` -> `...28`
• `` -> `...29`
• `` -> `...30`
• `` -> `...31`
• `` -> `...32`
• `` -> `...34`
• `` -> `...35`
• `` -> `...36`
• `` -> `...37`
• `` -> `...38`
• `` -> `...39`
• `` -> `...40`
• `` -> `...43`
• `` -> `...44`
• `` -> `...45`
• `` -> `...46`
• `` -> `...47`
• `` -> `...48`
• `` -> `...50`
• `` -> `...51`
• `` -> `...52`
• `` -> `...53`
• `` -> `...54`
• `` -> `...55`
• `` -> `...56`
• `` -> `...57`
• `` -> `...58`
• `` -> `...60`
• `` -> `...61`
• `` -> `...62`
• `` -> `...63`
• `` -> `...64`
• `` -> `...65`
• `` -> `...66`
• `` -> `...67`
• `` -> `...68`
• `` -> `...69`
# A tibble: 6 × 72
Respondent I…¹ Colle…² `Start Date` `End Date` IP Ad…³ Email…⁴
<dbl> <dbl> <dttm> <dttm> <chr> <lgl>
1 12978141302 4.09e8 2021-09-18 13:41:15 2021-09-18 14:43:04 216.14… NA
2 12979593159 4.09e8 2021-09-19 17:30:07 2021-09-19 17:36:22 72.174… NA
3 12991074702 4.09e8 2021-09-22 15:30:03 2021-09-23 13:55:29 184.16… NA
4 12985165265 4.09e8 2021-09-21 15:03:48 2021-09-21 15:58:22 72.174… NA
5 12978197761 4.09e8 2021-09-18 15:31:36 2021-09-18 15:44:55 209.6.… NA
6 13003870096 4.09e8 2021-09-28 23:25:30 2021-09-28 23:27:22 69.145… NA
# … with 66 more variables: `First Name` <lgl>, `Last Name` <lgl>,
# `Custom Data 1` <lgl>, `Zip Code` <dbl>, Age <chr>,
# `Annual Household Income` <chr>,
# `Number of Individuals in Your Household` <chr>,
# `In the context of food and agriculture, which of the following do you identify with? (Check all that apply)` <chr>,
# ...15 <chr>, ...16 <chr>, ...17 <chr>, ...18 <chr>, ...19 <chr>,
# ...20 <chr>, ...21 <chr>, ...22 <chr>, ...23 <chr>, ...24 <chr>, …
There is some initial cleaning I can do right now.
# rename columns
CFA_tidy <- CFA_raw %>%
rename("id" = "Respondent ID",
"zip" = "Zip Code",
"age" = "Age",
"income" = "Annual Household Income",
"size" = "Number of Individuals in Your Household",
"change" = "What changes would you like to see for Missoula’s food system?",
"strengths" = "In your opinion, what strengths and assets exist in Missoula's food system (i.e. in what ways is the food system doing well)?")
# create subset
CFA_tidy <- CFA_tidy %>%
subset(select = c("id", "zip", "age", "income", "change"))
# preview
head(CFA_tidy)
# A tibble: 6 × 5
id zip age income change
<dbl> <dbl> <chr> <chr> <chr>
1 12978141302 59823 72 Under $25,000 "More concentration on th…
2 12979593159 59802 42 Between $35,000 and $49,999 <NA>
3 12991074702 59802 29 Between $50,000 and $99,999 "I'd like to see more sto…
4 12985165265 59801 60 Between $50,000 and $99,999 "Factual, evidence-based …
5 12978197761 59801 61 Between $50,000 and $99,999 "Whatever it takes to get…
6 13003870096 59803 27 Between $25,000 and $34,999 <NA>
Warning: NAs introduced by coercion
# A tibble: 6 × 5
id zip age income change
<chr> <chr> <dbl> <chr> <chr>
1 12978141302 59823 72 Under $25,000 "More concentration on th…
2 12979593159 59802 42 Between $35,000 and $49,999 <NA>
3 12991074702 59802 29 Between $50,000 and $99,999 "I'd like to see more sto…
4 12985165265 59801 60 Between $50,000 and $99,999 "Factual, evidence-based …
5 12978197761 59801 61 Between $50,000 and $99,999 "Whatever it takes to get…
6 13003870096 59803 27 Between $25,000 and $34,999 <NA>
Next I’ll calculate some summary statistics.
For age I’ll be interested in mean, median, mode and range. For household size I’ll be interested in mean, median, mode and range. I’ll want the mode income level. I’ll also want all of those statistics broken out for each zip code.
Next I’ll spend some time with the stm
package and go through the methodology outlined in “Structural Topic Models for Open-Ended Survey Responses” in more detail. I definitely need to wrap my head around constructing the actual model(s) and there are some statistical concepts that have surfaced that I’m not familiar with (e.g. shrinkage priors, regularization, etc.)
I’ve also started thinking about what I would like to be able to visualize and came across these posts about visualizing text data in both R and Python:
Neither are specific to STM but may give me a sense of what is possible, packages to use, etc. The STM package documentation also includes some ideas for visualization STM models.
Roberts, Margaret E.; Stewart, Brandon M.; Tingley, Dustin; Lucas, Christopher; Leder-Luis, Jetson; Gadarian, Shana Kushner; Albertson, Bethany; Rand, David G. (2014). Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082.doi:10.1111/ajps.12103↩︎
---
title: "Blog Post 2"
author: "Claire Battaglia"
desription: "Blog Post 2"
date: "10/02/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- Claire Battaglia
- text-as-data
- blog post 2
- open-text survey response
---
```{r}
#| label: setup
#| warning: false
library(readxl)
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)
```
## The Data
The dataset that I'll be working with is from the Missoula City-County's inaugural Community Food Assessment, completed in 2021. In particular, the responses to one (or more) of the open-ended questions:
1. What changes would you like to see for Missoula’s food system?
2. In your opinion, what strengths and assets exist in Missoula's food system (i.e. in what ways is the food system doing well)?
3. In your opinion, what gaps or unmet needs exist in Missoula's food system (i.e. in what ways could the food system do better)?
Question One and Question Three are distinct questions but that distinction might not have been clear to respondents so I'm not sure yet whether it makes sense to analyze both questions or just one of the two.
We received 389 responses in 2021 but are working to implement random sampling for our next wave and anticipate a larger, valid sample moving forward. Working with this smaller sample will thus be good practice for me but it's important to note that, given the sampling method (a mix between convenience and snowball sampling), I will ultimately not be drawing any inferences about the Missoula community as a whole from it.
## Research Question and Analysis
While I am still researching methods of analysis I am currently leaning towards Structural Topic Modeling (STM). I am ultimately interested in understanding the prevalence and content of topics (for each of the above survey questions) by certain categories: age, household size, household income level, and geographic area, etc. STM will enable me to do this. As described by Roberts et al., it will allow me to *discover* the topics within the responses as opposed to assume them based upon my own theoretical expectations.^[Roberts, Margaret E.; Stewart, Brandon M.; Tingley, Dustin; Lucas, Christopher; Leder-Luis, Jetson; Gadarian, Shana Kushner; Albertson, Bethany; Rand, David G. (2014). Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082.doi:10.1111/ajps.12103]
*Prevalence* refers to how frequently a topic is discussed and *content* refers to the language used to discuss it.
STM will allow me to analyze whether different groups discuss certain topics more frequently than other groups but also what language they use to discuss them. This is important. Looking at prevalence alone may reveal that all income levels discuss farmers markets with the same frequency but looking at content as well could reveal that one income level uses words/phrases like "fun" or "meet friends" to talk about them while another group uses "expensive" or "far away." These are clearly getting at very different lived experiences of farmers markets.
## Getting Started
```{r read data}
# check wd
getwd()
# read data
CFA_raw <- read_excel("FPAB Community Food Assessment Survey Data.xlsx", sheet = "Raw")
# preview
head(CFA_raw)
```
There is some initial cleaning I can do right now.
```{r tidy dataset}
# rename columns
CFA_tidy <- CFA_raw %>%
rename("id" = "Respondent ID",
"zip" = "Zip Code",
"age" = "Age",
"income" = "Annual Household Income",
"size" = "Number of Individuals in Your Household",
"change" = "What changes would you like to see for Missoula’s food system?",
"strengths" = "In your opinion, what strengths and assets exist in Missoula's food system (i.e. in what ways is the food system doing well)?")
# create subset
CFA_tidy <- CFA_tidy %>%
subset(select = c("id", "zip", "age", "income", "change"))
# preview
head(CFA_tidy)
```
```{r change class}
# change class
CFA_tidy$id <- as.character(CFA_tidy$id)
CFA_tidy$zip <- as.character(CFA_tidy$zip)
CFA_tidy$age <- as.numeric(CFA_tidy$age)
# TODO decide what to do with income categories
# CFA_tidy$size <- as.numeric(CFA_tidy$size)
# preview
head(CFA_tidy)
# remove na values
CFA_tidy <- na.omit(CFA_tidy)
save(CFA_tidy, file = "CFA_tidy.RData")
# TODO get mean age and size of household
# mean(CFA_tidy$age, na.rm = TRUE)
# mean(CFA_tidy$size, na.rm = TRUE)
```
Next I'll calculate some summary statistics.
For **age** I'll be interested in mean, median, mode and range.
For **household size** I'll be interested in mean, median, mode and range.
I'll want the mode **income level**.
I'll also want all of those statistics broken out for each zip code.
## Thinking Ahead to Next Steps
Next I'll spend some time with the [`stm`](https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf) package and go through the methodology outlined in "Structural Topic Models for Open-Ended Survey Responses" in more detail. I definitely need to wrap my head around constructing the actual model(s) and there are some statistical concepts that have surfaced that I'm not familiar with (e.g. shrinkage priors, regularization, etc.)
I've also started thinking about what I would like to be able to visualize and came across these posts about visualizing text data in both R and Python:
* R - [TextPlot: R Library for Visualizing Text Data](https://towardsdatascience.com/textplot-r-library-for-visualizing-text-data-a8f1740a032d)
* Python - [Advanced Visualisations for Text Data Analysis](https://towardsdatascience.com/advanced-visualisations-for-text-data-analysis-fc8add8796e2)
Neither are specific to STM but may give me a sense of what is possible, packages to use, etc. The STM package documentation also includes some ideas for visualization STM models.