Code
library(tidyverse)
library(RedditExtractoR)
library(syuzhet)
library(rvest)
library(quanteda)
library(quanteda.textplots)
library(polite)
library(cleanNLP)
::opts_chunk$set(echo = TRUE) knitr
Quinn He
October 19, 2022
Compare how two subreddits (/r/republicans and /r/democrats) discuss particular political issues. With this it may be difficult because the democrat subreddit has a far superior user base.
I think I will have to actually scrape reddit for the comments at least because I cannot find the command to get comments of posts. I am able to get the titles of posts and how many comments they have, but I want to analze the discourse of the comments as well.
Should I do something with /r/conspiracy?
If I choose to look into rhetoric on the ukraine/russian war, I would have to pick different subreddits because there were little to no posts in subreddit titles.
What words do they tend to use over the opposition?
Below I am pulling data from the two subreddits below. As of now, these two subreddits will be my focus for analysis.
parsing URLs on page 1...
Warning in file(con, "r"): cannot open URL 'https://www.reddit.com/r/
republicans/top.json?t=month&limit=100': HTTP status was '429 Unknown Error'
Error in value[[3L]](cond): Cannot read from Reddit, check your inputs or internet connection
parsing URLs on page 1...
Warning in file(con, "r"): cannot open URL 'https://www.reddit.com/r/democrats/
top.json?t=month&limit=100': HTTP status was '429 Unknown Error'
Error in value[[3L]](cond): Cannot read from Reddit, check your inputs or internet connection
Word cloud for /r/democrats titles
Error in corpus(blue$title): object 'blue' not found
Error in tokens(blue_corpus, remove_punct = T, remove_numbers = T): object 'blue_corpus' not found
Error in tokens_select(blue_tokens, pattern = stopwords("en"), selection = "remove"): object 'blue_tokens' not found
Error in dfm(blue_tokens): object 'blue_tokens' not found
Error in textplot_wordcloud(blue_dfm, max_words = 100, color = "blue"): object 'blue_dfm' not found
Word cloud for /r/republicans titles
Error in corpus(red$title): object 'red' not found
Error in tokens(red_corpus, remove_punct = T, remove_numbers = T): object 'red_corpus' not found
Error in tokens_select(red_tokens, pattern = stopwords("en"), selection = "remove"): object 'red_tokens' not found
Error in dfm(red_tokens): object 'red_tokens' not found
Error in textplot_wordcloud(red_dfm, max_words = 100, color = "red"): object 'red_dfm' not found
This only gets the titles of the recent posts on the subreddit, but for now I will use it to just run sentiment analysis on that.
Error in get_nrc_sentiment(red$title): object 'red' not found
Error in get_nrc_sentiment(blue$title): object 'blue' not found
Error in cbind(red_title_sent, red): object 'red_title_sent' not found
Error in cbind(blue_title_sent, blue): object 'blue_title_sent' not found
Right now, I am only taking comments from posts I see on the front page with more than 10 comments, just to test this method. In the future I will have to determine a topic that I can track between both subreddits in order to get a sentiment that is comparable.
blue_url <- c("https://www.reddit.com/r/democrats/comments/ye13dt/the_midterms_are_a_referendum_on_democracy_in/",
"https://www.reddit.com/r/democrats/comments/ydqws7/vote_fetterman/",
"https://www.reddit.com/r/democrats/comments/ye1vqa/republicans_denounce_inflation_but_few_economists/")
comments <- get_thread_content(blue_url)
Warning in file(con, "r"): cannot open URL 'https://www.reddit.com/r/democrats/
comments/ye13dt/the_midterms_are_a_referendum_on_democracy_in/.json?limit=500':
HTTP status was '429 Unknown Error'
Error in value[[3L]](cond): Cannot read from Reddit, check your inputs or internet connection
Error in eval(expr, envir, enclos): object 'comments' not found
Error: <text>:2:5: unexpected symbol
1:
2: use get_thread_urls
^
First, I want to check if I can even scrape the website for comments
<polite session> https://www.reddit.com/r/republicans/
User-agent: polite R package
robots.txt: 45 rules are defined for 7 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent
<polite session> https://www.reddit.com/r/democrats/
User-agent: polite R package
robots.txt: 45 rules are defined for 7 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent
I’m having trouble scraping for comments. When I try to pull all of the comments for one post, my output is NA and I cannot firgure out why. When I pull a single comment though, I get the output I desire no problem.
# url for the main subreddit
url <- "https://www.reddit.com/r/democrats/"
com_url <- "https://www.reddit.com/r/democrats/comments/ychxe8/new_battleground_polls_a_boon_for_dems/"
# To pull the comments for this specific post
css_selector <- "#t1_itmbw37 > div.Comment.t1_itmbw37.P8SGAKMtRxNwlmLz1zdJu.HZ-cv9q391bm8s7qT54B3._1z5rdmX8TDr6mqwNv7A70U > div._3tw__eCCe7j-epNCKGXUKk"
use get thread content
set a specific time frame to look at a particular issue. Setting a time frame would be better
Start doing more interesting stuff
#Here is where I run into an issue trying to get all the comments for one particular post
reddit_post <- "https://www.reddit.com/r/democrats/comments/ycha3g/supreme_court_puts_hold_on_order_that_graham/"
css <- "#overlayScrollContainer > div._1npCwF50X2J7Wt82SZi6J0 > div.u35lf2ynn4jHsVUwPmNU.Dx3UxiK86VcfkFQVHNXNi > div.uI_hDmU5GSiudtABRz_37 > div._2M2wOqmeoPVvcSsJ6Po9-V._3287nL7j7epK9JmDC3N1VR"
blue_post <- reddit_post %>%
read_html() %>%
html_node(css = css) %>%
html_text()
Reddit is not a great representation of the general public. It is a niche group, but can have more in depth discussion than Twitter. Reddit users are also, usually, passionate about certain ideas and subjects, therefore many users will talk freely about their ideas.
A Tale of Two Subreddits: https://ojs.aaai.org/index.php/ICWSM/article/view/19347/19119
No echo in the chambers of political interactions on Reddit: https://www.nature.com/articles/s41598-021-81531-x
Determining Presidential Approval Rating Using Reddit Sentiment Analysis: https://towardsdatascience.com/determining-presidential-approval-rating-using-reddit-sentiment-analysis-7912fdb5fcc7
https://www.researchgate.net/publication/349794705_Populist_Supporters_on_Reddit_A_Comparison_of_Content_and_Behavioral_Patterns_Within_Publics_of_Supporters_of_Donald_Trump_and_Hillary_Clinton
---
title: "Text as Data Final Project"
author: "Quinn He"
desription: "Research project"
date: "10/19/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- Blog Post 2
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(RedditExtractoR)
library(syuzhet)
library(rvest)
library(quanteda)
library(quanteda.textplots)
library(polite)
library(cleanNLP)
knitr::opts_chunk$set(echo = TRUE)
```
## Research Question
Compare how two subreddits (/r/republicans and /r/democrats) discuss particular political issues. With this it may be difficult because the democrat subreddit has a far superior user base.
I think I will have to actually scrape reddit for the comments at least because I cannot find the command to get comments of posts. I am able to get the titles of posts and how many comments they have, but I want to analze the discourse of the comments as well.
Should I do something with /r/conspiracy?
If I choose to look into rhetoric on the ukraine/russian war, I would have to pick different subreddits because there were little to no posts in subreddit titles.
What words do they tend to use over the opposition?
Below I am pulling data from the two subreddits below. As of now, these two subreddits will be my focus for analysis.
```{r}
red <- find_thread_urls(subreddit = "republicans", sort_by = "top", period = "month")
blue <- find_thread_urls(subreddit = "democrats", sort_by = "top", period = "month")
```
Word cloud for /r/democrats titles
```{r}
blue_corpus <- corpus(blue$title)
blue_tokens <- tokens(blue_corpus,
remove_punct = T,
remove_numbers = T)
blue_tokens <- tokens_select(blue_tokens,
pattern = stopwords("en"),
selection = "remove")
blue_dfm <- dfm(blue_tokens)%>%
dfm_trim(min_termfreq = 3)
textplot_wordcloud(blue_dfm, max_words = 100, color = "blue")
```
Word cloud for /r/republicans titles
```{r}
red_corpus <- corpus(red$title)
red_tokens <- tokens(red_corpus,
remove_punct = T,
remove_numbers = T)
red_tokens <- tokens_select(red_tokens,
pattern = stopwords("en"),
selection = "remove")
red_dfm <- dfm(red_tokens) %>%
dfm_trim(min_termfreq = 3)
textplot_wordcloud(red_dfm, max_words = 100, color = "red")
```
This only gets the titles of the recent posts on the subreddit, but for now I will use it to just run sentiment analysis on that.
```{r}
red_title_sent <- get_nrc_sentiment(red$title)
blue_title_sent <- get_nrc_sentiment(blue$title)
red_title_sent <- cbind(red_title_sent, red)
blue_title_sent <- cbind(blue_title_sent, blue)
```
## Getting the comments from posts
Right now, I am only taking comments from posts I see on the front page with more than 10 comments, just to test this method. In the future I will have to determine a topic that I can track between both subreddits in order to get a sentiment that is comparable.
```{r}
blue_url <- c("https://www.reddit.com/r/democrats/comments/ye13dt/the_midterms_are_a_referendum_on_democracy_in/",
"https://www.reddit.com/r/democrats/comments/ydqws7/vote_fetterman/",
"https://www.reddit.com/r/democrats/comments/ye1vqa/republicans_denounce_inflation_but_few_economists/")
comments <- get_thread_content(blue_url)
three_comments <- comments[["comments"]] #this is from the "comments" dataframe
```
```{r}
use get_thread_urls
then use get_thread_content()
```
```{r}
get_nrc_sentiment(three_comments$comment)
```
## Webscrapping for Reddit Comments
First, I want to check if I can even scrape the website for comments
```{r}
bow("https://www.reddit.com/r/republicans/")
bow("https://www.reddit.com/r/democrats/")
```
## /r/democrats
I'm having trouble scraping for comments. When I try to pull all of the comments for one post, my output is NA and I cannot firgure out why. When I pull a single comment though, I get the output I desire no problem.
```{r}
# url for the main subreddit
url <- "https://www.reddit.com/r/democrats/"
com_url <- "https://www.reddit.com/r/democrats/comments/ychxe8/new_battleground_polls_a_boon_for_dems/"
# To pull the comments for this specific post
css_selector <- "#t1_itmbw37 > div.Comment.t1_itmbw37.P8SGAKMtRxNwlmLz1zdJu.HZ-cv9q391bm8s7qT54B3._1z5rdmX8TDr6mqwNv7A70U > div._3tw__eCCe7j-epNCKGXUKk"
```
use get thread content
set a specific time frame to look at a particular issue. Setting a time frame would be better
Start doing more interesting stuff
```{r}
#Here is where I run into an issue trying to get all the comments for one particular post
reddit_post <- "https://www.reddit.com/r/democrats/comments/ycha3g/supreme_court_puts_hold_on_order_that_graham/"
css <- "#overlayScrollContainer > div._1npCwF50X2J7Wt82SZi6J0 > div.u35lf2ynn4jHsVUwPmNU.Dx3UxiK86VcfkFQVHNXNi > div.uI_hDmU5GSiudtABRz_37 > div._2M2wOqmeoPVvcSsJ6Po9-V._3287nL7j7epK9JmDC3N1VR"
blue_post <- reddit_post %>%
read_html() %>%
html_node(css = css) %>%
html_text()
```
## Notes from other research
Reddit is not a great representation of the general public. It is a niche group, but can have more in depth discussion than Twitter. Reddit users are also, usually, passionate about certain ideas and subjects, therefore many users will talk freely about their ideas.
## Previous Research
A Tale of Two Subreddits: https://ojs.aaai.org/index.php/ICWSM/article/view/19347/19119
No echo in the chambers of political interactions on Reddit: https://www.nature.com/articles/s41598-021-81531-x
Determining Presidential Approval Rating Using Reddit Sentiment Analysis: https://towardsdatascience.com/determining-presidential-approval-rating-using-reddit-sentiment-analysis-7912fdb5fcc7
https://www.researchgate.net/publication/349794705_Populist_Supporters_on_Reddit_A_Comparison_of_Content_and_Behavioral_Patterns_Within_Publics_of_Supporters_of_Donald_Trump_and_Hillary_Clinton