Text as Data Final Project

Blog Post 2
Author

Quinn He

Published

October 19, 2022

Code
library(tidyverse)
library(RedditExtractoR)
library(syuzhet)
library(rvest)
library(quanteda)
library(quanteda.textplots)
library(polite)
library(cleanNLP)

knitr::opts_chunk$set(echo = TRUE)

Research Question

Compare how two subreddits (/r/republicans and /r/democrats) discuss particular political issues. With this it may be difficult because the democrat subreddit has a far superior user base.

I think I will have to actually scrape reddit for the comments at least because I cannot find the command to get comments of posts. I am able to get the titles of posts and how many comments they have, but I want to analze the discourse of the comments as well.

Should I do something with /r/conspiracy?

If I choose to look into rhetoric on the ukraine/russian war, I would have to pick different subreddits because there were little to no posts in subreddit titles.

What words do they tend to use over the opposition?

Below I am pulling data from the two subreddits below. As of now, these two subreddits will be my focus for analysis.

Code
red <- find_thread_urls(subreddit = "republicans", sort_by = "top", period = "month")
parsing URLs on page 1...
Warning in file(con, "r"): cannot open URL 'https://www.reddit.com/r/
republicans/top.json?t=month&limit=100': HTTP status was '429 Unknown Error'
Error in value[[3L]](cond): Cannot read from Reddit, check your inputs or internet connection
Code
blue <- find_thread_urls(subreddit = "democrats", sort_by = "top", period = "month")
parsing URLs on page 1...
Warning in file(con, "r"): cannot open URL 'https://www.reddit.com/r/democrats/
top.json?t=month&limit=100': HTTP status was '429 Unknown Error'
Error in value[[3L]](cond): Cannot read from Reddit, check your inputs or internet connection

Word cloud for /r/democrats titles

Code
blue_corpus <- corpus(blue$title)
Error in corpus(blue$title): object 'blue' not found
Code
blue_tokens <- tokens(blue_corpus,
                      remove_punct = T,
                      remove_numbers = T)
Error in tokens(blue_corpus, remove_punct = T, remove_numbers = T): object 'blue_corpus' not found
Code
blue_tokens <- tokens_select(blue_tokens,
                             pattern = stopwords("en"),
                             selection = "remove")
Error in tokens_select(blue_tokens, pattern = stopwords("en"), selection = "remove"): object 'blue_tokens' not found
Code
blue_dfm <- dfm(blue_tokens)%>% 
  dfm_trim(min_termfreq = 3)
Error in dfm(blue_tokens): object 'blue_tokens' not found
Code
textplot_wordcloud(blue_dfm, max_words = 100, color = "blue")
Error in textplot_wordcloud(blue_dfm, max_words = 100, color = "blue"): object 'blue_dfm' not found

Word cloud for /r/republicans titles

Code
red_corpus <- corpus(red$title)
Error in corpus(red$title): object 'red' not found
Code
red_tokens <- tokens(red_corpus,
                      remove_punct = T,
                      remove_numbers = T)
Error in tokens(red_corpus, remove_punct = T, remove_numbers = T): object 'red_corpus' not found
Code
red_tokens <- tokens_select(red_tokens,
                             pattern = stopwords("en"),
                             selection = "remove")
Error in tokens_select(red_tokens, pattern = stopwords("en"), selection = "remove"): object 'red_tokens' not found
Code
red_dfm <- dfm(red_tokens) %>% 
  dfm_trim(min_termfreq = 3)
Error in dfm(red_tokens): object 'red_tokens' not found
Code
textplot_wordcloud(red_dfm, max_words = 100, color = "red")
Error in textplot_wordcloud(red_dfm, max_words = 100, color = "red"): object 'red_dfm' not found

This only gets the titles of the recent posts on the subreddit, but for now I will use it to just run sentiment analysis on that.

Code
red_title_sent <- get_nrc_sentiment(red$title)
Error in get_nrc_sentiment(red$title): object 'red' not found
Code
blue_title_sent <- get_nrc_sentiment(blue$title)
Error in get_nrc_sentiment(blue$title): object 'blue' not found
Code
red_title_sent <- cbind(red_title_sent, red)
Error in cbind(red_title_sent, red): object 'red_title_sent' not found
Code
blue_title_sent <- cbind(blue_title_sent, blue)
Error in cbind(blue_title_sent, blue): object 'blue_title_sent' not found

Getting the comments from posts

Right now, I am only taking comments from posts I see on the front page with more than 10 comments, just to test this method. In the future I will have to determine a topic that I can track between both subreddits in order to get a sentiment that is comparable.

Code
blue_url <- c("https://www.reddit.com/r/democrats/comments/ye13dt/the_midterms_are_a_referendum_on_democracy_in/",
              "https://www.reddit.com/r/democrats/comments/ydqws7/vote_fetterman/",
              "https://www.reddit.com/r/democrats/comments/ye1vqa/republicans_denounce_inflation_but_few_economists/")


comments <- get_thread_content(blue_url)
Warning in file(con, "r"): cannot open URL 'https://www.reddit.com/r/democrats/
comments/ye13dt/the_midterms_are_a_referendum_on_democracy_in/.json?limit=500':
HTTP status was '429 Unknown Error'
Error in value[[3L]](cond): Cannot read from Reddit, check your inputs or internet connection
Code
three_comments <- comments[["comments"]] #this is from the "comments" dataframe
Error in eval(expr, envir, enclos): object 'comments' not found
Code

use get_thread_urls

then use get_thread_content()
Error: <text>:2:5: unexpected symbol
1: 
2: use get_thread_urls
       ^
Code
get_nrc_sentiment(three_comments$comment)
Error in get_nrc_sentiment(three_comments$comment): object 'three_comments' not found

Webscrapping for Reddit Comments

First, I want to check if I can even scrape the website for comments

Code
bow("https://www.reddit.com/r/republicans/")
<polite session> https://www.reddit.com/r/republicans/
    User-agent: polite R package
    robots.txt: 45 rules are defined for 7 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent
Code
bow("https://www.reddit.com/r/democrats/")
<polite session> https://www.reddit.com/r/democrats/
    User-agent: polite R package
    robots.txt: 45 rules are defined for 7 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

/r/democrats

I’m having trouble scraping for comments. When I try to pull all of the comments for one post, my output is NA and I cannot firgure out why. When I pull a single comment though, I get the output I desire no problem.

Code
# url for the main subreddit
url <- "https://www.reddit.com/r/democrats/"

com_url <- "https://www.reddit.com/r/democrats/comments/ychxe8/new_battleground_polls_a_boon_for_dems/"

# To pull the comments for this specific post
css_selector  <- "#t1_itmbw37 > div.Comment.t1_itmbw37.P8SGAKMtRxNwlmLz1zdJu.HZ-cv9q391bm8s7qT54B3._1z5rdmX8TDr6mqwNv7A70U > div._3tw__eCCe7j-epNCKGXUKk"

use get thread content

set a specific time frame to look at a particular issue. Setting a time frame would be better

Start doing more interesting stuff

Code
#Here is where I run into an issue trying to get all the comments for one particular post
reddit_post <- "https://www.reddit.com/r/democrats/comments/ycha3g/supreme_court_puts_hold_on_order_that_graham/"
css <- "#overlayScrollContainer > div._1npCwF50X2J7Wt82SZi6J0 > div.u35lf2ynn4jHsVUwPmNU.Dx3UxiK86VcfkFQVHNXNi > div.uI_hDmU5GSiudtABRz_37 > div._2M2wOqmeoPVvcSsJ6Po9-V._3287nL7j7epK9JmDC3N1VR"

blue_post <- reddit_post %>% 
  read_html() %>%
  html_node(css = css) %>% 
  html_text()

Notes from other research

Reddit is not a great representation of the general public. It is a niche group, but can have more in depth discussion than Twitter. Reddit users are also, usually, passionate about certain ideas and subjects, therefore many users will talk freely about their ideas.

Previous Research

A Tale of Two Subreddits: https://ojs.aaai.org/index.php/ICWSM/article/view/19347/19119

No echo in the chambers of political interactions on Reddit: https://www.nature.com/articles/s41598-021-81531-x

Determining Presidential Approval Rating Using Reddit Sentiment Analysis: https://towardsdatascience.com/determining-presidential-approval-rating-using-reddit-sentiment-analysis-7912fdb5fcc7

https://www.researchgate.net/publication/349794705_Populist_Supporters_on_Reddit_A_Comparison_of_Content_and_Behavioral_Patterns_Within_Publics_of_Supporters_of_Donald_Trump_and_Hillary_Clinton