blog posts
hw3
Molly Hackbarth
Author

Molly Hackbarth

Published

October 15, 2022

Code
library(tidyverse)
library(cld3)
library(dplyr)
library(stringi)
library(stringr)
library(here)
library(devtools)
library(tidytext)
library(plyr)
library(quanteda)
library(preText)
library(quanteda.textstats)
library(quanteda.textmodels)
library(quanteda.textplots)
library(hunspell)
library(wordcloud)
library(RColorBrewer)
library(tm)
library(emojifont)

knitr::opts_chunk$set(echo = TRUE)

Research Question

My current research question: How do Reddit and Twitter users feel about the show Love is Blind Japan?

Working with the Data

In my previous blog post I had started to clean the data using two different methods. After discussing I decided I’ll remove the “textclean” package in case of issues. I will also do some back tracking to have both twitter and reddit in separate databases.

Adding Data in Separately

Code
reddit_data <- read.csv(here::here("posts", "_data", "loveisblindjapan.csv"))

twitter1 <- read.csv(here::here("posts", "_data", "tweets.csv"))

twitter2 <- read.csv(here::here("posts", "_data", "tweets#.csv"))
Code
reddit <- subset(reddit_data, select = c("body", "created_utc")) 

reddit$created_utc <- as.Date.POSIXct(reddit$created_utc)

reddit <- reddit %>% 
  select(text = body, 
            date = created_utc)
# remove deleted or removed comments by moderators of the subreddit (ones that only contain [deleted] or [removed])
reddit <- reddit %>% 
  filter(!text == '[deleted]') %>% 
  filter(!text == '[removed]')
Code
#remove counting column
twitter1 <- twitter1 %>% select(!c(X, User))
twitter2 <- twitter2 %>% select(!c(X, User))

twitter <- merge(twitter1, twitter2, by=c('Tweet','Tweet', 'Date', 'Date'),all=T, ignore_case =T)
write.csv(twitter, here::here("posts", "_data", "twitter.csv") , all(T) )
Warning in utils::write.table(twitter, here::here("posts", "_data",
"twitter.csv"), : appending column names to file
Code
names(twitter) <- tolower(names(twitter))
twitter <- twitter %>% 
  rename_at('tweet', ~ 'text', 
            'Date' ~ 'date')
twitter$date <- as.Date(strftime(twitter$date, format="%Y-%m-%d"))

# remove duplicate tweets
twitter <- twitter %>% distinct(text, date, .keep_all = TRUE)

#check for duplicate tweets
twitter %in% unique(twitter[ duplicated(twitter)]) 
[1] FALSE FALSE
Code
allsocialmedia <- merge(twitter, reddit, by=c('text','text', 'date', 'date'),all=T, ignore_case =T)
write.csv(twitter, here::here("posts", "_data", "loveisblind_socialmedia.csv") , all(T) )
Warning in utils::write.table(twitter, here::here("posts", "_data",
"loveisblind_socialmedia.csv"), : appending column names to file

Using HunSpell

After listening to class lectures, I decided to go ahead and download the “hunspell” package so I could look at my errors.

Code
spell_check_reddit <- hunspell(reddit$text)

spell_check_twitter <- hunspell(twitter$text)

Here I’m able to see that Twitter more often used shortened words (i.e. “ppl” instead of “people”). However I also noticed that Twitter more often used non english. This is likely due to Reddit being a mostly English speaking platform compared to Twitter.

Overall though I think that the data wasn’t too bad. I still would like to remove htmls (they aren’t helpful to analyzing). However in class a good point was brought up that emojis do often display emotions. Thus I think for now I will leave them in.

For non English I’ll still remove them as I can’t accurately analyze them without knowing the language.

Below you will find ways that didn’t seem to work for me,

Cleaning the Data and Creating Tokens

To see if cleaning the data works I’ll go ahead and test first with Reddit data.

Code
reddit_corpus <- corpus(reddit)
redditsummary <- summary(reddit_corpus)

#remove non english languages 
reddit_corpus <- subset(reddit_corpus, detect_language(reddit_corpus) == "en") 

#remove htmls
reddit_corpus <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", reddit_corpus)

#remove emojis
only_ascii_regexp <- '[^\u0001-\u007F]+|<U\\+\\w+>'
reddit_corpus <- reddit_corpus %>% 
 str_replace_all(regex(only_ascii_regexp), "") 

#remove htmls
reddit_corpus <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", reddit_corpus)
reddit_corpus <- reddit_corpus[!is.na(reddit_corpus)]
Code
reddit_corpus_tokens <- tokens(reddit_corpus, 
    remove_punct = T,
    remove_numbers = T)

reddit_corpus_tokens <- tokens_tolower(reddit_corpus_tokens)

reddit_corpus_tokens <- tokens_select(reddit_corpus_tokens, pattern = stopwords("en"), selection = "remove")

reddit_lemmitized <- tokens_replace(reddit_corpus_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

Trying to create a DTM

Code
dtm <- TermDocumentMatrix(reddit_lemmitized) 
matrix <- as.matrix(dtm) 
Error: vector memory exhausted (limit reached?)
Code
words <- sort(rowSums(matrix),decreasing=TRUE) 
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...): 'x' must be an array of at least two dimensions
Code
df <- data.frame(word = names(words),freq=words)

While this works for creating tokens, unfortunately it has removed the date column. While this is fine to gather an overall sentiment, it would be nice to keep the other column there.

Additionally trying to create a dtm from this doesn’t seem to work.

Trying a Different Way

Below you will see me attempt to have a corpus vector date (where it says large simple corpus). This is because I would like the date columns. I will once again try with Reddit first.

Code
test <- corpus(reddit)

#remove non english languages 
test$text <- subset(test$text, detect_language(test) == "en") 

#remove htmls
test$text <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", test$text)

#remove emojis
only_ascii_regexp <- '[^\u0001-\u007F]+|<U\\+\\w+>'
test$text <- test$text %>% 
 str_replace_all(regex(only_ascii_regexp), "") 

test <- Corpus(VectorSource(test))

test <- test %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>% # remove stop words
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, stopwords("english")) # remove stop words

Trying to Create Tokens

Code
test_tokens <- tokens(test, 
    remove_punct = T,
    remove_numbers = T)
Error: tokens() only works on character, corpus, list, tokens objects.
Code
test_tokens <- tokens_tolower(test_tokens)
Error in tokens_tolower(test_tokens): object 'test_tokens' not found
Code
lemmitized <- tokens_replace(test_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)
Error in tokens_replace(test_tokens, pattern = lexicon::hash_lemmas$token, : object 'test_tokens' not found

While this has created a large simple corpus which has included the two columns I mentioned, I can’t seem to create “tokens” this way which could make researching difficult.

DTM & Word Cloud

Code
dtm <- TermDocumentMatrix(test) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)
Code
library(wordcloud)
library(RColorBrewer)
set.seed(1234) # for reproducibility 
wordcloud(words = df$word, freq = df$freq, min.freq = 50,           max.words=100, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2"))

Here we can see the word cloud worked!

I also noticed despite removing emojis and other languages this way it seems to still be in the matrix. While it only shows up once and awhile it is sort of strange.

Additional TM code

To try to help to remove more I will use addition tm code.

Code
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
test <- test %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>% 
  tm_map(toSpace, "/") %>% 
  tm_map(toSpace, "@") %>% 
  tm_map(toSpace, "\\|")

Dealing with some issues

I realized my current removing non English is not working completely. Thus I’m trying a different formula in addition to make it work.

I also realized I was using tm_map to remove numbers which may have caused an issue.

Code
test2 <- corpus(reddit)
test2 <- subset(test2, detect_language(test2) == "en") 

test2 <- stringi::stri_trans_general(test2, "latin-ascii")


#remove htmls
test2 <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", test2)

test2 <- Corpus(VectorSource(reddit_tm))
Error in SimpleSource(length = length(x), content = x, class = "VectorSource"): object 'reddit_tm' not found
Code
test2 <- subset(test2, detect_language(test2) == "en") 

test2 <- stringi::stri_trans_general(test2, "latin-ascii")

#remove htmls
test2 <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", test3)
Error in is.factor(x): object 'test3' not found
Code
#remove emojis
only_ascii_regexp <- '[^\u0001-\u007F]+|<U\\+\\w+>'
test2 <- test2 %>% 
 str_replace_all(regex(only_ascii_regexp), "") 

#change non letters to spaces
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

test2 <- test %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>% # remove stop words
  tm_map(toSpace, "/") %>% 
  tm_map(toSpace, "@") %>% 
  tm_map(toSpace, "\\|") %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, stopwords("english")) # remove stop words

Trying a DTM & Word Cloud Again

Code
dtm <- TermDocumentMatrix(test2) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)
Code
library(wordcloud)
set.seed(1234) # for reproducibility 
wordcloud(words = df$word, freq = df$freq, min.freq = 500, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

This seems to have kept the dates in the corpus! However here I noticed that the dates are also in the word cloud which I don’t want.

Additionally I learned I need a high minimum frequency to really have this work well. Otherwise there was an error saying it couldn’t show multiple words.

Way that Worked - Word Cloud with Lemmitized

Another way for Word Clouds

I was able to find another way to do text plots that don’t involve using the “tm” package to create a corpus thanks to the class blog! Below you will see the word cloud for Reddit Post.

Code
library(quanteda.textplots)
reddit_corpus_edit <- reddit_lemmitized %>% 
  dfm(remove = stopwords('english'), remove_punct = TRUE) %>% 
  dfm_trim(min_termfreq = 10, verbose = FALSE)
Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
Warning: 'remove' is deprecated; use dfm_remove() instead
Code
textplot_wordcloud(reddit_corpus_edit, max_words=100, color="red")

I first tried to do it allowing stop words, however it seems that it was overtaken by words such as “I”, “the”, and “and”. So I put the remove stop words back in.

Twitter Word Cloud

Below you will see the word cloud for Twitter posts.

Code
twitter_corpus <- corpus(twitter)
twittersummary <- summary(twitter_corpus)

#remove non english languages 
twitter_corpus <- subset(twitter_corpus, detect_language(twitter_corpus) == "en") 

twitter_corpus <- stringi::stri_trans_general(twitter_corpus, "latin-ascii")

#remove htmls
twitter_corpus <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", twitter_corpus)

twitter_corpus <- twitter_corpus[!is.na(twitter_corpus)]


twitter_corpus_tokens <- tokens(twitter_corpus, 
    remove_punct = T,
    remove_numbers = T)

twitter_corpus_tokens <- tokens_tolower(twitter_corpus_tokens)

twitter_corpus_tokens <- tokens_select(twitter_corpus_tokens, pattern = stopwords("en"), selection = "remove")

twitter_lemmitized <- tokens_replace(twitter_corpus_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

library(quanteda.textplots)

twitter_corpus_edit <- twitter_lemmitized %>% 
  dfm(remove = stopwords('english'), remove_punct = TRUE) %>% 
  dfm_trim(min_termfreq = 10, verbose = FALSE)
Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
Warning: 'remove' is deprecated; use dfm_remove() instead
Code
textplot_wordcloud(twitter_corpus_edit, max_words=100, color="blue")

Something I noticed was that when I tried to subset with English, the amount of tweets in the Twitter corpus went down by almost half. This seemed to be an error issue as when I looked at my data it shouldn’t have gone down more than 1,000. This has lead me to remove the detect language function. I have added a new function that strip non-ascii characters which will remove emojis and non English.

Something else I noticed for Twitter is that it seems that some non English or emojis are frequently used. My guess is that it’s mostly “love is blind japan” in Japanese.

Code
#library(emojifont) #reads japanese?

only_ascii_regexp <- '[^\u0001-\u03FF]+|<U\\+\\w+>'

twittertest <- twitter$text %>%
  str_replace_all(regex(only_ascii_regexp), "") 

#remove non english languages 

#twittertest <- stringi::stri_trans_general(twitter_corpus, "latin-ascii")

#remove htmls
twittertest <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", twittertest)

twittertest <- twittertest[!is.na(twittertest)]

twitter_corpus_tokens1 <- tokens(twittertest, 
    remove_punct = T,
    remove_numbers = T)

twitter_corpus_tokens1 <- tokens_tolower(twitter_corpus_tokens1)

twitter_corpus_tokens1 <- tokens_select(twitter_corpus_tokens1, pattern = stopwords("en"), selection = "remove")

twitter_lemmitized1 <- tokens_replace(twitter_corpus_tokens1, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

library(quanteda.textplots)

twittertest1 <- twittertest %>% 
  dfm(remove = stopwords('english'), remove_punct = TRUE) %>% 
  dfm_trim(min_termfreq = 10, verbose = FALSE)
Warning: 'dfm.character()' is deprecated. Use 'tokens()' first.
Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
Warning: 'remove' is deprecated; use dfm_remove() instead
Code
textplot_wordcloud(twittertest1, max_words=100, color="blue")

A quick note: For some reason the TM package and word clouds were a bit finicky. It sometimes seemed to work and sometimes did not.

Code For New Word Clouds

Lemmetize

Code
reddit_corpus <- corpus(reddit)
redditsummary <- summary(reddit_corpus)

only_ascii_regexp <- '[^\u0001-\u03FF]+|<U\\+\\w+>'

reddit_corpus <- reddit$text %>%
  str_replace_all(regex(only_ascii_regexp), "") 

#remove non english languages 
reddit_corpus <- stringi::stri_trans_general(reddit_corpus, "latin-ascii")

#remove emojis
only_ascii_regexp <- '[^\u0001-\u007F]+|<U\\+\\w+>'
reddit_corpus <- reddit_corpus %>% 
 str_replace_all(regex(only_ascii_regexp), "") 

#remove htmls
reddit_corpus <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", reddit_corpus)
reddit_corpus <- reddit_corpus[!is.na(reddit_corpus)]

reddit_corpus_tokens <- tokens(reddit_corpus, 
    remove_punct = T,
    remove_numbers = T)

reddit_corpus_tokens <- tokens_tolower(reddit_corpus_tokens)

reddit_corpus_tokens <- tokens_select(reddit_corpus_tokens, pattern = stopwords("en"), selection = "remove")

reddit_lemmitized <- tokens_replace(reddit_corpus_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

library(quanteda.textplots)

reddit_corpus_edit <- reddit_lemmitized %>% 
  dfm(remove = stopwords('english'), remove_punct = TRUE) %>% 
  dfm_trim(min_termfreq = 30, verbose = FALSE)
Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
Warning: 'remove' is deprecated; use dfm_remove() instead
Code
textplot_wordcloud(reddit_corpus_edit, max_words=200, color="red")

Code
library(emojifont) #reads japanese?

only_ascii_regexp <- '[^\u0001-\u03FF]+|<U\\+\\w+>'

twitter_corpus<- twitter$text %>%
  str_replace_all(regex(only_ascii_regexp), "") 

twitter_corpus <- corpus(twitter_corpus)
twittersummary <- summary(twitter_corpus)

#remove non english languages 
twitter_corpus <- stringi::stri_trans_general(twitter_corpus, "latin-ascii")

#remove htmls
twitter_corpus <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", twitter_corpus)

twitter_corpus <- twitter_corpus[!is.na(twitter_corpus)]

twitter_corpus_tokens <- tokens(twitter_corpus, 
    remove_punct = T,
    remove_numbers = T)

twitter_corpus_tokens <- tokens_tolower(twitter_corpus_tokens)

twitter_corpus_tokens <- tokens_select(twitter_corpus_tokens, pattern = stopwords("en"), selection = "remove")

twitter_lemmitized <- tokens_replace(twitter_corpus_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

library(quanteda.textplots)

twitter_corpus_edit <- twitter_lemmitized %>% 
  dfm(remove = stopwords('english'), remove_punct = TRUE) %>% 
  dfm_trim(min_termfreq = 30, verbose = FALSE)
Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
Warning: 'remove' is deprecated; use dfm_remove() instead
Code
textplot_wordcloud(twitter_corpus_edit, max_words=200, color="blue")

Some interesting facts for Lemmetize plot

  • #loveisblindjapan is one of the largest along with blind, love and japan for twitter. This is unsurprising as that’s the name of the show.

  • Reddit used the word like and think a lot. However interestingly Japanese is mentioned more than the word Japan.

  • The names of the actual contestants are not mentioned as much as I would think. They’re still mentioned often but are not major words in the word cloud.

  • Japanese characters and possibly emojis both made there way into the word cloud. I’m unsure why.

TM Package

Code
reddit_tm <- corpus(reddit)

only_ascii_regexp <- '[^\u0001-\u03FF]+|<U\\+\\w+>'

reddit_tm <- reddit$text %>%
  str_replace_all(regex(only_ascii_regexp), "") 

#remove non english languages 
reddit_tm <- stringi::stri_trans_general(reddit_tm, "latin-ascii")

#remove htmls
reddit_tm <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", reddit_tm)

reddit_tm <- Corpus(VectorSource(reddit_tm))

#change non letters to spaces
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

reddit_tm <- reddit_tm %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>% # remove multiple white space
  tm_map(toSpace, "/") %>% 
  tm_map(toSpace, "@") %>% 
  tm_map(toSpace, "\\|") %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, stopwords("english")) # remove stop words
Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
documents
Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
documents
Warning in tm_map.SimpleCorpus(., toSpace, "/"): transformation drops documents
Warning in tm_map.SimpleCorpus(., toSpace, "@"): transformation drops documents
Warning in tm_map.SimpleCorpus(., toSpace, "\\|"): transformation drops
documents
Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
drops documents
Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
transformation drops documents
Code
reddit_dtm <- TermDocumentMatrix(reddit_tm) 
reddit_matrix <- as.matrix(reddit_dtm) 
reddit_words <- sort(rowSums(reddit_matrix),decreasing=TRUE) 
reddit_df <- data.frame(words = names(words),freq=words)

set.seed(1234) # for reproducibility 
wordcloud(words = reddit_df$word, freq = reddit_df$freq, min.freq = 50,           max.words=100, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2"))

Code
twitter_tm <- corpus(twitter)

only_ascii_regexp <- '[^\u0001-\u03FF]+|<U\\+\\w+>'

twitter_tm <- twitter$text %>%
  str_replace_all(regex(only_ascii_regexp), "") 

#remove non english languages 
twitter_tm <- stringi::stri_trans_general(twitter_tm, "latin-ascii")

#remove htmls
twitter_tm <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", twitter_tm)

twitter_tm <- Corpus(VectorSource(twitter_tm))

#change non letters to spaces
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

twitter_tm <- twitter_tm %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>% # remove multiple white space
  tm_map(toSpace, "/") %>% 
  tm_map(toSpace, "@") %>% 
  tm_map(toSpace, "\\|") %>%
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, stopwords("english")) # remove stop words
Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
documents
Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
documents
Warning in tm_map.SimpleCorpus(., toSpace, "/"): transformation drops documents
Warning in tm_map.SimpleCorpus(., toSpace, "@"): transformation drops documents
Warning in tm_map.SimpleCorpus(., toSpace, "\\|"): transformation drops
documents
Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
drops documents
Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
transformation drops documents
Code
twitter_dtm <- TermDocumentMatrix(twitter_tm) 
twitter_matrix <- as.matrix(twitter_dtm) 
twitter_words <- sort(rowSums(twitter_matrix),decreasing=TRUE) 
twitter_df <- data.frame(words = names(words),freq=words)

set.seed(12345) # for reproducibility 
wordcloud(words = twitter_df$word, freq = twitter_df$freq, min.freq = 50,           max.words=100, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2"))

Some interesting facts for TM:

  • I thought that the ’s was part of some sort of stop word. However it seems that it isn’t for either. My guess is it may follow names of people in the show.
  • Even with another version of removing emojis and non English there still seems to be some in the data, however not very much.
  • Like and think were both very popular words for Twitter and Reddit.
  • Not all the contestants were posted about the same amount.
  • Despite the show being about marriage, the words married, marry, or marriage weren’t the top words for either Twitter or Reddit.

Overall between the two types of word clouds, I prefer the lemmitized version so I will continue using that one.

However I am surprised by the fact that they’re so different. The only difference in code is really the “toSpace” which removes just a couple of extra things.

Emoji Thoughts

Using the function that I found from the DACSS slack channel I found that it also has a nice side affect of removing non English as well.

Additionally I’ve noticed that there’s has been few emojis that have made it into the word cloud. I also don’t feel confident in decoding emojis and being able to do a sentiment analysis.

Thus for this project I will not keep emojis.

Attempting to Remove Japanese

Since I noticed there was a lot of Japanese still left in the word cloud I went to test if I could remove more with the previous function I had used. This along with the other three options seems to have the best results!

Below you will see a test version.

Code
str_rm_jap = function(x) {
  #we replace japanese blocks with nothing, and clean any double whitespace from this
  #reference at http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
  x %>% 
    #japanese style punctuation
    str_replace_all("[\u3000-\u303F]", "") %>% 
    #katakana
    str_replace_all("[\u30A0-\u30FF]", "") %>% 
    #hiragana
    str_replace_all("[\u3040-\u309F]", "") %>% 
    #kanji
    str_replace_all("[\u4E00-\u9FAF]", "") %>% 
    #remove excess whitespace
    str_replace_all("  +", " ") %>% 
    str_trim()
}

test_corpus<- twitter$text%>% str_rm_jap

test_posts <- test_corpus %>%
 str_replace_all(regex(only_ascii_regexp), "") 

#remove non english languages 
test_posts <- stringi::stri_trans_general(test_posts, "latin-ascii")

#remove htmls
test_corpus <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", test_corpus)

test_corpus <- test_corpus[!is.na(test_corpus)]

test_corpus_tokens <- tokens(test_corpus, 
    remove_punct = T,
    remove_numbers = T)

test_corpus_tokens <- tokens_tolower(test_corpus_tokens)

test_corpus_tokens <- tokens_select(test_corpus_tokens, pattern = stopwords("en"), selection = "remove")

test_lemmitized <- tokens_replace(test_corpus_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

library(quanteda.textplots)

test_corpus_edit <- test_lemmitized %>% 
  dfm(remove = stopwords('english'), remove_punct = TRUE) %>% 
  dfm_trim(min_termfreq = 50, verbose = FALSE)
Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
Warning: 'remove' is deprecated; use dfm_remove() instead
Code
textplot_wordcloud(test_corpus_edit, max_words=200, color="blue")

While doing this I noticed there were a lot of Spanish words as well. In order to combat this I decided to go back to using the subset function.

Removing all non English

While this has taken care of a lot of issues, unfortunately it has removed a lot of data. This has left me with only 7,990 tweets out of the original 14,866.

I also have added remove_symbols to the tokens.

Code
test <- corpus(twitter$text)

library(cld3)
test <- subset(test, detect_language(test) == "en")

test <- gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", test)

test <- test[!is.na(test)]

test_tokens <- tokens(test, 
    remove_punct = T,
    remove_numbers = T,
    remove_symbols = T)

test_tokens <- tokens_tolower(test_tokens)

test_tokens <- tokens_select(test_tokens, pattern = stopwords("en"), selection = "remove")

tests_lemmitized <- tokens_replace(test_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

library(quanteda.textplots)

test_edit <- tests_lemmitized %>% 
  dfm(remove = stopwords('english'), remove_punct = TRUE) %>% 
  dfm_trim(min_termfreq = 50, verbose = FALSE)
Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
Warning: 'remove' is deprecated; use dfm_remove() instead
Code
textplot_wordcloud(test_edit, max_words=300, color="blue")

Updated Word Cloud Code

Code
reddit_corpus <- subset(reddit, detect_language(reddit) == "en")
reddit_corpus <- reddit_corpus[!is.na(reddit_corpus)]
reddit_corpus <- corpus(reddit)
redditsummary <- summary(reddit_corpus)

reddit_corpus_tokens <- tokens(reddit_corpus, 
    remove_punct = T,
    remove_numbers = T, 
    remove_symbols = T,
    remove_url = T)

reddit_corpus_tokens <- tokens_tolower(reddit_corpus_tokens)

reddit_corpus_tokens <- tokens_select(reddit_corpus_tokens, pattern = stopwords("en"), selection = "remove")

reddit_lemmitized <- tokens_replace(reddit_corpus_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

library(quanteda.textplots)

reddit_corpus_dfm <- reddit_lemmitized %>% 
  dfm() %>% 
  dfm_remove(stopwords('english')) %>% 
  dfm_trim(min_termfreq = 30, verbose = FALSE)

textplot_wordcloud(reddit_corpus_dfm, max_words=200, color="red")

Code
twitter_corpus <- subset(twitter, detect_language(twitter) == "en")
twitter_corpus <- twitter_corpus[!is.na(twitter_corpus)]
twitter_corpus <- corpus(twitter_corpus)
twittersummary <- summary(twitter_corpus)

twitter_corpus_tokens <- tokens(twitter_corpus, 
    remove_punct = T,
    remove_numbers = T,
    remove_symbols = T,
    remove_url = T)

twitter_corpus_tokens <- tokens_tolower(twitter_corpus_tokens)

twitter_corpus_tokens <- tokens_select(twitter_corpus_tokens, pattern = stopwords("en"), selection = "remove")

twitter_lemmitized <- tokens_replace(twitter_corpus_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

library(quanteda.textplots)

twitter_corpus_dfm <- twitter_lemmitized %>% 
  dfm() %>% 
  dfm_remove(stopwords('english')) %>% 
  dfm_trim(min_termfreq = 30, verbose = FALSE)

textplot_wordcloud(twitter_corpus_dfm, max_words=200, color="blue")

Code
social_corpus <- subset(allsocialmedia, detect_language(allsocialmedia) == "en")
social_corpus <- corpus(social_corpus)
socialsummary <- summary(social_corpus)

social_corpus <- social_corpus[!is.na(social_corpus)]

social_corpus_tokens <- tokens(social_corpus, 
    remove_punct = T,
    remove_numbers = T,
    remove_symbols = T,
    remove_url = T)

social_corpus_tokens <- tokens_tolower(social_corpus_tokens)

social_corpus_tokens <- tokens_select(social_corpus_tokens, pattern = stopwords("en"), selection = "remove")

social_lemmitized <- tokens_replace(social_corpus_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

library(quanteda.textplots)

social_corpus_dfm <- social_lemmitized %>% 
  dfm() %>% 
  dfm_remove(stopwords('english')) %>% 
  dfm_trim(min_termfreq = 30, verbose = FALSE)

textplot_wordcloud(social_corpus_dfm, max_words=200, color="orange")

Looking at the Data Closer

Below you will see some quick looks at data.

Code
topfeatures(social_corpus_dfm, 10)
             love             japan              like             blind 
             7352              5886              5647              5270 
            think              just #loveisblindjapan              good 
             4582              3670              3445              3417 
              say               get 
             3378              2982 
Code
topfeatures(twitter_corpus_dfm, 10)
             love             blind             japan #loveisblindjapan 
             5965              5068              4766              3445 
            watch              like              good              just 
             2135              1200              1001               936 
              get              show 
              806               727 
Code
topfeatures(reddit_corpus_dfm, 10)
  like  think    say   just   good    get   show really    see   feel 
  4504   4120   2856   2750   2455   2196   2120   2007   1953   1897 

Here we can see that overall the top ten words:

Love is Blind Japan Top Features
All Social Media Twitter Reddit
love love like
japan blind think
like japan say
blind #loveisblindjapan just
think watch good
just like get
#loveisblindjapan good show
good just really
say get see
get show feel

Here we can see a few things:

  • While “love”, “blind”, and “japan” are the top features in Twitter, this is most likely due to the show’s name. When comparing this to both the category social media that combine them and Reddit, like made it much higher.

  • All three column’s top ten words have multiple words that are focusing on how people are watching the show feel.

  • From this brief look it seems there’s a positive thought on show as “good” were both in the top ten for Twitter and Reddit.

Trying a Co-occurrence Matrix

I will give a test try to having a max term frequency for Twitter, as Twitter’s top four words have to do with the shows name.

Code
# let's create a nicer dfm by limiting to words that appear frequently and are in more than 30% of chapters
smaller_dfm <- dfm_trim(twitter_corpus_dfm, max_termfreq = 3400, min_termfreq = 30)
smaller_dfm <- dfm_trim(smaller_dfm, max_docfreq = .3, docfreq_type = "prop")

textplot_wordcloud(smaller_dfm, min_count = 100,
                   random_order = FALSE)

Code
# create fcm from dfm
smaller_fcm <- fcm(smaller_dfm)

# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(smaller_fcm)
[1] 485 485
Code
myFeatures <- names(topfeatures(smaller_fcm, 30))

# retain only those top features as part of our matrix
even_smaller_fcm <- fcm_select(smaller_fcm, pattern = myFeatures, selection = "keep")

# check dimensions
dim(even_smaller_fcm)
[1] 30 30
Code
# compute size weight for vertices in network
size <- log(colSums(even_smaller_fcm))

# create plot
textplot_network(even_smaller_fcm, vertex_size = size / max(size) * 3)

Final Thoughts (TLDR)

Final thoughts

  • I did end up looking at Youtube to see if there were clips of this show online that were from Netflix. There unfortunately only seemed to be a trailer for the show. Clips of this season were fairly rare there.

  • After debating for awhile I decided to use the subset language option to remove all other languages for Twitter (Reddit did not have this problem most likely due to the main user base being English speaking Americans). Although this did cut my database down significantly I believe this was the best option as it removes any other language that I’m currently unable to analyze properly.

    • Conversely Reddit had very few other languages. However to keep the data similar I used subset for Reddit too.

      • This was also done so if I would like to use both the separate data and the data combined it is the same.
  • I decided to use the quanteda.textplots for the word cloud instead of the TM package. This is due to the TM package having issues when I try to reload the word clouds.

  • I have decided to lemmitize my tokens as it works to create a more accurate picture of words that were important to both Twitter and Reddit posters.

  • I have removed emojis due to the issues of most of the emoji packages I have found not working. Additionally I’m not confident that I could accurately use them for a sentiment analysis, or that there are that many in the first place.

  • For my tokens I have added remove_symbols and remove_url for both Reddit and Twitter to help out with people spamming urls to shows or gifs, and the “@” sign.

  • When I checked the top features it was interesting to see what words were used the most. While none of it was surprising, it did help me to understand the words a bit more.

  • I believe a co-occurance matrix may be useful for Twitter as it allows me to remove the name of the show and the hashtag better.

Questions and frustrations I still have

  • I’m still unsure how to keep both the date and the text as separate columns using the corpus() function from quanteda. I can keep it in the summary, but I’m not sure if I can use the dates outside of that.

    • The summary also only goes to 100, I’m not sure how I can make it show me all of the data either.
    • When I tried having a maximum frequency for Twitter’s word cloud it did bring out the dates in the word cloud but I’m not sure if that solves my problem either.
  • I am unsure if looking at all the data together or separately is currently the best. I have a word cloud using all the social media data below. This would change my research question.

    • Current Research Question: How do Reddit and Twitter users feel about the show Love is Blind Japan?

    • Possibly Research Question: How do Reddit and Twitter users sentiment differentiate about the show Love is Blind Japan?

      • This is the question I had previously.

Future Work

Here are a few things I’d like to do in blog post 4:

  • Try to group by date. With the date column seemingly working now I’d like to see if grouping by date would work.

  • Try to use the textplot_network on reddit as well as the combined data set. Although I’m not sure if I will I think it could produce some interesting results.

  • Try to find a way to remove the dates from the word clouds. My current thought is to just make a separate r chunk that only has twitter$text to remove the dates.

  • Try to decide on how or if I want to use co-occurrence.

Final Code

Code
#write csv has been commented out due to it continously trying to save an "updated version" in Git. 

reddit_data <- read.csv(here::here("posts", "_data", "loveisblindjapan.csv"))

twitter1 <- read.csv(here::here("posts", "_data", "tweets.csv"))

twitter2 <- read.csv(here::here("posts", "_data", "tweets#.csv"))

reddit <- subset(reddit_data, select = c("body", "created_utc")) 

reddit$created_utc <- as.Date.POSIXct(reddit$created_utc)

reddit <- reddit %>% 
  select(text = body, 
            date = created_utc)
# remove deleted or removed comments by moderators of the subreddit (ones that only contain [deleted] or [removed])
reddit <- reddit %>% 
  filter(!text == '[deleted]') %>% 
  filter(!text == '[removed]')

#remove counting column
twitter1 <- twitter1 %>% select(!c(X, User))
twitter2 <- twitter2 %>% select(!c(X, User))

twitter <- merge(twitter1, twitter2, by=c('Tweet','Tweet', 'Date', 'Date'),all=T, ignore_case =T)
#write.csv(twitter, here::here("posts", "_data", "twitter.csv") , all(T) )

names(twitter) <- tolower(names(twitter))
twitter <- twitter %>% 
  rename_at('tweet', ~ 'text', 
            'Date' ~ 'date')
twitter$date <- as.Date(strftime(twitter$date, format="%Y-%m-%d"))

# remove duplicate tweets
twitter <- twitter %>% distinct(text, date, .keep_all = TRUE)

#check for duplicate tweets
twitter %in% unique(twitter[ duplicated(twitter)]) 
[1] FALSE FALSE
Code
allsocialmedia <- merge(twitter, reddit, by=c('text','text', 'date', 'date'),all=T, ignore_case =T)
#write.csv(twitter, here::here("posts", "_data", "loveisblind_socialmedia.csv") , all(T) )
Code
reddit_corpus <- subset(reddit, detect_language(reddit) == "en")
reddit_corpus <- reddit_corpus[!is.na(reddit_corpus)]
reddit_corpus <- corpus(reddit)
redditsummary <- summary(reddit_corpus)

reddit_corpus_tokens <- tokens(reddit_corpus, 
    remove_punct = T,
    remove_numbers = T, 
    remove_symbols = T,
    remove_url = T)

reddit_corpus_tokens <- tokens_tolower(reddit_corpus_tokens)

reddit_corpus_tokens <- tokens_select(reddit_corpus_tokens, pattern = stopwords("en"), selection = "remove")

reddit_lemmitized <- tokens_replace(reddit_corpus_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

library(quanteda.textplots)

reddit_corpus_dfm <- reddit_lemmitized %>% 
  dfm() %>% 
  dfm_remove(stopwords('english')) %>% 
  dfm_trim(min_termfreq = 30, verbose = FALSE)

textplot_wordcloud(reddit_corpus_dfm, max_words=200, color="red")

Code
twitter_corpus <- subset(twitter, detect_language(twitter) == "en")
twitter_corpus <- twitter_corpus[!is.na(twitter_corpus)]
twitter_corpus <- corpus(twitter_corpus)
twittersummary <- summary(twitter_corpus)

twitter_corpus_tokens <- tokens(twitter_corpus, 
    remove_punct = T,
    remove_numbers = T,
    remove_symbols = T,
    remove_url = T)

twitter_corpus_tokens <- tokens_tolower(twitter_corpus_tokens)

twitter_corpus_tokens <- tokens_select(twitter_corpus_tokens, pattern = stopwords("en"), selection = "remove")

twitter_lemmitized <- tokens_replace(twitter_corpus_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

library(quanteda.textplots)
twitter_corpus_dfm <- twitter_lemmitized %>% 
  dfm() %>% 
  dfm_remove(stopwords('english')) %>% 
  dfm_trim(min_termfreq = 30, verbose = FALSE)

textplot_wordcloud(twitter_corpus_dfm, max_words=200, color="blue")

Code
social_corpus <- subset(allsocialmedia, detect_language(allsocialmedia) == "en")
social_corpus <- corpus(social_corpus)
socialsummary <- summary(social_corpus)

social_corpus <- social_corpus[!is.na(social_corpus)]

social_corpus_tokens <- tokens(social_corpus, 
    remove_punct = T,
    remove_numbers = T,
    remove_symbols = T,
    remove_url = T)

social_corpus_tokens <- tokens_tolower(social_corpus_tokens)

social_corpus_tokens <- tokens_select(social_corpus_tokens, pattern = stopwords("en"), selection = "remove")

social_lemmitized <- tokens_replace(social_corpus_tokens, 
                             pattern = lexicon::hash_lemmas$token, 
                             replacement = lexicon::hash_lemmas$lemma)

library(quanteda.textplots)

social_corpus_dfm <- social_lemmitized %>% 
  dfm() %>% 
  dfm_remove(stopwords('english')) %>% 
  dfm_trim(min_termfreq = 30, verbose = FALSE)

textplot_wordcloud(social_corpus_dfm, max_words=200, color="orange")

Code
topfeatures(social_corpus_dfm, 10)
             love             japan              like             blind 
             7352              5886              5647              5270 
            think              just #loveisblindjapan              good 
             4582              3670              3445              3417 
              say               get 
             3378              2982 
Code
topfeatures(twitter_corpus_dfm, 10)
             love             blind             japan #loveisblindjapan 
             5965              5068              4766              3445 
            watch              like              good              just 
             2135              1200              1001               936 
              get              show 
              806               727 
Code
topfeatures(reddit_corpus_dfm, 10)
  like  think    say   just   good    get   show really    see   feel 
  4504   4120   2856   2750   2455   2196   2120   2007   1953   1897 
Code
# let's create a nicer dfm by limiting to words that appear frequently and are in more than 30% of chapters
smaller_dfm <- dfm_trim(twitter_corpus_dfm, max_termfreq = 3400, min_termfreq = 30)
smaller_dfm <- dfm_trim(smaller_dfm, max_docfreq = .3, docfreq_type = "prop")

textplot_wordcloud(smaller_dfm, min_count = 100,
                   random_order = FALSE)

Code
# create fcm from dfm
smaller_fcm <- fcm(smaller_dfm)

# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(smaller_fcm)
[1] 485 485
Code
myFeatures <- names(topfeatures(smaller_fcm, 30))

# retain only those top features as part of our matrix
even_smaller_fcm <- fcm_select(smaller_fcm, pattern = myFeatures, selection = "keep")

# check dimensions
dim(even_smaller_fcm)
[1] 30 30
Code
# compute size weight for vertices in network
size <- log(colSums(even_smaller_fcm))

# create plot
textplot_network(even_smaller_fcm, vertex_size = size / max(size) * 3)