Blog Post 3: Pre-processing

BlogPost3
Andrea Mah
Author

Andrea Mah

Published

October 24, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

For my project, I plan to analyze speeches given by world leaders at the UN Climate conferences. My goal for the past week was to get my data into good shape. Previously, I was able to import all the pdfs into R. However, I did not have metadata associated with those documents, many speeches were not in english, and the data needed to be cleaned.

My first step was to detect the language of the speeches and to subset my corpus to only include english texts. Fortunately, there were multiple packages which I could use to detect language. Ultimately I chose “cld2” which was reported to have high accuracy. After importing the files, I used the detect_language() command to create a vector representing the language of each document. Then I subsetted the data and saved a new file with only the english texts.

Code
library(cleanNLP)
library(tidytext)
library(plyr)
------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: 'plyr'
The following objects are masked from 'package:dplyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize
The following object is masked from 'package:purrr':

    compact
Code
library(tidyverse)
library(quanteda)
Package version: 3.2.3
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 8 of 8 threads used.
See https://quanteda.io for tutorials and examples.
Code
library(pdftools)
Using poppler version 22.04.0
Code
library(quanteda.textmodels)
library(quanteda.textplots)
library(quanteda.textstats)

#creating list of names of files to read in
file_list <- list.files(pattern="*.pdf")
#
all_files <- lapply(file_list, function(file){
  txt <- pdf_text(file)
  txt <- str_c(txt, collapse = " ")
  data.frame(File = file,text = txt)
})
#
#bind
result <- do.call("rbind", all_files)

#checking import worked
View(result)
Error in View(result): invalid 'x' argument
Code
#detecting language? 
require(cld2)
Loading required package: cld2
Code
t_start = proc.time()
cld2_vec = cld2::detect_language(text = result$text, plain_text = TRUE, lang_code = TRUE)
Error in as_string(text): Parameter 'text' must be a connection or character vector
Code
#bind result with data
result$language <- cld2_vec
Error in eval(expr, envir, enclos): object 'cld2_vec' not found
Code
#create subset data with only english language speeches
en.result <-result[which(result$language == "en"),]
#Save as dataframe
save(en.result, file = "speeches.RData")
load("speeches.RData")

My next step was to get some metadata (‘docvars’) that I could use with my documents. The filenames in the downloaded speeches contained a lot of useful information. They had which conference the speech was from, the date the speech was delivered, and the speaker (in most cases, a country). I couldn’t find anything in R to help me use the filename to extract that information (although I’m sure something exists.) What I ended up doing was exporting the list of file names from my english speeches dataframe, importing that into excel, and using a series of “TEXTBEFORE” and “TEXTAFTER” functions to isolate the information I wanted. I manually added in the year of each speech, which was easy to do after sorting the files in alphabetical order. I saved this ‘metadata’ as a csv.

After importing the csv into R, I used left_join to merge the speeches with the metadata by file name. Now when I made the speech dataframe into a corpus I was able to see my metadata.

Code
#remove the "language" column
en.result <- en.result[,c(1,2)]

#isolate only the filename column to export
en.result.names <- en.result[,1]
#Here I exported a csv
write.csv(en.result.names, "en.result.csv")

#using excel "=textbefore()" and "=textafter()" commands, I was able to isolate
#Year and Speaker(country) using the file names. I saved this as a csv to import as my corpus metadata.

#importing metadata file
metadata <- read.csv("metadata_docs.csv", header = T)
Warning in file(file, "rt"): cannot open file 'metadata_docs.csv': No such file
or directory
Error in file(file, "rt"): cannot open the connection
Code
#renaming column to match
metadata$File <- metadata$filename
Error in eval(expr, envir, enclos): object 'metadata' not found
Code
#joining my speeches with my metadata
speech.meta <- left_join(metadata, en.result, by = "File", all.x = T)
Error in left_join(metadata, en.result, by = "File", all.x = T): object 'metadata' not found
Code
#saving this dataframe...
save(file =  "speech.meta.RData", speech.meta)
Error in save(file = "speech.meta.RData", speech.meta): object 'speech.meta' not found

Next, I needed to clean up the text and do some pre-processing. I didn’t have much success using built in commands to remove numbers or other things, but a different approach using gsub() seemed to work. I’m sure the code I’m using could be simplified a lot. But at least this ended up with the result I wanted. After getting rid of things that I didn’t want, I created a dfm, where I also removed stopwords and only included features which appeared a minimum of 10 times.

Code
#Cleaning up the speeches
load("speech.meta.RData")
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'speech.meta.RData', probable reason 'No such file or directory'
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
Code
text <- speech.meta$text
Error in eval(expr, envir, enclos): object 'speech.meta' not found
Code
text <- gsub("$", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("~", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("<", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub(">", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("1st", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("2nd", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("3rd", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("4th", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("5th", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("6th", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("7th", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("8th", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("9th", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("0th", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("1", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("2", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("3", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("4", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("5", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("6", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("7", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("8", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("9", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("0", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("%", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub("#", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <-gsub(" th ", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <-gsub(" t ", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <-gsub(" l ", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <-gsub(" d ", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
text <- gsub(" mr. ", " ", text)
Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'
Code
speech.meta$text <- text
Error in speech.meta$text <- text: object 'speech.meta' not found
Code
speech.meta$text_field <- speech.meta$text
Error in eval(expr, envir, enclos): object 'speech.meta' not found
Code
speech.meta$docid_field <- speech.meta$File
Error in eval(expr, envir, enclos): object 'speech.meta' not found
Code
#Make this into a corpus
speech_corpus <- corpus(speech.meta)
Error in corpus(speech.meta): object 'speech.meta' not found
Code
speech_tokens <- tokens(speech_corpus, remove_punc = T)
Error in tokens(speech_corpus, remove_punc = T): object 'speech_corpus' not found
Code
#Create a DFM without punctuation or stopwords
dfm_speech <- dfm(speech_tokens)
Error in dfm(speech_tokens): object 'speech_tokens' not found
Code
dfm_speech <- dfm_remove(dfm_speech, stopwords("english")) %>%
  dfm_trim(min_termfreq = 10, verbose = F)
Error in dfm_select(x, ..., selection = "remove"): object 'dfm_speech' not found

Next I was excited to just explore the dfm.

Code
#get some information about the dfm
ndoc(dfm_speech)
Error in ndoc(dfm_speech): object 'dfm_speech' not found
Code
head(featnames(dfm_speech), 25)
Error in featnames(dfm_speech): object 'dfm_speech' not found
Code
#See what's common
topfeatures(dfm_speech, 50)
Error in topfeatures(dfm_speech, 50): object 'dfm_speech' not found
Code
#see what's common within each year
topfeatures(dfm_speech, 5, groups = year)
Error in topfeatures(dfm_speech, 5, groups = year): object 'dfm_speech' not found
Code
#make a wordcloud 
set.seed(2222)
textplot_wordcloud(dfm_speech, min_count = 100, random_order = F)
Error in textplot_wordcloud(dfm_speech, min_count = 100, random_order = F): object 'dfm_speech' not found
Code
dfm.1995 <- subset(dfm_speech, year == "1995")
Error in subset(dfm_speech, year == "1995"): object 'dfm_speech' not found

Finally, since I now had some metadata, I wanted to see if I could actually use it. I found some code online which was showing how to plot frequency of terms using ggplot(). I did this for the overall corpus, but could not figure out how to select specific terms and plot them by year…

Code
#trying out some frequencies? gets the top 10 features overall
ts_freq <- textstat_frequency(dfm_speech, n = 20)
Error in textstat_frequency(dfm_speech, n = 20): object 'dfm_speech' not found
Code
ts_freq
Error in eval(expr, envir, enclos): object 'ts_freq' not found
Code
#what about top features by year? 
ts_freq_byyear <- textstat_frequency(dfm_speech, n = 10, group = year)
Error in textstat_frequency(dfm_speech, n = 10, group = year): object 'dfm_speech' not found
Code
ts_freq_byyear
Error in eval(expr, envir, enclos): object 'ts_freq_byyear' not found
Code
ts_freq_byspeaker <- textstat_frequency(dfm_speech, n = 10, group = speaker)
Error in textstat_frequency(dfm_speech, n = 10, group = speaker): object 'dfm_speech' not found
Code
ts_freq_byspeaker
Error in eval(expr, envir, enclos): object 'ts_freq_byspeaker' not found
Code
topterms <- ggplot(data = ts_freq, aes(x = feature)) +
  geom_bar(aes(y = frequency), stat = "identity") +
  theme(panel.background = element_rect(fill = "white"), 
        axis.line = element_line(colour = "black"),
        axis.text = element_text(size = 12),
        axis.title.x = element_text(size =12, vjust = -1),
        axis.title.y = element_text(size =12),
        legend.key = element_blank(),
        legend.position = "top",
        legend.text = element_text(size = 14),
        legend.title = element_blank())
Error in ggplot(data = ts_freq, aes(x = feature)): object 'ts_freq' not found
Code
topterms
Error in eval(expr, envir, enclos): object 'topterms' not found

Now that my dataset is actually clean and ready to be analyzed, I’m excited to try topic modelling, to learn more about what types of stats I can calculate using the dfm, and figuring our more interesting ways to visualize the data.