Blog Post 3: Pre-processing

BlogPost3

Andrea Mah

Author

Andrea Mah

Published

October 24, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

For my project, I plan to analyze speeches given by world leaders at the UN Climate conferences. My goal for the past week was to get my data into good shape. Previously, I was able to import all the pdfs into R. However, I did not have metadata associated with those documents, many speeches were not in english, and the data needed to be cleaned.

My first step was to detect the language of the speeches and to subset my corpus to only include english texts. Fortunately, there were multiple packages which I could use to detect language. Ultimately I chose “cld2” which was reported to have high accuracy. After importing the files, I used the detect_language() command to create a vector representing the language of each document. Then I subsetted the data and saved a new file with only the english texts.

Code

library(cleanNLP)
library(tidytext)
library(plyr)

------------------------------------------------------------------------------

You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)

------------------------------------------------------------------------------


Attaching package: 'plyr'

The following objects are masked from 'package:dplyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from 'package:purrr':

    compact

Code

library(tidyverse)
library(quanteda)

Package version: 3.2.3
Unicode version: 13.0
ICU version: 69.1

Parallel computing: 8 of 8 threads used.

See https://quanteda.io for tutorials and examples.

Code

library(pdftools)

Using poppler version 22.04.0

Code

library(quanteda.textmodels)
library(quanteda.textplots)
library(quanteda.textstats)

#creating list of names of files to read in
file_list <- list.files(pattern="*.pdf")
#
all_files <- lapply(file_list, function(file){
  txt <- pdf_text(file)
  txt <- str_c(txt, collapse = " ")
  data.frame(File = file,text = txt)
})
#
#bind
result <- do.call("rbind", all_files)

#checking import worked
View(result)

Error in View(result): invalid 'x' argument

Code

#detecting language? 
require(cld2)

Loading required package: cld2

Code

t_start = proc.time()
cld2_vec = cld2::detect_language(text = result$text, plain_text = TRUE, lang_code = TRUE)

Error in as_string(text): Parameter 'text' must be a connection or character vector

Code

#bind result with data
result$language <- cld2_vec

Error in eval(expr, envir, enclos): object 'cld2_vec' not found

Code

#create subset data with only english language speeches
en.result <-result[which(result$language == "en"),]
#Save as dataframe
save(en.result, file = "speeches.RData")
load("speeches.RData")

My next step was to get some metadata (‘docvars’) that I could use with my documents. The filenames in the downloaded speeches contained a lot of useful information. They had which conference the speech was from, the date the speech was delivered, and the speaker (in most cases, a country). I couldn’t find anything in R to help me use the filename to extract that information (although I’m sure something exists.) What I ended up doing was exporting the list of file names from my english speeches dataframe, importing that into excel, and using a series of “TEXTBEFORE” and “TEXTAFTER” functions to isolate the information I wanted. I manually added in the year of each speech, which was easy to do after sorting the files in alphabetical order. I saved this ‘metadata’ as a csv.

After importing the csv into R, I used left_join to merge the speeches with the metadata by file name. Now when I made the speech dataframe into a corpus I was able to see my metadata.

Code

#remove the "language" column
en.result <- en.result[,c(1,2)]

#isolate only the filename column to export
en.result.names <- en.result[,1]
#Here I exported a csv
write.csv(en.result.names, "en.result.csv")

#using excel "=textbefore()" and "=textafter()" commands, I was able to isolate
#Year and Speaker(country) using the file names. I saved this as a csv to import as my corpus metadata.

#importing metadata file
metadata <- read.csv("metadata_docs.csv", header = T)

Warning in file(file, "rt"): cannot open file 'metadata_docs.csv': No such file
or directory

Error in file(file, "rt"): cannot open the connection

Code

#renaming column to match
metadata$File <- metadata$filename

Error in eval(expr, envir, enclos): object 'metadata' not found

Code

#joining my speeches with my metadata
speech.meta <- left_join(metadata, en.result, by = "File", all.x = T)

Error in left_join(metadata, en.result, by = "File", all.x = T): object 'metadata' not found

Code

#saving this dataframe...
save(file =  "speech.meta.RData", speech.meta)

Error in save(file = "speech.meta.RData", speech.meta): object 'speech.meta' not found

Next, I needed to clean up the text and do some pre-processing. I didn’t have much success using built in commands to remove numbers or other things, but a different approach using gsub() seemed to work. I’m sure the code I’m using could be simplified a lot. But at least this ended up with the result I wanted. After getting rid of things that I didn’t want, I created a dfm, where I also removed stopwords and only included features which appeared a minimum of 10 times.

Code

#Cleaning up the speeches
load("speech.meta.RData")

Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'speech.meta.RData', probable reason 'No such file or directory'

Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection

Code

text <- speech.meta$text

Error in eval(expr, envir, enclos): object 'speech.meta' not found

Code

text <- gsub("$", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("~", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("<", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub(">", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("1st", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("2nd", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("3rd", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("4th", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("5th", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("6th", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("7th", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("8th", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("9th", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("0th", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("1", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("2", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("3", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("4", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("5", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("6", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("7", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("8", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("9", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("0", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("%", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub("#", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <-gsub(" th ", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <-gsub(" t ", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <-gsub(" l ", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <-gsub(" d ", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

text <- gsub(" mr. ", " ", text)

Error in as.character(x): cannot coerce type 'closure' to vector of type 'character'

Code

speech.meta$text <- text

Error in speech.meta$text <- text: object 'speech.meta' not found

Code

speech.meta$text_field <- speech.meta$text

Error in eval(expr, envir, enclos): object 'speech.meta' not found

Code

speech.meta$docid_field <- speech.meta$File

Error in eval(expr, envir, enclos): object 'speech.meta' not found

Code

#Make this into a corpus
speech_corpus <- corpus(speech.meta)

Error in corpus(speech.meta): object 'speech.meta' not found

Code

speech_tokens <- tokens(speech_corpus, remove_punc = T)

Error in tokens(speech_corpus, remove_punc = T): object 'speech_corpus' not found

Code

#Create a DFM without punctuation or stopwords
dfm_speech <- dfm(speech_tokens)

Error in dfm(speech_tokens): object 'speech_tokens' not found

Code

dfm_speech <- dfm_remove(dfm_speech, stopwords("english")) %>%
  dfm_trim(min_termfreq = 10, verbose = F)

Error in dfm_select(x, ..., selection = "remove"): object 'dfm_speech' not found

Next I was excited to just explore the dfm.

Code

#get some information about the dfm
ndoc(dfm_speech)

Error in ndoc(dfm_speech): object 'dfm_speech' not found

Code

head(featnames(dfm_speech), 25)

Error in featnames(dfm_speech): object 'dfm_speech' not found

Code

#See what's common
topfeatures(dfm_speech, 50)

Error in topfeatures(dfm_speech, 50): object 'dfm_speech' not found

Code

#see what's common within each year
topfeatures(dfm_speech, 5, groups = year)

Error in topfeatures(dfm_speech, 5, groups = year): object 'dfm_speech' not found

Code

#make a wordcloud 
set.seed(2222)
textplot_wordcloud(dfm_speech, min_count = 100, random_order = F)

Error in textplot_wordcloud(dfm_speech, min_count = 100, random_order = F): object 'dfm_speech' not found

Code

dfm.1995 <- subset(dfm_speech, year == "1995")

Error in subset(dfm_speech, year == "1995"): object 'dfm_speech' not found

Finally, since I now had some metadata, I wanted to see if I could actually use it. I found some code online which was showing how to plot frequency of terms using ggplot(). I did this for the overall corpus, but could not figure out how to select specific terms and plot them by year…

Code

#trying out some frequencies? gets the top 10 features overall
ts_freq <- textstat_frequency(dfm_speech, n = 20)

Error in textstat_frequency(dfm_speech, n = 20): object 'dfm_speech' not found

Code

ts_freq

Error in eval(expr, envir, enclos): object 'ts_freq' not found

Code

#what about top features by year? 
ts_freq_byyear <- textstat_frequency(dfm_speech, n = 10, group = year)

Error in textstat_frequency(dfm_speech, n = 10, group = year): object 'dfm_speech' not found

Code

ts_freq_byyear

Error in eval(expr, envir, enclos): object 'ts_freq_byyear' not found

Code

ts_freq_byspeaker <- textstat_frequency(dfm_speech, n = 10, group = speaker)

Error in textstat_frequency(dfm_speech, n = 10, group = speaker): object 'dfm_speech' not found

Code

ts_freq_byspeaker

Error in eval(expr, envir, enclos): object 'ts_freq_byspeaker' not found

Code

topterms <- ggplot(data = ts_freq, aes(x = feature)) +
  geom_bar(aes(y = frequency), stat = "identity") +
  theme(panel.background = element_rect(fill = "white"), 
        axis.line = element_line(colour = "black"),
        axis.text = element_text(size = 12),
        axis.title.x = element_text(size =12, vjust = -1),
        axis.title.y = element_text(size =12),
        legend.key = element_blank(),
        legend.position = "top",
        legend.text = element_text(size = 14),
        legend.title = element_blank())

Error in ggplot(data = ts_freq, aes(x = feature)): object 'ts_freq' not found

Code

topterms

Error in eval(expr, envir, enclos): object 'topterms' not found

Now that my dataset is actually clean and ready to be analyzed, I’m excited to try topic modelling, to learn more about what types of stats I can calculate using the dfm, and figuring our more interesting ways to visualize the data.