Post 3

post 3

saaradhaa

Cleaning PDFs

Author

Saaradhaa M

Published

October 16, 2022

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

load("post3_saaradhaa.rdata")

library(tidyverse)
library(stringr)
library(quanteda)
library(quanteda.textplots)
library(readtext)
library(phrasemachine)

Error in library(phrasemachine): there is no package called 'phrasemachine'

Introduction

In this post, I will focus on cleaning and pre-processing my data into a format that is useful for analysis. So far, I have been struggling with this quite a bit. This post will specifically focus on cleaning the PDFs, pre-processing them in R, then trying out different descriptives.

Manual Cleaning

To recap, I downloaded 50 PDFs of oral histories. I generated word clouds in Blog Post 2 to get some sense of how I need to clean them. I also skimmed through the PDFs to understand the content better.
I deleted PDFs 42-50 (representing students of fine arts teachers). Methodologically, it doesn’t make sense to include this group. When I skimmed through the transcripts, the actual age range of students varies, so they don’t neatly represent young people. Further, for those who are actually young, they represent young people who are more attached to their cultural roots anyway, and might be fundamentally different from the young people who comment on Reddit.
This leaves me with 41 transcripts. However, I had a lot of trouble figuring out a standardised way to clean all of them, as the oral histories were collected over a number of years by different researchers - hence, not all the PDFs had a consistent format.
For transcripts 1 to 34, I initially wanted to convert them to HTML to remove bold text, which represents most of what needs to be removed. I was unable to use R packages (poppler, pdftohtml, pdf2html) to convert them to HTML and do this, since these packages only work on older versions of R.
So unfortunately, I had to do some steps manually:
- Using Actions in Adobe Acrobat, I converted PDF to HTML to get rid of the running header and footer, then converted to plain text.
- I manually removed the header for transcripts 12, 14, 21 and 22, and ‘narrator’, ‘date’, ‘interviewed by’, ‘place’ and ‘end of interview’ for all.
- I manually removed the post-script for transcript 28 - it’s a post-interview message from the interviewee, which does not represent meaningful oral history data.
- I manually removed the header, index and glossary for transcripts 35 to 41.

Processing in R

# list out files.
file_list <- dir("~/Desktop/2022_Fall/GitHub/Text_as_Data_Fall_2022/posts/Transcripts", full.names = TRUE, pattern = "*.txt")

# create list of text files.
transcripts <- readtext(paste0(file_list), docvarnames = c("transcript", "text"))

Error in list_files(file, ignore_missing_files, FALSE, cache, verbosity): File '' does not exist.

# remove references to 'interviewer:' and 'interviewee:', as well as line breaks, '\n'. I found this website really helpful in testing out regex: https://spannbaueradam.shinyapps.io/r_regex_tester/
transcripts$text <- str_replace_all(transcripts$text, "[a-zA-Z]+:", "")
transcripts$text <- str_replace_all(transcripts$text, "\n", "")

# create 'quanteda' corpus. 
oh_corpus <- corpus(transcripts$text)

# create my own list of stopwords, based on qualitative reading of the first transcript.
to_keep <- c("do", "does", "did", "would", "should", "could", "ought", "isn't", "aren't", "wasn't", "weren't", "hasn't", "haven't", "hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't", "shan't", "shouldn't", "can't", "cannot", "couldn't", "mustn't", "because", "against", "between", "into", "through", "during", "before", "after", "above", "below", "over", "under", "no", "nor", "not")
Stopwords <- stopwords("en")
Stopwords <- Stopwords[!(Stopwords %in% to_keep)]

# create tokens, remove punctuation, numbers and stopwords, then convert to lowercase.
oh_tokens <- tokens(oh_corpus, remove_punct = T, remove_numbers = T)
oh_tokens <- tokens_select(oh_tokens, pattern = Stopwords, selection = "remove")
oh_tokens <- tokens_tolower(oh_tokens)

# get summary of corpus.
oh_summary <- summary(oh_corpus)

# what's the average number of 'types' per interview? ~1868.
mean(oh_summary$Types)

[1] 1867.537

# how many tokens in total? ~24k.
sum(max(ntoken(oh_corpus)))

[1] 24086

# create metadata. moving forward, i can potentially add extra metadata, like gender of interviewee and year of immigration.
docvars(oh_corpus) <- oh_summary

Text Plots

# i want to re-create the wordcloud from Blog Post 2, except now with cleaned data - just for the wordcloud, i'm going to remove all stopwords, so that they don't show up in it.
dfm <- oh_tokens %>% dfm() %>% dfm_trim(min_termfreq = 10, verbose = FALSE, min_docfreq = .1, docfreq_type = "prop") %>% dfm_remove(stopwords("en"))
textplot_wordcloud(dfm, max_words=50, color="purple")

The wordcloud looks a lot better! There’s definitely some references to migration (come, going, went), and temporal aspects (time, years). Conversations also involved references to family, school, work and India (likely the largest sub-group of South Asian interviewees).

# create fcm.
fcm <- fcm(dfm)

# keep only top features.
small_fcm <- fcm_select(fcm, pattern = names(topfeatures(fcm, 50)), selection = "keep")

# compute weights.
size <- log(colSums(small_fcm))

# create network.
textplot_network(small_fcm, vertex_size = size / max(size) * 4)

I made a network plot, just out of curiosity - I don’t know why there is a ‘<’ and ‘>’ even after removing punctuation. Will I need to remove these manually?

N-Grams

I want to test out n-grams with the first transcript to get a sense of what kind of terms I may need to further remove.

documents <- transcripts$text[1]
phrases <- phrasemachine(documents, minimum_ngram_length = 2, maximum_ngram_length = 4, return_phrase_vectors = TRUE, return_tag_sequences = TRUE)

Error in phrasemachine(documents, minimum_ngram_length = 2, maximum_ngram_length = 4, : could not find function "phrasemachine"

phrases[[1]]$phrases[1:100]

  [1] "Julie_Kerssen"                  "Asgar_Ahmedi"                  
  [3] "apartment_in_Edmonds"           "[to]_start"                    
  [5] "island_of_Madagascar"           "Madagascar_in_a_town"          
  [7] "East_Indians"                   "early_[19]20s"                 
  [9] "East_Africa"                    "[_Laughs"                      
 [11] "little_town"                    "natural_port"                  
 [13] "small_town"                     "vanilla_beans"                 
 [15] "business_in_that_product"       "good_place"                    
 [17] "two_thirds"                     "world’s_vanilla"               
 [19] "best_vanilla"                   "country_because_vanilla"       
 [21] "five_years"                     "younger_brother"               
 [23] "younger_sister"                 "Mahatma_Gandhi"                
 [25] "Mahatma_Gandhi"                 "right_next_door"               
 [27] "next_door"                      "uncle_for_education"           
 [29] "good_schools"                   "good_schools_in_Madagascar"    
 [31] "schools_in_Madagascar"          "[siblings_]"                   
 [33] "first_place"                    "first_place_–"                 
 [35] "place_–"                        "place_–_for_education"         
 [37] "–_for_education"                "couple_of_reasons"             
 [39] "five_new_brothers"              "new_brothers"                  
 [41] "[_India]"                       "teenage_years"                 
 [43] "high_school"                    "superpower_in_track"           
 [45] "friend_of_mine"                 "close_friends"                 
 [47] "close_friends_in_Bombay"        "friends_in_Bombay"             
 [49] "United_States"                  "Tufts_College"                 
 [51] "Tufts_College_in_Massachusetts" "College_in_Massachusetts"      
 [53] "British_colony"                 "North_Africa"                  
 [55] "threat_from_Japan"              "serious_matter"                
 [57] "parents_because_Madagascar"     "French_colony"                 
 [59] "France_during_World"            "France_during_World_War"       
 [61] "World_War"                      "World_War_II"                  
 [63] "War_II"                         "enemies_at_that_time"          
 [65] "two_things"                     "[19]40_India"                  
 [67] "school_activities"              "independence_movement"         
 [69] "movement_in_any_way"            "Partition_movement"            
 [71] "creation_of_Pakistan"           "threat_that_Japan"             
 [73] "Bombay_for_eight_months"        "eight_months"                  
 [75] "hometown_in_India"              "eight_months"                  
 [77] "14_years"                       "[_Laughs"                      
 [79] "uncle_during_the_war"           "war_years"                     
 [81] "cousins_in_the_house"           "lots_of_relatives"             
 [83] "one_cousin"                     "high_school"                   
 [85] "well-off_man"                   "high_school"                   
 [87] "high_school"                    "second_year"                   
 [89] "second_year_of_college"         "year_of_college"               
 [91] "money_in_India"                 "college_in_India"              
 [93] "high_school"                    "Group_A"                       
 [95] "Group_B."                       "Group_B._A"                    
 [97] "B._A"                           "pure_physics"                  
 [99] "Group_B"                        "good_student"

# save.image("post3_saaradhaa.RData")

Just from the above phrases, it seems like my leaning towards topic modelling might be good for the transcripts. We can see references to South Asia (Mahatma Gandhi, Bombay, East Indians) and the US (Tufts, Massachusetts). There are also references to WW2, which could potentially come up as a topic.

Next Blog Post

Generate all comments for the Reddit data, as well as more posts. I may need to use Python for this.
Use str_match() compare how many times “culture” appears in the oral histories vs. Reddit data, and do a statistical test.
Try out CleanNLP (Week 4 Tutorial).