Blog Post 3 - Data Cleaning and Word cloud

Adithya Parupudi
Author

Adithya Parupudi

Published

October 11, 2022

Libraries

Reading in all the libraries :)

Code
library(quanteda)
library(tidyverse)
library(rvest)
library(stringr)
library(tokenizers)
library(tm)
library(wordcloud)
library(wordcloud2)
library(stopwords)
library(tidyverse)
library(tidytext)

knitr::opts_chunk$set(echo=TRUE)

Reading Data

From CSV

Code
dataset2 <- read_csv("./100FamousPeople.csv")
New names:
Rows: 116 Columns: 4
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): people_names, links, content dbl (1): ...1
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
head(dataset2)
# A tibble: 6 × 4
   ...1 people_names     links                                           content
  <dbl> <chr>            <chr>                                           <chr>  
1     1 Abraham Lincoln  https://www.biographyonline.net/politicians/am… "“With…
2     2 Adolf Hitler     https://www.biographyonline.net/military/adolf… "Adolf…
3     3 Albert Einstein  https://www.biographyonline.net/scientists/alb… "Born …
4     4 Alfred Hitchcock https://www.biographyonline.net/actors/alfred-… "Sir A…
5     5 Amelia Earhart ( https://www.biographyonline.net/adventurers/am… "Ameli…
6     6 Angelina Jolie   https://www.biographyonline.net/actors/angelin… "Angel…

Reading data from the website.

Code
all = read_html("https://www.biographyonline.net/people/famous-100.html")

I only want the year range to check their age/era in which they lived. Hence extracting their year only. Some of the famous people are still alive. Since there is no date of expiry for them, I have temporarily filled that value with 2000 as default.

Data cleaning

Extract age ranges

There are many people in the list who are still alive. Hence the age range(eg. 1900-2000) is not complete which is causing code issues. Hence currently filling those cases with a pre-decided value of 2000

Code
# able to extract 1918 – 2013. and filling the unmatched pattern with 2000
only_year <-  all %>% html_nodes("ol li") %>% html_text() %>% 
  tolower() %>% str_extract(.,"[0-9]+\\s.\\s[0-9]+")%>% replace(.,is.na(.),'2000')
only_year <- data.frame(only_year)
head(only_year)
    only_year
1 1926 – 1962
2 1809 – 1865
3 1918 – 2013
4 1926 – 2022
5 1917 – 1963
6 1929 – 1968

Occupations

Extracting their primary occupation as mentioned in the website, and converting it into a data frame for further visualizations. I have used stringr functions extensively in this case. Some popular titles include american president, actress, leader, singer, economist.

These titles directly imply their line of work, or where they have participated to earn that title. For example, an olympian, humanitarian etc

Code
# printing the webpage and converting everything to lowercase
temp3 <- all %>% 
  html_nodes("ol li") %>% html_text() %>%  tolower() %>%  # converting to lowercase
  str_sort() %>% 
  str_extract(.,"\\).*") %>% # extracting all text after ')'. Have to remove ')'
  str_replace(.,"\\) ","") #replacing ) in all lines

# observed that the str functions are not cleaning all outputs. Hence doing further data cleaning to get desired outputs


# extracting people designations. Doing further data cleaning
peoples_title <- temp3 %>% str_remove_all(.,"\\)[0-9]*") %>% str_trim(.,"both")
peoples_title <- peoples_title %>% str_remove_all(.,"[0-9]*") %>% str_trim(.,"right") %>% str_replace_all("[–|-|-|(|.]","") %>% str_trim(.,"both")

peoples_title <- data.frame(peoples_title)
head(peoples_title)
                                           peoples_title
1                 us president during american civil war
2                                 leader of nazi germany
3 us presidential candidate and environmental campaigner
4                 german scientist, theory of relativity
5             english / american film producer, director
6                                                aviator

Removing jargon and fillers from all bio’s

I’ve used stringr package to remove text such as “adsbygoogle.com” which came along with the scraped content. Also removing the last paragraph of each page, which had the links to further pages in the website.

This cleaning is done to all rows in the data set, but there are a few special characters that are still left to be cleaned. Writing regex had been difficult, but at least for the word cloud has a feature to ignore special characters :). I will have to find a way to generate dataset which is void of all special characters so that it doesn’t effect the analysis in later stages.

Code
dataset2$content <- dataset2$content %>% str_remove_all(.,"adsbygoogle") %>%
  str_remove_all(.,"www.biographyonline.net") %>% 
  str_remove_all(.,"window.") %>% 
  str_remove_all(.,".push") %>% 
  gsub("Citation.*","",.)

Wordclouds!!!

Lets try to visualise everyone’s occupation using word clouds. Here I am following these steps to generate a word clouds.

  • tokenizing

  • stop word removal - using stopwords-iso list ( consists of 1298 stopwords)

  • corpus creation

  • removing numbers, punctuation, white spaces.

    • converting all text to lower case to maintain uniformity
  • Creating Document-Term Matrix

  • Finally generating word clouds!!!

Important. I faced issues generating word clouds, particularly at the matrix creation step. I’ve seen a lot of memory allocation issues, which I think is due to the RStudio update. Hence I am attaching the screenshots of the word cloud generated in one of the successful attempts.

How I solved it temporarily?

  • either using gc() command which frees up unused memory space.

  • deleting the .rdata file from my project.

  • waiting a long time for my RAM to free up before I run another code block :(

Word Cloud on Occupations

Looks like it is evident that 1/3rd of people in the list are from the USA, and Britain and most of them could be leaders, actresses, author, musician etc. It would be interesting to see the correlation of the word and the people associated with it.

Code
for_occupation = peoples_title %>%
  unnest_tokens(word, peoples_title, token = "words", strip_punct = TRUE) %>% 
  filter(!(word %in% stopwords(source = "stopwords-iso"))) 


# corpus created
create_corpus_occupation <- Corpus(VectorSource(for_occupation))


# cleaning data using tm library
create_corpus_occupation <- create_corpus_occupation %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
documents
Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
documents
Code
create_corpus_occupation <- tm_map(create_corpus_occupation, content_transformer(tolower))
Warning in tm_map.SimpleCorpus(create_corpus_occupation,
content_transformer(tolower)): transformation drops documents
Code
create_corpus_occupation <- tm_map(create_corpus_occupation, removeWords, stopwords("english"))
Warning in tm_map.SimpleCorpus(create_corpus_occupation, removeWords,
stopwords("english")): transformation drops documents
Code
# creating document term matrix

dtm_occupation <- TermDocumentMatrix(create_corpus_occupation) 
matrix_occupation <- as.matrix(dtm_occupation) 
words_occupation <- sort(rowSums(matrix_occupation),decreasing=TRUE) 
df2 <- data.frame(word = names(words_occupation),freq=words_occupation)

wordcloud2(data=df2, size = 0.8)

Word cloud on ALL biographies

Code
bio <- dataset2 %>% 
  unnest_tokens(word, content, token = "words", strip_punct = TRUE) %>% 
  filter(!(word %in% stopwords(source = "stopwords-iso"))) 


# tokenize_words(dataset$content)

# %>% filter(!(word %in% stopwords(source = "stopwords-iso"))) 
# 
bio_vec <- bio$word
create_corpus <- Corpus(VectorSource(bio_vec))

#cleaning data using tm library
create_corpus <- create_corpus %>%
 tm_map(removeNumbers) %>%
 tm_map(removePunctuation) %>%
 tm_map(stripWhitespace)
Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
documents
Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
documents
Code
create_corpus <- tm_map(create_corpus, content_transformer(tolower))
Warning in tm_map.SimpleCorpus(create_corpus, content_transformer(tolower)):
transformation drops documents
Code
create_corpus <- tm_map(create_corpus, removeWords, stopwords("english"))
Warning in tm_map.SimpleCorpus(create_corpus, removeWords,
stopwords("english")): transformation drops documents
Code
#creating document term matrix

dtm <- TermDocumentMatrix(create_corpus)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
#word cloud
wordcloud2(data=df, size = 0.7)

Word cloud on all biographies

The above word cloud provides really good insights about their biographies as a whole. It revolved mostly around war, people, government, politics and so on. Since this list is from the last century, the words like war are highlighted. Since all of them are famous people, hence the word :).

Words like president, american, political, army, peace, king might have some interesting correlations which I should further explore. Its good that women are getting highlighted, which hints that a majority of this list contains women as well!

Future Analysis

I am expecting to deep dive further into visualizing word count using graphs in my next post. And maybe in later posts, I want to do sentiment analysis.