Blog Post 3

MekhalaKumar
Olympics2020
GenderandSports
BlogPost3
Further Preprocessing
Author

Mekhala Kumar

Published

November 29, 2022

In this post, I focused on fixing the issues I faced while preprocessing the data in the previous blog post.

library(tidyverse)
library(quanteda)
#devtools::install_github("quanteda/readtext") 
library(readtext)
library(striprtf)
#library(LexisNexisTools)
library(corpus)
library(quanteda.textplots)
library(readr)

The dataset

I was able to save the files, that were read in, as an R object. So that fixed the problem where it was taking a large amount of time to load all the files.

#saveRDS(df, file = "Data/All12Files.rds")
df_All<-readRDS(file = "D:/Text as Data/All12Files.rds")
df1<-df_All

Modification of the files that were read in

Before moving to the preprocessing steps, I had to clean the data since there were many white spaces and terms present that were not required.
I was able to extract and separate the newspaper name, date and main text of the article successfully after following the suggestions of Professor Song.
This was done by removing leading and trailing whitespaces as well as finding common words used that could help separate the text into the columns required.

Cleaning up the name of newspapers and the dates

The rows of text which did not have the names of the newspapers were removed.

df1$Newspaper_Date <- str_squish(df1$Newspaper_Date)
df2 <- df1 %>%
  filter(Newspaper_Date != "")
df2_cleaning2 <- df2 %>%
  separate(Newspaper_Date, into = c("newspaper", "date"),
         sep = "(?=(August|July))")
Warning: Expected 2 pieces. Additional pieces discarded in 5 rows [329, 600,
880, 1047, 1134].
glimpse(df2_cleaning2)
Rows: 1,157
Columns: 4
$ newspaper <chr> "Hindustan Times ", "Hindustan Times ", "The Hindu ", "Hindu…
$ date      <chr> "August 9, 2021 Monday Copyright 2021 HT Media Ltd. All Righ…
$ Body      <chr> "\nBody\n\n\nMumbai, Aug. 9 -- From a solitary two-day fixtu…
$ Tags      <chr> "\nSubject: OLYMPICS (92%); CRICKET (90%); OLYMPIC COMMITTEE…
df2_cleaning2$newspaper <- str_squish(df2_cleaning2$newspaper)

df2_cleaning2<-df2_cleaning2%>%separate(date, into=c("date", "Delete"), sep="Copyright")%>%select(-Delete)

df2_cleaning2 <- df2_cleaning2 %>%
  separate(date, into = c("date", "delete"),
           sep = "(?=(Sunday|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday))")%>%select(-delete)
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 115 rows [15, 50,
53, 59, 66, 76, 81, 87, 104, 108, 112, 115, 121, 123, 126, 142, 143, 147, 150,
154, ...].

Cleaning up the main information of the article

I found that there were two formats in which the main information was saved; hence there were two approaches taken to clean the same.
I also checked a few rows of data to see if the information extracted was in the format that I wanted it to be in.

df2_cleaning2$Body <- str_squish(df2_cleaning2$Body)
df2_cleaning2$Body <- gsub("^(.{4})(.*)$",
                           "\\1-\\2",
                           df2_cleaning2$Body)
df2_cleaning_reg <- df2_cleaning2 %>%
 separate(Body, into = c("delete", "body"),
  sep = "[\\d+] --")
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 738 rows [3, 6,
8, 12, 13, 15, 19, 20, 24, 25, 29, 31, 32, 33, 35, 37, 39, 41, 42, 43, ...].
df2_cleaning4 <- df2_cleaning_reg %>%
  filter(is.na(body))

df2_cleaning_remaining <- df2_cleaning_reg %>%
  filter(!is.na(body))

df2_cleaning5 <- df2_cleaning4 %>%
  separate(delete, into = c("delete2", "body2"),
           sep = "Body- ") %>%
  select(-delete2, -body)

glimpse(df2_cleaning4)
Rows: 738
Columns: 5
$ newspaper <chr> "The Hindu", "DNA", "The Telegraph (India)", "India Today On…
$ date      <chr> "August 9, 2021 ", "July 27, 2021 ", "August 7, 2021 ", "Aug…
$ delete    <chr> "Body- After the best-ever performance at the just-concluded…
$ body      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Tags      <chr> "\nSubject: 2020 TOKYO SUMMER OLYMPICS (90%); OLYMPICS (90%)…
glimpse(df2_cleaning_remaining)
Rows: 419
Columns: 5
$ newspaper <chr> "Hindustan Times", "Hindustan Times", "Hindustan Times", "Hi…
$ date      <chr> "August 9, 2021 ", "August 2, 2021 ", "July 28, 2021 ", "Aug…
$ delete    <chr> "Body- Mumbai, Aug. ", "Body- India, Aug. ", "Body- New Delh…
$ body      <chr> " From a solitary two-day fixture between Great Britain and …
$ Tags      <chr> "\nSubject: OLYMPICS (92%); CRICKET (90%); OLYMPIC COMMITTEE…
df2_cleaning_remaining <- df2_cleaning_remaining %>%
  select(-delete)
df2_cleaning5 <- df2_cleaning5 %>%
  rename(body = body2)

df2_cleaned_all <- rbind(df2_cleaning_remaining,
                         df2_cleaning5)

df2_cleaned_all%>%select(body)%>%filter(row_number() %in% c(1))
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        body
1  From a solitary two-day fixture between Great Britain and France in the 1900 Olympics, prospects of cricket's inclusion as an 8-team medal sport for men and women in Los Angeles 2028 have brightened. The International Cricket Council (ICC)'s proposal to introduce cricket as an Olympic sport in 2028 has been placed before the International Olympic Committee (IOC). Importantly, the Board of Control for Cricket in India (BCCI)'s reluctance to join the Olympic movement is now a thing of the past. "Once cricket is added in the Olympics, India will be participating," BCCI secretary Jay Shah said. "The BCCI and the ICC are on the same page as far as participation in the Olympics is concerned." The BCCI in its Apex Council meeting in April had given a conditional nod to send a team for the 2028 edition if its autonomy wasn't disturbed and there was no interference from the Indian Olympic Association (IOA). The BCCI, IOA and government though are working in sync. The Indian cricket board pledged Rs. 10 crore to assist the Tokyo bound Indian contingent's marketing budget. They also announced a cash prize totalling Rs. 4 crore for the seven medal winners on Saturday. The ICC, which has 92 Associate members but only 12 members play Test cricket, has been slow on the Olympics issue. Many of the top Test nations have in the past have had an insular view of safeguarding their playing window and TV rights revenue and resisting cricket's entry into Olympics. Now, with an agreement amongst most leading cricket boards, it has been in constant talks with IOC and an Olympics committee formed for the purpose. With a nudge from the government to increase India's medal prospects, BCCI administration has also switched its stance. "The BCCI is more than happy to work together with the government and help increase India's medal chances," a BCCI official said. Cricket has been added as a discipline in the 2022 Commonwealth Games (Birmingham, July-Aug). There's cricket in the 2022 Asian Games (Hangzhou, September) too. With the cricket calendar congested with ICC events, bilateral cricket and franchise leagues, finding a window for each of these games consistently may become a challenge. That is why, the Commonwealth Games will only have women's cricket. The Olympics, it is learnt, will require the participation of men and women. T20 has emerged as the format of choice despite some Associate nations championing for introducing T10. The English cricket board has explored the prospects of taking the Hundred-ball format to the Olympics, riding on its newly launched league. With neither format having international status, ICC is expected to start with T20Is. While introduction of a new sport at Olympics involves structured presentations and lobbying, those in cricket are confident that India's rapidly growing consumer market and digital engagement base will marry IOC's search for new Olympic markets. Published by HT Digital Content Services with permission from Hindustan Times. For any query with respect to this article or any other content requirement, please contact Editor at Classification Language: ENGLISH Publication-Type: Newswire
df2_cleaned_all%>%select(body)%>%filter(row_number() %in% c(60))
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  body
1  Indian wrestler Bajrang Punia on Saturday won the bronze medal after defeating Daulet Niyazbekov of Kazakhstan in the men's freestyle 65kg category. The second seed, who faced a crushing 5-12 defeat to Azerbaijan's Haji Aliyev in the semifinals, redeemed his campaign as he earned India's sixth medal of Tokyo 2020 with a clear 8-0 victory in the bronze medal bout. The men's freestyle 65kg bronze medal showdown bout began with Punia taking a point when Niyazbekov couldn't score in the 30 seconds attacking zone. The Indian then got another point to take a 2-0 lead at the break. (Full Tokyo 2020 Coverage) Punia started off on an attacking note in the last three minutes and got two more points with a take down to take a 4-0 lead. The India then gave no chance to the wrester from Kazakhstan by scoring four more points with two more take downs. Punia became the sixth Indian wrestler to finish on the Olympic podium after KD Jadhav, Sushil Kumar, Yogeshwar Dutt, Sakshi Malik and Ravi Kumar Dahiya. This became the second instance after the 2012 London Olympics when two Indian wrestlers won medals in the same Games. Ravi Dahiya had earlier won silver in the men's 57 kg category in this Olympic. Twenty-seven-year-old Punia began his challenge against Kyrgyzstan's Ernazar Akmataliev, defeating him 3-3 after scoring a later point with a smart take-down towards the end of the bout. In the 1/4 final, he enjoyed a successful second period against Morteza Cheka Ghiasi of Iran to win by fall. TOKYO 2020 OLYMPICS DAY 15 BLOG Bajrang is a three-time world championships medallist. He won a bronze in the 2019 World Championships and had won a silver in 2018, both in the 65kg category. He had won a bronze at the world championships in 2013 in the 60 kg category. He is also the reigning Commonwealth and Asian games champion of the 65kg category, having won the gold medal in 2018 in both the Games. He had won a silver medal in the 61kg category in the CWG and Asiad in 2014 respectively. (more details awaited) Published by HT Digital Content Services with permission from Hindustan Times. For any query with respect to this article or any other content requirement, please contact Editor at Classification Language: ENGLISH Publication-Type: Newswire
df2_cleaned_all_more<-df2_cleaned_all%>%separate(body, into = c("body", "delete"),sep = "Published|Classification")
Warning: Expected 2 pieces. Additional pieces discarded in 418 rows [1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
df2_cleaned_all_more <- df2_cleaned_all_more %>%
  select(-delete)

df2_cleaned_all_more%>%select(body)%>%filter(row_number() %in% c(1))
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        body
1  From a solitary two-day fixture between Great Britain and France in the 1900 Olympics, prospects of cricket's inclusion as an 8-team medal sport for men and women in Los Angeles 2028 have brightened. The International Cricket Council (ICC)'s proposal to introduce cricket as an Olympic sport in 2028 has been placed before the International Olympic Committee (IOC). Importantly, the Board of Control for Cricket in India (BCCI)'s reluctance to join the Olympic movement is now a thing of the past. "Once cricket is added in the Olympics, India will be participating," BCCI secretary Jay Shah said. "The BCCI and the ICC are on the same page as far as participation in the Olympics is concerned." The BCCI in its Apex Council meeting in April had given a conditional nod to send a team for the 2028 edition if its autonomy wasn't disturbed and there was no interference from the Indian Olympic Association (IOA). The BCCI, IOA and government though are working in sync. The Indian cricket board pledged Rs. 10 crore to assist the Tokyo bound Indian contingent's marketing budget. They also announced a cash prize totalling Rs. 4 crore for the seven medal winners on Saturday. The ICC, which has 92 Associate members but only 12 members play Test cricket, has been slow on the Olympics issue. Many of the top Test nations have in the past have had an insular view of safeguarding their playing window and TV rights revenue and resisting cricket's entry into Olympics. Now, with an agreement amongst most leading cricket boards, it has been in constant talks with IOC and an Olympics committee formed for the purpose. With a nudge from the government to increase India's medal prospects, BCCI administration has also switched its stance. "The BCCI is more than happy to work together with the government and help increase India's medal chances," a BCCI official said. Cricket has been added as a discipline in the 2022 Commonwealth Games (Birmingham, July-Aug). There's cricket in the 2022 Asian Games (Hangzhou, September) too. With the cricket calendar congested with ICC events, bilateral cricket and franchise leagues, finding a window for each of these games consistently may become a challenge. That is why, the Commonwealth Games will only have women's cricket. The Olympics, it is learnt, will require the participation of men and women. T20 has emerged as the format of choice despite some Associate nations championing for introducing T10. The English cricket board has explored the prospects of taking the Hundred-ball format to the Olympics, riding on its newly launched league. With neither format having international status, ICC is expected to start with T20Is. While introduction of a new sport at Olympics involves structured presentations and lobbying, those in cricket are confident that India's rapidly growing consumer market and digital engagement base will marry IOC's search for new Olympic markets. 
df2_cleaned_all_more%>%select(body)%>%filter(row_number() %in% c(8))
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           body
1  The Board of Control for Cricket in India (BCCI) on Saturday decided to celebrate India's most successful ever campaign at the Olympics by announcing cash rewards for all the medal winners at the Tokyo Games. the In a tweet, BCCI secretary Jay Shah also announced that Neeraj Chopra, India's first-ever gold medal winner in athletics - he won gold in men's javelin throw event - will get Rs.1 crore from the board. Rs.50 lakh each will be given to silver medallists -- weightlifter Mirabai Chanu and wrestler Ravi Kumar Dahiya. "Our athletes have made the country proud by finishing on the podium at @Tokyo2020. The @BCCI acknowledges their stellar efforts and we are delighted to announce cash prizes for the medallists," Jay Shah tweeted. Mirabai Chanu won India's first weightlifting medal at the Games and Ravi Dahiya became only the second wrestler from the country to win a silver after Sushil Kumar (2012). The bronze medallists -- wrestler Bajrang Punia, boxer Lovlina Borgohain and shuttler P V Sindhu -- will get Rs.25 lakh each. Sindhu became the first Indian woman and the second athlete overall to win two Olympic medals. She had won silver five years ago at the Rio Olympics. The men's hockey team which won its first Olympic medal in 41 years will get Rs.1.25 crore. India beat Germany 5-4 to win their third bronze medal and take their overall medal tally in Olympics to 12. India finished Tokyo 2020 with seven medals, making it their most successful campaign at the Games. India bettered their tally of six medals at the London Olympics in 2012. 
df2_cleaned_all_more%>%select(body)%>%filter(row_number() %in% c(166))
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        body
1  "I will come back with a medal for sure."This was the promise ace midfielder Lalit Kumar Upadhyay made to his father Satish Kumar Upadhyay and coach Parmanand Mishra before India won a historic medal in hockey after 41 years at the Tokyo Olympics on Thursday, defeating Germany 5-4 in the bronze medal match. The promise didn't make the two relax at all. They couldn't sleep properly till the start of the match for the bronze medal at the Oi Hockey Stadium on Thursday morning. There were altogether different scenes at Lalit's home at the Shivpur bypass in Varanasi and coach Mishra's house in Sarnath, located 10 km northeast of Varanasi.People from Lalit's village gathered at his house to watch the match live with his parents and his elder brother, whereas Mishra chose to watch it alone as he didn't want to be disturbed. "I watched the match all alone as I didn't want any disturbance. I even kept my family out of my room during the 60-minute clash. I couldn't pick my phone for a few minutes even after India's win as it was an emotional moment for me. I couldn't believe that my trainee Lalit kept his promise of a medal to me," said Mishra who trained Lalit at the Udai Pratap College in Varanasi under the Sports Authority of India's scheme. "This means a lot to me as well as to Varanasi, which has produced some great hockey players for India in the past like Mohd Shahid and Vivek Singh. India's success in hockey at the Tokyo Games is a great gift to Varanasi and it's a special occasion to celebrate in the month of Sawan as we all worship Lord Shiva," said Mishra, adding, "Lalit was one of the amazing kids we had in the lot of 10-11 boys under training. He (Lalit) never disobeyed my orders on the ground and was always disciplined." "I never found him missing his training sessions. His ability to bounce back, especially when he was trapped in a selection controversy, helped him a lot. Lalit's achievement will help attract many more Varanasi kids to hockey in the near future," said Mishra, 62. Lalit's father Satish, 61, too sounded emotional on his son's success. "I told him to stay focused even after India lost to Belgium in the semi-final. It was heartbreaking for all of us but I kept my cool and didn't let my son know my emotions," said the senior Upadhyay, still a private employee of a nationalised bank in Varanasi. "We celebrated all the goals scored by India today and my heart was almost in my mouth when Germany got a penalty corner in the last few seconds of the game. I still feel the goosebumps as controlling emotions at that time was quite difficult," he said. In Tokyo, the situation was no different for Lalit who was watching the match while sitting in the stands of the Oi Hockey Stadium after being injured in the previous match. "It's an emotional moment... can't understand what to say (after a pause). The way we played today was unique. It's a moment to celebrate, congratulate everyone," said Lalit on Thursday. "I came to Tokyo with Baba Bholenath's blessings, and was sure of winning a medal here as I had (made) this promise to my family and my coach in Varanasi. In today's game, our strong defence stopped the Germans. The last four minutes were heart-stopping as the opponents were scoring penalty corners one after the other. My heart came to a standstill for a while when Germany got a penalty corner in the last six seconds, but Anna (PR Sreejesh) stood like a rock at the goalpost and defended us." "This medal will act as a tonic for Indian hockey. After a long gap, this medal has increased our stature at the international level. The popularity of hockey will increase once again . A plan should be made to take this game to the villages." (With inputs from Sudhir Kumar in Varanasi) 
df2_cleaned_all_more$Tags <- str_squish(df2_cleaned_all_more$Tags)
#saveRDS(df2_cleaned_all_more, file = "Data/CleanData.rds")

Preprocessing

Once the data was cleaned, I followed the same preprocessing steps as in the previous post but checked if there were any significant differences by comparing the word clouds obtained in the previous post and the current post.

Creating the corpus and looking at the number of tokens in each document

newspaper_corpus <- corpus(df2_cleaned_all_more,text_field = "body")

newspaper_corpus_summary <- summary(newspaper_corpus)
head(newspaper_corpus_summary)
   Text Types Tokens Sentences       newspaper            date
1 text1   260    542        22 Hindustan Times August 9, 2021 
2 text2   106    220         5 Hindustan Times August 2, 2021 
3 text3   153    305         8 Hindustan Times  July 28, 2021 
4 text4   198    380        16 Hindustan Times August 8, 2021 
5 text5   197    348         6 Hindustan Times  July 29, 2021 
6 text6   166    255         8 Hindustan Times August 1, 2021 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Tags
1                                                                                                                                                                                                                                                                                                                                                                                        Subject: OLYMPICS (92%); CRICKET (90%); OLYMPIC COMMITTEES (90%); SPORTS GOVERNING BODIES (90%); SPORTS & RECREATION EVENTS (89%); AGREEMENTS (78%); WOMEN'S SPORTS (78%); ASSOCIATIONS & ORGANIZATIONS (77%); TALKS & MEETINGS (77%) Organization: INTERNATIONAL OLYMPIC COMMITTEE (57%) Industry: BUDGETS (66%); TELEVISION INDUSTRY (50%) Geographic: MUMBAI, MAHARASHTRA, INDIA (92%); LOS ANGELES, CA, USA (79%); BIRMINGHAM, ENGLAND (58%); CALIFORNIA, USA (58%); INDIA (94%); UNITED KINGDOM (79%); FRANCE (58%) Load-Date: August 8, 2021
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Subject: 2020 TOKYO SUMMER OLYMPICS (90%); OLYMPICS (90%); SHOOTING SPORTS (90%); SUMMER OLYMPICS (90%); WEAPONS & ARMS (89%); EQUESTRIAN SPORTS (78%); WOMEN'S SPORTS (78%) Industry: MEDIA CONTENT (78%) Geographic: INDIA (91%) Load-Date: August 1, 2021
3                                                                                                                                                                                                                                                                                                                                                                                                                                                             Subject: 2020 TOKYO SUMMER OLYMPICS (91%); OLYMPICS (91%); ATHLETES (90%); FIELD HOCKEY (90%); SUMMER OLYMPICS (90%); WOMEN'S SPORTS (90%); BOAT RACING (89%); ROWING (89%); ARCHERY (78%); BADMINTON (78%); BOXING (78%); MEN'S SPORTS (78%); SPORTS & RECREATION EVENTS (78%); SPORTS AWARDS (78%); TOURNAMENTS (76%); BOATING & RAFTING (75%) Industry: MEDIA CONTENT (78%) Geographic: NEW DELHI, INDIA (74%); INDIA (94%); UNITED KINGDOM (73%) Load-Date: July 27, 2021
4 Subject: ARMIES (94%); OLYMPICS (92%); 2020 TOKYO SUMMER OLYMPICS (90%); ARMED FORCES (90%); HEADS OF STATE & GOVERNMENT (90%); PHOTO & VIDEO SHARING (90%); PRIME MINISTERS (90%); SPORTS & RECREATION (90%); SPORTS AWARDS (90%); SUMMER OLYMPICS (90%); VIRAL VIDEOS (90%); HUMAN RESOURCES & PERSONNEL MANAGEMENT (78%); TRACK & FIELD (78%); EDUCATION & TRAINING (77%); MILITARY SERVICE (77%); SOCIAL MEDIA (77%); EDUCATION SYSTEMS & INSTITUTIONS (74%); GOVERNMENT ADVISORS & MINISTERS (73%); STUDENTS & STUDENT LIFE (73%); INTERNET SOCIAL NETWORKING (72%); WEAPONS & ARMS (72%) Industry: ARMIES (94%); ARMED FORCES (90%); PHOTO & VIDEO SHARING (90%); VIRAL VIDEOS (90%); MEDIA CONTENT (78%); MILITARY SERVICE (77%); SOCIAL MEDIA (77%); EDUCATION SYSTEMS & INSTITUTIONS (74%); INTERNET SOCIAL NETWORKING (72%) Person: NARENDRA MODI (79%) Geographic: GUJARAT, INDIA (79%); INDIA (94%) Load-Date: August 8, 2021
5                                                                                                                                                                            Subject: 2020 TOKYO SUMMER OLYMPICS (90%); ARCHERY (90%); MEN'S SPORTS (90%); SUMMER OLYMPICS (90%); WOMEN'S SPORTS (90%); WEAPONS & ARMS (89%); BADMINTON (78%); BOXING (78%); ROWING (78%); SHOOTING SPORTS (78%); BOAT RACING (73%); BOATING & RAFTING (73%) Company: RADIAL INC (63%) Industry: NAICS561499 ALL OTHER BUSINESS SUPPORT SERVICES (63%); NAICS561422 TELEMARKETING BUREAUS & OTHER CONTACT CENTERS (63%); NAICS541511 CUSTOM COMPUTER PROGRAMMING SERVICES (63%); NAICS518210 DATA PROCESSING, HOSTING & RELATED SERVICES (63%); NAICS454110 ELECTRONIC SHOPPING AND MAIL-ORDER HOUSES (63%); SIC7389 BUSINESS SERVICES (63%); MEDIA CONTENT (78%) Geographic: NEW DELHI, INDIA (74%); INDIA (91%); ARGENTINA (79%) Load-Date: July 28, 2021
6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Subject: 2020 TOKYO SUMMER OLYMPICS (90%); OLYMPICS (90%); SUMMER OLYMPICS (90%); BADMINTON (89%); FIELD HOCKEY (89%); MEN'S SPORTS (89%); BOXING (78%); SPORTS AWARDS (78%); WOMEN'S SPORTS (78%); GOLF (72%) Industry: MEDIA CONTENT (78%) Geographic: NEW DELHI, INDIA (74%); INDIA (93%); UNITED KINGDOM (68%) Load-Date: July 31, 2021
newspaper_corpus_summary$Tokens
  [1]  542  220  305  380  348  255  547  299  254  298  326  292  393  353  431
 [16]  644  289  328  300  305  479  522  371  232  352  371  255  317  780  376
 [31]  188  222  520  580  207  383  367  522  337  321  472  347  635  341  243
 [46]  545  447  310  509  583  331  270  363  282 1084  375  266  310  477  383
 [61]  237  271  504  370  321  603  271  359  237  342  449  662  884 1196  280
 [76]  336  649  378  635  548  375  400  483  353  707  434  255 1118  282  440
 [91]  469  294  310  333 1024  361  290  652  282  999
ndoc(newspaper_corpus)
[1] 1157

Checking for metadata

There was metadata but too many pages were being printed so I commented out the line.

#docvars(newspaper_corpus)

Creating tokens

newspaper_tokens <- tokens(newspaper_corpus)
head(newspaper_tokens)
Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "From"     "a"        "solitary" "two-day"  "fixture"  "between" 
 [7] "Great"    "Britain"  "and"      "France"   "in"       "the"     
[ ... and 530 more ]

text2 :
 [1] "Tokyo"      "Olympics"   "Day"        "10"         "Full"      
 [6] "Schedule"   ":"          "Kamalpreet" "Kaur"       "stunned"   
[11] "the"        "nation"    
[ ... and 208 more ]

text3 :
 [1] "The"      "Tokyo"    "Olympics" "2020"     "enters"   "the"     
 [7] "fifth"    "day"      "which"    "will"     "begin"    "with"    
[ ... and 293 more ]

text4 :
 [1] "After"      "Neeraj"     "Chopra"     "became"     "the"       
 [6] "second"     "Indian"     "ever"       "to"         "win"       
[11] "an"         "individual"
[ ... and 368 more ]

text5 :
 [1] "Day"       "5"         "of"        "the"       "Tokyo"     "Olympics" 
 [7] "on"        "Wednesday" "was"       "a"         "hot"       "and"      
[ ... and 336 more ]

text6 :
 [1] "Day"        "8"          "of"         "the"        "Tokyo"     
 [6] "Olympics"   "wasn't"     "great"      "in"         "particular"
[11] "for"        "India"     
[ ... and 243 more ]
#print(newspaper_tokens)
#sprintf(as.character(newspaper_tokens[1]))

Removing punctuation

newspaper_tokens <- tokens(newspaper_tokens ,
                                    remove_punct = T)
head(newspaper_tokens)
Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "From"     "a"        "solitary" "two-day"  "fixture"  "between" 
 [7] "Great"    "Britain"  "and"      "France"   "in"       "the"     
[ ... and 464 more ]

text2 :
 [1] "Tokyo"      "Olympics"   "Day"        "10"         "Full"      
 [6] "Schedule"   "Kamalpreet" "Kaur"       "stunned"    "the"       
[11] "nation"     "with"      
[ ... and 171 more ]

text3 :
 [1] "The"      "Tokyo"    "Olympics" "2020"     "enters"   "the"     
 [7] "fifth"    "day"      "which"    "will"     "begin"    "with"    
[ ... and 255 more ]

text4 :
 [1] "After"      "Neeraj"     "Chopra"     "became"     "the"       
 [6] "second"     "Indian"     "ever"       "to"         "win"       
[11] "an"         "individual"
[ ... and 335 more ]

text5 :
 [1] "Day"       "5"         "of"        "the"       "Tokyo"     "Olympics" 
 [7] "on"        "Wednesday" "was"       "a"         "hot"       "and"      
[ ... and 274 more ]

text6 :
 [1] "Day"        "8"          "of"         "the"        "Tokyo"     
 [6] "Olympics"   "wasn't"     "great"      "in"         "particular"
[11] "for"        "India"     
[ ... and 215 more ]

Removing stopwords

withoutstopwords_news<- tokens_select(newspaper_tokens, 
                    pattern = stopwords("en"),
                    select = "remove")
print(withoutstopwords_news)
Tokens consisting of 1,157 documents and 3 docvars.
text1 :
 [1] "solitary"  "two-day"   "fixture"   "Great"     "Britain"   "France"   
 [7] "1900"      "Olympics"  "prospects" "cricket's" "inclusion" "8-team"   
[ ... and 267 more ]

text2 :
 [1] "Tokyo"      "Olympics"   "Day"        "10"         "Full"      
 [6] "Schedule"   "Kamalpreet" "Kaur"       "stunned"    "nation"    
[11] "64m"        "throw"     
[ ... and 136 more ]

text3 :
 [1] "Tokyo"    "Olympics" "2020"     "enters"   "fifth"    "day"     
 [7] "begin"    "Indian"   "women's"  "Hockey"   "team"     "locking" 
[ ... and 186 more ]

text4 :
 [1] "Neeraj"     "Chopra"     "became"     "second"     "Indian"    
 [6] "ever"       "win"        "individual" "gold"       "medal"     
[11] "Tokyo"      "Olympics"  
[ ... and 205 more ]

text5 :
 [1] "Day"       "5"         "Tokyo"     "Olympics"  "Wednesday" "hot"      
 [7] "cold"      "affair"    "Shuttler"  "PV"        "Sindhu"    "advanced" 
[ ... and 217 more ]

text6 :
 [1] "Day"        "8"          "Tokyo"      "Olympics"   "great"     
 [6] "particular" "India"      "top"        "guns"       "failed"    
[11] "make"       "mark"      
[ ... and 145 more ]

[ reached max_ndoc ... 1,151 more documents ]
head(withoutstopwords_news)
Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "solitary"  "two-day"   "fixture"   "Great"     "Britain"   "France"   
 [7] "1900"      "Olympics"  "prospects" "cricket's" "inclusion" "8-team"   
[ ... and 267 more ]

text2 :
 [1] "Tokyo"      "Olympics"   "Day"        "10"         "Full"      
 [6] "Schedule"   "Kamalpreet" "Kaur"       "stunned"    "nation"    
[11] "64m"        "throw"     
[ ... and 136 more ]

text3 :
 [1] "Tokyo"    "Olympics" "2020"     "enters"   "fifth"    "day"     
 [7] "begin"    "Indian"   "women's"  "Hockey"   "team"     "locking" 
[ ... and 186 more ]

text4 :
 [1] "Neeraj"     "Chopra"     "became"     "second"     "Indian"    
 [6] "ever"       "win"        "individual" "gold"       "medal"     
[11] "Tokyo"      "Olympics"  
[ ... and 205 more ]

text5 :
 [1] "Day"       "5"         "Tokyo"     "Olympics"  "Wednesday" "hot"      
 [7] "cold"      "affair"    "Shuttler"  "PV"        "Sindhu"    "advanced" 
[ ... and 217 more ]

text6 :
 [1] "Day"        "8"          "Tokyo"      "Olympics"   "great"     
 [6] "particular" "India"      "top"        "guns"       "failed"    
[11] "make"       "mark"      
[ ... and 145 more ]
#as.character(withoutstopwords_news)
#saveRDS(withoutstopwords_news, file = "Data/PreprocessedData.rds")

Keyword in context

kwic_women<- kwic(withoutstopwords_news,
                        pattern = c("women"))
kwic_men<-kwic(withoutstopwords_news,
                        pattern = c("men"))
head(kwic_women)
Keyword-in-context with 6 matches.                                                                  
   [text1, 16]          inclusion 8-team medal sport men | women |
  [text1, 224] Olympics learnt require participation men | women |
   [text5, 35] archer Deepika Kumari sailed pre-quarters | women |
  [text5, 181]           IST Nethra Kumanan Laser Radial | Women |
 [text26, 199]    turned emotional watching game Notably | women |
  [text30, 62]          announce HK Group decided honour | Women |
                                          
 Los Angeles 2028 brightened International
 T20 emerged format choice despite        
 individual event Taking part first       
 Race 7 8 8 45                            
 hockey players come humble backgrounds   
 hockey team players player wishes        
head(kwic_men)
Keyword-in-context with 6 matches.                                                                    
   [text1, 15]        cricket's inclusion 8-team medal sport | men |
  [text1, 223] cricket Olympics learnt require participation | men |
  [text5, 159]            IST SAILING Vishnu Saravanan Laser | Men |
  [text5, 170]         IST Ganapathy Kelapanda Varun Thakkar | Men |
  [text22, 87]               Britain's goal came 45th minute | Men |
 [text73, 333]             strong women's boxing good boxers | men |
                                     
 women Los Angeles 2028 brightened   
 women T20 emerged format choice     
 Race 7 8 8 35                       
 Race 5 6 8 35                       
 Blue Great Britain quarterfinal long
 well competed ask gone Olympics     

Creating a tokens object with bigrams and trigrams

news_ngrams <- tokens_ngrams(withoutstopwords_news, n=2:3)
head(news_ngrams)
Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "solitary_two-day"    "two-day_fixture"     "fixture_Great"      
 [4] "Great_Britain"       "Britain_France"      "France_1900"        
 [7] "1900_Olympics"       "Olympics_prospects"  "prospects_cricket's"
[10] "cricket's_inclusion" "inclusion_8-team"    "8-team_medal"       
[ ... and 543 more ]

text2 :
 [1] "Tokyo_Olympics"      "Olympics_Day"        "Day_10"             
 [4] "10_Full"             "Full_Schedule"       "Schedule_Kamalpreet"
 [7] "Kamalpreet_Kaur"     "Kaur_stunned"        "stunned_nation"     
[10] "nation_64m"          "64m_throw"           "throw_qualification"
[ ... and 281 more ]

text3 :
 [1] "Tokyo_Olympics" "Olympics_2020"  "2020_enters"    "enters_fifth"  
 [5] "fifth_day"      "day_begin"      "begin_Indian"   "Indian_women's"
 [9] "women's_Hockey" "Hockey_team"    "team_locking"   "locking_horns" 
[ ... and 381 more ]

text4 :
 [1] "Neeraj_Chopra"   "Chopra_became"   "became_second"   "second_Indian"  
 [5] "Indian_ever"     "ever_win"        "win_individual"  "individual_gold"
 [9] "gold_medal"      "medal_Tokyo"     "Tokyo_Olympics"  "Olympics_video" 
[ ... and 419 more ]

text5 :
 [1] "Day_5"              "5_Tokyo"            "Tokyo_Olympics"    
 [4] "Olympics_Wednesday" "Wednesday_hot"      "hot_cold"          
 [7] "cold_affair"        "affair_Shuttler"    "Shuttler_PV"       
[10] "PV_Sindhu"          "Sindhu_advanced"    "advanced_Round"    
[ ... and 443 more ]

text6 :
 [1] "Day_8"            "8_Tokyo"          "Tokyo_Olympics"   "Olympics_great"  
 [5] "great_particular" "particular_India" "India_top"        "top_guns"        
 [9] "guns_failed"      "failed_make"      "make_mark"        "mark_Boxers"     
[ ... and 299 more ]
tail(news_ngrams)
Tokens consisting of 6 documents and 3 docvars.
text1152 :
 [1] "Diamond_baron"          "baron_Savji"            "Savji_Dholakia"        
 [4] "Dholakia_made"          "made_yet"               "yet_another"           
 [7] "another_announcement"   "announcement_declaring" "declaring_award"       
[10] "award_Rs"               "Rs_2.5"                 "2.5_lakh"              
[ ... and 171 more ]

text1153 :
 [1] "THE.WAIT.HAS.ENDED_men's" "men's_hockey"            
 [3] "hockey_team"              "team_every"              
 [5] "every_Indian"             "Indian_dreaming"         
 [7] "dreaming_winning"         "winning_Olympic"         
 [9] "Olympic_medal"            "medal_Indian"            
[11] "Indian_side"              "side_defeatedGermany"    
[ ... and 135 more ]

text1154 :
 [1] "New_Delhi"            "Delhi_Indian"         "Indian_sports"       
 [4] "sports_fraternity"    "fraternity_including" "including_cricketing"
 [7] "cricketing_greats"    "greats_Sachin"        "Sachin_Tendulkar"    
[10] "Tendulkar_country's"  "country's_white-ball" "white-ball_skipper"  
[ ... and 51 more ]

text1155 :
 [1] "India's_Equestrian" "Equestrian_Fouaad"  "Fouaad_Mirza"      
 [4] "Mirza_Seigneur"     "Seigneur_Medicott"  "Medicott_qualified"
 [7] "qualified_Jumping"  "Jumping_Individual" "Individual_Finals" 
[10] "Finals_got"         "got_8"              "8_point"           
[ ... and 179 more ]

text1156 :
 [1] "Saturday_Saikhom" "Saikhom_Mirabai"  "Mirabai_Chanu"    "Chanu_stood"     
 [5] "stood_podium"     "podium_Tokyo"     "Tokyo_silver"     "silver_medal"    
 [9] "medal_around"     "around_neck"      "neck_two"         "two_former"      
[ ... and 203 more ]

text1157 :
 [1] "Mirabai_Chanu"           "Chanu_said"             
 [3] "said_failed"             "failed_campaign"        
 [5] "campaign_Rio"            "Rio_taught"             
 [7] "taught_overcome"         "overcome_disappointment"
 [9] "disappointment_make"     "make_fresh"             
[11] "fresh_start"             "start_virtual"          
[ ... and 201 more ]

Word Cloud

I created a word cloud to check if there are any words other than the stopwords to be removed.
This time the words classification, publication-type and body did not appear so the preprocessing was successful.

news_dfm <- dfm(tokens(withoutstopwords_news))
news_dfm
Document-feature matrix of: 1,157 documents, 22,370 features (99.21% sparse) and 3 docvars.
       features
docs    solitary two-day fixture great britain france 1900 olympics prospects
  text1        1       1       1     1       1      1    1        9         3
  text2        0       0       0     0       0      0    0        2         0
  text3        0       0       0     2       2      0    0        3         0
  text4        0       0       0     0       0      0    0       10         0
  text5        0       0       0     0       0      0    0        3         0
  text6        0       0       0     2       1      0    0        2         0
       features
docs    cricket's
  text1         2
  text2         0
  text3         0
  text4         0
  text5         0
  text6         0
[ reached max_ndoc ... 1,151 more documents, reached max_nfeat ... 22,360 more features ]
textplot_wordcloud(news_dfm, min_count = 50, random_order = FALSE)

#saveRDS(news_dfm, file = "Data/News_DFM.rds")

Questions: 1. How can I use the data of the classification tags and newspaper names in my analysis? 2. Is the word cloud accurate because some of the words were omitted from the image?

In my next blog post, I will begin the process of topic modelling.