Blog Post 2

MekhalaKumar
Olympics2020
GenderandSports
BlogPost2
Reading in Data and Preprocessing
Author

Mekhala Kumar

Published

November 15, 2022

library(tidyverse)
library(quanteda)
#devtools::install_github("quanteda/readtext") 
library(readtext)
library(striprtf)
#library(LexisNexisTools)
library(corpus)
library(quanteda.textplots)

About the Data and Collection Process

I will be looking into how women’s and men’s sports were described, specifically in the context of the Tokyo Olympics 2020. The data used for this project is a collection of Indian newspaper archives from LexisNexis, from July 22,2021 to August 9, 2021 (as the Olympics were held between 23 July-8 August). I tried a few ways to collect the articles. At first, I tried scraping the Deccan Herald website and then tried scraping the LexisNexis archive. However, I was unable to do so. Hence, I downloaded articles from LexisNexis. The articles were collected after filtering the database. These filters were: Jul 22,2021 to Aug 09,2021, Asia, India, Men’s Sports/Women’s Sports/Sports Awards, Newspapers/web-based publications, a set of newspapers which were: Hindustan Times, Times of India (Electronic Edition), Free Press Journal (India), The Telegraph (India), Indian Express, Mint, DNA, India Today Online, The Hindu and Economic Times.

Initial steps

At first, I tried reading in the data for one file and looked for the different components present, in order to understand what I needed to extract.
Since, it was causing errors to load the files, I have provided the code of how I read in the files and extracted information as comments.
I have loaded in the data using an R object.

#my_texts <- readtext(paste0("D:/Text as Data/Files(100) Set 1.rtf"),text_field = "texts")
#testing <- as.character(my_texts)
#testing
#typeof(testing)
#str_length(testing)

Components

The body of the article is the main text required. In order to extract this, I found the index value of where the body of the text started. Additionally, I collected the index values for the information of the newspaper name, date and classifications.

3 issues I noticed while doing this step were:

  1. The titles of each article were not saved when the file was read in.
  2. The dateline which included the location or date was not present for every article.
  3. I had thought of using the index position of nPublished, as it denoted the end of the body of the article. However, this was not present for all the articles.
#Components
#indices_name<-str_locate_all(testing, "\n\n\n\n\n\n")%>%unlist()
#indices_name

#indices_date<-str_locate_all(testing,"\nDateline")
#indices_date

#For the actual text
#indices_body<-str_locate_all(testing,"\nBody")%>%unlist()
#indices_body
#indices_body_end<-str_locate_all(testing,"\nPublished")%>%unlist()
#indices_body_end

#Subject and classifications 
#indices_classification<-str_locate_all(testing,"\nSubject")%>%unlist()
#indices_classification

#Document end
#indices_end<-str_locate_all(testing,"\nEnd of Document")%>%unlist()
#indices_end

Creating the dataframe for 1 file

I tried creating a dataframe for the articles in 1 file and faced issues when I tried to separate information such as the few words before the body of the article starts (such as date and location).

#df_test <- data.frame(matrix(ncol = 3, nrow = 1191))%>% rename(
 #   Newspaper_Date=X1 ,
  #  Body=X2,
  #  Tags=X3
   # )
#Testing with a smaller set

#for (i in 1:3){
 # df_test[i,1]<-str_sub(testing,indices_name[i],indices_body[i])
 # df_test[i,2]<-str_sub(testing,indices_body[i],indices_body_end[i])
  #df_test[i,3]<-str_sub(testing,indices_classification[i],indices_end[i])

#}


#df_test<-df_test%>%separate(Newspaper_Date, into=c("Newspaper_Date", "Delete"), sep="Copyright")%>%select(-Delete)

#for (i in 1:100){
# df[i,1]<-str_sub(testing,indices_name[i],indices_body[i])
# df[i,2]<-str_sub(testing,indices_body[i],indices_classification[i])
# df[i,3]<-str_sub(testing,indices_classification[i],indices_end[i])
# }
#\nPublished not there everywhere so did not work
#for (i in 1:100){
 # df[i,1]<-str_sub(testing,indices_name[i],indices_body[i])
  #df[i,2]<-str_sub(testing,indices_body[i],indices_classification[i])
  #df[i,3]<-str_sub(testing,indices_classification[i],indices_end[i])
#}

Reading all the files

This reads in the entire collection of files. The rest of the post works with the data from all the files.

#my_text1 <- readtext(paste0("Data/Files(100) Set 1.rtf"),text_field = "texts")
#my_text2 <- readtext(paste0("Data/Files(100) Set 2.rtf"),text_field = "texts")
#my_text3 <- readtext(paste0("Data/Files(100) Set 3.rtf"),text_field = "texts")
#my_text4 <- readtext(paste0("Data/Files(100) Set 4.rtf"),text_field = "texts")
#my_text5 <- readtext(paste0("Data/Files(100) Set 5.rtf"),text_field = "texts")
#my_text6 <- readtext(paste0("Data/Files(100) Set 6.rtf"),text_field = "texts")
#my_text7 <- readtext(paste0("Data/Files(100) Set 7.rtf"),text_field = "texts")
#my_text8 <- readtext(paste0("Data/Files(100) Set 8.rtf"),text_field = "texts")
#my_text9 <- readtext(paste0("Data/Files(100) Set 9.rtf"),text_field = "texts")
#my_text10 <- readtext(paste0("Data/Files(100) Set 10.rtf"),text_field = "texts")
#my_text11 <- readtext(paste0("Data/Files(100) Set 11.rtf"),text_field = "texts")
#my_text12 <- readtext(paste0("Data/Files(91) Set 12.rtf"),text_field = "texts")
#files<-c(my_text1,my_text2,my_text3,my_text4,my_text5,my_text6,my_text7,my_text8,my_text9,my_text10,my_text11,my_text12)

Putting the information into a dataframe

Using a nested for loop, I collected the index positions for the different components of each newspaper article and created a dataframe to put in the data. For the last file, I made a separate for loop since it had fewer articles to be read in and put into a dataframe.

Some issues I faced here:

  1. Before the beginning of the article, the word body as well as the location and date are often mentioned. However, there was no constant separator that could be used.
  2. For articles 767-800, the newspaper and date were not read in. Due to this, I could not perform any operations on the column to remove unnecessary text.
#df1 <- data.frame(matrix(ncol = 3, nrow = 1100))%>% rename(
 #   Newspaper_Date=X1 ,
  #  Body=X2,
   # Tags=X3
   # )
#k=0
#for (i in seq(2,22,2)){
  #testing <- as.character(files[i])
  #indices_name<-str_locate_all(testing, "\n\n\n\n\n\n")%>%unlist()
  #indices_body<-str_locate_all(testing,"\nBody")%>%unlist()
  #indices_classification<-str_locate_all(testing,"\nSubject")%>%unlist()
  #indices_end<-str_locate_all(testing,"\nEnd of Document")%>%unlist()
  
 # print(k)
#  for(j in 1:100){
 #   df1[k+j,1]<-str_sub(testing,indices_name[j],indices_body[j])
  #  df1[k+j,2]<-str_sub(testing,indices_body[j],indices_classification[j])
  #  df1[k+j,3]<-str_sub(testing,indices_classification[j],indices_end[j])
  #}
  #k=k+100
  #}

#df2 <- data.frame(matrix(ncol = 3, nrow = 91))%>% rename(
 #   Newspaper_Date=X1 ,
  #  Body=X2,
   # Tags=X3
    #)
#testing <- as.character(files[24])
#indices_name<-str_locate_all(testing, "\n\n\n\n\n\n")%>%unlist()
#indices_body<-str_locate_all(testing,"\nBody")%>%unlist()
#indices_classification<-str_locate_all(testing,"\nSubject")%>%unlist()
#indices_end<-str_locate_all(testing,"\nEnd of Document")%>%unlist()


#for(l in 1:91){
 # df2[l,1]<-str_sub(testing,indices_name[l],indices_body[l])
  #df2[l,2]<-str_sub(testing,indices_body[l],indices_classification[l])
  #df2[l,3]<-str_sub(testing,indices_classification[l],indices_end[l])
#}

#df<-rbind(df1,df2)
#Does not work because the information is missing for some
#df<-df%>%separate(Newspaper_Date, into=c("Newspaper_Date", "Delete"), sep="Copyright")%>%select(-Delete)
#this does not work
#df<-df%>%separate(Newspaper_Date, into=c("Newspaper", "Date"), sep="{\n}")
#df<-df_test%>%separate(Newspaper_Date, into=c("Newspaper", "Date"), sep="[August|July]")
#df<-df%>%separate(Body,into=c("Delete","Body"),sep="--|Body")
#write.csv(df, "D:/Text as Data/Text_as_Data_Fall_2022/posts/output.csv", row.names=FALSE, quote=FALSE)
df_All<-readRDS(file = "D:/Text as Data/All12Files.rds")
df<-df_All
#Does not work because the information is missing for some
df<-df%>%separate(Newspaper_Date, into=c("Newspaper_Date", "Delete"), sep="Copyright")%>%select(-Delete)
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 34 rows [767,
768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783,
784, 785, 786, ...].
#this does not work
#df<-df%>%separate(Newspaper_Date, into=c("Newspaper", "Date"), sep="{\n}")
#df<-df_test%>%separate(Newspaper_Date, into=c("Newspaper", "Date"), sep="[August|July]")
#df<-df%>%separate(Body,into=c("Delete","Body"),sep="--|Body")
#write.csv(df, "D:/Text as Data/Text_as_Data_Fall_2022/posts/output.csv", row.names=FALSE, quote=FALSE)

Creating the corpus and looking at the number of tokens in each document

newspaper_corpus <- corpus(df,text_field = "Body")
newspaper_corpus_summary <- summary(newspaper_corpus)
head(newspaper_corpus_summary)
   Text Types Tokens Sentences
1 text1   293    586        24
2 text2   138    264         6
3 text3   365    911        30
4 text4   187    349         9
5 text5   229    424        18
6 text6   231    467        25
                                              Newspaper_Date
1 \n\n\n\n\n\n\nHindustan Times\nAugust 9, 2021 Monday\n\n\n
2   \n\n\n\n\n\nHindustan Times\nAugust 2, 2021 Monday\n\n\n
3         \n\n\n\n\n\nThe Hindu\nAugust 9, 2021 Monday\n\n\n
4 \n\n\n\n\n\nHindustan Times\nJuly 28, 2021 Wednesday\n\n\n
5   \n\n\n\n\n\nHindustan Times\nAugust 8, 2021 Sunday\n\n\n
6               \n\n\n\n\n\nDNA\nJuly 27, 2021 Tuesday\n\n\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Tags
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       \nSubject: OLYMPICS (92%); CRICKET (90%); OLYMPIC COMMITTEES (90%); SPORTS GOVERNING BODIES (90%); SPORTS & RECREATION EVENTS (89%); AGREEMENTS (78%); WOMEN'S SPORTS (78%); ASSOCIATIONS & ORGANIZATIONS (77%); TALKS & MEETINGS (77%)\n\nOrganization: INTERNATIONAL OLYMPIC COMMITTEE (57%)\n\nIndustry: BUDGETS (66%); TELEVISION INDUSTRY (50%)\n\nGeographic: MUMBAI, MAHARASHTRA, INDIA (92%); LOS ANGELES, CA, USA (79%); BIRMINGHAM, ENGLAND (58%); CALIFORNIA, USA (58%); INDIA (94%); UNITED KINGDOM (79%); FRANCE (58%)\n\nLoad-Date: August 8, 2021\n\n\n
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \nSubject: 2020 TOKYO SUMMER OLYMPICS (90%); OLYMPICS (90%); SHOOTING SPORTS (90%); SUMMER OLYMPICS (90%); WEAPONS & ARMS (89%); EQUESTRIAN SPORTS (78%); WOMEN'S SPORTS (78%)\n\nIndustry: MEDIA CONTENT (78%)\n\nGeographic: INDIA (91%)\n\nLoad-Date: August 1, 2021\n\n\n
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            \nSubject: 2020 TOKYO SUMMER OLYMPICS (90%); OLYMPICS (90%); SUMMER OLYMPICS (90%); MEN'S SPORTS (78%); SPORTS AWARDS (78%)\n\nGeographic: NEW DELHI, INDIA (59%); INDIA (92%)\n\nLoad-Date: August 9, 2021\n\n\n
4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               \nSubject: 2020 TOKYO SUMMER OLYMPICS (91%); OLYMPICS (91%); ATHLETES (90%); FIELD HOCKEY (90%); SUMMER OLYMPICS (90%); WOMEN'S SPORTS (90%); BOAT RACING (89%); ROWING (89%); ARCHERY (78%); BADMINTON (78%); BOXING (78%); MEN'S SPORTS (78%); SPORTS & RECREATION EVENTS (78%); SPORTS AWARDS (78%); TOURNAMENTS (76%); BOATING & RAFTING (75%)\n\nIndustry: MEDIA CONTENT (78%)\n\nGeographic: NEW DELHI, INDIA (74%); INDIA (94%); UNITED KINGDOM (73%)\n\nLoad-Date: July 27, 2021\n\n\n
5                                                                                                                                                                                                                                                                                                                \nSubject: ARMIES (94%); OLYMPICS (92%); 2020 TOKYO SUMMER OLYMPICS (90%); ARMED FORCES (90%); HEADS OF STATE & GOVERNMENT (90%); PHOTO & VIDEO SHARING (90%); PRIME MINISTERS (90%); SPORTS & RECREATION (90%); SPORTS AWARDS (90%); SUMMER OLYMPICS (90%); VIRAL VIDEOS (90%); HUMAN RESOURCES & PERSONNEL MANAGEMENT (78%); TRACK & FIELD (78%); EDUCATION & TRAINING (77%); MILITARY SERVICE (77%); SOCIAL MEDIA (77%); EDUCATION SYSTEMS & INSTITUTIONS (74%); GOVERNMENT ADVISORS & MINISTERS (73%); STUDENTS & STUDENT LIFE (73%); INTERNET SOCIAL NETWORKING (72%); WEAPONS & ARMS (72%)\n\nIndustry: ARMIES (94%); ARMED FORCES (90%); PHOTO & VIDEO SHARING (90%); VIRAL VIDEOS (90%); MEDIA CONTENT (78%); MILITARY SERVICE (77%); SOCIAL MEDIA (77%); EDUCATION SYSTEMS & INSTITUTIONS (74%); INTERNET SOCIAL NETWORKING (72%)\n\nPerson: NARENDRA MODI (79%)\n\nGeographic: GUJARAT, INDIA (79%); INDIA (94%)\n\nLoad-Date: August 8, 2021\n\n\n
6 \nSubject: DRUGS IN SPORTS (94%); OLYMPICS (92%); SPORTS & RECREATION (92%); SPORTS & RECREATION EVENTS (91%); 2020 TOKYO SUMMER OLYMPICS (90%); SPORTS AWARDS (90%); SUMMER OLYMPICS (90%); OLYMPIC COMMITTEES (89%); WINTER OLYMPICS (89%); 2016 RIO SUMMER OLYMPICS (78%); ACADEMY AWARDS (78%); ENTERTAINMENT & ARTS AWARDS (78%); GENDER EQUALITY (78%); SPORTS GOVERNING BODIES (78%); WOMEN (78%); WOMEN'S SPORTS (78%); CONTROLLED SUBSTANCES CRIME (76%); NEGATIVE MISC NEWS (76%); NEGATIVE NEWS (76%); INVESTIGATIONS (73%); WINTER SPORTS (73%); SCANDALS (71%); AUTO RACING (66%); DISCRIMINATION (66%); FIFA WORLD CUP (66%); DOCUMENTARY FILMS (65%); TALIBAN (60%); ALTERNATIVE DISPUTE RESOLUTION (50%)\n\nCompany:  NETFLIX INC (54%);  AL MUDON INTERNATIONAL REAL ESTATE CO KSCC (51%)\n\nTicker: NFLX (NASDAQ) (54%); ALMUDON (KUW) (51%)\n\nIndustry: NAICS532282 VIDEO TAPE & DISC RENTAL (54%); SIC7841 VIDEO TAPE RENTAL (54%); NAICS531110 LESSORS OF RESIDENTIAL BUILDINGS & DWELLINGS (51%); SIC6513 OPERATORS OF APARTMENT BUILDINGS (51%); ACADEMY AWARDS (78%); ENTERTAINMENT & ARTS AWARDS (78%); DOCUMENTARY FILMS (65%); MOTOR VEHICLES (61%)\n\nGeographic: RUSSIAN FEDERATION (93%); AFGHANISTAN (52%)\n\nLoad-Date: July 27, 2021\n\n\n
newspaper_corpus_summary$Tokens
  [1] 586 264 911 349 424 467 392 605 300 591 343 295 315 298 298 344 370 336
 [19] 690 420 436 397 475 225 722 688 333 372 221 344 484 591 640 348 357 523
 [37] 276 567 359 415 518 533 414 391 345 354 601 554 216 288 277 466 232 395
 [55] 832 415 399 361 890 299 360 440 823 289 526 674 633 633 420 232 267 688
 [73] 320 248 563 357 781 633 562 624 240 251 426 597 869 591 576 633 364 427
 [91] 562 271 411 588 566 380 342 241 242 251
ndoc(newspaper_corpus)
[1] 1191

Checking for metadata

There was metadata but too many pages were being printed so I commented out the line.

#docvars(newspaper_corpus)

Creating tokens

as.character function prints too many pages so it was commented out.

newspaper_tokens <- tokens(newspaper_corpus)
head(newspaper_tokens)
Tokens consisting of 6 documents and 2 docvars.
text1 :
 [1] "Body"     "Mumbai"   ","        "Aug"      "."        "9"       
 [7] "-"        "From"     "a"        "solitary" "two-day"  "fixture" 
[ ... and 574 more ]

text2 :
 [1] "Body"     "India"    ","        "Aug"      "."        "2"       
 [7] "-"        "Tokyo"    "Olympics" "Day"      "10"       "Full"    
[ ... and 252 more ]

text3 :
 [1] "Body"           "After"          "the"            "best-ever"     
 [5] "performance"    "at"             "the"            "just-concluded"
 [9] "Tokyo"          "Olympics"       ","              "India"         
[ ... and 899 more ]

text4 :
 [1] "Body"     "New"      "Delhi"    ","        "July"     "28"      
 [7] "-"        "The"      "Tokyo"    "Olympics" "2020"     "enters"  
[ ... and 337 more ]

text5 :
 [1] "Body"   "India"  ","      "Aug"    "."      "8"      "-"      "After" 
 [9] "Neeraj" "Chopra" "became" "the"   
[ ... and 412 more ]

text6 :
 [1] "Body"      "If"        "you're"    "wondering" "who"       "are"      
 [7] "these"     "ROC"       "athletes"  "winning"   "so"        "many"     
[ ... and 455 more ]
#as.character(newspaper_tokens[1])
#sprintf(as.character(newspaper_tokens[1]))

Removing punctuation

newspaper_tokens <- tokens(newspaper_tokens ,
                                    remove_punct = T)
head(newspaper_tokens)
Tokens consisting of 6 documents and 2 docvars.
text1 :
 [1] "Body"     "Mumbai"   "Aug"      "9"        "From"     "a"       
 [7] "solitary" "two-day"  "fixture"  "between"  "Great"    "Britain" 
[ ... and 501 more ]

text2 :
 [1] "Body"       "India"      "Aug"        "2"          "Tokyo"     
 [6] "Olympics"   "Day"        "10"         "Full"       "Schedule"  
[11] "Kamalpreet" "Kaur"      
[ ... and 208 more ]

text3 :
 [1] "Body"           "After"          "the"            "best-ever"     
 [5] "performance"    "at"             "the"            "just-concluded"
 [9] "Tokyo"          "Olympics"       "India"          "shall"         
[ ... and 783 more ]

text4 :
 [1] "Body"     "New"      "Delhi"    "July"     "28"       "The"     
 [7] "Tokyo"    "Olympics" "2020"     "enters"   "the"      "fifth"   
[ ... and 293 more ]

text5 :
 [1] "Body"   "India"  "Aug"    "8"      "After"  "Neeraj" "Chopra" "became"
 [9] "the"    "second" "Indian" "ever"  
[ ... and 372 more ]

text6 :
 [1] "Body"      "If"        "you're"    "wondering" "who"       "are"      
 [7] "these"     "ROC"       "athletes"  "winning"   "so"        "many"     
[ ... and 416 more ]

Removing stopwords

withoutstopwords_news<- tokens_select(newspaper_tokens, 
                    pattern = stopwords("en"),
                    select = "remove")
head(withoutstopwords_news)
Tokens consisting of 6 documents and 2 docvars.
text1 :
 [1] "Body"     "Mumbai"   "Aug"      "9"        "solitary" "two-day" 
 [7] "fixture"  "Great"    "Britain"  "France"   "1900"     "Olympics"
[ ... and 292 more ]

text2 :
 [1] "Body"       "India"      "Aug"        "2"          "Tokyo"     
 [6] "Olympics"   "Day"        "10"         "Full"       "Schedule"  
[11] "Kamalpreet" "Kaur"      
[ ... and 161 more ]

text3 :
 [1] "Body"           "best-ever"      "performance"    "just-concluded"
 [5] "Tokyo"          "Olympics"       "India"          "shall"         
 [9] "look"           "breaking"       "top"            "10"            
[ ... and 456 more ]

text4 :
 [1] "Body"     "New"      "Delhi"    "July"     "28"       "Tokyo"   
 [7] "Olympics" "2020"     "enters"   "fifth"    "day"      "begin"   
[ ... and 212 more ]

text5 :
 [1] "Body"       "India"      "Aug"        "8"          "Neeraj"    
 [6] "Chopra"     "became"     "second"     "Indian"     "ever"      
[11] "win"        "individual"
[ ... and 230 more ]

text6 :
 [1] "Body"       "wondering"  "ROC"        "athletes"   "winning"   
 [6] "many"       "medals"     "Tokyo"      "Olympics"   "everything"
[11] "need"       "know"      
[ ... and 230 more ]
#summary(withoutstopwords_news)
#as.character(withoutstopwords_news)

Keyword in context

Since I want to look at women’s and men’s sports, I tried looking at the keyword-in-context for the words women and men.

kwic_women<- kwic(withoutstopwords_news,
                        pattern = c("women"))
kwic_men<-kwic(withoutstopwords_news,
                        pattern = c("men"))
head(kwic_women)
Keyword-in-context with 6 matches.                                                                 
  [text1, 20]          inclusion 8-team medal sport men | women |
 [text1, 228] Olympics learnt require participation men | women |
 [text3, 274]              valiantly win medal 41 years | women |
 [text3, 304]     happy take note spectacular emergence | women |
 [text3, 318]              2016 Rio games medal winners | women |
 [text3, 324]           Tokyo three seven medal winners | women |
                                           
 Los Angeles 2028 brightened International 
 T20 emerged format choice despite         
 bravehearts made Olympics debut 1980      
 athletes international sports arena coming
 Tokyo three seven medal winners           
 golden record men's hockey team           
head(kwic_men)
Keyword-in-context with 6 matches.                                                                   
  [text1, 19]        cricket's inclusion 8-team medal sport | men |
 [text1, 227] cricket Olympics learnt require participation | men |
 [text7, 164]            IST SAILING Vishnu Saravanan Laser | Men |
 [text7, 175]         IST Ganapathy Kelapanda Varun Thakkar | Men |
 [text19, 21]      Olympics 2020 performance Manpreet Singh | men |
 [text31, 21]     Recently 78-year-old actor shared analogy | men |
                                        
 women Los Angeles 2028 brightened      
 women T20 emerged format choice        
 Race 7 8 8 35                          
 Race 5 6 8 35                          
 taken social media storm Several       
 women's Indian hockey performance Tokyo

Creating a tokens object with bigrams and trigrams

news_ngrams <- tokens_ngrams(withoutstopwords_news, n=2:3)
head(news_ngrams)
Tokens consisting of 6 documents and 2 docvars.
text1 :
 [1] "Body_Mumbai"        "Mumbai_Aug"         "Aug_9"             
 [4] "9_solitary"         "solitary_two-day"   "two-day_fixture"   
 [7] "fixture_Great"      "Great_Britain"      "Britain_France"    
[10] "France_1900"        "1900_Olympics"      "Olympics_prospects"
[ ... and 593 more ]

text2 :
 [1] "Body_India"          "India_Aug"           "Aug_2"              
 [4] "2_Tokyo"             "Tokyo_Olympics"      "Olympics_Day"       
 [7] "Day_10"              "10_Full"             "Full_Schedule"      
[10] "Schedule_Kamalpreet" "Kamalpreet_Kaur"     "Kaur_stunned"       
[ ... and 331 more ]

text3 :
 [1] "Body_best-ever"             "best-ever_performance"     
 [3] "performance_just-concluded" "just-concluded_Tokyo"      
 [5] "Tokyo_Olympics"             "Olympics_India"            
 [7] "India_shall"                "shall_look"                
 [9] "look_breaking"              "breaking_top"              
[11] "top_10"                     "10_earliest"               
[ ... and 921 more ]

text4 :
 [1] "Body_New"       "New_Delhi"      "Delhi_July"     "July_28"       
 [5] "28_Tokyo"       "Tokyo_Olympics" "Olympics_2020"  "2020_enters"   
 [9] "enters_fifth"   "fifth_day"      "day_begin"      "begin_Indian"  
[ ... and 433 more ]

text5 :
 [1] "Body_India"      "India_Aug"       "Aug_8"           "8_Neeraj"       
 [5] "Neeraj_Chopra"   "Chopra_became"   "became_second"   "second_Indian"  
 [9] "Indian_ever"     "ever_win"        "win_individual"  "individual_gold"
[ ... and 469 more ]

text6 :
 [1] "Body_wondering"      "wondering_ROC"       "ROC_athletes"       
 [4] "athletes_winning"    "winning_many"        "many_medals"        
 [7] "medals_Tokyo"        "Tokyo_Olympics"      "Olympics_everything"
[10] "everything_need"     "need_know"           "know_ROC"           
[ ... and 469 more ]
tail(news_ngrams)
Tokens consisting of 6 documents and 2 docvars.
text1186 :
 [1] "Body_Diamond"           "Diamond_baron"          "baron_Savji"           
 [4] "Savji_Dholakia"         "Dholakia_made"          "made_yet"              
 [7] "yet_another"            "another_announcement"   "announcement_declaring"
[10] "declaring_award"        "award_Rs"               "Rs_2.5"                
[ ... and 183 more ]

text1187 :
 [1] "Body_THE.WAIT.HAS.ENDED"  "THE.WAIT.HAS.ENDED_men's"
 [3] "men's_hockey"             "hockey_team"             
 [5] "team_every"               "every_Indian"            
 [7] "Indian_dreaming"          "dreaming_winning"        
 [9] "winning_Olympic"          "Olympic_medal"           
[11] "medal_Indian"             "Indian_side"             
[ ... and 147 more ]

text1188 :
 [1] "Body_New"             "New_Delhi"            "Delhi_Indian"        
 [4] "Indian_sports"        "sports_fraternity"    "fraternity_including"
 [7] "including_cricketing" "cricketing_greats"    "greats_Sachin"       
[10] "Sachin_Tendulkar"     "Tendulkar_country's"  "country's_white-ball"
[ ... and 63 more ]

text1189 :
 [1] "Body_India's"       "India's_Equestrian" "Equestrian_Fouaad" 
 [4] "Fouaad_Mirza"       "Mirza_Seigneur"     "Seigneur_Medicott" 
 [7] "Medicott_qualified" "qualified_Jumping"  "Jumping_Individual"
[10] "Individual_Finals"  "Finals_got"         "got_8"             
[ ... and 191 more ]

text1190 :
 [1] "Body_Saturday"    "Saturday_Saikhom" "Saikhom_Mirabai"  "Mirabai_Chanu"   
 [5] "Chanu_stood"      "stood_podium"     "podium_Tokyo"     "Tokyo_silver"    
 [9] "silver_medal"     "medal_around"     "around_neck"      "neck_two"        
[ ... and 215 more ]

text1191 :
 [1] "Body_Mirabai"            "Mirabai_Chanu"          
 [3] "Chanu_said"              "said_failed"            
 [5] "failed_campaign"         "campaign_Rio"           
 [7] "Rio_taught"              "taught_overcome"        
 [9] "overcome_disappointment" "disappointment_make"    
[11] "make_fresh"              "fresh_start"            
[ ... and 213 more ]

Word Cloud

I created a word cloud to check if there are any words other than the stopwords to be removed. The words- classification, publication-type and body should ideally be removed.

# create the dfm
news_dfm <- dfm(tokens(withoutstopwords_news))

# find out a quick summary of the dfm
news_dfm
Document-feature matrix of: 1,191 documents, 22,788 features (99.15% sparse) and 2 docvars.
       features
docs    body mumbai aug 9 solitary two-day fixture great britain france
  text1    1      1   1 1        1       1       1     1       1      1
  text2    1      0   1 0        0       0       0     0       0      0
  text3    1      0   0 0        0       0       0     0       0      0
  text4    1      0   0 0        0       0       0     2       2      0
  text5    1      0   1 0        0       0       0     0       0      0
  text6    1      0   0 0        0       0       0     0       0      0
[ reached max_ndoc ... 1,185 more documents, reached max_nfeat ... 22,778 more features ]
textplot_wordcloud(news_dfm, min_count = 50, random_order = FALSE)
Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
#tokyo2020 could not be fit on page. It will not be plotted.
Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
parents could not be fit on page. It will not be plotted.
Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
becoming could not be fit on page. It will not be plotted.
Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
corners could not be fit on page. It will not be plotted.
Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
almost could not be fit on page. It will not be plotted.
Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : story
could not be fit on page. It will not be plotted.