knitr::opts_chunk$set(echo = TRUE)Web Scraping
Loading Packages
library(rvest)
library(tidyverse)── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.5
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.4.1
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag() masks stats::lag()
library(polite)
library(stringr)
library(quanteda)Package version: 3.2.3
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 8 of 8 threads used.
See https://quanteda.io for tutorials and examples.
Data Sources
There are 6 teams included in this study 2 from the top of the table 2 from the middle and 2 from the bottom. They are already in that order from top to bottom. Data needed to be web scraped from a page called match report. This page was located on each teams official website and this page included information about the match, statistics, and quotes from both the players and the managers.
Web Scraping/Tidying data
Here is the beginning of the web scraping process. I was unable to find a way to make the web scraper search for one object then proceed to the next page where you could then scrape whats inside. For the time being I decided to manually web scrape the information. The tidying process is the real issue as there are many unwanted variables inside. For example there are a lot of /n’s.
Arsenal
This data was scraped from the official Arsenal page. This scraping pulled in all the matches that have been played this season thus far and will continue to grow as the season progresses. Within this scraped data there was a lot that needed to get removed which included things like /n, , random number strings, and long sentences talking about buying Arsenal pictures. After using stringr to clean up the data we unlisted it and started moving towards a corpus. There is still some tidying that needs to be done to remove some -’s and to make some spaces at certain portion of the document. After cleaning this data was put into a character vector and then put into a corpus which can be found at the bottom of this code. I added in the name of the team and the match number to the table and after that we can start looking at what the data means. So far 7 matches have been played and the word count was kept fairly consistent until match 5
# Arsenal Match week 1 against Crystal Palace
Arsenal_URL_1 <- "https://www.arsenal.com/fixture/arsenal/2022-Aug-05/crystal-palace-0-2-arsenal-match-report"
Arsenal_URL_1 <- read_html(Arsenal_URL_1)
week_1_select <- (".article-body")
Arsenal_week_one <- Arsenal_URL_1 %>%
html_node(css = week_1_select) %>%
html_text2()
Arsenal_week_one <- str_replace_all(Arsenal_week_one, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_replace_all("[1234567890] of 42To buy official Arsenal pictures visit Arsenal Pics", "#") %>%
str_remove("Play videoWatch Arsenal video online05:24Highlights | Crystal Palace 0-2 Arsenal - bitesize") %>%
str_remove("111111111122222222223333333333444") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("#") %>%
unlist()
# Arsenal Match week 2 against Leicester
Arsenal_URL_2 <- "https://www.arsenal.com/fixture/arsenal/2022-Aug-13/arsenal-4-2-leicester-city-match-report"
Arsenal_URL_2 <- read_html(Arsenal_URL_2)
week_2_select <- (".article-body")
Arsenal_week_two <- Arsenal_URL_2 %>%
html_node(css = week_2_select) %>%
html_text2()
Arsenal_week_two <- str_replace_all(Arsenal_week_two, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_replace_all("[1234567890] of 36To buy official Arsenal pictures visit Arsenal Pics", "#") %>%
str_remove("Play videoWatch Arsenal video online04:53Highlights | Arsenal 4-2 Leicester City - bitesize") %>%
str_remove("111111111122222222223333333") %>%
str_remove("Play videoWatch Arsenal video online02:31") %>%
str_remove("Play videoWatch Arsenal video online02:27") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("#") %>%
unlist()
# Arsenal Match week 3 against Bournemouth
Arsenal_URL_3 <- "https://www.arsenal.com/premier-league-match-report-bournemouth-odegaard-saliba-jesus"
Arsenal_URL_3 <- read_html(Arsenal_URL_3)
week_3_select <- (".article-body")
Arsenal_week_three <- Arsenal_URL_3 %>%
html_node(css = week_3_select) %>%
html_text2()
Arsenal_week_three <- str_replace_all(Arsenal_week_three, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_replace_all("[1234567890] of 29To buy official Arsenal pictures visit Arsenal Pics", "#") %>%
str_remove("Play videoWatch Arsenal video online05:17Highlights | Bournemouth 0-3 Arsenal - bitesize") %>%
str_remove("11111111112222222222") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("#") %>%
unlist()
# Arsenal Match week 4 against Fulham
Arsenal_URL_4 <- "https://www.arsenal.com/premier-league-match-report-fulham-odegaard-gabriel"
Arsenal_URL_4 <- read_html(Arsenal_URL_4)
week_4_select <- (".article-body")
Arsenal_week_four <- Arsenal_URL_4 %>%
html_node(css = week_4_select) %>%
html_text2()
Arsenal_week_four <- str_replace_all(Arsenal_week_four, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_replace_all("[1234567890] of 45To buy official Arsenal pictures visit Arsenal Pics", "#") %>%
str_remove("Play videoWatch Arsenal video online01:59Highlights: Arsenal 2-1 Fulhambitesize") %>%
str_remove("111111111122222222223333333333444444") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("#") %>%
unlist()
# Arsenal Match week 5 against Aston villa
Arsenal_URL_5 <- "https://www.arsenal.com/match-report-aston-villa-premier-league-martinelli-jesus"
Arsenal_URL_5 <- read_html(Arsenal_URL_5)
week_5_select <- (".article-body")
Arsenal_week_five <- Arsenal_URL_5 %>%
html_node(css = week_5_select) %>%
html_text2()
Arsenal_week_five <- str_replace_all(Arsenal_week_five, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_replace_all("[1234567890] of 38To buy official Arsenal pictures visit Arsenal Pics", "#") %>%
str_remove("Play videoWatch Arsenal video online02:00Highlights | Brentford 0-3 Arsenal - bitesize") %>%
str_remove("11111111112222222222333333333") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("#") %>%
unlist()
# Arsenal Match week 6 against Manchester united
Arsenal_URL_6 <- "https://www.arsenal.com/fixture/arsenal/2022-Sep-04/manchester-united-3-1-arsenal-match-report"
Arsena_URL_6 <- read_html(Arsenal_URL_6)
week_6_select <- (".article-body")
Arsenal_week_six <- Arsenal_URL_6 %>%
read_html() %>%
html_node(css = week_6_select) %>%
html_text2()
Arsenal_week_six <- str_replace_all(Arsenal_week_six, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_replace_all("[1234567890] of 38To buy official Arsenal pictures visit Arsenal Pics", "#") %>%
str_remove("Play videoWatch Arsenal video online01:59Highlights: Manchester United 3-1 Arsenalbitesize") %>%
str_remove("11111111112222222222333333333") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("#") %>%
unlist()
#Arsenal Match week 7 Against Brentford
Arsenal_URL_7 <- "https://www.arsenal.com/premier-league-match-report-brentford-saliba-jesus-vieira"
Arsenal_URL_7 <- read_html(Arsenal_URL_7)
week_7_select <- (".article-body")
Arsenal_week_seven <- Arsenal_URL_7 %>%
html_node(css = week_7_select) %>%
html_text2()
# remove the \n\n, /n, the buy arsenal pictures 1-32(Tidying- the current problem is I can't get rid of the buy arsenal part)
Arsenal_week_seven <- str_replace_all(Arsenal_week_seven, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_replace_all("[1234567890] of 32To buy official Arsenal pictures visit Arsenal Pics", "#") %>%
str_remove("Play videoWatch Arsenal video online02:00Highlights | Brentford 0-3 Arsenal - bitesize") %>%
str_remove("11111111112222222222333") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("#") %>%
unlist()
# Once everything is completed it is a character
class(week_7_select)[1] "character"
Arsenal <- c(Arsenal_week_one, Arsenal_week_two, Arsenal_week_three, Arsenal_week_four, Arsenal_week_five, Arsenal_week_six, Arsenal_week_seven)
nchar(Arsenal)[1] 8163 9066 7726 7601 4433 7322 7818
Arsenal_corpus <- corpus(Arsenal)
Arsenal_corpus_summary <- summary(Arsenal_corpus)
Arsenal_corpus_summary$Team <- "Arsenal"
Arsenal_corpus_summaryCorpus consisting of 7 documents, showing 7 documents:
Text Types Tokens Sentences Team
text1 523 1412 11 Arsenal
text2 602 1623 22 Arsenal
text3 551 1387 15 Arsenal
text4 479 1299 9 Arsenal
text5 408 790 12 Arsenal
text6 481 1258 6 Arsenal
text7 526 1395 21 Arsenal
# create a Match number
Arsenal_corpus_summary$Match <- as.numeric(str_extract(Arsenal_corpus_summary$Text, "[0-9]+"))
Arsenal_corpus_summaryCorpus consisting of 7 documents, showing 7 documents:
Text Types Tokens Sentences Team Match
text1 523 1412 11 Arsenal 1
text2 602 1623 22 Arsenal 2
text3 551 1387 15 Arsenal 3
text4 479 1299 9 Arsenal 4
text5 408 790 12 Arsenal 5
text6 481 1258 6 Arsenal 6
text7 526 1395 21 Arsenal 7
Manchester City
Manchester Cty followed a very similar path as Arsenal as this one also required a lot cleaning with stringr however, it had a few unique moments. For example, I had to clean () which were all over the place in the original data. Other than this portion the cleaning process was the same and this will also needs some additional cleaning. I was able to move this into the corpus as well and we noticed that it had more sentences than Arsenal. However, this could be do to the spacing problem that I mentioned above more studying will need to be done after that change has been made. This team also had a more consistent amount of words and unique words.
# Manchester City First match against West Ham
Mancity_URL_1 <- "https://www.mancity.com/news/mens/west-ham-v-manchester-city-premier-league-match-report-63795480"
Mancity_URL_1 <- read_html(Mancity_URL_1)
Mancity_week_1 <- (".article-body__article-text")
Mancity_week_1 <- Mancity_URL_1 %>%
html_node(css = Mancity_week_1) %>%
html_text2()
Mancity_week_1 <- str_replace_all(Mancity_week_1, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("#") %>%
unlist()
# Manchester City second match against Bournemouth
Mancity_URL_2 <- "https://www.mancity.com/news/mens/man-city-bournemouth-premier-league-match-report-63795987"
Mancity_URL_2 <- read_html(Mancity_URL_2)
Mancity_week_2 <- (".article-body__article-text")
Mancity_week_2 <- Mancity_URL_2 %>%
html_node(css = Mancity_week_2) %>%
html_text2()
Mancity_week_2 <- str_replace_all(Mancity_week_2, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("#") %>%
unlist()
# Manchester City third match against Newcastle United
Mancity_URL_3 <- "https://www.mancity.com/news/mens/newcastle-v-manchester-city-match-report-63796690"
Mancity_URL_3 <- read_html(Mancity_URL_3)
Mancity_week_3 <- (".article-body__article-text")
Mancity_week_3 <- Mancity_URL_3 %>%
html_node(css = Mancity_week_3) %>%
html_text2()
Mancity_week_3 <- str_replace_all(Mancity_week_3, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Manchester City fourth match against Crystal Palace
Mancity_URL_4 <- "https://www.mancity.com/news/mens/man-city-crystal-palace-match-report-63797204"
Mancity_URL_4 <- read_html(Mancity_URL_4)
Mancity_week_4 <- (".article-body__article-text")
Mancity_week_4 <- Mancity_URL_4 %>%
html_node(css = Mancity_week_4) %>%
html_text2()
Mancity_week_4 <- str_replace_all(Mancity_week_4, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Manchester City fifth match against Nottm Forest
Mancity_URL_5 <- "https://www.mancity.com/news/mens/manchester-city-v-nottingham-forest-match-report-31-august-63797573"
Mancity_URL_5 <- read_html(Mancity_URL_5)
Mancity_week_5 <- (".article-body__article-text")
Mancity_week_5 <- Mancity_URL_5 %>%
html_node(css = Mancity_week_5) %>%
html_text2()
Mancity_week_5 <- str_replace_all(Mancity_week_5, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Manchester City six match against Aston Villa
Mancity_URL_6 <- "https://www.mancity.com/news/mens/aston-villa-manchester-city-premier-league-match-report-63797816"
Mancity_URL_6 <- read_html(Mancity_URL_6)
Mancity_week_6 <- (".article-body__article-text")
Mancity_week_6 <- Mancity_URL_6 %>%
html_node(css = Mancity_week_6) %>%
html_text2()
Mancity_week_6 <- str_replace_all(Mancity_week_6, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Manchester city did not play week 7 due to queens death
# Manchester City eighth match against Aston Villa
Mancity_URL_7 <- "https://www.mancity.com/news/mens/wolves-manchester-city-away-premier-league-2022-match-report-63799002"
Mancity_URL_7 <- read_html(Mancity_URL_7)
Mancity_week_7 <- (".article-body__article-text")
Mancity_week_7 <- Mancity_URL_7 %>%
html_node(css = Mancity_week_7) %>%
html_text2()
Mancity_week_7 <- str_replace_all(Mancity_week_7, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
ManCity <- c(Mancity_week_1, Mancity_week_2, Mancity_week_3, Mancity_week_4, Mancity_week_5, Mancity_week_6, Mancity_week_7)
Mancity_corpus <- corpus(ManCity)
Mancity_corpus_summary <- summary(Mancity_corpus)
# Creating a Team Name
Mancity_corpus_summary$Team <- "Manchester City"
# create a Match number
Mancity_corpus_summary$Match <- as.numeric(str_extract(Mancity_corpus_summary$Text, "[0-9]+"))
Mancity_corpus_summaryCorpus consisting of 7 documents, showing 7 documents:
Text Types Tokens Sentences Team Match
text1 564 1270 21 Manchester City 1
text2 615 1482 28 Manchester City 2
text3 660 1446 16 Manchester City 3
text4 549 1202 19 Manchester City 4
text5 489 1111 31 Manchester City 5
text6 658 1609 57 Manchester City 6
text7 701 1701 32 Manchester City 7
Newcastle united
This is the start of the middle table teams which I am exciting to see how they differ the two top tier teams. This data was scraped from the Newcastle official website and the cleaning process was pretty straight forward on this one as there was nothing unique that needed to be changed. One noticeable difference between this team and the top teams is the amount of words used in match reports as this one is about half of the first two teams. This might be unique to just this team or maybe the lower in the league the team is the less they will write about their performance?
# New Castle United first match against nottingham forest
# 1 rule for 1 bots crawl delay 5 seconds, scrapable
bow("https://www.nufc.co.uk/matches/first-team/2022-23/newcastle-united-v-nottingham-forest/")<polite session> https://www.nufc.co.uk/matches/first-team/2022-23/newcastle-united-v-nottingham-forest/
User-agent: polite R package
robots.txt: 1 rules are defined for 1 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent
Newc_URL_1 <- "https://www.nufc.co.uk/matches/first-team/2022-23/newcastle-united-v-nottingham-forest/"
Newc_URL_1 <- read_html(Newc_URL_1)
NewC_week_1 <- (".article__body")
NewC_week_1 <- Newc_URL_1 %>%
html_node(css = NewC_week_1) %>%
html_text2()
NewC_week_1 <- str_replace_all(NewC_week_1, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# New castle match week 2 against Brighton
Newc_URL_2 <- "https://www.nufc.co.uk/matches/first-team/2022-23/brighton-and-hove-albion-v-newcastle-united/"
Newc_URL_2 <- read_html(Newc_URL_2)
NewC_week_2 <- (".article__body")
NewC_week_2 <- Newc_URL_2 %>%
html_node(css = NewC_week_2) %>%
html_text2()
NewC_week_2 <- str_replace_all(NewC_week_2, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# New castle match week 3 against Man City
Newc_URL_3 <- "https://www.nufc.co.uk/matches/first-team/2022-23/newcastle-united-v-manchester-city/"
Newc_URL_3 <- read_html(Newc_URL_3)
NewC_week_3 <- (".article__body")
NewC_week_3 <- Newc_URL_3 %>%
html_node(css = NewC_week_3) %>%
html_text2()
NewC_week_3 <- str_replace_all(NewC_week_3, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# New castle match week 4 against Wolves
Newc_URL_4 <- "https://www.nufc.co.uk/matches/first-team/2022-23/wolverhampton-wanderers-v-newcastle-united/"
Newc_URL_4 <- read_html(Newc_URL_4)
NewC_week_4 <- (".article__body")
NewC_week_4 <- Newc_URL_4 %>%
html_node(css = NewC_week_4) %>%
html_text2()
NewC_week_4 <- str_replace_all(NewC_week_4, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# New castle match week 5 against Liverpool
Newc_URL_5 <- "https://www.nufc.co.uk/matches/first-team/2022-23/liverpool-v-newcastle-united/"
Newc_URL_5 <- read_html(Newc_URL_5)
NewC_week_5 <- (".article__body")
NewC_week_5 <- Newc_URL_5 %>%
html_node(css = NewC_week_5) %>%
html_text2()
NewC_week_5 <- str_replace_all(NewC_week_5, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# New castle match week 6 against Crystal Palace
Newc_URL_6 <- "https://www.nufc.co.uk/matches/first-team/2022-23/newcastle-united-v-crystal-palace/"
Newc_URL_6 <- read_html(Newc_URL_6)
NewC_week_6 <- (".article__body")
NewC_week_6 <- Newc_URL_6 %>%
html_node(css = NewC_week_6) %>%
html_text2()
NewC_week_6 <- str_replace_all(NewC_week_6, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# New castle match week 7 against Bournemouth
Newc_URL_7 <- "https://www.nufc.co.uk/matches/first-team/2022-23/newcastle-united-v-bournemouth/"
Newc_URL_7 <- read_html(Newc_URL_7)
NewC_week_7 <- (".article__body")
NewC_week_7 <- Newc_URL_7 %>%
html_node(css = NewC_week_7) %>%
html_text2()
NewC_week_7 <- str_replace_all(NewC_week_7, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
NewCastle <- c(NewC_week_1, NewC_week_2, NewC_week_3, NewC_week_4, NewC_week_5, NewC_week_6, NewC_week_7)
Newcastle_corpus <- corpus(NewCastle)
Newcastle_corpus_summary <- summary(Newcastle_corpus)
# Creating a team name
Newcastle_corpus_summary$Team <- "New Castle"
# create a Match number
Newcastle_corpus_summary$Match <- as.numeric(str_extract(Newcastle_corpus_summary$Text, "[0-9]+"))
Newcastle_corpus_summaryCorpus consisting of 7 documents, showing 7 documents:
Text Types Tokens Sentences Team Match
text1 336 665 11 New Castle 1
text2 336 613 12 New Castle 2
text3 340 677 12 New Castle 3
text4 294 539 8 New Castle 4
text5 324 631 16 New Castle 5
text6 360 729 4 New Castle 6
text7 380 717 7 New Castle 7
Everton
Was going to use Aston Villa originally however, the web scrapping was not returning the correct information so we switched to Everton which is running much more smoothly. This is the second team on the list of middle-tier teams and their cleaning process was about the same as the last team however, Aston Villa’s website was really hard to scrape from. Looking at the corpus information for Everton we notice an increase in words compared to the last team however, there is one match that is significantly higher than the rest. This is match 2 which was against Aston Villa and I am currently unsure why there is such a difference between these amounts.
# Everton vs Chelsea
# 1 rule for 1 bots crawl delay 5 seconds, scrapable
bow("https://www.evertonfc.com/match/74913/everton-chelsea#report")<polite session> https://www.evertonfc.com/match/74913/everton-chelsea#report
User-agent: polite R package
robots.txt: 1 rules are defined for 1 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent
everton_URL_1 <- "https://www.evertonfc.com/match/74913/everton-chelsea#report"
everton_URL_1 <- read_html(everton_URL_1)
everton_week_1 <- (".article__body.mc-report__body.js-article-body")
everton_week_1 <- everton_URL_1 %>%
html_node(css = everton_week_1) %>%
html_text2()
everton_week_1 <- str_replace_all(everton_week_1, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Everton vs Aston Villa
everton_URL_2 <- "https://www.evertonfc.com/match/74922/aston-villa-everton#report"
everton_URL_2 <- read_html(everton_URL_2)
everton_week_2 <- (".article__body.mc-report__body.js-article-body")
everton_week_2 <- everton_URL_2 %>%
html_node(css = everton_week_2) %>%
html_text2()
everton_week_2 <- str_replace_all(everton_week_2, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Everton vs Nottingham forest
everton_URL_3 <- "https://www.evertonfc.com/match/74933/everton-nottm-forest#report"
everton_URL_3 <- read_html(everton_URL_3)
everton_week_3 <- (".article__body.mc-report__body.js-article-body")
everton_week_3 <- everton_URL_3 %>%
html_node(css = everton_week_3) %>%
html_text2()
everton_week_3 <- str_replace_all(everton_week_3, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Everton vs Brentford
everton_URL_4 <- "https://www.evertonfc.com/match/74943/brentford-everton#report"
everton_URL_4 <- read_html(everton_URL_4)
everton_week_4 <- (".article__body.mc-report__body.js-article-body")
everton_week_4 <- everton_URL_4 %>%
html_node(css = everton_week_4) %>%
html_text2()
everton_week_4 <- str_replace_all(everton_week_4, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Everton vs Leeds
everton_URL_5 <- "https://www.evertonfc.com/match/74955/leeds-everton#report"
everton_URL_5 <- read_html(everton_URL_5)
everton_week_5 <- (".article__body.mc-report__body.js-article-body")
everton_week_5 <- everton_URL_5 %>%
html_node(css = everton_week_5) %>%
html_text2()
everton_week_5 <- str_replace_all(everton_week_5, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Everton vs Liverpool
everton_URL_6 <- "https://www.evertonfc.com/match/74965/everton-liverpool#report"
everton_URL_6 <- read_html(everton_URL_6)
everton_week_6 <- (".article__body.mc-report__body.js-article-body")
everton_week_6 <- everton_URL_6 %>%
html_node(css = everton_week_6) %>%
html_text2()
everton_week_6 <- str_replace_all(everton_week_6, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Everton vs West Ham
everton_URL_7 <- "https://www.evertonfc.com/match/74985/everton-west-ham#report"
everton_URL_7 <- read_html(everton_URL_7)
everton_week_7 <- (".article__body.mc-report__body.js-article-body")
everton_week_7 <- everton_URL_7 %>%
html_node(css = everton_week_7) %>%
html_text2()
everton_week_7 <- str_replace_all(everton_week_7, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
Everton <- c(everton_week_1, everton_week_2, everton_week_3, everton_week_4, everton_week_5, everton_week_6, everton_week_7)
Everton_corpus <- corpus(Everton)
Everton_corpus_summary <- summary(Everton_corpus)
# Creating a team name
Everton_corpus_summary$Team <- "Everton"
# create a match indicator
Everton_corpus_summary$Match <- as.numeric(str_extract(Everton_corpus_summary$Text, "[0-9]+"))
Everton_corpus_summaryCorpus consisting of 7 documents, showing 7 documents:
Text Types Tokens Sentences Team Match
text1 516 990 6 Everton 1
text2 703 1577 13 Everton 2
text3 330 624 1 Everton 3
text4 340 614 3 Everton 4
text5 374 709 4 Everton 5
text6 418 872 4 Everton 6
text7 438 874 10 Everton 7
Leicester
This is the start of the bottom tier teams and we start to get a look into teams that are in the relegation zone which means that if they do not start improving their performance they will get moved down to the second league. I am expecting some urgency from this team and I am expecting that each match means a lot more to a team like this where one win can seperate you from staying or getting kicked out of the league. The cleaning process went smoothly with this team but there is deffiently still some work that needs to be done before the real analysis. We noticed that there words used was higher than the two middle teams on average and they had a pretty consistent range.
# Leicester against Brentford
# 1 bot 1 rule scrapable 5 second crawl
bow("https://www.lcfc.com/news/2729025/city-held-by-bees-in-premier-league-opener/featured")<polite session> https://www.lcfc.com/news/2729025/city-held-by-bees-in-premier-league-opener/featured
User-agent: polite R package
robots.txt: 1 rules are defined for 1 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent
leicester_URL_1 <- "https://www.lcfc.com/news/2729025/city-held-by-bees-in-premier-league-opener/featured"
leicester_URL_1 <- read_html(leicester_URL_1)
leicester_week_1 <- (".col-12")
leicester_week_1 <- leicester_URL_1 %>%
html_node(css = leicester_week_1) %>%
html_text2()
leicester_week_1 <- str_replace_all(leicester_week_1, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Leicester against Arsenal
leicester_URL_2 <- "https://www.lcfc.com/news/2739798/foxes-fall-to-defeat-at-arsenal/featured"
leicester_URL_2 <- read_html(leicester_URL_2)
leicester_week_2 <- (".col-12")
leicester_week_2 <- leicester_URL_2 %>%
html_node(css = leicester_week_2) %>%
html_text2()
leicester_week_2 <- str_replace_all(leicester_week_2, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Leicester against SouthHamptom
leicester_URL_3 <- "https://www.lcfc.com/news/2751347/saints-take-the-points-on-filbert-way/featured"
leicester_URL_3 <- read_html(leicester_URL_3)
leicester_week_3 <- (".col-12")
leicester_week_3 <- leicester_URL_3 %>%
html_node(css = leicester_week_3) %>%
html_text2()
leicester_week_3 <- str_replace_all(leicester_week_3, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Leicester against Manchester City
leicester_URL_4 <- "https://www.lcfc.com/news/2762326/city-defeated-as-10man-chelsea-win-at-stamford-bridge/featured"
leicester_URL_4 <- read_html(leicester_URL_4)
leicester_week_4 <- (".col-12")
leicester_week_4 <- leicester_URL_4 %>%
html_node(css = leicester_week_4) %>%
html_text2()
leicester_week_4 <- str_replace_all(leicester_week_4, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Leicester against Manchester United
leicester_URL_5 <- "https://www.lcfc.com/news/2774578/man-utd-defeat-for-leicester-on-matchday-five/featured"
leicester_URL_5 <- read_html(leicester_URL_5)
leicester_week_5 <- (".col-12")
leicester_week_5 <- leicester_URL_5 %>%
html_node(css = leicester_week_5) %>%
html_text2()
leicester_week_5 <- str_replace_all(leicester_week_5, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Leicester against Brightton
leicester_URL_6 <- "https://www.lcfc.com/news/2779658/city-beaten-away-to-brighton/featured"
leicester_URL_6 <- read_html(leicester_URL_6)
leicester_week_6 <- (".col-12")
leicester_week_6 <- leicester_URL_6 %>%
html_node(css = leicester_week_6) %>%
html_text2()
leicester_week_6 <- str_replace_all(leicester_week_6, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# Leicester against Hotspurs
leicester_URL_7 <- "https://www.lcfc.com/news/2793845/leicester-lose-to-spurs-in-london/featured"
leicester_URL_7 <- read_html(leicester_URL_7)
leicester_week_7 <- (".col-12")
leicester_week_7 <- leicester_URL_7 %>%
html_node(css = leicester_week_7) %>%
html_text2()
leicester_week_7 <- str_replace_all(leicester_week_7, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
Leicester <- c(leicester_week_1, leicester_week_2, leicester_week_3, leicester_week_4, leicester_week_5, leicester_week_6, leicester_week_7)
Leicester_corpus <- corpus(Leicester)
Leicester_corpus_summary <- summary(Leicester_corpus)
# Creating a team name
Leicester_corpus_summary$Team <- "Leicester"
# create a match indicator
Leicester_corpus_summary$Match <- as.numeric(str_extract(Leicester_corpus_summary$Text, "[0-9]+"))
Leicester_corpus_summaryCorpus consisting of 7 documents, showing 7 documents:
Text Types Tokens Sentences Team Match
text1 533 1143 26 Leicester 1
text2 539 1183 18 Leicester 2
text3 484 1043 39 Leicester 3
text4 557 1263 27 Leicester 4
text5 463 830 36 Leicester 5
text6 498 1071 20 Leicester 6
text7 481 999 26 Leicester 7
West Ham United
West Ham was fairly straight forward and I was able to clean this one pretty well. There is still some spacing work that needs to be done but that will come at a later stage. When looking at their information we noticed that they use some of the least amount of words when talking about the matches. They also use some of the least unique words so I am interested to break this one down and see if they are mostly talking about certain players performances.
# West Ham vs Manchester City
Westham_URL_1 <- "https://www.whufc.com/fixture/view/6472"
Westham_URL_1 <- read_html(Westham_URL_1)
Westham_week_1 <- (".m-article__columns")
Westham_week_1 <- Westham_URL_1 %>%
html_node(css = Westham_week_1) %>%
html_text2()
Westham_week_1 <- str_replace_all(Westham_week_1, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# West Ham vs Nottingham Forest
Westham_URL_2 <- "https://www.whufc.com/fixture/view/6464"
Westham_URL_2 <- read_html(Westham_URL_2)
Westham_week_2 <- (".m-article__columns")
Westham_week_2 <- Westham_URL_2 %>%
html_node(css = Westham_week_2) %>%
html_text2()
Westham_week_2 <- str_replace_all(Westham_week_2, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# West Ham vs Brighton
Westham_URL_3 <- "https://www.whufc.com/fixture/view/6452"
Westham_URL_3 <- read_html(Westham_URL_3)
Westham_week_3 <- (".m-article__columns")
Westham_week_3 <- Westham_URL_3 %>%
html_node(css = Westham_week_3) %>%
html_text2()
Westham_week_3 <- str_replace_all(Westham_week_3, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# West Ham vs Aston Villa
Westham_URL_4 <- "https://www.whufc.com/fixture/view/6450"
Westham_URL_4 <- read_html(Westham_URL_4)
Westham_week_4 <- (".m-article__columns")
Westham_week_4 <- Westham_URL_4 %>%
html_node(css = Westham_week_4) %>%
html_text2()
Westham_week_4 <- str_replace_all(Westham_week_4, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# West Ham vs Tottenham Hotspurs
Westham_URL_5 <- "https://www.whufc.com/fixture/view/6436"
Westham_URL_5 <- read_html(Westham_URL_5)
Westham_week_5 <- (".m-article__columns")
Westham_week_5 <- Westham_URL_5 %>%
html_node(css = Westham_week_5) %>%
html_text2()
Westham_week_5 <- str_replace_all(Westham_week_5, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# West Ham vs Chelsea
Westham_URL_6 <- "https://www.whufc.com/fixture/view/6428"
Westham_URL_6 <- read_html(Westham_URL_6)
Westham_week_6 <- (".m-article__columns")
Westham_week_6 <- Westham_URL_6 %>%
html_node(css = Westham_week_6) %>%
html_text2()
Westham_week_6 <- str_replace_all(Westham_week_6, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
# West Ham vs Everton
Westham_URL_7 <- "https://www.whufc.com/fixture/view/6407"
Westham_URL_7 <- read_html(Westham_URL_7)
Westham_week_7 <- (".m-article__columns")
Westham_week_7 <- Westham_URL_7 %>%
html_node(css = Westham_week_7) %>%
html_text2()
Westham_week_7 <- str_replace_all(Westham_week_7, "\n", "####") %>%
str_replace_all("/n", "####") %>%
str_remove_all("/n") %>%
str_remove_all("\n") %>%
str_remove_all(" - ") %>%
str_remove_all("\\(") %>%
str_remove_all("\\)") %>%
str_remove_all("\"") %>%
str_remove_all("#") %>%
unlist()
Westham <- c(Westham_week_1, Westham_week_2, Westham_week_3, Westham_week_4, Westham_week_5, Westham_week_6, Westham_week_7)
Westham_corpus <- corpus(Westham)
Westham_corpus_summary <- summary(Westham_corpus)
# Creating a team name
Westham_corpus_summary$Team <- "WestHam"
# create a match indicator
Westham_corpus_summary$Match <- as.numeric(str_extract(Westham_corpus_summary$Text, "[0-9]+"))
Westham_corpus_summaryCorpus consisting of 7 documents, showing 7 documents:
Text Types Tokens Sentences Team Match
text1 360 799 16 WestHam 1
text2 339 759 18 WestHam 2
text3 283 599 18 WestHam 3
text4 325 730 7 WestHam 4
text5 311 694 10 WestHam 5
text6 338 696 10 WestHam 6
text7 323 685 8 WestHam 7
Exploratory Analysis
Bibliography
City, M. (2022). NEWS. Retrieved from Mancity: https://www.mancity.com/news/mens
Club, L. F. (2022). First Team. Retrieved from Leicester Football Club: https://www.lcfc.com/matches/reports
Club, T. A. (2022). NEWS. Retrieved from Arsenal: https://www.arsenal.com/news?field_article_arsenal_team_value=men&revision_information=&page=1
Everton. (2022). Results. Retrieved from Everton: https://www.evertonfc.com/results
United, N. (2022). Our Results. Retrieved from Newcastle United: https://www.nufc.co.uk/matches/first-team/#results
United, W. H. (2022). Fixtures. Retrieved from West Ham United: https://www.whufc.com/fixture/list/713