hw3
Prasann Desai
worldfootballR
Reading in Data
Author

Prasann Desai

Published

July 7, 2023

1 Introduction

Until a decade and a half ago, recording data points regarding key events in soccer was a manual process. This limited the range of analysis and the variety of data attributes. Today, with the usage of advanced technology, hundreds of attributes describing a game are being recorded in real-time. This has enabled a wide range of research in the field of soccer analytics and led to the emergence of various soccer data aggregators like StatsBomb, FBref, Understat, Fotmob, etc. In this project, we will be using the data from these aggregators using the worldfootballR package in R. This package has been created by Jason Zivkovic that enables users to conveniently access data belonging to the above-mentioned aggregators in R.

The goal of the project will be to find performance attributes that best predict success in the game. We would want to see the following aspects:

  • What are the best teams/players good at? We will compare their aggregate statistics like goals, xG, Goal difference, etc.
  • Does higher xG (Expected Goals) correspond to higher league position? Any trends over years?
  • Which team has been the most unlucky? Here, we will analyse the difference between XG and Actual goals to see which teams couldn’t secure points despite performing well.

2 Data

This dataset allows us to obtain soccer game/team/player data with various attributes. The attributes are both categorical and numerical. The attributes describing a game are goals scored by the home/away team, assists, game result, corners taken, shots attempted, etc. Similar attributes are also recorded at a team and player level. The dataset also contains some advanced attributes like xG and xA that are widely accepted in the soccer analytics industry to describe the quality of actions of players/teams in quantitative terms. We will be making use of xG and related stats to determine if it is a good indicator of success (better points earned in the league, more goals scored, etc.). For simplicity, we will be using the stats of the 1st tier of English Premier League Football for our analysis.

This is a sample dataset call. Further in the project, we could change a couple of parameters in the dataset call for comparative analysis, etc.

Code
prem_team_2023_shooting <- fb_season_team_stats(country = "ENG", gender = "M", season_end_year = "2023", tier = "1st", stat_type = "shooting")
Code
colnames(prem_team_2023_shooting)
 [1] "Competition_Name"       "Gender"                 "Country"               
 [4] "Season_End_Year"        "Squad"                  "Team_or_Opponent"      
 [7] "Num_Players"            "Mins_Per_90"            "Gls_Standard"          
[10] "Sh_Standard"            "SoT_Standard"           "SoT_percent_Standard"  
[13] "Sh_per_90_Standard"     "SoT_per_90_Standard"    "G_per_Sh_Standard"     
[16] "G_per_SoT_Standard"     "Dist_Standard"          "FK_Standard"           
[19] "PK_Standard"            "PKatt_Standard"         "xG_Expected"           
[22] "npxG_Expected"          "npxG_per_Sh_Expected"   "G_minus_xG_Expected"   
[25] "np:G_minus_xG_Expected"

2.1 Attribute Information

  1. Competition_Name (str, categorical): Name of the Competition
  2. Gender (str, categorical): Gender category of the competition
  3. Country (str, categorical): Country of the competition
  4. Season_End_Year (int, numerical): Year of end of the competition season
  5. Squad (str, categorical): Name of the team
  6. Team_or_Opponent (str, categorical): Takes only two values - ‘team’ and ‘opponent’. If the record is a ‘team’ record, then the stats are scored by the Squad, else, the stats are scored against the Squad.
  7. Num_Players (int, numerical): Number of players registered in that season
  8. Mins_Per_90 (int, numerical): Games played per season
  9. Gls_Standard (int, numerical): Goals scored in a season
  10. Sh_Standard (int, numerical): Shots taken in a season
  11. SoT_Standard (int, numerical): Shots on target in a season
  12. SoT_percent_Standard (int, numerical): Ratio of SoT_Standard/Sh_Standard
  13. Sh_per_90_Standard (int, numerical): Sh_Standard per 90 mins
  14. SoT_per_90_Standard (int, numerical): SoT_Standard per 90 mins
  15. G_per_Sh_Standard (int, numerical): Gls_Standard/Sh_Standard
  16. G_per_SoT_Standard (int, numerical): Gls_Standard/SoT_Standard
  17. Dist_Standard (int, numerical): Distance travelled by the team in the season. (Running distance)
  18. FK_Standard (int, numerical): Free kicks taken in a season
  19. PK_Standard (int, numerical): Penalty kicks scored in a season
  20. PKatt_Standard (int, numerical): Penalty kicks awarded in a season
  21. xG_Expected (int, numerical): Expected Goals scored in a season
  22. npxG_Expected (int, numerical): non-penalty Expected Goals scored in a season
  23. npxG_per_Sh_Expected (int, numerical): non-penalty Expected Goals per shot in a season
  24. G_minus_xG_Expected (int, numerical): Difference between actual goals scored and Expected goals scored
  25. np:G_minus_xG_Expected (int, numerical): Difference between actual non-penalty goals scored and non-penalty Expected goals scored

3 Analysis

3.1 What are the best teams/players good at?

Before we dive deeper into any particular facet of team or individual performances, we must first look into some standard soccer statistics and do some comparative analysis to set the right foundations.

Firstly, it would be a good idea to have an identifier for the top performing teams vs the rest. A common practice is to view the teams based on their xGFor and xGAgainst on a grid and choose the teams on the bottom right of that graph. The graph “xG_For vs. xG_Against” below does just that.

Code
# Pre-processing
# selecting the required columns
prem_2023_shooting_1 <- select(prem_team_2023_shooting, Squad, Team_or_Opponent, npxG_Expected)
prem_2023_shooting_1 <- mutate(prem_2023_shooting_1, Squad = case_when(str_detect(Squad, "vs ") ~ str_split(Squad, "vs ", simplify = TRUE)[,2],
                                                                     TRUE ~ as.character(Squad)
                                                                     ))
# Using the pivot_wider function to pivot the dataset
prem_2023_shooting_pivoted <- pivot_wider(prem_2023_shooting_1, names_from = "Team_or_Opponent", values_from = "npxG_Expected")
prem_2023_shooting_pivoted <- rename(prem_2023_shooting_pivoted, xG_For = team, xG_Against = opponent)
Code
# Building a scatter plot of xG_For and xG_Against

ggplot(prem_2023_shooting_pivoted,
       aes(x = xG_For,
           y = xG_Against,
           col = Squad
           )) +
  geom_text(aes(y = xG_Against - 1, label = Squad), size = 3) + 
  geom_point(size = 2.5, alpha = 0.8) + 
  labs(x = "Expected Goals Scored",
       y = "Expected Goals Conceded",
       title = "xG_For vs. xG_Against",
       subtitle = "xG For vs. Against comparison of all Premier league teams for season of 2022-2023") +
  theme(legend.position = "none")

You’ll see that there is group of 6 teams that have separated themselves from the rest - They are Arsenal, Manchester City, Liverpool, Brighton, Newcastle Utd & Manchester Utd. Interestingly, these were the exact 6 teams that finished in the top 6 of the 2022-23 Premier League Season! This, in a way, validates our usage of xG to identify top-performing teams. Going forward, we will identify these 6 teams as the “Top 6” and rest of the teams as “Bottom 14”.

Code
prem_team_2023_shooting <- mutate(prem_team_2023_shooting, `Team type` = case_when(str_detect(Squad, "Arsenal") ~ "Top 6",
                                                                              str_detect(Squad, "Manchester City") ~ "Top 6",
                                                                              str_detect(Squad, "Liverpool") ~ "Top 6",
                                                                              str_detect(Squad, "Brighton") ~ "Top 6",
                                                                              str_detect(Squad, "Newcastle Utd") ~ "Top 6",
                                                                              str_detect(Squad, "Manchester Utd") ~ "Top 6",
                                                                              TRUE ~ "Bottom 14"
                                                                            ))
prem_team_2023_shooting <- mutate(prem_team_2023_shooting, Team_or_Opponent = case_when(str_detect(Team_or_Opponent, "opponent") ~ "Against",
                                                                              str_detect(Team_or_Opponent, "team") ~ "For",
                                                                              TRUE ~ "NA"
                                                                            ))
Code
# Function to plot comparisons on various metrics
plot_bar_facet_avg <- function(input_df, category_var, facet_var, measure, chart_title, subtitle, x_axis_label, y_axis_label) {

input_df_grouped <- group_by(input_df, across(all_of(category_var)), across(all_of(facet_var))) %>%
                                  summarise(Average_measure = mean(.data[[measure]], na.rm = TRUE))  

ggplot(input_df_grouped, aes(x = .data[[category_var]], y = Average_measure, fill = .data[[category_var]])) +
  geom_bar(size = 1, stat = "identity", show.legend = FALSE) +
  facet_wrap(~.data[[facet_var]]) +
  labs(title = chart_title,
       subtitle = subtitle,
       x = x_axis_label,
       y = y_axis_label
       ) +
  geom_text(aes(label = round(Average_measure,2)),
            color = "black",
            size = 3,
            vjust = 2
            ) +
  theme_linedraw()
}

3.1.1 Comparison of Average Shots Taken per 90 minutes

Code
plot_bar_facet_avg(prem_team_2023_shooting, category_var = "Team_or_Opponent", facet_var = "Team type", measure = "Sh_per_90_Standard",
                   chart_title = "Average shots taken per 90 mins", subtitle = "Comparison of the metric For v. Against a particular team", 
                   x_axis_label = "", y_axis_label = "Average Shots Taken per 90"
                   )

In the above plot, we observe that, on an average the Top 6 teams take a lot more shots per 90mins as compared to the Bottom 14. Also, we can see that the shots conceded by Top 6 teams are far lesser than the Bottom 14 teams. As a result, we can infer that the Top 6 teams create a lot more attacking threat and exhibit much better defensive capability.

3.1.2 Comparison of Average Shots on target per 90 minutes

Code
plot_bar_facet_avg(prem_team_2023_shooting, category_var = "Team_or_Opponent", facet_var = "Team type", measure = "SoT_per_90_Standard",
                   chart_title = "Average shots on target per 90 mins", subtitle = "Comparison of the metric For v. Against a particular team", 
                   x_axis_label = "", y_axis_label = "Average shots on target per 90"
                   )

3.1.3 Comparison of Average Shots on target per 90 minutes

Code
plot_bar_facet_avg(prem_team_2023_shooting, category_var = "Team_or_Opponent", facet_var = "Team type", measure = "SoT_percent_Standard",
                   chart_title = "Average % shot accuracy", subtitle = "Comparison of the metric For v. Against a particular team", 
                   x_axis_label = "", y_axis_label = "Average % shot accuracy"
                   )

In the above two plots, we observe that, on an average the Top 6 teams have more shots on target per 90mins as compared to the Bottom 14. Also, we can see that the shot accuracy of the Top 6 teams is not so different than the Bottom 14 teams. As a result, from the above 3 plots, we can infer that what seperates the Top 6 teams v. Bottom 14 teams is the volume of shots taken. This insight can also be extended to individual player performances. A similar analysis for player performance was done by Omar Chaudhuri from “Twenty First Group” that stated that statistically what separates top-tier players like Ronaldo, Messi is the the volume of shots that they attempt every match from high goal probability positions (basically, high xG).

3.1.4 Comparison of Average “expected” non-penalty goals (xnpG)

Code
plot_bar_facet_avg(prem_team_2023_shooting, category_var = "Team_or_Opponent", facet_var = "Team type", measure = "npxG_Expected",
                   chart_title = "Average volume of goal scoring chances created in a season", 
                   subtitle = "Comparison of the metric For v. Against a particular team", 
                   x_axis_label = "", y_axis_label = "Expected non-penalty goals (xnpG)"
                   )

After establishing the importance of producing high xG, the above plot is just to substantiate the fact that over the course of the entire season, the difference in the Top 6 teams and Bottom 14 teams in terms of quality of goal scoring attempts created is pretty significant. For coaches and teams around the world, it is like a north-star metric that lies at the heart of all the new tactics/strategies that are devised to achieve higher league positions over a season. Interestingly, xG accumulated over a season/year at a player level can also be used for new player purchase strategies – there are hidden gems with under-rated goal-scoring abilities across the world that sometimes get overshadowed by their teams’ bad luck/under-performance but statistically have high individual xGs.

3.2 Does higher xG (Expected Goals) correspond to higher league position?

3.2.1 Visualising the linear relation between xG and Goals scored

Code
# Building a x-y plot of npxG_Expected and Goals scored
prem_team_2023_shooting_for <- filter(prem_team_2023_shooting, Team_or_Opponent == "For")

ggplot(prem_team_2023_shooting_for,
       aes(x = npxG_Expected,
           y = Gls_Standard,
           col = `Team type`
           )) +
  geom_point(size = 2.5, alpha = 0.8) + 
  stat_smooth(method = "lm", formula = 'y~x', col = "purple") +
  labs(x = "Non-penalty Expected Goals Scored",
       y = "Goals Scored",
       title = "Volume of good quality chances created vs. Goals Scored",
       subtitle = "Checking linear correlation between npxG and Goals scored for 2022-2023 season"
       ) +
  theme()

In the above plot, it is intuitive to infer that there is a positive correlation between Non-Penalty xG and Goals scored i.e. Higher volume of goal scoring opportunities created lead to more number of goals. We know that, more goals scored lead to more wins. Consequently, the teams with higher Non-Penalty xG lead to higher league positions (See how the Top 6 teams separate themselves from the rest). Statistically, there is also an understanding in the soccer analytics industry that higher Non-Penalty xG is a better measure of team performance despite of the eventual league position. To validate this viewpoint, let’s now look at the trend of Non-Penalty xG / Regular xG vs League Performance (League Position) over the past 4 years.

3.3 Which team has been the most unlucky in the 2022-23 season?

We’ll look at teams that couldn’t secure points despite performing well.

Code
# Plot to see under/over performance of teams compared to their xG
# final_df
final_df_v2 <- filter(final_df, `Season_End_Year.y` == "2023") #| `Season_End_Year.y` == "2022")

final_df_v2 <- mutate(final_df_v2, `Team type` = case_when(Rk < 7 ~ "Top 6",
                                                                      TRUE ~ "Bottom 14"
                                                                      ))
final_df_v2 <- mutate(final_df_v2, Over_perform_flag = case_when((npxG_Expected_Rank - Rk) > 0 ~ "Over-perform",
                                                                 (npxG_Expected_Rank - Rk) < 0 ~ "Under-perform",
                                                                 TRUE ~ "As Expected"
                                                                 ))
ggplot(final_df_v2, aes(x = Rk, y = npxG_Expected_Rank - Rk, color = Over_perform_flag)) +
  geom_point(size = 2) +
  facet_wrap(~`Season_End_Year.y`, ncol = 1) +
  #scale_fill_manual(values = c("Over-perform" = "green", "Under-perform" = "red", "As Expected" = "yellow")) +
  labs(x = "League Rank",
       y = "Measure of Overperformance",
       title = "Under/over performance of teams compared to their xG by League Rank",
       subtitle = "Based on League data for the 2022-2023 season",
       color = "League Performance"
       ) +
  scale_x_continuous(breaks = c(seq(1, 20, by=1))) +
  scale_y_continuous(breaks = c(seq(-10, 10, by=2))) +
  geom_text(aes(label = paste0(Squad, ", ", npxG_Expected_Rank - Rk)),
            color = "black",
            size = 2.5,
            vjust = 2,
            angle = 30
            ) +
  theme_linedraw()

In the above plot, the teams have been arranged based on their league position from left to right. We have calculated a measure called “npxG_Expected_Rank” that is analogous to expected rank based on npxG alone. The y-axis of the plot is the difference between the actual rank and the expected rank. A couple of interesting things to look at here – Fulham, Crystal Palace and Wolves have all over-performed this season. This means that, there are teams below them in the table that have actually performed better than them in the 2023 season. If luck had gone out of the window, they might have found themselves in the relegation battle. Obviously, over-performance of some teams are also a factor in under-performance of other teams. Interestingly, we need to look at the two teams that have highly under-performed: Leicester City and Leeds United. Unfortunately and unluckily, despite their relatively better performances, these teams ended up getting relegated from the league this season and their managers/coaches getting sacked. A deviation for 7 league positions from the expected league position is worth noting, and the teams should have enough confidence in them next season to get promoted back to the Premier league.

4 Pending sections for final project

  1. Conclusion
  2. Citations/Bibliography
  3. Critical reflection