library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Final Paper
library(readr)
<- read_csv("_data/nfl2019.csv")
nfl View(nfl)
<- read_csv("_data/nfl2019.csv",
nfl skip = 1,
col_names = c("delete", "Opponent", "Home_Ranking", "Opponent_Ranking", "Home_1st_Downs", "Home_Total_Yards", "Home_Passing_Yards", "Home_Rushing_Yards", "Home_Turnovers", "Opponent_1st_Downs", "Opponent_Total_Yards","Opponent_Passing_Yards", "Opponent_Rushing_Yards", "Opponent_Turnovers", "Home_Offensive_Ranking", "Home_Defensive_Ranking", "Home_Special_Teams", "Home", "Year", "Winner")) %>%
select(!starts_with("delete")) %>%
na_if("Skipped")
mean(nfl$Home_Total_Yards)
[1] 348.048
mean(nfl$Opponent_Total_Yards)
[1] 346.2134
mean(nfl$Home_Passing_Yards)
[1] 231.8302
mean(nfl$Opponent_Passing_Yards)
[1] 230.5542
mean(nfl$Home_Rushing_Yards)
[1] 116.2179
mean(nfl$Opponent_Rushing_Yards)
[1] 115.6592
%>%
nflggplot(aes(x=`Home_Passing_Yards`, y=`Year`)) +
geom_point()
%>%
nflggplot(aes(x=`Opponent_Passing_Yards`, y=`Year`)) +
geom_point()
$home_winner <- with(nfl, ifelse(Home == Winner, 'Win', 'Loss'))
nfl
view(nfl)
nrow(filter(nfl,home_winner == "Win"))
[1] 451
sum(nfl$Home_Passing_Yards > nfl$Opponent_Passing_Yards & nfl$home_winner == "Win")
[1] 262
sum(nfl$Home_Rushing_Yards > nfl$Opponent_Rushing_Yards & nfl$home_winner == "Win")
[1] 311
library(ggplot2)
<- nfl %>%
scatter ggplot(mapping=aes(x = Year, y = `Home_Passing_Yards`))+
geom_point(aes(color=home_winner))
scatter
library(ggplot2)
<- nfl %>%
scatter ggplot(mapping=aes(x = Year, y = `Home_Rushing_Yards`))+
geom_point(aes(color=home_winner))
scatter
##Final Paper
My final project was focused on an NFL (National Football League) dataset. I chose this particular dataset because I like to watch the NFL games every Sunday and I think I understand the game pretty well. However, during the games, I seem to have a couple of questions. My research question focuses on if each team’s stats in yardage (passing and rushing yards), correlate to each other and to the win. First, I wanted to see if the home team has more passing yards than the opponent then would it correlate to an automatic win? If that answer was yes that the two correlate directly to each other, it would tell us that rushing yards don’t matter as much in the game and that teams should focus on dominating in the passing game. It might even affect what kind of quarterback the teams draft. Do you pick a quarterback that is fast and can run or do you pick one that has a good arm and can throw for many yards. If that answer was no that the two don’t correlate directly to each other it would suggest two other hypotheses. The first would be that since the passing game doesn’t matter to the win, maybe the rushing game does, so the logical next step would be to look at whichever team has the most rushing yards and see how many times they pulled off the win. If we see that the number of times the team with the most rushing yards ends up winning, then we have our answer that rushing yards in the game matter more than passing yards. Again, I imagine that would affect draft picks in focusing more on running backs and quarterbacks that can run. If the answer is no there is no direct correlation, we have to look at the second hypothesis. The second would be that neither the passing yards nor the rushing yards in the game matter. We cannot predict the winner by just looking at the number of passing yards or the number of rushing yards. I think that the more passing yards a team has the more likely chance they have of winning. Obviously, the team with the most passing yards or rushing yards will win every game so arbitrarily I will say the team with the most passing yards or rushing yards has to win 65%+ of the games in order to say that either yardage has a significant impact. I think this because every team has a quarterback that can throw really well and throw for many yards but not every team has a quarterback that can run a lot or successfully in a consistent manner. I think that rushing yards do not matter as much as passing yards and maybe that’s why the average number of rushing yards in a game tends to be less than the passing yards. I think this research question can not only help teams select quarterbacks, wide receivers and running backs during the draft, but also identify weak points in the team’s roster to see where the team needs to practice more. Additionally, it can also help predict wins/losses within the league. The dataset I chose was the NFL team stats and outcomes from 2019-2022. It includes stats on the defense, offense and special teams units for the home team. It also includes information on team standings (rankings) and the outcomes of each game. When looking at the dataset, I renamed each column into something that made more sense to me, using opponent or home at the beginning of each name to specify which team. I deleted the first column because I did not think that the week of the NFL season that the game was played mattered to me to answer my research question. The next column is the opponent’s name. For the next column, instead of Tm I renamed that column to the Home_Ranking because this is what the home team ranked as during the week that the game was played and Tm is a very vague name. The next column is the Opponent_Ranking which is what the opponent team ranked as during the week the game was played, which I renamed from the nonspecific name Opp.1. Next, I renamed 1stD and TotYd into home 1st downs and home total yards respectively. I kept 1st downs because I feel like it’s a good stat to have. It won’t necessarily be a good predictor for who wins the game, but the more first downs you have the more likely you are going to be closer to the touchdowns and to make a field goal. Basically, the more first downs you have you probably are closer to getting more points on the board (either 3 points or 7). The total yards field is extremely important because this the sum of the passing yards and the rushing yards. The team with the win is automatically going to have the most total yards. Next couple of columns we have home passing yards and home rushing yards. These two columns are the main columns that I’ll be looking at to answer my research question. In the next column we have home turnovers, which is not a great indicator of whether or not a team wins that game but a lot of turnovers means that the other team got an additional chance to put more points on the board. The next couple of columns are the same as before but for the opponent: opponent 1st downs, opponent total yards, opponent passing yards, opponent rushing yards and opponent turnovers, which all now have opponent leading the name instead of a .1 differentiating these columns from the home team stats. These columns have the same importance as the parallel columns do for the home team. I will be looking closely as the opponent passing yards and opponent rushing yards columns to answer my question. The next couple of columns describe the different unit rankings for the home team. We have the home offensive ranking, the home defensive ranking and the home special teams ranking. Next, I have listed the home team name and the year that the game was played. This dataset only shows games that were played in the 2019-2022 regular season, meaning that playoff games and the Superbowl game stats are not counted here. I kept the column naming the winner, but this column lists the whole team name, which is why I have created another column called home_winner. This column looks at the team name of the winner and shows if it is the same team name listed as the home team. If it is the same team name, we put ‘Win’ in the home_winner column and if it is not equal, which means that the opponent won, we put a ‘Loss’ in the home_winner column. The home_winner column is relative to the home team. This is how I renamed the columns and cleaned the data. I needed to clean the data by renaming the columns because the original dataset had column names that were not descriptive of what team (opponent or home) we were looking at. A lot of the numeric valued columns like passing yards or rushing yards had names like PassY and RushY for the home team and PassY.1 and RushY.1 for the opponent, which gets confusing after a while. This is why I decided to rename all the columns as something descriptive and make sure that they all have opponent or home attached to the column name. After cleaning the dataset, we have 20 total columns and 895 rows. For my visualizations, I am focusing on 6 columns: Home_Passing_Yards, Home_Rushing_Yards, Opponent_Passing_Yards, Opponent_Rushing_Yards, Year and home_winner. I have run the basic statistics operations on Home/Opponent total yards, Home/Opponent passing yards and Home/Opponent rushing yards by calculating the mean of each of these columns. This is just to establish some sort of arbitrary baseline. For example, the mean of Home_Passing_Yards and Opponent_Passing_Yards is 231.83 and 230.55 respectively. Although these values are close, this operation tells me that the home team edged out the opponent team most of the time in passing yards and that in most games the passing yard total should be somewhere around 230 yards to be significant. This tells me that if for example the home team has 175 yards in the passing game, the opponent team probably won because 175 yards is significantly less than the average 230 yards. The baseline for the rushing yards is home at 116.22 and the opponent at 115.66. Again, the values are so close that it is not really significant. I expect the mean for the home and opponent total yards to be very close too, especially because both the passing yards and the rushing yards averages were very close. Calculating the average of these two comes out to home total yards at 348.05 and the opponent total yards at 346.21. Again, too close to be significantly leaning towards either team but this establishes a nice baseline for very value we will be looking at. I decided to not calculate the maximum or minimum values of each team’s passing, rushing or total yards because I feel like that information does not do anything to further answer my research question nor does it provide any additional missing information and I can adequately deduce that information from the data visualizations I have created. I decided to plot the Home Passing Yards vs the year to see how over the 4 years, how the passing yards have evolved. This visualization allows me to see with an overall view what is the average range of passing yards by the home team and which data points are outliers. I next decided to plot the Opponent Passing Yards vs the year so I can compare the two Home_Passing_Yards and Opponent_Passing_Yards in the big picture. Looking at the big picture, I don’t see much of a difference. The overall range of the average each year looks about the same and the outliers also look around the same which seems accurate since the average of each were so close to each other. From there, I decided to draw a real conclusion on my research question of whether the team’s passing yards correlate to a win if they have more passing yards than the opposing team. I needed to have a visual depicting the wins and losses with the home passing yards to see if the home team won more games with a lot of passing yards or not. So, I had to create a new field called home_winner. To get the values in this new column, I created a simple conditional statement. Basically, if the home team name is equal to the winner team name, then put the value ‘Win’ in the home_winner column. Else, if the opponent team name equals the winning team name, then put the value ‘Loss’ in the home_winner column. The home_winner column is all relative to the home team. Next, I decided to filter for the ‘Win’ value in home_winner to see how many times the home team won. They have 451 wins over 4 years. To determine the answer to my research question, I must count how many times the home team passing yards were greater than the opposing teams passing yards and that the home_winner has the value ‘Win’. Meeting that conditional came out to 262 times. I also counted how many times the home team rushing yards were greater than the opposing teams rushing yards to answer the question of whether a team’s rushing yards correlate to the win if they rushed more than the other team. Again, I used the conditional of Home_Rushing_Yards are greater than Opposing_Rushing_Yards and home_winning is equal to ‘Win’. Meeting than conditional came out to 311. Looking at the visual I created, the loss is shown as the color red and the win depicted as blue, we plot the year against the home passing yards. We see that the different colors dots are kind of all over the place. In 2019 and 2020, we see that the home passing yards are very high around 500 and 450 respectively but both of those come out as a loss. On the other hand, in 2021 and 2022 we see that the highest home passing yard around 450 comes out to wins. We can also see that when the home passing yards is at a minimum in all 4 years, anywhere from around 0 to 50, all of those show as a loss. However, instead of seeing a lot of blue congregated towards the top of the y axis and the red towards the bottom, we see no real divisive line between the two colors. I created the same kind of visual to look at the rushing yards for the home team. Again, I have red depicting a loss and blue showing a win and plotted the year against the home rushing yards. With the rushing yards, we can see a little more division already. We can see in all the years except for maybe 2021, the highest number of rushing yards clearly are home team wins. Additionally, we also see that the lowest number of rushing yards are clearly home team losses. We see more division with the rushing yards in that the blue or wins are mostly concentrated towards the top of the y-axis as more rushing yards for the home team and that conversely the red are more concentrated towards the bottom of the y-axis as fewer rushing yards for the home team. From my visualizations I learned that looking at the bigger picture in visuals tells you a lot of information. I can see when depicting wins and losses with color where wins and losses are more concentrated and if a divisive line exists. I enjoyed my experience with this project. It allowed me to see NFL games in a new light, a more logical statistical light. After I started this project, I started paying more attention to the passing yardage and rushing yardage that each team put up at the end of the game. I started noticing that certain teams usually have more passing yardage than rushing yardage because they are a more pass reliant team or vice versa with a more run heavy team. Looking back at my analysis in this project, I realize that I am only looking at the offensive side of each team and not accounting for the defense at all, even though some teams win based on defense alone. I now realize that some teams are more equipped to handle a pass heavy team with their defense and vice versa and defense definitely plays a role in whether a team wins or loses. I think if I was continuing this project, I would have tried to incorporate a team’s defensive strategy into my analysis somehow. Maybe see if the defensive ranking affects the win or loss. Or compare each team’s defensive and offensive rankings and try to see how those correlates to the win. It would be interesting to see if a team’s offensive or defensive ranking holds more weight to the win or if it’s a tossup and it really doesn’t matter. The most challenging aspect of this project definitely was figuring out how to visualize the home passing yardage and home rushing yardage and how to see which team won at the same time. Once I figured out that I have to base everything off the home team and create a new column win/loss for the home team, creating the visual became way easier. I originally was thinking of assigning each team a number and instead of a team name being displayed a number would be displayed, which would basically be the new win/loss column. However, since there are 32 NFL teams, I would have to assign 32 numbers and remember the number for each team. In addition to that, I would be plotting based on 32 teams for each graph which would further complicate matters and would make reading the graph impossibly difficult. I also did not think about the fact that these numbers are spread over 4 years and the quarterbacks, wide receivers and running backs are not the same for all 4 years. Some quarterbacks are more pass reliant and some quarterbacks are running quarterbacks which means their rushing yardage is usually higher whether they win or lose. If I was continuing this project, I would have to look at the quarterback rating and whether there are more running quarterbacks or pass heavy quarterbacks within the league. I think I would somehow incorporate what kind of quarterback the winning team had so that I can see whether a running quarterback or a pass heavy quarterback tends to do better within the league. Other areas of growth in this project could be to look at whether the team’s ranking as you go further through the weeks of the regular season plays an impact on which team wins or loses. For example, if the San Francisco 49ers are ranked at #9 in week 1 vs. ranked at #9 in week 15 of a regular 18-week season, is it more likely they will win in week 1 or week 15 when in both weeks they are ranked in the top 10. Another area I would consider looking at if I were to continue with this project would be to see if covid 19 played a big impact on the team’s not putting in as much passing or rushing yardage in 2020. If the fans were not allowed to be there, would the teams still have enough of a push to keep pushing the numbers up in each category. In regards to fans, it would be interesting to see how big the stadium of the home team was and how large of a fan base of each team came out to see them play and if that affected the win. I know the atmosphere of the game is a big thing in college football, like how big the stadium is, how many fans come out to see their team, how loud it can get, etc. but is it as big of a factor in the NFL? In continuing this project, another idea I had would be to continue looking at the rest of the 2019-2022 seasons. Like not only looking at the statistics of the regular 18-week season but looking at the numbers from the post season of the playoffs and the Superbowl. Do the playoffs and the Superbowl automatically assure bigger passing and rushing yardage? Do these numbers matter more or less in the postseason? Does rushing yardage matter more in the playoffs and the Superbowl than passing yardage or vice versa? Another area to look at would be if penalties had a significant impact on any of the games. Since yards given or taken away during penalties are not counted in passing or rushing yards or even the total yardage, a team might have won because of more penalties against the other team giving one team more advantages and more yards that are not technically accounted for in any statistics. My final thoughts on this are yes, a win is a win no matter how big or small, by 1 point or 50 points but if the home team and opposing team have a small deficit in passing yardage but a big deficit in rushing yardage does that directly link to a smaller deficit in points or a bigger deficit or does it not matter? What about the opposite scenario (a big deficit in passing yardage but a small deficit in rushing yardage)? In conclusion, I can answer both my research questions using the data I have gathered and from the visuals I have created using that data. My first research question was if one team has more passing yards than the other team, does that directly link to a win? The answer to that is no, not necessarily since the data doesn’t show that happens most of the time. I arrived to this answer by seeing that in the 4-year time frame of my data, there are 451 times that the home team has claimed the win. In those 451 times, if you apply the conditional that the home team has more passing yards than the opposing team and that the home team claims the win, then you get 262 times those conditions were met. This means that only 58% of the home wins also had the home team cashing in more passing yards than the opposing team. 58% is a majority but it is not the arbitrary 65% I defined in my introduction where the number of wins is significant enough that it cannot be disputed that the passing yards played a role in the win. With 58%, you also cannot accurately predict the winner of the game by only looking at passing yardage numbers. The visual depicting the home passing yards vs year with colors showing a win or loss corroborates this answer as both blue and red are shown all over the graph with no real separating line between the wins and losses. My second research question was if one team has more rushing yards than the other team, does that directly link to a win? The answer to that is yes, I think we can see the data shows this happens most of the time. Again, we have 451 wins with the home team. Looking at the conditional we have for the rushing yards, if the home team has more rushing yards than the opposing team and the home team has the win, we see that condition is met 311 times. This means that 69% of the home wins also had the home team cashing in more rushing yards than the opposing team. 69% is definitely a majority and is larger than the 65% that I defined in my introduction. In my opinion, it is a big enough majority that you can usually know the winner by only comparing the rushing yards of the two teams and 69% is significant enough that you can definitely say that rushing yards plays a big role in determining which team gets the W. Also, if you look at the visual for home rushing yards and look at the blue and red colors, you can see they are a little more isolated from each other than the passing yard graph. The red is more towards the bottom and means a loss and the blue is more towards the top, meaning more rushing yards and a win. These answers are opposite of my hypothesis since I originally thought that more passing yards would equivalate a win and more rushing yards would not matter. The only real question I have that is left unanswered is that do these findings really change a general manager’s outlook on how to draft a quarterback.
##Bibliography Finnstats. “COUNTIF Function in R: R-Bloggers.” R, 18 July 2021, https://www.r-bloggers.com/2021/07/countif-function-in-r/. Grolemund, Hadley Wickham and Garrett. “R For Data Science.” Welcome, https://r4ds.had.co.nz/index.html. “NFL Teams Stats and Outcomes.” Kaggle, https://www.kaggle.com/datasets/thedevastator/nfl-team-stats-and-outcomes. “The R Project for Statistical Computing.” R, https://www.r-project.org/.