library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Final Project Assignment#1: Priyanka Thatikonda
Important Formatting & Submission Notes:
Use this file as the template to work on: start your own writing from Section “Part.1”
Please make the following changes to the above YAML header:
Change the “title” to “Final Project Assignment#1: First Name Last Name”;
Change the “author” to your name;
Change the “date” to the current date in the “MM-DD-YYYY” format;
Submission:
- Delete the unnecessary sections (“Overview”, “Tasks”, “Special Note”, and “Evaluation”).
- In the posts folder of your local 601_Spring_2023 project, create a folder named “FirstNameLastName_FinalProjectData”, and save your final project dataset(s) in this folder. DO NOT save the dataset(s) to the _data folder which stores the dataset(s) for challenges.
- Render and submit the file to the blog post like a regular challenge.
Overview of the Final Project
The goal is to tell a coherent and focused story with your data, which answers a question (or questions) that a researcher, or current or future employer, might want to have answered. The goal might be to understand a source of covariance, make a recommendation, or understand change over time. We don’t expect you to reach a definitive conclusion in this analysis. Still, you are expected to tell a data-driven story using evidence to support the claims you are making on the basis of the exploratory analysis conducted over the past term.
In this final project, statistical analyses are not required, but any students who wish to include these may do so. However, your primary analysis should center around visualization rather than inferential statistics. Many scientists only compute statistics after a careful process of exploratory data analysis and data visualization. Statistics are a way to gauge your certainty in your results - NOT A WAY TO DISCOVER MEANINGFUL DATA PATTERNS. Do not run a multiple regression with numerous predictors and report which predictors are significant!!
Tasks of Assignment#1
This assignment is the first component of your final project. Together with the later assignments, it make up a short paper/report. In this assignment, you should introduce a dataset(s) and how you plan to present this dataset(s). This assignment should include the following components:
A clear description of the dataset(s) that you are using.
What “story” do you want to present to the audience? In other words, what “question(s)” do you like to answer with this dataset(s)?
The Plan for Further Analysis and Visualization.
We will have a special class meeting on April 12 to review and discuss students’ proposed datasets for the final project. If you want your project being discussed in the class, please submit this assignment before April 12.
Part 1. Introduction
In this part, you should introduce the dataset(s) and your research questions.
- Dataset(s) Introduction:
The National Basketball Association (NBA) is a professional basketball league in North America composed of 30 teams (29 in the United States and 1 in Canada). It is one of the major professional sports leagues in the United States and Canada and is considered the premier professional basketball league in the world. It changed its name to the National Basketball Association on August 3, 1949, after merging with the competing National Basketball League (NBL). The NBA’s regular season runs from October to April, with each team playing 82 games. The league’s playoff tournament extends into June. As of 2020, NBA players are the world’s best paid athletes by average annual salary per player.
The dataset I chose was from Kaggle (https://www.kaggle.com/mamadoudiallo/nba-players-stats-19802017). It is a list of player statistics from the NBA from 1980 to 2017.It depicts the stats of players of specific years on the teams they played. I chose this dataset because its very robust as it has a lot of data from a wide range of years which allows for a lot to explore and understand the trends in NBA over the years and make future predictions.
In the NBA dataset I have, which includes data from 1980 to 2017, there are various player statistics available. Each row represents a player’s performance in a specific year, indicating their position, age, team, games played, minutes played (MP), player efficiency rating (PER), true shooting percentage (TS%), offensive win shares (OWS), defensive win shares (DWS), win shares (WS), win shares per 48 minutes (WS/48), box plus/minus (BPM), value over replacement player (VORP), field goals made (FG), field goals attempted (FGA), field goal percentage (FG%), three-pointers made (3P), three-pointers attempted (3PA), three-point percentage (3P%), two-pointers made (2P), two-pointers attempted (2PA), two-point percentage (2P%), effective field goal percentage (eFG%), free throws made (FT), free throws attempted (FTA), free throw percentage (FT%), offensive rebounds (ORB), defensive rebounds (DRB), total rebounds (TRB), assists (AST), steals (STL), blocks (BLK), turnovers (TOV), personal fouls (PF), and points scored (PTS). Each player’s performance is recorded for a specific year, allowing us to analyze their progress and explore trends over time.
The final part will reflect the limits of dataset and the process as the project unfolds along with a preliminary conclusion on the project.
- What questions do you like to answer with this dataset(s)?
How has player performance evolved over the years? You can analyze trends in key metrics such as player efficiency rating (PER), true shooting percentage (TS%), win shares (WS), or points scored (PTS) to understand how players’ skills and productivity have changed over time.
Which players have consistently performed at a high level throughout their careers? By examining metrics like PER, WS, or VORP, you can identify players who have consistently been valuable contributors to their teams over multiple seasons.
How does performance vary based on player position? You can compare statistics across different positions (e.g., centers, guards, forwards) to understand the unique roles and contributions of each position on the court.
How do players’ shooting percentages (FG%, 3P%, FT%) affect their overall performance? You can analyze the impact of shooting efficiency on player effectiveness by examining metrics such as PER, WS, or points scored.
Which teams have had the most successful players in terms of win shares? By aggregating win shares by team, you can identify the teams that have consistently had high-performing players throughout the dataset’s time range.
Can we predict a player’s performance based on their age? Analyzing how player statistics change as they age can provide insights into the typical career trajectory and identify any patterns or outliers.
Part 2. Describe the data set(s)
This part contains both a coding and a storytelling component.
In the coding component, you should:
read the dataset;
(optional) If you have multiple dataset(s) you want to work with, you should combine these datasets at this step.
(optional) If your dataset is too big (for example, it contains too many variables/columns that may not be useful for your analysis), you may want to subset the data just to include the necessary variables/columns.
<- read_csv("player_df.csv")
df df
present the descriptive information of the dataset(s) using the functions in Challenges 1, 2, and 3;
- for examples: dim(), length(unique()), head();
# Display the structure of the dataset str(df)
spc_tbl_ [15,107 × 38] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ...1 : num [1:15107] 689 690 691 692 694 696 697 699 703 705 ... $ Year : num [1:15107] 1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ... $ Player: chr [1:15107] "Kareem Abdul-Jabbar*" "Tom Abernethy" "Alvan Adams" "Tiny Archibald*" ... $ Pos : chr [1:15107] "C" "PF" "C" "PG" ... $ Age : num [1:15107] 32 25 25 31 28 25 28 35 23 25 ... $ Tm : chr [1:15107] "LAL" "GSW" "PHO" "BOS" ... $ G : num [1:15107] 82 67 75 80 20 82 77 72 16 73 ... $ MP : num [1:15107] 3143 1222 2168 2864 180 ... $ PER : num [1:15107] 25.3 11 19.2 15.3 9.3 18.1 13.7 14.8 24.1 13.1 ... $ TS% : num [1:15107] 0.639 0.511 0.571 0.574 0.467 0.532 0.533 0.517 0.552 0.513 ... $ OWS : num [1:15107] 9.5 1.2 3.1 5.9 0 4.1 2.1 2.2 0.7 0.4 ... $ DWS : num [1:15107] 5.3 0.8 3.9 2.9 0.2 2.8 1.9 1.2 0.3 2.7 ... $ WS : num [1:15107] 14.8 2 7 8.9 0.2 6.9 3.9 3.4 0.9 3.2 ... $ WS/48 : num [1:15107] 0.227 0.08 0.155 0.148 0.043 0.136 0.081 0.09 0.188 0.08 ... $ BPM : num [1:15107] 6.7 -1.6 4.4 0 -2.4 2.5 0.3 0.6 3.3 0.5 ... $ VORP : num [1:15107] 6.8 0.1 3.5 1.5 0 2.7 1.4 1.2 0.3 1.2 ... $ FG : num [1:15107] 835 153 465 383 16 545 384 325 72 299 ... $ FGA : num [1:15107] 1383 318 875 794 35 ... $ FG% : num [1:15107] 0.604 0.481 0.531 0.482 0.457 0.495 0.505 0.422 0.493 0.484 ... $ 3P : num [1:15107] 0 0 0 4 1 16 1 73 8 1 ... $ 3PA : num [1:15107] 1 1 2 18 1 47 3 221 19 5 ... $ 3P% : num [1:15107] 0 0 0 0.222 1 0.34 0.333 0.33 0.421 0.2 ... $ 2P : num [1:15107] 835 153 465 379 15 529 383 252 64 298 ... $ 2PA : num [1:15107] 1382 317 873 776 34 ... $ 2P% : num [1:15107] 0.604 0.483 0.533 0.488 0.441 0.502 0.506 0.458 0.504 0.486 ... $ eFG% : num [1:15107] 0.604 0.481 0.531 0.485 0.471 0.502 0.506 0.469 0.521 0.485 ... $ FT : num [1:15107] 364 56 188 361 5 171 139 143 28 99 ... $ FTA : num [1:15107] 476 82 236 435 13 227 209 153 39 141 ... $ FT% : num [1:15107] 0.765 0.683 0.797 0.83 0.385 0.753 0.665 0.935 0.718 0.702 ... $ ORB : num [1:15107] 190 62 158 59 6 240 192 53 13 126 ... $ DRB : num [1:15107] 696 129 451 138 22 398 264 183 16 327 ... $ TRB : num [1:15107] 886 191 609 197 28 638 456 236 29 453 ... $ AST : num [1:15107] 371 87 322 671 26 159 279 268 31 178 ... $ STL : num [1:15107] 81 35 108 106 7 90 85 80 14 73 ... $ BLK : num [1:15107] 280 12 55 10 4 36 49 28 2 92 ... $ TOV : num [1:15107] 297 39 218 242 11 133 189 152 20 157 ... $ PF : num [1:15107] 216 118 237 218 18 197 268 182 26 246 ... $ PTS : num [1:15107] 2034 362 1118 1131 38 ... - attr(*, "spec")= .. cols( .. ...1 = col_double(), .. Year = col_double(), .. Player = col_character(), .. Pos = col_character(), .. Age = col_double(), .. Tm = col_character(), .. G = col_double(), .. MP = col_double(), .. PER = col_double(), .. `TS%` = col_double(), .. OWS = col_double(), .. DWS = col_double(), .. WS = col_double(), .. `WS/48` = col_double(), .. BPM = col_double(), .. VORP = col_double(), .. FG = col_double(), .. FGA = col_double(), .. `FG%` = col_double(), .. `3P` = col_double(), .. `3PA` = col_double(), .. `3P%` = col_double(), .. `2P` = col_double(), .. `2PA` = col_double(), .. `2P%` = col_double(), .. `eFG%` = col_double(), .. FT = col_double(), .. FTA = col_double(), .. `FT%` = col_double(), .. ORB = col_double(), .. DRB = col_double(), .. TRB = col_double(), .. AST = col_double(), .. STL = col_double(), .. BLK = col_double(), .. TOV = col_double(), .. PF = col_double(), .. PTS = col_double() .. ) - attr(*, "problems")=<externalptr>
#head head(df)
#tail tail(df)
#dimension dim(df)
[1] 15107 38
#check for missing values sum(is.na(df))
[1] 0
conduct summary statistics of the dataset(s); especially show the basic statistics (min, max, mean, median, etc.) for the variables you are interested in.
# Calculate summary statistics for numeric variables
summary(df[, c("PER", "TS%", "WS", "PTS")])
PER TS% WS PTS
Min. :-23.00 Min. :0.0000 Min. :-2.100 Min. : 0.0
1st Qu.: 10.40 1st Qu.:0.4810 1st Qu.: 0.400 1st Qu.: 175.0
Median : 13.10 Median :0.5190 Median : 1.800 Median : 441.0
Mean : 13.18 Mean :0.5121 Mean : 2.775 Mean : 568.5
3rd Qu.: 15.90 3rd Qu.:0.5510 3rd Qu.: 4.200 3rd Qu.: 844.0
Max. : 40.20 Max. :0.9190 Max. :21.200 Max. :3041.0
# Calculate the mean of numeric variables
colMeans(df[, c("PER", "TS%", "WS", "PTS")], na.rm = TRUE)
PER TS% WS PTS
13.1797445 0.5121115 2.7754088 568.4941418
# Calculate the median of numeric variables
sapply(df[, c("PER", "TS%", "WS", "PTS")], median, na.rm = TRUE)
PER TS% WS PTS
13.100 0.519 1.800 441.000
# Calculate the standard deviation of numeric variables
sapply(df[, c("PER", "TS%", "WS", "PTS")], sd, na.rm = TRUE)
PER TS% WS PTS
4.63435300 0.06376001 3.05957652 487.26310544
# Calculate the minimum of numeric variables
sapply(df[, c("PER", "TS%", "WS", "PTS")], min, na.rm = TRUE)
PER TS% WS PTS
-23.0 0.0 -2.1 0.0
# Calculate the maximum of numeric variables
sapply(df[, c("PER", "TS%", "WS", "PTS")], max, na.rm = TRUE)
PER TS% WS PTS
40.200 0.919 21.200 3041.000
# Calculate the quartiles of numeric variables
sapply(df[, c("PER", "TS%", "WS", "PTS")], quantile, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
PER TS% WS PTS
25% 10.4 0.481 0.4 175
50% 13.1 0.519 1.8 441
75% 15.9 0.551 4.2 844
Based on the given dataset, here is a brief description of each column:
- Year: The year in which the data corresponds to.
- Player: The name of the player.
- Pos: The position of the player (C for center, PF for power forward, PG for point guard, SG for shooting guard, SF for small forward).
- Age: The age of the player at the time of data recording.
- Tm: The team abbreviation or code for the team the player belongs to.
- G: The number of games played by the player.
- MP: The total minutes played by the player.
- PER: Player Efficiency Rating, a measure of a player’s overall performance.
- TS%: True Shooting Percentage, a measure of shooting efficiency that accounts for field goals, three-pointers, and free throws.
- OWS: Offensive Win Shares, an estimate of the number of wins contributed by a player’s offense.
- DWS: Defensive Win Shares, an estimate of the number of wins contributed by a player’s defense.
- WS: Win Shares, the sum of offensive and defensive win shares, estimating the number of wins contributed by a player.
- WS/48: Win Shares per 48 minutes, a rate statistic that estimates the number of wins contributed by a player per 48 minutes.
- BPM: Box Plus/Minus, a measure of a player’s overall contribution per 100 possessions.
- VORP: Value Over Replacement Player, an estimate of the points per 100 team possessions a player contributes above a replacement-level player.
- FG: Field Goals made by the player.
- FGA: Field Goals attempted by the player.
- FG%: Field Goal Percentage, the ratio of successful field goals made to field goals attempted.
- 3P: Three-Point Field Goals made by the player.
- 3PA: Three-Point Field Goals attempted by the player.
- 3P%: Three-Point Field Goal Percentage, the ratio of successful three-point field goals made to three-point field goals attempted.
- 2P: Two-Point Field Goals made by the player.
- 2PA: Two-Point Field Goals attempted by the player.
- 2P%: Two-Point Field Goal Percentage, the ratio of successful two-point field goals made to two-point field goals attempted.
- eFG%: Effective Field Goal Percentage, a modified field goal percentage that adjusts for the added value of three-point field goals.
- FT: Free Throws made by the player.
- FTA: Free Throws attempted by the player.
- FT%: Free Throw Percentage, the ratio of successful free throws made to free throws attempted.
- ORB: Offensive Rebounds grabbed by the player.
- DRB: Defensive Rebounds grabbed by the player.
- TRB: Total Rebounds grabbed by the player.
- AST: Assists made by the player.
- STL: Steals made by the player.
- BLK: Blocks made by the player.
- TOV: Turnovers committed by the player.
- PF: Personal Fouls committed by the player.
- PTS: Total points scored by the player.
These columns represent various statistics and metrics related to player performance in basketball.
3. The Tentative Plan for Visualization
The data analyses and visualizations that can be conducted to answer the research questions proposed above include:
Trend analysis over time: Visualize the trends in key metrics such as PER, TS%, WS, or PTS over the years to understand how player performance has evolved. This can be done using line plots or bar plots with the years on the x-axis and the corresponding metric values on the y-axis.
Descriptive statistics: Calculate summary statistics such as mean, median, standard deviation, minimum, maximum, and quartiles for the key metrics to gain insights into the distribution and variability of player performance.
The choice of specific data analyses and visualizations is driven by their ability to provide meaningful insights and answer the research questions effectively. Here’s how some types of statistics and graphs can help:
Bivariate visualization: Scatter plots or correlation matrices can reveal the relationship between two variables. For example, plotting PER against PTS can help determine if there is a strong correlation between a player’s efficiency and scoring ability.
Time-series analysis: Line plots or bar plots over time can showcase the pattern of development and identify any notable trends or changes in player performance. This can help understand how the NBA has evolved over the years.
Summary statistics: Descriptive statistics provide a concise summary of the data, allowing for comparisons between different metrics and understanding the central tendency, dispersion, and shape of the distributions.
To process and prepare the tidy data for the analysis, the following steps can be taken:
Create new variables: If additional variables are required for analysis, create them using functions like
mutate()
to calculate derived metrics or calculate player statistics ratios.Dealing with missing data/NAs: Handle missing data appropriately based on the nature and extent of missingness. This can include removing rows with missing values (
na.omit()
), imputing missing values using methods like mean or regression imputation (na.mean()
,na.glm()
), or considering specific missing data handling techniques based on domain knowledge.Outlier treatment: Identify and handle outliers based on the context of the analysis. This can involve removing outliers, transforming variables to mitigate the impact of outliers, or using robust statistical methods.
The specific data processing and preparation steps may vary depending on the dataset and the research questions at hand.