Project: NBA Player of the Week
Our dataset is a combination of the following datasets with regards to NBA:
NBA Player of the Week (1985 - 2019)
https://www.kaggle.com/jacobbaruch/nba-player-of-the-week
1,187 Rows, 14 Columns
NBA Player Salary from basketball-reference.com (1991 - 2017)
https://www.kaggle.com/whitefero/nba-player-salary-19902017
11,837 Rows, 7 Columns
NBA Player Salary from basketball-reference.com (2018 - 2019)
https://web.archive.org/web/20181002194236/www.basketball-reference.com/contracts/players.html
578 Rows, 11 Columns
NBA Player Statistics (1985 - 2019)
https://www.basketball-reference.com/leagues/
18,480 Rows, 30 Columns
NBA Yearly Summary (1985 - 2019)
https://www.basketball-reference.com/leagues/
35 Rows, 8 Columns
Combining the aforementioned datasets, we created a dataset in which, each row is an NBA player per season, and each column is a statistic of the player. We filtered the rows so that only the players who have both statistics and salary data for that particular season are included.
There are 9,003 Rows and 38 Columns in the dataset.
Variable | Definition | Type |
---|---|---|
Year |
(e.g. 1991 means the NBA 1990 - 1991 Season) |
Numerical |
Player |
Player name | Categorical |
Variable | Definition | Type |
---|---|---|
Pos |
Player Position | Categorical |
Age |
Age of Player at the start of February 1st of that season | Numerical |
Tm |
Team of Player | Categorical |
G |
Number of games played | Numerical |
GS |
Number of games played when the game started | Numerical |
MP |
Minutes played per game | Numerical |
FG |
Field Goals per game | Numerical |
FGA |
Field Goal attempts per game | Numerical |
FG_Prct |
Field Goal percentage | Numerical |
Three_P |
3-Point Field Goals per game | Numerical |
Three_PA |
3-Point Field Goal attempts per game | Numerical |
Three_P_Prct |
3-Point Field Goal percentage | Numerical |
Two_P |
2-Point Field Goals per game | Numerical |
Two_PA |
2-Point Field Goal attempts per game | Numerical |
Two_P_Prct |
2-Point Field Goal percentage | Numerical |
ePF_Prct |
Effective Field Goal percentage | Numerical |
FT |
Free Throws per game | Numerical |
FTA |
Free Throw attempts per game | Numerical |
FTA_Prct |
Free Throw percentage | Numerical |
ORB |
Offensive Rebounds per game | Numerical |
DRB |
Defensive Rebounds per game | Numerical |
TRB |
Total Rebounds per game | Numerical |
AST |
Assists per game | Numerical |
STL |
Steals per game | Numerical |
BLK |
Blocks per game | Numerical |
TOV |
Turnovers per game | Numerical |
PF |
Personal Fouls per game | Numerical |
PTS |
Points per game | Numerical |
Potw |
Was the player named Player of the Week during the season? | Binary |
APG_Leader |
Was the player named Assists Per Game Leader during the season? | Binary |
MVP |
Was the player named Most Valuable Player during the season? | Binary |
PPG_Leader |
Was the player named Points Per Game Leader during the season? | Binary |
RPG_Leader |
Was the player named Rebounds Per Game Leader during the season? | Binary |
Rookie |
Was the player named Rookie of the Year during the season? | Binary |
WS Leader |
Was the player named Win Shares Leader during the season? | Binary |
Salary |
Player Salary | Numerical |
Using the dataset, we stemmed two main research problems:
What player statistic contributes the most to the event that the player is named Player of the Week?
Since whether a player is named Player of the Week is a binary variable, we decided to approach this problem using the logistic regression model.
What NBA title, including Player of the Week, has the most weight on the salary of the player?
Since the salary of a player is a numerical variable, we decided to approach this problem using the multiple linear regression model.
For both problems, model selection was performed to find the optimal model, and model diagnosis was performed to mitigate the possible issues of heteroscedasticity, multicollinearity and autocorrelation.
After extracting the relevant player statistics and Player of the Week from the dataset, we plotted the relationship between the statistics and Player of the Week using a scatter plot.
Observing the scatter plot, since Potw
is a binary variable, the scatter plot did not give us a lot of useful information, apart from the differences in range of statistic values between the Potw = 0
and Potw = 1
. For every statistic, the range of values seems to be smaller for Potw = 1
, with the most significant variable being eFG_Prct
.
This discrepancy in range is also evident in the difference in frequency between Potw = 0
and Potw = 1
.
Potw | Count | Prct |
---|---|---|
0 | 8505 | 0.944685 |
1 | 498 | 0.0553149 |
The frequency table shows that Potw = 0
accounts for 94% of the data, which is to be expected since the number of players receiving an award would always be significantly smaller than those who did not. However, we are not sure if this would effect the reliability of the models we would build in regression analysis.
We also plotted the correlation using a heatmap.
Observing the heatmap, there are evidence that multicollinearity might exist. For example, The most correlated variables are Two_P_Prct
and FG_Prct
, but this is to be expected since FG_Prct
is derived from Two_P_Prct
. Similarly, eFG_Prct
is derived from FG_Prct
, so the correlation is high between them. Hence, some of these variables, specifically those that have direct relationships, will need to be removed prior to regression analysis.
Meanwhile, TOV
, AST
and STL
are highly correlated between one another. However, turnovers, assists and steals are basketball moves often performed by point guards, so there might be indirect relationships between these variables. Nonetheless, these correlations would need to be addressed in regression analysis.
As Pos
is categorical variables, we first get dummies for this predictors.
Since some statistics are calculated by other statistics, there would be strong multicollinearity if we include all of them. Therefore, we drop these following statistics for our first model.
TRB
= ORB
+ DRB
FGA
= FG
* FG_Prct
Three_PA
= Three_P
* Three_P_Prct
Two_PA
= Two_P
* Two_P_Prct
FTA
= FT_P
* FT_Prct
PTS
= Three_P
+ Two_P
+ FT_P
FG
= Three_P
+ Two_P
Then we fit the full model using all the remaining players' statistics, such as Age
,G
,GS
,MP
and etc.. Hence, we got the following logistic regression model as follows.
Given that there are too many variables with high correlation from the heatmap above as well as the there is warnning on multicollinearity, we decided to first use both VIF Factors and Deviance Test to find removable predictors.
Features | VIF Factor |
---|---|
Age | 23.7475 |
G | 11.2644 |
GS | 6.57655 |
MP | 79.4739 |
FG_Prct | 870.907 |
Three_P | 8.40503 |
Three_P_Prct | 6.21584 |
Two_P | 22.851 |
Two_P_Prct | 122.755 |
eFG_Prct | 755.823 |
FT | 10.2741 |
FT_Prct | 21.833 |
ORB | 11.5458 |
DRB | 18.2727 |
AST | 12.1883 |
STL | 10.1388 |
BLK | 3.92813 |
TOV | 23.6123 |
PF | 20.7312 |
Pos_PF | 2.29591 |
Pos_PG | 4.82529 |
Pos_SF | 3.03752 |
Pos_SG | 3.895 |
Using a function to remove a predictor with max VIF for each VIF test while deleting that predictor would not reject H0 in deviance test and thus choose reduced model.
Hence, we remove predictors FG_Prct
, eFG_Prct
, TOV
, Age
, MP
, FT_Prct
, Two_P_Prct
, ORB
, which both have high VIF factors and the reduced model with low ΔG in a Deviance Test.
With these remaining predictors, we run a logistic model again and here is our second model.
Features | VIF Factor |
---|---|
Two_P | 16.2202 |
DRB | 12.2412 |
PF | 12.0968 |
STL | 9.59197 |
G | 9.12068 |
FT | 8.76224 |
AST | 8.3415 |
GS | 5.37419 |
Three_P_Prct | 4.76798 |
BLK | 3.78012 |
Pos_PG | 3.45641 |
Three_P | 3.4551 |
Pos_SG | 2.54538 |
Pos_SF | 2.13738 |
Pos_PF | 1.90174 |
But still there are some remaining predictors with VIF Factor larger than 10.
To make sure whether reduced model is better than the full model, we do a deviance test.
Null Hypothesis: Reduced Model
Alternative Hypothesis: Full Model
ΔG = ΔG(Reduced Model) - ΔG(Full Model) = 13.7661
χ2 = 15.5073
On significant level of 0.05, ΔG > χ2. Therefore, we cannot reject Null Hypothesis and then choose Model 2.
But as Wald test shows that there still seems some insignificant predictors with p-values larger than 0.05. Therefore, we continue to remove predictors using Deviance Test and Wald Test. Here are removable predictors based on Deviance Test.
Deviance test | GS | Three_P_Prct | Pos_PG | Pos_SG |
---|---|---|---|---|
delta_G | 14.9021 | 14.5091 | 14.2247 | 14.3993 |
chi2_stat | 16.9190 | 16.9190 | 16.9190 | 16.9190 |
However, we use position as dummies variables. So, if we drop Pos_PG
and Pos_SG
, we need to drop other 2 other variables. In this case, dropping too many predictors, Deviance Test would tell us to stick to the full model.
Hence, we only drop variables GS
and Three_P_Prct
and keep Pos
dummies.
So far, here is the main logistic model we'll use.
Features | VIF Factor |
---|---|
Two_P | 15.268 |
DRB | 12.0013 |
PF | 11.9826 |
STL | 9.41759 |
FT | 8.6996 |
G | 8.27271 |
AST | 8.09547 |
BLK | 3.77999 |
Pos_PG | 2.90391 |
Three_P | 2.76944 |
Pos_SG | 2.12558 |
Pos_SF | 1.80177 |
Pos_PF | 1.73616 |
The VIF table above indicates that there is multicollinearity problem in this model. But we don't choose to drop those predictors with high VIF as both Deviance test and Wald test consider them as significant. So we choose not to drop these predictors.
From the graph above, we can see there are some "studentized residuals" with absolute values larger than 3, which indicates there may be outliers or influential points causing heteroscedasticity.
To find the outliers and influential points, here we plot residuals as well as cook's distance.
Given cook's distance, Diffits and Studentized Residuals,, here we find 316 influential points. Since 316 observations take only about 5% of the total observations. Therefore, we drop these observations and rerun the model.
Here is our final model. To confirm that whether it is the best model we have run, we compare AIC and BIC of the above 4 models.
Model | AIC | BIC |
---|---|---|
Model 1 | 1656.71 | -80147.9 |
Model 2 | 1654.48 | -80207 |
Model 3 | 1652.39 | -80223.3 |
Model 4 | 203.846 | -78484.6 |
Clearly, before dropping outliers and influential points, Model 3 has the lowest AIC and BIC, showing Model 3 is better than Model 1 and Model 2. After we drop outliers and influential points, AIC of Model decreases a lot while BIC increases a little bit. So we will choose Model 4 as our final model.
After removing outliers, the residual plots seems better.
Here we visualize how π changes with the model.
Predictors | βi | e^(βi) |
---|---|---|
Intercept | -43.98931819527114 | 7.86469427844486e-20 |
G | 0.16684022971613738 | 1.181565471176185 |
Three_P | 2.489105636067465 | 12.050493772492525 |
Two_P | 1.9363248937885942 | 6.9332237580619385 |
FT | 1.3465680005479181 | 3.8442095399977703 |
DRB | 1.0637258193048331 | 2.89714514483543 |
AST | 0.42443823739809583 | 1.52873139427614 |
STL | 1.9159062275216838 | 6.793092095448223 |
BLK | 1.3372375113410921 | 3.808507999802208 |
PF | -1.438304019319382 | 0.23732992451989693 |
Pos_PF | -1.6188977597716383 | 0.1981169512519482 |
Pos_PG | 2.351391076003565 | 10.50016609913466 |
Pos_SF | -2.4763565102424177 | 0.08404889969899432 |
Pos_SG | -0.8486571509347313 | 0.4279892712297531 |
Intercept
: the probability for a player win the award Player of the Week is 7.8647e-20, which is super small.G
: While controlling other variables, the odds for a player, who plays 1 more game, to win the POTW increase 18%.Three_P
: While controlling other variables, the odds for a player who can have one more 3-Point Field Goals per game, to win the POTW increase about 11 times.Two_P
: While controlling other variables, the odds for a player, who can have one more 2-point field goals per game, to win the POTW increase about 6 times.FT
: While controlling other variables, the odds for a player, who can have one more free throw per game, to win the POTW increase about 2.8 times.DRB
: While controlling other variables, the odds for a player, who can have one more defensive rebounds per game, to win the POTW increase about 1.9 times.AST
: While controlling other variables, the odds for a player, who can have one more assists per game, to win the POTW increase about 53%.STL
: While controlling other variables, the odds for a player, who can have one more steals per game, to win the POTW increase about 5.8 times.BLK
: While controlling other variables, the odds for a player, who can have one more blocks per game, to win the POTW increase about 2.8 times.PF
: While controlling other variables, the odds for a player, who can have one more personal fouls per game, to win the POTW decrease about 77%.Pos_PF
: While controlling other variables, the odds for a power forward is 80% less than center.Pos_PG
: While controlling other variables, the odds for a points guard is 9.5 times more than center.Pos_SF
: While controlling other variables, the odds for a small forward is 92% less than center.Pos_SG
: While controlling other variables, the odds for a shooting guard is 57% less than center.To summarize, the model indicates that 3-Point Field Goals per game attach the most importance to decide whether a player could get player of the week. Besides, the chance for a point guard to win player of the week is larger than other players. If a player wants to increase his chance of winning player of the week, increasing 2-point field goals per game, free throw per game, steals, assists, blocks and defensive rebounds as well as decreasing personal fouls would be recommended.
Intercept | G | Three_P | Two_P | FT | DRB | AST | STL | BLK | PF | Pos_PF | Pos_PG | Pos_SF | Pos_SG | Predicted πi |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 56 | 0.9 | 2.1 | 1 | 2.5 | 1.5 | 0.6 | 0.3 | 1.9 | 0 | 0 | 0 | 0 | 0 |
1 | 82 | 5.1 | 9.3 | 0 | 11.1 | 10.7 | 2.4 | 2.7 | 3.8 | 0 | 1 | 0 | 0 | 1 |
We use the median statistic of 2019 and max statistic of 2019 to do prediction. As a result, the probability of a player with median performance has 0% chance to win POTW while a player with max performance has 99.99% chance to win POTW. This prediction successfully indicates our model can predict whether a player could win POTW based on his performance to some extent.
After extracting the relevant player titles and salary from the dataset, we plotted the relationship between the titles and salary using a scatter plot.
Observing the scatter plot, since all of the variables are binary except for Year
, it is difficult to interpret the relationships using the scatter plot. Looking at the Year
plot, there is evidently a positive linear relationship between Year
and Salary
. What is also interesting about the Year
graph is that, despite having a positive relationship, the range of values also increased for every season.
This change in range may have been affected by the increase in observations over the years.
Decade | Count |
---|---|
1990 | 2416 |
2000 | 2998 |
2010 | 3589 |
From the frequency table, we can clearly see the increase in observations over the seasons. The cause of this is unknown; either there is a steady increase in players, or there is a steady increase in data collected. Nonetheless, this might be worth looking into and be cautious about during regression analysis.
We also plotted the correlation using a heatmap.
Observing the heatmap, the overall correlation seems pretty low, except for WS_Leader
and MVP
. This means that, except for MVP and Win Shares Leader, having one NBA title does not automatically entitled you to another. It also meant that multicollinearity is likely not an issue in regression analysis.
We first fitted the full model with variables Year
, Potw
, APG_Leader
, MVP
, PPG_Leader
, RPG_Leader
, Rookie
, WS_Leader
.