Greetings everyone! As a quick introduction, my name is Matt and I'm a machine learning engineer from the Pittsburgh area. This is my first of hopefully many posts for Bill & CFBD.
I love CFB and have been trying to gather enough data to predict game outcomes for the last 4 years. Thankfully, I stumbled upon Bill's amazing work last October and was able to do just that! I can't thank Bill enough for making this rich data so easily accessible, and, of course, open to the public!
Today I will be discussing the architecture of a few models that I've created to predict game winners and losers, as well as the raw spread. All of the models are gradient boosted decision trees made with LightGBM and NGBoost. I'll largely focus on how the algorithms work, but I would be remised if I left out information about the data that I'm using. So, let's start with a quick overview of the data preparation.
I built my dataset by calling Bill's API using Python's "requests" package. This included every game Bill has logged from 1997 - present. I wrote the data down to a local instance of MySQL on my machine, totaling 18 different tables. With these tables, I generated a dataset that consists of 17,662 unique games. The predictive models take into consideration 714 different independent variables for each game. All of these variables fall into some general buckets:
- Game location/home vs. away
- Recruiting success over the last 4 years
- Returning player usage and production
- Pregame spreads and win probabilities (where available)
- Current and previous rank information
- Team talent
- Aggregated statistics for a team to-date. Things such as average points per game, average opponent points per game, average havoc rating per game, average QB hurries, etc. So, for example, if Oklahoma is on their 5th game in 2020, these statistics are summarized from the previous 4 games in some aggregated manner.
- Aggregated statistics for a team to-date vs. top 25 opponents
All the information from the statistical buckets listed above are also included for the opponent that the team is playing. For example, if Oklahoma were playing Texas, we would have all of this information in a single row for Oklahoma, and we would also have this information about Texas.
Now that we briefly talked about the data preparation, we can move into predictive modeling. There were two outcomes that I was most interested in: money line winners and predicting the actual spread. Based on the nature of the data, I thought that a gradient boosted decision tree (GBDT) would be the most appropriate algorithm to estimate these targets.
GBDT's are on of the most popular and powerful algorithms available for structured data problems. There are many advantages of boosted trees, including:
- The model learns over rounds of training
- GBDT's can learn more complicated interactions between covariates
- Fewer assumptions are made about the data patterns compared to other algorithms such as linear regression
- Boosted trees can find patterns in missing, or null data
- They're robust to redundant variables
- GBDT's are extremely scalable
So, how do GBDT's accomplish all of this? If you are familiar with simple decision trees and random forests, split decisions are made in these models to optimize some objective, such as accuracy, Gini, entropy, etc. At their core, GBDT's are going to do the same thing. However, instead of simply measuring increase in accuracy or the largest decrease in mean squared error due to a split (depending on the target), a boosted tree is going to measure the gradient across all observations in the trees with respect to some loss function that the model will seek to minimize, which is tied directly to the objective function. For example, a boosted tree could monitor area under the curve in a binary classification model, or mean squared error in a regression problem. When the gradient is monitored with respect to this loss of our selected metric, we can assign an appropriate score, or weight, on each corresponding leaf, essentially measuring our confidence in the prediction that was made. This gives us a richer interpretation of our results beyond classification. To make our predictions more robust, the boosted tree will choose the tree that minimizes our loss based on our objective function during each round of learning and repeat the process over subsequent rounds to create an final, ensembled model, similar to random forests.
There are many hyperparameters that are available when architecting these models. To choose initial hyperparameters, I used a hyperparameter tuning package, hyperopt, for initial tuning, and then manually tuned the parameters after getting a good baseline to avoid overfitting. Overfitting is a phenomenon in data science where the model has memorized patterns on the training data that don't necessarily represent patterns on the testing data that it hasn't seen before, leading to overconfidence in the accuracy during training. Some of the parameters that I focused on included:
- Maximum depth: How deep the tree can grow
- Number of leaves: How many leaf-wise splits can be made
- Early stopping rounds: Number of rounds a model can learn without improvement before stopping
- L1 and L2 regularization: Helps penalize the model for complexity to avoid overfitting
- Column sample by tree: Number of random columns allowed for consideration in tree construction
Before training, I divided my data into a training and testing set, with my training set including all games before 2019, and testing including all completed games from 2019 - present. For predicting money line winners, I let my model train for 100,000 rounds with early stopping set to 10 while monitoring binary logloss. Early stopping will monitor how well the model is doing during each round of training, and will stop training after it hasn't improved in n rounds, which in this case would be 10. The algorithm finished training after 1,354 rounds, yielding an AUC of 0.88 on train and 0.87 on test. The gave me an overall accuracy of 76% when predicting winners vs. losers on my test set when choosing the maximum accuracy threshold.
As we all know, predicting the spread is much harder! Using the same dataset, I used LightGBM to predict the actual difference between a team's points minus their opponents points while monitoring root mean squared error (RMSE). The model trained again for 100,000 rounds with early stopping set to 10. I also implemented an NGBoost model to monitor for confidence, but I think that I will save that for a later post! On my training set, the model achieved an RMSE of 14.79, and 15.72 on the test set. For context, Ed Feng's model has an RMSE of 16.0, so these are promising results.
Model Results in 2020
The model has been running in production for most games since September 26, 2020. This included predictions for 131 unique games.
As it stands, the model has correctly predicted 98/131 games for winners vs. losers (75% accuracy). This is encouraging since it's close to the same accuracy seen during development (and considering the amount of uncertainty over the course of the year).
For the spread, the model currently has an RMSE of 16.77 in production. Again some encouraging results since it's only about a point off of what it produced in development. In addition to RMSE, I have made my own selections based on my spread predictions compared to DraftKings most popular opening lines. I have correctly picked 64/127 games with 4 pushes so far (50% accuracy), which is about what is seen in other public models such as Ed Feng's. Of course, in real life we would never make picks on every game, but rather, the games that we are most confident about. I've also monitored my select picks on some alternate spreads (follow me on Twitter to get weekly picks @CFB_Spreads!). Out of the 35 spreads that I've predicted, I correctly picked 24 (68.5% accuracy).
As the season progresses, I would expect the model to get slightly more accurate since more data is being accumulated about these teams.
Thank you for reading! If you are interested in following my weekly picks, you can follow me on Twitter @CFB_Spreads. Feel free to reach out via email or Twitter if you are interested in hearing about some other topics in data science, or if you have any questions. In my upcoming post, I'll be talking about analyzing variable importance in my predictive models.