Talking Tech Featured

Talking Tech: Building a March Madness Model using XGBoost

In this edition of Talking Tech, we'll be building our first basketball model. Specifically, we'll use XGBoost to predict games for March Madness.

Bill Radjewski

Mar 14, 2025 • 21 min read

In one of the earliest iterations of Talking Tech, we built a random forest classifier to predict play calls for college football. In another edition, we used my personally preffered method of building an artificial neural network to predict college football games. In this edition, we're going to dive into another type of machine learning method. Like the earlier walkthrough using a random forest classifier, we'll look at another type of ensemble method. An ensemble method builds numerous disparate models and relies on strength through sheer numbers. In the random forest method, a multitude of decision trees is generated and their outputs all gathered together in the final output. In this post, we're going to use an ensemble method that's a little less, well, random.

Gradient Boosting

Gradient boosting is similar in many ways to random forest methods. Both are ensemble models. Both typically make use of decision trees. Both also can be used for either classification or regression. So what sets them apart? If you remember, random forest methods typically generate a multitude of decision trees at random, counting on the erroneous trees to cancel each other out, more or less, while the stronger trees rise to the top. Gradient boosting, on the other hand, will start with one decision tree, evaluate it, and then use resulting error to generate another decision tree that is incrementally more accurate. Rinse and repeat.

Eventually, this results in a multitude of trees all chained together, each one using the insights from its predecessors to make itself more accurate. But you don't simply discard the older models. All generated trees make up the final model, which makes this another ensemble method. You can see how this method might perform much better than random forests. In fact, gradient boosted decision tree models are usually some of the top performing in Kaggle competitions and the like.

When it comes to gradient boosting in Python, there are two libraries with which I am familiar: XGBoost and LightGBM. While both libraries are solid options, we're going to be using XGBoost in this post. However, I do recommend going back and giving LightGBM a look at some point.

Gathering Data

We will be using the CBBD Python library to pull data from the CollegeBasketballData.com REST API. In total, we will be using these packages: cbbd, pandas, sklearn, xgboost. Be sure to have those all installed via pip or however you manage your Python dependencies. We will start importing everything we need up front. We will also set up our CBBD API key so enter your into the placeholder below. If you need a key, you can acquire one from the main CBBD site.


import cbbd
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

configuration = cbbd.Configuration(
    access_token = 'your_api_key_here'
)

I should also note that we will be making a total of 22 API calls, well within the free tier of 1000 monthly calls provided by CBBD and enough to rerun this model many times over.

Next, we will compile all NCAA tournament games from 2013 to 2024. You can go further back if you desire. Note that we are passing in a parameter of tournament='NCAA'. This allows us to conveniently query all tournament games for a given year.


games = []
with cbbd.ApiClient(configuration) as api_client:
    games_api = cbbd.GamesApi(api_client)
    for season in range(2024, 2013, -1):
        results = games_api.get_games(season=season, tournament='NCAA')
        games += results
len(games)

That returned 686 games. Let's see what data is included in a game record.


games[0]

GameInfo(id=12010, source_id='401638579', season_label='20232024', season=2024, season_type=<SeasonType.POSTSEASON: 'postseason'>, start_date=datetime.datetime(2024, 3, 19, 18, 40, tzinfo=datetime.timezone.utc), start_time_tbd=False, neutral_site=True, conference_game=False, game_type='TRNMNT', tournament='NCAA', game_notes="Men's Basketball Championship - West Region - First Four", status=<GameStatus.FINAL: 'final'>, attendance=0, home_team_id=114, home_team='Howard', home_conference_id=18, home_conference='MEAC', home_seed=16, home_points=68, home_period_points=[27, 41], home_winner=False, away_team_id=341, away_team='Wagner', away_conference_id=21, away_conference='NEC', away_seed=16, away_points=71, away_period_points=[38, 33], away_winner=True, excitement=4.7, venue_id=76, venue='UD Arena', city='Dayton', state='OH')

Now we need to load up some stats to incorporate as features into our model. We will use the CBBD Stats API to query for team season stats for the same years for which we queried tournament game data. Note that we are passing in a season_type='regular' parameter. THIS IS IMPORTANT. We want to ONLY grab statistics for the regular season. In other words, stats that were available prior to the start of the tournament in a given year. Failing to pass in the filter will result in a model that is not predictive, but retrodictive. This is a VERY common mistake people make including data and statistics that were not available at the time of the games they are seeking to predict.

Anyway, run the code below to grab team season stats.


stats = []
with cbbd.ApiClient(configuration) as api_client:
    stats_api = cbbd.StatsApi(api_client)
    for season in range(2024, 2013, -1):
        results = stats_api.get_team_season_stats(season=season, season_type='regular')
        stats += results
len(stats)

And we'll also check out the contents of the stats records.


stats[0]

TeamSeasonStats(season=2024, season_label='20232024', team_id=1, team='Abilene Christian', conference='WAC', games=32, wins=15, losses=17, total_minutes=1325, pace=61.1, team_stats=TeamSeasonUnitStats(field_goals=TeamSeasonUnitStatsFieldGoals(pct=43.2, attempted=1877, made=811), two_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=46.4, attempted=1393, made=646), three_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=34.1, attempted=484, made=165), free_throws=TeamSeasonUnitStatsFieldGoals(pct=73.1, attempted=729, made=533), rebounds=TeamSeasonUnitStatsRebounds(total=1070, defensive=756, offensive=314), turnovers=TeamSeasonUnitStatsTurnovers(team_total=12, total=404), fouls=TeamSeasonUnitStatsFouls(flagrant=0, technical=6, total=635), points=TeamSeasonUnitStatsPoints(fast_break=319, off_turnovers=466, in_paint=1138, total=2320), four_factors=TeamSeasonUnitStatsFourFactors(free_throw_rate=38.8, offensive_rebound_pct=29.3, turnover_ratio=0.2, effective_field_goal_pct=47.6), assists=405, blocks=65, steals=253, possessions=2028, rating=114.4, true_shooting=52.8), opponent_stats=TeamSeasonUnitStats(field_goals=TeamSeasonUnitStatsFieldGoals(pct=46.5, attempted=1792, made=833), two_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=52.6, attempted=1227, made=645), three_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=33.3, attempted=565, made=188), free_throws=TeamSeasonUnitStatsFieldGoals(pct=68.7, attempted=723, made=497), rebounds=TeamSeasonUnitStatsRebounds(total=1171, defensive=859, offensive=312), turnovers=TeamSeasonUnitStatsTurnovers(team_total=23, total=478), fouls=TeamSeasonUnitStatsFouls(flagrant=0, technical=6, total=619), points=TeamSeasonUnitStatsPoints(fast_break=316, off_turnovers=411, in_paint=1120, total=2351), four_factors=TeamSeasonUnitStatsFourFactors(free_throw_rate=40.3, offensive_rebound_pct=26.6, turnover_ratio=0.2, effective_field_goal_pct=51.7), assists=388, blocks=108, steals=206, possessions=2023, rating=116.2, true_shooting=55.7))

That's a lot of stats! The final step here is to match the team statistics with each game record and put those into a data frame. We are going to create a list of dict objects to combine this data, which will be pretty easy to load up into pandas.

In the code below, we are converting each game objet into a dict, querying team stats for the home and away team, and then loading up data points from each stats object into the dict. You can completely change these up if you desire or add different stats. I am not trying to build the most comprehensive or accurate model in this exercise. I am merely trying to give you a good idea of how to combine the data and get it into the correct format.


records = []
for game in games:
    record = game.to_dict()
    home_stats = [stat for stat in stats if stat.team_id == game.home_team_id and stat.season == game.season][0]
    away_stats = [stat for stat in stats if stat.team_id == game.away_team_id and stat.season == game.season][0]
    record['home_pace'] = home_stats.pace
    record['home_o_rating'] = home_stats.team_stats.rating
    record['home_d_rating'] = home_stats.opponent_stats.rating
    record['home_free_throw_rate'] = home_stats.team_stats.four_factors.free_throw_rate
    record['home_offensive_rebound_rate'] = home_stats.team_stats.four_factors.offensive_rebound_pct
    record['home_turnover_ratio'] = home_stats.team_stats.four_factors.turnover_ratio
    record['home_efg'] = home_stats.team_stats.four_factors.effective_field_goal_pct
    record['home_free_throw_rate_allowed'] = home_stats.opponent_stats.four_factors.free_throw_rate
    record['home_offensive_rebound_rate_allowed'] = home_stats.opponent_stats.four_factors.offensive_rebound_pct
    record['home_turnover_ratio_forced'] = home_stats.opponent_stats.four_factors.turnover_ratio
    record['home_efg_allowed'] = home_stats.opponent_stats.four_factors.effective_field_goal_pct
    record['away_pace'] = away_stats.pace
    record['away_o_rating'] = away_stats.team_stats.rating
    record['away_d_rating'] = away_stats.opponent_stats.rating
    record['away_free_throw_rate'] = away_stats.team_stats.four_factors.free_throw_rate
    record['away_offensive_rebound_rate'] = away_stats.team_stats.four_factors.offensive_rebound_pct
    record['away_turnover_ratio'] = away_stats.team_stats.four_factors.turnover_ratio
    record['away_efg'] = away_stats.team_stats.four_factors.effective_field_goal_pct
    record['away_free_throw_rate_allowed'] = away_stats.opponent_stats.four_factors.free_throw_rate
    record['away_offensive_rebound_rate_allowed'] = away_stats.opponent_stats.four_factors.offensive_rebound_pct
    record['away_turnover_ratio_forced'] = away_stats.opponent_stats.four_factors.turnover_ratio
    record['away_efg_allowed'] = away_stats.opponent_stats.four_factors.effective_field_goal_pct
    records.append(record)
len(records)

All that's left to do is load this into a data frame. Once loaded up, I am going to compute a new column for the final score margin based on the home and away score columns.


df = pd.DataFrame(records)
df['margin'] = df.homePoints - df.awayPoints
df.head()

	id	sourceId	seasonLabel	season	seasonType	startDate	startTimeTbd	neutralSite	conferenceGame	gameType	...	away_d_rating	away_free_throw_rate	away_offensive_rebound_rate	away_turnover_ratio	away_efg	away_free_throw_rate_allowed	away_offensive_rebound_rate_allowed	away_turnover_ratio_forced	away_efg_allowed	margin
0	12010	401638579	20232024	2024	SeasonType.POSTSEASON	2024-03-19 18:40:00+00:00	False	True	False	TRNMNT	...	98.3	26.2	31.4	0.2	45.4	29.1	25.4	0.2	47.9	-3
1	12009	401638580	20232024	2024	SeasonType.POSTSEASON	2024-03-19 21:10:00+00:00	False	True	False	TRNMNT	...	102.0	32.4	23.5	0.2	55.4	31.4	28.4	0.2	48.8	-25
2	12023	401638581	20232024	2024	SeasonType.POSTSEASON	2024-03-20 18:40:00+00:00	False	True	False	TRNMNT	...	114.5	39.1	29.7	0.2	48.9	32.6	32.2	0.2	49.0	-7
3	12022	401638582	20232024	2024	SeasonType.POSTSEASON	2024-03-20 21:28:00+00:00	False	True	False	TRNMNT	...	102.7	35.3	27.0	0.2	55.3	28.1	29.1	0.2	49.3	-7
4	12022	401638582	20232024	2024	SeasonType.POSTSEASON	2024-03-20 21:28:00+00:00	False	True	False	TRNMNT	...	102.7	35.3	27.0	0.2	55.3	28.1	29.1	0.2	49.3	-7

5 rows × 58 columns

Training the Model

The first step here is feature selection. Let's see what columns are currently included in the data frame.


df.columns

Index(['id', 'sourceId', 'seasonLabel', 'season', 'seasonType', 'startDate',
       'startTimeTbd', 'neutralSite', 'conferenceGame', 'gameType',
       'tournament', 'gameNotes', 'status', 'attendance', 'homeTeamId',
       'homeTeam', 'homeConferenceId', 'homeConference', 'homeSeed',
       'homePoints', 'homePeriodPoints', 'homeWinner', 'awayTeamId',
       'awayTeam', 'awayConferenceId', 'awayConference', 'awaySeed',
       'awayPoints', 'awayPeriodPoints', 'awayWinner', 'excitement', 'venueId',
       'venue', 'city', 'state', 'home_pace', 'home_o_rating', 'home_d_rating',
       'home_free_throw_rate', 'home_offensive_rebound_rate',
       'home_turnover_ratio', 'home_efg', 'home_free_throw_rate_allowed',
       'home_offensive_rebound_rate_allowed', 'home_turnover_ratio_forced',
       'home_efg_allowed', 'away_pace', 'away_o_rating', 'away_d_rating',
       'away_free_throw_rate', 'away_offensive_rebound_rate',
       'away_turnover_ratio', 'away_efg', 'away_free_throw_rate_allowed',
       'away_offensive_rebound_rate_allowed', 'away_turnover_ratio_forced',
       'away_efg_allowed', 'margin'],
      dtype='object')

We are going to pull out the columns we will be using, namely the feature for training and the output we will be training against (margin).


features = [
    'home_o_rating',
    'home_d_rating',
    'home_pace',
    'home_free_throw_rate',
    'home_offensive_rebound_rate',
    'home_turnover_ratio',
    'home_efg',
    'home_free_throw_rate_allowed',
    'home_offensive_rebound_rate_allowed',
    'home_turnover_ratio_forced',
    'home_efg_allowed',
    'away_o_rating',
    'away_d_rating',
    'away_pace',
    'away_free_throw_rate',
    'away_offensive_rebound_rate',
    'away_turnover_ratio',
    'away_efg',
    'away_free_throw_rate_allowed',
    'away_offensive_rebound_rate_allowed',
    'away_turnover_ratio_forced',
    'away_efg_allowed',
    'homeSeed',
    'awaySeed'
]

outputs = ['margin']

df[features + outputs]

	home_o_rating	home_d_rating	home_pace	home_free_throw_rate	home_offensive_rebound_rate	home_turnover_ratio	home_efg	home_free_throw_rate_allowed	home_offensive_rebound_rate_allowed	home_turnover_ratio_forced	...	away_offensive_rebound_rate	away_turnover_ratio	away_efg	away_free_throw_rate_allowed	away_offensive_rebound_rate_allowed	away_turnover_ratio_forced	away_efg_allowed	homeSeed	awaySeed	margin
0	107.8	106.2	67.4	41.9	31.0	0.2	52.4	39.2	33.5	0.2	...	31.4	0.2	45.4	29.1	25.4	0.2	47.9	16	16	-3
1	103.6	96.8	59.4	25.1	26.9	0.1	49.3	25.7	27.2	0.2	...	23.5	0.2	55.4	31.4	28.4	0.2	48.8	10	10	-25
2	111.7	109.8	65.2	29.7	22.2	0.2	54.5	35.9	26.5	0.2	...	29.7	0.2	48.9	32.6	32.2	0.2	49.0	16	16	-7
3	113.6	101.3	65.2	36.8	30.7	0.2	52.2	31.9	24.8	0.2	...	27.0	0.2	55.3	28.1	29.1	0.2	49.3	10	10	-7
4	113.6	101.3	65.2	36.8	30.7	0.2	52.2	31.9	24.8	0.2	...	27.0	0.2	55.3	28.1	29.1	0.2	49.3	10	10	-7
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
681	118.4	96.6	59.2	43.4	32.5	0.2	52.7	32.6	31.7	0.2	...	28.5	0.2	51.4	35.5	36.4	0.2	43.9	1	7	-10
682	118.4	96.6	59.2	43.4	32.5	0.2	52.7	32.6	31.7	0.2	...	28.5	0.2	51.4	35.5	36.4	0.2	43.9	1	7	-10
683	120.4	105.2	61.2	44.1	26.5	0.1	53.1	25.9	29.1	0.2	...	35.6	0.2	49.7	37.8	36.1	0.2	45.0	2	8	-1
684	115.2	101.1	61.7	38.6	28.5	0.2	51.4	35.5	36.4	0.2	...	35.6	0.2	49.7	37.8	36.1	0.2	45.0	7	8	6
685	115.2	101.1	61.7	38.6	28.5	0.2	51.4	35.5	36.4	0.2	...	35.6	0.2	49.7	37.8	36.1	0.2	45.0	7	8	6

686 rows × 25 columns

Again, you can feel free to mix that up. If you added or changed any of the statistics in the prior section, this is where you will need to incorporate them.

We will now split our data set into training data and testing data. Training data will be used in training the model. Testing data is pulled back to test out the model once it's ready to go. In this example, I am pulling 2024 tournament games as my test set. If you are running through this looking to make predictions on tourney games that are in the future, you can pull those games instead (assuming you pulled games and statistics for that season into the data set).


training = df.query("season != 2024").copy()
testing = df.query("season == 2024").copy()

We are going to further split out the training data into training and validation sets. Both of these sets will be used in training the model. The training set is what is actually fed into the model whereas the validation set is what the model uses in training to validate whether it is actually improving. This mechanism mitigates overfitting onto the training data.


X_train, X_valid, y_train, y_valid = train_test_split(training[features], training[outputs], train_size=0.8, test_size=0.2, random_state=0)

Note that this splits the training features (X) out from the expected outputs (y). In the example above, we are randomly holding back 20% of the dataset to be used for validation.

We are ready to train! We will be using XGBRegressor to use our gradient boosting model for regression. If we were doing classification, we would use XGBClassifier.


model = XGBRegressor(random_state=0)
model.fit(X_train, y_train)

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=0, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

And just like that, we have a trained model! We can make predictions against our validation set.


predictions = model.predict(X_valid)
predictions

array([-1.87790477e+00,  7.16752386e+00,  1.32060270e+01,  6.78795004e+00,
        1.44662819e+01, -2.85689831e+00, -8.69423985e-01,  8.75045967e+00,
        3.85790849e+00, -6.43919373e+00, -8.83276880e-01,  6.97011662e+00,
        4.38355398e+00,  8.06833267e+00, -8.77752018e+00,  5.22899723e+00,
        2.80364990e+00,  3.31810045e+00, -9.09639931e+00, -1.38665593e+00,
        4.66550255e+00,  3.16841202e+01,  9.18671894e+00, -2.34628081e+00,
        1.58264847e+01,  9.93082142e+00,  9.44772053e+00,  1.88728504e+01,
        2.87765160e+01,  3.31487012e+00,  1.30118427e+01, -1.30986392e-01,
        5.33917189e+00,  8.50678921e+00, -3.34483713e-01,  2.57094145e+00,
        1.66184235e+01,  5.99199915e+00, -2.74236417e+00,  1.33841276e+00,
       -5.50944662e+00, -8.56299973e+00,  9.36406422e+00,  1.27445345e+01,
       -5.79891968e+00,  9.32999039e+00,  4.99850559e+00,  1.41290035e+01,
        1.27072744e+01,  5.49775696e+00,  2.92133301e-01,  2.85389748e+01,
       -2.77683735e+00,  1.41666784e+01,  1.65023022e+01,  6.03557158e+00,
        2.24876385e+01, -5.69163513e+00,  5.78824818e-01,  2.18679352e+01,
        1.81881466e+01,  6.27820158e+00, -3.48073578e+00, -2.05786265e-02,
        2.38070393e+01,  7.80937290e+00,  2.68855405e+00,  1.00340958e+01,
        1.03051748e+01,  6.70673037e+00, -4.66818810e+00,  1.42929211e+01,
        5.93736887e+00,  2.18488560e+01, -3.96203065e+00, -6.01904249e+00,
        1.15123062e+01,  1.06525719e+00, -5.60221529e+00, -2.91650534e+00,
        8.13025475e+00, -2.16232657e+00, -7.38539994e-02, -7.47696776e-03,
        6.57202673e+00,  3.21248150e+00,  3.89195323e-01,  2.67519027e-01,
       -1.49262440e+00, -5.93076229e+00,  1.55619888e+01, -9.42352295e-01,
        6.86150503e+00,  2.09990826e+01, -2.62024927e+00, -3.10824728e+00,
        1.55272758e+00,  6.41326475e+00,  2.17659950e+00,  2.06855249e+00,
        1.48680840e+01,  3.38636231e+00,  1.16376562e+01, -1.75216424e+00,
        1.12170439e+01,  1.02640734e+01,  1.19243898e+01,  6.55053318e-01,
        1.79168587e+01,  1.12861748e+01,  1.15750656e+01, -1.21279058e+01,
       -6.30171585e+00,  2.97097254e+00,  5.94197321e+00, -1.26525140e+00,
        1.78847879e-01,  1.99955502e+01,  1.16229486e+01,  9.16914749e+00,
        1.56323729e+01,  2.16536427e+01,  4.01582432e+00,  2.84138560e-01],
      dtype=float32)

If your validation set contains games that have already been played, we can use this to calculate the mean absolute error (or any other metric) of our model.


mae = mean_absolute_error(predictions, y_valid)
mae

7.965800762176514

I got a MAE of ~7.96. I'll be honest, I have no idea how good that is since I'm a bit newer to basketball modeling. Based on my reading, a MAE of around 6.5 is pretty good. So, this is perhaps not great but a good starting point. My goal is not to have the best model but to walk you through this. It will be up to you to make changes and get and get better predictions.

What might fine tuning look like? For one, we can update the parameters on the model. The below code snippet runs through the same process as above bu explicitly sets the number of estimators, the learning rate, and the number of jobs for the model.


model = XGBRegressor(n_estimators=100, learning_rate=0.05, n_jobs=4)
model.fit(X_train, y_train)
predictions = model.predict(X_valid)
mae = mean_absolute_error(predictions, y_valid)
mae

7.976924419403076

As you can see, my MAE is not any better, but you can play around with those parameters and see if you get anything different. The best way to improve this will likely come from tweaking the input features and adding more stats.

Let's go back to our testing set, generate predictions, and compare them to actual results from the 2024 NCAA Tournament.


predictions = model.predict(testing[features])
testing['prediction'] = predictions
testing[['homeSeed', 'homeTeam', 'awaySeed', 'awayTeam', 'margin', 'prediction']]

	homeSeed	homeTeam	awaySeed	awayTeam	margin	prediction
0	16	Howard	16	Wagner	-3	4.429741
1	10	Virginia	10	Colorado State	-25	0.494260
2	16	Montana State	16	Grambling	-7	-0.163861
3	10	Boise State	10	Colorado	-7	0.399193
4	10	Boise State	10	Colorado	-7	0.399193
...	...	...	...	...	...	...
65	1	Purdue	2	Tennessee	6	-4.878470
66	4	Duke	11	NC State	-12	0.975319
67	1	Purdue	11	NC State	13	12.650157
68	1	UConn	4	Alabama	14	6.204337
69	1	UConn	1	Purdue	15	0.927093

70 rows × 6 columns

Let's calculate the actual percentage of games our model correctly picked straight up.


testing.query("(margin < 0 and prediction < 0) or (margin > 0 and prediction > 0)").shape[0] / testing.shape[0]

0.6428571428571429

My model correctly predicted all game in the 2024 Tournament at a 64.3% clip. Let's look at just the first round. I'm going use the gameNotes property (which contains round information) to filter down to first round games.


testing[testing['gameNotes'].str.contains('1st')].query("(margin < 0 and prediction < 0) or (margin > 0 and prediction > 0)").shape[0] / testing[testing['gameNotes'].str.contains('1st')].shape[0]

0.696969696969697

For the first round, I'm at a slightly better 69.696969% clip (nice).

At this point, we should save our model so that we can load it up and use it at a later time.


model.save_model('xgboostmodel')

This exports the model into a file. Replace xgboostmodel above with a filename of your choosing, especially if you want to train and save multiple models. If we want to use our model later on to make predictions, we can load it up as follows.


model = XGBRegressor()
model.load_model('xgboostmodel')

Let's say I wanted to predict a hypothetical matchup that hasn't yet occurred and isn't even scheduled. This would be useful in, for example, filling out a bracket. Here is an example of how I might do that with a reusable method.


stats = stats_api.get_team_season_stats(season=2025, season_type='regular')
    
def predict_game(model, stats, projected_home_seed, home_team, projected_away_seed, away_team):
    home_stats = [stat for stat in stats if stat.team == home_team][0]
    away_stats = [stat for stat in stats if stat.team == away_team][0]
    record = {
        'home_o_rating': home_stats.team_stats.rating,
        'home_d_rating': home_stats.opponent_stats.rating,
        'home_pace': home_stats.pace,
        'home_free_throw_rate': home_stats.team_stats.four_factors.free_throw_rate,
        'home_offensive_rebound_rate': home_stats.team_stats.four_factors.offensive_rebound_pct,
        'home_turnover_ratio': home_stats.team_stats.four_factors.turnover_ratio,
        'home_efg': home_stats.team_stats.four_factors.effective_field_goal_pct,
        'home_free_throw_rate_allowed': home_stats.opponent_stats.four_factors.free_throw_rate,
        'home_offensive_rebound_rate_allowed': home_stats.opponent_stats.four_factors.offensive_rebound_pct,
        'home_turnover_ratio_forced': home_stats.opponent_stats.four_factors.turnover_ratio,
        'home_efg_allowed': home_stats.opponent_stats.four_factors.effective_field_goal_pct,
        'away_o_rating': away_stats.team_stats.rating,
        'away_d_rating': away_stats.opponent_stats.rating,
        'away_pace': away_stats.pace,
        'away_free_throw_rate': away_stats.team_stats.four_factors.free_throw_rate,
        'away_offensive_rebound_rate': away_stats.team_stats.four_factors.offensive_rebound_pct,
        'away_turnover_ratio': away_stats.team_stats.four_factors.turnover_ratio,
        'away_efg': away_stats.team_stats.four_factors.effective_field_goal_pct,
        'away_free_throw_rate_allowed': away_stats.opponent_stats.four_factors.free_throw_rate,
        'away_offensive_rebound_rate_allowed': away_stats.opponent_stats.four_factors.offensive_rebound_pct,
        'away_turnover_ratio_forced': away_stats.opponent_stats.four_factors.turnover_ratio,
        'away_efg_allowed': away_stats.opponent_stats.four_factors.effective_field_goal_pct,
        'homeSeed': projected_home_seed,
        'awaySeed': projected_away_seed
    }
    return model.predict(pd.DataFrame([record]))[0]
    
predict_game(model, stats, 5, 'Michigan', 11, 'Dayton')

np.float32(6.149086)

In the above example, I loaded up data from the current season, created a method that constructs a data frame record using the required features, and then called that method to get a prediction, passing in a model, stats collection, and team projected seeds and names. This model predicts that Michigan as a 5 seed would beat Dayton as an 11 seed by 6.1 points. Voila!

And this is where I leave you. As mentioned, there are many improvements that can be made to get this thing ready from prime time. There were many features returned by the Stats API that we aren't even using. And none of our stats are opponent-adjusted. And you aren't limited to the Stats API, either. Tryi incorporating other endpoints or even other data sources.

As always, let me know what you think on Twitter, Bluesky, Discord, etc. And good luck with your brackets!

Gradient Boosting

Gathering Data

Training the Model

Sign up for more like this.