CFBD Blog

🏈 Revamping Win Probability for 2025

Bill Radjewski — Thu, 18 Sep 2025 20:00:28 GMT

Imagine this: it’s the fourth quarter, tie game, your team has the ball on the opponent’s 1-yard line with one second left. What are the odds they actually win?

That’s the kind of question a win probability model is built to answer and it’s one I’ve been calculating for years. But until now, the system powering those numbers was showing its age. For 2025, I’ve completely overhauled win probability on CollegeFootballData.com, replacing outdated models with a modernized, better-calibrated engine that understands not just the flow of regulation, but also the unique dynamics of clutch time and overtime.

The Old Way (Retired Models)

The previous version of win probability was powered by two models: one for regulation and one for overtime. These were built years ago using a now-obsolete JavaScript library called SynapticJS. They've generally worked well enough, but they had serious drawbacks:

They were essentially black boxes, with no good way to measure calibration or error.
There was no mechanism for handling rare, high-leverage scenarios, like the one-second, goal-line situation above.
And practically, SynapticJS is no longer maintained, making the models brittle and hard to improve.
Lastly, how many people are training machine learning models on JavaScript? There's a reason JS is used primarily for web while most ML happens in the Python (and R) ecosystem.

In short, they were due for replacement.

The New Models (2025 Revamp)

For the new season, I’ve rebuilt the system from the ground up in Python using XGBoost, a modern machine learning library that’s fast, well-supported, and ideal for structured sports data.

Instead of two opaque models, there are now three specialized models:

Regulation Model – trained on all non-overtime plays, handles the bulk of game situations.
Clutch Time Model – trained specifically on close games in the final minutes, where every play can swing the outcome.
Overtime Model – trained only on overtime possessions, which are fundamentally different because of college football’s unique rules.

The regulation and clutch models are combined into a blended approach: the regulation model drives most of the game, while the clutch model gradually takes over in high-leverage late situations. This way, the system is both broadly calibrated and sharply tuned for the moments that matter most.

Calibration and Results

A major advantage of the new models is that they can be tested and evaluated, which is something the old SynapticJS models couldn’t do.

For each of the three models, I generated calibration curves that compare predicted probabilities to actual outcomes. The closer the line is to the diagonal, the better calibrated the model is. The results show:

Regulation Model

Clutch Model

Overtime Model

The bottom line: the new system doesn’t just look smarter; it actually measures smarter.

Clutch Time in Action

One of the biggest improvements comes from how the new system handles endgame situations.

In the old model, a tie game with one second left and the ball on the opponent’s 1-yard line might have been treated like a coin flip with ~50% win probability. That never felt right.

With the new blended approach, the clutch model takes over, recognizing this as a near-certain win for the offense. Scenarios that used to break the model now produce realistic, intuitive results.

This “clutch awareness” makes the new win probability charts much more believable, especially in the final minutes of close games.

🚀 New Tools on the Site

Along with the revamped models, I’ve added a brand-new Win Probability Calculator to the site. This tool lets you plug in the game situation (score, time remaining, down, distance, and field position) and instantly see the home team’s win probability. Behind the scenes, it uses the new regulation + clutch blended model, so the numbers reflect both the general flow of a game and the pressure of high-leverage moments.

Advanced box scores and data for all 2025 matchups and beyond have been using this new blended model. And every win probability chart you see during the season will now run on the new models. You’ll notice smoother, more realistic shifts, especially late in games where the old system struggled.

Finally, the Excitement Index, a measure of how thrilling a game is based on swings in win probability, has been using the updated engine during the 2025 season. Because the clutch model is sharper, excitement ratings will better capture the drama of close finishes.

Overtime Model

Overtime in college football is a world of its own: possessions starting at the opponent’s 25, alternating turns, and since 2021, two-point shootouts. That structure makes overtime play fundamentally different from regulation, which is why a dedicated model was necessary.

The new overtime model is trained only on overtime possessions and captures those dynamics directly. Its calibration curve shows a solid fit, giving confidence in the numbers when games head into extra frames.

Takeaways & What’s Next

The old SynapticJS models got us this far, but they were opaque and unmeasurable. The new system is:

Transparent – feature sets are clear, and the models can be tested.

Calibrated – probabilities better match reality across all game states.

Clutch-aware – no more 36% win probability in a one-yard, one-second tie.

Specialized – overtime handled with its own dedicated model.

This overhaul powers not just the charts you see on the site, but also the new calculator and updated Excitement Index.

Looking ahead, I hope to extend these improvements into other areas like live win probability updates during games, deeper situational models (e.g., 4th-down decisions), and expanded API access for developers.

Closing

The 2025 season marks a new era for win probability on CollegeFootballData.com. Whether you’re following along live, exploring charts after the fact, or testing “what if” scenarios in the calculator, the numbers you see are powered by smarter, sharper, clutch-ready models.

2025 is here, and win probability just got a whole lot smarter.

Submitting CFBD predictions with HTTP requests

John Edwards — Sat, 13 Sep 2025 14:04:09 GMT

A week ago on the College Football Data Discord, some folks were discussing the difficulties of updating their predictions for the CFBD Model Pick'em Contest:

I see this complaint a fair amount–it is difficult to track all of the games that are available to pick, not to mention significant changes to the outlook of different games (a team's starting QB being scratched with injury, for instance). That's why one of my top tips for doing well in the CFBD Model Pick'em is to automate everything! This does not just mean how you make your predictions, but how you submit your predictions as well.

Thanks to some new features implemented by Bill, I have since moved beyond the Selenium-based pipeline I implemented a few years ago, and now my entire CFBD Model Pick'em pipeline relies on a series of simple HTTP calls. In this post, I will demonstrate how to format and execute these calls using cURL. cURL is an open-source library for uploading and downloading data from websites.

It is extremely unlikely that you will write these pipelines exclusively in cURL–rather, you will likely use a cURL wrapper library in your language of choice. Fortunately, the fantastic free website curlconverter.com will allow you to copy and paste valid cURL commands and convert them to the language of your choice (R, Python, etc.)

Obtaining your token

To begin, we will need to obtain a token for submitting to the game. We will first need to sign up for the predictions game if we have not done so already, then log in with our account. Visit predictions.collegefootballdata.com and sign in with one of the available options.

Once logged in, go to predictions.collegefootballdata.com/api/auth/token. You will see a long string of characters–this is your prediction token. This is a unique identifier that the CFBD Model Pick'em API will use to check that it is genuinely you submitting your picks, and not someone else.

Two very important notes:

This token is different from your basic CFBD API key and these cannot be used interchangeably! So do not swap them around–you cannot use your car key to open your house and vice versa!
Do not share this token with anyone! If you give this token to someone else, they can log into your account and access your predictions and information.

This token will work for one month. You can simply set a reminder for yourself once a month to update the prediction token when convenient.

Getting games to pick

Now that we have our token, we can begin to make HTTP requests to the site using cURL.

The most basic HTTP request is a GET request–when we make a GET request, we are asking the url we are querying to get us data and return it to us in some format. We first need to specify the web url we are trying to query, which is the picks endpoint of the CFBD Model Pick'em API. This API contains the list of games for which we can submit picks in a given week.

curl 'https://predictionsapi.collegefootballdata.com/api/picks'

{"error":"Unauthorized"}

Bummer! We cannot see the games to pick unless we can prove we have a CFBD Model Pick'em account. No matter, we will just need to give it our token. To do this, we will need to pass in our token as a header, which is specified with an -H tag. Note the backslashes (\) in our request–they allow us to put parts of our command on different lines, which allow us to make our requests more readable.

Much like querying the CFBD API, we simply pass the header 'authorization: Bearer {your token here--no brackets!} into our request as a header to our basic request.

curl 'https://predictionsapi.collegefootballdata.com/api/picks' \
  -H 'authorization: Bearer {your token here!}'

[{"id":401754531,"season":2025,"seasonType":"regular","week":3,"homeId":154,"homeTeam":"Wake Forest","awayId":152,"awayTeam":"NC State","spread":7.5,"pickId":120172,"pick":[REDACTED]},
...
{"id":401752921,"season":2025,"seasonType":"regular","week":14,"homeId":130,"homeTeam":"Michigan","awayId":194,"awayTeam":"Ohio State","spread":5.5,"pickId":104248,"pick":[REDACTED]}]

With this request, we have raw JSON data representing all of the games we have to pick! The id for each game returned by your request is identical to the id for games returned by requests to the CFBD API, so you can easily determine which games you need to predict for the contest.

Submitting predictions

Suppose we have our prediction for the Michigan/Ohio State game for the end of the season–we predict Michigan will win by 3.5 points (Bill will not let me publish this blog post if I do not have Michigan winning). We want to submit our prediction to the site. How can we? We have three options:

We can manually submit our prediction on the website.
We can use the CSV import button on the website to submit our prediction for the game and any other games we want to make predictions for.
We can use cURL to make another HTTP request and submit our predictions algorithmically!

The third option is going to integrate most seamlessly into any prediction pipeline we build. To do this, we can craft another cURL, this time making a POST request.

A POST request is kind of like sending a letter–you put what you want to send in your envelope, address it, and then POST it in the mail.

Just like before, we will need to include authorization for our request. Then, as a second header, we will need to tell cURL what format the data we are sending it is in–in this case, we are sending it some JSON. Finally, we send it some formatted JSON to reflect the pick we are submitting:

curl 'https://predictionsapi.collegefootballdata.com/api/picks' \
  -H 'authorization: Bearer {your token here!}' \
  -H 'content-type: application/json' \
  --data-raw '{"gameId":401752921,"pick":-3.5}'

We don't get any output to our console with this request, but if we check the website, we can see that our submission went through to the predictions page!

Our final prediction

Wrapping it up

Keep in mind to use HTTP requests responsibly–you do not want to spam a website with HTTP requests, as this can cause an unintentional denial of service or "DoS" attack or cause your IP to be limited (or even banned!) if you are not careful. Make sure you put adequate time in between HTTP requests to allow the website enough time to process your requests.

This should arm you with the tools to quickly pull in, predict, and submit your forecasts to the CFBD Model Pick'em! If you have data structured with a prediction for each CFBD game ID, submitting your predictions becomes a cinch. And because of how many languages allow you to submit HTTP requests, it should take very little work to submit predictions automatically using whatever language you use to generate predictions! Enjoy and best of luck in the prediction contest!

10 Data-Driven Visualizations That Will Change the Way You Watch College Football

Bill Radjewski — Wed, 10 Sep 2025 19:30:23 GMT

We all love the scoreboard, but sometimes it doesn't tell the whole story. That’s where data visualizations come in. They bring out the trends, the truths, and the surprises that raw box scores can’t capture.

Here are ten of my favorite charts, built from opponent-adjusted metrics and team-level data, that offer a deeper look into how the game is really played from both sides of the ball.

1. Success Rate: Standard Downs vs. Passing Downs

This pair of charts shows how teams perform in different game situations. Offensively, it's about staying efficient whether you're ahead of schedule or in a hole. Defensively, it’s about getting stops when it matters most.

Offense

Teams in the top right are effective on both standard and passing downs. The bottom left highlights units that struggle to stay on track or recover from setbacks.

Defense

On the defensive side, top-left teams shut down early-down runs and force passing situations but then struggle. Those in the bottom right may clean up on 3rd-and-long but struggle to contain base plays.

2. Line Yards vs. EPA per Rush

How much push does your line get, and what are your backs doing with it? And on defense, are you stonewalling rushers or getting gashed despite contact?

Offense

This chart compares line yards (blocking effectiveness) to rushing EPA (actual value). Teams in the top-left are relying on their playmakers to bail out their lethargic run game. Teams on the bottom-left are getting consistent push but not enough to spring explosive plays.

Defense

Defensively, it’s about limiting both initial yardage and big-play potential. Teams in the top-right are stone walls, stuffing runs and denying explosive plays. Top-left teams are your classic bend-don't-break defenses.

3. 3rd Down Success vs. Average Distance

Success on 3rd down isn’t just about execution, it’s also about setting yourself up with manageable situations. These charts break down how offenses and defenses handle the money down.

Offense

Elite teams convert often and avoid long-yardage scenarios. High success, low distance is the sweet spot. Teams above the trendline are what you would call clutch. They convert more often than you would expect given the average distance to go.

Defense

Strong defenses force longer 3rd downs and keep conversion rates low. Teams above the trendline hold firm on 3rd down more often than expected. These defensive coordinators are earning their paycheck.

4. Rushing Style: Line Yards vs. Highlight Yards

These charts reflect rushing identity. Offenses may grind out consistent gains or rely on splash plays. Defenses may force teams into low-efficiency runs or give up explosive gains.

Offense

Teams in the top right have both push and explosiveness. Upper-left teams are high-risk, high-reward. Lower-right teams grind it out but lack big-play potential.

Defense

Great defenses show up in the top right, limiting both consistent gains and explosive plays. Penn State was fantastic last season against the run. Struggling units trend toward the bottom right.

5. Dominating the Trenches

Winning up front still wins games. This chart shows which teams are physically controlling the line of scrimmage on both sides of the ball.

Offense vs. Defense

This combo plot shows offensive line yards gained vs. defensive line yards allowed. Top-right teams are trench kings who win both sides of the battle. Bottom-left teams are getting bullied around on both sides of the ball and may need to rethink their physical identity.

6. Recruiting vs. NFL Draft Output

Having top talent is great. Developing it into draft picks is even better. This chart doesn’t break down any game statistics or metrics, but it tells a powerful story.

Some programs overachieve and produce pros from modest classes. Others underdeliver despite recruiting success. Michigan and Georgia stand out as elite in both talent acquisition and development. Texas A&M and Clemson stand out for quite different reasons.

7. Net Success Rate by Half

Who gets better as the game goes on? This charts capture how teams perform before and after halftime, showing coaching adjustments, depth, and late-game execution.

Top-right teams are consistently good all game. Upper-left teams improve throughout the game. Lower-right teams start strong but fade.

8. Average Starting Field Position

It’s not just about scoring, it’s about controlling the field. Field position tells the hidden story of efficiency and control. These charts map where teams tend to spend their time on both sides of the ball.

Top-right teams spend the bulk of their time on the opponent's side of the field when they have the ball and far away from their own end zone when they don't. Teams in the bottom left are usually pinned up against their own goal line, whether they have the ball or not.

9. Field Goal Expected Points

This chart shows the expected point value of a field goal attempt by distance, based on outcomes for a replacement-level kicker. Short kicks (under 30 yards) are nearly automatic, but value drops quickly beyond 40 yards and attempts beyond 50 often return less than 2 points on average.

It’s a powerful reminder that not all “field goal range” is created equal. Coaches must weigh field position and down-distance against the real expected return, not just the hope of three. Kicking talent matters as well, as the curve for an above-average kicker will be more elongated than this one. For a below-average kicker, the curve will drop off much sooner and harsher.

10. Returning Production: Usage vs. EPA

This one stands on its own. While we don’t have a defensive counterpart, it’s still a powerful preseason predictor.

We chart returning usage (volume) and total EPA (impact) from last season. Teams high in both are not just experienced, they’re returning proven performers.

All of the charts above are based on data from last season (2024), using opponent-adjusted metrics to give a clearer picture of team performance. As the 2025 season unfolds, I’ll be posting updated versions of many of these visuals, along with some others, each week.

You can follow along on Twitter/X and Bluesky, where I share fresh charts, insights, and data stories throughout the season.

Dig Deeper

These charts are just a sample of what’s possible with the tools at CollegeFootballData.com. Whether you're building models, prepping picks, or just watching smarter, we’ve got the data to give you an edge.

Explore more visuals and tools
Try the free or paid tiers of the API
Join Discord to share your own charts and nerd out
Subscribe on Patreon to unlock more API calls and features

Built with curiosity. Powered by data.

From Model Training Pack to Predictions: How to Use Your Model

Bill Radjewski — Tue, 12 Aug 2025 01:07:50 GMT

Since launching the Model Training Pack, one of the top questions I've heard is:

“I’ve got the trained model… now what?”

If that’s you, this guide is for you.
We’ll walk through:

What kind of data your model needs to make predictions.
Where to get that data from the CollegeFootballData API.
How to load different types of models from the pack and run predictions.
How to skip the data prep entirely with Tier 3 weekly CSV drops.

1. What Your Model Needs

The models in the training pack were all built on feature-ready CSV files.

That means:

The CSV has the exact same columns as the training data.
The columns are in the same order.
The numbers are calculated the same way (e.g., using stats from games before the game you’re trying to predict).
If your CSV doesn’t match, your model will throw errors or give bad predictions.

2. Two Ways to Get the Data

Option 1: Build it yourself

You can pull comparable data from the CollegeFootballData API.
Here are the endpoints you’d use, at a high level:

Feature Group	API Endpoint	Key Fields
Opponent-adjusted team metrics	`/wepa/team/season`	`epa.`, `epa_allowed.`, `successRate.`, `successRateAllowed.`
Advanced team metrics (non-opponent-adjusted)	`/stats/season/advanced`	`havoc`, `fieldPosition`, `pointsPerOpportunity`
Game metadata	`/games`	`week`, `homeTeam`, `awayTeam`, `neutralSite`
Betting data	`/lines`	`lines[*].spread`
Talen composite	`/talent`	`talent`

Note: If you build your own CSV, you’ll need to join these datasets together and make sure your stats only include games before the prediction week.

Option 2: Skip the work

Starting Week 5, Tier 3 patrons will get a weekly CSV that already:

Has all the right columns.
Is in the correct order.
Uses stats from games before that week.

With that file, you can go straight to loading your model and running predictions.

3. Loading and Predicting

Once you have your CSV, here’s how to use it with each type of model from the pack.
Replace "week5_features.csv" with your file and "path/to/model" with your model file.

Random Forest / Regression (scikit-learn)

import pandas as pd, joblib

# Load your features
X_live = pd.read_csv("week5_features.csv")

# Load your model
model = joblib.load("models/sklearn_rf.pkl")

# Make predictions
preds = model.predict(X_live)
X_live['prediction'] = preds

XGBoost

import pandas as pd, xgboost as xgb

# Load your features
X_live = pd.read_csv("week5_features.csv")

# Load your model
model = joblib.load("models/xgb_model.pkl")

# Make predictions
preds = model.predict_proba(X_live)[:, 1]
X_live['prediction'] = preds

fastai (tabular)

import pandas as pd
from fastai.tabular.all import load_learner

# Load your features
cat_features = [...] # list out categorical features
cont_features = [...] # list out continuous features

X_live = pd.read_csv("week5_features.csv")
X_live = X_live[cat_features + cont_features]


# Load your model
learn = load_learner("models/fastai_model.pkl")
dls = learn.dls.test_dl(X_live)

# Make predictions
batch_preds = learn.get_preds(dl=dls)[0].numpy()
X_live['prediction'] = batch_preds

4. Common Gotchas

Wrong column order → reorder to match your training data before predicting.

Missing columns → make sure your CSV includes everything from training.

Wrong data types → convert strings to numbers where needed.

fastai category mismatch → your categories must match what the model was trained on.

5. The Fast Lane

If you want to:

Avoid merging multiple datasets,
Skip figuring out lag logic, and
Be sure your columns match perfectly…

…join Tier 3 on Patreon.
Every week starting in Week 5, you’ll get a CSV that’s ready to feed directly into your model.

Join Tier 3 here →

6. Your Next Steps

Pick one of your models from the pack.
Grab a CSV, either your own or from the pack, and run the code above to test.
Get your hands on current-season features (DIY or Tier 3) and start making real predictions.

Bottom line:
If you can build the CSV yourself, great. You now know exactly what your model needs.
If you want to skip the grunt work and start predicting in minutes, Tier 3’s weekly CSV drops are your fastest path.

🧠 10 Tips for Building a College Football Predictive Model Without the Pain

Bill Radjewski — Mon, 04 Aug 2025 14:00:01 GMT

So you want to build a college football predictive model. Maybe you're tired of guessing spreads, or you want to enter a pick'em contest with actual math behind your picks. Great news: you're not alone and you're definitely not crazy.

But here's the catch.

Most beginners hit a wall not because they can't model, but because they can't get to the modeling stage at all. Data is messy. College football is chaotic. And feature selection? That’s a minefield.

This post will walk you through 10 hard-earned tips for building your first (or better) college football model, faster, cleaner, and smarter. Whether you're a student learning sports analytics or a fan trying to sharpen your edge, these tips are for you.

Let’s dive in.

1. Start With Clean, Structured Data

College football data is notoriously inconsistent across sources. Team names vary, game records are incomplete, and drive data is messy. Cleaning this yourself can take hours or even days.

Skip that headache.

Start with a clean dataset like the College Football Starter Pack, which includes structured CSVs for games, drives, plays, advanced stats, and team metadata. It's all ready for analysis or modeling.

📌 Bonus: No API calls or rate limits required.

2. Wait a Few Weeks Into the Season

Early-season games (especially Weeks 0–4) are notoriously unpredictable. There’s simply not enough data to go on and teams are still figuring things out. Sure, you can model these games, but doing it well usually requires a separate approach tailored for low-information scenarios.

For most use cases, it’s better to wait.

Start your training set in Week 5, when team identities begin to solidify, metrics stabilize, and opponent strength becomes more meaningful.

That’s the exact approach I use in the Model Training Pack, which includes a full training dataset filtered for Week 5 and beyond.

3. Opponent Adjustment Isn’t Optional

Raw stats lie.

Team A’s EPA might look elite until you realize they played three bottom-20 defenses. If you're not adjusting for opponent strength, you're modeling schedule, not skill.

Use opponent-adjusted metrics like:

Adjusted EPA per play metrics
Adjusted success rates
Adjusted rushing stats like adjusted line yards

These are included and ready-to-use in the Model Training Pack. No need to build your own adjustment pipeline (unless you really want to).

4. Margin First, Win Probability Second

A lot of beginners jump straight to win/loss prediction. That’s fine—but you lose granularity. Modeling final score margin gives you much more:

✅ Win probability
✅ Cover probability
✅ Total predictions
✅ Confidence rankings

Start by modeling score margin as a regression task, then derive win/loss from it. More signal, more flexibility.

5. Use Features That Actually Predict Outcomes

More features ≠ better model. You want features that have signal, not just noise.

Some high-value features:

Opponent-adjusted efficiency stats
Team talent composite
Run/pass ratio
Havoc metrics
Explosive play rate

Both the Starter Pack and Model Pack highlight the best ones and show how to use them in sample notebooks.

6. Talent Isn’t Everything, But It Matters

Talent composite rankings (from 247Sports or similar) are sticky over time. They don’t predict game-to-game variance, but they help explain why certain teams outperform models built only on stats.

Include talent as a prior, especially early in the season.

We’ve already merged talent data into the Model Training Pack so you don’t have to track it down or clean it yourself.

7. Don’t Skip Cross-Validation

It’s tempting to train on one season and test on another, but that won’t catch overfitting. Instead:

Use k-fold cross-validation
Shuffle by week or game ID
Be mindful of data leakage (especially with team-specific stats)

Even basic models benefit from good validation hygiene.

8. Build a Baseline Before You Get Fancy

Don’t jump straight to neural nets or ensemble methods.

Start with:

Linear regression for margin
Logistic regression for win probability
Decision trees for feature importance

Once you’ve got a strong baseline, experiment with:

XGBoost
Random Forest
Tabular neural networks (like fastai)

The Model Training Pack includes working examples of each so you can see how models evolve.

9. Visualize Your Errors

Don’t just trust metrics like MAE or RMSE. Visualize:

Predicted vs. actual margin
Residuals by team
Over/under predictions by spread

You’ll catch trends you’d never spot in raw numbers (e.g., your model consistently underrates service academies or overweights garbage time stats).

All notebooks included in the Model Training Pack feature error visualization examples to help you troubleshoot fast.

10. Use Prebuilt Tools to Learn Faster

The biggest bottleneck in building a model isn’t modeling. It’s everything before that:

Data cleaning
Feature selection
Normalization
Debugging

The Starter Pack and Model Training Pack are designed to eliminate those barriers so you can focus on building, testing, and improving your model.

No gatekeeping. No fluff. Just clean data and working code examples.

🚀 Ready to Get Started?

Here’s how to level up your college football modeling journey today:

🎯 Grab the Starter Pack - Ideal for exploring and building your first dashboard or basic model.
📊 Grab the Model Training Pack - Perfect for jumpstarting predictive modeling with ready-to-use training data and sample models.

Together, they give you everything you need, from structured data to proven code, so you can focus on what matters: building smarter models.

📬 Want More Tips Like This?

Follow @CFB_Data on Twitter, @collegefootballdata.com on Bluesky, and CollegeFootballData.com for more guides, tools, and insights all season long.

So You Got the Starter Pack. Now What?

Bill Radjewski — Thu, 17 Jul 2025 15:00:04 GMT

First off, thanks for picking up the CFBD Starter Pack! It gives you cleaned, historical data across several seasons and is perfect for building models, dashboards, and analytics workflows.

👉 Don’t have the Starter Pack yet? Grab it now and follow along.

But what if you want to pull in more recent or live data? That’s where the CollegeFootballData API and official Python client come in.

Let’s walk through how to set it up and fetch new data.

🔧 Step 1: Install the Python Client

Install the package:

pip install cfbd

🔐 Step 2: Set Your API Key

You’ll need an API key (free or Patreon tier) from the CFBD website. Once you have it, set it as an environment variable:

export BEARER_TOKEN="your_api_key_here"

Then set up the configuration in your Python code:

import cfbd
import os

configuration = cfbd.Configuration(
    access_token=os.environ["BEARER_TOKEN"]
)

🚀 Step 3: Fetch Data Using an API Client

The Python client uses context managers to handle the API session. Here's how to fetch adjusted player passing stats:

with cfbd.ApiClient(configuration) as api_client:
    api_instance = cfbd.StatsApi(api_client)

    # Example: get advanced game stats for Michigan in 2023
    response = api_instance.get_advanced_game_stats(
        year=2023,
        team="Michigan"
    )

    print(response)

This same pattern works for all endpoints.

📘 Examples You Can Try

Here are a few practical snippets to get started:

Recent Games

with cfbd.ApiClient(configuration) as api_client:
    games_api = cfbd.GamesApi(api_client)
    games = games_api.get_games(year=2024, week=13)
    for g in games:
        print(f"{g.away_team} at {g.home_team}: {g.away_points}-{g.home_points}")

Team Box Scores

with cfbd.ApiClient(configuration) as api_client:
    stats_api = cfbd.GamesApi(api_client)
    box = stats_api.get_game_team_stats(year=2024, week=13)
    for game in box:
        print(game)

Historical Betting Lines

with cfbd.ApiClient(configuration) as api_client:
    betting_api = cfbd.BettingApi(api_client)
    games = betting_api.get_lines(year=2024, week=13)
    for game in games:
        for line in game.lines:
            print(f"{game.away_team} @ {game.home_team}: {line.formatted_spread} ({line.provider})")

🧠 Combine with the Starter Pack

The Starter Pack has historical EPA, recruiting, and drive/play-level data. You can extend it by:

Merging recent API data with your historical CSVs
Running your models on up-to-date weekly metrics
Building dashboards multiple types of data

🛑 Watch Your Limits

If you’re using the Free Tier, you’ll be capped at 1,000 calls/month. Consider bumping to a Patreon plan for more access (and goodies like weather, advanced metrics, and the GraphQL API).

You can check your remaining calls at any time either via the X-CallLimit-Remaining HTTP header returned with all responses or via the info endpoint (does not count against limits):

with cfbd.ApiClient(configuration) as api_client:
    api_instance = cfbd.InfoApi(api_client)
    api_response = api_instance.get_user_info()
    
    print(api_response)

💬 Questions or Feedback?

Join the community on Discord or check out the interactive API docs to explore every endpoint.

Talking Tech: Building a March Madness Model using XGBoost

Bill Radjewski — Sat, 15 Mar 2025 01:59:48 GMT

In one of the earliest iterations of Talking Tech, we built a random forest classifier to predict play calls for college football. In another edition, we used my personally preffered method of building an artificial neural network to predict college football games. In this edition, we're going to dive into another type of machine learning method. Like the earlier walkthrough using a random forest classifier, we'll look at another type of ensemble method. An ensemble method builds numerous disparate models and relies on strength through sheer numbers. In the random forest method, a multitude of decision trees is generated and their outputs all gathered together in the final output. In this post, we're going to use an ensemble method that's a little less, well, random.

Gradient Boosting

Gradient boosting is similar in many ways to random forest methods. Both are ensemble models. Both typically make use of decision trees. Both also can be used for either classification or regression. So what sets them apart? If you remember, random forest methods typically generate a multitude of decision trees at random, counting on the erroneous trees to cancel each other out, more or less, while the stronger trees rise to the top. Gradient boosting, on the other hand, will start with one decision tree, evaluate it, and then use resulting error to generate another decision tree that is incrementally more accurate. Rinse and repeat.

Eventually, this results in a multitude of trees all chained together, each one using the insights from its predecessors to make itself more accurate. But you don't simply discard the older models. All generated trees make up the final model, which makes this another ensemble method. You can see how this method might perform much better than random forests. In fact, gradient boosted decision tree models are usually some of the top performing in Kaggle competitions and the like.

When it comes to gradient boosting in Python, there are two libraries with which I am familiar: XGBoost and LightGBM. While both libraries are solid options, we're going to be using XGBoost in this post. However, I do recommend going back and giving LightGBM a look at some point.

Gathering Data

We will be using the CBBD Python library to pull data from the CollegeBasketballData.com REST API. In total, we will be using these packages: cbbd, pandas, sklearn, xgboost. Be sure to have those all installed via pip or however you manage your Python dependencies. We will start importing everything we need up front. We will also set up our CBBD API key so enter your into the placeholder below. If you need a key, you can acquire one from the main CBBD site.


import cbbd
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

configuration = cbbd.Configuration(
    access_token = 'your_api_key_here'
)

I should also note that we will be making a total of 22 API calls, well within the free tier of 1000 monthly calls provided by CBBD and enough to rerun this model many times over.

Next, we will compile all NCAA tournament games from 2013 to 2024. You can go further back if you desire. Note that we are passing in a parameter of tournament='NCAA'. This allows us to conveniently query all tournament games for a given year.


games = []
with cbbd.ApiClient(configuration) as api_client:
    games_api = cbbd.GamesApi(api_client)
    for season in range(2024, 2013, -1):
        results = games_api.get_games(season=season, tournament='NCAA')
        games += results
len(games)

That returned 686 games. Let's see what data is included in a game record.


games[0]

GameInfo(id=12010, source_id='401638579', season_label='20232024', season=2024, season_type=, start_date=datetime.datetime(2024, 3, 19, 18, 40, tzinfo=datetime.timezone.utc), start_time_tbd=False, neutral_site=True, conference_game=False, game_type='TRNMNT', tournament='NCAA', game_notes="Men's Basketball Championship - West Region - First Four", status=, attendance=0, home_team_id=114, home_team='Howard', home_conference_id=18, home_conference='MEAC', home_seed=16, home_points=68, home_period_points=[27, 41], home_winner=False, away_team_id=341, away_team='Wagner', away_conference_id=21, away_conference='NEC', away_seed=16, away_points=71, away_period_points=[38, 33], away_winner=True, excitement=4.7, venue_id=76, venue='UD Arena', city='Dayton', state='OH')

Now we need to load up some stats to incorporate as features into our model. We will use the CBBD Stats API to query for team season stats for the same years for which we queried tournament game data. Note that we are passing in a season_type='regular' parameter. THIS IS IMPORTANT. We want to ONLY grab statistics for the regular season. In other words, stats that were available prior to the start of the tournament in a given year. Failing to pass in the filter will result in a model that is not predictive, but retrodictive. This is a VERY common mistake people make including data and statistics that were not available at the time of the games they are seeking to predict.

Anyway, run the code below to grab team season stats.


stats = []
with cbbd.ApiClient(configuration) as api_client:
    stats_api = cbbd.StatsApi(api_client)
    for season in range(2024, 2013, -1):
        results = stats_api.get_team_season_stats(season=season, season_type='regular')
        stats += results
len(stats)

And we'll also check out the contents of the stats records.


stats[0]

TeamSeasonStats(season=2024, season_label='20232024', team_id=1, team='Abilene Christian', conference='WAC', games=32, wins=15, losses=17, total_minutes=1325, pace=61.1, team_stats=TeamSeasonUnitStats(field_goals=TeamSeasonUnitStatsFieldGoals(pct=43.2, attempted=1877, made=811), two_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=46.4, attempted=1393, made=646), three_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=34.1, attempted=484, made=165), free_throws=TeamSeasonUnitStatsFieldGoals(pct=73.1, attempted=729, made=533), rebounds=TeamSeasonUnitStatsRebounds(total=1070, defensive=756, offensive=314), turnovers=TeamSeasonUnitStatsTurnovers(team_total=12, total=404), fouls=TeamSeasonUnitStatsFouls(flagrant=0, technical=6, total=635), points=TeamSeasonUnitStatsPoints(fast_break=319, off_turnovers=466, in_paint=1138, total=2320), four_factors=TeamSeasonUnitStatsFourFactors(free_throw_rate=38.8, offensive_rebound_pct=29.3, turnover_ratio=0.2, effective_field_goal_pct=47.6), assists=405, blocks=65, steals=253, possessions=2028, rating=114.4, true_shooting=52.8), opponent_stats=TeamSeasonUnitStats(field_goals=TeamSeasonUnitStatsFieldGoals(pct=46.5, attempted=1792, made=833), two_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=52.6, attempted=1227, made=645), three_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=33.3, attempted=565, made=188), free_throws=TeamSeasonUnitStatsFieldGoals(pct=68.7, attempted=723, made=497), rebounds=TeamSeasonUnitStatsRebounds(total=1171, defensive=859, offensive=312), turnovers=TeamSeasonUnitStatsTurnovers(team_total=23, total=478), fouls=TeamSeasonUnitStatsFouls(flagrant=0, technical=6, total=619), points=TeamSeasonUnitStatsPoints(fast_break=316, off_turnovers=411, in_paint=1120, total=2351), four_factors=TeamSeasonUnitStatsFourFactors(free_throw_rate=40.3, offensive_rebound_pct=26.6, turnover_ratio=0.2, effective_field_goal_pct=51.7), assists=388, blocks=108, steals=206, possessions=2023, rating=116.2, true_shooting=55.7))

That's a lot of stats! The final step here is to match the team statistics with each game record and put those into a data frame. We are going to create a list of dict objects to combine this data, which will be pretty easy to load up into pandas.

In the code below, we are converting each game objet into a dict, querying team stats for the home and away team, and then loading up data points from each stats object into the dict. You can completely change these up if you desire or add different stats. I am not trying to build the most comprehensive or accurate model in this exercise. I am merely trying to give you a good idea of how to combine the data and get it into the correct format.


records = []
for game in games:
    record = game.to_dict()
    home_stats = [stat for stat in stats if stat.team_id == game.home_team_id and stat.season == game.season][0]
    away_stats = [stat for stat in stats if stat.team_id == game.away_team_id and stat.season == game.season][0]
    record['home_pace'] = home_stats.pace
    record['home_o_rating'] = home_stats.team_stats.rating
    record['home_d_rating'] = home_stats.opponent_stats.rating
    record['home_free_throw_rate'] = home_stats.team_stats.four_factors.free_throw_rate
    record['home_offensive_rebound_rate'] = home_stats.team_stats.four_factors.offensive_rebound_pct
    record['home_turnover_ratio'] = home_stats.team_stats.four_factors.turnover_ratio
    record['home_efg'] = home_stats.team_stats.four_factors.effective_field_goal_pct
    record['home_free_throw_rate_allowed'] = home_stats.opponent_stats.four_factors.free_throw_rate
    record['home_offensive_rebound_rate_allowed'] = home_stats.opponent_stats.four_factors.offensive_rebound_pct
    record['home_turnover_ratio_forced'] = home_stats.opponent_stats.four_factors.turnover_ratio
    record['home_efg_allowed'] = home_stats.opponent_stats.four_factors.effective_field_goal_pct
    record['away_pace'] = away_stats.pace
    record['away_o_rating'] = away_stats.team_stats.rating
    record['away_d_rating'] = away_stats.opponent_stats.rating
    record['away_free_throw_rate'] = away_stats.team_stats.four_factors.free_throw_rate
    record['away_offensive_rebound_rate'] = away_stats.team_stats.four_factors.offensive_rebound_pct
    record['away_turnover_ratio'] = away_stats.team_stats.four_factors.turnover_ratio
    record['away_efg'] = away_stats.team_stats.four_factors.effective_field_goal_pct
    record['away_free_throw_rate_allowed'] = away_stats.opponent_stats.four_factors.free_throw_rate
    record['away_offensive_rebound_rate_allowed'] = away_stats.opponent_stats.four_factors.offensive_rebound_pct
    record['away_turnover_ratio_forced'] = away_stats.opponent_stats.four_factors.turnover_ratio
    record['away_efg_allowed'] = away_stats.opponent_stats.four_factors.effective_field_goal_pct
    records.append(record)
len(records)

All that's left to do is load this into a data frame. Once loaded up, I am going to compute a new column for the final score margin based on the home and away score columns.


df = pd.DataFrame(records)
df['margin'] = df.homePoints - df.awayPoints
df.head()

	id	sourceId	seasonLabel	season	seasonType	startDate	startTimeTbd	neutralSite	conferenceGame	gameType	...	away_d_rating	away_free_throw_rate	away_offensive_rebound_rate	away_turnover_ratio	away_efg	away_free_throw_rate_allowed	away_offensive_rebound_rate_allowed	away_turnover_ratio_forced	away_efg_allowed	margin
0	12010	401638579	20232024	2024	SeasonType.POSTSEASON	2024-03-19 18:40:00+00:00	False	True	False	TRNMNT	...	98.3	26.2	31.4	0.2	45.4	29.1	25.4	0.2	47.9	-3
1	12009	401638580	20232024	2024	SeasonType.POSTSEASON	2024-03-19 21:10:00+00:00	False	True	False	TRNMNT	...	102.0	32.4	23.5	0.2	55.4	31.4	28.4	0.2	48.8	-25
2	12023	401638581	20232024	2024	SeasonType.POSTSEASON	2024-03-20 18:40:00+00:00	False	True	False	TRNMNT	...	114.5	39.1	29.7	0.2	48.9	32.6	32.2	0.2	49.0	-7
3	12022	401638582	20232024	2024	SeasonType.POSTSEASON	2024-03-20 21:28:00+00:00	False	True	False	TRNMNT	...	102.7	35.3	27.0	0.2	55.3	28.1	29.1	0.2	49.3	-7
4	12022	401638582	20232024	2024	SeasonType.POSTSEASON	2024-03-20 21:28:00+00:00	False	True	False	TRNMNT	...	102.7	35.3	27.0	0.2	55.3	28.1	29.1	0.2	49.3	-7

5 rows × 58 columns

Training the Model

The first step here is feature selection. Let's see what columns are currently included in the data frame.


df.columns

Index(['id', 'sourceId', 'seasonLabel', 'season', 'seasonType', 'startDate',
       'startTimeTbd', 'neutralSite', 'conferenceGame', 'gameType',
       'tournament', 'gameNotes', 'status', 'attendance', 'homeTeamId',
       'homeTeam', 'homeConferenceId', 'homeConference', 'homeSeed',
       'homePoints', 'homePeriodPoints', 'homeWinner', 'awayTeamId',
       'awayTeam', 'awayConferenceId', 'awayConference', 'awaySeed',
       'awayPoints', 'awayPeriodPoints', 'awayWinner', 'excitement', 'venueId',
       'venue', 'city', 'state', 'home_pace', 'home_o_rating', 'home_d_rating',
       'home_free_throw_rate', 'home_offensive_rebound_rate',
       'home_turnover_ratio', 'home_efg', 'home_free_throw_rate_allowed',
       'home_offensive_rebound_rate_allowed', 'home_turnover_ratio_forced',
       'home_efg_allowed', 'away_pace', 'away_o_rating', 'away_d_rating',
       'away_free_throw_rate', 'away_offensive_rebound_rate',
       'away_turnover_ratio', 'away_efg', 'away_free_throw_rate_allowed',
       'away_offensive_rebound_rate_allowed', 'away_turnover_ratio_forced',
       'away_efg_allowed', 'margin'],
      dtype='object')

We are going to pull out the columns we will be using, namely the feature for training and the output we will be training against (margin).


features = [
    'home_o_rating',
    'home_d_rating',
    'home_pace',
    'home_free_throw_rate',
    'home_offensive_rebound_rate',
    'home_turnover_ratio',
    'home_efg',
    'home_free_throw_rate_allowed',
    'home_offensive_rebound_rate_allowed',
    'home_turnover_ratio_forced',
    'home_efg_allowed',
    'away_o_rating',
    'away_d_rating',
    'away_pace',
    'away_free_throw_rate',
    'away_offensive_rebound_rate',
    'away_turnover_ratio',
    'away_efg',
    'away_free_throw_rate_allowed',
    'away_offensive_rebound_rate_allowed',
    'away_turnover_ratio_forced',
    'away_efg_allowed',
    'homeSeed',
    'awaySeed'
]

outputs = ['margin']

df[features + outputs]

	home_o_rating	home_d_rating	home_pace	home_free_throw_rate	home_offensive_rebound_rate	home_turnover_ratio	home_efg	home_free_throw_rate_allowed	home_offensive_rebound_rate_allowed	home_turnover_ratio_forced	...	away_offensive_rebound_rate	away_turnover_ratio	away_efg	away_free_throw_rate_allowed	away_offensive_rebound_rate_allowed	away_turnover_ratio_forced	away_efg_allowed	homeSeed	awaySeed	margin
0	107.8	106.2	67.4	41.9	31.0	0.2	52.4	39.2	33.5	0.2	...	31.4	0.2	45.4	29.1	25.4	0.2	47.9	16	16	-3
1	103.6	96.8	59.4	25.1	26.9	0.1	49.3	25.7	27.2	0.2	...	23.5	0.2	55.4	31.4	28.4	0.2	48.8	10	10	-25
2	111.7	109.8	65.2	29.7	22.2	0.2	54.5	35.9	26.5	0.2	...	29.7	0.2	48.9	32.6	32.2	0.2	49.0	16	16	-7
3	113.6	101.3	65.2	36.8	30.7	0.2	52.2	31.9	24.8	0.2	...	27.0	0.2	55.3	28.1	29.1	0.2	49.3	10	10	-7
4	113.6	101.3	65.2	36.8	30.7	0.2	52.2	31.9	24.8	0.2	...	27.0	0.2	55.3	28.1	29.1	0.2	49.3	10	10	-7
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
681	118.4	96.6	59.2	43.4	32.5	0.2	52.7	32.6	31.7	0.2	...	28.5	0.2	51.4	35.5	36.4	0.2	43.9	1	7	-10
682	118.4	96.6	59.2	43.4	32.5	0.2	52.7	32.6	31.7	0.2	...	28.5	0.2	51.4	35.5	36.4	0.2	43.9	1	7	-10
683	120.4	105.2	61.2	44.1	26.5	0.1	53.1	25.9	29.1	0.2	...	35.6	0.2	49.7	37.8	36.1	0.2	45.0	2	8	-1
684	115.2	101.1	61.7	38.6	28.5	0.2	51.4	35.5	36.4	0.2	...	35.6	0.2	49.7	37.8	36.1	0.2	45.0	7	8	6
685	115.2	101.1	61.7	38.6	28.5	0.2	51.4	35.5	36.4	0.2	...	35.6	0.2	49.7	37.8	36.1	0.2	45.0	7	8	6

686 rows × 25 columns

Again, you can feel free to mix that up. If you added or changed any of the statistics in the prior section, this is where you will need to incorporate them.

We will now split our data set into training data and testing data. Training data will be used in training the model. Testing data is pulled back to test out the model once it's ready to go. In this example, I am pulling 2024 tournament games as my test set. If you are running through this looking to make predictions on tourney games that are in the future, you can pull those games instead (assuming you pulled games and statistics for that season into the data set).


training = df.query("season != 2024").copy()
testing = df.query("season == 2024").copy()

We are going to further split out the training data into training and validation sets. Both of these sets will be used in training the model. The training set is what is actually fed into the model whereas the validation set is what the model uses in training to validate whether it is actually improving. This mechanism mitigates overfitting onto the training data.


X_train, X_valid, y_train, y_valid = train_test_split(training[features], training[outputs], train_size=0.8, test_size=0.2, random_state=0)

Note that this splits the training features (X) out from the expected outputs (y). In the example above, we are randomly holding back 20% of the dataset to be used for validation.

We are ready to train! We will be using XGBRegressor to use our gradient boosting model for regression. If we were doing classification, we would use XGBClassifier.


model = XGBRegressor(random_state=0)
model.fit(X_train, y_train)

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=0, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

And just like that, we have a trained model! We can make predictions against our validation set.


predictions = model.predict(X_valid)
predictions

array([-1.87790477e+00,  7.16752386e+00,  1.32060270e+01,  6.78795004e+00,
        1.44662819e+01, -2.85689831e+00, -8.69423985e-01,  8.75045967e+00,
        3.85790849e+00, -6.43919373e+00, -8.83276880e-01,  6.97011662e+00,
        4.38355398e+00,  8.06833267e+00, -8.77752018e+00,  5.22899723e+00,
        2.80364990e+00,  3.31810045e+00, -9.09639931e+00, -1.38665593e+00,
        4.66550255e+00,  3.16841202e+01,  9.18671894e+00, -2.34628081e+00,
        1.58264847e+01,  9.93082142e+00,  9.44772053e+00,  1.88728504e+01,
        2.87765160e+01,  3.31487012e+00,  1.30118427e+01, -1.30986392e-01,
        5.33917189e+00,  8.50678921e+00, -3.34483713e-01,  2.57094145e+00,
        1.66184235e+01,  5.99199915e+00, -2.74236417e+00,  1.33841276e+00,
       -5.50944662e+00, -8.56299973e+00,  9.36406422e+00,  1.27445345e+01,
       -5.79891968e+00,  9.32999039e+00,  4.99850559e+00,  1.41290035e+01,
        1.27072744e+01,  5.49775696e+00,  2.92133301e-01,  2.85389748e+01,
       -2.77683735e+00,  1.41666784e+01,  1.65023022e+01,  6.03557158e+00,
        2.24876385e+01, -5.69163513e+00,  5.78824818e-01,  2.18679352e+01,
        1.81881466e+01,  6.27820158e+00, -3.48073578e+00, -2.05786265e-02,
        2.38070393e+01,  7.80937290e+00,  2.68855405e+00,  1.00340958e+01,
        1.03051748e+01,  6.70673037e+00, -4.66818810e+00,  1.42929211e+01,
        5.93736887e+00,  2.18488560e+01, -3.96203065e+00, -6.01904249e+00,
        1.15123062e+01,  1.06525719e+00, -5.60221529e+00, -2.91650534e+00,
        8.13025475e+00, -2.16232657e+00, -7.38539994e-02, -7.47696776e-03,
        6.57202673e+00,  3.21248150e+00,  3.89195323e-01,  2.67519027e-01,
       -1.49262440e+00, -5.93076229e+00,  1.55619888e+01, -9.42352295e-01,
        6.86150503e+00,  2.09990826e+01, -2.62024927e+00, -3.10824728e+00,
        1.55272758e+00,  6.41326475e+00,  2.17659950e+00,  2.06855249e+00,
        1.48680840e+01,  3.38636231e+00,  1.16376562e+01, -1.75216424e+00,
        1.12170439e+01,  1.02640734e+01,  1.19243898e+01,  6.55053318e-01,
        1.79168587e+01,  1.12861748e+01,  1.15750656e+01, -1.21279058e+01,
       -6.30171585e+00,  2.97097254e+00,  5.94197321e+00, -1.26525140e+00,
        1.78847879e-01,  1.99955502e+01,  1.16229486e+01,  9.16914749e+00,
        1.56323729e+01,  2.16536427e+01,  4.01582432e+00,  2.84138560e-01],
      dtype=float32)

If your validation set contains games that have already been played, we can use this to calculate the mean absolute error (or any other metric) of our model.


mae = mean_absolute_error(predictions, y_valid)
mae

7.965800762176514

I got a MAE of ~7.96. I'll be honest, I have no idea how good that is since I'm a bit newer to basketball modeling. Based on my reading, a MAE of around 6.5 is pretty good. So, this is perhaps not great but a good starting point. My goal is not to have the best model but to walk you through this. It will be up to you to make changes and get and get better predictions.

What might fine tuning look like? For one, we can update the parameters on the model. The below code snippet runs through the same process as above bu explicitly sets the number of estimators, the learning rate, and the number of jobs for the model.


model = XGBRegressor(n_estimators=100, learning_rate=0.05, n_jobs=4)
model.fit(X_train, y_train)
predictions = model.predict(X_valid)
mae = mean_absolute_error(predictions, y_valid)
mae

7.976924419403076

As you can see, my MAE is not any better, but you can play around with those parameters and see if you get anything different. The best way to improve this will likely come from tweaking the input features and adding more stats.

Let's go back to our testing set, generate predictions, and compare them to actual results from the 2024 NCAA Tournament.


predictions = model.predict(testing[features])
testing['prediction'] = predictions
testing[['homeSeed', 'homeTeam', 'awaySeed', 'awayTeam', 'margin', 'prediction']]

	homeSeed	homeTeam	awaySeed	awayTeam	margin	prediction
0	16	Howard	16	Wagner	-3	4.429741
1	10	Virginia	10	Colorado State	-25	0.494260
2	16	Montana State	16	Grambling	-7	-0.163861
3	10	Boise State	10	Colorado	-7	0.399193
4	10	Boise State	10	Colorado	-7	0.399193
...	...	...	...	...	...	...
65	1	Purdue	2	Tennessee	6	-4.878470
66	4	Duke	11	NC State	-12	0.975319
67	1	Purdue	11	NC State	13	12.650157
68	1	UConn	4	Alabama	14	6.204337
69	1	UConn	1	Purdue	15	0.927093

70 rows × 6 columns

Let's calculate the actual percentage of games our model correctly picked straight up.


testing.query("(margin < 0 and prediction < 0) or (margin > 0 and prediction > 0)").shape[0] / testing.shape[0]

0.6428571428571429

My model correctly predicted all game in the 2024 Tournament at a 64.3% clip. Let's look at just the first round. I'm going use the gameNotes property (which contains round information) to filter down to first round games.


testing[testing['gameNotes'].str.contains('1st')].query("(margin < 0 and prediction < 0) or (margin > 0 and prediction > 0)").shape[0] / testing[testing['gameNotes'].str.contains('1st')].shape[0]

0.696969696969697

For the first round, I'm at a slightly better 69.696969% clip (nice).

At this point, we should save our model so that we can load it up and use it at a later time.


model.save_model('xgboostmodel')

This exports the model into a file. Replace xgboostmodel above with a filename of your choosing, especially if you want to train and save multiple models. If we want to use our model later on to make predictions, we can load it up as follows.


model = XGBRegressor()
model.load_model('xgboostmodel')

Let's say I wanted to predict a hypothetical matchup that hasn't yet occurred and isn't even scheduled. This would be useful in, for example, filling out a bracket. Here is an example of how I might do that with a reusable method.


stats = stats_api.get_team_season_stats(season=2025, season_type='regular')
    
def predict_game(model, stats, projected_home_seed, home_team, projected_away_seed, away_team):
    home_stats = [stat for stat in stats if stat.team == home_team][0]
    away_stats = [stat for stat in stats if stat.team == away_team][0]
    record = {
        'home_o_rating': home_stats.team_stats.rating,
        'home_d_rating': home_stats.opponent_stats.rating,
        'home_pace': home_stats.pace,
        'home_free_throw_rate': home_stats.team_stats.four_factors.free_throw_rate,
        'home_offensive_rebound_rate': home_stats.team_stats.four_factors.offensive_rebound_pct,
        'home_turnover_ratio': home_stats.team_stats.four_factors.turnover_ratio,
        'home_efg': home_stats.team_stats.four_factors.effective_field_goal_pct,
        'home_free_throw_rate_allowed': home_stats.opponent_stats.four_factors.free_throw_rate,
        'home_offensive_rebound_rate_allowed': home_stats.opponent_stats.four_factors.offensive_rebound_pct,
        'home_turnover_ratio_forced': home_stats.opponent_stats.four_factors.turnover_ratio,
        'home_efg_allowed': home_stats.opponent_stats.four_factors.effective_field_goal_pct,
        'away_o_rating': away_stats.team_stats.rating,
        'away_d_rating': away_stats.opponent_stats.rating,
        'away_pace': away_stats.pace,
        'away_free_throw_rate': away_stats.team_stats.four_factors.free_throw_rate,
        'away_offensive_rebound_rate': away_stats.team_stats.four_factors.offensive_rebound_pct,
        'away_turnover_ratio': away_stats.team_stats.four_factors.turnover_ratio,
        'away_efg': away_stats.team_stats.four_factors.effective_field_goal_pct,
        'away_free_throw_rate_allowed': away_stats.opponent_stats.four_factors.free_throw_rate,
        'away_offensive_rebound_rate_allowed': away_stats.opponent_stats.four_factors.offensive_rebound_pct,
        'away_turnover_ratio_forced': away_stats.opponent_stats.four_factors.turnover_ratio,
        'away_efg_allowed': away_stats.opponent_stats.four_factors.effective_field_goal_pct,
        'homeSeed': projected_home_seed,
        'awaySeed': projected_away_seed
    }
    return model.predict(pd.DataFrame([record]))[0]
    
predict_game(model, stats, 5, 'Michigan', 11, 'Dayton')

np.float32(6.149086)

In the above example, I loaded up data from the current season, created a method that constructs a data frame record using the required features, and then called that method to get a prediction, passing in a model, stats collection, and team projected seeds and names. This model predicts that Michigan as a 5 seed would beat Dayton as an 11 seed by 6.1 points. Voila!

And this is where I leave you. As mentioned, there are many improvements that can be made to get this thing ready from prime time. There were many features returned by the Stats API that we aren't even using. And none of our stats are opponent-adjusted. And you aren't limited to the Stats API, either. Tryi incorporating other endpoints or even other data sources.

As always, let me know what you think on Twitter, Bluesky, Discord, etc. And good luck with your brackets!

Talking Tech: Generating Shot Charts using the Basketball API

Bill Radjewski — Wed, 05 Mar 2025 18:00:39 GMT

Welcome to the first ever basketball post on this here blog! As announced a few weeks back, CollegeBasketballData.com is now live. I've often been asked about providing service for college basketball and have always been hesitant. For one, the sheer volume of data is multiple times greater than for football due to nearly triple the number of teams and triple the number of games per team. I've also been a big fan both of Bart Torvik and Ken Pomeroy and wasn't sure there was much of need for a CFBD-like service for CBB with the stats and analytics those guys provide.

That all said, I have been asked consistently over the years from various users and the CFBD site and API refreshes have made me energized to give CBB a go. I'm excited to provide this service and if I've been a part of your CFB analytics journey, I hope I can do the same for CBB.

Now let's dive into some charts!

Plotting the Court

We are going to be plotting team shot charts on top of a standard NCAA men's court using Python and the CollegeBasketballData.com API along with a few common Python packages. When all is said and done, we will have something that looks like this.

Before we do anything, we need to make sure we have all dependencies installed. We will need the CBBD Python package and a few others. Run the following code in terminal.

pip install cbbd pandas numpy matplotlib seaborn

Now we need to focus on plotting a basketball court. We will be using matplotlib to achieve this. Go ahead and run the following block to import all of dependencies we just installed.


import cbbd
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import matplotlib as mpl
from matplotlib.patches import Circle, Rectangle, Arc
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
import seaborn as sns

plt.style.use('seaborn-v0_8-dark-palette')

As we did into plotting the court, I first need to give a huge shout out to Rob Mulla, who wrote a series of helper functions for plotting NCAA courts on Kaggle. His Kaggle article goes more in-depth and even includes a plot for a full size court. We'll just be using a half court and copy/pasting a function from that article.


def create_ncaa_half_court(ax=None, three_line='mens', court_color='#dfbb85',
                           lw=3, lines_color='black', lines_alpha=0.5,
                           paint_fill='blue', paint_alpha=0.4,
                          inner_arc=False):
    """
    Version 2020.2.19

    Creates NCAA Basketball Half Court
    Dimensions are in feet (Court is 97x50 ft)
    Created by: Rob Mulla / https://github.com/RobMulla

    * Note that this function uses "feet" as the unit of measure.
    * NCAA Data is provided on a x range: 0, 100 and y-range 0 to 100
    * To plot X/Y positions first convert to feet like this:
    ```
    Events['X_'] = (Events['X'] * (94/100))
    Events['Y_'] = (Events['Y'] * (50/100))
    ```
    ax: matplotlib axes if None gets current axes using `plt.gca`
    
    three_line: 'mens', 'womens' or 'both' defines 3 point line plotted
    court_color : (hex) Color of the court
    lw : line width
    lines_color : Color of the lines
    lines_alpha : transparency of lines
    paint_fill : Color inside the paint
    paint_alpha : transparency of the "paint"
    inner_arc : paint the dotted inner arc
    """
    if ax is None:
        ax = plt.gca()

    # Create Pathes for Court Lines
    center_circle = Circle((50/2, 94/2), 6,
                           linewidth=lw, color=lines_color, lw=lw,
                           fill=False, alpha=lines_alpha)
    hoop = Circle((50/2, 5.25), 1.5 / 2,
                       linewidth=lw, color=lines_color, lw=lw,
                       fill=False, alpha=lines_alpha)

    # Paint - 18 Feet 10 inches which converts to 18.833333 feet - gross!
    paint = Rectangle(((50/2)-6, 0), 12, 18.833333,
                           fill=paint_fill, alpha=paint_alpha,
                           lw=lw, edgecolor=None)
    
    paint_boarder = Rectangle(((50/2)-6, 0), 12, 18.833333,
                           fill=False, alpha=lines_alpha,
                           lw=lw, edgecolor=lines_color)
    
    arc = Arc((50/2, 18.833333), 12, 12, theta1=-
                   0, theta2=180, color=lines_color, lw=lw,
                   alpha=lines_alpha)
    
    block1 = Rectangle(((50/2)-6-0.666, 7), 0.666, 1, 
                           fill=True, alpha=lines_alpha,
                           lw=0, edgecolor=lines_color,
                           facecolor=lines_color)
    block2 = Rectangle(((50/2)+6, 7), 0.666, 1, 
                           fill=True, alpha=lines_alpha,
                           lw=0, edgecolor=lines_color,
                           facecolor=lines_color)
    ax.add_patch(block1)
    ax.add_patch(block2)
    
    l1 = Rectangle(((50/2)-6-0.666, 11), 0.666, 0.166,
                           fill=True, alpha=lines_alpha,
                           lw=0, edgecolor=lines_color,
                           facecolor=lines_color)
    l2 = Rectangle(((50/2)-6-0.666, 14), 0.666, 0.166,
                           fill=True, alpha=lines_alpha,
                           lw=0, edgecolor=lines_color,
                           facecolor=lines_color)
    l3 = Rectangle(((50/2)-6-0.666, 17), 0.666, 0.166,
                           fill=True, alpha=lines_alpha,
                           lw=0, edgecolor=lines_color,
                           facecolor=lines_color)
    ax.add_patch(l1)
    ax.add_patch(l2)
    ax.add_patch(l3)
    l4 = Rectangle(((50/2)+6, 11), 0.666, 0.166,
                           fill=True, alpha=lines_alpha,
                           lw=0, edgecolor=lines_color,
                           facecolor=lines_color)
    l5 = Rectangle(((50/2)+6, 14), 0.666, 0.166,
                           fill=True, alpha=lines_alpha,
                           lw=0, edgecolor=lines_color,
                           facecolor=lines_color)
    l6 = Rectangle(((50/2)+6, 17), 0.666, 0.166,
                           fill=True, alpha=lines_alpha,
                           lw=0, edgecolor=lines_color,
                           facecolor=lines_color)
    ax.add_patch(l4)
    ax.add_patch(l5)
    ax.add_patch(l6)
    
    # 3 Point Line
    if (three_line == 'mens') | (three_line == 'both'):
        # 22' 1.75" distance to center of hoop
        three_pt = Arc((50/2, 6.25), 44.291, 44.291, theta1=12,
                            theta2=168, color=lines_color, lw=lw,
                            alpha=lines_alpha)

        # 4.25 feet max to sideline for mens
        ax.plot((3.34, 3.34), (0, 11.20),
                color=lines_color, lw=lw, alpha=lines_alpha)
        ax.plot((50-3.34, 50-3.34), (0, 11.20),
                color=lines_color, lw=lw, alpha=lines_alpha)
        ax.add_patch(three_pt)

    if (three_line == 'womens') | (three_line == 'both'):
        # womens 3
        three_pt_w = Arc((50/2, 6.25), 20.75 * 2, 20.75 * 2, theta1=5,
                              theta2=175, color=lines_color, lw=lw, alpha=lines_alpha)
        # 4.25 inches max to sideline for mens
        ax.plot( (4.25, 4.25), (0, 8), color=lines_color,
                lw=lw, alpha=lines_alpha)
        ax.plot((50-4.25, 50-4.25), (0, 8.1),
                color=lines_color, lw=lw, alpha=lines_alpha)

        ax.add_patch(three_pt_w)

    # Add Patches
    ax.add_patch(paint)
    ax.add_patch(paint_boarder)
    ax.add_patch(center_circle)
    ax.add_patch(hoop)
    ax.add_patch(arc)
    
    if inner_arc:
        inner_arc = Arc((50/2, 18.833333), 12, 12, theta1=180,
                             theta2=0, color=lines_color, lw=lw,
                       alpha=lines_alpha, ls='--')
        ax.add_patch(inner_arc)

    # Restricted Area Marker
    restricted_area = Arc((50/2, 6.25), 8, 8, theta1=0,
                        theta2=180, color=lines_color, lw=lw,
                        alpha=lines_alpha)
    ax.add_patch(restricted_area)
    
    # Backboard
    ax.plot(((50/2) - 3, (50/2) + 3), (4, 4),
            color=lines_color, lw=lw*1.5, alpha=lines_alpha)
    ax.plot( (50/2, 50/2), (4.3, 4), color=lines_color,
            lw=lw, alpha=lines_alpha)

    # Half Court Line
    ax.axhline(94/2, color=lines_color, lw=lw, alpha=lines_alpha)

    
    # Plot Limit
    ax.set_xlim(0, 50)
    ax.set_ylim(0, 94/2 + 2)
    ax.set_facecolor(court_color)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel('')
    return ax

You'll note that the code has several formatting options and you can even switch between a men's and women's courts. CBBD does not currently offer NCAA women's data, but that is still a very nice feature to have.

Go ahead and run the function without any options specified.


create_ncaa_half_court()

Pretty basic and it just works! We can add some formatting options.


create_ncaa_half_court(three_line='mens', court_color='black', lines_color='white', paint_alpha=0, inner_arc=True)

Feel free to mess around more with different court and style combinations.

Importing Shot Location Data

We will grab shot location data from the CollegeBasketballData.com (CBBD) API. Specifically, we'll be working with the cbbd Python package (imported above). First, configure your API key, replacing your own API key with the placeholder below. If you need an API key, you can register for a free key via the CBBD main website.


configuration = cbbd.Configuration(
    access_token = 'your_api_key_here'
)

Shot location data is included in play by play data. We can use the CBBD Plays API to grab all shooting plays for a specific team or player. In this example, we will grab team-level data. We will specify season and team parameters. We will also pass in a shooting_plays_only flag to only return shooting plays (i.e. filtering out things like timeouts, rebounds, fouls, etc). The code block below will grab shooting plays associated with Dayton in the 2025 season. Feel free to switch up the team or season.


with cbbd.ApiClient(configuration) as api_client:
    plays_api = cbbd.PlaysApi(api_client)
    plays = plays_api.get_plays_by_team(season=2025, team='Dayton', shooting_plays_only=True)
plays[0]

Example output of a shooting play:

PlayInfo(id=118229, source_id='401715398101806301', game_id=426, game_source_id='401715398', game_start_date=datetime.datetime(2024, 11, 9, 19, 30, tzinfo=datetime.timezone.utc), season=2025, season_type=, game_type='STD', play_type='LayUpShot', is_home_team=False, team_id=212, team='Northwestern', conference='Big Ten', opponent_id=64, opponent='Dayton', opponent_conference='A-10', period=1, clock='19:36', seconds_remaining=1176, home_score=0, away_score=0, home_win_probability=0.635, scoring_play=False, shooting_play=True, score_value=2, wallclock=None, play_text='Ty Berry missed Layup.', participants=[PlayInfoParticipantsInner(name='Ty Berry', id=5452)], shot_info=ShotInfo(shooter=ShotInfoShooter(name='Ty Berry', id=5452), made=False, range='rim', assisted=False, assisted_by=ShotInfoShooter(name=None, id=None), location=ShotInfoLocation(y=270, x=864.8)))

We can easily load this up into a pandas DataFrame. The current scale for the x and y coordinates is 10 pts for every 1 foot. Dividing by 10, we can convert that into feet as we import into a DataFrame, which will make it easier to work with the half court plot we ran through above. We will also filter out any shooting plays that may be missing location data for whatever reason.


df = pd.DataFrame.from_records([
    dict(
        x=p.shot_info.location.x / 10,
        y=p.shot_info.location.y / 10,
    )
    for p in plays
    if p.shot_info is not None
        and p.shot_info.location is not None
        and p.shot_info.location.x is not None
        and p.shot_info.location.y is not None
])

df.head()

	x	y
0	76.14	29.5
1	22.56	41.0
2	26.32	8.5
3	81.78	31.5
4	69.56	9.5

We have one last step to take to get our data into a usable state. We are currently working with half court plots, but these shot locations correspond to a full court. We will convert the shot locations to half court coordinates by translating locations from the missing half over to the visible half of the court.


df['x_half'] = df['x']
df.loc[df['x'] > 47, 'x_half'] = (94 - df['x'].loc[df['x'] > 47])
df['y_half'] = df['y']
df.loc[df['x'] > 47, 'y_half'] = (50 - df['y'].loc[df['x'] > 47])

# cast these to float to avoid typing issues later
df['x_half'] = df['x_half'].astype(float)
df['y_half'] = df['y_half'].astype(float)

Plotting the Data

We can easily plot this data using matplotlib. For example, we can put it into a scatter plot.


plt.scatter(df['y_half'], df['x_half'])

Not very pretty, but you can clearly see a basketball court, including the general outline of the 3-point line.

We can improve upon these by making a hexbin chart, which will bucket shots into hexagonal areas of the court to create a sort of heatmap. The below code will create a hexbin plot using the inferno color map.


plt.hexbin(df['y_half'], df['x_half'], gridsize=20, cmap='inferno')

You can view more colormaps here and play around with different color schemes. Just replace inferno in the above snippet with the colormap of our choice. You can also type in plt.cm. and use autocomplete to conveniently see what is available.

I'm partial to gist_heat_r, so let's check that one out. We'll just rerun the code from above, replacing the colormap with that one.


plt.hexbin(df['y_half'], df['x_half'], gridsize=20, cmap=plt.cm.gist_heat_r)

You can also mess around with the gridsize parameter for lower or higher resolution. Here I will increase the value from 20 to 40.


plt.hexbin(df['y_half'], df['x_half'], gridsize=40, cmap=plt.cm.gist_heat_r)

Bringing it all together

We've plotted an empty half court. We've plot actual shot location data points. It's time to bring that all together. Run the below snippet and then we'll break it down line by line.


fig, ax = plt.subplots(figsize=(13.8, 14))
ax.hexbin(x='y_half', y='x_half', cmap=plt.cm.gist_heat_r, gridsize=40, data=df)
create_ncaa_half_court(ax, court_color='white',
                       lines_color='black', paint_alpha=0,
                       inner_arc=True)
plt.show()

Pretty nice, huh? Let's walk through it.

On line 1, we are setting the size of the plot and returning the plot fig and ax objects.
On line 2, we are using the ax object to create a hexbin plot, almost identical to above.
On line 3, we are calling the create_ncaa_half_court function with our desired styling options. The colormap used here works best with a white background.
Lastly, we show the court with the plotted hex bins.

Let's make this even cooler. We're going to use a library called seaborn, which is built upon matplotlib. It contains many of the base plots found within matplotlib, but with its own tweaks and improvements. It also offers several additional, more advanced types of plots. You can view the gallery here. We are going to be working with a jointplot, which will combine the hexbin chart we created with aspects of a bar chart.

It's pretty simply. Just run the snippet below to see what it looks like.


sns.jointplot(data=df, x='y_half', y='x_half',
                                    kind='hex', space=0, color=plt.cm.gist_heat_r(.2), cmap=plt.cm.gist_heat_r)

Now put it all together and let's plot the jointplot on top of our half court plot.


cmap = plt.cm.gist_heat_r
joint_shot_chart = sns.jointplot(data=df, x='y_half', y='x_half',
                                kind='hex', space=0, color=cmap(.2), cmap=cmap)

joint_shot_chart.figure.set_size_inches(12,11)

# A joint plot has 3 Axes, the first one called ax_joint 
# is the one we want to draw our court onto 
ax = joint_shot_chart.ax_joint
create_ncaa_half_court(ax=ax,
                            three_line='mens',
                            court_color='white',
                            lines_color='black',
                            paint_alpha=0,
                            inner_arc=True)

One last thing, let's remove the access labels and add a title.


cmap = plt.cm.gist_heat_r
joint_shot_chart = sns.jointplot(data=df, x='y_half', y='x_half',
                                kind='hex', space=0, color=cmap(.2), cmap=cmap)

joint_shot_chart.figure.set_size_inches(12,11)

# A joint plot has 3 Axes, the first one called ax_joint 
# is the one we want to draw our court onto 
ax = joint_shot_chart.ax_joint
create_ncaa_half_court(ax=ax,
                            three_line='mens',
                            court_color='white',
                            lines_color='black',
                            paint_alpha=0,
                            inner_arc=True)

# Get rid of axis labels and tick marks
ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(labelbottom='off', labelleft='off')
ax.set_title(f"Dayton Shot Attempts\n(2024-2025)", y=1.22, fontsize=18)

There are other styles of joint plots you can make by changing the kind parameter on line 3 above. For example, changing the kind from hex to scatter results in this.

Here is what happens when we change it to kde.

It doesn't look so great, does it? We can mess around a bit with the styling to make that look a little better. I'm going to change the colormap to inferno, add fill and thresh parameters, and change the half court styling a little bit.


cmap = plt.cm.inferno
joint_shot_chart = sns.jointplot(data=df, x='y_half', y='x_half',
                                kind='kde', space=0, fill=True, thresh=0,  color=cmap(.2), cmap=cmap)

joint_shot_chart.figure.set_size_inches(12,11)

# A joint plot has 3 Axes, the first one called ax_joint 
# is the one we want to draw our court onto 
ax = joint_shot_chart.ax_joint
create_ncaa_half_court(ax=ax,
                            three_line='mens',
                            court_color='black',
                            lines_color='white',
                            paint_alpha=0,
                            inner_arc=True)

# Get rid of axis labels and tick marks
ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(labelbottom='off', labelleft='off')
ax.set_title(f"Dayton Shot Attempts\n(2024-2025)", y=1.22, fontsize=18)

That's better. See the jointplot docs for more styles, examples, and customizations.

Conclusion and Further Reading

You should now be able to create shot location charts against an actual court using matplotlib and seaborn with the CBBD Python library. There are many ways to take this further:

Plot multiple teams using subplots
Plot made shots and missed shots side-by-side for the same team using subplots
Apply the same code to plotting shot charts for specific players
Find new styling and customizations

Lastly, I already cited Rob Mulla and his excellent Kaggle article and helper functions for plotting NCAA basketball courts. I'd be remiss if I also didn't shout Savvas Tjortjoglou as a drew a lot of inspiration from his article on plotting NBA shot charts.

As always, let me know what you think and happy coding!

Talking Tech: Build an environment for data analysis in 2025

Bill Radjewski — Sat, 08 Feb 2025 01:45:56 GMT

If you follow this blog, chances are that you've seen and perhaps even walked through my guide on building an environment for analysis. That article is from 5 years ago and I still get questions and feedback on it to this day. To be clear, I still think it's a perfectly valid way to build an environment and to this day I still primarily use the Docker setup outlined in the guide. However, I find myself starting to gravitate more and more towards a non-Docker environment.

Docker is great and I still use it for many things, but lately I've found that it eats up a lot of resources on my local machine so I don't always have it running. The base Docker image I shared in the previous article is still published and available for anyone to use, but it has been increasingly challenging to maintain and keep up-to-date via automation. You can still use that image and it still works great in my experience, but recently gained appreciation for a more lightweight approach.

These are the tools used in this approach:

VS Code
Jupyter
Python (with virtual environments)
The CFBD and CBBD Python packages

If you've never used VS Code before as an IDE, you should be checking it out. It's long been my IDE of choice for everything else and it provides a fantastic experience for working with Jupyter notebooks. What has put it over the top for me and caused my to use it more and more for data analytics task is GitHub Copilot. GitHub Copilot has become something that I am no longer able to live without. You may be familiar with my recent rewrite of the CFBD API, website, and most associated infrastructure. You may also be familiar with my recent foray into basketball with CollegeBasketballData.com. I wouldn't have been able to do any of this without Copilot. It's probably at least halved my development time on the above. And it works seamlessly with Jupyter notebooks in VS Code.

Just as with the previous guide, this guide should work whether you are on Windows, Mac, or Linux. I am a Windows user and still highly recommend setting up Windows Subsystem for Linux (WSL) with your favorite Linux flavor (I use Ubuntu) if you are also in Windows. I do all my development (personal and professional) exclusively in WSL.

Getting Started

Prerequisites are that you have the following installed:

VS Code
Python

You will also need some VS Code extensions, at the very least the Python and Jupyter extensions. Here is the list of extensions I am running for this tutorial:

Open up a terminal window. Let's create a directory called jupyter and move into that directory.


mkdir jupyter
cd jupyter

Next, we're going to create a Python virtual environment. This is always a good practice as allows you to work with different Python versions and package versions across different folders/repos.


python -m venv ./venv

This should have created a venv folder with the Python binaries and some scripts. We are going to activate the virtual environment we just created by running:


source ./venv/bin/activate

Note that this command may differ for Mac and non-WSL Windows. Refer to the documentation linked above for instructions specific to those OSes.

Next we will install a list of commonly used Python packages. Feel free to add any others you may need. We will also write these packages into a requirements.txt file for easy installation.


pip install cbbd cfbd ipykernel matplotlib numpy pandas scikit-learn xgboost
pip freeze > requirements.txt

Let's create an empty Jupyter notebook and open this directory in VS Code.


touch test.ipynb
code .

Inside VS Code, open the test.ipynb file from the left sidebar. Then, click on "Select Kernel" in the top-right and then "Python Environments..." from the dropdown list that appears.

Select the environment labeled venv. There should be a star next to it.

Now we can begin working in the Jupyter notebook. Let's start by importing the cfbd and pandas packages and running the code block.


import cfbd
import pandas as pd

If you didn't install the ipykernel package with the list of packages above, you may be greeted with the below prompt. Just click 'Install' and wait.

Next, let's configure the CFBD package with our CFBD API key. If you do not have a key, you can acquire one from the website. Replace the text below with your personal key.


configuration = cfbd.Configuration(
    access_token = 'your_key_here'
)

We can now call the API to grab a list of games:


with cfbd.ApiClient(configuration) as api_client:
    games_api = cfbd.GamesApi(api_client)
    
    games = games_api.get_games(year=2024, classification='fbs')

len(games)

In my example, there were 920 games returned. It's pretty easy to load those into a Pandas DataFrame.


df = pd.DataFrame.from_records([g.to_dict() for g in games])
df.head()

One neat trick using the Python library is that every method has a special version that will also include the HTTP response metadata. Simply attach _with_http_info to the end of the method. You can use this to keep track of how many monthly calls you have remaining.


with cfbd.ApiClient(configuration) as api_client:
    games_api = cfbd.GamesApi(api_client)
    
    response = games_api.get_games_with_http_info(year=2024, classification='fbs')
    
response.headers['X-CallLimit-Remaining']

And then access the same as before data via the response.data field.


games = response.data
df = pd.DataFrame.from_records([g.to_dict() for g in games])
df.head()

And that is all there is to it!

Conclusion

I do still love Docker for many things and think it is still perfectly adequate to use for a data analytics environment. However, you can see how this approach is much more lightweight and allows you to leverage the full capabilities of VS Code. We didn't really dig into the GitHub Copilot extension. If you didn't install, then I cannot recommend it enough as it is a gamechanger.

Some other tweaks that people make include swapping out pip for conda. However, I have found the above setup to be more than adequate. Anyway, happy coding!

REST API v2 is now in general availability!

Bill Radjewski — Sat, 04 Jan 2025 14:00:00 GMT

The CFBD API v2 is now publicly available! The free tier has been set at 1000 monthly calls (tiering and call limits subject to change). Documentation is available at apinext.collegefootballdata.com. As previously announced, API v1 will be shut down prior to the start of the 2025 season. In May 2025, both api.collegefootballdata.com and apinext.collegefootballdata.com will point to v2.

To reiterate what has already been announced, there WILL be breaking changes, so it is recommended to check out the docs and update your code at your earliest possible convenience. Current API limits are as follows:

Free tier - 1000 monthly calls
Patreon Tier 1 ($1/mo) - 5000 monthly calls
Patreon Tier 2 ($5/mo) - 30,000 monthly calls
Patreon Tier 3 ($10/mo) - 75,000 monthly calls (+ access to the GraphQL API with realtime data subscriptions)

These tiers and limits are subject to change prior to the 2025 season. More tiers will be added as needed. If you need more than 75k monthly calls, reach out ot me and I will add more tiers.

Unlike REST API v1, there is no request throttling in REST API v2. This was done in favor of monthly limits to make things more transparent and easier to communicate and implement. However, note that Cloudflare limits are still in place and if you make a large amount of simultaneous requests, you may be blocked by Cloudflare for a short period of time (~10mins).

There are multiple ways to access REST API v2:

Read the API docs at apinext.collegefootballdata.com
Install the revamped Python package.
Install the new TypeScript package.
Install the new C# package.

I am exploring adding support for additional languages. If there are specific languages you would like to see, please let me know.

Subscribers in Patreon Tier 3 receive access to the new GraphQL API with realtime data subscriptions.

REST API v2 and the new GraphQL API should still be considered to be in beta. Please do not hesitate to reach out if you run into any potential bugs or issues.

Subscribing to Data Events with the CFBD GraphQL API

Bill Radjewski — Tue, 03 Sep 2024 19:00:01 GMT

Over the weekend, I announced the new and experimental CFBD GraphQL API. I already broke down most of the benefits of using GraphQL, which includese more dynamic querying and granular control over the data. One benefit is so big that it merits its own post, GraphQL Subscriptions.

Subscriptions do exactly what they say. They allow you to subscribe to data updates. If you're a Patreon subscriber, you may already be familiar with the live endpoints in the CFBD REST API (e.g. /scoreboard). While these endpoints present live data, they also require you, the user, to implement some sort of polling mechanism to re-trigger the endpoint on a cycle. And what's more, the data returned by the endpoint may or may not have changed. It's up to the user to figure out if it has.

In GraphQL, however, subscriptions are event-based. You specify a GraphQL query as a subscription and, instead of polling the data source repeatedly, the query auto-triggers each time that data has actually updated. Instead of making a bunch of calls, you specify one operation and then the data is pushed directly to your code whenever it changes in the CFBD database.

Subscriptions are pretty simple. Let's take a regular GraphQL query, one that queries betting lines from a specific sportsbook for all future games:


query bettingQuery {
	game(
		where: {
			status: { _eq: "scheduled" }
			lines: { provider: { name: { _eq: "Bovada" } } }
			_or: [
				{ homeClassification: { _eq: "fbs" } }
				{ awayClassification: { _eq: "fbs" } }
			]
		}
	) {
		homeTeam
		awayTeam
		lines(where: { provider: { name: { _eq: "Bovada" } } }) {
			spread
			overUnder
			provider {
				name
			}
		}
	}
}

Pretty standard query, right? If we wanted, we could call this query regularly, parsing the response to see if any of the data has changed. Much simpler would be turning it into a subscription:


subscription bettingSubscription {
	game(
		where: {
			status: { _eq: "scheduled" }
			lines: { provider: { name: { _eq: "Bovada" } } }
			_or: [
				{ homeClassification: { _eq: "fbs" } }
				{ awayClassification: { _eq: "fbs" } }
			]
		}
	) {
		homeTeam
		awayTeam
		lines(where: { provider: { name: { _eq: "Bovada" } } }) {
			spread
			overUnder
			provider {
				name
			}
		}
	}
}

That was simple! The only change I made was changing the query operation to a subscription operation (I also changed the arbitrary name of bettingSubscription). Now, whenever the data returned by this query changes in CFBD, I will get an update pushed directly to me. No more polling over and over again. No more trying to figure out if anything has actually changed.

If you want to get pushed an update whenever a game's status changes to "completed" so you know that it's time to pull play or box score data, you can do that. If you want to be alerted as above when a sportsbook spread has changed, you can do that. Want to be pushed an update when recruiting data changes? You can now do that, too.

Creating a Subscription in Python

One important thing to note, Insomnia does not support GraphQL subscriptions. However, I still recommend always designing all of your GraphQL operations Insomnia since you can take advantage of its autocomplete and interactive GraphQL docs. You would just build the subscription as a query and then change it to a subscription when putting it into your Python code.

We're going to be working with three PyPI packages: gql, asyncio, and backoff. So make sure to have all of these installed in your environment.

We're going to walk through two different examples. Here is the first example and it's pretty simple:


from gql import Client, gql
from gql.transport.websockets import WebsocketsTransport

transport = WebsocketsTransport(
    url="wss://graphql.collegefootballdata.com/v1/graphql",
    headers={ "Authorization": "Bearer YOUR_API_KEY"}
)

client = Client(
    transport=transport,
    fetch_schema_from_transport=True,
)

query = gql('''
    subscription bettingSubscription {
        game(
            where: {
                status: { _eq: "scheduled" }
                lines: { provider: { name: { _eq: "Bovada" } } }
                _or: [
                    { homeClassification: { _eq: "fbs" } }
                    { awayClassification: { _eq: "fbs" } }
                ]
            }
        ) {
            homeTeam
            awayTeam
            lines(where: { provider: { name: { _eq: "Bovada" } } }) {
                spread
                overUnder
                provider {
                    name
                }
            }
        }
    }
''')

for result in client.subscribe(query):
    # put your logic here
    print(result)

Let's walk through what this code is doing. On line 4, we are creating a WebsocketsTransport. You'll note this is different than what we did in the previous post for making GraphQL queries. If you remember, queries and mutations are just HTTP POST requests. If you look at line 5, we are instead using a wss:// protocol. Instead of making an HTTP request, we are working over a WebSocket. Unlike the HTTP protocol, WebSockets establish a persistent connection that allow for two-way communication. This is how GraphQL subscriptions are possible. A persistent connection is opened over a WebSocket. The client submits the subscription to the GraphQL server and then the GraphQL server pushes a communication out to the client whenever there is an update relevant to that subscription.

On line 6, be sure to replace YOUR_API_KEY with the same API key you use to access the CFBD REST API.

Starting at line 14, we build out a GraphQL operation that will be submitted to the GraphQL server as a subscription. This is the same subscription we outlined at the start of this post which subscribes to updates to the spreads and totals from a specific sportsbook (Bovada) for upcoming games.

On line 39, we begin looping through subscription updates. The GraphQL server will return an initial data set pertaining to the subscription query. Whenever there are updates to the data set, more results will appear in the loop and our code will act upon it. In the example above, we are merely printing the results to the console, but this is where you would put the logic that you want to be executed whenever there is a data update, such as pushing the updated data to your own data store.

I mentioned that we would be walking through two different examples. There is one potential issue with the example above: WebSocket connections, while incredibly useful, can be very brittle. The persistent connection can be interrupted for any number of reasons: network outage on your end, network outage on the GraphQL server's end, the GraphQL server going down temporarily for maintenance, etc.

Luckily, there are ways to address this. This is where we will be using the asyncio and backoff packages. Let's start with some imports:


import asyncio
import backoff

from gql import Client, gql
from gql.transport.websockets import WebsocketsTransport

Next, we are going to extract the GraphQL operation into its own async function. We will take a session as a parameter, which will be used to subscribe to a WebSocket session we will create later. This is basically a copy and paste from the previous example


async def subscribe(session):
    query = gql('''
        subscription bettingSubscription {
            game(
                where: {
                    status: { _eq: "scheduled" }
                    lines: { provider: { name: { _eq: "Bovada" } } }
                    _or: [
                        { homeClassification: { _eq: "fbs" } }
                        { awayClassification: { _eq: "fbs" } }
                    ]
                }
            ) {
                homeTeam
                awayTeam
                lines {
                    spread
                    overUnder
                    provider {
                        name
                    }
                }
            }
        }
    ''')

    async for result in session.subscribe(query):
        # put your logic here
        print(result)

We will now create another function for managing the WebSocket connection and calling our subscription function:


@backoff.on_exception(backoff.expo, Exception, max_time=60)
async def graphql_connection():
    transport = WebsocketsTransport(
        url="wss://graphql.collegefootballdata.com/v1/graphql",
        headers={ "Authorization": "Bearer YOUR_API_KEY"}
    )

    client = Client(
        transport=transport,
        fetch_schema_from_transport=True,
    )
    
    async with client as session:
        task = asyncio.create_task(subscribe(session))
        
        await asyncio.gather(task)

The backoff module is used on line 1. This establishes some retry logic with an exponential backoff. In other words, if the WebSocket connection gets interrupted for any reason, it will retry this method over and over again with an exponential increase in the wait period in between retries.

Starting on line 3, we have some more code copy and pasted from the previous example. Be sure to enter your CFBD API key in on line 5.

The last four lines deal with calling the subscription method using the WebSocket session that was established on the previous lines. What's interesting is that we are calling the subscribe method inside of a task. We could take advantage of this to call multiple subscriptions at once if we had multiple. This would enable them all to share the same WebSocket connection. The modified code would look similar to this:


def subscribe1(session):
    # GraphQL subscription here
    
def subscribe2(session):
    # GraphQL subscription here
    
def subscribe3(session):
    # GraphQL subscription here
    
def subscribe4(session):
    # GraphQL subscription here

@backoff.on_exception(backoff.expo, Exception, max_time=60)
async def graphql_connection():
    transport = WebsocketsTransport(
        url="wss://graphql.collegefootballdata.com/v1/graphql",
        headers={ "Authorization": "Bearer YOUR_API_KEY"}
    )

    client = Client(
        transport=transport,
        fetch_schema_from_transport=True,
    )
    
    async with client as session:
        task1 = asyncio.create_task(subscribe1(session))
        task2 = asyncio.create_task(subscribe2(session))
        task3 = asyncio.create_task(subscribe3(session))
        task4 = asyncio.create_task(subscribe4(session))
        
        await asyncio.gather(task1, task2, task3, task4)

This modification has four different subscriptions to track, each encapsulated by its own function.

The last thing we need to do is call the graphql_connection function and this is where the asyncio package comes into play:


asyncio.run(graphql_connection())

Putting everything together, your final code should look similar to this:


import asyncio
import backoff

from gql import Client, gql
from gql.transport.websockets import WebsocketsTransport

async def subscribe(session):
    query = gql('''
        subscription bettingSubscription {
            game(
                where: {
                    status: { _eq: "scheduled" }
                    lines: { provider: { name: { _eq: "Bovada" } } }
                    _or: [
                        { homeClassification: { _eq: "fbs" } }
                        { awayClassification: { _eq: "fbs" } }
                    ]
                }
            ) {
                homeTeam
                awayTeam
                lines {
                    spread
                    overUnder
                    provider {
                        name
                    }
                }
            }
        }
    ''')

    async for result in session.subscribe(query):
        # put your logic here
        print(result)
        
@backoff.on_exception(backoff.expo, Exception, max_time=60)
async def graphql_connection():
    transport = WebsocketsTransport(
        url="wss://graphql.collegefootballdata.com/v1/graphql",
        headers={ "Authorization": "Bearer YOUR_API_KEY"}
    )

    client = Client(
        transport=transport,
        fetch_schema_from_transport=True,
    )
    
    async with client as session:
        task = asyncio.create_task(subscribe(session))
        
        await asyncio.gather(task)
        
asyncio.run(graphql_connection())

Conclusion

GraphQL subscriptions are a great and efficient mechanism for subscribing to data updates. Whether you are looking to cut back on your API calls or be more efficient with your code, they are a great option. They are also a great option if you need to know when data updates. The experimental CFBD GraphQL API is available to Patreon subscribers at Tier 3. Join today if you would like to check it out. Also, check out my previous post to see more examples of what the GraphQL API can do for you. As always, let me know what you think!

Building Dynamic Queries with the CFBD GraphQL API

Bill Radjewski — Sat, 31 Aug 2024 01:08:25 GMT

Have you ever wanted more granular control over how you query data from CFBD? By more granular control, I mean dynamic filtering and sorting, querying related pieces of data in one query, and even the ability to specify which specific fields you want to be queried.

What about better real-time data support in the form of subscriptions? The REST API offers a few live endpoints that require constant polling, but I'm talking about being able to create a specific data query, subscribing to that query, and your own code being notified in real time when the data in that query changes. And this is far beyond the few live REST endpoints offered today. Imagine being able to subscribe to betting line updates, for example.

The experimental CFBD GraphQL API can enable you to do all of this and it is available to Patreon Tier 3 subscribers starting today. I put emphasis on the word experimental. It does not yet have full access to the entire CFBD data catalog, but it does incorporate a decent amount as of right now:

Team information
Conference information
Historical team/conference associations
Historical and live game data (scores, Elo ratings, excitement index, weather, media information)
Historical and live betting data
Recruiting data
Transfer data
NFL Draft history

Things that are not currently included but will be added over time:

Drive and play data
Basic game, player, and season stats
Advanced game, player, and season stats

Neither of these lists are exhaustive.

If you would like to learn and see some examples, then read on.

What is GraphQL?

GraphQL is a query language for APIs. Its central premise is that it defines a data model as a "graph" of attributes and relationships. When interfacing with such an API, you specify exactly which data you need, how it should be filtered, how it should be sorted, and it has paging abilities to grab data in batches. This is much different than a traditional REST API where you are given a concrete set of REST endpoints with discrete query parameters and a rigid data model response.

So how does it work differently from working with REST endpoints? The funny thing is, it basically is a REST endpoint. Unlike traditional REST APIs where you would likely have many different endpoints scattered across multiple different HTTP operations (e.g. GET, POST, PUT, etc), GraphQL exposes a single POST endpoint, usually named just graphql. You submit a POST request to that endpoint and the request body contains all the information about what you are trying to do and what data you want to receive back, all in GraphQL syntax.

Here is a simple GraphQL query using the new CFBD GraphQL endpoint:


query gamesQuery {
	game(where: { season: { _eq: 2024 } }, orderBy: { startDate: ASC }) {
		id
		season
		seasonType
		week
		startDate
		homeTeam
		homeClassification
		homeConferece
		homePoints
		awayTeam
		awayClassification
		awayConferece
		awayPoints
		lines {
			provider {
				name
			}
			spread
		}
	}
}

GraphQL offers three types of operations: queries for querying data, mutations for changing data, and subscriptions for subscribing to data updates. The above example is a query named gamesQuery. The query part is important since it tells the API that we are querying for data, but the gamesQuery part is completely arbitrary. In fact, we could have completely left off query gamesQuery and the API would implicitly know we are trying to query data.

The interesting stuff starts on line 2. There is a game object that is made available in the graph and we are telling the API that we want to query these objects. We are also including some filtering and sorting on this line. We are telling the API to return games from the 2024 season and to sort by the start_date property.

Let's look at the filter a little more closely: where: { season: { _eq: 2024 } }. We are using an equal operator (_eq) to filter on the 2024 season, but there are many more operators. For example, we could use _gt if we wanted to query on seasons greater than a specific year. We can also combine filters. Let's say we wanted to query games from the 2024 season, but only in weeks 1, 3, and 5. We could do something like this: where: { season: { _eq: 2024 }, week: { _in: [1, 3, 5] } }. We'll look at some more complex scenarios later on.

We also have an ordering statement: orderBy: { startDate: ASC }. This tells the API to sort the results by the startDate field in ascending order. Similar to filters, we can combine these if we want to sort by multiple fields. And we can specify whether we want to sort in ascending or descending order on each field.

As we continue past line 2, you can see that we are also able to specify which game object fields we would like returned back in the query. On line 16, we introduce another object in the graph via the lines property. We have a whole gameLines object that we could write a separate query on. However, we also have a relationship between games and game lines via the lines property. Because of this, we can tell the API to return any game lines associated with each game object. We can also specify which properties we want to be returned in these nested relationships. Notably, you'll see that we have another relationship nested within a relationship, as the provider object has a relationship with the lines object. provider provides information on the sportsbook that provides the game line.

We've gotten this far, so we should probably look at the data that gets returned by this query.


...
{
	"id": 401635525,
	"season": 2024,
	"seasonType": "regular",
	"week": 1,
	"startDate": "2024-08-24T16:00:00",
	"homeTeam": "Georgia Tech",
	"homeClassification": "fbs",
	"homeConferece": "ACC",
	"homePoints": 24,
	"awayTeam": "Florida State",
	"awayClassification": "fbs",
	"awayConferece": "ACC",
	"awayPoints": 21,
	"lines": [
		{
			"provider": {
				"name": "ESPN Bet"
			},
			"spread": 10.5
		},
		{
			"provider": {
				"name": "DraftKings"
			},
			"spread": 11.5
		},
		{
			"provider": {
				"name": "Bovada"
			},
			"spread": 10.0
		}
	]
},
...

As you can see, it matches the format and fields that we specified in the query. Let's write another query with a little bit more complexity. I want to query the most exciting games of the past 10 seasons as measured by the CFBD Excitement Index metrics. My query would look like this:


query excitementQuery {
	game(
		where: { season: { _gte: 2014 }, excitement: { _isNull: false } }
		orderBy: { excitement: DESC }
		limit: 100
	) {
		id
		season
		seasonType
		week
		startDate
		homeTeam
		homeClassification
		homeConferece
		homePoints
		awayTeam
		awayClassification
		awayConferece
		awayPoints
		excitement
	}
}

I'm writing this article right at the start of the 2024 season, so I've updated my filter, where: { season: { _gte: 2014 }, excitement: { _isNull: false } } to query all games starting with the 2014 season where the excitement field is not null or empty. I also included a sort clause, orderBy: { excitement: DESC }, because I want to sort by excitement in descending order so that the most exciting games are returned at the top. Lastly, I specified a limit of 100 results (limit: 100) because I only want the top 100 most exciting games.

Here are the partial results of that query:


{
	"data": {
		"game": [
			{
				"id": 401282177,
				"season": 2021,
				"seasonType": "regular",
				"week": 1,
				"startDate": "2021-09-05T00:00:00",
				"homeTeam": "South Alabama",
				"homeClassification": "fbs",
				"homeConferece": "SBC",
				"homePoints": 31,
				"awayTeam": "Southern Mississippi",
				"awayClassification": "fbs",
				"awayConferece": "CUSA",
				"awayPoints": 7,
				"excitement": 21.5355699358
			},
			{
				"id": 401418780,
				"season": 2022,
				"seasonType": "regular",
				"week": 9,
				"startDate": "2022-10-29T21:00:00",
				"homeTeam": "Central Arkansas",
				"homeClassification": "fcs",
				"homeConferece": "ASUN",
				"homePoints": 64,
				"awayTeam": "North Alabama",
				"awayClassification": "fcs",
				"awayConferece": "ASUN",
				"awayPoints": 29,
				"excitement": 16.5218277643
			},
			{
				"id": 401416599,
				"season": 2022,
				"seasonType": "regular",
				"week": 2,
				"startDate": "2022-09-10T22:00:00",
				"homeTeam": "Miami (OH)",
				"homeClassification": "fbs",
				"homeConferece": "MAC",
				"homePoints": 31,
				"awayTeam": "Robert Morris",
				"awayClassification": "fcs",
				"awayConferece": null,
				"awayPoints": 14,
				"excitement": 15.5860040950
			},
            ...
		]
	}
}

In the next few sections, we'll dive into how to query from the CFBD GraphQL API using Insomnia and Python.

Using the CFBD GraphQL API with Insomnia

If you haven't seen my post on using Insomnia with the CFBD API, then be sure to check it out. Insomnia is by far the best tool for experimenting with different APIs. Not only is it fantastic for experimenting with traditional REST calls, but it also has really great GraphQL support. This section of the guide assumes you are familiar with Insomnia and have it set up.

So let's go ahead and open up Insomnia. You are going to create a new request just like you normally would, but this time select "GraphQL Request" from the dropdown.

The new request should look really similar to a POST request and even be labeled as such. Before we fill in the URL, we're going to add our Auth details. Select "Bearer Token" from the Auth dropdown.

In the Token field, fill in your API key. It will be the same API key you use on the CFBD API. There is no need to add a Bearer prefix or anything else. Just paste in your key.

Now go ahead and fill out the URL: https://graphql.collegefootballdata.com/v1/graphql. After pasting that in, click on "schema" and select "Refresh Schema". Also, make sure that "Automatic Fetch" is enabled.

Click on "Show Documentation" from the same dropdown will open up a documentation side panel on the right. From the side panel, click on query_root to see which queries are available.

These docs are interactive, you feel free to click around to learn about the different queries and types. However, these docs aren't even necessary to get going but I did want to point them out because it's still a very nice feature.

Go ahead and click on the GraphQL tab, click inside of the code body, and then hit Ctrl+Space. The code editor has full autocomplete capabilities.

As you type out queries, you can use this functionality to guide you without even needing to really know or reference the documentation.

Let's query some recruiting data. I want to query every #1 overall high school recruit since the 2014 cycle. Additionally, I want to order by overall composite rating, with the highest ratings at the top. My query would look like this:


query myQuery {
	recruit(
		where: {
			year: { _gte: 2014 }
			overallRank: { _eq: 1 }
			recruitType: { _eq: "HighSchool" }
		}
		orderBy: { rating: DESC }
	) {
		rating
		name
		position {
			position
			positionGroup
		}
		college {
			school
			conference
		}
		recruitSchool {
			name
		}
	}
}

Feel free to mess around with the query. Pick whatever fields you want to return and tweak the filters and the sorts if you desire to do so. Once you're satisfied, go ahead and submit. This is what my query returned back:

I'm actually curious about my hometown. I come from a really tiny town in northern Ohio called Huron. I would like to know if there have been any legitimate recruits in the recruiting service era to hail from there. When I played (early aughts), the recruiting services where just becoming a thing and we didn't really have any FBS-level players. We had a really great TE named Jim Fisher who played at Michigan and would have fit the bill, but he was a year or two before my time and before Rivals and Scout got big.

Anyway, here's the query I drew up.


query myQuery {
	recruit(
		where: {
			recruitType: { _eq: "HighSchool" }
			hometown: { city: { _eq: "Huron" }, state: { _eq: "OH" } }
		}
		orderBy: { rating: DESC }
	) {
		stars
		ranking
		positionRank
		rating
		name
		position {
			position
			positionGroup
		}
		college {
			school
			conference
		}
		recruitSchool {
			name
		}
		hometown {
			city
			state
		}
	}
}

And here are the results:

We've had one lone 2* WR who ended up at Toledo. Way to go, Cody!

I can slightly modify this query if I want to filter historical recruits by any geographic region. Like if I wanted to query all-time recruits from the state of Alaska:

We can even do aggregates. For example, if I wanted to find mean stars and ratings and their respective standard deviations for all Michigan recruits since 2016, I could run something like the below:


query myQuery {
	recruitAggregate(
		where: {
			college: { school: { _eq: "Michigan" } }
			year: { _gte: 2016 }
			recruitType: { _eq: "HighSchool" }
		}
	) {
		aggregate {
			count
			avg {
				rating
				stars
			}
			stddev {
				rating
				stars
			}
		}
	}
}

Here are the results:

Using the CFBD GraphQL API with Python

I will preface this section by stating that you can interface with GraphQL APIs using just about any programming. It all amounts to a basic HTTP POST request after all. If you can make an HTTP request, you can make a GraphQL request. That all said, some tools and libraries make things much easier. If I'm being honest, TypeScript/JavaScript is the best ecosystem for working with GraphQL. Much like Python is largely unparalleled when it comes to libraries available for data science and machine learning, the TypeScript/JavaScript ecosystem is unparalleled when it comes to libraries and utilities for GraphQL.

However, I recognized that a large majority of CFBD users are working in Python. And frankly, Python is probably still the correct choice for you if you are working in data and analytics. Luckily, Python does have its own set of libraries for working with GraphQL.

GQL is one of the more popular packages for interfacing with GraphQL APIs in Python. We can install it from PyPI:


pip install "gql[all]"

Or if you're using Conda:


conda install gql-with-all

For the duration of this section, I will be running my Python code out of a Jupyter notebook. However, you should be able to run this same code even if you aren't running in Jupyter.

We'll start off by importing packages from GQL:


from gql import Client, gql
from gql.transport.aiohttp import AIOHTTPTransport

Next, we will create a transport around the CFBD GraphQL URL and GraphQL client around this transport.


transport = AIOHTTPTransport(
    url="https://graphql.collegefootballdata.com/v1/graphql",
    headers={ "Authorization": "Bearer YOUR_API_KEY_HERE"}
)

client = Client(transport=transport, fetch_schema_from_transport=True)

Note that this is also where you need to configure your API. Replace YOUR_API_KEY_HERE in the above snippet with the API key you use for the CFBD API. Notice that we do need to supply a "Bearer " prefix here.

I'm going to mirror the previous section on using Insomnia. If you skipped it, I highly recommend checking it out. I find it's usually easier to design GraphQL queries in Insomnia prior to putting them into Python code.

Executing the same query, which grabs all #1 overall high school recruits since 2014 and sorting in descending order of Composite rating looks like this:


query = gql(
    """
    query myQuery {
        recruit(
            where: {
                year: { _gte: 2014 }
                overallRank: { _eq: 1 }
                recruitType: { _eq: "HighSchool" }
            }
            orderBy: { rating: DESC }
        ) {
            rating
            name
            position {
                position
                positionGroup
            }
            college {
                school
                conference
            }
            recruitSchool {
                name
            }
        }
    }
"""
)

result = await client.execute_async(query)
result

This is what the output looks like in my Jupyter notebook.

We can run type(result) to see that result is a dict. It should be relatively easy to loop through this result and format it to our liking.

We can flatten all of the dicts to make them easier to put into a DataFrame:


formatted = [dict(rating=r['rating'], name=r['name'], college=r['college']['school'], position=r['position']['position']) for r in result['recruit']]
formatted

We can now easily get this into a pandas DataFrame.


import pandas as pd
df = pd.DataFrame(formatted)
df.head()

Let's run another query. This time I am going to query Michigan's historical entries in the AP poll, sorted with the most recent appearances first.


query = gql(
    """
    query myQuery {
        pollRank(
            where: {
                team: { school: { _eq: "Michigan" } }
                poll: { pollType: { name: { _eq: "AP Top 25" } } }
            }
            orderBy: [
                { poll: { season: DESC } }
                { poll: { seasonType: DESC } }
                { poll: { week: DESC } }
            ]
        ) {
            rank
            points
            firstPlaceVotes
            poll {
                season
                seasonType
                week
                pollType {
                    name
                }
            }
        }
    }

"""
)

result = await client.execute_async(query)
result

We can again flatten this and load it into a DataFrame if we desire, but I'll leave that up to you.

Conclusion

I hope that illustrates the power of GraphQL and what it can do for you. It allows for much more flexibility and fewer restrictions. I get requests all the time for querying the data in different ways or different formats or allowing different types of query parameters. This can be very difficult to keep up with and maintain in a traditional REST API, but is easy work when working with GraphQL.

Again, this is available to you if you are a Patreon Tier 3 subscriber. Got to Patreon if you are interested in checking it out. I will reiterate that this is very experimental right now. If there are pieces of data available in the REST API that you would like to see here, I am in the process of adding more and more data. Another huge benefit is real-time GraphQL subscriptions, but I'll save that for a future post. If you end up checking it out, let me know what you think!

Talking Tech: Creating Charts with matplotlib

Bill Radjewski — Thu, 12 Oct 2023 19:03:18 GMT

In one of my earlier blog posts, I wrote a guide on creating charts using the (at the time) nascent CFBD Python library and a charting library/platform called Plotly. I was still relatively new to Python myself and was trying to sort out the ecosystem of Python charting libraries. Indeed in that very post, I noted that there was a wide array of different options. Ultimately, I settled on Plotly due to its ease of use, large feature set, and fantastic documentation. I still think that Plotly is a fantastic library for those very reasons. It offers a lot out of the box with a relatively minimal level of fiddling. In recent years, however, I have gravitated towards a different charting library that has since usurped Plotly as my charting library of choice: matplotlib.

The primary reason I've grown to love matplotlib is that it's very customizable. I've found that I've been able to do just about anything I've been able to draw up in my own imagination. Due to its versatility, things are not always as straightforward as they are with Plotly but I've found I've been able to do much, much more. Before we dive in deeper, check out some of the charts I've been able to generate with matplotlib.

I initially grew frustrated with Plotly when I was trying to create plots that had logos, which apparently Plotly can't really do. This is when I really started using matplotlib and discovered how to do all kinds of advanced stuff like you see above. If you want to learn how to get started doing some of this, keep on reading!

Let's get charting

Edit: The Jupyter notebook used in this guide has been uploaded to GitHub if you would like to use it to follow along.

First off, we'll assume you have a Python environment setup, preferably using Jupyter notebooks. We'll begin by importing the libraries that we need, starting with the standard ones: cfbd, pandas, and numpy. I don't always end up using numpy but I usually always import it anyway because you never know. We'll also import matplotlib.


import cfbd
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

matplotlib is pretty standard in any Jupyter or data science environment, so you should have it. If not and you get an error above, then open up a terminal and install the matplotlib package and then run the import statement again.


pip install matplotlib

Next up, we'll configure the cfbd Python library so we can make some calls. Be sure to replace the placeholder below with your personal API key.


config = cfbd.Configuration(
    access_token = 'YOUR_API_KEY'
)
client = cfbd.ApiClient(config)

Now let's grab some data that we can turn into a chart. We'll grab team Elo and SP+ ratings from the end of the 2022 season and put these into a scatterplot. Run the code below to get the data.


ratings_api = cfbd.RatingsApi(client)
elo_ratings = ratings_api.get_elo(year=2022)
sp_ratings = ratings_api.get_sp(year=2022)

Let's take a look at the format of the data that was returned from the API.

The Elo rating object is pretty simple. It's just a flat object consisting of team, year, conference, and the team's final Elo rating. The SP+ object is a bit more complex with some nesting. We really only care about the top-level properties for team and overall rating. We also want to combine these lists, but first let's convert them int DataFrames that can be merged.

Here's the code for converting the list of Elo ratings.


elo_df = pd.DataFrame.from_records([e.to_dict() for e in elo_ratings])
elo_df.head()

When converting the SP+ ratings to a DataFrame, we're only going to grab the properties we care about (team and rating).


sp_df = pd.DataFrame.from_records([dict(team=s.team, rating=s.rating) for s in sp_ratings])
sp_df.head()

Now we can merge these together into a single DataFrame. I'm also going to rename the rating column to sp to make it things more clear in the data.


df = elo_df.merge(sp_df, left_on='team', right_on='team')
df.rename(columns={'rating': 'sp'}, inplace=True)
df.head()

We can now generate a scatterplot. We'll plot Elo ratings on the x-axis and SP+ ratings on the y-axis. This is super easy.


plt.scatter(df['elo'], df['sp'])

Good charts should always have a title and labels, so let's add some of those and regenerate the chart.


plt.scatter(df['elo'], df['sp'])

plt.xlabel('Elo rating')
plt.ylabel('SP+ rating')
plt.title('Elo and SP+ ratings (2022 season)')

Pretty easy, huh?

Jazzing Things Up

These charts look a little... bland? Don't you think? Let's look at jazzing things up a bit.

I mentioned that matplotlib is highly customizable. As a result, it can be heavily themed using style sheets. Luckily, it has several builtin themes out of the box. I recommend checking them all out.

A popular option is the ggplot theme, inspired by the famous R charting library. Let's check that one out.


plt.style.use('ggplot')

And then just rerun our chart code.

Personally, I'm partial to the fivethirtyeight theme, inspired by the charts from FiveThirtyEight.com.


plt.style.use('fivethirtyeight')

We can also easily manipulate the size and dimensions of charts. For example,


plt.rcParams["figure.figsize"] = [20,10]

We can also easily export charts to an image file format, such as PNG. Just add a call to savfig() with the name of the file you want to save to.


plt.scatter(df['elo'], df['sp'])

plt.xlabel('Elo rating')
plt.ylabel('SP+ rating')
plt.title('Elo and SP+ ratings (2022 season)')

plt.savefig("test.png")

Adding Team Logos

I mentioned the ability to plot team logos as being the initial impetus for my looking at matplotlib and moving away from Plotly. So this post would be no good if I didn't show you how to do that. First off, we need one more line of imports.


from matplotlib.offsetbox import OffsetImage, AnnotationBbox

Secondly, we need some logo files. I have a collection of logos on a Google Drive that you can download here. These logos (after you download and unzip them) should be placed in the same directory as your Jupyter notebook or Python script in a folder called logos.

Next up, we are going to define a function for retrieving a logo based on a team name and creating an image object from it.


def getImage(team):
    return OffsetImage(plt.imread(f'./logos/{team}.png'))

We need to modify our scatterplot code above to utilize this function to plot team logos in place of points on the scatterplot. Go ahead and run this code. We'll break it all down in a second.


fig, ax = plt.subplots()
ax.scatter(df['elo'], df['sp'], alpha=0)

for index, r in df.iterrows():
    ab = AnnotationBbox(getImage(r.team), (r.elo, r.sp), frameon=False)
    ax.add_artist(ab)
    
plt.xlabel('Elo rating')
plt.ylabel('SP+ rating')
plt.title('Elo and SP+ ratings (2022 season)')

If you added the logo directory properly, this is what should have been rendered:

Okay, let's break down the changes to our scatterplot code.


fig, ax = plt.subplots()

Instead of working directly off of the plt object, we called the subplots function, which allows multiple plots to be plotted in the same figure. We don't need subplot functionality here, but what's important is that this returned figure object (fig) and an Axes object (axes) which can both be used for various customizations. This is usually how you'll generate a chart instead of using plt directly.


ax.scatter(df['elo'], df['sp'], alpha=0)

There are two deviations here. First, we are calling scatter on the ax object instead of on plt. Secondly, we are setting an alpha property to 0. This is effectively making the plotted points invisible. We do not need the normal points to display because we will be adding team logos in their place.


for index, r in df.iterrows():
    ab = AnnotationBbox(getImage(r.team), (r.elo, r.sp), frameon=False)
    ax.add_artist(ab)

This block contains the meat of the changes. We are iterating through all of the rows in the DataFrame and creating an annotation box that consists of the team logo. In constructing the annotation box (AnnotationBbox), we are passing in the logo image (created by our getImage function using the logo path), the coordinates where the logo should display (Elo rating as the x-coordinate and SP+ rating as the y-coordinate), and setting a property to turn off the image frame (which would otherwise draw an ugly border around each logo).

Other types of charts

You can use matplotlib to create just about any type of chart: line charts, pie charts, bar charts, and more. We won't go into every one of these but let's check out a line chart.

We've already used the Elo ratings API endpoint let's use that to get historical data for a single team and put that into a line chart. I'm a Michigan guy so that's the team I'll be using, but feel free to substitute in your favorite team.


elos = ratings_api.get_elo(team='Michigan')
df = pd.DataFrame.from_records([e.to_dict() for e in elos])
df.head()

And let's go ahead and create a line chart.


fig, ax = plt.subplots()

ax.plot(df['year'], df['elo'], color='#00274c')

plt.xlabel('Year')
plt.ylabel('Elo rating')
plt.title('Historical Elo Rating (Michigan)')

Only two real minor changes from our previous code here. First, we're calling the plot function to generate a line chart whereas previously we were calling scatter for scatterplots. And then notice that passed in a color parameter to style the line to be in the team's primary color.

How about we add the team logo as a sort of watermark somewhere on the chart?


fig, ax = plt.subplots()

ax.plot(df['year'], df['elo'], color='#00274c')

logo = OffsetImage(plt.imread('./logos/Michigan.png'), zoom=1.5)
ab = AnnotationBbox(logo, (2020, 2600), frameon=False)
ax.add_artist(ab)

plt.xlabel('Year')
plt.ylabel('Elo rating')
plt.title('Historical Elo Rating (Michigan)')

Note that lines 5-7 are almost identical to the code we used in the getImage function and to plot team logos as scatterplot points. In this example, I am plotting the logo in the upper right corner of the graph. I just had to pass in the actual graph coordinates where I wanted the image to go, (2020, 2600) in this example.

Let's say I wanted to highlight a particular range of years, in this case the tenure of a significant coach in the program's history.


fig, ax = plt.subplots()

ax.plot(df['year'], df['elo'], color='#00274c')

logo = OffsetImage(plt.imread('./logos/Michigan.png'), zoom=1.5)
ab = AnnotationBbox(logo, (2020, 2600), frameon=False)
ax.add_artist(ab)

ax.axvspan(1969, 1989, alpha=0.5, color="#FFCB05")
ax.text(1974, 1400, '    1969-1989\nBo Schembechler', va='center', fontstyle='italic', fontsize='small')

plt.xlabel('Year')
plt.ylabel('Elo rating')
plt.title('Historical Elo Rating (Michigan)')

Line 9-10 are the only additions here. On line 9, I added a vertical span across the x-values 1969 to 1989, filled it in with the team's secondary color, and added some transparency.

Now suppose there's a specific point on the chart I want to call out, maybe with some text and an arrow annotation. This is how I'd do that.


fig, ax = plt.subplots()

ax.plot(df['year'], df['elo'], color='#00274c')

logo = OffsetImage(plt.imread('./logos/Michigan.png'), zoom=1.5)
ab = AnnotationBbox(logo, (2020, 2600), frameon=False)
ax.add_artist(ab)

ax.axvspan(1969, 1989, alpha=0.5, color="#FFCB05")
ax.text(1974, 1400, '    1969-1989\nBo Schembechler', va='center', fontstyle='italic', fontsize='small')

ax.annotate("Fielding Yost\nPoint-a-Minute teams",
            xy=(1903, 2700), xycoords='data',
            xytext=(1940, 2600), textcoords='data',
            arrowprops=dict(facecolor='#FFCB05'),
            horizontalalignment='center', verticalalignment='top')

plt.xlabel('Year')
plt.ylabel('Elo rating')
plt.title('Historical Elo Rating (Michigan)')

Lines 12-16 here are the additions. This is the basic format for adding an arrow annotation. Using the annotate function, I specified the text, where the arrow should point, where the arrow should end, some styling for the arrow color (using the team's secondary color again), and some alignment properties. Notice how for xycoords and textcoords we specified the data option. This tells the figure how to render these annotations. In this case, we are just going by the chart's coordinate system. There are several other options for specifying these locations, but those are outside of the scope of the article. I highly recommend looking into them on your own.

Further Steps

We've covered the basics of matplotlib. Hopefully it's given good insight into its versatility and power. While this post should give you some good building blocks to get started creating your own charts, we've really only touched the surface. We really only hit on scatter and line charts and there's a plethora of other chart types you can create. You can also create animated charts! Maybe that will be blog post down the road. Here are some more resources which should help you expand upon what we've gone through here.

Measuring Field Goal Kicker Efficiency

Bill Radjewski — Mon, 25 Sep 2023 23:00:06 GMT

I was recently in a conundrum about the best way to go about measuring field goal kicker efficiency. This has been a topic of discussion in much of the college football content I follow, which is largely centered around the Michigan football team. For the past three years, Michigan has had the benefit of having perhaps the best special team tandems in school history. What MGoBlog has dubbed the "Pax Specialistica" consisted of Groza-winning kicker Jake "Money" Moody and punter Brad Robbins, both taken in this years NFL Draft. In the wake of Moody's departure, Michigan added Louisville transfer James Turner who has had a pretty solid career. He's not quite Jake Moody but then again it would be unrealistic to expect him to be so.

This has raised the question around the value of field goal kicking. Namely, what is the difference between an average college kicker and one who is at the upper echelon of college kickers? Expected Points Added (EPA), such as this site's own Predicted Points Added model, seems like a good starting point with how ubiquitous EPA metrics have become in the world of CFB analytics. As it turns out, however, existing EPA models are almost entirely unsuitable for providing field goal kicker metrics. We'll break some of the reasons for that down.

If you are reading this, I am going to presume you have some familiarity with EPA. If not, we'll do just a really quick breakdown of EPA since it's central to the discussion around the unsuitability of existing EPA models in evaluating this sort of thing. The basic premise of EPA is that each yard line on the football field is assigned an Expected Points (EP) value which variable based on down and distance. Because of this, you have an EP value at the start of each play predicated on the starting down, distance, and yard line. Each play results in a new EP value, either from scoring points or from the resulting down, distance, and yard line. You take the difference between the play's ending EP and the play's starting EP, you get the value of Expected Points Added, or EPA.

Visualization of this site's EP model for 1st and 10

It might seem logical to apply this principal to field goal kicking and in certain contexts it certainly is. However, evaluating kickers is not one of them. Think about the factors that affect the difficulty of making a FG kick. Here are a few:

Distance from the goalposts
Wind velocity
Wind direction
Kick angle (e.g. from one of the hashes versus dead center)

Ideally, we would consider each of these factors when evaluating kickers. Unfortunately, we only have data for the first factor listed here, distance from the goalposts. Also note that down and yards to go are not listed here. Whether it is 1st down or 4th down, there is no material impact on the difficulty of the kick. It also doesn't really matter if it is 4th and 10 or 4th and 1. Since these are central features in traditional EPA models, it renders the ability of these models to measure kickers as very limited.

There's also another component to this. If a kicker misses a 50 yard field goal, the result of the play is a turnover on downs and great field position for the opponent. If a kicker misses a 20 yard field goal, there's still a turnover on downs but the opponent's field position is going to be pretty poor. As a result, the resulting negative EPA from missing a 50 yard kick will be much more extreme compared to the negative EPA resulting from missing a chip shot. This resulting EPA is still a valuable metrics in certain contexts, evaluating a coach's decision to attempt a FG vs going for it vs punting, for example. But it doesn't really make much sense to punish a kicker disproportionately more for missing a 50 yard kick than for missing a chip shot, does it?

A more sensible approach would be to take the one metric we have data for, field goal distance, and see how it correlates to field goal success. We could then use this information to spit out an Expected Points model for field goals based on kick distance. In fact, this is exactly what I did.

Methodology

For this exercise, I decided to query every field attempt dating back to the 2016 season. As we are currently a few weeks into the 2023 season, this is just over 7 seasons worth of field goal data. I then assigned each kick a points value of 0 (for a missed kick) or 3 (for a successful kick). While there have been several instances of field goals being returned for a TD by the defense, I decided not to count these as -6 point outcomes. For one, this category of plays is a miniscule sample. Additionally, the factors that lead to a Kick Six type of play are massively outside of the control of the kicker.

I should also note that I only included FBS attempts in the dataset. This means that the resulting metrics won't necessarily be applicable at other levels, such as the NFL or FCS. After aggregating this data and assigning a points value to each attempt, I then calculate the average points scored on field goal attempts based on kick distance.

As you can see in the figure above, this gave a nice little trend that was easily fitted to a curve. The closest field goal attempts average out just short of 3 points per attempt and at a certain point. At a certain point, the expected value is functionally 0 points. I am using this curve to define expected points at the FBS level for a field goal at a given distance.

FGA Expected Points at selected distances

I am also using this curve to define "replacement level" for field goal kickers. If a specific kick distance has an expected points value of 1.5, you would expect a replacement-level kicker to make that field goal about 50% of the time. Similarly, if the expected points value is 2.0, then a replacement-level kicker would be expected to make that kick 2 out of 3 times.

Using this concept, I've devised a metric called Points Added Above Replacement, or PAAR. To calculate PAAR, we look at each of a kicker's FG attempts and find the difference between the actual points scored by the kicker and the expected points based on FG distance. We then add these value up for each of a kicker's attempts. For example, here were the top 25 kickers in PAAR for the 2022 season.

2022 PAAR leaders

Looking at this chart, Jake Moody placed third in this metric behind Stanford's Joshua Katy and NC State's Christopher Dunn. Moody's PAAR value was +15.6. We define a replacement-level kicker as one who measures out at +0.0 PAAR, neither above nor below expectations. This means that, across all of his FG attempts for the 2022 season, Jake Moody scored 15.6 more points than a replacement-level kicker given the the same attempts. That is more than two TDs over the course of the season. Or put another way, he provided ~1.1 points per game over what would be expected for a replacement-level kicker.

Conversely, you can also have negative PAAR values. I hate to single out kickers since it's one of the toughest and highest pressure jobs on the football field, but here's the flip side of the above chart.

2002 PAAR bottom 25

Kansas's Jacob Borcila netted -14.2 PAAR for the season. Accounting for each of his FG attempts, he scored 14.2 less points than would be expected for a replacement level kicker. Compare with Stanford's Joshua Karty at the top of the previous table with +19.9 PAAR. The difference between the top kicker and the bottom kicker from last season was a whopping 34.1 points, or just under five TDs! Kicking is important.

Other Applications

I argued at the start of this that traditional EPA models aren't suitable for measuring kicker performance, but that doesn't mean they are altogether useless. We can combine this new FG expected points model with our traditional EPA model to visualize when it might make sense to go for a 1st down or TD versus attempting field goal.

EP differential with a replacement level kicker

This heatmap illustrates the expected point differential between kicking a FG with a replacement-level kicker and the current expected points based on the distance to go and the yard line. Situations where there is more value in attempting a FG are shaded green whereas the redder areas are where points are being left on the table in deciding on a FG attempt. We can contrast this with the chart for an elite kicker.

EP differential with an elite kicker

See how much greener this chart is than the previous one? Having an elite kicker makes the decision to attempt a FG versus going for the TD or 1st down an easier one since you have greater confidence in actually getting points with a FG attempt. Now, let's check out the chart for a kicker who is far below replacement level.

EP differential with a far below replacement level kicker

A lot more red there. You're probably much better off taking your chances and going for it rather than attempt a FG. This is somewhat of a simplification since this isn't always a binary choice. The option to punt exists and should ideally be included in the calculus here. That said, I think the point is pretty clear on the value of a good kicker and how much an elite kicker can open up options and make decisions easier.

Conclusion

Hopefully, this gives a clear idea on how we can go about evaluating kickers. I'm very excited to share the PAAR metric and start utilizing it. I plan on posting updates throughout the course of the season. And in case you missed it, I've backfilled and am now including FG kickers in player-play stastistic data so you can now track inidivual kickers at the play level and devise some metrics of your own. Next steps for me are making some of this stuff, like PAAR and FG EP, available on the site and API. Then, it's onto punters!

As always, feel free to reach out and let me know what you think on Twitter, Discord, or Reddit.

Talking Tech: Navigating the CFBD API with Insomnia

Bill Radjewski — Fri, 22 Sep 2023 14:00:17 GMT

There are a lot of good tools for working with APIs. Historically, Postman has been ubiquitous in this area. While Postman is still a great tool, I ditched it a few years back for a competing tool called Insomnia. If you've never used either of these tools, you may be wondering what they do. Mainly, they provide a convenient user interface for interacting with API endpoints. You can add an endpoint, and configure its URL, query parameters, request body, and request headers. Then you can call that endpoint and explore its output. You can do all of this in the UI without having a write a line of code. The benefit is you can quickly get to experimenting and testing out APIs right out of the box.

Here is an example of what this looks like querying the /games endpoint of the CFBD API:

Calling an endpoint with Insomnia

You can see all of the configuration needed to call the endpoint laid out in the middle panel, including the URL, setting query parameters, and any other additional properties. Here we configured the request to query all 2022 games. After sending the request, the formatted payload appears in the right panel.

If you look at the left panel, you can see that I have all endpoints in the CFBD API available to me and searchable.

List of available endpoints

Looking at the middle panel more closely, I already have all available query parameters prepopulated for each endpoint. I can fill these in and enable/disable them at my leisure. Normally, you would have to manually add each endpoint and its available query parameters. Luckily, Insomnia has a very nice feature where you can auto-import an API collection from a Swagger or OpenAPI specification. We'll detail some steps for doing this further down.

For now, lets check out some more features of Insomnia that are worth noting. One of my favorite features is the ability to autogenerate code from any API call. All you need to do is to click on the little arrow to the right of a given endpoint and select 'Generate Code':

Endpoint menue

From there, you can select from one of many popular programming languages:

Selecting a programming language

And the code will auto-generate:

Autogenerated code

Another great feature is JSONPath response filtering. Oftentimes, the response payload will be quite large and perhaps you'd like to filter it down because you're looking for a specific item. This is where JSONPath comes in. Using the /games response above, let's say I wanted to see ids for all games in the response body with an excitement index greater than 15. The JSONPath value would be $[?(@.excitement_index > 15)].id and filters and transforms the output as seen here:

List of game ids

If you're interested in using Insomnia to work with the CFBD API, then read on. We'll walk through some steps to get things configured.

Configuring Insomnia for the CFBD API

Here are some simple steps for getting Insomnia up and running with the CFBD API. First off, we presume you already have Insomnia downloaded and installed. If not, then please do that before proceeding.

Now, let's go ahead and visit CollegeFootballData.com. On the right sidebar, you should see a convenient "Run in Insomnia!" button. Click it.

A new browser tab will open. Click on the "RUN CFBD" button.

If an alert box pops up, click on the "Open Insomnia" button.

Insomnia will open and present you with an import dialog. Click on "Scan".

And then "Import".

This will create a new Document. Go ahead and click to open the Document.

Click on "SPEC" and the top of the window. This will show the Swagger documentation. Then, click on "Generate Request Collection".

You will now see all of the CFBD API endpoints available to you and ready to call. However, there are a few more small steps needed before we can do that.

We need to configure our Insomnia environment. Select "Swagger env" from the environment icon at the top left. Once selected, click on the gear icon just to the right.

In the dialog that opens, add a key called base_url and give it a value of https://api.collegefootballdata.com. Afterwards, your environment config should look like this:

And with that, you are largely all set! You should now be able to configure and call any of the endpoints. Although, there is one pesky little detail we haven't looked at: authentication. Select any of the endpoints and click on the Auth tab. Click on the arrow to the left of the "Auth" text and select "Bearer Token".

Paste your API key into the "TOKEN" field. Note: you do NOT need to prefix your key with "Bearer". You don't need to add "Bearer" anywhere. You can leave the "PREFIX" field blank.

And that is about it. One minor annoyance, you will need to configure auth for each and every endpoint. If you want an easier way, there is a plugin you can install that will automatically configure auth for you. I highly recommend doing this. This is the Global Headers plugin. Click here to go to the plugin page and click "Install Plugin" to install it into Insomnia.

Assuming you now have the plugin installed, you can now set the Authorization header as a global environment variable. Go ahead and click again on the gear to the right of "Swagger env" at the top left.

Modify the environment configuration to look like below, replacing with your API key. Note that the value IS prefixed with "Bearer" here. Be sure to keep that part in.

Or you can just copy the below content, paste, and modify it in your instance.


{
	"base_path": "/",
	"scheme": "https",
	"host": "api.collegefootballdata.com",
	"base_url": "https://api.collegefootballdata.com",
	"GLOBAL_HEADERS": {
        "Authorization": "Bearer "
    }
}

You should now be able to call any of the endpoints without needing to add auth details.

Okay. Now we are really done. There are a few caveats to be aware of. First off, this will import Patreon-exclusive endpoints. You will get an HTTP error if you try to call any of these without the appropriate Patreon subscription level. Secondly, this will not automatically update when the API updates. For example, when new endpoints are added or modifications are made to the query parameters in existing endpoints. In this scenario, you will need to redo all of these steps. You'll notice that Insomnia imported the Document with the CFBD API version number. CFBD API versions follow standard versioning in the format of ... You will typically only need to reimport the configuration when a major or minor version changes. And even then you may not necessarily need to based on whatever has changed.

I hope you found these steps helpful. More importantly, I hope you find Insomnia to be a useful tool. As I said, it is my goto for quickly testing any API, including the CFBD API. If I want to debug something, it's the first program I open. And we've really only touched the surface of its functionality.

Cheers!