Talking Tech: Calculating SRS (Pandemic Edition)

Hey all! I know I haven't been active on this here blog as of late. I'm very grateful to Matt for his awesome contributions and am looking forward to seeing what else he has in store. Truth is, attempting to play football in the middle of a global pandemic makes things very hard numbers-wise. For one, it can be super exhausting tracking down all the scheduling changes, postponements, and seasons getting cancelled and then un-cancelled. For two, it can make it nigh impossible to calculate many of the advanced metrics we all love, especially ones that adjust for strength of schedule and make opponent adjustments. And that is the topic of this post: the challenge of calculating metrics in this sort of environment.

You might wonder, why is it so difficult to calculate some of these metrics? We have games and data, right? While that may be true, there are several factors that serve to complicate things this season. For one, highly varied data points. At this point in the 2020 season, most MAC and Pac-12 teams have only played one or two games. Compare this with the Big Ten, which has played about five games thus far, or other conferences have already nearly played a full slate games. Secondly, non-conference games have been a huge casualty of the pandemic. Just about every type of opponent-adjusted metric is reliant on non-conference games. Why is that? Well, you sort of need these games in order to compare conferences to one another. Without that, you have various bubbles of teams playing one another with no way in the data to compare the level of competition between the various bubbles. True, we have had a small amount of non-conference game and that's better than nothing at all, but pretty much all of these are between P5 and G5 team or maybe between FBS and FCS. We have no real good data to compare, say, the relative strength between the Power 5 conferences.

So that certainly makes things incredibly difficult. And we haven't talked about the variability of rosters from week to week due to positive tests and quarantine. Take one team and the roster that plays one week may be very much different from the roster they play in another due to who is available. But that's a much more difficult problem that we aren't going to dive into. For now, we are just going to focus on making a comparison of teams based on SOS. Do you remember the Simple Rating System (SRS) we created awhile back? Let's dig into how to tackle something like that.


Quick Refresher

If you haven't checked out the previous post on SRS ratings, I highly recommend checking it out for a more in-depth explanation of SRS before reading much further. If you haven't, we'll offer a very brief refresher. SRS basically takes all game results to build a system of linear equation, but instead of just a couple of variables (e.g. x and y) we potentially have over 130 variables. Basically, we construct a system of equations where each team represents an unknown variable and then we solve that system.

I've had a few people message me about trying to calculate SRS ratings for the 2020 season based on the previous blog post referenced above. Each had been met with errors when running the previous code. Remembering that we are attempting to solve for a system of linear equations, can you think of why that is? Consider the following system of equations.

x + y = 10
x + 2y = 17
4w + z = 20

Right here we have four variables (w, x, y, z). Think of each variable as representing a different football team. The problem with the system above is that it is unsolvable. It's true that we can probably solve for the x and y variables, but we have no way of solving for w or z. There simply aren't enough data points and there's no way to reference them with the variables we are actually able to solve (x and y). This is the problem we run into in trying to calculate SRS ratings for a season like 2020 where some conferences aren't playing any non-conference games. There is no solution to the system of equations. We have no way of calculating SRS.

via GIPHY

Or do we?

A Different Angle

While we might not be able to build out a full set of SRS ratings, it should be possible to build them out for various subsets of teams. Since all teams are playing conference schedules, we should at the very least be able to generate a set of ratings for each conference. The next question is, would it be possible to stitch together all sets of conference ratings into a single, cohesive unit? I think so. Let's dive in and see what we can do.

First off, let's spin up our data analysis environment. If you don't have one, be sure to check out my earlier post on setting one up. That post shows you how to use Docker to spin up a prebuilt image that includes just about everything you need: Jupyter Lab, Anaconda, and several pre-installed libraries, like scikit-learn and Pandas. It's certainly not the only way to create an environment, but it's my preferred approach.

Once you have Jupyter up and running, go ahead and create a new Python notebook. First thing we'll do is import all the packages we will be using. There's just three.

import cfbd
import numpy as np
import pandas as pd

That's right. We'll be using the official CFBD Python library. If you're programming in Python, this is the best way to interact with the CFBD API. This library wasn't around the last time we generated SRS ratings, so we'll be doing things a little differently this time around. Some key benefits to using the official library:

  1. The package updates automatically whenever changes are made to the API, so it always has the latest features within minutes.
  2. Documentation is also re-generated each time.
  3. Much more streamlined and easier to use than having to construct REST requests and having to parse out the responses.

While this series predominantly uses Python, there are also official packages for JavaScript (cfb.js) and C#/.NET (CFBSharp) which enjoy the same benefits as their Python counterpart.

Querying data using the cfbd package is super easy. Let's go ahead and pull all game data for the 2020 season.

games = cfbd.GamesApi().get_games(year=2020, season_type='both')

There haven't been any bowl games played as of writing, but we included them above for anyone running this code in the future. Having these games as data points will make these rankings much more robust (more non-conference games!). Before I show you how to load this data into a DataFrame, I want to do a very small bit of cleaning. There's the question of Notre Dame. They're labeled in the CFBD API as an Independent (because they are) but you may recall they ended up actually playing a full ACC slate for the 2020 season. As such, I'd like to treat Notre Dame as an ACC team for the purpose of these calculations. The code below changes their conference label to ACC for all games in which they were involved.

for game in games:
    if (game.home_team == 'Notre Dame'):
        game.home_conference = 'ACC'
    elif (game.away_team == 'Notre Dame'):
        game.away_conference = 'ACC'

Alright, now we're ready to load this game data into a DataFrame. We'll use the from_records method in Pandas to accomplish this.

games_df = pd.DataFrame().from_records([
                g.to_dict() 
                for g in games 
                if g.home_points is not None 
                    and g.away_points is not None 
                    and g.home_conference is not None 
                    and g.away_conference is not None
            ])
games_df.head()

Here's the thing about the block of code above. The Game object has tons of fields, but I only care about a few. The rest are just cluttering things up and making it harder to see what I'm working with. Let's look at how to grab only the fields we care about. Let's modify the code block above and re-run it.

games_df = pd.DataFrame().from_records([
                dict(
                    id=g.id, 
                    neutral_site=g.neutral_site,
                    home_team=g.home_team,
                    home_conference=g.home_conference,
                    home_points=g.home_points,
                    away_team=g.away_team,
                    away_conference=g.away_conference,
                    away_points=g.away_points) 
                for g in games 
                if g.home_points is not None 
                    and g.away_points is not None 
                    and g.home_conference is not None 
                    and g.away_conference is not None
            ])
games_df.head()
Now that's more like it

You may have noticed that we included conference labels in this DataFrame when we didn't include this data the last time calculated SRS. Again, we need this data if we are going to do ratings by conference instead of for the whole pool of FBS teams. Now that we have our DataFrame loaded up, it's time to calculate final scoring margin for each game. We'll do this from both the home and away teams' perspectives because this will make our work easier in a bit.

games_df['home_spread'] = np.where(games_df['neutral_site'] == True, games_df['home_points'] - games_df['away_points'], (games_df['home_points'] - games_df['away_points'] - 2.5))
games_df['away_spread'] = -games_df['home_spread']
games_df.head()

If you recall from last time, there are a few adjustments we can make regarding the final scoring margin. One is to adjust for home field advantage (HFA). I like to set HFA at +2.5 points, so we'll go ahead and subtract that many points from the home team's final margin. We'll do the opposite and add it for the away team's final spread. You'll notice that we didn't make this adjustment for neutral site games. This step is completely optional, so feel free play around with different values or to omit it entirely.

Now it's time to restructure the data a little bit. Instead of having the rows be in terms of home and away, we're going to modify it to reference things in terms of team and opponent. This also means that each game will have two records, one for each team involved. This will enable us to group the data by team and make the rest of our operations easier from here on out.

teams = pd.concat([
    games_df[['home_team', 'home_conference', 'home_points', 'away_team', 'away_conference', 'away_points', 'home_spread']].rename(columns={'home_team': 'team', 'home_conference': 'conference', 'home_points': 'points', 'away_team': 'opponent', 'away_conference': 'opp_conference', 'away_points': 'opp_points', 'home_spread': 'spread'}),
    games_df[['away_team', 'away_conference', 'away_points', 'home_team', 'home_conference', 'home_points', 'away_spread']].rename(columns={'away_team': 'team', 'away_conference': 'conference', 'away_points': 'points', 'home_team': 'opponent', 'home_conference': 'opp_conference', 'home_points': 'opp_points', 'away_spread': 'spread'})
])

teams.head()

Before we go further, there are a few more adjustments we can make to the final score margin. Let's cap the final margin to be a maximum of plus/minus 28 points. Feel free to chose something different. This is just the value I've liked the best. In fact, I highly encourage you to experiment with different values to see what happens. The reason we are capping the margin is because some of these huge blowout games can have the effect of skewing the ratings somewhat. Some teams like to call off the dogs in garbage time while others keep their foot on the gas, so this tries to account for that. This is what the code looks like to cap it at plus/minus 28 points.

teams['spread'] = np.where(teams['spread'] > 28, 28, teams['spread']) # cap the upper bound scoring margin at +28 points
teams['spread'] = np.where(teams['spread'] < -28, -28, teams['spread']) # cap the lower bound scoring margin at -28 points
teams.head()

Another similar adjustment that can be made is to set a floor on the scoring margin. If you're familiar with the SRS rankings that sports-reference uses, they like to give a minimum floor of plus/minus 7 points. So if the final score is within a TD, they adjust up to a 7 points difference. I don't do this with my ratings, but you are free to do so. The code should be very similar to the snippet above. And one last thing, just as with HFA adjustments, these adjustments are completely optional. Feel free to substitute your own values or to omit them entirely.

Now it's just about time to run our SRS calculation, but I want to look into something before we continue any further. We talked about calculating a set of ratings for each conference, but is it possible to find a larger grouping of FBS teams to focus on, namely those in conferences that have had non-conference games between each other? This would also allow us to incorporate Independents into our ratings. And yes, I should probably know which conferences played non-conference games off the top of my head, but with all of the scheduling changes and game cancellations this season I'm having trouble keeping track. Let's run some code to see which conferences have played each other in at least one game this year.

games_df.query("home_conference != away_conference")[['home_conference', 'away_conference']].drop_duplicates().sort_values(['home_conference', 'away_conference'])

Here's what's going on in the above snippet. We are filtering down to all games where the two teams were from different conferences and only grabbing the conference columns from the dataset. Lastly, we are dropping all duplicate records so that we can see all the unique cross-conference matchups. Not counting FBS Independents, it looks like we have six FBS conferences that have played non-conference games. The ACC has played three of the other conferences (AAC, Conference USA, and Sun Belt). Of the two remaining conferences, it looks like the Sun Belt has played both the Big 12 and the Mountain West. We have the cross-matchups we need to put teams from all these conferences (plus FBS Independents) into their own pool. We will process the remaining conferences (B1G, SEC, Pac-12, MAC) in isolation.

combo_conferences = ['ACC', 'American Athletic', 'Big 12', 'Conference USA', 'FBS Independents', 'Mountain West', 'Sun Belt']
iso_conferences = ['Big Ten', 'SEC'] #, 'Pac-12', 'Mid-American']

You may have noticed that I commented out the Pac-12 and the MAC in the list of isolated conferences. Why did I do this? Teams in both conferences have only played one or two games thus far. We are definitely able to generate ratings for these conferences, but they aren't very meaningful due to the lack of data points. You can include them if you wish, but I am holding off for now. Heck, most B1G teams have played four or five games at this point and I'd say the ratings for them are just barely meaningful. SRS is the type of thing were it gets much more robust the more data points you have. I guess that really applies to any metric under the sun.

Now let's calculate some SRS ratings! We'll start with the combined group of teams from the first six conferences (plus independents). I have the code from last time down below with comments, but am not going to give a detailed breakdown this time around. Again, I highly encourage going back and checking out that original post for a more detailed explanation of what's going on here.

# get the group of teams to be included
combo_teams = teams[teams['conference'].isin(combo_conferences)]

# calculate the mean scoring margin for each of these teams
combo_spreads = combo_teams.groupby('team').spread.mean()

# create empty arrays
terms = []
solutions = []

# construct a system of equations
for team in combo_spreads.keys():
    row = []
    # get a list of team opponents
    opps = list(combo_teams[combo_teams['team'] == team]['opponent'])
    
    for opp in combo_spreads.keys():
        if opp == team:
        	# coefficient for the team should be 1
            row.append(1)
        elif opp in opps:
        	# coefficient for opponents should be 1 over the number of opponents
            row.append(-1.0/len(opps))
        else:
        	# teams not faced get a coefficient of 0
            row.append(0)
            
    terms.append(row)
    
    # average game spread on the other side of the equation
    solutions.append(combo_spreads[team])

# solve the system of equations
solutions = np.linalg.solve(terms, solutions)

# combine the series of ratings with their corresponding teams
ratings = list(zip( combo_spreads.keys(), solutions ))

# convert the data into a DataFrame
srs = pd.DataFrame(ratings, columns=['team', 'rating'])

# normalize the data so that the average team has a rating of 0
mean = srs.rating.mean()
srs['rating'] = srs['rating'] - mean

srs.head()

Before we go any further, we need to start considering any adjustments we would have to make in order to stitch all the different ratings together. Remember, SRS ratings are all relative based on an "average" team from the pool. If a team has a rating of +10, we take it to mean that they are 10 points better than an average team. Conversely, a rating of -3 means a team is slightly below average at 3 points worse than the average team.  If Team A has a rating of +3 and Team B a rating of -5, we would expect a game spread of +8 points in favor of Team A.

Can you think of the main problem with stitching unrelated groups of team SRS ratings together? In the ratings we just calculated, the average team has a rating of 0, just like all SRS ratings systems. The only problem is that we are missing teams from three of the five P5 conferences (B1G, SEC, Pac-12). An average team from this pool is going to be worse than an average team in a pool with those conferences included. Likewise, our SEC only ratings will be normalized to 0, but the average SEC team is significantly better than an average FBS team.

Can you think of any solutions? Here's my proposed solution to this problem. We have historical SRS rating going back to when college football first became a thing. It should be relatively straightforward to find the average historical SRS ratings for all B1G teams, for example, and use that as our baseline for the B1G. We probably don't want to go back 150 years, though, much less 20. I'm going to go ahead and settle on the most recent five years, starting with the 2015 season. This should give us enough data to give a good baseline number for each group without getting too noisy. Let's be clear, this solution is far from optimal but it is perhaps the best we can do given the circumstances.

As it so happens, I've already gone ahead and calculated these baselines. I got lazy and decided to use SQL rather than go through the Python package, but you should be able to calculate these same baselines by making API calls, either via the official Python package or querying the API directly. And remember, these numbers are the average SRS ratings from teams in each pool over the previous 5 years.

Main pool (ACC, B12, AAC, Conference USA, Mountain West, Sun Belt): -2.7
B1G: +6.9
SEC: +10.3
Pac-12: +5.8
MAC: -8.3

So, let's go ahead and make a baseline adjustment of -2.7 to the main pool of ratings we just calculated.

srs['rating'] = srs['rating'] - 2.7 # adjustment for taking out conferences

And now let's get to running calculations for the remaining conferences, just the B1G and the SEC in this case. We're going to run the above calculation code as we loop through these other conferences. Once we have calculations from each conference, we will make our baseline adjustment and add them in with the main group.

# loop through each of the remaining conferences
for conference in iso_conferences:
	# grab teams from the conference
    conf_teams = teams[teams['conference'] == conference]

	# calculate mean score margin for each conference team
    conf_spreads = conf_teams.groupby('team').spread.mean()
    
    # create empty arrays
    conf_terms = []
    conf_solutions = []

    for team in conf_spreads.keys():
        row = []
        # get a list of team opponents
        opps = list(conf_teams[conf_teams['team'] == team]['opponent'])

        for opp in conf_spreads.keys():
            if opp == team:
                # coefficient for the team should be 1
                row.append(1)
            elif opp in opps:
                # coefficient for opponents should be 1 over the number of opponents
                row.append(-1.0/len(opps))
            else:
                # teams not faced get a coefficient of 0
                row.append(0)

        conf_terms.append(row)

        # average game spread on the other side of the equation
        conf_solutions.append(conf_spreads[team])
    
	# solve for the system of equations
    conf_solutions = np.linalg.solve(conf_terms, conf_solutions)

	# match the series of ratings up with the corresponding team name
    conf_ratings = list(zip(conf_spreads.keys(), conf_solutions ))
    conf_srs = pd.DataFrame(conf_ratings, columns=['team', 'rating'])
    
	# normalize the ratings so that the average team is 0
    conf_mean = conf_srs.rating.mean()
    conf_srs['rating'] = conf_srs['rating'] - conf_mean
    
	# make a baseline adjustment for each conference based on historical ratings
    if (conference == 'Pac-12'):
        conf_srs['rating'] = conf_srs['rating'] + 5.8
    elif (conference == 'Big Ten'):
        conf_srs['rating'] = conf_srs['rating'] + 6.9
    elif (conference == 'SEC'):
        conf_srs['rating'] = conf_srs['rating'] + 10.3
    elif (conference == 'Mid-American'):
        conf_srs['rating'] = conf_srs['rating'] - 8.3
        
	# add these ratings to the main list of ratings
    srs = pd.concat([srs, conf_srs])

And we're done! Time to check out our ratings and see if they pass the eye test.

via GIPHY

Go ahead and run the following snippet to check out the top 25 teams.

srs.sort_values('rating', ascending=False).reset_index()[['team', 'rating']].iloc[0:25]

Those seem... somewhat plausible? It's a weird season, so expect to see some weird results. Also keep in mind that the variable number of games played by each team throws things off. As such, you see some B1G teams being inflated, notably Wisconsin who has had to cancel several games. Meanwhile, Ohio State is probably underranked and for the same reason. This isn't atypical for something you'd see at the midpoint in the season and that's where a lot of these teams are. Just out of curiosity, let's check out the bottom 25.

srs.sort_values('rating').reset_index()[['team', 'rating']].iloc[0:25]

Again, somewhat plausible keeping in mind the factors mentioned above. We would hope that these get better as more games are played. It's also possible that some conferences are having an uncharacteristically good or bad year and there's really no way to know that at this point. Bowl games, if they get played, could help shore a lot of this up. For now, we'll take these with a grain of salt and hope they get better.

Insights

So, what insights did we draw from this exercise? For one, it's been a weird season so we can expect to get some weird data and results. We mentioned several factors that can't even be accounted for, like teams playing with variable rosters from week to week or a conference having an uncharacteristically good year. It's important to note that just about any metric that is opponent-adjusted or has an SOS component is going to run into these same issues. Heck, even Bill Connelly mentioned earlier in the season that his SP+ system would be weird this year.

In short, I would just take a lot of data from this season with a grain of salt, no matter the source or how reputable it normally is. I'd recommend against using data and metrics from this season in any priors going forward for future seasons. Does this mean that all data from the 2020 season is worthless? Not necessarily. I think it's still fine to use it in calculating things like EPA or comparing teams within the same conference. It's just anything with an opponent adjustment in which I'd be very wary.

If you've stuck through this long, I thank you for taking the time to read and explore this with me. Have a safe and happy Thanksgiving!