Talking Tech: Applying LRMC Rankings to College Football, Part Two

The previous post in this series covered how LRMC works and its mathematical implementation. Now it's time to take things a step further and apply this concept to ranking college football teams.

Talking Tech: Applying LRMC Rankings to College Football, Part Two

This is the second and final part of a series on implementing LRMC rankings for CFB! This entry presupposes you are familiar with the mathematical concepts behind LRMC. To view part one, which covers these concepts, click here.

When we last left off, we covered how LRMC works and how to implement it mathematically. With our understanding of LRMC, we are now prepared to implement a version of LRMC computationally for ranking CFB teams. To start with, let's fire up our Python notebook.

import cfbd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import math

First, we'll grab a 6 year sample of CFB games, from 2015-2020 (you'll see why we're grabbing so many games in a second), then filter out any games not against BCS opponents.

seasons = [cfbd.GamesApi().get_games(year=iteryear, season_type='both') for iteryear in range(2015,2021)]
games = [game for season in seasons for game in season]

games_df = pd.DataFrame().from_records([
                dict(
                    id=g.id, 
                    neutral_site=g.neutral_site,
                    home_team=g.home_team,
                    home_conference=g.home_conference,
                    home_points=g.home_points,
                    away_team=g.away_team,
                    away_conference=g.away_conference,
                    away_points=g.away_points,
                    season=g.season
                ) 
                for g in games 
                if g.home_points is not None 
                    and g.away_points is not None
                    and g.home_conference is not None 
                    and g.away_conference is not None
            ])
games_df.head()

Our first task is to calculate rx for each of these games - the probability that the home team is better than the away team given their margin of victory.

For college basketball, estimating this value is quite easy with logistic regression –  by virtue of conference home-and-home games, we can simply take all home-team margin of victories and run a logistic regression on the away game of the series to determine the odds of the home team winning on the road, then adjust for home court advantage. But since two teams playing each other twice in one season in college football are quite rare (let alone two teams playing a home-and-home series), we have to take a different approach. For college football, Kolbush and Sokol instead use a common-opponent approach for estimating rx[1]. Essentially, for a given game, we can look at the opponents Team A and Team B both played over the course of the season, and estimate the odds that Team A is better than Team B based on their win percentages against common opponents. Let's walk through an example.

games_df.loc[(games_df['id'] == 401112476)]

In this sample game, Miami lost to Virginia Tech by 7 points. How many opponents did Virginia Tech and Miami have in common that year?

opponents = games_df['away_team'][((games_df['home_team'] == 'Virginia Tech') & (games_df['season'] == 2019))].tolist() + games_df['home_team'][((games_df['away_team'] == 'Virginia Tech') & (games_df['season'] == 2019))].tolist() + games_df['away_team'][((games_df['home_team'] == 'Miami') & (games_df['season'] == 2019))].tolist() + games_df['home_team'][((games_df['away_team'] == 'Miami') & (games_df['season'] == 2019))].tolist()

common_opponents = set([team for team in opponents if opponents.count(team) > 1])

common_opponents
{'Duke', 'Georgia Tech', 'North Carolina', 'Pittsburgh', 'Virginia'}

How well did VT and Miami do against those common opponents?

vt_w = 0
mia_w = 0
vt_g = 0
mia_g = 0
for index, game in games_df.loc[games_df['season'] == 2019].iterrows():
    if (game['home_team'] == 'Virginia Tech') & (game['away_team'] in common_opponents):
        vt_g += 1
        if game['home_points'] > game['away_points']:
            vt_w += 1
    elif (game['home_team'] == 'Miami') & (game['away_team'] in common_opponents):
        mia_g += 1
        if game['home_points'] > game['away_points']:
            mia_w += 1
    elif (game['away_team'] == 'Virginia Tech') & (game['home_team'] in common_opponents):
        vt_g += 1
        if game['away_points'] > game['home_points']:
            vt_w += 1
    elif (game['away_team'] == 'Miami') & (game['home_team'] in common_opponents):
        mia_g += 1
        if game['away_points'] > game['home_points']:
            mia_w += 1
            
print("Virginia Tech's common opponent WPCT: " + str(vt_w / vt_g))
print("Miami's common opponent WPCT: " + str(mia_w / mia_g))
Virginia Tech's common opponent WPCT: 0.6
Miami's common opponent WPCT: 0.4

Virginia Tech won 60% of their games against their common opponents, and Miami won 40%. Per Kolbush and Sokol's approach, this renders our calculus rather simple. Let pj be the winning percentage of the home team (j) against common opponents between j and the away team (i), and pi be the winning percentage of i against common opponents, then rx becomes:

There is a 40% approximated chance Miami is better than Virginia Tech based on their common opponents.

To obtain a smooth estimate, we'll perform this for every game where teams had common opponents in a season, compile the point differentials for those games, then run a weighted logistic regression on those games (remember, our 40% estimate is currently independent of point differential – we wish to translate point differential into this estimate!). Our weights will be the total number of games played against common opponents for the teams in a given game[2].

We'll do this for the 2015-2018 seasons to estimate coefficients for the 2019 season (Kolbush and Sokol also used four prior seasons of data to estimate coefficients for a given season[1]). We'll also filter out neutral site games so we can get a better approximation for home field advantage.

train_games_df = games_df.loc[(games_df['season'].isin([2015,2016,2017,2018])) & (~games_df['neutral_site'])]

train_df = []

for index, game in train_games_df.iterrows():
    season = game['season']
    team_a = game['home_team']
    team_b = game['away_team']
    home_diff = game['home_points'] - game['away_points']
    opponents = train_games_df['away_team'][((games_df['home_team'] == team_a) & (games_df['season'] == season))].tolist() + games_df['home_team'][((games_df['away_team'] == team_a) & (games_df['season'] == season))].tolist() + games_df['away_team'][((games_df['home_team'] == team_b) & (games_df['season'] == season))].tolist() + games_df['home_team'][((games_df['away_team'] == team_b) & (games_df['season'] == season))].tolist()
    common_opponents = set([team for team in opponents if opponents.count(team) > 1])
    if len(common_opponents) > 1:
        a_w = 0
        b_w = 0
        a_g = 0
        b_g = 0
        for index, game in games_df.loc[(games_df['season'] == season) & 
                                        ((games_df['home_team'].isin([team_a,team_b])) |
                                          games_df['away_team'].isin([team_a,team_b]))].iterrows():
            if (game['home_team'] == team_a) & (game['away_team'] in common_opponents):
                a_g += 1
                if game['home_points'] > game['away_points']:
                    a_w += 1
            elif (game['home_team'] == team_b) & (game['away_team'] in common_opponents):
                b_g += 1
                if game['home_points'] > game['away_points']:
                    b_w += 1
            elif (game['away_team'] == team_a) & (game['home_team'] in common_opponents):
                a_g += 1
                if game['away_points'] > game['home_points']:
                    a_w += 1
            elif (game['away_team'] == team_b) & (game['home_team'] in common_opponents):
                b_g += 1
                if game['away_points'] > game['home_points']:
                    b_w += 1
        
        r_x = ((a_w / a_g) + (1 - (b_w / b_g))) / 2
        
        d = {
            'season' : season,
            'team_a' : team_a,
            'team_b' : team_b,
            'home_diff' : home_diff,
            'r_x' : r_x,
            'weight' : a_g+b_g
        }
        train_df.append(d)

train_df = pd.DataFrame(train_df)  
train_df.head()

Now, we'll use our data to make a logistic regression model. Okay, I lied, it's not a strict logistic regression model – because we have a series of continuous probabilities to train on rather than a series of 1s and 0s (which a logistic regression model is usually trained upon), we're going to use the logit transformation to linearize our data, then run a weighted linear regression model, then undo the logit transformation to obtain probabilities for rx. Before we do that, we're going transform any probabilities of 100% to 90% and 0% to 10% – the logit transformation can't handle these specific probabilities, and speaking realistically, it would be difficult to project anyone as having a 100% chance of beating a given opponent regardless of their margins of victory against other opponents[3].

train_df.loc[train_df['r_x'] == 1,'r_x'] = 0.9
train_df.loc[train_df['r_x'] == 0,'r_x'] = 0.1

x = np.array(train_df['home_diff']).reshape(-1,1)
y = np.log(train_df['r_x'] / (1 - train_df['r_x']))
w = train_df['weight']
model = LinearRegression().fit(x, y, w)
plt.scatter(x, y, s = w)
plt.plot(x, model.predict(x), color='red', linewidth=3, label='Weighted model')
a = model.coef_[0]
b = model.intercept_
print("a = " + str(a))
print("b = " + str(b))
a = 0.023983237378834784
b = -0.09541138612221126

Remember that for a linear regression, we find our coefficients in the form y = ax + b. Since y is logit transformed, to undo that transformation, we plug ax + b into e(y) / (1 + e(y)) to get e(ax + b) / (1 + e(ax + b)).

Let's return to that Virginia Tech/Miami game. Based on the point differential of the game and our model, what are the odds that Miami was better than Virginia Tech?

print(math.exp(a * -7 + b) / (1 + math.exp(a * -7 + b)))
0.43455413139478277

Not too far off from our original estimate of 40%!

We can also quickly calculate the expected home field advantage from these coefficients[4]:

h = (-b / a)
print(h)
3.9782530029249448

So from 2015-2018, home field advantage appears to be worth about 4 points – not too far off from the 2.5 estimate Bill has been using in his SRS calculations.

Now that we have our formula for estimating rx, we can start building our Markov chain. Recall how we did this by hand in part one:

To generate our transition matrix, we sum up for games involving a particular team (let's say, Team X) by team. In the row of Team X, we place each team's sum of rx in each team's corresponding column (including Team X). Then, we divide all of the values in the row by the number of games Team X played.

We're going to implement this algorithmically for our 2019 dataset. First, let's grab our 2019 values.

games_df_2019 = games_df.loc[games_df['season'] == 2019]
games_df_2019.head()

Next, we'll build two matrices: an n x n transition matrix, where n is the number of BCS teams, and a 1 x n matrix, tracking how many games each team plays.

team_names = sorted(list(set(games_df_2019['home_team'].tolist() + games_df_2019['away_team'].tolist())))
n_teams = len(team_names)
p = np.zeros((n_teams,n_teams))
n_games = np.zeros(n_teams)

Now, we'll iterate through each game, calculate rx, and put either rx or 1 - rx in the appropriate place in our transition matrix (to understand why we're playing each value where we are, consult part one of this guide here). For neutral-site games, we'll calculate rx after adding the home-field advantage in. We'll also add a game to each team's n_games counter.

for index, game in games_df_2019.iterrows():
    home_team = game['home_team']
    away_team = game['away_team']
    home_points = game['home_points']
    away_points = game['away_points']
    neutral_flg = game['neutral_site']
    home_team_ndx = team_names.index(home_team)
    away_team_ndx = team_names.index(away_team)
    spread = home_points - away_points
    if neutral_flg:
        spread += h
    r_x = math.exp(a * spread + b) / (1 + math.exp(a * spread + b))
    n_games[home_team_ndx] += 1
    n_games[away_team_ndx] += 1
    p[home_team_ndx, away_team_ndx] += 1 - r_x
    p[away_team_ndx, home_team_ndx] += r_x
    p[home_team_ndx, home_team_ndx] += r_x
    p[away_team_ndx, away_team_ndx] += 1 - r_x

Next, we'll divide each row of our transition by the number of games we're tracking for each team.

p = p / n_games[:,None]
print(p)

Finally, we'll approximate the steady state of this matrix to find what team LRMC believes is the best, then multiply our steady state matrix by a vector that counts down from n_teams to 1, which gives us our relative team strength.

prior = n_teams - np.array(list(range(n_teams)))
steady_state = np.linalg.matrix_power(p, 1000)
rating = prior.dot(steady_state)
rating_df = pd.DataFrame({
    'team': team_names,
    'lrmc_rating': rating
    
})
print(rating_df.sort_values(by=['lrmc_rating'], ascending=False).iloc[0:25])
                  team  lrmc_rating
77          Ohio State   187.106455
48                 LSU   175.945528
21             Clemson   157.969142
128          Wisconsin   130.052698
2              Alabama   129.821806
34             Georgia   118.031678
82              Oregon   115.817995
78            Oklahoma   110.981205
9               Auburn   108.585276
75          Notre Dame   107.732058
59            Michigan   106.242719
117               Utah   105.915359
84          Penn State   105.042793
12              Baylor   103.884959
101              Texas   101.253766
42                Iowa   100.165017
29             Florida    96.009519
56             Memphis    93.568406
62           Minnesota    90.033058
20          Cincinnati    89.598700
123         Washington    88.540195
43          Iowa State    88.377904
3    Appalachian State    86.243448
110                UCF    86.078021
102          Texas A&M    84.039906

These are some good looking rankings! Most importantly, it seems to be placing G6 teams in the sweet spot of where they usually end up year-to-year (10-3 UCF was also ranked 24th in the AP poll in 2019), which indicates that it's appropriately evaluating strength of schedule.

So we've finished with LRMC rankings with 2019 – but how will they work in the messed-up year of 2020, when some teams are playing conference-only schedules? Let's find out. We'll re-run the same analysis but we'll adjust home field advantage for the weird-pandemic stuff, eyeballing it and dropping it by 2 points – removing some to account for the impact of reduced or no crowd, but keeping some to account for rest/travel.

train_games_df = games_df.loc[(games_df['season'].isin([2016,2017,2018,2019])) & (~games_df['neutral_site'])]

train_df = []

for index, game in train_games_df.iterrows():
    season = game['season']
    team_a = game['home_team']
    team_b = game['away_team']
    home_diff = game['home_points'] - game['away_points']
    opponents = train_games_df['away_team'][((games_df['home_team'] == team_a) & (games_df['season'] == season))].tolist() + games_df['home_team'][((games_df['away_team'] == team_a) & (games_df['season'] == season))].tolist() + games_df['away_team'][((games_df['home_team'] == team_b) & (games_df['season'] == season))].tolist() + games_df['home_team'][((games_df['away_team'] == team_b) & (games_df['season'] == season))].tolist()
    common_opponents = set([team for team in opponents if opponents.count(team) > 1])
    if len(common_opponents) > 1:
        a_w = 0
        b_w = 0
        a_g = 0
        b_g = 0
        for index, game in games_df.loc[(games_df['season'] == season) & 
                                        ((games_df['home_team'].isin([team_a,team_b])) |
                                          games_df['away_team'].isin([team_a,team_b]))].iterrows():
            if (game['home_team'] == team_a) & (game['away_team'] in common_opponents):
                a_g += 1
                if game['home_points'] > game['away_points']:
                    a_w += 1
            elif (game['home_team'] == team_b) & (game['away_team'] in common_opponents):
                b_g += 1
                if game['home_points'] > game['away_points']:
                    b_w += 1
            elif (game['away_team'] == team_a) & (game['home_team'] in common_opponents):
                a_g += 1
                if game['away_points'] > game['home_points']:
                    a_w += 1
            elif (game['away_team'] == team_b) & (game['home_team'] in common_opponents):
                b_g += 1
                if game['away_points'] > game['home_points']:
                    b_w += 1
        
        r_x = ((a_w / a_g) + (1 - (b_w / b_g))) / 2
        
        d = {
            'season' : season,
            'team_a' : team_a,
            'team_b' : team_b,
            'home_diff' : home_diff,
            'r_x' : r_x,
            'weight' : a_g+b_g
        }
        train_df.append(d)

train_df = pd.DataFrame(train_df)
train_df.loc[train_df['r_x'] == 1,'r_x'] = 0.9
train_df.loc[train_df['r_x'] == 0,'r_x'] = 0.1

x = np.array(train_df['home_diff']).reshape(-1,1)
y = np.log(train_df['r_x'] / (1 - train_df['r_x']))
w = train_df['weight']
model = LinearRegression().fit(x, y, w)
a = model.coef_[0]
b = model.intercept_
h = (-b / a) - 2

games_df_2020 = games_df.loc[games_df['season'] == 2020]
games_df_2020.head()

team_names = sorted(list(set(games_df_2020['home_team'].tolist() + games_df_2020['away_team'].tolist())))
n_teams = len(team_names)
p = np.zeros((n_teams,n_teams))
n_games = np.zeros(n_teams)

for index, game in games_df_2020.iterrows():
    home_team = game['home_team']
    away_team = game['away_team']
    home_points = game['home_points']
    away_points = game['away_points']
    neutral_flg = game['neutral_site']
    home_team_ndx = team_names.index(home_team)
    away_team_ndx = team_names.index(away_team)
    spread = home_points - away_points
    if neutral_flg:
        spread += h
    r_x = math.exp(a * spread + b) / (1 + math.exp(a * spread + b))
    n_games[home_team_ndx] += 1
    n_games[away_team_ndx] += 1
    p[home_team_ndx, away_team_ndx] += 1 - r_x
    p[away_team_ndx, home_team_ndx] += r_x
    p[home_team_ndx, home_team_ndx] += r_x
    p[away_team_ndx, away_team_ndx] += 1 - r_x
    
p = p / n_games[:,None]
prior = n_teams - np.array(list(range(n_teams)))
steady_state = np.linalg.matrix_power(p, 1000)
rating = prior.dot(steady_state)
rating_df = pd.DataFrame({
    'team': team_names,
    'lrmc_rating': rating
    
})
print(rating_df.sort_values(by=['lrmc_rating'], ascending=False).iloc[0:25])
                  team  lrmc_rating
2              Alabama   135.230015
16             Buffalo   131.356858
21             Clemson   130.404969
73          Notre Dame   129.633899
10                 BYU   129.319638
22    Coastal Carolina   113.635301
42          Iowa State   110.743776
49           Louisiana   109.852425
107                UCF   105.703515
20          Cincinnati   104.514956
69      North Carolina   103.885852
124   Western Michigan   103.099373
76            Oklahoma   100.090273
104             Tulane    99.269570
18    Central Michigan    99.083950
28             Florida    98.579495
14      Boston College    97.503712
56               Miami    96.908824
102             Toledo    96.386918
64            NC State    95.923478
52          Louisville    94.467302
98               Texas    93.446952
3    Appalachian State    93.341825
118      Virginia Tech    92.720235
33             Georgia    92.415851

These ratings are obviously a little rough, for various reasons – Buffalo has played just 4 games and thoroughly dominated their MAC-only opponents in three of them, and since our LRMC model doesn't account for sample size, they appear to be far better than they likely are – but I think it is worth pointing out that the rest of the rankings are more than plausible, and what's more, this ratings system actually resolves instead of base SRS.

Further, this method is largely empirical – the only manual adjustment we've made is to our estimate of home field advantage; the rest of it has required no manual adjustments. Bill's conference-adjusted SRS values are likely closest to what the top 25 ratings will look like, but this method is more robust in that it works with relatively few manual adjustments from year-to-year given the oddity of the 2020 season. I think there are advantages and disadvantages to either approach, and you should weigh those before applying the rating technique of your choosing. Still, LRMC, in both normal times and odd, is a powerful, useful tool for estimating team strength, and worthy of consideration for use in rankings.

To view the notebook for this code, click here.


[1] Kolbush, J. and J. Sokol, "A logistic regression/Markov chain model for American college football", International Journal of Computer Science in Sport (2017), Vol. 16, Iss. 3, pp. 185-196

[2] If j has a common opponent winning percentage of 100% and i has a common opponent winning percentage of 0%, that difference is far less meaningful if j and i have just one game against a common opponent, and far more meaningful if j and i have 10+ combined games against common opponents. Weighing our regression reflects the amount of confidence we have in our estimate of rx relative to other samples.

[3] You might have noticed that we make no adjustments here for running up the score, as most SRS implementations do. Why is that? Due to the sigmoid curve of a logistic regression function, running up the score yields diminishing returns as far as LRMC's ratings go.

[4] This is a little different from our calculation of home field advantage from part one – in part one, we used home point differential to estimate road winning percentage for a given matchup (sx), then adjusted for home field advantage to find rx. But here, we fit home point differential on a location-neutral estimate of team strength, so there isn't an intermediate sx to deal with. In math terms, when sx = 50%, x / 2 = h, when rx = 50%, x = h. We use both of these steps in part one, but here we only need the second step.