Baseball is back! Spring training for the 2025 season started just a few days ago. Over the course of the off-season, here and there, I have been pulling this project together. THe goal: power-rank each team.
Standings allow us to rank teams based on who has more wins and fewer losses. Not all teams play one another over the course of a season. For followers of NCAA fottball, this is often a source of consternation in the build-up to the playoff selection: do teams that play and win a lot against weaker opponents deserve to have as high of a rank as a team that may have relatively more losses but has played against much tougher opponents?
The task of sorting this out seems relatively complicated at first blush. And it kind of is. Thankfully we have a class of statistical models that do not force us to deliberate in a committee. Instead, we can estimate each team’s latent (latent meaning that we cannot directly measure it easily since not every team plays against one another) ability based on who they win against and who they lose against.
A class of statistical models that allow us to do just that: Paired comparisons. One type of these models is called the Bradley-Terry model.
\[ \begin{align} y_i \sim Bernoulli(logit^{-1}(\alpha_a - \alpha_h))\\ \alpha \sim \mathcal{N}(0,1) \end{align} \]
Where \(y_i\) is equal to 0 if the home team won and 1 if the away team won. \(\alpha_a\) is the parameter for the estimated latent ability of the away team, while \(\alpha_h\) is a parameter for the estimated latent ability of the home team. I then sort the estimated \(\alpha\) parameter in descending order to get an estimated rank for each team.
Show Stan code
// Simple Bradley-Terry model.
data {
int<lower=1> N; // Number of games.
int<lower=1> J; // Number of teams.
array[N, 2] int T; // Matrix of team ids. T[N, 1] = Away. T[N, 2] = Home.
matrix[N, 2] S; // Matrix of scores. S[N, 1] = Away score; S[N, 2] = Home.
}
transformed data {
// Compute who won.
// Place in a vector where:
// - 1 indicates the away team won
// - 0 indicates the home team won
array[N] int<lower=0, upper=1> y;
for (n in 1:N) {
real diff = S[n, 1] - S[n, 2];
if (diff < 0)
0;
y[n] = else
1;
y[n] =
}// Create a vector indicating each team.
array[N] int away = to_array_1d(T[,1]);
array[N] int home = to_array_1d(T[,2]);
}
parameters {
vector[J] alpha; // The ability for each team.
}
model {
// Prior on a logged-odds scale of the ability for each team.
0, 1);
alpha ~ normal(// Compute the ability for each team given who won.
// The ability for the away team is dependent on whether they
// beat the home team.
// If the away team won, then the logged odd would be
// alpha_away * 1 - alpha_home * 0
y ~ bernoulli_logit(alpha[away] - alpha[home]);
}
generated quantities {
// Now compute the ranking of each team based on who won.
array[J] int rank; // Ranking of the teams.
{// Get the ranking of each team in descending order of the alpha.
array[J] int rank_index = sort_indices_desc(alpha);
// For each team, apply the rank so that the median value for rank[i]
// is the rank for team i.
for (i in 1:J) {
rank[rank_index[i]] = i;
}
} }
I also run a version of the model that accounts for home-field advantage. To do this, I simply add an intercept term to the model above. Doing so, adjusts the logged-odds for the home team winning which can act as a parameter to estimate the home-field advantage.
\[ \begin{align} y_i \sim Bernoulli(logit^{-1}(\alpha_a - \alpha_h + \gamma))\\ \alpha \sim \mathcal{N}(0,1)\\ \gamma \sim \mathcal{N}(0,1) \end{align} \]
Show Stan code
// Simple Bradley-Terry model with Home-field advantage.
data {
int<lower=1> N; // Number of games.
int<lower=1> J; // Number of teams.
array[N, 2] int T; // Matrix of team ids. T[N, 1] = Away. T[N, 2] = Home.
matrix[N, 2] S; // Matrix of scores. S[N, 1] = Away score; S[N, 2] = Home.
}
transformed data {
// Compute who won.
// Place in a vector where:
// - 1 indicates the away team won
// - 0 indicates the home team won
array[N] int<lower=0, upper=1> y;
for (n in 1:N) {
real diff = S[n, 1] - S[n, 2];
if (diff < 0)
0;
y[n] = else
1;
y[n] =
}// Create a vector indicating each team.
array[N] int away = to_array_1d(T[,1]);
array[N] int home = to_array_1d(T[,2]);
}
parameters {
vector[J] alpha; // The ability for each team.
real gamma; // Intercept term to provide a home-field advantage.
}
model {
// Prior on a logged-odds scale of the ability for each team.
0, 1);
alpha ~ normal(// Prior for home-field advantage.
0, 1);
gamma ~ normal(// Compute the ability for each team given who won.
// The ability for the away team is dependent on whether they
// beat the home team.
// If the away team won, then the logged odd would be
// alpha_away * 1 - alpha_home * 0 + gamma
y ~ bernoulli_logit(alpha[away] - alpha[home] + gamma);
}
generated quantities {
// Now compute the ranking of each team based on who won.
array[J] int rank; // Ranking of the teams.
{// Get the ranking of each team in descending order of the alpha.
array[J] int rank_index = sort_indices_desc(alpha);
// For each team, apply the rank so that the median value for rank[i]
// is the rank for team i.
for (i in 1:J) {
rank[rank_index[i]] = i;
}
} }
Finally, I run an extension of the Bradley-Terry model which is sometimes referred to as the Davidson model which allows for ties and an ordered set of outcomes. That is, rather than simply predicting whether or not a team won or not, now my model is supposed to predict the magnitude of the win. That is, in this final model, \(y_i\) is equal to 1 if the home team won by 5 or more runs, 2 if the home team won by between 2 and 4 runs, 3 if the home team won by 1 run, 4 if the home and away team tied, 5 if the away team won by 1 run, 6 if the away team won by between 2 and 4 runs, and 7 if the away team won by 5 or more runs. One issue with this model, is that ties are not common at all due to the allowance for extra innings. This gives my ordered logistic regression some problems due to how rare those events are relative to the other possible outcomes.
\[ \begin{align} y_i \sim Categorical(\alpha_a - \alpha_h + \gamma)\\ \alpha \sim \mathcal{N}(0,1)\\ \gamma \sim \mathcal{N}(0,1) \end{align} \]
Show Stan code
// Ordered Bradley-Terry model with Home-field advantage.
data {
int<lower=1> N; // Number of games.
int<lower=1> J; // Number of teams.
array[N, 2] int T; // Matrix of team ids. T[N, 1] = Away. T[N, 2] = Home.
matrix[N, 2] S; // Matrix of scores. S[N, 1] = Away score; S[N, 2] = Home.
}
transformed data {
// Compute who won.
// Place in a vector where:
// - Pos. diff indicates the away team won.
// - Neg. diff indicates the home team won.
array[N] int<lower=1, upper=7> y;
for (n in 1:N) {
real diff = S[n, 1] - S[n, 2];
if (diff<=-5)
1;
y[n] = else if (diff<=-2 && diff>=-4)
2;
y[n] = else if (diff==-1)
3;
y[n] = else if (diff==0)
4;
y[n] = else if (diff==1)
5;
y[n] = else if (diff>=2 && diff<=4)
6;
y[n] = else if (diff>=5)
7;
y[n] =
}// Create a vector indicating each team.
array[N] int away = to_array_1d(T[,1]);
array[N] int home = to_array_1d(T[,2]);
}
parameters {
vector[J] alpha; // The ability for each team.
real gamma; // Intercept term to provide a home-field advantage.
ordered[6] c; // Number of cutpoints.
}
model {
// Prior on a logged-odds scale of the ability for each team.
0, 1);
alpha ~ normal(// Prior for home-field advantage.
0, 1);
gamma ~ normal(// Compute the ability for each team given who won.
// The ability for the away team is dependent on whether they
// beat the home team.
// If the away team won, then the logged odd would be
// alpha_away * 1 - alpha_home * 0 + gamma
for (n in 1:N){
y[n] ~ ordered_logistic(alpha[away[n]] - alpha[home[n]] + gamma, c);
}
}
generated quantities {
// Now compute the ranking of each team based on who won.
array[J] int rank; // Ranking of the teams.
{// Get the ranking of each team in descending order of the alpha.
array[J] int rank_index = sort_indices_desc(alpha);
// For each team, apply the rank so that the median value for rank[i]
// is the rank for team i.
for (i in 1:J) {
rank[rank_index[i]] = i;
}
} }
I pulled the boxscores for the 2019, 2020, 2021, 2022, 2023, and 2024 regular seasons from the MLB API. With these data, I fit these three models using cmdstanpy. I fit these models and retain the last 2000 simulations (or draws) of the estimated rankings. The full repository containing all of this can be found on my GitHub.
There are a few tweaks I could make to these models:
Play around with the priors for the parameters a bit more. Though from testing, I don’t think they will make too much of a difference here.
I could do things like a hierarchical model where I model player ability and sum the ability of the players for each team. One issue with this approach is that the players for each team tends to remain relatively stable over time. While pitching may change, the rotation of pitchers tends to remain somewhat stable. So, it may be something to try, but my a priori expectation is that it won’t help the model much.
There may be more that I am not thinking of, but Spring Training is underway and I am just too excited to sit on this project too much longer!