Chatbot Arena and the Elo rating system - Part 2

14 minute read

Published:

In our previous blog post on Elo rating system, we introduced the basics of the Elo rating system and its online linear update algorithm, hereafter referred to as “online Elo”. However, we identified a significant concern with online Elo: its instability and tendency to bias toward recent results. For example, a demonstration from Chatbot Arena showed substantial shifts in model ratings when Elo was recalculated using the reverse order of matches.

reverse_online_elo
Online Elo ranking is not stable. Image source: notebook from Chatbot Arena

This instability is, of course, undesirable. While it’s true that player can learn from feedback and improve their skills (which means their ratings are changing), the assumption is that player’s improvements occur slowly. Therefore, in the short term, their ranks should stay relatively stable. In other words, no matter how we compute the score, we hope to get the same ranking despite the order of matches. So, how can we generate a stable and unique Elo ranking?

Why does this happen?

We touched on this issue briefly in our previous post, the online Elo algorithm adjusts a player’s rating incrementally after each match, considering the match outcome and the current ratings of the opponents. Each update subsequently influences future ratings. Therefore, the sequence in which matches are processed significantly impacts the final ratings. When we reverse the order of matches, what were initially earlier matches (now processed later) are updated based on different initial conditions, which can lead to significantly altered final ratings.

So if the problem lies in the sequential update, why not update the ratings all at once?

Maximum Likelihood Estimation with Bradley-Terry model

Basics

If we think about SGD again, we know that a good practice for stable optimization is to use mini-batch instead of feeding a single data sample during each model update. In the context of LLM evaluation, we often assume that the model’s capability remains static over the period being analyzed. This assumption thus allows us to utilize Maximum Likelihood Estimation (MLE) to directly fit the ratings globally.

But before we dive deeper, let’s clarify some fundamental concepts.

  • MLE might be familiar, it is a statistical method used to estimate the parameters of a model. In simpler terms, it identifies the set of parameters under which the observed data is most probable.
  • The Bradley-Terry model is a probability model used for predicting the outcomes of pairwise comparisons. In rating systems, it’s used to estimate the relative strengths of players based on the outcomes of their matches against each other. You can find out its math formulas in this wiki, not hard to understand (very similar to the online Elo formula we have introduced in previous post).

The connection between MLE and the Bradley-Terry model in the context of rating systems is quite direct. MLE is used to estimate the strength parameters of the Bradley-Terry model from the data of match results. By applying MLE, we can maximize the likelihood that the predicted outcomes under the Bradley-Terry model align with the actual observed outcomes of matches. But how exactly this is done? How is this different from online Elo?

Dive deep into MLE Elo

Let’s look at a python implementation below, which is borrowed from this notebook from Chatbot Arena.

def compute_mle_elo(
    df, SCALE=400, BASE=10, INIT_RATING=1000, sample_weight=None
):
    from sklearn.linear_model import LogisticRegression
    ptbl_a_win = pd.pivot_table(
        df[df["winner"] == "model_a"],
        index="model_a",
        columns="model_b",
        aggfunc="size",
        fill_value=0,
    )
    # if no tie, create a zero matrix
    if sum(df["winner"].isin(["tie", "tie (bothbad)"])) == 0:
        ptbl_tie = pd.DataFrame(0, index=ptbl_a_win.index, columns=ptbl_a_win.columns)
    else:
        ptbl_tie = pd.pivot_table(
            df[df["winner"].isin(["tie", "tie (bothbad)"])],
            index="model_a",
            columns="model_b",
            aggfunc="size",
            fill_value=0,
        )
        ptbl_tie = ptbl_tie + ptbl_tie.T

    ptbl_b_win = pd.pivot_table(
        df[df["winner"] == "model_b"],
        index="model_a",
        columns="model_b",
        aggfunc="size",
        fill_value=0,
    )
    ptbl_win = ptbl_a_win * 2 + ptbl_b_win.T * 2 + ptbl_tie

    models = pd.Series(np.arange(len(ptbl_win.index)), index=ptbl_win.index)

    p = len(models)
    X = np.zeros([p * (p - 1) * 2, p])
    Y = np.zeros(p * (p - 1) * 2)

    cur_row = 0
    sample_weights = []
    for m_a in ptbl_win.index:
        for m_b in ptbl_win.columns:
            if m_a == m_b:
                continue
            # if nan skip
            if math.isnan(ptbl_win.loc[m_a, m_b]) or math.isnan(ptbl_win.loc[m_b, m_a]):
                continue
            X[cur_row, models[m_a]] = +math.log(BASE)
            X[cur_row, models[m_b]] = -math.log(BASE)
            Y[cur_row] = 1.0
            sample_weights.append(ptbl_win.loc[m_a, m_b])

            X[cur_row + 1, models[m_a]] = math.log(BASE)
            X[cur_row + 1, models[m_b]] = -math.log(BASE)
            Y[cur_row + 1] = 0.0
            sample_weights.append(ptbl_win.loc[m_b, m_a])
            cur_row += 2
    X = X[:cur_row]
    Y = Y[:cur_row]

    lr = LogisticRegression(fit_intercept=False, penalty=None, tol=1e-6)
    lr.fit(X, Y, sample_weight=sample_weights)
    elo_scores = SCALE * lr.coef_[0] + INIT_RATING
    return pd.Series(elo_scores, index=models.index).sort_values(ascending=False)

This function include 3 stages. For the first stage until ptbl_win = ptbl_a_win * 2 + ptbl_b_win.T * 2 + ptbl_tie, we are preparing data tables from the match outcomes. ptbl_win basically combines wins and ties into a single table that effectively represents the interaction matrix between all pairs of models. The reason for doubling the count of wins in both ptbl_a_win and ptbl_b_win.T is to emphasize the importance of each win equally for both model_a and model_b, ensuring that no single win is underrepresented.

For the second stage until Y = Y[:cur_row], we are setting up logistic regression inputs. X is filled with the logarithm of the base, encoding who is competing against whom by setting corresponding model indices to positive or negative values. Y is set to 1 or 0, representing the outcome of the match (1 if the row corresponds to a win by model_a, 0 for model_b). Here, sample_weights is an array that collects the weight of each sample based on the number of times a particular match-up occurs.

For the last stage, we simply use logistic regression to fit the data. the final lr.coef_ contains the estimated skill levels of the players on the log-odds scale. These coefficients are the result of the optimization process that maximizes the likelihood of the observed match outcomes given the skill levels of the players. The final output is a series of Elo scores indexed by model names, sorted in descending order to show the strongest model at the top.

Unlike the sequential update in the online Elo system, MLE with the Bradley-Terry model considers all match outcomes simultaneously. This global optimization approach reduces the dependency on the order of matches, leading to more stable and consistent estimates of player strengths. In fact, if we reverse the order of matches or shuffle the order, the MLE Elo ranking will mostly stay the same.

Compute bootstrap confidence intervals

Although MLE Elo is simple and stable, in order to add robustness to analysis, people often use bootstrapping to compute confidence internals to understand the statistical significance. But hold on, what is confidence interval and what does bootstrap mean?

A confidence interval (CI) is a range of values, derived from the data, that is likely to contain the value of an unknown population parameter. For instance, in the context of Elo scores, a 95% confidence interval around a score indicates that, if the procedure were repeated many times, 95% of the confidence intervals calculated would contain the true Elo score. It gives an idea of the uncertainty around the estimated value.

Bootstrap is a powerful statistical technique used to estimate the uncertainty of a statistic (like a mean, median, or, in this case, Elo scores) by resampling the data. In essence, it involves repeatedly drawing samples, with replacement, from the observed data set and recalculating the statistic for each sample. This generates an empirical distribution of the statistic which can then be used to compute confidence intervals or test hypotheses. By using bootstrap, it assesses how stable the Elo scores are: if the matches (data points) had come out differently, would we get a similar ranking and score?

Let’s look at a python implementation below, which is borrowed from this notebook from Chatbot Arena.

def get_bootstrap_result(battles, func_compute_elo, num_round):
    rows = []
    for i in tqdm(range(num_round), desc="bootstrap"):
        rows.append(func_compute_elo(battles.sample(frac=1.0, replace=True)))
    df = pd.DataFrame(rows)
    return df[df.median().sort_values(ascending=False).index]
    
BOOTSTRAP_ROUNDS = 100
np.random.seed(42)
bootstrap_elo_lu = get_bootstrap_result(battles, compute_mle_elo, BOOTSTRAP_ROUNDS)

The implementation above is straightforward. Basically, it loops num_round times as defined by BOOTSTRAP_ROUNDS. In each iteration, it resamples the dataset with replacement battles.sample(frac=1.0, replace=True), which means it creates a new dataset the same size as the original but some battles may be repeated and others omitted. It then computes Elo scores using the provided function func_compute_elo for the resampled dataset and stores the result.

It’s true that some battles may be duplicated in a bootstrap sample, which could introduce bias if certain players are overrepresented in these duplicates. But the purpose of bootstrap resampling is to create many such samples and then analyze the distribution of the computed Elo ratings. The bias introduced by any single bootstrap sample is mitigated by averaging over many samples.

If we visualize the confidence intervals of all models, we can see that the statistical range using MLE ELo is small (always centers around its median).

mle_elo_ci_part2
MLE Elo ranking is more stable than online Elo. Image source: notebook from Chatbot Arena

However, if we change back to online Elo, we can see the statistical range suddenly increases a lot. In particular, their confidence intervals overlap significantly, which suggests that the difference in their Elo scores might not be statistically significant. In other words, the observed difference could be due to random variations in the data (like specific matchups, outliers, or other noise) rather than a true difference in model’s performance.

online_elo_ci_part2
Online Elo ranking is not stable, the difference in models' Elo scores might not be statistically significant. Image source: notebook from Chatbot Arena

Some notes

Can I use MLE with Bradley-Terry model for data beyond pair-wise comparison?

Not really. If the comparisons are not pairwise, the Bradley-Terry model may not be applicable as it stands because it cannot process multiple comparisons simultaneously or scenarios where the structure of competition is not strictly one-on-one. For instance, in events or competitions where outcomes involve more than two participants at a time (like races or multi-player games), alternative models are needed.

Any other methods to compute elo?

There are many algorithms (at least extensions) to compute Elo score. One popular method, termed whole history rating (WHR), is specifically designed to more accurately reflect changes in a player’s performance over time.

Like we have mentioned, MLE Elo typically computes ratings based on the assumption that a player’s skill level is static or changes very slowly. It treats all matches with equal weight regardless of when they occurred. However, for some situations where player performance may change significantly over time, such as in long career sports or games, academic or professional development tracking, etc., it is better to explicitly model changes in player strength over time, so here comes the whole history rating.

For detailed implementations, we refer you to this great Github repo, which you can directly pip install and import whole_history_rating to compute. Simply put, WHR calculates ratings based on all past results, taking into account the time when each match was played. providing a more nuanced and responsive rating system that adjusts as more data becomes available.

from whr import whole_history_rating

whr = whole_history_rating.Base()

whr.create_game("shusaku", "shusai", "B", 1, 0)
whr.create_game("shusaku", "shusai", "W", 2, 0)
whr.create_game("shusaku", "shusai", "W", 3, 0)

whr.auto_iterate(time_limit=10, precision=1e-3, batch_size=10)

ratings = whr.get_ordered_ratings(current=True, compact=False)

This code snippet starts by importing the library and initializing the base WHR object. Then adding games to the system using create_game() method. It takes the names of the black and white players, the winner (‘B’ for black, ‘W’ for white), the day number, and an optional handicap (generally less than 500 elo). To emphasize, the day number here is to improve the temporal awareness. Finally, the WHR algorithm allows for iterative refinement to achieve accurate and stable ratings, and retrieve all ratings in order.

However for LLMs, their capabilities are relatively stable once they are done training. This is within the assumption of MLE Elo that player’s skill level is static or changes very slowly. That is why in most situations of LLM evaluation, people just use MLE Elo due to its simpleness and stableness.

Why do people compute win rate?

Win rate is a statistical measure that represents the percentage of games, matches, or competitions that a player wins over a certain period or across a series of events. It is calculated as the number of wins divided by the total number of matches played, often expressed as a percentage. Thus, it offers a straightforward, easily understood metric of success, which complements Elo rating to make the evaluation more comprehensive.

In addition, win rates can also highlight anomalies or interesting trends that Elo might not capture. For instance, a player might have a high Elo rating but a surprisingly low win rate, like in a scenario that the player did 50 matches,

  • Wins: 15 (Against top 20 players mostly)
  • Losses: 35 (Loses more frequently to lower-ranked players)
  • Win Rate: 30%
  • Elo Rating: High (Due to wins against top players)

This could indicate issues such as inconsistency in performance or also have psychological factors at play, such as greater motivation or focus against better-known players, while underestimating or lacking the same drive against lower-ranked players.

On the other hand, we can use win rate to gain insight into the accuracy and quality of the Elo rating system. This is because utilizing Elo ratings allows us to predict win probabilities. If the predicted win rate match closely with the actual win rate, it means the Elo rating system is is high quality.

Summary

At this point, we have discussed MLE Elo, win rate, and confidence interval by bootstrapping, this should cover most of the evaluation scenarios, not just for LLM.