Pairwise Comparison Modeling

Author

Michael Betancourt

Published

November 2024

A diversity of applications all consider data arising from the comparison of two objects to each other. For example we might be interesting in the outcome of a sporting event between two teams, how well a student can answer a battery of test questions, or even consumer preferences between two products.

In this chapter we’ll discuss general strategies for modeling pairwise comparisons, the inferential degeneracies that are inherent to these models, and productive strategies for managing those degeneracies. Per tradition these concepts will all be demonstrated in a series of instructional analyses.

1 Modeling Pairwise Comparisons

Given the rich variety of pairwise comparisons that arise in practical applications we will need a reasonably generic vocabulary to describe the common modeling techniques.

To that end consider a collection of I items, any pair of which can be compared to each other according to some criterion. Items can refer to anything from individual living creatures to individual inanimate objects, collections of those individuals, and more. Comparative criterion include the personal judgement of an individual and the scoring of a direct competition.

In many applications every pair of items defines a valid comparison. Some applications, however, partition the items into two groups and allow comparisons between only an item from one group with an item from the other group; specifically we cannot compare items within each group to each other. For example in a sporting event we might model the points scored by one team as a comparison between that team’s offense and their opponent’s defense, but no other outcomes directly compare the opposing teams’ offensives or defenses to each other. I will refer to these restricted pairings as bipartite comparisons.

A comparison between two abstract items will not always result in a quantitative outcome. Just think of all of the poets and songwriters who have struggled to express their feelings about contrasting concepts! To facilitate mathematical modeling we have to restrict consideration to comparative criteria that are quantifiable. In this chapter I will assume a convention where quantitative comparisons yield larger numerical values if the first item in the pair i_{1} is superior to the second item i_{2} according to the stated criterion.

More formally we will consider only pairwise comparisons that yield a point in some ordered space, with the superiority of item i_{1} over item i_{2} manifesting as larger values in that space. For example comparisons might be quantified by binary values, integers, or even real numbers. Critically these outputs depend on the arrangement of the items; swapping the roles of item i_{1} and item i_{2} will in general result in outcomes on the opposite side of the space. For example if i_{1} is superior to i_{2}, and comparisons return larger values, then i_{2} will be inferior to i_{1} and comparisons will return smaller values.

Our general strategy for consistently modeling pairwise comparisons with ordered outputs will proceed in two stages.

First we equip each item with a real-valued quality parameter, \alpha_{i} \in \mathbb{R}, that quantifies how much each item contributes to a particular comparison. A useful convention when working with bipartite comparisons is to denote the quality parameters of each group with a separate variable name, for example \alpha_{i_{i}} and \beta_{i_{2}}. To even further distinguish the two types of items we can also use separate indices, for example \alpha_{i} and \beta_{j} or \alpha_{j} and \beta_{i}.

Next we need to ensure that when comparing item i_{1} to item i_{2} the larger \alpha_{i_{1}} is relative to \alpha_{i_{2}} the more the comparison outcome y_{i_{1} i_{2}} will tend towards larger values. Equivalently the larger \alpha_{i_{2}} is relative to \alpha_{i_{1}} the more y_{i_{1} i_{2}} should tend towards smaller values.

One way to guarantee this behavior is to model the comparison outcome y_{i_{1} i_{2}} with an appropriate location family of probability density functions, p(y_{i_{1} i_{2}} \mid \mu_{i_{1} i_{2}}, \phi), and then replace the location parameter \mu_{i_{1} i_{2}} with the output of a monotonically-increasing function of the difference in quality parameters, \mu_{i_{1} i_{2}} = f(\alpha_{i_{1}} - \alpha_{i_{2}}). The larger \alpha_{i_{1}} is relative to \alpha_{i_{2}} the larger the difference between them will be, the larger the output of the function f will be, and the more the comparison outcomes will concentrate on larger values as desired.

2 Coupling Outcome Location and Item Qualities

The general construction of pairwise comparison models presented in Section 1 allows for a variety of sophisticated behaviors. Perhaps the most important of these is the functional relationship that couples the location of the outcome model to the difference of item qualities, \mu_{i_{1} i_{2}} = f(\alpha_{i_{1}} - \alpha_{i_{2}}). Often it is more productive to build up the function f as a composition of multiple component functions, \mu_{i_{1} i_{2}} = f_{J} \circ \cdots \circ f_{1}(\alpha_{i_{1}} - \alpha_{i_{2}}), with each moderating the coupling in different ways.

In this section we’ll discuss a few functional behaviors that are particularly useful for many pairwise comparison applications.

2.1 Constraining Functions

By construction item qualities are real-valued, and so too are their differences. If the location configuration of the outcome model is not real-valued then we cannot directly relate it to those real-valued differences. In particular if the location configuration is constrained then it will not be immediately compatible to the item differences. Instead we need to introduce an intermediate function to corral any inconsistent values.

First, however, consider a pairwise comparison that results in a real-valued outcome. In this case we might consider a normal outcome model \text{normal}(y_{i_{1} i_{2}} \mid \mu_{i_{1} i_{2}}, \sigma). Because the location \mu_{i_{1} i_{2}} is unconstrained we can directly relate it to the item differences with an identify function, \mu_{i_{1} i_{2}} = \mathrm{Id}(\alpha_{i_{1}} - \alpha_{i_{2}}) = \alpha_{i_{1}} - \alpha_{i_{2}}. Here the quality of the first item \alpha_{i_{1}} directly translates the location of the outcome model upwards while the quality of the second item \alpha_{i_{2}} translates the location downwards.

Now consider a pairwise comparison that yields non-negative, integer-valued outcomes y_{i_{1} i_{2}} \in \mathbb{N}. Here we might reach for a Poisson outcome model, \text{Poisson}(y_{i_{1} i_{2}} \mid \lambda_{i_{1} i_{2}} ), with a positive location configuration \lambda_{i_{1} i_{2}}. In order to relate the unconstrained item quality differences to a valid location configuration we need a monotonic function that squeezes an entire real line into a positive real line, f : \mathbb{R} \rightarrow \mathbb{R}^{+}.

One possibility is the exponential function, \begin{align*} \lambda_{i_{1} i_{2}} &= \exp(\alpha_{i_{1}} - \alpha_{i_{2}}) \\ &= \exp(\alpha_{i_{1}}) \, \exp(-\alpha_{i_{2}}) \\ &= \exp(\alpha_{i_{1}}) \frac{1}{\exp(\alpha_{i_{2}})}. \end{align*} In this case the item quality differences don’t translate the location configuration but rather scale it multiplicatively.

Similarly when working with binary outcomes y_{i_{1} i_{2}} \in \{0, 1\} we can always assume a Bernoulli model \text{Bernoulli}(y_{i_{1} i_{2}} \mid p_{i_{1} i_{2}}). We can then use a logistic function \mathrm{logistic} : \mathbb{R} \rightarrow [0, 1] to map item quality differences into unit-interval-valued probability configurations, p_{i_{1} i_{2}} = \mathrm{logistic}(\alpha_{i_{1}} - \alpha_{i_{2}}). That said the logistic function is not the only possibility here. For example we could use the error function or really the cumulative distribution function for any probability distribution defined on a real line.

2.2 Baseline-Inducing Functions

In many applications it is useful to have item qualities modify a common baseline behavior. One way to achieve this is with a translation function that centers differences in item qualities around a baseline value, t_{\eta}(\alpha_{i_{1}} - \alpha_{i_{2}}) = \eta + \alpha_{i_{1}} - \alpha_{i_{2}}.

For example we can model real-valued comparison outcomes with the same normal model that we considered above \text{normal}(y_{i_{1} i_{2}} \mid \mu_{i_{1} i_{2}}, \sigma) but with the location configuration \begin{align*} \mu_{i_{1} i_{2}} &= t_{\eta}(\alpha_{i_{1}} - \alpha_{i_{2}}) \\ &= \eta + \alpha_{i_{1}} - \alpha_{i_{2}}. \end{align*} Here \eta models the baseline location which can be translated up or down by the qualities of the items being compared.

2.3 Discrimination-Inducing Functions

Another useful feature of a pairwise comparison model is the ability to moderate exactly how much the location configuration is coupled to the item quality differences. One way to accomplish this is with a scaling function, s_{\gamma}(\alpha_{i_{1}} - \alpha_{i_{2}}) = \gamma \cdot (\alpha_{i_{1}} - \alpha_{i_{2}}).

Large, positive values of \gamma enhance the coupling while smaller values suppress it. As \gamma approaches zero the location configuration becomes completely uncoupled to the item qualities. Because of this behavior \gamma is known as a discrimination variable.

Negative values of \gamma invert the relationship between the item quality differences and the location configuration so that, for example, \alpha_{i_{1}} > \alpha_{i_{2}} results in smaller outcomes that suggest the superiority of item i_{2} over item i_{1}. This behavior can be useful when the nominal comparison criterion might actually be reversed.

That said negative values of \gamma also allow for some redundancy in the resulting pairwise comparison model. Flipping the sign of the discrimination and item qualities at the same time results in the same output, \gamma \cdot (\alpha_{i_{1}} - \alpha_{i_{2}}) = (-\gamma) \cdot ( (-\alpha_{i_{1}}) - (-\alpha_{i_{2}})), and hence the same model for the comparison outcomes. This redundancy then manifests as degenerate likelihood functions.

Beyond cases where the nominal comparison criterion is suspect the best practice is to constrain the sign of a discrimination parameter to positive values. This eliminates the sign redundancy and any resulting inferential degeneracies.

All of this said, a homogeneous discrimination model is not all that useful in practice. Scaling the differences in item qualities is most useful when the discrimination varies across different circumstances, with outcomes more strongly informed by the item qualities for some observations but more weakly informed by the item qualities for others. Exactly how \gamma should vary will depend on the details of a particular application; we’ll see some examples in Section 6.4 and Section 6.5.

2.4 Combining Functional Behaviors

As useful as these individual functional behaviors are on their own they can become especially useful when combined together.

To demonstrate let’s reconsider pairwise comparisons that yield non-negative, integer-valued outcomes. Once again we’ll appeal to a Poisson model for the outcomes \text{Poisson}(y_{i_{1} i_{2}} \mid \lambda_{i_{1} i_{2}} ), but now we’ll allow for more sophisticated functional relationships between the location configuration \lambda_{i_{1} i_{2}} and the item difficulties.

For example if we compose a translation function with an exponential function then the latter will define a baseline while the latter will ensure positive location configurations, \begin{align*} \lambda_{i_{1} i_{2}} &= \exp \circ t_{\eta}(\alpha_{i_{1}} - \alpha_{i_{2}}) \\ &= \exp(\eta + \alpha_{i_{1}} - \alpha_{i_{2}}) \\ &= \exp(\eta) \cdot \exp(\alpha_{i_{1}}) \cdot \exp(-\alpha_{i_{2}}) \\ &= \exp(\eta) \cdot \exp(\alpha_{i_{1}}) \cdot \frac{1}{\exp(\alpha_{i_{2}})}. \end{align*} In this case \exp(\eta) defines a baseline location that is multiplicatively perturbed by the quality parameters of the items being compared.

We might even compose a scaling function, a translation function, and an exponential function together at the same time, \begin{align*} \lambda_{i_{1} i_{2}} &= \exp \circ t_{\eta} \circ s_{\gamma} \big(\alpha_{i_{1}} - \alpha_{i_{2}}\big) \\ &= \exp \circ t_{\eta} \big( \gamma \cdot (\alpha_{i_{1}} - \alpha_{i_{2}}) \big) \\ &= \exp \big( \eta + \gamma \cdot (\alpha_{i_{1}} - \alpha_{i_{2}}) \big) \\ &= \exp \big( \eta \big) \, \exp \big( \gamma \cdot (\alpha_{i_{1}} - \alpha_{i_{2}}) \big). \end{align*} As before \exp(\eta) defines a baseline location, but now the multiplicative perturbations defined by the item quality parameters are moderated by \gamma. In particular as \gamma \rightarrow 0 the output of any pairwise comparison shrinks towards the baseline behavior regardless of the qualities of the items being compared.

3 Notable Pairwise Comparisons Models

While pairwise comparison modeling is broadly applicable there are two applications that dominate the statistics literature, both of which consider only the special case of binary outcomes. In this case section we’ll review these applications in the context of the more general techniques.

3.1 Bradley-Terry Model

One common pairwise comparison arises in competitions, with a game between two players or teams resulting in an unambiguous winner and loser. We can model these competitions as pairwise comparisons with the binary outcome y_{i_{1} i_{2}} = 1 denoting a win for the i_{1}th competitor and y_{i_{1} i_{2}} = 0 a win for the i_{2}th competitor. The items then model the competitors while the item qualities model their skills.

Wrapping the difference in item qualities in a logistic functions gives the pairwise comparison model \text{Bernoulli}(y_{i_{1} i_{2}} \mid p_{i_{1} i_{2}}) with \begin{align*} p_{i_{1} i_{2}} &= \mathrm{logistic}(\alpha_{i_{1}} - \alpha_{i_{2}}) \\ &= \frac{1}{1 + \exp( -( \alpha_{i_{1}} - \alpha_{i_{2}} ) )} \\ &= \frac{1}{1 + \exp(-\alpha_{i_{1}}) \, \exp(\alpha_{i_{2}})} \\ &= \frac{\exp(\alpha_{i_{1}})} {\exp(-\alpha_{i_{1}}) + \exp(\alpha_{i_{2}})}. \end{align*} This model is commonly known as the Bradley-Terry model (Bradley and Terry (1952)), although historically it has been independently derived multiple times from multiple different perspectives (Hamilton, Tawn, and Firth (2023)).

One nice feature of this model is the symmetry between the competitors. If we swap the order of the competitors then the probability that the i_{2}th competitor wins is \begin{align*} p_{i_{2} i_{1}} &= \mathrm{logistic}( \alpha_{i_{2}} - \alpha_{i_{1}} ) \\ &= \frac{1}{1 + \exp( -( \alpha_{i_{2}} - \alpha_{i_{1}} ) )} \\ &= \frac { \exp( -( \alpha_{i_{1}} - \alpha_{i_{2}} ) } { 1 + \exp( -( \alpha_{i_{1}} - \alpha_{i_{2}} ) ) } \\ &= \frac { 1 + \exp( -( \alpha_{i_{1}} - \alpha_{i_{2}} ) - 1} { 1 + \exp( -( \alpha_{i_{1}} - \alpha_{i_{2}} ) ) } \\ &= \frac { 1 + \exp( -( \alpha_{i_{1}} - \alpha_{i_{2}} ) } { 1 + \exp( -( \alpha_{i_{1}} - \alpha_{i_{2}} ) ) } - \frac { 1 } { 1 + \exp( -( \alpha_{i_{1}} - \alpha_{i_{2}} ) ) } \\ &= 1 - p_{i_{1} i_{2}}. \end{align*} Consequently modeling a win for either team will always yield the same inferences for their relative skills.

This model is also the basis for the Elo rating system (Elo (1967)) first developed to rank competitive chess players but since applied to all kinds of repeated competitions. The original Elo rating system used this pairwise comparison model to derive iterative updates to the competitor skills based on the initial skills and the outcome of a competition.

One limitation of considering only the winner of a competition is that the resulting analysis is largely insensitive to the details of competition itself. In particular any analysis using this model will be able to predict the winner of future competitions only so well.

In practice we can often use more general pairwise comparison models to capture these dynamics and inform more versatile predictions. For example in some competitions we can model the scores of two competitors, with item qualities capturing for their distinct offensive and defensive skills. In many cases we can even model the outcome of individual plays and then integrate these components models together into a detailed model of how a competition progresses.

When analyzing a baseball game, for instance, we could model only the winner or we could model the scores of each team. We could even go deeper and model the constituent competitions between individual batters and pitchers, balls in play and team defenses, and so on.

3.2 Item Response Theory

Another common pairwise comparison arises in standardized testing, also known as item response theory (Baker (2001)). These applications consider bipartite comparisons, with one item in each comparison representing a question on a test and the other representing a student attempting to answer that question. The item qualities then model the difficulty of the question and the test-taking ability of the student, respectively, while the binary outcome quantifies whether a not the student correctly answers the question.

The key difference between item response theory models and Bradley-Terry models is the asymmetry in the items. A Bradley-Terry model allows for comparisons between any two competitors, but an item response theory model considers only comparisons between one question and one student.

The simplest item response theory model takes the same form as the Bradley-Terry model with \text{Bernoulli}(y_{ji} \mid p_{ji}) and p_{ji} = \mathrm{logistic}(\alpha_{j} - \beta_{i}), where i indexes each question and j indexes each student. Historically this model is also known as the Rasch model (Rasch (1960)). In more contemporary literature it is often referred to as a 1PL model, with the “L” referring to the logistic mapping between item quality differences and outcome probabilities and the “1P” referring to the one parameter used to model the behavior of each question.

Adding a discrimination parameter for each question, p_{ji} = \mathrm{logistic}\big( \gamma_{i} \cdot (\alpha_{j} - \beta_{i} ) \big), allows the model to account for poorly designed questions that might be less sensitive to student abilities. This is also known as a 2PL model given the two parameters needed to model each question.

Lastly a 3PL model for each question introduces lower bounds to the correct answer probabilities, p_{ji} = \lambda_{i} + (1 - \lambda_{i}) \cdot \mathrm{logistic} \big( \gamma_{i} \cdot (\alpha_{j} - \beta_{i} ) \big). As we saw in the mixture modeling chapter this mixture of Bernoulli probability configurations can also be interpreted as a mixture of separate Bernoulli models, \begin{align*} p(y_{ji}) &= \quad\quad\quad\; \lambda_{i} \, \text{Bernoulli}(y_{ji} \mid 1) \\ &\quad + (1 - \lambda_{i}) \, \text{Bernoulli}(y_{ji} \mid \mathrm{logistic}\big( \gamma_{j} \cdot (\alpha_{j} - \beta_{i} ) \big)). \end{align*} In other words the 3PL model implies that students have some fixed probability of guessing each answer correctly, independent of their test-taking abilities.

4 Managing Inferential Degeneracies

Unfortunately pairwise comparison models aren’t as well-posed as we might initially have hoped. By construction observed outcomes inform only the differences between item qualities and not their definite values. Because of this the item qualities cannot be inferred independently of each other and, in a very real sense, they cannot be meaningfully interpreted independently of each other.

In other words pairwise comparison models are inherently redundant, and this redundancy complicates practical implementations.

4.1 The Fundamental Redundancy of Item Qualities

Pairwise comparison models are inherently non-identified, with any common translation to the item qualities yielding the same location configurations, \begin{align*} \mu_{i_{1} i_{2}}( \alpha_{i_{1}} + \zeta, \alpha_{i_{2}} + \zeta ) &= f \big( (\alpha_{i_{1}} + \zeta) - (\alpha_{i_{2}} + \zeta) \big) \\ &= f \big( \alpha_{i_{1}} - \alpha_{i_{2}} + \zeta - \zeta \big) \\ &= f \big( \alpha_{i_{1}} - \alpha_{i_{2}} \big) \\ &= \mu_{i_{1} i_{2}}( \alpha_{i_{1}}, \alpha_{i_{2}} ). \end{align*} Consequently every likelihood function realized from one of these models will be invariant to translations of the item qualities. This invariance in turn traces out non-compact inferential degeneracies that frustrate computation (Figure 1 (b)).

 

(a)
(b)
(c)

 

Figure 1: The item qualities in a naive pairwise comparison model are non-identified, resulting in (b) likelihood functions with non-compact degeneracies. (a) An informative prior model suppresses extreme qualities, resulting in a (c) much better behaved posterior distribution. Note, however, that a residual degeneracy remains within the containment of the prior model.

4.2 Informative Prior Modeling

Within a Bayesian analysis we can use an informative prior model to at least tame these non-compact degeneracies and contain the corresponding posterior distributions away from infinity (Figure 1). While inferential degeneracies can still persist within the breadth of a prior model and hinder computational efficiency this containment at least makes accurate computational possible.

Because we cannot interpret the item qualities independently of each other any prior model over the \alpha_{i} will be somewhat arbitrary. What really determines a whether a prior model is useful or not is how consistent the induced behavior for the differences in item qualities, and ultimately the corresponding outcome behaviors, is with the available domain expertise.

For example the independent component prior model over the item qualities, p( \alpha_{1}, \ldots, \alpha_{I} ) = \prod_{i = 1}^{I} \text{normal}( \alpha_{i} \mid 0, s_{i} ), implies a pushforward prior model for any item quality difference, p( \alpha_{i_{1}} - \alpha_{i_{2}} ) = \text{normal} \left( \alpha_{i_{1}} - \alpha_{i_{2}} \mid 0, \sqrt{ s_{i_{1}}^{2} + s_{i_{2}}^{2} } \right). If this pushforward behavior is consistent with our domain expertise then the initial prior model will be satisfactory.

That said many distinct prior models will in general share the same pushforward behavior. In particular any independent component prior model of the form p( \alpha_{1}, \ldots, \alpha_{I} ) = \prod_{i = 1}^{I} \text{normal}( \alpha_{i} \mid m, s_{i} ) implies the same pushforward behavior, p( \alpha_{i_{1}} - \alpha_{i_{2}} ) = \text{normal} \left( \alpha_{i_{1}} - \alpha_{i_{2}} \mid 0, \sqrt{ s_{i_{1}}^{2} + s_{i_{2}}^{2} } \right). Consequently any value of m defines an equally viable prior model.

4.3 Anchoring Item Qualities

A more effective approach than prior containment is to remove a degree of freedom from the pairwise comparison model and eliminate the underlying redundancy entirely.

4.3.1 Single Anchor

For example consider picking an arbitrary item and then subtracting the distinguished item quality from all of the item qualities, \delta_{i} = \alpha_{i} - \alpha_{i'}. After this transformation the item quality differences are unchanged, \begin{align*} \mu_{i_{1} i_{2}}( \delta_{i_{1}}, \delta_{i_{2}} ) &= \mu_{i_{1} i_{2}}( \alpha_{i_{1}} - \alpha_{i'}, \alpha_{i_{2}} - \alpha_{i'} ) \\ &= f \big( (\alpha_{i_{1}} - \alpha_{i'}) - (\alpha_{i_{2}} - \alpha_{i'}) \big) \\ &= f \big( \alpha_{i_{1}} - \alpha_{i_{2}} \big) \\ &= \mu_{i_{1} i_{2}}( \alpha_{i_{1}}, \alpha_{i_{2}} ), \end{align*} but the distinguished item quality is now anchored to zero, \delta_{i'} = \alpha_{i'} - \alpha_{i'} = 0. Because \delta_{i'} is now fixed we have to infer only the I - 1 remaining quality parameters.

The \delta_{i} can be interpreted as the item qualities relative to the quality of the anchored item, with \delta_{i} > 0 implying that the ith item is superior to the anchored item and \delta_{i} < 0 implying that the ith item is inferior to the anchored item. This interpretation also facilitates prior modeling, allowing us to reason about the relative item qualities directly instead of having to analyze indirect pushforward behavior.

4.3.2 Multiple Anchors

All of this said, a single anchor isn’t always enough. Recall that once we fix the quality of the i'th item to zero the residual quality parameters are interpreted relative to the quality of that anchored item. This immediately identifies the \delta_{i} for any item that is directly compared to the i'th item. These identified parameters in turn identify any remaining items that are directly compared to them, and so on. Any items which are not at least indirectly compared to the anchored item, however, remain invariant to translations!

We can formalize the connectivity of the comparisons, and hence how many item qualities we need to anchor, with a little bit of graph theory (Clark and Holton (1991)). Any collection of observed pairwise comparisons defines a graph, with each item defining a node and the number of pairwise comparisons between any two items defining a weighted, undirected edge between the corresponding nodes (Figure 2).

Figure 2: Any collection of pairwise comparisons defines an undirected graph, with the items being compared defining nodes and the comparisons themselves defining edges between the nodes.

Two items have been directly compared if there is an edge between their nodes; two items have been indirectly compared if their nodes can be connected by a path of edges. More formally these interacting items form the connected subgraphs, or components, of the full graph (Figure 3).

 

(a)
(b)
(c)

 

Figure 3: When some nodes are isolated from the others (a) a graph decomposes into (b, c) multiple connected components. In a graph defined by pairwise comparisons the item qualities in any one component are independent of the item qualities in any other component.

From an inferential perspective the behavior of items in one connected component are independent of the behavior of items in any other connected component. In other words the item qualities are only locally interpretable within the encompassing connected component. We cannot learn the relationship between two items that have no connection to each other!

For example while we can learn the relative test-taking abilities of students who taken the same test we cannot learn the relative test-taking abilities of students who have taken completely different tests. Similarly when analyzing historical competition data we can learn how skilled more modern competitors are to older competitors only if we have enough intermediate competitions to bridge them together.

Mathematically anchoring the quality parameter of an item eliminates the redundancy only of the item qualities within the same component. The redundancy of item qualities in other components remains. To completely eliminate the redundancy of a pairwise component model and avoid the subsequent inferential degeneracies we have to anchor one item quality in each connected component.

Fortunately the translation of a collection of items into a graph and computation of the resulting connected components is relatively straightforward to implement in practice (Cormen et al. (2022)). We’ll review a demonstration in Section 6.2.

That said we have to be careful with what inferences we construct from a pairwise comparison model with more than one anchored item quality. For example if we anchor the quality of items i' and i'' in two different connected components then we will have both \delta_{i'} = \alpha_{i'} - \alpha_{i'} = 0 and \delta_{i''} = \alpha_{i''} - \alpha_{i''} = 0, and consequently \delta_{i'} - \delta_{i''} = 0 - 0 = 0. Naively this might be interpreted as the quality of the two anchored items being identical, even thought those relative quantities are actually incomparable with each other.

In theory we could obstruct ill-defined comparisons by anchoring the items in the second connected component to a different value, \begin{align*} \delta_{i'} &= \alpha_{i'} - \alpha_{i'} = 0 \\ \delta_{i''} &= \alpha_{i''} - (\alpha_{i''} - \phi) = \phi. \end{align*} Unfortunately this just reintroduces the initial degeneracy: \phi is completely uninformed by the observed data and we would have to rely on whatever domain expertise has been encoded in the prior model to inform its posterior behavior.

Perhaps the safest approach is to simply analyze the pairwise comparisons in each connected components separately from each other. When analyzing a single connected component at a time we will only ever need to anchor one item. Moreover inferences and predictions for all possible pairwise comparisons within each analysis will always be well-defined. The limitation with this approach, however, is that the full data set will no longer able to inform any shared behaviors, such as common baselines or discriminations.

Regardless of the approach we take we are ultimately responsible for avoiding ill-defined inferences or predictions for any comparison of items in different connected components.

4.4 Coupling Function Redundancies

Every function that couples the item quality differences to the outcome location will suffer the fundamental redundancy of the item qualities. Some coupling functions, however, can introduce new redundancies and inferential degeneracies with them.

Consider, for example, the translation function t_{\eta} ( \alpha_{i_{1}} - \alpha_{i_{2}} ) = \eta + \alpha_{i_{1}} - \alpha_{i_{2}}. Mathematically the baseline suffers from an additive redundancy with the item quality differences, \begin{align*} t_{\eta} ( \alpha_{i_{1}} - \alpha_{i_{2}} ) &= \eta + \alpha_{i_{1}} - \alpha_{i_{2}} \\ &= \eta + 0 + \alpha_{i_{1}} - \alpha_{i_{2}} \\ &= \eta + \zeta - \zeta + \alpha_{i_{1}} - \alpha_{i_{2}} \\ &= (\eta + \zeta) + (\alpha_{i_{1}} - \alpha_{i_{2}} - \zeta) \\ &= t_{\eta + \zeta} (\alpha_{i_{1}} - \alpha_{i_{2}} - \zeta ). \end{align*}

When items appear are both side of the comparisons, however, there’s no way to factor that offset \zeta into the individual item qualities. In this case the baseline \eta is identified by the observed behavior common to all comparisons.

With bi-partite comparisons, however, we can factor any offset \zeta into either of the contrasting item qualities \begin{align*} t_{\eta} ( \alpha_{i} - \beta_{j} ) &= t_{\eta + \zeta} ( \alpha_{i} - \beta_{j} - \zeta ) \\ &= t_{\eta + \zeta} \big( (\alpha_{i} - \zeta) - \beta_{j} \big) \\ &= t_{\eta + \zeta} \big( \alpha_{i} - (\beta_{j} + \zeta) \big). \end{align*} Because we can absorb the offset into either group of item qualities we cannot eliminate this redundancy by anchoring one type of item alone. Instead we have to anchor both types of items.

When the observed comparisons define a single connected component we need to set a distinguished \alpha_{i'} and a distinguished \beta_{j'} to zero at the same time. More generally we need to fix one of each type of item quality within each connected component.

For example if we’re modeling points scored in the games of a sports league as p(y_{ij} ) = \mathrm{Poisson}\big( y_{ij} \mid \exp( \eta + \alpha_{i}^{\mathrm{offense}} - \alpha_{j}^{\mathrm{defense}} ) \big) then we would need to anchor one team’s offensive skill to zero and one team’s defensive skill to zero so that the remaining skills are defined relative to the anchored skills. In general we can anchor the offensive and defensive skills of the same team or even different teams. Either way we can then interpret the baseline \eta as modeling the score in a, possibly hypothetical, game between the anchored offense and the anchored defense.

Similarly a scaling function s_{\gamma} (\alpha_{i_{1}} - \alpha_{i_{2}} ) = \gamma \cdot (\alpha_{i_{1}} - \alpha_{i_{2}}). suffers from a multiplicative redundancy, \begin{align*} s_{\gamma} (\alpha_{i_{1}} - \alpha_{i_{2}} ) &= \gamma \cdot (\alpha_{i_{1}} - \alpha_{i_{2}}) \\ &= \gamma \cdot 1 \cdot (\alpha_{i_{1}} - \alpha_{i_{2}}) \\ &= \gamma \cdot \frac{\rho}{\rho} \cdot (\alpha_{i_{1}} - \alpha_{i_{2}}) \\ &= (\rho \cdot \gamma) \cdot \left( \frac{\alpha_{i_{1}} - \alpha_{i_{2}}}{\rho} \right) \\ &= s_{\rho \cdot \gamma} \left( \frac{\alpha_{i_{1}} - \alpha_{i_{2}}}{\rho} \right). \end{align*}

Because scaling the item qualities does not obstruct translations, \frac{\alpha_{i_{1}} - \alpha_{i_{2}}}{\rho} = \frac{ (\alpha_{i_{1}} + \zeta) - (\alpha_{i_{2}} - \zeta)}{\rho}, anchoring the item qualities will not eliminate this scale redundancy. Any proportional increase in the discrimination parameter \gamma can be compensated by a proportional decrease in the item qualities and vice versa.

An informative prior model on the item qualities and discrimination parameters does limit the scope of this redundancy, and hence the resulting inferential degeneracies. That said constraints on the qualities and discriminations can interact in subtle ways that complicate principled prior modeling. By far the most effective strategy is to eliminate the redundancy entirely.

If the discrimination is homogeneous across all comparisons then we can always fix \gamma to one without any loss of generality. On the other hand if the discrimination is heterogeneous, with a separate \gamma_{k} for each distinct context, then fixing one of the \gamma_{k'} will eliminate the redundancy in the pairwise comparison model. Specifically if we set \gamma_{k'} = 1 then the remaining discrimination parameters end up quantifying scaling relative to the anchored context, \kappa_{k} = \frac{ \gamma_{k} }{ \gamma_{k'} }.

4.5 Trouble With Dominating Items

Depending on the structure of the observational space pairwise comparison models can also suffer from some more circumstantial inferential degeneracies.

Consider for example comparison outcomes that are bounded above, y_{i_{1} i_{2}} \le y_{\max}. If the observed comparisons to one particular item i' always saturate that maximum value, y_{i' i_{2}} = y_{\max}, then the observations will not be able to determine just how much better that dominating item is from the others.

The saturated observations will be able to lower bound the definite or relative quality of the dominating item, but they will be indifferent to the remaining, arbitrarily large values. This then results in a non-compact degeneracy extending out towards positively-infinite values.

Similarly if the observed comparisons are bounded below, y_{i_{1} i_{2}} \ge y_{\min}, and the observed comparisons to one dominating item always saturate that minimum value, y_{i' i_{2}} = y_{\min}, then the data will not be able to differentiate between arbitrarily large but negative values of the definite quality \alpha_{i'} or even the relative quality \delta_{i'}. In this case the likelihood function will exhibit a non-compact degeneracy that extends out towards negatively-infinite values.

For both cases a reasonably informative prior model will be able contain these inferential degeneracies and prevent them from propagating to the posterior distribution. An even better strategy is to design more informative comparisons that can inform just how much better every item is from its peers.

For example in the context of sporting events analyzing score differentials is always more informative than analyzing winners. Similarly in the context of standardized testing one might be able to learn more from how long it takes students to reach the right answer than one can from whether or not the students report the right answer eventually.

5 Ranking Items

Because item qualities can be meaningfully interpreted only relative to one other they inform only a limited range of behaviors. One particularly useful behavior that they can inform are rankings of the items being compared.

Formally any configuration of item qualities, ( \alpha_{1}, \ldots, \alpha_{i}, \ldots, \alpha_{I} ) \in A^{I}, implies one of I! possible rankings, (r_{1}, \ldots, r_{i}, \ldots, r_{I}) \in R, where the items are ordered by their relative qualities, \alpha_{r_{1}} < \ldots < \alpha_{r_{i}} < \ldots < \alpha_{r_{I}}. In fact this relationship defines a bijective function from the space of item qualities to the space of the possible orderings of the I items, o : A^{I} \rightarrow R.

Theoretically pushing forward a posterior distribution over the item qualities \pi along this function will define a posterior distribution o_{*} \pi that quantifies the consistent item rankings. The challenge in practice, however, is that the space of ranking R is difficult to navigate. In particular it is not immediately clear how we might be able to construct summaries of o_{*} \pi that are straightforward to communicate.

For example we might be interested in point summaries that quantify the centrality of the rank posterior distribution in some way. Because the space of rankings is discrete each rank will in general be allocated non-zero probability, allowing us to consider a modal ranking, r^{*} = \underset{r \in R}{\mathrm{argmax}} \; o_{*} \pi( \{ r \} ).

Unfortunately even if a unique mode exists actually finding it is typically intractable. In general we will not be able to compute the atomic allocations o_{*} \pi( \{ r \} ) in closed form and estimation of all of these probabilities quickly becomes expensive. Consequently exhaustively searching through all I! atomic allocations is almost always be too expensive. Moreover without gradient information we have no way to guide a more efficient search of the ranking space.

In analogy to the construction of posterior mean summaries we might instead consider a ranking distance function d : R \times R \rightarrow \mathbb{R}^{+} and then define a point summary that minimizes the expected distance, \mu_{R} = \underset{r \in R}{\mathrm{argmin}} \; \mathbb{E}_{o_{*} \pi}[ d( \cdot, r) ]. Conveniently there are a variety of useful distance functions on R that we could use to inform such point summaries, and in theory we can estimate the necessary expectation values with Markov chain Monte Carlo. The minimization over all possible candidate rankings r \in R, however, suffers from the same problems as above.

Because of these computational issues we usually need to appeal to more heuristic summaries in practice. For instance we can always rank the items by the posterior expectation values of their individual qualities, \mathbb{E}_{\pi}[ \alpha_{r_{1}} ] < \ldots < \mathbb{E}_{\pi}[ \alpha_{r_{i}} ] < \ldots < \mathbb{E}_{\pi}[ \alpha_{r_{I}} ].

When considering only two items at a time we can can also consult the posterior probability that one skill surpasses the other, \pi( \{ \alpha_{i_{1}} > \alpha_{i_{2}} \} ). One nice feature of this pairwise inferential comparison is that it accounts for any coupled uncertainties between the two item qualities.

These binary inferential comparisons can also be used to construct another heuristic ranking. The posterior probability that the quality of each item is larger than the quality of all other items, p_{i'} = \pi( \{ \alpha_{i'} > \alpha_{1} \, \text{and} \, \ldots \, \text{and} \, \alpha_{i'} > \alpha_{i} \, \text{and} \, \ldots \, \text{and} \, \alpha_{i'} > \alpha_{I} \}), can be used to assign a top rank, r_{I} = \underset{i \in I}{\mathrm{argmax}} \; p_{i}. At this point we can compute the posterior probability that the quality of each remaining item is larger than the quality of all of the other remaining items and assign the second-highest rank based on the largest of these probabilities. We can then fill out all of the remaining rankings by iterating this procedure until only one item is left to occupy the lowest rank.

The main limitations with this approach is that it requires estimating so many probability allocations that we can exhaust the available computational resources, especially if there are a lot of items. The computation is even more demanding if we want to ensure robust rankings that are not corrupted by computational artifacts. This requires that the Markov chain Monte Carlo error for each posterior probability estimate is smaller than any of the differences between it and the other posterior probability estimates. Implementing robust rankings in practice requires not only carefully monitor the error of all of the posterior probability estimates but also potentially running additional or longer Markov chains to reduce any errors as needed.

6 Demonstrations

With the foundations laid let’s investigate the implementation of pairwise comparison models and their inferential behaviors with a series of increasingly more sophisticated example analyses.

6.1 Setup

First and foremost we have to set up or local R environment.

par(family="serif", las=1, bty="l",
    cex.axis=1, cex.lab=1, cex.main=1,
    xaxs="i", yaxs="i", mar = c(5, 5, 3, 1))
library(rstan)
rstan_options(auto_write = TRUE)            # Cache compiled Stan programs
options(mc.cores = parallel::detectCores()) # Parallelize chains
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")

library(colormap)
util <- new.env()
source('mcmc_analysis_tools_rstan.R', local=util)
source('mcmc_visualization_tools.R', local=util)

6.2 Graph Analysis Utilities

Next we’ll need some utilities for analyzing the graphs defined by a collection of pairwise comparisons. There are a variety of dedicated graph analysis packages available in R, but here I’ll code up basic implementations of the specific tools that we’ll need.

Firstly we’ll need a function that builds up a weighted adjacency matrix from a list of pairwise comparisons. The (i_{1}, i_{2})-th element of a weighted adjacency matrix is the number of comparisons between item i_{1} and item i_{2}. In particular if the (i_{1}, i_{2})-th element is zero then the two items have not been directly compared.

build_adj_matrix <- function(N_items, N_comparisons, idx1, idx2) {
  if (min(idx1) < 1 | max(idx1) > N_items) {
    stop("Out of bounds idx1 values.")
  }
  if (min(idx2) < 1 | max(idx2) > N_items) {
    stop("Out of bounds idx2 values.")
  }

  adj <- matrix(0, nrow=N_items, ncol=N_items)

  # Compute directed edges
  for (n in 1:N_comparisons) {
    adj[idx1[n], idx2[n]] <- adj[idx1[n], idx2[n]] + 1
  }

  # Compute symmetric, undirected edges
  for (p1 in 1:(N_items - 1)) {
    for (p2 in (p1 + 1):N_items) {
      N1 <- adj[p1, p2]
      N2 <- adj[p2, p1]
      adj[p1, p2] <- N1 + N2
      adj[p2, p1] <- N1 + N2
    }
  }

  (adj)
}

Adjacency matrices are extremely useful for implementing many graph theory operations in practice. For example we can use a weighed adjacency matrix to visualize the comparison graph, with circles representing each node and lines representing the edges. Here the color of the lines visualizes the weight of each edge, equivalently how many comparisons have been made between any two items.

plot_undir_graph <- function(adj) {
  if (!is.matrix(adj)) {
    stop("Input adj is not a matrix.")
  }
  if (nrow(adj) != ncol(adj)) {
    stop("Input adj is not a square matrix.")
  }

  nom_colors <- c("#DCBCBC", "#C79999", "#B97C7C", "#A25050")
  edge_colors <- colormap(colormap=nom_colors, nshades=100)

  I <- nrow(adj)
  delta <- 2 * pi / I
  thetas <- (0:(I - 1)) * delta + pi / 2
  R <- 1

  xs <- R * cos(thetas)
  ys <- R * sin(thetas)

  plot(0, type='n', axes=FALSE, ann=FALSE,
       xlim=c(-1.1 * R, +1.1 * R), ylim=c(-1.1 * R, +1.1 * R))

  cthetas <- seq(0, 2 * pi + 0.05, 0.05)
  cxs <- R * cos(cthetas)
  cys <- R * sin(cthetas)
  lines(cxs, cys, lwd=3, col="#DDDDDD")

  max_adj <- max(adj)
  for (i1 in 1:(I - 1)) {
    for (i2 in (i1 + 1):I) {
      cidx <- floor(100 * (adj[i1, i2] / max_adj))
      lines(xs[c(i1, i2)], ys[c(i1, i2)], lwd=3, col=edge_colors[cidx])
    }
  }

  points(xs, ys, pch=16, cex=1.75, col="white")
  points(xs, ys, pch=16, cex=1.25, col=util$c_dark)
}

We can also use an adjacency graph to decompose an initial graph into connected components.

Given an initial node we first find all of the nodes that are at least indirectly related by pairwise comparisons with a recursive, breadth-first search.

breadth_first_search <- function(n, adj, local_nodes=c()) {
  local_nodes <- c(local_nodes, n)
  neighbors <- which(adj[n,] > 0)
  for (nn in neighbors) {
    if (!(nn %in% local_nodes))
      local_nodes <- union(local_nodes,
                           Recall(nn, adj, local_nodes))
  }
  (local_nodes)
}

Repeating this search until we’ve exhausted all of the nodes identifies the connected components.

compute_connected_components <- function(adj) {
  if (!is.matrix(adj)) {
    stop("Input adj is not a matrix.")
  }
  if (nrow(adj) != ncol(adj)) {
    stop("Input adj is not a square matrix.")
  }

  all_nodes <- 1:nrow(adj)
  remaining_nodes <- all_nodes
  components <- list()

  for (n in all_nodes) {
    if (n %in% remaining_nodes) {
      subgraph <- breadth_first_search(n, adj)
      remaining_nodes <- setdiff(remaining_nodes, subgraph)
      components[[length(components) + 1]] <- subgraph
    }
  }

  (components)
}

For example let’s say that we have a graph defined by 8 items and 8 comparisons.

K <- 8
C <- 8

item1 <- c(1, 2, 3, 4, 4, 4, 5, 7)
item2 <- c(2, 3, 1, 5, 6, 8, 6, 8)

These comparisons define a weighted adjacency matrix.

adj <- build_adj_matrix(K, C, item1, item2)

The weighted adjacency matrix then informs a visualization of the graph.

par(mfrow=c(1, 1), mar=c(0, 0, 0, 0))

plot_undir_graph(adj)

In this case the graph decomposes into two components.

components <- compute_connected_components(adj)
components
[[1]]
[1] 1 2 3

[[2]]
[1] 4 5 6 8 7

One way to visualize this decomposition is to visualize each connected component individually.

par(mfrow=c(1, 2), mar=c(0, 0, 2, 0))
for (c in seq_along(components)) {
  component_adj <- adj
  for (k in 1:K) {
    if (!(k %in% components[[c]]))
      component_adj[k,] <- rep(0, K)
  }
  plot_undir_graph(component_adj)
  title(paste("Component", c))
}

6.3 Modeling Binary Comparisons

With toolbox prepared let’s move onto our first example. Here we’ll emphasize inferential degeneracies with a model of binary comparisons. To provide some basic context let’s say that these binary comparisons are the outcomes of games between two players and we’re interesting in inferring the relative skills of each player.

6.3.1 Simulate data

To start we’ll need to some data to analyze, including which players are matched up in each game and the eventual winner. For this example we’ll simulate the games in two steps, using player1_probs to sample the first player and pair_probs to conditionally sample their opponent. Note the zeroes in pair_probs prevent some player matchups from occurring at all.

N_players <- 7
N_games <- 4000

player1_probs <- c(0.2, 0.16, 0.19, 0.11, 0.12, 0.13, 0.09)

pair_probs <- c(0.0, 0.4, 0.0, 0.0, 0.5, 0.0, 0.0,
                0.5, 0.0, 0.0, 0.0, 0.2, 0.6, 0.0,
                0.0, 0.0, 0.0, 0.6, 0.0, 0.0, 0.5,
                0.0, 0.0, 0.7, 0.0, 0.0, 0.0, 0.5,
                0.5, 0.3, 0.0, 0.0, 0.0, 0.4, 0.0,
                0.0, 0.3, 0.0, 0.0, 0.3, 0.0, 0.0,
                0.0, 0.0, 0.3, 0.4, 0.0, 0.0, 0.0)
pair_probs <-  matrix(pair_probs, nrow=N_players)
simu\_bradley\_terry.stan
data {
  int<lower=1> N_players; // Number of players
  int<lower=1> N_games;   // Number of games

  // Matchup probabilities
  simplex[N_players] player1_probs;
  array[N_players] simplex[N_players] pair_probs;
}

generated quantities {
  // Player skills
  array[N_players] real alpha
    = normal_rng(rep_vector(0, N_players), log(sqrt(2)) / 2.32);

  // Game outcomes
  array[N_games] int<lower=1, upper=N_players> player1_idx;
  array[N_games] int<lower=1, upper=N_players> player2_idx;
  array[N_games] int<lower=0, upper=1> y;

  for (n in 1:N_games) {
    // Simulate first player
    player1_idx[n] = categorical_rng(player1_probs);

    // Simulate second player
    player2_idx[n] = categorical_rng(pair_probs[player1_idx[n]]);

    // Simulate winner
    y[n] = bernoulli_logit_rng(  alpha[player1_idx[n]]
                               - alpha[player2_idx[n]]);
  }
}
simu <- stan(file="stan_programs/simu_bradley_terry.stan",
             algorithm="Fixed_param", seed=8438338,
             data=list("N_players" = N_players,
                       "N_games" = N_games,
                       "player1_probs" = player1_probs,
                       "pair_probs" = pair_probs),
             warmup=0, iter=1, chains=1, refresh=0)
data <- list("N_players" = N_players,
             "N_games" = N_games,
             "player1_idx" = extract(simu)$player1_idx[1,],
             "player2_idx" = extract(simu)$player2_idx[1,],
             "y" = extract(simu)$y[1,])

alpha_true <- extract(simu)$alpha[1,]

6.3.2 Explore Data

Now we can explore the simulated data.

Counting the number of games that include each player we can see that not every player participates in the same number of games. Consequently we should be able to more precisely learn the relative skills of the more active players.

game_counts <- sapply(1:data$N_players,
                      function(i) sum(data$player1_idx == i |
                                      data$player2_idx == i   ) )

par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))

barplot(game_counts,
        space=0, col=util$c_dark_teal, border="white",
        xlab="Player", ylab="Number of Games")

To investigate the interaction between the players we can visualize the matchups as a graph.

adj <- build_adj_matrix(data$N_players, data$N_games,
                        data$player1_idx, data$player2_idx)

par(mfrow=c(1, 1), mar=c(0, 0, 0, 0))

plot_undir_graph(adj)

Interestingly the players decompose into two isolated groups, with no games matching up players across the groups.

components <- compute_connected_components(adj)
components
[[1]]
[1] 1 2 5 6

[[2]]
[1] 3 4 7
par(mfrow=c(1, 2), mar=c(0, 0, 2, 0))

for (c in seq_along(components)) {
  component_adj <- adj
  for (i in 1:data$N_players) {
    if (!(i %in% components[[c]]))
      component_adj[i,] <- rep(0, data$N_players)
  }
  plot_undir_graph(component_adj)
  title(paste("Connected Subgraph", c))
}

Given the variation in the number of games played it’s no surprise that there is also heterogeneity in the total number of wins and losses for each player.

win_counts <- sapply(1:data$N_players,
                     function(i) sum(( data$player1_idx == i &
                                       data$y == 1              ) |
                                     ( data$player2_idx == i &
                                       data$y == 0              )  ) )

loss_counts <- sapply(1:data$N_players,
                      function(i) sum(( data$player1_idx == i &
                                        data$y == 0             ) |
                                      ( data$player2_idx == i &
                                        data$y == 1             )  ) )

par(mfrow=c(1, 2), mar=c(5, 5, 1, 1))

barplot(win_counts,
        space=0, col=util$c_dark_teal, border="white",
        xlab="Player", ylab="Number of Wins", ylim=c(0, 800))

barplot(loss_counts,
        space=0, col=util$c_dark_teal, border="white",
        xlab="Player", ylab="Number of Losses", ylim=c(0, 800))

The proportion of games won should be particularly sensitive to the individual player skills.

par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))

barplot(win_counts / game_counts,
        space=0, col=util$c_dark_teal, border="white",
        xlab="Player", ylab="Win Frequency")

Indeed these empirical win frequencies make for a useful summary statistic.

6.3.3 Attempt 1

Given the binary outcomes we’ll follow the Bradley-Terry construction, with a Bernoulli model, \text{Bernoulli}(y_{i_{1} i_{2}} \mid p_{i_{1} i_{2}}), and logistic relationship between the probability that player i_{1} defeats player i_{2} and the corresponding difference in skills, p_{i_{1} i_{2}} = \mathrm{logistic}(\alpha_{i_{1}} - \alpha_{i_{2}}).

To investigate any inferential degeneracies as directly as possible let’s start with an improper, flat prior density function so that our Markov chains are effectively exploring the contours of the realized likelihood function.

bradley\_terry1.stan
data {
  int<lower=1> N_players; // Number of players
  int<lower=1> N_games;   // Number of games

  // Game outcomes
  array[N_games] int<lower=1, upper=N_players> player1_idx;
  array[N_games] int<lower=1, upper=N_players> player2_idx;
  array[N_games] int<lower=0, upper=1> y;
}

parameters {
  vector[N_players] alpha; // Player skills
}

model {
  // Implicit uniform prior density function for the alpha components

  // Observational model
  y ~ bernoulli_logit(alpha[player1_idx] - alpha[player2_idx]);
}

generated quantities {
  array[N_games] int<lower=0, upper=1> y_pred;
  array[N_players] int<lower=0> win_counts_pred
    = rep_array(0, N_players);
  array[N_players] int<lower=0> loss_counts_pred
    = rep_array(0, N_players);
  array[N_players] real win_freq_pred;

  for (n in 1:N_games) {
    int idx1 = player1_idx[n];
    int idx2 = player2_idx[n];

    y_pred[n] = bernoulli_logit_rng(alpha[idx1] - alpha[idx2]);
    if (y_pred[n]) {
      win_counts_pred[idx1] += 1;
      loss_counts_pred[idx2] += 1;
    } else {
      win_counts_pred[idx2] += 1;
      loss_counts_pred[idx1] += 1;
    }
  }

  for (i in 1:N_players) {
    win_freq_pred[i] =   (1.0 * win_counts_pred[i])
                       / (win_counts_pred[i] + loss_counts_pred[i]);
  }
}
fit <- stan(file="stan_programs/bradley_terry1.stan",
            data=data, seed=8438338,
            warmup=1000, iter=2024, refresh=0)

Unfortunately the diagnostics are very unhappy.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  Chain 1: 485 of 1024 transitions (47.36328125%) saturated the maximum treedepth of 10.

  Chain 2: 439 of 1024 transitions (42.87109375%) saturated the maximum treedepth of 10.

  Chain 3: 562 of 1024 transitions (54.8828125%) saturated the maximum treedepth of 10.

  Chain 4: 497 of 1024 transitions (48.53515625%) saturated the maximum treedepth of 10.

  Numerical trajectories that saturate the maximum treedepth have
terminated prematurely.  Increasing max_depth above 10 should result in
more expensive, but more efficient, Hamiltonian transitions.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples,
                                       c('alpha'),
                                       check_arrays=TRUE)
util$check_all_expectand_diagnostics(base_samples)
alpha[1]:
  Split hat{R} (4.600) exceeds 1.1.
  Chain 1: hat{ESS} (8.697) is smaller than desired (100).
  Chain 2: hat{ESS} (5.632) is smaller than desired (100).
  Chain 3: hat{ESS} (8.079) is smaller than desired (100).
  Chain 4: hat{ESS} (4.521) is smaller than desired (100).

alpha[2]:
  Split hat{R} (4.600) exceeds 1.1.
  Chain 1: hat{ESS} (8.697) is smaller than desired (100).
  Chain 2: hat{ESS} (5.632) is smaller than desired (100).
  Chain 3: hat{ESS} (8.079) is smaller than desired (100).
  Chain 4: hat{ESS} (4.521) is smaller than desired (100).

alpha[3]:
  Split hat{R} (3.996) exceeds 1.1.
  Chain 1: hat{ESS} (5.616) is smaller than desired (100).
  Chain 2: hat{ESS} (6.316) is smaller than desired (100).
  Chain 3: hat{ESS} (10.696) is smaller than desired (100).
  Chain 4: hat{ESS} (13.759) is smaller than desired (100).

alpha[4]:
  Split hat{R} (3.996) exceeds 1.1.
  Chain 1: hat{ESS} (5.616) is smaller than desired (100).
  Chain 2: hat{ESS} (6.316) is smaller than desired (100).
  Chain 3: hat{ESS} (10.696) is smaller than desired (100).
  Chain 4: hat{ESS} (13.759) is smaller than desired (100).

alpha[5]:
  Split hat{R} (4.600) exceeds 1.1.
  Chain 1: hat{ESS} (8.697) is smaller than desired (100).
  Chain 2: hat{ESS} (5.632) is smaller than desired (100).
  Chain 3: hat{ESS} (8.078) is smaller than desired (100).
  Chain 4: hat{ESS} (4.521) is smaller than desired (100).

alpha[6]:
  Split hat{R} (4.600) exceeds 1.1.
  Chain 1: hat{ESS} (8.694) is smaller than desired (100).
  Chain 2: hat{ESS} (5.632) is smaller than desired (100).
  Chain 3: hat{ESS} (8.078) is smaller than desired (100).
  Chain 4: hat{ESS} (4.520) is smaller than desired (100).

alpha[7]:
  Split hat{R} (3.996) exceeds 1.1.
  Chain 1: hat{ESS} (5.616) is smaller than desired (100).
  Chain 2: hat{ESS} (6.316) is smaller than desired (100).
  Chain 3: hat{ESS} (10.695) is smaller than desired (100).
  Chain 4: hat{ESS} (13.758) is smaller than desired (100).


Split Rhat larger than 1.1 suggests that at least one of the Markov
chains has not reached an equilibrium.

Small empirical effective sample sizes result in imprecise Markov chain
Monte Carlo estimators.

Together the treedepth saturation and low \hat{ESS}s suggest that the likelihood function is concentrating into a long, narrow geometry. Indeed when we look through pairs plots of the player skills we see some of them exhibiting translation invariance.

names <- sapply(1:data$N_players, function(i) paste0('alpha[', i, ']'))
util$plot_div_pairs(names, names, samples, diagnostics)

The behaviors become more consistent if we compare the skills of players only within the same connected components. All of these pairs plots show the same degenerate geometry.

names <- sapply(components[[1]], function(i) paste0('alpha[', i, ']'))
util$plot_div_pairs(names, names, samples, diagnostics)

names <- sapply(components[[2]], function(i) paste0('alpha[', i, ']'))
util$plot_div_pairs(names, names, samples, diagnostics)

Interestingly the skills of players between the two connected subgroups appear to be largely uncorrelated. Instead the players skills incoherently meandering around the consistent values.

names1 <- sapply(components[[1]], function(i) paste0('alpha[', i, ']'))
names2 <- sapply(components[[2]], function(i) paste0('alpha[', i, ']'))
util$plot_div_pairs(names1, names2, samples, diagnostics)

The fact that all of the cross-component pairs plots exhibit identical meandering patterns is somewhat disconcerting. Something has to be coupling all of these behaviors together.

Recall that the observed game outcomes inform each player’s skill relative to their historical opponents. Nothing, however, informs how the two connected components should behave relative to each other. A player who has won all of their matches but played only poor opponents isn’t likely to fare well against new competition. Similarly a player who has lost all of their matches against elite opponents can perform surprisingly well against weaker competition.

Consequently the average skills of the two connected components effectively random walk around each other.

var_repl <- list('alpha'=array(sapply(components[[1]],
                                      function(i)
                                      paste0("alpha[", i, "]"))))
mean_vals1 <- util$eval_expectand_pushforward(samples,
                                              function(alpha)
                                                mean(alpha),
                                              var_repl)

var_repl <- list('alpha'=array(sapply(components[[2]],
                                      function(i)
                                      paste0("alpha[", i, "]"))))
mean_vals2 <- util$eval_expectand_pushforward(samples,
                                              function(alpha)
                                                mean(alpha),
                                              var_repl)

util$plot_pairs_by_chain(mean_vals1, 'mean1',
                         mean_vals2, 'mean2')

Another way to understand these inferential degeneracies is that the observed game outcomes inform the differences in the player skills but not the sums. For this initial model the explored skill sums are limited only by the finite length of the Markov chains.

par(mfrow=c(1, 2), mar=c(5, 5, 1, 1))

var_repl <- list('alpha'=array(sapply(components[[1]],
                                      function(i)
                                      paste0("alpha[", i, "]"))))
sum_vals <- util$eval_expectand_pushforward(samples,
                                            function(alpha) sum(alpha),
                                            var_repl)
util$plot_expectand_pushforward(sum_vals, 20, flim=c(-4000, 4000),
                                display_name="sum(alpha)",
                                main="Connected Component 1")

var_repl <- list('alpha'=array(sapply(components[[2]],
                                      function(i)
                                      paste0("alpha[", i, "]"))))
sum_vals <- util$eval_expectand_pushforward(samples,
                                            function(alpha) sum(alpha),
                                            var_repl)
util$plot_expectand_pushforward(sum_vals, 20, flim=c(-4000, 4000),
                                display_name="sum(alpha)",
                                main="Connected Component 2")

On the other hand the skill differences within each connected component are consistent with the behavior of the simulation data generating process.

par(mfrow=c(2, 3), mar=c(5, 5, 3, 1))

N <- length(components[[1]])
for (i1 in 1:(N - 1)) {
  for (i2 in (i1 + 1):N) {
    idx1 <- components[[1]][i1]
    idx2 <- components[[1]][i2]
    name1 <- paste0('alpha[', idx1, ']')
    name2 <- paste0('alpha[', idx2, ']')
    diff_vals <- util$eval_expectand_pushforward(samples,
                                                 function(a1, a2) a1 - a2,
                                                 list('a1'=name1,
                                                      'a2'=name2))
    util$plot_expectand_pushforward(diff_vals,
                                    25, flim=c(-0.75, 0.5),
                                    baseline=alpha_true[idx1] -
                                             alpha_true[idx2],
                                    baseline_col=util$c_dark_teal,
                                    display_name=paste(name1, '-',
                                                       name2))
  }
}

mtext("Connected Component 1", line=-2, outer=TRUE)

par(mfrow=c(1, 3), mar=c(5, 5, 3, 1))

N <- length(components[[2]])
for (i1 in 1:(N - 1)) {
  for (i2 in (i1 + 1):N) {
    idx1 <- components[[2]][i1]
    idx2 <- components[[2]][i2]
    name1 <- paste0('alpha[', idx1, ']')
    name2 <- paste0('alpha[', idx2, ']')
    diff_vals <- util$eval_expectand_pushforward(samples,
                                                 function(a1, a2)
                                                 a1 - a2,
                                                 list('a1'=name1,
                                                      'a2'=name2))
    util$plot_expectand_pushforward(diff_vals,
                                    25, flim=c(-0.75, 0.5),
                                    baseline=alpha_true[idx1] -
                                             alpha_true[idx2],
                                    baseline_col=util$c_dark_teal,
                                    display_name=paste(name1, '-',
                                                       name2))
  }
}

mtext("Connected Component 2", line=-2, outer=TRUE)

6.3.4 Attempt 2

We should be able to tame the degenerate geometry of the realized likelihood function with a proper prior model. Let’s start by considering a diffuse, but well-defined, prior model that contains the magnitude of the player skills below 10. Keep in mind that inputs of \pm 10 to a logistic function correspond to pretty extreme output probabilities!

bradley\_terry2.stan
data {
  int<lower=1> N_players; // Number of players
  int<lower=1> N_games;   // Number of games

  // Game outcomes
  array[N_games] int<lower=1, upper=N_players> player1_idx;
  array[N_games] int<lower=1, upper=N_players> player2_idx;
  array[N_games] int<lower=0, upper=1> y;
}

parameters {
  vector[N_players] alpha; // Player skills
}

model {
  // Prior model
  // -10 <~ alpha[i] <~ +10
  alpha ~ normal(0, 10 / 2.32);

  // Observational model
  y ~ bernoulli_logit(alpha[player1_idx] - alpha[player2_idx]);
}

generated quantities {
  array[N_games] int<lower=0, upper=1> y_pred;
  array[N_players] int<lower=0> win_counts_pred
    = rep_array(0, N_players);
  array[N_players] int<lower=0> loss_counts_pred
    = rep_array(0, N_players);
  array[N_players] real win_freq_pred;

  for (n in 1:N_games) {
    int idx1 = player1_idx[n];
    int idx2 = player2_idx[n];

    y_pred[n] = bernoulli_logit_rng(alpha[idx1] - alpha[idx2]);
    if (y_pred[n]) {
      win_counts_pred[idx1] += 1;
      loss_counts_pred[idx2] += 1;
    } else {
      win_counts_pred[idx2] += 1;
      loss_counts_pred[idx1] += 1;
    }
  }

  for (i in 1:N_players) {
    win_freq_pred[i] =   (1.0 * win_counts_pred[i])
                       / (win_counts_pred[i] + loss_counts_pred[i]);
  }
}
fit <- stan(file="stan_programs/bradley_terry2.stan",
            data=data, seed=8438338,
            warmup=1000, iter=2024, refresh=0)

Even with this conservative prior model the diagnostics are much better behaved than they were before.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples,
                                       c('alpha'),
                                       check_arrays=TRUE)
util$check_all_expectand_diagnostics(base_samples)
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

That said the posterior exploration requires long, and consequently expensive, Hamiltonian trajectories.

par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))

util$plot_num_leapfrogs(diagnostics)

The costly exploration, however, does yield good retrodictive performance. In particular the behavior of the posterior predictive and observed empirical win frequencies are consistent with each other.

par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))

names <- sapply(1:data$N_players,
                function(i) paste0('win_freq_pred[', i, ']'))
util$plot_disc_pushforward_quantiles(samples, names,
                                     baseline_values=win_counts /
                                                     game_counts,
                                     xlab="Player",
                                     ylab="Win Frequency")

Marginally the posterior inferences for each player skill aren’t that much more informative than the prior model.

par(mfrow=c(2, 4), mar=c(5, 5, 1, 1))

for (i in 1:data$N_players) {
  name <- paste0('alpha[', i, ']')
  util$plot_expectand_pushforward(samples[[name]],
                                  25, flim=c(-10, 10),
                                  baseline=alpha_true[i],
                                  baseline_col=util$c_dark_teal,
                                  display_name=name)
  xs <- seq(-10, 10, 0.25)
  ys <- dnorm(xs, 0, 10 / 2.32)
  lines(xs, ys, lwd=2, col=util$c_light_teal)
}

It’s not that we’re not learning anything; the game outcomes just don’t pin down the absolute value of the player skills.

par(mfrow=c(2, 4), mar=c(5, 5, 1, 1))

for (i in 1:(data$N_players - 1)) {
  for (j in (i + 1):data$N_players) {
    name1 <- paste0('alpha[', i, ']')
    name2 <- paste0('alpha[', j, ']')
    plot(c(samples[[name1]]), c(samples[[name2]]),
         cex=1, pch=16, col=util$c_dark,
         xlab=name1, xlim=c(-10, 10),
         ylab=name2, ylim=c(-10, 10))
    lines(10 * cos(seq(0, 2 * pi, 2 * pi / 250)),
          10 * sin(seq(0, 2 * pi, 2 * pi / 250)),
          col=util$c_dark_teal)
  }
}