par(family="serif", las=1, bty="l",
cex.axis=1, cex.lab=1, cex.main=1,
xaxs="i", yaxs="i", mar = c(5, 5, 3, 1))Die Another Bayes
As a freelancer branding is a necessary evil. The logo for my own company (Figure 1 (a)) is derived from applying a symplectomorphic transformation to the image of a six-sided die (Figure 1 (b)), emulating how Hamiltonian Monte Carlo transforms probabilistic objects. A few years after establishing my company I came across the Dice Lab and their Skew Dice™. While the shape of these dice doesn’t exactly match my logo, they were close enough for effective branding and I was able to place a bulk order for Skew Dice™ in my logo colors (Figure 2).
At a workshop in January 2023 I was giving away some dice to participants when one Will Pearse has the audacity to question the integrity of the Skew Dice™. Once I had calmed down I realized that this challenge raised an interesting question: how exactly can we learn how fair a die is in practice?
The Dice Lab website provides a geometric argument for why the Skew Dice™ should roll fairly:
Skew Dice™ are based on the trigonal trapezohedron. (A cube is a special case of a trigonal trapezohedron.) We distorted the faces in such a way that they remained flat and are congruent (each one has the same size and shape). In addition, the distorted polyhedron is an isohedron, meaning that the symmetry group of the polyhedron is transitive on the faces. Despite their bizarre appearance, this means that Skew Dice are just as fair as regular dice.
This argument, however, assumes that each die is manufactured precisely and with uniform material density. Without being able to test these mechanical properties directly all we can do is roll the dice over and over again and investigate how often each side lands facing upwards.
In this case study I present a series of Bayesian analyses that attempt to infer just how fair the Skew Dice™ that I ordered might be using hundreds of physical die rolls performed by a small group of volunteers. Along the way I’ll present various techniques for working with categorical data and simplex model configurations, including some approaches for incorporation heterogeneity.
1 Data Exploration
To ensure as rich of a data set as possible I sent Skew Dice™ to Adriano Yoshino, EM Wolkovich, Joe Wozny, Karim Naguib, Ian Costley, and John Flournoy, six of my generous Patreon supporters who volunteered their own, and in some cases their friends’ and family’s, time. Each volunteer then collected data from repeated rolls, recording which die was thrown, who threw it, and what the outcome was. Once all of the data had been shared I collected it into a single R data frame with unique indices for each die and player.
df <- read.csv("data/rolls.csv")
N <- nrow(df)
N_players <- length(unique(df$player_idxs))
N_dice <- length(unique(df$die_idxs))
data <- list("N" = N,
"outcome" = df$outcome,
"N_players" = N_players,
"player_idxs" = df$player_idxs,
"N_dice" = N_dice,
"die_idxs" = df$die_idxs)Altogether 10 unique individuals used a combination of 12 unique dice to give a 1682 total rolls.
print(sprintf("%s total rolls", N))[1] "1682 total rolls"
print(sprintf("%s distinct players", N_players))[1] "10 distinct players"
print(sprintf("%s distinct dice", N_dice))[1] "12 distinct dice"
Because each individual had access to only a few dice, however, the data include only a few player-dice pairings.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
library(colormap)
disc_colors <- c("#FFFFFF", "#DCBCBC", "#C79999", "#B97C7C",
"#A25050", "#8F2727", "#7C0000")
cont_colors <- colormap(colormap=disc_colors, nshades=100)
zs <- table(data$player_idxs, data$die_idxs)
image(1:N_players, 1:N_dice, zs,
col=rev(cont_colors),
xlab="Player Index", ylab="Die Index")
for (d in 1:N_dice) abline(h=d - 0.5, lwd=1.5, col="white")
for (p in 1:N_players) for (d in 1:N_dice) text(p, d, zs[p, d], col="white")The rolls themselves appear to be reasonably uniform across the six faces, both in aggregate and individually for each die.
util <- new.env()
source('mcmc_visualization_tools.R', local=util)Loading required package: StanHeaders
rstan version 2.32.6 (Stan version 2.32.2)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)
par(mfrow=c(1, 1), mar=c(5, 5, 3, 1))
util$plot_line_hist(data$outcome, 0.5, 6.5, 1,
xlab="Face", main="All Rolls")
abline(h=data$N / 6, lty=2, lwd=3, col="#DDDDDD")for (d in 1:data$N_dice) {
if (d %% 6 == 1) par(mfrow=c(2, 3), mar=c(4, 5, 2, 1))
util$plot_line_hist(data$outcome[data$die_idxs == d],
0.5, 6.5, 1, xlab="Face", main=paste("Die", d))
abline(h=sum(data$die_idxs == d) / 6, lty=2, lwd=3, col="#DDDDDD")
}That said the recorded outcomes are not exactly uniform. Our statistical challenge is to determine just how large the deviations from exact uniformity need to be before we need to be concerned about unfair dice.
2 Uniform Simplex Model
If we assume that the outcome of rolling a die is unpredictable then we won’t expect exactly uniform outcomes, even when a die is fair. In order to quantify these deviations we need to build a predictive model.
2.1 The Observational Model
Each roll of a die on a flat surface results in one, and only one, face. If we label each face with one of six integers then the observational space becomes Y = \{ 1, 2, 3, 4, 5, 6 \}.
In general we can model the relative frequencies of each these six outcomes with a finite probability distribution specified by six atomic probabilities, \begin{align*} \pi( \{ 1 \} ) &= q_{1} \\ \ldots& \\ \pi( \{ k \} ) &= q_{k} \\ \ldots& \\ \pi( \{ 6 \} ) &= q_{6} \end{align*} satisfying the constraints 0 \le q_{k} \le 1 and \sum_{k = 1}^{6} q_{k} = 1. The observational model for a single outcome is then given by categorical distribution, \begin{align*} p(y \mid q) &= \text{categorical}(y \mid q_{1}, \ldots, q_{k}, \ldots, q_{6}) \\ &\equiv q_{k = y}. \end{align*}
The space of all such probability distributions over six elements is known as the 5-simplex and is denoted \Delta^{5}. Note that we refer to this space as 5-simplex and not a 6-simplex because the normalization constraint effectively remove one degree of freedom. I will refer to any individual point in this simplex, q = (q_{1}, \ldots, q_{k}, \ldots, q_{6}) \in \Delta^{5}, as a simplex configuration.
If the properties of a particular die do not change appreciably between rolls then we should be able to model the outcome of multiple rolls using the same simplex configuration. In this case the joint observational model for multiple rolls reduces to a multinomial distribution, \begin{align*} p(y_{1}, \ldots, y_{N} \mid q_{1}, \ldots, q_{6}) &= \prod_{n = 1}^{N} \text{categorical}(y_{n} \mid q_{1}, \ldots, q_{6}) \\ &= \prod_{n = 1}^{N} q_{k = y_{n}} \\ &= \prod_{k = 1}^{6} q_{k}^{ \sum_{n = 1}^{N} \delta_{y_{n}, k} } \\ &= \prod_{k = 1}^{6} q_{k}^{ n_{k} }, \end{align*} where \delta_{y_{n}, k} is the Kronecker delta, \delta_{y_{n}, k} = \left\{ \begin{array}{rr} 1, & y_{n} = k \\ 0, & y_{n} \neq k \end{array} \right. , and n_{k} = \sum_{n = 1}^{N} \delta_{y_{n}, k} is the total number of rolls resulting in the kth face.
Finally if we assume that all of the dice behave identically then we can model all of their outcomes with a multinomial distribution and a single simplex configuration.
The context of this model provides a new way of interpreting “fairness”. Instead of requiring that a fair die produce uniform outcomes we can require that a fair die be modeled by the uniform simplex configuration \upsilon = \left( \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \right) \in \Delta^{5}. Interestingly this strict modeling assumption leaves us with no degrees of freedom to infer from the data. Instead we can immediately move on to the simulation of predictive data.
2.2 Summary Statistics For Retrodictive Checks
The observational space for all N = 1682 rolls in our data set, y_{1}, \ldots, y_{n}, \ldots, y_{N}, is 1682 dimensional. Unfortunately there are just a few too many dimensions to be able to visualize a predictive distribution directly.
In order to construct productive retrodictive checks to critique this model we’ll need summary statistics that can reduce the 1682-dimensional observational space into something a bit more manageable without sacrificing interpretability.
For example a histogram of the observed outcomes counts, n_{1}, \ldots, n_{6} is particularly natural for categorical data like this.
Moreover we can summarize the uniformity of these histograms using an empirical entropy function over the individual counts \begin{align*} \hat{H}(n_{1}, \ldots, n_{6}) &= - \sum_{k = 1}^{6} \hat{q}_{k} \log \hat{q}_{k} \\ &= - \sum_{k = 1}^{6} \left( \frac{n_{k}}{N} \right) \log \left( \frac{n_{k}}{N} \right). \end{align*} The empirical entropy function is minimized when the observed counts are exactly uniform, n_{1} = \ldots = n_{6} = \frac{N}{6}, and maximized when any of the n_{k} vanish. In a retrodictive check the empirical entropy allows us to investigate how well our model captures the uniformity, or lack thereof, exhibited by the observed rolls.
2.3 Implementation
Because there are no free parameters we can simulate predictive data directly in the generated quantities block of a Stan program.
uniform\_simplex.stan
data {
int<lower=1> N; // Number of rolls
}
transformed data {
// Uniform simplex configuration
simplex[6] upsilon = rep_vector(1.0 / 6, 6);
}
generated quantities {
// Compute predictions
array[6] int pred_counts = multinomial_rng(upsilon, N);
real pred_entropy = 0;
for (k in 1:6) {
real q_hat = pred_counts[k] * 1.0 / N;
pred_entropy += - lmultiply(q_hat, q_hat);
}
}The Fixed_param algorithm in RStan then allows us to repeatedly evaluate this generated quantities block to generate an ensemble of predictive summary statistic outputs.
library(rstan)
rstan_options(auto_write = TRUE) # Cache compiled Stan programs
options(mc.cores = parallel::detectCores()) # Parallelize chains
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
source('mcmc_analysis_tools_rstan.R', local=util)fit <- stan(file="stan_programs/uniform_simplex.stan",
algorithm="Fixed_param", data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)
samples <- util$extract_expectand_vals(fit)As we’re not running Markov chain Monte Carlo here there are no computational diagnostics to consider and we can instead jump straight into the retrodictive checks.
Let’s first consider the retrodictive behavior within each histogram bin, comparing each observed count to the corresponding predictive count.
par(mfrow=c(2, 3), mar=c(5, 5, 3, 1))
obs_counts <- table(data$outcome)
for (k in 1:6) {
name <- paste0('pred_counts[', k, ']')
util$plot_expectand_pushforward(samples[[name]], 20,
display_name="Counts",
main=paste0("Face ", k),
baseline=obs_counts[k])
}Overall there don’t appear to be any serious retrodictive disagreements. That said there are a few outcomes for which the observed counts fall into the tails of the corresponding predictive distribution.
In order to investigate for any patterns in these mild retrodictive tensions it will help to visualizing the observed and predictive behaviors for all outcomes at the same time. One particularly convenient way to do this is with nested quantile intervals.
par(mfrow=c(1, 1), mar=c(5, 5, 3, 1))
obs_counts <- table(data$outcome)
pred_names <- sapply(1:6, function(k) paste0('pred_counts[', k, ']'))
util$plot_disc_pushforward_quantiles(samples, pred_names,
baseline_values=obs_counts,
xlab="Face", display_ylim=c(225, 325))Immediately we see that the observed counts aren’t completely contained within the predictive quantile intervals. That, however, is not necessarily surprising. For example the largest intervals shown here span only the 10% to 90% predictive quantiles; 10% predictive probability continues below the lightest band and 10% continues above it. At the same time because the marginal quantiles are not sensitive to correlations in the counts they tend to underestimate the joint variation.
The empirical entropy retrodictive check, which directly quantifies the deviation of the observed counts from exact uniformity, exhibits the same mild retrodictive tension seen in the retrodictive checks for each outcome. This at least qualitatively supports the adequacy of uniform simplex model, and the fairness of the dice.
par(mfrow=c(1, 1))
obs_entropy <- sum(sapply(obs_counts,
function(n) -(n / data$N) * log(n / data$N)))
util$plot_expectand_pushforward(samples[["pred_entropy"]], 20,
display_name="Empirical Entropy",
baseline=obs_entropy)2.4 Interpreting the Possible Retrodictive Tension
The shape of the mild retrodictive tension in the histogram summary statistic is useful for motivating possible model inadequacies, and hence productive model expansions. Here the tension suggests that the uniform simplex model might be underestimating the occurrence of faces 3 and 4 while overestimating the occurrence of faces 1 and 6.
One advantage of knowing the provenance of the data is that we can compare this behavior to our domain expertise. In particular we can consider whether or not this deviation from uniformity is physically reasonable.
The faces on each Skew Dice™ are arranged so that opposite faces always sum to 7: 1 is opposite 6, 2 is opposite 5, and 3 is opposite 4. A symmetric deficit of 1 and 6 rolls could be explained by less material on these opposite faces relative to faces 2, 3, 4, and 5. Similarly a symmetric excess of 3 and 4 rolls could be explained by an excess of material on those opposite faces.
That said not every source of non-uniformity is consistent with these patterns. For example biases caused by the asymmetric filling of a die mold from one side to the other would manifest in excesses and deficits that are similar between not opposite faces but rather neighboring faces. Without knowing more details about the manufacturing process, however, we can really only speculate.
3 General Simplex Model
One way to clarify the situation is to consider a more elaborate model and see if the added flexibility resolves any of the mild retrodictive tension exhibited by the simpler model. If we can’t find any beneficial expansions then we can always return to the simpler model.
In this section we’ll consider a model that takes advantage of the entire simplex \Delta^{5}. We will continue to assume that the dice behave identically so that all of the data can be modeled with a single multinomial distribution.
3.1 Inferences
Conveniently modeling general simplex configurations is straightforward in Stan due to the simplex variable type which automatically enforces the simplex constraints. With only limited information about the manufacturing of the dice we’ll use a uniform prior density function over the entire simplex.
homogeneous\_simplex.stan
data {
int<lower=1> N; // Number of rolls
array[N] int<lower=1, upper=6> outcome; // Roll outcomes
}
transformed data {
// Compute outcome totals
array[6] int counts = rep_array(0, 6);
for (n in 1:N) {
counts[outcome[n]] += 1;
}
}
parameters {
// General simplex configuration
simplex[6] q;
}
model {
q ~ dirichlet(rep_vector(1, 6));
counts ~ multinomial(q);
}
generated quantities {
// Posterior predictions
array[6] int pred_counts = multinomial_rng(q, N);
real pred_entropy = 0;
for (k in 1:6) {
real q_hat = pred_counts[k] * 1.0 / N;
pred_entropy += - lmultiply(q_hat, q_hat);
}
}fit <- stan(file="stan_programs/homogeneous_simplex.stan",
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)The diagnostics don’t suggest that Stan’s Hamiltonian Monte Carlo sampler is encountering any problems in its efforts to quantifying the posterior distribution.
diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples, 'q', check_arrays=TRUE)
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
With an accurate posterior quantification we can faithfully compare the posterior predictive distribution to the observed data. The general simplex model appears to have more than enough flexibility to reproduce all of the behaviors in the observed data that manifest in the total outcome frequencies and empirical entropy.
par(mfrow=c(1, 1))
pred_names <- sapply(1:6, function(k) paste0('pred_counts[', k, ']'))
util$plot_disc_pushforward_quantiles(samples, pred_names,
baseline_values=obs_counts,
xlab="Outcome", ylab="Counts")par(mfrow=c(1, 1))
obs_entropy <- sum(sapply(obs_counts,
function(n) -(n / data$N) * log(n / data$N)))
util$plot_expectand_pushforward(samples[["pred_entropy"]], 20,
display_name="Empirical Entropy",
baseline=obs_entropy)Because there is no visible posterior retrodictive tension the posterior inferences from this model should capture meaningful insights. In this case the posterior distribution follows the pattern seen in the observed counts, with an excess of probability for faces 3 and 4 and a deficit for faces 1 and 6. That said the posterior distribution is not concentrating all that far from the uniform simplex configuration.
names <- sapply(1:6, function(k) paste0('q[', k, ']'))
util$plot_disc_pushforward_quantiles(samples, names,
xlab="Outcome", ylab="Probability")
abline(h=1/6, col="#DDDDDD", lwd=3, lty=2)3.2 Decisions
Inferences from the general simplex model provide multiple ways to characterize the fairness of the dice within the scope of the modeling assumptions. In particular we can use our inferences to inform explicit decisions about the fairness of the dice for each characterization. In this section we’ll see how we can use Bayesian decision theory to construct systematic decisions.
3.2.1 A Primer on Bayesian Decision Theory
Formal decision making is defined by a space of actions, a \in A, and a utility function that numerically quantifies the overall benefits of taking each action, \begin{alignat*}{6} U :\; & A & &\rightarrow& \; & \mathbb{R} & \\ & a & &\mapsto& & U(a) &. \end{alignat*} Once we have defined a space of actions and a utility function then the optimal decision is simply to take the action with the highest utility, a_{\mathrm{optimal}} = \underset{a \in A}{\mathrm{argmax}} \, U(a). Note that to simplify this discussion I am ignoring the possibility of ties, but they are not difficult to accommodate by first identifying the set of equally-optimal actions and then introducing a procedure to arbitrarily choose from those actions.
For many problems the utility of each action will depend on unknown behaviors. In these cases the best we can do is construct a model configuration space of possible behaviors \Theta and then a model-based utility function, \begin{alignat*}{6} U :\; & A \times \Theta & &\rightarrow& \; & \mathbb{R} & \\ & a, \theta & &\mapsto& & U(a, \theta) &. \end{alignat*} Because the ordering of the actions by their assigned utilities will in general vary with each model configuration \theta, \underset{a \in A}{\mathrm{argmax}} \, U(a, \theta) \ne \underset{a \in A}{\mathrm{argmax}} \, U(a, \theta'), there is no longer an unambiguous best action.
Bayesian decision theory uses posterior inferences to inform decision making under exactly this sort of uncertainty. A posterior distribution p(\theta \mid \tilde{y}) informing the unknown behaviors defines posterior expected utilities for each action, \overline{U}(a \mid \tilde{y}) = \int \mathrm{d} \theta \, p(\theta \mid \tilde{y}) , U(a, \theta). Given the observation \tilde{y} the posterior expected utilities depend on only the actions. Consequently we can use them to unambiguously rank the actions and identify a single best action, a_{\mathrm{optimal}} = \underset{a \in A}{\mathrm{argmax}} \, \overline{U}(a \mid \tilde{y}). Again for simplicity I am ignoring the possibility of ties here.
3.2.2 Fairness Actions
The space of actions that we consider in a decision making problem is easy to take for granted. This is unfortunate because the worst decisions often arise not from an inappropriate utility function but rather from the restriction to an unnecessarily small space of actions, all of which might be burdened with poor consequences.
When questioning the fairness of these dice there appear to be two immediate actions that we can take:
- a_{1}: declare that the dice are fair,
- a_{2}: declare that the dice are unfair.
In some applications a more refined set of actions might be more useful. For example we could incorporate additional actions that avoid making any declaration at all or even trigger followup data collection.
Even if these two actions are sufficient we still cannot move forwards without defining exactly what makes a die “fair” or not. In the context of our simplex models we can formalize fairness as uniformity of the simplex configuration. For example we might say that any simplex configuration that is not exactly uniform is unfair. With this definition of fairness our actions become
- a_{1}: declare that the true simplex configuration is uniform,
- a_{2}: declare that the true simplex configuration is not uniform.
This comparison between a single model configuration and all other model configurations is common in frequentist null hypothesis significance testing.
If we think of the uniform simplex configuration as an atomic subset then can we reinterpret these actions as
- a_{1}: declare that the true simplex configuration is in \{ \upsilon \},
- a_{2}: declare that the true simplex configuration is in \{ \upsilon \}^{c}.
This suggests an immediate generalization to this formalization of fairness where we define a fair die by not \{ \upsilon \} but rather a subset of the simplex that contains the uniform simplex configuration, \upsilon \in \mathsf{u},
- a_{1}: declare that the true simplex configuration is in \mathsf{u},
- a_{2}: declare that the true simplex configuration is in \mathsf{u}^{c}.
In other words this generalization requires that a fair die be only sufficiently close to uniform. In hindsight this perspective is typically more relevant to actual practice; a die that deviates from exact uniformity by only an infinitesimal amount may not technically be fair, but we would never be able to distinguish it from a truly fair die within our finite lifetimes. Consequently the more relevant decision isn’t actually whether these dice are perfectly uniform but rather whether or not these dice are close enough to being uniform that we wouldn’t be able to tell the difference in a particular application.
Exactly how close to uniform is close enough to be fair, and how small the subset \mathsf{u} \subset \Delta^{5} needs to be, will depend on the details of a given application. Any choice of simplex subset, however, motivates the same class of utility functions.
3.2.3 Choosing Between Model Configuration Subsets
The utility of claiming that the true simplex configuration falls within or without a subset of nearly uniform model configurations depends upon what the true simplex configuration is. If the true simplex configuration fall into \mathsf{u} then taking action a_{1} should result in some positive benefit u_{11}. On the other hand if the true simplex configuration falls into \mathsf{u}^{c} then taking a_{1} should result in lower, typically negative, utility u_{12}. Mathematically we can encode these assumptions into the utility assignment U(a_{1}, q) = \left\{ \begin{array}{rr} u_{11}, & q \in \mathsf{u} \\ u_{12}, & q \notin \mathsf{u} \end{array} \right. Similarly the utility assigned to the second action should take the form U(a_{2}, q) = \left\{ \begin{array}{rr} u_{21}, & q \in \mathsf{u} \\ u_{22}, & q \notin \mathsf{u} \end{array} \right. . Note that by allowing for asymmetric utilities we allow for the possibility that some correct decisions are more beneficial than others, while some incorrect statements are more costly than others.
Conveniently we can simplify both of these utility assignments into indicator functions, \begin{align*} U(a_{1}, q) &= u_{11} \, I_{\mathsf{u}}(q) + u_{12} \, \left( 1 - I_{\mathsf{u}}(q) \right) \\ &= (u_{11} - u_{12}) \, I_{\mathsf{u}}(q) + u_{12} \\ U(a_{2}, q) &= u_{21} \, I_{\mathsf{u}}(q) + u_{22} \, \left( 1 - I_{\mathsf{u}}(q) \right) \\ &= (u_{21} - u_{22}) \, I_{\mathsf{u}}(q) + u_{21}. \end{align*}
So long the subset of sufficiently uniform simplex configurations \mathsf{u} is measurable the posterior expected utilities for each action become \begin{align*} \overline{U}(a_{1} \mid \tilde{y}) &= \int \mathrm{d} q \, p(q \mid \tilde{y}) \, U(a_{1}, q) \\ &= \int \mathrm{d} q \, p(q \mid \tilde{y}) \, \left( (u_{11} - u_{12}) \, I_{\mathsf{u}}(q) + u_{12} \right) \\ &= (u_{11} - u_{12}) \, \pi( \mathsf{u} \mid \tilde{y}) + u_{12} \end{align*} and \begin{align*} \overline{U}(a_{2} \mid \tilde{y}) &= \int \mathrm{d} q \, p(q \mid \tilde{y}) \, U(a_{2}, q) \\ &= \int \mathrm{d} q \, p(q \mid \tilde{y}) \, \left( (u_{21} - u_{22}) \, I_{\mathsf{u}}(q) + u_{21} \right) \\ &= (u_{21} - u_{22}) \, \pi( \mathsf{u} \mid \tilde{y}) + u_{21}. \end{align*}
The optimal Bayesian decision is to choose a_{1} over a_{2} if and only if \begin{align*} \overline{U}(a_{1} \mid \tilde{y}) &> \overline{U}(a_{2} \mid \tilde{y}) \\ (u_{11} - u_{12}) \, \pi( \mathsf{u} \mid \tilde{y}) + u_{12} &> (u_{21} - u_{22}) \, \pi( \mathsf{u} \mid \tilde{y}) + u_{21}, \end{align*} or, equivalently, \pi( \mathsf{u} \mid \tilde{y}) > \frac{ u_{21} - u_{12} }{ u_{11} - u_{12} - u_{21} + u_{22} } \equiv t. In words we declare that the dice are practically fair only when the posterior probability allocated to \mathsf{u} is sufficiently large.
Note that the final decision doesn’t depend on any of the individual benefits and costs but rather only the particular combination t = \frac{ u_{21} - u_{12} }{ u_{11} - u_{12} - u_{21} + u_{22} }. For example translating or scaling all of the benefits and costs by the same amount would result in the same t, and hence the same decisions.
To implement this Bayesian decision making process in practice we need to define the subset \mathsf{u} and the probability threshold t.
The possibilities for t are, by construction, endless. In general the larger t is the more certain we have to be in the effective uniformity of the die behavior to declare fair dice.
Similarly we have infinite options for \mathsf{u}. Here we’ll consider three strategies for defining \mathsf{u}: using exact uniformity, using direct features, and using pushforward features.
3.2.4 Exact Uniformity
Our initial, strict definition of fairness corresponds to taking \mathsf{u} = \{ \upsilon \}. Unfortunately for our modeling assumptions, in particular our Dirichlet prior model, the posterior probability allocated to any atomic subset will always be zero, \pi( \mathsf{u} \mid \tilde{y} ) = \pi( \{ \upsilon \} \mid \tilde{y} ) = 0!
Consequently we will never choose a_{1} so long as t > 0. On the other hand if we take t = 0 then we will always choose a_{1} regardless of the realized data and resulting posterior distribution.
Intuitively the problem here is that the lone simplex configuration in \mathsf{u} = \{ \upsilon \} is overwhelmed by the infinitely many simplex configurations just outside of \mathsf{u}. With only a finitely many rolls our posterior inferences will never really be able to distinguish between the two.
There are only two circumstances in which we can have t > 0 but still might choose a_{1}. Firstly in the asymptotic limit of infinite rolls the posterior distribution will converge to a Dirac distribution with \lim_{N \rightarrow \infty} \pi( \{ \upsilon \} \mid \tilde{y}_{1}, \ldots, \tilde{y}_{N}) = 1 if the dice are exactly uniform and \lim_{N \rightarrow \infty} \pi( \{ \upsilon \} \mid \tilde{y}_{1}, \ldots, \tilde{y}_{N}) = 0 if the dice are not exactly uniform. In this ideal scenario we would choose a_{1} when the dice are exactly uniform and a_{2} when they are not for any value of 0 < t < 1, as desired.
Secondly we can inflate a Dirichlet prior model \rho with a Dirac probability distribution \delta to give the prior probability allocations \pi( \mathsf{q} ) = \alpha \, \delta_{ \upsilon }( \mathsf{q} ) + (1 - \alpha) \, \rho( \mathsf{q} ). In this case the posterior probability allocation to \mathsf{u} = \{ \upsilon \} becomes \begin{align*} \pi( \{ \upsilon \} \mid \tilde{y}) &= \int \pi( \mathrm{d} q \mid \tilde{y} ) \, I_{ \{ \upsilon \} }(q) \\ &= \frac{ \int \pi( \mathrm{d} q) \, p(\tilde{y} \mid q ) \, I_{ \{ \upsilon \} }(q) } { \int \pi( \mathrm{d} q) \, p(\tilde{y} \mid q ) } \\ &=\quad \quad\; \alpha \quad\; \frac{ \int \delta_{ \upsilon }( \mathrm{d} q) \, p(\tilde{y} \mid q ) \, I_{ \{ \upsilon \} }(q) } { \int \pi( \mathrm{d} q) \, p(\tilde{y} \mid q ) } \\ &\quad+ (1 - \alpha) \, \frac{ \int \rho( \mathrm{d} q) \, p(\tilde{y} \mid q ) \, I_{ \{ \upsilon \} }(q) } { \int \pi( \mathrm{d} q) \, p(\tilde{y} \mid q ) } \\ &=\quad \quad\; \alpha \quad\; \frac{ p(\tilde{y} \mid \upsilon ) } { \int \pi( \mathrm{d} q) \, p(\tilde{y} \mid q ) } \\ &\quad+ (1 - \alpha) \, \frac{ 0 } { \int \pi( \mathrm{d} q) \, p(\tilde{y} \mid q ) } \\ &= \frac{ \alpha \, p(\tilde{y} \mid \upsilon ) } { \int \pi( \mathrm{d} q) \, p(\tilde{y} \mid q ) }. \end{align*} Expanding the denominator then gives \begin{align*} \pi( \{ \upsilon \} \mid \tilde{y}) &= \frac{ \alpha \, p(\tilde{y} \mid \upsilon ) } { \int \pi( \mathrm{d} q) \, p(\tilde{y} \mid q ) } \\ &= \frac{ \alpha \, p(\tilde{y} \mid \upsilon ) } { \alpha \, \int \delta_{ \{ \upsilon \} }( \mathrm{d} q) \, p(\tilde{y} \mid q ) + (1 - \alpha) \, \int \rho( \mathrm{d} q) \, p(\tilde{y} \mid q) } \\ &= \frac{ \alpha \, p(\tilde{y} \mid \upsilon ) } { \alpha \, p(\tilde{y} \mid \upsilon ) + (1 - \alpha) \, \int \rho( \mathrm{d} q) \, p(\tilde{y} \mid q) } \\ &= \frac{ 1 } { 1 + \frac{ 1 - \alpha }{ \alpha } \, \frac{ \int \rho( \mathrm{d} q) \, p(\tilde{y} \mid q) } { p(\tilde{y} \mid \upsilon ) } }. \end{align*}
In the asymptotic limit where the posterior distribution concentrates on \upsilon we have \lim_{N \rightarrow \infty} \frac{ \int \rho( \mathrm{d} q) \, p(\tilde{y}_{1}, \ldots, \tilde{y}_{N} \mid q) } { p(\tilde{y}_{1}, \ldots, \tilde{y}_{N} \mid \upsilon ) } = 0 so that \lim_{N \rightarrow \infty} \pi( \{ \upsilon \} \mid \tilde{y}_{1}, \ldots, \tilde{y}_{N}) = 1, regardless of \alpha. Similarly if \lim_{N \rightarrow \infty} \frac{ \int \rho( \mathrm{d} q) \, p(\tilde{y}_{1}, \ldots, \tilde{y}_{N} \mid q) } { p(\tilde{y}_{1}, \ldots, \tilde{y}_{N} \mid \upsilon ) } = \infty then \lim_{N \rightarrow \infty} \pi( \{ \upsilon \} \mid \tilde{y}_{1}, \ldots, \tilde{y}_{N}) = 0, regardless of \alpha. In other words the asymptotic behavior of the inflated model is consistent with that of the un-inflated model.
For any finite observation, however, the posterior probability allocated to \mathsf{u} = \{ \upsilon \}, and hence our conviction in claiming exact fairness, is extremely sensitive to the precise strength of inflation \alpha. Without any principled way to determine \alpha are decision making will be arbitrary!
In general selecting between non-atomic subsets is much better behaved than attempting to resolve any particular atomic subset from the infinitely many alternatives that behave nearly identically.
3.2.5 Effective Uniformity From Direct Features
One way to construct a subset of simplex configurations that contains the uniform simplex configuration is to impose a constraint on the component probabilities. For example we could require that all of the component probabilities are within some multiplicative tolerance of 1/6, \mathsf{u} = \left\{ (q_{1}, \ldots, q_{6}) \in \Delta^{5} \mid \frac{1}{1 + t} \, \frac{1}{6} < q_{k} < (1 + t) \, \frac{1}{6} \right\}. The smaller the tolerance t is the smaller the subset \mathsf{u} will be.
Another useful strategy is motivated by the construction of open subsets in metric spaces. Recall that in Chapter Two we saw how any distance function \begin{alignat*}{6} d :\; & X \times X & &\rightarrow& \; & \mathbb{R}^{+} & \\ & x_{1}, x_{2} & &\mapsto& & d(x_{1}, x_{2}) & \end{alignat*} implicitly defines open balls, \mathsf{b}_{x, R} = \{ x' \in X \mid d(x, x') < R \}. By construction these subsets define neighborhoods of points around the center x \in X whose size is determined by the radius R \in \mathbb{R}^{+}. Moreover if the distance function is measurable then these open balls will also be measurable.
Consequently any pair of a measurable distance function on the simplex \Delta^{5} and radius R \in \mathbb{R}^{+} will defines a measurable subset of nearly uniform simplex configurations, \mathsf{u} = \{ q \in \Delta^{5} \mid d(\upsilon, q) < R \}. The smaller the radius is the stricter the notion of fairness these subsets will encode.
This construction can also be generalized beyond distance functions. Any measurable binary function \begin{alignat*}{6} f :\; & X \times X & &\rightarrow& \; & \mathbb{R}^{+} & \\ & x_{1}, x_{2} & &\mapsto& & d(x_{1}, x_{2}) & \end{alignat*} that vanishes when the two inputs are equal, f(x, x) = 0, can be used to construct two different ball-like, measurable subsets around a central point x \in X, \mathsf{b}_{x, R} = \{ x' \in X \mid f(x, x') < R \} and \mathsf{b}^{*}_{x, R} = \{ x' \in X \mid f(x', x) < R \}. I will refer to the any subset constructed in this way as a radial subset.
Distance functions on a simplex immediately satisfy this criterion, with the symmetry of their arguments implying that the two subsets we can define around the uniform simplex configuration are actually redundant, \mathsf{b}_{\upsilon, R} = \mathsf{b}^{*}_{\upsilon, R}. Because every point in a simplex defines a probability distribution we an also appeal to statistical divergences, such as the Kullback-Leibler divergence. Statistical divergences are not symmetric in their inputs so they define two distinct subsets around the uniform simplex configuration. \mathsf{b}_{\upsilon, R} \ne \mathsf{b}^{*}_{\upsilon, R}. Ultimately there are a wealth of distance and divergences functions on the simplex that we could use to define “sufficiently uniform”. I review some of the possibilities and their properties in the Appendix A.
In a practical analysis we would need to use our domain expertise to choose a subset of sufficiently uniform simplex configurations whose shape and size are appropriate for a given application. That is not to say, however, that identifying the most appropriate shape and size is at all straightforward.
For the analysis here let’s consider one subset defined by component-wise constraints and another defined by the geodesic distance function introduced in the Appendix A.
K <- 6
u <- 1 / 6
upsilon <- rep(u, K)
# Component-wise subset indicator function
cwise_indicator <- function(q, t) {
component_inclusion <- sapply(1:K, function(k) u / (1 + t) < q[k] &
q[k] < u * (1 + t) )
Reduce("&", component_inclusion)
}
# Radial subset indicator function
geodesic_distance <- function(q) {
2 * acos( sum(sqrt( q * upsilon )) )
}
radial_indicator <- function(q, R) {
geodesic_distance(q) < R
}To make these subsets somewhat comparable we can tune their size so that they are both allocated the same prior probability. Here let’s require that each subset is allocated a prior probability of 0.001, enforcing a relatively strict notion of fairness.
# Prior configuration
alpha <- rep(1, 6)
# Subset configurations
t <- 0.412
R <- 0.21
# Monte Carlo estimates of prior probability allocations
S <- 100000
I_c_samples <- rep(NA, S)
I_r_samples <- rep(NA, S)
for (s in 1:S) {
m <- rgamma(K, alpha, 1)
q <- m / sum(m)
I_c_samples[s] <- cwise_indicator(q, t)
I_r_samples[s] <- radial_indicator(q, R)
}pi_c <- mean(I_c_samples)
delta_c <- 2 * sqrt(var(I_c_samples) / S)
cat(paste0("Prior probability allocated to component-wise subset = ",
sprintf("%.5f +/- %.5f", pi_c, delta_c)))Prior probability allocated to component-wise subset = 0.00113 +/- 0.00021
pi_r <- mean(I_r_samples)
delta_r <- 2 * sqrt(var(I_r_samples) / S)
cat(paste0("Prior probability allocated to radial subset = ",
sprintf("%.5f +/- %.5f", pi_r, delta_r)))Prior probability allocated to radial subset = 0.00118 +/- 0.00022
The posterior probability allocated to these subsets can then be estimated with Markov chain Monte Carlo. In practice we can evaluate these indicator functions directly in R, as we have done here, or in the generated quantities block of a new Stan program. Here we’ll stay within R.
var_repl <- list('q'=util$name_array('q', c(6)))
pi_est <- util$implicit_subset_prob(samples,
function(q) cwise_indicator(q, t),
var_repl)
cat(paste0("Posterior probability allocated to component-wise subset = ",
sprintf("%.5f +/- %.5f", pi_est[1], 2 * pi_est[2])))Posterior probability allocated to component-wise subset = 0.99779 +/- 0.00146
pi_est <- util$implicit_subset_prob(samples,
function(q) radial_indicator(q, R),
var_repl)
cat(paste0("Posterior probability allocated to radial subset = ",
sprintf("%.5f +/- %.5f", pi_est[1], 2 * pi_est[2])))Posterior probability allocated to radial subset = 1.00000 +/- 0.00000
We cannot make a final Bayesian decision, however, without specifying a probability threshold t. In a real application we might use Bayesian calibration to tune t to achieve desired false positive and true positive rates. That said the posterior probabilities here are so large that we would declare that the die are fair for all but the largest thresholds!
3.2.6 Effective Uniformity From Pushforward Features
One of the reasons why choosing an appropriate subset of effectively uniform simplex configurations, not to mention a corresponding probability threshold, can be so awkward is that most of us don’t have experience reasoning about the high-dimensional 5-simplex. Typically lower-dimensional consequences are easier compare to our domain expertise. The practical challenge, however, is identifying which of these consequences are relevant in a given application.
Consider, for example, one of the many games of chance that rely not on all six outcomes from a single six-sided die roll but rather particular outcomes from multiple six-sided die rolls. For these games the more relevant notion of fairness is not now close the 5-simplex model is to the uniform simplex configuration but rather how close these particular consequences are to the those expected from a uniform die.
Apocalypse World (Baker and Baker 2016) is a table top role playing game that uses the sum of two six-sided die rolls to determine whether player actions succeed or fail. Since its introduction this straightforward mechanic has also been adopted by a variety of other role playing table top games.
The sum of two six-sided die rolls can take on eleven possible values (Table 1). Consequently the summation outcomes can be modeled with a 10-simplex space. Moreover we can use the rules for pushforward probability mass functions that we learned in Chapter 7 to map any point in the initial 5-simplex into a corresponding point in this 10-simplex.
| s | f^{-1}(s) | q_{s} = f_{*} p(s) |
|---|---|---|
| 2 | (1, 1) | q_{1}^{2} |
| 3 | (1, 2), (2, 1) | 2 \, q_{1} q_{2} |
| 4 | (1, 3), (2, 2), (3, 1) | 2 \, q_{1} \, q_{3} + q_{2}^{2} |
| 5 | (1, 4), (2, 3), (3, 2), (4, 1) | 2 \, q_{1} \, q_{4} + 2 \, q_{2} \, q_{3} |
| 6 | (1, 5), (2, 4), (3, 3), (4, 2), (5, 1) | 2 \, q_{1} \, q_{5} + 2 \, q_{2} \, q_{4} + q_{3}^{2} |
| 7 | (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1) | 2 \, q_{1} \, q_{6} + 2 \, q_{2} \, q_{5} + 2 \, q_{3}, q_{4} |
| 8 | (2, 6), (3, 5), (4, 4), (5, 3), (6, 2) | 2 \, q_{2} \, q_{6} + 2 \, q_{3} \, q_{5} + q_{4}^{2} |
| 9 | (3, 6), (4, 5), (5, 4), (6, 3) | 2 \, q_{3} \, q_{6} + 2 \, q_{4} \, q_{5} |
| 10 | (4, 6), (5, 5), (6, 4) | 2 \, q_{4} \, q_{6} + q_{5}^{2} |
| 11 | (5, 6), (6, 5) | 2 \, q_{5} q_{6} |
| 12 | (6, 6) | q_{6}^{2} |
We can use this mathematical result to immediately derive the consequences from not only an idealized, exactly uniform die but also the real Skew Dice™.
pushforward_simplex_comps <-
list(function(q)
q[1] * q[1],
function(q)
2 * q[1] * q[2],
function(q)
2 * q[1] * q[3] + q[2] * q[2],
function(q)
2 * q[1] * q[4] + 2 * q[2] * q[3],
function(q)
2 * q[1] * q[5] + 2 * q[2] * q[4] + q[3] * q[3],
function(q)
2 * q[1] * q[6] + 2 * q[2] * q[5] + 2 * q[3] * q[4],
function(q)
2 * q[2] * q[6] + 2 * q[3] * q[5] + q[4] * q[4],
function(q)
2 * q[3] * q[6] + 2 * q[4] * q[5],
function(q)
2 * q[4] * q[6] + q[5] * q[5],
function(q)
2 * q[5] * q[6],
function(q)
q[6] * q[6])
names(pushforward_simplex_comps) <- sapply(2:12,
function(k) paste0('q[', k, ']'))
pushforward_simplex <- function(q) {
sapply(1:11,
function(k) do.call(pushforward_simplex_comps[[k]], list(q)))
}unif_2d6_simplex <- pushforward_simplex(upsilon)inferred_2d6_simplex <-
util$eval_expectand_pushforwards(samples,
pushforward_simplex_comps,
var_repl)Overall the inferred behavior of rolling a Skew Dice™ twice is reasonably consistent with what we would expect from rolling an exactly uniform die twice.
par(mfrow=c(3, 4), mar=c(5, 5, 3, 1))
obs_counts <- table(data$outcome)
for (k in 1:11) {
name <- paste0('q[', k + 1, ']')
util$plot_expectand_pushforward(inferred_2d6_simplex[[name]], 25,
display_name="Probability",
baseline=unif_2d6_simplex[k],
baseline_col=util$c_mid_teal)
title(paste0("2d6 Outcome = ", k + 1))
}