As measurements become more complex they often convolve meaningful phenomena with more extraneous phenomena. In order to limit the impact of these irrelevant phenomena on our inferences, and isolate the relevant phenomena, we need models that encourage sparse inferences. In this case study I review the basics of Bayesian inferential sparsity and some of the various strategies for building prior models that incorporate sparsity assumptions.

1 Fading into Irrelevance

Fundamentally the concept of sparsity requires some notion of a population. Inferential sparsity in particular requires a population of individual contexts from which data are generated. In this case sparsity implies that most of the individual contexts follow some default behavior with only a few exhibiting any deviant behavior.

If the real-valued parameter \(\theta_{k}\) quantifies the variation of the data generating process in the \(k\)th context then sparsity implies that most of the \(\theta_{k}\) should be fixed to some default value with only a few taking on other values. Without loss of generality we can take the default value to be zero so that sparsity implies that most of the individual parameters will have a zero value with just a few having non-zero values.

Although conceptually appealing, this notion of sparsity as a preference for exactly zero values runs into problems in practice. In particular the information contained in any finite measurement is not enough to resolve the difference between individual parameter values that are exactly zero against those that are small enough to not strongly influence the data generating process.

For example consider a normal observational model \[ \pi(y_{n} \mid \theta_{k}, \sigma) = \text{normal}(y_{n} \mid \theta_{k(n)}, \sigma), \] where \(k(n)\) maps each observation into the context from which it was generated. Each context features \[ N_{k} = \sum_{n = 1}^{N} \mathbb{I}[k(n) = k] \] observations with the empirical averages \[ \tilde{\mu}_{k} = \frac{1}{N_{k}} \sum_{n = 1}^{N} \mathbb{I}[k(n) = k] \cdot \tilde{y}_{n}. \] The individual likelihood function informing each context is then given by \[ l(\theta_{k} ; \tilde{y}) \propto \text{normal} \left( \tilde{\mu}_{k} ; \theta_{k}, \frac{\sigma}{\sqrt{N_{k}}} \right). \] These individual likelihood function are relatively uniform within the interval \[ | \theta_{k} - \tilde{\mu}_{k} | \lessapprox \frac{\sigma}{\sqrt{N_{k}}}; \] even if this interval contains \(\theta_{k} = 0\) we can't just discard all of the other neighboring values that are also consistent with the observed data.

Consequently even if \(\theta_{k} = 0\) is consistent with the observed data there will always be an infinite number of small but not-quite-zero values \(\theta_{k} \ne 0\) that we will not be able to strongly distinguish from that zero value. If the prior model is specified by a continuous probability density function then the resulting posterior distribution will allocate vanishing probability to \(\theta_{k} = 0\) no matter the shape of the realized likelihood function. The only way for the hypothesis \(\theta_{k} = 0\) to compete against these small values is to engineer a prior model that inflates zero by allocating the point a non-zero probability. Unfortunately such inflated prior models introduce a wealth of computational problems.

These problems can be avoided, however, by abandoning the strict binary classification between zero and non-zero. Because of the finite resolution afforded by any practical measurement the more pertinent task is not whether the individual parameters are exactly zero or not but rather whether they are too small to be relevant to our inferences or not.

For one-dimensional parameters relevance can be defined in terms of a relevance threshold below which the parameters influence realized likelihood functions only negligibly. This threshold can sometimes be motivated by the limitations of the observational model; for example in the normal observational model considered above the relevance threshold could be taken as \(\sigma / \sqrt{N_{k}}\). Alternatively a relevance threshold can be informed by prior knowledge of the underlying data generating process. In either case enforcing sparsity from this perspective becomes a matter of containing most of the individual parameters below this relevance threshold.

If the relevance threshold is not known exactly then sparsity takes on yet another form. In this case the critical manifestation of a sparse inference is that most of the parameters cluster around zero, and only a few are able to extend to larger values. Because the scale of that clustering may or may not identify a meaningful relevance threshold I will generally refer to the individual parameters that cluster around zero as the inner core and the exceptions as the outer expanse.




Any domain expertise about the relevance threshold can always be incorporated into the prior model for the shape of the inner core.

Regardless of its precise interpretation sparsity is a characteristic of the entire population of parameters, not any single parameter. In particular the assumption of sparsity does not inform which parameters are irrelevant and which are relevant. Even if know exactly how many parameters we want to fall into each group we still have to learn a specific sparsity pattern that assigns the individual parameters into those groups.

Because sparsity assumptions are indifferent to these possible sparsity patterns, the domain expertise sparsity assumptions encode is invariant to permutations of the individual contexts. In other words a prior model that encodes sparsity must be exchangeable which, as I discuss in my hierarchical case study, implies that sparsity-inducing prior models are naturally hierarchical.

In order to ensure sparse inferences we need to engineering a hierarchical population model that can not only accommodate both the inner core and the outer expanse at the same time but also encouraging most parameters towards the former.

2 Sparse Population Models

Exactly what kind of hierarchical population model can coerce wide likelihood functions towards an inner core while allowing sufficiently narrow likelihood functions to roam the outer expanse as needed? In this section we'll investigate a variety of options and their limitations.

First let's set up our R environment for computational experiments to guide this investigation.

library(rstan)
Loading required package: StanHeaders
Loading required package: ggplot2
rstan (Version 2.19.3, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")

util <- new.env()
source('stan_utility.R', local=util)

c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")

c_light_trans <- c("#DCBCBC80")
c_dark_trans <- c("#8F272780")
c_green_trans <- c("#00FF0080")

par(family="CMU Serif", las=1, bty="l", cex.axis=1, cex.lab=1, cex.main=1,
    xaxs="i", yaxs="i", mar = c(5, 5, 3, 1))

set.seed(58533858)

To compare the various population models under consideration we'll see how they each perform on a relatively simple simulation. Of the nine total parameters six cluster below an inner scale of 0.1 and three venture out past 10.

K <- 9

theta_true <- c(-0.09604655,  0.02063505,  0.01246716,
                -0.04487767, -0.13519031,  0.09421304,
                -29.449233,  32.997172,  18.517443)

These parameters then inform the location of a normal observational model.

N <- K
sigma <- 0.5
context_idx <- rep(1:K, 1)

y <- rnorm(K, theta_true[context_idx], sigma)

data <- list("N" = N, "K" = K, "context_idx" = context_idx,
             "y" = y, "sigma" = sigma)

Note that with only one observation per context the width of the individual likelihood functions is determined by the measurement variability \(\sigma = 0.5\), and we will not be able to resolve the inner scale of \(0.1\) with the information gleamed from the data alone.

2.1 The Limitations of Normal Population Models

Let's begin with the normal population model that is so common to hierarchical modeling. Because we want to regularize our inferences towards \(\theta_{k} = 0\) we set the population location to zero, which leaves just a population scale \(\tau\) to configure, \[ \pi(\theta_{k} \mid \tau) = \text{normal}(\theta_{k} \mid 0, \tau). \]




With only a single scale can this model accommodate both the inner core and outer expanse populations at the same time? To build our understanding let's compare what happens when \(\tau\) is fixed to some finite value to what happens when \(\tau\) is sent off to infinity, resulting in an improper uniform prior density function.

No matter the location of the individual likelihood function the marginal posterior distribution is uniformly regularized to the normal population model. In particular both the posterior mean and standard deviation are shrunk by the same proportion regardless of whether the likelihood function concentrates below or above \(\tau\).




This uniform regularization has some unfortunate consequences when trying to model sparsity. If an individual likelihood function is much wider than \(\tau\) then the posterior distribution is strongly regularized to the population model. When the individual likelihood function is narrow, however, the posterior distribution has to compromise between the prior model and the likelihood function. When the narrow likelihood function concentrates on values below \(\tau\) it overlaps with the prior model and this compromise does not affect the posterior distribution much. On the other hand if the narrow likelihood function concentrates on values much larger than \(\tau\) then the compromise pulls the posterior distribution right in between it and the population model.




In particular if we set \(\tau\) to the inner core scale then the inferences for the three parameters with large true values will be over-regularized. At the same time if we set \(\tau\) to the outer extent scale then the inferences for the six parameters with small true values will be under-regularized. As expected the normal population model cannot accommodate both the inner core and the outer expanse at the same time.

To confirm let's build and run this model in Stan. We start by setting \(\tau\) to the inner core scale of \(0.1\).

writeLines(readLines("stan_programs/normal_narrow.stan"))
data {
  int<lower=0> K;                       // Number of contexts
  int<lower=0> N;                       // Number of observations
  vector[N] y;                          // Observations
  int<lower=1, upper=K> context_idx[N]; // Context assignments
  real<lower=0> sigma;                  // Measurement variability
}

parameters {
  // Horseshoe parameters
  vector[K] theta;
}

model {
  // Horseshoe prior model
  theta ~ normal(0, 0.1);
  
  // Observational model
  y ~ normal(theta[context_idx], sigma);
}
fit <- stan(file='stan_programs/normal_narrow.stan', data=data,
            seed=4938483, refresh=1000)

util$check_all_diagnostics(fit)
[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Our posterior fit seems reasonable, but the inferences for the large parameters are pulled far below their true values. On the other hand the inferences for the six small parameters are regularized below the measurement resolution \(\sigma = 0.5\) due to the information provided by the prior model.

samples = extract(fit)

par(mfrow=c(3, 3))

for (k in 1:9) {
  hist(samples$theta[, k], breaks=seq(-35, 35, 0.25),
       main=paste("k = ", k), col=c_dark, border=c_dark_highlight,
       xlab="theta", yaxt='n', ylab="")
  abline(v=theta_true[k], col="white", lwd=2)
  abline(v=theta_true[k], col="black", lwd=1)
}

Now let's try fixing \(\tau = 10\) to accommodate the larger values.

writeLines(readLines("stan_programs/normal_wide.stan"))
data {
  int<lower=0> K;                       // Number of contexts
  int<lower=0> N;                       // Number of observations
  vector[N] y;                          // Observations
  int<lower=1, upper=K> context_idx[N]; // Context assignments
  real<lower=0> sigma;                  // Measurement variability
}

parameters {
  // Horseshoe parameters
  vector[K] theta;
}

model {
  // Horseshoe prior model
  theta ~ normal(0, 10);
  
  // Observational model
  y ~ normal(theta[context_idx], sigma);
}
fit <- stan(file='stan_programs/normal_wide.stan', data=data,
            seed=4938483, refresh=1000)

util$check_all_diagnostics(fit)
[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Again we have no signs of computational problems, and the weaker regularization seems to allow the inferences for the larger parameters to reach the true values.

samples = extract(fit)

par(mfrow=c(3, 3))

for (k in 1:9) {
  hist(samples$theta[, k], breaks=seq(theta_true[k] - 3, theta_true[k] + 3, 0.1),
       main=paste("k = ", k), col=c_dark, border=c_dark_highlight,
       xlab="theta", yaxt='n', ylab="")
  abline(v=theta_true[k], col="white", lwd=2)
  abline(v=theta_true[k], col="black", lwd=1)
}

The inferences for the six smaller parameters, however, leaves much to be desired. Because the wide population model offers no regularization to these inferences, each marginal posterior distribution is limited to the resolution of the measurements \(\sigma = 0.5\). In particular the marginal posterior distributions are much wider than the true inner scale, \(0.1\).

par(mfrow=c(1, 1))

k <- 1
hist(samples$theta[, k], breaks=seq(-3, 3, 0.1),
     main=paste("k = ", k), col=c_dark, border=c_dark_highlight,
     xlab="theta", yaxt='n', ylab="")
abline(v=theta_true[k], col="white", lwd=2)
abline(v=theta_true[k], col="black", lwd=1)

abline(v=theta_true[k] - 0.1, col="gray80", lwd=2, lty=3)
abline(v=theta_true[k] + 0.1, col="gray80", lwd=2, lty=3)

Finally let's investigate what happens when we infer the population scale along with the individual parameters. Our prior model for \(\tau\) will be just large enough to contain both the inner core scale and the outer extent scale.

writeLines(readLines("stan_programs/hier_normal_cp.stan"))
data {
  int<lower=0> K;                       // Number of contexts
  int<lower=0> N;                       // Number of observations
  vector[N] y;                          // Observations
  int<lower=1, upper=K> context_idx[N]; // Context assignments
  real<lower=0> sigma;                  // Measurement variability
}

parameters {
  // Horseshoe parameters
  vector[K] theta;
  real<lower=0> tau;
}

model {
  // Horseshoe prior model
  theta ~ normal(0, tau);
  tau ~ normal(0, 10);
  
  // Observational model
  y ~ normal(theta[context_idx], sigma);
}
fit <- stan(file='stan_programs/hier_normal_cp.stan', data=data,
            seed=4938483, refresh=1000)

util$check_all_diagnostics(fit)
[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Because the normal population model is too rigid the posterior distribution is forced to compromise between the scales of the inner core and outer extent. Ultimately it ends up concentrating somewhere between the two in a feeble attempt to balance the over and under-regularization.

samples = extract(fit)

par(mfrow=c(1, 1))

hist(abs(rnorm(4000, 0, 10)), breaks=seq(0, 50, 0.5),
     main="", col=c_light, border=c_light_highlight,
     xlab="tau", yaxt='n', ylab="", ylim=c(0, 300))
hist(samples$tau, breaks=seq(0, 50, 0.5),
     main="", col=c_dark, border=c_dark_highlight, add=T)

This balance, however, is still too much large for the parameters with small true values and a bit too small for the parameters with large true values. Because the balance favors the larger scale the over-regularization isn't too bad.

par(mfrow=c(3, 3))

for (k in 1:9) {
  hist(samples$theta[, k], seq(theta_true[k] - 3, theta_true[k] + 3, 0.1),
       main=paste("k = ", k), col=c_dark, border=c_dark_highlight,
       xlab="theta", yaxt='n', ylab="")
  abline(v=theta_true[k], col="white", lwd=2)
  abline(v=theta_true[k], col="black", lwd=1)
}

If we zoom in to one of the plots we can see the under-regularization of the small parameters more clearly. Because both \(\tau\) and \(\sigma\) are greater than the inner core scale \(0.1\) there is no information to concentrate the posterior distribution to the smaller scale.

par(mfrow=c(1, 1))

k <- 1
hist(samples$theta[, k], breaks=seq(-5, 5, 0.15),
     main=paste("k = ", k), col=c_dark, border=c_dark_highlight,
     xlab="theta", yaxt='n', ylab="")

abline(v=theta_true[k], col="white", lwd=2)
abline(v=theta_true[k], col="black", lwd=1)

abline(v=theta_true[k] - 0.1, col="gray80", lwd=1, lty=3)
abline(v=theta_true[k] + 0.1, col="gray80", lwd=1, lty=3)

2.2 Mixture Population Models

A single normal population model proved to be too rigid to accommodate parameters clustering in both the inner core and outer extent at the same time, but there's no reason why we have to consider only a single normal population model. By mixing together multiple component population models our inferences have the flexibility to associate individual contexts to different populations, clustering the parameters into different groups as directed by the observed data. We can then encode sparsity as a preference for associating parameters with the most narrow population.

2.2.1 Discrete Mixture Models

The simplest mixture model to consider is a discrete mixture model that features only a finite number of component population models.

For example we might consider mixing a Dirac distribution that concentrates entirely on zero with a wider distribution to accommodate non-zero parameter values.




The hope is that parameters with likelihood functions concentrating near zero would collapse to the spike at zero while those with likelihood functions concentrating further away are more likely to be allocated to the non-zero slab. By inflating each \(\theta_{k} = 0\) to have a non-zero probability this spike and slab population model [1] is theoretically able to infer exact zeros and not just small values. Unfortunately because of the infinity of small but not-quite-zero values this population model requires special computational methods to transitioning between the spike and slab components in finite time. Even these special methods, however, often struggle to accurately quantify the posterior probabilities of each component in practice.

If we embrace irrelevant values instead of exact zero values then we can replace the spike with a narrow population model that concentrates around zero with scale \(\tau_{1}\). A second, wider normal population model with scale \(\tau_{2} > \tau_{1}\) then accommodates the outer core.




Each individual parameter \(\theta_{k}\) is equipped with a mixture probability \(\lambda_{k}\) that quantifies how consistent the parameter is with the narrower population model, \[ \pi(\theta_{k} \mid \tau_{1}, \tau_{2}, \lambda_{k}) = \lambda_{k} \cdot \text{normal}(\theta_{k} \mid 0, \tau_{1} ) + (1 - \lambda_{k}) \cdot \text{normal}(\theta_{k} \mid 0, \tau_{2} ). \]




When the marginal posterior distribution for a \(\lambda_{k}\) concentrates towards one then the corresponding \(\theta_{k}\) will be more strongly regularized by the narrow component population model. In other words the individual parameter effectively decouples from the wide component population model.




On the other hand if the posterior distribution for a \(\lambda_{k}\) concentrates towards zero then the corresponding \(\theta_{k}\) will effectively decouple from the narrow population model and be regularized by only the wide population model.




We can then encode sparsity with a prior model on each \(\lambda_{k}\) that concentrates towards one. The stronger this concentration the more an individual likelihood function has to concentrate above \(\tau_{1}\) for \(\theta_{k}\) to be decouple from the narrow population model and escape the strong regularization towards the inner core.

A mixture probability for each \(\theta_{k}\) introduces a significant number of new parameters, and those parameters introduce opportunities for degeneracies that can frustrate computational efficiency. If we use a beta prior model for the mixture probabilities, however, then we can actually marginalize them out of the model analytically, \[ \begin{align*} &\int \mathrm{d} \lambda_{k} \, \pi(\theta_{k} \mid \tau_{1}, \tau_{2}, \lambda_{k} ) \, \text{beta}(\lambda_{k} \mid \alpha, \beta) \\ &\quad = \int \mathrm{d} \lambda_{k} \, \left[ \lambda_{k} \cdot \text{normal}(\theta_{k} \mid 0, \tau_{1} ) + (1 - \lambda_{k}) \cdot \text{normal}(\theta_{k} \mid 0, \tau_{2} ) \right] \, \text{beta}(\lambda_{k} \mid \alpha, \beta) \\ &\quad = \left[ \int \mathrm{d} \lambda_{k} \, \text{beta}(\lambda_{k} \mid \alpha, \beta) \, \lambda_{k} \right] \text{normal}(\theta_{k} \mid 0, \tau_{1} ) + \left(1 - \left[ \int \mathrm{d} \lambda_{k} \, \text{beta}(\lambda_{k} \mid \alpha, \beta) \, \lambda_{k}) \right] \right) \text{normal}(\theta_{k} \mid 0, \tau_{2} ) \\ &\quad = \frac{\alpha}{\alpha + \beta} \text{normal}(\theta_{k} \mid 0, \tau_{1} ) + (1 - \frac{\alpha}{\alpha + \beta} ) \text{normal}(\theta_{k} \mid 0, \tau_{2} ). \end{align*} \]

With only this particular ratio of \(\alpha\) and \(\beta\) persisting into the marginal density function our domain expertise does not need to specify both parameters but just the ratio \[ \gamma = \frac{\alpha}{\alpha + \beta}. \] In this case we can replace the discrete mixture population model with the marginal population model \[ \pi(\theta_{k} \mid \tau_{1}, \tau_{2}) = \gamma \cdot \text{normal}(\theta_{k} \mid 0, \tau_{1} ) + (1 - \gamma ) \cdot \text{normal}(\theta_{k} \mid 0, \tau_{2} ). \]

Because \(\gamma\) is the same for all of the individual contexts we have replaced the \(k\) mixture parameters with a single new parameter. In this context we can interpret \(\gamma\) as the proportion of the total observed contexts that are expected to be consistent with the narrow population model as opposed to the wide population model.




As before let's investigate the properties of this discrete mixture population model by fixing the scales and consider how this prior model affect the posterior mean and standard deviation relative to an improper uniform prior density function. To start let's assume that \(\tau_{1} < \sigma < \tau_{2}\).

Although each component normal model induces uniform regularization their mixture does not when the measurement variability can resolve differences between the component scales. When the likelihood function concentrates near and below the narrower scale \(\tau_{1}\) the posterior distribution couples to the narrow population model; the mean is strongly regularized towards zero and the standard deviation narrows because of the information provided by the prior model. In between \(\tau_{1}\) and \(\tau_{2}\) the regularization of the posterior mean weakens and the posterior standard deviation actually expands towards \(\tau_{2}\)! Finally when the likelihood function concentrates closer to \(\tau_{2}\) it decouples from the narrow population model entirely. If \(\tau_{2}\) is much wider than the width of the likelihood function, as is the case here, then the wide population model behaves almost exactly like an improper uniform prior model.




Because of this multi-scale regularization the discrete mixture population model can accommodate both the inner core and the outer extend implied by the sparsity assumption. When an individual likelihood function is wide the marginal posterior distribution for \(\theta_{k}\) will be informed by both component population models, allocating most of its probability within the inner core but spreading some across the outer extent. If an individual likelihood function is narrow and concentrates at values near \(\tau_{1}\) then the narrow population model dominates and the marginal posterior distribution is strongly regularized into the inner core. On the other hand if the individual likelihood function is narrow and concentrates at values above \(\tau_{1}\) then the wide population model governs, allowing the marginal posterior distribution to be determined mostly by the individual likelihood function.