par(family="serif", las=1, bty="l",
cex.axis=1, cex.lab=1, cex.main=1,
xaxs="i", yaxs="i", mar = c(5, 5, 3, 1))
c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")
c_light_teal <- c("#6B8E8E")
c_mid_teal <- c("#487575")
c_dark_teal <- c("#1D4F4F")Modeling Selection Processes
We often take for granted our ability to observe every event that arises from a latent data generating process. Many observations, however, are compromised by selection effects, where the output of an initial, latent data generating process must pass through an intermediate selection process that can prevent events from being observed.
Inferences derived from the resulting data capture only a biased perspective of that latent data generating process. To derive faithful inferences we need to model both the latent data generating process and the subsequent selection process.
In this chapter we’ll review the basic modeling techniques for incorporating selection processes and the computational challenges that arise when trying to implement the resulting models in practice.
1 Selection Models
Consider an initial data generating process which generates events that take values in the space Y. Introducing a selection process annotates Y with a binary variable z \in \{0, 1\} that takes the value z = 0 when an event is rejected and z = 1 when an event is accepted and persists to the final observation.
A joint model of the latent data generating process and the selection process is defined by a joint probability density function over the product space \{0, 1\} \times Y, p(z, y). The complete data generating process is then captured by the conditional decomposition p(z, y) = p(z \mid y) \, p(y), where p(y) models the latent data generating process and p(z \mid y) models the selection process. Here the conditional probability of selection p(z = 1 \mid y) \equiv S(y), is known as a selection function.
Because we do not observe every event we cannot use the joint model p(z, y) directly. Instead we have to condition it on a successful observation.
1.1 Modeling Accepted Events
Observations that survive the selection process are modeled with the conditional probability density function (Figure 1) \begin{align*} p(y \mid z = 1) &= \frac{ p(z = 1, y) }{ p(z = 1) } \\ &= \frac{ p(y) \, S(y) }{ \int \mathrm{d}y' \, p(y') \, S(y') }. \end{align*} The more the selection function deviates from S(y) = 1 across neighborhoods of high latent probability the more the observational model p(y \mid z = 1) will deviate from the latent model p(y) (Figure 2).
The denominator Z \equiv p(z = 1) = \int \mathrm{d}y' \, p(y') \, S(y') is referred to as the normalization integral, normalization constant, or simply normalization for short. Note that the normalization integral is also equal to the expectation value of the selection function with respect to the latent probability distribution, Z = \int \mathrm{d}y' \, p(y') \, S(y') = \mathbb{E}_{\pi}[S].
In most applications the latent probability density function and selection function are not known exactly and we have to infer their behaviors from observed data (Figure 3). When the latent probability density function is configured with the parameters \phi and the selection function is configured with the parameters \psi the observational model becomes p(y \mid z = 1, \phi, \psi) = \frac{ p(y \mid \phi) \, S(y; \psi) } { \int \mathrm{d}y' \, p(y' \mid \phi) \, S(y'; \psi) }. Critically the normalization is no longer a constant but varies with the possible parameter configurations, Z(\phi, \psi) = \int \mathrm{d}y' \, p(y' \mid \phi) \, S(y'; \psi).
In order to implement this observational model in practice we need to be able to evaluate the normalization integral for arbitrarily values of \phi and \psi. Unfortunately analytic evaluation of Z(\phi, \psi) is rarely feasible for practical models. Because of this computational hurdle selection models are often referred to as intractable models.
We will discuss general numerical methods for evaluating Z(\phi, \psi) in Section 2. The exercises in Section 3, however, will focus on exceptional cases where the normalization can be evaluated analytically so that we can directly quantify the error of these numerical methods.
1.2 Modeling The Number of Rejected Events
In some applications we may not be completely ignorant of the rejected events. The mere existence of rejected events can be surprisingly informative about the behavior of the latent probability distribution and selection function even if we don’t know the exact values of those events.
We can model the presence of a single rejected event by evaluating the joint model at z = 0 and then marginalizing out the possible latent values (Figure 4), \begin{align*} p(z = 0 \mid \phi, \psi) &= \int \mathrm{d} y' \, p(z = 0, y \mid \phi, \psi) \\ &= \int \mathrm{d} y' \, p(y \mid \phi) \, p(z = 0 \mid y, \psi) \\ &= \int \mathrm{d} y' \, p(y \mid \phi) \, \big( 1 - p(z = 1 \mid y, \psi) \big) \\ &= \int \mathrm{d} y' \, p(y \mid \phi) \, \big(1 - S(y; \psi) \big) \\ &= \int \mathrm{d} y' \, p(y; \phi) - \int \mathrm{d} y' \, p(y \mid \phi) \, S(y; \psi) \\ &= 1 - \int \mathrm{d} y' \, p(y \mid \phi) \, S(y; \psi) \\ &= 1 - Z(\phi, \psi). \end{align*}
An observation with N independent accepted events, \{ (z_{1} = 1, y_{1}), \ldots (z_{N} = 1, y_{N}) \}, and R independent rejected events, \{ z_{N + 1} = 0, \ldots, z_{N + R} = 0 \}, is then modeled as \begin{align*} p( z_{1}, y_{1}, \ldots, &z_{N}, y_{N}, z_{N + 1}, z_{N + R} \mid \phi, \psi) \\ &= \prod_{n = 1}^{N} p(z_{n} = 1, y_{n} \mid \phi, \psi) \, \prod_{r = 1}^{R} p(z_{N + r} = 0 \mid \phi, \psi) \\ &= \prod_{n = 1}^{N} p(y_{n} \mid \phi, \psi) \, p(z_{n} = 1 \mid y_{n}, \phi, \psi) \, \prod_{r = 1}^{R} p(z_{N + r} \mid \phi, \psi) \\ &= \prod_{n = 1}^{N} p(y_{n} \mid \phi) \, S(y_{n}; \psi) \prod_{r = 1}^{R} \big( 1 - Z(\phi, \psi) \big) \\ &= \bigg( \prod_{n = 1}^{N} p(y_{n} \mid \phi) \, S(y_{n}; \psi) \bigg) \bigg( 1 - Z(\phi, \psi) \bigg)^{R}. \end{align*}
Note that because we are modeling all of the latent events, both accepted and rejected, we model the accepted events with the joint model p(z = 1, y) instead of the conditional model p(y \mid z = 1) and its possibly intractable normalization integral . That said the normalization still shows up in the model for the rejected events so that the implementation isn’t actually any easier.
We can see the benefit of knowing the number of rejected events by rewriting this model in terms of the conditional model used for modeling accepted events, \begin{align*} p( z_{1}, y_{1}, \ldots, &z_{N}, y_{N}, z_{N + 1}, z_{N + R} \mid \phi, \psi) \\ &= \bigg( \prod_{n = 1}^{N} p(y_{n} \mid \phi) \, S(y_{n}; \psi) \bigg) \bigg( 1 - Z(\phi, \psi) \bigg)^{R} \\ &= \bigg( \prod_{n = 1}^{N} \frac{ p(y_{n} \mid \phi) \, S(y_{n}; \psi) } { Z(\phi, \psi) } \, Z(\phi, \psi) \bigg) \bigg( 1 - Z(\phi, \psi) \bigg)^{R} \\ &= \bigg( \prod_{n = 1}^{N} \frac{ p(y_{n} \mid \phi) \, S(y_{n}; \psi) } { Z(\phi, \psi) } \bigg) \bigg( \prod_{n = 1}^{N} Z(\phi, \psi) \bigg) \bigg( 1 - Z(\phi, \psi) \bigg)^{R} \\ &= \bigg( \prod_{n = 1}^{N} p(y \mid z = 1, \phi, \psi) \bigg) \bigg( \prod_{n = 1}^{N} Z(\phi, \psi) \bigg) \bigg( 1 - Z(\phi, \psi) \bigg)^{R} \\ &= \bigg( \prod_{n = 1}^{N} p(y \mid z = 1, \phi, \psi) \bigg) Z(\phi, \psi)^{N} \, \bigg( 1 - Z(\phi, \psi) \bigg)^{R}. \end{align*}
Writing the total number of latent events as M = N + R this becomes \begin{align*} p( z_{1}, y_{1}, \ldots, &z_{N}, y_{N}, z_{N + 1}, z_{N + R} \mid \phi, \psi) \\ &= \bigg( \prod_{n = 1}^{N} p(y \mid z = 1, \phi, \psi) \bigg) Z(\phi, \psi)^{N} \, \bigg( 1 - Z(\phi, \psi) \bigg)^{R} \\ &= \bigg( \prod_{n = 1}^{N} p(y \mid z = 1, \phi, \psi) \bigg) Z(\phi, \psi)^{N} \, \bigg( 1 - Z(\phi, \psi) \bigg)^{M - N} \\ &= \bigg( \prod_{n = 1}^{N} p(y \mid z = 1, \phi, \psi) \bigg) \text{binomial} \left( N \mid M, Z(\phi, \psi) \right). \end{align*} This additional binomial probability mass function directly informs Z(\phi, \psi), and hence indirectly informs \phi and \psi, beyond what we would learn from the accepted events alone!
For example observing a large number of rejections relative to the number of acceptances, R \gg N, restricts Z(\phi, \psi) to small values. Small values of Z(\phi, \psi), however, require that the latent probability density function and selection function are misaligned. This misalignment doesn’t directly inform \phi and \psi, but it can be a powerful complement to the information provided by the accepted events.
1.3 Types of Selection Processes
The inferential consequences and implementation details of selection models depend on the nature of the selection process and the resulting structure of the selection function. In particular there are two main categories of selection functions.
1.3.1 Deterministic Selection
If the selection function outputs only extreme values, S(y; \psi) : Y \rightarrow \{ 0, 1 \}, then all latent events with a given value y are either always accepted or always rejected. In other words the selection process becomes completely deterministic once we know the value of the latent event.
Deterministic selection functions partition the observational space into the subset of latent values that are always accepted, S^{-1}(1; \psi) \subset Y, and the subset of latent values that are always rejected, S^{-1}(0; \psi) \subset Y. Moreover the normalization reduces to the probability allocated to the acceptance subset, \begin{align*} Z(\phi, \psi) &= \int \mathrm{d}y' \, p(y' \mid \phi) \, S(y'; \psi ) \\ &= \int \mathrm{d}y' \, p(y' \mid \phi) \, I_{S^{-1}(1; \psi)}(y') \\ &= \int_{S^{-1}(1; \psi)} \mathrm{d}y' p(y' \mid \phi) \\ &= \pi ( S^{-1}(1; \psi) ). \end{align*} Consequently a deterministic selection process is equivalent to restricting the latent probability distribution to the subset S^{-1}(1; \psi) (Figure 5 (a)), \begin{align*} p(y \mid \phi, \psi, z = 1) &= \frac{ p(y \mid \phi) \, S(y; \psi ) } { \int \mathrm{d}y' \, p(y' \mid \phi) \, S(y'; \psi ) } \\ &= \frac{ p(y \mid \phi) \, I_{S^{-1}(1; \psi)}(y) } { \pi ( S^{-1}(1; \psi) ) }. \end{align*}
When the observational space is one-dimensional and ordered the acceptance subset S^{-1}(1; \psi) often decomposes into a union of continuous intervals. In this case we can evaluate \pi ( S^{-1}(1; \psi) ), and hence the normalization integral Z(\phi, \psi), with various cumulative distribution function calculations. If we can evaluate the cumulative distribution function for the latent model analytically then the implementation of these particular deterministic selection models is straightforward.
That said implementing inference for deterministic selection models requires some care. Any observation \tilde{y} by definition has been accepted by the selection process and has to fall in the acceptance subset, \tilde{y} \in S^{-1}(1; \psi). Consequently any model configuration with S(\tilde{y}; \psi) = 0 is completely inconsistent with the observation and must be excluded from our inferences. When deriving inferences from multiple observed events \{ \tilde{y}_{1}, \ldots, \tilde{y}_{N} \} we can include only those model configurations where the acceptance subset contains all of the observations (Figure 5 (b)) \{ \tilde{y}_{1}, \ldots, \tilde{y}_{N} \} \in S^{-1}(1; \psi).
Any computational method attempting to quantify a posterior distribution derived from a deterministic selection model will then have to avoid model configurations that yield S(\tilde{y}_{n}; \psi) = 0. For example when using Markov chain Monte Carlo we have to find initial states that satisfy this condition and then reject all Markov transitions to states that violate this condition.
Unfortunately the influence of \psi on the zero level set S^{-1}(0; \psi) is subtle and difficult to quantify in many applications. Identifying the good model configurations with S(\tilde{y}_{n}; \psi) > 0 from the bad model configurations with S(\tilde{y}_{n}; \psi) = 0 is often much easier said than done.
1.3.2 Probabilistic Selection
When the selection function outputs an intermediate value for a given y \in Y, 0 < S(y; \psi) < 1, then the acceptance of latent events with this value will no longer be deterministic and we cannot perfectly predict whether or not the event will be accepted or rejected. When 0 < S(y; \psi) < 1 for all y \in Y then the selection function is described as probabilistic or, depending on the interpretation of the probabilistic selection, sometimes stochastic or even random.
Probabilistic selections functions that never output zero, 0 < S(y; \psi) for all y \in Y, are particularly well-behaved when it comes to implementing inferences. In this case no model configurations are completely incompatible with observed data. An initial model configuration might imply that the observations are extremely unlikely, but because they are not outright impossible we will be able to follow the shape of the posterior distribution to more reasonable model configurations.
2 Computational Strategies For Evaluating Normalization Constants
In an ideal selection model the normalization integral Z(\phi, \psi) = \int \mathrm{d}y' \, p(y' \mid \phi) \, S(y'; \psi ) can be evaluated in closed form, allowing us to analytically evaluate the normalization constant for any model configuration (\phi, \psi). Unfortunately these ideal selection models are rare in practice.
When we cannot evaluate the normalization analytically we have to rely on numerical integration techniques that can approximate it, \hat{Z}(\phi, \psi) \approx Z(\phi, \psi). In this section we’ll consider a few techniques that provide deterministic estimates for the normalization integral, \hat{Z}(\phi, \psi) \approx Z(\phi, \psi), and hence deterministic approximations to the full Bayesian probability density function, \begin{align*} \hat{p}(y, \phi, \psi \mid z = 1) &= \frac{ p(y' \mid \phi) \, S(y'; \psi ) \, p(\psi, \phi) }{ \hat{Z}(\phi, \psi) } \\ &\approx \frac{ p(y' \mid \phi) \, S(y'; \psi ) \, p(\psi, \phi) }{ Z(\phi, \psi) } \\ &= p(y, \phi, \psi \mid z = 1), \end{align*} and finally deterministic approximations to posterior density functions, \hat{p}( \phi, \psi \mid \tilde{y}, \tilde{z} = 1) \propto \hat{p}(\tilde{y}, \phi, \psi \mid \tilde{z} = 1) \approx p(\tilde{y}, \phi, \psi \mid \tilde{z} = 1) \propto p( \phi, \psi \mid \tilde{y}, \tilde{z} = 1).
We can then use tools like Hamiltonian Monte Carlo to estimate expectation values with respect to these approximate posterior density functions. If the approximate posterior density function is close enough to the exact posterior density function then this can provide reasonably accurate estimates of the exact posterior expectation values, \int \mathrm{d} \phi \, \mathrm{d} \psi \, \hat{p}(\phi, \psi \mid \tilde{y}, \tilde{z} = 1) f(\phi, \psi) \approx \int \mathrm{d} \phi \, \mathrm{d} \psi \, p(\phi, \psi \mid \tilde{y}, \tilde{z} = 1) f(\phi, \psi).
While we can often quantify the error in estimating the normalization integral, propagating that error through to the posterior density function estimate and then any resulting posterior expectation value estimates is much more difficult and outright infeasible for most practical problems. This makes it difficult to determine just how accurate our numerical integration need to be to ensure faithful posterior inferences.
In some cases we can verify that the consequences of the normalization estimator error are negligible by repeating the approximate posterior quantification with more accurate, although more expensive, estimators until the resulting posterior inferences appear to stabilize. The main limitation of this approach is that these consequences don’t always vary smoothly with the normalization error itself: apparent stability doesn’t always imply negligible error.
Another approach is to consider methods like Simulation-Based Calibration (Talts et al. 2018). Any normalization estimator error will introduce an inconsistency between the model used to simulate prior predictive data and the model used to construct posterior samples. This inconsistency will then bias the prior-posterior ranks which should manifest in the Simulation-Based Calibration rank histograms. That said we might need an excess number of replications, and hence an excess amount of computation, to resolve this bias.
2.1 Numerical Quadrature
When the observational space is one-dimensional numerical quadrature becomes a feasible tool for estimating the normalization integral.
Given a grid \{ y_{0}, \ldots, y_{k}, \ldots, y_{K} \} \in Y we can construct a crude quadrature estimate of the normalization constant as \begin{align*} \hat{Z}(\phi, \psi) &= \sum_{k = 1}^{K} (y_{k} - y_{k - 1}) \, p(y_{k} \mid \phi) \, S(y_{k}; \psi ) \\ &\approx \int \mathrm{d}y' \, p(y' \mid \phi) \, S(y'; \psi ). \end{align*} The finer the grid the more we have to evaluate the latent probability density function and selection function, and the more computationally expensive the estimator becomes, but the smaller of an error we can achieve.
Modern numerical quadrature tools dynamically construct a grid based on the shape of the integrand, here S(y_{k}; \psi ) \, p(y_{k} \mid \phi), to guarantee extremely small errors with as few evaluations as possible.
Unfortunately the effectiveness of numerical quadrature decays rapidly with the dimension of the observational space. In order to maintain a fixed error the cost of numerical quadrature needs to increase exponentially with the dimension. At the same time if we fix the cost then the error will grow exponentially with dimension. Moreover as we go to higher and higher-dimensional observational spaces robust dynamic grid adaptation becomes more difficult. Ultimately numerical quadrature is most productive for one-dimensional problems, with exceptional applications for two and three-dimensional problems.
2.2 Monte Carlo Estimation
Recall that the normalization integral can be written as the expectation value of the selection function with respect to the latent probability distribution, \begin{align*} Z(\psi, \phi) &= \int \mathrm{d}y' \, p(y' \mid \phi ) \, S(y'; \psi )) \\ &= \int \pi( \mathrm{d}y' \mid \phi) \, S(y'; \psi ) \\ &= \int \pi_{\phi}( \mathrm{d}y' ) \, S(y'; \psi ) \\ &= \mathbb{E}_{\pi_{\phi}}[ S(\cdot ; \psi) ]. \end{align*} Consequently we can use expectation value estimation techniques to approximate the normalization integral.
For example we can use Monte Carlo estimation to approximate the normalization integral. Given an ensemble of exact samples \{ \tilde{y}_{1}, \ldots, \tilde{y}_{j}, \ldots, \tilde{y}_{J} \} \sim \pi_{\phi} we can construct the Monte Carlo estimator \frac{1}{J} \sum_{j = 1}^{J} S(\tilde{y}_{j}; \psi) \approx \mathbb{E}_{\pi_{\phi}}[ S(\cdot ; \psi) ] = Z(\psi, \phi). We can then use the Monte Carlo central limit theorem to estimate the error of this estimator. Moreover if we ever need to decrease the error then we can increase the ensemble size J as needed.
As we consider higher and higher-dimensional observational spaces the cost of generating exact samples will in general increase, although the growth is often more linear than exponential. The Monte Carlo estimator error, however, will remain stable. This makes Monte Carlo particularly attractive for higher-dimensional selection problems.
That said any changes to the configuration of the latent probability distribution or selection function complicate the incorporation of Monte Carlo estimates into selection models. Varying \psi changes the expectand which requires computing a new Monte Carlo estimator. At the same time varying \phi changes the latent probability distribution which requires generating a new ensemble of samples and then re-evaluating the Monte Carlo estimator.
To accommodate variations in the behavior of the latent probability distribution we would have to generating a new ensemble every time we evaluate the selection model. Consequently we would not longer be working with a fixed approximate model \hat{p}(y, \phi, \psi \mid z = 1) but rather a stochastic one. This, in turn, disrupts the scalable performance of computational tools like Hamiltonian Monte Carlo (Betancourt 2015).
When the configuration of the latent probability distribution is fixed, however, we can generate a single ensemble of latent samples and use it estimate the normalization integral for any given configuration of the selection function. We just have to be careful that the ensemble is large enough to ensure that the Monte Carlo estimator error is small for all selection function configurations of interest. In particular if we ever need to evaluate the selection model when the selection function is strongly misaligned with the latent probability distribution then we will need relatively large ensembles to ensure that Monte Carlo estimator error is always well-behaved.
One way to monitor for any problems is to compute and save the Monte Carlo estimator error at each evaluation of the model and check that the entire ensemble of errors is sufficiently small.
2.3 Importance Sampling Estimation
Unfortunately the convenience of Monte Carlo estimation is rarely of use in applied problems given that the behavior of the latent probability distribution is almost never known precisely. One way to infer the latent probability distribution at the same time as the selection function is to generate one ensemble of samples and adapt them to the varying behavior of the latent probability distribution. More formally we can replace our Monte Carlo estimator with an importance sampling estimator.
In this case we would generate an ensemble of samples from a convenient reference probability distribution \rho, \{ \tilde{y}_{1}, \ldots, \tilde{y}_{j}, \ldots, \tilde{y}_{J} \} \sim \rho(y) and then estimate the normalization as \begin{align*} \frac{1}{J} \sum_{j = 1}^{J} w(\tilde{y}_{j}; \phi) \, S(\tilde{y}_{j}; \psi) \approx \mathbb{E}_{\pi_{\phi}}[ S(\cdot ; \psi) ] = Z(\psi, \phi), \end{align*} where the importance weights account for the difference between the reference probability distribution and the current configuration of the latent probability distribution, w \! \left( y, \phi \right) = \frac{ \pi(y \mid \phi) }{ \rho(y) }.
Unfortunately error quantification for importance sampling estimators is much more fragile than it is for Monte Carlo estimators. When the reference probability distribution is sufficiently similarly to the latent probability distribution then the importance weights will concentrate around 1 and the error in estimating the expectation value \mathbb{E}_{\pi_{\phi}}[f] will be well-approximated by \epsilon[f] = \sqrt{ \frac{\mathrm{Var}_{\rho}[w \, f] }{J} }. In practice we can estimate the variance \mathrm{Var}_{\rho}[w \, f] with the empirical variance of the individual products \{ w(\tilde{y}_{1}; \phi) \, f(\tilde{y}_{1}), \ldots, w(\tilde{y}_{j}; \phi) \, f(\tilde{y}_{j}), \ldots, w(\tilde{y}_{J}; \phi) \, f(\tilde{y}_{J}) \}.
If the reference probability distribution is not similar enough to the latent probability distribution, however, then the distribution of the importance weights will exhibit heavy tails and the resulting importance sampling estimators, let alone the estimator error quantification, will be unreliable. In particular most of the importance weights will be very close to zero with only a few taking on extremely large values. The small weights will have negligible contributions to the importance sampling estimator, and the effective ensemble size will be very small.
When applying importance sampling estimates to the normalization integral in selection models we cannot tune the reference probability distribution to a single latent probability distribution. Instead we need the reference probability distribution to span all of the latent probability distribution behaviors in neighborhoods of high posterior probability (Figure 6). The larger our posterior uncertainties the wider the reference probability distribution will need to be, and the larger of an ensemble of points we will need to ensure reasonable importance sampling estimator accuracy for any given configuration of the latent probability distribution.
In most practical applications, however, we will not know what latent probability distribution behaviors will be spanned by the posterior distribution before we attempt posterior computation. All we can do in this case is adapt the reference probability distribution to the prior model; the more informative the prior model is the better this strategy will be. With enough effort we can even approach this problem iteratively, starting with a reference probability distribution adapted to the prior model and then tweaking it based on the results from each fit.
Unfortunately as the dimension of the observational space increases the engineering of a productive reference probability distribution that spans all of the relevant latent probability distribution behaviors becomes more and more difficult, and quickly becomes impractical altogether.
Because of their fragility we have to be very careful about the verification of importance sampling estimators. For example we can start by examining the importance weights generated when evaluating the normalization integral at each Markov chain state for any indications of a heavy tail, such as excessive zeros or large but sparse deviations. When the reference probability distribution isn’t too far off then diagnostics like the importance sampling effective sample size and the \hat{k} statistic (Vehtari et al. 2015) can also be helpful.
Critically all of these diagnostics are limited by the size of the ensemble. If the ensemble is too small then we can easily miss large but rare weights that indicate unreliable estimation. In practice it can be useful to run with multiple ensemble sizes to verify that the estimator behavior is consistent.
3 Demonstrations
To contextualize these ideas let’s explore their application in a variety of simulated data exercises that span different selection function scenarios.
3.1 Setup
Before we start with our first exercise, however, let’s set up our local R environment.
In particular we’ll need to load RStan and my recommended Hamiltonian Monte Carlo analysis suite.
library(rstan)
rstan_options(auto_write = TRUE) # Cache compiled Stan programs
options(mc.cores = parallel::detectCores()) # Parallelize chains
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
util <- new.env()
source('stan_utility_rstan.R', local=util)3.2 Deterministic Selection
For our first example let’s consider a one-dimensional, real-valued observational space Y = \mathbb{R} and a deterministic selection process that always rejects events with values above a certain threshold \psi = \lambda \in Y. Because this selection process limits the size of latent values that can be observed it is also known as truncation.
We can model this selection process with the selection function S(y; \lambda) = \left\{ \begin{array}{rr} 1, & y \le \lambda \\ 0, & y > \lambda \end{array} \right. The resulting normalization integral is \begin{align*} Z(\phi, \lambda) &= \int \mathrm{d}y' \, p(y' \mid \phi) \, S(y'; \lambda) \\ &= \int \mathrm{d}y' \, p(y' \mid \phi) \, I_{(-\infty, \lambda]}(y') \\ &= \int_{-\infty}^{\lambda} \mathrm{d}y' p(y'; \phi), \\ &= \Pi(\lambda; \phi), \end{align*} where \Pi is the cumulative distribution function of the latent probability distribution.
In order to avoid selection function configurations that are inconsistent with the observed data, S(\tilde{y}; \lambda) = 0, we need to bound the selection threshold below by the observed data, \tilde{y} < \lambda. If we have multiple observations \{ \tilde{y}_{1}, \ldots, \tilde{y}_{N} \} then we need to bound \lambda below by the largest observed value, \max \left( \{ \tilde{y}_{1}, \ldots, \tilde{y}_{N} \} \right) < \lambda.
Now let’s assume that the latent probability distribution is specified by a one-dimensional, normal density function with the parameters \phi = (\mu, \tau), p(y \mid \mu, \tau) = \text{normal}(y \mid \mu, \tau) In this case the normalization integral reduces to \begin{align*} Z(\mu, \tau, \lambda) &= \int \mathrm{d}y' \, p(y' \mid \mu, \tau) \, S(y'; \lambda) \\ &= \int \mathrm{d}y' \, \text{normal}(y' \mid \mu, \tau) \, I_{(-\infty, \lambda]}(y') \\ &= \int_{-\infty}^{\lambda} \mathrm{d}y' \, \text{normal}(y' \mid \mu, \tau) \\ &= \Phi_{\text{normal}}(\lambda \mid \mu, \tau). \end{align*}
For example taking \mu = 3 and \tau = 2 gives a latent probability density function that concentrates around y = 3.
delta <- 0.1
ys <- seq(-7, 13, delta)
mu <- 3
tau <- 2
latent_pds <- dnorm(ys, mu, tau)
plot(ys, latent_pds, type="l", main ="p(y)", col=c_dark, lwd=4,
xlab="y", xlim=c(-7, 13),
ylab="Latent Probability Density", yaxt='n', ylim=c(0, 0.2))Taking \lambda = 4.75 defines a selection function that truncates all values above y = 4.75.
lambda <- 4.75
selection_probs <- ifelse(ys <= lambda, 1, 0)
plot(ys, latent_pds, type="l", main ="p(z = 1 | y)", col=c_light, lwd=4,
xlab="y", xlim=c(-7, 13),
ylab="Selection Probability", yaxt='n', ylim=c(0, 0.25))
abline(v = lambda, col="#DDDDDD", lty=3, lwd=4)
lines(ys, 0.22 * selection_probs, col=c_dark, lwd=4)Together these components define an observed probability density function that abruptly cuts off at the truncation point.
observed_pds <- (selection_probs * latent_pds)
norm <- pnorm(lambda, mu, tau)
observed_pds <- observed_pds / norm
plot(ys, latent_pds, type="l", main ="p(y | z = 1)", col=c_light, lwd=4,
xlab="y", xlim=c(-7, 13),
ylab="Observed Probability Density", yaxt='n', ylim=c(0, 0.25))
abline(v = lambda, col="#DDDDDD", lty=3, lwd=4)
lines(ys, 0.22 * selection_probs, col=c_light_highlight, lwd=4)
lines(ys, observed_pds, col=c_dark, lwd=4)Note that below the threshold the observed probability density function is larger than the latent probability density function because the probability that would be allocated above the threshold instead has to be reallocated to values below the threshold.
3.2.1 Simulating Data
The most efficient way to generate data is to sample directly from the observed probability distribution specified by the probability density function p(y \mid \mu, \tau, \lambda, z = 1) = \frac{ \text{normal}(y \mid \mu, \tau) \, I_{(-\infty, \lambda]}(y) } { \Pi_{\text{normal}}(\lambda \mid \mu, \tau) }. Conveniently sampling from this particular observed probability distribution is straightforward with a small modification to the quantile function pushforward method.
First we compute the cumulative probability of the selection threshold, r = \Pi_{\text{normal}}(\lambda \mid \mu, \tau). Next we simulate a uniform variate in the interval (0, r), \tilde{q} \sim \text{uniform}(0, r). Finally we apply the normal quantile function, \tilde{y} = \Pi^{-1}_{\text{normal}}(\tilde{q} \mid \mu, \tau).
More generally we can also simulate the selection process directly. Here we generate a candidate sample from the latent probability distribution, \tilde{y} \sim \pi_{\phi}, and then a binary acceptance variable \tilde{z} \sim \text{Bernoulli}( S(\tilde{y}; \psi) ). If \tilde{z} = 0 then we reject \tilde{y} and generate a new candidate sample. We then repeat these steps until \tilde{z} = 1.
Although straightforward this approach can also become extremely inefficient if the overlap between the latent probability distribution and selection function is weak. In this case we will need to generate many candidate samples, expend a lot of computation, and wait a long time for every acceptance.
The direct simulation approach is implemented in the following Stan program for the ground truth that we introduced above, \begin{align*} \mu &= 3 \\ \tau &= 2 \\ \lambda &= 4.75. \end{align*} In general the while loop here can be dangerous because there’s no guarantee that it will terminate in a finite time. Because the overlap between the latent probability distribution and selection function is reasonably large here, however, this shouldn’t be a problem.
simu\_threshold.stan
data {
int<lower=1> N; // Number of observations
}
transformed data {
real lambda = 4.75; // Selection threshold
real mu = 3; // Location of latent probability density function
real<lower=0> tau = 2; // Shape of latent probability density function
}
generated quantities {
// Simulated data
real y[N];
real N_reject = 0;
for (n in 1:N) {
// Sample latent events until one survives the selection process
while (1) {
y[n] = normal_rng(mu, tau);
if (y[n] <= lambda) {
break;
} else {
N_reject += 1;
}
}
}
}To ensure a pretty substantial observation we’ll take N = 1000.
N <- 1000
simu <- stan(file="stan_programs/simu_threshold.stan",
iter=1, warmup=0, chains=1, data=list("N" = N),
seed=4838282, algorithm="Fixed_param")
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0 seconds (Warm-up)
Chain 1: 0 seconds (Sampling)
Chain 1: 0 seconds (Total)
Chain 1:
y <- extract(simu)$y[1,]
N_reject <- extract(simu)$N_reject[1]To double check our simulation implementation we can compare a properly normalized histogram of the component observations to the true observed probability density function. Fortunately there don’t appear to be any conflicts here.
plot_line_hist <- function(s, bin_min=NULL, bin_max=NULL, delta=NULL,
prob=FALSE, xlab="y", main="") {
# Remove any NA values
s <- s[!is.na(s)]
# Construct binning configuration
if (is.null(bin_min))
bin_min <- min(s)
if (is.null(bin_max))
bin_max <- max(s)
if (is.null(delta))
delta <- (bin_max - bin_min) / 25
# Construct bins
bins <- seq(bin_min, bin_max, delta)
B <- length(bins) - 1
idx <- rep(1:B, each=2)
x <- sapply(1:length(idx),
function(b) if(b %% 2 == 1) bins[idx[b]] else bins[idx[b] + 1])
x <- c(bin_min - 10, x, bin_max + 10)
# Check bin containment
N <- length(s)
N_low <- sum(s < bin_min)
if (N_low > 0)
warning(sprintf('%s values (%.1f%%) fell below the histogram binning.',
N_low, 100 * N_low / N))
N_high <- sum(bin_max < s)
if (N_high > 0)
warning(sprintf('%s values (%.1f%%) fell above the histogram binning.',
N_high, 100 * N_high / N))
# Compute bin contents, including empty bounding bins
counts <- hist(s[bin_min <= s & s <= bin_max], breaks=bins, plot=FALSE)$counts
ylab <- "Counts"
if (prob) {
ylab <- "Empirical Bin Probability / Bin Width"
# Normalize bin contents, if desired
counts <- counts / (delta * sum(counts))
}
y <- counts[idx]
y <- c(0, y, 0)
# Plot
ymax <- 1.1 * max(y)
plot(x, y, type="l", main=main, col="black", lwd=2,
xlab=xlab, xlim=c(bin_min, bin_max),
ylab=ylab, ylim=c(0, ymax))
}plot_line_hist(y, -7, 13, 1, prob=TRUE, xlab="y")
lines(ys, observed_pds, col=c_dark, lwd=4)Additionally if our simulation is correct then the ratio of the total number of acceptances to the expected number of rejections will be \frac{ \left< N_{\text{reject}} \right> }{ N } = \frac{ 1 - Z(\mu, \tau, \lambda) }{ Z(\mu, \tau, \lambda) }. Although we can’t expect exact equality the ratio N_{\text{reject}} / N should be near the value on the right-hand side. Fortunately for us, it is.
N_reject / N; (1 - norm) / norm[1] 0.233
[1] 0.2357685
With our simulation validated let’s try to actually fit the simulated data.
3.2.2 Ignoring The Selection Process
To begin we’ll be careless and ignore the selection process entirely, assuming that the data are drawn directly from the latent probability distribution.
fit\_threshold1.stan
data {
int<lower=1> N; // Number of observations
real y[N]; // Observations
}
parameters {
real mu; // Location of latent probability density function
real<lower=0> tau; // Shape of latent probability density function
}
model {
// Prior model
mu ~ normal(0, 5 / 2.32); // ~ 99% prior mass between -5 and +5
tau ~ normal(0, 5 / 2.57); // ~ 99% prior mass between 0 and +5
// Observational model
target += normal_lpdf(y | mu, tau);
}
generated quantities {
real y_pred[N]; // Posterior predictive data
for (n in 1:N) {
y_pred[n] = normal_rng(mu, tau);
}
}data <- list("N" = N, "y" = y)
fit1 <- stan(file="stan_programs/fit_threshold1.stan",
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)There are no signs of computational problems.
diagnostics <- util$extract_hmc_diagnostics(fit1)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectands(fit1)
base_samples <- util$filter_expectands(samples,
c('mu', 'tau'))
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
The retrodictive performance, however, leaves much to be desired.
hist_retro <- function(obs, samples, pred_names,
bin_min=NULL, bin_max=NULL, delta=NULL,
xlab="", ylim=NA, title="") {
# Check that pred_names are in samples
all_names <- names(samples)
bad_names <- setdiff(pred_names, all_names)
if (length(bad_names) > 0) {
warning(sprintf('The expectand names %s are not in the `samples` object and will be ignored.',
paste(bad_names, collapse=", ")))
}
good_names <- intersect(pred_names, all_names)
if (length(good_names) == 0) {
stop('There are no valid expectand names.')
}
pred_names <- good_names
# Remove any NA values
obs <- obs[!is.na(obs)]
pred <- sapply(pred_names,
function(name) c(t(samples[[name]]), recursive=TRUE))
pred_collapse <- c(pred)
pred_collapse <- pred_collapse[!is.na(pred_collapse)]
# Construct binning configuration
if (is.null(bin_min))
bin_min <- min(min(obs), min(pred_collapse))
if (is.null(bin_max))
bin_max <- max(max(obs), max(pred_collapse))
if (is.null(delta))
delta <- (bin_max - bin_min) / 25
# Construct bins
breaks <- seq(bin_min, bin_max, delta)
B <- length(breaks) - 1
idx <- rep(1:B, each=2)
xs <- sapply(1:length(idx),
function(b) if(b %% 2 == 0) breaks[idx[b] + 1]
else breaks[idx[b]] )
# Check bin containment for observed values
N <- length(obs)
N_low <- sum(obs < bin_min)
if (N_low > 0)
warning(sprintf('%s data values (%.1f%%) fell below the histogram binning.',
N_low, 100 * N_low / N))
N_high <- sum(bin_max < obs)
if (N_high > 0)
warning(sprintf('%s data values (%.1f%%) fell above the histogram binning.',
N_high, 100 * N_high / N))
# Compute observed bin contents
obs_counts <- hist(obs[bin_min <= obs & obs <= bin_max],
breaks=breaks, plot=FALSE)$counts
pad_obs_counts <- do.call(cbind,
lapply(idx, function(n) obs_counts[n]))
# Check bin containment for posterior predictive values
N <- length(pred_collapse)
N_low <- sum(pred_collapse < bin_min)
if (N_low > 0)
warning(sprintf('%s predictive values (%.1f%%) fell below the histogram binning.',
N_low, 100 * N_low / N))
N_high <- sum(bin_max < pred_collapse)
if (N_high > 0)
warning(sprintf('%s predictive values (%.1f%%) fell above the histogram binning.',
N_high, 100 * N_high / N))
# Construct ribbons for predictive bin contents
N <- dim(pred)[1]
pred_counts <- sapply(1:N,
function(n) hist(pred[n, bin_min <= pred[n,] &
pred[n,] <= bin_max],
breaks=breaks,
plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:B,
function(b) quantile(pred_counts[b,], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9, n]))
if (any(is.na(ylim))) {
ylim <- c(0, max(c(obs_counts, cred[9,])))
}
# Plot
plot(1, type="n", main=title,
xlim=c(bin_min, bin_max), xlab=xlab,
ylim=ylim, ylab="Counts")
polygon(c(xs, rev(xs)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
lines(xs, pad_cred[5,], col=c_dark, lwd=2)
lines(xs, pad_obs_counts, col="white", lty=1, lw=4)
lines(xs, pad_obs_counts, col="black", lty=1, lw=2)
}While the observed histogram exhibits a clear asymmetry, with a hard cut off around y = 5, the posterior predictive distribution concentrates on symmetric histograms that not only peak at smaller values of y abut also exhibit a tail that extends far above the hard cut off seen in the data.
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
pred_names <- grep('y_pred', names(samples), value=TRUE)
hist_retro(data$y, samples, pred_names, -7, 13, 0.75, 'y')Moreover the resulting posterior inferences concentrate far away from the true values. Any decisions or predictions that we might derive from these inferences will perform poorly.
par(mfrow=c(1, 2))
util$plot_expectand_pushforward(samples[['mu']], 25, 'mu',
flim=c(2, 4), baseline=mu)
util$plot_expectand_pushforward(samples[['tau']], 25, 'tau',
flim=c(1, 3), baseline=tau)3.2.3 Modeling The Selection Process
To obtain an accurate picture of the latent behavior we’ll have to explicitly model the selection process. Fortunately this is straightforward given that the normalization integral can be evaluated as a normal cumulative distribution function.
In order to avoid model configurations with zero likelihood, and hence zero posterior density, we need to force the selection threshold \lambda to be larger than the largest observation.
Additionally to avoid excessive computation when the latent density function and selection function are poorly aligned we will cap the number of attempted latent simulations in the generated quantities block. Poor alignment is typically most problematic during the warmup phase, but we can readily diagnose any problems that persist into the sampling phase by looking for NA outputs.
fit\_threshold2.stan
data {
int<lower=1> N; // Number of observations
real y[N]; // Observations
}
transformed data {
real lambda_lower_bound = max(y);
}
parameters {
real<lower=lambda_lower_bound> lambda; // Selection threshold
real mu; // Location of latent probability density function
real<lower=0> tau; // Shape of latent probability density function
}
model {
real log_norm = normal_lcdf(lambda | mu, tau);
// Prior model
lambda ~ normal(5, 5 / 2.32); // ~ 99% prior mass between 0 and +10
mu ~ normal(0, 5 / 2.32); // ~ 99% prior mass between -5 and +5
tau ~ normal(0, 5 / 2.57); // ~ 99% prior mass between 0 and +5
// Observational model
for (n in 1:N) {
target += normal_lpdf(y[n] | mu, tau) - log_norm;
}
}
generated quantities {
real y_pred[N]; // Posterior predictive data
for (n in 1:N) {
// Sample latent intensities until one survives the selection process
// Use fixed iterations instead of while loop to avoid excessive trials
// when the selection function and latent density function are not aligned
for (m in 1:100) {
real y_sample = normal_rng(mu, tau);
if (y_sample <= lambda) {
y_pred[n] = y_sample;
break;
}
}
}
}fit2 <- stan(file="stan_programs/fit_threshold2.stan",
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)The diagnostics are clean.
diagnostics <- util$extract_hmc_diagnostics(fit2)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectands(fit2)
base_samples <- util$filter_expectands(samples,
c('mu', 'tau', 'lambda'))
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
The lack of any retrodictive tension in the histogram summary statistic suggests that our selection model is accurately capturing the threshold behavior exhibited by the observed data.
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
pred_names <- grep('y_pred', names(samples), value=TRUE)
hist_retro(data$y, samples, pred_names, -7, 13, 0.75, 'y')Warning in hist_retro(data$y, samples, pred_names, -7, 13, 0.75, "y"): 6
predictive values (0.0%) fell below the histogram binning.
More importantly our posterior inferences are consistent with the true model configurations from which we simulated data.
par(mfrow=c(1, 3))
util$plot_expectand_pushforward(samples[['mu']], 25, 'mu',
flim=c(2.5, 4), baseline=mu)
util$plot_expectand_pushforward(samples[['tau']], 25, 'tau',
flim=c(1.5, 2.75), baseline=tau)
util$plot_expectand_pushforward(samples[['lambda']], 25, 'lambda',
flim=c(4.72, 4.82), baseline=lambda)3.2.4 Incorporating The Number of Rejections
Finally let’s go one step further and consider how our inferences might change if we knew not only that some latent events were rejected but also how many were rejected. Recall that in this case the accepted events no longer need to be modeled with a modified normalization, but that the normalization integral still shows up in the model for the rejected events.
fit\_threshold3.stan
data {
int<lower=1> N; // Number of observations
real y[N]; // Observations
int N_reject; // Number of rejected latent events
}
transformed data {
real lambda_lower_bound = max(y);
}
parameters {
real<lower=lambda_lower_bound> lambda; // Selection threshold
real mu; // Location of latent probability density function
real<lower=0> tau; // Shape of latent probability density function
}
model {
real log_norm = normal_lcdf(lambda | mu, tau);
// Prior model
lambda ~ normal(5, 5 / 2.32); // ~ 99% prior mass between 0 and +10
mu ~ normal(0, 5 / 2.32); // ~ 99% prior mass between -5 and +5
tau ~ normal(0, 5 / 2.57); // ~ 99% prior mass between 0 and +5
// Observational model
for (n in 1:N) {
target += normal_lpdf(y[n] | mu, tau);
}
target += N_reject * log1m_exp(log_norm);
}
generated quantities {
real y_pred[N]; // Posterior predictive data
int N_reject_pred = 0; // Posterior predictive data
for (n in 1:N) {
// Sample latent intensities until one survives the selection process
// Use fixed iterations instead of while loop to avoid excessive trials
// when the selection function and latent density function are not aligned
for (m in 1:100) {
real y_sample = normal_rng(mu, tau);
if (y_sample <= lambda) {
y_pred[n] = y_sample;
break;
} else {
N_reject_pred += 1;
}
}
}
}data$N_reject <- N_reject
fit3 <- stan(file="stan_programs/fit_threshold3.stan",
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)Fortunately the computation remains unproblematic.
diagnostics <- util$extract_hmc_diagnostics(fit3)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectands(fit3)
base_samples <- util$filter_expectands(samples,
c('mu', 'tau', 'lambda'))
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
In addition to the histogram summary statistic which is sensitive to the behavior of the accepted events we can also consider the number of rejected events as its own summary statistic. Our model is consistent with the observed behavior of both of these statistics, giving us confidence that we’re capturing the relevant behavior.
par(mfrow=c(1, 2), mar = c(5, 5, 3, 1))
pred_names <- grep('y_pred', names(samples), value=TRUE)
hist_retro(data$y, samples, pred_names, -7, 13, 0.75, 'y')Warning in hist_retro(data$y, samples, pred_names, -7, 13, 0.75, "y"): 1
predictive values (0.0%) fell below the histogram binning.
util$plot_expectand_pushforward(samples[['N_reject_pred']], 25, 'N_reject',
baseline=data$N_reject)As in the previous fit the posterior inferences concentrate around the true model configuration. The concentration, however, is much stronger for the latent probability distribution parameters \mu and \tau than in that previous fit.
par(mfrow=c(3, 2))
samples2 <- util$extract_expectands(fit2)
samples3 <- util$extract_expectands(fit3)
util$plot_expectand_pushforward(samples2[['mu']], 25, 'mu',
flim=c(2.5, 4),
main="Unknown Number of Rejections",
baseline=mu)
util$plot_expectand_pushforward(samples3[['mu']], 25, 'mu',
flim=c(2.5, 4),
main="Known Number of Rejections",
baseline=mu)
util$plot_expectand_pushforward(samples2[['tau']], 25, 'tau',
flim=c(1.5, 2.75),
main="Unknown Number of Rejections",
baseline=tau)
util$plot_expectand_pushforward(samples3[['tau']], 25, 'tau',
flim=c(1.5 ,2.75),
main="Known Number of Rejections",
baseline=tau)
util$plot_expectand_pushforward(samples2[['lambda']], 25, 'lambda',
flim=c(4.72, 4.82),
main="Unknown Number of Rejections",
baseline=lambda)
util$plot_expectand_pushforward(samples3[['lambda']], 25, 'lambda',
flim=c(4.72, 4.82),
main="Known Number of Rejections",
baseline=lambda)Warning in util$plot_expectand_pushforward(samples3[["lambda"]], 25, "lambda",
: 1 posterior samples (0.0%) fell above the histogram binning.
We shouldn’t be surprised that we achieve our most precise inferences when we take advantage of all of the data available. What may be surprising is just how much more we learn when we incorporate only the total number of rejected events and not their latent values.
3.3 One-Dimensional Probabilistic Selection
As in the previous deterministic selection example let’s consider a one-dimensional latent probability distribution specified by the normal probability density function, p(y \mid \phi = (\mu, \tau) ) = \text{normal}(y \mid \mu, \tau), only this time with the location parameter \mu = -1 and scale parameter \tau = 3.
delta <- 0.1
ys <- seq(-20, 20, delta)
mu <- -1
tau <- 3
latent_pds <- dnorm(ys, mu, tau)
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
plot(ys, latent_pds, type="l", main ="p(y)", col=c_light, lwd=4,
xlab="y", xlim=c(-15, 15),
ylab="Latent Probability Density", yaxt='n', ylim=c(0, 0.25))In this case, however, we’ll consider a continuous selection function S \big( y; \psi = (\chi, \gamma) \big) = \Phi \big( \gamma \, (y - \chi) \big), where \Phi(x) = \Pi_{\text{normal}}(x; 0, 1) is the standard normal cumulative distribution function. When \gamma > 0 this selection function gradually rises as the input increases, achieving a selection probability of 1/2 at \chi. Conversely when \gamma < 0 the selection function falls as the input increases.
Here we’ll take \gamma = 0.75, so the selection function is monotonically increasing, and \chi = 2, so that the selection function is centered in the upper tail of the latent probability density function.
chi <- 2
gamma <- 0.75
selection_probs <- pnorm(gamma * (ys - chi))
plot(ys, latent_pds, type="l", main ="p(z = 1 | y)", col=c_light, lwd=4,
xlab="y", xlim=c(-15, 15),
ylab="Selection Probability", yaxt='n', ylim=c(0, 0.25))
lines(ys, 0.22 * selection_probs, col=c_dark, lwd=4)Conveniently the normalization integral resulting from this selection model does admit a closed-form solution, \begin{align*} Z(\mu, \tau, \chi, \gamma) &= \int_{-\infty}^{+\infty} \mathrm{d} y \, S(y; \chi, \gamma) \, p(y \mid \mu, \tau) \\ &= \int_{-\infty}^{+\infty} \mathrm{d} y \, \Phi \big( \gamma \, (y - \chi) \big) \, \text{normal}(y \mid \mu, \tau) \\ &= \Phi \left( \frac{ \gamma \, (\mu - \chi) } { \sqrt{1 + (\gamma \, \tau)^{2} } } \right). \end{align*} More details on this integral are available in the Appendix.
observed_pds <- (selection_probs * latent_pds)
norm <- pnorm(gamma * (mu - chi) / sqrt(1 + (gamma * tau)**2), 0, 1)
observed_pds <- observed_pds / norm
plot(ys, latent_pds, type="l", main ="p(y | z = 1)", col=c_light, lwd=4,
xlab="y", xlim=c(-15, 15),
ylab="Observed Probability Density", yaxt='n', ylim=c(0, 0.25))
lines(ys, 0.22 * selection_probs, col=c_light_highlight, lwd=4)
lines(ys, observed_pds, col=c_dark, lwd=4)The observed probability density function exhibits a noticeable skew, but not one so large that it couldn’t be approximated okay by a symmetric normal probability density function. Indeed we’ll see this below.
3.3.1 Simulating Data
Although we can derive an analytic normalization integral, we cannot derive an analytic cumulative distribution function for the observed probability density function. Consequently we cannot simulate data with the inverse cumulative distribution function method.
We can, however, generate data by simulating the selection process directly.
simu\_selection\_uni.stan
data {
int<lower=1> N; // Number of observations
}
transformed data {
real mu = -1; // Location of latent probability density function
real<lower=0> tau = 3; // Shape of latent probability density function
real chi = 2; // Location of selection function
real gamma = 0.75; // Shape of selection function
}
generated quantities {
// Simulate filtered data
real y[N];
real N_reject = 0;
for (n in 1:N) {
// Sample latent points until one survives the selection process
while (1) {
y[n] = normal_rng(mu, tau);
if ( bernoulli_rng( Phi(gamma * (y[n] - chi)) ) ) {
break;
} else {
N_reject += 1;
}
}
}
}N <- 1000
simu <- stan(file="stan_programs/simu_selection_uni.stan",
iter=1, warmup=0, chains=1, data=list("N" = N),
seed=4838282, algorithm="Fixed_param")
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0 seconds (Warm-up)
Chain 1: 0 seconds (Sampling)
Chain 1: 0 seconds (Total)
Chain 1:
y <- extract(simu)$y[1,]
N_reject <- extract(simu)$N_reject[1]To double check our implementation let’s compare a normalized histogram of the simulated data to the true observed probability density function. Fortunately there are no signs on inconsistencies.
plot_line_hist(y, -5, 15, 0.75, prob=TRUE, xlab="y")
lines(ys, observed_pds, col=c_dark, lwd=4)3.3.2 Ignoring The Selection Process
Let’s begin as we did in the previous exercise by attempting to fit a model that ignores the selection process entirely.
fit\_no\_selection\_uni.stan
data {
int<lower=1> N; // Number of observations
real y[N]; // Observations
}
parameters {
real mu; // Location of latent probability density function
real<lower=0> tau; // Shape of latent probability density function
}
model {
// Prior model
mu ~ normal(0, 5 / 2.32); // ~ 99% prior mass between -5 and +5
tau ~ normal(0, 5 / 2.57); // ~ 99% prior mass between 0 and +5
// Observational model
for (n in 1:N) {
target += normal_lpdf(y[n] | mu, tau);
}
}
generated quantities {
real y_pred[N]; // Posterior predictive data
for (n in 1:N) {
y_pred[n] = normal_rng(mu, tau);
}
}data <- list("N" = N, "y" = y)
fit <- stan(file="stan_programs/fit_no_selection_uni.stan",
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)There are no indications of inaccurate posterior quantification.
diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectands(fit)
base_samples <- util$filter_expectands(samples,
c('mu', 'tau'))
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
Perhaps surprisingly the retrodictive performance isn’t terrible. If we look carefully we can see that the observed histogram skews slightly to smaller values more than in the posterior predictive histograms.
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
pred_names <- grep('y_pred', names(samples), value=TRUE)
hist_retro(data$y, samples, pred_names, -5, 15, 1, 'y')Warning in hist_retro(data$y, samples, pred_names, -5, 15, 1, "y"): 18
predictive values (0.0%) fell below the histogram binning.
We can see this skew a bit more clearly if we use a histogram with finer bins as our summary statistic. We can’t use too fine a histogram summary statistic, however, as the bin occupancies will become too noisy and wash out the signal. Our ability to resolve the retrodictive tension is limited by the available data.
par(mfrow=c(1, 1))
hist_retro(data$y, samples, pred_names, -5, 15, 0.5, 'y')Warning in hist_retro(data$y, samples, pred_names, -5, 15, 0.5, "y"): 18
predictive values (0.0%) fell below the histogram binning.
If we wanted to investigate this retrodictive tension even further then we could consider summary statistics that target this shape more directly, such as an empirical skewness statistic.
What is not ambiguous is how extremely misaligned the posterior inferences of the latent probability distribution configuration are with the true model configuration.
par(mfrow=c(1, 2))
util$plot_expectand_pushforward(samples[['mu']], 25, 'mu',
flim=c(-2, 4), baseline=mu)
util$plot_expectand_pushforward(samples[['tau']], 25, 'tau',
flim=c(1, 4), baseline=tau)Our model for the latent probability distribution is itself flexible enough to contort into an okay approximation of the observed behavior. This contortion, however, will not generalize to other circumstances where the behavior of the selection changes.
3.3.3 Modeling An Unknown Selection Behavior
In order to derive more robust inferences we’ll need to model the selection process, which then requires evaluating the normalization integral. These exercises will then give us an opportunity to compare and contrast exact evaluations to numerical approximations.
Before confronting the entire model let’s see what happens when the behavior of the latent probability distribution is known and we just need to infer the behavior of the selection function.
3.3.3.1 Exact
We’ll begin with the exact evaluation.
fit\_unknown\_selecton\_uni\_exact.stan
data {
int<lower=1> N; // Number of observations
real y[N]; // Observations
int<lower=1> N_grid; // Number of latent value grid points
real y_grid[N_grid]; // Latent value grid points
}
transformed data {
// Exact latent probability distribution configuration
real mu = -1; // Location
real<lower=0> tau = 3; // Shape
}
parameters {
real chi; // Location of selection function
real gamma; // Shape of selection function
}
model {
// Filtered normalization
real norm = Phi( gamma * (mu - chi) / sqrt(1 + square(gamma * tau)) );
// Prior model
chi ~ normal(0, 3 / 2.32); // ~ 99% prior mass between -3 and +3
gamma ~ normal(0, 3 / 2.32); // ~ 99% prior mass between -3 and +3
// Observational model
for (n in 1:N) {
real log_select_prob = log( Phi(gamma * (y[n] - chi)) );
real lpdf = normal_lpdf(y[n] | mu, tau);
target += log_select_prob + lpdf - log(norm);
}
}
generated quantities {
real select_prob_grid[N_grid]; // Selection function values
real y_pred[N]; // Posterior predictive data
for (n in 1:N_grid)
select_prob_grid[n] = Phi(gamma * (y_grid[n] - chi));
for (n in 1:N) {
// Sample latent intensities until one survives the selection
// process. Use fixed iterations instead of while loop to avoid
// excessive trials when the selection function and latent density
// function are not aligned.
for (m in 1:100) {
real y_sample = normal_rng(mu, tau);
if ( bernoulli_rng( Phi(gamma * (y_sample - chi)) ) ) {
y_pred[n] = y_sample;
break;
}
}
}
}To visualize the inferred selection function behaviors in this one-dimensional problem we can compute the inferred outputs along a grid of latent inputs.
delta <- 0.1
ys <- seq(-20, 20, delta)
data$y_grid <- ys
data$N_grid <- length(ys)fit_exact <- stan(file="stan_programs/fit_unknown_selection_uni_exact.stan",
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)The diagnostics are clean.
diagnostics <- util$extract_hmc_diagnostics(fit_exact)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectands(fit_exact)
base_samples <- util$filter_expectands(samples,
c('chi', 'gamma'))
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
The posterior predictions how exhibit a mild skewness similar to what we see in the observed data. With this retrodictive agreement we have nothing to criticize.
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
pred_names <- grep('y_pred', names(samples), value=TRUE)
hist_retro(data$y, samples, pred_names, -5, 15, 1, 'y')Warning in hist_retro(data$y, samples, pred_names, -5, 15, 1, "y"): 2
predictive values (0.0%) fell above the histogram binning.
Even better the inferred configuration of the selection function is consistent with the true configuration.
par(mfrow=c(1, 2))
util$plot_expectand_pushforward(samples[['chi']], 25, 'chi',
flim=c(1, 3), baseline=chi)
util$plot_expectand_pushforward(samples[['gamma']], 25, 'gamma',
flim=c(0.4, 1), baseline=gamma)When the influence of the \chi and \gamma parameters on the shape of the selection function isn’t immediately obvious we can also visualize the inferred selection function behaviors directly.
plot_realizations <- function(xs, fs, N=50,
x_name="", display_xlims=NULL,
y_name="", display_ylims=NULL,
title="") {
I <- dim(fs)[2]
J <- min(N, I)
plot_idx <- sapply(1:J, function(j) (I %/% J) * (j - 1) + 1)
nom_colors <- c("#DCBCBC", "#C79999", "#B97C7C",
"#A25050", "#8F2727", "#7C0000")
line_colors <- colormap(colormap=nom_colors, nshades=J)
if (is.null(display_xlims)) {
xlims <- range(xs)
} else {
xlims <- display_xlims
}
if (is.null(display_ylims)) {
ylims <- range(fs)
} else {
ylims <- display_ylims
}
plot(1, type="n", main=title,
xlab=x_name, xlim=xlims,
ylab=y_name, ylim=ylims)
for (j in 1:J) {
r_fs <- fs[, plot_idx[j]]
lines(xs, r_fs, col=line_colors[j], lwd=3)
}
}
par(mfrow=c(1, 1), mar = c(5, 4, 2, 1))
fs <- t(sapply(1:data$N_grid,
function(n) c(samples[[paste0('select_prob_grid[', n, ']')]],
recursive=TRUE)))plot_realizations(data$y_grid, fs, 50,
x_name="y", display_xlims=c(-5, 15),
y_name="Selection Probability", display_ylims=c(0, 1))
lines(ys, selection_probs, type="l", col="white", lwd=4)
lines(ys, selection_probs, type="l", col="black", lwd=2)3.3.3.2 Numerical Quadrature
Now we can repeat using a numerical quadrature estimate of the normalization integral. Conveniently Stan implements an adaptive quadrature method for one-dimensional integrals.
fit\_unknown\_selecton\_uni\_quad.stan
functions {
// Unnormalized observed probability density function
real observed_updf(real x, real xc, real[] theta, real[] x_r, int[] x_i) {
real chi = theta[1];
real gamma = theta[2];
real mu = x_r[1];
real tau = x_r[2];
return exp(normal_lpdf(x | mu, tau)) * Phi(gamma * (x - chi));
}
}
data {
int<lower=1> N; // Number of observations
real y[N]; // Observations
int<lower=1> N_grid; // Number of latent value grid points
real y_grid[N_grid]; // Latent value grid points
}
transformed data {
// Exact latent probability distribution configuration
real mu = -1; // Location
real<lower=0> tau = 3; // Shape
// Empty arrays for `integrate_1d` function
real x_r[2] = {mu, tau};
int x_i[0];
}
parameters {
real chi; // Location of selection function
real gamma; // Shape of selection function
}
model {
// Numerical quadrature estimator of normalization integral
real theta[2] = {chi, gamma};
real norm = integrate_1d(observed_updf,
negative_infinity(), positive_infinity(),
theta, x_r, x_i, 1e-8);
// Prior model
chi ~ normal(0, 3 / 2.32); // ~ 99% prior mass between -3 and +3
gamma ~ normal(0, 3 / 2.32); // ~ 99% prior mass between -3 and +3
// Observational model
for (n in 1:N) {
real log_select_prob = log( Phi(gamma * (y[n] - chi)) );
real lpdf = normal_lpdf(y[n] | mu, tau);
target += log_select_prob + lpdf - log(norm);
}
}
generated quantities {
real norm_error; // Quadrature normalization errors
real select_prob_grid[N_grid]; // Selection function values
real y_pred[N]; // Posterior predictive data
{
real theta[2] = {chi, gamma};
real quad_norm = integrate_1d(observed_updf,
negative_infinity(),
positive_infinity(),
theta, x_r, x_i, 1e-8);
real exact_norm = Phi( gamma * (mu - chi)
/ sqrt(1 + square(gamma * tau)) );
norm_error = quad_norm - exact_norm;
}
for (n in 1:N_grid)
select_prob_grid[n] = Phi(gamma * (y_grid[n] - chi));
for (n in 1:N) {
// Sample latent intensities until one survives the selection
// process. Use fixed iterations instead of while loop to avoid
// excessive trials when the selection function and latent density
// function are not aligned.
for (m in 1:100) {
real y_sample = normal_rng(mu, tau);
if ( bernoulli_rng( Phi(gamma * (y_sample - chi)) ) ) {
y_pred[n] = y_sample;
break;
}
}
}
}fit <- stan(file="stan_programs/fit_unknown_selection_uni_quad.stan",
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)Large numerical error can often result in inconsistent model evaluations which in turn can compromise the performance of Markov chain Monte Carlo. Fortunately that doesn’t seem to be the case here.
diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectands(fit)
base_samples <- util$filter_expectands(samples,
c('chi', 'gamma'))
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
Indeed the error in the adaptive numerical quadrature is at the same level of double floating precision! We can even see the discretization that arises from subtracting two numbers whose difference is close to the double floating point precision threshold.
par(mfrow=c(1, 1), mar = c(5, 5, 2, 1))
util$plot_expectand_pushforward(samples[['norm_error']], 25,
'Quadrature Norm - Exact Norm')Because the error in evaluating the normalization integral is so small we see the same retrodictive behavior as in the exact fit.
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
pred_names <- grep('y_pred', names(samples), value=TRUE)
hist_retro(data$y, samples, pred_names, -5, 15, 1, 'y')Warning in hist_retro(data$y, samples, pred_names, -5, 15, 1, "y"): 2
predictive values (0.0%) fell above the histogram binning.
Similarly the posterior inferences appear to be the same.
par(mfrow=c(2, 2))
samples_exact <- util$extract_expectands(fit_exact)
util$plot_expectand_pushforward(samples_exact[['chi']], 25, 'chi',
flim=c(1, 3), baseline=chi,
main="Exact Normalization")
util$plot_expectand_pushforward(samples[['chi']], 25, 'chi',
flim=c(1, 3), baseline=chi,
main="Numerical Quadrature\nNormalization")
util$plot_expectand_pushforward(samples_exact[['gamma']], 25, 'gamma',
flim=c(0.4, 1), baseline=gamma,
main="Exact Normalization")
util$plot_expectand_pushforward(samples[['gamma']], 25, 'gamma',
flim=c(0.4, 1), baseline=gamma,
main="Numerical Quadrature\nNormalization")3.3.3.3 Monte Carlo
How do Monte Carlo estimates of the normalization integral behave? Let’s start with a relatively small Monte Carlo ensemble.
Recall that we have to generate a single Monte Carlo ensemble, here in the transformed data block, which we then use to construct all Monte Carlo estimates of the normalization integral. Conditioned on this particular ensemble the Monte Carlo estimates define a fully deterministic model approximation, with evaluations at the same selection function configuration always returning the same output.
fit\_unknown\_selecton\_uni\_mc.stan
functions {
real compute_mc_norm(real[] y_MC, real chi, real gamma) {
int N_MC = size(y_MC);
real select_probs[N_MC];
for (n in 1:N_MC)
select_probs[n] = Phi(gamma * (y_MC[n] - chi));
return mean(select_probs);
}
real compute_MCSE(real[] y_MC, real chi, real gamma) {
int N_MC = size(y_MC);
real select_probs[N_MC];
for (n in 1:N_MC)
select_probs[n] = Phi(gamma * (y_MC[n] - chi));
return sqrt(variance(select_probs) / N_MC);
}
}
data {
int<lower=1> N; // Number of observations
real y[N]; // Observations
int<lower=1> N_grid; // Number of latent value grid points
real y_grid[N_grid]; // Latent value grid points
int<lower=1> N_MC; // Size of Monte Carlo ensemble
}
transformed data {
// Exact latent probability distribution configuration
real mu = -1; // Location
real<lower=0> tau = 3; // Shape
// Generate Monte Carlo ensemble
real y_MC[N_MC];
for (n in 1:N_MC)
y_MC[n] = normal_rng(mu, tau);
}
parameters {
real chi; // Location of selection function
real gamma; // Shape of selection function
}
model {
// Monte Carlo estimator of normalization integral
real norm = compute_mc_norm(y_MC, chi, gamma);
// Prior model
chi ~ normal(0, 3 / 2.32); // ~ 99% prior mass between -3 and +3
gamma ~ normal(0, 3 / 2.32); // ~ 99% prior mass between -3 and +3
// Observational model
for (n in 1:N) {
real log_select_prob = log( Phi(gamma * (y[n] - chi)) );
real lpdf = normal_lpdf(y[n] | mu, tau);
target += log_select_prob + lpdf - log(norm);
}
}
generated quantities {
// Exact Monte Carlo error
real norm_error;
// Monte Carlo standard errors for normalization estimation
real MCSE = compute_MCSE(y_MC, chi, gamma);
real select_prob_grid[N_grid]; // Selection function values
real y_pred[N]; // Posterior predictive data
{
real exact_norm = Phi( gamma * (mu - chi)
/ sqrt(1 + square(gamma * tau)) );
real norm = compute_mc_norm(y_MC, chi, gamma);
norm_error = norm - exact_norm;
}
for (n in 1:N_grid)
select_prob_grid[n] = Phi(gamma * (y_grid[n] - chi));
for (n in 1:N) {
// Sample latent intensities until one survives the selection
// process. Use fixed iterations instead of while loop to avoid
// excessive trials when the selection function and latent density
// function are not aligned.
for (m in 1:100) {
real y_sample = normal_rng(mu, tau);
if ( bernoulli_rng( Phi(gamma * (y_sample - chi)) ) ) {
y_pred[n] = y_sample;
break;
}
}
}
}data$N_MC <- 100
fit <- stan(file="stan_programs/fit_unknown_selection_uni_mc.stan",
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)The diagnostics are clean.
diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectands(fit)
base_samples <- util$filter_expectands(samples,
c('chi', 'gamma'))
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
The exact normalization estimate error is much larger than it was for the adaptive numerical quadrature estimator, although not huge in absolute terms.
par(mfrow=c(1, 1), mar = c(5, 5, 2, 1))
util$plot_expectand_pushforward(samples[['norm_error']], 25,
'Monte Carlo Norm - Exact Norm')Perhaps most importantly the estimated Monte Carlo standard error provides a good estimate of the magnitude of the exact error. This makes the Monte Carlo standard error particularly important in applications where we don’t know the exact normalization and hence cannot compute the exact error.
par(mfrow=c(1, 1), mar = c(5, 5, 2, 1))
plot(abs(c(samples[['norm_error']], recursive=TRUE)),
c(samples[['MCSE']], recursive=TRUE),
col=c_dark, pch=16,
xlim=c(0, 0.01), xlab="Magnitude of Exact Error",
ylim=c(0, 0.05), ylab="Monte Carlo Standard Error")
lines(c(0, 1), c(0, 1), col="#DDDDDD", lwd=2, lty=2)Note that the precise behavior we see here, where the Monte Carlo standard error conservatively upper bounds the exact error, is an accident of the particular Monte Carlo ensemble we drew. Changing the seed argument in the stan call will result in a different Monte Carlo ensemble whose Monte Carlo estimators could both overshoot or undershoot the exact error. That said the Monte Carlo standard error will usually be within a factor of five of the exact error.
The retrodictive performance has degraded substantially. Because we’ve changed only how we estimate the normalization integral this must be due to the Monte Carlo estimator error being too large.
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
pred_names <- grep('y_pred', names(samples), value=TRUE)
hist_retro(data$y, samples, pred_names, -5, 15, 1, 'y')Warning in hist_retro(data$y, samples, pred_names, -5, 15, 1, "y"): 1
predictive values (0.0%) fell above the histogram binning.
We can confirm this by examining the marginal posterior distributions which have both shifted away from the true model configuration.
par(mfrow=c(2, 2))
util$plot_expectand_pushforward(samples_exact[['chi']], 25, 'chi',
flim=c(0.5, 3), baseline=chi,
main="Exact Normalization")
util$plot_expectand_pushforward(samples[['chi']], 25, 'chi',
flim=c(0.5, 3), baseline=chi,
main="Monte Carlo Normalization, N = 100")
util$plot_expectand_pushforward(samples_exact[['gamma']], 25, 'gamma',
flim=c(0.4, 1.25), baseline=gamma,
main="Exact Normalization")
util$plot_expectand_pushforward(samples[['gamma']], 25, 'gamma',
flim=c(0.4, 1.25), baseline=gamma,
main="Monte Carlo Normalization, N = 100")