Taylor approximation is a powerful and general strategy for modeling the behavior of a function within a local neighborhood of inputs. The utility of this strategy, however, can be limited when the output of the target function is constrained to some subset of values. In this case study we'll see how Taylor approximations can be combined with transformations from the constrained output space to an unconstrained output space and back to robustly model the local behavior of constrained functions.
We will begin by examining some of the limitations of directly Taylor approximating constrained functions before demonstrating how these limitations can largely be avoided by removing the constraint before constructing a Taylor approximation and then incorporating it back afterwards. Next we will discuss in detail two common constraints and implement those insights with explicit examples and then finish off with a short discussion of multi-dimensional function approximation.
In this section we'll explore the difficulties that can arise when constructing Taylor approximations of constrained functions and how we can avoid some of those difficulties with the application of a link function and its inverse. Finally we'll discuss one of the key properties that makes a link function particularly useful in practical modeling.
I present Taylor approximation theory in great depth, perhaps even too much depth, in Section 1 of my Taylor modeling case study. To summarize: a Taylor approximation captures the behavior of a real-valued function \(f : X \rightarrow \mathbb{R}\) in a local neighborhood \(U\) around some baseline input \(x_{0}\), \[ f_{I}(x; x_{0}) \approx f(x) \] for \(x \in U\).
More precisely a Taylor approximation uses the differential structure of \(f\) at \(x_{0}\) to inform an \(I\)th-order polynomial function that approximates the exact functional behavior within this local neighborhood. A good approximation can be engineered by building a sufficiently high-order polynomial or restricting the local neighborhood to a narrow interval of inputs around \(x_{0}\).
Because of this local context Taylor approximations can theoretically be applied to any real-valued function, including those whose outputs are confined to some subset of the real line, \(f : X \rightarrow V \subset \mathbb{R}\).
While the polynomial functions that form a Taylor approximation have unconstrained outputs the outputs within a small enough local neighborhood will satisfy the given constraint.
If the neighborhood is too large, however, then evaluations of the Taylor approximation at some inputs will return outputs that violate the constraint.
When the baseline \(x_{0}\) is close to the constraint boundary the constraint can be violated even when the absolute approximation error is small.
In other words enforcing compatibility between a specific Taylor approximation \(f_{I}(x; x_{0})\) and an output constraint restricts the geometry of the local neighborhoods. If we know the local differential structure of \(f\) then we may be able to explicitly work out what inputs will lead to constraint-violating Taylor approximation outputs and then craft appropriate local neighborhoods. When we are inferring that local differential structure, however, establishing local neighborhoods that avoid constraint violations becomes much more difficult.
Constrained functions also have a strong influence on the local error of Taylor approximations. Constrained functions that are smooth need to be highly nonlinear near a constraint boundary in order to contort their outputs away from the boundary and avoid violating the constraint.
Because of this nonlinearity Taylor approximations will behave very differently when the baseline input \(x_{0}\) is close to a constraint boundary and when it is far away. Away from the boundary constrained functions will tend to be more linear which makes Taylor approximations of a fixed order \(I\) more accurate. This then allows the Taylor approximation to be employed over a wider local neighborhood. Near a constraint boundary, however, constrained functions will tend to exhibit stronger nonlinearities which introduce large approximation errors that require smaller local neighborhoods.
Alternatively if we need to fix the local neighborhood then this varying curvature will require carefully tuning the order of the Taylor approximation for each baseline \(x_{0}\).
The strong sensitivity to the baseline input substantially complicates the implementation of Taylor approximations for constrained functions. Unless we need to evaluate the constrained function only far from the constraint boundaries direct Taylor approximation will be at best a fragile approach to modeling the exact functional behavior.
If Taylor approximating constrained functions is so difficult then why don't we just eliminate the constraint before building a Taylor approximation in the first place?
Consider a one-to-one function that maps the constrained output space to the entire real line, \[ g : V \rightarrow \mathbb{R}. \] The function \(g\) is referred to as a link function because it "links" the nominal constrained output to an unconstrained output.
Composing the link function with the constrained function of interest defines a completely unconstrained function, \[ \begin{alignat*}{6} g \circ f :\; &X& &\rightarrow& \; &\mathbb{R}& \\ &x& &\mapsto& &g(f(x))&. \end{alignat*} \] Without any output constraints \(g \circ f\) should be must easier to approximate with a Taylor approximation, \[ g \circ f \approx (g \circ f)_{I}(x; x_{0}). \]
That said our model depends on the function \(f\), not the composed function \(g \circ f\). In order to incorporate this Taylor approximation into our model we need to undo the action of the link function. Mathematically we achieve this by applying the inverse link function, \[ \begin{alignat*}{6} g^{-1} :\; &\mathbb{R}& &\rightarrow& \; &V& \\ &w& &\mapsto& &g^{-1}(w)&, \end{alignat*} \] to the unconstrained composition, \[ \begin{align*} f &= \mathbb{I} \circ f \\ &= (g^{-1} \circ g) \circ f \\ &= g^{-1} \circ (g \circ f). \end{align*} \] Because we required that the link function is one-to-one this inverse function will always be well-defined.
Substituting our Taylor approximation for the unconstrained composition \(g \circ f\) then gives a general Taylor approximation \[ \begin{align*} f &= g^{-1} \circ (g \circ f) \\ &\approx g^{-1} \circ (g \circ f)_{I}, \end{align*} \] or for a given input, \[ f(x) \approx g^{-1} \left( (g \circ f)_{I}(x; x_{0}) \right). \] This construction results in a local functional model that always respects the output constraint regardless of the chosen input neighborhood. Moreover, because the link function is one-to-one we don't lose any information going to the constrained space and back.
A well-chosen link function can also have the added benefit of warping the constrained function, allowing us to better resolve all of the rapidly changing behavior near the output boundaries. When the composite function \(g \circ f\) exhibits more uniform curvature the Taylor approximation \((g \circ f)_{I}(x; x_{0})\) will be much less sensitive to the choice of input baseline and hence much easier to wield in practice.
If we need to model functional behavior in only a small neighborhood of inputs, with output values far away from the constraint boundaries, then a Taylor approximation model can be directly applicable. When we need to consider wider ranges of input values, or perhaps more realistically when we don't know what range of inputs values we might need to consider, then a general Taylor approximation becomes a more robust tool.
Mechanically the warping of any polynomial function, with or without the Taylor approximation interpretation, through an inverse link function is often referred to as general linear modeling or generalized linear modeling [1]. The same construction with piece-wise polynomial models is also sometimes referred to as general(ized) additive modeling. That said the use of "linear" and "additive" in this terminology can be confusing for the same reasons discussed in Section 2.3.3 of the Taylor modeling case study.
Given a link function \(g: V \rightarrow \mathbb{R}\) and any function \(h : \mathbb{R} \rightarrow \mathbb{R}\) the composition \[ \begin{alignat*}{6} h\circ g :\; &V& &\rightarrow& \; &\mathbb{R}& \\ &y& &\mapsto& &h(g(y))& \end{alignat*} \] will also define a valid link function that we can use to construct a general Taylor approximation. In other words there will be an infinite number of link functions compatible with any given output constraint. This then raises the question of what properties separate the link functions that are more useful in practice from those that are less useful.
When working with tools like Hamiltonian Monte Carlo smoothness is a virtue so at the very least we want our link functions to be as differentiable as possible. That still leaves, however, an infinite number of possibilities.
From the Taylor approximation perspective the ideal link function is one that converts the initial constrained function into a polynomial function, \[ g \circ f(x) = \sum_{i = 0}^{I} \beta_{i} x^{i} \] that can be reconstructed exactly with an \(I\)th-order Taylor approximation. When we are trying to infer the constrained function \(f\), however, we typically won't have the information needed to even attempt to engineer such a link function.
A more practical consideration for Bayesian modeling is how a link function transforms the additive structure of a Taylor approximation. Recall that a Taylor approximation is built up from a sum of terms; at zeroth-order we have a constant intercept, at first-order we have the component covariate perturbations contracted against a one-dimensional array of parameters, \[ \Delta \mathbf{x}^{T} \cdot \boldsymbol{\beta}, \] and so on. The contractions that give these higher-order terms can also be written as sums of intermediate components, \[ \Delta \mathbf{x}^{T} \cdot \boldsymbol{\beta} = \Delta x_{1} \, \beta_{1} + \Delta x_{2} \, \beta_{2} + \ldots. \]
Because of this additive structure each of the summands influences the Taylor approximation independently of the others: the term \(\Delta x_{2} \, \beta_{2}\) will influence \(f_{I}(x; x_{0})\) in the same way regardless of the value of \(\Delta x_{1} \, \beta_{1}\) or any of the other terms in the Taylor approximation. This allows us to independently elicit domain expertise for each of the individual terms, greatly facilitating prior modeling.
A nonlinear inverse link functions, however, convolves all of these contributions together so that they cannot be reasoned about so separately. Because \[ g^{-1}(\alpha + \Delta \alpha) \ne g^{-1}(\alpha) + g^{-1}(\Delta \alpha). \] the influence of second term \(\Delta \alpha\) on the overall constrained functional output will change depending on the configuration of the first term \(\alpha\). In order to build a principled prior model for both contributions \(\alpha\) and \(\Delta \alpha\) we have to reason about their joint behavior, complicating the modeling process. Building a principled prior model for the many contributions in even a low-order Taylor approximation can quickly become overwhelming.
That said while a nonlinear inverse link function will not preserve additive structure it might translate it into some other interpretable, algebraic structure. In this case we can use that translation to connect our domain expertise to the configuration of each individual term in the latent Taylor approximation.
More formally if \[ g^{-1}(\alpha + \Delta \alpha) = g^{-1}(\alpha) \ast g^{-1}(\Delta \alpha) \] for some interpretable, binary operation \(\ast\) then we may be able to translate our domain expertise about the constrained functional behavior to each term \(g^{-1}(\alpha)\) and \(g^{-1}(\Delta \alpha)\) and then through the link function back to \(\alpha\) and \(\Delta \alpha\). Although still more complicated than prior modeling for an untransformed Taylor approximation this decomposition will be a huge benefit when we cannot take the output constraints for granted.
This preservation of algebraic structure is an exceptional property that few if any link functions will exhibit. For a given constraint there is often a unique link function that preserves algebraic structure.
In this section we'll consider two common output constraints -- positive and interval -- and the link functions that are naturally compatible. applications.
Let's start with a positive constraint \(V = \mathbb{R}^{+}\) that excludes negative output values. In other words we will be interested in modeling functional behavior of the form \(f : X \rightarrow \mathbb{R}^{+}\).
Given the positivity constraint we will need to engineer a link function of the form \(g : \mathbb{R}^{+} \rightarrow \mathbb{R}\). An immediate candidate is one of the most ubiquitous functions in applied mathematics: the natural logarithm function, \[ \begin{alignat*}{6} \log :\; &\mathbb{R}^{+}& &\rightarrow& \; &\mathbb{R}& \\ &v& &\mapsto& &\log(v)&. \end{alignat*} \] This infinintely-differentiable log function maps inputs larger than \(1\) to positive real numbers and inputs smaller than \(1\) to negative real numbers, with \(1\) falling right in the middle at zero.
The corresponding inverse is given by the natural exponential function, \[ \begin{alignat*}{6} \exp :\; &\mathbb{R}& &\rightarrow& \; &\mathbb{R}^{+}& \\ &w& &\mapsto& &\exp(w)&. \end{alignat*} \] Here all positive inputs are mapped to the interval \((1, \infty)\) while all negative inputs are mapped to \((0, 1)\), with \(0\) being mapped in between to \(1\).
In this circumstance I often think of the exponential function as the result of a line that has been bent over to satisfy the positivity constraint, although this picture is admittedly a bit subjective.
Ubiquity alone does not make the log function a useful link function. What really matters is how it transforms addition. Conveniently the exponential inverse link function maps addition into the next simplest operation: multiplication, \[ \exp(\alpha + \Delta \alpha) = \exp(\alpha) \cdot \exp(\Delta \alpha). \] This allows us to reason about each contribution as independent scalings of the positive functional output.
Because the exponential inverse link function maps addition to multiplication we can reason about the functional output of a general Taylor approximation \[ \exp \left( \alpha + \sum_{j = 0}^{J} \Delta \alpha_{j} \right) \] as a baseline output \[ \exp \left( \alpha \right) \] that is proportionally scaled by the remaining terms, \[ \exp \left( \alpha \right) \, \prod_{j = 0}^{J} \exp \left( \Delta \alpha_{j} \right). \] In other words we can elicit domain expertise about each term in the latent Taylor approximation by reasoning about the possible proportional changes that the term induces.
For example let's say that we are interested in modeling the relationship between an organism's body mass, \(m\), and the temperature of its environment, \(x\), with the general, first-order Taylor model \[ m(x) = \exp \left( \alpha + \beta \, (x - x_{0}) \right). \]
Here the exponentiated intercept defines the baseline body mass. Any domain expertise about the range of reasonable baselines informs the distribution of \(\exp(\alpha)\) which then informs the distribution of \(\alpha\).
The linear term then allows the environmental temperature to scale this baseline value up or down. Eliciting domain expertise about the reasonable proportional changes directly informs \(\exp \left( \beta \, (x - x_{0}) \right)\) which we can then use to inform a prior model for \(\beta\).
For instance if even extreme environmental changes can modify the baseline body mass by only a factor of two then the reasonable values for this term will concentrate within the interval \[ \frac{1}{2} \lessapprox \exp \left( \beta \, (x - x_{0}) \right) \lessapprox 2. \] This implies that reasonable values for the the linear argument concentrate within the interval \[ - \log 2 \lessapprox \beta \, (x - x_{0}) \lessapprox \log 2. \] If environmental temperatures can vary by at most ten degrees Celsius \[ | x - x_{0} | \le 10 \, C \] then \[ \begin{align*} - \frac{\log 2}{10 \, C} &\lessapprox \beta \lessapprox \frac{\log 2}{10 \, C} \\ - 0.069 \, C^{-1} &\lessapprox \beta \lessapprox 0.069 \, C^{-1}. \end{align*} \] Contrast this to the \(\text{normal}(0, 1)\) or even \(\text{normal}(0, 100)\) component prior models that are often recommended as defaults.
More informative domain expertise would lead to even tighter constraints. If we knew that body mass is unlikely to change by more than 10% then we would need a prior model that constrains \[ 0.9 \lessapprox \exp \left( \beta \, (x - x_{0}) \right) \lessapprox 1.1, \] which requires a prior model for the the linear argument that concentrates within the interval \[ \begin{align*} - \log 1.1 &\lessapprox \beta \, (x - x_{0}) \lessapprox \log 1.1 \\ - 0.1 &\lessapprox \beta \, (x - x_{0}) \lessapprox 0.1. \end{align*} \] When environmental temperatures can vary by at most ten degrees Celsius the reasonable values for the slope would have to concentrate in the interval \[ \begin{align*} - \frac{0.1}{10 \, C} &\lessapprox \beta \lessapprox \frac{0.1}{10 \, C} \\ - 0.01 \, C^{-1} & \lessapprox \beta \lessapprox 0.01 \, C^{-1} \end{align*} \] which, despite the relatively weak domain expertise that we have elicited, is even narrower than most default prior models!
The ability to reason about proportional changes makes the log link function, and the exponential inverse link function that complements it, incredibly powerful in practice.
That said the nonlinear nature of the exponential inverse link function does complicate this component prior modeling strategy a bit. Recall that the pushforward prior model for the output of a Taylor approximation is amplified relative to the component prior models for each of its individual terms. For example if the component prior models for the individual terms are given by \[ \alpha_{j} \sim \text{normal}(0, \tau) \] then the pushforward prior model for their sum will be inflated by a factor of \(\sqrt{J}\), \[ \sum_{j = 1}^{J} \alpha_{j} \sim \text{normal}(0, \sqrt{J} \, \tau). \] When the input space is high-dimensional or the order of the Taylor approximation is large there will be many individual terms and this inflation can be substantial.
In a general Taylor approximation the inverse link function amplifies this inflation which can make it even more problematic. Pushing the latent distribution of Taylor approximation outputs through the exponential inverse link function stretches the distribution out towards larger and larger values which can quickly transcend our domain expertise.
Consequently when using a component prior model for the parameters in a general Taylor approximation we have to be vigilant in investigating prior pushforward checks for the distribution of constrained function outputs. If the pushforward distribution extends too far into extreme values then we have to go back and narrow the latent component prior models as necessary.
To demonstrate the application of the log link function let's consider the regression of a discrete variate that can take only positive integer values. An immediate candidate for the probabilistic variation in this case is a Poisson distribution whose intensity parameter is given by a function of the covariates, \[ \pi(y \mid x, \theta) = \text{Poisson}(y \mid \lambda(x, \theta)). \]
We start by configuring our local R environment.
c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")
c_light_teal <- c("#6B8E8E")
c_mid_teal <- c("#487575")
c_dark_teal <- c("#1D4F4F")
library(colormap)
nom_colors <- c("#DCBCBC", "#C79999", "#B97C7C", "#A25050", "#8F2727", "#7C0000")
par(family="CMU Serif", las=1, bty="l", cex.axis=1, cex.lab=1, cex.main=1,
xaxs="i", yaxs="i", mar = c(5, 5, 3, 5))
set.seed(29483833)library(rstan)
rstan_options(auto_write = TRUE) # Cache compiled Stan programs
options(mc.cores = parallel::detectCores()) # Parallelize chains
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
util <- new.env()
source('stan_utility.R', local=util)Next we'll define all of the useful visualization functions that were first introduced in the Taylor modeling case study.
Click for the definition of all of the visualization functions.
plot_line_hist <- function(s, min_val, max_val, delta, xlab, main="") {
bins <- seq(min_val, max_val, delta)
B <- length(bins) - 1
idx <- rep(1:B, each=2)
x <- sapply(1:length(idx),
function(b) if(b %% 2 == 1) bins[idx[b]] else bins[idx[b] + 1])
x <- c(min_val - 10, x, max_val + 10)
counts <- hist(s, breaks=bins, plot=FALSE)$counts
y <- counts[idx]
y <- c(0, y, 0)
ymax <- max(y) + 1
plot(x, y, type="l", main=main, col="black", lwd=2,
xlab=xlab, xlim=c(min_val, max_val),
ylab="Counts", yaxt='n', ylim=c(0, ymax))
}
marginal_location_median <- function(f, name, bin_min, bin_max, B) {
if (is.na(bin_min)) bin_min <- min(f)
if (is.na(bin_max)) bin_max <- max(f)
breaks <- seq(bin_min, bin_max, (bin_max - bin_min) / B)
idx <- rep(1:B, each=2)
xs <- sapply(1:length(idx),
function(b) if(b %% 2 == 0) breaks[idx[b] + 1] else breaks[idx[b]] )
N <- dim(f)[1]
prior_f <- sapply(1:N,
function(n) hist(f[n, bin_min < f[n,] & f[n,] < bin_max],
breaks=breaks, plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:B, function(b) quantile(prior_f[b,], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9, n]))
plot(1, type="n", main="",
xlim=c(bin_min, bin_max), xlab=name,
ylim=c(min(cred[1,]), max(cred[9,])), ylab="Counts")
polygon(c(xs, rev(xs)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
lines(xs, pad_cred[5,], col=c_dark, lwd=2)
}
conditional_location_median <- function(obs_covar, x_name, f, y_name,
bin_min, bin_max, B) {
breaks <- seq(bin_min, bin_max, (bin_max - bin_min) / B)
idx <- rep(1:B, each=2)
xs <- sapply(1:length(idx), function(b)
if(b %% 2 == 1) breaks[idx[b]] else breaks[idx[b] + 1])
S <- length(f[,1])
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- matrix(NA, nrow=9, ncol=B)
for (b in 1:B) {
bin_idx <- which(breaks[b] <= obs_covar & obs_covar < breaks[b + 1])
if (length(bin_idx)) {
bin_f_medians <- sapply(1:S, function(s) median(f[s, bin_idx]))
cred[,b] <- quantile(bin_f_medians, probs=probs)
}
}
# Compute graphical bounds before replacing NAs with zeros
ymin <- min(cred[1,], na.rm=T)
ymax <- max(cred[9,], na.rm=T)
for (b in 1:B) {
if ( is.na(cred[1, b]) ) cred[,b] = rep(0, length(probs))
}
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9, n]))
plot(1, type="n", main="",
xlim=c(bin_min, bin_max), xlab=x_name,
ylim=c(ymin, ymax), ylab=y_name)
polygon(c(xs, rev(xs)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
for (b in 1:B) {
if (length(which(breaks[b] <= obs_covar & obs_covar < breaks[b + 1]))) {
lines(xs[(2 * b - 1):(2 * b)],
pad_cred[5,(2 * b - 1):(2 * b)],
col=c_dark, lwd=2)
}
}
}
marginal_variate_median_retro <- function(obs, pred, bin_min, bin_max, B) {
if (is.na(bin_min)) bin_min <- min(pred)
if (is.na(bin_max)) bin_max <- max(pred)
breaks <- seq(bin_min, bin_max, (bin_max - bin_min) / B)
idx <- rep(1:B, each=2)
xs <- sapply(1:length(idx),
function(b) if(b %% 2 == 0) breaks[idx[b] + 1] else breaks[idx[b]] )
filt_obs <- obs[bin_min < obs & obs < bin_max]
obs <- hist(filt_obs, breaks=breaks, plot=FALSE)$counts
pad_obs <- do.call(cbind, lapply(idx, function(n) obs[n]))
N <- dim(pred)[1]
post_pred <- sapply(1:N,
function(n) hist(pred[n, bin_min < pred[n,] & pred[n,] < bin_max],
breaks=breaks, plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:B, function(b) quantile(post_pred[b,], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9, n]))
plot(1, type="n", main="",
xlim=c(bin_min, bin_max), xlab="y",
ylim=c(0, max(c(obs, cred[9,]))), ylab="Counts")
polygon(c(xs, rev(xs)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
lines(xs, pad_cred[5,], col=c_dark, lwd=2)
lines(xs, pad_obs, col="white", lty=1, lw=2.5)
lines(xs, pad_obs, col="black", lty=1, lw=2)
}
conditional_variate_median_retro <- function(obs_var, obs_covar, pred_var,
x_name, bin_min, bin_max, B) {
breaks <- seq(bin_min, bin_max, (bin_max - bin_min) / B)
idx <- rep(1:B, each=2)
xs <- sapply(1:length(idx), function(b)
if(b %% 2 == 1) breaks[idx[b]] else breaks[idx[b] + 1])
S <- length(pred_var[,1])
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
obs_var_medians <- rep(NA, B)
cred <- matrix(NA, nrow=9, ncol=B)
for (b in 1:B) {
bin_idx <- which(breaks[b] <= obs_covar & obs_covar < breaks[b + 1])
if (length(bin_idx)) {
obs_var_medians[b] <- median(obs_var[bin_idx])
bin_pred_medians <- sapply(1:S, function(s) median(pred_var[s, bin_idx]))
cred[,b] <- quantile(bin_pred_medians, probs=probs)
}
}
# Compute graphical bounds before replacing NAs with zeros
ymin <- min(min(cred[1,], na.rm=T), min(obs_var_medians, na.rm=T))
ymax <- max(max(cred[9,], na.rm=T), max(obs_var_medians, na.rm=T))
for (b in 1:B) {
if ( is.na(cred[1, b]) ) cred[,b] = rep(0, length(probs))
}
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9, n]))
plot(1, type="n", main="",
xlim=c(bin_min, bin_max), xlab=x_name,
ylim=c(ymin, ymax), ylab="y_pred")
polygon(c(xs, rev(xs)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
for (b in 1:B) {
if (length(which(breaks[b] <= obs_covar & obs_covar < breaks[b + 1]))) {
lines(xs[(2 * b - 1):(2 * b)],
pad_cred[5,(2 * b - 1):(2 * b)],
col=c_dark, lwd=2)
lines(xs[(2 * b - 1):(2 * b)],
c(obs_var_medians[b], obs_var_medians[b]),
col="white", lwd=3)
lines(xs[(2 * b - 1):(2 * b)],
c(obs_var_medians[b], obs_var_medians[b]),
col="black", lwd=2)
}
}
}
conditional_variate_median_retro_residual <- function(obs_var, obs_covar,
pred_var, name,
bin_min, bin_max, B) {
breaks <- seq(bin_min, bin_max, (bin_max - bin_min) / B)
idx <- rep(1:B, each=2)
xs <- sapply(1:length(idx), function(b)
if(b %% 2 == 1) breaks[idx[b]] else breaks[idx[b] + 1])
S <- length(pred_var[,1])
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- matrix(NA, nrow=9, ncol=B)
for (b in 1:B) {
bin_idx <- which(breaks[b] <= obs_covar & obs_covar < breaks[b + 1])
if (length(bin_idx)) {
bin_pred_medians <- sapply(1:S, function(s) median(pred_var[s, bin_idx]))
cred[,b] <- quantile(bin_pred_medians, probs=probs) - median(obs_var[bin_idx])
}
if ( is.na(cred[1, b]) ) cred[,b] = rep(0, length(probs))
}
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9, n]))
ymin <- min(pad_cred[1,], na.rm=T)
ymax <- max(pad_cred[9,], na.rm=T)
plot(1, type="n", main="",
xlim=c(bin_min, bin_max), xlab=name,
ylim=c(ymin, ymax), ylab="y_pred - y_obs")
abline(h=0, col="gray80", lwd=2, lty=3)
polygon(c(xs, rev(xs)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(xs, rev(xs)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
for (b in 1:B) {
if (length(which(breaks[b] <= obs_covar & obs_covar < breaks[b + 1]))) {
lines(xs[(2 * b - 1):(2 * b)], pad_cred[5,(2 * b - 1):(2 * b)],
col=c_dark, lwd=2)
}
}
}
plot_marginal <- function(values, name, display_xlims, title="") {
bin_lims <- range(values)
delta <- diff(bin_lims) / 50
breaks <- seq(bin_lims[1], bin_lims[2] + delta, delta)
hist(values, breaks=breaks, prob=T,
main=title, xlab=name, xlim=display_xlims,
ylab="", yaxt='n',
col=c_dark, border=c_dark_highlight)
}
plot_marginal_comp <- function(values1, values2, name,
display_xlims, display_ylims, title="") {
r1 <- range(values1)
r2 <- range(values2)
bin_lims <- c(min(r1[1], r2[1]), max(r1[2], r2[2]))
delta <- diff(bin_lims) / 50
breaks <- seq(bin_lims[1], bin_lims[2] + delta, delta)
hist(values1, breaks=breaks, prob=T,
main=title, xlab=name, xlim=display_xlims,
ylab="", yaxt='n', ylim=display_ylims,
col=c_light, border=c_light_highlight)
hist(values2, breaks=breaks, prob=T, add=T,
col=c_dark, border=c_dark_highlight)
}In this exercise we'll use simulated data saved into the local data object.
Click for spoilers on how the data are simulated.
writeLines(readLines("stan_programs/simu_log_reg.stan"))transformed data {
int<lower=0> M = 3; // Number of covariates
int<lower=0> N = 1000; // Number of observations
vector[M] x0 = [0, 2, -1]'; // Covariate baseline
vector[M] z0 = [-1, 3, 0]'; // Latent functional behavior baseline
real gamma0 = 4.75; // True intercept
vector[M] gamma1 = [-0.4, 0.375, 0.5]'; // True slopes
real phi = 5; // True negative binomial concentration
}
generated quantities {
matrix[N, M] X; // Covariate design matrix
real y[N]; // Variates
real log_mu[N];
for (n in 1:N) {
X[n, 1] = normal_rng(x0[1], 0.25);
X[n, 2] = normal_rng( x0[2]
- 3 * (X[n, 1] - x0[1])
+ 1 * pow(X[n, 1] - x0[1], 3), 0.3);
X[n, 3] = x0[3]
+ ( (X[n, 1] - x0[1]) - 10 * fabs(normal_rng(0, 2.25)) ) / sqrt(1 + 100);
/*
if ((X[n, 1] - x0[1]) * (X[n, 2] - x0[2]) > 0)
X[n, 3]
= normal_rng(x0[3] + 0.41 * sqrt((X[n, 1] - x0[1]) * (X[n, 2] - x0[2])),
0.5);
else
X[n, 3]
= normal_rng(x0[3] - 0.41 * sqrt(-(X[n, 1] - x0[1]) * (X[n, 2] - x0[2])),
0.4);
*/
log_mu[n] = gamma0 + (X[n] - z0') * gamma1;
y[n] = neg_binomial_2_log_rng(gamma0 + (X[n] - z0') * gamma1, phi);
}
}
simu <- stan(file="stan_programs/simu_log_reg.stan",
iter=1, warmup=0, chains=1,
seed=4838282, algorithm="Fixed_param")
SAMPLING FOR MODEL 'simu_log_reg' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0 seconds (Warm-up)
Chain 1: 0.001297 seconds (Sampling)
Chain 1: 0.001297 seconds (Total)
Chain 1:
X <- extract(simu)$X[1,,]
y <- extract(simu)$y[1,]
data <- list("M" = 3, "N" = 1000, "x0" = c(0, 2, -1), "X" = X, "y" = y)
range(data$X[,1])[1] -0.7267635 0.8254803
range(data$X[,2])[1] -0.2650418 4.1217727
range(data$X[,3])[1] -8.9621706 -0.9563363
We begin the analysis proper by taking a quick look at the simulated data. As expected the observed variates take only positive integer values and there does seem to be some nontrivial correlations between those variates and the observed covariates.
par(mfrow=c(3, 3), mar = c(5, 5, 2, 1))
plot_line_hist(data$X[,1], -1, 1, 0.1, "x1")
plot_line_hist(data$X[,2], -1, 5, 0.25, "x2")
plot_line_hist(data$X[,3], -10, 0, 0.25, "x3")
plot(data$X[,1], data$X[,2], pch=16, cex=1.0, col="black",
main="", xlim=c(-1, 1), xlab="x1", ylim=c(-1, 5), ylab="x2")
plot(data$X[,1], data$X[,3], pch=16, cex=1.0, col="black",
main="", xlim=c(-1, 1), xlab="x1", ylim=c(-10, 0), ylab="x3")
plot(data$X[,2], data$X[,3], pch=16, cex=1.0, col="black",
main="", xlim=c(-1, 5), xlab="x2", ylim=c(-10, 0), ylab="x3")
plot(data$X[,1], data$y, pch=16, cex=1.0, col="black",
main="", xlim=c(-1, 1), xlab="x1", ylim=c(0, 60), ylab="y")
plot(data$X[,2], data$y, pch=16, cex=1.0, col="black",
main="", xlim=c(-1, 5), xlab="x2", ylim=c(0, 60), ylab="y")
plot(data$X[,3], data$y, pch=16, cex=1.0, col="black",
main="", xlim=c(-10, 0), xlab="x3", ylim=c(0, 60), ylab="y")If we want to model these observations with a Taylor model then we need to choose a baseline. In a real analysis we would inform these baselines with domain expertise, but here I'll just impose some component baselines by fiat.
data$x0 <- c(0, 2, -1)For our first model let's be a little sloppy and ignore not only the positivity of the variates but also their discreteness. In particular instead of starting with the Poisson model for the probabilistic variation we'll start with a normal model and a completely unconstrained first-order Taylor approximation for the location function.
When the marginal variate distribution concentrates on sufficiently large values this approximation might not be terrible. That said, positive variates are guaranteed only when the location function is everywhere positive which limits the valid configurations of the Taylor approximation.
Without any interpretable context we don't have the domain expertise to build a principled prior model so as with the baselines I'll just impose one. This prior model, however, is not entirely arbitrary. It is designed to approximately restrict the location function to positive outcomes, consistent with our assumption that the normal model will be valid.
writeLines(readLines("stan_programs/normal_linear_prior.stan"))data {
int<lower=0> M; // Number of covariates
int<lower=0> N; // Number of observations
vector[M] x0; // Covariate baselines
matrix[N, M] X; // Covariate design matrix
}
transformed data {
matrix[N, M] deltaX;
for (n in 1:N) {
deltaX[n,] = X[n] - x0';
}
}
parameters {
real alpha; // Intercept
vector[M] beta; // Slopes
real<lower=0> sigma; // Measurement Variability
}
model {
// Prior model
alpha ~ normal(25, 5);
beta ~ normal(0, 3);
sigma ~ normal(0, 5);
}
// Simulate a full observation from the current value of the parameters
generated quantities {
vector[N] mu = alpha + deltaX * beta;
real y_pred[N] = normal_rng(alpha + deltaX * beta, sigma);
}
fit <- stan(file="stan_programs/normal_linear_prior.stan",
data=data, seed=8438338, refresh=0)
samples = extract(fit)Indeed the induced behavior of the location function outputs is mostly contained to positive values, although there is some mild leakage into negative values. So long as the observed variates are large enough this shouldn't be too problematic.
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
marginal_location_median(samples$mu, "mu", -20, 60, 40)
rect(-20, 0, 0, 400, col=paste0(c_light_teal, "33"), border=FALSE)par(mfrow=c(1, 3), mar = c(5, 5, 2, 1))
conditional_location_median(data$X[,1], "x1", samples$mu, "mu", -1, 1, 20)
abline(v=data$x0[1], lwd=2, col="grey", lty=3)
rect(-10, -20, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)
conditional_location_median(data$X[,2], "x2", samples$mu, "mu", -1, 5, 20)
abline(v=data$x0[2], lwd=2, col="grey", lty=3)
rect(-10, -20, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)
conditional_location_median(data$X[,3], "x3", samples$mu, "mu", -10, 0, 20)
abline(v=data$x0[3], lwd=2, col="grey", lty=3)
rect(-10, -20, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)Convolving these sampled location function outputs with the normal variation only increases the leakage into constraint violating behaviors.
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
marginal_location_median(samples$y_pred, "y", -30, 70, 30)
rect(-30, 0, 0, 400, col=paste0(c_light_teal, "33"), border=FALSE)par(mfrow=c(1, 3), mar = c(5, 5, 2, 1))
conditional_location_median(data$X[,1], "x1", samples$y_pred, "y", -1, 1, 20)
abline(v=data$x0[1], lwd=2, col="grey", lty=3)
rect(-10, -20, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)
conditional_location_median(data$X[,2], "x2", samples$y_pred, "y", -1, 5, 20)
abline(v=data$x0[2], lwd=2, col="grey", lty=3)
rect(-10, -20, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)
conditional_location_median(data$X[,3], "x3", samples$y_pred, "y", -10, 0, 20)
abline(v=data$x0[3], lwd=2, col="grey", lty=3)
rect(-10, -20, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)While this model isn't entirely consistent with the known constraints it might be a reasonable approximation. Of course hope alone isn't sufficient statistical reasoning. To verify whether or not this model is a decent approximation we have to fit it to the observed data.
writeLines(readLines("stan_programs/normal_linear.stan"))data {
int<lower=0> M; // Number of covariates
int<lower=0> N; // Number of observations
vector[M] x0; // Covariate baselines
matrix[N, M] X; // Covariate design matrix
real y[N]; // Variates
}
transformed data {
matrix[N, M] deltaX;
for (n in 1:N) {
deltaX[n,] = X[n] - x0';
}
}
parameters {
real alpha; // Intercept
vector[M] beta; // Slopes
real<lower=0> sigma; // Measurement Variability
}
model {
// Prior model
alpha ~ normal(25, 5);
beta ~ normal(0, 3);
sigma ~ normal(0, 5);
// Vectorized observation model
y ~ normal(alpha + deltaX * beta, sigma);
}
// Simulate a full observation from the current value of the parameters
generated quantities {
vector[N] mu = alpha + deltaX * beta;
real y_pred[N] = normal_rng(alpha + deltaX * beta, sigma);
}
fit <- stan(file="stan_programs/normal_linear.stan", data=data,
seed=8438338, refresh=0)
util$check_all_diagnostics(fit)[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"
The lack of diagnostic warnings might give us hope, but the inferred location behaviors stray pretty far past the positivity constraint.
samples = extract(fit)
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
marginal_location_median(samples$mu, "mu", -20, 50, 30)
rect(-20, 0, 0, 400, col=paste0(c_light_teal, "33"), border=FALSE)To really critique our model, however, we need to consider retrodictive checks. Unfortunately the retrodictive behavior of the marginal variate distribution leaves something to be desired. The retrodictive variate distribution not only spills below zero but also doesn't match the asymmetric shape of the observed data. Ultimately it doesn't look like the observed variates are not far enough away from the positivity boundary for that boundary to be ignorable.
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
marginal_variate_median_retro(data$y, samples$y_pred, -50, 80, 50)
rect(-50, 0, 0, 400, col=paste0(c_light_teal, "33"), border=FALSE)The conditional retrodictive checks suggest that our first-order Taylor approximation for the location function might be too rigid, but that might also be an artifact of the normal variation model.
par(mfrow=c(2, 3), mar = c(5, 5, 3, 1))
conditional_variate_median_retro(data$y, data$X[,1], samples$y_pred,
"x1", -1, 1, 20)
rect(-10, -50, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)
conditional_variate_median_retro(data$y, data$X[,2], samples$y_pred,
"x2", -1, 5, 20)
rect(-10, -50, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)
conditional_variate_median_retro(data$y, data$X[,3], samples$y_pred,
"x3", -10, 0, 20)
rect(-10, -50, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)
conditional_variate_median_retro_residual(data$y, data$X[,1], samples$y_pred,
"x1", -1, 1, 20)
conditional_variate_median_retro_residual(data$y, data$X[,2], samples$y_pred,
"x2", -1, 5, 20)
conditional_variate_median_retro_residual(data$y, data$X[,3], samples$y_pred,
"x3", -10, 0, 20)Keeping in mind potential limitations in a first-order Taylor approximation let's prioritize the potential improvement of the variation model. One way to accommodate an asymmetric variate distributions is to drop the strong assumption of normal variation in faovr of the assumption of Poisson variation that is much more appropriate for our observed variates.
An immediate problem with this pairing of a Poisson variation model and an unconstrained Taylor approximation is that the Taylor approximation can output negative location values where the Poisson mass function is completely undefined. Constrast this to the normal variation model where negative location values could lead to poor inferences and predictions but didn't obstruct the evaluation of the model itself. In order to implement the Poisson model we have to be careful to avoid these ill-defined evaluations.
Trying to explicitly constrain the configuration of a first-order Taylor approximation so that its outputs will be positive for all the observed covariate inputs is typically infeasible. We can, however, reject Markov transitions that stray into constraint-violating configurations before we try to evaluate the Poisson model. We just have to be careful to initialize the Markov chains in positive configurations so that Stan's Hamiltonian Monte Carlo sampler will be able to adapt from these initializations.
Here we'll use a heuristic that avoids negative outputs with high probability while also covering most of the span of the prior model. As always with Markov chain Monte Carlo the more diffuse the initializations the better we can empirically diagnose convergence.
pos_inits <- list()
for (c in 1:4) {
beta <- rnorm(3, 0, 1)
alpha <- -min(data$X %*% beta) + 5
pos_inits[[c]] <- list("alpha" = alpha, "beta" = beta)
}We begin, as always, with an investigation of the prior model and its consequences.
writeLines(readLines("stan_programs/poisson_linear_prior.stan"))data {
int<lower=0> M; // Number of covariates
int<lower=0> N; // Number of observations
vector[M] x0; // Covariate baselines
matrix[N, M] X; // Covariate design matrix
}
transformed data {
matrix[N, M] deltaX;
for (n in 1:N) {
deltaX[n,] = X[n] - x0';
}
}
parameters {
real alpha; // Intercept
vector[M] beta; // Slopes
}
model {
// Prior model
alpha ~ normal(25, 5);
beta ~ normal(0, 3);
for (n in 1:N)
if (alpha + deltaX[n,] * beta < 0)
reject("");
}
// Simulate a full observation from the current value of the parameters
generated quantities {
vector[N] mu = alpha + deltaX * beta;
int y_pred[N] = poisson_rng(mu);
}
fit <- stan(file="stan_programs/poisson_linear_prior.stan",
data=data, seed=8438338, refresh=0,
init=pos_inits)In Stan calling the reject function effectively forces the current numerical Hamiltonian trajectory to diverge. Consequently each transition that ends with a call to the reject function is tagged as a divergent transition, and we can use the divergence diagnostic to see how often the function was called.
util$check_all_diagnostics(fit)[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "1369 of 4000 iterations ended with a divergence (34.225%)"
[1] " Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"
The high percentage of divergences indicates that without the rejections our prior model places a quite a bit of probability on Taylor approximation configurations with negative outputs. The constant rejections to avoid these constraint violations compromises the efficient exploration of Hamiltonian Monte Carlo, and with enough rejections the exploration will eventually devolve into a computationally wasteful random walk.
That said the rejections do ensure that the positive location function constraint is satisfied.
samples = extract(fit)
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
marginal_location_median(samples$mu, "mu", -5, 60, 50)
rect(-20, 0, 0, 400, col=paste0(c_light_teal, "33"), border=FALSE)par(mfrow=c(1, 3), mar = c(5, 5, 2, 1))
conditional_location_median(data$X[,1], "x1", samples$mu, "mu", -1, 1, 20)
abline(v=data$x0[1], lwd=2, col="grey", lty=3)
rect(-10, -20, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)
conditional_location_median(data$X[,2], "x2", samples$mu, "mu", -1, 5, 20)
abline(v=data$x0[2], lwd=2, col="grey", lty=3)
rect(-10, -20, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)
conditional_location_median(data$X[,3], "x3", samples$mu, "mu", -10, 0, 20)
abline(v=data$x0[3], lwd=2, col="grey", lty=3)
rect(-10, -20, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)The Poisson variation model then ensures that the predicted variates are positive as well. Note the asymmetry in the prior predictive distribution.
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
marginal_location_median(samples$y_pred, "y", -5.5, 70.5, 76)
rect(-20, 0, 0, 400, col=paste0(c_light_teal, "33"), border=FALSE)par(mfrow=c(1, 3), mar = c(5, 5, 2, 1))
conditional_location_median(data$X[,1], "x1", samples$y_pred, "y", -1, 1, 20)
abline(v=data$x0[1], lwd=2, col="grey", lty=3)
rect(-10, -20, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)
conditional_location_median(data$X[,2], "x2", samples$y_pred, "y", -1, 5, 20)
abline(v=data$x0[2], lwd=2, col="grey", lty=3)
rect(-10, -20, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)
conditional_location_median(data$X[,3], "x3", samples$y_pred, "y", -10, 0, 20)
abline(v=data$x0[3], lwd=2, col="grey", lty=3)
rect(-10, -20, 10, 0, col=paste0(c_light_teal, "33"), border=FALSE)Will this rejection mechanism be robust enough to let us fit the observed data? There's only one way to find out.
writeLines(readLines("stan_programs/poisson_linear.stan"))data {
int<lower=0> M; // Number of covariates
int<lower=0> N; // Number of observations
vector[M] x0; // Covariate baselines
matrix[N, M] X; // Covariate design matrix
int y[N]; // Variates
}
transformed data {
matrix[N, M] deltaX;
for (n in 1:N) {
deltaX[n,] = X[n] - x0';
}
}
parameters {
real alpha; // Intercept
vector[M] beta; // Slopes
}
model {
// Prior model
alpha ~ normal(25, 5);
beta ~ normal(0, 3);
// Vectorized observation model
y ~ poisson(alpha + deltaX * beta);
}
// Simulate a full observation from the current value of the parameters
generated quantities {
vector[N] mu = alpha + deltaX * beta;
int y_pred[N] = poisson_rng(mu);
}
fit <- stan(file="stan_programs/poisson_linear.stan", data=data,
seed=8438338, refresh=0, init=pos_inits)Oh, no.
capture.output(util$check_n_eff(fit))[1:10] [1] "[1] \"n_eff / iter for parameter alpha is 0.000509359673613626!\""
[2] "[1] \"n_eff / iter for parameter beta[1] is 0.000500479030005201!\""
[3] "[1] \"n_eff / iter for parameter beta[2] is 0.000520106806623146!\""
[4] "[1] \"n_eff / iter for parameter beta[3] is 0.000526144781821528!\""
[5] "[1] \"n_eff / iter for parameter mu[1] is 0.000510454867546584!\""
[6] "[1] \"n_eff / iter for parameter mu[2] is 0.000509669960707064!\""
[7] "[1] \"n_eff / iter for parameter mu[3] is 0.000524258737463354!\""
[8] "[1] \"n_eff / iter for parameter mu[4] is 0.000584067233896216!\""
[9] "[1] \"n_eff / iter for parameter mu[5] is 0.000512168283647731!\""
[10] "[1] \"n_eff / iter for parameter mu[6] is 0.000509685157177343!\""
capture.output(util$check_rhat(fit))[1:10] [1] "[1] \"Rhat for parameter alpha is 12.4832179520727!\""
[2] "[1] \"Rhat for parameter beta[1] is 87.7713855779901!\""
[3] "[1] \"Rhat for parameter beta[2] is 11.2731376537136!\""
[4] "[1] \"Rhat for parameter beta[3] is 7.52477590970578!\""
[5] "[1] \"Rhat for parameter mu[1] is 11.6856932985649!\""
[6] "[1] \"Rhat for parameter mu[2] is 13.4745917484533!\""
[7] "[1] \"Rhat for parameter mu[3] is 7.47994191018897!\""
[8] "[1] \"Rhat for parameter mu[4] is 5.00906468474416!\""
[9] "[1] \"Rhat for parameter mu[5] is 13.1569100637833!\""
[10] "[1] \"Rhat for parameter mu[6] is 12.3489778553509!\""
util$check_div(fit)[1] "3821 of 4000 iterations ended with a divergence (95.525%)"
[1] " Try running with larger adapt_delta to remove the divergences"
util$check_treedepth(fit)[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
util$check_energy(fit)[1] "Chain 3: E-FMI = 0.0027378106235891"
[1] " E-FMI below 0.2 indicates you may need to reparameterize your model"
Let's try to break these diagnostic failures down one by one, starting with the divergences. The large proportion of divergent transitions is consistent with excessive rejections which compromise the Markov chains' ability to explore the truncated parameter space.
To better resolve this truncated region Stan tries to adapt the symplectic integrator step size in some of the Markov chains to very small values.
sapply(1:4, function(c) get_sampler_params(fit, inc_warmup=FALSE)[[c]][,'stepsize__'][1])[1] 0.003470967 0.001348262 0.030180749 0.018740624
Normally the length of the numerical Hamiltonian trajectories would increase to compensate for such small step sizes, but most of the numerical trajectories are prematurely truncated by those rejections.
par(mfrow=c(1, 1), mar = c(5, 2, 2, 1))
steps <- do.call(rbind,
get_sampler_params(fit, inc_warmup=FALSE))[,'n_leapfrog__']
hist(steps, breaks=seq(-0.5, 220.5, 1), main="",
col=c_dark, border=c_dark_highlight,
xlab="Number of Leapfrog Steps", yaxt="n", ylab="")The stunted trajectories result in extremely inefficient exploration. We don't see blobs in the pairs plots but rather traces of random walks!
samples = extract(fit)
unpermuted_samples <- extract(fit, permute=FALSE)
par(mfrow=c(2, 2), mar = c(5, 5, 3, 1))
for (c in 1:4) {
plot(samples$alpha, samples$beta[,1],
main=paste("Chain", c), col=paste0(c_light, "33"), pch=16, cex=0.8,
xlab="alpha", ylab="beta[1]",
xlim=c(20, 23), ylim=c(-1, 0.5))
points(unpermuted_samples[,c,1], unpermuted_samples[,c,2],
col=paste0(c_dark, "33"), pch=16, cex=0.8)
}The random walk behavior is even more clear if we zoom in to the individual Markov chains.
par(mfrow=c(2, 2), mar = c(5, 5, 3, 1))
for (c in 1:4) {
plot(unpermuted_samples[,c,1], unpermuted_samples[,c,2],
main=paste("Chain", c), xlab="alpha", ylab="beta[1]",
col=paste0(c_dark, "33"), pch=16, cex=0.8)
}In fact the exploration is so slow that the individual Markov chains have not yet had the opportunity to converge to each other which explains all of the split \(\hat{R}\) warnings. Here the large values of split \(\hat{R}\) indicate not any real multimodality in the posterior distribution but rather a lack of convergence of the individual Markov chains.
Even if we ignored these computational issues the approximation of the posterior fit afforded by this poor exploration doesn't even come close to conforming to what we see in the observed data.
par(mfrow=c(1, 1), mar = c(5, 5, 3, 1))
marginal_variate_median_retro(data$y, samples$y_pred, -5.5, 60.5, 66)
rect(-20, 0, 0, 400, col=paste0(c_light_teal, "33"), border=FALSE)