Stan and its implementation of dynamic Hamiltonian Monte Carlo is an extremely powerful tool for specifying and then fitting complex Bayesian models. In order to ensure a robust analysis, however, that power must be complemented with responsibility.

In particular, while dynamic implementations of Hamiltonian Monte Carlo, i.e. implementations where the integration time is dynamic, do perform well over a large class of models their success is not guaranteed. When they do fail, however, their failures manifest in diagnostics that are readily checked.

By acknowledging and respecting these diagnostics you can ensure that Stan is accurately fitting the Bayesian posterior and hence accurately characterizing your model. And only with an accurate characterization of your model can you properly utilize its insights.

1 A Little Bit About Markov Chain Monte Carlo

Hamiltonian Monte Carlo is an implementation of Markov chain Monte Carlo, an algorithm which approximates expectations with respect to a given target distribution, \(\pi\), \[ \mathbb{E}_{\pi} [ f ] = \int \mathrm{d}q \, \pi (q) \, f(q), \] using the states of a Markov chain, \(\{q_{0}, \ldots, q_{N} \}\), \[ \mathbb{E}_{\pi} [ f ] \approx \hat{f}_{N} = \frac{1}{N + 1} \sum_{n = 0}^{N} f(q_{n}). \] Typically the target distribution is taken to be the posterior distribution of our specified model.

These estimators are guaranteed to be accurate only asymptotically, as the Markov chain grows to be infinitely long, \[ \lim_{N \rightarrow \infty} \hat{f}_{N} = \mathbb{E}_{\pi} [ f ]. \] To be useful in applied analyses, we need these Markov chain Monte Carlo estimators to converge to the true expectation values sufficiently quickly that they are reasonably accurate before we exhaust our finite computational resources. This fast convergence requires strong ergodicity conditions to hold, typically a condition called geometric ergodicity between the Markov transition and target distribution. In particular, geometric ergodicity is a sufficient condition for Markov chain Monte Carlo estimators to follow a central limit theorem, which ensures not only that they are unbiased after only a finite number of iterations but also that we can empirically quantify their precision, \[ \hat{f}_{N} - \mathbb{E}_{\pi} [ f ] \sim \mathcal{N} \! \left( 0, \sqrt{ \mathrm{Var}[f] / N_{\mathrm{eff}}} \right). \]

Unfortunately proving geometric ergodicity theoretically is infeasible for any nontrivial problem. Instead we must rely on empirical diagnostics that identify obstructions to geometric ergodicity, and hence well-behaved Markov chain Monte Carlo estimators. For a general Markov transition and target distribution, the best known diagnostic is the split \(\hat{R}\) statistic over an ensemble of Markov chains initialized from diffuse points in parameter space. To do any better we need to exploit the particular structure of a given transition or target distribution.

Hamiltonian Monte Carlo, for example, is especially powerful in this regard as its failures to be geometrically ergodic with respect to any target distribution manifest in distinct behaviors that have been developed into sensitive diagnostics. One of these behaviors is the appearance of divergences that indicate the Hamiltonian Markov chain has encountered regions of high curvature in the target distribution which it cannot adequately explore. Another is the energy Bayesian fraction of missing information, or E-BFMI, which quantifies the efficacy of the momentum resampling in between Hamiltonian trajectories.

For more details on Markov chain Monte Carlo and Hamiltonian Monte Carlo see Betancourt (2017).

In this case study I will demonstrate the recommended Stan workflow in R where we not only fit a model but also scrutinize these diagnostics and ensure an accurate fit.

2 Setting Up The RStan Environment

We begin by importing the RStan module and setting some local options,

library(rstan)
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.16.2, packaged: 2017-07-03 09:24:58 UTC, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

By setting rstan_options(auto_write = TRUE) we allow RStan to cache compiled models so that we can run them multiple times without the overhead of recompilation. options(mc.cores = parallel::detectCores()) enables RStan to run multiple Markov chains in parallel over any cores that your computer may have. These settings are recommended if you are running locally on your own machine, as opposed to running on a remote cluster, and your local machine which has plenty of RAM. For very large problems running multiple Markov chains in parallel may exhaust your RAM and degrade performance, in which case you would not want to utilize this option.

In order to facilitate checking diagnostics we source the attached utility script which loads some useful functions,

source('stan_utility.R')
lsf.str()
check_all_diagnostics : function (fit)  
check_div : function (fit)  
check_energy : function (fit)  
check_n_eff : function (fit)  
check_rhat : function (fit)  
check_treedepth : function (fit, max_depth = 10)  
partition_div : function (fit)  

2.1 Specifying and Fitting A Model in Stan

To demonstrate the recommended Stan workflow let’s consider a hierarchical model of the eight schools dataset infamous in the statistical literature (D. B. Rubin 1981),

\[\mu \sim \mathcal{N}(0, 5)\]

\[\tau \sim \text{Half-Cauchy}(0, 5)\]

\[\theta_{n} \sim \mathcal{N}(\mu, \tau)\]

\[y_{n} \sim \mathcal{N}(\theta_{n}, \sigma_{n}),\]

where \(n \in \left\{1, \ldots, 8 \right\}\) and the \(\left\{ y_{n}, \sigma_{n} \right\}\) are given as data.

For more information on the eight schools dataset see Gelman et al. (2014).

2.2 Specifying the Model with a Stan Program

In particular, let’s implement a centered-parameterization of the model which is known to frustrate even sophisticated samplers like Hamiltonian Monte Carlo. In Stan the centered parameterization is specified with the Stan program

writeLines(readLines("eight_schools_cp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta[J];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
}

Note that we have specified the Stan program in its own file. We strongly recommend keeping your workflow modular by separating the Stan program from the R environment in this way. Not only does it make it easier to identify and read through the Stan-specific components of your analysis, it also makes it easy to share your models Stan users exploiting workflows in environments, such as Python and the command line.

2.3 Specifying the Data

Similarly, we strongly recommend that you specify the data in its own file.

Data specified in R lists can be immediately converted to an external Stan data file using RStan’s stan_rdump function,

J <- 8
y <- c(28,  8, -3,  7, -1,  1, 18, 12)
sigma <- c(15, 10, 16, 11,  9, 11, 10, 18)

stan_rdump(c("J", "y", "sigma"), file="eight_schools.data.R")

At the same time, an existing Stan data file can be read into the R environment using the read_rdump function,

data <- read_rdump("eight_schools.data.R")

2.4 Fitting the Model

With the model and data specified we can now turn to Stan to quantify the resulting posterior distribution with Hamiltonian Monte Carlo,

fit <- stan(file='eight_schools_cp.stan', data=data, seed=194838)
Warning: There were 82 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems

We recommend explicitly specifying the seed of Stan’s random number generator, as we have done here, so that we can reproduce these exactly results in the future, at least when using the same machine, operating system, and interface. This is especially helpful for the more subtle pathologies that may not always be found, which results in seemingly stochastic behavior.

By default the sampling method runs 4 Markov chains of Hamiltonian Monte Carlo, each initialized from a diffuse initial condition to maximize the probability that at least one of the chains might encounter a pathological neighborhood of the posterior, if it exists. Because we set options(mc.cores = parallel::detectCores()) above, these chains will run in parallel when possible.

Each of those chains proceeds with 1000 warmup iterations and 1000 sampling iterations, totaling 4000 sampling iterations available for diagnostics and analysis.

2.5 Validating a Fit in Stan

We are now ready to validate the fit using the information contained in the fit object. While RStan automatically outputs diagnostic warnings that indicate problems with the fit, here I will demonstrate how to analyze those diagnostics programmatically.

The first diagnostics we will check are universal to Markov chain Monte Carlo: effective sample size per iteration and split \(\hat{R}\). We will then consider a suite of powerful diagnostics that are unique to Hamiltonian Monte Carlo.

2.6 Universal Diagnostics

The effective sample size, or n_eff, and split \(\hat{R}\), or rhat, for each parameter is displayed using the print method of the fit object,

print(fit)
Inference for Stan model: eight_schools_cp.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

           mean se_mean   sd   2.5%    25%    50%    75% 97.5% n_eff Rhat
mu         4.22    0.16 3.37  -2.72   2.10   4.23   6.53 10.43   459 1.01
tau        3.83    0.19 3.02   0.69   1.62   2.93   5.22 11.75   264 1.02
theta[1]   6.14    0.23 5.78  -3.43   2.53   5.39   8.99 19.82   613 1.01
theta[2]   4.82    0.17 4.87  -4.05   1.84   4.61   7.77 15.01   802 1.01
theta[3]   3.78    0.17 5.29  -7.53   0.85   4.01   6.98 13.69   926 1.00
theta[4]   4.58    0.18 4.91  -5.00   1.63   4.50   7.57 14.52   785 1.01
theta[5]   3.40    0.16 4.58  -6.44   0.83   3.77   6.34 11.39   816 1.00
theta[6]   3.78    0.15 4.83  -6.25   0.97   3.84   6.69 13.06  1004 1.01
theta[7]   6.23    0.23 5.13  -2.68   2.95   5.77   8.99 17.98   487 1.01
theta[8]   4.65    0.18 5.43  -6.05   1.53   4.37   7.97 15.91   890 1.00
lp__     -15.11    0.43 5.92 -26.41 -19.50 -15.01 -10.63 -4.51   187 1.03

Samples were drawn using NUTS(diag_e) at Fri Nov 10 18:08:26 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

We can investigate each more programatically, however, using some of our utility functions.

2.6.1 Checking Effective Sample Sizes

As noted in Section 1, the effective sample size quantifies the accuracy of the Markov chain Monte Carlo estimator of a given function, here each parameter mean, provided that geometric ergodicity holds. The potential problem with these effective sample sizes, however, is that we must estimate them from the fit output. When we geneate less than 0.001 effective samples per transition of the Markov chain the estimators that we use are typically biased and can significantly overestimate the true effective sample size.

We can check that our effective sample size per iteration is large enough with one of our utility functions,

check_n_eff(fit)
[1] "n_eff / iter looks reasonable for all parameters"

Here there are no indications of problems in our estimates of the effective sample size.

2.6.2 Checking Split \(\hat{R}\)

The effective sample size, however, is meaningless unless our Markov chain and target distribution interact sufficiently well that geometric ergodicity holds. Split \(\hat{R}\) quantifies an important necessary condition for geometric ergodicity, namely that all chains must converge to the same equilibrium behavior.

If the input Markov chains have the same behavior for a given parameter then the corresponding split \(\hat{R}\) will be close to 1. The further split \(\hat{R}\) is from 1 the more idiosyncraticly the chains behave. Empirically we have found that Rhat > 1.1 is usually indicative of problems in the fit.

In addition to browsing the print output, we can check split \(\hat{R}\) programmatically using one of our utility functions,

check_rhat(fit)
[1] "Rhat looks reasonable for all parameters"

Here the split \(\hat{R}\) for all of our parameters looks good.

Both large split \(\hat{R}\) and low effective sample size per iteration are consequences of poorly mixing Markov chains. Improving the mixing of the Markov chains almost always requires tweaking the model specification, for example with a reparameterization or stronger priors.

2.7 Hamiltonian Monte Carlo Diagnostics

One of the most powerful features of Hamiltonian Monte Carlo is that it provides additional diagnostics that can indicate problems with the fit. These diagnostics are extremely sensitive and typically indicate problems long before the arise in the more universal diagnostics considered above.

2.7.1 Checking the Tree Depth

The dynamic implementation of Hamiltonian Monte Carlo used in Stan has a maximum trajectory length built in to avoid infinite loops that can occur for non-identified models. For sufficiently complex models, however, Stan can saturate this threshold even if the model is identified, which limits the efficacy of the sampler.

We can check whether that threshold was hit using one of our utility functions,

check_treedepth(fit)
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"

We’re good here, but if our fit had saturated the threshold then we would have wanted to rerun with a larger maximum tree depth,

fit <- stan(file='eight_schools_cp.stan', data=data, seed=194838, control=list(max_treedepth=15))

and then check if still saturated this larger threshold with

check_treedepth(fit, 15)

2.7.2 Checking the E-BFMI

Hamiltonian Monte Carlo proceeds in two phases – the algorithm first simulates a Hamiltonian trajectory that rapidly explores a slice of the target parameter space before resampling the auxiliary momenta to allow the next trajectory to explore another slice of the target parameter space. Unfortunately, the jumps between these slices induced by the momenta resamplings can be short, which often leads to slow exploration.

We can identify this problem by consulting the energy Bayesian Fraction of Missing Information,

check_energy(fit)
[1] "E-BFMI indicated no pathological behavior"

The check_energy function uses the threshold of 0.2 to diagnose problems, although this is based on preliminary empirical studies and should be taken only as a very rough recommendation. In particular, this diagnostic comes out of recent theoretical work and will be better understood as we apply it to more and more problems. For further discussion see Section 4.2 and 6.1 of Betancourt (2017).

As with split \(\hat{R}\) and effective sample size per transition, the problems indicated by low E-BFMI are remedied by tweaking the specification of the model. Unfortunately the exact tweaks required depend on the exact structure of the model and, consequently, there are no generic solutions.

2.7.3 Checking Divergences

Finally, we can check divergences which indicate pathological neighborhoods of the posterior that the simulated Hamiltonian trajectories are not able to explore sufficiently well. For this fit we have a significant number of divergences

check_div(fit)
[1] "82 of 4000 iterations ended with a divergence (2.05%)"
[1] "  Try running with larger adapt_delta to remove the divergences"

indicating that the Markov chains did not completely explore the posterior and that our Markov chain Monte Carlo estimators will be biased.

Divergences, however, can sometimes be false positives. To verify that we have real fitting issues we can rerun with a larger target acceptance probability, adapt_delta, which will force more accurate simulations of Hamiltonian trajectories and reduce the false positives.

fit <- stan(file='eight_schools_cp.stan', data=data, seed=194838, control=list(adapt_delta=0.90))
Warning: There were 25 divergent transitions after warmup. Increasing adapt_delta above 0.9 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems

Checking again,

check_div(fit)
[1] "25 of 4000 iterations ended with a divergence (0.625%)"
[1] "  Try running with larger adapt_delta to remove the divergences"

we see that while the divergences were reduced they did not completely vanish. In order to argue that divergences are only false positives, the divergences have to be completely eliminated for some adapt_delta sufficiently close to 1. Here we could continue increasing adapt_delta, where we would see that the divergences do not completely vanish, or we can analyze the existing divergences graphically.

If the divergences are not false positives then they will tend to concentrate in the pathological neighborhoods of the posterior. Falsely positive divergent iterations, however, will follow the same distribution as non-divergent iterations.

Here we will use the partition_div function of the stan_utility module to separate divergence and non-divergent iterations,

c_dark <- c("#8F272780")
green <- c("#00FF0080")

partition <- partition_div(fit)
div_params <- partition[[1]]
nondiv_params <- partition[[2]]

par(mar = c(4, 4, 0.5, 0.5))
plot(nondiv_params$'theta[1]', log(nondiv_params$tau),
     col=c_dark, pch=16, cex=0.8, xlab="theta[1]", ylab="log(tau)",
     xlim=c(-20, 50), ylim=c(-1,4))
points(div_params$'theta[1]', log(div_params$tau),
       col=green, pch=16, cex=0.8)

One of the challenges with a visual analysis of divergences is determining exactly which parameters to examine. Consequently visual analyses are most useful when there are already components of the model about which you are suspicious, as in this case where we know that the correlation between random effects (theta.1 through theta.8) and the hierarchical standard deviation, tau, can be problematic.

Indeed we see the divergences clustering towards small values of tau where the posterior abruptly stops. This abrupt stop is indicative of a transition into a pathological neighborhood that Stan was not able to penetrate.

In order to avoid this issue we have to consider a modification to our model, and in this case we can appeal to a non-centered parameterization of the same model that does not suffer these issues.

2.8 Validating a Fit in ShinyStan

Another way of browsing the diagnostic information is to use the Shiny app ShinyStan.

After loading the library,

library(shinystan)

ShinyStan can be activated with the call

launch_shinystan(fit)

which opens ShinyStan in a new browser window.

Quantitative diagnostics can be found in the “Diagnose” tab. Divergences will also be shown in the plots created in the “Expore” tab.

3 A Successful Fit

Multiple diagnostics have indicated that our fit of the centered parameterization of our hierarchical model is not to be trusted, so let’s instead consider the complementary non-centered parameterization,

writeLines(readLines("eight_schools_ncp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta_tilde[J];
}

transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] = mu + tau * theta_tilde[j];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta_tilde ~ normal(0, 1);
  y ~ normal(theta, sigma);
}

Running this model,

fit <- stan(file='eight_schools_ncp.stan', data=data, seed=194838, control=list(adapt_delta=0.9))

we see that all of the diagnostics are clean,

check_all_diagnostics(fit)
[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-BFMI indicated no pathological behavior"

With this more appropriate implementation of our model we can now utilize Markov chain Monte Carlo estimators of expectations, such as parameter means and variances, to accurately characterize our model’s posterior distribution.

4 Original Computing Environment

writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined

CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256
devtools::session_info("rstan")
Session info -------------------------------------------------------------
 setting  value                       
 version  R version 3.4.2 (2017-09-28)
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     2017-11-10                  
Packages -----------------------------------------------------------------
 package      * version   date       source        
 BH             1.65.0-1  2017-08-24 CRAN (R 3.4.1)
 colorspace     1.3-2     2016-12-14 CRAN (R 3.4.0)
 dichromat      2.0-0     2013-01-24 CRAN (R 3.4.0)
 digest         0.6.12    2017-01-27 CRAN (R 3.4.0)
 ggplot2      * 2.2.1     2016-12-30 CRAN (R 3.4.0)
 graphics     * 3.4.2     2017-10-04 local         
 grDevices    * 3.4.2     2017-10-04 local         
 grid           3.4.2     2017-10-04 local         
 gridExtra      2.3       2017-09-09 CRAN (R 3.4.1)
 gtable         0.2.0     2016-02-26 CRAN (R 3.4.0)
 inline         0.3.14    2015-04-13 CRAN (R 3.4.0)
 labeling       0.3       2014-08-23 CRAN (R 3.4.0)
 lattice        0.20-35   2017-03-25 CRAN (R 3.4.2)
 lazyeval       0.2.1     2017-10-29 CRAN (R 3.4.2)
 magrittr       1.5       2014-11-22 CRAN (R 3.4.0)
 MASS           7.3-47    2017-02-26 CRAN (R 3.4.2)
 Matrix         1.2-11    2017-08-21 CRAN (R 3.4.2)
 methods      * 3.4.2     2017-10-04 local         
 munsell        0.4.3     2016-02-13 CRAN (R 3.4.0)
 plyr           1.8.4     2016-06-08 CRAN (R 3.4.0)
 R6             2.2.2     2017-06-17 CRAN (R 3.4.0)
 RColorBrewer   1.1-2     2014-12-07 CRAN (R 3.4.0)
 Rcpp           0.12.13   2017-09-28 CRAN (R 3.4.2)
 RcppEigen      0.3.3.3.0 2017-05-01 CRAN (R 3.4.0)
 reshape2       1.4.2     2016-10-22 CRAN (R 3.4.0)
 rlang          0.1.4     2017-11-05 CRAN (R 3.4.2)
 rstan        * 2.16.2    2017-07-03 CRAN (R 3.4.1)
 scales         0.5.0     2017-08-24 CRAN (R 3.4.1)
 StanHeaders  * 2.16.0-1  2017-07-03 CRAN (R 3.4.1)
 stats        * 3.4.2     2017-10-04 local         
 stats4         3.4.2     2017-10-04 local         
 stringi        1.1.5     2017-04-07 CRAN (R 3.4.0)
 stringr        1.2.0     2017-02-18 CRAN (R 3.4.0)
 tibble         1.3.4     2017-08-22 CRAN (R 3.4.1)
 tools          3.4.2     2017-10-04 local         
 utils        * 3.4.2     2017-10-04 local         
 viridisLite    0.2.0     2017-03-24 CRAN (R 3.4.0)

References

Betancourt, Michael. 2017. “A Conceptual Introduction to Hamiltonian Monte Carlo.” ArXiv E-Prints 1701.02434 (January).

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2014. Bayesian Data Analysis. Third. Texts in Statistical Science Series. CRC Press, Boca Raton, FL.

Rubin, Donald B. 1981. “Estimation in Parallel Randomized Experiments.” Journal of Educational and Behavioral Statistics 6 (4): 377–401.