This case study has been deprecated; the current version can be found here.

Part 1: Introduction to Gaussian Processes
Part 2: Optimizing Gaussian Processes Hyperparameters
Part 3: Bayesian Inference of Gaussian Processes Hyperparameters

In Part 2 of this case study we investigated the limited performance of both regularized and unregularized maximum marginal likelihood for interfering Gaussian process hyperparameters. Here we conclude the case study with a treatment of a Bayesian inference for the hyperparameters. We will see that a naive Bayesian model fares no better than maximum marginal likelihood, but maintaining a proper Bayesian workflow allows us to quickly identify the problems and build an appropriately robust model.

1 Initial Setup

As in Part 2 we begin by setting up our local computing environment,

library(rstan)

Loading required package: ggplot2

Loading required package: StanHeaders

rstan (Version 2.16.2, packaged: 2017-07-03 09:24:58 UTC, GitRev: 2e1f913d3ca3)

For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
source("stan_utility.R")
source("gp_utility.R")

c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")

c_light_trans <- c("#DCBCBC80")
c_light_highlight_trans <- c("#C7999980")
c_mid_trans <- c("#B97C7C80")
c_mid_highlight_trans <- c("#A2505080")
c_dark_trans <- c("#8F272780")
c_dark_highlight_trans <- c("#7C000080")

c_green_trans <- c("#00FF0080")
c_superfine <- c("#8F272705")

and then loading in both the simulated data and the ground truth,

data <- read_rdump('gp.data.R')
true_realization <- read_rdump('gp.truth.R')

To judge the performance of our fits we also recreate the true data generating process that we are attempting to model,

f_data <- list(sigma=true_realization$sigma_true,
               N=length(true_realization$f_total),
               f=true_realization$f_total)
dgp_fit <- stan(file='simu_gauss_dgp.stan', data=f_data, iter=1000, warmup=0,
                chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")


SAMPLING FOR MODEL 'simu_gauss_dgp' NOW (CHAIN 1).
Iteration:   1 / 1000 [  0%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               0.054671 seconds (Sampling)
               0.054671 seconds (Total)

plot_gp_pred_quantiles(dgp_fit, data, true_realization,
                       "True Data Generating Process Quantiles")

2 Inferring the Hyperparameters with a Naive Bayesian fit

A common conceit of inexperienced Bayesians is that they have no information about the hyperparameters and hence resort to a default uniform prior that the misunderstand to be “non-informative”.

The corresponding Bayesian model is straightforward to implement in Stan– we just move the hyperparameters from the data block into the parameters block and in the model block we don’t assign any explicit priors,

writeLines(readLines("gp1.stan"))

data {
  int<lower=1> N;
  real x[N];
  vector[N] y;
}

parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}

model {
  matrix[N, N] cov =   cov_exp_quad(x, alpha, rho)
                     + diag_matrix(rep_vector(square(sigma), N));
  matrix[N, N] L_cov = cholesky_decompose(cov);

  y ~ multi_normal_cholesky(rep_vector(0, N), L_cov);
}

We take our new model out for a run and, following a robust Bayesian workflow, check the output.

fit <- stan(file='gp1.stan', data=data, seed=5838298)

Warning: There were 2125 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

Warning: There were 1581 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

Warning: Examine the pairs() plot to diagnose sampling problems

print(fit)

Inference for Stan model: gp1.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

               mean     se_mean           sd          2.5%           25%
rho   9.077243e+307         Inf          Inf 7.521524e+306 4.448229e+307
alpha  1.213413e+08 78618520.15 113341765.62  1.380000e+00  9.890810e+03
sigma  3.020000e+00        0.05         0.57  1.930000e+00  2.790000e+00
lp__   6.949200e+02        0.89         1.59  6.911100e+02  6.939300e+02
                50%           75%         97.5% n_eff Rhat
rho   7.981426e+307 1.510861e+308 1.684642e+308  4000  NaN
alpha  2.121612e+08  2.329414e+08  2.498356e+08     2 4.97
sigma  3.010000e+00  3.240000e+00  4.450000e+00   114 1.05
lp__   6.952000e+02  6.957300e+02  6.968500e+02     3 1.56

Samples were drawn using NUTS(diag_e) at Sun Nov 12 14:15:34 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The output of print(fit) is already alarming as the posterior for the hyperparameters, especially the length scale, is concentrating at extremely large values,

params <- extract(fit)

par(mfrow=c(1, 3))

alpha_breaks=10 * (0:50) / 50 - 5
hist(log(params$alpha), main="", xlab="log(alpha)", col=c_dark,
     border=c_dark_highlight, yaxt='n')

beta_breaks=10 * (0:50) / 50 - 5
hist(log(params$rho), main="", xlab="log(rho)", col=c_dark,
     border=c_dark_highlight, yaxt='n')

sigma_breaks=5 * (0:50) / 50
hist(params$sigma, main="", xlab="sigma", col=c_dark, border=c_dark_highlight,
     yaxt='n')

What could be drawing the posterior to such large length scales? Well recall that the input covariates from which we are trying to learn the length scale covers only the range \(-10 \le x \le 10\). Consequently there is nothing in the data from which the likelihood can inform length scales above \(\rho = 20\). More formally, the likelihood is non-identified above the maximum covariate distance.

In a Bayesian analysis this non-identifiability implies that the posterior above \(\rho = 20\) reduces to the prior. Unfortunately we have chosen a uniform prior which places infinite prior mass at arbitrarily large length scales, which the posterior dutifully follows. Counter to the common intuition, uniform priors are far from non-informative. Instead they are extremely informative about the validity of arbitrarily extreme values and their assumption has significant consequences on the corresponding model!

In order to ensure reasonable inferences we need to incorporate some principled information about the hyperparameters into our model.

3 Adding Weakly Informative Priors

Firstly we need to construct a prior that constrains our inferences to more reasonable length scales, which requires some consideration of what a reasonable length scale might be. If we had prior information about the interaction between the variates and covariates then we could incorporate that directly. Alternatively, if we knew the range of covariate values that we could encounter in an experiment then we could constraint the length scales to within that range as higher length scales wouldn’t have any observable consequences. We could even take this logic a step further and treat the observed covariates as the only experiment, in which case we would simply use the empirical range to set the scale of the length scale prior. The danger here is that such an empirical prior then limits the applicability of our model, both in terms of inferences and predictions, to within that range.

Given that we are dealing with simulated data we will take the empirical approach and consider the weakly informative prior \[ \rho \sim \text{Half-}\mathcal{N}(0, \Delta x / 3),\] or for our observed data, \[ \rho \sim \text{Half-}\mathcal{N}(0, 20 / 3).\]

We should also consider weakly informative priors for the marginal standard deviation, \(\alpha\), and measurement variability, \(\sigma\). Once again we could use principled prior information about the system being modeled to set the scales of these priors, or we could compromise a bit and consider some empirical behavior. For example, the range of observed variate values could inform the scale of the marginal standard deviation, and the standard deviation of the observed variate values could inform the scale of the measurement variability. The same caution about using empirical priors that we discussed above, however, applies here as well.

For this exercise we will take the weakly-informative priors \[ \alpha \sim \text{Half-}\mathcal{N}(0, 2)\] and \[ \sigma \sim \text{Half-}\mathcal{N}(0, 1).\]

These priors are then readily incorporated into a Stan program, to which we will also add the generation of inferred posterior and posterior predictive realizations of the Gaussian process as we marginalize over the hyperparameters.

writeLines(readLines("gp2.stan"))

functions {
  vector gp_pred_rng(real[] x2,
                     vector y1, real[] x1,
                     real alpha, real rho, real sigma, real delta) {
    int N1 = rows(y1);
    int N2 = size(x2);
    vector[N2] f2;
    {
      matrix[N1, N1] K =   cov_exp_quad(x1, alpha, rho)
                         + diag_matrix(rep_vector(square(sigma), N1));
      matrix[N1, N1] L_K = cholesky_decompose(K);

      vector[N1] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
      vector[N1] K_div_y1 = mdivide_right_tri_low(L_K_div_y1', L_K)';
      matrix[N1, N2] k_x1_x2 = cov_exp_quad(x1, x2, alpha, rho);
      vector[N2] f2_mu = (k_x1_x2' * K_div_y1);
      matrix[N1, N2] v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
      matrix[N2, N2] cov_f2 =   cov_exp_quad(x2, alpha, rho) - v_pred' * v_pred
                              + diag_matrix(rep_vector(delta, N2));
      f2 = multi_normal_rng(f2_mu, cov_f2);
    }
    return f2;
  }
}

data {
  int<lower=1> N;
  real x[N];
  vector[N] y;

  int<lower=1> N_predict;
  real x_predict[N_predict];
}

parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}

model {
  matrix[N, N] cov =   cov_exp_quad(x, alpha, rho)
                     + diag_matrix(rep_vector(square(sigma), N));
  matrix[N, N] L_cov = cholesky_decompose(cov);

  rho ~ normal(0, 20.0 / 3);
  alpha ~ normal(0, 2);
  sigma ~ normal(0, 1);
  y ~ multi_normal_cholesky(rep_vector(0, N), L_cov);
}

generated quantities {
  vector[N_predict] f_predict = gp_pred_rng(x_predict, y, x, alpha, rho, sigma, 1e-10);
  vector[N_predict] y_predict;
  for (n in 1:N_predict)
    y_predict[n] = normal_rng(f_predict[n], sigma);
}

Diligently checking the diagnostics we see that even with weakly-informative priors the fit of this model exhibits a divergence,

fit <- stan(file='gp2.stan', data=data, seed=5838298)

Warning: There were 14 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

Warning: Examine the pairs() plot to diagnose sampling problems

check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "14 of 4000 iterations ended with a divergence (0.35%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-BFMI indicated no pathological behavior"

Following up with the divergence we take a look at the two-dimensional marginal posteriors,

partition <- partition_div(fit)
div_params <- partition[[1]]
nondiv_params <- partition[[2]]

par(mfrow=c(1, 3))

par(mar = c(4, 4, 0.5, 0.5))
plot(nondiv_params$rho, nondiv_params$alpha, log="xy",
     col=c_dark_trans, pch=16, cex=0.8, xlab="rho", ylab="alpha")
points(div_params$rho, div_params$alpha,
       col=c_green_trans, pch=16, cex=0.8)

par(mar = c(4, 4, 0.5, 0.5))
plot(nondiv_params$rho, nondiv_params$sigma,
     col=c_dark_trans, pch=16, cex=0.8, xlab="rho", ylab="sigma")
points(div_params$rho, div_params$sigma,
       col=c_green_trans, pch=16, cex=0.8)

par(mar = c(4, 4, 0.5, 0.5))
plot(nondiv_params$alpha, nondiv_params$sigma,
     col=c_dark_trans, pch=16, cex=0.8, xlab="alpha", ylab="sigma")
points(div_params$alpha, div_params$sigma,
       col=c_green_trans, pch=16, cex=0.8)

and immediately notice some interesting behavior below \(\rho = 2\), where in particular \(\sigma\) seem to expand to its prior range. What we are seeing is the manifestation of another non-identifiability, this time below the minimum covariate distance. Just as there is no information in the data above the maximum covariate distance, there is also no information in the data below the minimum covariate distance where the hyperparameters are informed only by their priors.

In particular, the measurement variability can drop to zero, allowing the Gaussian process to perfectly interpolate between the observed data. As the measurement variability vanishes the likelihood becomes somewhat singular and a ridge of high curvature develops exactly where we see the divergences!

We can see this behavior by isolating the Gaussian process realizations with \(\sigma < 0.5\), which concentrate around the input covariates at the detriment of the intermediary performance.

plot_low_sigma_gp_realizations(fit, data, true_realization,
                               "Posterior Realizations with sigma < 0.5")

Unfortunately, these overfitting realizations at small length scales significantly compromise the overall the marginal performance of the Bayesian model.

plot_gp_realizations(fit, data, true_realization,
                     "Posterior Realizations")

plot_gp_quantiles(fit, data, true_realization,
                  "Posterior Quantiles")

plot_gp_realizations(fit, data, true_realization,
                     "Posterior Predictive Realizations")

par(mfrow=c(1, 2))

plot_gp_pred_quantiles(fit, data, true_realization,
                       "PP Quantiles")

plot_gp_pred_quantiles(dgp_fit, data, true_realization,
                       "True DGP Quantiles")

4 Adding An Informative Prior for the Length Scale

To remove this newly found non-identifiability we need a prior for the length scale that places negligible mass both below a lower length scale, \(u\), and above an upper length scale, \(l\). An immediate choice would be the Gamma distribution, but the Gamma distribution has a somewhat heavy tail towards zero that can make it hard to constrain the posterior below \(l\). Instead we will utilize an inverse Gamma distribution that has a lighter tail towards zero and more strongly constrains the posterior above \(l\).

Finally we have to decide how exactly to translate the lower and upper length scales to a particular inverse Gamma distribution. One particularly principled way is to choose an inverse Gamma distribution that has a small amount of prior mass, say \(1\%\) below and above the length scales, \[ \int_{0}^{l} \mathrm{d}\rho \, \text{Inv-}\mathcal{G}(\rho \mid a, b) = 0.01\] \[ \int_{u}^{\infty} \mathrm{d}\rho \, \text{Inv-}\mathcal{G}(\rho \mid a, b) = 0.01\]

We can find approximate values for \(a\) and \(b\) satisfying this criterion by making a Gaussian approximation to the tail conditions, \[ l \approx \mu - 3 \sigma = \frac{b}{a - 1} - 3 \frac{b}{\sqrt{(a-1)^{2}(a-2)}} \] \[ u \approx \mu + 3 \sigma = \frac{b}{a - 1} + 3 \frac{b}{\sqrt{(a-1)^{2}(a-2)}}, \]

and then refine the approximation using Stan’s spiffy new algebraic solver. For example, if we set \(l = 2\) and \(u = 10\) then we can define the Stan program

writeLines(readLines("gp_prior_tune.stan"))

functions {
  vector tail_delta(vector y, vector theta, real[] x_r, int[] x_i) {
    vector[2] deltas;
    deltas[1] = inv_gamma_cdf(theta[1], exp(y[1]), exp(y[2])) - 0.01;
    deltas[2] = 1 - inv_gamma_cdf(theta[2], exp(y[1]), exp(y[2])) - 0.01;
    return deltas;
  }
}

transformed data {
  vector[2] y_guess = [log(10), log(20)]';
  vector[2] theta = [2, 10]';
  vector[2] y;
  real x_r[0];
  int x_i[0];

  y = algebra_solver(tail_delta, y_guess, theta, x_r, x_i);

  print("a = ", exp(y[1]));
  print("b = ", exp(y[2]));
}

which we will be able to run in RStan 2.17+ with the call,

fit <- stan(file='gp_prior_tune.stan', iter=1, warmup=0, chains=1,
            seed=5838298, algorithm="Fixed_param")

As RStan 2.17 has not yet been released, however, I’ll run the program externally to give \(a = 8.91924\) and \(b = 34.5805\).

This leaves us with the Gaussian process regression

writeLines(readLines("gp3.stan"))

functions {
  vector gp_pred_rng(real[] x2,
                     vector y1, real[] x1,
                     real alpha, real rho, real sigma, real delta) {
    int N1 = rows(y1);
    int N2 = size(x2);
    vector[N2] f2;
    {
      matrix[N1, N1] K =   cov_exp_quad(x1, alpha, rho)
                         + diag_matrix(rep_vector(square(sigma), N1));
      matrix[N1, N1] L_K = cholesky_decompose(K);

      vector[N1] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
      vector[N1] K_div_y1 = mdivide_right_tri_low(L_K_div_y1', L_K)';
      matrix[N1, N2] k_x1_x2 = cov_exp_quad(x1, x2, alpha, rho);
      vector[N2] f2_mu = (k_x1_x2' * K_div_y1);
      matrix[N1, N2] v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
      matrix[N2, N2] cov_f2 =   cov_exp_quad(x2, alpha, rho) - v_pred' * v_pred
                              + diag_matrix(rep_vector(delta, N2));
      f2 = multi_normal_rng(f2_mu, cov_f2);
    }
    return f2;
  }
}

data {
  int<lower=1> N;
  real x[N];
  vector[N] y;

  int<lower=1> N_predict;
  real x_predict[N_predict];
}

parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}

model {
  matrix[N, N] cov =   cov_exp_quad(x, alpha, rho)
                     + diag_matrix(rep_vector(square(sigma), N));
  matrix[N, N] L_cov = cholesky_decompose(cov);

  // P[rho < 2.0] = 0.01
  // P[rho > 10] = 0.01
  rho ~ inv_gamma(8.91924, 34.5805);
  alpha ~ normal(0, 2);
  sigma ~ normal(0, 1);

  y ~ multi_normal_cholesky(rep_vector(0, N), L_cov);
}

generated quantities {
  vector[N_predict] f_predict = gp_pred_rng(x_predict, y, x, alpha, rho, sigma, 1e-10);
  vector[N_predict] y_predict;
  for (n in 1:N_predict)
    y_predict[n] = normal_rng(f_predict[n], sigma);
}

which we eagerly fit

fit <- stan(file='gp3.stan', data=data, seed=5838298)

check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-BFMI indicated no pathological behavior"

to find no diagnostic indications of problems.

Indeed this model finally captures the true hyperparameters

params <- extract(fit)

par(mfrow=c(1, 3))

alpha_breaks=10 * (0:50) / 50 - 5
hist(params$alpha, main="", xlab="alpha",
     col=c_dark, border=c_dark_highlight, yaxt='n')
abline(v=3, col=c_light, lty=1, lwd=3)

beta_breaks=10 * (0:50) / 50 - 5
hist(params$rho, main="", xlab="rho",
     col=c_dark, border=c_dark_highlight, yaxt='n')
abline(v=5.5, col=c_light, lty=1, lwd=3)

sigma_breaks=5 * (0:50) / 50
hist(params$sigma, main="", xlab="sigma",
     col=c_dark, border=c_dark_highlight, yaxt='n')
abline(v=2, col=c_light, lty=1, lwd=3)

while avoiding the non-identified plateau previously seen at smaller length scales.

partition <- partition_div(fit)
div_params <- partition[[1]]
nondiv_params <- partition[[2]]

par(mfrow=c(1, 3))

par(mar = c(4, 4, 0.5, 0.5))
plot(nondiv_params$rho, nondiv_params$alpha, log="xy",
     col=c_dark_trans, pch=16, cex=0.8, xlab="rho", ylab="alpha")
points(div_params$rho, div_params$alpha,
       col=c_green_trans, pch=16, cex=0.8)

par(mar = c(4, 4, 0.5, 0.5))
plot(nondiv_params$rho, nondiv_params$sigma,
     col=c_dark_trans, pch=16, cex=0.8, xlab="rho", ylab="sigma")
points(div_params$rho, div_params$sigma,
       col=c_green_trans, pch=16, cex=0.8)

par(mar = c(4, 4, 0.5, 0.5))
plot(nondiv_params$alpha, nondiv_params$sigma,
     col=c_dark_trans, pch=16, cex=0.8, xlab="alpha", ylab="sigma")
points(div_params$alpha, div_params$sigma,
       col=c_green_trans, pch=16, cex=0.8)

If we focus on the marginal standard deviation and the length scale then we see a fuzzy ridge indicates that these parameters are not completely identified by the data. Had we run with more covariates then this ridge would be much more well-defined. In any case, with principled priors containing the posterior Stan is able to fully quantify the weakly-identified ridge without any issues.

par(mar = c(4, 4, 0.5, 0.5))
plot(nondiv_params$rho, nondiv_params$alpha, log="xy",
     col=c_dark_trans, pch=16, cex=0.8, xlab="rho", ylab="alpha")
points(div_params$rho, div_params$alpha,
       col=c_green_trans, pch=16, cex=0.8)

To our great relief, the realizations from this model provide a solid fit to the data both for the input covariates and the out of sample data.

plot_gp_realizations(fit, data, true_realization,
                     "Posterior Realizations")

plot_gp_quantiles(fit, data, true_realization,
                  "Posterior Quantiles")

plot_gp_realizations(fit, data, true_realization,
                     "Posterior Predictive Realizations")

par(mfrow=c(1, 2))

plot_gp_pred_quantiles(fit, data, true_realization,
                       "PP Quantiles")

plot_gp_pred_quantiles(dgp_fit, data, true_realization,
                       "True DGP Quantiles")

By sticking to a robust Bayesian workflow we were able to identify and resolve the subtle pathologies of Gaussian process regression and fully exploit their practical utility.

5 A Look Back at Maximum Marginal Likelihood

With the Bayesian workflow helping us to understand the non-identifiabilities of the Gaussian process likelihood we now better interpret the unfortunate behavior of maximum marginal likelihood estimation. The non-identifiabilities below the minimum covariate distance and above the maximum covariate distance manifest in flat marginal likelihood surfaces which offer very small gradients and little guidance to the maximum marginal likelihood fit. This explains why the fits were so sensitive to the initial values and so prone to converging to very small or very large length scales.

Even with the principled prior identified above used as a regularization function, however, maximum marginal likelihood is still frustrated by the weak identifiability between the marginal standard deviation and length scale. No single point can quantify this fuzzy ridge and hence have a chance at accurately quantifying the true data generating process.

6 The Inevitable Curse of Dimensionality

I want to end this case study with one last word of caution when attempting to employ Gaussian processes over high-dimensional covariate spaces, which is unfortunately a common approach for emulation methods that try to interpolate between the inputs and outputs of some expensive computer simulation.

In low-dimensional covariate spaces the information provided by principled weakly informative priors, especially on the length scale, constrains a Gaussian process to function realizations that offer reasonable interpolations between the observed covariates. Geometrically, we can think of the priors as defining a “tube” between neighboring covariates in which all of the realizations will contained.

As we increase the dimensionality of the covariate space, however, the volume of this tube, and hence the span of functions admitted by the Gaussian process, grows exponentially fast. If we try to get away with a point estimate for the hyperparameters we then find that the probability of choosing the right hyperparameters decreases exponentially fast. Marginalizing over the hyperparameters with a Bayesian model resolves that problem, but only at the expense of having exponentially growing uncertainties between the input covariates.

Really the only way to maintain similar performance for higher-dimensional covariate spaces is to add exponentially more information into our model. We could achieve that with exponentially more data, but that quickly becomes too computationally demanding for practical applications. Alternatively we could make the prior on the length scale exponentially narrower to ensure that the volume of the “tube” stays constant. Of course this requires knowing the length scale extremely precisely a priori, which is unreasonable in exactly the applications for which Gaussian processes would be useful.

This difficulty with scaling with dimension is not unique to Gaussian processes – it manifests any time we try to interpolate function values in a high-dimensional space, for example with orthogonal functions or splines. Really interpolation is most useful when applied to low-dimensional problems, especially estimating regression relationships over one and two dimensional covariate spaces. In higher dimensions the performance of any interpolation will necessarily suffer – it might suffer less than alternatives, but it’s hard to defeat that infamous curse.

7 Conclusion

Gaussian processes provide a flexible means of modeling even complex regression behavior, but that flexibility also means that any finite data set will have trouble identifying the hyperparameters of many kernels. In order to implement a robust analysis the Gaussian process must be complemented with principled hyperparameter priors that regularize this undesired behavior and facilitate efficient computation. Here we used a proper Bayesian workflow to identify the principled priors needed to identify a Gaussian process regression with an exponentiated quadratic kernel, but that same workflow can also be used to motivate hyperparameter priors for other kernels.

8 Acknowledgements

The insights motivating this case study came from a particularly fertile research project with Dan Simpson, Rob Trangucci, and Aki Vehtari.

I thank Dan Simpson, Aki Vehtari, and Rob Trangucci for many helpful comments on the case study.

9 Original Computing Environment

writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))

CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined

CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256

devtools::session_info("rstan")

Session info -------------------------------------------------------------

 setting  value                       
 version  R version 3.4.2 (2017-09-28)
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     2017-11-12

Packages -----------------------------------------------------------------

 package      * version   date       source        
 BH             1.65.0-1  2017-08-24 CRAN (R 3.4.1)
 colorspace     1.3-2     2016-12-14 CRAN (R 3.4.0)
 dichromat      2.0-0     2013-01-24 CRAN (R 3.4.0)
 digest         0.6.12    2017-01-27 CRAN (R 3.4.0)
 ggplot2      * 2.2.1     2016-12-30 CRAN (R 3.4.0)
 graphics     * 3.4.2     2017-10-04 local         
 grDevices    * 3.4.2     2017-10-04 local         
 grid           3.4.2     2017-10-04 local         
 gridExtra      2.3       2017-09-09 CRAN (R 3.4.1)
 gtable         0.2.0     2016-02-26 CRAN (R 3.4.0)
 inline         0.3.14    2015-04-13 CRAN (R 3.4.0)
 labeling       0.3       2014-08-23 CRAN (R 3.4.0)
 lattice        0.20-35   2017-03-25 CRAN (R 3.4.2)
 lazyeval       0.2.1     2017-10-29 CRAN (R 3.4.2)
 magrittr       1.5       2014-11-22 CRAN (R 3.4.0)
 MASS           7.3-47    2017-02-26 CRAN (R 3.4.2)
 Matrix         1.2-11    2017-08-21 CRAN (R 3.4.2)
 methods      * 3.4.2     2017-10-04 local         
 munsell        0.4.3     2016-02-13 CRAN (R 3.4.0)
 plyr           1.8.4     2016-06-08 CRAN (R 3.4.0)
 R6             2.2.2     2017-06-17 CRAN (R 3.4.0)
 RColorBrewer   1.1-2     2014-12-07 CRAN (R 3.4.0)
 Rcpp           0.12.13   2017-09-28 CRAN (R 3.4.2)
 RcppEigen      0.3.3.3.0 2017-05-01 CRAN (R 3.4.0)
 reshape2       1.4.2     2016-10-22 CRAN (R 3.4.0)
 rlang          0.1.4     2017-11-05 CRAN (R 3.4.2)
 rstan        * 2.16.2    2017-07-03 CRAN (R 3.4.1)
 scales         0.5.0     2017-08-24 CRAN (R 3.4.1)
 StanHeaders  * 2.16.0-1  2017-07-03 CRAN (R 3.4.1)
 stats        * 3.4.2     2017-10-04 local         
 stats4         3.4.2     2017-10-04 local         
 stringi        1.1.5     2017-04-07 CRAN (R 3.4.0)
 stringr        1.2.0     2017-02-18 CRAN (R 3.4.0)
 tibble         1.3.4     2017-08-22 CRAN (R 3.4.1)
 tools          3.4.2     2017-10-04 local         
 utils        * 3.4.2     2017-10-04 local         
 viridisLite    0.2.0     2017-03-24 CRAN (R 3.4.0)

Robust Gaussian Processes in Stan

Michael Betancourt

October 2017