Part 1: Introduction to Gaussian Processes
Part 2: Optimizing Gaussian Processes Hyperparameters
Part 3: Bayesian Inference of Gaussian Processes Hyperparameters
In Part 2 of this case study we investigated the limited performance of both regularized and unregularized maximum marginal likelihood for interfering Gaussian process hyperparameters. Here we conclude the case study with a treatment of a Bayesian inference for the hyperparameters. We will see that a naive Bayesian model fares no better than maximum marginal likelihood, but maintaining a proper Bayesian workflow allows us to quickly identify the problems and build an appropriately robust model.
As in Part 2 we begin by setting up our local computing environment,
library(rstan)Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.16.2, packaged: 2017-07-03 09:24:58 UTC, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
source("stan_utility.R")
source("gp_utility.R")
c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")
c_light_trans <- c("#DCBCBC80")
c_light_highlight_trans <- c("#C7999980")
c_mid_trans <- c("#B97C7C80")
c_mid_highlight_trans <- c("#A2505080")
c_dark_trans <- c("#8F272780")
c_dark_highlight_trans <- c("#7C000080")
c_green_trans <- c("#00FF0080")
c_superfine <- c("#8F272705")and then loading in both the simulated data and the ground truth,
data <- read_rdump('gp.data.R')
true_realization <- read_rdump('gp.truth.R')To judge the performance of our fits we also recreate the true data generating process that we are attempting to model,
f_data <- list(sigma=true_realization$sigma_true,
N=length(true_realization$f_total),
f=true_realization$f_total)
dgp_fit <- stan(file='simu_gauss_dgp.stan', data=f_data, iter=1000, warmup=0,
chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")
SAMPLING FOR MODEL 'simu_gauss_dgp' NOW (CHAIN 1).
Iteration: 1 / 1000 [ 0%] (Sampling)
Iteration: 1000 / 1000 [100%] (Sampling)
Elapsed Time: 0 seconds (Warm-up)
0.054671 seconds (Sampling)
0.054671 seconds (Total)
plot_gp_pred_quantiles(dgp_fit, data, true_realization,
"True Data Generating Process Quantiles")A common conceit of inexperienced Bayesians is that they have no information about the hyperparameters and hence resort to a default uniform prior that the misunderstand to be “non-informative”.
The corresponding Bayesian model is straightforward to implement in Stan– we just move the hyperparameters from the data block into the parameters block and in the model block we don’t assign any explicit priors,
writeLines(readLines("gp1.stan"))data {
int<lower=1> N;
real x[N];
vector[N] y;
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}
model {
matrix[N, N] cov = cov_exp_quad(x, alpha, rho)
+ diag_matrix(rep_vector(square(sigma), N));
matrix[N, N] L_cov = cholesky_decompose(cov);
y ~ multi_normal_cholesky(rep_vector(0, N), L_cov);
}
We take our new model out for a run and, following a robust Bayesian workflow, check the output.
fit <- stan(file='gp1.stan', data=data, seed=5838298)Warning: There were 2125 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 1581 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems
print(fit)Inference for Stan model: gp1.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25%
rho 9.077243e+307 Inf Inf 7.521524e+306 4.448229e+307
alpha 1.213413e+08 78618520.15 113341765.62 1.380000e+00 9.890810e+03
sigma 3.020000e+00 0.05 0.57 1.930000e+00 2.790000e+00
lp__ 6.949200e+02 0.89 1.59 6.911100e+02 6.939300e+02
50% 75% 97.5% n_eff Rhat
rho 7.981426e+307 1.510861e+308 1.684642e+308 4000 NaN
alpha 2.121612e+08 2.329414e+08 2.498356e+08 2 4.97
sigma 3.010000e+00 3.240000e+00 4.450000e+00 114 1.05
lp__ 6.952000e+02 6.957300e+02 6.968500e+02 3 1.56
Samples were drawn using NUTS(diag_e) at Sun Nov 12 14:15:34 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
The output of print(fit) is already alarming as the posterior for the hyperparameters, especially the length scale, is concentrating at extremely large values,
params <- extract(fit)
par(mfrow=c(1, 3))
alpha_breaks=10 * (0:50) / 50 - 5
hist(log(params$alpha), main="", xlab="log(alpha)", col=c_dark,
border=c_dark_highlight, yaxt='n')
beta_breaks=10 * (0:50) / 50 - 5
hist(log(params$rho), main="", xlab="log(rho)", col=c_dark,
border=c_dark_highlight, yaxt='n')
sigma_breaks=5 * (0:50) / 50
hist(params$sigma, main="", xlab="sigma", col=c_dark, border=c_dark_highlight,
yaxt='n')What could be drawing the posterior to such large length scales? Well recall that the input covariates from which we are trying to learn the length scale covers only the range \(-10 \le x \le 10\). Consequently there is nothing in the data from which the likelihood can inform length scales above \(\rho = 20\). More formally, the likelihood is non-identified above the maximum covariate distance.
In a Bayesian analysis this non-identifiability implies that the posterior above \(\rho = 20\) reduces to the prior. Unfortunately we have chosen a uniform prior which places infinite prior mass at arbitrarily large length scales, which the posterior dutifully follows. Counter to the common intuition, uniform priors are far from non-informative. Instead they are extremely informative about the validity of arbitrarily extreme values and their assumption has significant consequences on the corresponding model!
In order to ensure reasonable inferences we need to incorporate some principled information about the hyperparameters into our model.
Firstly we need to construct a prior that constrains our inferences to more reasonable length scales, which requires some consideration of what a reasonable length scale might be. If we had prior information about the interaction between the variates and covariates then we could incorporate that directly. Alternatively, if we knew the range of covariate values that we could encounter in an experiment then we could constraint the length scales to within that range as higher length scales wouldn’t have any observable consequences. We could even take this logic a step further and treat the observed covariates as the only experiment, in which case we would simply use the empirical range to set the scale of the length scale prior. The danger here is that such an empirical prior then limits the applicability of our model, both in terms of inferences and predictions, to within that range.
Given that we are dealing with simulated data we will take the empirical approach and consider the weakly informative prior \[ \rho \sim \text{Half-}\mathcal{N}(0, \Delta x / 3),\] or for our observed data, \[ \rho \sim \text{Half-}\mathcal{N}(0, 20 / 3).\]
We should also consider weakly informative priors for the marginal standard deviation, \(\alpha\), and measurement variability, \(\sigma\). Once again we could use principled prior information about the system being modeled to set the scales of these priors, or we could compromise a bit and consider some empirical behavior. For example, the range of observed variate values could inform the scale of the marginal standard deviation, and the standard deviation of the observed variate values could inform the scale of the measurement variability. The same caution about using empirical priors that we discussed above, however, applies here as well.
For this exercise we will take the weakly-informative priors \[ \alpha \sim \text{Half-}\mathcal{N}(0, 2)\] and \[ \sigma \sim \text{Half-}\mathcal{N}(0, 1).\]
These priors are then readily incorporated into a Stan program, to which we will also add the generation of inferred posterior and posterior predictive realizations of the Gaussian process as we marginalize over the hyperparameters.
writeLines(readLines("gp2.stan"))functions {
vector gp_pred_rng(real[] x2,
vector y1, real[] x1,
real alpha, real rho, real sigma, real delta) {
int N1 = rows(y1);
int N2 = size(x2);
vector[N2] f2;
{
matrix[N1, N1] K = cov_exp_quad(x1, alpha, rho)
+ diag_matrix(rep_vector(square(sigma), N1));
matrix[N1, N1] L_K = cholesky_decompose(K);
vector[N1] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
vector[N1] K_div_y1 = mdivide_right_tri_low(L_K_div_y1', L_K)';
matrix[N1, N2] k_x1_x2 = cov_exp_quad(x1, x2, alpha, rho);
vector[N2] f2_mu = (k_x1_x2' * K_div_y1);
matrix[N1, N2] v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
matrix[N2, N2] cov_f2 = cov_exp_quad(x2, alpha, rho) - v_pred' * v_pred
+ diag_matrix(rep_vector(delta, N2));
f2 = multi_normal_rng(f2_mu, cov_f2);
}
return f2;
}
}
data {
int<lower=1> N;
real x[N];
vector[N] y;
int<lower=1> N_predict;
real x_predict[N_predict];
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}
model {
matrix[N, N] cov = cov_exp_quad(x, alpha, rho)
+ diag_matrix(rep_vector(square(sigma), N));
matrix[N, N] L_cov = cholesky_decompose(cov);
rho ~ normal(0, 20.0 / 3);
alpha ~ normal(0, 2);
sigma ~ normal(0, 1);
y ~ multi_normal_cholesky(rep_vector(0, N), L_cov);
}
generated quantities {
vector[N_predict] f_predict = gp_pred_rng(x_predict, y, x, alpha, rho, sigma, 1e-10);
vector[N_predict] y_predict;
for (n in 1:N_predict)
y_predict[n] = normal_rng(f_predict[n], sigma);
}
Diligently checking the diagnostics we see that even with weakly-informative priors the fit of this model exhibits a divergence,
fit <- stan(file='gp2.stan', data=data, seed=5838298)Warning: There were 14 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems
check_all_diagnostics(fit)[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "14 of 4000 iterations ended with a divergence (0.35%)"
[1] " Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-BFMI indicated no pathological behavior"
Following up with the divergence we take a look at the two-dimensional marginal posteriors,
partition <- partition_div(fit)
div_params <- partition[[1]]
nondiv_params <- partition[[2]]
par(mfrow=c(1, 3))
par(mar = c(4, 4, 0.5, 0.5))
plot(nondiv_params$rho, nondiv_params$alpha, log="xy",
col=c_dark_trans, pch=16, cex=0.8, xlab="rho", ylab="alpha")
points(div_params$rho, div_params$alpha,
col=c_green_trans, pch=16, cex=0.8)
par(mar = c(4, 4, 0.5, 0.5))
plot(nondiv_params$rho, nondiv_params$sigma,
col=c_dark_trans, pch=16, cex=0.8, xlab="rho", ylab="sigma")
points(div_params$rho, div_params$sigma,
col=c_green_trans, pch=16, cex=0.8)
par(mar = c(4, 4, 0.5, 0.5))
plot(nondiv_params$alpha, nondiv_params$sigma,
col=c_dark_trans, pch=16, cex=0.8, xlab="alpha", ylab="sigma")
points(div_params$alpha, div_params$sigma,
col=c_green_trans, pch=16, cex=0.8)and immediately notice some interesting behavior below \(\rho = 2\), where in particular \(\sigma\) seem to expand to its prior range. What we are seeing is the manifestation of another non-identifiability, this time below the minimum covariate distance. Just as there is no information in the data above the maximum covariate distance, there is also no information in the data below the minimum covariate distance where the hyperparameters are informed only by their priors.
In particular, the measurement variability can drop to zero, allowing the Gaussian process to perfectly interpolate between the observed data. As the measurement variability vanishes the likelihood becomes somewhat singular and a ridge of high curvature develops exactly where we see the divergences!
We can see this behavior by isolating the Gaussian process realizations with \(\sigma < 0.5\), which concentrate around the input covariates at the detriment of the intermediary performance.
plot_low_sigma_gp_realizations(fit, data, true_realization,
"Posterior Realizations with sigma < 0.5")Unfortunately, these overfitting realizations at small length scales significantly compromise the overall the marginal performance of the Bayesian model.
plot_gp_realizations(fit, data, true_realization,
"Posterior Realizations")plot_gp_quantiles(fit, data, true_realization,
"Posterior Quantiles")plot_gp_realizations(fit, data, true_realization,
"Posterior Predictive Realizations")