A probability distribution is a frustratingly abstract object. In order to explicitly specify a probability distribution in full we would have to specify the probability assigned to each element of the ambient \(\sigma\)-algebra, which is beyond impractical for even simple spaces. In practice we instead define probability distributions implicitly through their computational consequences. More formally we define them algorithmically through methods that return expectation values.

More poetically we can think of a probability distribution as an oracle. While we cannot perceive the oracle directly, we can ask it questions. Those questions take the form of real-valued, integrable functions, and the answers we receive from our probabilistic oracle take the form of expectation values of those functions with respect to the probability distribution.

Some simple distributions are structured such that certain expectation values are known analytically. For example the distributions specified by probability density functions within the Gaussian family are parameterized in terms of their means and standard deviations. These distributions also admit closed-form cumulative distribution functions which can be used to compute probabilities and quantiles.

The more sophisticated probability distributions that arise in applied analyses, however, do not enjoy these analytic results, even for the expectation values of simple functions. Instead we have rely on numerical algorithms which only estimate the exact expectation values returned by a given probability distribution. In other words we don’t hear our oracle so clearly, receiving our answers only over noisy transmission lines that corrupt its delicate answers. The challenge of practical probabilistic computation is the construction of algorithms with well-behaved and well-quantified errors that allow us to understand when we can trust the corrupted answers that we receive.

In this case study I introduce the basics of probabilistic computation, with a focus on the challenges that arise as we attempt to scale to problems in more than a few dimenions. I discuss a variety of popular probabilistic computational algorithms in the context of these challenges and set the stage for a more thorough discussion of Markov chain Monte Carlo and Hamiltonian Monte Carlo that will follow in future case studies.

1 Representation with Computational Taxation

The basic algorithmic approaches to probabilistic computation are motivated by how we represent a target probability distribution, \(\pi\), defined over an ambient space, \(Q\). I am intentionally using \(Q\) to denote the ambient space, and \(q\) to denote points in that space, instead of something more familiar, to emphasize the abstract nature of this discussion. Computation is the same regardless of the interpretation of \(\pi\) and \(Q\).

Here we will consider two of the most popular representations – probability density function representations and sampling representations.

1.1 I Just Find Densities Terribly Exhausting

In most applications probability distributions are specified through a probability density function representation within a given parameterization, and the computations that follow.

For example probability distributions over discrete spaces can be specified with a probability mass function, \(\pi(q)\), and expectation values can be computed with summation over the entire space, \[ \mathbb{E}_{\pi}[f] = \sum_{q \in Q} \pi(q) \, f(q). \] Beyond ambient spaces with only a finite number of points, these sums quickly exhaust our finite computational resources. In practice all we can do is limit whatever resources are available to a subset of points in order to approximate these expectation values.

When specifying probability distributions over continuous spaces we can appeal to probability density functions, \(\pi(q)\), that give expectation values through the standard integral from calculus, \[ \mathbb{E}_{\pi}[f] = \int_{Q} \mathrm{d} q \, \pi(q) \, f(q). \] For most probability density functions we won’t be able to work out these integrals analytically. Without infinite computational resources we won’t be able exhaustively survey the entire ambient space to compute a numerical value, either.

In order to attempt a practical numerical integration we have to discretize the ambient space into a countable number of points and then restrict our attention to a finite number of them. Quadrature methods place a finite grid over the ambient space which partitions the ambient space into a collection of volume elements, \((\Delta q)_{n}\) [1]. If we then approximate the integrand as constant across each volume element the expectation value can be estimated as \[ \mathbb{E}_{\pi}[f] \approx \sum_{n = 1}^{N} (\Delta q)_{n} \, \pi(q_{n}) \, f(q_{n}). \]

For example in one-dimension a quadrature method might define a grid of \(N + 1\) evenly spaced points \(\{q_{0}, \ldots, q_{N}\}\), with the volume elements defined as \((\Delta_{q})_{n} = q_{n} - q_{n - 1}\), and the integrand values assumed to be the value at the right end of each volume element.




More sophisticated methods might consider the integrand value at both end points, interpolating between the two as in the trapezoid rule from calculus. They might also introduce complex gridding schemes or even modifying the integrand to control the overall error.

In two dimensions the volume elements from a uniform grid become squares,




with four bounding grid points and a wealth of ways to interpolate the integrand values at those points into a single value.

Regardless of the dimension, as more points are added to the quadrature grid the discrete volume elements, and the error of the quadrature estimators, shrink. Exactly how fast the error shrinks depends on the structure and expanse of the grid, the smoothness of the integrand, and the dimension of the ambient space. The construction of sophisticated grids to minimize errors is a core topic of the Quasi-Monte Carlo literature.

In general the more of the ambient space a grid covers, and the denser it is within that expanse, the more accurate the resulting quadrature estimator will be, and the more faithful the messages we will receive from our probabilistic oracle. At the same time the cost of a quadrature estimator also scales with the total number of grid points, and more accurate grids will also be more expensive. This is where quadrature methods start facing limitations.

Consider a uniform grid with \(N\) points in each dimension of the ambient space, \(Q\). When \(Q\) is one-dimensional the cost of the quadrature estimator is dominated by the \(N\) evaluations of the integrand. If \(Q\) is two-dimensional? Then the cost increases to \(N^{2}\) evaluations. Three-dimensional? \(N^{3}\) evaluations. \(D\)-dimensional? \(N^{D}\).

At this point we have to stop and recognize that the cost of a quadrature estimator based on a uniform grid is scaling exponentially fast with dimension. To achieve reasonable accuracy we will need reasonably large \(N\) for a grid with the necessary expanse and density, and after just a few dimensions that exponential scaling will completely exhaust whatever computational resources we have.

The ultimate lesson here is that any attempt to exhaustively quantify a space, whether a discrete space or a discretized continuous space, in order to estimate expectation values is largely limited to one or two-dimensional problems. In order to scale to the higher-dimensional problems that arise in practice we need to be smarter with our precious, finite computational resources. We need to identify exactly where in the ambient space we should be focusing that computation to achieve as small an error as possible.

Unfortunately the answer to that question is obfuscated by the counterintuitive behaviors of high-dimensional spaces. In order to build up any conceptual understanding we first need to learn how our intuition fails.

1.2 The House Always Wins

Another way to represent a probability distribution is through exact sampling. Recall that an exact sampling mechanism is a procedure for generating a sequence of independent points in the ambient space, \[ \{ q_{1}, \ldots, q_{N} \} \in Q, \] such that the ensemble average of an real-valued, integrable function, \[ \hat{f}_{N}^{\text{MC}} = \frac{1}{N} \sum_{n = 1}^{N} f(q_{n}), \] defines an asymtpotically consistent estimator of the corresponding expectation value, \[ \lim_{N \rightarrow \infty} \hat{f}_{N}^{\text{MC}} = \mathbb{E}_{\pi}[f]. \] These estimators are commonly denoted Monte Carlo estimators, referencing the famous casino because of how exact sampling mechanisms are reminiscent of casino games [3].

Asymptotic consistency isn’t particularly useful in practice because, well, infinity is just too big to reasonably approximated by the number of samples we can generate with finite computational resources. Fortunately the behavior of Monte Carlo estimators with only a finite number of samples can also be quantified.

The estimator for any square-integrable real-valued function, meaning that both \(\mathbb{E}_{\pi}[f]\) and \(\mathbb{E}_{\pi}[f^{2}]\) exist and are finite, satisfies a central limit theorem, \[ \frac{ \hat{f}_{N}^{\text{MC}} - \mathbb{E}_{\pi}[f] } {\text{MC-SE}_{N}[f] } \sim \mathcal{N}(0, 1), \] where the Monte Carlo Standard Error, \(\text{MC-SE}_{N}[f]\), is defined by \[ \text{MC-SE}_{N}[f] = \sqrt{ \frac{ \text{Var}_{\pi}[f]}{N} }. \] This means that we can quantify the error in the Monte Carlo estimator of a square-integrable function probabilistically. Moreover, because we know how the standard error varies with the number of samples, \(N\), we can determine how many exact samples we need to achieve a given precision. Keep in mind, however, that this error bound is probabilistic – while most samples will yield Monte Carlo estimators with error near \(\text{MC-SE}_{N}[f]\) some will inevitably yield much larger errors.

In practice we won’t know the variance of \(f\) and hence can’t quantify the Monte Carlo standard error exactly. We can, however, construct an estimator for it at the same time that we construct the Monte Carlo estimator \(\hat{f}_{N}^{\text{MC}}\). If \(f\) is cube-integrable then this will introduce a small error of order \(\mathcal{O}(N^{-1})\) which becomes negligible compared to the \(\mathcal{O}(N^{-\frac{1}{2}})\) standard error for reasonably large \(N\). We often ignore these technicalities in practice, but if the target probability distribution might be heavy-tailed then we can’t be so careless. While we can always compute Monte Carlo estimators and their standard errors, they won’t have any correlation with the true expectation value unless all of these technical conditions hold.

The basics of implementing a Monte Carlo estimator are then computing empirical means and variances from a given exact sample. These require repeatedly summing over the function evaluations, \(f(q_{n})\) and \(f^{2}(q_{n})\), which for large \(N\) is vulnerable to significant errors on computers due to the limitations of floating point arithmetic [4]. One way to reduce this vulnerability, and avoid repeatedly sweeping over the sample, is to use Welford accumulators. The following accumulator, for example, computes Monte Carlo estimators and variance estimators for the input function evaluations.

welford_summary <- function(f) {
  summary = c(0, 0)
  for (n in 1:length(f)) {
    delta <- f[n] - summary[1]
    summary[1] <- summary[1] + delta / (n + 1)
    summary[2] <- summary[2] + delta * (f[n] - summary[1])
  }
  summary[2] <- summary[2] / (length(f) - 1)
  return(summary)
}

We can then use the Welford accumulator output to compute the Monte Carlo estimator of the given function as well an estimate of its Monte Carlo Standard Error,

compute_mc_stats <- function(f) {
  summary <- welford_summary(f)
  return(c(summary[1], sqrt(summary[2] / length(f))))
}

Technically we cannot perfectly generate exact samples on computers. Instead we use pseudo random number generators which can generate long sequences of samples from certain distributions that are nearly exact [5]. Only when the sequences approach the cycle of the pseudo random number generator will the error become significant.

Pseudo random number generators do have to be initialized with a seed. The exact samples generated will depend on the seed, but their statistical behavior should not. While the seed shouldn’t matter we do want to explicitly set one instead of relying on some implicit default that we can’t replicate; the last thing you want is to observe weird behavior that you can’t debug because you can’t reproduce it. Have fun with your seed, ideally make it an obscure reference that few will understand.

set.seed(8675309)

In this case study I will be using Monte Carlo estimators to demonstrate some of the counterintuitive behaviors of high-dimensional spaces. Let’s be conservative and use lots of samples to ensure small errors for most of the estimators we’ll encounter.

N <- 100000

One curious behavior of Monte Carlo estimators is that unlike quadrature estimators they don’t seem to depend on the dimension of the ambient space. We will return to Monte Carlo and this question in Section 3.2.1 once we develop a better understanding of just how weird high-dimensional spaces are.

2 Concentration of Measure and the Typical Set

We saw in Section 1.1 that the natural probabilistic computation when using probability density functions is limited by the overwhelming costs of searching the entire ambient space on which the latent probability distribution is defined. This then motivated the question of whether or not we can construct accurate approximations by focusing our finite computational resources on only a subset of the ambient space, and if so where that focus should be.

In this section I will build up theoretical motivation for the optimal focus while contrasting it to the poor intuitions that we often adopt when extrapolating our low-dimensional perceptions into higher-dimensional circumstances.

While the general lessons apply to both discrete and continuous spaces, for the rest of this case study I will consider only continuous spaces as ultimately it is only for those spaces that we have the technology to construct robust yet general estimation methods.

2.1 Dense

Probability distributions defined on continuous spaces admit probability density function representations, from which expectation values can be computed as integrals, \[ \mathbb{E}_{\pi}[f] = \int_{Q} \mathrm{d} q \, \pi(q) \, f(q). \] Although we can’t compute these integrals analytically in anything but the simplest problems, we can apply numerical integration methods like quadrature estimators.

Quadrature estimators based on uniform grids across the ambient space \(Q\), however, became too computationally demanding outside of a few dimensions. Visual inspection suggests that this demand might not be inherent to quadrature in general but rather to the wasteful nature of uniform grids – most of the volume elements defined by a uniform grid seem to offer little contribution to the final estimator.




How might we cull the uniform grid to achieve a more efficient quadrature estimator? One immediate idea would be to remove this grid points where the integrand, \[ \pi(q) \, f(q), \] is not significantly different from zero. If the integrand is zero in a neighborhood then so too will be the contribution to the integral in that neighborhood, \[ \int_{A} \mathrm{d} q \, \pi(q) \, f(q) \approx \int_{A} \mathrm{d} q \, 0 = 0. \] At the same time in neighborhoods where the integrand is larger we expect larger contributions. If we are constrained by limited computational resources, then shouldn’t we focus are computation on those neighborhoods where the integrand is the largest? In other words, let’s consider grids that concentrate in the neighborhoods where the integrand is largest.




Now I am going to make an important, but restricting, assumption. For most applications the functions encoding our questions, \(f(q)\), will be reasonably uniform relative to the strong variations in the target probability density function, \(\pi(q)\). Consequently the behavior of the integrand, and the shape of the most efficient grid, can be well approximated by considering only the target probability density function. Methods based on this assumption will yield expectation value estimators that are reasonably accurate for most functions, but not all. Care must be taken when attempting to apply any of the following lessons when estimating expectation values that that are largest far in the tails of the target distribution.




Following this intuition we should be able to develop reasonably efficient expectation value estimators by finding the mode of the target probability density function, say with an efficient optimization algorithm, and then focusing our computation in a small neighborhood around that point of highest probability density. To visualize this intuition consider a relatively isotropic probability density function in \(D\)-dimensions. As we move away from the mode in any radial direction the target probability density function, and presumably the contribution to expectation values, quickly decays.




This intuition is pleasing, and consistent with the many optimization-based algorithms in the literature. Unfortunately it ultimately proves to be fatally flawed. Probability density alone is not sufficient for integration.

2.2 Voluminous

In the previous section we focused on how the integrand, and in particular the target probability density function, influence local contributions to an integral. Unfortunately our thought experiment ignored another critical influence: volume.

The differential volume element \(\mathrm{d} q\) that appears in our integral, \[ \mathbb{E}_{\pi}[f] = \int_{Q} \mathrm{d} q \, \pi(q) \, f(q), \] is not just notational convenience to remind us of with respect to which variables we are integrating. It indicates that an integral is determined by not only the values of the integrand but also the space over which those values are distributed. A neighborhood where the integrand is small might still provide a significant, if not outright dominant, contribution to an integral if its volume is large enough.

Unfortunately volume can behave in surprising ways, especially as we move past the lower dimensional spaces that characterize our physical perception. In this section we’ll explore some of the counterintuitive properties of volume over the real numbers in order to better understand the fundamental challenges of probabilistic computation.

2.2.1 In Every Corner

Let’s begin with a simple example - a square box representing the ambient space. In each dimension we will partition the box into three smaller boxes, with the central box corresponding to the neighborhood around the mode and the outer two side boxes corresponding to the rest of the space.

In one dimension that means that the central, modal box encapsulates a third of the total volume of the enclosing box. Ignoring the other two boxes isn’t necessarily a terrible approximation.




Now consider adding another dimension. Adding the two side boxes corresponding to that new dimension yields five total boxes, and an almost linear increase in the total volume under consideration. Those boxes, however, do not completely fill the enclosing box. To do that we have to also add the four corner boxes which brings the total enclosed volume up to nine boxes. The relative importance of that central box has all of a sudden dropped to only a ninth.




Moving to three dimensions the situation becomes even more dire. Along each axis we have seven boxes. Adding the corners in each two-dimensional plane adds twelve more boxes for a total of nineteen. This, however, still does not fill up the enclosing box!

To completely fill the space we need to consider not just the corners but also the corners of corners. It’s corner madness! The corner-corners add eight more boxes, bringing the total number of boxes up to twenty seven and the relative importance of the central box to just one twenty-seventh!




Extrapolating it looks like the relative contribution of the central box scales as the relative partition width raised to the dimension of the space, in this case \((1 / 3)^{D}\). We can verify this scaling empirically using Monte Carlo. The relative importance of the central box is its probability with respect to a uniform distribution over the enclosing box, which we can estimate by sampling from that uniform distribution and computing the Monte Carlo estimator of the expectation value of the corresponding inclusion function, \[ \begin{align*} \mathbb{P}_{\pi} [ B_{c} ] &= \mathbb{E}_{\pi}[ I_{B_{c}} ] \\ &= \int_{B} \mathrm{d} q \, U(q) \, I_{B_{c}}(q) \\ &\approx \frac{1}{N} \sum_{n = 1}^{N} I_{B_{c}}(q_{n}) \\ &= \frac{ N_{\text{in central box}} }{ N }. \end{align*} \]

For example in one-dimension we get

# We start by assuming that the sampled
# point will be in the central box with
# the indicator function equaling one
D <- 1
is_central_samples <- rep(1, N)

for (n in 1:N) {
  # Sample a new point one dimension at a time
  for (d in 1:D) {
    q_d <- runif(1, -3, 3)
  
    # If the sampled component is not contained 
    # within the central interval then set the 
    # inclusion function to zero
    if (-1 > q_d || q_d > 1)
      is_central_samples[n] <- 0
  }
}

# Estimate the relative volume as a probability
compute_mc_stats(is_central_samples)
[1] 0.3337567 0.0014912

This is consistent with the value of \(1/3\) we reached from geometric considerations. The beauty of Monte Carlo, however, is that we can quickly ramp up the dimensionality of the space without having to count so many damned boxes.

Ds <- 1:10

prob_means <- 0 * Ds
prob_ses <- 0 * Ds

for (D in Ds) {
  is_central_samples <- rep(1, N)
  for (n in 1:N) {
    for (d in 1:D) {
      q_d <- runif(1, -3, 3)
      if (-1 > q_d || q_d > 1)
        is_central_samples[n] <- 0
    }
  }
  # Estimate the relative volume as a probability
  s <- compute_mc_stats(is_central_samples)
  prob_means[D] <- s[1]
  prob_ses[D] <- s[2]
}

# Plot probabilities verses dimension
c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")

idx <- rep(Ds, each=2)
plot_Ds <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 0.5 else idx[b] - 0.5)

pad_means <- do.call(cbind, lapply(idx, function(n) prob_means[n]))
pad_ses <- do.call(cbind, lapply(idx, function(n) prob_ses[n]))

plot(1, type="n", main="",
xlim=c(head(Ds, 1) - 0.5, tail(Ds, 1) + 0.5), xlab="Dimension",
ylim=c(0, 0.5), ylab="Probability")

polygon(c(plot_Ds, rev(plot_Ds)), 
        c(pad_means + 2 * pad_ses, rev(pad_means - 2 * pad_ses)),
        col = c_light, border = NA)
lines(plot_Ds, pad_means, col=c_dark, lwd=2)

We see not only the one-ninth and one-twenty-seventh in two and three dimensions but also the general \((1/3)^{D}\) scaling we hypothesized for higher dimensions. This exponential scaling renders the central box, or any single box for that matter, essentially negligible relative to the full space once we reach three or four dimensions. In other words trying to represent the full space with any single box is a rapidly worsening approximation as the dimensionality of our space increases.

2.2.2 She’ll Be Coming ’Round the Corner

Another way to reason about volume in spaces of increasing dimension is to consider what happens in the neighborhood of a sphere. In particular, how much volume is contained within a shell immediately inside the sphere relative to the volume contained within a shell immediately outside of the sphere with the same radial width? To better intuit that relationship let’s break those shells down into wedges that subtend a constant solid angle.

In one dimension a sphere is just two points, and the inner and outer wedges are just the inner and outer shells themselves. Because the wedges have the same size on both sides of either point, the inner and outer shells will encapsulate the same volume.




In two dimensions our sphere becomes a circle, and now we can see an asymmetry on the two sides. Inside the circle there is only limited room. As the radial width of the inner shell increases the inner wedges have to shrink. Outside the circle there is significantly more room which allows the wedges to instead expand. Summing up all of the wedges we end up with more volume outside the sphere than inside the sphere.




When we consider three dimensions the asymmetry only grows stronger. The outer wedges can now expand into both angular directions while the inner wedges have to shrink in both angular directions, leading to even more volume outside the sphere than inside.




We can verify these qualitative relationships, and provide some quantitative understanding, by appealing once against to Monte Carlo estimation. In particular let’s consider a sphere of radius 2 inside our same box and ask how much volume is contained with an inner shell spanning radii 1.5 to 2, and an outer shell spanning radii 2 to 2.5.

prob_inner_means <- 0 * Ds
prob_inner_ses <- 0 * Ds

prob_outer_means <- 0 * Ds
prob_outer_ses <- 0 * Ds

R <- 2
delta <- 0.5

for (D in Ds) {
  # Does the sampled point fall in the inside neighborhood?
  is_inner_samples <- rep(0, N)
  # Does the sampled point fall in the outside neighborhood?
  is_outer_samples <- rep(0, N)

  for (n in 1:N) {
    # Sample a new point
    x <- runif(D, -3, 3)
  
    # Compute distance from origin
    r <- sqrt(sum(x**2))
  
    # Check if point falls in the inside neighborhood
    if (R - delta < r && r < R)
    is_inner_samples[n] <- 1;
  
    # Check if point falls in the outside neighborhood
    if (R < r && r < R + delta)
    is_outer_samples[n] <- 1;
  }

  # Estimate the relative volumes as probabilities
  s1 <- compute_mc_stats(is_inner_samples)
  prob_inner_means[D] <- s1[1]
  prob_inner_ses[D] <- s1[2]

  s2 <- compute_mc_stats(is_outer_samples)
  prob_outer_means[D] <- s2[1]
  prob_outer_ses[D] <- s2[2]
}

# Plot probabilities verses dimension
pad_inner_means <- do.call(cbind, lapply(idx, function(n) prob_inner_means[n]))
pad_inner_ses <- do.call(cbind, lapply(idx, function(n) prob_inner_ses[n]))

pad_outer_means <- do.call(cbind, lapply(idx, function(n) prob_outer_means[n]))
pad_outer_ses <- do.call(cbind, lapply(idx, function(n) prob_outer_ses[n]))

plot(1, type="n", main="",
xlim=c(head(Ds, 1) - 0.5, tail(Ds, 1) + 0.5), xlab="Dimension",
ylim=c(0, 0.25), ylab="Probability")

polygon(c(plot_Ds, rev(plot_Ds)),
        c(pad_inner_means + 2 * pad_inner_ses, rev(pad_inner_means - 2 * pad_inner_ses)),
        col = c_light, border = NA)
lines(plot_Ds, pad_inner_means, col=c_light_highlight, lwd=2)

polygon(c(plot_Ds, rev(plot_Ds)),
        c(pad_outer_means + 2 * pad_outer_ses, rev(pad_outer_means - 2 * pad_outer_ses)),
        col = c_dark, border = NA)
lines(plot_Ds, pad_outer_means, col=c_dark_highlight, lwd=2)

The volume in both shells is rapidly decreasing with dimension, but the volume in the outer shell is always larger than the volume in the inner shell. Interestingly enough the volume in the outer shell initially increases before starting to decay. If we looked at shells of larger radii we would see this behavior extend out to higher dimensions. In some sense the growth of volume in the corners manifests as a wave that propagates from small radii to larger radii as the dimensionality of the space increases!

We can compare the volumes of the inner and outer shells directly by plotting the ratio of probabilities verses dimension.

plot(1, type="n", main="",
xlim=c(head(Ds, 1) - 0.5, tail(Ds, 1) + 0.5), xlab="Dimension",
ylim=c(0, 10), ylab="Ratio of Outer verses Inner Probability")

lines(plot_Ds, pad_outer_means / pad_inner_means, col=c_dark_highlight, lwd=2)

This then verifies that the volume in the outer shell is decreasing more slowly than the volume in the inner shell so that there is always more volume immediately outside the sphere then immediately inside the sphere. Although the noise in the Monte Carlo standard errors here makes this somewhat hard to see, the ratio of volumes in fact grows exponentially fast, \[ \frac{(R + \delta)^{D} - (R)^{D}}{(R)^{D} - (R - \delta)^{D}}. \] Not only is there more volume immediately outside a sphere, there is exponentially more.

2.2.3 \(D \rightarrow \infty\) Is The Loneliest Number

Using what we’ve learned from the last two exercises we can now start to put together a very loose intuition for how volume behaves in high-dimensional real number spaces. Imagine situating ourselves at a point surrounded by a closed surface. No matter where that point is located in the ambient space, volume will appear to concentrate away from us, squeezing up against the boundary of the enclosing surface. If we expand the surface to probe more of the neighboring space then the volume only pushes further away. As the box grows to be infinitely large, and we get a more global perspective on the entire space, it appears as if all of the volume is in fact concentrating in neighborhoods infinitely far away.

One way to visualize this behavior is to employ a stereographic projection of the real numbers. Here we introduce a sphere that makes contact with the real numbers at a point and then “roll” the real numbers up around it. Alternatively we can imagine that the sphere is shiny, with the projection defined by the reflection of the real numbers on its surface.

Regardless of the exact interpretation, a stereographic projection maps infinity in each direction to a common point at north pole of the sphere. Most of the ambient space maps into a small neighborhood surrounding that point at infinity.




High-dimensional spaces are lonely places. The next time you find yourself at a social event and notice that everyone seems to be avoiding your local neighborhood perhaps you can take solace in the fact that maybe, just maybe, you’ve actually stumbled into a higher-dimensional space than the one to which you are accustomed.

We can formalize the forlorn nature of high-dimensional spaces by simulating the distance between two points randomly sampled from our box. Because of the finite extent of our box the distance is bounded above by the largest diagonal, \[ L \cdot \sqrt{D} = 6 \sqrt{D}. \] To better compare between dimensions I will normalize all distances to this maximum.

# Quantile probabilities that we'll use to visualize distributions
quant_probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)

delta_means <- 0 * Ds
delta_ses <- 0 * Ds
delta_quantiles <- matrix(data = 0, nrow = tail(Ds, 1), ncol = 9)

for (D in Ds) {
  # Distances between two sampled points
  delta_samples <- rep(0, N)
  max_delta <- 6 * sqrt(D)

  for (n in 1:N) {
    # Sample two points
    x1 <- runif(D, -3, 3)
    x2 <- runif(D, -3, 3)
  
    # Compute distance between them
    delta_samples[n] <- sqrt(sum( (x1 - x2)**2 )) / max_delta
  }

  # Estimate average distance
  s <- compute_mc_stats(delta_samples)
  delta_means[D] <- s[1]
  delta_ses[D] <- s[2]

  # Estimate distance quantiles
  delta_quantiles[D,] <- quantile(delta_samples, probs=quant_probs)
}

After an initial increase the average relative distance between points stabilizes to about 40% of the maximum distance. Because that maximum distance is growing, however, the absolute distance is also growing on average.

pad_delta_means <- do.call(cbind, lapply(idx, function(n) delta_means[n]))
pad_delta_ses <- do.call(cbind, lapply(idx, function(n) delta_ses[n]))

plot(1, type="n", main="",
xlim=c(head(Ds, 1) - 0.5, tail(Ds, 1) + 0.5), xlab="Dimension",
ylim=c(0, 1), ylab="Average Relative Distance Between Points")

polygon(c(plot_Ds, rev(plot_Ds)),
c(pad_delta_means + 2 * pad_delta_ses, rev(pad_delta_means - 2 * pad_delta_ses)),
col = c_light, border = NA)
lines(plot_Ds, pad_delta_means, col=c_dark, lwd=2)

The average, however, doesn’t tell the complete story. The average is increasing, but the variance of the distances might also be sufficiently large that the probability of being close to someone doesn’t vanish. To study that we need to examine the entire distance distribution, which we accomplish here by plotting nested quantiles estimated from our samples.

pad_delta_quantiles <- do.call(cbind, lapply(idx, function(n) delta_quantiles[n, 1:9]))

plot(1, type="n", main="",
xlim=c(head(Ds, 1) - 0.5, tail(Ds, 1) + 0.5), xlab="Dimension",
ylim=c(0, 1), ylab="Relative Distance Between Points")

polygon(c(plot_Ds, rev(plot_Ds)), 
        c(pad_delta_quantiles[1,], rev(pad_delta_quantiles[9,])),
        col = c_light, border = NA)
polygon(c(plot_Ds, rev(plot_Ds)), 
        c(pad_delta_quantiles[2,], rev(pad_delta_quantiles[8,])),
        col = c_light_highlight, border = NA)
polygon(c(plot_Ds, rev(plot_Ds)), 
        c(pad_delta_quantiles[3,], rev(pad_delta_quantiles[7,])),
        col = c_mid, border = NA)
polygon(c(plot_Ds, rev(plot_Ds)), 
        c(pad_delta_quantiles[4,], rev(pad_delta_quantiles[6,])),
        col = c_mid_highlight, border = NA)
lines(plot_Ds, pad_delta_quantiles[5,], col=c_dark, lwd=2)

As the average relative distance stabilizes with increasing dimension, the distribution of relative distances rapidly concentrates around the average. In other words, not only is the average absolute distance increasing with the dimension, the probability that you end up close to someone decreases extremely quickly.

This behavior is relevant to more than just existential dread. The increasing space between points is extremely important when consideration interpolation methods that try to fill in functional behavior between points where the function output is known. By assuming a certain rigidity one can constrain the possible interpolations in neighborhoods around each point. As the dimension increases, however, those neighborhoods drift apart and that constraint becomes weaker. To maintain the same interpolative power one would have to either increase the density of the points with known function values or improve the rigidity constraint exponentially quickly.

2.3 Massive

As the dimensionality of the ambient space increases volume appears to concentrate around the point at infinity, seemingly repelled from any finite point including the mode of the target probability density function within a given parameterization. Indeed the strength of that concentration grows exponentially fast with increasing dimension.




The neighborhoods in the ambient space that contribute most to integrals, and hence our target expectation values, must contain not only significant probability density but also significant volume. More specifically we should be considering the product of these contributions: differential probability mass.




Differential probability mass doesn’t concentrate in neighborhoods around the mode because there isn’t sufficient volume there. It also doesn’t concentrate in neighborhoods around infinity because of the vanishing probability density. Instead it compromises between those two extremes, concentrating around a surface surrounding the mode. This neighborhood is called the typical set, and as the dimension of the ambient space increases it manifests an important behavior known as concentration of measure. The geometry of this neighborhood is critical to not only probabilistic computational methods that can scale beyond a few dimensions but also probabilistic reasoning itself.

2.3.1 Where Have All the Cowboys Gone?

Conceptually the typical set defines a neighborhood in the ambient space where the probability density and the local volume are both large enough that their product, differential probability mass, is non-negligible. A threshold on encapsulated probability mass defining exactly what “non-negligible” means defines a surface around the mode in which differential probability mass concentrates.




Because we are free to choose a threshold, the typical set doesn’t really define a specific neighborhood but rather a family of nested neighborhoods, each corresponding to a weaker threshold. Consequently a better intuition is a fuzzy surface that surrounds, but concentrates away from, the local mode of the representative probability density function.




The concept of the typical set is borrowed from the field of information theory [6] [7] [8], which motivates yet another intuition for the concept. The typical set defines the most efficient way to compress a probability distribution by considering only a subset of the ambient space. This helps clarify, for example, why the neighborhood around the mode is so inconsequential for probabilistic computation; the most efficient way to characterize a probability distribution with only finite resources is to focus on quantifying this fuzzy surface, not its interior or exterior.
Approximations compatible with this compression tackle a smaller problem and hence better exploit whatever computational resources are available.

In information theory compression is defined with respect to messages comprised of words. In isolation each component word is likely to be near the modal word, but as we construct longer and longer messages we provide more and more opportunity for some of the words to deviate. The probability that a long message contains no deviant words or all deviant words both become negligible, with most messages confined to the set of messages with “typical” fluctuations.

Probabilistically we can interpret a message as a point in our ambient space, with each word interpreted as the projection of that point along each dimension in the space. Any single component is likely to be found near the projection of the mode because of the pull of the target probability density function. The volume, however, induces fluctuations in some of the components which push the constituent point away from the global mode. Exact samples are “typical”, each influenced by fluctuations that push it somewhere within the typical set.
Each samples experience different fluctuations and an ensemble of samples disperses across the typical set.




This sampling perspective is particularly helpful for understanding some of the more subtle properties of the typical set. For example, the typical set is not the highest probability set for a given volume, especially for small volumes. Because we don’t know how an exact sample will fluctuate, and where in the typical set it will manifest, our best guess of where to find a single exact sample will be a neighborhood around the mode that encompasses small fluctuations in all directions. It’s only when we consider many samples that we are better off guessing neighborhoods in the typical set.

One immediate application of this intuition is in the context of gambling on the games in a tournament, such as the college basketball championship. Absent all other information, the best strategy for a single game is to bet on the favorite. Taking the favorite for all of the games, however, ends up being a terrible strategy. While an upset in any single game is rare, having no upsets over the entire tournament is rarer.

At the same time we don’t know which games will produce upsets, and so any single bet on the outcome of every game is exceedingly unlikely to win. In order to increase the odds of winning one has to make many bets, filling out the typical set of tournament outcomes. This is one way that casinos and professional gamblers are able to maintain steady profits while even the smartest casual gambler enjoys only small expected gains.

2.3.2 Concentration of Measure

The typical set defines the right abstraction for how to efficiently approximate probability distributions in more than a few dimensions. Instead of visualizing the distribution as a density concentrating at the mode we are better served visualizing it as a fuzzy surface that surrounds the mode. Our next step is to figure out how this fuzzy surface abstraction changes with dimension.

As we motivated above, the volume will concentrate more strongly away from any single point and towards infinity as the dimensionality of the ambient space increases. This implies that volume recedes away from the mode.




If the target probability density function maintains the same tail behavior then the balance between probability density and volume will shift. The volume becomes more dominant, pulling the differential probability mass, and the typical set, further towards infinity in a process known as concentration of measure.