Probability theory teaches us that the only well-posed way to extract information from a probability distribution is through expectation values. Unfortunately computing expectation values from abstract probability distributions is no easy feat. In some cases, however, we can readily approximate expectation values by averaging functions of interest over special collections of points known as samples. In this case study we carefully define the concept of a sample from a probability distribution and show how those samples can be used to approximate expectation values through the Monte Carlo method.

1 Samples

Points in a space and a probability distribution defined over that space are distinct objects that can't be directly compared. Certain collections of points, however, can be constructed to be compatible with a given probability distribution in a way that allows them to approximate information from that distribution. In this section we formally define this notion of consistency and in doing so formally define the concept of a sample from a probability distribution.

1.1 Ensembles

A base probability space consisting of the ambient space \(X\), \(\sigma\)-algebra \(\mathcal{X}\), and probability distribution \(\pi\) can always be replicated an arbitrary number of times to construct an ensemble system.

We first replicate the ambient space multiple times, using the integer \(n\) to identify each identical replication, to define the product space \[ X_{1:N} = X_{1} \times \cdots \times X_{n} \times \cdots \times X_{N}, \] and the corresponding projection operators that map from the product space to each component space \[ \begin{alignat*}{6} \varpi_{n} :\; &X_{1:N} & &\rightarrow& \; &X_{n}& \\ &( x_{1}, \ldots, x_{n}, \ldots, x_{N} )& &\mapsto& &x_{n}&. \end{alignat*} \]

At the same time we can define a complementary \(\sigma\)-algebra over this product space by taking Cartesian products of the measurable sets in each of the component spaces, \[ \mathcal{X}_{1:N} = \{ A_{1} \times \cdots \times A_{n} \times \cdots \times A_{N} \mid A_{n} \in \mathcal{X}_{n} \}. \] In other words this product \(\sigma\)-algebra is compatible with the projection operator -- if \(A \in \mathcal{X}_{1:N}\) then \(\varpi_{n}(A) \in \mathcal{X}_{n}\).

Finally we can construct a unique identical product probability distribution with respect to this product \(\sigma\)-algebra that satisfies \[ \begin{align*} \mathbb{P}_{\pi_{1:N}}[A] &= \prod_{n = 1}^{N} \mathbb{P}_{\pi_{n}}[ \varpi_{n}(A) ] \\ &= \prod_{n = 1}^{N} \mathbb{P}_{\pi}[ \varpi_{n}(A) ]. \end{align*} \] A key feature of this product probability distribution is that the expectation of any function that depends on only one component space, \(f = f \circ \varpi_{n}\), is equal to the expectation with respect to the base probability distribution, \[ \begin{align*} \mathbb{E}_{\pi_{1:N}}[ f ] &= \mathbb{E}_{\pi_{1:N}}[ f \circ \varpi_{n} ] \\ &= \mathbb{E}_{\pi}[ f \circ \varpi_{n} ] \\ &= \mathbb{E}_{\pi}[ f ]. \end{align*} \]

We will refer to \(X_{1:N}\) as an ensemble space and \(\pi_{1:N}\) as an ensemble probability distribution, or just ensemble distribution for short.

1.1.1 Ensemble Averages

A measurable function defined on the base space \(f : X \rightarrow \mathbb{R}\) can be lifted to an ensemble space in various ways. For example we can define the lifted function \(f_{n} = f \circ \varpi_{n}\) that considers only one component space at a time, \[ \begin{alignat*}{6} f_{n} :\; &X_{1:N} & &\rightarrow& \; &\mathbb{R}& \\ &( x_{1}, \ldots, x_{n}, \ldots, x_{N} )& &\mapsto& &f \circ \varpi_{n} (x_{1}, \ldots, x_{n}, \ldots, x_{N}) = f(x_{n})&. \end{alignat*} \] By definition the expectation of this lifted function with respect to the ensemble distribution is equal to the expectation of the initial function with respect to the base distribution, \[ \mathbb{E}_{\pi_{1:N}}[ f \circ \varpi_{n} ] = \mathbb{E}_{\pi}[ f ]. \]

We can also lift the base function to every component space and then combine the results together. The ensemble average of any function \(f : X \rightarrow \mathbb{R}\) is defined by applying the function to all of the component spaces and then adding the results together, \[ \left< f \right>_{1:N} = \frac{1}{N} \sum_{n = 1}^{N} f \circ \varpi_{n}, \] or more formally, \[ \begin{alignat*}{6} \left< f \right>_{1:N} :\; &X_{1:N} & &\rightarrow& \; &\mathbb{R}& \\ &( x_{1}, \ldots, x_{n}, \ldots, x_{N} )& &\mapsto& &\frac{1}{N} \sum_{n = 1}^{N} f(x_{n})&. \end{alignat*} \]

Because expectation is a linear operation the expectation value of the ensemble average reduces to the expectation of the base function, \[ \begin{align*} \mathbb{E}_{\pi_{1:N}}[ \left< f \right>_{1:N} ] &= \frac{1}{N} \sum_{n = 1}^{N} \mathbb{E}_{\pi_{1:N}}[ f \circ \varpi_{n} ] \\ &= \frac{1}{N} \sum_{n = 1}^{N} \mathbb{E}_{\pi}[ f ] \\ &= \frac{1}{N} \cdot N \cdot \mathbb{E}_{\pi}[ f ] \\ &= \mathbb{E}_{\pi}[ f ], \end{align*} \] at least when that base expectation \(\mathbb{E}_{\pi}[ f ]\) is well-defined. In other words an expectation of the ensemble average is consistent with the expectation of the initial function.

To really understand the behavior of the ensemble average, however, we have to go beyond the expectation value. If we consider \(\left< f \right>_{1:N}\) as a map from the product space \(X_{1:N}\) to the real numbers \(\mathbb{R}\) then we can consider the pushforward of the ensemble distribution over input values \(X_{1:N}\) to the induced distribution over output values in \(\mathbb{R}\), \((\left< f \right>_{1:N})_{*} \pi_{1:N}\).




1.1.2 The Law of Large Numbers

One way to characterize the pushforward distribution along the ensemble average is through its cumulants. The mean of the pushforward distribution is the expectation of the ensemble average, which we just calculated, \[ \begin{align*} \mathbb{M}_{\pi_{1:N}} [ \left< f \right>_{1:N} ] &= \mathbb{E}_{\pi_{1:N}}[ \left< f \right>_{1:N} ] \\ &= \mathbb{E}_{\pi}[ f ]. \end{align*} \]

Through some straightforward but messy algebra we can also show that the variance of the ensemble average is the variance of the base function scaled by a factor of \(N\), \[ \begin{align*} \mathbb{V}\pi_{1:N}[ \left< f \right>_{1:N} ] &= \mathbb{E}_{\pi_{1:N}}[ (\left< f \right>_{1:N})^{2} ] - \left( \mathbb{E}_{\pi_{1:N}}[ \left< f \right>_{1:N} ] \right)^{2} \\ &= \frac{1}{N} \left( \mathbb{E}_{\pi}[ f^{2} ] - \left( \mathbb{E}_{\pi}[ f ] \right)^{2} \right) \\ &= \frac{\mathbb{V}_{\pi}[f]}{N}, \end{align*} \] whenever \(\mathbb{V}_{\pi}[f]\) is well-defined. This pattern continues to higher-order cumulants as well: if the \(k\)-th order cumulant of \(f_{*} \pi\) is well-defined then the \(k\)th-order cumulant of \((\left< f \right>_{1:N})_{*} \pi_{1:N}\) is scaled by \(k\) factors of \(N\), \[ \mathbb{C}_{k}[ \left< f \right>_{1:N} ] = \frac{\mathbb{C}_{k}}{N^{k}}. \]

Because of this scaling the higher-order cumulants of the ensemble average become more negligible as the size of the ensemble \(N\) grows, at least provided that the corresponding cumulants of the base function are well-defined. In the limit of an infinite ensemble all of the cumulants except the mean vanish exactly, and all of the pushforward probability is allocated to ensemble averages exactly equaling that expectation value. More formally we say that the pushforward distribution of ensemble average converges to a Dirac distribution, \[ \lim_{N \rightarrow \infty} (\left< f \right>_{1:N})_{*} \pi_{1:N} = \delta_{\mathbb{E}_{\pi}[f]}, \] where the Dirac distribution over the real numbers is defined as \[ \delta_{y}[R] = \left\{ \begin{array}{rr} 1, & y \in A \\ 0, & y \notin A \end{array} \right., \forall A \in \mathcal{B}(\mathbb{R}). \] This convergence of the pushforward distribution is known as the strong law of large numbers.

1.1.3 The Central Limit Theorem

The law of large numbers determines to what the pushforward of the ensemble distribution along an ensemble average function will converge, but we can also characterize how it will converge asymptotically. If the pushforward mean and variance are well-defined then the pushforward probability density function will converge to a normal probability density function after an appropriate shift and scaling, \[ \lim_{N \rightarrow \infty} \frac{ (\left< f \right>_{1:N})_{*} \pi_{1:N}(y) - \mathbb{E}_{\pi}[f] } { \sqrt{ \mathbb{V}_{\pi}[f] / N } } = \text{normal} ( y \mid 0, 1 ). \] This asymptotic behavior is known as a central limit theorem.

If the third cumulant of the pushforward distribution is also well-behaved then the central limit theorem also approximately holds preasymptotically. For large but finite \(N\) we have \[ (\left< f \right>_{1:N})_{*} \pi_{1:N}(y) = \text{normal} \left( y \mid \mathbb{E}_{\pi}[f], \sqrt{ \frac{\mathbb{V}_{\pi}[f]}{N} } \right) + \mathcal{O}(N^{-\frac{3}{2}}). \] This preasymptotic characterization of the pushforward distribution is far more useful than the asymptotic characterizations in practical circumstances where infinity is just out of reach.

In this case we can characterize the convergence of the pushforward distribution for all ensemble sizes \(N\). Initially the pushforward distribution is characterized by all of the pushforward cumulants, but as \(N\) grows the mean and variance quickly being to dominate and the pushforward probability density function becomes approximately normal. As \(N\) continues to grow the variance narrows and this normal approximation converges to the limiting Dirac distribution.




1.2 Asymptotically Consistent Sequences

In the previous section we considered the behavior of the ensemble average in infinitely large ensemble spaces. Now let's consider what we can say about the elements of these spaces.

Any point in the ensemble space \(X_{1:N}\) is given by a sequence of points in each of the component spaces, \[ x_{1:N} = (x_{1}, \ldots, x_{n}, \ldots, x_{N}) \in X_{1:N}. \] Evaluating the ensemble average on this point then yields a particular real value, \[ \left< f \right>_{1:N} (x_{1:N}) = \left< f \right>_{1:N} (x_{1}, \ldots, x_{n}, \ldots, x_{N}) \in \mathbb{R}. \]




The same construction holds in the infinite case. All elements of the infinite ensemble space are given by a countably infinite sequence of points, \[ (x_{1}, \ldots, x_{n}, \ldots) \in \lim_{N \rightarrow \infty} X_{1:N}. \] Similarly the ensemble average evaluated over an infinitely long sequence converges to some real value, \[ \lim_{N \rightarrow \infty} \left< f \right>_{1:N} (x_{1:N}) = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n = 1}^{N} f(x_{n}) \in \mathbb{R}, \] whenever the limit is well-defined.

An arbitrary element of the infinite ensemble space will not, in general, have any relation to an ensemble distribution defined over that space, but some special elements might. In particular if the limiting value of the ensemble average over an element equals the corresponding expectation value, \[ \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n = 1}^{N} f(x_{n}) = \mathbb{E}_{\pi}[f], \] then that ensemble element recovers the exact same value to which the pushforward distribution along the corresponding ensemble average converges! If this limiting behavior holds for all well-behaved functions \(f : X \rightarrow \mathbb{R}\) then this ensemble element encodes the exact same information as \(\pi\)! We will refer to these sequences as asymptotically consistent with \(\pi\).




This consistency immediately implies that an asymptotically consistent sequence has to satisfy certain frequency properties. For example consider a measurable set \(A \in \mathcal{X}\) which is allocated probability \(p = \mathbb{P}_{\pi}[A]\). Because probabilities are expectations of indicator functions we can also write this as \[ p = \mathbb{P}_{\pi}[A] = \mathbb{E}_{\pi}[ \mathbb{I}_{A} ] \] where \[ \mathbb{I}_{A}[x] = \left\{ \begin{array}{rr} 1, & x \in A \\ 0, & x \notin A \end{array} \right. \] A sequence asymptotically consistent with this distribution then satisfies \[ \begin{align*} p \\ &= \mathbb{P}_{\pi}[A] \\ &= \mathbb{E}_{\pi}[ \mathbb{I}_{A} ] \\ &= \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n = 1}^{N} \mathbb{I}_{A}(x_{n}) \\ &= \lim_{N \rightarrow \infty} \frac{N[A]}{N}, \end{align*} \] where \[ N[A] = \sum_{n = 1}^{N} \mathbb{I}_{A}(x_{n}). \] In words exactly \(100 \cdot p \%\) of the points in the asymptotically consistent sequence must fall into \(A\). Points \(x \in X\) in neighborhoods of higher probability must appear more frequently in a compatible asymptotic sequence than points in neighborhoods of small probability.




While not every element of the infinite ensemble \((x_{1}, \ldots, x_{n}, \ldots)\) will be compatible with a given base probability distribution, in general there will be infinitely many elements that are. Because all of these asymptotically consistent sequences share the same relative frequency of component points they can loosely be interpreted as different orderings of some common, infinite population of points in \(X\) that contains the same information as the base distribution \(\pi\). That said we have to take care with such an interpretation as it involves a lot of finicky infinities!

1.3 Preasymptotically Consistent Sequences

Because of our unfortunate limitation to finite computation, we can't actually construct any infinitely long sequences of points, let alone sequences that are asymptotically consistent with a given probability distribution of interest. In practice we need finite sequences that are consistent with a given probability distribution in some way.

Towards that end consider selecting \(N\) points from an asymptotically consistent sequence, \((x_{1}, \ldots, x_{n}, \ldots)\). From the mathematically-casual perspective that an asymptotically consistent sequence is an ordering of some latent population of points, this selection is equivalent to sampling individuals from that population. Although we have to be careful not to take this analogy too far, we will use it to motivate some terminology. We denote any selection of \(N\) points from an asymptotically consistent sequence as a sample of size \(N\).

If the infinite population contains the same information as \(\pi\) then we might hope that a finite sample will contain enough information to approximate \(\pi\). Unfortunately this is only the case for certain very special sequences.

For a fixed sample size \(N\) there are a countably infinite number of possible samples from an infinitely long sequence. Let's label each of these samples with the integer \(s\), and the corresponding points with \[ (x_{\sigma_{s}(1)}, \ldots, x_{\sigma_{s}(n)}, \ldots, x_{\sigma_{s}(N)}). \] For example if \(N = 3\) then \[ \sigma_{1}(1) = 1, \sigma_{1}(2) = 2, \sigma_{1}(3) = 3 \] would define one possible sample.




A function \(g : X_{1:N} \rightarrow \mathbb{R}\) can be applied to any of these samples, yielding the value \[ g(x_{\sigma_{s}(1)}, \ldots, x_{\sigma_{s}(n)}, \ldots, x_{\sigma_{s}(N)}) \in \mathbb{R}. \] If the average function value over all possible samples of size \(N\) equals the corresponding joint expectation value, \[ \lim_{S \rightarrow \infty} \frac{1}{S} \sum_{s = 1}^{S} g(x_{s(1)}, \ldots, x_{s(n)}, \ldots, x_{s(N)}) = \mathbb{E}_{\pi_{1:N}}[g], \] for all well-behaved functions \(g\) then this collection of finite samples contains the same information as the independent product distribution \(\pi_{1:N}\). When this consistency holds not only for all functions \(g\) but also for all finite sample sizes \(N\) then we say that the sequence is not only asymptotically consistent but also preasymptotically consistent. Likewise a finite sample from a preasymptotically consistent sequence defines a sample from the corresponding probability distribution \(\pi\).

Once again this consistency implies certain frequency properties that a compatible sequence must satisfy. For example if the measurable set \(A \in \mathcal{X}_{1:N}\) is allocated probability \(p = \mathbb{P}_{\pi_{1:N}}[A]\) then \[ \begin{align*} \lim_{S \rightarrow \infty} \frac{1}{S} \sum_{s = 1}^{S} \mathbb{I}_{A}(x_{s(1)}, \ldots, x_{s(n)}, \ldots, x_{s(N)}) &= \mathbb{E}_{\pi_{1:N}}[ \mathbb{I}_{A} ] \\ &= \mathbb{P}_{\pi_{1:N}}[A] \\ &= p. \end{align*} \] Consequently exactly \(100 \cdot p \%\) of the possible samples of size \(N\) contained in that sequence must fall into \(A\). Samples \((x_{1}, \ldots, x_{n}, \ldots, x_{N}) \in X_{1:N}\) in neighborhoods of higher product probability must appear more frequently in a preasymptotically consistent sequence than samples in neighborhoods of smaller joint probability.

A joint function \(g : X_{1:N} \rightarrow \mathbb{R}\) of particular interest is the ensemble average of some base function \(f: X \rightarrow \mathbb{R}\), \(\left< f \right>_{1:N} (x_{1:N})\). By definition the average value of \(\left< f \right>_{1:N} (x_{1:N})\) over the collection of finite samples of size \(N\) is just the expectation of \(f\), \[ \begin{align*} \lim_{S \rightarrow \infty} \frac{1}{S} \sum_{s = 1}^{S} \left< f \right>_{1:N}(x_{s(1)}, \ldots, x_{s(n)}, \ldots, x_{s(N)}) &= \mathbb{E}_{\pi_{1:N}}[ \left< f \right>_{1:N} ] \\ &= \frac{1}{N} \sum_{n = 1}^{N} \mathbb{E}_{\pi_{1:N}}[ f \circ \varpi_{n} ] \\ &= \frac{1}{N} \sum_{n = 1}^{N} \mathbb{E}_{\pi}[ f ] \\ &= \frac{1}{N} \cdot N \cdot \mathbb{E}_{\pi}[ f ] \\ &= \mathbb{E}_{\pi}[ f ]. \end{align*} \] Even more the frequency of particular ensemble average values over this collection is determined by the pushforward of the finite ensemble distribution along the ensemble average function, \((\left< f \right>_{1:N})_{*} \pi_{1:N}\). For example if the interval of possible average values \(\Delta f \subset \mathbb{R}\) is allocated probability \(p = \mathbb{P}_{(\left< f \right>_{1:N})_{*} \pi_{1:N}}[\Delta f]\) then exactly \(100 p \%\) of samples will yield ensemble average values in \(\Delta f\).

This quantification then provides a path towards approximating a probability distribution. If we can generate any finite subset of a preasymptotically consistent sequence, \[ (x_{1}, \ldots, x_{n}, \ldots, x_{N}) \] then the pushforward distribution \((\left< f \right>_{N})_{*} \pi_{1:N}\) will quantify how accurately an ensemble average evaluated over that sequence, \[ \left< f \right>_{1:N} (x_{1}, \ldots, x_{n}, \ldots, x_{N}), \] recovers the corresponding expectation value \(\mathbb{E}_{\pi}[ f ]\). We will return this approach in Section 3 once we consider how to generate these samples in the first place.

1.4 Transformation Properties of Consistent Sequences

Before discussing how to construct a preasymptotically consistent sequence let's consider how they transform. Recall that if \(\phi : X \rightarrow Y\) is a measurable transformation from the initial ambient space \(X\) to some new ambient space \(Y\) then a probability distribution \(\pi\) over \(X\) defines a corresponding pushforward probability distribution \(\phi_{*} \pi\) over \(Y\).

The expectation value of any function on the transformed space, \(g : Y \rightarrow \mathbb{R}\), can be computed with respect to either probability distribution, \[ \mathbb{E}_{\pi}[g \circ \phi] = \mathbb{E}_{\phi_{*} \pi}[g]. \] This immediately implies that if \((x_{1}, \ldots, x_{n}, \ldots)\) is a sequence that is asymptotically consistent with \(\pi\) then \[ \begin{align*} \mathbb{E}_{\phi_{*} \pi}[g] &= \mathbb{E}_{\pi}[g \circ \phi] \\ &= \lim_{N \rightarrow \infty} \left< g \circ \phi \right>_{1:N} (x_{1:N}) \\ &= \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n = 1}^{N} g \circ \phi(x_{n}) \\ &= \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n = 1}^{N} g(y_{n}) \\ &= \lim_{N \rightarrow \infty} \left< g \right>_{1:N} (y_{1:N}). \end{align*} \] In other words \[ (y_{1} = \phi(x_{1}) \ldots, y_{n} = \phi(x_{n}), \ldots) \] generates a sequence that is asymptotically consistent with the pushforward distribution! Consequently any asymptotically consistent sequence on \(X\) defines an asymptotically consistent sequence on \(Y\) simply by transforming each component point one by one.

Because asymptotically consistent sequences transform (vary) in exactly the same way (co) as points the ambient space, they are said to transform covariantly. By their definition so too do preasymptotically consistent sequences and any finite sample. Contrast this to, say, a probability density function representation which does not transform covariantly but instead requires careful Jacobian determinant corrections and analytic integrations when transformed to a new space. This ability to easily transform samples between different spaces makes them an especially effective way to approximate a probability distribution in practice.

When \(\phi\) is a projection function this transformation property allows us to transform samples from a joint distribution to samples from a marginal distribution. For example let \(X\) be a product space, \(X = Y \times Z\) with points \(x = (y, z)\) and projection operator \(\varpi_{Y} : X \rightarrow Y\). If \[ ( x_{1}, \ldots, x_{N}) = ( (y_{1}, z_{1}), \ldots, (y_{N}, z_{N}) ) \] is a sample from \(\pi\) then \[ ( \varpi_{Y}(x_{1}), \ldots, \varpi_{Y}(x_{N}) ) = ( \varpi_{Y}(y_{1}, z_{1}), \ldots, \varpi_{Y}(y_{N}, z_{N}) ) = \{ y_{1}, \ldots, y_{N} \} \] is a sample from \((\varpi_{Y})_{*} \pi\).




Unfortunately constructing samples from a conditional distribution is nowhere near as straightforward. The only samples from the joint space that yield valid samples from a given conditional distribution are those that happen to fall exactly on the corresponding fiber. The probability that any points from a finite sample will fall on a fixed fiber is vanishingly small on continuous spaces.




In other words samples from a joint probability distribution can be immediately marginalized but they cannot be conditioned. If we want to generate samples from a conditional distribution in practice then we usually need to construct the conditional distribution first and then generate samples directly from that probability distribution.

2 Random Number Generators

Asymptotically and preasymptotically consistent sequences have the potential to be very useful in practice, but so far we have only defined how these sequences behave and not any explicit ways in which we can generate them. Indeed the generation of these sequences is a challenging and mostly heuristic endeavor.

Consider a process which incrementally generates sequences of points in \(X\) and, hypothetically, can be run long enough until those sequences become infinitely long. If all of the sequences that such a process can generate are asymptotically consistent with a given probability distribution then we'll define it to be a sampling mechanism for that distribution. In general such a process is capable of generating many different asymptotically consistent sequences, in which case we refer to each different sequence as a realization.

In practice we can only ever run a sampling mechanism for only a finite time, and consequently we will only ever be able to work with finite samples from the asymptotically consistent sequences that it is capable of generating. For these samples to be useful we need those sequences to be not only asymptotically consistent but also preasymptotically consistent.

One interesting property of preasymptotically consistent sequences is that it is difficult to predict later points in the sequence given only information about previous states. If we don't know how the sampling mechanism is generating points then the only constraints we have on its behavior are its defining preasymptotic properties. Those properties, however, imply that frequency behavior of later points is independent of previous points; knowing the realized values of the points \((x_{1}, \ldots, x_{n}, \ldots, x_{N})\) doesn't tell us anything about how \(x_{N + 1}\) will behave. Another way of thinking about this is that there are an infinite number of preasymptotically consistent sequences with the same initial values \((x_{1}, \ldots, x_{n}, \ldots, x_{N})\), and if we don't know which sequence is being generated then the only information we have on \(x_{N + 1}\) and later is quantified by \(\pi\).




Because of this limited predictability preasymptotically consistent sequences are also known as random numbers, and a sampling mechanism that generates preasymptotically consistent sequences is known as a random number generator.

Perhaps the biggest conceptual and practical difficulty with samples is that a given probability distribution does not define any particular sampling mechanisms. In general there could be many different ways in which we can theoretically generate random numbers, and typically only a few if any will be practically implementable. Moreover we can rarely prove that a particular sampling mechanism will generate sequences that are consistent with a given probability distribution. In practice we typically have to hypothesize that a sampling mechanism generates random numbers and then empirically verify consistency for as many finite sequences as we can.

2.1 Physical Random Number Generators

A common speculation is that random numbers can be generated from observations of some physical systems, such as the chaotic variations of an electronic current or the measurement of a quantum mechanical system. Because we can never observe such a static physical system infinitely long, however, we can never completely verify that the frequency behavior of such measurements are consistent with a probability distribution, let alone identify with which probability distribution they are consistent.

The probabilistic nature of a physical system is an assumption that we have to make and then verify empirically by observing the system for as long as we can. This verification in challenging, and indeed many systems that appear unpredictable over short sequences manifest distinct patterns over longer sequences that invalidates preasymptotic consistency with any probability distribution. Physical sampling mechanisms can be useful in practice but we have to remain ever vigilant.

If we assume that a physical system is probabilistic, in other words that repeated observations generate preasymptotically consistent sequences, then its behavior is completely quantified by the corresponding probability distribution. In particular any random number generator consistent with that distribution will generate sequences indistinguishable from repeated observations of that system, even if the generator itself is completely independent of that system. These auxiliary random number generators are then said to simulate the physical system, with each realized sequence known as a simulation. Indeed simulation is increasingly used interchangeably with random number generation even in the context of a probability distribution without any particular physical interpretation.

Once we consider physical random number generators a natural question is whether we can use the repeated output of physical systems to define probability distributions entirely, offering an objective derivation of probability theory. This frequentist perspective on probability theory motivated many early thinkers, from Jacob Bernoulli to John Venn and Richard von Mises, but ultimately they were all thwarted by philosophical and mathematical problems. On the philosophical side we can never repeatedly observe the exact same physical system an infinite number of times -- either the physical system itself will wear under the repeated observations or we the observer will -- and limiting frequencies will always be a mathematical abstraction. At the same time it turns out that not all sequences with the right frequency properties will actually yield preasymptotically consistent sequences. In order to consistently define a probability distribution from sequences we have to be able to carefully identify and remove misbehaving sequences. For a thorough discussion of this history see [1].

Ultimately probability theory is best defined axiomatically, as we did in my probability theory case study. Once we define a probability distribution in this way we can look for physical systems that generate empirically consistent sequences without having to worry about whether those sequences are enough to fully define that distribution.

2.2 Pseudo-Random Number Generators

As compelling as physical random number generators might be, it would be much easier to work with deterministic algorithms than unpredictable physical systems. Algorithms are not only easier to implement in practice but also offer the possibility of reproducible results that allow us to follow up on any odd behavior observed in any particular realized sample.

A pseudo-random number generator is a deterministic algorithm that iteratively generates sequences from some initial seed configuration. Although the iterative updates are deterministic they can be designed to be as chaotic as possible to minimize correlations between successive points. If we ignore the exact updating mechanism then these sequences might approximate preasymptotically consistent sequences well enough for probabilistic computation. For much more on the design and implementation of pseudo-random number generators see for example [2] and [3].

2.2.1 Validating Pseudo-Random Number Generators

While formal mathematics can be used to motivate certain techniques for designing productive pseudo-random number generators, the verification of any pseudo-random number generator has to be implemented empirically. Test suites such as the NIST STS [4], TestU01 [5], and PractRand [6] evaluate the behavior of long sequences of pseudo-random numbers for any behavior that deviates from that expected of preasymptotically consistent sequences. These tests can never prove that a pseudo-random number generator is preasymptotically consistent, but they are sufficient for most probabilistic computation applications where we don't care so much about randomness but rather expectation value approximation.

One limitation of even the best pseudo-random number generators is that they approximate preasymptotic consistent sequences up to only some maximum sample size. Past some period the sequences start to repeat, and samples become more and more correlated. The longer this period the larger the samples that a pseudo-random number generator can reliably produce, and the more precisely it can approximate a preasymptotically consistent sampling mechanism.

2.2.2 Seeding Pseudo-Random Number Generators

Seeding a pseudo-random number generator with a specific configuration fixes the output, allowing for the same sequence to be generated on the same computing setup. This allows us to, for example, exactly replicate a previous analysis and investigate any potentially pathological behavior that might have arisen. One of the most unsettling analysis results is ghostly behavior that we are unable to recreate and explore, leaving only the memory to haunt us.

That said even with a fixed seed configuration a pseudo-random number generator may not be exactly reproducible across different computing setups. Any changes to the hardware, operating system, or program execution can influence the precise implementation of floating-point arithmetic which can then influence the implementation of pseudo-random number generators on continuous spaces. Indeed the more chaotic the updates, and the better the pseudo-random number generator approximates preasymptotically consistent sequences, the more of an effect differences in floating point arithmetic can have! Fortunately these implementation differences do not typically compromise the consistency properties of pseudo-random number generators, just the exact reproducibility.

The empirical validation pseudo-random number generators is typically focused on the behavior of points sampled within particular sequences. This validation, however, does not guarantee that points sampled across different sequences will be as well-behaved. In particular the sequences generated from two different seed configurations can be surprisingly correlated if those seed configurations are not sufficiently distinct.

Programming folklore suggests that pseudo-random number generators are seeded with a single large integer. The initial seed configurations of many pseudo-random generators, however, are much larger than a single integer; instead they might be fully specified with only the equivalent of hundreds of large integer values if not more. In this case two seed configurations resulting from any two integers can be much more similar than one might naively expected.

Because of these potential seeding issues the best practice for most pseudo-random number generators is to generate samples by extending a single, long sequence from one particular seed configuration. If samples are needed in parallel then the samples can be pulled from distant regions of this sequence [3]




2.2.3 Adapting Pseudo-Random Number Generators to Target Distributions

The most common pseudo-random number generators are designed for simple target distributions, such as a Bernoulli distribution over \([0, 1]\), a uniform distribution over a set of integers, \([0, 1, \ldots, K]\), or the unit real interval \((0, 1) \in \mathbb{R}\). In some cases pseudo-random number generators for more sophisticated distributions can be derived by appropriately transforming these base generators.

Consider a probability distribution \(\pi\) defined over a one-dimensional, ordered space \(X\). In this case any point \(x \in X\) defines a corresponding interval, \[ I(x_{\text{min}}, x ) \subset X \] with the corresponding probability \(\mathbb{P}_{\pi}[I(x_{\text{min}}, x )]\).

The cumulative distribution function maps each point to this corresponding probability, \[ \begin{alignat*}{6} \Pi :\; &X& &\rightarrow& \; &[0, 1] \subset \mathbb{R}& \\ &x& &\mapsto& &\mathbb{P}_{\pi}[I(x_{\text{min}}, x )]&. \end{alignat*} \] Pushing \(\pi\) forward through this cumulative distribution function results in a uniform probability distribution over the unit real interval with probability density function \[ \Pi_{*} \pi(y) = \left\{ \begin{array}{rr} 1, & 0 \le y \le 1 \\ 0, & \text{else} \end{array} \right. . \]

Likewise the inverse cumulative distribution function, also known as the quantile function, \[ q(p) = \Pi^{-1}(p), \] pushes forward the uniform distribution over the unit real interval \(\upsilon\) back to the target distribution \(\pi\), \[ q_{*} \upsilon = \pi. \]

Consequently if we can generate a sample from the uniform distribution over the unit real interval \[ ( y_{1}, \ldots, y_{N} ) \] then \[ ( \Pi^{-1}(y_{1}), \ldots, \Pi^{-1}(y_{N}) ) \] will be a sample from \(\pi\).