Probability Density Functions
As we discussed in the last chapter, expectation values allow us to interrogate the behaviors of the probability distributions that we’ll encounter in practical applications. At least, that is, in theory. In practice, the expectation values of most probability distributions cannot be computed directly. Indeed integrals with respect to only a few exceptional measures can be computed directly at all.
By scaling those exceptional measures, however, we can often use measure-informed integrals to indirectly compute integrals with respect to more general measures. If we can engineer scalings that reproduce expectation values of interest, then we can use this indirect approach to implement probability theory in practice.
In this chapter we will formalize this procedure, identifying exactly how we can scale a given measure to reproduce the expectation values of a target probability distribution. We will then investigate the detailed application of the method on discrete and real probability spaces.
1 Density Functions
Frustratingly, we cannot translate integrals with respect to a given measure to integrals with respect to any other measure. To avoid subtle mathematical inconsistencies, we have to restrict our consideration to measures that are compatible with each other. In this section we will first motivate what kind of compatibility we might need on finite spaces before formalizing the compatibility requirements, and the systematic translation of integrals, on general spaces.
1.1 Finite Spaces
Let’s start by considering a finite measurable space (X, 2^{X}) and two measures, \mu and \nu. Given a real-valued function f: X \rightarrow \mathbb{R} we can construct an integral with respect to both \mu, \mathbb{I}_{\mu}[f] = \sum_{x \in X} \mu( \{ x \} ) \cdot f(x), and \nu, \mathbb{I}_{\nu}[f] = \sum_{x \in X} \nu( \{ x \} ) \cdot f(x).
To simplify the initial construction, let’s first assume that all of the atomic allocations are non-zero and finite, \begin{align*} 0 &< \mu( \{ x \} ) < \infty \\ 0 &< \nu( \{ x \} ) < \infty \end{align*} for all x \in X. This allow us to multiply and divide atomic allocations without running into ill-defined results. In particular, we will always have \frac{ \nu( \{ x \} )}{ \nu( \{ x \} )} = 1, which enures that we can always write the \mu integral of f as \begin{align*} \mathbb{I}_{\mu}[f] &= \sum_{x \in X} \mu( \{ x \} ) \cdot f(x) \\ &= \sum_{x \in X} \mu( \{ x \} ) \cdot 1 \cdot f(x) \\ &= \sum_{x \in X} \mu( \{ x \} ) \cdot \frac{ \nu( \{ x \} )}{ \nu( \{ x \} )} \cdot f(x) \\ &= \sum_{x \in X} \nu( \{ x \} ) \cdot \frac{ \mu( \{ x \} )}{ \nu( \{ x \} )} \cdot f(x). \end{align*}
If we define a function that maps each element to the ratio of the atomic allocations, \begin{alignat*}{6} r :\; & X & &\rightarrow& \; & \mathbb{R}^{+} & \\ & x & &\mapsto& & \frac{ \mu( \{ x \} )}{ \nu( \{ x \} )} &, \end{alignat*} then we can write the \mu integral as \begin{align*} \mathbb{I}_{\mu}[f] &= \sum_{x \in X} \nu( \{ x \} ) \cdot \frac{ \mu( \{ x \} )}{ \nu( \{ x \} )} \cdot f(x) \\ &= \sum_{x \in X} \nu( \{ x \} ) \cdot r(x) \cdot f(x) \\ &= \mathbb{I}_{\nu} [ r \cdot f ]. \end{align*} In words, the integral of any real-valued function f with respect to \mu is equal to the integral of the modified function r \cdot f with respect to \nu.
More generally, this result doesn’t actually require that the ratio of atomic allocations \frac{ \nu( \{ x \} ) }{ \nu( \{ x \} ) } is well-defined for all x \in X. We need it to be well-defined for only the terms where \mu( \{ x \} ) > 0.
To see why, let \mathsf{z}_{\mu} denote the subset of elements that are allocated zero measure, \mathsf{z}_{\mu} = \{ x \in X \mid \mu( \{ x \} ) = 0 \}, with \mathsf{n}_{\mu} denoting the complementary subset of elements that are allocated non-zero measure, \mathsf{nz}_{\mu} = \{ x \in X \mid \mu( \{ x \} ) > 0 \}.
Using these subsets, we can write any integral with respect to \mu as \begin{align*} \mathbb{I}_{\mu}[f] &= \sum_{x \in X} \mu( \{ x \} ) \cdot f(x) \\ &= \sum_{x \in \mathsf{z}_{\mu}} \mu( \{ x \} ) \cdot f(x) + \sum_{x \in \mathsf{nz}_{\mu}} \mu( \{ x \} ) \cdot f(x) \\ &= \sum_{x \in \mathsf{z}_{\mu}} 0 \cdot f(x) + \sum_{x \in \mathsf{nz}_{\mu}} \mu( \{ x \} ) \cdot f(x) \\ &= \sum_{x \in \mathsf{nz}_{\mu}} \mu( \{ x \} ) \cdot f(x). \end{align*} At the same time, we can write the integral of the modified integrand with respect to \nu as \begin{align*} \mathbb{I}_{\nu} [ r \cdot f ] &= \sum_{x \in X} \nu( \{ x \} ) \cdot r(x) \cdot f(x) \\ &= \sum_{x \in \mathsf{z}_{\nu}} \nu( \{ x \} ) \cdot r(x) \cdot f(x) + \sum_{x \in \mathsf{nz}_{\nu}} \nu( \{ x \} ) \cdot r(x) \cdot f(x) \\ &= \sum_{x \in \mathsf{z}_{\nu}} 0 \cdot r(x) \cdot f(x) + \sum_{x \in \mathsf{nz}_{\nu}} \nu( \{ x \} ) \cdot r(x) \cdot f(x) \\ &= \sum_{x \in \mathsf{nz}_{\nu}} \nu( \{ x \} ) \cdot r(x) \cdot f(x) \\ &= \sum_{x \in \mathsf{nz}_{\nu}} \nu( \{ x \} ) \cdot \frac{ \mu \{ x \} }{ \nu \{ x \} } \cdot f(x) \\ &= \sum_{x \in \mathsf{nz}_{\nu}} \mu( \{ x \} ) \cdot f(x). \end{align*}
Because \mathsf{nz}_{\mu} and \mathsf{nz}_{\nu} will, in general, contain different elements, the two sums will not, in general, be equal, \mathbb{I}_{\mu}[f] \ne \mathbb{I}_{\nu} \left[ r \cdot f \right] We have equality only when \mu( \{ x \} ) = 0 every time that \nu( \{ x \} ) = 0. In other words, the two integrals will be equal only when \mathsf{z}_{\nu} \subseteq \mathsf{z}_{\mu}. Ultimately, we can translate \mu integrals into \nu integrals only when all atomic allocations are finite, and the null allocations of the two measures are consistent with each other (Figure 1).
Similarly, we can translate \nu integrals into \mu integrals only when the atomic allocations are finite and \mathsf{z}_{\mu} \subseteq \mathsf{z}_{\nu}. Importantly, these two conditions are not symmetric; we can translate back and forth between \mu and \nu integrals only when the null allocations are the same, \mathsf{z}_{\mu} = \mathsf{z}_{\nu}.
1.2 General Spaces
Translating integrals between more general measure spaces requires a generalization of the atomic allocation conditions that we derived in the previous section. Unsurprisingly, this generalization will take a bit of care.
In this section we’ll assume a measurable space (X, \mathcal{X}) and two measures, \mu : \mathcal{X} \rightarrow [0, \infty] and \nu : \mathcal{X} \rightarrow [0, \infty].
1.2.1 \sigma-Finite Measures
Avoiding infinite atomic allocations is necessary on general measure spaces, but it is not sufficient for translating integrals from one measure to another. To avoid any mathematical inconsistencies, we also need to limit exactly which subsets can be allocated infinite measure. Intuitively, we need to avoid infinite “local” allocations.
To formalize this intuition, we will first need to divide the ambient space into disjoint subsets that probe local structure. A partition of X is any collection of subsets, \mathcal{P} = \{ \mathsf{x}_{1}, \ldots, \mathsf{x}_{i}, \ldots \}, that are non-empty, \mathsf{x}_{i} \ne \emptyset, mutually disjoint, \mathsf{x}_{i} \cap \mathsf{x}_{i' \ne i} = \emptyset, and cover the entire space, \cup_{i} \mathsf{x}_{i} = X. When working with measures we need to restrict attention to measurable partitions, which contain only measurable subsets, and countable partitions, which contain only a countable number of subsets.
For example, the full set X by itself always defines a partition, as does the collection of atomic subsets. The collection of atomic subsets, however, will not always define a countable partition. On ordered metric spaces, such as real lines, measurable and countable partitions built up from intervals of constant length are particularly natural.
In general, we can partition a measurable space in many different ways, and some partitions will divide the space up into smaller subsets than others. If every subset in one measurable partition \mathcal{P} is encapsulated by a subset in another measurable partition \mathcal{P}', then we say that \mathcal{P} is a refinement of \mathcal{P}'. The more refined a measurable partition is, the more sensitively it will be able to probe the “local” structure of a measure (Figure 2).
We can always construct a refinement an initial refinement by breaking up the initial subsets. The partition defined by the atomic subsets is a refinement of every other partition, making it the finest possible partition. On the other hand, every partition is a refinement of the partition defined by the full set X alone, making the full set the crudest possible partition. Both of these extreme partitions are typically measurable, but the atomic partition will be countable on only countable spaces.
When the total measure is infinite, the allocations to the subsets in a crude, measurable partition might also be infinite. Breaking up those subsets into finer and finer pieces, however, might eventually spread the infinite allocations out into finite allocations. If we can construct a sufficiently fine, measurable partition such that all of the subset allocations are finite, then we will always able to avoid infinite allocations by working with small enough subsets. Moreover, if that sufficiently fine, measurable partition is also countable, then we will always be able to aggregate those fine subset allocations into any subset allocation.
Formally, a measure \mu is \sigma-finite if we can construct at least one measurable and countable partition where each subset is allocated only a finite measure, \mu(\mathsf{x}_{i}) < \infty for all \mathsf{x}_{i} \in \mathcal{P}.
Every finite measure, including every probability distribution, is \sigma-finite because we can always take the trivial partition \mathcal{P} = \{ X \} with \mu(X) < \infty.
On the other hand, only exceptional infinite measures are also \sigma-finite. For example, every counting measure \chi is \sigma-finite because the atomic partition \mathcal{P} = \{ \{ x \} \mid x \in X \} is both measurable and countable, and \chi( \mathsf{x} ) = 1 < \infty for all \mathsf{x} \in \mathcal{P}. Similarly, every Lebesgue measure \lambda is \sigma-finite because the interval partition of a real line, \mathcal{P} = \{ ( L \cdot n, L \cdot (n + 1) ] \mid n \in \ldots, -1, 0, 1, \ldots \}, is both measurable and countable, and \lambda( \mathsf{x} ) = L < \infty for all \mathsf{x} \in \mathcal{P}. In both cases, the allocations to each of the partition subsets are finite individually, but combining a infinite number of those subsets results in infinite allocations.
One immediate obstruction to \sigma-finiteness is the allocation of infinite measure to a single atomic subset. Because we can’t break an atomic subset down any further, there’s no way to diffuse that infinite allocation into something finite.
Fortunately, this extreme behavior is relatively rare in practical applications. Unfortunately, it is not impossible, and we will encounter exceptions.
1.2.2 Absolute Continuity
On finite spaces, the compatibility between two measures was determined by the overlap of the null atomic subsets. On more general spaces, we have to consider the overlap of all null subsets.
When every \nu-null subset is a \mu-null subset, \mathcal{X}_{\nu = 0} \subseteq \mathcal{X}_{\mu = 0} we say that \mu is absolutely continuous with respect to \nu and write \mu \ll \nu. Equivalently we can say that \nu dominates \mu.
Importantly, this notion of compatibility between two measures is completely determined by their null subsets; the allocations across non-null subsets plays no role. When a collection of measures all share the same null subsets, each measure will be always be absolutely continuous with respect to every other measure!
On a discrete space, for instance, every measure with non-zero atomic allocations is absolutely continuous with respect to every other measure with non-zero atomic allocations. Moreover, Lebesgue measures defined with respect to different metrics are always absolutely continuous with each other.
1.2.3 Radon-Nikodym Derivatives
With \sigma-finiteness and absolute continuity established, we are now ready to define the most general circumstances under which we can translate integrals from one measure to another.
If \mu and \nu are two \sigma-finite measures on (X, \mathcal{X}), and \mu is absolutely continuous with respect to \nu, \mu \ll \nu, then at least one \mathcal{X}-measurable, positive real-valued function \frac{ \mathrm{d} \mu }{ \mathrm{d} \nu } : X \rightarrow \mathbb{R}^{+} exists such that \mathbb{I}_{\mu}[f] = \mathbb{I}_{\nu} \left[ \frac{ \mathrm{d} \mu }{ \mathrm{d} \nu } \cdot f \right] for every \mathcal{X}-measurable, \mu-integrable, real-valued function f : X \rightarrow \mathbb{R}.
Any function \mathrm{d} \mu / \mathrm{d} \nu that translates \mu integrals into \nu integrals is known as a Radon-Nikodym derivative of \mu with respect to \nu. Moreover, \mu is denoted the target measure and \nu is denoted the reference measure.
Because Radon-Nikodym derivatives are defined by measure-informed integrals, they are not, in general, unique. Modifying the outputs of a Radon-Nikodym derivative on a \nu-null subset of inputs results in the same \nu integrals, and hence another valid Radon-Nikodym derivative.
In practice, we can usually restrict our consideration to Radon-Nikodym derivatives that also happen to be continuous, or even smooth. These structured Radon-Nikodym derivatives are typically unique for each \mu and \nu. We don’t lose any generality by ignoring other, non-structured Radon-Nikodym derivatives provided that we never try to interpret these functions outside of the shadow of an integral!
For example, the unit function \begin{alignat*}{6} u :\; & X & &\rightarrow& \; & \mathbb{R}^{+} & \\ & x & &\mapsto& & 1 & \end{alignat*} is one possible Radon-Nikodym derivative for any measure with respect to itself, \frac{ \mathrm{d} \nu }{ \mathrm{d} \nu } \overset{\nu}{=} u. At the same time any function that deviates from the unit function on only \nu-null subsets is also a valid Radon-Nikodym derivative for \nu with respect to itself. If X = \mathbb{R}, however, then the unit function is the only continuous Radon-Nikodym derivative.
Using the local scaling notation that we introduced in Chapter 5, we can also express the relationship between two compatible measures and the Radon-Nikodym derivative that links them together as \mu = \frac{ \mathrm{d} \mu }{ \mathrm{d} \nu } \cdot \nu. That said, this is just shorthand for the integral relationships \mathbb{I}_{\mu}[f] = \mathbb{I}_{\nu} \left[ \frac{ \mathrm{d} \mu }{ \mathrm{d} \nu } \cdot f \right] for any sufficiently well-behaved function f.
1.2.4 Interpreting Radon-Nikodym Derivatives
One way to build intuition for Radon-Nikodym derivatives is to consider the integral of indicator functions. For any measurable subset \mathsf{x} \in \mathcal{X}, we have \begin{align*} \mu( \mathsf{x} ) &= \mathbb{I}_{\mu} [ I_{\mathsf{x}} ] \\ &= \mathbb{I}_{\nu} \left[ \frac{ \mathrm{d} \mu }{ \mathrm{d} \nu } \cdot I_{\mathsf{x}} \right]. \end{align*} If \mathrm{d} \mu / \mathrm{d} \nu(x) > 1 for almost all points x \in \mathsf{x} – remember that we always have to excuse deviant behavior over the null subsets of \nu – then we will also have \frac{ \mathrm{d} \mu }{ \mathrm{d} \nu }(x) \cdot I_{\mathsf{x}}(x) > I_{\mathsf{x}}(x) for almost all x \in X. This implies that \begin{align*} \mu( \mathsf{x} ) &= \mathbb{I}_{\nu} \left[ \frac{ \mathrm{d} \mu }{ \mathrm{d} \nu } \cdot I_{\mathsf{x}} \right] \\ &> \mathbb{I}_{\nu} \left[ I_{\mathsf{x}} \right] \\ &> \nu(\mathsf{x}). \end{align*}
In other words, if \mathrm{d} \mu / \mathrm{d} \nu(x) > 1 across some measurable subset of the ambient space, then \mu will allocate more measure to that subset than \nu does. The larger the Radon-Nikodym derivative is, the more excessive the \mu allocation will be. Similarly, if \mathrm{d} \mu / \mathrm{d} \nu(x) < 1 across some measurable subset, then \mu will allocate less measure to that subset than \nu does.
At the extreme, any measurable collection of points with \mathrm{d} \mu / \mathrm{d} \nu(x) \overset{\nu}{=} 0 always defines a \mu-null subset. On the other hand, if \mathrm{d} \mu / \mathrm{d} \nu(x) \overset{\nu}{=} 1 across a measurable subset then \mu and \nu will allocate the same measure to that subset.
All of these examples demonstrates that we can interpret a Radon-Nikodym derivative as a quantification of how the allocations of \mu are locally enhanced or suppressed relative to the allocations of \nu. By integrating indicator functions, we aggregate these local warpings into non-local changes in the subset allocations.
A common physical analogy interprets the reference measure \nu as defining a sense of volume across the ambient space. The more measure \nu allocates to a given subset, the more background volume is spanned by that subset. If we analogize the measure allocated by \mu to a physical mass then the Radon-Nikodym derivatives become density functions that quantify how strongly mass concentrates within any given volume.
Personally, I’ve always found the term “Radon-Nikodym derivative” to be aesthetically pleasing – it sounds like a term straight out of science fiction – but this density function analogy is typically much more accessible. Because of that, I will mostly favor the term “density function” over “Radon-Nikodym derivative” in this book.
1.2.5 Operator Perspective
Speaking of terminology, why do we call a Radon-Nikodym derivative a “derivative” in the first place? It turns out that Radon-Nikodym derivatives do mathematically satisfy the properties of formal derivatives, but the theory needed to demonstrate that is pretty elaborate. In this section, we will instead investigate some of the more superficial similarities between Radon-Nikodym derivative and the familiar derivatives from calculus.
Recall that, in one-dimensional calculus, differentiation is an operation that converts a smooth function f: \mathbb{R} \rightarrow \mathbb{R} into a new function \frac{ \mathrm{d} f }{ \mathrm{d} x} : \mathbb{R} \rightarrow \mathbb{R}. This operation is linear, \frac{ \mathrm{d} }{ \mathrm{d} x} \left( \alpha \, f + \beta \, g \right) = \alpha \, \frac{ \mathrm{d} f }{ \mathrm{d} x} + \beta \, \frac{ \mathrm{d} g }{ \mathrm{d} x}, and satisfies a chain rule, \frac{ \mathrm{d} (g \circ f)}{ \mathrm{d} x}(x) = \frac{ \mathrm{d} g}{ \mathrm{d} y}(f(x)) \frac{ \mathrm{d} f}{ \mathrm{d} x}(x) for any two functions f: x \mapsto y and g: y \mapsto z.
The output of the derivative function at the point x \in \mathbb{R}, \frac{ \mathrm{d} f }{ \mathrm{d} x}(x), quantifies how quickly f is changing at that input (Figure 3). If \frac{ \mathrm{d} f }{ \mathrm{d} x}(x) > 0, then f is increasing at x. On the other hand, if \frac{ \mathrm{d} f }{ \mathrm{d} x}(x) < 0, then f is decreasing at x. Finally, if \frac{ \mathrm{d} f }{ \mathrm{d} x}(x) = 0, then f is constant at x.
Equivalently, we can interpret a derivative function as quantifying how much the initial function changes relative to a constant function. If \frac{ \mathrm{d} f }{ \mathrm{d} x}(x) > 0, then f is increasing faster than a constant function at x. Conversely, if \frac{ \mathrm{d} f }{ \mathrm{d} x}(x) < 0, then f is increasing slower than a constant function at x. Lastly, if \frac{ \mathrm{d} f }{ \mathrm{d} x}(x) = 0, then f is increasing at the same rate as a constant function at x. In other words, we can interpret the derivative as an operator that maps an input function to an output function that quantifies how much the input function changes relative to some reference behavior.
In a similar way, we can interpret Radon-Nikodym derivatives as the output of an operator that maps measures into output functions that quantify how much the input measure changes relative to some reference measure. Even more, this operator is linear and satisfies a chain rule.
To formalize this similarity a bit more, let’s use M(X, \mathcal{X}) to denote the collection of all measures that we can define over the measurable space (X, \mathcal{X}), and C^{+}(X, \mathcal{X}, \mu) to denote the collection of all \mathcal{X}-measurable functions from X to \mathbb{R}^{+}. We can then define Radon-Nikodym differentiation as a mapping \begin{alignat*}{6} \frac{ \mathrm{d} }{ \mathrm{d} \nu } :\; & M(X, \mathcal{X}) & &\rightarrow& \; & C^{+}(X, \mathcal{X}) & \\ & \mu & &\mapsto& & \frac{ \mathrm{d} \mu }{ \mathrm{d} \nu } & \end{alignat*} that takes the input measure \mu to an output function \frac{ \mathrm{d} \mu }{ \mathrm{d} \nu } : X \rightarrow \mathbb{R}^{+} that captures how \mu varies relative to \nu.
Admittedly, I’m being a bit mathematically sloppy here. In particular, because Radon-Nikodym derivatives are defined up to only \nu-null subsets, this operator doesn’t yield a single function but rather a collection of all functions that are equal \nu-almost everywhere. In order to achieve a unique output function, we need to introduce additional constraints, such as continuity or smoothness. This sloppy notation, however, does allow us to investigate many of the useful properties of the operation.
For instance, what does it mean to say that Radon-Nikodym differentiation is linear? If we define a linear combination of measures by the allocations (\alpha \cdot \mu + \beta \cdot \nu)(\mathsf{x}) = \alpha \cdot \mu(\mathsf{x}) + \beta \cdot \nu(\mathsf{x}), then \frac{ \mathrm{d} }{ \mathrm{d} \lambda } (\alpha \cdot \mu + \beta \cdot \nu) \overset{\nu}{=} \alpha \cdot \frac{ \mathrm{d} \mu}{ \mathrm{d} \lambda } + \beta \cdot \frac{ \mathrm{d} \nu }{ \mathrm{d} \lambda } whenever \mu and \nu are both absolutely continuous with respect to \lambda, \mathcal{X}_{\lambda = 0} \subseteq \mathcal{X}_{\mu = 0} \cap \mathcal{X}_{\nu = 0}. Note that to be technically correct – the best kind of correct – we’ve had to use equality up to only \nu-null subsets.
Similarly, what does it mean to say that Radon-Nikodym differentiation satisfies a chain rule? If \mu is absolutely continuous with respect to \lambda and \lambda is absolutely continuous with respect to \nu, \mathcal{X}_{\mu = 0} \subseteq \mathcal{X}_{\lambda = 0} \subseteq \mathcal{X}_{\nu = 0}, then \frac{ \mathrm{d} \mu}{ \mathrm{d} \nu} \overset{\nu}{=} \frac{ \mathrm{d} \mu}{ \mathrm{d} \lambda } \cdot \frac{ \mathrm{d} \lambda}{ \mathrm{d} \nu }. Again, we have to be careful to acknowledge that this “equality” holds up to only \nu-null subsets.
The Radon-Nikodym chain rule can be particularly useful in practice, where we’re often interested in comparing two measures \mu and \nu to each other but a third measure \lambda is a more convenient reference measure.
For example, consider the simplified case where \mu, \nu, and \lambda all share the same null subsets, \mathcal{X}_{\mu = 0} = \mathcal{X}_{\nu = 0} = \mathcal{X}_{\lambda = 0} \equiv \mathcal{X}_{0}, and hence are all absolutely continuous with respect to each other. In this case, all of the Radon-Nikodym derivatives between these measures are almost everywhere non-zero, and we can write \begin{align*} 1 &\overset{\nu}{=} \frac{ \mathrm{d} \nu}{ \mathrm{d} \nu}(x) \\ 1 &\overset{\nu}{=} \frac{ \mathrm{d} \nu}{ \mathrm{d} \lambda }(x) \cdot \frac{ \mathrm{d} \lambda}{ \mathrm{d} \nu }(x) \end{align*} or \frac{ \mathrm{d} \lambda}{ \mathrm{d} \nu }(x) \overset{\nu}{=} \frac{1}{ \frac{ \mathrm{d} \nu}{ \mathrm{d} \lambda }(x) }.
Consequently, we can write the Radon-Nikodym derivative between \mu and \nu as a ratio of the Radon-Nikodym derivatives with respect to \lambda, \begin{align*} \frac{ \mathrm{d} \mu}{ \mathrm{d} \nu}(x) &\overset{\nu}{=} \frac{ \mathrm{d} \mu}{ \mathrm{d} \lambda }(x) \cdot \frac{ \mathrm{d} \lambda}{ \mathrm{d} \nu }(x) \\ &\overset{\nu}{=} \frac{ \mathrm{d} \mu}{ \mathrm{d} \lambda }(x) \cdot \frac{1}{ \frac{ \mathrm{d} \nu}{ \mathrm{d} \lambda }(x) } \\ &\overset{\nu}{=} \frac{ \frac{ \mathrm{d} \mu}{ \mathrm{d} \lambda }(x) } { \frac{ \mathrm{d} \nu}{ \mathrm{d} \lambda }(x) }. \end{align*} This allows us to directly compare the behaviors of \mu and \nu to each other using only the more convenient Radon-Nikodym derivatives with respect to \lambda.
2 Specifying Probability Distributions With Density Functions
In order to construct a density function, née Radon-Nikodym derivative, we have to specify a target measure \mu and a compatible reference measure \nu. This allows us to translate implicit integrals with respect to the target measure to explicit integrals with respect to the reference measure, \mathbb{I}_{\mu}[f] = \mathbb{I}_{\nu} \left[ \frac{ \mathrm{d} \pi}{ \mathrm{d} \nu } \cdot f \right].
This construction, however, can also be reversed. Given a reference measure, any measurable, positive-real valued function implicitly defines a target measure without having to directly specify any subset allocations.
Consider, for example, a measurable space (X, \mathcal{X}) and a convenient, \sigma-finite measure \nu. Every non-negative, \mathcal{X}-measurable function m : X \rightarrow R^{+} implicitly defines a measure \mu via the measure-informed integrals \mathbb{I}_{\mu}[f] = \mathbb{I}_{\nu} \left[ m \cdot f \right].
The normalization of a density function is the measure allocated by the implied measure to the entire ambient space, \begin{align*} \mu(X) &= \mathbb{I}_{\mu}[I_{X}] \\ &= \mathbb{I}_{\nu} [ m \cdot I_{X} ] \\ &= \mathbb{I}_{\nu} [ m ]. \end{align*} If \mathbb{I}_{\nu}[ m ] = 1 then the total measure allocated by \mu is just unity, \mu(X) = 1, A properly normalized density function defines not just a measure but also a probability distribution!
More generally, every non-negative, \mathcal{X}-measurable function p : X \rightarrow R^{+} with unit normalization, \mathbb{I}_{\nu} [ p ] = 1, implicitly defines a probability distribution \pi via the expectation values \mathbb{E}_{\pi}[f] = \mathbb{I}_{\nu} \left[ p \cdot f \right]. Because we will be working with mostly probability distributions, we will be working with mostly these normalized probability density functions.
Note that, while the probabilities output by a probability distribution can take only real values between zero and one, the probability densities output by a probability density function can take any non-negative real value. Probability densities and probabilities are distinct, and should not be confused with each other!
In practice, working with point-wise functions is much easier than working with subset-wise functions. For instance, when the ambient space is low-dimensional we can visualize point-wise functional behavior directly, allowing density functions to not define probability distributions but also communicate their basic properties. We’ll explore this latter feature in the context of Lebesgue reference measures in Section 4.3.
This is not to say that probability density functions are not without their limitations. In particular, probability density functions are defined only relative to the given reference measure. If the reference measure is at all ambiguous ,then a density function alone will not completely determine a probability distribution! At the same time, if the reference measure ever changes, then probability density functions will also have to change if we want them to represent the same probability distributions.
We will use probability density functions to define relevant probability distributions almost exclusively in this book. In the cases where we start with a probability distribution \pi and reference measure \nu, I will use the notation \frac{ \mathrm{d} \pi}{ \mathrm{d} \nu} for the corresponding probability density function. This notation is both compact and explicit, displaying all of the information needed to put the density function into context.
Most of the time, however, we don’t start with an explicit probability distribution but rather a reference measure \nu and a properly normalized function p : X \rightarrow \mathbb{R}^{+} to implicitly define a probability distribution, \pi = p \cdot \nu, or, equivalently, \mathbb{E}_{\pi}[f] = \mathbb{I}_{\nu}[p \cdot f]. In these more common cases, I will use Roman letters to denote probability density functions. Whenever possible, I will also match the Roman letter used to denote a probability density function and the Greek letter used to denote the induced probability distribution.
The only limitation of this latter notational convention is that it doesn’t communicate the underlying reference measure. Because of this, it is not hard to lose track of that reference measure, especially when multiple reference measures might be relevant in a given application.
Unfortunately, this convention has also become standard in many fields. Consequently, it is nearly impossible to avoid in practice. To avoid any confusion when we encounter a probability density function p : X \rightarrow \mathbb{R}^{+}, we have to take care to consider what the accompanying reference measure is.
3 Probability Density Functions on Discrete Measure Spaces
Now that we have defined probability density functions in full generality, we can study how they’re used in the kinds of spaces that often arise in practical applications. We’ll begin by looking at probability density functions on discrete measure spaces.
3.1 Probability Mass Functions
On discrete measurable spaces (X, 2^{X}), we can always construct a uniform counting measure that allocates a unit measure to each atomic subset, \chi( \{ x \} ) = 1. Because a counting measure doesn’t have any null subsets, it dominates every measure that we could define on (X, 2^{X}), making it a natural reference measure for any discrete probability distribution. Moreover, the lack of null subsets also means that density functions will always be unique.
Following the steps we worked through in Section 1.1, we can derive an explicit result for the probability density function defined by any probability distribution \pi with respect to a counting measure. Translating an expectation value to an integral informed by the counting measure gives \begin{align*} \mathbb{E}_{\pi}[f] &= \sum_{x \in X} \pi( \{ x \} ) \cdot f(x) \\ &= \sum_{x \in X} 1 \cdot \pi( \{ x \} ) \cdot f(x) \\ &= \sum_{x \in X} \chi( \{ x \} ) \cdot \pi( \{ x \} ) \cdot f(x) \\ &= \sum_{x \in X} \chi( \{ x \} ) \cdot \left( \pi( \{ x \} ) \cdot f(x) \right) \\ &= \mathbb{I}_{\chi} [ p \cdot f ], \end{align*} where p in the last term denotes a function maps each element of X to its atomic allocation, \begin{alignat*}{6} p :\; & X & &\rightarrow& \; & [0, 1] & \\ & x & &\mapsto& & \pi( \{ x \} ) &. \end{alignat*}
In other words, the density of any probability distribution with respect to the counting measure is just the corresponding probability mass function, \frac{ \mathrm{d} \pi }{ \mathrm{d} \chi }(x) = p(x)! Because of this association, probability mass functions are also sometimes referred to as discrete probability density functions.
Identifying probability mass functions with Radon-Nikodym derivatives not only formalizes all of the more heuristic results that we’ve developed on countable spaces, but also places them within the context of the more general probability theory. For example, when we use the atomic allocations specified by a probability mass function to compute expectation values, we’re implicitly integrating with respect to a counting measure. Similarly, when we visualize a probability distribution by plotting the atomic allocations, we’re communicating how much the probability distribution warps the uniform allocations defined by the counting measure.
3.2 The Poisson Family of Probability Mass Functions
To demonstrate the use of probability density functions on countable, spaces let’s consider the positive integers, (\mathbb{N}, 2^{\mathbb{N}}). Assuming the use of the counting measure as the reference measure, we can specify an entire family of probability distributions with the parametrized probability density function (Figure 4) \mathrm{Poisson}(n; \lambda) = \frac{ \lambda^{n} e^{-\lambda} }{ n! }, where \lambda \in \mathbb{R}^{+}.
This family of probability mass functions, as well as the family of probability distributions they implicitly define, is known as the Poisson family. If we fix \lambda, then \mathrm{Poisson}(n; \lambda) denotes a Poisson probability mass function, and implicitly a Poisson distribution.
Expectation values with respect to a Poisson distribution are given by explicit summations, \begin{align*} \mathbb{E}_{\mathrm{Poisson}}[ f ; \lambda ] &= \mathbb{I}_{\chi}[ \mathrm{Poisson}( ; \lambda) \cdot f ] \\ &= \sum_{n = 0}^{\infty} \mathrm{Poisson}(n; \lambda) \, f(n). \end{align*} The sums for many common expectands can actually be worked out in closed form. I’ve isolated those calculations in the Appendix, and will simply state some of the more important results here.
For example, the normalization of each Poisson probability mass function is unity, \begin{align*} \mathrm{Poisson}(X; \lambda) &= \mathbb{E}_{\mathrm{Poisson}}[ I_{X}; \lambda ] \\ &= \mathbb{I}_{\chi}[ \mathrm{Poisson}( ; \lambda) \cdot I_{X} ] \\ &= \sum_{n = 0}^{\infty} \mathrm{Poisson}(n; \lambda) \cdot 1 \\ &= 1, \end{align*} as required for probability distributions.
Similarly, the mean for each Poisson probability distribution is given by \begin{align*} \mathbb{M}(\lambda) &= \mathbb{E}_{\mathrm{Poisson}} [ \iota ; \lambda ] \\ &= \mathbb{I}_{\chi} [ \mathrm{Poisson}( ; \lambda) \cdot \iota ] \\ &= \sum_{n = 0}^{\infty} \mathrm{Poisson}(n; \lambda) \cdot \iota(n) \\ &= \lambda. \end{align*} Consequently, the parameter \lambda moderates the centrality of probability distributions defined by each Poisson mass functions (Figure 5).
A slightly longer calculation shows that the variance of each Poisson distribution is also equal to \lambda. As we increase the parameter \lambda, the individual Poisson distributions not only shift to larger values but also become more diffuse.
The Poisson cumulative distribution functions \Pi(n) = \mathrm{Poisson}([0, n]; \lambda) = \sum_{n' = 0}^{n} \mathrm{Poisson}(n'; \lambda) can also be evaluated in closed form, \Pi(n) = \frac{ \Gamma(n + 1, \lambda) }{ \Gamma(n + 1) }, where \Gamma(x, y) is the incomplete Gamma function and \Gamma(x) is the gamma function. Conveniently, these two functions, if not the Poisson cumulative distribution function itself, are implemented in most programming languages. This makes the family particularly straightforward to use in practice.
These cumulative distribution functions give us two ways to evaluate interval probabilities (Figure 6). On one hand, we can brute force an interval probability with direct summation, \mathrm{Poisson}( \, (n_{1}, n_{2}] \, ; \lambda) = \sum_{n' = n_{1} + 1}^{n_{2}} \mathrm{Poisson}(n'; \lambda). On the other hand, we can evaluate the cumulative distribution function at each boundary and subtract, \mathrm{Poisson}( \, (n_{1}, n_{2}] \, ; \lambda) f = \Pi_{\mathrm{Poisson}}(n_{2}) - \Pi_{\mathrm{Poisson}}(n_{1}).
4 Probability Density Functions on Real Spaces
Having explored discrete measure spaces, let’s consider our prototypical uncountable spaces, the spaces of real numbers. On these spaces we don’t have heuristics on which we can fall back, and the full machinery of Radon-Nikodym derivatives is needed to ensure our first consistent results. On these spaces, the Lebesgue measure serves as a natural reference measure, and most expectation values reduce not to summations but rather classic Riemann integrals.
4.1 Lebesgue Probability Density Functions
The metric structure of a real space makes Lebesgue measures particularly useful. Specifically, because Lebesgue measures are \sigma-finite, they serve as immediate candidates for reference measures.
The only remaining obstruction to the construction of probability density functions is absolute continuity. For practical considerations, the most important Lebesgue null subsets are the subsets consisting of only a countable number of points; any probability distribution that is absolutely continuous with respect to the Lebesgue measure must also allocate zero probability to atomic subsets and their countable unions. In theory there are other, more abstract, null subsets that need to be accommodated as well, but the atomic subsets are the most practically relevant.
Given a D-dimensional real space, that is a particular rigid real space or particular parameterization of a flexible real space, and a compatible a compatible probability distribution \pi, we can define a Lebesgue probability density function \frac{ \mathrm{d} \pi \hphantom{ {}^{D} } }{ \mathrm{d} \lambda^{D} } : \mathbb{R}^{D} \rightarrow \mathbb{R}^{+}.
Alternatively, any measurable, positive real-valued function p : \mathbb{R}^{D} \rightarrow \mathbb{R}^{+} that is appropriately normalized, \mathbb{I}_{\lambda^{D}}[ p ] = \int \mathrm{d}^{D} x \, p(x_{1}, \ldots, x_{d}, \ldots, x_{D}) = 1, will implicitly specify a probability distribution. By construction, these derived probability distributions will be absolutely continuous with respect to the defining Lebesgue measure, allocating zero probability to every atomic subset. In practice, almost every probability distribution over real spaces that we will encounter in applied problems will be built up by scaling the Lebesgue measure in this way.
Given a one-dimensional Lebesgue probability density function, we can compute the probability allocated to any interval subset with an appropriately bounded integral, \begin{align*} \pi( \, [ x_{1}, x_{2} ] \, ) &= \mathbb{E}_{\pi} \left[ I_{[ x_{1}, x_{2} ]} \right] \\ &= \mathbb{I}_{\lambda} \left[ \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda} \cdot I_{[ x_{1}, x_{2} ]} \right] \\ &= \int_{-\infty}^{+\infty} \mathrm{d} x \, \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda}(x) \cdot I_{[ x_{1}, x_{2} ]}(x) \\ &= \int_{x_{1}}^{x_{2}} \mathrm{d} x \, \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda}(x). \end{align*} In other words, interval probabilities are equal to the area under the curve defined by the probability density function (Figure 7). For higher-dimensional real spaces, subset probabilities become volumes under the surfaces defined by the higher-dimensional probability density functions.
When working with Lebesgue probability density functions in practice, we have to be very careful to account for which Lebesgue measure we’re using at any given time. Different real spaces, or different parameterizations of a single flexible real space, will in general feature different metrics which then give rise to different Lebesgue measures.
Because of this, a fixed Lebesgue probability density function will define different probability distributions on different real spaces. Equivalently, in order to represent a fixed target probability distribution \pi we need to use different Lebesgue probability density functions on every individual real space. Either way, to avoid any ambiguity we need to clearly communicate which Lebesgue measure we’re assuming.
In the next chapter we’ll learn how to transform measures from one space to another. This will allow us to relate the Lebesgue measures, and the corresponding Lebesgue probability density functions, from one real space to another, or equivalently from one parameterization of a flexible real space to another.
4.2 Abusing The Equals Sign
As with any Radon-Nikodym derivative, Lebesgue probability density functions are not unique. Any two probability density functions that differ on only countable subsets of input points will specify exactly the same probability distribution. Consequently, we have to be careful whenever we’re comparing probability density functions to each other.
For instance, if two probability distributions over \mathbb{R} are equal to each other, \pi = \rho, then the expectation of every sufficiently well-behaved expectand f will be the same, \int_{-\infty}^{+\infty} \mathrm{d} x \, \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda}(x) \, f(x) = \int_{-\infty}^{+\infty} \mathrm{d} x \, \frac{ \mathrm{d} \rho }{ \mathrm{d} \lambda}(x) \, f(x) This does not, however, imply that the two probability density functions are exactly equal to each other at every input point x \in \mathbb{R}, \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda}(x) \ne \frac{ \mathrm{d} \rho }{ \mathrm{d} \lambda}(x). Rather, they can deviate from each other on any Lebesgue null subset (Figure 8), \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda}(x) \overset{\lambda}{=} \frac{ \mathrm{d} \rho }{ \mathrm{d} \lambda}.
Most references take this subtlety for granted, abusing the equals sign by writing \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda} = \frac{ \mathrm{d} \rho }{ \mathrm{d} \lambda} to mean that \int_{-\infty}^{+\infty} \mathrm{d} x \, \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda}(x) \, f(x) = \int_{-\infty}^{+\infty} \mathrm{d} x \, \frac{ \mathrm{d} \rho }{ \mathrm{d} \lambda}(x) \, f(x) for any sufficiently well-behaved expectand f. Unfortunately, this sloppy notation is ubiquitous and impossible to avoid in practice.
One way to make this convention a bit more rigorous is to impose additional constraints on the probability density functions when possible. For example, if we restrict consideration to continuous probability density functions, then \int_{-\infty}^{+\infty} \mathrm{d} x \, \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda}(x) \, f(x) = \int_{-\infty}^{+\infty} \mathrm{d} x \, \frac{ \mathrm{d} \rho }{ \mathrm{d} \lambda}(x) \, f(x) usually implies point-wise equality, \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda}(x) = \frac{ \mathrm{d} \rho }{ \mathrm{d} \lambda}(x), for all x \in \mathbb{R}.
Regardless, we have to be careful with unstated assumptions. Directly equating probability density functions implies either that the equality holds only up to reference null subsets or that we’re restricting consideration to certain structured probability density functions and not just any valid probability density functions.
The safest approach is to never forget that, unlike regular functions, Lebesgue probability density functions do not exist on their own. Instead, Lebesgue probability density functions always live under the shadow of integral signs. Anytime we see a bare probability density function we should recognize the implied integral context.
4.3 Lebesgue Probability Density Functions As Visualizations
Lebesgue probability density functions quantify probability distributions by integration. On a real line, for instance, the area under the curve defined by a probability density function corresponds to interval probabilities. If we train ourselves to “integrate by eye”, qualitatively mapping intervals to areas under a given curve, then we can extract a wealth of information by visually examining a Lebesgue probability density function.
For example, consider two intervals \mathsf{I}_{1} and \mathsf{I}_{2} of the same width but at different positions on a real line. If \mathrm{d} \pi / \mathrm{d} \lambda (x) is larger for all points in \mathsf{I}_{1} than it is for all points in \mathsf{I}_{2}, then the probability allocated to the first interval will be larger than the probability allocated to the second interval (Figure 9 (a)) \pi(\mathsf{I}_{1}) > \pi(\mathsf{I}_{2}).
That said, we have to take care when visually comparing intervals of different lengths. A narrow interval can be allocated negligible probability even if the probability density function is extremely large everywhere within it. Similarly, intervals where the probability density function is everywhere small can still be allocated appreciable probabilities if the interval is large enough (Figure 9 (b)).
With care, these visual comparisons can convey a wealth of qualitative insights. If a probability density function peaks at a single point, for instance, then intervals containing the peak will tend to be allocated larger probabilities than intervals that don’t contain the peak. In other words, the probability distribution implicitly defined by that probability density function will concentrate in the neighborhood of that peak (Figure 10).
If a probability density function exhibits multiple peaks, then the probability distribution will locally concentrate in the neighborhood of each. Subsets falling into the gaps between these local modes will be allocated much less probability.
Moreover, the behavior of a probability density function around a peak qualifies how the implied probability distribution concentrates (Figure 11). For instance, if the probability density function is wide, then the concentration of probability will be weak. Conversely, if the probability density function is narrow, then the concentration will be strong. Similarly, if the probability density function is asymmetric then the concentration will be skewed towards larger or smaller values.
All of this said, we have to be careful to not misinterpret the point-wise behavior of a Lebesgue probability density function. For one, point-wise behavior is formally ambiguous because Radon-Nikodym derivatives are defined only up to null subsets. More importantly, point-wise evaluations of a probability density function don’t correspond to any well-defined expectation values. Probability density functions exist to be integrated, and visualizations of probability density functions are useful only when they qualitatively inform how certain integrals behave.
Finally, the visualization of probability density functions is limited to only one and two-dimensional real spaces. In higher dimensions, we cannot plot how a Lebesgue probability density function changes in every direction at the same time.
One thing we can do is plot how a probability density function varies along one and two-dimensional cross sections of the ambient space. For example, when working on \mathbb{R}^{3} we might partially visualize a probability density function \frac{ \mathrm{d} \pi \hphantom{ {}^{3} } }{ \mathrm{d} \lambda^{3} } (x_{1}, x_{2}, x_{3}) by plotting the partial evaluations \frac{ \mathrm{d} \pi \hphantom{ {}^{3} } }{ \mathrm{d} \lambda^{3} } (\tilde{x}_{1}, x_{2}, x_{3}), \frac{ \mathrm{d} \pi \hphantom{ {}^{3} } }{ \mathrm{d} \lambda^{3} } (x_{1}, \tilde{x}_{2}, x_{3}), and \frac{ \mathrm{d} \pi \hphantom{ {}^{3} } }{ \mathrm{d} \lambda^{3} } (x_{1}, x_{2}, \tilde{x}_{3}) Properly interpreting these slices, however, can be tricky. In Chapter 8, we’ll learn how to properly interpret these slices as conditional probability density functions.
A more effective approach in practice is to not try to visualize an entire probability density function at once, but rather investigate its behavior when projected to lower-dimensional summaries spaces. We’ll learn how to construct these projections in the next chapter.
4.4 Lebesgue Probability Densities As Limiting Interval Probabilities
Lebesgue probability density functions can be integrated over intervals to compute the corresponding probability allocation. We can also use certain limiting interval probabilities to derive Lebesgue probability densities.
To set up this latter construction, let’s consider a real line X = \mathbb{R} and the interval \mathsf{I} = (x_{1}, x_{1} + L ] of length L > 0. We then decompose this initial interval into N disjoint subintervals \mathsf{I}_{n} = ( x_{1} + n \, \epsilon , x_{1} + (n + 1) \, \epsilon ], each of length \epsilon = \frac{L}{N}.
By countable additivity, the probability that a probability distribution \pi allocates to the interval \mathsf{I} is the same as the sum of the probabilities allocated to each subinterval \mathsf{I}_{n}, \pi( \, ( x_{1}, x_{1} + L ] \, ) = \sum_{n = 0}^{N} \pi( \, ( x_{1} + n \, \epsilon , x_{1} + (n + 1) \, \epsilon ] \, ). After multiplying and dividing by the subinterval length \epsilon, this becomes \pi( \, ( x_{1}, x_{1} + L ] \, ) = \sum_{n = 0}^{N} \epsilon \cdot \frac{ \pi( \, ( x_{1} + n \, \epsilon, x_{1} + (n + 1) \, \epsilon ] \, ) }{ \epsilon }.
This, however, is exactly the setup for a Riemann integral. As the number of subintervals increases, N \rightarrow \infty, and the length of each subinterval decreases, \epsilon \rightarrow 0, the left hand side will remain the same, but the sum on the right hand side will converge to a Riemann integral, \begin{align*} \pi( \, ( x_{1}, x_{1} + L ] \, ) &= \lim_{N \rightarrow \infty} \sum_{n = 0}^{N} \pi \left( \left( x_{1} + n \, \frac{L}{N}, x_{1} + (n + 1) \, \frac{L}{N} \right] \right) \\ &= \int_{x_{1}}^{x_{1} + L} \mathrm{d} x \, p(x), \end{align*} with the integrand p(x) = \lim_{\epsilon \rightarrow 0} \frac{ \pi( \, ( x, x + \epsilon ] \, ) }{ \epsilon }.
At the same time, we can also use a probability density function between \pi and the Lebesgue measure \lambda to compute the same interval probability, \pi( \, ( x_{1}, x_{1} + L ] \, ) = \int_{x_{1}}^{x_{1} + L} \mathrm{d} x \, \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda }(x). Comparing these two results, we can identify the Riemann integrand with the Lebesgue probability density function, \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda }(x) \overset{\lambda}{=} \lim_{\epsilon \rightarrow 0} \frac{ \pi( \, ( x, x + \epsilon ] \, ) }{ \epsilon }.
Note that this result relies on the choice of metric. The convergence of a scaled interval probability to a probability density function depends on the metric we use to define interval lengths. Different metrics will result in different limiting values, consistent with the fact that different metrics define different Lebesgue measures, and hence different Lebesgue probability density functions.
This result helps to explain why we have to take care with interpreting probability densities. Lebesgue probability densities don’t correspond to interval probabilities, but rather to how quickly interval probabilities change as we scan across the ambient real line. In other words, they encode differential information about probability allocations. Relatedly, probability density functions are endowed with units of probability over length, not units of probability.
One practical corollary of this relationship is that we can use properly scaled interval probabilities to approximate probability density functions. For small but finite \epsilon, the quantity \frac{ \pi( \, ( x, x + \epsilon ] \, ) }{ \epsilon } approximates the Lebesgue probability density \frac{ \mathrm{d} \pi }{ \mathrm{d} \lambda }(x).
Consequently, a histogram where the bin heights are scaled by the inverse bin widths, \frac{ \pi( \, ( x_{i}, x_{i + 1} ] \, ) }{ x_{i + 1} - x_{i} }, approximately visualizes a probability density function. As the bins become narrower, the scaled histogram becomes a more accurate representation of the Lebesgue probability density function (Figure 12).
This approximate visualization is particularly useful when we can compute interval probabilities, but we can’t evaluate the Lebesgue probability density function. As we’ll discover in the next chapter, this exact circumstance often arises when we try to project a probability distribution from a higher-dimensional space to a lower-dimensional space.
4.5 The Normal Family of Probability Density Functions
The most convenient Lebesgue probability density functions are those that facilitate analytic Riemann integration as much as possible. The two-parameter normal family of Lebesgue probability density functions \mathrm{normal}(x; \mu, \sigma) = \frac{1}{\sqrt{2 \, \pi} \sigma} \exp \left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right), where \mu \in \mathbb{R} and \sigma \in \mathbb{R}^{+}, is particularly amenable to analytic calculations.
The two parameters \mu and \sigma directly determine the basic shape of each normal probability density function (Figure 13). A normal probability density function peaks at \mu, which is accordingly referred to generally as a location parameter. The second parameter \sigma determines how quickly the probability density function decays as we move away from the peak; the smaller \sigma is, the narrower the density function will be. It is known as a scale parameter.
Each \mathrm{normal}(x; \mu, \sigma) is referred to as a normal probability density function, or just normal density function for short. We might be tempted to refer to the probability distribution defined by a normal density function as a normal distribution, but this association is well-defined only once we fix a metric, and hence a particular Lebesgue measure. Because the same normal density function can define different probability distributions on different real lines, I try to avoid terms like “normal distribution”.
The conventions I’ve used here are common, but no means standard. For example, in some fields this family is described as “Gaussian”, in honor of Carl Friedrich Gauss who first introduced it. Moreover, there are multiple, equivalent ways to parameterize the family, including \begin{align*} \mathrm{normal}(x; \mu, v) &= \frac{1}{\sqrt{2 \, \pi \, v}} \exp \left( -\frac{1}{2 \, v} \left( x - \mu \right)^{2} \right), \\ \mathrm{normal}(x; \mu, \tau) &= \sqrt{ \frac{\tau}{2 \, \pi} } \exp \left( -\frac{\tau}{2} \left( x - \mu \right)^{2} \right) \end{align*} and even \mathrm{normal}(x; \eta_{1}, \eta_{2}) = \sqrt{ \frac{-\eta_{2}}{\pi} } \exp \left( \eta_{1} \, x + \eta_{2} \, x^{2} + \frac{\eta_{1}^{2}}{4 \, \eta_{2}^{2}} \right). Each of these parameterizations can be convenient in certain circumstances, but the initial parameterization defined above tends to the most useful for most practical applications
The integrals \int_{-\infty}^{+\infty} \mathrm{d} x \, \mathrm{normal}(x; \mu, \sigma) \, f(x) that define expectation values are relatively nice as far as integrals go. That isn’t to say that the integrals are always easy to evaluate but rather that they many of them admit closed-form solutions, which is pretty miraculous when it comes to integrals. For those twisted individuals who fancy a good integral calculation, I’ve included those calculations in the Appendix. Everyone else can take these results at face value.
For instance, we can verify that each probability density function in the normal family is properly normalized with the integral \begin{align*} \mathrm{normal}(\mathbb{R}; \mu, \sigma) &= \int_{-\infty}^{+\infty} \mathrm{d} x \, \mathrm{normal}(x; \mu, \sigma) \\ &= \int_{-\infty}^{+\infty} \mathrm{d} x \, \frac{1}{\sqrt{2 \, \pi} \sigma} \exp \left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right) \\ &= 1. \end{align*}
Similarly, we can compute the mean of the probability distribution implied by each normal density function, \begin{align*} \mathbb{M}(\mu, \sigma) &= \mathbb{E}_{\mathrm{normal}}[ \iota; \mu, \sigma ] \\ &= \int_{-\infty}^{+\infty} \mathrm{d} x \, \mathrm{normal}(x; \mu, \sigma) \, x \\ &= \int_{-\infty}^{+\infty} \mathrm{d} x \, \frac{1}{\sqrt{2 \, \pi} \sigma} \exp \left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right) \, x \\ &= \mu. \end{align*} The parameter \mu determines not just the peak of each normal probability density function, but also the mean of each corresponding probability distribution. Because of this, \mu is sometimes referred to as a mean parameter.
With even more mathematical elbow grease, we can show that the variance is given by \begin{align*} \mathbb{V}(\mu, \sigma) &= \mathbb{E}_{\mathrm{normal}}[ (\iota - \mathbb{M}(\mu, \sigma))^{2}; \mu, \sigma ] \\ &= \mathbb{E}_{\mathrm{normal}}[ (\iota - \mu)^{2}; \mu, \sigma ] \\ &= \int_{-\infty}^{+\infty} \mathrm{d} x \, \mathrm{normal}(x; \mu, \sigma) \, (x - \mu)^{2} \\ &= \int_{-\infty}^{+\infty} \mathrm{d} x \, \frac{1}{\sqrt{2 \, \pi} \sigma} \exp \left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right) \, (x - \mu)^{2} \\ &= \sigma^{2}. \end{align*} As we increase \sigma, the normal probability density functions widen and the variance of the implied probability distributions increases.
Normal probability density functions can also be used to evaluate the cumulative distribution function corresponding to the implicitly-defined probability distributions, \begin{align*} \Pi_{\text{normal}}(x; \mu, \sigma) &= \text{normal}( \, (-\infty, x]; \mu, \sigma \, ) \\ &= \mathbb{E}_{\mathrm{normal}}[ I_{(-\infty, x]}; \mu, \sigma ] \\ &= \int_{-\infty}^{+\infty} \mathrm{d} x' \, \mathrm{normal}(x'; \mu, \sigma) \, I_{(-\infty, x]}(x') \\ &= \int_{-\infty}^{x} \mathrm{d} x' \, \mathrm{normal}(x'; \mu, \sigma) \\ &= \frac{1}{2} + \frac{1}{2} \, \mathrm{erf} \left( \frac{x - \mu}{\sqrt{2} \, \sigma} \right), \end{align*} where \mathrm{erf} (x) \frac{2}{\sqrt{\pi}} \int_{0}^{ x } \mathrm{d} t \, \exp \left( -t^{2} \right) is known as the error function.
Conveniently the error function, if not the normal cumulative distribution functions themselves, are available in most programming languages. This allows us directly compute interval probabilities by subtracting cumulative probabilities (Figure 14), \text{normal}( \, (x_{1}, x_{2} ] \, ; \mu, \sigma ) = \frac{1}{2} \, \left( \mathrm{erf} \left( \frac{x_{2} - \mu}{\sqrt{2} \, \sigma} \right) - \mathrm{erf} \left( \frac{x_{1} - \mu}{\sqrt{2} \, \sigma} \right) \right).
5 Other Useful Probability Density Functions
Most of the applications of probability theory we will tackle in this book will use probability distributions that are absolutely continuous with respect to a counting measure or a Lebesgue measure and, consequently, can defined and implemented with probability density functions. There are a few exceptions, however, that we will occasionally need to accommodate.
5.1 Singular Probability Density Functions
On any measurable space where the atomic subsets are measurable, we can always define a singular probability distribution \delta_{x'} that concentrates all probability on a single point x' \in X, \delta_{x}(\mathsf{x}) = \left\{ \begin{array}{rr} 1, & x' \in \mathsf{x} \\ 0, & x' \notin \mathsf{x} \end{array} \right. . Because only a single point contributes, we can define the corresponding expectation values by the point-wise evaluation of the expectand, \mathbb{E}_{\delta_{x}} [f] = f(x'). These probability distributions are known as Dirac distributions.
Because \delta_{x'}(\{ x' \}) = 1, the atomic subset \{ x' \} is not a null subset with respect to \delta_{x'}. Consequently, Dirac distributions are not absolutely continuous with respect to any reference measure that allocates vanishing probability to the atomic subsets. In particular, singular probability distributions on real lines are not absolutely continuous with respect to the Lebesgue measure, and we cannot represent them with ordinary functions!
If, for example, a probability density function for this singular probability distribution did exist, then we would be able to engineer a function \delta(x - x') such that f(x') = \mathbb{E}_{\delta_{x'}} [f] = \mathbb{I}_{\lambda} [ \delta(\cdot - x') \cdot f] = \int_{-\infty}^{\infty} \mathrm{d} x \, \delta(x - x') \, f(x) for any expectand f: \mathbb{R} \rightarrow \mathbb{R}. How could we achieve this behavior?
Well, if the hypothetical density function \delta(x - x') concentrated around x', then the integrals would also concentrate around f(x'). For instance, integrals of normal probability density functions centered at x' approximate the desired behavior better and better as the scale becomes smaller and smaller (Figure 15).
In the limit \sigma \rightarrow 0, the normal probability density functions reduce to an infinitely high spike at the location parameter, \lim_{\sigma \rightarrow 0} \frac{1}{\sqrt{2 \, \pi} \sigma} \exp \left( -\frac{1}{2} \left( \frac{x' - x}{\sigma} \right)^{2} \right) = \left\{ \begin{array}{rr} \infty, & x' = x \\ 0, & x' \ne x \end{array} \right. . Perhaps we can we define \delta(x - x') by this spike?
Unfortunately, this intuition doesn’t quite work out. Remember that probability density functions can be modified at a null subset of input points without affecting their integrals. Because \{ x' \} is a Lebesgue null subset, our infinite spike should be equivalent to a function z that returns zero for all inputs, \mathbb{I}_{\lambda} [ \delta(\cdot - x') \cdot f] = \mathbb{I}_{\lambda} [ z \cdot f] = \mathbb{I}_{\lambda} [ z ]. Unfortunately, this results in an ill-defined integral, \mathbb{I}_{\lambda} [ z ] = \int_{-\infty}^{+\infty} \mathrm{d} x \, 0 = 0 \cdot \infty. that contradicts the desired behavior \mathbb{I}_{\lambda} [ \delta(\cdot - x') \cdot f] = f(x').
We cannot define a function that can scale a Lebesgue measure into a singular Dirac distribution. This is what the failure of absolutely continuity was trying to tell us in the first place!
Because the expectation values are trivial to compute, working with a singular probability distribution directly is straightforward. The lack of a well-defined singular density function, however, can be awkward when we’re exclusively using probability density functions to specify every other probability distribution of interest.
One way to get around this issue is to just define an object \delta called the Dirac delta function that satisfies f(0) = \int_{-\infty}^{\infty} \mathrm{d} x \, \delta(x) \, f(x) for all functions f : \mathbb{R} \rightarrow \mathbb{R}. Frustratingly contrary to the name, \delta is not actually a function but rather what mathematicians call a generalized function or functional distribution. Fortunately, these technicalities don’t matter so long as we only ever use the formal integral definition in calculations.
Consider, for example, an application where we want to inflate the probability allocated to the atomic subset \{ x' \} from 0 to 0 < \gamma \le 1, breaking absolute continuity with respect to the Lebesgue measure in the process. We can achieve this with a mixture probability distribution that combines a Dirac distribution \delta_{x'} with the initial probability distribution \pi, \rho = \gamma \, \delta_{x'} + (1 - \gamma) \, \pi. This mixture distribution defines the subset allocations \begin{align*} \rho(\mathsf{x}) &= \gamma \, \delta_{x'}(\mathsf{x}) + (1 - \gamma) \, \pi(\mathsf{x}) \\ &= \left\{ \begin{array}{rr} \gamma + (1 - \gamma) \, \pi(\mathsf{x}), & x' \in \mathsf{x} \\ (1 - \gamma) \, \pi(\mathsf{x}), & x' \notin \mathsf{x} \end{array} \right. , \end{align*} and the expectation values \begin{align*} \mathbb{E}_{\rho}[f] &= \gamma \, \mathbb{E}_{\delta_{x'}}[f] + (1 - \gamma) \, \mathbb{E}_{\pi}[f] \\ &= \gamma \, f(x') + (1 - \gamma) \, \mathbb{E}_{\pi}[f]. \end{align*}
Using the Dirac delta function we can heuristicaly represent this mixture distribution as \frac{\mathrm{d} \rho}{\mathrm{d} \lambda}(x) = \gamma \, \delta(x - x') + (1 - \gamma) \, \frac{\mathrm{d} \pi}{\mathrm{d} \lambda}(x), where all expectation values are calculated as \begin{align*} \mathbb{E}_{\rho}[f] &= \int_{-\infty}^{\infty} \mathrm{d} x \, \frac{\mathrm{d} \rho}{\mathrm{d} \lambda}(x) \, f(x) \\ &= \int_{-\infty}^{\infty} \mathrm{d} x \, \left( \gamma \, \delta(x - x') + (1 - \gamma) \, \frac{\mathrm{d} \pi}{\mathrm{d} \lambda}(x)(x) \right) \, f(x) \\ &= \gamma \, \int_{-\infty}^{\infty} \mathrm{d} x \, \delta(x - x') \, f(x) + (1 - \gamma) \, \int_{-\infty}^{\infty} \mathrm{d} x \, \frac{\mathrm{d} \pi}{\mathrm{d} \lambda}(x) \, f(x) \\ &= \gamma \, f(x') + (1 - \gamma) \, \int_{-\infty}^{\infty} \mathrm{d} x \, \frac{\mathrm{d} \pi}{\mathrm{d} \lambda}(x) \, f(x) \\ &= \gamma \, f(x') + (1 - \gamma) \, \mathbb{E}_{\pi}[f]. \end{align*}
If we restrict consideration to continuous probability density functions, then we can visualize this mixture density function \mathrm{d} \rho / \mathrm{d} \lambda as a continuous base density function \mathrm{d} \pi / \mathrm{d} \lambda with a single discontinuity at x' (Figure 16). Without this restriction, however, visualizations like this are ambiguous because point defects in \mathrm{d} \pi / \mathrm{d} \lambda shouldn’t change the resulting expectation values.
Because of all of this delicate mathematical baggage, the Dirac delta “function” requires care when using in practice. That said the compact, dare I say elegant, probability density function descriptions it enables is often worth the added subtlety.
5.2 Geometric Probability Density Functions
Real spaces adequately model many phenomena that arise in practical applications, but by no means all of them. In some cases we will need to consider continuous spaces that look like a real spaces locally but exhibit different shapes globally (Figure 18). These include, for example, spheres, torii, and even more foreign spaces. Mathematically, these spaces, along with real spaces, are collectively known as manifolds.
One nice feature of manifolds is that we can always equip them with consistent metric structures. A chosen metric then allows us to define a corresponding uniform measure that emulates many of the features of the Lebesgue measure on real spaces. In particular, these uniform measures serve as natural reference measures on which we can build many useful probability distributions.
For instance, we can equip a circle S^{1} with a metric that endows angular intervals with a notion of length (Figure 18 (a)). We can then define a uniform measure that allocates the same measure to every angular interval of the same length (Figure 18 (b)).
Another important consequence of this construction is that, similar to the Lebesgue measure, these uniform measures allocate zero measure to every atomic subset. Consequently, any probability distribution over S^{1} that also allocates vanishing probability to the atomic subsets will be absolutely continuous to these uniform measures. This allows us to define circular probability density functions p : \mathbb{S}^{1} \rightarrow \mathbb{R}^{+} to represent each of these absolutely continuous probability distributions (Figure 19).
We have to take care, however, not to confuse these circular probability density functions with Lebesgue probability density functions. Specifically, expectation values \mathbb{E}_{\pi}[f] = \mathbb{I}_{\nu}[p \cdot f] are not implemented with classic Riemann integration, but rather a more general manifold integration that is not implemented in the same way.
That said, sometimes there are workarounds. For example, removing a point x' \in \mathbb{S}^{1} from the circle defines a new space \mathbb{S}^{1} \setminus x'. Circular probability distributions, circular probability density functions, and circular expectands f : \mathbb{S} \rightarrow \mathbb{R} on the circle all define corresponding objects on this excised space. Once we’ve removed the point we can then unroll and stretch out \mathbb{S}^{1} \setminus x' into a real line \mathbb{R}, taking all of the probabilistic objects along with us (Figure 20).
In general, expectation values on the initial circular probability space will be different from the expectation values on the excised probability space. If \{ x' \} is a null subset, however, then they will be the same, \mathbb{E}_{\pi}[f] = \mathbb{I}_{\nu}[p \cdot f] = \mathbb{I}_{\nu'}[p' \cdot f'] = \mathbb{I}_{\nu''}[p'' \cdot f''], where no ticks denotes objects on \mathbb{S}^{1}, one tick denotes objects on \mathbb{S}^{1} \setminus \mathbb{R}, and two ticks denotes objects on \mathbb{R}.
To summarize, by removing a point from the circle we can calculate circular expectation values with classic Riemann integrals. Not every cut point, however, will be as convenient as others. Cutting the circle at a point away from the peak of the initial probability density function results in a well-behaved, unimodal Lebesgue probability density function (Figure 20 (a)). On the other hand, cutting near the peak results in a Lebesgue probability density function with two peaks at relatively extreme values. This separation can frustrate integral calculations (Figure 20 (b)).
If we don’t know where the initial circular probability density function concentrates, then we won’t know where to make a good cut. In this case, we can end up with a difficult Lebesgue probability density function that doesn’t actually get us any closer to a completed calculation. The subtle relationship between circular probability density functions, let alone other, more complicated manifold probability density functions, and Lebesgue probability density functions is dark and full of terrors. To avoid computational problems, or worse corrupting the target expectation values, we need to be very careful when working on these more sophisticated spaces.
6 Conclusion
Because they provide a straightforward way to implement probability distributions in practice, probability density functions are absolutely critical in transitioning probability theory from abstract mathematics to a viable tool for applied practice.
That said, because they define probability distributions only in the context of integrals informed by a given reference measure, they are not without their subtleties. When we ignore this context, we become prone to incorrect interpretations and erroneous calculations, both of which can result in inconsistent applications of the underlying probability theory.
Appendix: Sums and Integrals
Probability density functions allow us to compute expectation values using explicit mathematical operations, in particular summation when working with a counting reference measure and Riemann integration when working with a Lebesgue reference measure. Just because these operations are explicit, however, doesn’t mean that they’re straightforward to calculate.
Deriving analytic results for even the most convenient summations and integrals can require a substantial amount of experience with calculus techniques, and in many cases tricks that seem to come out of nowhere. For those who are curious about these techniques, this appendix gathers full calculations for the Poisson and normal results that we used in this chapter.
Note that these calculations will absolutely not be necessary for keeping up with future chapters. Indeed, in most applications we will take advantage of other computational tools that don’t require these kinds of onerous calculations at all.
6.1 Poisson Summations
Let’s warm up with some summations.
The normalization of each Poisson probability mass function is given by \begin{align*} \mathrm{Poisson}(X; \lambda) &= \mathbb{E}_{\mathrm{Poisson}}[ I_{X}; \lambda ] \\ &= \mathbb{I}_{\chi}[ \mathrm{Poisson}( ; \lambda) \cdot I_{X} ] \\ &= \sum_{n = 0}^{\infty} \mathrm{Poisson}(n; \lambda) \cdot 1 \\ &= \sum_{n = 0}^{\infty} \frac{ \lambda^{n} e^{-\lambda} }{ n! } \\ &= e^{-\lambda} \sum_{n = 0}^{\infty} \frac{ \lambda^{n} }{ n! }. \end{align*} This summation, however, is just the power series definition for the exponential function, e^{x} = \sum_{n = 0}^{\infty} \frac{ x^{n} }{ n! }. Consequently, \begin{align*} \mathrm{Poisson}(X; \lambda) &= e^{-\lambda} \sum_{n = 0}^{\infty} \frac{ \lambda^{n} }{ n! } \\ &= e^{-\lambda} e^{\lambda} \\ &= 1, \end{align*} as required.
Similarly, the mean for each Poisson probability distribution is given by \begin{align*} \mathbb{M}(\lambda) &= \mathbb{E}_{\mathrm{Poisson}} [ \iota ; \lambda ] \\ &= \mathbb{I}_{\chi} [ \mathrm{Poisson}( ; \lambda) \cdot \iota ] \\ &= \sum_{n = 0}^{\infty} \mathrm{Poisson}(n; \lambda) \cdot \iota(n) \\ &= \sum_{n = 0}^{\infty} \frac{ \lambda^{n} e^{-\lambda} }{ n! } \cdot n \\ &= \sum_{n = 1}^{\infty} \frac{ \lambda^{n} e^{-\lambda} }{ (n - 1)! }. \end{align*} To evaluate this sum, we need to shift the summation index from n to m = n - 1, which gives \begin{align*} \mathbb{M}(\lambda) &= \sum_{n = 1}^{\infty} \frac{ \lambda^{n} e^{-\lambda} }{ (n - 1)! } \\ &= \sum_{m = 1}^{\infty} \frac{ \lambda^{m + 1} e^{-\lambda} }{ m! } \\ &= \lambda \sum_{m = 1}^{\infty} \frac{ \lambda^{m} e^{-\lambda} }{ m! }. \end{align*} Conveniently, the resulting sum is exactly the normalization that we showed above is equal to one, \begin{align*} \mathbb{M}(\lambda) &= \lambda \sum_{m = 1}^{\infty} \frac{ \lambda^{m} e^{-\lambda} }{ m! } \\ &= \lambda. \end{align*}
To compute the variances, we’ll first need to calculate the second-order moment, \begin{align*} \mathbb{M}_{2}(\lambda) &= \mathbb{E}_{\mathrm{Poisson}} [ \iota^{2} ; \lambda ] \\ &= \mathbb{I}_{\chi} [ \mathrm{Poisson}( ; \lambda) \cdot \iota^{2} ] \\ &= \sum_{n = 0}^{\infty} \mathrm{Poisson}(n; \lambda) \cdot iota(n)^{2} \\ &= \sum_{n = 0}^{\infty} \frac{ \lambda^{n} e^{-\lambda} }{ n! } \cdot n^{2} \\ &= \sum_{n = 1}^{\infty} \frac{ \lambda^{n} e^{-\lambda} }{ (n - 1)! } \, n \end{align*} Shifting the summation index from n to m = n - 1, as we did above, gives \begin{align*} \mathbb{M}_{2}(\lambda) &= \sum_{n = 1}^{\infty} \frac{ \lambda^{n} e^{-\lambda} }{ (n - 1)! } \, n \\ &= \sum_{m = 1}^{\infty} \frac{ \lambda^{m + 1} e^{-\lambda} }{ m! } (m + 1) \\ &= \lambda \sum_{m = 1}^{\infty} \frac{ \lambda^{m} e^{-\lambda} }{ m! } \, m + \lambda \sum_{m = 1}^{\infty} \frac{ \lambda^{m} e^{-\lambda} }{ m! }. \end{align*} The first summation is the Poisson mean, while second summation is the normalization, both of which we’ve already computed. Substituting their values gives \begin{align*} \mathbb{M}_{2}(\lambda) &= \lambda \sum_{m = 1}^{\infty} \frac{ \lambda^{m} e^{-\lambda} }{ m! } \, m + \lambda \sum_{m = 1}^{\infty} \frac{ \lambda^{m} e^{-\lambda} }{ m! } \\ &= \lambda \, \mathbb{M}(\lambda) + \lambda \, \mathrm{Poisson}(X; \lambda) \\ &= \lambda \, \lambda + \lambda \, 1 \\ &= \lambda^{2} + \lambda. \end{align*}
The Poisson variances are then given by \begin{align*} \mathbb{C}_{2} &= \mathbb{E}_{\mathrm{Poisson}} [ (\iota - \mathbb{M}(\lambda))^{2} ; \lambda ] \\ &= \mathbb{E}_{\mathrm{Poisson}} [ \iota^{2} - 2 \, \mathbb{M}(\lambda) \, \iota + \mathbb{M}(\lambda)^{2} ; \lambda ] \\ &= \mathbb{E}_{\mathrm{Poisson}} [ \iota^{2} ; \lambda ] - 2 \, \mathbb{M}(\lambda) \, \mathbb{E}_{\mathrm{Poisson}} [ \iota ; \lambda ] + \mathbb{M}(\lambda) \\ &= \left( \lambda^{2} + \lambda \right) -2 \, \lambda \, \lambda + \lambda^{2} \\ &= (2 \lambda^{2} - 2 \lambda^{2}) + \lambda \\ &= \lambda. \end{align*}
6.2 Normal Integrals
Working with normal expectation values is substantially easier once we’ve established a few foundational results.
6.2.1 The Basic Normal Integral
First, let’s compute the so-called “normal integral” H = \int_{-\infty}^{\infty} \mathrm{d} x \, \exp \left( -a \, x^{2} \right). The trick to computing this integral is to work not with a single copy of it, but rather two copies at the same time, \begin{align*} H^{2} &= H \cdot H \\ &= \int_{-\infty}^{\infty} \mathrm{d} x_{1} \, \exp \left( -a \, x_{1}^{2} \right) \int_{-\infty}^{\infty} \mathrm{d} x_{2} \, \exp \left( -a \, x_{2}^{2} \right) \\ &= \int_{-\infty}^{\infty} \mathrm{d} x_{1} \, \int_{-\infty}^{\infty} \mathrm{d} x_{2} \, \exp \left( -a \, x_{1}^{2} \right) \, \exp \left( -a \, x_{2}^{2} \right) \\ &= \int_{-\infty}^{\infty} \mathrm{d} x_{1} \, \int_{-\infty}^{\infty} \mathrm{d} x_{2} \, \exp \left( -a \left( x_{1}^{2} + x_{2}^{2} \right) \right). \end{align*}
Remarkably, this results in a two-dimensional integral that is ripe for polar coordinates. Making the transformation \begin{align*} r &= x_{1}^{2} + x_{2}^{2} \\ \theta &= \arctan \frac{x_{2}}{x_{1}} \end{align*} with \mathrm{d} x_{1} \, \mathrm{d} x_{2} = \mathrm{d} \theta \, \mathrm{d} r \, r gives \begin{align*} H^{2} &= \int_{-\infty}^{\infty} \mathrm{d} x_{1} \, \int_{-\infty}^{\infty} \mathrm{d} x_{2} \, \exp \left( -a \left( x_{1}^{2} + x_{2}^{2} \right) \right) \\ &= \int_{0}^{2 \, \pi} \mathrm{d} \theta \, \int_{0}^{\infty} \mathrm{d} r \, r \exp \left( -a \, r^{2} \right) \\ &= \int_{0}^{\infty} \mathrm{d} r \, r \exp \left( -a \, r^{2} \right) \, \int_{0}^{2 \, \pi} \mathrm{d} \theta \\ &= \int_{0}^{\infty} \mathrm{d} r \, r \exp \left( -a \, r^{2} \right) \, 2 \, \pi \\ &= 2 \, \pi \, \int_{0}^{\infty} \mathrm{d} r \, r \exp \left( -a \, r^{2} \right). \end{align*}
At this point we make the substitution u = - a \, r^{2} with \mathrm{d} u = \mathrm{d} r \, (- 2 \, a \, r ) to give \begin{align*} H^{2} &= 2 \, \pi \, \int_{0}^{\infty} \mathrm{d} r \, r \exp \left( -a \, r^{2} \right) \\ &= 2 \, \pi \, \int_{0}^{-\infty} \mathrm{d} u \, \frac{-1}{2 \, a} \exp \left( u \right) \\ &= - \frac{\pi}{a} \, \int_{0}^{-\infty} \mathrm{d} u \exp \left( u \right) \\ &= - \frac{\pi}{a} \left[ \exp \left( u \right) \right]_{0}^{-\infty} \\ &= - \frac{\pi}{a} \left[ \exp \left( -\infty \right) - \exp \left( 0 \right) \right] \\ &= - \frac{\pi}{a} \left[ 0 - 1 \right] \\ &= \frac{\pi}{a}. \end{align*}
Consequently, \int_{-\infty}^{\infty} \mathrm{d} x \, \exp \left( -a \, x^{2} \right) = \sqrt{H^{2}} = \sqrt{ \frac{\pi}{a} }.
6.2.2 Higher-Order Normal Integrals
We can now use this basic normal integral to compute the slightly more sophisticated integral \int_{-\infty}^{\infty} \mathrm{d} x \, x^{k} \, \exp \left( -a \, x^{2} \right). One thing that we can note immediately is that, when k is an odd integer, the integrand will be an odd function, and the integral will vanish by symmetry. Consequently, we really only need to compute the integrals \int_{-\infty}^{\infty} \mathrm{d} x \, x^{2 k} \, \exp \left( -a \, x^{2} \right) for k \in \mathbb{N}.
We could hack away at this integral by integrating the basic normal integral by parts. For example, \begin{align*} \sqrt{ \frac{\pi}{a} } &= \int_{-\infty}^{\infty} \mathrm{d} x \, \big( 1 \big) \, \big( \exp \left( -a \, x^{2} \right) \big) \\ &= \left[ x \, \exp \left( -a \, x^{2} \right) \right]_{-\infty}^{\infty} - \int_{-\infty}^{\infty} \mathrm{d} x \, \big( x \big) \, \big( (- 2 \, a \, x) \, \exp \left( -a \, x^{2} \right) \big) \\ &= \left[ 0 - 0 \right] + 2 \, a \, \int_{-\infty}^{\infty} \mathrm{d} x \, x^{2} \, \exp \left( -a \, x^{2} \right) \\ &= 2 \, a \, \int_{-\infty}^{\infty} \mathrm{d} x \, x^{2} \, \exp \left( -a \, x^{2} \right), \end{align*} or, equivalently, \int_{-\infty}^{\infty} \mathrm{d} x \, x^{2} \, \exp \left( -a \, x^{2} \right) = \frac{1}{2 \, a} \, \sqrt{ \frac{\pi}{a} } = \sqrt{\pi} \, \frac{1}{2} \, a^{-3/2}. Higher-order integrals can then be derived by repeating this process over and over again.
That said, we can also compute these integrals more efficiently with the help of a devious trick of mathematics. The idea is to start with our initial integral \int_{-\infty}^{\infty} \mathrm{d} x \, \exp \left( -a \, x^{2} \right) = \sqrt{ \frac{\pi}{a} }, and then differentiate both sides with respect to the parameter a, \begin{align*} \frac{\mathrm{d}}{\mathrm{d} a} \int_{-\infty}^{\infty} \mathrm{d} x \, \exp \left( -a \, x^{2} \right) &= \frac{\mathrm{d}}{\mathrm{d} a} \sqrt{ \frac{\pi}{a} } \\ &= \sqrt{\pi} \left( - \frac{1}{2} \right) a^{-3/2}. \end{align*}
Because all of these functions are smooth, the derivative and integral operations commute here, allowing us to pull the derivative inside of the integral, \begin{align*} \frac{\mathrm{d}}{\mathrm{d} a } \int_{-\infty}^{\infty} \mathrm{d} x \, \exp \left( -a \, x^{2} \right) &= \sqrt{\pi} \left( - \frac{1}{2} \right) a^{-3/2} \\ \int_{-\infty}^{\infty} \mathrm{d} x \, \frac{\mathrm{d}}{\mathrm{d} a} \exp \left( -a \, x^{2} \right) &= \sqrt{\pi} \left( - \frac{1}{2} \right) a^{-3/2} \\ \int_{-\infty}^{\infty} \mathrm{d} x \, \left( -x^{2} \right) \exp \left( -a \, x^{2} \right) &= \sqrt{\pi} \left( - \frac{1}{2} \right) a^{-3/2} \\ \int_{-\infty}^{\infty} \mathrm{d} x \, x^{2} \exp \left( -a \, x^{2} \right) &= \sqrt{\pi} \, \frac{1}{2} \, a^{-3/2}, \end{align*} consistent with our earlier integration by parts calculation.
The beauty of this latter approach, however, is that it generalizes to larger k immediately with repeated differentiation, \begin{align*} \frac{\mathrm{d}^{k}}{\mathrm{d} a^{k}} \int_{-\infty}^{\infty} \mathrm{d} x \, \exp \left( -a \, x^{2} \right) &= \frac{\mathrm{d}^{k}}{\mathrm{d} a^{k}} \sqrt{ \frac{\pi}{a} } \\ \int_{-\infty}^{\infty} \mathrm{d} x \, \frac{\mathrm{d}^{k}}{\mathrm{d} a^{k}} \exp \left( -a \, x^{2} \right) &= \frac{\mathrm{d}^{k}}{\mathrm{d} a^{k}} \sqrt{ \frac{\pi}{a} } \\ \int_{-\infty}^{\infty} \mathrm{d} x \, (-1)^{k} x^{2 k} \exp \left( -a \, x^{2} \right) &= \sqrt{\pi} \prod_{k' = 1}^{k} \left(-\frac{1}{2} - (k' - 1) \right) \, a^{-(1/2 + k)} \\ \int_{-\infty}^{\infty} \mathrm{d} x \, (-1)^{k} x^{2 k} \exp \left( -a \, x^{2} \right) &= \sqrt{\pi} \, (-1)^{k} \, \prod_{k' = 1}^{k} \frac{2 \, k' - 1}{2} \, a^{-(1/2 + k)} \\ \int_{-\infty}^{\infty} \mathrm{d} x \, x^{2 k} \exp \left( -a \, x^{2} \right) &= \sqrt{\pi} \, \frac{ (2 \, k - 1)!! }{2^{k}} \, a^{-(1/2 + k)}, \end{align*} where n!! = \prod_{k' = 1}^{\frac{n + 1}{2}} (2 \, k' - 1) is the double factorial.
6.2.3 Normal Expectation Values
With the general result \int_{-\infty}^{\infty} \mathrm{d} x \, x^{2 k} \exp \left( -a \, x^{2} \right) = \sqrt{\pi} \frac{k!!}{2^{k}} a^{-(k + 1/2)} we can now go to town on the standard normal expectation values.
For example, the normalization of each normal density function is given by \begin{align*} \mathrm{normal}(\mathbb{R}; \mu, \sigma) &= \mathbb{E}_{\mathrm{normal}}[ I_{\mathbb{R}} ; \mu, \sigma ] \\ &= \mathbb{I}_{\lambda}[ \mathrm{normal}(\cdot; \mu, \sigma) \cdot I_{\mathbb{R}} ] \\ &= \int_{-\infty}^{\infty} \mathrm{d} x \, \frac{1}{\sqrt{2 \, \pi} \, \sigma} \mathrm{normal}(x; \mu, \sigma) \\ &= \int_{-\infty}^{\infty} \mathrm{d} x \, \frac{1}{\sqrt{2 \, \pi} \, \sigma} \exp\left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right) \\ &= \frac{1}{\sqrt{2 \, \pi} \, \sigma} \int_{-\infty}^{\infty} \mathrm{d} x \, \exp \left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right). \end{align*}
To proceed further, we have to change variables to u = \frac{x - \mu}{\sigma} with \mathrm{d} u = \mathrm{d} x \, \frac{1}{\sigma}. This gives \begin{align*} \mathrm{normal}(\mathbb{R}; \mu, \sigma) &= \frac{1}{\sqrt{2 \, \pi} \, \sigma} \int_{-\infty}^{\infty} \mathrm{d} x \, \exp \left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right) \\ &= \frac{1}{\sqrt{2 \, \pi} \, \sigma} \int_{-\infty}^{\infty} \mathrm{d} u \, \sigma \, \exp \left( -\frac{1}{2} u^{2} \right) \\ &= \frac{1}{\sqrt{2 \, \pi}} \int_{-\infty}^{\infty} \mathrm{d} u \, \exp \left( -\frac{1}{2} u^{2} \right). \end{align*}
For a = 1/2 and k = 0 our general normal integral gives \begin{align*} \mathrm{normal}(\mathbb{R}; \mu, \sigma) &= \frac{1}{\sqrt{2 \, \pi}} \int_{-\infty}^{\infty} \mathrm{d} u \, \exp \left( -\frac{1}{2} u^{2} \right) \\ &= \frac{1}{\sqrt{2 \, \pi}} \sqrt{2 \, \pi} \\ &= 1, \end{align*} as required.
Similarly, the mean is given by \begin{align*} \mathbb{M}(\mu, \sigma) &= \mathbb{E}_{\mathrm{normal}}[ \iota ; \mu, \sigma ] \\ &= \mathbb{I}_{\lambda}[ \mathrm{normal}(\cdot; \mu, \sigma) \cdot \iota ] \\ &= \int_{-\infty}^{\infty} \mathrm{d} x \, \frac{1}{\sqrt{2 \, \pi} \, \sigma} \mathrm{normal}(x; \mu, \sigma) \, \iota(x) \\ &= \int_{-\infty}^{\infty} \mathrm{d} x \, \frac{1}{\sqrt{2 \, \pi} \, \sigma} \exp \left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right) \, x \\ &= \frac{1}{\sqrt{2 \, \pi} \, \sigma} \int_{-\infty}^{\infty} \mathrm{d} x \, \exp \left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right) \, x. \end{align*}
Applying the same change of variables gives \begin{align*} \mathbb{M}(\mu, \sigma) &= \frac{1}{\sqrt{2 \, \pi} \, \sigma} \int_{-\infty}^{\infty} \mathrm{d} x \, \exp \left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right) \, x \\ &= \frac{1}{\sqrt{2 \, \pi} \, \sigma} \int_{-\infty}^{\infty} \mathrm{d} u \, \sigma \, \exp \left( -\frac{1}{2} u^{2} \right) \, \left( \sigma \, u + \mu \right) \\ &= \frac{\sigma}{\sqrt{2 \, \pi}} \int_{-\infty}^{\infty} \mathrm{d} u \, \exp \left( -\frac{1}{2} u^{2} \right) \, u + \frac{\mu}{\sqrt{2 \, \pi}} \int_{-\infty}^{\infty} \mathrm{d} u \, \exp \left( -\frac{1}{2} u^{2} \right). \end{align*} The first integral here vanishes by symmetry, while the second integral can be evaluated with another application of the general normal integral, \begin{align*} \mathbb{M}(\mu, \sigma) &= \frac{\sigma}{\sqrt{2 \, \pi}} \int_{-\infty}^{\infty} \mathrm{d} u \, \exp \left( -\frac{1}{2} u^{2} \right) \, u + \frac{\mu}{\sqrt{2 \, \pi}} \int_{-\infty}^{\infty} \mathrm{d} u \, \exp \left( -\frac{1}{2} u^{2} \right) \\ &= 0 + \frac{\mu}{\sqrt{2 \, \pi}} \sqrt{2 \, \pi} \\ &= \mu. \end{align*}
In fact, we can compute all of the central moments at once. The odd central moments all vanish by symmetry, whereas the even central moments are given by \begin{align*} \mathbb{D}_{2 \, k}(\mu, \sigma) &= \mathbb{E}_{\mathrm{normal}} [ \left( \iota - \mathbb{M}(\mu, \sigma) \right)^{2 \, k} ; \mu, \sigma ] \\ &= \mathbb{E}_{\mathrm{normal}} [ \left( \iota - \mu \right)^{2 \,k} ; \mu, \sigma ] \\ &= \mathbb{I}_{\lambda} [\mathrm{normal}(\cdot; \mu, \sigma) \cdot \left( \iota - \mu \right)^{2 \, k} ] \\ &= \int_{-\infty}^{\infty} \mathrm{d} x \, \frac{1}{\sqrt{2 \, \pi} \, \sigma} \mathrm{normal}(x; \mu, \sigma) \, \left( \iota(x) - \mu \right)^{2 \,k} \\ &= \int_{-\infty}^{\infty} \mathrm{d} x \, \frac{1}{\sqrt{2 \, \pi} \, \sigma} \exp \left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right) \, \left( x - \mu \right)^{2 \,k} \\ &= \frac{1}{\sqrt{2 \, \pi} \, \sigma} \int_{-\infty}^{\infty} \mathrm{d} x \, \exp \left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right) \, \left( x - \mu \right)^{2 \,k}. \end{align*}
After using the same change of variables again, this becomes \begin{align*} \mathbb{D}_{2 k}(\mu, \sigma) &= \frac{1}{\sqrt{2 \, \pi} \, \sigma} \int_{-\infty}^{\infty} \mathrm{d} x \, \exp \left( -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right) \, \left( x - \mu \right)^{2 k} \\ &= \frac{1}{\sqrt{2 \, \pi} \, \sigma} \int_{-\infty}^{\infty} \mathrm{d} u \, \sigma \, \exp \left( -\frac{1}{2} u^{2} \right) \, \sigma^{2 \, k} \, u^{2 k} \\ &= \frac{\sigma^{2 \, k} }{\sqrt{2 \, \pi}} \int_{-\infty}^{\infty} \mathrm{d} u \, \exp \left( -\frac{1}{2} u^{2} \right) \, u^{2 k}. \end{align*}
We already know, however, how to evaluate this integral, \begin{align*} \mathbb{D}_{2 k}(\mu, \sigma) &= \frac{\sigma^{2 \, k} }{\sqrt{2 \, \pi}} \int_{-\infty}^{\infty} \mathrm{d} u \, \exp \left( -\frac{1}{2} u^{2} \right) \, u^{2 k} \\ &= \frac{\sigma^{2 \, k} }{\sqrt{2 \, \pi}} \sqrt{\pi} \, \frac{ (2 \, k - 1)!! }{2^{k}} \, \left( \frac{1}{2} \right)^{-(1/2 + k)} \\ &= \frac{\sigma^{2 \, k} }{\sqrt{2}} \frac{ (2 \, k - 1)!! }{2^{k}} \, 2^{1/2 + k} \\ &= \frac{\sigma^{2 \, k} }{\sqrt{2}} \frac{ (2 \, k - 1)!! }{2^{k}} \, \sqrt{2} \, 2^{k} \\ &= \sigma^{2 \, k} \, (2 \, k - 1)!!. \end{align*}
Specifically, the variance is given by \mathbb{C}_{2}(\mu, \sigma) = \mathbb{D}_{2}(\mu, \sigma) = \sigma^{2}.
Finally the normal cumulative distribution functions are defined by interval probabilities, \begin{align*} \Pi_{\text{normal}}(x; \mu, \sigma) &= \text{normal}( \, (-\infty, x]; \mu, \sigma \, ) \\ &= \mathbb{E}_{\mathrm{normal}}[ I_{(-\infty, x]}; \mu, \sigma ] \\ &= \int_{-\infty}^{+\infty} \mathrm{d} x' \, \mathrm{normal}(x'; \mu, \sigma) \, I_{(-\infty, x]}(x') \\ &= \int_{-\infty}^{x} \mathrm{d} x' \, \mathrm{normal}(x'; \mu, \sigma) \\ &= \int_{-\infty}^{x} \mathrm{d} x' \, \frac{1}{\sqrt{2 \, \pi} \sigma} \exp \left( -\frac{1}{2} \left( \frac{x' - \mu}{\sigma} \right)^{2} \right) \\ &= \frac{1}{\sqrt{2 \, \pi} \sigma} \int_{-\infty}^{x} \mathrm{d} x' \, \exp \left( -\frac{1}{2} \left( \frac{x' - \mu}{\sigma} \right)^{2} \right). \end{align*}
After a change of variables to t = \frac{x' - \mu}{\sqrt{2} \, \sigma}, this becomes \begin{align*} \Pi(x; \mu, \sigma) &= \frac{1}{\sqrt{2 \, \pi} \sigma} \int_{-\infty}^{x} \mathrm{d} x' \, \exp \left( -\frac{1}{2} \left( \frac{x' - \mu}{\sigma} \right)^{2} \right). \\ &= \frac{1}{\sqrt{\pi}} \int_{-\infty}^{ \frac{x - \mu}{\sqrt{2} \, \sigma} } \mathrm{d} t \, \exp \left( -t^{2} \right) \\ &= \frac{1}{\sqrt{\pi}} \int_{-\infty}^{0} \mathrm{d} t \, \exp \left( -t^{2} \right) + \frac{1}{\sqrt{\pi}} \int_{0}^{ \frac{x - \mu}{\sqrt{2} \, \sigma} } \mathrm{d} t \, \exp \left( -t^{2} \right) \\ &= \frac{1}{2} + \frac{1}{2} \, \frac{2}{\sqrt{\pi}} \int_{0}^{ \frac{x - \mu}{\sqrt{2} \, \sigma} } \mathrm{d} t \, \exp \left( -t^{2} \right) \\ &= \frac{1}{2} + \frac{1}{2} \, \mathrm{erf} \left( \frac{x - \mu}{\sqrt{2} \, \sigma} \right), \end{align*} where \mathrm{erf} (x) \frac{2}{\sqrt{\pi}} \int_{0}^{ x } \mathrm{d} t \, \exp \left( -t^{2} \right) is the error function.
Acknowledgements
A very special thanks to everyone supporting me on Patreon: Adam Fleischhacker, Adriano Yoshino, Alan Chang, Alessandro Varacca, Alexander Noll, Alexander Petrov, Alexander Rosteck, Anders Valind, Andrea Serafino, Andrew Mascioli, Andrew Rouillard, Andrew Vigotsky, Angie_Hyunji Moon, Ara Winter, Austin Rochford, Austin Rochford, Avraham Adler, Ben Matthews, Ben Swallow, Benjamin Glemain, Benoit Essiambre, Bradley Kolb, Brandon Liu, Brendan Galdo, Brynjolfur Gauti Jónsson, Cameron Smith, Canaan Breiss, Cat Shark, Charles Naylor, Chase Dwelle, Chris, Chris Jones, Christopher Mehrvarzi, Colin Carroll, Colin McAuliffe, Damien Mannion, Damon Bayer, dan mackinlay, Dan Muck, Dan W Joyce, Dan Waxman, Dan Weitzenfeld, Daniel Edward Marthaler, Darshan Pandit, Darthmaluus , David Burdelski, David Galley, David Wurtz, Denis Vlašiček, Doug Rivers, Dr. Jobo, Dr. Omri Har Shemesh, Dylan Maher, Ed Cashin, Edgar Merkle, Eric LaMotte, Erik Banek, Ero Carrera, Eugene O’Friel, Felipe González, Fergus Chadwick, Finn Lindgren, Florian Wellmann, Francesco Corona, Geoff Rollins, Greg Sutcliffe, Guido Biele, Hamed Bastan-Hagh, Haonan Zhu, Hector Munoz, Henri Wallen, hs, Hugo Botha, Håkan Johansson, Ian Costley, Ian Koller, idontgetoutmuch, Ignacio Vera, Ilaria Prosdocimi, Isaac Vock, J, J Michael Burgess, Jair Andrade, James C, James Hodgson, James Wade, Janek Berger, Jason Martin, Jason Pekos, Jason Wong, Jeff Burnett, Jeff Dotson, Jeff Helzner, Jeffrey Erlich, Jesse Wolfhagen, Jessica Graves, Joe Wagner, John Flournoy, Jonathan H. Morgan, Jonathon Vallejo, Joran Jongerling, Josh Weinstock, Joshua Duncan, Joshua Griffith, JU, Justin Bois, Karim Naguib, Karim Osman, Kejia Shi, Kristian Gårdhus Wichmann, Kádár András, Lars Barquist, lizzie , LOU ODETTE, Luís F, Marcel Lüthi, Marek Kwiatkowski, Mark Donoghoe, Markus P., Martin Modrák, Matt Moores, Matt Rosinski, Matthew, Matthew Kay, Matthieu LEROY, Mattia Arsendi, Maurits van der Meer, Michael Colaresi, Michael DeWitt, Michael Dillon, Michael Lerner, Mick Cooney, Márton Vaitkus, N Sanders, Nathaniel Burbank, Nic Fishman, Nicholas Clark, Nicholas Cowie, Nick S, Nicolas Frisby, Octavio Medina, Oliver Crook, Olivier Ma, Patrick Kelley, Patrick Boehnke, Pau Pereira Batlle, Peter Smits, Pieter van den Berg , ptr, Putra Manggala, Ramiro Barrantes Reynolds, Ravin Kumar, Raúl Peralta Lozada, Riccardo Fusaroli, Richard Nerland, Robert Frost, Robert Goldman, Robert kohn, Robin Taylor, Ross McCullough, Ryan Grossman, Rémi , S Hong, Scott Block, Sean Pinkney, Sean Wilson, Sergiy Protsiv, Seth Axen, shira, Simon Duane, Simon Lilburn, sssz, Stan_user, Stefan, Stephanie Fitzgerald, Stephen Lienhard, Steve Bertolani, Stew Watts, Stone Chen, Susan Holmes, Svilup, Sören Berg, Tao Ye, Tate Tunstall, Tatsuo Okubo, Teresa Ortiz, Thomas Lees, Thomas Vladeck, Tiago Cabaço, Tim Radtke, Tobychev , Tom McEwen, Tony Wuersch, Utku Turk, Virginia Fisher, Vitaly Druker, Vladimir Markov, Wil Yegelwel, Will Farr, Will Tudor-Evans, woejozney, yolhaj , Zach A, Zad Rafi, and Zhengchen Cai.
License
A repository containing all of the files used to generate this chapter is available on GitHub.
The text and figures in this chapter are copyrighted by Michael Betancourt and licensed under the CC BY-NC 4.0 license:
https://creativecommons.org/licenses/by-nc/4.0/