Measure and Probability on General Spaces
In Chapter One we discussed measure and probability theory over sets with only a finite number of elements. We saw in Chapter Two, however, that many of the mathematical spaces we encounter in practical applications, like the integers and the real line, feature not a finite number of elements but rather countably infinite, and even uncountably infinite, numbers of elements. Unfortunately, extending measure and probability theory to more general spaces like these is surprisingly subtle.
Here we will investigate the difficulties in defining measure and probability theory on general mathematical spaces, with a focus on concepts instead of technical details. We will first discuss why measures allocated to individual elements does not, in general, provide enough information to define a consistent allocation for all subsets. Then we will consider how certain pathological subsets on some spaces can obstruct consistent allocations over the full power set, and how we can systematically remove these obstructions in practice. Finally we will present the most general form of measure and probability theory that can be applied to any mathematical space and discuss some common applications.
1 Allocation Over Elements
Recall that in Chapter One we first defined measures and probability distributions as allocations to the individual elements in a finite set. More formally, we were able to define a measure as a function that mapped each element to its allocation of the total measure, \begin{alignat*}{6} \mu :\; & X & &\rightarrow& \; & [0, \infty] & \\ & x & &\mapsto& & \mu(x) &. \end{alignat*}
This element-wise allocations then allowed us to define the measure allocated to subsets. Specifically, the measure allocated to a subset \mathsf{x} \subset X was unambiguously determined by summing up the measures allocated to the included elements, \mu(\mathsf{x}) = \sum_{x \in \mathsf{x}} \mu(x). On finite spaces this construction gives us a consistent allocation in the sense that the total measure is always preserved no matter how we might decompose the ambient set into subsets.
Conveniently, this construction does extend to spaces with countably infinite numbers of elements, such as the integers. In these spaces every subset contains at most a countably infinite number of elements, and sums of measures will always converge to well-defined values. Element-wise measure allocations on finite and countably infinite spaces are also known as mass functions, with element-wise probability allocations also known as probability mass functions.
Mass functions are particularly straightforward to visualize when X is not only countable but also ordered, such as the integers or a subset of the integers. In this case we can visualize the element-wise allocations with a sequence of vertical bars stacked next to each other (Figure 1).
Unfortunately, the element-wise construction does not extend any further. Once we consider spaces with uncountably infinite numbers of elements, such as the real numbers, we have to confront subsets with uncountably infinite numbers of elements where sums start to misbehave.
Consider, for example, a subset \mathsf{x} where each of the included elements has been allocated exactly zero measure. If \mathsf{x} contains only a finite or countably infinite number of elements, then the sum of these zero measures always yields zero.
When \mathsf{x} contains an uncountably infinite number of elements, however, the sum of the individual element measures is not necessarily zero. In fact, it can give any value between zero and infinity. Uncountably infinite spaces have so many elements that we can very much get something from nothing!
Ultimately, the allocation of measure to individual elements does not generally provide enough information to uniquely determine what measure should be allocated to every combination of those elements. In order to completely define a measure, we need to specify what those subset allocations are ourselves.
2 Allocation Over All Subsets
In Chapter One we also considered defining a measure by specifying allocations to each subset in the power set, \begin{alignat*}{6} \mu :\; & 2^{X} & &\rightarrow& \; & [0, \infty] & \\ & \mathsf{x} & &\mapsto& & \mu(\mathsf{x}) &. \end{alignat*} Importantly, these subsets allocations needed to be consistent with each other to match the behavior derived from summing over individual element allocations. In particular, for any finite collection of disjoint subsets we should have \mu( \cup_{i = 1}^{I} \mathsf{x} ) = \sum_{i = 1}^{I} \mu( \mathsf{x}_{i}).
This construction is excessive for finite spaces; the subset allocations contain an abundance of redundant information. Because we can also derive subset allocations from element-wise allocations on countably infinite spaces, this construction is unnecessary there as well.
On the other hand, the specification of at least some subset allocations is strictly necessary for fully defining measures on uncountably infinite spaces, and hence mathematical spaces in general. The only question is whether or not consistent subset allocations are even possible on these more sophisticated spaces.
2.1 Consistent Allocations
Before answering this question, let’s take a second to define exactly what kind of consistency we need.
Because finite spaces feature only a finite number of subsets, we only ever have to consider the consistency of a finite collection of subsets at a time. More formally, if \{ \mathsf{x}_{1}, \ldots, \mathsf{x}_{i}, \ldots, \mathsf{x}_{I} \} is any finite collection of disjoint subsets, \mathsf{x}_{i} \cap \mathsf{x}_{i' \ne i} = \emptyset, then a consistent measure should give \mu( \cup_{i = 1}^{I} \mathsf{x} ) = \sum_{i = 1}^{I} \mu( \mathsf{x}_{i}). This property is known as finite additivity.
More general spaces can feature infinitely many subsets, and hence different possible notions of additive consistency. For example, on a countably infinite space the subset allocations derived from a mass function are consistent across countably infinite collections of subsets. If \{ \mathsf{x}_{1}, \ldots, \mathsf{x}_{i}, \ldots \} is any countably infinite collection of disjoint subsets with with \mathsf{x}_{i} \cap \mathsf{x}_{i' \ne i} = \emptyset, then \mu( \cup_{i} \mathsf{x} ) = \sum_{i} \mu( \mathsf{x}_{i}). This is known as countable additivity.
The question is then whether measures with finite additivity are sufficiently useful for practical application, if we need to consider countably additive measures, or even more general notions of additivity.
For example, a common problem that arises in practice is reconstructing the measure allocated to a general subset from the measures allocated to particularly nice subsets that are easier with which to work. If we could always decompose a generic subset into the disjoint union of a finite number of nice subsets, then finite additivity would be sufficient for this task. On the other hand, if we could always decompose a generic subset into the disjoint union of a countably infinite number of nice subsets, then countable additivity would be sufficient. At the extreme, if some subsets could be decomposed into only an uncountably infinite number of subsets then we would need even stronger notions of additivity!
Fortunately, we don’t have to go to that last extreme. It turns out that on the spaces that we’ll encounter in practice, and a reasonable notion of “nice” subset, countable additivity is sufficient for reconstructing the measure allocated to more general subsets.
To demonstrate, let’s consider the two-dimensional real plane \mathbb{R}^{2} and a measure that is partially defined through it’s allocations to rectangular subsets (Figure 2). In general, a non-rectangular subset, in this case a disk, can be crudely approximated by a single rectangular subset. The disk can be approximated more precisely as the disjoint union of finitely many rectangular subsets, but there will always be some residual error. Only when we incorporate a countably infinite number of rectangular subsets can we reconstruct the disk exactly.
Ultimately, countable additive measures give us the mathematical flexibility we need for practical calculation.
2.2 Sub-Additivity and Super-Additivity
Ideally, we would be able to define measures that are additive over any countably infinite collection of disjoint subsets on any space. Unfortunately, mathematics is not always kind; many seemingly well-behaved spaces feature pathological subsets that obstruct countable additivity.
Specifically, many uncountably infinite spaces feature disjoint subsets that will always behave sub-additivity, \begin{align*} \mathsf{x}_{1} \cap \mathsf{x}_{2} &= \emptyset \\ \mu( \mathsf{x}_{1} \cup \mathsf{x}_{2} ) &< \mu( \mathsf{x}_{1}) + \mu( \mathsf{x}_{2} ) \end{align*} no matter how we try to define the allocations! In other words, the power set will always be contaminated by certain subsets that are always less than the sum of their parts. This contamination obstruct a fully consistent definition of measure.
At the same time, we can generally prove the existence of disjoint subsets that are super-additive, \begin{align*} \mathsf{x}_{1} \cap \mathsf{x}_{2} &= \emptyset \\ \mu( \mathsf{x}_{1} \cup \mathsf{x}_{2} ) &> \mu( \mathsf{x}_{1}) + \mu( \mathsf{x}_{2} ). \end{align*} If we compare the measure allocated to these subsets to the measure allocated to their union, then we will always appear to end up with more measure than what had been initially allocated.
What makes these pathological subsets even more awkward is that we can’t actually construct them from explicit conditions. Given typical assumptions about infinity, all we can do is prove that these subsets exist. These phantom subsets are known as non-constructive objects.
That said, because the misbehaving subsets are non-constructive, we don’t really need to consider them in any practical application of measure theory. If we could consistently filter them out of the full power set, then we would be able to define consistent measures over the remaining subsets, and that would be completely sufficient for any practical application.
3 \sigma-Algebras
Because the term “\sigma-algebra” is often thrown around in measure and probability theory without much explanation, it is often seen as an impenetrable concept that defies explanation. In reality, however, \sigma-algebras are simply a way to consistently filter out undesired subsets from the power set.
3.1 Filtering Subsets
We can always filter the power set by removing certain subsets. The difficultly is ensuring that no application of the three set operations would ever lead us back to any of the excised subsets. More formally, we need our filtered collection of subsets to be closed under the three set operations so that there is no risk of accidentally recreating a subset outside of the collection.
For instance, if the subset \mathsf{x} \subset X is in our filtered collection then so too should be the complement \mathsf{x}^{c}. If this is true, then anytime we apply the complement operator to a subset in our collection we are guaranteed to always end up with another subset in our collection.
Similarly, for every pair of subsets \mathsf{x}_{1} \subset X and \mathsf{x}_{2} \subset X in a filtered collection the union \mathsf{x}_{1} \cup \mathsf{x}_{2} and intersection \mathsf{x}_{1} \cap \mathsf{x}_{2} should also be in the collection. In order to ensure closure under repeated applications of the union and intersection operators, we also need the union and intersection of any countably infinite sequence of subsets to also be in the filtered collection.
A \sigma-algebra is any collection of subsets that is closed under complements, countable unions, and countable intersections. In other words, a \sigma-algebra is just any consistent filtering of the power set. I will use a calligraphic font to refer to \sigma-algebras, so that if X is a space then \mathcal{X} \subset 2^{X} will denote a \sigma-algebra defined on that space.
A set equipped with a \sigma-algebra, (X, \mathcal{X}) is known as a measurable space. I will refer to X as the ambient set, or the ambient space if it is also equipped with additional structure. Similarly, the elements of a distinguished \sigma-algebra are known as measurable subsets, while any subsets in the power set but not in \sigma-algebra are referred to as non-measurable subsets.
Non-measurable subsets typically probe the subtle, and often counterintuitive, pathologies inherent to a given space. By working with \sigma-algebras directly, we can avoid these awkward pathologies entirely.
3.2 Generating \sigma-Algebras
Now that we’ve defined how a consistent sub-collection of subsets behaves, we need to consider how to construct these \sigma-algebras in practice. One particularly useful way to build up \sigma-algebras is to generate them by repeatedly applying the three set operations to an initial collection of subsets.
For example, consider an initial collection of two subsets \{ \mathsf{x}_1, \mathsf{x}_2 \}. Applying the complement operator gives us two subsets that fall outside of the initial collection, \{ \mathsf{x}_1^c, \mathsf{x}_2^c \}, Similarly, applying the union operator gives, \{ \mathsf{x}_1 \cup \mathsf{x}_2 \} while applying the intersection operator gives \{ \mathsf{x}_1 \cap \mathsf{x}_2 \}. To ensure closure we have to add all of these subsets to our initial collection, \{ \mathsf{x}_1, \mathsf{x}_2, \mathsf{x}_1^c, \mathsf{x}_2^c, \mathsf{x}_1 \cup \mathsf{x}_2, \mathsf{x}_1 \cap \mathsf{x}_2 \}. At this point we iterate, applying the complement operator to every subset and the union and intersection operators to every finite and countably infinite sub-collection of subsets. This generates increasing larger collections of subsets. When the set operations no longer return new subsets, the final collection of subsets defines a \sigma-algebra.
A convenient feature of this procedure is that if we start with a collection of constructive subsets then we will always end up with a \sigma-algebra that is free of any non-constructive subsets, and their pathological behaviors. To ensure that we don’t filter out any well-behaved subsets in the process, we just have to make sure that our initial collection is sufficiently large.
Conveniently, when working on a topological space we already have a natural collection of subsets that we can use to generate a \sigma-algebra – the defining topology itself! The \sigma-algebra generated by repeatedly applying all three set operations to the subsets in a topology is known as a Borel \sigma-algebra. In other words, a Borel \sigma-algebra is the unique \sigma-algebra comprised of all of the open and closed subsets. If X is a topological space, then I will denote the corresponding Borel \sigma-algebra by \mathcal{B}_{X}.
Every space that we will consider in this book will be a topological space. Consequently, we can always use the corresponding Borel \sigma-algebra to remove any undesired subsets that might obstruct the definition of a consistent measures and probability distributions. Indeed, Borel \sigma-algebras are so common that they are often taken for granted, with any reference to a “measurable space” implicitly assuming a topological space and its corresponding Borel \sigma-algebras to filter out any inconsistent behavior.
For example, finite and countably infinite spaces are almost always equipped with discrete topologies. Because discrete topologies contain all of the atomic sets, the \sigma-algebras derived from them will always be the full power set. In these cases there are no pathological behaviors that we have to avoid at all! I will refer to any measurable space (X, 2^{X}) compatible with a discrete topology as a discrete measurable space.
On the other hand, the Borel \sigma-algebra derived from the topology that defines the real line filters out all of the non-constructive subsets and their undesired behaviors while keeping all of the interval subsets and the subsets that we can derive from them. This results in a \sigma-algebra that is strictly smaller than the full power set of the real line.
3.3 Measurability Is Not Recursive
A potentially counterintuitive feature of general \sigma-algebras is that the subsets of a measurable subset need not themselves be measurable. For example, the intervals that are measurable with respect to the Borel \sigma-algebra over a real line contain many non-Borel measurable subsets.
This behavior doesn’t really have any practical consequence, but it can frustrate many formal proofs and derivations. Consequently, when engaging in more technical calculations it can be helpful to expand a given \sigma-algebra so that the subsets of certain measurable subsets are always guaranteed to be measurable as well; see for example the discussions at the end of Section 4.3 and Section 5.2.
If we add subsets to a \sigma-algebra, then we also have to add the subsets that are generated from complements, unions, and intersections. Fortunately, there is a systematic construction for this process that ensures a valid \sigma-algebra. All of this is to say that, in more applied practice, we can always safely assume a Borel \sigma-algebra or any extension of that \sigma-algebra that might be needed to resolve any technical issues.
4 Measures and Probability Distributions
With all of that work, we are finally ready to define a theory for allocating any conserved, but not necessarily finite, quantity across a general mathematical space.
4.1 Formal Definitions
A measure on any measurable space (X, \mathcal{X}) is a function from the \sigma-algebra \mathcal{X} to the extended positive real line, \begin{alignat*}{6} \mu :\; & \mathcal{X} & &\rightarrow& \; & [0, \infty] & \\ & \mathsf{x} & &\mapsto& & \mu(\mathsf{x}) &, \end{alignat*} that is countably additive, \mu( \cup_{i} \mathsf{x} ) = \sum_{i} \mu( \mathsf{x}_{i} ) for any countably infinite collection of subsets \{ \mathsf{x}_{1}, \ldots, \mathsf{x}_{i}, \ldots \} that are mutually disjoint, \mathsf{x}_{i} \cap \mathsf{x}_{i' \ne i} = \emptyset.
On finite and countably infinite spaces, we can always take \mathcal{X} = 2^{X} and ensure countable additivity by allocating measure to individual elements and then deriving the measure allocated to subsets by summing over the individual allocations to the included elements. When working with more sophisticated ambient spaces, however, the pair (X, 2^{X}) may not admit any consistent measures. In these cases we have to consider smaller \sigma-algebras in order for well-behaved measures to exist.
A set equipped with not only a \sigma-algebra but also a measure, in other words a triple (X, \mathcal{X}, \mu) is known as a measure space. Again, I will refer to X as the ambient set or ambient space as appropriate.
If the total measure is finite, \mu(X) < \infty, then \mu is referred to as a finite measure. In this case we can always normalize the measure by \mu(X) to define a proportional allocation.
A probability distribution (Figure 3) on any measurable space (X, \mathcal{X}) is a function from the \sigma-algebra \mathcal{X} to the closed unit interval, \begin{alignat*}{6} \pi :\; & \mathcal{X} & &\rightarrow& \; & [0, 1] & \\ & \mathsf{x} & &\mapsto& & \pi(\mathsf{x}) &, \end{alignat*} with \pi(X) = 1 and \pi( \cup_{i} \mathsf{x} ) = \sum_{i} \pi( \mathsf{x}_{i} ) for any countably infinite collection of subsets \{ \mathsf{x}_{1}, \ldots, \mathsf{x}_{i}, \ldots \} that are mutually disjoint, \mathsf{x}_{i} \cap \mathsf{x}_{i' \ne i} = \emptyset.
Collectively, these properties are also known as the Kolmogorov axioms. On a historical note, however, Kolmogorov first axiomized probability theory slightly using slightly different properties (Kolmogorov 1950). That said, that initial construction and the more contemporary construction shown here are mathematically equivalent.
In practical applications, subsets implicitly defined with the set-builder notation are sufficiently common that more a compact notation for their allocations will be useful. When there is no risk of ambiguity, we can write the probability allocated to the subset \begin{align*} \mathsf{x} &= \{ x \in X \mid \text{ condition(x) } \} \\ &= \{ \text{ condition(x) } \} \end{align*} as \begin{align*} \pi( \mathsf{x} ) &= \pi( \{ x \in X \mid \text{ condition(x) } \} ) \\ &= \pi( \{ \text{ condition(x) } \} ) \\ &= \pi[ \text{ condition(x) } ]. \end{align*} For example, the probability allocated to the interval I[a, b] = \{ x \in \mathbb{R} \mid a \le x \le b \} can be written as \pi( I[a, b] ) = \pi[ a \le x \le b ].
A set equipped with a \sigma-algebra and a probability distribution is known as a probability space. Sometimes the combination (X, \mathcal{X}, \pi) is also referred to as a probability triple.
Probability spaces are also sometimes denoted by x \sim \pi, where x \in X indicates the ambient set, and a \sigma-algebra is taken for granted. In words, this reads “the variable x is distributed according to \pi” or “the variable x follows the distribution \pi”. Because probability distributions are generally defined over measurable subsets and not individual elements, however, a more precise description be would be “the variable x takes values in a space X that is equipped with a probability distribution \pi”. The emphasis on variables instead of spaces in this notation is related to the awkward notion of a “random variable”, which we will discuss in more detail in Chapter Eleven.
From a mathematical perspective, probability distributions are just special cases of measures. That special case, however, is uniquely important. As we’ll see in Section 6, for example, probability distributions are naturally well-suited for many applications. Moreover, the constraint of unit total measure endows probability distributions with many exceptional theoretical properties that help us to implement probabilistic calculations in practice.
Because of these distinctions, measure theory and probability theory are often compartmentalized in the mathematics literature, with separate textbooks, terminologies, and notations despite their common mathematical foundations.
4.2 Derived Properties
Although these definitions might appear to be a bit stark, we can derive all of the usual rules of measure and probability theory from them.
Consider, for instance, one measurable subset that is strictly smaller than another, \mathsf{x}_{1} \subset \mathsf{x}_{2} \in \mathcal{X}. In this case we can always write \mathsf{x}_{2} = \mathsf{x}_{1} \cup \mathsf{x}_{3} for the non-empty, measurable subset of elements that are in \mathsf{x}_{2} but not in \mathsf{x}_{1}. Applying countable additivity then gives \begin{align*} \pi(\mathsf{x}_{2}) &= \pi(\mathsf{x}_{1} \cup \mathsf{x}_{3}) \\ &= \pi(\mathsf{x}_{1}) + \pi(\mathsf{x}_{3}) \\ &\ge \pi(\mathsf{x}_{1}), \end{align*} because \pi(\mathsf{x}_{3}) \ge 0. In other words, larger measurable subsets are always allocated more or equal probability than smaller subsets.
Similarly, because any subset and its complement are disjoint and combine to reconstruct the full set, we always have \begin{align*} 1 &= \pi(X) \\ &= \pi(\mathsf{x} \cup \mathsf{x}^{c}) \\ &= \pi(\mathsf{x}) + \pi(\mathsf{x}^{c}) \end{align*} or \pi(\mathsf{x}^{c}) = 1 - \pi(\mathsf{x}).
In order to work with two measurable subsets \mathsf{x}_{1}, \mathsf{x}_{2} \in \mathcal{X} that might not be disjoint (Figure 4), we have to consider the elements that are unique to each, \mathsf{x}_{1 \setminus 2} = \{ x \in X \mid x \in \mathsf{x}_{1}, x \notin \mathsf{x}_{2} \} and \mathsf{x}_{2 \setminus 1} = \{ x \in X \mid x \in \mathsf{x}_{2}, x \notin \mathsf{x}_{1} \}, and the elements that are shared, \mathsf{x}_{1} \cap \mathsf{x}_{2} = \{ x \in X \mid x \in \mathsf{x}_{1}, x \in \mathsf{x}_{2} \}.
This decomposition then allows us to decompose \mathsf{x}_{1}, \mathsf{x}_{2}, and their union into disjoint, measurable subsets (Figure 5,) \begin{align*} \mathsf{x}_{1} &= \mathsf{x}_{1 \setminus 2} \cup (\mathsf{x}_{1} \cap \mathsf{x}_{2}) \\ \mathsf{x}_{2} &= \mathsf{x}_{2 \setminus 1} \cup (\mathsf{x}_{1} \cap \mathsf{x}_{2}) \\ \mathsf{x}_{1} \cup \mathsf{x}_{2} &= \mathsf{x}_{1 \setminus 2} \cup ( \mathsf{x}_{1} \cap \mathsf{x}_{2} ) \cup \mathsf{x}_{2 \setminus 1}. \end{align*}
Applying countable additivity to all three of these decompositions defines a system of equations, \begin{align*} \pi(\mathsf{x}_{1}) &= \pi(\mathsf{x}_{1 \setminus 2}) + \pi(\mathsf{x}_{1} \cap \mathsf{x}_{2}) \\ \pi(\mathsf{x}_{2}) &= \pi(\mathsf{x}_{2 \setminus 1}) + \pi(\mathsf{x}_{1} \cap \mathsf{x}_{2}) \\ \pi(\mathsf{x}_{1} \cup \mathsf{x}_{2}) &= \pi(\mathsf{x}_{1 \setminus 2}) + \pi(\mathsf{x}_{1} \cap \mathsf{x}_{2}) + \pi \mathsf{x}_{2 \setminus 1}). \end{align*} Adding the first two equations together gives \pi(\mathsf{x}_{1}) + \pi(\mathsf{x}_{2}) = \pi(\mathsf{x}_{1 \setminus 2}) + 2 \, \pi(\mathsf{x}_{1} \cap \mathsf{x}_{2}) + \pi(\mathsf{x}_{1} \cap \mathsf{x}_{2}) or \pi(\mathsf{x}_{1 \setminus 2}) + \pi(\mathsf{x}_{1} \cap \mathsf{x}_{2}) + \pi(\mathsf{x}_{1} \cap \mathsf{x}_{2}) = \pi(\mathsf{x}_{1}) + \pi(\mathsf{x}_{2}) - \pi(\mathsf{x}_{1} \cap \mathsf{x}_{2}). Substituting this from the third equation finally gives \pi(\mathsf{x}_{1} \cup \mathsf{x}_{2}) = \pi(\mathsf{x}_{1}) + \pi(\mathsf{x}_{2}) - \pi(\mathsf{x}_{1} \cap \mathsf{x}_{2}).
4.3 Null Subsets
The measure allocated to a measurable subset quantifies the weight of that subset relative to any other measurable subsets. Those measurable subsets that are allocated zero measure are the least important subsets in terms of the overall allocation. At the same time, these negligible subsets can be useful for characterizing certain properties of a given measure.
Any measurable subset \mathsf{x} \in \mathcal{X} that is allocated zero measure \mu(\mathsf{x}) = 0 is referred to as a null subset of the measure space (X, \mathcal{X}, \mu) or, more compactly, a \mu-null subset. Similarly, if \pi(\mathsf{x}) = 0 then the measurable subset \mathsf{x} is denoted a null subset of the probability space (X, \mathcal{X}, \pi), or simply a \pi-null subset.
I will denote the collection of null subsets by \mathcal{X}_{\mu = 0} = \{ \mathsf{x} \in \mathcal{X} \mid \mu(\mathsf{x}) = 0 \} \subset \mathcal{X}.
Most properties of measures depend on the detailed allocation of the total measure across all measurable subsets. Some useful properties, however, are completely characterized by which measurable subsets receive a non-zero allocation and which measurable subsets receive a zero allocation. In other words, these null properties can be completely derived from \mathcal{X}_{\mu = 0}.
Any two measures that share the same null subsets, \mathcal{X}_{\mu_{1} = 0} = \mathcal{X}_{\mu_{2} = 0} will share any properties that are derived from those null subsets. Consequently, the overlap of null subsets, or the lack thereof, is often a useful way to determine how compatible two measures are with each other. We’ll formalize this compatibility when we construct density functions in Chapter Six.
Many formal calculations become particularly straightforward when every subset of a null subset is also a null subset. This allows us to, for example, decompose null subsets into smaller subsets without having to worry about measurability concerns. Any measure with recursively-consistent null subsets is known as a complete measure.
Unfortunately, many measures that we encounter in practice, specifically many measures defined with respect to Borel \sigma-algebras, are not complete because not enough subsets are measurable. In these cases, extending the initial \sigma-algebra to include every subset of the initial null subsets can be convenient for more technical work. That said, these considerations have little to no practical consequence.
4.4 Measures and Probability Distributions In Practice
The formal definition of measures and probability distributions tell us what form the consistent allocation of any quantity on any measurable space has to take, but it does not necessarily provide a way to construct explicit allocations in practice. Specifically, in almost all circumstances it is infeasible, if not outright impossible, to exhaustively specify the measure or probability allocated to every subset in the ambient \sigma-algebra. Constructing and then storing infinitely large databases linking each measurable subset, or even just the non-null subsets, to their allocations is not particularly practical!
In some cases, we can define useful measures and probability distributions by specifying the allocation to only some of the measurable subsets, and then deriving the allocations to the rest with countable additivity. For instance, in finite and countable spaces we need to specify only the allocations to all atomic subsets. Similarly measures over (\mathbb{R}, \mathcal{B}_{\mathbb{R}}) can be completely specified by allocations to all interval subsets.
In most of these cases, however, those reduced allocations are still impractical to specify one-by-one. Because of that, our introduction to general measure and probability theory will have to remain a bit abstract, without many explicit examples, for the time being.
In applied problems measures and probability distributions are almost always defined algorithmically, with rules to evaluate the measure or probability allocated to a subset on the fly instead of storing and retrieving the allocation from an exhaustive specification. We will introduce two of these algorithmic representations, and use them to define many useful allocations, in Chapter Six and Chapter Fourteen.
5 Uniform Measures
Any given measurable space (X, \mathcal{X}) can be equipped with infinitely many measures and probability distributions. Some of these objects, however, are more useful in applied practice than others.
This section will introduce two measures that encode distinct notions of uniformity, each of which is applicable to different types of ambient spaces. In the following chapters, we will see how the properties of these uniform measures make them particularly useful in practical applications of probability theory.
5.1 The Counting Measure On Discrete Measurable Spaces
Intuitively, a uniform measure should allocate the same measure to as many measurable subsets as possible. The consistency of measures, however, limits just how many subsets can receive the same allocation. For example, if two disjoint subsets, \mathsf{x}_{1} \in \mathcal{X} and \mathsf{x}_{2} \in \mathcal{X}, are allocated the same measure, \mu(\mathsf{x}_{1}) = \mu(\mathsf{x}_{2}) = \mu_{0}, then their union will be allocated the measure, \mu(\mathsf{x}_{1}) \cup \mu(\mathsf{x}_{2}) = \mu(\mathsf{x}_{1}) + \mu(\mathsf{x}_{2}) = 2 \, \mu_{0}. All three allocations can be equal only if \mu_{0} = 0 or \mu_{0} = \infty.
In order to define a notion of uniformity that doesn’t abuse infinity, we need to restrict our consideration from the entire \sigma-algebra to collections of measurable subsets that we want to behave in the same way. If this collection is large enough, then we can completely define a uniform measure by enforcing the same allocation to those distinguished subsets.
On any countable measure space (X, 2^{X}), for instance, a natural collection of subsets to consider are the atomic subsets. Because in this case the measure allocated to these subsets completely specifies the allocations to every other subset, we can fully define a uniform measure by a enforcing the same allocation to each element. This results in a constant mass function (Figure 6).
In particular, the counting measure on the discrete measurable space (X, 2^{X}) is defined by a unit allocation to each atomic set, \chi(\{ x \}) = 1. Equivalently, we can define the counting measure with a uniform mass function that assigns each element to a unit allocation, \chi(x) = 1.
Given these element-wise allocations, we can derive the measure allocated to any other subset \mathsf{x} \subset X by countable additivity, \begin{align*} \chi( \mathsf{x} ) &= \sum_{x \in \mathsf{x}} \chi(\{ x \}) \\ &= \sum_{x \in \mathsf{x}} 1. \end{align*} By construction, this always results in the total number of elements in \mathsf{x}. The total counting measure, \chi( X ) = \sum_{x \in X} 1, just counts the total number of elements in the ambient set. In other words, a counting measure formalizes our intuitive notion of counting discrete objects.
An immediate consequence of these derived allocations is that any subset with the same number of elements will receive the same allocation. Uniformity over the individual elements induces uniformity over other subsets as well.
When there are only a finite number of elements in X, the total measure will be finite. In this case we can normalize the counting measure into a uniform probability distribution, \pi ( \mathsf{x} ) = \frac{ \sum_{x \in \mathsf{x}} 1 }{ \sum_{x \in X} 1 }, which quantifies the proportion of elements in X that are contained in \mathsf{x}. If there are an infinite number of elements in X then this normalization is no longer possible; for example, there is no well-defined notion of a uniform probability distribution over the integers \mathbb{Z}.
Counting measures are not the only uniform measures that we could define over a countable ambient set. More generally, we can define a uniform measure by allocating any positive real number c \in \mathbb{R}^{+} to each element, \kappa(\{ x \}) = c. That said ,these other uniform measures are somewhat redundant in the sense that their allocations can always be recovered by scaling the corresponding counting measure allocations, \kappa ( \mathsf{x} ) = c \cdot \chi ( \mathsf{x} ).
One important feature of a counting measure, indeed any uniform measure, is that every subset except the empty set receives a non-zero allocation. That is to say, the empty set is the only \chi-null subset.
5.2 The Lebesgue Measure On Real Lines
Unfortunately, on uncountable spaces element-wise allocations do not completely define measures. In order to define any consistent notion of uniform measure, we have to specify a larger class of measurable subsets that should receive equal allocations.
An uncountable set alone, however, offers no criteria for preferring any collection of subsets to any other, and hence no criteria for defining a consistent notion of uniform measure. Additional structure on X, however, may be able to break this ambiguity.
Consider, for instance, a real line \mathbb{R} equipped with an appropriate ordering, algebra, metric, and topology as discussed in Chapter Two. Recall that we can interpret this as a particular rigid real line or a particular parameterization of a flexible real line.
Using the ordering, we can construct closed interval subsets, [ x_{1}, x_{2} ] = \{ x \in \mathbb{R} \mid x_{1} \le x \le x_{2} \}. We can then use the metric to characterize these intervals by the distance between the end points, L( \, [ x_{1}, x_{2} ] \, ) = d( x_{1}, x_{2} ) = | x_{2} - x_{1} |, otherwise known as the interval length. Moreover, if we use a Borel \sigma-algebra derived from the real topology then these closed intervals will all be measurable subsets.
Any notion of uniformity that is compatible with all of this structure should treat all interval subsets with the same length in the same way. In other words, a uniform measure over \mathbb{R} should allocate the same measure to all equal-length intervals (Figure 7). For example ,because the intervals L( \, [-2, -1] \, ) = L( \, [5, 6] \, ) = L( \, [150, 151] \,) all have the same length, any uniform measure should give \mu( \, [-2, -1] \, ) = \mu( \, [5, 6] \, ) = \mu( \, [150, 151] \, ). Likewise, because L( \, [-350, -300] \, ) = L( \, [0, 50] \, ), we should have \mu( \, [-350, -300] \, ) = \mu( \, [0, 50] \,), and so on.
The easiest way to accomplish this uniformity is to allocate to each interval a measure directly equal to its length, \lambda( \, [x_{1}, x_{2}] \, ) = L( \, [x_{1}, x_{2}] \, ) = | x_{2} - x_{1} |. Allocations to more general measurable subsets can then be derived from these interval allocations and countable additivity. The resulting uniform measure is known as the Lebesgue measure.
Just as the counting measure formalizes intuition notions of counting on countable spaces, the Lebesgue measure formalizes intuitive notions of length on a real line. This formalization of length, however, is a bit more subtle. Counting behaves the same on all countable spaces, but length can behave differently across different real lines!
Two real lines with incompatible metrics – different rigid real lines or different parameterizations of a flexible real line – will assign different lengths to the same intervals, resulting in different Lebesgue measures. When there might be any chance of confusion we have to be careful to communicate which real line we’re using in any given application.
Because the total Lebesgue measure is infinitely large, \begin{align*} \lambda(\mathbb{R}) &= \lim_{x \rightarrow \infty} \lambda( \, [-x, x] \, ) \\ &= \lim_{x \rightarrow \infty} 2 \, | x | \\ &= \infty, \end{align*} it cannot be normalized into a probability distribution. As with the integers, there is no well-defined notion of a uniform probability distribution over a real line.
Every other uniform measure over a real line is defined by allocating a measure to each interval proportional to its length, \nu( \, [a, b] \, ) \propto l( \, [a, b] \, ). Consequently every uniform measure over a real line reduces to a constant scaling of the Lebesgue measure, \nu ( \mathsf{x} ) \propto \lambda ( \mathsf{x} ), similar to how every uniform measure over countable spaces reduces to a scaling of the counting measure.
By definition, the Lebesgue measure on any real line will allocate zero measure to individual points, \begin{align*} \lambda(\{ x \}) &= \lambda([x, x]) \\ &= d(x, x) \\ &= 0. \end{align*} Indeed any measurable subset with only a countable number of elements will also be \lambda-null, \begin{align*} \lambda(\mathsf{x}) &= \lambda(\cup_{i} \{ x_{i} \} ) \\ &= \sum_{i} \lambda( \{ x_{i} \} ) \\ &= \sum_{i} 0 \\ &= 0. \end{align*}
These null properties follow from d(x, x) = 0, which is true for any well-behaved metric. Two real lines with incompatible metrics might feature different Lebesgue measures, but those Lebesgue measures will always share the same null subsets and hence share any properties derived from those null subsets. Moving between rigid real lines or parameterizations of a flexible real line will change the details of the Lebesgue measure, but not these shared properties.
Let’s conclude on a more technical note. Allocations based on interval lengths can be used to derive consistent allocations over any subset constructed from open and closed subsets, and hence every measurable subset in the Borel \sigma-algebra \mathcal{B}_{\mathbb{R}}. They can also be used, however, to derive null allocations to many subsets that are not in \mathcal{B}_{\mathbb{R}}.
Extending \mathcal{B}_{\mathbb{R}} to include these additional null subsets results in a larger \sigma-algebra known as the Lebesgue \sigma-algebra. Conveniently, if we define the Lebesgue measure with respect to this larger \sigma-algebra then it becomes a complete measure which, as we have previously discussed, facilitates many technical results and is consequently preferred in more formal mathematical references.
Many references reserve the term “Lebsegue measure” for the complete uniform measure defined over the Lebesgue \sigma-algebra and introduce the term “Borel measure” for the incomplete uniform measure defined over the smaller Borel \sigma-algebra \mathcal{B}_{\mathbb{R}}. That said, because the differences between these definitions are limited to null subsets they can be effectively ignored in practical applications.
5.3 The Lebesgue Measure On Multivariate Real Spaces
The construction of the Lebesgue measure on a real line immediately generalizes to multivariate real spaces built up from multiple real lines at the same time. Within each component real line, we can allocate a uniform Lebesgue measure to each interval subset, \lambda_{i}( \, [x_{1, i}, x_{2, i}] \, ) = | x_{2, i} - x_{1, i} |. To ensure uniformity over the composite product space we need to allocate to each rectangular subset a measure equal to the product of these component allocations, \begin{align*} \lambda( \, \times_{i = 1}^{I} [ x_{1, i}, x_{2, i} ] ) &= \prod_{i = 1}^{I} \lambda_{i}( \, [x_{1, i}, x_{2, i}] \, ) \\ &= \prod_{i = 1}^{I} | x_{2, i} - x_{1, i} |. \end{align*} As in the one-dimensional, case allocations to more general measurable subsets are derived from these rectangular allocations.
Intuitively, the Lebesgue measure on \mathbb{R} quantifies length. The Lebesgue measure over \mathbb{R}^{2} then quantifies area. More generally, the Lebesgue measure over \mathbb{R}^{I} quantifies volume and it’s higher-order generalization.
5.4 Uniformity, Ignorance, and Information
The concepts of ignorance and information are related to uniformity; formalizing the relationships between these concepts, however, is subtle. In order to avoid confusing these concepts, we have take care to recognize not only their similarities but also their differences.
When two elements of a countable space are allocated the same measure, the overall allocation will be the same even if we permute those two elements before allocating measures. In other words, any measure that allocates the same measure to two elements is not able to distinguish between any permutations of those elements.
The more regular the individual allocations, are the less sensitive the resulting measure will be to any rearrangement of the elements. Conversely, the more the allocations vary from element to element the more the resulting measure will able to discern one permutation from another. Informally, we might say that the more uniform the measure is, the less information it encodes. Because a uniform measure on a countable space allocates the same measure to every element, it is ignorant to any bijective transformation of the elements, capturing the least information possible on a countable space.
On uncountable spaces these concepts become more delicate. The allocations defined by the Lebesgue measure, for example, are not invariant to arbitrary transformations of the real line. Any transformation that warps the metric will also warp lengths, and hence the measures allocated by the Lebesgue measure. Instead, the Lebesgue measure is ignorant to only those transformations that preserve distances.
In order to formalize heuristic concepts like “ignorance” and “information”, we have to embrace a bit more abstraction. Recall that in Chapter Two we discussed the notion of a structure-preserving transformation. More generally, if \phi : X \rightarrow X is a structure-preserving automorphism then we say that the structure is symmetric to \phi, while the transformation \phi is a symmetry of the structure.
In other words, if \phi is a symmetry of a structure \mathfrak{x} then the behavior of \mathfrak{x} is the same before and after we apply the transformation. The structure cannot detect whether or not we apply the transformation. For example, on a real line the metric is symmetric to translations, \begin{align*} d( t_{x_{3}}(x_{1}), t_{x_{3}}(x_{2}) ) &= | t_{x_{3}}(x_{2}) - t_{x_{3}}(x_{1}) | \\ &= | x_{2} + x_{3} - (x_{1} + x_{3}) | \\ &= | x_{2} - x_{1} | \\ &= d(x_{1}, x_{2}). \end{align*}
Some structures admit multiple symmetries at the same time. The discrete topology on a countable set, for instance, is invariant to any permutation of the elements, while the metric on a real line is invariant to all translations of the elements. The more symmetric a structure is, the less it can distinguish between arbitrary transformations to the ambient set. If we formalize information as the ability of a structure to distinguish between transformations of the ambient set, then the more symmetric a structure is the less information it encodes.
From an abstract perspective, measures and probability distributions are, like orderings, algebras, metrics, topologies, and \sigma-algebras, just structures that we can endow onto a set. The more invariant a measure is to transformations of that set, the less information it will contain. We will more formally consider how measures transform, and hence how to precisely define symmetries of a measure, in Chapter Seven.
Uniform measures are built to be symmetric to at least some transformations, and hence encode less information than most other measures. For example, the counting measure is invariant to any permutation of a countable ambient set, while the Lebesgue measure is invariant to any translation of a real line. We can also extend this construction to more elaborate spaces, for instance defining uniform measures on spheres that are invariant to any rotation.
Not every uniform measure, however, is invariant to every possible transformation of the ambient set; some uniform measures are more informative than others! Consequently, notions of uniformity do not define any universal notions of ignorance, just ignorance to the particular transformations that are used to define uniformity in a given context.
In practice, this means that we have to be careful not to make broad claims about uniform measures being least informative or most ignorant. Instead we should specify with respect to which transformations a measure might be least informative or most ignorant.
6 Interpretations of Measure And Probability
To this point, our treatment of measure and probability theory has been purely mathematical. A measures defines the allocations of some abstract conserved quantity across some abstract measurable space; a probability distribution defines a proportional allocations. This mathematical construction cannot be endowed with any particular interpretation until we use it to model something.
In this section we’ll review some of the most common applications of measure and probability theory and the particular interpretations those applications create.
6.1 Modeling Physical Distributions
One immediate application of measure theory is to model the behavior of a physical quantity, such as mass or electric charge. For example, physical mass can be distributed across a solid object in a variety of different ways, with the exact distribution affecting how that object interacts with the surrounding environment. Similarly, the distribution of charge across the surface of a conducting object defines its electrostatic properties.
In some physical systems, the distribution can also change with time and influence the dynamics of the system. Time-dependent measures that quantify how the distribution of a physical quantity evolves are a common feature of many physical theories.
6.2 Modeling Populations
A similar application is modeling the selection of individuals, or the properties of individuals, from a larger population. Each time we sample a subset of individuals from the population we will observe a different ensemble of behaviors. The heterogeneity of these characteristics across the population can often be quantified with measures, and their relative occurrences modeled with probability distributions.
For instance, if 30\% of the individuals in a population of people have a height between 0 feet and 5 feet, then a probability distribution modeling the variation in heights would give \pi([0, 5]) = 0.3.
6.3 Modeling Frequencies
An application particular to probability theory concerns the frequencies of repeated events.
Consider an abstract event whose outcomes take unpredictable values in some space X. Perfectly replicating the circumstances of this event N times defines a sequence of values in Y, \{ x_{1}, \ldots, x_{n}, \ldots x_{N} \}.
While we cannot predict what values the individual events in this sequence will take, we may be able to characterize how often certain outcomes appear relative to others. In particular, we can define the frequency of a subset \mathsf{x} \subset X by the number of events that that take values in \mathsf{x}, f_{N}(\mathsf{x}) = \frac{ \sum_{n = 1}^{N} \mathbb{I}_{\mathsf{x}}(x_{n}) }{N}, where \mathbb{I}_{\mathsf{x}}(x) = \left\{ \begin{array}{rr} 1, & x \in \mathsf{x} \\ 0, & x \notin \mathsf{x} \end{array} \right. .
Replicating the event a countably infinite number of times defines the asymptotic or long-run frequency of a subset, \begin{align*} f(\mathsf{x}) &= \lim_{N \rightarrow \infty} f_{N}(\mathsf{x}) \\ &= \lim_{N \rightarrow \infty} \frac{ \sum_{n = 1}^{N} \mathbb{I}_{\mathsf{x}}(x_{n}) }{N}. \end{align*} In other words, the more frequent subsets contain more common event outcomes.
If the frequencies are the same for any sequence of events, then we can model them with probability theory. Specifically, we can interpret the allocated probabilities as the proportion of the total event outcomes that fall into each subset of outcome values.
In this case the particular ordering of the event sequences doesn’t matter; by ignoring the order we can interpret the sequences as defining a population of possible events. From this perspective, the application of probability theory is equivalent to the application in the previous section. At the same time, if we interpret repeated samples from a population as events, then the population probabilities can be interpreted as frequencies.
6.4 Modeling Uncertainties
Probability theory can also be used to consistently quantify uncertain information.
Consider a space of possible statements X. Under perfect knowledge, we would be able to specify a particular statement x \in X as true, with all other statements in X being false. In other words, certainty is quantified with binary true/false assignments. When our knowledge is not quite so certain, however, we have to soften those claims.
To quantify uncertain information, we have to generalize from binary true/false assignments to continuous values that interpolate between absolute truth and falsity. The larger the value we assign to a subset of statements, the more our uncertain information supports one of those statements being true. Conversely, the smaller the value we assign to a subset, the more our uncertain information supports all of the included statements being false.
Applying probability theory allows us to enforce consistent uncertainty assignments across all of the possible statements. The individual probability allocations can then be interpreted as quantifying how strongly our information supports that one of the statements within a measurable set is true. In this setting the allocated probabilities are sometimes referred to as “plausibilities”, “credibilities”, or “beliefs”.
For example, the property that \pi(X) = 1 corresponds to the fact that at least one of the statements in X has to be true. A probability distribution that concentrates around the statement x encodes confidence that one of the statements near x is true. The singular limit where all of the available probability collapses onto a single statement, \pi( \{ x \} ) = 1, communicates certainty that x is true.
This kind of probabilistic uncertainty quantification can be interpreted in many ways. For instance, we can use it to model the personal, subjective beliefs that an individual holds about the behavior of a system. In particular, we can use it to model our own specific beliefs. At the same time, we can use it to model the collective understanding of entire communities. We can also avoid attempting to quantify the entirety of that knowledge at once, and instead use probability theory to model only certain aspects of individual or community knowledge,
More formally, this application of probability theory is one way to generalize classical propositional logic to a many-valued logic. Using probability theory to generalize other logical systems can sometimes also be possible, although the technical details quickly become more complicated.
6.5 Everyone Play Nicely
A key point of confusion in probability theory is the confounding of its abstract mathematical structure with the interpretations that arise in particular applications. This confusion is made all the worse by the long history of attempts to derive probability theory from these particular applications.
For example, many have tried to derive probabilities as asymptotic frequencies of physical events. The key motivation of this approach is that the resulting probabilities would be “objective” in the sense that everyone who could implement those infinite trials would attain the same probabilities. Even if we ignore the impracticality of perfectly repeating an event an infinite number of times within a finite lifetime – see Jay and Purce (2003) for visual demonstrations of just how ephemeral dice can be even when they are not rolled – it turns out that there are also some subtle mathematical complications with this approach. For a comprehensive discussion see Diaconis and Skyrms (2017).
Similarly, many have tried to derive probability theory from uncertainty quantification. The Cox postulates (Van Horn, Kevin S. (2003)), for instance, define basic intuitions about uncertainty quantification. On simpler spaces, these rules are equivalent to probability theory, but that equivalence doesn’t persist to more general spaces. Consequently, this approach is not able to recover the full generality of probability theory.
A common reaction to these technical difficulties is to resort to a sort of philosophical bait and switch. When one cannot derive probability theory from a particular application, one might instead define probability theory abstractly, as we have done in this chapter, but then impose an arbitrary restriction that it can only ever be applied to that one application. Those trying to derive probability theory from frequencies, for example, might argue that probability theory can only ever be applied to model frequencies, in which case all probabilities must be frequencies. Others trying to derive probability theory from the Cox axioms might argue that any application of probability theory always models uncertain information.
These interpretational restrictions then force some awkward philosophical contortions when trying to apply probability theory in practice. For instance, after imposing that all probabilities are frequencies the only way to model uncertainty in the value of some quantity is to treat it as the outcome of some hypothetical, and maybe even unachievable, event. The introduction of these hypothetical events to real events makes the entire system more difficult to understand.
In this book we will avoid these restrictions, respect the full generality of probability theory, and take advantage of any consistent applications that might be useful in any given problem. Indeed we will often take advantage of multiple applications of probability theory at the same time.
Consider, for example, a binary space X = {0, 1} that corresponds to whether a two-sided coin lands with its head side or its tail side up. In particular, let 0 denote a flip that lands tails up, and 1 denote a flip that lands heads up. Any probability distribution over X can be quantified with the probability p \in [0, 1] allocated to the point 1, which gives the consistent probability allocations \begin{align*}
\pi( \emptyset; p ) &= 0
\\
\pi( {0}; p ) &= 1 - p
\\
\pi( {1}; p ) &= p
\\
\pi( X; p ) &= 1.
\end{align*}
There are many ways to flip a coin, but let’s say that we flip our coin in a way that results in an unpredictable sequence of heads and tails. The asymptotic frequencies of these outcomes can then be modeled with an application of probability theory. In other words, we can use the probability distribution defined by p to model the physical outcomes of the flips.
At the same time, we use probability theory to model any uncertainty in which of the possible frequency models best matches the true behavior of the coin. In particular, we can construct a probability distribution over the unit interval to quantify how compatible each probability allocation p \in [0, 1] is with our knowledge of the coin.
If we have a collection of I coins, then we might also be interested in the variation of the probability parameters \{ p_{1}, \ldots, p_{i}, \ldots, p_{I} \} for each coin. In this case we can apply probability theory once again, this time to model the population of coin behaviors.
To be clear, the interpretations inherent to particular applications of probability theory are important for ensuring that we implement those applications correctly in practice. Elevating one interpretation to the exclusion of others, however, limits the full potential of probability theory. To take full advantage of the practical utility of probability theory we have to respect all of the consistent applications!
7 Conclusion
Conceptually, measure and probability theory are straightforward. Measure theory quantifies how we can consistently allocate a conserved quantify across a general mathematical space and probability theory considers the special case of proportional allocations. In order to formalize that conceptual simplicity, however, we need to resort to some careful mathematics. In particular, we need to incorporate \sigma-algebras to surgically remove any pathological behavior that can arise, even on seemingly well-behaved spaces such as the real line, and obstruct consistent allocations.
Once we’ve safely constructed these theories in full generality, we can use the abstract mathematics to model particular systems. Within these applications, the math inherits particular interpretations. We have to be careful to not take these circumstantial interpretations too seriously, lest we abandon the full utility of the abstract mathematics.
The technical exploration of measures and probability distributions goes far beyond the introduction in this chapter. Unfortunately, many textbooks that cover this material can be difficult to parse without extensive mathematical experience. My personal favorite is Folland (1999) which, while technically rigorous, provides more exposition and motivation than I have found in other treatments.
8 Acknowledgements
I thank Jeff Helzner, Simon Duane, Adriano Yoshino, jd, and Léo Burgund for helpful discussion.
A very special thanks to everyone supporting me on Patreon: Adam Fleischhacker, Adriano Yoshino, Alan Chang, Alessandro Varacca, Alexander Bartik, Alexander Noll, Alexander Petrov, Alexander Rosteck, Anders Valind, Andrea Serafino, Andrew Mascioli, Andrew Rouillard, Andrew Vigotsky, Angie_Hyunji Moon, Ara Winter, Austin Rochford, Austin Rochford, Avraham Adler, Ben Matthews, Ben Swallow, Benjamin Glemain, Bradley Kolb, Brandon Liu, Brynjolfur Gauti Jónsson, Cameron Smith, Canaan Breiss, Cat Shark, Charles Naylor, Chase Dwelle, Chris Jones, Chris Zawora, Christopher Mehrvarzi, Colin Carroll, Colin McAuliffe, Damien Mannion, Damon Bayer, dan mackinlay, Dan Muck, Dan W Joyce, Dan Waxman, Dan Weitzenfeld, Daniel Edward Marthaler, Darshan Pandit, Darthmaluus , David Burdelski, David Galley, David Wurtz, Doug Rivers, Dr. Jobo, Dr. Omri Har Shemesh, Ed Cashin, Edgar Merkle, Eric LaMotte, Erik Banek, Ero Carrera, Eugene O’Friel, Felipe González, Fergus Chadwick, Finn Lindgren, Florian Wellmann, Francesco Corona, Geoff Rollins, Granville Matheson, Greg Sutcliffe, Guido Biele, Hamed Bastan-Hagh, Haonan Zhu, Hector Munoz, Henri Wallen, hs, Hugo Botha, Håkan Johansson, Ian Costley, Ian Koller, idontgetoutmuch, Ignacio Vera, Ilaria Prosdocimi, Isaac Vock, J, J Michael Burgess, Jair Andrade, James Hodgson, James McInerney, James Wade, Janek Berger, Jason Martin, Jason Pekos, Jason Wong, Jeff Burnett, Jeff Dotson, Jeff Helzner, Jeffrey Erlich, Jesse Wolfhagen, Jessica Graves, Joe Wagner, John Flournoy, Jonathan H. Morgan, Jonathon Vallejo, Joran Jongerling, Joseph Despres, Josh Weinstock, Joshua Duncan, Joshua Griffith, JU, Justin Bois, Karim Naguib, Karim Osman, Kejia Shi, Kevin Foley, Kristian Gårdhus Wichmann, Kádár András, Lars Barquist, lizzie , LOU ODETTE, Marc Dotson, Marcel Lüthi, Marek Kwiatkowski, Mark Donoghoe, Markus P., Martin Modrák, Matt Moores, Matthew, Matthew Kay, Matthieu LEROY, Maurits van der Meer, Merlin Noel Heidemanns, Michael DeWitt, Michael Dillon, Michael Lerner, Mick Cooney, Márton Vaitkus, N Sanders, Name, Nathaniel Burbank, Nic Fishman, Nicholas Clark, Nicholas Cowie, Nick S, Nicolas Frisby, Octavio Medina, Ole Rogeberg, Oliver Crook, Olivier Ma, Pablo León Villagrá, Patrick Kelley, Patrick Boehnke, Pau Pereira Batlle, Peter Smits, Pieter van den Berg , ptr, Putra Manggala, Ramiro Barrantes Reynolds, Ravin Kumar, Raúl Peralta Lozada, Riccardo Fusaroli, Richard Nerland, Robert Frost, Robert Goldman, Robert kohn, Robert Mitchell V, Robin Taylor, Ross McCullough, Ryan Grossman, Rémi , S Hong, Scott Block, Sean Pinkney, Sean Wilson, Seth Axen, shira, Simon Duane, Simon Lilburn, sssz, Stan_user, Stefan, Stephanie Fitzgerald, Stephen Lienhard, Steve Bertolani, Stew Watts, Stone Chen, Susan Holmes, Svilup, Sören Berg, Tao Ye, Tate Tunstall, Tatsuo Okubo, Teresa Ortiz, Thomas Lees, Thomas Vladeck, Tiago Cabaço, Tim Radtke, Tobychev , Tom McEwen, Tony Wuersch, Utku Turk, Virginia Fisher, Vitaly Druker, Vladimir Markov, Wil Yegelwel, Will Farr, Will Tudor-Evans, woejozney, yolhaj , yureq , Zach A, Zad Rafi, and Zhengchen Cai.
References
License
A repository containing all of the files used to generate this chapter is available on GitHub.
The text and figures in this chapter are copyrighted by Michael Betancourt and licensed under the CC BY-NC 4.0 license:
https://creativecommons.org/licenses/by-nc/4.0/