Formal probability theory is a rich and sophisticated field of mathematics with a reputation for being confusing, if not outright impenetrable. Much of that intimidation, however, is due not to the abstract mathematics but rather how they are haphazardly employed in practice. In particular, many introductions to probability theory sloppily confound the abstract mathematics with their practical implementations, convoluting what we can calculate in the theory with how we perform those calculations. To make matters even worse, probability theory is used to model a variety of subtlety different systems, which then burdens the already confused mathematics with the distinct and often conflicting philosophical connotations of those applications.

In this case study I attempt to untangle this pedagogical knot and illuminate the basic concepts and manipulations of probability theory. Our ultimate goal is to demystify what we can calculate in probability theory and how we can perform those calculations in practice. We begin with an introduction to abstract set theory, continue to probability theory, and then move onto practical implementations without any interpretational distraction. We will spend time more thoroughly reviewing sampling-based calculation methods before finally considering the classic applications of probability theory and the interpretations of the theory that then arise.

In a few places I will dip a bit deeper into the mathematics than is strictly necessary. These section are labeled “Extra Credit” and may be skipped without any consequence, but they do provide further unification and motivation of some of the more subtle aspects of probability theory.

Let me open with a warning that the section on abstract probability theory will be devoid of any concrete examples. This is not because of any conspiracy to confuse the reader, but rather a consequence of the fact that we cannot explicitly construct abstract probability distributions in any meaningful sense. Instead we must utilize problem-specific representations of abstract probability distributions which means that concrete examples will have to wait until we introduce these representations in Section 3.

1 Setting A Foundation

Because formal probability theory works with sets in a given space we first need to review the relevant results from set theory that we will need. For a more complete and rigorous, yet readable, treatment of set theory see, for example, Folland (1999) and the appendix of Lee (2011).

A set is a collection of elements and a space is the collection of all elements under consideration. For example, \(A_{1} = \{1\}\), \(A_{2} = \{1, 5, 10, 12\}\), and \(A_{3} = \{30, 2, 7\}\), are all sets contained in the space of natural numbers, \(\mathbb{N} = \left\{0, 1, \ldots \right\}\).
To denote their containment we write \(A_{1} \subset \mathbb{N}\), \(A_{2} \subset \mathbb{N}\), and \(A_{3} \subset \mathbb{N}\). Because \(A_{1}\) is also contained in \(A_{2}\) we say that \(A_{1}\) is a subset of \(A_{2}\) and write \(A_{1} \subset A_{2}\).

A point set or atomic set is a set containing a single element such as \(\{1\}\) above. The entire space itself is always a valid set, as is the empty set or null set, \(\emptyset\), which contains no elements at all.

Sets are often defined implicitly via an inclusion criterion. These sets are denoted with the set builder notation \[ A = \left\{ x \in X \mid f(x) = 0 \right\}, \] which reads “the set of elements \(x\) in the space \(X\) such that the condition \(f(x) = 0\) holds”. For example, we can define the positive real numbers as \[ \mathbb{R}^{+} = \left\{ x \in \mathbb{R} \mid x > 0 \right\}. \]

Throughout this case study we will use four sets to demonstrate the operations of set theory and probability theory. Our first two sets are finite intervals within the natural numbers, \[ \begin{align*} A_{1} &= \left\{ x \in \mathbb{N} \mid 4 \le x \le 7 \right\} \\ A_{2} &= \left\{ x \in \mathbb{N} \mid 7 \le x \le 9 \right\} \end{align*} \] Because these intervals contain only a finite number of elements we can explicitly represent them with a vector in R,

A1 <- c(4, 5, 6, 7)
A2 <- c(7, 8, 9)

The latter two sets are finite intervals within the real numbers, \[ \begin{align*} B_{1} &= \left\{ x \in \mathbb{R} \mid -2 \le x \le 1 \right\} \\ B_{2} &= \left\{ x \in \mathbb{R} \mid 0 \le x \le 3 \right\} \end{align*} \] These real intervals contain an uncountably infinite number of points so we’ll have to represent them with their boundaries.

B1_min <- -2
B1_max <- 1

B2_min <- 0
B2_max <- 3

1.1 Set Operations

There are three natural operations between sets: set complement, set union, and set intersection.

1.1.1 Set Complement

The set complement is a unary operation that maps one set into another set. Given a set, \(A\), its complement, \(A^{c}\), is defined by all of the elements not in that set, \[ A^{c} = \left\{ x \in X \mid x \notin A \right\}. \]





For example, the complement of \(A_{1}\) contains all of the positive integers below 4 and above 7, \[ A_{1}^{c} = \left\{ x \in \mathbb{N} \mid x < 4 \text{ OR } x > 7 \right\}. \] Similarly, the complement of \(B_{1}\) is given by \[ B_{1}^{c} = \left\{ x \in \mathbb{R} \mid x < -2 \text{ OR } x > 1 \right\}. \]

By definition the complement of the empty set is the entire space, \(\emptyset^{c} = X\), and vice versa, \(X^{c} = \emptyset\).

1.1.2 Set Union

The set union is a binary operator that maps two sets into a third set. We can construct the union of two sets, \(A_{1} \cup A_{2}\), as the set of elements included in either set, \[ A_{1} \cup A_{2} = \left\{ x \in X \mid x \in A_{1} \text{ OR } x \in A_{2} \right\}. \]





For example, the union of \(A_{1}\) and \(A_{2}\) is \[ A_{1} \cup A_{2} = \left\{ x \in X \mid 4 \le x \le 9 \right\}, \]

A_union <- union(A1, A2)
A_union
[1] 4 5 6 7 8 9

while the union of \(B_{1}\) and \(B_{2}\) is given by \[ B_{1} \cup B_{2} = \left\{ x \in X \mid -2 \le x \le 3 \right\}, \]

which we can specify with the boundaries

B_union_min <- -2
B_union_max <- 3

The union of any set and the empty set is just that set, \(A \cup \emptyset = A\).

1.1.3 Set Intersection

The set intersection is another binary operator that maps two sets into a third set. The intersection of two sets, \(A_{1} \cap A_{2}\), is defined as the set of elements included in both sets, \[ A_{1} \cap A_{2} = \left\{ x \in X \mid x \in A_{1} \text{ AND } x \in A_{2} \right\}. \]





For example, the intersection of \(A_{1}\) and \(A_{2}\) is \[ A_{1} \cap A_{2} = \left\{ x \in X \mid 7 \le x \le 7 \right\}, \]

A_inter <- intersect(A1, A2)
A_inter
[1] 7

while the intersection of \(B_{1}\) and \(B_{2}\) is given by \[ B_{1} \cap B_{2} = \left\{ x \in X \mid 0 \le x \le 1 \right\}, \] which we can specify with the boundaries

B_inter_min <- 0
B_inter_max <- 1

The intersection of any set and the empty set is just the empty set, \(A \cap \emptyset = \emptyset\).

1.2 Sigma Algebras

The collection of all sets in a space, \(X\), is called the power set, \(\mathcal{P}(X)\). The power set is massive; it contains the empty set, the entire space, all of the atomic sets, and more.





Unfortunately, even if the space \(X\) is well-behaved, the corresponding power set can also contain some less mathematically savory elements. For example, making a seemingly innocent assumption known as the Axiom of Choice one can show that there exist sets in the real numbers that cannot be explicitly constructed. In particular, one can show that these non-constructible sets feature some very odd properties.

Consequently, when dealing with sets we often want to consider not the entire power set but rather a restriction of the power set that avoids any pathological sets that might be hiding. We have to be careful, however, to apply this restriction in a self-consistent manner. In particular, we want our restriction to be closed under the three natural set operations so that we don’t have to worry about straying outside of our carefully chosen restriction.

For example, if a set \(A\) is in our restriction then we want its complement \(A^{c}\) to also be in our restriction. If \(A_{1}\) and \(A_{2}\) are in our restriction then we also want the restriction to contain the union, \(A_{1} \cup A_{2}\), and the intersection, \(A_{1} \cap A_{2}\). In fact we often will need our restrictions to be closed under a countably infinite, or countable, number of unions or intersections. More precisely, given a countable collection of sets, \(\left\{A_{1}, A_{2}, \ldots\right\}\), we want both the countable union, \[ \cup_{n = 1}^{\infty} A_{n}, \] and the countable intersection, \[ \cap_{n = 1}^{\infty} A_{n}, \] to be in the restriction.

A restricted collection of sets satisfying these closure properties is called a \(\sigma\)-algebra over the space \(X\) (Folland 1999) and a choice of \(\sigma\)-algebra over \(X\) will be denoted with a calligraphic font, \(\mathcal{X}\). A given space will typically admit many \(\sigma\)-algebras, although only a few will be of practical interest. For example, we can always take the trivial \(\sigma\)-algebra \(\mathcal{X} = \{ \emptyset, X \}\).

An important benefit of these closure properties is that they ensure that any non-constructible sets that may be lurking within the power set don’t persist into a \(\sigma\)-algebra. Consequently identifying any \(\sigma\)-algebra removes many of the pathological behaviors that arise in uncountably-large spaces.

Conveniently, many spaces are equipped with structures that define natural \(\sigma\)-algebras. In particular, all of the spaces that we typically deal with in practical problems, such as the integers and the real numbers, have natural \(\sigma\)-algebras. Consequently, in practice we don’t have to spend any time worrying about explicitly constructing a \(\sigma\)-algebra, and the technical need to restrict the power set to a \(\sigma\)-algebra has little impact on the practical use of sets. That said, because avoiding the pathologies of non-constructible sets is critical to a mathematically well-defined probability theory, \(\sigma\)-algebras will be a common occurance.

For more on these natural \(\sigma\)-algebras see the following optional section.

1.3 Extra Credit: Topology

A topology on a space, \(X\), is a collection of sets that contains the empty set and the full space and is closed under countable unions and finite intersections (Lee 2011). The sets in a given topology are called topologically open sets and the complement of any open set is called a topologically closed set. In a rough intuition, topologically closed sets contain their boundaries while topologically open sets do not. Be warned that open and closed in the topological sense are not exact opposites – by definition the empty set and the full space are both topologically open and topologically closed or, shudder, topologically clopen.

The topology of a space provides a formal way to define important properties like continuity. For example, the fundamental differences between discrete and continuous spaces arise from their different topologies. In particular, all of the common spaces, such as the integers and the real numbers, have unique topologies from which many of their more-intuitive properties arise.

Although superficially similar, a topology is distinct from a \(\sigma\)-algebra. For example, the topologically open sets are not closed under the complement operation. If we combine all of the topologically open sets and topologically closed sets defined by a topology, however, then we get a proper \(\sigma\)-algebra. \(\sigma\)-algebras constructed from the topology of a space in this way are denoted Borel \(\sigma\)-algebras (Folland 1999).

Because the typical spaces we’ll consider all have natural topologies they consequently also have natural Borel \(\sigma\)-algebras that exclude pathological sets that can cause problems when trying to define objects like probabilities.

1.4 Functions

A function is a relation that associates elements in one space to elements in another space. If a function, \(f\), maps elements in the space \(X\) to elements in the space \(Y\) then we write \[ f : X \rightarrow Y. \] If we want to be particularly explicit and denote how a particular element \(x \in X\) maps into elements of Y then we can also write \[ \begin{alignat*}{6} f :\; &X& &\rightarrow& \; &Y& \\ &x& &\mapsto& &f(x)&. \end{alignat*} \]





A function also induces a mapping from sets in the input space, \(X\), to sets in the output space, \(Y\), by applying the function to each individual element in the input set, \[ f_{*}(A) = \left\{ y \in Y \mid f^{-1}(y) \in A \right\}. \] This induced map is also known as a pushforward map over sets.





At the same time we can also map sets from the output space, \(Y\), to sets in the input space, \(X\), by considering all of the points that map into a given set, \[ f^{*}(B) = \left\{ x \in X \mid f(x) \in B \right\}. \] This induced map is also known as a pullback map over sets.





These induced maps will, in general, be between the power sets of the two spaces, \[ \begin{align*} f_{*} &: \mathcal{P}(X) \rightarrow \mathcal{P}(Y) \\ f^{*} &: \mathcal{P}(Y) \rightarrow \mathcal{P}(X), \end{align*} \] but they need not respect any restriction of the power sets that we might be considering, in particular a choice \(\sigma\)-algebras on the input and output spaces. When dealing with \(\sigma\)-algebras, then, we often want to limit our consideration to only those special functions that induce maps between the \(\sigma\)-algebras themselves, \[ \begin{align*} f_{*} &: \mathcal{X} \rightarrow \mathcal{Y} \\ f^{*} &: \mathcal{Y} \rightarrow \mathcal{X}. \end{align*} \]

2 Mathematical Logistics

Once the philosophy has been stripped away, probability theory is simply the study of an object, a probability distribution that assigns values to sets, and the transformations of that object. To be fair this isn’t particularly surprising as most of pure mathematics ultimately reduces to studying objects and their transformations! Fortunately, the basic concepts of probability distributions and their manipulations are reasonably intuitive, even from this more formal perspective.

Probability textbooks tend to be too simple, ignoring many important concepts and succumbing to the pedagogical issues I mentioned in the introduction, or too complex, focusing on technical details quickly falling beyond the proficiency of many readers. My favorite treatment of the more formal details of probability theory, and its predecessor measure theory, is Folland (1999), who spends significant time discussing concepts between the technical details.

Working through those details, however, is a serious investment in time. In this section we instead review those concepts relevant to the practical application of probability theory, and some of the details necessary to piece those concepts together and form a self-contained treatment.

2.1 Probability Distributions

From an abstract mathematical perspective, probability is surprisingly boring. Probability is simply a positive, conserved quantity that we want to distribute across a given space – in particular it does not necessarily refer to anything inherently random or uncertain. If we rescale this distribution in terms of the relative proportions of the total amount then regardless of the nature of this conserved quantity we can treat it as unit-less with total amounting to 1.

A probability distribution defines a mathematically self-consistent allocation of this conserved quantity across \(X\). If \(A\) is a subset of \(X\) then we write \(\mathbb{P}_{\pi}[A]\) as the probability assigned to \(A\) by the probability distribution \(\pi\).

This allocation should be exhaustive such that all of it is distributed into \(X\). In other words we require that the probability assigned to the full space is always the total probabilty available, \(\mathbb{P}_{\pi} [X] = 1\).





Moreover, any distribution of probability should globally consistent over any disjoint decomposition of the total space. For example, consider the decomposition of \(X\) into the two disjoint sets \(A_{1}\) and \(A_{2}\).





We can assign an arbitrary amount of probability to \(A_{1}\),





but then we have to assign the remaining probability to \(A_{2}\),





In other words, the total probability assigned to \(A_{1}\) and \(A_{2}\) must sum to \(1\).





Indeed we’ll want to extend this behavior to local consistency of probability distribution over the disjoint decomposition of any set. For example, consider the decomposition of \(A_{1}\) into the two disjoint sets \(A_{3}\) and \(A_{4}\).





The total probability assigned to \(A_{1}\) is given by the probability allocated to \(A_{3}\) plus that allocated to \(A_{4}\).





In particular, we can derive global consistency of probability distribution by recursively applying local consistency, \[ \begin{align*} \mathbb{P}_{\pi}[A_{3}] + \mathbb{P}_{\pi}[A_{4}] + \mathbb{P}_{\pi}[A_{2}] &= \mathbb{P}_{\pi}[A_{1}] + \mathbb{P}_{\pi}[A_{2}] \\ &= \mathbb{P}_{\pi}[X] \\ &= 1. \end{align*} \]





More formally we want the allocation of probability to any collection of disjoint sets, \(A_{n} \cap A_{m} = 0, n \ne m\), to be the same as the allocation to the union of those sets, \[ \sum_{n = 1}^{N} \mathbb{P}_{\pi} [ A_{n} ] = \mathbb{P}_{\pi} [ \cup_{n = 1}^{N} A_{n} ], \] so that no matter how we decompose the space \(X\), or any subsets of \(X\), we don’t lose any probability.

Consistency over any finite collection of sets is known as finite additivity and this would be sufficient if there were only a finite number of subsets in \(X\). Finite additivity, however, is insufficient for distributing probability across spaces which feature a countably infinite number of subsets. In particular consider computing the probability assigned to circle using only rectangles, which is how distributions across the real numbers are more formally derived. No finite number of disjoint rectangles will exactly cover the circle and hence yield the desired probability, but the limit of a countably infinite number of them will.





Consequently in general we need our probability distributions to be countably additive across any disjoint sets, \(A_{n} \cap A_{m} = 0, n \ne m\), \[ \sum_{n = 1}^{\infty} \mathbb{P}_{\pi} [ A_{n} ] = \mathbb{P}_{\pi} [ \cup_{n = 1}^{\infty} A_{n} ]. \]

This all sounds well and good, but it turns out that we cannot construct probability distributions that satisfy any kind of additivity for all subsets of every space. The problem? Those non-constructible sets to which we were introduced in Section 1.2. It turns out that when a space features non-constructible subsets we can combine a finite number of them, \(\{ \mathfrak{A}_{1}, \ldots, \mathfrak{A}_{N} \}\), in certain way to achieve sub-additivity, \[ \sum_{n = 1}^{N} \mathbb{P}_{\pi} [ \mathfrak{A}_{n} ] > \mathbb{P}_{\pi} [ \cup_{n = 1}^{N} \mathfrak{A}_{n} ], \] even when \(\mathfrak{A}_{n} \cap \mathfrak{A}_{m} = 0, n \ne m\). We can also combine a finite number of different non-constructible subsets to achieve super-additivity, \[ \sum_{n = 1}^{N} \mathbb{P}_{\pi} [ \mathfrak{A}_{n} ] < \mathbb{P}_{\pi} [ \cup_{n = 1}^{N} \mathfrak{A}_{n} ]! \]

In other words, there is no way that we can engineer a self-consistent distribution of probability across non-constructible sets. Fortunately this pathological behavior isn’t terminal because we already know how to avoid non-constructible sets in practice! All we have to do is limit our probability distributions to any well-defined \(\sigma\)-algebra, \(\mathcal{X}\).

These minimal properties needed to achieve the probability distribution behavior that we’re after are formalized in Kolmogorov’s axioms:

  • A probability distribution \(\pi\) on a space \(X\) with \(\sigma\)-algebra \(\mathcal{X}\) is a mapping from \(\mathcal{X}\) to the real numbers \([0, 1]\), \[ \begin{alignat*}{6} \mathbb{P}_{\pi} :\; &\mathcal{X}& &\rightarrow& \; &[0, 1]& \\ &A& &\mapsto& &\mathbb{P}_{\pi} [A]&. \end{alignat*} \]
  • This mapping defines a lossless allocation, \(\mathbb{P}_{\pi} [X] = 1\).
  • This mapping defines a consistent allocation for any collection of disjoint sets in \(\mathcal{X}\), \[ \mathbb{P}_{\pi} [ \cup_{n = 1}^{\infty} A_{n} ] = \sum_{n = 1}^{\infty} \mathbb{P}_{\pi} [ A_{n} ], \] \[ A_{n} \cap A_{m} = 0, \, n \ne m. \]

The more familiar rules of probability theory can all be derived from these axioms. For example the consistency condition implies that \[ \mathbb{P}_{\pi}[A] + \mathbb{P}_{\pi}[A^{c}] = \mathbb{P}_{\pi}[X] = 1 \] or \[ \mathbb{P}_{\pi}[A] = 1 - \mathbb{P}_{\pi}[A^{c}]. \]

With a little work one can also derive the probability sum rules, \[ \mathbb{P}_{\pi}[ A_{1} \cup A_{2} ] = \mathbb{P}_{\pi}[ A_{1} ] + \mathbb{P}_{\pi}[ A_{2} ] - \mathbb{P}_{\pi}[ A_{1} \cap A_{2} ], \] and \[ \mathbb{P}_{\pi}[ A_{1} \cap A_{2} ] = \mathbb{P}_{\pi}[ A_{1} ] + \mathbb{P}_{\pi}[ A_{2} ] - \mathbb{P}_{\pi}[ A_{1} \cup A_{2} ]. \]

These rules can be used to construct explicit probability distributions when the \(\sigma\)-algebra is small enough. For example, consider the binary space \(X = \{0, 1\}\). In this case the valid \(\sigma\)-algebra is the entire power set consisting of the empty set, \(A_{1} = \emptyset\), the atomic sets, \(A_{2} = \{ 0 \}\) and \(A_{3} = \{ 1 \}\), and the entire space, \(A_{4} = X\). What probabilities can we assign to these sets? The axioms require that \(\mathbb{P}_{\pi} [ A_{4} ] = 1\), and the complement rule then requires that \(\mathbb{P}_{\pi} [ A_{1} ] = 0\). We are free to assign any probability to one of the atomic sets, so we can take \(\mathbb{P}_{\pi} [ A_{3} ] = p\) in which case the complement rule requires that \(\mathbb{P}_{\pi} [ A_{2} ] = 1 - p\). In other words, any probability distribution on the binary space \(\{0, 1\}\) is completely specified by the parameter \(p\) and the resulting allocation \[ \begin{align*} \mathbb{P}_{\pi} [ \;\, \emptyset \;\, ] &= 0 \\ \mathbb{P}_{\pi} [ \{ 0 \} ] &= 1 - p \\ \mathbb{P}_{\pi} [ \{ 1 \} ] &= p \\ \mathbb{P}_{\pi} [ \;\, X \; ] &= 1. \end{align*} \]

A probability distribution defined by Kolmogorov’s axioms is completely specified by the probability triplet \((X, \mathcal{X}, \pi)\) which is often denoted more compactly as \[ x \sim \pi. \] Here the point \(x \in X\) simply denotes denoting the space \(X\), \(\pi\) denotes the probability distribution, and a valid \(\sigma\)-algebra is just assumed.
Care must be taken to not over interpret this notation – in particular it says nothing about individual points \(x \in X\) but rather denotes a probability distribution over all sets in the assumed \(\sigma\)-algebra.

2.2 Measurable Transformations

Once we have defined a probability distribution on a space, \(X\), and a well-behaved collection of subsets, \(\mathcal{X}\), we can then consider how the probability distribution transforms when \(X\) transforms. In particular, let \(f: X \rightarrow Y\) be a transformation from \(X\) to another space \(Y\). Can this transformation also transform our probability distribution on \(X\) onto a probability distribution on \(Y\), and if so under what conditions?

The answer is straightforward once we have selected a \(\sigma\)-algebra for \(Y\) as well, which we will denote \(\mathcal{Y}\). In order for \(f\) to induce a probability distribution on \(Y\) we need the two \(\sigma\)-algebras to be compatible in some sense. In particular we need every subset \(B \in \mathcal{Y}\) to correspond to a unique subset \(f^{-1}(B) \in \mathcal{X}\). If this holds for all subsets in \(\mathcal{Y}\) then we say that the transformation \(f\) is measurable and we can define a pushforward distribution, \(\pi_{*}\) by \[ \mathbb{P}_{\pi_{*}} [ B ] = \mathbb{P}_{\pi} [ f^{-1} (B) ]. \] In other words, if \(f\) is measurable then a self-consistent allocation of probability over \(X\) induces a self-consistent allocation of probability over \(Y\).





2.2.1 Projections

Measurable transformations can be used to project a probability distribution over a space onto a probability distribution over a lower-dimensional subspace.
Let \(\varpi: X \rightarrow Y\) be a projection operator that maps points in a space \(X\) to points in the subspace \(Y \subset X\). It turns out that in this case a \(\sigma\)-algebra on \(X\) naturally defines a \(\sigma\)-algebra on \(Y\) and the projection operator is measurable with respect to this choice. Consequently any joint probability distribution on \(X\) will transform into a unique marginal probability distribution on \(Y\). More commonly we say that we marginalize out the complementary subspace, \(Y^{C}\).





Marginalization is a bit more straightforward when we are dealing with a product space, \(X \times Y\), which is naturally equipped with the component projection operators \(\varpi_{X} : X \times Y \rightarrow X\) and \(\varpi_{Y}: X \times Y \rightarrow Y\). In this case by pushing a distribution over \((X \times Y, \mathcal{X} \times \mathcal{Y})\) forwards along \(\varpi_{X}\) we marginalize out \(Y\) to give a probability distribution over \((X, \mathcal{X})\). At the same time by pushing that same distribution forwards along \(\varpi_{Y}\) we can marginalize out \(X\) to give a probability distribution over \((Y, \mathcal{Y})\).





Consider, for example, the three-dimensional space, \(\mathbb{R}^{3}\), where the coordinate functions serve as projection operators onto the three axes, \(X\), \(Y\), and \(Z\). Marginalizing out \(X\) transforms a probability distribution over \(X \times Y \times Z\) to give a probability distribution over the two-dimensional space, \(Y \times Z = \mathbb{R}^{2}\). Marginalizing out \(Y\) then gives a probability distribution over the one-dimensional space, \(Z = \mathbb{R}\).

2.2.2 Reparameterizations

Another important class of measurable functions are those one-to-one maps for which \(f(A) \in \mathcal{Y}\) for any \(A \in \mathcal{X}\) in addition to \(f^{-1}(B) \in \mathcal{X}\) for any \(B \in \mathcal{Y}\). In this case \(f\) not only transforms a probability distribution on \(X\) into a probability distribution on \(Y\) but also transforms a probability distribution on \(Y\) into a probability distribution on \(X\). Moreover, transforming a distribution on \(X\) to one on \(Y\) and back perfectly restores the original distribution – unlike projections these reparameterizations preserve all information encoded in a probability distribution.

In this case \((X, \mathcal{X}, \pi)\) and \((Y, \mathcal{Y}, \pi_{*})\) are really just two different manifestations, or parameterizations, of the same abstract probability system. The two parameterizations, for example, might correspond to different choices of coordinate system, different choices of units, or different choices of language dialect capable of the same descriptions. The reparameterizations then translate from one equivalent manifestation to another.

For discrete spaces all reparameterizations can be reduced to permutations that simply rearrange the individual elements of the space. This simple intuition, however, doesn’t carry over to continuous spaces for which reparameterizations can be signficantly more subtle.

3 Great Expectations

Conveniently the distribution of probability across a space also provides a supplementary function: it allows us to summarize how functions of the form \(f: X \rightarrow \mathbb{R}\) behave.

Without probability the only way that we can summarize the behavior of such a function is through topological information such as minima and maxima. The introduction of a probability distribution, however, allows us to average the behavior of a function across it input domain. For example, consider the function below along with two sets, \(A_{1}\) and \(A_{2}\).





If \(\mathbb{P}_{\pi}[ A_{1} ] > \mathbb{P}_{\pi}[ A_{2} ]\) then the behavior of \(f\) within \(A_{1}\) is more relevant to an average.

Expectation values, \(\mathbb{E}_{\pi}[f]\), reduce a function to a single real number by weighting the function output at every point by the probability distributed around that point. More formally, let \(\mathcal{C}\) be the space of all functions from our target space to the real numbers, \(f : X \rightarrow \mathbb{R}\). Expectation is then a map \[ \begin{alignat*}{6} \mathbb{E}_{\pi} :\; &\mathcal{C}& &\rightarrow& \; \mathbb{R}& \\ &f& &\mapsto& &\mathbb{E}_{\pi} [f]&. \end{alignat*} \]

Formalizing this expectation that weights a function with probabilities is subtle, but we can build up some intuition using one class of functions for which the process is straightforward. An indicator function is a function that vanishes outside of a given set, \[ \mathbb{I}_{A} [x] = \left\{ \begin{array}{rr} 1, & x \in A \\ 0, & x \notin A \end{array} \right. . \]





Because the indicator function is constant within its defining set, \(A\), we can simply stipulate its expectation value to be the weight assigned that set, which is just the probability it is allocated, \[ \mathbb{E}_{\pi} [ \mathbb{I}_{A} ] \equiv \mathbb{P}_{\pi} [ A ]. \] We can then build up the expectation value of an arbitrary function with a careful approximation in terms of these indicator functions in a process known as Lebesgue integration. For more detail see the following optional section.

3.1 Extra Credit: Lebesgue Integration

As we saw in Section 3.1 only the indicator functions have immediate expectation values in terms of probabilities. In order to define expectation values of more general functions we have to build increasingly more complex functions out of these elementary ingredients.

The countable sum of indicator functions weighted by real numbers defines a simple function, \[ \phi = \sum_{n} a_{n} \mathbb{I}_{A_{n}}. \] If we require that expectation is linear over this summation then the expectation value of any simple function is given by \[ \begin{align*} \mathbb{E}_{\pi} [ \phi ] &= \mathbb{E}_{\pi} \left[ \sum_{n} a_{n} \mathbb{I}_{A_{n}} \right] \\ &= \sum_{n} a_{n} \mathbb{E}_{\pi} [ \mathbb{I}_{A_{n}} ] \\ &= \sum_{n} a_{n} \mathbb{P}_{\pi} [ A_{n} ]. \end{align*} \] Because of the countable additivity of \(\pi\) and the boundedness of probability, the expectation of a simple function will always be finite provided that each of the coefficients \(a_{n}\) are themselves finite.

We can then use simple functions to approximate an everywhere-positive function, \(g : X \rightarrow \mathbb{R}^{+}\). A simple function with only a few terms defined over only a few sets will yield a poor approximation to \(g\), but as we consider more terms and more sets we can build an increasingly accurate approximation. In particular, because of countable additivity we can always construct a simple function that approximates \(f\) with arbitrary accuracy while remaining bounded above by \(f\).





We can then define the expectation of an everywhere-positive function as the expectation of this approximating simple function. Because we were careful to consider only simple functions bounded by \(f\) we can also define the expectation of \(f\) as the largest expectation of all bounded simple functions, \[ \mathbb{E}_{\pi} [ f ] = \max_{\phi \le f} \mathbb{E}_{\pi} [ \phi ]. \]

For functions that aren’t everywhere-positive we can decompose \(X\) into a collection of neighborhoods where \(f\) is entirely positive, \(A^{+}_{n}\), and entirely negative, \(A^{-}_{m}\). In those neighborhoods where \(f\) is entirely positive we apply the above procedure to define \[ \mathbb{E}_{\pi} [ f \cdot \mathbb{I}_{A^{+}_{n}}], \] while in the neighborhoods where \(f\) is entirely negative we apply the above procedure on the negation of \(f\) to define \[ \mathbb{E}_{\pi} [ -f \cdot \mathbb{I}_{A^{+}_{n}}]. \] Those regions where \(f\) vanishes yield zero expectation values and can be ignored. We then define the expectation value of an arbitrary function \(f\) as the sum of these contributions, \[ \mathbb{E}_{\pi} [ f ] = \sum_{n = 0}^{\infty} \mathbb{E}_{\pi} [ f \cdot \mathbb{I}_{A^{+}_{n}}] - \sum_{m = 0}^{\infty} \mathbb{E}_{\pi} [ -f \cdot \mathbb{I}_{A^{-}_{m}}]. \]

Formally this procedure is known as Lebesgue integration and is a critical tool in the more general measure theory of which probability theory is a special case.

3.2 Useful Expectations

Expectation values serve two powerful purposes. As we saw above they can be used to summarize the behavior of a function \(f : X \rightarrow \mathbb{R}\) in the context of a given probability distribution. We can also invert this perspective, however, and use the expectation value of certain functions to characterize the given probability distribution itself. In this section I’ll introduce certain functions whose expectations are particularly useful in interrogating the properties of probability distributions defined on certain spaces.

First, however, a word of caution. The structure of these special functions, and hence the interpretation of these expectation values, is specific to the context of a given space and will not, in general, push forward along a transformation to another space. In particular, the interpretation is unique to a given parameterization and will change under reparameterizations even though the abstract probabilistic system does not.

If desired we can preserve the value of an expectation value if we first push the special function forward along the transformation and only then perform an expectation. Expectation and transformation do not, in general, commute!

3.2.1 This Distribution and I Are Having a Moment

When our target space is a subset of the real line, \(X \subseteq \mathbb{R}\), there is a natural embedding of \(X\) into \(\mathbb{R}\), \[ \begin{alignat*}{4} \iota :\; &X& &\rightarrow& \mathbb{R} \\ &x& &\mapsto& x. \end{alignat*} \] For example there is a natural embedding that associates the natural numbers, \(\left\{0, 1, 2, \ldots \right\}\), with the corresponding values in the real line.





Similarly there is a natural embedding of the real interval \([0, 1]\) with the corresponding interval in the full real line.





If our target space is itself the real line then the identify function serves as an appropriate embedding.





The expectation value of this embedding function, and simple transformations of the embedding function, are extremely useful in characterizing certain properties of probability distributions. Moments are expectations of powers of the embedding function, \[ \mathbb{m}_{\pi, n} \equiv \mathbb{E}_{\pi} [ \iota^{n} ]. \] Central moments center the embedding function around the first moment, \[ \mathbb{c}_{\pi, n} \equiv \mathbb{E}_{\pi} [ (\iota - \mathbb{m}_{\pi, 1})^{n} ]. \]

The mean of a probability distribution is the corresponding first moment, \[ \mathbb{m}_{\pi} = \mathbb{m}_{\pi, 1} = \mathbb{E}_{\pi} [ \iota ], \] which quantifies a sense of where in the target space the probability distribution is concentration its allocation.

At the same time the variance of a probability distribution is the second central moment, \[ \mathbb{V}_{\pi} = \mathbb{c}_{\pi, 2} = \mathbb{E}_{\pi} [ (\iota - \mathbb{m}_{\pi})^{2}], \] which quantifies a sense of breadth of the probability allocation around the mean. In practice the standard deviation \[ \mathbb{s}_{\pi} = \sqrt{ \mathbb{V}_{\pi} } \] is often more interpretable than the variance.

Higher-order central moments go on to quantify more intricate properties of a given probability distribution. The third central moment, for example, quantifies how asymmetrically probability is distributed around the mean.

While we can always define expectation values of a given function \(f: X \rightarrow \mathbb{R}\), a probability distribution will not always have a well-defined moments. On one hand the corresponding expectation values may be infinite or simply ill-posed. On the other hand the embedding function itself may not be well-defined.

For example, if our space is a subset of the real numbers in more than one dimension, \(X \subseteq \mathbb{R}^{N}, \, N > 1\), then there is no unique embedding function and hence no well-defined moments. We can, however, define component moments as expectations of the coordinate functions, \(\hat{x}_{n}: X \rightarrow \mathbb{R}\), that project a point \(x \in X\) onto each of the component real lines. These component means and variances then provide some quantification of how probability is allocated along each axis.

On the other hand, there is no unique embedding of a sphere into the real line – any embedding turns into an equally good embedding by rotating the sphere before applying the embedding. Consequently there are no unique moments for distributions on spheres and many other non-Euclidean spaces! In practice one adds additional structure to distinguish unique moments, for example with circular moments in the case of spheres.

Unfortunately it is also common to for the mean of a function to refer to its expectation value, \[ \mathbb{m}_{\pi} [f] = \mathbb{E}_{\pi} [ f ]. \] and the variance of a function to refer to the expectation \[ \mathbb{V}_{\pi} [f] = \mathbb{E}_{\pi} [ (f - \mathbb{E}_{\pi}[f])^{2}]. \] In order to avoid confusion with the mean and variance of a probability distribution, I will always specify the argument \([f]\) when these alternative notations are being used.

Finally, keep in mind that moments are defined within the context of a particular parameterization. Reparameterizing to another space introduces a different embedding function which defines different moments that quantify different properties of the abstract probability system.

3.2.2 Cumulative Distribution Functions and Quantiles

The target space is ordered when we can uniquely define an ordering of each point. For example the one-dimensional real line is ordered because each point is placed above or below all others, but the two-dimensional real plane is not ordered because, for example, there is no unique way to define whether the point \((1, 0)\) is greater than or less than the point \((0, 1)\). Similarly, there’s no self-consistent way to order a circle.

When the target space is ordered we can define one-sided interval sets, \[ I(x) = \left\{ x \in X \mid x_{\mathrm{min}} \le x \right\}, \] where \(x_{\mathrm{min}}\) is the unique smallest value in \(X\), as well as two-sided interval sets, \[ I(x_{1}, x_{2}) = \left\{ x \in X \mid x_{1} < x \le x_{2} \right\}, \]

The cumulative distribution function is constructed from the expectation values of all of the indicator functions corresponding to one-sided interval sets, \[ \Pi(x) = \mathbb{E}_{\pi} [ \mathbb{I}_{I(x)} ] = \mathbb{P}_{\pi} [ I(x) ]. \]





Cumulative distribution functions not only give the probability of one-sided intervals, the difference between cumulative distribution functions also gives the probability of two-sided intervals. Because \[ I(x_{2}) = I(x_{1}) \cup I(x_{1}, x_{2}) \] we have \[ \mathbb{P}_{\pi} [ I(x_{2}) ] = \mathbb{P}_{\pi}[ I(x_{1})] + \mathbb{P}_{\pi} [ I(x_{1}, x_{2}) ] \] or \[ \begin{align*} \mathbb{P}_{\pi} [ I(x_{1}, x_{2}) ] &= \mathbb{P}_{\pi} [ I(x_{2}) ] - \mathbb{P}_{\pi} [ I(x_{1}) ] \\ &= \Pi(x_{2}) - \Pi(x_{1}). \end{align*} \]





Quantiles are inverses of the cumulative distribution function for a given target probability, \[ x_{p} = \Pi^{-1} (p). \] More descriptively, a \(p\)-quantile is the point in the target space where we’ve accumulated \(p\) probability. The most common quantile is \(x_{0.5}\), or the median, which quantifies the point in the target space that splits the total probabilty distribution below and above. This quantifies a different sense of where a probability distribution concentrates other than the mean. Tail quantiles, such as \(x_{0.05}\) and \(x_{0.95}\), quantify a different sense of the breadth of a probability distribution than the variance.

Because a reparameterization can arbitrarily reorder a space, a cumulative distribution function is defined only within the context of a given parameterization. As with moments, quantiles in different parameterizations isolate different properties of the abstract probability system.

3.2.3 Histograms

A histogram is the collection of probabilities over a sequence of disjoint intervals, \(\left\{ B_{1}, \ldots, B_{N} \right\}\), \[ \left\{ \mathbb{P}_{\pi} [ B_{1} ], \ldots, \mathbb{P}_{\pi} [ B_{N} ] \right\}, \] or, equivalently, the expectation values of the corresponding indicator functions, \[ \left\{ \mathbb{E}_{\pi} [ \mathbb{I}_{B_{1}} ], \ldots, \mathbb{E}_{\pi} [ \mathbb{I}_{B_{N}} ] \right\}. \]





For one-dimensional, and occasionally two-dimensional target spaces, histograms are useful ways of visualizing the overall behavior of a probability distribution.

4 Representing Probability Distributions with Densities

As we saw in the previous section, formal probability theory is simply the study of probability distributions that distribute a finite, conserved quantity across a space, the expectation values that such a distribution induces, and how the distribution behaves under transformations of the underlying space. While there is myriad complexity in the details of that study, the basics concepts are relatively straightforward.

Still, those concepts have so far been presented in the abstract without any concrete examples to provide context. Why weren’t there any examples to accompany the discussion in the previous section? Because, unfortunately, probability distributions cannot be explicitly defined in most problems!

Recall that a probability distribution is defined as a map from the well-defined subsets in a \(\sigma\)-algebra to real numbers in the interval \([0, 1]\). For a \(\sigma\)-algebra with only a few elements we could completely specify this mapping with a lookup table that associates a probability with each set, but in practice almost all \(\sigma\)-algebras are simply be too large for this to be practical. In order to implement probability distributions in practice we need a way to define this mapping algorithmically so that we can calculate probabilities and expectation values on the fly.

Fortunately most probabilistic systems admit representations that faithfully recover all probabilities and expectation values on demand and hence completely specify a given probability distribution. In particular, density representations exploit the particular structure of \(X\) to fully characterize a probability distribution with special functions.

One of the most common sources of misunderstanding in probability theory is the confusion of an abstract probability distribution with these representations. In an understandable desire to present concrete examples as early as possible, many introductory treatments of probability theory begin immediately with these representations only to end up muddling their relationship with the abstract mathematics that they are characterizing. Now that we have worked through the abstract definitions, however, we can introduce representations in a more principled way.

4.1 Probability Mass Functions

When \(X\) is a discrete space any probability distribution can be represented by a probability mass function that assigns a probability to each atomic element of the space, \(\pi : X \rightarrow [0, 1]\). From these atomic allocations we can reconstruct the probability of any subset \(A \subset X\) as \[ \mathbb{P}_{\pi} [ A ] = \sum_{x \in A} \pi(x), \] and any expectation value as \[ \mathbb{E}_{\pi} [ f ] = \sum_{x \in X} \pi(x) f(x). \]

Probability mass functions also have the elegant property of transforming quite naturally along a measurable transformation \(g : X \rightarrow Y\), \[ \pi_{*} (y) = \sum_{x \in g^{-1}(y)} \pi(x). \] This implies that the probability of a set in \(Y\) is given by \[ \mathbb{P}_{\pi_{*}} [ B ] = \sum_{y \in B } \sum_{x \in g^{-1}(y)} \pi(x), \] while expectation values of functions on \(Y\) are given by \[ \mathbb{E}_{\pi_{*}} [ f ] = \sum_{x \in X } \pi(x) \, f(g(x)). \]

For a reparameterization \(g\) is one-to-one and \(g^{-1}(y)\) identifies a single point. Hence the pushforward probability mass function is just a relabeling of the original probability mass function, as expected.

4.1.1 The Poisson Family of Probability Mass Functions

Consider, for example, the Poisson probability mass functions which allocate probabilities across the natural numbers, \(X = \left\{0, 1, 2, \ldots \right\}\), \[ \pi(x \mid \lambda) = \frac{e^{-\lambda} \lambda^{x}}{x!}. \] depending on the particular value of the intensity parameter \(\lambda\). In particular, there is no singular Poisson probability mass function but rather a family of them, and a corresponding family of probability distributions specified by each member of that family.

In R the Poisson probability mass function is defined as dpois,

c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")

plot_poisson <- function(l) {
  p <- hist(0, breaks=0:21-0.5, plot=FALSE)
  p$counts <- dpois(0:20, l)

  par(mar = c(8, 6, 0, 0.5))
  plot(p, main="", col="white", border=c_dark_highlight,
       xlab="x", xlim=c(-0.5, 20.5), 
       ylab="Probability Mass", ylim=c(0, 0.2), yaxt='n',
       cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
}

l <- 5
plot_poisson(l)

The corresponding cumulative distribution function has an analytic form and in R is given by ppois

B <- 21
xs <- rep(0:B, each=2)
cdfs <- sapply(1:length(xs), function(n) ifelse(n > 1, ppois(xs[n - 1], l), 0))

par(mar = c(8, 6, 0, 0.5))
plot(xs, cdfs, type="l", main="", col=c_dark_highlight, 
     xlab="x", xlim=c(-0.5, 20.5), 
     ylab="Cumulative Probability", ylim=c(0, 1), yaxt='n',
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)

4.1.2 Computing Probabilities

It’s taken quite the buildup, but we are now finally ready compute our first probability. Let’s start with the probability allocated to the set \(A_{1}\) that we defined in Section 1.

plot_poisson_probs <- function(A, l) {
  bin_edges <- c(A, A[length(A)] + 1) - 0.5
  p_sum <- hist(A[1], breaks=bin_edges, plot=FALSE)
  p_sum$counts <- dpois(A, 5)

  plot(p_sum, col=c_dark, border=c_dark_highlight, add=T)
}

plot_poisson(l)
plot_poisson_probs(A1, l)

We can compute the probability by exhaustively summing over the elements in \(A_{1}\),

sum(sapply(A1, function(x) dpois(x, l)))
[1] 0.6016024

or we can use the difference of cumulative distribution functions,

poisson_prob <- function(A, l) {
  return(ppois(A[length(A)], l) - ppois(A[1] - 1, l))
}

poisson_prob(A1, l)
[1] 0.6016024

which delightedly yield the same answer.

Given this rousing success we can then ask what is the probability allocated to the compliment of \(A_{1}\)?

plot_poisson(l)
plot_poisson_probs(0:(A1[1] - 1), l)
plot_poisson_probs((A1[length(A1)] + 1):20, l)