Regression models are defined by two key assumptions: negligible confounders and covariates influencing only the centrality of the variate behavior. In order to define a particular regression model, however, we need to define the behavior of that covariate-dependent location function. Linear functions of the covariates \[ f(x) = \alpha + \beta \cdot x \] are especially ubiquitous in the statistics zeitgeist, although it's not clear if that ubiquity is due to any universal utility or just mathematical simplicity.

In this case study I consider linear functions of input covariates as local approximations of more general functional behavior within a neighborhood of covariate configurations. The theory of Taylor approximations grounds these models in an explicit context that offers interpretability and guidance for how they, and the heuristics that often accompany them, can be robustly applied in practice. Note that this is not a novel perspective -- see for example [1] -- but here we'll endeavor a comprehensive treatment for how this perspective informs Bayesian modeling.

We will begin with a review of Taylor approximation theory for both one-dimensional and multi-dimensional functions with an emphasis on how that theory informs practical modeling of local functional behavior. Then we will apply that theory to the modeling of location functions in regression models, with a specific emphasis on efficient implementations, prior modeling, and overall model checking.

1 Taylor Approximation Theory

I Almost Do (Taylor's Version)

Often the engineering of a function \(f : X \rightarrow \mathbb{R}\) is considered a global task, requiring the ability to evaluate outputs for every input \(x \in X\). In some cases, however, we don't need to consider functional behavior at every possible input but instead only locally within a small neighborhood of inputs. Modeling local functional behavior is typically a less demanding problem, especially when we equip ourselves with the theory of Taylor expansions to guide the construction of local functional behavior from interpretable polynomial functions that are straightforward to implement.

In this section we will discuss Taylor approximation theory and its applications for modeling local functional behavior first for one-dimensional input spaces and then for multi-dimensional input spaces. We will then consider the practical consequences of locality.

1.1 One-Dimensional Taylor Approximations

The 1

Let's first consider a one-dimensional input space \(X \subset \mathbb{R}\).

The first step in defining local functional behavior is specifying a local neighborhood of input points \(U \subset X\). Here we will define a local neighborhood as a base point \(x_{0} \in X\) and then a set of points around that base point, \(x_{0} \in U \subset X\).

Evaluating a function \(f: X \rightarrow \mathbb{R}\) and its derivatives at the base point provides a wealth of information about the behavior of that function immediately around that base point. In particular if \(f\) is a sufficiently nice function for inputs in \(U\) then we can completely reconstruct its local behavior from this differential information.

Formally we can Taylor expand \(f\) into an infinite polynomial whose coefficients are fixed by increasingly higher-order differential information at \(x_{0}\). For all \(x \in U\) we have \[ \begin{align*} f(x) &= \sum_{i = 0}^{\infty} \frac{1}{i!} \frac{ \mathrm{d}^{i} f }{ \mathrm{d} x^{i} } (x_{0}) \, (x - x_{0})^{i} \\ &= f(x_{0}) + \frac{ \mathrm{d} f }{ \mathrm{d} x } (x_{0}) \, (x - x_{0}) + \frac{1}{2} \frac{ \mathrm{d}^{2} f }{ \mathrm{d} x^{2} } (x_{0}) \, (x - x_{0})^{2} + \ldots. \end{align*} \] Note that the polynomials depend not on the absolute value of \(x\) but rather on only how far a given input \(x\) is from the base point \(x_{0}\). I will refer to these differences \[ \Delta x \equiv x - x_{0} \] as perturbations around \(x_{0}\), or simply perturbations for short.

This process of Taylor expansions can be thought of as a way to extrapolate functional behavior at \(x_{0}\) to nearby inputs. The initial term \(f(x_{0})\) quantifies the exact functional behavior at \(x_{0}\), which we can also consider as a crude approximation to the local functional behavior. Each higher-order term provides a successive refinement of this approximation, incorporating more sophisticated functional behavior that comes closer to the exact function. After incorporating an infinite number of terms we exactly reconstruct \(f\) for any local point \(x \in U\).

Not every function is nice enough to admit a Taylor expansion within a given local neighborhood. For example any ill-behaved derivative evaluations at \(x_{0}\) will obstruct the convergence of the infinite series. That said even if all of the derivative evaluations are well-behaved the higher-order terms might not decay fast enough for the series to converge. Some functions don't admit Taylor expansions around any base point, and some admit Taylor expansions around only specific base points. Even well-behaved functions typically admit Taylor expansions only in sufficiently small neighborhoods around any admissible base points. The more well-behaved a function is the more base points will give well-defined Taylor expansions and the further those Taylor expansions will accurately extrapolate beyond \(x_{0}\).

All of that said, even if a Taylor expansion is well-defined within a given neighborhood \(U\) around our base point \(x_{0}\) we can't actually implement it in practice because we'll never be able to evaluate the infinite number of terms, let alone sum them up to exactly recover the local functional behavior. What we can do is truncate the infinite series to give a polynomial with a finite number of terms.

Incorporating only the first \(I + 1\) terms in a Taylor expansion will not recover \(f\) exactly, but the resulting polynomial may provide a useful approximation to the local functional behavior, \[ \begin{align*} f(x) &= \sum_{i = 0}^{I} \frac{1}{i!} \frac{ \mathrm{d}^{i} f }{ \mathrm{d} x^{i} } (x_{0}) \, (x - x_{0})^{i} + \sum_{i = I + 1}^{\infty} \frac{1}{i!} \frac{ \mathrm{d}^{i} f }{ \mathrm{d} x^{i} } (x_{0}) \, (x - x_{0})^{i} \\ &= f_{I}(x; x_{0}) + R_{I}(x; x_{0}) \\ &\approx f_{I}(x; x_{0}). \end{align*} \] I will refer to the first \(I + 1\) terms of a Taylor expansion, \[ f_{I}(x; x_{0}) = \sum_{i = 0}^{I} \frac{1}{i!} \frac{ \mathrm{d}^{i} f }{ \mathrm{d} x^{i} } (x_{0}) \, (x - x_{0})^{i}, \] as an \(I\)th-order truncated Taylor polynomial or, more appropriate for our uses, an \(I\)th-order Taylor approximation.

The error of such an approximation is quantified by the remaining terms, \[ R_{I}(x; x_{0}) = \sum_{i = I + 1}^{\infty} \frac{1}{i!} \frac{ \mathrm{d}^{i} f }{ \mathrm{d} x^{i} } (x_{0}) \, (x - x_{0})^{i}, \] which I will refer to as the \(I\)th-order Taylor remainder.




If a function admits a full Taylor expansion around \(x_{0}\) then a Taylor approximation of any order will be well-defined in that neighborhood. Even when a full Taylor expansion isn't locally well-defined, however, Taylor approximations might still be. Formally if the first \(I\) derivatives of \(f\) are well-behaved at the base point then we can reconstruct the local functional behavior for \(x \in U\) as \[ f(x) = \sum_{i = 0}^{I} \frac{1}{i!} \frac{ \mathrm{d}^{i} f }{ \mathrm{d} x^{i} } (x_{0}) \, (x - x_{0})^{i} + R_{I}(x; x_{0}) \] for some remainder function \(R_{I}(x; x_{0})\) that vanishes as we get closer to the base point, \[ \lim_{x \rightarrow x_{0}} R_{I}(x; x_{0}) = 0. \] Only when a Taylor expansion is well-behaved can we identify this remainder with the infinite series above.

More practically if the first \(I + 1\) derivatives are finite not just at the base point \(x_{0}\) but also at all points in the local neighborhood \(U\) then the remainder \(R_{I}(x; x_{0})\) will also be finite for all \(x \in U\) [2, 3]. In this case the Taylor remainder provides a finite bound on the error of any Taylor approximation.

A finite error, however, does not mean that a Taylor approximation is a sufficiently good approximation for the local behavior of a function. For \(f_{I}(x; x_{0})\) to be an adequate approximation the error has to be small enough for the Taylor approximation to provide meaningful insights about the exact functional behavior around \(x_{0}\). In practice this means that we have to engineer useful Taylor approximations by modifying the neighborhood and the order.

For example a small neighborhood \(U\) constrains how far any local input \(x \in U\) can be from the given base point \(x_{0}\). This constraint then limits how large the perturbations \(x - x_{0}\) can be, and hence how the large the powers \((x - x_{0})^{i}\) from which Taylor approximations are constructed can be. Generally the smaller the neighborhood the more quickly these powers will decay and the smaller the Taylor remainder will be across \(U\). In other words shrinking the local neighborhood, and extrapolating functional behavior from \(x_{0}\) less aggressively, ensures that an \(I\)th order Taylor approximation \(f_{I}(x; x_{0})\) will better approximate \(f(x)\).




At the same time the faster the higher-order derivatives across \(U\) decay the smaller and the more accurate a given Taylor approximation will be; the flatter \(f\) is across \(U\) the better each \(f_{I}(x; x_{0})\) will approximate \(f(x)\). Equivalently the flatter the exact function is the more aggressively we can extrapolate functional behavior from \(x_{0}\), and the larger of a local neighborhood we can afford without compromising the approximation error.

When the exact function \(f\) and local neighborhood \(U\) are fixed the only way we can control the approximation error is to increase the order \(I\). If the local functional behavior across \(U\) is sufficiently complex then we will need to consider high-order Taylor approximations to achieve a given approximation error.




1.2 Multi-Dimensional Taylor Approximations

Two Is Better Than One

Taylor approximations generalize to multi-dimensional functions quite naturally once we account for all of the partial derivative information that informs how to extrapolate functional behavior from the base point in any direction. The main challenge in implementing multidimensional Taylor approximations is organizing all of these partial derivatives.

1.2.1 Basics

Tell Me Why

Generalizing the discussion in the last section we now let the input space \(X\) be an \(M\)-dimensional product space with \(M\) one-dimensional components. In other words each point \(x \in X\) is now identified with \(M\) component variables, \[ x = (x_{1}, \ldots, x_{m}, \ldots, x_{M}). \] For example a base point is defined by \(M\) variables, one for each of the component directions, \[ x_{0} = (x_{0, 1}, \ldots, x_{0, m}, \ldots, x_{0, M}). \]

The zeroth-order contribution to a Taylor expansion remains the same as in the one-dimensional case: the exact functional behavior at the base point, \[ f_{0}(x; x_{0}) = f(x_{0}) = f(x_{0, 1}, \ldots, x_{0, M}). \]

At first-order, however, we have to reconcile the fact that there is no longer a single, one-dimensional derivative. Instead first-order differential information about \(f\) is encoded in \(M\) partial derivative functions, \[ \frac{\partial f}{\partial x_{m}}(x), \] that quantify how the functional behavior changes along each component direction. These first-order partial derivatives are incorporated into a Taylor approximation by multiplying each by the corresponding component perturbation and then summing the individual contributions together, \[ f_{1}(x; x_{0}) = f(x_{0}) + \sum_{m = 1}^{M} \frac{\partial f}{\partial x_{m}}(x_{0}) \, (x_{m} - x_{0, m}). \]

Every time we go up one order we have to contend with a deluge of new partial derivatives; at \(I\)th order there are \(M^{I}\) partial derivatives, although only \[ { M + I - 1 \choose I } = \frac{ (M + I - 1)! }{ (M - 1)! \, I! } = \frac{ (M + I - 1) \cdot (M + I - 2) \cdots (M + 1) \cdot M } { I \cdot (I - 1) \cdots 2 \cdot 1 } \] are distinct due to symmetry. For example at second-order we have the \(M^{2}\) second-order partial derivative functions, \[ \frac{\partial^{2} f}{\partial x_{m_{1}} \partial x_{m_{2}} }(x), \] that are symmetric in their indices, \[ \frac{\partial^{2} f}{\partial x_{m_{1}} \partial x_{m_{2}} }(x) = \frac{\partial^{2} f}{\partial x_{m_{2}} \partial x_{m_{1}} }(x). \] Because of this symmetry there only \(M \cdot (M + 1) / 2\) distinct partial derivative functions out of the \(M^{2}\) nominal functions. I will refer to the second-order partial derivatives with \(m_{1} = m_{2}\) as on-diagonal and those with \(m_{1} \ne m_{2}\) as off-diagonal.

The second-order partial derivatives are incorporated into the multi-dimensional Taylor approximation in a similar way, except that each partial derivative is matched not with one component perturbation but rather two, \[ \begin{align*} f_{2}(x; x_{0}) &= \quad f(x_{0}) \\ & \quad + \sum_{m = 1}^{M} \frac{\partial f}{\partial x_{m}}(x_{0}) \, (x_{m} - x_{0, m}) \\ & \quad + \frac{1}{2} \sum_{m_{1} = 1}^{M} \sum_{m_{2} = 1}^{M} \frac{\partial^{2} f}{\partial x_{m_{1}} \partial x_{m_{2}} }(x_{0}) \, (x_{m_{1}} - x_{0, m_{1}}) \, (x_{m_{2}} - x_{0, m_{2}}). \end{align*} \]

More generally the \(I\)-th order contribution to a Taylor approximation is given by pairing each index with a corresponding component perturbation, \[ \frac{1}{I!} \sum_{m_{1} = 1}^{M} \cdots \sum_{m_{i} = 1}^{M} \cdot \sum_{m_{I} = 1}^{M} \frac{\partial^{I} f} {\partial x_{m_{1}} \cdots \partial x_{m_{i}} \cdot \partial x_{m_{I}} }(x_{0}) \, \prod_{i = 1}^{I} (x_{m_{i}} - x_{0, m_{i}}). \] The \(I!\) factorial coefficient accounts for the over-counting that arises when we sum over all of the partial derivatives and not just the distinct partial derivatives.

While the multi-dimensional case is conceptually straightforward, organizing the partial derivatives and their symmetries can be quite the burden.

1.2.2 Arrays Verses Tensors

We Are Never Ever Getting Back Together

One way to facilitate the implementation of multi-dimensional Taylor approximations is to organize all of the partial derivatives at a given order into arrays. In particular at the \(I\)th order we can organize the output of the \(M^{I}\) partial derivative functions into a rectangular \(I\)-dimensional array with elements \[ \mathcal{J}_{m_{1} \cdots m_{i} \cdots m_{I}} (x) = \frac{\partial^{I} f} {\partial x_{m_{1}} \cdots \partial x_{m_{i}} \cdot \partial x_{m_{I}} }(x). \] I will refer to these objects as Jacobian arrays.

For example with Jacobian arrays we can write a second-order Taylor approximation in the more compact form \[ \begin{align*} f_{2}(x; x_{0}) &= \quad f(x_{0}) + \sum_{m = 1}^{M} \frac{\partial f}{\partial x_{m}}(x_{0}) \, (x_{m} - x_{0, m}) \\ & \quad + \frac{1}{2} \sum_{m_{1} = 1}^{M} \sum_{m_{2} = 1}^{M} \frac{\partial^{2} f}{\partial x_{m_{1}} \partial x_{m_{2}} }(x_{0}) \, (x_{m_{1}} - x_{0, m_{1}}) \, (x_{m_{2}} - x_{0, m_{2}}). \\ &= \quad f(x_{0}) + \sum_{m = 1}^{M} \mathcal{J}_{m}(x_{0}) \, (x_{m} - x_{0, m}) \\ & \quad + \frac{1}{2} \sum_{m_{1} = 1}^{M} \sum_{m_{2} = 1}^{M} \mathcal{J}_{m_{1} m_{2}}(x_{0}) \, (x_{m_{1}} - x_{0, m_{1}}) \, (x_{m_{2}} - x_{0, m_{2}}) \\ &= \quad f(x_{0}) + \sum_{m = 1}^{M} \mathcal{J}_{m}(x_{0}) \, \Delta x_{m} \\ & \quad + \frac{1}{2} \sum_{m_{1} = 1}^{M} \sum_{m_{2} = 1}^{M} \mathcal{J}_{m_{1} m_{2}}(x_{0}) \, \Delta x_{m_{1}} \, \Delta x_{m_{2}}. \end{align*} \]

In this form the individual terms on the Taylor expansion start to resemble common operations in linear algebra. For example the first-order term looks like the dot product of two vectors and the second-order term looks like a quadratic form of two vectors and a matrix.

To make this similarity more explicit we might associate the one-dimensional arrays with vectors, \[ \Delta \mathbf{x} = \begin{pmatrix} \Delta x_{1} \vphantom{x_{0}} \\ \ldots \\ \Delta x_{m} \vphantom{x_{0}} \\ \ldots \\ \Delta x_{M} \vphantom{x_{0}} \end{pmatrix} = \begin{pmatrix} x_{1} - x_{0, 1} \\ \ldots \\ x_{m} - x_{0, m} \\ \ldots \\ x_{M} - x_{0, M} \end{pmatrix} \] and \[ \mathbf{J}_{1} = \begin{pmatrix} \mathcal{J}_{1}(x_{0}) \vphantom{\frac{\partial f}{\partial x_{1}}} \\ \ldots \\ \mathcal{J}_{m}(x_{0}) \vphantom{\frac{\partial f}{\partial x_{1}}} \\ \ldots \\ \mathcal{J}_{M}(x_{0}) \vphantom{\frac{\partial f}{\partial x_{1}}} \end{pmatrix} = \begin{pmatrix} \frac{\partial f}{\partial x_{1}}(x_{0}) \\ \ldots \\ \frac{\partial f}{\partial x_{m}}(x_{0}) \\ \ldots \\ \frac{\partial f}{\partial x_{M}}(x_{0}) \end{pmatrix}, \] and the two-dimensional array as a matrix, \[ \begin{align*} \mathbf{J}_{2} &= \begin{pmatrix} \mathcal{J}_{11}(x_{0}) & \ldots & \mathcal{J}_{1m'}(x_{0}) & \ldots & \mathcal{J}_{1M}(x_{0}) \\ \ldots & \ldots & \ldots & \ldots & \ldots \\ \mathcal{J}_{m1}(x_{0}) & \ldots & \mathcal{J}_{mm'}(x_{0}) & \ldots & \mathcal{J}_{mM}(x_{0}) \\ \ldots & \ldots & \ldots & \ldots & \ldots \\ \mathcal{J}_{M1}(x_{0}) & \ldots & \mathcal{J}_{Mm'}(x_{0}) & \ldots & \mathcal{J}_{MM}(x_{0}) \\ \end{pmatrix} \\ &= \begin{pmatrix} \frac{\partial^{2} f}{\partial x_{1} \partial x_{1} } (x_{0}) & \ldots & \frac{\partial^{2} f}{\partial x_{1} \partial x_{m'} } (x_{0}) & \ldots & \frac{\partial^{2} f}{\partial x_{1} \partial x_{M} } (x_{0}) \\ \ldots & \ldots & \ldots & \ldots & \ldots \\ \frac{\partial^{2} f}{\partial x_{m} \partial x_{1} } (x_{0}) & \ldots & \frac{\partial^{2} f}{\partial x_{m} \partial x_{m'} } (x_{0}) & \ldots & \frac{\partial^{2} f}{\partial x_{m} \partial x_{M} }(x_{0}) \\ \ldots & \ldots & \ldots & \ldots & \ldots \\ \frac{\partial^{2} f}{\partial x_{M} \partial x_{1} } (x_{0}) & \ldots & \frac{\partial^{2} f}{\partial x_{M} \partial x_{m'} } (x_{0}) & \ldots & \frac{\partial^{2} f}{\partial x_{M} \partial x_{M} } (x_{0}). \end{pmatrix} \end{align*} \]

With this association the second-order Taylor approximation can be written entirely with vector-vector and matrix-vector products, \[ \begin{align*} f_{2}(x; x_{0}) &= f(x_{0}) + \sum_{m = 1}^{M} \mathcal{J}_{m}(x_{0}) \, \Delta x_{m} + \frac{1}{2} \sum_{m_{1} = 1}^{M} \sum_{m_{2} = 1}^{M} \mathcal{J}_{m_{1} m_{2}}(x_{0}) \, \Delta x_{m_{1}} \, \Delta x_{m_{2}} \\ &= f(x_{0}) + \Delta \mathbf{x}^{T} \cdot \mathbf{J}_{1} + \frac{1}{2} \, \Delta \mathbf{x}^{T} \cdot \mathbf{J}_{2} \cdot \Delta \mathbf{x}. \end{align*} \] At higher-orders partial derivative arrays can be similarly be associated with tensors that maps multiple copies of the vector \(\mathbf{x}\) to a single real number.

This linear-algebraic construction can drastically facilitate the implementation of multivariate Taylor approximations. Instead of explicitly summing over each partial derivative evaluation at a given order we can gather them into arrays and then compute the overall contribution with tensor operations, many of which are implemented in accessible software packages.

That said we have to employ these associations with care because the Jacobian arrays are not actually tensors. Importantly they do not transform like tensors when the input space is transformed.

If the Jacobian arrays were actually tensors then each term in a Taylor approximation would be invariant under transformations of the input space. For example if \(\Delta \mathbf{x}\), \(\mathbf{J}_{1}\), and \(\mathbf{J}_{2}\) were transformed into \(\Delta \mathbf{x'}\), \(\mathbf{J'}_{1}\), and \(\mathbf{J'}_{2}\) resepctively then we would always have \[ \mathbf{x}^{T} \cdot \mathbf{J}_{1} = \mathbf{x'}^{T} \cdot \mathbf{J'}_{1} \] and \[ \frac{1}{2} \, \Delta \mathbf{x}^{T} \cdot \mathbf{J}_{2} \cdot \Delta \mathbf{x} = \frac{1}{2} \, \Delta \mathbf{x'}^{T} \cdot \mathbf{J'}_{2} \cdot \Delta \mathbf{x'}. \]

Unfortunately this preservation of the individual terms in a Taylor approximation isn't true. In general transforming the input space mixes the individual contributions in complex ways so that while the entire Taylor approximation is preserved the individual terms are not. In particular we cannot construct \(\mathbf{J'}_{2}\) using \(\mathbf{J}_{2}\) alone, as we would be able to if these objects were proper matrices, and instead have to use both \(\mathbf{J}_{2}\) and \(\mathbf{J}_{1}\).

Formally Jacobian arrays are related to more complicated algebraic objects knowns as jets. For an introduction to jets see for example [4].

Fortunately we don't need to learn about jets to take advantage of the superficial relationship between Jacobian arrays and tensors in practice. We just have to make sure that parameterization of the input space is fixed once we associate Jacobian arrays with tensors. In other words we have to fix the input space first before we fill up the Jacobian arrays with partial derivative evaluations in that parameterization and then apply tensor operations to those arrays. Anytime we want to reparameterize the input space we have to redo the entire construction.

The relationship between Jacobian arrays and tensors is subtle but often confused, but if we don't appreciate it enough then we put ourselves into a precarious position where it's easy to implement multi-dimensional Taylor approximations incorrectly.

1.2.3 Polynomial Geometry

No Body, No Crime

One nice feature of a Taylor approximation to functional behavior is that the order directly informs the possible shapes or geometries within the local neighborhood where the approximation is valid.

Before discussing these geometric possibilities, however, be warned that we will be using the fragile association between Jacobian arrays and tensors, here vectors and matrices, that we introduced in the previous section. Consequently the geometry of the functional behaviors will be best interpreted within a fixed parameterization of the input space.

That said, let's consider a first-order Taylor approximation built up from linear algebra operations, \[ f ( \Delta \mathbf{x} ) = f(x_{0}) + \Delta \mathbf{x}^{T} \cdot \mathbf{J}_{1}. \] Geometrically the functional behaviors encapsulated in this approximation correspond planes. Each plane is fixed to the baseline functional behavior \(f(x_{0})\) with the \(\mathbf{J}_{1}\) serving as slopes that orient the planes around that point.




An immediate consequence of this geometric interpretation is that a first-order Taylor approximation will never exhibit any local minima or maxima within the local neighborhood. Instead the extrema exist at infinity, far away from any local neighborhood.

We can verify this with a little bit of multivariable calculus. If a finite extremum \(\Delta \mathbf{x}^{*}\) exists then the gradient of the functional behavior will vanish there, \[ \begin{align*} \mathbf{0} &= \nabla f (\Delta \mathbf{x}^{*} ) \\ \mathbf{0} &= \nabla \left[ f(x_{0}) + \Delta \mathbf{x}^{T} \cdot \mathbf{J}_{1} \right]_{\Delta \mathbf{x} = \Delta \mathbf{x}^{*} } \\ \mathbf{0} &= \mathbf{J}_{1}. \end{align*} \] This constraint, however, holds only when the first-order slopes vanish, corresponding to a flat plane where all of the points are equally extreme.

On the other hand a second-order Taylor approximation, \[ f ( \Delta \mathbf{x} ) = f(x_{0}) + \Delta \mathbf{x}^{T} \cdot \mathbf{J}_{1} + \Delta \mathbf{x}^{T} \cdot \mathbf{J}_{2} \cdot \Delta \mathbf{x}, \] exhibits a much wider range of functional behaviors. In addition to the planes of the first-order approximation this model also includes paraboloids that curve around \(f(x_{0})\).

Unlike planes, however, paraboloids exhibit finite extrema. To derive where these finite extrema will be we can appeal to the vanishing gradient condition again, \[ \begin{align*} \mathbf{0} &= \nabla f (\Delta \mathbf{x}^{*} ) \\ \mathbf{0} &= \nabla \left[ f(x_{0}) + \Delta \mathbf{x}^{T} \cdot \mathbf{J}_{1} + \Delta \mathbf{x}^{T} \cdot \mathbf{J}_{2} \cdot \Delta \mathbf{x} \right]_{\Delta \mathbf{x} = \Delta \mathbf{x}^{*} } \\ \mathbf{0} &= \mathbf{J}_{1} + 2 \, \mathbf{J}_{2} \cdot \Delta \mathbf{x}^{*} \\ 2 \, \mathbf{J}_{2} \cdot \Delta \mathbf{x}^{*} &= - \mathbf{J}_{1}. \end{align*} \] If the determinant of \(\mathbf{J}_{2}\) doesn't vanish, \[ | \mathbf{J}_{2} | \ne 0, \] then this equation is satisfied by a unique point, \[ \begin{align*} 2 \, \mathbf{J}_{2} \cdot \Delta \mathbf{x}^{*} &= - \mathbf{J}_{1} \\ \Delta \mathbf{x}^{*} &= - \frac{1}{2} ( \mathbf{J}_{2} )^{-1} \cdot \mathbf{J}_{1}. \end{align*} \] Otherwise the system is ill-posed and there will be an infinite number of extrema.

Note that an extreme point is not guaranteed to fall within the local environment where a Taylor approximation is defined. If we extrapolate out to the quadratic extrema then we might encounter higher-order behaviors in the exact functional behavior that render the second-order Taylor approximation, and its implied extrema, inadequate. In other words if an extrema falls outside of the local neighborhood where a Taylor approximation has been validated then we have to treat it with at least some suspicion.

The precise shape of the parabolic behavior around a well-defined extrema will depend on the second-order partial derivatives at that point, which in this case is organized neatly into the second-order Jacobian, \[ \nabla^{2} f (\Delta \mathbf{x}^{*} ) = \mathbf{J}_{2}. \]

If \(\mathbf{J}_{2}\) is positive-definite, and all of the eigenvalues of the matrix are positive, then the functional behavior will curve upwards around the extreme point; in other words \(\Delta \mathbf{x}^{*}\) will be a minimum of the functional behavior. Likewise if \(\mathbf{J}_{2}\) is negative-definite, and all of the eigenvalues are negative, then \(\Delta \mathbf{x}^{*}\) will correspond to a maximum. In both cases the geometry of the functional behavior resembles a bowl centered at the extrema, oriented right-side up if \(\mathbf{J}_{2}\) is positive-definite and upside-down if \(\mathbf{J}_{2}\) is negative-definite.