Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

2 Operators and Fixed Points

Authors
Affiliations
New York University
Australian National University

This chapter discusses techniques that underlie the optimization and fixed-point methods used throughout the book. Many of these techniques relate to order. Order-theoretic concepts will prove valuable not only for fixed-point methods but also for understanding the main concepts in dynamic programming. Chapter 8 will show core components of dynamic programming can be expressed in terms of simple order-theoretic constructs.

2.1Stability

In this section, we discuss algorithms for computing fixed points and analyze their convergence.

2.1.1Conjugate Maps

First we treat a technique for simplifying analysis of stability and fixed points that we’ll apply in applications.

To illustrate the idea, suppose that we want to study dynamics induced by a self-map TT on URnU \subset \RR^n. We might want to know if a unique fixed point of TT exists and if iterates of TT converge to a fixed point. One approach is to apply fixed-point theory to TT.

However, sometimes there is an easier approach: Transform TT into a “simpler” map T^\hat T and study its the fixed-point properties. For this to work, we need to be sure that useful properties we discover about T^\hat T will transmit themselves back to properties of TT, the map that actually interests us.

This section explains a notion of conjugacy that formalizes these ideas. The study of conjugate relationships originated in the field of dynamical systems theory. Later we will apply this approach to operators that arise in contexts of dynamic programming and recursive preferences.

2.1.1.1Conjugacy

A dynamical system is a pair (U,T)(U, T), where UU is any set and TT is a self-map on UU. Two dynamical systems (U,T)(U, T) and (U^,T^)(\hat U, \hat T) are said to be conjugate under Φ\Phi if Φ\Phi is a bijection from UU into U^\hat U such that T=Φ1T^ΦT = \Phi^{-1} \circ \hat T \circ \Phi on UU.

Conjugacy of (U,T)(U, T) and (U^,T^)(\hat U, \hat T) under Φ\Phi can be understood as follows: Shifting a point uUu \in U to TuT u via TT is equivalent to moving uu into U^\hat U via u^=Φu\hat u = \Phi u, applying T^\hat T, and then moving the result back using Φ1\Phi^{-1}:

Conjugacy of dynamical systems (U, T) and (\\hat U, \\hat T) under \\Phi.

Figure 2.1:Conjugacy of dynamical systems (U,T)(U, T) and (hatU,hatT)(\\hat U, \\hat T) under Phi\\Phi.

The next two exercises illustrate benefits of establishing a conjugate relationship between two dynamical systems.

The next result summarizes the most important consequences of our findings.

In particular, if TT has a unique fixed point on UU if and only if T^\hat T has a unique fixed point on U^\hat U.

2.1.1.2Topological Conjugacy

Let UU and U^\hat U be two subsets of Rn\RR^n. A function Φ\Phi from UU to U^\hat U is called a homeomorphism if it is continuous, bijective, and its inverse Φ1\Phi^{-1} is also continuous.

Assume again that UU and U^\hat U are subsets of Rn\RR^n. In this setting, we say that dynamical systems (U,T)(U, T) and (U^,T^)(\hat U, \hat T) are topologically conjugate under Φ\Phi if (U,T)(U, T) and (U^,T^)(\hat U, \hat T) are conjugate under Φ\Phi and, in addition, Φ\Phi is a homeomorphism.

The next exercise asks you to show that topologically conjugacy is an equivalence relation, as defined in Section A.1.

From the preceding exercises we can state the following useful result:

2.1.2Local Stability

In Section 1.2.2.2 we investigated global stability. Here we introduce local stability and provide a sufficient condition for situations in which the map is smooth.

Let UU be a subset of Rn\RR^n and let TT be a self-map on UU. A fixed point uu^* of TT in UU is called locally stable for the dynamical system (U,T)(U, T) if there exists an open set OUO \subset U such that uOu^* \in O and TkuuT^k u \to u^* as kk \to \infty for every uOu \in O. In other words, the domain of attraction for uu^* contains an open neighborhood of uu^*.

For an interior fixed point xx^* of a smooth self-map gg on an interval of R\RR, local stability holds whenever g(x)<1|g'(x^*)| < 1. The proof strategy proceeds as follows: When g(x)<1|g'(x^*)| < 1, the first-order linear approximation

g^(x)g(x)+g(x)(xx)=x+g(x)(xx)\hat g(x) \coloneq g(x^*) + g'(x^*)(x - x^*) = x^* + g'(x^*)(x - x^*)

is a contraction of modulus g(x)|g'(x^*)| with unique fixed point xx^*. Hence all trajectories of g^\hat g converge to xx^*. Moreover, since gg and g^\hat g are similar in a neighborhood of xx^*, the same is true for trajectories of gg starting close to xx^*.

The next theorem formalizes this line of argument and extends it to multiple dimensions. In stating the theorem, we take TT to be a self-map on UU with fixed point uu^* in UU and assume that TT is continuously differentiable on UU. Recall that the Jacobian of TT at uUu \in U is

JT(u)(T1u1(u)T1un(u)Tnu1(u)Tnun(u))whereTu=(T1uTnu),J_T(u) \coloneq \begin{pmatrix} \frac{\partial T_1}{\partial u_1}(u) & \cdots & \frac{\partial T_1}{\partial u_n}(u) \\ & \cdots & \\ \frac{\partial T_n}{\partial u_1}(u) & \cdots & \frac{\partial T_n}{\partial u_n}(u) \end{pmatrix} \quad \text{where} \quad Tu = \begin{pmatrix} T_1 u \\ \vdots \\ T_n u \end{pmatrix} ,

and let T^\hat T be the first-order approximation to TT at uu^*:

T^u=u+JT(u)(uu)(uU).\hat Tu = u^* + J_T(u^*) (u - u^*) \qquad (u \in U).

Combining this theorem with the result of Exercise 2.6, we see that, under the conditions of the theorem, uu^* is globally stable for (O,T)(O, T), and hence locally stable for (U,T)(U, T), whenever (O,T^)(O, \hat T) is globally stable. By the Neumann series lemma, the first-order approximation will be globally stable whenever JT(u)J_T(u^*) has spectral radius less than one. Thus, we have

2.1.3Convergence Rates

To discuss relative rates of convergence we fix a norm \| \cdot \| on Rn\RR^n and take a sequence (uk)k0Rn(u_k)_{k \geq 0} \subset \RR^n converging to uRnu^* \in \RR^n. Set ekukue_k \coloneq \| u_k - u^* \| for all kk. We say that (uk)(u_k) converges to uu^* at rate at least qq if q1q \geq 1 and, for some β(0,)\beta \in (0, \infty) and NNN \in \NN, we have

ek+1βekq for all kN.e_{k+1} \leq \beta e_k^q \quad \text{ for all } k \geq N.

We say that convergence occurs at rate qq if, in addition,

lim supkek+1ekq=β.\limsup_{k \to \infty} \frac{e_{k+1}}{e_k^q} = \beta .

In addition,

Orders of convergence are studied in the neighborhood of zero, implying that higher orders are faster. For example, suppose ϵkuku\epsilon_k \coloneq \| u_k - u^* \| is the size of the error and that uku_k converges to uu^* quadratically. If, say, ϵk=105\epsilon_k = 10^{-5}, then ϵk+1β1010\epsilon_{k+1} \approx \beta 10^{-10}. Provided that β\beta is not large, the number of accurate digits roughly doubles at each step.

Successive approximations typically converge at a linear rate. To see this in one dimension, try the following exercise.

(The restriction that 0<Tu<10 < |T' u^*| < 1 in Exercise 2.7 is mild. For example, given convergence of successive approximation to the fixed point, we expect Tu<1|T' u^*| < 1, since this inequality implies that uu^* is locally stable.)

2.1.4Gradient-Based Methods

While successive approximation always converges when global stability holds, faster fixed-point algorithms can often be obtained by leveraging extra information, such as gradients. Newton’s method is an important gradient-based technique. (As we discuss in Section 5.1.4.2, Newton’s method is a key component of algorithms for solving dynamic programs.)

While Newton’s method is often used to solve for roots of a given function, here we use it to find fixed points.

2.1.4.1Newton Fixed-Point Iteration

Suppose first that TT is a differentiable self-map on an open set URnU \subset \RR^n and that we want to find a fixed point of TT. Our plan is to start with a guess u0u_0 of the fixed point and then update it to u1u_1. To do this we use the first-order approximation T^\hat T of TT around u0u_0 and solve for the fixed point of T^\hat T -- which we can do exactly since T^\hat T is linear. We take this new point as u1u_1 and then continue.

If TT is one-dimensional then T^uTu0+Tu0(uu0)\hat T u \coloneq T u_0 + T' u_0 (u - u_0). For n>1n > 1 we replace Tu0T' u_0 with the Jacobian of TT at u0u_0, which we write as JT(u0)J_T(u_0). We then solve T^u1=u1\hat T u_1 = u_1 for u1u_1, which gives

u1=(IJT(u0))1(Tu0JT(u0)u0)(I is the n×n identity).u_1 = (I - J_T(u_0))^{-1} (Tu_0 - J_T(u_0) u_0) \qquad \text{(} I \text{ is the } n \times n \text{ identity)}.

Figure Figure 2.2 shows u0u_0 and u1u_1 when n=1n=1 and Tu=1+u/(u+1)Tu = 1 + u/(u + 1) and u0=0.5u_0 = 0.5. The value u1u_1 is the fixed point of the first-order approximation T^\hat T. It is closer to the fixed point of TT than u0u_0, as desired.

First step of Newton’s method applied to T

Figure 2.2:First step of Newton’s method applied to TT

Newton’s (fixed-point) method continues in the same way, from u1u_1 to u2u_2 and so on, leading to the sequence of points

uk+1=QukwhereQu(IJT(u))1(TuJT(u)u)k=0,1,u_{k+1} = Qu_k \quad \text{where} \quad Qu \coloneq (I - J_T(u))^{-1} (Tu - J_T(u) u) \qquad k = 0, 1, \ldots

We need not write a new solver, since the successive approximation function in Listing 4 can be applied to QQ defined in (2.13).

2.1.4.2Rates of Convergence

Figure Figure 2.3 shows both the Newton approximation sequence and the successive approximation sequence applied to computing the fixed point of the Solow--Swan model from Section 1.2.3.2. We use two different initial conditions (top and bottom subfigures). Both sequences converge, but the Newton sequences converge faster.

Newton’s method applied to the Solow--Swan update rule

Figure 2.3:Newton’s method applied to the Solow--Swan update rule

A fast rate of convergence for Newton scheme can be confirmed theoretically: Under mild conditions, there exists a neighborhood of the fixed point within which the Newton iterates converge quadratically. See, for example, Theorem 5.4.1 of Atkinson & Han (2005). Some dynamic programming algorithms take advantage of this fast rate of convergence (see Section 5.1.4.3).

2.1.4.3Speed versus Robustness

Sometimes we can accelerate computations by exploiting a problem’s special structure (e.g., differentiability, convexity, monotonicity). But we often face a trade-off between speed and robustness to details of problem specification. More robust methods impose less structure.

Relative to other algorithms, successive approximation tends to be robust but slow. We saw one illustration of the relatively slow rate of convergence in Figure Figure 2.3. But we can also see its relatively strong robustness properties via the same example, by inspecting Figure Figure 2.4, which compares the update rule of successive approximation (the function gg) with the update rule for Newton’s method (the function QQ in (2.13)). Also plotted is the dashed 45 degree line.

The parameterization is the same as for the top subfigure in Figure Figure 1.7. As previously discussed, the shape of gg implies global convergence of successive approximation. However, QQ is well-behaved near the fixed point (i.e., very flat and hence strongly contractive) but poorly behaved away from the fixed point. This illustrates that Newton’s method is fast but generally less robust.

Robustness of successive approximation versus Newton’s method

Figure 2.4:Robustness of successive approximation versus Newton’s method

2.1.4.4Parallelization

We have discussed rates of convergence for fixed-point methods. Mathematicians and computer scientists also analyze algorithms via worst-case complexity, which measures the number of fundamental operations (e.g., addition and multiplication of floating point numbers) when an algorithm acts on data that is least favorable for good performance. These measures are attractive because they are independent of the software and hardware platforms on which algorithms are implemented.

Software and hardware matter not just for absolute performance of algorithms but also for relative performance. For example, although a single update step in successive approximation can often be partially parallelized, the algorithm is inherently serial, in the sense that the (k+1)(k+1)-th iterate cannot be computed until iterate kk is available. Moreover, because the rate of convergence is typically slow (i.e., linear), there can be many small serial steps. This limits parallelization.

Newton’s method is also serial to some degree, since we are just iterating with a different map (the operator QQ in (2.13)). However, because it involves inverting matrices of possibly high dimension, each step is computationally intensive. At the same time, since the rate of convergence is faster, we have to take fewer steps. In this sense, the algorithm is less serial -- it involves a smaller number of more expensive steps. Because it is less serial, Newton’s method offers far more potential for parallelization. Thus, the speed gain associated with Newton’s method can become very large when using effective parallelization.

2.2Order

This section reviews key concepts from order theory.

2.2.1Partial Orders

We define partial orders and examine some of their basic properties.

2.2.1.1Partially Ordered Sets

A partial order on a nonempty set PP is a relation \preceq on P×PP \times P that, for any p,q,rp, q, r in PP, satisfies

2

  1. ppp \preceq p

  2. pqp \preceq q and qpq \preceq p implies p=qp = q and

  3. pqp \preceq q and qrq \preceq r implies prp \preceq r

  4. (reflexivity),

  5. (antisymmetry), and

  6. (transitivity).

The pair (P,)(P, \preceq) is called a partially ordered set. For convenience, we sometimes write PP for (P,)(P, \preceq) and qpq \succeq p for pqp \preceq q. The statement pqrp \preceq q \preceq r means pqp \preceq q and qrq \preceq r.

In what follows, for u,vRXu, v \in \RR^\Xsf, we write uvu \ll v if u(x)<v(x)u(x) < v(x) for all xXx \in \Xsf.

The preceding pointwise concepts extend immediately to vectors, since vectors are just real-valued functions under the identification asserted in Lemma 1.2.4. In particular, for vectors u=(u1,,un)u = (u_1, \ldots, u_n) and v=(v1,,vn)v = (v_1, \ldots, v_n) in Rn\RR^n, we write

Statements uvu \geq v and uvu \gg v are defined analogously. Figure Figure 2.5 illustrates. Naturally, \leq is called the pointwise order on Rn\RR^n.

Pointwise we have u \leq v and u \ll v but not w \leq v

Figure 2.5:Pointwise we have uvu \leq v and uvu \ll v but not wvw \leq v

A partial order \preceq on PP is called total if, for all p,qPp, q \in P, either pqp \preceq q or qpq \preceq p.

2.2.1.2Least and Greatest Elements

Given a partially ordered set (P,)(P, \preceq) and APA \subset P, we say that gPg \in P is a greatest element of AA if gAg \in A and, in addition, aA    aga \in A \implies a \preceq g. We call P\ell \in P a least element of AA if A\ell \in A and, in addition, aA    aa \in A \implies \ell \preceq a.

If AA is totally ordered, then a greatest element gg of AA is also called a maximum of AA, whereas a least element \ell of AA is also called a minimum. See Appendix Appendix A for more about maxima and minima.

2.2.1.3Sup and Inf

Concepts of suprema and infima on the real line (Appendix Appendix A) extend naturally to partially ordered sets. Given a partially ordered set (P,)(P, \preceq) and a nonempty subset AA of PP, we call uPu \in P an upper bound of AA if aua \preceq u for all aa in AA. Letting UP(A)U_P(A) be the set of all upper bounds of AA in PP, we call uˉP\bar u \in P a supremum of AA if

uˉUP(A)   and   uˉu   for all   uUP(A).\bar u \in U_P(A) \; \text{ and } \; \bar u \preceq u \; \text{ for all } \; u \in U_P(A).

Thus, uˉ\bar u is the least element (see Section 2.2.1.2) of the set of upper bounds UP(A)U_P(A), whenever it exists.

If PRP \subset \RR and \preceq is \leq, then the notion of supremum on a partially ordered set reduces to the elementary definition of the supremum for subsets of the real line discussed in Appendix Appendix A.

Letting AA be a subset of partially ordered space PP,

Suprema and greatest elements are clearly related. The next exercise clarifies this.

We call P\ell \in P a lower bound of AA if aa \succeq \ell for all aa in AA. An element ˉ\bar \ell of PP is called a infimum of AA if ˉ\bar \ell is a lower bound of AA and ˉ\bar \ell \succeq \ell for every lower bound \ell of AA. We use analogous notation to denote the infimum. For example, if A={a,b}A = \{a, b\}, then A\bigwedge A is also written as aba \wedge b.

2.2.2The Case of Pointwise Order

For us, the pointwise partial order \leq introduced in Example 2.2.2 is especially useful. In this section, we review some properties of this order. Throughout, X\Xsf is an arbitrary finite set.

2.2.2.1Suprema and Infima under a Pointwise Order

Given u,vRXu, v \in \RR^\Xsf, the symbol uvu \wedge v is possibly ambiguous because we used the symbol both for a pointwise minimum in Section 1.2.4.1 and an infimum of {u,v}\{u, v\} in Section 2.2.1.3. Fortunately, for elements of the partially ordered set (RX,)(\RR^\Xsf, \leq), these two definitions coincide. Indeed, if f(x)min{u(x),v(x)}f(x) \coloneq \min\{u(x), v(x)\} for all xXx \in \Xsf, then

  1. ff is a lower bound for {u,v}\{u, v\} in (RX,)(\RR^\Xsf, \leq), and

  2. gug \leq u and gvg \leq v implies gfg \leq f.

Hence ff is the infimum of {u,v}\{u, v\} in (RX,)(\RR^\Xsf, \leq).

A subset VV of RX\RR^\Xsf is called a sublattice of RX\RR^\Xsf if

u,vVu, v \in V implies uvVu \vee v \in V and uvVu \wedge v \in V.

Above we discussed the fact that, for a pair of functions {u,v}\{u, v\}, the supremum in (RX,)(\RR^\Xsf, \leq) is the pointwise maximum, whereas the infimum in (RX,)(\RR^\Xsf, \leq) is the pointwise minimum. The same principle holds for finite collections of functions. Thus, if {vi}{vi}iI\{ v_i \} \coloneq \{ v_i \}_{i \in I} is a finite subset of RX\RR^\Xsf, then, for all xXx \in \Xsf,

(ivi)(x)maxiIvi(x)and(ivi)(x)miniIvi(x).\left( \bigvee_i v_i \right)(x) \coloneq \max_{i \in I} v_i(x) \quad \text{and} \quad \left( \bigwedge_i v_i \right)(x) \coloneq \min_{i \in I} v_i(x).

The next example discusses greatest elements in the setting of pointwise order.

Figure Figure 2.6 helps illustrate Example 2.2.7. In this case, vv^* is not in {vσ}\{v_\sigma\} and {vσ}\{v_\sigma\} has no greatest element (since neither vσvσv_{\sigma'} \leq v_{\sigma''} nor vσvσv_{\sigma''} \leq v_{\sigma'}).

v^* is the upper envelope of \{v_\sigma\}_{\sigma \in \Sigma}

Figure 2.6:vv^* is the upper envelope of {vσ}σΣ\{v_\sigma\}_{\sigma \in \Sigma}

Given a partially ordered set (P,)(P, \preceq) and a,bPa, b \in P, the order interval [a,b][a, b] is defined as all pPp\in P such that apba \preceq p \preceq b. (If aba \preceq b fails, the order interval is empty.)

2.2.2.2Inequalities and Identities

In this section, we note some useful inequalities and identities related to the pointwise partial order on RX\RR^\Xsf. As before, X\Xsf is any finite set.

These results follow immediately from proofs of corresponding claims when f,gf,g, and hh are in R\RR. For example, by the usual triangle inequality for scalars, we have f(x)+g(x)f(x)+g(x)|f(x) + g(x)| \leq |f(x)| + |g(x)| for all xXx \in \Xsf. This is equivalent to the statement f+gf+g|f+g| \leq |f|+|g| in (i). Similarly, inequality (v) follows directly from a corresponding scalar inequality that was already proved in Exercise 1.32,.

A complete proof of lemma Lemma 2.2.1 can be found with Theorem 30.1 of Aliprantis & Burkinshaw (1998).

It is also true that, if f,g,hR+Xf, g, h \in \RR_+^\Xsf, then

(f+g)h(fh)+(gh).(f + g) \wedge h \leq (f \wedge h) + (g \wedge h).

We note the following useful inequality.

The inequality in Lemma 2.2.2 helps with dynamic programming problems that involve maximization. The next exercise concerns minimization.

We end this section with a discussion of upper envelopes. To frame the discussion, we take {Tσ}{Tσ}σΣ\{ T_\sigma \} \coloneq \{ T_\sigma \}_{\sigma \in \Sigma} to be a finite family of self-maps on a sublattice VV of RX\RR^\Xsf. Consider some properties of the operator TT on VV defined by

Tv=σΣTσv(vV).Tv = \bigvee_{\sigma \in \Sigma} \, T_\sigma \, v \qquad (v \in V).

It follows from the sublattice property that TT is a self-map on VV. In some sources, TT is called the upper envelope of the functions {Tσ}\{ T_\sigma \}. The following lemma will be useful for dynamic programming.

2.2.3Order-Preserving Maps

Order-preserving maps appear throughout the theory of dynamic programming. Here we define them and state a condition for contractivity that requires the order preserving property.

2.2.3.1Definition

Given two partially ordered sets (P,)(P, \preceq) and (Q,)(Q, \trianglelefteq), a map TT from PP to QQ is called order-preserving if, given p,pPp, p' \in P, we have

pp    TpTp.p \preceq p' \quad \implies \quad Tp \trianglelefteq Tp'.

TT is called order-reversing if, instead,

pp    TpTp.p \preceq p' \quad \implies \quad Tp' \trianglelefteq Tp.

2.2.3.2Increasing and Decreasing Functions

Regarding the definitions in (2.28) and (2.29), when (Q,)=(R,)(Q, \trianglelefteq) = (\RR, \leq), it is common to say “increasing” instead of order-preserving, and “decreasing” instead of order-reversing. We adopt this terminology. In particular, given partially ordered set (P,)(P, \preceq), we call hRPh \in \RR^P

We use the symbol iRPi\RR^P for the set of increasing functions in RP\RR^P.

The next exercise shows that, in a totally ordered setting, an increasing function can be represented as the sum of increasing binary functions.

As usual, if h ⁣:PQh \colon P \to Q and P,QRP, Q \subset \RR, then we will call hh

2.2.3.3Blackwell’s Condition

Our discussion of Banach’s Theorem in Section 1.2.2.3 showed the usefulness of contractivity. For an order-preserving operator on a subset of RX\RR^\Xsf, the following condition often simplifies establishing this property. In the statement of the lemma, UU is a subset of RX\RR^\Xsf, partially ordered by \leq, and X\Xsf is finite. Also, UU has the property that uUu \in U and cR+c \in \RR_+ implies $u

2.2.4Stochastic Dominance

So far we have discussed partial orders over vectors, functions and sets. It is also useful to have a partial order over distributions that tells us when one distribution is in some sense “larger” than another. In this section, we introduce a partial order over some distributions commonly used in economics and finance.

Let’s start with an example. Recall that a random variable XX is binomial B(n,0.5)B(n,0.5) if it counts the number of heads in nn flips of a fair coin. Figure Figure 2.7 shows two distributions, ϕ=dXB(10,0.5)\phi \eqdist X \sim B(10, 0.5) and ψ=dYB(18,0.5)\psi \eqdist Y \sim B(18, 0.5). Since YY counts over more flips, we expect it to take larger values in some sense, and we also expect its distribution ψ\psi to reflect this. How can we make these thoughts precise?

Two binomial distributions

Figure 2.7:Two binomial distributions

A standard order over distributions that captures this idea is defined as follows: Given finite set X\Xsf partially ordered by \preceq and ϕ,ψD(X)\phi, \psi \in \dD(\Xsf), we say that ψ\psi stochastically dominates ϕ\phi and write ϕFψ\phi \lefsd \psi if

xu(x)ϕ(x)xu(x)ψ(x)   for every u in iRX.\sum_x u(x) \phi(x) \leq \sum_x u(x) \psi(x) \; \text{ for every } u \text{ in } i\RR^\Xsf.

The relation F\lefsd is also called first-order stochastic dominance to differentiate it from other forms of stochastic order.

A good way to interpret first-order stochastic dominance is to suppose that an agent has preferences over outcomes in X\Xsf described by a utility function uRXu \in \RR^\Xsf. Suppose in addition that the agent prefers more to less, in the sense that uiRXu \in i\RR^\Xsf, and that the agent ranks lotteries over X\Xsf according to expected utility, so that the agent evaluates ϕD(X)\phi \in \dD(\Xsf) according to xu(x)ϕ(x)\sum_x u(x) \phi(x). Then the agent (weakly) prefers ψ\psi to ϕ\phi whenever ϕFψ\phi \lefsd \psi.

We can say more. Consider the class A\aA of all agents who (a) have preferences over outcomes in X\Xsf, (b) prefer more to less, and (c) rank lotteries over X\Xsf according to expected utility. Then ϕFψ\phi \lefsd \psi if and only if every agent in A\aA prefers ψ\psi to ϕ\phi.

To state another useful perspective on stochastic dominance, we introduce the notation

Gϕ(y)xyϕ(x)(ϕD(X),  yX).G^\phi (y) \coloneq \sum_{x \succeq y} \phi(x) \qquad (\phi \in \dD(\Xsf), \; y \in \Xsf).

For a given distribution ϕ\phi, the function GϕG^\phi is sometimes called the counter CDF (counter cumulative distribution function) of ϕ\phi.

The proof is given. Figure Figure 2.8 helps to illustrate. Here XR\Xsf \subset \RR and ϕ\phi and ψ\psi are distributions on X\Xsf. We can see that ϕFψ\phi \lefsd \psi because the counter CDFs are ordered in the sense that GϕGψG^\phi \leq G^\psi pointwise on X\Xsf.

Visualization of \phi \lefsd \psi

Figure 2.8:Visualization of ϕFψ\phi \lefsd \psi

2.2.5Parametric Monotonicity

We are often interested in whether a change in a parameter shifts an outcome up or down. For example, a parameter might appear in a central bank decision rule for pegging an interest rate, and we want to know whether increasing that parameter will increase steady state inflation. By providing sufficient conditions for monotone shifts in fixed points, results in this section can help answer such questions.

Let (P,)(P, \preceq) be a partially ordered set. Given two self-maps SS and TT on a set PP, we write STS \preceq T if SuTuS u \preceq T u for every uPu \in P and say that TT dominates SS on PP.

One might assume that, in a setting where TT dominates SS, the fixed points of TT will be larger. This can hold, as in Figure Figure 2.9, but it can also fail, as in Figure Figure 2.10. A difference between these two situations is that in Figure Figure 2.9 the map TT is globally stable. This leads us to our next result.

Ordered fixed points when global stability holds

Figure 2.9:Ordered fixed points when global stability holds

Reverse-ordered fixed points when global stability fails

Figure 2.10:Reverse-ordered fixed points when global stability fails

As an application of Proposition 2.2.7, consider again the Solow--Swan growth model kt+1=g(kt)sf(kt)+(1δ)ktk_{t+1} = g(k_t) \coloneq s f(k_t) + (1 - \delta) k_t. We saw in Section 1.2.3.2 that if f(k)=Akαf(k) = Ak^\alpha where A>0A > 0 and α(0,1)\alpha \in (0, 1), then gg is globally stable on M(0,)M \coloneq (0,\infty). Clearly kg(k)k \mapsto g(k) is order-preserving on MM. If we now increase, say, the savings rate ss, then gg will be shifted up everywhere, implying, via Proposition 2.2.7, that the fixed point also rises. Exercise 2.45 asks you to step through the details.

Figure Figure 2.11 helps illustrate the results of Exercise 2.45. The top left sub-figure shows a baseline parameterization, with A=2.0A=2.0, s=α=0.3s = \alpha = 0.3 and δ=0.4\delta=0.4. The other sub-figures show how the steady state changes as parameters deviate from that baseline.

Parametric monotonicity for the Solow--Swan model

Figure 2.11:Parametric monotonicity for the Solow--Swan model

Figure Figure 2.12 gives an illustration of the result in Exercise 2.46. Here an increase in β\beta leads to a larger continuation value. This seems reasonable, since larger β\beta indicates more concern about outcomes in future periods.

Parametric monotonicity in \beta for the continuation value

Figure 2.12:Parametric monotonicity in β\beta for the continuation value

While the preceding examples of parametric monotonicity are all one-dimensional, we will soon see that Proposition 2.2.7 can also be applied in high-dimensional settings.

2.3Matrices and Operators

Many aspects of dynamic programming are most clearly framed using operator theory. In this section, we discuss linear operators and their connections to matrices. We emphasize nonnegative matrices and so-called positive linear operators that arise naturally in dynamic programming.

2.3.1Nonnegative Matrices

We begin by reviewing basic properties of nonnegative matrices.

2.3.1.1Nonnegative Matrices and Their Powers

We call a matrix AA nonnegative and write A0A \geq 0 if all elements of AA are nonnegative. We call AA everywhere positive and write A0A \gg 0 if all elements of AA are strictly positive. A square matrix AA is called irreducible if A0A \geq 0 and k=1Ak0\sum_{k=1}^\infty A^k \gg 0. An interpretation in terms of connected networks is given in Chapter 1 of Sargent & Stachurski (2023).

Let AA be n×nn \times n. It is not always true that the spectral radius ρ(A)\rho(A) is an eigenvalue of AA.[1] However, when A0A \geq 0, the spectral radius is always an eigenvalue. The following theorem states this result and several extensions.

The convergence in (2.43) provides a sharp characterization of large powers of AA that will prove useful in what follows. The assumption that AA is everywhere positive can be weakened without affecting this convergence. A complete statement and full proof of the Perron--Frobenius theorem can be found in Meyer (2000).

We can use the Perron--Frobenius theorem to provide bounds on the spectral radius of a nonnegative matrix. Fix n×nn \times n matrix A=(aij)A = (a_{ij}) and set

2.3.1.2A Local Spectral Radius Result

Let AA be an n×nn \times n matrix. We know from Gelfand’s formula that if \| \cdot \| is any matrix norm, then Ak1/kρ(A)\|A^k\|^{1/k} \to \rho(A) as kk \to \infty. While useful, this lemma can be difficult to apply because it involves matrix norms. Fortunately, when AA is nonnegative, we have the following variation, which only involves vector norms.

The expression on the left of (2.44) is sometimes called the local spectral radius of AA at hh. Lemma 2.3.3 gives one set of conditions under which a local spectral radius equals the spectral radius. This result will be useful when we examine state-dependent discounting in Chapter 6.

For a proof of Lemma 2.3.3 see Theorem 9.1 of Krasnosel’skii et al. (1972).

2.3.1.3Markov Matrices

An n×nn \times n matrix PP is called a stochastic matrix or Markov matrix if

P0andP1=1P \geq 0 \quad \text{and} \quad P \1 = \1

where 1\1 is a column vector of ones, so that PP is nonnegative and has unit row sums. The Perron--Frobenius theorem will be useful for the following exercise.

The vector ψ\psi in part (iii) of Exercise 2.48 is called a stationary distribution for PP. Such distributions play an important role in the theory of Markov chains. We discuss their interpretation and significance in Section 3.1.2.

2.3.2A Lake Model

We illustrate the power of the Perron--Frobenius theorem by showing how it helps us analyze a model of employment and unemployment flows in a large population.

The model is sometimes called a “lake model” because there are two pools of workers: those who are currently employed and those who are currently unemployed but still seeking work. The flows between states are as follows:

We assume that all parameters lie in (0,1)(0, 1). New workers are initially unemployed.

Transition rates between two pools appear in Figure Figure 2.13. For example, the rate of flow from employment to unemployment is α(1d)\alpha (1-d), which equals the fraction of employed workers who remained in the labor market and separated from their jobs.

Lake model transition dynamics

Figure 2.13:Lake model transition dynamics

Let ete_t and utu_t be the number of employed and unemployed workers at time tt respectively. The total population (of workers) is ntet+utn_t \coloneq e_t + u_t. In view of the rates just stated, the number of unemployed workers evolves according to

ut+1=(1d)αet+(1d)(1λ)ut+bnt.u_{t+1} = (1-d) \alpha e_t + (1-d)(1-\lambda) u_t + b n_t.

The three terms on the right correspond to the newly unemployed (due to separation), the unemployed who failed to find jobs last period, and new entrants into the labor force. The number of employed workers evolves according to

et+1=(1d)(1α)et+(1d)λut.e_{t+1} = (1-d) (1- \alpha) e_t + (1-d)\lambda u_t .

Evolution of the time series for utu_t, ete_t and ntn_t is illustrated in Figure Figure 2.14. We set parameters to α=0.01\alpha = 0.01, λ=0.1\lambda = 0.1, d=0.02d = 0.02, and b=0.025b = 0.025. The initial population of unemployed and employed workers are u0=0.6u_0 = 0.6 and e0=1.2e_0 =1.2, respectively. The series grow over the long run due to net population growth.

Time series for e_t, u_t and n_t, (lake_2.jl)

Figure 2.14:Time series for ete_t, utu_t and ntn_t, (lake_2.jl)

Can we say more about the dynamics of this system? For example, what long-run unemployment rate should we expect? Also, do long-run outcomes depend heavily on the initial conditions u0u_0 and e0e_0? Can we make some general statements that hold regardless of the initial state?

To begin to address these questions, we first organize the linear system for (et)(e_t) and (ut)(u_t) by setting

xt(utet)andA((1d)(1λ)+b(1d)α+b(1d)λ(1d)(1α)).x_t \coloneq \begin{pmatrix} u_t \\ e_t \end{pmatrix} \quad \text{and} \quad A \coloneq \begin{pmatrix} (1-d)(1-\lambda) + b & (1-d) \alpha + b \\ (1-d)\lambda & (1-d) (1- \alpha) \end{pmatrix}.

With these definitions, we can write the dynamics as xt+1=Axtx_{t+1} = A x_t. As a result, xt=Atx0x_t = A^t x_0, where x0=(u0  e0)x_0 = (u_0 \; e_0)^\top.

The overall growth rate of the total labor force is g=bdg = b - d, in the sense that nt+1=(1+g)ntn_{t+1} = (1+g) n_t for all tt.

In the language of Perron--Frobenius theory, the right eigenvector xˉ\bar x is called the dominant eigenvector, since it corresponds to the dominant (i.e., largest) eigenvalue ρ(A)\rho(A). This eigenvector plays an important role in determining long-run outcomes. In the remainder of this section we illustrate this fact.

To begin, recall that αxˉ\alpha \bar x is also a right eigenvector corresponding to the eigenvalue ρ(A)\rho(A) when α>0\alpha > 0. The set D{xR2:x=αxˉ for some α>0}D \coloneq \setntn{x \in \RR^2}{x = \alpha \bar x \text{ for some } \alpha > 0} is shown as a dashed black line in Figure Figure 2.15. The figure also shows two time paths, each of the form (xt)t0=(Atx0)t0(x_t)_{t \geq 0} = (A^t x_0)_{t \geq 0}, generated from two different initial conditions. In both cases, we see that both paths converge to DD over time. The figure suggests that paths share strong similarities in the long run that are determined by the dominant eigenvector xˉ\bar x.

Time paths x_t = A^t x_0 for two choices of x_0 (lake_1.jl)

Figure 2.15:Time paths xt=Atx0x_t = A^t x_0 for two choices of x0x_0 (lake_1.jl)

To see why this is so, we return (2.43) from to the Perron--Frobenius theorem, which tells us that, since A0A \gg 0, we have

Atρ(A)txˉ1=(1+g)t(uˉuˉeˉeˉ)for large t.A^t \approx \rho(A)^t \cdot \bar x \1^\top = (1 + g)^t \begin{pmatrix} \bar u & \bar u \\ \bar e & \bar e \end{pmatrix} \quad \text{for large } t.

As a result, for any initial condition x0=(u0  e0)x_0 = (u_0 \; e_0)^\top, we have

Atx0(1+g)t(uˉuˉeˉeˉ)(u0e0)=(1+g)t(u0+e0)(uˉeˉ)=ntxˉ,A^t x_0 \approx (1 + g)^t \begin{pmatrix} \bar u & \bar u \\ \bar e & \bar e \end{pmatrix} \begin{pmatrix} u_0 \\ e_0 \end{pmatrix} = (1 + g)^t (u_0 + e_0) \begin{pmatrix} \bar u \\ \bar e \end{pmatrix} = n_t \bar x,

where nt=(1+g)tn0n_t =(1 + g)^t n_0 and n0=u0+e0n_0 = u_0 + e_0. This says that, regardless of the initial condition, the state xtx_t scales along xˉ\bar x at the rate of population growth. This is precisely what we saw in Figure Figure 2.15.

We can provide additional interpretations to the components uˉ\bar u and eˉ\bar e of xˉ\bar x. Since ntn_t is the size of the workforce at time tt, the rate of unemployment is ut/ntu_t / n_t. As just shown, for large tt this is close to (ntuˉ)/nt=uˉ(n_t \bar u) / n_t = \bar u. Hence uˉ\bar u is the long-term rate of unemployment along the stable growth path. Similarly, the other component eˉ\bar e of the dominant eigenvector is the long-run employment rate.

In summary, the dominant eigenvector provides with both the long-run rate of unemployment and the stable growth path, to which all trajectories with positive initial conditions converge over time.

2.3.3Linear Operators

There are two ways to think about a matrix. In one definition, an n×kn \times k matrix AA is an n×kn \times k array of (real) numbers. In the second, AA is a linear operator from Rk\RR^k to Rn\RR^n that takes a vector uRku \in \RR^k and sends it to AuAu in Rn\RR^n. Let’s clarify these ideas in a setting where n=kn=k. While the matrix representation is important, the linear operator representation is more fundamental and more general.

2.3.3.1Matrices versus Linear Operators

A linear operator on Rn\RR^n is a map LL from Rn\RR^n to Rn\RR^n such that

L(αu+βv)=αLu+βLvfor all u,vRn and α,βR.L(\alpha u + \beta v) = \alpha Lu + \beta Lv \quad \text{for all } u, v \in \RR^n \text{ and } \alpha, \beta \in \RR.

(We write LuLu instead of L(u)L(u), etc.) For example, if AA is an n×nn \times n matrix, then the map from uu to AuAu defines a linear operator, since the rules of matrix algebra yield A(αu+βv)=αAu+βAvA(\alpha u + \beta v) = \alpha Au + \beta Av.

We just showed that each matrix can be regarded as a linear operator. In fact the converse is also true:

A proof of Theorem 2.3.4 can be found in Kreyszig (1978) and many other sources.

Why introduce linear operators if they are essentially the same as matrices? One reason is that, while a one-to-one correspondence between linear operators and matrices holds in Rn\RR^n, the concept of linear operators is far more general. Linear operators can be defined over many different kinds of sets whose elements have vector-like properties. This is related to the point that we made about function spaces in Remark 1.2.2.

Another reason is computational: The matrix representation of a linear operator can be tedious to construct and difficult to instantiate in memory in large problems. We illustrate this point in Section 2.3.3.3.

2.3.3.2Linear Operators on Function Space

The definition of linear operators on Rn\RR^n extends naturally to linear operators on RX\RR^\Xsf when X={x1,,xn}\Xsf = \{x_1, \ldots, x_n\}: A linear operator on RX\RR^\Xsf is a map LL from RX\RR^\Xsf to itself such that, for all u,vRXu, v \in \RR^\Xsf and α,βR\alpha, \beta \in \RR, we have L(αu+βv)=αLu+βLvL(\alpha u + \beta v) = \alpha Lu + \beta Lv. In what follows,

L(RX) the set of all linear operators on RX.\lopx \coloneq \text{ the set of all linear operators on } \RR^\Xsf.

Let LL be a function from X×X\Xsf \times \Xsf to R\RR. This function induces an operator LL from RX\RR^\Xsf to itself via

(Lu)(x)=xXL(x,x)u(x)(xX,  uRX).(Lu)(x) = \sum_{x' \in \Xsf} L(x,x') u(x') \qquad (x \in \Xsf, \; u \in \RR^\Xsf).

We use the same symbol LL on both sides of the equals sign because both represent essentially the same object (in the sense that a matrix AA can be viewed as a collection of numbers (Aij)(A_{ij}) or as a linear map uAuu \mapsto Au).

The function LL on the right-hand side of (2.56) is sometimes called the “kernel” of the operator LL. However, we will call it a matrix in what follows, since L(x,x)=L(xi,xj)L(x, x') = L(x_i, x_j) is just an n×nn \times n array of real numbers. When more precision is required, we will call it the matrix representation of LL.

In essence, the operation in (2.56) is just matrix multiplication: (Lu)(x)(Lu)(x) is row xx of the matrix product LuL u.

The eigenvalues and eigenvectors of the linear operator LL are defined as the eigenvalues and eigenvectors of its matrix representation. The spectral radius ρ(L)\rho(L) of LL is defined analogously.

We used the same symbol for the operator LL on the left-hand side of (2.56) and its matrix representation on the right because these two objects are in one-to-one correspondence. In particular, every LL(RX)L \in \lopx can be expressed in the form of (2.56) for a suitable choice of matrix (L(x,x))(L(x,x')). Readers who are comfortable with these claims can skip ahead to Section 2.3.3.3. The next lemma provides more details.

Lemma 2.3.5 needs no formal proof. Theorem 2.3.4 already tells us that (a) and (b) are in one-to-one correspondence. Also, (b) and (c) are in one-to-one correspondence because each LL(RX)L \in \lopx can be identified with a linear operator uLuu \mapsto Lu on Rn\RR^n by pairing u,LuRXu, Lu \in \RR^\Xsf with its vector representation in Rn\RR^n (see Section 1.2.4.2). Finally, (d) and (a) are in one-to-one correspondence under the identification L(xi,xj)LijL(x_i, x_j) \leftrightarrow L_{ij}.

2.3.3.3Computational Issues

At the end of Section 2.3.3.1 we claimed that working with linear operators brings some computational advantages vis-à-vis working with matrices. This section fills in some details (Readers who prefer not to think about computational issues at this point can skip ahead to Section 2.3.3.4.)

To illustrate the main idea, consider a setting where the state space X\Xsf takes the form X=Y×Z\Xsf = \Ysf \times \Zsf with Y=j|\Ysf| = j and Z=k|\Zsf| = k. A typical element of X\Xsf is x=(y,z)x = (y, z). As we shall see, this kind of setting arises naturally in dynamic programming.

Let QQ be a map from Z×Z\Zsf \times \Zsf to R\RR (i.e., a k×kk \times k matrix) and consider the operator sending uRXu \in \RR^\Xsf to LuRXLu \in \RR^\Xsf according to the rule

(Lu)(x)=(Lu)(y,z)=zZu(y,z)Q(z,z).(Lu)(x) = (Lu)(y, z) = \sum_{z' \in \Zsf} u(y, z') Q(z, z').

Since LL is a linear operator on RX\RR^\Xsf, Lemma 2.3.5 tells us that LL can be represented as an n×nn \times n matrix (L(xi,xj))=(Lij)(L(x_i,x_j)) = (L_{ij}), where n=X=j×kn = |\Xsf| = j \times k. To construct this matrix, we first need to “flatten” Y×Z\Ysf \times \Zsf into a set X={x1,,xn}\Xsf = \{x_1, \ldots, x_n\} with a single index. There are two natural ways to do this. Considering Y×Z\Ysf \times \Zsf as a two-dimensional array with typical element (yi,zj)(y_i, z_j), we can (a) stack all kk columns vertically into one long column, or (b) concatenate all jj rows into one long row. The first arrangement is called column-major ordering and is the default for languages such as Julia and Fortran. The second is called row-major ordering and is the default for languages such as Python and C. Either way we obtain a set of elements indexed by 1,,n1, \ldots, n.

After adopting one of these conventions, Lemma 2.3.5 assures us we can construct a uniquely defined n×nn \times n matrix that represents LL. Once we decide how to construct this matrix, we can instantiate it in computer memory and compute the operation uLuu \mapsto Lu by matrix multiplication.

There are, however, several disadvantages to implementing LL using this matrix-based approach. One is that constructing the matrix representation is tedious. Another is that confusion can arise when swapping between column- and row-major orderings in order to shift between languages or to communicate with colleagues. A third is that differences are introduced between computer code and the natural representation (2.57), which can be a source of bugs. A fourth issue is that an n×nn \times n matrix has to be instantiated in memory, even though the linear operation in (2.57) is only an inner product in Rk\RR^k. The last issue can be alleviated in most languages by employing sparse matrices, but doing so adds boilerplate and can be a source of inefficiency.

Because of these issues, most modern scientific computing environments support linear operators directly, as well as actions on linear operators such as inverting linear maps. These considerations encourage us to take an operator-based approach.

2.3.3.4Positive Operators and Markov Operators

Having agreed on the benefits of an operator-theoretic exposition, let us now describe some kinds of linear operators. We continue to assume that X\Xsf is a finite set with nn elements.

The set R+X\RR^\Xsf_+ of all uRXu \in \RR^\Xsf with u0u \geq 0 is called the positive cone of RX\RR^\Xsf. An operator LL(RX)L \in \lopx is called positive if LL is invariant on the positive cone; that is, if

u0        Lu0.u \geq 0 \; \implies \; Lu \geq 0.

An operator PL(RX)P \in \lopx is called a Markov operator on RX\RR^\Xsf if PP is positive and P1=1P \1 = \1. We let

M(RX) the set of all Markov operators on RX.\mopx \coloneq \text{ the set of all Markov operators on } \RR^\Xsf.

Viewed as matrices, elements of M(RX)\mopx are nonnegative matrices whose rows sum to one. The next exercise asks you to confirm this.

In the next exercise, you can think of ϕ\phi as a row vector and ϕP\phi P as premultiplying the matrix PP by this row vector. Chapter 3 uses the map ϕϕP\phi \mapsto \phi P to update marginal distributions generated by Markov chains.

Markov operators are important for us because they generate Markov dynamics, a foundation of dynamic programming. Thus, (2.63) is a rule for updating distributions by one period under the Markov dynamics specified by PP. We’ll use it often in the next chapter.

2.4Chapter Notes

Davey & Priestley (2002) provide a good introduction to partial orders and order-theoretic concepts. Our favorite books on fixed points and analysis include Ok (2007), Zhang (2012), Cheney (2013), and Atkinson & Han (2005). Good background material on order-theoretic fixed-point methods can be found in Guo et al. (2004) and Zhang (2012).

Footnotes
  1. For example, eigenvalues of A=diag(1,0)A = \diag(-1, 0) are {1,0}\{-1, 0\}. Hence ρ(A)=1=1\rho(A) = |-1| = 1, which is not an eigenvalue of AA.

References
  1. Atkinson, K., & Han, W. (2005). Theoretical Numerical Analysis (Vol. 39). Springer.
  2. Aliprantis, C. D., & Burkinshaw, O. (1998). Principles of Real Analysis (3rd ed.). Academic Press.
  3. Sargent, T., & Stachurski, J. (2023). Economic Networks: Theory and Computation. Cambridge University Press.
  4. Meyer, C. D. (2000). Matrix Analysis and Applied Linear Algebra (Vol. 71). Siam.
  5. Krasnosel’skii, M. A., Vainikko, G. M., Zabreiko, P. P., Rutitskii, Ya. B., & Stetsenko, V. Ya. (1972). Approximate Solution of Operator Equations. Springer Netherlands.
  6. Kreyszig, E. (1978). Introductory Functional Analysis with Applications (Vol. 1). Wiley New York.
  7. Zaanen, A. C. (2012). Introduction to Operator Theory in Riesz Spaces. Springer.
  8. Davey, B. A., & Priestley, H. A. (2002). Introduction to Lattices and Order. Cambridge University Press.
  9. Ok, E. A. (2007). Real Analysis with Economic Applications (Vol. 10). Princeton University Press.
  10. Zhang, Z. (2012). Variational, Topological, and Partial Order Methods with Their Applications (Vol. 29). Springer.
  11. Cheney, W. (2013). Analysis for Applied Mathematics (Vol. 208). Springer Science & Business Media.
  12. Guo, D., Cho, Y. J., & Zhu, J. (2004). Partial Ordering Methods in Nonlinear Problems. Nova Publishers.