Operators and Fixed Points - Dynamic Programming Volume I: Finite States

This chapter discusses techniques that underlie the optimization and fixed-point methods used throughout the book. Many of these techniques relate to order. Order-theoretic concepts will prove valuable not only for fixed-point methods but also for understanding the main concepts in dynamic programming. Chapter 8 will show core components of dynamic programming can be expressed in terms of simple order-theoretic constructs.

2.1Stability¶

In this section, we discuss algorithms for computing fixed points and analyze their convergence.

2.1.1Conjugate Maps¶

First we treat a technique for simplifying analysis of stability and fixed points that we’ll apply in applications.

To illustrate the idea, suppose that we want to study dynamics induced by a self-map $T$ on $U \subset \RR^n$ . We might want to know if a unique fixed point of $T$ exists and if iterates of $T$ converge to a fixed point. One approach is to apply fixed-point theory to $T$ .

However, sometimes there is an easier approach: Transform $T$ into a “simpler” map $\hat T$ and study its fixed-point properties. For this to work, we need to be sure that useful properties we discover about $\hat T$ will transmit themselves back to properties of $T$ , the map that actually interests us.

This section explains a notion of conjugacy that formalizes these ideas. The study of conjugate relationships originated in the field of dynamical systems theory. Later we will apply this approach to operators that arise in contexts of dynamic programming and recursive preferences.

2.1.1.1Conjugacy¶

A dynamical system is a pair $(U, T)$ , where $U$ is any set and $T$ is a self-map on $U$ . Two dynamical systems $(U, T)$ and $(\hat U, \hat T)$ are said to be conjugate under $\Phi$ if $\Phi$ is a bijection from $U$ into $\hat U$ such that $T = \Phi^{-1} \circ \hat T \circ \Phi$ on $U$ .

Conjugacy of $(U, T)$ and $(\hat U, \hat T)$ under $\Phi$ can be understood as follows: Shifting a point $u \in U$ to $T u$ via $T$ is equivalent to moving $u$ into $\hat U$ via $\hat u = \Phi u$ , applying $\hat T$ , and then moving the result back using $\Phi^{-1}$ :

The next two exercises illustrate benefits of establishing a conjugate relationship between two dynamical systems.

The next result summarizes the most important consequences of our findings.

In particular, if $T$ has a unique fixed point on $U$ if and only if $\hat T$ has a unique fixed point on $\hat U$ .

2.1.1.2Topological Conjugacy¶

Let $U$ and $\hat U$ be two subsets of $\RR^n$ . A function $\Phi$ from $U$ to $\hat U$ is called a homeomorphism if it is continuous, bijective, and its inverse $\Phi^{-1}$ is also continuous.

Assume again that $U$ and $\hat U$ are subsets of $\RR^n$ . In this setting, we say that dynamical systems $(U, T)$ and $(\hat U, \hat T)$ are topologically conjugate under $\Phi$ if $(U, T)$ and $(\hat U, \hat T)$ are conjugate under $\Phi$ and, in addition, $\Phi$ is a homeomorphism.

Solution to Exercise 2.1.4

From $\hat T = \Phi \circ T \circ \Phi^{-1}$ we have $\hat T^2 = \Phi \circ T \circ \Phi^{-1} \circ \Phi \circ T \circ \Phi^{-1} = \Phi \circ T^2 \circ \Phi^{-1}$ and, continuing in the same way (or using induction), $\hat T^k = \Phi \circ T^k \circ \Phi^{-1}$ for all $k \in \NN$ . Equivalently, $\hat T^k \circ \Phi = \Phi \circ T^k$ for all $k \in \NN$ . Hence, using continuity of $\Phi$ and $\Phi^{-1}$ ,

T^k u \to u^* \; \iff \; \Phi T^ku \to \Phi u^* \; \iff \; \hat T^k\Phi u \to \Phi u^*.

The next exercise asks you to show that topologically conjugacy is an equivalence relation, as defined in Section A.1.

Solution to Exercise 2.1.5

Let $\mathbf U$ be the set of all dynamical systems $(U, T)$ with $U \subset \RR^n$ and write $(U, T) \sim (\hat U, \hat T)$ if these systems are topologically conjugate. It is easy to see that $\sim$ is reflexive and symmetric. Regarding transitivity, suppose that $(U, T) \sim (U', T')$ and $(U', T') \sim (U'', T'')$ . Let $F$ be the homeomorphism from $U$ to $U'$ and $G$ be the homeomorphism from $U'$ to $U''$ . Then $H \coloneq G \circ F$ is a homeomorphism from $U$ to $U''$ with inverse $(F \circ G)^{-1}$ . Moreover, on $U$ , we have

T = F^{-1} \circ T' \circ F = F^{-1} \circ G^{-1} \circ T'' \circ G \circ F = (G F)^{-1} \circ T'' \circ G \circ F.

Hence $(U, T) \sim (U'', T'')$ and $\sim$ is transitive, as required.

From the preceding exercises we can state the following useful result:

2.1.2Local Stability¶

In Section 1.2.2.2 we investigated global stability. Here we introduce local stability and provide a sufficient condition for situations in which the map is smooth.

Let $U$ be a subset of $\RR^n$ and let $T$ be a self-map on $U$ . A fixed point $u^*$ of $T$ in $U$ is called locally stable for the dynamical system $(U, T)$ if there exists an open set $O \subset U$ such that $u^* \in O$ and $T^k u \to u^*$ as $k \to \infty$ for every $u \in O$ . In other words, the domain of attraction for $u^*$ contains an open neighborhood of $u^*$ .

For an interior fixed point $x^*$ of a smooth self-map $g$ on an interval of $\RR$ , local stability holds whenever $|g'(x^*)| < 1$ . The proof strategy proceeds as follows: When $|g'(x^*)| < 1$ , the first-order linear approximation

\hat g(x) \coloneq g(x^*) + g'(x^*)(x - x^*) = x^* + g'(x^*)(x - x^*)

is a contraction of modulus $|g'(x^*)|$ with unique fixed point $x^*$ . Hence all trajectories of $\hat g$ converge to $x^*$ . Moreover, since $g$ and $\hat g$ are similar in a neighborhood of $x^*$ , the same is true for trajectories of $g$ starting close to $x^*$ .

The next theorem formalizes this line of argument and extends it to multiple dimensions. In stating the theorem, we take $T$ to be a self-map on $U$ with fixed point $u^*$ in $U$ and assume that $T$ is continuously differentiable on $U$ . Recall that the Jacobian of $T$ at $u \in U$ is

J_T(u) \coloneq \begin{pmatrix} \frac{\partial T_1}{\partial u_1}(u) & \cdots & \frac{\partial T_1}{\partial u_n}(u) \\ & \cdots & \\ \frac{\partial T_n}{\partial u_1}(u) & \cdots & \frac{\partial T_n}{\partial u_n}(u) \end{pmatrix} \quad \text{where} \quad Tu = \begin{pmatrix} T_1 u \\ \vdots \\ T_n u \end{pmatrix} ,

and let $\hat T$ be the first-order approximation to $T$ at $u^*$ :

\hat Tu = u^* + J_T(u^*) (u - u^*) \qquad (u \in U).

Combining this theorem with the result of Exercise 2.1.6, we see that, under the conditions of the theorem, $u^*$ is globally stable for $(O, T)$ , and hence locally stable for $(U, T)$ , whenever $(O, \hat T)$ is globally stable. By the Neumann series lemma, the first-order approximation will be globally stable whenever $J_T(u^*)$ has spectral radius less than one. Thus, we have

2.1.3Convergence Rates¶

To discuss relative rates of convergence we fix a norm $\| \cdot \|$ on $\RR^n$ and take a sequence $(u_k)_{k \geq 0} \subset \RR^n$ converging to $u^* \in \RR^n$ . Set $e_k \coloneq \| u_k - u^* \|$ for all $k$ . We say that $(u_k)$ converges to $u^*$ at rate at least $q$ if $q \geq 1$ and, for some $\beta \in (0, \infty)$ and $N \in \NN$ , we have

e_{k+1} \leq \beta e_k^q \quad \text{ for all } k \geq N.

We say that convergence occurs at rate $q$ if, in addition,

\limsup_{k \to \infty} \frac{e_{k+1}}{e_k^q} = \beta .

In addition,

If $q=2$ , then we say that convergence is (at least) quadratic.
If $q=1$ and $\beta < 1$ , then we say that convergence is (at least) linear.

Orders of convergence are studied in the neighborhood of zero, implying that higher orders are faster. For example, suppose $\epsilon_k \coloneq \| u_k - u^* \|$ is the size of the error and that $u_k$ converges to $u^*$ quadratically. If, say, $\epsilon_k = 10^{-5}$ , then $\epsilon_{k+1} \approx \beta 10^{-10}$ . Provided that $\beta$ is not large, the number of accurate digits roughly doubles at each step.

Successive approximations typically converge at a linear rate. To see this in one dimension, try the following exercise.

(The restriction that $0 < |T' u^*| < 1$ in Exercise 2.1.7 is mild. For example, given convergence of successive approximation to the fixed point, we expect $|T' u^*| < 1$ , since this inequality implies that $u^*$ is locally stable.)

2.1.4Gradient-Based Methods¶

While successive approximation always converges when global stability holds, faster fixed-point algorithms can often be obtained by leveraging extra information, such as gradients. Newton’s method is an important gradient-based technique. (As we discuss in Section 5.1.4.2, Newton’s method is a key component of algorithms for solving dynamic programs.)

While Newton’s method is often used to solve for roots of a given function, here we use it to find fixed points.

2.1.4.1Newton Fixed-Point Iteration¶

Suppose first that $T$ is a differentiable self-map on an open set $U \subset \RR^n$ and that we want to find a fixed point of $T$ . Our plan is to start with a guess $u_0$ of the fixed point and then update it to $u_1$ . To do this we use the first-order approximation $\hat T$ of $T$ around $u_0$ and solve for the fixed point of $\hat T$ – which we can do exactly since $\hat T$ is linear. We take this new point as $u_1$ and then continue.

If $T$ is one-dimensional then $\hat T u \coloneq T u_0 + T' u_0 (u - u_0)$ . For $n > 1$ we replace $T' u_0$ with the Jacobian of $T$ at $u_0$ , which we write as $J_T(u_0)$ . We then solve $\hat T u_1 = u_1$ for $u_1$ , which gives

u_1 = (I - J_T(u_0))^{-1} (Tu_0 - J_T(u_0) u_0) \qquad \text{(} I \text{ is the } n \times n \text{ identity)}.

Figure 2.1 shows $u_0$ and $u_1$ when $n=1$ and $Tu = 1 + u/(u + 1)$ and $u_0 = 0.5$ . The value $u_1$ is the fixed point of the first-order approximation $\hat T$ . It is closer to the fixed point of $T$ than $u_0$ , as desired.

First step of Newton’s method applied to T — Figure 2.1:First step of Newton’s method applied to $T$

Newton’s (fixed-point) method continues in the same way, from $u_1$ to $u_2$ and so on, leading to the sequence of points

u_{k+1} = Qu_k \quad \text{where} \quad Qu \coloneq (I - J_T(u))^{-1} (Tu - J_T(u) u) \qquad k = 0, 1, \ldots

(2.2)

We need not write a new solver, since the successive approximation function in Listing 4 can be applied to $Q$ defined in (2.2).

2.1.4.2Rates of Convergence¶

Figure 2.2 shows both the Newton approximation sequence and the successive approximation sequence applied to computing the fixed point of the Solow–Swan model from Section 1.2.3.2. We use two different initial conditions (top and bottom subfigures). Both sequences converge, but the Newton sequences converge faster.

Figure 2.2:Newton’s method applied to the Solow–Swan update rule

A fast rate of convergence for Newton scheme can be confirmed theoretically: Under mild conditions, there exists a neighborhood of the fixed point within which the Newton iterates converge quadratically. See, for example, Theorem 5.4.1 of Atkinson & Han (2005). Some dynamic programming algorithms take advantage of this fast rate of convergence (see Section 5.1.4.3).

2.1.4.3Speed versus Robustness¶

Sometimes we can accelerate computations by exploiting a problem’s special structure (e.g., differentiability, convexity, monotonicity). But we often face a trade-off between speed and robustness to details of problem specification. More robust methods impose less structure.

Relative to other algorithms, successive approximation tends to be robust but slow. We saw one illustration of the relatively slow rate of convergence in Figure 2.2. But we can also see its relatively strong robustness properties via the same example, by inspecting Figure 2.3, which compares the update rule of successive approximation (the function $g$ ) with the update rule for Newton’s method (the function $Q$ in (2.2)). Also plotted is the dashed 45 degree line.

The parameterization is the same as for the top subfigure in Figure 1.7. As previously discussed, the shape of $g$ implies global convergence of successive approximation. However, $Q$ is well-behaved near the fixed point (i.e., very flat and hence strongly contractive) but poorly behaved away from the fixed point. This illustrates that Newton’s method is fast but generally less robust.

Figure 2.3:Robustness of successive approximation versus Newton’s method

2.1.4.4Parallelization¶

We have discussed rates of convergence for fixed-point methods. Mathematicians and computer scientists also analyze algorithms via worst-case complexity, which measures the number of fundamental operations (e.g., addition and multiplication of floating point numbers) when an algorithm acts on data that is least favorable for good performance. These measures are attractive because they are independent of the software and hardware platforms on which algorithms are implemented.

Software and hardware matter not just for absolute performance of algorithms but also for relative performance. For example, although a single update step in successive approximation can often be partially parallelized, the algorithm is inherently serial, in the sense that the $(k+1)$ -th iterate cannot be computed until iterate $k$ is available. Moreover, because the rate of convergence is typically slow (i.e., linear), there can be many small serial steps. This limits parallelization.

Newton’s method is also serial to some degree, since we are just iterating with a different map (the operator $Q$ in (2.2)). However, because it involves inverting matrices of possibly high dimension, each step is computationally intensive. At the same time, since the rate of convergence is faster, we have to take fewer steps. In this sense, the algorithm is less serial – it involves a smaller number of more expensive steps. Because it is less serial, Newton’s method offers far more potential for parallelization. Thus, the speed gain associated with Newton’s method can become very large when using effective parallelization.

2.2Order¶

This section reviews key concepts from order theory.

2.2.1Partial Orders¶

We define partial orders and examine some of their basic properties.

2.2.1.1Partially Ordered Sets¶

A partial order on a nonempty set $P$ is a relation $\preceq$ on $P \times P$ that, for any $p, q, r$ in $P$ , satisfies

$p \preceq p$

$p \preceq q$ and $q \preceq p$ implies $p = q$ and

$p \preceq q$ and $q \preceq r$ implies $p \preceq r$

(reflexivity),

(antisymmetry), and

(transitivity).

The pair $(P, \preceq)$ is called a partially ordered set. For convenience, we sometimes write $P$ for $(P, \preceq)$ and $q \succeq p$ for $p \preceq q$ . The statement $p \preceq q \preceq r$ means $p \preceq q$ and $q \preceq r$ .

In what follows, for $u, v \in \RR^\Xsf$ , we write $u \ll v$ if $u(x) < v(x)$ for all $x \in \Xsf$ .

The preceding pointwise concepts extend immediately to vectors, since vectors are just real-valued functions under the identification asserted in Lemma 1.2.4. In particular, for vectors $u = (u_1, \ldots, u_n)$ and $v = (v_1, \ldots, v_n)$ in $\RR^n$ , we write

$u \leq v$ if $u_i \leq v_i$ for all $i \in \natset{n}$ and
$u \ll v$ if $u_i < v_i$ for all $i \in \natset{n}$ .

Statements $u \geq v$ and $u \gg v$ are defined analogously. Figure 2.4 illustrates. Naturally, $\leq$ is called the pointwise order on $\RR^n$ .

Pointwise we have u \leq v and u \ll v but not w \leq v — Figure 2.4:Pointwise we have $u \leq v$ and $u \ll v$ but not $w \leq v$

Solution to Exercise 2.2.7

Regarding the first claim, fix $B \in \matset{m}{k}$ with $b_{ij} \geq 0$ for all $i, j$ . Pick any $i \in \natset{m}$ and $u \in \RR^k$ . By the triangle inequality, we have $|\sum_j b_{ij} u_j| \leq \sum_j b_{ij} |u_j|$ . Stacking these inequalities yields $|B u| \leq B |u|$ , as was to be shown.

Regarding the second, let $A$ and $(u_k)$ be as stated, with $u_{k+1} \leq A u_k$ for all $k$ . We aim to prove $u_k \leq A^k u_0$ for all $k$ using induction. In doing so, we observe that $u_1 \leq A u_0$ , so the claim is true at $k=1$ . Suppose now that it holds at $k-1$ . Then $u_k \leq A u_{k-1} \leq A A^{k-1} u_0 = A^k u_0$ , where the last step used nonnegativity of $A$ and the induction hypothesis. The claim is now proved.

A partial order $\preceq$ on $P$ is called total if, for all $p, q \in P$ , either $p \preceq q$ or $q \preceq p$ .

2.2.1.2Least and Greatest Elements¶

Given a partially ordered set $(P, \preceq)$ and $A \subset P$ , we say that $g \in P$ is a greatest element of $A$ if $g \in A$ and, in addition, $a \in A \implies a \preceq g$ . We call $\ell \in P$ a least element of $A$ if $\ell \in A$ and, in addition, $a \in A \implies \ell \preceq a$ .

If $A$ is totally ordered, then a greatest element $g$ of $A$ is also called a maximum of $A$ , whereas a least element $\ell$ of $A$ is also called a minimum. See Appendix A for more about maxima and minima.

2.2.1.3Sup and Inf¶

Concepts of suprema and infima on the real line (Appendix A) extend naturally to partially ordered sets. Given a partially ordered set $(P, \preceq)$ and a nonempty subset $A$ of $P$ , we call $u \in P$ an upper bound of $A$ if $a \preceq u$ for all $a$ in $A$ . Letting $U_P(A)$ be the set of all upper bounds of $A$ in $P$ , we call $\bar u \in P$ a supremum of $A$ if

\bar u \in U_P(A) \; \text{ and } \; \bar u \preceq u \; \text{ for all } \; u \in U_P(A).

Thus, $\bar u$ is the least element (see Section 2.2.1.2) of the set of upper bounds $U_P(A)$ , whenever it exists.

If $P \subset \RR$ and $\preceq$ is $\leq$ , then the notion of supremum on a partially ordered set reduces to the elementary definition of the supremum for subsets of the real line discussed in Appendix A.

Letting $A$ be a subset of partially ordered space $P$ ,

the supremum of $A$ is typically denoted $\bigvee A$ .
If $A = \{a_i\}_{i \in I}$ for some index set $I$ , we also write $\bigvee A$ as $\bigvee_i \, a_i$ .
If $A = \{a, b\}$ , then $\bigvee A$ is also written as $a \vee b$ .

Suprema and greatest elements are clearly related. The next exercise clarifies this.

We call $\ell \in P$ a lower bound of $A$ if $a \succeq \ell$ for all $a$ in $A$ . An element $\bar \ell$ of $P$ is called a infimum of $A$ if $\bar \ell$ is a lower bound of $A$ and $\bar \ell \succeq \ell$ for every lower bound $\ell$ of $A$ . We use analogous notation to denote the infimum. For example, if $A = \{a, b\}$ , then $\bigwedge A$ is also written as $a \wedge b$ .

2.2.2The Case of Pointwise Order¶

For us, the pointwise partial order $\leq$ introduced in Example 2.2.2 is especially useful. In this section, we review some properties of this order. Throughout, $\Xsf$ is an arbitrary finite set.

2.2.2.1Suprema and Infima under a Pointwise Order¶

Given $u, v \in \RR^\Xsf$ , the symbol $u \wedge v$ is possibly ambiguous because we used the symbol both for a pointwise minimum in Section 1.2.4.1 and an infimum of $\{u, v\}$ in Section 2.2.1.3. Fortunately, for elements of the partially ordered set $(\RR^\Xsf, \leq)$ , these two definitions coincide. Indeed, if $f(x) \coloneq \min\{u(x), v(x)\}$ for all $x \in \Xsf$ , then

(i) $f$ is a lower bound for $\{u, v\}$ in $(\RR^\Xsf, \leq)$ , and

(ii) $g \leq u$ and $g \leq v$ implies $g \leq f$ .

Hence $f$ is the infimum of $\{u, v\}$ in $(\RR^\Xsf, \leq)$ .

A subset $V$ of $\RR^\Xsf$ is called a sublattice of $\RR^\Xsf$ if

$u, v \in V$ implies $u \vee v \in V$ and $u \wedge v \in V$ .

Above we discussed the fact that, for a pair of functions $\{u, v\}$ , the supremum in $(\RR^\Xsf, \leq)$ is the pointwise maximum, whereas the infimum in $(\RR^\Xsf, \leq)$ is the pointwise minimum. The same principle holds for finite collections of functions. Thus, if $\{ v_i \} \coloneq \{ v_i \}_{i \in I}$ is a finite subset of $\RR^\Xsf$ , then, for all $x \in \Xsf$ ,

\left( \bigvee_i v_i \right)(x) \coloneq \max_{i \in I} v_i(x) \quad \text{and} \quad \left( \bigwedge_i v_i \right)(x) \coloneq \min_{i \in I} v_i(x).

The next example discusses greatest elements in the setting of pointwise order.

Figure 2.5 helps illustrate Example 2.2.7. In this case, $v^*$ is not in $\{v_\sigma\}$ and $\{v_\sigma\}$ has no greatest element (since neither $v_{\sigma'} \leq v_{\sigma''}$ nor $v_{\sigma''} \leq v_{\sigma'}$ ).

$v^* is the upper envelope of \{v_\sigma\}_{\sigma \in \Sigma}$

Figure 2.5: $v^*$ is the upper envelope of $\{v_\sigma\}_{\sigma \in \Sigma}$

Solution to Exercise 2.2.21

Suppose first that $v^* \in \{v_\sigma\}$ . Since $v_\sigma \leq v^*$ for all $\sigma$ , the function $v^*$ is the greatest element. Regarding the second claim, suppose (seeking a contradiction), that $v^* \notin \{v_\sigma\}$ and $\bar v$ is a greatest element of $\{v_\sigma\}$ . By definition, $v_\sigma \leq \bar v$ for all $\sigma$ , so taking the pointwise maximum gives $v^* \leq \bar v$ . At the same time, since $\bar v$ is a greatest element, we have $\bar v \in \{v_\sigma\}$ , and therefore $\bar v \leq \max_\sigma v_\sigma = v^*$ . Putting the two inequalities together gives $\bar v = v^*$ , which in turn implies that $v^* \in \{v_\sigma\}$ . Contradiction.

Given a partially ordered set $(P, \preceq)$ and $a, b \in P$ , the order interval $[a, b]$ is defined as all $p\in P$ such that $a \preceq p \preceq b$ . (If $a \preceq b$ fails, the order interval is empty.)

Solution to Exercise 2.2.22

Let $I_a \coloneq [a_1, a_2]$ and $I_b \coloneq [b_1, b_2]$ be two order intervals in $V$ . Consider the order interval $I \coloneq [a_1 \vee b_1, a_2 \wedge b_2]$ . If $h \in I$ , then $h \geq a_1 \vee b_1$ , so $h \geq a_1$ and $h \geq b_1$ . A similar argument gives $h \leq a_2$ and $h \leq b_2$ . Hence $h \in I_a \cap I_b$ . Working in the other direction, it is not difficult to show that $h \in I_a \cap I_b$ implies $h \in I$ . Hence $I = I_a \cap I_b$ . In particular, $I_a \cap I_b$ is an order interval in $V$ .

2.2.2.2Inequalities and Identities¶

In this section, we note some useful inequalities and identities related to the pointwise partial order on $\RR^\Xsf$ . As before, $\Xsf$ is any finite set.

Lemma 2.2.1

For $f, g, h \in \RR^\Xsf$ , the following statements are true:

(i) $|f + g| \leq |f| + |g|$ .

(ii) $(f \wedge g) + h = (f + h) \wedge (g + h)$ and $(f \vee g) + h = (f + h) \vee (g + h)$ .

(iii) $(f \vee g) \wedge h = (f \wedge h) \vee (g \wedge h)$ and $(f \wedge g) \vee h = (f \vee h) \wedge (g \vee h)$ .

(iv) $|f \wedge h - g \wedge h | \leq |f - g|$ .

(v) $|f \vee h - g \vee h | \leq |f - g|$ .

These results follow immediately from proofs of corresponding claims when $f,g$ , and $h$ are in $\RR$ . For example, by the usual triangle inequality for scalars, we have $|f(x) + g(x)| \leq |f(x)| + |g(x)|$ for all $x \in \Xsf$ . This is equivalent to the statement $|f+g| \leq |f|+|g|$ in (i). Similarly, inequality (v) follows directly from a corresponding scalar inequality that was already proved in Exercise 1.3.1,.

A complete proof of lemma Lemma 2.2.1 can be found with Theorem 30.1 of Aliprantis & Burkinshaw (1998).

It is also true that, if $f, g, h \in \RR_+^\Xsf$ , then

(f + g) \wedge h \leq (f \wedge h) + (g \wedge h).

(2.3)

We note the following useful inequality.

The inequality in Lemma 2.2.2 helps with dynamic programming problems that involve maximization. The next exercise concerns minimization.

We end this section with a discussion of upper envelopes. To frame the discussion, we take $\{ T_\sigma \} \coloneq \{ T_\sigma \}_{\sigma \in \Sigma}$ to be a finite family of self-maps on a sublattice $V$ of $\RR^\Xsf$ . Consider some properties of the operator $T$ on $V$ defined by

Tv = \bigvee_{\sigma \in \Sigma} \, T_\sigma \, v \qquad (v \in V).

It follows from the sublattice property that $T$ is a self-map on $V$ . In some sources, $T$ is called the upper envelope of the functions $\{ T_\sigma \}$ . The following lemma will be useful for dynamic programming.

Proof

Let the stated conditions hold and fix $u, v \in V$ . Applying Lemma 2.2.2, we get

\begin{aligned} \|Tu - Tv \|_\infty & = \max_x | \max_\sigma (T_\sigma \, u)(x) - \max_\sigma \, (T_\sigma \, v)(x) | \\ & \leq \max_x \max_\sigma \, | (T_\sigma \, u)(x) - (T_\sigma \, v)(x) | \\ & = \max_\sigma \, \max_x \, | (T_\sigma \, u)(x) - (T_\sigma \, v)(x) |. \end{aligned}

\fore \|Tu - Tv \|_\infty \leq \max_\sigma \, \| T_\sigma \, u - T_\sigma \, v \|_\infty \leq \max_\sigma \, \lambda_\sigma \, \| u - v \|_\infty .

Hence $T$ is a contraction of modulus $\max_\sigma \, \lambda_\sigma$ on $V$ , as claimed. ◻

2.2.3Order-Preserving Maps¶

Order-preserving maps appear throughout the theory of dynamic programming. Here we define them and state a condition for contractivity that requires the order preserving property.

2.2.3.1Definition¶

Given two partially ordered sets $(P, \preceq)$ and $(Q, \trianglelefteq)$ , a map $T$ from $P$ to $Q$ is called order-preserving if, given $p, p' \in P$ , we have

p \preceq p' \quad \implies \quad Tp \trianglelefteq Tp'.

(2.6)

$T$ is called order-reversing if, instead,

p \preceq p' \quad \implies \quad Tp' \trianglelefteq Tp.

(2.7)

Solution to Exercise 2.2.28

Fix square $A,B$ with $0 \leq A \leq B$ . It follows from the rules of matrix multiplication that, for arbitrary nonnegative square matrices $E, F, G$ with $F \leq G$ , we have $EF \leq EG$ and $FE \leq GE$ . Hence, if $A^k \leq B^k$ for some $k \in \NN$ , then $A^{k+1} = A A^k \leq B A^k \leq B B^k =B^{k+1}$ . Thus, by induction, $A^k \leq B^k$ for all $k \in \NN$ , which verifies the first claim. Regarding the second, it is clear that for nonnegative matrices $E, F$ with $E \leq F$ we have $\| E\|_\infty \leq \| F\|_\infty$ . Hence $\| A^k \|_\infty \leq \| B^k \|_\infty$ for all $k \in \NN$ . Raising both sides to the power $1/k$ and applying Gelfand’s lemma verifies $\rho(A) \leq \rho(B)$ .

2.2.3.2Increasing and Decreasing Functions¶

Regarding the definitions in (2.6) and (2.7), when $(Q, \trianglelefteq) = (\RR, \leq)$ , it is common to say “increasing” instead of order-preserving, and “decreasing” instead of order-reversing. We adopt this terminology. In particular, given partially ordered set $(P, \preceq)$ , we call $h \in \RR^P$

increasing if $p \preceq p'$ implies $h(p) \leq h(p')$ and
decreasing if $p \preceq p'$ implies $h(p) \geq h(p')$ .

We use the symbol $i\RR^P$ for the set of increasing functions in $\RR^P$ .

The next exercise shows that, in a totally ordered setting, an increasing function can be represented as the sum of increasing binary functions.

Solution to Exercise 2.2.32

Set $\alpha_k \coloneq u(x_k)$ for all $k$ and $s_k \coloneq \alpha_k - \alpha_{k-1}$ with $\alpha_0 \coloneq 0$ . Fix $x_j \in \Xsf$ . Then

\sum_{k=1}^n s_k \1\{x_j \geq x_k\} = \sum_{k=1}^j s_k = (\alpha_1 - \alpha_0) + (\alpha_2 - \alpha_1) + \ldots + (\alpha_j - \alpha_{j-1}) = \alpha_j.

In other words, $\sum_{k=1}^n s_k \1\{x_j \geq x_k\} = u(x_j)$ . This completes the proofs.

As usual, if $h \colon P \to Q$ and $P, Q \subset \RR$ , then we will call $h$

strictly increasing if $x < y$ implies $h(x) < h(y)$ , and
strictly decreasing if $x < y$ implies $h(x) > h(y)$ .

2.2.3.3Blackwell’s Condition¶

Our discussion of Banach’s Theorem in Section 1.2.2.3 showed the usefulness of contractivity. For an order-preserving operator on a subset of $\RR^\Xsf$ , the following condition often simplifies establishing this property. In the statement of the lemma, $U$ is a subset of $\RR^\Xsf$ , partially ordered by $\leq$ , and $\Xsf$ is finite. Also, $U$ has the property that $u \in U$ and $c \in \RR_+$ implies $u + c \in U$ .

2.2.4Stochastic Dominance¶

So far we have discussed partial orders over vectors, functions and sets. It is also useful to have a partial order over distributions that tells us when one distribution is in some sense “larger” than another. In this section, we introduce a partial order over some distributions commonly used in economics and finance.

Let’s start with an example. Recall that a random variable $X$ is binomial $B(n,0.5)$ if it counts the number of heads in $n$ flips of a fair coin. Figure 2.6 shows two distributions, $\phi \eqdist X \sim B(10, 0.5)$ and $\psi \eqdist Y \sim B(18, 0.5)$ . Since $Y$ counts over more flips, we expect it to take larger values in some sense, and we also expect its distribution $\psi$ to reflect this. How can we make these thoughts precise?

A standard order over distributions that captures this idea is defined as follows: Given finite set $\Xsf$ partially ordered by $\preceq$ and $\phi, \psi \in \dD(\Xsf)$ , we say that $\psi$ stochastically dominates $\phi$ and write $\phi \lefsd \psi$ if

\sum_x u(x) \phi(x) \leq \sum_x u(x) \psi(x) \; \text{ for every } u \text{ in } i\RR^\Xsf.

(2.9)

The relation $\lefsd$ is also called first-order stochastic dominance to differentiate it from other forms of stochastic order.

Example 2.2.11

If $\phi$ and $\psi$ are the binomial distributions defined in the preceding paragraphs and $\Xsf = \{0, \ldots, 18\}$ , then $\phi \lefsd \psi$ holds. Indeed, if $W_1, \ldots, W_{18}$ are IID binary random variables with $\PP\{W_i = 1\}=0.5$ for all $i$ , then $X \coloneq \sum_{i=1}^{10} W_i$ has distribution $\phi$ and $Y \coloneq \sum_{i=1}^{18} W_i$ has distribution $\psi$ . In addition, $X \leq Y$ with probability one (i.e., for any outcome of the draws $W_1, \ldots, W_{18}$ ). It follows that, for any given $u \in i\RR^\Xsf$ , we have $u(X) \leq u(Y)$ with probability one. Hence $\EE u(X) \leq \EE u(Y)$ holds, which is the same statement as (2.9).

A good way to interpret first-order stochastic dominance is to suppose that an agent has preferences over outcomes in $\Xsf$ described by a utility function $u \in \RR^\Xsf$ . Suppose in addition that the agent prefers more to less, in the sense that $u \in i\RR^\Xsf$ , and that the agent ranks lotteries over $\Xsf$ according to expected utility, so that the agent evaluates $\phi \in \dD(\Xsf)$ according to $\sum_x u(x) \phi(x)$ . Then the agent (weakly) prefers $\psi$ to $\phi$ whenever $\phi \lefsd \psi$ .

We can say more. Consider the class $\aA$ of all agents who (a) have preferences over outcomes in $\Xsf$ , (b) prefer more to less, and (c) rank lotteries over $\Xsf$ according to expected utility. Then $\phi \lefsd \psi$ if and only if every agent in $\aA$ prefers $\psi$ to $\phi$ .

Solution to Exercise 2.2.33

Fix $\phi, \psi \in \Xsf$ and suppose that $\phi \lefsd \psi$ . Let $u \in \RR^\Xsf$ be defined by $u(1)=0$ and $u(2)=1$ . Then, by the definition of stochastic dominance, we have $\phi(2) \leq \psi(2)$ . Since $\phi(1)=1-\phi(2)$ and $\psi(1)=1-\psi(2)$ , this inequality is equivalent to $\psi(1) \leq \phi(1)$ . Finally, suppose that $\psi(1) \leq \phi(1)$ and fix $u \in i\RR^\Xsf$ . Let $h = u(2) - u(1) \geq 0$ . Then

\sum_x u(x) \phi(x) = u(1) \phi(1) + (u(1) + h) (1 - \phi(1)) = u(1) + h (1 - \phi(1)).

Similarly, $\sum_x u(x) \psi(x) = u(1) + h (1 - \psi(1))$ . Since $h \geq 0$ and $\psi(1) \leq \phi(1)$ , we have $\sum_x u(x) \phi(x) \leq \sum_x u(x) \psi(x)$ . Thus, $\phi \lefsd \psi$ . This chain of implications proves the equivalences in the exercise.

To state another useful perspective on stochastic dominance, we introduce the notation

G^\phi (y) \coloneq \sum_{x \succeq y} \phi(x) \qquad (\phi \in \dD(\Xsf), \; y \in \Xsf).

For a given distribution $\phi$ , the function $G^\phi$ is sometimes called the counter CDF (counter cumulative distribution function) of $\phi$ .

The proof is given. Figure 2.7 helps to illustrate. Here $\Xsf \subset \RR$ and $\phi$ and $\psi$ are distributions on $\Xsf$ . We can see that $\phi \lefsd \psi$ because the counter CDFs are ordered in the sense that $G^\phi \leq G^\psi$ pointwise on $\Xsf$ .

Visualization of \phi \lefsd \psi — Figure 2.7:Visualization of $\phi \lefsd \psi$

2.2.5Parametric Monotonicity¶

We are often interested in whether a change in a parameter shifts an outcome up or down. For example, a parameter might appear in a central bank decision rule for pegging an interest rate, and we want to know whether increasing that parameter will increase steady state inflation. By providing sufficient conditions for monotone shifts in fixed points, results in this section can help answer such questions.

Let $(P, \preceq)$ be a partially ordered set. Given two self-maps $S$ and $T$ on a set $P$ , we write $S \preceq T$ if $S u \preceq T u$ for every $u \in P$ and say that $T$ dominates $S$ on $P$ .

One might assume that, in a setting where $T$ dominates $S$ , the fixed points of $T$ will be larger. This can hold, as in Figure 2.8, but it can also fail, as in Figure 2.9. A difference between these two situations is that in Figure 2.8 the map $T$ is globally stable. This leads us to our next result.

Figure 2.8:Ordered fixed points when global stability holds

Figure 2.9:Reverse-ordered fixed points when global stability fails

As an application of Proposition 2.2.7, consider again the Solow–Swan growth model $k_{t+1} = g(k_t) \coloneq s f(k_t) + (1 - \delta) k_t$ . We saw in Section 1.2.3.2 that if $f(k) = Ak^\alpha$ where $A > 0$ and $\alpha \in (0, 1)$ , then $g$ is globally stable on $M \coloneq (0,\infty)$ . Clearly $k \mapsto g(k)$ is order-preserving on $M$ . If we now increase, say, the savings rate $s$ , then $g$ will be shifted up everywhere, implying, via Proposition 2.2.7, that the fixed point also rises. Exercise 2.2.38 asks you to step through the details.

Figure 2.10 helps illustrate the results of Exercise 2.2.38. The top left sub-figure shows a baseline parameterization, with $A=2.0$ , $s = \alpha = 0.3$ and $\delta=0.4$ . The other sub-figures show how the steady state changes as parameters deviate from that baseline.

Figure 2.10:Parametric monotonicity for the Solow–Swan model

Figure 2.11 gives an illustration of the result in Exercise 2.2.39. Here an increase in $\beta$ leads to a larger continuation value. This seems reasonable, since larger $\beta$ indicates more concern about outcomes in future periods.

$Parametric monotonicity in \beta for the continuation value$

Figure 2.11:Parametric monotonicity in $\beta$ for the continuation value

While the preceding examples of parametric monotonicity are all one-dimensional, we will soon see that Proposition 2.2.7 can also be applied in high-dimensional settings.

2.3Matrices and Operators¶

Many aspects of dynamic programming are most clearly framed using operator theory. In this section, we discuss linear operators and their connections to matrices. We emphasize nonnegative matrices and so-called positive linear operators that arise naturally in dynamic programming.

2.3.1Nonnegative Matrices¶

We begin by reviewing basic properties of nonnegative matrices.

2.3.1.1Nonnegative Matrices and Their Powers¶

We call a matrix $A$ nonnegative and write $A \geq 0$ if all elements of $A$ are nonnegative. We call $A$ everywhere positive and write $A \gg 0$ if all elements of $A$ are strictly positive. A square matrix $A$ is called irreducible if $A \geq 0$ and $\sum_{k=1}^\infty A^k \gg 0$ . An interpretation in terms of connected networks is given in Chapter 1 of Sargent & Stachurski (2023).

Let $A$ be $n \times n$ . It is not always true that the spectral radius $\rho(A)$ is an eigenvalue of $A$ .^[1] However, when $A \geq 0$ , the spectral radius is always an eigenvalue. The following theorem states this result and several extensions.

Theorem 2.3.1 (Perron--Frobenius)

If $A \geq 0$ , then $\rho(A)$ is an eigenvalue of $A$ with nonnegative, real-valued right and left eigenvectors. In particular, we can find a nonnegative, nonzero column vector $e$ and a nonnegative, nonzero row vector $\epsilon$ such that

A e = \rho(A) e \quad \text{ and } \quad \epsilon A = \rho(A) \epsilon.

(2.10)

If $A$ is irreducible, then the right and left eigenvectors are everywhere positive and unique. Moreover, if $A$ is everywhere positive, then with $e$ and $\epsilon$ normalized so that $\inner{\epsilon, e}=1$ , we have

\rho(A)^{-t} A^t \to e \, \epsilon \qquad (t \to \infty).

(2.11)

The convergence in (2.11) provides a sharp characterization of large powers of $A$ that will prove useful in what follows. The assumption that $A$ is everywhere positive can be weakened without affecting this convergence. A complete statement and full proof of the Perron–Frobenius theorem can be found in Meyer (2000).

We can use the Perron–Frobenius theorem to provide bounds on the spectral radius of a nonnegative matrix. Fix $n \times n$ matrix $A = (a_{ij})$ and set

$\rsum_i(A) \coloneq \sum_j a_{ij} =$ the $i$ -th row sum of $A$ and
$\csum_j(A) \coloneq \sum_i a_{ij} =$ the $j$ -th column sum of $A$ .

2.3.1.2A Local Spectral Radius Result¶

Let $A$ be an $n \times n$ matrix. We know from Gelfand’s formula that if $\| \cdot \|$ is any matrix norm, then $\|A^k\|^{1/k} \to \rho(A)$ as $k \to \infty$ . While useful, this lemma can be difficult to apply because it involves matrix norms. Fortunately, when $A$ is nonnegative, we have the following variation, which only involves vector norms.

The expression on the left of (2.12) is sometimes called the local spectral radius of $A$ at $h$ . Lemma 2.3.3 gives one set of conditions under which a local spectral radius equals the spectral radius. This result will be useful when we examine state-dependent discounting in Chapter 6.

For a proof of Lemma 2.3.3 see Theorem 9.1 of Krasnosel’skii et al. (1972).

2.3.1.3Markov Matrices¶

An $n \times n$ matrix $P$ is called a stochastic matrix or Markov matrix if

P \geq 0 \quad \text{and} \quad P \1 = \1

where $\1$ is a column vector of ones, so that $P$ is nonnegative and has unit row sums. The Perron–Frobenius theorem will be useful for the following exercise.

Solution to Exercise 2.3.2

Let $P$ and $Q$ be as stated. Evidently $PQ \geq 0$ . Moreover, $PQ \1 = P \1 = \1$ , so $PQ$ is Markov. That $\rho(P)=1$ follows directly from Lemma 2.3.2. By the Perron--Frobenius theorem, there exists a nonzero, nonnegative row vector $\phi$ satisfying $\phi P = \phi$ . Rescaling $\phi$ to $\phi / (\phi \1)$ gives the desired vector $\psi$ .

The final positivity and uniqueness claim is also by the Perron–Frobenius theorem, and its consequences for irreducible matrices. Indeed, if $\phi$ is another nonnegative vector satisfying $\phi \1 = 1$ and $\phi P = \phi$ , then, by the Perron--Frobenius theorem, $\phi = \alpha \psi$ for some $\alpha > 0$ . But then $\alpha \psi \1 = 1$ and $\psi \1 = 1$ , which gives $\alpha=1$ . Hence $\phi = \psi$ .

The vector $\psi$ in part (iii) of Exercise 2.3.2 is called a stationary distribution for $P$ . Such distributions play an important role in the theory of Markov chains. We discuss their interpretation and significance in Section 3.1.2.

2.3.2A Lake Model¶

We illustrate the power of the Perron–Frobenius theorem by showing how it helps us analyze a model of employment and unemployment flows in a large population.

The model is sometimes called a “lake model” because there are two pools of workers: those who are currently employed and those who are currently unemployed but still seeking work. The flows between states are as follows:

Workers exit the labor market at rate $d$ .
New workers enter the labor market at rate $b$ .
Employed workers separate from their jobs and become unemployed at rate $\alpha$ .
Unemployed workers find jobs at rate $\lambda$ .

We assume that all parameters lie in $(0, 1)$ . New workers are initially unemployed.

Transition rates between two pools appear in Figure 2.12. For example, the rate of flow from employment to unemployment is $\alpha (1-d)$ , which equals the fraction of employed workers who remained in the labor market and separated from their jobs.

Figure 2.12:Lake model transition dynamics

Let $e_t$ and $u_t$ be the number of employed and unemployed workers at time $t$ respectively. The total population (of workers) is $n_t \coloneq e_t + u_t$ . In view of the rates just stated, the number of unemployed workers evolves according to

u_{t+1} = (1-d) \alpha e_t + (1-d)(1-\lambda) u_t + b n_t.

The three terms on the right correspond to the newly unemployed (due to separation), the unemployed who failed to find jobs last period, and new entrants into the labor force. The number of employed workers evolves according to

e_{t+1} = (1-d) (1- \alpha) e_t + (1-d)\lambda u_t .

Evolution of the time series for $u_t$ , $e_t$ and $n_t$ is illustrated in Figure 2.13. We set parameters to $\alpha = 0.01$ , $\lambda = 0.1$ , $d = 0.02$ , and $b = 0.025$ . The initial population of unemployed and employed workers are $u_0 = 0.6$ and $e_0 =1.2$ , respectively. The series grow over the long run due to net population growth.

Time series for e_t, u_t and n_t, (lake_2.jl) — Figure 2.13:Time series for $e_t$ , $u_t$ and $n_t$ , (`lake_2.jl`)

Can we say more about the dynamics of this system? For example, what long-run unemployment rate should we expect? Also, do long-run outcomes depend heavily on the initial conditions $u_0$ and $e_0$ ? Can we make some general statements that hold regardless of the initial state?

To begin to address these questions, we first organize the linear system for $(e_t)$ and $(u_t)$ by setting

x_t \coloneq \begin{pmatrix} u_t \\ e_t \end{pmatrix} \quad \text{and} \quad A \coloneq \begin{pmatrix} (1-d)(1-\lambda) + b & (1-d) \alpha + b \\ (1-d)\lambda & (1-d) (1- \alpha) \end{pmatrix}.

(2.13)

With these definitions, we can write the dynamics as $x_{t+1} = A x_t$ . As a result, $x_t = A^t x_0$ , where $x_0 = (u_0 \; e_0)^\top$ .

The overall growth rate of the total labor force is $g = b - d$ , in the sense that $n_{t+1} = (1+g) n_t$ for all $t$ .

In the language of Perron–Frobenius theory, the right eigenvector $\bar x$ is called the dominant eigenvector, since it corresponds to the dominant (i.e., largest) eigenvalue $\rho(A)$ . This eigenvector plays an important role in determining long-run outcomes. In the remainder of this section we illustrate this fact.

To begin, recall that $\alpha \bar x$ is also a right eigenvector corresponding to the eigenvalue $\rho(A)$ when $\alpha > 0$ . The set $D \coloneq \setntn{x \in \RR^2}{x = \alpha \bar x \text{ for some } \alpha > 0}$ is shown as a dashed black line in Figure 2.14. The figure also shows two time paths, each of the form $(x_t)_{t \geq 0} = (A^t x_0)_{t \geq 0}$ , generated from two different initial conditions. In both cases, we see that both paths converge to $D$ over time. The figure suggests that paths share strong similarities in the long run that are determined by the dominant eigenvector $\bar x$ .

Time paths x_t = A^t x_0 for two choices of x_0 (lake_1.jl) — Figure 2.14:Time paths $x_t = A^t x_0$ for two choices of $x_0$ (`lake_1.jl`)

To see why this is so, we return to (2.11) from the Perron–Frobenius theorem, which tells us that, since $A \gg 0$ , we have

A^t \approx \rho(A)^t \cdot \bar x \1^\top = (1 + g)^t \begin{pmatrix} \bar u & \bar u \\ \bar e & \bar e \end{pmatrix} \quad \text{for large } t.

As a result, for any initial condition $x_0 = (u_0 \; e_0)^\top$ , we have

A^t x_0 \approx (1 + g)^t \begin{pmatrix} \bar u & \bar u \\ \bar e & \bar e \end{pmatrix} \begin{pmatrix} u_0 \\ e_0 \end{pmatrix} = (1 + g)^t (u_0 + e_0) \begin{pmatrix} \bar u \\ \bar e \end{pmatrix} = n_t \bar x,

where $n_t =(1 + g)^t n_0$ and $n_0 = u_0 + e_0$ . This says that, regardless of the initial condition, the state $x_t$ scales along $\bar x$ at the rate of population growth. This is precisely what we saw in Figure 2.14.

We can provide additional interpretations to the components $\bar u$ and $\bar e$ of $\bar x$ . Since $n_t$ is the size of the workforce at time $t$ , the rate of unemployment is $u_t / n_t$ . As just shown, for large $t$ this is close to $(n_t \bar u) / n_t = \bar u$ . Hence $\bar u$ is the long-term rate of unemployment along the stable growth path. Similarly, the other component $\bar e$ of the dominant eigenvector is the long-run employment rate.

In summary, the dominant eigenvector provides with both the long-run rate of unemployment and the stable growth path, to which all trajectories with positive initial conditions converge over time.

2.3.3Linear Operators¶

There are two ways to think about a matrix. In one definition, an $n \times k$ matrix $A$ is an $n \times k$ array of (real) numbers. In the second, $A$ is a linear operator from $\RR^k$ to $\RR^n$ that takes a vector $u \in \RR^k$ and sends it to $Au$ in $\RR^n$ . Let’s clarify these ideas in a setting where $n=k$ . While the matrix representation is important, the linear operator representation is more fundamental and more general.

2.3.3.1Matrices versus Linear Operators¶

A linear operator on $\RR^n$ is a map $L$ from $\RR^n$ to $\RR^n$ such that

L(\alpha u + \beta v) = \alpha Lu + \beta Lv \quad \text{for all } u, v \in \RR^n \text{ and } \alpha, \beta \in \RR.

(2.15)

(We write $Lu$ instead of $L(u)$ , etc.) For example, if $A$ is an $n \times n$ matrix, then the map from $u$ to $Au$ defines a linear operator, since the rules of matrix algebra yield $A(\alpha u + \beta v) = \alpha Au + \beta Av$ .

We just showed that each matrix can be regarded as a linear operator. In fact the converse is also true:

A proof of Theorem 2.3.4 can be found in Kreyszig (1978) and many other sources.

Why introduce linear operators if they are essentially the same as matrices? One reason is that, while a one-to-one correspondence between linear operators and matrices holds in $\RR^n$ , the concept of linear operators is far more general. Linear operators can be defined over many different kinds of sets whose elements have vector-like properties. This is related to the point that we made about function spaces in Remark 1.2.2.

Another reason is computational: The matrix representation of a linear operator can be tedious to construct and difficult to instantiate in memory in large problems. We illustrate this point in Section 2.3.3.3.

2.3.3.2Linear Operators on Function Space¶

The definition of linear operators on $\RR^n$ extends naturally to linear operators on $\RR^\Xsf$ when $\Xsf = \{x_1, \ldots, x_n\}$ : A linear operator on $\RR^\Xsf$ is a map $L$ from $\RR^\Xsf$ to itself such that, for all $u, v \in \RR^\Xsf$ and $\alpha, \beta \in \RR$ , we have $L(\alpha u + \beta v) = \alpha Lu + \beta Lv$ . In what follows,

\lopx \coloneq \text{ the set of all linear operators on } \RR^\Xsf.

Let $L$ be a function from $\Xsf \times \Xsf$ to $\RR$ . This function induces an operator $L$ from $\RR^\Xsf$ to itself via

(Lu)(x) = \sum_{x' \in \Xsf} L(x,x') u(x') \qquad (x \in \Xsf, \; u \in \RR^\Xsf).

(2.16)

We use the same symbol $L$ on both sides of the equals sign because both represent essentially the same object (in the sense that a matrix $A$ can be viewed as a collection of numbers $(A_{ij})$ or as a linear map $u \mapsto Au$ ).

The function $L$ on the right-hand side of (2.16) is sometimes called the “kernel” of the operator $L$ . However, we will call it a matrix in what follows, since $L(x, x') = L(x_i, x_j)$ is just an $n \times n$ array of real numbers. When more precision is required, we will call it the matrix representation of $L$ .

In essence, the operation in (2.16) is just matrix multiplication: $(Lu)(x)$ is row $x$ of the matrix product $L u$ .

The eigenvalues and eigenvectors of the linear operator $L$ are defined as the eigenvalues and eigenvectors of its matrix representation. The spectral radius $\rho(L)$ of $L$ is defined analogously.

We used the same symbol for the operator $L$ on the left-hand side of (2.16) and its matrix representation on the right because these two objects are in one-to-one correspondence. In particular, every $L \in \lopx$ can be expressed in the form of (2.16) for a suitable choice of matrix $(L(x,x'))$ . Readers who are comfortable with these claims can skip ahead to Section 2.3.3.3. The next lemma provides more details.

Lemma 2.3.5 needs no formal proof. Theorem 2.3.4 already tells us that (a) and (b) are in one-to-one correspondence. Also, (b) and (c) are in one-to-one correspondence because each $L \in \lopx$ can be identified with a linear operator $u \mapsto Lu$ on $\RR^n$ by pairing $u, Lu \in \RR^\Xsf$ with its vector representation in $\RR^n$ (see Section 1.2.4.2). Finally, (d) and (a) are in one-to-one correspondence under the identification $L(x_i, x_j) \leftrightarrow L_{ij}$ .

2.3.3.3Computational Issues¶

At the end of Section 2.3.3.1 we claimed that working with linear operators brings some computational advantages vis-à-vis working with matrices. This section fills in some details (Readers who prefer not to think about computational issues at this point can skip ahead to Section 2.3.3.4.)

To illustrate the main idea, consider a setting where the state space $\Xsf$ takes the form $\Xsf = \Ysf \times \Zsf$ with $|\Ysf| = j$ and $|\Zsf| = k$ . A typical element of $\Xsf$ is $x = (y, z)$ . As we shall see, this kind of setting arises naturally in dynamic programming.

Let $Q$ be a map from $\Zsf \times \Zsf$ to $\RR$ (i.e., a $k \times k$ matrix) and consider the operator sending $u \in \RR^\Xsf$ to $Lu \in \RR^\Xsf$ according to the rule

(Lu)(x) = (Lu)(y, z) = \sum_{z' \in \Zsf} u(y, z') Q(z, z').

(2.17)

Since $L$ is a linear operator on $\RR^\Xsf$ , Lemma 2.3.5 tells us that $L$ can be represented as an $n \times n$ matrix $(L(x_i,x_j)) = (L_{ij})$ , where $n = |\Xsf| = j \times k$ . To construct this matrix, we first need to “flatten” $\Ysf \times \Zsf$ into a set $\Xsf = \{x_1, \ldots, x_n\}$ with a single index. There are two natural ways to do this. Considering $\Ysf \times \Zsf$ as a two-dimensional array with typical element $(y_i, z_j)$ , we can (a) stack all $k$ columns vertically into one long column, or (b) concatenate all $j$ rows into one long row. The first arrangement is called column-major ordering and is the default for languages such as Julia and Fortran. The second is called row-major ordering and is the default for languages such as Python and C. Either way we obtain a set of elements indexed by $1, \ldots, n$ .

After adopting one of these conventions, Lemma 2.3.5 assures us we can construct a uniquely defined $n \times n$ matrix that represents $L$ . Once we decide how to construct this matrix, we can instantiate it in computer memory and compute the operation $u \mapsto Lu$ by matrix multiplication.

There are, however, several disadvantages to implementing $L$ using this matrix-based approach. One is that constructing the matrix representation is tedious. Another is that confusion can arise when swapping between column- and row-major orderings in order to shift between languages or to communicate with colleagues. A third is that differences are introduced between computer code and the natural representation (2.17), which can be a source of bugs. A fourth issue is that an $n \times n$ matrix has to be instantiated in memory, even though the linear operation in (2.17) is only an inner product in $\RR^k$ . The last issue can be alleviated in most languages by employing sparse matrices, but doing so adds boilerplate and can be a source of inefficiency.

Because of these issues, most modern scientific computing environments support linear operators directly, as well as actions on linear operators such as inverting linear maps. These considerations encourage us to take an operator-based approach.

2.3.3.4Positive Operators and Markov Operators¶

Having agreed on the benefits of an operator-theoretic exposition, let us now describe some kinds of linear operators. We continue to assume that $\Xsf$ is a finite set with $n$ elements.

The set $\RR^\Xsf_+$ of all $u \in \RR^\Xsf$ with $u \geq 0$ is called the positive cone of $\RR^\Xsf$ . An operator $L \in \lopx$ is called positive if $L$ is invariant on the positive cone; that is, if

u \geq 0 \; \implies \; Lu \geq 0.

(2.18)

Solution to Exercise 2.3.10

Fix $L \in \lopx$ with $(Lu)(x) = \sum_{x' \in \Xsf} L(x,x') u(x')$ for all $x \in \Xsf$ and $u \in \RR^\Xsf$ . Positivity of $L$ requires that

u \geq 0 \; \implies \; \sum_{x' \in \Xsf} L(x,x') u(x') \geq 0 \text{ for all } x \in \Xsf.

Clearly, this holds whenever $L(x, x') \geq 0$ for all $x, x' \in \Xsf$ .

Regarding the converse, suppose that $L$ is positive. Seeking a contradiction, suppose in addition that we can find a pair $(x_a, x_b) \in \Xsf \times \Xsf$ such that $L(x_a,x_b) < 0$ . With $u(x) \coloneq \1\{x = x_b\}$ , we have $(Lu)(x_a) = \sum_{x' \in \Xsf} L(x_a,x') u(x') = L(x_a, x_b) < 0$ . This contradicts positivity of $L$ .

An operator $P \in \lopx$ is called a Markov operator on $\RR^\Xsf$ if $P$ is positive and $P \1 = \1$ . We let

\mopx \coloneq \text{ the set of all Markov operators on } \RR^\Xsf.

Viewed as matrices, elements of $\mopx$ are nonnegative matrices whose rows sum to one. The next exercise asks you to confirm this.

In the next exercise, you can think of $\phi$ as a row vector and $\phi P$ as premultiplying the matrix $P$ by this row vector. Chapter 3 uses the map $\phi \mapsto \phi P$ to update marginal distributions generated by Markov chains.

Solution to Exercise 2.3.14

In the solution, we use the characterization in Exercise 2.3.12: $P \in \mopx$ if and only if $P(x,x') \geq 0$ for all $x,x' \in \Xsf$ and $\sum_{x' \in \Xsf} P(x, x') =1$ for all $x \in \Xsf$ .

Fix $P \in \lopx$ and suppose first that $P \in \mopx$ . Then

(\phi P)(x') = \sum_x P(x, x') \phi(x) \qquad (x' \in \Xsf)

(2.20)

is in $\dD(\Xsf)$ whenever $\phi \in \dD(\Xsf)$ , since, for any such $\phi$ , the vector $\phi P$ is clearly nonnegative and

\sum_{x'} (\phi P)(x') = \sum_x \sum_{x'} P(x, x') \phi(x) = \sum_x \phi(x) = 1.

Now suppose instead that $P \in \lopx$ and $\phi P \in \dD(\Xsf)$ whenever $\phi \in \dD(\Xsf)$ . It follows that $P(x, x')$ is nonnegative at arbitrary $(x,x')$ , since $(\phi P)(x') = P(x,x')$ when $\phi$ is the distribution that puts all mass on $x$ . Moreover, $P(x, \cdot)$ must sum to one at arbitrary $x$ because if $\phi$ is the distribution that puts all mass on $x$ , then

1 = \sum_{x'} (\phi P)(x') = \sum_{x'} P(x,x').

Markov operators are important for us because they generate Markov dynamics, a foundation of dynamic programming. Thus, (2.19) is a rule for updating distributions by one period under the Markov dynamics specified by $P$ . We’ll use it often in the next chapter.

2.4Chapter Notes¶

Davey & Priestley (2002) provide a good introduction to partial orders and order-theoretic concepts. Our favorite books on fixed points and analysis include Ok (2007), Zhang (2012), Cheney (2013), and Atkinson & Han (2005). Good background material on order-theoretic fixed-point methods can be found in Guo et al. (2004) and Zhang (2012).

Footnotes¶

For example, eigenvalues of $A = \diag(-1, 0)$ are $\{-1, 0\}$ . Hence $\rho(A) = |-1| = 1$ , which is not an eigenvalue of $A$ .
↩

References¶

Atkinson, K., & Han, W. (2005). Theoretical Numerical Analysis (Vol. 39). Springer.
Aliprantis, C. D., & Burkinshaw, O. (1998). Principles of Real Analysis (3rd ed.). Academic Press.
Sargent, T., & Stachurski, J. (2023). Economic Networks: Theory and Computation. Cambridge University Press.
Meyer, C. D. (2000). Matrix Analysis and Applied Linear Algebra (Vol. 71). Siam.
Krasnosel’skii, M. A., Vainikko, G. M., Zabreiko, P. P., Rutitskii, Ya. B., & Stetsenko, V. Ya. (1972). Approximate Solution of Operator Equations. Springer Netherlands.
Kreyszig, E. (1978). Introductory Functional Analysis with Applications (Vol. 1). Wiley New York.
Zaanen, A. C. (2012). Introduction to Operator Theory in Riesz Spaces. Springer.
Davey, B. A., & Priestley, H. A. (2002). Introduction to Lattices and Order. Cambridge University Press.
Ok, E. A. (2007). Real Analysis with Economic Applications (Vol. 10). Princeton University Press.
Zhang, Z. (2012). Variational, Topological, and Partial Order Methods with Their Applications (Vol. 29). Springer.
Cheney, W. (2013). Analysis for Applied Mathematics (Vol. 208). Springer Science & Business Media.
Guo, D., Cho, Y. J., & Zhu, J. (2004). Partial Ordering Methods in Nonlinear Problems. Nova Publishers.