Abstract Decision Processes - Dynamic Programming Volume II: General States

One of the principal objects of theoretical research in any department of knowledge is to find the point of view from which the subject appears in its greatest simplicity.
—Josiah Willard Gibbs

Having reviewed some applications in Chapter 1, our next aim is to present a general theory of dynamic programming that includes these applications as special cases and extends to many new problems.

The framework developed in this chapter leads with order theory rather than metric structure. The reason is that contraction-based arguments on Banach spaces, the standard foundation for dynamic programming, are fragile. Many natural variants of the textbook setting — nonstandard discounting, nonlinear period aggregators, nonstandard value spaces, and others — break it. As suggested by the examples in Chapter 1, contraction is the exception rather than the rule once one moves outside the additively separable, expected-reward template.

What does unify these settings is that policy operators are order preserving and that lifetime values are characterized as fixed points of those operators on partially ordered sets. This chapter develops the optimality and convergence theory those ingredients support. This investment will pay off in later chapters: subsequent theory uses this foundation to establish, for many classes of models, the same fundamental optimality results and algorithmic convergence guarantees one expects from classical theory, even without contraction-type properties. The classical theory itself is recovered as a special case: the firm problem of Section 1.1, the finite MDP of Section 1.2, and the optimal savings problem of Section 1.3 all reappear as concrete instances.

A second feature of the framework is methodological. Order-theoretic proofs are almost entirely algebraic: each result reduces to systematic application of a small number of named hypotheses, with few analytic prerequisites. This setup pays off in two ways. It eases human analysis — a proof in this chapter typically needs nothing beyond applications of definitions and chains of inequalities — and it makes the theory a natural target for machine-assisted reasoning.

We begin with definitions and basic properties. Some readers will find it helpful to review the main order-theoretic concepts in Section A.1.2 before proceeding further.

The optimality results given in this chapter are high-level. We use them mostly as inputs for downstream optimality theory, rather than for applications per se. (In particular, from Chapter 3 on, we leverage findings from this chapter to construct downstream results that are more easily applied to specific problems.) When presenting this high-level theory, we refrain from giving examples until Section 2.3, hoping that readers have reviewed Chapter 1, or otherwise have strong backgrounds in core concepts of dynamic programming. Such readers will be able to appreciate the practical value of the abstract framework to be presented here.

2.1ADPs on Posets¶

We begin by defining abstract dynamic programs and listing some of their properties, including the Bellman equation and Bellman operator. We then state conditions under which essential optimality properties hold.

2.1.1Definitions and Properties¶

In this section we define abstract dynamic programs and list some simple properties.

2.1.1.1Key Definitions¶

Let’s start with the most important concept.

In what follows,

$V$ is called the value space.
Each operator $T_\sigma$ in $\TT$ is called a policy operator.
$\Sigma$ is an arbitrary index set and elements of $\Sigma$ will be referred to as policies.

In applications we will impose conditions under which each $T_\sigma$ has a unique fixed point in $V$ . In these settings, the significance of $T_\sigma$ is that its fixed point represents the lifetime value of following policy $\sigma$ . We will denote the unique fixed point of $T_\sigma$ by $v_\sigma$ and call it the $\sigma$ -value function.

Figure 2.1 illustrates a simple ADP with $V = \RR$ (paired with the usual order) and $\Sigma = \{\sigma, \sigma', \sigma''\}$ . Each policy operator is an affine self-map on $\RR$ , and each has a unique fixed point: $v_\sigma$ , $v_{\sigma'}$ , and $v_{\sigma''}$ , respectively.

An ADP on \RR — Figure 2.1:An ADP on $\RR$

In Chapter 1, the concept of greedy policies (see (1.51)) played a key role. It will also play a key role here, in our ADP framework. The definition is as follows.

As shown below, Definition 2.1.2 generalizes the notion of greedy policies for the problems considered in Chapter 1. Throughout this book we let

V_G \coloneq \setntn{v \in V}{\text{ at least one } v \text{-greedy policy exists}}.

(2.2)

Figure 2.2 illustrates the $v$ -greedy concept for the ADP from Figure 2.1, evaluated at the marked point $v$ . The three values $T_\sigma \, v$ , $T_{\sigma'} \, v$ , $T_{\sigma''} \, v$ are indicated: here $T_{\sigma'} \, v$ is the largest, so $\sigma'$ is the $v$ -greedy policy.

In applications, solving for a $v$ -greedy policy is often straightforward. Moreover, there exist conditions under which solving the overall problem reduces to solving for a $v$ -greedy policy with a “correct” choice of $v$ . We pursue this idea below.

2.1.1.2Properties of ADPs¶

We now list properties useful for ADP optimality theory. First, we call $(V, \TT)$

well-posed if each $T_\sigma \in \TT$ has a unique fixed point in $V$ .

Well-posedness is a minimal precondition for a coherent dynamic program: to maximize lifetime values, we need them to be well-defined. When well-posedness holds we will always use $v_\sigma$ to denote the unique fixed point of $T_\sigma$ .

We call $(V, \TT)$

finite if $\TT$ is a finite set.

Finiteness holds for the finite MDPs described in Section 1.2, as well as other related settings with finite states and actions. Not surprisingly, finite ADPs have attractive optimality properties.

We call $(V, \TT)$

regular if $V_G = V$ ; that is, if a $v$ -greedy policy exists for all $v \in V$ .

Regularity helps us to construct algorithms and to obtain existence of optimal policies. Since $V = V_G$ under regularity, Lemma 2.1.1 implies that $T$ is well-defined on all of $V$ whenever this property holds. We focus primarily on regular ADPs.

We call $(V, \TT)$

order stable if each $T_\sigma \in \TT$ is order stable on $V$ .

Recalling the definitions in Section A.1.2.10, this means that $(V, \TT)$ is well-posed and, for each $\sigma \in \Sigma$ and $v \in V$ ,

$v \preceq T_\sigma \, v$ implies $v \preceq v_\sigma$ , and
$T_\sigma \, v \preceq v$ implies $v_\sigma \preceq v$ .

Most proofs in this chapter use only properties (a) and (b). Occasionally we will need monotone convergence of the iterates, in which case we will assume that $(V, \TT)$ is

strongly order stable if each $T_\sigma \in \TT$ is strongly order stable on $V$ .

As per the discussion in Section A.1.2.10, this implies that each $T_\sigma$ is order stable and, in addition, $v \preceq T_\sigma \, v$ implies $T_\sigma^n v \uparrow v_\sigma$ and $T_\sigma \, v \preceq v$ implies $T_\sigma^n v \downarrow v_\sigma$ .

2.1.1.3The Bellman Equation¶

A Bellman equation that plays a central role in optimality theory for the Section 1.3.2.1 optimal savings problem is again key for ADPs. Here we define the ADP Bellman equation and note some preliminary observations. Throughout this section, $(V, \TT)$ is an ADP with policy set $\Sigma$ .

In (2.3), the supremum is taken over all $\sigma \in \Sigma$ . There is, in general, no guarantee that the supremum exists.

Evidently $v \in V$ satisfies the Bellman equation if and only if $Tv$ exists and $Tv = v$ .

Figure 2.3 illustrates the Bellman operator for the simple ADP introduced earlier. At each $v$ , the value $Tv$ is the largest of $T_\sigma \, v$ , $T_{\sigma'} \, v$ , $T_{\sigma''} \, v$ : $T$ is the upper envelope of the three policy operators. The fixed point of $T$ is $v_{\sigma''}$ , which is also the largest of the three $\sigma$ -value functions and hence the value function $\vmax$ .

The next lemma provides some essential facts about $T$ on $V_G$ (as defined in (2.2)).

Proof

We begin with part (ii). Fix $v \in V_G$ and let $\sigma$ be $v$ -greedy. Then, by definition, $T_\sigma \, v$ is the greatest element of $\{T_\tau \, v\}_{\tau \in \Sigma}$ . A greatest element is also a supremum, so we have

T v \coloneq \bigvee_{\tau \in \Sigma} T_\tau \, v = T_\sigma \, v .

This gives both (a) and $\Leftarrow$ in (b) of part (ii). For $\Rightarrow$ of (b), if $Tv = T_\sigma \, v$ , then $T_\tau \, v \preceq T_\sigma \, v$ for all $\tau \in \Sigma$ . In particular, $\sigma$ is $v$ -greedy.

Next we prove (i). For $v \in V_G$ , a $v$ -greedy policy exists, so $Tv$ is well-defined by (b) of part (ii). Regarding the order preserving claim, fix $v, w \in V_G$ with $v \preceq w$ . Let $\sigma \in \Sigma$ be $v$ -greedy. Since $T_\sigma$ is order preserving, we have $T v = T_\sigma \, v \preceq T_\sigma \, w \preceq T w$ . ◻

2.1.1.4Subsets of the Value Space¶

We often refer to the following three subsets of the value space $V$ , the first of which was already introduced in (2.2) and is repeated here for convenience:

In the next lemma, $(V, \TT)$ is any ADP.

2.1.2Optimization¶

In this subsection we define optimality for ADPs, connect it to the Bellman equation, and state the fundamental optimality properties. We then give sufficient conditions for these properties to hold.

2.1.2.1Optimality and the Bellman Equation¶

We say that a policy $\sigma \in \Sigma$ is optimal for $(V, \TT)$ if $v_\sigma$ is a greatest element of $V_\Sigma$ . In other words, $\sigma$ is optimal if it attains the “highest possible” lifetime value.

Perhaps the most important aspect of the theory of dynamic programming is the link between optimality and the Bellman equation. To clarify this link we introduce some terminology. Let $(V, \TT)$ be a well-posed ADP and set

\vmax \coloneq \bigvee_\sigma v_\sigma \coloneq \bigvee V_\Sigma \quad \text{whenever the supremum exists}.

When $\vmax$ exists (i.e., when the supremum exists) we call $\vmax$ the value function of the ADP. The following statements are obvious from the definitions:

Existence of an optimal policy $\sigma$ implies that $\vmax$ exists and is equal to $v_\sigma$ .
If $\vmax = v_\sigma$ for some $\sigma \in \Sigma$ , then $\sigma$ is optimal.

At the same time, existence of $\vmax$ does not generally imply existence of a greatest element (and hence an optimal policy).

(When $\vmax$ does not exist both sets are understood as empty.) The following results will be useful for studying optimality.

Proof

Regarding part (i), suppose that $\vmax$ exists in $V$ . We prove the equality in (2.5) when $T\vmax=\vmax$ . Suppose first that $\sigma \in \Sigma$ is optimal, so that $v_\sigma = \vmax$ . Since $T_\sigma \, v_\sigma = v_\sigma$ , this implies $T_\sigma \, \vmax = \vmax$ . But $T \, \vmax = \vmax$ , so $T_\sigma \, \vmax = T \vmax$ . Hence $\sigma$ is $\vmax$ -greedy, since $T_\tau \, \vmax \preceq T \vmax = T_\sigma \, \vmax$ for all $\tau$ . Suppose next that $\sigma$ is $\vmax$ -greedy, so that $T_\sigma \, \vmax = T \vmax = \vmax$ . But $v_\sigma$ is the unique fixed point of $T_\sigma$ in $V$ , so $v_\sigma = \vmax$ . Hence $\sigma$ is an optimal policy.

Regarding part (ii), suppose that $\vmax$ exists in $V_G$ and $T\vmax = \vmax$ . Let $\sigma$ be $\vmax$ -greedy. Then $T_\sigma \, \vmax = T \vmax = \vmax$ . But $v_\sigma$ is the unique fixed point of $T_\sigma$ , so $v_\sigma = \vmax$ . Hence $\sigma$ is an optimal policy. Bellman’s principle of optimality follows from part (i). Regarding (ii, $\Leftarrow$ ), let $\sigma \in \Sigma$ be optimal, so that $v_\sigma = \vmax$ . By Bellman’s principle of optimality, the policy $\sigma$ is $\vmax$ -greedy. As a result, $T \vmax = T_\sigma \, \vmax = T_\sigma \, v_\sigma = v_\sigma = \vmax$ . In particular, $\vmax$ exists and satisfies the Bellman equation. ◻

2.1.2.2The Fundamental Optimality Properties¶

Throughout this section, $(V, \TT)$ is a well-posed ADP.

It is important to emphasize here that (B1)–(B3) are not independent. Indeed, (B2) implies both (B1) and (B3), as shown in the next proposition.

Despite the fact that (B2) implies both (B1) and (B3), we have stated them together because, in terms of applications, all three parts are significant. (B1) tells us that the problem at hand has a solution. (B3) implies that a solution can be computed by taking a $\vmax$ -greedy policy, and that any other optimal policy must also be $\vmax$ -greedy. Finally, (B2) provides a restriction that can help us calculate $\vmax$ . Provided that we search in $V_G$ , any fixed point of $T$ is equal to $\vmax$ .

Proposition 2.1.4 tells us our main goal is to construct conditions under which $\vmax$ exists and is the unique fixed point of $T$ in $V_G$ . We begin this task in Section 2.2.2.

The next exercise generalizes a well-known result from more traditional dynamic programming frameworks (see, e.g., Puterman (2005), Theorem 6.2.6).

2.1.2.3From Fixed Points to Optimality¶

In this section we investigate the following question: When does existence of a solution to the Bellman equation imply the fundamental optimality properties (B1)–(B3)? We will see that order stability is useful here. The results stated in this section should be understood as intermediate inputs: we use them in downstream theorems that are tuned to applications.

In the statement of the next result, $(V, \TT)$ is an ADP and $T$ is the Bellman operator.

Proof

Proof of Theorem 2.1.5.

Let $(V, \TT)$ be as stated.

Suppose first that (i) holds, with $T \bar v = \bar v$ and $\bar v \in V_G$ . In view of Proposition 2.1.4, it suffices to show that $\vmax$ exists and is the unique solution to the Bellman equation in $V_G$ . By the characterization of greedy policies in Lemma 2.1.1, we can choose a $\sigma \in \Sigma$ such that $\bar v = T \, \bar v = T_\sigma \, \bar v$ . By well-posedness, the unique fixed point of $T_\sigma$ in $V$ is its $\sigma$ -value function $v_\sigma$ , so the last equation yields $\bar v = v_\sigma$ . Moreover, if $\tau$ is any policy, then $T_\tau \, v_\sigma \preceq T \, v_\sigma = v_\sigma$ and hence, by hypothesis, $v_\tau \preceq v_\sigma$ . These facts imply that the fixed point $\bar v$ is a greatest element of $V_\Sigma$ . Moreover, if $v'$ is another fixed point of $T$ in $V_G$ , then, by the same argument, $v'$ is also a greatest element of $V_\Sigma$ . Since greatest elements are unique, we have $v' = \bar v$ . These arguments prove that $\vmax$ exists and is the unique solution to the Bellman equation in $V_G$ , so (ii) holds.

Conversely, suppose that (ii) holds. Then (B2) states that $\vmax$ exists and is the unique solution to the Bellman equation in $V_G$ , so $\vmax \in V_G$ is a fixed point of $T$ , giving (i). ◻

The next result is almost a corollary of Theorem 2.1.5. It imposes a strong order completeness condition on the value space. (See Section A.5.1 for background on order completeness.)

While the chain completeness assumption is strong, Theorem 2.1.7 is nonetheless significant: It says that, at least for this idealized setting, regular well-posed ADPs possess all the optimality properties that we seek.

What drives this strong result? The definition of ADPs is itself doing some heavy lifting: it requires that lifetime values are inherently recursive and reflect order structure (being fixed points of order-preserving operators). Bellman’s fundamental ideas work smoothly in this setting when we add some regularity.

2.2Algorithms and Convergence¶

In this section we introduce algorithms for solving ADPs—including value function iteration, Howard policy iteration and optimistic policy iteration—and establish conditions under which they converge. We treat both maximization and minimization, extending the latter via the theory of dual ADPs.

2.2.1Algorithms¶

In this section we discuss algorithms for dynamic programming and present some preliminary results. The three algorithms we consider are value function iteration (VFI), Howard policy iteration (HPI) and optimistic policy iteration (OPI). These algorithms generalize the ones presented for the finite MDP case in Section 1.2.1.3.

2.2.1.1Operators¶

As a first step, we introduce two operators. Throughout Section 2.2.1.1 we take $(V, \TT)$ to be a fixed ADP. As usual, when $(V, \TT)$ is well-posed, the unique fixed point of $T_\sigma \in \TT$ in $V$ is denoted by $v_\sigma$ .

(Here and below, the dependence of $W$ on $m$ is often suppressed to simplify notation.)

Like the Bellman operator $T$ , the map $W$ is well-defined on all of $V$ when $(V, \TT)$ is regular. The Howard policy operator $H$ is well-defined on all of $V$ when $(V, \TT)$ is well-posed and regular. Note that, for both of these maps, we always select the same $v$ -greedy policy when applying them to some fixed $v$ .^[1]

Below we will associate VFI, OPI, and HPI with fixing a $v \in V$ and then iteratively applying the operators $T$ , $W$ , and $H$ respectively. A small amount of thought will convince you that, for the optimal savings ADP described in Section 2.3.2, these iterative procedures coincide with our earlier description of VFI, HPI, and OPI from Section 1.2.1.3.

The next lemma collects useful results for the operators introduced above. In the statement, $(V, \TT)$ is an ADP with Howard policy operator $H$ , optimistic policy operator $W$ and Bellman operator $T$ . We assume regularity, so that $V_G = V$ .

Proof

Regarding (L1), fix $v \in V$ with $H \, v = v$ and let $\sigma$ be a $v$ -greedy policy such that $H \, v = v_\sigma$ . Then $v_\sigma = v$ . Since $\sigma$ is $v$ -greedy, $T_\sigma \, v = T \, v$ . Since $v_\sigma$ is fixed for $T_\sigma$ , we also have $T_\sigma \, v = v$ . Combining the last two equalities proves (L1).

Regarding (L2), fix $v \in V_U$ . Since $v \preceq Tv$ and $T$ is order preserving on $V_U$ , we have $Tv \preceq TTv$ . Hence $Tv \in V_U$ . Regarding $W$ , let $\sigma$ be $v$ -greedy with $Wv = T_\sigma^m v$ . Since $\sigma$ is $v$ -greedy, $Tv = T_\sigma \, v$ . Using this and the order preserving property of $T$ and $T_\sigma$ , we get

\begin{aligned} W v &= T_\sigma \, T_\sigma^{m-1} v \preceq T \, T_\sigma^{m-1} v &&\text{(since } T_\sigma \preceq T \text{)} \\ &\preceq T \, T_\sigma^{m-1} \, T v &&\text{(since } v \preceq Tv \text{ and } T, T_\sigma \text{ order preserving)} \\ &= T \, T_\sigma^{m-1} \, T_\sigma \, v = T \, T_\sigma^m \, v = T W v &&\text{(using } Tv = T_\sigma \, v \text{).} \end{aligned}

Hence $W v \in V_U$ . Finally, regarding $H$ , we observe that $Hv \in V_\Sigma$ and, since $V_G = V$ by regularity, Lemma 2.1.2 gives $V_\Sigma \subset V_U$ .

To prove (L3) we fix $v \in V_U$ . Letting $\sigma$ be $v$ -greedy, we have $v \preceq Tv = T_\sigma \, v$ . Iterating on this inequality with $T_\sigma$ proves that $(T_\sigma^k \, v)$ is increasing. In particular, $Tv = T_\sigma \, v \preceq W v$ . For the second inequality in (L3) we use the fact that $T_\sigma \preceq T$ on $V$ and $T$ and $T_\sigma$ are both order preserving to obtain $W v = T^m_\sigma v \preceq T^m v$ . ◻

The next lemma adds order stability and derives additional implications.

Proof

Proof of Lemma 2.2.2.

Our first claim is that

u,v \in V_U \text{ with } u \preceq v \implies Tu \preceq Wv \; \text{ and } Tu \preceq Hv.

(2.8)

To show this we fix such $u, v$ and use regularity to select a $v$ -greedy policy $\sigma$ . Since $(V, \TT)$ is order stable the $\sigma$ -value function (unique fixed point of $T_\sigma$ ) exists. Let $v_\sigma$ be the $\sigma$ -value function, so that $T_\sigma \, v_\sigma = v_\sigma$ and $v_\sigma = H v$ . Since $v \in V_U$ we have

v \preceq Tv = T_\sigma \, v \preceq T^m_\sigma \, v = W v \preceq v_\sigma = Hv.

(2.9)

The second inequality is by iterating on $v \preceq T_\sigma \, v$ , while the third is by order stability. Since $Tu \preceq Tv$ , we can use (2.9) to obtain (2.8). Iterating on (2.8) produces (2.7). The last claim in Lemma 2.2.2 follows from (2.9), which tells us that elements of $V_U$ are mapped up by $T$ , $W$ , and $H$ . ◻

2.2.1.2Convergence¶

In this section, we assume that $(V, \TT)$ is a regular ADP and the fundamental optimality properties hold. Let $\vmax$ denote the value function.

For OPI convergence, the meaning is that convergence holds for any choice of the OPI step size $m \in \NN$ .

The next lemma is a useful preliminary.

The next lemma sharpens Lemma 2.2.2 by adding upper bounds in terms of $\vmax$ .

Proof

Fix $v \in V_U$ . By Lemma 2.2.2, $T^n v$ , $W^n v$ , $H^n v$ are increasing in $n$ and $T^n v \preceq W^n v$ , $T^n v \preceq H^n v$ for all $n$ . In particular, $v \preceq T^n v$ in both chains. It remains to show that $W^n v \preceq \vmax$ and $H^n v \preceq \vmax$ for all $n$ .

We argue by induction on $n$ . For $n = 0$ : $v \preceq \vmax$ by Lemma 2.2.3. For the inductive step, assume $W^n v \preceq \vmax$ . By Lemma 2.2.1(L2), $W^n v \in V_U$ , so we may pick $\sigma$ that is $(W^n v)$ -greedy. Applying (2.9) at $W^n v$ yields $W^{n+1} v = T_\sigma^m \, W^n v \preceq v_\sigma \preceq \vmax$ , the last inequality holding because $\vmax$ is the greatest element of $V_\Sigma$ . The argument for $H^n v \preceq \vmax$ is identical, using $H^{n+1} v = v_\sigma \preceq \vmax$ in the last step. ◻

2.2.2Optimality and Convergence¶

In this section we introduce conditions for optimality of ADPs and convergence of algorithms that are well-posed and regular. These high-level results will later be used as inputs for lower level results that are more straightforward to check in applications. In each case, we will aim for the fundamental optimality properties, as given, and the convergence of the three major algorithms, discussed.

2.2.2.1Case I: Finite ADPs¶

In applications we often deal with dynamic programs that have finitely many states and actions. This finiteness implies that the set of feasible policies is finite. The next result deals with this case.

Proof

Let $(V, \TT)$ be as stated. Fix $v$ in $V_U$ (which is nonempty by Lemma 2.1.2). Since $(V, \TT)$ is well-posed and regular, the Howard policy operator $H$ is well-defined on $V$ . Let $v_n = H^n v$ for all $n \geq 0$ . By Lemma 2.2.2, we have $v_n \preceq v_{n+1}$ for all $n$ . Since $(v_n)_{n \geq 1}$ is contained in the finite set $V_\Sigma$ , it must be that $v_{n+1} = v_n$ for some $n \in \NN$ . But then $H v_n = v_n$ , so, by Lemma 2.2.1, we have $T v_n = v_n$ . Since $T$ has a fixed point in $V$ , the fundamental optimality properties hold (Corollary 2.1.6). We have also shown that HPI converges in finitely many steps. ◻

2.2.2.2Case II: Chain Complete Value Space¶

Next we state a result for the chain complete setting that extends Theorem 2.1.7. It shows that adding strong order stability is enough to provide convergence of all algorithms.

2.2.2.3Case III: Countably Dedekind Complete Value Space¶

The result in this section replaces chain completeness with countable Dedekind completeness and order continuity. To state the result we need two new definitions. We call $(V, \TT)$

order continuous if each $T_\sigma \in \TT$ is order continuous on $V$ .

Order continuity means that $T_\sigma \, v_n \uparrow T_\sigma \, v$ whenever $\sigma \in \Sigma$ and $v_n \uparrow v$ . This technical condition holds in many of the applications we consider. A general discussion of order continuous operators is provided in Section A.5.1.3.

In addition, we call $(V, \TT)$

order bounded if there exists a $u \in V$ with $T_\sigma \, u \preceq u$ for all $T_\sigma \in \TT$ .

Now we can state the main result of this section.

Proof

In view of Corollary 2.1.6, the fundamental optimality properties will hold when $T$ has a fixed point in $V$ . To see that this is true, fix any $v \in V_U$ (which is nonempty by Lemma 2.1.2) and set $v_n \coloneq T^n v$ . Since $(V, \TT)$ is order bounded, there is a $u \in V$ with $Tu \preceq u$ and $v \preceq u$ (see Exercise 2.2.2). Hence $v_n \preceq u$ for all $n$ . Because $V$ is countably Dedekind complete, we deduce existence of a $\bar v \in V$ with $v_n \uparrow \bar v$ . We claim that $T \bar v = \bar v$ . Indeed, $v_{n+1} = T v_n \preceq T \bar v$ for all $n$ , so, taking the supremum, $\bar v \preceq T \bar v$ . For the reverse inequality we take $\sigma$ to be $\bar v$ -greedy and use order continuity of $T_\sigma$ to obtain

T \bar v = T_\sigma \, \bar v = T_\sigma \, \bigvee_n v_n = \bigvee_n T_\sigma \, v_n \preceq \bigvee_n T \, v_n = \bigvee_n v_{n+1} = \bar v.

The fundamental optimality properties are now proved. In view of these properties, the only fixed point of $T$ in $V$ is $\vmax$ . Hence $T^n v = v_n \uparrow \bar v = \vmax$ . This proves convergence of VFI. Convergence of OPI and HPI follow from Corollary 2.2.5. ◻

Table 2.1 summarizes the conditions and conclusions for the three cases treated above.

Table 2.1:Conditions and conclusions for the three convergence cases

	Case I	Case II	Case III
	(Theorem 2.2.6)	(Theorem 2.2.7)	(Theorem 2.2.8)
Regular	✓	✓	✓
Order stable	✓	✓	✓
Finite	✓
Chain complete		✓
Countably Dedekind complete			✓
Order bounded			✓
Order continuous			✓
Optimality properties	✓	✓	✓
VFI, OPI, HPI converge		✓	✓
HPI finite convergence	✓

2.2.3Minimization¶

In some dynamic programs, the objective is to minimize lifetime cost of a given policy, rather than maximizing rewards. While we focus primarily on maximization in this book, the present section discusses how to handle minimization problems. The content of this section can be summarized by the following statement: for a given ADP $(V, \TT)$ , a minimization problem can be converted to a maximization problem by reversing the partial order on $V$ . Further details are given below. Readers who prefer to focus on maximization results can safely skip ahead.

2.2.3.1Definitions¶

Let $(V, \TT)$ be an ADP with policy set $\Sigma$ . We define the Bellman min-operator $\tmin$ corresponding to $(V, \TT)$ by

\tmin v = \bigwedge_\sigma T_\sigma \, v \quad \text{whenever the infimum exists.}

We say that $v \in V$ satisfies the Bellman min-equation if $\tmin v = v$ .

Paralleling the maximization terminology, we say that

$\sigma \in \Sigma$ is $v$ -min-greedy if $T_{\sigma} \, v \preceq T_\tau \, v$ for all $\tau \in \Sigma$ .

Analogous to the max case, we have

\sigma \text{ is } v \text{-min-greedy} \quad \iff \quad T_\sigma \, v = \tmin v.

In addition, we say that $(V, \TT)$ is

min-order bounded if there exists a $b \in V$ with $b \preceq T_\sigma \, b$ for all $T_\sigma \in \TT$ , and
min-regular if, for each $v \in V$ , at least one $v$ -min-greedy policy exists.

We let

V^G_{\triangledown} = \setntn{v \in V}{\text{ at least one } v \text{-min-greedy policy exists}}.

Now suppose $(V, \TT)$ is well-posed and let $V_\Sigma$ be defined as before (i.e., the set of $\sigma$ -value functions). In this setting we set $\vmin = \bigwedge_\sigma v_\sigma$ and call it the min-value function whenever the infimum exists. Also, we say that

$\sigma \in \Sigma$ is min-optimal for $(V, \TT)$ if $v_\sigma = \vmin$ , and
$(V, \TT)$ obeys Bellman’s principle of min-optimality if

\sigma \in \Sigma \text{ is min-optimal for } (V, \TT) \quad \iff \quad \sigma \text{ is } \vmin \text{-min-greedy}.

When $(V, \TT)$ is min-regular and well-posed, we define the Howard policy min-operator corresponding to $(V, \TT)$ via

\Hmin \colon V^G_{\triangledown} \to V_\Sigma, \qquad \Hmin v = v_\sigma \quad \text{ where } \sigma \text{ is } v \text{-min-greedy},

as well as, for each $m \in \NN$ , the optimistic policy min-operator via

\Wmin \colon V^G_{\triangledown} \to V, \qquad \Wmin v \coloneq T^m_\sigma v \quad \text{where} \quad \sigma \text{ is } v \text{-min-greedy}.

(2.11)

We say that the fundamental min-optimality properties hold if

(B1’) at least one min-optimal policy exists,

(B2’) $\vmin$ exists and is the unique solution to the Bellman min-equation in $V^G_\triangledown$ , and

(B3’) Bellman’s principle of min-optimality holds.

This definition parallels (B1)–(B3).

Let $V_D$ be all $v \in V^G_{\triangledown}$ with $\tmin v \preceq v$ . We say that

min-VFI converges if $\tmin^n v \downarrow \vmin$ for all $v \in V_D$ ,
min-OPI converges if $\Wmin^n v \downarrow \vmin$ for all $v \in V_D$ and all $m \in \NN$ , and
min-HPI converges if $\Hmin^n v \downarrow \vmin$ for all $v \in V_D$ .

To further increase clarity, when discussing maximization and minimization in the same section, we add a “max-” prefix to the previously introduced definitions that pertain to maximization. For example,

“ $v$ -greedy policies” will be referred to as $v$ -max-greedy policies,
“optimal policies” will be referred to as max-optimal policies,
“the Bellman equation” will be referred to as the Bellman max-equation,

and so on.

2.2.3.2Dual ADPs¶

Let’s now investigate how minimization problems can be converted to maximization problems in this abstract setting. The key idea is that taking infima in $(V, \preceq)$ corresponds to taking suprema in the order-dual $(V, \preceq^\partial)$ .

Recall from Section A.1.2.5 that if $V \coloneq (V, \preceq)$ is a partially ordered set, then its order dual $V^\partial \coloneq (V, \preceq^\partial)$ is the set $V$ paired with the partial order $\preceq^\partial$ obtained by setting $u \preceq^\partial v$ if and only if $v \preceq u$ . If $(V, \TT)$ is any ADP, then we call $(V, {\TT})^\partial \coloneq (V^\partial, \TT)$ the dual of $(V, \TT)$ . In other words, the dual $(V, {\TT})^\partial$ of $(V, \TT)$ is the ADP created by maintaining the same family of policy operators $\TT$ while replacing the poset $V$ with its order dual $V^\partial$ .

Regarding notation for $(V, {\TT})^\partial$ ,

the Bellman max-operator will be denoted by $T^\partial$ ,
the Bellman min-operator will be denoted by $\tmin^\partial$ ,
the max-value function will be denoted by $\vmaxd$ ,
$V_G^\partial$ is all $v \in V$ such that at least one $v$ -max greedy policy exists for $(V, \TT)^\partial$ ,
etc.

Each ADP is self-dual, in the sense that $((V, {\TT})^\partial)^\partial = (V, \TT)$ . This follows from the fact that all partially ordered sets are self-dual.

Exercise 2.2.4

Let $(V, \TT)$ be a well-posed ADP with dual $(V, {\TT})^\partial$ . Fix $v \in V$ and verify the following:

$\sigma$ is $v$ -min-greedy for $(V, \TT)$ if and only if $\sigma$ is $v$ -max-greedy for $(V, {\TT})^\partial$ .
$(V, \TT)$ is min-regular if and only if $(V, {\TT})^\partial$ is max-regular,
$(V, \TT)$ is min-order bounded if and only if $(V, {\TT})^\partial$ is max-order bounded,
If $\tmax^\partial v$ exists then so does $\tmin v$ , and, moreover, $\tmin v = \tmax^\partial v$ .
If $W^\partial v$ exists then so does $\Wmin v$ , and, moreover, $\Wmin v = W^\partial v$ .
If $H^\partial v$ exists then so does $\Hmin v$ , and, moreover, $\Hmin v = H^\partial v$ .
If $\vmaxd$ exists for $(V, {\TT})^\partial$ , then $\vmin$ exists for $(V, \TT)$ and $\vmin = \vmaxd$ .
$\sigma \in \Sigma$ is min-optimal for $(V, \TT)$ if and only if $\sigma$ is max-optimal for $(V, {\TT})^\partial$ .
$V^G_{\triangledown} = V_G^\partial$ .

Solution to Exercise 2.2.4

Regarding (i), fix $v \in V$ . Policy $\sigma$ is $v$ -min-greedy for $(V, \TT)$ if and only if $T_\sigma \, v \preceq T_\tau \, v$ for all $\tau \in \Sigma$ , which is equivalent to $T_\tau \, v \preceq^\partial T_\sigma \, v$ for all $\tau \in \Sigma$ . Hence $\sigma$ is $v$ -min-greedy for $(V, \TT)$ if and only if $\sigma$ is $v$ -max-greedy for $(V, {\TT})^\partial$ .

Claim (ii) follows from (i). Claim (iii) holds because $b \preceq T_\sigma \, b$ for all $\sigma$ implies $T_\sigma \, b \preceq^\partial b$ for all $\sigma$ . Claim (iv) is immediate from Exercise A.1.7. The proofs of the remaining claims are also straightforward and details are left to the reader.

Self-duality implies corollaries to Exercise 2.2.4 that we treat as self-evident. For example, $\sigma \in \Sigma$ is min-optimal for $(V, {\TT})^\partial$ if and only if $\sigma$ is max-optimal for $(V, \TT)$ .

Part (viii) of Exercise 2.2.4 tells us that we can solve an ADP for a min-optimal policy by switching to the dual ADP and maximizing.

Solution to Exercise 2.2.5

Bellman’s principle of min-optimality for $(V, \TT)$ states that

\setntn{\sigma \in \Sigma}{\sigma \text{ is min-optimal for } (V, \TT)} = \setntn{\sigma \in \Sigma}{\sigma \text{ is } \vmin \text{-min-greedy}}.

Bellman’s principle of max-optimality for $(V, {\TT})^\partial$ states that

\setntn{\sigma \in \Sigma}{\sigma \text{ is max-optimal for } (V, {\TT})^\partial} = \setntn{\sigma \in \Sigma}{\sigma \text{ is } \vmaxd \text{-max-greedy for } (V, {\TT})^\partial}.

We show that the former principle implies the latter using the facts established in Exercise 2.2.4. To this end, observe that the statement $\sigma$ is max-optimal for $(V, \TT)^\partial$ is equivalent to the statement that $\sigma$ is min-optimal for $(V, \TT)$ . Since Bellman’s principle of min-optimality holds for $(V, \TT)$ , this is equivalent to the statement that $\sigma$ is $\vmin$ -min-greedy for $(V, \TT)$ , which is equivalent to the statement that $\sigma$ is $\vmin$ -max-greedy for $(V, {\TT})^\partial$ , which is in turn equivalent to the statement that $\sigma$ is $\vmaxd$ -max-greedy for $(V, {\TT})^\partial$ . This argument confirms that the former principle implies the latter. The proof of the converse implication is similar and omitted.

2.2.3.3Optimality and Convergence¶

We can now easily translate max-optimality results to min-optimality results and vice versa. Our key tool is the next exercise.

Solution to Exercise 2.2.6

We prove each equivalence using the correspondences established in Exercise 2.2.4.

Regarding the fundamental optimality properties, suppose the fundamental max-optimality properties hold for $(V, {\TT})^\partial$ . Then (B1) holds for $(V, {\TT})^\partial$ , so at least one max-optimal policy exists for $(V, {\TT})^\partial$ . By Exercise 2.2.4(viii), this policy is min-optimal for $(V, \TT)$ , giving (B1’). For (B2’), note that $\vmaxd$ is the unique solution to the Bellman max-equation in $V_G^\partial$ for $(V, {\TT})^\partial$ . By Exercise 2.2.4(iv), $\tmin v = \tmax^\partial v$ for all relevant $v$ , and by Exercise 2.2.4(ix), $V^G_\triangledown = V_G^\partial$ , so the Bellman max-equation for the dual in $V_G^\partial$ is precisely the Bellman min-equation for $(V, \TT)$ in $V^G_\triangledown$ . Moreover, by Exercise 2.2.4(vii), $\vmin = \vmaxd$ , so $\vmin$ is the unique solution in $V^G_\triangledown$ , giving (B2’). For (B3’), Bellman’s principle of max-optimality for the dual states that a policy is max-optimal for $(V, {\TT})^\partial$ if and only if it is $\vmaxd$ -max-greedy for $(V, {\TT})^\partial$ . By Exercise 2.2.4(viii), max-optimality for the dual is equivalent to min-optimality for $(V, \TT)$ . By Exercise 2.2.4(i), $\vmaxd$ -max-greediness for the dual is equivalent to $\vmaxd$ -min-greediness for $(V, \TT)$ , which by Exercise 2.2.4(vii) is $\vmin$ -min-greediness. This gives (B3’). The converse follows by the same argument applied to $((V, {\TT})^\partial)^\partial = (V, \TT)$ .

For (i), the set $V_U^\partial$ of all $v \in V_G^\partial$ with $v \preceq^\partial T^\partial v$ equals $V_D$ by Exercise 2.2.4(iv) and (ix), since $v \preceq^\partial T^\partial v$ means $\tmax^\partial v \preceq v$ , i.e., $\tmin v \preceq v$ . Moreover, $(\tmax^\partial)^n v \to \vmaxd$ in the order on $V^\partial$ means $\tmin^n v \downarrow \vmin$ in $V$ . Claims (ii) and (iii) follow similarly using Exercise 2.2.4(v) and (vi).

Below we use Exercise 2.2.6 to prove theorems providing sufficient conditions for minimization results.

As one example of how Exercise 2.2.6 can be applied, we construct a min-version of Theorem 2.1.5. In the statement, $\tmin$ is the Bellman min-operator.

Proof

Let $(V, \TT)$ be as stated and consider the dual $(V, {\TT})^\partial$ . By Exercise 2.2.4(iv), $\tmin v = \tmax^\partial v$ for all $v$ where either side is defined. By Exercise 2.2.4(ix), $V^G_\triangledown = V_G^\partial$ . Hence claim (i) is equivalent to “ $T^\partial$ has a fixed point in $V_G^\partial$ ”. Moreover, $T_\sigma \, v \preceq^\partial v$ means $v \preceq T_\sigma \, v$ , which by hypothesis gives $v \preceq v_\sigma$ , i.e., $v_\sigma \preceq^\partial v$ . So the hypothesis of Theorem 2.1.5 holds for the dual, and Theorem 2.1.5 states that “ $T^\partial$ has a fixed point in $V_G^\partial$ ” is equivalent to the fundamental max-optimality properties for $(V, {\TT})^\partial$ . By Exercise 2.2.6, that is equivalent to the fundamental min-optimality properties for $(V, \TT)$ . ◻

2.3Applications¶

In this section we show how several important models fit into the ADP framework. Applications include firm valuation, optimal savings, finite Markov decision processes and linear-quadratic control.

2.3.1Firm Valuation¶

Recall the firm problem from Section 1.1, where, repeating (1.5), the policy operators took the form

(T_\sigma \, v)(x) = \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right] \qquad (x \in \Xsf).

(2.12)

The policy set $\Sigma$ was defined as all Borel measurable functions mapping $\Xsf$ to $\{0,1\}$ . Let $\TT_{\rm FV} \coloneq \setntn{T_\sigma}{\sigma \in \Sigma}$ and let $\leq$ be the pointwise partial order on $b\Xsf$ .

The monotonicity in Exercise 2.3.1 implies that the pair $(b\Xsf, \TT_{\rm FV})$ is an ADP. We saw in Section 1.1.1.2 that each $T_\sigma$ is a contraction map on $b\Xsf$ . Hence $(b\Xsf, \TT_{\rm FV})$ is well-posed. As we discussed in Section 1.1.1.2, the unique fixed point $v_\sigma$ of $T_\sigma$ has the interpretation of assigning lifetime values (expected present value of the firm) to states (initial conditions for the underlying Markov chain) under $\sigma$ .

In Section 1.1 we introduced the Bellman operator $T$ (see (1.9)) and the notion of greedy policies (see (1.8)). Let’s make sure that these agree with the new ADP definitions from this chapter. To begin, fix $v \in b\Xsf$ . On one hand, in Section 1.1.1.2, we called $\sigma \in \Sigma$ $v$ -greedy whenever

\sigma(x) \in \argmax_{a \in \{0,1\}} \left\{ a s + (1 - a) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right] \right\} \quad \text{for all } x \in \Xsf.

(2.13)

On the other hand, in our ADP setting, we called $\sigma$ $v$ -greedy when $\sigma \in \Sigma$ and $T_\tau \, v \preceq T_\sigma \, v$ for all $\tau \in \Sigma$ . Here this translates to

\tau(x) s + (1 - \tau(x)) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right] \leq \\ \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right]

for all $\tau \in \Sigma$ and all $x \in \Xsf$ , which is equivalent to stating that $\sigma$ obeys (2.13). Hence, for the firm ADP $(b\Xsf, \TT_{\rm FV})$ , the two definitions agree.

A $v$ -greedy policy can always be chosen. Indeed, if we set

\sigma(x) = \1\left\{s \geq \pi(x) + \beta \int v(x') P(x, \diff x')\right\} \qquad (x \in \Xsf),

then $\sigma$ is Borel measurable, since $\pi$ and $x \mapsto \int v(x') P(x, \diff x')$ are both Borel measurable, and $\sigma$ obeys (2.13). Given that we can always choose a $\sigma \in \Sigma$ obeying (2.13), the firm ADP $(b\Xsf, \TT_{\rm FV})$ is regular.

The ADP definition of the Bellman operator presented in Section 2.1.1.3 is $Tv = \bigvee_\sigma T_\sigma \, v$ . For our firm ADP $(b\Xsf, \TT_{\rm FV})$ , this agrees with the original definition we gave in Section 1.1.1.3. Indeed, given that the ADP is regular, we can state that $Tv = T_\sigma \, v$ whenever $\sigma$ is $v$ -greedy (Lemma 2.1.1). We have just shown that any such $\sigma$ obeys (2.13). Taking such a $\sigma$ and applying $Tv = T_\sigma \, v$ yields

(T v)(x) = \max \left\{ s ,\; \pi(x) + \beta \int v(x') P(x, \diff x') \right\}

for all $x \in \Xsf$ . This ADP construction of $T$ agrees with the original definition we presented in (1.9).

Solution to Exercise 2.3.2

Fix $\sigma \in \Sigma$ and let $(v_n)$ be a sequence in $b\Xsf$ with $v_n \uparrow v$ . By Lemma A.1.3, $v_n(x') \uparrow v(x')$ in $\RR$ for all $x' \in \Xsf$ . For any $x \in \Xsf$ , we have

(T_\sigma \, v_n)(x) = \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta \int v_n(x') P(x, \diff x') \right].

By the monotone convergence theorem, $\int v_n(x') P(x, \diff x') \uparrow \int v(x') P(x, \diff x')$ . Hence $(T_\sigma \, v_n)(x) \uparrow (T_\sigma \, v)(x)$ for all $x$ . Applying Lemma A.1.3 again gives $T_\sigma \, v_n \uparrow T_\sigma \, v$ , confirming order continuity.

It is possible to prove optimality results for the firm problem here, verifying and extending Theorem 1.1.1. For example, a proof can be constructed using Theorem 2.2.8. For now we’ll refrain from doing so. The reason is that we build additional ADP theory below, leveraging the framework set out in this chapter. Tackling specific applications will then become much easier.

2.3.2Optimal Savings¶

Consider the optimal savings model from Section 1.3, with Assumption 1.3.1 in force. We can represent this model as an ADP by taking $V \coloneq b\RR_+$ as the value space, paired with the pointwise order $\leq$ , letting $\Sigma$ be the set of (Borel measurable) feasible policies, as defined in Section 1.3.1.1, and setting $\TT_{\rm OS} \coloneq \setntn{T_\sigma}{\sigma \in \Sigma}$ , where each policy operator $T_\sigma$ is as given in (1.44). It is straightforward to verify that each $T_\sigma \in \TT_{\rm OS}$ is order preserving under $\leq$ , and Exercise 1.3.1 confirms that $T_\sigma$ maps $V$ to itself. Hence $(V, \TT_{\rm OS})$ is an ADP.

By Lemma 1.3.1, each policy operator $T_\sigma$ has a unique fixed point (i.e., $\sigma$ -value function) $v_\sigma$ . Consistent with the discussion in Section 2.1.1.1, the real number $v_\sigma(w)$ represents the lifetime value of policy $\sigma$ , conditional on initial wealth state $W_0 = w$ .

In (1.51) we defined the concept of a $v$ -greedy policy for the optimal savings model. Earlier, in (2.1), we introduced the notion of a $v$ -greedy policy for an arbitrary ADP. The second definition is a generalization of the first. Indeed, if $\sigma$ obeys the optimal savings greedy condition (1.51) and $\tau$ is any other feasible policy, then

\begin{aligned} & u(\tau(w)) + \beta \int v(R(w - \tau(w)) + y) \phi(\diff y) \leq \\ & u(\sigma(w)) + \beta \int v(R(w - \sigma(w)) + y) \phi(\diff y) \qquad \text{for all } w \in \RR_+. \end{aligned}

(2.14)

This is equivalent to the statement $T_\tau \, v \leq T_\sigma \, v$ for all $\tau \in \Sigma$ , which is, in the present setting, the ADP definition of $v$ -greedy (given that the partial order is $\leq$ ). Conversely, if $\sigma \in \Sigma$ obeys $T_\tau \, v \leq T_\sigma \, v$ for all $\tau \in \Sigma$ , then it obeys (2.14). By appealing to Lemma 1.3.2, we can strengthen this to

\sigma(w) \in \argmax_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\} \quad \text{for all } w \in \RR_+.

(2.15)

In other words, $\sigma$ is $v$ -greedy in the sense of Section 1.3.

In addition, the ADP Bellman operator for $(V, \TT_{\rm OS})$ , as defined in (2.4), is a generalization of the optimal savings Bellman operator given in (1.53). To see this, let $T = \bigvee_\sigma T_\sigma$ be the ADP Bellman operator and fix $v \in V$ . By (ii) of Lemma 2.1.1, we have $Tv = T_\sigma \, v$ whenever $\sigma$ is $v$ -greedy. Letting $\sigma$ obey (2.15), which exists by Lemma 1.3.2, fixing $w \in \RR_+$ and combining these facts, we get

(Tv)(w) = (T_\sigma v)(w) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\}.

This confirms the claim that, in the setting of $(V, \TT_{\rm OS})$ , the ADP Bellman operator reduces to the optimal savings Bellman operator in (1.53).

In view of Lemma 1.3.2, a $v$ -greedy policy exists for every $v \in V$ . Hence $(V, \TT_{\rm OS})$ is regular. In Lemma 1.3.1 we showed that each $T_\sigma \in \TT_{\rm OS}$ has a unique fixed point, so $(V, \TT_{\rm OS})$ is well-posed. The same lemma also showed that each policy operator in $\TT_{\rm OS}$ is globally stable. Lemma A.5.19 now implies that $(V, \TT_{\rm OS})$ is order stable.

2.3.3MDPs as ADPs¶

In Section 1.2 we introduced a finite MDP $(\Gamma, r, \beta, P)$ with finite state space $\Xsf$ and finite action space $\Asf$ . Now we show that this finite MDP can be framed as an ADP $(\RR^\Xsf, \TT_{\rm MDP})$ and then discuss its properties.

2.3.3.1ADP Representations for MDPs¶

Let $(\Gamma, r, \beta, P)$ be as above. We recall that the policy operators take the form

(T_\sigma \, v)(x) = r(x, \sigma(x)) + \beta \sum_{x'} v(x') P(x, \sigma(x), x') \qquad (v \in \RR^\Xsf, \; x \in \Xsf)

(2.16)

We set $\TT_{\rm MDP} \coloneq \setntn{T_\sigma}{\sigma \in \Sigma}$ , where $\Sigma$ is the feasible policies, and pair $\RR^\Xsf$ with the pointwise partial order $\leq$ . Since each $T_\sigma \in \TT_{\rm MDP}$ is order preserving on $(\RR^\Xsf, \leq)$ , the pair $(\RR^\Xsf, \TT_{\rm MDP})$ is an ADP.

We proved in Exercise 1.2.1 that each $T_\sigma \in \TT_{\rm MDP}$ has a unique fixed point in $\RR^\Xsf$ given by $(I-\beta P_\sigma)^{-1} r_\sigma$ . Hence $(\RR^\Xsf, \TT_{\rm MDP})$ is well-posed. $(\RR^\Xsf, \TT_{\rm MDP})$ is also order continuous. Indeed, in $(\RR^\Xsf, \leq)$ , the statement $v_n \uparrow v$ is equivalent to $v_n(x) \uparrow v(x)$ in $\RR$ for all $x \in \Xsf$ (Lemma A.1.3). As a result, every continuous order preserving map is order continuous.

The ADP $(\RR^\Xsf, \TT_{\rm MDP})$ is also order stable. Indeed, each $T_\sigma$ has the form $T_\sigma \, v = r_\sigma + \beta P_\sigma \, v$ , where $P_\sigma$ is a stochastic matrix. Since $\beta P_\sigma \geq 0$ and $\rho(\beta P_\sigma) = \beta < 1$ , Exercise A.4.4 implies that each $T_\sigma$ is order stable.

In Section 1.2.1.2 we introduced the concept of $v$ -greedy policies for the finite MDP, as well as the Bellman operator and the Bellman equation. Let’s make sure that these are in fact special cases of our ADP definitions from Section 2.1.1.

Solution to Exercise 2.3.8

Since $\RR^\Xsf$ is endowed with the pointwise partial order, for given $v \in \RR^\Xsf$ and $x \in \Xsf$ , the ADP Bellman operator $Tv = \bigvee_\sigma T_\sigma \, v$ reduces to

(T \, v)(x) = \sup_{\sigma \in \Sigma} (T_\sigma \, v)(x) = \sup_{\sigma \in \Sigma} \left\{ r(x, \sigma(x)) + \beta \sum_{x'} v(x') P(x, \sigma(x), x') \right\}.

By the definition of $\Sigma$ , we can also write this as

(T \, v)(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\},

(2.19)

which is identical to (2.18).

From (2.18) it follows that the ADP Bellman equation for $(\RR^\Xsf, \TT_{\rm MDP})$ is given by

v(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \qquad\qquad (x \in \Xsf).

(2.20)

This aligns with the traditional expression for the Bellman equation that we set out in (1.20).

2.3.3.2Optimality for Finite MDP Models¶

We make a number of optimality claims for the finite MDP model in Section 1.2.1.2. Later we’ll build on the ADP optimality results in ways that make verifying these claims straightforward. Nevertheless, for the sake of the exercise, we now prove the claims from Section 1.2.1.2 using Theorem 2.2.8, as well as establishing additional results. Readers who prefer to move on can skip this section without loss of continuity.

We begin with the following claim.

The statement in Proposition 2.3.1 is easily translated into more standard MDP terminology. For example, it tells us that the value function $\vmax$ solves the Bellman equation, which we know is given by (2.20), and, by the characterization of greedy policies in (2.17) plus Bellman’s principle of optimality, that a policy $\sigma$ is optimal if and only if

\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} \vmax(x') P(x, a, x') \right\} \quad \text{for all } x \in \Xsf.

Later we will show that ADP theory can also handle many extensions to the basic MDP framework.

The next exercise asks you to work through an alternative to the proof of Proposition 2.3.1.

2.3.4Distributional Dynamic Programming¶

In Section 1.1.3.2 we mentioned distributional dynamic programming, which seeks to track not just the expected discounted return under a policy, but also the full probability distribution of the random discounted return $\sum_{t \geq 0} \beta^t r(X_t, \sigma(X_t))$ . This richer object captures variance, quantiles, tail risk, and other features that matter when agents are not risk-neutral. Here we review some of the basic concepts and show how these ideas can be embedded into the ADP framework. Further reading can be found in Section 2.4.

2.3.4.1The Distributional ADP¶

Let $\Xsf$ be a metric space of states, let $\Asf$ be a metric space of actions, let $r \colon \Xsf \times \Asf \to \RR$ be a bounded measurable reward function, let $P$ be a stochastic kernel on $\Xsf$ given $\Xsf \times \Asf$ (see Section A.5.4), and let $\beta \in (0,1)$ . A policy is a measurable map $\sigma \colon \Xsf \to \Asf$ and $\Sigma$ denotes the set of all feasible policies.

Let $\dD_1(\RR)$ denote the set of Borel probability measures on $\RR$ with finite first moment. A distributional value function is a stochastic kernel $\eta$ from $\Xsf$ to $\RR$ (see Section A.5.4) with $\eta(x, \cdot) \in \dD_1(\RR)$ for each $x \in \Xsf$ ; it assigns a return distribution $\eta(x, \cdot)$ to each state $x$ . We let $\hH$ denote the set of all distributional value functions $\eta$ satisfying the uniform bound

\sup_{x \in \Xsf} \int |z| \, \eta(x, \diff z) < \infty.

This condition ensures that the metric introduced in Section 2.3.4.2 below is well-defined.

Given $\eta \in \hH$ , we write $\eta(x, h) \coloneq \int h(z) \, \eta(x, \diff z)$ for bounded measurable $h \colon \RR \to \RR$ . The space $\hH$ is equipped with the pointwise stochastic dominance order, which we denote by $\trianglelefteq$ and define by $\eta \trianglelefteq \eta'$ if $\eta(x, \cdot) \lefsd \eta'(x, \cdot)$ for every $x \in \Xsf$ (see Section A.5.5).

We now define the distributional analogue of the scalar policy operator $T_\sigma$ . In the scalar case,

(T_\sigma \, v)(x) = r_\sigma(x) + \beta \int v(x') \, P_\sigma(x, \diff x'),

where $r_\sigma(x) \coloneq r(x, \sigma(x))$ and $P_\sigma(x, \cdot) \coloneq P(x, \sigma(x), \cdot)$ . The distributional version replaces the scalar continuation value $v(x')$ with a random draw from the distribution $\eta(x', \cdot)$ , and replaces addition and scalar multiplication with the corresponding operations on distributions.

As a first step, given $\eta \in \hH$ , we define the continuation kernel $P_\sigma \otimes \eta$ from $\Xsf$ to $\RR$ by

(P_\sigma \otimes \eta)(x, B) \coloneq \int \eta(x', B) \, P_\sigma(x, \diff x') \qquad (B \in \bB(\RR)).

(2.21)

Here $(P_\sigma \otimes \eta)(x, \cdot)$ is a distributional analog of the continuation value: the distribution of the random value obtained by drawing $x' \sim P_\sigma(x, \cdot)$ and then drawing from $\eta(x', \cdot)$ .

Given a policy $\sigma \in \Sigma$ , the distributional policy operator $D_\sigma \colon \hH \to \hH$ is defined by

(D_\sigma \, \eta)(x, h) \coloneq \int h(r_\sigma(x) + \beta v) \, (P_\sigma \otimes \eta)(x, \diff v)

(2.22)

for bounded measurable $h \colon \RR \to \RR$ . Here $h$ should be understood as a test function that we use to characterize the distribution $(D_\sigma \, \eta)(x, \cdot)$ , which is the law of $r_\sigma(x) + \beta V$ when $V \sim (P_\sigma \otimes \eta)(x, \cdot)$ : today’s reward plus a discounted draw from the continuation distribution. This mirrors the scalar recursion. Expanding the continuation kernel, (2.22) can be equivalently written as

(D_\sigma \, \eta)(x, h) = \int \left[ \int h(r_\sigma(x) + \beta v) \, \eta(x', \diff v) \right] P_\sigma(x, \diff x').

(2.23)

Proof

Fix $\sigma \in \Sigma$ and $\eta \in \hH$ . Self-map. For each $x$ , the measure $(D_\sigma \, \eta)(x, \cdot)$ is the pushforward of the probability measure $(P_\sigma \otimes \eta)(x, \cdot)$ under the map $v \mapsto r_\sigma(x) + \beta v$ , and is therefore a probability measure. Since $P_\sigma \otimes \eta$ is a stochastic kernel from $\Xsf$ to $\RR$ (as a composition of stochastic kernels) and the map $(x, v) \mapsto r_\sigma(x) + \beta v$ is jointly measurable, it follows that $D_\sigma \, \eta$ is again a stochastic kernel from $\Xsf$ to $\RR$ . Finally, letting $M \coloneq \sup_{x,a} |r(x,a)|$ and $C \coloneq \sup_x \int |z| \, \eta(x, \diff z) < \infty$ , we have

\sup_x \int |z| \, (D_\sigma \, \eta)(x, \diff z) \leq M + \beta C < \infty,

so $D_\sigma \, \eta \in \hH$ .

Order preservation. Fix $\eta \trianglelefteq \eta'$ in $\hH$ , $x \in \Xsf$ , and let $h \colon \RR \to \RR$ be bounded and increasing. Since $\beta > 0$ , the map $v \mapsto h(r_\sigma(x) + \beta v)$ is also increasing. Because $\eta(x', \cdot) \lefsd \eta'(x', \cdot)$ for every $x'$ , we have

\int h(r_\sigma(x) + \beta z) \, \eta(x', \diff z) \leq \int h(r_\sigma(x) + \beta z) \, \eta'(x', \diff z)

for every $x'$ . Integrating against the nonnegative measure $P_\sigma(x, \diff x')$ and applying (2.23) gives $(D_\sigma \, \eta)(x, h) \leq (D_\sigma \, \eta')(x, h)$ . As $x$ and $h$ are arbitrary, $D_\sigma \eta \trianglelefteq D_\sigma \eta'$ . ◻

Since each $D_\sigma$ is an order preserving self-map on $\hH$ , the pair $(\hH, \TT_{\rm DDP})$ is an ADP, where $\TT_{\rm DDP} \coloneq \setntn{D_\sigma}{\sigma \in \Sigma}$ . The value space is $\hH$ (distributional value functions), the partial order is pointwise stochastic dominance, and the policy operators are $\{D_\sigma\}$ .

2.3.4.2Well-Posedness¶

As in the scalar MDP case, well-posedness follows from a contraction argument. To state it, we need a metric on distributions. The Wasserstein-1 distance between $\mu, \nu \in \dD_1(\RR)$ is

W_1(\mu, \nu) \coloneq \sup_{\|h\|_{\rm Lip} \leq 1} \left[ \int h \, \diff\mu - \int h \, \diff\nu \right],

where the supremum is over all 1-Lipschitz functions $h \colon \RR \to \RR$ . We equip $\hH$ with the supremum Wasserstein metric

\bar d_1(\eta, \eta') \coloneq \sup_{x \in \Xsf} W_1(\eta(x, \cdot), \eta'(x, \cdot)).

Under this metric, $\hH$ is a complete metric space.

Solution to Exercise 2.3.11

Fix $x \in \Xsf$ and let $h$ be 1-Lipschitz. Define $g(v) \coloneq h(r_\sigma(x) + \beta v)$ , which is $\beta$ -Lipschitz. Applying (2.23) and setting $\tilde g \coloneq g/\beta$ , which is 1-Lipschitz,

\begin{aligned} (D_\sigma \eta)(x, h) - (D_\sigma \eta')(x, h) &= \int [\eta(x', g) - \eta'(x', g)] \, P_\sigma(x, \diff x') \\ &= \beta \int [\eta(x', \tilde g) - \eta'(x', \tilde g)] \, P_\sigma(x, \diff x') \\ &\leq \beta \int W_1(\eta(x', \cdot), \eta'(x', \cdot)) \, P_\sigma(x, \diff x') \\ &\leq \beta \, \bar d_1(\eta, \eta'). \end{aligned}

Taking the supremum over all 1-Lipschitz $h$ and then over $x$ gives the result.

By the Banach contraction mapping theorem, each $D_\sigma$ has a unique fixed point $\eta_\sigma \in \hH$ . Hence $(\hH, \TT_{\rm DDP})$ is well-posed. Moreover, since each $D_\sigma$ is a contraction and hence globally stable on $\hH$ , Lemma A.5.19 implies that $(\hH, \TT_{\rm DDP})$ is order stable.

2.3.4.3Identification of the Fixed Point¶

Let $(X_t)_{t \geq 0}$ be the $P_\sigma$ -Markov chain with $X_0 = x$ and define the discounted return

Z_\sigma(x) \coloneq \sum_{t=0}^\infty \beta^t r(X_t, \sigma(X_t)).

In the scalar case, the fixed point of $T_\sigma$ is the expected-return function $v_\sigma(x) = \EE Z_\sigma(x)$ . What is the distributional analogue? The fixed point of $D_\sigma$ should be the distribution of $Z_\sigma(x)$ . Since $r$ is bounded, say $|r| \leq M$ , we have $|Z_\sigma(x)| \leq M/(1-\beta)$ almost surely, so the distribution of $Z_\sigma(x)$ lies in $\dD_1(\RR)$ .

Solution to Exercise 2.3.12

Since $|Z_\sigma(x)| \leq M/(1-\beta)$ a.s. for all $x$ , the distribution $\hat\eta(x, \cdot)$ has bounded support and hence lies in $\dD_1(\RR)$ with $\sup_x \int |z| \, \hat\eta(x, \diff z) \leq M/(1-\beta) < \infty$ . Measurability of $x \mapsto \hat\eta(x, B)$ follows from the canonical Markov construction in Section A.5.4.1: $\hat\eta(x, B) = \PP_x\{Z_\sigma \in B\}$ , where $\PP_x$ is the law of the $(P_\sigma, x)$ -Markov chain, and $Z_\sigma$ is a measurable functional of the path. Hence $\hat\eta \in \hH$ .

For the fixed point property, fix $x \in \Xsf$ and a bounded measurable $h$ . The decomposition $Z_\sigma(x) = r_\sigma(x) + \beta Z_\sigma(X_1)$ and the Markov property give

\begin{aligned} \hat\eta(x, h) &= \EE[h(Z_\sigma(x))] = \EE[h(r_\sigma(x) + \beta Z_\sigma(X_1))] \\ &= \int \EE[h(r_\sigma(x) + \beta Z_\sigma(x'))] \, P_\sigma(x, \diff x') \\ &= \int \left[ \int h(r_\sigma(x) + \beta z) \, \hat\eta(x', \diff z) \right] P_\sigma(x, \diff x') = (D_\sigma \hat\eta)(x, h). \end{aligned}

Hence $D_\sigma \hat\eta = \hat\eta$ . Since $D_\sigma$ has a unique fixed point, $\hat\eta = \eta_\sigma$ .

Thus, while the scalar $\sigma$ -value function $v_\sigma(x) = \EE[Z_\sigma(x)]$ records only the mean of the random return, the distributional $\sigma$ -value function $\eta_\sigma(x, \cdot)$ records its full distribution---including variance, quantiles and tail behavior. The scalar value function is recovered as a special case: $v_\sigma(x) = \int z \, \eta_\sigma(x, \diff z)$ .

Figure 2.4 illustrates the action of the distributional policy operator on the optimal savings model from Section 1.3, using the policy $\sigma$ computed in Figure 1.14. The left panel shows $D_{\sigma}^5 \eta_0$ , an early iterate starting from the initial condition $\eta_0(x, \cdot) = \delta_0$ for all $x$ . The right panel shows the converged distributional value function $\eta_{\sigma}$ , which assigns to each state $x$ the full distribution of the discounted return $Z_{\sigma}(x)$ . The iterations were implemented using the categorical projection method of Bellemare et al. (2017): the return axis is discretized into a fixed grid of atoms, and at each step the affine shift $z \mapsto r_\sigma(x) + \beta z$ is projected back onto this grid via linear interpolation of probability mass. The mean of $\eta_{\sigma}(x, \cdot)$ at each $x$ recovers the scalar value function $v_\sigma$ shown in Figure 1.14.

Iterating D_\sigma: early iterate (left) and converged \eta_\sigma (right) — Figure 2.4:Iterating $D_\sigma$ : early iterate (left) and converged $\eta_\sigma$ (right)

2.3.4.4Regularity¶

One complication, in terms of developing a theory of distributional dynamic programming, is failure of regularity. Even when the state and action spaces are finite, greedy policies typically fail to exist. To see why, observe that, in the present setting, a policy $\sigma$ is $\eta$ -greedy when

\int h(r_\tau(x) + \beta v) \, (P_\tau \otimes \eta)(x, \diff v) \leq \int h(r_\sigma(x) + \beta v) \, (P_\sigma \otimes \eta)(x, \diff v)

(2.24)

for all $x \in \Xsf$ , all $\tau \in \Sigma$ and all $h \in ib\RR$ . A natural approach to this problem is to solve

\max_a \int \left[ \int h(r(x,a) + \beta v) \, \eta(x', \diff v) P(x, a, \diff x') \right]

at each $x$ , and produce a policy $\sigma$ from the correspondence of maximizers. The problem with this idea is that, in most cases, the solution will depend on $h$ . For example, if $h$ is concave with high curvature then optimal choices will avoid risk. If $h$ is linear, optimal choices will ignore risk. This makes it very difficult to attain (2.24) for all $h$ in $ib\RR$ .

Without regularity, the core optimization loop of dynamic programming—solve the Bellman equation, then extract a greedy policy—breaks down at the second step. In the reinforcement learning literature, practitioners who use distributional methods typically select actions using the mean of the return distribution—which amounts to standard scalar greediness—while exploiting the distributional representation to improve function approximation and learning dynamics.

At a deeper level, the failure of regularity reflects the fact that, in order to set up a well-defined criterion for control, one must first commit to how risk is valued. This commitment takes the form of a specific nonlinear aggregator $K$ applied period-by-period, as described in Section 1.1.3.6 and Section 1.2.3.2 (see also the discussion of certainty equivalents and risk preferences in Section 7.2.4). Such a commitment collapses the distributional object back to a scalar recursion, restoring regularity and the full apparatus of dynamic programming. This is the approach adopted throughout the remainder of this book.

2.3.5LQ Control¶

LQ control is a major sub-field of dynamic programming, routinely applied to problems in engineering, economics, operations research and elsewhere. In this section we describe a canonical LQ problem and show how it can be solved using ADP methods. Rather than aiming to provide new results, we plan to show that LQ problems can be cleanly handled by the theory provided above, instead of requiring specialized machinery.

2.3.5.1Description¶

We consider a deterministic undiscounted linear-quadratic (LQ) control problem, defined as a tuple $(Q, R, A, B)$ , where

$A$ is $k \times k$ and $B$ is $k \times m$ ,
$Q$ is $k \times k$ and positive semidefinite, and
$R$ is $m \times m$ and positive definite.

The objective of the LQ problem is to solve

\min_{(u_t)} \sum_{t \geq 0} \left[ x_t^\top Q x_t + u_t^\top R u_t \right]

(2.25)

subject to

x_{t+1} = A x_t + B u_t \quad \text{ for all } t \geq 0.

The vector $x_t \in \RR^k$ is called the state variable and $u_t \in \RR^m$ is called the action or control. Note that $x_t^\top Q x_t + u_t^\top R u_t \geq 0$ for all $t$ , so the infinite sum (2.25) takes values in $[0, \infty]$ . The Bellman equation takes the form

\ell(x) = \min_{u \in \RR^m} \left\{ x^\top Q x + u^\top R u + \ell(Ax + B u) \right\} \quad \text{for all } x \in \RR^k.

(2.26)

Since the solution to this problem is well-known, we will not seek new results. Rather, our aim is to illustrate how we can embed the LQ problem into the ADP framework and recover existing results in a relatively straightforward way.

2.3.5.2Preamble: LQ Background¶

Before we go further, let’s set up a framework for working with LQ problems and note down some standard results from the literature. In what follows, $A, B, R,$ and $Q$ are as described in Section 2.3.5.1.

We let

$\pP$ be the set of positive semidefinite $k \times k$ matrices
$\preccurlyeq$ be the Loewner partial order on $\RR^{k \times k}$ , so that

M \preccurlyeq N \quad \iff \quad N - M \in \pP.

We use 0 to denote the zero element of $\RR^{k \times k}$ , so that $P$ is in $\pP$ if and only if $0 \preccurlyeq P$ .

Also, we define

the Riccati map $\bR \colon \pP \to \pP$ via

\bR(P) = A^\top P A - A^\top P B(B^\top P B + R)^{-1} B^\top P A + Q, \quad \text{and}

(2.27)

the control gain map $\bF \colon \pP \to \RR^{m \times k}$ via

\bF (P) = - (B^\top P B + R)^{-1} B^\top P A .

(2.28)

Below we will connect the Riccati map to the Bellman operator for this problem and the control gain map will help us select decision rules. The fixed point equation $P = \bR(P)$ is called the Riccati equation. Note that the inverse in (2.27) exists because $R$ is positive definite and $P$ is positive semidefinite, implying that $B^\top P B + R$ is also positive definite.

The next exercise is a well-known result and the proof is not trivial.

Solution to Exercise 2.3.13

Let $M \coloneq (B^\top P B + R)^{-1}$ and $K \coloneq B^\top P A$ , so that $F = -MK$ . From (2.27), the Riccati map satisfies

\bR(P) = A^\top P A - K^\top M K + Q.

Since $A + BF = A - BMK$ , expanding the right-hand side of the target expression gives

\begin{aligned} &(A + BF)^\top P (A + BF) + F^\top R F + Q \\ &= (A - BMK)^\top P (A - BMK) + K^\top M^\top R M K + Q \\ &= A^\top P A - A^\top P BMK - K^\top M^\top B^\top P A \\ & \qquad \qquad + K^\top M^\top B^\top P B M K + K^\top M^\top R M K + Q \\ &= A^\top P A - 2K^\top M K + K^\top M^\top (B^\top P B + R) M K + Q. \end{aligned}

Since $M = (B^\top P B + R)^{-1}$ , we have $(B^\top P B + R) M = I$ and $M$ is symmetric. Hence $K^\top M^\top (B^\top P B + R) M K = K^\top M K$ . Substituting yields $\bR(P) = A^\top P A - K^\top M K + Q$ , as required.

The next lemma shows that the control gain map selects matrices that are akin to “min-greedy policies,” although we need to be a bit careful with that terminology, since we also want our policies to be stable (as clarified below).

In stating the next result we let $C$ be such that $C^\top C = Q$ . We refer to Bertsekas (2012) for the definitions of observability and controllability.

2.3.5.3Policies¶

In the LQ setting, a control matrix is any $F \in \RR^{m \times k}$ . Under a given control matrix $F$ , the current control obeys $u_t = F x_t$ and the update rule for the state is $x_{t+1} = A x_t + B F x_t$ . Hence the state evolves according to $x_t = (A + BF)^t x_0$ . Following (2.25), the lifetime cost of following $F$ , starting at initial condition $x_0 \in \RR^k$ , is

\ell_F (x_0) \coloneq \sum_{t=0}^\infty x_t^\top \left( F^\top R F + Q \right) x_t \; \text{ with } x_t = (A + BF)^t x_0.

(2.30)

The function \ell_F for different choices of F — Figure 2.5:The function $\ell_F$ for different choices of $F$

Returning to the general case, finite lifetime costs require driving the state to zero fast enough for the sum (2.30) to converge. In this connection, extending the condition $|A + BF| < 1$ from the one-dimensional example, a control matrix $F$ is called stable if the spectral radius condition $\rho(A + BF) < 1$ holds.

2.3.5.4From LQ to ADP¶

We wish to produce an ADP representation of the LQ problem. To this end, we set

$\Sigma \coloneq$ the set of stable control matrices, and
$\TT \coloneq \setntn{\bT_F}{F \in \Sigma}$ with each policy operator $P \mapsto \bT_F(P)$ defined by

\bT_F (P) = Q + F^\top R F + (A + BF)^\top P (A + BF) .

(2.32)

To match earlier ADP terminology, a stable control matrix is also referred to as a policy.

It follows from Exercise 2.3.16 that $(\pP, \TT)$ is an ADP.

2.3.5.5Interpretation and Properties¶

The fixed point equation $P = \bT_F(P)$ can be interpreted as a recursive equation for lifetime cost under policy $F$ . To see this, suppose $P = \bT_F(P)$ and set $\ell(x) \coloneq x^\top P x$ . Using (2.32) and a bit of algebra, you will be able to confirm that this function $\ell$ obeys the recursion

\ell(x) = x^\top Q x + (Fx)^\top R (Fx) + \ell((A + BF)x).

(2.34)

The right-hand side equals the current state cost $x^\top Q x$ , plus the current action cost $(Fx)^\top R (Fx)$ , plus the lifetime cost from the next state $(A + BF)x$ . This is exactly the recursive structure of lifetime cost under policy $F$ .

Proof

Fix $\bT_F \in \TT$ . Let $\bL_F$ be a linear self-map on $\RR^{k \times k}$ defined by $\bL_F (P) \coloneq (A + BF)^\top P (A + BF)$ . Since $F$ is stable, $\rho(\bL_F) < 1$ on $\RR^{k \times k}$ . Hence, by the Neumann series lemma (see, in particular, Corollary A.4.11), the map $\bT_F$ is globally stable on $\pP$ with unique fixed point

P_F = \sum_{t=0}^\infty \bL_F^t \left(F^\top R F + Q\right) = \sum_{t=0}^\infty [(A + BF)^t]^\top \left(F^\top R F + Q\right) (A + BF)^t.

This verifies the expression for $P_F$ in (2.35). ◻

Comparing (2.35) with (2.30), we see that $x^\top P_F \, x = \ell_F(x)$ for all $x \in \RR^k$ . Hence $P_F$ is the matrix representation of the lifetime cost function.

2.3.5.6Min-Greedy Policies¶

Fixing $P \in \pP$ , the ADP definition of min-greedy policies tells us that $F \in \Sigma$ is $P$ -min-greedy if and only if $\bT_F(P) \preccurlyeq \bT_G(P)$ for all $G \in \Sigma$ . The next exercise is useful for characterizing min-greedy policies.

Solution to Exercise 2.3.17

Fix $P \in \pP$ and $x \in \RR^k$ . Let $F = \bF(P)$ . Since $F x$ is the minimizer of (2.26) when $\ell(x) = x^\top P x$ , we have

x^\top Q x + x^\top F^\top R Fx + x^\top (A + BF)^\top P (A + BF) x \leq x^\top Q x + u^\top R u + \ell(Ax + B u)

(2.36)

for any $u \in \RR^m$ . Setting $G$ to be any control matrix and letting $u = Gx$ allows us to write (2.36) as $x^\top \bT_F(P) \, x \leq x^\top \bT_G(P) \, x$ . Hence $\bT_F(P) \preccurlyeq \bT_G(P)$ .

Now let $\ell(x) = x^\top P x$ and let $F'$ be any control matrix. If $\bT_{F'}(P) \preccurlyeq \bT_G(P)$ holds for any control matrix $G$ , then it holds when $G = \bF(P)$ . Using this choice of $G$ and fixed $x \in \RR^k$ , we get $x^\top \bT_{F'}(P) \, x \leq x^\top \bT_G(P) \, x$ and hence

x^\top Q x + u_{F'}^\top R u_{F'} + \ell(Ax + Bu_{F'}) \leq \min_{u \in \RR^m} \left\{ x^\top Q x + u^\top R u + \ell(Ax + B u) \right\}

where $u_{F'} = F'x$ . Lemma 2.3.3 now implies that $F' = - (B^\top P B + R)^{-1} B^\top P A = \bF(P)$ .

In obtaining optimality results, one problem we have is that $(\pP, \TT)$ is not always regular. On the one hand, if we take an arbitrary $P \in \pP$ and then calculate the control gain matrix $F = \bF(P)$ , Exercise 2.3.17 assures us that we get the “min-greedy” property $\bT_F(P) \preccurlyeq \bT_G(P)$ . On the other hand, we have no guarantee that $F$ is actually in $\Sigma$ . In particular, $F$ might not be a stable control matrix. For this reason, we introduce a smaller set $\pP_S \subset \pP$ of all positive semidefinite matrices such that control gain $F = \bF(P)$ is stable. That is, we set

\pP_S \coloneq \setntn{P \in \pP}{\rho(A + B \bF(P))<1} = \setntn{P \in \pP}{\bF(P) \in \Sigma}.

With this definition, the following lemma is immediate from Exercise 2.3.17.

Note that $\pP_S$ is not necessarily closed under the policy operators $\bT_F$ , so we cannot take $(\pP_S, \TT)$ as an ADP.

2.3.5.7The Bellman Equation¶

From the definition in Section 2.1.1.3, the Bellman min-operator associated with the ADP $(\pP, \TT)$ is defined by

\bT (P) = \bigwedge_{F \in \Sigma} \bT_F (P) \;\; \text{ whenever the infimum exists}.

If $P \in \pP_S$ and $F = \bF(P)$ , then, by Lemma 2.3.7, $F$ is $P$ -min-greedy. Hence $\bT (P) = \bT_F (P)$ (see Lemma 2.1.1), so

\bT(P) = Q + F^\top R F + (A + BF)^\top P (A + BF) \quad\text{when} \quad F = \bF(P).

(2.37)

Recalling Exercise 2.3.13, this means that

The next lemma connects this to the Riccati equation and another version of the LQ Bellman equation.

2.3.5.8Optimality¶

Suppose that $(A, B)$ is controllable and $(A, C)$ is observable. Since $\bT$ and $\bR$ agree on $\pP_S$ , Lemma 2.3.4 tells us that $\bT$ has a fixed point $P^*$ in $\pP_S$ . We call $P^*$ the minimum loss matrix. Let $\pP_\Sigma$ be the set of lifetime costs, so that

\pP_\Sigma = \setntn{P = P_F}{F \in \Sigma}.

Applying Theorem 2.2.9 and the fact that $(\pP, \TT)$ is order stable (Lemma 2.3.6), we obtain the following optimality results:

$P^*$ is the least element of $(\pP_\Sigma, \preccurlyeq)$ ,
$P^*$ obeys the Bellman min-equation $\bT P^* = P^*$ , and
a policy $F$ is min-optimal for $(\pP, \TT)$ if and only if $F$ is $P^*$ -min-greedy.

We can translate (a)–(c) into more familiar optimality results for LQ problems. For example, (b) combined with Lemma 2.3.9 tells us that

\ell^*(x) = \min_{u \in \RR^m} \left\{ x^\top Q x + u^\top R u + \ell^*(Ax + B u) \right\} \quad (x \in \RR^k),

(2.38)

where $\ell^*$ is defined by $\ell^*(x) = x^\top P^* x$ for all $x$ . Moreover, if we now set $F = \bF(P^*)$ , then, by Lemma 2.3.7, the policy $F$ is $P^*$ -min-greedy. Hence, by (c), $F$ is min-optimal.

2.4Chapter Notes¶

This chapter is based on the abstract dynamic programming framework of Sargent & Stachurski (2025), which builds on theory found in Denardo (1967), Bertsekas (1977), Verdu & Poor (1987), Szepesvari (1998), Kamihigashi (2014), Bertsekas (2017), Li & Xie (2021), and, in particular, Bertsekas (2022). The paper by Sargent & Stachurski (2025) adds a layer of abstraction over these earlier frameworks by shifting analysis to families of policy operators on partially ordered sets. Doing so makes it possible treat a wider class of problems and generate new results, as discussed in later chapters.

Other precursors to the framework described in this chapter include Porteus (1975) and Kreps & Porteus (1977), who pioneered the use of order-theoretic methods to extend dynamic programming optimality theory beyond the standard expected discounted reward criterion. (An overview of this line of work appears in Appendix 6 of Kreps (2013).) Their operator-based approach accommodates non-additive objectives, such as expected utility criteria, risk-sensitive preferences and stochastic games. Building on this foundation, Kreps & Porteus (1979) showed that non-additive recursive preferences are amenable to dynamic programming, work that inspired the recursive utility framework of Epstein & Zin (1989).

The discussion of distributional dynamic programming in Section 2.3.4 builds on Bellemare et al. (2017). The objective of this line of work is to replace scalar value functions with return distributions, enabling agents to reason about risk, variability, and higher-order statistics of outcomes. The theoretical foundations have been extended by several authors, including Dabney et al. (2018), Rowland et al. (2018), Bäuerle et al. (2025), Marthe (2026), and Bäuerle & Vasileiadis (2026). Distributional methods have found application in risk-sensitive control, robotics, and finance.

Linear-quadratic optimal control theory was developed during the late 1950s and 1960s, when the Riccati equation emerged as a key tool for computing optimal feedback policies. These methods have been widely applied in economics. Sargent (1987) provides an early treatment connecting LQ foundations to economic modeling. Hansen & Sargent (1980) showed how to formulate and estimate dynamic linear rational expectations models using LQ methods. The risk-sensitive extension to linear-exponential-quadratic-Gaussian (LEQG) control, developed by Whittle (1981) and adapted for discounted problems in economics by Hansen & Sargent (1995), provides a bridge between LQ control and robust decision-making under model uncertainty; see Hansen & Sargent (2011) for a comprehensive treatment. LQ methods underpin much of macroeconomic modeling, including the analysis of fiscal policy, monetary policy, and business cycle dynamics. See Hansen & Sargent (2013) and Ljungqvist & Sargent (2018) for numerous applications.

Research at the intersection of nonlinear dynamics and LQ control is currently quite active. One of the key ideas is to approximate nonlinear systems with very high dimensional linear systems, and then to approximate those linear systems via singular value decomposition. For further discussion of these topics, see Kutz et al. (2016) or Brunton & Kutz (2019).

Footnotes¶

In general, designating a $v$ -greedy policy for all $v \in V$ requires the Axiom of Choice. In practice, many applications induce some structure on the policy set that can be used to produce simple selection mechanisms.
↩

References¶

Puterman, M. L. (2005). Markov decision processes: discrete stochastic dynamic programming. Wiley Interscience.
Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. International Conference on Machine Learning, 449–458.
Bertsekas, D. (2012). Dynamic programming and optimal control (Vol. 1). Athena Scientific.
Sargent, T. J., & Stachurski, J. (2025). Dynamic Programs on Partially Ordered Sets. SIAM Journal on Optimization and Control, in press.
Denardo, E. V. (1967). Contraction Mappings in the Theory Underlying Dynamic Programming. SIAM Review, 9(2), 165–177.
Bertsekas, D. P. (1977). Monotone mappings with application in dynamic programming. SIAM Journal on Control and Optimization, 15(3), 438–464.
Verdu, S., & Poor, H. V. (1987). Abstract dynamic programming models under commutativity conditions. SIAM Journal on Control and Optimization, 25(4), 990–1006.
Szepesvari, C. (1998). Non-Markovian policies in sequential decision problems. Acta Cybernetica, 13(3), 305–318.
Kamihigashi, T. (2014). Elementary results on solutions to the Bellman equation of dynamic programming: existence, uniqueness, and convergence. Economic Theory, 56, 251–273.
Bertsekas, D. P. (2017). Regular policies in abstract dynamic programming. SIAM Journal on Optimization, 27(3), 1694–1727.
Li, X., & Xie, L. (2021). Online Abstract Dynamic Programming with Contractive Models.
Bertsekas, D. P. (2022). Abstract dynamic programming (3rd ed.). Athena Scientific.
Porteus, E. L. (1975). On the optimality of structured policies in countable stage decision processes. Management Science, 22(2), 148–157.
Kreps, D. M., & Porteus, E. L. (1977). On the optimality of structured policies in countable stage decision processes. II: Positive and negative problems. SIAM Journal on Applied Mathematics, 32(2), 457–466.
Kreps, D. M. (2013). Microeconomic foundations (Vol. 1). Princeton University Press.

2 Abstract Decision Processes

2.1ADPs on Posets¶

2.1.1Definitions and Properties¶

2.1.1.1Key Definitions¶

2.1.1.2Properties of ADPs¶

2.1.1.3The Bellman Equation¶

2.1.1.4Subsets of the Value Space¶

2.1.2Optimization¶

2.1.2.1Optimality and the Bellman Equation¶

2.1.2.2The Fundamental Optimality Properties¶

2.1.2.3From Fixed Points to Optimality¶

2.2Algorithms and Convergence¶

2.2.1Algorithms¶

2.2.1.1Operators¶

2.2.1.2Convergence¶

2.2.2Optimality and Convergence¶

2.2.2.1Case I: Finite ADPs¶

2.2.2.2Case II: Chain Complete Value Space¶

2.2.2.3Case III: Countably Dedekind Complete Value Space¶

2.2.3Minimization¶

2.2.3.1Definitions¶

2.2.3.2Dual ADPs¶

2.2.3.3Optimality and Convergence¶

2.3Applications¶

2.3.1Firm Valuation¶

2.3.2Optimal Savings¶

2.3.3MDPs as ADPs¶

2.3.3.1ADP Representations for MDPs¶

2.3.3.2Optimality for Finite MDP Models¶

2.3.4Distributional Dynamic Programming¶

2.3.4.1The Distributional ADP¶

2.3.4.2Well-Posedness¶

2.3.4.3Identification of the Fixed Point¶

2.3.4.4Regularity¶

2.3.5LQ Control¶

2.3.5.1Description¶

2.3.5.2Preamble: LQ Background¶

2.3.5.3Policies¶

2.3.5.4From LQ to ADP¶

2.3.5.5Interpretation and Properties¶

2.3.5.6Min-Greedy Policies¶

2.3.5.7The Bellman Equation¶

2.3.5.8Optimality¶

2.4Chapter Notes¶