Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

2 Abstract Decision Processes

One of the principal objects of theoretical research in any department of knowledge is to find the point of view from which the subject appears in its greatest simplicity.

---Josiah Willard Gibbs

Having reviewed some applications in Chapter 1, our next aim is to present a general theory of dynamic programming that includes these applications as special cases and extends to many new problems.

The framework developed in this chapter leads with order theory rather than metric structure. The reason is that contraction-based arguments on Banach spaces, the standard foundation for dynamic programming, are fragile. Many natural variants of the textbook setting --- nonstandard discounting, nonlinear period aggregators, nonstandard value spaces, and others --- break it. As suggested by the examples in Chapter 1, contraction is the exception rather than the rule once one moves outside the additively separable, expected-reward template.

What does unify these settings is that policy operators are order preserving and that lifetime values are characterized as fixed points of those operators on partially ordered sets. This chapter develops the optimality and convergence theory those ingredients support. This investment will pay off in later chapters: subsequent theory uses this foundation to establish, for many classes of models, the same fundamental optimality results and algorithmic convergence guarantees one expects from classical theory, even without contraction-type properties. The classical theory itself is recovered as a special case: the firm problem of Section 1.1, the finite MDP of Section 1.2, and the optimal savings problem of Section 1.3 all reappear as concrete instances.

A second feature of the framework is methodological. Order-theoretic proofs are almost entirely algebraic: each result reduces to systematic application of a small number of named hypotheses, with few analytic prerequisites. This setup pays off in two ways. It eases human analysis --- a proof in this chapter typically needs nothing beyond applications of definitions and chains of inequalities --- and it makes the theory a natural target for machine-assisted reasoning.

We begin with definitions and basic properties. Some readers will find it helpful to review the main order-theoretic concepts in Section A.1.2 before proceeding further.

The optimality results given in this chapter are high-level. We use them mostly as inputs for downstream optimality theory, rather than for applications per se. (In particular, from Chapter 3 on, we leverage findings from this chapter to construct downstream results that are more easily applied to specific problems.) When presenting this high-level theory, we refrain from giving examples until Section 2.3, hoping that readers have reviewed Chapter 1, or otherwise have strong backgrounds in core concepts of dynamic programming. Such readers will be able to appreciate the practical value of the abstract framework to be presented here.

2.1ADPs on Posets

We begin by defining abstract dynamic programs and listing some of their properties, including the Bellman equation and Bellman operator. We then state conditions under which essential optimality properties hold.

2.1.1Definitions and Properties

In this section we define abstract dynamic programs and list some simple properties.

2.1.1.1Key Definitions

Let’s start with the most important concept.

In what follows,

In applications we will impose conditions under which each TσT_\sigma has a unique fixed point in VV. In these settings, the significance of TσT_\sigma is that its fixed point represents the lifetime value of following policy σ\sigma. We will denote the unique fixed point of TσT_\sigma by vσv_\sigma and call it the σ\sigma-value function.

Figure 2.1 illustrates a simple ADP with V=RV = \RR (paired with the usual order) and Σ={σ,σ,σ}\Sigma = \{\sigma, \sigma', \sigma''\}. Each policy operator is an affine self-map on R\RR, and each has a unique fixed point: vσv_\sigma, vσv_{\sigma'}, and vσv_{\sigma''}, respectively.

An ADP on \RR

Figure 2.1:An ADP on R\RR

In Chapter 1, the concept of greedy policies (see (1.97)) played a key role. It will also play a key role here, in our ADP framework. The definition is as follows.

As shown below, Definition 2.1.2 generalizes the notion of greedy policies for the problems considered in Chapter 1. Throughout this book we let

VG{vV: at least one v-greedy policy exists}.V_G \coloneq \setntn{v \in V}{\text{ at least one } v \text{-greedy policy exists}}.

Figure 2.2 illustrates the vv-greedy concept for the ADP from Figure 2.1, evaluated at the marked point vv. The three values TσvT_\sigma \, v, TσvT_{\sigma'} \, v, TσvT_{\sigma''} \, v are indicated: here TσvT_{\sigma'} \, v is the largest, so σ\sigma' is the vv-greedy policy.

In applications, solving for a vv-greedy policy is often straightforward. Moreover, there exist conditions under which solving the overall problem reduces to solving for a vv-greedy policy with a “correct” choice of vv. We pursue this idea below.

2.1.1.2Properties of ADPs

We now list properties useful for ADP optimality theory. First, we call (V,T)(V, \TT)

Well-posedness is a minimal precondition for a coherent dynamic program: to maximize lifetime values, we need them to be well-defined. When well-posedness holds we will always use vσv_\sigma to denote the unique fixed point of TσT_\sigma.

We call (V,T)(V, \TT)

Finiteness holds for the finite MDPs described in Section 1.2, as well as other related settings with finite states and actions. Not surprisingly, finite ADPs have attractive optimality properties.

We call (V,T)(V, \TT)

Regularity helps us to construct algorithms and to obtain existence of optimal policies. Since V=VGV = V_G under regularity, Lemma 2.1.1 implies that TT is well-defined on all of VV whenever this property holds. We focus primarily on regular ADPs.

We call (V,T)(V, \TT)

Recalling the definitions in Section A.1.2.10, this means that (V,T)(V, \TT) is well-posed and, for each σΣ\sigma \in \Sigma and vVv \in V,

Most proofs in this chapter use only properties (a) and (b). Occasionally we will need monotone convergence of the iterates, in which case we will assume that (V,T)(V, \TT) is

As per the discussion in Section A.1.2.10, this implies that each TσT_\sigma is order stable and, in addition, vTσvv \preceq T_\sigma \, v implies TσnvvσT_\sigma^n v \uparrow v_\sigma and TσvvT_\sigma \, v \preceq v implies TσnvvσT_\sigma^n v \downarrow v_\sigma.

2.1.1.3The Bellman Equation

A Bellman equation that plays a central role in optimality theory for the Section 1.3.2.1 optimal savings problem is again key for ADPs. Here we define the ADP Bellman equation and note some preliminary observations. Throughout this section, (V,T)(V, \TT) is an ADP with policy set Σ\Sigma.

In (2.3), the supremum is taken over all σΣ\sigma \in \Sigma. There is, in general, no guarantee that the supremum exists.

Evidently vVv \in V satisfies the Bellman equation if and only if TvTv exists and Tv=vTv = v.

Figure 2.3 illustrates the Bellman operator for the simple ADP introduced earlier. At each vv, the value TvTv is the largest of TσvT_\sigma \, v, TσvT_{\sigma'} \, v, TσvT_{\sigma''} \, v: TT is the upper envelope of the three policy operators. The fixed point of TT is vσv_{\sigma''}, which is also the largest of the three σ\sigma-value functions and hence the value function v\vmax.

σ′ is v-greedy

Figure 2.2:σ′ is v-greedy

T as upper envelope

Figure 2.3:T as upper envelope

The next lemma provides some essential facts about TT on VGV_G (as defined in (2.2)).

2.1.1.4Subsets of the Value Space

We often refer to the following three subsets of the value space VV, the first of which was already introduced in (2.2) and is repeated here for convenience:

In the next lemma, (V,T)(V, \TT) is any ADP.

2.1.2Optimization

In this subsection we define optimality for ADPs, connect it to the Bellman equation, and state the fundamental optimality properties. We then give sufficient conditions for these properties to hold.

2.1.2.1Optimality and the Bellman Equation

We say that a policy σΣ\sigma \in \Sigma is optimal for (V,T)(V, \TT) if vσv_\sigma is a greatest element of VΣV_\Sigma. In other words, σ\sigma is optimal if it attains the “highest possible” lifetime value.

Perhaps the most important aspect of the theory of dynamic programming is the link between optimality and the Bellman equation. To clarify this link we introduce some terminology. Let (V,T)(V, \TT) be a well-posed ADP and set

vσvσVΣwhenever the supremum exists.\vmax \coloneq \bigvee_\sigma v_\sigma \coloneq \bigvee V_\Sigma \quad \text{whenever the supremum exists}.

When v\vmax exists (i.e., when the supremum exists) we call v\vmax the value function of the ADP. The following statements are obvious from the definitions:

At the same time, existence of v\vmax does not generally imply existence of a greatest element (and hence an optimal policy).

(When v\vmax does not exist both sets are understood as empty.) The following results will be useful for studying optimality.

2.1.2.2The Fundamental Optimality Properties

Throughout this section, (V,T)(V, \TT) is a well-posed ADP.

It is important to emphasize here that (B1)--(B3) are not independent. Indeed, (B2) implies both (B1) and (B3), as shown in the next proposition.

Despite the fact that (B2) implies both (B1) and (B3), we have stated them together because, in terms of applications, all three parts are significant. (B1) tells us that the problem at hand has a solution. (B3) implies that a solution can be computed by taking a v\vmax-greedy policy, and that any other optimal policy must also be v\vmax-greedy. Finally, (B2) provides a restriction that can help us calculate v\vmax. Provided that we search in VGV_G, any fixed point of TT is equal to v\vmax.

Proposition 2.1.4 tells us our main goal is to construct conditions under which v\vmax exists and is the unique fixed point of TT in VGV_G. We begin this task in Section 2.2.2.

The next exercise generalizes a well-known result from more traditional dynamic programming frameworks (see, e.g., Puterman (2005), Theorem 6.2.6).

2.1.2.3From Fixed Points to Optimality

In this section we investigate the following question: When does existence of a solution to the Bellman equation imply the fundamental optimality properties (B1)--(B3)? We will see that order stability is useful here. The results stated in this section should be understood as intermediate inputs: we use them in downstream theorems that are tuned to applications.

In the statement of the next result, (V,T)(V, \TT) is an ADP and TT is the Bellman operator.

The next result is almost a corollary of Theorem 2.1.5. It imposes a strong order completeness condition on the value space. (See Section A.5.1 for background on order completeness.)

While the chain completeness assumption is strong, Theorem 2.1.7 is nonetheless significant: It says that, at least for this idealized setting, regular well-posed ADPs possess all the optimality properties that we seek.

What drives this strong result? The definition of ADPs is itself doing some heavy lifting: it requires that lifetime values are inherently recursive and reflect order structure (being fixed points of order-preserving operators). Bellman’s fundamental ideas work smoothly in this setting when we add some regularity.

2.2Algorithms and Convergence

In this section we introduce algorithms for solving ADPs---including value function iteration, Howard policy iteration and optimistic policy iteration---and establish conditions under which they converge. We treat both maximization and minimization, extending the latter via the theory of dual ADPs.

2.2.1Algorithms

In this section we discuss algorithms for dynamic programming and present some preliminary results. The three algorithms we consider are value function iteration (VFI), Howard policy iteration (HPI) and optimistic policy iteration (OPI). These algorithms generalize the ones presented for the finite MDP case in Section 1.2.1.3.

2.2.1.1Operators

As a first step, we introduce two operators. Throughout Section 2.2.1.1 we take (V,T)(V, \TT) to be a fixed ADP. As usual, when (V,T)(V, \TT) is well-posed, the unique fixed point of TσTT_\sigma \in \TT in VV is denoted by vσv_\sigma.

(Here and below, the dependence of WW on mm is often suppressed to simplify notation.)

Like the Bellman operator TT, the map WW is well-defined on all of VV when (V,T)(V, \TT) is regular. The Howard policy operator HH is well-defined on all of VV when (V,T)(V, \TT) is well-posed and regular. Note that, for both of these maps, we always select the same vv-greedy policy when applying them to some fixed vv.[1]

Below we will associate VFI, OPI, and HPI with fixing a vVv \in V and then iteratively applying the operators TT, WW, and HH respectively. A small amount of thought will convince you that, for the optimal savings ADP described in Section 2.3.2, these iterative procedures coincide with our earlier description of VFI, HPI, and OPI from Section 1.2.1.3.

The next lemma collects useful results for the operators introduced above. In the statement, (V,T)(V, \TT) is an ADP with Howard policy operator HH, optimistic policy operator WW and Bellman operator TT. We assume regularity, so that VG=VV_G = V.

The next lemma adds order stability and derives additional implications.

2.2.1.2Convergence

In this section, we assume that (V,T)(V, \TT) is a regular ADP and the fundamental optimality properties on page  hold. Let v\vmax denote the value function.

For OPI convergence, the meaning is that convergence holds for any choice of the OPI step size mNm \in \NN.

The next lemma is a useful preliminary.

The next lemma sharpens Lemma 2.2.2 by adding upper bounds in terms of v\vmax.

2.2.2Optimality and Convergence

In this section we introduce conditions for optimality of ADPs and convergence of algorithms that are well-posed and regular. These high-level results will later be used as inputs for lower level results that are more straightforward to check in applications. In each case, we will aim for the fundamental optimality properties, as given on page , and the convergence of the three major algorithms, discussed on page .

2.2.2.1Case I: Finite ADPs

In applications we often deal with dynamic programs that have finitely many states and actions. This finiteness implies that the set of feasible policies is finite. The next result deals with this case.

2.2.2.2Case II: Chain Complete Value Space

Next we state a result for the chain complete setting that extends Theorem 2.1.7. It shows that adding strong order stability is enough to provide convergence of all algorithms.

2.2.2.3Case III: Countably Dedekind Complete Value Space

The result in this section replaces chain completeness with countable Dedekind completeness and order continuity. To state the result we need two new definitions. We call (V,T)(V, \TT)

Order continuity means that TσvnTσvT_\sigma \, v_n \uparrow T_\sigma \, v whenever σΣ\sigma \in \Sigma and vnvv_n \uparrow v. This technical condition holds in many of the applications we consider. A general discussion of order continuous operators is provided in Section A.5.1.3.

In addition, we call (V,T)(V, \TT)

Now we can state the main result of this section.

Table 2.1 summarizes the conditions and conclusions for the three cases treated above.

Table 2.1:Conditions and conclusions for the three convergence cases

Case I

Case II

Case III

(Theorem 2.2.6)

(Theorem 2.2.7)

(Theorem 2.2.8)

Regular

Order stable

Finite

Chain complete

Countably Dedekind complete

Order bounded

Order continuous

Optimality properties

VFI, OPI, HPI converge

HPI finite convergence

2.2.3Minimization

In some dynamic programs, the objective is to minimize lifetime cost of a given policy, rather than maximizing rewards. While we focus primarily on maximization in this book, the present section discusses how to handle minimization problems. The content of this section can be summarized by the following statement: for a given ADP (V,T)(V, \TT), a minimization problem can be converted to a maximization problem by reversing the partial order on VV. Further details are given below. Readers who prefer to focus on maximization results can safely skip ahead.

2.2.3.1Definitions

Let (V,T)(V, \TT) be an ADP with policy set Σ\Sigma. We define the Bellman min-operator T\tmin corresponding to (V,T)(V, \TT) by

Tv=σTσvwhenever the infimum exists.\tmin v = \bigwedge_\sigma T_\sigma \, v \quad \text{whenever the infimum exists.}

We say that vVv \in V satisfies the Bellman min-equation if Tv=v\tmin v = v.

Paralleling the maximization terminology, we say that

Analogous to the max case, we have

σ is v-min-greedy    Tσv=Tv.\sigma \text{ is } v \text{-min-greedy} \quad \iff \quad T_\sigma \, v = \tmin v.

In addition, we say that (V,T)(V, \TT) is

We let

VG={vV: at least one v-min-greedy policy exists}.V^G_{\triangledown} = \setntn{v \in V}{\text{ at least one } v \text{-min-greedy policy exists}}.

Now suppose (V,T)(V, \TT) is well-posed and let VΣV_\Sigma be defined as before (i.e., the set of σ\sigma-value functions). In this setting we set v=σvσ\vmin = \bigwedge_\sigma v_\sigma and call it the min-value function whenever the infimum exists. Also, we say that

σΣ is min-optimal for (V,T)    σ is v-min-greedy.\sigma \in \Sigma \text{ is min-optimal for } (V, \TT) \quad \iff \quad \sigma \text{ is } \vmin \text{-min-greedy}.

When (V,T)(V, \TT) is min-regular and well-posed, we define the Howard policy min-operator corresponding to (V,T)(V, \TT) via

H ⁣:VGVΣ,Hv=vσ where σ is v-min-greedy,\Hmin \colon V^G_{\triangledown} \to V_\Sigma, \qquad \Hmin v = v_\sigma \quad \text{ where } \sigma \text{ is } v \text{-min-greedy},

as well as, for each mNm \in \NN, the optimistic policy min-operator via

W ⁣:VGV,WvTσmvwhereσ is v-min-greedy.\Wmin \colon V^G_{\triangledown} \to V, \qquad \Wmin v \coloneq T^m_\sigma v \quad \text{where} \quad \sigma \text{ is } v \text{-min-greedy}.

We say that the fundamental min-optimality properties hold if

  1. at least one min-optimal policy exists,

  2. v\vmin exists and is the unique solution to the Bellman min-equation in VGV^G_\triangledown, and

  3. Bellman’s principle of min-optimality holds.

This definition parallels (B1)--(B3) from page .

Let VDV_D be all vVGv \in V^G_{\triangledown} with Tvv\tmin v \preceq v. We say that

To further increase clarity, when discussing maximization and minimization in the same section, we add a “max-” prefix to the previously introduced definitions that pertain to maximization. For example,

and so on.

2.2.3.2Dual ADPs

Let’s now investigate how minimization problems can be converted to maximization problems in this abstract setting. The key idea is that taking infima in (V,)(V, \preceq) corresponds to taking suprema in the order-dual (V,)(V, \preceq^\partial).

Recall from Section A.1.2.5 that if V(V,)V \coloneq (V, \preceq) is a partially ordered set, then its order dual V(V,)V^\partial \coloneq (V, \preceq^\partial) is the set VV paired with the partial order \preceq^\partial obtained by setting uvu \preceq^\partial v if and only if vuv \preceq u. If (V,T)(V, \TT) is any ADP, then we call (V,T)(V,T)(V, {\TT})^\partial \coloneq (V^\partial, \TT) the dual of (V,T)(V, \TT). In other words, the dual (V,T)(V, {\TT})^\partial of (V,T)(V, \TT) is the ADP created by maintaining the same family of policy operators T\TT while replacing the poset VV with its order dual VV^\partial.

Regarding notation for (V,T)(V, {\TT})^\partial,

Each ADP is self-dual, in the sense that ((V,T))=(V,T)((V, {\TT})^\partial)^\partial = (V, \TT). This follows from the fact that all partially ordered sets are self-dual.

Self-duality implies corollaries to Exercise 2.5 that we treat as self-evident. For example, σΣ\sigma \in \Sigma is min-optimal for (V,T)(V, {\TT})^\partial if and only if σ\sigma is max-optimal for (V,T)(V, \TT).

Part (viii) of Exercise 2.5 tells us that we can solve an ADP for a min-optimal policy by switching to the dual ADP and maximizing.

2.2.3.3Optimality and Convergence

We can now easily translate max-optimality results to min-optimality results and vice versa. Our key tool is the next exercise.

Below we use Exercise 2.7 to prove theorems providing sufficient conditions for minimization results.

As one example of how Exercise 2.7 can be applied, we construct a min-version of Theorem 2.1.5. In the statement, T\tmin is the Bellman min-operator.

2.3Applications

In this section we show how several important models fit into the ADP framework. Applications include firm valuation, optimal savings, finite Markov decision processes and linear-quadratic control.

2.3.1Firm Valuation

Recall the firm problem from Section 1.1, where, repeating (1.7), the policy operators took the form

(Tσv)(x)=σ(x)s+(1σ(x))[π(x)+βv(x)P(x, ⁣dx)](xX).(T_\sigma \, v)(x) = \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right] \qquad (x \in \Xsf).

The policy set Σ\Sigma was defined as all Borel measurable functions mapping X\Xsf to {0,1}\{0,1\}. Let TFV{Tσ:σΣ}\TT_{\rm FV} \coloneq \setntn{T_\sigma}{\sigma \in \Sigma} and let \leq be the pointwise partial order on bXb\Xsf.

The monotonicity in Exercise 2.8 implies that the pair (bX,TFV)(b\Xsf, \TT_{\rm FV}) is an ADP. We saw in Section 1.1.1.2 that each TσT_\sigma is a contraction map on bXb\Xsf. Hence (bX,TFV)(b\Xsf, \TT_{\rm FV}) is well-posed. As we discussed in Section 1.1.1.2, the unique fixed point vσv_\sigma of TσT_\sigma has the interpretation of assigning lifetime values (expected present value of the firm) to states (initial conditions for the underlying Markov chain) under σ\sigma.

In Section 1.1 we introduced the Bellman operator TT (see (1.16)) and the notion of greedy policies (see (1.14)). Let’s make sure that these agree with the new ADP definitions from this chapter. To begin, fix vbXv \in b\Xsf. On one hand, in Section 1.1.1.2, we called σΣ\sigma \in \Sigma vv-greedy whenever

σ(x)argmaxa{0,1}{as+(1a)[π(x)+βv(x)P(x, ⁣dx)]}for all xX.\sigma(x) \in \argmax_{a \in \{0,1\}} \left\{ a s + (1 - a) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right] \right\} \quad \text{for all } x \in \Xsf.

On the other hand, in our ADP setting, we called σ\sigma vv-greedy when σΣ\sigma \in \Sigma and TτvTσvT_\tau \, v \preceq T_\sigma \, v for all τΣ\tau \in \Sigma. Here this translates to

τ(x)s+(1τ(x))[π(x)+βv(x)P(x, ⁣dx)]σ(x)s+(1σ(x))[π(x)+βv(x)P(x, ⁣dx)]\tau(x) s + (1 - \tau(x)) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right] \leq \\ \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right]

for all τΣ\tau \in \Sigma and all xXx \in \Xsf, which is equivalent to stating that σ\sigma obeys (2.28). Hence, for the firm ADP (bX,TFV)(b\Xsf, \TT_{\rm FV}), the two definitions agree.

A vv-greedy policy can always be chosen. Indeed, if we set

σ(x)=1{sπ(x)+βv(x)P(x, ⁣dx)}(xX),\sigma(x) = \1\left\{s \geq \pi(x) + \beta \int v(x') P(x, \diff x')\right\} \qquad (x \in \Xsf),

then σ\sigma is Borel measurable, since π\pi and xv(x)P(x, ⁣dx)x \mapsto \int v(x') P(x, \diff x') are both Borel measurable, and σ\sigma obeys (2.28). Given that we can always choose a σΣ\sigma \in \Sigma obeying (2.28), the firm ADP (bX,TFV)(b\Xsf, \TT_{\rm FV}) is regular.

The ADP definition of the Bellman operator presented in Section 2.1.1.3 is Tv=σTσvTv = \bigvee_\sigma T_\sigma \, v. For our firm ADP (bX,TFV)(b\Xsf, \TT_{\rm FV}), this agrees with the original definition we gave in Section 1.1.1.3. Indeed, given that the ADP is regular, we can state that Tv=TσvTv = T_\sigma \, v whenever σ\sigma is vv-greedy (Lemma 2.1.1). We have just shown that any such σ\sigma obeys (2.28). Taking such a σ\sigma and applying Tv=TσvTv = T_\sigma \, v yields

(Tv)(x)=max{s,  π(x)+βv(x)P(x, ⁣dx)}(T v)(x) = \max \left\{ s ,\; \pi(x) + \beta \int v(x') P(x, \diff x') \right\}

for all xXx \in \Xsf. This ADP construction of TT agrees with the original definition we presented in (1.16).

It is possible to prove optimality results for the firm problem here, verifying and extending Theorem 1.1.1. For example, a proof can be constructed using Theorem 2.2.8. For now we’ll refrain from doing so. The reason is that we build additional ADP theory below, leveraging the framework set out in this chapter. Tackling specific applications will then become much easier.

2.3.2Optimal Savings

Consider the optimal savings model from Section 1.3, with Assumption 1.3.1 in force. We can represent this model as an ADP by taking VbR+V \coloneq b\RR_+ as the value space, paired with the pointwise order \leq, letting Σ\Sigma be the set of (Borel measurable) feasible policies, as defined in Section 1.3.1.1, and setting TOS{Tσ:σΣ}\TT_{\rm OS} \coloneq \setntn{T_\sigma}{\sigma \in \Sigma}, where each policy operator TσT_\sigma is as given in (1.83). It is straightforward to verify that each TσTOST_\sigma \in \TT_{\rm OS} is order preserving under \leq, and Exercise 1.10 confirms that TσT_\sigma maps VV to itself. Hence (V,TOS)(V, \TT_{\rm OS}) is an ADP.

By Lemma 1.3.1, each policy operator TσT_\sigma has a unique fixed point (i.e., σ\sigma-value function) vσv_\sigma. Consistent with the discussion in Section 2.1.1.1, the real number vσ(w)v_\sigma(w) represents the lifetime value of policy σ\sigma, conditional on initial wealth state W0=wW_0 = w.

In (1.97) we defined the concept of a vv-greedy policy for the optimal savings model. Earlier, in (2.1), we introduced the notion of a vv-greedy policy for an arbitrary ADP. The second definition is a generalization of the first. Indeed, if σ\sigma obeys the optimal savings greedy condition (1.97) and τ\tau is any other feasible policy, then

u(τ(w))+βv(R(wτ(w))+y)ϕ( ⁣dy)u(σ(w))+βv(R(wσ(w))+y)ϕ( ⁣dy)for all wR+.u(\tau(w)) + \beta \int v(R(w - \tau(w)) + y) \phi(\diff y) \leq \\ u(\sigma(w)) + \beta \int v(R(w - \sigma(w)) + y) \phi(\diff y) \qquad \text{for all } w \in \RR_+.

This is equivalent to the statement TτvTσvT_\tau \, v \leq T_\sigma \, v for all τΣ\tau \in \Sigma, which is, in the present setting, the ADP definition of vv-greedy (given that the partial order is \leq). Conversely, if σΣ\sigma \in \Sigma obeys TτvTσvT_\tau \, v \leq T_\sigma \, v for all τΣ\tau \in \Sigma, then it obeys (2.33). By appealing to Lemma 1.3.2, we can strengthen this to

σ(w)argmax0cw{u(c)+βv(R(wc)+y)ϕ( ⁣dy)}for all wR+.\sigma(w) \in \argmax_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\} \quad \text{for all } w \in \RR_+.

In other words, σ\sigma is vv-greedy in the sense of Section 1.3.

In addition, the ADP Bellman operator for (V,TOS)(V, \TT_{\rm OS}), as defined in (2.4), is a generalization of the optimal savings Bellman operator given in (1.99). To see this, let T=σTσT = \bigvee_\sigma T_\sigma be the ADP Bellman operator and fix vVv \in V. By (ii) of Lemma 2.1.1, we have Tv=TσvTv = T_\sigma \, v whenever σ\sigma is vv-greedy. Letting σ\sigma obey (2.34), which exists by Lemma 1.3.2, fixing wR+w \in \RR_+ and combining these facts, we get

(Tv)(w)=(Tσv)(w)=max0cw{u(c)+βv(R(wc)+y)ϕ( ⁣dy)}.(Tv)(w) = (T_\sigma v)(w) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\}.

This confirms the claim that, in the setting of (V,TOS)(V, \TT_{\rm OS}), the ADP Bellman operator reduces to the optimal savings Bellman operator in (1.99).

In view of Lemma 1.3.2, a vv-greedy policy exists for every vVv \in V. Hence (V,TOS)(V, \TT_{\rm OS}) is regular. In Lemma 1.3.1 on page  we showed that each TσTOST_\sigma \in \TT_{\rm OS} has a unique fixed point, so (V,TOS)(V, \TT_{\rm OS}) is well-posed. The same lemma also showed that each policy operator in TOS\TT_{\rm OS} is globally stable. Lemma A.5.19 now implies that (V,TOS)(V, \TT_{\rm OS}) is order stable.

2.3.3MDPs as ADPs

In Section 1.2 we introduced a finite MDP (Γ,r,β,P)(\Gamma, r, \beta, P) with finite state space X\Xsf and finite action space A\Asf. Now we show that this finite MDP can be framed as an ADP (RX,TMDP)(\RR^\Xsf, \TT_{\rm MDP}) and then discuss its properties.

2.3.3.1ADP Representations for MDPs

Let (Γ,r,β,P)(\Gamma, r, \beta, P) be as above. We recall that the policy operators take the form

(Tσv)(x)=r(x,σ(x))+βxv(x)P(x,σ(x),x)(vRX,  xX)(T_\sigma \, v)(x) = r(x, \sigma(x)) + \beta \sum_{x'} v(x') P(x, \sigma(x), x') \qquad (v \in \RR^\Xsf, \; x \in \Xsf)

We set TMDP{Tσ:σΣ}\TT_{\rm MDP} \coloneq \setntn{T_\sigma}{\sigma \in \Sigma}, where Σ\Sigma is the feasible policies, and pair RX\RR^\Xsf with the pointwise partial order \leq. Since each TσTMDPT_\sigma \in \TT_{\rm MDP} is order preserving on (RX,)(\RR^\Xsf, \leq), the pair (RX,TMDP)(\RR^\Xsf, \TT_{\rm MDP}) is an ADP.

We proved in Exercise 1.4 that each TσTMDPT_\sigma \in \TT_{\rm MDP} has a unique fixed point in RX\RR^\Xsf given by (IβPσ)1rσ(I-\beta P_\sigma)^{-1} r_\sigma. Hence (RX,TMDP)(\RR^\Xsf, \TT_{\rm MDP}) is well-posed. (RX,TMDP)(\RR^\Xsf, \TT_{\rm MDP}) is also order continuous. Indeed, in (RX,)(\RR^\Xsf, \leq), the statement vnvv_n \uparrow v is equivalent to vn(x)v(x)v_n(x) \uparrow v(x) in R\RR for all xXx \in \Xsf (Lemma A.1.3). As a result, every continuous order preserving map is order continuous.

The ADP (RX,TMDP)(\RR^\Xsf, \TT_{\rm MDP}) is also order stable. Indeed, each TσT_\sigma has the form Tσv=rσ+βPσvT_\sigma \, v = r_\sigma + \beta P_\sigma \, v, where PσP_\sigma is a stochastic matrix. Since βPσ0\beta P_\sigma \geq 0 and ρ(βPσ)=β<1\rho(\beta P_\sigma) = \beta < 1, Exercise A.24 implies that each TσT_\sigma is order stable.

In Section 1.2.1.2 we introduced the concept of vv-greedy policies for the finite MDP, as well as the Bellman operator and the Bellman equation. Let’s make sure that these are in fact special cases of our ADP definitions from Section 2.1.1.

From (2.40) it follows that the ADP Bellman equation for (RX,TMDP)(\RR^\Xsf, \TT_{\rm MDP}) is given by

v(x)=maxaΓ(x){r(x,a)+βxv(x)P(x,a,x)}(xX).v(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \qquad\qquad (x \in \Xsf).

This aligns with the traditional expression for the Bellman equation that we set out in (1.46).

2.3.3.2Optimality for Finite MDP Models

We make a number of optimality claims for the finite MDP model in Section 1.2.1.2. Later we’ll build on the ADP optimality results in ways that make verifying these claims straightforward. Nevertheless, for the sake of the exercise, we now prove the claims from Section 1.2.1.2 using Theorem 2.2.8, as well as establishing additional results. Readers who prefer to move on can skip this section without loss of continuity.

We begin with the following claim.

The statement in Proposition 2.3.1 is easily translated into more standard MDP terminology. For example, it tells us that the value function v\vmax solves the Bellman equation, which we know is given by (2.43), and, by the characterization of greedy policies in (2.39) plus Bellman’s principle of optimality, that a policy σ\sigma is optimal if and only if

σ(x)argmaxaΓ(x){r(x,a)+βxv(x)P(x,a,x)}for all xX.\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} \vmax(x') P(x, a, x') \right\} \quad \text{for all } x \in \Xsf.

Later we will show that ADP theory can also handle many extensions to the basic MDP framework.

The next exercise asks you to work through an alternative to the proof of Proposition 2.3.1.

2.3.4Distributional Dynamic Programming

In Section 1.1.3.2 we mentioned distributional dynamic programming, which seeks to track not just the expected discounted return under a policy, but also the full probability distribution of the random discounted return t0βtr(Xt,σ(Xt))\sum_{t \geq 0} \beta^t r(X_t, \sigma(X_t)). This richer object captures variance, quantiles, tail risk, and other features that matter when agents are not risk-neutral. Here we review some of the basic concepts and show how these ideas can be embedded into the ADP framework. Further reading can be found in Section 2.4.

2.3.4.1The Distributional ADP

Let X\Xsf be a metric space of states, let A\Asf be a metric space of actions, let r ⁣:X×ARr \colon \Xsf \times \Asf \to \RR be a bounded measurable reward function, let PP be a stochastic kernel on X\Xsf given X×A\Xsf \times \Asf (see Section A.5.4), and let β(0,1)\beta \in (0,1). A policy is a measurable map σ ⁣:XA\sigma \colon \Xsf \to \Asf and Σ\Sigma denotes the set of all feasible policies.

Let D1(R)\dD_1(\RR) denote the set of Borel probability measures on R\RR with finite first moment. A distributional value function is a stochastic kernel η\eta from X\Xsf to R\RR (see Section A.5.4) with η(x,)D1(R)\eta(x, \cdot) \in \dD_1(\RR) for each xXx \in \Xsf; it assigns a return distribution η(x,)\eta(x, \cdot) to each state xx. We let H\hH denote the set of all distributional value functions η\eta satisfying the uniform bound

supxXzη(x, ⁣dz)<.\sup_{x \in \Xsf} \int |z| \, \eta(x, \diff z) < \infty.

This condition ensures that the metric introduced in Section 2.3.4.2 below is well-defined.

Given ηH\eta \in \hH, we write η(x,h)h(z)η(x, ⁣dz)\eta(x, h) \coloneq \int h(z) \, \eta(x, \diff z) for bounded measurable h ⁣:RRh \colon \RR \to \RR. The space H\hH is equipped with the pointwise stochastic dominance order, which we denote by \trianglelefteq and define by ηη\eta \trianglelefteq \eta' if η(x,)Fη(x,)\eta(x, \cdot) \lefsd \eta'(x, \cdot) for every xXx \in \Xsf (see Section A.5.5).

We now define the distributional analogue of the scalar policy operator TσT_\sigma. In the scalar case,

(Tσv)(x)=rσ(x)+βv(x)Pσ(x, ⁣dx),(T_\sigma \, v)(x) = r_\sigma(x) + \beta \int v(x') \, P_\sigma(x, \diff x'),

where rσ(x)r(x,σ(x))r_\sigma(x) \coloneq r(x, \sigma(x)) and Pσ(x,)P(x,σ(x),)P_\sigma(x, \cdot) \coloneq P(x, \sigma(x), \cdot). The distributional version replaces the scalar continuation value v(x)v(x') with a random draw from the distribution η(x,)\eta(x', \cdot), and replaces addition and scalar multiplication with the corresponding operations on distributions.

As a first step, given ηH\eta \in \hH, we define the continuation kernel PσηP_\sigma \otimes \eta from X\Xsf to R\RR by

(Pση)(x,B)η(x,B)Pσ(x, ⁣dx)(BB(R)).(P_\sigma \otimes \eta)(x, B) \coloneq \int \eta(x', B) \, P_\sigma(x, \diff x') \qquad (B \in \bB(\RR)).

Here (Pση)(x,)(P_\sigma \otimes \eta)(x, \cdot) is a distributional analog of the continuation value: the distribution of the random value obtained by drawing xPσ(x,)x' \sim P_\sigma(x, \cdot) and then drawing from η(x,)\eta(x', \cdot).

Given a policy σΣ\sigma \in \Sigma, the distributional policy operator Dσ ⁣:HHD_\sigma \colon \hH \to \hH is defined by

(Dση)(x,h)h(rσ(x)+βv)(Pση)(x, ⁣dv)(D_\sigma \, \eta)(x, h) \coloneq \int h(r_\sigma(x) + \beta v) \, (P_\sigma \otimes \eta)(x, \diff v)

for bounded measurable h ⁣:RRh \colon \RR \to \RR. Here hh should be understood as a test function that we use to characterize the distribution (Dση)(x,)(D_\sigma \, \eta)(x, \cdot), which is the law of rσ(x)+βVr_\sigma(x) + \beta V when V(Pση)(x,)V \sim (P_\sigma \otimes \eta)(x, \cdot): today’s reward plus a discounted draw from the continuation distribution. This mirrors the scalar recursion. Expanding the continuation kernel, (2.48) can be equivalently written as

(Dση)(x,h)=[h(rσ(x)+βv)η(x, ⁣dv)]Pσ(x, ⁣dx).(D_\sigma \, \eta)(x, h) = \int \left[ \int h(r_\sigma(x) + \beta v) \, \eta(x', \diff v) \right] P_\sigma(x, \diff x').

Since each DσD_\sigma is an order preserving self-map on H\hH, the pair (H,TDDP)(\hH, \TT_{\rm DDP}) is an ADP, where TDDP{Dσ:σΣ}\TT_{\rm DDP} \coloneq \setntn{D_\sigma}{\sigma \in \Sigma}. The value space is H\hH (distributional value functions), the partial order is pointwise stochastic dominance, and the policy operators are {Dσ}\{D_\sigma\}.

2.3.4.2Well-Posedness

As in the scalar MDP case, well-posedness follows from a contraction argument. To state it, we need a metric on distributions. The Wasserstein-1 distance between μ,νD1(R)\mu, \nu \in \dD_1(\RR) is

W1(μ,ν)suphLip1[h ⁣dμh ⁣dν],W_1(\mu, \nu) \coloneq \sup_{\|h\|_{\rm Lip} \leq 1} \left[ \int h \, \diff\mu - \int h \, \diff\nu \right],

where the supremum is over all 1-Lipschitz functions h ⁣:RRh \colon \RR \to \RR. We equip H\hH with the supremum Wasserstein metric

dˉ1(η,η)supxXW1(η(x,),η(x,)).\bar d_1(\eta, \eta') \coloneq \sup_{x \in \Xsf} W_1(\eta(x, \cdot), \eta'(x, \cdot)).

Under this metric, H\hH is a complete metric space.

By the Banach contraction mapping theorem, each DσD_\sigma has a unique fixed point ησH\eta_\sigma \in \hH. Hence (H,TDDP)(\hH, \TT_{\rm DDP}) is well-posed. Moreover, since each DσD_\sigma is a contraction and hence globally stable on H\hH, Lemma A.5.19 implies that (H,TDDP)(\hH, \TT_{\rm DDP}) is order stable.

2.3.4.3Identification of the Fixed Point

Let (Xt)t0(X_t)_{t \geq 0} be the PσP_\sigma-Markov chain with X0=xX_0 = x and define the discounted return

Zσ(x)t=0βtr(Xt,σ(Xt)).Z_\sigma(x) \coloneq \sum_{t=0}^\infty \beta^t r(X_t, \sigma(X_t)).

In the scalar case, the fixed point of TσT_\sigma is the expected-return function vσ(x)=EZσ(x)v_\sigma(x) = \EE Z_\sigma(x). What is the distributional analogue? The fixed point of DσD_\sigma should be the distribution of Zσ(x)Z_\sigma(x). Since rr is bounded, say rM|r| \leq M, we have Zσ(x)M/(1β)|Z_\sigma(x)| \leq M/(1-\beta) almost surely, so the distribution of Zσ(x)Z_\sigma(x) lies in D1(R)\dD_1(\RR).

Thus, while the scalar σ\sigma-value function vσ(x)=E[Zσ(x)]v_\sigma(x) = \EE[Z_\sigma(x)] records only the mean of the random return, the distributional σ\sigma-value function ησ(x,)\eta_\sigma(x, \cdot) records its full distribution---including variance, quantiles and tail behavior. The scalar value function is recovered as a special case: vσ(x)=zησ(x, ⁣dz)v_\sigma(x) = \int z \, \eta_\sigma(x, \diff z).

Figure 2.4 illustrates the action of the distributional policy operator on the optimal savings model from Section 1.3, using the policy σ\sigma computed in Figure 1.14. The left panel shows Dσ5η0D_{\sigma}^5 \eta_0, an early iterate starting from the initial condition η0(x,)=δ0\eta_0(x, \cdot) = \delta_0 for all xx. The right panel shows the converged distributional value function ησ\eta_{\sigma}, which assigns to each state xx the full distribution of the discounted return Zσ(x)Z_{\sigma}(x). The iterations were implemented using the categorical projection method of Bellemare et al. (2017): the return axis is discretized into a fixed grid of atoms, and at each step the affine shift zrσ(x)+βzz \mapsto r_\sigma(x) + \beta z is projected back onto this grid via linear interpolation of probability mass. The mean of ησ(x,)\eta_{\sigma}(x, \cdot) at each xx recovers the scalar value function vσv_\sigma shown in Figure 1.14.

Iterating D_\sigma: early iterate (left) and converged \eta_\sigma (right)

Figure 2.4:Iterating DσD_\sigma: early iterate (left) and converged ησ\eta_\sigma (right)

2.3.4.4Regularity

One complication, in terms of developing a theory of distributional dynamic programming, is failure of regularity. Even when the state and action spaces are finite, greedy policies typically fail to exist. To see why, observe that, in the present setting, a policy σ\sigma is η\eta-greedy when

h(rτ(x)+βv)(Pτη)(x, ⁣dv)h(rσ(x)+βv)(Pση)(x, ⁣dv)\int h(r_\tau(x) + \beta v) \, (P_\tau \otimes \eta)(x, \diff v) \leq \int h(r_\sigma(x) + \beta v) \, (P_\sigma \otimes \eta)(x, \diff v)

for all xXx \in \Xsf, all τΣ\tau \in \Sigma and all hibRh \in ib\RR. A natural approach to this problem is to solve

maxa[h(r(x,a)+βv)η(x, ⁣dv)P(x,a, ⁣dx)]\max_a \int \left[ \int h(r(x,a) + \beta v) \, \eta(x', \diff v) P(x, a, \diff x') \right]

at each xx, and produce a policy σ\sigma from the correspondence of maximizers. The problem with this idea is that, in most cases, the solution will depend on hh. For example, if hh is concave with high curvature then optimal choices will avoid risk. If hh is linear, optimal choices will ignore risk. This makes it very difficult to attain (2.58) for all hh in ibRib\RR.

Without regularity, the core optimization loop of dynamic programming---solve the Bellman equation, then extract a greedy policy---breaks down at the second step. In the reinforcement learning literature, practitioners who use distributional methods typically select actions using the mean of the return distribution---which amounts to standard scalar greediness---while exploiting the distributional representation to improve function approximation and learning dynamics.

At a deeper level, the failure of regularity reflects the fact that, in order to set up a well-defined criterion for control, one must first commit to how risk is valued. This commitment takes the form of a specific nonlinear aggregator KK applied period-by-period, as described in Section 1.1.3.6 and Section 1.2.3.2 (see also the discussion of certainty equivalents and risk preferences in Section 7.2.4). Such a commitment collapses the distributional object back to a scalar recursion, restoring regularity and the full apparatus of dynamic programming. This is the approach adopted throughout the remainder of this book.

2.3.5LQ Control

LQ control is a major sub-field of dynamic programming, routinely applied to problems in engineering, economics, operations research and elsewhere. In this section we describe a canonical LQ problem and show how it can be solved using ADP methods. Rather than aiming to provide new results, we plan to show that LQ problems can be cleanly handled by the theory provided above, instead of requiring specialized machinery.

2.3.5.1Description

We consider a deterministic undiscounted linear-quadratic (LQ) control problem, defined as a tuple (Q,R,A,B)(Q, R, A, B), where

The objective of the LQ problem is to solve

min(ut)t0[xtQxt+utRut]\min_{(u_t)} \sum_{t \geq 0} \left[ x_t^\top Q x_t + u_t^\top R u_t \right]

subject to

xt+1=Axt+But for all t0.x_{t+1} = A x_t + B u_t \quad \text{ for all } t \geq 0.

The vector xtRkx_t \in \RR^k is called the state variable and utRmu_t \in \RR^m is called the action or control. Note that xtQxt+utRut0x_t^\top Q x_t + u_t^\top R u_t \geq 0 for all tt, so the infinite sum (2.60) takes values in [0,][0, \infty]. The Bellman equation takes the form

(x)=minuRm{xQx+uRu+(Ax+Bu)}for all xRk.\ell(x) = \min_{u \in \RR^m} \left\{ x^\top Q x + u^\top R u + \ell(Ax + B u) \right\} \quad \text{for all } x \in \RR^k.

Since the solution to this problem is well-known, we will not seek new results. Rather, our aim is to illustrate how we can embed the LQ problem into the ADP framework and recover existing results in a relatively straightforward way.

2.3.5.2Preamble: LQ Background

Before we go further, let’s set up a framework for working with LQ problems and note down some standard results from the literature. In what follows, A,B,R,A, B, R, and QQ are as described in Section 2.3.5.1.

We let

MN    NMP.M \preccurlyeq N \quad \iff \quad N - M \in \pP.

We use 0 to denote the zero element of Rk×k\RR^{k \times k}, so that PP is in P\pP if and only if 0P0 \preccurlyeq P.

Also, we define

R(P)=APAAPB(BPB+R)1BPA+Q,and\bR(P) = A^\top P A - A^\top P B(B^\top P B + R)^{-1} B^\top P A + Q, \quad \text{and}
F(P)=(BPB+R)1BPA.\bF (P) = - (B^\top P B + R)^{-1} B^\top P A .

Below we will connect the Riccati map to the Bellman operator for this problem and the control gain map will help us select decision rules. The fixed point equation P=R(P)P = \bR(P) is called the Riccati equation. Note that the inverse in (2.64) exists because RR is positive definite and PP is positive semidefinite, implying that BPB+RB^\top P B + R is also positive definite.

The next exercise is a well-known result and the proof is not trivial.

The next lemma shows that the control gain map selects matrices that are akin to “min-greedy policies,” although we need to be a bit careful with that terminology, since we also want our policies to be stable (as clarified below).

In stating the next result we let CC be such that CC=QC^\top C = Q. We refer to Bertsekas (2012) for the definitions of observability and controllability.

2.3.5.3Policies

In the LQ setting, a control matrix is any FRm×kF \in \RR^{m \times k}. Under a given control matrix FF, the current control obeys ut=Fxtu_t = F x_t and the update rule for the state is xt+1=Axt+BFxtx_{t+1} = A x_t + B F x_t. Hence the state evolves according to xt=(A+BF)tx0x_t = (A + BF)^t x_0. Following (2.60), the lifetime cost of following FF, starting at initial condition x0Rkx_0 \in \RR^k, is

F(x0)t=0xt(FRF+Q)xt   with xt=(A+BF)tx0.\ell_F (x_0) \coloneq \sum_{t=0}^\infty x_t^\top \left( F^\top R F + Q \right) x_t \; \text{ with } x_t = (A + BF)^t x_0.
The function \ell_F for different choices of F

Figure 2.5:The function F\ell_F for different choices of FF

Returning to the general case, finite lifetime costs require driving the state to zero fast enough for the sum (2.71) to converge. In this connection, extending the condition A+BF<1|A + BF| < 1 from the one-dimensional example, a control matrix FF is called stable if the spectral radius condition ρ(A+BF)<1\rho(A + BF) < 1 holds.

2.3.5.4From LQ to ADP

We wish to produce an ADP representation of the LQ problem. To this end, we set

TF(P)=Q+FRF+(A+BF)P(A+BF).\bT_F (P) = Q + F^\top R F + (A + BF)^\top P (A + BF) .

To match earlier ADP terminology, a stable control matrix is also referred to as a policy.

It follows from Exercise 2.23 that (P,T)(\pP, \TT) is an ADP.

2.3.5.5Interpretation and Properties

The fixed point equation P=TF(P)P = \bT_F(P) can be interpreted as a recursive equation for lifetime cost under policy FF. To see this, suppose P=TF(P)P = \bT_F(P) and set (x)xPx\ell(x) \coloneq x^\top P x. Using (2.73) and a bit of algebra, you will be able to confirm that this function \ell obeys the recursion

(x)=xQx+(Fx)R(Fx)+((A+BF)x).\ell(x) = x^\top Q x + (Fx)^\top R (Fx) + \ell((A + BF)x).

The right-hand side equals the current state cost xQxx^\top Q x, plus the current action cost (Fx)R(Fx)(Fx)^\top R (Fx), plus the lifetime cost from the next state (A+BF)x(A + BF)x. This is exactly the recursive structure of lifetime cost under policy FF.

Comparing (2.76) with (2.71), we see that xPFx=F(x)x^\top P_F \, x = \ell_F(x) for all xRkx \in \RR^k. Hence PFP_F is the matrix representation of the lifetime cost function.

2.3.5.6Min-Greedy Policies

Fixing PPP \in \pP, the ADP definition of min-greedy policies (page ) tells us that FΣF \in \Sigma is PP-min-greedy if and only if TF(P)TG(P)\bT_F(P) \preccurlyeq \bT_G(P) for all GΣG \in \Sigma. The next exercise is useful for characterizing min-greedy policies.

In obtaining optimality results, one problem we have is that (P,T)(\pP, \TT) is not always regular. On the one hand, if we take an arbitrary PPP \in \pP and then calculate the control gain matrix F=F(P)F = \bF(P), Exercise 2.24 assures us that we get the “min-greedy” property TF(P)TG(P)\bT_F(P) \preccurlyeq \bT_G(P). On the other hand, we have no guarantee that FF is actually in Σ\Sigma. In particular, FF might not be a stable control matrix. For this reason, we introduce a smaller set PSP\pP_S \subset \pP of all positive semidefinite matrices such that control gain F=F(P)F = \bF(P) is stable. That is, we set

PS{PP:ρ(A+BF(P))<1}={PP:F(P)Σ}.\pP_S \coloneq \setntn{P \in \pP}{\rho(A + B \bF(P))<1} = \setntn{P \in \pP}{\bF(P) \in \Sigma}.

With this definition, the following lemma is immediate from Exercise 2.24.

Note that PS\pP_S is not necessarily closed under the policy operators TF\bT_F, so we cannot take (PS,T)(\pP_S, \TT) as an ADP.

2.3.5.7The Bellman Equation

From the definition in Section 2.1.1.3, the Bellman min-operator associated with the ADP (P,T)(\pP, \TT) is defined by

T(P)=FΣTF(P)     whenever the infimum exists.\bT (P) = \bigwedge_{F \in \Sigma} \bT_F (P) \;\; \text{ whenever the infimum exists}.

If PPSP \in \pP_S and F=F(P)F = \bF(P), then, by Lemma 2.3.7, FF is PP-min-greedy. Hence T(P)=TF(P)\bT (P) = \bT_F (P) (see Lemma 2.1.1), so

T(P)=Q+FRF+(A+BF)P(A+BF)whenF=F(P).\bT(P) = Q + F^\top R F + (A + BF)^\top P (A + BF) \quad\text{when} \quad F = \bF(P).

Recalling Exercise 2.20, this means that

The next lemma connects this to the Riccati equation and another version of the LQ Bellman equation.

2.3.5.8Optimality

Suppose that (A,B)(A, B) is controllable and (A,C)(A, C) is observable. Since T\bT and R\bR agree on PS\pP_S, Lemma 2.3.4 tells us that T\bT has a fixed point PP^* in PS\pP_S. We call PP^* the minimum loss matrix. Let PΣ\pP_\Sigma be the set of lifetime costs, so that

PΣ={P=PF:FΣ}.\pP_\Sigma = \setntn{P = P_F}{F \in \Sigma}.

Applying Theorem 2.2.9 and the fact that (P,T)(\pP, \TT) is order stable (Lemma 2.3.6), we obtain the following optimality results:

  1. PP^* is the least element of (PΣ,)(\pP_\Sigma, \preccurlyeq),

  2. PP^* obeys the Bellman min-equation TP=P\bT P^* = P^*, and

  3. a policy FF is min-optimal for (P,T)(\pP, \TT) if and only if FF is PP^*-min-greedy.

We can translate (a)--(c) into more familiar optimality results for LQ problems. For example, (b) combined with Lemma 2.3.9 tells us that

(x)=minuRm{xQx+uRu+(Ax+Bu)}(xRk),\ell^*(x) = \min_{u \in \RR^m} \left\{ x^\top Q x + u^\top R u + \ell^*(Ax + B u) \right\} \quad (x \in \RR^k),

where \ell^* is defined by (x)=xPx\ell^*(x) = x^\top P^* x for all xx. Moreover, if we now set F=F(P)F = \bF(P^*), then, by Lemma 2.3.7, the policy FF is PP^*-min-greedy. Hence, by (c), FF is min-optimal.

2.4Chapter Notes

This chapter is based on the abstract dynamic programming framework of Sargent & Stachurski (2025), which builds on theory found in Denardo (1967), Bertsekas (1977), Verdu & Poor (1987), Szepesvari (1998), Kamihigashi (2014), Bertsekas (2017), Li & Xie (2021), and, in particular, Bertsekas (2022). The paper by Sargent & Stachurski (2025) adds a layer of abstraction over these earlier frameworks by shifting analysis to families of policy operators on partially ordered sets. Doing so makes it possible treat a wider class of problems and generate new results, as discussed in later chapters.

Other precursors to the framework described in this chapter include Porteus (1975) and Kreps & Porteus (1977), who pioneered the use of order-theoretic methods to extend dynamic programming optimality theory beyond the standard expected discounted reward criterion. (An overview of this line of work appears in Appendix 6 of Kreps (2013).) Their operator-based approach accommodates non-additive objectives, such as expected utility criteria, risk-sensitive preferences and stochastic games. Building on this foundation, Kreps & Porteus (1979) showed that non-additive recursive preferences are amenable to dynamic programming, work that inspired the recursive utility framework of Epstein & Zin (1989).

The discussion of distributional dynamic programming in Section 2.3.4 builds on Bellemare et al. (2017). The objective of this line of work is to replace scalar value functions with return distributions, enabling agents to reason about risk, variability, and higher-order statistics of outcomes. The theoretical foundations have been extended by several authors, including Dabney et al. (2018), Rowland et al. (2018), Bäuerle et al. (2025), Marthe (2026), and Bäuerle & Vasileiadis (2026). Distributional methods have found application in risk-sensitive control, robotics, and finance.

Linear-quadratic optimal control theory was developed during the late 1950s and 1960s, when the Riccati equation emerged as a key tool for computing optimal feedback policies. These methods have been widely applied in economics. Sargent (1987) provides an early treatment connecting LQ foundations to economic modeling. Hansen & Sargent (1980) showed how to formulate and estimate dynamic linear rational expectations models using LQ methods. The risk-sensitive extension to linear-exponential-quadratic-Gaussian (LEQG) control, developed by Whittle (1981) and adapted for discounted problems in economics by Hansen & Sargent (1995), provides a bridge between LQ control and robust decision-making under model uncertainty; see Hansen & Sargent (2011) for a comprehensive treatment. LQ methods underpin much of macroeconomic modeling, including the analysis of fiscal policy, monetary policy, and business cycle dynamics. See Hansen & Sargent (2013) and Ljungqvist & Sargent (2018) for numerous applications.

Research at the intersection of nonlinear dynamics and LQ control is currently quite active. One of the key ideas is to approximate nonlinear systems with very high dimensional linear systems, and then to approximate those linear systems via singular value decomposition. For further discussion of these topics, see Kutz et al. (2016) or Brunton & Kutz (2019).

Footnotes
  1. In general, designating a vv-greedy policy for all vVv \in V requires the Axiom of Choice. In practice, many applications induce some structure on the policy set that can be used to produce simple selection mechanisms.

References
  1. Puterman, M. L. (2005). Markov decision processes: discrete stochastic dynamic programming. Wiley Interscience.
  2. Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. International Conference on Machine Learning, 449–458.
  3. Bertsekas, D. (2012). Dynamic programming and optimal control (Vol. 1). Athena Scientific.
  4. Sargent, T. J., & Stachurski, J. (2025). Dynamic Programs on Partially Ordered Sets. SIAM Journal on Optimization and Control, in press.
  5. Denardo, E. V. (1967). Contraction Mappings in the Theory Underlying Dynamic Programming. SIAM Review, 9(2), 165–177.
  6. Bertsekas, D. P. (1977). Monotone mappings with application in dynamic programming. SIAM Journal on Control and Optimization, 15(3), 438–464.
  7. Verdu, S., & Poor, H. V. (1987). Abstract dynamic programming models under commutativity conditions. SIAM Journal on Control and Optimization, 25(4), 990–1006.
  8. Szepesvari, C. (1998). Non-Markovian policies in sequential decision problems. Acta Cybernetica, 13(3), 305–318.
  9. Kamihigashi, T. (2014). Elementary results on solutions to the Bellman equation of dynamic programming: existence, uniqueness, and convergence. Economic Theory, 56, 251–273.
  10. Bertsekas, D. P. (2017). Regular policies in abstract dynamic programming. SIAM Journal on Optimization, 27(3), 1694–1727.
  11. Li, X., & Xie, L. (2021). Online Abstract Dynamic Programming with Contractive Models.
  12. Bertsekas, D. P. (2022). Abstract dynamic programming (3rd ed.). Athena Scientific.
  13. Porteus, E. L. (1975). On the optimality of structured policies in countable stage decision processes. Management Science, 22(2), 148–157.
  14. Kreps, D. M., & Porteus, E. L. (1977). On the optimality of structured policies in countable stage decision processes. II: Positive and negative problems. SIAM Journal on Applied Mathematics, 32(2), 457–466.
  15. Kreps, D. M. (2013). Microeconomic foundations (Vol. 1). Princeton University Press.