One of the principal objects of theoretical research in any department of knowledge is to find the point of view from which the subject appears in its greatest simplicity.
---Josiah Willard Gibbs
Having reviewed some applications in Chapter 1, our next aim is to present a general theory of dynamic programming that includes these applications as special cases and extends to many new problems.
The framework developed in this chapter leads with order theory rather than metric structure. The reason is that contraction-based arguments on Banach spaces, the standard foundation for dynamic programming, are fragile. Many natural variants of the textbook setting --- nonstandard discounting, nonlinear period aggregators, nonstandard value spaces, and others --- break it. As suggested by the examples in Chapter 1, contraction is the exception rather than the rule once one moves outside the additively separable, expected-reward template.
What does unify these settings is that policy operators are order preserving and that lifetime values are characterized as fixed points of those operators on partially ordered sets. This chapter develops the optimality and convergence theory those ingredients support. This investment will pay off in later chapters: subsequent theory uses this foundation to establish, for many classes of models, the same fundamental optimality results and algorithmic convergence guarantees one expects from classical theory, even without contraction-type properties. The classical theory itself is recovered as a special case: the firm problem of Section 1.1, the finite MDP of Section 1.2, and the optimal savings problem of Section 1.3 all reappear as concrete instances.
A second feature of the framework is methodological. Order-theoretic proofs are almost entirely algebraic: each result reduces to systematic application of a small number of named hypotheses, with few analytic prerequisites. This setup pays off in two ways. It eases human analysis --- a proof in this chapter typically needs nothing beyond applications of definitions and chains of inequalities --- and it makes the theory a natural target for machine-assisted reasoning.
We begin with definitions and basic properties. Some readers will find it helpful to review the main order-theoretic concepts in Section A.1.2 before proceeding further.
The optimality results given in this chapter are high-level. We use them mostly as inputs for downstream optimality theory, rather than for applications per se. (In particular, from Chapter 3 on, we leverage findings from this chapter to construct downstream results that are more easily applied to specific problems.) When presenting this high-level theory, we refrain from giving examples until Section 2.3, hoping that readers have reviewed Chapter 1, or otherwise have strong backgrounds in core concepts of dynamic programming. Such readers will be able to appreciate the practical value of the abstract framework to be presented here.
We begin by defining abstract dynamic programs and listing some of their properties, including the Bellman equation and Bellman operator. We then state conditions under which essential optimality properties hold.
Each operator Tσ in T is called a policy operator.
Σ is an arbitrary index set and elements of Σ will be referred to as policies.
In applications we will impose conditions under which each Tσ has a unique fixed point in V. In these settings, the significance of Tσ is that its fixed point represents the lifetime value of following policy σ. We will denote the unique fixed point of Tσ by vσ and call it the σ-value function.
Figure 2.1 illustrates a simple ADP with V=R (paired with the usual order) and Σ={σ,σ′,σ′′}. Each policy operator is an affine self-map on R, and each has a unique fixed point: vσ, vσ′, and vσ′′, respectively.
In Chapter 1, the concept of greedy policies (see (1.97)) played a key role. It will also play a key role here, in our ADP framework. The definition is as follows.
As shown below, Definition 2.1.2 generalizes the notion of greedy policies for the problems considered in Chapter 1. Throughout this book we let
Figure 2.2 illustrates the v-greedy concept for the ADP from Figure 2.1, evaluated at the marked point v. The three values Tσv, Tσ′v, Tσ′′v are indicated: here Tσ′v is the largest, so σ′ is the v-greedy policy.
In applications, solving for a v-greedy policy is often straightforward. Moreover, there exist conditions under which solving the overall problem reduces to solving for a v-greedy policy with a “correct” choice of v. We pursue this idea below.
We now list properties useful for ADP optimality theory. First, we call (V,T)
well-posed if each Tσ∈T has a unique fixed point in V.
Well-posedness is a minimal precondition for a coherent dynamic program: to maximize lifetime values, we need them to be well-defined. When well-posedness holds we will always use vσ to denote the unique fixed point of Tσ.
We call (V,T)
finite if T is a finite set.
Finiteness holds for the finite MDPs described in Section 1.2, as well as other related settings with finite states and actions. Not surprisingly, finite ADPs have attractive optimality properties.
We call (V,T)
regular if VG=V; that is, if a v-greedy policy exists for all v∈V.
Regularity helps us to construct algorithms and to obtain existence of optimal policies. Since V=VG under regularity, Lemma 2.1.1 implies that T is well-defined on all of V whenever this property holds. We focus primarily on regular ADPs.
We call (V,T)
order stable if each Tσ∈T is order stable on V.
Recalling the definitions in Section A.1.2.10, this means that (V,T) is well-posed and, for each σ∈Σ and v∈V,
v⪯Tσv implies v⪯vσ, and
Tσv⪯v implies vσ⪯v.
Most proofs in this chapter use only properties (a) and (b). Occasionally we will need monotone convergence of the iterates, in which case we will assume that (V,T) is
strongly order stable if each Tσ∈T is strongly order stable on V.
As per the discussion in Section A.1.2.10, this implies that each Tσ is order stable and, in addition, v⪯Tσv implies Tσnv↑vσ and Tσv⪯v implies Tσnv↓vσ.
A Bellman equation that plays a central role in optimality theory for the Section 1.3.2.1 optimal savings problem is again key for ADPs. Here we define the ADP Bellman equation and note some preliminary observations. Throughout this section, (V,T) is an ADP with policy set Σ.
In (2.3), the supremum is taken over all σ∈Σ. There is, in general, no guarantee that the supremum exists.
Evidently v∈V satisfies the Bellman equation if and only if Tv exists and Tv=v.
Figure 2.3 illustrates the Bellman operator for the simple ADP introduced earlier. At each v, the value Tv is the largest of Tσv, Tσ′v, Tσ′′v: T is the upper envelope of the three policy operators. The fixed point of T is vσ′′, which is also the largest of the three σ-value functions and hence the value function v∗.
We often refer to the following three subsets of the value space V, the first of which was already introduced in (2.2) and is repeated here for convenience:
In this subsection we define optimality for ADPs, connect it to the Bellman equation, and state the fundamental optimality properties. We then give sufficient conditions for these properties to hold.
We say that a policy σ∈Σ is optimal for (V,T) if vσ is a greatest element of VΣ. In other words, σ is optimal if it attains the “highest possible” lifetime value.
Perhaps the most important aspect of the theory of dynamic programming is the link between optimality and the Bellman equation. To clarify this link we introduce some terminology. Let (V,T) be a well-posed ADP and set
Throughout this section, (V,T) is a well-posed ADP.
It is important to emphasize here that (B1)--(B3) are not independent. Indeed, (B2) implies both (B1) and (B3), as shown in the next proposition.
Despite the fact that (B2) implies both (B1) and (B3), we have stated them together because, in terms of applications, all three parts are significant. (B1) tells us that the problem at hand has a solution. (B3) implies that a solution can be computed by taking a v∗-greedy policy, and that any other optimal policy must also be v∗-greedy. Finally, (B2) provides a restriction that can help us calculate v∗. Provided that we search in VG, any fixed point of T is equal to v∗.
Proposition 2.1.4 tells us our main goal is to construct conditions under which v∗ exists and is the unique fixed point of T in VG. We begin this task in Section 2.2.2.
The next exercise generalizes a well-known result from more traditional dynamic programming frameworks (see, e.g., Puterman (2005), Theorem 6.2.6).
In this section we investigate the following question: When does existence of a solution to the Bellman equation imply the fundamental optimality properties (B1)--(B3)? We will see that order stability is useful here. The results stated in this section should be understood as intermediate inputs: we use them in downstream theorems that are tuned to applications.
In the statement of the next result, (V,T) is an ADP and T is the Bellman operator.
The next result is almost a corollary of Theorem 2.1.5. It imposes a strong order completeness condition on the value space. (See Section A.5.1 for background on order completeness.)
While the chain completeness assumption is strong, Theorem 2.1.7 is nonetheless significant: It says that, at least for this idealized setting, regular well-posed ADPs possess all the optimality properties that we seek.
What drives this strong result? The definition of ADPs is itself doing some heavy lifting: it requires that lifetime values are inherently recursive and reflect order structure (being fixed points of order-preserving operators). Bellman’s fundamental ideas work smoothly in this setting when we add some regularity.
In this section we introduce algorithms for solving ADPs---including value function iteration, Howard policy iteration and optimistic policy iteration---and establish conditions under which they converge. We treat both maximization and minimization, extending the latter via the theory of dual ADPs.
In this section we discuss algorithms for dynamic programming and present some preliminary results. The three algorithms we consider are value function iteration (VFI), Howard policy iteration (HPI) and optimistic policy iteration (OPI). These algorithms generalize the ones presented for the finite MDP case in Section 1.2.1.3.
As a first step, we introduce two operators. Throughout Section 2.2.1.1 we take (V,T) to be a fixed ADP. As usual, when (V,T) is well-posed, the unique fixed point of Tσ∈T in V is denoted by vσ.
(Here and below, the dependence of W on m is often suppressed to simplify notation.)
Like the Bellman operator T, the map W is well-defined on all of V when (V,T) is regular. The Howard policy operator H is well-defined on all of V when (V,T) is well-posed and regular. Note that, for both of these maps, we always select the same v-greedy policy when applying them to some fixed v.[1]
Below we will associate VFI, OPI, and HPI with fixing a v∈V and then iteratively applying the operators T, W, and H respectively. A small amount of thought will convince you that, for the optimal savings ADP described in Section 2.3.2, these iterative procedures coincide with our earlier description of VFI, HPI, and OPI from Section 1.2.1.3.
The next lemma collects useful results for the operators introduced above. In the statement, (V,T) is an ADP with Howard policy operator H, optimistic policy operator W and Bellman operator T. We assume regularity, so that VG=V.
The next lemma adds order stability and derives additional implications.
In this section we introduce conditions for optimality of ADPs and convergence of algorithms that are well-posed and regular. These high-level results will later be used as inputs for lower level results that are more straightforward to check in applications. In each case, we will aim for the fundamental optimality properties, as given on page , and the convergence of the three major algorithms, discussed on page .
In applications we often deal with dynamic programs that have finitely many states and actions. This finiteness implies that the set of feasible policies is finite. The next result deals with this case.
Next we state a result for the chain complete setting that extends Theorem 2.1.7. It shows that adding strong order stability is enough to provide convergence of all algorithms.
2.2.2.3Case III: Countably Dedekind Complete Value Space¶
The result in this section replaces chain completeness with countable Dedekind completeness and order continuity. To state the result we need two new definitions. We call (V,T)
order continuous if each Tσ∈T is order continuous on V.
Order continuity means that Tσvn↑Tσv whenever σ∈Σ and vn↑v. This technical condition holds in many of the applications we consider. A general discussion of order continuous operators is provided in Section A.5.1.3.
In addition, we call (V,T)
order bounded if there exists a u∈V with Tσu⪯u for all Tσ∈T.
Now we can state the main result of this section.
Table 2.1 summarizes the conditions and conclusions for the three cases treated above.
Table 2.1:Conditions and conclusions for the three convergence cases
In some dynamic programs, the objective is to minimize lifetime cost of a given policy, rather than maximizing rewards. While we focus primarily on maximization in this book, the present section discusses how to handle minimization problems. The content of this section can be summarized by the following statement: for a given ADP (V,T), a minimization problem can be converted to a maximization problem by reversing the partial order on V. Further details are given below. Readers who prefer to focus on maximization results can safely skip ahead.
Now suppose (V,T) is well-posed and let VΣ be defined as before (i.e., the set of σ-value functions). In this setting we set v▽∗=⋀σvσ and call it the min-value function whenever the infimum exists. Also, we say that
σ∈Σ is min-optimal for (V,T) if vσ=v▽∗, and
(V,T) obeys Bellman’s principle of min-optimality if
σ∈Σ is min-optimal for (V,T)⟺σ is v▽∗-min-greedy.
We say that the fundamental min-optimality properties hold if
at least one min-optimal policy exists,
v▽∗ exists and is the unique solution to the Bellman min-equation in V▽G, and
Bellman’s principle of min-optimality holds.
This definition parallels (B1)--(B3) from page .
Let VD be all v∈V▽G with T▽v⪯v. We say that
min-VFI converges if T▽nv↓v▽∗ for all v∈VD,
min-OPI converges if W▽nv↓v▽∗ for all v∈VD and all m∈N, and
min-HPI converges if H▽nv↓v▽∗ for all v∈VD.
To further increase clarity, when discussing maximization and minimization in the same section, we add a “max-” prefix to the previously introduced definitions that pertain to maximization. For example,
“v-greedy policies” will be referred to as v-max-greedy policies,
“optimal policies” will be referred to as max-optimal policies,
“the Bellman equation” will be referred to as the Bellman max-equation,
Let’s now investigate how minimization problems can be converted to maximization problems in this abstract setting. The key idea is that taking infima in (V,⪯) corresponds to taking suprema in the order-dual (V,⪯∂).
Recall from Section A.1.2.5 that if V:=(V,⪯) is a partially ordered set, then its order dual V∂:=(V,⪯∂) is the set V paired with the partial order ⪯∂ obtained by setting u⪯∂v if and only if v⪯u. If (V,T) is any ADP, then we call (V,T)∂:=(V∂,T) the dual of (V,T). In other words, the dual (V,T)∂ of (V,T) is the ADP created by maintaining the same family of policy operators T while replacing the poset V with its order dual V∂.
Regarding notation for (V,T)∂,
the Bellman max-operator will be denoted by T∂,
the Bellman min-operator will be denoted by T▽∂,
the max-value function will be denoted by v∗∂,
VG∂ is all v∈V such that at least one v-max greedy policy exists for (V,T)∂,
etc.
Each ADP is self-dual, in the sense that ((V,T)∂)∂=(V,T). This follows from the fact that all partially ordered sets are self-dual.
Self-duality implies corollaries to Exercise 2.5 that we treat as self-evident. For example, σ∈Σ is min-optimal for (V,T)∂ if and only if σ is max-optimal for (V,T).
Part (viii) of Exercise 2.5 tells us that we can solve an ADP for a min-optimal policy by switching to the dual ADP and maximizing.
In this section we show how several important models fit into the ADP framework. Applications include firm valuation, optimal savings, finite Markov decision processes and linear-quadratic control.
The policy set Σ was defined as all Borel measurable functions mapping X to {0,1}. Let TFV:={Tσ:σ∈Σ} and let ⩽ be the pointwise partial order on bX.
The monotonicity in Exercise 2.8 implies that the pair (bX,TFV) is an ADP. We saw in Section 1.1.1.2 that each Tσ is a contraction map on bX. Hence (bX,TFV) is well-posed. As we discussed in Section 1.1.1.2, the unique fixed point vσ of Tσ has the interpretation of assigning lifetime values (expected present value of the firm) to states (initial conditions for the underlying Markov chain) under σ.
In Section 1.1 we introduced the Bellman operator T (see (1.16)) and the notion of greedy policies (see (1.14)). Let’s make sure that these agree with the new ADP definitions from this chapter. To begin, fix v∈bX. On one hand, in Section 1.1.1.2, we called σ∈Σv-greedy whenever
σ(x)∈a∈{0,1}argmax{as+(1−a)[π(x)+β∫v(x′)P(x,dx′)]}for all x∈X.
then σ is Borel measurable, since π and x↦∫v(x′)P(x,dx′) are both Borel measurable, and σ obeys (2.28). Given that we can always choose a σ∈Σ obeying (2.28), the firm ADP (bX,TFV) is regular.
The ADP definition of the Bellman operator presented in Section 2.1.1.3 is Tv=⋁σTσv. For our firm ADP (bX,TFV), this agrees with the original definition we gave in Section 1.1.1.3. Indeed, given that the ADP is regular, we can state that Tv=Tσv whenever σ is v-greedy (Lemma 2.1.1). We have just shown that any such σ obeys (2.28). Taking such a σ and applying Tv=Tσv yields
for all x∈X. This ADP construction of T agrees with the original definition we presented in (1.16).
It is possible to prove optimality results for the firm problem here, verifying and extending Theorem 1.1.1. For example, a proof can be constructed using Theorem 2.2.8. For now we’ll refrain from doing so. The reason is that we build additional ADP theory below, leveraging the framework set out in this chapter. Tackling specific applications will then become much easier.
Consider the optimal savings model from Section 1.3, with Assumption 1.3.1 in force. We can represent this model as an ADP by taking V:=bR+ as the value space, paired with the pointwise order ⩽, letting Σ be the set of (Borel measurable) feasible policies, as defined in Section 1.3.1.1, and setting TOS:={Tσ:σ∈Σ}, where each policy operator Tσ is as given in (1.83). It is straightforward to verify that each Tσ∈TOS is order preserving under ⩽, and Exercise 1.10 confirms that Tσ maps V to itself. Hence (V,TOS) is an ADP.
By Lemma 1.3.1, each policy operator Tσ has a unique fixed point (i.e., σ-value function) vσ. Consistent with the discussion in Section 2.1.1.1, the real number vσ(w) represents the lifetime value of policy σ, conditional on initial wealth state W0=w.
In (1.97) we defined the concept of a v-greedy policy for the optimal savings model. Earlier, in (2.1), we introduced the notion of a v-greedy policy for an arbitrary ADP. The second definition is a generalization of the first. Indeed, if σ obeys the optimal savings greedy condition (1.97) and τ is any other feasible policy, then
u(τ(w))+β∫v(R(w−τ(w))+y)ϕ(dy)⩽u(σ(w))+β∫v(R(w−σ(w))+y)ϕ(dy)for all w∈R+.
This is equivalent to the statement Tτv⩽Tσv for all τ∈Σ, which is, in the present setting, the ADP definition of v-greedy (given that the partial order is ⩽). Conversely, if σ∈Σ obeys Tτv⩽Tσv for all τ∈Σ, then it obeys (2.33). By appealing to Lemma 1.3.2, we can strengthen this to
σ(w)∈0⩽c⩽wargmax{u(c)+β∫v(R(w−c)+y)ϕ(dy)}for all w∈R+.
In other words, σ is v-greedy in the sense of Section 1.3.
In addition, the ADP Bellman operator for (V,TOS), as defined in (2.4), is a generalization of the optimal savings Bellman operator given in (1.99). To see this, let T=⋁σTσ be the ADP Bellman operator and fix v∈V. By (ii) of Lemma 2.1.1, we have Tv=Tσv whenever σ is v-greedy. Letting σ obey (2.34), which exists by Lemma 1.3.2, fixing w∈R+ and combining these facts, we get
This confirms the claim that, in the setting of (V,TOS), the ADP Bellman operator reduces to the optimal savings Bellman operator in (1.99).
In view of Lemma 1.3.2, a v-greedy policy exists for every v∈V. Hence (V,TOS) is regular. In Lemma 1.3.1 on page we showed that each Tσ∈TOS has a unique fixed point, so (V,TOS) is well-posed. The same lemma also showed that each policy operator in TOS is globally stable. Lemma A.5.19 now implies that (V,TOS) is order stable.
In Section 1.2 we introduced a finite MDP (Γ,r,β,P) with finite state space X and finite action space A. Now we show that this finite MDP can be framed as an ADP (RX,TMDP) and then discuss its properties.
We set TMDP:={Tσ:σ∈Σ}, where Σ is the feasible policies, and pair RX with the pointwise partial order ⩽. Since each Tσ∈TMDP is order preserving on (RX,⩽), the pair (RX,TMDP) is an ADP.
We proved in Exercise 1.4 that each Tσ∈TMDP has a unique fixed point in RX given by (I−βPσ)−1rσ. Hence (RX,TMDP) is well-posed. (RX,TMDP) is also order continuous. Indeed, in (RX,⩽), the statement vn↑v is equivalent to vn(x)↑v(x) in R for all x∈X (Lemma A.1.3). As a result, every continuous order preserving map is order continuous.
The ADP (RX,TMDP) is also order stable. Indeed, each Tσ has the form Tσv=rσ+βPσv, where Pσ is a stochastic matrix. Since βPσ⩾0 and ρ(βPσ)=β<1, Exercise A.24 implies that each Tσ is order stable.
In Section 1.2.1.2 we introduced the concept of v-greedy policies for the finite MDP, as well as the Bellman operator and the Bellman equation. Let’s make sure that these are in fact special cases of our ADP definitions from Section 2.1.1.
From (2.40) it follows that the ADP Bellman equation for (RX,TMDP) is given by
We make a number of optimality claims for the finite MDP model in Section 1.2.1.2. Later we’ll build on the ADP optimality results in ways that make verifying these claims straightforward. Nevertheless, for the sake of the exercise, we now prove the claims from Section 1.2.1.2 using Theorem 2.2.8, as well as establishing additional results. Readers who prefer to move on can skip this section without loss of continuity.
We begin with the following claim.
The statement in Proposition 2.3.1 is easily translated into more standard MDP terminology. For example, it tells us that the value function v∗ solves the Bellman equation, which we know is given by (2.43), and, by the characterization of greedy policies in (2.39) plus Bellman’s principle of optimality, that a policy σ is optimal if and only if
σ(x)∈a∈Γ(x)argmax{r(x,a)+βx′∑v∗(x′)P(x,a,x′)}for all x∈X.
In Section 1.1.3.2 we mentioned distributional dynamic programming, which seeks to track not just the expected discounted return under a policy, but also the full probability distribution of the random discounted return ∑t⩾0βtr(Xt,σ(Xt)). This richer object captures variance, quantiles, tail risk, and other features that matter when agents are not risk-neutral. Here we review some of the basic concepts and show how these ideas can be embedded into the ADP framework. Further reading can be found in Section 2.4.
Let X be a metric space of states, let A be a metric space of actions, let r:X×A→R be a bounded measurable reward function, let P be a stochastic kernel on X given X×A (see Section A.5.4), and let β∈(0,1). A policy is a measurable map σ:X→A and Σ denotes the set of all feasible policies.
Let D1(R) denote the set of Borel probability measures on R with finite first moment. A distributional value function is a stochastic kernel η from X to R (see Section A.5.4) with η(x,⋅)∈D1(R) for each x∈X; it assigns a return distribution η(x,⋅) to each state x. We let H denote the set of all distributional value functions η satisfying the uniform bound
This condition ensures that the metric introduced in Section 2.3.4.2 below is well-defined.
Given η∈H, we write η(x,h):=∫h(z)η(x,dz) for bounded measurable h:R→R. The space H is equipped with the pointwise stochastic dominance order, which we denote by ⊴ and define by η⊴η′ if η(x,⋅)⪯Fη′(x,⋅) for every x∈X (see Section A.5.5).
We now define the distributional analogue of the scalar policy operator Tσ. In the scalar case,
where rσ(x):=r(x,σ(x)) and Pσ(x,⋅):=P(x,σ(x),⋅). The distributional version replaces the scalar continuation value v(x′) with a random draw from the distribution η(x′,⋅), and replaces addition and scalar multiplication with the corresponding operations on distributions.
As a first step, given η∈H, we define the continuation kernelPσ⊗η from X to R by
Here (Pσ⊗η)(x,⋅) is a distributional analog of the continuation value: the distribution of the random value obtained by drawing x′∼Pσ(x,⋅) and then drawing from η(x′,⋅).
Given a policy σ∈Σ, the distributional policy operatorDσ:H→H is defined by
for bounded measurable h:R→R. Here h should be understood as a test function that we use to characterize the distribution (Dση)(x,⋅), which is the law of rσ(x)+βV when V∼(Pσ⊗η)(x,⋅): today’s reward plus a discounted draw from the continuation distribution. This mirrors the scalar recursion. Expanding the continuation kernel, (2.48) can be equivalently written as
Since each Dσ is an order preserving self-map on H, the pair (H,TDDP) is an ADP, where TDDP:={Dσ:σ∈Σ}. The value space is H (distributional value functions), the partial order is pointwise stochastic dominance, and the policy operators are {Dσ}.
As in the scalar MDP case, well-posedness follows from a contraction argument. To state it, we need a metric on distributions. The Wasserstein-1 distance between μ,ν∈D1(R) is
By the Banach contraction mapping theorem, each Dσ has a unique fixed point ησ∈H. Hence (H,TDDP) is well-posed. Moreover, since each Dσ is a contraction and hence globally stable on H, Lemma A.5.19 implies that (H,TDDP) is order stable.
In the scalar case, the fixed point of Tσ is the expected-return function vσ(x)=EZσ(x). What is the distributional analogue? The fixed point of Dσ should be the distribution of Zσ(x). Since r is bounded, say ∣r∣⩽M, we have ∣Zσ(x)∣⩽M/(1−β) almost surely, so the distribution of Zσ(x) lies in D1(R).
Thus, while the scalar σ-value function vσ(x)=E[Zσ(x)] records only the mean of the random return, the distributional σ-value function ησ(x,⋅) records its full distribution---including variance, quantiles and tail behavior. The scalar value function is recovered as a special case: vσ(x)=∫zησ(x,dz).
Figure 2.4 illustrates the action of the distributional policy operator on the optimal savings model from Section 1.3, using the policy σ computed in Figure 1.14. The left panel shows Dσ5η0, an early iterate starting from the initial condition η0(x,⋅)=δ0 for all x. The right panel shows the converged distributional value function ησ, which assigns to each state x the full distribution of the discounted return Zσ(x). The iterations were implemented using the categorical projection method of Bellemare et al. (2017): the return axis is discretized into a fixed grid of atoms, and at each step the affine shift z↦rσ(x)+βz is projected back onto this grid via linear interpolation of probability mass. The mean of ησ(x,⋅) at each x recovers the scalar value function vσ shown in Figure 1.14.
Figure 2.4:Iterating Dσ: early iterate (left) and converged ησ (right)
One complication, in terms of developing a theory of distributional dynamic programming, is failure of regularity. Even when the state and action spaces are finite, greedy policies typically fail to exist. To see why, observe that, in the present setting, a policy σ is η-greedy when
at each x, and produce a policy σ from the correspondence of maximizers. The problem with this idea is that, in most cases, the solution will depend on h. For example, if h is concave with high curvature then optimal choices will avoid risk. If h is linear, optimal choices will ignore risk. This makes it very difficult to attain (2.58) for all h in ibR.
Without regularity, the core optimization loop of dynamic programming---solve the Bellman equation, then extract a greedy policy---breaks down at the second step. In the reinforcement learning literature, practitioners who use distributional methods typically select actions using the mean of the return distribution---which amounts to standard scalar greediness---while exploiting the distributional representation to improve function approximation and learning dynamics.
At a deeper level, the failure of regularity reflects the fact that, in order to set up a well-defined criterion for control, one must first commit to how risk is valued. This commitment takes the form of a specific nonlinear aggregator K applied period-by-period, as described in Section 1.1.3.6 and Section 1.2.3.2 (see also the discussion of certainty equivalents and risk preferences in Section 7.2.4). Such a commitment collapses the distributional object back to a scalar recursion, restoring regularity and the full apparatus of dynamic programming. This is the approach adopted throughout the remainder of this book.
LQ control is a major sub-field of dynamic programming, routinely applied to problems in engineering, economics, operations research and elsewhere. In this section we describe a canonical LQ problem and show how it can be solved using ADP methods. Rather than aiming to provide new results, we plan to show that LQ problems can be cleanly handled by the theory provided above, instead of requiring specialized machinery.
The vector xt∈Rk is called the state variable and ut∈Rm is called the action or control. Note that xt⊤Qxt+ut⊤Rut⩾0 for all t, so the infinite sum (2.60) takes values in [0,∞]. The Bellman equation takes the form
Since the solution to this problem is well-known, we will not seek new results. Rather, our aim is to illustrate how we can embed the LQ problem into the ADP framework and recover existing results in a relatively straightforward way.
Before we go further, let’s set up a framework for working with LQ problems and note down some standard results from the literature. In what follows, A,B,R, and Q are as described in Section 2.3.5.1.
We let
P be the set of positive semidefinite k×k matrices
Below we will connect the Riccati map to the Bellman operator for this problem and the control gain map will help us select decision rules. The fixed point equation P=R(P) is called the Riccati equation. Note that the inverse in (2.64) exists because R is positive definite and P is positive semidefinite, implying that B⊤PB+R is also positive definite.
The next exercise is a well-known result and the proof is not trivial.
The next lemma shows that the control gain map selects matrices that are akin to “min-greedy policies,” although we need to be a bit careful with that terminology, since we also want our policies to be stable (as clarified below).
In stating the next result we let C be such that C⊤C=Q. We refer to Bertsekas (2012) for the definitions of observability and controllability.
In the LQ setting, a control matrix is any F∈Rm×k. Under a given control matrix F, the current control obeys ut=Fxt and the update rule for the state is xt+1=Axt+BFxt. Hence the state evolves according to xt=(A+BF)tx0. Following (2.60), the lifetime cost of following F, starting at initial condition x0∈Rk, is
ℓF(x0):=t=0∑∞xt⊤(F⊤RF+Q)xt with xt=(A+BF)tx0.
Figure 2.5:The function ℓF for different choices of F
Returning to the general case, finite lifetime costs require driving the state to zero fast enough for the sum (2.71) to converge. In this connection, extending the condition ∣A+BF∣<1 from the one-dimensional example, a control matrix F is called stable if the spectral radius condition ρ(A+BF)<1 holds.
The fixed point equation P=TF(P) can be interpreted as a recursive equation for lifetime cost under policy F. To see this, suppose P=TF(P) and set ℓ(x):=x⊤Px. Using (2.73) and a bit of algebra, you will be able to confirm that this function ℓ obeys the recursion
The right-hand side equals the current state cost x⊤Qx, plus the current action cost (Fx)⊤R(Fx), plus the lifetime cost from the next state (A+BF)x. This is exactly the recursive structure of lifetime cost under policy F.
Comparing (2.76) with (2.71), we see that x⊤PFx=ℓF(x) for all x∈Rk. Hence PF is the matrix representation of the lifetime cost function.
Fixing P∈P, the ADP definition of min-greedy policies (page ) tells us that F∈Σ is P-min-greedy if and only if TF(P)≼TG(P) for all G∈Σ. The next exercise is useful for characterizing min-greedy policies.
In obtaining optimality results, one problem we have is that (P,T) is not always regular. On the one hand, if we take an arbitrary P∈P and then calculate the control gain matrix F=F(P), Exercise 2.24 assures us that we get the “min-greedy” property TF(P)≼TG(P). On the other hand, we have no guarantee that F is actually in Σ. In particular, F might not be a stable control matrix. For this reason, we introduce a smaller set PS⊂P of all positive semidefinite matrices such that control gain F=F(P) is stable. That is, we set
Suppose that (A,B) is controllable and (A,C) is observable. Since T and R agree on PS, Lemma 2.3.4 tells us that T has a fixed point P∗ in PS. We call P∗ the minimum loss matrix. Let PΣ be the set of lifetime costs, so that
where ℓ∗ is defined by ℓ∗(x)=x⊤P∗x for all x. Moreover, if we now set F=F(P∗), then, by Lemma 2.3.7, the policy F is P∗-min-greedy. Hence, by (c), F is min-optimal.
This chapter is based on the abstract dynamic programming framework of Sargent & Stachurski (2025), which builds on theory found in Denardo (1967), Bertsekas (1977), Verdu & Poor (1987), Szepesvari (1998), Kamihigashi (2014), Bertsekas (2017), Li & Xie (2021), and, in particular, Bertsekas (2022). The paper by Sargent & Stachurski (2025) adds a layer of abstraction over these earlier frameworks by shifting analysis to families of policy operators on partially ordered sets. Doing so makes it possible treat a wider class of problems and generate new results, as discussed in later chapters.
Other precursors to the framework described in this chapter include Porteus (1975) and Kreps & Porteus (1977), who pioneered the use of order-theoretic methods to extend dynamic programming optimality theory beyond the standard expected discounted reward criterion. (An overview of this line of work appears in Appendix 6 of Kreps (2013).) Their operator-based approach accommodates non-additive objectives, such as expected utility criteria, risk-sensitive preferences and stochastic games. Building on this foundation, Kreps & Porteus (1979) showed that non-additive recursive preferences are amenable to dynamic programming, work that inspired the recursive utility framework of Epstein & Zin (1989).
The discussion of distributional dynamic programming in Section 2.3.4 builds on Bellemare et al. (2017). The objective of this line of work is to replace scalar value functions with return distributions, enabling agents to reason about risk, variability, and higher-order statistics of outcomes. The theoretical foundations have been extended by several authors, including Dabney et al. (2018), Rowland et al. (2018), Bäuerle et al. (2025), Marthe (2026), and Bäuerle & Vasileiadis (2026). Distributional methods have found application in risk-sensitive control, robotics, and finance.
Linear-quadratic optimal control theory was developed during the late 1950s and 1960s, when the Riccati equation emerged as a key tool for computing optimal feedback policies. These methods have been widely applied in economics. Sargent (1987) provides an early treatment connecting LQ foundations to economic modeling. Hansen & Sargent (1980) showed how to formulate and estimate dynamic linear rational expectations models using LQ methods. The risk-sensitive extension to linear-exponential-quadratic-Gaussian (LEQG) control, developed by Whittle (1981) and adapted for discounted problems in economics by Hansen & Sargent (1995), provides a bridge between LQ control and robust decision-making under model uncertainty; see Hansen & Sargent (2011) for a comprehensive treatment. LQ methods underpin much of macroeconomic modeling, including the analysis of fiscal policy, monetary policy, and business cycle dynamics. See Hansen & Sargent (2013) and Ljungqvist & Sargent (2018) for numerous applications.
Research at the intersection of nonlinear dynamics and LQ control is currently quite active. One of the key ideas is to approximate nonlinear systems with very high dimensional linear systems, and then to approximate those linear systems via singular value decomposition. For further discussion of these topics, see Kutz et al. (2016) or Brunton & Kutz (2019).
In general, designating a v-greedy policy for all v∈V requires the Axiom of Choice. In practice, many applications induce some structure on the policy set that can be used to produce simple selection mechanisms.
Puterman, M. L. (2005). Markov decision processes: discrete stochastic dynamic programming. Wiley Interscience.
Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. International Conference on Machine Learning, 449–458.
Bertsekas, D. (2012). Dynamic programming and optimal control (Vol. 1). Athena Scientific.
Sargent, T. J., & Stachurski, J. (2025). Dynamic Programs on Partially Ordered Sets. SIAM Journal on Optimization and Control, in press.
Denardo, E. V. (1967). Contraction Mappings in the Theory Underlying Dynamic Programming. SIAM Review, 9(2), 165–177.
Bertsekas, D. P. (1977). Monotone mappings with application in dynamic programming. SIAM Journal on Control and Optimization, 15(3), 438–464.
Verdu, S., & Poor, H. V. (1987). Abstract dynamic programming models under commutativity conditions. SIAM Journal on Control and Optimization, 25(4), 990–1006.
Szepesvari, C. (1998). Non-Markovian policies in sequential decision problems. Acta Cybernetica, 13(3), 305–318.
Kamihigashi, T. (2014). Elementary results on solutions to the Bellman equation of dynamic programming: existence, uniqueness, and convergence. Economic Theory, 56, 251–273.
Bertsekas, D. P. (2017). Regular policies in abstract dynamic programming. SIAM Journal on Optimization, 27(3), 1694–1727.
Li, X., & Xie, L. (2021). Online Abstract Dynamic Programming with Contractive Models.
Bertsekas, D. P. (2022). Abstract dynamic programming (3rd ed.). Athena Scientific.
Porteus, E. L. (1975). On the optimality of structured policies in countable stage decision processes. Management Science, 22(2), 148–157.
Kreps, D. M., & Porteus, E. L. (1977). On the optimality of structured policies in countable stage decision processes. II: Positive and negative problems. SIAM Journal on Applied Mathematics, 32(2), 457–466.
Kreps, D. M. (2013). Microeconomic foundations (Vol. 1). Princeton University Press.