Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

9 Abstract Dynamic Programming

Authors
Affiliations
New York University
Australian National University

In Chapter 8 we introduced RDPs, stated their optimality properties, and investigated applications that satisfy optimality conditions. But we have yet to prove the core optimality and convergence results in Theorem 8.1.1.

Rather than proving these result directly, we now present a very abstract version of a dynamic programming problem that consist of a family of self-maps on a partially ordered set. Doing so allows us to simplify proofs and extend the reach of dynamic programming theory. (The value of these extensions will become clearer in Volume 2.)

9.1Abstract Dynamic Programs

First, we define abstract dynamic programs and prove optimality results under a set of high-level assumptions. Then we connect these results to our Chapter 8 optimality claims for RDPs.

9.1.1Preliminaries

Let’s cover some fundamental concepts that we’ll use when considering abstract dynamic programs.

9.1.1.1Order Stability

The first concept is related to stability of maps over partially ordered spaces. Our aim is to provide a weak notion of stability that can be applied in any partially ordered set (without any form of topology).

Let VV be a partially ordered set and let TT be a self-map on VV with exactly one fixed point vˉ\bar v in VV. In this setting, we call TT

Figure Figure 9.1 gives an illustration of a map TT that on V=[0,1]V = [0,1] that is order stable: all points mapped up by TT lie below its fixed point and all points mapped down by TT lie above its fixed point. The figure suggests that order stability is related to global stability, as defined in Section 1.2.2.2. We will affirm this in Lemma 9.1.1.

Order-stability of a self-map T on V
    = ([0,1], \leq)

Figure 9.1:Order-stability of a self-map TT on V=([0,1],)V = ([0,1], \leq)

9.1.1.2Order Duals

Given partially ordered set VV, let V=(V,)V^\partial = (V, \preceq^\partial) be the order dual, so that, for u,vVu, v \in V, we have uvu \preceq^\partial v if and only if vuv \preceq u. (The notation is slightly confusing but the concept is simple: VV^\partial is just VV with the order reversed.) The following result will be useful.

9.1.2Abstract Dynamic Programs

In this section, we formalize abstract dynamic programs and present fundamental optimality results. Section 9.1.2.1 starts the ball rolling with an informal overview.

9.1.2.1Prelude

We saw in Section 8.1 that a globally stable RDP yields a set of feasible policies Σ\Sigma and, for each σΣ\sigma \in \Sigma, a policy operator TσT_\sigma defined on the value space VRXV \subset \RR^\Xsf. Notice that the dynamic program is fully specified by the family of operators {Tσ}σΣ\{T_\sigma\}_{\sigma \in \Sigma} and the space VV that they act on. From this set of operators we obtain the set of lifetime values {vσ}σΣ\{v_\sigma\}_{\sigma \in \Sigma}, with each vσv_\sigma uniquely identified as a fixed point of TσT_\sigma. These lifetime values define the value function vv^* as the pointwise maximum v=σvσv^* = \vee_\sigma \, v_\sigma. An optimal policy is then defined as a σΣ\sigma \in \Sigma obeying vσ=vv_\sigma = v^*.

To shed unnecessary structure before the main optimality proofs, a natural idea is to start directly with an abstract set of “policy operators” {Tσ}\{T_\sigma\} acting on some set VV. One can then define lifetime values and optimality as in the previous paragraph and start to investigate conditions on the family of operators {Tσ}\{T_\sigma\} that lead to optimality.

We use these ideas as our starting point, beginning with an arbitrary family {Tσ}\{T_\sigma\} of operators on a partially ordered set.

9.1.2.2Defining ADPs

An abstract dynamic program (ADP) is a pair A=(V,{Tσ}σΣ)\aA = (V, \{T_\sigma\}_{\sigma \in \Sigma}) such that

  1. V=(V,)V = (V, \preceq) is a partially ordered set,

  2. {Tσ}{Tσ}σΣ\{T_\sigma\} \coloneq \{T_\sigma\}_{\sigma \in \Sigma} is a family of self-maps on VV, and

  3. for all vVv \in V, the set {Tσv}σΣ\{T_\sigma \, v\}_{\sigma \in \Sigma} has both a least and greatest element.

Elements of the index set Σ\Sigma are called policies and elements of {Tσ}\{T_\sigma \} are called policy operators. Given vVv \in V, a policy σ\sigma in Σ\Sigma is called vv-greedy if TσvTτvT_{\sigma} \, v \succeq T_\tau \, v for all τΣ\tau \in \Sigma. Existence of a greatest element in (iii) of the definition is equivalent to the statement that each vVv \in V has at least one vv-greedy policy.

In the setting of Example 9.1.1, we call AR\aA_{\rR} the ADP generated by R\rR.

We have just shown that RDPs are ADPs. But there are also ADPs that do not fit naturally into the RDP framework. The next two examples illustrate. In these examples, the Bellman equation does not match the RDP Bellman equation v(x)=maxaΓ(x)B(x,a,v)v(x) = \max_{a \in \Gamma(x)} B(x, a, v) due to the inverted order of expectation and maximization.

In Chapter 10 we will see that continuous time dynamic programs can also be viewed as ADPs.

9.2Optimality

In this section, we study optimality properties of ADPs, aiming for generalizations of the foundational results of dynamic programming. To achieve this aim we need to define optimality and provide sufficient conditions.

9.2.1Max-Optimality

We begin with maximization. Later, in Section 9.2.3, we will show that results for minimization problems are simple corollaries of maximization results.

9.2.1.1Lifetime Values

The objective of dynamic programming is to optimize lifetime value. But what is lifetime value in this abstract context? Suppose that, for an ADP (V,{Tσ})(V, \{T_\sigma\}) and fixed σΣ\sigma \in \Sigma, the policy operator TσT_\sigma has a unique fixed point. In this setting, we write vσv_\sigma for the fixed point of TσT_\sigma and call it the σ\sigma-value function. We interpret it as the lifetime value of following policy σ\sigma. A closely related interpretation was discussed at length for RDPs in Section 8.1.2.1 and the situation here is analogous.

We call an ADP A(V,{Tσ})\aA \coloneq (V, \{T_\sigma\}) well-posed if every policy operator TσT_\sigma has a unique fixed point in VV. In view of the preceding discussion on lifetime values, well-posedness is a minimum requirement for constructing an optimality theory around ADPs.

9.2.1.2Operators

Let A=(V,{Tσ})\aA = (V, \{T_\sigma\}) be an ADP. We set

TvσTσv(vV),\tmax \, v \coloneq \bigvee_\sigma T_\sigma \, v \qquad (v \in V),

and call T\tmax the Bellman operator generated by A\aA. Note that TT is a well-defined self-map on VV by part (iii) of the definition of ADPs (existence of greedy policies). A function vVv \in V is said to satisfy the Bellman equation if it is a fixed point of T\tmax.

The definition of T\tmax in (9.5) includes all of the Bellman operators we have met as special cases. For example, consider an RDP R=(Γ,V,B)\rR = (\Gamma, V, B) with Bellman operator (Tv)(x)=maxaΓ(x)B(x,a,v)(Tv)(x) = \max_{a \in \Gamma(x)}B(x,a,v). We can write TT as σTσv\bigvee_\sigma T_\sigma \, v, as shown in Exercise 8.8. Thus, the Bellman operator of the RDP agrees with the Bellman operator T\tmax of the corresponding ADP AR\aA_{\rR}.

Below we consider Howard policy iteration (HPI) as an algorithm for solving for optimal policies of ADPs. We use precisely the same instruction set as for the RDP case, as shown in Algorithm 8.1. To further clarify the algorithm, we define a map H\Hmax from VV to {vσ}\{v_\sigma\} via Hv=vσ\Hmax \, v = v_\sigma where σ\sigma is vv-max-greedy. Iterating with H\Hmax generates the value sequence associated with Howard policy iteration.[1] In what follows, we call H\Hmax the Howard operator generated by the ADP.

9.2.1.3Properties

Let A(V,{Tσ}σΣ)\aA \coloneq (V, \{T_\sigma\}_{\sigma \in \Sigma}) be an ADP. We call A\aA

Obviously max-stable     \implies order stable     \implies well-posed.

Regarding the definition of max-stability, existence of a fixed point of TT in VV is a high-level assumption that can be challenging to verify in applications. At the same time, our main concern in the present volume is the case where A\aA is finite, and, in this setting, order stability is enough:

Proposition 9.2.1 is proved in Section B.4.

Order stability is central to the optimality results just stated. While order stability is a somewhat nonstandard condition, the next result shows that, at least in simple settings, order stability is necessary for any discussion of optimality.

9.2.1.4Max-Optimality Results

Let A=(V,{Tσ})\aA = (V, \{T_\sigma\}) be a well-posed ADP with σ\sigma-value functions {vσ}σΣ\{ v_\sigma \}_{\sigma \in \Sigma}. We define

VΣ{vσ}σΣandVu{vV:vTv}.V_\Sigma \coloneq \{v_\sigma\}_{\sigma \in \Sigma} \quad \text{and} \quad V_u \coloneq \setntn{v \in V}{v \preceq Tv}.

If VΣV_\Sigma has a greatest element, then we denote it by v\vmax and call it the value function generated by A\aA. In this setting, a policy σΣ\sigma \in \Sigma is called optimal for A\aA if vσ=vv_\sigma = \vmax. We say that A\aA obeys Bellman’s principle of optimality if

σΣ is optimal for A    σ is v-greedy.\sigma \in \Sigma \text{ is optimal for } \aA \quad \iff \quad \sigma \text{ is } \vmax \text{-greedy}.

These definitions are direct generalizations of the corresponding definitions for RDPs discussed in Chapter 8.

We can now state our main optimality result for ADPs.

Theorem 9.2.4 informs us that finite well-posed ADPs have first-rate optimality properties under a relatively mild stability condition. In Section 9.2.2 we use Theorem 9.2.4 to prove all optimality results for RDPs stated in Chapter 8. The proof of Theorem 9.2.4 is given in Section B.4. Note that (iv) follows directly from (i) and is included only for completeness.

9.2.1.5General States

This volume focuses on dynamic programming problems with finite states. Here we restrict ourselves to one high-level result for general state spaces.

Proposition 9.2.5 tells us that we can drop finiteness of policy set Σ\Sigma (which is implied by finite states and actions) whenever the Bellman operator has at least one fixed point. Various fixed-point methods are available for establishing this existence. We defer further details until Volume 2. Proposition 9.2.5 is proved in in Section B.4.

9.2.1.6Application: Mixed Strategies

This section discusses adding mixed strategies to an RDP. We will need to apply Proposition 9.2.5 to discuss optimality because the set of mixed strategies is not finite.

Let R=(Γ,V,B)\rR = (\Gamma, V, B) be an RDP with finite state space X\Xsf, finite action space A\Asf, policy set Σ\Sigma and Bellman operator TT (see Section 8.1.3). A mixed strategy for R\rR is a map ϕ\phi sending xXx \in \Xsf into a distribution ϕxD(A)\phi_x \in \dD(\Asf) supported on Γ(x)\Gamma(x). In other words, for each xXx \in \Xsf,

ϕx ⁣:A[0,1]andaΓ(x)ϕx(a)=1.\phi_x \colon \Asf \to [0, 1] \quad \text{and} \quad \sum_{a \in \Gamma(x)} \phi_x(a) = 1.

Let Φ\Phi be the set of all mixed strategies for R\rR. For each mixed strategy ϕΦ\phi \in \Phi, we introduce the policy operator on VV defined by

(T^ϕv)(x)=aAB(x,a,v)ϕx(a)(vV,  xX).(\hat T_\phi \, v)(x) = \sum_{a \in \Asf} B(x, a, v) \phi_x(a) \qquad (v \in V, \; x \in \Xsf).

The right-hand side is the expected lifetime value from current state xx, when the current action is drawn from ϕx\phi_x and future states are evaluated via vv.

It follows from this discussion that AM(V,{T^ϕ}ϕΦ)\aA_M \coloneq (V, \{\hat T_\phi\}_{\phi \in \Phi}) is an ADP (where “M” stands for “mixed”), and that the Bellman operator T^\hat T associated with the ADP AM\aA_M is given by

(T^v)(x)=maxaΓ(x)B(x,a,v)=(Tv)(x)(vV,  xX).(\hat T v)(x) = \max_{a \in \Gamma(x)} B(x, a, v) = (Tv)(x) \qquad (v \in V, \; x \in \Xsf).

Let us assume for simplicity that R\rR is contracting (see Section 8.2.1), with modulus of contraction β(0,1)\beta \in (0,1). Assume also that VV is closed in RX\RR^\Xsf. As a result, the value function vv^* for R\rR exists in VV and is the unique fixed point of TT in VV (Corollary 8.2.2).

By Exercise 9.7, the ADP AM\aA_M is max-stable (since globally stable operators are order stable -- see Lemma 9.1.1 -- and the Bellman operator T^\hat T has a fixed point). Hence, by Proposition 9.2.5, the value function v^\hat v^* for AM\aA_M exists in VV and is the unique fixed point of T^\hat T in VV. But, by (9.11), T^\hat T and TT agree on VV. Hence v^=v\hat v^* = v^*. We conclude as follows: while the set of mixed strategies is larger than the set of pure strategies (i.e., deterministic policies), the maximal lifetime value from each state is the same.

9.2.2Optimality Results for RDPs

In this section, we return to the optimality properties of RDPs, as first discussed in Section 8.1.3.3. Our aim is to connect the ADP optimality results from Section 9.2.1.4 to the special case of RDPs and, through this process, complete the proofs of our key RDP optimality results from Chapter 8.

9.2.2.1OPI Convergence

The first step is to provide some preliminary results related to OPI convergence, where OPI obeys the algorithm given. Throughout, R=(Γ,V,B)\rR = (\Gamma, V, B) is a globally stable RDP with policy set Σ\Sigma, policy operators {Tσ}\{T_\sigma\}, Bellman operator TT, and value function vv^*. As usual, vσv_\sigma denotes the unique fixed point of TσT_\sigma for all σΣ\sigma \in \Sigma. In the results that follow, mm is a fixed natural number indicating the OPI step size and HH and WmW_m are as defined in Section 8.1.3.2.

9.2.2.2Proofs of RDP Results

In Section 8.1.3 we stated two key optimality results for RDPs, the first concerning globally stable RDPs (Theorem 8.1.1) and the second concerning bounded RDPs (Theorem 8.1.2). Let’s now prove them. In what follows, R=(Γ,V,B)\rR = (\Gamma, V, B) is a well-posed RDP and AR(V,{Tσ})\aA_\rR \coloneq (V, \{T_\sigma\}) is the ADP generated by R\rR.

9.2.3Min-Optimality

Until now, our ADP theory has focused on maximization of lifetime values. Now we turn to minimization. One of our aims is to prove the RDP minimization results in Section 8.3.5. We will see that ADP minimization results are easily recovered from ADP maximization results via order duality.

Let A=(V,{Tσ})\aA = (V, \{T_\sigma\}) be a well-posed ADP and let VΣ{vσ}V_\Sigma \coloneq \{v_\sigma\} be the set of σ\sigma-value functions. We call σΣ\sigma \in \Sigma min-optimal for A\aA if vσv_\sigma is a least element of VΣV_\Sigma. When VΣV_\Sigma has a least element we denote it by v\vmin and call it the min-value function generated by A\aA. A policy σ\sigma is called vv-min-greedy if TσvTτvT_\sigma \, v \preceq T_\tau \, v for all τΣ\tau \in \Sigma. Existence of a vv-min-greedy policy for each vVv \in V is guaranteed by the definition of ADPs.

We say that A\aA obeys Bellman’s principle of min-optimality if

σΣ is min-optimal for A    σ is v-min-greedy.\sigma \in \Sigma \text{ is min-optimal for } \aA \quad \iff \quad \sigma \text{ is } \vmin \text{-min-greedy}.

We define the Bellman min-operator corresponding to A\aA as the self-map T\tmin on VV defined by Tv=σTσv\tmin v = \bigwedge_\sigma T_\sigma \, v. This map is well-defined because {Tσv}σΣ\{T_\sigma \, v\}_{\sigma \in \Sigma} has a least element and, moreover, σΣ\sigma \in \Sigma is vv-min-greedy if and only if Tσv=TvT_\sigma \, v = \tmin \, v.

We say that vv satisfies the Bellman min-equation if Tv=v\tmin v = v. We call A\aA min-stable if A\aA is order stable and T\tmin has at least one fixed point in VV. We define H\Hmin from VV to {vσ}\{v_\sigma\} via Hv=vσ\Hmin \, v = v_\sigma where σ\sigma is vv-min-greedy and call H\Hmin the Howard min-operator generated by A\aA. Iterating with H\Hmin is called min-HPI.

Results analogous to Theorem 9.2.4 hold for the minimization case.

To prove Theorem 9.2.11 we use order duality. Below, if A(V,{Tσ})\aA \coloneq (V, \{T_\sigma\}) is an ADP then its dual is

A(V,{Tσ})   where V is the order dual of V.\aA^\partial \coloneq (V^\partial, \{T_\sigma\}) \; \text{ where } V^\partial \text{ is the order dual of } V.

In this setting, we let T\tmax^\partial be the Bellman operator for A\aA^\partial, (v)(\vmax)^\partial be the value function for A\aA^\partial, and so on. We note that A\aA is self-dual, in the sense that (A)=A(\aA^\partial)^\partial = \aA, since the same is true for VV.

To make our terminology more symmetric, in the remainder of this section we refer to maximization-based optimal policies as max-optimal, the Bellman operator T=σTσv\tmax = \bigvee_\sigma T_\sigma \, v as the Bellman max-operator, and so on.

Self-duality implies corollaries to Exercise 9.8 which we treat as self-evident. For example, if A\aA is max-stable if and only if A\aA^\partial is min-stable, which follows from part (v) and the fact that (A)=A(\aA^\partial)^\partial = \aA.

9.3Chapter Notes

As indicated in notes for Chapter 8, our interest in abstract dynamic programming was inspired by Bertsekas (2022). This chapter generalizes his framework by switching to a “completely abstract” setting based on analysis of self-maps on partially ordered space. The material here is based on Sargent & Stachurski (2023). Earlier work on dynamic programming in a setting with no topology can be found in Kamihigashi (2014).

Footnotes
  1. For H\Hmax to be well-defined, we must always select the same vv-greedy policy when the operator is applied to vv. We can use the axiom of choice to assign to each vv a designated vv-greedy policy, although, in applications, a simple rule usually suffices. For example, if Σ\Sigma is finite, we can enumerate the policy set Σ\Sigma and choose the first vv-greedy policy.

References
  1. Fei, Y., Yang, Z., Chen, Y., & Wang, Z. (2021). Exponential Bellman equation and improved regret bounds for risk-sensitive reinforcement learning. Advances in Neural Information Processing Systems, 34, 20436–20446.
  2. Bertsekas, D. (2022). Abstract Dynamic Programming (3rd ed.). Athena Scientific.
  3. Sargent, T., & Stachurski, J. (2023). Completely abstract dynamic programming. arXiv, 2308.02148.
  4. Kamihigashi, T. (2014). Elementary results on solutions to the Bellman equation of dynamic programming: Existence, uniqueness, and convergence. Economic Theory, 56, 251–273.