In Chapter 8 we introduced RDPs, stated their optimality properties, and investigated applications that satisfy optimality conditions. But we have yet to prove the core optimality and convergence results in Theorem 8.1.1.
Rather than proving these result directly, we now present a very abstract version of a dynamic programming problem that consist of a family of self-maps on a partially ordered set. Doing so allows us to simplify proofs and extend the reach of dynamic programming theory. (The value of these extensions will become clearer in Volume 2.)
First, we define abstract dynamic programs and prove optimality results under a set of high-level assumptions. Then we connect these results to our Chapter 8 optimality claims for RDPs.
The first concept is related to stability of maps over partially ordered spaces. Our aim is to provide a weak notion of stability that can be applied in any partially ordered set (without any form of topology).
Let V be a partially ordered set and let T be a self-map on V with exactly one fixed point vˉ in V. In this setting, we call T
upward stable on V if v∈V and v⪯Tv implies v⪯vˉ,
downward stable on V if v∈V and Tv⪯v implies vˉ⪯v, and
order stable on V if T is both upward and downward stable.
Figure Figure 9.1 gives an illustration of a map T that on V=[0,1] that is order stable: all points mapped up by T lie below its fixed point and all points mapped down by T lie above its fixed point. The figure suggests that order stability is related to global stability, as defined in Section 1.2.2.2. We will affirm this in Lemma 9.1.1.
Figure 9.1:Order-stability of a self-map T on V=([0,1],⩽)
Given partially ordered set V, let V∂=(V,⪯∂) be the order dual, so that, for u,v∈V, we have u⪯∂v if and only if v⪯u. (The notation is slightly confusing but the concept is simple: V∂ is just V with the order reversed.) The following result will be useful.
In this section, we formalize abstract dynamic programs and present fundamental optimality results. Section 9.1.2.1 starts the ball rolling with an informal overview.
We saw in Section 8.1 that a globally stable RDP yields a set of feasible policies Σ and, for each σ∈Σ, a policy operator Tσ defined on the value space V⊂RX. Notice that the dynamic program is fully specified by the family of operators {Tσ}σ∈Σ and the space V that they act on. From this set of operators we obtain the set of lifetime values {vσ}σ∈Σ, with each vσ uniquely identified as a fixed point of Tσ. These lifetime values define the value function v∗ as the pointwise maximum v∗=∨σvσ. An optimal policy is then defined as a σ∈Σ obeying vσ=v∗.
To shed unnecessary structure before the main optimality proofs, a natural idea is to start directly with an abstract set of “policy operators” {Tσ} acting on some set V. One can then define lifetime values and optimality as in the previous paragraph and start to investigate conditions on the family of operators {Tσ} that lead to optimality.
We use these ideas as our starting point, beginning with an arbitrary family {Tσ} of operators on a partially ordered set.
An abstract dynamic program (ADP) is a pair A=(V,{Tσ}σ∈Σ) such that
V=(V,⪯) is a partially ordered set,
{Tσ}:={Tσ}σ∈Σ is a family of self-maps on V, and
for all v∈V, the set {Tσv}σ∈Σ has both a least and greatest element.
Elements of the index set Σ are called policies and elements of {Tσ} are called policy operators. Given v∈V, a policy σ in Σ is called v-greedy if Tσv⪰Tτv for all τ∈Σ. Existence of a greatest element in (iii) of the definition is equivalent to the statement that each v∈V has at least one v-greedy policy.
In the setting of Example 9.1.1, we call AR the ADP generated byR.
We have just shown that RDPs are ADPs. But there are also ADPs that do not fit naturally into the RDP framework. The next two examples illustrate. In these examples, the Bellman equation does not match the RDP Bellman equation v(x)=maxa∈Γ(x)B(x,a,v) due to the inverted order of expectation and maximization.
In Chapter 10 we will see that continuous time dynamic programs can also be viewed as ADPs.
In this section, we study optimality properties of ADPs, aiming for generalizations of the foundational results of dynamic programming. To achieve this aim we need to define optimality and provide sufficient conditions.
We begin with maximization. Later, in Section 9.2.3, we will show that results for minimization problems are simple corollaries of maximization results.
The objective of dynamic programming is to optimize lifetime value. But what is lifetime value in this abstract context? Suppose that, for an ADP (V,{Tσ}) and fixed σ∈Σ, the policy operator Tσ has a unique fixed point. In this setting, we write vσ for the fixed point of Tσ and call it the σ-value function. We interpret it as the lifetime value of following policy σ. A closely related interpretation was discussed at length for RDPs in Section 8.1.2.1 and the situation here is analogous.
We call an ADP A:=(V,{Tσ})well-posed if every policy operator Tσ has a unique fixed point in V. In view of the preceding discussion on lifetime values, well-posedness is a minimum requirement for constructing an optimality theory around ADPs.
and call T the Bellman operator generated by A. Note that T is a well-defined self-map on V by part (iii) of the definition of ADPs (existence of greedy policies). A function v∈V is said to satisfy the Bellman equation if it is a fixed point of T.
The definition of T in (9.5) includes all of the Bellman operators we have met as special cases. For example, consider an RDP R=(Γ,V,B) with Bellman operator (Tv)(x)=maxa∈Γ(x)B(x,a,v). We can write T as ⋁σTσv, as shown in Exercise 8.8. Thus, the Bellman operator of the RDP agrees with the Bellman operator T of the corresponding ADP AR.
Below we consider Howard policy iteration (HPI) as an algorithm for solving for optimal policies of ADPs. We use precisely the same instruction set as for the RDP case, as shown in Algorithm 8.1. To further clarify the algorithm, we define a map H from V to {vσ} via Hv=vσ where σ is v-max-greedy. Iterating with H generates the value sequence associated with Howard policy iteration.[1] In what follows, we call H the Howard operator generated by the ADP.
order stable if every policy operator Tσ is order stable on V, and
max-stable if A is order stable and T has at least one fixed point in V.
Obviously max-stable ⟹ order stable ⟹ well-posed.
Regarding the definition of max-stability, existence of a fixed point of T in V is a high-level assumption that can be challenging to verify in applications. At the same time, our main concern in the present volume is the case where A is finite, and, in this setting, order stability is enough:
Order stability is central to the optimality results just stated. While order stability is a somewhat nonstandard condition, the next result shows that, at least in simple settings, order stability is necessary for any discussion of optimality.
If VΣ has a greatest element, then we denote it by v∗ and call it the value function generated by A. In this setting, a policy σ∈Σ is called optimal for A if vσ=v∗. We say that A obeys Bellman’s principle of optimality if
These definitions are direct generalizations of the corresponding definitions for RDPs discussed in Chapter 8.
We can now state our main optimality result for ADPs.
Theorem 9.2.4 informs us that finite well-posed ADPs have first-rate optimality properties under a relatively mild stability condition. In Section 9.2.2 we use Theorem 9.2.4 to prove all optimality results for RDPs stated in Chapter 8. The proof of Theorem 9.2.4 is given in Section B.4. Note that (iv) follows directly from (i) and is included only for completeness.
This volume focuses on dynamic programming problems with finite states. Here we restrict ourselves to one high-level result for general state spaces.
Proposition 9.2.5 tells us that we can drop finiteness of policy set Σ (which is implied by finite states and actions) whenever the Bellman operator has at least one fixed point. Various fixed-point methods are available for establishing this existence. We defer further details until Volume 2. Proposition 9.2.5 is proved in in Section B.4.
This section discusses adding mixed strategies to an RDP. We will need to apply Proposition 9.2.5 to discuss optimality because the set of mixed strategies is not finite.
Let R=(Γ,V,B) be an RDP with finite state space X, finite action space A, policy set Σ and Bellman operator T (see Section 8.1.3). A mixed strategy for R is a map ϕ sending x∈X into a distribution ϕx∈D(A) supported on Γ(x). In other words, for each x∈X,
The right-hand side is the expected lifetime value from current state x, when the current action is drawn from ϕx and future states are evaluated via v.
It follows from this discussion that AM:=(V,{T^ϕ}ϕ∈Φ) is an ADP (where “M” stands for “mixed”), and that the Bellman operator T^ associated with the ADP AM is given by
Let us assume for simplicity that R is contracting (see Section 8.2.1), with modulus of contraction β∈(0,1). Assume also that V is closed in RX. As a result, the value function v∗ for R exists in V and is the unique fixed point of T in V (Corollary 8.2.2).
By Exercise 9.7, the ADP AM is max-stable (since globally stable operators are order stable -- see Lemma 9.1.1 -- and the Bellman operator T^ has a fixed point). Hence, by Proposition 9.2.5, the value function v^∗ for AM exists in V and is the unique fixed point of T^ in V. But, by (9.11), T^ and T agree on V. Hence v^∗=v∗. We conclude as follows: while the set of mixed strategies is larger than the set of pure strategies (i.e., deterministic policies), the maximal lifetime value from each state is the same.
In this section, we return to the optimality properties of RDPs, as first discussed in Section 8.1.3.3. Our aim is to connect the ADP optimality results from Section 9.2.1.4 to the special case of RDPs and, through this process, complete the proofs of our key RDP optimality results from Chapter 8.
The first step is to provide some preliminary results related to OPI convergence, where OPI obeys the algorithm given. Throughout, R=(Γ,V,B) is a globally stable RDP with policy set Σ, policy operators {Tσ}, Bellman operator T, and value function v∗. As usual, vσ denotes the unique fixed point of Tσ for all σ∈Σ. In the results that follow, m is a fixed natural number indicating the OPI step size and H and Wm are as defined in Section 8.1.3.2.
In Section 8.1.3 we stated two key optimality results for RDPs, the first concerning globally stable RDPs (Theorem 8.1.1) and the second concerning bounded RDPs (Theorem 8.1.2). Let’s now prove them. In what follows, R=(Γ,V,B) is a well-posed RDP and AR:=(V,{Tσ}) is the ADP generated by R.
Until now, our ADP theory has focused on maximization of lifetime values. Now we turn to minimization. One of our aims is to prove the RDP minimization results in Section 8.3.5. We will see that ADP minimization results are easily recovered from ADP maximization results via order duality.
Let A=(V,{Tσ}) be a well-posed ADP and let VΣ:={vσ} be the set of σ-value functions. We call σ∈Σmin-optimal for A if vσ is a least element of VΣ. When VΣ has a least element we denote it by v↓∗ and call it the min-value function generated by A. A policy σ is called v-min-greedy if Tσv⪯Tτv for all τ∈Σ. Existence of a v-min-greedy policy for each v∈V is guaranteed by the definition of ADPs.
We say that A obeys Bellman’s principle of min-optimality if
We define the Bellman min-operator corresponding to A as the self-map T↓ on V defined by T↓v=⋀σTσv. This map is well-defined because {Tσv}σ∈Σ has a least element and, moreover, σ∈Σ is v-min-greedy if and only if Tσv=T↓v.
We say that v satisfies the Bellman min-equation if T↓v=v. We call Amin-stable if A is order stable and T↓ has at least one fixed point in V. We define H↓ from V to {vσ} via H↓v=vσ where σ is v-min-greedy and call H↓ the Howard min-operator generated by A. Iterating with H↓ is called min-HPI.
Results analogous to Theorem 9.2.4 hold for the minimization case.
To prove Theorem 9.2.11 we use order duality. Below, if A:=(V,{Tσ}) is an ADP then its dual is
In this setting, we let T∂ be the Bellman operator for A∂, (v∗)∂ be the value function for A∂, and so on. We note that A is self-dual, in the sense that (A∂)∂=A, since the same is true for V.
To make our terminology more symmetric, in the remainder of this section we refer to maximization-based optimal policies as max-optimal, the Bellman operator T=⋁σTσv as the Bellman max-operator, and so on.
Self-duality implies corollaries to Exercise 9.8 which we treat as self-evident. For example, if A is max-stable if and only if A∂ is min-stable, which follows from part (v) and the fact that (A∂)∂=A.
As indicated in notes for Chapter 8, our interest in abstract dynamic programming was inspired by Bertsekas (2022). This chapter generalizes his framework by switching to a “completely abstract” setting based on analysis of self-maps on partially ordered space. The material here is based on Sargent & Stachurski (2023). Earlier work on dynamic programming in a setting with no topology can be found in Kamihigashi (2014).
For H to be well-defined, we must always select the same v-greedy policy when the operator is applied to v. We can use the axiom of choice to assign to each v a designated v-greedy policy, although, in applications, a simple rule usually suffices. For example, if Σ is finite, we can enumerate the policy set Σ and choose the first v-greedy policy.
Kamihigashi, T. (2014). Elementary results on solutions to the Bellman equation of dynamic programming: Existence, uniqueness, and convergence. Economic Theory, 56, 251–273.