Abstract Dynamic Programming - Dynamic Programming Volume I: Finite States

In Chapter 8 we introduced RDPs, stated their optimality properties, and investigated applications that satisfy optimality conditions. But we have yet to prove the core optimality and convergence results in Theorem 8.1.1.

Rather than proving these result directly, we now present a very abstract version of a dynamic programming problem that consists of a family of self-maps on a partially ordered set. Doing so allows us to simplify proofs and extend the reach of dynamic programming theory. (The value of these extensions will become clearer in Volume 2.)

9.1Abstract Dynamic Programs¶

First, we define abstract dynamic programs and prove optimality results under a set of high-level assumptions. Then we connect these results to our Chapter 8 optimality claims for RDPs.

9.1.1Preliminaries¶

Let’s cover some fundamental concepts that we’ll use when considering abstract dynamic programs.

9.1.1.1Order Stability¶

The first concept is related to stability of maps over partially ordered spaces. Our aim is to provide a weak notion of stability that can be applied in any partially ordered set (without any form of topology).

Let $V$ be a partially ordered set and let $T$ be a self-map on $V$ with exactly one fixed point $\bar v$ in $V$ . In this setting, we call $T$

upward stable on $V$ if $v \in V$ and $v \preceq T \, v$ implies $v \preceq \bar v$ ,
downward stable on $V$ if $v \in V$ and $T \, v \preceq v$ implies $\bar v \preceq v$ , and
order stable on $V$ if $T$ is both upward and downward stable.

Figure 9.1 gives an illustration of a map $T$ on $V = [0,1]$ that is order stable: all points mapped up by $T$ lie below its fixed point and all points mapped down by $T$ lie above its fixed point. The figure suggests that order stability is related to global stability, as defined in Section 1.2.2.2. We will affirm this in Lemma 9.1.1.

Order-stability of a self-map T on V
= ([0,1], \leq) — Figure 9.1:Order-stability of a self-map $T$ on $V = ([0,1], \leq)$

9.1.1.2Order Duals¶

Given partially ordered set $V$ , let $V^\partial = (V, \preceq^\partial)$ be the order dual, so that, for $u, v \in V$ , we have $u \preceq^\partial v$ if and only if $v \preceq u$ . (The notation is slightly confusing but the concept is simple: $V^\partial$ is just $V$ with the order reversed.) The following result will be useful.

9.1.2Abstract Dynamic Programs¶

In this section, we formalize abstract dynamic programs and present fundamental optimality results. Section 9.1.2.1 starts the ball rolling with an informal overview.

9.1.2.1Prelude¶

We saw in Section 8.1 that a globally stable RDP yields a set of feasible policies $\Sigma$ and, for each $\sigma \in \Sigma$ , a policy operator $T_\sigma$ defined on the value space $V \subset \RR^\Xsf$ . Notice that the dynamic program is fully specified by the family of operators $\{T_\sigma\}_{\sigma \in \Sigma}$ and the space $V$ that they act on. From this set of operators we obtain the set of lifetime values $\{v_\sigma\}_{\sigma \in \Sigma}$ , with each $v_\sigma$ uniquely identified as a fixed point of $T_\sigma$ . These lifetime values define the value function $v^*$ as the pointwise maximum $v^* = \vee_\sigma \, v_\sigma$ . An optimal policy is then defined as a $\sigma \in \Sigma$ obeying $v_\sigma = v^*$ .

To shed unnecessary structure before the main optimality proofs, a natural idea is to start directly with an abstract set of “policy operators” $\{T_\sigma\}$ acting on some set $V$ . One can then define lifetime values and optimality as in the previous paragraph and start to investigate conditions on the family of operators $\{T_\sigma\}$ that lead to optimality.

We use these ideas as our starting point, beginning with an arbitrary family $\{T_\sigma\}$ of operators on a partially ordered set.

9.1.2.2Defining ADPs¶

An abstract dynamic program (ADP) is a pair $\aA = (V, \{T_\sigma\}_{\sigma \in \Sigma})$ such that

(i) $V = (V, \preceq)$ is a partially ordered set,

(ii) $\{T_\sigma\} \coloneq \{T_\sigma\}_{\sigma \in \Sigma}$ is a family of self-maps on $V$ , and

(iii) for all $v \in V$ , the set $\{T_\sigma \, v\}_{\sigma \in \Sigma}$ has both a least and greatest element.

Elements of the index set $\Sigma$ are called policies and elements of $\{T_\sigma \}$ are called policy operators. Given $v \in V$ , a policy $\sigma$ in $\Sigma$ is called $v$ -greedy if $T_{\sigma} \, v \succeq T_\tau \, v$ for all $\tau \in \Sigma$ . Existence of a greatest element in (iii) of the definition is equivalent to the statement that each $v \in V$ has at least one $v$ -greedy policy.

Example 9.1.1 (RDPs generate ADPs)

Let $\rR = (\Gamma, V, B)$ be an RDP with finite-state $\Xsf$ , as defined in Section 8.1.1. For each $\sigma$ in the feasible policy set $\Sigma$ , let $T_\sigma$ be the corresponding policy operator, defined at $v \in V$ by $(T_\sigma \, v)(x) = B(x, \sigma(x), v)$ . The pair $\aA_{\rR} \coloneq (V, \{T_\sigma\})$ is an ADP, since $V$ is partially ordered by $\leq$ , $T_\sigma$ is a self-map on $V$ for all $\sigma \in \Sigma$ , and, given $v \in V$ , choosing $\bar \sigma \in \Sigma$ such that $\bar \sigma(x) \in \argmax_{a \in \Gamma(x)} B(x, a, v)$ for all $x \in \Xsf$ produces a $v$ -greedy policy and a greatest element for $\{T_\sigma \, v\}$ (cf., Exercise 8.1.7). A least element of $\{T_\sigma \, v\}$ can be generated by replacing “argmax” with “argmin.”

In the setting of Example 9.1.1, we call $\aA_{\rR}$ the ADP generated by $\rR$ .

We have just shown that RDPs are ADPs. But there are also ADPs that do not fit naturally into the RDP framework. The next two examples illustrate. In these examples, the Bellman equation does not match the RDP Bellman equation $v(x) = \max_{a \in \Gamma(x)} B(x, a, v)$ due to the inverted order of expectation and maximization.

Example 9.1.3

Recall the $Q$ -factor MDP Bellman operator, which takes the form

(Sq)(x, a) = r(x, a) + \beta \sum_{x'} \max_{a' \in \Gamma(x')}q(x', a') P(x, a, x'),

(9.1)

with $q \in \RR^\Gsf$ and $(x,a) \in \Gsf$ (We are repeating (5.39).) The $Q$ -factor policy operators $\{S_\sigma\}$ corresponding to (9.1) are given by

(S_\sigma \, q)(x, a) = r(x, a) + \beta \sum_{x'} q(x', \sigma(x')) P(x, a, x') \qquad ((x,a) \in \Gsf).

(9.2)

Each $S_\sigma$ is a self-map on $\RR^\Gsf = (\RR^\Gsf, \leq)$ . If $q \in \RR^\Gsf$ and $\sigma \in \Sigma$ is such that $\sigma(x) \in \argmax_{a \in \Gamma(x)}q(x, a)$ for all $x \in \Xsf$ , then $S_\sigma \, q \geq S_{\tau} \, q$ on $\Gsf$ for all $\tau \in \Sigma$ . Hence $\sigma$ is $q$ -greedy and $\aA \coloneq (\RR^\Gsf, \{S_\sigma\})$ is an ADP.

Example 9.1.4

In reinforcement learning and related fields the $Q$ -factor approach from Example 9.1.3 has been extended to risk-sensitive decision processes (see, e.g., Fei et al. (2021)). The corresponding $Q$ -factor Bellman equation is given by

f(x, a) = r(x, a) + \frac{\beta}{\theta} \ln \left\{ \sum_{x'} \exp \left[ \theta \max_{a' \in \Gamma(x')} f(x', a') \right] P(x, a, x') \right\} \qquad ((x,a) \in \Gsf).

(9.3)

The policy operators over risk-sensitive $Q$ -factors take the form

(Q_\sigma \, f)(x, a) = r(x, a) + \frac{\beta}{\theta} \ln \left[ \sum_{x'} \exp \left[ \theta f(x', \sigma(x')) \right] P(x, a, x') \right],

(9.4)

where $f \in \RR^\Gsf$ and $\sigma \in \Sigma$ . An argument similar to the one given in Example 9.1.3 confirms that each $f \in \RR^\Gsf$ has an $f$ -greedy policy. Hence $(\RR^\Gsf, \{Q_\sigma\})$ is an ADP.

In Chapter 10 we will see that continuous time dynamic programs can also be viewed as ADPs.

9.2Optimality¶

In this section, we study optimality properties of ADPs, aiming for generalizations of the foundational results of dynamic programming. To achieve this aim we need to define optimality and provide sufficient conditions.

9.2.1Max-Optimality¶

We begin with maximization. Later, in Section 9.2.3, we will show that results for minimization problems are simple corollaries of maximization results.

9.2.1.1Lifetime Values¶

The objective of dynamic programming is to optimize lifetime value. But what is lifetime value in this abstract context? Suppose that, for an ADP $(V, \{T_\sigma\})$ and fixed $\sigma \in \Sigma$ , the policy operator $T_\sigma$ has a unique fixed point. In this setting, we write $v_\sigma$ for the fixed point of $T_\sigma$ and call it the $\sigma$ -value function. We interpret it as the lifetime value of following policy $\sigma$ . A closely related interpretation was discussed at length for RDPs in Section 8.1.2.1 and the situation here is analogous.

We call an ADP $\aA \coloneq (V, \{T_\sigma\})$ well-posed if every policy operator $T_\sigma$ has a unique fixed point in $V$ . In view of the preceding discussion on lifetime values, well-posedness is a minimum requirement for constructing an optimality theory around ADPs.

9.2.1.2Operators¶

Let $\aA = (V, \{T_\sigma\})$ be an ADP. We set

\tmax \, v \coloneq \bigvee_\sigma T_\sigma \, v \qquad (v \in V),

(9.5)

and call $\tmax$ the Bellman operator generated by $\aA$ . Note that $T$ is a well-defined self-map on $V$ by part (iii) of the definition of ADPs (existence of greedy policies). A function $v \in V$ is said to satisfy the Bellman equation if it is a fixed point of $\tmax$ .

The definition of $\tmax$ in (9.5) includes all of the Bellman operators we have met as special cases. For example, consider an RDP $\rR = (\Gamma, V, B)$ with Bellman operator $(Tv)(x) = \max_{a \in \Gamma(x)}B(x,a,v)$ . We can write $T$ as $\bigvee_\sigma T_\sigma \, v$ , as shown in Exercise 8.1.8. Thus, the Bellman operator of the RDP agrees with the Bellman operator $\tmax$ of the corresponding ADP $\aA_{\rR}$ .

Below we consider Howard policy iteration (HPI) as an algorithm for solving for optimal policies of ADPs. We use precisely the same instruction set as for the RDP case, as shown in Algorithm 8.1. To further clarify the algorithm, we define a map $\Hmax$ from $V$ to $\{v_\sigma\}$ via $\Hmax \, v = v_\sigma$ where $\sigma$ is $v$ -max-greedy. Iterating with $\Hmax$ generates the value sequence associated with Howard policy iteration.^[1] In what follows, we call $\Hmax$ the Howard operator generated by the ADP.

9.2.1.3Properties¶

Let $\aA \coloneq (V, \{T_\sigma\}_{\sigma \in \Sigma})$ be an ADP. We call $\aA$

finite if $\Sigma$ is a finite set,
order stable if every policy operator $T_\sigma$ is order stable on $V$ , and
max-stable if $\aA$ is order stable and $\tmax$ has at least one fixed point in $V$ .

Obviously max-stable $\implies$ order stable $\implies$ well-posed.

Regarding the definition of max-stability, existence of a fixed point of $T$ in $V$ is a high-level assumption that can be challenging to verify in applications. At the same time, our main concern in the present volume is the case where $\aA$ is finite, and, in this setting, order stability is enough:

Proposition 9.2.1 is proved in Section B.4.

Order stability is central to the optimality results just stated. While order stability is a somewhat nonstandard condition, the next result shows that, at least in simple settings, order stability is necessary for any discussion of optimality.

Proof

Let $\aA$ be as stated, with $V = [v_1, v_2]$ for some $v_1, v_2$ in $\RR^\Xsf$ with $v_1 \leq v_2$ . Obviously (ii) $\implies$ (i). Regarding (i) $\implies$ (ii), let $\aA$ be well-posed and pick any policy operator $T_\sigma$ . Since $\aA$ is well-posed, $T_\sigma$ has a unique fixed point $v_\sigma$ in $V$ . Suppose $v \in V$ with $T_\sigma \, v \leq v$ . Since, $T_\sigma$ is order preserving, $T_\sigma$ is a self-map on $[v_1, v]$ . By the Knaster–Tarski theorem, $T_\sigma$ has at least one fixed point in $[v_1, v]$ . By uniqueness, that fixed point is $v_\sigma$ . Hence $v_\sigma \leq v$ and downward stability holds. Upward stability can be confirmed via a similar argument. Hence $\aA$ is order stable. ◻

9.2.1.4Max-Optimality Results¶

Let $\aA = (V, \{T_\sigma\})$ be a well-posed ADP with $\sigma$ -value functions $\{ v_\sigma \}_{\sigma \in \Sigma}$ . We define

V_\Sigma \coloneq \{v_\sigma\}_{\sigma \in \Sigma} \quad \text{and} \quad V_u \coloneq \setntn{v \in V}{v \preceq Tv}.

If $V_\Sigma$ has a greatest element, then we denote it by $\vmax$ and call it the value function generated by $\aA$ . In this setting, a policy $\sigma \in \Sigma$ is called optimal for $\aA$ if $v_\sigma = \vmax$ . We say that $\aA$ obeys Bellman’s principle of optimality if

\sigma \in \Sigma \text{ is optimal for } \aA \quad \iff \quad \sigma \text{ is } \vmax \text{-greedy}.

These definitions are direct generalizations of the corresponding definitions for RDPs discussed in Chapter 8.

We can now state our main optimality result for ADPs.

Theorem 9.2.4 informs us that finite well-posed ADPs have first-rate optimality properties under a relatively mild stability condition. In Section 9.2.2 we use Theorem 9.2.4 to prove all optimality results for RDPs stated in Chapter 8. The proof of Theorem 9.2.4 is given in Section B.4. Note that (iv) follows directly from (i) and is included only for completeness.

9.2.1.5General States¶

This volume focuses on dynamic programming problems with finite states. Here we restrict ourselves to one high-level result for general state spaces.

Proposition 9.2.5 tells us that we can drop finiteness of policy set $\Sigma$ (which is implied by finite states and actions) whenever the Bellman operator has at least one fixed point. Various fixed-point methods are available for establishing this existence. We defer further details until Volume 2. Proposition 9.2.5 is proved in Section B.4.

9.2.1.6Application: Mixed Strategies¶

This section discusses adding mixed strategies to an RDP. We will need to apply Proposition 9.2.5 to discuss optimality because the set of mixed strategies is not finite.

Let $\rR = (\Gamma, V, B)$ be an RDP with finite state space $\Xsf$ , finite action space $\Asf$ , policy set $\Sigma$ and Bellman operator $T$ (see Section 8.1.3). A mixed strategy for $\rR$ is a map $\phi$ sending $x \in \Xsf$ into a distribution $\phi_x \in \dD(\Asf)$ supported on $\Gamma(x)$ . In other words, for each $x \in \Xsf$ ,

\phi_x \colon \Asf \to [0, 1] \quad \text{and} \quad \sum_{a \in \Gamma(x)} \phi_x(a) = 1.

Let $\Phi$ be the set of all mixed strategies for $\rR$ . For each mixed strategy $\phi \in \Phi$ , we introduce the policy operator on $V$ defined by

(\hat T_\phi \, v)(x) = \sum_{a \in \Asf} B(x, a, v) \phi_x(a) \qquad (v \in V, \; x \in \Xsf).

The right-hand side is the expected lifetime value from current state $x$ , when the current action is drawn from $\phi_x$ and future states are evaluated via $v$ .

It follows from this discussion that $\aA_M \coloneq (V, \{\hat T_\phi\}_{\phi \in \Phi})$ is an ADP (where “M” stands for “mixed”), and that the Bellman operator $\hat T$ associated with the ADP $\aA_M$ is given by

(\hat T v)(x) = \max_{a \in \Gamma(x)} B(x, a, v) = (Tv)(x) \qquad (v \in V, \; x \in \Xsf).

(9.6)

Let us assume for simplicity that $\rR$ is contracting (see Section 8.2.1), with modulus of contraction $\beta \in (0,1)$ . Assume also that $V$ is closed in $\RR^\Xsf$ . As a result, the value function $v^*$ for $\rR$ exists in $V$ and is the unique fixed point of $T$ in $V$ (Corollary 8.2.2).

By Exercise 9.2.6, the ADP $\aA_M$ is max-stable (since globally stable operators are order stable – see Lemma 9.1.1 – and the Bellman operator $\hat T$ has a fixed point). Hence, by Proposition 9.2.5, the value function $\hat v^*$ for $\aA_M$ exists in $V$ and is the unique fixed point of $\hat T$ in $V$ . But, by (9.6), $\hat T$ and $T$ agree on $V$ . Hence $\hat v^* = v^*$ . We conclude as follows: while the set of mixed strategies is larger than the set of pure strategies (i.e., deterministic policies), the maximal lifetime value from each state is the same.

9.2.2Optimality Results for RDPs¶

In this section, we return to the optimality properties of RDPs, as first discussed in Section 8.1.3.3. Our aim is to connect the ADP optimality results from Section 9.2.1.4 to the special case of RDPs and, through this process, complete the proofs of our key RDP optimality results from Chapter 8.

9.2.2.1OPI Convergence¶

The first step is to provide some preliminary results related to OPI convergence, where OPI obeys the algorithm given. Throughout, $\rR = (\Gamma, V, B)$ is a globally stable RDP with policy set $\Sigma$ , policy operators $\{T_\sigma\}$ , Bellman operator $T$ , and value function $v^*$ . As usual, $v_\sigma$ denotes the unique fixed point of $T_\sigma$ for all $\sigma \in \Sigma$ . In the results that follow, $m$ is a fixed natural number indicating the OPI step size and $H$ and $W_m$ are as defined in Section 8.1.3.2.

Proof

Regarding the self-map property, pick any $v \in V_u$ . Since $T$ and $T_\sigma$ are order preserving, $v \leq Tv$ and $\sigma$ is $v$ -greedy, we have

W_m v = T_\sigma T_\sigma^{m-1} v \leq T T_\sigma^{m-1} v \leq T T_\sigma^{m-1} T v = T T_\sigma^m v = TW_m v .

Hence $W_m v \in V_u$ and $W_m$ is invariant on $V_u$ .

To obtain the inequality $Tv \leq W_m v$ , fix $v \in V_u$ . Since $T_\sigma$ is order preserving, $v \leq T v$ and $\sigma$ is $v$ -greedy, we have

T_\sigma^{m-1} v \leq T_\sigma^{m-1} T v = T_\sigma^{m-1} T_\sigma \, v = W_m v.

Continuing, in the same manner, gives $T_\sigma^{m-j} v \leq W_m v$ for $j < m$ and, in particular, $T_\sigma v \leq W_m v$ . Because $\sigma$ is $v$ -greedy, this yields $Tv \leq W_m v$ .

Regarding the second inequality, we use the fact that $T_\sigma \leq T$ on $V$ and $T$ and $T_\sigma$ are both order preserving to obtain $W_m v = T^m_\sigma v \leq T^m v$ (see Exercise 2.2.36). ◻

Proof

Fix $v \in V_u$ . Let $v_k = T^k v$ and $w_k = W_m^k v$ for all $k$ . The claim is true at $k=1$ by Lemma 9.2.7. Suppose it is true at $k-1$ , so that $v_{k-1} \leq w_{k-1}$ . We claim it is true at $k$ as well. To show this we take $\sigma$ to be $w_{k-1}$ -greedy and, using the fact that $v \in V_u$ and $W_m V_u \subset V_u$ , obtain $w_{k-1} \leq T w_{k-1} = T_\sigma \, w_{k-1}$ . Since $T_\sigma$ is order preserving, this means that the sequence $(T^\ell_\sigma \, w_{k-1})_{\ell \in \NN}$ is increasing. As a result, we have

v_k = T v_{k-1} \leq T w_{k-1} = T_\sigma \, w_{k-1} \leq T_\sigma^m \, w_{k-1} = W_m w_{k-1} = w_k.

This proves the claim in Lemma 9.2.8. ◻

Proof

Let $\rR$ be as stated and fix $(v_k) \subset V_u$ with $v_k \to v^*$ as $k \to \infty$ . Let $\Sigma^*$ be the set of optimal policies and let $\Sigma' \coloneq \Sigma \setminus \Sigma^*$ . Since $\Sigma'$ is finite, we have

e \coloneq \min_{\sigma \in \Sigma'} \|v_\sigma - v^*\|_\infty > 0.

Choose $K \in \NN$ such that $\|v_k - v^*\|_\infty < e$ for all $k \geq K$ . Fix $k \geq K$ and let $\sigma$ be $v_k$ -greedy. We claim that $\sigma$ is optimal. Indeed, since $v_k \subset V_u$ , we have $v_k \leq T v_k = T_\sigma \, v_k$ , so, by upward stability, $v_k \leq v_\sigma$ . As a result,

|v^* - v_\sigma | = v^* - v_\sigma \leq v^* - v_k .

Hence $\|v^* - v_\sigma\|_\infty \leq \| v^* - v_k \|_\infty < e$ . But then $\sigma \notin \Sigma'$ , so $\sigma$ is optimal. ◻

9.2.2.2Proofs of RDP Results¶

In Section 8.1.3 we stated two key optimality results for RDPs, the first concerning globally stable RDPs (Theorem 8.1.1) and the second concerning bounded RDPs (Theorem 8.1.2). Let’s now prove them. In what follows, $\rR = (\Gamma, V, B)$ is a well-posed RDP and $\aA_\rR \coloneq (V, \{T_\sigma\})$ is the ADP generated by $\rR$ .

Proof

Proof of Theorem 8.1.1.

Let $\rR$ be globally stable. Then $\aA_\rR$ is finite and max-stable, by Corollary 9.2.2. Hence the optimality and HPI convergence claims in Theorem 8.1.1 follow from Theorem 9.2.4.

Regarding OPI convergence, let $(v_k, \sigma_k)$ be as given in (8.18). From Lemma 9.2.6 we obtain $T^k v_0 \to \vmax$ . Also, from Lemma 9.2.8, we have $T^k v_0 \leq v_k$ for all $k$ . In fact we also have $T^k v_0 \leq v_k \leq \vmax$ for all $k$ , where the second inequality holds because $W_m$ has the property $W_m w \leq \vmax$ whenever $w \leq \vmax$ . (If $w \leq \vmax$ , then, taking $\sigma$ to be $w$ -greedy, we have $T_\sigma \, w = T w \leq T \vmax = \vmax$ , so, iterating $m$ times on this inequality, $W_m w \leq \vmax$ .)

The convergence $T^k v_0 \to \vmax$ and the bound $T^k v_0 \leq v_k \leq v^*$ for all $k$ together imply $v_k \to \vmax$ as $k \to \infty$ . Given such convergence, Lemma 9.2.10 implies that there exists a $K \in \NN$ such that $\sigma_k$ is optimal whenever $k \geq K$ . ◻

9.2.3Min-Optimality¶

Until now, our ADP theory has focused on maximization of lifetime values. Now we turn to minimization. One of our aims is to prove the RDP minimization results in Section 8.3.5. We will see that ADP minimization results are easily recovered from ADP maximization results via order duality.

Let $\aA = (V, \{T_\sigma\})$ be a well-posed ADP and let $V_\Sigma \coloneq \{v_\sigma\}$ be the set of $\sigma$ -value functions. We call $\sigma \in \Sigma$ min-optimal for $\aA$ if $v_\sigma$ is a least element of $V_\Sigma$ . When $V_\Sigma$ has a least element we denote it by $\vmin$ and call it the min-value function generated by $\aA$ . A policy $\sigma$ is called $v$ -min-greedy if $T_\sigma \, v \preceq T_\tau \, v$ for all $\tau \in \Sigma$ . Existence of a $v$ -min-greedy policy for each $v \in V$ is guaranteed by the definition of ADPs.

We say that $\aA$ obeys Bellman’s principle of min-optimality if

\sigma \in \Sigma \text{ is min-optimal for } \aA \quad \iff \quad \sigma \text{ is } \vmin \text{-min-greedy}.

We define the Bellman min-operator corresponding to $\aA$ as the self-map $\tmin$ on $V$ defined by $\tmin v = \bigwedge_\sigma T_\sigma \, v$ . This map is well-defined because $\{T_\sigma \, v\}_{\sigma \in \Sigma}$ has a least element and, moreover, $\sigma \in \Sigma$ is $v$ -min-greedy if and only if $T_\sigma \, v = \tmin \, v$ .

We say that $v$ satisfies the Bellman min-equation if $\tmin v = v$ . We call $\aA$ min-stable if $\aA$ is order stable and $\tmin$ has at least one fixed point in $V$ . We define $\Hmin$ from $V$ to $\{v_\sigma\}$ via $\Hmin \, v = v_\sigma$ where $\sigma$ is $v$ -min-greedy and call $\Hmin$ the Howard min-operator generated by $\aA$ . Iterating with $\Hmin$ is called min-HPI.

Results analogous to Theorem 9.2.4 hold for the minimization case.

To prove Theorem 9.2.11 we use order duality. Below, if $\aA \coloneq (V, \{T_\sigma\})$ is an ADP then its dual is

\aA^\partial \coloneq (V^\partial, \{T_\sigma\}) \; \text{ where } V^\partial \text{ is the order dual of } V.

In this setting, we let $\tmax^\partial$ be the Bellman operator for $\aA^\partial$ , $(\vmax)^\partial$ be the value function for $\aA^\partial$ , and so on. We note that $\aA$ is self-dual, in the sense that $(\aA^\partial)^\partial = \aA$ , since the same is true for $V$ .

To make our terminology more symmetric, in the remainder of this section we refer to maximization-based optimal policies as max-optimal, the Bellman operator $\tmax = \bigvee_\sigma T_\sigma \, v$ as the Bellman max-operator, and so on.

Exercise 9.2.7

Let $\aA$ be a well-posed ADP with dual $\aA^\partial$ . Verify the following.

(i) Given $v \in V$ , $\sigma \in \Sigma$ is $v$ -min-greedy for $\aA$ if and only if $\sigma$ is $v$ -max-greedy for $\aA^\partial$ ,

(ii) $\tmin = \tmax^\partial$ and $\tmin^\partial = \tmax$ ,

(iii) $\Hmin = \Hmax^\partial$ and $\Hmin^\partial = \Hmax$ ,

(iv) $\aA$ is order stable if and only if $\aA^\partial$ is order stable,

(v) $\aA$ is min-stable if and only if $\aA^\partial$ is max-stable, and, in this case, $\vmin = (\vmax)^\partial$ , and

(vi) $\sigma \in \Sigma$ is max-optimal for $\aA$ if and only if $\sigma$ is min-optimal for $\aA^\partial$ .

Solution to Exercise 9.2.7

Regarding (i), fix $v \in V$ . Policy $\sigma$ is $v$ -min-greedy for $\aA$ if and only if $T_\sigma \, v \preceq T_\tau \, v$ for all $\tau \in \Sigma$ , which is equivalent to $T_\sigma \, v \succeq^\partial T_\tau \, v$ for all $\tau \in \Sigma$ . Hence $\sigma$ is $v$ -min-greedy for $\aA$ if and only if $\sigma$ is $v$ -max-greedy for $\aA^\partial$ .

Regarding (ii)–(iii), fix $v \in V$ and let $\sigma$ be $v$ -min-greedy for $\aA$ (and hence $v$ -max-greedy for $\hat \aA$ ). We then have $\tmax^\partial v = T_\sigma \, v = \tmin v$ . Hence $\tmax^\partial = \tmin$ . Similarly, at the same $v$ and with the same policy $\sigma$ , $\Hmax^\partial v$ is equal to $v_\sigma$ and so is $\Hmin$ . A similar argument gives $\tmin^\partial = \tmax$ and $\Hmin^\partial = \Hmax$ .

Regarding (iv), Lemma 9.1.2 implies that $\aA$ is order stable if and only if $\aA^\partial$ is order stable.

Regarding (v), $\tmin = \tmax^\partial$ , so $\tmin$ has a fixed point in $V$ if and only if $\tmax^\partial$ has a fixed point in $V$ . By this fact and (iv), $\aA$ is min-stable if and only if $\aA^\partial$ is max-stable. Moreover, in this setting, we have $\vmin = \bigwedge_\sigma v_\sigma = \bigvee_\sigma^\partial v_\sigma = (\vmax)^\partial$ .

Part (vi) follows from similar analysis and details are left to the reader.

Self-duality implies corollaries to Exercise 9.2.7 which we treat as self-evident. For example, if $\aA$ is max-stable if and only if $\aA^\partial$ is min-stable, which follows from part (v) and the fact that $(\aA^\partial)^\partial = \aA$ .

Proof

Proof of Theorem 9.2.11.

Let $\aA$ be min-stable. By Exercise 9.2.7, the dual $\aA^\partial$ is max-stable. Hence all of the conclusions of the max-optimality result in Theorem 9.2.4 apply to $\aA^\partial$ . All that remains is to translate these max-optimality results for $\aA^\partial$ back to min-optimality results for $\aA$ .

Regarding claim (i) of the min-optimality results, max-optimality of $\aA^\partial$ implies that $(\vmax)^\partial$ exists in $V$ . But then $\vmin$ exists in $V$ , since, by Exercise 9.2.7, $\vmin = (\vmax)^\partial$ .

Regarding (ii), we know that $(\vmax)^\partial$ is the unique solution to $\tmax^\partial (\vmax)^\partial = (\vmax)^\partial$ , so, applying Exercise 9.2.7 again, we have $\tmin \, \vmin = \vmin$ .

The remaining steps of the proof are similar and left to the reader. ◻

9.3Chapter Notes¶

As indicated in notes for Chapter 8, our interest in abstract dynamic programming was inspired by Bertsekas (2022). This chapter generalizes his framework by switching to a “completely abstract” setting based on analysis of self-maps on partially ordered space. The material here is based on Sargent & Stachurski (2023). Earlier work on dynamic programming in a setting with no topology can be found in Kamihigashi (2014).

Footnotes¶

For $\Hmax$ to be well-defined, we must always select the same $v$ -greedy policy when the operator is applied to $v$ . We can use the axiom of choice to assign to each $v$ a designated $v$ -greedy policy, although, in applications, a simple rule usually suffices. For example, if $\Sigma$ is finite, we can enumerate the policy set $\Sigma$ and choose the first $v$ -greedy policy.
↩

References¶

Fei, Y., Yang, Z., Chen, Y., & Wang, Z. (2021). Exponential Bellman equation and improved regret bounds for risk-sensitive reinforcement learning. Advances in Neural Information Processing Systems, 34, 20436–20446.
Bertsekas, D. (2022). Abstract Dynamic Programming (3rd ed.). Athena Scientific.
Sargent, T., & Stachurski, J. (2023). Completely abstract dynamic programming. arXiv, 2308.02148.
Kamihigashi, T. (2014). Elementary results on solutions to the Bellman equation of dynamic programming: Existence, uniqueness, and convergence. Economic Theory, 56, 251–273.

9 Abstract Dynamic Programming

9.1Abstract Dynamic Programs¶

9.1.1Preliminaries¶

9.1.1.1Order Stability¶

9.1.1.2Order Duals¶

9.1.2Abstract Dynamic Programs¶

9.1.2.1Prelude¶

9.1.2.2Defining ADPs¶

9.2Optimality¶

9.2.1Max-Optimality¶

9.2.1.1Lifetime Values¶

9.2.1.2Operators¶

9.2.1.3Properties¶

9.2.1.4Max-Optimality Results¶

9.2.1.5General States¶

9.2.1.6Application: Mixed Strategies¶

9.2.2Optimality Results for RDPs¶

9.2.2.1OPI Convergence¶

9.2.2.2Proofs of RDP Results¶

9.2.3Min-Optimality¶

9.3Chapter Notes¶