ADPs on Pospace - Dynamic Programming Volume II: General States

In this chapter, we add topological structure to the value space and the policy operators. This allows us to provide sufficient conditions for optimality that are often easy to use in applications. Proofs of the theorems presented in this chapter leverage the poset ADP theory that we constructed in Chapter 2.

We begin by studying ADPs on pospace, then specialize to partially ordered metric spaces, where contractivity arguments become available. Minimization counterparts and a treatment of nonstationary policies round out the theoretical development. In Section 3.2 we apply the theory to discrete MDPs and Q-factors, optimal savings, no-discount optimal stopping, and the sequential analysis problem from Section 1.4.

3.1Adding Topology¶

In this section we introduce ADPs in settings where the value space is a poset and also a pospace (i.e., partially ordered space, see Section A.5.3.1 for the definition and basic properties). We study optimality in settings where the policy operators have topological stability properties, as well as order properties. Throughout this chapter, $V$ is always assumed to be a pospace.

In Section 3.1.1 we study ADPs on pospace, leveraging global stability to obtain optimality and convergence results. In Section 3.1.2 we specialize to partially ordered metric spaces, where contraction-based arguments apply. Section 3.1.3 provides minimization counterparts and Section 3.1.4 treats nonstationary policies.

3.1.1ADPs on Pospace¶

Let $V$ be a pospace and let $(V, \TT)$ be an ADP. Recall that a self-map $S$ on $V$ is called globally stable when $S$ has a unique fixed point $\bar v$ in $V$ and $S^n v \to \bar v$ as $n \to \infty$ for all $v \in V$ . (See Section A.2.2 for more background.) We say that

$(V, \TT)$ is globally stable if each $T_\sigma \in \TT$ is globally stable on $V$ .

Obviously, if $(V, \TT)$ is globally stable, then $(V, \TT)$ is well-posed. We also have the following useful preliminary result, which is an immediate consequence of Lemma A.5.19.

The next result shows that regularity and global stability together yield strong optimality properties.

Proof

Let $(V, \TT)$ and $T$ be as stated. Since $(V, \TT)$ is order stable (Lemma 3.1.1), part (i) follows from Corollary 2.1.6. Regarding convergence of VFI, fix $v \in V_U$ , let $\sigma$ be an optimal policy and let $\vmax$ be the value function. We have $T_\sigma \, v \preceq T v \preceq \vmax$ , where the last inequality is by Lemma 2.2.3 (since, by that lemma, $v \preceq \vmax$ and, therefore, $Tv \preceq T\vmax = \vmax$ ). From this chain of inequalities, combined with the fact that $T_\sigma \preceq T$ on $V$ , we obtain $T_\sigma^n \, v \preceq T^n v \preceq \vmax$ for all $n$ . As $\sigma$ is optimal and $(V, \TT)$ is globally stable, the policy operator $T_\sigma$ is globally stable with unique fixed point $\vmax$ . Because $T_\sigma^n \, v \preceq \vmax$ for all $n$ and $T_\sigma^n \, v \to \vmax$ , Lemma A.5.18 implies that the supremum of $T^n_\sigma \, v$ is $\vmax$ . From this fact and $T_\sigma^n \, v \preceq T^n v \preceq \vmax$ for all $n$ , the supremum of $T^n v$ is also $\vmax$ (Exercise A.1.3). Moreover, this sequence is increasing (because $v \in V_U$ ), which leads us to $T^n v \uparrow \vmax$ . Hence VFI converges. Since $(V, \TT)$ is regular and order stable (by Lemma 3.1.1), convergence of VFI implies convergence of OPI and HPI (Corollary 2.2.5). ◻

Theorem 3.1.2 is a high-level result because a condition (existence of a fixed point) is placed on the derived object $T$ , rather than the primitives $V$ and $\TT$ . Below, we present a sequence of results that leverage Theorem 3.1.2 while also placing all assumptions on primitives. The first is the relatively obvious but useful corollary, which adds convergence of VFI and OPI to Theorem 2.2.6.

The conditions of the next theorem are similar to those of Theorem 2.2.8, after replacing order stability and order continuity with global stability.

Proof

In view of Theorem 3.1.2, we need only show that $T$ has a fixed point in $V$ . To see this, we use the order bounded property to take a $u \in V$ be such that $T_\sigma \, u \preceq u$ for all $\sigma \in \Sigma$ . Now let $v$ be any element of $V_\Sigma$ . Global stability implies order stability (Lemma 3.1.1), so we have $v \preceq u$ . Moreover, $v \preceq T v$ , since $V_\Sigma \subset V_U$ (by regularity and Lemma 2.1.2). Hence $T$ is a self-map on the order interval $[v, u]$ . Letting $v_n \coloneq T^n v$ and applying countably Dedekind completeness, we have $v_n \uparrow \bar v$ for some $\bar v \in [v, u]$ . We claim that $T \bar v = \bar v$ . To see this, first observe that $v_{n+1} = T v_n \preceq T \bar v$ for all $n$ , so $\bar v \preceq T \bar v$ . Letting $\sigma$ be $\bar v$ -greedy, we have $\bar v \preceq T \bar v = T_\sigma \, \bar v$ , so, by order stability, $\bar v \preceq v_\sigma$ . Also, letting $w_n \coloneq T_\sigma^n \, v$ , we have $w_n \preceq v_n \preceq \bar v \preceq v_\sigma$ for all $n \in \NN$ . By global stability, $w_n \to v_\sigma$ . Applying Lemma A.5.18, we also have $\vee_n w_n = v_\sigma$ . From this fact and the previous chain of inequalities, it must be that $\bar v = v_\sigma$ . Hence $T \bar v = T_\sigma \, \bar v = T_\sigma \, v_\sigma = v_\sigma = \bar v$ . This completes the proof that $\bar v$ is a fixed point of $T$ . ◻

3.1.2ADPs on Metric Space¶

In this section we add more structure by assuming that our value space $V$ is in fact a partially ordered metric space, by which we mean that $V = (V, d, \preceq)$ where $d$ is a metric on $V$ and $(V, \preceq)$ is a pospace under the topology generated by $d$ . By construction, this specializes the earlier setting of Section 3.1.1, where $(V, \preceq)$ is just a pospace.

Throughout Section 3.1.2,

$(V, \TT)$ is an ADP and
$V = (V, d, \preceq)$ is a partially ordered metric space.

We will say that the ADP $(V, \TT)$ is semi-regular if there exists a closed subset $V_0$ of $V$ such that $V_0 \subset V_G$ and $T V_0 \subset V_0$ . In addition, we will say that VFI is geometrically convergent on $V_0$ if $\vmax$ exists and there is a $\beta \in (0,1)$ such that

\text{ } d(T^n v, \vmax) = \OO(\beta^n) \text{ as } n \to \infty \text{ for each } v \in V_0 \text{. }

Here $f(n) = \OO(\beta^n)$ means that there exists a $C < \infty$ with $f(n) \leq C \beta^n$ for all $n \in \NN$ .

In stating the next theorem, we use the notion of sup-nonexpansiveness from Section A.5.3.2.

Proof

Let $V$ and $\TT$ have the stated properties. By completeness and Banach’s fixed point theorem, the ADP $(V, \TT)$ is globally stable and hence order stable (Lemma 3.1.1).

Moreover, each $T_\sigma$ is a contraction of modulus $\beta$ on $V$ and $T$ is a well-defined self-map on $V_0$ . Lemma A.5.21 now implies that $T$ is a contraction of modulus $\beta$ on $V_0$ . As $V_0$ is closed and $V$ is complete, this implies that $T$ has a fixed point in $V_0 \subset V_G$ . Hence, by Corollary 2.1.6, the fundamental optimality properties hold. It follows that $\vmax$ exists and is the unique fixed point of $T$ in $V_G$ . Since $T$ has a fixed point in $V_0$ , this also implies that $\vmax \in V_0$ . Geometric convergence of VFI now follows from contractivity of $T$ on $V_0$ .

If $(V, \TT)$ is also regular, then convergence of OPI and HPI follow from Theorem 3.1.2. ◻

3.1.3Minimization¶

Because of Exercise 2.2.6, the optimality and convergence theorems based around maximization can easily be converted to theorems for minimization. In this section we give some examples. Our first is a min-version of Theorem 3.1.2.

Proof

Let $(V, \TT)$ be as stated and let $(V, \TT)^\partial$ be the dual ADP. Since $(V, \TT)$ is min-regular, $(V, \TT)^\partial$ is max-regular (Exercise 2.2.4). Since $(V, \TT)$ is globally stable, $(V, \TT)^\partial$ is likewise globally stable. By assumption, the Bellman min-operator $\tmin$ for $(V, \TT)$ has a fixed point $\bar v$ in $V$ . Since the Bellman max-operator $T^\partial$ for $(V, \TT)^\partial$ satisfies $T^\partial = \tmin$ (the supremum under $\preceq^\partial$ equals the infimum under $\preceq$ ), the element $\bar v$ is also a fixed point of $T^\partial$ . As a result, Theorem 3.1.2 implies that, for $(V, \TT)^\partial$ , the fundamental max-optimality properties hold and max-VFI, max-OPI and max-HPI all converge. The conclusions of Theorem 3.1.6 now follow from Exercise 2.2.6. ◻

For the rest of Section 3.1.3, $V = (V, d, \preceq)$ is always assumed to be a partially ordered metric space. Our next result is a min-version of Theorem 3.1.5, restricted to the regular case.

Now let’s consider a min-version of Theorem 3.1.4.

3.1.4Nonstationary Policies¶

In all of the preceding discussion we focused on stationary policies. For example, in the context of the optimal savings problem from Section 1.3, we fixed a policy $\sigma$ and computed its lifetime value $v_\sigma$ by assuming that $\sigma$ is applied at every $t$ in $\{0,1,\ldots\}$ . In particular, in Section 1.3.1.2, we showed that, for $v$ arbitrarily chosen from the value space,

v_\sigma = \lim_{n \to \infty} T^n_\sigma \, v.

(3.1)

This expression illustrates how lifetime value is obtained by repeatedly applying the same policy.

But is this focus on stationary policies justified? Could it be that higher lifetime value is available when we allow a change of policy in each period?

To address this question, suppose that we can select a policy plan $\bar \sigma \coloneq (\sigma_t)_{t \geq 0}$ in the infinite Cartesian product $\times_{t \geq 0} \Sigma$ and apply the $t$ -th element $\sigma_t$ at time $t$ . Generalizing (3.1), the lifetime value of $\bar \sigma$ can be defined by

v_{\bar \sigma} = \lim_{n \to \infty} T_{\sigma_0} T_{\sigma_1} \cdots T_{\sigma_n} v.

(3.2)

Of course, for the definition in (3.2) to make sense we need to know that the limit exists. Ideally, it should also be independent of $v$ . (In (3.2), we iterate backwards in time, applying $T_{\sigma_j}$ first, because $v$ is best thought of as a terminal condition, rather than an initial condition. See Section 1.3.1.2 for intuition.)

Since the expression (3.2) requires a topology, we consider an ADP $(V, \TT)$ where $V = (V, \preceq)$ is a partially ordered space. We also assume that the topology on $V$ is generated by a metric $d$ . As usual, $\TT \coloneq \setntn{T_\sigma}{\sigma \in \Sigma}$ is a family of order preserving self-maps on $V$ . To ensure that (3.2) exists we also require the following:

We will make use of the following preliminary results.

Lemma 3.1.9

If Assumption 3.1.1 holds, then claims (i)–(ii) below are valid. If, in addition, $(V, \TT)$ is semi-regular, then claim (iii) is also valid.

for each $v \in V$ and policy plan $\hat \sigma \coloneq (\sigma_t)_{t \geq 0}$ , the limit

v_{\hat \sigma} \coloneq \lim_{n \to \infty} T_{\sigma_0} \cdots T_{\sigma_n} \, v

exists in $V$ and is independent of $v$ .

Every $T_\sigma \in \TT$ is continuous and globally stable on $V$ , with unique fixed point $v_\sigma$ satisfying

v_\sigma = \lim_{j \to \infty} T_\sigma^j \, v \quad \text{for all } v \in V.

(3.3)

There exists a $v \in V$ such that $v = \bigvee_{\sigma \in \Sigma} T_\sigma \, v$ .

Proof

Fix $v \in V$ and policy plan $\hat \sigma \coloneq (\sigma_t)_{t \geq 0}$ . Given the policy plan above and $m \leq n$ , we adopt the following simplified notation:

T_{m, n} \coloneq T_{\sigma_m} \circ T_{\sigma_{m+1}} \circ \cdots \circ T_{\sigma_n}.

Let $v_n = T_{0, n} v$ . Our claim is that $\lim_n v_n$ exists in $V$ and is independent of $v$ . To see this, observe first that $(v_n)$ is Cauchy, since, fixing $m, j \in \NN$ ,

d(v_m, v_{m+j}) \leq \lambda^{m+1} d \left( v, T_{m+1, m+j} v \right) ,

and, by repeatedly applying the triangle inequality,

d (v, T_{m+1, m+j} v) \leq d (v, T_{m+1} v) + d (T_{m+1} v, T_{m+1} T_{m+2} v) \\ + \cdots + d (T_{m+1} \cdots T_{m+j-1} v, T_{m+1} \cdots T_{m+j-1} T_{m+j} v) .

By Assumption 3.1.1, there exists a finite constant $b$ satisfying $d ( v, T_{\sigma_j} v ) \leq b$ for all $j$ . From this and the last bound we obtain

d (v, T_{m+1, m+j} v) \leq b + \lambda b + \cdots + \lambda^{j-1} b \leq \frac{b}{1-\lambda}.

As a result, $d(v_m, v_{m+j}) \leq \lambda^{m+1} b / (1- \lambda)$ . This shows that $(v_n)$ is Cauchy. Using completeness of $V$ and letting $\bar v$ be the limit of this sequence, we argue that $\bar v$ is independent of $v$ . Indeed, if $w_n \coloneq T_{0, n} \, w$ for some $w \in V$ , then $d(v_n, w_n) \leq \lambda^{n+1} d(v, w)$ for all $n$ , so that $(v_n)$ and $(w_n)$ have the same limit. Hence $\lim_{n \to \infty} T_{0, n} \, v$ exists in $V$ and is independent of the initial condition $v$ . This proves claim (i).

The result in (ii) is immediate because, by Assumption 3.1.1, every $T_\sigma \in \TT$ is a contraction mapping (and therefore continuous) on the complete metric space $V$ . Finally, for (iii), applying Lemma A.5.21, the Bellman operator $T$ is also a contraction map and, therefore, has at least one fixed point in $V$ . ◻

We can now prove the main result of this section, which shows that any policy plan is (weakly) dominated in value by a stationary policy.

3.2Applications¶

We now apply the theory developed above to a range of problems. Section 3.2.1 revisits discrete MDPs and introduces Q-factors. Section 3.2.2 treats optimal savings under both strong and weak continuity assumptions. Section 3.2.3 develops a no-discount optimal stopping framework, which is then applied to sequential analysis in Section 3.2.4.

3.2.1Discrete MDPs and Q-Factors¶

In Section 3.2.1.1 we re-derive the optimality results for finite MDPs using the pospace theory of this chapter. In Section 3.2.1.2 we introduce the Q-factor formulation of the MDP and establish its optimality properties. The Q-factor formulation underpins reinforcement learning, one of the most influential branches of modern AI. Reinforcement learning algorithms—most notably Q-learning—use Q-factors to learn optimal policies from data, without requiring a model of the environment. Here we focus on the underlying dynamic programming theory associated with Q-factors, taking the model as given. Further discussion of Q-learning can be found in Section 3.3 and Section 9.2.1.

3.2.1.1MDPs Optimality, Again¶

In Section 1.2 we introduced a finite MDP $(\Gamma, r, \beta, P)$ with state space $\Xsf$ and action space $\Asf$ . In Section 2.3.3.1 we showed that this finite MDP can be framed as ADP $(\RR^\Xsf, \TT_{\rm MDP})$ , with policy operators given by $T_\sigma = r_\sigma + \beta P_\sigma$ . In Section 2.3.3.2 we established all of the major optimality properties for $(\RR^\Xsf, \TT_{\rm MDP})$ . For the purpose of illustration, working in a familiar and simple environment, let’s re-prove these results using the theorems in this chapter.

First, we know from Section 2.3.3.1 that $(\RR^\Xsf, \TT_{\rm MDP})$ is regular. In Exercise 1.2.1 we proved that every policy operator $T_\sigma$ is globally stable. Since $\TT_{\rm MDP}$ is finite Corollary 3.1.3 applies. This proves validity of the fundamental optimality properties and convergence of all algorithms. Since global stability implies order stability, convergence of HPI in finite time holds by Theorem 2.2.6.

Exercise 3.2.1

Consider again the MDP model $(\Gamma, r, \beta, P)$ on state space $\Xsf$ and with action space $\Asf$ , but now suppose that $\Asf$ and $\Xsf$ are countable rather than finite. Suppose in addition that the reward function $r$ is bounded and $\Gamma(x)$ is finite for all $x \in \Xsf$ . The remaining MDP definitions from Section 1.2.1.1 are unchanged. Let $b\Xsf$ be the set of all bounded functions on $\Xsf$ . As before, $\Sigma$ is the set of all maps from $\Xsf$ to $\Asf$ and $\TT$ is all policy operators of the form $T_\sigma \, v = r_\sigma + \beta P_\sigma \, v$ . Prove that

$(b\Xsf, \TT_{\rm MDP})$ is an ADP,
the fundamental optimality properties hold, and
VFI, OPI, and HPI all converge.

Solution to Exercise 3.2.1

Since $|T_\sigma \, v| \leq |r_\sigma| + \beta P_\sigma |v|$ , the image $Tv$ is bounded whenever $|v|$ is bounded. Hence $T_\sigma$ is a self-map on $b\Xsf$ . Clearly $T_\sigma$ is order preserving. Hence $(b\Xsf, \TT_{\rm MDP})$ is an ADP. Since each set $\Gamma(x)$ is finite, the proof in Section 2.3.3.1 that the finite ADP is regular extends directly to the countable case. The proof in Exercise 1.2.1 that every policy operator $T_\sigma$ is a contraction of modulus $\beta < 1$ under the supremum norm also extends to the countable case without significant modifications. The supremum norm remains sup-nonexpansive and complete on $b\Xsf$ . Hence Theorem 3.1.5 applies. This proves validity of the fundamental optimality properties and convergence of all algorithms.

3.2.1.2The Q-Factor Model¶

Next we examine the Q-factor variation of the MDP model. This modification provides an alternative view on the Bellman equation that unlocks a stochastic approximation approach to reinforcement learning. We can study optimality of the Q-factor variation either by treating it directly, as an ADP in its own right, or by inferring optimality properties from the original MDP version of the problem, which we just obtained in Section 3.2.1.1. The first approach is treated here and the second is treated in Chapter 5.

To begin, we take the MDP model from Section 3.2.1.1 and, given $v \in \RR^\Xsf$ , set

q(x, a) \coloneq r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \qquad ((x,a) \in \Gsf).

(3.4)

The function $q$ is called the Q-factor corresponding to $v$ . We will convert the original MDP Bellman equation (2.20) into an equation in $Q$ -factors. The first step is to observe that, given $q$ in (3.4), the Bellman equation can be written as $v(x) = \max_{a \in \Gamma(x)}q(x, a)$ . Taking the expectations and discounting on both sides of this equation yields

\beta \sum_{x'} v(x') P(x, a, x') = \beta \sum_{x'} \max_{a' \in \Gamma(x')}q(x', a') P(x, a, x').

Adding $r(x,a)$ and using the definition of $q$ again gives

q(x, a) = r(x, a) + \beta \sum_{x'} \max_{a' \in \Gamma(x')}q(x', a') P(x, a, x').

(3.5)

This is the Q-factor Bellman equation. To study it, we introduce a family of policy operators $\SS \coloneq \setntn{S_\sigma}{\sigma \in \Sigma}$ via

(S_\sigma \, q)(x, a) = r(x, a) + \beta \sum_{x'} q(x', \sigma(x')) P(x, a, x') \qquad ((x,a) \in \Gsf).

(3.6)

Here $S_\sigma$ acts on function $q \in \RR^\Gsf$ . The set $\RR^\Gsf$ is paired with the pointwise partial order.

By definition, the ADP Bellman operator corresponding to $(\RR^\Gsf, \SS)$ obeys $S q \coloneq \bigvee_\sigma S_\sigma \, q$ . The next exercise helps us connect this to the Q-factor Bellman equation (3.5).

Exercise 3.2.4 tells us that, as expected, $q \in \RR^\Gsf$ is a fixed point of $S$ if and only if it is a solution to the Q-factor Bellman equation (3.5).

Now let’s turn to optimality.

Solution to Exercise 3.2.5

Fix $\sigma \in \Sigma$ and $q, f \in \RR^{\Gsf}$ . For each $(x, a) \in \Gsf$ , we have

\begin{aligned} |(S_\sigma \, q)(x, a) - (S_\sigma \, f)(x, a)| & \leq \beta \sum_{x'} |q(x', \sigma(x')) - f(x', \sigma(x'))| P(x, a, x') \\ & \leq \beta \sum_{x'} \|q - f\|_{\infty} P(x, a, x') = \beta \|q - f\|_{\infty}. \end{aligned}

Taking the supremum over $(x, a)$ yields $\|S_\sigma \, q - S_\sigma \, f\|_{\infty} \leq \beta \|q - f\|_{\infty}$ .

The next exercise asks you to confirm the core optimality properties also hold for the Q-factor MDP. You may like to use Theorem 3.1.5.

3.2.2Optimal Savings¶

In Section 1.3 we introduced a simple optimal savings problem. In Section 2.3.2, we converted the optimal savings problem from Section 1.3 into an ADP $(V, \TT_{\rm OS})$ , where $V = b\RR_+$ and each $T_\sigma \in \TT_{\rm OS}$ takes the form

(T_\sigma \, v)(w) = u(\sigma(w)) + \beta \int v(R(w - \sigma(w)) + y) \phi(\diff y) \qquad (w \in \RR_+).

(3.8)

Now we turn to optimality. In Section 3.2.2.1 we will maintain the conditions in Assumption 1.3.1, so that $u$ is continuous and bounded on $\RR_+$ and the distribution of labor income can be represented by a continuous density. In Section 3.2.2.2 we will drop the continuous density assumption.

3.2.2.1The Strongly Continuous Case¶

Maintaining Assumption 1.3.1, we prove the following optimality properties, which were stated without proof in Section 1.3.2.2.

Let’s briefly translate these ADP results into the optimal savings results stated in Section 1.3.2.2. We showed in Section 2.3.2 that $v \in V$ satisfies the ADP Bellman equation $\bigvee_\sigma T_\sigma \, v=v$ if and only if it satisfies

v(w) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\} \quad \text{for all } w \geq 0.

(3.9)

By this fact and the fundamental optimality properties, the value function $\vmax$ exists and is the unique solution to (3.9) in $V$ .

In Section 2.3.2, we say that a policy $\sigma$ is $v$ -greedy if and only if

\sigma(w) \in \argmax_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\} \quad \text{for all } w \geq 0.

(3.10)

Applying Bellman’s principle of optimality, a policy is optimal if and only if it satisfies (3.10) with $v$ replaced by $\vmax$ .

3.2.2.2The Weakly Continuous Case¶

Now let’s prove a result similar to Proposition 3.2.1 under weaker conditions. In particular, we modify Assumption 1.3.1 by dropping the assumption that $\phi$ is a continuous density. Instead we’ll let $\phi$ be an arbitrary probability measure on the Borel subset of $\RR_+$ . Other conditions in Assumption 1.3.1 are maintained.

Without the continuity of $\phi$ , Lemma 1.3.2 fails. In particular, we cannot claim that each $v \in b\RR_+$ has a greedy policy. As a result, $(V, \TT_{\rm OS})$ is no longer regular. However, we do have the following result, stated as a solved exercise.

Solution to Exercise 3.2.7

Fix $v \in bc\RR_+$ . Since $u, v$ are both continuous and bounded, an application of the dominated convergence theorem confirms that the map

(w, c) \mapsto u(c) + \beta \int v(R(w - c) + y) \phi(\diff y)

is continuous on the set of feasible state-action pairs $\Gsf$ . Applying Theorem A.3.3, we see that there exists a measurable selection $\sigma$ such that

u(\sigma(w)) + \beta \int v(R(w - \sigma(w)) + y) \phi(\diff y) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\}

for all $w \in \RR_+$ . This measurable selection is $v$ -greedy. The same theorem tells us that

(Tv)(w) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\}

is continuous on $\RR_+$ . Since $Tv$ is also bounded, we have $Tv \in bc\RR_+$ .

With the results from this exercise in hand, we can prove the next proposition.

3.2.3No-Discount Optimal Stopping¶

Many important applications—including sampling problems, shortest path and routing problems, bandit problems, and reinforcement learning tasks—involve no discounting. In the absence of a discount factor, contractivity of the policy operators typically fails, so the results based on contraction mappings do not directly apply. This necessitates more sophisticated techniques. One of our main aims is to provide foundations for solving the sequential sampling problem from Section 1.4.

3.2.3.1Setup¶

Let $(\Xsf, \bB)$ be a measurable space. We recall from Section A.5.4.1 that a discrete time $\Xsf$ -valued stochastic process $(X_t)_{t \geq 0}$ on probability space $(\Omega, \fF, \PP_x)$ is called $P$ -Markov if

\PP \{X_{t+1} \in B \given \fF_t\} = P(X_t, B) \quad \PP \text{-a.s.\ for all } t \geq 0 \text{ and } B \in \bB \text{. }

We write $\PP_x$ and $\EE_x$ for probabilities and expectations when conditioning on $X_0 = x$ .

Let $b\Xsf_+$ be all $g \in b\Xsf$ taking only nonnegative values. We consider a cost minimization problem with Bellman equation

g(x) = \min \left\{ e(x), c(x) + \int g(x') P(x, \diff x') \right\},

(3.11)

where $e, c$ are functions in $b\Xsf_+$ , while $P$ is a stochastic kernel on $\Xsf$ . The function $e$ is called the exit cost function and $c$ is called the flow cost. The Bellman equation corresponds to a setting where a controller observes a $P$ -Markov state process $(X_t)_{t \geq 0}$ and decides when to stop. Stopping at time $t$ incurs the one-off penalty $e(X_t)$ . Continuing incurs the flow cost $c(X_t)$ , followed by transition to the new state $X_{t+1}$ and the opportunity to decide again.

3.2.3.2Policies¶

A policy is a $\bB$ -measurable map $\sigma$ from $\Xsf$ to $\{0, 1\}$ , with $\sigma(x) = 1$ indicating the decision to stop in state $x$ . Given a policy $\sigma$ , we call

E_\sigma \coloneq \setntn{x \in \Xsf}{\sigma(x) = 1}

the exit region for $\sigma$ . We call its complement $E_\sigma^c$ the continuation region.

To each policy $\sigma$ , we associate the stopping time

\tau^\sigma \coloneq \inf\setntn{t \geq 0}{X_t \in E_\sigma} = \inf\setntn{t \geq 0}{\sigma(X_t) = 1}.

(3.12)

Here and below, the convention for the infimum is that $\inf \varnothing \coloneq \infty$ . Also, given $\sigma$ , we define the $\sigma$ -loss function via

g_\sigma(x) \coloneq \EE_x \left[ \sum_{t=0}^{\tau^\sigma-1} c(X_t) + e(X_{\tau^\sigma}) \right].

(3.13)

Here $\sum_{t=0}^{-1} c(X_t)$ is understood as 0, so that $g_\sigma(x) = e(x)$ when $x \in E_\sigma$ . The function $g_\sigma$ takes values in $[0, \infty]$ and $g_\sigma(x)$ represents the total expected cost when applying $\sigma$ in every period, conditional on starting in state $x$ .

3.2.3.3The Lower Bound Policy¶

One policy of particular interest is $\bar \sigma = \1\{e \leq c\}$ . To simplify notation, the exit region for this policy is denoted

\bar E \coloneq E_{\bar \sigma} = \setntn{x \in \Xsf}{e(x) \leq c(x)}.

(3.14)

and the stopping time is denoted

\bar \tau \coloneq \tau^{\bar \sigma} = \inf\setntn{t \geq 0}{\bar \sigma(X_t) = 1}.

(3.15)

We call $\bar E$ the certain exit region. Under the innocuous convention that the controller always stops when indifferent between stopping and continuing, any optimal policy will choose to stop when $x \in \bar E$ . The reason is that the controller has the opportunity to exit at cost $e(x) \leq c(x)$ , and continuing incurs $c(x)$ plus additional costs in subsequent stages.

The fact that the controller always stops when $x \in \bar E$ allows us to shrink the policy space. Specifically, we consider only policies where $\sigma(x) = 1$ for all $x \in \bar E$ . Let $\Sigma$ be the set of all such policies. We can also express this set via

\Sigma \coloneq \{ \text{all } \bB \text{-measurable } \sigma \colon \Xsf \to \{0,1\} \text{ with } \bar \sigma \leq \sigma \}.

(3.16)

Since $\bar \sigma \leq \sigma$ for all $\sigma \in \Sigma$ , we refer to $\bar \sigma$ as the lower bound policy. Also, since

\bar \sigma \leq \sigma \; \implies \; \tau^\sigma \leq \bar \tau \;\; \PP\text{-a.s.},

(3.17)

we refer to $\bar \tau$ as the upper bound stopping time.

Our key assumption is as follows.

For Assumption 3.2.1, it suffices to check that $\sup_{x \in \bar E^c} \EE_x \bar \tau$ is finite, since $\bar \tau = 0$ with probability one when $x \in \bar E$ . Below we show that Assumption 3.2.1 is sufficient for all of the major optimality results associated with dynamic programming.

3.2.3.4Policy Operators¶

For each $\sigma \in \Sigma$ , we define a policy operator $T_\sigma$ via

(T_\sigma \, g)(x) = \sigma(x) e(x) + (1 - \sigma(x)) \left[c(x) + \int g(x') P(x, \diff x') \right],

(3.18)

or, in operator notation, as

T_\sigma \, g = \sigma e + (1-\sigma) c + K_\sigma \, g \qquad (g \in b\Xsf),

(3.19)

where

(K_\sigma \, g)(x) \coloneq (1 - \sigma(x)) \int g(x') P(x, \diff x'). \qquad (x \in \Xsf).

(3.20)

In (3.19), expressions such as $\sigma e$ are understood as pointwise products.

As usual, the policy operator associated with $\sigma$ is introduced with the idea that its fixed point gives lifetime value – in this case lifetime cost functions – generated by $\sigma$ . This turns out to be true here as well, although the proof is not entirely trivial. It requires some familiarity with shift operators and the Markov property in its general form. An introduction to these topics can be found in Chapter 3 of Meyn & Tweedie (2009).

Proof

Fix $\sigma \in \Sigma$ . By Lemma 3.2.3, $g_\sigma$ is bounded on $\Xsf$ , so all expectations below are well-defined. To simplify notation, for the duration of this proof we set $\tau \coloneq \tau^\sigma$ and $E \coloneq E_\sigma$ . For $x \in E$ , we have $\sigma(x)=1$ and hence $(T_\sigma \, g_\sigma)(x) = e(x)$ . At the same time, (3.13) implies that $g_\sigma(x) = e(x)$ also holds for such $x$ . In particular, $T_\sigma \, g_\sigma = g_\sigma$ on $E$ . Hence, to complete the proof, we only need to show that $T_\sigma \, g_\sigma = g_\sigma$ on $E^c$ .

To this end, fix $x \notin E$ and define the random variable

H \coloneq \sum_{t=0}^{\tau - 1} c(X_t) + e(X_{\tau}),

so that

g_\sigma(x) = \EE_x H \quad \text{and} \quad g_\sigma(X_1) = \EE_{X_1} H = \EE_x [ H \circ \theta \given X_1 ].

On the right-hand side, $\theta \colon (x_0, x_1, \ldots) \mapsto (x_1, x_2, \ldots)$ is the shift operator on the sequence space $\Xsf^\infty$ , and the last equality is by the Markov property (A.25). We can write $g_\sigma(X_1)$ more explicitly by expanding $H \circ \theta$ to get

g_\sigma(X_1) = \EE_x \left[ \sum_{t=0}^{\tau \circ \, \theta - 1} c(X_{t+1}) + e(X_{\tau \circ \, \theta + 1}) \, \given \, X_1 \right] = \EE_x \left[ \sum_{t=1}^{\tau \circ \, \theta} c(X_t) + e(X_{\tau \circ \, \theta + 1}) \, \given \, X_1 \right].

From the law of iterated expectations, this yields

\EE_x \, g_\sigma(X_1) = \EE_x \left[ \sum_{t=1}^{\tau \circ \, \theta} c(X_t) + e(X_{\tau \circ \, \theta + 1}) \right].

(3.21)

Recalling that $x \in E^c$ , so that $\sigma(x)=0$ , we have

(T_\sigma \, g_\sigma)(x) = c(x) + \EE_x \, g_\sigma(X_1) = c(x) + \EE_x \left[ \sum_{t=1}^{\tau \circ \, \theta} c(X_t) + e(X_{\tau \circ \, \theta + 1}) \right].

Using the fact that $x \in E^c$ , so that the first visit to $E$ occurs after $t=1$ , we obtain $\tau \circ \theta = \tau - 1$ and $X_{\tau \circ \, \theta + 1} = X_\tau$ . (The stopping time $\tau \circ \theta$ counts time using the shifted sequence $(X_1, X_2, \ldots)$ and so equals $\tau - 1$ ; the stopped value on the shifted path then sits at position $\tau \circ \theta + 1 = \tau$ in the original sequence.) Applying these facts to the last display yields

(T_\sigma \, g_\sigma)(x) = c(x) + \EE_x \left[ \sum_{t=1}^{\tau - 1} c(X_t) + e(X_\tau) \right] = \EE_x \left[ \sum_{t=0}^{\tau - 1} c(X_t) + e(X_\tau) \right] = g_\sigma(x).

This confirms that $T_\sigma \, g_\sigma = g_\sigma$ on $E^c$ , so the proof of Lemma 3.2.4 is done. ◻

3.2.3.5ADP Formulation¶

Let $\TT$ be the set of all policy operators, as defined in (3.18), indexed over the restricted policy set $\Sigma$ defined in (3.16). Since each $T_\sigma$ is order preserving, the pair $(b\Xsf_+, \TT)$ forms an ADP. For this ADP and given $g \in b\Xsf_+$ , a policy $\sigma \in \Sigma$ is $g$ -min-greedy when $T_\sigma g \leq T_s g$ for all $s \in \Sigma$ . It is easy to check that one such policy always exists. Indeed, such a policy can be found by setting

\sigma(x) = \1 \left\{ e(x) \leq c(x) + \int g(x') P(x, \diff x') \right\} \qquad (x \in \Xsf)

This policy is in $\Sigma$ , since $\sigma$ is $\bB$ -measurable and, in addition,

e(x) \leq c(x) \implies e(x) \leq c(x) + \int g(x') P(x, \diff x'),

so that $\bar \sigma \leq \sigma$ . Moreover, it is clear that

(T_\sigma \, g)(x) = \min \left\{ e(x), c(x) + \int g(x') P(x, \diff x') \right\} \leq (T_s \, g)(x)

for all $s \in \Sigma$ .

This argument also proves that the ADP $(b\Xsf_+, \TT)$ is regular.

In essence, a $g$ -min-greedy policy treats $g$ as a loss function, using it to associate the total expected cost of each state, and makes the current best choice accordingly.

3.2.3.6Stability of the Policy Operators¶

In this section we prove that $(b\Xsf_+, \TT)$ is globally stable.

We prove the proposition using a sequence of lemmas. In the statement of the next lemma, $\sigma$ is a fixed policy, $K_\sigma$ is as defined in (3.20), and $\tau^\sigma$ is as defined in (3.12).

Proof

Fixing $f \in b\Xsf$ and $x_0 \in \Xsf$ , we iterate with $K_\sigma$ and apply the triangle inequality to obtain

|(K^n_\sigma f)(x_0)| \leq (1-\sigma(x_0)) \int (1-\sigma(x_1)) \int (1-\sigma(x_{n-1})) \cdots \\ \int |f(x_n)| P(x_{n-1}, \diff x_n) P(x_{n-2}, \diff x_{n-1}) \cdots P(x_0, \diff x_1).

Hence, if $(X_t)$ is $P$ -Markov and starts at $x_0$ , then

\begin{aligned} |(K^n_\sigma f)(x_0)| & \leq \| f\| \int \cdots \int \prod_{t=0}^{n-1} (1 - \sigma(x_t)) P(x_{n-1}, \diff x_n) P(x_{n-2}, \diff x_{n-1}) \cdots P(x_0, \diff x_1) \\ & = \| f\| \cdot \PP_{x_0} \bigcap_{t=0}^{n-1} \, \{\sigma(X_t) = 0\} = \| f\| \cdot \PP_{x_0} \{\tau^\sigma \geq n\}. \end{aligned}

Taking the supremum on the right and then the left produces the bound in Lemma 3.2.6. ◻

We will also use the following result regarding stopping times.

Proof

Let $\bar \tau$ be as in (3.15). Under Assumption 3.2.1, there exists an $M < \infty$ with $\EE_x \bar \tau \leq M$ for all $x$ . In addition, for any $x \in \Xsf$ and any $n > 0$ , we have:

\EE_x \bar \tau = \sum_{s=0}^{\infty} \PP_x\{\bar \tau > s\} \geq \sum_{s=0}^n \PP_x\{\bar \tau > s\} \geq n \cdot \PP_x\{\bar \tau > n\},

where the last inequality holds because $\bar \tau > n$ implies $\bar \tau > s$ for all $s \leq n$ . Combining these facts and using Markov’s inequality yields

\PP_x\{\bar \tau > n\} \leq \frac{\EE_x \bar \tau}{n} \leq \frac{M}{n} \quad \text{for all } x \in \Xsf.

As a consequence,

\lim_{n \to \infty} \sup_{x \in \Xsf} \PP_x\{\bar \tau > n\} = 0.

(3.23)

Now fix $\sigma \in \Sigma$ . By (3.17) we have $\tau^\sigma \leq \bar \tau$ and hence $\{\tau^\sigma > n\} \subset \{\bar \tau > n\}$ . Combining this with (3.23) yields (3.22). ◻

3.2.3.7Optimality¶

We define the minimum loss function $\gmin$ via

\gmin(x) \coloneq \inf_{\sigma \in \Sigma} g_\sigma(x) \qquad (x \in \Xsf).

(3.24)

The function $\gmin$ also takes values in $[0,\infty)$ and is well-defined everywhere on $\Xsf$ . The minimum loss function $\gmin$ is equal to the min-value function $\vmin$ of the ADP, as defined in Section 2.2.3.1. (This is because, by definition, $\vmin = \bigwedge_\sigma v_\sigma$ , which is $\vmin = \bigwedge_\sigma g_\sigma$ in the current setting. This equation reduces to (3.24) when working in $b\Xsf$ with the pointwise partial order.)

A policy $\sigma \in \Sigma$ is called optimal if $g_\sigma \leq g_s$ for all $s \in \Sigma$ . This is equivalent to the statement that $\sigma$ attains the minimum possible cost from every state, and is equivalent to the ADP definition in Section 2.2.3.1.

We can now state the following result.

3.2.4Sequential Analysis Revisited¶

With Theorem 3.2.9 in hand, we are in a position to prove optimality results for the sequential analysis problem we presented in Section 1.4. First we reduce the sequential analysis problem to a special case of the no-discount optimal stopping problem treated in Theorem 3.2.9. Then we check the conditions of that theorem. The only significant condition is Assumption 3.2.1, so this is where we will be investing all our effort.

3.2.4.1Set Up¶

Our first step is to show that the hard part of the sequential analysis problem is a special case of the no-discount optimal stopping problem from Section 3.2.3. To this end, let’s remind ourselves of the set up in Section 1.4. We recall that the state space for the belief state $\pi$ is $\Xsf = (0,1)$ and action space is $\Asf = \{0, 1, 2\}$ , where action 0 represents accepting $f_0$ , action 1 represents accepting $f_1$ , and action 2 represents continuing to sample. We assume that the two densities are defined on $\RR$ .

Repeating (1.60), the Bellman equation has the form

g(\pi) = \min \left\{ \pi L_0, \; (1-\pi) L_1, \; c + \int g(\pi') P(\pi, \diff \pi') \right\}

(3.25)

for $\pi \in (0,1)$ , where the stochastic kernel $P$ obeys

(Pg)(\pi) \coloneq \int g(\kappa(\pi, z)) \psi(\pi, z) \diff z.

(3.26)

Here, recalling (1.59),

\psi(\pi,z) = (1-\pi)f_0(z) + \pi f_1(z)

is the predictive density and

\kappa(\pi, z) = \frac{\pi f_1(z)}{(1-\pi) f_0(z) + \pi f_1(z)} = \frac{\pi f_1(z)}{\psi(\pi, z)}

is the Bayesian update rule. Together, $\psi$ and $\kappa$ define the stochastic kernel $P$ governing the belief state $(\pi_n)_{n \geq 0}$ , describing how beliefs evolve from the perspective of the controller when the observation sequence $(Z_n)_{n \geq 1}$ is forecast using the predictive density.

In all of what follows, the cost $c$ and the losses $L_0$ and $L_1$ are assumed to be positive constants. Our aim is to characterize and solve for optimal policies.

In terms of dynamic programming, we can simplify the sequential analysis to a binary stopping problem. Our first step is to set

e(\pi) \coloneq \min\{\pi L_0, \; (1-\pi) L_1\}.

(3.27)

Now consider the Bellman equation

g(\pi) = \min \left\{ e(\pi), \; c + (Pg)(\pi) \right\}.

(3.28)

We claim that solving this dynamic program is sufficient for solving the sequential analysis problem. To see this, suppose we are able to show that the fundamental min-optimality properties hold for this dynamic program. Let the min-value function be denoted by $\gmin$ . Continuing the convention that the controller always stops when indifferent between stopping and continuing, Bellman’s principle of min-optimality tells the controller to stop if and only if $e(\pi) \leq c + (P\gmin)(\pi)$ .

In the present setting, this means that the controller stops if and only if at least one of the stopping losses $\pi L_0$ and $(1-\pi)L_1$ is less than or equal to the continuation loss $c + (P\gmin)(\pi)$ . When this stop occurs, the controller then makes the static choice over the two density options (selecting $f_0$ or $f_1$ ) depending on which of $\pi L_0$ and $(1-\pi)L_1$ is smaller. Since this static problem is trivial, we can concentrate on solving the dynamic problem represented by the Bellman equation (3.28).

The stopping problem just described is a special case of the no-discount optimal stopping from Section 3.2.3, with $e$ as the exit cost function in (3.27) and the flow cost $c$ constant over the state space $\Xsf = (0,1)$ . (To confirm this, compare the Bellman equation (3.28) with the general case in (3.11).) As a result, Theorem 3.2.9 applies. All we need to do is check that Assumption 3.2.1 is valid in the current setting. This turns out to be true whenever $f_0$ and $f_1$ are distinct. Our proof will rely on a bound for martingale stopping times in Theorem A.3.9.

3.2.4.2Verifying Assumption 3.2.1¶

Let’s consider verification of Assumption 3.2.1 in the present setting. Let $\bar \tau$ be the upper bound stopping time for the belief state (i.e., the stopping time in (3.15) specialized to the current setting). This stopping time is defined in terms of the certain exit region (see (3.14)), which, for our problem, is

\bar E \coloneq \setntn{\pi \in (0,1)}{\min\{\pi L_0, \; (1-\pi) L_1\} \leq c}.

Equivalently,

\pi \in \bar E \iff \pi \leq \frac{c}{L_0} \quad \text{or} \quad \pi \geq 1 - \frac{c}{L_1}.

The lower bound policy $\bar \sigma$ is (recalling its definition from Section 3.2.3.3) the indicator function for $\bar E$ , stopping the process whenever the belief state enters the certain exit region. The upper bound stopping time is, therefore,

\bar \tau = \inf \left\{ n \geq 0 \,:\, \pi_n \leq \frac{c}{L_0} \text{ or } \pi_n \geq 1 - \frac{c}{L_1} \right\}.

(3.29)

Here the $(\pi_n)$ process evolves according to the kernel $P$ from (3.26): given $\pi_n$ , we draw $Z_{n+1}$ independently from $\psi(\pi_n, \cdot)$ , and then set

\pi_{n+1} = \kappa(\pi_n, Z_{n+1}) = \frac{\pi_n f_1(Z_{n+1})}{\psi(\pi_n, Z_{n+1})}.

(3.30)

(Note here that division by zero is not a concern: For $\pi \in (0,1)$ , we have $\psi(\pi, z) > 0$ if and only if $f_0(z) + f_1(z) > 0$ . Since $Z_{n+1}$ is drawn from $\psi(\pi, \cdot)$ , we have $\psi(\pi_n, Z_{n+1}) > 0$ almost surely.)

Matching (3.16), the policy set $\Sigma$ under consideration is all $\bB$ -measurable $\sigma \colon \Xsf \to \{0,1\}$ with $\bar \sigma \leq \sigma$ . For $\sigma$ in this set, the policy operator takes the form

(T_\sigma \, g)(\pi) = \sigma(\pi) e(\pi) + (1 - \sigma(\pi)) \left[c + \int g(\pi') P(\pi, \diff \pi') \right].

(3.31)

Let $V$ be all bounded Borel measurable functions from $(0,1)$ to $\RR_+$ . With $\TT_{\rm SA}$ as the family of operators described by (3.31), indexed by $\sigma \in \Sigma$ , the pair $(b(0,1), \TT_{\rm SA})$ forms an ADP. For this ADP, we can state the following result. In the statement, two functions are distinct when they are not equal almost everywhere.

To prove Proposition 3.2.10, we recognize $(V, \TT_{\rm SA})$ as a special case of the no-discount optimal stopping ADP treated in Theorem 3.2.9. As such, we only need to verify Assumption 3.2.1, which amounts to showing that $\sup_{0 \leq \pi \leq 1} \EE_\pi \bar \tau$ is finite.

As a first step, we recall that, given two probability densities $f_0$ and $f_1$ on $\RR$ , the triangular discrimination is

\Delta(f_0, f_1) := \int \frac{[f_1(z) - f_0(z)]^2}{f_0(z) + f_1(z)} \, \diff z,

where the integrand is defined to be zero when $f_0(z) + f_1(z) = 0$ .

Now let’s investigate the properties of stopping times for the process $(\pi_n)$ from (3.30). We will begin with a generic stopping time

\tau = \inf\{n \geq 0 : \pi_n \notin (a,b)\},

where $a, b$ are numbers satisfying $0 < a < b < 1$ .

Proof

Fix $\pi_0 \in (0,1)$ . If $\pi_0$ is not in $(a,b)$ , then $\tau = 0$ , in which case the claim is trivial. Hence, from now on, we assume that $\pi_0$ is in $(a,b)$ . Note that $(\pi_n)_{n \geq 0}$ is a martingale, since

\EE[\pi_{n+1} \mid \pi_n = \pi] = \int \frac{\pi f_1(z)}{\psi(\pi, z)} \cdot \psi(\pi, z) \, \diff z = \int \pi f_1(z) \, \diff z = \pi.

We apply Theorem A.3.9 to this bounded martingale. Our main task is to obtain $\delta$ in (A.10). For fixed $\pi$ , the law of motion (3.30) yields

\EE[\pi^2_{n+1} \mid \pi_n = \pi] = \int \frac{\pi^2 f^2_1(z)}{\psi(\pi, z)} \, \diff z = \pi^2 \int \frac{f^2_1(z)}{\psi(\pi, z)} \, \diff z,

(As for the triangular discrimination, the integrand is defined to be zero when $\psi(\pi, z) = 0$ , which occurs only when $f_0(z) = f_1(z) = 0$ ). Using the fact that $(\pi_n)$ is a martingale, we get

v(\pi) := \EE[(\pi_{n+1} - \pi_n)^2 \mid \pi_n = \pi] = \pi^2 \left[\int \frac{f^2_1(z)}{\psi(\pi, z)} \, \diff z - 1\right].

To simplify the bracketed term, we observe that

\begin{aligned} \int \frac{[f_1(z) - \psi(\pi, z)]^2}{\psi(\pi, z)} \, \diff z &= \int \frac{f_1(z)^2}{\psi(\pi, z)} \, \diff z - 2\int f_1(z) \, \diff z + \int \psi(\pi, z) \, \diff z \\ &= \int \frac{f_1(z)^2}{\psi(\pi, z)} \, \diff z - 1. \end{aligned}

Using this fact, combined with $f_1(z) - \psi(\pi, z) = (1-\pi)[f_1(z) - f_0(z)]$ and the definition of $v(\pi)$ , gives

v(\pi) = \pi^2 (1-\pi)^2 \int \frac{[f_1(z) - f_0(z)]^2}{\psi(\pi, z)} \, \diff z.

(3.32)

Since $\psi(\pi, z) \leq f_0(z) + f_1(z)$ , we have

\int \frac{[f_1(z) - f_0(z)]^2}{\psi(\pi, z)} \, \diff z \geq \int \frac{[f_1(z) - f_0(z)]^2}{f_0(z) + f_1(z)} \, \diff z = \Delta(f_0, f_1).

(3.33)

Combining (3.32) and (3.33), for $\pi \in [a,b]$ :

v(\pi) \geq \pi^2 (1-\pi)^2 \Delta(f_0, f_1) \geq a^2 (1-b)^2 \Delta(f_0, f_1) =: \delta.

By Lemma 3.2.11, the triangular discrimination is strictly positive, so $\delta > 0$ . Since $v(\pi) = \EE[(\pi_{n+1} - \pi_n)^2 \mid \pi_n = \pi]$ , this verifies condition (A.10) for $M_n = \pi_n$ and this choice of $\delta$ . Applying Theorem A.3.9 and using $(\pi_\tau - \pi_0)^2 \leq 1$ , we get

\EE_{\pi_0}[\tau] \leq \frac{\EE_{\pi_0}[(\pi_\tau - \pi_0)^2]}{\delta} < \frac{1}{\delta}.

This verifies the claim in Lemma 3.2.12. ◻

Proof

Proof of Proposition 3.2.10.

Let $f_0$ and $f_1$ be distinct. Since $(V, \TT_{\rm SA})$ is a special case of the no-discount optimal stopping ADP treated in Theorem 3.2.9, we only need to show that $\sup_{0 \leq \pi \leq 1} \EE_\pi \bar \tau$ is finite, where $\bar \tau$ is as defined in (3.29). If $c/L_0 + c/L_1 \geq 1$ , then $\bar E = (0,1)$ , so $\bar \tau = 0$ a.s. and Assumption 3.2.1 holds trivially. Otherwise, $a \coloneq c/L_0$ and $b \coloneq 1 - c/L_1$ satisfy $0 < a < b < 1$ , and we can apply Lemma 3.2.12 to $\bar\tau$ . This leads us to $\EE_{\pi}[\tau] < 1/\delta$ , where

\delta \coloneq \left(\frac{c}{L_0}\right)^2 \left(1 - 1 + \frac{c}{L_1}\right)^2 \Delta(f_0, f_1) = \left(\frac{c^2}{L_0L_1}\right)^2 \Delta(f_0, f_1).

Since $f_0$ and $f_1$ are distinct, Lemma 3.2.11 implies that $\Delta(f_0, f_1) > 0$ . The bound $\EE_{\pi}[\tau] < 1/\delta$ is valid for all $\pi \in (0, 1)$ , confirming that Assumption 3.2.1 holds when $f_0$ and $f_1$ are distinct. As a result, all the claims in Theorem 3.2.9 are valid. ◻

3.3Chapter Notes¶

The results in this chapter extend the abstract dynamic programming framework of Chapter 2 by adding topological structure to the value space. The contraction-based approach in Section 3.1.2 is rooted in the classical work of Denardo (1967) and Bertsekas (2022). Earlier foundations for the operator-theoretic approach to dynamic programming include Blackwell (1965) on discounted models, Strauch (1966) on negative dynamic programming, and Bertsekas (1977) on monotone mappings without contraction. The order-theoretic perspective on fixed points used in Section 3.1.1 connects to Marinacci & Montrucchio (2019), who establish uniqueness results for Tarski-type fixed points of monotone operators. The pospace ADP framework of this chapter is developed in Sargent & Stachurski (2025).

The Q-factor representation of MDPs treated in Section 3.2.1.2 is used extensively in reinforcement learning, where the Bellman equation over Q-factors provides the basis for model-free algorithms. The name originates from Q-learning, introduced by Watkins (1989); see Watkins & Dayan (1992) for the convergence proof and Tsitsiklis (1994) for convergence analysis under asynchronous updates. Standard references on reinforcement learning include Bertsekas & Tsitsiklis (1996) and Sutton & Barto (2018).

The optimal savings problem in Section 3.2.2 has a long history, originating with Brock & Mirman (1972) in the stochastic setting; see also Stokey & Lucas (1989). The strongly continuous case draws on classical results for MDPs with continuous densities, while the weakly continuous case uses Berge’s maximum theorem and relates to the Feller-continuity approach to MDPs developed in Hernández-Lerma & Lasserre (2012) and Hernández-Lerma & Lasserre (2012).

The no-discount optimal stopping framework in Section 3.2.3 covers problems where contractivity fails due to the absence of discounting. Such problems arise in shortest path problems, where the foundational reference is Bertsekas & Tsitsiklis (1991); in bandit problems, where the seminal work is Gittins (1979); and in sequential sampling. For general treatments of optimal stopping theory, see Shiryaev (2007), Peskir & Shiryaev (2006), and Chow et al. (1971).

The sequential analysis problem treated in Section 3.2.4 goes back to the pioneering work of Wald (1947). The optimality of the sequential probability ratio test was established by Wald & Wolfowitz (1948). Modern treatments and extensions of sequential analysis include DeGroot (1970), Siegmund (1985), and Lai (2001). The triangular discrimination used in our proof of Lemma 3.2.12 is a classical $f$ -divergence; see Topsøe (2000) for sharp inequalities involving this divergence. The general theory of $f$ -divergences was introduced independently by Csiszár (1967) and Ali & Silvey (1966); for a modern treatment, see Liese & Vajda (2006).

References¶

Meyn, S. P., & Tweedie, R. L. (2009). Markov chains and stochastic stability. Cambridge University Press.
Denardo, E. V. (1967). Contraction Mappings in the Theory Underlying Dynamic Programming. SIAM Review, 9(2), 165–177.
Bertsekas, D. P. (2022). Abstract dynamic programming (3rd ed.). Athena Scientific.
Blackwell, D. (1965). Discounted Dynamic Programming. The Annals of Mathematical Statistics, 36(1), 226–235.
Strauch, R. E. (1966). Negative Dynamic Programming. The Annals of Mathematical Statistics, 37(4), 871–890.
Bertsekas, D. P. (1977). Monotone mappings with application in dynamic programming. SIAM Journal on Control and Optimization, 15(3), 438–464.
Marinacci, M., & Montrucchio, L. (2019). Unique tarski fixed points. Mathematics of Operations Research, 44(4), 1174–1191.
Sargent, T. J., & Stachurski, J. (2025). Dynamic Programs on Partially Ordered Sets. SIAM Journal on Optimization and Control, in press.
Watkins, C. J. C. H. (1989). Learning from delayed rewards [Techreport]. PhD Thesis, King’s College, Cambridge United Kingdom.
Watkins, C. J., & Dayan, P. (1992). Q-Learning. Machine Learning, 8, 279–292.
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16, 185–202.
Bertsekas, D., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena Scientific.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Brock, W. A., & Mirman, L. J. (1972). Optimal economic growth and uncertainty: The discounted case. Journal of Economic Theory, 4(3), 479–513.
Stokey, N. L., & Lucas, R. E. (1989). Recursive methods in dynamic economics. Harvard University Press.

3 ADPs on Pospace

3.1Adding Topology¶

3.1.1ADPs on Pospace¶

3.1.2ADPs on Metric Space¶

3.1.3Minimization¶

3.1.4Nonstationary Policies¶

3.2Applications¶

3.2.1Discrete MDPs and Q-Factors¶

3.2.1.1MDPs Optimality, Again¶

3.2.1.2The Q-Factor Model¶

3.2.2Optimal Savings¶

3.2.2.1The Strongly Continuous Case¶

3.2.2.2The Weakly Continuous Case¶

3.2.3No-Discount Optimal Stopping¶

3.2.3.1Setup¶

3.2.3.2Policies¶

3.2.3.3The Lower Bound Policy¶

3.2.3.4Policy Operators¶

3.2.3.5ADP Formulation¶

3.2.3.6Stability of the Policy Operators¶

3.2.3.7Optimality¶

3.2.4Sequential Analysis Revisited¶

3.2.4.1Set Up¶

3.2.4.2Verifying Assumption 3.2.1¶

3.3Chapter Notes¶