Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

3 ADPs on Pospace

In this chapter, we add topological structure to the value space and the policy operators. This allows us to provide sufficient conditions for optimality that are often easy to use in applications. Proofs of the theorems presented in this chapter leverage the poset ADP theory that we constructed in Chapter 2.

We begin by studying ADPs on pospace, then specialize to partially ordered metric spaces, where contractivity arguments become available. Minimization counterparts and a treatment of nonstationary policies round out the theoretical development. In Section 3.2 we apply the theory to discrete MDPs and Q-factors, optimal savings, no-discount optimal stopping, and the sequential analysis problem from Section 1.4.

3.1Adding Topology

In this section we introduce ADPs in settings where the value space is a poset and also a pospace (i.e., partially ordered space, see Section A.5.3.1 for the definition and basic properties). We study optimality in settings where the policy operators have topological stability properties, as well as order properties. Throughout this chapter, VV is always assumed to be a pospace.

In Section 3.1.1 we study ADPs on pospace, leveraging global stability to obtain optimality and convergence results. In Section 3.1.2 we specialize to partially ordered metric spaces, where contraction-based arguments apply. Section 3.1.3 provides minimization counterparts and Section 3.1.4 treats nonstationary policies.

3.1.1ADPs on Pospace

Let VV be a pospace and let (V,T)(V, \TT) be an ADP. Recall that a self-map SS on VV is called globally stable when SS has a unique fixed point vˉ\bar v in VV and SnvvˉS^n v \to \bar v as nn \to \infty for all vVv \in V. (See Section A.2.2 for more background.) We say that

Obviously, if (V,T)(V, \TT) is globally stable, then (V,T)(V, \TT) is well-posed. We also have the following useful preliminary result, which is an immediate consequence of Lemma A.5.19.

The next result shows that regularity and global stability together yield strong optimality properties.

Theorem 3.1.2 is a high-level result because a condition (existence of a fixed point) is placed on the derived object TT, rather than the primitives VV and T\TT. Below, we present a sequence of results that leverage Theorem 3.1.2 while also placing all assumptions on primitives. The first is the relatively obvious but useful corollary, which adds convergence of VFI and OPI to Theorem 2.2.6.

The conditions of the next theorem are similar to those of Theorem 2.2.8, after replacing order stability and order continuity with global stability.

3.1.2ADPs on Metric Space

In this section we add more structure by assuming that our value space VV is in fact a partially ordered metric space, by which we mean that V=(V,d,)V = (V, d, \preceq) where dd is a metric on VV and (V,)(V, \preceq) is a pospace under the topology generated by dd. By construction, this specializes the earlier setting of Section 3.1.1, where (V,)(V, \preceq) is just a pospace.

Throughout Section 3.1.2,

We will say that the ADP (V,T)(V, \TT) is semi-regular if there exists a closed subset V0V_0 of VV such that V0VGV_0 \subset V_G and TV0V0T V_0 \subset V_0. In addition, we will say that VFI is geometrically convergent on V0V_0 if v\vmax exists and there is a β(0,1)\beta \in (0,1) such that

 d(Tnv,v)=O(βn) as n for each vV0\text{ } d(T^n v, \vmax) = \OO(\beta^n) \text{ as } n \to \infty \text{ for each } v \in V_0 \text{. }

Here f(n)=O(βn)f(n) = \OO(\beta^n) means that there exists a C<C < \infty with f(n)Cβnf(n) \leq C \beta^n for all nNn \in \NN.

In stating the next theorem, we use the notion of sup-nonexpansiveness from Section A.5.3.2.

3.1.3Minimization

Because of Exercise 2.7, the optimality and convergence theorems based around maximization can easily be converted to theorems for minimization. In this section we give some examples. Our first is a min-version of Theorem 3.1.2.

For the rest of Section 3.1.3, V=(V,d,)V = (V, d, \preceq) is always assumed to be a partially ordered metric space. Our next result is a min-version of Theorem 3.1.5, restricted to the regular case.

Now let’s consider a min-version of Theorem 3.1.4.

3.1.4Nonstationary Policies

In all of the preceding discussion we focused on stationary policies. For example, in the context of the optimal savings problem from Section 1.3, we fixed a policy σ\sigma and computed its lifetime value vσv_\sigma by assuming that σ\sigma is applied at every tt in {0,1,}\{0,1,\ldots\}. In particular, in Section 1.3.1.2, we showed that, for vv arbitrarily chosen from the value space,

vσ=limnTσnv.v_\sigma = \lim_{n \to \infty} T^n_\sigma \, v.

This expression illustrates how lifetime value is obtained by repeatedly applying the same policy.

But is this focus on stationary policies justified? Could it be that higher lifetime value is available when we allow a change of policy in each period?

To address this question, suppose that we can select a policy plan σˉ(σt)t0\bar \sigma \coloneq (\sigma_t)_{t \geq 0} in the infinite Cartesian product ×t0Σ\times_{t \geq 0} \Sigma and apply the tt-th element σt\sigma_t at time tt. Generalizing (3.2), the lifetime value of σˉ\bar \sigma can be defined by

vσˉ=limnTσ0Tσ1Tσnv.v_{\bar \sigma} = \lim_{n \to \infty} T_{\sigma_0} T_{\sigma_1} \cdots T_{\sigma_n} v.

Of course, for the definition in (3.3) to make sense we need to know that the limit exists. Ideally, it should also be independent of vv. (In (3.3), we iterate backwards in time, applying TσjT_{\sigma_j} first, because vv is best thought of as a terminal condition, rather than an initial condition. See Section 1.3.1.2 for intuition.)

Since the expression (3.3) requires a topology, we consider an ADP (V,T)(V, \TT) where V=(V,)V = (V, \preceq) is a partially ordered space. We also assume that the topology on VV is generated by a metric dd. As usual, T{Tσ:σΣ}\TT \coloneq \setntn{T_\sigma}{\sigma \in \Sigma} is a family of order preserving self-maps on VV. To ensure that (3.3) exists we also require the following:

We will make use of the following preliminary results.

We can now prove the main result of this section, which shows that any policy plan is (weakly) dominated in value by a stationary policy.

3.2Applications

We now apply the theory developed above to a range of problems. Section 3.2.1 revisits discrete MDPs and introduces Q-factors. Section 3.2.2 treats optimal savings under both strong and weak continuity assumptions. Section 3.2.3 develops a no-discount optimal stopping framework, which is then applied to sequential analysis in Section 3.2.4.

3.2.1Discrete MDPs and Q-Factors

In Section 3.2.1.1 we re-derive the optimality results for finite MDPs using the pospace theory of this chapter. In Section 3.2.1.2 we introduce the Q-factor formulation of the MDP and establish its optimality properties. The Q-factor formulation underpins reinforcement learning, one of the most influential branches of modern AI. Reinforcement learning algorithms---most notably Q-learning---use Q-factors to learn optimal policies from data, without requiring a model of the environment. Here we focus on the underlying dynamic programming theory associated with Q-factors, taking the model as given. Further discussion of Q-learning can be found in Section 3.3 and Section 9.2.1.

3.2.1.1MDPs Optimality, Again

In Section 1.2 we introduced a finite MDP (Γ,r,β,P)(\Gamma, r, \beta, P) with state space X\Xsf and action space A\Asf. In Section 2.3.3.1 we showed that this finite MDP can be framed as ADP (RX,TMDP)(\RR^\Xsf, \TT_{\rm MDP}), with policy operators given by Tσ=rσ+βPσT_\sigma = r_\sigma + \beta P_\sigma. In Section 2.3.3.2 we established all of the major optimality properties for (RX,TMDP)(\RR^\Xsf, \TT_{\rm MDP}). For the purpose of illustration, working in a familiar and simple environment, let’s re-prove these results using the theorems in this chapter.

First, we know from Section 2.3.3.1 that (RX,TMDP)(\RR^\Xsf, \TT_{\rm MDP}) is regular. In Exercise 1.4 we proved that every policy operator TσT_\sigma is globally stable. Since TMDP\TT_{\rm MDP} is finite Corollary 3.1.3 applies. This proves validity of the fundamental optimality properties and convergence of all algorithms. Since global stability implies order stability, convergence of HPI in finite time holds by Theorem 2.2.6.

3.2.1.2The Q-Factor Model

Next we examine the Q-factor variation of the MDP model. This modification provides an alternative view on the Bellman equation that unlocks a stochastic approximation approach to reinforcement learning. We can study optimality of the Q-factor variation either by treating it directly, as an ADP in its own right, or by inferring optimality properties from the original MDP version of the problem, which we just obtained in Section 3.2.1.1. The first approach is treated here and the second is treated in Chapter 5.

To begin, we take the MDP model from Section 3.2.1.1 and, given vRXv \in \RR^\Xsf, set

q(x,a)r(x,a)+βxv(x)P(x,a,x)((x,a)G).q(x, a) \coloneq r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \qquad ((x,a) \in \Gsf).

The function qq is called the Q-factor corresponding to vv. We will convert the original MDP Bellman equation (2.43) into an equation in QQ-factors. The first step is to observe that, given qq in (3.11), the Bellman equation can be written as v(x)=maxaΓ(x)q(x,a)v(x) = \max_{a \in \Gamma(x)}q(x, a). Taking the expectations and discounting on both sides of this equation yields

βxv(x)P(x,a,x)=βxmaxaΓ(x)q(x,a)P(x,a,x).\beta \sum_{x'} v(x') P(x, a, x') = \beta \sum_{x'} \max_{a' \in \Gamma(x')}q(x', a') P(x, a, x').

Adding r(x,a)r(x,a) and using the definition of qq again gives

q(x,a)=r(x,a)+βxmaxaΓ(x)q(x,a)P(x,a,x).q(x, a) = r(x, a) + \beta \sum_{x'} \max_{a' \in \Gamma(x')}q(x', a') P(x, a, x').

This is the Q-factor Bellman equation. To study it, we introduce a family of policy operators S{Sσ:σΣ}\SS \coloneq \setntn{S_\sigma}{\sigma \in \Sigma} via

(Sσq)(x,a)=r(x,a)+βxq(x,σ(x))P(x,a,x)((x,a)G).(S_\sigma \, q)(x, a) = r(x, a) + \beta \sum_{x'} q(x', \sigma(x')) P(x, a, x') \qquad ((x,a) \in \Gsf).

Here SσS_\sigma acts on function qRGq \in \RR^\Gsf. The set RG\RR^\Gsf is paired with the pointwise partial order.

By definition, the ADP Bellman operator corresponding to (RG,S)(\RR^\Gsf, \SS) obeys SqσSσqS q \coloneq \bigvee_\sigma S_\sigma \, q. The next exercise helps us connect this to the Q-factor Bellman equation (3.13).

Exercise 3.4 tells us that, as expected, qRGq \in \RR^\Gsf is a fixed point of SS if and only if it is a solution to the Q-factor Bellman equation (3.13).

Now let’s turn to optimality.

The next exercise asks you to confirm the core optimality properties also hold for the Q-factor MDP. You may like to use Theorem 3.1.5.

3.2.2Optimal Savings

In Section 1.3 we introduced a simple optimal savings problem. In Section 2.3.2, we converted the optimal savings problem from Section 1.3 into an ADP (V,TOS)(V, \TT_{\rm OS}), where V=bR+V = b\RR_+ and each TσTOST_\sigma \in \TT_{\rm OS} takes the form

(Tσv)(w)=u(σ(w))+βv(R(wσ(w))+y)ϕ( ⁣dy)(wR+).(T_\sigma \, v)(w) = u(\sigma(w)) + \beta \int v(R(w - \sigma(w)) + y) \phi(\diff y) \qquad (w \in \RR_+).

Now we turn to optimality. In Section 3.2.2.1 we will maintain the conditions in Assumption 1.3.1, so that uu is continuous and bounded on R+\RR_+ and the distribution of labor income can be represented by a continuous density. In Section 3.2.2.2 we will drop the continuous density assumption.

3.2.2.1The Strongly Continuous Case

Maintaining Assumption 1.3.1, we prove the following optimality properties, which were stated without proof in Section 1.3.2.2.

Let’s briefly translate these ADP results into the optimal savings results stated in Section 1.3.2.2. We showed in Section 2.3.2 that vVv \in V satisfies the ADP Bellman equation σTσv=v\bigvee_\sigma T_\sigma \, v=v if and only if it satisfies

v(w)=max0cw{u(c)+βv(R(wc)+y)ϕ( ⁣dy)}for all w0.v(w) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\} \quad \text{for all } w \geq 0.

By this fact and the fundamental optimality properties, the value function v\vmax exists and is the unique solution to (3.19) in VV.

In Section 2.3.2, we say that a policy σ\sigma is vv-greedy if and only if

σ(w)argmax0cw{u(c)+βv(R(wc)+y)ϕ( ⁣dy)}for all w0.\sigma(w) \in \argmax_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\} \quad \text{for all } w \geq 0.

Applying Bellman’s principle of optimality, a policy is optimal if and only if it satisfies (3.20) with vv replaced by v\vmax.

3.2.2.2The Weakly Continuous Case

Now let’s prove a result similar to Proposition 3.2.1 under weaker conditions. In particular, we modify Assumption 1.3.1 by dropping the assumption that ϕ\phi is a continuous density. Instead we’ll let ϕ\phi be an arbitrary probability measure on the Borel subset of R+\RR_+. Other conditions in Assumption 1.3.1 are maintained.

Without the continuity of ϕ\phi, Lemma 1.3.2 on page  fails. In particular, we cannot claim that each vbR+v \in b\RR_+ has a greedy policy. As a result, (V,TOS)(V, \TT_{\rm OS}) is no longer regular. However, we do have the following result, stated as a solved exercise.

With the results from this exercise in hand, we can prove the next proposition.

3.2.3No-Discount Optimal Stopping

Many important applications---including sampling problems, shortest path and routing problems, bandit problems, and reinforcement learning tasks---involve no discounting. In the absence of a discount factor, contractivity of the policy operators typically fails, so the results based on contraction mappings do not directly apply. This necessitates more sophisticated techniques. One of our main aims is to provide foundations for solving the sequential sampling problem from Section 1.4.

3.2.3.1Setup

Let (X,B)(\Xsf, \bB) be a measurable space. We recall from Section A.5.4.1 that a discrete time X\Xsf-valued stochastic process (Xt)t0(X_t)_{t \geq 0} on probability space (Ω,F,Px)(\Omega, \fF, \PP_x) is called PP-Markov if

P{Xt+1BFt}=P(Xt,B)P-a.s. for all t0 and BB\PP \{X_{t+1} \in B \given \fF_t\} = P(X_t, B) \quad \PP \text{-a.s.\ for all } t \geq 0 \text{ and } B \in \bB \text{. }

We write Px\PP_x and Ex\EE_x for probabilities and expectations when conditioning on X0=xX_0 = x.

Let bX+b\Xsf_+ be all gbXg \in b\Xsf taking only nonnegative values. We consider a cost minimization problem with Bellman equation

g(x)=min{e(x),c(x)+g(x)P(x, ⁣dx)},g(x) = \min \left\{ e(x), c(x) + \int g(x') P(x, \diff x') \right\},

where e,ce, c are functions in bX+b\Xsf_+, while PP is a stochastic kernel on X\Xsf. The function ee is called the exit cost function and cc is called the flow cost. The Bellman equation corresponds to a setting where a controller observes a PP-Markov state process (Xt)t0(X_t)_{t \geq 0} and decides when to stop. Stopping at time tt incurs the one-off penalty e(Xt)e(X_t). Continuing incurs the flow cost c(Xt)c(X_t), followed by transition to the new state Xt+1X_{t+1} and the opportunity to decide again.

3.2.3.2Policies

A policy is a B\bB-measurable map σ\sigma from X\Xsf to {0,1}\{0, 1\}, with σ(x)=1\sigma(x) = 1 indicating the decision to stop in state xx. Given a policy σ\sigma, we call

Eσ{xX:σ(x)=1}E_\sigma \coloneq \setntn{x \in \Xsf}{\sigma(x) = 1}

the exit region for σ\sigma. We call its complement EσcE_\sigma^c the continuation region.

To each policy σ\sigma, we associate the stopping time

τσinf{t0:XtEσ}=inf{t0:σ(Xt)=1}.\tau^\sigma \coloneq \inf\setntn{t \geq 0}{X_t \in E_\sigma} = \inf\setntn{t \geq 0}{\sigma(X_t) = 1}.

Here and below, the convention for the infimum is that inf\inf \varnothing \coloneq \infty. Also, given σ\sigma, we define the σ\sigma-loss function via

gσ(x)Ex[t=0τσ1c(Xt)+e(Xτσ)].g_\sigma(x) \coloneq \EE_x \left[ \sum_{t=0}^{\tau^\sigma-1} c(X_t) + e(X_{\tau^\sigma}) \right].

Here t=01c(Xt)\sum_{t=0}^{-1} c(X_t) is understood as 0, so that gσ(x)=e(x)g_\sigma(x) = e(x) when xEσx \in E_\sigma. The function gσg_\sigma takes values in [0,][0, \infty] and gσ(x)g_\sigma(x) represents the total expected cost when applying σ\sigma in every period, conditional on starting in state xx.

3.2.3.3The Lower Bound Policy

One policy of particular interest is σˉ=1{ec}\bar \sigma = \1\{e \leq c\}. To simplify notation, the exit region for this policy is denoted

EˉEσˉ={xX:e(x)c(x)}.\bar E \coloneq E_{\bar \sigma} = \setntn{x \in \Xsf}{e(x) \leq c(x)}.

and the stopping time is denoted

τˉτσˉ=inf{t0:σˉ(Xt)=1}.\bar \tau \coloneq \tau^{\bar \sigma} = \inf\setntn{t \geq 0}{\bar \sigma(X_t) = 1}.

We call Eˉ\bar E the certain exit region. Under the innocuous convention that the controller always stops when indifferent between stopping and continuing, any optimal policy will choose to stop when xEˉx \in \bar E. The reason is that the controller has the opportunity to exit at cost e(x)c(x)e(x) \leq c(x), and continuing incurs c(x)c(x) plus additional costs in subsequent stages.

The fact that the controller always stops when xEˉx \in \bar E allows us to shrink the policy space. Specifically, we consider only policies where σ(x)=1\sigma(x) = 1 for all xEˉx \in \bar E. Let Σ\Sigma be the set of all such policies. We can also express this set via

Σ{all B-measurable σ ⁣:X{0,1} with σˉσ}.\Sigma \coloneq \{ \text{all } \bB \text{-measurable } \sigma \colon \Xsf \to \{0,1\} \text{ with } \bar \sigma \leq \sigma \}.

Since σˉσ\bar \sigma \leq \sigma for all σΣ\sigma \in \Sigma, we refer to σˉ\bar \sigma as the lower bound policy. Also, since

σˉσ        τστˉ    P-a.s.,\bar \sigma \leq \sigma \; \implies \; \tau^\sigma \leq \bar \tau \;\; \PP\text{-a.s.},

we refer to τˉ\bar \tau as the upper bound stopping time.

Our key assumption is as follows.

For Assumption 3.2.1, it suffices to check that supxEˉcExτˉ\sup_{x \in \bar E^c} \EE_x \bar \tau is finite, since τˉ=0\bar \tau = 0 with probability one when xEˉx \in \bar E. Below we show that Assumption 3.2.1 is sufficient for all of the major optimality results associated with dynamic programming.

3.2.3.4Policy Operators

For each σΣ\sigma \in \Sigma, we define a policy operator TσT_\sigma via

(Tσg)(x)=σ(x)e(x)+(1σ(x))[c(x)+g(x)P(x, ⁣dx)],(T_\sigma \, g)(x) = \sigma(x) e(x) + (1 - \sigma(x)) \left[c(x) + \int g(x') P(x, \diff x') \right],

or, in operator notation, as

Tσg=σe+(1σ)c+Kσg(gbX),T_\sigma \, g = \sigma e + (1-\sigma) c + K_\sigma \, g \qquad (g \in b\Xsf),

where

(Kσg)(x)(1σ(x))g(x)P(x, ⁣dx).(xX).(K_\sigma \, g)(x) \coloneq (1 - \sigma(x)) \int g(x') P(x, \diff x'). \qquad (x \in \Xsf).

In (3.36), expressions such as σe\sigma e are understood as pointwise products.

As usual, the policy operator associated with σ\sigma is introduced with the idea that its fixed point gives lifetime value -- in this case lifetime cost functions -- generated by σ\sigma. This turns out to be true here as well, although the proof is not entirely trivial. It requires some familiarity with shift operators and the Markov property in its general form. An introduction to these topics can be found in Chapter 3 of Meyn & Tweedie (2009).

3.2.3.5ADP Formulation

Let T\TT be the set of all policy operators, as defined in (3.35), indexed over the restricted policy set Σ\Sigma defined in (3.31). Since each TσT_\sigma is order preserving, the pair (bX+,T)(b\Xsf_+, \TT) forms an ADP. For this ADP and given gbX+g \in b\Xsf_+, a policy σΣ\sigma \in \Sigma is gg-min-greedy when TσgTsgT_\sigma g \leq T_s g for all sΣs \in \Sigma. It is easy to check that one such policy always exists. Indeed, such a policy can be found by setting

σ(x)=1{e(x)c(x)+g(x)P(x, ⁣dx)}(xX)\sigma(x) = \1 \left\{ e(x) \leq c(x) + \int g(x') P(x, \diff x') \right\} \qquad (x \in \Xsf)

This policy is in Σ\Sigma, since σ\sigma is B\bB-measurable and, in addition,

e(x)c(x)    e(x)c(x)+g(x)P(x, ⁣dx),e(x) \leq c(x) \implies e(x) \leq c(x) + \int g(x') P(x, \diff x'),

so that σˉσ\bar \sigma \leq \sigma. Moreover, it is clear that

(Tσg)(x)=min{e(x),c(x)+g(x)P(x, ⁣dx)}(Tsg)(x)(T_\sigma \, g)(x) = \min \left\{ e(x), c(x) + \int g(x') P(x, \diff x') \right\} \leq (T_s \, g)(x)

for all sΣs \in \Sigma.

This argument also proves that the ADP (bX+,T)(b\Xsf_+, \TT) is regular.

In essence, a gg-min-greedy policy treats gg as a loss function, using it to associate the total expected cost of each state, and makes the current best choice accordingly.

3.2.3.6Stability of the Policy Operators

In this section we prove that (bX+,T)(b\Xsf_+, \TT) is globally stable.

We prove the proposition using a sequence of lemmas. In the statement of the next lemma, σ\sigma is a fixed policy, KσK_\sigma is as defined in (3.37), and τσ\tau^\sigma is as defined in (3.27).

We will also use the following result regarding stopping times.

3.2.3.7Optimality

We define the minimum loss function g\gmin via

g(x)infσΣgσ(x)(xX).\gmin(x) \coloneq \inf_{\sigma \in \Sigma} g_\sigma(x) \qquad (x \in \Xsf).

The function g\gmin also takes values in [0,)[0,\infty) and is well-defined everywhere on X\Xsf. The minimum loss function g\gmin is equal to the min-value function v\vmin of the ADP, as defined in Section 2.2.3.1. (This is because, by definition, v=σvσ\vmin = \bigwedge_\sigma v_\sigma, which is v=σgσ\vmin = \bigwedge_\sigma g_\sigma in the current setting. This equation reduces to (3.55) when working in bXb\Xsf with the pointwise partial order.)

A policy σΣ\sigma \in \Sigma is called optimal if gσgsg_\sigma \leq g_s for all sΣs \in \Sigma. This is equivalent to the statement that σ\sigma attains the minimum possible cost from every state, and is equivalent to the ADP definition in Section 2.2.3.1.

We can now state the following result.

3.2.4Sequential Analysis Revisited

With Theorem 3.2.9 in hand, we are in a position to prove optimality results for the sequential analysis problem we presented in Section 1.4. First we reduce the sequential analysis problem to a special case of the no-discount optimal stopping problem treated in Theorem 3.2.9. Then we check the conditions of that theorem. The only significant condition is Assumption 3.2.1, so this is where we will be investing all our effort.

3.2.4.1Set Up

Our first step is to show that the hard part of the sequential analysis problem is a special case of the no-discount optimal stopping problem from Section 3.2.3. To this end, let’s remind ourselves of the set up in Section 1.4. We recall that the state space for the belief state π\pi is X=(0,1)\Xsf = (0,1) and action space is A={0,1,2}\Asf = \{0, 1, 2\}, where action 0 represents accepting f0f_0, action 1 represents accepting f1f_1, and action 2 represents continuing to sample. We assume that the two densities are defined on R\RR.

Repeating (1.116), the Bellman equation has the form

g(π)=min{πL0,  (1π)L1,  c+g(π)P(π, ⁣dπ)}g(\pi) = \min \left\{ \pi L_0, \; (1-\pi) L_1, \; c + \int g(\pi') P(\pi, \diff \pi') \right\}

for π(0,1)\pi \in (0,1), where the stochastic kernel PP obeys

(Pg)(π)g(κ(π,z))ψ(π,z) ⁣dz.(Pg)(\pi) \coloneq \int g(\kappa(\pi, z)) \psi(\pi, z) \diff z.

Here, recalling (1.115),

ψ(π,z)=(1π)f0(z)+πf1(z)\psi(\pi,z) = (1-\pi)f_0(z) + \pi f_1(z)

is the predictive density and

κ(π,z)=πf1(z)(1π)f0(z)+πf1(z)=πf1(z)ψ(π,z)\kappa(\pi, z) = \frac{\pi f_1(z)}{(1-\pi) f_0(z) + \pi f_1(z)} = \frac{\pi f_1(z)}{\psi(\pi, z)}

is the Bayesian update rule. Together, ψ\psi and κ\kappa define the stochastic kernel PP governing the belief state (πn)n0(\pi_n)_{n \geq 0}, describing how beliefs evolve from the perspective of the controller when the observation sequence (Zn)n1(Z_n)_{n \geq 1} is forecast using the predictive density.

In all of what follows, the cost cc and the losses L0L_0 and L1L_1 are assumed to be positive constants. Our aim is to characterize and solve for optimal policies.

In terms of dynamic programming, we can simplify the sequential analysis to a binary stopping problem. Our first step is to set

e(π)min{πL0,  (1π)L1}.e(\pi) \coloneq \min\{\pi L_0, \; (1-\pi) L_1\}.

Now consider the Bellman equation

g(π)=min{e(π),  c+(Pg)(π)}.g(\pi) = \min \left\{ e(\pi), \; c + (Pg)(\pi) \right\}.

We claim that solving this dynamic program is sufficient for solving the sequential analysis problem. To see this, suppose we are able to show that the fundamental min-optimality properties hold for this dynamic program. Let the min-value function be denoted by g\gmin. Continuing the convention that the controller always stops when indifferent between stopping and continuing, Bellman’s principle of min-optimality tells the controller to stop if and only if e(π)c+(Pg)(π)e(\pi) \leq c + (P\gmin)(\pi).

In the present setting, this means that the controller stops if and only if at least one of the stopping losses πL0\pi L_0 and (1π)L1(1-\pi)L_1 is less than or equal to the continuation loss c+(Pg)(π)c + (P\gmin)(\pi). When this stop occurs, the controller then makes the static choice over the two density options (selecting f0f_0 or f1f_1) depending on which of πL0\pi L_0 and (1π)L1(1-\pi)L_1 is smaller. Since this static problem is trivial, we can concentrate on solving the dynamic problem represented by the Bellman equation (3.61).

The stopping problem just described is a special case of the no-discount optimal stopping from Section 3.2.3, with ee as the exit cost function in (3.60) and the flow cost cc constant over the state space X=(0,1)\Xsf = (0,1). (To confirm this, compare the Bellman equation (3.61) with the general case in (3.25).) As a result, Theorem 3.2.9 applies. All we need to do is check that Assumption 3.2.1 is valid in the current setting. This turns out to be true whenever f0f_0 and f1f_1 are distinct. Our proof will rely on a bound for martingale stopping times in Theorem A.3.9.

3.2.4.2Verifying Assumption 3.2.1

Let’s consider verification of Assumption 3.2.1 in the present setting. Let τˉ\bar \tau be the upper bound stopping time for the belief state (i.e., the stopping time in (3.30) specialized to the current setting). This stopping time is defined in terms of the certain exit region (see (3.29)), which, for our problem, is

Eˉ{π(0,1):min{πL0,  (1π)L1}c}.\bar E \coloneq \setntn{\pi \in (0,1)}{\min\{\pi L_0, \; (1-\pi) L_1\} \leq c}.

Equivalently,

πEˉ    πcL0orπ1cL1.\pi \in \bar E \iff \pi \leq \frac{c}{L_0} \quad \text{or} \quad \pi \geq 1 - \frac{c}{L_1}.

The lower bound policy σˉ\bar \sigma is (recalling its definition from Section 3.2.3.3) the indicator function for Eˉ\bar E, stopping the process whenever the belief state enters the certain exit region. The upper bound stopping time is, therefore,

τˉ=inf{n0:πncL0 or πn1cL1}.\bar \tau = \inf \left\{ n \geq 0 \,:\, \pi_n \leq \frac{c}{L_0} \text{ or } \pi_n \geq 1 - \frac{c}{L_1} \right\}.

Here the (πn)(\pi_n) process evolves according to the kernel PP from (3.57): given πn\pi_n, we draw Zn+1Z_{n+1} independently from ψ(πn,)\psi(\pi_n, \cdot), and then set

πn+1=κ(πn,Zn+1)=πnf1(Zn+1)ψ(πn,Zn+1).\pi_{n+1} = \kappa(\pi_n, Z_{n+1}) = \frac{\pi_n f_1(Z_{n+1})}{\psi(\pi_n, Z_{n+1})}.

(Note here that division by zero is not a concern: For π(0,1)\pi \in (0,1), we have ψ(π,z)>0\psi(\pi, z) > 0 if and only if f0(z)+f1(z)>0f_0(z) + f_1(z) > 0. Since Zn+1Z_{n+1} is drawn from ψ(π,)\psi(\pi, \cdot), we have ψ(πn,Zn+1)>0\psi(\pi_n, Z_{n+1}) > 0 almost surely.)

Matching (3.31), the policy set Σ\Sigma under consideration is all B\bB-measurable σ ⁣:X{0,1}\sigma \colon \Xsf \to \{0,1\} with σˉσ\bar \sigma \leq \sigma. For σ\sigma in this set, the policy operator takes the form

(Tσg)(π)=σ(π)e(π)+(1σ(π))[c+g(π)P(π, ⁣dπ)].(T_\sigma \, g)(\pi) = \sigma(\pi) e(\pi) + (1 - \sigma(\pi)) \left[c + \int g(\pi') P(\pi, \diff \pi') \right].

Let VV be all bounded Borel measurable functions from (0,1)(0,1) to R+\RR_+. With TSA\TT_{\rm SA} as the family of operators described by (3.66), indexed by σΣ\sigma \in \Sigma, the pair (b(0,1),TSA)(b(0,1), \TT_{\rm SA}) forms an ADP. For this ADP, we can state the following result. In the statement, two functions are distinct when they are not equal almost everywhere.

To prove Proposition 3.2.10, we recognize (V,TSA)(V, \TT_{\rm SA}) as a special case of the no-discount optimal stopping ADP treated in Theorem 3.2.9. As such, we only need to verify Assumption 3.2.1, which amounts to showing that sup0π1Eπτˉ\sup_{0 \leq \pi \leq 1} \EE_\pi \bar \tau is finite.

As a first step, we recall that, given two probability densities f0f_0 and f1f_1 on R\RR, the triangular discrimination is

Δ(f0,f1):=[f1(z)f0(z)]2f0(z)+f1(z) ⁣dz,\Delta(f_0, f_1) := \int \frac{[f_1(z) - f_0(z)]^2}{f_0(z) + f_1(z)} \, \diff z,

where the integrand is defined to be zero when f0(z)+f1(z)=0f_0(z) + f_1(z) = 0.

Now let’s investigate the properties of stopping times for the process (πn)(\pi_n) from (3.65). We will begin with a generic stopping time

τ=inf{n0:πn(a,b)},\tau = \inf\{n \geq 0 : \pi_n \notin (a,b)\},

where a,ba, b are numbers satisfying 0<a<b<10 < a < b < 1.

3.3Chapter Notes

The results in this chapter extend the abstract dynamic programming framework of Chapter 2 by adding topological structure to the value space. The contraction-based approach in Section 3.1.2 is rooted in the classical work of Denardo (1967) and Bertsekas (2022). Earlier foundations for the operator-theoretic approach to dynamic programming include Blackwell (1965) on discounted models, Strauch (1966) on negative dynamic programming, and Bertsekas (1977) on monotone mappings without contraction. The order-theoretic perspective on fixed points used in Section 3.1.1 connects to Marinacci & Montrucchio (2019), who establish uniqueness results for Tarski-type fixed points of monotone operators. The pospace ADP framework of this chapter is developed in Sargent & Stachurski (2025).

The Q-factor representation of MDPs treated in Section 3.2.1.2 is used extensively in reinforcement learning, where the Bellman equation over Q-factors provides the basis for model-free algorithms. The name originates from Q-learning, introduced by Watkins (1989); see Watkins & Dayan (1992) for the convergence proof and Tsitsiklis (1994) for convergence analysis under asynchronous updates. Standard references on reinforcement learning include Bertsekas & Tsitsiklis (1996) and Sutton & Barto (2018).

The optimal savings problem in Section 3.2.2 has a long history, originating with Brock & Mirman (1972) in the stochastic setting; see also Stokey & Lucas (1989). The strongly continuous case draws on classical results for MDPs with continuous densities, while the weakly continuous case uses Berge’s maximum theorem and relates to the Feller-continuity approach to MDPs developed in Hernández-Lerma & Lasserre (2012) and Hernández-Lerma & Lasserre (2012).

The no-discount optimal stopping framework in Section 3.2.3 covers problems where contractivity fails due to the absence of discounting. Such problems arise in shortest path problems, where the foundational reference is Bertsekas & Tsitsiklis (1991); in bandit problems, where the seminal work is Gittins (1979); and in sequential sampling. For general treatments of optimal stopping theory, see Shiryaev (2007), Peskir & Shiryaev (2006), and Chow et al. (1971).

The sequential analysis problem treated in Section 3.2.4 goes back to the pioneering work of Wald (1947). The optimality of the sequential probability ratio test was established by Wald & Wolfowitz (1948). Modern treatments and extensions of sequential analysis include DeGroot (1970), Siegmund (1985), and Lai (2001). The triangular discrimination used in our proof of Lemma 3.2.12 is a classical ff-divergence; see Topsøe (2000) for sharp inequalities involving this divergence. The general theory of ff-divergences was introduced independently by Csiszár (1967) and Ali & Silvey (1966); for a modern treatment, see Liese & Vajda (2006).

References
  1. Meyn, S. P., & Tweedie, R. L. (2009). Markov chains and stochastic stability. Cambridge University Press.
  2. Denardo, E. V. (1967). Contraction Mappings in the Theory Underlying Dynamic Programming. SIAM Review, 9(2), 165–177.
  3. Bertsekas, D. P. (2022). Abstract dynamic programming (3rd ed.). Athena Scientific.
  4. Blackwell, D. (1965). Discounted Dynamic Programming. The Annals of Mathematical Statistics, 36(1), 226–235.
  5. Strauch, R. E. (1966). Negative Dynamic Programming. The Annals of Mathematical Statistics, 37(4), 871–890.
  6. Bertsekas, D. P. (1977). Monotone mappings with application in dynamic programming. SIAM Journal on Control and Optimization, 15(3), 438–464.
  7. Marinacci, M., & Montrucchio, L. (2019). Unique tarski fixed points. Mathematics of Operations Research, 44(4), 1174–1191.
  8. Sargent, T. J., & Stachurski, J. (2025). Dynamic Programs on Partially Ordered Sets. SIAM Journal on Optimization and Control, in press.
  9. Watkins, C. J. C. H. (1989). Learning from delayed rewards [Techreport]. PhD Thesis, King’s College, Cambridge United Kingdom.
  10. Watkins, C. J., & Dayan, P. (1992). Q-Learning. Machine Learning, 8, 279–292.
  11. Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16, 185–202.
  12. Bertsekas, D., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena Scientific.
  13. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
  14. Brock, W. A., & Mirman, L. J. (1972). Optimal economic growth and uncertainty: The discounted case. Journal of Economic Theory, 4(3), 479–513.
  15. Stokey, N. L., & Lucas, R. E. (1989). Recursive methods in dynamic economics. Harvard University Press.