In this chapter, we add topological structure to the value space and the policy operators. This allows us to provide sufficient conditions for optimality that are often easy to use in applications. Proofs of the theorems presented in this chapter leverage the poset ADP theory that we constructed in Chapter 2.
We begin by studying ADPs on pospace, then specialize to partially ordered metric spaces, where contractivity arguments become available. Minimization counterparts and a treatment of nonstationary policies round out the theoretical development. In Section 3.2 we apply the theory to discrete MDPs and Q-factors, optimal savings, no-discount optimal stopping, and the sequential analysis problem from Section 1.4.
In this section we introduce ADPs in settings where the value space is a poset and also a pospace (i.e., partially ordered space, see Section A.5.3.1 for the definition and basic properties). We study optimality in settings where the policy operators have topological stability properties, as well as order properties. Throughout this chapter, V is always assumed to be a pospace.
In Section 3.1.1 we study ADPs on pospace, leveraging global stability to obtain optimality and convergence results. In Section 3.1.2 we specialize to partially ordered metric spaces, where contraction-based arguments apply. Section 3.1.3 provides minimization counterparts and Section 3.1.4 treats nonstationary policies.
Let V be a pospace and let (V,T) be an ADP. Recall that a self-map S on V is called globally stable when S has a unique fixed point vˉ in V and Snv→vˉ as n→∞ for all v∈V. (See Section A.2.2 for more background.) We say that
(V,T) is globally stable if each Tσ∈T is globally stable on V.
Obviously, if (V,T) is globally stable, then (V,T) is well-posed. We also have the following useful preliminary result, which is an immediate consequence of Lemma A.5.19.
The next result shows that regularity and global stability together yield strong optimality properties.
Theorem 3.1.2 is a high-level result because a condition (existence of a fixed point) is placed on the derived object T, rather than the primitives V and T. Below, we present a sequence of results that leverage Theorem 3.1.2 while also placing all assumptions on primitives. The first is the relatively obvious but useful corollary, which adds convergence of VFI and OPI to Theorem 2.2.6.
The conditions of the next theorem are similar to those of Theorem 2.2.8, after replacing order stability and order continuity with global stability.
In this section we add more structure by assuming that our value space V is in fact a partially ordered metric space, by which we mean that V=(V,d,⪯) where d is a metric on V and (V,⪯) is a pospace under the topology generated by d. By construction, this specializes the earlier setting of Section 3.1.1, where (V,⪯) is just a pospace.
We will say that the ADP (V,T) is semi-regular if there exists a closed subset V0 of V such that V0⊂VG and TV0⊂V0. In addition, we will say that VFI is geometrically convergent on V0 if v∗ exists and there is a β∈(0,1) such that
Because of Exercise 2.7, the optimality and convergence theorems based around maximization can easily be converted to theorems for minimization. In this section we give some examples. Our first is a min-version of Theorem 3.1.2.
For the rest of Section 3.1.3, V=(V,d,⪯) is always assumed to be a partially ordered metric space. Our next result is a min-version of Theorem 3.1.5, restricted to the regular case.
In all of the preceding discussion we focused on stationary policies. For example, in the context of the optimal savings problem from Section 1.3, we fixed a policy σ and computed its lifetime value vσ by assuming that σ is applied at every t in {0,1,…}. In particular, in Section 1.3.1.2, we showed that, for v arbitrarily chosen from the value space,
This expression illustrates how lifetime value is obtained by repeatedly applying the same policy.
But is this focus on stationary policies justified? Could it be that higher lifetime value is available when we allow a change of policy in each period?
To address this question, suppose that we can select a policy planσˉ:=(σt)t⩾0 in the infinite Cartesian product ×t⩾0Σ and apply the t-th element σt at time t. Generalizing (3.2), the lifetime value of σˉ can be defined by
Of course, for the definition in (3.3) to make sense we need to know that the limit exists. Ideally, it should also be independent of v. (In (3.3), we iterate backwards in time, applying Tσj first, because v is best thought of as a terminal condition, rather than an initial condition. See Section 1.3.1.2 for intuition.)
Since the expression (3.3) requires a topology, we consider an ADP (V,T) where V=(V,⪯) is a partially ordered space. We also assume that the topology on V is generated by a metric d. As usual, T:={Tσ:σ∈Σ} is a family of order preserving self-maps on V. To ensure that (3.3) exists we also require the following:
We will make use of the following preliminary results.
We can now prove the main result of this section, which shows that any policy plan is (weakly) dominated in value by a stationary policy.
We now apply the theory developed above to a range of problems. Section 3.2.1 revisits discrete MDPs and introduces Q-factors. Section 3.2.2 treats optimal savings under both strong and weak continuity assumptions. Section 3.2.3 develops a no-discount optimal stopping framework, which is then applied to sequential analysis in Section 3.2.4.
In Section 3.2.1.1 we re-derive the optimality results for finite MDPs using the pospace theory of this chapter. In Section 3.2.1.2 we introduce the Q-factor formulation of the MDP and establish its optimality properties. The Q-factor formulation underpins reinforcement learning, one of the most influential branches of modern AI. Reinforcement learning algorithms---most notably Q-learning---use Q-factors to learn optimal policies from data, without requiring a model of the environment. Here we focus on the underlying dynamic programming theory associated with Q-factors, taking the model as given. Further discussion of Q-learning can be found in Section 3.3 and Section 9.2.1.
In Section 1.2 we introduced a finite MDP (Γ,r,β,P) with state space X and action space A. In Section 2.3.3.1 we showed that this finite MDP can be framed as ADP (RX,TMDP), with policy operators given by Tσ=rσ+βPσ. In Section 2.3.3.2 we established all of the major optimality properties for (RX,TMDP). For the purpose of illustration, working in a familiar and simple environment, let’s re-prove these results using the theorems in this chapter.
First, we know from Section 2.3.3.1 that (RX,TMDP) is regular. In Exercise 1.4 we proved that every policy operator Tσ is globally stable. Since TMDP is finite Corollary 3.1.3 applies. This proves validity of the fundamental optimality properties and convergence of all algorithms. Since global stability implies order stability, convergence of HPI in finite time holds by Theorem 2.2.6.
Next we examine the Q-factor variation of the MDP model. This modification provides an alternative view on the Bellman equation that unlocks a stochastic approximation approach to reinforcement learning. We can study optimality of the Q-factor variation either by treating it directly, as an ADP in its own right, or by inferring optimality properties from the original MDP version of the problem, which we just obtained in Section 3.2.1.1. The first approach is treated here and the second is treated in Chapter 5.
To begin, we take the MDP model from Section 3.2.1.1 and, given v∈RX, set
The function q is called the Q-factor corresponding to v. We will convert the original MDP Bellman equation (2.43) into an equation in Q-factors. The first step is to observe that, given q in (3.11), the Bellman equation can be written as v(x)=maxa∈Γ(x)q(x,a). Taking the expectations and discounting on both sides of this equation yields
Here Sσ acts on function q∈RG. The set RG is paired with the pointwise partial order.
By definition, the ADP Bellman operator corresponding to (RG,S) obeys Sq:=⋁σSσq. The next exercise helps us connect this to the Q-factor Bellman equation (3.13).
Exercise 3.4 tells us that, as expected, q∈RG is a fixed point of S if and only if it is a solution to the Q-factor Bellman equation (3.13).
Now let’s turn to optimality.
The next exercise asks you to confirm the core optimality properties also hold for the Q-factor MDP. You may like to use Theorem 3.1.5.
In Section 1.3 we introduced a simple optimal savings problem. In Section 2.3.2, we converted the optimal savings problem from Section 1.3 into an ADP (V,TOS), where V=bR+ and each Tσ∈TOS takes the form
Now we turn to optimality. In Section 3.2.2.1 we will maintain the conditions in Assumption 1.3.1, so that u is continuous and bounded on R+ and the distribution of labor income can be represented by a continuous density. In Section 3.2.2.2 we will drop the continuous density assumption.
Maintaining Assumption 1.3.1, we prove the following optimality properties, which were stated without proof in Section 1.3.2.2.
Let’s briefly translate these ADP results into the optimal savings results stated in Section 1.3.2.2. We showed in Section 2.3.2 that v∈V satisfies the ADP Bellman equation ⋁σTσv=v if and only if it satisfies
v(w)=0⩽c⩽wmax{u(c)+β∫v(R(w−c)+y)ϕ(dy)}for all w⩾0.
Now let’s prove a result similar to Proposition 3.2.1 under weaker conditions. In particular, we modify Assumption 1.3.1 by dropping the assumption that ϕ is a continuous density. Instead we’ll let ϕ be an arbitrary probability measure on the Borel subset of R+. Other conditions in Assumption 1.3.1 are maintained.
Without the continuity of ϕ, Lemma 1.3.2 on page fails. In particular, we cannot claim that each v∈bR+ has a greedy policy. As a result, (V,TOS) is no longer regular. However, we do have the following result, stated as a solved exercise.
With the results from this exercise in hand, we can prove the next proposition.
Many important applications---including sampling problems, shortest path and routing problems, bandit problems, and reinforcement learning tasks---involve no discounting. In the absence of a discount factor, contractivity of the policy operators typically fails, so the results based on contraction mappings do not directly apply. This necessitates more sophisticated techniques. One of our main aims is to provide foundations for solving the sequential sampling problem from Section 1.4.
Let (X,B) be a measurable space. We recall from Section A.5.4.1 that a discrete time X-valued stochastic process (Xt)t⩾0 on probability space (Ω,F,Px) is called P-Markov if
P{Xt+1∈B∣Ft}=P(Xt,B)P-a.s. for all t⩾0 and B∈B.
where e,c are functions in bX+, while P is a stochastic kernel on X. The function e is called the exit cost function and c is called the flow cost. The Bellman equation corresponds to a setting where a controller observes a P-Markov state process (Xt)t⩾0 and decides when to stop. Stopping at time t incurs the one-off penalty e(Xt). Continuing incurs the flow cost c(Xt), followed by transition to the new state Xt+1 and the opportunity to decide again.
Here ∑t=0−1c(Xt) is understood as 0, so that gσ(x)=e(x) when x∈Eσ. The function gσ takes values in [0,∞] and gσ(x) represents the total expected cost when applying σ in every period, conditional on starting in state x.
We call Eˉ the certain exit region. Under the innocuous convention that the controller always stops when indifferent between stopping and continuing, any optimal policy will choose to stop when x∈Eˉ. The reason is that the controller has the opportunity to exit at cost e(x)⩽c(x), and continuing incurs c(x) plus additional costs in subsequent stages.
The fact that the controller always stops when x∈Eˉ allows us to shrink the policy space. Specifically, we consider only policies where σ(x)=1 for all x∈Eˉ. Let Σ be the set of all such policies. We can also express this set via
For Assumption 3.2.1, it suffices to check that supx∈EˉcExτˉ is finite, since τˉ=0 with probability one when x∈Eˉ. Below we show that Assumption 3.2.1 is sufficient for all of the major optimality results associated with dynamic programming.
In (3.36), expressions such as σe are understood as pointwise products.
As usual, the policy operator associated with σ is introduced with the idea that its fixed point gives lifetime value -- in this case lifetime cost functions -- generated by σ. This turns out to be true here as well, although the proof is not entirely trivial. It requires some familiarity with shift operators and the Markov property in its general form. An introduction to these topics can be found in Chapter 3 of Meyn & Tweedie (2009).
Let T be the set of all policy operators, as defined in (3.35), indexed over the restricted policy set Σ defined in (3.31). Since each Tσ is order preserving, the pair (bX+,T) forms an ADP. For this ADP and given g∈bX+, a policy σ∈Σ is g-min-greedy when Tσg⩽Tsg for all s∈Σ. It is easy to check that one such policy always exists. Indeed, such a policy can be found by setting
This argument also proves that the ADP (bX+,T) is regular.
In essence, a g-min-greedy policy treats g as a loss function, using it to associate the total expected cost of each state, and makes the current best choice accordingly.
In this section we prove that (bX+,T) is globally stable.
We prove the proposition using a sequence of lemmas. In the statement of the next lemma, σ is a fixed policy, Kσ is as defined in (3.37), and τσ is as defined in (3.27).
We will also use the following result regarding stopping times.
The function g▽∗ also takes values in [0,∞) and is well-defined everywhere on X. The minimum loss function g▽∗ is equal to the min-value function v▽∗ of the ADP, as defined in Section 2.2.3.1. (This is because, by definition, v▽∗=⋀σvσ, which is v▽∗=⋀σgσ in the current setting. This equation reduces to (3.55) when working in bX with the pointwise partial order.)
A policy σ∈Σ is called optimal if gσ⩽gs for all s∈Σ. This is equivalent to the statement that σ attains the minimum possible cost from every state, and is equivalent to the ADP definition in Section 2.2.3.1.
With Theorem 3.2.9 in hand, we are in a position to prove optimality results for the sequential analysis problem we presented in Section 1.4. First we reduce the sequential analysis problem to a special case of the no-discount optimal stopping problem treated in Theorem 3.2.9. Then we check the conditions of that theorem. The only significant condition is Assumption 3.2.1, so this is where we will be investing all our effort.
Our first step is to show that the hard part of the sequential analysis problem is a special case of the no-discount optimal stopping problem from Section 3.2.3. To this end, let’s remind ourselves of the set up in Section 1.4. We recall that the state space for the belief state π is X=(0,1) and action space is A={0,1,2}, where action 0 represents accepting f0, action 1 represents accepting f1, and action 2 represents continuing to sample. We assume that the two densities are defined on R.
Repeating (1.116), the Bellman equation has the form
is the Bayesian update rule. Together, ψ and κ define the stochastic kernel P governing the belief state (πn)n⩾0, describing how beliefs evolve from the perspective of the controller when the observation sequence (Zn)n⩾1 is forecast using the predictive density.
In all of what follows, the cost c and the losses L0 and L1 are assumed to be positive constants. Our aim is to characterize and solve for optimal policies.
In terms of dynamic programming, we can simplify the sequential analysis to a binary stopping problem. Our first step is to set
We claim that solving this dynamic program is sufficient for solving the sequential analysis problem. To see this, suppose we are able to show that the fundamental min-optimality properties hold for this dynamic program. Let the min-value function be denoted by g▽∗. Continuing the convention that the controller always stops when indifferent between stopping and continuing, Bellman’s principle of min-optimality tells the controller to stop if and only if e(π)⩽c+(Pg▽∗)(π).
In the present setting, this means that the controller stops if and only if at least one of the stopping losses πL0 and (1−π)L1 is less than or equal to the continuation loss c+(Pg▽∗)(π). When this stop occurs, the controller then makes the static choice over the two density options (selecting f0 or f1) depending on which of πL0 and (1−π)L1 is smaller. Since this static problem is trivial, we can concentrate on solving the dynamic problem represented by the Bellman equation (3.61).
The stopping problem just described is a special case of the no-discount optimal stopping from Section 3.2.3, with e as the exit cost function in (3.60) and the flow cost c constant over the state space X=(0,1). (To confirm this, compare the Bellman equation (3.61) with the general case in (3.25).) As a result, Theorem 3.2.9 applies. All we need to do is check that Assumption 3.2.1 is valid in the current setting. This turns out to be true whenever f0 and f1 are distinct. Our proof will rely on a bound for martingale stopping times in Theorem A.3.9.
Let’s consider verification of Assumption 3.2.1 in the present setting. Let τˉ be the upper bound stopping time for the belief state (i.e., the stopping time in (3.30) specialized to the current setting). This stopping time is defined in terms of the certain exit region (see (3.29)), which, for our problem, is
The lower bound policy σˉ is (recalling its definition from Section 3.2.3.3) the indicator function for Eˉ, stopping the process whenever the belief state enters the certain exit region. The upper bound stopping time is, therefore,
(Note here that division by zero is not a concern: For π∈(0,1), we have ψ(π,z)>0 if and only if f0(z)+f1(z)>0. Since Zn+1 is drawn from ψ(π,⋅), we have ψ(πn,Zn+1)>0 almost surely.)
Matching (3.31), the policy set Σ under consideration is all B-measurable σ:X→{0,1} with σˉ⩽σ. For σ in this set, the policy operator takes the form
Let V be all bounded Borel measurable functions from (0,1) to R+. With TSA as the family of operators described by (3.66), indexed by σ∈Σ, the pair (b(0,1),TSA) forms an ADP. For this ADP, we can state the following result. In the statement, two functions are distinct when they are not equal almost everywhere.
To prove Proposition 3.2.10, we recognize (V,TSA) as a special case of the no-discount optimal stopping ADP treated in Theorem 3.2.9. As such, we only need to verify Assumption 3.2.1, which amounts to showing that sup0⩽π⩽1Eπτˉ is finite.
As a first step, we recall that, given two probability densities f0 and f1 on R, the triangular discrimination is
The results in this chapter extend the abstract dynamic programming framework of Chapter 2 by adding topological structure to the value space. The contraction-based approach in Section 3.1.2 is rooted in the classical work of Denardo (1967) and Bertsekas (2022). Earlier foundations for the operator-theoretic approach to dynamic programming include Blackwell (1965) on discounted models, Strauch (1966) on negative dynamic programming, and Bertsekas (1977) on monotone mappings without contraction. The order-theoretic perspective on fixed points used in Section 3.1.1 connects to Marinacci & Montrucchio (2019), who establish uniqueness results for Tarski-type fixed points of monotone operators. The pospace ADP framework of this chapter is developed in Sargent & Stachurski (2025).
The Q-factor representation of MDPs treated in Section 3.2.1.2 is used extensively in reinforcement learning, where the Bellman equation over Q-factors provides the basis for model-free algorithms. The name originates from Q-learning, introduced by Watkins (1989); see Watkins & Dayan (1992) for the convergence proof and Tsitsiklis (1994) for convergence analysis under asynchronous updates. Standard references on reinforcement learning include Bertsekas & Tsitsiklis (1996) and Sutton & Barto (2018).
The optimal savings problem in Section 3.2.2 has a long history, originating with Brock & Mirman (1972) in the stochastic setting; see also Stokey & Lucas (1989). The strongly continuous case draws on classical results for MDPs with continuous densities, while the weakly continuous case uses Berge’s maximum theorem and relates to the Feller-continuity approach to MDPs developed in Hernández-Lerma & Lasserre (2012) and Hernández-Lerma & Lasserre (2012).
The no-discount optimal stopping framework in Section 3.2.3 covers problems where contractivity fails due to the absence of discounting. Such problems arise in shortest path problems, where the foundational reference is Bertsekas & Tsitsiklis (1991); in bandit problems, where the seminal work is Gittins (1979); and in sequential sampling. For general treatments of optimal stopping theory, see Shiryaev (2007), Peskir & Shiryaev (2006), and Chow et al. (1971).
The sequential analysis problem treated in Section 3.2.4 goes back to the pioneering work of Wald (1947). The optimality of the sequential probability ratio test was established by Wald & Wolfowitz (1948). Modern treatments and extensions of sequential analysis include DeGroot (1970), Siegmund (1985), and Lai (2001). The triangular discrimination used in our proof of Lemma 3.2.12 is a classical f-divergence; see Topsøe (2000) for sharp inequalities involving this divergence. The general theory of f-divergences was introduced independently by Csiszár (1967) and Ali & Silvey (1966); for a modern treatment, see Liese & Vajda (2006).