Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

8 Recursive Decision Processes

Authors
Affiliations
New York University
Australian National University

While the MDP model from Chapter 5 and Chapter 6 is elegant and widely used, researchers in economics, finance, and other fields are working to extend it. Reasons include:

  1. MDP theory cannot be applied to settings where lifetime values are described by the kinds of nonlinear recursions discussed in Chapter 7.

  2. Equilibria in some models of production and economic geography can be computed using dynamic programming but not all such programming problems fit within the MDP framework.

  3. Dynamic programming problems that include adversarial agents to promote robust decision rules can fail to be MDPs.

To handle such departures from the MDP assumptions, we now construct a more general dynamic programming framework, building on an approach to optimization initially developed by Denardo (1967) and extended by Bertsekas (2022). Further references are provided in Section 8.4.

We start this chapter by building a framework that centers on an abstract representation of the Bellman equation (Section 8.1). We then state optimality results and show how they can be verified in a range of applications. We defer proofs of core optimality results to Chapter 9, where we strip dynamic programs down to their essence by adopting a purely operator-theoretic perspective.

8.1Definition and Properties

In this section, we introduce and analyze optimality conditions for recursive decision processes that include and extend all dynamic programming frameworks discussed so far. Throughout this chapter, X\Xsf denotes a finite set.

8.1.1Defining RDPs

Consider a generic Bellman equation of the form

v(x)=maxaΓ(x)B(x,a,v).v(x) = \max_{a \in \Gamma(x)} B(x, a, v).

Here xx is the state, aa is an action, Γ\Gamma is a feasible correspondence, and BB is an “aggregator” function. We understand Γ(x)\Gamma(x) as all actions available to the controller in state xx. The function vv assigns values to states and is a member of some class VRXV \subset \RR^\Xsf. This “abstract” Bellman equation generalizes all of the Bellman equations presented in previous chapters.

Our plan is to analyze the Bellman equation (8.1) and state conditions on BB and the other primitives that make strong optimality properties hold. As a first step, we introduce two finite sets,

Given X\Xsf and A\Asf, we define a recursive decision process (RDP) to be a triple R=(Γ,V,B)\rR = (\Gamma, V, B) consisting of

  1. a feasible correspondence Γ\Gamma that is a nonempty correspondence from X\Xsf to A\Asf, which in turn defines

    • the feasible state-action pairs

G{(x,a)X×A:aΓ(x)}\Gsf \coloneq \setntn{(x, a) \in \Xsf \times \Asf}{a \in \Gamma(x)}
- and the set of feasible policies
Σ{σAX:σ(x)Γ(x) for all xX},\Sigma \coloneq \setntn{\sigma \in \Asf^\Xsf} {\sigma(x) \in \Gamma(x) \text{ for all } x \in \Xsf},
  1. a subset VV of RX\RR^\Xsf called the value space, and

  2. a value aggregator BB that maps G×V\Gsf \times V to R\RR and satisfies both the monotonicity condition

v,wV and vw    B(x,a,v)B(x,a,w)   for all (x,a)G,v, w \in V \text{ and } v \leq w \implies B(x, a, v) \leq B(x, a, w) \; \text{ for all } (x, a) \in \Gsf,

and the consistency condition

wV whenever w(x)=B(x,σ(x),v) for some σΣ and vV.w \in V \text{ whenever } w(x) = B(x, \sigma(x), v) \text{ for some } \sigma \in \Sigma \text{ and } v \in V.

Throughout, \leq represents the pointwise order on RX\RR^\Xsf.

The definition of the feasible correspondence in (i) is identical to that for the MDP in Chapter 5. As for (ii), we understand VV to be a class of functions that assign values to states. In (iii), the interpretation of the aggregator BB is:

B(x,a,v)=B(x, a, v) = total lifetime rewards, contingent on current action aa, current state xx, and using vv to evaluate future states.

The monotonicity condition (8.4) is natural: if, relative to vv, rewards are at least as high for ww in every future state, then the total rewards one can extract under ww should be at least as high. The consistency condition in (8.5) ensures that as we consider values of different policies we remain within the value space VV.

The MDP framework is a special case of the RDP framework:

The last example is a special case of Example 8.1.1, since the cake eating problem is an MDP (see Section 5.1.2.3). Nonetheless, Example 8.1.2 is instructive because, for cake eating, the MDP construction is tedious (e.g., we need to define a stochastic kernel PP even though transitions are deterministic), while the RDP construction is straightforward.

The next example makes a related point.

Example 8.1.1--Example 8.1.4 treated RDPs that can be embedded into the MDP framework. In the remaining examples, we consider models that cannot be represented as MDPs.

Example 8.1.8 is a minimization problem. We treat minimization explicitly in Section 8.3.5, although the shortest path setting can be converted maximization by replacing c(x,x)c(x,x') with c(x,x)-c(x,x'). This produces an application similar to the cake eating problem in Example 8.1.2 (although discounting is eliminated and network structure shows up in the constraint).

8.1.2Lifetime Value

We aim to discuss optimality of RDPs. To prepare for this topic, we now clarify lifetime values associated with different policy choices in the RDP setting.

8.1.2.1Policies and Value

Let R=(Γ,V,B)\rR = (\Gamma, V, B) be an RDP with state and action spaces X\Xsf and A\Asf, and let Σ\Sigma be the set of all feasible policies. For each σΣ\sigma \in \Sigma we introduce the policy operator TσT_\sigma as a self-map on VV defined by

(Tσv)(x)=B(x,σ(x),v)(xX).(T_\sigma \, v)(x) = B(x, \sigma(x), v) \qquad (x \in \Xsf).

The RDP policy operator is a direct generalization of the MDP policy operator defined, as well as the optimal stopping policy operator defined.

Consider a given RDP (Γ,V,B)(\Gamma, V, B) and fix σΣ\sigma \in \Sigma. If TσT_\sigma has a unique fixed point in VV, we denote this fixed point by vσv_\sigma and call it the σ\sigma-value function. It is natural to interpret vσv_\sigma as representing the lifetime value of following policy σ\sigma.

The previous examples are linear but the same idea extends to nonlinear recursive preference models as well. To see this, recall the generic Koopmans operator (Kv)(x)=A(x,(Rv)(x))(Kv)(x) = A(x, (Rv)(x)) introduced in Section 7.3.1. Lifetime value is the unique fixed point of this operator whenever it exists. In all of the RDP examples we have considered, the policy operator can be expressed as (Tσv)(x)=Aσ(x,(Rσv)(x))(T_\sigma \, v)(x) = A_\sigma(x, (R_\sigma \, v)(x)) for some aggregator AσA_\sigma and certainty equivalent operator RσR_\sigma. Hence TσT_\sigma is a Koopmans operator and lifetime value associated with policy σ\sigma is the fixed point of this operator.

8.1.2.2Uniqueness and Stability

Let R=(Γ,V,B)\rR = (\Gamma, V, B) be a given RDP with policy operators {Tσ}\{T_\sigma\}. Given that our objective is to maximize lifetime value over the set of policies in Σ\Sigma, we need to assume at the very least that lifetime value is well defined at each policy. To this end, we say that R\rR is well-posed whenever TσT_\sigma has a unique fixed point vσv_\sigma in VV for all σΣ\sigma \in \Sigma.

Let R\rR be an RDP with policy operators {Tσ}σΣ\{T_\sigma\}_{\sigma \in \Sigma}. In what follows, we call R\rR globally stable if TσT_\sigma is globally stable on VV for all σΣ\sigma \in \Sigma.

Obviously, every globally stable RDP is well-posed.

In Section 8.1.3 we will see that global stability yields strong optimality properties.

8.1.2.3Continuity

Let R=(Γ,V,B)\rR = (\Gamma, V, B) be an RDP. We call R\rR continuous if B(x,a,v)B(x, a, v) is continuous in vv for all (x,a)G(x, a) \in \Gsf. In other words, R\rR is continuous if, for any vVv \in V, any (x,a)G(x, a) \in \Gsf and any sequence (vk)k1(v_k)_{k \geq 1} in VV, we have

limkB(x,a,vk)=B(x,a,v)wheneverlimkvk=v.\lim_{k \to \infty} B(x, a, v_k) = B(x, a, v) \quad \text{whenever} \quad \lim_{k \to \infty} v_k = v .

Continuity is satisfied by all applications considered in this text. For example, for the RDP generated by an MDP (Example 8.1.1), the deviation B(x,a,vk)B(x,a,v)| B(x, a, v_k) - B(x, a, v)| is dominated by βvkv\beta \| v_k - v \|_\infty for all (x,a)G(x, a) \in \Gsf. Hence continuity holds.

Below we will see that continuity is useful when considering covergence of certain algorithms.

8.1.3Optimality

In this section, we present optimality theory for RDPs.

8.1.3.1Greedy Policies

Given an RDP R=(Γ,V,B)\rR = (\Gamma, V, B) and vVv \in V, a policy σΣ\sigma \in \Sigma is called vv-greedy if

σ(x)argmaxaΓ(x)B(x,a,v)for all xX.\sigma(x) \in \argmax_{a \in \Gamma(x)} B(x, a, v) \quad \text{for all } x \in \Xsf.

Since Γ(x)\Gamma(x) is finite and nonempty at each xXx \in \Xsf, at least one such policy exists. As with policy operators, the notion of greedy policies extends existing definitions from earlier chapters.

Given RDP R=(Γ,V,B)\rR = (\Gamma, V, B), we say that vVv \in V satisfies the Bellman equation if v(x)=maxaΓ(x)B(x,a,v)v(x) = \max_{a \in \Gamma(x)}B(x,a,v) for all xXx \in \Xsf. The Bellman operator corresponding to R\rR is the map TT on VV defined by

(Tv)(x)=maxaΓ(x)B(x,a,v)(xX).(T v)(x) = \max_{a \in \Gamma(x)} B(x, a, v) \qquad (x \in \Xsf).

8.1.3.2Algorithms

To solve RDPs for optimal policies, we use two core algorithms: Howard policy iteration (HPI) and optimistic policy iteration (OPI). As in previous chapters, OPI includes VFI as a special case.

To describe HPI we take R=(Γ,V,B)\rR = (\Gamma, V, B) to be a well-posed RDP with feasible policy set Σ\Sigma, policy operators {Tσ}\{T_\sigma\}, and Bellman operator TT. In this setting, the HPI algorithm is essentially identical to the one given for MDPs in Section 5.1.4.2, except that vσv_\sigma is calculated as the fixed point of TσT_\sigma, rather than taking the specific form (IβPσ)1rσ(I-\beta P_\sigma)^{-1} r_\sigma. The details are in Algorithm 8.1.

Algorithm 8.1 is somewhat ambiguous, since it is not always clear how to implement the instruction “vkv_k \leftarrow the fixed point of TσkT_{\sigma_k}”. However, if R\rR is globally stable, then each TσkT_{\sigma_k} is globally stable, so an approximation of the fixed point can be calculated by iterating with TσkT_{\sigma_k}. This line of thought leads us to consider optimistic policy iterating (OPI) as a more practical alternative. Algorithm 8.2 states an OPI routine for solving R\rR that generalizes the MDP OPI routine in Section 5.1.4.

In Algorithm 8.2 we require that v0=vσv_0 = v_\sigma for some σΣ\sigma \in \Sigma. This assumption can be dropped in some settings. For practical purposes, however, it is almost always straightforward to initialize OPI with v0=vσv_0 = v_\sigma for some simple choice of σ\sigma.

When we turn to proofs, it will help to have an operator-theoretic description of HPI and OPI. To this end, we define two operators. The first is H ⁣:V{vσ}\Hmax \colon V \to \{v_\sigma\}, which is defined via

Hv=vσ where σ is v-greedy.\Hmax v = v_\sigma \text{ where } \sigma \text{ is } v \text{-greedy}.

We call H\Hmax the Howard operator generated by R\rR. Iterating with H\Hmax implements HPI. In particular, if we fix σΣ\sigma \in \Sigma and set vk=Hkvσv_k = \Hmax^k v_\sigma, then (vk)k0(v_k)_{k \geq 0} is the sequence of σ\sigma-value functions generated by HPI.[1]

Next, fixing mNm \in \NN, we define the operator WmW_m from VV to itself via

WmvTσmvwhereσ is v-greedy.W_m v \coloneq T^m_\sigma v \quad \text{where} \quad \sigma \text{ is } v \text{-greedy}.

(See the previous footnote on the choice of vv-greedy policies.) The operator WmW_m is an approximation of HH, since Tσmvvσ=HvT^m_\sigma v \to v_\sigma = Hv as mm \to \infty. Iterating with WmW_m generates the value sequence in OPI. More specifically, we take v0{vσ}v_0 \in \{v_\sigma\} and generate

(vk,σk)k0where vk=Wmkv0 and σk is vk-greedy.(v_k, \sigma_k)_{k \geq 0} \quad \text{where } v_k = W_m^k v_0 \text{ and } \sigma_k \text{ is } v_k \text{-greedy}.

This produces an infinite sequence of OPI value and policy iterates.

8.1.3.3Optimality

Let R\rR be a well-posed RDP with policy operators {Tσ}\{T_\sigma\} and σ\sigma-value functions {vσ}\{v_\sigma\}. In this context, we set vσvσRXv^* \coloneq \bigvee_\sigma \, v_\sigma \in \RR^\Xsf and call vv^* the value function of R\rR. By definition, vv^* satisfies

v(x)=maxσΣvσ(x)for all   xX.v^*(x) = \max_{\sigma \in \Sigma} v_\sigma(x) \qquad \text{for all } \; x \in \Xsf.

A policy σ\sigma is called optimal for R\rR if vσ=vv_\sigma = v^*; that is, if

vσ(x)vτ(x)for all τΣ and all xX.v_\sigma(x) \geq v_\tau(x) \quad \text{for all } \tau \in \Sigma \text{ and all } x \in \Xsf.

Both of these definitions generalize the definitions we used for MDPs and optimal stopping. In particular, optimality of a policy means that it generates maximum possible lifetime value from every state.

We say that R\rR satisfies Bellman’s principle of optimality if

σΣ is optimal for R    σ is v-greedy.\sigma \in \Sigma \text{ is optimal for } \rR \quad \iff \quad \sigma \text{ is } v^*\text{-greedy}.

We can now state our main optimality result for RDPs. In the statement, R\rR is a well-posed RDP with value function vv^*.

As OPI includes VFI as a special case (m=1m=1), Theorem 8.1.1 also implies convergence of VFI under the stated conditions.

In terms of applications, Theorem 8.1.1 is the most important optimality result in this book. It provides the core optimality results from dynamic programming and a broadly convergent algorithm for computing optimal policies.

The proof of Theorem 8.1.1 is deferred to Section 9.1.

Example 8.1.18--Example 8.1.16 are relatively elementary. More complex models will be handled in Section 8.2.

8.1.3.4Comments on the Optimality Theorem

Many traditional treatments of dynamic programming build optimality theory around contractivity (see, e.g., Puterman (2005) or Stokey & Lucas (1989), Section 4.2). Assumptions are constructed so that the policy operators and Bellman operator are all contraction mappings.

While such assumptions are sufficient for Theorem 8.1.1 (since contractivity of the policy operators implies stability), they are not necessary. There are a variety of ways to prove uniqueness and stability of fixed points, including the monotonicity-based methods discussed in Section 7.1.2 and the spectral methods in Section 6.1.3.2. These alternatives will prove useful in settings where contractivity fails, as we shall see in Section 8.2.

Another point worth noting about the conditions in Theorem 8.1.1 is that no assumptions are placed on the Bellman operator. Rather, one only needs to check properties of the policy operators. This is advantageous because, unlike the Bellman operator, the policy operators do not involve maximization.

8.1.3.5Nonstationary Policies

Up until now, we have focused entirely on stationary policies, in the sense that the same policy is used at every point in time. What if we drop this assumption and admit the option to change policies? Might this lead to higher lifetime values?

In this section, we show that for globally stable RDPs the answer is negative. This finding justifies our focus on stationary policies.

To begin, let R=(Γ,V,B)\rR = (\Gamma, V, B) be a globally stable RDP. Recall from Remark 8.1.1 that, given vVv \in V, σΣ\sigma \in \Sigma, kNk \in \NN and xXx \in \Xsf, the value (Tσkv)(x)(T_\sigma^k \, v)(x) gives finite horizon utility over periods 0,,k0, \ldots, k under policy σ\sigma, with initial state xx and terminal condition vv. Extending this idea, it is natural to understand TσkTσk1Tσ1vT_{\sigma_k} T_{\sigma_{k-1}} \cdots T_{\sigma_1} v as providing finite horizon utility values for the nonstationary policy sequence (σk)kNΣ(\sigma_k)_{k \in \NN} \subset \Sigma, given terminal condition vVv \in V. For the same policy sequence, we define its lifetime value via

vˉlim supkvkwith vkTσkTσk1Tσ1v\bar v \coloneq \limsup_{k \to \infty} v_k \quad \text{with } v_k \coloneq T_{\sigma_k} T_{\sigma_{k-1}} \cdots T_{\sigma_1} v

whenever the limsup is finite and independent of the terminal condition vv.

Suppose that this is the case, and hence vˉ\bar v is well defined. We claim that vˉv\bar v \leq v^*.

Since vˉ\bar v is independent of the terminal condition vv, we can assume without loss of generality that vVΣv \in V_\Sigma. By Theorem 8.1.1, we have TkvvT^k v \to v^* as kk \to \infty. Hence, by Exercise 8.11,

vˉ=lim supkvklim supkTkv=limkTkv=v,\bar v = \limsup_{k \to \infty} v_k \leq \limsup_{k \to \infty} T^k v = \lim_{k \to \infty} T^k v = v^*,

as was to be shown.

8.1.3.6Bounded RDPs

We call an RDP R=(Γ,V,B)\rR = (\Gamma, V, B) bounded if VV is convex and, moreover, there exist functions v1,v2Vv_1, v_2 \in V such that v1v2v_1 \leq v_2,

v1(x)B(x,a,v1) and B(x,a,v2)v2(x)     for all (x,a)G.v_1(x) \leq B(x, a, v_1) \quad \text{ and } \quad B(x, a, v_2) \leq v_2(x) \;\; \text{ for all } (x, a) \in \Gsf .

We will show that boundedness can be used to obtain optimality results for well-posed RDPs, even without global stability.

Another attractive feature of boundedness is that it permits a reduction of value space, as illustrated by the next two exercises.

Exercise 8.13 implies the reduced RDP (Γ,V^,B)(\Gamma, \hat V, B) is also well-posed under the stated conditions, and that it contains all the σ\sigma-value functions and the value function from the original RDP (Γ,V,B)(\Gamma, V, B). Hence any optimality results for (Γ,V^,B)(\Gamma, \hat V, B) carry over to (Γ,V,B)(\Gamma, V, B).

The next result shows that, when considering optimality, stability can be replaced by boundedness.

8.1.4Topologically Conjugate RDPs

Sometimes RDP models can be simplified by transformations over value space. In this section we investigate such transformations. The underlying ideas are related to topological conjugacy of dynamical systems, which we introduced in Section 2.1.1.2.

To begin, let R=(Γ,V,B)\rR = (\Gamma, V, B) and R^=(Γ,V^,B^)\hat \rR = (\Gamma, \hat V, \hat B) be two RDPs with identical state space X\Xsf, action space A\Asf and feasible correspondence Γ\Gamma. We consider settings where

V=MXandV^=M^Xwhere M,M^R,V = \MM^\Xsf \quad \text{and} \quad \hat V = \hat{\MM}^\Xsf \quad \text{where } \MM, \hat \MM \subset \RR,

and, in addition, that there exists a homeomorphism ϕ\phi from M\MM onto M^\hat \MM such that

B(x,a,v)=ϕ1[B^(x,a,ϕv)]for all vV and (x,a)G.B(x, a, v) = \phi^{-1}[ \hat B(x, a, \phi \circ v) ] \quad \text{for all } v \in V \text{ and } (x, a) \in \Gsf.

We call R\rR and R^\hat R topologically conjugate under ϕ\phi if ϕ\phi is a homeomorphism ϕ\phi from M\MM to M^\hat \MM and (8.43) holds.

Here is our main result for this section.

The benefit of Proposition 8.1.3 is that one of these models might be easier to analyze than the other. We apply the proposition to the Epstein--Zin specification in Section 8.1.4.1 and to a smooth ambiguity model in Section 8.3.4. The next exercise will be useful for the proof.

In the next section we will see how these ideas can simplify optimality analysis.

8.1.4.1Application: Epstein--Zin RDPs

In this section, we show how the preceding optimality results and the notion of topologically conjugacy can be deployed to analyze the Epstein--Zin RDP from Example 8.1.7.

Recall that the aggregator in Example 8.1.7 is

B(x,a,v)={r(x,a)+β(xv(x)γP(x,a,x))α/γ}1/α.B(x, a, v) = \left\{ r(x, a) + \beta \left( \sum_{x'} v(x')^\gamma P(x, a, x') \right)^{\alpha/\gamma} \right\}^{1/\alpha}.

Let V=(0,)XV = (0, \infty)^\Xsf. We assume that r0r \gg 0 and take a nonempty feasible correspondence Γ\Gamma as given. Exercise 8.5 confirmed that R(Γ,V,B)\rR \coloneq (\Gamma, V, B) is an RDP.

We will call the stochastic kernel PP irreducible if P(x,σ(x),x)P(x, \sigma(x), x') is irreducible for all σΣ\sigma \in \Sigma. Below we establish stability of R\rR under irreducibility.

To prove Proposition 8.1.4, we set up a simpler and more tractable model. Our first step is to introduce another RDP by setting

B^(x,a,v)=B(x,a,v1/γ)γ.\hat B(x, a, v) = B\left(x, a, v^{1/\gamma}\right)^{\gamma}.

We set R(Γ,V,B)\rR \coloneq (\Gamma, V, B) and R^(Γ,V,B^)\hat \rR \coloneq (\Gamma, V, \hat B). Notice that B^\hat B can also be expressed as

B^(x,a,v)={r(x,a)+β(xv(x)P(x,a,x))1/θ}θ,\hat B(x,a, v) = \left\{ r(x, a) + \beta \left( \sum_{x'} v(x') P(x, a, x') \right)^{1/\theta} \right\}^{\theta},

where θγ/α\theta \coloneq \gamma/\alpha.

The value of of introducing R^\hat \rR comes from the fact that R^\hat \rR is easier to work with than R\rR (just as the modified Epstein--Zin Koopmans operator K^\hat K defined in Section 7.2.3.3 turned out to be easier to work with than the original Epstein--Zin Koopmans operator KK introduced in Section 7.2.3.2).

Now we investigate the properties of the simpler RDP R^\hat \rR.

8.2Types of RDPs

In Section 8.1 we showed that well-posed RDPs have strong optimality properties whenever they are globally stable or bounded, and that VFI and OPI converge whenever they are globally stable. But what conditions are sufficient for these properties? We start with a relatively strict condition based on contractivity and then progress to models that fail to be contractive.

8.2.1Contracting RDPs

In this section, we study RDPs with strong contraction properties. Many traditional dynamic programs fit into this framework.

8.2.1.1Definition and Examples

Let R=(Γ,V,B)\rR = (\Gamma, V, B) be an RDP with state space X\Xsf, action space A\Asf, and feasible state-action pair set G\Gsf. We call R\rR contracting if there exists a β<1\beta < 1 such that

B(x,a,v)B(x,a,w)βvwfor all (x,a)G and v,wV.| B(x, a, v) - B(x, a, w)| \leq \beta \| v - w \|_\infty \quad \text{for all } (x, a) \in \Gsf \text{ and } v, w \in V.

In line with the terminology for contraction maps, we call β\beta the modulus of contraction for R\rR when (8.49) holds.

The following corollary to Proposition 8.2.1 is immediate from Banach’s contraction mapping theorem.

8.2.1.2Error Bounds

Corollary 8.2.2 tells us that contracting RDPs are globally stable and, as a result, the sequence of functions in VV generated by VFI (Algorithm 8.2 with m=1m=1) converges to vv^*. However this result is asymptotic and conditions on v0=vσv_0 = v_\sigma for some σΣ\sigma \in \Sigma. We can improve this result in the current setting by leveraging the contraction property:

Since the VFI algorithm terminates when vkvk1\|v_k - v_{k-1} \|_{\infty} falls below a given tolerance, the result in (8.52) directly provides a quantitative bound on the performance of the policy returned by VFI.

8.2.1.3A Blackwell-Type Condition

Next we state a useful condition for contractivity that is related to Blackwell’s sufficient condition discussed in Section 2.2.3.3. We say that RDP (Γ,V,B)(\Gamma, V, B) satisfies Blackwell’s condition if vVv \in V implies v+λv+λ1v + \lambda \coloneq v + \lambda \1 is in VV for every λ0\lambda \geq 0 and, in addition, there exists a β[0,1)\beta \in [0, 1) such that

B(x,a,v+λ)B(x,a,v)+βλfor all (x,a)GvV and λR+.B(x, a, v + \lambda) \leq B(x, a, v) + \beta \lambda \qquad \text{for all } (x, a) \in \Gsf \text{, } v \in V \text{ and } \lambda \in \RR_+.

8.2.1.4Application: Job Search with Quantile Preferences

Consider the job search problem with correlated wage draws first investigated in Section 3.3.1. With finite wage offer set W\Wsf, wage offer process generated by PM(RW)P \in \mopw and β(0,1)\beta \in (0,1), we can frame this as an RDP (Γ,V,B)(\Gamma, V, B) with V=RWV = \RR^\Wsf, Γ(w)={0,1}\Gamma(w) = \{0, 1\} for wWw \in \Wsf and

B(w,a,v)aw1β+(1a)[c+β(Pv)(w)].B(w, a, v) \coloneq a \frac{w}{1-\beta} + (1-a) [ c + \beta (Pv)(w) ].

Since the model just described is an optimal stopping problem, Example 8.2.1 tells us that (V,Γ,B)(V, \Gamma, B) is contracting.

Now consider the following modification, where Γ\Gamma and VV are as before but BB is replaced by

Bτ(w,a,v)aw1β+(1a)[c+β(Rτv)(w)],B_\tau(w, a, v) \coloneq a \frac{w}{1-\beta} + (1-a) [ c + \beta (R_\tau v)(w) ],

where τ[0,1]\tau \in [0,1] and RτR_\tau is the quantile certainty equivalent operator described in Exercise 7.22.

Figure Figure 8.1 shows the reservation wage for a range of τ\tau values, computed using OPI (and taking the smallest wWw \in \Wsf such that σ(w)=1\sigma^*(w) = 1). The stationary distribution of PP is also shown in the figure, tilted 90 degrees.

The parameters and the code for applying TσT_\sigma and evaluating greedy functions is shown in Listing 2. That listing includes the quantile operator RτR_\tau, which is implemented in Listing 1. (Quantiles of discrete random variables can also be computed using functionality contained in Distributions.jl.)

The reservation wage as a function of \tau

Figure 8.1:The reservation wage as a function of τ\tau

The main message of Figure Figure 8.1 is that the reservation wage rises in τ\tau. In essence, higher τ\tau focuses the attention of the worker on the right tail of the distribution of continuation values. This encourages the worker to take on more risk, which leads to a higher reservation wage (i.e., reluctance to accept a given current offer).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
"Compute the τ-th quantile of v(X) when X ∼ ϕ."
function quantile(τ, v, ϕ)
    # Sort v and reorder ϕ accordingly
    indices = sortperm(v)
    v_sorted = v[indices]
    ϕ_sorted = ϕ[indices]
    
    for (i, v_value) in enumerate(v_sorted)
        p = sum(ϕ_sorted[1:i])  # sum all ϕ[j] s.t. v[j] ≤ v_value
        if p ≥ τ                # exit and return v_value if prob ≥ τ
            return v_value
        end
    end
end

Program 1:Conditional quantile operator (quantile_function.jl)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
using QuantEcon
include("quantile_function.jl")

"Creates an instance of the job search model."
function create_markov_js_model(;
        n=100,       # wage grid size
        ρ=0.9,       # wage persistence
        ν=0.2,       # wage volatility
        β=0.98,      # discount factor
        c=1.0,       # unemployment compensation
        τ=0.5        # quantile parameter
    )
    mc = tauchen(n, ρ, ν)
    w_vals, P = exp.(mc.state_values), mc.p
    return (; n, w_vals, P, β, c, τ)
end

"""
The policy operator 

    (T_σ v)(w) = σ(w) (w / (1-β)) + (1 - σ(w))(c + β (R_τ v)(w))

"""
function T_σ(v, σ, model)
    (; n, w_vals, P, β, c, τ) = model
    h = c .+ β * R(τ, v, P)
    e = w_vals ./ (1 - β)
    return σ .* e + (1 .- σ) .* h
end

" Get a v-greedy policy."
function get_greedy(v, model)
    (; n, w_vals, P, β, c, τ) = model
    σ = w_vals / (1 - β) .≥ c .+ β * R(τ, v, P)
    return σ
end

Program 2:Job search with quantile operator (quantile_js.jl)

8.2.1.5Application: Optimal Default

In this section, we consider a small open economy that borrows in international financial markets in order to smooth consumption and has the option to default. We show that the model is a contractive RDP.

Income (Yt)t0(Y_t)_{t \geq 0} is exogenous and QQ-Markov on finite set Y\Ysf. A representative household faces budget constraint

Ct=Yt+btqbt+1(t0),C_t = Y_t + b_t - q b_{t+1} \qquad (t \geq 0),

where CtC_t is consumption at time tt, qq is the price at time tt of a risk-free claim on one unit of time t+1t+1 consumption; qq is determined outside the model, say international markets; btb_t measures foreign lending. Purchasing a claim on bt+1b_{t+1} units of time t+1t+1 consumption costs qbt+1q b_{t+1}. Purchasing bond with negative face value bt+1b_{t+1} pays qbt+1q b_{t+1} in current consumption goods and promises to deliver bt+1b_{t+1} next period.

Bond trading is managed by a benevolent government that wants to maximize household utility. Households discount future utility at rate β(0,1)\beta \in (0,1) and current consumption CtC_t generates current utility u(Ct)u(C_t). The government faces borrowing constraint btmb_t \geq - m where m0m \geq 0. The government maximizes expected discounted utility for the households.

The government can default on foreign loans. In this case, output available for consumption drops from YtY_t to h(Yt)h(Y_t), where hh is a function satisfying h(y)<yh(y) < y for all yy. After a country defaults, it temporarily loses access to the international credit market.

At the end of each period during which the country is in default, it regains access to international credit markets with probability θ(0,1)\theta \in (0,1). With probability 1θ1-\theta it remains in financial autarky. When a country regains access to foreign borrowing, its debt is reset to zero.

We can cast this as an RDP by considering the value of each state and action. We set the state space X\Xsf to be the set of all (y,b,d)(y, b, d) in Y×B×{0,1}\Ysf \times \Bsf \times \{0, 1\}, where B\Bsf is a finite subset of [m,)[-m, \infty) indicating possible choices for bond holdings btb_t and dd is a binary variable indicating whether the country is in default (d=0d=0 means not in default and d=1d=1 means in default).

The value space VV is all of RX\RR^\Xsf. The action space is (ba,da)B×{0,1}(b_a, d_a) \in \Bsf \times \{0,1\} indicating choices for bond holdings and default. The feasible correspondence specifies feasible (ba,da)(b_a,d_a) at given state (y,b,d)(y, b, d) and is given by

Γ(y,b,d)={B×{0,1} if d=0 and {0}×{1} if d=1.\Gamma(y, b, d) = \begin{cases} \Bsf \times \{0, 1\} & \text{ if } d = 0 \text{ and } \\ \{0 \} \times \{1\} & \text{ if } d = 1. \end{cases}

In other words, if d=0d=0, so the country is not in default, the government can choose any baBb_a \in \Bsf and also any da{0,1}d_a \in \{0, 1\} (i.e., default or not default). If d=1d=1, however, the government has no choices. We represent this situation by ba=0b_a =0 and da=1d_a =1.

The value aggregator takes the form

B((y,b,d),(ba,da),v)=value in state (y,b,d) under action (ba,da).B((y, b, d), (b_a, d_a), v) = \text{value in state } (y, b, d) \text{ under action } (b_a, d_a).

To specify it we decompose the problem across cases for dd and dad_a. First consider the case where d=0d=0 (not currently in default) and da=0d_a=0 (the government chooses not to default). For this case y+bqbay + b - q b_a is current consumption, so we set

B((y,b,0),(ba,0),v)=u(y+bqba)+βyv(y,ba,0)Q(y,y).B((y, b, 0), (b_a, 0), v) = u(y + b - q b_a) + \beta \sum_{y'} v(y', b_a, 0) Q(y, y').

Now consider the case where d=0d=0 and da=1d_a=1, so the government chooses to default. Then current consumption is h(y)h(y) and we set

B((y,b,0),(ba,1),v)=u(h(y))+β[θyv(y,0,0)Q(y,y)+(1θ)yv(y,0,1)Q(y,y)].B((y, b, 0), (b_a, 1), v) = u(h(y)) + \beta \\ \left[ \theta \sum_{y'} v(y', 0, 0) Q(y, y') + (1-\theta) \sum_{y'} v(y', 0, 1) Q(y, y') \right].

The term yv(y,0,0)Q(y,y)\sum_{y'} v(y', 0, 0) Q(y, y') is the expected value next period when the country is readmitted to international financial markets (with b=0b'=0 and d=0d'=0), whereas the term yv(y,0,1)Q(y,y)\sum_{y'} v(y', 0, 1) Q(y, y') is the expected value next period when default continues (with b=0b'=0 and d=1d'=1).

Since B((y,b,1),(ba,0),v)B((y, b, 1), (b_a, 0), v) is not feasible (a defaulted country cannot itself directly choose to reenter financial markets), the only other possibility is B((y,b,1),(ba,1),v)B((y, b, 1), (b_a, 1), v), which is the expected value when the country remains in default. But this is the same as B((y,b,0),(ba,1),v)B((y, b, 0), (b_a, 1), v) specified earlier: The value for a country that stays in default is the same as that for a country that newly enters default.

8.2.2Eventually Contracting RDPs

Many RDPs are not contracting. There is no single method for handling all types of non-contractive RDPs, so we introduce alternative techniques over the next few sections. The first such technique, treated in this section, handles RDPs that contract “eventually,” even though they may fail to contract in one step. We show that these eventually contracting RDPs are globally stable, so all of the fundamental optimality results apply.

One application for these results is the MDP model with state-dependent discounting treated in Chapter 6. This section contains a proof of the main optimality result in that chapter (Proposition 6.2.2).

8.2.2.1Definition and Properties

Let R=(Γ,V,B)\rR = (\Gamma, V, B) be an RDP with policy set Σ\Sigma. We call R\rR eventually contracting if there is a map LL from G×X\Gsf \times \Xsf to R+\RR_+ such that

B(x,a,v)B(x,a,w)xv(x)w(x)L(x,a,x),|B(x, a, v)-B(x, a, w)| \leq \sum_{x'} |v(x') - w(x')| L(x, a, x'),

for all (x,a)G(x, a) \in \Gsf and all v,wVv, w \in V, and moreover,

σΣ    ρ(Lσ)<1whereLσ(x,x)L(x,σ(x),x).\sigma \in \Sigma \implies \rho(L_\sigma) < 1 \quad \text{where} \quad L_\sigma(x, x') \coloneq L(x, \sigma(x), x').

8.2.2.2Optimality for MDPs with State-Dependent Discounting

With Proposition 8.2.4 in hand, we can complete the proof of Proposition 6.2.2, which pertained to optimality properties for MDPs with state-dependent discounting.

Let (Γ,β,r,P)(\Gamma, \beta, r, P) be an MDP with state-dependent discounting, as defined in Section 6.2.1.1. The state space is X\Xsf and the action space is A\Asf. The function β\beta maps G×X\Gsf \times \Xsf to R+\RR_+. Set

L(x,a,x)β(x,a,x)P(x,a,x)andLσ(x,x)L(x,σ(x),x),L(x, a, x') \coloneq \beta(x, a, x') P(x, a, x') \quad \text{and} \quad L_\sigma(x, x') \coloneq L(x, \sigma(x), x'),

for all (x,a,x)G×X(x, a, x') \in \Gsf \times \Xsf and σΣ\sigma \in \Sigma.

Assume the conditions of Proposition 6.2.2, so that ρ(Lσ)<1\rho(L_\sigma) < 1 for all σΣ\sigma \in \Sigma.

If we set

B(x,a,v)r(x,a)+xv(x)β(x,a,x)P(x,a,x),B(x, a, v) \coloneq r(x, a) + \sum_{x'} v(x') \beta(x, a, x') P(x, a, x'),

and take VV to be all of RX\RR^\Xsf, then R(Γ,V,B)\rR \coloneq (\Gamma, V, B) forms an RDP, as discussed in Exercise 8.23. We claim that R\rR is an eventually contracting RDP.

To see this, fix v,wVv, w \in V and (x,a)G(x, a) \in \Gsf. Applying the definition (8.74) and the triangle inequality, we have

B(x,a,v)B(x,a,w)x[v(x)w(x)]β(x,a,x)P(x,a,x)xv(x)w(x)L(x,a,x),\begin{aligned} |B(x, a, v)-B(x, a, w)| & \leq \left| \sum_{x'} [v(x') - w(x')] \beta(x, a, x') P(x, a, x') \right| \\ & \leq \sum_{x'} |v(x') - w(x')| L(x, a, x'), \end{aligned}

Under the stated assumptions, for each σΣ\sigma \in \Sigma, the operator Lσ(x,x)=L(x,σ(x),x)L_\sigma(x, x') = L(x, \sigma(x), x') satisfies ρ(Lσ)<1\rho(L_\sigma) < 1. Hence R\rR is eventually contracting, as claimed. Since V=RXV = \RR^\Xsf is closed, Proposition 8.2.4 implies that R\rR is a globally stable RDP. The claims in Proposition 6.2.2 now follow from Theorem 8.1.1.

8.2.3Convex and Concave RDPs

Theorem 8.1.1 shows that RDPs have excellent optimality properties when all policy operators are globally stable on value space. So far we have looked at conditions for stability based on contractions (Section 8.2.1) and eventual contractions (Section 8.2.2). But sometimes both of these approaches fail and we need alternative conditions.

In this section, we explore alternative conditions based on Du’s theorem. Du’s theorem is well suited to the task of studying stability of policy operators, since it leverages the fact that all policy operators are order-preserving.

8.2.3.1Definitions and Optimality

Let R=(Γ,V,B)\rR = (\Gamma, V, B) be an RDP with V=[v1,v2]V = [v_1, v_2] for some v1v2v_1 \leq v_2 in RX\RR^\Xsf. We call R\rR convex if

  1. for all (x,a)G(x, a) \in \Gsf, λ[0,1]\lambda \in [0, 1] and v,wv, w in VV, we have

B(x,a,λv+(1λ)w)λB(x,a,v)+(1λ)B(x,a,w)and,B(x, a, \lambda v + (1-\lambda) w) \leq \lambda B(x, a, v ) + (1-\lambda) B(x, a, w) \quad \text{and},
  1. there exists a δ>0\delta > 0 such that

B(x,a,v2)v2(x)δ[v2(x)v1(x)] for all (x,a)G.B(x, a, v_2) \leq v_2(x) - \delta[v_2(x) - v_1(x)] \text{ for all } (x, a) \in \Gsf.

Analogous to the convex case, we call R\rR concave if

  1. for all (x,a)G(x, a) \in \Gsf, λ[0,1]\lambda \in [0, 1] and v,wv, w in VV, we have

B(x,a,λv+(1λ)w)λB(x,a,v)+(1λ)B(x,a,w)and,B(x, a, \lambda v + (1-\lambda) w) \geq \lambda B(x, a, v ) + (1-\lambda) B(x, a, w) \quad \text{and},
  1. there exists a δ>0\delta > 0 such that

B(x,a,v1)v1(x)+δ[v2(x)v1(x)] for all (x,a)G.B(x, a, v_1) \geq v_1(x) + \delta [v_2(x) - v_1(x)] \text{ for all } (x, a) \in \Gsf.

In both of the these definitions, condition (ii) is rather complex. The next exercise provides simpler sufficient conditions.

Both convexity and concavity yield stability, as the next proposition shows.

It follows from Div that, for convex and concave RDPs, all of the optimality and convergence results in Theorem 8.1.1 apply.

8.2.3.2Application to MDPs

Div can be applied to establish optimality properties of regular MDPs. This exercise is redundant in the sense that optimality properties of regular MDPs have already been established using other means. At the same time, some of the arguments developed here will be helpful when we face more sophisticated problems.

To sketch the argument, let R=(Γ,V,B)\rR = (\Gamma, V, B) be an RDP generated by an ordinary MDP (Γ,β,r,P)(\Gamma, \beta, r, P), as discussed in Example 8.1.1. In particular, V=RXV =\RR^\Xsf, and B(x,a,v)=r(x,a)+βxv(x)P(x,a,x)B(x,a, v) = r(x, a) + \beta \sum_{x'} v(x') P(x, a, x'). We set r1minrr_1 \coloneq \min r and r2maxrr_2 \coloneq \max r. Then we fix ϵ>0\epsilon > 0 and define VV via

V^[v1,v2] where v1r1ϵ1β   and   v2r2+ϵ1β.\hat V \coloneq [v_1, v_2] \quad \text{ where } \quad v_1 \coloneq \frac{r_1 - \epsilon}{1-\beta} \; \text{ and } \; v_2 \coloneq \frac{r_2 + \epsilon}{1-\beta}.

(The functions v1v_1 and v2v_2 are constant.) We claim that the RDP R^(Γ,V^,B)\hat \rR \coloneq (\Gamma, \hat V, B) is both convex and concave.

8.3Further Applications

In this section, we consider some applications of the optimality results in Section 8.2.

8.3.1Risk-Sensitive RDPs

In Section 7.2.2 we introduced risk-sensitive preferences and discussed a recursive utility problem. Now we embed risk-sensitive preferences into a dynamic program and apply the preceding optimality results to compute optimal policies.

8.3.1.1Optimality Results

Consider the risk-sensitive preference RDP in Example 8.1.6, with state space X\Xsf and action space A\Asf. Let V=RXV = \RR^\Xsf. For (x,a)G(x,a) \in \Gsf and vVv \in V, we can express the aggregator as

B(x,a,v)r(x,a)+β(Rθav)(x),B(x, a, v) \coloneq r(x, a) + \beta (R_\theta^a \, v)(x),

where θ\theta is a nonzero constant and

(Rθav)(x)1θln{xexp(θv(x))P(x,a,x)}.(R_\theta^a \, v)(x) \coloneq \frac{1}{\theta} \ln \left\{ \sum_{x'} \exp(\theta v(x')) P(x, a,x') \right\}.

Notice that, for each fixed aΓ(x)a \in \Gamma(x), the operator RθaR_\theta^a is an entropic certainty equivalent operator on VV (see Example 7.3.2).

The next exercise pertains to quantile preferences rather than risk-sensitive preferences, but the result can be obtained via a relatively straightforward modification of the proof of Proposition 8.3.1.

Let’s consider a job search problem where future wage outcomes are evaluated via risk-sensitive expectations. The associated Bellman operator is

(Tv)(w)=max{w1β,c+βθln[wexp(θv(w))P(w,w)]}(wW).(Tv)(w) = \max \left\{ \frac{w}{1-\beta} ,\, c + \frac{\beta}{\theta} \ln \left[ \sum_{w'} \, \exp(\theta v(w')) P(w, w') \right] \right\} \qquad (w \in \Wsf).

Here θ\theta is a nonzero parameter and other details are as in Section 3.3.1. We can represent the problem as an RDP with state space W\Wsf, action space A={0,1}\Asf = \{0, 1\}, feasible correspondence Γ(w)=A\Gamma(w) = \Asf, value space VRWV \coloneq \RR^\Wsf, and value aggregator

B(w,a,v)=aw1β+(1a){c+βθln[wexp(θv(w))P(w,w)]}.B(w, a, v) = a \frac{w}{1-\beta} + (1-a) \left\{ c + \frac{\beta}{\theta} \ln \left[ \sum_{w'} \, \exp(\theta v(w')) P(w, w') \right] \right\}.

If θ<0\theta < 0, then the agent is risk-averse with respect to the gamble associated with continuing and waiting for new wage draws. If θ>0\theta > 0 then the agent is risk-loving with respect to such gambles. For θ0\theta \approx 0, the agent is close to risk-neutral.

Figure Figure 8.2 shows how the continuation value, value function and optimal decision vary with θ\theta. Apart from θ\theta, parameters are identical to those in Listing 4. Indeed, for θ\theta close to zero, as in the middle sub-figure of Figure Figure 8.2, we see that the value function and reservation wage are almost identical to those from the risk-neutral model in Figure Figure 3.5.

Job search with risk-sensitive preferences

Figure 8.2:Job search with risk-sensitive preferences

As expected, a negative value of θ\theta tends to reduce the continuation value and hence the reservation wage, since the agent’s dislike of risk encourages early acceptance of an offer. For positive values of θ\theta the reverse is true, as seen in the bottom sub-figure.

8.3.2Adversarial Agents

Some problems in economics, finance, and artificial intelligence assume that decisions emerge from a dynamic two-person zero-sum game in which the two agents’ preferences are perfectly misaligned. This can lead to a dynamic program where the Bellman equation takes the form

v(x)=maxaΓ(x)infdD(x,a)B(x,a,d,v)(xX,  vRX),v(x) = \max_{a \in \Gamma(x)} \inf_{d \in D(x, a)} B(x, a, d, v) \qquad (x \in \Xsf, \; v \in \RR^\Xsf),

where B(x,a,d,v)B(x, a, d, v) represents lifetime value for the decision maker conditional on her current action aa and her adversary’s action dd. The decision maker chooses action aΓ(x)a \in \Gamma(x) with the knowledge that the opponent will then choose dD(x,a)d \in D(x, a) to minimize her lifetime value.

8.3.2.1Optimality

To establish optimality properties in the setting of (8.93), we introduce the following assumptions:

B(x,a,d,v)B(x,a,d,w)for all xX,  aΓ(x),  dD(x,a).B(x, a, d, v) \leq B(x, a, d, w) \quad \text{for all } x \in \Xsf, \; a \in \Gamma(x), \; d \in D(x, a).
v1(x)+ϵB(x,a,d,v1)for all xX,  aΓ(x),  dD(x,a).v_1(x) + \epsilon \leq B(x, a, d, v_1) \quad \text{for all } x \in \Xsf, \; a \in \Gamma(x), \; d \in D(x, a).
B(x,a,d,v2)v2(x)for all xX,  aΓ(x),  dD(x,a).B(x, a, d, v_2) \leq v_2(x) \quad \text{for all } x \in \Xsf, \; a \in \Gamma(x), \; d \in D(x, a).
B(x,a,d,λv+(1λ)w)λB(x,a,d,v)+(1λ)B(x,a,d,w)B(x, a, d, \lambda v + (1-\lambda) w) \geq \lambda B(x, a, d, v ) + (1-\lambda) B(x, a, d, w)

for all xXx \in \Xsf, aΓ(x)a \in \Gamma(x) and dD(x,a)d \in D(x, a).

Condition (a) is a natural monotonicity condition: A uniform increase in continuation values increases current value at all states and actions. Conditions (b) and (c) provide upper and lower bounds. Condition (d) is a concavity condition.

To analyze the decision maker’s problem, we set V[v1,v2]V \coloneq [v_1, v_2] and

B^(x,a,v)infdD(x,a)B(x,a,d,v)((x,a)G,  vV),\hat B(x, a, v) \coloneq \inf_{d \in D(x, a)} B(x, a, d, v) \qquad ((x, a) \in \Gsf, \; v \in V),

We consider R=(Γ,V,B^)\rR = (\Gamma, V, \hat B).

An immediate corollary of Proposition 8.3.2 is that, under the stated conditions, the decision maker’s problem is a globally stable RDP, and hence the fundamental optimality properties in Theorem 8.1.1 all hold.

In the proof of Proposition 8.3.2, we use the following exercise.

8.3.2.2A Perturbed MDP Problem

In this section, we provide a relatively abstract application of Proposition 8.3.2. Later, in Section 8.3.3, we will see more concrete applications.

The setting we consider is a modified MDP where the adversarial agent’s actions affect the reward function and transition kernel. This leads to a Bellman equation of the form

v(x)=maxaΓ(x)infdD(x,a){r(x,a,d)+βxv(x)P(x,a,d,x)}(xX).v(x) = \max_{a \in \Gamma(x)} \inf_{d \in D(x, a)} \left\{ r(x, a, d) + \beta \sum_{x'} v(x')P(x, a, d, x') \right\} \qquad (x \in \Xsf).

The choice perturbation dD(x,a)d \in D(x, a) is made by the adversary. The object PP is a stochastic kernel, in the sense that P(x,a,d,)P(x, a, d, \cdot) is a distribution over X\Xsf for each feasible (x,a,d)(x, a, d). We assume that Γ\Gamma is a nonempty correspondence from X\Xsf to A\Asf and D(x,a)D(x, a) is nonempty for all (x,a)G(x, a) \in \Gsf. Let

B^(x,a,v)=infdD(x,a){r(x,a,d)+βxv(x)P(x,a,d,x)}((x,a)G).\hat B(x, a, v) = \inf_{d \in D(x, a)} \left\{ r(x, a, d) + \beta \sum_{x'} v(x') P(x, a, d, x') \right\} \qquad ((x, a) \in \Gsf).

To construct the value space VV, we let r1=minrr_1 = \min r and r2=maxrr_2 = \max r, and set

V=[v1,v2]wherev1r1ϵ1βandv2r21β.V = [v_1, v_2] \quad \text{where} \quad v_1 \coloneq \frac{r_1 - \epsilon}{1-\beta} \quad \text{and} \quad v_2 \coloneq \frac{r_2}{1-\beta}.

(These constant functions are similar to v1,v2v_1, v_2 in (8.85).)

An immediate corollary of Lemma 8.3.3 is that R\rR is globally stable (via Proposition 8.2.5) and all optimality results in Theorem 8.1.1 apply.

8.3.3Ambiguity and Robustness

Until now we have considered agents facing decision problems where outcomes are uncertain but probabilities are known. For example, while the job seeker introduced in Chapter 1 does not know the next period wage offer when choosing her current action, she does know the distribution of that offer. She uses this distribution to determine an optimal course of action. Similarly, the controllers in our discussion of optimal stopping and MDPs used their knowledge of the Markov transition law to determine an optimal policy.

In many cases, the assumption that the decision maker knows all probability distributions that govern outcomes under different actions is debatable. In this section we study lifetime valuations in settings of Knightian uncertainty (Knight, (1921)), which means that outcome distributions are themselves unknown. Some authors refer to Knightian uncertainty as ambiguity.

Below we consider some dynamic problems where decision makers face Knightian uncertainty.

8.3.3.1Robust Control

First we study the choices of a decision maker who knows her reward function but distrusts her specification of the stochastic kernel PP that describes the evolution of the state. This distrust is expressed by assuming that she knows that PP belongs to some class of stochastic kernels from G×X\Gsf \times \Xsf to X\Xsf. This can lead to aggregators of the form

B(x,a,v)=r(x,a)+βinfPP(x,a){xv(x)P(x,a,x)},B(x, a, v) = r(x, a) + \beta \inf_{P \in \pP(x, a)} \left\{ \sum_{x'} v(x')P(x, a, x') \right\},

for (x,a)G(x, a) \in \Gsf. As usual, rr maps G\Gsf to R\RR and β(0,1)\beta \in (0,1). The decision maker can construct a policy that is robust to her distrust of the stochastic kernel by using this aggregator BB. Such aggregators arise in the field of robust control.

Positing that the decision maker knows a nontrivial set of stochastic kernels is a way of modeling Knightian uncertainty, as distinguished from risks that are described by known probability distributions.

Returning to the robust control model with aggregator BB in (8.108), we take VV is as defined in (8.105) and set R=(Γ,V,B)\rR = (\Gamma, V, B). The set P\pP of stochastic kernels is entirely arbitrary.

We conclude from this discussion that the robust control RDP is globally stable. Hence all of the fundamental optimality properties hold.

8.3.3.2Robustness and Adversarial Agents

A more general way to implement robustness is via the aggregator

B(x,a,v)=r(x,a)+βinfPP(x,a){xv(x)P(x,a,x)+d(P(x,a,),Pˉ(x,a,))}.B(x, a, v) = r(x, a) + \beta \inf_{P \in \pP(x, a)} \left\{ \sum_{x'} v(x')P(x, a, x') + d(P(x, a, \cdot), \bar P(x, a, \cdot)) \right\}.

In this set up, P(x,a)\pP(x, a) is often large, weakening the constraint on PP. At the same time, we introduce the penalty term d(P(x,a,),Pˉ(x,a,))d(P(x, a, \cdot), \bar P(x, a, \cdot)), which can be understood as recording the deviation between a given kernel PP and some baseline specification Pˉ\bar P.

One interpretation of this setting is that the decision maker begins with a baseline specification of dynamics but lacks confidence in its accuracy. In her desire to choose a robust policy, she imagines herself playing against an adversarial agent. Her adversary can choose transition kernels that deviate from the baseline, but the presence of the penalty term means that extreme deviations are curbed.

If we define

r^(x,a)=r(x,a)+d(P(x,a,),Pˉ(x,a,)),\hat r(x, a) = r(x, a) + d(P(x, a, \cdot), \bar P(x, a, \cdot)),

then (8.111) can be expressed as

B(x,a,v)=infP{r^(x,a)+βxv(x)P(x,a,x)}.B(x, a, v) = \inf_P \left\{ \hat r(x, a) + \beta \sum_{x'} v(x')P(x, a, x') \right\}.

This is a special case of (8.110), so the same optimality theory applies.

8.3.3.3Connection to Risk-Sensitive Preferences

One measure of discrepancy between two probability distributions is the Kullback--Liebler divergence (KL divergence)

dKL(qp)xq(x)ln(q(x)p(x))for q,pD(X).d_{KL}(q \given p) \coloneq \sum_x q(x) \ln \left( \frac{q(x)}{p(x)} \right) \quad \text{for } q, p \in \dD(\Xsf).

It is assumed here that qacpq \prec_{\rm{ac}} p, which means that q(x)=0q(x)=0 whenever p(x)=0p(x)=0. We note for future reference that dKLd_{KL} obeys the duality formula for variational inference, which states that, given hRXh \in \RR^\Xsf,

lnxexp(h(x))p(x)=supqacp{xh(x)q(x)dKL(qp)}.\ln \sum_x \exp(h(x)) p(x) = \sup_{q \prec_{\rm{ac}} p} \left\{ \sum_x h(x) q(x) - d_{KL}(q \given p) \right\}.

(See, e.g., Dupuis & Ellis (2011), Proposition 1.4.2.)

In robust control, KL divergence can be used to measure deviation between the baseline specification and alternative specifications. It turns out that, under this measure of divergence, there is a tight relationship between robust control and risk-sensitive preferences.

To illustrate this relationship, we fix θ<0\theta < 0 and set dθ(1/θ)dKLd_\theta \coloneq -(1/\theta) d_{KL}, so that dθd_\theta is a simple positive rescaling of the Kullback--Leibler divergence. Using dθd_\theta in (8.111) leads to

B(x,a,v)=r(x,a)+βinfPP(x,a){xv(x)P(x,a,x)+dθ(P(x,a,)Pˉ(x,a,))}.B(x, a, v) = r(x, a) + \beta \inf_{P \in \pP(x, a)} \left\{ \sum_{x'} v(x')P(x, a, x') + d_\theta (P(x, a, \cdot) \given \bar P(x, a, \cdot)) \right\}.

The constraint set P(x,a)\pP(x,a) is all PM(RX)P \in \mopx such that P(x,a,)acPˉ(x,a,)P(x, a, \cdot) \prec_{\rm{ac}} \bar P(x, a, \cdot).

If we multiply both sides of the variational formula (8.115) by (1/θ)(1/\theta) and set h=θvh = \theta v we get

1θlnxexp(θv(x))p(x)=infqacp{xv(x)q(x)1θdKL(qp)}.\frac{1}{\theta} \ln \sum_x \exp(\theta v(x)) p(x) = \inf_{q \prec_{\rm{ac}} p} \left\{ \sum_x v(x) q(x) - \frac{1}{\theta} d_{KL}(q \given p) \right\}.

This allows us to rewrite BB as

B(x,a,v)=r(x,a)+β1θln{xexp(θv(x))Pˉ(x,a,x)}.B(x, a, v) = r(x, a) + \beta \frac{1}{\theta} \ln \left\{ \sum_{x'} \exp(\theta v(x')) \bar P(x, a, x') \right\}.

Hence, for this choice of deviation, the robust control aggregator (8.111) reduces to the risk-sensitive aggregator (see Example 8.1.6) under the baseline transition kernel.

8.3.4Smooth Ambiguity

Ju & Miao (2012) propose and study a recursive smooth ambiguity model in the context of asset pricing. A generic discrete formulation of their optimization problem can be expressed in terms of the aggregator

B(x,a,v)={r(x,a)+β{[xv(x)γPθ(x,a,x)]κ/γμ(x, ⁣dθ)}α/κ}1/α,B(x, a, v) = \left\{ r(x, a) + \beta \left\{ \int \left[ \sum_{x'} v(x')^{\gamma} P_\theta (x, a, x') \right]^{\kappa/\gamma} \mu(x, \diff \theta) \right\}^{\alpha/\kappa} \right\}^{1/\alpha},

where α,κ,γ\alpha, \kappa, \gamma are nonzero parameters, PθP_\theta is a stochastic kernel from G\Gsf to X\Xsf for each θ\theta in a finite dimensional parameter space Θ\Theta, and μ(x,)\mu(x, \cdot) is a probability distribution over Θ\Theta for each xXx \in \Xsf. The distribution μ(x,)\mu(x, \cdot) represents subjective beliefs over the transition rule for the state.

The aggregator BB in (8.119) is defined for xXx \in \Xsf, aΓ(x)a \in \Gamma(x) and vIv \in I, where II is be the interior of the positive cone of RX\RR^\Xsf. To ensure finite real values, we assume r0r \gg 0.

As with the Epstein--Zin case, α\alpha parameterizes the elasticity of intertemporal substitution and γ\gamma governs risk aversion. The parameter κ\kappa captures ambiguity aversion. If κ=γ\kappa = \gamma, the agent is said to be ambiguity neutral.

Returning to (8.119), we focus on the case κ<γ<0<α<1\kappa < \gamma < 0 < \alpha < 1, which includes the calibration used in Ju & Miao (2012). (Other cases can be handled using similar methods and details are left to the reader.) After constructing a suitable value space, we will show that the resulting RDP is globally stable.

As a first step, set r1minrr_1 \coloneq \min r, r2maxrr_2 \coloneq \max r and fix ϵ>0\epsilon > 0. Consider the constant functions

v1(r11β)1/αandv2(r2+ϵ1β)1/α.v_1 \coloneq\left( \frac{r_1}{1-\beta} \right)^{1/\alpha} \quad \text{and} \quad v_2 \coloneq \left( \frac{r_2 +\epsilon}{1-\beta} \right)^{1/\alpha} .

In the remainder of this section on smooth ambiguity, we set V=[v1,v2]V = [v_1, v_2].

Here is our main result for this section. It implies that all optimality and convergence results for R\rR are valid (see, in particular, Theorem 8.1.1).

To prove Proposition 8.3.5, we use a transformation, just as we did with the Epstein--Zin case in Section 8.1.4.1. To this end, we introduce the composite parameters

ξγκ(0,1)andζκα<0.\xi \coloneq \frac{\gamma}{\kappa} \in (0,1) \quad \text{and} \quad \zeta \coloneq \frac{\kappa}{\alpha} < 0.

Then we define

B^(x,a,v)={r(x,a)+β{[xv(x)ξPθ(x,a,x)]1/ξμ(x, ⁣dθ)}ζ}1/ζ,\hat B(x, a, v) = \left\{ r(x, a) + \beta \left\{ \int \left[ \sum_{x'} v(x')^{\xi} P_\theta (x, a, x') \right]^{1/\xi} \mu(x, \diff \theta) \right\}^{\zeta} \right\}^{1/\zeta},

and

V^=[v^1,v^2]where   v^1v21/κ and v^2v11/κ.\hat V = [\hat v_1, \hat v_2] \quad \text{where } \; \hat v_1 \coloneq v_2^{1/\kappa} \text{ and } \hat v_2 \coloneq v_1^{1/\kappa}.

Note that V^\hat V is a nonempty order interval of strictly positive real-valued functions, since 0<v1<v20 < v_1 < v_2 and κ<0\kappa < 0. We set R^=(Γ,V^,B^)\hat \rR = (\Gamma, \hat V, \hat B).

The next exercise shows that R\rR and R^\hat \rR are topologically conjugate (see Section 8.1.4).

8.3.5Minimization

Until now, all results and applications have concerned maximization of lifetime values. Now is a good time to treat minimization. Throughout this section, R\rR is a well-posed RDP. The pointwise minimum vσvσ\vmin \coloneq \bigwedge_\sigma v_\sigma is called the min-value function generated by R\rR. We call a policy σΣ\sigma \in \Sigma min-optimal for R\rR if vσ=vv_\sigma = \vmin. A policy σΣ\sigma \in \Sigma is called vv-min-greedy for R\rR if

σ(x)argminaΓ(x)B(x,a,v)for all xX.\sigma(x) \in \argmin_{a \in \Gamma(x)} B(x, a, v) \quad \text{for all } x \in \Xsf.

We say that R\rR obeys Bellman’s principle of min-optimality if

σΣ is min-optimal for R    σ is v-min-greedy.\sigma \in \Sigma \text{ is min-optimal for } \rR \quad \iff \quad \sigma \text{ is } \vmin \text{-min-greedy}.

The Bellman min-operator T\tmin is defined by

(Tv)(x)=minaΓ(x)B(x,a,v)(xX).(\tmin v)(x) = \min_{a \in \Gamma(x)} B(x, a, v) \qquad (x \in \Xsf).

We say that vVv \in V obeys the min-Bellman equation if Tv=v\tmin v = v. The algorithm defined by replacing “vv-greedy” with “vv-min-greedy” in Algorithm 8.1 (HPI) will be called min-HPI.

We can now state the following result, which is analogous to Theorem 8.1.1. In the statement, R\rR is a well-posed RDP with min-value function v\vmin.

Although we omit the details, a min-OPI convergence result directly analogous to the OPI convergence result in (v) of Theorem 8.1.1 also holds (after replacing maximization-based vv-greedy policies with vv-min-greedy policies).

Theorem 8.3.7 is proved in Section 9.2.3. For now, we consider two applications that involve minimization.

8.3.5.1Application: Shortest Paths

Recall the shortest path problem introduced in Example 8.1.8, where X\Xsf is the vertices of a graph, EE is the edges, c ⁣:ER+c \colon E \to \RR_+ maps a travel cost to each edge (x,x)E(x,x') \in E, and O(x)\oO(x) is the set of direct successors of xx. The aim is to minimize total travel cost to a destination node dd. We adopt all assumptions from Exercise 8.17 and assume in addition that c(x,x)=0c(x,x')=0 implies x=dx=d. As in Exercise 8.17, we let C(x)C(x) be the maximum cost of traveling to dd from xx along any directed path.

We regard the problem as an RDP R=(O,V,B)\rR = (\oO, V, B) with V=[0,C]V = [0, C] and

B(x,x,v)=c(x,x)+v(x)(xX).B(x, x', v) = c(x, x') + v(x') \qquad (x \in \Xsf).

In the present setting, the function vv in (8.132) is often called the cost-to-go function, with v(x)v(x') in (8.137) understood as remaining costs after moving to state xx'.

While the value aggregator BB in (8.132) is simple, the absence of discounting (which is standard in the shortest path literature) means that R\rR is not contracting. Fortunately, R\rR turns out to be concave (in the sense of Section 8.2.3), which allows us to prove

(In the present context, v\vmin is also known as the minimum cost-to-go function.)

8.3.5.2Application: Negative Discount Rate Optimality

When discussing MDPs we used β\beta to represent the discount factor. Given β\beta, the discount rate or rate of time preference is the value ρ\rho that solves β=1/(1+ρ)\beta = 1/(1+\rho). The standard MDP assumption β<1\beta < 1 implies this rate is positive. You will recall from Chapter 5 that the condition β<1\beta < 1 is central to the general theory of MDPs, since it yields global stability of the Bellman and policy operators on RX\RR^\Xsf (via the Neumann series lemma or Banach’s fixed-point theorem).

In the previous section, on shortest paths, we studied an RDP with a zero discount rate. Now we go one step further and consider problems with negative rates of time preference. Such preference are commonly inferred when people face unpleasant tasks. Subjects of studies often prefer getting such tasks “over and done with” rather than postponing them. (Negative discount rates are inferred in other settings as well. Section 9.3 provides background and references.)

In this section, we model optimal choice under a negative discount rate. Taking our cue from the preceding discussion, we consider a scenario where a task generates disutility but has to be completed. In particular, we assume that

B(x,x,v)=c(x,x)+βv(x)(x,xX),B(x, x', v) = c(x, x') + \beta v(x') \qquad (x, x' \in \Xsf),

where X\Xsf is a finite set and β>1\beta > 1 is some positive constant. The function cc gives the cost of transitioning from xx to the new state xx'

The value aggregator BB in (8.137) is the same as the shortest path aggregator (8.132), except for the constant β\beta. To keep the discussion simple, we adopt all other assumptions from the shortest path discussion in Section 8.3.5.1.

We now have an R=(Γ,V,B)\rR = (\Gamma, V, B) with Γ=O\Gamma = \oO, BB as in (8.137) and V=[0,C]V = [0, C]. The policy operators map VV into itself because, for vVv \in V, we clearly have 0Tσv0 \leq T_\sigma \, v and, in addition,

(Tσv)(x)=c(x,σ(x))+βv(σ(x))c(x,σ(x))+βC(σ(x))C(x).(T_\sigma \, v)(x) = c(x, \sigma(x)) + \beta v(\sigma(x)) \leq c(x, \sigma(x)) + \beta C(\sigma(x)) \leq C(x).

The last bound holds because C(x)C(x) is, by definition, greater than the cost of traveling from xx to σ(x)\sigma(x) and then following the most expensive path.

8.4Chapter Notes

The RDP framework adopted in this chapter is inspired by Bertsekas (2022), who in turn credits Mitten (1964) as the first research paper to frame Richard Bellman’s dynamic programming problems in an abstract setting. Denardo (1967) describes key ideas including what we call contracting RDPs (see Section 8.2.1). Denardo (1967) credits Shapley (1953) for inspiring his contraction-based arguments.

The key optimality results from this chapter are new, although closely related results appear in Bertsekas (2022). See, in addition, Bloise et al. (2024), which builds on Bertsekas (2022) and Ren & Stachurski (2021).

The job search application with quantile preferences in Section 8.2.1.4 is based on Castro et al. (2022). The same reference includes a general theory of dynamic programming when certainty equivalents are computed using quantile operators and aggregation is time additive.

The optimal default application in Section 8.2.1.5 is loosely based on Arellano (2008). Influential contributions to this line of work include, Yue (2010), Chatterjee & Eyigungor (2012), Arellano & Ramanarayanan (2012), Cruces & Trebesch (2013), Ghosh et al. (2013), Gennaioli et al. (2014), and Bocola et al. (2019).

At the start of the chapter we motivated RDPs by mentioning that equilibria in some models of production and economic geography can be computed using dynamic programming. Examples include Hsu (2012), Hsu et al. (2014), Antràs & Gortari (2020), Kikuchi et al. (2021) and Tyazhelnikov (2022).

Early references for dynamic programming with risk-sensitive preferences include Jacobson (1973), Whittle (1981), and Hansen & Sargent (1995). Elegant modern treatments can be found in Asienkiewicz & Jaśkiewicz (2017) and Bäuerle & Jaśkiewicz (2024), and an extension to general static risk measures is available in Bäuerle & Glauner (2022). Risk-sensitivity is applied to the study of optimal growth in Bäuerle & Jaśkiewicz (2018), and to optimal divided payouts in Bäuerle & Jaśkiewicz (2017). Risk-sensitivity is also used in applications of reinforcement learning, where the underlying state process is not known. See, for example, Shen et al. (2014), Majumdar et al. (2017) or Gao et al. (2021).

Dynamic programming problems that acknowledge model uncertainty by including adversarial agents to promote robust decision rules can be found in Cagetti et al. (2002), Hansen & Sargent (2011), and other related papers. Al-Najjar & Shmaya (2019) study the connection between Epstein--Zin utility and parameter uncertainty. Ruszczyński (2010) considers risk averse dynamic programming and time consistency.

The smooth ambiguity model in Section 8.3.4 is loosely adapted from Klibanoff et al. (2009) and Ju & Miao (2012). For applications of optimization under smooth ambiguity, see, for example, Guan & Wang (2020) or Yu et al. (2023). Zhao (2020) studies yield curves in a setting where ambiguity-averse agents face varying amounts of Knightian uncertainty over the short and long run.

Readers who wish to see some motivation for the discussion of negative discounting in Section 8.3.5.2 can consult Loewenstein & Sicherman (1991), who found that the majority of workers they surveyed reported a preference for increasing wage profiles over decreasing ones that yield the same undiscounted sum, even when it was pointed out that the latter could be used to construct a dominating consumption sequence. Loewenstein & Prelec (1991) obtained similar results. In summarizing their study, they argue that, in the context of the choice problems that they examined, “sequences of outcomes that decline in value are greatly disliked, indicating a negative rate of time preference” Loewenstein & Prelec, 1991.

In Section 8.3.5.2 we considered dynamic programs with negative discount rates. A more general treatment of such problems can be found in Kikuchi et al. (2021), which also shows how negative discount rate dynamic programs connect to static problems concerning equilibria in production networks and draws connections with Coase’s theory of the firm.

An algorithm that we neglected to discuss is stochastic gradient descent (or ascent) in policy space. Typically policies are parameterized via an approximation architecture that consists of basis functions, activation functions, and compositions of them (e.g., a neural network). In large models, such approximation is used even when the state and action spaces are finite, simply because the curse of dimensionality makes exact representations infeasible. For recent discussions of gradient descent in policy spaces see Nota & Thomas (2019), Mei et al. (2020), and Bhandari & Russo (2022).

Footnotes
  1. For H\Hmax to be well-defined, we must always select the same vv-greedy policy when the operator is applied to vv. To this end, we enumerate the policy set Σ\Sigma and choose the first vv-greedy policy. This choice of convention has no effect on convergence results.

References
  1. Denardo, E. V. (1967). Contraction mappings in the theory underlying dynamic programming. Siam Review, 9(2), 165–177.
  2. Bertsekas, D. (2022). Abstract Dynamic Programming (3rd ed.). Athena Scientific.
  3. Stokey, N., & Lucas, R. (1989). Recursive Methods in Dynamic Economics. Harvard University Press.
  4. Puterman, M. L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Interscience.
  5. Knight, F. H. (1921). Risk, Uncertainty and Profit (Vol. 31). Houghton Mifflin.
  6. Dupuis, P., & Ellis, R. S. (2011). A Weak Convergence Approach to the Theory of Large Deviations. John Wiley & Sons.
  7. Ju, N., & Miao, J. (2012). Ambiguity, learning, and asset returns. Econometrica, 80(2), 559–591.
  8. Mitten, L. (1964). Composition principles for synthesis of optimal multistage processes. Operations Research, 12(4), 610–619.
  9. Shapley, L. S. (1953). Stochastic games. Proceedings of the National Academy of Sciences, 39(10), 1095–1100.
  10. Bloise, G., Le Van, C., & Vailakis, Y. (2024). Do not blame Bellman: It is Koopmans’ fault. Econometrica, 92(1), 111–140.
  11. Ren, G., & Stachurski, J. (2021). Dynamic programming with value convexity. Automatica, 130, 109641.
  12. de Castro, L., Galvao, A., & Nunes, D. (2022). Dynamic Economics with Quantile Preferences [Techreport]. SSRN, 4108230.
  13. Arellano, C. (2008). Default risk and income fluctuations in emerging economies. American Economic Review, 98(3), 690–712.
  14. Yue, V. Z. (2010). Sovereign default and debt renegotiation. Journal of International Economics, 80(2), 176–187.
  15. Chatterjee, S., & Eyigungor, B. (2012). Maturity, indebtedness, and default risk. American Economic Review, 102(6), 2674–2699.