Recursive Decision Processes - Dynamic Programming Volume I: Finite States

While the MDP model from Chapter 5 and Chapter 6 is elegant and widely used, researchers in economics, finance, and other fields are working to extend it. Reasons include:

(i) MDP theory cannot be applied to settings where lifetime values are described by the kinds of nonlinear recursions discussed in Chapter 7.

(ii) Equilibria in some models of production and economic geography can be computed using dynamic programming but not all such programming problems fit within the MDP framework.

(iii) Dynamic programming problems that include adversarial agents to promote robust decision rules can fail to be MDPs.

To handle such departures from the MDP assumptions, we now construct a more general dynamic programming framework, building on an approach to optimization initially developed by Denardo (1967) and extended by Bertsekas (2022). Further references are provided in Section 8.4.

We start this chapter by building a framework that centers on an abstract representation of the Bellman equation (Section 8.1). We then state optimality results and show how they can be verified in a range of applications. We defer proofs of core optimality results to Chapter 9, where we strip dynamic programs down to their essence by adopting a purely operator-theoretic perspective.

8.1Definition and Properties¶

In this section, we introduce and analyze optimality conditions for recursive decision processes that include and extend all dynamic programming frameworks discussed so far. Throughout this chapter, $\Xsf$ denotes a finite set.

8.1.1Defining RDPs¶

Consider a generic Bellman equation of the form

v(x) = \max_{a \in \Gamma(x)} B(x, a, v).

(8.1)

Here $x$ is the state, $a$ is an action, $\Gamma$ is a feasible correspondence, and $B$ is an “aggregator” function. We understand $\Gamma(x)$ as all actions available to the controller in state $x$ . The function $v$ assigns values to states and is a member of some class $V \subset \RR^\Xsf$ . This “abstract” Bellman equation generalizes all of the Bellman equations presented in previous chapters.

Our plan is to analyze the Bellman equation (8.1) and state conditions on $B$ and the other primitives that make strong optimality properties hold. As a first step, we introduce two finite sets,

an action space $\Asf$ and
a state space $\Xsf$ .

Given $\Xsf$ and $\Asf$ , we define a recursive decision process (RDP) to be a triple $\rR = (\Gamma, V, B)$ consisting of

(i) a feasible correspondence $\Gamma$ that is a nonempty correspondence from $\Xsf$ to $\Asf$ , which in turn defines

the feasible state-action pairs

\Gsf \coloneq \setntn{(x, a) \in \Xsf \times \Asf}{a \in \Gamma(x)}

and the set of feasible policies

\Sigma \coloneq \setntn{\sigma \in \Asf^\Xsf} {\sigma(x) \in \Gamma(x) \text{ for all } x \in \Xsf},

(ii) a subset $V$ of $\RR^\Xsf$ called the value space, and

(iii) a value aggregator $B$ that maps $\Gsf \times V$ to $\RR$ and satisfies both the monotonicity condition

v, w \in V \text{ and } v \leq w \implies B(x, a, v) \leq B(x, a, w) \; \text{ for all } (x, a) \in \Gsf,

(8.2)

and the consistency condition

w \in V \text{ whenever } w(x) = B(x, \sigma(x), v) \text{ for some } \sigma \in \Sigma \text{ and } v \in V.

(8.3)

Throughout, $\leq$ represents the pointwise order on $\RR^\Xsf$ .

The definition of the feasible correspondence in (i) is identical to that for the MDP in Chapter 5. As for (ii), we understand $V$ to be a class of functions that assign values to states. In (iii), the interpretation of the aggregator $B$ is:

$B(x, a, v) =$ total lifetime rewards, contingent on current action $a$ , current state $x$ , and using $v$ to evaluate future states.

The monotonicity condition (8.2) is natural: if, relative to $v$ , rewards are at least as high for $w$ in every future state, then the total rewards one can extract under $w$ should be at least as high. The consistency condition in (8.3) ensures that as we consider values of different policies we remain within the value space $V$ .

The MDP framework is a special case of the RDP framework:

Example 8.1.2

Consider a basic cake eating problem (see Section 5.1.2.3), where $\Xsf$ is a finite subset of $\RR_+$ and $x \in \Xsf$ is understood to be the number of remaining slices of cake today. Let $x'$ be the number of remaining slices next period and $u(x-x')$ be the utility from slices enjoyed today. The utility function $u$ maps $\RR_+$ to $\RR$ . Let $V = \RR^\Xsf$ , let $\Gamma$ be defined by $\Gamma(x) = \setntn{x' \in \Xsf}{x' \leq x}$ and let

B(x, x', v) = u(x - x') + \beta v(x').

Then $(\Gamma, V, B)$ is an RDP with Bellman equation identical to that of the original cake eating problem in Section 5.1.2.3. The monotonicity condition (8.2) and the consistency condition (8.3) are easy to verify.

The last example is a special case of Example 8.1.1, since the cake eating problem is an MDP (see Section 5.1.2.3). Nonetheless, Example 8.1.2 is instructive because, for cake eating, the MDP construction is tedious (e.g., we need to define a stochastic kernel $P$ even though transitions are deterministic), while the RDP construction is straightforward.

The next example makes a related point.

Example 8.1.3

In Section 5.1.2.4 we showed that the job search model is an MDP but the construction was tedious. But we can also represent job search as an RDP and the embedding is straightforward. To see this, recall that, for an arbitrary optimal stopping problem with primitives as described in Chapter 4, the Bellman equation is

v(x) = \max \left\{ e(x), c(x) + \beta \sum_{x'} v(x') P(x, x') \right\} \qquad (x \in \Xsf).

(8.5)

Let $V = \RR^\Xsf$ and $\Gamma(x) = \{0, 1\}$ for all $x$ . Let

B(x, a, v) = a e(x) + (1-a) \left[ c(x) + \beta \sum_{x'} v(x') P(x, x') \right],

(8.6)

for $x \in \Xsf$ and $a \in \Asf \coloneq \{0, 1\}$ . Then $(\Gamma, V, B)$ is an RDP (Exercise 8.1.1) and setting $v(x) = \max_{a \in \Gamma(x)} B(x, a, v)$ reproduces the Bellman equation (8.5).

Example 8.1.4

The dynamic programming framework popularized by Stokey & Lucas (1989) is characterized by two features: First, the state is divided into an exogenous process $(Z_t)$ and an endogenous process $(Y_t)$ . In addition, the next period endogenous state is directly chosen by the current action. The Bellman equation takes the form

v(y, z) = \max_{y' \in \Gamma(y, z)} \left\{ F(y, z, y') + \beta \sum_{z'} v(y', z') Q(z, z') \right\}.

(8.7)

We assume that $(Z_t)$ is $Q$ -Markov on finite set $\Zsf$ and $(Y_t)$ takes values in finite set $\Ysf$ . With state space $\Xsf \coloneq \Ysf \times \Zsf$ , action space $\Ysf$ , feasible correspondence $x \mapsto \Gamma(x)$ , value space $V = \RR^\Xsf$ and aggregator

B(x, a, v) = B((y, z), y', v) = F(y, z, y') + \beta \sum_{z'} v(y', z') Q(z, z'),

we obtain an RDP with Bellman equation identical to (8.7).

Example 8.1.1–Example 8.1.4 treated RDPs that can be embedded into the MDP framework. In the remaining examples, we consider models that cannot be represented as MDPs.

Example 8.1.8

The shortest path problem considers optimal traversal of a directed graph $\gG = (\Xsf, E)$ , where $\Xsf$ is the vertices of the graph and $E$ is the edges. A weight function $c \colon E \to \RR_+$ associates cost to each edge $(x,x') \in E$ . The aim is to find the minimum cost path from $x$ to a specified vertex $d$ for every $x \in \Xsf$ . Under some conditions, the problem can be solved by applying a Bellman operator of the form

(Tv)(x) = \min_{x' \in \oO(x)} \{ c(x, x') + v(x') \} \qquad (x \in \Xsf),

(8.11)

where $\oO(x) \coloneq \setntn{x' \in \Xsf}{(x, x') \in E}$ is the direct successors of $x$ and $v(x')$ is the minimum cost-to-go from state $x'$ . The problem is not an MDP because future values are not discounted. It can be framed as an RDP, however, by setting $\Gamma(x) = \oO(x)$ , $B(x, x', v) = c(x, x') + v(x')$ and $V = \RR^\Xsf$ .

Example 8.1.8 is a minimization problem. We treat minimization explicitly in Section 8.3.5, although the shortest path setting can be converted maximization by replacing $c(x,x')$ with $-c(x,x')$ . This produces an application similar to the cake eating problem in Example 8.1.2 (although discounting is eliminated and network structure shows up in the constraint).

8.1.2Lifetime Value¶

We aim to discuss optimality of RDPs. To prepare for this topic, we now clarify lifetime values associated with different policy choices in the RDP setting.

8.1.2.1Policies and Value¶

Let $\rR = (\Gamma, V, B)$ be an RDP with state and action spaces $\Xsf$ and $\Asf$ , and let $\Sigma$ be the set of all feasible policies. For each $\sigma \in \Sigma$ we introduce the policy operator $T_\sigma$ as a self-map on $V$ defined by

(T_\sigma \, v)(x) = B(x, \sigma(x), v) \qquad (x \in \Xsf).

(8.12)

The RDP policy operator is a direct generalization of the MDP policy operator defined, as well as the optimal stopping policy operator defined.

Consider a given RDP $(\Gamma, V, B)$ and fix $\sigma \in \Sigma$ . If $T_\sigma$ has a unique fixed point in $V$ , we denote this fixed point by $v_\sigma$ and call it the $\sigma$ -value function. It is natural to interpret $v_\sigma$ as representing the lifetime value of following policy $\sigma$ .

The previous examples are linear but the same idea extends to nonlinear recursive preference models as well. To see this, recall the generic Koopmans operator $(Kv)(x) = A(x, (Rv)(x))$ introduced in Section 7.3.1. Lifetime value is the unique fixed point of this operator whenever it exists. In all of the RDP examples we have considered, the policy operator can be expressed as $(T_\sigma \, v)(x) = A_\sigma(x, (R_\sigma \, v)(x))$ for some aggregator $A_\sigma$ and certainty equivalent operator $R_\sigma$ . Hence $T_\sigma$ is a Koopmans operator and lifetime value associated with policy $\sigma$ is the fixed point of this operator.

8.1.2.2Uniqueness and Stability¶

Let $\rR = (\Gamma, V, B)$ be a given RDP with policy operators $\{T_\sigma\}$ . Given that our objective is to maximize lifetime value over the set of policies in $\Sigma$ , we need to assume at the very least that lifetime value is well defined at each policy. To this end, we say that $\rR$ is well-posed whenever $T_\sigma$ has a unique fixed point $v_\sigma$ in $V$ for all $\sigma \in \Sigma$ .

Example 8.1.14

The shortest path problem discussed in Example 8.1.8 is not well-posed without further assumptions. For example, consider a graph that contains two vertices $x$ and $y$ , with $x \in \oO(y)$ , $y \in \oO(x)$ , and $c(x,y) + c(y,x) > 0$ . Then, for any policy $\sigma$ that maps $x$ to $y$ and $y$ to $x$ , we have

(T_\sigma \, v)(x) = c(x,y) + v(y) \quad \text{and} \quad (T_\sigma \, v)(y) = c(y,x) + v(x) .

Hence, if $v \in \RR^\Xsf$ is a fixed point of $T_\sigma$ , we obtain $v(x) = c(x,y) + v(y)$ and $v(y) = c(y,x) + v(x)$ . Substitution yields $v(x) = c(x,y) + c(y, x) + v(x)$ , which is a contradiction.

Let $\rR$ be an RDP with policy operators $\{T_\sigma\}_{\sigma \in \Sigma}$ . In what follows, we call $\rR$ globally stable if $T_\sigma$ is globally stable on $V$ for all $\sigma \in \Sigma$ .

Obviously, every globally stable RDP is well-posed.

In Section 8.1.3 we will see that global stability yields strong optimality properties.

8.1.2.3Continuity¶

Let $\rR = (\Gamma, V, B)$ be an RDP. We call $\rR$ continuous if $B(x, a, v)$ is continuous in $v$ for all $(x, a) \in \Gsf$ . In other words, $\rR$ is continuous if, for any $v \in V$ , any $(x, a) \in \Gsf$ and any sequence $(v_k)_{k \geq 1}$ in $V$ , we have

\lim_{k \to \infty} B(x, a, v_k) = B(x, a, v) \quad \text{whenever} \quad \lim_{k \to \infty} v_k = v .

Continuity is satisfied by all applications considered in this text. For example, for the RDP generated by an MDP (Example 8.1.1), the deviation $| B(x, a, v_k) - B(x, a, v)|$ is dominated by $\beta \| v_k - v \|_\infty$ for all $(x, a) \in \Gsf$ . Hence continuity holds.

Below we will see that continuity is useful when considering convergence of certain algorithms.

8.1.3Optimality¶

In this section, we present optimality theory for RDPs.

8.1.3.1Greedy Policies¶

Given an RDP $\rR = (\Gamma, V, B)$ and $v \in V$ , a policy $\sigma \in \Sigma$ is called $v$ -greedy if

\sigma(x) \in \argmax_{a \in \Gamma(x)} B(x, a, v) \quad \text{for all } x \in \Xsf.

(8.13)

Since $\Gamma(x)$ is finite and nonempty at each $x \in \Xsf$ , at least one such policy exists. As with policy operators, the notion of greedy policies extends existing definitions from earlier chapters.

Solution to Exercise 8.1.7

Fix $v \in V$ and consider the set $\{T_\sigma \, v\}_{\sigma \in \Sigma} \subset V$ . We first show that $\{T_\sigma \, v\}_{\sigma \in \Sigma}$ contains a greatest element. Suppose that $\bar \sigma$ is $v$ -greedy. If $\sigma$ is any other policy, then

(T_\sigma \, v)(x) = B(x, \sigma(x), v) \leq (T_{\bar \sigma} \, v)(x) \quad \text{for all } x \in \Xsf .

Hence $T_{\bar \sigma} \, v$ is a greatest element of $\{T_\sigma \, v\}_{\sigma \in \Sigma}$ .

The proof that $\{T_\sigma \, v\}_{\sigma \in \Sigma}$ contains a least element is analogous, after replacing $\argmax$ with $\argmin$ .

Given RDP $\rR = (\Gamma, V, B)$ , we say that $v \in V$ satisfies the Bellman equation if $v(x) = \max_{a \in \Gamma(x)}B(x,a,v)$ for all $x \in \Xsf$ . The Bellman operator corresponding to $\rR$ is the map $T$ on $V$ defined by

(T v)(x) = \max_{a \in \Gamma(x)} B(x, a, v) \qquad (x \in \Xsf).

Solution to Exercise 8.1.8

Regarding part (i), fix $v \in V$ . For any $\sigma \in \Sigma$ and $x \in \Xsf$ , we have

(T v)(x) = \max_{a \in \Gamma(x)} B(x, a, v) = \max_{\sigma \in \Sigma} B(x, \sigma(x), v) = \max_{\sigma \in \Sigma} (T_\sigma \, v)(x).

Since $x$ was chosen arbitrarily, we have confirmed that $T v = \bigvee_{\sigma \in \Sigma} T_\sigma \, v$ .

Regarding part (ii), $\sigma$ is $v$ -greedy if and only if

\sigma(x) \in \argmax_{a \in \Gamma(x)} B(x, a, v) \quad \text{for all } x \in \Xsf.

This is equivalent to $B(x, \sigma(x), v) = \max_{a \in \Gamma(x)} B(x, a, v)$ for all $x \in \Xsf$ . Hence $\sigma$ is $v$ -greedy if and only if $T_\sigma \, v = Tv$ , as claimed.

Regarding (iii), to see that $T$ is a self-map on $V$ , fix $v \in V$ and let $\sigma$ be $v$ -greedy. Then, by (ii), $Tv = T_\sigma \, v \in V$ . Hence $T$ is a self-map, as claimed. The fact that $T$ is order-preserving on $V$ follows immediately from the monotonicity property of $B$ in (8.2).

8.1.3.2Algorithms¶

To solve RDPs for optimal policies, we use two core algorithms: Howard policy iteration (HPI) and optimistic policy iteration (OPI). As in previous chapters, OPI includes VFI as a special case.

To describe HPI we take $\rR = (\Gamma, V, B)$ to be a well-posed RDP with feasible policy set $\Sigma$ , policy operators $\{T_\sigma\}$ , and Bellman operator $T$ . In this setting, the HPI algorithm is essentially identical to the one given for MDPs in Section 5.1.4.2, except that $v_\sigma$ is calculated as the fixed point of $T_\sigma$ , rather than taking the specific form $(I-\beta P_\sigma)^{-1} r_\sigma$ . The details are in Algorithm 8.1.

Algorithm 8.1 is somewhat ambiguous, since it is not always clear how to implement the instruction “ $v_k \leftarrow$ the fixed point of $T_{\sigma_k}$ ”. However, if $\rR$ is globally stable, then each $T_{\sigma_k}$ is globally stable, so an approximation of the fixed point can be calculated by iterating with $T_{\sigma_k}$ . This line of thought leads us to consider optimistic policy iterating (OPI) as a more practical alternative. Algorithm 8.2 states an OPI routine for solving $\rR$ that generalizes the MDP OPI routine in Section 5.1.4.

In Algorithm 8.2 we require that $v_0 = v_\sigma$ for some $\sigma \in \Sigma$ . This assumption can be dropped in some settings. For practical purposes, however, it is almost always straightforward to initialize OPI with $v_0 = v_\sigma$ for some simple choice of $\sigma$ .

When we turn to proofs, it will help to have an operator-theoretic description of HPI and OPI. To this end, we define two operators. The first is $\Hmax \colon V \to \{v_\sigma\}$ , which is defined via

\Hmax v = v_\sigma \text{ where } \sigma \text{ is } v \text{-greedy}.

(8.17)

We call $\Hmax$ the Howard operator generated by $\rR$ . Iterating with $\Hmax$ implements HPI. In particular, if we fix $\sigma \in \Sigma$ and set $v_k = \Hmax^k v_\sigma$ , then $(v_k)_{k \geq 0}$ is the sequence of $\sigma$ -value functions generated by HPI.^[1]

Next, fixing $m \in \NN$ , we define the operator $W_m$ from $V$ to itself via

W_m v \coloneq T^m_\sigma v \quad \text{where} \quad \sigma \text{ is } v \text{-greedy}.

(See the previous footnote on the choice of $v$ -greedy policies.) The operator $W_m$ is an approximation of $H$ , since $T^m_\sigma v \to v_\sigma = Hv$ as $m \to \infty$ . Iterating with $W_m$ generates the value sequence in OPI. More specifically, we take $v_0 \in \{v_\sigma\}$ and generate

(v_k, \sigma_k)_{k \geq 0} \quad \text{where } v_k = W_m^k v_0 \text{ and } \sigma_k \text{ is } v_k \text{-greedy}.

(8.18)

This produces an infinite sequence of OPI value and policy iterates.

8.1.3.3Optimality¶

Let $\rR$ be a well-posed RDP with policy operators $\{T_\sigma\}$ and $\sigma$ -value functions $\{v_\sigma\}$ . In this context, we set $v^* \coloneq \bigvee_\sigma \, v_\sigma \in \RR^\Xsf$ and call $v^*$ the value function of $\rR$ . By definition, $v^*$ satisfies

v^*(x) = \max_{\sigma \in \Sigma} v_\sigma(x) \qquad \text{for all } \; x \in \Xsf.

(8.19)

A policy $\sigma$ is called optimal for $\rR$ if $v_\sigma = v^*$ ; that is, if

v_\sigma(x) \geq v_\tau(x) \quad \text{for all } \tau \in \Sigma \text{ and all } x \in \Xsf.

Both of these definitions generalize the definitions we used for MDPs and optimal stopping. In particular, optimality of a policy means that it generates maximum possible lifetime value from every state.

We say that $\rR$ satisfies Bellman’s principle of optimality if

\sigma \in \Sigma \text{ is optimal for } \rR \quad \iff \quad \sigma \text{ is } v^*\text{-greedy}.

We can now state our main optimality result for RDPs. In the statement, $\rR$ is a well-posed RDP with value function $v^*$ .

As OPI includes VFI as a special case ( $m=1$ ), Theorem 8.1.1 also implies convergence of VFI under the stated conditions.

In terms of applications, Theorem 8.1.1 is the most important optimality result in this book. It provides the core optimality results from dynamic programming and a broadly convergent algorithm for computing optimal policies.

The proof of Theorem 8.1.1 is deferred to Section 9.1.

Example 8.1.18–Example 8.1.19 are relatively elementary. More complex models will be handled in Section 8.2.

8.1.3.4Comments on the Optimality Theorem¶

Many traditional treatments of dynamic programming build optimality theory around contractivity (see, e.g., Puterman (2005) or Stokey & Lucas (1989), Section 4.2). Assumptions are constructed so that the policy operators and Bellman operator are all contraction mappings.

While such assumptions are sufficient for Theorem 8.1.1 (since contractivity of the policy operators implies stability), they are not necessary. There are a variety of ways to prove uniqueness and stability of fixed points, including the monotonicity-based methods discussed in Section 7.1.2 and the spectral methods in Section 6.1.3.2. These alternatives will prove useful in settings where contractivity fails, as we shall see in Section 8.2.

Another point worth noting about the conditions in Theorem 8.1.1 is that no assumptions are placed on the Bellman operator. Rather, one only needs to check properties of the policy operators. This is advantageous because, unlike the Bellman operator, the policy operators do not involve maximization.

8.1.3.5Nonstationary Policies¶

Up until now, we have focused entirely on stationary policies, in the sense that the same policy is used at every point in time. What if we drop this assumption and admit the option to change policies? Might this lead to higher lifetime values?

In this section, we show that for globally stable RDPs the answer is negative. This finding justifies our focus on stationary policies.

To begin, let $\rR = (\Gamma, V, B)$ be a globally stable RDP. Recall from Remark 8.1.1 that, given $v \in V$ , $\sigma \in \Sigma$ , $k \in \NN$ and $x \in \Xsf$ , the value $(T_\sigma^k \, v)(x)$ gives finite horizon utility over periods $0, \ldots, k$ under policy $\sigma$ , with initial state $x$ and terminal condition $v$ . Extending this idea, it is natural to understand $T_{\sigma_k} T_{\sigma_{k-1}} \cdots T_{\sigma_1} v$ as providing finite horizon utility values for the nonstationary policy sequence $(\sigma_k)_{k \in \NN} \subset \Sigma$ , given terminal condition $v \in V$ . For the same policy sequence, we define its lifetime value via

\bar v \coloneq \limsup_{k \to \infty} v_k \quad \text{with } v_k \coloneq T_{\sigma_k} T_{\sigma_{k-1}} \cdots T_{\sigma_1} v

whenever the limsup is finite and independent of the terminal condition $v$ .

Suppose that this is the case, and hence $\bar v$ is well defined. We claim that $\bar v \leq v^*$ .

Since $\bar v$ is independent of the terminal condition $v$ , we can assume without loss of generality that $v \in V_\Sigma$ . By Theorem 8.1.1, we have $T^k v \to v^*$ as $k \to \infty$ . Hence, by Exercise 8.1.11,

\bar v = \limsup_{k \to \infty} v_k \leq \limsup_{k \to \infty} T^k v = \lim_{k \to \infty} T^k v = v^*,

as was to be shown.

8.1.3.6Bounded RDPs¶

We call an RDP $\rR = (\Gamma, V, B)$ bounded if $V$ is convex and, moreover, there exist functions $v_1, v_2 \in V$ such that $v_1 \leq v_2$ ,

v_1(x) \leq B(x, a, v_1) \quad \text{ and } \quad B(x, a, v_2) \leq v_2(x) \;\; \text{ for all } (x, a) \in \Gsf .

(8.20)

We will show that boundedness can be used to obtain optimality results for well-posed RDPs, even without global stability.

Another attractive feature of boundedness is that it permits a reduction of value space, as illustrated by the next two exercises.

Exercise 8.1.13 implies the reduced RDP $(\Gamma, \hat V, B)$ is also well-posed under the stated conditions, and that it contains all the $\sigma$ -value functions and the value function from the original RDP $(\Gamma, V, B)$ . Hence any optimality results for $(\Gamma, \hat V, B)$ carry over to $(\Gamma, V, B)$ .

The next result shows that, when considering optimality, stability can be replaced by boundedness.

8.1.4Topologically Conjugate RDPs¶

Sometimes RDP models can be simplified by transformations over value space. In this section we investigate such transformations. The underlying ideas are related to topological conjugacy of dynamical systems, which we introduced in Section 2.1.1.2.

To begin, let $\rR = (\Gamma, V, B)$ and $\hat \rR = (\Gamma, \hat V, \hat B)$ be two RDPs with identical state space $\Xsf$ , action space $\Asf$ and feasible correspondence $\Gamma$ . We consider settings where

V = \MM^\Xsf \quad \text{and} \quad \hat V = \hat{\MM}^\Xsf \quad \text{where } \MM, \hat \MM \subset \RR,

and, in addition, that there exists a homeomorphism $\phi$ from $\MM$ onto $\hat \MM$ such that

B(x, a, v) = \phi^{-1}[ \hat B(x, a, \phi \circ v) ] \quad \text{for all } v \in V \text{ and } (x, a) \in \Gsf.

(8.21)

We call $\rR$ and $\hat R$ topologically conjugate under $\phi$ if $\phi$ is a homeomorphism $\phi$ from $\MM$ to $\hat \MM$ and (8.21) holds.

Here is our main result for this section.

The benefit of Proposition 8.1.3 is that one of these models might be easier to analyze than the other. We apply the proposition to the Epstein–Zin specification in Section 8.1.4.1 and to a smooth ambiguity model in Section 8.3.4. The next exercise will be useful for the proof.

Proof

Proof of Proposition 8.1.3.

By Exercise 8.1.18, $\Phi v \coloneq \phi \circ v$ is a homeomorphism from $V$ to $\hat V$ . Moreover, for any $\sigma \in \Sigma$ , the respective policy operators $T_\sigma$ and $\hat T_\sigma$ are linked by

(T_\sigma \, v)(x) = B(x, \sigma(x), v) = \phi^{-1}[ \hat B(x, \sigma(x), \phi \circ v) ] = \phi^{-1}[ ( \hat T_\sigma \phi \circ v)(x) ].

This shows that $T_\sigma = \Phi^{-1} \circ \hat T_\sigma \circ \Phi$ on $V$ . Hence $(V, T_\sigma)$ and $(\hat V, \hat T_\sigma)$ are topologically conjugate dynamical systems, from which it follows that $T_\sigma$ is globally stable if and only if $\hat T_\sigma$ is globally stable. This completes the proof of Proposition 8.1.3. ◻

In the next section we will see how these ideas can simplify optimality analysis.

8.1.4.1Application: Epstein–Zin RDPs¶

In this section, we show how the preceding optimality results and the notion of topologically conjugacy can be deployed to analyze the Epstein–Zin RDP from Example 8.1.7.

Recall that the aggregator in Example 8.1.7 is

B(x, a, v) = \left\{ r(x, a) + \beta \left( \sum_{x'} v(x')^\gamma P(x, a, x') \right)^{\alpha/\gamma} \right\}^{1/\alpha}.

(8.22)

Let $V = (0, \infty)^\Xsf$ . We assume that $r \gg 0$ and take a nonempty feasible correspondence $\Gamma$ as given. Exercise 8.1.5 confirmed that $\rR \coloneq (\Gamma, V, B)$ is an RDP.

We will call the stochastic kernel $P$ irreducible if $P(x, \sigma(x), x')$ is irreducible for all $\sigma \in \Sigma$ . Below we establish stability of $\rR$ under irreducibility.

To prove Proposition 8.1.4, we set up a simpler and more tractable model. Our first step is to introduce another RDP by setting

\hat B(x, a, v) = B\left(x, a, v^{1/\gamma}\right)^{\gamma}.

(8.23)

We set $\rR \coloneq (\Gamma, V, B)$ and $\hat \rR \coloneq (\Gamma, V, \hat B)$ . Notice that $\hat B$ can also be expressed as

\hat B(x,a, v) = \left\{ r(x, a) + \beta \left( \sum_{x'} v(x') P(x, a, x') \right)^{1/\theta} \right\}^{\theta},

(8.24)

where $\theta \coloneq \gamma/\alpha$ .

The value of of introducing $\hat \rR$ comes from the fact that $\hat \rR$ is easier to work with than $\rR$ (just as the modified Epstein–Zin Koopmans operator $\hat K$ defined in Section 7.2.3.3 turned out to be easier to work with than the original Epstein–Zin Koopmans operator $K$ introduced in Section 7.2.3.2).

Now we investigate the properties of the simpler RDP $\hat \rR$ .

Proof

In view of (8.24), each policy operator $\hat T_\sigma$ associated with $\hat \rR$ takes the form

(\hat T_\sigma \,v)(x) = \left\{ r(x, \sigma(x)) + \beta \left( \sum_{x'} w(x') P(x, \sigma(x), x') \right)^{1/\theta} \right\}^{\theta}.

(8.25)

Each such $\hat T_\sigma$ is a special case of $\hat K$ defined by $\hat K v = \left\{ h + \beta (P v)^{1/\theta} \right\}^\theta$ (see (7.15)). We saw in Section 7.2.3.3 that this operator is globally stable under the stated assumptions. Hence $\hat \rR$ is a globally stable RDP. ◻

8.2Types of RDPs¶

In Section 8.1 we showed that well-posed RDPs have strong optimality properties whenever they are globally stable or bounded, and that VFI and OPI converge whenever they are globally stable. But what conditions are sufficient for these properties? We start with a relatively strict condition based on contractivity and then progress to models that fail to be contractive.

8.2.1Contracting RDPs¶

In this section, we study RDPs with strong contraction properties. Many traditional dynamic programs fit into this framework.

8.2.1.1Definition and Examples¶

Let $\rR = (\Gamma, V, B)$ be an RDP with state space $\Xsf$ , action space $\Asf$ , and feasible state-action pair set $\Gsf$ . We call $\rR$ contracting if there exists a $\beta < 1$ such that

| B(x, a, v) - B(x, a, w)| \leq \beta \| v - w \|_\infty \quad \text{for all } (x, a) \in \Gsf \text{ and } v, w \in V.

(8.26)

In line with the terminology for contraction maps, we call $\beta$ the modulus of contraction for $\rR$ when (8.26) holds.

The following corollary to Proposition 8.2.1 is immediate from Banach’s contraction mapping theorem.

8.2.1.2Error Bounds¶

Corollary 8.2.2 tells us that contracting RDPs are globally stable and, as a result, the sequence of functions in $V$ generated by VFI (Algorithm 8.2 with $m=1$ ) converges to $v^*$ . However this result is asymptotic and conditions on $v_0 = v_\sigma$ for some $\sigma \in \Sigma$ . We can improve this result in the current setting by leveraging the contraction property:

Since the VFI algorithm terminates when $\|v_k - v_{k-1} \|_{\infty}$ falls below a given tolerance, the result in (8.28) directly provides a quantitative bound on the performance of the policy returned by VFI.

Proof

Proof of Proposition 8.2.3.

Let $(\Gamma, V, B)$ and $v$ be as stated and let $v^*$ be the value function. Note that

\|v^* - v_{\sigma} \|_{\infty} \leq \|v^* - v_k \|_{\infty} + \|v_k - v_{\sigma} \|_{\infty}.

(8.29)

To bound the first term on the right-hand side of (8.29), we use the fact that $v^*$ is a fixed point of $T$ , obtaining

\|v^* - v_k \|_{\infty} \leq \|v^* - Tv_k \|_{\infty} + \| Tv_k - v_k \|_{\infty} \leq \beta \|v^* - v_k \|_{\infty} + \beta \|v_k - v_{k-1}\|_{\infty}.

Hence

\|v^* - v_k \|_{\infty} \leq \frac{\beta}{1-\beta} \|v_k - v_{k-1} \|_{\infty}.

(8.30)

Now consider the second term on the right-hand side of (8.29). Since $\sigma$ is $v_k$ -greedy, we have $Tv_k = T_{\sigma} v_k$ , and

\|v_k - v_{\sigma} \|_{\infty} \leq \|v_k - Tv_k \|_{\infty} + \|Tv_k - v_{\sigma}\|_{\infty} = \|Tv_{k-1} - Tv_k\|_{\infty} + \|T_{\sigma} \, v_k - T_{\sigma} \, v_{\sigma}\|_{\infty}.

\fore \|v_k - v_{\sigma} \|_{\infty} \leq \beta \|v_{k-1} - v_k\|_{\infty} + \beta \| v_k - v_{\sigma}\|_{\infty}.

\fore \|v_k - v_{\sigma} \|_{\infty} \leq \frac{\beta}{1-\beta} \|v_k - v_{k-1} \|_{\infty}.

(8.31)

Together, (8.29), (8.30), and (8.31) give us (8.28). ◻

8.2.1.3A Blackwell-Type Condition¶

Next we state a useful condition for contractivity that is related to Blackwell’s sufficient condition discussed in Section 2.2.3.3. We say that RDP $(\Gamma, V, B)$ satisfies Blackwell’s condition if $v \in V$ implies $v + \lambda \coloneq v + \lambda \1$ is in $V$ for every $\lambda \geq 0$ and, in addition, there exists a $\beta \in [0, 1)$ such that

B(x, a, v + \lambda) \leq B(x, a, v) + \beta \lambda \qquad \text{for all } (x, a) \in \Gsf \text{, } v \in V \text{ and } \lambda \in \RR_+.

Solution to Exercise 8.2.3

Let $\rR = (\Gamma, V, B)$ satisfy Blackwell’s condition. Fix $v, w \in V$ and $(x, a) \in \Gsf$ . Observe that $v = w + v - w \leq w + \| v - w\|_\infty$ . By monotonicity of $B$ and Blackwell’s condition, we have

B(x, a, v) \leq B(x, a, w + \| v - w\|_\infty) \leq B(x, a, w) + \beta \| v - w\|_\infty.

As a result, $B(x, a, v) - B(x, a, w) \leq \beta \| v - w\|_\infty$ . Reversing the roles of $v$ and $w$ yields

|B(x, a, v) - B(x, a, w)| \leq \beta \| v - w\|_\infty.

Since $\beta < 1$ , the RDP $\rR$ is contracting.

8.2.1.4Application: Job Search with Quantile Preferences¶

Consider the job search problem with correlated wage draws first investigated in Section 3.3.1. With finite wage offer set $\Wsf$ , wage offer process generated by $P \in \mopw$ and $\beta \in (0,1)$ , we can frame this as an RDP $(\Gamma, V, B)$ with $V = \RR^\Wsf$ , $\Gamma(w) = \{0, 1\}$ for $w \in \Wsf$ and

B(w, a, v) \coloneq a \frac{w}{1-\beta} + (1-a) [ c + \beta (Pv)(w) ].

Since the model just described is an optimal stopping problem, Example 8.2.1 tells us that $(V, \Gamma, B)$ is contracting.

Now consider the following modification, where $\Gamma$ and $V$ are as before but $B$ is replaced by

B_\tau(w, a, v) \coloneq a \frac{w}{1-\beta} + (1-a) [ c + \beta (R_\tau v)(w) ],

where $\tau \in [0,1]$ and $R_\tau$ is the quantile certainty equivalent operator described in Exercise 7.3.4.

Figure 8.1 shows the reservation wage for a range of $\tau$ values, computed using OPI (and taking the smallest $w \in \Wsf$ such that $\sigma^*(w) = 1$ ). The stationary distribution of $P$ is also shown in the figure, tilted 90 degrees.

The parameters and the code for applying $T_\sigma$ and evaluating greedy functions is shown in Listing 2. That listing includes the quantile operator $R_\tau$ , which is implemented in Listing 1. (Quantiles of discrete random variables can also be computed using functionality contained in Distributions.jl.)

The reservation wage as a function of \tau — Figure 8.1:The reservation wage as a function of $\tau$

The main message of Figure 8.1 is that the reservation wage rises in $\tau$ . In essence, higher $\tau$ focuses the attention of the worker on the right tail of the distribution of continuation values. This encourages the worker to take on more risk, which leads to a higher reservation wage (i.e., reluctance to accept a given current offer).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
"Compute the τ-th quantile of v(X) when X ∼ ϕ."
function quantile(τ, v, ϕ)
    # Sort v and reorder ϕ accordingly
    indices = sortperm(v)
    v_sorted = v[indices]
    ϕ_sorted = ϕ[indices]
    
    for (i, v_value) in enumerate(v_sorted)
        p = sum(ϕ_sorted[1:i])  # sum all ϕ[j] s.t. v[j] ≤ v_value
        if p ≥ τ                # exit and return v_value if prob ≥ τ
            return v_value
        end
    end
end

Program 1:Conditional quantile operator (quantile_function.jl)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
using QuantEcon
include("quantile_function.jl")

"Creates an instance of the job search model."
function create_markov_js_model(;
        n=100,       # wage grid size
        ρ=0.9,       # wage persistence
        ν=0.2,       # wage volatility
        β=0.98,      # discount factor
        c=1.0,       # unemployment compensation
        τ=0.5        # quantile parameter
    )
    mc = tauchen(n, ρ, ν)
    w_vals, P = exp.(mc.state_values), mc.p
    return (; n, w_vals, P, β, c, τ)
end

"""
The policy operator 

    (T_σ v)(w) = σ(w) (w / (1-β)) + (1 - σ(w))(c + β (R_τ v)(w))

"""
function T_σ(v, σ, model)
    (; n, w_vals, P, β, c, τ) = model
    h = c .+ β * R(τ, v, P)
    e = w_vals ./ (1 - β)
    return σ .* e + (1 .- σ) .* h
end

" Get a v-greedy policy."
function get_greedy(v, model)
    (; n, w_vals, P, β, c, τ) = model
    σ = w_vals / (1 - β) .≥ c .+ β * R(τ, v, P)
    return σ
end

Program 2:Job search with quantile operator (quantile_js.jl)

8.2.1.5Application: Optimal Default¶

In this section, we consider a small open economy that borrows in international financial markets in order to smooth consumption and has the option to default. We show that the model is a contractive RDP.

Income $(Y_t)_{t \geq 0}$ is exogenous and $Q$ -Markov on finite set $\Ysf$ . A representative household faces budget constraint

C_t = Y_t + b_t - q b_{t+1} \qquad (t \geq 0),

where $C_t$ is consumption at time $t$ , $q$ is the price at time $t$ of a risk-free claim on one unit of time $t+1$ consumption; $q$ is determined outside the model, say international markets; $b_t$ measures foreign lending. Purchasing a claim on $b_{t+1}$ units of time $t+1$ consumption costs $q b_{t+1}$ . Purchasing bond with negative face value $b_{t+1}$ pays $q b_{t+1}$ in current consumption goods and promises to deliver $b_{t+1}$ next period.

Bond trading is managed by a benevolent government that wants to maximize household utility. Households discount future utility at rate $\beta \in (0,1)$ and current consumption $C_t$ generates current utility $u(C_t)$ . The government faces borrowing constraint $b_t \geq - m$ where $m \geq 0$ . The government maximizes expected discounted utility for the households.

The government can default on foreign loans. In this case, output available for consumption drops from $Y_t$ to $h(Y_t)$ , where $h$ is a function satisfying $h(y) < y$ for all $y$ . After a country defaults, it temporarily loses access to the international credit market.

At the end of each period during which the country is in default, it regains access to international credit markets with probability $\theta \in (0,1)$ . With probability $1-\theta$ it remains in financial autarky. When a country regains access to foreign borrowing, its debt is reset to zero.

We can cast this as an RDP by considering the value of each state and action. We set the state space $\Xsf$ to be the set of all $(y, b, d)$ in $\Ysf \times \Bsf \times \{0, 1\}$ , where $\Bsf$ is a finite subset of $[-m, \infty)$ indicating possible choices for bond holdings $b_t$ and $d$ is a binary variable indicating whether the country is in default ( $d=0$ means not in default and $d=1$ means in default).

The value space $V$ is all of $\RR^\Xsf$ . The action space is $(b_a, d_a) \in \Bsf \times \{0,1\}$ indicating choices for bond holdings and default. The feasible correspondence specifies feasible $(b_a,d_a)$ at given state $(y, b, d)$ and is given by

\Gamma(y, b, d) = \begin{cases} \Bsf \times \{0, 1\} & \text{ if } d = 0 \text{ and } \\ \{0 \} \times \{1\} & \text{ if } d = 1. \end{cases}

In other words, if $d=0$ , so the country is not in default, the government can choose any $b_a \in \Bsf$ and also any $d_a \in \{0, 1\}$ (i.e., default or not default). If $d=1$ , however, the government has no choices. We represent this situation by $b_a =0$ and $d_a =1$ .

The value aggregator takes the form

B((y, b, d), (b_a, d_a), v) = \text{value in state } (y, b, d) \text{ under action } (b_a, d_a).

To specify it we decompose the problem across cases for $d$ and $d_a$ . First consider the case where $d=0$ (not currently in default) and $d_a=0$ (the government chooses not to default). For this case $y + b - q b_a$ is current consumption, so we set

B((y, b, 0), (b_a, 0), v) = u(y + b - q b_a) + \beta \sum_{y'} v(y', b_a, 0) Q(y, y').

(8.32)

Now consider the case where $d=0$ and $d_a=1$ , so the government chooses to default. Then current consumption is $h(y)$ and we set

B((y, b, 0), (b_a, 1), v) = u(h(y)) + \beta \\ \left[ \theta \sum_{y'} v(y', 0, 0) Q(y, y') + (1-\theta) \sum_{y'} v(y', 0, 1) Q(y, y') \right].

(8.33)

The term $\sum_{y'} v(y', 0, 0) Q(y, y')$ is the expected value next period when the country is readmitted to international financial markets (with $b'=0$ and $d'=0$ ), whereas the term $\sum_{y'} v(y', 0, 1) Q(y, y')$ is the expected value next period when default continues (with $b'=0$ and $d'=1$ ).

Since $B((y, b, 1), (b_a, 0), v)$ is not feasible (a defaulted country cannot itself directly choose to reenter financial markets), the only other possibility is $B((y, b, 1), (b_a, 1), v)$ , which is the expected value when the country remains in default. But this is the same as $B((y, b, 0), (b_a, 1), v)$ specified earlier: The value for a country that stays in default is the same as that for a country that newly enters default.

8.2.2Eventually Contracting RDPs¶

Many RDPs are not contracting. There is no single method for handling all types of non-contractive RDPs, so we introduce alternative techniques over the next few sections. The first such technique, treated in this section, handles RDPs that contract “eventually,” even though they may fail to contract in one step. We show that these eventually contracting RDPs are globally stable, so all of the fundamental optimality results apply.

One application for these results is the MDP model with state-dependent discounting treated in Chapter 6. This section contains a proof of the main optimality result in that chapter (Proposition 6.2.2).

8.2.2.1Definition and Properties¶

Let $\rR = (\Gamma, V, B)$ be an RDP with policy set $\Sigma$ . We call $\rR$ eventually contracting if there is a map $L$ from $\Gsf \times \Xsf$ to $\RR_+$ such that

|B(x, a, v)-B(x, a, w)| \leq \sum_{x'} |v(x') - w(x')| L(x, a, x'),

(8.34)

for all $(x, a) \in \Gsf$ and all $v, w \in V$ , and moreover,

\sigma \in \Sigma \implies \rho(L_\sigma) < 1 \quad \text{where} \quad L_\sigma(x, x') \coloneq L(x, \sigma(x), x').

Proof

Let $\rR$ be as stated and fix $\sigma \in \Sigma$ . Let $T_\sigma$ be the associated policy operator and let $L_\sigma$ be the linear operator in (8.34). For fixed $v, w \in V$ we have

\begin{aligned} |(T_\sigma \, v)(x) - (T_\sigma \, w)(x)| & = \left| B(x, \sigma(x), v) - B(x, \sigma(x), w) \right| \\ & \leq \sum_{x'} \left| v(x') - w(x') \right| \, L_\sigma(x, x'). \end{aligned}

Since $L_\sigma \geq 0$ and $\rho(L_\sigma) < 1$ , Proposition 6.1.6 implies that $T_\sigma$ is eventually contracting on $V$ . Since $V$ is closed in $\RR^\Xsf$ , it follows that $T_\sigma$ is globally stable (Theorem 6.1.5,). Hence $\rR$ is globally stable, as claimed. ◻

Exercise 8.2.8

In Section 4.1.2 we studied an optimal exit problem for a firm. We can modify this problem to handle stochastic interest rates by introducing the RDP $\rR = (\Gamma, V, B)$ on state space $\Xsf$ with $\Gamma(x) = \{0,1\}$ , $V = \RR^\Xsf$ and

B(x, a, v) = a s + (1-a) \left[ \pi(x) + \beta(x) \sum_{x'} v(x') Q(x, x') \right],

for some $\beta \in \RR^\Xsf_+$ . (We suppose that state-dependence of $\beta$ is generated by state-dependent interest rates.) State the Bellman equation for this problem. Prove that $\rR$ is globally stable whenever there exists an $L \in \lopx$ such that $\rho(L) < 1$ and $\beta(x) Q(x, x') \leq L(x, x')$ for all $x, x' \in \Xsf$ .

8.2.2.2Optimality for MDPs with State-Dependent Discounting¶

With Proposition 8.2.4 in hand, we can complete the proof of Proposition 6.2.2, which pertained to optimality properties for MDPs with state-dependent discounting.

Let $(\Gamma, \beta, r, P)$ be an MDP with state-dependent discounting, as defined in Section 6.2.1.1. The state space is $\Xsf$ and the action space is $\Asf$ . The function $\beta$ maps $\Gsf \times \Xsf$ to $\RR_+$ . Set

L(x, a, x') \coloneq \beta(x, a, x') P(x, a, x') \quad \text{and} \quad L_\sigma(x, x') \coloneq L(x, \sigma(x), x'),

for all $(x, a, x') \in \Gsf \times \Xsf$ and $\sigma \in \Sigma$ .

Assume the conditions of Proposition 6.2.2, so that $\rho(L_\sigma) < 1$ for all $\sigma \in \Sigma$ .

If we set

B(x, a, v) \coloneq r(x, a) + \sum_{x'} v(x') \beta(x, a, x') P(x, a, x'),

(8.35)

and take $V$ to be all of $\RR^\Xsf$ , then $\rR \coloneq (\Gamma, V, B)$ forms an RDP, as discussed in Exercise 8.2.4. We claim that $\rR$ is an eventually contracting RDP.

To see this, fix $v, w \in V$ and $(x, a) \in \Gsf$ . Applying the definition (8.35) and the triangle inequality, we have

\begin{aligned} |B(x, a, v)-B(x, a, w)| & \leq \left| \sum_{x'} [v(x') - w(x')] \beta(x, a, x') P(x, a, x') \right| \\ & \leq \sum_{x'} |v(x') - w(x')| L(x, a, x'), \end{aligned}

Under the stated assumptions, for each $\sigma \in \Sigma$ , the operator $L_\sigma(x, x') = L(x, \sigma(x), x')$ satisfies $\rho(L_\sigma) < 1$ . Hence $\rR$ is eventually contracting, as claimed. Since $V = \RR^\Xsf$ is closed, Proposition 8.2.4 implies that $\rR$ is a globally stable RDP. The claims in Proposition 6.2.2 now follow from Theorem 8.1.1.

8.2.3Convex and Concave RDPs¶

Theorem 8.1.1 shows that RDPs have excellent optimality properties when all policy operators are globally stable on value space. So far we have looked at conditions for stability based on contractions (Section 8.2.1) and eventual contractions (Section 8.2.2). But sometimes both of these approaches fail and we need alternative conditions.

In this section, we explore alternative conditions based on Du’s theorem. Du’s theorem is well suited to the task of studying stability of policy operators, since it leverages the fact that all policy operators are order-preserving.

8.2.3.1Definitions and Optimality¶

Let $\rR = (\Gamma, V, B)$ be an RDP with $V = [v_1, v_2]$ for some $v_1 \leq v_2$ in $\RR^\Xsf$ . We call $\rR$ convex if

(i) for all $(x, a) \in \Gsf$ , $\lambda \in [0, 1]$ and $v, w$ in $V$ , we have

B(x, a, \lambda v + (1-\lambda) w) \leq \lambda B(x, a, v ) + (1-\lambda) B(x, a, w) \quad \text{and},

(8.36)

(ii) there exists a $\delta > 0$ such that

B(x, a, v_2) \leq v_2(x) - \delta[v_2(x) - v_1(x)] \text{ for all } (x, a) \in \Gsf.

(8.37)

Analogous to the convex case, we call $\rR$ concave if

(i) for all $(x, a) \in \Gsf$ , $\lambda \in [0, 1]$ and $v, w$ in $V$ , we have

B(x, a, \lambda v + (1-\lambda) w) \geq \lambda B(x, a, v ) + (1-\lambda) B(x, a, w) \quad \text{and},

(8.38)

(ii) there exists a $\delta > 0$ such that

B(x, a, v_1) \geq v_1(x) + \delta [v_2(x) - v_1(x)] \text{ for all } (x, a) \in \Gsf.

(8.39)

In both of these definitions, condition (ii) is rather complex. The next exercise provides simpler sufficient conditions.

Solution to Exercise 8.2.9

We discuss the first case, regarding (8.37). When (8.40) holds, by finiteness of $\Gsf$ , we can take an $\epsilon > 0$ such that

B(x, a, v_2) \leq v_2(x) - \epsilon \text{ for all } (x, a) \in \Gsf.

We then have

\epsilon \leq v_2(x) - B(x, a, v_2) \leq v_2(x) - B(x, a, v_1) \leq v_2(x) - v_1(x)

for all $x$ , so $0 < \epsilon \leq \|v_2 - v_1\|_\infty$ . Set $\delta \coloneq \epsilon / \| v_2 - v_1 \|_\infty$ . From (8.40) we get

B(x, a, v_2) \leq v_2(x) - \delta \| v_2 - v_1 \|_\infty \leq v_2(x) - \delta [ v_2(x) - v_1(x)]

for arbitrary $(x,a) \in \Gsf$ . Hence (8.37) holds.

Both convexity and concavity yield stability, as the next proposition shows.

It follows from Div that, for convex and concave RDPs, all of the optimality and convergence results in Theorem 8.1.1 apply.

8.2.3.2Application to MDPs¶

Div can be applied to establish optimality properties of regular MDPs. This exercise is redundant in the sense that optimality properties of regular MDPs have already been established using other means. At the same time, some of the arguments developed here will be helpful when we face more sophisticated problems.

To sketch the argument, let $\rR = (\Gamma, V, B)$ be an RDP generated by an ordinary MDP $(\Gamma, \beta, r, P)$ , as discussed in Example 8.1.1. In particular, $V =\RR^\Xsf$ , and $B(x,a, v) = r(x, a) + \beta \sum_{x'} v(x') P(x, a, x')$ . We set $r_1 \coloneq \min r$ and $r_2 \coloneq \max r$ . Then we fix $\epsilon > 0$ and define $V$ via

\hat V \coloneq [v_1, v_2] \quad \text{ where } \quad v_1 \coloneq \frac{r_1 - \epsilon}{1-\beta} \; \text{ and } \; v_2 \coloneq \frac{r_2 + \epsilon}{1-\beta}.

(8.42)

(The functions $v_1$ and $v_2$ are constant.) We claim that the RDP $\hat \rR \coloneq (\Gamma, \hat V, B)$ is both convex and concave.

8.3Further Applications¶

In this section, we consider some applications of the optimality results in Section 8.2.

8.3.1Risk-Sensitive RDPs¶

In Section 7.2.2 we introduced risk-sensitive preferences and discussed a recursive utility problem. Now we embed risk-sensitive preferences into a dynamic program and apply the preceding optimality results to compute optimal policies.

8.3.1.1Optimality Results¶

Consider the risk-sensitive preference RDP in Example 8.1.6, with state space $\Xsf$ and action space $\Asf$ . Let $V = \RR^\Xsf$ . For $(x,a) \in \Gsf$ and $v \in V$ , we can express the aggregator as

B(x, a, v) \coloneq r(x, a) + \beta (R_\theta^a \, v)(x),

where $\theta$ is a nonzero constant and

(R_\theta^a \, v)(x) \coloneq \frac{1}{\theta} \ln \left\{ \sum_{x'} \exp(\theta v(x')) P(x, a,x') \right\}.

Notice that, for each fixed $a \in \Gamma(x)$ , the operator $R_\theta^a$ is an entropic certainty equivalent operator on $V$ (see Example 7.3.2).

The next exercise pertains to quantile preferences rather than risk-sensitive preferences, but the result can be obtained via a relatively straightforward modification of the proof of Proposition 8.3.1.

Exercise 8.3.1

Let $\rR \coloneq (\Gamma, V, B)$ be an RDP with $V = \RR^\Xsf$ and fix $\tau \in [0,1]$ . Let $B(x, a, v) = r(x, a) + \beta (R_\tau^a \, v)(x)$ where, for each $a \in \Gamma(x)$ , the map $R_\tau^a$ is given by

(R_\tau^a \, v)(x) = \min \left\{ y \in \RR \;\Big|\; \sum_{x'} \1\{v(x') \leq y\} P(x, a, x') \geq \tau \right\} \qquad (v \in V, \; x \in \Xsf).

Prove that $\rR$ is globally stable whenever $\beta < 1$ .

8.3.1.2Risk-Sensitive Job Search¶

Let’s consider a job search problem where future wage outcomes are evaluated via risk-sensitive expectations. The associated Bellman operator is

(Tv)(w) = \max \left\{ \frac{w}{1-\beta} ,\, c + \frac{\beta}{\theta} \ln \left[ \sum_{w'} \, \exp(\theta v(w')) P(w, w') \right] \right\} \qquad (w \in \Wsf).

Here $\theta$ is a nonzero parameter and other details are as in Section 3.3.1. We can represent the problem as an RDP with state space $\Wsf$ , action space $\Asf = \{0, 1\}$ , feasible correspondence $\Gamma(w) = \Asf$ , value space $V \coloneq \RR^\Wsf$ , and value aggregator

B(w, a, v) = a \frac{w}{1-\beta} + (1-a) \left\{ c + \frac{\beta}{\theta} \ln \left[ \sum_{w'} \, \exp(\theta v(w')) P(w, w') \right] \right\}.

If $\theta < 0$ , then the agent is risk-averse with respect to the gamble associated with continuing and waiting for new wage draws. If $\theta > 0$ then the agent is risk-loving with respect to such gambles. For $\theta \approx 0$ , the agent is close to risk-neutral.

Figure 8.2 shows how the continuation value, value function and optimal decision vary with $\theta$ . Apart from $\theta$ , parameters are identical to those in Listing 4. Indeed, for $\theta$ close to zero, as in the middle sub-figure of Figure 8.2, we see that the value function and reservation wage are almost identical to those from the risk-neutral model in Figure 3.5.

Figure 8.2:Job search with risk-sensitive preferences

As expected, a negative value of $\theta$ tends to reduce the continuation value and hence the reservation wage, since the agent’s dislike of risk encourages early acceptance of an offer. For positive values of $\theta$ the reverse is true, as seen in the bottom sub-figure.

8.3.2Adversarial Agents¶

Some problems in economics, finance, and artificial intelligence assume that decisions emerge from a dynamic two-person zero-sum game in which the two agents’ preferences are perfectly misaligned. This can lead to a dynamic program where the Bellman equation takes the form

v(x) = \max_{a \in \Gamma(x)} \inf_{d \in D(x, a)} B(x, a, d, v) \qquad (x \in \Xsf, \; v \in \RR^\Xsf),

(8.43)

where $B(x, a, d, v)$ represents lifetime value for the decision maker conditional on her current action $a$ and her adversary’s action $d$ . The decision maker chooses action $a \in \Gamma(x)$ with the knowledge that the opponent will then choose $d \in D(x, a)$ to minimize her lifetime value.

8.3.2.1Optimality¶

To establish optimality properties in the setting of (8.43), we introduce the following assumptions:

If $v, w \in \RR^\Xsf$ with $v \leq w$ , then

B(x, a, d, v) \leq B(x, a, d, w) \quad \text{for all } x \in \Xsf, \; a \in \Gamma(x), \; d \in D(x, a).

There exists a $v_1 \in \RR^\Xsf$ and $\epsilon > 0$ such that

v_1(x) + \epsilon \leq B(x, a, d, v_1) \quad \text{for all } x \in \Xsf, \; a \in \Gamma(x), \; d \in D(x, a).

There exists a $v_2 \in \RR^\Xsf$ such that $v_1 \leq v_2$ and

B(x, a, d, v_2) \leq v_2(x) \quad \text{for all } x \in \Xsf, \; a \in \Gamma(x), \; d \in D(x, a).

If $\lambda \in [0,1]$ and $v, w \in \RR^\Xsf$ , then

B(x, a, d, \lambda v + (1-\lambda) w) \geq \lambda B(x, a, d, v ) + (1-\lambda) B(x, a, d, w)

for all $x \in \Xsf$ , $a \in \Gamma(x)$ and $d \in D(x, a)$ .

Condition (a) is a natural monotonicity condition: A uniform increase in continuation values increases current value at all states and actions. Conditions (b) and (c) provide upper and lower bounds. Condition (d) is a concavity condition.

To analyze the decision maker’s problem, we set $V \coloneq [v_1, v_2]$ and

\hat B(x, a, v) \coloneq \inf_{d \in D(x, a)} B(x, a, d, v) \qquad ((x, a) \in \Gsf, \; v \in V),

We consider $\rR = (\Gamma, V, \hat B)$ .

An immediate corollary of Proposition 8.3.2 is that, under the stated conditions, the decision maker’s problem is a globally stable RDP, and hence the fundamental optimality properties in Theorem 8.1.1 all hold.

In the proof of Proposition 8.3.2, we use the following exercise.

Proof

Proof of Proposition 8.3.2.

First, we need to check that $\rR$ is an RDP. In view of (a) we have $\hat B(x, a, v) \leq \hat B(x, a, w)$ whenever $(x, a) \in \Gsf$ and $v, w \in V$ and $v \leq w$ . Also, by (b) and (c),

v_1(x) < \hat B(x, a, v_1) \; \text{ and } \hat B(x, a, v_2) \leq v_2(x) \quad \text{for all } (x, a) \in \Gsf.

(8.44)

As a result, $v_1(x) \leq \hat B(x, \sigma(x), v) \leq v_2(x)$ for all $x \in \Xsf$ and $v \in V$ . Together, these facts imply the monotonicity and consistency conditions required of an RDP.

In view of (8.44) and Exercise 8.2.9, to establish that $\rR$ is concave, we need only show that, for fixed $\lambda \in [0,1]$ and $v, w \in V$ ,

\hat B(x, a, \lambda v + (1-\lambda) w) \geq \lambda \hat B(x, a, v ) + (1-\lambda) \hat B(x, a, w) ,

(8.45)

for all $(x, a) \in \Gsf$ . This holds because, given $(x, a) \in \Gsf$ , $\lambda \in [0,1]$ and $v, w \in V$ ,

\begin{aligned} \hat B(x, a, \lambda v + (1-\lambda) w) & = \inf_{d \in D(x, a)} B(x, a, d, \lambda v + (1-\lambda) w) \\ & \geq \inf_{d \in D(x, a)} [\lambda B(x, a, d, v ) + (1-\lambda) B(x, a, d, w)] \\ & \geq \lambda \inf_{d \in D(x, a)} B(x, a, d, v ) + (1-\lambda) \inf_{d \in D(x, a)} B(x, a, d, w), \end{aligned}

where the first inequality is by condition (d) and the second is by Exercise 8.3.3. This proves (8.45), so $\rR$ is a concave RDP. ◻

8.3.2.2A Perturbed MDP Problem¶

In this section, we provide a relatively abstract application of Proposition 8.3.2. Later, in Section 8.3.3, we will see more concrete applications.

The setting we consider is a modified MDP where the adversarial agent’s actions affect the reward function and transition kernel. This leads to a Bellman equation of the form

v(x) = \max_{a \in \Gamma(x)} \inf_{d \in D(x, a)} \left\{ r(x, a, d) + \beta \sum_{x'} v(x')P(x, a, d, x') \right\} \qquad (x \in \Xsf).

(8.46)

The choice perturbation $d \in D(x, a)$ is made by the adversary. The object $P$ is a stochastic kernel, in the sense that $P(x, a, d, \cdot)$ is a distribution over $\Xsf$ for each feasible $(x, a, d)$ . We assume that $\Gamma$ is a nonempty correspondence from $\Xsf$ to $\Asf$ and $D(x, a)$ is nonempty for all $(x, a) \in \Gsf$ . Let

\hat B(x, a, v) = \inf_{d \in D(x, a)} \left\{ r(x, a, d) + \beta \sum_{x'} v(x') P(x, a, d, x') \right\} \qquad ((x, a) \in \Gsf).

To construct the value space $V$ , we let $r_1 = \min r$ and $r_2 = \max r$ , and set

V = [v_1, v_2] \quad \text{where} \quad v_1 \coloneq \frac{r_1 - \epsilon}{1-\beta} \quad \text{and} \quad v_2 \coloneq \frac{r_2}{1-\beta}.

(8.47)

(These constant functions are similar to $v_1, v_2$ in (8.42).)

Solution to Exercise 8.3.4

Regarding (b), note that $v_1$ is constant. Hence, at fixed $(x, a) \in \Gsf$ and $d \in D(x, a)$ , we have

B(x, a, d, v_1) = r(x,a) + \beta \frac{r_1 - \epsilon}{1-\beta} \geq r_1 + \beta \frac{r_1 - \epsilon}{1-\beta} = \frac{r_1 - \beta r_1 + \beta r_1 - \beta \epsilon}{1-\beta} = v_1 + \epsilon.

Hence (b) is confirmed. Regarding (c), we have

B(x, a, d, v_2) = r(x,a) + \beta \frac{r_2}{1-\beta} \leq r_2 + \beta \frac{r_2}{1-\beta} = v_2.

We have now verified conditions (b)–(c).

An immediate corollary of Lemma 8.3.3 is that $\rR$ is globally stable (via Proposition 8.2.5) and all optimality results in Theorem 8.1.1 apply.

8.3.3Ambiguity and Robustness¶

Until now we have considered agents facing decision problems where outcomes are uncertain but probabilities are known. For example, while the job seeker introduced in Chapter 1 does not know the next period wage offer when choosing her current action, she does know the distribution of that offer. She uses this distribution to determine an optimal course of action. Similarly, the controllers in our discussion of optimal stopping and MDPs used their knowledge of the Markov transition law to determine an optimal policy.

In many cases, the assumption that the decision maker knows all probability distributions that govern outcomes under different actions is debatable. In this section we study lifetime valuations in settings of Knightian uncertainty (Knight, (1921)), which means that outcome distributions are themselves unknown. Some authors refer to Knightian uncertainty as ambiguity.

Below we consider some dynamic problems where decision makers face Knightian uncertainty.

8.3.3.1Robust Control¶

First we study the choices of a decision maker who knows her reward function but distrusts her specification of the stochastic kernel $P$ that describes the evolution of the state. This distrust is expressed by assuming that she knows that $P$ belongs to some class of stochastic kernels from $\Gsf \times \Xsf$ to $\Xsf$ . This can lead to aggregators of the form

B(x, a, v) = r(x, a) + \beta \inf_{P \in \pP(x, a)} \left\{ \sum_{x'} v(x')P(x, a, x') \right\},

(8.48)

for $(x, a) \in \Gsf$ . As usual, $r$ maps $\Gsf$ to $\RR$ and $\beta \in (0,1)$ . The decision maker can construct a policy that is robust to her distrust of the stochastic kernel by using this aggregator $B$ . Such aggregators arise in the field of robust control.

Positing that the decision maker knows a nontrivial set of stochastic kernels is a way of modeling Knightian uncertainty, as distinguished from risks that are described by known probability distributions.

Returning to the robust control model with aggregator $B$ in (8.48), we take $V$ is as defined in (8.47) and set $\rR = (\Gamma, V, B)$ . The set $\pP$ of stochastic kernels is entirely arbitrary.

We conclude from this discussion that the robust control RDP is globally stable. Hence all of the fundamental optimality properties hold.

8.3.3.2Robustness and Adversarial Agents¶

A more general way to implement robustness is via the aggregator

B(x, a, v) = r(x, a) + \beta \inf_{P \in \pP(x, a)} \left\{ \sum_{x'} v(x')P(x, a, x') + d(P(x, a, \cdot), \bar P(x, a, \cdot)) \right\}.

(8.50)

In this set up, $\pP(x, a)$ is often large, weakening the constraint on $P$ . At the same time, we introduce the penalty term $d(P(x, a, \cdot), \bar P(x, a, \cdot))$ , which can be understood as recording the deviation between a given kernel $P$ and some baseline specification $\bar P$ .

One interpretation of this setting is that the decision maker begins with a baseline specification of dynamics but lacks confidence in its accuracy. In her desire to choose a robust policy, she imagines herself playing against an adversarial agent. Her adversary can choose transition kernels that deviate from the baseline, but the presence of the penalty term means that extreme deviations are curbed.

If we define

\hat r(x, a) = r(x, a) + d(P(x, a, \cdot), \bar P(x, a, \cdot)),

then (8.50) can be expressed as

B(x, a, v) = \inf_P \left\{ \hat r(x, a) + \beta \sum_{x'} v(x')P(x, a, x') \right\}.

This is a special case of (8.49), so the same optimality theory applies.

8.3.3.3Connection to Risk-Sensitive Preferences¶

One measure of discrepancy between two probability distributions is the Kullback–Liebler divergence (KL divergence)

d_{KL}(q \given p) \coloneq \sum_x q(x) \ln \left( \frac{q(x)}{p(x)} \right) \quad \text{for } q, p \in \dD(\Xsf).

It is assumed here that $q \prec_{\rm{ac}} p$ , which means that $q(x)=0$ whenever $p(x)=0$ . We note for future reference that $d_{KL}$ obeys the duality formula for variational inference, which states that, given $h \in \RR^\Xsf$ ,

\ln \sum_x \exp(h(x)) p(x) = \sup_{q \prec_{\rm{ac}} p} \left\{ \sum_x h(x) q(x) - d_{KL}(q \given p) \right\}.

(8.51)

(See, e.g., Dupuis & Ellis (2011), Proposition 1.4.2.)

In robust control, KL divergence can be used to measure deviation between the baseline specification and alternative specifications. It turns out that, under this measure of divergence, there is a tight relationship between robust control and risk-sensitive preferences.

To illustrate this relationship, we fix $\theta < 0$ and set $d_\theta \coloneq -(1/\theta) d_{KL}$ , so that $d_\theta$ is a simple positive rescaling of the Kullback–Leibler divergence. Using $d_\theta$ in (8.50) leads to

B(x, a, v) = r(x, a) + \beta \inf_{P \in \pP(x, a)} \left\{ \sum_{x'} v(x')P(x, a, x') + d_\theta (P(x, a, \cdot) \given \bar P(x, a, \cdot)) \right\}.

The constraint set $\pP(x,a)$ is all $P \in \mopx$ such that $P(x, a, \cdot) \prec_{\rm{ac}} \bar P(x, a, \cdot)$ .

If we multiply both sides of the variational formula (8.51) by $(1/\theta)$ and set $h = \theta v$ we get

\frac{1}{\theta} \ln \sum_x \exp(\theta v(x)) p(x) = \inf_{q \prec_{\rm{ac}} p} \left\{ \sum_x v(x) q(x) - \frac{1}{\theta} d_{KL}(q \given p) \right\}.

This allows us to rewrite $B$ as

B(x, a, v) = r(x, a) + \beta \frac{1}{\theta} \ln \left\{ \sum_{x'} \exp(\theta v(x')) \bar P(x, a, x') \right\}.

Hence, for this choice of deviation, the robust control aggregator (8.50) reduces to the risk-sensitive aggregator (see Example 8.1.6) under the baseline transition kernel.

8.3.4Smooth Ambiguity¶

Ju & Miao (2012) propose and study a recursive smooth ambiguity model in the context of asset pricing. A generic discrete formulation of their optimization problem can be expressed in terms of the aggregator

B(x, a, v) = \left\{ r(x, a) + \beta \left\{ \int \left[ \sum_{x'} v(x')^{\gamma} P_\theta (x, a, x') \right]^{\kappa/\gamma} \mu(x, \diff \theta) \right\}^{\alpha/\kappa} \right\}^{1/\alpha},

(8.52)

where $\alpha, \kappa, \gamma$ are nonzero parameters, $P_\theta$ is a stochastic kernel from $\Gsf$ to $\Xsf$ for each $\theta$ in a finite dimensional parameter space $\Theta$ , and $\mu(x, \cdot)$ is a probability distribution over $\Theta$ for each $x \in \Xsf$ . The distribution $\mu(x, \cdot)$ represents subjective beliefs over the transition rule for the state.

The aggregator $B$ in (8.52) is defined for $x \in \Xsf$ , $a \in \Gamma(x)$ and $v \in I$ , where $I$ is be the interior of the positive cone of $\RR^\Xsf$ . To ensure finite real values, we assume $r \gg 0$ .

As with the Epstein–Zin case, $\alpha$ parameterizes the elasticity of intertemporal substitution and $\gamma$ governs risk aversion. The parameter $\kappa$ captures ambiguity aversion. If $\kappa = \gamma$ , the agent is said to be ambiguity neutral.

Returning to (8.52), we focus on the case $\kappa < \gamma < 0 < \alpha < 1$ , which includes the calibration used in Ju & Miao (2012). (Other cases can be handled using similar methods and details are left to the reader.) After constructing a suitable value space, we will show that the resulting RDP is globally stable.

As a first step, set $r_1 \coloneq \min r$ , $r_2 \coloneq \max r$ and fix $\epsilon > 0$ . Consider the constant functions

v_1 \coloneq\left( \frac{r_1}{1-\beta} \right)^{1/\alpha} \quad \text{and} \quad v_2 \coloneq \left( \frac{r_2 +\epsilon}{1-\beta} \right)^{1/\alpha} .

In the remainder of this section on smooth ambiguity, we set $V = [v_1, v_2]$ .

Here is our main result for this section. It implies that all optimality and convergence results for $\rR$ are valid (see, in particular, Theorem 8.1.1).

To prove Proposition 8.3.5, we use a transformation, just as we did with the Epstein–Zin case in Section 8.1.4.1. To this end, we introduce the composite parameters

\xi \coloneq \frac{\gamma}{\kappa} \in (0,1) \quad \text{and} \quad \zeta \coloneq \frac{\kappa}{\alpha} < 0.

Then we define

\hat B(x, a, v) = \left\{ r(x, a) + \beta \left\{ \int \left[ \sum_{x'} v(x')^{\xi} P_\theta (x, a, x') \right]^{1/\xi} \mu(x, \diff \theta) \right\}^{\zeta} \right\}^{1/\zeta},

(8.54)

and

\hat V = [\hat v_1, \hat v_2] \quad \text{where } \; \hat v_1 \coloneq v_2^{1/\kappa} \text{ and } \hat v_2 \coloneq v_1^{1/\kappa}.

Note that $\hat V$ is a nonempty order interval of strictly positive real-valued functions, since $0 < v_1 < v_2$ and $\kappa < 0$ . We set $\hat \rR = (\Gamma, \hat V, \hat B)$ .

The next exercise shows that $\rR$ and $\hat \rR$ are topologically conjugate (see Section 8.1.4).

Proof

Fix $(x, a) \in \Gsf$ . We write $\hat B(x, a, \hat v)$ as

\hat B(x, a, \hat v) = \psi \left( \int f(\theta, v) \mu(x, \diff \theta) \right),

where

f(\theta, v) \coloneq \left[ \sum_{x'} v(x')^{\xi} P_\theta (x, a, x') \right]^{1/\xi} \quad \text{and} \quad \psi(t) \coloneq \left\{ r(x, a) + \beta t^{\zeta} \right\}^{1/\zeta}.

For fixed $\theta$ , the function $v \mapsto f(\theta, v)$ is concave over all $v$ in the interior of the positive cone of $\RR^\Xsf$ by Lemma 7.3.1. The real-valued function $\psi$ satisfies $\psi' > 0$ and $\psi'' < 0$ over $t \in (0,\infty)$ . Since we are composing order-preserving concave functions, it follows that $\hat B(x, a, \hat v)$ is concave on $\hat V$ . ◻

8.3.5Minimization¶

Until now, all results and applications have concerned maximization of lifetime values. Now is a good time to treat minimization. Throughout this section, $\rR$ is a well-posed RDP. The pointwise minimum $\vmin \coloneq \bigwedge_\sigma v_\sigma$ is called the min-value function generated by $\rR$ . We call a policy $\sigma \in \Sigma$ min-optimal for $\rR$ if $v_\sigma = \vmin$ . A policy $\sigma \in \Sigma$ is called $v$ -min-greedy for $\rR$ if

\sigma(x) \in \argmin_{a \in \Gamma(x)} B(x, a, v) \quad \text{for all } x \in \Xsf.

We say that $\rR$ obeys Bellman’s principle of min-optimality if

\sigma \in \Sigma \text{ is min-optimal for } \rR \quad \iff \quad \sigma \text{ is } \vmin \text{-min-greedy}.

The Bellman min-operator $\tmin$ is defined by

(\tmin v)(x) = \min_{a \in \Gamma(x)} B(x, a, v) \qquad (x \in \Xsf).

We say that $v \in V$ obeys the min-Bellman equation if $\tmin v = v$ . The algorithm defined by replacing “ $v$ -greedy” with “ $v$ -min-greedy” in Algorithm 8.1 (HPI) will be called min-HPI.

We can now state the following result, which is analogous to Theorem 8.1.1. In the statement, $\rR$ is a well-posed RDP with min-value function $\vmin$ .

Although we omit the details, a min-OPI convergence result directly analogous to the OPI convergence result in (v) of Theorem 8.1.1 also holds (after replacing maximization-based $v$ -greedy policies with $v$ -min-greedy policies).

Theorem 8.3.7 is proved in Section 9.2.3. For now, we consider two applications that involve minimization.

8.3.5.1Application: Shortest Paths¶

Recall the shortest path problem introduced in Example 8.1.8, where $\Xsf$ is the vertices of a graph, $E$ is the edges, $c \colon E \to \RR_+$ maps a travel cost to each edge $(x,x') \in E$ , and $\oO(x)$ is the set of direct successors of $x$ . The aim is to minimize total travel cost to a destination node $d$ . We adopt all assumptions from Exercise 8.1.17 and assume in addition that $c(x,x')=0$ implies $x=d$ . As in Exercise 8.1.17, we let $C(x)$ be the maximum cost of traveling to $d$ from $x$ along any directed path.

We regard the problem as an RDP $\rR = (\oO, V, B)$ with $V = [0, C]$ and

B(x, x', v) = c(x, x') + v(x') \qquad (x \in \Xsf).

(8.55)

In the present setting, the function $v$ in (8.55) is often called the cost-to-go function, with $v(x')$ in (8.57) understood as remaining costs after moving to state $x'$ .

While the value aggregator $B$ in (8.55) is simple, the absence of discounting (which is standard in the shortest path literature) means that $\rR$ is not contracting. Fortunately, $\rR$ turns out to be concave (in the sense of Section 8.2.3), which allows us to prove

(In the present context, $\vmin$ is also known as the minimum cost-to-go function.)

Proof

We first show that $\rR$ is concave. By the definition of concave RDPs in Section 8.2.3, and given that $B(x, x', v)$ is affine in $v$ (and hence concave), it suffices to prove that there exists a $\delta > 0$ such that

c(x, x') \geq \delta C(x) \text{ for all } x \in \Xsf \text{ and } x' \in \oO(x).

(8.56)

(This corresponds to (8.39) when $v_1 = 0$ and $v_2 = C$ .)

To this end, we set

\delta = \min_{x \neq d} \min_{x' \in \oO(x)} \frac{c(x, x')}{C(x)}.

By the stated cost assumptions, we have $c(x, x') > 0$ when $x \neq d$ and $x' \in \oO(x)$ , whereas $C(x) > 0$ when $x \neq d$ . Since $\Xsf$ is finite, it follows that $\delta$ is finite and positive. Evidently, with this definition, the bound (8.56) holds for all $x \neq d$ . In addition, (8.56) holds trivially when $x=d$ , since $C(d) = 0$ . Hence (8.56) is valid for all $x \in \Xsf$ .

Concavity of $\rR$ implies global stability by Proposition 8.2.5. The remaining claims now follow from Theorem 8.3.7. ◻

8.3.5.2Application: Negative Discount Rate Optimality¶

When discussing MDPs we used $\beta$ to represent the discount factor. Given $\beta$ , the discount rate or rate of time preference is the value $\rho$ that solves $\beta = 1/(1+\rho)$ . The standard MDP assumption $\beta < 1$ implies this rate is positive. You will recall from Chapter 5 that the condition $\beta < 1$ is central to the general theory of MDPs, since it yields global stability of the Bellman and policy operators on $\RR^\Xsf$ (via the Neumann series lemma or Banach’s fixed-point theorem).

In the previous section, on shortest paths, we studied an RDP with a zero discount rate. Now we go one step further and consider problems with negative rates of time preference. Such preferences are commonly inferred when people face unpleasant tasks. Subjects of studies often prefer getting such tasks “over and done with” rather than postponing them. (Negative discount rates are inferred in other settings as well. Section 9.3 provides background and references.)

In this section, we model optimal choice under a negative discount rate. Taking our cue from the preceding discussion, we consider a scenario where a task generates disutility but has to be completed. In particular, we assume that

B(x, x', v) = c(x, x') + \beta v(x') \qquad (x, x' \in \Xsf),

(8.57)

where $\Xsf$ is a finite set and $\beta > 1$ is some positive constant. The function $c$ gives the cost of transitioning from $x$ to the new state $x'$

The value aggregator $B$ in (8.57) is the same as the shortest path aggregator (8.55), except for the constant $\beta$ . To keep the discussion simple, we adopt all other assumptions from the shortest path discussion in Section 8.3.5.1.

Solution to Exercise 8.3.10

A feasible policy $\sigma$ is a map from $\Xsf$ to itself satisfying $\sigma(x) \in \oO(x)$ for all $x$ . Recalling that $\Xsf$ is finite and setting $n = |\Xsf|$ , the stated assumptions imply that $\sigma^k(x) = d$ for all $k \geq n$ (since all paths lead to $d$ in at most $n$ steps). Given that $c(d,d)=0$ , it follows that the lifetime cost of following $\sigma$ from initial condition $x$ is no more than

c(x, \sigma(x)) + \beta c(\sigma(x), \sigma^2(x)) + \beta^2 c(\sigma^2(x), \sigma^3(x)) + \cdots + \beta^{n-1} c(\sigma^{n-1}(x), \sigma^n(x))

With $c_\top \coloneq \max c$ , we then have

C(x) \leq c_\top \frac{1 - \beta^n}{1 - \beta}.

We now have an $\rR = (\Gamma, V, B)$ with $\Gamma = \oO$ , $B$ as in (8.57) and $V = [0, C]$ . The policy operators map $V$ into itself because, for $v \in V$ , we clearly have $0 \leq T_\sigma \, v$ and, in addition,

(T_\sigma \, v)(x) = c(x, \sigma(x)) + \beta v(\sigma(x)) \leq c(x, \sigma(x)) + \beta C(\sigma(x)) \leq C(x).

The last bound holds because $C(x)$ is, by definition, greater than the cost of traveling from $x$ to $\sigma(x)$ and then following the most expensive path.

8.4Chapter Notes¶

The RDP framework adopted in this chapter is inspired by Bertsekas (2022), who in turn credits Mitten (1964) as the first research paper to frame Richard Bellman’s dynamic programming problems in an abstract setting. Denardo (1967) describes key ideas including what we call contracting RDPs (see Section 8.2.1). Denardo (1967) credits Shapley (1953) for inspiring his contraction-based arguments.

The key optimality results from this chapter are new, although closely related results appear in Bertsekas (2022). See, in addition, Bloise et al. (2024), which builds on Bertsekas (2022) and Ren & Stachurski (2021).

The job search application with quantile preferences in Section 8.2.1.4 is based on Castro et al. (2022). The same reference includes a general theory of dynamic programming when certainty equivalents are computed using quantile operators and aggregation is time additive.

The optimal default application in Section 8.2.1.5 is loosely based on Arellano (2008). Influential contributions to this line of work include, Yue (2010), Chatterjee & Eyigungor (2012), Arellano & Ramanarayanan (2012), Cruces & Trebesch (2013), Ghosh et al. (2013), Gennaioli et al. (2014), and Bocola et al. (2019).

At the start of the chapter we motivated RDPs by mentioning that equilibria in some models of production and economic geography can be computed using dynamic programming. Examples include Hsu (2012), Hsu et al. (2014), Antràs & Gortari (2020), Kikuchi et al. (2021) and Tyazhelnikov (2022).

Early references for dynamic programming with risk-sensitive preferences include Jacobson (1973), Whittle (1981), and Hansen & Sargent (1995). Elegant modern treatments can be found in Asienkiewicz & Jaśkiewicz (2017) and Bäuerle & Jaśkiewicz (2024), and an extension to general static risk measures is available in Bäuerle & Glauner (2022). Risk-sensitivity is applied to the study of optimal growth in Bäuerle & Jaśkiewicz (2018), and to optimal divided payouts in Bäuerle & Jaśkiewicz (2017). Risk-sensitivity is also used in applications of reinforcement learning, where the underlying state process is not known. See, for example, Shen et al. (2014), Majumdar et al. (2017) or Gao et al. (2021).

Dynamic programming problems that acknowledge model uncertainty by including adversarial agents to promote robust decision rules can be found in Cagetti et al. (2002), Hansen & Sargent (2011), and other related papers. Al-Najjar & Shmaya (2019) study the connection between Epstein–Zin utility and parameter uncertainty. Ruszczyński (2010) considers risk averse dynamic programming and time consistency.

The smooth ambiguity model in Section 8.3.4 is loosely adapted from Klibanoff et al. (2009) and Ju & Miao (2012). For applications of optimization under smooth ambiguity, see, for example, Guan & Wang (2020) or Yu et al. (2023). Zhao (2020) studies yield curves in a setting where ambiguity-averse agents face varying amounts of Knightian uncertainty over the short and long run.

Readers who wish to see some motivation for the discussion of negative discounting in Section 8.3.5.2 can consult Loewenstein & Sicherman (1991), who found that the majority of workers they surveyed reported a preference for increasing wage profiles over decreasing ones that yield the same undiscounted sum, even when it was pointed out that the latter could be used to construct a dominating consumption sequence. Loewenstein & Prelec (1991) obtained similar results. In summarizing their study, they argue that, in the context of the choice problems that they examined, “sequences of outcomes that decline in value are greatly disliked, indicating a negative rate of time preference” Loewenstein & Prelec, 1991.

In Section 8.3.5.2 we considered dynamic programs with negative discount rates. A more general treatment of such problems can be found in Kikuchi et al. (2021), which also shows how negative discount rate dynamic programs connect to static problems concerning equilibria in production networks and draws connections with Coase’s theory of the firm.

An algorithm that we neglected to discuss is stochastic gradient descent (or ascent) in policy space. Typically policies are parameterized via an approximation architecture that consists of basis functions, activation functions, and compositions of them (e.g., a neural network). In large models, such approximation is used even when the state and action spaces are finite, simply because the curse of dimensionality makes exact representations infeasible. For recent discussions of gradient descent in policy spaces see Nota & Thomas (2019), Mei et al. (2020), and Bhandari & Russo (2022).

Footnotes¶

For $\Hmax$ to be well-defined, we must always select the same $v$ -greedy policy when the operator is applied to $v$ . To this end, we enumerate the policy set $\Sigma$ and choose the first $v$ -greedy policy. This choice of convention has no effect on convergence results.
↩

References¶

Denardo, E. V. (1967). Contraction mappings in the theory underlying dynamic programming. Siam Review, 9(2), 165–177.
Bertsekas, D. (2022). Abstract Dynamic Programming (3rd ed.). Athena Scientific.
Stokey, N., & Lucas, R. (1989). Recursive Methods in Dynamic Economics. Harvard University Press.
Puterman, M. L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Interscience.
Knight, F. H. (1921). Risk, Uncertainty and Profit (Vol. 31). Houghton Mifflin.
Dupuis, P., & Ellis, R. S. (2011). A Weak Convergence Approach to the Theory of Large Deviations. John Wiley & Sons.
Ju, N., & Miao, J. (2012). Ambiguity, learning, and asset returns. Econometrica, 80(2), 559–591.
Mitten, L. (1964). Composition principles for synthesis of optimal multistage processes. Operations Research, 12(4), 610–619.
Shapley, L. S. (1953). Stochastic games. Proceedings of the National Academy of Sciences, 39(10), 1095–1100.
Bloise, G., Le Van, C., & Vailakis, Y. (2024). Do not blame Bellman: It is Koopmans’ fault. Econometrica, 92(1), 111–140.
Ren, G., & Stachurski, J. (2021). Dynamic programming with value convexity. Automatica, 130, 109641.
de Castro, L., Galvao, A., & Nunes, D. (2022). Dynamic Economics with Quantile Preferences [Techreport]. SSRN, 4108230.
Arellano, C. (2008). Default risk and income fluctuations in emerging economies. American Economic Review, 98(3), 690–712.
Yue, V. Z. (2010). Sovereign default and debt renegotiation. Journal of International Economics, 80(2), 176–187.
Chatterjee, S., & Eyigungor, B. (2012). Maturity, indebtedness, and default risk. American Economic Review, 102(6), 2674–2699.

8 Recursive Decision Processes

8.1Definition and Properties¶

8.1.1Defining RDPs¶

8.1.2Lifetime Value¶

8.1.2.1Policies and Value¶

8.1.2.2Uniqueness and Stability¶

8.1.2.3Continuity¶

8.1.3Optimality¶

8.1.3.1Greedy Policies¶

8.1.3.2Algorithms¶

8.1.3.3Optimality¶

8.1.3.4Comments on the Optimality Theorem¶

8.1.3.5Nonstationary Policies¶

8.1.3.6Bounded RDPs¶

8.1.4Topologically Conjugate RDPs¶

8.1.4.1Application: Epstein–Zin RDPs¶

8.2Types of RDPs¶

8.2.1Contracting RDPs¶

8.2.1.1Definition and Examples¶

8.2.1.2Error Bounds¶

8.2.1.3A Blackwell-Type Condition¶

8.2.1.4Application: Job Search with Quantile Preferences¶

8.2.1.5Application: Optimal Default¶

8.2.2Eventually Contracting RDPs¶

8.2.2.1Definition and Properties¶

8.2.2.2Optimality for MDPs with State-Dependent Discounting¶

8.2.3Convex and Concave RDPs¶

8.2.3.1Definitions and Optimality¶

8.2.3.2Application to MDPs¶

8.3Further Applications¶

8.3.1Risk-Sensitive RDPs¶

8.3.1.1Optimality Results¶

8.3.1.2Risk-Sensitive Job Search¶

8.3.2Adversarial Agents¶

8.3.2.1Optimality¶

8.3.2.2A Perturbed MDP Problem¶

8.3.3Ambiguity and Robustness¶

8.3.3.1Robust Control¶

8.3.3.2Robustness and Adversarial Agents¶

8.3.3.3Connection to Risk-Sensitive Preferences¶

8.3.4Smooth Ambiguity¶

8.3.5Minimization¶

8.3.5.1Application: Shortest Paths¶

8.3.5.2Application: Negative Discount Rate Optimality¶

8.4Chapter Notes¶