Recursive Decision Processes - Dynamic Programming Volume II: General States

In this chapter we study what Sargent & Stachurski (2025) call recursive decision processes (RDPs). The following display shows where RDPs fit relative to the other major classes of DP models studied so far in this book:

\text{MDPs } \subset \text{ LDPs } \subset \text{ RDPs } \subset \text{ ADPs}.

The main role of RDPs is to extend LDPs by accommodating nonlinearities in aggregators and calculation of present values. Another difference between this chapter and our discussion of LDPs is that we allow for unbounded rewards and value functions. While this can also be done in an LDP setting, the corresponding analysis turns out to be cleaner when working with RDPs.

Section 7.1 introduces the RDP framework, provides examples, clarifies the relationships between RDPs, LDPs and ADPs, and discusses existence of greedy policies. Section 7.2.1 and Section 7.2.2 present optimality results—first for bounded rewards and then for unbounded rewards handled via weighted contractions—while Section 7.2.3 gives conditions under which the value function is monotone, concave, or uniquely determined. After a digression on certainty equivalents (Section 7.2.4), we extend the optimality theory to MDPs with general certainty equivalents (Section 7.2.5). Section 7.3.1 applies the theory to optimal savings with unbounded utility, and we then study irreversible investment under risk neutrality, risk aversion, and ambiguity aversion.

7.1Introduction¶

Section 7.1.1 defines RDPs and provides examples. We then clarify the relationships between RDPs, LDPs and ADPs, and discuss existence of greedy policies in Section 7.1.3.

7.1.1Definition and Examples¶

To a first approximation, RDPs are dynamic programs with a Bellman equation of the form

v(x) = \max_{a \in \Gamma(x)} B(x, a, v) \qquad (x \in \Xsf)

(7.1)

for some suitable choice of $B$ . Here $x$ is the state, $a$ is an action, $\Gamma$ is a feasible correspondence and $B$ is an “aggregator,” with interpretation

$B(x, a, v) =$ total lifetime rewards, contingent on current action $a$ , current state $x$ and the use of $v$ to evaluate future states.

In Section 7.1.1.1–Section 7.1.1.5 we improve this definition and then provide examples. As usual, in a topological space setting, “measurable” means “Borel measurable” unless otherwise stated.

7.1.1.1Definition¶

Let $\Xsf$ and $\Asf$ be separable metric spaces, referred to henceforth as the state and action spaces respectively. Given these spaces, a recursive decision process (RDP) is a tuple $(\Gamma, V, B)$ containing

a nonempty correspondence $\Gamma$ from $\Xsf$ to $\Asf$ called the feasible correspondence, with an associated set of feasible state-action pairs

\Gsf := \graph \Gamma = \setntn{(x, a) \in \Xsf \times \Asf}{a \in \Gamma(x)}

and an associated set of feasible policies

\Sigma \coloneq \{ \text{all measurable } \sigma \colon \Xsf \to \Asf \text{ satisfying } \sigma(x) \in \Gamma(x) \text{ for all } x \in \Xsf \},

a subset $V$ of $\RR^\Xsf$ called the value space,
a map $B \colon \Gsf \times V \to \RR$ , referred to as an aggregator, satisfying the monotonicity condition

v \leq w \implies B(x, a, v) \leq B(x, a, w)

(7.2)

for every $v, w \in V$ and every $(x, a) \in \Gsf$ , and the consistency condition

\sigma \in \Sigma \text{ and } v \in V \; \implies \; m(x) \coloneq B(x, \sigma(x), v) \text{ is in } V.

(7.3)

Several objects, such as $\Xsf, \Asf$ and $\Gamma$ are familiar from our definition of LDPs in Section 6.1.2.1. Analgous to the LDP case, when representing the RDP by the tuple $(\Gamma, V, B)$ , we are treating $\Xsf$ and $\Asf$ as understood from context.

The value space $V$ is a class of functions that assign values to states. The order on the left side of (7.2) is the usual pointwise partial order for functions. The monotonicity restriction is natural: relative to $v$ , if rewards are at least as high under $w$ in every future state, then the total rewards we can extract under $w$ should be at least as high.

The final condition, in (7.3), is a consistency condition implying that $V$ is large enough to capture the value of following a particular policy.

7.1.1.2Example: Finite MDPs¶

In Section 1.2 we introduced the basic MDP model, with finite state space $\Xsf$ , finite action space $\Asf$ , and remaining primitives $(\Gamma, r, \beta, P)$ as given in Section 1.2.1.1. This maps easily to the RDP setting by taking $V = \RR^\Xsf$ , $\Gamma$ as given, and

B(x, a, v) = r(x, a) + \beta \sum_{x'} v(x') P(x, a, x')

for $(x, a) \in \Gsf$ and $v \in \RR^\Xsf$ .

For this model, it is clear that the RDP Bellman equation $v(x) = \max_{a \in \Gamma(x)} B(x, a, v)$ from (7.1) agrees with the original expression we gave in (1.20).

7.1.1.3Example: The Firm Valuation Problem¶

Recall the firm decision problem we analyzed in Section 1.1.1.2, where the decision is binary (0 means continue and 1 means sell) and the state $x$ takes values in a set $\Xsf$ and evolves via stochastic kernel $P$ . To map this problem to an RDP we set $\Asf = \{0,1\}$ and $\Gamma(x) = \Asf$ for all $x$ . We take $V \coloneq b\Xsf$ as the value space and set

B(x, a, v) = a s + (1 - a) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right].

(7.4)

The monotonicity condition (7.2) clearly holds. A policy is a $\bB$ -measurable map $\sigma \colon \Xsf \to \{0,1\}$ . Given any such policy and any $v \in b\Xsf$ , the function

m(x) \coloneq \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right]

is in $b\Xsf$ (since $\pi$ is assumed to be bounded), so the consistency condition (7.3) also holds. For this model, the RDP Bellman equation $v(x) = \max_{a \in \Gamma(x)} B(x, a, v)$ becomes

v(x) = \max_{a \in \{0,1\}} \left\{ a s + (1 - a) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right] \right\}

This is equivalent to the original statement of the Bellman equation in Theorem 1.1.1.

7.1.1.4Firm Valuation with Unbounded Profits¶

The firm valuation problem can also fit into the RDP framework when profits are unbounded, at least in some cases. For example, suppose that $\Xsf = \RR_+$ , that $\ell$ is a given weight function on $\RR_+$ (see Section A.5.3.5), and that there exist nonnegative constants $\alpha, \eta, \delta$ such that

\pi(x) \leq \eta \ell(x) + \delta \quad \text{and} \quad \int \ell(x') P(x, \diff x') \leq \alpha \ell(x) \qquad (x \in \Xsf).

(7.5)

(These conditions bound the rate at which profits grow.) We again take $B$ as in (7.4) and $\Gamma(x) = \{0,1\}$ for all $x \in \RR_+$ . We set $V$ equal to $b_\ell \Xsf$ , the set of measurable functions $v \in \RR^\Xsf$ with $\| v \|_\ell < \infty$ . (Here $\| \cdot \|_\ell$ denotes the $\ell$ -weighted supremum norm, as in Section A.5.3.5.)

Solution to Exercise 7.1.2

For monotonicity, fix $(x, a) \in \Gsf$ and $v, w \in V$ with $v \leq w$ . Since $P(x, \cdot)$ is a nonnegative measure, we have $\int v(x') P(x, \diff x') \leq \int w(x') P(x, \diff x')$ . Because $a \in \{0,1\}$ , the coefficient $(1 - a)$ is nonnegative, so $B(x, a, v) \leq B(x, a, w)$ .

For consistency, fix $\sigma \in \Sigma$ and $v \in V = b_\ell \Xsf$ . Consider $m(x) \coloneq B(x, \sigma(x), v)$ . Since $\sigma$ is measurable and $\pi$ and $x \mapsto \int v(x') P(x, \diff x')$ are measurable, $m$ is measurable. For $\ell$ -boundedness, we have

|m(x)| \leq |s| + |\pi(x)| + \beta \int |v(x')| P(x, \diff x').

Since $v \in b_\ell \Xsf$ , we have $|v| \leq \|v\|_\ell \ell$ , giving $\int |v(x')| P(x, \diff x') \leq \|v\|_\ell \int \ell(x') P(x, \diff x') \leq \alpha \|v\|_\ell \ell(x)$ by (7.5). Also $|\pi(x)| \leq \eta \ell(x) + \delta$ by (7.5). Hence

|m(x)| \leq (|s| + \delta) + (\eta + \beta \alpha \|v\|_\ell)\, \ell(x),

so $m \in b_\ell \Xsf = V$ and the consistency condition holds.

7.1.1.5Example: Optimal Savings¶

Consider the optimal savings problem studied in Section 1.3. The state is $w \in \RR_+$ and the action is $c \in \RR_+$ . The feasible correspondence is $\Gamma(w) = [0, w]$ and $V \coloneq b\RR_+$ is the value space. We set

B(w, c, v) = u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \qquad (v \in V, \; 0 \leq c \leq w).

As in Assumption 1.3.1, we take $u$ to be bounded and continuous. Under these restrictions, the tuple $(\Gamma, V, B)$ is an RDP. The function $B$ is real-valued and the monotonicity condition (7.2) clearly holds. The consistency condition (7.3) holds because, by the definition of $\Gamma$ , a policy is a Borel measurable map $\sigma \colon \RR_+ \to \RR_+$ with $0 \leq \sigma(w) \leq w$ for all $w$ , and given any such policy and any $v \in b\RR_+$ , the function

m(w) \coloneq u(\sigma(w)) + \beta \int v(R(w - \sigma(w)) + y) \phi(\diff y)

is measurable and bounded (since $u$ is bounded and continuous).

For this model, the RDP Bellman equation $v(x) = \max_{a \in \Gamma(x)} B(x, a, v)$ from (7.1) agrees with the optimal savings Bellman equation in (1.52).

7.1.1.6Example: Savings with Kreps–Porteus Expectations¶

Consider a variation of the optimal savings model from Section 7.1.1.5 with Epstein–Zin-type preferences. To simplify the presentation, we set the EIS parameter to $\psi = \infty$ , so that the CES aggregator reduces to addition, while retaining the nonlinear Kreps–Porteus expectation over future values. The Bellman equation becomes

v(w) = \max_{0 \leq c \leq w} \left\{ (1-\beta) u(c) + \beta \left( \int v(R(w-c) + y)^{1-\gamma} \phi(\diff y) \right)^{\frac{1}{1-\gamma}} \right\}

(7.6)

where $\gamma > 0$ with $\gamma \neq 1$ is the coefficient of relative risk aversion.

As before, the state is $w \in \RR_+$ , the action is $c \in \RR_+$ , and $\Gamma(w) = [0, w]$ . To avoid raising zero to a negative power, we assume that $u$ is measurable and that there exist constants $0 < \underline{u} \leq \bar{u} < \infty$ with $\underline{u} \leq u(c) \leq \bar{u}$ for all $c \in \RR_+$ . The aggregator is

B(w, c, v) = (1-\beta) u(c) + \beta \left( \int v(R(w-c) + y)^{1-\gamma} \phi(\diff y) \right)^{\frac{1}{1-\gamma}}.

We set the value space $V$ to be all measurable functions $v \colon \RR_+ \to [\underline{u}, \bar{u}]$ .

Solution to Exercise 7.1.3

For monotonicity, fix $(w, c) \in \Gsf$ and $v, v' \in V$ with $v \leq v'$ . We show that the Kreps–Porteus expectation is increasing: if $\gamma < 1$ , then $v^{1-\gamma} \leq (v')^{1-\gamma}$ pointwise, so the integral increases, and raising to $1/(1-\gamma) > 0$ preserves the inequality. If $\gamma > 1$ , then $v^{1-\gamma} \geq (v')^{1-\gamma}$ (the power reverses order) so the integral decreases, but raising to $1/(1-\gamma) < 0$ reverses order again. In both cases $B(w, c, v) \leq B(w, c, v')$ , confirming (7.2).

For consistency, fix $\sigma \in \Sigma$ and $v \in V$ . Since $v$ takes values in $[\underline{u}, \bar{u}] \subset (0, \infty)$ , the Kreps–Porteus expectation of $v$ also lies in $[\underline{u}, \bar{u}]$ (by the same argument as above, applied with the constant functions $\underline{u}$ and $\bar{u}$ in place of $v$ and $v'$ ). Hence

\underline{u} \leq (1-\beta) \underline{u} + \beta \underline{u} \leq m(w) \leq (1-\beta) \bar{u} + \beta \bar{u} = \bar{u}

where $m(w) \coloneq B(w, \sigma(w), v)$ . Since $m$ is also measurable, we have $m \in V$ , confirming (7.3).

With $B$ defined as above, the RDP Bellman equation $v(x) = \max_{a \in \Gamma(x)} B(x, a, v)$ from (7.1) agrees with (7.6).

7.1.1.7Example: MDPs with Modified Rewards¶

Some authors use an MDP framework where current rewards depend on the next period state, so that the Bellman equation has the form

v(x) = \max_{a \in \Gamma(x)} \sum_{x'} \left\{ r(x, a, x') + \beta v(x') \right\} P(x, a, x') \qquad (x \in \Xsf).

(7.7)

Here $r$ maps $\Gsf \times \Xsf$ to $\RR$ and other primitives are unchanged. We take $V = \RR^\Xsf$ , $\Gamma$ as given, and set

B(x, a, v) = \sum_{x'} \left\{ r(x, a, x') + \beta v(x') \right\} P(x, a, x')

Evidently, for the associated RDP $(\Gamma, V, B)$ , the monotonicity and consistency conditions (7.2) and (7.3) both hold. For this choice of $B$ , the RDP Bellman equation $v(x) = \max_{a \in \Gamma(x)} B(x, a, v)$ agrees with the modified MDP Bellman equation in (7.7).

7.1.1.8Example: Risk-Sensitive Preferences¶

In §Example 6.1.6 we discussed a risk-sensitive MDP with entropic certainty equivalent. This model can be embedded in the RDP framework by setting $V = \RR^\Xsf$ , $\Gamma$ as given, and

B(x, a, v) \coloneq r(x, a) + \frac{\beta}{\theta} \ln \left[ \sum_{x'} \exp(\theta v(x')) P(x,a,x') \right]

(7.8)

The parameter $\theta$ is any nonzero real value.

When $B$ is given by (7.8), the RDP Bellman equation $v(x) = \max_{a \in \Gamma(x)} B(x, a, v)$ from (7.1) becomes

v(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \frac{\beta}{\theta} \ln \left[ \sum_{x'} \exp(\theta v(x')) P(x,a,x') \right] \right\}.

7.1.2RDPs vs LDPs vs ADPs¶

As mentioned at the start of the chapter, we have MDPs $\subset$ LDPs $\subset$ RDPs $\subset$ ADPs and the inclusions are all strict. We already know that the first inclusion is strict (consider, for example, the LDP with state-dependent discounting in Section 6.1.4). Here we review the remaining relationships.

7.1.2.1LDPs are RDPs¶

Let $(\Gamma, r, K)$ be an LDP with state and action spaces $\Xsf$ , $\Asf$ , as defined in Section 6.1.2.1. Setting $V = b\Xsf$ and

B(x, a, v) = r(x, a) + \int v(x') K(x,a, \diff x') \qquad ((x,a) \in \Gsf, \; v \in V),

(7.9)

the resulting tuple $(\Gamma, V, B)$ is an RDP. To see this, note that $V = b\Xsf$ is a subset of $\RR^\Xsf$ , so we only need to check the monotonicity and consistency conditions for the aggregator $B$ . For monotonicity, fix $(x, a) \in \Gsf$ and $v, w \in V$ with $v \leq w$ . Since $K(x, a, \cdot)$ is a nonnegative measure, we have $\int v(x') K(x, a, \diff x') \leq \int w(x') K(x, a, \diff x')$ and hence $B(x, a, v) \leq B(x, a, w)$ . For consistency, fix $\sigma \in \Sigma$ and $v \in b\Xsf$ . We need to show that $m(x) \coloneq B(x, \sigma(x), v)$ is in $b\Xsf$ . This follows from the LDP conditions, which require $r \in b\Gsf$ and $Kv \in b\Gsf$ whenever $v \in b\Xsf$ .

The risk-sensitive MDP in Section 7.1.1.8 is an RDP but not an LDP, since the aggregator is nonlinear in future values.

7.1.2.2RDPs are ADPs¶

Every RDP generates an ADP. To see this, let $(\Gamma, V, B)$ be an RDP with state space $\Xsf$ and action space $\Asf$ . The set $V$ is paired with the pointwise partial order. With $\Sigma$ as the set of feasible policies and given $\sigma$ in $\Sigma$ , we define $T_\sigma$ by

(T_\sigma \, v)(x) = B(x, \sigma(x), v) \qquad (x \in \Xsf, \; v \in V).

The monotonicity and consistency conditions in (7.2)–(7.3) imply that $T_\sigma$ is an order-preserving self-map on $V$ . Hence, with $\TT$ as the set of all policy operators, the pair $(V, \TT)$ is an ADP. We call $(V, \TT)$ the ADP generated by $(\Gamma, V, B)$ .

For ADPs generated by RDPs, we can provide intuitive representations of greedy policies and the Bellman equation. For example, we recall from our ADP definition in (2.1) that a policy $\sigma \in \Sigma$ is $v$ -greedy for ADP $(V, \TT)$ if $T_\tau \, v \leq T_\sigma \, v$ for all $\tau \in \Sigma$ . If $(V, \TT)$ is generated by $(\Gamma, V, B)$ , then this is equivalent to the statement that

B(x, \tau(x), v) \leq B(x, \sigma(x), v) \quad \text{for all } \tau \in \Sigma \text{ and } x \in \Xsf.

(7.10)

Also, we recall that the ADP Bellman operator is defined by $Tv = \bigvee_\sigma T_\sigma \, v$ whenever the supremum exists. When $(V, \TT)$ is generated by $(\Gamma, V, B)$ , this is equivalent to the statement $(Tv)(x) = \sup_{\sigma \in \Sigma} B(x, \sigma(x), v)$ for all $x \in \Xsf$ whenever the pointwise supremum exists (see Exercise A.1.4). Under reasonable conditions on $\Gamma$ and $B$ , we will show that this can be improved to the stronger form

(Tv)(x) = \max_{a \in \Gamma(x)} B(x, a, v) \qquad (x \in \Xsf)

(7.11)

(see Lemma 7.1.1 and Lemma 7.1.2).

Solution to Exercise 7.1.5

Let $F v \coloneq \phi \circ v$ . By construction, $F$ maps $V$ onto $\hat V$ . Moreover, $F$ is an order isomorphism: since $\phi$ is an order isomorphism on $\RR$ , we have $v(x) \leq w(x) \iff \phi(v(x)) \leq \phi(w(x))$ for each $x \in \Xsf$ , and hence $v \leq w$ pointwise if and only if $Fv \leq Fw$ pointwise (cf. Exercise A.1.11).

It remains to verify the conjugacy condition. Fix $\sigma \in \Sigma$ and $v \in V$ . Evaluating (7.12) at $a = \sigma(x)$ gives $B(x, \sigma(x), v) = \phi^{-1}[\hat B(x, \sigma(x), \phi \circ v)]$ for all $x$ , which is $T_\sigma \, v = F^{-1} \circ \hat T_\sigma \circ F \, v$ . Hence the generated ADPs are isomorphic.

Since every RDP is an ADP, we can use ADP optimality results to study RDPs. Given an RDP $(\Gamma, V, B)$ and its generated ADP $(V, \TT)$ , we make the obvious connections, saying that

$(\Gamma, V, B)$ is regular if $(V, \TT)$ is regular,
$T$ is the Bellman operator for $(\Gamma, V, B)$ when $T$ is the Bellman operator for $(V, \TT)$ ,
$\sigma$ is optimal for $(\Gamma, V, B)$ when $\sigma$ is optimal for $(V, \TT)$ ,
etc.

7.1.2.3Not all ADPs are RDPs¶

Although the RDP framework is broad, there are significant dynamic programs that fall outside this framework.

In the examples above, it is possible to rearrange the problem so that the $\max$ operator is shifted to the outside and, thereby, construct a version that fits the RDP framework. But there are good reasons to avoid this, related to smoothness and dimensionality (see, e.g., Kristensen et al. (2021) or Rust (1994)).

Example 7.1.3

In Section 2.3.5 we investigated linear-quadratic (LQ) problems with Bellman equations such as

v(x) = \min_u \left\{ u^\top R u + x^\top Q x + v(Ax + B u) \right\}.

(7.15)

Here $R, Q, A$ , and $B$ are matrices, while $u$ and $x$ are vectors. This model looks similar to an RDP if we set $\Xsf = \RR^k$ , $\Asf = \RR^m$ , $\Gamma(x) = \Asf$ for all $x \in \Xsf$ and, for the aggregator,

B(x, u, v) = u^\top Ru + x^\top Q x + v(Ax + Bu) .

However, in Section 2.3.5 we took $\Sigma$ to be a set of stable controls, so that $F \in \Sigma$ means that $F$ is a matrix and $\rho(A + BF) < 1$ . Thus, we restrict $\Sigma$ beyond just feasibility of actions, in contrast to the specification of $\Sigma$ within the definition of an RDP. The LQ problem is difficult to handle without such additional restrictions on policies.

7.1.3Existence of Greedy Policies¶

As always, existence of greedy policies is important for our analysis. In this section, we investigate RDP environments where greedy policies exist. We begin with finite action spaces and then move to the general case.

7.1.3.1Finite Actions¶

We begin with the discrete choice setting, where greedy policies always exist.

Proof

Fix $v \in V$ and enumerate $\Asf = \{a_1, \ldots, a_n\}$ . The claim (i, $\Leftarrow$ ) is obvious: if $\sigma \in \Sigma$ obeys (7.16) then (7.10) clearly holds. Next we prove (ii). Since $\Gamma(x)$ is nonempty and finite for each $x$ , the set $\argmax_{a \in \Gamma(x)} B(x, a, v)$ is nonempty for all $x$ . Define $\sigma(x) = a_{i(x)}$ where $i(x)$ is the smallest index $i$ such that $a_i \in \argmax_{a \in \Gamma(x)} B(x, a, v)$ . Then $\sigma$ satisfies (7.16). Moreover, $\sigma$ is Borel measurable. To see this, for each $k$ , let

S_k \coloneq \setntn{x}{a_k \in \Gamma(x) \text{ and } B(x, a_k, v) \geq B(x, a_j, v) \text{ for all } a_j \in \Gamma(x)}.

Each $S_k$ is Borel, since it is defined by finitely many inequalities involving measurable functions. Moreover, $\sigma^{-1}(a_k) = S_k \setminus (S_1 \cup \cdots \cup S_{k-1})$ , which is also Borel. These claims show that $\sigma \in \Sigma$ and $\sigma$ obeys (7.16). By (i, $\Leftarrow$ ), $\sigma$ is $v$ -greedy. In particular, at least one $v$ -greedy policy exists.

For (iii), since a $v$ -greedy policy exists, Lemma 2.1.1 gives $Tv = T_\sigma \, v$ for any $v$ -greedy $\sigma$ . Combined with (7.16), this yields (7.17).

Now we return to (i, $\Rightarrow$ ). Suppose $\sigma$ is $v$ -greedy. By Lemma 2.1.1, $T_\sigma \, v = Tv$ , and hence, by (iii),

B(x, \sigma(x), v) = (T_\sigma \, v)(x) = (Tv)(x) = \max_{a \in \Gamma(x)} B(x, a, v)

for all $x$ , which is (7.16). ◻

Note that, in the simple setting of Lemma 7.1.1, the Bellman equation takes the form of (7.1). Below we investigate more complex settings where this is still true.

7.1.3.2Continuous Actions¶

Now we drop the finiteness restriction on $\Asf$ . In these general RDP settings, existence is less trivial. Here we state one useful result.

Lemma 7.1.2

Let $(\Gamma, V, B)$ be an RDP and fix $v \in V$ . If

$\Gamma$ is continuous and compact-valued on $\Xsf$ , and
$(x, a) \mapsto B(x, a, v)$ is continuous on $\Gsf$ ,

then

a policy $\sigma \in \Sigma$ is $v$ -greedy if and only if

\sigma(x) \in \argmax_{a \in \Gamma(x)} B(x, a, v) \quad \text{for all } x \in \Xsf

(7.18)

at least one $v$ -greedy policy exists,
the Bellman operator is well-defined at $v$ , the function $Tv$ is continuous on $\Xsf$ , and

(Tv)(x) = \max_{a \in \Gamma(x)} B(x, a, v) \quad \text{for all } x \in \Xsf.

(7.19)

If, in addition, $\sigma(x)$ is the unique maximizer of $B(x, a, v)$ over $\Gamma(x)$ for each $x \in \Xsf$ , then $\sigma$ is continuous on $\Xsf$ .

Proof

Fix $v \in V$ and suppose that the stated conditions hold. We use the following facts, which follow from Theorem A.3.3:

$m(x) \coloneq \max_{a \in \Gamma(x)} B(x, a, v)$ is well-defined and continuous on $\Xsf$ , and
there exists a measurable $\sigma \in \Sigma$ satisfying (7.18).

The claim (i, $\Leftarrow$ ) is obvious: if $\sigma \in \Sigma$ obeys (7.18) then (7.10) clearly holds.

Claim (ii) follows from (b) and (i, $\Leftarrow$ ).

For (iii), existence of a $v$ -greedy policy implies that $T$ is well-defined at $v$ . To prove continuity, let $\sigma$ be the $v$ -greedy policy constructed above, which satisfies (7.18). Lemma 2.1.1 gives $Tv = T_\sigma \, v$ , so, by (7.18), $(Tv)(x) = m(x)$ for all $x$ . This yields (7.19). Also, by (a), the function $Tv$ is continuous.

For (i, $\Rightarrow$ ), suppose $\sigma$ is $v$ -greedy. By Lemma 2.1.1, $T_\sigma \, v = Tv$ , and hence, by (iii),

B(x, \sigma(x), v) = (T_\sigma \, v)(x) = (Tv)(x) = \max_{a \in \Gamma(x)} B(x, a, v)

for all $x$ , which is (7.16).

The final claim follows directly from Theorem A.3.3. ◻

7.2Optimality Results¶

Let’s put together some sufficient conditions for optimality of RDP models. We will focus here on models that are naturally contracting. This permits us to handle dynamic programs with both bounded and unbounded rewards.

We begin in Section 7.2.1 with the bounded case, where the aggregator $B$ is bounded and satisfies a Blackwell-type discounting condition. In Section 7.2.2 we extend to potentially unbounded rewards using weighted contractions. Finally, Section 7.2.3 investigates properties of solutions, giving sufficient conditions for the value function to be monotone, concave, or uniquely determined, and for the optimal policy to be continuous.

7.2.1Bounded Contractions¶

RDPs have strong optimality properties when they uniformly contract values. The current section investigates this case. Throughout this section, we assume that values are bounded. This typically occurs when reward functions are bounded. Later, in Section 7.2.2, we will consider unbounded settings.

7.2.1.1Framework¶

Let $\Xsf, \Asf$ be separable metric spaces, let $\Gamma$ be a nonempty correspondence from $\Xsf$ to $\Asf$ , and let $\Gsf$ be the feasible state-action pairs (see Section 7.1.1.1). Set $V = b\Xsf$ . Let $B \colon \Gsf \times V \to \RR$ be a given function such that

$(x, a) \mapsto B(x, a, v)$ is measurable on $\Gsf$ for all $v \in V$ , and
$B(x, a, w) \leq B(x, a, v)$ for all $w \leq v$ in $V$ and $(x, a) \in \Gsf$ .

The tuple $(\Gamma, V, B)$ is an RDP. To see this, note that the monotonicity condition (7.2) is given by the second restriction on $B$ above. For the consistency condition (7.3), fix $\sigma \in \Sigma$ and $v \in V$ . The function $x \mapsto B(x, \sigma(x), v)$ is measurable, since $\sigma$ is measurable and $(x,a) \mapsto B(x,a,v)$ is measurable on $\Gsf$ , and bounded, since $B$ is bounded by Assumption 7.2.1. Hence $B(\cdot, \sigma(\cdot), v) \in b\Xsf = V$ .

7.2.1.2Finite Actions¶

We seek optimality results for the RDP $(\Gamma, V, B)$ introduced in Section 7.2.1.1. The simplest case is when the choice set is always finite. Also note that, in this setting, a policy $\sigma$ is $v$ -greedy if and only if (7.16) holds.

Proof

We apply Theorem 4.1.3 with $E = b\Xsf$ , $e = \1$ , and $V = V_0 = b\Xsf$ . The function $\1$ is a normalized order unit of $b\Xsf$ . Assumption 7.2.1 provides the Blackwell condition (4.4): evaluating at $a = \sigma(x)$ gives $T_\sigma(v + \kappa \1) \leq T_\sigma \, v + \lambda \kappa \1$ for all $v \in V$ , $\kappa \in \RR_+$ , and $\sigma \in \Sigma$ . Since $V_0 = V$ , it is trivially closed in $V$ .

By Lemma 7.1.1, every $v \in V$ has a $v$ -greedy policy, so the ADP is regular. The optimality and convergence claims follow from Theorem 4.1.3. If $\Xsf$ is also finite, then $\Sigma$ is finite and hence $\TT$ is finite. Each $T_\sigma$ is a contraction and therefore globally stable, so the ADP is order stable by Lemma 3.1.1. Hence HPI converges in finitely many steps by Theorem 2.2.6. ◻

7.2.1.3The Continuous Case¶

We now drop the finiteness assumption, while continuing to work with the RDP $(\Gamma, V, B)$ introduced in Section 7.2.1.1. In place of finiteness, we consider two continuity conditions on $B$ .

The main result of this section is as follows.

Proof

We apply Theorem 4.1.3 with $E = b\Xsf$ , $e = \1$ , $V = b\Xsf$ , and $V_0 = bc\Xsf$ . The function $\1$ is a normalized order unit of $b\Xsf$ . As in the proof of Proposition 7.2.1, Assumption 7.2.1 provides the Blackwell condition (4.4). The set $bc\Xsf$ is closed in $b\Xsf$ under the supremum norm.

We verify semi-regularity on $V_0$ . Fix $v \in bc\Xsf$ . By Assumption 7.2.2, the map $(x,a) \mapsto B(x,a,v)$ is continuous on $\Gsf$ . Lemma 7.1.2 then gives that a $v$ -greedy policy exists and that $Tv$ is continuous on $\Xsf$ . Since $B$ is bounded (Assumption 7.2.1), we have $Tv \in bc\Xsf$ . Hence $bc\Xsf \subset V_G$ and $T(bc\Xsf) \subset bc\Xsf$ , confirming semi-regularity. Claims (i)–(iii) now follow from Theorem 4.1.3.

Under Assumption 7.2.3, the same argument applies to every $v \in V$ , so $V_G = V$ and the ADP is regular. Convergence of OPI and HPI then follows from Theorem 4.1.3. ◻

7.2.2Weighted Contractions¶

In Section 7.2.1, we considered RDPs that are both contracting and bounded. Some useful RDPs fail to have this boundedness property. Here we extend our results to potentially unbounded problems that still retain contractivity. (While the results obtained in Section 7.2.1 are special cases of the results presented here (after minor modifications), we decided to present them separately in order to provide simple sufficient conditions in the bounded case.)

When maximizing, the theory works best for problems where rewards are unbounded above and bounded below. (One approach to the reverse type of unboundedness can be found in Ma et al. (2022).) Because we focus on such problems, we will typically assume that rewards are nonnegative. This costs no generality in such settings, since optimal policies are invariant to additive shifts.

Throughout this section, $\ell$ is a weight function on $\Xsf$ , $\| \cdot \|_\ell$ denotes the $\ell$ -weighted supremum norm, and $b_\ell \Xsf$ is all $f \colon \Xsf \to \RR$ with $f/\ell \in b\Xsf$ . See Section A.5.3.5 for background and discussion of weight functions and the space $b_\ell \Xsf$ .

7.2.2.1Framework¶

Let $\Gamma$ be a nonempty correspondence from $\Xsf$ to $\Asf$ and let $\Gsf$ be the feasible state-action pairs (see Section 7.1.1.1). Set $V = b_\ell \Xsf_+$ , so that $V$ is the nonnegative functions in $b_\ell \Xsf$ . Let $B \colon \Gsf \times V \to \RR_+$ be a given function. (In the current setting, where $B$ can be unbounded, we restrict attention to the case where $B$ is nonnegative. Since the weighted contraction approach pursued here works best for rewards that are unbounded above but bounded below, imposing nonnegativity costs very little in the way of generality.)

We suppose that

$(x, a) \mapsto B(x, a, v)$ is measurable on $\Gsf$ for all $v \in V$ , and
$B(x, a, w) \leq B(x, a, v)$ for all $w \leq v$ in $V$ and $(x, a) \in \Gsf$ .

We also require two conditions related to contractivity and $\ell$ -boundedness:

Consider the tuple $(\Gamma, V, B)$ .

7.2.2.2Finite Actions¶

We seek optimality results for the RDP $(\Gamma, V, B)$ introduced in Lemma 7.2.3. The simplest case is when the choice set is always finite:

In this case we have the following result:

7.2.2.3The Continuous Case¶

We now drop the finiteness assumption, while continuing to work with the RDP $(\Gamma, V, B)$ introduced in Lemma 7.2.3.

Proof

We apply Theorem 4.1.3 with $E = b_\ell \Xsf$ , $e = \ell$ , $V = b_\ell \Xsf_+$ , and $V_0 = b_\ell c \Xsf_+$ . The weight function $\ell$ is a normalized order unit of $b_\ell \Xsf$ , and condition (U1) provides the Blackwell condition (4.4): evaluating at $a = \sigma(x)$ gives $T_\sigma(v + \kappa \ell) \leq T_\sigma \, v + \lambda \kappa \ell$ for all $v \in V$ , $\kappa \in \RR_+$ , and $\sigma \in \Sigma$ .

We verify semi-regularity on $V_0$ . Fix $v \in b_\ell c \Xsf_+$ . By Assumption 7.2.6, the map $(x, a) \mapsto B(x, a, v)$ is continuous on $\Gsf$ . Our restrictions on $\Gamma$ and Lemma 7.1.2 imply that $v$ has at least one greedy policy and that $Tv \in b_\ell c \Xsf_+$ . Hence $V_0 \subset V_G$ and $T V_0 \subset V_0$ . Since $\ell$ is continuous (Assumption 7.2.6), $V_0$ is closed in $V$ (Theorem A.5.24), and claims (i)–(iii) follow from Theorem 4.1.3.

For the last claim, if Assumption 7.2.7 holds, the same argument applies to all $v \in V$ , giving regularity. OPI and HPI convergence then follow from Theorem 4.1.3. ◻

7.2.3Properties of Solutions¶

In this section, we seek sufficient conditions for the value and policy functions to have useful shape and continuity properties. We adopt the setting of Proposition 7.2.5 and study the properties of the RDP $(\Gamma, V, B)$ discussed in that result. In the proofs below, we repeatedly use Lemma A.2.6.

7.2.3.1Monotone Values¶

First, we seek conditions under which the value function is increasing. In addition to the conditions in Proposition 7.2.5, we suppose that $\Xsf$ is partially ordered by $\preceq$ . Let

ib_\ell c\Xsf_+ \coloneq \text{ the set of increasing functions in } b_\ell c\Xsf_+.

Both conditions in Assumption 7.2.8 are monotonicity conditions. The first is equivalent to stating that $\Gamma$ is order preserving when viewed as a map from $(\Xsf, \preceq)$ to $(\wp(\Asf), \subset)$ . Here $\wp(\Asf)$ is the set of all subsets of $\Asf$ and $\subset$ is the partial order induced by set inclusion (Example A.1.2).

Proof

It suffices to show that $T$ is invariant on $ib_\ell c\Xsf_+$ , since, by Proposition 7.2.5, $T$ is globally stable on $b_\ell c\Xsf_+$ and, in addition, $i b_\ell c\Xsf_+$ is closed in $b_\ell c\Xsf_+$ . To see that this holds, pick any $v \in i b_\ell c\Xsf_+$ and fix $x$ and $x'$ with $x \preceq x'$ . Since $T$ is invariant on $b_\ell c \Xsf_+$ , we need only show that $Tv$ is increasing. But this must be so, since, by Assumption 7.2.8,

\sup_{a \in \Gamma(x)} B(x, a, v) \leq \sup_{a \in \Gamma(x)} B(x', a, v) \leq \sup_{a \in \Gamma(x')} B(x', a, v).

Hence $Tv(x) \leq Tv(x')$ and $T$ is invariant on $ib_\ell c\Xsf_+$ . ◻

7.2.3.2Concavity¶

Next we seek sufficient conditions for the value function to be concave. In this section, we assume that both $\Xsf$ and $\Asf$ are convex subsets of a vector space.

The convexity requirement on $\Gsf$ in Assumption 7.2.9 is equivalent to the statement that, for all $x, x'$ in $\Xsf$ , all $a \in \Gamma(x)$ all $a' \in \Gamma(x')$ and all $\lambda \in [0, 1]$ , we have

\lambda a + (1-\lambda) a' \in \Gamma(\lambda x + (1-\lambda) x').

By taking $x=x'$ , we see that each set $\Gamma(x)$ is convex in $\Asf$ .

Proof

Let $c b_\ell c \Xsf_+$ be the concave functions in $b_\ell c \Xsf_+$ . By a similar argument to the one used in the proof of Proposition 7.2.6, it suffices to show that $T$ is invariant on $c b_\ell c \Xsf_+$ . To this end, fix $v$ in $c b_\ell c \Xsf_+$ , $\lambda$ in $[0, 1]$ and $x_0, x_1 \in \Xsf$ . Let $a_i$ satisfy $Tv(x_i) = B(x_i, a_i, v)$ for each $i$ . Let $x_\lambda = \lambda x_0 + (1-\lambda) x_1$ and $a_\lambda = \lambda a_0 + (1-\lambda) a_1$ . By convexity of $\Gsf$ , we know that $a_\lambda$ lies in $\Gamma(x_\lambda)$ , which gives

\lambda B(x_0, a_0, v) + (1- \lambda) B(x_1, a_1, v) \leq B(x_\lambda, a_\lambda, v) \leq Tv(x_\lambda).

The left-hand side is $\lambda Tv(x_0) + (1-\lambda) Tv(x_1)$ , so we have proved concavity of $Tv$ . Hence $T$ is invariant on $c b_\ell c \Xsf_+$ , and the claim in Proposition 7.2.7 holds. ◻

Solution to Exercise 7.2.3

Monotonicity of $\vmax$ follows from Exercise 7.2.2 and Proposition 7.2.6. For concavity, we verify Assumption 7.2.9. The set $\Gsf = \{(w, c) : 0 \leq c \leq w\}$ is convex. Fix concave $v$ on $\RR_+$ . The map $(w, c) \mapsto u(c)$ is concave in $(w, c)$ since $u$ is concave. The map $(w, c) \mapsto v(R(w - c) + y)$ is concave in $(w, c)$ for each $y$ , since $(w, c) \mapsto R(w - c) + y$ is affine and the composition of a concave function with an affine map is concave. Integrating preserves concavity, so $(w, c) \mapsto B(w, c, v)$ is concave on $\Gsf$ . Proposition 7.2.7 now gives the result.

7.2.3.3Uniqueness and Continuity¶

When the conditions of Proposition 7.2.5 are in force, we know that at least one optimal policy exists in $\Sigma$ . The question we ask now is, when is it unique? Not surprisingly, uniqueness can be obtained with a form of strict concavity.

7.2.4Digression on Certainty Equivalents¶

Before continuing with the theory of RDPs, it will be helpful to review risk measures and certainty equivalents. These concepts are in one-to-one correspondence: we convert between them by flipping signs. Certainty equivalents can be understood as extensions of mathematical expectation that include attitudes towards risk. In later sections, we will tie the discussion of risk measures and certainty equivalents back into RDP theory and its applications.

(The existence of parallel literatures on risk measures and certainty equivalents reflects the fact that researchers in finance and engineering often think about minimizing risk, while economists typically concern themselves with maximizing rewards.^[1] In this book, we tend to work with certainty equivalents, although the following discussion will allow readers to translate between the two.)

Throughout the following discussion, the triple $(\Omega, \fF, \PP)$ is a probability space and $L_\infty \coloneq L_\infty(\Omega, \fF, \PP)$ is the set of essentially bounded random variables on $(\Omega, \fF, \PP)$ ; that is, all random $Z$ admitting an $N \in \NN$ with $|Z| \leq N$ $\PP$ -a.s.

In this setting, a risk measure is a map $\rR \colon L_\infty \to \RR$ satisfying

(R1) Monotonicity: If $Z, Z' \in L_\infty$ and $Z \leq Z'$ $\PP$ -a.s., then $\rR(Z') \leq \rR(Z)$ .

(R2) Cash invariance: $\rR(Z + a) = \rR(Z) - a$ for all $Z \in L_\infty$ and $a \in \RR$ .

A certainty equivalent is a map $\eE \colon L_\infty \to \RR$ satisfying

(C1) Monotonicity: If $Z, Z' \in L_\infty$ and $Z \leq Z'$ $\PP$ -a.s., then $\eE(Z) \leq \eE(Z')$ .

(C2) Cash invariance: $\eE(Z + a) = \eE(Z) + a$ for all $Z \in L_\infty$ and $a \in \RR$ .

(Note that the meaning of monotonicity and cash invariance changes from $\rR$ to $\eE$ .)

We now define several significant subclasses of risk measures and state the corresponding properties of the associated certainty equivalent $\eE = -\rR$ .

A risk measure $\rR$ is called convex if

\rR(\lambda Z + (1 - \lambda) Z') \leq \lambda\, \rR(Z) + (1 - \lambda)\, \rR(Z')

for all $Z, Z' \in L_\infty$ and $\lambda \in [0, 1]$ . Obviously, $\rR$ is convex if its negation $\eE \coloneq -\rR$ is concave:

\eE(\lambda Z + (1 - \lambda) Z') \geq \lambda\, \eE(Z) + (1 - \lambda)\, \eE(Z').

Concavity of $\eE$ captures the idea that diversification is weakly preferred.

A risk measure $\rR$ is called coherent if it is convex and positively homogeneous, meaning that $\rR(\lambda Z) = \lambda\, \rR(Z)$ for all $Z \in L_\infty$ and $\lambda > 0$ . The certainty equivalent $\eE = -\rR$ is then concave and positively homogeneous. Together with concavity, this means $\eE$ is superadditive and positively homogeneous.

7.2.4.1Duality¶

There is a dual representation theorem for convex risk measures, originally due to Föllmer & Schied (2002), that helps us interpret and manipulate these functionals. Here we restate their result in terms of concave certainty equivalents. In doing so, we will restrict attention to the law invariant case; that is, the case where $\eE(Z)$ depends only on the distribution of $Z$ for all $Z \in L_\infty$ .^[2]

In the theorem statement, $P_Z$ is $\PP \circ Z^{-1}$ , the distribution of $Z$ , and the infimum is over all $Q \in \pP(\RR)$ such that $Q$ is absolutely continuous with respect to $P_Z$ . The constraint $Q \ll P_Z$ means that if $P_Z$ says an event is impossible, then $Q$ must also say it’s impossible.

One way to interpret (7.20) is in terms of an adversarial agent who chooses $Q$ to minimize the expected return $\EE_Q[Z]$ , while being constrained by a penalty term $\alpha(Q)$ . This is the robust optimization point of view: the agent makes choices that are robust to variations by a real or fictitious adversary. The penalty function $\alpha$ controls how far the adversary is able to deviate from the reference model $P_Z$ . From this perspective, the absolute continuity condition $Q \ll P_Z$ means that the adversary is allowed to disagree about how likely different scenarios are, but not about which scenarios are conceivable.

A second interpretation involves ambiguity. The agent does not know the true model and $P_Z$ is only a reference point. The agent’s cautious reasoning forces him to entertain a range of plausible models. The penalty term $\alpha(Q)$ reflects how implausible $Q$ is relative to $P_Z$ . The absolute continuity constraint defines what the agent considers to be possible—the set of scenarios that could actually occur.

7.2.4.2Examples of Certainty Equivalents¶

Let’s look at examples, focusing primarily on certainty equivalents. Throughout this discussion, $Z$ is an element of $L_\infty$ , $P_Z$ is its distribution, $F_Z$ is its CDF, and $F_Z^{-1}$ is the inverse CDF.

The simplest certainty equivalent is mathematical expectation: $\eE(Z) = \EE[Z]$ . This corresponds to risk neutrality: the agent is indifferent between any random variable and its mean. The other extreme is the pessimistic certainty equivalent

\eE_p(Z) \coloneq \operatorname{ess\,inf} Z = \sup \setntn{a \in \RR}{\PP\{Z < a\} = 0}.

We can think of $\eE_p(Z)$ as the left-hand end point of the support of $Z$ . For the pessimistic certainty equivalent, the dual representation (7.20) becomes

\eE_p(Z) = \inf_{Q \ll P_Z} \EE_Q[Z],

Both of these examples are coherent.

Another example is the $\alpha$ -quantile certainty equivalent

\qQ_\alpha(Z) = F_Z^{-1}(\alpha), \qquad \alpha \in (0, 1).

The value $\qQ_\alpha(Z)$ is the $\alpha$ -quantile of $Z$ . The corresponding risk measure $\rR_\alpha = - \qQ_\alpha$ is just $\alpha$ -level value-at-risk (VaR).

VaR admits some pathologies. For example, VaR is not convex, and hence can increase under diversification. These deficiencies have motivated the introduction of conditional value at risk (CVaR) (also called average value at risk, or expected shortfall), defined as

\rR_\alpha(Z) = -\frac{1}{\alpha} \int_0^\alpha F_Z^{-1}(t)\, dt \qquad \alpha \in (0, 1],

The corresponding CVaR certainty equivalent is $\eE_\alpha(Z) = - \rR_\alpha(Z)$ , interpreted as the mean of the $\alpha$ -tail of the distribution of $Z$ ---the average over the worst $\alpha$ -fraction of outcomes. The CVaR certainty equivalent is coherent and admits the dual representation

\eE_\alpha(Z) = \inf\, \left\{ \EE_Q[Z] \,:\, Q \ll P_Z,\; \frac{\diff Q}{\diff P_Z} \leq \frac{1}{\alpha} \right\}.

The parameter $\alpha$ interpolates between the two previous cases:

\alpha = 1 \implies \eE_1(Z) = \EE[Z], \qquad \alpha \to 0 \implies \eE_\alpha(Z) \to \operatorname{ess\,inf} Z.

Another important case, already discussed in Chapter 1, is the entropic certainty equivalent

\eE_\gamma(Z) = -\frac{1}{\gamma} \ln \EE\, \left[\exp(-\gamma Z)\right] \qquad (\gamma > 0).

(7.21)

This equivalent is concave but not coherent. The dual representation is

\eE_\gamma(Z) = \inf_{Q \ll P_Z} \left\{ \EE_Q[Z] + \frac{1}{\gamma}\, D_{\mathrm{KL}}(Q \,\|\, P_Z) \right\},

(7.22)

where $D_{\mathrm{KL}}(Q \,\|\, P_Z) = \EE_Q\!\left[\log \frac{\diff Q}{\diff P_Z}\right]$ is the Kullback–Leibler divergence. The parameter $\gamma$ controls the degree of risk aversion and interpolates between risk neutrality and worst case. In particular, $\gamma \to 0$ implies $\eE_\gamma(Z) \to \EE[Z]$ , while $\gamma \to \infty$ implies $\eE_\gamma(Z) \to \operatorname{ess\,inf}\, Z$ .

It is worth noting here that the Kreps–Porteus expectation $\kK(Z) \coloneq (\EE[Z^{1-\gamma}])^{1/(1-\gamma)}$ is not a certainty equivalent, at least according to our definition. While monotonicity holds, $\kK$ fails cash invariance. This is, in essence, why Bellman and policy operators based around Epstein–Zin preferences often fail to be contractions. We discuss Kreps–Porteus expectations again in Section 7.3.4.

7.2.4.3Continuity¶

Let $\eE$ be a certainty equivalent on $L_\infty = L_\infty(\Omega, \fF, \PP)$ . We call $\eE$ continuous if, given any uniformly bounded sequence $(Z_n)_{n \in \NN}$ in $L_\infty$ and any $Z \in L_\infty$ , we have

\eE(Z_n) \to \eE(Z) \quad \text{whenever } \; Z_n \to Z \; \text{ } \PP \text{-almost surely}.

In the definition above, uniform boundedness means that there exists an $M < \infty$ with $|Z_n| \leq M$ almost surely for all $n$ . Continuity will be useful for the optimality theory developed below.

Proof

Fix a sequence $(Z_n)$ and $Z$ in $L_\infty$ with $Z_n \to Z$ $\PP$ -a.s. and $|Z_n| \leq M$ for all $n$ . Let $F_n$ and $F$ be the respective CDFs. Since almost sure convergence implies convergence in distribution, $F_n(x) \to F(x)$ at every continuity point of $F$ . By the quantile convergence theorem, $F_n^{-1}(t) \to F^{-1}(t)$ at every continuity point of $F^{-1}$ , which excludes at most countably many $t$ . Since $|F_n^{-1}(t)| \leq M$ for all $t$ , the dominated convergence theorem gives

\eE_\alpha(Z_n) = \frac{1}{\alpha} \int_0^\alpha F_n^{-1}(t)\, dt \;\to\; \frac{1}{\alpha} \int_0^\alpha F^{-1}(t)\, dt = \eE_\alpha(Z),

confirming continuity. ◻

7.2.5MDPs with Certainty Equivalents¶

We first set up a standard MDP framework with aggregator based on mathematical expectation and then replace the expectation with a general certainty equivalent, showing that the fundamental optimality results carry over when the certainty equivalent is continuous.

7.2.5.1MDP Framework¶

We begin by setting up a basic MDP framework that can then be adapted to add risk preferences. To this end, let $\Xsf$ and $\Asf$ be arbitrary metric spaces and let $\Gamma$ be a nonempty correspondence from $\Xsf$ to $\Asf$ . Let $\Gsf = \setntn{(x,a)\in \Xsf \times \Asf}{a \in \Gamma(x)}$ . Consider an RDP with feasible correspondence $\Gamma$ and aggregator

B(x, a, v) = r(x,a) + \beta \, \EE [ v(f(x,a,\xi)) ].

(7.23)

Here $\xi$ is a random element that takes values in a metric space $\Zsf$ and has distribution $\phi$ , $f$ is a measurable function from $\Gsf \times \Zsf$ to $\Xsf$ , and $r$ is a measurable function from $\Gsf$ to $\RR$ . The discount factor $\beta$ obeys $0 \leq \beta < 1$ . The value space is set to $b\Xsf$ .

We can treat optimality and convergence of algorithms for the associated RDP $(\Gamma, b\Xsf, B)$ using Proposition 6.1.7 from Chapter 6. Here, however, we’ll extend the model to use arbitrary certainty equivalents in place of $\EE$ . Results for $\EE$ will be a special case. The next section gives details.

7.2.5.2Certainty Equivalents for MDPs¶

Let’s consider replacing the expectation in (7.23), which corresponds to risk-neutrality over continuation values, with an arbitrary certainty equivalent $\eE$ . The aggregator is now

B_{\eE}(x, a, v) \coloneq r(x,a) + \beta \, \eE [v(f(x,a,\xi))].

(7.24)

Other primitives are left unchanged. The value space continues to be $b\Xsf$ . We assume throughout that $(x,a) \mapsto \eE [v(f(x,a,\xi))]$ is measurable on $\Gsf$ . We consider the RDP $(\Gamma, b\Xsf, B_{\eE})$ .

Proof

We verify the conditions of Proposition 7.2.2. By Assumption 7.2.11, $\Gamma$ is compact-valued and continuous. Measurability of $(x, a) \mapsto B_{\eE}(x, a, v)$ on $\Gsf$ holds for each $v \in V$ by our standing assumption. Monotonicity of $B_{\eE}$ in $v$ follows from monotonicity of $\eE$ . Since $r$ is bounded (Assumption 7.2.11) and $\eE$ maps bounded random variables to finite values, the map $(x, a) \mapsto B_{\eE}(x, a, v)$ is bounded for each $v \in V$ . Moreover, cash invariance of $\eE$ gives

B_{\eE}(x, a, v + \kappa) = B_{\eE}(x, a, v) + \beta \kappa

for all $\kappa \in \RR_+$ , so Assumption 7.2.1 holds with $\lambda = \beta$ .

For Assumption 7.2.2, fix $v \in bc\Xsf$ and let $(x_n, a_n) \to (x, a)$ in $\Gsf$ . By Assumption 7.2.11, $f(x_n, a_n, z) \to f(x, a, z)$ for all $z \in \Zsf$ , so continuity of $v$ gives $v(f(x_n, a_n, \xi)) \to v(f(x, a, \xi))$ almost surely. Since $|v| \leq \|v\|_\infty$ , continuity of $\eE$ gives $\eE\, v(f(x_n, a_n, \xi)) \to \eE\, v(f(x, a, \xi))$ . Combined with continuity of $r$ , we conclude that $(x, a) \mapsto B_{\eE}(x, a, v)$ is continuous on $\Gsf$ .

The claims now follow from Proposition 7.2.2. ◻

7.3Applications¶

We now apply the RDP optimality theory developed in Section 7.2 to a range of dynamic programming problems. In Section 7.3.1 we revisit the optimal savings problem, this time with utility unbounded above, and verify the conditions of our weighted contraction results. We then study irreversible investment under risk neutrality (Section 7.3.2.1), risk aversion (Section 7.3.2.2), and ambiguity aversion (Section 7.3.3).

7.3.1Optimal Savings with Utility Unbounded Above¶

Here we again consider the optimal savings model from Section 1.3, but without the boundedness restriction on $u$ . In particular, we assume that $u$ is continuous, nonnegative, and increasing, and that

\ell(w) \coloneq \EE \sum_{t \geq 0} \delta^t u(\hat W_t) < \infty

(7.25)

for some $\delta \in (\beta, 1)$ . Here $(\hat W_t)$ is defined recursively via $\hat W_{t+1} = R \hat W_t + Y_{t+1}$ with $\hat W_0 = w$ and $(Y_t) \iidsim \phi$ . We can think of $(\hat W_t)$ as an upper bound process for wealth, achieved when consumption is always zero. As before, $\phi$ is a continuous density on $\RR_+$ . Our aim is to provide conditions under which the conclusions of Proposition 7.2.5 apply.

We set $V = b_\ell \RR_+$ , $\Gamma(w) = [0, w]$ , and

B(w, c, v) = u(c) + \beta \int v(R(w - c) + y) \phi(\diff y).

Solution to Exercise 7.3.1

For (U1), we have

B(w, c, v + \kappa \ell) - B(w, c, v) = \beta \kappa \int \ell(R(w - c) + y) \phi(\diff y) \leq \beta \kappa \int \ell(Rw + y) \phi(\diff y),

where the inequality uses $c \geq 0$ and the fact that $\ell$ is increasing (which follows from its definition as the expected sum of nonnegative terms along increasing sample paths). By the recursive structure of $\ell$ , we have

\ell(w) = u(w) + \delta \int \ell(Rw + y) \phi(\diff y),

so $\int \ell(Rw + y) \phi(\diff y) \leq \ell(w)/\delta$ . Hence $B(w, c, v + \kappa \ell) - B(w, c, v) \leq (\beta/\delta) \kappa \ell(w)$ , confirming (U1).

For (U2), fix $v \in V$ . Since $u \geq 0$ and $c \leq w$ , we have $u(c) \leq u(w) \leq \ell(w)$ . Also, $|v| \leq \|v\|_\ell \ell$ , so

|B(w, c, v)| \leq u(c) + \beta \|v\|_\ell \int \ell(Rw + y) \phi(\diff y) \leq \ell(w) + (\beta/\delta) \|v\|_\ell \ell(w),

which confirms (U2).

Solution to Exercise 7.3.2

Fix $v \in b_\ell \RR_+$ and let $(w_n, c_n) \to (w, c)$ in $\Gsf$ . Since $u$ is continuous, it suffices to show that $\int v(R(w_n - c_n) + y) \phi(y) \diff y \to \int v(R(w - c) + y) \phi(y) \diff y$ . Set $s_n = R(w_n - c_n)$ and $s = R(w - c)$ , so $s_n \to s$ . The change of variable $x' = s_n + y$ gives $\int v(s_n + y) \phi(y) \diff y = \int v(x') \phi(x' - s_n) \diff x'$ , and similarly with $s$ in place of $s_n$ (cf. Example 6.1.2). Fix $\epsilon > 0$ and set $\bar s = \sup_n s_n$ . Since $\int \ell(\bar s + y) \phi(y) \diff y < \infty$ (by the definition of $\ell$ ), we can choose $M$ so that $\int_{M - \bar s}^\infty \ell(\bar s + y) \phi(y) \diff y < \epsilon$ . Using $|v| \leq \|v\|_\ell \ell$ and $\ell$ increasing, the integral over $(M, \infty)$ is bounded by $\|v\|_\ell \epsilon$ for all $n$ , and the same bound holds with $s$ in place of $s_n$ . On $[0, M]$ , the function $v$ is bounded by $\|v\|_\ell \max_{[0, M]} \ell < \infty$ (since $\ell$ is continuous), and Scheffé’s lemma together with continuity of $\phi$ gives $\int_0^M |v(x')| \cdot |\phi(x' - s_n) - \phi(x' - s)| \diff x' \to 0$ . Combining gives the result.

The results above imply that Assumption 7.2.7 and Assumption 7.2.4 both hold. As a result, the conclusions of Proposition 7.2.5 apply. For example, the value function $\vmax$ exists, is an element of $b_\ell c \RR_+$ , and satisfies

\vmax(w) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \int \vmax(R(w - c) + y) \phi(\diff y) \right\}

for all $w \geq 0$ . Moreover, VFI, OPI, and HPI all converge.

In general, OPI and VFI are the easiest to implement. Figure 7.1 illustrates the runtime of OPI as a function of $m$ for this model with CRRA utility $u(c) = (c^{1-\gamma} - 1)/(1-\gamma)$ and $\gamma = 0.5$ . The runtime of VFI is shown as a horizontal line. Since VFI is the special case of OPI with $m=1$ , the leftmost point of the OPI curve coincides with the VFI runtime. The minimum is attained near $m = 40$ , where OPI runs roughly six times faster than VFI. Runtime then rises linearly in $m$ . The key message is that OPI dominates VFI over a wide range of $m$ .

Figure 7.1:Runtime of OPI vs. VFI for the optimal savings model

7.3.2Irreversible Investment¶

We begin with the risk-neutral case in Section 7.3.2.1, where standard contraction arguments apply. In Section 7.3.2.2 we introduce risk aversion by replacing mathematical expectation with a certainty equivalent, using the framework developed in Section 7.2.4.

7.3.2.1The Risk-Neutral Case¶

Let’s begin with a canonical firm problem with irreversible investment, where the Bellman equation is

v(k, z) = \max_{i \in \Gamma(k, z)} \left\{ f(k,z) - i + \beta \int v(i + (1-\delta) k, g(z, \xi)) \phi(\diff \xi) \right\}.

Here $k \in \RR_+$ is capital stock, $i \in \RR_+$ is investment, $\beta \in (0,1)$ is the discount factor, $\delta \in (0,1)$ is a depreciation rate, $f$ is a production function, and $z \in \RR^m$ is an exogenous state vector. The feasible correspondence is defined by $\Gamma(k,z) = [0, \theta f(k, z)]$ , where $\theta > 0$ is a borrowing constraint parameter. The state process evolves according to

Z_{t+1} = g(Z_t, \xi_{t+1}), \qquad (\xi_t)_{t \geq 0} \iidsim \phi.

Each $\xi_t$ takes values in a metric space $\Zsf$ , the distribution $\phi$ is an element of $\pP(\Zsf)$ , and $g \colon \RR^m \times \Zsf \to \RR^m$ .

Some comments are in order. First, to simplify the presentation, we’ve set the output price to unity, so that $f(k,z)$ is both output and revenue. This can easily be modified. Second, the boundedness restriction on $f$ is not automatically satisfied in many cases but greatly simplifies the analysis. In terms of quantitative applications, the cost is not large. For example, $f(k,z)=zk^\alpha$ can be replaced with $f(k,z) = \min\{zk^\alpha, y\}$ for large $y$ . If $y$ is very large then the impact on choices and values is negligible.

This firm problem is called an irreversible investment model because $i$ is required to be nonnegative. To frame this problem as an RDP, we set $V = b\Xsf$ , where $\Xsf \coloneq \RR_+ \times \RR^m$ , and

B(k,z,i, v) = f(k,z) - i + \beta \int v(i + (1-\delta) k, g(z, \xi)) \phi(\diff \xi)

The set of policies $\Sigma$ is all measurable maps from $\Xsf$ to $\RR_+$ satisfying the feasibility constraint.

Figure 7.2 shows an example computation, plotting optimal investment as a function of capital for two productivity levels. The plot also compares this irreversible case ( $i \geq 0$ ) with a reversible benchmark where the firm can also disinvest ( $i \geq -(1 - \delta)k$ ). At low capital, both firms invest identically. At high capital, with the same level of productivity, the reversible firm disinvests, while the irreversible firm sets $i$ to the lower bound constraint of zero. Importantly, for intermediate levels of capital, the reversible firm invests more aggressively, knowing that it can sell capital later if productivity drops. The irreversible firm faces a higher effective cost of capital due to the option value of waiting and targets a lower stock.^[3]

Figure 7.3 shows simulated paths for both firms facing identical productivity shocks. The reversible firm tracks productivity more closely, boosting capital during good times and shedding it during downturns. The irreversible firm adjusts sluggishly on the downside, since it can only reduce capital through depreciation.

Figure 7.2:Investment policies: irreversible vs. reversible

Figure 7.3:Simulated capital and investment paths under common shocks

7.3.2.2Adding Risk Aversion¶

As discussed in Section 1.1.3, actual firm behavior often deviates from the risk-neutral benchmark attained under the assumption of frictionless complete markets. Here we extend the model from Section 7.3.2.1 in order to discuss this case. We swap the Bellman equation from Section 7.3.2.1 with

v(k, z) = \max_{0 \leq i \leq \theta f(k,z)} \left\{ f(k, z) - i + \beta \eE [v(i + (1-\delta) k, g(z, \xi))] \right\}.

(7.26)

The only change is that mathematical expectation has been replaced with a certainty equivalent $\eE$ . The term $\xi$ should be understood as a random element on $\Zsf$ with distribution $\phi$ . We assume that the map $(k, z, i) \mapsto \eE [v(i + (1-\delta) k, g(z, \xi))]$ is measurable on $\Gsf$ for all $v \in b\Xsf$ .

We can set this model up as an RDP by taking $V = b\Xsf$ , $\Gamma(k,z) = [0, \theta f(k,z)]$ , and

B(k,z,i,v) = f(k,z) - i + \beta \eE [v(i + (1-\delta) k, g(z, \xi))].

By the monotonicity of certainty equivalents, we have $B(k,z,i,v) \leq B(k,z,i,v')$ whenever $v \leq v'$ . Also, by our measurability assumption and boundedness of $f$ , the map sending $(k,z)$ into $B(k,z, \sigma(k,z), v)$ is bounded and measurable whenever $\sigma \in \Sigma$ and $v \in V$ . This confirms that $(\Gamma, V, B)$ is an RDP.

7.3.2.3Numerical Illustration¶

To illustrate the impact of risk aversion on firm behavior, we now compute optimal policies under the specific certainty equivalent

\eE_\lambda(Z) \coloneq \lambda \EE[Z] - (1 - \lambda)\, \rR_\alpha(Z).

(7.27)

Here $\lambda \in (0,1)$ and $\rR_\alpha$ is value-at-risk at a fixed $\alpha$ , which will be set to the industry standard value 0.05. The certainty equivalent puts positive weight on both expected rewards and VaR, matching common management practice. Decreasing $\lambda$ increases concern for left tail events. The map $\eE_\lambda$ is a valid certainty equivalent: $-\rR_\alpha = \qQ_\alpha$ , the $\alpha$ -quantile certainty equivalent, and convex combinations of certainty equivalents are certainty equivalents (Exercise 7.2.5).

Note that $\eE_\lambda$ is not a continuous certainty equivalent, since VaR can jump under small perturbations. This means that Proposition 7.3.2 does not directly apply. (We are treating VaR here because of its popularity in applications, rather than its attractive theoretical properties.) At the same time, when we implement the model on a machine, all numerical quantities are ultimately represented by a finite set of double-precision floats. In this sense, the model as actually computed is an RDP with finite action sets. By Proposition 7.2.1, optimal policies exist and VFI converges.

Figure 7.4 compares optimal investment policies for the risk-neutral firm (certainty equivalent $\EE$ ) and a risk-averse firm using $\eE_\lambda$ .^[4] At both productivity levels, the risk-averse firm invests less aggressively. The intuition is that the quantile component of $\eE$ penalizes downside outcomes in the continuation value, which lowers the perceived return to investment. Because the firm cannot reverse investment decisions, the option value of waiting is amplified by risk aversion.

Figure 7.5 shows simulated paths for both firms facing identical productivity shocks. The risk-averse firm maintains a persistently lower capital stock and invests more cautiously throughout the sample. During periods of high productivity, the gap is especially pronounced: the risk-neutral firm boosts capital aggressively, while the risk-averse firm is restrained, anticipating the possibility of future downturns.

Figure 7.4:Investment policies: risk-neutral vs. risk-averse

Figure 7.5:Simulated paths: risk-neutral vs. risk-averse under common shocks

7.3.3Firm Investment under Ambiguity¶

In Section 1.2.3.3 we discussed how concern for model misspecification can be incorporated into dynamic programs. Here we return to this topic in the context of irreversible investment. We first formulate the robust control version of the firm problem and then show how duality reduces it to the risk-sensitive case already covered by our theory.

7.3.3.1Model Formulation¶

To set up a robust control version of the investment problem we formulate the Bellman equation as

v(k, z) = \max_i \left\{ f(k, z) - i + \beta \inf_{\psi \ll \phi} \left[ \int v(k', g(z, x)) \psi(\diff x) + \frac{1}{\gamma} D_{\mathrm{KL}}(\psi \,\|\, \phi) \right] \right\}.

Here $k' \coloneq i + (1-\delta) k$ and the maximization is over $i$ with $0 \leq i \leq \theta f(k,z)$ . In this case we interpret the problem as one where the manager does not fully trust the model: she fears misspecification in terms of the distribution $\phi$ of the shock sequence $(\xi_t)$ and hence lacks full confidence when calculating expectations of continuation values. Nonetheless, she is willing to treat $\phi$ as a reference model. She entertains distributions $\psi$ that deviate from $\phi$ , provided that they don’t assign positive probability to events that $\phi$ deems impossible.

The penalty term $(1/\gamma) D_{\mathrm{KL}}(\psi \,\|\, \phi)$ can be thought of as a soft constraint. Models further from the reference point (in terms of KL divergence) are regarded as less plausible. If $\gamma$ is close to zero then the penalty term will be very large for even small deviations. Because the evaluation of the continuation value involves an infimum, only very small deviations are considered. This corresponds to greater trust in the model. Conversely, larger values of $\gamma$ indicate deeper distrust.

7.3.3.2Risk-Sensitive Formulation¶

Using the duality in (7.22), we can rewrite the robust control Bellman equation for the firm problem as

v(k, z) = \max_i \left\{ f(k, z) - i - \frac{\beta}{\gamma} \ln \left[ \int \exp[-\gamma v(k', g(z, x))] \phi(\diff x) \right] \right\}.

This is a version of (7.26), with $\eE$ set to the entropic certainty equivalent $\eE_\gamma$ . Since $\eE_\gamma$ is continuous (Exercise 7.2.6), the conditions of Proposition 7.3.2 hold under Assumption 7.3.1. As a result, for this model, the fundamental optimality properties hold, the value function $\vmax$ lies in $bc \Xsf$ , and VFI converges geometrically on $bc \Xsf$ .

The above discussion shows that we do not require any new machinery to tackle the somewhat intimidating robust control version of the investment problem: a duality based approach allows us to switch to a setting where we already have all the results we need.

Figure 7.6 compares optimal investment policies under three specifications: the risk-neutral benchmark ( $\gamma \to 0$ ) and the entropic certainty equivalent $\eE_\gamma$ at two levels of ambiguity aversion.^[5] As $\gamma$ increases—reflecting deeper distrust in the reference model—investment falls. The manager who entertains a wider range of alternative models, and who evaluates continuation values under the worst-case distribution within the KL penalty ball, perceives a lower return to committing capital. The effect is monotone in $\gamma$ : higher ambiguity aversion leads to uniformly less aggressive investment across all capital levels.

Figure 7.7 shows simulated paths for all three firms facing identical productivity shocks. The more ambiguity-averse firm maintains a persistently lower capital stock. During periods of high productivity, the differences are most visible: the risk-neutral firm ramps up capital, while the ambiguity-averse firm invests more cautiously, hedging against the possibility that the favorable conditions are less persistent than the reference model suggests.

Figure 7.6:Investment policies under ambiguity aversion

Figure 7.7:Simulated paths under common shocks with varying ambiguity aversion

7.3.4Kreps–Porteus vs Risk-Sensitivity¶

We return to the setup in Section 7.2.5.1, where $\Xsf$ , $\Asf$ are arbitrary metric spaces, $\Gamma$ is a nonempty correspondence from $\Xsf$ to $\Asf$ , and $B(x, a, v) = r(x,a) + \beta \EE [v(f(x,a,\xi))]$ . We suppose, as in Assumption 7.2.11, that the correspondence $\Gamma$ is compact-valued and continuous, the reward function $r$ is bounded and continuous, and that the map $(x,a) \mapsto f(x,a,z)$ is continuous on $\Gsf$ for all $z \in \Zsf$ . As discussed in Section 7.2.5.1, the fundamental optimality properties hold and VFI converges on $bc \Xsf$ .

In Section 7.2.5.2 we extended this basic MDP analysis to settings where the aggregator has the form $B(x, a, v) \coloneq r(x,a) + \beta \eE v(f(x,a,\xi))$ . In Proposition 7.2.10 we showed that, when $\eE$ is continuous, the fundamental optimality properties hold, the value function $\vmax$ lies in $bc \Xsf$ , and VFI converges geometrically on $bc \Xsf$ .

One special case is the entropic setting, where

B_{\rm RS}(x, a, v) = r(x,a) + \frac{\beta}{\theta} \ln \EE \left[ \exp(\theta \cdot v(f(x,a,\xi))) \right].

(7.28)

This model is called a risk-sensitive MDP. The modified expectation is an application of the entropic certainty equivalent (7.21) with $\theta = -\gamma$ . This modified expectation allows for parameterization of risk-sensitivity through $\theta$ , with $\theta < 0$ injecting risk-aversion. Since the entropic certainty equivalent is continuous (Exercise 7.2.6), we can apply Proposition 7.2.10. This tells us that all of the preceding convergence and optimality results apply.

Another alternative is to replace the entropic certainty equivalent with Kreps–Porteus expectations, leading to aggregator

B_{\rm KP}(x, a, v) = r(x,a) + \beta \left\{ \EE [ v(f(x,a,\xi))^\nu ] \right\}^{1/\nu} \qquad (\nu \in \RR \text{ and } \nu \neq 0).

Here, in order to avoid running into trouble with exponents, we require that $r > 0$ and take the value space $V$ to be all functions in $b\Xsf$ that take only positive values. We discussed such an RDP in Section 7.1.1.6.

Note, however, that the Kreps–Porteus expectation fails cash invariance, and, as such, is not a certainty equivalent (as previously discussed in Section 7.2.4.2). As a result, the preceding optimality theory does not apply. In particular, we cannot appeal to Proposition 7.2.10. Moreover, the aggregator $B_{\rm KP}$ is not generally contracting, in the sense that Assumption 7.2.1 typically fails. Instead, the RDP $(\Gamma, V, B_{\rm KP})$ has to be treated with other methods, such as the convexity-based techniques used in Section 5.1.3.

There is, however, a multiplicative variation on the Kreps–Porteus RDP that is simple to analyze. The model is obtained by setting

B_{\rm MKP}(x, a, v) = r(x,a) \cdot \left\{ \EE [ v(f(x,a,\xi))^\nu ] \right\}^{\beta/\nu},

while continuing to assume that $r$ is everywhere positive. The parameter $\beta$ is a discount factor for the multiplicative model, and is assumed to take values in $[0,1)$ . We call $(\Gamma, V, B_{\rm MKP})$ the multiplicative Kreps–Porteus RDP.

It turns out that the multiplicative Kreps–Porteus RDP and the additive risk-sensitive RDP $(\Gamma, b\Xsf, B_{\rm RS})$ are closely related—in fact they are isomorphic. To illustrate this, we take logs of the Bellman equation associated with the multiplicative Kreps–Porteus RDP, obtaining

\ln v(x) = \max_{a \in \Gamma(x)} \left\{ \ln r(x,a) + \frac{\beta}{\nu} \ln \EE [ v(f(x,a,\xi))^\nu ] \right\}.

Setting $\hat v = \ln v$ and $\hat r = \ln r$ yields

\hat v(x) = \max_{a \in \Gamma(x)} \left\{ \hat r(x,a) + \frac{\beta}{\nu} \ln \EE [ \exp( \nu \cdot \hat v(f(x,a,\xi))) ] \right\}.

This is the Bellman equation for (7.28), after replacing $r$ with $\hat r$ and $\theta$ with $\nu$ .

Solution to Exercise 7.3.4

We apply Exercise 7.1.5. The multiplicative Kreps–Porteus RDP is $(\Gamma, V, B_{\rm MKP})$ with $V$ the bounded measurable functions from $\Xsf$ to $(0, \infty)$ . Let $\phi = \ln \colon (0, \infty) \to \RR$ , which is an order isomorphism. Then $\hat V = \{\phi \circ v : v \in V\} = b\Xsf$ , and $(\Gamma, \hat V, B_{\rm RS})$ is the risk-sensitive RDP (after replacing $r$ with $\hat r = \ln r$ and $\theta$ with $\nu$ ).

It remains to verify (7.12): for all $v \in V$ and $(x, a) \in \Gsf$ , we need $B_{\rm MKP}(x, a, v) = \phi^{-1}[B_{\rm RS}(x, a, \phi \circ v)]$ . Setting $\hat v = \phi \circ v = \ln \circ \, v$ , we have

\begin{aligned} \phi^{-1}[B_{\rm RS}(x, a, \hat v)] &= \exp\!\left[ \ln r(x,a) + \frac{\beta}{\nu} \ln \EE\, \exp(\nu \cdot \hat v(f(x,a,\xi))) \right] \\ &= r(x,a) \cdot \left\{ \EE\, \exp(\nu \cdot \ln v(f(x,a,\xi))) \right\}^{\beta/\nu} \\ &= r(x,a) \cdot \left\{ \EE [ v(f(x,a,\xi))^\nu ] \right\}^{\beta/\nu} = B_{\rm MKP}(x, a, v). \end{aligned}

The conclusion now follows from Exercise 7.1.5.

7.4Chapter Notes¶

RDPs were introduced in Chapter 8 of Sargent & Stachurski (2025) in settings where the state space is finite. The theory in this chapter extends that treatment to general state spaces. The study of contracting dynamic programs with abstract Bellman equations was begun by Denardo (1967). Extensive discussion can be found in Bertsekas (2022). (The terminology is slightly confusing: the abstract dynamic programs studied in Denardo (1967) and Bertsekas (2022) are similar to the RDPs studied in this chapter. For us, however, abstract dynamic programs are the more general objects introduced in Section 2.1.1.)

The optimality results in Section 7.2.1–Section 7.2.2 combine the RDP framework with the Blackwell contraction theory of Chapter 4. Our framework is similar to the contractive models in Bertsekas (2022). The results in Section 7.2.3 on monotonicity, concavity, uniqueness, and continuity of solutions extend related results in Bäuerle & Jaśkiewicz (2018).

In Section 7.2.2 we treated RDPs where rewards are bounded below and unbounded above. Related work can be found in Toda (2023). One approach to the reverse case—where rewards are bounded above and unbounded below—can be found in Ma et al. (2022). Their idea is to rearrange the Bellman equation so that the transformed problem has bounded rewards, allowing standard contraction mapping arguments to be applied. The transformation is inspired by the Q-function used in reinforcement learning.

Regarding Euler equations, early results along the lines of Proposition 8.3.9 were established by Mirman & Zilcha (1975) and Benveniste & Scheinkman (1979).

The certainty equivalents and risk measures discussed in Section 7.2.4 are standard tools in mathematical finance and decision theory. The dual representation theorem for convex risk measures is due to Föllmer & Schied (2002); see also Jouini et al. (2006). The quantile certainty equivalent and its risk measure counterpart (VaR) have been studied in dynamic programming environments by Castro & Galvao (2019), Castro & Galvao (2022), Almeida et al. (2024), Castro et al. (2025), and Castro & Galvao (2025), among others.

The robust control formulation in Section 7.3.3 builds on Hansen & Sargent (2001) and Hansen & Sargent (2011), who developed the multiplier preference approach to robustness in dynamic economic models. The duality between robust control and risk-sensitive preferences, which we exploit to reduce the robust problem to the entropic certainty equivalent case, is a central theme of that literature.

For a recent textbook treatment that uses the RDP framework, see Toda (2024).

Footnotes¶

If we were more cynical, we would add that existence of these two literatures also reflects the fact that researchers can publish more papers if they study the same thing under different names.
↩
More formally, a certainty equivalent $\eE$ is called law invariant if there exists a functional $e \colon \pP(\RR) \to \RR$ such that $\eE(Z) = e(\PP \circ Z^{-1})$ for all $Z \in L_\infty$ .
↩
We set $f(k,z) = \min\{zk^\alpha, y\}$ with $y=1000$ , $\alpha = 0.3$ , $\beta = 0.95$ , $\delta = 0.1$ , and $\theta = 1.5$ . The exogenous state follows $Z_t = \exp(X_t)$ where $(X_t)$ is AR(1) with persistence $\rho = 0.9$ and volatility $\nu = 0.2$ , discretized via the Tauchen method. The value function is approximated on a grid via linear interpolation of $v(\cdot, z)$ for each $z$ , and solved via VFI.
↩
The parameterization is the same as for the risk-neutral case and $\lambda = 0.5$ . The dynamic program is solved via VFI.
↩
The parameterization is the same as for the risk-neutral case, with $\gamma = 0.05$ and $\gamma = 0.5$ for the entropic certainty equivalent. VFI converges geometrically in all cases.
↩

References¶

Sargent, T. J., & Stachurski, J. (2025). Dynamic Programming: Finite States. Cambridge University Press.
Kristensen, D., Mogensen, P. K., Moon, J. M., & Schjerning, B. (2021). Solving dynamic discrete choice models using smoothing and sieve methods. Journal of Econometrics, 223(2), 328–360.
Rust, J. (1994). Structural estimation of Markov decision processes. Handbook of Econometrics, 4, 3081–3143.
Ma, Q., Stachurski, J., & Toda, A. A. (2022). Unbounded dynamic programming via the Q-transform. Journal of Mathematical Economics, 100, 102652.
Föllmer, H., & Schied, A. (2002). Convex Measures of Risk and Trading Constraints. Finance and Stochastics, 6(4), 429–447. 10.1007/s007800200072
Jouini, E., Schachermayer, W., & Touzi, N. (2006). Law Invariant Risk Measures Have the Fatou Property. In S. Kusuoka & A. Yamazaki (Eds.), Advances in Mathematical Economics (Vol. 9, pp. 49–71). Springer. 10.1007/4-431-34342-3_4
Denardo, E. V. (1967). Contraction Mappings in the Theory Underlying Dynamic Programming. SIAM Review, 9(2), 165–177.
Bertsekas, D. P. (2022). Abstract dynamic programming (3rd ed.). Athena Scientific.
Bäuerle, N., & Jaśkiewicz, A. (2018). Stochastic optimal growth model with risk sensitive preferences. Journal of Economic Theory, 173, 181–200.
Toda, A. A. (2023). Unbounded Markov Dynamic Programming with Weighted Supremum Norm Perov Contractions.
Mirman, L. J., & Zilcha, I. (1975). On optimal growth under uncertainty. Journal of Economic Theory, 11(3), 329–339.
Benveniste, L. M., & Scheinkman, J. A. (1979). On the differentiability of the value function in dynamic models of economics. Econometrica, 727–732.
de Castro, L., & Galvao, A. F. (2019). Dynamic quantile models of rational behavior. Econometrica, 87(6), 1893–1939.
de Castro, L., & Galvao, A. F. (2022). Static and dynamic quantile preferences. Economic Theory, 73(2–3), 747–779.
Almeida, H., Campello, M., de Castro, L. I., & Galvao Jr, A. F. (2024). A Quantile Model of Firm Investment [Techreport]. National Bureau of Economic Research.

7 Recursive Decision Processes

7.1Introduction¶

7.1.1Definition and Examples¶

7.1.1.1Definition¶

7.1.1.2Example: Finite MDPs¶

7.1.1.3Example: The Firm Valuation Problem¶

7.1.1.4Firm Valuation with Unbounded Profits¶

7.1.1.5Example: Optimal Savings¶

7.1.1.6Example: Savings with Kreps–Porteus Expectations¶

7.1.1.7Example: MDPs with Modified Rewards¶

7.1.1.8Example: Risk-Sensitive Preferences¶

7.1.2RDPs vs LDPs vs ADPs¶

7.1.2.1LDPs are RDPs¶

7.1.2.2RDPs are ADPs¶

7.1.2.3Not all ADPs are RDPs¶

7.1.3Existence of Greedy Policies¶

7.1.3.1Finite Actions¶

7.1.3.2Continuous Actions¶

7.2Optimality Results¶

7.2.1Bounded Contractions¶

7.2.1.1Framework¶

7.2.1.2Finite Actions¶

7.2.1.3The Continuous Case¶

7.2.2Weighted Contractions¶

7.2.2.1Framework¶

7.2.2.2Finite Actions¶

7.2.2.3The Continuous Case¶

7.2.3Properties of Solutions¶

7.2.3.1Monotone Values¶

7.2.3.2Concavity¶

7.2.3.3Uniqueness and Continuity¶

7.2.4Digression on Certainty Equivalents¶

7.2.4.1Duality¶

7.2.4.2Examples of Certainty Equivalents¶

7.2.4.3Continuity¶

7.2.5MDPs with Certainty Equivalents¶

7.2.5.1MDP Framework¶

7.2.5.2Certainty Equivalents for MDPs¶

7.3Applications¶

7.3.1Optimal Savings with Utility Unbounded Above¶

7.3.2Irreversible Investment¶

7.3.2.1The Risk-Neutral Case¶

7.3.2.2Adding Risk Aversion¶

7.3.2.3Numerical Illustration¶

7.3.3Firm Investment under Ambiguity¶

7.3.3.1Model Formulation¶

7.3.3.2Risk-Sensitive Formulation¶

7.3.4Kreps–Porteus vs Risk-Sensitivity¶

7.4Chapter Notes¶