Linear Decision Processes - Dynamic Programming Volume II: General States

In this chapter, we define linear decision processes (LDPs). In terms of level of generality, we can think of LDPs as sitting between MDPs, as discussed in Section 1.2, and the ADPs we considered in Chapter 2–Chapter 5:

\text{MDPs } \subset \text{ LDPs } \subset \text{ ADPs},

with all inclusions being strict. LDPs have a range of advantages over MDPs while maintaining much of their tractability. One advantage is that we can work with state-dependent discounting, which is particularly important for economic and financial applications. Another is that their flexible structure makes them easy to apply. For example, optimal stopping problems can be embedded directly into the LDP framework, whereas embedding optimal stopping problems into MDPs requires expanding the state space.

LDPs differ from ADPs by including actions explicitly, instead of taking policy operators as the basic primitive. This is a more traditional perspective: one where controllers observe states and respond to those states by choosing actions. Ultimately, a choice of action given a state will take the form of a policy function; that will lead us back to the ADPs. By studying this circle, we can leverage theory from our earlier chapters.

LDPs are more limited than ADPs, but also more concrete and more structured. For example, they provide an algebraic formula for computing lifetime values similar to the one available for MDPs (see, e.g., (1.18)). This formula is not available for general ADPs. Thus, for LDPs, the HPI step requiring computation of lifetime values from policies is fully articulated, at least at a theoretical level. Another advantage of LDPs, relative to ADPs, is that we can start to construct systematic conditions for regularity, or existence of greedy policies, unlike in previous chapters.

We begin with the theoretical foundations in Section 6.1. After introducing Feller properties in Section 6.1.1, we define LDPs in Section 6.1.2.1 and present optimality results. We then treat exogenous discount processes (Section 6.1.4) and specialize the LDP framework to MDPs on general state spaces (Section 6.1.5). In this chapter, we focus on the bounded case; unbounded models are handled in Chapter 7. Section 6.2.1 and Section 6.2.2 apply the theory to natural resource management and optimal savings with stochastic rates of return.

6.1Theory¶

In this section we develop the foundational theory of linear decision processes. We begin in Section 6.1.1 by studying Feller properties of transition kernels, which provide the continuity conditions needed for existence of optimal policies. We then define LDPs in Section 6.1.2.1, give examples, and discuss lifetime values. Next we present optimality results and their implications. Section 6.1.4 treats exogenous discount processes, and Section 6.1.5 specializes the LDP framework to Markov decision processes on general state spaces.

6.1.1Feller Properties¶

Since we are always interested in whether or not optimal policies exist, we study conditions under which future values are continuous in states and actions. In the case of LDPs, this continuity will require that integrals of transition kernels vary continuously with actions. (For background on transition kernels see Section A.5.4.1.) Here we provide a collection of definitions and results that help us address this question.

Throughout this section,

$\Xsf$ and $\Asf$ are separable metric spaces,
$\| \cdot \| \coloneq \| \cdot \|_\infty$ is the supremum norm,
$\Gsf$ is a subset of $\Xsf \times \Asf$ , and
$K$ is a transition kernel mapping $b\Xsf$ to $b\Gsf$ .

Here you can think of $\Gsf$ as a collection of feasible state-action pairs. The last statement means that

(Kv)(x, a) \coloneq \int v(x') K(x, a, \diff x')

is in $b\Gsf$ whenever $v \in b\Xsf$ .

Extending standard terminology, we will say that the transition kernel $K$ is

weak Feller if $Kh$ is continuous on $\Gsf$ whenever $h \in b c\Xsf$ and
strong Feller if $Kh$ is continuous on $\Gsf$ whenever $h \in b \Xsf$ .

Let’s look at some special cases.

Example 6.1.1

Suppose that $\Wsf$ is another metric space and that $K$ has the form

(Kh)(x, a) = \beta(x, a) \int h(F(x, a, w)) \phi(\diff w) \qquad ((x, a) \in \Gsf),

where $\phi$ is a distribution on $\Wsf$ , $F \colon \Gsf \times \Wsf \to \Xsf$ is Borel measurable, $(x, a) \mapsto F(x, a, w)$ is continuous for all $w \in \Wsf$ , and $\beta \in bc\Gsf$ . This corresponds to the case where discounting depends on states and actions, while the state evolves according to

X_{t+1} = F(X_t, A_t, W_{t+1}) \quad \text{with} \quad (W_t)_{t \geq 1} \iidsim \phi \in \dD(\Wsf).

In this setting, $K$ is weak Feller. Indeed, taking $(x_n, a_n) \to (x, a)$ in $\Gsf$ and assuming $h \in b c\Xsf$ , the dominated convergence theorem yields

\beta(x_n, a_n) \int h(F(x_n, a_n, w)) \phi(\diff w) \to \beta(x, a) \int h(F(x, a, w)) \phi(\diff w)

as $n \to \infty$ . In particular, $Kh$ is continuous on $\Gsf$ . More generally, $K$ is weak Feller whenever $(x, a) \mapsto F(x, a, w)$ is continuous for $\phi$ -almost all $w \in \Wsf$ .

The strong Feller property requires more conditions, since we need to map a potentially discontinuous function $h$ into a continuous function $Kh$ . For this, we rely on smoothing properties of the integral. To obtain these properties we introduce a “dominating” measure $\mu$ on $(\Xsf, \bB)$ , which we assume to be $\sigma$ -finite. A Borel measurable map $p$ from $\Gsf \times \Xsf$ to $\RR$ is called a density kernel from $\Gsf$ to $\Xsf$ with dominating measure $\mu$ if $p$ is nonnegative and

\int p(x, a, x') \mu(\diff x') = 1 \quad \text{for all } (x, a) \in \Gsf.

We say that a stochastic kernel $P$ from $\Gsf$ to $\Xsf$ has density kernel $p$ with dominating measure $\mu$ if $p$ is a density kernel on $\Xsf$ and

P(x, a, B) = \int_B p(x, a, x') \mu(\diff x') \quad \text{for all } (x, a, B) \in \Gsf \times \bB.

If the dominating measure $\mu$ is not identified in the discussion below then we will be referring to Lebesgue measure, and we write $\diff x$ instead of $\mu(\diff x)$ . The following lemma shows how a continuous density kernel can transform discontinuous functions into continuous ones under integration.

Proof

Fix $h \in b\Xsf$ . Since products of continuous functions are continuous, we need only show that $(Ph)(x,a) \coloneq \int h(x') p(x, a, x') \mu(\diff x')$ is continouous. Given $(x_n, a_n) \to (x, a)$ in $\Gsf$ , we have

|(Ph)(x_n, a_n) - (Ph)(x, a)| \leq \|h\| \int |p(x_n, a_n, x') - p(x, a, x')| \mu(\diff x').

The continuity condition on $p$ gives $p(x_n, a_n, x') \to p(x, a, x')$ for $\mu$ -almost all $x' \in \Xsf$ , so Scheffé’s lemma applies. This yields $(Ph)(x_n, a_n) \to (Ph)(x,a)$ . The strong Feller property follows. ◻

Example 6.1.2

Suppose that $\Gsf \subset \RR^k$ and $\Xsf = \RR^m$ . Let $g$ be a continuous map from $\Gsf$ to $\Xsf$ , and let $\beta \colon \Xsf \to \RR$ be continuous and bounded. We consider the transition kernel from $\Gsf$ to $\Xsf$ given by

(Kh)(x, a) = \beta(x) \int h[g(x, a) + w] \phi(w) \diff w \qquad ((x, a) \in \Gsf),

(6.2)

where the density $\phi$ is continuous on $\Xsf$ . In this setting, $K$ is strong Feller. Indeed, fix $h \in b\Xsf$ . The change of variable $x' = g(x, a) + w$ yields

\begin{aligned} (Kh)(x, a) & = \beta(x) \int h[ g(x, a) + w ] \phi(w) \diff w \\ & = \beta(x) \int h(x') \phi(x' - g(x, a)) \diff x'. \end{aligned}

The strong Feller property now follows from continuity of the functions $\phi$ and $g$ , combined with Lemma 6.1.1.

6.1.2LDPs¶

We now introduce LDPs and study their basic properties. Section 6.1.2.1 defines LDPs and connects them to the ADP framework. We then present several examples, showing which models can and cannot be expressed as LDPs. Finally, we discuss lifetime values and their computation in the LDP setting.

6.1.2.1Definition¶

Let $\Xsf$ and $\Asf$ be separable metric spaces, referred to henceforth as the state and action spaces. As before, $\| \cdot \|$ denotes the supremum norm on $b \Xsf$ . Given $\Xsf$ and $\Asf$ , a linear decision process (LDP) is a tuple $(\Gamma, r, K)$ containing

a nonempty correspondence $\Gamma$ from $\Xsf$ to $\Asf$ called the feasible correspondence, with an associated set of feasible state-action pairs

\Gsf \coloneq \graph \Gamma = \setntn{(x, a) \in \Xsf \times \Asf}{a \in \Gamma(x)},

(6.3)

a bounded Borel measurable reward function $r$ mapping $\Gsf$ into $\RR$ , and
a transition kernel $K$ from $\Gsf$ to $\Xsf$ satisfying $Kv \in b\Gsf$ whenever $v \in b\Xsf$ .

The set $\Gamma(x)$ represents all actions available to a controller in state $x$ . Figure 6.1 shows an illustration of one possible correspondence $\Gamma$ when $\Asf = \Xsf = \RR_+$ , along with $\Gsf$ , the resulting set of feasible state-action pairs. When representing the LDP by the tuple $(\Gamma, r, K)$ , we are treating $\Xsf$ and $\Asf$ as understood from context.

Figure 6.1:Feasible correspondence and feasible state-action pairs

For the LDP $(\Gamma, r, K)$ , a feasible policy is a Borel measurable map $\sigma \colon \Xsf \to \Asf$ such that $\sigma(x) \in \Gamma(x)$ for all $x \in \Xsf$ . Figure 6.2 shows a feasible policy $\sigma$ in the same setting.

The action \sigma(x) lies in \Gamma(x) for all x — Figure 6.2:The action $\sigma(x)$ lies in $\Gamma(x)$ for all $x$

We let $\Sigma$ denote the set of all feasible policies. With these policies in hand, we define the set of policy operators associated with $(\Gamma, r, K)$ via

(T_\sigma \, v)(x) = r(x, \sigma(x)) + \int v(x') K(x, \sigma(x), \diff x') \qquad (x \in \Xsf),

(6.4)

where $v$ varies over $b \Xsf$ .

The assumption that $\Xsf$ and $\Asf$ are metric spaces is important in some applications and irrelevant in others. For simplicity, we maintain it throughout. When $\Xsf$ and $\Asf$ are discrete, the metric in question is always understood to be the discrete metric. In this case, every subset of these sets is a Borel set, so the measurability constraint in the definition of $\Sigma$ never binds.

6.1.2.2ADP Representation¶

With

K_\sigma(x, x') \coloneq K(x, \sigma(x), x') \quad \text{and} \quad r_\sigma(x) \coloneq r(x, \sigma(x)),

we can also write the policy operator (6.4) as

T_\sigma \, v = r_\sigma + K_\sigma \, v.

Given $v \in b \Xsf$ , we have $Kv \in b\Gsf$ and hence $K_\sigma v \in b\Xsf$ . Since $b \Xsf$ is a vector space, it follows that $T_\sigma \, v$ is in $b \Xsf$ . Since $K$ is a transition kernel, $K_\sigma$ is a positive linear operator, so $T_\sigma$ is order preserving. Hence

(b \Xsf, \TT_{\rm LDP}) \quad \text{with} \quad \TT_{\rm LDP} \coloneq \setntn{T_\sigma}{\sigma \in \Sigma}

is an ADP. We call $(V, \TT_{\rm LDP})$ the ADP generated by $(\Gamma, r, K)$ and use the following obvious conventions:

$(\Gamma, r, K)$ is called well-posed (resp., regular, order stable, etc.) if $(V, \TT_{\rm LDP})$ is well-posed (resp., regular, order stable, etc.).
$v_\sigma$ is the $\sigma$ -value function for $(\Gamma, r, K)$ when $v_\sigma$ is the $\sigma$ -value function for $(V, \TT_{\rm LDP})$ ,
$\sigma$ is called optimal for $(\Gamma, r, K)$ when $\sigma$ is optimal for $(V, \TT_{\rm LDP})$ ,
etc.

We notice that each $T_\sigma$ has the affine form from the ADP analysis in Section 4.1.2.3, with $K_\sigma \in \blop_+(b\Xsf)$ by Theorem A.5.25. We will use the theorems in that section for some of our optimality results.

6.1.2.3Examples¶

Let’s discuss some examples. Some but not all of these examples can be framed as LDPs.

6.1.2.4Lifetime Values¶

Let $(\Gamma, r, K)$ be an LDP with state space $\Xsf$ , action space $\Asf$ . Given a policy $\sigma \in \Sigma$ , the $\sigma$ -value function $v_\sigma$ is defined as the fixed point of the policy operator $T_\sigma$ in (6.4). As a result, $v_\sigma$ satisfies the recursion

v_\sigma = r_\sigma + K_\sigma v_\sigma.

(6.5)

If the spectral radius condition $\rho(K_\sigma) < 1$ holds, then, by the Neumann series lemma (see, in particular Corollary A.4.11), the operator $I - K_\sigma$ is invertible on $b \Xsf$ and the unique solution to (6.5) is

v_\sigma = (I - K_\sigma)^{-1} r_\sigma = \sum_{t=0}^{\infty} K_\sigma^t r_\sigma.

(6.6)

The $t$ -th term $K_\sigma^t r_\sigma$ gives the expected reward at time $t$ under policy $\sigma$ , discounted back to the present.

The explicit representation of $v_\sigma$ in (6.6) is valuable for computation. For example, the MDP version of HPI in Algorithm 1.2.2 can be extended to the current setting by replacing $v \leftarrow (I - \beta P_{\sigma} )^{-1} r_{\sigma}$ with $v \leftarrow (I - K_{\sigma} )^{-1} r_{\sigma}$ . Under the conditions of Proposition 6.1.3, with $K$ strong Feller, this algorithm converges.

6.1.3Optimality Results¶

Now we turn to optimality results. We first treat the case where $\TT$ is finite, and then shift to general (metric) state and action spaces by adding continuity conditions. We conclude by deriving implications for greedy policies and the Bellman operator.

In the following, we suppose that $(\Gamma, r, K)$ is an LDP with state space $\Xsf$ and action space $\Asf$ . As before, these sets are separable metric spaces (with the discrete topology when finite). As shown in Section 6.1.2.1, the LDP $(\Gamma, r, K)$ generates an ADP $(b\Xsf, \TT_{\rm LDP})$ where each $T_\sigma \in \TT_{\rm LDP}$ has the affine form $T_\sigma v = r_\sigma + K_\sigma v$ . We will infer optimality of the LDP by studying this ADP.

6.1.3.1Results¶

First we present a result that works for the finite case.

To shift to the general case, we inject some continuity.

Recalling the definition of a discount operator, we can state the following result.

Proof

Let the stated conditions hold and let $(b\Xsf, \TT_{\rm LDP})$ be the ADP generated by $(\Gamma, r, K)$ . We apply Theorem 4.1.8 with $V = b\Xsf$ and $V_0 = bc\Xsf$ . The ADP has the required affine form $T_\sigma \, v = r_\sigma + K_\sigma \, v$ with $K_\sigma \in \blop_+(b\Xsf)$ , and the discount operator condition $K_\sigma \leq D$ on $b\Xsf_+$ holds by hypothesis. We verify semi-regularity on $bc\Xsf$ . Fix $v \in bc\Xsf$ . Since $r$ is continuous on $\Gsf$ by Assumption 6.1.1 and $K$ is weak Feller, the map $(x, a) \mapsto r(x,a) + \int v(x') K(x, a, \diff x')$ is continuous on $\Gsf$ . Combined with our assumptions on $\Gamma$ , Theorem A.3.3 implies that a $v$ -greedy policy exists and that $Tv \in bc\Xsf$ . Hence $bc\Xsf \subset V_G$ and $T(bc\Xsf) \subset bc\Xsf$ .

Since $bc\Xsf$ is closed in $b\Xsf$ , claims (i)–(iii) follow from Theorem 4.1.8. For the last claim, if $K$ is strong Feller, the same argument applies to any $v \in b\Xsf$ , giving regularity. OPI and HPI convergence then follow from Theorem 4.1.8. ◻

6.1.3.2Implications¶

Let $(\Gamma, r, K)$ be a given LDP. Clearly $\sigma \in \Sigma$ is $v$ -greedy for $(\Gamma, r, K)$ if and only if

r(x, \tau(x)) + \int v(x') K(x, \tau(x), \diff x') \leq r(x, \sigma(x)) + \int v(x') K(x, \sigma(x), \diff x')

(6.7)

for all $\tau \in \Sigma$ and $x \in \Xsf$ . The Bellman operator obeys

(Tv)(x) = \sup_{\sigma \in \Sigma} \left\{ r(x, \sigma(x)) + \int v(x') K(x, \sigma(x), \diff x') \right\} \qquad (x \in \Xsf).

If, say, the conditions of Proposition 6.1.3 hold and $K$ is strong Feller, then, for every $v \in b \Xsf$ , there always exists a $\sigma \in \Sigma$ obeying

\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \int v(x') K(x,a, \diff x') \right\} \quad \text{for all } x \in \Xsf.

(6.8)

(See the proof of Proposition 6.1.3.) In this setting, a policy $\sigma \in \Sigma$ is $v$ -greedy if and only if (6.8) holds. Moreover, the Bellman operator simplifies to

(Tv)(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \int v(x') K(x, a, \diff x') \right\}

(6.9)

for every $v \in b \Xsf$ . These expressions remain valid in the weak Feller setting when we restrict to $v \in b c \Xsf$ .

6.1.4Exogenous Discount Processes¶

In Chapter 4 we look at several settings that include state-dependent discounting. In each case the setting was relatively simple: either a binary stopping problem or a model with discrete states and actions. Here we’ll look at a problem with continuous state and action spaces. To make this setting tractable, we’ll insist that the discount factor process depends only on an exogenous state (i.e., a state that is not influenced by decisions of the agent).

6.1.4.1Discount Factor Processes¶

When the discount factor varies over time, forming a sequence $(\beta_t)_{t \geq 0}$ , the present value of a random time $t$ payoff $H_t$ has the general form $\EE \, \beta_0 \cdots \beta_{t-1} H_t$ . In this section we formalize this idea in a Markov environment and examine some simple consequences.

Let $\Zsf$ be a metric space and let $Q$ be a stochastic kernel on $\Zsf$ . Let $(Z_t)$ be $Q$ -Markov on $\Zsf$ . Let $\beta \in b\Zsf$ be a nonnegative function and consider the discount factor process $(\beta_t)_{t \geq 0}$ where $\beta_t \coloneq \beta(Z_t)$ for all $t$ . We introduce the operator

(K_Q h)(z) := \beta(z) \int h(z') Q(z, \diff z')

(6.10)

Our next lemma connects powers of $K_Q$ to expected present values.

We can confirm this rather natural expression by induction.

Proof

When $n=1$ , we have $(K_Q h)(z) = \beta(z) \int h(z') Q(z, \diff z') = \EE_z \, \beta_0 \, h(Z_1)$ . For the inductive step, suppose the claim holds at $n$ . Then

\begin{aligned} (K_Q^{n+1} h)(z) & = (K_Q (K_Q^n h))(z) = \beta(z) \int (K_Q^n h)(z') Q(z, \diff z') \\ & = \beta(z) \int \EE_{z'} \, \beta_0 \cdots \beta_{n-1} h(Z_n) \, Q(z, \diff z') \\ & = \EE_z \, \beta_0 \cdots \beta_n \, h(Z_{n+1}), \end{aligned}

where the last step uses the law of iterated expectations and the Markov property. ◻

The next result follows from Gelfand’s formula for the spectral radius and the details of the argument can be seen in Example 4.1.1.

Now consider pricing an infinite horizon cash flow $(h(Z_t))_{t \geq 0}$ . We set

q(z) \coloneq \EE_z \, \sum_{t \geq 0} \prod_{i=0}^{t-1} \beta(Z_i) \cdot h(Z_t).

6.1.4.2An LDP with Exogenous Discounting¶

Let $\Xsf$ and $\Asf$ be separable metric spaces and let $(\Gamma, r, K)$ be an LDP with state space $\Xsf$ and action space $\Asf$ . Suppose further that $\Xsf$ is a product space of the form $\Ysf \times \Zsf$ and that $K$ has the form

(Kh)(x, a) = (Kh)(y,z, a) = \beta(z) \int \sum_{z'} h(y', z') Q(z, z') R(y, z, a, \diff y') ,

where

$R$ is a stochastic kernel from $\Gsf$ to $\Ysf$ ,
$Q$ is a stochastic kernel from $\Zsf$ to $\Zsf$ , and
$\beta$ is an element of $bc\Zsf$ .

We call $R$ the endogenous kernel, $Q$ the exogenous kernel and $\beta$ the discount function. The expression for $K$ tells us that the endogenous state $y$ updates via the kernel $R$ , depending on current state $x=(y,z)$ and action $a$ , while $z$ updates via $Q$ . Since we are taking products, the two updates are independent. The exogenous process feeds into values and hence optimal policies through its impact on the discount factor.

To make our lives slightly easier, we’ll assume that $\Zsf$ is finite. As with every other finite set, we endow $\Zsf$ with the discrete topology.

Let $K_Q$ be defined as in (6.10). In this setting, we have the following result.

Proof

We apply Proposition 6.1.3. Assumption 6.1.1 holds by hypothesis, so it remains to verify that (i) $K$ is weak Feller and (ii) there exists a discount operator $D$ on $b\Xsf$ such that $K_\sigma \leq D$ on $b\Xsf_+$ for all $\sigma \in \Sigma$ .

To verify (i), we fix $h \in bc\Xsf$ . Since $\Zsf$ is finite (and has the discrete topology), it suffices to show $(y,a) \mapsto (Kh)(y, z, a)$ is continuous for fixed $z \in \Zsf$ . Fix any such $z$ . For each $z' \in \Zsf$ , the map $y' \mapsto h(y', z')$ is continuous on $\Ysf$ (since $h \in bc\Xsf$ and $\Zsf$ is discrete), so Assumption 6.1.2 implies that $(y, a) \mapsto \int h(y', z') R(y, z, a, \diff y')$ is continuous on $\Gsf_z$ . This implies that

(y, a) \mapsto \int \sum_{z'} h(y', z') Q(z, z') R(y, z, a, \diff y') = (Kh)(y, z, a)

is continuous. Hence $K$ is weak Feller.

To verify (ii), we introduce the operator $D$ on $b\Xsf$ via

(D h)(y, z) \coloneq \beta(z) \sup_{a \in \Gamma(y,z)} \int \sum_{z'} h(y', z')Q(z, z') R(y, z, a, \diff y') .

Since $D$ takes the supremum over $a$ , we immediately have $K_\sigma \leq D$ on $b\Xsf_+$ for all $\sigma \in \Sigma$ . It remains to show that $D$ is a discount operator (i.e., $D0 = 0$ , $D$ is order-preserving, and $D$ is eventually contracting). The first two properties are clear. For eventual contractivity, fix $h \in b\Xsf_+$ and observe that, since $h \leq \|h\|$ and $R, Q$ are stochastic kernels,

(Dh)(y, z) \leq \beta(z) \|h\| = \|h\| \cdot (K_Q \1)(z).

Since $D$ is order preserving, we can iterate on this bound to obtain

\begin{aligned} (D^2 h)(y, z) & \leq \beta(z) \max_{a \in \Gamma(y,z)} \int \sum_{z'} \|h\| \cdot (K_Q \1)(z') \, Q(z, z') R(y, z, a, \diff y') \\ & = \|h\| \cdot \beta(z) \sum_{z'} (K_Q \1)(z') \, Q(z, z') = \|h\| \cdot (K_Q^2 \1)(z). \end{aligned}

Continuing to iterate, we obtain, for all $(y, z) \in \Xsf$ and all $n \in \NN$ ,

(D^n h)(y, z) \leq \| h \| \cdot (K_Q^n \1)(z).

(6.11)

Taking the supremum over the right-hand side and then the left yields $\| D^n h \| \leq \| h \| \cdot \| K_Q^n \1 \|$ . Since $\rho(K_Q) < 1$ , Lemma 6.1.5 provides an $n \in \NN$ and $\lambda \in [0,1)$ with $\| K_Q^n \1 \| \leq \lambda$ , so $D$ is eventually contracting. We conclude that $D$ is a discount operator on $b\Xsf$ , and the claims follow from Proposition 6.1.3. ◻

6.1.5Markov Decision Processes¶

We treated discrete MDPs in Section 1.2. Let’s now consider MDPs on general state spaces. Mathematically, MDPs are LDPs with a fixed discount factor and Markov dynamics under any fixed policy. On one hand, MDPs are a special case of LDPs and need no separate theoretical discussion. On the other hand, MDPs are a benchmark representation of a dynamic program, used throughout mathematics, operations research, and computer science. For this reason we’ll take the time to specialize our LDP results to the Markov setting. Throughout this section, $\Xsf$ and $\Asf$ are separable metric spaces.

6.1.5.1Theory¶

Let $(\Gamma, r, K)$ be an LDP with state space $\Xsf$ and action space $\Asf$ . This LDP is called a Markov Decision Process (MDP) when the transition kernel has the form

\int v(x') K(x, a, \diff x') = \beta \int v(x') P(x, a, \diff x')

(6.12)

for some $\beta \in [0, 1)$ and some stochastic kernel $P$ from $\Gsf$ to $\Xsf$ .

The MDP above will be represented by the tuple $(\Gamma, r, \beta, P)$ . The ADP generated by this MDP will be denoted $(b\Xsf, \TT_{\rm MDP})$ , where

T_\sigma = r_\sigma + \beta P_\sigma, \quad \text{where} \quad r_\sigma(x) \coloneq r(x, \sigma(x)) \quad \text{and} \quad P_\sigma(x, \diff x') \coloneq P(x, \sigma(x), \diff x').

(6.13)

Choosing a policy $\sigma$ picks out a stochastic kernel $P_\sigma$ on $\Xsf$ , so choosing a policy is akin to picking an $\Xsf$ -valued Markov process.

The following optimality result is an immediate consequence of Proposition 6.1.3.

Example 6.1.8

The basic optimal savings problem we studied in Section 1.3 is a strong Feller MDP. To put the model in this framework we set

$\Xsf = \Asf = \RR_+$ ,
$\Gamma(x) = [0, x]$ ,
$r(x, a) = u(a)$ , and
$p(x, a, x') = \phi(x' - R(x - a))$ ,

where $\phi$ is the continuous density of the income process. Using the change of variable $y = x' - R(x-a)$ , we can write the Bellman equation as

\begin{aligned} v(x) & = \max_{0 \leq a \leq x} \left\{ u(a) + \beta \int v(x') p(x, a, x') \diff x' \right\} \\ & = \max_{0 \leq a \leq x} \left\{ u(a) + \beta \int v(R(x - a) + y) \phi(y) \diff y \right\} \end{aligned}

(To simplify the change of variable argument, we are assuming that $\phi$ is defined on all of $\RR$ with $\phi(y)=0$ whenever $y \leq 0$ . The integrals above are taken over all of $\RR$ .) The optimality results we obtained for the optimal savings model in Section 1.3 can be recovered from Proposition 6.1.7.

6.1.5.2Implications¶

Since MDPs are such an important special case, we briefly specialize the implications from Section 6.1.2.1 to the MDP setting, replacing the general transition kernel $K$ with $\beta P$ .

If the conditions of Proposition 6.1.7 hold and $P$ is strong Feller, then, for every $v \in b \Xsf$ , there exists a $\sigma \in \Sigma$ obeying

\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \int v(x') P(x,a, \diff x') \right\} \quad \text{for all } x \in \Xsf,

(6.14)

and a policy $\sigma \in \Sigma$ is $v$ -greedy if and only if (6.14) holds. Moreover, the Bellman operator simplifies to

(Tv)(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \int v(x') P(x, a, \diff x') \right\}

(6.15)

for every $v \in b \Xsf$ . These expressions remain valid without the strong Feller condition when we restrict to $v \in b c \Xsf$ .

6.2Applications¶

We apply the theory developed above to two classes of problems. In Section 6.2.1 we study a natural resource management problem with state-dependent discounting. In Section 6.2.2 we analyze an optimal savings problem with stochastic rates of return on assets.

6.2.1Natural Resource Management¶

We consider a natural resource management application with Bellman equation

v(y, z) = \max_{0 \leq e \leq y} \left\{ \pi(e) + \beta(z) \int \sum_{z'} v(f(y - e) \xi, z') Q(z, z')\phi(\diff \xi) \right\}.

Here $y$ is the stock of the resource, $e$ is the current usage, $Q$ is a stochastic kernel on finite set $\Zsf$ , $\phi$ is a distribution on $\RR_+$ , $\pi \colon \RR_+ \to \RR$ is a profit function, $f \colon \RR_+ \to \RR_+$ is a transition function that updates the resource, $\beta \colon \Zsf \to \RR_+$ is a discount factor function, and $\xi$ is a multiplicative shock. The quantity $f(y-e) \xi$ is the next period stock.

If, say, $\xi$ is concentrated at 1 and $f(y-e) = y-e$ , then this is exploitation of a nonrenewable resource. Another interpretation is that $y$ is a stock of fish at a given fishery, $e$ is current catch, $f$ is a transition rule that updates the stock given biological properties and environmental factors, and $\xi$ is a random shock to updating.

We assume that $\pi$ is continuous and bounded, and that the function $f$ is continuous. In the exogenous discounting setting of Section 6.1.4, the state is $\Xsf = \RR_+ \times \Zsf$ , the action space is $\RR_+$ , the feasible correspondence is $\Gamma(y, z) = [0, y]$ , the reward function is $r(y, z, e) = \pi(e)$ , and the transition kernel is

(Kh)(y,z,e) = \beta(z) \sum_{z'} \int h(f(y-e) \xi, z') \phi(\diff \xi) Q(z,z').

The endogenous kernel $R$ is determined by

\int g(y') R(y, z, e, \diff y') = \int g(f(y-e) \xi) \phi(\diff \xi).

Proof

We verify that the associated LDP $(\Gamma, r, K)$ satisfies the conditions of Proposition 6.1.6. For Assumption 6.1.1, $\Gamma(y,z) = [0,y]$ is nonempty, continuous and compact-valued (see Exercise A.3.1), and $r = \pi$ is continuous by assumption. For Assumption 6.1.2, fix $g \in bc\RR_+$ . We need $(y, e) \mapsto \int g(f(y-e)\xi) \phi(\diff \xi)$ to be continuous on the feasible state-action pairs. Since $f$ and $g$ are continuous, so is $(y, e) \mapsto g(f(y-e)\xi)$ for each $\xi$ , and the dominated convergence theorem gives the required continuity. Since $\rho(K_Q) < 1$ by hypothesis, all conditions of Proposition 6.1.6 are met and the conclusions therein hold. ◻

The state evolves according to

y_{t+1} = f(y_t - \sigma(y_t))\xi_{t+1}

(6.16)

where $\sigma$ is the optimal consumption policy. Let’s take a look at the kind of outcomes we can generate when $\beta$ is fixed, so that the exogenous shock process is degenerate. For simulation purposes, profits take the exponential form $\pi(x) = 1 - \exp(-\theta x^\gamma)$ , while the transition function is set to $f(x) = x^\alpha \ell(x)$ . Here $\ell$ is a generalized logistic function, while $\xi$ is lognormal.^[1] We compute the optimal policy $\sigma$ using value function iteration and then study the dynamics associated with the law of motion (6.16).

Figure 6.3:Optimal policy and dynamics for the natural resource model

Figure 6.3 shows the optimal consumption policy $\sigma$ when $\beta = 0.96$ , along with the 45 degree line, the map $y \mapsto f(y) \EE \xi$ , which shows the expected next period stock with zero consumption, and the map $y \mapsto f(y - \sigma(y)) \EE \xi$ , which shows expected dynamics under the optimal policy. Interestingly, the optimal choice for this parameterization is to consume none of the resource when the stock is small, enabling the stock to grow. Consumption only becomes positive when the stock is large enough to remain stable at a relatively high level. Of course, this kind of behavior will only be seen when the agent is sufficiently patient.

Figure 6.4 shows more detail on the dynamics by examining the stochastic kernel associated with the Markov dynamics in (6.16), after taking logs. Each stochastic kernel is represented as a contour plot of the relevant conditional density. The four subplots correspond to four different values of the discount factor $\beta$ . For each value of $\beta$ , the plot shows where probability mass for next period stock concentrates relative to current stock, given the associated optimal policy. Mass above the 45 degree line implies that the state moves up on average, while mass below indicates that the state drifts down.

As $\beta$ increases, the optimal policy adjusts to reduce current consumption and increase conservation, leading to probability mass shifting upward at each current state value. The changes in the stochastic kernel in Figure 6.4 seem minor but in fact they have large impacts on long run outcomes. Figure 6.5 illustrates this by showing an estimate of the stationary distribution corresponding to each Markov process. Densities were estimated by simulating 100 independent paths of length $1{,}000$ from a common initial condition. The plots show a sharp transition around $\beta=0.95$ . For $\beta$ around that level, the long run stock is low. For slightly higher $\beta$ , the optimal path leads to much larger stocks (recalling that we are working in logs).

Stochastic kernel under the optimal policy at different \beta — Figure 6.4:Stochastic kernel under the optimal policy at different $\beta$

Variation in the stationary distribution across \beta values — Figure 6.5:Variation in the stationary distribution across $\beta$ values

Up until now we’ve taken $\beta$ as a fixed parameter when computing optimal policies. Now we allow it to vary with an exogenous state $z$ via $\beta(z)$ , in line with our theoretical analysis in Proposition 6.2.1. To illustrate the effect of state-dependent discounting, we set $\Zsf = \{0.9, 0.99\}$ and $\beta(z) = z$ . The exogenous state follows a two-state Markov chain with persistence 0.99 in each state. Other model parameters are as in the fixed- $\beta$ experiments above. We computed optimal policies via value function iteration on the product space $\RR_+ \times \Zsf$ .

Figure 6.7 shows the outcome of simulating 20 independent paths of the resource stock under the optimal policy, given a single realization of the exogenous process $(Z_t)$ . The top panel displays the discount factor $\beta_t$ , while the bottom panel shows the corresponding log stock $\log y_t$ over multiple alternative paths for $(\xi_t)$ . During patient regimes, the stock tends to grow as the optimal policy shifts toward conservation. When the discount factor drops, the agent increases exploitation and the stock tends to decline.

Figure 6.6:Optimal investment policy under state-dependent discounting

Figure 6.7:Simulated resource dynamics under state-dependent discounting

6.2.2Stochastic Rates of Return¶

As our next application, we consider a savings problem with a persistent state process and a stochastic rate of return on assets. Stochastic returns on assets appear to be important in generating sufficiently heavy right tails in wealth distributions when we take models to the data.

In this model,

the state is $x = (w, z)$ , where $w \in \RR_+$ is wealth and $z$ is an exogenous state process on finite set $\Zsf$ with stochastic kernel (matrix) $Q$ ,
the action $a$ is current consumption $c$ , taking values in $\RR_+$ ,
the feasible correspondence is $\Gamma(x) = \Gamma(w, z) = [0, w]$ ,
the reward is $r(x, a) = r((w, z), c) = u(c)$ , where $u$ is bounded and continuous,
the discount factor is $\beta \in (0,1)$ , and
the stochastic kernel takes the form

\int v(x') P(x, a, \diff x') = \sum_{z'} \int v[ R(z') (w - c) + y(z', s') , z' ] \phi(\diff s') Q(z, z').

The kernel can be explained as follows: Labor income is affected by an IID shock $s'$ drawn from distribution $\phi \in \dD(\Ssf)$ , where $\Ssf$ is a topological space. In addition, both the interest rate and labor income are impacted by a common persistent component $z$ . The latter is driven by stochastic matrix $Q$ . We give $\Zsf$ the discrete topology and $\Xsf = \RR_+ \times \Zsf$ the product topology.

We apply Proposition 6.1.7 to this model. For Assumption 6.1.1, continuity and compact-valuedness of $\Gamma$ follow from Exercise A.3.1, and $r = u$ is continuous by assumption. It remains to verify that $P$ is weak Feller. Fixing $v \in bc\Xsf$ , we must show that the mapping

m(w, z, c) \coloneq \sum_{z'} \int v[ R(z') (w - c) + y(z', s') , z' ] \phi(\diff s') Q(z, z')

is continuous on $\Gsf$ . Taking $(w_n, z_n, c_n) \to (w, z, c)$ in $\Gsf$ , since $\Zsf$ has the discrete topology, $(z_n)$ is eventually constant at $z$ . Hence it suffices to show that $m(w_n, z, c_n) \to m(w, z, c)$ . This follows from continuity and boundedness of $v$ and the dominated convergence theorem.

Hence Proposition 6.1.7 applies and the conclusions therein hold. The Bellman operator takes the form

(Tv)(w,z) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \, \sum_{z' \in \Zsf} \int v[ R(z') (w - c) + y(z', s') , z' ] \phi(\diff s') Q(z, z') \right\}.

Exercise 6.2.1

Consider an optimal savings model identical to the one described in Section 6.2.2 except that agents die with a time-dependent probability in each period (see, e.g., De Nardi et al. (2020)). To accommodate this feature, we modify the state, setting it to $x = (w, z, t) \in \Xsf := \RR_+ \times \Zsf \times \ZZ_+$ , where $t$ represents time. The transition kernel becomes

\int v(x') K(x, c, \diff x') = \beta \, q(t) \sum_{z' \in \Zsf} \int v[ R(z') (w - c) + y(z', s') , z', t+1 ] \phi(\diff s') Q(z, z'),

where $q(t) \in [0, 1]$ is the survival probability at age $t$ . Higher probability of dying reduces the expected continuation value. Show that Proposition 6.1.3 applies. (Impose the discrete topology on $\ZZ_+$ .)

Solution to Exercise 6.2.1

We verify the conditions of Proposition 6.1.3. For Assumption 6.1.1, the correspondence $\Gamma(w, z, t) = [0, w]$ is continuous and compact-valued by Exercise A.3.1, and $r = u$ is continuous by assumption. For the discount operator condition, define

(Dh)(w,z,t) \coloneq \beta \max_{c \in [0,w]} \sum_{z'} \int h[ R(z') (w - c) + y(z', s') , z', t+1 ] \phi(\diff s') Q(z, z').

Since $q(t) \leq 1$ , we have $K_\sigma \leq D$ on $b\Xsf_+$ for all $\sigma \in \Sigma$ . Moreover, for $h \in b\Xsf_+$ , the stochastic kernel structure gives $(Dh)(w,z,t) \leq \beta \|h\|$ , so $D$ is a contraction of modulus $\beta$ and hence a discount operator.

It remains to show that $K$ is weak Feller. Fix $v \in bc\Xsf$ and let $(w_n, z_n, t_n, c_n) \to (w, z, t, c)$ in $\Gsf$ . Since $\Zsf$ and $\ZZ_+$ have the discrete topology, $(z_n, t_n)$ is eventually constant at $(z, t)$ . Hence it suffices to show that

\sum_{z'} \int v[ R(z') (w_n - c_n) + y(z', s') , z', t+1 ] \phi(\diff s') Q(z, z')

converges to the same expression with $(w_n, c_n)$ replaced by $(w, c)$ . This follows from continuity and boundedness of $v$ and the dominated convergence theorem.

6.3Chapter Notes¶

The Feller properties discussed in Section 6.1.1 are standard tools in the theory of Markov chains and stochastic processes. For further background, see Hernández-Lerma & Lasserre (2012) or Bäuerle & Rieder (2011). The use of Feller conditions to guarantee existence of optimal policies in MDPs and dynamic programs dates back to the foundational work of Blackwell (1965). Scheffé’s lemma, used in the proof of Lemma 6.1.1, is a classical result in measure theory.

Standard proofs of the optimality results we stated for MDPs on general state spaces (Section 6.1.5) can be found in Puterman (2005), Bäuerle & Rieder (2011), Hernández-Lerma & Lasserre (2012), Stachurski (2022), or Sargent & Stachurski (2025).

The exposition of exogenous discount processes in Section 6.1.4 is partly based on Stachurski & Zhang (2021). State-dependent discounting in the context of dynamic programming is also studied in Jaśkiewicz et al. (2014).

The natural resource management model in Section 6.2.1 is a standard bioeconomic exploitation model; see Clark (2010) for background. Versions with state-dependent discounting are relevant for modeling resource management under fluctuating economic conditions.

For a discussion of stochastic rates of return on financial income, as considered in Section 6.2.2, see Benhabib et al. (2015) or Stachurski & Toda (2019). The latter shows that heavy-tailed wealth distributions can also be generated by time preference shocks, but this channel is relatively unrealistic, since it requires that all households in the economy simultaneously experience time preference shocks in the same direction. Additional work on the relationship between stochastic discount factors and wealth distributions includes Toda (2019), Ma et al. (2020), and Nirei & Aoki (2015).

What we have called linear decision processes (LDPs) might be confused with Markov decision processes having linear reward or cost functions. The latter are a special case of the former. For a recent discussion of MDPs with linear cost functions, see Rantzer (2022) and Li & Bertsekas (2024).

Footnotes¶

The logistic function is $\ell(x) = a + (b-a)/(1 + \exp(-c(x-d)))$ with $a=1$ , $b=1.5$ , $c=20$ , $d=1$ . Other parameters are $\theta = 0.5$ , $\gamma = 0.9$ , $\alpha = 0.7$ , and $\xi \sim \text{LN}(-0.1, 0.2)$ . The optimal policy was computed by value function iteration on a grid of 500 state points and $2{,}000$ action points using JAX.
↩

References¶

De Nardi, M., Fella, G., & Paz-Pardo, G. (2020). Nonlinear household earnings dynamics, self-insurance, and welfare. Journal of the European Economic Association, 18(2), 890–926.
Hernández-Lerma, O., & Lasserre, J. B. (2012). Discrete-time Markov control processes: basic optimality criteria (Vol. 30). Springer Science & Business Media.
Bäuerle, N., & Rieder, U. (2011). Markov decision processes with applications to finance. Springer Science & Business Media.
Blackwell, D. (1965). Discounted Dynamic Programming. The Annals of Mathematical Statistics, 36(1), 226–235.
Puterman, M. L. (2005). Markov decision processes: discrete stochastic dynamic programming. Wiley Interscience.
Stachurski, J. (2022). Economic dynamics: theory and computation (2nd ed.). MIT Press.
Sargent, T. J., & Stachurski, J. (2025). Dynamic Programming: Finite States. Cambridge University Press.
Stachurski, J., & Zhang, J. (2021). Dynamic programming with state-dependent discounting. Journal of Economic Theory, 192, 105190.
Jaśkiewicz, A., Matkowski, J., & Nowak, A. S. (2014). On variable discounting in dynamic programming: applications to resource extraction and other economic models. Annals of Operations Research, 220, 263–278.
Clark, C. W. (2010). Mathematical Bioeconomics: The Mathematics of Conservation (3rd ed.). John Wiley & Sons.
Benhabib, J., Bisin, A., & Luo, M. (2015). Wealth distribution and social mobility in the US: A quantitative approach [Techreport]. National Bureau of Economic Research.
Stachurski, J., & Toda, A. A. (2019). An impossibility theorem for wealth in heterogeneous-agent models with limited heterogeneity. Journal of Economic Theory, 182, 1–24.
Toda, A. A. (2019). Wealth distribution with random discount factors. Journal of Monetary Economics, 104, 101–113.
Ma, Q., Stachurski, J., & Toda, A. A. (2020). The income fluctuation problem and the evolution of wealth. Journal of Economic Theory, 187, 105003.
Nirei, M., & Aoki, S. (2015). Wealth distribution and stochastic discount factors. Journal of Monetary Economics, 69, 119–133.

6 Linear Decision Processes

6.1Theory¶

6.1.1Feller Properties¶

6.1.2LDPs¶

6.1.2.1Definition¶

6.1.2.2ADP Representation¶

6.1.2.3Examples¶

6.1.2.4Lifetime Values¶

6.1.3Optimality Results¶

6.1.3.1Results¶

6.1.3.2Implications¶

6.1.4Exogenous Discount Processes¶

6.1.4.1Discount Factor Processes¶

6.1.4.2An LDP with Exogenous Discounting¶

6.1.5Markov Decision Processes¶

6.1.5.1Theory¶

6.1.5.2Implications¶

6.2Applications¶

6.2.1Natural Resource Management¶

6.2.2Stochastic Rates of Return¶

6.3Chapter Notes¶