Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

6 Linear Decision Processes

In this chapter, we define linear decision processes (LDPs). In terms of level of generality, we can think of LDPs as sitting between MDPs, as discussed in Section 1.2, and the ADPs we considered in Chapter 2--Chapter 5:

MDPs  LDPs  ADPs,\text{MDPs } \subset \text{ LDPs } \subset \text{ ADPs},

with all inclusions being strict. LDPs have a range of advantages over MDPs while maintaining much of their tractability. One advantage is that we can work with state-dependent discounting, which is particularly important for economic and financial applications. Another is that their flexible structure makes them easy to apply. For example, optimal stopping problems can be embedded directly into the LDP framework, whereas embedding optimal stopping problems into MDPs requires expanding the state space.

LDPs differ from ADPs by including actions explicitly, instead of taking policy operators as the basic primitive. This is a more traditional perspective: one where controllers observe states and respond to those states by choosing actions. Ultimately, a choice of action given a state will take the form of a policy function; that will lead us back to the ADPs. By studying this circle, we can leverage theory from our earlier chapters.

LDPs are more limited than ADPs, but also more concrete and more structured. For example, they provide an algebraic formula for computing lifetime values similar to the one available for MDPs (see, e.g., (1.42) on page ). This formula is not available for general ADPs. Thus, for LDPs, the HPI step requiring computation of lifetime values from policies is fully articulated, at least at a theoretical level. Another advantage of LDPs, relative to ADPs, is that we can start to construct systematic conditions for regularity, or existence of greedy policies, unlike in previous chapters.

We begin with the theoretical foundations in Section 6.1. After introducing Feller properties in Section 6.1.1, we define LDPs in Section 6.1.2.1 and present optimality results. We then treat exogenous discount processes (Section 6.1.4) and specialize the LDP framework to MDPs on general state spaces (Section 6.1.5). In this chapter, we focus on the bounded case; unbounded models are handled in Chapter 7. Section 6.2.1 and Section 6.2.2 apply the theory to natural resource management and optimal savings with stochastic rates of return.

6.1Theory

In this section we develop the foundational theory of linear decision processes. We begin in Section 6.1.1 by studying Feller properties of transition kernels, which provide the continuity conditions needed for existence of optimal policies. We then define LDPs in Section 6.1.2.1, give examples, and discuss lifetime values. Next we present optimality results and their implications. Section 6.1.4 treats exogenous discount processes, and Section 6.1.5 specializes the LDP framework to Markov decision processes on general state spaces.

6.1.1Feller Properties

Since we are always interested in whether or not optimal policies exist, we study conditions under which future values are continuous in states and actions. In the case of LDPs, this continuity will require that integrals of transition kernels vary continuously with actions. (For background on transition kernels see Section A.5.4.1.) Here we provide a collection of definitions and results that help us address this question.

Throughout this section,

Here you can think of G\Gsf as a collection of feasible state-action pairs. The last statement means that

(Kv)(x,a)v(x)K(x,a, ⁣dx)(Kv)(x, a) \coloneq \int v(x') K(x, a, \diff x')

is in bGb\Gsf whenever vbXv \in b\Xsf.

Extending standard terminology, we will say that the transition kernel KK is

Let’s look at some special cases.

The strong Feller property requires more conditions, since we need to map a potentially discontinuous function hh into a continuous function KhKh. For this, we rely on smoothing properties of the integral. To obtain these properties we introduce a “dominating” measure μ\mu on (X,B)(\Xsf, \bB), which we assume to be σ\sigma-finite. A Borel measurable map pp from G×X\Gsf \times \Xsf to R\RR is called a density kernel from G\Gsf to X\Xsf with dominating measure μ\mu if pp is nonnegative and

p(x,a,x)μ( ⁣dx)=1for all (x,a)G.\int p(x, a, x') \mu(\diff x') = 1 \quad \text{for all } (x, a) \in \Gsf.

We say that a stochastic kernel PP from G\Gsf to X\Xsf has density kernel pp with dominating measure μ\mu if pp is a density kernel on X\Xsf and

P(x,a,B)=Bp(x,a,x)μ( ⁣dx)for all (x,a,B)G×B.P(x, a, B) = \int_B p(x, a, x') \mu(\diff x') \quad \text{for all } (x, a, B) \in \Gsf \times \bB.

If the dominating measure μ\mu is not identified in the discussion below then we will be referring to Lebesgue measure, and we write  ⁣dx\diff x instead of μ( ⁣dx)\mu(\diff x). The following lemma shows how a continuous density kernel can transform discontinuous functions into continuous ones under integration.

6.1.2LDPs

We now introduce LDPs and study their basic properties. Section 6.1.2.1 defines LDPs and connects them to the ADP framework. We then present several examples, showing which models can and cannot be expressed as LDPs. Finally, we discuss lifetime values and their computation in the LDP setting.

6.1.2.1Definition

Let X\Xsf and A\Asf be separable metric spaces, referred to henceforth as the state and action spaces. As before, \| \cdot \| denotes the supremum norm on bXb \Xsf. Given X\Xsf and A\Asf, a linear decision process (LDP) is a tuple (Γ,r,K)(\Gamma, r, K) containing

  1. a nonempty correspondence Γ\Gamma from X\Xsf to A\Asf called the feasible correspondence, with an associated set of feasible state-action pairs

GgraphΓ={(x,a)X×A:aΓ(x)},\Gsf \coloneq \graph \Gamma = \setntn{(x, a) \in \Xsf \times \Asf}{a \in \Gamma(x)},
  1. a bounded Borel measurable reward function rr mapping G\Gsf into R\RR, and

  2. a transition kernel KK from G\Gsf to X\Xsf satisfying KvbGKv \in b\Gsf whenever vbXv \in b\Xsf.

The set Γ(x)\Gamma(x) represents all actions available to a controller in state xx. Figure Figure 6.1 shows an illustration of one possible correspondence Γ\Gamma when A=X=R+\Asf = \Xsf = \RR_+, along with G\Gsf, the resulting set of feasible state-action pairs. When representing the LDP by the tuple (Γ,r,K)(\Gamma, r, K), we are treating X\Xsf and A\Asf as understood from context.

Feasible correspondence and feasible state-action pairs

Figure 6.1:Feasible correspondence and feasible state-action pairs

For the LDP (Γ,r,K)(\Gamma, r, K), a feasible policy is a Borel measurable map σ ⁣:XA\sigma \colon \Xsf \to \Asf such that σ(x)Γ(x)\sigma(x) \in \Gamma(x) for all xXx \in \Xsf. Figure 6.2 shows a feasible policy σ\sigma in the same setting.

The action \sigma(x) lies in \Gamma(x) for all x

Figure 6.2:The action σ(x)\sigma(x) lies in Γ(x)\Gamma(x) for all xx

We let Σ\Sigma denote the set of all feasible policies. With these policies in hand, we define the set of policy operators associated with (Γ,r,K)(\Gamma, r, K) via

(Tσv)(x)=r(x,σ(x))+v(x)K(x,σ(x), ⁣dx)(xX),(T_\sigma \, v)(x) = r(x, \sigma(x)) + \int v(x') K(x, \sigma(x), \diff x') \qquad (x \in \Xsf),

where vv varies over bXb \Xsf.

The assumption that X\Xsf and A\Asf are metric spaces is important in some applications and irrelevant in others. For simplicity, we maintain it throughout. When X\Xsf and A\Asf are discrete, the metric in question is always understood to be the discrete metric. In this case, every subset of these sets is a Borel set, so the measurability constraint in the definition of Σ\Sigma never binds.

6.1.2.2ADP Representation

With

Kσ(x,x)K(x,σ(x),x)andrσ(x)r(x,σ(x)),K_\sigma(x, x') \coloneq K(x, \sigma(x), x') \quad \text{and} \quad r_\sigma(x) \coloneq r(x, \sigma(x)),

we can also write the policy operator (6.13) as

Tσv=rσ+Kσv.T_\sigma \, v = r_\sigma + K_\sigma \, v.

Given vbXv \in b \Xsf, we have KvbGKv \in b\Gsf and hence KσvbXK_\sigma v \in b\Xsf. Since bXb \Xsf is a vector space, it follows that TσvT_\sigma \, v is in bXb \Xsf. Since KK is a transition kernel, KσK_\sigma is a positive linear operator, so TσT_\sigma is order preserving. Hence

(bX,TLDP)withTLDP{Tσ:σΣ}(b \Xsf, \TT_{\rm LDP}) \quad \text{with} \quad \TT_{\rm LDP} \coloneq \setntn{T_\sigma}{\sigma \in \Sigma}

is an ADP. We call (V,TLDP)(V, \TT_{\rm LDP}) the ADP generated by (Γ,r,K)(\Gamma, r, K) and use the following obvious conventions:

We notice that each TσT_\sigma has the affine form from the ADP analysis in Section 4.1.2.3, with KσB+(bX)K_\sigma \in \blop_+(b\Xsf) by Theorem A.5.25. We will use the theorems in that section for some of our optimality results.

6.1.2.3Examples

Let’s discuss some examples. Some but not all of these examples can be framed as LDPs.

6.1.2.4Lifetime Values

Let (Γ,r,K)(\Gamma, r, K) be an LDP with state space X\Xsf, action space A\Asf. Given a policy σΣ\sigma \in \Sigma, the σ\sigma-value function vσv_\sigma is defined as the fixed point of the policy operator TσT_\sigma in (6.13). As a result, vσv_\sigma satisfies the recursion

vσ=rσ+Kσvσ.v_\sigma = r_\sigma + K_\sigma v_\sigma.

If the spectral radius condition ρ(Kσ)<1\rho(K_\sigma) < 1 holds, then, by the Neumann series lemma (see, in particular Corollary A.4.11 on page ), the operator IKσI - K_\sigma is invertible on bXb \Xsf and the unique solution to (6.21) is

vσ=(IKσ)1rσ=t=0Kσtrσ.v_\sigma = (I - K_\sigma)^{-1} r_\sigma = \sum_{t=0}^{\infty} K_\sigma^t r_\sigma.

The tt-th term KσtrσK_\sigma^t r_\sigma gives the expected reward at time tt under policy σ\sigma, discounted back to the present.

The explicit representation of vσv_\sigma in (6.22) is valuable for computation. For example, the MDP version of HPI in Algorithm 1.2.2 can be extended to the current setting by replacing v(IβPσ)1rσv \leftarrow (I - \beta P_{\sigma} )^{-1} r_{\sigma} with v(IKσ)1rσv \leftarrow (I - K_{\sigma} )^{-1} r_{\sigma}. Under the conditions of Proposition 6.1.3, with KK strong Feller, this algorithm converges.

6.1.3Optimality Results

Now we turn to optimality results. We first treat the case where T\TT is finite, and then shift to general (metric) state and action spaces by adding continuity conditions. We conclude by deriving implications for greedy policies and the Bellman operator.

In the following, we suppose that (Γ,r,K)(\Gamma, r, K) is an LDP with state space X\Xsf and action space A\Asf. As before, these sets are separable metric spaces (with the discrete topology when finite). As shown in Section 6.1.2.1, the LDP (Γ,r,K)(\Gamma, r, K) generates an ADP (bX,TLDP)(b\Xsf, \TT_{\rm LDP}) where each TσTLDPT_\sigma \in \TT_{\rm LDP} has the affine form Tσv=rσ+KσvT_\sigma v = r_\sigma + K_\sigma v. We will infer optimality of the LDP by studying this ADP.

6.1.3.1Results

First we present a result that works for the finite case.

To shift to the general case, we inject some continuity.

Recalling the definition of a discount operator from page , we can state the following result.

6.1.3.2Implications

Let (Γ,r,K)(\Gamma, r, K) be a given LDP. Clearly σΣ\sigma \in \Sigma is vv-greedy for (Γ,r,K)(\Gamma, r, K) if and only if

r(x,τ(x))+v(x)K(x,τ(x), ⁣dx)r(x,σ(x))+v(x)K(x,σ(x), ⁣dx)r(x, \tau(x)) + \int v(x') K(x, \tau(x), \diff x') \leq r(x, \sigma(x)) + \int v(x') K(x, \sigma(x), \diff x')

for all τΣ\tau \in \Sigma and xXx \in \Xsf. The Bellman operator obeys

(Tv)(x)=supσΣ{r(x,σ(x))+v(x)K(x,σ(x), ⁣dx)}(xX).(Tv)(x) = \sup_{\sigma \in \Sigma} \left\{ r(x, \sigma(x)) + \int v(x') K(x, \sigma(x), \diff x') \right\} \qquad (x \in \Xsf).

If, say, the conditions of Proposition 6.1.3 hold and KK is strong Feller, then, for every vbXv \in b \Xsf, there always exists a σΣ\sigma \in \Sigma obeying

σ(x)argmaxaΓ(x){r(x,a)+v(x)K(x,a, ⁣dx)}for all xX.\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \int v(x') K(x,a, \diff x') \right\} \quad \text{for all } x \in \Xsf.

(See the proof of Proposition 6.1.3.) In this setting, a policy σΣ\sigma \in \Sigma is vv-greedy if and only if (6.25) holds. Moreover, the Bellman operator simplifies to

(Tv)(x)=maxaΓ(x){r(x,a)+v(x)K(x,a, ⁣dx)}(Tv)(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \int v(x') K(x, a, \diff x') \right\}

for every vbXv \in b \Xsf. These expressions remain valid in the weak Feller setting when we restrict to vbcXv \in b c \Xsf.

6.1.4Exogenous Discount Processes

In Chapter 4 we look at several settings that include state-dependent discounting. In each case the setting was relatively simple: either a binary stopping problem or a model with discrete states and actions. Here we’ll look at a problem with continuous state and action spaces. To make this setting tractable, we’ll insist that the discount factor process depends only on an exogenous state (i.e., a state that is not influenced by decisions of the agent).

6.1.4.1Discount Factor Processes

When the discount factor varies over time, forming a sequence (βt)t0(\beta_t)_{t \geq 0}, the present value of a random time tt payoff HtH_t has the general form Eβ0βt1Ht\EE \, \beta_0 \cdots \beta_{t-1} H_t. In this section we formalize this idea in a Markov environment and examine some simple consequences.

Let Z\Zsf be a metric space and let QQ be a stochastic kernel on Z\Zsf. Let (Zt)(Z_t) be QQ-Markov on Z\Zsf. Let βbZ\beta \in b\Zsf be a nonnegative function and consider the discount factor process (βt)t0(\beta_t)_{t \geq 0} where βtβ(Zt)\beta_t \coloneq \beta(Z_t) for all tt. We introduce the operator

(KQh)(z):=β(z)h(z)Q(z, ⁣dz)(K_Q h)(z) := \beta(z) \int h(z') Q(z, \diff z')

Our next lemma connects powers of KQK_Q to expected present values.

We can confirm this rather natural expression by induction.

The next result follows from Gelfand’s formula for the spectral radius (p. ) and the details of the argument can be seen in Example 4.1.1.

Now consider pricing an infinite horizon cash flow (h(Zt))t0(h(Z_t))_{t \geq 0}. We set

q(z)Ezt0i=0t1β(Zi)h(Zt).q(z) \coloneq \EE_z \, \sum_{t \geq 0} \prod_{i=0}^{t-1} \beta(Z_i) \cdot h(Z_t).

6.1.4.2An LDP with Exogenous Discounting

Let X\Xsf and A\Asf be separable metric spaces and let (Γ,r,K)(\Gamma, r, K) be an LDP with state space X\Xsf and action space A\Asf. Suppose further that X\Xsf is a product space of the form Y×Z\Ysf \times \Zsf and that KK has the form

(Kh)(x,a)=(Kh)(y,z,a)=β(z)zh(y,z)Q(z,z)R(y,z,a, ⁣dy),(Kh)(x, a) = (Kh)(y,z, a) = \beta(z) \int \sum_{z'} h(y', z') Q(z, z') R(y, z, a, \diff y') ,

where

We call RR the endogenous kernel, QQ the exogenous kernel and β\beta the discount function. The expression for KK tells us that the endogenous state yy updates via the kernel RR, depending on current state x=(y,z)x=(y,z) and action aa, while zz updates via QQ. Since we are taking products, the two updates are independent. The exogenous process feeds into values and hence optimal policies through its impact on the discount factor.

To make our lives slightly easier, we’ll assume that Z\Zsf is finite. As with every other finite set, we endow Z\Zsf with the discrete topology.

Let KQK_Q be defined as in (6.27). In this setting, we have the following result.

6.1.5Markov Decision Processes

We treated discrete MDPs in Section 1.2. Let’s now consider MDPs on general state spaces. Mathematically, MDPs are LDPs with a fixed discount factor and Markov dynamics under any fixed policy. On one hand, MDPs are a special case of LDPs and need no separate theoretical discussion. On the other hand, MDPs are a benchmark representation of a dynamic program, used throughout mathematics, operations research, and computer science. For this reason we’ll take the time to specialize our LDP results to the Markov setting. Throughout this section, X\Xsf and A\Asf are separable metric spaces.

6.1.5.1Theory

Let (Γ,r,K)(\Gamma, r, K) be an LDP with state space X\Xsf and action space A\Asf. This LDP is called a Markov Decision Process (MDP) when the transition kernel has the form

v(x)K(x,a, ⁣dx)=βv(x)P(x,a, ⁣dx)\int v(x') K(x, a, \diff x') = \beta \int v(x') P(x, a, \diff x')

for some β[0,1)\beta \in [0, 1) and some stochastic kernel PP from G\Gsf to X\Xsf.

The MDP above will be represented by the tuple (Γ,r,β,P)(\Gamma, r, \beta, P). The ADP generated by this MDP will be denoted (bX,TMDP)(b\Xsf, \TT_{\rm MDP}), where

Tσ=rσ+βPσ,whererσ(x)r(x,σ(x))andPσ(x, ⁣dx)P(x,σ(x), ⁣dx).T_\sigma = r_\sigma + \beta P_\sigma, \quad \text{where} \quad r_\sigma(x) \coloneq r(x, \sigma(x)) \quad \text{and} \quad P_\sigma(x, \diff x') \coloneq P(x, \sigma(x), \diff x').

Choosing a policy σ\sigma picks out a stochastic kernel PσP_\sigma on X\Xsf, so choosing a policy is akin to picking an X\Xsf-valued Markov process.

The following optimality result is an immediate consequence of Proposition 6.1.3.

6.1.5.2Implications

Since MDPs are such an important special case, we briefly specialize the implications from Section 6.1.2.1 to the MDP setting, replacing the general transition kernel KK with βP\beta P.

If the conditions of Proposition 6.1.7 hold and PP is strong Feller, then, for every vbXv \in b \Xsf, there exists a σΣ\sigma \in \Sigma obeying

σ(x)argmaxaΓ(x){r(x,a)+βv(x)P(x,a, ⁣dx)}for all xX,\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \int v(x') P(x,a, \diff x') \right\} \quad \text{for all } x \in \Xsf,

and a policy σΣ\sigma \in \Sigma is vv-greedy if and only if (6.44) holds. Moreover, the Bellman operator simplifies to

(Tv)(x)=maxaΓ(x){r(x,a)+βv(x)P(x,a, ⁣dx)}(Tv)(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \int v(x') P(x, a, \diff x') \right\}

for every vbXv \in b \Xsf. These expressions remain valid without the strong Feller condition when we restrict to vbcXv \in b c \Xsf.

6.2Applications

We apply the theory developed above to two classes of problems. In Section 6.2.1 we study a natural resource management problem with state-dependent discounting. In Section 6.2.2 we analyze an optimal savings problem with stochastic rates of return on assets.

6.2.1Natural Resource Management

We consider a natural resource management application with Bellman equation

v(y,z)=max0ey{π(e)+β(z)zv(f(ye)ξ,z)Q(z,z)ϕ( ⁣dξ)}.v(y, z) = \max_{0 \leq e \leq y} \left\{ \pi(e) + \beta(z) \int \sum_{z'} v(f(y - e) \xi, z') Q(z, z')\phi(\diff \xi) \right\}.

Here yy is the stock of the resource, ee is the current usage, QQ is a stochastic kernel on finite set Z\Zsf, ϕ\phi is a distribution on R+\RR_+, π ⁣:R+R\pi \colon \RR_+ \to \RR is a profit function, f ⁣:R+R+f \colon \RR_+ \to \RR_+ is a transition function that updates the resource, β ⁣:ZR+\beta \colon \Zsf \to \RR_+ is a discount factor function, and ξ\xi is a multiplicative shock. The quantity f(ye)ξf(y-e) \xi is the next period stock.

If, say, ξ\xi is concentrated at 1 and f(ye)=yef(y-e) = y-e, then this is exploitation of a nonrenewable resource. Another interpretation is that yy is a stock of fish at a given fishery, ee is current catch, ff is a transition rule that updates the stock given biological properties and environmental factors, and ξ\xi is a random shock to updating.

We assume that π\pi is continuous and bounded, and that the function ff is continuous. In the exogenous discounting setting of Section 6.1.4, the state is X=R+×Z\Xsf = \RR_+ \times \Zsf, the action space is R+\RR_+, the feasible correspondence is Γ(y,z)=[0,y]\Gamma(y, z) = [0, y], the reward function is r(y,z,e)=π(e)r(y, z, e) = \pi(e), and the transition kernel is

(Kh)(y,z,e)=β(z)zh(f(ye)ξ,z)ϕ( ⁣dξ)Q(z,z).(Kh)(y,z,e) = \beta(z) \sum_{z'} \int h(f(y-e) \xi, z') \phi(\diff \xi) Q(z,z').

The endogenous kernel RR is determined by

g(y)R(y,z,e, ⁣dy)=g(f(ye)ξ)ϕ( ⁣dξ).\int g(y') R(y, z, e, \diff y') = \int g(f(y-e) \xi) \phi(\diff \xi).

The state evolves according to

yt+1=f(ytσ(yt))ξt+1y_{t+1} = f(y_t - \sigma(y_t))\xi_{t+1}

where σ\sigma is the optimal consumption policy. Let’s take a look at the kind of outcomes we can generate when β\beta is fixed, so that the exogenous shock process is degenerate. For simulation purposes, profits take the exponential form π(x)=1exp(θxγ)\pi(x) = 1 - \exp(-\theta x^\gamma), while the transition function is set to f(x)=xα(x)f(x) = x^\alpha \ell(x). Here \ell is a generalized logistic function, while ξ\xi is lognormal.[1] We compute the optimal policy σ\sigma using value function iteration and then study the dynamics associated with the law of motion (6.49).

Optimal policy and dynamics for the natural resource model

Figure 6.3:Optimal policy and dynamics for the natural resource model

Figure 6.3 shows the optimal consumption policy σ\sigma when β=0.96\beta = 0.96, along with the 45 degree line, the map yf(y)Eξy \mapsto f(y) \EE \xi, which shows the expected next period stock with zero consumption, and the map yf(yσ(y))Eξy \mapsto f(y - \sigma(y)) \EE \xi, which shows expected dynamics under the optimal policy. Interestingly, the optimal choice for this parameterization is to consume none of the resource when the stock is small, enabling the stock to grow. Consumption only becomes positive when the stock is large enough to remain stable at a relatively high level. Of course, this kind of behavior will only be seen when the agent is sufficiently patient.

Figure 6.4 shows more detail on the dynamics by examining the stochastic kernel associated with the Markov dynamics in (6.49), after taking logs. Each stochastic kernel is represented as a contour plot of the relevant conditional density. The four subplots correspond to four different values of the discount factor β\beta. For each value of β\beta, the plot shows where probability mass for next period stock concentrates relative to current stock, given the associated optimal policy. Mass above the 45 degree line implies that the state moves up on average, while mass below indicates that the state drifts down.

As β\beta increases, the optimal policy adjusts to reduce current consumption and increase conservation, leading to probability mass shifting upward at each current state value. The changes in the stochastic kernel in Figure 6.4 seem minor but in fact they have large impacts on long run outcomes. Figure 6.5 illustrates this by showing an estimate of the stationary distribution corresponding to each Markov process. Densities were estimated by simulating 100 independent paths of length 1,0001{,}000 from a common initial condition. The plots show a sharp transition around β=0.95\beta=0.95. For β\beta around that level, the long run stock is low. For slightly higher β\beta, the optimal path leads to much larger stocks (recalling that we are working in logs).

Stochastic kernel under the optimal policy at different \beta

Figure 6.4:Stochastic kernel under the optimal policy at different β\beta

Variation in the stationary distribution across \beta values

Figure 6.5:Variation in the stationary distribution across β\beta values

Up until now we’ve taken β\beta as a fixed parameter when computing optimal policies. Now we allow it to vary with an exogenous state zz via β(z)\beta(z), in line with our theoretical analysis in Proposition 6.2.1. To illustrate the effect of state-dependent discounting, we set Z={0.9,0.99}\Zsf = \{0.9, 0.99\} and β(z)=z\beta(z) = z. The exogenous state follows a two-state Markov chain with persistence 0.99 in each state. Other model parameters are as in the fixed-β\beta experiments above. We computed optimal policies via value function iteration on the product space R+×Z\RR_+ \times \Zsf.

Figure 6.7 shows the outcome of simulating 20 independent paths of the resource stock under the optimal policy, given a single realization of the exogenous process (Zt)(Z_t). The top panel displays the discount factor βt\beta_t, while the bottom panel shows the corresponding log stock logyt\log y_t over multiple alternative paths for (ξt)(\xi_t). During patient regimes, the stock tends to grow as the optimal policy shifts toward conservation. When the discount factor drops, the agent increases exploitation and the stock tends to decline.

Optimal investment policy under state-dependent discounting

Figure 6.6:Optimal investment policy under state-dependent discounting

Simulated resource dynamics under state-dependent discounting

Figure 6.7:Simulated resource dynamics under state-dependent discounting

6.2.2Stochastic Rates of Return

As our next application, we consider a savings problem with a persistent state process and a stochastic rate of return on assets. Stochastic returns on assets appear to be important in generating sufficiently heavy right tails in wealth distributions when we take models to the data.

In this model,

v(x)P(x,a, ⁣dx)=zv[R(z)(wc)+y(z,s),z]ϕ( ⁣ds)Q(z,z).\int v(x') P(x, a, \diff x') = \sum_{z'} \int v[ R(z') (w - c) + y(z', s') , z' ] \phi(\diff s') Q(z, z').

The kernel can be explained as follows: Labor income is affected by an iid shock ss' drawn from distribution ϕD(S)\phi \in \dD(\Ssf), where S\Ssf is a topological space. In addition, both the interest rate and labor income are impacted by a common persistent component zz. The latter is driven by stochastic matrix QQ. We give Z\Zsf the discrete topology and X=R+×Z\Xsf = \RR_+ \times \Zsf the product topology.

We apply Proposition 6.1.7 to this model. For Assumption 6.1.1, continuity and compact-valuedness of Γ\Gamma follow from Exercise A.20, and r=ur = u is continuous by assumption. It remains to verify that PP is weak Feller. Fixing vbcXv \in bc\Xsf, we must show that the mapping

m(w,z,c)zv[R(z)(wc)+y(z,s),z]ϕ( ⁣ds)Q(z,z)m(w, z, c) \coloneq \sum_{z'} \int v[ R(z') (w - c) + y(z', s') , z' ] \phi(\diff s') Q(z, z')

is continuous on G\Gsf. Taking (wn,zn,cn)(w,z,c)(w_n, z_n, c_n) \to (w, z, c) in G\Gsf, since Z\Zsf has the discrete topology, (zn)(z_n) is eventually constant at zz. Hence it suffices to show that m(wn,z,cn)m(w,z,c)m(w_n, z, c_n) \to m(w, z, c). This follows from continuity and boundedness of vv and the dominated convergence theorem.

Hence Proposition 6.1.7 applies and the conclusions therein hold. The Bellman operator takes the form

(Tv)(w,z)=max0cw{u(c)+βzZv[R(z)(wc)+y(z,s),z]ϕ( ⁣ds)Q(z,z)}.(Tv)(w,z) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \, \sum_{z' \in \Zsf} \int v[ R(z') (w - c) + y(z', s') , z' ] \phi(\diff s') Q(z, z') \right\}.

6.3Chapter Notes

The Feller properties discussed in Section 6.1.1 are standard tools in the theory of Markov chains and stochastic processes. For further background, see Hernández-Lerma & Lasserre (2012) or Bäuerle & Rieder (2011). The use of Feller conditions to guarantee existence of optimal policies in MDPs and dynamic programs dates back to the foundational work of Blackwell (1965). Scheffé’s lemma, used in the proof of Lemma 6.1.1, is a classical result in measure theory (see p. ).

Standard proofs of the optimality results we stated for MDPs on general state spaces (Section 6.1.5) can be found in Puterman (2005), Bäuerle & Rieder (2011), Hernández-Lerma & Lasserre (2012), Stachurski (2022), or Sargent & Stachurski (2025).

The exposition of exogenous discount processes in Section 6.1.4 is partly based on Stachurski & Zhang (2021). State-dependent discounting in the context of dynamic programming is also studied in Jaśkiewicz et al. (2014).

The natural resource management model in Section 6.2.1 is a standard bioeconomic exploitation model; see Clark (2010) for background. Versions with state-dependent discounting are relevant for modeling resource management under fluctuating economic conditions.

For a discussion of stochastic rates of return on financial income, as considered in Section 6.2.2, see Benhabib et al. (2015) or Stachurski & Toda (2019). The latter shows that heavy-tailed wealth distributions can also be generated by time preference shocks, but this channel is relatively unrealistic, since it requires that all households in the economy simultaneously experience time preference shocks in the same direction. Additional work on the relationship between stochastic discount factors and wealth distributions includes Toda (2019), Ma et al. (2020), and Nirei & Aoki (2015).

What we have called linear decision processes (LDPs) might be confused with Markov decision processes having linear reward or cost functions. The latter are a special case of the former. For a recent discussion of MDPs with linear cost functions, see Rantzer (2022) and Li & Bertsekas (2024).

Footnotes
  1. The logistic function is (x)=a+(ba)/(1+exp(c(xd)))\ell(x) = a + (b-a)/(1 + \exp(-c(x-d))) with a=1a=1, b=1.5b=1.5, c=20c=20, d=1d=1. Other parameters are θ=0.5\theta = 0.5, γ=0.9\gamma = 0.9, α=0.7\alpha = 0.7, and ξLN(0.1,0.2)\xi \sim \text{LN}(-0.1, 0.2). The optimal policy was computed by value function iteration on a grid of 500 state points and 2,0002{,}000 action points using JAX.

References
  1. De Nardi, M., Fella, G., & Paz-Pardo, G. (2020). Nonlinear household earnings dynamics, self-insurance, and welfare. Journal of the European Economic Association, 18(2), 890–926.
  2. Hernández-Lerma, O., & Lasserre, J. B. (2012). Discrete-time Markov control processes: basic optimality criteria (Vol. 30). Springer Science & Business Media.
  3. Bäuerle, N., & Rieder, U. (2011). Markov decision processes with applications to finance. Springer Science & Business Media.
  4. Blackwell, D. (1965). Discounted Dynamic Programming. The Annals of Mathematical Statistics, 36(1), 226–235.
  5. Puterman, M. L. (2005). Markov decision processes: discrete stochastic dynamic programming. Wiley Interscience.
  6. Stachurski, J. (2022). Economic dynamics: theory and computation (2nd ed.). MIT Press.
  7. Sargent, T. J., & Stachurski, J. (2025). Dynamic Programming: Finite States. Cambridge University Press.
  8. Stachurski, J., & Zhang, J. (2021). Dynamic programming with state-dependent discounting. Journal of Economic Theory, 192, 105190.
  9. Jaśkiewicz, A., Matkowski, J., & Nowak, A. S. (2014). On variable discounting in dynamic programming: applications to resource extraction and other economic models. Annals of Operations Research, 220, 263–278.
  10. Clark, C. W. (2010). Mathematical Bioeconomics: The Mathematics of Conservation (3rd ed.). John Wiley & Sons.
  11. Benhabib, J., Bisin, A., & Luo, M. (2015). Wealth distribution and social mobility in the US: A quantitative approach [Techreport]. National Bureau of Economic Research.
  12. Stachurski, J., & Toda, A. A. (2019). An impossibility theorem for wealth in heterogeneous-agent models with limited heterogeneity. Journal of Economic Theory, 182, 1–24.
  13. Toda, A. A. (2019). Wealth distribution with random discount factors. Journal of Monetary Economics, 104, 101–113.
  14. Ma, Q., Stachurski, J., & Toda, A. A. (2020). The income fluctuation problem and the evolution of wealth. Journal of Economic Theory, 187, 105003.
  15. Nirei, M., & Aoki, S. (2015). Wealth distribution and stochastic discount factors. Journal of Monetary Economics, 69, 119–133.