In this chapter, we define linear decision processes (LDPs). In terms of level of generality, we can think of LDPs as sitting between MDPs, as discussed in Section 1.2, and the ADPs we considered in Chapter 2--Chapter 5:
with all inclusions being strict. LDPs have a range of advantages over MDPs while maintaining much of their tractability. One advantage is that we can work with state-dependent discounting, which is particularly important for economic and financial applications. Another is that their flexible structure makes them easy to apply. For example, optimal stopping problems can be embedded directly into the LDP framework, whereas embedding optimal stopping problems into MDPs requires expanding the state space.
LDPs differ from ADPs by including actions explicitly, instead of taking policy operators as the basic primitive. This is a more traditional perspective: one where controllers observe states and respond to those states by choosing actions. Ultimately, a choice of action given a state will take the form of a policy function; that will lead us back to the ADPs. By studying this circle, we can leverage theory from our earlier chapters.
LDPs are more limited than ADPs, but also more concrete and more structured. For example, they provide an algebraic formula for computing lifetime values similar to the one available for MDPs (see, e.g., (1.42) on page ). This formula is not available for general ADPs. Thus, for LDPs, the HPI step requiring computation of lifetime values from policies is fully articulated, at least at a theoretical level. Another advantage of LDPs, relative to ADPs, is that we can start to construct systematic conditions for regularity, or existence of greedy policies, unlike in previous chapters.
We begin with the theoretical foundations in Section 6.1. After introducing Feller properties in Section 6.1.1, we define LDPs in Section 6.1.2.1 and present optimality results. We then treat exogenous discount processes (Section 6.1.4) and specialize the LDP framework to MDPs on general state spaces (Section 6.1.5). In this chapter, we focus on the bounded case; unbounded models are handled in Chapter 7. Section 6.2.1 and Section 6.2.2 apply the theory to natural resource management and optimal savings with stochastic rates of return.
In this section we develop the foundational theory of linear decision processes. We begin in Section 6.1.1 by studying Feller properties of transition kernels, which provide the continuity conditions needed for existence of optimal policies. We then define LDPs in Section 6.1.2.1, give examples, and discuss lifetime values. Next we present optimality results and their implications. Section 6.1.4 treats exogenous discount processes, and Section 6.1.5 specializes the LDP framework to Markov decision processes on general state spaces.
Since we are always interested in whether or not optimal policies exist, we study conditions under which future values are continuous in states and actions. In the case of LDPs, this continuity will require that integrals of transition kernels vary continuously with actions. (For background on transition kernels see Section A.5.4.1.) Here we provide a collection of definitions and results that help us address this question.
Throughout this section,
X and A are separable metric spaces,
∥⋅∥:=∥⋅∥∞ is the supremum norm,
G is a subset of X×A, and
K is a transition kernel mapping bX to bG.
Here you can think of G as a collection of feasible state-action pairs. The last statement means that
Extending standard terminology, we will say that the transition kernel K is
weak Feller if Kh is continuous on G whenever h∈bcX and
strong Feller if Kh is continuous on G whenever h∈bX.
Let’s look at some special cases.
The strong Feller property requires more conditions, since we need to map a potentially discontinuous function h into a continuous function Kh. For this, we rely on smoothing properties of the integral. To obtain these properties we introduce a “dominating” measure μ on (X,B), which we assume to be σ-finite. A Borel measurable map p from G×X to R is called a density kernel from G to X with dominating measure μ if p is nonnegative and
If the dominating measure μ is not identified in the discussion below then we will be referring to Lebesgue measure, and we write dx instead of μ(dx). The following lemma shows how a continuous density kernel can transform discontinuous functions into continuous ones under integration.
We now introduce LDPs and study their basic properties. Section 6.1.2.1 defines LDPs and connects them to the ADP framework. We then present several examples, showing which models can and cannot be expressed as LDPs. Finally, we discuss lifetime values and their computation in the LDP setting.
Let X and A be separable metric spaces, referred to henceforth as the state and action spaces. As before, ∥⋅∥ denotes the supremum norm on bX. Given X and A, a linear decision process (LDP) is a tuple (Γ,r,K) containing
a nonempty correspondence Γ from X to A called the feasible correspondence, with an associated set of feasible state-action pairs
a bounded Borel measurable reward functionr mapping G into R, and
a transition kernelK from G to X satisfying Kv∈bG whenever v∈bX.
The set Γ(x) represents all actions available to a controller in state x. Figure Figure 6.1 shows an illustration of one possible correspondence Γ when A=X=R+, along with G, the resulting set of feasible state-action pairs. When representing the LDP by the tuple (Γ,r,K), we are treating X and A as understood from context.
Figure 6.1:Feasible correspondence and feasible state-action pairs
For the LDP (Γ,r,K), a feasible policy is a Borel measurable map σ:X→A such that σ(x)∈Γ(x) for all x∈X. Figure 6.2 shows a feasible policy σ in the same setting.
The assumption that X and A are metric spaces is important in some applications and irrelevant in others. For simplicity, we maintain it throughout. When X and A are discrete, the metric in question is always understood to be the discrete metric. In this case, every subset of these sets is a Borel set, so the measurability constraint in the definition of Σ never binds.
Given v∈bX, we have Kv∈bG and hence Kσv∈bX. Since bX is a vector space, it follows that Tσv is in bX. Since K is a transition kernel, Kσ is a positive linear operator, so Tσ is order preserving. Hence
is an ADP. We call (V,TLDP) the ADP generated by(Γ,r,K) and use the following obvious conventions:
(Γ,r,K) is called well-posed (resp., regular, order stable, etc.) if (V,TLDP) is well-posed (resp., regular, order stable, etc.).
vσ is the σ-value function for (Γ,r,K) when vσ is the σ-value function for (V,TLDP),
σ is called optimal for (Γ,r,K) when σ is optimal for (V,TLDP),
etc.
We notice that each Tσ has the affine form from the ADP analysis in Section 4.1.2.3, with Kσ∈B+(bX) by Theorem A.5.25. We will use the theorems in that section for some of our optimality results.
Let (Γ,r,K) be an LDP with state space X, action space A. Given a policy σ∈Σ, the σ-value functionvσ is defined as the fixed point of the policy operator Tσ in (6.13). As a result, vσ satisfies the recursion
If the spectral radius condition ρ(Kσ)<1 holds, then, by the Neumann series lemma (see, in particular Corollary A.4.11 on page ), the operator I−Kσ is invertible on bX and the unique solution to (6.21) is
The t-th term Kσtrσ gives the expected reward at time t under policy σ, discounted back to the present.
The explicit representation of vσ in (6.22) is valuable for computation. For example, the MDP version of HPI in Algorithm 1.2.2 can be extended to the current setting by replacing v←(I−βPσ)−1rσ with v←(I−Kσ)−1rσ. Under the conditions of Proposition 6.1.3, with K strong Feller, this algorithm converges.
Now we turn to optimality results. We first treat the case where T is finite, and then shift to general (metric) state and action spaces by adding continuity conditions. We conclude by deriving implications for greedy policies and the Bellman operator.
In the following, we suppose that (Γ,r,K) is an LDP with state space X and action space A. As before, these sets are separable metric spaces (with the discrete topology when finite). As shown in Section 6.1.2.1, the LDP (Γ,r,K) generates an ADP (bX,TLDP) where each Tσ∈TLDP has the affine form Tσv=rσ+Kσv. We will infer optimality of the LDP by studying this ADP.
(See the proof of Proposition 6.1.3.) In this setting, a policy σ∈Σ is v-greedy if and only if (6.25) holds. Moreover, the Bellman operator simplifies to
In Chapter 4 we look at several settings that include state-dependent discounting. In each case the setting was relatively simple: either a binary stopping problem or a model with discrete states and actions. Here we’ll look at a problem with continuous state and action spaces. To make this setting tractable, we’ll insist that the discount factor process depends only on an exogenous state (i.e., a state that is not influenced by decisions of the agent).
When the discount factor varies over time, forming a sequence (βt)t⩾0, the present value of a random time t payoff Ht has the general form Eβ0⋯βt−1Ht. In this section we formalize this idea in a Markov environment and examine some simple consequences.
Let Z be a metric space and let Q be a stochastic kernel on Z. Let (Zt) be Q-Markov on Z. Let β∈bZ be a nonnegative function and consider the discount factor process (βt)t⩾0 where βt:=β(Zt) for all t. We introduce the operator
Let X and A be separable metric spaces and let (Γ,r,K) be an LDP with state space X and action space A. Suppose further that X is a product space of the form Y×Z and that K has the form
We call R the endogenous kernel, Q the exogenous kernel and β the discount function. The expression for K tells us that the endogenous state y updates via the kernel R, depending on current state x=(y,z) and action a, while z updates via Q. Since we are taking products, the two updates are independent. The exogenous process feeds into values and hence optimal policies through its impact on the discount factor.
To make our lives slightly easier, we’ll assume that Z is finite. As with every other finite set, we endow Z with the discrete topology.
Let KQ be defined as in (6.27). In this setting, we have the following result.
We treated discrete MDPs in Section 1.2. Let’s now consider MDPs on general state spaces. Mathematically, MDPs are LDPs with a fixed discount factor and Markov dynamics under any fixed policy. On one hand, MDPs are a special case of LDPs and need no separate theoretical discussion. On the other hand, MDPs are a benchmark representation of a dynamic program, used throughout mathematics, operations research, and computer science. For this reason we’ll take the time to specialize our LDP results to the Markov setting. Throughout this section, X and A are separable metric spaces.
Let (Γ,r,K) be an LDP with state space X and action space A. This LDP is called a Markov Decision Process (MDP) when the transition kernel has the form
Since MDPs are such an important special case, we briefly specialize the implications from Section 6.1.2.1 to the MDP setting, replacing the general transition kernel K with βP.
If the conditions of Proposition 6.1.7 hold and P is strong Feller, then, for every v∈bX, there exists a σ∈Σ obeying
σ(x)∈a∈Γ(x)argmax{r(x,a)+β∫v(x′)P(x,a,dx′)}for all x∈X,
We apply the theory developed above to two classes of problems. In Section 6.2.1 we study a natural resource management problem with state-dependent discounting. In Section 6.2.2 we analyze an optimal savings problem with stochastic rates of return on assets.
Here y is the stock of the resource, e is the current usage, Q is a stochastic kernel on finite set Z, ϕ is a distribution on R+, π:R+→R is a profit function, f:R+→R+ is a transition function that updates the resource, β:Z→R+ is a discount factor function, and ξ is a multiplicative shock. The quantity f(y−e)ξ is the next period stock.
If, say, ξ is concentrated at 1 and f(y−e)=y−e, then this is exploitation of a nonrenewable resource. Another interpretation is that y is a stock of fish at a given fishery, e is current catch, f is a transition rule that updates the stock given biological properties and environmental factors, and ξ is a random shock to updating.
We assume that π is continuous and bounded, and that the function f is continuous. In the exogenous discounting setting of Section 6.1.4, the state is X=R+×Z, the action space is R+, the feasible correspondence is Γ(y,z)=[0,y], the reward function is r(y,z,e)=π(e), and the transition kernel is
where σ is the optimal consumption policy. Let’s take a look at the kind of outcomes we can generate when β is fixed, so that the exogenous shock process is degenerate. For simulation purposes, profits take the exponential form π(x)=1−exp(−θxγ), while the transition function is set to f(x)=xαℓ(x). Here ℓ is a generalized logistic function, while ξ is lognormal.[1] We compute the optimal policy σ using value function iteration and then study the dynamics associated with the law of motion (6.49).
Figure 6.3:Optimal policy and dynamics for the natural resource model
Figure 6.3 shows the optimal consumption policy σ when β=0.96, along with the 45 degree line, the map y↦f(y)Eξ, which shows the expected next period stock with zero consumption, and the map y↦f(y−σ(y))Eξ, which shows expected dynamics under the optimal policy. Interestingly, the optimal choice for this parameterization is to consume none of the resource when the stock is small, enabling the stock to grow. Consumption only becomes positive when the stock is large enough to remain stable at a relatively high level. Of course, this kind of behavior will only be seen when the agent is sufficiently patient.
Figure 6.4 shows more detail on the dynamics by examining the stochastic kernel associated with the Markov dynamics in (6.49), after taking logs. Each stochastic kernel is represented as a contour plot of the relevant conditional density. The four subplots correspond to four different values of the discount factor β. For each value of β, the plot shows where probability mass for next period stock concentrates relative to current stock, given the associated optimal policy. Mass above the 45 degree line implies that the state moves up on average, while mass below indicates that the state drifts down.
As β increases, the optimal policy adjusts to reduce current consumption and increase conservation, leading to probability mass shifting upward at each current state value. The changes in the stochastic kernel in Figure 6.4 seem minor but in fact they have large impacts on long run outcomes. Figure 6.5 illustrates this by showing an estimate of the stationary distribution corresponding to each Markov process. Densities were estimated by simulating 100 independent paths of length 1,000 from a common initial condition. The plots show a sharp transition around β=0.95. For β around that level, the long run stock is low. For slightly higher β, the optimal path leads to much larger stocks (recalling that we are working in logs).
Figure 6.4:Stochastic kernel under the optimal policy at different β
Figure 6.5:Variation in the stationary distribution across β values
Up until now we’ve taken β as a fixed parameter when computing optimal policies. Now we allow it to vary with an exogenous state z via β(z), in line with our theoretical analysis in Proposition 6.2.1. To illustrate the effect of state-dependent discounting, we set Z={0.9,0.99} and β(z)=z. The exogenous state follows a two-state Markov chain with persistence 0.99 in each state. Other model parameters are as in the fixed-β experiments above. We computed optimal policies via value function iteration on the product space R+×Z.
Figure 6.7 shows the outcome of simulating 20 independent paths of the resource stock under the optimal policy, given a single realization of the exogenous process (Zt). The top panel displays the discount factor βt, while the bottom panel shows the corresponding log stock logyt over multiple alternative paths for (ξt). During patient regimes, the stock tends to grow as the optimal policy shifts toward conservation. When the discount factor drops, the agent increases exploitation and the stock tends to decline.
Figure 6.6:Optimal investment policy under state-dependent discounting
Figure 6.7:Simulated resource dynamics under state-dependent discounting
As our next application, we consider a savings problem with a persistent state process and a stochastic rate of return on assets. Stochastic returns on assets appear to be important in generating sufficiently heavy right tails in wealth distributions when we take models to the data.
In this model,
the state is x=(w,z), where w∈R+ is wealth and z is an exogenous state process on finite set Z with stochastic kernel (matrix) Q,
the action a is current consumption c, taking values in R+,
the feasible correspondence is Γ(x)=Γ(w,z)=[0,w],
the reward is r(x,a)=r((w,z),c)=u(c), where u is bounded and continuous,
The kernel can be explained as follows: Labor income is affected by an iid shock s′ drawn from distribution ϕ∈D(S), where S is a topological space. In addition, both the interest rate and labor income are impacted by a common persistent component z. The latter is driven by stochastic matrix Q. We give Z the discrete topology and X=R+×Z the product topology.
We apply Proposition 6.1.7 to this model. For Assumption 6.1.1, continuity and compact-valuedness of Γ follow from Exercise A.20, and r=u is continuous by assumption. It remains to verify that P is weak Feller. Fixing v∈bcX, we must show that the mapping
is continuous on G. Taking (wn,zn,cn)→(w,z,c) in G, since Z has the discrete topology, (zn) is eventually constant at z. Hence it suffices to show that m(wn,z,cn)→m(w,z,c). This follows from continuity and boundedness of v and the dominated convergence theorem.
Hence Proposition 6.1.7 applies and the conclusions therein hold. The Bellman operator takes the form
The Feller properties discussed in Section 6.1.1 are standard tools in the theory of Markov chains and stochastic processes. For further background, see Hernández-Lerma & Lasserre (2012) or Bäuerle & Rieder (2011). The use of Feller conditions to guarantee existence of optimal policies in MDPs and dynamic programs dates back to the foundational work of Blackwell (1965). Scheffé’s lemma, used in the proof of Lemma 6.1.1, is a classical result in measure theory (see p. ).
Standard proofs of the optimality results we stated for MDPs on general state spaces (Section 6.1.5) can be found in Puterman (2005), Bäuerle & Rieder (2011), Hernández-Lerma & Lasserre (2012), Stachurski (2022), or Sargent & Stachurski (2025).
The exposition of exogenous discount processes in Section 6.1.4 is partly based on Stachurski & Zhang (2021). State-dependent discounting in the context of dynamic programming is also studied in Jaśkiewicz et al. (2014).
The natural resource management model in Section 6.2.1 is a standard bioeconomic exploitation model; see Clark (2010) for background. Versions with state-dependent discounting are relevant for modeling resource management under fluctuating economic conditions.
For a discussion of stochastic rates of return on financial income, as considered in Section 6.2.2, see Benhabib et al. (2015) or Stachurski & Toda (2019). The latter shows that heavy-tailed wealth distributions can also be generated by time preference shocks, but this channel is relatively unrealistic, since it requires that all households in the economy simultaneously experience time preference shocks in the same direction. Additional work on the relationship between stochastic discount factors and wealth distributions includes Toda (2019), Ma et al. (2020), and Nirei & Aoki (2015).
What we have called linear decision processes (LDPs) might be confused with Markov decision processes having linear reward or cost functions. The latter are a special case of the former. For a recent discussion of MDPs with linear cost functions, see Rantzer (2022) and Li & Bertsekas (2024).
The logistic function is ℓ(x)=a+(b−a)/(1+exp(−c(x−d))) with a=1, b=1.5, c=20, d=1. Other parameters are θ=0.5, γ=0.9, α=0.7, and ξ∼LN(−0.1,0.2). The optimal policy was computed by value function iteration on a grid of 500 state points and 2,000 action points using JAX.
De Nardi, M., Fella, G., & Paz-Pardo, G. (2020). Nonlinear household earnings dynamics, self-insurance, and welfare. Journal of the European Economic Association, 18(2), 890–926.
Hernández-Lerma, O., & Lasserre, J. B. (2012). Discrete-time Markov control processes: basic optimality criteria (Vol. 30). Springer Science & Business Media.
Bäuerle, N., & Rieder, U. (2011). Markov decision processes with applications to finance. Springer Science & Business Media.
Blackwell, D. (1965). Discounted Dynamic Programming. The Annals of Mathematical Statistics, 36(1), 226–235.
Puterman, M. L. (2005). Markov decision processes: discrete stochastic dynamic programming. Wiley Interscience.
Stachurski, J. (2022). Economic dynamics: theory and computation (2nd ed.). MIT Press.
Sargent, T. J., & Stachurski, J. (2025). Dynamic Programming: Finite States. Cambridge University Press.
Stachurski, J., & Zhang, J. (2021). Dynamic programming with state-dependent discounting. Journal of Economic Theory, 192, 105190.
Jaśkiewicz, A., Matkowski, J., & Nowak, A. S. (2014). On variable discounting in dynamic programming: applications to resource extraction and other economic models. Annals of Operations Research, 220, 263–278.
Clark, C. W. (2010). Mathematical Bioeconomics: The Mathematics of Conservation (3rd ed.). John Wiley & Sons.
Benhabib, J., Bisin, A., & Luo, M. (2015). Wealth distribution and social mobility in the US: A quantitative approach [Techreport]. National Bureau of Economic Research.
Stachurski, J., & Toda, A. A. (2019). An impossibility theorem for wealth in heterogeneous-agent models with limited heterogeneity. Journal of Economic Theory, 182, 1–24.
Toda, A. A. (2019). Wealth distribution with random discount factors. Journal of Monetary Economics, 104, 101–113.
Ma, Q., Stachurski, J., & Toda, A. A. (2020). The income fluctuation problem and the evolution of wealth. Journal of Economic Theory, 187, 105003.
Nirei, M., & Aoki, S. (2015). Wealth distribution and stochastic discount factors. Journal of Monetary Economics, 69, 119–133.