Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1 Prelude: Examples of Dynamic Programs

Dynamic programming is a recursive technique for solving optimization problems. While initially developed for intertemporal problems (inventory management, investment planning, optimal savings and consumption, etc.), it has since been applied to various atemporal problems, ranging from genome sequencing and matrix multiplication to the structure of production chains. Creators of machine learning and artificial intelligence routinely use dynamic programming.

The sheer breadth of applications currently being tackled with dynamic programming is a challenge for presenting a modern theory. As well as the vast range of concrete problems faced in applied settings, researchers are adding features to their models that require extensions of the foundations of dynamic programming. Such new features include time-varying discount rates, nonlinear discounting, risk-sensitive control, ambiguity aversion, nonlinear time aggregation, and so on. Researchers added these features in their quests to create new models capable of coming closer to data sets of interest.

To set the scene, we begin with some problems that can be handled using classic methods. Then we discuss extensions and transformations that require more sophisticated theoretical foundations. Since our purpose here is to set the stage for the abstract theory that is the main focus of this Volume, our presentation here is mathematically informal at many points. Results stated in this chapter are special cases of the abstract theory, which begins in Chapter 2.

1.1A Firm Problem

This section uses a firm valuation and control problem to introduce core dynamic programming concepts. We begin with traditional methods, showing how recursive representations lead to the Bellman equation and optimal policies. We then discuss extensions involving unbounded rewards, time-varying discount rates, and risk-sensitive preferences.

1.1.1Models of a Firm

We first consider firm valuation under a Markov profit process, deriving a recursive representation for expected present value. Next, by giving the manager an option to sell the firm at a time of their choosing, we add a control. A Bellman type of optimality is then stated and proved. This material establishes a template for dynamic programs that recur throughout this chapter.

1.1.1.1Valuation

A firm generates a random flow of profits (πt)t0=π0,π1,π2,(\pi_t)_{t \geq 0} = \pi_0, \pi_1, \pi_2, \ldots. At time tt, a manager wants to calculate the present value of the profit process. Current profits πt\pi_t are known but future values πt+1,πt+2,\pi_{t+1}, \pi_{t+2}, \ldots are not. The manager knows the distribution of the process (πt)t0(\pi_t)_{t \geq 0}. Consequently the manager can compute the expected present value, namely

VtEt[πt+βπt+1+β2πt+2+].V_t \coloneq \EE_t \left[ \pi_t + \beta \pi_{t+1} + \beta^2 \pi_{t+2} + \cdots \right].

Here β\beta is a discount factor, often reparameterized as β=1/(1+r)\beta = 1/(1+r) when rr is a discount rate. Thus, βmπt+m\beta^m \pi_{t+m} is time t+mt+m profits discounted to the present date tt. The symbol Et\EE_t denotes mathematical expectation conditional on time tt information.

We can switch to a recursive representation of the sequence of valuations (Vt)t0(V_t)_{t \geq 0} by first writing

Vt=πt+βEt[πt+1+βπt+2+β2πt+3+]V_t = \pi_t + \beta \EE_t \left[ \pi_{t+1} + \beta \pi_{t+2} + \beta^2 \pi_{t+3} + \cdots \right]

and then applying the law of iterated expectations Et=EtEt+1\EE_t = \EE_t \EE_{t+1}. This leads to

Vt=πt+βEtVt+1.V_t = \pi_t + \beta \EE_t V_{t+1}.

This expression will be important for us because the theory of dynamic programming is built around recursive relationships. Later we will see many variations of (1.3).

To make further progress in computing (Vt)t0(V_t)_{t \geq 0}, let’s assume that πt=π(Xt)\pi_t = \pi(X_t), where (Xt)t0(X_t)_{t \geq 0} is a discrete time Markov process taking values in a measurable space (X,B)(\Xsf, \bB). We assume that (Xt)t0(X_t)_{t \geq 0} is driven by a stochastic kernel PP (see Section A.5.4.1), so that P(Xt,B)P(X_t, B) is the probability that Xt+1BBX_{t+1} \in B \in \bB given current state XtX_t. The function π\pi is assumed to be in bXb\Xsf, the set of bounded, B\bB-measurable functions from X\Xsf to R\RR. We understand XtX_t as representing the “state of the world”, including all factors affecting firm profits. Figure 1.1 shows multiple realizations of the profit process (πt)t0(\pi_t)_{t \geq 0} in the case where X=R\Xsf = \RR, B\bB is the Borel sets, π(x)=exp(x)\pi(x) = \exp(x), and (Xt)(X_t) is a discretization of an AR(1) process with ρ=0.9\rho = 0.9 and ν=0.2\nu = 0.2.

Sample profit paths

Figure 1.1:Sample profit paths

Under the Markov profits assumption, VtV_t will depend on the current state XtX_t, since knowing the state helps predict future profits. At the same time, the Markov assumption means that earlier values X0,,Xt1X_0, \ldots, X_{t-1} will not aid prediction once XtX_t is known. This leads us to conjecture that Vt=v(Xt)V_t = v(X_t) for some fixed function v ⁣:XRv \colon \Xsf \to \RR. Inserting this conjecture into (1.3) and evaluating at Xt=xX_t = x yields

v(x)=π(x)+βv(x)P(x, ⁣dx).v(x) = \pi(x) + \beta \int v(x') P(x, \diff x').

Defining (Pv)(x)v(x)P(x, ⁣dx)(Pv)(x) \coloneq \int v(x') P(x, \diff x'), which is consistent with the operator-theoretic notation in Section A.5.4, we can rewrite (1.4) as v=π+βPvv = \pi + \beta P v. Using the fact that πbX\pi \in b\Xsf, and that the spectral radius of βP\beta P is β\beta (Lemma A.5.30), this equation has a unique solution for vv in bXb\Xsf whenever β<1\beta < 1. The Neumann series lemma (Theorem A.4.10) tells us that the solution has the form

v=(IβP)1π.v = (I - \beta P)^{-1} \pi.
Value of the firm for different discount factors

Figure 1.2:Value of the firm for different discount factors

Figure 1.2 plots v=(IβP)1πv = (I - \beta P)^{-1} \pi over the state space X\Xsf for several choices of the discount factor β\beta. The environment is the same as the one underlying Figure 1.1. As β\beta increases, the firm places greater weight on future profits, resulting in higher valuations across all states.

Notice how the difficult problem of computing a stochastic process (Vt)t0(V_t)_{t \geq 0} has been converted into the much easier problem of calculating v=(IβP)1πv = (I - \beta P)^{-1} \pi.

1.1.1.2Control

So far our manager has had no choice to make. Let’s now give the manager the option to sell the firm to an outside buyer at the start of each period (before receiving current profits) for the fixed price ss. We will describe the manager’s decision with a binary policy function σ\sigma that selects between selling the firm in the current period, after observing the current state xx, and not selling --- in which case the manager again faces the option to sell the firm at the beginning of the next period. The same policy σ\sigma is applied in every period, with σ(x){0,1}\sigma(x) \in \{0,1\} being the current decision. Modifying (1.4) appropriately to a σ\sigma-dependent value function, we obtain the Bellman equation

vσ(x)=σ(x)s+(1σ(x))[π(x)+βvσ(x)P(x, ⁣dx)],v_\sigma(x) = \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta \int v_\sigma(x') P(x, \diff x') \right],

where vσ(x)v_\sigma(x) is the total value of the firm under the policy σ\sigma, conditional on state xx. If σ(x)=1\sigma(x) = 1, then the manager sells at payoff ss and the process ends, so ss is the total value received. Otherwise σ(x)=0\sigma(x) = 0, indicating no sale, profit π(x)\pi(x) is received, and the process continues, with discounted expected payoff βvσ(x)P(x, ⁣dx)\beta \int v_\sigma(x') P(x, \diff x'). This is the second term in (1.6). In what follows we call vσv_\sigma the σ\sigma-value function.

Let Σ\Sigma be the set of all policies, defined as all B\bB-measurable functions mapping X\Xsf to {0,1}\{0,1\}. For each σΣ\sigma \in \Sigma, let vσv_\sigma be the function defined recursively in (1.6). We can use the Neumann series lemma to show that vσv_\sigma is uniquely defined in bXb\Xsf. Alternatively, we can introduce the operator Tσ ⁣:bXbXT_\sigma \colon b\Xsf \to b\Xsf via

(Tσv)(x)=σ(x)s+(1σ(x))[π(x)+βv(x)P(x, ⁣dx)](xX).(T_\sigma \, v)(x) = \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right] \qquad (x \in \Xsf).

We call TσT_\sigma the policy operator defined by σ\sigma. In operator form it can be expressed as

Tσv=σs+(1σ)(π+βPv).T_\sigma \, v = \sigma s + (1-\sigma)(\pi + \beta Pv).

Evidently, vσv_\sigma solves (1.6) if and only if vσv_\sigma is a fixed point of TσT_\sigma.

One easily shows that TσT_\sigma is contracting (see Section A.2.2.2 for the definition) on the complete space (bX,)(b\Xsf, \| \cdot\|), where \| \cdot \| is the supremum norm, defined by f=supxXf(x)\| f \| = \sup_{x \in \Xsf} |f(x)| for all fbXf \in b\Xsf. Indeed, for arbitrary v,wbXv, w \in b\Xsf,

TσvTσw(1σ)βPvPwβPvPwβPvw,|T_\sigma \, v - T_\sigma \, w| \leq (1-\sigma) \beta | Pv - Pw| \leq \beta | Pv - Pw| \leq \beta P | v - w|,

where the absolute value | \cdot | is applied pointwise (and the last inequality is by the triangle inequality for integrals --- or, if you prefer, by the inequality for positive operators in Exercise A.41 on page ). From the last expression we get

TσvTσwβPvwβvw.|T_\sigma \, v - T_\sigma \, w| \leq \beta P \| v - w \| \leq \beta \| v - w \|.

Taking the supremum over the left-hand side, we find that TσT_\sigma is a contraction of modulus β\beta on bXb\Xsf. Hence, the σ\sigma-value function vσv_\sigma is the unique fixed point of TσT_\sigma in the candidate space bXb\Xsf.

Policies and their lifetime value functions

Figure 1.3:Policies and their lifetime value functions

Figure 1.3 shows the value functions vσv_\sigma and vτv_\tau for two possible policies σ\sigma and τ\tau. The policies are shown on the left and the value functions are on the right. The environment is the same as Figure 1.2 with β\beta set to 0.96. The first policy σ\sigma sells when the state is below 1.0 and continues otherwise. The second policy τ\tau oscillates between selling and continuing. Each value function is computed as the fixed point of the corresponding policy operator in bXb\Xsf. The horizontal dashed line indicates the sale price ss. In terms of value, policy τ\tau is outperformed by σ\sigma everywhere on the state space.

Now let’s consider optimality. A policy σ\sigma is called optimal if vτ(x)vσ(x)v_\tau(x) \leq v_\sigma(x) for all τΣ\tau \in \Sigma and all xXx \in \Xsf. The value function v\vmax, sometimes called the optimal value function, is defined by

v(x)supσΣvσ(x)(xX).\vmax(x) \coloneq \sup_{\sigma \in \Sigma} v_\sigma(x) \qquad (x \in \Xsf).

Evidently σ\sigma is an optimal policy if and only if vσ=vv_\sigma = \vmax.

The problem of finding an optimal policy appears nontrivial, since Σ\Sigma is a large set whenever X\Xsf is large. Indeed, each σΣ\sigma \in \Sigma is just an indicator function of a measurable set, so the cardinality of Σ\Sigma is the same as B\bB. However, it turns out that we can find, characterize, and compute optimal policies relatively easily, using the theory of dynamic programming. In particular, we can state the following result, which is proved below in Section 1.1.1.4.

Theorem 1.1.1 is a simple consequence of Richard Bellman’s (1920--1984) beautiful theory of dynamic programming Bellman, 1957. Equation (1.12) is called the Bellman equation.

The characterization in (1.13) has the following natural interpretation. The expression π(x)+βv(x)P(x, ⁣dx)\pi(x) + \beta \int \vmax(x') P(x, \diff x') is the payoff (expected present value) for choosing to continue, receiving current profits, and then behaving optimally (since we are valuing future states with v\vmax). The best decision at xx is to continue if this is larger than ss, which is achieved by setting a=0a=0. Otherwise the manager should set a=1a=1 and stop.

We shall repeatedly use a term related to Theorem 1.1.1. Thus, given vbXv \in b\Xsf, we will agree to say that σΣ\sigma \in \Sigma is vv-greedy whenever

σ(x)argmaxa{0,1}{as+(1a)[π(x)+βv(x)P(x, ⁣dx)]}for all xX.\sigma(x) \in \argmax_{a \in \{0,1\}} \left\{ a s + (1 - a) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right] \right\} \quad \text{for all } x \in \Xsf.

With this terminology, we can repeat the policy optimality characterization in Theorem 1.1.1 by saying that a policy is optimal if and only if it is v\vmax-greedy. This idea is a cornerstone of our theory.

1.1.1.3The Bellman Operator

We noted above that a policy is optimal if and only if it is v\vmax-greedy. This provides a direct avenue for computing optimal policies: First calculate v\vmax and then take a v\vmax-greedy policy. A straightforward way to approximate v\vmax is to iterate on the Bellman operator, which, in the present setting, is the self-map TT on bXb\Xsf defined at vbXv \in b\Xsf by

(Tv)(x)=max{s,  π(x)+βv(x)P(x, ⁣dx)}(xX).(T v)(x) = \max \left\{ s ,\; \pi(x) + \beta \int v(x') P(x, \diff x') \right\} \qquad (x \in \Xsf).

In operator-theoretic notation, we write TT as Tv=s(π+βPv)Tv = s \vee (\pi + \beta Pv). Evidently vv solves the Bellman equation if and only if vv is a fixed point of TT.

Note that TT is a contraction of modulus β\beta on (bX,)(b\Xsf, \| \cdot \|). Indeed, fixing v,wbXv, w \in b\Xsf and applying the elementary bound

αxαyxy(α,x,yR),|\alpha \vee x - \alpha \vee y| \leq |x - y| \qquad (\alpha, x, y \in \RR),

we get

TvTw=s(π+βPv)s(π+βPw)βPvPw.|T v - T w| = |s \vee (\pi + \beta Pv) - s \vee (\pi + \beta Pw)| \leq \beta | Pv - Pw|.

The rest of the argument is identical to the one for contractivity of TσT_\sigma in Section 1.1.1.2.

From this and Theorem 1.1.1, we see that TT has a unique fixed point in bXb\Xsf and that the fixed point is the value function v\vmax. Moreover, for any vv in bXb\Xsf, we have TkvvT^k v \to \vmax as kk \to \infty. In other words, fixed point iteration (also called successive approximation) allows us to approximate v\vmax arbitrarily well.

Optimal policy and value function

Figure 1.4:Optimal policy and value function

Figure 1.4 shows an approximate optimal policy and an approximation to the value function v\vmax. The latter was computed by iterating with the Bellman operator TT from initial condition v00v_0 \equiv 0 and monitoring for convergence (waiting for the step sizes Tkv0Tk+1v0\| T^k v_0 - T^{k+1} v_0 \| to fall below some threshold). We then took the resulting function vv and computed a vv-greedy policy. We used the same parameters as in Figure 1.3. The optimal policy has a threshold structure: the manager sells the firm when the state is below a critical value and continues operating otherwise. The value function v\vmax dominates the sale price ss everywhere, reflecting the value of the option to wait and sell later.

1.1.1.4Proving Theorem 1.1.1

In this book we will prove far more general results than Theorem 1.1.1. So providing an immediate proof of the theorem here is actually redundant and postpones our presentation of some sample applications. Nevertheless, we now sketch a short proof of Theorem 1.1.1, but alert readers that they can safely move on without digesting it now.

1.1.1.5How About More General Policies?

Up to now we have focused on stationary Markov policies, which, in our language, are measurable maps from X\Xsf to {0,1}\{0,1\}. Restricting our attention to such policies prevents the manager from making decisions based on a longer history of states and actions, or from changing policies at some date. We have also ignored the possibility that the manager might wish to randomize actions, meaning that the choice is a probability of selling, rather than a fixed selection from {0,1}\{0,1\}.

It turns out that focusing exclusively on stationary Markov policies is appropriate in the current setting. We show in Section 3.1.4 that allowing nonstationary policy choices cannot lead to higher expected present value, and that this result holds far more generally. Moreover, randomization cannot improve outcomes in this setting. For a proof in a similar environment, see our discussion of mixed strategies in Section 9.2.1.6 of Sargent & Stachurski (2025).

1.1.2Extensions

The theory stated so far is elegant and can be extended in some directions with relatively little effort. For example, we can ask the manager to control inventories, capital stocks, and the size and disposition of a labor force across tasks. The same fundamental ideas still govern optimality, and the same solution approaches still work.

But some extensions are more challenging and require moving beyond standard dynamic programs. We describe some of these challenges next.

1.1.2.1Beyond Constant Discount Rates

One restrictive assumption in Section 1.1.1 is that the discount factor β\beta is constant. In fact discount rates vary considerably. For example, market interest rates fluctuate substantially, responding to changes in monetary policy, inflation expectations, and risk premia. Interest rates charged to risky borrowers can fluctuate even more widely than benchmark rates. For a firm weighing the decision to continue operating versus exiting, the cost of capital at which future profits are discounted will have a substantial impact on optimal decisions. In practice, firms routinely incorporate time-varying discount rates into their strategic planning.

Figure 1.5 illustrates the extent of this variation using US data from the Federal Reserve. The top panel shows the federal funds rate, which transmits to firm financing costs through the credit channel. The bottom panel plots the real interest rate, computed as the 10-year Treasury yield minus twelve-month CPI inflation. Both sets of rates affect the intertemporal tradeoffs that firms face: when rates are high, future profits are discounted more heavily, reducing the present value of long-lived investments and making exit or disinvestment more attractive. Conversely, low or negative rates reduce the cost of waiting and encourage firms to invest, expand capacity, or delay exit from declining markets. The relative importance of real vs nominal rates varies across firms. The swings visible in Figure 1.5 underscore the issues with assuming a fixed discount factor β\beta when studying intertemporal choices of firms.

US nominal and real interest rates.

Figure 1.5:US nominal and real interest rates.

To accommodate dependence of discount factors on firm-specific or macroeconomic variables, we can specify that β=b(Xt)\beta = b(X_t) for some suitable function bb. Fortunately, this modification can be accommodated in the current setting. A discussion can be found in Section 4.2.1.3. But the associated theory can become more complicated when the controller’s actions affect state components that affect the discount factor. This happens, for example, in models of firms who face higher borrowing costs because they pursue high-risk strategies that involve occasionally running down their cash reserves. That makes the dynamic programming problem more challenging, particularly if we assume that interest rates are occasionally negative; such cases break contractivity properties of policy and Bellman operators. Optimality results for the finite state case can be found in Sargent & Stachurski (2025). We study more general cases in Chapter 6.

1.1.2.2Unbounded Rewards

Another---more technical---issue with our analysis in Section 1.1.1 is that π\pi is bounded. This directly embeds the problem in the space of bounded functions bXb\Xsf and pairs naturally with the supremum norm. The contraction and optimality proofs unfold smoothly when working in such an environment. But assuming that π\pi is bounded can be overly restrictive.

It turns out that we can drop the boundedness assumption without too much difficulty. One way is to assume instead that π<\| \pi \|_\ell < \infty where  ⁣:X[1,)\ell \colon \Xsf \to [1, \infty) and \| \cdot \|_\ell is the \ell-weighted norm defined by fsupxXf(x)/(x)\| f \|_\ell \coloneq \sup_{x \in \Xsf} |f(x)|/\ell(x). This approach is discussed in Section 7.2.2.2.

Another option is to embed the problem in the class of integrable functions L1(ψ)L1(X,B,ψ)L_1(\psi) \coloneq L_1(\Xsf, \bB, \psi) for a suitable measure ψ\psi on (X,B)(\Xsf, \bB). (For background see Section A.4.2.4.) For example, suppose that ψ\psi is a stationary distribution (see Section A.5.4.4) of the stochastic kernel PP and that πL1(ψ)\pi \in L_1(\psi). The policy operators then send L1(ψ)L_1(\psi) into itself. Indeed, if πL1(ψ)\pi \in L_1(\psi) and we fix vL1(ψ)v \in L_1(\psi), then Tσvs+π+βPv|T_\sigma \, v| \leq |s| + |\pi| + \beta |P v|, so TσvT_\sigma \, v is in L1(ψ)L_1(\psi) when PvPv is ψ\psi-integrable. This is true by stationarity of ψ\psi --- see (i) of Lemma A.5.32 and the surrounding discussion. Moreover, TσT_\sigma is a contraction map on L1(ψ)L_1(\psi), as can be seen by integrating both sides of the bound TσvTσwβPvw|T_\sigma \, v - T_\sigma \, w| \leq \beta P | v - w| with respect to ψ\psi and using (A.102) on p.  to obtain Pvw ⁣dψ=vw ⁣dψ\int P | v - w| \diff \psi = \int |v - w| \diff \psi. One can also show that the Bellman operator TT is a contraction with respect to the norm on L1(ψ)L_1(\psi) and then proceed to adapt proofs in Theorem 1.1.1.

Rather than provide all details here, we defer further discussion to Section 4.2.1.2.

1.1.3Beyond Risk Neutrality

A limitation of the preceding analysis is how it treats risk. We assumed that the manager wants to maximize the expected present value of cash flow generated by the firm across different strategies (policies). But what if the manager wants something else?

Canonical theories of firm behavior suggest that expected present value is the appropriate criterion. In complete and frictionless markets, a firm’s shareholders can hedge firm-specific risks by trading other securities, so the firm’s manager should not worry about that Modigliani & Miller, 1958Smith & Stulz, 1985. But maybe financial markets are incomplete and maybe shareholders lack enough information about risks or financial sophistication to develop optimal hedging strategies. Maybe managers’ incentives are not aligned with shareholders’ interests. Managers might be concerned about a firm’s survival and overweight downside risks relative to shareholders. Indeed, various studies offer evidence for such risk-averse decision-making within both small and large firms (see, e.g., Graham et al. (2013), Kerr et al. (2019), or Almeida et al. (2024)).

In this section we consider adjusting the manager’s problem to incorporate some of these considerations.

1.1.3.1Distributions of Rewards

Let’s return to our jump-off point for optimization, where we decided to seek a policy that solves maxσΣvσ(x)\max_{\sigma \in \Sigma} v_\sigma(x) for all xx. We know that vσ(x)v_\sigma(x) is the expected lifetime value of policy σ\sigma given initial condition xx. As such, we can also write

vσ(x)=EZσ,v_\sigma(x) = \EE Z_\sigma,

where

Zσt=0T(σ)1βtπt+βT(σ)s,withT(σ)inf{t0:σ(Xt)=1}.Z_\sigma \coloneq \sum_{t=0}^{T(\sigma)-1} \beta^t \pi_t + \beta^{T(\sigma)} s, \quad \text{with} \quad T(\sigma) \coloneq \inf \, \setntn{t \geq 0}{\sigma(X_t)=1}.

Here T(σ)T(\sigma) is the date at which the firm is sold at price ss (with convention inf=\inf \varnothing = \infty, so that T(σ)=T(\sigma)=\infty if the firm is never sold). Evidently ZσZ_\sigma is a random payoff, depending on the policy and the random path (πt)t0(\pi_t)_{t \geq 0}. The random variable ZσZ_\sigma depends on the initial state xx but we have chosen to suppress this in the notation.

The left-hand subfigure in Figure 1.6 shows the distribution of ZσZ_{\sigopt}, lifetime value under the optimal policy σ\sigopt, computed by fixing an initial condition xx, simulating 100,000 profit paths from a given initial state, and then computing the discounted payoff along each path. (Parameter values were the same as in Figure 1.3.) The mean of the distribution is vσ(x)v_{\sigopt}(x), which is also v(x)\vmax(x). The policy is regarded as optimal because the mean of the distribution is larger than the mean of ZσZ_{\sigma} under any other policy, and given any other initial condition xx. The right-hand subfigure compares ZσZ_{\sigopt} with ZσZ_\sigma, where σ\sigma is the policy that never sells.

Distributions of the random payoff Z_\sigma defined in .

Figure 1.6:Distributions of the random payoff ZσZ_\sigma defined in (1.23).

Let’s consider these distributions from the perspective of firm managers when markets are not frictionless, and information is not perfect. In this case, as discussed at the start of Section 1.1.3, managers care about more than just the mean of these distributions. While the mean will surely be of interest, these managers are likely to also care about factors such as variance, upside risk and downside risk. Let’s now consider how we might insert preferences over such factors into our model.

1.1.3.2Distributional Dynamic Programming

Some researchers have begun to construct a theory of “distributional dynamic programming” where the core idea is to track the distribution of the payoff across policies and initial conditions. In our context, this means choosing σ\sigma so that ZσZ_\sigma has an “optimal” distribution. For example, a manager might want a distribution with a relatively high mean and low downside risk. Bellemare et al. (2023) show that, for a relatively broad class of dynamic programming problems, a distributional version of the Bellman equation can be constructed, where the left- and right-hand sides of the Bellman equation are both distributions. We formalize this idea within the abstract dynamic programming framework in Section 2.3.4.

In practice, the theory of distributional dynamic programming is constrained by the fact that there is no natural extension of the idea of a greedy policy to the setting of distributional Bellman equations. As such, we focus on environments where agents are able to specify loss or reward functions over distributions. The next few sections investigate such cases, while still preserving concern for tail properties and additional moments beyond the mean.

1.1.3.3Mean-Variance Analysis

Let RR be a random payoff that we want to evaluate. Mean-variance analysis proposes the criterion E[R](γ/2)Var[R]\EE [R] - (\gamma/2) \var[R], where γ\gamma parameterizes risk-aversion. In the context of our manager’s problem, the mean-variance criterion tells us to solve

maxσΣ  mV(Zσ)wheremV(Zσ){E[Zσ]γ2Var[Zσ]}.\max_{\sigma \in \Sigma} \; m_V(Z_\sigma) \quad \text{where} \quad m_V(Z_\sigma) \coloneq \left\{ \EE [Z_\sigma] - \frac{\gamma}{2} \var[Z_\sigma] \right\}.

Assuming that the initial condition is xx, the first term E[Zσ]\EE [Z_\sigma] is just vσ(x)v_\sigma(x). The second term is harder to calculate but its role is clear: it downweights policies that generate high variance payoffs, with the extent of downweighting depending on the size of γ\gamma. More risk averse managers will use larger values of γ\gamma and their preferred policies will deviate more from what we previously defined to be optimal---that is, the policy that solves maxσΣvσ(x)=maxσΣEZσ\max_{\sigma \in \Sigma} v_\sigma(x) = \max_{\sigma \in \Sigma} \EE Z_\sigma for all xx.

1.1.3.4Alternatives to Mean-Variance

The mean-variance criterion is not the only way to formulate concerns about risk. An alternative formulation solves

maxσΣ  eγ(Zσ)whereeγ(Zσ)1γln[E[exp(γZσ)]]\max_{\sigma \in \Sigma}\; e_\gamma(Z_\sigma) \quad \text{where} \quad e_\gamma(Z_\sigma) \coloneq - \frac{1}{\gamma} \ln \left[ \EE [\exp(- \gamma Z_\sigma)] \right]

Here E\EE again denotes the mathematical expectation with respect to the probability distribution of the random payoff, which the decision maker is assumed to know. The map eγe_\gamma is called an entropic certainty equivalent. The parameter γ\gamma parameterizes risk aversion. When γ>0\gamma > 0, the decision maker values ZσZ_\sigma below E[Zσ]\EE[Z_\sigma]. We say that risk aversion is higher when γ\gamma is larger. An introduction to these ideas is provided in Section 7.2.2 of Sargent & Stachurski (2025).

The entropic criterion is attractive for several reasons. One is that, with sufficiently many finite moments, a Taylor expansion produces

eγ(X)  =  κ1    γ2κ2  +  γ26κ3    γ324κ4  +  ,e_\gamma(X) \;=\; \kappa_1 \;-\; \frac{\gamma}{2}\,\kappa_2 \;+\; \frac{\gamma^2}{6}\,\kappa_3 \;-\; \frac{\gamma^3}{24}\,\kappa_4 \;+\; \cdots,

where κn\kappa_n is the nn-th cumulant

κ1=E[X],κ2=Var[X],κ3=E ⁣[(Xκ1)3],\kappa_1 = \EE[X], \qquad \kappa_2 = \var[X], \qquad \kappa_3 = \EE\!\left[(X - \kappa_1)^3\right], \qquad \cdots

This tells us that, with positive γ\gamma, the agent likes a high mean, dislikes variance, likes positive skewness (right tails), dislikes kurtosis (fat tails), etc. When higher moments are small we get

eγ(Zσ)E[Zσ]γ2Var[Zσ],e_\gamma(Z_\sigma) \approx \EE [Z_\sigma] - \frac{\gamma}{2} \var[Z_\sigma],

which connects us back to mean-variance analysis. The approximation above becomes exact when ZσZ_\sigma is normally distributed.

In addition, the entropic criterion can be regarded as an indirect utility function that emerges from a setting in which the manager doubts its probability model for RR. In particular, let PP denote the manager’s baseline model probability measure for the payoff ZσZ_\sigma. Then it can be shown that

eγ(Zσ)=minQ  {EQ[Zσ]+1γDKL(QP)},e_\gamma(Z_\sigma) = \min_Q\; \left\{ \EE_Q [Z_\sigma] + \frac{1}{\gamma} D_{KL}(Q \| P) \right\},

where the minimum is over probability measures QQ absolutely continuous with respect to PP and DKL(QP)EQ[ln( ⁣dQ/ ⁣dP)]D_{KL}(Q \| P) \coloneq \EE_Q [\ln(\diff Q / \diff P)] is a Kullback--Leibler statistical divergence for measuring the discrepancy between two probability distributions. Here the parameter 1/γ1/\gamma controls the size of a penalty that the minimizer pays for distorting QQ relative to baseline probability model PP; larger γ\gamma’s allow the minimizing “player” who chooses QQ to range over a larger set of alternative models. Such an analysis connects entropic preferences to robust control and ambiguity aversion and will be discussed in Chapter 7.

The entropic criterion eγ(Zσ)e_\gamma(Z_\sigma) is a special case of a more general objective

Φ(Zσ)ϕ1{E[ϕ(Zσ)]}\Phi(Z_\sigma) \coloneq \phi^{-1} \left\{ \EE [\phi(Z_\sigma)] \right\}

where ϕ\phi is a given function. Typically ϕ\phi is concave, as is the case for ϕ(x)=exp(γx)\phi(x) = \exp(-\gamma x) when γ>0\gamma > 0. Another example is the Kreps--Porteus expectation, which is obtained by setting ϕ(x)=x1γ\phi(x) = x^{1-\gamma}.

A third option for inserting preferences over risk is to evaluate

VaRα(Zσ)inf{cR:P{Zσ+c<0}α}.\VaR_\alpha(Z_\sigma) \coloneq \inf \, \setntn{c \in \RR}{\PP\{Z_\sigma+c<0\}\leq\alpha}.

where α[0,1]\alpha \in [0,1] is a given constant. This objective is called value-at-risk and can be understood as the smallest cash injection cc such that the probability of a net loss is no more than α\alpha. Intuitively, if ZσZ_\sigma has more downside risk, then VaRα(Zσ)\VaR_\alpha(Z_\sigma) increases, as more cash is needed to keep the loss probability below the threshold. Thus, a manager seeking a low-risk policy might look to minimize VaRα(Zσ)\VaR_\alpha(Z_\sigma), or, equivalently, to solve maxσVaRα(Zσ)\max_\sigma - \VaR_\alpha(Z_\sigma).

Value-at-risk became industry standard in the 1990s, partly due to popularization through RiskMetrics, a risk management framework developed at J.P. Morgan. It has spawned a variety of alternatives and extensions, including conditional value-at-risk, entropic value-at-risk and relativistic value-at-risk. We will meet some of these ideas again in Chapter 7.

1.1.3.5Difficulties

While the risk-management concepts discussed above are all sensible, they complicate solving for an optimal policy because they make the objective be nonlinear. For example, consider the mean-variance problem (1.24), which we can write as

maxσΣmV{t=0T(σ)1βtπt+βT(σ)s}.\max_{\sigma \in \Sigma} m_V \left\{ \sum_{t=0}^{T(\sigma)-1} \beta^t \pi_t + \beta^{T(\sigma)} s \right\}.

The function mVm_V is nonlinear due to the presence of the variance term, and this nonlinearity prevents us from passing mVm_V through the sum and thereby deriving a recursive expression for the value of a strategy similar to (1.6). To see this, recall the role that linearity of the expectations operator Et\EE_t played in our derivation of representation (1.4).

Bellman’s dynamic programming theory requires having a recursive representation for valuations under alternative strategies. Without that, numerical strategies for solving global optimization problems like maxσU(Zσ)\max_\sigma U(Z_\sigma) can be poorly behaved, very high dimensional, and virtually inaccessible to the theory of dynamic programming.

Even worse, there is an important sense in which the criterion of maximizing mV(Zσ)m_V(Z_\sigma) over σΣ\sigma \in \Sigma is no longer the right one, since there is no guarantee that the best strategy for the manager is to choose a stationary Markov policy and apply it in every period. For example, in our new nonlinear setting, it might be optimal for the manager to apply a given policy σ\sigma in the first period and a second policy τ\tau in all periods thereafter (see, e.g., Section 5 of Bäuerle & Jaśkiewicz (2024)). While this might still seem feasible---we need only to compute one more policy--- time inconsistency arises. The manager must be committed to switching to τ\tau in the second period, since re-optimization would lead to choosing σ\sigma again.

Unfortunately none of the alternatives to mean-variance analysis discussed in Section 1.1.3.4 offer a way out of the problem described above, since nonlinearity in the objective again prevents construction of a recursive representation. As with mean-variance, this lack of recursivity requires deploying dynamic programs in new ways.[1]

1.1.3.6Back to Recursion

In Section 1.1.3.5, we saw how nonlinearity intended to capture risk-preferences led to a breakdown of dynamic programming theory. Fortunately, there is a way to inject nonlinearity and risk-preferences into the manager’s problem without breaking recursivity and hence access to the core ideas of dynamic programming. The idea is to stop trying to apply risk-preferences directly to the net present value sum and instead apply them period-by-period. We can do this by starting with the risk-neutral recursive valuation (1.6) and modifying it to

vσ(x)=σ(x)s+(1σ(x))[π(x)+β(Kvσ)(x)],v_\sigma(x) = \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta (K v_\sigma)(x) \right],

where KK is a possibly nonlinear operator from bXb\Xsf to itself. For example, using the notation (Pv)(x)=v(x)P(x, ⁣dx)(Pv)(x) = \int v(x') P(x, \diff x'), we can set

(Kv)(x)=(Pv)(x)γ2[v(x)(Pv)(x)]2P(x, ⁣dx)(K v)(x) = (Pv)(x) - \frac{\gamma}{2} \int \left[v(x') - (Pv)(x)\right]^2 P(x, \diff x')

to implement the mean-variance criterion, or

(Kv)(x)=1γln{exp(γv(x))P(x, ⁣dx)}(K v)(x) = - \frac{1}{\gamma} \ln \left\{ \int \exp(- \gamma v(x')) P(x, \diff x') \right\}

for entropic risk preferences.

This alternative approach to inserting risk preferences into the manager’s problem is somewhat less intuitive than the direct approach that we reviewed in Section 1.1.3.3 and Section 1.1.3.4. In addition, it introduces a new problem: is vσv_\sigma actually well-defined by the nonlinear functional equation (1.33)? On the other hand, it offers a major advantage: provided we can show that vσv_\sigma is in fact well-defined --- which is a problem for fixed point theory --- the valuations are recursive by construction. Exploiting this fact, we can extend Bellman’s original theory in very natural ways. This is one of the main subjects of this book.

We will attack this problem in stages, beginning with an abstract recursive setup in Chapter 2. The setup will be recursive in the sense that each valuation vσv_\sigma will be represented as the fixed point of a possibly nonlinear policy operator TσT_\sigma. Our approach will then be to rewrite Bellman’s optimality theory in this abstract setting and seek properties on the policy operators under which the main results go through. At the end of this process we will connect back to the applications from this chapter.

Before progressing to this abstract theory, we look at some other concrete examples, beginning with finite state Markov decision processes.

1.2Finite MDPs

Finite state Markov decision processes (MDPs) form the foundations of many quantitative modeling and reinforcement learning routines, as well as providing a benchmark setting for dynamic programming theory (see, e.g., Puterman (2005) or Chapter 5 of Sargent & Stachurski (2025)). In this section we introduce finite state MDPs and some extensions. As was the case for the firm problem considered in Section 1.1, our main objective is to introduce a class of dynamic programming problems that will motivate the abstract theory starting in Chapter 2.

While the presentation below is self-contained, readers wanting a slower pace and more examples might prefer to begin with Chapter 5 of Sargent & Stachurski (2025).

1.2.1Theory

In this section we introduce the finite MDP framework and state core optimality results---the Bellman equation and the greedy policy characterization of optimal policies. We then present three fundamental algorithms: value function iteration, Howard policy iteration, and optimistic policy iteration. We illustrate the main ideas with an application to firm cash management.

1.2.1.1The Discrete Time Model

A finite state Markov decision process (finite MDP) consists of

and a tuple (Γ,r,β,P)(\Gamma, r, \beta, P), where

Since PP is a stochastic kernel, it satisfies xP(x,a,x)=1\sum_{x'} P(x, a, x') = 1 for all (x,a)G(x,a) \in \Gsf.

Given an initial condition X0=xX_0 = x, the objective is to maximize the expected discounted sum

Et0βtr(Xt,At) s.t. AtΓ(Xt) for all t0.\EE \sum_{t \geq 0} \beta^t r(X_t, A_t) \quad \st \quad A_t \in \Gamma(X_t) \text{ for all } t \geq 0.

Here (Xt)t0(X_t)_{t \geq 0} takes values in X\Xsf and (At)t0(A_t)_{t \geq 0} takes values in A\Asf. After observing state XtX_t, the controller chooses action AtA_t from the feasible set Γ(Xt)\Gamma(X_t) and the new state Xt+1X_{t+1} is drawn from the distribution P(Xt,At,)P(X_t, A_t, \cdot). The constant β[0,1)\beta \in [0,1) is a discount factor and rr is a reward function. In maximizing this objective, the action sequence (At)(A_t) must also satisfy an information constraint: Each AtA_t is required to be measurable with respect to the σ\sigma-algebra generated by (X0,,Xt)(X_0, \ldots, X_t). Thus, the controller can use information from the past and present but not the future.

As was the case for the firm problem in Section 1.1, it turns out that actions depending on all of (X0,,Xt)(X_0, \ldots, X_t) are no better than actions that depend only on XtX_t. We will show this formally in Section 3.1.4. As a result, we focus on stationary Markov policies, where the same deterministic function of the state is applied at every point in time. We call such stationary Markov policies feasible policies, the set of which is given by

Σ{σAX:σ(x)Γ(x) for all xX}.\Sigma \coloneq \setntn{\sigma \in \Asf^\Xsf} {\sigma(x) \in \Gamma(x) \text{ for all } x \in \Xsf}.

For each σΣ\sigma \in \Sigma, we set

Pσ(x,x)P(x,σ(x),x)andrσ(x)r(x,σ(x)).P_\sigma(x, x') \coloneq P(x, \sigma(x), x') \quad \text{and} \quad r_\sigma(x) \coloneq r(x, \sigma(x)).

It follows from our assumptions on PP that PσP_\sigma is a stochastic matrix, meaning that Pσ0P_\sigma \geq 0 and all rows sum to one. By choosing a policy σ\sigma, the controller determines a reward function rσr_\sigma on the state and Markov dynamics PσP_\sigma for the state process. Following notation in Section A.5.4, we write

(Pσh)(x)=xXh(x)Pσ(x,x)(hRX,  xX),(P_\sigma h)(x) = \sum_{x' \in \Xsf} h(x') P_\sigma(x, x') \qquad (h \in \RR^\Xsf, \; x \in \Xsf),

interpreting this value as the expectation of h(Xt+1)h(X_{t+1}) when Xt=xX_t = x and the controller uses policy σ\sigma.

The lifetime value of σ\sigma given X0=xX_0 = x is

vσ(x)Et0βtr(Xt,σ(Xt))=t0βtErσ(Xt)v_\sigma(x) \coloneq \EE \sum_{t \geq 0} \beta^t r(X_t, \sigma(X_t)) = \sum_{t \geq 0} \beta^t \EE \, r_\sigma(X_t)

when (Xt)t0(X_t)_{t \geq 0} is a Markov chain generated by PσP_\sigma with initial condition X0=xXX_0 = x \in \Xsf. Since Erσ(Xt)=(Pσtrσ)(x)\EE \, r_\sigma(X_t) = (P_\sigma^t r_\sigma)(x), the function vσv_\sigma can be expressed pointwise on X\Xsf as

vσ=t0(βPσ)trσ=(IβPσ)1rσ,v_\sigma = \sum_{t \geq 0} (\beta P_\sigma)^t r_\sigma = (I-\beta P_\sigma)^{-1} r_\sigma,

where II is the identity map on RX\RR^\Xsf, the set of real-valued functions on X\Xsf. This representation on the right-hand side is essentially the same as (1.5). In particular, the second equality comes from the Neumann series lemma (page ). See also Puterman (2005), Theorem 6.1.1, or Chapter 5 of Sargent & Stachurski (2025).

The policy operator associated with given σ\sigma for the finite MDP model takes the form

(Tσv)(x)=r(x,σ(x))+βxv(x)P(x,σ(x),x)(vRX,  xX)(T_\sigma \, v)(x) = r(x, \sigma(x)) + \beta \sum_{x'} v(x') P(x, \sigma(x), x') \qquad (v \in \RR^\Xsf, \; x \in \Xsf)

1.2.1.2Core Optimality Results

The definition of the value function is the same as that for the firm problem in Section 1.1.1.2:

v(x)supσΣvσ(x)(xX).\vmax(x) \coloneq \sup_{\sigma \in \Sigma} v_\sigma(x) \qquad (x \in \Xsf).

Similarly, a policy σΣ\sigma \in \Sigma is called optimal if vτvσv_\tau \leq v_\sigma for all τΣ\tau \in \Sigma.

The Bellman equation for this problem is

v(x)=maxaΓ(x){r(x,a)+βxv(x)P(x,a,x)}.v(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\}.

This is a functional equation that restricts vRXv \in \RR^\Xsf. The Bellman operator is given by

(Tv)(x)=maxaΓ(x){r(x,a)+βxv(x)P(x,a,x)}(xX).(T \, v)(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \qquad\qquad (x \in \Xsf).

By construction, vv solves the Bellman equation if and only if vv is a fixed point of the Bellman operator. In the next exercise, d(f,g)supxXf(x)g(x)d_\infty(f, g) \coloneq \sup_{x \in \Xsf}|f(x) - g(x)|. Corollary A.5.13 may be helpful.

We say that σΣ\sigma \in \Sigma is vv-greedy if

σ(x)argmaxaΓ(x){r(x,a)+βxv(x)P(x,a,x)}for all xX,\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \quad \text{for all } x \in \Xsf,

We can now state the following optimality result, which naturally mirrors our previous result for the firm problem (Theorem 1.1.1).

A full proof of Theorem 1.2.1 can be found in Chapter 5 of Sargent & Stachurski (2025). The proof is almost identical to that of Theorem 1.1.1, which we provided for the firm problem. We will also prove Theorem 1.2.1 in Section 2.3.3.2, as a special case of far more general results.

The next obvious step is to use the results in Theorem 1.2.1 to compute optimal policies. Next we consider algorithms designed for this purpose.

1.2.1.3Algorithms

The three most important algorithms for solving dynamic programming problems are value function iteration (VFI), Howard policy iteration, and optimistic policy iteration (OPI). In the present setting, they take the forms of Algorithm 1.2.1, Algorithm 1.2.2, and Algorithm 1.2.3.

VFI amounts to iterating kk times with TT from some initial condition vVv \in V (where kk is determined by a fixed tolerance level for error), producing an approximation vkTkvv_k \coloneq T^k v to v\vmax, and then computing a vkv_k-greedy policy σ\sigma. This idea is natural, given that v\vmax-greedy policies are optimal, since TT is a contraction mapping and v\vmax is the unique fixed point (so that vkv_k is close to v\vmax).

In HPI, one begins with a guess σ\sigma of the optimal policy and then iterates between computing the lifetime value of that policy (as given in (1.42)) and the corresponding greedy policy. In fact HPI is equivalent to Newton fixed point iteration applied to the Bellman operator. See, for example, Chapter 5 of Sargent & Stachurski (2025).

OPI can be thought of as a “convex combination” of VFI and HPI. Instead of computing the lifetime value vσ=(IβPσ)1rσv_\sigma = (I - \beta P_{\sigma})^{-1} r_{\sigma} of current policy guess σ\sigma, one computes instead TσmvT_{\sigma}^m v, which is an approximation to vσv_\sigma (since TσT_\sigma is a contraction with fixed point vσv_\sigma). There are two edge cases:

  1. If mm is large, this approximation is tight, and hence OPI is close to HPI.

  2. If m=1m=1, OPI reduces to VFI.

OPI usually outperforms both VFI and HPI for some intermediate values of mm. Further intuition and discussion is provided in Chapter 5 of Sargent & Stachurski (2025).

For the finite MDP setting, we can state the following results:

We shall prove these results after we discuss convergence of these algorithms in a general setting in Section 2.2.1.

1.2.1.4Solving MDPs via Linear Programming

Many dynamic programs can be formulated as linear programs. We illustrate with finite MDPs. To do so, we recall that a typical linear program has the form

minvc,v   over all   vRn with Avb.\min_v \inner{c,v} \; \text{ over all } \; v \in \RR^n \text{ with } Av \leq b.

Here cc is a vector in Rn\RR^n, the term c,v\inner{c,v} is the inner product icivi\sum_i c_i v_i, AA is a matrix with nn columns and bb is a vector with the same length as AvAv. (Other LP formulations replace the inequality AvbAv \leq b with Av=bAv = b or AvbAv \geq b, or a mix of inequality and equality constraints. Standard LP algorithms and theory can be applied to all of these cases.)

To place the finite state MDP from Section 1.2.1.1 into this framework, we begin by setting

VD{vRX:Tvv},V_D \coloneq \setntn{v \in \RR^\Xsf}{Tv \leq v},

where TT is the Bellman operator. Recalling that v\vmax represents the value function, we have the following result:

Now let cc be an everywhere positive element of RX\RR^\Xsf and consider the linear program

minvc,v s.t. r(x,a)+βxv(x)P(x,a,x)v(x) for all (x,a)G.\begin{aligned} & \min_v \inner{c, v} \\ & \st r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \leq v(x) \text{ for all } (x, a) \in \Gsf. \end{aligned}

Here c,v=xc(x)v(x)\inner{c, v} = \sum_x c(x) v(x) and the minimization is over all vRXv \in \RR^\Xsf that satisfy the stated constraint. As before, we require that cc is everywhere positive. Evidently (1.52) takes the form of (1.50) after suitable assignment of indices. Thus, (1.52) is a linear program. This leads us to

Proposition 1.2.4 implies that we can compute the value function and hence solve the MDP using linear programming techniques. The LP approach is useful for many models, such as those that incorporate additional linear constraints, e.g., bounds on expected rewards and resources, that are difficult to handle with iterative methods. On the other hand, the number of constraints in (1.52) equals G|\Gsf|, which can be large, so iterative methods are still often preferred. See Section 1.6 for references and further discussion.

1.2.1.5Example: Cash Management

To illustrate finite MDPs in a concrete setting, we now study a cash management problem faced by a firm that must balance cash holdings against returns from securities. This problem dates back to the work of Baumol (1952) and Tobin (1956), who developed inventory-theoretic models of the demand for money. In their models, the decision maker decides how much cash to hold versus interest-bearing assets, balancing the opportunity cost of holding idle cash against the transaction costs of converting assets to cash.

The decision maker manages fixed total wealth wˉ\bar w, which is divided between cash holdings xx and securities s=wˉxs = \bar w - x. Each period, the decision maker experiences random portfolio shocks and must decide whether to transfer funds between cash and securities. The decision maker earns returns on securities but pays transaction costs for transfers and faces penalties for insufficient cash. (Assuming that wealth is fixed allows us to get by tracking only xx, and not ss.)

The state space for cash is X={0,1,,wˉ}\Xsf = \{0, 1, \ldots, \bar w\}. At state xx, the feasible actions are transfers aa satisfying 0x+awˉ0 \leq x+a \leq \bar w. Thus, Γ(x)={aZxawˉx}\Gamma(x) = \{a \in \ZZ \mid -x \leq a \leq \bar w - x\}. Portfolio shocks (cash payments, equity payments, debt restructuring, etc.) are iid, written as (ξt)t1(\xi_t)_{t \geq 1}, and take values in a set Ξ={k,,k}\Xi = \{-k, \ldots, k\} with probability mass function ϕ\phi. Transition probabilities are determined by the next-period state

F(x,a,ξ)=max{0,min{wˉ,x+a+ξ}},F(x,a, \xi) = \max\{0, \min\{\bar w, x + a + \xi\}\},

where the max and min keep xx' in the state space. The transition probabilities are, therefore,

P(x,a,x)=ξΞ1{F(x,a,ξ)=x}ϕ(ξ).P(x, a, x') = \sum_{\xi \in \Xi} \1\{F(x, a, \xi) = x'\} \phi(\xi).

Flow profits are given by

π(x,a,ξ)ρ(wˉx)(c+τa)1{a0}p1{x+a+ξ<0}.\pi(x, a, \xi) \coloneq \rho (\bar w - x) - (c + \tau |a|) \1\{a \neq 0\} - p \, \1\{x + a + \xi < 0\}.

Here ρ\rho is the rate of return on securities, cc is a fixed transaction cost, τ\tau is a proportional transaction cost, and pp is a penalty for insufficient cash (in this case, when x+a+ξ<0x + a + \xi < 0). To fit the problem into the MDP framework, we take expectations of flow profits to get the period reward

r(x,a)=ξΞπ(x,a,ξ)ϕ(ξ).r(x, a) = \sum_{\xi \in \Xi} \pi(x, a, \xi) \, \phi(\xi).

Future payoffs are discounted using discount factor β\beta.

The set of feasible policies Σ\Sigma is defined from Γ\Gamma in the usual way (see (1.38)). The lifetime value of any given policy can be computed from vσ=(IβPσ)1rσv_\sigma = (I - \beta P_\sigma)^{-1} r_\sigma, as discussed in Section 1.2.1.1. Figure 1.7 illustrates by using this formula to compute σ\sigma-value functions for two policies. The first is a “do nothing” policy that sets a=0a = 0 for all states. The second is a target policy that always moves toward a fixed target cash level. As we will see, both policies are suboptimal.

Value of a do-nothing policy (\sigma_1) and a target policy (\sigma_2)

Figure 1.7:Value of a do-nothing policy (σ1\sigma_1) and a target policy (σ2\sigma_2)

Next let’s solve for an approximately optimal policy using VFI, as described in Algorithm 1.2.1. Figure 1.8 shows the resulting (approximately) optimal policy and value function. The policy shows the optimal transfer amount as a function of current cash holdings, while the value function shows the lifetime value of following the optimal policy. Here we set total wealth wˉ=50\bar w = 50, return on securities ρ=0.02\rho = 0.02, fixed transaction cost c=1c = 1, proportional transaction cost τ=0.1\tau = 0.1, penalty for insufficient cash p=10p = 10, and β=0.95\beta = 0.95. The cash flow shocks are uniformly distributed on {5,4,,4,5}\{-5, -4, \ldots, 4, 5\}.

The optimal policy recommends that when cash holdings are low, the decision maker should move funds from securities to cash (take a positive action), and when cash holdings are high, move funds from cash to securities (a negative action). The value function declines for large cash balances because wealth is fixed and hence high cash balances mean low holdings of securities and reduced returns.

Optimal policy and value function for the cash management problem

Figure 1.8:Optimal policy and value function for the cash management problem

Figure 1.9 shows iterates for the policy sequence and the value sequence under the HPI algorithm. The initial policy is a do-nothing policy. As mentioned in Theorem 1.2.2, HPI converges in a finite number of iterations. Here it converges in 5 iterations, so the last policy is the exact optimal policy (modulo floating point arithmetic), and the last σ\sigma-value function is the value function v\vmax. The gap between the first value function associated with the do-nothing policy, and the final value function associated with the optimal policy, is value of active cash management. This gap is largest at extreme cash levels, where the do-nothing policy either leaves the firm exposed to cash shortfalls or forgoes returns on securities.

Iterating with HPI from the do-nothing policy

Figure 1.9:Iterating with HPI from the do-nothing policy

Figure 1.10 shows a simulated time path for cash and optimal cash transfers under the optimal policy. Cash is held unchanged in many time periods as a result of transaction costs.

Time path for cash and actions under the optimal policy

Figure 1.10:Time path for cash and actions under the optimal policy

1.2.2Continuous Time

In this section we modify our finite state MDP model from Section 1.2.1.1 to a continuous time setting. With appropriate manipulations, our continuous time model can be embedded in the discrete time framework.

1.2.2.1Primitives and Values

As in the discrete time case, X\Xsf and A\Asf are finite sets, while the controller is constrained by a feasible correspondence Γ\Gamma from X\Xsf to A\Asf. The definitions of G\Gsf, Σ\Sigma, and rr are unchanged. Discounting is determined by a constant δ>0\delta > 0, referred to as the discount rate, while transitions are driven by an intensity kernel QQ from G\Gsf to X\Xsf, which is a map QQ from G×X\Gsf \times \Xsf to R\RR that satisfies

xQ(x,a,x)=0 for all (x,a) in G and Q(x,a,x)0 when xx.\sum_{x'} Q(x, a, x') = 0 \text{ for all } (x,a) \text{ in } \Gsf \text{ and } Q(x, a, x') \geq 0 \text{ when } x \not= x'.

Informally, over the short interval from tt to t+ht+h, the controller receives instantaneous reward r(x,a)hr(x,a)h and the state transitions to state xx' with probability Q(x,a,x)h+o(h)Q(x, a, x') h + o(h).

For a fixed σΣ\sigma \in \Sigma, we obtain an intensity operator (i.e., an infinitesimal generator)

Qσ(x,x)Q(x,σ(x),x)(x,xX)Q_\sigma(x, x') \coloneq Q(x, \sigma(x), x') \qquad (x, x' \in \Xsf)

that determines a continuous time Markov chain (Xt)t0(X_t)_{t \geq 0} with transition probabilities given by PtσetQσP^\sigma_t \coloneq \me^{t Q_\sigma} for all xXx \in \Xsf. In particular,

Exh(Xt)=(Ptσh)(x) for any hRX.\EE_x h(X_t) = (P^\sigma_t h)(x) \text{ for any } h \in \RR^\Xsf.

(For background see Chapter 10 of Sargent & Stachurski (2025).) Continuing to define rσ(x)r(x,σ(x))r_\sigma(x) \coloneq r(x, \sigma(x)), the lifetime value of following σ\sigma starting from state xx is

vσ(x)=Ex0eδtrσ(Xt) ⁣dt=0eδt(Ptσrσ)(x) ⁣dtv_\sigma (x) = \EE_x \int_0^\infty \me^{-\delta t} r_\sigma(X_t) \diff t = \int_0^\infty \me^{-\delta t} (P^\sigma_t r_\sigma)(x) \diff t

(Passing the expectation through the integral can be justified by Fubini’s theorem.) Using δ>0\delta > 0, we can rewrite vσv_\sigma as

vσ=0et(QσδI)rσ ⁣dt=(δIQσ)1rσ.v_\sigma = \int_0^\infty \me^{t (Q_\sigma - \delta I)} r_\sigma \diff t = (\delta I - Q_\sigma)^{-1} r_\sigma.

The two representations for vσv_\sigma are the continuous time analogs of the discrete-time representations given in (1.42). A proof of the second equality is given in §10.2 of Sargent & Stachurski (2025).

(Readers familiar with semigroup theory will recognize the two representations in (1.61) as alternative expressions for the resolvent of the semigroup (etQ)(\me^{tQ}) -- see, for example, Engel & Nagel (2006), Theorem 1.10.)

1.2.2.2Uniformization

We can use the ADP framework to reformulate (1.61) by making vσv_\sigma be the fixed point of an order preserving policy operator. This process is called uniformization. The first step is to set

P(x,a,x)1{x=x}+Q(x,a,x)mwheremmaxxX,aAQ(x,a,x).P(x, a, x') \coloneq \1\{x = x'\} + \frac{Q(x, a, x')}{m} \quad \text{where} \quad m \coloneq \max_{x \in \Xsf, \, a \in \Asf} |Q(x, a, x)|.

Then set

βmm+δandr^σrσm+δ.\beta \coloneq \frac{m}{m + \delta} \quad \text{and} \quad \hat r_\sigma \coloneq \frac{r_\sigma}{m + \delta}.

As in the discrete time case, for each σΣ\sigma \in \Sigma, define PσP_\sigma and r^σ\hat r_\sigma according to

Pσ(x,x)P(x,σ(x),x)andr^σ(x)=r^(x,σ(x)).P_\sigma(x, x') \coloneq P(x, \sigma(x), x') \quad \text{and} \quad \hat r_\sigma(x) = \hat r(x, \sigma(x)).

From Exercise 1.8, we see that vσv_\sigma is the unique fixed point in VRXV \coloneq \RR^\Xsf of the policy operator

Tσv=r^σ+βPσv.T_\sigma \, v = \hat r_\sigma + \beta P_\sigma \, v.

1.2.2.3Optimality

Since (1.65) becomes (1.43) after replacing r^σ\hat r_\sigma with rσr_\sigma, we can apply the discrete time MDP theory in Section 1.2. The Bellman equation becomes

v(x)=maxaΓ(x){r^(x,a)+βxv(x)P(x,a,x)}(xX).v(x) = \max_{a \in \Gamma(x)} \left\{ \hat r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \qquad (x \in \Xsf).

The optimality properties in Theorem 1.2.1 hold and, by Theorem 1.2.2, VFI, OPI and HPI all converge. With v\vmax denoting the value function, a policy is optimal if and only if

σ(x)argmaxaΓ(x){r^(x,a)+βxv(x)P(x,a,x)}for all xX,\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ \hat r(x, a) + \beta \sum_{x'} \vmax(x') P(x, a, x') \right\} \quad \text{for all } x \in \Xsf,

The next two exercises unpack these equations and conditions to recover our original continuous time formulation.

Equation (1.68) connects the exposition above to the traditional theory of continuous time MDPs (see, e.g., Guo & Hernández-Lerma (2009)). It is sometimes called the Hamilton--Jacobi--Bellman (HJB) equation, although that name is more commonly used when the state process is a diffusion.

As in Exercise 1.9, it can be shown that σΣ\sigma \in \Sigma is v\vmax-greedy if and only if

σ(x)argmaxaΓ(x){r(x,a)+xv(x)Q(x,a,x)}for all xX.\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \sum_{x'} \vmax(x') Q(x, a, x') \right\} \quad \text{for all } x \in \Xsf.

1.2.2.4Example: Service Rate Control

Here we study a queue system where a firm controls service rates to maximize profit. The firm operates a service facility with finite capacity NN. Customers arrive according to a Poisson process with rate λ\lambda. The state xX={0,1,,N}x \in \Xsf = \{0, 1, \ldots, N\} represents the number of customers currently waiting for service. The firm can control the service rate by selecting from a finite set of actions A\Asf. Each action aAa \in \Asf corresponds to a service rate μ(a)\mu(a). Higher service rates allow faster customer processing but incur greater operating costs.

The intensity kernel QQ associated with this problem is

Q(x,a,x)={λif x=x+1 and x<Nμ(a)if x=x1 and x>00for all other x with xx.Q(x, a, x') = \begin{cases} \lambda & \text{if } x' = x + 1 \text{ and } x < N \\ \mu(a) & \text{if } x' = x - 1 \text{ and } x > 0 \\ 0 & \text{for all other } x' \text{ with } x \neq x' \text{.} \end{cases}

When x=xx=x', we set Q(x,a,x)=yxQ(x,a,y)Q(x, a, x') = -\sum_{y \neq x} Q(x, a, y) in order to ensure xQ(x,a,x)=0\sum_{x'} Q(x, a, x') = 0 at each (x,a)(x,a).

All choices of aa are feasible, so Γ(x)=A\Gamma(x) = \Asf for all xx. The instantaneous profit rate is

r(x,a)=μ(a)R1{x>0}hxc(a),r(x, a) = \mu(a) R \1\{x > 0\} - h x - c(a),

where RR is revenue per customer served, hh is holding cost per customer per unit time, and c(a)c(a) is the service cost rate for action aa. The first term represents revenue from serving customers, the second term captures the cost of customers waiting in the queue, and the third term is the cost of operating at service rate μ(a)\mu(a). The firm’s objective is to maximize expected discounted profit flow vσ(x)=Ex0eδtr(Xt,σ(Xt)) ⁣dtv_\sigma(x) = \EE_x \int_0^\infty \me^{-\delta t} r(X_t, \sigma(X_t)) \diff t, where δ>0\delta > 0 is the discount rate.

We solve the problem using the uniformization technique discussed in Section 1.2.2.2. The first step is to calculate the uniformization rate m=maxx,aQ(x,a,x)m = \max_{x,a} |Q(x,a,x)|. Here

Hence

m=λ+μˉwhereμˉmaxaμ(a).m = \lambda + \bar \mu \quad \text{where} \quad \bar \mu \coloneq \max_a \mu(a).

Thus, following the specifications in Section 1.2.2.2, we set

P(x,a,x)=1{x=x}+Q(x,a,x)λ+μˉ,β=λ+μˉλ+μˉ+δ,r^(x,a)=r(x,a)λ+μˉ+δ.P(x, a, x') = \1\{x = x'\} + \frac{Q(x, a, x')}{\lambda + \bar \mu}, \quad \beta = \frac{\lambda + \bar \mu}{\lambda + \bar \mu + \delta}, \quad \hat r(x, a) = \frac{r(x, a)}{\lambda + \bar \mu + \delta}.

We then compute the optimal policy using VFI based on the Bellman equation (1.66).

Figure 1.11 shows the optimal policy and value function for a system with N=10N = 10 customers, arrival rate λ=2.0\lambda = 2.0, service rates μ=(2.5,3.0,3.5)\mu = (2.5, 3.0, 3.5), revenue R=10R = 10, holding cost h=2.5h = 2.5, service costs c=(1.5,2.0,4.5)c = (1.5, 2.0, 4.5), and discount rate δ=0.1\delta = 0.1. When the queue is empty, the firm uses the lowest service rate to minimize operating costs. As the queue length increases, the firm gradually raises the service rate to balance the increasing holding costs against service costs. The value function increases initially as queue length grows (reflecting the value of serving customers) but eventually decreases as holding costs dominate.

Optimal service rate policy and value function for the queue system

Figure 1.11:Optimal service rate policy and value function for the queue system

1.2.3Extensions

In this section we discuss extensions of the MDP framework that parallel those for the firm problem. These include nonlinear objectives including mean-variance and risk-sensitive preferences. As before, we find that nonlinearity forecloses a recursive structure, motivating resort to period-by-period reformulations that restore tractability. We also introduce ambiguity, where the controller faces uncertainty about transition probabilities.

1.2.3.1Nonlinear Criteria

Some extensions to the firm problem discussed in Section 1.1.2 have counterparts here. For example, we discussed maximization problems of the form

maxσΣU(Zσ)whereZσt=0T(σ)1βtπt+βT(σ)s,\max_{\sigma \in \Sigma} U( Z_\sigma ) \quad \text{where} \quad Z_\sigma \coloneq \sum_{t=0}^{T(\sigma)-1} \beta^t \pi_t + \beta^{T(\sigma)} s,

and UU is a nonlinear real-valued function. One example was given in (1.32), where UU was the mean-variance map in (1.24). In other examples, UU emerges from value-at-risk, conditional value-at-risk, risk sensitivity, or a desire for robustness.

There are obvious parallels for the MDP model we introduced in Section 1.2.1. We simply take the criterion from (1.74) and modify it to

maxσΣU(Zσ)whereZσt0βtr(Xt,σ(Xt)).\max_{\sigma \in \Sigma} U \left( Z_\sigma \right) \quad \text{where} \quad Z_\sigma \coloneq \sum_{t \geq 0} \beta^t r(X_t, \sigma(X_t)).

Here (Xt)t0(X_t)_{t \geq 0} is a Markov chain generated by PσP_\sigma with fixed initial condition and UU is again, some given real-valued function. As before, we can choose UU to inject concern for mean-variance trade-offs, value-at-risk, conditional value-at-risk, risk sensitivity and a desire for robustness.

1.2.3.2Back to Recursions

In Section 1.1.3.5 we discussed how optimization problems of the form (1.74) can be troublesome. The lack of a recursive structure prevents us from using Bellman machinery. The result is that we are left without a clear path to optimization, as well as the loss of time-consistency. Not surprisingly, all of these difficulties remain present when we switch to the MDP version in (1.75).

For the most part, theorists have responded in ways similar to ones discussed for the firm problem in Section 1.1.3.6, where recursive structure is enforced by applying nonlinear criteria period-by-period, rather than applying them directly to the sum representing lifetime value. For example, we can modify the policy operator (1.7) to

(Tσv)(x)=σ(x)s+(1σ(x))[π(x)+β(Kσv)(x)](T_\sigma \, v)(x) = \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta (K_\sigma \, v)(x) \right]

where, for each σΣ\sigma \in \Sigma, the map KσK_\sigma is a given nonlinear operator. For example, by setting

(Kσv)(x)=1γln{exp(γv(x))Pσ(x, ⁣dx)}(K_\sigma \, v)(x) = - \frac{1}{\gamma} \ln \left\{ \int \exp(- \gamma v(x')) P_\sigma(x, \diff x') \right\}

we switch the MDP problem to entropic risk preferences.

As for the firm problem, this alternative approach to inserting risk preferences is less intuitive than the direct approach in (1.75) and raises a question: when is vσv_\sigma well-defined by the nonlinear functional equation (1.76)? In addition, the formulation in (1.76) offers the advantage that valuations are recursive by construction. This allows us to apply solution methods that extend Bellman’s original theory in natural ways. We explore these ideas in the remainder of the text, beginning with the abstract recursive setup in Chapter 2.

1.2.3.3Ambiguity

The MDP framework has been extended to include a decision maker’s concerns about misspecification of the probability distribution. For example, consider the cash management problem from Section 1.2.1.5. Applying the definitions of the profit function π\pi and the transition function FF from that section, the risk-neutral problem can be written as

maxσΣEϕt0βtπ(Xt,σ(Xt),ξt+1)\max_{\sigma \in \Sigma} \, \EE_\phi \, \sum_{t \geq 0} \beta^t \pi(X_t, \sigma(X_t), \xi_{t+1})

where (Xt)t0(X_t)_{t \geq 0} obeys Xt+1=F(Xt,σ(Xt),ξt+1)X_{t+1} = F(X_t, \sigma(X_t), \xi_{t+1}) for all tt. We subscript expectation with ϕ\phi to emphasize the fact that the mathematical expectation over ξt+1\xi_{t+1} is taken with respect to distribution ϕ\phi.

Suppose now that the manager doesn’t know ϕ\phi but does know that ϕ\phi belongs to a set of possible distributions Φ\Phi. This is how the decision maker expresses ambiguity about the probability law that governs ξt+1\xi_{t+1}. If we were to say to the decision maker to put a subjective probability distribution over Φ\Phi, the decision maker would decline to do so.

The decision maker proceeds in the spirit of Abraham Wald Wald (1950) by assuming only that he knows a set of possible models. To do this, he replaces (1.78) with

maxσΣminϕΦEϕt0βtπ(Xt,σ(Xt),ξt+1).\max_{\sigma \in \Sigma} \, \min_{\phi \in \Phi} \EE_\phi \, \sum_{t \geq 0} \beta^t \pi(X_t, \sigma(X_t), \xi_{t+1}).

By using this criterion, the manager seeks a decision rule that works well enough no matter which probability distribution ϕΦ\phi \in \Phi governs ξt+1\xi_{t+1}.

A recursive structure is absent from criterion (1.79). One way out of this difficulty is to make our decision maker express model ambiguity in a way that is more susceptible to a recursive formulation. For example, we could ascribe our decision maker a value function that solves the following Bellman equation:

v(x)=maxaΓ(x)minϕΦξ{π(x,a,ξ)+βv(F(x,a,ξ))}ϕ(ξ).v(x) = \max_{a \in \Gamma(x)} \min_{\phi \in \Phi} \sum_\xi \left\{ \pi(x, a, \xi) + \beta v(F(x, a, \xi)) \right\} \phi(\xi).

As we will see in Section 7.3.3, this kind of specification puts dynamic programming theory back in business.

1.3Optimal Savings

This section presents an optimal savings problem (also called the optimal consumption problem and the income fluctuation problem). This problem is a building block for many economic models. It features a basic intertemporal trade-off from consuming now or later. This trade-off can be solved by dynamic programming.

Unlike the finite state problems above, the optimal savings model has a continuous state space, as well as a continuous action space. We define policies, policy operators, and lifetime values and state the key optimality results. We then look at extensions to Epstein--Zin preferences.

1.3.1Policies and Decisions

In an optimal savings problem (sometimes called an “income fluctuation problem”), a household seeks to maximize

Et=0βtu(Ct)s.t.Wt+1=R(WtCt)+Yt+1and0CtWt.\EE \, \sum_{t=0}^{\infty} \beta^t u(C_t) \quad \text{s.t.} \quad W_{t+1} = R(W_t - C_t) + Y_{t+1} \quad \text{and} \quad 0 \leq C_t \leq W_t.

The constraints in (1.81) are required to hold for all t0t \geq 0, and an initial condition w0w_0 is taken as given. The utility function u ⁣:R+Ru \colon \RR_+ \to \RR maps current consumption CtC_t into a utility value (loosely speaking, a measure of satisfaction), β(0,1)\beta \in (0,1) is a discount factor indicating impatience, and R>0R > 0 is a gross rate of return on assets. The variable WtW_t represents wealth at time tt, while YtY_t is labor income. To keep the model simple, we assume (Yt)(Y_t) is iid with common distribution ϕD(R+)\phi \in \dD(\RR_+), the set of distributions (i.e., Borel probability measures) on R+\RR_+.

(We study more general settings later.)

The variable WtW_t is the state of the dynamic program, while CtC_t is the action. Figure 1.12 shows the timing for the optimal savings problem. After observing WtW_t, the household chooses CtC_t and hence WtCtW_t - C_t. Then labor income Yt+1Y_{t+1} is realized and the state updates to Wt+1W_{t+1}. The process then repeats.

Timing for the optimal savings problem

Figure 1.12:Timing for the optimal savings problem

In maximization problem (1.81) there is another constraint: CtC_t can depend only on information available at time tt. Formally, current consumption CtC_t must be a (deterministic) Borel measurable function of shocks, states, and actions observed up to and including time tt. Thus, the current action cannot depend on future values such as Yt+1Y_{t+1} or Wt+1W_{t+1}. A mapping from the history of the state and the shocks into the current action is called a policy function.[2]

The infinite horizon, iid (Yt)(Y_t)-process, time-invariant structure of the optimal savings problem lets us focus on policies that make current consumption CtC_t be a deterministic function σ\sigma of the current state WtW_t. (We will prove this later and discover how it depends on the iid-nature of the (Yt)(Y_t) process.)

We impose the following simplifying conditions:

In a slight abuse of notation, we use ϕ\phi to represent the density of labor income as well as the corresponding distribution (i.e., Borel probability measure on R+\RR_+). Thus, in the integrals below, ϕ( ⁣dy)\phi(\diff y) and ϕ(y) ⁣dy\phi(y) \diff y have the same meaning.

1.3.1.1Lifetime Value

In this setting, a stationary Markov policy is a Borel measurable map σ\sigma from R+\RR_+ to itself. Here we refer to stationary Markov policies more simply as policies. We call a policy σ\sigma feasible if 0σ(w)w0 \leq \sigma(w) \leq w for all wR+w \in \RR_+, so that the consumption response c=σ(w)c = \sigma(w) obeys the inequalities in (1.81). Let Σ\Sigma denote the set of all feasible policies. We seek σΣ\sigma \in \Sigma that maximizes expected lifetime value. For given σ\sigma and initial condition w=w0w = w_0, expected lifetime value is

vσ(w)=Et0βtu(σ(Wt))when   Wt+1=R(Wtσ(Wt))+Yt+1v_\sigma(w) = \EE \sum_{t \geq 0} \beta^t u(\sigma(W_t)) \quad \text{when } \; W_{t+1} = R(W_t - \sigma(W_t)) + Y_{t+1}

for all t0t \geq 0 and (Wt)t0(W_t)_{t \geq 0} starts at ww. Below, we refer to vσv_\sigma as the σ\sigma-value function.

It is helpful to represent vσv_\sigma as a policy operator for σΣ\sigma \in \Sigma:

(Tσv)(w)=u(σ(w))+βv(R(wσ(w))+y)ϕ( ⁣dy)(wR+).(T_\sigma \, v)(w) = u(\sigma(w)) + \beta \int v(R(w - \sigma(w)) + y) \phi(\diff y) \qquad (w \in \RR_+).

This policy operator is a continuous state analog of the finite MDP policy operator we saw in (1.43). It acts on functions vVv \in V, where

VbR+all bounded Borel measurable functions from R+ to R.V \coloneq b\RR_+ \coloneq \text{all bounded Borel measurable functions from } \RR_+ \text{ to } \RR.

Recall that VV is a Banach space (see Section A.4.2) with supremum norm v:=supxv(x)\| v \| := \sup_x |v(x)|.

Policy operators are useful because vVv \in V is a fixed point of TσT_\sigma if and only if it equals the σ\sigma-value function. Thus, the fixed point of TσT_\sigma characterizes the lifetime value of σ\sigma. This is a consequence of the following lemma.

Incidentally, one can use the law of iterated expectations to prove that the σ\sigma-value function vσv_\sigma is a fixed point of TσT_\sigma. Write

vσ(w)=u(σ(w))+Et1βtu(σ(Wt)).v_\sigma(w) = u(\sigma(w)) + \EE \sum_{t \geq 1} \beta^t u(\sigma(W_t)) .

Letting E1\EE_1 be the expectation conditional on W1W_1, applying the law of iterated expectations implies

vσ(w)=u(σ(w))+βE[E1t1βt1u(σ(Wt))]=u(σ(w))+βEvσ(W1).v_\sigma(w) = u(\sigma(w)) + \beta \EE \, \left[ \EE_1 \, \sum_{t \geq 1} \beta^{t-1} u(\sigma(W_t)) \right] = u(\sigma(w)) + \beta \EE \, v_\sigma(W_1).

Expanding the last expression yields

vσ(w)=u(σ(w))+βvσ(R(wσ(w))+y)ϕ( ⁣dy).v_\sigma(w) = u(\sigma(w)) + \beta \int v_\sigma(R(w - \sigma(w)) + y) \phi(\diff y).

Thus, vσv_\sigma is a fixed point of TσT_\sigma.

1.3.1.2Lifetime Values as Limits

In the previous section we learned that fixed points of policy operators represent lifetime value. What do finite iterates of policy operators represent? Fixing σ\sigma and inspecting the definition of TσT_\sigma (see (1.83)) indicates that (Tσv)(w)(T_\sigma \, v)(w) represents the reward received from using policy σ\sigma for one period, when ww is initial wealth and the function vv is used to evaluate the reward from wealth in the second period.

We can lengthen the horizon by iterating with TσT_\sigma while keeping the terminal value function vv fixed. Choosing kNk \in \NN and using the expression for TσT_\sigma in (1.87), we get

Tσkv=rσ+βPσrσ++(βPσ)k1rσ+(βPσ)kvT^k_\sigma \, v = r_\sigma + \beta P_\sigma \, r_\sigma + \cdots + (\beta P_\sigma)^{k-1} r_\sigma + (\beta P_\sigma)^k v

The expression on the right is the value of using policy σ\sigma for kk periods and then receiving a reward for terminal wealth determined by the function vv. Thus, it is the finite horizon value of following σ\sigma under this terminal condition.

It seems plausible that the infinite-horizon lifetime value of a policy σ\sigma could equal the limit of finite horizon values, so that

vσ=limkTσkv.v_\sigma = \lim_{k \to \infty} T^k_\sigma v.

Lemma 1.3.1 assures us that this is true: since TσT_\sigma is globally stable on VV with unique fixed point vσv_\sigma, the limit in (1.94) exists and equals vσv_\sigma, independent of the terminal condition vVv \in V.

Figure 1.13 shows two arbitrarily chosen feasible policies and their lifetime values when R=1.04R=1.04, β=0.95\beta=0.95, u(c)=1exp(c)u(c)=1 - \exp(-c), and Yt=exp(νZt)Y_t = \exp(\nu Z_t) when ν=0.8\nu=0.8 and ZtZ_t is standard normal. The lifetime values were computed via (1.94).

Randomly chosen policies and their lifetime values

Figure 1.13:Randomly chosen policies and their lifetime values

1.3.2Optimality

The value function for the optimal savings model is

v(w)supσΣvσ(w)(wR+).\vmax(w) \coloneq \sup_{\sigma \in \Sigma} v_\sigma (w) \qquad (w \in \RR_+).

Under Assumption 1.3.1 the supremum is always well defined in R\RR, since uu and hence rσr_\sigma is bounded by some constant MM, implying that, for any wR+w \in \RR_+ and σΣ\sigma \in \Sigma,

vσ(w)t0βtM=M1β.v_\sigma(w) \leq \sum_{t \geq 0} \beta^t M = \frac{M}{1-\beta}.

A policy is called optimal if vσ=vv_\sigma = \vmax; that is, if following the policy from every initial state ww leads to the largest possible lifetime value attainable from ww.

The set of feasible policies lies in an infinite-dimensional function space, so we cannot find an optimal policy by exhaustive search. We want a systematic and efficient search procedure. Following the techniques we used for the firm management problem in Section 1.1, our approach will be to (a) set up a Bellman equation to help us assign maximal lifetime values to states, and (b) solve for a greedy policy with respect to this maximizing function.

1.3.2.1Bellman’s Method

Fix vVv \in V. In the present setting, a policy σΣ\sigma \in \Sigma will be called vv-greedy if

σ(w)argmax0cw{u(c)+βv(R(wc)+y)ϕ( ⁣dy)}for all w0.\sigma(w) \in \argmax_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\} \quad \text{for all } w \geq 0.

A vv-greedy policy uses vv to value next-period states and then chooses consumption optimally to trade off current utility against expected discounted future value associated with the implied level of savings. The following statements are both true:

  1. Computing vv-greedy policies is typically much easier than computing optimal policies, since we are only solving a two-period problem.

  2. Computing vv-greedy policies can be equivalent to computing optimal policies, given the right choice of vv.

What is the right choice of vv? A natural candidate is the value function, since the value function tells us the maximal reward from alternative states. We explain this in more detail in Section 1.3.2.2. In that same section, we will also use the fact that the value function satisfies an important functional equation, which we now describe.

We say that vVv \in V satisfies the Bellman equation for the optimal savings problem if

v(w)=max0cw{u(c)+βv(R(wc)+y)ϕ( ⁣dy)}for all w0.v(w) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\} \quad \text{for all } w \geq 0.

Stating that vv solves the Bellman equation is equivalent to stating that vv is a fixed point of the Bellman operator TT that maps a value function v(w)v(w) into a value function (Tv)(w)(T v)(w) defined by

(Tv)(w)=max0cw{u(c)+βv(R(wc)+y)ϕ( ⁣dy)}(w0).(T v)(w) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\} \qquad (w \geq 0).

The next lemma discusses properties of greedy policies and the Bellman operator.

Since (V,)(V, \| \cdot \|) is a Banach space, the contraction property in Exercise 1.11 implies that TT is globally stable on VV. (See Section A.2.2.2 for details).

1.3.2.2DP Results for Optimal Savings

Dynamic programming theory tells us that, under Assumption 1.3.1,

  1. at least one optimal policy exists,

  2. the value function v\vmax is the unique solution to the Bellman equation in VV, and

  3. a policy σΣ\sigma \in \Sigma is optimal if and only if it is v\vmax-greedy.

A direct proof of (i)--(iii) can be found in Stokey & Lucas (1989), Stachurski (2022) and numerous other sources. The proofs heavily exploit the fact that the Bellman operator is a contraction mapping (as discussed in Exercise 1.11). We skip proofs for now, noting that they will be special cases of proofs we shall provide in Section 3.2.2.

Let’s review what we’ve found so far. We started with one optimization problem---choosing an optimal consumption path C0,C1,C_0, C_1, \ldots to maximize expected discounted lifetime utility---and ended up with another one---finding a greedy policy from the value function. Are we actually better off? The answer is: yes! Finding a greedy policy involves solving a scalar optimization problem performed for each state ww, whereas as our previous optimization problem was infinite dimensional. High dimensionality is the mountain we must climb in all hard optimization problems and here we have used the recursive structure inherent in the problem to map a route up to the top.

Of course this claim that we are better off is contingent on us being able to learn what the value function is, so that we can compute v\vmax-greedy policies---or at least some reasonable approximation. We discuss this topic next.

Figure 1.14 shows an approximation of the optimal policy σ\sigopt and the value function v\vmax, both computed by OPI, for the same version of the optimal savings problem used in Figure 1.13. In this case we set m=20m=20.

Approximating the optimal policy and value function via OPI

Figure 1.14:Approximating the optimal policy and value function via OPI

1.3.2.3Special Case: No Labor Income

Let’s quickly look at a version of the savings model where it’s possible to get an analytical solution for the optimal policy and the value function. We will use this solution to help us investigate the role of parameters and, through this process, consider the need for extensions to the basic optimal savings model.

To obtain an analytical solution, we set Yt0Y_t \equiv 0 and assume that the utility function has the CRRA form

u(c)c1γ1γ(γ>0,  γ1).u(c) \coloneq \frac{c^{1-\gamma}}{1-\gamma} \qquad (\gamma > 0, \; \gamma \not= 1).

The conditions of the preceding discussion are not satisfied, since uu is not bounded on R+\RR_+ and may take the value -\infty. We assume instead that βR1γ<1\beta R^{1-\gamma} < 1. This turns out to be sufficient to ensure finite lifetime values when consumption choices are positive:

For this CRRA problem, the optimal consumption policy is linear in ww. That is,

there exists a constant η such that σ(w)=ηw is the optimal policy\text{there exists a constant } \eta \text{ such that } \sigma(w) = \eta w \text{ is the optimal policy}

Let’s verify this claim and also seek the value of the constant η\eta. In doing so, we first observe that if (1.103) holds, then

Wt=Rt(1η)twwhen   W0=wW_t = R^t (1 - \eta)^t w \quad \text{when } \; W_0 = w

and hence the value function v\vmax satisfies

v(w)=tβtu(ηWt)=tβtu(ηRt(1η)tw)=tβt(ηRt(1η)t)1γu(w)=η1γ1β(R(1η))1γu(w)\begin{aligned} \vmax(w) = \sum_t \beta^t u (\eta W_t) & = \sum_t \beta^t u \left( \eta R^t \left(1 -\eta \right)^t w \right) \\ & = \sum_t \beta^t \left( \eta R^t \left(1 -\eta \right)^t \right)^{1-\gamma} u \left( w \right) = \frac{\eta^{1-\gamma}}{1-\beta \left( R \left( 1-\eta \right) \right)^{1-\gamma}} u(w) \end{aligned}

Our conjecture is that the linear policy σ(w)=ηw\sigma(w) = \eta w satisfies the Bellman equation with the value function as given above. Under this conjecture, the Bellman equation becomes

v(w)=maxc{c1γ1γ+βη1γ1β(R(1η))1γ(R(wc))1γ1γ}\vmax(w) = \max_c \left\{ \frac{c^{1-\gamma}}{1-\gamma} + \beta \cdot \frac{\eta^{1-\gamma}}{1-\beta \left( R \left( 1-\eta \right) \right)^{1-\gamma}} \cdot \frac{\left(R \left( w-c \right) \right)^{1-\gamma}}{1-\gamma} \right\}

Taking the derivative with respect to cc yields the first-order condition

cγ+βm(R(wc))γ(R)=0 when   mη1γ1β(R(1η))1γc^{-\gamma} + \beta m \left(R \left( w-c \right) \right)^{-\gamma} (-R) =0 \quad \text{ when } \; m \coloneq \frac{\eta^{1-\gamma}} { 1-\beta \left( R \left( 1-\eta \right) \right)^{1-\gamma} }

It then follows that cγ=βmR1γ(wc)γc^{-\gamma} = \beta m R^{1-\gamma}(w-c)^{-\gamma}. Substituting the optimal policy σ(w)=ηw\sigma(w) = \eta w into this equality gives

(ηw)γ=βR1γη1γ1β(R(1η))1γ(1η)γwγ\left( \eta w \right)^{-\gamma} = \frac{\beta R^{1-\gamma} \eta^{1-\gamma}} {1- \beta \left( R \left( 1-\eta \right) \right)^{1-\gamma}} (1-\eta)^{-\gamma} w^{-\gamma}

Now solving the above equality for η\eta yields

η=1(βR1γ)1/γ\eta = 1 - \left( \beta R^{1-\gamma} \right)^{1/\gamma}

In this connection, given any initial wealth ww, the value function becomes

v(w)=η1γ1β(R(1η))1γu(w)=(1(βR1γ)1/γ)1γ1βR1γ(βR1γ)1γγu(w)=ηγu(w).\vmax(w) = \frac{\eta^{1-\gamma}}{1-\beta \left( R \left( 1-\eta \right) \right)^{1-\gamma}} u(w) = \frac{\left( 1 - \left( \beta R^{1-\gamma} \right)^{1/\gamma} \right)^{1-\gamma}} {1-\beta R^{1-\gamma} \left( \beta R^{1-\gamma} \right)^{\frac{1-\gamma}{\gamma}}} u(w) =\eta^{-\gamma} u(w).

It is not difficult to verify that v(w)=ηγu(w)\vmax(w) = \eta^{-\gamma} u(w) solves the Bellman equation (1.106) for any ww.

The parameter γ\gamma governs the curvature of the utility function and hence preferences about consumption smoothing. To see this, observe that consumption at time tt is Ct=ηWt=η(R(1η))twC_t = \eta W_t = \eta (R(1-\eta))^t w, so the consumption growth factor is Ct+1/Ct=(βR)1/γC_{t+1}/C_t = (\beta R)^{1/\gamma}. When γ\gamma is large, the utility function has high curvature and the agent dislikes variation in consumption across time. Conversely, when γ\gamma is small, the agent is more tolerant of consumption variation, leading to steeper paths. Figure 1.15 illustrates these effects for β=0.96\beta = 0.96 and R=1R=1.

Optimal consumption paths under CRRA utility for different values of \gamma.

Figure 1.15:Optimal consumption paths under CRRA utility for different values of γ\gamma.

1.3.3Epstein--Zin Preferences

There are a number of issues and limitations associated with the basic optimal savings model we have discussed so far. Moreover, these limitations tend to bind more often as we move towards quantitative analysis and interesting research applications. In this section we discuss issues related to risk and intertemporal substitution. This discussion will motivate us to introduce Epstein--Zin preferences, which are a particularly popular specification of intertemporal preferences in economics and finance.

1.3.3.1Risk vs EIS

One issue is that, under the model considered so far, the curvature of the utility function uu simultaneously governs both risk aversion (e.g., a more strongly concave utility function indicates stronger aversion to risk) and willingness to substitute consumption across time (as we saw in Figure 1.15, where increasing γ\gamma led to flatter consumption paths). Willingness to substitute consumption is usually measured by the elasticity of intertemporal substitution (EIS), which, for the CRRA utility function is 1/γ1/\gamma. Larger γ\gamma pushes down the EIS, indicating preference for smooth consumption over time.

The fact that utility curvature controls both risk preferences and the EIS binds attitudes toward uncertainty together with attitudes toward intertemporal substitution. Researchers have found that to explain various macro-finance patterns, it helps to unbind them and allow separate parameters to describe these two attitudes. For example, matching observed equity premia using standard asset pricing models requires high risk aversion, but high γ\gamma under CRRA implies a small EIS, which creates other difficulties including what is called a risk-free rate puzzle (see, e.g., Chapter 13 of Ljungqvist & Sargent (2018)).

1.3.3.2EZ Preferences

Epstein--Zin preferences Epstein & Zin, 1989Weil, 1990 play a big role in many macro-finance models. Under these preferences, the Bellman equation (1.98) from the standard optimal savings model becomes

v(w)=max0cw[(1β)c11/ψ+β(v(R(wc)+y)1γϕ( ⁣dy))11/ψ1γ]111/ψv(w) = \max_{0 \leq c \leq w} \left[ (1-\beta) c^{1-1/\psi} + \beta \left( \int v(R(w-c) + y)^{1-\gamma} \phi(\diff y) \right)^{\frac{1-1/\psi}{1-\gamma}} \right]^{\frac{1}{1-1/\psi}}

where ψ>0\psi > 0 is the EIS and γ>0\gamma > 0 is the coefficient of relative risk aversion. The inner expectation applies risk adjustment to future value via the Kreps--Porteus expectation, which we met earlier in Section 1.1.3.4. The outer CES aggregator governs intertemporal substitution. The policy operator (1.83) now becomes

(Tσv)(w)=[(1β)σ(w)11/ψ+β(v(R(wσ(w))+y)1γϕ( ⁣dy))11/ψ1γ]111/ψ(T_\sigma v)(w) = \left[ (1-\beta) \sigma(w)^{1-1/\psi} + \beta \left( \int v(R(w-\sigma(w)) + y)^{1-\gamma} \phi(\diff y) \right)^{\frac{1-1/\psi}{1-\gamma}} \right]^{\frac{1}{1-1/\psi}}

for any feasible policy σ\sigma. Figure 1.16 shows two arbitrarily chosen policies and their lifetime values under Epstein--Zin preferences, using the same income process as in Figure 1.13. The σ\sigma-value functions are now computed by iterating on our new version of TσT_\sigma. Parameters are R=1.04R=1.04, β=0.95\beta=0.95, γ=5\gamma=5 (risk aversion), and ψ=1.5\psi=1.5 (EIS).

Policies and lifetime values under Epstein--Zin preferences

Figure 1.16:Policies and lifetime values under Epstein--Zin preferences

Figure 1.17 shows the optimal policy and value function under Epstein--Zin preferences, computed via OPI. Compared to the standard expected utility case in Figure 1.14, the optimal consumption policy is qualitatively similar---increasing and concave in wealth---but the value function differs in interpretation and scale. With γ>1/ψ\gamma > 1/\psi, the agent exhibits preference for early resolution of uncertainty, which affects how future risk is valued.

Optimal policy and value function under Epstein--Zin preferences

Figure 1.17:Optimal policy and value function under Epstein--Zin preferences

Figure 1.18 explores how risk aversion affects optimal consumption. The figure shows optimal policies for γ{1.25,5,20}\gamma \in \{1.25, 5, 20\}, holding the EIS fixed at ψ=1.5\psi = 1.5. Higher risk aversion leads to more precautionary saving: at each wealth level, the agent consumes less and saves more as γ\gamma increases. Values of γ\gamma around 10--20 are common in the long-run risk literature Bansal & Yaron, 2004, where high risk aversion is needed to match observed asset pricing moments.

Optimal consumption by risk aversion \gamma

Figure 1.18:Optimal consumption by risk aversion γ\gamma

1.3.3.3Optimization Theory

While the preceding analysis illustrates the potential usefulness of Epstein--Zin preferences, it puts us on shaky ground technically. For example, optimality properties of the ordinary optimal savings model in Section 1.3.2.2 depend on contractivity of the Bellman operator. (For a sense of why, read the proof of Theorem 1.1.1.)

The Bellman operator associated with the Epstein--Zin Bellman equation is not a contraction under the supremum distance for the most quantitatively significant parameterizations, and the same is true for the policy operators. This means that, in order to handle both the standard and the Epstein--Zin variations of the savings problem, we require a more general theory of dynamic programming that can handle both contractive and non-contractive settings. We begin constructing appropriate tools in Chapter 2.

1.4Sequential Analysis

This section presents a Bayesian formulation of a statistical decision problem described by Bertsekas (1976). Unlike the previous examples, there is no discounting, so the Bellman operator is not a contraction. Nonetheless, the same conceptual framework applies: the optimal loss function solves a Bellman equation and optimal policies have a threshold structure. In subsequent chapters, we will build a theory of dynamic programming that can handle this no-discounting case. For now, our objective is to motivate the theory by exploring the application through guess-work and simulation.[3]

1.4.1Introduction

We now consider a Bayesian formulation of the sequential testing problem originally studied by Milton Friedman, Allen Wallis, and Abraham Wald Wald, 1947Arrow et al., 1949. The following is an account of how the problem was conceived and came to the attention of Wald. The account is by Milton Friedman, one of the giants of 20th Century economics, and relates to his work during World War II as an analyst at the U.S. Government’s Statistical Research Group at Columbia University.

In order to understand the story, it is necessary to have an idea of a simple statistical problem, and of the standard procedure for dealing with it. The actual problem out of which sequential analysis grew will serve. The Navy has two alternative designs (say A and B) for a projectile. It wants to determine which is superior. To do so it undertakes a series of paired firings. On each round, it assigns the value 1 or 0 to A accordingly as its performance is superior or inferior to that of B and conversely 0 or 1 to B. The Navy asks the statistician how to conduct the test and how to analyze the results.

The standard statistical answer was to specify a number of firings and a pair of percentages (e.g., 53% and 47%) and tell the client that if A receives a 1 in more than 53% of the firings, it can be regarded as superior; if it receives a 1 in fewer than 47%, B can be regarded as superior; if the percentage is between 47% and 53%, neither can be so regarded.

When Allen Wallis was discussing such a problem with (Navy) Captain Garret L. Schuyler, the captain objected that such a test, to quote from Allen’s account, may prove wasteful. If a wise and seasoned ordnance officer like Schuyler were on the premises, he would see after the first few thousand or even few hundred [rounds] that the experiment need not be completed either because the new method is obviously inferior or because it is obviously superior beyond what was hoped for.

Friedman and Wallis worked on the problem for a while but didn’t completely solve it. Realizing that, they told Wald about the problem. That set Wald on a path that led him to create sequential analysis Wald, 1947. While the story above relates to wartime activity, sequential analysis has many significant applications in economics, finance, operations research, and other fields. Examples include determining the number of clinical trials before bringing a drug to market, real-time fraud detection, algorithmic trading, supply chain monitoring, and experimental interface design by social media companies.

On a technical level, this problem differs from the other problems we have investigated so far in that it involves no discounting. As a result, the Bellman operator is not necessarily a contraction. Nonetheless, we will find ways to prove that the core concepts from dynamic programming theory still apply.

The setting is as follows: A decision-maker observes a sequence of iid draws Z1,Z2,Z3,Z_1, Z_2, Z_3, \ldots from an unknown distribution ff. The distribution ff is either f0f_0 or f1f_1, where both f0f_0 and f1f_1 are known probability densities. After observing each draw, the decision-maker must choose one of three actions:

  1. Accept the hypothesis that f=f0f = f_0 and stop.

  2. Accept the hypothesis that f=f1f = f_1 and stop.

  3. Draw another observation at cost c>0c > 0.

The decision-maker incurs a loss whenever she makes an incorrect decision. The losses are as follows:

Both L0L_0 and L1L_1 are strictly positive. The objective is to minimize the expected loss, which includes both the cost of sampling and the potential loss from incorrect terminal decisions.

The decision-maker begins with a prior belief π0(0,1)\pi_0 \in (0,1) that f=f1f = f_1. The state variable is the posterior belief πn\pi_n, which represents the probability that f=f1f = f_1 given observations 1,,n1, \ldots, n. After observing ZnZ_n, the posterior is updated via Bayes’ rule:

πn+1=κ(πn,Zn+1),whereκ(π,z)πf1(z)(1π)f0(z)+πf1(z).\pi_{n+1} = \kappa(\pi_n, Z_{n+1}), \quad \text{where} \quad \kappa(\pi, z) \coloneq \frac{\pi f_1(z)}{(1-\pi) f_0(z) + \pi f_1(z)}.

Notice that (πn)n0(\pi_n)_{n \geq 0} is Markovian over the sampling process.

Given current belief π\pi, the next draw from the (Zn)nN(Z_n)_{n \in \NN} sequence has the predicted distribution

ψ(π,z)(1π)f0(z)+πf1(z).\psi(\pi, z) \coloneq (1-\pi)f_0(z) + \pi f_1(z).

The controller uses this distribution to take expectations over next-sample draws from the (Zn)n1(Z_n)_{n \geq 1} process.

1.4.2Optimality

For this sequential sampling problem, the Bellman equation for minimizing loss has the form

g(π)=min{πL0,  (1π)L1,  c+g(κ(π,z))ψ(π,z) ⁣dz}.g(\pi) = \min \left\{ \pi L_0, \; (1-\pi) L_1, \; c + \int g(\kappa(\pi, z)) \psi(\pi, z) \diff z \right\}.

The Bellman equation can be understood as follows: The value g(π)g(\pi) represents the minimum expected loss given current belief state π\pi. This value is itself the minimum over three terms, each of which corresponds to a choice. The first term is associated with accepting f0f_0 and has expected loss πL0\pi L_0, since π\pi is the (subjective) probability that f=f1f = f_1. The second term is for accepting f1f_1. This has expected loss (1π)L1(1-\pi) L_1, since 1π1-\pi is the probability that f=f0f = f_0. The last term is the expected loss associated with continuing to the next sample and then behaving optimally.

We now state an optimality result that parallels our earlier theorems. Let X=(0,1)\Xsf = (0,1) be the state space for the belief state π\pi and let bX+b\Xsf_+ denote the set of bounded, Borel measurable functions from X\Xsf to R+\RR_+. The action space is A={0,1,2}\Asf = \{0, 1, 2\}, where action 0 represents accepting f0f_0, action 1 represents accepting f1f_1, and action 2 represents continuing to sample. The set of all feasible policies, denoted by Σ\Sigma, is all Borel measurable σ ⁣:X{0,1,2}\sigma \colon \Xsf \to \{0, 1, 2\}.

Distributions and sample paths for f_0 and f_1

Figure 1.19:Distributions and sample paths for f0f_0 and f1f_1

We prove this theorem via a more general result in Theorem 3.2.9. For now, to illustrate the key ideas, we consider a specific example where f0=Beta(3,4)f_0 = \text{Beta}(3, 4) and f1=Beta(4,3)f_1 = \text{Beta}(4, 3), as shown in Figure 1.19. The figure also shows iid sample paths generated by the densities f0f_0 and f1f_1. The remaining parameters are set to L0=25L_0 = 25, L1=25L_1 = 25, and c=0.5c = 0.5.

Optimal policy and loss function

Figure 1.20:Optimal policy and loss function

Figure 1.20 shows the optimal policy and the corresponding loss function g\gmin. The functions were computed by a version of value function iteration, starting from initial condition g00g_0 \equiv 0. The state space X\Xsf was discretized into a grid of 200 points, and the integral over future observations was approximated using a grid of 50 points over the support of the distributions. The left panel displays the optimal action as a function of the posterior belief π\pi. As predicted by Theorem 1.4.1, the optimal policy has a threshold structure: there exist cutoffs t0,t1[0,1]t_0, t_1 \in [0,1] with t0t1t_0 \leq t_1 such that

Figure 1.21 shows dynamics of the belief state under the optimal policy. The belief state πn\pi_n shifts according to the update rule πn+1=κ(πn,Zn+1)\pi_{n+1} = \kappa(\pi_n, Z_{n+1}), with the samples (Zn)nN(Z_n)_{n \in \NN} being drawn from either f0f_0 or f1f_1. When the true distribution is f0f_0, the belief πn\pi_n tends to drift downward toward zero; when it is f1f_1, the belief drifts upward toward one. Under the optimal policy, sampling terminates once the belief exits the continuation region (t0,t1)(t_0, t_1), at which point the corresponding hypothesis is accepted.

Belief paths under the optimal policy

Figure 1.21:Belief paths under the optimal policy

1.5Summary

The examples in this chapter illustrate the breadth of problems that dynamic programming can address, but they also expose the limits of classical methods. The firm problem and finite MDPs with constant discounting yield contracting Bellman operators, making optimality theory and computation straightforward. However, several of the extensions and models we encountered require a more general foundation: Epstein--Zin preferences and the sequential analysis problem produce Bellman operators that are not contractions; risk-sensitive and robust formulations involve nonlinear aggregators; and distributional dynamic programming operates on a non-standard value space ordered by stochastic dominance. The abstract theory developed in the next chapter provides a unified framework that accommodates all of these settings.

1.6Chapter Notes

Richard Bellman’s ((1957)) monograph established dynamic programming as a unified framework for sequential optimization, introducing optimality concepts and recursive functional equations that form the foundations of this text. David Blackwell made major contributions to the mathematical theory, proving contraction properties for discounted problems with finite state spaces Blackwell, 1962 and extending these results to Borel spaces using order-preserving operators Blackwell, 1965. Eric Denardo ((1967)) further generalized contraction results to a broad class of sequential decision problems, introducing conditions that anticipate many ideas in this text.

The LP formulation for MDPs discussed in Section 1.2.1.4 has a long history. LP methods are particularly useful for constrained MDPs, where the controller faces additional restrictions on expected rewards or resource usage. Such problems arise in network routing, healthcare resource allocation, and other applications. See Altman (1999) for a textbook treatment. The LP formulation is also central to average-reward problems, where occupation measures play a key role (see, e.g., Chapter 8 of Puterman (2005)). For large state-action spaces, approximate LP methods using basis function representations can help scale the approach; Farias & Van Roy (2003) provides a foundational treatment.

The firm problem we studied in Section 1.1 is closely related to classic references such as Jovanovic (1982) and Hopenhayn & Prescott (1992), and has been extended by many authors (see, e.g., Alessandria et al. (2021) or Sterk et al. (2021)). Regarding the extensions to the firm problem in Section 1.1.3, excellent discussions of Markov decision processes with risk-sensitive objectives can be found in Bäuerle & Jaśkiewicz (2024) and Bäuerle & Jaśkiewicz (2025). We borrowed from their exposition in several parts of the chapter and return to the key ideas later in the text. The variational formula connecting risk-sensitivity to robustness is developed in Anantharam & Borkar (2017); see also Chapter 8 of Sargent & Stachurski (2025) for further discussion.

The discussion in Section 1.1.3.5 mentions dynamic (time) inconsistency. For analysis of time inconsistency in macroeconomic models and its connections to dynamic programming, see Sargent (2024), Sargent & Yang (2025), and Sargent & Yang (2025). For recent theoretical work, see Stanca (2025), Strack & Taubinsky (2026), and Bayraktar et al. (2023) on the stability of equilibria in time-inconsistent stopping.

The finite MDP framework in Section 1.2 is treated comprehensively in Puterman (2005); see also Chapter 5 of Sargent & Stachurski (2025) for an introductory treatment. The three core algorithms we presented --- VFI, HPI, and OPI --- are discussed in these sources. Howard policy iteration was introduced in Howard (1960). The cash management application in Section 1.2.1.5 builds on the inventory-theoretic models of money demand developed by Baumol (1952) and Tobin (1956). Our continuous time MDP model in Section 1.2.2 follows the framework described in Guo & Hernández-Lerma (2009); the uniformization technique we used to reduce continuous time problems to discrete time ones is standard (see, e.g., Puterman (2005), Chapter 11).

The optimal savings problem in Section 1.3 is also called the income fluctuation problem. It was studied in an early and influential form by Brock & Mirman (1972), who analyzed optimal growth under uncertainty with discounted CRRA utility. It has become a core building block for heterogeneous agent models following Bewley (1986), Huggett (1993), and Aiyagari (1994). Recent analysis can be found in Carroll & Shanker (2026), Li & Stachurski (2014), Lehrer & Light (2018), Light (2018), Ma et al. (2020), and Ma & Toda (2021). For a continuous-time treatment, see Achdou et al. (2022). See Stokey & Lucas (1989) and Stachurski (2022) for textbook treatments of the underlying dynamic programming theory.

The Epstein--Zin preferences discussed in Section 1.3.3 were introduced by Epstein & Zin (1989) and Weil (1990), building on earlier work of Kreps & Porteus (1978). The separation of risk aversion from the elasticity of intertemporal substitution enabled by Epstein--Zin utility has been central to the long-run risk literature initiated by Bansal & Yaron (2004), where small persistent consumption shocks get heavily priced, generating realistic equity premia using “reasonable” parameters. The equity premium puzzle was posed by Mehra & Prescott (1985); the associated risk-free rate puzzle was posed by Weil (1989). Both puzzles are discussed extensively in Chapter 13 of Ljungqvist & Sargent (2018). Chapter 7 of Sargent & Stachurski (2025) provides an introduction to recursive preferences.

The discussion of ambiguity in Section 1.2.3.3 connects to a large literature on robust decision-making under model uncertainty. The minimax formulation we presented follows the approach of Wald (1950); see also Ellsberg (1961) for a foundational discussion of ambiguity aversion and Hansen & Sargent (2011) for connections to robust control. Recent work on dynamic programming under ambiguity includes Maccheroni et al. (2006), Klibanoff et al. (2009), Marinacci & Montrucchio (2019), Neufeld et al. (2023), Cerreia-Vioglio et al. (2026), Benyamine et al. (2026), and Wang & Si (2026). An excellent survey on ambiguity and its implications for economics and finance can be found in Ilut & Schneider (2023).

The sequential analysis problem in Section 1.4 originated with Wald (1947) and Arrow et al. (1949). The Bayesian formulation we presented follows Bertsekas (1976). For treatments from frequentist and Bayesian perspectives, respectively, see Sargent & Stachurski (2026) and Sargent & Stachurski (2026).

The introduction to this chapter mentioned applications of dynamic programming to atemporal problems, such as genome sequencing and the structure of production chains. For one discussion of the former see Gu et al. (2023); for the latter see, for example, Kikuchi et al. (2021). We mentioned also that many recent applications of dynamic programming are connected to machine learning and artificial intelligence. Introductions to the literature can be found in Bertsekas (2021) and Kochenderfer et al. (2022).

Footnotes
  1. The literature on “recursive contracts” in macroeconomics makes progress here by using a set of procedures that have been called “dynamic programming squared”. Ljungqvist & Sargent (2018) devote a suite of chapters to that topic.

  2. In engineering it is sometimes called a closed loop control to emphasize that the control must be a measurable function of an observed history and not depend on as yet unrealized random variables.

  3. In his formulation, Abraham Wald Wald (1947) proceeded as a frequentist statistician, using objects from Neyman-Pearson’s hypothesis testing theory. For descriptions of the problem from the distinct frequentist and Bayesian perspectives, see Sargent & Stachurski (2026) and Sargent & Stachurski (2026).

References
  1. Bellman, R. (1957). Dynamic programming. In Science. American Association for the Advancement of Science.
  2. Sargent, T. J., & Stachurski, J. (2025). Dynamic Programming: Finite States. Cambridge University Press.
  3. Modigliani, F., & Miller, M. H. (1958). The cost of capital, corporation finance and the theory of investment. The American Economic Review, 48(3), 261–297.
  4. Smith, C. W., & Stulz, R. M. (1985). The Determinants of Firms’ Hedging Policies. Journal of Financial and Quantitative Analysis, 20(4), 391–405.
  5. Graham, J. R., Harvey, C. R., & Puri, M. (2013). Managerial attitudes and corporate actions. Journal of Financial Economics, 109(1), 103–121. 10.1016/j.jfineco.2013.01.010
  6. Kerr, S. P., Kerr, W. R., & Dalton, M. (2019). Risk attitudes and personality traits of entrepreneurs and venture team members. Proceedings of the National Academy of Sciences, 116(36), 17712–17716. 10.1073/pnas.1908375116
  7. Almeida, H., Campello, M., de Castro, L. I., & Galvao Jr, A. F. (2024). A Quantile Model of Firm Investment [Techreport]. National Bureau of Economic Research.
  8. Bellemare, M. G., Dabney, W., & Rowland, M. (2023). Distributional reinforcement learning. MIT Press.
  9. Bäuerle, N., & Jaśkiewicz, A. (2024). Markov decision processes with risk-sensitive criteria: an overview. Mathematical Methods of Operations Research, 99(1), 141–178.
  10. Puterman, M. L. (2005). Markov decision processes: discrete stochastic dynamic programming. Wiley Interscience.
  11. Baumol, W. J. (1952). The Transactions Demand for Cash: An Inventory Theoretic Approach. The Quarterly Journal of Economics, 66(4), 545–556.
  12. Tobin, J. (1956). The Interest-Elasticity of Transactions Demand For Cash. The Review of Economics and Statistics, 38(3), 241–247.
  13. Engel, K.-J., & Nagel, R. (2006). A short course on operator semigroups. Springer Science & Business Media.
  14. Guo, X., & Hernández-Lerma, O. (2009). Continuous-time Markov decision processes. Springer.
  15. Wald, A. (1950). Statistical Decision Functions (p. ix + 179). John Wiley & Sons.