Prelude: Examples of Dynamic Programs - Dynamic Programming Volume II: General States

Dynamic programming is a recursive technique for solving optimization problems. While initially developed for intertemporal problems (inventory management, investment planning, optimal savings and consumption, etc.), it has since been applied to various atemporal problems, ranging from genome sequencing and matrix multiplication to the structure of production chains. Creators of machine learning and artificial intelligence routinely use dynamic programming.

The sheer breadth of applications currently being tackled with dynamic programming is a challenge for presenting a modern theory. As well as the vast range of concrete problems faced in applied settings, researchers are adding features to their models that require extensions of the foundations of dynamic programming. Such new features include time-varying discount rates, nonlinear discounting, risk-sensitive control, ambiguity aversion, nonlinear time aggregation, and so on. Researchers added these features in their quests to create new models capable of coming closer to data sets of interest.

To set the scene, we begin with some problems that can be handled using classic methods. Then we discuss extensions and transformations that require more sophisticated theoretical foundations. Since our purpose here is to set the stage for the abstract theory that is the main focus of this Volume, our presentation here is mathematically informal at many points. Results stated in this chapter are special cases of the abstract theory, which begins in Chapter 2.

1.1A Firm Problem¶

This section uses a firm valuation and control problem to introduce core dynamic programming concepts. We begin with traditional methods, showing how recursive representations lead to the Bellman equation and optimal policies. We then discuss extensions involving unbounded rewards, time-varying discount rates, and risk-sensitive preferences.

1.1.1Models of a Firm¶

We first consider firm valuation under a Markov profit process, deriving a recursive representation for expected present value. Next, by giving the manager an option to sell the firm at a time of their choosing, we add a control. A Bellman type of optimality is then stated and proved. This material establishes a template for dynamic programs that recur throughout this chapter.

1.1.1.1Valuation¶

A firm generates a random flow of profits $(\pi_t)_{t \geq 0} = \pi_0, \pi_1, \pi_2, \ldots$ . At time $t$ , a manager wants to calculate the present value of the profit process. Current profits $\pi_t$ are known but future values $\pi_{t+1}, \pi_{t+2}, \ldots$ are not. The manager knows the distribution of the process $(\pi_t)_{t \geq 0}$ . Consequently the manager can compute the expected present value, namely

V_t \coloneq \EE_t \left[ \pi_t + \beta \pi_{t+1} + \beta^2 \pi_{t+2} + \cdots \right].

Here $\beta$ is a discount factor, often reparameterized as $\beta = 1/(1+r)$ when $r$ is a discount rate. Thus, $\beta^m \pi_{t+m}$ is time $t+m$ profits discounted to the present date $t$ . The symbol $\EE_t$ denotes mathematical expectation conditional on time $t$ information.

We can switch to a recursive representation of the sequence of valuations $(V_t)_{t \geq 0}$ by first writing

V_t = \pi_t + \beta \EE_t \left[ \pi_{t+1} + \beta \pi_{t+2} + \beta^2 \pi_{t+3} + \cdots \right]

and then applying the law of iterated expectations $\EE_t = \EE_t \EE_{t+1}$ . This leads to

V_t = \pi_t + \beta \EE_t V_{t+1}.

(1.1)

This expression will be important for us because the theory of dynamic programming is built around recursive relationships. Later we will see many variations of (1.1).

To make further progress in computing $(V_t)_{t \geq 0}$ , let’s assume that $\pi_t = \pi(X_t)$ , where $(X_t)_{t \geq 0}$ is a discrete time Markov process taking values in a measurable space $(\Xsf, \bB)$ . We assume that $(X_t)_{t \geq 0}$ is driven by a stochastic kernel $P$ (see Section A.5.4.1), so that $P(X_t, B)$ is the probability that $X_{t+1} \in B \in \bB$ given current state $X_t$ . The function $\pi$ is assumed to be in $b\Xsf$ , the set of bounded, $\bB$ -measurable functions from $\Xsf$ to $\RR$ . We understand $X_t$ as representing the “state of the world”, including all factors affecting firm profits. Figure 1.1 shows multiple realizations of the profit process $(\pi_t)_{t \geq 0}$ in the case where $\Xsf = \RR$ , $\bB$ is the Borel sets, $\pi(x) = \exp(x)$ , and $(X_t)$ is a discretization of an AR(1) process with $\rho = 0.9$ and $\nu = 0.2$ .

Under the Markov profits assumption, $V_t$ will depend on the current state $X_t$ , since knowing the state helps predict future profits. At the same time, the Markov assumption means that earlier values $X_0, \ldots, X_{t-1}$ will not aid prediction once $X_t$ is known. This leads us to conjecture that $V_t = v(X_t)$ for some fixed function $v \colon \Xsf \to \RR$ . Inserting this conjecture into (1.1) and evaluating at $X_t = x$ yields

v(x) = \pi(x) + \beta \int v(x') P(x, \diff x').

(1.2)

Defining $(Pv)(x) \coloneq \int v(x') P(x, \diff x')$ , which is consistent with the operator-theoretic notation in Section A.5.4, we can rewrite (1.2) as $v = \pi + \beta P v$ . Using the fact that $\pi \in b\Xsf$ , and that the spectral radius of $\beta P$ is $\beta$ (Lemma A.5.30), this equation has a unique solution for $v$ in $b\Xsf$ whenever $\beta < 1$ . The Neumann series lemma (Theorem A.4.10) tells us that the solution has the form

v = (I - \beta P)^{-1} \pi.

(1.3)

Figure 1.2:Value of the firm for different discount factors

Figure 1.2 plots $v = (I - \beta P)^{-1} \pi$ over the state space $\Xsf$ for several choices of the discount factor $\beta$ . The environment is the same as the one underlying Figure 1.1. As $\beta$ increases, the firm places greater weight on future profits, resulting in higher valuations across all states.

Notice how the difficult problem of computing a stochastic process $(V_t)_{t \geq 0}$ has been converted into the much easier problem of calculating $v = (I - \beta P)^{-1} \pi$ .

1.1.1.2Control¶

So far our manager has had no choice to make. Let’s now give the manager the option to sell the firm to an outside buyer at the start of each period (before receiving current profits) for the fixed price $s$ . We will describe the manager’s decision with a binary policy function $\sigma$ that selects between selling the firm in the current period, after observing the current state $x$ , and not selling — in which case the manager again faces the option to sell the firm at the beginning of the next period. The same policy $\sigma$ is applied in every period, with $\sigma(x) \in \{0,1\}$ being the current decision. Modifying (1.2) appropriately to a $\sigma$ -dependent value function, we obtain the Bellman equation

v_\sigma(x) = \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta \int v_\sigma(x') P(x, \diff x') \right],

(1.4)

where $v_\sigma(x)$ is the total value of the firm under the policy $\sigma$ , conditional on state $x$ . If $\sigma(x) = 1$ , then the manager sells at payoff $s$ and the process ends, so $s$ is the total value received. Otherwise $\sigma(x) = 0$ , indicating no sale, profit $\pi(x)$ is received, and the process continues, with discounted expected payoff $\beta \int v_\sigma(x') P(x, \diff x')$ . This is the second term in (1.4). In what follows we call $v_\sigma$ the $\sigma$ -value function.

Let $\Sigma$ be the set of all policies, defined as all $\bB$ -measurable functions mapping $\Xsf$ to $\{0,1\}$ . For each $\sigma \in \Sigma$ , let $v_\sigma$ be the function defined recursively in (1.4). We can use the Neumann series lemma to show that $v_\sigma$ is uniquely defined in $b\Xsf$ . Alternatively, we can introduce the operator $T_\sigma \colon b\Xsf \to b\Xsf$ via

(T_\sigma \, v)(x) = \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right] \qquad (x \in \Xsf).

(1.5)

We call $T_\sigma$ the policy operator defined by $\sigma$ . In operator form it can be expressed as

T_\sigma \, v = \sigma s + (1-\sigma)(\pi + \beta Pv).

Evidently, $v_\sigma$ solves (1.4) if and only if $v_\sigma$ is a fixed point of $T_\sigma$ .

One easily shows that $T_\sigma$ is contracting (see Section A.2.2.2 for the definition) on the complete space $(b\Xsf, \| \cdot\|)$ , where $\| \cdot \|$ is the supremum norm, defined by $\| f \| = \sup_{x \in \Xsf} |f(x)|$ for all $f \in b\Xsf$ . Indeed, for arbitrary $v, w \in b\Xsf$ ,

|T_\sigma \, v - T_\sigma \, w| \leq (1-\sigma) \beta | Pv - Pw| \leq \beta | Pv - Pw| \leq \beta P | v - w|,

where the absolute value $| \cdot |$ is applied pointwise (and the last inequality is by the triangle inequality for integrals — or, if you prefer, by the inequality for positive operators in Exercise A.5.17). From the last expression we get

|T_\sigma \, v - T_\sigma \, w| \leq \beta P \| v - w \| \leq \beta \| v - w \|.

Taking the supremum over the left-hand side, we find that $T_\sigma$ is a contraction of modulus $\beta$ on $b\Xsf$ . Hence, the $\sigma$ -value function $v_\sigma$ is the unique fixed point of $T_\sigma$ in the candidate space $b\Xsf$ .

Figure 1.3:Policies and their lifetime value functions

Figure 1.3 shows the value functions $v_\sigma$ and $v_\tau$ for two possible policies $\sigma$ and $\tau$ . The policies are shown on the left and the value functions are on the right. The environment is the same as Figure 1.2 with $\beta$ set to 0.96. The first policy $\sigma$ sells when the state is below 1.0 and continues otherwise. The second policy $\tau$ oscillates between selling and continuing. Each value function is computed as the fixed point of the corresponding policy operator in $b\Xsf$ . The horizontal dashed line indicates the sale price $s$ . In terms of value, policy $\tau$ is outperformed by $\sigma$ everywhere on the state space.

Now let’s consider optimality. A policy $\sigma$ is called optimal if $v_\tau(x) \leq v_\sigma(x)$ for all $\tau \in \Sigma$ and all $x \in \Xsf$ . The value function $\vmax$ , sometimes called the optimal value function, is defined by

\vmax(x) \coloneq \sup_{\sigma \in \Sigma} v_\sigma(x) \qquad (x \in \Xsf).

Evidently $\sigma$ is an optimal policy if and only if $v_\sigma = \vmax$ .

Solution to Exercise 1.1.2

Fix $\sigma \in \Sigma$ . Let $a = |s| + \| \pi \|$ and let $M = a/(1-\beta)$ . Let $[-M, M]$ be all $v \in b\Xsf$ such that $|v| \leq M$ . Repeated use of the triangle inequality shows that $|T_\sigma \, v| \leq |s| + |\pi| + \beta P |v|$ . Hence, for $v \in [-M, M]$ , we have $|T_\sigma \, v| \leq a + \beta M = M$ . In particular, $T_\sigma$ is a self-map on the closed set $[-M, M]$ . As a result, its fixed point $v_\sigma$ also lies in this set. From this fact we obtain $v_\sigma(x) \leq M$ for all $\sigma$ and all $x$ . As bounded above sets in $\RR$ have suprema, we conclude that $\vmax$ is well-defined (Theorem A.1.1).

The problem of finding an optimal policy appears nontrivial, since $\Sigma$ is a large set whenever $\Xsf$ is large. Indeed, each $\sigma \in \Sigma$ is just an indicator function of a measurable set, so the cardinality of $\Sigma$ is the same as $\bB$ . However, it turns out that we can find, characterize, and compute optimal policies relatively easily, using the theory of dynamic programming. In particular, we can state the following result, which is proved below in Section 1.1.1.4.

Theorem 1.1.1

The value function $\vmax$ is the unique $v \in b\Xsf$ that solves the functional equation

v(x) = \max \left\{ s ,\; \pi(x) + \beta \int v(x') P(x, \diff x') \right\} \qquad (x \in \Xsf).

(1.6)

In addition, at least one optimal policy exists. Finally, a policy $\sigma \in \Sigma$ is optimal if and only if, for each $x \in \Xsf$ ,

\sigma(x) \in \argmax_{a \in \{0,1\}} \left\{ a s + (1 - a) \left[ \pi(x) + \beta \int \vmax(x') P(x, \diff x') \right] \right\}.

(1.7)

Theorem 1.1.1 is a simple consequence of Richard Bellman’s (1920–1984) beautiful theory of dynamic programming Bellman, 1957. Equation (1.6) is called the Bellman equation.

The characterization in (1.7) has the following natural interpretation. The expression $\pi(x) + \beta \int \vmax(x') P(x, \diff x')$ is the payoff (expected present value) for choosing to continue, receiving current profits, and then behaving optimally (since we are valuing future states with $\vmax$ ). The best decision at $x$ is to continue if this is larger than $s$ , which is achieved by setting $a=0$ . Otherwise the manager should set $a=1$ and stop.

We shall repeatedly use a term related to Theorem 1.1.1. Thus, given $v \in b\Xsf$ , we will agree to say that $\sigma \in \Sigma$ is $v$ -greedy whenever

\sigma(x) \in \argmax_{a \in \{0,1\}} \left\{ a s + (1 - a) \left[ \pi(x) + \beta \int v(x') P(x, \diff x') \right] \right\} \quad \text{for all } x \in \Xsf.

(1.8)

With this terminology, we can repeat the policy optimality characterization in Theorem 1.1.1 by saying that a policy is optimal if and only if it is $\vmax$ -greedy. This idea is a cornerstone of our theory.

1.1.1.3The Bellman Operator¶

We noted above that a policy is optimal if and only if it is $\vmax$ -greedy. This provides a direct avenue for computing optimal policies: First calculate $\vmax$ and then take a $\vmax$ -greedy policy. A straightforward way to approximate $\vmax$ is to iterate on the Bellman operator, which, in the present setting, is the self-map $T$ on $b\Xsf$ defined at $v \in b\Xsf$ by

(T v)(x) = \max \left\{ s ,\; \pi(x) + \beta \int v(x') P(x, \diff x') \right\} \qquad (x \in \Xsf).

(1.9)

In operator-theoretic notation, we write $T$ as $Tv = s \vee (\pi + \beta Pv)$ . Evidently $v$ solves the Bellman equation if and only if $v$ is a fixed point of $T$ .

Note that $T$ is a contraction of modulus $\beta$ on $(b\Xsf, \| \cdot \|)$ . Indeed, fixing $v, w \in b\Xsf$ and applying the elementary bound

|\alpha \vee x - \alpha \vee y| \leq |x - y| \qquad (\alpha, x, y \in \RR),

we get

|T v - T w| = |s \vee (\pi + \beta Pv) - s \vee (\pi + \beta Pw)| \leq \beta | Pv - Pw|.

The rest of the argument is identical to the one for contractivity of $T_\sigma$ in Section 1.1.1.2.

From this and Theorem 1.1.1, we see that $T$ has a unique fixed point in $b\Xsf$ and that the fixed point is the value function $\vmax$ . Moreover, for any $v$ in $b\Xsf$ , we have $T^k v \to \vmax$ as $k \to \infty$ . In other words, fixed point iteration (also called successive approximation) allows us to approximate $\vmax$ arbitrarily well.

Figure 1.4:Optimal policy and value function

Figure 1.4 shows an approximate optimal policy and an approximation to the value function $\vmax$ . The latter was computed by iterating with the Bellman operator $T$ from initial condition $v_0 \equiv 0$ and monitoring for convergence (waiting for the step sizes $\| T^k v_0 - T^{k+1} v_0 \|$ to fall below some threshold). We then took the resulting function $v$ and computed a $v$ -greedy policy. We used the same parameters as in Figure 1.3. The optimal policy has a threshold structure: the manager sells the firm when the state is below a critical value and continues operating otherwise. The value function $\vmax$ dominates the sale price $s$ everywhere, reflecting the value of the option to wait and sell later.

1.1.1.4Proving Theorem 1.1.1¶

In this book we will prove far more general results than Theorem 1.1.1. So providing an immediate proof of the theorem here is actually redundant and postpones our presentation of some sample applications. Nevertheless, we now sketch a short proof of Theorem 1.1.1, but alert readers that they can safely move on without digesting it now.

Proof

Proof of Theorem 1.1.1.

In Section 1.1.1.3 we showed that $T$ is a contraction mapping on the complete space $(b\Xsf, \| \cdot \|)$ . Hence $T$ is globally stable on $b\Xsf$ and therefore has a unique fixed point $\bar v \in b\Xsf$ . Our first claim is that $\bar v = \vmax$ . We show $\bar v \leq \vmax$ and then $\bar v \geq \vmax$ .

For the first inequality, let $\sigma \in \Sigma$ be $\bar v$ -greedy. Recalling Exercise 1.1.3, we have $T_\sigma\, \bar v = T \bar v = \bar v$ . Hence $\bar v$ is also a fixed point of $T_\sigma$ . But the only fixed point of $T_\sigma$ in $b\Xsf$ is $v_\sigma$ , so $\bar v = v_\sigma \leq \vmax$ . This is our first inequality. As for the second, fix an arbitrary $\sigma \in \Sigma$ and observe that $T_\sigma \, \bar v \leq T \bar v = \bar v$ . Since $T_\sigma$ is order-preserving (see Section A.1.2.7), this implies that $T^k_\sigma \, \bar v$ is decreasing and bounded above by $\bar v$ . Because $T_\sigma$ is a contraction with fixed point $v_\sigma$ , we can take the limit in $k$ to obtain $v_\sigma \leq \bar v$ . Taking the supremum over $\sigma \in \Sigma$ yields $\vmax \leq \bar v$ .

This argument shows that $\vmax$ is a fixed point of $T$ in $b\Xsf$ . Since $T$ is contracting on $b\Xsf$ , we have confirmed that $\vmax$ is the unique solution to the Bellman equation in $b\Xsf$ .

Turning to our characterization of greedy policies, it follows from Exercise 1.1.3 that

\sigma \text{ is } \vmax \text{-greedy} \quad \iff \quad T_\sigma \, \vmax = T \vmax \quad \iff \quad T_\sigma \, \vmax = \vmax.

The right hand side of this expression tells us that $\vmax$ is a fixed point of $T_\sigma$ . But the only fixed point of $T_\sigma$ is $v_\sigma$ , so the right hand side is equivalent to the statement $v_\sigma = \vmax$ . By this chain of logic and the definition of optimality, we see that

\sigma \text{ is } \vmax \text{-greedy} \iff \vmax = v_\sigma \iff \text{ } \sigma \text{ is optimal}.

Since greedy policies exist for every $v$ in $b\Xsf$ , this also proves existence of at least one optimal policy. ◻

1.1.1.5How About More General Policies?¶

Up to now we have focused on stationary Markov policies, which, in our language, are measurable maps from $\Xsf$ to $\{0,1\}$ . Restricting our attention to such policies prevents the manager from making decisions based on a longer history of states and actions, or from changing policies at some date. We have also ignored the possibility that the manager might wish to randomize actions, meaning that the choice is a probability of selling, rather than a fixed selection from $\{0,1\}$ .

It turns out that focusing exclusively on stationary Markov policies is appropriate in the current setting. We show in Section 3.1.4 that allowing nonstationary policy choices cannot lead to higher expected present value, and that this result holds far more generally. Moreover, randomization cannot improve outcomes in this setting. For a proof in a similar environment, see our discussion of mixed strategies in Section 9.2.1.6 of Sargent & Stachurski (2025).

1.1.2Extensions¶

The theory stated so far is elegant and can be extended in some directions with relatively little effort. For example, we can ask the manager to control inventories, capital stocks, and the size and disposition of a labor force across tasks. The same fundamental ideas still govern optimality, and the same solution approaches still work.

But some extensions are more challenging and require moving beyond standard dynamic programs. We describe some of these challenges next.

1.1.2.1Beyond Constant Discount Rates¶

One restrictive assumption in Section 1.1.1 is that the discount factor $\beta$ is constant. In fact discount rates vary considerably. For example, market interest rates fluctuate substantially, responding to changes in monetary policy, inflation expectations, and risk premia. Interest rates charged to risky borrowers can fluctuate even more widely than benchmark rates. For a firm weighing the decision to continue operating versus exiting, the cost of capital at which future profits are discounted will have a substantial impact on optimal decisions. In practice, firms routinely incorporate time-varying discount rates into their strategic planning.

Figure 1.5 illustrates the extent of this variation using US data from the Federal Reserve. The top panel shows the federal funds rate, which transmits to firm financing costs through the credit channel. The bottom panel plots the real interest rate, computed as the 10-year Treasury yield minus twelve-month CPI inflation. Both sets of rates affect the intertemporal tradeoffs that firms face: when rates are high, future profits are discounted more heavily, reducing the present value of long-lived investments and making exit or disinvestment more attractive. Conversely, low or negative rates reduce the cost of waiting and encourage firms to invest, expand capacity, or delay exit from declining markets. The relative importance of real vs nominal rates varies across firms. The swings visible in Figure 1.5 underscore the issues with assuming a fixed discount factor $\beta$ when studying intertemporal choices of firms.

Figure 1.5:US nominal and real interest rates.

To accommodate dependence of discount factors on firm-specific or macroeconomic variables, we can specify that $\beta = b(X_t)$ for some suitable function $b$ . Fortunately, this modification can be accommodated in the current setting. A discussion can be found in Section 4.2.1.3. But the associated theory can become more complicated when the controller’s actions affect state components that affect the discount factor. This happens, for example, in models of firms who face higher borrowing costs because they pursue high-risk strategies that involve occasionally running down their cash reserves. That makes the dynamic programming problem more challenging, particularly if we assume that interest rates are occasionally negative; such cases break contractivity properties of policy and Bellman operators. Optimality results for the finite state case can be found in Sargent & Stachurski (2025). We study more general cases in Chapter 6.

1.1.2.2Unbounded Rewards¶

Another—more technical—issue with our analysis in Section 1.1.1 is that $\pi$ is bounded. This directly embeds the problem in the space of bounded functions $b\Xsf$ and pairs naturally with the supremum norm. The contraction and optimality proofs unfold smoothly when working in such an environment. But assuming that $\pi$ is bounded can be overly restrictive.

It turns out that we can drop the boundedness assumption without too much difficulty. One way is to assume instead that $\| \pi \|_\ell < \infty$ where $\ell \colon \Xsf \to [1, \infty)$ and $\| \cdot \|_\ell$ is the $\ell$ -weighted norm defined by $\| f \|_\ell \coloneq \sup_{x \in \Xsf} |f(x)|/\ell(x)$ . This approach is discussed in Section 7.2.2.2.

Another option is to embed the problem in the class of integrable functions $L_1(\psi) \coloneq L_1(\Xsf, \bB, \psi)$ for a suitable measure $\psi$ on $(\Xsf, \bB)$ . (For background see Section A.4.2.4.) For example, suppose that $\psi$ is a stationary distribution (see Section A.5.4.4) of the stochastic kernel $P$ and that $\pi \in L_1(\psi)$ . The policy operators then send $L_1(\psi)$ into itself. Indeed, if $\pi \in L_1(\psi)$ and we fix $v \in L_1(\psi)$ , then $|T_\sigma \, v| \leq |s| + |\pi| + \beta |P v|$ , so $T_\sigma \, v$ is in $L_1(\psi)$ when $Pv$ is $\psi$ -integrable. This is true by stationarity of $\psi$ --- see (i) of Lemma A.5.32 and the surrounding discussion. Moreover, $T_\sigma$ is a contraction map on $L_1(\psi)$ , as can be seen by integrating both sides of the bound $|T_\sigma \, v - T_\sigma \, w| \leq \beta P | v - w|$ with respect to $\psi$ and using (A.33) to obtain $\int P | v - w| \diff \psi = \int |v - w| \diff \psi$ . One can also show that the Bellman operator $T$ is a contraction with respect to the norm on $L_1(\psi)$ and then proceed to adapt proofs in Theorem 1.1.1.

Rather than provide all details here, we defer further discussion to Section 4.2.1.2.

1.1.3Beyond Risk Neutrality¶

A limitation of the preceding analysis is how it treats risk. We assumed that the manager wants to maximize the expected present value of cash flow generated by the firm across different strategies (policies). But what if the manager wants something else?

Canonical theories of firm behavior suggest that expected present value is the appropriate criterion. In complete and frictionless markets, a firm’s shareholders can hedge firm-specific risks by trading other securities, so the firm’s manager should not worry about that Modigliani & Miller, 1958Smith & Stulz, 1985. But maybe financial markets are incomplete and maybe shareholders lack enough information about risks or financial sophistication to develop optimal hedging strategies. Maybe managers’ incentives are not aligned with shareholders’ interests. Managers might be concerned about a firm’s survival and overweight downside risks relative to shareholders. Indeed, various studies offer evidence for such risk-averse decision-making within both small and large firms (see, e.g., Graham et al. (2013), Kerr et al. (2019), or Almeida et al. (2024)).

In this section we consider adjusting the manager’s problem to incorporate some of these considerations.

1.1.3.1Distributions of Rewards¶

Let’s return to our jump-off point for optimization, where we decided to seek a policy that solves $\max_{\sigma \in \Sigma} v_\sigma(x)$ for all $x$ . We know that $v_\sigma(x)$ is the expected lifetime value of policy $\sigma$ given initial condition $x$ . As such, we can also write

v_\sigma(x) = \EE Z_\sigma,

(1.10)

where

Z_\sigma \coloneq \sum_{t=0}^{T(\sigma)-1} \beta^t \pi_t + \beta^{T(\sigma)} s, \quad \text{with} \quad T(\sigma) \coloneq \inf \, \setntn{t \geq 0}{\sigma(X_t)=1}.

(1.11)

Here $T(\sigma)$ is the date at which the firm is sold at price $s$ (with convention $\inf \varnothing = \infty$ , so that $T(\sigma)=\infty$ if the firm is never sold). Evidently $Z_\sigma$ is a random payoff, depending on the policy and the random path $(\pi_t)_{t \geq 0}$ . The random variable $Z_\sigma$ depends on the initial state $x$ but we have chosen to suppress this in the notation.

The left-hand subfigure in Figure 1.6 shows the distribution of $Z_{\sigopt}$ , lifetime value under the optimal policy $\sigopt$ , computed by fixing an initial condition $x$ , simulating 100,000 profit paths from a given initial state, and then computing the discounted payoff along each path. (Parameter values were the same as in Figure 1.3.) The mean of the distribution is $v_{\sigopt}(x)$ , which is also $\vmax(x)$ . The policy is regarded as optimal because the mean of the distribution is larger than the mean of $Z_{\sigma}$ under any other policy, and given any other initial condition $x$ . The right-hand subfigure compares $Z_{\sigopt}$ with $Z_\sigma$ , where $\sigma$ is the policy that never sells.

Distributions of the random payoff Z_\sigma defined in . — Figure 1.6:Distributions of the random payoff $Z_\sigma$ defined in (1.11).

Let’s consider these distributions from the perspective of firm managers when markets are not frictionless, and information is not perfect. In this case, as discussed at the start of Section 1.1.3, managers care about more than just the mean of these distributions. While the mean will surely be of interest, these managers are likely to also care about factors such as variance, upside risk and downside risk. Let’s now consider how we might insert preferences over such factors into our model.

1.1.3.2Distributional Dynamic Programming¶

Some researchers have begun to construct a theory of “distributional dynamic programming” where the core idea is to track the distribution of the payoff across policies and initial conditions. In our context, this means choosing $\sigma$ so that $Z_\sigma$ has an “optimal” distribution. For example, a manager might want a distribution with a relatively high mean and low downside risk. Bellemare et al. (2023) show that, for a relatively broad class of dynamic programming problems, a distributional version of the Bellman equation can be constructed, where the left- and right-hand sides of the Bellman equation are both distributions. We formalize this idea within the abstract dynamic programming framework in Section 2.3.4.

In practice, the theory of distributional dynamic programming is constrained by the fact that there is no natural extension of the idea of a greedy policy to the setting of distributional Bellman equations. As such, we focus on environments where agents are able to specify loss or reward functions over distributions. The next few sections investigate such cases, while still preserving concern for tail properties and additional moments beyond the mean.

1.1.3.3Mean-Variance Analysis¶

Let $R$ be a random payoff that we want to evaluate. Mean-variance analysis proposes the criterion $\EE [R] - (\gamma/2) \var[R]$ , where $\gamma$ parameterizes risk-aversion. In the context of our manager’s problem, the mean-variance criterion tells us to solve

\max_{\sigma \in \Sigma} \; m_V(Z_\sigma) \quad \text{where} \quad m_V(Z_\sigma) \coloneq \left\{ \EE [Z_\sigma] - \frac{\gamma}{2} \var[Z_\sigma] \right\}.

(1.12)

Assuming that the initial condition is $x$ , the first term $\EE [Z_\sigma]$ is just $v_\sigma(x)$ . The second term is harder to calculate but its role is clear: it downweights policies that generate high variance payoffs, with the extent of downweighting depending on the size of $\gamma$ . More risk averse managers will use larger values of $\gamma$ and their preferred policies will deviate more from what we previously defined to be optimal—that is, the policy that solves $\max_{\sigma \in \Sigma} v_\sigma(x) = \max_{\sigma \in \Sigma} \EE Z_\sigma$ for all $x$ .

1.1.3.4Alternatives to Mean-Variance¶

The mean-variance criterion is not the only way to formulate concerns about risk. An alternative formulation solves

\max_{\sigma \in \Sigma}\; e_\gamma(Z_\sigma) \quad \text{where} \quad e_\gamma(Z_\sigma) \coloneq - \frac{1}{\gamma} \ln \left[ \EE [\exp(- \gamma Z_\sigma)] \right]

Here $\EE$ again denotes the mathematical expectation with respect to the probability distribution of the random payoff, which the decision maker is assumed to know. The map $e_\gamma$ is called an entropic certainty equivalent. The parameter $\gamma$ parameterizes risk aversion. When $\gamma > 0$ , the decision maker values $Z_\sigma$ below $\EE[Z_\sigma]$ . We say that risk aversion is higher when $\gamma$ is larger. An introduction to these ideas is provided in Section 7.2.2 of Sargent & Stachurski (2025).

The entropic criterion is attractive for several reasons. One is that, with sufficiently many finite moments, a Taylor expansion produces

e_\gamma(X) \;=\; \kappa_1 \;-\; \frac{\gamma}{2}\,\kappa_2 \;+\; \frac{\gamma^2}{6}\,\kappa_3 \;-\; \frac{\gamma^3}{24}\,\kappa_4 \;+\; \cdots,

where $\kappa_n$ is the $n$ -th cumulant

\kappa_1 = \EE[X], \qquad \kappa_2 = \var[X], \qquad \kappa_3 = \EE\!\left[(X - \kappa_1)^3\right], \qquad \cdots

This tells us that, with positive $\gamma$ , the agent likes a high mean, dislikes variance, likes positive skewness (right tails), dislikes kurtosis (fat tails), etc. When higher moments are small we get

e_\gamma(Z_\sigma) \approx \EE [Z_\sigma] - \frac{\gamma}{2} \var[Z_\sigma],

which connects us back to mean-variance analysis. The approximation above becomes exact when $Z_\sigma$ is normally distributed.

In addition, the entropic criterion can be regarded as an indirect utility function that emerges from a setting in which the manager doubts its probability model for $R$ . In particular, let $P$ denote the manager’s baseline model probability measure for the payoff $Z_\sigma$ . Then it can be shown that

e_\gamma(Z_\sigma) = \min_Q\; \left\{ \EE_Q [Z_\sigma] + \frac{1}{\gamma} D_{KL}(Q \| P) \right\},

where the minimum is over probability measures $Q$ absolutely continuous with respect to $P$ and $D_{KL}(Q \| P) \coloneq \EE_Q [\ln(\diff Q / \diff P)]$ is a Kullback–Leibler statistical divergence for measuring the discrepancy between two probability distributions. Here the parameter $1/\gamma$ controls the size of a penalty that the minimizer pays for distorting $Q$ relative to baseline probability model $P$ ; larger $\gamma$ ’s allow the minimizing “player” who chooses $Q$ to range over a larger set of alternative models. Such an analysis connects entropic preferences to robust control and ambiguity aversion and will be discussed in Chapter 7.

The entropic criterion $e_\gamma(Z_\sigma)$ is a special case of a more general objective

\Phi(Z_\sigma) \coloneq \phi^{-1} \left\{ \EE [\phi(Z_\sigma)] \right\}

where $\phi$ is a given function. Typically $\phi$ is concave, as is the case for $\phi(x) = \exp(-\gamma x)$ when $\gamma > 0$ . Another example is the Kreps–Porteus expectation, which is obtained by setting $\phi(x) = x^{1-\gamma}$ .

A third option for inserting preferences over risk is to evaluate

\VaR_\alpha(Z_\sigma) \coloneq \inf \, \setntn{c \in \RR}{\PP\{Z_\sigma+c<0\}\leq\alpha}.

where $\alpha \in [0,1]$ is a given constant. This objective is called value-at-risk and can be understood as the smallest cash injection $c$ such that the probability of a net loss is no more than $\alpha$ . Intuitively, if $Z_\sigma$ has more downside risk, then $\VaR_\alpha(Z_\sigma)$ increases, as more cash is needed to keep the loss probability below the threshold. Thus, a manager seeking a low-risk policy might look to minimize $\VaR_\alpha(Z_\sigma)$ , or, equivalently, to solve $\max_\sigma - \VaR_\alpha(Z_\sigma)$ .

Value-at-risk became industry standard in the 1990s, partly due to popularization through RiskMetrics, a risk management framework developed at J.P. Morgan. It has spawned a variety of alternatives and extensions, including conditional value-at-risk, entropic value-at-risk and relativistic value-at-risk. We will meet some of these ideas again in Chapter 7.

1.1.3.5Difficulties¶

While the risk-management concepts discussed above are all sensible, they complicate solving for an optimal policy because they make the objective be nonlinear. For example, consider the mean-variance problem (1.12), which we can write as

\max_{\sigma \in \Sigma} m_V \left\{ \sum_{t=0}^{T(\sigma)-1} \beta^t \pi_t + \beta^{T(\sigma)} s \right\}.

(1.13)

The function $m_V$ is nonlinear due to the presence of the variance term, and this nonlinearity prevents us from passing $m_V$ through the sum and thereby deriving a recursive expression for the value of a strategy similar to (1.4). To see this, recall the role that linearity of the expectations operator $\EE_t$ played in our derivation of representation (1.2).

Bellman’s dynamic programming theory requires having a recursive representation for valuations under alternative strategies. Without that, numerical strategies for solving global optimization problems like $\max_\sigma U(Z_\sigma)$ can be poorly behaved, very high dimensional, and virtually inaccessible to the theory of dynamic programming.

Even worse, there is an important sense in which the criterion of maximizing $m_V(Z_\sigma)$ over $\sigma \in \Sigma$ is no longer the right one, since there is no guarantee that the best strategy for the manager is to choose a stationary Markov policy and apply it in every period. For example, in our new nonlinear setting, it might be optimal for the manager to apply a given policy $\sigma$ in the first period and a second policy $\tau$ in all periods thereafter (see, e.g., Section 5 of Bäuerle & Jaśkiewicz (2024)). While this might still seem feasible—we need only to compute one more policy— time inconsistency arises. The manager must be committed to switching to $\tau$ in the second period, since re-optimization would lead to choosing $\sigma$ again.

Unfortunately none of the alternatives to mean-variance analysis discussed in Section 1.1.3.4 offer a way out of the problem described above, since nonlinearity in the objective again prevents construction of a recursive representation. As with mean-variance, this lack of recursivity requires deploying dynamic programs in new ways.^[1]

1.1.3.6Back to Recursion¶

In Section 1.1.3.5, we saw how nonlinearity intended to capture risk-preferences led to a breakdown of dynamic programming theory. Fortunately, there is a way to inject nonlinearity and risk-preferences into the manager’s problem without breaking recursivity and hence access to the core ideas of dynamic programming. The idea is to stop trying to apply risk-preferences directly to the net present value sum and instead apply them period-by-period. We can do this by starting with the risk-neutral recursive valuation (1.4) and modifying it to

v_\sigma(x) = \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta (K v_\sigma)(x) \right],

(1.14)

where $K$ is a possibly nonlinear operator from $b\Xsf$ to itself. For example, using the notation $(Pv)(x) = \int v(x') P(x, \diff x')$ , we can set

(K v)(x) = (Pv)(x) - \frac{\gamma}{2} \int \left[v(x') - (Pv)(x)\right]^2 P(x, \diff x')

to implement the mean-variance criterion, or

(K v)(x) = - \frac{1}{\gamma} \ln \left\{ \int \exp(- \gamma v(x')) P(x, \diff x') \right\}

for entropic risk preferences.

This alternative approach to inserting risk preferences into the manager’s problem is somewhat less intuitive than the direct approach that we reviewed in Section 1.1.3.3 and Section 1.1.3.4. In addition, it introduces a new problem: is $v_\sigma$ actually well-defined by the nonlinear functional equation (1.14)? On the other hand, it offers a major advantage: provided we can show that $v_\sigma$ is in fact well-defined — which is a problem for fixed point theory — the valuations are recursive by construction. Exploiting this fact, we can extend Bellman’s original theory in very natural ways. This is one of the main subjects of this book.

We will attack this problem in stages, beginning with an abstract recursive setup in Chapter 2. The setup will be recursive in the sense that each valuation $v_\sigma$ will be represented as the fixed point of a possibly nonlinear policy operator $T_\sigma$ . Our approach will then be to rewrite Bellman’s optimality theory in this abstract setting and seek properties on the policy operators under which the main results go through. At the end of this process we will connect back to the applications from this chapter.

Before progressing to this abstract theory, we look at some other concrete examples, beginning with finite state Markov decision processes.

1.2Finite MDPs¶

Finite state Markov decision processes (MDPs) form the foundations of many quantitative modeling and reinforcement learning routines, as well as providing a benchmark setting for dynamic programming theory (see, e.g., Puterman (2005) or Chapter 5 of Sargent & Stachurski (2025)). In this section we introduce finite state MDPs and some extensions. As was the case for the firm problem considered in Section 1.1, our main objective is to introduce a class of dynamic programming problems that will motivate the abstract theory starting in Chapter 2.

While the presentation below is self-contained, readers wanting a slower pace and more examples might prefer to begin with Chapter 5 of Sargent & Stachurski (2025).

1.2.1Theory¶

In this section we introduce the finite MDP framework and state core optimality results—the Bellman equation and the greedy policy characterization of optimal policies. We then present three fundamental algorithms: value function iteration, Howard policy iteration, and optimistic policy iteration. We illustrate the main ideas with an application to firm cash management.

1.2.1.1The Discrete Time Model¶

A finite state Markov decision process (finite MDP) consists of

a finite set $\Xsf$ called the state space,
a finite set $\Asf$ called the action space,

and a tuple $(\Gamma, r, \beta, P)$ , where

$\Gamma$ is a nonempty correspondence from $\Xsf$ to $\Asf$ , which in turn defines the feasible state-action pairs
$\Gsf \coloneq \setntn{(x, a) \in \Xsf \times \Asf}{a \in \Gamma(x)},$
(1.15)
a reward function $r \colon \Gsf \to \RR$ ,
a discount factor $\beta$ in $[0,1)$ , and
a stochastic kernel $P$ from $\Gsf$ to $\Xsf$ , which provides transition probabilities for the next period state given current state and action.

Since $P$ is a stochastic kernel, it satisfies $\sum_{x'} P(x, a, x') = 1$ for all $(x,a) \in \Gsf$ .

Given an initial condition $X_0 = x$ , the objective is to maximize the expected discounted sum

\EE \sum_{t \geq 0} \beta^t r(X_t, A_t) \quad \st \quad A_t \in \Gamma(X_t) \text{ for all } t \geq 0.

Here $(X_t)_{t \geq 0}$ takes values in $\Xsf$ and $(A_t)_{t \geq 0}$ takes values in $\Asf$ . After observing state $X_t$ , the controller chooses action $A_t$ from the feasible set $\Gamma(X_t)$ and the new state $X_{t+1}$ is drawn from the distribution $P(X_t, A_t, \cdot)$ . The constant $\beta \in [0,1)$ is a discount factor and $r$ is a reward function. In maximizing this objective, the action sequence $(A_t)$ must also satisfy an information constraint: Each $A_t$ is required to be measurable with respect to the $\sigma$ -algebra generated by $(X_0, \ldots, X_t)$ . Thus, the controller can use information from the past and present but not the future.

As was the case for the firm problem in Section 1.1, it turns out that actions depending on all of $(X_0, \ldots, X_t)$ are no better than actions that depend only on $X_t$ . We will show this formally in Section 3.1.4. As a result, we focus on stationary Markov policies, where the same deterministic function of the state is applied at every point in time. We call such stationary Markov policies feasible policies, the set of which is given by

\Sigma \coloneq \setntn{\sigma \in \Asf^\Xsf} {\sigma(x) \in \Gamma(x) \text{ for all } x \in \Xsf}.

(1.16)

For each $\sigma \in \Sigma$ , we set

P_\sigma(x, x') \coloneq P(x, \sigma(x), x') \quad \text{and} \quad r_\sigma(x) \coloneq r(x, \sigma(x)).

(1.17)

It follows from our assumptions on $P$ that $P_\sigma$ is a stochastic matrix, meaning that $P_\sigma \geq 0$ and all rows sum to one. By choosing a policy $\sigma$ , the controller determines a reward function $r_\sigma$ on the state and Markov dynamics $P_\sigma$ for the state process. Following notation in Section A.5.4, we write

(P_\sigma h)(x) = \sum_{x' \in \Xsf} h(x') P_\sigma(x, x') \qquad (h \in \RR^\Xsf, \; x \in \Xsf),

interpreting this value as the expectation of $h(X_{t+1})$ when $X_t = x$ and the controller uses policy $\sigma$ .

The lifetime value of $\sigma$ given $X_0 = x$ is

v_\sigma(x) \coloneq \EE \sum_{t \geq 0} \beta^t r(X_t, \sigma(X_t)) = \sum_{t \geq 0} \beta^t \EE \, r_\sigma(X_t)

when $(X_t)_{t \geq 0}$ is a Markov chain generated by $P_\sigma$ with initial condition $X_0 = x \in \Xsf$ . Since $\EE \, r_\sigma(X_t) = (P_\sigma^t r_\sigma)(x)$ , the function $v_\sigma$ can be expressed pointwise on $\Xsf$ as

v_\sigma = \sum_{t \geq 0} (\beta P_\sigma)^t r_\sigma = (I-\beta P_\sigma)^{-1} r_\sigma,

(1.18)

where $I$ is the identity map on $\RR^\Xsf$ , the set of real-valued functions on $\Xsf$ . This representation on the right-hand side is essentially the same as (1.3). In particular, the second equality comes from the Neumann series lemma. See also Puterman (2005), Theorem 6.1.1, or Chapter 5 of Sargent & Stachurski (2025).

The policy operator associated with given $\sigma$ for the finite MDP model takes the form

(T_\sigma \, v)(x) = r(x, \sigma(x)) + \beta \sum_{x'} v(x') P(x, \sigma(x), x') \qquad (v \in \RR^\Xsf, \; x \in \Xsf)

(1.19)

Solution to Exercise 1.2.1

In operator notation, the action of $T_\sigma$ can be written as $T_\sigma \, v = r_\sigma + \beta P_\sigma \, v$ . With $\| \cdot \|$ as the supremum norm and $v, v' \in \RR^\Xsf$ , we have

\| T_\sigma \, v - T_\sigma \, v' \| = \beta \| P_\sigma \, (v - v') \| \leq \beta \| P_\sigma \| \, \| v - v' \| = \beta \| v - v' \|,

(Readers who are less comfortable with an operator-theoretic approach can write these steps out pointwise, at fixed $x \in \Xsf$ , and arrive at the same bound. In the last step we used $\| P_\sigma \| = 1$ from Lemma A.5.30.) The unique fixed point solves $v = r_\sigma + \beta P_\sigma v$ . Assuming that $I - \beta P_\sigma$ is invertible, the fixed point is $v_\sigma = (I-\beta P_\sigma)^{-1} r_\sigma$ . The invertibility assumption holds because $\rho(P_\sigma) = 1$ (see Lemma A.5.32) and hence $\rho(\beta P_\sigma) = \beta < 1$ . See Corollary A.4.11 for more details.

1.2.1.2Core Optimality Results¶

The definition of the value function is the same as that for the firm problem in Section 1.1.1.2:

\vmax(x) \coloneq \sup_{\sigma \in \Sigma} v_\sigma(x) \qquad (x \in \Xsf).

Similarly, a policy $\sigma \in \Sigma$ is called optimal if $v_\tau \leq v_\sigma$ for all $\tau \in \Sigma$ .

The Bellman equation for this problem is

v(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\}.

(1.20)

This is a functional equation that restricts $v \in \RR^\Xsf$ . The Bellman operator is given by

(T \, v)(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \qquad\qquad (x \in \Xsf).

(1.21)

By construction, $v$ solves the Bellman equation if and only if $v$ is a fixed point of the Bellman operator. In the next exercise, $d_\infty(f, g) \coloneq \sup_{x \in \Xsf}|f(x) - g(x)|$ . Corollary A.5.13 may be helpful.

Solution to Exercise 1.2.3

For arbitrary $v, w \in \RR^\Xsf$ and $x \in \Xsf$ ,

\begin{aligned} |(Tv)(x) - (Tw)(x)| & \leq \beta \max_{a \in A} \left| \sum_{x'} v(x')P(x,a,x') - \sum_{x'} w(x')P(x,a,x') \right| \\ & \leq \beta \max_{a \in A} \sum_{x'} \left| v(x') - w(x') \right| P(x,a,x') \\ & \leq \beta \| v - w\|, \end{aligned}

We say that $\sigma \in \Sigma$ is $v$ -greedy if

\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \quad \text{for all } x \in \Xsf,

(1.22)

We can now state the following optimality result, which naturally mirrors our previous result for the firm problem (Theorem 1.1.1).

A full proof of Theorem 1.2.1 can be found in Chapter 5 of Sargent & Stachurski (2025). The proof is almost identical to that of Theorem 1.1.1, which we provided for the firm problem. We will also prove Theorem 1.2.1 in Section 2.3.3.2, as a special case of far more general results.

The next obvious step is to use the results in Theorem 1.2.1 to compute optimal policies. Next we consider algorithms designed for this purpose.

1.2.1.3Algorithms¶

The three most important algorithms for solving dynamic programming problems are value function iteration (VFI), Howard policy iteration, and optimistic policy iteration (OPI). In the present setting, they take the forms of Algorithm 1.2.1, Algorithm 1.2.2, and Algorithm 1.2.3.

VFI amounts to iterating $k$ times with $T$ from some initial condition $v \in V$ (where $k$ is determined by a fixed tolerance level for error), producing an approximation $v_k \coloneq T^k v$ to $\vmax$ , and then computing a $v_k$ -greedy policy $\sigma$ . This idea is natural, given that $\vmax$ -greedy policies are optimal, since $T$ is a contraction mapping and $\vmax$ is the unique fixed point (so that $v_k$ is close to $\vmax$ ).

In HPI, one begins with a guess $\sigma$ of the optimal policy and then iterates between computing the lifetime value of that policy (as given in (1.18)) and the corresponding greedy policy. In fact HPI is equivalent to Newton fixed point iteration applied to the Bellman operator. See, for example, Chapter 5 of Sargent & Stachurski (2025).

OPI can be thought of as a “convex combination” of VFI and HPI. Instead of computing the lifetime value $v_\sigma = (I - \beta P_{\sigma})^{-1} r_{\sigma}$ of current policy guess $\sigma$ , one computes instead $T_{\sigma}^m v$ , which is an approximation to $v_\sigma$ (since $T_\sigma$ is a contraction with fixed point $v_\sigma$ ). There are two edge cases:

If $m$ is large, this approximation is tight, and hence OPI is close to HPI.
If $m=1$ , OPI reduces to VFI.

OPI usually outperforms both VFI and HPI for some intermediate values of $m$ . Further intuition and discussion is provided in Chapter 5 of Sargent & Stachurski (2025).

For the finite MDP setting, we can state the following results:

We shall prove these results after we discuss convergence of these algorithms in a general setting in Section 2.2.1.

1.2.1.4Solving MDPs via Linear Programming¶

Many dynamic programs can be formulated as linear programs. We illustrate with finite MDPs. To do so, we recall that a typical linear program has the form

\min_v \inner{c,v} \; \text{ over all } \; v \in \RR^n \text{ with } Av \leq b.

(1.23)

Here $c$ is a vector in $\RR^n$ , the term $\inner{c,v}$ is the inner product $\sum_i c_i v_i$ , $A$ is a matrix with $n$ columns and $b$ is a vector with the same length as $Av$ . (Other LP formulations replace the inequality $Av \leq b$ with $Av = b$ or $Av \geq b$ , or a mix of inequality and equality constraints. Standard LP algorithms and theory can be applied to all of these cases.)

To place the finite state MDP from Section 1.2.1.1 into this framework, we begin by setting

V_D \coloneq \setntn{v \in \RR^\Xsf}{Tv \leq v},

where $T$ is the Bellman operator. Recalling that $\vmax$ represents the value function, we have the following result:

Now let $c$ be an everywhere positive element of $\RR^\Xsf$ and consider the linear program

\begin{aligned} & \min_v \inner{c, v} \\ & \st r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \leq v(x) \text{ for all } (x, a) \in \Gsf. \end{aligned}

(1.24)

Here $\inner{c, v} = \sum_x c(x) v(x)$ and the minimization is over all $v \in \RR^\Xsf$ that satisfy the stated constraint. As before, we require that $c$ is everywhere positive. Evidently (1.24) takes the form of (1.23) after suitable assignment of indices. Thus, (1.24) is a linear program. This leads us to

Proof

Let $\bar v$ be any solution to (1.24) in $\RR^\Xsf$ . The constraint in (1.24) implies that $T \bar v \leq \bar v$ . Since $\bar v$ is in $V_D$ , Lemma 1.2.3 implies that $\vmax \leq \bar v$ . In addition, $\bar v \leq \vmax$ . Indeed, if $\vmax(x) < \bar v(x)$ for some $x$ , then, using $\vmax \leq \bar v$ and the positivity of $c$ , we have $\inner{c , \vmax} < \inner{c, \bar v}$ . This contradicts the hypothesis that $\bar v$ solves (1.24), since $\vmax$ is also in the choice set. The contradiction confirms that $\bar v \leq \vmax$ and hence $\bar v = \vmax$ . The claim in Proposition 1.2.4 follows. ◻

Proposition 1.2.4 implies that we can compute the value function and hence solve the MDP using linear programming techniques. The LP approach is useful for many models, such as those that incorporate additional linear constraints, e.g., bounds on expected rewards and resources, that are difficult to handle with iterative methods. On the other hand, the number of constraints in (1.24) equals $|\Gsf|$ , which can be large, so iterative methods are still often preferred. See Section 1.6 for references and further discussion.

1.2.1.5Example: Cash Management¶

To illustrate finite MDPs in a concrete setting, we now study a cash management problem faced by a firm that must balance cash holdings against returns from securities. This problem dates back to the work of Baumol (1952) and Tobin (1956), who developed inventory-theoretic models of the demand for money. In their models, the decision maker decides how much cash to hold versus interest-bearing assets, balancing the opportunity cost of holding idle cash against the transaction costs of converting assets to cash.

The decision maker manages fixed total wealth $\bar w$ , which is divided between cash holdings $x$ and securities $s = \bar w - x$ . Each period, the decision maker experiences random portfolio shocks and must decide whether to transfer funds between cash and securities. The decision maker earns returns on securities but pays transaction costs for transfers and faces penalties for insufficient cash. (Assuming that wealth is fixed allows us to get by tracking only $x$ , and not $s$ .)

The state space for cash is $\Xsf = \{0, 1, \ldots, \bar w\}$ . At state $x$ , the feasible actions are transfers $a$ satisfying $0 \leq x+a \leq \bar w$ . Thus, $\Gamma(x) = \{a \in \ZZ \mid -x \leq a \leq \bar w - x\}$ . Portfolio shocks (cash payments, equity payments, debt restructuring, etc.) are IID, written as $(\xi_t)_{t \geq 1}$ , and take values in a set $\Xi = \{-k, \ldots, k\}$ with probability mass function $\phi$ . Transition probabilities are determined by the next-period state

F(x,a, \xi) = \max\{0, \min\{\bar w, x + a + \xi\}\},

(1.25)

where the max and min keep $x'$ in the state space. The transition probabilities are, therefore,

P(x, a, x') = \sum_{\xi \in \Xi} \1\{F(x, a, \xi) = x'\} \phi(\xi).

(1.26)

Flow profits are given by

\pi(x, a, \xi) \coloneq \rho (\bar w - x) - (c + \tau |a|) \1\{a \neq 0\} - p \, \1\{x + a + \xi < 0\}.

(1.27)

Here $\rho$ is the rate of return on securities, $c$ is a fixed transaction cost, $\tau$ is a proportional transaction cost, and $p$ is a penalty for insufficient cash (in this case, when $x + a + \xi < 0$ ). To fit the problem into the MDP framework, we take expectations of flow profits to get the period reward

r(x, a) = \sum_{\xi \in \Xi} \pi(x, a, \xi) \, \phi(\xi).

(1.28)

Future payoffs are discounted using discount factor $\beta$ .

The set of feasible policies $\Sigma$ is defined from $\Gamma$ in the usual way (see (1.16)). The lifetime value of any given policy can be computed from $v_\sigma = (I - \beta P_\sigma)^{-1} r_\sigma$ , as discussed in Section 1.2.1.1. Figure 1.7 illustrates by using this formula to compute $\sigma$ -value functions for two policies. The first is a “do nothing” policy that sets $a = 0$ for all states. The second is a target policy that always moves toward a fixed target cash level. As we will see, both policies are suboptimal.

Value of a do-nothing policy (\sigma_1) and a target policy (\sigma_2) — Figure 1.7:Value of a do-nothing policy ( $\sigma_1$ ) and a target policy ( $\sigma_2$ )

Next let’s solve for an approximately optimal policy using VFI, as described in Algorithm 1.2.1. Figure 1.8 shows the resulting (approximately) optimal policy and value function. The policy shows the optimal transfer amount as a function of current cash holdings, while the value function shows the lifetime value of following the optimal policy. Here we set total wealth $\bar w = 50$ , return on securities $\rho = 0.02$ , fixed transaction cost $c = 1$ , proportional transaction cost $\tau = 0.1$ , penalty for insufficient cash $p = 10$ , and $\beta = 0.95$ . The cash flow shocks are uniformly distributed on $\{-5, -4, \ldots, 4, 5\}$ .

The optimal policy recommends that when cash holdings are low, the decision maker should move funds from securities to cash (take a positive action), and when cash holdings are high, move funds from cash to securities (a negative action). The value function declines for large cash balances because wealth is fixed and hence high cash balances mean low holdings of securities and reduced returns.

Figure 1.8:Optimal policy and value function for the cash management problem

Figure 1.9 shows iterates for the policy sequence and the value sequence under the HPI algorithm. The initial policy is a do-nothing policy. As mentioned in Theorem 1.2.2, HPI converges in a finite number of iterations. Here it converges in 5 iterations, so the last policy is the exact optimal policy (modulo floating point arithmetic), and the last $\sigma$ -value function is the value function $\vmax$ . The gap between the first value function associated with the do-nothing policy, and the final value function associated with the optimal policy, is value of active cash management. This gap is largest at extreme cash levels, where the do-nothing policy either leaves the firm exposed to cash shortfalls or forgoes returns on securities.

Figure 1.9:Iterating with HPI from the do-nothing policy

Figure 1.10 shows a simulated time path for cash and optimal cash transfers under the optimal policy. Cash is held unchanged in many time periods as a result of transaction costs.

Figure 1.10:Time path for cash and actions under the optimal policy

1.2.2Continuous Time¶

In this section we modify our finite state MDP model from Section 1.2.1.1 to a continuous time setting. With appropriate manipulations, our continuous time model can be embedded in the discrete time framework.

1.2.2.1Primitives and Values¶

As in the discrete time case, $\Xsf$ and $\Asf$ are finite sets, while the controller is constrained by a feasible correspondence $\Gamma$ from $\Xsf$ to $\Asf$ . The definitions of $\Gsf$ , $\Sigma$ , and $r$ are unchanged. Discounting is determined by a constant $\delta > 0$ , referred to as the discount rate, while transitions are driven by an intensity kernel $Q$ from $\Gsf$ to $\Xsf$ , which is a map $Q$ from $\Gsf \times \Xsf$ to $\RR$ that satisfies

\sum_{x'} Q(x, a, x') = 0 \text{ for all } (x,a) \text{ in } \Gsf \text{ and } Q(x, a, x') \geq 0 \text{ when } x \neq x'.

Informally, over the short interval from $t$ to $t+h$ , the controller receives instantaneous reward $r(x,a)h$ and the state transitions to state $x'$ with probability $Q(x, a, x') h + o(h)$ .

For a fixed $\sigma \in \Sigma$ , we obtain an intensity operator (i.e., an infinitesimal generator)

Q_\sigma(x, x') \coloneq Q(x, \sigma(x), x') \qquad (x, x' \in \Xsf)

that determines a continuous time Markov chain $(X_t)_{t \geq 0}$ with transition probabilities given by $P^\sigma_t \coloneq \me^{t Q_\sigma}$ for all $x \in \Xsf$ . In particular,

\EE_x h(X_t) = (P^\sigma_t h)(x) \text{ for any } h \in \RR^\Xsf.

(For background see Chapter 10 of Sargent & Stachurski (2025).) Continuing to define $r_\sigma(x) \coloneq r(x, \sigma(x))$ , the lifetime value of following $\sigma$ starting from state $x$ is

v_\sigma (x) = \EE_x \int_0^\infty \me^{-\delta t} r_\sigma(X_t) \diff t = \int_0^\infty \me^{-\delta t} (P^\sigma_t r_\sigma)(x) \diff t

(1.29)

(Passing the expectation through the integral can be justified by Fubini’s theorem.) Using $\delta > 0$ , we can rewrite $v_\sigma$ as

v_\sigma = \int_0^\infty \me^{t (Q_\sigma - \delta I)} r_\sigma \diff t = (\delta I - Q_\sigma)^{-1} r_\sigma.

(1.30)

The two representations for $v_\sigma$ are the continuous time analogs of the discrete-time representations given in (1.18). A proof of the second equality is given in §10.2 of Sargent & Stachurski (2025).

(Readers familiar with semigroup theory will recognize the two representations in (1.30) as alternative expressions for the resolvent of the semigroup $(\me^{tQ})$ – see, for example, Engel & Nagel (2006), Theorem 1.10.)

1.2.2.2Uniformization¶

We can use the ADP framework to reformulate (1.30) by making $v_\sigma$ be the fixed point of an order preserving policy operator. This process is called uniformization. The first step is to set

P(x, a, x') \coloneq \1\{x = x'\} + \frac{Q(x, a, x')}{m} \quad \text{where} \quad m \coloneq \max_{x \in \Xsf, \, a \in \Asf} |Q(x, a, x)|.

(1.31)

Then set

\beta \coloneq \frac{m}{m + \delta} \quad \text{and} \quad \hat r_\sigma \coloneq \frac{r_\sigma}{m + \delta}.

(1.32)

As in the discrete time case, for each $\sigma \in \Sigma$ , define $P_\sigma$ and $\hat r_\sigma$ according to

P_\sigma(x, x') \coloneq P(x, \sigma(x), x') \quad \text{and} \quad \hat r_\sigma(x) = \hat r(x, \sigma(x)).

From Exercise 1.2.5, we see that $v_\sigma$ is the unique fixed point in $V \coloneq \RR^\Xsf$ of the policy operator

T_\sigma \, v = \hat r_\sigma + \beta P_\sigma \, v.

(1.33)

1.2.2.3Optimality¶

Since (1.33) becomes (1.19) after replacing $\hat r_\sigma$ with $r_\sigma$ , we can apply the discrete time MDP theory in Section 1.2. The Bellman equation becomes

v(x) = \max_{a \in \Gamma(x)} \left\{ \hat r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \qquad (x \in \Xsf).

(1.34)

The optimality properties in Theorem 1.2.1 hold and, by Theorem 1.2.2, VFI, OPI and HPI all converge. With $\vmax$ denoting the value function, a policy is optimal if and only if

\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ \hat r(x, a) + \beta \sum_{x'} \vmax(x') P(x, a, x') \right\} \quad \text{for all } x \in \Xsf,

The next two exercises unpack these equations and conditions to recover our original continuous time formulation.

Equation (1.35) connects the exposition above to the traditional theory of continuous time MDPs (see, e.g., Guo & Hernández-Lerma (2009)). It is sometimes called the Hamilton–Jacobi–Bellman (HJB) equation, although that name is more commonly used when the state process is a diffusion.

As in Exercise 1.2.6, it can be shown that $\sigma \in \Sigma$ is $\vmax$ -greedy if and only if

\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \sum_{x'} \vmax(x') Q(x, a, x') \right\} \quad \text{for all } x \in \Xsf.

1.2.2.4Example: Service Rate Control¶

Here we study a queue system where a firm controls service rates to maximize profit. The firm operates a service facility with finite capacity $N$ . Customers arrive according to a Poisson process with rate $\lambda$ . The state $x \in \Xsf = \{0, 1, \ldots, N\}$ represents the number of customers currently waiting for service. The firm can control the service rate by selecting from a finite set of actions $\Asf$ . Each action $a \in \Asf$ corresponds to a service rate $\mu(a)$ . Higher service rates allow faster customer processing but incur greater operating costs.

The intensity kernel $Q$ associated with this problem is

Q(x, a, x') = \begin{cases} \lambda & \text{if } x' = x + 1 \text{ and } x < N \\ \mu(a) & \text{if } x' = x - 1 \text{ and } x > 0 \\ 0 & \text{for all other } x' \text{ with } x \neq x' \text{.} \end{cases}

When $x=x'$ , we set $Q(x, a, x') = -\sum_{y \neq x} Q(x, a, y)$ in order to ensure $\sum_{x'} Q(x, a, x') = 0$ at each $(x,a)$ .

All choices of $a$ are feasible, so $\Gamma(x) = \Asf$ for all $x$ . The instantaneous profit rate is

r(x, a) = \mu(a) R \1\{x > 0\} - h x - c(a),

where $R$ is revenue per customer served, $h$ is holding cost per customer per unit time, and $c(a)$ is the service cost rate for action $a$ . The first term represents revenue from serving customers, the second term captures the cost of customers waiting in the queue, and the third term is the cost of operating at service rate $\mu(a)$ . The firm’s objective is to maximize expected discounted profit flow $v_\sigma(x) = \EE_x \int_0^\infty \me^{-\delta t} r(X_t, \sigma(X_t)) \diff t$ , where $\delta > 0$ is the discount rate.

We solve the problem using the uniformization technique discussed in Section 1.2.2.2. The first step is to calculate the uniformization rate $m = \max_{x,a} |Q(x,a,x)|$ . Here

$Q(0, a, 0) = -\lambda$ (only arrivals when empty),
$Q(x, a, x) = -(\lambda + \mu(a))$ for $0 < x < N$ (both arrivals and departures), and
$Q(N, a, N) = -\mu(a)$ (only departures at capacity).

Hence

m = \lambda + \bar \mu \quad \text{where} \quad \bar \mu \coloneq \max_a \mu(a).

Thus, following the specifications in Section 1.2.2.2, we set

P(x, a, x') = \1\{x = x'\} + \frac{Q(x, a, x')}{\lambda + \bar \mu}, \quad \beta = \frac{\lambda + \bar \mu}{\lambda + \bar \mu + \delta}, \quad \hat r(x, a) = \frac{r(x, a)}{\lambda + \bar \mu + \delta}.

We then compute the optimal policy using VFI based on the Bellman equation (1.34).

Figure 1.11 shows the optimal policy and value function for a system with $N = 10$ customers, arrival rate $\lambda = 2.0$ , service rates $\mu = (2.5, 3.0, 3.5)$ , revenue $R = 10$ , holding cost $h = 2.5$ , service costs $c = (1.5, 2.0, 4.5)$ , and discount rate $\delta = 0.1$ . When the queue is empty, the firm uses the lowest service rate to minimize operating costs. As the queue length increases, the firm gradually raises the service rate to balance the increasing holding costs against service costs. The value function increases initially as queue length grows (reflecting the value of serving customers) but eventually decreases as holding costs dominate.

Figure 1.11:Optimal service rate policy and value function for the queue system

1.2.3Extensions¶

In this section we discuss extensions of the MDP framework that parallel those for the firm problem. These include nonlinear objectives including mean-variance and risk-sensitive preferences. As before, we find that nonlinearity forecloses a recursive structure, motivating resort to period-by-period reformulations that restore tractability. We also introduce ambiguity, where the controller faces uncertainty about transition probabilities.

1.2.3.1Nonlinear Criteria¶

Some extensions to the firm problem discussed in Section 1.1.2 have counterparts here. For example, we discussed maximization problems of the form

\max_{\sigma \in \Sigma} U( Z_\sigma ) \quad \text{where} \quad Z_\sigma \coloneq \sum_{t=0}^{T(\sigma)-1} \beta^t \pi_t + \beta^{T(\sigma)} s,

(1.36)

and $U$ is a nonlinear real-valued function. One example was given in (1.13), where $U$ was the mean-variance map in (1.12). In other examples, $U$ emerges from value-at-risk, conditional value-at-risk, risk sensitivity, or a desire for robustness.

There are obvious parallels for the MDP model we introduced in Section 1.2.1. We simply take the criterion from (1.36) and modify it to

\max_{\sigma \in \Sigma} U \left( Z_\sigma \right) \quad \text{where} \quad Z_\sigma \coloneq \sum_{t \geq 0} \beta^t r(X_t, \sigma(X_t)).

(1.37)

Here $(X_t)_{t \geq 0}$ is a Markov chain generated by $P_\sigma$ with fixed initial condition and $U$ is again, some given real-valued function. As before, we can choose $U$ to inject concern for mean-variance trade-offs, value-at-risk, conditional value-at-risk, risk sensitivity and a desire for robustness.

1.2.3.2Back to Recursions¶

In Section 1.1.3.5 we discussed how optimization problems of the form (1.36) can be troublesome. The lack of a recursive structure prevents us from using Bellman machinery. The result is that we are left without a clear path to optimization, as well as the loss of time-consistency. Not surprisingly, all of these difficulties remain present when we switch to the MDP version in (1.37).

For the most part, theorists have responded in ways similar to ones discussed for the firm problem in Section 1.1.3.6, where recursive structure is enforced by applying nonlinear criteria period-by-period, rather than applying them directly to the sum representing lifetime value. For example, we can modify the policy operator (1.5) to

(T_\sigma \, v)(x) = \sigma(x) s + (1 - \sigma(x)) \left[ \pi(x) + \beta (K_\sigma \, v)(x) \right]

(1.38)

where, for each $\sigma \in \Sigma$ , the map $K_\sigma$ is a given nonlinear operator. For example, by setting

(K_\sigma \, v)(x) = - \frac{1}{\gamma} \ln \left\{ \int \exp(- \gamma v(x')) P_\sigma(x, \diff x') \right\}

we switch the MDP problem to entropic risk preferences.

As for the firm problem, this alternative approach to inserting risk preferences is less intuitive than the direct approach in (1.37) and raises a question: when is $v_\sigma$ well-defined by the nonlinear functional equation (1.38)? In addition, the formulation in (1.38) offers the advantage that valuations are recursive by construction. This allows us to apply solution methods that extend Bellman’s original theory in natural ways. We explore these ideas in the remainder of the text, beginning with the abstract recursive setup in Chapter 2.

1.2.3.3Ambiguity¶

The MDP framework has been extended to include a decision maker’s concerns about misspecification of the probability distribution. For example, consider the cash management problem from Section 1.2.1.5. Applying the definitions of the profit function $\pi$ and the transition function $F$ from that section, the risk-neutral problem can be written as

\max_{\sigma \in \Sigma} \, \EE_\phi \, \sum_{t \geq 0} \beta^t \pi(X_t, \sigma(X_t), \xi_{t+1})

(1.39)

where $(X_t)_{t \geq 0}$ obeys $X_{t+1} = F(X_t, \sigma(X_t), \xi_{t+1})$ for all $t$ . We subscript expectation with $\phi$ to emphasize the fact that the mathematical expectation over $\xi_{t+1}$ is taken with respect to distribution $\phi$ .

Suppose now that the manager doesn’t know $\phi$ but does know that $\phi$ belongs to a set of possible distributions $\Phi$ . This is how the decision maker expresses ambiguity about the probability law that governs $\xi_{t+1}$ . If we were to say to the decision maker to put a subjective probability distribution over $\Phi$ , the decision maker would decline to do so.

The decision maker proceeds in the spirit of Abraham Wald Wald (1950) by assuming only that he knows a set of possible models. To do this, he replaces (1.39) with

\max_{\sigma \in \Sigma} \, \min_{\phi \in \Phi} \EE_\phi \, \sum_{t \geq 0} \beta^t \pi(X_t, \sigma(X_t), \xi_{t+1}).

(1.40)

By using this criterion, the manager seeks a decision rule that works well enough no matter which probability distribution $\phi \in \Phi$ governs $\xi_{t+1}$ .

A recursive structure is absent from criterion (1.40). One way out of this difficulty is to make our decision maker express model ambiguity in a way that is more susceptible to a recursive formulation. For example, we could ascribe our decision maker a value function that solves the following Bellman equation:

v(x) = \max_{a \in \Gamma(x)} \min_{\phi \in \Phi} \sum_\xi \left\{ \pi(x, a, \xi) + \beta v(F(x, a, \xi)) \right\} \phi(\xi).

(1.41)

As we will see in Section 7.3.3, this kind of specification puts dynamic programming theory back in business.

1.3Optimal Savings¶

This section presents an optimal savings problem (also called the optimal consumption problem and the income fluctuation problem). This problem is a building block for many economic models. It features a basic intertemporal trade-off from consuming now or later. This trade-off can be solved by dynamic programming.

Unlike the finite state problems above, the optimal savings model has a continuous state space, as well as a continuous action space. We define policies, policy operators, and lifetime values and state the key optimality results. We then look at extensions to Epstein–Zin preferences.

1.3.1Policies and Decisions¶

In an optimal savings problem (sometimes called an “income fluctuation problem”), a household seeks to maximize

\EE \, \sum_{t=0}^{\infty} \beta^t u(C_t) \quad \text{s.t.} \quad W_{t+1} = R(W_t - C_t) + Y_{t+1} \quad \text{and} \quad 0 \leq C_t \leq W_t.

(1.42)

The constraints in (1.42) are required to hold for all $t \geq 0$ , and an initial condition $w_0$ is taken as given. The utility function $u \colon \RR_+ \to \RR$ maps current consumption $C_t$ into a utility value (loosely speaking, a measure of satisfaction), $\beta \in (0,1)$ is a discount factor indicating impatience, and $R > 0$ is a gross rate of return on assets. The variable $W_t$ represents wealth at time $t$ , while $Y_t$ is labor income. To keep the model simple, we assume $(Y_t)$ is IID with common distribution $\phi \in \dD(\RR_+)$ , the set of distributions (i.e., Borel probability measures) on $\RR_+$ .

(We study more general settings later.)

The variable $W_t$ is the state of the dynamic program, while $C_t$ is the action. Figure 1.12 shows the timing for the optimal savings problem. After observing $W_t$ , the household chooses $C_t$ and hence $W_t - C_t$ . Then labor income $Y_{t+1}$ is realized and the state updates to $W_{t+1}$ . The process then repeats.

In maximization problem (1.42) there is another constraint: $C_t$ can depend only on information available at time $t$ . Formally, current consumption $C_t$ must be a (deterministic) Borel measurable function of shocks, states, and actions observed up to and including time $t$ . Thus, the current action cannot depend on future values such as $Y_{t+1}$ or $W_{t+1}$ . A mapping from the history of the state and the shocks into the current action is called a policy function.^[2]

The infinite horizon, IID $(Y_t)$ -process, time-invariant structure of the optimal savings problem lets us focus on policies that make current consumption $C_t$ be a deterministic function $\sigma$ of the current state $W_t$ . (We will prove this later and discover how it depends on the IID-nature of the $(Y_t)$ process.)

We impose the following simplifying conditions:

In a slight abuse of notation, we use $\phi$ to represent the density of labor income as well as the corresponding distribution (i.e., Borel probability measure on $\RR_+$ ). Thus, in the integrals below, $\phi(\diff y)$ and $\phi(y) \diff y$ have the same meaning.

1.3.1.1Lifetime Value¶

In this setting, a stationary Markov policy is a Borel measurable map $\sigma$ from $\RR_+$ to itself. Here we refer to stationary Markov policies more simply as policies. We call a policy $\sigma$ feasible if $0 \leq \sigma(w) \leq w$ for all $w \in \RR_+$ , so that the consumption response $c = \sigma(w)$ obeys the inequalities in (1.42). Let $\Sigma$ denote the set of all feasible policies. We seek $\sigma \in \Sigma$ that maximizes expected lifetime value. For given $\sigma$ and initial condition $w = w_0$ , expected lifetime value is

v_\sigma(w) = \EE \sum_{t \geq 0} \beta^t u(\sigma(W_t)) \quad \text{when } \; W_{t+1} = R(W_t - \sigma(W_t)) + Y_{t+1}

(1.43)

for all $t \geq 0$ and $(W_t)_{t \geq 0}$ starts at $w$ . Below, we refer to $v_\sigma$ as the $\sigma$ -value function.

It is helpful to represent $v_\sigma$ as a policy operator for $\sigma \in \Sigma$ :

(T_\sigma \, v)(w) = u(\sigma(w)) + \beta \int v(R(w - \sigma(w)) + y) \phi(\diff y) \qquad (w \in \RR_+).

(1.44)

This policy operator is a continuous state analog of the finite MDP policy operator we saw in (1.19). It acts on functions $v \in V$ , where

V \coloneq b\RR_+ \coloneq \text{all bounded Borel measurable functions from } \RR_+ \text{ to } \RR.

Recall that $V$ is a Banach space (see Section A.4.2) with supremum norm $\| v \| := \sup_x |v(x)|$ .

Policy operators are useful because $v \in V$ is a fixed point of $T_\sigma$ if and only if it equals the $\sigma$ -value function. Thus, the fixed point of $T_\sigma$ characterizes the lifetime value of $\sigma$ . This is a consequence of the following lemma.

Proof

Fix $\sigma \in \Sigma$ and set $r_\sigma \coloneq u \circ \sigma$ . Let $P_\sigma$ be the Markov operator (see Section A.5.4.2) defined at $v \in V$ by

(P_\sigma \, v)(w) \coloneq \int v(R(w - \sigma(w)) + y) \phi(\diff y) \qquad (w \in \RR_+).

Using this notation, we can write

T_\sigma \, v = r_\sigma + \beta P_\sigma \, v.

(1.45)

In Section A.5.4 and Corollary A.4.11 we show that $P_\sigma$ is a bounded linear operator from $V$ to itself and, using $\beta \in (0,1)$ , that $T_\sigma$ is globally stable on $V$ with unique fixed point $v_\sigma \in V$ obeying

v_\sigma = (I - \beta P_\sigma)^{-1} r_\sigma = \sum_{t \geq 0} (\beta P_\sigma)^t \, r_\sigma .

(1.46)

(Here $I$ is the identity map on $V$ and the second equality follows from the Neumann series lemma.) It remains only to show that $v_\sigma$ in (1.46) agrees with $v_\sigma$ defined in (1.43). To obtain this we use the fact that, when $W_{t + 1} = R(W_t - \sigma(W_t)) + Y_{t + 1}$ for all $t$ and $W_0 = w$ ,

\left( P_\sigma^t \, r_\sigma \right)(w) = \EE \left[ r_\sigma(W_t) \, \given W_0 = w \right] = \EE \left[ u(\sigma(W_t)) \, \given W_0 = w \right].

(1.47)

(The first equality also uses results in Section A.5.4.) Combining this with the last expression in (1.46), we see that $v_\sigma$ in (1.46) and (1.43) are identical. ◻

Incidentally, one can use the law of iterated expectations to prove that the $\sigma$ -value function $v_\sigma$ is a fixed point of $T_\sigma$ . Write

v_\sigma(w) = u(\sigma(w)) + \EE \sum_{t \geq 1} \beta^t u(\sigma(W_t)) .

Letting $\EE_1$ be the expectation conditional on $W_1$ , applying the law of iterated expectations implies

v_\sigma(w) = u(\sigma(w)) + \beta \EE \, \left[ \EE_1 \, \sum_{t \geq 1} \beta^{t-1} u(\sigma(W_t)) \right] = u(\sigma(w)) + \beta \EE \, v_\sigma(W_1).

Expanding the last expression yields

v_\sigma(w) = u(\sigma(w)) + \beta \int v_\sigma(R(w - \sigma(w)) + y) \phi(\diff y).

(1.48)

Thus, $v_\sigma$ is a fixed point of $T_\sigma$ .

1.3.1.2Lifetime Values as Limits¶

In the previous section we learned that fixed points of policy operators represent lifetime value. What do finite iterates of policy operators represent? Fixing $\sigma$ and inspecting the definition of $T_\sigma$ (see (1.44)) indicates that $(T_\sigma \, v)(w)$ represents the reward received from using policy $\sigma$ for one period, when $w$ is initial wealth and the function $v$ is used to evaluate the reward from wealth in the second period.

We can lengthen the horizon by iterating with $T_\sigma$ while keeping the terminal value function $v$ fixed. Choosing $k \in \NN$ and using the expression for $T_\sigma$ in (1.45), we get

T^k_\sigma \, v = r_\sigma + \beta P_\sigma \, r_\sigma + \cdots + (\beta P_\sigma)^{k-1} r_\sigma + (\beta P_\sigma)^k v

The expression on the right is the value of using policy $\sigma$ for $k$ periods and then receiving a reward for terminal wealth determined by the function $v$ . Thus, it is the finite horizon value of following $\sigma$ under this terminal condition.

It seems plausible that the infinite-horizon lifetime value of a policy $\sigma$ could equal the limit of finite horizon values, so that

v_\sigma = \lim_{k \to \infty} T^k_\sigma v.

(1.49)

Lemma 1.3.1 assures us that this is true: since $T_\sigma$ is globally stable on $V$ with unique fixed point $v_\sigma$ , the limit in (1.49) exists and equals $v_\sigma$ , independent of the terminal condition $v \in V$ .

Figure 1.13 shows two arbitrarily chosen feasible policies and their lifetime values when $R=1.04$ , $\beta=0.95$ , $u(c)=1 - \exp(-c)$ , and $Y_t = \exp(\nu Z_t)$ when $\nu=0.8$ and $Z_t$ is standard normal. The lifetime values were computed via (1.49).

Figure 1.13:Randomly chosen policies and their lifetime values

1.3.2Optimality¶

The value function for the optimal savings model is

\vmax(w) \coloneq \sup_{\sigma \in \Sigma} v_\sigma (w) \qquad (w \in \RR_+).

(1.50)

Under Assumption 1.3.1 the supremum is always well defined in $\RR$ , since $u$ and hence $r_\sigma$ is bounded by some constant $M$ , implying that, for any $w \in \RR_+$ and $\sigma \in \Sigma$ ,

v_\sigma(w) \leq \sum_{t \geq 0} \beta^t M = \frac{M}{1-\beta}.

A policy is called optimal if $v_\sigma = \vmax$ ; that is, if following the policy from every initial state $w$ leads to the largest possible lifetime value attainable from $w$ .

The set of feasible policies lies in an infinite-dimensional function space, so we cannot find an optimal policy by exhaustive search. We want a systematic and efficient search procedure. Following the techniques we used for the firm management problem in Section 1.1, our approach will be to (a) set up a Bellman equation to help us assign maximal lifetime values to states, and (b) solve for a greedy policy with respect to this maximizing function.

1.3.2.1Bellman’s Method¶

Fix $v \in V$ . In the present setting, a policy $\sigma \in \Sigma$ will be called $v$ -greedy if

\sigma(w) \in \argmax_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\} \quad \text{for all } w \geq 0.

(1.51)

A $v$ -greedy policy uses $v$ to value next-period states and then chooses consumption optimally to trade off current utility against expected discounted future value associated with the implied level of savings. The following statements are both true:

Computing $v$ -greedy policies is typically much easier than computing optimal policies, since we are only solving a two-period problem.
Computing $v$ -greedy policies can be equivalent to computing optimal policies, given the right choice of $v$ .

What is the right choice of $v$ ? A natural candidate is the value function, since the value function tells us the maximal reward from alternative states. We explain this in more detail in Section 1.3.2.2. In that same section, we will also use the fact that the value function satisfies an important functional equation, which we now describe.

We say that $v \in V$ satisfies the Bellman equation for the optimal savings problem if

v(w) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\} \quad \text{for all } w \geq 0.

(1.52)

Stating that $v$ solves the Bellman equation is equivalent to stating that $v$ is a fixed point of the Bellman operator $T$ that maps a value function $v(w)$ into a value function $(T v)(w)$ defined by

(T v)(w) = \max_{0 \leq c \leq w} \left\{ u(c) + \beta \int v(R(w - c) + y) \phi(\diff y) \right\} \qquad (w \geq 0).

(1.53)

The next lemma discusses properties of greedy policies and the Bellman operator.

Solution to Exercise 1.3.2

Lemma 1.3.2 tells us that $T$ maps $V$ into $bc\RR_+$ , which is a subset of $b\RR_+$ . In particular, $T$ is a self-map on $V$ . For the contraction property, we apply the sup inequality from Corollary A.5.13 and the triangle inequality for integrals to obtain

\begin{aligned} |(Tv)(w) - (Tv')(w)| & \leq \max_{0 \leq c \leq w} \beta \int \left| v(R(w - c) + y) - v'(R(w - c) + y) \right| \phi(\diff y) \\ & \leq \beta \| v - v'\|. \end{aligned}

Taking the supremum gives $\|Tv - Tv'\| \leq \beta \|v-v'\|$ .

Since $(V, \| \cdot \|)$ is a Banach space, the contraction property in Exercise 1.3.2 implies that $T$ is globally stable on $V$ . (See Section A.2.2.2 for details).

1.3.2.2DP Results for Optimal Savings¶

Dynamic programming theory tells us that, under Assumption 1.3.1,

at least one optimal policy exists,
the value function $\vmax$ is the unique solution to the Bellman equation in $V$ , and
a policy $\sigma \in \Sigma$ is optimal if and only if it is $\vmax$ -greedy.

A direct proof of (i)–(iii) can be found in Stokey & Lucas (1989), Stachurski (2022) and numerous other sources. The proofs heavily exploit the fact that the Bellman operator is a contraction mapping (as discussed in Exercise 1.3.2). We skip proofs for now, noting that they will be special cases of proofs we shall provide in Section 3.2.2.

Let’s review what we’ve found so far. We started with one optimization problem—choosing an optimal consumption path $C_0, C_1, \ldots$ to maximize expected discounted lifetime utility—and ended up with another one—finding a greedy policy from the value function. Are we actually better off? The answer is: yes! Finding a greedy policy involves solving a scalar optimization problem performed for each state $w$ , whereas as our previous optimization problem was infinite dimensional. High dimensionality is the mountain we must climb in all hard optimization problems and here we have used the recursive structure inherent in the problem to map a route up to the top.

Of course this claim that we are better off is contingent on us being able to learn what the value function is, so that we can compute $\vmax$ -greedy policies—or at least some reasonable approximation. We discuss this topic next.

Figure 1.14 shows an approximation of the optimal policy $\sigopt$ and the value function $\vmax$ , both computed by OPI, for the same version of the optimal savings problem used in Figure 1.13. In this case we set $m=20$ .

Figure 1.14:Approximating the optimal policy and value function via OPI

1.3.2.3Special Case: No Labor Income¶

Let’s quickly look at a version of the savings model where it’s possible to get an analytical solution for the optimal policy and the value function. We will use this solution to help us investigate the role of parameters and, through this process, consider the need for extensions to the basic optimal savings model.

To obtain an analytical solution, we set $Y_t \equiv 0$ and assume that the utility function has the CRRA form

u(c) \coloneq \frac{c^{1-\gamma}}{1-\gamma} \qquad (\gamma > 0, \; \gamma \neq 1).

(1.54)

The conditions of the preceding discussion are not satisfied, since $u$ is not bounded on $\RR_+$ and may take the value $-\infty$ . We assume instead that $\beta R^{1-\gamma} < 1$ . This turns out to be sufficient to ensure finite lifetime values when consumption choices are positive:

For this CRRA problem, the optimal consumption policy is linear in $w$ . That is,

\text{there exists a constant } \eta \text{ such that } \sigma(w) = \eta w \text{ is the optimal policy}

(1.55)

Let’s verify this claim and also seek the value of the constant $\eta$ . In doing so, we first observe that if (1.55) holds, then

W_t = R^t (1 - \eta)^t w \quad \text{when } \; W_0 = w

and hence the value function $\vmax$ satisfies

\begin{aligned} \vmax(w) = \sum_t \beta^t u (\eta W_t) & = \sum_t \beta^t u \left( \eta R^t \left(1 -\eta \right)^t w \right) \\ & = \sum_t \beta^t \left( \eta R^t \left(1 -\eta \right)^t \right)^{1-\gamma} u \left( w \right) = \frac{\eta^{1-\gamma}}{1-\beta \left( R \left( 1-\eta \right) \right)^{1-\gamma}} u(w) \end{aligned}

Our conjecture is that the linear policy $\sigma(w) = \eta w$ satisfies the Bellman equation with the value function as given above. Under this conjecture, the Bellman equation becomes

\vmax(w) = \max_c \left\{ \frac{c^{1-\gamma}}{1-\gamma} + \beta \cdot \frac{\eta^{1-\gamma}}{1-\beta \left( R \left( 1-\eta \right) \right)^{1-\gamma}} \cdot \frac{\left(R \left( w-c \right) \right)^{1-\gamma}}{1-\gamma} \right\}

(1.56)

Taking the derivative with respect to $c$ yields the first-order condition

c^{-\gamma} + \beta m \left(R \left( w-c \right) \right)^{-\gamma} (-R) =0 \quad \text{ when } \; m \coloneq \frac{\eta^{1-\gamma}} { 1-\beta \left( R \left( 1-\eta \right) \right)^{1-\gamma} }

It then follows that $c^{-\gamma} = \beta m R^{1-\gamma}(w-c)^{-\gamma}$ . Substituting the optimal policy $\sigma(w) = \eta w$ into this equality gives

\left( \eta w \right)^{-\gamma} = \frac{\beta R^{1-\gamma} \eta^{1-\gamma}} {1- \beta \left( R \left( 1-\eta \right) \right)^{1-\gamma}} (1-\eta)^{-\gamma} w^{-\gamma}

Now solving the above equality for $\eta$ yields

\eta = 1 - \left( \beta R^{1-\gamma} \right)^{1/\gamma}

(1.57)

In this connection, given any initial wealth $w$ , the value function becomes

\vmax(w) = \frac{\eta^{1-\gamma}}{1-\beta \left( R \left( 1-\eta \right) \right)^{1-\gamma}} u(w) = \frac{\left( 1 - \left( \beta R^{1-\gamma} \right)^{1/\gamma} \right)^{1-\gamma}} {1-\beta R^{1-\gamma} \left( \beta R^{1-\gamma} \right)^{\frac{1-\gamma}{\gamma}}} u(w) =\eta^{-\gamma} u(w).

It is not difficult to verify that $\vmax(w) = \eta^{-\gamma} u(w)$ solves the Bellman equation (1.56) for any $w$ .

The parameter $\gamma$ governs the curvature of the utility function and hence preferences about consumption smoothing. To see this, observe that consumption at time $t$ is $C_t = \eta W_t = \eta (R(1-\eta))^t w$ , so the consumption growth factor is $C_{t+1}/C_t = (\beta R)^{1/\gamma}$ . When $\gamma$ is large, the utility function has high curvature and the agent dislikes variation in consumption across time. Conversely, when $\gamma$ is small, the agent is more tolerant of consumption variation, leading to steeper paths. Figure 1.15 illustrates these effects for $\beta = 0.96$ and $R=1$ .

Optimal consumption paths under CRRA utility for different values of \gamma. — Figure 1.15:Optimal consumption paths under CRRA utility for different values of $\gamma$ .

1.3.3Epstein–Zin Preferences¶

There are a number of issues and limitations associated with the basic optimal savings model we have discussed so far. Moreover, these limitations tend to bind more often as we move towards quantitative analysis and interesting research applications. In this section we discuss issues related to risk and intertemporal substitution. This discussion will motivate us to introduce Epstein–Zin preferences, which are a particularly popular specification of intertemporal preferences in economics and finance.

1.3.3.1Risk vs EIS¶

One issue is that, under the model considered so far, the curvature of the utility function $u$ simultaneously governs both risk aversion (e.g., a more strongly concave utility function indicates stronger aversion to risk) and willingness to substitute consumption across time (as we saw in Figure 1.15, where increasing $\gamma$ led to flatter consumption paths). Willingness to substitute consumption is usually measured by the elasticity of intertemporal substitution (EIS), which, for the CRRA utility function is $1/\gamma$ . Larger $\gamma$ pushes down the EIS, indicating preference for smooth consumption over time.

The fact that utility curvature controls both risk preferences and the EIS binds attitudes toward uncertainty together with attitudes toward intertemporal substitution. Researchers have found that to explain various macro-finance patterns, it helps to unbind them and allow separate parameters to describe these two attitudes. For example, matching observed equity premia using standard asset pricing models requires high risk aversion, but high $\gamma$ under CRRA implies a small EIS, which creates other difficulties including what is called a risk-free rate puzzle (see, e.g., Chapter 13 of Ljungqvist & Sargent (2018)).

1.3.3.2EZ Preferences¶

Epstein–Zin preferences Epstein & Zin, 1989Weil, 1990 play a big role in many macro-finance models. Under these preferences, the Bellman equation (1.52) from the standard optimal savings model becomes

v(w) = \max_{0 \leq c \leq w} \left[ (1-\beta) c^{1-1/\psi} + \beta \left( \int v(R(w-c) + y)^{1-\gamma} \phi(\diff y) \right)^{\frac{1-1/\psi}{1-\gamma}} \right]^{\frac{1}{1-1/\psi}}

where $\psi > 0$ is the EIS and $\gamma > 0$ is the coefficient of relative risk aversion. The inner expectation applies risk adjustment to future value via the Kreps–Porteus expectation, which we met earlier in Section 1.1.3.4. The outer CES aggregator governs intertemporal substitution. The policy operator (1.44) now becomes

(T_\sigma v)(w) = \left[ (1-\beta) \sigma(w)^{1-1/\psi} + \beta \left( \int v(R(w-\sigma(w)) + y)^{1-\gamma} \phi(\diff y) \right)^{\frac{1-1/\psi}{1-\gamma}} \right]^{\frac{1}{1-1/\psi}}

for any feasible policy $\sigma$ . Figure 1.16 shows two arbitrarily chosen policies and their lifetime values under Epstein–Zin preferences, using the same income process as in Figure 1.13. The $\sigma$ -value functions are now computed by iterating on our new version of $T_\sigma$ . Parameters are $R=1.04$ , $\beta=0.95$ , $\gamma=5$ (risk aversion), and $\psi=1.5$ (EIS).

Figure 1.16:Policies and lifetime values under Epstein–Zin preferences

Figure 1.17 shows the optimal policy and value function under Epstein–Zin preferences, computed via OPI. Compared to the standard expected utility case in Figure 1.14, the optimal consumption policy is qualitatively similar—increasing and concave in wealth—but the value function differs in interpretation and scale. With $\gamma > 1/\psi$ , the agent exhibits preference for early resolution of uncertainty, which affects how future risk is valued.

Figure 1.18 explores how risk aversion affects optimal consumption. The figure shows optimal policies for $\gamma \in \{1.25, 5, 20\}$ , holding the EIS fixed at $\psi = 1.5$ . Higher risk aversion leads to more precautionary saving: at each wealth level, the agent consumes less and saves more as $\gamma$ increases. Values of $\gamma$ around 10–20 are common in the long-run risk literature Bansal & Yaron, 2004, where high risk aversion is needed to match observed asset pricing moments.

Optimal consumption by risk aversion \gamma — Figure 1.18:Optimal consumption by risk aversion $\gamma$

1.3.3.3Optimization Theory¶

While the preceding analysis illustrates the potential usefulness of Epstein–Zin preferences, it puts us on shaky ground technically. For example, optimality properties of the ordinary optimal savings model in Section 1.3.2.2 depend on contractivity of the Bellman operator. (For a sense of why, read the proof of Theorem 1.1.1.)

The Bellman operator associated with the Epstein–Zin Bellman equation is not a contraction under the supremum distance for the most quantitatively significant parameterizations, and the same is true for the policy operators. This means that, in order to handle both the standard and the Epstein–Zin variations of the savings problem, we require a more general theory of dynamic programming that can handle both contractive and non-contractive settings. We begin constructing appropriate tools in Chapter 2.

1.4Sequential Analysis¶

This section presents a Bayesian formulation of a statistical decision problem described by Bertsekas (1976). Unlike the previous examples, there is no discounting, so the Bellman operator is not a contraction. Nonetheless, the same conceptual framework applies: the optimal loss function solves a Bellman equation and optimal policies have a threshold structure. In subsequent chapters, we will build a theory of dynamic programming that can handle this no-discounting case. For now, our objective is to motivate the theory by exploring the application through guess-work and simulation.^[3]

1.4.1Introduction¶

We now consider a Bayesian formulation of the sequential testing problem originally studied by Milton Friedman, Allen Wallis, and Abraham Wald Wald, 1947Arrow et al., 1949. The following is an account of how the problem was conceived and came to the attention of Wald. The account is by Milton Friedman, one of the giants of 20th Century economics, and relates to his work during World War II as an analyst at the U.S. Government’s Statistical Research Group at Columbia University.

In order to understand the story, it is necessary to have an idea of a simple statistical problem, and of the standard procedure for dealing with it. The actual problem out of which sequential analysis grew will serve. The Navy has two alternative designs (say A and B) for a projectile. It wants to determine which is superior. To do so it undertakes a series of paired firings. On each round, it assigns the value 1 or 0 to A accordingly as its performance is superior or inferior to that of B and conversely 0 or 1 to B. The Navy asks the statistician how to conduct the test and how to analyze the results.
The standard statistical answer was to specify a number of firings and a pair of percentages (e.g., 53% and 47%) and tell the client that if A receives a 1 in more than 53% of the firings, it can be regarded as superior; if it receives a 1 in fewer than 47%, B can be regarded as superior; if the percentage is between 47% and 53%, neither can be so regarded.
When Allen Wallis was discussing such a problem with (Navy) Captain Garret L. Schuyler, the captain objected that such a test, to quote from Allen’s account, may prove wasteful. If a wise and seasoned ordnance officer like Schuyler were on the premises, he would see after the first few thousand or even few hundred [rounds] that the experiment need not be completed either because the new method is obviously inferior or because it is obviously superior beyond what was hoped for.

Friedman and Wallis worked on the problem for a while but didn’t completely solve it. Realizing that, they told Wald about the problem. That set Wald on a path that led him to create sequential analysis Wald, 1947. While the story above relates to wartime activity, sequential analysis has many significant applications in economics, finance, operations research, and other fields. Examples include determining the number of clinical trials before bringing a drug to market, real-time fraud detection, algorithmic trading, supply chain monitoring, and experimental interface design by social media companies.

On a technical level, this problem differs from the other problems we have investigated so far in that it involves no discounting. As a result, the Bellman operator is not necessarily a contraction. Nonetheless, we will find ways to prove that the core concepts from dynamic programming theory still apply.

The setting is as follows: A decision-maker observes a sequence of IID draws $Z_1, Z_2, Z_3, \ldots$ from an unknown distribution $f$ . The distribution $f$ is either $f_0$ or $f_1$ , where both $f_0$ and $f_1$ are known probability densities. After observing each draw, the decision-maker must choose one of three actions:

Accept the hypothesis that $f = f_0$ and stop.
Accept the hypothesis that $f = f_1$ and stop.
Draw another observation at cost $c > 0$ .

The decision-maker incurs a loss whenever she makes an incorrect decision. The losses are as follows:

loss $L_0$ when incorrectly accepting $f_0$ (in fact $f = f_1$ )
loss $L_1$ when incorrectly accepting $f_1$ (in fact $f = f_0$ )

Both $L_0$ and $L_1$ are strictly positive. The objective is to minimize the expected loss, which includes both the cost of sampling and the potential loss from incorrect terminal decisions.

The decision-maker begins with a prior belief $\pi_0 \in (0,1)$ that $f = f_1$ . The state variable is the posterior belief $\pi_n$ , which represents the probability that $f = f_1$ given observations $1, \ldots, n$ . After observing $Z_n$ , the posterior is updated via Bayes’ rule:

\pi_{n+1} = \kappa(\pi_n, Z_{n+1}), \quad \text{where} \quad \kappa(\pi, z) \coloneq \frac{\pi f_1(z)}{(1-\pi) f_0(z) + \pi f_1(z)}.

(1.58)

Notice that $(\pi_n)_{n \geq 0}$ is Markovian over the sampling process.

Given current belief $\pi$ , the next draw from the $(Z_n)_{n \in \NN}$ sequence has the predicted distribution

\psi(\pi, z) \coloneq (1-\pi)f_0(z) + \pi f_1(z).

(1.59)

The controller uses this distribution to take expectations over next-sample draws from the $(Z_n)_{n \geq 1}$ process.

1.4.2Optimality¶

For this sequential sampling problem, the Bellman equation for minimizing loss has the form

g(\pi) = \min \left\{ \pi L_0, \; (1-\pi) L_1, \; c + \int g(\kappa(\pi, z)) \psi(\pi, z) \diff z \right\}.

(1.60)

The Bellman equation can be understood as follows: The value $g(\pi)$ represents the minimum expected loss given current belief state $\pi$ . This value is itself the minimum over three terms, each of which corresponds to a choice. The first term is associated with accepting $f_0$ and has expected loss $\pi L_0$ , since $\pi$ is the (subjective) probability that $f = f_1$ . The second term is for accepting $f_1$ . This has expected loss $(1-\pi) L_1$ , since $1-\pi$ is the probability that $f = f_0$ . The last term is the expected loss associated with continuing to the next sample and then behaving optimally.

We now state an optimality result that parallels our earlier theorems. Let $\Xsf = (0,1)$ be the state space for the belief state $\pi$ and let $b\Xsf_+$ denote the set of bounded, Borel measurable functions from $\Xsf$ to $\RR_+$ . The action space is $\Asf = \{0, 1, 2\}$ , where action 0 represents accepting $f_0$ , action 1 represents accepting $f_1$ , and action 2 represents continuing to sample. The set of all feasible policies, denoted by $\Sigma$ , is all Borel measurable $\sigma \colon \Xsf \to \{0, 1, 2\}$ .

Distributions and sample paths for f_0 and f_1 — Figure 1.19:Distributions and sample paths for $f_0$ and $f_1$

We prove this theorem via a more general result in Theorem 3.2.9. For now, to illustrate the key ideas, we consider a specific example where $f_0 = \text{Beta}(3, 4)$ and $f_1 = \text{Beta}(4, 3)$ , as shown in Figure 1.19. The figure also shows IID sample paths generated by the densities $f_0$ and $f_1$ . The remaining parameters are set to $L_0 = 25$ , $L_1 = 25$ , and $c = 0.5$ .

Figure 1.20:Optimal policy and loss function

Figure 1.20 shows the optimal policy and the corresponding loss function $\gmin$ . The functions were computed by a version of value function iteration, starting from initial condition $g_0 \equiv 0$ . The state space $\Xsf$ was discretized into a grid of 200 points, and the integral over future observations was approximated using a grid of 50 points over the support of the distributions. The left panel displays the optimal action as a function of the posterior belief $\pi$ . As predicted by Theorem 1.4.1, the optimal policy has a threshold structure: there exist cutoffs $t_0, t_1 \in [0,1]$ with $t_0 \leq t_1$ such that

accept $f_0$ if $\pi \leq t_0$ ,
accept $f_1$ if $\pi \geq t_1$ , or
continue sampling if $t_0 < \pi < t_1$ .

Figure 1.21 shows dynamics of the belief state under the optimal policy. The belief state $\pi_n$ shifts according to the update rule $\pi_{n+1} = \kappa(\pi_n, Z_{n+1})$ , with the samples $(Z_n)_{n \in \NN}$ being drawn from either $f_0$ or $f_1$ . When the true distribution is $f_0$ , the belief $\pi_n$ tends to drift downward toward zero; when it is $f_1$ , the belief drifts upward toward one. Under the optimal policy, sampling terminates once the belief exits the continuation region $(t_0, t_1)$ , at which point the corresponding hypothesis is accepted.

Figure 1.21:Belief paths under the optimal policy

1.5Summary¶

The examples in this chapter illustrate the breadth of problems that dynamic programming can address, but they also expose the limits of classical methods. The firm problem and finite MDPs with constant discounting yield contracting Bellman operators, making optimality theory and computation straightforward. However, several of the extensions and models we encountered require a more general foundation: Epstein–Zin preferences and the sequential analysis problem produce Bellman operators that are not contractions; risk-sensitive and robust formulations involve nonlinear aggregators; and distributional dynamic programming operates on a non-standard value space ordered by stochastic dominance. The abstract theory developed in the next chapter provides a unified framework that accommodates all of these settings.

1.6Chapter Notes¶

Richard Bellman’s ((1957)) monograph established dynamic programming as a unified framework for sequential optimization, introducing optimality concepts and recursive functional equations that form the foundations of this text. David Blackwell made major contributions to the mathematical theory, proving contraction properties for discounted problems with finite state spaces Blackwell, 1962 and extending these results to Borel spaces using order-preserving operators Blackwell, 1965. Eric Denardo ((1967)) further generalized contraction results to a broad class of sequential decision problems, introducing conditions that anticipate many ideas in this text.

The LP formulation for MDPs discussed in Section 1.2.1.4 has a long history. LP methods are particularly useful for constrained MDPs, where the controller faces additional restrictions on expected rewards or resource usage. Such problems arise in network routing, healthcare resource allocation, and other applications. See Altman (1999) for a textbook treatment. The LP formulation is also central to average-reward problems, where occupation measures play a key role (see, e.g., Chapter 8 of Puterman (2005)). For large state-action spaces, approximate LP methods using basis function representations can help scale the approach; Farias & Van Roy (2003) provides a foundational treatment.

The firm problem we studied in Section 1.1 is closely related to classic references such as Jovanovic (1982) and Hopenhayn & Prescott (1992), and has been extended by many authors (see, e.g., Alessandria et al. (2021) or Sterk et al. (2021)). Regarding the extensions to the firm problem in Section 1.1.3, excellent discussions of Markov decision processes with risk-sensitive objectives can be found in Bäuerle & Jaśkiewicz (2024) and Bäuerle & Jaśkiewicz (2025). We borrowed from their exposition in several parts of the chapter and return to the key ideas later in the text. The variational formula connecting risk-sensitivity to robustness is developed in Anantharam & Borkar (2017); see also Chapter 8 of Sargent & Stachurski (2025) for further discussion.

The discussion in Section 1.1.3.5 mentions dynamic (time) inconsistency. For analysis of time inconsistency in macroeconomic models and its connections to dynamic programming, see Sargent (2024), Sargent & Yang (2025), and Sargent & Yang (2025). For recent theoretical work, see Stanca (2025), Strack & Taubinsky (2026), and Bayraktar et al. (2023) on the stability of equilibria in time-inconsistent stopping.

The finite MDP framework in Section 1.2 is treated comprehensively in Puterman (2005); see also Chapter 5 of Sargent & Stachurski (2025) for an introductory treatment. The three core algorithms we presented — VFI, HPI, and OPI — are discussed in these sources. Howard policy iteration was introduced in Howard (1960). The cash management application in Section 1.2.1.5 builds on the inventory-theoretic models of money demand developed by Baumol (1952) and Tobin (1956). Our continuous time MDP model in Section 1.2.2 follows the framework described in Guo & Hernández-Lerma (2009); the uniformization technique we used to reduce continuous time problems to discrete time ones is standard (see, e.g., Puterman (2005), Chapter 11).

The optimal savings problem in Section 1.3 is also called the income fluctuation problem. It was studied in an early and influential form by Brock & Mirman (1972), who analyzed optimal growth under uncertainty with discounted CRRA utility. It has become a core building block for heterogeneous agent models following Bewley (1986), Huggett (1993), and Aiyagari (1994). Recent analysis can be found in Carroll & Shanker (2026), Li & Stachurski (2014), Lehrer & Light (2018), Light (2018), Ma et al. (2020), and Ma & Toda (2021). For a continuous-time treatment, see Achdou et al. (2022). See Stokey & Lucas (1989) and Stachurski (2022) for textbook treatments of the underlying dynamic programming theory.

The Epstein–Zin preferences discussed in Section 1.3.3 were introduced by Epstein & Zin (1989) and Weil (1990), building on earlier work of Kreps & Porteus (1978). The separation of risk aversion from the elasticity of intertemporal substitution enabled by Epstein–Zin utility has been central to the long-run risk literature initiated by Bansal & Yaron (2004), where small persistent consumption shocks get heavily priced, generating realistic equity premia using “reasonable” parameters. The equity premium puzzle was posed by Mehra & Prescott (1985); the associated risk-free rate puzzle was posed by Weil (1989). Both puzzles are discussed extensively in Chapter 13 of Ljungqvist & Sargent (2018). Chapter 7 of Sargent & Stachurski (2025) provides an introduction to recursive preferences.

The discussion of ambiguity in Section 1.2.3.3 connects to a large literature on robust decision-making under model uncertainty. The minimax formulation we presented follows the approach of Wald (1950); see also Ellsberg (1961) for a foundational discussion of ambiguity aversion and Hansen & Sargent (2011) for connections to robust control. Recent work on dynamic programming under ambiguity includes Maccheroni et al. (2006), Klibanoff et al. (2009), Marinacci & Montrucchio (2019), Neufeld et al. (2023), Cerreia-Vioglio et al. (2026), Benyamine et al. (2026), and Wang & Si (2026). An excellent survey on ambiguity and its implications for economics and finance can be found in Ilut & Schneider (2023).

The sequential analysis problem in Section 1.4 originated with Wald (1947) and Arrow et al. (1949). The Bayesian formulation we presented follows Bertsekas (1976). For treatments from frequentist and Bayesian perspectives, respectively, see Sargent & Stachurski (2026) and Sargent & Stachurski (2026).

The introduction to this chapter mentioned applications of dynamic programming to atemporal problems, such as genome sequencing and the structure of production chains. For one discussion of the former see Gu et al. (2023); for the latter see, for example, Kikuchi et al. (2021). We mentioned also that many recent applications of dynamic programming are connected to machine learning and artificial intelligence. Introductions to the literature can be found in Bertsekas (2021) and Kochenderfer et al. (2022).

Footnotes¶

The literature on “recursive contracts” in macroeconomics makes progress here by using a set of procedures that have been called “dynamic programming squared”. Ljungqvist & Sargent (2018) devote a suite of chapters to that topic.
↩
In engineering it is sometimes called a closed loop control to emphasize that the control must be a measurable function of an observed history and not depend on as yet unrealized random variables.
↩
In his formulation, Abraham Wald Wald (1947) proceeded as a frequentist statistician, using objects from Neyman-Pearson’s hypothesis testing theory. For descriptions of the problem from the distinct frequentist and Bayesian perspectives, see Sargent & Stachurski (2026) and Sargent & Stachurski (2026).
↩

References¶

Bellman, R. (1957). Dynamic programming. In Science. American Association for the Advancement of Science.
Sargent, T. J., & Stachurski, J. (2025). Dynamic Programming: Finite States. Cambridge University Press.
Modigliani, F., & Miller, M. H. (1958). The cost of capital, corporation finance and the theory of investment. The American Economic Review, 48(3), 261–297.
Smith, C. W., & Stulz, R. M. (1985). The Determinants of Firms’ Hedging Policies. Journal of Financial and Quantitative Analysis, 20(4), 391–405.
Graham, J. R., Harvey, C. R., & Puri, M. (2013). Managerial attitudes and corporate actions. Journal of Financial Economics, 109(1), 103–121. 10.1016/j.jfineco.2013.01.010
Kerr, S. P., Kerr, W. R., & Dalton, M. (2019). Risk attitudes and personality traits of entrepreneurs and venture team members. Proceedings of the National Academy of Sciences, 116(36), 17712–17716. 10.1073/pnas.1908375116
Almeida, H., Campello, M., de Castro, L. I., & Galvao Jr, A. F. (2024). A Quantile Model of Firm Investment [Techreport]. National Bureau of Economic Research.
Bellemare, M. G., Dabney, W., & Rowland, M. (2023). Distributional reinforcement learning. MIT Press.
Bäuerle, N., & Jaśkiewicz, A. (2024). Markov decision processes with risk-sensitive criteria: an overview. Mathematical Methods of Operations Research, 99(1), 141–178.
Puterman, M. L. (2005). Markov decision processes: discrete stochastic dynamic programming. Wiley Interscience.
Baumol, W. J. (1952). The Transactions Demand for Cash: An Inventory Theoretic Approach. The Quarterly Journal of Economics, 66(4), 545–556.
Tobin, J. (1956). The Interest-Elasticity of Transactions Demand For Cash. The Review of Economics and Statistics, 38(3), 241–247.
Engel, K.-J., & Nagel, R. (2006). A short course on operator semigroups. Springer Science & Business Media.
Guo, X., & Hernández-Lerma, O. (2009). Continuous-time Markov decision processes. Springer.
Wald, A. (1950). Statistical Decision Functions (p. ix + 179). John Wiley & Sons.

1 Prelude: Examples of Dynamic Programs

1.1A Firm Problem¶

1.1.1Models of a Firm¶

1.1.1.1Valuation¶

1.1.1.2Control¶

1.1.1.3The Bellman Operator¶

1.1.1.4Proving Theorem 1.1.1¶

1.1.1.5How About More General Policies?¶

1.1.2Extensions¶

1.1.2.1Beyond Constant Discount Rates¶

1.1.2.2Unbounded Rewards¶

1.1.3Beyond Risk Neutrality¶

1.1.3.1Distributions of Rewards¶

1.1.3.2Distributional Dynamic Programming¶

1.1.3.3Mean-Variance Analysis¶

1.1.3.4Alternatives to Mean-Variance¶

1.1.3.5Difficulties¶

1.1.3.6Back to Recursion¶

1.2Finite MDPs¶

1.2.1Theory¶

1.2.1.1The Discrete Time Model¶

1.2.1.2Core Optimality Results¶

1.2.1.3Algorithms¶

1.2.1.4Solving MDPs via Linear Programming¶

1.2.1.5Example: Cash Management¶

1.2.2Continuous Time¶

1.2.2.1Primitives and Values¶

1.2.2.2Uniformization¶

1.2.2.3Optimality¶

1.2.2.4Example: Service Rate Control¶

1.2.3Extensions¶

1.2.3.1Nonlinear Criteria¶

1.2.3.2Back to Recursions¶

1.2.3.3Ambiguity¶

1.3Optimal Savings¶

1.3.1Policies and Decisions¶

1.3.1.1Lifetime Value¶

1.3.1.2Lifetime Values as Limits¶

1.3.2Optimality¶

1.3.2.1Bellman’s Method¶

1.3.2.2DP Results for Optimal Savings¶

1.3.2.3Special Case: No Labor Income¶

1.3.3Epstein–Zin Preferences¶

1.3.3.1Risk vs EIS¶

1.3.3.2EZ Preferences¶

1.3.3.3Optimization Theory¶

1.4Sequential Analysis¶

1.4.1Introduction¶

1.4.2Optimality¶

1.5Summary¶

1.6Chapter Notes¶