Markov Decision Processes - Dynamic Programming Volume I: Finite States

In this chapter we study a class of discrete time, infinite horizon dynamic programs called Markov decision processes (MDPs). This standard class of problems is broad enough to encompass many applications, including the optimal stopping problems in Chapter 4. MDPs can also be combined with reinforcement learning to tackle settings where important inputs to an MDP are not known.

5.1Definition and Properties¶

In this section, we define MDPs and investigate optimality.

5.1.1The MDP Model¶

We study a controller who interacts with a state process $(X_t)_{t \geq 0}$ by choosing an action path $(A_t)_{t \geq 0}$ to maximize expected discounted rewards

\EE \sum_{t \geq 0} \beta^t r(X_t, A_t),

(5.1)

taking an initial state $X_0$ as given. As with all dynamic programs, we insist that the controller is not clairvoyant: He or she cannot choose actions that depend on future states.

To formalize the problem, we fix a finite set $\Xsf$ , henceforth called the state space, and a finite set $\Asf$ , henceforth called the action space. In what follows, a correspondence $\Gamma$ from $\Xsf$ to $\Asf$ is a function from $\Xsf$ into $\wp(\Asf)$ , the set of all subsets of $\Asf$ . The correspondence is called nonempty if $\Gamma(x) \neq \emptyset$ for all $x \in \Xsf$ . For example, the map $\Gamma$ defined by $\Gamma(x) = [-x, x]$ is a nonempty correspondence from $\RR$ to $\RR$ .

Given $\Xsf$ and $\Asf$ , we define a Markov decision process (MDP) to be a tuple $\mM = (\Gamma, \beta, r, P)$ consisting of

(i) a nonempty correspondence $\Gamma$ from $\Xsf$ to $\Asf$ , referred to as the feasible correspondence, which in turn defines the feasible state action pairs

\Gsf \coloneq \setntn{(x, a) \in \Xsf \times \Asf}{a \in \Gamma(x)},

(ii) a constant $\beta$ in $(0, 1)$ , referred to as the discount factor,

(iii) a function $r$ from $\Gsf$ to $\RR$ , referred to as the reward function, and

(iv) a stochastic kernel $P$ from $\Gsf$ to $\Xsf$ ; that is, $P$ is a map from $\Gsf \times \Xsf$ to $\RR_+$ satisfying

\sum_{x' \in \Xsf} P(x, a, x') = 1 \quad \text{ for all } (x,a) \text{ in } \Gsf.

Here $\Gamma(x) \subset \Asf$ is the set of actions available to the controller in state $x$ . Given a feasible state action pair $(x, a)$ , reward $r(x, a)$ is received, and the next period state $x'$ is randomly drawn from $P(x,a, \cdot)$ , which is an element of $\dD(\Xsf)$ . The dynamics and reward flow are summarized in Algorithm 5.1.

The Bellman equation corresponding to $\mM$ is

v(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \quad \text{for all } \, x \in \Xsf.

(5.2)

This can be understood as an equation in the unknown function $v \in \RR^\Xsf$ . Below we define the value function $v^*$ as maximal lifetime rewards and show that $v^*$ is the unique solution to the Bellman equation in $\RR^\Xsf$ .

We can understand the Bellman equation as reducing an infinite-horizon problem to a two-period problem involving the present and the future. Current actions influence (i) current rewards and (ii) expected discounted value from future states. In every case we examine, there is a trade-off between maximizing current rewards and shifting probability mass towards states with high future rewards.

5.1.2Examples¶

Here we list examples of MDPs. We will see that some models neatly fit the MDP structure, whereas others can be coaxed into the MDP framework by adding states or applying other tricks.

5.1.2.1A Renewal Problem¶

Rust (1987) ignited the field of dynamic structural estimation by examining an engine replacement problem for a bus workshop. In each period the superintendent decides whether or not to replace the engine of a given bus. Replacement is costly but delaying risks unexpected failure. Rust (1987) solved this trade-off using dynamic programming.

We consider an abstract version of Rust’s problem with binary action $A_t$ . When $A_t = 1$ , the state resets to some fixed renewal state $\bar x$ in a finite set $\Xsf$ (e.g., mileage resets to zero when an engine is replaced). When $A_t = 0$ , the state updates according to $Q \in \mopx$ (e.g., mileage increases stochastically when the engine is not replaced). Given current state $x$ and action $a$ , current reward $r(x,a)$ is received. The discount factor is $\beta \in (0,1)$ .

For this problem, the Bellman equation has the form

v(x) = \max \left\{ r(x,1) + \beta v(\bar x), \; r(x,0) + \beta \sum_{x'} v(x')Q(x, x') \right\} \qquad (x \in \Xsf),

(5.3)

where the first term is the value from action 1 and the second is the value of action 0.

To set the problem up as an MDP we set $\Asf = \{0,1\}$ and $\Gamma(x) = \Asf$ for all $x \in \Xsf$ . We define

P(x, a, x') \coloneq a \1\{x' = \bar x\} + (1-a) Q(x, x') \qquad ((x,a) \in \Gsf, \; x' \in \Xsf).

(5.4)

The primitives $(\Gamma, \beta, r, P)$ form an MDP. Moreover, the renewal Bellman equation (5.3) is a special case of the MDP Bellman equation (5.2). To verify this we rewrite (5.3) as

v(x) = \max_{a \in \{0,1\}} \left\{ r(x,a) + \beta \left[ a v(\bar x) + (1-a) \sum_{x'} v(x')Q(x, x') \right] \right\},

Inserting $P$ from (5.4) into the right-hand side of the last equation recovers the MDP Bellman equation (5.2).

5.1.2.2Optimal Inventory Management¶

We study a firm where a manager maximizes shareholder value. To simplify the problem, we ignore exit options (so that firm value is the expected present value of profits) and assume that the firm only sells one product. Letting $\pi_t$ be profits at time $t$ and $r > 0$ be the interest rate, the value of the firm is

V_0 = \EE \sum_{t \geq 0} \beta^t \pi_t \qquad \text{ where } \quad \beta \coloneq \frac{1}{1+r}.

(5.5)

The firm faces exogenous demand process $(D_t)_{t \geq 0} \iidsim \phi \in \dD(\ZZ_+)$ . Inventory $(X_t)_{t \geq 0}$ of the product obeys

X_{t+1} = f(X_t, A_t, D_{t+1}) \qquad \text{where} \quad f(x,a,d) \coloneq (x - d)\vee 0 + a.

(5.6)

The term $A_t$ is units of stock ordered this period, which take one period to arrive. The definition of $f$ imposes the assumption that firms cannot sell more stock than they have on hand. We assume that the firm can store at most $K$ items at one time.

With the price of the firm’s product set to one, current profits are given by

\pi_t \coloneq X_t \wedge D_{t+1} - c A_t - \kappa \1\{A_t > 0\}.

Here $c$ is unit product cost and $\kappa$ is a fixed cost of ordering inventory. We take the minimum $X_t \wedge D_{t+1}$ because orders in excess of inventory are assumed to be lost rather than back-filled.

We can map our inventory problem into an MDP with state space $\Xsf \coloneq \{0, \ldots, K\}$ and action space $\Asf \coloneq \Xsf$ . The feasible correspondence $\Gamma$ is

\Gamma(x) \coloneq \{0, \ldots, K - x\},

(5.7)

which represents the set of feasible orders when the current inventory state is $x$ . The reward function is expected current profits, or

r(x, a) \coloneq \sum_{d \geq 0} (x \wedge d) \phi(d) - c a - \kappa \1\{a > 0\}.

(5.8)

The stochastic kernel from the set of feasible state action pairs $\Gsf$ induced by $\Gamma$ is, in view of (5.6),

P(x, a, x') \coloneq \PP\{ f(x, a, D) = x' \} \qquad \text{when} \quad D \sim \phi.

(5.9)

The Bellman equation for this optimal inventory problem is

v(x) = \max_{a \in \Gamma(x)} \left\{ r(x,a) + \beta \sum_{d \geq 0} v(f(x, a, d)) \phi(d) \right\},

(5.11)

at each $x \in \Xsf$ , where $r(x,a)$ is as given in (5.8) and the aim is to solve for $v$ . We introduce the Bellman operator

(Tv)(x) = \max_{a \in \Gamma(x)} \left\{ r(x,a) + \beta \sum_{d \geq 0} v(f(x, a, d)) \phi(d) \right\}.

(5.12)

This operator maps $\RR^\Xsf$ to itself and is designed so that its set of fixed points in $\RR^\Xsf$ coincide with solutions to (5.11) in $\RR^\Xsf$ .

Solution to Exercise 5.1.3

$T$ is a sup norm contraction mapping on $\RR^\Xsf$ because, in view of the max-inequality lemma, for any $v, w$ in $\RR^\Xsf$ ,

\begin{aligned} |(T v)(x)| - (T w)(x)| & \leq \beta \, \max_{a \in \Gamma(x)} \left| \sum_{d \geq 0} \left[ v(f(x, a, d)) - w(f(x, a, d)) \right] \phi(d) \right| \\ & \leq \beta \, \max_{a \in \Gamma(x)} \sum_{d \geq 0} \left| v(f(x, a, d)) - w(f(x, a, d)) \right| \phi(d). \end{aligned}

Since $\sum_{d \geq 0} \phi(d) = 1$ , it follows that, for arbitrary $x \in \Xsf$ ,

|(T v)(x) - (T w)(x)| \leq \beta \| v - w\|_\infty.

Taking the supremum over all $x \in \Xsf$ yields the desired result.

5.1.2.3Example: Cake Eating¶

Many dynamic programming problems in economics involve a trade-off between current and future consumption. The simplest example in this class is the “cake eating” problem, where initial household wealth is given but no labor income is received. Wealth evolves according to

W_{t+1} = R(W_t - C_t) \qquad (t \geq 0),

where $C_t$ is current consumption and $R$ is the gross interest rate. The agent seeks to maximize

\EE \sum_{t \geq 0} \beta^t u(C_t) \quad \text{given } W_0 = w,

subject to $0 \leq C_t \leq W_t$ (implying that the agent cannot borrow). Consumption level $C_t$ generates utility $u(C_t)$ . Assuming that wealth takes values in a finite set $\Wsf \subset \RR_+$ , the Bellman equation for this problem can be written as

v(w) = \max_{0 \leq w' \leq w} \left\{ u (w - w'/R) + \beta v(w') \right\}.

(5.13)

In (5.13) we are using $w' = R(w - c)$ to obtain $c=(w-w'/R)$ . The household uses (5.13) to trade-off current utility of consumption against the value of future wealth.

Solution to Exercise 5.1.4

We take the action $A_t$ to be the choice of next period wealth $W_{t+1}$ , so that the action space is also $\Wsf$ . The feasible correspondence is

\Gamma(w) = \setntn{a \in \Wsf}{a \leq R w} \qquad (w \in \Wsf),

implying that $\Gsf = \setntn{(w, a) \in \Wsf \times \Wsf}{a \leq Rw}$ . The current reward is utility of consumption, or

r(w, a) = u \left( w - \frac{a}{R} \right) \qquad ((w, a) \in \Gsf).

The stochastic kernel is $P(w, a, w') = \1\{w' = a\}$ . This just states that next period wealth $w'$ is equal to the action $a$ with probability one.

5.1.2.4Example: Optimal Stopping¶

The optimal stopping problem we studied in Chapter 4 can be framed as an MDP. On one hand, doing so allows us to apply results obtained for MDPs to optimal stopping. On the other hand, expressing an optimal stopping problem as an MDP requires an additional state variable, which complicates the exposition. The next exercise helps to illustrate the key ideas.

Let’s focus on the job search problem with Markov state discussed in Section 3.3.1 (although the arguments for the general optimal stopping problem in Section 4.1.1.1 are very similar). As before, $\Wsf$ is the set of wage outcomes. Since we need the symbol $P$ for other purposes, we let $Q$ be the Markov matrix for wages, so that $(W_t)_{t\geq 0}$ is $Q$ -Markov on $\Wsf$ .

To express the job search problem as an MDP, let $\Xsf = \{0,1\} \times \Wsf$ be a state space whose typical element is $(e, w)$ , with $e$ representing either unemployment ( $e=0$ ) or employment ( $e=1$ ) and $w$ being the current wage offer. An action $a \in \Asf \coloneq \{0, 1\}$ indicates rejection or acceptance of the current wage offer.

Solution to Exercise 5.1.5

To impose that workers never leave the firm, we require $a \geq e$ . Thus, the feasible correspondence is

\Gamma(x) = \Gamma(e, w) = \setntn{a \in \{0, 1\}}{a \geq e} .

The set of feasible state action pairs is $\Gsf = \setntn{ ((e, w), a) \in \Xsf \times \Asf}{a \geq e}$ . The reward function is

r(x,a) = r((e, w), a) = a w + (1-a) c.

Regarding the stochastic kernel, we need to define state transition probabilities for all feasible state action pairs. Letting $P[(e, w), a, (e', w')]$ be the probability of transitioning to state $(e', w')$ given current state $(e,w)$ and current action $a \leq e$ , we set

P[(0, w), a, (e', w')] = \1\{e'=a\} \cdot [ \, a \1\{w' = w\} + (1-a) Q(w, w') \, ]

(5.14)

and $P[(1, w), 1, (e', w')] = \1\{e'=1\} \1\{w' = w\}$ . Equation (5.14) says that if $a=0$ then $e'=0$ and the next wage is drawn from $Q(w, w')$ , whereas if $a=1$ then $e'=1$ and the next wage is $w$ . You can verify that $P$ is a stochastic kernel from $\Gsf$ to $\Xsf$ .

To double-check that these definitions work, we can verify that they lead to the same Bellman equations that we saw in Section 3.3.1. Under the definitions of $\Gamma$ , $r$ , and $P$ just provided, we have $v(1, w) = w + \beta \EE v(1, w)$ . This implies that $v(1, w) = w/(1-\beta)$ , which is what we expect for lifetime value of an agent employed with wage $w$ .

Moreover, the Bellman equation for $v(0, w)$ agrees with the one we obtained for an unemployed agent. To see this when $e=0$ , observe that the Bellman equation is

\begin{aligned} v(0, w) & = \max_{a \in \{0, 1\}} \left\{ a w + (1-a) c + \beta \sum_{(e', w')} v(e', w') P[(0, w), a, (e', w')] \right\} \\ & = \max_{a \in \{0, 1\}} \left\{ a w + (1-a) c + \beta \left[ a v(a, w) + (1-a) \sum_{w'} v(a, w') Q(w, w') \right] \right\}, \end{aligned}

where the second equation follows from (5.14). (You can see this by checking the cases $a=0$ and $a=1$ .) Rearranging and using $v(1, w) = w/(1-\beta)$ now gives

v(0, w) = \max \left\{ \frac{w}{1-\beta} ,\, c + \beta \, \sum_{w'} \, v(0, w') Q(w, w') \right\}.

(5.15)

This is the Bellman equation for an unemployed agent from the job search problem we saw previously.

5.1.3Optimality¶

In this section, we return to the general MDP setting of Section 5.1.1, define optimal policies and state our main optimality result. As was the case for job search, actions are governed by policies, which are maps from states to actions (see, in particular, Section 1.3.1.3, where policies were introduced).

5.1.3.1Policies and Lifetime Values¶

Let $\mM = (\Gamma, \beta, r, P)$ be an MDP. The set of feasible policies corresponding to $\mM$ is

\Sigma \coloneq \setntn{\sigma \in \Asf^\Xsf} {\sigma(x) \in \Gamma(x) \text{ for all } x \in \Xsf}.

(5.16)

If we select a policy $\sigma$ from $\Sigma$ , it is understood that we respond to state $X_t$ with action $A_t \coloneq \sigma(X_t)$ at every date $t$ . As a result, the state evolves by drawing $X_{t+1}$ from $P(X_t, \sigma(X_t), \cdot)$ at each $t \geq 0$ . In other words, $(X_t)_{t \geq 0}$ is $P_\sigma$ -Markov when

P_\sigma(x, x') \coloneq P(x, \sigma(x), x') \qquad (x, x' \in \Xsf).

Note that $P_\sigma \in \mopx$ . Fixing a policy “closes the loop” in the state transition process and defines a Markov chain for the state.

Under the policy $\sigma$ , rewards at state $x$ are $r(x, \sigma(x))$ . If

r_\sigma(x) \coloneq r(x, \sigma(x)) \quad \text{and} \quad \EE_x \coloneq \EE[ \; \cdot \; \given X_0 = x],

then the lifetime value of following $\sigma$ starting from state $x$ can be written as

v_\sigma (x) = \EE_x \sum_{t \geq 0} \beta^t r_\sigma(X_t) \quad \text{where } (X_t) \text{ is } P_\sigma \text{-Markov with } X_0 = x.

(5.17)

Since $\beta < 1$ , applying Lemma 3.2.1 to this expression yields

v_\sigma = \sum_{t \geq 0} \beta^t P_\sigma^t \, r_\sigma = (I - \beta P_\sigma)^{-1} \, r_\sigma .

(5.18)

Analogous to the optimal stopping case, we call $v_\sigma$ the $\sigma$ -value function. We also call $v_\sigma(x)$ the lifetime value of policy $\sigma$ conditional on initial state $x$ .

Solution to Exercise 5.1.6

We need to show that $v_\sigma = (I-\beta P_\sigma)^{-1} r_\sigma$ obeys $v_1 \leq v_\sigma \leq v_2$ where $v_1, v_2$ are as defined in the exercise. Regarding the upper bound, let $\bar r \coloneq \| r \|_\infty$ . We have

(I-\beta P_\sigma)^{-1} \, r_\sigma \leq (I-\beta P_\sigma)^{-1} \, \bar r \, \1 = \bar r \sum_{t \geq 0} (\beta P)^t \1 = \frac{\bar r}{1 - \beta} = v_2.

A similar argument shows that $v_1 \leq v_\sigma$ .

Another way to compute $v_\sigma$ is to use the policy operator $T_\sigma$ corresponding to $\sigma$ , which is defined at $v \in \RR^\Xsf$ by

(T_\sigma \, v)(x) = r(x, \sigma(x)) + \beta \sum_{x'} v(x') P(x, \sigma(x), x') \qquad (x \in \Xsf).

(5.19)

( $T_\sigma$ is analogous to the policy operator defined for the optimal stopping problem in Section 4.1.1.3.) In vector notation,

T_\sigma \, v = r_\sigma + \beta P_\sigma \, v.

(5.20)

The next exercise shows how $T_\sigma$ can be put to work.

Solution to Exercise 5.1.7

Fix $\sigma \in \Sigma$ . It is obvious that $T_\sigma$ is a self-map on $\RR^\Xsf$ and $T_\sigma$ is clearly order-preserving, since $v \leq w$ implies $P_\sigma v \leq P_\sigma w$ and hence $T_\sigma v \leq T_\sigma w$ .

Also, $T_\sigma$ is a contraction of modulus $\beta$ on $\RR^\Xsf$ under the supremum norm, since, for any $v, w$ in $\RR^\Xsf$ we have

\begin{aligned} |(T_\sigma v)(x) -(T_\sigma w)(x)| & = \beta \, \left| \sum_{x'} P(x, \sigma(x), x') v(x') - \sum_{x'} P(x, \sigma(x), x') w(x') \right| \\ & \leq \sum_{x'} P(x, \sigma(x), x') \beta \, \left| v(x') - w(x') \right| \leq \beta \| v - w\|_\infty. \end{aligned}

Taking the supremum over all $x \in \Xsf$ yields the desired result. This contraction property combined with Banach’s fixed-point theorem implies that $T_\sigma$ has a unique fixed point.

Now suppose that $v$ is the unique fixed point of $T_\sigma$ . Then $v = r_\sigma + \beta P_\sigma v$ . But then $v = (I-\beta P_\sigma)^{-1} r_\sigma$ . Hence $v = v_\sigma$ . This establishes all claims in the lemma.

Computationally, this means that we can pick $v \in \RR^\Xsf$ and iterate with $T_\sigma$ to obtain an approximation to $v_\sigma$ .

The next exercise extends Exercise 5.1.8 and aids interpretation of policy operators. It tells us that $(T_\sigma^k \, v)(x)$ is the payoff from following policy $\sigma$ and starting in state $x$ when lifetime is truncated to the finite horizon $k$ and $v$ provides a terminal payoff in each state.

5.1.3.2Defining Optimality¶

Given MDP $\mM = (\Gamma, \beta, r, P)$ with $\sigma$ -value functions $\{v_\sigma\}_{\sigma \in \Sigma}$ , the value function corresponding to $\mM$ is defined as $v^* \coloneq \vee_{\sigma \in \Sigma} \, v_\sigma$ , where, as usual, the maximum is pointwise. More explicitly,

v^*(x) = \max_{\sigma \in \Sigma} v_\sigma(x) \qquad (x \in \Xsf).

(5.21)

This is consistent with our definition of the value function in the optimal stopping case. It is the maximal lifetime value we can extract from each state using feasible behavior. The maximum in (5.21) exists at each $x$ because $\Sigma$ is finite.

A policy $\sigma \in \Sigma$ is called optimal for $\mM$ if $v_\sigma = v^*$ . In other words, a policy is optimal if its lifetime value is maximal at each state.

Our optimality results are easier to follow with some additional terminology. To start, given $v \in \RR^\Xsf$ , we define a policy $\sigma \in \Sigma$ to be $v$ -greedy if

\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \quad \text{for all } x \in \Xsf.

(5.22)

In essence, a $v$ -greedy policy treats $v$ as the correct value function and sets all actions accordingly.

Solution to Exercise 5.1.10

Fix $v \in V$ and take $\hat \sigma$ to be $v$ -greedy, so that

\hat \sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \quad \text{for all } x \in \Xsf .

(5.23)

If $\sigma$ is any other feasible policy, then

r(x, \hat \sigma(x)) + \beta \sum_{x'} v(x') P(x, \hat \sigma(x), x') \geq r(x, \sigma(x)) + \beta \sum_{x'} v(x') P(x, \sigma(x), x')

at all $x$ . In operator form, this is $T_{\hat \sigma} \, v \geq T_\sigma \, v$ . Since $\sigma$ is an arbitrary greedy policy, we have shown that $T_{\hat \sigma} \, v$ is the greatest element of $\{T_\sigma \, v\}_{\sigma \in \Sigma}$ .

A similar argument replacing argmax with argmin in (5.23) shows that a least element also exists.

Bellman’s principle of optimality is said to hold for the MDP $\mM$ if

\sigma \in \Sigma \text{ is optimal for } \mM \quad \iff \quad \sigma \text{ is } v^*\text{-greedy}.

The Bellman operator corresponding to $\mM$ is the self-map $T$ on $\RR^\Xsf$ defined by

(Tv)(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \qquad (x \in \Xsf).

(5.24)

Obviously, $Tv=v$ if and only if $v$ satisfies the Bellman equation (5.2).

Solution to Exercise 5.1.11

Fix $v \in \RR^\Xsf$ . Part (i) follows from the fact that $\Gamma(x)$ is finite and nonempty at each $x \in \Xsf$ . Hence we can select an element $a^*_x$ from the argmax in the definition of a $v$ -greedy policy at each $x$ in $\Xsf$ . The resulting policy is $v$ -greedy. For part (ii) we need to show that $\sigma \in \Sigma$ is $v$ -greedy if and only if

r(x, \sigma(x)) + \beta \sum_{x'} v(x') P(x, \sigma(x), x') = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\}

for all $x \in \Xsf$ . But this is immediate from the definition.

Regarding part (iii), it follows from the definitions that $(T_\sigma \, v)(x) \leq (Tv)(x)$ for all $x \in \Xsf$ . At the same time, for any $v$ -greedy $\sigma \in \Sigma$ , we have $(T_\sigma \, v)(x) = (Tv)(x)$ for all $x$ . Hence $Tv = \vee_\sigma \, T_\sigma \, v$ , as was to be shown.

The last part of Exercise 5.1.11 tells us that $T$ is the pointwise maximum of $\{T_\sigma\}_{\sigma \in \Sigma}$ , which can be expressed as $T = \vee_\sigma \, T_\sigma$ . Figure 5.1 illustrates this relationship in one dimension.

T is the pointwise maximum of \{T_\sigma\}_{\sigma \in \Sigma} (one-dimensional setting) — Figure 5.1: $T$ is the pointwise maximum of $\{T_\sigma\}_{\sigma \in \Sigma}$ (one-dimensional setting)

Solution to Exercise 5.1.12

This result follows from Lemma 2.2.3. For the sake of the exercise, we also provide a direct proof:

Fix $v, w \in \RR^\Xsf$ and $x \in \Xsf$ . By Exercise 5.1.11 and the max-inequality lemma, we have

\begin{aligned} |(T v)(x) -(T w)(x)| & = \left| \max_{\sigma \in \Sigma} (T_\sigma \, v)(x) - \max_{\sigma \in \Sigma} (T_\sigma \, w)(x) \right| \\ & \leq \max_{\sigma \in \Sigma} \left| (T_\sigma \, v)(x) - (T_\sigma \, w)(x) \right| = \| T_\sigma \, v - T_\sigma \, w\|_\infty. \end{aligned}

Applying contractivity of $T_\sigma$ (Exercise 5.1.7), we get $\| Tv - Tw \|_\infty \leq \beta \| v - w\|_\infty$ .

5.1.3.3Optimality Theory¶

We can now state our main optimality result for MDPs.

While Proposition 5.1.1 is a special case of later results (see Section 8.1.3.3), a direct proof is not difficult and we now provide one for interested readers.

Proof

Proof of Proposition 5.1.1.

In Exercise 5.1.12 we showed that $T$ is a contraction mapping on the closed set $\RR^\Xsf$ . Hence $T$ is globally stable on $\RR^\Xsf$ and therefore has a unique fixed point $\bar v \in \RR^\Xsf$ . Our first claim is that $\bar v = v^*$ . We show $\bar v \leq v^*$ and then $\bar v \geq v^*$ .

For the first inequality, let $\sigma \in \Sigma$ be $\bar v$ -greedy. Recalling Exercise 5.1.11, we have $T_\sigma\, \bar v = T \bar v = \bar v$ . Hence $\bar v$ is also a fixed point of $T_\sigma$ . But the only fixed point of $T_\sigma$ in $\RR^\Xsf$ is $v_\sigma$ , so $\bar v = v_\sigma$ . But then $\bar v \leq v^*$ , since, by definition, $v^* = \vee_\sigma \, v_\sigma$ . This is our first inequality.

As for the second inequality, fix $\sigma \in \Sigma$ and observe that $T_\sigma \, v \leq T v$ for all $v \in \RR^\Xsf$ . Since $T$ is order-preserving and globally stable, Proposition 2.2.7 implies that $v_\sigma \leq \bar v$ . Taking the supremum over $\sigma \in \Sigma$ yields $v^* \leq \bar v$ .

Hence $v^*$ is a fixed point of $T$ in $\RR^\Xsf$ . Since $T$ is globally stable on $\RR^\Xsf$ , the remaining claims in parts (i)–(ii) follow immediately.

As for part (iii), it follows from Exercise 5.1.11 and part (i) of this theorem that

\sigma \text{ is } v^* \text{-greedy} \quad \iff \quad T_\sigma \, v^* = T v^* \quad \iff \quad T_\sigma \, v^* = v^*.

The right hand side of this expression tells us that $v^*$ is a fixed point of $T_\sigma$ . But the only fixed point of $T_\sigma$ is $v_\sigma$ , so the right hand side is equivalent to the statement $v_\sigma = v^*$ . Hence, by this chain of logic and the definition of optimality,

\sigma \text{ is } v^* \text{-greedy} \iff v^* = v_\sigma \iff \text{ } \sigma \text{ is optimal}.

(5.25)

Hence (iii) holds.

Part (iv) is left to Exercise 5.1.13. ◻

Figure 5.2 illustrates Proposition 5.1.1 in an abstract case, where $\Xsf$ is a singleton $\{x\}$ . We write $v$ instead of $v(x)$ for the value of state $x$ and place $v$ on the horizontal axis. In the figure, the set of policies is $\Sigma = \{\sigma', \sigma''\}$ . For given $\sigma \in \Sigma$ , the map $T_\sigma$ is an affine function $T_\sigma \, v = r_\sigma + \beta P_\sigma \, v$ and the fixed point is $v_\sigma$ . The Bellman operator $T$ is the upper envelope of the functions $\{T_\sigma\}$ , as shown in (ii) of Exercise 5.1.11. By definition,

(i) $v^*$ is the largest of these fixed points, which equals $v_{\sigma''}$ , and

(ii) $\sigma''$ is the optimal policy, since $v_{\sigma''} = v^*$ .

In accordance with Proposition 5.1.1, $v^*$ is also the fixed point of the Bellman operator.

Figure 5.2:Illustration of optimality for MDPs

It is important to understand the significance of (iii) in Proposition 5.1.1. Greedy policies are relatively easy to compute, in the sense that solving (5.22) at each $x$ is easier than trying to directly solve the problem of maximizing lifetime value, since $\Sigma$ is in general far larger than $\Gamma(x)$ . Part (iii) tells us that solving the overall problem reduces to computing a $v$ -greedy policy with the right choice of $v$ . For optimal stopping problems, that choice is the value function $v^*$ . Intuitively, $v^*$ assigns a “correct” value to each state, in the sense of maximal lifetime value the controller can extract, so using $v^*$ to calculate greedy policies leads to the optimal outcome.

5.1.4Algorithms¶

In previous chapters we solved job search and optimal stopping problems using value function iteration. In this section, we present a generalization suitable for arbitrary MDPs and then discuss two important alternatives.

5.1.4.1Value Function Iteration¶

Value function iteration (VFI) for MDPs is similar to VFI for the job search model: We use successive approximation on $T$ to compute an approximation $v_k$ to the value function $v^*$ and then take a $v_k$ -greedy policy. The general procedure is given by Algorithm 5.2.

The fact that the sequence $(v_k)_{k \geq 0}$ produced by VFI converges to $v^*$ is immediate from Proposition 5.1.1 (as the tolerance $\tau$ is taken toward zero). It is also true that the greedy policy produced in the last step is approximately optimal when $\tau$ is small, and exactly optimal when $k$ is sufficiently large. Proofs are given in Chapter 8, where we examine VFI in a more general setting.

VFI is robust, easy to understand and easy to implement. These properties explain its enduring popularity. At the same time, in terms of efficiency, VFI is often dominated by alternative algorithms that we now describe.

5.1.4.2Howard Policy Iteration¶

Unlike VFI, Howard policy iteration (HPI) computes optimal policies by iterating between computing the value of a given policy and computing the greedy policy associated with that value. The full technique is described in Algorithm 5.3.

A visualization of HPI is given in Figure 5.3, where $\sigma$ is the initial choice. Next, we compute the lifetime value $v_\sigma$ , and then the $v_\sigma$ -greedy policy $\sigma'$ , and so on. The computation of lifetime value is called the policy evaluation step, whereas the computation of greedy policies is called policy improvement.

HPI has two very attractive features. One is that, in a finite state setting, the algorithm always converges to an exact optimal policy in a finite number of steps, regardless of the initial condition. We prove this fact in a more general setting in Chapter 8. The second is that the rate of convergence is faster than VFI, as will be shown in Section 5.1.4.3.

Figure 5.4 gives another illustration, presented in the one-dimensional setting that we used for Figure 5.2. In this illustration, we imagine that there are many optimal policies, and hence many functions in $\{T_\sigma\}$ , so that their upper envelope, which is the Bellman operator, becomes a smoother curve. The figure shows the update from $v_\sigma$ to the next lifetime value $v_{\sigma'}$ , via the following two steps:

(i) Take $\sigma'$ to be $v_\sigma$ -greedy, which means that $T_{\sigma'} v_\sigma = T v_\sigma$ (see Exercise 5.1.11).

(ii) Take $v_{\sigma'}$ to be the fixed point of $T_{\sigma'}$ .

The next step, from $v_{\sigma'}$ to $v_{\sigma''}$ is analogous.

Comparison of this figure with Figure 2.1 suggests that HPI is an implementation of Newton’s method, applied to the Bellman operator. We confirm this in Section 5.1.4.3.

Figure 5.4:HPI as a version of Newton’s method

5.1.4.3HPI as Newton Iteration¶

In discussing the connection between HPI and Newton iteration, one issue is that $T$ is not always differentiable, as seen in Figure 5.2. But $T$ is convex, and this lets us substitute subgradients for derivatives. Once we make this modification, HPI and Newton iteration are identical, as we now show.

First, recall that, given a self-map $T$ from $S \subset \RR^n$ to itself, an $n \times n$ matrix $D$ is called a subgradient of $T$ at $v \in S$ if

Tu \geq Tv + D (u - v) \quad \text{for all } u \in S.

(5.26)

Figure 5.5 illustrates the definition in one dimension, where $D$ is just a scalar determining the slope of a tangent line at $v$ . In the left subfigure, $T_1$ is convex and differentiable at $v$ , which means that only one subgradient exists (since any other choice of slope implies that the inequality in (5.26) will fail for some $u$ ). In the right subfigure, $T_2$ is convex but nondifferentiable at $v$ , so multiple subgradients exist.

Figure 5.5:Subgradients of convex functions

In the next result, we take $(\Gamma, \beta, r, P)$ to be a given MDP and let $T$ be the associated Bellman operator.

Now let’s consider Newton’s method applied to the problem of finding the fixed point of $T$ . Since $T$ is nondifferentiable and convex, we replace the Jacobian in Newton’s method (see (2.2)) with the subgradient. This leads us to iterate on

v_{k+1} = Qv_k \quad \text{where} \quad Qv \coloneq (I - \beta P_\sigma)^{-1} (Tv - \beta P_\sigma v).

In the definition of $Q$ , the policy $\sigma$ is $v$ -greedy. Using $Tv = T_\sigma v$ , the map $Q$ reduces to $Qv \coloneq (I - \beta P_\sigma)^{-1} r_\sigma$ , which is exactly the update step to produce the next $\sigma$ -value function in HPI (i.e., the lifetime value of a $v$ -greedy policy).

The fact that HPI is a version of Newton’s method suggests that its iterates $(v_k)_{k \geq 0}$ enjoy quadratic convergence. This is indeed the case: Under mild conditions, one can show there exists a constant $N$ such that, for all $k \geq 0$ ,

\| v_{k+1} - v_k \| \leq N \| v_k - v_{k-1} \|^2

(5.27)

(see, e.g., Puterman (2005), Theorem 6.4.8). Hence HPI enjoys both a fast convergence rate and the robustness of global convergence.

However, HPI is not always optimal in terms of efficiency, since the size of the constant term in (5.27) also matters. This term can be large because, at each step, the update from $v_\sigma$ to $v_{\sigma'}$ requires computing the exact lifetime value $v_{\sigma'}$ of the $v_\sigma$ -greedy policy $\sigma'$ . Computing this fixed point exactly can be computationally expensive in high dimensions.

One way around this issue is to forgo computing the fixed point $v_{\sigma'}$ exactly, replacing it with an approximation. Section 5.1.4.4 takes up this idea.

5.1.4.4Optimistic Policy Iteration¶

Optimistic policy iteration (OPI) is an algorithm that borrows from both VFI and HPI. In essence, the algorithm is the same as HPI except that, instead of computing the full value $v_\sigma$ of a given policy, the approximation $T_\sigma^m v$ from Exercise 5.1.9 is used instead. Algorithm 5.4 clarifies.

In the algorithm, the policy operator $T_{\sigma_k}$ is applied $m$ times to generate an approximation of $v_{\sigma_k}$ . The constant step size $m$ can also be replaced with a sequence $(m_k) \subset \NN$ . In either case, for MDPs, convergence to an optimal policy is guaranteed. We prove this in a more general setting in Chapter 8.

Notice that, as $m \to \infty$ , the algorithm increasingly approximates HPI, since $T_{\sigma_k}^m v_k$ converges to $v_{\sigma_k}$ . At the same time, if $m=1$ , it reduces to VFI. This follows from Exercise 5.1.11, which tells us that, when $\sigma_k$ is $v_k$ -greedy, $T_{\sigma_k} v_k = T v_k$ . Hence, with intermediate $m$ , OPI can be seen as a “convex combination” of HPI and VFI.

In almost all dynamic programming applications, there exist choices of $m > 1$ such that OPI converges faster than VFI. We investigate these ideas in the applications. In some cases, there exist values of $m$ such that OPI dominates HPI. However, this depends on the structure of the problem and the software and hardware platforms being employed – see Section 2.1.4.4 and the applications for additional discussion.

5.2Applications¶

This section gives several applications of the MDP model to economic problems. The applications illustrate the ease with which MDPs can be implemented on a computer (provided that the state and action spaces are not too large).

5.2.1Optimal Inventories¶

In Section 3.1.1.2 we studied a firm whose inventory behavior was specified to follow S–s dynamics. In Section 5.1.2.2 we introduced a model where investment behavior is endogenous, determined by the desire to maximize firm value. In this section, we show that this endogenous inventory behavior can replicate the S–s dynamics from Section 3.1.1.2.

We saw in Section 5.1.2.2 that the optimal inventory model is an MDP, so the Proposition 5.1.1 optimality and convergence results apply. In particular, the unique fixed point of the Bellman operator is the value function $v^*$ , and a policy $\sigma^*$ is optimal if and only if $\sigma^*$ is $v^*$ -greedy.

We solve the model numerically using VFI. As in Exercise 5.1.2, we take $\phi$ to be the geometric distribution on $\ZZ_+$ with parameter $p$ . We use the default parameter values shown in Listing 1. The code listing also presents an implementation of the Bellman operator.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
using Distributions

f(x, a, d) = max(x - d, 0) + a  # Inventory update

function create_inventory_model(; β=0.98,     # discount factor
                                  K=40,       # maximum inventory
                                  c=0.2, κ=2, # cost paramters
                                  p=0.6)      # demand parameter
    ϕ(d) = (1 - p)^d * p        # demand distribution
    x_vals = collect(0:K)       # set of inventory levels
    return (; β, K, c, κ, p, ϕ, x_vals)
end

"The function B(x, a, v) = r(x, a) + β Σ_x′ v(x′) P(x, a, x′)."
function B(x, a, v, model; d_max=100)
    (; β, K, c, κ, p, ϕ, x_vals) = model
    revenue = sum(min(x, d) * ϕ(d) for d in 0:d_max) 
    current_profit = revenue - c * a - κ * (a > 0)
    next_value = sum(v[f(x, a, d) + 1] * ϕ(d) for d in 0:d_max)
    return current_profit + β * next_value
end

"The Bellman operator."
function T(v, model)
    (; β, K, c, κ, p, ϕ, x_vals) = model
    new_v = similar(v)
    for (x_idx, x) in enumerate(x_vals)
        Γx = 0:(K - x) 
        new_v[x_idx], _ = findmax(B(x, a, v, model) for a in Γx)
    end
    return new_v
end

Program 1:Solving the optimal inventory model (inventory_dp.jl)

Figure 5.6 exhibits an approximation of the value function $v^*$ , computed by iterating with $T$ starting at $v \equiv 1$ . Figure 5.6 also shows the approximate optimal policy, obtained as a $v^*$ -greedy policy:

\sigma^*(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{d \geq 0} v^*(f(x, a, d)) \phi(d) \right\}.

The plot of the optimal policy shows that there is a threshold region below which the firm orders large batches and above which the firm orders nothing. This makes sense, since the firm wishes to economize on the fixed cost of ordering. Figure 5.7 shows a simulation of inventory dynamics under the optimal policy, starting from $X_0 = 0$ . The time path closely approximates the S–s dynamics discussed in Section 3.1.1.2.

Figure 5.6:The value function and optimal policy for the inventory problem

5.2.2Optimal Savings with Labor Income¶

As our next example of an MDP, we modify the cake eating problem in Section 5.1.2.3 to add labor income. Wealth evolves according to

W_{t+1} = R (W_t + Y_t - C_t) \qquad (t = 0, 1, \ldots),

(5.28)

where $(W_t)$ takes values in finite set $\Wsf \subset \RR_+$ and labor income $(Y_t)$ is a Markov chain on finite set $\Ysf \subset \RR_+$ with transition matrix $Q$ .^[1] $R$ is a gross rate of interest, so that investing $d$ dollars today returns $Rd$ next period. Other parts of the problem are unchanged. The Bellman operator can be written as

(Tv)(w, y) = \max_{w' \in \Gamma(w, y)} \left\{ u \left( w + y - \frac{w'}{R} \right) + \beta \sum_{y'} v(w', y') Q(y, y') \right\}.

(5.29)

5.2.2.1MDP Representation¶

To frame this problem as an MDP, we set the state to $x \coloneq (w, y)$ , representing current wealth and income, taking values in the state space $\Xsf \coloneq \Wsf \times \Ysf$ . The action is savings $s$ , which takes values in $\Wsf$ and equals $w'$ . The feasible correspondence is the set of feasible savings values

\Gamma(w, y) = \setntn{s \in \Wsf}{s \leq R (w + y)}.

The current reward is utility of consumption $r(w, s) = u(w + y - s/R)$ . The stochastic kernel is

P((w, y), s, (w', y')) = \1\{w' = s\} Q(y, y').

Having framed an MDP, the Proposition 5.1.1 optimality results apply.

5.2.2.2Implementation¶

To implement the algorithms discussed in Section 5.1.4, we use the Bellman operator (5.29), and the corresponding definition of a $v$ -greedy policy, which is

\sigma(w, y) \in \argmax_{w' \in \Gamma(w, y)} \left\{ u \left( w + y - \frac{w'}{R} \right) + \beta \sum_{y'} v(w', y') Q(y, y') \right\},

for all $(w, y)$ . The policy operator for given $\sigma \in \Sigma$ is

(T_\sigma \, v)(w, y) = u \left( w + y - \frac{\sigma(w, y)}{R} \right) + \beta \sum_{y'} v(\sigma(w, y), y') Q(y, y').

(5.30)

Code for implementing the model and these two operators is given in Listing 2. Income is constructed as a discretized AR(1) process using the method from Section 3.1.3. Exponentiation is applied to the grid so that income takes positive values.

The function get_value in Listing 3 uses the expression $v_\sigma = (I - \beta \, P_\sigma)^{-1} r_\sigma$ from (5.18) to obtain the value of a given policy $\sigma$ . The matrix $P_\sigma$ and vector $r_\sigma$ take the form

\begin{aligned} P_\sigma((w, y), (w', y')) & = \1\{\sigma(w, y) = w'\} Q(y, y'), \\ r_\sigma(w, y) & = u(w + y - \sigma(w, y) / R). \end{aligned}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
using QuantEcon, LinearAlgebra, IterTools

function create_savings_model(; R=1.01, β=0.98, γ=2.5,  
                                w_min=0.01, w_max=20.0, w_size=200,
                                ρ=0.9, ν=0.1, y_size=5)
    w_grid = LinRange(w_min, w_max, w_size)  
    mc = tauchen(y_size, ρ, ν)
    y_grid, Q = exp.(mc.state_values), mc.p
    return (; β, R, γ, w_grid, y_grid, Q)
end

"B(w, y, w′, v) = u(R*w + y - w′) + β Σ_y′ v(w′, y′) Q(y, y′)."
function B(i, j, k, v, model)
    (; β, R, γ, w_grid, y_grid, Q) = model
    w, y, w′ = w_grid[i], y_grid[j], w_grid[k]
    u(c) = c^(1-γ) / (1-γ)
    c = w + y - (w′ / R)
    @views value = c > 0 ? u(c) + β * dot(v[k, :], Q[j, :]) : -Inf
    return value
end

"The Bellman operator."
function T(v, model)
    w_idx, y_idx = (eachindex(g) for g in (model.w_grid, model.y_grid))
    v_new = similar(v)
    for (i, j) in product(w_idx, y_idx)
        v_new[i, j] = maximum(B(i, j, k, v, model) for k in w_idx)
    end
    return v_new
end

"The policy operator."
function T_σ(v, σ, model)
    w_idx, y_idx = (eachindex(g) for g in (model.w_grid, model.y_grid))
    v_new = similar(v)
    for (i, j) in product(w_idx, y_idx)
        v_new[i, j] = B(i, j, σ[i, j], v, model) 
    end
    return v_new
end

Program 2:Discrete optimal savings model (finite_opt_saving_0.jl)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
include("finite_opt_saving_0.jl")

"Compute a v-greedy policy."
function get_greedy(v, model)
    w_idx, y_idx = (eachindex(g) for g in (model.w_grid, model.y_grid))
    σ = Matrix{Int32}(undef, length(w_idx), length(y_idx))
    for (i, j) in product(w_idx, y_idx)
        _, σ[i, j] = findmax(B(i, j, k, v, model) for k in w_idx)
    end
    return σ
end

"Get the value v_σ of policy σ."
function get_value(σ, model)
    # Unpack and set up
    (; β, R, γ, w_grid, y_grid, Q) = model
    w_idx, y_idx = (eachindex(g) for g in (w_grid, y_grid))
    wn, yn = length(w_idx), length(y_idx)
    n = wn * yn
    u(c) = c^(1-γ) / (1-γ)
    # Build P_σ and r_σ as multi-index arrays
    P_σ = zeros(wn, yn, wn, yn)
    r_σ = zeros(wn, yn)
    for (i, j) in product(w_idx, y_idx)
            w, y, w′ = w_grid[i], y_grid[j], w_grid[σ[i, j]]
            r_σ[i, j] = u(w + y - w′/R)
        for j′ in y_idx
            P_σ[i, j, σ[i, j], j′] = Q[j, j′]
        end
    end
    # Reshape for matrix algebra
    P_σ = reshape(P_σ, n, n)
    r_σ = reshape(r_σ, n)
    # Apply matrix operations --- solve for the value of σ 
    v_σ = (I - β * P_σ) \ r_σ
    # Return as multi-index array
    return reshape(v_σ, wn, yn)
end

Program 3:Discrete optimal savings model (finite_opt_saving_1.jl)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
include("s_approx.jl")
include("finite_opt_saving_1.jl")

"Value function iteration routine."
function value_iteration(model, tol=1e-5)
    vz = zeros(length(model.w_grid), length(model.y_grid))
    v_star = successive_approx(v -> T(v, model), vz, tolerance=tol)
    return get_greedy(v_star, model)
end

"Howard policy iteration routine."
function policy_iteration(model)
    wn, yn = length(model.w_grid), length(model.y_grid)
    σ = ones(Int32, wn, yn)
    i, error = 0, 1.0
    while error > 0
        v_σ = get_value(σ, model)
        σ_new = get_greedy(v_σ, model)
        error = maximum(abs.(σ_new - σ))
        σ = σ_new
        i = i + 1
        println("Concluded loop $i with error $error.")
    end
    return σ
end

"Optimistic policy iteration routine."
function optimistic_policy_iteration(model; tolerance=1e-5, m=100)
    v = zeros(length(model.w_grid), length(model.y_grid))
    error = tolerance + 1
    while error > tolerance
        last_v = v
        σ = get_greedy(v, model)
        for i in 1:m
            v = T_σ(v, σ, model)
        end
        error = maximum(abs.(v - last_v))
    end
    return get_greedy(v, model)
end

Program 4:Discrete optimal savings model (finite_opt_saving_2.jl)

5.2.2.3Timing¶

Since all results for MDPs apply, we know that the value function $v^*$ is the unique fixed point of the Bellman operator in $\RR^\Xsf$ , and that VFI, HPI, and OPI all converge. Listing 4 implements these three algorithms. Since the state and action space are finite, HPI is guaranteed to return an exact optimal policy.

Figure 5.8 shows the number of seconds taken to solve the finite optimal savings model under the default parameters when executed on a laptop machine with 20 CPUs running at around 4 GHz. The horizontal axis corresponds to the step parameter $m$ in OPI (Algorithm 5.4). The two other algorithms do not depend on $m$ and hence their timings are constant. The figure shows that HPI is an order of magnitude faster than VFI and that OPI is even faster for moderate values of $m$ .

One reason VFI is slow is that the discount factor is close to one. This matters because the convergence rate for VFI is linear with error size decreasing geometrically in $\beta$ . In contrast, HPI, being an instance of Newton iteration, converges quadratically (see Section 2.1.4.2). As a result, HPI tends to dominate VFI when the discount factor approaches unity.

Run-times are also dependent on implementation, and relative speed varies significantly with coding style, software, and hardware platforms. In our implementation, the main deficiency is that parallelization is under-utilized. Better exploitation of parallelization tends to favor HPI, as discussed in Section 2.1.4.4.

Figure 5.8:Timings for alternative algorithms, savings model

5.2.2.4Outputs¶

Figure 5.9 shows a typical time series for the wealth of a single household under the optimal policy. The series is created by computing an optimal policy $\sigma^*$ , generating $(Y_t)_{t=0}^{m-1}$ as a $Q$ -Markov chain on $\Ysf$ and then computing $(W_t)_{t=0}^m$ via $W_{t+1} = \sigma^*(W_t, Y_t)$ for $t$ running from 0 to $m-1$ . Initial wealth $W_0$ is set to 1.0 and $m = 2000$ .

Figure 5.10 shows the result of computing and histogramming a longer time series, with $m$ set to 1,000,000. This histogram approximates the stationary distribution of wealth for a large population, each updating via $\sigma^*$ and each with independently generated labor income series $(Y_t)_{t=0}^{m-1}$ . (This is due to ergodicity of the wealth-income process. For a discussion of the connection between stationary distributions and time series under ergodicity see, for example, Sargent & Stachurski (2023).)

The shape of the wealth distribution in Figure 5.10 is unrealistic. In almost all countries, the wealth distribution has a very long right tail. The Gini coefficient of the distribution in Figure 5.10 is 0.54, which is too low. For example, World Bank data for 2019 produces a wealth Gini for the US equal to 0.852. For Germany and Japan the figures are 0.816 and 0.627 respectively.

In Section 5.3.3 we discuss a variation on the optimal savings model that can produce a more realistic wealth distribution.

5.2.3Optimal Investment¶

As our next application, we consider a monopolist facing adjustment costs and stochastically evolving demand. The monopolist balances setting enough capacity to meet demand against costs of adjusting capacity.

5.2.3.1Problem Description¶

We assume that the monopolist produces a single product and faces an inverse demand function of the form

P_t = a_0 - a_1 Y_t + Z_t,

where $a_0, a_1$ are positive parameters, $Y_t$ is output, $P_t$ is price, and the demand shock $Z_t$ follows

Z_{t+1} = \rho Z_t + \sigma \eta_{t+1}, \qquad \{\eta_t \} \iidsim N(0, 1).

Current profits are

\pi_t \coloneq P_t Y_t - c Y_t - \gamma (Y_{t+1} - Y_t)^2.

Here $\gamma (Y_{t+1} - Y_t)^2$ represents costs associated with adjusting production scale, parameterized by $\gamma$ , and $c$ is unit cost of current production. Costs are convex, so rapid changes to capacity are expensive.

The monopolist chooses $(Y_t)$ to maximize the expected discounted value of its profit flow, which we write as

\EE \, \sum_{t=0}^{\infty} \beta^t \pi_t.

(5.31)

Here $\beta = 1/(1+r)$ , where $r > 0$ is a fixed interest rate.

A way to start thinking about the optimal time path of output is to consider what would happen if $\gamma = 0$ . Without adjustment costs there is no intertemporal trade-off, so the monopolist should choose output to maximize current profit in each period. The implied level of output at time $t$ is

\bar Y_t \coloneq \frac{a_0 - c + Z_t}{2 a_1}.

(5.32)

For $\gamma > 0$ , we expect the following behavior.

If $\gamma$ is close to zero, then the optimal output path $Y_t$ will track the time path of $\bar Y_t$ relatively closely, whereas
if $\gamma$ is larger, then $Y_t$ will be significantly smoother than $\bar Y_t$ , as the monopolist seeks to avoid adjustment costs.

5.2.3.2MDP Representation¶

We can represent this problem as an MDP. To do so we let $\Ysf$ be a grid contained in $\RR_+$ that lists possible output values. To conform to the finite state setting, we discretize the shock process $(Z_t)$ using Tauchen’s method, as described in Section 3.1.3. For convenience we again use $(Z_t)$ to represent the discrete process, which is a finite Markov chain on $\Zsf \subset \RR$ with transition matrix $Q$ .

The state space for this MDP is $\Xsf = \Ysf \times \Zsf$ , while the action space is $\Ysf$ . The feasible correspondence is defined by $\Gamma(x) = \Ysf$ , meaning that choice of output is not restricted by the state. Thus, the feasible policy set $\Sigma$ is all $\sigma \colon \Ysf \times \Zsf \to \Ysf$ .

We write $(y,z)$ for the current state, $q$ for the action (which chooses next period output) and $(y',z')$ for the next period state. The current reward function is current profits, which we can write as

r((y, z), q) = (a_0 - a_1 y + z - c) y - \gamma (q - y)^2.

The stochastic kernel is

P((y, z), q, (y', z')) = \1\{y' = q\} Q(z, z').

The term $\1\{y' = q\}$ states that next period output $y'$ is equal to our current choice $q$ for next period output. With these definitions, the problem defines an MDP and all of the optimality theory for MDPs applies.

5.2.3.3Implementation¶

The Bellman operator can be expressed as

(Tv)(y, z) = \max_{y' \in \RR} \left\{ r(y, z, y') + \beta \sum_{z'} v(y', z') Q(z, z') \right\}.

Given $\sigma \in \Sigma$ , we can express the policy operator as

(T_\sigma \, v)(y, z) = r(y, z, \sigma(y, z)) + \beta \sum_{z'} v(\sigma(y, z), z') Q(z, z').

A $v$ -greedy policy is a $\sigma \in \Sigma$ that obeys

\sigma(y, z) \in \argmax_{y' \in \Ysf} \left\{ r(y, z, y') + \beta \sum_{z'} v(y', z') Q(z, z') \right\} \quad \text{for all } (y,z) \in \Xsf.

By combining iteration with the policy operator and computation of greedy policies, we can implement OPI, compute the optimal policy $\sigma^*$ , and study output choices generated by this policy. We are particularly interested in how output responds over time to randomly generated demand shocks.

Figure 5.11 shows the result of a simulation designed to shed light on how output responds to demand. After choosing initial values $(Y_1, Z_1)$ and generating a $Q$ -Markov chain $(Z_t)_{t=1}^T$ , we simulated optimal output via $Y_{t+1} = \sigma^*(Y_t, Z_t)$ . The default parameters are shown in Listing 5. In the figure, the adjustment cost parameter $\gamma$ is varied as shown in the title. In addition to the optimal output path, the path of $(\bar Y_t)$ as defined in (5.32) is also presented.

The figure shows how increasing $\gamma$ promotes smoothing, as predicted in the preceding discussion. For small $\gamma$ , adjustment costs have only minor effects on choices, so output closely follows $(\bar Y_t)$ , the optimal path when output responds immediately to demand shocks. Conversely, larger values of $\gamma$ make adjustment expensive, so the operator responds relatively slowly to changes in demand.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
using QuantEcon, LinearAlgebra, IterTools
include("s_approx.jl")

function create_investment_model(; 
        r=0.04,                              # Interest rate
        a_0=10.0, a_1=1.0,                   # Demand parameters
        γ=25.0, c=1.0,                       # Adjustment and unit cost 
        y_min=0.0, y_max=20.0, y_size=100,   # Grid for output
        ρ=0.9, ν=1.0,                        # AR(1) parameters
        z_size=25)                           # Grid size for shock
    β = 1/(1+r) 
    y_grid = LinRange(y_min, y_max, y_size)  
    mc = tauchen(z_size, ρ, ν)
    z_grid, Q = mc.state_values, mc.p
    return (; β, a_0, a_1, γ, c, y_grid, z_grid, Q)
end

Program 5:Optimal investment model (finite_lq.jl)

Figure 5.11:Simulation of optimal output with different adjustment costs

Figure 5.12 compares timings for VFI, HPI, and OPI. Parameters are as in Listing 5. As in Figure 5.8, which gave timings for the optimal savings model, the horizontal axis shows $m$ , which is the step parameter in OPI (see Algorithm 5.4). VFI and HPI do not depend on $m$ and hence their timings are constant. The vertical axis is time in seconds.

HPI is faster than VFI, although the difference is not as dramatic as was the case for optimal savings. One reason is that the discount factor is relatively small for the optimal investment model ( $r=0.04$ and $\beta = 1/(1+r)$ , so $\beta \approx 0.96)$ . Since $\beta$ is the modulus of contraction for the Bellman operator, this means that VFI converges relatively quickly. Another observation is that, for many values of $m$ , OPI dominates both VFI and HPI in terms of speed, which is consistent with our findings for the optimal savings model. At $m=70$ , OPI is around 20 times faster than VFI.

Figure 5.12:Timings for alternative algorithms, investment model

Exercise 5.2.3

Consider a firm that maximizes expected discounted value in a setting where future profits are discounted at rate $\beta = 1/(1+r)$ , the only production input is labor and hiring involves fixed costs. Let $\ell_t$ be employment at the firm at time $t$ . Current profits are

\pi_t = p Z_t \ell_t^\alpha - w \ell_t - \kappa \1\{\ell_{t+1} \neq \ell_t\},

where $p$ is the output price, $w$ is the wage rate, $\alpha$ is a production parameter, the productivity shock is $Q$ -Markov on $\Zsf$ and $\kappa$ is a fixed cost of hiring and firing. This fixed cost induces lumpy adjustment, as shown in Figure 5.13. Show that this model is an MDP. Write the Bellman equation and the procedure for OPI in the context of this model. Replicate Figure 5.13, modulo randomness, using the parameters shown in Listing 6.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
using QuantEcon, LinearAlgebra, IterTools

function create_hiring_model(; 
        r=0.04,                              # Interest rate
        κ=1.0,                               # Adjustment cost 
        α=0.4,                               # Production parameter
        p=1.0, w=1.0,                        # Price and wage
        l_min=0.0, l_max=30.0, l_size=100,   # Grid for labor
        ρ=0.9, ν=0.4, b=1.0,                 # AR(1) parameters
        z_size=100)                          # Grid size for shock
    β = 1/(1+r) 
    l_grid = LinRange(l_min, l_max, l_size)  
    mc = tauchen(z_size, ρ, ν, b, 6)
    z_grid, Q = mc.state_values, mc.p
    return (; β, κ, α, p, w, l_grid, z_grid, Q)
end

Program 6:Firm hiring model (firm_hiring.jl)

Figure 5.13:Optimal shifts in the stock of labor

5.3Modified Bellman Equations¶

Direct application of MDP theory is sometimes suboptimal. For example, we saw in Section 1.3.2.2 that solving the job search problem with IID wage draws is best accomplished by generating a recursion on the continuation value, which reduces dimensionality for iterative solution methods. Separately, in Section 4.2.2.2, we saw how a different manipulation of the Bellman equation also increased efficiency.

Now we aim to study such modifications systematically. We begin by providing other examples of how manipulating a Bellman equation can facilitate computation and analysis. Then we establish a theoretical foundation for this line of analysis, and show how similar ideas can also be applied to policy operators and greedy policies.

(We also treat similar topics at a more advanced and abstract level in Volume II.)

5.3.1Structural Estimation¶

As a first illustration of the ideas in this section, we discuss a connection between econometric estimation and dynamic programs. Our focus is on some modifications that econometricians often make to Bellman equations and how they affect computation and optimality.

5.3.1.1What Is Structural Estimation?¶

Structural estimation is a branch of quantitative social science in which, in a quest to understand observed quantities and prices, researchers attribute Markov decision problems to economic agents. A key step in this approach is to formulate dynamic programs in terms of functional forms and parameters. The econometric challenge is to infer parameters that bring the model outputs as close as possible to actual data.

Structural estimation aims to discover objects that are invariant to hypothetical interventions that the analysis wants to investigate. Examples of such invariant objects are parameters of utility functions, discount factors, and production technologies. Agents inside the model solve their MDPs. A policy intervention that systematically alters the Markov processes that they face will alter agents’ optimal policies, that is, their decision rules. Various examples of such interventions involving aspects of fiscal and monetary policy are described in various chapters of Lucas & Sargent (1981) a compendium of early papers that were written in response to the Lucas (1976) Critique of then prevailing dynamic econometric models.^[2]

Efficient solution methods are essential in structural estimation because the underlying dynamic program must be solved repeatedly in order to search the parameter space for a good fit to data. Moreover, these dynamic programs are often high-dimensional, due to shocks to preferences and other random variables that the agents inside the model are assumed to see but that the econometrician does not. When these shocks are persistent, the dimension of the state grows.^[3]

In order to maintain focus on dynamic programming, we will not describe the details of the estimation step required for structural estimation (although Section 5.4 contains references for those who wish to learn about that). Instead, we focus on the kinds of dynamic programs treated in structural estimation and techniques for solving them efficiently.

5.3.1.2An Illustration¶

Let us look at an example of a dynamic program with preference shocks used in structural estimation, which is taken from a study of labor supply by married women Keane et al., 2011. The husband of the decision-maker, a married woman, is already working. The couple has young children and the mother is deciding whether to work. Her utility function is

u(c, d, \xi) = c + (\alpha n + \xi) (1 - d),

where $c$ is consumption, $\alpha$ is a parameter, $n$ is the number of children, $\xi$ is a preference shock and $d$ is the action variable. The action is binary, with $d=1$ representing the decision to work in the current period and $d=0$ representing the decision not to work.^[4]

The budget constraint for the household is

c_t = f_t + w_t d_t - \pi n d_t,

where $f_t$ is the father’s income, $w_t$ is the mother’s wage and $\pi$ is the cost of child care. Wages depend on human capital $h_t$ , which increases with experience. In particular,

w_t = \gamma h_t + \eta_t, \quad \text{with} \quad h_t = h_{t-1} + d_{t-1}.

Here $\eta_t$ is random and $\gamma$ is a parameter. We assume that $(f_t)_{t \geq 0}$ is $F$ -Markov on some finite set. In the model, $(\xi_t)_{t \geq 0}$ and $(\eta_t)_{t \geq 0}$ are IID. We denote their joint distribution by $\phi$ .

With constant discount factor $\beta$ and implied utility

r(f, h, \xi, \eta, d) \coloneq f + (\gamma h + \eta) d - \pi n d + (\alpha n + \xi) (1 - d),

the problem of maximizing expected discounted utility is an MDP with the Bellman equation

v(f, h, \xi, \eta) = \max_d \left\{ r(f, h, \xi, \eta, d) + \beta \sum_{f', \xi', \eta'} v(f', h + d, \xi', \eta') F(f, f') \phi(\xi', \eta') \right\}.

While we can proceed directly with a technique such as VFI to obtain optimal choices, we can simplify.

One way is by reducing the number of states. A hint comes from looking at the expected value function

g(f, h, d) \coloneq \sum_{f', \xi', \eta'} v(f', h + d, \xi', \eta') F(f, f') \phi(\xi', \eta')

This function depends only on three arguments and, moreover, the choice variable $d$ is binary. Hence we can break $g$ down into two functions $g(f, h, 0)$ and $g(f, h, 1)$ , each of which depends only on the pair $(f, h)$ . These functions are substantially simpler than $v$ when the domain of $(\xi, \eta)$ is large. Hence, it is natural to consider whether we can solve our problem using $g$ rather than $v$ .

5.3.1.3Expected Value Functions¶

Rather than address this question within the context of the preceding model, let’s shift to a generic version of the dynamic program used in structural estimation and how it can be solved using expected value methods. Our generic version takes the form

v(y, \epsilon) = \max_{a \in \Gamma(y)} \left\{ r(y, \epsilon, a) + \beta \sum_{y'} \int v(y', \epsilon') P(y, a, y') \phi(\epsilon') \diff \epsilon' \right\}

(5.33)

for all $y \in \Ysf$ and $\epsilon \in \Esf$ . Here $\Ysf$ is a finite set, often determined by discretization of a continuous space, whereas $\Esf$ , the outcome space for $\epsilon$ , is allowed to be continuous. The state $y$ will be called the endogenous state and $\epsilon$ is the preference shock. In practice, $\epsilon$ will often be a vector of shocks that affect current rewards. The integral can therefore be multivariate and is over all of $\Esf$ .

The problem represented by (5.33) is a version of a regular MDP, with state $x = (y, \epsilon)$ taking values in $\Xsf \coloneq \Ysf \times \Esf$ . If we discretize the space $\Esf$ , then all the optimality theory for MDPs applies. Instead of taking this approach, however, we draw on our discussion of labor choice in Section 5.3.1.2. In particular, to enhance efficiency, we will work with the expected value function

g(y, a) \coloneq \sum_{y'} \int v(y', \epsilon') P(y, a, y') \phi(\epsilon') \diff \epsilon'.

(5.34)

There are several potential advantages associated with working with $g$ rather than $v$ . One is that the set of actions $\Asf$ can be much smaller than the set of states that would be created by discretization of the preference shock space $\Esf$ (especially if $\epsilon_t$ takes values in a high-dimensional space). Another is that the integral provides smoothing, so that $g$ is typically a smooth function. This can accelerate structural estimation procedures.

5.3.1.4Optimality via EV Methods¶

To exploit the relative simplicity of the expected value function, we rewrite the Bellman equation (5.33) as

v(y, \epsilon) = \max_{a \in \Gamma(y)} \left\{ r(y, \epsilon, a) + \beta g(y, a) \right\}.

Taking expectations of both sides and using (5.34) again gives

g(y, a) = \sum_{y'} \int \max_{a' \in \Gamma(y')} \left\{ r(y', \epsilon', a') + \beta g(y', a') \right\} \phi(\epsilon') \diff \epsilon' P(y, a, y') .

To solve this functional equation we introduce the expected value Bellman operator $R$ defined at $g \in \RR^\Gsf$ by

(Rg)(y, a) = \sum_{y'} \int \max_{a' \in \Gamma(y')} \left\{ r(y', \epsilon', a') + \beta g(y', a') \right\} \phi(\epsilon') \diff \epsilon' P(y, a, y').

(5.35)

Here $\Gsf$ is the set of feasible state action pairs $(y, a)$ .

In what follows, we let $g^*$ be the fixed point of $R$ in $\RR^\Gsf$ . Since $R$ is a contraction map, $g^*$ can be computed by successive approximation. The next result shows that knowing this fixed point is enough to solve the dynamic program.

We postpone proving Proposition 5.3.1 until Section 5.3.5, where we prove a more general result.

Example 5.3.2

In the labor supply problem in Section 5.3.1.2, the expected value Bellman operator becomes

(Rg)(f, h, d) = \sum_{f', \xi', \eta'} \max_{d'} \left\{ r(f', h+d, \xi', \eta', d') + \beta g(f', h+d, d') \right\} F(f, f') \phi(\xi', \eta').

Iterating from an arbitrary guess of $g$ converges to the unique fixed point $g^*$ of $R$ . By Proposition 5.3.1, we can then compute the optimal policy $\sigma^*$ at $(f,h, \xi, \eta)$ by taking

\sigma^*(f, h, \xi, \eta) \in \argmax_d \left\{ r(f, h, \xi, \eta, d) + \beta g^*(f, h, d) \right\}.

5.3.2The Gumbel Max Trick¶

Section 5.3.1.3 described how using expected values can reduce dimensionality by smoothing. But there is another feature of an expected value formulation of a Bellman equation that we can take advantage of when we are prepared to impose extra structure on preference shocks. This section provides details.

A real-valued random variable $Z$ is said to have a Gumbel distribution (or a “type 1 generalized extreme value distribution”) with mode $\mu \in \RR$ if its cumulative distribution function takes the form $F(z) = \exp(-\exp(z - \mu))$ . To denote a random variable with a Gumbel distribution, we write $Z \sim G(\mu)$ . The expectation of $Z$ is $\mu + \gamma$ , where $\gamma \approx 0.577$ is the Euler–Mascheroni constant.

The Gumbel distribution has the following useful stability property, a proof of which can be found in Huijben et al. (2022).

To exploit Lemma 5.3.2, we continue the discussion in Section 5.3.1.4, but assume now that $\Asf = \{a_1, \ldots, a_k\}$ , that $\Gamma(y') = \Asf$ for all $y'$ (so that actions are unrestricted), that $\epsilon'$ in (5.35) is additive in rewards and indexed by actions, so that $r(y', \epsilon', a') = r(y', a') + \epsilon'(a')$ for all feasible $(y',a')$ , and that, conditional on $y'$ , the vector $(\epsilon(a_1), \ldots, \epsilon(a_k))$ consists of $k$ independent $G(0)$ shocks. Thus, each feasible choice returns a rewards perturbed by an independent Gumbel shock.

From these assumptions and Lemma 5.3.2, the term inside the integral in (5.35) satisfies

\begin{aligned} \max_{a'} \left\{ r(y', \epsilon', a') + \beta g(y', a') \right\} & = \max_{a'} \left\{ r(y', a') + \epsilon'(a') + \beta g(y', a') \right\} \\ & \sim G \left\{ -\gamma + \ln \left[ \sum_{a'} \exp\left( r(y', a') + \beta g(y', a') \right) \right] \right\}. \end{aligned}

Recalling our rule for computing mathematical expectations of Gumbel distributed random variables, the expected value Bellman operator $R$ in (5.35) becomes

(Rg)(y, a) = \sum_{y'} \ln \left[ \sum_{a'} \exp\left( r(y', a') + \beta g(y', a') \right) \right] P(y, a, y').

(5.36)

This operator is convenient because the absence of a max operator permits fast evaluation. Notice also that $R$ is smooth in $g$ , which suggests that we can use gradient information to compute its fixed points.

Notice how the Gumbel max trick that exploits Lemma 5.3.2 depends crucially on the expected value formulation of the Bellman equation, rather than the standard formulation (5.33). This is because the expected value formulation puts the max inside the expectation operator, unlike the standard formulation, where the max is on the outside.

Variations of the Gumbel max trick have many uses in structural econometrics (see Section 5.4).

5.3.3Optimal Savings with Stochastic Returns on Wealth¶

We modify the Section 5.2.2 optimal savings problem by replacing a constant gross rate of interest $R$ by an IID sequence $(\eta_t)_{t \geq 0}$ with common distribution $\phi$ on finite set $\Esf$ . So the consumer faces a fluctuating rate of returns on financial wealth. In each period $t$ , the consumer knows $\eta_t$ , the gross rate of interest between $t$ and $t+1$ , before deciding how much to consume and how much to save. Other aspects of the problem are unchanged.

We have two motivations. One is computational, namely, to illustrate how framing a decision in terms of expected values can reduce dimensionality, analogous to the results in Section 5.3.1.4. The other is to generate a more realistic wealth distribution than that generated by the Section 5.2.2.4 optimal savings model.

With stochastic returns on wealth, the Bellman equation becomes

v(w, y, \eta) = \max_{w' \leq \eta(w+y)} \left\{ u \left(w+y - \frac{w'}{\eta} \right) + \beta \sum_{y', \eta'} v(w', y', \eta') Q(y, y') \phi(\eta') \right\} .

Both $w$ and $w'$ are constrained to a finite set $\Wsf \subset \RR_+$ . The expected value function can be expressed as

g(y, w') \coloneq \sum_{y', \; \eta'} v(w', y', \eta') Q(y, y') \phi(\eta').

(5.37)

In the remainder of this section, we will say that a savings policy $\sigma$ is $g$ -greedy if

\sigma(y, w, \eta) \in \argmax_{ w' \leq \eta(w+y)} \left\{ u \left(w+y - \frac{w'}{\eta} \right) + \beta g(y, w') \right\} .

Since it is an MDP, we can see immediately that if we replace $v$ in (5.37) with the value function $v^*$ , then a $g$ -greedy policy will be an optimal one.

Using manipulations analogous to those we used in Section 5.3.1.4, we can rewrite the Bellman equation in terms of expected value functions via

g(y, w') = \sum_{y', \; \eta'} \; \max_{ w'' \leq \eta'(w'+y')} \left\{ u \left(w'+y' - \frac{w''}{\eta'} \right) + \beta g(y', w'') \right\} Q(y, y') \phi(\eta').

From here we could proceed by introducing an expected value Bellman operator analogous to $\eta$ in (5.35), proving it to be a contraction map and then showing that greedy policies taken with respect to the fixed point are optimal. All of this can be accomplished without too much difficulty – we prove more general results in Section 5.3.5.

However, we also know that OPI is, in general, more efficient than VFI. This motivates us to introduce the modified $\sigma$ -value operator

(R_\sigma \, g)(y, w') = \sum_{y', \; \eta'} \left\{ u \left( w' +y' - \frac{\sigma(w', y', \eta')}{\eta'} \right) + \beta g(y', \sigma(w', y', \eta')) \right\} Q(y, y') \phi(\eta').

This is a modification of the regular $\sigma$ -value operator $T_\sigma$ that makes it act on expected value functions.

A suitably modified OPI routine that is adapted from the regular OPI algorithm in Section 5.1.4.4 can be found in Algorithm 5.5. The routine is convergent. We discuss this in greater detail in Section 5.3.5.

1
2
3
4
5
6
7
8
9
10
11
12
13
using QuantEcon, LinearAlgebra, IterTools

function create_savings_model(; β=0.98, γ=2.5,  
                                w_min=0.01, w_max=20.0, w_size=100,
                                ρ=0.9, ν=0.1, y_size=20,
                                η_min=0.75, η_max=1.25, η_size=2)
    η_grid = LinRange(η_min, η_max, η_size)  
    ϕ = ones(η_size) * (1 / η_size)  # Uniform distribution
    w_grid = LinRange(w_min, w_max, w_size)  
    mc = tauchen(y_size, ρ, ν)
    y_grid, Q = exp.(mc.state_values), mc.p
    return (; β, γ, η_grid, ϕ, w_grid, y_grid, Q)
end

Program 7:Optimal savings parameters (modified_opt_savings.jl)

Figure 5.14 shows a histogram of a long wealth time series that parallels Figure 5.10. The only significant difference is the switch to stochastic returns (as previously described). Parameters are as in Listing 7. Now the wealth distribution has a more realistic long right tail (a few observations are in the far right tail, although they are difficult to see). The Gini coefficient is 0.72, which is closer to typical country values recorded in World Bank data (but still lower than the US). In essence, this occurs because return shocks have multiplicative rather than additive effects on wealth, so a sequence of high draws compounds to make wealth grow fast.

Figure 5.14:Histogram of wealth (stochastic returns)

Solution to Exercise 5.3.3

The Bellman equation becomes

v(w, z, \epsilon) = \max_{w' \leq R(w+z+\epsilon)} \left\{ u \left(w+z + \epsilon - \frac{w'}{R} \right) + \beta \sum_{z', \epsilon'} v(w', z', \epsilon') Q(z, z') \phi(\epsilon') \right\} .

Both $w$ and $w'$ are constrained to a finite set $\Wsf \subset \RR_+$ . The expected value function can be expressed as

g(z, w') \coloneq \sum_{z', \; \epsilon'} v(w', z', \epsilon') Q(z, z') \phi(\epsilon').

(5.38)

In the remainder of this section, we will say that a savings policy $\sigma$ is $g$ -greedy if

\sigma(z, w, \epsilon) \in \argmax_{ w' \leq R(w+z+\epsilon)} \left\{ u \left(w+z + \epsilon - \frac{w'}{R} \right) + \beta g(z, w') \right\} .

Since it is an MDP, we can see immediately that if we replace $v$ in (5.38) with the value function $v^*$ , then a $g$ -greedy policy will be an optimal one. We can rewrite the Bellman equation in terms of expected value functions via

g(z, w') = \sum_{z', \; \epsilon'} \; \max_{ w'' \leq R(w'+z'+\epsilon')} \left\{ u \left(w'+z' + \epsilon' - \frac{w''}{R} \right) + \beta g(z', w'') \right\} Q(z, z') \phi(\epsilon').

5.3.4Q-Factors¶

$Q$ -factors assign values to state action pairs. They set the stage for $Q$ -learning, an application of reinforcement learning, a recursive algorithm for estimating parameters. $Q$ -learning uses stochastic approximation techniques to learn $Q$ -factors. Under special conditions $Q$ -learning eventually learns optimal $Q$ -factors for a finite MDP.

$Q$ -learning is connected to the topic of this chapter because it relies on a Bellman operator for the $Q$ -factor. We discuss that Bellman operator, but we don’t discuss $Q$ -learning here.

To begin, we fix an MDP $(\Gamma, \beta, r, P)$ with state space $\Xsf$ and action space $\Asf$ . For each $v \in \RR^\Xsf$ , the $Q$ -factor corresponding to $v$ is the function

q(x, a) = r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \qquad ((x,a) \in \Gsf).

We can convert the Bellman equation into an equation in $Q$ -factors by observing that, given such a $q$ , the Bellman equation can be written as $v(x) = \max_{a \in \Gamma(x)}q(x, a)$ . Taking the mean and discounting on both sides of this equation gives

\beta \sum_{x'} v(x') P(x, a, x') = \beta \sum_{x'} \max_{a' \in \Gamma(x')}q(x', a') P(x, a, x').

Adding $r(x,a)$ and using the definition of $q$ again gives

q(x, a) = r(x, a) + \beta \sum_{x'} \max_{a' \in \Gamma(x')}q(x', a') P(x, a, x').

This functional equation motivates us to introduce the $Q$ -factor Bellman operator

(Sq)(x, a) = r(x, a) + \beta \sum_{x'} \max_{a' \in \Gamma(x')}q(x', a') P(x, a, x') \qquad ((x,a) \in \Gsf).

(5.39)

Let $q^*$ be the unique fixed point of $S$ in $\RR^\Gsf$ .

Enthusiastic readers might like to try to prove Proposition 5.3.4 directly. We defer the proof until Section 5.3.5, where a more general result is obtained.

5.3.5Operator Factorizations¶

Our study of structural estimation in Section 5.3.1, optimal savings in Section 5.3.3 and $Q$ -factors in Section 5.3.4 all involved manipulations of the Bellman and policy operators that presented alternative perspectives on the respective optimization problems. Rather than offering additional applications that apply such ideas, we now develop a general theoretical framework from which to understand manipulations of the Bellman and policy operators for general MDPs. The framework clarifies when and how these techniques can be applied.

5.3.5.1Refactoring the Bellman Operator¶

Fix an MDP $(\Gamma, \beta, r, P)$ with state space $\Xsf$ and action space $\Asf$ . As usual, $\Sigma$ is the set of feasible policies, $\Gsf$ is the set of feasible state, action pairs, $T$ is the Bellman operator and $v^*$ denotes the value function. Our first step is to decompose $T$ . To do this we introduce three auxiliary operators:

$E \colon \RR^\Xsf \to \RR^\Gsf$ defined by $(Ev)(x, a) = \sum_{x'} v(x') P(x, a, x')$ ,
$D \colon \RR^\Gsf \to \RR^\Gsf$ defined by $(Dg)(x, a) = r(x, a) + \beta g(x, a)$ and
$M \colon \RR^\Gsf \to \RR^\Xsf$ defined by $(Mq)(x) = \max_{a \in \Gamma(x)} q(x, a)$ .

Evidently the action of the Bellman operator $T$ on a given $v \in \RR^\Xsf$ is the composition of these three steps:

(i) take conditional expectations given $(x, a) \in \Gsf$ (applying $E$ ),

(ii) discount and adding current rewards (applying $D$ ), and

(iii) maximize with respect to current action (applying $M$ ).

As a result, we can write $T = M D E \coloneq M \circ D \circ E$ (apply $E$ first, $D$ second, and $M$ third). This decomposition is visualized in Figure 5.15. The action of $T$ is a round trip from the top node, which is the set of value functions.

Multiple Bellman operators (EV = expected value) — Figure 5.15:Multiple Bellman operators (EV $=$ expected value)

If we study Figure 5.15, we can imagine two other round trips. One is a round trip from the set of expected value functions, obtained by the sequence $EMD$ . The other is a round trip from the set of $Q$ -factors, obtained by the sequence $DEM$ . Let’s name these additional round trips $R$ and $S$ respectively, so that, collecting all three,

R = EMD, \quad S = DEM, \quad T = MDE.

(5.40)

Both $R$ and $S$ act on functions in $\RR^\Gsf$ . The next exercise provides an explicit representation of these operators.

Exercise 5.3.5

Show that for any $g, q \in \RR^\Gsf$ and $(x, a) \in \Gsf$ we have

\begin{aligned} (Rg)(x, a) & = \sum_{x'} \max_{a' \in \Gamma(x')} \left\{ r(x', a') + \beta g(x', a') \right\} P(x, a, x') \;\; \text{ and} \\ (Sq)(x, a) & = r(x, a) + \beta \sum_{x'} \max_{a' \in \Gamma(x')}q(x', a') P(x, a, x'). \end{aligned}

Let’s connect our “refactored” Bellman operators $R$ and $S$ to our preceding examples. Inspection of (5.39) confirms that $S$ is exactly the $Q$ -factor Bellman operator. In addition, $R$ is a general version of the expected value Bellman operator defined in (5.35).

While the equalities in Exercise 5.3.6 can be proved by induction via the logic revealed by (5.40), the intuition is straightforward from Figure 5.15. For example, the relationship $R^k = ET^{k-1}MD$ states that round-tripping $k$ times from the space of expected values (EV function space) is the same as shifting to value function space by applying $MD$ , round-tripping $k-1$ times using $T$ , and then shifting one more step to EV function space via $E$ .

Although the relationships in Exercise 5.3.6 are easy to prove, they are already useful. For example, suppose that, in a computational setting, $R$ is easier to iterate with than $T$ . Then to iterate with $T$ $k$ times, we can instead use $T^k = MDR^{k-1}E$ : We apply $E$ once, $R$ $k-1$ times, and $M$ and $D$ once each. If $k$ is large, this might be more efficient.

In the rest of this section, we let $\| \cdot \| \coloneq \| \cdot \|_\infty$ , the supremum norm on either $\RR^\Xsf$ or $\RR^\Gsf$ .

We can say that $E$ and $M$ are nonexpansive on $\RR^\Xsf$ and $\RR^\Gsf$ respectively, whereas $D$ is a contraction on $\RR^\Gsf$ .

In Section 5.3.5.2, we clarify relationships between these operators and prove Proposition 5.3.1 and Proposition 5.3.4.

5.3.5.2Refactorizations and Optimality¶

From Lemma 5.3.5 we see that $R$ , $S$ and $T$ all have unique fixed points. We denote them by $g^*$ , $q^*$ and $v^*$ respectively, so that

Rg^* = g^*, \quad Sq^* = q^*, \quad \text{and} \quad Tv^* = v^*.

We already know that $v^*$ is the value function (Proposition 5.1.1). The following results show that the other two fixed points are, like the value function, sufficient to determine optimality.

The results in Proposition 5.3.6 can be written more explicitly as

$g^*(x, a) = \sum_{x'} v^*(x') P(x, a, x')$ for all $(x,a) \in \Gsf$ ,
$q^*(x, a) = r(x, a) + \beta g^*(x, a)$ for all $(x,a) \in \Gsf$ , and
$v^*(x) = \max_{a \in \Gamma(x)}q^*(x, a)$ for all $x \in \Xsf$ .

In the next result and the discussion that follows, given $g, q \in \RR^\Gsf$ , we will call $\sigma \in \Sigma$

$g$ -greedy if $\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta g(x, a) \right\}$ for all $x \in \Xsf$ , and
$q$ -greedy if $\sigma(x) \in \argmax_{a \in \Gamma(x)} q(x, a)$ for all $x \in \Xsf$ .

These definitions are exact analogs of the $v$ -greedy concept, applied to expected value functions and $Q$ -factors respectively.

Proof

To see that (i) implies (ii), suppose that $\sigma$ is $v$ -greedy when $v = v^*$ . Then for arbitrary $x \in \Xsf$

\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v^*(x') P(x, a, x') \right\} = \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta g^*(x, a) \right\}.

Hence $\sigma$ is $g$ -greedy when $g=g^*$ , and (i) implies (ii). The proofs of the remaining equivalences (ii) $\implies$ (iii) $\implies$ (i) are similar. The claim that $\sigma$ is optimal if and only if any one of (i)–(iii) holds now follows from Proposition 5.1.1, which asserts that $\sigma$ is optimal if and only if $\sigma$ is $v^*$ -greedy. ◻

Notice that Proposition 5.3.4 is a special case of Corollary 5.3.7.

The results in Corollary 5.3.7 can be understood as “refactored” versions of Bellman’s principle of optimality. A consequence of these results is that we can solve a given MDP by modifying VFI to operate either on expected value functions or on $Q$ -factors. For example, if we find it more convenient to iterate in expected value space, then (informally) we can proceed as follows:

(i) Fix $g \in \RR^\Gsf$ .

(ii) Iterate with $R$ to obtain $g_k \coloneq R^k g \approx g^*$ .

(iii) Compute a $g_k$ -greedy policy.

Since $g_k \approx g^*$ , the resulting policy will be approximately optimal.

5.3.5.3Refactored OPI¶

In Chapter 5 we found that VFI is often outperformed by HPI or OPI. Our next step is to apply these methods to modified versions of the Bellman equation, as discussed in Section 5.3.5.2. This allows us to combine advantages of HPI/OPI with the potential efficiency gains obtained by refactoring the Bellman equation.

We illustrate these ideas by producing a version of OPI that can compute $Q$ -factors and expected value functions. (The same is true for HPI, although we leave details of that construction to interested readers.) To begin, we introduce a new operator, denoted $M_\sigma$ , that, for fixed $\sigma \in \Sigma$ and $q \in \RR^\Gsf$ , produces

(M_\sigma \, q)(x) \coloneq q(x, \sigma(x)) \qquad (x \in \Xsf).

This operator is the policy analog of the maximization operator $M$ defined by $(Mq)(x) = \max_{a \in \Gamma(x)} q(x, a)$ in Section 5.3.5.1. Analogous to (5.40), we set

R_\sigma \coloneq E \, M_\sigma \, D, \quad S_\sigma \coloneq D \,E \,M_\sigma, \quad T_\sigma \coloneq M_\sigma \, D \,E.

You can verify that $T_\sigma$ is the ordinary $\sigma$ -policy operator (defined in (5.19)). The operators $R_\sigma$ and $S_\sigma$ are the expected value and $Q$ -factor equivalents.

Let’s now show that OPI can be successfully modified via these alternative operators. We will focus on the expected value viewpoint (value functions are replaced by expected value functions), which is often helpful in the applications we wish to consider.

Our modified OPI routine is given in Algorithm 5.5. It makes the obvious modifications to regular OPI, switching to working with expected value functions in $\RR^\Gsf$ and from iteration with $T_\sigma$ to iteration with $R_\sigma$ .

Algorithm 5.5 is globally convergent in the same sense as regular OPI (Algorithm 5.4). In fact, if we pick $v_0 \in \RR^\Xsf$ and apply regular OPI with this initial condition, as well as applying Algorithm 5.5 with initial condition $g_0 \coloneq E v_0$ , then the sequences $(v_k)_{k \geq 0}$ and $(g_k)_{k \geq 0}$ generated by the two algorithms are connected via $g_k = E v_k$ for all $k \geq 0$ . If greedy policies are unique, then it is also true that the policy sequences generated by the two algorithms are identical.

Let’s prove these claims, assuming for convenience that greedy policies are unique. Consider first the claim that $g_k = E v_k$ for all $k \geq 0$ . This is true by assumption when $k=0$ . Suppose, as an induction hypothesis, that $g_k = E v_k$ holds at arbitrary $k$ . Let $\sigma$ be $g_k$ -greedy. Then

\sigma(x) = \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta g_k(x, a) \right\} = \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v_k(x') P(x, a, x') \right\},

where the second equality is implied by $g_k = E v_k$ . Hence $\sigma$ is both $g_k$ -greedy and $v_k$ -greedy and so is the next policy selected by both modified and regular OPI. Moreover, updating via Algorithm 5.5 and applying (5.41), we have

g_{k+1} = R_\sigma^m \, g_k = E T_\sigma^{m-1} \, M_\sigma \, D g_k = E T_\sigma^{m-1} \, M_\sigma \, D E v_k = E T_\sigma^m \, v_k.

Since $\sigma$ is $v_k$ -greedy, $T_\sigma^m \, v_k$ is the next function selected by regular OPI. Hence $v_{k+1} = T_\sigma^m \, v_k$ . Connecting with the last chain of equalities yields $g_{k+1} = Ev_{k+1}$ . This completes the proof that $g_k = E v_k$ for all $k$ . Policy functions generated by the algorithms are identical as well.

The preceding discussion provides a justification for the modified OPI algorithm we adopted in Section 5.3.3.

5.4Chapter Notes¶

Detailed treatment of MDPs can be found in books by Bellman (1957), Howard (1960), Denardo (1981), Puterman (2005), Bertsekas (2012), Hernández-Lerma & Lasserre (2012), Hernández-Lerma & Lasserre (2012), and Kochenderfer et al. (2022). Further discussion of the connection between HPI and Newton iteration can be found in Section 6.4 of Puterman (2005).

HPI is routinely used in artificial intelligence applications, including during the training of AlphaZero by DeepMind. Further discussion of these variants of HPI and their connection to Newton iteration can be found in Bertsekas (2021) and Bertsekas (2022).

There are several methods available for accelerating value function iteration, including asynchronous VFI and Anderson acceleration. Due to space constraints, we omit discussion of these topics. Interested readers can find a treatment of asynchronous VFI in Bertsekas (2022). For discussion of Anderson acceleration see, for example, Walker & Ni (2011) or Geist & Scherrer (2018). First-order methods for accelerating VFI are presented in Goyal & Grand-Clement (2023).

Other methods for computing solutions to MDPs include the linear programming (LP) approach and the policy gradient technique, both of which solve a problem of the form

\max_{\sigma \in \Sigma} \sum_x w(x) v(x) \quad \st \quad v = r_\sigma + \beta P_\sigma \, v,

(5.42)

for some chosen weight function $w$ . The LP approach views (5.42) as a linear program and applies various algorithms to the primal and dual problems. See, for example, Puterman (2005) or Ying & Zhu (2020).

The policy gradient method involves approximating $\sigma$ and $v$ in (5.42) using smooth functions with finitely many parameters. These parameters are then adjusted via some version of gradient ascent. A recent trend for high-dimensional MDPs is to approximate the value and policy functions with neural nets. An early exposition can be found in Bertsekas & Tsitsiklis (1996). A more recent monograph is Bertsekas (2021). For research along these lines in the context of economic applications see, for example, Maliar et al. (2021), Hill et al. (2021), Han et al. (2021), Kahou et al. (2021), Kase et al. (2022), and Azinovic et al. (2022).

In some versions of these algorithms, as well as in VFI and HPI, the expectations associated with dynamic programs are computed using Monte Carlo sampling methods. See, for example, Rust (1997), Powell (2007), and Bertsekas (2021). Sidford et al. (2023) combine LP and sampling approaches.

The optimal savings problem is a workhorse in macroeconomics and has been treated extensively in the literature. Early references include Brock & Mirman (1972), Mirman & Zilcha (1975), Schechtman (1976), Deaton & Laroque (1992), and Carroll (1997). For more recent studies, see, for example, Li & Stachurski (2014), Açıkgöz (2018), Light (2018), Lehrer & Light (2018), or Ma et al. (2020). Recent applications involving optimal savings in a representative agent framework include Bianchi (2011), Paciello & Wiederholt (2014), Rendahl (2016), Heathcote & Perri (2018), Paroussos et al. (2019), Erosa & González (2019), Herrendorf et al. (2021), and Michelacci et al. (2022). For more on the long right tail of the wealth distribution (as discussed in Section 5.3.3), see, for example, Benhabib et al. (2015), Krueger et al. (2016), or Stachurski & Toda (2019).

Households solving optimal savings problems are often embedded in heterogeneous agent models in order to study income distributions, wealth distributions, business cycles and other macroeconomic phenomena. Representative examples include Aiyagari (1994), Huggett (1993), Krusell & Smith (1998), Miao (2006), Algan et al. (2014), Toda (2014), Benhabib et al. (2015), Stachurski & Toda (2019), Toda (2019), Light (2020), Hubmer et al. (2020), or Cao (2020).

Exercise 5.3.3 considered optimal savings and consumption in the presence of transient and persistent shocks to labor income. For research in this vein, see, for example, Quah (1990), Carroll (2009), De Nardi et al. (2010), or Lettau & Ludvigson (2014). For empirical work on labor income dynamics, see, for example, Newhouse (2005), Guvenen (2007), Guvenen (2009), or Blundell et al. (2015). For analysis of optimal savings in a very general setting, see Ma et al. (2020) or Ma & Toda (2021).

The optimal investment problem dates back to Lucas & Prescott (1971). Textbook treatments can be found in Stokey & Lucas (1989) and Dixit & Pindyck (2012). Sargent (1980) and Hayashi (1982) used optimal investment problems to connect optimal capital accumulation with Tobin’s $q$ (which is the ratio between a physical asset’s market value and its replacement value). Other influential papers in the field include Lee & Shin (2000), Hassett & Hubbard (2002), Bloom et al. (2007), Bond & Van Reenen (2007), Bloom (2009), and Wang & Wen (2012). Carruth et al. (2000) contains a survey.

Classic papers about S–s inventory models include Arrow et al. (1951) and Dvoretzky et al. (1952). Optimality of S–s policies under certain conditions was first established by Scarf (1960). Kelle & Milne (1999) study the impact of S–s inventory policies on the supply chain, including connection to the “bullwhip” effect. The connection between S–s inventory policies and macroeconomic fluctuations is studied in Nirei (2006).

The model in Exercise 5.2.3 is loosely adapted from Bagliano & Bertola (2004).

Rust (1994) is a classic and highly readable reference in the area of structural estimation of MDPs. Keane & Wolpin (1997) provides an influential study of the career choices of young men. Roberts & Tybout (1997) analyze the decision to export in the presence of sunk costs. Keane et al. (2011) provide an overview of structural estimation applied to labor market problems. Gentry et al. (2018) review analysis of auctions using structural estimation. Legrand (2019) surveys the use of structural models to study the dynamics of commodity prices. Calsamiglia et al. (2020) use structural estimation to study school choices. Iskhakov et al. (2020) provide a thoughtful discussion on the differences between structural estimation and machine learning. Luo & Sang (2022) propose structural estimation via sieves.

Theoretical analysis of expected value functions in discrete choice models and other settings can be found in Rust (1994), Norets (2010), Mogensen (2018) and Kristensen et al. (2021). The expected value Gumbel max trick is due to Rust (1987) and builds on work by McFadden (1974). The Gumbel max trick is also used in machine learning methods (see, e.g., Jang et al. (2016)).

In Section 5.3.4 we mentioned $Q$ -learning, which was originally proposed by Watkins (1989). Tsitsiklis (1994) and Melo (2001) studied convergence of $Q$ -learning. In related work, Esponda & Pouzo (2021) study MDPs where dynamics are unknown, and where agents update their understanding of transition laws via Bayesian updating.

The theory in Section 5.3.5 on optimality under modifications of the Bellman equation is loosely based on Ma & Stachurski (2021). That paper considers arbitrary modifications in a very general setting.

Footnotes¶

See Marcet et al. (2007) and Zhu (2020) for more extensive analysis of how adding a labor supply choice can affect outcomes in a consumption-savings model.
↩
Rational expectations econometrics was a response to that Critique. While early work on rational expectations originated from the macroeconomics community (e.g., Hansen & Sargent (1980), Hansen & Sargent (1990)), many of their examples were actually about industrial organization and other microeconomic models. This work was part of a broad process that erased many boundaries between micro and macro theory.
↩
Hansen & Sargent (1980) analyze the implications of such “Shiller errors” for efficient estimation procedures in a class of linear structural models.
↩
Here, the woman is the primary carer of the child; she derives no utility from children in periods in which she works. See Keane et al. (2011) for further discussion.
↩

References¶

Rust, J. (1987). Optimal replacement of GMC bus engines: An empirical model of Harold Zurcher. Econometrica, 55(5), 999–1033.
Puterman, M. L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Interscience.
Sargent, T., & Stachurski, J. (2023). Economic Networks: Theory and Computation. Cambridge University Press.
Lucas, R., & Sargent, T. (1981). Rational Expectations and Econometric Practice (Vol. 2). University of Minnesota Press.
Lucas, R. E. (1976). Econometric policy evaluation: A critique. Carnegie-Rochester Conference Series on Public Policy, 1, 19–46.
Gillingham, K., Iskhakov, F., Munk-Nielsen, A., Rust, J., & Schjerning, B. (2022). Equilibrium trade in automobiles. Journal of Political Economy, 130(10), 2534–2593.
Keane, M. P., Todd, P. E., & Wolpin, K. I. (2011). The structural estimation of behavioral models: Discrete choice dynamic programming methods and applications. In O. Ashenfelter & D. Card (Eds.), Handbook of Labor Economics (Vol. 4, pp. 331–461). Elsevier.
Huijben, I. A., Kool, W., Paulus, M. B., & Van Sloun, R. J. (2022). A review of the gumbel-max trick and its extensions for discrete stochasticity in machine learning. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Bellman, R. (1957). Dynamic Programming. In Science. American Association for the Advancement of Science.
Howard, R. A. (1960). Dynamic Programming and Markov Processes. John Wiley & Sons.
Denardo, E. V. (1981). Dynamic Programming: Models and Applications. Prentice Hall PTR.
Bertsekas, D. (2012). Dynamic Programming and Optimal Control (Vol. 1). Athena Scientific.
Hernández-Lerma, O., & Lasserre, J. B. (2012). Discrete-Time Markov Control Processes: Basic Optimality Criteria (Vol. 30). Springer Science & Business Media.
Hernández-Lerma, O., & Lasserre, J. B. (2012). Further Topics on Discrete-Time Markov Control Processes (Vol. 42). Springer Science & Business Media.
Kochenderfer, M. J., Wheeler, T. A., & Wray, K. H. (2022). Algorithms for Decision Making. MIT Press.