Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

5 Markov Decision Processes

Authors
Affiliations
New York University
Australian National University

In this chapter we study a class of discrete time, infinite horizon dynamic programs called Markov decision processes (MDPs). This standard class of problems is broad enough to encompass many applications, including the optimal stopping problems in Chapter 4. MDPs can also be combined with reinforcement learning to tackle settings where important inputs to an MDP are not known.

5.1Definition and Properties

In this section, we define MDPs and investigate optimality.

5.1.1The MDP Model

We study a controller who interacts with a state process (Xt)t0(X_t)_{t \geq 0} by choosing an action path (At)t0(A_t)_{t \geq 0} to maximize expected discounted rewards

Et0βtr(Xt,At),\EE \sum_{t \geq 0} \beta^t r(X_t, A_t),

taking an initial state X0X_0 as given. As with all dynamic programs, we insist that the controller is not clairvoyant: He or she cannot choose actions that depend on future states.

To formalize the problem, we fix a finite set X\Xsf, henceforth called the state space, and a finite set A\Asf, henceforth called the action space. In what follows, a correspondence Γ\Gamma from X\Xsf to A\Asf is a function from X\Xsf into (A)\wp(\Asf), the set of all subsets of A\Asf. The correspondence is called nonempty if Γ(x)\Gamma(x) \not= \emptyset for all xXx \in \Xsf. For example, the map Γ\Gamma defined by Γ(x)=[x,x]\Gamma(x) = [-x, x] is a nonempty correspondence from R\RR to R\RR.

Given X\Xsf and A\Asf, we define a Markov decision process (MDP) to be a tuple M=(Γ,β,r,P)\mM = (\Gamma, \beta, r, P) consisting of

  1. a nonempty correspondence Γ\Gamma from X\Xsf to A\Asf, referred to as the feasible correspondence, which in turn defines the feasible state action pairs

G{(x,a)X×A:aΓ(x)},\Gsf \coloneq \setntn{(x, a) \in \Xsf \times \Asf}{a \in \Gamma(x)},
  1. a constant β\beta in (0,1)(0, 1), referred to as the discount factor,

  2. a function rr from G\Gsf to R\RR, referred to as the reward function, and

  3. a stochastic kernel PP from G\Gsf to X\Xsf; that is, PP is a map from G×X\Gsf \times \Xsf to R+\RR_+ satisfying

xXP(x,a,x)=1 for all (x,a) in G.\sum_{x' \in \Xsf} P(x, a, x') = 1 \quad \text{ for all } (x,a) \text{ in } \Gsf.

Here Γ(x)A\Gamma(x) \subset \Asf is the set of actions available to the controller in state xx. Given a feasible state action pair (x,a)(x, a), reward r(x,a)r(x, a) is received, and the next period state xx' is randomly drawn from P(x,a,)P(x,a, \cdot), which is an element of D(X)\dD(\Xsf). The dynamics and reward flow are summarized in Algorithm 5.1.

The Bellman equation corresponding to M\mM is

v(x)=maxaΓ(x){r(x,a)+βxv(x)P(x,a,x)}for all xX.v(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \quad \text{for all } \, x \in \Xsf.

This can be understood as an equation in the unknown function vRXv \in \RR^\Xsf. Below we define the value function vv^* as maximal lifetime rewards and show that vv^* is the unique solution to the Bellman equation in RX\RR^\Xsf.

We can understand the Bellman equation as reducing an infinite-horizon problem to a two-period problem involving the present and the future. Current actions influence (i) current rewards and (ii) expected discounted value from future states. In every case we examine, there is a trade-off between maximizing current rewards and shifting probability mass towards states with high future rewards.

5.1.2Examples

Here we list examples of MDPs. We will see that some models neatly fit the MDP structure, whereas others can be coaxed into the MDP framework by adding states or applying other tricks.

5.1.2.1A Renewal Problem

Rust (1987) ignited the field of dynamic structural estimation by examining an engine replacement problem for a bus workshop. In each period the superintendent decides whether or not to replace the engine of a given bus. Replacement is costly but delaying risks unexpected failure. Rust (1987) solved this trade-off using dynamic programming.

We consider an abstract version of Rust’s problem with binary action AtA_t. When At=1A_t = 1, the state resets to some fixed renewal state xˉ\bar x in a finite set X\Xsf (e.g., mileage resets to zero when an engine is replaced). When At=0A_t = 0, the state updates according to QM(RX)Q \in \mopx (e.g., mileage increases stochastically when the engine is not replaced). Given current state xx and action aa, current reward r(x,a)r(x,a) is received. The discount factor is β(0,1)\beta \in (0,1).

For this problem, the Bellman equation has the form

v(x)=max{r(x,1)+βv(xˉ),  r(x,0)+βxv(x)Q(x,x)}(xX),v(x) = \max \left\{ r(x,1) + \beta v(\bar x), \; r(x,0) + \beta \sum_{x'} v(x')Q(x, x') \right\} \qquad (x \in \Xsf),

where the first term is the value from action 1 and the second is the value of action 0.

To set the problem up as an MDP we set A={0,1}\Asf = \{0,1\} and Γ(x)=A\Gamma(x) = \Asf for all xXx \in \Xsf. We define

P(x,a,x)a1{x=xˉ}+(1a)Q(x,x)((x,a)G,  xX).P(x, a, x') \coloneq a \1\{x' = \bar x\} + (1-a) Q(x, x') \qquad ((x,a) \in \Gsf, \; x' \in \Xsf).

The primitives (Γ,β,r,P)(\Gamma, \beta, r, P) form an MDP. Moreover, the renewal Bellman equation (5.5) is a special case of the MDP Bellman equation (5.4). To verify this we rewrite (5.5) as

v(x)=maxa{0,1}{r(x,a)+β[av(xˉ)+(1a)xv(x)Q(x,x)]},v(x) = \max_{a \in \{0,1\}} \left\{ r(x,a) + \beta \left[ a v(\bar x) + (1-a) \sum_{x'} v(x')Q(x, x') \right] \right\},

Inserting PP from (5.6) into the right-hand side of the last equation recovers the MDP Bellman equation (5.4).

5.1.2.2Optimal Inventory Management

We study a firm where a manager maximizes shareholder value. To simplify the problem, we ignore exit options (so that firm value is the expected present value of profits) and assume that the firm only sells one product. Letting πt\pi_t be profits at time tt and r>0r > 0 be the interest rate, the value of the firm is

V0=Et0βtπt where β11+r.V_0 = \EE \sum_{t \geq 0} \beta^t \pi_t \qquad \text{ where } \quad \beta \coloneq \frac{1}{1+r}.

The firm faces exogenous demand process (Dt)t0 iid ϕD(Z+)(D_t)_{t \geq 0} \iidsim \phi \in \dD(\ZZ_+). Inventory (Xt)t0(X_t)_{t \geq 0} of the product obeys

Xt+1=f(Xt,At,Dt+1)wheref(x,a,d)(xd)0+a.X_{t+1} = f(X_t, A_t, D_{t+1}) \qquad \text{where} \quad f(x,a,d) \coloneq (x - d)\vee 0 + a.

The term AtA_t is units of stock ordered this period, which take one period to arrive. The definition of ff imposes the assumption that firms cannot sell more stock than they have on hand. We assume that the firm can store at most KK items at one time.

With the price of the firm’s product set to one, current profits are given by

πtXtDt+1cAtκ1{At>0}.\pi_t \coloneq X_t \wedge D_{t+1} - c A_t - \kappa \1\{A_t > 0\}.

Here cc is unit product cost and κ\kappa is a fixed cost of ordering inventory. We take the minimum XtDt+1X_t \wedge D_{t+1} because orders in excess of inventory are assumed to be lost rather than back-filled.

We can map our inventory problem into an MDP with state space X{0,,K}\Xsf \coloneq \{0, \ldots, K\} and action space AX\Asf \coloneq \Xsf. The feasible correspondence Γ\Gamma is

Γ(x){0,,Kx},\Gamma(x) \coloneq \{0, \ldots, K - x\},

which represents the set of feasible orders when the current inventory state is xx. The reward function is expected current profits, or

r(x,a)d0(xd)ϕ(d)caκ1{a>0}.r(x, a) \coloneq \sum_{d \geq 0} (x \wedge d) \phi(d) - c a - \kappa \1\{a > 0\}.

The stochastic kernel from the set of feasible state action pairs G\Gsf induced by Γ\Gamma is, in view of (5.9),

P(x,a,x)P{f(x,a,D)=x}whenDϕ.P(x, a, x') \coloneq \PP\{ f(x, a, D) = x' \} \qquad \text{when} \quad D \sim \phi.

The Bellman equation for this optimal inventory problem is

v(x)=maxaΓ(x){r(x,a)+βd0v(f(x,a,d))ϕ(d)},v(x) = \max_{a \in \Gamma(x)} \left\{ r(x,a) + \beta \sum_{d \geq 0} v(f(x, a, d)) \phi(d) \right\},

at each xXx \in \Xsf, where r(x,a)r(x,a) is as given in (5.12) and the aim is to solve for vv. We introduce the Bellman operator

(Tv)(x)=maxaΓ(x){r(x,a)+βd0v(f(x,a,d))ϕ(d)}.(Tv)(x) = \max_{a \in \Gamma(x)} \left\{ r(x,a) + \beta \sum_{d \geq 0} v(f(x, a, d)) \phi(d) \right\}.

This operator maps RX\RR^\Xsf to itself and is designed so that its set of fixed points in RX\RR^\Xsf coincide with solutions to (5.15) in RX\RR^\Xsf.

5.1.2.3Example: Cake Eating

Many dynamic programming problems in economics involve a trade-off between current and future consumption. The simplest example in this class is the “cake eating” problem, where initial household wealth is given but no labor income is received. Wealth evolves according to

Wt+1=R(WtCt)(t0),W_{t+1} = R(W_t - C_t) \qquad (t \geq 0),

where CtC_t is current consumption and RR is the gross interest rate. The agent seeks to maximize

Et0βtu(Ct)given W0=w,\EE \sum_{t \geq 0} \beta^t u(C_t) \quad \text{given } W_0 = w,

subject to 0CtWt0 \leq C_t \leq W_t (implying that the agent cannot borrow). Consumption level CtC_t generates utility u(Ct)u(C_t). Assuming that wealth takes values in a finite set WR+\Wsf \subset \RR_+, the Bellman equation for this problem can be written as

v(w)=max0ww{u(ww/R)+βv(w)}.v(w) = \max_{0 \leq w' \leq w} \left\{ u (w - w'/R) + \beta v(w') \right\}.

In (5.21) we are using w=R(wc)w' = R(w - c) to obtain c=(ww/R)c=(w-w'/R). The household uses (5.21) to trade-off current utility of consumption against the value of future wealth.

5.1.2.4Example: Optimal Stopping

The optimal stopping problem we studied in Chapter 4 can be framed as an MDP. On one hand, doing so allows us to apply results obtained for MDPs to optimal stopping. On the other hand, expressing an optimal stopping problem as an MDP requires an additional state variable, which complicates the exposition. The next exercise helps to illustrate the key ideas.

Let’s focus on the job search problem with Markov state discussed in Section 3.3.1 (although the arguments for the general optimal stopping problem in Section 4.1.1.1 are very similar). As before, W\Wsf is the set of wage outcomes. Since we need the symbol PP for other purposes, we let QQ be the Markov matrix for wages, so that (Wt)t0(W_t)_{t\geq 0} is QQ-Markov on W\Wsf.

To express the job search problem as an MDP, let X={0,1}×W\Xsf = \{0,1\} \times \Wsf be a state space whose typical element is (e,w)(e, w), with ee representing either unemployment (e=0e=0) or employment (e=1e=1) and ww being the current wage offer. An action aA{0,1}a \in \Asf \coloneq \{0, 1\} indicates rejection or acceptance of the current wage offer.

5.1.3Optimality

In this section, we return to the general MDP setting of Paragraph, define optimal policies and state our main optimality result. As was the case for job search, actions are governed by policies, which are maps from states to actions (see, in particular, Section 1.3.1.3, where policies were introduced).

5.1.3.1Policies and Lifetime Values

Let M=(Γ,β,r,P)\mM = (\Gamma, \beta, r, P) be an MDP. The set of feasible policies corresponding to M\mM is

Σ{σAX:σ(x)Γ(x) for all xX}.\Sigma \coloneq \setntn{\sigma \in \Asf^\Xsf} {\sigma(x) \in \Gamma(x) \text{ for all } x \in \Xsf}.

If we select a policy σ\sigma from Σ\Sigma, it is understood that we respond to state XtX_t with action Atσ(Xt)A_t \coloneq \sigma(X_t) at every date tt. As a result, the state evolves by drawing Xt+1X_{t+1} from P(Xt,σ(Xt),)P(X_t, \sigma(X_t), \cdot) at each t0t \geq 0. In other words, (Xt)t0(X_t)_{t \geq 0} is PσP_\sigma-Markov when

Pσ(x,x)P(x,σ(x),x)(x,xX).P_\sigma(x, x') \coloneq P(x, \sigma(x), x') \qquad (x, x' \in \Xsf).

Note that PσM(RX)P_\sigma \in \mopx. Fixing a policy “closes the loop” in the state transition process and defines a Markov chain for the state.

Under the policy σ\sigma, rewards at state xx are r(x,σ(x))r(x, \sigma(x)). If

rσ(x)r(x,σ(x))andExE[    X0=x],r_\sigma(x) \coloneq r(x, \sigma(x)) \quad \text{and} \quad \EE_x \coloneq \EE[ \; \cdot \; \given X_0 = x],

then the lifetime value of following σ\sigma starting from state xx can be written as

vσ(x)=Ext0βtrσ(Xt)where (Xt) is Pσ-Markov with X0=x.v_\sigma (x) = \EE_x \sum_{t \geq 0} \beta^t r_\sigma(X_t) \quad \text{where } (X_t) \text{ is } P_\sigma \text{-Markov with } X_0 = x.

Since β<1\beta < 1, applying Lemma 3.2.1 to this expression yields

vσ=t0βtPσtrσ=(IβPσ)1rσ.v_\sigma = \sum_{t \geq 0} \beta^t P_\sigma^t \, r_\sigma = (I - \beta P_\sigma)^{-1} \, r_\sigma .

Analogous to the optimal stopping case, we call vσv_\sigma the σ\sigma-value function. We also call vσ(x)v_\sigma(x) the lifetime value of policy σ\sigma conditional on initial state xx.

Another way to compute vσv_\sigma is to use the policy operator TσT_\sigma corresponding to σ\sigma, which is defined at vRXv \in \RR^\Xsf by

(Tσv)(x)=r(x,σ(x))+βxv(x)P(x,σ(x),x)(xX).(T_\sigma \, v)(x) = r(x, \sigma(x)) + \beta \sum_{x'} v(x') P(x, \sigma(x), x') \qquad (x \in \Xsf).

(TσT_\sigma is analogous to the policy operator defined for the optimal stopping problem in Section 4.1.1.3.) In vector notation,

Tσv=rσ+βPσv.T_\sigma \, v = r_\sigma + \beta P_\sigma \, v.

The next exercise shows how TσT_\sigma can be put to work.

Computationally, this means that we can pick vRXv \in \RR^\Xsf and iterate with TσT_\sigma to obtain an approximation to vσv_\sigma.

The next exercise extends Exercise 5.8 and aids interpretation of policy operators. It tells us that (Tσkv)(x)(T_\sigma^k \, v)(x) is the payoff from following policy σ\sigma and starting in state xx when lifetime is truncated to the finite horizon kk and vv provides a terminal payoff in each state.

5.1.3.2Defining Optimality

Given MDP M=(Γ,β,r,P)\mM = (\Gamma, \beta, r, P) with σ\sigma-value functions {vσ}σΣ\{v_\sigma\}_{\sigma \in \Sigma}, the value function corresponding to M\mM is defined as vσΣvσv^* \coloneq \vee_{\sigma \in \Sigma} \, v_\sigma, where, as usual, the maximum is pointwise. More explicitly,

v(x)=maxσΣvσ(x)(xX).v^*(x) = \max_{\sigma \in \Sigma} v_\sigma(x) \qquad (x \in \Xsf).

This is consistent with our definition of the value function in the optimal stopping case. It is the maximal lifetime value we can extract from each state using feasible behavior. The maximum in (5.39) exists at each xx because Σ\Sigma is finite.

A policy σΣ\sigma \in \Sigma is called optimal for M\mM if vσ=vv_\sigma = v^*. In other words, a policy is optimal if its lifetime value is maximal at each state.

Our optimality results are easier to follow with some additional terminology. To start, given vRXv \in \RR^\Xsf, we define a policy σΣ\sigma \in \Sigma to be vv-greedy if

σ(x)argmaxaΓ(x){r(x,a)+βxv(x)P(x,a,x)}for all xX.\sigma(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \quad \text{for all } x \in \Xsf.

In essence, a vv-greedy policy treats vv as the correct value function and sets all actions accordingly.

Bellman’s principle of optimality is said to hold for the MDP M\mM if

σΣ is optimal for M    σ is v-greedy.\sigma \in \Sigma \text{ is optimal for } \mM \quad \iff \quad \sigma \text{ is } v^*\text{-greedy}.

The Bellman operator corresponding to M\mM is the self-map TT on RX\RR^\Xsf defined by

(Tv)(x)=maxaΓ(x){r(x,a)+βxv(x)P(x,a,x)}(xX).(Tv)(x) = \max_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \right\} \qquad (x \in \Xsf).

Obviously, Tv=vTv=v if and only if vv satisfies the Bellman equation (5.4).

The last part of Exercise 5.11 tells us that TT is the pointwise maximum of {Tσ}σΣ\{T_\sigma\}_{\sigma \in \Sigma}, which can be expressed as T=σTσT = \vee_\sigma \, T_\sigma. Figure Figure 5.1 illustrates this relationship in one dimension.

T is the pointwise maximum of \{T_\sigma\}_{\sigma \in \Sigma} (one-dimensional setting)

Figure 5.1:TT is the pointwise maximum of {Tσ}σΣ\{T_\sigma\}_{\sigma \in \Sigma} (one-dimensional setting)

5.1.3.3Optimality Theory

We can now state our main optimality result for MDPs.

While Proposition 5.1.1 is a special case of later results (see Section 8.1.3.3), a direct proof is not difficult and we now provide one for interested readers.

Figure Figure 5.2 illustrates Proposition 5.1.1 in an abstract case, where X\Xsf is a singleton {x}\{x\}. We write vv instead of v(x)v(x) for the value of state xx and place vv on the horizontal axis. In the figure, the set of policies is Σ={σ,σ}\Sigma = \{\sigma', \sigma''\}. For given σΣ\sigma \in \Sigma, the map TσT_\sigma is an affine function Tσv=rσ+βPσvT_\sigma \, v = r_\sigma + \beta P_\sigma \, v and the fixed point is vσv_\sigma. The Bellman operator TT is the upper envelope of the functions {Tσ}\{T_\sigma\}, as shown in (ii) of Exercise 5.11. By definition,

  1. vv^* is the largest of these fixed points, which equals vσv_{\sigma''}, and

  2. σ\sigma'' is the optimal policy, since vσ=vv_{\sigma''} = v^*.

In accordance with Proposition 5.1.1, vv^* is also the fixed point of the Bellman operator.

Illustration of optimality for MDPs

Figure 5.2:Illustration of optimality for MDPs

It is important to understand the significance of (iii) in Proposition 5.1.1. Greedy policies are relatively easy to compute, in the sense that solving (5.40) at each xx is easier than trying to directly solve the problem of maximizing lifetime value, since Σ\Sigma is in general far larger than Γ(x)\Gamma(x). Part (iii) tells us that solving the overall problem reduces to computing a vv-greedy policy with the right choice of vv. For optimal stopping problems, that choice is the value function vv^*. Intuitively, vv^* assigns a “correct” value to each state, in the sense of maximal lifetime value the controller can extract, so using vv^* to calculate greedy policies leads to the optimal outcome.

5.1.4Algorithms

In previous chapters we solved job search and optimal stopping problems using value function iteration. In this section, we present a generalization suitable for arbitrary MDPs and then discuss two important alternatives.

5.1.4.1Value Function Iteration

Value function iteration (VFI) for MDPs is similar to VFI for the job search model: We use successive approximation on TT to compute an approximation vkv_k to the value function vv^* and then take a vkv_k-greedy policy. The general procedure is given by Algorithm 5.2.

The fact that the sequence (vk)k0(v_k)_{k \geq 0} produced by VFI converges to vv^* is immediate from Proposition 5.1.1 (as the tolerance τ\tau is taken toward zero). It is also true that the greedy policy produced in the last step is approximately optimal when τ\tau is small, and exactly optimal when kk is sufficiently large. Proofs are given in Chapter 8, where we examine VFI in a more general setting.

VFI is robust, easy to understand and easy to implement. These properties explain its enduring popularity. At the same time, in terms of efficiency, VFI is often dominated by alternative algorithms that we now describe.

5.1.4.2Howard Policy Iteration

Unlike VFI, Howard policy iteration (HPI) computes optimal policies by iterating between computing the value of a given policy and computing the greedy policy associated with that value. The full technique is described in Algorithm 5.3.

A visualization of HPI is given in Figure Figure 5.3, where σ\sigma is the initial choice. Next, we compute the lifetime value vσv_\sigma, and then the vσv_\sigma-greedy policy σ\sigma', and so on. The computation of lifetime value is called the policy evaluation step, whereas the computation of greedy policies is called policy improvement.

HPI algorithm

Figure 5.3:HPI algorithm

HPI has two very attractive features. One is that, in a finite state setting, the algorithm always converges to an exact optimal policy in a finite number of steps, regardless of the initial condition. We prove this fact in a more general setting in Chapter 8. The second is that the rate of convergence is faster than VFI, as will be shown in Section 5.1.4.3.

Figure Figure 5.4 gives another illustration, presented in the one-dimensional setting that we used for Figure Figure 5.2. In this illustration, we imagine that there are many optimal policies, and hence many functions in {Tσ}\{T_\sigma\}, so that their upper envelope, which is the Bellman operator, becomes a smoother curve. The figure shows the update from vσv_\sigma to the next lifetime value vσv_{\sigma'}, via the following two steps:

  1. Take σ\sigma' to be vσv_\sigma-greedy, which means that Tσvσ=TvσT_{\sigma'} v_\sigma = T v_\sigma (see Exercise 5.11).

  2. Take vσv_{\sigma'} to be the fixed point of TσT_{\sigma'}.

The next step, from vσv_{\sigma'} to vσv_{\sigma''} is analogous.

Comparison of this figure with Figure Figure 2.2 suggests that HPI is an implementation of Newton’s method, applied to the Bellman operator. We confirm this in Section 5.1.4.3.

HPI as a version of Newton’s method

Figure 5.4:HPI as a version of Newton’s method

5.1.4.3HPI as Newton Iteration

In discussing the connection between HPI and Newton iteration, one issue is that TT is not always differentiable, as seen in Figure Figure 5.2. But TT is convex, and this lets us substitute subgradients for derivatives. Once we make this modification, HPI and Newton iteration are identical, as we now show.

First, recall that, given a self-map TT from SRnS \subset \RR^n to itself, an n×nn \times n matrix DD is called a subgradient of TT at vSv \in S if

TuTv+D(uv)for all uS.Tu \geq Tv + D (u - v) \quad \text{for all } u \in S.

Figure Figure 5.5 illustrates the definition in one dimension, where DD is just a scalar determining the slope of a tangent line at vv. In the left subfigure, T1T_1 is convex and differentiable at vv, which means that only one subgradient exists (since any other choice of slope implies that the inequality in (5.49) will fail for some uu). In the right subfigure, T2T_2 is convex but nondifferentiable at vv, so multiple subgradients exist.

Subgradients of convex functions

Figure 5.5:Subgradients of convex functions

In the next result, we take (Γ,β,r,P)(\Gamma, \beta, r, P) to be a given MDP and let TT be the associated Bellman operator.

Now let’s consider Newton’s method applied to the problem of finding the fixed point of TT. Since TT is nondifferentiable and convex, we replace the Jacobian in Newton’s method (see (2.13)) with the subgradient. This leads us to iterate on

vk+1=QvkwhereQv(IβPσ)1(TvβPσv).v_{k+1} = Qv_k \quad \text{where} \quad Qv \coloneq (I - \beta P_\sigma)^{-1} (Tv - \beta P_\sigma v).

In the definition of QQ, the policy σ\sigma is vv-greedy. Using Tv=TσvTv = T_\sigma v, the map QQ reduces to Qv(IβPσ)1rσQv \coloneq (I - \beta P_\sigma)^{-1} r_\sigma, which is exactly the update step to produce the next σ\sigma-value function in HPI (i.e., the lifetime value of a vv-greedy policy).

The fact that HPI is a version of Newton’s method suggests that its iterates (vk)k0(v_k)_{k \geq 0} enjoy quadratic convergence. This is indeed the case: Under mild conditions, one can show there exists a constant NN such that, for all k0k \geq 0,

vk+1vkNvkvk12\| v_{k+1} - v_k \| \leq N \| v_k - v_{k-1} \|^2

(see, e.g., Puterman (2005), Theorem 6.4.8). Hence HPI enjoys both a fast convergence rate and the robustness of global convergence.

However, HPI is not always optimal in terms of efficiency, since the size of the constant term in (5.53) also matters. This term can be large because, at each step, the update from vσv_\sigma to vσv_{\sigma'} requires computing the exact lifetime value vσv_{\sigma'} of the vσv_\sigma-greedy policy σ\sigma'. Computing this fixed point exactly can be computationally expensive in high dimensions.

One way around this issue is to forgo computing the fixed point vσv_{\sigma'} exactly, replacing it with an approximation. Section Section 5.1.4.4 takes up this idea.

5.1.4.4Optimistic Policy Iteration

Optimistic policy iteration (OPI) is an algorithm that borrows from both VFI and HPI. In essence, the algorithm is the same as HPI except that, instead of computing the full value vσv_\sigma of a given policy, the approximation TσmvT_\sigma^m v from Exercise 5.9 is used instead. Algorithm 5.4 clarifies.

In the algorithm, the policy operator TσkT_{\sigma_k} is applied mm times to generate an approximation of vσkv_{\sigma_k}. The constant step size mm can also be replaced with a sequence (mk)N(m_k) \subset \NN. In either case, for MDPs, convergence to an optimal policy is guaranteed. We prove this in a more general setting in Chapter 8.

Notice that, as mm \to \infty, the algorithm increasingly approximates HPI, since TσkmvkT_{\sigma_k}^m v_k converges to vσkv_{\sigma_k}. At the same time, if m=1m=1, the reduces to VFI. This follows from Exercise 5.11, which tells us that, when σk\sigma_k is vkv_k-greedy, Tσkvk=TvkT_{\sigma_k} v_k = T v_k. Hence, with intermediate mm, OPI can be seen as a “convex combination” of HPI and VFI.

In almost all dynamic programming applications, there exist choices of m>1m > 1 such that OPI converges faster than VFI. We investigate these ideas in the applications. In some cases, there exist values of mm such that OPI dominates HPI. However, this depends on the structure of the problem and the software and hardware platforms being employed -- see Section 2.1.4.4 and the applications for additional discussion.

5.2Applications

This section gives several applications of the MDP model to economic problems. The applications illustrate the ease with which MDPs can be implemented on a computer (provided that the state and action spaces are not too large).

5.2.1Optimal Inventories

In Section 3.1.1.2 we studied a firm whose inventory behavior was specified to follow S--s dynamics. In Section 5.1.2.2 we introduced a model where investment behavior is endogenous, determined by the desire to maximize firm value. In this section, we show that this endogenous inventory behavior can replicate the S--s dynamics from Section 3.1.1.2.

We saw in Section 5.1.2.2 that the optimal inventory model is an MDP, so the Proposition 5.1.1 optimality and convergence results apply. In particular, the unique fixed point of the Bellman operator is the value function vv^*, and a policy σ\sigma^* is optimal if and only if σ\sigma^* is vv^*-greedy.

We solve the model numerically using VFI. As in Exercise 5.2, we take ϕ\phi to be the geometric distribution on Z+\ZZ_+ with parameter pp. We use the default parameter values shown in Listing 1. The code listing also presents an implementation of the Bellman operator.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
using Distributions

f(x, a, d) = max(x - d, 0) + a  # Inventory update

function create_inventory_model(; β=0.98,     # discount factor
                                  K=40,       # maximum inventory
                                  c=0.2, κ=2, # cost paramters
                                  p=0.6)      # demand parameter
    ϕ(d) = (1 - p)^d * p        # demand distribution
    x_vals = collect(0:K)       # set of inventory levels
    return (; β, K, c, κ, p, ϕ, x_vals)
end

"The function B(x, a, v) = r(x, a) + β Σ_x′ v(x′) P(x, a, x′)."
function B(x, a, v, model; d_max=100)
    (; β, K, c, κ, p, ϕ, x_vals) = model
    revenue = sum(min(x, d) * ϕ(d) for d in 0:d_max) 
    current_profit = revenue - c * a - κ * (a > 0)
    next_value = sum(v[f(x, a, d) + 1] * ϕ(d) for d in 0:d_max)
    return current_profit + β * next_value
end

"The Bellman operator."
function T(v, model)
    (; β, K, c, κ, p, ϕ, x_vals) = model
    new_v = similar(v)
    for (x_idx, x) in enumerate(x_vals)
        Γx = 0:(K - x) 
        new_v[x_idx], _ = findmax(B(x, a, v, model) for a in Γx)
    end
    return new_v
end

Program 1:Solving the optimal inventory model (inventory_dp.jl)

Figure Figure 5.6 exhibits an approximation of the value function vv^*, computed by iterating with TT starting at v1v \equiv 1. Figure Figure 5.6 also shows the approximate optimal policy, obtained as a vv^*-greedy policy:

σ(x)argmaxaΓ(x){r(x,a)+βd0v(f(x,a,d))ϕ(d)}.\sigma^*(x) \in \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{d \geq 0} v^*(f(x, a, d)) \phi(d) \right\}.

The plot of the optimal policy shows that there is a threshold region below which the firm orders large batches and above which the firm orders nothing. This makes sense, since the firm wishes to economize on the fixed cost of ordering. Figure Figure 5.7 shows a simulation of inventory dynamics under the optimal policy, starting from X0=0X_0 = 0. The time path closely approximates the S--s dynamics discussed in Section 3.1.1.2.

The value function and optimal policy for the inventory problem

Figure 5.6:The value function and optimal policy for the inventory problem

Optimal inventory dynamics

Figure 5.7:Optimal inventory dynamics

5.2.2Optimal Savings with Labor Income

As our next example of an MDP, we modify the cake eating problem in Section 5.1.2.3 to add labor income. Wealth evolves according to

Wt+1=R(Wt+YtCt)(t=0,1,),W_{t+1} = R (W_t + Y_t - C_t) \qquad (t = 0, 1, \ldots),

where (Wt)(W_t) takes values in finite set WR+\Wsf \subset \RR_+ and labor income (Yt)(Y_t) is a Markov chain on finite set YR+\Ysf \subset \RR_+ with transition matrix QQ.[1] RR is a gross rate of interest, so that investing dd dollars today returns RdRd next period. Other parts of the problem are unchanged. The Bellman operator can be written as

(Tv)(w,y)=maxwΓ(w,y){u(w+ywR)+βyv(w,y)Q(y,y)}.(Tv)(w, y) = \max_{w' \in \Gamma(w, y)} \left\{ u \left( w + y - \frac{w'}{R} \right) + \beta \sum_{y'} v(w', y') Q(y, y') \right\}.

5.2.2.1MDP Representation

To frame this problem as an MDP, we set the state to x(w,y)x \coloneq (w, y), representing current wealth and income, taking values in the state space XW×Y\Xsf \coloneq \Wsf \times \Ysf. The action is savings ss, which takes values in W\Wsf and equals ww'. The feasible correspondence is the set of feasible savings values

Γ(w,y)={sW:sR(w+y)}.\Gamma(w, y) = \setntn{s \in \Wsf}{s \leq R (w + y)}.

The current reward is utility of consumption r(w,s)=u(w+ys/R)r(w, s) = u(w + y - s/R). The stochastic kernel is

P((w,y),s,(w,y))=1{w=s}Q(y,y).P((w, y), s, (w', y')) = \1\{w' = s\} Q(y, y').

Having framed an MDP, the Proposition 5.1.1 optimality results apply.

5.2.2.2Implementation

To implement the algorithms discussed in Section 5.1.4, we use the Bellman operator (5.56), and the corresponding definition of a vv-greedy policy, which is

σ(w,y)argmaxwΓ(w,y){u(w+ywR)+βyv(w,y)Q(y,y)},\sigma(w, y) \in \argmax_{w' \in \Gamma(w, y)} \left\{ u \left( w + y - \frac{w'}{R} \right) + \beta \sum_{y'} v(w', y') Q(y, y') \right\},

for all (w,y)(w, y). The policy operator for given σΣ\sigma \in \Sigma is

(Tσv)(w,y)=u(w+yσ(w,y)R)+βyv(σ(w,y),y)Q(y,y).(T_\sigma \, v)(w, y) = u \left( w + y - \frac{\sigma(w, y)}{R} \right) + \beta \sum_{y'} v(\sigma(w, y), y') Q(y, y').

Code for implementing the model and these two operators is given in Listing 2. Income is constructed as a discretized AR(1) process using the method from Section 3.1.3. Exponentiation is applied to the grid so that income takes positive values.

The function get_value in Listing 3 uses the expression vσ=(IβPσ)1rσv_\sigma = (I - \beta \, P_\sigma)^{-1} r_\sigma from (5.33) to obtain the value of a given policy σ\sigma. The matrix PσP_\sigma and vector rσr_\sigma take the form

Pσ((w,y),(w,y))=1{σ(w,y)=w}Q(y,y),rσ(w,y)=u(w+yσ(w,y)/R).\begin{aligned} P_\sigma((w, y), (w', y')) & = \1\{\sigma(w, y) = w'\} Q(y, y'), \\ r_\sigma(w, y) & = u(w + y - \sigma(w, y) / R). \end{aligned}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
using QuantEcon, LinearAlgebra, IterTools

function create_savings_model(; R=1.01, β=0.98, γ=2.5,  
                                w_min=0.01, w_max=20.0, w_size=200,
                                ρ=0.9, ν=0.1, y_size=5)
    w_grid = LinRange(w_min, w_max, w_size)  
    mc = tauchen(y_size, ρ, ν)
    y_grid, Q = exp.(mc.state_values), mc.p
    return (; β, R, γ, w_grid, y_grid, Q)
end

"B(w, y, w′, v) = u(R*w + y - w′) + β Σ_y′ v(w′, y′) Q(y, y′)."
function B(i, j, k, v, model)
    (; β, R, γ, w_grid, y_grid, Q) = model
    w, y, w′ = w_grid[i], y_grid[j], w_grid[k]
    u(c) = c^(1-γ) / (1-γ)
    c = w + y - (w′ / R)
    @views value = c > 0 ? u(c) + β * dot(v[k, :], Q[j, :]) : -Inf
    return value
end

"The Bellman operator."
function T(v, model)
    w_idx, y_idx = (eachindex(g) for g in (model.w_grid, model.y_grid))
    v_new = similar(v)
    for (i, j) in product(w_idx, y_idx)
        v_new[i, j] = maximum(B(i, j, k, v, model) for k in w_idx)
    end
    return v_new
end

"The policy operator."
function T_σ(v, σ, model)
    w_idx, y_idx = (eachindex(g) for g in (model.w_grid, model.y_grid))
    v_new = similar(v)
    for (i, j) in product(w_idx, y_idx)
        v_new[i, j] = B(i, j, σ[i, j], v, model) 
    end
    return v_new
end

Program 2:Discrete optimal savings model (finite_opt_saving_0.jl)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
include("finite_opt_saving_0.jl")

"Compute a v-greedy policy."
function get_greedy(v, model)
    w_idx, y_idx = (eachindex(g) for g in (model.w_grid, model.y_grid))
    σ = Matrix{Int32}(undef, length(w_idx), length(y_idx))
    for (i, j) in product(w_idx, y_idx)
        _, σ[i, j] = findmax(B(i, j, k, v, model) for k in w_idx)
    end
    return σ
end

"Get the value v_σ of policy σ."
function get_value(σ, model)
    # Unpack and set up
    (; β, R, γ, w_grid, y_grid, Q) = model
    w_idx, y_idx = (eachindex(g) for g in (w_grid, y_grid))
    wn, yn = length(w_idx), length(y_idx)
    n = wn * yn
    u(c) = c^(1-γ) / (1-γ)
    # Build P_σ and r_σ as multi-index arrays
    P_σ = zeros(wn, yn, wn, yn)
    r_σ = zeros(wn, yn)
    for (i, j) in product(w_idx, y_idx)
            w, y, w′ = w_grid[i], y_grid[j], w_grid[σ[i, j]]
            r_σ[i, j] = u(w + y - w′/R)
        for j′ in y_idx
            P_σ[i, j, σ[i, j], j′] = Q[j, j′]
        end
    end
    # Reshape for matrix algebra
    P_σ = reshape(P_σ, n, n)
    r_σ = reshape(r_σ, n)
    # Apply matrix operations --- solve for the value of σ 
    v_σ = (I - β * P_σ) \ r_σ
    # Return as multi-index array
    return reshape(v_σ, wn, yn)
end

Program 3:Discrete optimal savings model (finite_opt_saving_1.jl)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
include("s_approx.jl")
include("finite_opt_saving_1.jl")

"Value function iteration routine."
function value_iteration(model, tol=1e-5)
    vz = zeros(length(model.w_grid), length(model.y_grid))
    v_star = successive_approx(v -> T(v, model), vz, tolerance=tol)
    return get_greedy(v_star, model)
end

"Howard policy iteration routine."
function policy_iteration(model)
    wn, yn = length(model.w_grid), length(model.y_grid)
    σ = ones(Int32, wn, yn)
    i, error = 0, 1.0
    while error > 0
        v_σ = get_value(σ, model)
        σ_new = get_greedy(v_σ, model)
        error = maximum(abs.(σ_new - σ))
        σ = σ_new
        i = i + 1
        println("Concluded loop $i with error $error.")
    end
    return σ
end

"Optimistic policy iteration routine."
function optimistic_policy_iteration(model; tolerance=1e-5, m=100)
    v = zeros(length(model.w_grid), length(model.y_grid))
    error = tolerance + 1
    while error > tolerance
        last_v = v
        σ = get_greedy(v, model)
        for i in 1:m
            v = T_σ(v, σ, model)
        end
        error = maximum(abs.(v - last_v))
    end
    return get_greedy(v, model)
end

Program 4:Discrete optimal savings model (finite_opt_saving_2.jl)

5.2.2.3Timing

Since all results for MDPs apply, we know that the value function vv^* is the unique fixed point of the Bellman operator in RX\RR^\Xsf, and that VFI, HPI, and OPI all converge. Listing 4 implements these three algorithms. Since the state and action space are finite, HPI is guaranteed to return an exact optimal policy.

Figure Figure 5.8 shows the number of seconds taken to solve the finite optimal savings model under the default parameters when executed on a laptop machine with 20 CPUs running at around 4 GHz. The horizontal axis corresponds to the step parameter mm in OPI (Algorithm 5.4). The two other algorithms do not depend on mm and hence their timings are constant. The figure shows that HPI is an order of magnitude faster than VFI and that OPI is even faster for moderate values of mm.

One reason VFI is slow is that the discount factor is close to one. This matters because the convergence rate for VFI is linear with error size decreasing geometrically in β\beta. In contrast, HPI, being an instance of Newton iteration, converges quadratically (see Section 2.1.4.2). As a result, HPI tends to dominate VFI when the discount factor approaches unity.

Run-times are also dependent on implementation, and relative speed varies significantly with coding style, software, and hardware platforms. In our implementation, the main deficiency is that parallelization is under-utilized. Better exploitation of parallelization tends to favor HPI, as discussed in Section 2.1.4.4.

Timings for alternative algorithms, savings model

Figure 5.8:Timings for alternative algorithms, savings model

5.2.2.4Outputs

Figure Figure 5.9 shows a typical time series for the wealth of a single household under the optimal policy. The series is created by computing an optimal policy σ\sigma^*, generating (Yt)t=0m1(Y_t)_{t=0}^{m-1} as a QQ-Markov chain on Y\Ysf and then computing (Wt)t=0m(W_t)_{t=0}^m via Wt+1=σ(Wt,Yt)W_{t+1} = \sigma^*(W_t, Y_t) for tt running from 0 to m1m-1. Initial wealth W0W_0 is set to 1.0 and m=2000m = 2000.

Time series for wealth

Figure 5.9:Time series for wealth

Figure Figure 5.10 shows the result of computing and histogramming a longer time series, with mm set to 1,000,000. This histogram approximates the stationary distribution of wealth for a large population, each updating via σ\sigma^* and each with independently generated labor income series (Yt)t=0m1(Y_t)_{t=0}^{m-1}. (This is due to ergodicity of the wealth-income process. For a discussion of the connection between stationary distributions and time series under ergodicity see, for example, Sargent & Stachurski (2023).)

Histogram of wealth

Figure 5.10:Histogram of wealth

The shape of the wealth distribution in Figure Figure 5.10 is unrealistic. In almost all countries, the wealth distribution has a very long right tail. The Gini coefficient of the distribution in Figure Figure 5.10 is 0.54, which is too low. For example, World Bank data for 2019 produces a wealth Gini for the US equal to 0.852. For Germany and Japan the figures are 0.816 and 0.627 respectively.

In Section 5.3.3 we discuss a variation on the optimal savings model that can produce a more realistic wealth distribution.

5.2.3Optimal Investment

As our next application, we consider a monopolist facing adjustment costs and stochastically evolving demand. The monopolist balances setting enough capacity to meet demand against costs of adjusting capacity.

5.2.3.1Problem Description

We assume that the monopolist produces a single product and faces an inverse demand function of the form

Pt=a0a1Yt+Zt,P_t = a_0 - a_1 Y_t + Z_t,

where a0,a1a_0, a_1 are positive parameters, YtY_t is output, PtP_t is price, and the demand shock ZtZ_t follows

Zt+1=ρZt+σηt+1,{ηt} iid N(0,1).Z_{t+1} = \rho Z_t + \sigma \eta_{t+1}, \qquad \{\eta_t \} \iidsim N(0, 1).

Current profits are

πtPtYtcYtγ(Yt+1Yt)2.\pi_t \coloneq P_t Y_t - c Y_t - \gamma (Y_{t+1} - Y_t)^2.

Here γ(Yt+1Yt)2\gamma (Y_{t+1} - Y_t)^2 represents costs associated with adjusting production scale, parameterized by γ\gamma, and cc is unit cost of current production. Costs are convex, so rapid changes to capacity are expensive.

The monopolist chooses (Yt)(Y_t) to maximize the expected discounted value of its profit flow, which we write as

Et=0βtπt.\EE \, \sum_{t=0}^{\infty} \beta^t \pi_t.

Here β=1/(1+r)\beta = 1/(1+r), where r>0r > 0 is a fixed interest rate.

A way to start thinking about the optimal time path of output is to consider what would happen if γ=0\gamma = 0. Without adjustment costs there is no intertemporal trade-off, so the monopolist should choose output to maximize current profit in each period. The implied level of output at time tt is

Yˉta0c+Zt2a1.\bar Y_t \coloneq \frac{a_0 - c + Z_t}{2 a_1}.

For γ>0\gamma > 0, we expect the following behavior.

5.2.3.2MDP Representation

We can represent this problem as an MDP. To do so we let Y\Ysf be a grid contained in R+\RR_+ that lists possible output values. To conform to the finite state setting, we discretize the shock process (Zt)(Z_t) using Tauchen’s method, as described in Section 3.1.3. For convenience we again use (Zt)(Z_t) to represent the discrete process, which is a finite Markov chain on ZR\Zsf \subset \RR with transition matrix QQ.

The state space for this MDP is X=Y×Z\Xsf = \Ysf \times \Zsf, while the action space is Y\Ysf. The feasible correspondence is defined by Γ(x)=Y\Gamma(x) = \Ysf, meaning that choice of output is not restricted by the state. Thus, the feasible policy set Σ\Sigma is all σ ⁣:Y×ZY\sigma \colon \Ysf \times \Zsf \to \Ysf.

We write (y,z)(y,z) for the current state, qq for the action (which chooses next period output) and (y,z)(y',z') for the next period state. The current reward function is current profits, which we can write as

r((y,z),q)=(a0a1y+zc)yγ(qy)2.r((y, z), q) = (a_0 - a_1 y + z - c) y - \gamma (q - y)^2.

The stochastic kernel is

P((y,z),q,(y,z))=1{y=q}Q(z,z).P((y, z), q, (y', z')) = \1\{y' = q\} Q(z, z').

The term 1{y=q}\1\{y' = q\} states that next period output yy' is equal to our current choice qq for next period output. With these definitions, the problem defines an MDP and all of the optimality theory for MDPs applies.

5.2.3.3Implementation

The Bellman operator can be expressed as

(Tv)(y,z)=maxyR{r(y,z,y)+βzv(y,z)Q(z,z)}.(Tv)(y, z) = \max_{y' \in \RR} \left\{ r(y, z, y') + \beta \sum_{z'} v(y', z') Q(z, z') \right\}.

Given σΣ\sigma \in \Sigma, we can express the policy operator as

(Tσv)(y,z)=r(y,z,σ(y,z))+βzv(σ(y,z),z)Q(z,z).(T_\sigma \, v)(y, z) = r(y, z, \sigma(y, z)) + \beta \sum_{z'} v(\sigma(y, z), z') Q(z, z').

A vv-greedy policy is a σΣ\sigma \in \Sigma that obeys

σ(y,z)argmaxyY{r(y,z,y)+βzv(y,z)Q(z,z)}for all (y,z)X.\sigma(y, z) \in \argmax_{y' \in \Ysf} \left\{ r(y, z, y') + \beta \sum_{z'} v(y', z') Q(z, z') \right\} \quad \text{for all } (y,z) \in \Xsf.

By combining iteration with the policy operator and computation of greedy policies, we can implement OPI, compute the optimal policy σ\sigma^*, and study output choices generated by this policy. We are particularly interested in how output responds over time to randomly generated demand shocks.

Figure Figure 5.11 shows the result of a simulation designed to shed light on how output responds to demand. After choosing initial values (Y1,Z1)(Y_1, Z_1) and generating a QQ-Markov chain (Zt)t=1T(Z_t)_{t=1}^T, we simulated optimal output via Yt+1=σ(Yt,Zt)Y_{t+1} = \sigma^*(Y_t, Z_t). The default parameters are shown in Listing 5. In the figure, the adjustment cost parameter γ\gamma is varied as shown in the title. In addition to the optimal output path, the path of (Yˉt)(\bar Y_t) as defined in (5.66) is also presented.

The figure shows how increasing γ\gamma promotes smoothing, as predicted in the preceding discussion. For small γ\gamma, adjustment costs have only minor effects on choices, so output closely follows (Yˉt)(\bar Y_t), the optimal path when output responds immediately to demand shocks. Conversely, larger values of γ\gamma make adjustment expensive, so the operator responds relatively slowly to changes in demand.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
using QuantEcon, LinearAlgebra, IterTools
include("s_approx.jl")

function create_investment_model(; 
        r=0.04,                              # Interest rate
        a_0=10.0, a_1=1.0,                   # Demand parameters
        γ=25.0, c=1.0,                       # Adjustment and unit cost 
        y_min=0.0, y_max=20.0, y_size=100,   # Grid for output
        ρ=0.9, ν=1.0,                        # AR(1) parameters
        z_size=25)                           # Grid size for shock
    β = 1/(1+r) 
    y_grid = LinRange(y_min, y_max, y_size)  
    mc = tauchen(z_size, ρ, ν)
    z_grid, Q = mc.state_values, mc.p
    return (; β, a_0, a_1, γ, c, y_grid, z_grid, Q)
end

Program 5:Optimal investment model (finite_lq.jl)

Simulation of optimal output with different adjustment costs

Figure 5.11:Simulation of optimal output with different adjustment costs

Figure Figure 5.12 compares timings for VFI, HPI, and OPI. Parameters are as in Listing 5. As in Figure Figure 5.8, which gave timings for the optimal savings model, the horizontal axis shows mm, which is the step parameter in OPI (see Algorithm 5.4). VFI and HPI do not depend on mm and hence their timings are constant. The vertical axis is time in seconds.

HPI is faster than VFI, although the difference is not as dramatic as was the case for optimal savings. One reason is that the discount factor is relatively small for the optimal investment model (r=0.04r=0.04 and β=1/(1+r)\beta = 1/(1+r), so β0.96)\beta \approx 0.96). Since β\beta is the modulus of contraction for the Bellman operator, this means that VFI converges relatively quickly. Another observation is that, for many values of mm, OPI dominates both VFI and HPI in terms of speed, which is consistent with our findings for the optimal savings model. At m=70m=70, OPI is around 20 times faster than VFI.

Timings for alternative algorithms, investment model

Figure 5.12:Timings for alternative algorithms, investment model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
using QuantEcon, LinearAlgebra, IterTools

function create_hiring_model(; 
        r=0.04,                              # Interest rate
        κ=1.0,                               # Adjustment cost 
        α=0.4,                               # Production parameter
        p=1.0, w=1.0,                        # Price and wage
        l_min=0.0, l_max=30.0, l_size=100,   # Grid for labor
        ρ=0.9, ν=0.4, b=1.0,                 # AR(1) parameters
        z_size=100)                          # Grid size for shock
    β = 1/(1+r) 
    l_grid = LinRange(l_min, l_max, l_size)  
    mc = tauchen(z_size, ρ, ν, b, 6)
    z_grid, Q = mc.state_values, mc.p
    return (; β, κ, α, p, w, l_grid, z_grid, Q)
end

Program 6:Firm hiring model (firm_hiring.jl)

Optimal shifts in the stock of labor

Figure 5.13:Optimal shifts in the stock of labor

5.3Modified Bellman Equations

Direct application of MDP theory is sometimes suboptimal. For example, we saw in Section 1.3.2.2 that solving the job search problem with iid wage draws is best accomplished by generating a recursion on the continuation value, which reduces dimensionality for iterative solution methods. Separately, in Section 4.2.2.2, we saw how a different manipulation of the Bellman equation also increased efficiency.

Now we aim to study such modifications systematically. We begin by providing other examples of how manipulating a Bellman equation can facilitate computation and analysis. Then we establish a theoretical foundation for this line of analysis, and show how similar ideas can also be applied to policy operators and greedy policies.

(We also treat similar topics at a more advanced and abstract level in Volume 2.)

5.3.1Structural Estimation

As a first illustration of the ideas in this section, we discuss a connection between econometric estimation and dynamic programs. Our focus is on some modifications that econometricians often make to Bellman equations and how they affect computation and optimality.

5.3.1.1What Is Structural Estimation?

Structural estimation is a branch of quantitative social science in which, in a quest to understand observed quantities and prices, researchers attribute Markov decision problems to economic agents. A key step in this approach is to formulate dynamic programs in terms of functional forms and parameters. The econometric challenge is to infer parameters that bring the model outputs as close as possible to actual data.

Structural estimation aims to discover objects that are invariant to hypothetical interventions that the analysis wants to investigate. Examples of such invariant objects are parameters of utility functions, discount factors, and production technologies. Agents inside the model solve their MDPs. A policy intervention that systematically alters the Markov processes that they face will alter agents’ optimal policies, that is, their decision rules. Various examples of such interventions involving aspects of fiscal and monetary policy are described in various chapters of Lucas & Sargent (1981) a compendium of early papers that were written in response to the Lucas (1976) Critique of then prevailing dynamic econometric models.[2]

Efficient solution methods are essential in structural estimation because the underlying dynamic program must be solved repeatedly in order to search the parameter space for a good fit to data. Moreover, these dynamic programs are often high-dimensional, due to shocks to preferences and other random variables that the agents inside the model are assumed to see but that the econometrician does not. When these shocks are persistent, the dimension of the state grows.[3]

In order to maintain focus on dynamic programming, we will not describe the details of the estimation step required for structural estimation (although Section 5.4 contains references for those who wish to learn about that). Instead, we focus on the kinds of dynamic programs treated in structural estimation and techniques for solving them efficiently.

5.3.1.2An Illustration

Let us look at an example of a dynamic program with preference shocks used in structural estimation, which is taken from a study of labor supply by married women Keane et al., 2011. The husband of the decision-maker, a married woman, is already working. The couple has young children and the mother is deciding whether to work. Her utility function is

u(c,d,ξ)=c+(αn+ξ)(1d),u(c, d, \xi) = c + (\alpha n + \xi) (1 - d),

where cc is consumption, α\alpha is a parameter, nn is the number of children, ξ\xi is a preference shock and dd is the action variable. The action is binary, with d=1d=1 representing the decision to work in the current period and d=0d=0 representing the decision not to work.[4]

The budget constraint for the household is

ct=ft+wtdtπndt,c_t = f_t + w_t d_t - \pi n d_t,

where ftf_t is the father’s income, wtw_t is the mother’s wage and π\pi is the cost of child care. Wages depend on human capital hth_t, which increases with experience. In particular,

wt=γht+ηt,withht=ht1+dt1.w_t = \gamma h_t + \eta_t, \quad \text{with} \quad h_t = h_{t-1} + d_{t-1}.

Here ηt\eta_t is random and γ\gamma is a parameter. We assume that (ft)t0(f_t)_{t \geq 0} is FF-Markov on some finite set. In the model, (ξt)t0(\xi_t)_{t \geq 0} and (ηt)t0(\eta_t)_{t \geq 0} are iid. We denote their joint distribution by ϕ\phi.

With constant discount factor β\beta and implied utility

r(f,h,ξ,η,d)f+(γh+η)dπnd+(αn+ξ)(1d),r(f, h, \xi, \eta, d) \coloneq f + (\gamma h + \eta) d - \pi n d + (\alpha n + \xi) (1 - d),

the problem of maximizing expected discounted utility is an MDP with the Bellman equation

v(f,h,ξ,η)=maxd{r(f,h,ξ,η,d)+βf,ξ,ηv(f,h+d,ξ,η)F(f,f)ϕ(ξ,η)}.v(f, h, \xi, \eta) = \max_d \left\{ r(f, h, \xi, \eta, d) + \beta \sum_{f', \xi', \eta'} v(f', h + d, \xi', \eta') F(f, f') \phi(\xi', \eta') \right\}.

While we can proceed directly with a technique such as VFI to obtain optimal choices, we can simplify.

One way is by reducing the number of states. A hint comes from looking at the expected value function

g(f,h,d)f,ξ,ηv(f,h+d,ξ,η)F(f,f)ϕ(ξ,η)g(f, h, d) \coloneq \sum_{f', \xi', \eta'} v(f', h + d, \xi', \eta') F(f, f') \phi(\xi', \eta')

This function depends only on three arguments and, moreover, the choice variable dd is binary. Hence we can break gg down into two functions g(f,h,0)g(f, h, 0) and g(f,h,1)g(f, h, 1), each of which depends only on the pair (f,h)(f, h). These functions are substantially simpler than vv when the domain of (ξ,η)(\xi, \eta) is large. Hence, it is natural to consider whether we can solve our problem using gg rather than vv.

5.3.1.3Expected Value Functions

Rather than address this question within the context of the preceding model, let’s shift to a generic version of the dynamic program used in structural estimation and how it can be solved using expected value methods. Our generic version takes the form

v(y,ϵ)=maxaΓ(y){r(y,ϵ,a)+βyv(y,ϵ)P(y,a,y)ϕ(ϵ) ⁣dϵ}v(y, \epsilon) = \max_{a \in \Gamma(y)} \left\{ r(y, \epsilon, a) + \beta \sum_{y'} \int v(y', \epsilon') P(y, a, y') \phi(\epsilon') \diff \epsilon' \right\}

for all yYy \in \Ysf and ϵE\epsilon \in \Esf. Here Y\Ysf is a finite set, often determined by discretization of a continuous space, whereas E\Esf, the outcome space for ϵ\epsilon, is allowed to be continuous. The state yy will be called the endogenous state and ϵ\epsilon is the preference shock. In practice, ϵ\epsilon will often be a vector of shocks that affect current rewards. The integral can therefore be multivariate and is over all of E\Esf.

The problem represented by (5.79) is a version of a regular MDP, with state x=(y,ϵ)x = (y, \epsilon) taking values in XY×E\Xsf \coloneq \Ysf \times \Esf. If we discretize the space E\Esf, then all the optimality theory for MDPs applies. Instead of taking this approach, however, we draw on our discussion of labor choice in Section 5.3.1.2. In particular, to enhance efficiency, we will work with the expected value function

g(y,a)yv(y,ϵ)P(y,a,y)ϕ(ϵ) ⁣dϵ.g(y, a) \coloneq \sum_{y'} \int v(y', \epsilon') P(y, a, y') \phi(\epsilon') \diff \epsilon'.

There are several potential advantages associated with working with gg rather than vv. One is that the set of actions A\Asf can be much smaller than the set of states that would be created by discretization of the preference shock space E\Esf (especially if ϵt\epsilon_t takes values in a high-dimensional space). Another is that the integral provides smoothing, so that gg is typically a smooth function. This can accelerate structural estimation procedures.

5.3.1.4Optimality via EV Methods

To exploit the relative simplicity of the expected value function, we rewrite the Bellman equation (5.79) as

v(y,ϵ)=maxaΓ(y){r(y,ϵ,a)+βg(y,a)}.v(y, \epsilon) = \max_{a \in \Gamma(y)} \left\{ r(y, \epsilon, a) + \beta g(y, a) \right\}.

Taking expectations of both sides and using (5.80) again gives

g(y,a)=ymaxaΓ(y){r(y,ϵ,a)+βg(y,a)}ϕ(ϵ) ⁣dϵP(y,a,y).g(y, a) = \sum_{y'} \int \max_{a' \in \Gamma(y')} \left\{ r(y', \epsilon', a') + \beta g(y', a') \right\} \phi(\epsilon') \diff \epsilon' P(y, a, y') .

To solve this functional equation we introduce the expected value Bellman operator RR defined at gRGg \in \RR^\Gsf by

(Rg)(y,a)=ymaxaΓ(y){r(y,ϵ,a)+βg(y,a)}ϕ(ϵ) ⁣dϵP(y,a,y).(Rg)(y, a) = \sum_{y'} \int \max_{a' \in \Gamma(y')} \left\{ r(y', \epsilon', a') + \beta g(y', a') \right\} \phi(\epsilon') \diff \epsilon' P(y, a, y').

Here G\Gsf is the set of feasible state action pairs (y,a)(y, a).

In what follows, we let gg^* be the fixed point of RR in RG\RR^\Gsf. Since RR is a contraction map, gg^* can be computed by successive approximation. The next result shows that knowing this fixed point is enough to solve the dynamic program.

We postpone proving Proposition 5.3.1 until Section 5.3.5, where we prove a more general result.

5.3.2The Gumbel Max Trick

Section 5.3.1.3 described how using expected values can reduce dimensionality by smoothing. But there is another feature of an expected value formulation of a Bellman equation that we can take advantage of when we are prepared to impose extra structure on preference shocks. This section provides details.

A real-valued random variable ZZ is said to have a Gumbel distribution (or a “type 1 generalized extreme value distribution”) with mode μR\mu \in \RR if its cumulative distribution function takes the form F(z)=exp(exp(zμ))F(z) = \exp(-\exp(z - \mu)). To denote a random variable with a Gumbel distribution, we write ZG(μ)Z \sim G(\mu). The expectation of ZZ is μ+γ\mu + \gamma, where γ0.577\gamma \approx 0.577 is the Euler--Mascheroni constant.

The Gumbel distribution has the following useful stability property, a proof of which can be found in Huijben et al. (2022).

To exploit Lemma 5.3.2, we continue the discussion in Section 5.3.1.4, but assume now that A={a1,,ak}\Asf = \{a_1, \ldots, a_k\}, that Γ(y)=A\Gamma(y') = \Asf for all yy' (so that actions are unrestricted), that ϵ\epsilon' in (5.83) is additive in rewards and indexed by actions, so that r(y,ϵ,a)=r(y,a)+ϵ(a)r(y', \epsilon', a') = r(y', a') + \epsilon'(a') for all feasible (y,a)(y',a'), and that, conditional on yy', the vector (ϵ(a1),,ϵ(ak))(\epsilon(a_1), \ldots, \epsilon(a_k)) consists of kk independent G(0)G(0) shocks. Thus, each feasible choice returns a rewards perturbed by an independent Gumbel shock.

From these assumptions and Lemma 5.3.2, the term inside the integral in (5.83) satisfies

maxa{r(y,ϵ,a)+βg(y,a)}=maxa{r(y,a)+ϵ(a)+βg(y,a)}G{γ+ln[aexp(r(y,a)+βg(y,a))]}.\begin{aligned} \max_{a'} \left\{ r(y', \epsilon', a') + \beta g(y', a') \right\} & = \max_{a'} \left\{ r(y', a') + \epsilon'(a') + \beta g(y', a') \right\} \\ & \sim G \left\{ -\gamma + \ln \left[ \sum_{a'} \exp\left( r(y', a') + \beta g(y', a') \right) \right] \right\}. \end{aligned}

Recalling our rule for computing mathematical expectations of Gumbel distributed random variables, the expected value Bellman operator RR in (5.83) becomes

(Rg)(y,a)=yln[aexp(r(y,a)+βg(y,a))]P(y,a,y).(Rg)(y, a) = \sum_{y'} \ln \left[ \sum_{a'} \exp\left( r(y', a') + \beta g(y', a') \right) \right] P(y, a, y').

This operator is convenient because the absence of a max operator permits fast evaluation. Notice also that RR is smooth in gg, which suggests that we can use gradient information to compute its fixed points.

Notice how the Gumbel max trick that exploits Lemma 5.3.2 depends crucially on the expected value formulation of the Bellman equation, rather than the standard formulation (5.79). This is because the expected value formulation puts the max inside the expectation operator, unlike the standard formulation, where the max is on the outside.

Variations of the Gumbel max trick have many uses in structural econometrics (see Section 5.4).

5.3.3Optimal Savings with Stochastic Returns on Wealth

We modify the Section 5.2.2 optimal savings problem by replacing a constant gross rate of interest RR by an iid sequence (ηt)t0(\eta_t)_{t \geq 0} with common distribution ϕ\phi on finite set E\Esf. So the consumer faces a fluctuating rate of returns on financial wealth. In each period tt, the consumer knows ηt\eta_t, the gross rate of interest between tt and t+1t+1, before deciding how much to consume and how much to save. Other aspects of the problem are unchanged.

We have two motivations. One is computational, namely, to illustrate how framing a decision in terms of expected values can reduce dimensionality, analogous to the results in Section 5.3.1.4. The other is to generate a more realistic wealth distribution than that generated by the Section 5.2.2.4 optimal savings model.

With stochastic returns on wealth, the Bellman equation becomes

v(w,y,η)=maxwη(w+y){u(w+ywη)+βy,ηv(w,y,η)Q(y,y)ϕ(η)}.v(w, y, \eta) = \max_{w' \leq \eta(w+y)} \left\{ u \left(w+y - \frac{w'}{\eta} \right) + \beta \sum_{y', \eta'} v(w', y', \eta') Q(y, y') \phi(\eta') \right\} .

Both ww and ww' are constrained to a finite set WR+\Wsf \subset \RR_+. The expected value function can be expressed as

g(y,w)y,  ηv(w,y,η)Q(y,y)ϕ(η).g(y, w') \coloneq \sum_{y', \; \eta'} v(w', y', \eta') Q(y, y') \phi(\eta').

In the remainder of this section, we will say that a savings policy σ\sigma is gg-greedy if

σ(y,w,η)argmaxwη(w+y){u(w+ywη)+βg(y,w)}.\sigma(y, w, \eta) \in \argmax_{ w' \leq \eta(w+y)} \left\{ u \left(w+y - \frac{w'}{\eta} \right) + \beta g(y, w') \right\} .

Since it is an MDP, we can see immediately that if we replace vv in (5.91) with the value function vv^*, then a gg-greedy policy will be an optimal one.

Using manipulations analogous to those we used in Section 5.3.1.4, we can rewrite the Bellman equation in terms of expected value functions via

g(y,w)=y,  η  maxwη(w+y){u(w+ywη)+βg(y,w)}Q(y,y)ϕ(η).g(y, w') = \sum_{y', \; \eta'} \; \max_{ w'' \leq \eta'(w'+y')} \left\{ u \left(w'+y' - \frac{w''}{\eta'} \right) + \beta g(y', w'') \right\} Q(y, y') \phi(\eta').

From here we could proceed by introducing an expected value Bellman operator analogous to η\eta in (5.83), proving it to be a contraction map and then showing that greedy policies taken with respect to the fixed point are optimal. All of this can be accomplished without too much difficulty -- we prove more general results in Section 5.3.5.

However, we also know that OPI is, in general, more efficient than VFI. This motivates us to introduce the modified σ\sigma-value operator

(Rσg)(y,w)=y,  η{u(w+yσ(w,y,η)η)+βg(y,σ(w,y,η))}Q(y,y)ϕ(η).(R_\sigma \, g)(y, w') = \sum_{y', \; \eta'} \left\{ u \left( w' +y' - \frac{\sigma(w', y', \eta')}{\eta'} \right) + \beta g(y', \sigma(w', y', \eta')) \right\} Q(y, y') \phi(\eta').

This is a modification of the regular σ\sigma-value operator TσT_\sigma that makes it act on expected value functions.

A suitably modified OPI routine that is adapted from the regular OPI algorithm in Section 5.1.4.4 can be found in Algorithm 5.5. The routine is convergent. We discuss this in greater detail in Section 5.3.5.

1
2
3
4
5
6
7
8
9
10
11
12
13
using QuantEcon, LinearAlgebra, IterTools

function create_savings_model(; β=0.98, γ=2.5,  
                                w_min=0.01, w_max=20.0, w_size=100,
                                ρ=0.9, ν=0.1, y_size=20,
                                η_min=0.75, η_max=1.25, η_size=2)
    η_grid = LinRange(η_min, η_max, η_size)  
    ϕ = ones(η_size) * (1 / η_size)  # Uniform distribution
    w_grid = LinRange(w_min, w_max, w_size)  
    mc = tauchen(y_size, ρ, ν)
    y_grid, Q = exp.(mc.state_values), mc.p
    return (; β, γ, η_grid, ϕ, w_grid, y_grid, Q)
end

Program 7:Optimal savings parameters (modified_opt_savings.jl)

Figure Figure 5.14 shows a histogram of a long wealth time series that parallels Figure Figure 5.10. The only significant difference is the switch to stochastic returns (as previously described). Parameters are as in Listing 7. Now the wealth distribution has a more realistic long right tail (a few observations are in the far right tail, although they are difficult to see). The Gini coefficient is 0.72, which is closer to typical country values recorded in World Bank data (but still lower than the US). In essence, this occurs because return shocks have multiplicative rather than additive effects on wealth, so a sequence of high draws compounds to make wealth grow fast.

Histogram of wealth (stochastic returns)

Figure 5.14:Histogram of wealth (stochastic returns)

5.3.4Q-Factors

QQ-factors assign values to state action pairs. They set the stage for QQ-learning, an application of reinforcement learning, a recursive algorithm for estimating parameters. QQ-learning uses stochastic approximation techniques to learn QQ-factors. Under special conditions QQ-learning eventually learns optimal QQ-factors for a finite MDP.

QQ-learning is connected to the topic of this chapter because it relies on a Bellman operator for the QQ-factor. We discuss that Bellman operator, but we don’t discuss QQ-learning here.

To begin, we fix an MDP (Γ,β,r,P)(\Gamma, \beta, r, P) with state space X\Xsf and action space A\Asf. For each vRXv \in \RR^\Xsf, the QQ-factor corresponding to vv is the function

q(x,a)=r(x,a)+βxv(x)P(x,a,x)((x,a)G).q(x, a) = r(x, a) + \beta \sum_{x'} v(x') P(x, a, x') \qquad ((x,a) \in \Gsf).

We can convert the Bellman equation into an equation in QQ-factors by observing that, given such a qq, the Bellman equation can be written as v(x)=maxaΓ(x)q(x,a)v(x) = \max_{a \in \Gamma(x)}q(x, a). Taking the mean and discounting on both sides of this equation gives

βxv(x)P(x,a,x)=βxmaxaΓ(x)q(x,a)P(x,a,x).\beta \sum_{x'} v(x') P(x, a, x') = \beta \sum_{x'} \max_{a' \in \Gamma(x')}q(x', a') P(x, a, x').

Adding r(x,a)r(x,a) and using the definition of qq again gives

q(x,a)=r(x,a)+βxmaxaΓ(x)q(x,a)P(x,a,x).q(x, a) = r(x, a) + \beta \sum_{x'} \max_{a' \in \Gamma(x')}q(x', a') P(x, a, x').

This functional equation motivates us to introduce the QQ-factor Bellman operator

(Sq)(x,a)=r(x,a)+βxmaxaΓ(x)q(x,a)P(x,a,x)((x,a)G).(Sq)(x, a) = r(x, a) + \beta \sum_{x'} \max_{a' \in \Gamma(x')}q(x', a') P(x, a, x') \qquad ((x,a) \in \Gsf).

Let qq^* be the unique fixed point of SS in RG\RR^\Gsf.

Enthusiastic readers might like to try to prove Proposition 5.3.4 directly. We defer the proof until Section 5.3.5, where a more general result is obtained.

5.3.5Operator Factorizations

Our study of structural estimation in Section 5.3.1, optimal savings in Section 5.3.3 and QQ-factors in Section 5.3.4 all involved manipulations of the Bellman and policy operators that presented alternative perspectives on the respective optimization problems. Rather than offering additional applications that apply such ideas, we now develop a general theoretical framework from which to understand manipulations of the Bellman and policy operators for general MDPs. The framework clarifies when and how these techniques can be applied.

5.3.5.1Refactoring the Bellman Operator

Fix an MDP (Γ,β,r,P)(\Gamma, \beta, r, P) with state space X\Xsf and action space A\Asf. As usual, Σ\Sigma is the set of feasible policies, G\Gsf is the set of feasible state, action pairs, TT is the Bellman operator and vv^* denotes the value function. Our first step is to decompose TT. To do this we introduce three auxiliary operators:

Evidently the action of the Bellman operator TT on a given vRXv \in \RR^\Xsf is the composition of these three steps:

  1. take conditional expectations given (x,a)G(x, a) \in \Gsf (applying EE),

  2. discount and adding current rewards (applying DD), and

  3. maximize with respect to current action (applying MM).

As a result, we can write T=MDEMDET = M D E \coloneq M \circ D \circ E (apply EE first, DD second, and MM third). This decomposition is visualized in Figure Figure 5.15. The action of TT is a round trip from the top node, which is the set of value functions.

Multiple Bellman operators (EV = expected value)

Figure 5.15:Multiple Bellman operators (EV == expected value)

If we study Figure Figure 5.15, we can imagine two other round trips. One is a round trip from the set of expected value functions, obtained by the sequence EMDEMD. The other is a round trip from the set of QQ-factors, obtained by the sequence DEMDEM. Let’s name these additional round trips RR and SS respectively, so that, collecting all three,

R=EMD,S=DEM,T=MDE.R = EMD, \quad S = DEM, \quad T = MDE.

Both RR and SS act on functions in RG\RR^\Gsf. The next exercise provides an explicit representation of these operators.

Let’s connect our “refactored” Bellman operators RR and SS to our preceding examples. Inspection of (5.102) confirms that SS is exactly the QQ-factor Bellman operator. In addition, RR is a general version of the expected value Bellman operator defined in (5.83).

While the equalities in Exercise 5.22 can be proved by induction via the logic revealed by (5.104), the intuition is straightforward from Figure Figure 5.15. For example, the relationship Rk=ETk1MDR^k = ET^{k-1}MD states that round-tripping kk times from the space of expected values (EV function space) is the same as shifting to value function space by applying MDMD, round-tripping k1k-1 times using TT, and then shifting one more step to EV function space via EE.

Although the relationships in Exercise 5.22 are easy to prove, they are already useful. For example, suppose that, in a computational setting, RR is easier to iterate with than TT. Then to iterate with TT kk times, we can instead use Tk=MDRk1ET^k = MDR^{k-1}E: We apply EE once, RR k1k-1 times, and MM and DD once each. If kk is large, this might be more efficient.

In the rest of this section, we let \| \cdot \| \coloneq \| \cdot \|_\infty, the supremum norm on either RX\RR^\Xsf or RG\RR^\Gsf.

We can say that EE and MM are nonexpansive on RX\RR^\Xsf and RG\RR^\Gsf respectively, whereas DD is a contraction on RG\RR^\Gsf.

In Section Section 5.3.5.2, we clarify relationships between these operators and prove Proposition 5.3.1 and Proposition 5.3.4.

5.3.5.2Refactorizations and Optimality

From Lemma 5.3.5 we see that RR, SS and TT all have unique fixed points. We denote them by gg^*, qq^* and vv^* respectively, so that

Rg=g,Sq=q,andTv=v.Rg^* = g^*, \quad Sq^* = q^*, \quad \text{and} \quad Tv^* = v^*.

We already know that vv^* is the value function (Proposition 5.1.1). The following results show that the other two fixed points are, like the value function, sufficient to determine optimality.

The results in Proposition 5.3.6 can be written more explicitly as

In the next result and the discussion that follows, given g,qRGg, q \in \RR^\Gsf, we will call σΣ\sigma \in \Sigma

These definitions are exact analogs of the vv-greedy concept, applied to expected value functions and QQ-factors respectively.

Notice that Proposition 5.3.4 is a special case of Corollary 5.3.7.

The results in Corollary 5.3.7 can be understood as “refactored” versions of Bellman’s principle of optimality. A consequence of these results is that we can solve a given MDP by modifying VFI to operate either on expected value functions or on QQ-factors. For example, if we find it more convenient to iterate in expected value space, then (informally) we can proceed as follows:

  1. Fix gRGg \in \RR^\Gsf.

  2. Iterate with RR to obtain gkRkggg_k \coloneq R^k g \approx g^*.

  3. Compute a gkg_k-greedy policy.

Since gkgg_k \approx g^*, the resulting policy will be approximately optimal.

5.3.5.3Refactored OPI

In Chapter 5 we found that VFI is often outperformed by HPI or OPI. Our next step is to apply these methods to modified versions of the Bellman equation, as discussed in Section 5.3.5.2. This allows us to combine advantages of HPI/OPI with the potential efficiency gains obtained by refactoring the Bellman equation.

We illustrate these ideas by producing a version of OPI that can compute QQ-factors and expected value functions. (The same is true for HPI, although we leave details of that construction to interested readers.) To begin, we introduce a new operator, denoted MσM_\sigma, that, for fixed σΣ\sigma \in \Sigma and qRGq \in \RR^\Gsf, produces

(Mσq)(x)q(x,σ(x))(xX).(M_\sigma \, q)(x) \coloneq q(x, \sigma(x)) \qquad (x \in \Xsf).

This operator is the policy analog of the maximization operator MM defined by (Mq)(x)=maxaΓ(x)q(x,a)(Mq)(x) = \max_{a \in \Gamma(x)} q(x, a) in Section 5.3.5.1. Analogous to (5.104), we set

RσEMσD,SσDEMσ,TσMσDE.R_\sigma \coloneq E \, M_\sigma \, D, \quad S_\sigma \coloneq D \,E \,M_\sigma, \quad T_\sigma \coloneq M_\sigma \, D \,E.

You can verify that TσT_\sigma is the ordinary σ\sigma-policy operator (defined in (5.35)). The operators RσR_\sigma and SσS_\sigma are the expected value and QQ-factor equivalents.

Let’s now show that OPI can be successfully modified via these alternative operators. We will focus on the expected value viewpoint (value functions are replaced by expected value functions), which is often helpful in the applications we wish to consider.

Our modified OPI routine is given in Algorithm 5.5. It makes the obvious modifications to regular OPI, switching to working with expected value functions in RG\RR^\Gsf and from iteration with TσT_\sigma to iteration with RσR_\sigma.

Algorithm 5.5 is globally convergent in the same sense as regular OPI (Algorithm 5.4). In fact, if we pick v0RXv_0 \in \RR^\Xsf and apply regular OPI with this initial condition, as well as applying Algorithm 5.5 with initial condition g0Ev0g_0 \coloneq E v_0, then the sequences (vk)k0(v_k)_{k \geq 0} and (gk)k0(g_k)_{k \geq 0} generated by the two algorithms are connected via gk=Evkg_k = E v_k for all k0k \geq 0. If greedy policies are unique, then it is also true that the policy sequences generated by the two algorithms are identical.

Let’s prove these claims, assuming for convenience that greedy policies are unique. Consider first the claim that gk=Evkg_k = E v_k for all k0k \geq 0. This is true by assumption when k=0k=0. Suppose, as an induction hypothesis, that gk=Evkg_k = E v_k holds at arbitrary kk. Let σ\sigma be gkg_k-greedy. Then

σ(x)=argmaxaΓ(x){r(x,a)+βgk(x,a)}=argmaxaΓ(x){r(x,a)+βxvk(x)P(x,a,x)},\sigma(x) = \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta g_k(x, a) \right\} = \argmax_{a \in \Gamma(x)} \left\{ r(x, a) + \beta \sum_{x'} v_k(x') P(x, a, x') \right\},

where the second equality is implied by gk=Evkg_k = E v_k. Hence σ\sigma is both gkg_k-greedy and vkv_k-greedy and so is the next policy selected by both modified and regular OPI. Moreover, updating via Algorithm 5.5 and applying (5.111), we have

gk+1=Rσmgk=ETσm1MσDgk=ETσm1MσDEvk=ETσmvk.g_{k+1} = R_\sigma^m \, g_k = E T_\sigma^{m-1} \, M_\sigma \, D g_k = E T_\sigma^{m-1} \, M_\sigma \, D E v_k = E T_\sigma^m \, v_k.

Since σ\sigma is vkv_k-greedy, TσmvkT_\sigma^m \, v_k is the next function selected by regular OPI. Hence vk+1=Tσmvkv_{k+1} = T_\sigma^m \, v_k. Connecting with the last chain of equalities yields gk+1=Evk+1g_{k+1} = Ev_{k+1}. This completes the proof that gk=Evkg_k = E v_k for all kk. Policy functions generated by the algorithms are identical as well.

The preceding discussion provides a justification for the modified OPI algorithm we adopted in Section 5.3.3.

5.4Chapter Notes

Detailed treatment of MDPs can be found in books by Bellman (1957), Howard (1960), Denardo (1981), Puterman (2005), Bertsekas (2012), Hernández-Lerma & Lasserre (2012), Hernández-Lerma & Lasserre (2012), and Kochenderfer et al. (2022). Further discussion of the connection between HPI and Newton iteration can be found in Section 6.4 of Puterman (2005).

HPI is routinely used in artificial intelligence applications, including during the training of AlphaZero by DeepMind. Further discussion of these variants of HPI and their connection to Newton iteration can be found in Bertsekas (2021) and Bertsekas (2022).

There are several methods for available for accelerating value function iteration, including asynchronous VFI and Anderson acceleration. Due to space constraints, we omit discussion of these topics. Interested readers can find a treatment of asynchronous VFI in Bertsekas (2022). For discussion of Anderson acceleration see, for example, Walker & Ni (2011) or Geist & Scherrer (2018). First-order methods for accelerating VFI are presented in Goyal & Grand-Clement (2023).

Other methods for computing solutions to MDPs include the linear programming (LP) approach and the policy gradient technique, both of which solve a problem of the form

maxσΣxw(x)v(x) s.t. v=rσ+βPσv,\max_{\sigma \in \Sigma} \sum_x w(x) v(x) \quad \st \quad v = r_\sigma + \beta P_\sigma \, v,

for some chosen weight function ww. The LP approach views (5.114) as a linear program and applies various algorithms to the primal and dual problems. See, for example, Puterman (2005) or Ying & Zhu (2020).

The policy gradient method involves approximating σ\sigma and vv in (5.114) using smooth functions with finitely many parameters. These parameters are then adjusted via some version of gradient ascent. A recent trend for high-dimensional MDPs is to approximate the value and policy functions with neural nets. An early exposition can be found in Bertsekas & Tsitsiklis (1996). A more recent monograph is Bertsekas (2021). For research along these lines in the context of economic applications see, for example, Maliar et al. (2021), Hill et al. (2021), Han et al. (2021), Kahou et al. (2021), Kase et al. (2022), and Azinovic et al. (2022).

In some versions of these algorithms, as well as in VFI and HPI, the expectations associated with dynamic programs are computed using Monte Carlo sampling methods. See, for example, Rust (1997), Powell (2007), and Bertsekas (2021). Sidford et al. (2023) combine LP and sampling approaches.

The optimal savings problem is a workhorse in macroeconomics and has been treated extensively in the literature. Early references include Brock & Mirman (1972), Mirman & Zilcha (1975), Schechtman (1976), Deaton & Laroque (1992), and Carroll (1997). For more recent studies, see, for example, Li & Stachurski (2014), Açıkgöz (2018), Light (2018), Lehrer & Light (2018), or Ma et al. (2020). Recent applications involving optimal savings in a representative agent framework include Bianchi (2011), Paciello & Wiederholt (2014), Rendahl (2016), Heathcote & Perri (2018), Paroussos et al. (2019), Erosa & González (2019), Herrendorf et al. (2021), and Michelacci et al. (2022). For more on the long right tail of the wealth distribution (as discussed in Section 5.3.3), see, for example, Benhabib et al. (2015), Krueger et al. (2016), or Stachurski & Toda (2019).

Households solving optimal savings problems are often embedded in heterogeneous agent models in order to study income distributions, wealth distributions, business cycles and other macroeconomic phenomena. Representative examples include Aiyagari (1994), Huggett (1993), Krusell & Smith (1998), Miao (2006), Algan et al. (2014), Toda (2014), Benhabib et al. (2015), Stachurski & Toda (2019), Toda (2019), Light (2020), Hubmer et al. (2020), or Cao (2020).

Exercise §Exercise 5.19 considered optimal savings and consumption in the presence of transient and persistent shocks to labor income. For research in this vein, see, for example, Quah (1990), Carroll (2009), De Nardi et al. (2010), or Lettau & Ludvigson (2014). For empirical work on labor income dynamics, see, for example, Newhouse (2005), Guvenen (2007), Guvenen (2009), or Blundell et al. (2015). For analysis of optimal savings in a very general setting, see Ma et al. (2020) or Ma & Toda (2021).

The optimal investment problem dates back to Lucas & Prescott (1971). Textbook treatments can be found in Stokey & Lucas (1989) and Dixit & Pindyck (2012). Sargent (1980) and Hayashi (1982) used optimal investment problems to connect optimal capital accumulation with Tobin’s qq (which is the ratio between a physical asset’s market value and its replacement value). Other influential papers in the field include Lee & Shin (2000), Hassett & Hubbard (2002), Bloom et al. (2007), Bond & Van Reenen (2007), Bloom (2009), and Wang & Wen (2012). Carruth et al. (2000) contains a survey.

Classic papers about S--s inventory models include Arrow et al. (1951) and Dvoretzky et al. (1952). Optimality of S--s policies under certain conditions was first established by Scarf (1960). Kelle & Milne (1999) study the impact of S--s inventory policies on the supply chain, including connection to the “bullwhip” effect. The connection between S--s inventory policies and macroeconomic fluctuations is studied in Nirei (2006).

The model in Exercise 5.16 is loosely adapted from Bagliano & Bertola (2004).

Rust (1994) is a classic and highly readable reference in the area of structural estimation of MDPs. Keane & Wolpin (1997) provides an influential study of the career choices of young men. Roberts & Tybout (1997) analyze the decision to export in the presence of sunk costs. Keane et al. (2011) provide an overview of structural estimation applied to labor market problems. Gentry et al. (2018) review analysis of auctions using structural estimation. Legrand (2019) surveys the use of structural models to study the dynamics of commodity prices. Calsamiglia et al. (2020) use structural estimation to study school choices. Iskhakov et al. (2020) provide a thoughtful discussion on the differences between structural estimation and machine learning. Luo & Sang (2022) propose structural estimation via sieves.

Theoretical analysis of expected value functions in discrete choice models and other settings can be found in Rust (1994), Norets (2010), Mogensen (2018) and Kristensen et al. (2021). The expected value Gumbel max trick is due to Rust (1987) and builds on work by McFadden (1974). The Gumbel max trick is also used in machine learning methods (see, e.g., Jang et al. (2016)).

In Section 5.3.4 we mentioned QQ-learning, which was originally proposed by Watkins (1989). Tsitsiklis (1994) and Melo (2001) studied convergence of QQ-learning. In related work, Esponda & Pouzo (2021) study MDPs where dynamics are unknown, and where agents update their understanding of transition laws via Bayesian updating.

The theory in Section 5.3.5 on optimality under modifications of the Bellman equation is loosely based on Ma & Stachurski (2021). That paper considers arbitrary modifications in a very general setting.

Footnotes
  1. See Marcet et al. (2007) and Zhu (2020) for more extensive analysis of how adding a labor supply choice can affect outcomes in a consumption-savings model.

  2. Rational expectations econometrics was a response to that Critique. While early work on rational expectations originated from the macroeconomics community (e.g., Hansen & Sargent (1980), Hansen & Sargent (1990)), many of their examples were actually about industrial organization and other microeconomic models. This work was part of a broad process that erased many boundaries between micro and macro theory.

  3. Hansen & Sargent (1980) analyze the implications of such “Shiller errors” for efficient estimation procedures in a class of linear structural models.

  4. Here, the woman is the primary carer of the child; she derives no utility from children in periods in which she works. See Keane et al. (2011) for further discussion.

References
  1. Rust, J. (1987). Optimal replacement of GMC bus engines: An empirical model of Harold Zurcher. Econometrica, 55(5), 999–1033.
  2. Puterman, M. L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Interscience.
  3. Sargent, T., & Stachurski, J. (2023). Economic Networks: Theory and Computation. Cambridge University Press.
  4. Lucas, R., & Sargent, T. (1981). Rational Expectations and Econometric Practice (Vol. 2). University of Minnesota Press.
  5. Lucas, R. E. (1976). Econometric policy evaluation: A critique. Carnegie-Rochester Conference Series on Public Policy, 1, 19–46.
  6. Gillingham, K., Iskhakov, F., Munk-Nielsen, A., Rust, J., & Schjerning, B. (2022). Equilibrium trade in automobiles. Journal of Political Economy, 130(10), 2534–2593.
  7. Keane, M. P., Todd, P. E., & Wolpin, K. I. (2011). The structural estimation of behavioral models: Discrete choice dynamic programming methods and applications. In O. Ashenfelter & D. Card (Eds.), Handbook of Labor Economics (Vol. 4, pp. 331–461). Elsevier.
  8. Huijben, I. A., Kool, W., Paulus, M. B., & Van Sloun, R. J. (2022). A review of the gumbel-max trick and its extensions for discrete stochasticity in machine learning. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  9. Bellman, R. (1957). Dynamic Programming. In Science. American Association for the Advancement of Science.
  10. Howard, R. A. (1960). Dynamic Programming and Markov Processes. John Wiley & Sons.
  11. Denardo, E. V. (1981). Dynamic Programming: Models and Applications. Prentice Hall PTR.
  12. Bertsekas, D. (2012). Dynamic Programming and Optimal Control (Vol. 1). Athena Scientific.
  13. Hernández-Lerma, O., & Lasserre, J. B. (2012). Discrete-Time Markov Control Processes: Basic Optimality Criteria (Vol. 30). Springer Science & Business Media.
  14. Hernández-Lerma, O., & Lasserre, J. B. (2012). Further Topics on Discrete-Time Markov Control Processes (Vol. 42). Springer Science & Business Media.
  15. Kochenderfer, M. J., Wheeler, T. A., & Wray, K. H. (2022). Algorithms for Decision Making. MIT Press.