In this chapter we study a class of discrete time, infinite horizon dynamic programs called Markov decision processes (MDPs). This standard class of problems is broad enough to encompass many applications, including the optimal stopping problems in Chapter 4. MDPs can also be combined with reinforcement learning to tackle settings where important inputs to an MDP are not known.
5.1Definition and Properties¶
In this section, we define MDPs and investigate optimality.
5.1.1The MDP Model¶
We study a controller who interacts with a state process by choosing an action path to maximize expected discounted rewards
taking an initial state as given. As with all dynamic programs, we insist that the controller is not clairvoyant: He or she cannot choose actions that depend on future states.
To formalize the problem, we fix a finite set , henceforth called the state space, and a finite set , henceforth called the action space. In what follows, a correspondence from to is a function from into , the set of all subsets of . The correspondence is called nonempty if for all . For example, the map defined by is a nonempty correspondence from to .
Given and , we define a Markov decision process (MDP) to be a tuple consisting of
a nonempty correspondence from to , referred to as the feasible correspondence, which in turn defines the feasible state action pairs
a constant in , referred to as the discount factor,
a function from to , referred to as the reward function, and
a stochastic kernel from to ; that is, is a map from to satisfying
Here is the set of actions available to the controller in state . Given a feasible state action pair , reward is received, and the next period state is randomly drawn from , which is an element of . The dynamics and reward flow are summarized in Algorithm 5.1.
The Bellman equation corresponding to is
This can be understood as an equation in the unknown function . Below we define the value function as maximal lifetime rewards and show that is the unique solution to the Bellman equation in .
We can understand the Bellman equation as reducing an infinite-horizon problem to a two-period problem involving the present and the future. Current actions influence (i) current rewards and (ii) expected discounted value from future states. In every case we examine, there is a trade-off between maximizing current rewards and shifting probability mass towards states with high future rewards.
5.1.2Examples¶
Here we list examples of MDPs. We will see that some models neatly fit the MDP structure, whereas others can be coaxed into the MDP framework by adding states or applying other tricks.
5.1.2.1A Renewal Problem¶
Rust (1987) ignited the field of dynamic structural estimation by examining an engine replacement problem for a bus workshop. In each period the superintendent decides whether or not to replace the engine of a given bus. Replacement is costly but delaying risks unexpected failure. Rust (1987) solved this trade-off using dynamic programming.
We consider an abstract version of Rust’s problem with binary action . When , the state resets to some fixed renewal state in a finite set (e.g., mileage resets to zero when an engine is replaced). When , the state updates according to (e.g., mileage increases stochastically when the engine is not replaced). Given current state and action , current reward is received. The discount factor is .
For this problem, the Bellman equation has the form
where the first term is the value from action 1 and the second is the value of action 0.
To set the problem up as an MDP we set and for all . We define
The primitives form an MDP. Moreover, the renewal Bellman equation (5.5) is a special case of the MDP Bellman equation (5.4). To verify this we rewrite (5.5) as
Inserting from (5.6) into the right-hand side of the last equation recovers the MDP Bellman equation (5.4).
5.1.2.2Optimal Inventory Management¶
We study a firm where a manager maximizes shareholder value. To simplify the problem, we ignore exit options (so that firm value is the expected present value of profits) and assume that the firm only sells one product. Letting be profits at time and be the interest rate, the value of the firm is
The firm faces exogenous demand process . Inventory of the product obeys
The term is units of stock ordered this period, which take one period to arrive. The definition of imposes the assumption that firms cannot sell more stock than they have on hand. We assume that the firm can store at most items at one time.
With the price of the firm’s product set to one, current profits are given by
Here is unit product cost and is a fixed cost of ordering inventory. We take the minimum because orders in excess of inventory are assumed to be lost rather than back-filled.
We can map our inventory problem into an MDP with state space and action space . The feasible correspondence is
which represents the set of feasible orders when the current inventory state is . The reward function is expected current profits, or
The stochastic kernel from the set of feasible state action pairs induced by is, in view of (5.9),
The Bellman equation for this optimal inventory problem is
at each , where is as given in (5.12) and the aim is to solve for . We introduce the Bellman operator
This operator maps to itself and is designed so that its set of fixed points in coincide with solutions to (5.15) in .
5.1.2.3Example: Cake Eating¶
Many dynamic programming problems in economics involve a trade-off between current and future consumption. The simplest example in this class is the “cake eating” problem, where initial household wealth is given but no labor income is received. Wealth evolves according to
where is current consumption and is the gross interest rate. The agent seeks to maximize
subject to (implying that the agent cannot borrow). Consumption level generates utility . Assuming that wealth takes values in a finite set , the Bellman equation for this problem can be written as
In (5.21) we are using to obtain . The household uses (5.21) to trade-off current utility of consumption against the value of future wealth.
5.1.2.4Example: Optimal Stopping¶
The optimal stopping problem we studied in Chapter 4 can be framed as an MDP. On one hand, doing so allows us to apply results obtained for MDPs to optimal stopping. On the other hand, expressing an optimal stopping problem as an MDP requires an additional state variable, which complicates the exposition. The next exercise helps to illustrate the key ideas.
Let’s focus on the job search problem with Markov state discussed in Section 3.3.1 (although the arguments for the general optimal stopping problem in Section 4.1.1.1 are very similar). As before, is the set of wage outcomes. Since we need the symbol for other purposes, we let be the Markov matrix for wages, so that is -Markov on .
To express the job search problem as an MDP, let be a state space whose typical element is , with representing either unemployment () or employment () and being the current wage offer. An action indicates rejection or acceptance of the current wage offer.
5.1.3Optimality¶
In this section, we return to the general MDP setting of Paragraph, define optimal policies and state our main optimality result. As was the case for job search, actions are governed by policies, which are maps from states to actions (see, in particular, Section 1.3.1.3, where policies were introduced).
5.1.3.1Policies and Lifetime Values¶
Let be an MDP. The set of feasible policies corresponding to is
If we select a policy from , it is understood that we respond to state with action at every date . As a result, the state evolves by drawing from at each . In other words, is -Markov when
Note that . Fixing a policy “closes the loop” in the state transition process and defines a Markov chain for the state.
Under the policy , rewards at state are . If
then the lifetime value of following starting from state can be written as
Since , applying Lemma 3.2.1 to this expression yields
Analogous to the optimal stopping case, we call the -value function. We also call the lifetime value of policy conditional on initial state .
Another way to compute is to use the policy operator corresponding to , which is defined at by
( is analogous to the policy operator defined for the optimal stopping problem in Section 4.1.1.3.) In vector notation,
The next exercise shows how can be put to work.
Computationally, this means that we can pick and iterate with to obtain an approximation to .
The next exercise extends Exercise 5.8 and aids interpretation of policy operators. It tells us that is the payoff from following policy and starting in state when lifetime is truncated to the finite horizon and provides a terminal payoff in each state.
5.1.3.2Defining Optimality¶
Given MDP with -value functions , the value function corresponding to is defined as , where, as usual, the maximum is pointwise. More explicitly,
This is consistent with our definition of the value function in the optimal stopping case. It is the maximal lifetime value we can extract from each state using feasible behavior. The maximum in (5.39) exists at each because is finite.
A policy is called optimal for if . In other words, a policy is optimal if its lifetime value is maximal at each state.
Our optimality results are easier to follow with some additional terminology. To start, given , we define a policy to be -greedy if
In essence, a -greedy policy treats as the correct value function and sets all actions accordingly.
Bellman’s principle of optimality is said to hold for the MDP if
The Bellman operator corresponding to is the self-map on defined by
Obviously, if and only if satisfies the Bellman equation (5.4).
The last part of Exercise 5.11 tells us that is the pointwise maximum of , which can be expressed as . Figure Figure 5.1 illustrates this relationship in one dimension.

Figure 5.1: is the pointwise maximum of (one-dimensional setting)
5.1.3.3Optimality Theory¶
We can now state our main optimality result for MDPs.
While Proposition 5.1.1 is a special case of later results (see Section 8.1.3.3), a direct proof is not difficult and we now provide one for interested readers.
Figure Figure 5.2 illustrates Proposition 5.1.1 in an abstract case, where is a singleton . We write instead of for the value of state and place on the horizontal axis. In the figure, the set of policies is . For given , the map is an affine function and the fixed point is . The Bellman operator is the upper envelope of the functions , as shown in (ii) of Exercise 5.11. By definition,
is the largest of these fixed points, which equals , and
is the optimal policy, since .
In accordance with Proposition 5.1.1, is also the fixed point of the Bellman operator.

Figure 5.2:Illustration of optimality for MDPs
It is important to understand the significance of (iii) in Proposition 5.1.1. Greedy policies are relatively easy to compute, in the sense that solving (5.40) at each is easier than trying to directly solve the problem of maximizing lifetime value, since is in general far larger than . Part (iii) tells us that solving the overall problem reduces to computing a -greedy policy with the right choice of . For optimal stopping problems, that choice is the value function . Intuitively, assigns a “correct” value to each state, in the sense of maximal lifetime value the controller can extract, so using to calculate greedy policies leads to the optimal outcome.
5.1.4Algorithms¶
In previous chapters we solved job search and optimal stopping problems using value function iteration. In this section, we present a generalization suitable for arbitrary MDPs and then discuss two important alternatives.
5.1.4.1Value Function Iteration¶
Value function iteration (VFI) for MDPs is similar to VFI for the job search model: We use successive approximation on to compute an approximation to the value function and then take a -greedy policy. The general procedure is given by Algorithm 5.2.
The fact that the sequence produced by VFI converges to is immediate from Proposition 5.1.1 (as the tolerance is taken toward zero). It is also true that the greedy policy produced in the last step is approximately optimal when is small, and exactly optimal when is sufficiently large. Proofs are given in Chapter 8, where we examine VFI in a more general setting.
VFI is robust, easy to understand and easy to implement. These properties explain its enduring popularity. At the same time, in terms of efficiency, VFI is often dominated by alternative algorithms that we now describe.
5.1.4.2Howard Policy Iteration¶
Unlike VFI, Howard policy iteration (HPI) computes optimal policies by iterating between computing the value of a given policy and computing the greedy policy associated with that value. The full technique is described in Algorithm 5.3.
A visualization of HPI is given in Figure Figure 5.3, where is the initial choice. Next, we compute the lifetime value , and then the -greedy policy , and so on. The computation of lifetime value is called the policy evaluation step, whereas the computation of greedy policies is called policy improvement.
Figure 5.3:HPI algorithm
HPI has two very attractive features. One is that, in a finite state setting, the algorithm always converges to an exact optimal policy in a finite number of steps, regardless of the initial condition. We prove this fact in a more general setting in Chapter 8. The second is that the rate of convergence is faster than VFI, as will be shown in Section 5.1.4.3.
Figure Figure 5.4 gives another illustration, presented in the one-dimensional setting that we used for Figure Figure 5.2. In this illustration, we imagine that there are many optimal policies, and hence many functions in , so that their upper envelope, which is the Bellman operator, becomes a smoother curve. The figure shows the update from to the next lifetime value , via the following two steps:
Take to be -greedy, which means that (see Exercise 5.11).
Take to be the fixed point of .
The next step, from to is analogous.
Comparison of this figure with Figure Figure 2.2 suggests that HPI is an implementation of Newton’s method, applied to the Bellman operator. We confirm this in Section 5.1.4.3.

Figure 5.4:HPI as a version of Newton’s method
5.1.4.3HPI as Newton Iteration¶
In discussing the connection between HPI and Newton iteration, one issue is that is not always differentiable, as seen in Figure Figure 5.2. But is convex, and this lets us substitute subgradients for derivatives. Once we make this modification, HPI and Newton iteration are identical, as we now show.
First, recall that, given a self-map from to itself, an matrix is called a subgradient of at if
Figure Figure 5.5 illustrates the definition in one dimension, where is just a scalar determining the slope of a tangent line at . In the left subfigure, is convex and differentiable at , which means that only one subgradient exists (since any other choice of slope implies that the inequality in (5.49) will fail for some ). In the right subfigure, is convex but nondifferentiable at , so multiple subgradients exist.

Figure 5.5:Subgradients of convex functions
In the next result, we take to be a given MDP and let be the associated Bellman operator.
Now let’s consider Newton’s method applied to the problem of finding the fixed point of . Since is nondifferentiable and convex, we replace the Jacobian in Newton’s method (see (2.13)) with the subgradient. This leads us to iterate on
In the definition of , the policy is -greedy. Using , the map reduces to , which is exactly the update step to produce the next -value function in HPI (i.e., the lifetime value of a -greedy policy).
The fact that HPI is a version of Newton’s method suggests that its iterates enjoy quadratic convergence. This is indeed the case: Under mild conditions, one can show there exists a constant such that, for all ,
(see, e.g., Puterman (2005), Theorem 6.4.8). Hence HPI enjoys both a fast convergence rate and the robustness of global convergence.
However, HPI is not always optimal in terms of efficiency, since the size of the constant term in (5.53) also matters. This term can be large because, at each step, the update from to requires computing the exact lifetime value of the -greedy policy . Computing this fixed point exactly can be computationally expensive in high dimensions.
One way around this issue is to forgo computing the fixed point exactly, replacing it with an approximation. Section Section 5.1.4.4 takes up this idea.
5.1.4.4Optimistic Policy Iteration¶
Optimistic policy iteration (OPI) is an algorithm that borrows from both VFI and HPI. In essence, the algorithm is the same as HPI except that, instead of computing the full value of a given policy, the approximation from Exercise 5.9 is used instead. Algorithm 5.4 clarifies.
In the algorithm, the policy operator is applied times to generate an approximation of . The constant step size can also be replaced with a sequence . In either case, for MDPs, convergence to an optimal policy is guaranteed. We prove this in a more general setting in Chapter 8.
Notice that, as , the algorithm increasingly approximates HPI, since converges to . At the same time, if , the reduces to VFI. This follows from Exercise 5.11, which tells us that, when is -greedy, . Hence, with intermediate , OPI can be seen as a “convex combination” of HPI and VFI.
In almost all dynamic programming applications, there exist choices of such that OPI converges faster than VFI. We investigate these ideas in the applications. In some cases, there exist values of such that OPI dominates HPI. However, this depends on the structure of the problem and the software and hardware platforms being employed -- see Section 2.1.4.4 and the applications for additional discussion.
5.2Applications¶
This section gives several applications of the MDP model to economic problems. The applications illustrate the ease with which MDPs can be implemented on a computer (provided that the state and action spaces are not too large).
5.2.1Optimal Inventories¶
In Section 3.1.1.2 we studied a firm whose inventory behavior was specified to follow S--s dynamics. In Section 5.1.2.2 we introduced a model where investment behavior is endogenous, determined by the desire to maximize firm value. In this section, we show that this endogenous inventory behavior can replicate the S--s dynamics from Section 3.1.1.2.
We saw in Section 5.1.2.2 that the optimal inventory model is an MDP, so the Proposition 5.1.1 optimality and convergence results apply. In particular, the unique fixed point of the Bellman operator is the value function , and a policy is optimal if and only if is -greedy.
We solve the model numerically using VFI. As in Exercise 5.2, we take to be the geometric distribution on with parameter . We use the default parameter values shown in Listing 1. The code listing also presents an implementation of the Bellman operator.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32using Distributions f(x, a, d) = max(x - d, 0) + a # Inventory update function create_inventory_model(; β=0.98, # discount factor K=40, # maximum inventory c=0.2, κ=2, # cost paramters p=0.6) # demand parameter ϕ(d) = (1 - p)^d * p # demand distribution x_vals = collect(0:K) # set of inventory levels return (; β, K, c, κ, p, ϕ, x_vals) end "The function B(x, a, v) = r(x, a) + β Σ_x′ v(x′) P(x, a, x′)." function B(x, a, v, model; d_max=100) (; β, K, c, κ, p, ϕ, x_vals) = model revenue = sum(min(x, d) * ϕ(d) for d in 0:d_max) current_profit = revenue - c * a - κ * (a > 0) next_value = sum(v[f(x, a, d) + 1] * ϕ(d) for d in 0:d_max) return current_profit + β * next_value end "The Bellman operator." function T(v, model) (; β, K, c, κ, p, ϕ, x_vals) = model new_v = similar(v) for (x_idx, x) in enumerate(x_vals) Γx = 0:(K - x) new_v[x_idx], _ = findmax(B(x, a, v, model) for a in Γx) end return new_v end
Program 1:Solving the optimal inventory model (inventory_dp.jl)
Figure Figure 5.6 exhibits an approximation of the value function , computed by iterating with starting at . Figure Figure 5.6 also shows the approximate optimal policy, obtained as a -greedy policy:
The plot of the optimal policy shows that there is a threshold region below which the firm orders large batches and above which the firm orders nothing. This makes sense, since the firm wishes to economize on the fixed cost of ordering. Figure Figure 5.7 shows a simulation of inventory dynamics under the optimal policy, starting from . The time path closely approximates the S--s dynamics discussed in Section 3.1.1.2.

Figure 5.6:The value function and optimal policy for the inventory problem

Figure 5.7:Optimal inventory dynamics
5.2.2Optimal Savings with Labor Income¶
As our next example of an MDP, we modify the cake eating problem in Section 5.1.2.3 to add labor income. Wealth evolves according to
where takes values in finite set and labor income is a Markov chain on finite set with transition matrix .[1] is a gross rate of interest, so that investing dollars today returns next period. Other parts of the problem are unchanged. The Bellman operator can be written as
5.2.2.1MDP Representation¶
To frame this problem as an MDP, we set the state to , representing current wealth and income, taking values in the state space . The action is savings , which takes values in and equals . The feasible correspondence is the set of feasible savings values
The current reward is utility of consumption . The stochastic kernel is
Having framed an MDP, the Proposition 5.1.1 optimality results apply.
5.2.2.2Implementation¶
To implement the algorithms discussed in Section 5.1.4, we use the Bellman operator (5.56), and the corresponding definition of a -greedy policy, which is
for all . The policy operator for given is
Code for implementing the model and these two operators is given in Listing 2. Income is constructed as a discretized AR(1) process using the method from Section 3.1.3. Exponentiation is applied to the grid so that income takes positive values.
The function get_value in Listing 3 uses the expression from (5.33) to obtain the value of a given policy . The matrix and vector take the form
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40using QuantEcon, LinearAlgebra, IterTools function create_savings_model(; R=1.01, β=0.98, γ=2.5, w_min=0.01, w_max=20.0, w_size=200, ρ=0.9, ν=0.1, y_size=5) w_grid = LinRange(w_min, w_max, w_size) mc = tauchen(y_size, ρ, ν) y_grid, Q = exp.(mc.state_values), mc.p return (; β, R, γ, w_grid, y_grid, Q) end "B(w, y, w′, v) = u(R*w + y - w′) + β Σ_y′ v(w′, y′) Q(y, y′)." function B(i, j, k, v, model) (; β, R, γ, w_grid, y_grid, Q) = model w, y, w′ = w_grid[i], y_grid[j], w_grid[k] u(c) = c^(1-γ) / (1-γ) c = w + y - (w′ / R) @views value = c > 0 ? u(c) + β * dot(v[k, :], Q[j, :]) : -Inf return value end "The Bellman operator." function T(v, model) w_idx, y_idx = (eachindex(g) for g in (model.w_grid, model.y_grid)) v_new = similar(v) for (i, j) in product(w_idx, y_idx) v_new[i, j] = maximum(B(i, j, k, v, model) for k in w_idx) end return v_new end "The policy operator." function T_σ(v, σ, model) w_idx, y_idx = (eachindex(g) for g in (model.w_grid, model.y_grid)) v_new = similar(v) for (i, j) in product(w_idx, y_idx) v_new[i, j] = B(i, j, σ[i, j], v, model) end return v_new end
Program 2:Discrete optimal savings model (finite_opt_saving_0.jl)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38include("finite_opt_saving_0.jl") "Compute a v-greedy policy." function get_greedy(v, model) w_idx, y_idx = (eachindex(g) for g in (model.w_grid, model.y_grid)) σ = Matrix{Int32}(undef, length(w_idx), length(y_idx)) for (i, j) in product(w_idx, y_idx) _, σ[i, j] = findmax(B(i, j, k, v, model) for k in w_idx) end return σ end "Get the value v_σ of policy σ." function get_value(σ, model) # Unpack and set up (; β, R, γ, w_grid, y_grid, Q) = model w_idx, y_idx = (eachindex(g) for g in (w_grid, y_grid)) wn, yn = length(w_idx), length(y_idx) n = wn * yn u(c) = c^(1-γ) / (1-γ) # Build P_σ and r_σ as multi-index arrays P_σ = zeros(wn, yn, wn, yn) r_σ = zeros(wn, yn) for (i, j) in product(w_idx, y_idx) w, y, w′ = w_grid[i], y_grid[j], w_grid[σ[i, j]] r_σ[i, j] = u(w + y - w′/R) for j′ in y_idx P_σ[i, j, σ[i, j], j′] = Q[j, j′] end end # Reshape for matrix algebra P_σ = reshape(P_σ, n, n) r_σ = reshape(r_σ, n) # Apply matrix operations --- solve for the value of σ v_σ = (I - β * P_σ) \ r_σ # Return as multi-index array return reshape(v_σ, wn, yn) end
Program 3:Discrete optimal savings model (finite_opt_saving_1.jl)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40include("s_approx.jl") include("finite_opt_saving_1.jl") "Value function iteration routine." function value_iteration(model, tol=1e-5) vz = zeros(length(model.w_grid), length(model.y_grid)) v_star = successive_approx(v -> T(v, model), vz, tolerance=tol) return get_greedy(v_star, model) end "Howard policy iteration routine." function policy_iteration(model) wn, yn = length(model.w_grid), length(model.y_grid) σ = ones(Int32, wn, yn) i, error = 0, 1.0 while error > 0 v_σ = get_value(σ, model) σ_new = get_greedy(v_σ, model) error = maximum(abs.(σ_new - σ)) σ = σ_new i = i + 1 println("Concluded loop $i with error $error.") end return σ end "Optimistic policy iteration routine." function optimistic_policy_iteration(model; tolerance=1e-5, m=100) v = zeros(length(model.w_grid), length(model.y_grid)) error = tolerance + 1 while error > tolerance last_v = v σ = get_greedy(v, model) for i in 1:m v = T_σ(v, σ, model) end error = maximum(abs.(v - last_v)) end return get_greedy(v, model) end
Program 4:Discrete optimal savings model (finite_opt_saving_2.jl)
5.2.2.3Timing¶
Since all results for MDPs apply, we know that the value function is the unique fixed point of the Bellman operator in , and that VFI, HPI, and OPI all converge. Listing 4 implements these three algorithms. Since the state and action space are finite, HPI is guaranteed to return an exact optimal policy.
Figure Figure 5.8 shows the number of seconds taken to solve the finite optimal savings model under the default parameters when executed on a laptop machine with 20 CPUs running at around 4 GHz. The horizontal axis corresponds to the step parameter in OPI (Algorithm 5.4). The two other algorithms do not depend on and hence their timings are constant. The figure shows that HPI is an order of magnitude faster than VFI and that OPI is even faster for moderate values of .
One reason VFI is slow is that the discount factor is close to one. This matters because the convergence rate for VFI is linear with error size decreasing geometrically in . In contrast, HPI, being an instance of Newton iteration, converges quadratically (see Section 2.1.4.2). As a result, HPI tends to dominate VFI when the discount factor approaches unity.
Run-times are also dependent on implementation, and relative speed varies significantly with coding style, software, and hardware platforms. In our implementation, the main deficiency is that parallelization is under-utilized. Better exploitation of parallelization tends to favor HPI, as discussed in Section 2.1.4.4.

Figure 5.8:Timings for alternative algorithms, savings model
5.2.2.4Outputs¶
Figure Figure 5.9 shows a typical time series for the wealth of a single household under the optimal policy. The series is created by computing an optimal policy , generating as a -Markov chain on and then computing via for running from 0 to . Initial wealth is set to 1.0 and .

Figure 5.9:Time series for wealth
Figure Figure 5.10 shows the result of computing and histogramming a longer time series, with set to 1,000,000. This histogram approximates the stationary distribution of wealth for a large population, each updating via and each with independently generated labor income series . (This is due to ergodicity of the wealth-income process. For a discussion of the connection between stationary distributions and time series under ergodicity see, for example, Sargent & Stachurski (2023).)

Figure 5.10:Histogram of wealth
The shape of the wealth distribution in Figure Figure 5.10 is unrealistic. In almost all countries, the wealth distribution has a very long right tail. The Gini coefficient of the distribution in Figure Figure 5.10 is 0.54, which is too low. For example, World Bank data for 2019 produces a wealth Gini for the US equal to 0.852. For Germany and Japan the figures are 0.816 and 0.627 respectively.
In Section 5.3.3 we discuss a variation on the optimal savings model that can produce a more realistic wealth distribution.
5.2.3Optimal Investment¶
As our next application, we consider a monopolist facing adjustment costs and stochastically evolving demand. The monopolist balances setting enough capacity to meet demand against costs of adjusting capacity.
5.2.3.1Problem Description¶
We assume that the monopolist produces a single product and faces an inverse demand function of the form
where are positive parameters, is output, is price, and the demand shock follows
Current profits are
Here represents costs associated with adjusting production scale, parameterized by , and is unit cost of current production. Costs are convex, so rapid changes to capacity are expensive.
The monopolist chooses to maximize the expected discounted value of its profit flow, which we write as
Here , where is a fixed interest rate.
A way to start thinking about the optimal time path of output is to consider what would happen if . Without adjustment costs there is no intertemporal trade-off, so the monopolist should choose output to maximize current profit in each period. The implied level of output at time is
For , we expect the following behavior.
If is close to zero, then the optimal output path will track the time path of relatively closely, whereas
if is larger, then will be significantly smoother than , as the monopolist seeks to avoid adjustment costs.
5.2.3.2MDP Representation¶
We can represent this problem as an MDP. To do so we let be a grid contained in that lists possible output values. To conform to the finite state setting, we discretize the shock process using Tauchen’s method, as described in Section 3.1.3. For convenience we again use to represent the discrete process, which is a finite Markov chain on with transition matrix .
The state space for this MDP is , while the action space is . The feasible correspondence is defined by , meaning that choice of output is not restricted by the state. Thus, the feasible policy set is all .
We write for the current state, for the action (which chooses next period output) and for the next period state. The current reward function is current profits, which we can write as
The stochastic kernel is
The term states that next period output is equal to our current choice for next period output. With these definitions, the problem defines an MDP and all of the optimality theory for MDPs applies.
5.2.3.3Implementation¶
The Bellman operator can be expressed as
Given , we can express the policy operator as
A -greedy policy is a that obeys
By combining iteration with the policy operator and computation of greedy policies, we can implement OPI, compute the optimal policy , and study output choices generated by this policy. We are particularly interested in how output responds over time to randomly generated demand shocks.
Figure Figure 5.11 shows the result of a simulation designed to shed light on how output responds to demand. After choosing initial values and generating a -Markov chain , we simulated optimal output via . The default parameters are shown in Listing 5. In the figure, the adjustment cost parameter is varied as shown in the title. In addition to the optimal output path, the path of as defined in (5.66) is also presented.
The figure shows how increasing promotes smoothing, as predicted in the preceding discussion. For small , adjustment costs have only minor effects on choices, so output closely follows , the optimal path when output responds immediately to demand shocks. Conversely, larger values of make adjustment expensive, so the operator responds relatively slowly to changes in demand.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16using QuantEcon, LinearAlgebra, IterTools include("s_approx.jl") function create_investment_model(; r=0.04, # Interest rate a_0=10.0, a_1=1.0, # Demand parameters γ=25.0, c=1.0, # Adjustment and unit cost y_min=0.0, y_max=20.0, y_size=100, # Grid for output ρ=0.9, ν=1.0, # AR(1) parameters z_size=25) # Grid size for shock β = 1/(1+r) y_grid = LinRange(y_min, y_max, y_size) mc = tauchen(z_size, ρ, ν) z_grid, Q = mc.state_values, mc.p return (; β, a_0, a_1, γ, c, y_grid, z_grid, Q) end
Program 5:Optimal investment model (finite_lq.jl)

Figure 5.11:Simulation of optimal output with different adjustment costs
Figure Figure 5.12 compares timings for VFI, HPI, and OPI. Parameters are as in Listing 5. As in Figure Figure 5.8, which gave timings for the optimal savings model, the horizontal axis shows , which is the step parameter in OPI (see Algorithm 5.4). VFI and HPI do not depend on and hence their timings are constant. The vertical axis is time in seconds.
HPI is faster than VFI, although the difference is not as dramatic as was the case for optimal savings. One reason is that the discount factor is relatively small for the optimal investment model ( and , so . Since is the modulus of contraction for the Bellman operator, this means that VFI converges relatively quickly. Another observation is that, for many values of , OPI dominates both VFI and HPI in terms of speed, which is consistent with our findings for the optimal savings model. At , OPI is around 20 times faster than VFI.

Figure 5.12:Timings for alternative algorithms, investment model
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16using QuantEcon, LinearAlgebra, IterTools function create_hiring_model(; r=0.04, # Interest rate κ=1.0, # Adjustment cost α=0.4, # Production parameter p=1.0, w=1.0, # Price and wage l_min=0.0, l_max=30.0, l_size=100, # Grid for labor ρ=0.9, ν=0.4, b=1.0, # AR(1) parameters z_size=100) # Grid size for shock β = 1/(1+r) l_grid = LinRange(l_min, l_max, l_size) mc = tauchen(z_size, ρ, ν, b, 6) z_grid, Q = mc.state_values, mc.p return (; β, κ, α, p, w, l_grid, z_grid, Q) end
Program 6:Firm hiring model (firm_hiring.jl)

Figure 5.13:Optimal shifts in the stock of labor
5.3Modified Bellman Equations¶
Direct application of MDP theory is sometimes suboptimal. For example, we saw in Section 1.3.2.2 that solving the job search problem with iid wage draws is best accomplished by generating a recursion on the continuation value, which reduces dimensionality for iterative solution methods. Separately, in Section 4.2.2.2, we saw how a different manipulation of the Bellman equation also increased efficiency.
Now we aim to study such modifications systematically. We begin by providing other examples of how manipulating a Bellman equation can facilitate computation and analysis. Then we establish a theoretical foundation for this line of analysis, and show how similar ideas can also be applied to policy operators and greedy policies.
(We also treat similar topics at a more advanced and abstract level in Volume 2.)
5.3.1Structural Estimation¶
As a first illustration of the ideas in this section, we discuss a connection between econometric estimation and dynamic programs. Our focus is on some modifications that econometricians often make to Bellman equations and how they affect computation and optimality.
5.3.1.1What Is Structural Estimation?¶
Structural estimation is a branch of quantitative social science in which, in a quest to understand observed quantities and prices, researchers attribute Markov decision problems to economic agents. A key step in this approach is to formulate dynamic programs in terms of functional forms and parameters. The econometric challenge is to infer parameters that bring the model outputs as close as possible to actual data.
Structural estimation aims to discover objects that are invariant to hypothetical interventions that the analysis wants to investigate. Examples of such invariant objects are parameters of utility functions, discount factors, and production technologies. Agents inside the model solve their MDPs. A policy intervention that systematically alters the Markov processes that they face will alter agents’ optimal policies, that is, their decision rules. Various examples of such interventions involving aspects of fiscal and monetary policy are described in various chapters of Lucas & Sargent (1981) a compendium of early papers that were written in response to the Lucas (1976) Critique of then prevailing dynamic econometric models.[2]
Efficient solution methods are essential in structural estimation because the underlying dynamic program must be solved repeatedly in order to search the parameter space for a good fit to data. Moreover, these dynamic programs are often high-dimensional, due to shocks to preferences and other random variables that the agents inside the model are assumed to see but that the econometrician does not. When these shocks are persistent, the dimension of the state grows.[3]
In order to maintain focus on dynamic programming, we will not describe the details of the estimation step required for structural estimation (although Section 5.4 contains references for those who wish to learn about that). Instead, we focus on the kinds of dynamic programs treated in structural estimation and techniques for solving them efficiently.
5.3.1.2An Illustration¶
Let us look at an example of a dynamic program with preference shocks used in structural estimation, which is taken from a study of labor supply by married women Keane et al., 2011. The husband of the decision-maker, a married woman, is already working. The couple has young children and the mother is deciding whether to work. Her utility function is
where is consumption, is a parameter, is the number of children, is a preference shock and is the action variable. The action is binary, with representing the decision to work in the current period and representing the decision not to work.[4]
The budget constraint for the household is
where is the father’s income, is the mother’s wage and is the cost of child care. Wages depend on human capital , which increases with experience. In particular,
Here is random and is a parameter. We assume that is -Markov on some finite set. In the model, and are iid. We denote their joint distribution by .
With constant discount factor and implied utility
the problem of maximizing expected discounted utility is an MDP with the Bellman equation
While we can proceed directly with a technique such as VFI to obtain optimal choices, we can simplify.
One way is by reducing the number of states. A hint comes from looking at the expected value function
This function depends only on three arguments and, moreover, the choice variable is binary. Hence we can break down into two functions and , each of which depends only on the pair . These functions are substantially simpler than when the domain of is large. Hence, it is natural to consider whether we can solve our problem using rather than .
5.3.1.3Expected Value Functions¶
Rather than address this question within the context of the preceding model, let’s shift to a generic version of the dynamic program used in structural estimation and how it can be solved using expected value methods. Our generic version takes the form
for all and . Here is a finite set, often determined by discretization of a continuous space, whereas , the outcome space for , is allowed to be continuous. The state will be called the endogenous state and is the preference shock. In practice, will often be a vector of shocks that affect current rewards. The integral can therefore be multivariate and is over all of .
The problem represented by (5.79) is a version of a regular MDP, with state taking values in . If we discretize the space , then all the optimality theory for MDPs applies. Instead of taking this approach, however, we draw on our discussion of labor choice in Section 5.3.1.2. In particular, to enhance efficiency, we will work with the expected value function
There are several potential advantages associated with working with rather than . One is that the set of actions can be much smaller than the set of states that would be created by discretization of the preference shock space (especially if takes values in a high-dimensional space). Another is that the integral provides smoothing, so that is typically a smooth function. This can accelerate structural estimation procedures.
5.3.1.4Optimality via EV Methods¶
To exploit the relative simplicity of the expected value function, we rewrite the Bellman equation (5.79) as
Taking expectations of both sides and using (5.80) again gives
To solve this functional equation we introduce the expected value Bellman operator defined at by
Here is the set of feasible state action pairs .
In what follows, we let be the fixed point of in . Since is a contraction map, can be computed by successive approximation. The next result shows that knowing this fixed point is enough to solve the dynamic program.
We postpone proving Proposition 5.3.1 until Section 5.3.5, where we prove a more general result.
5.3.2The Gumbel Max Trick¶
Section 5.3.1.3 described how using expected values can reduce dimensionality by smoothing. But there is another feature of an expected value formulation of a Bellman equation that we can take advantage of when we are prepared to impose extra structure on preference shocks. This section provides details.
A real-valued random variable is said to have a Gumbel distribution (or a “type 1 generalized extreme value distribution”) with mode if its cumulative distribution function takes the form . To denote a random variable with a Gumbel distribution, we write . The expectation of is , where is the Euler--Mascheroni constant.
The Gumbel distribution has the following useful stability property, a proof of which can be found in Huijben et al. (2022).
To exploit Lemma 5.3.2, we continue the discussion in Section 5.3.1.4, but assume now that , that for all (so that actions are unrestricted), that in (5.83) is additive in rewards and indexed by actions, so that for all feasible , and that, conditional on , the vector consists of independent shocks. Thus, each feasible choice returns a rewards perturbed by an independent Gumbel shock.
From these assumptions and Lemma 5.3.2, the term inside the integral in (5.83) satisfies
Recalling our rule for computing mathematical expectations of Gumbel distributed random variables, the expected value Bellman operator in (5.83) becomes
This operator is convenient because the absence of a max operator permits fast evaluation. Notice also that is smooth in , which suggests that we can use gradient information to compute its fixed points.
Notice how the Gumbel max trick that exploits Lemma 5.3.2 depends crucially on the expected value formulation of the Bellman equation, rather than the standard formulation (5.79). This is because the expected value formulation puts the max inside the expectation operator, unlike the standard formulation, where the max is on the outside.
Variations of the Gumbel max trick have many uses in structural econometrics (see Section 5.4).
5.3.3Optimal Savings with Stochastic Returns on Wealth¶
We modify the Section 5.2.2 optimal savings problem by replacing a constant gross rate of interest by an iid sequence with common distribution on finite set . So the consumer faces a fluctuating rate of returns on financial wealth. In each period , the consumer knows , the gross rate of interest between and , before deciding how much to consume and how much to save. Other aspects of the problem are unchanged.
We have two motivations. One is computational, namely, to illustrate how framing a decision in terms of expected values can reduce dimensionality, analogous to the results in Section 5.3.1.4. The other is to generate a more realistic wealth distribution than that generated by the Section 5.2.2.4 optimal savings model.
With stochastic returns on wealth, the Bellman equation becomes
Both and are constrained to a finite set . The expected value function can be expressed as
In the remainder of this section, we will say that a savings policy is -greedy if
Since it is an MDP, we can see immediately that if we replace in (5.91) with the value function , then a -greedy policy will be an optimal one.
Using manipulations analogous to those we used in Section 5.3.1.4, we can rewrite the Bellman equation in terms of expected value functions via
From here we could proceed by introducing an expected value Bellman operator analogous to in (5.83), proving it to be a contraction map and then showing that greedy policies taken with respect to the fixed point are optimal. All of this can be accomplished without too much difficulty -- we prove more general results in Section 5.3.5.
However, we also know that OPI is, in general, more efficient than VFI. This motivates us to introduce the modified -value operator
This is a modification of the regular -value operator that makes it act on expected value functions.
A suitably modified OPI routine that is adapted from the regular OPI algorithm in Section 5.1.4.4 can be found in Algorithm 5.5. The routine is convergent. We discuss this in greater detail in Section 5.3.5.
1 2 3 4 5 6 7 8 9 10 11 12 13using QuantEcon, LinearAlgebra, IterTools function create_savings_model(; β=0.98, γ=2.5, w_min=0.01, w_max=20.0, w_size=100, ρ=0.9, ν=0.1, y_size=20, η_min=0.75, η_max=1.25, η_size=2) η_grid = LinRange(η_min, η_max, η_size) ϕ = ones(η_size) * (1 / η_size) # Uniform distribution w_grid = LinRange(w_min, w_max, w_size) mc = tauchen(y_size, ρ, ν) y_grid, Q = exp.(mc.state_values), mc.p return (; β, γ, η_grid, ϕ, w_grid, y_grid, Q) end
Program 7:Optimal savings parameters (modified_opt_savings.jl)
Figure Figure 5.14 shows a histogram of a long wealth time series that parallels Figure Figure 5.10. The only significant difference is the switch to stochastic returns (as previously described). Parameters are as in Listing 7. Now the wealth distribution has a more realistic long right tail (a few observations are in the far right tail, although they are difficult to see). The Gini coefficient is 0.72, which is closer to typical country values recorded in World Bank data (but still lower than the US). In essence, this occurs because return shocks have multiplicative rather than additive effects on wealth, so a sequence of high draws compounds to make wealth grow fast.

Figure 5.14:Histogram of wealth (stochastic returns)
5.3.4Q-Factors¶
-factors assign values to state action pairs. They set the stage for -learning, an application of reinforcement learning, a recursive algorithm for estimating parameters. -learning uses stochastic approximation techniques to learn -factors. Under special conditions -learning eventually learns optimal -factors for a finite MDP.
-learning is connected to the topic of this chapter because it relies on a Bellman operator for the -factor. We discuss that Bellman operator, but we don’t discuss -learning here.
To begin, we fix an MDP with state space and action space . For each , the -factor corresponding to is the function
We can convert the Bellman equation into an equation in -factors by observing that, given such a , the Bellman equation can be written as . Taking the mean and discounting on both sides of this equation gives
Adding and using the definition of again gives
This functional equation motivates us to introduce the -factor Bellman operator
Let be the unique fixed point of in .
Enthusiastic readers might like to try to prove Proposition 5.3.4 directly. We defer the proof until Section 5.3.5, where a more general result is obtained.
5.3.5Operator Factorizations¶
Our study of structural estimation in Section 5.3.1, optimal savings in Section 5.3.3 and -factors in Section 5.3.4 all involved manipulations of the Bellman and policy operators that presented alternative perspectives on the respective optimization problems. Rather than offering additional applications that apply such ideas, we now develop a general theoretical framework from which to understand manipulations of the Bellman and policy operators for general MDPs. The framework clarifies when and how these techniques can be applied.
5.3.5.1Refactoring the Bellman Operator¶
Fix an MDP with state space and action space . As usual, is the set of feasible policies, is the set of feasible state, action pairs, is the Bellman operator and denotes the value function. Our first step is to decompose . To do this we introduce three auxiliary operators:
defined by ,
defined by and
defined by .
Evidently the action of the Bellman operator on a given is the composition of these three steps:
take conditional expectations given (applying ),
discount and adding current rewards (applying ), and
maximize with respect to current action (applying ).
As a result, we can write (apply first, second, and third). This decomposition is visualized in Figure Figure 5.15. The action of is a round trip from the top node, which is the set of value functions.
Figure 5.15:Multiple Bellman operators (EV expected value)
If we study Figure Figure 5.15, we can imagine two other round trips. One is a round trip from the set of expected value functions, obtained by the sequence . The other is a round trip from the set of -factors, obtained by the sequence . Let’s name these additional round trips and respectively, so that, collecting all three,
Both and act on functions in . The next exercise provides an explicit representation of these operators.
Let’s connect our “refactored” Bellman operators and to our preceding examples. Inspection of (5.102) confirms that is exactly the -factor Bellman operator. In addition, is a general version of the expected value Bellman operator defined in (5.83).
While the equalities in Exercise 5.22 can be proved by induction via the logic revealed by (5.104), the intuition is straightforward from Figure Figure 5.15. For example, the relationship states that round-tripping times from the space of expected values (EV function space) is the same as shifting to value function space by applying , round-tripping times using , and then shifting one more step to EV function space via .
Although the relationships in Exercise 5.22 are easy to prove, they are already useful. For example, suppose that, in a computational setting, is easier to iterate with than . Then to iterate with times, we can instead use : We apply once, times, and and once each. If is large, this might be more efficient.
In the rest of this section, we let , the supremum norm on either or .
We can say that and are nonexpansive on and respectively, whereas is a contraction on .
In Section Section 5.3.5.2, we clarify relationships between these operators and prove Proposition 5.3.1 and Proposition 5.3.4.
5.3.5.2Refactorizations and Optimality¶
From Lemma 5.3.5 we see that , and all have unique fixed points. We denote them by , and respectively, so that
We already know that is the value function (Proposition 5.1.1). The following results show that the other two fixed points are, like the value function, sufficient to determine optimality.
The results in Proposition 5.3.6 can be written more explicitly as
for all ,
for all , and
for all .
In the next result and the discussion that follows, given , we will call
-greedy if for all , and
-greedy if for all .
These definitions are exact analogs of the -greedy concept, applied to expected value functions and -factors respectively.
Notice that Proposition 5.3.4 is a special case of Corollary 5.3.7.
The results in Corollary 5.3.7 can be understood as “refactored” versions of Bellman’s principle of optimality. A consequence of these results is that we can solve a given MDP by modifying VFI to operate either on expected value functions or on -factors. For example, if we find it more convenient to iterate in expected value space, then (informally) we can proceed as follows:
Fix .
Iterate with to obtain .
Compute a -greedy policy.
Since , the resulting policy will be approximately optimal.
5.3.5.3Refactored OPI¶
In Chapter 5 we found that VFI is often outperformed by HPI or OPI. Our next step is to apply these methods to modified versions of the Bellman equation, as discussed in Section 5.3.5.2. This allows us to combine advantages of HPI/OPI with the potential efficiency gains obtained by refactoring the Bellman equation.
We illustrate these ideas by producing a version of OPI that can compute -factors and expected value functions. (The same is true for HPI, although we leave details of that construction to interested readers.) To begin, we introduce a new operator, denoted , that, for fixed and , produces
This operator is the policy analog of the maximization operator defined by in Section 5.3.5.1. Analogous to (5.104), we set
You can verify that is the ordinary -policy operator (defined in (5.35)). The operators and are the expected value and -factor equivalents.
Let’s now show that OPI can be successfully modified via these alternative operators. We will focus on the expected value viewpoint (value functions are replaced by expected value functions), which is often helpful in the applications we wish to consider.
Our modified OPI routine is given in Algorithm 5.5. It makes the obvious modifications to regular OPI, switching to working with expected value functions in and from iteration with to iteration with .
Algorithm 5.5 is globally convergent in the same sense as regular OPI (Algorithm 5.4). In fact, if we pick and apply regular OPI with this initial condition, as well as applying Algorithm 5.5 with initial condition , then the sequences and generated by the two algorithms are connected via for all . If greedy policies are unique, then it is also true that the policy sequences generated by the two algorithms are identical.
Let’s prove these claims, assuming for convenience that greedy policies are unique. Consider first the claim that for all . This is true by assumption when . Suppose, as an induction hypothesis, that holds at arbitrary . Let be -greedy. Then
where the second equality is implied by . Hence is both -greedy and -greedy and so is the next policy selected by both modified and regular OPI. Moreover, updating via Algorithm 5.5 and applying (5.111), we have
Since is -greedy, is the next function selected by regular OPI. Hence . Connecting with the last chain of equalities yields . This completes the proof that for all . Policy functions generated by the algorithms are identical as well.
The preceding discussion provides a justification for the modified OPI algorithm we adopted in Section 5.3.3.
5.4Chapter Notes¶
Detailed treatment of MDPs can be found in books by Bellman (1957), Howard (1960), Denardo (1981), Puterman (2005), Bertsekas (2012), Hernández-Lerma & Lasserre (2012), Hernández-Lerma & Lasserre (2012), and Kochenderfer et al. (2022). Further discussion of the connection between HPI and Newton iteration can be found in Section 6.4 of Puterman (2005).
HPI is routinely used in artificial intelligence applications, including during the training of AlphaZero by DeepMind. Further discussion of these variants of HPI and their connection to Newton iteration can be found in Bertsekas (2021) and Bertsekas (2022).
There are several methods for available for accelerating value function iteration, including asynchronous VFI and Anderson acceleration. Due to space constraints, we omit discussion of these topics. Interested readers can find a treatment of asynchronous VFI in Bertsekas (2022). For discussion of Anderson acceleration see, for example, Walker & Ni (2011) or Geist & Scherrer (2018). First-order methods for accelerating VFI are presented in Goyal & Grand-Clement (2023).
Other methods for computing solutions to MDPs include the linear programming (LP) approach and the policy gradient technique, both of which solve a problem of the form
for some chosen weight function . The LP approach views (5.114) as a linear program and applies various algorithms to the primal and dual problems. See, for example, Puterman (2005) or Ying & Zhu (2020).
The policy gradient method involves approximating and in (5.114) using smooth functions with finitely many parameters. These parameters are then adjusted via some version of gradient ascent. A recent trend for high-dimensional MDPs is to approximate the value and policy functions with neural nets. An early exposition can be found in Bertsekas & Tsitsiklis (1996). A more recent monograph is Bertsekas (2021). For research along these lines in the context of economic applications see, for example, Maliar et al. (2021), Hill et al. (2021), Han et al. (2021), Kahou et al. (2021), Kase et al. (2022), and Azinovic et al. (2022).
In some versions of these algorithms, as well as in VFI and HPI, the expectations associated with dynamic programs are computed using Monte Carlo sampling methods. See, for example, Rust (1997), Powell (2007), and Bertsekas (2021). Sidford et al. (2023) combine LP and sampling approaches.
The optimal savings problem is a workhorse in macroeconomics and has been treated extensively in the literature. Early references include Brock & Mirman (1972), Mirman & Zilcha (1975), Schechtman (1976), Deaton & Laroque (1992), and Carroll (1997). For more recent studies, see, for example, Li & Stachurski (2014), Açıkgöz (2018), Light (2018), Lehrer & Light (2018), or Ma et al. (2020). Recent applications involving optimal savings in a representative agent framework include Bianchi (2011), Paciello & Wiederholt (2014), Rendahl (2016), Heathcote & Perri (2018), Paroussos et al. (2019), Erosa & González (2019), Herrendorf et al. (2021), and Michelacci et al. (2022). For more on the long right tail of the wealth distribution (as discussed in Section 5.3.3), see, for example, Benhabib et al. (2015), Krueger et al. (2016), or Stachurski & Toda (2019).
Households solving optimal savings problems are often embedded in heterogeneous agent models in order to study income distributions, wealth distributions, business cycles and other macroeconomic phenomena. Representative examples include Aiyagari (1994), Huggett (1993), Krusell & Smith (1998), Miao (2006), Algan et al. (2014), Toda (2014), Benhabib et al. (2015), Stachurski & Toda (2019), Toda (2019), Light (2020), Hubmer et al. (2020), or Cao (2020).
Exercise §Exercise 5.19 considered optimal savings and consumption in the presence of transient and persistent shocks to labor income. For research in this vein, see, for example, Quah (1990), Carroll (2009), De Nardi et al. (2010), or Lettau & Ludvigson (2014). For empirical work on labor income dynamics, see, for example, Newhouse (2005), Guvenen (2007), Guvenen (2009), or Blundell et al. (2015). For analysis of optimal savings in a very general setting, see Ma et al. (2020) or Ma & Toda (2021).
The optimal investment problem dates back to Lucas & Prescott (1971). Textbook treatments can be found in Stokey & Lucas (1989) and Dixit & Pindyck (2012). Sargent (1980) and Hayashi (1982) used optimal investment problems to connect optimal capital accumulation with Tobin’s (which is the ratio between a physical asset’s market value and its replacement value). Other influential papers in the field include Lee & Shin (2000), Hassett & Hubbard (2002), Bloom et al. (2007), Bond & Van Reenen (2007), Bloom (2009), and Wang & Wen (2012). Carruth et al. (2000) contains a survey.
Classic papers about S--s inventory models include Arrow et al. (1951) and Dvoretzky et al. (1952). Optimality of S--s policies under certain conditions was first established by Scarf (1960). Kelle & Milne (1999) study the impact of S--s inventory policies on the supply chain, including connection to the “bullwhip” effect. The connection between S--s inventory policies and macroeconomic fluctuations is studied in Nirei (2006).
The model in Exercise 5.16 is loosely adapted from Bagliano & Bertola (2004).
Rust (1994) is a classic and highly readable reference in the area of structural estimation of MDPs. Keane & Wolpin (1997) provides an influential study of the career choices of young men. Roberts & Tybout (1997) analyze the decision to export in the presence of sunk costs. Keane et al. (2011) provide an overview of structural estimation applied to labor market problems. Gentry et al. (2018) review analysis of auctions using structural estimation. Legrand (2019) surveys the use of structural models to study the dynamics of commodity prices. Calsamiglia et al. (2020) use structural estimation to study school choices. Iskhakov et al. (2020) provide a thoughtful discussion on the differences between structural estimation and machine learning. Luo & Sang (2022) propose structural estimation via sieves.
Theoretical analysis of expected value functions in discrete choice models and other settings can be found in Rust (1994), Norets (2010), Mogensen (2018) and Kristensen et al. (2021). The expected value Gumbel max trick is due to Rust (1987) and builds on work by McFadden (1974). The Gumbel max trick is also used in machine learning methods (see, e.g., Jang et al. (2016)).
In Section 5.3.4 we mentioned -learning, which was originally proposed by Watkins (1989). Tsitsiklis (1994) and Melo (2001) studied convergence of -learning. In related work, Esponda & Pouzo (2021) study MDPs where dynamics are unknown, and where agents update their understanding of transition laws via Bayesian updating.
The theory in Section 5.3.5 on optimality under modifications of the Bellman equation is loosely based on Ma & Stachurski (2021). That paper considers arbitrary modifications in a very general setting.
See Marcet et al. (2007) and Zhu (2020) for more extensive analysis of how adding a labor supply choice can affect outcomes in a consumption-savings model.
Rational expectations econometrics was a response to that Critique. While early work on rational expectations originated from the macroeconomics community (e.g., Hansen & Sargent (1980), Hansen & Sargent (1990)), many of their examples were actually about industrial organization and other microeconomic models. This work was part of a broad process that erased many boundaries between micro and macro theory.
Hansen & Sargent (1980) analyze the implications of such “Shiller errors” for efficient estimation procedures in a class of linear structural models.
Here, the woman is the primary carer of the child; she derives no utility from children in periods in which she works. See Keane et al. (2011) for further discussion.
- Rust, J. (1987). Optimal replacement of GMC bus engines: An empirical model of Harold Zurcher. Econometrica, 55(5), 999–1033.
- Puterman, M. L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Interscience.
- Sargent, T., & Stachurski, J. (2023). Economic Networks: Theory and Computation. Cambridge University Press.
- Lucas, R., & Sargent, T. (1981). Rational Expectations and Econometric Practice (Vol. 2). University of Minnesota Press.
- Lucas, R. E. (1976). Econometric policy evaluation: A critique. Carnegie-Rochester Conference Series on Public Policy, 1, 19–46.
- Gillingham, K., Iskhakov, F., Munk-Nielsen, A., Rust, J., & Schjerning, B. (2022). Equilibrium trade in automobiles. Journal of Political Economy, 130(10), 2534–2593.
- Keane, M. P., Todd, P. E., & Wolpin, K. I. (2011). The structural estimation of behavioral models: Discrete choice dynamic programming methods and applications. In O. Ashenfelter & D. Card (Eds.), Handbook of Labor Economics (Vol. 4, pp. 331–461). Elsevier.
- Huijben, I. A., Kool, W., Paulus, M. B., & Van Sloun, R. J. (2022). A review of the gumbel-max trick and its extensions for discrete stochasticity in machine learning. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Bellman, R. (1957). Dynamic Programming. In Science. American Association for the Advancement of Science.
- Howard, R. A. (1960). Dynamic Programming and Markov Processes. John Wiley & Sons.
- Denardo, E. V. (1981). Dynamic Programming: Models and Applications. Prentice Hall PTR.
- Bertsekas, D. (2012). Dynamic Programming and Optimal Control (Vol. 1). Athena Scientific.
- Hernández-Lerma, O., & Lasserre, J. B. (2012). Discrete-Time Markov Control Processes: Basic Optimality Criteria (Vol. 30). Springer Science & Business Media.
- Hernández-Lerma, O., & Lasserre, J. B. (2012). Further Topics on Discrete-Time Markov Control Processes (Vol. 42). Springer Science & Business Media.
- Kochenderfer, M. J., Wheeler, T. A., & Wray, K. H. (2022). Algorithms for Decision Making. MIT Press.