While the MDP model from Chapter 5 and Chapter 6 is elegant and widely used, researchers in economics, finance, and other fields are working to extend it. Reasons include:
MDP theory cannot be applied to settings where lifetime values are described by the kinds of nonlinear recursions discussed in Chapter 7.
Equilibria in some models of production and economic geography can be computed using dynamic programming but not all such programming problems fit within the MDP framework.
Dynamic programming problems that include adversarial agents to promote robust decision rules can fail to be MDPs.
To handle such departures from the MDP assumptions, we now construct a more general dynamic programming framework, building on an approach to optimization initially developed by Denardo (1967) and extended by Bertsekas (2022). Further references are provided in Section 8.4.
We start this chapter by building a framework that centers on an abstract representation of the Bellman equation (Section 8.1). We then state optimality results and show how they can be verified in a range of applications. We defer proofs of core optimality results to Chapter 9, where we strip dynamic programs down to their essence by adopting a purely operator-theoretic perspective.
In this section, we introduce and analyze optimality conditions for recursive decision processes that include and extend all dynamic programming frameworks discussed so far. Throughout this chapter, X denotes a finite set.
Here x is the state, a is an action, Γ is a feasible correspondence, and B is an “aggregator” function. We understand Γ(x) as all actions available to the controller in state x. The function v assigns values to states and is a member of some class V⊂RX. This “abstract” Bellman equation generalizes all of the Bellman equations presented in previous chapters.
Our plan is to analyze the Bellman equation (8.1) and state conditions on B and the other primitives that make strong optimality properties hold. As a first step, we introduce two finite sets,
an action spaceA and
a state spaceX.
Given X and A, we define a recursive decision process (RDP) to be a triple R=(Γ,V,B) consisting of
a feasible correspondenceΓ that is a nonempty correspondence from X to A, which in turn defines
Throughout, ⩽ represents the pointwise order on RX.
The definition of the feasible correspondence in (i) is identical to that for the MDP in Chapter 5. As for (ii), we understand V to be a class of functions that assign values to states. In (iii), the interpretation of the aggregator B is:
B(x,a,v)= total lifetime rewards, contingent on current action a, current state x, and using v to evaluate future states.
The monotonicity condition (8.4) is natural: if, relative to v, rewards are at least as high for w in every future state, then the total rewards one can extract under w should be at least as high. The consistency condition in (8.5) ensures that as we consider values of different policies we remain within the value space V.
The MDP framework is a special case of the RDP framework:
The last example is a special case of Example 8.1.1, since the cake eating problem is an MDP (see Section 5.1.2.3). Nonetheless, Example 8.1.2 is instructive because, for cake eating, the MDP construction is tedious (e.g., we need to define a stochastic kernel P even though transitions are deterministic), while the RDP construction is straightforward.
The next example makes a related point.
Example 8.1.1--Example 8.1.4 treated RDPs that can be embedded into the MDP framework. In the remaining examples, we consider models that cannot be represented as MDPs.
Example 8.1.8 is a minimization problem. We treat minimization explicitly in Section 8.3.5, although the shortest path setting can be converted maximization by replacing c(x,x′) with −c(x,x′). This produces an application similar to the cake eating problem in Example 8.1.2 (although discounting is eliminated and network structure shows up in the constraint).
We aim to discuss optimality of RDPs. To prepare for this topic, we now clarify lifetime values associated with different policy choices in the RDP setting.
Let R=(Γ,V,B) be an RDP with state and action spaces X and A, and let Σ be the set of all feasible policies. For each σ∈Σ we introduce the policy operatorTσ as a self-map on V defined by
The RDP policy operator is a direct generalization of the MDP policy operator defined, as well as the optimal stopping policy operator defined.
Consider a given RDP (Γ,V,B) and fix σ∈Σ. If Tσ has a unique fixed point in V, we denote this fixed point by vσ and call it the σ-value function. It is natural to interpret vσ as representing the lifetime value of following policy σ.
The previous examples are linear but the same idea extends to nonlinear recursive preference models as well. To see this, recall the generic Koopmans operator (Kv)(x)=A(x,(Rv)(x)) introduced in Section 7.3.1. Lifetime value is the unique fixed point of this operator whenever it exists. In all of the RDP examples we have considered, the policy operator can be expressed as (Tσv)(x)=Aσ(x,(Rσv)(x)) for some aggregator Aσ and certainty equivalent operator Rσ. Hence Tσ is a Koopmans operator and lifetime value associated with policy σ is the fixed point of this operator.
Let R=(Γ,V,B) be a given RDP with policy operators {Tσ}. Given that our objective is to maximize lifetime value over the set of policies in Σ, we need to assume at the very least that lifetime value is well defined at each policy. To this end, we say that R is well-posed whenever Tσ has a unique fixed point vσ in V for all σ∈Σ.
Let R be an RDP with policy operators {Tσ}σ∈Σ. In what follows, we call Rglobally stable if Tσ is globally stable on V for all σ∈Σ.
Obviously, every globally stable RDP is well-posed.
In Section 8.1.3 we will see that global stability yields strong optimality properties.
Let R=(Γ,V,B) be an RDP. We call Rcontinuous if B(x,a,v) is continuous in v for all (x,a)∈G. In other words, R is continuous if, for any v∈V, any (x,a)∈G and any sequence (vk)k⩾1 in V, we have
Continuity is satisfied by all applications considered in this text. For example, for the RDP generated by an MDP (Example 8.1.1), the deviation ∣B(x,a,vk)−B(x,a,v)∣ is dominated by β∥vk−v∥∞ for all (x,a)∈G. Hence continuity holds.
Below we will see that continuity is useful when considering covergence of certain algorithms.
Since Γ(x) is finite and nonempty at each x∈X, at least one such policy exists. As with policy operators, the notion of greedy policies extends existing definitions from earlier chapters.
Given RDP R=(Γ,V,B), we say that v∈V satisfies the Bellman equation if v(x)=maxa∈Γ(x)B(x,a,v) for all x∈X. The Bellman operator corresponding to R is the map T on V defined by
To solve RDPs for optimal policies, we use two core algorithms: Howard policy iteration (HPI) and optimistic policy iteration (OPI). As in previous chapters, OPI includes VFI as a special case.
To describe HPI we take R=(Γ,V,B) to be a well-posed RDP with feasible policy set Σ, policy operators {Tσ}, and Bellman operator T. In this setting, the HPI algorithm is essentially identical to the one given for MDPs in Section 5.1.4.2, except that vσ is calculated as the fixed point of Tσ, rather than taking the specific form (I−βPσ)−1rσ. The details are in Algorithm 8.1.
Algorithm 8.1 is somewhat ambiguous, since it is not always clear how to implement the instruction “vk← the fixed point of Tσk”. However, if R is globally stable, then each Tσk is globally stable, so an approximation of the fixed point can be calculated by iterating with Tσk. This line of thought leads us to consider optimistic policy iterating (OPI) as a more practical alternative. Algorithm 8.2 states an OPI routine for solving R that generalizes the MDP OPI routine in Section 5.1.4.
In Algorithm 8.2 we require that v0=vσ for some σ∈Σ. This assumption can be dropped in some settings. For practical purposes, however, it is almost always straightforward to initialize OPI with v0=vσ for some simple choice of σ.
When we turn to proofs, it will help to have an operator-theoretic description of HPI and OPI. To this end, we define two operators. The first is H:V→{vσ}, which is defined via
We call H the Howard operator generated by R. Iterating with H implements HPI. In particular, if we fix σ∈Σ and set vk=Hkvσ, then (vk)k⩾0 is the sequence of σ-value functions generated by HPI.[1]
Next, fixing m∈N, we define the operator Wm from V to itself via
(See the previous footnote on the choice of v-greedy policies.) The operator Wm is an approximation of H, since Tσmv→vσ=Hv as m→∞. Iterating with Wm generates the value sequence in OPI. More specifically, we take v0∈{vσ} and generate
(vk,σk)k⩾0where vk=Wmkv0 and σk is vk-greedy.
Let R be a well-posed RDP with policy operators {Tσ} and σ-value functions {vσ}. In this context, we set v∗:=⋁σvσ∈RX and call v∗ the value function of R. By definition, v∗ satisfies
Both of these definitions generalize the definitions we used for MDPs and optimal stopping. In particular, optimality of a policy means that it generates maximum possible lifetime value from every state.
We say that R satisfies Bellman’s principle of optimality if
We can now state our main optimality result for RDPs. In the statement, R is a well-posed RDP with value function v∗.
As OPI includes VFI as a special case (m=1), Theorem 8.1.1 also implies convergence of VFI under the stated conditions.
In terms of applications, Theorem 8.1.1 is the most important optimality result in this book. It provides the core optimality results from dynamic programming and a broadly convergent algorithm for computing optimal policies.
Many traditional treatments of dynamic programming build optimality theory around contractivity (see, e.g., Puterman (2005) or Stokey & Lucas (1989), Section 4.2). Assumptions are constructed so that the policy operators and Bellman operator are all contraction mappings.
While such assumptions are sufficient for Theorem 8.1.1 (since contractivity of the policy operators implies stability), they are not necessary. There are a variety of ways to prove uniqueness and stability of fixed points, including the monotonicity-based methods discussed in Section 7.1.2 and the spectral methods in Section 6.1.3.2. These alternatives will prove useful in settings where contractivity fails, as we shall see in Section 8.2.
Another point worth noting about the conditions in Theorem 8.1.1 is that no assumptions are placed on the Bellman operator. Rather, one only needs to check properties of the policy operators. This is advantageous because, unlike the Bellman operator, the policy operators do not involve maximization.
Up until now, we have focused entirely on stationary policies, in the sense that the same policy is used at every point in time. What if we drop this assumption and admit the option to change policies? Might this lead to higher lifetime values?
In this section, we show that for globally stable RDPs the answer is negative. This finding justifies our focus on stationary policies.
To begin, let R=(Γ,V,B) be a globally stable RDP. Recall from Remark 8.1.1 that, given v∈V, σ∈Σ, k∈N and x∈X, the value (Tσkv)(x) gives finite horizon utility over periods 0,…,k under policy σ, with initial state x and terminal condition v. Extending this idea, it is natural to understand TσkTσk−1⋯Tσ1v as providing finite horizon utility values for the nonstationary policy sequence (σk)k∈N⊂Σ, given terminal condition v∈V. For the same policy sequence, we define its lifetime value via
whenever the limsup is finite and independent of the terminal condition v.
Suppose that this is the case, and hence vˉ is well defined. We claim that vˉ⩽v∗.
Since vˉ is independent of the terminal condition v, we can assume without loss of generality that v∈VΣ. By Theorem 8.1.1, we have Tkv→v∗ as k→∞. Hence, by Exercise 8.11,
We will show that boundedness can be used to obtain optimality results for well-posed RDPs, even without global stability.
Another attractive feature of boundedness is that it permits a reduction of value space, as illustrated by the next two exercises.
Exercise 8.13 implies the reduced RDP (Γ,V^,B) is also well-posed under the stated conditions, and that it contains all the σ-value functions and the value function from the original RDP (Γ,V,B). Hence any optimality results for (Γ,V^,B) carry over to (Γ,V,B).
The next result shows that, when considering optimality, stability can be replaced by boundedness.
Sometimes RDP models can be simplified by transformations over value space. In this section we investigate such transformations. The underlying ideas are related to topological conjugacy of dynamical systems, which we introduced in Section 2.1.1.2.
To begin, let R=(Γ,V,B) and R^=(Γ,V^,B^) be two RDPs with identical state space X, action space A and feasible correspondence Γ. We consider settings where
We call R and R^topologically conjugate under ϕ if ϕ is a homeomorphism ϕ from M to M^ and (8.43) holds.
Here is our main result for this section.
The benefit of Proposition 8.1.3 is that one of these models might be easier to analyze than the other. We apply the proposition to the Epstein--Zin specification in Section 8.1.4.1 and to a smooth ambiguity model in Section 8.3.4. The next exercise will be useful for the proof.
In the next section we will see how these ideas can simplify optimality analysis.
In this section, we show how the preceding optimality results and the notion of topologically conjugacy can be deployed to analyze the Epstein--Zin RDP from Example 8.1.7.
The value of of introducing R^ comes from the fact that R^ is easier to work with than R (just as the modified Epstein--Zin Koopmans operator K^ defined in Section 7.2.3.3 turned out to be easier to work with than the original Epstein--Zin Koopmans operator K introduced in Section 7.2.3.2).
Now we investigate the properties of the simpler RDP R^.
In Section 8.1 we showed that well-posed RDPs have strong optimality properties whenever they are globally stable or bounded, and that VFI and OPI converge whenever they are globally stable. But what conditions are sufficient for these properties? We start with a relatively strict condition based on contractivity and then progress to models that fail to be contractive.
Corollary 8.2.2 tells us that contracting RDPs are globally stable and, as a result, the sequence of functions in V generated by VFI (Algorithm 8.2 with m=1) converges to v∗. However this result is asymptotic and conditions on v0=vσ for some σ∈Σ. We can improve this result in the current setting by leveraging the contraction property:
Since the VFI algorithm terminates when ∥vk−vk−1∥∞ falls below a given tolerance, the result in (8.52) directly provides a quantitative bound on the performance of the policy returned by VFI.
Next we state a useful condition for contractivity that is related to Blackwell’s sufficient condition discussed in Section 2.2.3.3. We say that RDP (Γ,V,B) satisfies Blackwell’s condition if v∈V implies v+λ:=v+λ1 is in V for every λ⩾0 and, in addition, there exists a β∈[0,1) such that
B(x,a,v+λ)⩽B(x,a,v)+βλfor all (x,a)∈G, v∈V and λ∈R+.
8.2.1.4Application: Job Search with Quantile Preferences¶
Consider the job search problem with correlated wage draws first investigated in Section 3.3.1. With finite wage offer set W, wage offer process generated by P∈M(RW) and β∈(0,1), we can frame this as an RDP (Γ,V,B) with V=RW, Γ(w)={0,1} for w∈W and
where τ∈[0,1] and Rτ is the quantile certainty equivalent operator described in Exercise 7.22.
Figure Figure 8.1 shows the reservation wage for a range of τ values, computed using OPI (and taking the smallest w∈W such that σ∗(w)=1). The stationary distribution of P is also shown in the figure, tilted 90 degrees.
The parameters and the code for applying Tσ and evaluating greedy functions is shown in Listing 2. That listing includes the quantile operator Rτ, which is implemented in Listing 1. (Quantiles of discrete random variables can also be computed using functionality contained in Distributions.jl.)
Figure 8.1:The reservation wage as a function of τ
The main message of Figure Figure 8.1 is that the reservation wage rises in τ. In essence, higher τ focuses the attention of the worker on the right tail of the distribution of continuation values. This encourages the worker to take on more risk, which leads to a higher reservation wage (i.e., reluctance to accept a given current offer).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
"Compute the τ-th quantile of v(X) when X ∼ ϕ."
function quantile(τ, v, ϕ)
# Sort v and reorder ϕ accordingly
indices = sortperm(v)
v_sorted = v[indices]
ϕ_sorted = ϕ[indices]
for (i, v_value) in enumerate(v_sorted)
p = sum(ϕ_sorted[1:i]) # sum all ϕ[j] s.t. v[j] ≤ v_value
if p ≥ τ # exit and return v_value if prob ≥ τ
return v_value
end
end
end
Program 1:Conditional quantile operator (quantile_function.jl)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
using QuantEcon
include("quantile_function.jl")
"Creates an instance of the job search model."
function create_markov_js_model(;
n=100, # wage grid size
ρ=0.9, # wage persistence
ν=0.2, # wage volatility
β=0.98, # discount factor
c=1.0, # unemployment compensation
τ=0.5 # quantile parameter
)
mc = tauchen(n, ρ, ν)
w_vals, P = exp.(mc.state_values), mc.p
return (; n, w_vals, P, β, c, τ)
end
"""
The policy operator
(T_σ v)(w) = σ(w) (w / (1-β)) + (1 - σ(w))(c + β (R_τ v)(w))
"""
function T_σ(v, σ, model)
(; n, w_vals, P, β, c, τ) = model
h = c .+ β * R(τ, v, P)
e = w_vals ./ (1 - β)
return σ .* e + (1 .- σ) .* h
end
" Get a v-greedy policy."
function get_greedy(v, model)
(; n, w_vals, P, β, c, τ) = model
σ = w_vals / (1 - β) .≥ c .+ β * R(τ, v, P)
return σ
end
Program 2:Job search with quantile operator (quantile_js.jl)
In this section, we consider a small open economy that borrows in international financial markets in order to smooth consumption and has the option to default. We show that the model is a contractive RDP.
Income (Yt)t⩾0 is exogenous and Q-Markov on finite set Y. A representative household faces budget constraint
where Ct is consumption at time t, q is the price at time t of a risk-free claim on one unit of time t+1 consumption; q is determined outside the model, say international markets; bt measures foreign lending. Purchasing a claim on bt+1 units of time t+1 consumption costs qbt+1. Purchasing bond with negative face value bt+1 pays qbt+1 in current consumption goods and promises to deliver bt+1 next period.
Bond trading is managed by a benevolent government that wants to maximize household utility. Households discount future utility at rate β∈(0,1) and current consumption Ct generates current utility u(Ct). The government faces borrowing constraint bt⩾−m where m⩾0. The government maximizes expected discounted utility for the households.
The government can default on foreign loans. In this case, output available for consumption drops from Yt to h(Yt), where h is a function satisfying h(y)<y for all y. After a country defaults, it temporarily loses access to the international credit market.
At the end of each period during which the country is in default, it regains access to international credit markets with probability θ∈(0,1). With probability 1−θ it remains in financial autarky. When a country regains access to foreign borrowing, its debt is reset to zero.
We can cast this as an RDP by considering the value of each state and action. We set the state space X to be the set of all (y,b,d) in Y×B×{0,1}, where B is a finite subset of [−m,∞) indicating possible choices for bond holdings bt and d is a binary variable indicating whether the country is in default (d=0 means not in default and d=1 means in default).
The value space V is all of RX. The action space is (ba,da)∈B×{0,1} indicating choices for bond holdings and default. The feasible correspondence specifies feasible (ba,da) at given state (y,b,d) and is given by
In other words, if d=0, so the country is not in default, the government can choose any ba∈B and also any da∈{0,1} (i.e., default or not default). If d=1, however, the government has no choices. We represent this situation by ba=0 and da=1.
The value aggregator takes the form
B((y,b,d),(ba,da),v)=value in state (y,b,d) under action (ba,da).
To specify it we decompose the problem across cases for d and da. First consider the case where d=0 (not currently in default) and da=0 (the government chooses not to default). For this case y+b−qba is current consumption, so we set
The term ∑y′v(y′,0,0)Q(y,y′) is the expected value next period when the country is readmitted to international financial markets (with b′=0 and d′=0), whereas the term ∑y′v(y′,0,1)Q(y,y′) is the expected value next period when default continues (with b′=0 and d′=1).
Since B((y,b,1),(ba,0),v) is not feasible (a defaulted country cannot itself directly choose to reenter financial markets), the only other possibility is B((y,b,1),(ba,1),v), which is the expected value when the country remains in default. But this is the same as B((y,b,0),(ba,1),v) specified earlier: The value for a country that stays in default is the same as that for a country that newly enters default.
Many RDPs are not contracting. There is no single method for handling all types of non-contractive RDPs, so we introduce alternative techniques over the next few sections. The first such technique, treated in this section, handles RDPs that contract “eventually,” even though they may fail to contract in one step. We show that these eventually contracting RDPs are globally stable, so all of the fundamental optimality results apply.
One application for these results is the MDP model with state-dependent discounting treated in Chapter 6. This section contains a proof of the main optimality result in that chapter (Proposition 6.2.2).
8.2.2.2Optimality for MDPs with State-Dependent Discounting¶
With Proposition 8.2.4 in hand, we can complete the proof of Proposition 6.2.2, which pertained to optimality properties for MDPs with state-dependent discounting.
Let (Γ,β,r,P) be an MDP with state-dependent discounting, as defined in Section 6.2.1.1. The state space is X and the action space is A. The function β maps G×X to R+. Set
Under the stated assumptions, for each σ∈Σ, the operator Lσ(x,x′)=L(x,σ(x),x′) satisfies ρ(Lσ)<1. Hence R is eventually contracting, as claimed. Since V=RX is closed, Proposition 8.2.4 implies that R is a globally stable RDP. The claims in Proposition 6.2.2 now follow from Theorem 8.1.1.
Theorem 8.1.1 shows that RDPs have excellent optimality properties when all policy operators are globally stable on value space. So far we have looked at conditions for stability based on contractions (Section 8.2.1) and eventual contractions (Section 8.2.2). But sometimes both of these approaches fail and we need alternative conditions.
In this section, we explore alternative conditions based on Du’s theorem. Du’s theorem is well suited to the task of studying stability of policy operators, since it leverages the fact that all policy operators are order-preserving.
Div can be applied to establish optimality properties of regular MDPs. This exercise is redundant in the sense that optimality properties of regular MDPs have already been established using other means. At the same time, some of the arguments developed here will be helpful when we face more sophisticated problems.
To sketch the argument, let R=(Γ,V,B) be an RDP generated by an ordinary MDP (Γ,β,r,P), as discussed in Example 8.1.1. In particular, V=RX, and B(x,a,v)=r(x,a)+β∑x′v(x′)P(x,a,x′). We set r1:=minr and r2:=maxr. Then we fix ϵ>0 and define V via
V^:=[v1,v2] where v1:=1−βr1−ϵ and v2:=1−βr2+ϵ.
In Section 7.2.2 we introduced risk-sensitive preferences and discussed a recursive utility problem. Now we embed risk-sensitive preferences into a dynamic program and apply the preceding optimality results to compute optimal policies.
Consider the risk-sensitive preference RDP in Example 8.1.6, with state space X and action space A. Let V=RX. For (x,a)∈G and v∈V, we can express the aggregator as
Notice that, for each fixed a∈Γ(x), the operator Rθa is an entropic certainty equivalent operator on V (see Example 7.3.2).
The next exercise pertains to quantile preferences rather than risk-sensitive preferences, but the result can be obtained via a relatively straightforward modification of the proof of Proposition 8.3.1.
Here θ is a nonzero parameter and other details are as in Section 3.3.1. We can represent the problem as an RDP with state space W, action space A={0,1}, feasible correspondence Γ(w)=A, value space V:=RW, and value aggregator
If θ<0, then the agent is risk-averse with respect to the gamble associated with continuing and waiting for new wage draws. If θ>0 then the agent is risk-loving with respect to such gambles. For θ≈0, the agent is close to risk-neutral.
Figure Figure 8.2 shows how the continuation value, value function and optimal decision vary with θ. Apart from θ, parameters are identical to those in Listing 4. Indeed, for θ close to zero, as in the middle sub-figure of Figure Figure 8.2, we see that the value function and reservation wage are almost identical to those from the risk-neutral model in Figure Figure 3.5.
Figure 8.2:Job search with risk-sensitive preferences
As expected, a negative value of θ tends to reduce the continuation value and hence the reservation wage, since the agent’s dislike of risk encourages early acceptance of an offer. For positive values of θ the reverse is true, as seen in the bottom sub-figure.
Some problems in economics, finance, and artificial intelligence assume that decisions emerge from a dynamic two-person zero-sum game in which the two agents’ preferences are perfectly misaligned. This can lead to a dynamic program where the Bellman equation takes the form
where B(x,a,d,v) represents lifetime value for the decision maker conditional on her current action a and her adversary’s action d. The decision maker chooses action a∈Γ(x) with the knowledge that the opponent will then choose d∈D(x,a) to minimize her lifetime value.
Condition (a) is a natural monotonicity condition: A uniform increase in continuation values increases current value at all states and actions. Conditions (b) and (c) provide upper and lower bounds. Condition (d) is a concavity condition.
To analyze the decision maker’s problem, we set V:=[v1,v2] and
An immediate corollary of Proposition 8.3.2 is that, under the stated conditions, the decision maker’s problem is a globally stable RDP, and hence the fundamental optimality properties in Theorem 8.1.1 all hold.
In this section, we provide a relatively abstract application of Proposition 8.3.2. Later, in Section 8.3.3, we will see more concrete applications.
The setting we consider is a modified MDP where the adversarial agent’s actions affect the reward function and transition kernel. This leads to a Bellman equation of the form
The choice perturbation d∈D(x,a) is made by the adversary. The object P is a stochastic kernel, in the sense that P(x,a,d,⋅) is a distribution over X for each feasible (x,a,d). We assume that Γ is a nonempty correspondence from X to A and D(x,a) is nonempty for all (x,a)∈G. Let
Until now we have considered agents facing decision problems where outcomes are uncertain but probabilities are known. For example, while the job seeker introduced in Chapter 1 does not know the next period wage offer when choosing her current action, she does know the distribution of that offer. She uses this distribution to determine an optimal course of action. Similarly, the controllers in our discussion of optimal stopping and MDPs used their knowledge of the Markov transition law to determine an optimal policy.
In many cases, the assumption that the decision maker knows all probability distributions that govern outcomes under different actions is debatable. In this section we study lifetime valuations in settings of Knightian uncertainty (Knight, (1921)), which means that outcome distributions are themselves unknown. Some authors refer to Knightian uncertainty as ambiguity.
Below we consider some dynamic problems where decision makers face Knightian uncertainty.
First we study the choices of a decision maker who knows her reward function but distrusts her specification of the stochastic kernel P that describes the evolution of the state. This distrust is expressed by assuming that she knows that P belongs to some class of stochastic kernels from G×X to X. This can lead to aggregators of the form
for (x,a)∈G. As usual, r maps G to R and β∈(0,1). The decision maker can construct a policy that is robust to her distrust of the stochastic kernel by using this aggregator B. Such aggregators arise in the field of robust control.
Positing that the decision maker knows a nontrivial set of stochastic kernels is a way of modeling Knightian uncertainty, as distinguished from risks that are described by known probability distributions.
Returning to the robust control model with aggregator B in (8.108), we take V is as defined in (8.105) and set R=(Γ,V,B). The set P of stochastic kernels is entirely arbitrary.
We conclude from this discussion that the robust control RDP is globally stable. Hence all of the fundamental optimality properties hold.
In this set up, P(x,a) is often large, weakening the constraint on P. At the same time, we introduce the penalty term d(P(x,a,⋅),Pˉ(x,a,⋅)), which can be understood as recording the deviation between a given kernel P and some baseline specification Pˉ.
One interpretation of this setting is that the decision maker begins with a baseline specification of dynamics but lacks confidence in its accuracy. In her desire to choose a robust policy, she imagines herself playing against an adversarial agent. Her adversary can choose transition kernels that deviate from the baseline, but the presence of the penalty term means that extreme deviations are curbed.
It is assumed here that q≺acp, which means that q(x)=0 whenever p(x)=0. We note for future reference that dKL obeys the duality formula for variational inference, which states that, given h∈RX,
In robust control, KL divergence can be used to measure deviation between the baseline specification and alternative specifications. It turns out that, under this measure of divergence, there is a tight relationship between robust control and risk-sensitive preferences.
To illustrate this relationship, we fix θ<0 and set dθ:=−(1/θ)dKL, so that dθ is a simple positive rescaling of the Kullback--Leibler divergence. Using dθ in (8.111) leads to
Hence, for this choice of deviation, the robust control aggregator (8.111) reduces to the risk-sensitive aggregator (see Example 8.1.6) under the baseline transition kernel.
Ju & Miao (2012) propose and study a recursive smooth ambiguity model in the context of asset pricing. A generic discrete formulation of their optimization problem can be expressed in terms of the aggregator
where α,κ,γ are nonzero parameters, Pθ is a stochastic kernel from G to X for each θ in a finite dimensional parameter space Θ, and μ(x,⋅) is a probability distribution over Θ for each x∈X. The distribution μ(x,⋅) represents subjective beliefs over the transition rule for the state.
The aggregator B in (8.119) is defined for x∈X, a∈Γ(x) and v∈I, where I is be the interior of the positive cone of RX. To ensure finite real values, we assume r≫0.
As with the Epstein--Zin case, α parameterizes the elasticity of intertemporal substitution and γ governs risk aversion. The parameter κ captures ambiguity aversion. If κ=γ, the agent is said to be ambiguity neutral.
Returning to (8.119), we focus on the case κ<γ<0<α<1, which includes the calibration used in Ju & Miao (2012). (Other cases can be handled using similar methods and details are left to the reader.) After constructing a suitable value space, we will show that the resulting RDP is globally stable.
As a first step, set r1:=minr, r2:=maxr and fix ϵ>0. Consider the constant functions
In the remainder of this section on smooth ambiguity, we set V=[v1,v2].
Here is our main result for this section. It implies that all optimality and convergence results for R are valid (see, in particular, Theorem 8.1.1).
To prove Proposition 8.3.5, we use a transformation, just as we did with the Epstein--Zin case in Section 8.1.4.1. To this end, we introduce the composite parameters
Until now, all results and applications have concerned maximization of lifetime values. Now is a good time to treat minimization. Throughout this section, R is a well-posed RDP. The pointwise minimum v↓∗:=⋀σvσ is called the min-value function generated by R. We call a policy σ∈Σmin-optimal for R if vσ=v↓∗. A policy σ∈Σ is called v-min-greedy for R if
We say that v∈V obeys the min-Bellman equation if T↓v=v. The algorithm defined by replacing “v-greedy” with “v-min-greedy” in Algorithm 8.1 (HPI) will be called min-HPI.
We can now state the following result, which is analogous to Theorem 8.1.1. In the statement, R is a well-posed RDP with min-value function v↓∗.
Although we omit the details, a min-OPI convergence result directly analogous to the OPI convergence result in (v) of Theorem 8.1.1 also holds (after replacing maximization-based v-greedy policies with v-min-greedy policies).
Theorem 8.3.7 is proved in Section 9.2.3. For now, we consider two applications that involve minimization.
Recall the shortest path problem introduced in Example 8.1.8, where X is the vertices of a graph, E is the edges, c:E→R+ maps a travel cost to each edge (x,x′)∈E, and O(x) is the set of direct successors of x. The aim is to minimize total travel cost to a destination node d. We adopt all assumptions from Exercise 8.17 and assume in addition that c(x,x′)=0 implies x=d. As in Exercise 8.17, we let C(x) be the maximum cost of traveling to d from x along any directed path.
We regard the problem as an RDP R=(O,V,B) with V=[0,C] and
In the present setting, the function v in (8.132) is often called the cost-to-go function, with v(x′) in (8.137) understood as remaining costs after moving to state x′.
While the value aggregator B in (8.132) is simple, the absence of discounting (which is standard in the shortest path literature) means that R is not contracting. Fortunately, R turns out to be concave (in the sense of Section 8.2.3), which allows us to prove
(In the present context, v↓∗ is also known as the minimum cost-to-go function.)
When discussing MDPs we used β to represent the discount factor. Given β, the discount rate or rate of time preference is the value ρ that solves β=1/(1+ρ). The standard MDP assumption β<1 implies this rate is positive. You will recall from Chapter 5 that the condition β<1 is central to the general theory of MDPs, since it yields global stability of the Bellman and policy operators on RX (via the Neumann series lemma or Banach’s fixed-point theorem).
In the previous section, on shortest paths, we studied an RDP with a zero discount rate. Now we go one step further and consider problems with negative rates of time preference. Such preference are commonly inferred when people face unpleasant tasks. Subjects of studies often prefer getting such tasks “over and done with” rather than postponing them. (Negative discount rates are inferred in other settings as well. Section 9.3 provides background and references.)
In this section, we model optimal choice under a negative discount rate. Taking our cue from the preceding discussion, we consider a scenario where a task generates disutility but has to be completed. In particular, we assume that
where X is a finite set and β>1 is some positive constant. The function c gives the cost of transitioning from x to the new state x′
The value aggregator B in (8.137) is the same as the shortest path aggregator (8.132), except for the constant β. To keep the discussion simple, we adopt all other assumptions from the shortest path discussion in Section 8.3.5.1.
We now have an R=(Γ,V,B) with Γ=O, B as in (8.137) and V=[0,C]. The policy operators map V into itself because, for v∈V, we clearly have 0⩽Tσv and, in addition,
The RDP framework adopted in this chapter is inspired by Bertsekas (2022), who in turn credits Mitten (1964) as the first research paper to frame Richard Bellman’s dynamic programming problems in an abstract setting. Denardo (1967) describes key ideas including what we call contracting RDPs (see Section 8.2.1). Denardo (1967) credits Shapley (1953) for inspiring his contraction-based arguments.
The key optimality results from this chapter are new, although closely related results appear in Bertsekas (2022). See, in addition, Bloise et al. (2024), which builds on Bertsekas (2022) and Ren & Stachurski (2021).
The job search application with quantile preferences in Section 8.2.1.4 is based on Castro et al. (2022). The same reference includes a general theory of dynamic programming when certainty equivalents are computed using quantile operators and aggregation is time additive.
The optimal default application in Section 8.2.1.5 is loosely based on Arellano (2008). Influential contributions to this line of work include, Yue (2010), Chatterjee & Eyigungor (2012), Arellano & Ramanarayanan (2012), Cruces & Trebesch (2013), Ghosh et al. (2013), Gennaioli et al. (2014), and Bocola et al. (2019).
At the start of the chapter we motivated RDPs by mentioning that equilibria in some models of production and economic geography can be computed using dynamic programming. Examples include Hsu (2012), Hsu et al. (2014), Antràs & Gortari (2020), Kikuchi et al. (2021) and Tyazhelnikov (2022).
Early references for dynamic programming with risk-sensitive preferences include Jacobson (1973), Whittle (1981), and Hansen & Sargent (1995). Elegant modern treatments can be found in Asienkiewicz & Jaśkiewicz (2017) and Bäuerle & Jaśkiewicz (2024), and an extension to general static risk measures is available in Bäuerle & Glauner (2022). Risk-sensitivity is applied to the study of optimal growth in Bäuerle & Jaśkiewicz (2018), and to optimal divided payouts in Bäuerle & Jaśkiewicz (2017). Risk-sensitivity is also used in applications of reinforcement learning, where the underlying state process is not known. See, for example, Shen et al. (2014), Majumdar et al. (2017) or Gao et al. (2021).
Dynamic programming problems that acknowledge model uncertainty by including adversarial agents to promote robust decision rules can be found in Cagetti et al. (2002), Hansen & Sargent (2011), and other related papers. Al-Najjar & Shmaya (2019) study the connection between Epstein--Zin utility and parameter uncertainty. Ruszczyński (2010) considers risk averse dynamic programming and time consistency.
The smooth ambiguity model in Section 8.3.4 is loosely adapted from Klibanoff et al. (2009) and Ju & Miao (2012). For applications of optimization under smooth ambiguity, see, for example, Guan & Wang (2020) or Yu et al. (2023). Zhao (2020) studies yield curves in a setting where ambiguity-averse agents face varying amounts of Knightian uncertainty over the short and long run.
Readers who wish to see some motivation for the discussion of negative discounting in Section 8.3.5.2 can consult Loewenstein & Sicherman (1991), who found that the majority of workers they surveyed reported a preference for increasing wage profiles over decreasing ones that yield the same undiscounted sum, even when it was pointed out that the latter could be used to construct a dominating consumption sequence. Loewenstein & Prelec (1991) obtained similar results. In summarizing their study, they argue that, in the context of the choice problems that they examined, “sequences of outcomes that decline in value are greatly disliked, indicating a negative rate of time preference” Loewenstein & Prelec, 1991.
In Section 8.3.5.2 we considered dynamic programs with negative discount rates. A more general treatment of such problems can be found in Kikuchi et al. (2021), which also shows how negative discount rate dynamic programs connect to static problems concerning equilibria in production networks and draws connections with Coase’s theory of the firm.
An algorithm that we neglected to discuss is stochastic gradient descent (or ascent) in policy space. Typically policies are parameterized via an approximation architecture that consists of basis functions, activation functions, and compositions of them (e.g., a neural network). In large models, such approximation is used even when the state and action spaces are finite, simply because the curse of dimensionality makes exact representations infeasible. For recent discussions of gradient descent in policy spaces see Nota & Thomas (2019), Mei et al. (2020), and Bhandari & Russo (2022).
For H to be well-defined, we must always select the same v-greedy policy when the operator is applied to v. To this end, we enumerate the policy set Σ and choose the first v-greedy policy. This choice of convention has no effect on convergence results.