Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

9 Approximation and Learning

The theory developed in earlier chapters can be extended to address two important problems faced by applied researchers. One is the need for approximation. The other is how to learn optimal policies from samples, based on limited knowledge. This short chapter addresses these issues. Our aim here is limited, namely, to provide a brief introduction to these large topics and connect them to the ADP theory studied above.

The chapter consists of two sections. Section 9.1 develops the theory of approximate value function iteration, showing that nonexpansive approximation operators preserve contractivity and lead to clean error bounds. It also introduces stochastic approximation, which provides a way to study the convergence properties of iterative algorithms and forms the theoretical foundation for the learning algorithms that follow. Section 9.2 treats simulation-based learning, covering Q-learning, risk-sensitive Q-learning, and policy gradient methods.

9.1Approximation

This section discusses approximate dynamic programming. Approximation is needed because (a) models with continuous states and actions cannot be implemented exactly on computers, and (b) many problems are high-dimensional and hence computationally expensive. We focus on approximate value function iteration.

We begin in Section 9.1.1 with a review of common approximation methods, including local and global approximation. In Section 9.1.2, we show that nonexpansive approximation operators preserve contractivity and derive error bounds for the resulting policies. Section 9.1.3 introduces stochastic approximation, a general framework for finding fixed points from noisy evaluations that underpins the learning algorithms of Section 9.2.

9.1.1Approximation Methods

Suppose that we want to represent a function f ⁣:XRf \colon \Xsf \to \RR using an approximation f^\hat f that can be implemented on a machine using a finite number of parameters. One common approach is to take a normed linear space VV containing ff and nn elements b1,,bnb_1, \ldots, b_n from VV. The approximation then takes the form

f^(x)=i=1nαibi(x)\hat f(x) = \sum_{i=1}^n \alpha_i b_i(x)

The scalars α1,,αn\alpha_1, \ldots, \alpha_n are chosen so that f^\hat f is close to ff. In this setting, the elements bib_i are typically called basis functions.

In practice, one often fixes a sequence of functions (bn)nNV(b_n)_{n \in \NN} \subset V such that, with BnB_n defined as the linear span of b1,,bnb_1, \ldots, b_n for each nn, the union nNBn\bigcup_{n \in \NN} B_n is dense in VV. Next we introduce a sequence of approximation operators (Ln)nN(L_n)_{n \in \NN} such that LnfL_n f has the form of f^\hat f in (9.1). For example, given fVf \in V, the function LnfL_n f might equal the closest element in BnB_n to ff under the specified norm on VV. See, for example, Cheney (2013) or Atkinson & Han (2005).

9.1.1.1Local Approximation

One variation on this idea, often called local approximation, proceeds by first choosing a set of grid points {x1,,xn}\{x_1, \ldots, x_n\} contained in X\Xsf and a set of basis functions {κ1,,κn}\{\kappa_1, \ldots, \kappa_n\} defined on the state space X\Xsf with the property that

i=1nκi(x)=1 for all xXandκj0for all j{1,,n}.\sum_{i=1}^n \kappa_i(x) = 1 \text{ for all } x \in \Xsf \quad \text{and} \quad \kappa_j \geq 0 \quad \text{for all } j \in \{1, \ldots, n\}.

Elements of a family of basis functions satisfying (9.2) are sometimes called weighting functions or kernels, and an approximation of a function ff taking the form

(Lf)(x)=i=1nf(xi)κi(x)(Lf)(x) = \sum_{i=1}^n f(x_i) \kappa_i(x)

is called a local approximator or a kernel averager. The key idea is that (Lf)(x)(Lf)(x), the approximation to f(x)f(x) at xXx \in \Xsf, is a weighted average of the values of ff at the grid points. Typically, κi(x)\kappa_i(x) is relatively large when xx is near xix_i, so that f(xi)f(x_i) has a large weight in determining Lf(x)Lf(x).

Figure 9.1 illustrates the Gaussian kernel averager from Example 9.1.1 in one dimension. The left panel shows the kernel weights κi(x)\kappa_i(x) for six grid points: each κi\kappa_i peaks at xix_i and decays smoothly, and the weights sum to one at every xx. The right panel shows a function ff and its kernel-averager approximation LfLf; the approximation agrees with ff at the grid points and smoothly blends between them.

Gaussian kernel averager: kernels (left) and approximation (right)

Figure 9.1:Gaussian kernel averager: kernels (left) and approximation (right)

We will exploit the following property in Section 9.1.2 to derive error bounds for approximate value function iteration.

9.1.1.2Global Approximation

Another approach involves global approximation methods. Here we drop the partition-of-unity constraint (9.2) from the basis functions and the function values {f(xi)}\{f(x_i)\} are replaced with a more generic set of coefficients {θj}j=1k\{\theta_j\}_{j=1}^k. With the basis denoted by {b1,,bk}\{b_1, \ldots, b_k\}, the approximations take the form

(Gθf)(x)=j=1kθjbj(x).(G_\theta f)(x) = \sum_{j=1}^k \theta_j b_j(x).

The number of basis elements kk is typically smaller than the number of grid points, seeking a parsimonious representation. Coefficients are chosen by fitting. A typical criterion is least squares minimization:

θ^argminθi=1n[(Gθf)(xi)f(xi)]2.\hat \theta \coloneq \argmin_\theta \sum_{i=1}^n [(G_\theta f)(x_i) - f(x_i)]^2.

If Zij=bj(xi)Z_{ij} = b_j(x_i) and y=(f(x1),,f(xn))y = (f(x_1), \ldots, f(x_n))^\top, then θ^\hat \theta takes the familiar form (ZZ)1Zy(Z^\top Z)^{-1} Z^\top y whenever ZZ has full column rank.

9.1.2Error Bounds

In this section, we develop error bounds for approximate value function iteration in the ADP framework of Chapter 3. The key result is that when the approximation operator is nonexpansive, the composition of the Bellman operator with the approximation operator inherits contractivity, and the error in the computed policy can be bounded in terms of the approximation quality.

9.1.2.1Framework

Throughout this section, (V,T)(V, \TT) is an ADP with V=(V,d,)V = (V, d, \preceq) a partially ordered metric space. An operator L ⁣:VVL \colon V \to V is called nonexpansive with respect to dd when

d(Lv,Lw)d(v,w)for all v,wV.d(Lv, Lw) \leq d(v, w) \qquad \text{for all } v, w \in V.

We make the following assumption.

Under Assumption 9.1.1, the conditions of Theorem 3.1.5 are satisfied (with V0=VV_0 = V, since regularity implies semi-regularity with V0=VV_0 = V). As a result, the fundamental optimality properties hold, the Bellman operator TT is a contraction of modulus β\beta on VV, the value function v\vmax is the unique fixed point of TT in VV, and VFI, OPI, and HPI all converge.

Let LL be a self-map on VV. We call LL the approximation operator and define the approximate Bellman operator via T^LT\hat T \coloneq L \circ T.

By Lemma 9.1.2 and the Banach contraction mapping theorem, T^\hat T has a unique fixed point v^V\hat v \in V, and the iterates T^nv0\hat T^n v_0 converge geometrically to v^\hat v for any starting point v0Vv_0 \in V. The fitted value iteration (FVI) algorithm iterates with T^\hat T from an initial condition v0v_0 and, upon termination, returns a vNv_N-greedy policy.

The next two results bound the policy error when the FVI algorithm terminates after NN iterations.

9.1.2.2Error Bounds

Let v0Vv_0 \in V be the initial condition in the FVI routine and define vnT^nv0=(LT)nv0v_n \coloneq \hat T^n v_0 = (LT)^n v_0 for each nn. Let eNd(vN,vN1)e_N \coloneq d(v_N, v_{N-1}) be the step size at termination. Let σ\sigma be a vNv_N-greedy policy, so that TσvN=TvNT_\sigma \, v_N = T v_N.

The bound (9.11) decomposes the error into two terms. The first term βeN\beta \, e_N reflects the accuracy of the iterative process and decreases geometrically with the number of iterations. The second term d(LTvN,TvN)d(LTv_N, Tv_N) is the approximation error at the final iterate---how well LL represents TvNTv_N. Both terms are computable by the algorithm.

The next result provides an alternative bound that replaces the approximation error at the final iterate with the approximation error at the true value function v\vmax.

The bound (9.19) trades a larger constant (1β)2(1-\beta)^{-2} for a simpler approximation term d(Lv,v)d(L\vmax, \vmax), which depends only on how well the approximation operator LL represents the true value function. This is useful for a priori analysis of approximation architectures.

9.1.2.3Order-Preserving Approximations

One attractive special case arises when the approximation operator is not only nonexpansive but also order preserving. In the next result we set T^{LTσ:σΣ}\hat{\TT} \coloneq \setntn{L \circ T_\sigma}{\sigma \in \Sigma}.

In this setting, the approximate model (L(V),T^)(L(V), \hat{\TT}) inherits the structure of an ADP, and the optimality theory from Chapter 2 and Chapter 3 applies directly to it. The error bounds in Theorem 9.1.3 and Theorem 9.1.4 quantify the gap between the optimal policies of the original and approximate models.

9.1.3Stochastic Approximation

In the previous section, we discussed methods for controlling error when iterating with an approximate Bellman operator T^=LT\hat T = L \circ T. In that setting, the operator TT is known. In many applications, however, TT itself cannot be evaluated exactly because it involves an expectation over an unknown or intractable distribution. Instead, we can only observe noisy evaluations of TT. Stochastic approximation provides a framework for finding fixed points in this setting. Here we provide a fast introduction.

Stochastic approximation will drive the convergence theory of the learning algorithms we present below in Section 9.2.

9.1.3.1Fixed Point Iteration and Damping

Stochastic approximation can be thought of as a general technique for solving fixed point problems or root finding problems when information is limited. Here we focus on the fixed point case. In our analysis, T ⁣:ΘΘT \colon \Theta \to \Theta will be a contraction of modulus β\beta on a closed subset Θ\Theta of Rn\RR^n. By Banach’s fixed point theorem, TT has a unique fixed point θˉ\bar \theta and TkθθˉT^k \theta \to \bar \theta for any θΘ\theta \in \Theta.

To start thinking about stochastic approximation, we first recall damped iteration, which fixes α(0,1)\alpha \in (0,1) and updates via

θk+1=θk+α(Tθkθk).\theta_{k+1} = \theta_k + \alpha (T \theta_k - \theta_k).

This amounts to iterating with the map Fθθ+α(Tθθ)F\theta \coloneq \theta + \alpha (T\theta - \theta). Damped iteration is an alternative to regular fixed point iteration that has advantages in some cases.

One setting where damped iteration can converge faster than standard iteration is when the iterates oscillate. Figure 9.2 and Figure 9.4 illustrate this case with the linear map Tθ=AθT\theta = A\theta, where

A=(0.50.60.40.7).A = \begin{pmatrix} 0.5 & 0.6 \\ 0.4 & -0.7 \end{pmatrix}.

The fixed point is the origin. Standard iteration produces a spiraling trajectory, while damped iteration (with α=0.7\alpha = 0.7) converges more smoothly.

Standard iteration

Figure 9.2:Standard iteration

Damped iteration (α = 0.7)

Figure 9.3:Damped iteration (α = 0.7)

Convergence comparison: norm of iterates

Figure 9.4:Convergence comparison: norm of iterates

9.1.3.2The Robbins--Monro Algorithm

Now suppose that TT is a contraction on Θ\Theta with unique fixed point θˉ\bar \theta, but we can only evaluate TT with noise. That is, given input θ\theta, we observe Tθ+WT\theta + W, where WW is a zero-mean random disturbance. We cannot observe TθT\theta or WW separately, only their sum. The Robbins--Monro algorithm Robbins & Monro, 1951 generalizes damped iteration to this noisy setting. It tells us to fix an initial θ0Θ\theta_0 \in \Theta and update via

θk+1=θk+αk(Tθk+Wk+1θk),\theta_{k+1} = \theta_k + \alpha_k (T \theta_k + W_{k+1} - \theta_k),

where the learning rate (αk)(0,1)(\alpha_k) \subset (0,1) satisfies αk0\alpha_k \to 0.

Compare with damped iteration in (9.27): the only differences are (a) the learning rate decreases over time, and (b) the noise term Wk+1W_{k+1} enters the update. The decreasing learning rate ensures that, asymptotically, the noise is averaged away while the signal from TT accumulates.

9.1.3.3Convergence

The following well-known result, due to Tsitsiklis (1994), provides conditions under which the Robbins--Monro iterates converge to θˉ\bar \theta. In the statement, (Fk)k0(\fF_k)_{k \geq 0} is the filtration generated by θ0,W1,,Wk\theta_0, W_1, \ldots, W_k.

The restrictions on (αk)(\alpha_k) in (iii) are called the Robbins--Monro conditions. The restriction αk=\sum \alpha_k = \infty ensures that the learning rates do not decay too fast, so the algorithm can reach θˉ\bar \theta from any starting point. The condition αk2<\sum \alpha_k^2 < \infty ensures that the accumulated noise variance is finite. A standard choice is αk=(k+1)γ\alpha_k = (k+1)^{-\gamma} for γ(1/2,1)\gamma \in (1/2, 1).

9.1.3.4Example: Asset Pricing

To illustrate stochastic approximation for a fixed point problem, we consider a basic asset pricing model. The model is deliberately simple, so that the exact solution can be computed by linear algebra (allowing us to verify numerically that stochastic approximation converges to the correct answer). In the model, the price of the asset satisfies

Vt=βE[Vt+1+Dt+1Ft],V_t = \beta \, \EE [V_{t+1} + D_{t+1} \mid \fF_t],

where β(0,1)\beta \in (0,1) is a discount factor and DtD_t is the dividend at time tt. (See Section 6.3 of Sargent & Stachurski (2025) for background on the model.) Assume that (Xt)(X_t) is PP-Markov on a finite set X\Xsf with transition matrix PP, and Dt+1=d(Xt+1)D_{t+1} = d(X_{t+1}). The equilibrium price function vv is the unique fixed point in RX\RR^\Xsf of the map (Tv)(x)=βx[v(x)+d(x)]P(x,x)(Tv)(x) = \beta \sum_{x'} [v(x') + d(x')] P(x, x').

Setting KβPK \coloneq \beta P, the unique fixed point of TT is v=(IK)1Kdv^* = (I - K)^{-1} Kd.

We can also compute vv^* using stochastic approximation. To do this we use the random map vT^vv \mapsto \hat Tv given by the algorithm below

We understand T^\hat T as an operator that maps an element vv of RX\RR^\Xsf into a random vector T^v\hat Tv in RX\RR^\Xsf. Note that

E[(T^v)(x)]=xβ[v(x)+d(x)]P(x,x)=(Tv)(x).\EE \, [(\hat Tv)(x)] = \sum_{x'} \beta [v(x') + d(x')] P(x, x') = (Tv)(x).

Now we iterate via

vk+1=vk+αk(T^vkvk)v_{k+1} = v_k + \alpha_k (\hat T v_k - v_k)

which parallels the damped iteration in (9.27). With Wk+1T^vkTvkW_{k+1} \coloneq \hat Tv_k - Tv_k we have E[Wk+1Fk]=0\EE [W_{k+1} \mid \fF_k] = 0 and

vk+1=vk+αk(Tvk+(T^vkTvk)vk)=vk+αk(Tvk+Wk+1vk),v_{k+1} = v_k + \alpha_k (Tv_k + (\hat T v_k - Tv_k) - v_k) = v_k + \alpha_k (Tv_k + W_{k+1} - v_k),

so updating obeys the Robbins--Monro rule (9.30). The operator TT is order preserving since vwv \leq w implies TvTwTv \leq Tw (the sum involves nonnegative weights). Conditions (i) and (iii) of Theorem 9.1.8 hold by construction.[1] Thus Theorem 9.1.8 applies, and (vk)(v_k) converges to vv^* with probability one.

Figure 9.5 compares the stochastic approximation solution with the exact linear algebra solution v=(IK)1Kdv^* = (I - K)^{-1} Kd for a specification with X=25|\Xsf| = 25, β=0.96\beta = 0.96, d(x)=exd(x) = \me^x, and (Xt)(X_t) discretized from an AR(1) process with ρ=0.9\rho = 0.9 and σ=0.02\sigma = 0.02. The learning rate is αk=(k+1)0.55\alpha_k = (k+1)^{-0.55}. After 500 iterations (left panel), the SA estimate is uniformly below vv^* but has already captured the shape of the price function. After 50,00050{,}000 iterations (right panel), the SA solution closely matches the exact value.

Batch stochastic approximation vs. linear algebra for asset pricing: partial convergence (left) and near-full convergence (right)

Figure 9.5:Batch stochastic approximation vs. linear algebra for asset pricing: partial convergence (left) and near-full convergence (right)

9.1.3.5A Sequential Version

An alternative way to apply stochastic approximation is to simulate a single trajectory (Xt)t0(X_t)_{t \geq 0} of the PP-Markov chain and update vv sequentially. In particular, at each step tt, only the entry v(Xt)v(X_t) is modified:

vt+1(Xt)=vt(Xt)+αt[(T^svt)(Xt)vt(Xt)],v_{t+1}(X_t) = v_t(X_t) + \alpha_t \big[ (\hat T_s v_t)(X_t) - v_t(X_t) \big],

where

(T^svt)(Xt)β[vt(Xt+1)+d(Xt+1)].(\hat T_s v_t)(X_t) \coloneq \beta [v_t(X_{t+1}) + d(X_{t+1})] .

For states yXty \neq X_t we hold the estimate steady, setting vt+1(y)=vt(y)v_{t+1}(y) = v_t(y). Note that T^s\hat T_s is a single-sample version of our previous version T^\hat T from Section 9.1.3.4. The asynchronous extension of Theorem 9.1.8 noted in Remark 9.1.3 shows that the iterates vtv_t converge to vv^* with probability one, provided every state is visited infinitely often.

The advantage of the sequential version will be clearer when we study Q-learning in Section 9.2.1. In real-world Q-learning applications, sequences of observations are naturally generated by the environment and the agent’s interactions with that environment. Optimal policies can be computed from these sequences without knowing or even estimating underlying dynamics.

Figure 9.6 illustrates the sequential version on the same asset pricing model from Section 9.1.3.4, using per-state learning rates αt=nt(Xt)0.55\alpha_t = n_t(X_t)^{-0.55}, where nt(x)n_t(x) counts the number of visits to state xx by time tt. The left panel shows the SA estimate after 104 updates. Convergence is fastest near the center of the state space, where the stationary distribution of the chain concentrates most of its mass and states are visited most frequently. At the tails, fewer visits lead to slower learning. After 5×1075 \times 10^7 updates (right panel), the SA solution closely matches the exact value.

Sequential stochastic approximation vs. linear algebra for asset pricing: partial convergence (left) and near-full convergence (right)

Figure 9.6:Sequential stochastic approximation vs. linear algebra for asset pricing: partial convergence (left) and near-full convergence (right)

9.2Simulation and Learning

In most of this book we have worked in settings where the underlying model is known. In many applications, the model is unknown and the agent must learn from data generated by interacting with the environment. This section covers two approaches to learning: Q-learning (value-based) and policy gradient methods (policy-based). We emphasize that this section is intended only as a very brief introduction to a vast literature.

9.2.1Q-Learning

Q-learning is a model-free algorithm that learns optimal policies from data, without knowledge of transition probabilities or reward functions. In this section we connect Q-learning to the Q-factor ADP framework developed in Section 3.2.1.2.

9.2.1.1Q-Factors and Model-Free Learning

Recall from Section 3.2.1.2 that, for a finite MDP (Γ,r,β,P)(\Gamma, r, \beta, P) with state space X\Xsf and action space A\Asf, the Q-factor Bellman equation takes the form

q(x,a)=r(x,a)+βxmaxaΓ(x)q(x,a)  P(x,a,x)((x,a)G).q(x, a) = r(x, a) + \beta \sum_{x'} \max_{a' \in \Gamma(x')} q(x', a') \; P(x, a, x') \qquad ((x,a) \in \Gsf).

As shown in Proposition 5.3.1,

  1. the unique solution q\qmax to (9.38) obeys v(x)=maxaΓ(x)q(x,a)\vmax(x) = \max_{a \in \Gamma(x)} \qmax(x, a), where v\vmax is the value function, and

  2. σ\sigma is an optimal policy whenever σ(x)=argmaxaΓ(x)q(x,a)\sigma(x) = \argmax_{a \in \Gamma(x)} \qmax(x, a) for all xx.

Item (ii) tells us that the optimal policy can be calculated from q\qmax without knowledge of the transition kernel PP. This observation opens the door to learning: if we can approximate q\qmax from samples, we can compute an approximately optimal policy without knowing the model.

9.2.1.2The Q-Learning Algorithm

Q-learning approximates q\qmax by stochastic approximation of the Q-factor Bellman operator. At each step tt, the agent is in state Xt=xX_t = x, takes action At=aA_t = a, observes reward Rt+1=r(x,a)R_{t+1} = r(x, a) and next state Xt+1P(x,a,)X_{t+1} \sim P(x, a, \cdot), and updates the Q-function (also called the Q-table) via

qt+1(x,a)=(1αt)qt(x,a)+αt(Rt+1+βmaxaΓ(Xt+1)qt(Xt+1,a)).q_{t+1}(x, a) = (1 - \alpha_t) \, q_t(x, a) + \alpha_t \left( R_{t+1} + \beta \max_{a' \in \Gamma(X_{t+1})} q_t(X_{t+1}, a') \right).

Here αt(0,1)\alpha_t \in (0,1) is a learning rate. The term in parentheses is a single-sample estimate of the right-hand side of (9.38). The update blends this fresh sample with the current estimate. The agent needs to observe only the tuple (Xt,At,Rt+1,Xt+1)(X_t, A_t, R_{t+1}, X_{t+1}) at each step.

Actions are selected by a behavior policy that balances exploration (visiting new state-action pairs) with exploitation (using the current estimate of q\qmax). A common choice is the ϵ\epsilon-greedy strategy: with probability ϵ\epsilon, choose a random feasible action; otherwise choose argmaxaΓ(x)qt(x,a)\argmax_{a \in \Gamma(x)} q_t(x, a). Because the convergence target q\qmax does not depend on how actions are selected---only on the Bellman equation (9.38)---Q-learning is called an off-policy method.

9.2.1.3Convergence

Convergence of Q-learning follows from the stochastic approximation theory developed in Section 9.1.3. To see this, let SS be the Q-factor Bellman operator defined in (3.13), so that (Sq)(x,a)=r(x,a)+βxmaxaq(x,a)P(x,a,x)(Sq)(x, a) = r(x, a) + \beta \sum_{x'} \max_{a'} q(x', a') P(x, a, x'). By Exercise 3.5, each policy operator SσS_\sigma is a contraction of modulus β\beta on (RG,)(\RR^\Gsf, \| \cdot \|_\infty). Since the supremum norm is sup-nonexpansive, Theorem 3.1.5 gives that SS is also a contraction of modulus β\beta with unique fixed point q\qmax. At each step tt, the agent observes (Xt,At,Rt+1,Xt+1)(X_t, A_t, R_{t+1}, X_{t+1}) and computes the sample

(S^qt)(Xt,At)Rt+1+βmaxaΓ(Xt+1)qt(Xt+1,a).(\hat S q_t)(X_t, A_t) \coloneq R_{t+1} + \beta \max_{a' \in \Gamma(X_{t+1})} q_t(X_{t+1}, a').

With this notation, the Q-learning update (9.39) becomes

qt+1(Xt,At)=qt(Xt,At)+αt[(S^qt)(Xt,At)qt(Xt,At)],q_{t+1}(X_t, A_t) = q_t(X_t, A_t) + \alpha_t \big[ (\hat S q_t)(X_t, A_t) - q_t(X_t, A_t) \big],

which parallels (9.36). Since the fixed point of SS is q\qmax, applying the asynchronous version of Theorem 9.1.8 noted in Remark 9.1.3 leads to the following classical result Watkins & Dayan, 1992Tsitsiklis, 1994:

Theorem 9.2.1 is often referred to as a “tabular” result: it assumes that the Q-table stores a separate value for every (x,a)(x, a) pair. An alternative approach, which scales to high dimensions, is to combine Q-learning with one of the function approximation schemes from Section 9.1. When the scheme in question is neural nets we get deep Q-learning, which has been highly successful in certain environments. See Section 9.3 for more discussion.

9.2.1.4Example: Inventory Management

To illustrate Q-learning, we consider an inventory management problem where a firm manages stock of a single product. Inventory (Xt)t0(X_t)_{t \geq 0} evolves according to

Xt+1=max(XtDt+1,0)+At,X_{t+1} = \max(X_t - D_{t+1}, 0) + A_t,

where DtD_t is iid demand and At{0,,KXt}A_t \in \{0, \ldots, K - X_t\} is the order quantity chosen by the firm. The state space is X={0,,K}\Xsf = \{0, \ldots, K\}. Per-period profits are

π(Xt,At,Dt+1)=pmin(Xt,Dt+1)cAtκ1{At>0},\pi(X_t, A_t, D_{t+1}) = p \min(X_t, D_{t+1}) - c \, A_t - \kappa \1\{A_t > 0\},

where pp is the selling price, cc is the unit ordering cost, and κ\kappa is a fixed cost of placing an order. The firm seeks to maximize expected discounted profits Et0βtπ(Xt,At,Dt+1)\EE \sum_{t \geq 0} \beta^t \pi(X_t, A_t, D_{t+1}).

Now suppose that a manager does not know the demand distribution, the cost parameters, or the transition dynamics. The manager can still learn the optimal ordering policy by Q-learning. At each period, the manager observes the current inventory XtX_t, places an order AtA_t, records the resulting profit, and observes the next-period inventory Xt+1X_{t+1}. These four quantities suffice for the update (9.39).

Figure 9.7 compares the Q-learning solution with the exact VFI solution for this problem, with K=20K = 20, selling price p=1p = 1, geometric demand with parameter θ=0.7\theta = 0.7, c=0.2c = 0.2, κ=0.8\kappa = 0.8, and β=0.98\beta = 0.98. After 20 million steps of interaction, the learned value function and policy closely match those obtained by exact value function iteration with full model knowledge.

Value function and policy: VFI (exact) vs. Q-learning (model-free)

Figure 9.7:Value function and policy: VFI (exact) vs. Q-learning (model-free)

The 20 million steps required in this small example illustrate a general feature of model-free methods: because they do not exploit knowledge of transition probabilities or reward functions, they must extract all structural information from observed data, which can require very long trajectories. In practice, the cost of exploration is compounded by the fact that the agent typically needs to make reasonable decisions at each time step, limiting the extent to which it can explore freely. These challenges motivate both deep Q-learning (which uses function approximation to generalize across states) and the policy gradient methods discussed in Section 9.2.3 (which search directly in policy space).

9.2.2Risk-Sensitive Q-Learning

The Q-learning framework can be extended to accommodate risk-sensitive preferences. As we will see, the extension provides a concrete illustration of the order-reversing FDP theory developed in Section 5.2.3.

9.2.2.1Risk-Sensitive Bellman Equation

We can construct a risk-sensitive version of the inventory problem Bellman equation by replacing the expectation with a certainty equivalent (see, e.g., Section 7.1.1.8). Using the entropic certainty equivalent with parameter γ>0\gamma > 0, the Bellman equation becomes

v(x)=maxaΓ(x)(1γ)ln{xexp[γ(r(x,a)+βv(x))]P(x,a,x)}.v(x) = \max_{a \in \Gamma(x)} \left( -\frac{1}{\gamma} \right) \ln \left\{ \sum_{x'} \exp \left[ -\gamma \left( r(x, a) + \beta \, v(x') \right) \right] P(x, a, x') \right\}.

Larger γ\gamma implies stronger aversion to downside risk.

9.2.2.2The Order-Reversing FDP Structure

Our aim is to derive a Q-factor Bellman equation for the risk-sensitive problem, analogous to (9.38), and then build a Q-learning algorithm from it. One challenge is how to shift expectation to the outside of the maximization step, as in the standard Q-factor equation (see (9.38)). We will use the order-reversing FDP framework of Section 5.2.3 to factor the Bellman equation into a suitable form, enabling single-sample updates.

We set up the FDP (V,F,V^,G)(V, F, \hat V, \GG) as follows. The primary value space is VRXV \coloneq \RR^\Xsf and the subordinate value space is V^R++G\hat V \coloneq \RR_{++}^\Gsf (strictly positive functions on G\Gsf), each with the pointwise partial order. The connecting map F ⁣:VV^F \colon V \to \hat V is defined by

(Fv)(x,a)xexp ⁣[γ(r(x,a)+βv(x))]P(x,a,x).(Fv)(x, a) \coloneq \sum_{x'} \exp\!\left[ -\gamma \left( r(x, a) + \beta \, v(x') \right) \right] P(x, a, x').

Noting that the image of FF lies in R++G\RR_{++}^\Gsf (since the exponential function is strictly positive), we define the family G{Gσ}σΣ\GG \coloneq \{G_\sigma\}_{\sigma \in \Sigma} of maps from V^\hat V to VV by

(Gσq)(x)1γln ⁣(q(x,σ(x))).(G_\sigma \, q)(x) \coloneq -\frac{1}{\gamma} \ln\!\big(q(x, \sigma(x))\big).

The map GσG_\sigma evaluates the Q-function at the action prescribed by σ\sigma and applies ψ1(y)=(1/γ)ln(y)\psi^{-1}(y) = -(1/\gamma)\ln(y) to convert the positive Q-value back to the value scale.

Since Γ(x)\Gamma(x) is finite, the set {Gσq}σΣ\{G_\sigma \, q\}_{\sigma \in \Sigma} has a greatest element for every qV^q \in \hat V. Thus, (V,F,V^,G)(V, F, \hat V, \GG) is an order-reversing FDP.

9.2.2.3Primary and Subordinate Bellman Equations

We have chosen (V,F,V^,G)(V, F, \hat V, \GG) so that the primary ADP for this FDP has the Bellman equation (9.44). To see this, recall that, in the FDP setting, G\Gmax is defined by GqσGσq\Gmax q \coloneq \bigvee_\sigma G_\sigma \, q. Here this gives (Gq)(x)=(1/γ)ln(minaΓ(x)q(x,a))(\Gmax q)(x) = -(1/\gamma)\ln(\min_{a \in \Gamma(x)} q(x, a)), since ψ1\psi^{-1} is decreasing. The primary policy operators TσGσFT_\sigma \coloneq G_\sigma \circ F give

(Tσv)(x)=1γln ⁣(xP(x,σ(x),x)exp ⁣[γ(r(x,σ(x))+βv(x))]).(T_\sigma \, v)(x) = -\frac{1}{\gamma} \ln\!\left( \sum_{x'} P(x, \sigma(x), x') \exp\!\left[-\gamma(r(x, \sigma(x)) + \beta \, v(x'))\right] \right).

With T=GF\tmax = \Gmax \circ F (by Lemma 5.2.15), we recover (9.44).

For the subordinate ADP, by Lemma 5.2.16, the Bellman min-operator is T^=FG\htmin = F \circ \Gmax. Computing explicitly:

(T^q)(x,a)=(F(Gq))(x,a)=xP(x,a,x)exp ⁣[γ ⁣(r(x,a)+β(1γ)ln ⁣(minaq(x,a)))]=xP(x,a,x)[exp(γr(x,a))(minaΓ(x)q(x,a))β].\begin{aligned} (\htmin q)(x, a) &= (F(\Gmax q))(x, a) \\ &= \sum_{x'} P(x, a, x') \exp\!\left[-\gamma\!\left(r(x, a) + \beta \cdot \left(-\tfrac{1}{\gamma}\right) \ln\!\big(\min_{a'} q(x', a')\big)\right)\right] \\ &= \sum_{x'} P(x, a, x') \left[\exp(-\gamma \, r(x, a)) \cdot \left(\min_{a' \in \Gamma(x')} q(x', a')\right)^\beta \right]. \end{aligned}

Thus the subordinate Bellman equation q=T^qq = \htmin q takes the form

q(x,a)=xP(x,a,x)[exp(γr(x,a))(minaΓ(x)q(x,a))β].q(x, a) = \sum_{x'} P(x, a, x') \left[ \exp(-\gamma \, r(x, a)) \cdot \left(\min_{a' \in \Gamma(x')} q(x', a')\right)^\beta \right].

This is the risk-sensitive Q-factor Bellman equation---a fixed point equation in qq alone, and the min-analogue of the standard Q-factor Bellman equation (9.38).

Notice that the expectation (summation over xx') appears on the outside of (9.49) because it originates from the map FF in the FDP factorization. The maps GσG_\sigma act pointwise and do not involve the transition kernel PP. This separation makes Q-learning possible: we can replace the expectation with a single-sample observation, just as in the standard case.

Our original goal is to find an optimal policy for the primary risk-sensitive MDP. The Q-learning algorithm described below operates on the subordinate Q-factor problem, converging to its min-value function q\qmin. The key question is whether q\qmin suffices to recover an optimal policy for the primary problem. Theorem 5.2.18 gives an affirmative answer: any policy σ\sigma satisfying Gσq=GqG_\sigma \, \qmin = \Gmax \, \qmin is optimal for the primary problem (V,T)(V, \TT). Since (Gq)(x)=(1/γ)ln(minaq(x,a))(\Gmax q)(x) = -(1/\gamma)\ln(\min_a q(x,a)), this condition reduces to σ(x)=argminaΓ(x)q(x,a)\sigopt(x) = \argmin_{a \in \Gamma(x)} \qmin(x, a). In other words, once Q-learning has converged, we read off the optimal policy by minimizing the learned Q-function at each state.

9.2.2.4Optimal Behavior Under Risk Sensitivity

Before turning to Q-learning, we illustrate how risk sensitivity affects the optimal policy and the induced dynamics in the inventory model of Section 9.2.1. We apply value function iteration to the risk-sensitive Bellman equation (9.44) for three levels of risk sensitivity, γ{0.01,1.0,2.0}\gamma \in \{0.01, 1.0, 2.0\}, holding all other parameters at the values used for Figure 9.7.

Figure 9.8 shows the resulting optimal order quantities. A more risk-sensitive firm (larger γ\gamma) orders less stock, accepting lower expected sales in exchange for more predictable profits. Figure 9.9 confirms the effect in simulation: inventory paths under the optimal policy stabilize at lower levels as γ\gamma increases.

Optimal policy as a function of inventory, for \gamma \in \{0.01, 1.0, 2.0\}

Figure 9.8:Optimal policy as a function of inventory, for γ{0.01,1.0,2.0}\gamma \in \{0.01, 1.0, 2.0\}

Inventory dynamics under the optimal policy at \gamma \in \{0.01, 1.0, 2.0\}

Figure 9.9:Inventory dynamics under the optimal policy at γ{0.01,1.0,2.0}\gamma \in \{0.01, 1.0, 2.0\}

9.2.2.5The Risk-Sensitive Q-Learning Update

The corresponding Q-learning update replaces the expectation in (9.49) with a single sample:

qt+1(x,a)=(1αt)qt(x,a)+αt[exp(γRt+1)(minaΓ(Xt+1)qt(Xt+1,a))β].q_{t+1}(x, a) = (1 - \alpha_t) \, q_t(x, a) + \alpha_t \left[ \exp(-\gamma \, R_{t+1}) \cdot \left(\min_{a' \in \Gamma(X_{t+1})} q_t(X_{t+1}, a')\right)^\beta \right].

Note several differences from the standard case. First, the optimal policy minimizes rather than maximizes qq. Second, the reward enters through exp(γRt+1)\exp(-\gamma R_{t+1}) rather than additively. Third, the continuation value enters as a power (minaqt)β(\min_{a'} q_t)^\beta rather than a scaled sum βmaxaqt\beta \cdot \max_{a'} q_t. Fourth, to make the calculations, the agent now needs to know γ\gamma as well as β\beta.

Figure 9.10 shows the value function and policy produced by the risk-sensitive Q-learning update (9.52) after 2×1072 \times 10^7 steps on the inventory model at γ=1.0\gamma = 1.0, alongside the exact VFI solution. Q-learning recovers the optimal policy and value function despite the open theoretical question above; see the QuantEcon lectures for implementation details and code.

Risk-sensitive inventory at \gamma = 1.0: VFI vs. Q-learning after 2 \times 10^7 steps

Figure 9.10:Risk-sensitive inventory at γ=1.0\gamma = 1.0: VFI vs. Q-learning after 2×1072 \times 10^7 steps

9.2.3Policy-Based Methods

The methods discussed so far are value-based: they learn a value function (or Q-function) and extract a policy from it. An alternative approach is to parameterize the policy directly and optimize over the parameter space. Such policy gradient methods are widely used in modern reinforcement learning.

9.2.3.1The Policy Gradient Approach

Rather than searching over all feasible policies, we restrict attention to a parameterized family {σ(,θ):θΘ}\{\sigma(\cdot, \theta) : \theta \in \Theta\}, where θΘRm\theta \in \Theta \subset \RR^m is a parameter vector. Given an initial condition x0x_0, define

M(θ)vσ(,θ)(x0),M(\theta) \coloneq v_{\sigma(\cdot, \theta)}(x_0),

the lifetime value of the policy σ(,θ)\sigma(\cdot, \theta) starting from x0x_0. The policy gradient method maximizes M(θ)M(\theta) over θ\theta by gradient ascent:

θn+1=θn+λnθM^(θn),\theta_{n+1} = \theta_n + \lambda_n \nabla_\theta \hat M(\theta_n),

where λn>0\lambda_n > 0 is a step size and M^\hat M is a Monte Carlo approximation to MM described below.

To illustrate, consider the optimal savings problem from Section 1.3: a household chooses a consumption policy to maximize Et0βtu(ct)\EE \sum_{t \geq 0} \beta^t u(c_t), where uu is a smooth utility function, subject to

at+1=R(atct)+Yt+1,ct[0,at],a_{t+1} = R(a_t - c_t) + Y_{t+1}, \qquad c_t \in [0, a_t],

with iid income shocks (Yt)(Y_t) and gross interest rate R>0R > 0. A policy σ(a,θ)\sigma(a, \theta) maps current assets aa to consumption cc, parameterized by θ\theta (for example, the weights of a neural network). The state dynamics are a smooth function of the policy: given a realization of the shock sequence, at+1a_{t+1} is differentiable in θ\theta through ct=σ(at,θ)c_t = \sigma(a_t, \theta).

9.2.3.2Monte Carlo Approximation

In general, M(θ)M(\theta) cannot be evaluated exactly because it involves an expectation over the shock sequence. Instead, we form a Monte Carlo approximation by simulating NN independent paths of length TT under the policy σ(,θ)\sigma(\cdot, \theta). For each path i=1,,Ni = 1, \ldots, N, set a0i=a0a^i_0 = a_0 and generate

cti=σ(ati,θ),at+1i=R(aticti)+Yt+1i,(t=0,,T1),c^i_t = \sigma(a^i_t, \theta), \qquad a^i_{t+1} = R(a^i_t - c^i_t) + Y^i_{t+1}, \qquad (t = 0, \ldots, T-1),

where (Yti)(Y^i_t) are iid draws. The Monte Carlo estimate is

M^(θ)=1Ni=1Nt=0T1βtu(σ(ati,θ)).\hat M(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \beta^t u(\sigma(a^i_t, \theta)).

The key observation is that M^(θ)\hat M(\theta) is differentiable in θ\theta whenever uu and σ(,θ)\sigma(\cdot, \theta) are differentiable, because the state dynamics (9.55) are a smooth function of the action. (The shocks Yt+1iY^i_{t+1} do not depend on θ\theta, so they pass through the differentiation.) Each simulated path defines a computational graph from θ\theta through the policy evaluations and state transitions to the cumulative payoff. Modern automatic differentiation libraries can differentiate through these graphs efficiently, providing θM^(θ)\nabla_\theta \hat M(\theta) at cost comparable to evaluating M^(θ)\hat M(\theta) itself. The gradient estimate is substituted into the ascent step (9.54).

A common and effective choice for the policy parameterization is a neural network mapping states to actions, with θ\theta collecting all the weights and biases. The universal approximation theorem Hornik et al., 1989 guarantees that sufficiently expressive networks can approximate any continuous policy. The main advantage of policy gradient methods is that they handle continuous state and action spaces naturally and scale to high-dimensional problems. Discussion of related literature can be found in Section 9.3.

Figure 9.11 applies the policy gradient method to the household problem (9.55) with CRRA utility (γ=1.5\gamma = 1.5), β=0.96\beta = 0.96, R=1.01R = 1.01, and iid lognormal income. The policy is parameterized by a neural network with three hidden layers of six units each, trained for 400 epochs using 100 paths of length 200. The learned consumption policy closely matches the exact solution obtained by the endogenous grid method (EGM; see Section 8.3.3.5). For this low-dimensional problem, EGM is far more efficient; we use it here as a trusted benchmark to verify the policy gradient solution. The advantage of policy gradient methods emerges in high-dimensional settings where traditional grid-based methods are infeasible.

Consumption policy: EGM (exact) vs. policy gradient (neural network)

Figure 9.11:Consumption policy: EGM (exact) vs. policy gradient (neural network)

Interestingly, the policy gradient method recovers the globally optimal policy despite the fact that we are optimizing lifetime value at a single initial condition a0a_0. This raises a natural question: when does maximizing at one initial condition generate an optimal policy --- that is, a policy that is optimal across all states? We address this question in the next section.

9.2.3.3From Local to Global Optimality

In general, vσ(x0)=v(x0)v_\sigma(x_0) = \vmax(x_0) does not imply vσ=vv_\sigma = \vmax. But perhaps implication does hold in some settings? Stachurski et al. (2024) analyze this question for MDPs on general state spaces. Here we present a simplified version for finite MDPs that conveys the core idea. In stating the main result, we recall that a stochastic matrix PσP_\sigma is called irreducible if, for every pair of states x,yXx, y \in \Xsf, there exists an mNm \in \NN with Pσm(x,y)>0P_\sigma^m(x, y) > 0.

Theorem 9.2.2 tells us that policy gradient methods---which optimize at a single initial condition---produce globally optimal policies provided the candidate policy induces irreducible dynamics. Irreducibility ensures that every state is “reachable” from x0x_0, so optimality propagates throughout the state space.

This helps explain the global convergence observed in Figure 9.11 for the household problem (9.55). The income shocks (Yt)(Y_t) are lognormal with unbounded support, so they spread the state across a wide range of asset levels regardless of the consumption policy. This provides a form of irreducibility: every region of the state space is visited with positive probability under any feasible policy, allowing optimality at the initial condition a0a_0 to propagate globally.

9.3Chapter Notes

The curse of dimensionality and the role of approximation in dynamic programming are discussed at length in Powell (2007) and Rust (1997). Theorem 9.1.3 and Theorem 9.1.4 generalize the error bounds for fitted value iteration in Stachurski (2008) from concrete MDPs on Euclidean state spaces to the abstract ADP setting. Li et al. (2026) provide performance guarantees for approximate dynamic programming in the form of ratio bounds. Lee (2026) studies convergence of Q-factor value function iteration.

The use of neural networks for approximate dynamic programming has become popular in recent years. The classic reference is Bertsekas & Tsitsiklis (1996). More recent examples include Maliar et al. (2021), Kase et al. (2022), Han et al. (2026), Pascal (2024), Kamber et al. (2025), Ashwin et al. (2024), and Oguz & Bray (2026). See Scheidegger (2026) for an overview of the field.

The stochastic approximation framework of Section 9.1.3 originates with Robbins & Monro (1951). The convergence result in Theorem 9.1.8 is based on Tsitsiklis (1994), who proves convergence in a more general asynchronous setting. Sutton & Barto (2018) give a comprehensive textbook treatment of stochastic approximation and temporal difference methods in the reinforcement learning context. Szepesvári (2010) provides a concise and mathematically rigorous introduction to reinforcement learning algorithms, including detailed convergence analysis for both value-based and policy-based methods. Agarwal et al. (2021) give a modern theoretical treatment of reinforcement learning with an emphasis on sample complexity and regret bounds.

Q-learning was introduced by Watkins & Dayan (1992), with convergence established by Tsitsiklis (1994). For a more recent discussion of convergence, see Regehr & Ayoub (2021). Deep Q-learning was mentioned in Section 9.2.1. Bertsekas & Tsitsiklis (1996) laid the theoretical foundations for combining neural network function approximation with dynamic programming. The potential of this approach was demonstrated by Mnih et al. (2015), who showed that a deep Q-network agent could reach human-level performance across a range of Atari games without game-specific engineering. Yang et al. (2020) provide a theoretical analysis of deep Q-learning. Wang et al. (2024) provide a recent survey of the deep reinforcement learning literature.

The risk-sensitive Q-learning formulation of Section 9.2.2 connects to the broader literature on risk-sensitive Markov decision processes surveyed in Bäuerle & Jaśkiewicz (2024). Littman & Szepesvari (1996) is a valuable early contribution to the theoretical foundations of Q-learning in potentially nonstandard environments. The use of deep learning for solving dynamic programming problems in economics is explored by Azinovic et al. (2022).

Policy gradient methods, discussed briefly in Section 9.2.3, are treated in depth by Sutton & Barto (2018), Szepesvári (2010), and Agarwal et al. (2021). The discussion in Section 9.2.3.3 is a simplified version of the result in Stachurski et al. (2024), which treats MDPs on general state spaces under an appropriate generalization of irreducibility.

Footnotes
  1. For condition (ii), note that Wk+1(x)=(T^vk)(x)(Tvk)(x)=β[vk(X)+d(X)](Tvk)(x)W_{k+1}(x) = (\hat Tv_k)(x) - (Tv_k)(x) = \beta [v_k(X') + d(X')] - (Tv_k)(x), so Wk+1(x)2β(vk+d)|W_{k+1}(x)| \leq 2\beta(\|v_k\|_\infty + \|d\|_\infty). Hence E[Wk+12Fk]4Xβ2(vk+d)2C(1+vk2)\EE[\|W_{k+1}\|^2 \mid \fF_k] \leq 4 |\Xsf| \beta^2 (\|v_k\|_\infty + \|d\|_\infty)^2 \leq C(1 + \|v_k\|^2) for a constant CC depending only on β\beta, X|\Xsf|, and d\|d\|_\infty.

References
  1. Cheney, W. (2013). Analysis for applied mathematics (Vol. 208). Springer Science & Business Media.
  2. Atkinson, K., & Han, W. (2005). Theoretical numerical analysis (Vol. 39). Springer.
  3. Stachurski, J. (2008). Continuous state dynamic programming via nonexpansive approximation. Computational Economics, 31(2), 141–160.
  4. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
  5. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407.
  6. Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16, 185–202.
  7. Sargent, T. J., & Stachurski, J. (2025). Dynamic Programming: Finite States. Cambridge University Press.
  8. Watkins, C. J., & Dayan, P. (1992). Q-Learning. Machine Learning, 8, 279–292.
  9. Stachurski, J., Yang, J., & Yang, Z. (2024). Dynamic Programming: From Local Optimality to Global Optimality. arXiv Preprint arXiv:2411.11062.
  10. Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimensionality (Vol. 703). John Wiley & Sons.
  11. Rust, J. (1997). Using randomization to break the curse of dimensionality. Econometrica: Journal of the Econometric Society, 487–516.
  12. Li, B., Chong, E. K. P., & Pezeshki, A. (2026). Performance Guarantees for Data-Driven Sequential Decision-Making. https://arxiv.org/abs/2603.20553
  13. Lee, D. (2026). Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration. https://arxiv.org/abs/2604.17457
  14. Bertsekas, D., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena Scientific.
  15. Maliar, L., Maliar, S., & Winant, P. (2021). Deep Learning for Solving Dynamic Economic Models. Journal of Monetary Economics, 122, 76–101. 10.1016/j.jmoneco.2021.07.004