Approximation and Learning - Dynamic Programming Volume II: General States

The theory developed in earlier chapters can be extended to address two important problems faced by applied researchers. One is the need for approximation. The other is how to learn optimal policies from samples, based on limited knowledge. This short chapter addresses these issues. Our aim here is limited, namely, to provide a brief introduction to these large topics and connect them to the ADP theory studied above.

The chapter consists of two sections. Section 9.1 develops the theory of approximate value function iteration, showing that nonexpansive approximation operators preserve contractivity and lead to clean error bounds. It also introduces stochastic approximation, which provides a way to study the convergence properties of iterative algorithms and forms the theoretical foundation for the learning algorithms that follow. Section 9.2 treats simulation-based learning, covering Q-learning, risk-sensitive Q-learning, and policy gradient methods.

9.1Approximation¶

This section discusses approximate dynamic programming. Approximation is needed because (a) models with continuous states and actions cannot be implemented exactly on computers, and (b) many problems are high-dimensional and hence computationally expensive. We focus on approximate value function iteration.

We begin in Section 9.1.1 with a review of common approximation methods, including local and global approximation. In Section 9.1.2, we show that nonexpansive approximation operators preserve contractivity and derive error bounds for the resulting policies. Section 9.1.3 introduces stochastic approximation, a general framework for finding fixed points from noisy evaluations that underpins the learning algorithms of Section 9.2.

9.1.1Approximation Methods¶

Suppose that we want to represent a function $f \colon \Xsf \to \RR$ using an approximation $\hat f$ that can be implemented on a machine using a finite number of parameters. One common approach is to take a normed linear space $V$ containing $f$ and $n$ elements $b_1, \ldots, b_n$ from $V$ . The approximation then takes the form

\hat f(x) = \sum_{i=1}^n \alpha_i b_i(x)

(9.1)

The scalars $\alpha_1, \ldots, \alpha_n$ are chosen so that $\hat f$ is close to $f$ . In this setting, the elements $b_i$ are typically called basis functions.

In practice, one often fixes a sequence of functions $(b_n)_{n \in \NN} \subset V$ such that, with $B_n$ defined as the linear span of $b_1, \ldots, b_n$ for each $n$ , the union $\bigcup_{n \in \NN} B_n$ is dense in $V$ . Next we introduce a sequence of approximation operators $(L_n)_{n \in \NN}$ such that $L_n f$ has the form of $\hat f$ in (9.1). For example, given $f \in V$ , the function $L_n f$ might equal the closest element in $B_n$ to $f$ under the specified norm on $V$ . See, for example, Cheney (2013) or Atkinson & Han (2005).

9.1.1.1Local Approximation¶

One variation on this idea, often called local approximation, proceeds by first choosing a set of grid points $\{x_1, \ldots, x_n\}$ contained in $\Xsf$ and a set of basis functions $\{\kappa_1, \ldots, \kappa_n\}$ defined on the state space $\Xsf$ with the property that

\sum_{i=1}^n \kappa_i(x) = 1 \text{ for all } x \in \Xsf \quad \text{and} \quad \kappa_j \geq 0 \quad \text{for all } j \in \{1, \ldots, n\}.

(9.2)

Elements of a family of basis functions satisfying (9.2) are sometimes called weighting functions or kernels, and an approximation of a function $f$ taking the form

(Lf)(x) = \sum_{i=1}^n f(x_i) \kappa_i(x)

(9.3)

is called a local approximator or a kernel averager. The key idea is that $(Lf)(x)$ , the approximation to $f(x)$ at $x \in \Xsf$ , is a weighted average of the values of $f$ at the grid points. Typically, $\kappa_i(x)$ is relatively large when $x$ is near $x_i$ , so that $f(x_i)$ has a large weight in determining $Lf(x)$ .

Figure 9.1 illustrates the Gaussian kernel averager from Example 9.1.1 in one dimension. The left panel shows the kernel weights $\kappa_i(x)$ for six grid points: each $\kappa_i$ peaks at $x_i$ and decays smoothly, and the weights sum to one at every $x$ . The right panel shows a function $f$ and its kernel-averager approximation $Lf$ ; the approximation agrees with $f$ at the grid points and smoothly blends between them.

Figure 9.1:Gaussian kernel averager: kernels (left) and approximation (right)

Example 9.1.2

Continuous piecewise linear interpolation on a one-dimensional grid can be viewed as a kernel averager. Taking grid points $a = x_1 < x_2 < \cdots < x_n = b$ , define the “hat” basis functions

\kappa_i(x) \coloneq \begin{cases} (x - x_{i-1}) / (x_i - x_{i-1}) & \text{if } x \in [x_{i-1}, x_i] \\ (x_{i+1} - x) / (x_{i+1} - x_i) & \text{if } x \in [x_i, x_{i+1}] \\ 0 & \text{otherwise} \end{cases}

with the obvious modifications at the endpoints. The resulting kernel averager (9.3) is the unique continuous piecewise linear function that agrees with $f$ at all grid points. One easily checks that $\kappa_i \geq 0$ and $\sum_i \kappa_i(x) = 1$ for all $x \in [a, b]$ , so (9.2) holds. These ideas and results extend to continuous piecewise linear interpolation in $\RR^d$ (see, e.g., Stachurski (2008)).

We will exploit the following property in Section 9.1.2 to derive error bounds for approximate value function iteration.

9.1.1.2Global Approximation¶

Another approach involves global approximation methods. Here we drop the partition-of-unity constraint (9.2) from the basis functions and the function values $\{f(x_i)\}$ are replaced with a more generic set of coefficients $\{\theta_j\}_{j=1}^k$ . With the basis denoted by $\{b_1, \ldots, b_k\}$ , the approximations take the form

(G_\theta f)(x) = \sum_{j=1}^k \theta_j b_j(x).

The number of basis elements $k$ is typically smaller than the number of grid points, seeking a parsimonious representation. Coefficients are chosen by fitting. A typical criterion is least squares minimization:

\hat \theta \coloneq \argmin_\theta \sum_{i=1}^n [(G_\theta f)(x_i) - f(x_i)]^2.

If $Z_{ij} = b_j(x_i)$ and $y = (f(x_1), \ldots, f(x_n))^\top$ , then $\hat \theta$ takes the familiar form $(Z^\top Z)^{-1} Z^\top y$ whenever $Z$ has full column rank.

9.1.2Error Bounds¶

In this section, we develop error bounds for approximate value function iteration in the ADP framework of Chapter 3. The key result is that when the approximation operator is nonexpansive, the composition of the Bellman operator with the approximation operator inherits contractivity, and the error in the computed policy can be bounded in terms of the approximation quality.

9.1.2.1Framework¶

Throughout this section, $(V, \TT)$ is an ADP with $V = (V, d, \preceq)$ a partially ordered metric space. An operator $L \colon V \to V$ is called nonexpansive with respect to $d$ when

d(Lv, Lw) \leq d(v, w) \qquad \text{for all } v, w \in V.

(9.4)

We make the following assumption.

Under Assumption 9.1.1, the conditions of Theorem 3.1.5 are satisfied (with $V_0 = V$ , since regularity implies semi-regularity with $V_0 = V$ ). As a result, the fundamental optimality properties hold, the Bellman operator $T$ is a contraction of modulus $\beta$ on $V$ , the value function $\vmax$ is the unique fixed point of $T$ in $V$ , and VFI, OPI, and HPI all converge.

Let $L$ be a self-map on $V$ . We call $L$ the approximation operator and define the approximate Bellman operator via $\hat T \coloneq L \circ T$ .

By Lemma 9.1.2 and the Banach contraction mapping theorem, $\hat T$ has a unique fixed point $\hat v \in V$ , and the iterates $\hat T^n v_0$ converge geometrically to $\hat v$ for any starting point $v_0 \in V$ . The fitted value iteration (FVI) algorithm iterates with $\hat T$ from an initial condition $v_0$ and, upon termination, returns a $v_N$ -greedy policy.

The next two results bound the policy error when the FVI algorithm terminates after $N$ iterations.

9.1.2.2Error Bounds¶

Let $v_0 \in V$ be the initial condition in the FVI routine and define $v_n \coloneq \hat T^n v_0 = (LT)^n v_0$ for each $n$ . Let $e_N \coloneq d(v_N, v_{N-1})$ be the step size at termination. Let $\sigma$ be a $v_N$ -greedy policy, so that $T_\sigma \, v_N = T v_N$ .

The bound (9.5) decomposes the error into two terms. The first term $\beta \, e_N$ reflects the accuracy of the iterative process and decreases geometrically with the number of iterations. The second term $d(LTv_N, Tv_N)$ is the approximation error at the final iterate—how well $L$ represents $Tv_N$ . Both terms are computable by the algorithm.

Proof

Proof of Theorem 9.1.3.

By the triangle inequality,

d(\vmax, v_\sigma) \leq d(\vmax, v_N) + d(v_N, v_\sigma).

(9.6)

We bound each term separately.

Bounding $d(\vmax, v_N)$ . Since $\vmax = T\vmax$ and $T$ is a $\beta$ -contraction,

d(\vmax, v_N) \leq d(T\vmax, Tv_N) + d(Tv_N, v_N) \leq \beta \, d(\vmax, v_N) + d(Tv_N, v_N),

and hence

(1-\beta) \, d(\vmax, v_N) \leq d(Tv_N, v_N).

(9.7)

To bound $d(Tv_N, v_N)$ , we use $v_{N+1} = LTv_N$ :

\begin{aligned} d(Tv_N, v_N) &\leq d(Tv_N, LTv_N) + d(LTv_N, v_N) \\ &= d(Tv_N, LTv_N) + d(LTv_N, LTv_{N-1}) \\ &\leq d(Tv_N, LTv_N) + d(Tv_N, Tv_{N-1}) \\ &\leq d(Tv_N, LTv_N) + \beta \, e_N. \end{aligned}

Here the third line uses nonexpansiveness of $L$ and the fourth uses contractivity of $T$ . Combining with (9.7),

(1-\beta) \, d(\vmax, v_N) \leq \beta \, e_N + d(LTv_N, Tv_N).

(9.8)

Bounding $d(v_N, v_\sigma)$ . Since $\sigma$ is $v_N$ -greedy, $T_\sigma \, v_N = Tv_N$ . Using this and the fact that $T_\sigma$ is a $\beta$ -contraction with fixed point $v_\sigma$ ,

\begin{aligned} d(v_N, v_\sigma) & \leq d(v_N, Tv_N) + d(Tv_N, v_\sigma) \\ & = d(v_N, Tv_N) + d(T_\sigma v_N, T_\sigma v_\sigma) \leq d(v_N, Tv_N) + \beta \, d(v_N, v_\sigma). \end{aligned}

Hence $(1-\beta) \, d(v_N, v_\sigma) \leq d(v_N, Tv_N)$ , and the same bound as above gives

(1-\beta) \, d(v_N, v_\sigma) \leq \beta \, e_N + d(LTv_N, Tv_N).

(9.9)

Combining. Adding (9.8) and (9.9) and using (9.6) yields (9.5). ◻

The next result provides an alternative bound that replaces the approximation error at the final iterate with the approximation error at the true value function $\vmax$ .

The bound (9.10) trades a larger constant $(1-\beta)^{-2}$ for a simpler approximation term $d(L\vmax, \vmax)$ , which depends only on how well the approximation operator $L$ represents the true value function. This is useful for a priori analysis of approximation architectures.

Proof

Proof of Theorem 9.1.4.

We first show that

(1-\beta) \, d(\vmax, v_\sigma) \leq 2 \, d(v_N, \vmax).

(9.11)

To this end, fix $v \in V$ and observe that $d(v, Tv) \leq d(v, \vmax) + d(\vmax, Tv) \leq (1+\beta) d(v, \vmax)$ . Using this at $v = v_N$ and repeating the argument for $d(v_N, v_\sigma)$ from the proof of Theorem 9.1.3, we get $(1-\beta) d(v_N, v_\sigma) \leq (1+\beta) d(v_N, \vmax)$ . Combining with (9.6),

d(\vmax, v_\sigma) \leq d(\vmax, v_N) + \frac{1+\beta}{1-\beta} d(v_N, \vmax) = \frac{2}{1-\beta} d(v_N, \vmax).

This gives (9.11).

Next, let $\hat v$ be the unique fixed point of $\hat T = LT$ in $V$ . By the triangle inequality,

d(\vmax, v_N) \leq d(\vmax, \hat v) + d(\hat v, v_N).

(9.12)

For the first term, using $\vmax = T\vmax$ and $\hat v = LT\hat v$ :

d(\vmax, \hat v) \leq d(\vmax, L\vmax) + d(L\vmax, \hat v) = d(\vmax, L\vmax) + d(LT\vmax, LT\hat v) \leq d(\vmax, L\vmax) + \beta \, d(\vmax, \hat v),

so $(1-\beta) \, d(\vmax, \hat v) \leq d(\vmax, L\vmax)$ . For the second term, using $\hat v = \hat T \hat v$ and $v_{N+1} = \hat T v_N$ :

d(\hat v, v_N) \leq d(\hat T \hat v, \hat T v_N) + d(\hat T v_N, v_N) \leq \beta \, d(\hat v, v_N) + \beta \, e_N,

where $d(\hat T v_N, v_N) = d(v_{N+1}, v_N) \leq \beta \, e_N$ by contractivity of $\hat T$ . Hence $(1-\beta) \, d(\hat v, v_N) \leq \beta \, e_N$ . Combining with (9.12):

(1-\beta) \, d(\vmax, v_N) \leq d(\vmax, L\vmax) + \beta \, e_N.

Substituting into (9.11) yields (9.10). ◻

9.1.2.3Order-Preserving Approximations¶

One attractive special case arises when the approximation operator is not only nonexpansive but also order preserving. In the next result we set $\hat{\TT} \coloneq \setntn{L \circ T_\sigma}{\sigma \in \Sigma}$ .

In this setting, the approximate model $(L(V), \hat{\TT})$ inherits the structure of an ADP, and the optimality theory from Chapter 2 and Chapter 3 applies directly to it. The error bounds in Theorem 9.1.3 and Theorem 9.1.4 quantify the gap between the optimal policies of the original and approximate models.

9.1.3Stochastic Approximation¶

In the previous section, we discussed methods for controlling error when iterating with an approximate Bellman operator $\hat T = L \circ T$ . In that setting, the operator $T$ is known. In many applications, however, $T$ itself cannot be evaluated exactly because it involves an expectation over an unknown or intractable distribution. Instead, we can only observe noisy evaluations of $T$ . Stochastic approximation provides a framework for finding fixed points in this setting. Here we provide a fast introduction.

Stochastic approximation will drive the convergence theory of the learning algorithms we present below in Section 9.2.

9.1.3.1Fixed Point Iteration and Damping¶

Stochastic approximation can be thought of as a general technique for solving fixed point problems or root finding problems when information is limited. Here we focus on the fixed point case. In our analysis, $T \colon \Theta \to \Theta$ will be a contraction of modulus $\beta$ on a closed subset $\Theta$ of $\RR^n$ . By Banach’s fixed point theorem, $T$ has a unique fixed point $\bar \theta$ and $T^k \theta \to \bar \theta$ for any $\theta \in \Theta$ .

To start thinking about stochastic approximation, we first recall damped iteration, which fixes $\alpha \in (0,1)$ and updates via

\theta_{k+1} = \theta_k + \alpha (T \theta_k - \theta_k).

(9.13)

This amounts to iterating with the map $F\theta \coloneq \theta + \alpha (T\theta - \theta)$ . Damped iteration is an alternative to regular fixed point iteration that has advantages in some cases.

Proof

First, $F \bar \theta = \bar \theta + \alpha(T \bar \theta - \bar \theta) = \bar \theta$ . For contractivity,

\begin{aligned} \| F\theta - F\theta' \| & = \| (1-\alpha)(\theta - \theta') + \alpha(T\theta - T\theta') \| \\ & \leq (1-\alpha) \| \theta - \theta' \| + \alpha \| T\theta - T\theta' \| \leq (1-\alpha + \alpha \beta) \| \theta - \theta' \|. \end{aligned}

Since $\beta < 1$ , we have $1-\alpha + \alpha\beta < 1$ . ◻

One setting where damped iteration can converge faster than standard iteration is when the iterates oscillate. Figure 9.2 and Figure 9.4 illustrate this case with the linear map $T\theta = A\theta$ , where

A = \begin{pmatrix} 0.5 & 0.6 \\ 0.4 & -0.7 \end{pmatrix}.

The fixed point is the origin. Standard iteration produces a spiraling trajectory, while damped iteration (with $\alpha = 0.7$ ) converges more smoothly.

Damped iteration (\alpha = 0.7) — Figure 9.3:Damped iteration ( $\alpha = 0.7$ )

Figure 9.4:Convergence comparison: norm of iterates

9.1.3.2The Robbins–Monro Algorithm¶

Now suppose that $T$ is a contraction on $\Theta$ with unique fixed point $\bar \theta$ , but we can only evaluate $T$ with noise. That is, given input $\theta$ , we observe $T\theta + W$ , where $W$ is a zero-mean random disturbance. We cannot observe $T\theta$ or $W$ separately, only their sum. The Robbins--Monro algorithm Robbins & Monro, 1951 generalizes damped iteration to this noisy setting. It tells us to fix an initial $\theta_0 \in \Theta$ and update via

\theta_{k+1} = \theta_k + \alpha_k (T \theta_k + W_{k+1} - \theta_k),

(9.14)

where the learning rate $(\alpha_k) \subset (0,1)$ satisfies $\alpha_k \to 0$ .

Compare with damped iteration in (9.13): the only differences are (a) the learning rate decreases over time, and (b) the noise term $W_{k+1}$ enters the update. The decreasing learning rate ensures that, asymptotically, the noise is averaged away while the signal from $T$ accumulates.

9.1.3.3Convergence¶

The following well-known result, due to Tsitsiklis (1994), provides conditions under which the Robbins–Monro iterates converge to $\bar \theta$ . In the statement, $(\fF_k)_{k \geq 0}$ is the filtration generated by $\theta_0, W_1, \ldots, W_k$ .

The restrictions on $(\alpha_k)$ in (iii) are called the Robbins–Monro conditions. The restriction $\sum \alpha_k = \infty$ ensures that the learning rates do not decay too fast, so the algorithm can reach $\bar \theta$ from any starting point. The condition $\sum \alpha_k^2 < \infty$ ensures that the accumulated noise variance is finite. A standard choice is $\alpha_k = (k+1)^{-\gamma}$ for $\gamma \in (1/2, 1)$ .

9.1.3.4Example: Asset Pricing¶

To illustrate stochastic approximation for a fixed point problem, we consider a basic asset pricing model. The model is deliberately simple, so that the exact solution can be computed by linear algebra (allowing us to verify numerically that stochastic approximation converges to the correct answer). In the model, the price of the asset satisfies

V_t = \beta \, \EE [V_{t+1} + D_{t+1} \mid \fF_t],

where $\beta \in (0,1)$ is a discount factor and $D_t$ is the dividend at time $t$ . (See Section 6.3 of Sargent & Stachurski (2025) for background on the model.) Assume that $(X_t)$ is $P$ -Markov on a finite set $\Xsf$ with transition matrix $P$ , and $D_{t+1} = d(X_{t+1})$ . The equilibrium price function $v$ is the unique fixed point in $\RR^\Xsf$ of the map $(Tv)(x) = \beta \sum_{x'} [v(x') + d(x')] P(x, x')$ .

Setting $K \coloneq \beta P$ , the unique fixed point of $T$ is $v^* = (I - K)^{-1} Kd$ .

We can also compute $v^*$ using stochastic approximation. To do this we use the random map $v \mapsto \hat Tv$ given by the algorithm below

We understand $\hat T$ as an operator that maps an element $v$ of $\RR^\Xsf$ into a random vector $\hat Tv$ in $\RR^\Xsf$ . Note that

\EE \, [(\hat Tv)(x)] = \sum_{x'} \beta [v(x') + d(x')] P(x, x') = (Tv)(x).

Now we iterate via

v_{k+1} = v_k + \alpha_k (\hat T v_k - v_k)

(9.15)

which parallels the damped iteration in (9.13). With $W_{k+1} \coloneq \hat Tv_k - Tv_k$ we have $\EE [W_{k+1} \mid \fF_k] = 0$ and

v_{k+1} = v_k + \alpha_k (Tv_k + (\hat T v_k - Tv_k) - v_k) = v_k + \alpha_k (Tv_k + W_{k+1} - v_k),

so updating obeys the Robbins–Monro rule (9.14). The operator $T$ is order preserving since $v \leq w$ implies $Tv \leq Tw$ (the sum involves nonnegative weights). Conditions (i) and (iii) of Theorem 9.1.8 hold by construction.^[1] Thus Theorem 9.1.8 applies, and $(v_k)$ converges to $v^*$ with probability one.

Figure 9.5 compares the stochastic approximation solution with the exact linear algebra solution $v^* = (I - K)^{-1} Kd$ for a specification with $|\Xsf| = 25$ , $\beta = 0.96$ , $d(x) = \me^x$ , and $(X_t)$ discretized from an AR(1) process with $\rho = 0.9$ and $\sigma = 0.02$ . The learning rate is $\alpha_k = (k+1)^{-0.55}$ . After 500 iterations (left panel), the SA estimate is uniformly below $v^*$ but has already captured the shape of the price function. After $50{,}000$ iterations (right panel), the SA solution closely matches the exact value.

Figure 9.5:Batch stochastic approximation vs. linear algebra for asset pricing: partial convergence (left) and near-full convergence (right)

9.1.3.5A Sequential Version¶

An alternative way to apply stochastic approximation is to simulate a single trajectory $(X_t)_{t \geq 0}$ of the $P$ -Markov chain and update $v$ sequentially. In particular, at each step $t$ , only the entry $v(X_t)$ is modified:

v_{t+1}(X_t) = v_t(X_t) + \alpha_t \big[ (\hat T_s v_t)(X_t) - v_t(X_t) \big],

(9.16)

where

(\hat T_s v_t)(X_t) \coloneq \beta [v_t(X_{t+1}) + d(X_{t+1})] .

For states $y \neq X_t$ we hold the estimate steady, setting $v_{t+1}(y) = v_t(y)$ . Note that $\hat T_s$ is a single-sample version of our previous version $\hat T$ from Section 9.1.3.4. The asynchronous extension of Theorem 9.1.8 noted in Remark 9.1.3 shows that the iterates $v_t$ converge to $v^*$ with probability one, provided every state is visited infinitely often.

The advantage of the sequential version will be clearer when we study Q-learning in Section 9.2.1. In real-world Q-learning applications, sequences of observations are naturally generated by the environment and the agent’s interactions with that environment. Optimal policies can be computed from these sequences without knowing or even estimating underlying dynamics.

Figure 9.6 illustrates the sequential version on the same asset pricing model from Section 9.1.3.4, using per-state learning rates $\alpha_t = n_t(X_t)^{-0.55}$ , where $n_t(x)$ counts the number of visits to state $x$ by time $t$ . The left panel shows the SA estimate after 10⁴ updates. Convergence is fastest near the center of the state space, where the stationary distribution of the chain concentrates most of its mass and states are visited most frequently. At the tails, fewer visits lead to slower learning. After $5 \times 10^7$ updates (right panel), the SA solution closely matches the exact value.

Figure 9.6:Sequential stochastic approximation vs. linear algebra for asset pricing: partial convergence (left) and near-full convergence (right)

9.2Simulation and Learning¶

In most of this book we have worked in settings where the underlying model is known. In many applications, the model is unknown and the agent must learn from data generated by interacting with the environment. This section covers two approaches to learning: Q-learning (value-based) and policy gradient methods (policy-based). We emphasize that this section is intended only as a very brief introduction to a vast literature.

9.2.1Q-Learning¶

Q-learning is a model-free algorithm that learns optimal policies from data, without knowledge of transition probabilities or reward functions. In this section we connect Q-learning to the Q-factor ADP framework developed in Section 3.2.1.2.

9.2.1.1Q-Factors and Model-Free Learning¶

Recall from Section 3.2.1.2 that, for a finite MDP $(\Gamma, r, \beta, P)$ with state space $\Xsf$ and action space $\Asf$ , the Q-factor Bellman equation takes the form

q(x, a) = r(x, a) + \beta \sum_{x'} \max_{a' \in \Gamma(x')} q(x', a') \; P(x, a, x') \qquad ((x,a) \in \Gsf).

(9.17)

As shown in Proposition 5.3.1,

the unique solution $\qmax$ to (9.17) obeys $\vmax(x) = \max_{a \in \Gamma(x)} \qmax(x, a)$ , where $\vmax$ is the value function, and
$\sigma$ is an optimal policy whenever $\sigma(x) = \argmax_{a \in \Gamma(x)} \qmax(x, a)$ for all $x$ .

Item (ii) tells us that the optimal policy can be calculated from $\qmax$ without knowledge of the transition kernel $P$ . This observation opens the door to learning: if we can approximate $\qmax$ from samples, we can compute an approximately optimal policy without knowing the model.

9.2.1.2The Q-Learning Algorithm¶

Q-learning approximates $\qmax$ by stochastic approximation of the Q-factor Bellman operator. At each step $t$ , the agent is in state $X_t = x$ , takes action $A_t = a$ , observes reward $R_{t+1} = r(x, a)$ and next state $X_{t+1} \sim P(x, a, \cdot)$ , and updates the Q-function (also called the Q-table) via

q_{t+1}(x, a) = (1 - \alpha_t) \, q_t(x, a) + \alpha_t \left( R_{t+1} + \beta \max_{a' \in \Gamma(X_{t+1})} q_t(X_{t+1}, a') \right).

(9.18)

Here $\alpha_t \in (0,1)$ is a learning rate. The term in parentheses is a single-sample estimate of the right-hand side of (9.17). The update blends this fresh sample with the current estimate. The agent needs to observe only the tuple $(X_t, A_t, R_{t+1}, X_{t+1})$ at each step.

Actions are selected by a behavior policy that balances exploration (visiting new state-action pairs) with exploitation (using the current estimate of $\qmax$ ). A common choice is the $\epsilon$ -greedy strategy: with probability $\epsilon$ , choose a random feasible action; otherwise choose $\argmax_{a \in \Gamma(x)} q_t(x, a)$ . Because the convergence target $\qmax$ does not depend on how actions are selected—only on the Bellman equation (9.17)—Q-learning is called an off-policy method.

9.2.1.3Convergence¶

Convergence of Q-learning follows from the stochastic approximation theory developed in Section 9.1.3. To see this, let $S$ be the Q-factor Bellman operator defined in (3.5), so that $(Sq)(x, a) = r(x, a) + \beta \sum_{x'} \max_{a'} q(x', a') P(x, a, x')$ . By Exercise 3.2.5, each policy operator $S_\sigma$ is a contraction of modulus $\beta$ on $(\RR^\Gsf, \| \cdot \|_\infty)$ . Since the supremum norm is sup-nonexpansive, Theorem 3.1.5 gives that $S$ is also a contraction of modulus $\beta$ with unique fixed point $\qmax$ . At each step $t$ , the agent observes $(X_t, A_t, R_{t+1}, X_{t+1})$ and computes the sample

(\hat S q_t)(X_t, A_t) \coloneq R_{t+1} + \beta \max_{a' \in \Gamma(X_{t+1})} q_t(X_{t+1}, a').

With this notation, the Q-learning update (9.18) becomes

q_{t+1}(X_t, A_t) = q_t(X_t, A_t) + \alpha_t \big[ (\hat S q_t)(X_t, A_t) - q_t(X_t, A_t) \big],

which parallels (9.16). Since the fixed point of $S$ is $\qmax$ , applying the asynchronous version of Theorem 9.1.8 noted in Remark 9.1.3 leads to the following classical result Watkins & Dayan, 1992Tsitsiklis, 1994:

Theorem 9.2.1 is often referred to as a “tabular” result: it assumes that the Q-table stores a separate value for every $(x, a)$ pair. An alternative approach, which scales to high dimensions, is to combine Q-learning with one of the function approximation schemes from Section 9.1. When the scheme in question is neural nets we get deep Q-learning, which has been highly successful in certain environments. See Section 9.3 for more discussion.

9.2.1.4Example: Inventory Management¶

To illustrate Q-learning, we consider an inventory management problem where a firm manages stock of a single product. Inventory $(X_t)_{t \geq 0}$ evolves according to

X_{t+1} = \max(X_t - D_{t+1}, 0) + A_t,

where $D_t$ is IID demand and $A_t \in \{0, \ldots, K - X_t\}$ is the order quantity chosen by the firm. The state space is $\Xsf = \{0, \ldots, K\}$ . Per-period profits are

\pi(X_t, A_t, D_{t+1}) = p \min(X_t, D_{t+1}) - c \, A_t - \kappa \1\{A_t > 0\},

where $p$ is the selling price, $c$ is the unit ordering cost, and $\kappa$ is a fixed cost of placing an order. The firm seeks to maximize expected discounted profits $\EE \sum_{t \geq 0} \beta^t \pi(X_t, A_t, D_{t+1})$ .

Now suppose that a manager does not know the demand distribution, the cost parameters, or the transition dynamics. The manager can still learn the optimal ordering policy by Q-learning. At each period, the manager observes the current inventory $X_t$ , places an order $A_t$ , records the resulting profit, and observes the next-period inventory $X_{t+1}$ . These four quantities suffice for the update (9.18).

Figure 9.7 compares the Q-learning solution with the exact VFI solution for this problem, with $K = 20$ , selling price $p = 1$ , geometric demand with parameter $\theta = 0.7$ , $c = 0.2$ , $\kappa = 0.8$ , and $\beta = 0.98$ . After 20 million steps of interaction, the learned value function and policy closely match those obtained by exact value function iteration with full model knowledge.

Figure 9.7:Value function and policy: VFI (exact) vs. Q-learning (model-free)

The 20 million steps required in this small example illustrate a general feature of model-free methods: because they do not exploit knowledge of transition probabilities or reward functions, they must extract all structural information from observed data, which can require very long trajectories. In practice, the cost of exploration is compounded by the fact that the agent typically needs to make reasonable decisions at each time step, limiting the extent to which it can explore freely. These challenges motivate both deep Q-learning (which uses function approximation to generalize across states) and the policy gradient methods discussed in Section 9.2.3 (which search directly in policy space).

9.2.2Risk-Sensitive Q-Learning¶

The Q-learning framework can be extended to accommodate risk-sensitive preferences. As we will see, the extension provides a concrete illustration of the order-reversing FDP theory developed in Section 5.2.3.

9.2.2.1Risk-Sensitive Bellman Equation¶

We can construct a risk-sensitive version of the inventory problem Bellman equation by replacing the expectation with a certainty equivalent (see, e.g., Section 7.1.1.8). Using the entropic certainty equivalent with parameter $\gamma > 0$ , the Bellman equation becomes

v(x) = \max_{a \in \Gamma(x)} \left( -\frac{1}{\gamma} \right) \ln \left\{ \sum_{x'} \exp \left[ -\gamma \left( r(x, a) + \beta \, v(x') \right) \right] P(x, a, x') \right\}.

(9.19)

Larger $\gamma$ implies stronger aversion to downside risk.

9.2.2.2The Order-Reversing FDP Structure¶

Our aim is to derive a Q-factor Bellman equation for the risk-sensitive problem, analogous to (9.17), and then build a Q-learning algorithm from it. One challenge is how to shift expectation to the outside of the maximization step, as in the standard Q-factor equation (see (9.17)). We will use the order-reversing FDP framework of Section 5.2.3 to factor the Bellman equation into a suitable form, enabling single-sample updates.

We set up the FDP $(V, F, \hat V, \GG)$ as follows. The primary value space is $V \coloneq \RR^\Xsf$ and the subordinate value space is $\hat V \coloneq \RR_{++}^\Gsf$ (strictly positive functions on $\Gsf$ ), each with the pointwise partial order. The connecting map $F \colon V \to \hat V$ is defined by

(Fv)(x, a) \coloneq \sum_{x'} \exp\!\left[ -\gamma \left( r(x, a) + \beta \, v(x') \right) \right] P(x, a, x').

(9.20)

Noting that the image of $F$ lies in $\RR_{++}^\Gsf$ (since the exponential function is strictly positive), we define the family $\GG \coloneq \{G_\sigma\}_{\sigma \in \Sigma}$ of maps from $\hat V$ to $V$ by

(G_\sigma \, q)(x) \coloneq -\frac{1}{\gamma} \ln\!\big(q(x, \sigma(x))\big).

(9.21)

The map $G_\sigma$ evaluates the Q-function at the action prescribed by $\sigma$ and applies $\psi^{-1}(y) = -(1/\gamma)\ln(y)$ to convert the positive Q-value back to the value scale.

Since $\Gamma(x)$ is finite, the set $\{G_\sigma \, q\}_{\sigma \in \Sigma}$ has a greatest element for every $q \in \hat V$ . Thus, $(V, F, \hat V, \GG)$ is an order-reversing FDP.

9.2.2.3Primary and Subordinate Bellman Equations¶

We have chosen $(V, F, \hat V, \GG)$ so that the primary ADP for this FDP has the Bellman equation (9.19). To see this, recall that, in the FDP setting, $\Gmax$ is defined by $\Gmax q \coloneq \bigvee_\sigma G_\sigma \, q$ . Here this gives $(\Gmax q)(x) = -(1/\gamma)\ln(\min_{a \in \Gamma(x)} q(x, a))$ , since $\psi^{-1}$ is decreasing. The primary policy operators $T_\sigma \coloneq G_\sigma \circ F$ give

(T_\sigma \, v)(x) = -\frac{1}{\gamma} \ln\!\left( \sum_{x'} P(x, \sigma(x), x') \exp\!\left[-\gamma(r(x, \sigma(x)) + \beta \, v(x'))\right] \right).

With $\tmax = \Gmax \circ F$ (by Lemma 5.2.15), we recover (9.19).

For the subordinate ADP, by Lemma 5.2.16, the Bellman min-operator is $\htmin = F \circ \Gmax$ . Computing explicitly:

\begin{aligned} (\htmin q)(x, a) &= (F(\Gmax q))(x, a) \\ &= \sum_{x'} P(x, a, x') \exp\!\left[-\gamma\!\left(r(x, a) + \beta \cdot \left(-\tfrac{1}{\gamma}\right) \ln\!\big(\min_{a'} q(x', a')\big)\right)\right] \\ &= \sum_{x'} P(x, a, x') \left[\exp(-\gamma \, r(x, a)) \cdot \left(\min_{a' \in \Gamma(x')} q(x', a')\right)^\beta \right]. \end{aligned}

Thus the subordinate Bellman equation $q = \htmin q$ takes the form

q(x, a) = \sum_{x'} P(x, a, x') \left[ \exp(-\gamma \, r(x, a)) \cdot \left(\min_{a' \in \Gamma(x')} q(x', a')\right)^\beta \right].

(9.22)

This is the risk-sensitive Q-factor Bellman equation—a fixed point equation in $q$ alone, and the min-analogue of the standard Q-factor Bellman equation (9.17).

Notice that the expectation (summation over $x'$ ) appears on the outside of (9.22) because it originates from the map $F$ in the FDP factorization. The maps $G_\sigma$ act pointwise and do not involve the transition kernel $P$ . This separation makes Q-learning possible: we can replace the expectation with a single-sample observation, just as in the standard case.

Our original goal is to find an optimal policy for the primary risk-sensitive MDP. The Q-learning algorithm described below operates on the subordinate Q-factor problem, converging to its min-value function $\qmin$ . The key question is whether $\qmin$ suffices to recover an optimal policy for the primary problem. Theorem 5.2.18 gives an affirmative answer: any policy $\sigma$ satisfying $G_\sigma \, \qmin = \Gmax \, \qmin$ is optimal for the primary problem $(V, \TT)$ . Since $(\Gmax q)(x) = -(1/\gamma)\ln(\min_a q(x,a))$ , this condition reduces to $\sigopt(x) = \argmin_{a \in \Gamma(x)} \qmin(x, a)$ . In other words, once Q-learning has converged, we read off the optimal policy by minimizing the learned Q-function at each state.

Solution to Exercise 9.2.2

Using the expressions derived above,

(\hat T_\sigma q)(x, a) = \exp(-\gamma \, r(x, a)) \sum_{x'} P(x, a, x') \, q(x', \sigma(x'))^\beta.

Suppose $d(q, q') \leq \delta$ , so that $\me^{-\delta} \leq q(x, a) / q'(x, a) \leq \me^{\delta}$ for all $(x, a) \in \Gsf$ . Then $q(x', \sigma(x'))^\beta$ lies between $\me^{-\beta \delta}$ and $\me^{\beta \delta}$ times $q'(x', \sigma(x'))^\beta$ . Since $P(x, a, \cdot)$ has nonnegative weights summing to one,

\me^{-\beta\delta} \, (\hat T_\sigma q')(x, a) \leq (\hat T_\sigma q)(x, a) \leq \me^{\beta\delta} \, (\hat T_\sigma q')(x, a),

where the $\exp(-\gamma \, r(x, a))$ factor cancels in the ratio. Taking logarithms and the supremum over $(x, a)$ gives $d(\hat T_\sigma q, \hat T_\sigma q') \leq \beta \, \delta = \beta \, d(q, q')$ .

9.2.2.4Optimal Behavior Under Risk Sensitivity¶

Before turning to Q-learning, we illustrate how risk sensitivity affects the optimal policy and the induced dynamics in the inventory model of Section 9.2.1. We apply value function iteration to the risk-sensitive Bellman equation (9.19) for three levels of risk sensitivity, $\gamma \in \{0.01, 1.0, 2.0\}$ , holding all other parameters at the values used for Figure 9.7.

Figure 9.8 shows the resulting optimal order quantities. A more risk-sensitive firm (larger $\gamma$ ) orders less stock, accepting lower expected sales in exchange for more predictable profits. Figure 9.9 confirms the effect in simulation: inventory paths under the optimal policy stabilize at lower levels as $\gamma$ increases.

Optimal policy as a function of inventory, for \gamma \in \{0.01, 1.0, 2.0\} — Figure 9.8:Optimal policy as a function of inventory, for $\gamma \in \{0.01, 1.0, 2.0\}$

Inventory dynamics under the optimal policy at \gamma \in \{0.01, 1.0, 2.0\} — Figure 9.9:Inventory dynamics under the optimal policy at $\gamma \in \{0.01, 1.0, 2.0\}$

9.2.2.5The Risk-Sensitive Q-Learning Update¶

The corresponding Q-learning update replaces the expectation in (9.22) with a single sample:

q_{t+1}(x, a) = (1 - \alpha_t) \, q_t(x, a) + \alpha_t \left[ \exp(-\gamma \, R_{t+1}) \cdot \left(\min_{a' \in \Gamma(X_{t+1})} q_t(X_{t+1}, a')\right)^\beta \right].

(9.23)

Note several differences from the standard case. First, the optimal policy minimizes rather than maximizes $q$ . Second, the reward enters through $\exp(-\gamma R_{t+1})$ rather than additively. Third, the continuation value enters as a power $(\min_{a'} q_t)^\beta$ rather than a scaled sum $\beta \cdot \max_{a'} q_t$ . Fourth, to make the calculations, the agent now needs to know $\gamma$ as well as $\beta$ .

Figure 9.10 shows the value function and policy produced by the risk-sensitive Q-learning update (9.23) after $2 \times 10^7$ steps on the inventory model at $\gamma = 1.0$ , alongside the exact VFI solution. Q-learning recovers the optimal policy and value function despite the open theoretical question above; see the QuantEcon lectures for implementation details and code.

Risk-sensitive inventory at \gamma = 1.0: VFI vs. Q-learning after 2 \times 10^7 steps — Figure 9.10:Risk-sensitive inventory at $\gamma = 1.0$ : VFI vs. Q-learning after $2 \times 10^7$ steps

9.2.3Policy-Based Methods¶

The methods discussed so far are value-based: they learn a value function (or Q-function) and extract a policy from it. An alternative approach is to parameterize the policy directly and optimize over the parameter space. Such policy gradient methods are widely used in modern reinforcement learning.

9.2.3.1The Policy Gradient Approach¶

Rather than searching over all feasible policies, we restrict attention to a parameterized family $\{\sigma(\cdot, \theta) : \theta \in \Theta\}$ , where $\theta \in \Theta \subset \RR^m$ is a parameter vector. Given an initial condition $x_0$ , define

M(\theta) \coloneq v_{\sigma(\cdot, \theta)}(x_0),

the lifetime value of the policy $\sigma(\cdot, \theta)$ starting from $x_0$ . The policy gradient method maximizes $M(\theta)$ over $\theta$ by gradient ascent:

\theta_{n+1} = \theta_n + \lambda_n \nabla_\theta \hat M(\theta_n),

(9.24)

where $\lambda_n > 0$ is a step size and $\hat M$ is a Monte Carlo approximation to $M$ described below.

To illustrate, consider the optimal savings problem from Section 1.3: a household chooses a consumption policy to maximize $\EE \sum_{t \geq 0} \beta^t u(c_t)$ , where $u$ is a smooth utility function, subject to

a_{t+1} = R(a_t - c_t) + Y_{t+1}, \qquad c_t \in [0, a_t],

(9.25)

with IID income shocks $(Y_t)$ and gross interest rate $R > 0$ . A policy $\sigma(a, \theta)$ maps current assets $a$ to consumption $c$ , parameterized by $\theta$ (for example, the weights of a neural network). The state dynamics are a smooth function of the policy: given a realization of the shock sequence, $a_{t+1}$ is differentiable in $\theta$ through $c_t = \sigma(a_t, \theta)$ .

9.2.3.2Monte Carlo Approximation¶

In general, $M(\theta)$ cannot be evaluated exactly because it involves an expectation over the shock sequence. Instead, we form a Monte Carlo approximation by simulating $N$ independent paths of length $T$ under the policy $\sigma(\cdot, \theta)$ . For each path $i = 1, \ldots, N$ , set $a^i_0 = a_0$ and generate

c^i_t = \sigma(a^i_t, \theta), \qquad a^i_{t+1} = R(a^i_t - c^i_t) + Y^i_{t+1}, \qquad (t = 0, \ldots, T-1),

where $(Y^i_t)$ are IID draws. The Monte Carlo estimate is

\hat M(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \beta^t u(\sigma(a^i_t, \theta)).

(9.26)

The key observation is that $\hat M(\theta)$ is differentiable in $\theta$ whenever $u$ and $\sigma(\cdot, \theta)$ are differentiable, because the state dynamics (9.25) are a smooth function of the action. (The shocks $Y^i_{t+1}$ do not depend on $\theta$ , so they pass through the differentiation.) Each simulated path defines a computational graph from $\theta$ through the policy evaluations and state transitions to the cumulative payoff. Modern automatic differentiation libraries can differentiate through these graphs efficiently, providing $\nabla_\theta \hat M(\theta)$ at cost comparable to evaluating $\hat M(\theta)$ itself. The gradient estimate is substituted into the ascent step (9.24).

A common and effective choice for the policy parameterization is a neural network mapping states to actions, with $\theta$ collecting all the weights and biases. The universal approximation theorem Hornik et al., 1989 guarantees that sufficiently expressive networks can approximate any continuous policy. The main advantage of policy gradient methods is that they handle continuous state and action spaces naturally and scale to high-dimensional problems. Discussion of related literature can be found in Section 9.3.

Figure 9.11 applies the policy gradient method to the household problem (9.25) with CRRA utility ( $\gamma = 1.5$ ), $\beta = 0.96$ , $R = 1.01$ , and IID lognormal income. The policy is parameterized by a neural network with three hidden layers of six units each, trained for 400 epochs using 100 paths of length 200. The learned consumption policy closely matches the exact solution obtained by the endogenous grid method (EGM; see Section 8.3.3.5). For this low-dimensional problem, EGM is far more efficient; we use it here as a trusted benchmark to verify the policy gradient solution. The advantage of policy gradient methods emerges in high-dimensional settings where traditional grid-based methods are infeasible.

Figure 9.11:Consumption policy: EGM (exact) vs. policy gradient (neural network)

Interestingly, the policy gradient method recovers the globally optimal policy despite the fact that we are optimizing lifetime value at a single initial condition $a_0$ . This raises a natural question: when does maximizing at one initial condition generate an optimal policy — that is, a policy that is optimal across all states? We address this question in the next section.

9.2.3.3From Local to Global Optimality¶

In general, $v_\sigma(x_0) = \vmax(x_0)$ does not imply $v_\sigma = \vmax$ . But perhaps implication does hold in some settings? Stachurski et al. (2024) analyze this question for MDPs on general state spaces. Here we present a simplified version for finite MDPs that conveys the core idea. In stating the main result, we recall that a stochastic matrix $P_\sigma$ is called irreducible if, for every pair of states $x, y \in \Xsf$ , there exists an $m \in \NN$ with $P_\sigma^m(x, y) > 0$ .

Proof

That (ii) implies (i) is immediate. For the converse, suppose $v_\sigma(x) = \vmax(x)$ for some $x \in \Xsf$ . Define $h \coloneq \vmax - v_\sigma$ . By definition of $\vmax$ , we have $h \geq 0$ . We claim that $h = 0$ . Since $v_\sigma$ is the fixed point of $T_\sigma$ and $\vmax = T\vmax \geq T_\sigma \vmax$ , we have

\vmax - v_\sigma \geq T_\sigma \vmax - T_\sigma v_\sigma = \beta P_\sigma (\vmax - v_\sigma),

that is, $h \geq \beta P_\sigma h$ . Iterating gives $h \geq \beta^n P_\sigma^n h$ for all $n \in \NN$ . Evaluating at $x$ ,

0 \leq \beta^n (P_\sigma^n h)(x) \leq h(x) = 0 \qquad \text{for all } n \in \NN,

(9.27)

where the first inequality uses $h \geq 0$ and $P_\sigma^n h \geq 0$ , and the second uses $h \geq \beta^n P_\sigma^n h$ . Hence $(P_\sigma^n h)(x) = 0$ for all $n \geq 1$ .

Now suppose, for the sake of contradiction, that $h(y) > 0$ for some $y \in \Xsf$ . By irreducibility, there exists an $m$ with $P_\sigma^m(x, y) > 0$ . Then

(P_\sigma^m h)(x) = \sum_{y'} h(y') P_\sigma^m(x, y') \geq h(y) P_\sigma^m(x, y) > 0,

contradicting $(P_\sigma^m h)(x) = 0$ . Hence $h = 0$ , proving (ii). ◻

Theorem 9.2.2 tells us that policy gradient methods—which optimize at a single initial condition—produce globally optimal policies provided the candidate policy induces irreducible dynamics. Irreducibility ensures that every state is “reachable” from $x_0$ , so optimality propagates throughout the state space.

This helps explain the global convergence observed in Figure 9.11 for the household problem (9.25). The income shocks $(Y_t)$ are lognormal with unbounded support, so they spread the state across a wide range of asset levels regardless of the consumption policy. This provides a form of irreducibility: every region of the state space is visited with positive probability under any feasible policy, allowing optimality at the initial condition $a_0$ to propagate globally.

9.3Chapter Notes¶

The curse of dimensionality and the role of approximation in dynamic programming are discussed at length in Powell (2007) and Rust (1997). Theorem 9.1.3 and Theorem 9.1.4 generalize the error bounds for fitted value iteration in Stachurski (2008) from concrete MDPs on Euclidean state spaces to the abstract ADP setting. Li et al. (2026) provide performance guarantees for approximate dynamic programming in the form of ratio bounds. Lee (2026) studies convergence of Q-factor value function iteration.

The use of neural networks for approximate dynamic programming has become popular in recent years. The classic reference is Bertsekas & Tsitsiklis (1996). More recent examples include Maliar et al. (2021), Kase et al. (2022), Han et al. (2026), Pascal (2024), Kamber et al. (2025), Ashwin et al. (2024), and Oguz & Bray (2026). See Scheidegger (2026) for an overview of the field.

The stochastic approximation framework of Section 9.1.3 originates with Robbins & Monro (1951). The convergence result in Theorem 9.1.8 is based on Tsitsiklis (1994), who proves convergence in a more general asynchronous setting. Sutton & Barto (2018) give a comprehensive textbook treatment of stochastic approximation and temporal difference methods in the reinforcement learning context. Szepesvári (2010) provides a concise and mathematically rigorous introduction to reinforcement learning algorithms, including detailed convergence analysis for both value-based and policy-based methods. Agarwal et al. (2021) give a modern theoretical treatment of reinforcement learning with an emphasis on sample complexity and regret bounds.

Q-learning was introduced by Watkins & Dayan (1992), with convergence established by Tsitsiklis (1994). For a more recent discussion of convergence, see Regehr & Ayoub (2021). Deep Q-learning was mentioned in Section 9.2.1. Bertsekas & Tsitsiklis (1996) laid the theoretical foundations for combining neural network function approximation with dynamic programming. The potential of this approach was demonstrated by Mnih et al. (2015), who showed that a deep Q-network agent could reach human-level performance across a range of Atari games without game-specific engineering. Yang et al. (2020) provide a theoretical analysis of deep Q-learning. Wang et al. (2024) provide a recent survey of the deep reinforcement learning literature.

The risk-sensitive Q-learning formulation of Section 9.2.2 connects to the broader literature on risk-sensitive Markov decision processes surveyed in Bäuerle & Jaśkiewicz (2024). Littman & Szepesvari (1996) is a valuable early contribution to the theoretical foundations of Q-learning in potentially nonstandard environments. The use of deep learning for solving dynamic programming problems in economics is explored by Azinovic et al. (2022).

Policy gradient methods, discussed briefly in Section 9.2.3, are treated in depth by Sutton & Barto (2018), Szepesvári (2010), and Agarwal et al. (2021). The discussion in Section 9.2.3.3 is a simplified version of the result in Stachurski et al. (2024), which treats MDPs on general state spaces under an appropriate generalization of irreducibility.

Footnotes¶

For condition (ii), note that $W_{k+1}(x) = (\hat Tv_k)(x) - (Tv_k)(x) = \beta [v_k(X') + d(X')] - (Tv_k)(x)$ , so $|W_{k+1}(x)| \leq 2\beta(\|v_k\|_\infty + \|d\|_\infty)$ . Hence $\EE[\|W_{k+1}\|^2 \mid \fF_k] \leq 4 |\Xsf| \beta^2 (\|v_k\|_\infty + \|d\|_\infty)^2 \leq C(1 + \|v_k\|^2)$ for a constant $C$ depending only on $\beta$ , $|\Xsf|$ , and $\|d\|_\infty$ .
↩

References¶

Cheney, W. (2013). Analysis for applied mathematics (Vol. 208). Springer Science & Business Media.
Atkinson, K., & Han, W. (2005). Theoretical numerical analysis (Vol. 39). Springer.
Stachurski, J. (2008). Continuous state dynamic programming via nonexpansive approximation. Computational Economics, 31(2), 141–160.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407.
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16, 185–202.
Sargent, T. J., & Stachurski, J. (2025). Dynamic Programming: Finite States. Cambridge University Press.
Watkins, C. J., & Dayan, P. (1992). Q-Learning. Machine Learning, 8, 279–292.
Stachurski, J., Yang, J., & Yang, Z. (2024). Dynamic Programming: From Local Optimality to Global Optimality. arXiv Preprint arXiv:2411.11062.
Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimensionality (Vol. 703). John Wiley & Sons.
Rust, J. (1997). Using randomization to break the curse of dimensionality. Econometrica: Journal of the Econometric Society, 487–516.
Li, B., Chong, E. K. P., & Pezeshki, A. (2026). Performance Guarantees for Data-Driven Sequential Decision-Making. https://arxiv.org/abs/2603.20553
Lee, D. (2026). Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration. https://arxiv.org/abs/2604.17457
Bertsekas, D., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena Scientific.
Maliar, L., Maliar, S., & Winant, P. (2021). Deep Learning for Solving Dynamic Economic Models. Journal of Monetary Economics, 122, 76–101. 10.1016/j.jmoneco.2021.07.004

9 Approximation and Learning

9.1Approximation¶

9.1.1Approximation Methods¶

9.1.1.1Local Approximation¶

9.1.1.2Global Approximation¶

9.1.2Error Bounds¶

9.1.2.1Framework¶

9.1.2.2Error Bounds¶

9.1.2.3Order-Preserving Approximations¶

9.1.3Stochastic Approximation¶

9.1.3.1Fixed Point Iteration and Damping¶

9.1.3.2The Robbins–Monro Algorithm¶

9.1.3.3Convergence¶

9.1.3.4Example: Asset Pricing¶

9.1.3.5A Sequential Version¶

9.2Simulation and Learning¶

9.2.1Q-Learning¶

9.2.1.1Q-Factors and Model-Free Learning¶

9.2.1.2The Q-Learning Algorithm¶

9.2.1.3Convergence¶

9.2.1.4Example: Inventory Management¶

9.2.2Risk-Sensitive Q-Learning¶

9.2.2.1Risk-Sensitive Bellman Equation¶

9.2.2.2The Order-Reversing FDP Structure¶

9.2.2.3Primary and Subordinate Bellman Equations¶

9.2.2.4Optimal Behavior Under Risk Sensitivity¶

9.2.2.5The Risk-Sensitive Q-Learning Update¶

9.2.3Policy-Based Methods¶

9.2.3.1The Policy Gradient Approach¶

9.2.3.2Monte Carlo Approximation¶

9.2.3.3From Local to Global Optimality¶

9.3Chapter Notes¶