9 Approximation and Learning
The theory developed in earlier chapters can be extended to address two important problems faced by applied researchers. One is the need for approximation. The other is how to learn optimal policies from samples, based on limited knowledge. This short chapter addresses these issues. Our aim here is limited, namely, to provide a brief introduction to these large topics and connect them to the ADP theory studied above.
The chapter consists of two sections. Section 9.1 develops the theory of approximate value function iteration, showing that nonexpansive approximation operators preserve contractivity and lead to clean error bounds. It also introduces stochastic approximation, which provides a way to study the convergence properties of iterative algorithms and forms the theoretical foundation for the learning algorithms that follow. Section 9.2 treats simulation-based learning, covering Q-learning, risk-sensitive Q-learning, and policy gradient methods.
9.1Approximation¶
This section discusses approximate dynamic programming. Approximation is needed because (a) models with continuous states and actions cannot be implemented exactly on computers, and (b) many problems are high-dimensional and hence computationally expensive. We focus on approximate value function iteration.
We begin in Section 9.1.1 with a review of common approximation methods, including local and global approximation. In Section 9.1.2, we show that nonexpansive approximation operators preserve contractivity and derive error bounds for the resulting policies. Section 9.1.3 introduces stochastic approximation, a general framework for finding fixed points from noisy evaluations that underpins the learning algorithms of Section 9.2.
9.1.1Approximation Methods¶
Suppose that we want to represent a function using an approximation that can be implemented on a machine using a finite number of parameters. One common approach is to take a normed linear space containing and elements from . The approximation then takes the form
The scalars are chosen so that is close to . In this setting, the elements are typically called basis functions.
In practice, one often fixes a sequence of functions such that, with defined as the linear span of for each , the union is dense in . Next we introduce a sequence of approximation operators such that has the form of in (9.1). For example, given , the function might equal the closest element in to under the specified norm on . See, for example, Cheney (2013) or Atkinson & Han (2005).
9.1.1.1Local Approximation¶
One variation on this idea, often called local approximation, proceeds by first choosing a set of grid points contained in and a set of basis functions defined on the state space with the property that
Elements of a family of basis functions satisfying (9.2) are sometimes called weighting functions or kernels, and an approximation of a function taking the form
is called a local approximator or a kernel averager. The key idea is that , the approximation to at , is a weighted average of the values of at the grid points. Typically, is relatively large when is near , so that has a large weight in determining .
Figure 9.1 illustrates the Gaussian kernel averager from Example 9.1.1 in one dimension. The left panel shows the kernel weights for six grid points: each peaks at and decays smoothly, and the weights sum to one at every . The right panel shows a function and its kernel-averager approximation ; the approximation agrees with at the grid points and smoothly blends between them.

Figure 9.1:Gaussian kernel averager: kernels (left) and approximation (right)
We will exploit the following property in Section 9.1.2 to derive error bounds for approximate value function iteration.
9.1.1.2Global Approximation¶
Another approach involves global approximation methods. Here we drop the partition-of-unity constraint (9.2) from the basis functions and the function values are replaced with a more generic set of coefficients . With the basis denoted by , the approximations take the form
The number of basis elements is typically smaller than the number of grid points, seeking a parsimonious representation. Coefficients are chosen by fitting. A typical criterion is least squares minimization:
If and , then takes the familiar form whenever has full column rank.
9.1.2Error Bounds¶
In this section, we develop error bounds for approximate value function iteration in the ADP framework of Chapter 3. The key result is that when the approximation operator is nonexpansive, the composition of the Bellman operator with the approximation operator inherits contractivity, and the error in the computed policy can be bounded in terms of the approximation quality.
9.1.2.1Framework¶
Throughout this section, is an ADP with a partially ordered metric space. An operator is called nonexpansive with respect to when
We make the following assumption.
Under Assumption 9.1.1, the conditions of Theorem 3.1.5 are satisfied (with , since regularity implies semi-regularity with ). As a result, the fundamental optimality properties hold, the Bellman operator is a contraction of modulus on , the value function is the unique fixed point of in , and VFI, OPI, and HPI all converge.
Let be a self-map on . We call the approximation operator and define the approximate Bellman operator via .
By Lemma 9.1.2 and the Banach contraction mapping theorem, has a unique fixed point , and the iterates converge geometrically to for any starting point . The fitted value iteration (FVI) algorithm iterates with from an initial condition and, upon termination, returns a -greedy policy.
The next two results bound the policy error when the FVI algorithm terminates after iterations.
9.1.2.2Error Bounds¶
Let be the initial condition in the FVI routine and define for each . Let be the step size at termination. Let be a -greedy policy, so that .
The bound (9.11) decomposes the error into two terms. The first term reflects the accuracy of the iterative process and decreases geometrically with the number of iterations. The second term is the approximation error at the final iterate---how well represents . Both terms are computable by the algorithm.
The next result provides an alternative bound that replaces the approximation error at the final iterate with the approximation error at the true value function .
The bound (9.19) trades a larger constant for a simpler approximation term , which depends only on how well the approximation operator represents the true value function. This is useful for a priori analysis of approximation architectures.
9.1.2.3Order-Preserving Approximations¶
One attractive special case arises when the approximation operator is not only nonexpansive but also order preserving. In the next result we set .
In this setting, the approximate model inherits the structure of an ADP, and the optimality theory from Chapter 2 and Chapter 3 applies directly to it. The error bounds in Theorem 9.1.3 and Theorem 9.1.4 quantify the gap between the optimal policies of the original and approximate models.
9.1.3Stochastic Approximation¶
In the previous section, we discussed methods for controlling error when iterating with an approximate Bellman operator . In that setting, the operator is known. In many applications, however, itself cannot be evaluated exactly because it involves an expectation over an unknown or intractable distribution. Instead, we can only observe noisy evaluations of . Stochastic approximation provides a framework for finding fixed points in this setting. Here we provide a fast introduction.
Stochastic approximation will drive the convergence theory of the learning algorithms we present below in Section 9.2.
9.1.3.1Fixed Point Iteration and Damping¶
Stochastic approximation can be thought of as a general technique for solving fixed point problems or root finding problems when information is limited. Here we focus on the fixed point case. In our analysis, will be a contraction of modulus on a closed subset of . By Banach’s fixed point theorem, has a unique fixed point and for any .
To start thinking about stochastic approximation, we first recall damped iteration, which fixes and updates via
This amounts to iterating with the map . Damped iteration is an alternative to regular fixed point iteration that has advantages in some cases.
One setting where damped iteration can converge faster than standard iteration is when the iterates oscillate. Figure 9.2 and Figure 9.4 illustrate this case with the linear map , where
The fixed point is the origin. Standard iteration produces a spiraling trajectory, while damped iteration (with ) converges more smoothly.

Figure 9.2:Standard iteration

Figure 9.3:Damped iteration (α = 0.7)

Figure 9.4:Convergence comparison: norm of iterates
9.1.3.2The Robbins--Monro Algorithm¶
Now suppose that is a contraction on with unique fixed point , but we can only evaluate with noise. That is, given input , we observe , where is a zero-mean random disturbance. We cannot observe or separately, only their sum. The Robbins--Monro algorithm Robbins & Monro, 1951 generalizes damped iteration to this noisy setting. It tells us to fix an initial and update via
where the learning rate satisfies .
Compare with damped iteration in (9.27): the only differences are (a) the learning rate decreases over time, and (b) the noise term enters the update. The decreasing learning rate ensures that, asymptotically, the noise is averaged away while the signal from accumulates.
9.1.3.3Convergence¶
The following well-known result, due to Tsitsiklis (1994), provides conditions under which the Robbins--Monro iterates converge to . In the statement, is the filtration generated by .
The restrictions on in (iii) are called the Robbins--Monro conditions. The restriction ensures that the learning rates do not decay too fast, so the algorithm can reach from any starting point. The condition ensures that the accumulated noise variance is finite. A standard choice is for .
9.1.3.4Example: Asset Pricing¶
To illustrate stochastic approximation for a fixed point problem, we consider a basic asset pricing model. The model is deliberately simple, so that the exact solution can be computed by linear algebra (allowing us to verify numerically that stochastic approximation converges to the correct answer). In the model, the price of the asset satisfies
where is a discount factor and is the dividend at time . (See Section 6.3 of Sargent & Stachurski (2025) for background on the model.) Assume that is -Markov on a finite set with transition matrix , and . The equilibrium price function is the unique fixed point in of the map .
Setting , the unique fixed point of is .
We can also compute using stochastic approximation. To do this we use the random map given by the algorithm below
We understand as an operator that maps an element of into a random vector in . Note that
Now we iterate via
which parallels the damped iteration in (9.27). With we have and
so updating obeys the Robbins--Monro rule (9.30). The operator is order preserving since implies (the sum involves nonnegative weights). Conditions (i) and (iii) of Theorem 9.1.8 hold by construction.[1] Thus Theorem 9.1.8 applies, and converges to with probability one.
Figure 9.5 compares the stochastic approximation solution with the exact linear algebra solution for a specification with , , , and discretized from an AR(1) process with and . The learning rate is . After 500 iterations (left panel), the SA estimate is uniformly below but has already captured the shape of the price function. After iterations (right panel), the SA solution closely matches the exact value.

Figure 9.5:Batch stochastic approximation vs. linear algebra for asset pricing: partial convergence (left) and near-full convergence (right)
9.1.3.5A Sequential Version¶
An alternative way to apply stochastic approximation is to simulate a single trajectory of the -Markov chain and update sequentially. In particular, at each step , only the entry is modified:
where
For states we hold the estimate steady, setting . Note that is a single-sample version of our previous version from Section 9.1.3.4. The asynchronous extension of Theorem 9.1.8 noted in Remark 9.1.3 shows that the iterates converge to with probability one, provided every state is visited infinitely often.
The advantage of the sequential version will be clearer when we study Q-learning in Section 9.2.1. In real-world Q-learning applications, sequences of observations are naturally generated by the environment and the agent’s interactions with that environment. Optimal policies can be computed from these sequences without knowing or even estimating underlying dynamics.
Figure 9.6 illustrates the sequential version on the same asset pricing model from Section 9.1.3.4, using per-state learning rates , where counts the number of visits to state by time . The left panel shows the SA estimate after 104 updates. Convergence is fastest near the center of the state space, where the stationary distribution of the chain concentrates most of its mass and states are visited most frequently. At the tails, fewer visits lead to slower learning. After updates (right panel), the SA solution closely matches the exact value.

Figure 9.6:Sequential stochastic approximation vs. linear algebra for asset pricing: partial convergence (left) and near-full convergence (right)
9.2Simulation and Learning¶
In most of this book we have worked in settings where the underlying model is known. In many applications, the model is unknown and the agent must learn from data generated by interacting with the environment. This section covers two approaches to learning: Q-learning (value-based) and policy gradient methods (policy-based). We emphasize that this section is intended only as a very brief introduction to a vast literature.
9.2.1Q-Learning¶
Q-learning is a model-free algorithm that learns optimal policies from data, without knowledge of transition probabilities or reward functions. In this section we connect Q-learning to the Q-factor ADP framework developed in Section 3.2.1.2.
9.2.1.1Q-Factors and Model-Free Learning¶
Recall from Section 3.2.1.2 that, for a finite MDP with state space and action space , the Q-factor Bellman equation takes the form
As shown in Proposition 5.3.1,
the unique solution to (9.38) obeys , where is the value function, and
is an optimal policy whenever for all .
Item (ii) tells us that the optimal policy can be calculated from without knowledge of the transition kernel . This observation opens the door to learning: if we can approximate from samples, we can compute an approximately optimal policy without knowing the model.
9.2.1.2The Q-Learning Algorithm¶
Q-learning approximates by stochastic approximation of the Q-factor Bellman operator. At each step , the agent is in state , takes action , observes reward and next state , and updates the Q-function (also called the Q-table) via
Here is a learning rate. The term in parentheses is a single-sample estimate of the right-hand side of (9.38). The update blends this fresh sample with the current estimate. The agent needs to observe only the tuple at each step.
Actions are selected by a behavior policy that balances exploration (visiting new state-action pairs) with exploitation (using the current estimate of ). A common choice is the -greedy strategy: with probability , choose a random feasible action; otherwise choose . Because the convergence target does not depend on how actions are selected---only on the Bellman equation (9.38)---Q-learning is called an off-policy method.
9.2.1.3Convergence¶
Convergence of Q-learning follows from the stochastic approximation theory developed in Section 9.1.3. To see this, let be the Q-factor Bellman operator defined in (3.13), so that . By Exercise 3.5, each policy operator is a contraction of modulus on . Since the supremum norm is sup-nonexpansive, Theorem 3.1.5 gives that is also a contraction of modulus with unique fixed point . At each step , the agent observes and computes the sample
With this notation, the Q-learning update (9.39) becomes
which parallels (9.36). Since the fixed point of is , applying the asynchronous version of Theorem 9.1.8 noted in Remark 9.1.3 leads to the following classical result Watkins & Dayan, 1992Tsitsiklis, 1994:
Theorem 9.2.1 is often referred to as a “tabular” result: it assumes that the Q-table stores a separate value for every pair. An alternative approach, which scales to high dimensions, is to combine Q-learning with one of the function approximation schemes from Section 9.1. When the scheme in question is neural nets we get deep Q-learning, which has been highly successful in certain environments. See Section 9.3 for more discussion.
9.2.1.4Example: Inventory Management¶
To illustrate Q-learning, we consider an inventory management problem where a firm manages stock of a single product. Inventory evolves according to
where is iid demand and is the order quantity chosen by the firm. The state space is . Per-period profits are
where is the selling price, is the unit ordering cost, and is a fixed cost of placing an order. The firm seeks to maximize expected discounted profits .
Now suppose that a manager does not know the demand distribution, the cost parameters, or the transition dynamics. The manager can still learn the optimal ordering policy by Q-learning. At each period, the manager observes the current inventory , places an order , records the resulting profit, and observes the next-period inventory . These four quantities suffice for the update (9.39).
Figure 9.7 compares the Q-learning solution with the exact VFI solution for this problem, with , selling price , geometric demand with parameter , , , and . After 20 million steps of interaction, the learned value function and policy closely match those obtained by exact value function iteration with full model knowledge.

Figure 9.7:Value function and policy: VFI (exact) vs. Q-learning (model-free)
The 20 million steps required in this small example illustrate a general feature of model-free methods: because they do not exploit knowledge of transition probabilities or reward functions, they must extract all structural information from observed data, which can require very long trajectories. In practice, the cost of exploration is compounded by the fact that the agent typically needs to make reasonable decisions at each time step, limiting the extent to which it can explore freely. These challenges motivate both deep Q-learning (which uses function approximation to generalize across states) and the policy gradient methods discussed in Section 9.2.3 (which search directly in policy space).
9.2.2Risk-Sensitive Q-Learning¶
The Q-learning framework can be extended to accommodate risk-sensitive preferences. As we will see, the extension provides a concrete illustration of the order-reversing FDP theory developed in Section 5.2.3.
9.2.2.1Risk-Sensitive Bellman Equation¶
We can construct a risk-sensitive version of the inventory problem Bellman equation by replacing the expectation with a certainty equivalent (see, e.g., Section 7.1.1.8). Using the entropic certainty equivalent with parameter , the Bellman equation becomes
Larger implies stronger aversion to downside risk.
9.2.2.2The Order-Reversing FDP Structure¶
Our aim is to derive a Q-factor Bellman equation for the risk-sensitive problem, analogous to (9.38), and then build a Q-learning algorithm from it. One challenge is how to shift expectation to the outside of the maximization step, as in the standard Q-factor equation (see (9.38)). We will use the order-reversing FDP framework of Section 5.2.3 to factor the Bellman equation into a suitable form, enabling single-sample updates.
We set up the FDP as follows. The primary value space is and the subordinate value space is (strictly positive functions on ), each with the pointwise partial order. The connecting map is defined by
Noting that the image of lies in (since the exponential function is strictly positive), we define the family of maps from to by
The map evaluates the Q-function at the action prescribed by and applies to convert the positive Q-value back to the value scale.
Since is finite, the set has a greatest element for every . Thus, is an order-reversing FDP.
9.2.2.3Primary and Subordinate Bellman Equations¶
We have chosen so that the primary ADP for this FDP has the Bellman equation (9.44). To see this, recall that, in the FDP setting, is defined by . Here this gives , since is decreasing. The primary policy operators give
With (by Lemma 5.2.15), we recover (9.44).
For the subordinate ADP, by Lemma 5.2.16, the Bellman min-operator is . Computing explicitly:
Thus the subordinate Bellman equation takes the form
This is the risk-sensitive Q-factor Bellman equation---a fixed point equation in alone, and the min-analogue of the standard Q-factor Bellman equation (9.38).
Notice that the expectation (summation over ) appears on the outside of (9.49) because it originates from the map in the FDP factorization. The maps act pointwise and do not involve the transition kernel . This separation makes Q-learning possible: we can replace the expectation with a single-sample observation, just as in the standard case.
Our original goal is to find an optimal policy for the primary risk-sensitive MDP. The Q-learning algorithm described below operates on the subordinate Q-factor problem, converging to its min-value function . The key question is whether suffices to recover an optimal policy for the primary problem. Theorem 5.2.18 gives an affirmative answer: any policy satisfying is optimal for the primary problem . Since , this condition reduces to . In other words, once Q-learning has converged, we read off the optimal policy by minimizing the learned Q-function at each state.
9.2.2.4Optimal Behavior Under Risk Sensitivity¶
Before turning to Q-learning, we illustrate how risk sensitivity affects the optimal policy and the induced dynamics in the inventory model of Section 9.2.1. We apply value function iteration to the risk-sensitive Bellman equation (9.44) for three levels of risk sensitivity, , holding all other parameters at the values used for Figure 9.7.
Figure 9.8 shows the resulting optimal order quantities. A more risk-sensitive firm (larger ) orders less stock, accepting lower expected sales in exchange for more predictable profits. Figure 9.9 confirms the effect in simulation: inventory paths under the optimal policy stabilize at lower levels as increases.

Figure 9.8:Optimal policy as a function of inventory, for

Figure 9.9:Inventory dynamics under the optimal policy at
9.2.2.5The Risk-Sensitive Q-Learning Update¶
The corresponding Q-learning update replaces the expectation in (9.49) with a single sample:
Note several differences from the standard case. First, the optimal policy minimizes rather than maximizes . Second, the reward enters through rather than additively. Third, the continuation value enters as a power rather than a scaled sum . Fourth, to make the calculations, the agent now needs to know as well as .
Figure 9.10 shows the value function and policy produced by the risk-sensitive Q-learning update (9.52) after steps on the inventory model at , alongside the exact VFI solution. Q-learning recovers the optimal policy and value function despite the open theoretical question above; see the QuantEcon lectures for implementation details and code.

Figure 9.10:Risk-sensitive inventory at : VFI vs. Q-learning after steps
9.2.3Policy-Based Methods¶
The methods discussed so far are value-based: they learn a value function (or Q-function) and extract a policy from it. An alternative approach is to parameterize the policy directly and optimize over the parameter space. Such policy gradient methods are widely used in modern reinforcement learning.
9.2.3.1The Policy Gradient Approach¶
Rather than searching over all feasible policies, we restrict attention to a parameterized family , where is a parameter vector. Given an initial condition , define
the lifetime value of the policy starting from . The policy gradient method maximizes over by gradient ascent:
where is a step size and is a Monte Carlo approximation to described below.
To illustrate, consider the optimal savings problem from Section 1.3: a household chooses a consumption policy to maximize , where is a smooth utility function, subject to
with iid income shocks and gross interest rate . A policy maps current assets to consumption , parameterized by (for example, the weights of a neural network). The state dynamics are a smooth function of the policy: given a realization of the shock sequence, is differentiable in through .
9.2.3.2Monte Carlo Approximation¶
In general, cannot be evaluated exactly because it involves an expectation over the shock sequence. Instead, we form a Monte Carlo approximation by simulating independent paths of length under the policy . For each path , set and generate
where are iid draws. The Monte Carlo estimate is
The key observation is that is differentiable in whenever and are differentiable, because the state dynamics (9.55) are a smooth function of the action. (The shocks do not depend on , so they pass through the differentiation.) Each simulated path defines a computational graph from through the policy evaluations and state transitions to the cumulative payoff. Modern automatic differentiation libraries can differentiate through these graphs efficiently, providing at cost comparable to evaluating itself. The gradient estimate is substituted into the ascent step (9.54).
A common and effective choice for the policy parameterization is a neural network mapping states to actions, with collecting all the weights and biases. The universal approximation theorem Hornik et al., 1989 guarantees that sufficiently expressive networks can approximate any continuous policy. The main advantage of policy gradient methods is that they handle continuous state and action spaces naturally and scale to high-dimensional problems. Discussion of related literature can be found in Section 9.3.
Figure 9.11 applies the policy gradient method to the household problem (9.55) with CRRA utility (), , , and iid lognormal income. The policy is parameterized by a neural network with three hidden layers of six units each, trained for 400 epochs using 100 paths of length 200. The learned consumption policy closely matches the exact solution obtained by the endogenous grid method (EGM; see Section 8.3.3.5). For this low-dimensional problem, EGM is far more efficient; we use it here as a trusted benchmark to verify the policy gradient solution. The advantage of policy gradient methods emerges in high-dimensional settings where traditional grid-based methods are infeasible.

Figure 9.11:Consumption policy: EGM (exact) vs. policy gradient (neural network)
Interestingly, the policy gradient method recovers the globally optimal policy despite the fact that we are optimizing lifetime value at a single initial condition . This raises a natural question: when does maximizing at one initial condition generate an optimal policy --- that is, a policy that is optimal across all states? We address this question in the next section.
9.2.3.3From Local to Global Optimality¶
In general, does not imply . But perhaps implication does hold in some settings? Stachurski et al. (2024) analyze this question for MDPs on general state spaces. Here we present a simplified version for finite MDPs that conveys the core idea. In stating the main result, we recall that a stochastic matrix is called irreducible if, for every pair of states , there exists an with .
Theorem 9.2.2 tells us that policy gradient methods---which optimize at a single initial condition---produce globally optimal policies provided the candidate policy induces irreducible dynamics. Irreducibility ensures that every state is “reachable” from , so optimality propagates throughout the state space.
This helps explain the global convergence observed in Figure 9.11 for the household problem (9.55). The income shocks are lognormal with unbounded support, so they spread the state across a wide range of asset levels regardless of the consumption policy. This provides a form of irreducibility: every region of the state space is visited with positive probability under any feasible policy, allowing optimality at the initial condition to propagate globally.
9.3Chapter Notes¶
The curse of dimensionality and the role of approximation in dynamic programming are discussed at length in Powell (2007) and Rust (1997). Theorem 9.1.3 and Theorem 9.1.4 generalize the error bounds for fitted value iteration in Stachurski (2008) from concrete MDPs on Euclidean state spaces to the abstract ADP setting. Li et al. (2026) provide performance guarantees for approximate dynamic programming in the form of ratio bounds. Lee (2026) studies convergence of Q-factor value function iteration.
The use of neural networks for approximate dynamic programming has become popular in recent years. The classic reference is Bertsekas & Tsitsiklis (1996). More recent examples include Maliar et al. (2021), Kase et al. (2022), Han et al. (2026), Pascal (2024), Kamber et al. (2025), Ashwin et al. (2024), and Oguz & Bray (2026). See Scheidegger (2026) for an overview of the field.
The stochastic approximation framework of Section 9.1.3 originates with Robbins & Monro (1951). The convergence result in Theorem 9.1.8 is based on Tsitsiklis (1994), who proves convergence in a more general asynchronous setting. Sutton & Barto (2018) give a comprehensive textbook treatment of stochastic approximation and temporal difference methods in the reinforcement learning context. Szepesvári (2010) provides a concise and mathematically rigorous introduction to reinforcement learning algorithms, including detailed convergence analysis for both value-based and policy-based methods. Agarwal et al. (2021) give a modern theoretical treatment of reinforcement learning with an emphasis on sample complexity and regret bounds.
Q-learning was introduced by Watkins & Dayan (1992), with convergence established by Tsitsiklis (1994). For a more recent discussion of convergence, see Regehr & Ayoub (2021). Deep Q-learning was mentioned in Section 9.2.1. Bertsekas & Tsitsiklis (1996) laid the theoretical foundations for combining neural network function approximation with dynamic programming. The potential of this approach was demonstrated by Mnih et al. (2015), who showed that a deep Q-network agent could reach human-level performance across a range of Atari games without game-specific engineering. Yang et al. (2020) provide a theoretical analysis of deep Q-learning. Wang et al. (2024) provide a recent survey of the deep reinforcement learning literature.
The risk-sensitive Q-learning formulation of Section 9.2.2 connects to the broader literature on risk-sensitive Markov decision processes surveyed in Bäuerle & Jaśkiewicz (2024). Littman & Szepesvari (1996) is a valuable early contribution to the theoretical foundations of Q-learning in potentially nonstandard environments. The use of deep learning for solving dynamic programming problems in economics is explored by Azinovic et al. (2022).
Policy gradient methods, discussed briefly in Section 9.2.3, are treated in depth by Sutton & Barto (2018), Szepesvári (2010), and Agarwal et al. (2021). The discussion in Section 9.2.3.3 is a simplified version of the result in Stachurski et al. (2024), which treats MDPs on general state spaces under an appropriate generalization of irreducibility.
For condition (ii), note that , so . Hence for a constant depending only on , , and .
- Cheney, W. (2013). Analysis for applied mathematics (Vol. 208). Springer Science & Business Media.
- Atkinson, K., & Han, W. (2005). Theoretical numerical analysis (Vol. 39). Springer.
- Stachurski, J. (2008). Continuous state dynamic programming via nonexpansive approximation. Computational Economics, 31(2), 141–160.
- Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
- Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407.
- Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16, 185–202.
- Sargent, T. J., & Stachurski, J. (2025). Dynamic Programming: Finite States. Cambridge University Press.
- Watkins, C. J., & Dayan, P. (1992). Q-Learning. Machine Learning, 8, 279–292.
- Stachurski, J., Yang, J., & Yang, Z. (2024). Dynamic Programming: From Local Optimality to Global Optimality. arXiv Preprint arXiv:2411.11062.
- Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimensionality (Vol. 703). John Wiley & Sons.
- Rust, J. (1997). Using randomization to break the curse of dimensionality. Econometrica: Journal of the Econometric Society, 487–516.
- Li, B., Chong, E. K. P., & Pezeshki, A. (2026). Performance Guarantees for Data-Driven Sequential Decision-Making. https://arxiv.org/abs/2603.20553
- Lee, D. (2026). Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration. https://arxiv.org/abs/2604.17457
- Bertsekas, D., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena Scientific.
- Maliar, L., Maliar, S., & Winant, P. (2021). Deep Learning for Solving Dynamic Economic Models. Journal of Monetary Economics, 122, 76–101. 10.1016/j.jmoneco.2021.07.004