Remaining Proofs - Dynamic Programming Volume I: Finite States

B.1Chapter 2 Results¶

Regarding (i), fix $\phi, \psi \in \dD(\Xsf)$ with $\phi \lefsd \psi$ . Pick any $y \in \Xsf$ . By transitivity of partial orders, the function $u(x) \coloneq \1\{y \preceq x\}$ is in $i\RR^\Xsf$ . Hence $\sum_x u(x) \phi(x) \leq \sum_x u(x) \psi(x)$ . Given the definition of $u$ , this is equivalent to $G^\phi(y) \leq G^\psi(y)$ . As $y$ was chosen arbitrarily, we have $G^\phi \leq G^\psi$ pointwise on $\Xsf$ .

Regarding (ii), let $\phi, \psi \in \dD(\Xsf)$ be such that $G^\phi \leq G^\psi$ and let $\Xsf$ be totally ordered by $\preceq$ . We can write $\Xsf$ as $\{x_1, \ldots, x_n\}$ with $x_i \preceq x_{i+1}$ for all $i$ . Pick any $u \in i\RR^\Xsf$ and let $\alpha_i = u(x_i)$ . By Exercise 2.2.32, we can write $u$ as $u(x) = \sum_{i=1}^n s_i \1\{x \succeq x_i\}$ at each $x \in \Xsf$ , where $s_i \geq 0$ for all $i$ . Hence

\sum_{x \in \Xsf} u(x) \phi(x) = \sum_{x \in \Xsf} \sum_{i=1}^n s_i \1\{x \succeq x_i\} \phi(x) = \sum_{i=1}^n s_i \sum_{x \in \Xsf} \1\{x \succeq x_i\} \phi(x) = \sum_{i=1}^n s_i \, G^\phi(x_i).

A similar argument gives $\sum_{x \in \Xsf} u(x) \psi(x) = \sum_{i=1}^n s_i \, G^\psi(x_i)$ . Since $G^\phi \leq G^\psi$ , we have

\sum_{x \in \Xsf} u(x) \phi(x) = \sum_{i=1}^n s_i \, G^\phi(x_i) \leq \sum_{i=1}^n s_i \, G^\psi(x_i) = \sum_{x \in \Xsf} u(x) \psi(x) .

We conclude that $\phi \lefsd \psi$ , as was to be shown. ◻

B.2Chapter 6 Results¶

We adopt the setting of Section 6.1.1.2 and consider the claim

\EE_x \, \sum_{t=0}^\infty \left[ \prod_{i=0}^t \beta_i \right] h(X_t) = \sum_{t=0}^\infty \EE_x \, \left[ \prod_{i=0}^t \beta_i \right] h(X_t),

(B.1)

when $(X_t)$ is $P$ -Markov with initial condition $x$ and $h \in \RR^\Xsf$ . Throughout this discussion the assumption $\rho(L) < 1$ is in force (see Theorem 6.1.1). Unlike the rest of the book, we assume some familiarity with measure theory, at the level of, say, Dudley (2002), Chapters 3 and 4.

To begin the discussion we set

F_T \coloneq \sum_{t=0}^T \delta_t \, h(X_t) \quad \text{and} \quad F \coloneq \sum_{t=0}^\infty \delta_t \, h(X_t) \quad \text{where} \quad \delta_t \coloneq \prod_{i=0}^t \beta_i.

Our first aim is to show that $F$ is a well-defined random variable, in the sense that the sum converges almost surely. Since absolute convergence of real series implies convergence, and since finite expectation implies finiteness almost everywhere, it suffices to show that

\EE_x \, \sum_{t=0}^\infty \delta_t \, |h(X_t)| < \infty.

(B.2)

By the monotone convergence theorem (see, e.g., Dudley (2002), Theorem 4.3.2), we have

\EE_x \, \sum_{t=0}^\infty \delta_t \, |h(X_t)| = \sum_{t=0}^\infty \EE_x \, \delta_t \, |h(X_t)| = \sum_{t=0}^\infty (L^t |h|)(x) ,

where the last equality is by (6.6). Since $\rho(L) < 1$ , we have shown that (B.2) holds, which in turn confirms that $F$ is well-defined and finite almost surely.

Now observe that, on the probability one set where $F$ is finite, we have $F_T \to F$ as $T \to \infty$ . Moreover,

|F_T| \leq \sum_{t=0}^T \delta_t \, |h(X_t)| \leq Y \coloneq \sum_{t=0}^\infty \delta_t \, |h(X_t)|,

and, as shown above, $\EE_x \, Y < \infty$ . By the dominated convergence theorem, we now have $\EE_x \, F = \lim_{T \to \infty} \EE_x \, F_T$ , or, equivalently,

\EE_x \, \sum_{t=0}^\infty \delta_t \, h(X_t) = \lim_{T \to \infty} \EE_x \, \sum_{t=0}^T \delta_t \, h(X_t) = \lim_{T \to \infty} \sum_{t=0}^T \EE_x \, \delta_t \, h(X_t) = \sum_{t=0}^\infty \EE_x \, \delta_t \, h(X_t).

Hence (B.1) holds.

B.3Chapter 7 Results¶

Proof

Proof of uniqueness for Theorem 7.1.3.

We focus on the concave case. Let $I$ be as in Theorem 7.1.3 and suppose that $T$ is an order-preserving concave self map on $I$ with $T \phi \gg \phi$ . By Theorem 7.1.1, $T$ has least and greatest fixed points in $I$ . We denote them by $a$ and $b$ , respectively. Let

\lambda = \min_{x \in \Xsf} \frac{a(x) - \phi(x)}{b(x) - \phi(x)},

and let $\bar x$ be a minimizer. It follows immediately from its definition that $\lambda$ obeys $0 \leq \lambda \leq 1$ and

a(x) \geq \lambda b(x) + (1-\lambda) \phi(x) \quad \text{for all } x \in \Xsf \text{ with equality at } \bar x.

As a result, applying the assumed properties of $T$ , we have

a = Ta \geq T(\lambda b + (1-\lambda) \phi) \geq \lambda b + (1-\lambda) T \phi.

Suppose now that $\lambda < 1$ . Since $T \phi \gg \phi$ , we get $a \gg \lambda b + (1-\lambda) \phi$ , and evaluating this at $\bar x$ yields

a(\bar x) > \lambda b(\bar x) + (1-\lambda) \phi(\bar x) = a(\bar x),

which is a contradiction. Hence $\lambda=1$ and, therefore, $a \geq b$ . Since all fixed points $\bar u$ of $T$ in $I$ obey $a \leq \bar u \leq b$ , we see that $a = b$ is the unique fixed point of $T$ in $I$ . ◻

B.4Chapter 9 Results¶

Let’s now turn to the proof of the core optimality results for ADPs. In what follows, $\aA = (V, \{T_\sigma\})$ is a well-posed ADP with Bellman operator $\tmax$ and $\sigma$ -value functions $\{ v_\sigma \}_{\sigma \in \Sigma}$ . We start with

Lemma B.4.1

If $\aA$ is order stable, then the following statements hold:

(i) $v \in V_u \implies v \preceq \Hmax \, v$ .

(ii) If $\sigma \in \Sigma$ and $T v_\sigma = v_\sigma$ , then $v_\sigma = \vmax$ .

(iii) If $v \in V$ and $\Hmax \, v =v$ , then $v = \vmax$ and $\tmax \, \vmax = \vmax$ .

(iv) If $\aA$ is finite, then $\vmax$ exists in $V$ and $\Hmax \, \vmax = \vmax$ . Moreover, for all $v \in V$ , the HPI sequence $(v_k)$ defined by $v_k = \Hmax^k v$ converges to $\vmax$ in finitely many steps.

(v) Fix $v \in V$ and let $(v_k)$ be the HPI sequence defined by $v_k = \Hmax^k v$ for $k \in \NN$ . If $v_{k+1} = v_k$ for some $k \in \NN$ , then $v_k = \vmax$ and every $v_k$ -greedy policy is optimal.

Proof

Regarding (i), fix $v \in V_u$ and let $\tau$ be $v$ -greedy, with $\Hmax v = v_{\tau}$ . Since $v \in V_u$ , we have $v \preceq \tmax \, v = T_\tau \, v$ . This inequality and upward stability of $T_\tau$ yield $v \preceq v_\tau$ . But then $v \preceq \Hmax v$ , as claimed.

Regarding (ii), suppose $\sigma \in \Sigma$ and $\tmax \, v_\sigma = v_\sigma$ . Fix $\tau \in \Sigma$ and note that $v_\sigma = \tmax \, v_\sigma \succeq T_\tau \, v_\sigma$ . Downward stability of $T_\tau$ implies $v_\sigma \succeq v_\tau$ . Since $\tau \in \Sigma$ was arbitrary, $v_\sigma = \vmax$ .

Regarding (iii), fix $v \in V$ with $\Hmax \, v = v$ and let $\sigma$ be such that $\Hmax \, v = v_\sigma$ . Then $v_\sigma = v$ , and, since $\sigma$ is $v$ -greedy, $T_\sigma \, v = \tmax \, v$ . But then $T_\sigma \, v_\sigma = \tmax \, v_\sigma$ , and, since $v_\sigma = T_\sigma \, v_\sigma$ , we have $v_\sigma = \tmax \, v_\sigma$ . Part (ii) now implies $v = v_\sigma = \vmax$ . This proves the first claim. Regarding the second, substituting $v_\sigma = \vmax$ into $v_\sigma = \tmax \, v_\sigma$ yields $\vmax = \tmax \,\vmax$ .

For (iv), it suffices to show that $\Hmax \, \vmax = \vmax$ and there exists a $K \in \NN$ such that $\Hmax^K v = \vmax$ . To this end, let $v_k = \Hmax^k v$ and note that $v_k \in V_\Sigma$ for all $k \geq 1$ . Part (i) implies that $v_{k+1} \succeq v_k$ for all $k \in \NN$ . Since the sequence $(v_k)$ is contained in the finite set $V_\Sigma$ , it must be that $v_{K+1} = v_K$ for some $K \in \NN$ (since otherwise $V_\Sigma$ contains an infinite sequence of distinct points). But then $\Hmax \, v_K = v_{K+1} = v_K$ , so $v_K$ is a fixed point of $\Hmax$ . Part (iii) now implies that $v_K =\vmax$ .

For (v), let $(v_k)$ be as stated and suppose that $v_{k+1} = v_k$ for some $k \in \NN$ . Then $v_k$ is a fixed point of $\Hmax$ , so, by (iii) above, we have $v_k = \vmax$ . By Bellman’s principle of optimality, every $v_k$ -greedy policy is optimal. ◻

We first prove Proposition 9.2.5 and then return to Theorem 9.2.4.

Proof

Proof of Proposition 9.2.5.

Let $\aA$ be max-stable. We need to establish the following claims.

(a) $V_\Sigma$ has a greatest element $\vmax$ ,

(b) $\vmax$ is the unique fixed point of $\tmax$ in $V$ ,

(d) at least one optimal policy exists.

For claims (a) and (b), we observe that, by max-stability, $\tmax$ has a fixed point $\bar v$ in $V$ . By existence of greedy policies, we can find a $\sigma \in \Sigma$ such that $\bar v = \tmax \, \bar v = T_\sigma \, \bar v$ . But $T_\sigma$ has a unique fixed point in $V$ , equal to $v_\sigma$ , so $\bar v = v_\sigma$ . Moreover, if $\tau$ is any policy, then $T_\tau \, \bar v \preceq \tmax \, \bar v = \bar v$ and hence, by downward stability, $v_\tau \preceq \bar v$ . These facts imply that $\vmax \coloneq \bar v$ is the greatest element of $V_\Sigma$ and a fixed point of $\tmax$ . Since greatest elements are unique, $\vmax$ is the only fixed point of $\tmax$ in $V$ .

Regarding (c), parts (a) and (b) give $\vmax \in V$ and $\tmax \, \vmax = \vmax$ . Now recall that $\sigma$ is optimal if and only if $v_\sigma = \vmax$ . Since $v_\sigma$ is the unique fixed point of $T_\sigma$ , this is equivalent to $T_\sigma \, \vmax = \vmax$ . Since $\tmax \, \vmax = \vmax$ , the last statement is equivalent to $T_\sigma \, \vmax = \tmax \vmax$ , which is, in turn, equivalent to the statement that $\sigma$ is $\vmax$ -greedy.

Part (d) follows directly from (a). ◻

References¶

Dudley, R. M. (2002). Real analysis and probability (Vol. 74). Cambridge University Press.

B Remaining Proofs

B.1Chapter 2 Results¶

B.2Chapter 6 Results¶

B.3Chapter 7 Results¶

B.4Chapter 9 Results¶