Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

3 Data Structures

Statistics and bounded rationality

Statistics studies methods of making inferences and decisions when we aren’t sure what is happening. For economists, statistics has long been the place to go prospecting for hypotheses to understand how people behave under conditions of uncertainty and ignorance. It must be our starting place, because we propose to make our agents even more like statisticians or econometricians than they are in rational expectations models.

The purpose of this chapter is to review the workings of that centerpiece of econometrics, least squares regression, and how in many contexts it can be implemented recursively. This review will set the stage for the following chapter on neural networks and artificial intelligence, material that is less familiar to most economists, but which we shall see just implements recursive least squares in various ingenious contexts. In this chapter and the next we shall be looking for devices to hand over to the boundedly rational agents that will be created in Chapter 5.

Representation and estimation

From statistics, economists have borrowed and adapted a set of methods for describing and interpreting relationships within data sets. The task of description has fruitfully been subdivided into two logically distinct but interrelated pieces: representation and estimation. Representation of a relationship means positing a mathematical model that is assumed to have generated the data.

Usually, the data are taken to be a random sample drawn from a particular probability distribution, and the task of representation is to select a tractable model describing that probability distribution, typically in terms of a small number of parameters. The mathematical model chosen to represent the data is sometimes called the ‘data generating mechanism.’ The job of representation is ‘purely mathematical’ and in itself involves no use of statistical inference. Statistical methods are used to estimate the free parameters of the mathematical model, on the basis of the data set under study.

This chapter briefly describes some of the data-generating mechanisms that economists widely use, and also the sorts of procedures that they use to estimate the parameters of those models. My purpose is to convey the flavor of these data-generating mechanisms and statistical procedures, and to set the stage for our subsequent comparisons with neural networks.[1]

Two versions of the linear regression model

Population version

Linear regression, a pillar of econometrics, is a tool for summarizing the linear structure of a vector of random variables. We have a probability distribution function F(yt,xt)F(y_t, x_t) for (yt,xt)(y_t, x_t) where yty_t is a scalar and xtx_t is a k×1k \times 1 vector. We assume that the probability distribution has well defined first and second moments. We want to represent the probability distribution in the form

yt=βxt+et,y_t = \beta' x_t + e_t,

where βxt\beta' x_t approximates yty_t in the sense that et=ytβxte_t = y_t - \beta' x_t is as small as possible in the mean square norm Eet2Ee_t^2, where EE is the mathematical expectation. The object is to select β\beta to minimize V(β)=0.5Eet2=0.5E(ytβxt)2V(\beta) = 0.5 Ee_t^2 = 0.5 E(y_t - \beta' x_t)^2.

The first-order condition for minimization of V(β)V(\beta) is V(β)=0V'(\beta) = 0, where V(β)=Ext(ytxtβ)V'(\beta) = Ex_t(y_t - x_t'\beta). Thus, the (population) least squares β\beta satisfies the orthogonality condition

Ext(ytxtβ)=0,Ex_t(y_t - x_t'\beta) = 0,

or

β=(Extxt)1Extyt.\beta = (Ex_t x_t')^{-1} Ex_t y_t.

Notice that the population least squares regression coefficients β\beta is a mathematical object defined in terms of the population second moments of the distribution F(yt,xt)F(y_t, x_t). By virtue of the orthogonality condition (2), because et=ytxtβe_t = y_t - x_t'\beta, the least squares regression represents yty_t as the sum of a linear function of xtx_t and a piece ete_t that is orthogonal to xtx_t.

The sample version

We have a sample of observations {yt,xt}\{y_t, x_t\}, drawn from the distribution F(yt,xt)F(y_t, x_t). Everything that we know about the moments of F(yt,xt)F(y_t, x_t) is contained in the sample {yt,xt}t=1T\{y_t, x_t\}_{t=1}^T. From these data, we want to estimate the unknown value of β\beta in the model

yt=βxt+et.y_t = \beta' x_t + e_t.

We can accomplish this by replacing expectations with sample means in formula (3); namely, we use

βT=(1Tt=1Txtxt)1(1Tt=1Txtyt).\beta_T = \left(\frac{1}{T} \sum_{t=1}^T x_t x_t'\right)^{-1} \left(\frac{1}{T} \sum_{t=1}^T x_t y_t\right).

This is ‘ordinary least squares.’

Vector autoregressions

Following the advice of Sims (1980), macroeconomists often use systems of linear regressions, with one equation for each variable being studied, to represent the dynamics within a collection of economic time series. Each variable is regressed against lagged values of itself and all of the other variables in the model.

Let ztz_t be an (n×1)(n \times 1) covariance stationary stochastic process (i.e., one for which the vector of means EztEz_t is independent of time and the matrix covariances Cz(k)=EztztkC_z(k) = Ez_t z_{t-k}' are well defined and independent of calendar time tt). For convenience, assume that Ezt=0Ez_t = 0. Under particular conditions, such a process has the autoregressive representation

zt=j=1Ajztj+ϵt,z_t = \sum_{j=1}^{\infty} A_j z_{t-j} + \epsilon_t,

where ϵt\epsilon_t is an (n×1)(n \times 1) vector of least squares residuals or ‘innovations’ that satisfies the extensive orthogonality conditions

Eϵtztj=0n,j=1,,.E\epsilon_t z_{t-j}' = 0_n, \quad j = 1, \ldots, \infty.

This model is called a vector autoregression. The force of the extensive orthogonality conditions in the last equation is to decompose ztz_t into a piece j=1Ajztj\sum_{j=1}^{\infty} A_j z_{t-j}, which is linearly predictable from past values of the vector process itself, and a part ϵt\epsilon_t, which cannot be predicted linearly from past zz’s (i.e., it is orthogonal to each element of past zz’s). The matrices of autoregressive coefficients AjA_j are determined by the normal equations

Cz(k)=j=1AjCz(kj),k1C_z(k) = \sum_{j=1}^{\infty} A_j C_z(k-j), \quad k \ge 1

which are equivalent with the least squares orthogonality conditions.

The vector autoregressive representation is a workhorse. Following the lead of Sims and Litterman, it is often used to represent systems of interrelated time series for the purposes of describing their dynamic structure and forecasting them. In constructing our models of economic agents, economists often describe their beliefs about the dynamics of the environment in terms of a vector autoregression. We use vector autoregressions to formulate our own forecasting problems, and often model economic actors as doing the same.

Estimation of vector autoregressions

If enough data were available, vector autoregressions could be well estimated by applying ordinary least squares, equation by equation. But economists usually don’t have enough data to use ordinary least squares, so Sims and Litterman have shown how to use modified versions of least squares. I postpone discussing why in practice they deviate from using ordinary least squares in its unadulterated form.

Stochastic approximation

Robbins and Monro (1951) considered the problem of finding a value of a vector α\alpha that solves the equation

EQ(zt,α)=0,E Q(z_t, \alpha) = 0,

where QQ is a function that is decreasing in α\alpha, and {zt}\{z_t\} is a sequence of vectors of random variables ztz_t drawn from some sequence of probability distributions Ft(zt)F_t(z^t) where zt={zt,zt1,,z0}z^t = \{z_t, z_{t-1}, \ldots, z_0\}. It is assumed that

The stochastic approximation algorithm for computing a sequence of estimates αt\alpha_t of the value α\alpha^* that solves (9) is

αt=αt1+γtQ(zt,αt1)\alpha_t = \alpha_{t-1} + \gamma_t Q(z_t, \alpha_{t-1})

where {γt}\{\gamma_t\} is a nonincreasing sequence of positive numbers satisfying

limttγt=1.\lim_{t \to \infty} t \gamma_t = 1.

Robbins and Monro (1951) and their followers described conditions under which

limtαt=α,\lim_{t \to \infty} \alpha_t = \alpha^*,

where α\alpha^* solves either EQ(zt,α)=0EQ(z_t, \alpha^*) = 0 (in the case that {zt}\{z_t\} is drawn from a distribution that is stationary) or limtEQ(zt,α)=0\lim_{t \to \infty} EQ(z_t, \alpha^*) = 0 (in the case that {zt}\{z_t\} is asymptotically stationary).[2]

It has been discovered that the limiting behavior of a sequence {αt}\{\alpha_t\} determined by stochastic difference equation (10) is described by an associated differential equation,

ddτα=EQ(z,ατ),\frac{d}{d\tau}\alpha = E Q(z, \alpha_\tau),

where EQ(z,α)EQ(z, \alpha) is the expected value of Q(z,α)Q(z, \alpha), evaluated with respect to the asymptotic stationary distribution of {zt}\{z_t\}.

A heuristic justification for (13) notes that for large values of tt algorithm (10) is approximated by

ddtααtαt111tEQ(z,αt1),\frac{d}{dt}\alpha \approx \frac{\alpha_t - \alpha_{t-1}}{1} \approx \frac{1}{t} EQ(z, \alpha_{t-1}),

where replacing Q(zt,αt1)Q(z_t, \alpha_{t-1}) in (10) with an expected value at a fixed α\alpha, namely, E(Q(z,α))E(Q(z, \alpha)), is justified by observing that, for large tt, αtαt1\alpha_t \approx \alpha_{t-1}, and that the randomness in zz will make its variation large relative to the variation in αt\alpha_t. Use the time transformation τ(t)=log(t)\tau(t) = \log(t) to write this differential equation as

ddτατEQ(z,α),\frac{d}{d\tau}\alpha_\tau \approx EQ(z, \alpha),

which is the ordinary differential equation (13) to be used to approximate the limiting behavior of αt\alpha_t in (10).

A recursive formulation of the least squares estimate of the mean Ezt=μEz_t = \mu provides a simple example of stochastic approximation. The least squares estimate is the sample mean zˉt=(1/t)s=1tzs\bar{z}_t = (1/t) \sum_{s=1}^t z_s. Subtracting the sample mean at t1t-1 from both sides of this formula and rearranging gives

zˉt=zˉt1+(1/t)(ztzˉt1),\bar{z}_t = \bar{z}_{t-1} + (1/t)(z_t - \bar{z}_{t-1}),

which is in the form of (10) with the ‘gain’ γt=1/t\gamma_t = 1/t. The usual initial condition for this equation is zˉ0=0\bar{z}_0 = 0.[3],[4]

Recursive least squares

The stochastic approximation algorithm can be used to implement the least squares formulas recursively. Suppose that we set zt=(yt,xt)z_t = (y_t, x_t), α=(β,R)\alpha = (\beta, R), γt=1/t\gamma_t = 1/t, and

Q(zt,α)={R1xt(ytxtβ)xtxtR.Q(z_t, \alpha) = \begin{cases} R^{-1} x_t (y_t - x_t' \beta) \\ x_t x_t' - R. \end{cases}

Then the stochastic approximation scheme becomes

βt=βt1+γtRt1xt(ytxtβt1)Rt=Rt1+γt(xtxtRt1).\begin{aligned} \beta_t &= \beta_{t-1} + \gamma_t R_t^{-1} x_t (y_t - x_t' \beta_{t-1}) \\ R_t &= R_{t-1} + \gamma_t (x_t x_t' - R_{t-1}). \end{aligned}

Starting from appropriate initial conditions (β0,R0)(\beta_0, R_0), this is a method for calculating (5). Alternatively, it can be interpreted as a Bayesian procedure for updating estimates starting from a prior distribution (β0,R0)(\beta_0, R_0).

Least squares as stochastic Newton procedures

Sometimes we want to minimize the function V(θ)V(\theta) with respect to θ\theta. The gradient descent method is iteratively to choose θk\theta_k according to

θk=θk1γkV(θk1)\theta_k = \theta_{k-1} - \gamma_k V'(\theta_{k-1})

for some positive step-size sequence {γk}\{\gamma_k\}. Newton’s method is to choose {θk}\{\theta_k\} according to

θk=θk1γkV(θk1)1V(θk1).\theta_k = \theta_{k-1} - \gamma_k V''(\theta_{k-1})^{-1} V'(\theta_{k-1}).

For the regression problem, we choose θ=β\theta = \beta and V(β)=0.5E(ytβxt)2V(\beta) = 0.5 E(y_t - \beta' x_t)^2. Since V(β)=Ext(ytxtβ)V'(\beta) = -Ex_t(y_t - x_t'\beta) and V(β)=ExtxtV''(\beta) = Ex_t x_t', a gradient descent and Newton’s method become, respectively,

βk=βk1+γkE(xtytxtxtβk1)\beta_k = \beta_{k-1} + \gamma_k E(x_t y_t - x_t x_t' \beta_{k-1})
βk=βk1+γkE(xtxt)1E(xtytxtxtβk1).\beta_k = \beta_{k-1} + \gamma_k E(x_t x_t')^{-1} E(x_t y_t - x_t x_t' \beta_{k-1}).

Notice that, with γk=1\gamma_k = 1 for all kk, Newton’s method converges in one step to the population least squares vector β\beta given by (3).

A comparison of the population formula (2) with the recursive least squares formula (18) motivates the interpretation of (18) as a stochastic Newton algorithm.

Nonlinear least squares

Suppose that we want to fit the nonlinear regression model

yt=g(xt,β)+ϵt,y_t = g(x_t, \beta) + \epsilon_t,

using the sample {yt,xt}t=1T\{y_t, x_t\}_{t=1}^T. In population, our problem is to choose β\beta to minimize

V(β)=0.5E(ytg(xt,β))2=0.5Eϵt(β)2V(\beta) = 0.5 E(y_t - g(x_t, \beta))^2 = 0.5 E\epsilon_t(\beta)^2

where ϵt(β)=ytg(xt,β)\epsilon_t(\beta) = y_t - g(x_t, \beta). In this problem, the least squares orthogonality condition is

Eψt(β)ϵt(β)=0,E\psi_t(\beta)\epsilon_t(\beta) = 0,

where ψt(β)ϵt(β)\psi_t(\beta) \equiv \nabla \epsilon_t(\beta) is the gradient of ϵt(β)\epsilon_t(\beta) with respect to β\beta. Various recursive algorithms are designed to find solutions of (25). Stochastic gradient algorithms iterate on

βt=βt1+γtψt(βt1)ϵt(βt1).\beta_t = \beta_{t-1} + \gamma_t \psi_t(\beta_{t-1}) \epsilon_t(\beta_{t-1}).

Stochastic Newton algorithms iterate on versions of

βt=βt1+γtRt1ψt(βt1)ϵt(βt1)Rt=Rt1+γt(ψt(βt1)ψt(βt1)Rt1).\begin{aligned} \beta_t &= \beta_{t-1} + \gamma_t R_t^{-1} \psi_t(\beta_{t-1}) \epsilon_t(\beta_{t-1}) \\ R_t &= R_{t-1} + \gamma_t (\psi_t(\beta_{t-1}) \psi_t(\beta_{t-1})' - R_{t-1}). \end{aligned}

Unlike the linear case, for nonlinear regressions these algorithms are not equivalent with corresponding ‘off-line’ algorithms.[5],[6]

Classification

Classification with known moments

Following Fisher (1936), least squares regression can be used to find a linear discriminant function for determining to which of two predetermined classes an individual belongs.[7] We are given two populations, x1X1x_1 \in X_1 and x2X2x_2 \in X_2, of k×1k \times 1 random vectors, each with common covariance matrix VV, but with different mean vectors Ex1=μ1Ex_1 = \mu_1, Ex2=μ2Ex_2 = \mu_2. Vectors xx will be drawn from a mixture of the two distributions, with equal probability. Our task is to find a rule for classifying an xx that is randomly drawn from this mixture of populations X1X_1 and X2X_2, i.e., we want to say whether xx is from X1X_1 or from X2X_2. Note that the classification into X1X_1 and X2X_2 is given. For now, we assume that the means μ1\mu_1 and μ2\mu_2 and the common covariance matrix VV are known.

A solution of this problem is attained with the linear discriminant function. We want to find a linear function βxβ0\beta' x - \beta_0, where β\beta is a k×1k \times 1 vector and β0\beta_0 is a scalar, so that our decisions can be made according to the rule

if βxβ00, then x is a member of X1; if βxβ0<0, then x is a member of X2.\text{if } \beta' x - \beta_0 \ge 0, \text{ then } x \text{ is a member of } X_1; \text{ if } \beta' x - \beta_0 < 0, \text{ then } x \text{ is a member of } X_2.

For a given variance of the random variable βx\beta' x, which equals βVβ\beta' V \beta and can be interpreted as ‘variance within a population,’ we want to choose β\beta to separate the two populations as much as possible. Discrepancy between the two populations is to be measured by the criterion β(μ1μ2)\beta'(\mu_1 - \mu_2). Our goal is to choose β\beta to maximize β(μ1μ2)\beta'(\mu_1 - \mu_2), subject to βVβ=c\beta' V \beta = c, where c>0c > 0 is a constant. The maximizing value of β\beta is

β=λ1V1(μ1μ2),\beta = \lambda^{-1} V^{-1} (\mu_1 - \mu_2),

where λ\lambda is a Lagrange multiplier on the constraint, which can be set equal to one (which amounts to a choice of the variance cc in the constraint).

For a sample equally likely to be drawn from populations X1X_1 and X2X_2, the expected value EβxE\beta'x is β(μ1+μ2)/2\beta'(\mu_1 + \mu_2)/2. For the discriminant function, we therefore choose β0\beta_0 in (28) according to β0=β(μ1+μ2)/2\beta_0 = \beta'(\mu_1 + \mu_2)/2.

When X1X_1 and X2X_2 are each multivariate normal, the discriminant function (28) has an interpretation in terms of a likelihood ratio test. This is because the log likelihood ratio can be represented as

xV1(μ1μ2)0.5(μ1+μ2)V1(μ1μ2),x' V^{-1}(\mu_1 - \mu_2) - 0.5(\mu_1 + \mu_2)' V^{-1}(\mu_1 - \mu_2),

so that (28) can be read as stating that xx should be assigned to population X1X_1 whenever the likelihood ratio exceeds one (or the log likelihood ratio exceeds zero).[8]

Estimated parameters

When the means (μ1,μ2)(\mu_1, \mu_2) and covariance matrix VV are not known a priori, they are estimated by sample means and covariances, where the sample covariance is estimated by pooling observations across the X1X_1 and X2X_2 populations. Sample estimates are substituted into (29) to obtain the sample discriminant function.

Fisher (1936) showed that the linear discriminant function can be derived by a regression on dummy variables on xx. For a sample of k×1k \times 1 vectors xtx_t, where observations for t=1,,N1t = 1, \ldots, N_1 are drawn from population X1X_1 and observations t=N1+1,,N1+N2t = N_1 + 1, \ldots, N_1 + N_2 from population X2X_2, define yt=N2N1+N2y_t = \frac{N_2}{N_1 + N_2} for t=1,,N1t = 1, \ldots, N_1, yt=N1N1+N2y_t = -\frac{N_1}{N_1 + N_2} for t=N1+1,,N1+N2t = N_1 + 1, \ldots, N_1 + N_2. Then the estimated linear discriminant function can be obtained from an ordinary least squares regression of yty_t on xtx_t for this sample.

Principal components analysis

Population theory

Let xtx_t again be a k×1k \times 1 random vector with second moment matrix V=ExtxtV = Ex_t x_t'. The method of principal components analysis is based on the eigenvector decomposition of VV, namely, V=PDP1V = PDP^{-1}, where PP is an orthogonal matrix whose columns are eigenvectors of VV, and DD is the corresponding diagonal matrix of eigenvalues of VV. This decomposition of VV induces a transformation of the k×1k \times 1 vector xtx_t into a k×1k \times 1 vector zt=Pxtz_t = P'x_t with the following properties:

The first principal component is the linear combination of the xtx_t’s (with norm of the weights constrained to 1) with the most variance, while the second component is the linear combination (orthogonal to the first one) with the next highest variance, and so on.

Thus, in principal components analysis, we seek a ztz_t that satisfies

zt=Pxtz_t = P' x_t

where the components of ztz_t are mutually orthogonal, so that Eztzt=DEz_t z_t' = D is a diagonal matrix; and where successive rows of PP' are orthogonal and of unit norm, so that PP=IP'P = I. The eigenvector decomposition of the covariance matrix VV of xtx_t delivers the appropriate linear transformation PP of xtx_t.

The eigenvector p1p_1 that is associated with the largest eigenvalue of DD is called the first principal component of xtx_t. This eigenvector solves the problem of maximizing over p1p_1 the second moment E(p1x)2=p1Vp1E(p_1' x)^2 = p_1' V p_1, subject to the unit norm side condition p1p1=1p_1' p_1 = 1. The first-order necessary condition for this problem is

(VdI)p1=0,(V - dI)p_1 = 0,

where d/2d/2 is the Lagrange multiplier on the constraint. Evidently, E(p1x)2=p1Vp1=d2p12=d2E(p_1' x)^2 = p_1' V p_1 = d^2 \|p_1\|^2 = d^2 is maximized by choosing dd to be the largest eigenvalue of VV and p1p_1 to be the associated eigenvector. The second principal component maximizes E(p2x)2E(p_2' x)^2 subject to p2p2=1p_2' p_2 = 1 and p1p2=0p_1' p_2 = 0, and so on. Furthermore, di2d_i^2 is the second moment of zit=pixtz_{it} = p_i' x_t.

Data reduction

Sometimes principal components analysis is used for building linear models designed to summarize the most important source of (generalized) variance within a data set xtx_t. For example, in economic time series data, the first one or two principal components often account for a dominant proportion of variance. The first principal component is a linear combination of the data along which most of the variation occurs.[9]

Estimation

Estimation of principal components proceeds by substituting for ExtxtEx_t x_t' the sample moment matrix T1t=1TxtxtT^{-1} \sum_{t=1}^T x_t x_t'. To estimate principal components, one simply computes the eigenvalues and normalized eigenvectors of the sample moment matrix.[10]

Factor analysis

Factor analysis represents the covariance within a k×1k \times 1 vector xtx_t of observables in terms of their mutual dependence on a smaller ×1\ell \times 1 vector ftf_t of hidden ‘factors,’ where k\ell \ll k. The second-moment matrix V=ExtxtV = Ex_t x_t' is restricted to be the sum of a matrix of rank \ell and a diagonal matrix

V=LL+D,V = LL' + D,

where LL is a (k×)(k \times \ell) matrix and DD is a (k×k)(k \times k) diagonal matrix. The model can also be represented as

xt=Lft+ϵt,x_t = Lf_t + \epsilon_t,

where Eftft=IEf_t f_t' = I_\ell, the (×)(\ell \times \ell) identity matrix, Eϵtϵt=DE\epsilon_t \epsilon_t' = D, and Eftϵt=0Ef_t \epsilon_t' = 0. The (×1)(\ell \times 1) vector ftf_t is composed of hidden factors, while the (k×1)(k \times 1) vector ϵt\epsilon_t contains idiosyncratic noises.

The model asserts that all of the covariance within the xtx_t vector is intermediated via the action of a much smaller number of hidden factors. A classic use of the model is interpreting students’ test scores. Here xtx_t is a vector of student tt’s scores on kk tests on various subjects, such as history, French, English, algebra, physics, and so on. It is posited that there are two hidden orthogonal factors, ‘mathematical intelligence’ and ‘verbal intelligence,’ that explain the structure of correlations among the test scores. A second example comes from the field of business cycle analysis, where it is possible to read Burns and Mitchell (1946) as asserting that there is one underlying factor called ‘business conditions’ or the ‘business cycle,’ dependence upon which intermediates most or all of the correlation among measures of economic activity at business cycle frequencies.[11]

For a given sample {xt}t=1T\{x_t\}_{t=1}^T, let ST=t=1Txtxt/TS_T = \sum_{t=1}^T x_t x_t'/T. For a Gaussian likelihood function, maximum likelihood estimation seeks values for L,VL, V that satisfy the normal equations

V1(VST)V1L=0V^{-1}(V - S_T)V^{-1}L = 0
diag V1(VST)V1=0.\text{diag } V^{-1}(V - S_T)V^{-1} = 0.

See Jöreskog (1967) for efficient ‘off-line’ methods of solving these normal equations. In the spirit of stochastic approximation, one might use the ‘on-line’ algorithm

Lt+1=Lt+(1/t)(Vt1(Vtxtxt)Vt1Lt)Vt+1=Vt+(1/t)diag(Vt1(Vtxtxt)Vt1)Dt=VtLtLt.\begin{aligned} L_{t+1} &= L_t + (1/t)(V_t^{-1}(V_t - x_t x_t')V_t^{-1}L_t) \\ V_{t+1} &= V_t + (1/t)\text{diag}(V_t^{-1}(V_t - x_t x_t')V_t^{-1}) \\ D_t &= V_t - L_t L_t'. \end{aligned}

Overfitting and choice of parameterization

Economists are familiar with the phenomenon of overfitting. The term overfitting is meant to describe a circumstance in which a researcher can get a good fit for the data set in hand by estimating so many parameters that out of sample the model does much worse than an alternative model that fits fewer parameters, thereby giving up the ability to fit the sample data as well in exchange for increased precision of the estimated parameters.[12] Figure 1 and Figure 2 show a standard example in which two ‘wrong’ models (polynomials in time) are fitted to a random walk (i.e., a yty_t process yt=yt1+ϵty_t = y_{t-1} + \epsilon_t, where ϵt\epsilon_t is a serially uncorrelated Gaussian variable). Evidently, the higher-order model fits much better within the sample, but will do a much worse job of predicting yy if it is extrapolated.

An eighth-order polynomial in time fitted to 21 observations on a random walk.

Figure 1:An eighth-order polynomial in time fitted to 21 observations on a random walk.

A first-order polynomial in time fitted to 21 observations on a random walk.

Figure 2:A first-order polynomial in time fitted to 21 observations on a random walk.

Parameterizing vector autoregressions

The autoregressive representation has too many parameters to be useful for applications to the short data sets that economists work with. It imposes little more than that Cz(k)C_z(k) are well defined, i.e. that the process is covariance stationary. In terms of describing a data-generating mechanism, the representation affords no economies in terms of numbers of parameters vis-à-vis either the entire list of covariance matrices Cz(k)C_z(k) or, equivalently, the spectral density Sz(ω)=k=Cz(k)exp(iωk)S_z(\omega) = \sum_{k=-\infty}^{\infty} C_z(k) \exp(-i\omega k). (Even though it provides no economies in terms of representation, the vector autoregression might provide insights.)

Applying vector autoregressions requires adopting special versions in which the numbers of parameters are kept small relative to the length of the data sets being studied. Economists have devised differing specializations designed to render the vector autoregressive model applicable.

Rational expectations macroeconomists and econometricians have devised one set of procedures for reducing the number of parameters. In addition to restricting the lag-length (instead of fitting an infinite-order vector autoregression, they fit models in which ztz_t depends on only mm lagged values of itself), they typically restrict the number of coefficients that describe the cross-variable dynamics. They use theories that imply a particular class of functions Aj=Aj(θ)A_j = A_j(\theta) expressing each coefficient matrix in the vector autoregression as a function of a much smaller number of free parameters θ\theta. These parameters are interpreted as describing the preferences, technologies, and information sets of the agents whose behavior is determining ztz_t. Such rational expectations models typically claim to be complete and fully interpreted models in the sense that they purport to describe the covariation through time of all of the variables modelled in ways consistent with general equilibrium theory, with the parameters being economically interpretable as preference, technology, or information parameters.[13]

Sims, Litterman, and their co-workers have invented a different way of coping with the overfitting problem. In principle, their procedures do not restrict the number of parameters (beyond their use of some restrictions on lag lengths), but work by heavily exploiting some of the computational features associated with a recursive form of estimation that is applicable to vector autoregressions. Rather than adopting a function Aj=Aj(θ)A_j = A_j(\theta) as the rational expectations modellers do, Sims and Litterman carry along all of the elements of the AjA_j’s as free parameters, but restrict the initial coefficients and covariance matrix. Then they update all of the coefficients via recursive least squares. Evidently, their estimation procedures (or sometimes important parts of them) have interpretations as stochastic approximation algorithms.[14] Litterman and Sims have devoted much effort to devising specifications of these initial conditions that are designed to forecast macroeconomic time series well out of the sample used to estimate the parameters. Litterman and Sims’s choice of these initial conditions has involved less formal use of economic theory than that used by rational expectations econometricians in deducing their functions Aj=Aj(θ)A_j = A_j(\theta).[15]

Statistics for bounded rationality

The boundedly rational agents that we shall put into our example economic environments will all use versions of the stochastic approximation algorithm to estimate decision functions or parameters. This strategy for modelling boundedly rational agents starts from the observation that first-order conditions for stochastic estimation and optimization take exactly the form of the equation being solved by stochastic approximation, namely,

EQ(zt,α)=0.EQ(z_t, \alpha) = 0.

Further, stochastic approximation is designed to solve this equation under the conditions of limited knowledge (i.e., insufficient knowledge to use the differential calculus or the expectational calculus) with which we want to endow our artificially intelligent agents.

We shall see stochastic approximation algorithms applied over and over again, in superficially different contexts. Sometimes they will be used with versions of boundedly rational agents’ first-order conditions that come from estimation problems like those described in this chapter. Stochastic approximation algorithms will also be used where (9) is interpreted as the ‘Euler equation’ from a boundedly rational agent’s optimum problem.

Before handing stochastic approximation algorithms to our boundedly rational economic agents, we turn in the next chapter to survey some of the themes in the recent literature on neural networks and other forms of artificial intelligence. In reading this material, watch for stochastic approximation methods to make appearances.

Footnotes
  1. A third aspect of data interpretation is the study of the quality of approximation, in which an analyst studies the behavior of some estimates that would be appropriate under the assumption that model A is correct when in truth model B has actually generated the data set. Sims (1972), White (1982), and Hansen and Sargent (1993) have studied the issue of approximation in various contexts.

  2. Lennart Ljung (1977) and Ljung and Söderström (1983) have written extensively about the connection between stochastic approximation algorithms and some associated ordinary differential equations. Among economists, M. Aoki (1974) was one of the first to show the applicability of the stochastic approximation algorithm to study learning. See Ljung, Pflug, and Walk (1992) for a description of recent developments.

  3. Prior information about the mean can be represented by using an initial condition other than 0.

  4. For this estimator, the associated differential equation is d/dt  zˉ=μzˉd/dt \; \bar{z} = \mu - \bar{z}, whose solution is zˉ(t)=μ+exp(t)(zˉ(0)μ)\bar{z}(t) = \mu + \exp(-t)(\bar{z}(0) - \mu), which converges to μ\mu for any initial value zˉ(0)\bar{z}(0).

  5. Kuan and White (1991) show that the recursive estimators are root-TT consistent and share the asymptotic distribution of (non-recursive) nonlinear least squares.

  6. In the estimation literature, ‘on-line’ algorithms refer to estimators that have the recursive structure of, for example, stochastic approximation algorithms, the estimator at tt being represented as a function of the estimator at t1t-1 and the data observed at tt. ‘Off-line’ estimators have the property that the estimator at tt cannot be expressed in this way; instead, the estimator at tt must be written as a function of the entire sample of observations up to time tt.

  7. See Kendall (1957).

  8. See Anderson (1958).

  9. That is, for such data, the covariance matrix is ‘ill-conditioned’.

  10. The method of Oja (1982) for grouping data amounts to a recursive implementation of principal components analysis.

  11. See Sargent and Sims (1977) for remarks about the history of this interpretation of Burns and Mitchell (1946), and for details about how the static factor analysis model can be made dynamic via its application in the frequency domain in a way to encompass this interpretation of Burns and Mitchell.

  12. Statisticians have dealt with the problem of overfitting by adopting criteria for choosing among parameterizations that penalize models with more parameters. The Schwarz Information Criterion (Schwarz 1978) is one widely used criterion; others are described by Rissanen (1989). Chung-Ming Kuan and Tung Liu (1991) apply and discuss some of these criteria in selecting among univariate models that are designed to predict future exchange rates. They report the results of employing the ‘Predictive Stochastic Complexity’ criterion of Rissanen. This criterion works as follows. Given a function h(x,θ)h(x, \theta) designed to ‘forecast’ yy, and given a sample of length TT observations, compute the mean square of so-called honest prediction errors, namely, (Tk)1t=k+1T(yth(xt,θ^t1))2(T-k)^{-1}\sum_{t=k+1}^{T}(y_t - h(x_t, \hat{\theta}_{t-1}))^2, where θ^t1\hat{\theta}_{t-1} is estimated using data only through time k1k-1. The model with the smallest value is the one selected by this criterion. Kuan and Liu find that it is difficult to pin down systematic nonlinearities that can be used to predict exchange rates better than by using a random walk theory of exchange rates.

  13. Work along the lines of Hansen and Sargent (1980, 1981) and Kydland and Prescott (1982) fits within this category.

  14. Doan, Litterman, and Sims (1984) describe a set of procedures for searching over some parameters (or hyperparameters) that pin down the initial conditions of the least squares recursions.

  15. Sargent and Sims (1977) and Litterman and Sargent experimented with a factor analytic method for restricting the dimensionality of the parameter space for vector autoregressions. They used the frequency domain version of the factor analysis model, which assumes that the spectral density matrix of ztz_t can be written Sz(ω)=L(ω)L(ω)+D(ω)S_z(\omega) = L(\omega)L(\omega)' + D(\omega), where L(ω)L(ω)L(\omega)L(\omega)' is a matrix of rank kk for each ω\omega, and D(ω)D(\omega) is a diagonal matrix. This is equivalent to assuming that ztz_t has the time domain representation zt=j=Ljftj+utz_t = \sum_{j=-\infty}^{\infty} L_j f_{t-j} + u_t, where utu_t is an (n×1)(n \times 1) vector process, each component of which is orthogonal at all leads and lags to each component of the (k×1)(k \times 1) vector of hidden factors ftf_t. Sargent and Sims (1977) and Litterman, Quah, and Sargent (1984) used this model with k=1k=1 to represent and evaluate Burns and Mitchell’s ideas about the business ‘reference cycle.’