Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

4 Networks and Artificial Intelligence

This chapter describes some artificial devices that display ‘intelligence.’ Some of them memorize and recall patterns. Others represent and learn nonlinear decision rules. Most of the devices come from the recent ‘connectionist’ literature on neural networks. As Halbert White has taught, a student of econometrics will recognize many parallels between the connectionist literature and his own, in terms of problems and methods. The connections between the literatures on neural networks and econometrics provide additional perspective on our characterization of bounded rationality as a program to populate models with devices that mimic econometricians.[1]

The perceptron

The simplest neural network is a single-layer perceptron, a model of the interaction of kk input neurons xitx_{it}, i=1,,ki = 1, \ldots, k, with one output neuron yty_t. The neurons are elements xiXx_i \in X and yYy \in Y, where the spaces X,YX, Y can be specified in various ways. For example, for ‘Ising neurons,’ X={1,1}X = \{-1, 1\}; for classification perceptrons we take Y={0,1}Y = \{0, 1\}. The perceptron model is

yt=S(i=1kwixit)y_t = S\left(\sum_{i=1}^k w_i x_{it}\right)

or

yt=S(wxt),y_t = S(w' x_t),

where ww and xtx_t are each (k×1)(k \times 1) vectors, and SS is a ‘squasher’ function, i.e. a monotonically nondecreasing function that maps R\mathbb{R} onto [0,1][0, 1]. Three popular squasher functions are:

  1. The Heaviside step function:

S(z)={1,for z0;0,for z<0.S(z) = \begin{cases} 1, & \text{for } z \ge 0; \\ 0, & \text{for } z < 0. \end{cases}
  1. The ‘sigmoid function’:

S(z)=11+ez.S(z) = \frac{1}{1 + e^{-z}}.
  1. Any cumulative distribution function, e.g. the normal c.d.f.:

S(z)=12πσzexp(x22σ2)dx.S(z) = \frac{1}{\sqrt{2\pi}\sigma} \int_{-\infty}^{z} \exp\left(\frac{-x^2}{2\sigma^2}\right) dx.

A perceptron is depicted graphically in Figure 1. Input xitx_{it} is weighted by wiw_i, inputs are summed across ii, then the sum is squashed to yield output yty_t. With the Heaviside squasher function, the neuron is either ‘on’ (yt=1y_t = 1) or ‘off’ (yt=0y_t = 0). For non-negative inputs, a positive weight wiw_i from input ii to the neuron is called an ‘exciting’ connection, and a negative weight is called an ‘inhibiting’ connection, because such connections make it more or less likely that the neuron will ‘fire.’

A perceptron. Values of two inputs x_{it} for i = 1, 2 are multiplied by w_i; the results are added together, then operated upon with the ‘squasher’ function S to produce the output y_t.

Figure 1:A perceptron. Values of two inputs xitx_{it} for i=1,2i = 1, 2 are multiplied by wiw_i; the results are added together, then operated upon with the ‘squasher’ function SS to produce the output yty_t.

The perceptron as classifier

For a given set of weights ww, a perceptron acts as a classifier. For example, let there be two classes of people, American football players (y=1y = 1) and economists (y=0y = 0). Let xtx_t be a vector of characteristics, say weight and salary, of individual tt.[2] When we present the perceptron with xtx_t for economist tt, we want the perceptron to eject output yt=0y_t = 0, and when we input the characteristics of a football player, we want the perceptron to say yt=1y_t = 1. With the Heaviside step function as the squasher, the neuron ‘fires’ (yt=1y_t = 1) if and only if wxt>0w' x_t > 0. With one of the other two squasher functions above, the value of yty_t could be interpreted as the probability that an individual is a football player.

Perceptron training

Suppose that we have observations on kk characteristics of two distinct predetermined populations, again say economists, whom we have labelled yt=0y_t = 0, and football players, whom we have labelled yt=1y_t = 1. We want to fit the perceptron to these data, which means that we want to estimate the weight vector ww. We have a ‘training sample’ {yt,xt}t=1T\{y_t, x_t\}_{t=1}^T, where tt denotes a particular individual. The perceptron training problem is to choose ww to minimize t(ytS(wxt))2\sum_t (y_t - S(w' x_t))^2. This is a nonlinear least squares problem. The literature on perceptrons describes various algorithms of the iterative form

wt=wt1+γtS(wt1,xt)(ytS(wt1xt)),w_t = w_{t-1} + \gamma_t \nabla S(w_{t-1}, x_t) (y_t - S(w_{t-1}' x_t)),

where {γt}\{\gamma_t\} is a nonincreasing sequence of positive numbers, and S(w,x)\nabla S(w, x) is the gradient of SS with respect to ww. One scheme is to pick γt\gamma_t equal to a small positive constant, and repeatedly to run the sample {yt,xt}t=1T\{y_t, x_t\}_{t=1}^T through the algorithm until convergence occurs. Implementations of nonlinear least squares found in the econometrics literature also apply directly.

Perceptrons and discriminant analysis

There is evidently a close connection between a perceptron and a simple linear discriminant function. Early work on the perceptron focused on determining which classes of objects could and could not be separated by a perceptron, with negative early results by Minsky and Papert (1969) contributing to a long period of disenchantment. Minsky and Papert’s negative judgement about the perceptron was based on its inability to represent a nonlinear discriminant function. Enthusiasm for perceptrons revived in the early 1980s when it was recognized that networks of perceptrons could approximate any nonlinear discriminant function.[3]

Feedforward neural networks with hidden units

The perceptron is the building block out of which many types of neural networks are constructed. By arranging banks of perceptrons into rows and linking elements of successive rows via weighted summation operators, we construct a feedforward neural network. Halbert White and various co-workers[4] have shown that feedforward neural networks are best regarded as approximators of nonlinear functions gg mapping vectors xXx \in X into vectors yYy \in Y.

A feedforward neural network with one hidden layer is described by the two equations

yt=θ0+j=1qθjatjatj=S(i=1kwjixit).\begin{aligned} y_t &= \theta_0 + \sum_{j=1}^q \theta_j a_{tj} \\ a_{tj} &= S\left(\sum_{i=1}^k w_{ji} x_{it}\right). \end{aligned}

The second equation describes the output atja_{tj} of hidden unit jj, which is simply a perceptron. The first equation generates the output yty_t of the network by taking an affine function of the (q×1)(q \times 1) vector ata_t of outputs of the hidden units. The parameters of the network are the weights wjiw_{ji} and θj\theta_j. The network can be represented compactly as

yt=θ0+j=1qθjS(wjxt).y_t = \theta_0 + \sum_{j=1}^q \theta_j S(w_j' x_t).

An example of such a network is depicted in Figure 2.

A feedforward neural network. The network takes inputs x_i, multiplies them with weights w_{ji} and adds over i to get the input into the hidden node j, operates on the input \sum_i w_{ji} x_i into each hidden node j with the ‘squasher’ function S, then multiplies with \theta_j and adds to get the output y.

Figure 2:A feedforward neural network. The network takes inputs xix_i, multiplies them with weights wjiw_{ji} and adds over ii to get the input into the hidden node jj, operates on the input iwjixi\sum_i w_{ji} x_i into each hidden node jj with the ‘squasher’ function SS, then multiplies with θj\theta_j and adds to get the output yy.

The literature has addressed two issues about models of this class: the issue of approximation or representation, and the issue of estimation. The approximation literature describes the class of functions g:XYg : X \to Y that can be arbitrarily well approximated by model (8). Hornik, Stinchcombe, and White (1989) have shown that a very wide class of functions can be approximated by (8).[5] The parameter that controls the quality of approximation is qq, the number of hidden units. For a given qq, the best mean-square approximator to a function g(x)g(x) is determined by the values of (θ,w)(\theta, w) that minimize the squared norm

g(x)θ0j=1qθjS(wjx)2.\left\|g(x) - \theta_0 - \sum_{j=1}^q \theta_j S(w_j' x)\right\|^2.

Here 2\|\cdot\|^2 is the L2L_2 norm. For a given qq, the best approximator can be found by variations on Newton’s method. The approximation literature comforts us by assuring us that, if we select qq large enough, we can find a (θ,w)(\theta, w) that make this mean squared error as small as we might want.

The estimation problem occurs when we are given a sample {yt,xt}t=1T\{y_t, x_t\}_{t=1}^T and a particular model of the form (8), with fixed qq, and want to estimate the parameters. This is a version of the nonlinear regression problem. The model to be estimated can be written in the form

yt=g(xt,β)+ϵt,y_t = g(x_t, \beta) + \epsilon_t,

where we have stacked the parameters (θ,w)(\theta, w) into the vector β\beta. Estimation can proceed by utilizing one of the algorithms based on equations (26) or (27) of the previous chapter. The ‘training’ algorithms discussed in the literature all use versions of these iterations. There exist examples of such ‘on-line’ algorithms that have asymptotic properties equivalent to ‘off-line’ algorithms. However, for small samples one can generally do better by using an ‘off-line’ algorithm.

Recurrent networks

The second equation of (7) can be modified in a way that lets us model dynamics. For example, we can specify

atj=S(wjxt+δjat1),a_{tj} = S(w_j' x_t + \delta_j' a_{t-1}),

where at1a_{t-1} is the vector of values of at1,ja_{t-1,j}, and δj\delta_j is a vector of additional parameters. This kind of network has been used by Elman (1988), and captures feedback to the hidden units from past values of hidden units. Alternatively, we could also specify

atj=S(wjxt+γjyt1),a_{tj} = S(w_j' x_t + \gamma_j' y_{t-1}),

where γj\gamma_j is a vector of parameters. In the special case that wj=0w_j = 0 for all jj, this is an autonomous dynamic system, one that can be used to represent the systematic part of a nonlinear vector autoregression; this specification is a version of a Hopfield network, to which we now turn.

Associative memory

This section describes a class of networks that store memorized patterns and solve some signal extraction problems. The simplest version of such a network is a dynamic system that has locally stable rest points at a predetermined number of memorized patterns. The network is designed so that, when a corrupted version of one of these patterns is injected, the network quickly converges to the closest memorized pattern.[6]

Autoassociation

Suppose that patterns are represented as vectors of length NN written in the binary alphabet {1,1}\{1, -1\}, so that a particular pattern is expressed as a vector of 1’s and -1’s of length NN. There are pp patterns that we want memorized. Let these patterns be represented by the vectors σ1,σ2,,σp\sigma^1, \sigma^2, \ldots, \sigma^p, which we arrange into the (N×p)(N \times p) matrix σ\sigma. We assume that the patterns are orthogonal, by which we mean that σσ=NIp\sigma'\sigma = N \cdot I_p where IpI_p is the p×pp \times p identity matrix. It is assumed that pp is small relative to NN. The pp patterns correspond to ‘Platonic ideals.’

When the system is presented with an error-ridden signal in the form of a vector ss of length NN with elements {1,1}\{-1, 1\}, we want to know which pattern σμ\sigma^\mu is the closest to the signal, as measured by the distance

Hμ(s)=i=1N(σiμsi)2,μ=1,2,,p.H_\mu(s) = \sum_{i=1}^{N} (\sigma_i^\mu - s_i)^2, \quad \mu = 1, 2, \ldots, p.

That is, we want to know the value of μ(1,2,,p)\mu \in (1, 2, \ldots, p) for which the distance given by (13) is smallest. One procedure is simply to compute Hμ(s)H_\mu(s) for each value of μ\mu, and then select the one yielding the lowest value of HμH_\mu. For large values of pp, this can be a time consuming process.

The idea of associative memory systems is to create a dynamic system

s(t+1)=f(s(t))s(t+1) = f(s(t))

with the following two properties:

(a) Every memorized pattern is a fixed point of ff:

σk=f(σk),k=1,,p.\sigma^k = f(\sigma^k), \quad k = 1, \ldots, p.

(b) Every σk\sigma^k is locally stable. That is, around each σk\sigma^k, there is a neighborhood N(σk)N(\sigma_k) such that, for s0N(σk)s_0 \in N(\sigma_k), iterations on ff starting from s0s_0 converge to σk\sigma^k.

If such a dynamic system can be crafted, ‘recalling from memory’ is just the process of iterating on ff starting from a presented pattern. We want the convergence to occur rapidly.

Figure 3 portrays a ‘phase diagram’ of a Hopfield network with four memorized patterns. The arrows in the diagram indicate how the network is locally stable about each of the memorized patterns.[7]

A Hopfield network. The network has four ‘memorized patterns’ that are attractors for the net’s dynamics.

Figure 3:A Hopfield network. The network has four ‘memorized patterns’ that are attractors for the net’s dynamics.

It is easy to create a mapping ff with the desired properties. We use the Hopfield network

s(t+1)=sgn(ws(t)),s(t+1) = \text{sgn}(w \, s(t)),

where ww is an N×NN \times N matrix and sgn is the ‘sign’ function that maps real-valued vectors into vectors of binary numbers, mapping positive components into +1, negative components into -1. A common way to choose ww is according to Hebb’s rule:

w=1Nσσ.w = \frac{1}{N} \sigma \sigma'.

It is straightforward to show that Hebb’s rule builds in properties (a) and (b). To verify property (a), let us input into the network the kkth pattern, σk\sigma^k, where k(1,,p)k \in (1, \ldots, p). We want σk\sigma^k to be output, which we have because

σk=sgn(1Nσσσk)=sgn(σ(1Nσσk))=sgn(σek)=sgn(σk)=σk.\sigma^k = \text{sgn}\left(\frac{1}{N}\sigma\sigma'\sigma^k\right) = \text{sgn}\left(\sigma\left(\frac{1}{N}\sigma'\sigma^k\right)\right) = \text{sgn}(\sigma e_k) = \text{sgn}(\sigma^k) = \sigma^k.

Here eke_k is the p×1p \times 1 unit vector with 1 in the kkth position, 0’s in all other positions.[8]

To study property (b), suppose that we input a noise-corrupted version of pattern σk\sigma^k, sks^k, where sks^k agrees with σk\sigma^k except for sign reversals at nn randomly selected positions. We set the dynamic system in motion starting from s0=sks_0 = s^k, and compute

s1=sgn(1Nσσsk)=sgn(1Nσ(σsk))=sgn(σ1N[r1rk1N2nrk+1rp])s_1 = \text{sgn}\left(\frac{1}{N}\sigma\sigma' s^k\right) = \text{sgn}\left(\frac{1}{N}\sigma(\sigma' s^k)\right) = \text{sgn}\left(\sigma\frac{1}{N}\begin{bmatrix} r_1 \\ \vdots \\ r_{k-1} \\ N-2n \\ r_{k+1} \\ \vdots \\ r_p \end{bmatrix}\right)

where rj=iσijsikr_j = \sum_i \sigma_i^j s_i^k. The rjr_j terms can be expected to be small because the pattern matrix σ\sigma is orthogonal, so long as the number of corrupted bits nn is small. Thus we have

s1=sgn((12nN)σk+jk1Nσjrj).s_1 = \text{sgn}\left(\left(1 - \frac{2n}{N}\right)\sigma^k + \sum_{j \neq k} \frac{1}{N}\sigma^j r_j\right).

The terms jkσjrjN1\sum_{j \neq k} \sigma^j r_j N^{-1} can be expected to have value on the order of 2n(p1)/N2\sqrt{n(p-1)/N}, so that, as long as pNp \ll N and 2n<N2n < N, we have that

s1=σk,s_1 = \sigma^k,

indicating that convergence to the desired pattern occurs in one step.[9]

Correlated patterns

It is easy to modify Hebb’s rule to accommodate correlated (but linearly independent) patterns. We now drop our earlier assumption that σσ=NIp\sigma'\sigma = N \cdot I_p, and replace it by the assumption that

σσ=NV,\sigma'\sigma = NV,

where VV is a nonsingular matrix. We maintain system (16), but replace Hebb’s rule (17) with the so-called ‘projection rule’:

w=1NσV1σ.w = \frac{1}{N} \sigma V^{-1} \sigma'.

It is easy to check that with this modified rule the rest points of sgn[ws(t)]\text{sgn}[w \, s(t)] are σ1,σ2,,σp\sigma^1, \sigma^2, \ldots, \sigma^p, as described, and that the system is locally stable about these rest points. To check the existence of σk\sigma^k as a fixed point, note that wσk=1NσV1σσk=σkw\sigma^k = \frac{1}{N}\sigma V^{-1}\sigma'\sigma^k = \sigma^k.

Heteroassociation

Hopfield networks can be ‘wired’ to remember sequences of patterns. Let [σ1,σ2,,σp][\sigma^1, \sigma^2, \ldots, \sigma^p] represent a temporal ordering of patterns. When σk\sigma^k (or a corrupted version of it) is ‘input’ into the network, we want the ‘output’ from the network to be σk+1\sigma^{k+1}, where we understand that σp+1=σ1\sigma^{p+1} = \sigma^1, so that the memorized pattern is to be periodic. This is easily achieved by using a modified version of Hebb’s rule. Let σ=(σ1,,σp)\sigma = (\sigma^1, \ldots, \sigma^p), σσ=NIp\sigma'\sigma = NI_p, σ+=(σ2,σ3,,σp,σ1)\sigma_+ = (\sigma^2, \sigma^3, \ldots, \sigma^p, \sigma^1). Let

w=1Nσ+σ.w = \frac{1}{N} \sigma_+ \sigma'.

Note that the system

s(t+1)=sgn(ws(t))s(t+1) = \text{sgn}(w \, s(t))

has the periodic solution σk+1=sgn(wσk)\sigma^{k+1} = \text{sgn}(w\sigma^k), for we have

sgn(wσk)=sgn(1Nσ+(σσk))=sgn(σ+ek)=sgn(σk+1)=σk+1,\text{sgn}(w\sigma^k) = \text{sgn}\left(\frac{1}{N}\sigma_+(\sigma'\sigma^k)\right) = \text{sgn}(\sigma_+ e_k) = \text{sgn}(\sigma^{k+1}) = \sigma^{k+1},

where again eke_k is the unit vector with one in the kkth place.

Memorized patterns as energy minimizers

Associated with Hebb’s rule (17) is the energy function[10]

E(s)=12N(sσσs).E(s) = -\frac{1}{2N} (s'\sigma\sigma' s).

By construction, the stored patterns σ1,σ2,,σp\sigma^1, \sigma^2, \ldots, \sigma^p are each local minimizers of this function. In particular, note that

E(σk)=N2E(\sigma^k) = -\frac{N}{2}

for σk\sigma^k, k=1,,pk = 1, \ldots, p, and that, since sσkNs'\sigma^k \le N for all ss, we know that σk\sigma^k is a local minimizer. For the ‘projection rule’ w=1NσV1σw = \frac{1}{N}\sigma V^{-1}\sigma', we have the associated energy function

E(s)=12N(sσV1σs).E(s) = -\frac{1}{2N} (s'\sigma V^{-1} \sigma' s).

Stochastic networks

Sometimes Hopfield networks are constructed that have spurious local rest points that the designer wants to avoid. When these rest points have small domains of attraction, they can be avoided by using a stochastic network. The idea is to modify the dynamics of the system by ‘shaking’ it at a rate that decreases over time. The purpose of shaking the system is to lower the probability that the system will come to rest at ‘undesirable’ local rest points and descend to better ones. Figure 4 and Figure 5 contain an example of what is hoped for. Introducing a random ‘shaking’ factor into Newton methods leads to what are called ‘simulated annealing’ methods.[11]

The dynamics of a network with neurons viv_i, i=1,,Ni = 1, \ldots, N, can be represented

vi,t+1=g(jwijvjtθi),v_{i,t+1} = g\left(\sum_{j} w_{ij} v_{jt} - \theta_i\right),

where gg is one of our ‘squasher functions.’ For example, we might use

g(x)=tanh(x/T)g(x) = \tanh(x/T)

as a squasher function for ‘Ising neurons’ that we want to constrain to take on values in [1,1][-1, 1].[12] The variable T>0T > 0 is a ‘temperature’ variable that governs the sharpness of the squasher function.

Newton’s method. Motion of a ball rolling down a hill under force of gravity. The ball may come to rest at a local minimum.

Figure 4:Newton’s method. Motion of a ball rolling down a hill under force of gravity. The ball may come to rest at a local minimum.

A stochastic Newton method. Simulated annealing and ‘stochastic network’ methods ‘shake’ the system so that with probability arbitrarily close to one the ball will eventually escape being trapped in a local minimum.

Figure 5:A stochastic Newton method. Simulated annealing and ‘stochastic network’ methods ‘shake’ the system so that with probability arbitrarily close to one the ball will eventually escape being trapped in a local minimum.

Combinatorial optimization problems

Neural nets have been used to solve constrained optimization problems that can be represented in the form: find a configuration of neurons si{1,1}s_i \in \{-1, 1\}, i=1,,Ni = 1, \ldots, N, to minimize

C(s)=0.5sws+0.5αss,C(s) = -0.5 s'ws + 0.5\alpha s's,

where ww is a given matrix encoding costs, and α\alpha is a penalty variable (or Lagrange multiplier). One approach to this problem would be to use the deterministic ‘local dynamics,’ i.e. those given by the Hopfield network,

s(t+1)=sgn((wαI)s(t)).s(t+1) = \text{sgn}((w - \alpha I)s(t)).

However, this algorithm typically fails to find a global minimum, and instead heads for the nearest local minimum.[13]

To get an algorithm with a better chance of descending to a global minimum, the deterministic local dynamics are sometimes replaced by stochastic ones. Let p(si(t+1)=+1)p(s_i(t+1) = +1) be the probability that si(t+1)s_i(t+1) is plus or minus 1. Then we can generate stochastic dynamics by setting

p(si(t+1)=±1)=11+exp(2(wis(t)αs(t))/T)p(s_i(t+1) = \pm 1) = \frac{1}{1 + \exp(\mp 2(w_i's(t) - \alpha s(t))/T)}

where TT is a non-negative temperature variable. This stochastic transition law produces a stochastic process with a stationary distribution over states given by the Boltzmann distribution

Prob(s)=(1/Z)exp(C(s)/T),\text{Prob}(s) = (1/Z) \exp(-C(s)/T),

where the normalization factor ZZ is

Z=sexp(C(s)/T).Z = \sum_{s} \exp(-C(s)/T).

As TT approaches zero from above, this distribution concentrates more and more probability on the global minimum of the cost function (32).

The method of simulated annealing exploits this property. The algorithm works as follows. For a fixed initial T=T00T = T_0 \gg 0 and a given initial configuration of neuron values ss, use the stochastic dynamics, say via simulation methods, to compute a sample from the stationary distribution for that fixed TT. This represents a sample drawn from the Boltzmann distribution (35). Now define an iterative process in which the system is ‘annealed’; i.e., the temperature is gradually lowered, at a rate slow enough to permit the system approximately to settle into its stationary distribution at each temperature. In particular, lower the temperature, say according to Tk+1=0.9TkT_{k+1} = 0.9 T_k, where TkT_k is the temperature at the (k+1)(k+1)th step; use the (approximate) stationary distribution from temperature TkT_k as an initial distribution; and use the transition dynamics at this new temperature to get an estimate of the probability distribution (35) associated with this new temperature. Continue in this way, lowering the temperature toward zero. Notice that as TT approaches zero the probability given by (35) piles up near optimizing values of the state vector ss.

This procedure is computationally intensive, but has the virtue of assuring convergence to a global minimum of the cost function. The computational expense occurs at the step of computing the stationary distributions (35) at each temperature.

Mean field theory

The idea of mean field is to replace the stochastic dynamics with an associated deterministic dynamics that gives the correct direction of movement ‘on average.’ The basic object of analysis is the time average of neuron ii at temperature TT, which we denote siT\langle s_i \rangle_T or vi=siv_i = \langle s_i \rangle for short. The variable si=vi\langle s_i \rangle = v_i is the average value of neuron ii in the stationary equilibrium given by (35). The approximate mean field dynamics are given by[14]

vi,t+1=tanh(j(wijα)vjt/T).v_{i,t+1} = \tanh\left(\sum_{j} (w_{ij} - \alpha)v_{jt}/T\right).

Equation (37) is a set of deterministic difference equations in the average states of the neurons. Notice the similarity between systems (30) and (37).

The function \tanh(z/T) with T = 1. Fixed points are at intersections with the dotted line of slope 1 through the origin.

Figure 6:The function tanh(z/T)\tanh(z/T) with T=1T = 1. Fixed points are at intersections with the dotted line of slope 1 through the origin.

The function \tanh(z/T) with T = 0.5. Fixed points are at intersections with the dotted line of slope 1 through the origin. At lower temperatures, the fixed point at 0 becomes unstable, and two stable fixed points near \pm 1 appear.

Figure 7:The function tanh(z/T)\tanh(z/T) with T=0.5T = 0.5. Fixed points are at intersections with the dotted line of slope 1 through the origin. At lower temperatures, the fixed point at 0 becomes unstable, and two stable fixed points near ±1\pm 1 appear.

The dynamics of the system (37) are known to exhibit behavior of two sorts, depending on the temperature TT. At sufficiently high temperatures, the system moves to a steady state at vi=0v_i = 0 i\forall i, which can be seen to be the fixed point of the tanh function (see Figure 6 and Figure 7). As the temperature is gradually lowered, a ‘phase transition’ is passed at a critical temperature T=TcT = T_c, and the system moves to fixed points vi=±1v_i = \pm 1 that approximate a solution of the minimization problem. Peterson and Söderberg describe methods for estimating the critical temperature TcT_c. These methods start from a Taylor series approximation of (37) around vv^*, which yields the dynamics

ϵi,t+1=(1/T)j(wijα)ϵjt,\epsilon_{i,t+1} = (1/T) \sum_{j} (w_{ij} - \alpha) \epsilon_{jt},

where vi=vi+ϵiv_i = v_i^* + \epsilon_i. The stability of the local dynamics around v=0v^* = 0 are determined by the eigenvalue of maximum modulus of the matrix

mT=(wαI)/T.m_T = (w - \alpha I)/T.

For TT large enough, the dynamics around vv^* are evidently stable. However, for nontrivial ww, there will exist a critical value TcT_c at which an eigenvalue of mTm_T will exceed unity in modulus, causing the linear dynamics to destabilize and prompting the nonlinear aspects of the dynamics of system (37) to take over. Peterson and Söderberg describe computationally efficient methods for estimating the value of TcT_c from the pair (w,α)(w, \alpha). They also describe corresponding methods for similar but richer problems.[15]

Here is Peterson and Söderberg’s implementation of mean field dynamics to minimize a cost function of the form (32).

Peterson and Söderberg (1992) provide several examples in which the mean field dynamics provide cheap and high-quality approximate solutions to some big optimization problems.

Shadows of things to come

Mean field theory uses a deterministic system whose dynamics approximate aspects of the average behavior of a random dynamical system. The deterministic system is formed by writing down the random dynamical system, then replacing random variables with their means at carefully chosen spots. This procedure has provided a cheap way of solving or approximately solving some high-dimension combinatorial optimization problems.

We shall encounter this kind of device again, in different contexts. In particular, we shall see how closely related arguments from the ‘stochastic approximation’ literature have been applied to study the convergence of systems of boundedly rational agents to rational expectations equilibria.

Local and global methods

Newton’s method, which is the basis for many of the estimation methods that we have surveyed, has excellent local convergence properties, but can have difficulties globally. Simulated annealing and stochastic Newton methods are modifications designed to improve the global properties of Newton’s method without eventually sacrificing its good local properties. We now turn to some global search algorithms proposed by Holland that depart more fundamentally from Newton’s method.

The genetic algorithm

The genetic algorithm of John Holland was devised to find maxima of ‘rugged landscapes’ that lack the smoothness that Newton’s method exploits.[16] Holland’s idea was to turn loose on the landscape a population of sexually active artificial beings that would evolve itself to the top of the hill.

The genetic algorithm is a sequence of operations applied to a population of binary strings, which are strings of 0’s and 1’s of length SS. Recall that each integer jj in the collection [1,2,,2S2,2S1][1, 2, \ldots, 2^S - 2, 2^S - 1] can be represented as j=i=1Ski2i1j = \sum_{i=1}^S k_i 2^{i-1} where kik_i is the value (0 or 1) in the iith position in the string. We can approximate a bounded set of real numbers xXx \in X by setting x=j/(ba)x = j/(b - a), where a,b,Sa, b, S are chosen to select the set that we want to approximate.

The algorithm begins with a collection of NN binary strings of length SS. We are given a non-negative function f:XRf : X \to \mathbb{R}, which we are to maximize. We encode values of xXx \in X, where XX may be a vector space, in terms of binary bit strings. We are told how to compute f(x)f(x) for any value of xx within the domain defined by our encoding.

The algorithm uses the following operators.

  1. Evaluation of fitness. For every value xix_i for i=1,,Ni = 1, \ldots, N, compute the value f(xi)f(x_i). Then compute the ‘relative fitness’ of xix_i, defined to be f(xi)/j=1Nf(xj)f(x_i)/\sum_{j=1}^N f(x_j).

  2. Fitness-proportional reproduction. Make copies of the population by spinning a ‘biased roulette wheel,’ constructed by dividing a disk into NN slices, with the probability of individual ii’s reproducing being set equal to its relative fitness f(xi)/j=1Nf(xj)f(x_i)/\sum_{j=1}^N f(x_j). Spin the wheel NN times, each time making a copy of the individual into whose slice the roulette ball comes to rest, thus making NN copies. The copies constitute a new population of NN individuals.

  3. Mutation. Independently subject each bit of each string to a small probability pmp_m of being flipped.

  4. Mating. Randomly divide the population of NN strings into N/2N/2 equal subpopulations of ‘males’ and ‘females.’ Randomly match these subpopulations into N/2N/2 pairs for the purpose of ‘mating.’ A pair (sometimes referred to in the literature as ‘the happy couple’) mates by drawing a random number that is uniformly distributed across the integers from 1 to S1S - 1. If the integer kk is drawn, the male’s and female’s bit strings are each cut between the bit numbers kk and k+1k + 1, and two new strings are formed by joining the first kk bits of the male string with the last SkS - k bits of the female string, and vice versa. In this way are formed NN ‘children’ strings. Extinguish the parents, and take the NN children as the new population.

The algorithm starts by selecting a random sample of NN strings, and then applying the four operators sequentially. After a new population is created via the mating operator, the algorithm applies operator 1 again, continuing either for a prespecified number of rounds or else until a stable population of xx’s emerges. As the ‘solution’ of the original problem, select the fittest member xx from the final population. The parameters of the algorithm are NN, SS, and the mutation rate pmp_m.

The reproduction operator increases the representation of relatively fit individuals in the population, but does nothing to find a fitter individual. The mutation and mating operators can add new elements to the population, while destroying old ones. If the mutation probability pmp_m is set too high, it slows or prevents convergence, and degrades the performance of the algorithm because it destroys fit individuals along with the unfit. But when the mutation rate is set to a very low value, mutation alone is a poor mechanism for injecting diversity into the population. The mating operator seems to be a very good device for probabilistically injecting diversity, while giving structures that have proved their fitness a shot at surviving.

This algorithm has proved its value in a variety of applications. It has some features of a parallel algorithm, both in the obvious sense that it simultaneously processes a sample distribution of elements, and in the subtler sense that, instead of processing individuals, it is really processing equivalence classes of individuals. These equivalence classes, which Holland calls schemata, are defined by the lengths of common segments of bit strings. The algorithm is evidently a random search algorithm, one that does not confine its searches locally. The mating operator, which lies at the heart of the algorithm, creates new strings, but only when it operates on a population within which there is already diversity. The beauty of this operator is that, while it creates new values of xx (at the cost of destroying old ones), it does so in a way that preserves long sections of ‘genetic structure,’ i.e. segments of bit strings. Depending precisely on how a given problem has been encoded, this feature can embody a useful compromise between the principles of adventure and preservation.

The genetic algorithm is designed to produce a sequence of populations of organisms that moves up a fitness criterion ff. The ‘individuals,’ i.e. the bit strings, do not ‘learn.’ (Each of them dies after one period.) Only the ‘society’ of the sequence of populations of bit strings can be regarded as learning. This feature makes it difficult to accept the genetic algorithm literally as a model of an individual ‘brain.’ It is much easier to relate to John Holland’s classifier system as a model of an individual’s brain or thought process.[17]

Classifier systems

The brain as a ‘competitive economy’

John Holland conceived the ‘classifier system’ as an evolving collection of potential ‘condition–action’ statements that decide and learn. He used an alphabet for expressing the rules that permits more general statements to co-exist with less general statements. The statements compete with one another for the opportunity to decide. The classifier system incorporates elements of the genetic algorithm with other aspects in a way that represents a brain in terms that Holland describes as a competitive economy.

A classifier system consists of a more or less comprehensive list of ‘if-then’ statements called classifiers that map conditions into actions, and a set of rules for interpreting and altering these statements to make decisions through time. A classifier system has a list with a fixed number of classifiers. When a message from the environment enters the classifier system, in general, several of the classifiers will have their ‘if’ parts satisfied, but usually their ‘then’ parts will differ, in which case these classifiers are offering different advice (‘Go on, say hello to her’ versus ‘Mind your own business’). Which classifier makes the decision is determined by an ‘auction’ among the relevant set of classifiers, with classifiers bidding with ‘resources’ accumulated via an accounting process that registers the consequences of past decisions. The accounting system is the vehicle by which the classifier system learns to alter its behavior over time.

More formally, a classifier system consists of the following objects.

  1. Bit strings (classifiers). A classifier is a bit string of fixed length, written over the trinary alphabet 0,1,#0, 1, \#, where #\# is interpreted as ‘either 0 or 1’ or ‘I don’t care.’ The first part of each bit string is interpreted as encoding a ‘condition’ statement, while the remaining bits encode an ‘action’ statement. The presence of the #\# sign accommodates generalization.

  2. A decoding device. When a state occurs in the environment, this device identifies which of a fixed collection of classifiers match in their ‘condition’ parts the condition prevailing in the environment. The device thereby identifies a set of classifiers, from which one is to be selected actually to make a decision at the moment.

  3. An accounting system. A measure of value called strength is assigned to each classifier in the system at each point in time. Strengths for each classifier are updated over time in response to the utilities and costs that flow from the environment when the classifier acts. The accounting system computes cumulated averages of realized utilities net of costs. In sequential settings, the accounting system taxes classifiers operating at one stage, and awards the proceeds to the classifier at the immediately preceding stage of the decision tree whose decision moved the system to the position that gave the presently active classifier the opportunity to act. Setting up the accounting system in this way is important to induce decisions whose only rewards are that they facilitate subsequent decisions that will ultimately generate rewards.

  4. An auction system. The auction system determines which of a set of matched classifiers is granted the right to act in any given situation. Two alternative auction principles are:

    • (a) The highest strength rule gets to make the decision.

    • (b) The right to make a decision is allocated probabilistically, with the probability of being granted the decision made equal to a rule’s relative strength.

  5. A device for introducing new classifiers. New classifiers are introduced in several situations.

    • (a) Uncovered situations. The most obvious occurs when the environment produces a condition that matches no existing classifiers. In this situation, a new classifier is generated whose condition statement matches the existing environmental condition, and whose action part is randomly generated.

    • (b) Try something new. New classifiers are generated and old ones are occasionally extinguished in order to provide room for experimenting with untried actions.

    • (c) Generalize and specialize. New classifiers are synthesized to generalize (replace 0’s and 1’s with #\#'s), or to specialize (replace #\#'s with 0’s or 1’s in existing rules).

A two-armed bandit

An example of Brian Arthur and Carl P. Simon illustrates a method for studying the limiting behavior of a classifier system in a very simple context. In particular, they showed how a classifier system would cope with a two-armed bandit problem.[18] The iith arm pays a random variable xitx_{it} drawn independently and identically from a distribution FiF_i with mean μi\mu_i. Assume that μ1>μ2\mu_1 > \mu_2. A player must pull one arm for each tt with 1tT1 \le t \le T, with his reward being his total payoffs. The player knows neither F1F_1 nor F2F_2.

Arthur let the classifier system consist of the two classifiers,

#0
#1

Here the first entry encodes the ‘condition’ and the second entry encodes the ‘action’; #\# means ‘whatever the history of observations,’ 1 means pull the first arm, and 0 means pull the second arm. The conditions of both classifiers are met all of the time. The first classifier plays arm 1 all of the time, and the second plays arm 2 all of the time. Let τi(t)\tau_i(t), i=1,2i = 1, 2, be a clock recording the cumulative number of times that arm ii has been pulled as of time tt, which equals the cumulative number of times that classifier ii has acted. Arthur set up the accounting system

Siτi=Siτi1+(1/τi)(xiτiSiτi1),S_{i\tau_i} = S_{i\tau_i-1} + (1/\tau_i)(x_{i\tau_i} - S_{i\tau_i-1}),

where SiτiS_{i\tau_i} is the ‘strength’ of classifier ii after it has been rewarded with its payoff when its clock is at τi\tau_i.

Arthur used a random device based on relative strengths to determine the advice of which classifier was followed at time tt. In particular, at time t+1t + 1, the system follows the advice of classifier 1 with probability π1t+1\pi_{1t+1} given by

π1t+1=S1τ1(t)/(S1τ1(t)+S2τ2(t)).\pi_{1t+1} = S_{1\tau_1(t)}/(S_{1\tau_1(t)} + S_{2\tau_2(t)}).

By using methods of stochastic approximation, Arthur showed that the classifier system eventually ‘probability-matches,’ that is, plays the arms in fractions-over-time that are in proportion to their expected rewards.[19]

Design decisions

An author of a classifier system controls the list of states or conditions to encode, the scheme for encoding them, and the accounting system that distributes rewards and collects ‘taxes.’[20] Thus, some ‘hard wiring’ goes into the construction of a classifier system, much of it being done with an eye to the particular problem at hand. In some environments, what is not hard-wired is the degree of generalization.

Generality versus discrimination

The problem of learning induces an incompletely understood tradeoff between ‘general’ rules (those with ‘conditions’ that are coarse and therefore are often met) and ‘specific’ rules (those with conditions that are fine and therefore less frequently met). An advantage of ‘general’ rules is that their conditions are frequently encountered, which means that their performance can be assessed frequently. A disadvantage is that they call for the same action for all states that satisfy the condition. In effect, general classifiers give the advice: ‘use a piece-wise constant decision rule over the subset of the state space that I cover.’ Specific decision rules have the opposite advantages and disadvantages. They potentially permit fine-tuning the action to fit the specific point in the condition space, but they pay for that advantage by requiring longer histories of experience to learn.[21]

An interesting property of classifier systems is that they can be set up in ways that permit the degree of generalization or specificity to emerge adaptively. The presence of the #\# (or ‘I don’t care’) symbol in the alphabet, together with devices designed either to generalize or specialize,[22] provide this capacity. The literature has some intriguing simulation examples in which different degrees of generalization have emerged in classifier systems, but at the present time little is known about general principles that determine their propensity to generalize.

Summary

In this chapter we have seen recursive least squares dynamics appear in many guises and contexts. There are evidently many fascinating connections between research lines being pursued within the ‘connectionist’ literatures that we have surveyed in this chapter and the econometrics and statistics literatures described in the preceding chapter. It is tempting (and would surely be worthwhile) to pursue those connections more broadly and deeply, but it is time for us to redirect our attention to our main task of discussing economic models of bounded rationality. In the next chapter we hand over some recursive least squares algorithms to artificial agents living within one of several particular economic environments, and watch what happens.

Footnotes
  1. Useful references on some of the material in this chapter are Müller and Reinhardt (1990), Hertz, Krogh, and Palmer (1991), and Kosko (1992).

  2. Most American football players are heavier than the average person. Some economists are also heavier than the average person.

  3. In-Koo Cho (1992) shows that a pair of single-layer perceptrons can represent all subgame-perfect strategy pairs in an infinitely repeated (undiscounted) prisoners’ dilemma game. A key part of his demonstration is that a single-layer perceptron is ‘discriminating enough’ for the situation each player confronts. Cho shrewdly exploits the ease with which a perceptron can implement a ‘trigger strategy’ via a squasher function. For another infinitely repeated two-player game, Cho shows that a single-layer perceptron is too simple (because it is unable to discriminate among situations adequately), but that a network with one hidden layer is able to encode all subgame-perfect equilibrium strategies.

  4. Including Ronald Gallant, Kurt Hornik, and Maxwell Stinchcombe.

  5. They show that, if enough hidden units are included, then any Borel measurable function from one finite dimensional space to another can be approximated arbitrarily well. Barron (1991) showed that to achieve the same approximation rate a feedforward network uses only linearly many parameters (O(qn)O(qn)), while polynomial, spline, and trigonometric expansions use parameters that grow exponentially (O(qn)O(q^n)).

  6. Hertz, Krogh, and Palmer (1991) and Müller and Reinhardt (1990) are good references on the material in this section. Müller and Reinhardt provide an example in which a Hopfield network is used to ‘read’ noise corrupted letters encoded in pixels via ‘Ising neurons.’

  7. A problem in constructing associative memory models is that sometimes the builder inadvertently puts ‘spurious patterns’ into the network, so that the network recalls patterns that weren’t taught to it. Notice that the construction in the text assures that the ‘memorized patterns’ are rest points of the network, but does not preclude the presence of other rest points.

  8. Notice that 1Nσσk=ek\frac{1}{N} \sigma'\sigma_k = e_k because σσ=NIp\sigma'\sigma = NI_p.

  9. There is a literature on ‘response times’ in cognitive psychology in which subjects are asked to classify a specific example (‘pelican’) into one of a specified number of classes (‘central banker,’ ‘bird,’ ‘congressperson’). Their times to respond are recorded and interpreted in terms of the closeness of the particular item to the ideal pattern. The Hopfield network is a useful device for interpreting these experiments.

  10. If the formulas associated with Hebb’s rule remind the economics student of things from econometrics, they should.

  11. This section relies heavily on Peterson and Söderberg (1992).

  12. Or g(x)=0.5[1+tanh(x/T)]g(x) = 0.5[1 + \tanh(x/T)], if we want the neurons to take values in (0,1](0, 1].

  13. This local stability property is ‘good’ for models of associative memory, but ‘bad’ for schemes that iterate on first-order conditions from optimum problems.

  14. See Brock (1992) for several applications of mean field dynamics to models of economics and finance.

  15. Peterson and Söderberg study both synchronous and serial implementations of the dynamics (37). The synchronous dynamics are simply (37). The serial or nonsynchronous dynamics are like the Gauss-Seidel algorithm: in updating a neuron ii, the latest value of neuron jj is used on the right side of (37).

  16. Goldberg (1989) is a useful text on genetic algorithms.

  17. Ellen McGrattan has applied genetic algorithms to estimate nonlinear rational expectations models. She first uses genetic algorithms to search for the vicinity of a maximum, then switches to a Newton method.

  18. These results were presented by Arthur (1989b) and Simon in independent oral presentations at the Santa Fe Institute in March 1989.

  19. See Arthur (1989b) for discussions of various ways of altering the classifier system to improve its performance.

  20. In sequential problems, the author must also link classifier sub-systems ‘intertemporally’ in ways that permit learning to experience the rewards of patience. See Marimon, McGrattan, and Sargent (1990) for an example.

  21. This might remind the reader of tradeoffs between parametric and non-parametric estimation strategies in econometrics.

  22. See Marimon, McGrattan, and Sargent (1990).