# Optimization of a Neural Network

William S. Harlan

Feb 1999

# Introduction

Many discussions of neural networks unnecessarily introduce a large vocabulary of specialized terms. In fact, neural networks are a simple system of recursive equations that can be optimized by conventional means. Any good book on optimization provides all the necessary tools [1]. We need only unambiguously write down the equations we are using, identify the objective function, and calculate a few derivatives.

# The Recursive Equations

Neural-network equations map vectors of known data to corresponding known vectors of parameters , indexed by . The vector of parameters might be actual physical properties or statistical probabilities. Data can be physical measurements or numbers that index possible events. The equations contain unknown coefficients, or weights, that the user adjusts until known parameters are predicted accurately. Then it is hoped that unknown parameters can be estimated from new measurements of data.

Let's look at how simple these equations are. Intermediate vectors are expressed recursively as a function of other vectors , usually with a different dimensionality. The first vector corresponds to data , and the final vector corresponds to the parameters to be estimated. First, a pre-established non-linear transform is applied to each element of a vector. Then linear transforms calculate each sample of the output vector as a weighted sum of samples from the input vector.

 (1) (2) (3) (4)
Usually the non-linear functions apply independently to each element of the vector. Functions are expected to be monotonic over the range for values . The derivative should also be continuous. Usually, the shape is a sigmoid or linear. Because the notation presents no difficulty, I write this non-linear scalar function as a more general vector function.

I could also have reversed the order of linear and non-linear operators. A reversed order is equivalent to setting and to identity operations. This form (1) is somewhat easier to manipulate.

This nonlinear recursion can be abbreviated as

 (5)

# Perturbations of an Objective Function

The best weights are assumed to minimize the following least-squares objective function , summed over all known pairs of data measurements and model parameters .
 (6)

A perturbation of the weights will result in a linearized perturbation of the objective function:

 (7) (8) (9) (10)
Unperturbed variables retain their original reference values from the original non-linear forward transform (1).

The perturbations (9) of the vector elements are expressed as a linear function of perturbed weights . We can abbreviate the recursive equations (9) as a single linear transform .

 (11) (12)
Gradient optimization also requires the adjoint of this linearization. If the linearized forward transform is expressed as a matrix, then the adjoint transform is just the transpose of this matrix. A matrix would be unwieldy, but the recursive version of the adjoint equations is not.
 (13) (14) (15) (16) (17)
These perturbations do not equal those of the forward linearization. The adjoint recursion can also be abbreviated with the linear transform .
 (18) (19)
If the linear transform were written as a single large matrix, then indeed the matrix would be identical in the forward and adjoint equations (11) and (18). For optimization, fortunately, we need not construct this matrix explicitly (as would be required by singular-value decomposition). Instead, we need only be able to multiply this linear transform or its adjoint by specific perturbations, using the recursions (9) and (16).

# Optimization

The simplest neural network “training” algorithm adjusts the previous choice of weights by a scaled gradient. This recursive algorithm is called back-propagation.

1. Initialize each weight matrix .
2. Calculate
.
3. Replace each by .
To reduce the objective function, the perturbation reverses the sign of the gradient. The small scale factor is sometimes fixed a priori and never changed. If the scale factor is too small, then many consecutive steps may move in the same direction. If the scale factor is too large, the perturbations may increase rather than decrease the objective function and may even diverge. Although this algorithm is most commonly cited, we can easily do better. Let us turn to standard non-linear optimization methods.

Many steepest-descent algorithms would replace step 3 by a line search to find an optimum scale factor.

1. Initialize each weight matrix .
2. Calculate as before.
3. Find to minimize
4. Replace each by .
This revised algorithm is guaranteed to find a local minimum where the gradient is zero. Step sizes are large. The line search can use combinations of parabolic fitting and golden sections for speed and robustness.

A good estimate of the scale factor can be estimated without an explicit line search.

1. Initialize each .
2. Calculate as before.
3. Calculate .
4. Find to minimize ,
or equivalently
5. Replace each by .

When the linearization is a good approximation (particularly in the vicinity of a minimum), the scale factor will be very accurate. Each iteration will be much more efficient than a line search. If the scaled perturbation increases the objective function in early non-linear iterations, then a line search can be used until the linearization improves.

The convergence of these steepest descent algorithms should be improved by a PARTAN or Fletcher-Reeves algorithm, which retains and combines consecutive gradients. Neural network references use the word “momentum” to describe similar strategies, usually with invariable scale factors.

An even more efficient alternative is to solve the linearized least-squares problem fully in each iteration — a Gauss-Newton approach:

1. Initialize each .
2. Find the perturbations that minimize
.
3. Replace each by .