\documentclass[12pt]{article}
\usepackage{wsh}
\setlength{\textwidth}{\paperwidth}
\addtolength{\textwidth}{-2.in}
\setlength{\oddsidemargin}{0.in}
\setlength{\evensidemargin}{\oddsidemargin}
\begin{document}
\bibliographystyle{plain}
\markright{Optimization of a Neural Network --- W.S. Harlan}
\title{Optimization of a Neural Network}
\author{William S. Harlan}
\date{Feb 1999}
\maketitle
\section {Introduction}
Many discussions of neural networks unnecessarily introduce
a large vocabulary of specialized terms. In fact, neural networks
are a simple system of recursive equations that can be optimized
by conventional means. Any good book on optimization
provides all the necessary tools \cite{luenberger}.
We need only unambiguously write down the equations we are using,
identify the objective function, and calculate a few derivatives.
\section{The Recursive Equations}
\newcommand{\xv}{\svector{x}}
\newcommand{\dx}{\delta {\xv}}
\newcommand{\fv}{\svector{f}}
\newcommand{\dv}{\svector{d}}
\newcommand{\pv}{\svector{p}}
\newcommand{\Wt}{\stensor{W}}
\newcommand{\wt}{\stensor{w}}
\newcommand{\Gt}{\stensor{G}}
\newcommand{\dW}{\delta {\Wt}}
\newcommand{\WW}{\Wt^1,\ldots,\Wt^N}
\newcommand{\ww}{\wt^1,\ldots,\wt^N}
Neural-network equations map vectors of known data $\dv^m$
to corresponding known vectors of parameters $\pv^m$, indexed by $m$.
The vector of parameters might be actual physical properties or
statistical probabilities. Data can be physical measurements
or numbers that index possible events. The equations
contain unknown coefficients, or weights, that the user
adjusts until known parameters are predicted accurately.
Then it is hoped that unknown parameters can
be estimated from new measurements of data.
Let's look at how simple these equations are. Intermediate vectors
$\xv^{m,n}$ are expressed recursively
as a function of other vectors $\xv^{m,n-1}$,
usually with a different dimensionality.
The first vector $\xv^{m,0}$ corresponds to data $\dv^m$, and the
final vector $\xv^{m,N}$ corresponds to the parameters $\pv^m$
to be estimated.
First, a pre-established non-linear transform $\fv^n$ is applied to
each element of a vector. Then linear transforms $\Wt^n$ calculate
each sample of the output vector as a weighted sum
of samples from the input vector.
\begin{eqnarray}
\label{eq:recu}
x_i^{m,n} &=& \sum_j W_{ij}^n f_j^n ( \xv^{m,n-1} ), \\
\mbox{or}~ \xv^{m,n} &=& \Wt^n \cdot \fv^n (\xv^{m,n-1})
~ \mbox{(in vector notation).} \\
\mbox{Find parameters }~ \pv^m &\approx& \xv^{m,N} \\
\mbox{from data}~ \dv^m &=& \xv^{m,0} .
\end{eqnarray}
Usually the non-linear functions $\fv^n$ apply independently to each
element of the vector. Functions are expected
to be monotonic over the range $0 \leq f(x ) < f(x' ) \leq 1$
for values $0 \leq x < x' \leq 1$.
The derivative $d f(x) / d x$ should also be continuous.
Usually, the shape is a sigmoid
or linear. Because the notation presents
no difficulty, I write this non-linear scalar function as a more general
vector function.
I could also have reversed the order of linear and non-linear operators.
A reversed order is equivalent to setting $\fv^1$ and $\Wt^N$ to
identity operations. This form (\ref{eq:recu})
is somewhat easier to manipulate.
This nonlinear recursion can be abbreviated as
\begin{eqnarray}
\pv^m &\approx& \xv^{m,N}(\dv^m ,\WW ) = \xv^{m,N}.
\end{eqnarray}
\section{Perturbations of an Objective Function}
The best weights $\Wt^n$ are assumed to minimize the following least-squares
objective function $F$, summed over all known pairs $m$ of data measurements
$\dv^m$ and model parameters $\pv^m$.
\begin{eqnarray}
\label{eq:obje}
\min_{\{W_{ij}^n\}}
\, F(\WW ) &=& \sum_{m,k} \, [ p_k^m - x_k^{m,N} ( \dv^m, \WW ) ]^2
= \sum_m \, \| \pv^m - \xv^{m,N} \|^2 .
\end{eqnarray}
A perturbation of the weights $\delta \Wt^n$ will result in
a linearized perturbation $\delta F$ of the objective function:
\begin{eqnarray}
\label{eq:perta}
\delta F(\WW ) &=&
- \sum_{m,k} \, ( p_k^m - x_k^{m,N} ) \delta x_k^{m,N}
= - \sum_{m} \, ( \pv^m - \xv^{m,N} ) \cdot \delta \xv^{m,N} , \\
\label{eq:pertd}
\mbox{where} ~ \dx^{m,0} &=& \svector{0} \\
\label{eq:pertb}
\mbox{and}~ \delta x_i^{m,n} &=&
\sum_{j} \, \delta W_{ij}^n \cdot f_j^n (\xv^{m,n-1}) +
\sum_j W_{ij}^n \sum_k
\frac{\partial}{\partial x_k} f_j^n (\xv^{m,n-1}) \delta x_k^{m,n-1} , \\
\label{eq:pertc}
\mbox{or} ~
\dx^{m,n} &=& \dW^n \cdot \fv^n ( \xv^{m,n-1} )
+\Wt^n \cdot \stensor{\nabla} \fv^n ( \xv^{m,n-1} ) \cdot \dx^{m,n-1} .
\end{eqnarray}
Unperturbed variables retain their original reference values from
the original non-linear forward transform (\ref{eq:recu}).
The perturbations (\ref{eq:pertb}) of the vector elements $\dx^{m,n}$ are
expressed as a linear function of perturbed weights $\dW^n$.
We can abbreviate the recursive equations (\ref{eq:pertb})
as a single linear transform $\Gt^{m,n}$.
\begin{eqnarray}
\label{eq:Gfor}
\delta x_k^{m,N} &=& \sum_{n,i,j} G_{kij}^{m,n} \delta W_{ij}^n \\
\mbox{or} ~
\dx^{m,N} &=& \sum_n \Gt^{m,n} : \dW^n .
\end{eqnarray}
Gradient optimization also requires the adjoint of this linearization.
If the linearized forward transform is expressed as a matrix,
then the adjoint transform is just the transpose of this matrix.
A matrix would be unwieldy, but the recursive version of the
adjoint equations is not.
\begin{eqnarray}
\label{eq:adja}
\delta W_{ij}^n &=& \sum_m \delta x_i^{m,n} f_j^n ( \xv^{m,n-1} ), \\
\mbox{or} ~
\dW^n &=& \sum_m \dx^{m,n} [ \fv^n (\xv^{m,n-1} ) ]^{*} ~~
\mbox{(outer product),} \\
\mbox{where} ~ \delta \xv^{m,N}
&=& - (\pv^m - \xv^{m,N} ) \delta F \\
\label{eq:adjb}
\mbox{and} ~ \delta x_k^{m,n-1} &=& \sum_j
\frac{\partial}{\partial x_k} f_j^n (\xv^{m,n-1})
\sum_i W_{ij}^n \delta x_i^{m,n} , \\
\mbox{or} ~
\dx^{m,n-1} &=& [ \stensor{\nabla} \fv^n ( \xv^{m,n-1} )]^{*}
\cdot (\Wt^n)^{*} \cdot \delta \xv^{m,n} .
\end{eqnarray}
These perturbations do not equal those of the forward linearization.
The adjoint recursion can also be abbreviated with the
linear transform $\Gt^{m,n}$.
\begin{eqnarray}
\label{eq:Gadj}
\delta W_{ij}^n &=& \sum_{m,k} G_{kij}^{m,n} \delta x_k^{m,N} , \\
\mbox{or} ~
\dW^n &=& ( \Gt^{m,n} )^{*} \cdot \dx^{m,N} .
\end{eqnarray}
If the linear transform $\Gt^{m,n}$ were written as a single
large matrix, then indeed the matrix would be identical
in the forward and adjoint equations (\ref{eq:Gfor}) and (\ref{eq:Gadj}).
For optimization, fortunately, we need not construct this matrix
explicitly (as would be required by singular-value decomposition).
Instead, we need only be able to multiply this linear transform
or its adjoint by specific perturbations, using the recursions
(\ref{eq:pertb}) and (\ref{eq:adjb}).
\section{Optimization}
The simplest neural network ``training'' algorithm adjusts the previous
choice of weights by a scaled gradient. This recursive algorithm
is called back-propagation.
\begin{enumerate}
\item Initialize each weight matrix $\Wt^n$.
\item Calculate
$\Delta W_{ij}^n
= \sum_{m,k} G_{kij}^{m,n} [ p_k^m - x_k^{m,N} ( \dv^m , \WW ) ],$ \\
$\Delta \Wt^n = ( \Gt^{m,n} )^{*} \cdot ( \pv^m - \xv^{m,N} )$.
\item Replace each $\Wt^n$ by $\Wt^n + \epsilon \Delta \Wt^n$.
\item Return to step 2.
\end{enumerate}
To reduce the objective function, the perturbation reverses the
sign of the gradient. The small scale factor $\epsilon$ is sometimes
fixed \textit{a priori} and never changed.
If the scale factor is too small, then many consecutive steps may
move in the same direction. If the scale factor is too large,
the perturbations may increase rather than decrease the objective
function and may even diverge. Although this algorithm is most
commonly cited, we can easily do better. Let us turn to
standard non-linear optimization methods.
Many steepest-descent algorithms would replace step 3
by a line search to find an optimum scale factor.
\begin{enumerate}
\item Initialize each weight matrix $\Wt^n$.
\item Calculate $\Delta \Wt^n $ as before.
\item Find $\alpha$ to minimize
$\sum_{m,k} \, [ p_k^m - x_k^{m,N}(\dv^m ,
\Wt^1 + \alpha \Delta \Wt^1 , \ldots ,
\Wt^N + \alpha \Delta \Wt^N )]^2$
\item Replace each $\Wt^n$ by $\Wt^n + \alpha \Delta \Wt^n$.
\item Return to step 2.
\end{enumerate}
This revised algorithm is guaranteed to find a local minimum where the
gradient is zero. Step sizes are large. The line search can use
combinations of parabolic fitting and golden sections for speed
and robustness.
A good estimate of the scale factor can be estimated without
an explicit line search.
\begin{enumerate}
\item Initialize each $\Wt^n$.
\item Calculate $\Delta \Wt^n $ as before.
\item Calculate
$\Delta \xv^{m,N} = \sum_n \Gt^{m,n} : \Delta \Wt^n$ .
\item Find $\alpha$ to minimize
$\sum_{m,k} \, [ p_k^m - \alpha \Delta x_k^{m,N} ]^2$ , \\
or equivalently $\alpha = (\sum_{m,k} \, p_k^m \Delta x_k^{m,N} ) /
\sum_{m,k} \Delta x_k^{m,N} \Delta x_k^{m,N} ) .$
\item Replace each $\Wt^n$ by $\Wt^n + \alpha \Delta \Wt^n$.
\item Return to step 2.
\end{enumerate}
When the linearization is a good approximation (particularly in
the vicinity of a minimum), the scale factor will be very accurate.
Each iteration will be much more efficient than a line search.
If the scaled perturbation increases the objective function in early
non-linear iterations, then a line search can be used until the
linearization improves.
The convergence of these steepest descent algorithms should be improved
by a PARTAN or Fletcher-Reeves algorithm, which retains and combines
consecutive gradients. Neural network references use the word ``momentum''
to describe similar strategies, usually with invariable scale factors.
An even more efficient alternative is to solve the linearized least-squares
problem fully in each iteration --- a Gauss-Newton approach:
\begin{enumerate}
\item Initialize each $\Wt^n$.
\item Find the perturbations $\{\ww \}$ that minimize \\
$\sum_{m,k} \, [ p_k^m - x_k^{m,N} ( \dv^m , \WW )
- \sum_{n,i,j} G_{kij}^{m,n} w_{ij}^n ]^2$.
\item Replace each $\Wt^n$ by $\Wt^n + \wt^n$.
\item Return to step 2.
\end{enumerate}
Since the objective function in step 2 is a fully quadratic function
of the perturbations $\wt^n$, we can use iterative least-squares
methods like conjugate gradients.
If a steepest descent algorithm were applied with an unlimited
number of infinitesimal perturbations, then the changing weights
would continuously decrease the objective function until a local
minimum was reached. Visualize following a path that is directly
downhill at every point. The path cannot ascend, no matter what
may be beyond the next rise.
Practical descent algorithms take large step sizes and may step over
a small local increase in the objective function, to find another
minimum beyond.
To increase the chances of finding the global minimum with
the pseudo-quadratic objective function (\ref{eq:obje}), we could
initialize with different initial weights and discard suboptimum
local minima. If different weights consistently lead to the same
solution, then the objective function is probably convex and
has only one global minimum.
\bibliography{wsh}
\end{document}