THE DAMPED NEWTON METHOD
--AN ANN LEARNING
ALGORITHM
By
LIYA WANG
Bachelor of Science
China University of Mining and Technology
XuZhou, P.R. of China
1985
Master of Science
Northern Arizona University
Flagstaff, Arizona
May, 1993
Submitted to the Faculty of the
Graduate College of the
Oklahoma State University
in partial fulfillment of
the requirement for
the Degree of
MASTER OF SCIENCE
December, 1995
OKLAHOMA STATE Uf.JIVERSITY
THE DAMPED NEWTON METHOD
--AN ANN LEARNING
ALGORITHM
Thesis Approved:
Dean of the Graduate College
ii
PREFACE
This paper presents a new learning algorithm for training fully-connected, feedforward
artificial neural networks. The proposed learning algorithm will be suitable for training
neural networks to solve approximation problems.
The framework of the new ANN learning algorithm is based on Newton's method for
solving non-linear least squares problems. To improve the stability of the new learning
algorithm, the Levenberg-Marquardt technique for safe-guarding the Gauss-Newton
method is incorporated into the Newton method. This damped version of Newton's
method has been implemented using FORTRAN 77, along with some other well-known
ANN learning algorithms in order to evaluate the performance of the new learning
algorithm. Satisfactory numerical results have been obtained. It is shown that the
proposed new learning algorithm has a better performance than the other algorithms in
dealing with function approximation problems and problems which may require a high
precision of training accuracy.
I would like to express my sincere gratitude to my thesis advisor, Dr. J. P. Chandler,
for his guidance in choosing the topic of this thesis and for his thoughtful. suggestions and
encouragement in pin-pointing and solving problems that I have encountered throughout
my thesis work. Special thanks are also due to Dr. K. M. George and Dr. G. E. Hedrick of
the Department of Computer Science, who served as the committee members for this
iii
thesis. Their patience in reviewing and revising this report is also much appreciated.
Finally, I would like to thank my parents, Qinghuo Wang and ChuenLing Yang, for
rearing me very thoughtfully. My deepest appreciation is extended especially to my
mother, who suffered very much in her life and raised my brother and me under some
difficult circumstances.
IV
TABLE OF CONTENTS
Chapter Page
1. INTRODUCTION 1
1.1 ANN History 1
1.2 ANN Models and Applications 2
1.3 Feedforward ANN 3
1.4 Learning in Feedforward ANN 5
1.5 Learning Algorithms for Feedforward ANN 6
1.6 The Purpose of the Paper 8
2. NON-LINEAR OPTIMIZATION 9
2.1 Fundamental Concepts 9
2.2 General Descent 18
2.3 Steepest Descent Method 21
2.4 The Newton Method 23
2.5 Conjugate Gradient Descent Method 26
2.6 Least Squares Minimization
/', ~ ,.
.. f! .. ~/I.'~·F .. /I~l"' n' - i\/, !,J+ 0', ) ..... • , .... ••••• ~ •• ~ ........ \, ••••.•• ; ••• ~ ••••••••••• 31
2.7 The Levenberg-Marquardt Method 34
3. ANN LEARNING ALGORITHMS 42
3.1 Architecture of Feedforward ANN 42
v
3.2 Dynamic Adaptation in Feedforward ANN
3.3 Propagated Computations in Feedforward ANN
3.4 Classical Learning Algorithms
3.5 Implementation of The Levenberg-Marquardt Method
4. THE DAMPED NEWTON LEARNING ALGORITHM
4.1 The Damped Newton Method
4.2 The Damped Newton Learning Algorithm
5. IMPLEMENTATION AND TEST RESULTS
5.1 Language Implementations
5.2 Neural Network Design
5.3 Test Problems
5.4 Test Results
6. CONCLUSION
6.1 Conclusion
6.2 Recommendation for Future Work
7. REFERENCES
8. APPENDICES
8.1 Program List
vi
48
53
60
68
71
71
72
75
75
78
80
81
88
88
88
90
94
94
LIST OF TABLES
"
Table #
Table 1: 2-parity Test Results
Table 2: 3-parity Test Results
Table 3: 4-parity Test Results
Table 4: Cancer Problem Test Results
Table 5: Building Problem Test Results
Table 6: Heart Problem Test Results
Table 7: GNLMD Performance Test Results on Building Problem
Table 8: NLMD Performance Test Results on Building Problem
Table 9: GNLMD Performance Test Results on Heart Problem
Table 10: NLMD Performance Test Results on Heart Problem
vii
Page
82
83
83
84
84
85
86
86
87
87
LIST OF FIGURES
Chapter.Section.Figure Number
Figure 1.3.1. A simulated neuron
Figure 1.3.2. The Sigmoid function
Figure 1.3.3. A neural network for XOR classification
Figure 3.1.1. Architecture of a three-layer neural network
Figure 3.1.2. The Sigmoid function
Figure 3.1.3. The hyperbolic tangent function
Figure 3.3.1. Part of a neural network
viii
Page
3
4
4
43
45
46
53
1. INTRODUCTION
1.1 ANN History
Artificial Neural Networks (ANNs) emerged in the 1940s when people used a single
threshold device to model the functionality of a biological neuron [20]. Simulating the
network structure of the human brain, those artificial neurons were linked together to
form a network (called artificial neural network, neural network, or neural net, etc.).
While a human brain is characterized by its learning capability, the learning capacity of
the linked neurons was first shown in the book titled "The Organization of Behavior" by
Donald Hebb, published around 1950 [18]. Two early successes in the 1950s and 1960s
were Rosenblatt's Perceptron and Widrow's ADALINE (ADAptive LINear Element),
equipped with the Perceptron learning law and Widrow's learning law respectively [18].
Both were able to learn and to perform accordingly. Despite some setbacks in the late
1960s and 1970s, the ANN researches regained their strength in the 1980s due to the
contributions of many dedicated scholars. The publication .of the famous Error BackPropagation
Algorithm (Backpropagation, Backprop, or BB Algorithm) by Rumelhart,
Hinton, and Williams in 1986 (also independently done by Paul Werbos in 1974 [30]) set
the milestone for the ANN's prosperity beginning in the late 1980s [15]. Today, the
studies of ANNs have become an independent and very broad field, and involve the
endeavors of thousands of scholars and engineers in many different fields allover the
world.
1
1.2 ANN Models and Applications
An artificial neural network is an infonnation processing system [10]. It usually
consists of a large number of artificial neurons. These neurons are linked together through
direct connections to form a network. Currently there are several such network models:
feedforward network model, feedback (recurrent) network model, and cellular model.
One example of the feedforward network is the fully-connected feedforward neural
network. Such networks are very useful in dealing with classification problems and
function approximation problems. The Hopfield network and the Boltzmann machine are
examples of the recurrent models. They are often used in speech processing and pattern
recognition [20]. The Kohonen map is a fonn of the cellular model. Such network is selforganizing
and is mostly used in speech recognition [20].
An ANN can be considered to be a parallel machine. It distributes its computing power
among all neurons in the network. Each neuron,a simple local processing unit, contributes
to the final output of the network. This characteristic of the ANN is very helpful to the
application problems that require massive and high-speed data processing capacities.
Today, ANNs have been used in many fields. Such fields include data encoding and
compression, signal and image processing, speech recognition, pattern classification,
noise filtering, stock market predictions, credit card application processing, and modeling
[2]. Currently, there are several kinds of neural network chips and hundreds of software
packages available for developing neural network applications. Thousands of neural
network applications have been carried out successfully and more are being explored in
new areas and in new fields.
2
1.3 Feedforward ANN
In this paper, we concentrate on one particular ANN model: the fully-connected,
feedforward neural network. In the sections that follow, all neural networks mentioned
will refer to the fully-connected, feedforward neural networks unless otherwise stated.
A fundamental feature of the feedforward neural network is its adaptive behavior, i.e.,
the capability to learn. When a feedforward ANN is employed to tackle a problem, all
that are needed is some training examples of the problem, i.e., some input-output
patterns. Based on those sample patterns, the ANN, when properly set up, not only tries to
learn how to handle those samples correctly, but also summarizes the learning results and
tries to handle samples in the entire problem domain. This capability of the ANN s is
especially useful to solve the kind of problems whose underlying ideas are not clear
and/or there are no fast rules that can easily be applied.
The basic building blocks of a feedforward neural network are the artificial neurons
and the connections. A connection usually has a weight on it to represent the level of
importance of (or contribution from) that connection, except perhaps for the output
connections of the network. A neuron (Figure 1.3.1) is a basic processing unit: it sums up
all its inputs, modulates each by its corresponding connection weight, and then transforms
the result via some activation function to yield the output of the neuron.
Input weight weighted activation
10 sum function
output
...-__ .. o=f(u)
Figure 1.3.1: A simulated neuron.
3
The type of the activation function determines the sort of problems that an ANN can
solve. For example, ANNs with step (threshold) activation functions can only solve
classification problems. In order to approximate non-linear mappings, non-linear
activation functions have to be used. In this paper, the non-linear sigmoidal activation
function will be used ( Figure 1.3.2).
1
0.4 / 0.2
-10 -5 5 10
A.
Figure 1.3.2: The Sigmoid function
input
Figure 1.3.3: a neural network for XOR classification
4
All neurons in a feedforward network are organized into layers or slabs (Figure 1.3.3).
The outputs of the neurons in one layer (or the inputs of the network) are all linked only
to each of the neurons in the next layer, except for the output layer of the network. A
neural network connected this way is called a fully-connected neural network. When
input signals are fed in, the computation of the network is carried out on a layer-by-Iayer
basis, starting at the input layer. This processing pattern continues until the outputs of the
network have been produced. Such a computation process is called a forward pass, which
is a left-to-right pass as shown in a network layout.
1.4 Learning in Feedforward ANN
Learning is an intrinsic requirement of neural networks--an untrained network with
randomly assigned connection weights can do nothing meaningful. The knowledge
representation system employed by an ANN is opaque: the ANN has to learn its own
representation because programming it by hand is not possible. That is, a neural net has to
be trained in order to perform a defined task. Thus, the use of a neural network usually
involves two major stages: the learning stage and the performing stage. During the
learning stage, a neural network produces output for each input sample and the result is
compared with the required output. If a mistake is made or the approximation result is not
desirable, then the net tries to learn from this example and modifies its behavior in order
to make fewer mistakes or better approximation results. Such a way of training is called
supervised training or learning with a teacher. After perhaps hundreds, sometimes even
up to hundreds of thousands of round of repetitions as the net goes through all the training
examples, the trained network will enter the performing stage. In this stage, the ANN is
5
used, based upon its generalization of the training examples, to compute outputs for other
non-example inputs.
While the learning of a neural network is observed as a change of behavior, inside the
neural network, learning takes place in the form of weight changing. Usually, some sort
of optimization function is used to measure the output errors. Then, a learning algorithm
is applied to transform the measured error information into weight changes. The major
learning algorithms that are used for training the feedforward nets are those that enforce
the learning process by means of backpropagation. Those learning algorithms are
discussed in the next section.
1.5 Learning Algorithms for Feedforward ANN
A learning law governs how a neural network enforces learning. The following is a
general outline of the learning procedure used in all of the backpropagation algorithms:
• The network usually starts out with a random set of weights.
• Examples (input-output pairs) are presented at the input side of the network.
• Each input-output pair requires two stages: a forward pass and a backward
pass. The forward pass is to present a sample input to the network and to let
activation flow until they reach the output layer.
• During the backward pass, the network's actual output from the forward pass
is compared with the desired output and error estimates are computed for the
output nodes. Then the network adjusts its weights in a backward fashion,
starting at the output layer (or saves all adjustments and changes weights after
all examples have been presented), in order to reduce the errors.
6
• The process is repeated many times until some error criteria are met.
Several mathematical models in optimization theory can be used in carrying out the
learning process of the above procedure. In that regard, the learning process corresponds
to the minimization process of the models. Among the available optimization models, the
Least Squares model is the one used the most in ANNs. Suppose that n, t, and m are the
dimensions of the input vector, weight vector, and output vector respectively and s is the
number of training examples. For each i=l, ... , m, let Yj (x, w) be the i-th coordinate in
the output vector. Then, the least squares model for training an ANN is
(1.5.1)
where Xj is the j-th input vector, and yj, j is the i-th coordinate of the output vector Yj,
corresponding to the input Xj.
In the above optimization model, learning corresponds to minimizing E(w) with regard
to the weight vector w. There are many ways to achieve this optimization goal and it is
the mathematical methods used in the algorithms that distinguish one from the others.
Currently, most of the gradient-based backpropagation learning algorithms can be traced
to three basic mathematical methods, which are well established in the optimization
theory. They are the Steepest Descent method, the Conjugate Gradient method, and the
Newton method. Many learning algorithms have been successfully built upon those
methods and more modified versions have come out in order to improve the performance
of the original versions. One remarkable method that has been used in some modified
algorithms is the Levenberg-Marquardt method [24], which is a modification of the
Gauss-Newton method (derived from the Newton method). In this paper, we develop a
7
new ANN learning algorithm that will be based on the framework of the LevenbergMarquardt
method. The goal of the paper is stated in the next section.
1.6 The Purpose of the Paper
The aim of this paper is to introduce a new supervised learning algorithm for training
fully-connected, feedforward neural networks. The proposed algorithm is based on the
well-known Newton numerical approximation method and the Levenberg-Marquardt
improvement of the Gauss-Newton method. Specifically, we incorporate Newton's
method with the technique used in the Levenberg-Marquardt method to derive an
improved version of the Newton method.
To investigate further into the subject, we first need some background knowledge. In
theory, there is considerable overlap between the fields of neural networks and statistics.
Many well-known results from the statistical theory of non-linear models, as we shall see
in the next few chapters, can be applied directly to feedforward neural networks. The next
Chapter is devoted to a brief introduction of non-linear optimization theory.
8
2. NON-LINEAR OPTIMIZATION
In this chapter, we will introduce some of the classical minimization methods.
Although, the approaches to derive these methods are different, they have two things in
common. First, all these methods are iterative. That is to say, to locate a minimizer x * of a
function J, a sequence {x(k) } of points is generated. If the sequence converges, then an
estimate of x * to a satisfactory degree of accuracy has been attained. Second, all these
methods are descent methods that contain the following two fundamental steps at each
iteration.
1) Determine a search direction along which a reduction off in function values is
possible.
2) Choose a step size so that minimization does take place.
We shall start, in section 2.2, with a discussion of a general descent method consisting
of the above two basic steps and then study the classical ones in the sections that follow.
The first section below provides some of the background knowledge we will need for the
discussion of later sections.
2.1 Fundamental Concepts
Let 9\ be the set of all real numbers.
Linear and matrix algebra
An mXn matrix A E 9\mxn given by
9
can be expressed compactly as A=[aij]mxn• It has m rows and n columns. An mxl matrix
will be a column vector of dimension m and a lxn matrix will be called a row vector of
dimension n. In our discussion, matrices are denoted by uppercase bold face letters and
vectors by lowercase bold face letters.
The transpose of any matrix is obtained by rewriting all columns as rows. Thus, the
transpose of an mXn matrix are an nXm matrix. In symbol, the transpose of a matrix A is
denoted by AT. Note that for any matrix A, we have
and
where Ae 9tmxn, Be 9tnxl, and AB=[i aikbkj ] •
k=l mxl
An nxn matrix is called a square matrix. A symmetric matrix is a square matrix A such
that
A=AT
A diagonal matrix De 9tnxn is a square matrix whose entries dij=O, for all i=!:j, and is
written as
D=diag(dl1, ... ,dnn).
A diagonal matrix is called the identity matrix if all of its diagonal elements are equal
to 1. An identity matrix is often denoted by I.
10
A square matrix B is said to be the inverse of another square matrix A if
AB=I.
In this case, B can be written as B=A-t, and it could be shown that BA=I. A square
matrix A is a singular matrix if its inverse does not exist. Otherwise, it is said to be non-singular.
A symmetric matrix AE 9\nxn is said to be positive definite (negative definite) if the
quadratic form x T Ax satisfies
A matrix A is positive (or negative) semidefinite if the equality sign is included in the
condition above. Note that the matrices AT A and AA T are always semidefinite for any A.
If a matrix A is positive or negative definite, then its inverse matrix exists.
A system of linear equations, given by
{
allxl + "':+ aln = bl
. ,
amlxl + •.• +amn =bm
can be written in matrix form as
Ax=b,
where A=[aij]mxn, X=(Xi)T, and b=(bDT. The system is solvable if the inverse matrix of A
exists. In case A-I exists, the solution of the system can be written as
To solve a system of linear equations, we need an algorithmic procedure. The
algorithm we shall use later on is based on the well-known Gaussian elimination method.
To reduce the round-off error associated with finite-digit arithmetic, a technique called
11
maximal column pivoting or partial pivoting is incorporated into the algorithm [9], which
is listed below.
Algorithm 2. 1. 1: given an nxn system of linear equations Anxn x=b, with Anxn=( au) the coefficient matrix.
1. set a(i, J)=aij, all i, j, and For i=1, ... , n, set NROW(/)=i.
2. For i=1, ... , n-1, repeat step 3,4,5, and 6.
3. Let p be the smallest integer with i:O; p:o; n, and
I a( NROW(p),i~ = max la(NROW(j), i)l.
'''','''n
4. If a(NROW(p), 1)=0, then print "no unique solution exists" and stop.
5. If NROW(/) * NROW(p), then set
NCOPY = NROW(/), NROW(/)=NROW(p), NROW(p)=NCOPY.
6. For j= i+1, ... , n, do 6.1 and 6.2
6.1. set m(NROW(j), I) = a(NROW(j), i) .
a(NROW(i), i)
6.2. Let ENROWl..!) = ENROWl..!) - m(NROW(j), I) ENRO'N(/).
7. If a(NROW(n), n)=O, then print "no unique solution exists" and stop.
8. Otherwise, set
x = _a.:.,.(N._R_O_W_(:..n.:..:.),_n +-,-1)
n a(NROW(n),n)
9. For i= n -1, ... , 1, set
n
a(NROW(i),n+1)- L a(NROW(i),j)'Xj
j=i+i
xi=------------~------------
a(NROW(i),i)
10. Output the solution x=(x 1, ... , xn).
The inner product of two real numbered n-dimensional vectors x=(xD and w=(wD is
defined as
T T n <x, W>=X W=W X= L WiXi •
i=i
12
A p-norm (or Lp-norm) of a vector x=(xikn is given by
A frequently used norm is the 2-norm, also called the Euclidean norm. In 2 or 3
dimensional real spaces, the 2-norm of a vector is just the length of the vector.
In a Euclidean normed vector space, we have Schwarz's inequality (Cauchy-Schwarz)
1< x, y >1 = IXT yl ~ IIxlI~2I1YII~2 , with
equality holding if and only if X=Ay, for some AE 9t
Multivariate analysis
A set D c 9\n is a convex set if whenever Xl, X2 E D, the line segment
lies entirely in D.
A real-valued functionf :9\n~9\ is continuous at X E 9\n, if any sequence {X(k)} such
that
A function f is said to be continuous on 9\n, iff is continuous at each XE 9\n.
A real-valued vector function F:9\n ~9\m , F(X)=(fl(X), ... ,fm(x))T, is continuous on
9\n, if eachfi is continuous on 9\n, i=l, ... ,m.
Let f :9\n ~9\ be a twice differentiable function. The first partial derivatives of f with
regard to (w.r.t.) Xi, i=l, ... , n, is denoted by
13
(if (x) .
JJ(x):=--, l=1, ... ,n, axi
and the second partial derivative is written as
The gradient g(x) or Vf(x) offis defined upon all the first partials off given by
._ ._(if(x)_ T g(x) . - Vf (x) . - --a;- - (J ti (x), ... ,J nf (x)) .
The Hessian matrix H(x) or V2f(x) of the functionfis defined to be
J~f(x) J IJ 2f(x) JIJnf(x)
H(x):=V2f(x):= ~[(if~X)r =
J 2JJ(x) J 2J 2f(x) J 2J n f(x)
JnJJ(x) J nJ 2f(x) J~f(x) nXn
Let F:9\n~9\m , F(x)=(fi(x), ... ,fm(X))T, be a vector function (written in bold face).
Then, The first order derivative of F W.r.t. x is defined by
v F(x) = (Vfl (x), ... , Vfl (x)).
transpose is defined to be the Jacobian matrix J(x) of F(x), i.e.,
J til (x) J 2fl (x)
J (x) . = aF(x) = J ti2 (x) J 2f2 (x) . ax
The following formulas [10] are useful for differentiating scalar functions with
respect to a vector.
14
(2.1.1) ! (F(x) T G(X)) = [ OF~X) r G(x) + [ iJG~X) r F(x), where F(x) and G(x)
are vector functions.
(2.1.2)
(2.1.3) aax (x T x) = 2x,
(2.1.4) aax (x T Ay) = Ay,
(2.1.5) aax (yT Ax) = (yT Af = AT y,
(2.1.6) aax (x TA x) = (A+AT)X,
Let f9\n~9\ be a function of X=(Xl, ... , xn), having continuous partial derivatives.
Assume xi=h(t) is differentiable for all i=l, ... , n. Then, the Chain rule of differentiation is
(if (t) _ i (if (x) dxi (t)
-at-i=la;;-at·
Taylor Theorem: Let f9\n ~9\ have continuous first (second) partial derivatives and
x * E 9\. Then, for any x near x *, there exists S E (0, 1) such that
J(x) = J(x*) + g(yl (x - x*)
( J (x) = J (x * ) + g(x) T (x - x * ) + -1 (x - x *)T H(y)(x - x * ) ),
2
where, y=x* + Sex-x*), g is the gradient vector ofJand H is the Hessian matrix off
A critical point of a functionf9\n~9\, having continuous first partial derivatives, is a
point x * in the domain of J at where
15
g(x *) =0,
where g is the gradient off The point x * is a strong global minimizer of J if
J(x) > J(x*), for all x in the domain off
The point x * is a strong local minimizer of J if the above condition holds on a convex
subset of the domain of f Note that a minimizer is necessarily a critical point. For a
critical point to be a strong local minimizer off, we have the following theorem [36].
Theorem 2.i.2: If 1. J:9\D---t9\ has continuous first and second partial derivatives in an
open convex set D containing x *;
2. x * is a critical point of J in D;
3. H(x*) is positive definite, where H is the Hessian matrix off,
then, x* is a strong local minimizer ofJover D.
line search
The minimization of a univariate function of the form l/J (a)=J (x+ap) over an interval
(a, b) c 9\ requires the uses of line search techniques. The commonly used such line
search techniques are the Golden Section search method, Bisection method, Quadratic
interpolation method, and Cubic Interpolation method. The following algorithm [28] is
based on the quadratic interpolation method, which we shall use later.
Algorithm 2.1.2: Quadratic Interpolation: given an initial interval (a1, b1), a point C1E (a1, b1), and tolerance.
2. set k=1.
3. set
(Note: if denominator in the above formula is 0, then stop.)
16
fx=f( X *)
if X * < Ck and fx < fc
then set ak+1=ak, bk+1=Ck, Ck+1= X *
fb=f c f c=f x.
else if x * > Ck and fx > fc
then set ak+1=ak, bk+1= X *, Ck+1= Ck.
fb=fx.
else if x * < Ck and fx > fc
then set ak+1= X *, bk+1= bk, Ck+1= Ck.
fa=fx.
else set ak+1=Ck, bk+1=bk, Ck+1= X *,
fa=fc, fc=fx.
4. if bk+1-ak+1< tolerance, or (f(Ck)-f(Ck+1))/f(Ck) < tolerance, then stop
5. otherwise, set k=k+ 1, go to 3.
In the above algorithms, it is assumed that a minimizer lies in the given initial interval.
If this is not available, then methods have to be incorporated into the algorithms to locate
such an interval first. The methods that are commonly in use to find such an interval are
the function comparison method and extrapolation method, which are listed in [28].
Rate of convergence
If a minimization method for minimizing a functionf generates a convergent sequence
{X(k)} that approaches a minimizer x * of J, we then are interested in the speed of
convergence of the algorithm. An algorithm is said to give p-th order rate of convergence
if p is the largest number such that the limit
17
· IIX(k+l) - X t
P=hm p
H= Ilx(k) -xt
exists. An algorithm with first-order rate of convergence is called an algorithm of linear
convergence and if, in addition, P=O, then it is said to be of superlinear convergence.
Having covered some of the basic knowledge that we shall need in later sections, we
now start with the introduction of a basic minimization technique--the general descent
method.
2.2 General Descent
In this section, we discuss a general descent scheme for minimizing a function, which
forms the skeleton of all the algorithms we shall introduce in this chapter. We need the
following definition to start with [36].
Definition 2.2.1: Assumef:9\n~9\ has first partial derivatives at a point x*, and let p E
9\n be a non-zero vector. Then, p is downhill for f at x * if and only if g(x *) T P <0, where g
is the gradient off
Let function 19\n ~ 9\ have continuous second partial derivatives at a point Xo E 9\n,
then by Taylor's theorem, we have
(2.2.1)
where y=Xo + Sap, for some e E (0, 1). It is not hard to show that as a ~ 0, the sign of
the last two terms in (2.2.1) would be dominated by the sign of the term ag(xo) Tp [36].
Now, assuming that p is downhill at xo, i.e., g(XO)Tp < 0, we then have, for a sufficiently
small,
18
jexo+ap) <jexo).
This implies that the function value of j would be reduced if we take an appropriate
step along a downhill direction. Hence, starting with an initial point x(O), we could use the
Taylor expansion of j around x(O) to find another point, say xCI), which results in the
function-value reduction of J, and then, successively repeat the processes for the newly
derived point X(k), k=1, 2, .... This iterative process will generate a sequence { X(k) } that
results in the successive reduction of j in function values. Such consideration yields the
following general descent algorithmic scheme [36]:
Algorithm 2.2. 1: Let an estimate xeD) of an unconstrained minimizer x* of f be given
1. Set k=O.
2. Compute p(k) such that g(X(k)Tp(k) < O.
3. Compute a(k) such that f(x(k)+a(k)p(k)) < f(X(k)).
4. Compute X(k+1) = x(k)+a(k)p(k).
5. If X(k+1) satisfies given convergence criteria, then stop.
6. Set k=k+ 1, go to 2.
In Algorithm 2.2.1, it is readily seen that Step 2 and 3 are the two fundamental
procedures that we have outlined in section 2.1, though detailed methods as how to carry
out those two steps are not given.
Clearly, there are many choices of a(k) and p(k) that satisfy the criteria in Step 2 and 3
of Algorithm 2.2.1. However, the choices of a(k) and p(k) alone are not sufficient to ensure
convergence of {X(k)} to a minimizer of f The conditions to ensure that {X(k)} will
converge to a minimizer ofj can be found in [36].
19
A frequently used technique to determine a(k) in Step 3 above is to estimate a local
minimizer of J (x(k)+ap(k») which is regarded as a function of a. Then, at least
approximately, a(k) would satisfy the requirement
(2.2.2)
Let l/J: 9\ ~ 9\ be defined by
(2.2.3)
Then (2.2.2) is equivalent to finding a local minimizer ark] of l/J, so that, at least
approximately,
(2.2.4)
where l/J' is the first derivative of l/J. Normally, (2.2.4) is a non-linear equation and can
not be solved analytically. Numerically, this can be done by using one of the line search
techniques mentioned in section 2.1. Hereafter, for all the algorithms we will introduce,
we shall use (2.2.2) as the criterion for computing ark] at each iteration.
An important consequence of using (2.2.2) is an equation that shows the relation
between the search direction p(k) and the gradient g(k+l) ofJat x(k+l). We have
By (2.2.4),
(2.2.5)
l/J'(a) =-dJ (X(k) +ap(k))
da
n d d = L-J(X(k) +ap(k))_(X(k) +ap(k))
i=ldxi da
= g(X(k) +ap(k))T p(k).
~-~-
l/J'(a(k)) = 0, which imPJi~: ~hat l"
g(k+!rp(k) = l/J'(a(k)) = 0.
..,....-----
20
The geometric meaning of (2.2.5) is that the two vectors p and g are orthogonal.
Many methods have been introduced for the determination of the p(k),S in Algorithm
2.2.1. The following few sections will discuss some of them in detail.
2.3 Steepest Descent Method
Suppose that the gradient vector g of a given functionj can be calculated analytically,
then we can choose for the vector p in Algorithm 2.2.1 as -g. That is, the descent direction
p(k) at each iteration will be _g(k):= _g(X(k») = -V(J)X(k) .
Assume g(k) '# 0, we have
"'l
I i
., p(k)T~_(k)= _g(k)T g(k) = _~(gi(k»)2 < o.
---.".:~.'_ I
(2.3.1)
Hence, if we choose the search direction p(k)= _g[k1, then p(k) is downhill for j at X(k) for
each integer k > o.
Note that since, by Taylor's theorem, for a(k) sufficient small, we have a truncated first
order Taylor representation ofjthat yields the following approximation.
(2.3.2)
In (2.3.2), we can see that the reduction of j in function values at each iteration
;;;,+"" .{ ~,{h~"{,-\< ~
depends approximately on the magnitude of p(k) Tg(k), which, b;' Schwarz' inequality, is
bounded by the product of the 2-norms (Euclidean norm) of the 2 factors, i.e.,
(2.3.3)
with equality holds if and only if p(k) = A g(k) (AE 9\ ). Hence, if we take p(k) = _g(k), then
~,..,..,~,~---~.-~-
p&;:)g(k) has maximum magnitude. This would result in approximately the maximum
21
· reduction of function values off at each iteration. Such a consideration justifies the name
-st-ee-pe_st .d-es-ce-nt- m-e-thod for the descent method obtained via taking p(k)= _g(k) in
Algorithm 2.2.1. The steepest descent method is contained in the following algorithm
[36].
Algorithm 2.3. 1: Let an estimate x(O) of an unconstrained minimizer x* of f be given
1. Set k=O.
2. Compute p(k) froI'!1J!(~~== :g(X(k)). ~~'
search
5. If X(k+1) satisfies given convergence criteria, then stop.
6. Set k=k+ 1, go to 2.
The proof of the convergence of the sequence {X(k)} to a minimizer off is the same as
that of the general descent method, assuming thatfhas continuous second partials [36].
One advantage of using this steepest descent algorithm is that the sequence {X(k)}
generated by Algorithm 2.3.1 will converge for any given initial estimate x(O) in the
domain of f This enables it to be regarded as a general purpose minimization method.
However, the rate of convergence of Algorithm 2.3.1 is only linear [26], which is
considered slow. The computations involved in Algorithm 2.3.1 are simple except for the
computations required for performing a line search which may vary depending on the
search method used.
22
2.4 Newton's Method
To derive Newton's method for minimization problems, we need a stronger
assumption than that of section 2.3 to start with. In this section, function f"9\ll -? 9\ will
be assumed to have continuous third partial derivatives in a convex neighborhood D of a
critical point x * E 9\ll and its Hessian matrix H(x *) being positive definite. Then, by
Theorem 2.1.2, x* is a local minimizer off. This yields the necessary condition
(2.4.1) g(x*)=o,
where g(X)=(gl(X), ... , gll(X» =(dd(x), ... , d nf(x») is the gradient off.
If we could solve (2.4.1), then the solution would be a minimizer of f However, in
general, (2.4.1) is a system of non-linear equations. To solve it, iterative methods have to
be used.
Consider gi (x)=d i (f(x» over D. Now, each gi (x) has continuous second order partials
by hypothesis. Hence, for an estimate xED, we have, for each xED, the Taylor
expansion off around x,
(2.4.2)
where, Yi=X +8i(x -x), for some 8i E (0,1), all i=l, ... , n.
If x is sufficiently close to X, the last term in (2.4.2) could be dropped. Then, letting
x=X and taking into account of (2.4.1), we would obtain
(2.4.3)
23
Since the approximation (2.4.3) are true for all i=l, ... , n, we then have, using the
Hessian matrix off at x ,
(2.4.4) H(x)(x*- x)=:-g(x).
Hence
(2.4.5)
(2.4.5) suggests that, for an initial estimate x(O) of a solution x * of (2.4.1), we can
approximate x* with arbitrary accuracy by generating the sequence {X(k)} via the iteration
(2.4.6)
'-T,-h--i-s- -i-t-erative procedure for estimating a critical point off is called Newton's method. -"."~'.
Unlike the derivation of the Steepest Descent method which involves using a first order
Taylor approximation of f, ~l1e derivation of the Newton method requires the uses of
sec0I!~LQrcl~r ,Taylor representation off, which approximates f much better than does the
first order one. This yields a much better rate of convergence for Newton's method to
minimize a function than that of the Steepest Descent method, though more conditions
have to be met in order for Newton's method to converge.
Note that Newton's method could be regarded as a special case of the general descent
method of section 2.2 by taking the descent direction
(2.4.7) p(k)= _H(X(k)ylg(X(k»)
~
at each iteration. However, the original Newton method does not include the second
fundamental step for determining a step size of movement at each iteration, as we have
mentioned in section 2.1. This can cause some unstable behavior of the Newton method.
Therefore, to safeguard Newton's method against failures caused by lack of validity of
24
approximating f by the second order Taylor representation, the fundamental step for
choosing ark] has been incorporated into the original Newton method to ensure the
validity of the approximation of f The following algorithm is based on this modified
Newton's method.
Algorithm 2.4. 1: Let an estimate x(O) of an unconstrained minimizer x* of f be given.
1. Set k=O.
3. Compute p(k) by solving the system of linear equations
H(k)p(k)= _g(k). "
/" .... -~.'\
," \
4. Compute {a~~sing one of the line search methods mentioned in 2.1. ,"-
6. If X(k+1) satisfies given convergence criteria, then stop.
7. Set k=k+ 1, go to 2.
The sequence {X(k)} generated by Algorithm 2.4.1 will converge to a minimizer off if
some other conditions are met [36] and it can be shown that the rate of convergence of
Algorithm 2.4.1 (without step 4 in Algorithm 2.4.1) is of order 2, if it does converge. This
is the fastest rate normally encountered in non-linear optimization [28].
However, Newton's method is still subject to the following causes of failure during the
(k+ 1 )-st iteration.
1. The direction p(k) is orthogonal to g(k) or nearly orthogonal to g(k).
2. H(k)-l exists but is not positive definite.
3. H(k)-l does not exist.
To overcome these deficiencies of Newton's method, other techniques have to be
incorporated into Algorithm 2.4.1 [36]. Note that the case of Failure 1 could be detected
25
by using the condition 'lg(k)T p(k)1 ~ ellg(k)IIJp(k)112' for some sufficiently small number E,
, ,
and, if this is the case at an iteration, we then could simply take a steepest descent step,
i.e., taking p(k)= _g(k). We could also safeguard Newton's method against Failure 3 by
taking a steepest descent step whenever the system of linear equations for p(k) is not
solvable. For the remaining failure case, we could take- p(k)=- _p(k) or a steepest descent ,,' (~
step if it does happen. J
As we can see in the above discussion, one basic step that we could take, when
Newton's method fails to yield a downhill step, is i:l~te~p~~Ldes~~Q!..Step,~~~_~ays
~<!QwnhiILstep. Hence, we could combine the Steepest Descent method and the Newton
method to yield a hybrid algorithm which is more stable than Newton's method.
However, we are not going any further in this line of thought.
If the function f is the non-linear least squares function, then we will have a uniform
treatment for safeguarding the Newton method. Such an approach involves the ideas of
Levenberg and Marquardt that we shall discuss shortly.
The next section introduces the conjugate gradient method that is derived from more
profound mathematics.
2.5 Conjugate Gradient Descent Method
The following definition and facts are needed for the establishment of the conjugate
gradient descent method [36].
Definition 2.5.1: Let A be a symmetric nXn matrix. Then the vectors p(i) (i=O, 1, ... )
are A-conjugate if and only if p(i) T Ap(i)=O ( i;f:. j).
26
Theorem 2.5.1: If 1. A is an nXn symmetric positive definite matrix;
2. p(k) *-° (k = 0, ... ,n-l) are A-conjugate;
3. v is any vector in 9ill,
then, (a) p(k) (k = 0, ... , n-l) form a basis for 9ill;
n-l p(i)T Av
(b) v = L C)T (k) p(k). (see [36] for a proof)
'<_'I [j'Oop' Ap
~ .~,
We are interested in a quadratic functionJ: 9ill ~ 9i of the form
where A is an nXn symmetric positive definite matrix, b is an nxl vector, and c is a real
number. Obviously, such a quadratic function has a global minimizer, where the gradient
vector g ofJvanishes. That is, taking the derivative ofJwith respect to x, we obtain
(2.5.1)
g(x) = 'VJ(x) = 'V(~ x T AX) + 'V(b T x) + 'V(e)
=l.(A+AT)x+b, i.e.,
2
g(x) = Ax + b (since A is symmetric).
If x * is a minimizer off, then
(2.5.2) g(x *) =A x * +b=O or equivalently
(2.5.3) x * = _A-1 b, since A is positive definite and hence non-singular.
Now, let us apply the general descent algorithm to the above quadratic function.
Suppose we choose the descent directions p(k), k=O, ... ,n-l, such that
27
(Assumption: )
This gives us
(2.5.4)
1. g(x(O)) T p(O)<O, where x(O) is an initial estimate
2. p(k+l) chosen such that p(k+l)Ap(j)=O, (j=0, ... , k)
X(k+l) = x(O) + I a(j)p (l) , k=O, ... , n-1.
j=O
Plugging (2.5.4) into (2.5.1), we get
(2.5.5)
g(k+l) = g(X(k+I») = AX(k+l) + b
= Ax(O) +b+ Ia(j) Ap(j).
j=O
Since by (2.2.5), g(k+I)T p (k) = 0, multiplying p(k) T to both side of (2.5.5), and then
solving for a(k), by hypothesis 2 above, the resulting solution will simplify to
(2.5.6)
p(k)T Ax(O)
a (k) = _ -=--__ _
p(k)T Ap(k)
p(k)Tb
(k)T (k)' k=O, ... , n-1.
p Ap
Now, taking k=n-l in (2.5.4), we obtain
n-l p (j)T Ax(O) n-l p (j)Tb
x(n) = x(O) - I, p(j) - I, p(j)
j=OP (j)T Ap (j) j=OP (j)T Ap (j)
n-l p (j)T Ax(O) . n-l p (j)T A(A -lb) .
=x(O)-I, p(])-I, pc])
j=OP (j)T Ap (j) j=O P (j)T Ap (j)
=x(O) _x(O) +A-lb (Theorem 25.1)
= A -lb.
By (2.5.3), we have x(n)= x* i.e., x(n) is the minimizer off Hence, the minimizer x*
off can be obtained in n iterations (could be less than n iterations, see [36] ).
i
To generalize this remarkable result to non-quadratic functions, we first note that,
again by Taylor's theorem, a function F3tn -7 9\ which has continuous second partial
28
derivatives at a point x * E 9\D could be well approximated by the truncated second order
Taylor representation, i.e.,
(2.5.7) f (x) == f (x *) + g(x *) T (x - x *) + 21 (x - x *) T H (x* ) (x - x *) .
Assuming x * is a critical point off, we then have g(x *)=0. Hence, (2.5.7) becomes
'!
(2.5.8)
f(x) == f (x *) +21 ~2 ( x-x* ) H( x* ) (x-x*)
1
=_xT Ax+bTx+c,
2
'1
where, A=H(x*), b= -H(x*) x*, and c = (x·) +l:(~!!(~:l~':'
This shows that f can be approximated by a quadratic function and, hence, the above
results about quadratic functions could be used. Upon the assumption that the Hessian
matrix H(x*) is positive definite, the above quadratic function (2.5.8) can be used to find
a minimizer of the original function f, which is readily seen to be the same as the
minimizer of the quadratic function (2.5.8). Such consideration yields the iterative
scheme for the conjugate direction algorithm, which involves the successive
approximation of f by a quadratic function at each iteration. Note that the convergence
property of the quadratic function does not hold for non-quadratic functions, i.e., a
satisfactory minimizer will not generally be located within n iterations. For non-quadratic
functions, a new sequence of p(k),S, (k=O, ... , n-l) should be constructed if the current
approximation by one quadratic function is not sufficiently good. Hence, a reset of the
parameters after every n iterations may seem to be a reasonable strategy to be used in
algorithms for minimizing non-quadratic functions [18].
29
Notice that there are many methods to compute p(k),s such that assumption 2 above
could be satisfied. One way to construct p(k) at ~a--c-h- iter~!!~n is to take p(k) as a linear .. ~ .. -..- .. - -~- .,. --.-.,--~ ------~'--.--.--.-~".~,~. "-'-.'--'--.'.-
combination of _g(k~~~.J!_~~_~ ap.(L!!~~~e the E~e._fQl1.iugq,te. _g!:~4ie.!:l~lP:e.!~.od. Such a
< •••• .-. " ...... ---.----~---,.- ,----.--~- .-----'"-"~~~ "'"
construction is encoded in the following conjugate gradient algorithm.
Algorithm 2.5.1: Let an estimate X(l) of an unconstrained minimizer x* of f be given which is sufficiently
close to x*.
1. Set k=1, p(l)= r(l)= _g(X(l».
2. Compute the Hessian matrix H(k) =H(X(k».
3. Compute ~~k)._~y-.using a line search technique [36] or by using the
{.
(k)T r(k)
following formula a(k) = (~T (k) (k) (or using an approximation of H(k»
p H p
[24].
4. Compute x(k+l)=x(k)+a(k)p(k) and r(k+l)= _g(X(k+l».
I' 5.(~ go to 10.
\ ,=.
6. Compute p(k) using one of the following:
• (Hestenes and Stiefel)
•
•
(Fletcher and Reeves)
(Polak and Ribiere)
{3(k) = r(k+l)T (r(k+l) - r(k»
p(k)T r(k)
8. If convergence is obtained, go to 11.
/'i\, ~7' \ ,. 9. Set k=k+ 1, go to 2.
'\( ,[.1'" '/
,) ',' \. '\ '--
'--' " 10. Set p(k+l)= r(k+l) go to 2.
11. Set XO=X(k+l).
12. Stop.
30
Algorithm 2.5.1 follows the same descent pattern as we have always mentioned, i.e.,
first determine the search direction( Step 6, 7, given an initial pCl)) and then choose a step
size (Step 3, 4). The rate of convergence of this algorithm is superlinear. If the formula
given in Step 3 above is used, the algorithm suffers from the same problems as that of
Newton's method, since the conditions on the Hessian matrix ofJmight be violated at an
iteration. Hence, it might need the same treatments as we have mentioned in the
discussion of the Newton method. Another treatment, which does not require the
computation of the Hessian matrix H(x)--using a line search instead, can be found in
[36].
2.6 Least Squares Minimization
In this section, we specialize the above methods to a particular kind of functions that is
a sum of the squares of non-linear functions. The special form of this kind of functions
enables us to describe the methods in more detail and, as we shall see, to derive a new
method, namely, the Gauss-Newton method.
Consider a function of 1 real variables Xi (i=l, ... ,l) and n real variables Wj (j=1, ... , n)
Y: 9\1 X 9\n --7 9\ be a function. Writing them in vector form, we have
Furthermore, let m vectors Xi (i=1, ... , m) and m real numbers Yi. be given. We are
interested in estimating a local minimizer of E: 9\n --7 9\ defined, as a function of w, by
(2.6.1)
Let
1 m 2
E(w) = - I(Y(xp w) - yJ .
2i=1
31
(2.6.2)
where Vi(W)=Y(xj, w), i= 1, ... , m, and y = (yI, ... , Yrn) T. Then (2.6.1) can be written as
(2.6.3)
Let f: 9\n ~ 9\rn be defined by
(2.6.4) few) = V(w) - y
Then, (2.6.1) can also be written in vector form as
(2.6.5)
1
E(w)=-f(w)T few).
2
The gradient vector g(W)=(gl(W), ... , grn(w)) of E(w) is given by
(2.6.6)
g .(w) = dE = If.(w) dfi(w)
l dw· J dw i J=! i
m
= I,f/w)dJ/w), i = 1, ... , n.
j=!
Using the Jacobian matrix] rnxn off (w), whose element is given by Jilw)=~Ji(w), i=1,
... , m andj=1, ... , n, we have
(2.6.7) g(w)=] (w) Yew).
The Hessian matrix H(w) of E(w) can be decomposed via the following treatment.
Consider the ij-element Hij of H(w),
Now, if we let Ti be the Hessian Matrix of A , i.e.,
32
(2.6.8)
then a decomposition of the Hessian matrix of E(w) is
(2.6.9)
m
H(w) =J(W)T J(w) + Ifi (w)Ti (w)
i=1
=J(W)T J(w)+S(w),
m
where, Sew) = Ifi (W)Ti (w).
;=1
Clearly, the decomposition (2.6.9) of the Hessian matrix displays a considerable
structure and hence gives us more alternatives when applying Newton's method to the
function E(w).
Now, let us apply all the methods introduced in the previous sections to the function
E(w). For the Steepest Descent method and the Conjugate Gradient method, all we need
is the gradient of E(w) at each iteration and hence equation (2.6.7) can be utilized to do
so. For Newton's method, the gradient of E(w) and the Hessian matrix of E(w) are
needed. Thus, equations (2.6.7), (2.6.8), and (2.6.9) can be used for the computations.
One alternative when using the Newton method is that we could drop the term Sew) in
(2.6.9) if E(w) is expanded at a critical point x* for points sufficiently close to x*. Then
the Hessian matrix H(w) of E(w) is approximated by
(2.6.10) H(w) =J (w) TJ(W).
This changed form of the Newton method is called the Gauss-Newton method [36]. In
this method, the Hessian matrix H(w) is computed approximately by using only the first
order partials of E(w), which is less costly in computations. The following is an
algorithmic implementation of the Gauss-Newton modification of the Newton method.
33
Note that, like Algorithm 2.4.1 of the Newton method, the fundamental step for choosing
ark] has been incorporated into the original Gauss-Newton method, which does not have
such a step, to ensure the validity of the approximation of E(w).
Algorithm 2.6.1: Let an estimate w(O) of an unconstrained minimizer w' of E(w) be given.
1. Set k=O.
2. Compute g(k) and jk) via
gik)=d;E(w(k)) and
J;j(k)=dj!i(W(k)), i, j=1, ... , n.
3. Compute p(k) by solving the system of linear equations
(jk)T jk))p(k)= _g(k).
4. Compute U(k+1) by using one of the line search methods mentioned in 2.1.
5. Compute W(k+1) = W(k)+U(k) p(k).
6. If W(k+1) satisfies given convergence criteria, then stop.
7. Set k=k+ 1, go to 2.
However, in the Gauss-Newton version of the Newton method, all the drawbacks of
Newton's method still remain and may become worse since the approximated Hessian
matrix is probably more vulnerable to singular problems. In the next section, we shall see
that the Levenberg-Marquardt method could be very helpful to handle all those
drawbacks.
2.7 The Levenberg-Marquardt Method
The formulation of the Levenberg-Marquardt method, as often seen in the
optimization literature, is based on the Gauss-Newton method for dealing with non-linear
least squares problems. When the Gauss-Newton method is utilized for a minimization
problem, we need to solve the system of linear equations given by
34
(2.7.1)
and then to update the estimate via
(2.7.2)
As we have seen in section 2.6, depending on the matrix jk) Tjk), the Gauss-Newton
method sometimes suffers the following problems in practice:
1. The direction p(k) is orthogonal or nearly orthogonal to g(k).
2. (jk) T jk)r1 does not exist.
Failure 1 above generally results in little or no progress of the minimization process or
something even worse. When case 2 happens, there is no way to update the variables and
hence some alternative methods need to be used to handle the problem. As mentioned
also in section 2.4, the solution to both of the problems is to take a steepest descent step if
either of the cases does occur.
We have seen that one important choice at each iteration of every gradient-based
method is to choose the search direction p(k). Progress will be made if this direction is
close to that of _g(k). Levenberg [23] and Marquardt [24] have described methods for
determining a direction, between p(k) and _g(k) , that will ensure that the process makes
progress even if either of the above cases occur. In their treatment, the essential idea is
that the update ~ (k) = a(k)p(k) at each iteration should be determined by solving one of the
following system of linear equations.
(2.7.3) (jk) Tjk) + A(k)I) ~(k)= _g(k), or
(2.7.4) (jk) Tjk) + A(k)D) ~(k)= _g(k),
35
where A (k»O is a real number and D is a diagonal matrix whose elements are the diagonal
elements of jk)Tjk).
In his treatment [23], Levenberg linearized the functionsfi's in (2.7.5) below by their
(2.7.5)
1 m 2 E(w)=-Lh,
2i=1
first order Taylor expansions to obtain the approximation E(w) of E(w) at an initial point
Wo by the following.
(2.7.6)
Since (2.7.6) generally gives poor approximation when dw is large in absolute value,
Levenberg used the following minimization scheme in order to minimize the sum of
squares of the residuals and to limit the step size simultaneously.
(2.7.7) = 1- n ( )2 E(w)=-E(w)+ Laj dWj , A j=1
where A-I is a positive number expressing the relative importance of the residuals and
increments in the minimization process and aj' s are positive constants that represent the
relative importance of damping the different increments. In his reasoning, A-I can be
determined approximately by
(2.7.8)
n (n (}J,. J2 2La. L-' ·Ii
j=1 J i=1 aw. A-I = I Wo
E(wo)
36
The choices of the weighting constants aj's are arbitrary. One way is to set them all equal
to unity, in which case, it can be shown [23] that the system of equations we need to solve
to determine the step size at each iteration is
(2.7.9)
This method is called Levenberg I. Another strategy ( Levenberg II ) to choose aj' s is to
let
(2.7.10) ( J2
_ n dh ._
aj - ~ :1... ' J-l, ... ,n.
I-lOW·
J Wo
Then, the system of equations we need to solve becomes [23]
(2.7.11)
where A(k) is given by (2.7.8) and D is a diagonal matrix whose elements are the diagonal
elements of jk) T jk).
Levenberg's method (Levenberg I or Levenberg II) sometimes fail because the
minimization process on the first iteration may lead to a very small A that results in a very
long step that is nearly orthogonal to -g to a point far from the global minimum, from
which the method never returns. Marquardt [24] invented a better strategy for selecting A.
He started off with the Gauss-Newton method and came up with the system equations
(2.7.4) whose solutions would determine both of the direction and size of the next step in
the minimization process. In his treatment, he showed that the minimization process
would make progress if A( k) was sufficiently large. The validity of his formulation is
based on the following three theorems cited from [36].
37
Theorem 2.7.1: If
Theorem 2.7.2: If
Theorem 2.7.3: If
1. bO is arbitrary;
2. Q=JT J, where J is an mXn matrix;
3. VE 9\ll and v 7; 0;
6. <1>(~) = IIJ~ + vll~
then t/>(~o)=rnin{t/>(~)}.
~En
1. bO is arbitrary;
2. Q=JT J, where J is an rnxn matrix;
3. VE 9\ll and v 7; 0;
then II~( A )II~ is a continuous monotone decreasing function of A and II~( A )II~ --7 0, as A --70.
1. bO is arbitrary;
2. Q=JT J, where J is an mxn matrix;
3. VE9\ll and v 7; 0;
4. (Q+AI)~ = -Jv
then 1fI is monotone increasing function of A and tp(A) --7 1, as A --700.
Proofs of the above theorems can be found in [36].
Theorem 2.7.1 says that ~ (k) depends upon the choice of A (k) at each iteration. By
(2.6.3), the Gauss-Newton!Levenberg-Marquardt approximation of the E(w) is, over a
sufficiently small neighborhood of a current estimate wof a local minimizer w* of E(w),
38
,I
,,j )
(2.7.12)
E(w+L\) = IIV(w+ L\)-YII~
== E(w) + 2f(w)T J(w)L\ + L\T(J(W)T J(W) + AI)L\ ./'
/
Now we can see that, by Theorem 2.7.2, if 'A, is sufficiently large, then, lI~t, where ~
is the solution of (2.7.3) (i.e., the Levenberg-Marquardt update of W(k»), is sufficiently
small so that the quadratic approximation (2.7.12) becomes valid. This ensures that
E (w + ~) < E (w), that is, the reduction of function values of E( w) at each iteration.
Notice also that, by Theorem 2.7.3, as 'A, increases, the direction of ~ in (2.7.3)
approaches that of -g(w)=f(wf J(w). Hence, even if jk) Tjk) is singular or its
A
solution does not lead to a downhill direction, the direction of the update L\ can still be
made downhill for E(w) at w by taking 'A, sufficiently big. Thus, we can ensure the
reduction of function values at each iteration of the minimization process by properly
setting each 'A,(k) big enough. This way, all the drawbacks of the original Gauss-Newton
method are removed.
Both Levenberg and Marquardt have suggested that the matrix Q=JT J should be
scaled so that its diagonal elements become equal to unity, since the properties of the
gradient methods are scale-variant [24]. It can be shown [23, 24, 28] that this is
equivalent to solving
(2.7.13)
where, D=diag(Qll, ... , Qnn).
39
This treatment makes the method scale-invariant. However, theorem 2.7.3 fails
because of this change (other two theorems still hold). Instead of approaching _lg as
It
A~OO, the direction of the update ~ becomes
(2.7.14)
Nevertheless, we then have
(2.7.15)
i.e., still downhill in the limit. Hence, all previous discussion for (2.7.3) still applies if the
diagonal matrix D is used instead of the identity matrix I.
Following Marquardt's treatment, it should be noticed that setting A (k) equal to zero
leads to the Gauss-Newton method. Hence, at the beginning of each iteration, A(k) is
reduced by the factor v, since a smaller A (k) gives performance of the algorithm that is
more close to that of the Gauss-Newton method.
Summarizing the above discussion, we are now ready to gIve the Levenberg-
Marquardt version of the Gauss-Newton method.
Algorithm 2.7.1: Let an estimate w(O) of an unconstrained minimizer w' of E(w) be given.
1. Set k=O, A(k)=~O.. : 91, v=1O.
2. Compute g(k) and jk) via
g;(k)=d;E(w(k») and
l(k)
3. SetA(k)= --;-.
4. Compute p(k) by solving the system of linear equations
40
7. If W(k+1) satisfies given convergence criteria, then stop.
8. Set k=k+ 1, go to 2.
Note that there is no fundamental step 2, i.e., determining a step size, in Algorithm
2.7.1. Such a step can be observed in the process of increasing the value of A (k) in order to
result in the function-value reduction of E(w). In the algorithm, as A (k) increases, not only
the direction of p(k) is approaching a downhill direction but also the step size that is
decreased by the factor of (A(k)) along that direction (see (2.7.14)). And, in the limit, the
step size becomes infinitesimal along a downhill direction.
We have seen in this section that the Levenberg-Marquardt method is a very effective
technique in overcoming the deficiencies of the Gauss-Newton method. The main task of
this paper is to incorporate such a technique into the Newton method and hence
implement this new method as an ANN learning law for training feedforward neural
networks. Before doing this, we first explore some of the classical ANN learning
algorithms that are based on the optimization models we have developed in this Chapter.
Having developed some of the mathematical models that we will need in the ANN
studies, we next concentrate on the implementations of those models using the ANN
structures. The next chapter gives full treatment of some of the classical gradient-based
ANN learning laws for training fully-connected, feedforward neural networks.
41
3. ANN LEARNING ALGORITHMS
In this chapter, we describe some of the classical learning algorithms for training fullyconnected,
feedforward neural networks in detail. This includes their basic concepts,
learning criteria, propagated computations, and learning algorithms.
3.1 Architecture of Feedforward ANN
In chapter one, we have seen an example of a feedforward neural network with two
layers. In general, such a neural network can have any number of layers. The layers of a
neural network usually have a left-to-right layout with the last one on the right being the
output layer. The number of layers is defined to be the number of layers with weighted
connections. Note that we treat the input layer of the network as just some connection
nodes. The hidden layers of a network are all of the layers except the output layer of the
network and, hence, the number of hidden layers is the number of layers in a network
minus one. In theory, a feedforward neural network with at most two hidden layers can
approximate any function practically encountered. However, no construction method has
yet been found as how to build a three-layer neural network for every given such
function [1 0].
Notation
The notations we will use in this paper are shown in Figure 3.1.1. In the layout, all
neurons in a layer are consecutively indexed, beginning at 1, in an up-down fashion. The
layers are indexed in a left-to-right order and they are identified by square-bracketed
42
W[l] =(W1 .)
J,I W[2] = (W2.)
J,I
W[3] = (W3 .)
J,I
input
output
-~ ....
"
u[o] u[3] 1 1
[0]
u2
[31
u2
:
u[Ol
3 '-, ............ _----
second layer third layer
------ ------
first layer
Figure 3.1.1: Architecture of a three-layered neural network
superscripts. All inputs to a neuron in layer k are denoted as UJk-11 where i=O, 1,2, ... , nk-l
(nk-l = number of neurons in (k-1)-st layer), and in the case of k-1=0, then the u;Ol are the
inputs of the network. Here, we have assumed an extra bias node for each layer, which
connects forward to each neuron in the next layer. Such nodes have no input connections
from the previous layer and have a constant output value of -1, i.e., ubk1 = -1, for all k=O,
1, ... ,K-I. The weight on each of those constant connections corresponds to the bias of the
neuron to which the connection is linked. Note that for each k >2, u;k-l1 is also the output
of neuron i in the (k-1)-st layer. The outputs of the network are (U;K1)T, written in vector
form with K being the number of layers of the network. A weight is marked as Wj~i1, il=O,
where k is the layer index and ''j, i" means that the weight is on the connection from the i-th
neuron in layer k-1 to the j-th neuron in the k-th layer. In vector form, these are denoted
43
as W[k1=( Wj~il) T for the weights in the k-th layer and U[k1=( U~kl) T for the outputs of the k-th
layer and the inputs of the (k+ 1 )-st layer. The weighted sum of the inputs of a neuron,
say neuron), in layer k is denoted by
(3.1.1)
Hence, the output of the neuron} in layer k can be written as
(3.1.2) ujkl = J yl ( V~kl ), ItO, and u6k1 = -1, where,
J}kl is the activation function of that neuron. In vector form, it will be
(3.1.3)
(3.1.4)
The Activation function and its bias input
Perceptrons form a subclass of feedforward neural networks. In a perceptron, the
activation function is a step function. This limits the applications of the perceptron
networks to only classification problems. In order to introduce non-linearity into a neural
network, non-linear activation functions have to be used. It is only the use of non-linear
activation functions that enables multilayer neural networks to solve all kinds of
mapping-approximation problems [37]. Many non-linear function will do the job,
although which one to use depends upon the requirement of the learning algorithm being
used. For gradient-based learning algorithms, the activation functions are required to be
differentiable. The most common choices for feedforward networks are the sigmoidal
44
functions. Two forms of such sigmoidal functions are given below with their graphs
shown in the Figures that follow [2]. Since we can scale the input and output values to
within the interval (0,1) or (-1,1), there is no fundamental difference between the two
except for computational considerations. In this paper, the Sigmoid function is used.
(3.1.5)
(3.1.6)
-10
1
f(x)=--
l+ex
eX _ e-x
f(x) = X -x e +e
-5
0.4 / 0.2
( Sigmoid function)
( Hyperbolic tangent function)
5
x
10
Figure 3.1.2: The Sigmoid function.
45
-10 -5 5 10 I x
-0.5 ),
By (3.1.1) and (3.1.2) above, each neuron in a neural network defines a hyperplane in a
space of dimension ni ' which is the number of variable inputs to that unit [37]. The
position of this hyperplane is determined by the weights. Without a bias, the hyperplane is
constrained to pass through the origin of the hyperspace. This yields some limitations on
the problems that an unbiased ANN can solve. For example, without a bias, a neural
network can not even solve the XOR problems without changing the domains of the input
values.
Initiation of weights and bias
The weights in a neural network are initially chosen to be small random numbers.
Since the activation function is usually active over a small interval and levels out outside
of the interval, the slopes over the rest of the interval are very small. If the initial weights
are too large, the activation functions may saturate at the beginning and the network
46
might get stuck in a very flat plateau or a local minimum near the starting point [16].
Thus, some optimal way to set up the initial random weights is desired. In this paper, the
initial weights of all test neural networks are chosen as random numbers uniformly
-0.5 05
distributed between ------- and ----.--- [10], where the fan-in of
fan - in of that unit fan - in of that unit
a node is the number of inputs including the bias input to that unit.
Computation in feedforward ANNs
As we have seen in Chapter 1, a neural network performs computation tasks on a layer
basis. That is, when all the inputs of network are ready, the neurons in the first layer are
activated and pass the results to the next layer which is in tum waiting for all its inputs to
be provided. This pattern continues until the outputs of the network have all been
produced. The following procedure gives an outline of such a forward pass of
computations in the network, called aforward computation of the network.
1. The weight vectors wlk1 and the activation functions f[kl , k=1, 2, ... , K. are all given, where K is
the number of layers in the network.
2. JOl=x , where x is any given input vector.
3. For k=1, 2, ... , K, compute Jk1= f[k1(vk1) by using (3.1.1) and (3.1.2).
4. y=JKl will be the output of the network.
Mathematically, we think of a neural network as a mapping N:9\n--79\ffi written as
y=N(x), where nand m are the dimensions of the input vector and output vector
respectively. In this treatment, the layered structure of an ANN can be seen as comprising
the recursive computations involved in evaluating the function N at an XE 9\n.
47
3.2 Dynamic Adaptation in Feedforward ANN
A feedforward ANN exhibits dynamic changes of behavior in its training session.
Based on example learning, the error made by the ANN upon an input will be realized by
a pre-defined measuring function called the error function, cost function or energy
function [20]. To correct the error, the error function is investigated for the sources of the
error and the level of error contribution of each source, among all the neurons in the
network. Then, changes of connection-weights are made in order to reduce the erroneous
outputs of the network. This adaptation of behavior ends when the network has produced
outputs close enough to the desired ones or when an optimal point is reached with regard
to some generalization criterion. The following details the concepts involved in the above
discussion of an ANN learning process.
Supervised learning
To train a feedforward ANN, supervised learning is used. This requires that sample
inputs are gathered from the domain of the problem and their correct outputs provided for
the quantitative realization of errors being made by the network. Normally, two such sets
are chosen. One is for the training examples and the other is for the testing of the network
after it has been trained. The number of examples in the training set depends on the
number of weights used in the network. A general rule of thumb is that the number of
samples should be larger than the number of adaptive parameters [37], preferably much
larger. This rule has often been violated in practice which has led to problems of
overfitting in many applications.
48
Another training paradigm is to divide all the samples into 3 sets [26], i.e., a learning
set, a validation set, and a test set. When training an ANN, the learning process will be
stopped when an optimal point is reached regarding the validation set. This way, better
generalization might be achieved.
On-line and off-line learning
On-line learning means that updating of the weights takes place each time an example
is presented to a neural network and errors have been produced. In off-line learning (also
called batch learning), weight updating is postponed until all examples have been
presented once to the network. In the on-line learning mode, the learning process is more
sensitive to each individual example and, hence, it may help in escaping from local
minima of the cost function [10], though no proof of such assertion has been shown yet.
On the contrary, off-line learning introduces some inherent averaging and filtering effects
due to the additive collection of all the weight updates. It is asserted that on-line learning
is faster and more effective than the off-line learning [10], though, from the viewpoint of
optimization theory, the batch learning is more consistent with the optimization models
used to implement learning algorithms.
Learning criterion
In supervised training, a learning criterion is used to measure the performance of a
neural network. Mathematically, this can be achieved by using an error function.
Different learning algorithms implement the learning process using different kinds of
49
error information. For the gradient-based learning laws used in training feedforward
ANNs, the error function should be continuously differentiable to the first, second, or the
third order depending upon the choice of methods used. Some of the common choices are
the sum-of-squared-errors function and the absolute value error function (shown below
respectively). In (3.2.1), the constant 1. is used to simplify the calculation of derivatives
2
of E(w) [10].
and
_ s (m 1 [K] I) (3.2.2) E(W)-~l ~fi (xj,w)-Yj,i ,
where the notations are the same as those used in Chapter 1 and Chapter 2.
Learning algorithms and their evaluation
The role of the learning algorithm is to make the transition from realized errors to
weight changes--thus enforcing the learning process. For the gradient-based learning
algorithms, this is often achieved by employing an error function and an optimization
method to minimize this error function with regard to the connection weights. The
implementation of the minimization methods involves the computations of the partial
derivatives of the error function. To evaluate those partial derivatives, the global error
function has to be parametrized successively via the transformations made by the neurons
in each layer. Such a parametrization process corresponds to a backward pass through the
50
neural network. Then, upon this backward parametrization, various differential items of
information are obtained, and the amount of weight update is approximately computed for
each of the connection weights in the network. The mathematical formulas involved in
this process will be developed in section 3.3.
Different learning algorithms have different learning performances. To compare
among those learning algorithms, standard measurements have been developed. The
commonly used measures are the speed of learning, computation complexity, and the
memory usage. For each of the learning algorithms we will introduce, we shall give,
whenever possible, their performance evaluation in terms of the above three
measurements.
One drawback of all the gradient-based learning algorithms that we will introduce in
this paper is that the minimization process of the error function may get stuck at a local
minimizer of the error function and hence stop making further progress [35]. Currently,
there is no theoretical treatment to overcome this deficiency of the gradient-based
learning algorithms. However, practical experiences have shown that a local minimum is
not very often encountered [35]. One strategy that could be used in dealing with local
minima problems is to run the learning algorithm several times, each time with a new set
of initial random weights [35].
Stopping Criteria
There are several conditions that can be used to stop the learning process of an ANN.
1. The function values (or the RMS values, see below for RMS) are reduced to
51
within tolerance [15], i.e.,
E(W(k») < tolerance.
2. The reduction in function values at iteration k+ 1 is within tolerance [34], i.e.,
3. The reduction in RMS values at iteration k+ 1 is within tolerance, i.e.,
where, RMS(E(w)) :=
E(w)
Number of Training Examples
4. The weight updates at iteration k+ 1 are within tolerance, i.e.,
II~W(kt < tolerance, or
5. ( For classification problems only), each pattern has been recognized to within
tolerance [35], i.e.,
actual output aj (W(k+l), Xl) - desired output OJ (Xl) < tolerance,
for all j over the output dimension and all Xl E training set.
6. The number of iterations exceeds some pre-defined limit.
The stopping condition of a learning process is closely related to the generalizing
ability of a neural network. To improve the generalization performance of the network, a
validation set can be used while training the network and the training can be stopped once
a minimum of the error on the validation set is reached [26]. This technique is often used
in practice to avoid overfitting problems when training a neural network.
52
Generalization
The practical use of ANNs depends on their generalization capability. It is desirable
that a trained ANN generalize the learning results ideally over the entire problem domain.
However, this is not always practically achievable since no theoretical treatment has yet
been found to control effectively the behavior of an ANN over the entire problem
domain. Among the ANNs' generalization problems, overfitting is the one commonly
encountered. To say that a trained ANN is overfitted on the training example means that
the net conforms to the training examples, but with wild meandering outside of the
training set[35], which is not desired. Such problems are usually due to the inadequate
number or range of training examples. To avoid overfitting, a rule of thumb is to have the
number of training examples much greater than the number of trainable weights.
3.3 Propagated Computations in Feedforward ANN
We have seen that an ANN produces its output via forward computations with the uses
of the recursive formulas (3.1.3) and (3.1.4). In this section, we develop mechanics for
the computations of the partial derivatives of the error function, which are needed in the
implementation of the gradient-based learning algorithms.
inputs from
(k-1 )-st layer
W [k 1
J ,n k-1
k-th layer
Figure 3.3.1: Part of a neural network
53
outputs to
(k+ 1 )-st layer
Throughout discussion in the rest of this chapter, we suppose a given neural network
of K layers. Furthermore, we will use the sum-of-squared-errors function E(w) defined by
(3.2.1) in our discussion. Since differentiation is additive, it suffices for us to consider the
following function «3.3.1)) for one input pattern (Xl, Yl) instead [10]. This is actually the
error function used for on-line training. Summing up the derivatives over all the input
samples later on would constitute the off-line training model.
(3.3.1) E(w) _- -1 ( Lm ( f j [Kl (Xl' W) - YI,j )2) -_-1 (f [Kl (Xl' W) - Y/ )T( f [Kl (Xl' W) - Y/ ) .
2 J=l 2
We begin by considering the network locally at a neuron, say neuron j at the k-th layer
(Figure 3.3.1).
By (3.1.2), i.e., U)kl = fYl ( V)kl ), /1:0, and ~kl = -1,
we have
(3.3.2)
and
(3.3.3)
,
"
Note that the derivatives (U)kl) and (U)kl) can be calculated analytically once a
proper activation function is chosen. We call such derivatives the local derivatives [8].
Using the Chain rule and (3.3.2), we have,
(3.3.4)
54
(3.3.5)
(3.3.6)
a,Pl ;:U-~kl(v[kl) av[kl ,
j - VJ j j j _ ( [kl) [k-ll·...Lf) (b (3 3 4) )
dw[.k.l av[.kl dw[k] - uj • ui ,If-V, Y .. .
j,1 j j,1
In (3.3.4), (3.3.5), and (3.3.6), W)~l is given and will be known from the
av[kl
forward computation of the network upon a given input. Hence, the derivatives dw~k] ,
j,1
au[.kl au[k1 -fu ,itO, and ------d=ll, #0, are computable for any neuron in the network once an input
dw.. au . ./,1 I
is fed. Such derivatives are also called local derivatives [8]. To distinguish those local
partial derivatives given by (3.3.5) and (3.3.6) from other partial derivatives that involve
neurons over at least two layers, we make the following definitions [8].
(3.3.7)
au[k1
( [k1)i._ j ( [kl)i
U j . - au[k-1l and Uo := 0, i:;t(),
I
(3.3.8)
au[kl
( [k1)(i)._ .i .
U j . - --[k-l ' J=I=O,
dw ..
j,1
au[kd
In general, the computation of derivatives of the form aut~l ' where kl and k2 are two
I
layer indices with kl > k2, requires propagation through the network layout. We have, by
the Chain Rule,
(3.3.9) (by 3.3.7).
(3.3.9) is a forward propagation formula. The derivative of U ~ kJl with regard to
55
u ~ k 2 1 can be recursively computed by generating the partials
(hPl --iJ for k= k2+1, k2+2, ... , kl' i=FO,
du;
with the base case at k = k2+ 1 reducing (3.3.9) to a local derivative. Note that the
sequence k=k2+1, k2+2, ... , kl' goes forward in the network. If kl = k2, then the derivative
is unity.
Based on (3.3.9), we can derive formula (3.3.10) for computing the partial derivatives
mPd
of the form, Jv(kzl . By the Chain Rule, we get, for kl > k2'
I
(3.3.10)
The recursion of (3.3.10) that could be seen as determined by (3.3.9) is again in a
forward fashion. A special case of (3.3.10) is when kl = k2+1, where
(3.3.11)
(since u.;kzl is not a function of u;kzl if s -::j:. i)
Another recursive formula we shall need is the backpropagation of the partials of the
56
dE
error function, i.e. dU[kl' The base of the recursion is when k=K. In this case, the
]
corresponding derivatives are defined by local derivatives and hence are computable.
That is,
dE _ d (1 m ( [Kl )2)
au[Kl - au[Kl i~ is (XI' w) - Yl,s
] .I
1 [m d(U[K\X w)- Y )J =_ "'2( [Kl( )_ ) s I' 1,01' 2 ;;1 Us XI' W YI,s au[Kl
]
Since U~Kl(XI' w) is not a function of U~Kl for S -::1= j, we have,
(3.3.12) ,j-::l=o.
It is easily seen that (3.3.12) is computable after the forward computation of the
network upon input Xl. Note that following (3.3.12), we have
(3.3.13)
For k < K, we have, again by the Chain Rule,
(3.3.14)
To compute ~;tl ' starting at the output layer, we can use (3.3.14) recursively to
.I
generate the derivatives d~l for k=K, K-1, K-2, ... , kl. The base case in the sequence is
dUj
given by (3.1.12). The recursion of (3.3.14) is readily seen to operate in a backward
fashion.
57
When computing the Hessian matrix of E(w), we need to calculate the term
a2E
----(jl,jz=tO) with kl2k2• By the Chain Rule, we have [6]
av[.ktl av[k21
it h
(3.3.15)
Consider
(by (3.3.11))
( by (3.3.15) )
Thus, we have obtained, for h,jz:f:.O,
(3.3.16)
Each factor in the right-hand side of (3.3.16) is given, local, or computable via a
58
d2E
previous formula, except the terms av[kt+l]d~l~]' whose computations involve the
S h
d2E
recurSIve uses of (3.3.16). Hence, to compute av[ktlav[k we back-propagate the 2 ] ,
it 12
computations of
d2E
av[k] [~]' for k=K, K-l, K-2, ... , kj • . av.
it 12
When k=K, we have the base case
_ d (dE Ja u[iKt ] [K] 'dE d ( [K] 'J av[iKt ]
- au[ktl dU[K] av[~] (Ujt ) + au[K] av[K] (Uh ) av[~]
it it 12 it it 12
Using (3.3.13) and factoring out a term, we then obtain [6]
(3.3.17)
av[K] dE
Since ~~2] and ~ can be calculated via the uses of (3.3.10) and (3.3.14), formula
av. au.! 12 i
(3.3.17) can be used to compute the base case of (3.3.16). Note that the calculation of
59
dv [kll dv [kz 1
l! 12
requires the forward propagation of first-order partial derivatives, the
backward propagation of the first-order partials of E(w), and the second order partials.
Those propagation formulas that we have developed in this section will enable us to
derive the learning algorithms based on the methods we have introduced in Chapter 2.
We carry out the derivation in the next section.
3.4 Classical Learning Algorithms
Based on the classical optimization methods introduced in Chapter 2, we develop their
corresponding learning algorithms for training feedforward ANNs. The only major task
that is needed to yield our learning algorithms from the algorithms of Chapter 2 is the
calculation of all the first (andlor second) partial derivatives of the error function with
regard to the weights. Such computations are characterized by the special recursive form
of the network function given by (3.1.3) and (3.1.4). Since the error function is recursive,
the computations of the derivatives are also recursive.
While the derivation of all formulas in the previous section and the current section are
valid for any proper choice of an activation function, we shall use the Sigmoid function
defined by (3.1.5) as the activation function for each neuron in the network. For
convenience, the Sigmoid is cited below along with its derivatives [6].
(3.4.1)
(3.4.2)
1
f(x) = 1 -x
+e
(Sigmoid function)
f'(x) = 1 .e-X = 1 (1+e-X)-l
(l+e-xt (l+e-x) (l+e-x)
= f(x)(l- f(x))
60
,
f"(x) = (f(x)(I- f(x))) = f'(x)(l- f(x)) - f(x)f'(x)
(3.4.3) = f(x)(l- f(X))2 - f(x)f(x)(I- f(x))
= f(x)(l- f(x))(l- 2f(x))
The first-order and second-order derivatives of the Sigmoid function are needed when
evaluating the local derivatives defined by (3.3.2) and (3.3.3).
Steepest Descent Learning Algorithm
For the Steepest Descent method, we need all the first order partial derivatives of the
error function with regard to a connection weight. By the Chain rule, we have, for a layer
index k<K,
(3.4.4) aE _ aE au~kl _ aE ( [kl )(i) .
aw[.k] - au[kl aw[k] - au[k1 uj ,J ::j:. o.
j,1 j j,1 j
aE
Based on the recursion of au[kl given by (3.3.14), we can see that (3.4.4) is a
j
backward propagation formula. The base case of the recursion (3.4.4) is when k=K, where
we have
(3.4.5)
aE = aE ( [Kl)(i)
aw[Kl au[Kl uj
j,1 j
( by (3.3.12) and (3.3.8) ).
The last expression in (3.4.5) is computable upon a forward computation of the
network. Hence, any derivative a~l of E can be computed by the uses of (3.4.4) and
aw ..
],1
(3.4.5). This yields the gradient of E at Xl. To determine the step size of a downhill move
along the direction of the gradient, one simple ad hoc method is to fix a step size
61
beforehand. A better approach is to use a line search method as mentioned in section 2.1.
Upon the above preliminaries, we are now ready to give the first learning algorithm
that is based on the Steepest Descent method.
Algorithm 3.4.1: Given a set 5={(XI, YI)I X, =input, YI =desired output of X, } of L training patterns
and given a network setup of K layers with input dimension of n and output
dimension of m.
1. Initialize all weights [k'l W j,i as random numbers uniformly distributed between
-----0.-5 -and ------
0.5
fan -in of that unit fan - in of that unit
set stopping Tolerance
set W(1)=( W}~?) and k=1.
2. For each input X, E 5, repeat step 3, 4, 5, and 6.
3. Compute the actual output of the network by using (3.1.3) and (3.1.4).
4. Compute the gradient g(k) of E(w) via the use of (3.4.4), (3.4.5), and (3.4.2),
and set p(k)= _g(k).
5. Determine the step size a(k) by using a line search technique such that
( ) ( , )(k+l) 6. (on-line version) Update the weights w k+1 =( W)~il ) via
(off-line version) Accumulate the weight updates W(k+1) given above over all
input patterns.
7. If all the weights w (k+ 1 ) =( ([k'l)(k+l) W j,i ) is such that the following convergence
criterion is satisfied, then stop.
L
Otherwise, set k=k+ 1 , go to 2.
62
The rate of convergence of Algorithm 3.4.1 is linear. The computational complexity of
this algorithm is fairly simple except for the linear search method used.
The steepest descent method is rarely used today in the field of optimization because
of its slow rate of convergence. However, it is still employed in ANN learning algorithms
due to its simplicity. Various modified versions for speeding up Algorithm 3.4.1 have
been proposed. However, investigation of those speeding up algorithms is out of the
scope of this paper. Further development in this direction can be found in [10].
Algorithm for Computing the Hessian Matrix of E
The derivation of our proposed damped Newton learning algorithm requires the uses of
both the gradient g and the Hessian matrix H of E(w), which might also be needed for
the Conjugate Gradient learning algorithm. The calculation of the gradient g has been
derived above. We now derive the formulas for computing the Hessian matrix H.
a2E
The element of the Hessian matrix H of E(w) is of the form , jl, h:f=O.
aw[kd aw[k2]
11"1 jz"2
Since the Hessian matrix is symmetric, we only need to compute the second derivatives
with kl ~ k2• In the following discussion, without loss of generality, we shall assume kl ~
k2. Note that this means there is no connection from neuron ii in layer kl to neuron i2 in
layer k2•
Considering the first order partial derivative of E(w) locally at w~~:~, we have, by the
Chain Rule and (3.3.4) [6],
63
(3.4.6)
Differentiating (3.4.6) with respect to W[k2] , we obtain, for kl > k2
12"2
This gives us
(3.4.7)
When kl = k2, since U,[k1-1] is not a function of W J[k2] in (3.4.6), we have
1 2,l2
(3.4.8)
Each factor on the right-hand side of (3.4.7) and (3.4.8) can be computed either locally
or by a propagation formula derived in the previous section. Note that the computation of
normally requires multiple forward and backward passes through the
network, the number of which scales with the number of hidden nodes in the network [6].
64
The following algorithm is for the computation of the Hessian matrix of E(w).
Algorithm 3.4.2: (Exact calculation of Hessian Matrix H of E(w»
Assume given an input pattern (XI, Y/)
1. Compute the following
{ U}kll k=1, 2, ... , K, j=0, 1, ... , 11k}, by (3.1.3) and (3.1.4),
,
{ (u}kl) I k=1, 2, ... , K, j=0, 1, ... , 11k}, via (3.4.2),
"
{(u~kl) I k=1, 2, ... , K, j=0, 1, ... , 11k}, by (3.4.3),
{ (u}k1)i I k=1, 2, ... , K, j=0, 1, ... , 11k, i=0, 1, ... , 11k.1} via (3.3.7).
t1iPd
2. Compute { du(~21 I K?k"?~"21} by forward propagation formula (3.3.9).
12
av[kd
3. Compute { dv(~21 I K?k1"2~"21} by forward propagation formula (3.3.10).
h
dE
4. Calculate {--uJ I k=1, 2, ... , K, j=0, 1, ... , Ilk } via the back-propagation
duj
formula (3.3.14).
5. Generate { dv[k!~~[k21 I K?k1"2~"21} by using the backward recursive formula
h 12
(3.3.16).
dE
6. Evaluate { [ktl [k21 I K?k1"2~"21} via (3.4.7) and (3.4.8). aw . . aw .. h,l! 12.'2
7. Obtain the Hessian Matrix H of E(w) by symmetry and stop.
The Gauss Newton Learning Algorithm
The implementation of the Gauss-Newton version of Newton's method requires the
calculation of the gradient of E(w) and the Jacobian matrix J of the vector function
65
defined by the network. The element of the Jacobian matrix is of the form
(3.4.9) ( by (3.3.6) ).
By using (3.3.14), we can compute (3.4.9) via backward propagation after a forward
computation through the network. The following algorithm is the Gauss-Newton learning
algorithm, which is based on Algorithm 2.6.1.
Algorithm 3.4.3: Given a set 5={(X/, Y/)I X, =input, YI =desired output of X, } of L training patterns
and given a network setup of K layers with input dimension of n and output
dimension of m.
1. Initialize all weights [k'] W j,i as random numbers uniformly distributed between
----0.-5 -- an d ---0.5- --
fan - in of that unit fan - in of that unit
set stopping Tolerance
2. For each pattern (XI, Y/) E 5, repeat step 2.1, 2.2, and 2.3, with g(k)=O.
2.1. Using the weight W(k), compute the actual output of the network by
using (3.1.3) and (3.1.4).
2.2. Obtain the gradient g(x/) of E(w) via the use of (3.4.4), (3.4.5), and
(3.3.8),
2.3. sum up g(X/)'S, i.e., g(k)=g(k)+ g(x/)
3. For each pattern (XI, Y/) E 5, repeat step 3.1 and 3.2 with J(k)= o.
3.1. Compute the Jacobian Matrix J(W(k), XI) via (3.4.9).
3.2. sum up J(W(k), X/)'S, i.e., J(k)= J(k)+ J(W(k), XI).
4. Set H(k)= J(k) T J(k) and compute p(k) by solving the system of linear equations
H(k)p(k)= _g(k).
5. Compute a(k+1) one of the line search methods in section 2.1.
66
7. If all the weights w( k+ 1 ) =( ( W}~i' l)(k+l) ) is such that the following convergence
criterion is satisfied, then go to 8.
Otherwise, set k=k+ 1, go to 2.
8 . Set WO=W(k+1) and stop.
Conjugate Descent Learning Algorithm
The following learning algorithm is based on Algorithm 2.5.1 of the conjugate descent
method. It is used for the off-line training model.
Algorithm 3.4.4: Given a set S={(X/, Y/)I X, =input, YI =desired output of X, } of L training patterns
and given a network setup of K layers with input dimension of n and output
dimension of m.
1. Initialize all weights W}~? as random numbers uniformly distributed between
-------0.-5- ---and -----0-.5- ----
fan - in of that unit fan - in of that unit
Set stopping Tolerance
set W(l)=( W}~?) and k=1.
2. For each pattern (XI, Y/) E S, repeat step 2.1, 2.2, 2.3, 2.4, with g(k)=O.
2.1. Using the weight W(k), compute the actual output of the network by
using (3.1.3) and (3.1.4).
2.2. Calculate error information by using (3.3.12), based on the desired
output Y/,
2.3. Obtain the gradient g(x/) of E(w) via the use of (3.4.4), (3.4.5), and
(3.3.8),
2.4. sum up g(X/)'S, i.e., g(k)=g(k)+ g(x/)
3. If k=1, then set p(l)= r(l)= _g(l).
67
4. Compute U(k) by using a line search technique [36]
Ck)T rCk)
or by using the formula aCk) = C0T Ck) Ck) (or using an approximation of
p H p
5. Compute W(k+1)=W(k)+U(k)p(k) and, using step 2 to compute g(k+1), and set
6. If k=O mod n, go to 10.
7. Compute ~(k) using one of the following:
•
•
•
(Hestenes and Stiefel)
(Fletcher and Reeves)
(Polak and Ribiere)
{3Ck) = r Ck+1)T (r(k+l) - rCk»
pCk)T rCk)
() ( , )Ck+l) 9. If all the weights W k+1 =( W)~i 1 ) is such that the following convergence
criterion is satisfied, then go to 11.
L
Otherwise, set k=k+ 1, go to 2.
10. Set p(k+1)= r(k+1) go to 2.
11. Set WO=W(k+1) and stop.
3.5 Implementation of the Levenberg-Marquardt Method
The Levenberg-Marquardt technique is used to improve the stability of the Gauss-
Newton method, as we have seen in Chapter 2. Specifically, instead of solving a system
of linear equation given by
(3.5.1) (step 4, Algorithm 3.4.3),
68
in the treatment of Levenberg and Marquardt, we solve the system determined by
(3.5.2)
where D=diag(Qll, ... , Qnn) for Q=jk) Tjk). Based on Algorithm 2.7.1 and Algorithm
3.4.3, the Levenberg-Marquardt version of the Gauss-Newton method is straightforward.
Algorithm 3.5.1: Given a set S={(X/, Y/)I XI =input, YI =desired output of XI } of L training patterns
and given a network setup of K layers with input dimension of n and output
dimension of m.
1. Initialize all weights [k'J Wj,i as random numbers uniformly distributed between
----0.-5 -- an d ---0.5- --
fan - in of that unit fan - in of that unit
set stopping Tolerance
2. For each pattern (XI, Y/) E S, repeat step 2.1, 2.2, and, 2.3, with g(k)=O.
2.1. Using the weight W(k), compute the actual output of the network by
using (3.1.3) and (3.1.4).
2.2. Obtain the gradient g(x/) of E(w) via the use of (3.4.4), (3.4.6), and
(3.3.8),
2.3. sum up g(X/)'S, i.e., g(k)=g(k)+ g(x/)
3. For each pattern (XI, Y/) E S, repeat step 3.1 and 3.2 with J(k)= o.
3.1. Compute the Jacobian Matrix J(W(k), XI) via (3.4.8).
3.2. sum up J(W(k), X/)'S, i.e., J(k)= J(k)+ J(W(k), XI).
Aik )
4. Set H(k)= J(k)TJ(k) and ",(k)= -v-'
6. Compute p(k) by solving the system of linear equations
(H(k)+ ",(k)D)p(k)= _g(k).
Otherwise, goto 8.
69
8. If all the weights W(k+1) =( (W)~;l)(k+l) is such that the following convergence
criterion is satisfied, then go to 9.
Otherwise, set k=k+ 1 , goto 2.
9. Set wo=W(k+1) and stop.
70
4. THE DAMPED NEWTON LEARNING ALGORITHM
In this Chapter, we develop the proposed ANN learning algorithm. This includes its
foundation and the algorithmic implementation. In Section 2.7, we have seen that the
Levenberg-Marquardt method is a very effective technique to improve the stability of the
Gauss-Newton method. In this Chapter, we incorporate such a technique into Newton's
method in order to improve the convergence properties of the Newton method, and
implement this new method as an ANN learning law for training feedforward neural
networks. The two sections, Section 4.1 and 4.2, can be treated as extensions of Section
2.7 of Chapter 2 and Section 3.5 of Chapter 3 respectively.
4.1 The Damped Newton Method
When Newton method is used for solving non-linear least squares problems, it suffers
the same deficiencies as the Gauss-Newton method we have discussed in Section 2.7 and
it may also fail because of non-positive definiteness of the Hessian matrix of E(w). To
improve the stability of the Newton method, we incorporate the Levenberg-Marquardt
technique into Newton's method. Recall that, at each iteration in the Newton method, we
determine a downhill step Il(k) by solving the system of linear equations given by
(4.1.1)
where g(k) is the gradient of the error function E(w) and H(k) is the Hessian matrix of
E(w). To overcome the singular and non-positive definite problems of the matrix H(k)
that may exist, we solve the following system of linear equations instead.
(4.1.2)
71
where A (k) > 0 and D is a diagonal matrix whose elements are the absolute values of the
diagonal elements of H(k).
The theoretical foundation for the above formulation is related to the three theorems of
Section 2.7. In the three theorems, it is assumed that the matrix Q being symmetric and
positive semidefinite. The latter condition is violated in the above formulation since the
Hessian matrix could be non-positive definite. By adding a positive number to each of the
diagonal elements of the Hessian matrix H, the resulting matrix could be made positive
definite if the added constants are sufficiently large. Hence, upon the restriction that A is
sufficiently large, the three theorems in section 2.7 can still be used to justify the
formulation (4.1.2). All results obtained in the discussion of the Gauss-Newton/
Levenberg-Marquardt method (Section 2.7) can now be applied to this extended method.
This modified version of the Newton method might be called the "Extended" LevenbergMarquardt
method, or the Damped Newton method.
In the next section, we develop an arithmetic implementation of this Damped Newton
method as an ANN learning algorithm for training feedforward neural networks.
4.2 The Damped Newton Learning Algorithm
To implement the damped Newton method as an ANN learning law, we need to
compute the gradient g and the Hessian matrix H of the error function E(w). Based on
our previous work on such computations, a slight modification of the Algorithm 3.5.1
would yield our goal algorithm.
72
The Damped Newton Learning Algorithm
Algorithm 4.2.1: Given a set 5={(XI, YI)I X, =input, YI =desired output of X, } of L training patterns
and given a network setup of K layers with input dimension of n and output
dimension of m.
1. Initialize all weights [k'l W j,i as random numbers uniformly distributed between
----0-5 -- an d ---05- --
fan - in of that unit fan - in of that unit
set stopping Tolerance
2. For each pattern (XI, YI) E 5, repeat step 2.1, 2.2, and 2.3, with g(k)=O.
2.1. Using the weight W(k), compute the actual output of the network by
using (3.1.3) and (3.1.4).
2.2. Obtain the gradient g(xl) of E(w) via the use of (3.4.4), (3.4.5), and
(3.3.8),
2.3. sum up g(XI)'S, i.e., g(k)=g(k)+ g(xl)
3. For each pattern (XI, YI) E 5, repeat step 3.1 and 3.2 with H(k)= o.
3.1. Use Algorithm 3.4.2 to compute the Hessian matrix H(W(k), XI) of E(w).
3.2. Sum up H(W(k), XI)'S, i.e., H(k)= H(k)+ H(W(k), XI).
ACk)
4. Set ;>...(k)= -v-'
5. Compute p(k) by solving the system of linear equations
Otherwise, goto 8.
8. If all the weights w( k +1 ) =( ([k'l)Ck+l) W j,i ) is such that the following convergence
criterion is satisfied, then go to 9.
73
Otherwise, set k=k+ 1 , goto 2.
9. Set WO=W{k+1) and stop.
To test our proposed learning algorithm, we will implement it using the FORTRAN
language. Some other algorithms will also be implemented in order to assess the
performance of the new algorithm. The next chapter deals with those implementations
and tests.
74
5. IMPLEMENTATION AND TEST RESULTS
5.1 Language Implementations
To test the new damped Newton learning algorithm, we implement it using the
standard FORTRAN 77 language. To compare its performance in enforcing the learning
process of ANNs with that of some other methods, the steepest descent learning
algorithm and the Gauss-NewtonJLevenberg-Marquardt learning algorithm are also
programmed. Their names are STEDES for the steepest descent learning algorithm,
GNLMD for the Gauss-NewtonJLevenberg-Marquardt learning algorithm, and NLMD for
the Damped Newton learning algorithm.
The network structure of an ANN is represented by two two-dimensional arrays. The
first one is array LAYER, which contains information such as the number of layers in the
network, the number of nodes in each layer, and some indexing information. One
problem encountered in the implementation is the indexing of neurons and weights. In the
program, the following indexing formulas are used.
• The index of neuron i in layer k is (k:E~umber of nodes in layer sJ + i •
s=o
Note that the first summation in the above formula depends on the layer only, and
for convenience, the summation for each layer is computed early and stored in the
second row of array LAYER. Note also that neuron number 0 in each layer is the
bias node.
• The index of a connection weight w;~l is defined by
75
(k~ ) .~numberOfnOdeinlayers{numberOfnOdeinlayer(S-I)+I) + G -I)· numberofnodesinlayer(k -I) + i
The first summation also depends on the layer only and the values are computed
and stored in the third row of the array LA YER.
The second array representing a neural network is a two-dimensional array, NEURON,
whose first dimension is over the nodes index. It stores the output u[klof each neuron, the
.I
,
first derivative (U)kl) of the activation function, and the partial derivatives of the
form d~l . The storage requirement of NEURON is of order O(n), where n is the total
duj
number of nodes in the network (including input and bias nodes).
All the weights are stored in the first row of the two-dimensional array WEIGHT with
the first dimension over the weight index. The second and the third rows of WEIGHT are
used to store the gradient vector of the error function with regard to a pattern and that of
the error function over all the patterns respectively. This array takes storage of order, at
most, O(mK), where m is the maximum number of nodes in a layer and K is the number
of weighted layers in the network. For the steepest descent learning algorithm, the above
storage are all that it needs. For the Gauss-Newton learning algorithm, more rows of the
array WEIGHT are needed to store the Jacobian matrix of E(w), and a storage for the
Hessian matrix of E(w) is also required. The storage requirement of the Hessian matrix is
of order O(m2K2), with m and K being defined above. For the Damped Newton learning
algorithm, the array HESIAN for the Hessian matrix of E(w) is needed and, in addition, a
76
dU[kd mlkd
storage of order O(n2) is needed to store information of the form J __.I - and
du[~l' dv[~l'
I I
dv[.k~~[~l . The latter array, UUVV, is also needed for the Gauss-NewtonJLevenberg-
.II 12
Marquardt learning algorithm. In the program, other derivatives whose storage are not
allocated are computed when needed. The formulas for computing those pieces of
information are listed below.
" ,
3) (U)kl) = (U)kl) (1-2u)kl)
The program is divided into functional subroutines. The major subroutines are the
following.
INITWT () --initialize all connection weights.
NETOUT () --compute the outputs of all neurons in the network upon an input.
COMGRA () --calculate the gradient of E(w) with regard to an learning pattern.
DEU () --compute the derivatives of the form ~~l .
J
FIDITV () --find an interval for a line search.
QINTER () --perform quadratic interpolation to locate a minimizer (Algorithm 2.1.2).
STEDES () --implement the steepest descent learning algorithm (Algorithm 3.4.1).
RR ( ) --compute the squared sum function value upon a learning example.
77
DUUVV( )
au[kd dv[kd
--compute derivatives of the form aut10J and ----bJ.
i dvi
COMJAC ( ) --compute the Jacobian matrix of the network function and approximate
the Hessian matrix by the product of the Jacobian matrices.
LSOLV () --solve a system of linear equations (Algorithm 2.1.1).
DEW ( ) --comp,ute derivatives of the form dv[k~~[.I0J •
11 12
COMHAS ( ) --compute the Hessian matrix of E(w) for the Damped Newton learning
algorithm.
GNLMD () --implement the Gauss-Newton!Levenberg-Marquardt learning algorithm
(Algorithm 3.5.1).
NLMD () --implement the Damped Newton learning algorithm (Algorithm 4.2.1).
The program is written in FORTRAN 77 with double precision. It is compiled by
using ptx/FORTRAN compiler which is compatible with ANSI Standard FORTRAN 77.
The test results are described in section 5.3.
5.2 Neural Network Design
The design of a neural network is highly problem-oriented. It is the problem that
determines what ANN topology and what stopping condition for training should be used.
The topology of the ANN determines the number of connection weights. Using more
connection weights implies that the ANN might need more training examples. If, in the
training process, the error function values converge to a value above the requirement,
78
then more connection weights (by means of adding more nodes and/or hidden layers) are
needed in order to further reduce the error function values.
Normally, for a given problem, training is done on different topologies of ANNs in
order to determine which one might fit the requirements of the problem. At this stage, the
stopping conditions used in the training might be any of the stopping condition 2, 3, 4 or
6 as mentioned in section 3.2, Chapter 3. Once a set of optimal ANN topologies are
obtained, then the training of the ANNs are concentrated on the generalization of the
ANNs. The generalization ability of the ANN is highly sensitive to when to stop the
training process. A good fitting of the training examples does not mean that the ANN
will generalize well over the entire problem domain. Thus, to obtain better generalization
results, some optimal stopping point has to be set before the training starts. The uses of
validation sets provide some tools for obtaining such a optimal point. In such a training
paradigm, the training process is stopped when the errors on the validation examples are
reduced within tolerance. In this stage, the stopping condition used for the training might
,
be the stopping condition 1, 3, or 5. However, in this paper, we are not concerned with
the generalization of the ANNs, since it depends on the training ability of the learning
algorithms, i.e., the ability to reduce the error function values to within a specified level
of tolerance. Hence, to test our new Damped Newton learning algorithm, we emphasize
on the training side, that is, whether it trains an ANN or not.
79
5.3 Test Problems
The goal of our testing is to explore the feasibility of the proposed new damped
Newton learning algorithm for training fully-connected, feedforward neural networks and,
if feasible, to assess its performance compared with existing learning algorithms.
The first test problem is the parity problem obtained from the benchmark problems for
training neural networks in the artificial intelligence depository at Carnegie Mellon
University (Anonymous FTP: /afs/cs/projectlconnectfbench on ftp.cs.cmu.edu). We test
the learning algorithms on three sub-problems of the parity problem, i.e., the 2-, 3-, and 4-
parity problems. Details of those problems follow.
Number of
Problem Type of Problems Inputs Outputs Examples
2-parity(XOR) Classification 2 1 4
3-parity Classification 3 1 8
4-parity Classification 4 1 16
Note that in this ~-parity problem, the requirement is that the ANN should be able to
classify each pattern correctly up to a given tolerance. Hence, all the examples are used in
the training sessions and generalization of the ANN is not a concern in this problem.
Three more test problems are chosen from the PROBENl [26]. Two of those are
function approximation problems and the other is a classification problem. The following
lists information about those test problems.
Number of
Problem Type of Problems Inputs Outputs Examples
cancer Classification 9 2 699
building Approximation 14 3 4208
heart Approximation 35 1 920
80
5.4 Test Results
The initial set of connection weights in an ANN affects the learning process of the
neural network. Hence, it is necessary to obtain testing results with regard to different set
of initial connection weights. The test results obtained in this section are the results of
several runs, each time with a new set of random connection weights. In the program, a
pseudo-random number generator [32] is used to produce the random numbers needed.
The generator takes an integer seed number as its input and could produce a sequence of
random numbers unique to each seed number. This way, the weight initialization process
could be reproduced so that we can start different learning algorithms with the same ANN
topology and the same set of initial connection weights.
The program has been run on the test problems and numerical results have been
obtained. The testing results are described below according to the following comparison
criteria.
1. Feasibility test: with a fixed stopping criterion, the program was run on the
three test problems. For each problem, ten runs have been
performed for each of the three learning algorithms. The
initial weights of the network are different for each run,
while they are the same for each of the three learning
algorithms at each run (i.e., using the same seed for
generating the pseudo-random numbers). The results
obtained are the average of the ten runs or fewer if learning
fails because of local minimum problems.
81
2. Performance test: the same as above but with varied stopping conditions to
assess the performance of the new algorithm when high
precision over the training set is needed.
The following lists the parameters used in the three learning algorithms.
• initial A=O.OI
• increase/decrease factor ~=5.0
The feasibility test has been done on all of the test problems mentioned in the last
section. For the classification problems, the stopping condition used is No.5, i.e., to
classify each pattern correctly to within a 0.1 tolerance. The results are listed in the
following tables. In the tables, the symbol "L" means a local minimum was encountered,
i.e., patterns can not all be classified correctly within the given tolerance, and "F" implies
that the method failed. In such cases, they are not counted in the averages.
Table 1: Test Results on the 2·parity Problem with a 2·4·1 ANN Topology
seed number for generating initial random weights average
Algorithm 1 17 21 27 40 45 66 78 81 96
Steepest Descent 80 61 68 49 106 103 104 106 35 259 97.1
Gauss-Newton/LM 6 9 6 L 6 15 5 L 2 11 7.5
Damped Newton 21 L 9 15 18 16 17 37 17 19 18.8
Note: 4 training examples are used.
82
Table 2: Test Results on the 3-parity Problem with a 3-6-1 ANN Topology
seed number for generating initial random weights average
Algorithm 1 17 21 27 40 45 66 78 81 96
Steepest Descent 135 402 3660 5541 11821 480 3401 149 4696 189 3047.4
Gauss-Newton/LM 16 6 3 7 5 6 10 12 7 9 8.1
Damped Newton L 28 12 14 28 16 L 15 19 23 19.4
.. Note: 8 trammg examples are used.
Table 3: Test Results on the 3-parity Problem with a 4-8-1 ANN Topology
seed number for generating initial random weights average
Algorithm 1 17 21 27 40 45 66 78 81 96
Steepest Descent 1121 1334 18576 4325 1489 2783 2811 F 23230 2242 3617.2
Gauss-Newton/LM L L 105 132 L L 119 101 18 57 88.7
Damped Newton 21 116 22 34 54 L 77 59 55 119 61.4
. . Note: 16 trammg examples are used .
The stopping conditions used for testing the cancer classification problem and the
other two approximation problems are somewhat complicated. We first run the Gauss-
NewtonlLevenberg-Marquardt learning algorithm (or the Damped Newton) with a
relative RMS stopping condition ( Stopping Condition #3) to determine a point of
convergence. Note that this stopping condition sometimes does not work well when used
in the steepest descent learning algorithm, since the reduction of function values from one
iteration to the next could be very small. Then, upon the knowledge of this point, we used
the stopping condition #1, i.e., learning is stopped when the error function values are
reduced within a given tolerance (which is greater than the function value at the point of
83
convergence). In the following, results of the Steepest Descent learning algorithm are
obtained with a tolerance of 0.001 with stopping condition #1 (using RMS values) and
the results of the other two methods are gotten when the reduction of RMS values is
within tolerance of O.lx 1 0-16 (Stop Condition #3).
Table 4: Test Results on the Cancer Problem with a 9-2 ANN Topology
seed number for generating initial random weights average
Algorithm 1 17 21 27 40 45 66 78 81 96
Steepest Descent 2173 2184 2190 2190 1924 2037 2165 2130 346 2180 1951.9
Gauss-Newton/LM 58 57 58 56 56 56 56 57 56 56 56.6
Damped Newton 12 12 12 12 12 12 13 12 12 12 12.1
. . Note: (1) 525 trammg examples are used .
(2) Local miminum=0.2246622184498266780 (with a 10.15 accuracy).
(3) Training for Steepest Descent algorithm stopped when function values are reduced below 0.231.
Table 5: Test Results on the Building Problem with a 14-3 ANN Topology
seed number for generating initial random weights average
Algorithm 1 17 21 27 40 45 66 78 81 96
Steepest Descent 748 757 757 771 753 760 773 763 747 767 759.6
Gauss-Newton/LM 7 7 7 7 7 7 7 7 7 7 7
Damped Newton 6 6 6 6 6 6 5 5 6 6 5.8
. . Note: (1) 500 trammg examples are used .
(2) Local miminum=0.0602037866342696670 (with a ]0.15 accuracy).
(3) Training for Steepest Descent algorithm stopped when function values are reduced below 0.0603.
84
Table 6: Test Results on the Heart Problem with a 35-1 ANN Topology
seed number for generating initial random weights average
Algorithm 1 17 21 27 40 45 66 78 81 96
Steepest Descent 325 341 331 336 324 325 322 327 335 335 330.1
Gauss-Newton/LM 9 10 9 9 9 9 9 9 9 9 9.1
Damped Newton 5 5 5 5 5 5 6 5 5 6 5.2
Note: (1) 690 training examples are used.
(2) Local miminum=0.1970472022241477990 (with a ]0.15 accuracy).
(3) Training for Steepest Descent algorithm stopped when function values are reduced below 0.198.
One advantage of using the Newton method over the Gauss-Newton method is that the
former converges to a minimum point with quadratic convergence as opposed to the
linear convergence for the Gauss-Newton method. When testing, this property is reflected
by the speed at which the norm of the weight updates approaches O. For the damped
Newton method, its speed of convergence is weaker than that of the Newton method and
is close to quadratic as can be observed in the following tables. The stopping condition
used for the tests is the relative RMS ( Stopping Condition #3) with a much smaller
tolerance of O.lxlO-16 than that used in the above tests. This stronger requirement is
needed when high precision on the training set is required.
85
Table 7: Performance Test Results (the Gauss-NewtonlLevenberg-Marquardt
Algorithm) on the Building Problem with a 14-3 ANN Topology
actual RMS epoch )}II] II!1ttiI2
0.3719656416874160150 0 0.1000000E-01 5.000000
0.0764699001570153846 1 0.2000000E-02 2.851237
0.0603818254426006184 2 0.4000000E-03 0.6581495
0.0602039906224533627 3 0.8000000E-04 0.1705101
0.0602037872064130574 4 0.1600000E-04 0.8345467E-02
0.0602037866363822171 5 0.3200000E-05 0.4318909E-03
0.0602037866342778116 6 0.6400000E-06 0.2682336E-04
0.0602037866342696670 7 0.1280000E-06 0.1654361E-05
0.0602037866342696492 8 0.2560000E-07 0.1031173E-06
. . Note: (1) 500 trammg examples are used .
Table 8: Performance Test Results (the Damped Newton Algorithm) on
the Building Problem with a 14-3 ANN Topology
actual RMS epoch AlII] II!1w112
0.3719656416874160150 0 0.1000000E-01 5.000000
0.0787833370965010093 1 0.2000000E-02 2.805360
0.0613366701825997751 2 0.4000000E-03 0.9584970
0.0602151588752766375 3 0.8000000E-04 0.4732735
0.0602037887783644087 4 0.1600000E-04 0.8338464E-01
0.0602037866342702443 5 0.3200000E-05 0.1488150E-02
0.0602037866342696847 6 0.6400000E-06 o . 1317166E-05
0.0602037866342696670 7 o .1280000E-06 0.2642247E-09
Note: (1) 690 training examples are used.
86
Table 9: Performance Test Results (the Gauss-NewtonlLevenberg-Marquardt
Algorithm) on the Heart Problem with a 35-1 ANN Topology
actual RMS epoch A,LKJ 11~W\12
0.3091762029727637360 0 0.1000000E-01 5.000000
0.2000977683803623730 1 0.2000000E-02 3.142956
0.1970759648661704050 2 0.4000000E-03 0.7461564
0.1970474538977780020 3 0.8000000E-04 0.9303208E-01
0.1970472055096417030 4 0.1600000E-04 0.8621738E-02
0.1970472022720594110 5 0.3200000E-05 0.8791360E-03
0.1970472022248880070 6 0.6400000E-06 0.9898951E-04
0.1970472022241596290 7 0.1280000E-06 0.1136560E-04
0.1970472022241478880 8 0.2560000E-07 o .1339874E-05
0.1970472022241477100 9 0.5120000E-08 0.1799136E-06
0.1970472022241478520 10 0.5120000E-09 0.1718461E-05
0.1970472022241476390 10 0.9765625E-01 0.1655321E-07
Note: (1) 500 traimng examples are used.
Table 10: Performance Test Results (the Damped Newton Algorithm) on the Heart
Problem with a 35-1 ANN Topology
actual RMS epoch A,LKJ 11~W\12
0.3091762029727637360 0 0.1000000E-01 5.000000
0.2000021570874490610 1 0.2000000E-02 3.194229
0.1971521541708801450 2 0.4000000E-03 0.7650341
0.1970474745554628800 3 0.8000000E-04 0.1775413
0.1970472022267991890 4 0.1600000E-04 0.9683954E 02
0.1970472022241477280 5 0.3200000E-05 0.3717560E-04
0.1970472022241477460 6 0.6400000E-06 0.6030849E 08
0.1970472022241477100 6 0.1280000E-06 0.6028837E-08
. . Note: (1) 690 trammg examples are used .
87
6. CONCLUSION
6.1 Conclusion
This study presents a new ANN learning algorithm for training fully-connected,
feedforward neural networks. The new algorithm is based on Newton's method for
solving non-linear least squares problems and the Levenberg-Marquardt technique to
improve the convergence properties of Newton's method. The new algorithm has been
programmed and tested on several real world problems. Satisfactory numerical results
have been obtained. It is shown in this paper that the proposed new learning algorithm is
practically feasibly and has high convergence performance over the existing ANN
learning algorithms when dealing with approximation problems and when high precision
over the training set is required.
6.2 Recommendation for Future Work
Further investigation could be done at several places in the new algorithm. Some
optimal methods could be used in order to improve the performance of the new
algorithm. Those areas are described in the following.
1. Further investigation can be done on some optimal choices of A and the factor fl at
each iteration, which are used in both of the Gauss-NewtonlLevenberg-Marquardt
learning algorithm and the Damped Newton learning algorithm. One approach is
that of Fletcher [5]. Others are covered in references [12], [13], [17] and [33].
88
2. Other paradigms regarding when and how to increase or decrease')... at each iteration
other than that of Marquardt's approach presented in this paper can be
investigated. The references are the same as the above ones.
3. A rigid comparison study is needed to compare thoroughly the performance of the
algorithms mentioned in this paper in order to obtain better understanding of the
learning properties of those algorithms we are concerned with, with regard to
more aspects of the learning process in training ANNs. For example, the GaussNewton
method may never have a neighborhood of convergence at a minimum
point, as has been shown in [31]. Such a problem is not encountered in the test
problems we have seen in this thesis report. Proper test problems need to be found
in order to gain understanding of this divergence behavior of the Gauss-Newton
method and the behavior of Newton's method when tested with this kind of
problems.
89
7. REFERENCES
[1] M. AI-Baali and R. Fletcher, "Variational Methods for Non-Linear LeastSquares",
J. Opl Res. Soc., Vol. 36, No.5, pp405-421, 1985.
[2] Christine T. Altendorf, Estimating Soil Water Content Using Soil Temperatures
and a Neural Network, Ph.D. Dissertation, Oklahoma State University, 1993
[3] Roberto Battiti, "First- and Second-Order Methods for Learning: Between
Steepest Descent and Newton's Method", Neural Computation, Vol.4, 1992,
ppI41-166.
[4] Randall Beer, Chie1, Hillel; and Sterling, Leon, "An Artificial Insect", American
Scientist, Volume 79.
[5] Chris Bishop, "A Fast Procedure for Retraining the Multilayer Perceptron",
International Journal of Neural Systems, Vol. 2, No.3 (1991), pp229-236.
[6] Chris Bishop, "Exact Calculation of the Hessian Matrix for the Multilayer
Perceptron", Neural Computation, Vol.4, 1992, pp494-501.
[7] Chris M. Bishop, "Neural Networks and Their Applications", Rev. Sci. Instrum.
Vol.65, No.6, June, 1994.
[8] Wray L. Buntine and Andreas S. Weigend, "Computing Second Derivatives in
Feed-Forward Networks: A Review", IEEE Transactions on Neural Networks, Vol.
5, No.3, pp.480-488, May,1994.
[9] Richard L. Burden and J. Douglas Faires, Numerical Analysis, 4th Edition, PWSKENT
Publishing Company, Boston, 1988.
90
[10] A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal
Processing, John Wiley & Sons Ltd., Baffins Lane, Chichester, West Sussex P019
IUD, England, 1993.
[11] J. E. Dennis, Jr. and Robert B. Schnabel, Numerical Methods for Unconstrained
Optimization and Nonlinear Equations, Prentice-Hall, Inc., Englewood Cliffs,
New Jersey 07632, 1983.
[12] A. Dold and B. Eckmann, Lecture Notes in Mathematics: Numerical Analysis,
No. 630, Springer-Verlag, Berlin, 1978.
[13] R. Fletcher, Practical Methods of Optimization, Volume 1, John Wiley & Sons,
Chichester, 1980.
[14] A. Forsgren, P.E. Gill, and W. Murray, "Computing Modified Newton Directions
Using A Partial Cholesky Factorization", SIAM J. SCI. COMPUT., Vo1.16, No.1,
pp139-150, January 1995.
[15] Martin T. Hagan and Mohammed B. Menhaj, "Training Feedforward Networks
with the Marquardt Algorithm", IEEE Transactions on Neural Networks, Vol. 5,
No.6, pp.989-994, Nov.,1994.
[16] Martin T. Hagan, Neural Networks, Lecture Notes, Oklahoma State University,
1995.
[17] W. M. Haubler, "A Local Convergence Analysis for the Gauss-Newton and
Levenberg-Morrison-Marquardt Algorithm", Computing, Vo1.31, pp231-244,
1983.
91
[18] Magnus R. Hestenes, Conjugate Direction Methods in Optimization, SpringerVerlag,
175 Fifth Avenue, New Your, NY 10010, USA, 1980.
[19] Robert Hecht-Nielsen, Neurocomputing, Addison-Wesley Publishing Company,
Inc., 1990.
[20] Nicolaos B. Karayiannis and Anastasios N. Venetsanopoulos, Artificial Neural
Networks, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park,
Norwell, Massachusetts, 02061, 1993.
[21] S.Y. Kung, Digital Neural Network, Princeton University, PTR Prentice-Hall,
Englewood Cliffs, New Jersey 07632, 1993.
[22] Charles L. Lawson and Richard J. Hanson, Solving Least Squares Problems,
Prentice-Hall, Inc., Englewood Cliffs, N.J., 1974.
[23] Kenneth Levenberg, "A Method For the Solution of Certain Non-Linear Problems
in Least Squares", Quart. Appl. Math., No.2, pp.164-168, 1944.
[24] Donald W. Marquardt, "An Algorithm for Least-Squares Estimation of Nonlinear
Parameters", J. Soc. Indust. Appl. Math., Vo.l1, No.2, pp.431-441, June, 1963
[25] Martin Fodslette Moller, "A Scaled Conjugate Gradient Algorithm for Fast
Supervised Learning", Neural Networks, Vol. 6, pp. 525-553, 1993.
[26] Lutz Prechelt, PROBEN1-A Set of Neural Network Benchmark Problems and
Benchmarking Rules, Technical Report 21194, University Karlsruhe, 1994,
Anonymous FTP: /pub/neuronJprobenl.tar.gz on ftp.ira.uka.de.
[27] David E. Rumelhart, Geoffery E. Hinton, and Ronald J. Williams, "Learning
representations by back-propagating errors", Nature, Vol.323-9, Oct., 1986.
92
[28] L.B. Scales, Introduction to Non-Linear Optimization, Springer-Verlag New York
Inc., 175 Fifth Avenue, New Your, NY 10010, USA, 1985.
[29] F. Beckley Smith, Jr., and David F. Shanno, "An Improved Marquardt Procedure
for Nonlinear Regressions", Technometrics, VoLl3, No.1, pp.63-74, Feb., 1971.
[30] J.G. Taylor, The Promise of Neural Networks, Springer-Verlag, London, 1993.
[31] D. R. Unruh, Basic Fluid Power Research Program, OSU Report, REF 70-1,
Oklahoma State University, 1970.
[32] Shiang-Huey Wang, "Comparison of Backpropagation Neural Networks and
General Regression Neural Networks", M.S. Thesis, Oklahoma State University,
1994.
[33] Andrew R. Webb, "Functional Approximation by Feed-Forward Networks: A
Least-Squares Approach to Generalization", IEEE Transactions on Neural
Networks, Vol. 5, No.3, pp363-371, May, 1994.
[34] Yuan Wei, "An Accelerated Levenberg-Marquardt Algorithm for Nonlinear Least
Square Problems", M.S. Thesis, Oklahoma State University, 1992.
[35] Patrick H. Winston, Artificial Intelligence, Third Edition, Addison-Wesley
Publishing company, Reading, Massachusetts, 1992.
[36] M.A. Wolfe, Numerical Methods for Unconstrained Optimization, Van Nostrand
[37]
Reinhold Company Ltd., Molly Millars Lane, Wokingham, Berkshire, England,
1978.
FAQ, comp.ai.neural-nets, URL: http://w-w..w.i-p-d.-ir-a.-uka.de/-precheltlFAQ/neural.:-net-
faq.html.
93
8. APPENDICES
8.1 Program List
PROGRAM DRIVER
C***********************************************************************
C
C BY Joseph Wang, Department of Computer Sciences, OSU, 1995
C
C***********************************************************************
C This is the driver of the 3 ANN learning algorithms implemented.
C Before the program is executed, the following 3 data files should be
C set ready. Note that if there are more than one data in a line than
C they should be separated by spaces or commas.
C (1). "net.dat", contains information to specify a network.
C FORMAT: 1st line--number of weighted layers (without input
C layer)
C 2nd line--number of nodes in each layer, starting
C with the input-node layer, all in one
C line.
C
C
C
C
C
C
C
C
C
C
C
C
C
3rd line--an integer seed for generating random
number.
4th line--stopping tolerance.
5th line--choice of learning algorithms.
1= Steepest Descent.
2= Gauss-Newton/Levenberg-Marquadt
algorithm with Identity matrix added.
3= Damped Newton algorithm with Identity
matrix added.
4= Gauss-Newton/Levenberg-Marquadt
algorithm with diagonal matrix added.
5= Damped Newton algorithm with diagonal
matrix added.
C (2). "train.dat", contains training examples.
C FORMAT: 1st line--number of training examples.
C 2nd line--input example, all in one line.
C 3rd line--desired output examples.
C repeat the pattern of 2nd and 3rd lines for each
C example.
C (3). "test.dat", contains test examples
C FORMAT: 1st line--number of testing examples.
C 2nd line--input values, all in one line.
C repeat the 2nd lines for each testing example.
C
C When running,
C STEDES
C GNLMD
C GNLMD
C
C
one of the following routine will be called.
the Steepest Descent learning algorithm
the Gauss-Newton/Levenberg-Marquardt learning algorithm
the Newton/Levenberg-Marquardt learning algorithm
C***********************************************************************
C Important: This program is neural network dependent. That is, the
C program has restriction on the number of neurons in a layer
C and the number of layers in a network. The restriction is
C due to the language deficiency rather than the performance
94
C
C
C
C
C
C
C
C Variables:
of the learning algorithms.
If it is needed to change the capacity of the program, then
the dimensions of all arrays should be adjusted properly.
This only needs to be done for the global variables that
defined below. The dimensions of the arrays in subprograms
will be automatically adjusted by using formal arguments.
C LAYER: to store network structure.
C
C
C
C
C
C
C
C
2 dimension integer array, 1st index over layers (including
the input nodes layer) of the network, the 2nd index are
defined as following:
(*, 1) contains the number of nodes in each layer,
excluding the bias node.
(*, 2) keeps information for indexing neurons in the net.
( *, 3) etgres_-iIl:t:9..~rn~!.t.QP" f or2!lg,§?Si:}]'SLJ~.tl-~LW.ej,ghts .... in .. the-.
net.
C INEX: to store learning examples.
C 2 dimension real array, 1st index over input vector, 2nd
C over learning examples.
C OUTEX: to store the desired outputs of learning examples.
C 2 dimension real array, 1st index over output vector, 2nd
C over learning examples.
C TESTEX: to store test examples.
C 2 dimension real array, 1st index over input vector, 2nd
C over testing examples.
C NEURON: to store information related to neurons.
C 2 dimension real array, 1st index over all neurons/nodes in
C the network, the 2nd index are defined as below
C (*, 1) contains output values of each neuron/node in the
C net.
C (*, 2) contains 1st derivatives of the activation function
C of each neurons.
C (*, 3) stores the partials of E(w) w.r.t. a neuron output
C (*, 4) temporary
C WEIGHT: to store information w.r.t. connection weights.
C 2 dimension real array, 1st index over all connection
C weights in the network, the 2nd index are defined as below.
C Note that the 2nd index should be greater than
C (net output dim. + 4).
C (*, 1) contains weight values of all connections.
C (*, 2) stores the partials of E(w) w.r.t. a weight, i.e.,
C the gradtent of E(w) .
C (*, 3) temporary used by STDDES, COMJAC.
C (*, 4) temporary.
C (*, 5+*) storage for the Jacobian matrix.
C UUVV: to store information w.r.t. 2 neurons.
C 3 dimension real array, contains useful derivatives
C (*,*,1) contains partials of the activation function of a
C neuron w.r.t. the output of another neuron in its
C input path.
C (*,*,2) contains partials of the weighted sum function of
C a neuron w.r.t. the weighted sum of another
C neuron in its input path.
C (*,*,3) stores the 2nd order partials of E(w) w.r.t. two
C weighted sum functions v(k1,j1) and v(k1,j2) with
C k1>=k2.
C (*,*,4) temporary
C HESIAN: the Hessian matrix of E(w).
C 3 dimensional real array. 1st index (row) and 2nd index
C forms the indices of the Hessian matrix. 3 storage space
95
C
C
C INDIM:
C OUTDIM:
C NUMEX:
C NUMTEX:
C NUMUNI:
C WTDIM:
C
C SEED:
C
C
are used.
input dimension.
output dimension.
number of learning examples.
number of testing examples.
number of units in the net, including input and bias nodes.
number of weights in the net, dimension of the weight
vector.
SEED of pseudo-random number generator.
C***********************************************************************
C Declaration
C
INTEGER LAYER(-1:11,3)
DOUBLE PRECISION INEX(35, 800), OUTEX(4,800), TESTEX(35,800),
+ WEIGHT(0:199,10), NEURON(0:49,4),
+ UUVV(0:49,0:49, 4), HESIAN(0:199,0:199,3)
INTEGER INDIM, OUTDIM, NUMEX, NUMTEX, WTDIM, SEED, METHOD
C
DOUBLE PRECISION TOLERA, STPSIZ
C
C The following variables are used to pass the dimension indices
C when calling subroutines, their values should be the same as the
C dimensions numbers specified in the above declarations.
C
C
INTEGER LAIDX1, LAIDX2, WTIDX1, WTIDX2, IEIDX1, IEIDX2,
+ OEIDX1, OEIDX2, TEIDX1, TEIDX2, NUIDX1, NUIDX2,
+ UUIDX1, UUIDX2, UUIDX3
LAIDX1=11
LAIDX2=3
IEIDX1=35
IEIDX2=800
OEIDX1=4
OEIDX2=800
TEIDX1=35
TEIDX2=800
WTIDX1=199
WTIDX2=10
NUIDX1=49
NUIDX2=4
UUIDX1=49
UUIDX2=49
UUIDX3=4
C setup internal network representation
CALL SETNET(LAYER, LAIDX1, LAIDX2, SEED, TOLERA, METHOD)
INDIM=LAYER(-1,2)
OUTDIM=LAYER(-1,3)
K=LAYER (-1, 1)
WTDIM=LAYER(K,3)+LAYER(K,1)*(LAYER(K-1,1)+1)
C setup parameters
CALL SETPAR ( )
C
C read in training examples
CALL SETEX(INEX, IEIDX1, IEIDX2, OUTEX, OEIDX1, OEIDX2,
+ INDIM, OUTDIM, NUMEX)
C
C check net structure and examples
CALL PRTNET(LAYER, LAIDX1, LAIDX2)
96
C CALL PRTEX(INEX, IEIDX1, IEIDX2, OUTEX, OEIDX1, OEIDX2,
C + INDIM, OUTDIM, NUMEX)
C
C initialize all connection weights
C read in an integer SEED for generating pseudo-random number
C PRINT *, 'Input an integer for the random SEED'
C READ *, SEED
CALL INITWT(LAYER, LAIDX1, LAIDX2, WEIGHT, WTIDX1, SEED)
C
C check connection weights
CALL PRTWT(LAYER, LAIDX1, LAIDX2, WEIGHT,WTIDX1)
C
C read in stopping criterion
C PRINT *, 'Input RMS tolerance below'
C READ *, TOLERA
C
C choose a learning algorithm
GO TO 18
16 WRITE (*, 17)
17 FORMAT(/, 'Choose a training method:' ,/,6X,
+ 'l=steepest descent method',/,6X,
+ '2=Gauss-Newton method-(I)' ,/,6X,
+ '3=Damped Newton method-(I) '/,6X,
+ '4=Gauss-Newton method-(D)' ,/,6X,
+ '5=Damped Newton method-(D) ')
READ (*,*) METHOD
C
18 IF (METHOD .EQ. 1) THEN
C The following parameter is not adjustable.
S