NEURAL NETWORK LEARNING ALGORITHMS
BASED ON LIMITED MEMORY
QUASINEWTON METHODS
By
YIJUNHUANG
Bachelor of Science
Shanghai Institute of Education
Shanghai, China
1987
Submitted to the Faculty of the
Graduate College of the
Oklahoma State University
in partial fulfillment of
the requirements for
the Degree of
MASTER OF SCIENCE
July, J997
NEURAL NETWORK LEARNING ALGORITHMS
BASED ON LIMITED MEMORY
QUASINEWTON METHODS
Thesis Approved:
Thesis Adviser
Dean of the Graduate College
11
ACKNOWLEDGMENTS
I wish to express my deep appreciation to my thesis advisor Dr. J. P. Chandler, for
his guidance in choosing the topic of this thesis, his skillful supervision, constructive
suggestions and unceasing encouragement in solving problems that I have encountered
throughout my work.
I extend my sincere gratitude to my other committee members Dr. G. E. Hedrick
and Dr. H. Lu as well. Their patience and guidance in reviewing this thesis were
invaluable.
I am forever indebted to my husband Debao Chen and our daughter Xuejing Chen,
whose wholehearted support made this work possible.
Throughout my life, my mother Zhuang Li had given me immeasurable love and
encouragement. She and my father were extremely proud of my entrance into graduate
school and both hoped to attend my Hooding Convocation. Unfortunately, my mother
passed away last November at age eightyfour and consequently was unable to witness
the completion of my degree.
To her memory I dedicate this thesis.
m
Chapter
TABLE OF CONTENTS
Page
1. INTRO,DUCTION 1
2. NEURAL NETWORKS 4
2.1 Feedforward Neural Networks 4
2.2 Problem Formulation 6
3. UNCONSTRAINED OPTTh1IZATION 8
3.1 General Optimization 8
3.2 Gradient Methods 9
3.3 QuasiNewton Methods 12
4. LIMITED MEMORY QUASINEWTON METHODS 14
5. COMPARISONS AMONG LBFGS, PRAXIS AND DFMCG 21
5.1 Comparisons on Various Test Functions 22
5.2 Test Using Osborne Functions 30
5.3 Testing LBFGS Using Different Numbers of Corrections 33
5.4 Lack of Quadratic Tennination Property 34
6. 'fES11NG 36
6.1 Some Aspects of Proben1 36
6.2 Language Irnpl,ementation 42
6.3 Test Problems " 44
6.4 Test Results 45
6.5 Test on XOR Problem 50
7. CONCLUSIONS 52
REFERENCES 53
APPENDIXES 57
APPENDIX A: Program List for DRIVEN.F 57
APPENDIX B: Program List for DRIVEF.F 68
APPENDIX C: Program List for MLBFGS.F 80
iv
Table
LIST OF TABLES
Page
TABLE 5.1 RESULT OF PRAXIS PROGRAM 23
TABLE 5.2 RESULTS OF DFMCG PROGRAM 23
TABLE 5.3 RESULTS OF LBFGS PROGRAM 23
TABLE 5.4 ORIGlNAL RESULT OF OSBORNE FUNCTION 1 31
TABLE 5.5 LBFGS RESULT OF OSBORNE FUNcnON 1.. 31
TABLE 5.6 PRAXIS RESULT OF OSBORNE FUNCTION 1 31
TABLE 5.7 ORIGINAL RESULT OF OSBORNE FUNCTION 2 32
TABLE 5.8 LBFGS RESULT OF OSBORNE FUNCTION 2 32
TABLE 5.9 PRAXIS RESULT OF OSBORNE FUNCTION 2 32
TABLE 5.10 RESULTS OF OSBORNE FUNCTION 2 FOR DIFFERENTM 33
TABLE 5.11 RESULTS OF A QUADRATIC FUNcnON 35
TABLE 6.1 TEST PROBLEMS 44
TABLE 6.2 RESULTS IN PROBEN1 45
TABLE 6.3 RESULTS OF BUILDING2 (M = 5) 47
TABLE 6.4 RESULTS OF FLAREI (M = 5) 47
TABLE 6.5 RESULTS OF THYROIDI (M = 5) 47
TABLE 6.6 RESULTS OF XOR PROBLEM WITH 241 NETWORK 50
TABLE 6.7 RESULTS OF XOR PROBLEM WITH 221 NETWORK 51
v
Figure
LIST OF FIGURES
Page
FIGURE 2.1: ARCHITECTURE OF A THREELAYER NEURAL NETWORK 5
vi
1. Introduction
Our brains are huge biological neural networks made up of individual neurons that
are extensively interconnected with many synapses. Today's artificial neural networks
have arisen from attempts to model this biological structure with computer software.
Generally speaking, an artificial neural networks is an infonnation processing
system. It consists of a large number of artificial neurons which are interconnected
together in order to solve a desired computational task.
Currently, neural networks are used in many fields, such as aerospace,
automotive, electronics, financial, insurance, medical, transportation, and so on. For
details of such applications, the readers are referred to [15,21,23,28]. In general, the most
important and the most successful applications of neural networks can be classified as
function approximation and pattern classification [21].
One important aspect of such research is focused on standard numerical
optimization techniques. In neural network design, one usually uses the sum of squares
of oth.er nonlinear functions as the error function, and the objective of a learning
algorithm for a neural network is to minimize the error function, or objective function.
The minimization of this type of objective function is referred to as nonlinear least
squares, which is a very popular category of optimization problem [44,47]. When the
form of the objective function is known, it is often possible to design more efficient
algorithms. The GaussNewton method, modified GaussNewton methods, and
Marquardt methods are designed specifically for nonlinear least squares problems [47].
All of those methods are very efficient for some problems. Other useful methods of
1
general numerical optimization applied to neural networks are Newton methods and
variations of them [33,38,48,50], the conjugate gradient methods [12,18], and the quasiNewton
methods [20]. In general, the quasiNewton methods are among the most
efficient known general optimization methods. However, the storage requirements
increase as the square of the number of variables of the objective function. Obviously,
th,ey are not suitable for very large neural networks. The limited memory quasiNewton
methods have been discussed in [28,29,35], which limits the memory requirements in the
process of optimization. Liu and Nocedal proposed an efficient limited memory quasiNewton
algorithm [29].
The main purpose of this thesis is to design a supervised learning algorithm based
on a limited memory quasiNewton method to train fullyconnected feedforward neural
networks.
To evaluate the efficiency of the limited memory quasiNewton method, we test
this algorithm on various functions from [5,39]. We have to mention that the primary
purpose of the limited memory quasiNewton method is to minimize high dimension
functions, especially functions with more than one thousand independent variables. The
dimensions of the test functions in [5,39] range from two to twenty. We see that the
limited memory quasiNewton method are very fast and robust for low dimension
functions.
This thesis is organized into seven chapters.
In Chapter 2, we briefly explain fullyconnected feedforward neural networks and
fonnulate the learning algorithm as an optimization problem. We only consider fully
2
connected feedforward neural networks in this thesis. Ch.apter 3 gives a brief review of
unconstrained optimization. We concentrate our discussion on gradient methods,
especially the Newtonlike methods. Chapter 4 describes some details of the limited
memory quasiNewton method, and how it can be used in a neural network learning
algorithm. In Chapter 5, we test the limited memory quasiNewton method algorithm on
various test functions from (5,39]. Chapter 6 explains how to implement and test our
learning algorithm. The conclusions are given in Chapter 7.
In Probenl [43], Prechelt collected a set of problems for neural network learning
in the realm of pattern classification and function approximation. Proben1 contains ]5
data sets from 12 different domains, and all of the data sets consist of real world data.
We choose some data sets from Proben1 to test our algorithm. The rules and conventions
of Proben1 are followed strictly in our implementation.
3
2. Neural Networks
The main purpose of this chapter is to fonnulate the learning problem within the
context of nonlinear optim:iJzation.
Artificial neural networks are much simpler than the biological neural networks.
However, there are at least two similarities between them. First, both networks are
highlyinterconnected simple computational devices. Second, the connections between
neurons determine the functions of the network. In the remainder of this thesis, "neural
network" always refers to an artificial neural network or ANN.
In the literature, neural network architectures are characterized into three basic
categories: Feedforward, Feedback, and Selforganizing neural networks. Although
there are some essential differences among these categories, the common characterization
of neural networks is an ability to learn. Learning is the process by which a neural system
acquires the ability to carry out certain tasks by adjusting its internal parameters
according to some learning scheme [24].
2.1 Feedforward Neural Networks
fu this thesis, we concentrate on one particular neural network category: fullyconnected,
feedforward neural networks.
A feedforward neural network can be viewed as a system transforming a set of
input patterns into a set of output patterns. It consists of an input layer, one or more
hidden layers, and an output layer of neurons. Layers are connected by sets of weights. A
neuron in any layer, except for the input layer, of the network is connected to all the
neurons in the previous layer. A neural network connected in this way is referred to as
4
fullyconnected. The input signal propagates through the network in a forward direction,
from left to right, on a layerbylayer basis. Such neural architectures are called feedforward
since the output of each layer feeds the next layer of units. The network can be
trained to provide a desired response to a given input. The training of feedforward
neural network often requires the existence of a set of input and output patterns, called the
training set. This kind of learning is called supervised learning [24].
Input First Layer
~"'~ ,__.J'. '.....
[01 Xo 0
Second Layer
,,'.....
Third Layer(outpu t)
."....
Figure 2.1:. Architecture of a ThreeLayer Neural Network
5
2.2 Problem Formulation
Figure 2.1 is a multilayer feedforward neural network consisting of three layers,
made of a number of neurons. The network is fully interconnected from one layer to the
next layer and the connections are represented by lines which are characterized by their
weights. Based on the weights of aU the input connections, each neuron computes a
weighted sum of all the inputs and evaluates a nonlinear activation function using the sum
as the argument of the function. In our algorithm, we choose the following widely used
sigmoid function as the activation function:
1
f(x)= 
1+ ex
The result of this function evaluation is the output of the neuron. The objective of a
learning algorithm is to find the optimum weights, to minimize the discrepancy between
the outputs of a neural network at output layer and the desired outputs, corresponding to a
set of specific inputs. The inputoutput patterns used for the learning are termed
"examples" or "exemplars".
We define:
1E [1, L] where L is the number of layers in the network.
nl E [1, Nt] where Nt is the number of neurons in layer l.
P E [1, P] where P is the number of examples in the data set.
W~m as the weight of the mth input to neuron n in layer t.
O~n as the output of neuron n in layer 1for example p.
t~n as the desired output of neuron n in layer L for example p.
6
Based on the abov,e definitions, if we define our objective function as the sum of the
squares of output errors in the output layer (layer L) of a neural network over a set of
examples, then
In the above optimization model, the learning corresponds to minimizing E(w) with
regard to the weight vector w. Thus, the training of a feedforward neural network simply
turns into a numerical optimization problem.
In our actual implementation, we use the squared error percentage which differs
by a constant factor from the above error function. We explain the reason that we choose
the squared error percentage in Chapter 6. However, this slight difference is not essential
from the viewpoint of neural network training.
Our training is batch rather than incremental, that is the weights are updated only
after the entire training set has been presented
7
3. Unconstrained Optimization
3.1 General Optimization
The nonlinear optimization problem can be stated as follows:
given a set Dc ~n, and given a real functionf D ~ 9\, find
min{f(x): XE D}
and the vector x· E D where the minimum is achieved. Here f refers to the objective
function, and D is called the feasible region. IfD = 9r' , the optimization problem is said
to be unconstrained. In this case, nothing else needs to be said about how to specify D,
since every vector in 51r is feasible [12,31). The neural network learning algorithm can
be treated within the context of nonlinear unconstrained optimization problems. This
chapter gives a brief review of unconstrained optimization. Despite the diversity of both
algorithms and problems, all of the algorithms that we discuss in any detail in this chapter
and in Chapter 4 are all iterative processes which fit into the same general framework:
General Optimization Algorithm:
Specify some initial guess for the solution vector Xo'
For k = 0, ],2, ...
If xI: is optimal, stop.
Determine an improved estimate of the solution: XI:+1 = xI: +akPI:'
Here the vector PI: represents a search direction and the positive scalarak is a step length
that determines the point xl:+1' For our purpose, the word "optimize" means finding tbe
vajue of x that minimizes f. For an unconstrained problem of this form, we require that
8
the search direction Pt be a descent direction for the function f at the point xt . This
means that for a small enough step taken along Pk the function value is guaranteed to
decrease, in other words,
forO<ak <e
for some e > O. With Pt is available, we would ideally like to detennine the step length
a k so as to minimize the function in that direction:
This is a problem only involving one variable, the parameter a 1 . The restriction ak > 0
is imposed because Pk is a descent direction. The calculation of ak is called a line
search since it corresponds to a search along the line xk +atpt defined by ak •
However, it is not always the best to minimize f(xt +atpt) with regard to a k • We
discuss the line search in Chapter 4. Here we first discuss how to choose the search
direction Pk' although in optimization algorithms the choices of Pk and at cannot be
separated in general.
3.2 Gradient Methods
There are various methods to determine the direction vector Pk' The most
popular methods are gradient methods, which use first, and sometimes second,
derivatives of the objective function to compute Pt. The derivatives may be available
analytically or perhaps are approximated in some way. When we discuss their properties
in the following, we assume that the objective function has continuous second
derivatives, whether or not these are explicitly available. As we mentioned in previous
9
chapter, we choose the sigmoid function as the activation function, so that the error
function chosen in our training algorithms is infinitely differentiable, and the assumption
of differentiability is always satisfied. fu this thesis we design a learning algorithm based
on a limited memory quasiNewton method,. so we concentrate on Newtonlike methods
in this chapter. For other optimization methods the readers are referred to any good book
on nonlinear optimization, for example [12,47].
For convenience of further discussion, we first give some terminology and
notations. Let f be a smooth, nonlinear function from 9\n to 9\, the gradient of f is
defined as:
l!!& .1fS&. g(x) = Vj{x) = ( dxl ' dx2 ,...,
The Hessian matrix offis defined as:
i!f(x) / axn
B(x) =?j{x) = I
!
By definition, a matrix A is positive definite if
10
for any vector x E 9t1l
, X *O~ We require that the Hessian matrix B be positive definite at
the minimum off Then the inverse of the Hessian matrix exists. We denote the inverse
matrix of the Hessian matrix offas H.
One may choose a search direction, in which the objective function decreases
fastest. For this purpose we choose Pic = gk' which is the opposite direction of the
gradient vector. The corresponding method is referred to as the steepest descent method.
It seems that steepest descent method is not a very good method in neural network
learning since it is too slow in converging. Wang [51] compared his damped learning
algorithm and the steepest descent method. Another popular method is Newton's
method, in which one chooses the search direction as
(3.1)
where Hk = H(xk ), g/c = g(xk ) and H is the inverse Hessian matrix. Newton's method is
not always feasible, since the inverse Hessian matrix may not exist. Even if Hic is
invertible, it may not be positive definite, hence Pk may not be a descent direction.
To avoid computing any second derivatives of f, the GaussNewton method and
LevenbergMarquardt method [27,30] were introduced. These methods need a particular
fonn of the objective function. That is, the objective function is a sum of squares of some
nonlinear functions. As we mentioned before, one often chooses such functions as the
error function in neural network training, hence, the LevenbergMarquardt method is
usually considered as a good way to train neural networks [22,51]. However, we do not
discuss these methods in detail in this thesis; their storage requirements are too great for
very large networks.
11
3.3 QuasiNewton Methods
Let us return to the general fonn of the Newton method with the search vector
calculated at each iteration as in (3.1),
Here Hk is the precise inverse of the Hessian matrix at xk • As was mentioned
previously, Hk may not exist or may not be positive definite. To avoid such problems
we use a BFGS update formula [47] to replace the precise inverse of Hessian matrix with
a positive definite matrix, H k. which is in some way an approximation of the inverse
Hessian matrix.
Let
We define
Hk+1 = (I  SkY ~ / YJSk )Hk (I  Yk sJ /Y~Sk ) +Sk sJ / yJSk •
Let
(3.2)
Then (3.2) can be expressed as
where we store each Hk explicitly. It is easy to verify that if Hk is positive definite and
yJSk > 0, then Hk+1 is positive definite also. We assume that y~ St > 0 for all k. In
12
gradientrelated methods this can always be done, provided the line search is sufficiently
accurate. The search direction Pk is obtained from the matrixvector product:
PI; =Hl;gk
The quasiNewton methods use an iterative process to approximate the inverse
Hessian matrix, so that no explicit expression for the second derivatives is needed for
canying out a Newtonlike search. Although the quasiNewton algorithms require
slightly more operations to calculate an iterate, and they require somewhat more storage
than the conjugate gradient algorithms do, but in almost all cases, these additional costs
are outweighed by the advantage of superior speed of convergence.
At fIrst glance, quasiNewton methods may seem unsuitable for large problems
because the approximate inverse Hessian matrices are generally dense. In the next
chapter, we discuss ways to cut down on storage in order to create limited memory quasiNewton
methods for large problems.
13
4. Limited Memory QuasiNewton Methods
In Chapter 3 we gave a brief review of unconstrained optimization. In particular,
we discussed a quasiNewton method. When we use a quasiNewton method, we
construct a sequence of matrices which in some way approximate the inverse Hessian
matrix instead of storing the precise inverse Hessian matrix in each iteration. In such a
way we avoid using the second derivatives of the objective function so that we save a
substantial fraction of the computing time. However, it is necessary to have O(nl
) storage
locations for each Hk • In neural network training, in some cases, for example, one may
need to use a large number of weights. Sometimes the network has many thousands of
weights. In such a case, storage becomes an issue since it will be impossible to retain the
matrix in the high speed storage of.a computer.
In this chapter, we describe an algorithm which uses a limited amount of storage
and where the quasiNewton matrix is updated continually. At every step the oldest
information contained in the matrix is updated and replaced by the newest infonnation.
Recall that the BFGS update of His:
where
Let Ho be a given positive definite matrix. Then the above BFGS update gives:
H\ = v~Hovo +Posos~
H2 =viHJvj +p,s,s{
T TH T T T
=Vj Vo OVOv\ +V\ POSOSOv\ +P\SlSl
14
T T T
+V" V"_\P"_2S"_2S"_2V"_IV,,
T T +V" PkISHS"lV"
T
+PkS"S"
Instead of forming H" explicitly. now we store previous values of Yj and S j
separately. Here m is a given integer that represents the maximum number of correction
matrices that can be stored.. Normally we choose 3 ~ m ~ 7.
The following algorithm was given by Liu and Nocedal [28].
LBFGS Algorithm:
Step 1. Choose xo' m, 0 < {J' < ~, {J' < f3 < 1, and a symmetric and positive
definite starting matrix Ho (normally we choose a scaled diagonal matrix or I as
the Ho)' Set k = O.
Step 2. Compute
PIc =H"g",
Xk+1 = x" +akP",
where at satisfies the Wolfe conditions [52]:
I(x" +a"p,,) ~ I(xk)+ {J'akg{Pk'
Ig(x" +a"Pkl Pkl~ pg{Pic
(We always try the steplength ak =1 first)
Step 3. Let m= min(k, ml). Update H o m+l times using the pairs
(y j's)"j=km' .I.e. 1et
HIc = (v[",v[_m)HO(v"_m"'VIc)
+P"m (vJ .. ,vJ_m+\ )s,,_msJ_m( vk1,....\··· Vic)
15
T +PkSiSk·
(We do not calculate and store the Hi in this step, instead, we use the above
fonnulas to calculate the direction vector Pic =Hkg" , directly).
Step 4. Set k := k + 1, go to Step 2.
The stopping criterion of LBFGS is:
where e is a small positive number supplied by the us,er.
The LBFGS algorithm is almost identical in its implementation to the weHknown
BFGS method. The only differences that are the amount of storage required by
the algorithm (and thus the cost of the iteration) can be controlled by the user and the later
approximations Hk to the inverse Hessian deviate from the BFGS method. The user
specifies the number m of BFGS corrections that are to be kept, and provides a sparse
symmetric and positive definite matrix Ho which approximates the inverse Hessian off.
During the first m iterations the method is the same as the BFGS method. When the
available storage is used up, i.e., k > m, since the BFGS corrections are stored separately,
we can delete the oldest one to make space for the new one. All subsequent iterations are
of this form: one correction is deleted and a new one inserted. Hence, it requires only
O(mn) storage locations (m « n), contrast with usual BFGS algorithm which requires
O(n2
) storage locations as we discussed before. H there is no previous information, one
can choose the identity matrix lor a scaled diagonal matrix as the Ho.
16
It is also known that simple scaling of the variables can improve the performance
of quasiNewton methods on small problems. For large problems, scaling becomes much
more important. Several scaling methods for the matrix Ho were introduced in [28J. We
use the following strategy.
In Step 3 of the LBFGS algorithm, instead of using a fixed Ho' we use Hcik
)
which is a scale of the identity matrix I:
where
Yo = vllYiJI12
Y.I; = yf_t Sk_JIIYk_1112
, k = 1,2,3,···.
It was showed [28] that this is a simple and effective way of introducing a scale in the
algorithm.
Since we do not store Hk explicitly, the product Hg must be computed The
following recursion performs this computation efficiently [35]. It is essentially the same
as the formula for the usual BFGS method.
In the following algorithm, m is the number of corrections stored, and Iter is the
iteration number.
Recursive Formula to Compute Hg:
1) If Iter ~ m, Set Incr =0; Bound =Iter;
else Set Iner = Iter  m; Bound = m;
2)qBound = giter;
3) For i = (Bound 1), ...,0
j = i + Iner;
a; =pjsJqj+l;
17
(store CXi)
5) Fori=O, 1, ...,(Bound1)
j= i + Incr;
a  n T •
pi  ,.,j Yj ri'
ri+l = ri + si~ f3d;
This fonnula requires at most 4nm + 2m +n multiplications and 4nm + m additions.
At the kth iteration of the LBFGS algorithm, we need to find a positive scalar
a k as a step length, to determine a new point X H1 that is a minimum in the direction Pk
or that gives a sufficient reduction in function value. This process is called a line search,
which is a univariate problem
As we mentioned in section 3.1, it is not always best to minimizef(x); +akPk)
with regard to a k Instead, we find an a k that satisfies the conditions:
We fix x k and Pk and let
a~O.
(4.1)
(4.2)
Then conditions (4.1) and (4.2) can be formulated as finding a> 0 such that
rJ)(a) ::; l/J(O)+ /3'tV(O)a
and
I41(a)1 ::; j31tV(O)1
18
(4.3)
(4.4)
The motivation for requiring conditions (4.3) and (4.4), or (4.1) and (4.2), in a line search
method should be clear. If a is not too small, condition (4.3) forces a sufficient decrease
in the function. However, this condition is not sufficient to guarantee convergence,
because it allows arbitrarily small choices of a> O. Condition (4.4) rules out arbitrarily
small choices of a and usually guarantees that a is near a local minimizer of CP.
Condition (4.4) is a curvature condition because it implies that
cP(a)  cP(O) > (l/3)lcV(O)l,
and thus the average curvature of cP on (0, a) is positive. The curvature condition (4.4)
is particularly important in a quasiNewton method or a limited memory quasiNewton
method because it guarantees that a positive definite quasiNewton update is possible
[11,12].
As final motivation for (4.3) and. (4.4), we mention that if a step satisfies these
conditions, then the line search method is convergent for reasonable choices of direction
[1,6,11,12,16,28,42]. In particular, in a quasiNewton method, we choose Pk = Hkgk ,
and the line search method is convergent if conditions (4.3) and (4.4) are satisfied.
There are still many choices of a k to satisfy the conditions (4.3) and (4.4). More
and Thuente {32] designed a very efficient line search algorithm which was used by
several authors, for example, Liu and Nocedal [28], O'Leary [38], and Gilbert and
Nocedal [16]. It seems that the main idea is to combine quadratic and cubic
interpolations to find a suitable ak • For the detail of this search procedure and the
associated convergence theory, the readers are referred to [32].
19
We design and implemented our neural network learning algorithm using Uu and
Nocedal's LBFGS algorithm. More and Thuente's line search method is also used. Our
program works fine. Later on, we found Nocedal's LBFGS FORTRAN program in the
Internet [36], which works even better than ours. We modified our program. In the
current version of our program, we treat Nocedal's LBFGS as a subroutine and simply
call it in our neural network learning process.
20
.fill'
~.Il:
5. Comparisons among LBFGS, PRAXIS and DFMCG
The main purpose of this thesis is to design and implement neural network.
learning algorithms based on limited memory quasiNewton methods. In this chapter we
compare the limited memory quasiNewton methods and Brent's optimization method
[5]. We also compare the limited memory quasiNewton methods and one conjugate
gradient method [14,31].
We have to mention that the primary purpose of the limited memory quasi
Newton method is to minimize the high dimension functions, especially functions with
more than one thousand independent variables. The main concern is the storage. The
dimensions of the test functions we use in this chapter range from two to twenty. For
such low dimension functions, storage is not a problem at all. However, the main
difference between the LBFGS algorithm and the BFGS algorithm is how to store the
inverse Hessian matrices in Step 3. The remaining of these two algorithms seems same.
In fact, if the correction number m is sufficient large in LBFGS, then LBFGS is
identical to BFGS. Hence, we stm can see how good the algorithms are.
Powell [42] introduced an optimization algorithm without using the derivatives of
the objective function. Brent [5] modified Powell's method and overcame some of the
difficulties observed in the literature. Numerical tests suggested that Brent's proposed
method is faster than Powell's original method and some other previous methods [5].
Brent [5] gave the ALGOL procedure PRAXlS to implement his algorithm. Chandler [7]
gave the FORTRAN procedure PRAXIS, a direct translation of Brent's procedure.The
conjugate gradient method [14,31] is another method to solve large optimization
21
problems. The storage complexity of the conjugate gradient method is O(n), where n is
the number of the variables of the cost function. We have tested a conjugate gradient
FORTRAN procedure DFMCG from the ffiM Scientific Subroutine Package [53].
In this chapter we mainly compare LBFGS and PRAXIS on speed and accuracy.
We also roughly compare LBFGS and DFMCG. In Section 5.1 we compare these three
algorithms on various test functions used in [5]. Section 5.2 summarizes the results of LBFGS
running on Osborne's functions [39]. In Section 5.3 we run LBFGS on Osborne
function 2 with different numbers of corrections m. Section 5.4 verifies that LBFGS
does not have the quadratic termination property.
Recall that the stopping criterion of LBFGS is
Throughout this chapter, we choose e = 10.7
, unless we mention otherwise.
5.1 Comparisons on Various Test Functions
Most of the functions tested in [5] are actually differentiable. We run Chandler's
PRAXIS FORTRAN program [7], the DFMCG program [53], and Nocedal's LBFGS
program [36] on the UNIX System of the Oklahoma State University Computer Science
Department, using the same test functions in [5], and compare the results. The results of
the PRAXIS program are listed in Table 5.1, the results of DFMCG program are listed in
Table 5.2, and the results ofLBFGS program are listed in Table 5.3.
22
Table 5.1 Result ofPRAXIS Program
Function n T X o nf fix)
Rosenbrock 2 (1.2. 1 ) 155 2.012E24 i
I
Singular 4 (3, I, 0 1) 421 5.476E19
Helix 3 . (O.OI,O.OI 0) 201 2.998E24
Helix 3
,.
(1 0,0) 200 8.886E25 I
Cube 2 (1.2 1) 234 1.599E25
Beale 2 (0.1 0.1) 80 1.595E25
Watson 9 OT 1869 1.400E06
Powell 3 (0, 1, 2) 86 O.ooEOO
Wood 4 I I (3, 1, 3 1) 487 2.846E21
Hilbert 10 I 0,...,0 2417 7.602E17
Tridiag 20 I 01 941 2.00E+l
Box 3 I (0, 10 20) 154 4. 173E25 I
Table 5.2 Results of DFMCG Program
Function n T Xo nf f(x)
Rosenbrock 2 (1.2 1 ) 73 4.123E27
Singular 4 (3 1. 0, 1) 305 2.198E18
Helix 3 (0.01,0.0 I, 0) 47 5.518E29
Cube 2 (1.2. 1) 80 6.024E27
Beale 2 (0.1. 0.1) 62 5.493El
Watson 9 OT 713 3.479E+O
Powell 3 (0, 1 2) 59 O.OEO
Wood 4 0, 1 3, 1) 67 7.68E26
Hilbert 10 0,... 1) 2688 5.715E2
Tridiag 20 0 1 111 2.0E+l
Box I 3 (0 10.20) 121 3.579E+O
Table 5.3 Results of LBFGS Program
Function T n xo nf (n+ l)nf I(x)
Rosenbrock 2 (1.2, 1 ) 49 147 1.947E25
Singular 4 (3 I 0 I) 76 380 7.614E16
Helix 3 (0.01, 0.01 0) 23 92 3.276E19
Cubic 2 (1.2 1) 64 192 9.917E16
Beale 2 (0.1 0.1) 16 48 9.953E17
Watson 9 OT 1991 19910 6.527E6
Powell 3 (0, 1, 2) 20 80 1.11OE15
Wood 4 (3, 1, 3, 1) 122 610 2.053E20
Hilbert 10 (l ... .1) 109 1199 1.236E12
Tridiag 20 OT 98 2058 2.ooE+l ,
Box 3 (0. 10.20) 41 164 4.508E14 !
23
In the above tables we use the following conventions:
n is the number of variables.
x ~ is the starting point.
nf is the number of function evaluations.
fix) is the approximated minimum.
In the PRAXIS program, we do not need to calculate the derivatives, while in LBFGS
the gradient must be calculated. In order to compare LBFGS and PRAXIS, a
proper weighting factor [5] must be used for the number of function evaluations in the LBFGS
program. As in [5], we define:
the weighted number of function evaluations = (n+ 1) nf ,
Note that the convergence criteria for PRAXIS are generally tighter than for LBFGS,
resulting in "better" minima in most cases, although not in all. It is not possible to
use the convergence criterion of LBFGS in PRAXIS, which does not have the gradient
available.
The following are brief descriptions of each function and the comparison of LBFSG
and PRAXIS for each function. As in [5], to compare the speeds of the two
programs, we simply compare nf in Table 5.1 and (n+ 1) nf in Table 5.3.
1. Rosenbrock (Rosenbrock [45]):
This is a wellknown function with a parabolic valley. Descent methods tend to fall into
the valley and then follow it around to the minimum of 0 at (I,ll.
The two programs perform similarly in both speed and accuracy.
24
I
2. Singular (Powell [40]):
This function is difficult to minimize, and provides a severe test of the stopping criterion,
because the Hessian matrix at the minimum (x = 0) is doubly singular.
The function varies very slowly near 0 in the twodimensional subspace {(lOAI,
AI, A2, A2)T}. For this function, PRAXIS is slightly slower, but slightly more accurate
than LBFGS.
3. Helix (Fletcher and Powell [13]):
where
and
if XI> 0, }
if XI < O.
This function of three variables has a helical valley, and a minimum at (1, 0, ol .
Originally, Brent used (1, 0, O)T as the starting point. However, since this
function is not differentiable when XI = 0, LBFGS does not work for this function when
we use this starting point. Instead, we use (0.01,0.01, O)T as the starting point in both
programs.
For this function, the situation is similar as for the Singular function, PRAXIS is
slightly slower than LBFGS, but more accurate than LBFGS.
4. Cube (Leon [26J):
25
·1
This function is similar to Rosenbrock's, and much the same remarks apply. Here the
valley foHows the curve x2 = xi .
For this function, the situation is also similar to the Singular function, and
PRAXIS is slightly slower than LBFGS, but more accurate than LBFGS.
5. Beale (Beale [2]):
where c\ =1.5, c2 =2.25, c3 =2.625. This function has a valley approaching the line
x2 =1, and has a minimum of 0 at (3, 1/2) T
For this function, PRAXIS is slightly slower than LBFGS, but more accurate than
LBFGS.
6. Watson (Kowalik and Osborne [25]):
{
. . 2 }2 30 n • 1 )2 n • 1 )1
+LL(jl)Xj(~) [~Xj(~) ] 1 .
,=2 ;=2 29 )=1 29
Here a polynomial
is fitted, by least squares, to approximate a solution of the differential equation
dz
 = 1+ l , z(O) = 0,
dt
26
If: I'· .~1.
ItI
it
,t.
for t E [0, 1]. (The exact solution is z = tant.) The minimization problem is illconditioned,
and rather difficult to solve, because of a bad choice of basis functions {I, t,
..., t·l
}. We choose n = 9.
For this function, the two programs have similar accuracy, but PRAXIS is much
faster than LBFGS.
7. Powell (Powell [41]):
For a description of this function, see Powell [41].
For this function, the two programs have similar speeds, but PRAXIS is more
accurate than LBFGS.
8. Wood (Colville [10]):
f(x) = lOO(x2  X;)2 + (1 XI )2 +90(x4 xi)2 + (1 X3)2
+ lOJ[(x2 _1)2 +(x4 _1)2] +19.8(x2 1)(x4 1).
This function is rather like Rosenbrock's, but with four variables instead of two.
For this function, the two programs have similar accuracy, but PRAXIS is faster
than LBFGS.
9. Hilbert
f(x) = x T Ax,
where A is an n by n Hilbert matrix, i.e.,
f.:
~: ,"
1
aij = i + jl
27
for I :5 i, j :5 n.
We choose n = lO.
For this function, PRAXIS is slower, but more accurate, than LBFGS.
10. Tridiag (Gregory and Karney (19]):
where
1 1
1 2 1 0
A=
1 2 1
1 2 I
0
1 2
This function is useful for testing the property of finite convergence on a quadratic
function. The minimum f(ll) =n occurs when Il is the first column ofAI , i.e.,
Il = (n, nl, n2, ... ,2, l) T •
we choose n =20.
For this function, the two programs have similar accuracy, but PRAXIS is much
faster than LBFGS.
II.Box (Box [4]):
~{[exP( ix] /10)  exp(iX2/10]}2
f(x)= £..J [ ] .
;=1 x3 exp(i/lO)exp(i)
28
'. 'L
....1 "''t,,'
'"'
This function has minima of 0 at (], 10, I) T and also along the line {(A., A., 0) 7: }.
Both programs find the first minimum, and have similar speeds. However,
PRAXIS is more accurate than LBFGS.
Summary: Brent's algorithm is considered to be a very good one. Overall, it is
better than many other algorithms [5]. OUf tests shows that the LBFGS program is
almost as good as PRAXIS, though for some test functions, PRAXIS is much better than
LBFGS.
PRAXIS was tuned extensively by Brent on his suite of test problems, unlike LBFGS,
which explains most of any superiority of PRAXIS.
Now we roughly compare LBFGS and DFMCG. LBFGS obtains satisfactory
results for all eleven test functions, while DFMCO does not converge for some functions,
such as those of Beale, Watson, Hilbert, and Box. For all other functions, it seems that LBFGS
is slightly faster and/or more accurate than DFMCG. We do not know why the
conjugate gradient method is unstable. To answer a question posted in [54], Chandler [8]
made the following comments:"At least one conjugate gradient (CG) method was shown
to be unstable: [31], see page 383 in particular. No CG method has ever been shown to be
stable" as far as I know. Instability means that small errors such as roundoff are
magnified in each succeeding iteration, which can lead to unreliability. CG methods have
great difficulty solving moderately illconditioned problems efficiently.
As far as I know, no CO method has solved either of the simple nonlinear least
squares problems of M. R. Osborne [39] in a competitive time (fewer than 10,000
equivalent function evaluations). Any decent direct search method (such as my STEPIT
29
j•
.".J
i..ll\.'.
~l'
':1' :i
~~~
~J
!II
, ",
oJ'
... ". );, .,..",
ill' ,1~ ....
I.t.:..,, .,..,1 .. C",
~O,..."..f.:.,
I"~
I:
C' .. I
~\
~.
.<l,
[7] or Richard Brent's PRAXIS) or quasiNewton method will solve both of these
problems much faster than this. Marquardt's method, developed specifically for least
squares problems, also solves them efficiently.
If you want a lowstorage method that is reasonably robust, I recommend the
limited quasiNewton method developed and programmed by Nocedal. It crunches both
Osborne problems with no difficulty."
5.2 Test Using Osborne Functions
Osborne [39] studied a general method for minimizing a sum of squares which
has the property that a linear least squares problem is solved at each stage, and which
includes the GaussNewton,. Levenberg, Marquardt, and Morrison methods as particular
special cases. In this section we do not discuss the method which Osborne discussed in
[39], but use his example functions to test LBFGS.
The problem of minimizing a sum of squares arises naturally from the problem of
detennining parameters Xi , i = ], 2, ..., p in the model equation
yet) =F(t, x)
from observations
Yi =y(ti) + Ci, (i =1, 2,... ,n),
where the Ci (the experimental errors) are independent, normally distributed random
variables with mean zero and standard deviation cr. In the case n > p the appropriate
maximum likelihood analysis indicates that x should be estimated by minimizing
Ilf(x)r, where
fi (x) =Yj  F(ti, x)
30
r.
")
"i:~
~'
~.
:1
;;\
:~
~)
~I
"j.";.':......, ,~
.."..'., ..,.;. C"
.~.t~
(.If:.
:.. ~.~.
~;:
t: «...'
~~
'*.
and
n
I:ltCx}112 = Lfi(X)2
i=1
This problem will be referred to as the model problem, and it is stressed that we have
offered a statistical justification for minimizing a sum of squares. Osborne's two test
problems are classic practical nonlinear least squares problems.
1. Osborne function 1
In this example, the data values {(ti, Yi), 1 ~ i ~ 33}, which are given in [36], are
fitted by the model
The result of [39] is copied in the following Table 5.4.
Table 5.4 Original Result of Osborne Function 1
I!ftX) 112
Xl X2 X3 : X4 X5
0.546E4 0.3754 1.9358 1.4647 0.01287 0.02212
We run LBFGS and PRAXIS using the above example and list the results in the
following Table 5.5 and 5.6.
Table 5.5 LBFGS &esult of Osborne Function 1
I!ftx)112 nf (n+l)nr Xl X2 X3 X4 X5
5.465E5 172 1032 0.3754 1.9358 1.4647 0.01287 0.02212
Table 5.6 PRAXIS Result of Osborne Function 1
I!ftx)II2
nf Xl X2 X3 X4 X5
5.465E5 1268 0.3753 1.9203 1.4490 0.01284 0.02186
We mention that in this section we set e=105 as the stopping criterion in LBFGS.
31
~..
;::
•~"..
The three programs have similar accuracy. Since the number of function
evaluations is not available in Table 5.4, we cannot compare the speed of the two
algorithms. However, when we use LBFGS on Osborne function 1, there are only 172
function and gradient evaluations, which is considered very fast. LBFGS and PRAXIS
have similar speed on Osborne function 1.
2. Osborne function 2
In this example, the model has the fonn
f(t,x) = Xl exp(xst)+ X2 exp[x6(t x9 )2]
+X3exp[x7(t  XIQ)2]+ X4 exp[xg(t XII )2]
The data values {(I;, Yi)' 1~ i ~ 65} are also given in [39]. The result of [39] is copied
in the Table 5.7.
Table 5.7 Original Result of Osborne Function 2
11!f(x)112
Xl X2 Xl X4 Xs
0.0402 1.3100 0.4315 0.6336 0.5993 0.7539
X6 X7 I X8 X9 XlO Xll
0.9056 1.3651 4.8248 2.3988 4.5689 5.6754
we run the LBFGS and PRAXIS using the above example and list the results in the
Table 5.8 and Table 5.9.
Table 5.8 LBFGS Result of Osborne Function 2
1!f{x)1fl (n+l)nf Xl X2 Xl X4 Xs
0.04014 2136 1.3100 0.4315 0.6337 0.5996 0.7543
nf X6 X7 Xs X9 XlO XJl
178 0.9038 1.3666 4.8227 I 2.3988 4.5688 5.6753
Table 5.9 PRAXIS Result of Osborne Function 2
l!ftx)lf Xl X2 Xl X4 Xs
0.04014 1.3100 0.4316 0.6337 0.5994 0.7542
nr X6 X7 X8 X9 XlO XJl
857 0.9043 1.3658 4.8237 ! 2.3987 4.5689 5.6753
32
Using LBFGS, we only need 178 function and gradiient evaluations to complete
this problem. PRAXIS is faster than LBFGS for this function. In addition, the final
result 0.04014 of both LBFGS and PRAXIS are better than Osborne's original result
0.0402.
5.3 Testing LBFGS Using Different Numbers of Corrections
Previously, we set the number of corrections m =5 in LBFGS to test various
functions. For a very large problem, one cannot set m too large, otherwise, it will take too
much storage. Also, the larger the values of m, the more execution time for each
iteration. Normally, we set 3~ m ~ 7. However, the larger the m, the more information
we can store. Hence, it seems that the larger the m. the smaller the number of iterations.
We verify this by testing LBFGS on Osborne function 2 with various m and list the
results on Table 5.10.
Table 5.10 Results of Osborne Function 2 for Different m
m Total nf l!f(x)W
Epochs
2 344 379 4.0l4E2
3 419 446 4.0l4E2
4 311 345 4.014E2
5 246 268 4.0l4E2
6 227 253 4.0l4E2
7 144 161 4.0l4E2
8 117 132 4.014E2
9 113 130 4.014E2
10 83 99 4.014E2
llli 83 94 4.0l4E2
12 79 91 4.014E2
100 63 73 4.014E2
1000 63 73 4.014E2
33
...,..,
,\ ;
"'.
.u,
We mention that in this section we set e= 107 as the stopping criterion.
From Table 5.10, we can see that when 3~ m~ 13, the larger the m, the smaller the
number of iterations and the number of function evaluations. In fact, in the first m
iterations, LBFGS and BFGS are identical. Hence, if m is sufficiently large, for
example, larger than the number of iterations, then executions of LBFGS and BFGS are
exactly the same.
5.4 Lack of Quadratic Termination Property
Many optimization algorithms possess a property called quadratic termination
which means that they minimize a quadratic function exactly in a finite number of
iterations [34]. For example, Newton's method and quasiNewton methods have
quadratic tennination properties.
In this section, we verify that the limited memory quasiNewton method does not
have such property.
Let
It is not difficult to prove that this quadratic function is positive definite. We run this
function using different m and list the results in Table 5.11.
34
,
,I
,~,!
" I
1'1
<"
",/
)'
"'\ t:,
1°,1
"t;
0,.
'
Table 5.11 Results of a Quadratic Function
m #of nf j(x)
Iterations
2 16 21 1.782E15
3 14 19 1.677E15
4 15 20 1.5E15
5 14 19 2.51lE17
6 13 18 3.223E16
7 13 18 1.805E16
8 13 18 6.604E17
9 12 17 1.450E15
! 10 12 17 1.42GE15
I 11 12 17 1.421E15
12 12 17 1.421E15
From the above table we see that for small m, LBFGS does not have quadratic
termination in n or (n+ 1) iterations. The authors of [28] pointed out that"Our aim is that
the limited memory method resemble BFGS as much as possible, and we disregard
quadratic tennination properties, which are not very meaningful, in general, for large
dimensional problems."
35

..~'),,
, I ., ,
,".:
v.)
", r·
.>
~ ..
""
'J ..
6. Testing
In this chapter, we discuss the testing of our learning algorithm. We use the
neural network data in Proben I [43] to test our algorithm. The rules and conventions in
Proben I are followed strictly.
The scope of the ProbenI problems can be characterized as follows. All problems
can be suited for supervised learning, since input and output values are separated. All
examples within a problem are independent of each other. Some of the problems can be
tackled by pattern classification algorithms, while others need the capability of
continuous multivariate function approximation. All problems are presented as static
problems in the sense that all data to learn from are present at once and do not change
during learning. All problems consist of real data from real problem domains.
6.1 Some Aspects of Proben1
1 Training set, validation set, test set [43]
The data used for perfonning benchmarks on neural network learning algorithms
must be split into at least two parts: one part on which the training is performed, called
the training data set" and another part on which the performance of the resulting network
is measured, called the test data set. The idea is that the performance of a network 0111 the
test set estimates its performance in real use. This means that absolutely no information
about the test set examples or the test set performance of the network can be available
during the training process; otherwise the benchmark is invalid.
In some cases the training data are further subdivided. Some examples are put
into the actual training set, others into a socalled validation set [43]. The latter is used as
36
, .;
~ ,
~ .. ,, ' ~ ·'.
~, . "",
r .•' ·'. .1'"
·;..,.:.,
a pseudo test set in order to evaluate the quality of a network during training. Such an
evaluation is called cross validation [43]. It is necessary due to the overfitting
(overtraining) phenomenon: For two networks trained on the same problem, the one with
larger training set error may actually be better, since the other has concentrated on
peculiarities of the training set at the cost of losing much of the regularity needed for
good generalization. This is a problem in particular when not very many training
examples are available, or when too large a network is used.
A popular and very powerful fonn of cross validation used in neural networks is
early stopping: Training proceeds not until a minimum of the error on the training set is
reached, but only until a minimum of the error on the validation set is reached during
training. Training is stopped at this point and the current network state is the result of the
training run.
The sizes of the training, validation, and test sets in all Proben1 data files are
50%, 25%, and 25% of all examples, respectively.
Our primary goal is to design a learning algorithm for a problem, either a
classification problem or a function approximation problem, with a large number of
examples. For a problem with a large number of examples, overtraining is not a big
problem, provided the network architecture is suitably chosen. In our implementation, we
choose the data sets with large number of examples, which are much larger than the
number of weights. Hence, we do not use the validation set. Instead, we combine the
training set and the validation set as the training set.
2 Input and output representation
37
• I , . · . ~ \)
• r
"·'~
j~ ·, .,
I :
·' r·. ..., '~ h,
.'"
J r~.
, . ".
··'.. ~ ~ ,
III·.'
I,
How to represent the input and output attributes of a learning problem in a neural
network implementation of the problem is one of the key decisions influencing the quality
of the solutions one can obtain.
In Probenl, the realvalued attributes are usually rescaled by some tinear factors.
The integervalued attributes are most often handled as if they were realvalued. Each
input in the data set is a realvalued attribute, and each output in the data set is either a
real number for the function approximation problems or an integer 0 or 1 for the
classification problems.
3 Error measures
Many different error measures (also called error functions, objective functions,
cost functions, or loss functions) can be used for network training. The most commonly
used is the squared error:
where 0; and t; are the actual output and target output at the ith output node for one
example. The above measure gives one error value per example  obviously there are
too many data to report. Thus one usually reports either the sum or the average of these
values over the set of all examples. The average is called the mean squared error. The
author of [43] believed that the mean squared error may have the advantage of being
independent of the size of the data set. Note that the mean squared error still depends on
the number of output coefficients in the problem representation and on the range of
output values used. We thus follow [43] to normalize for these factors as well, and report
a squared error percentage as:
38
: .
• I
,
"
·......, ·. ",
·,,.'
.....
Omax  OmiD P N 2
E = 100· L L(Opi  t p.)
N· P p=li=l
where Omax and Olllin are the maximum and minimum values of output coefficients in the
problem representation, N is the number of outputs of the network, and P is the number
of examples in the data set considered. However, all the data sets in Probenl have been
normalized such that Ornax = 1 and 0min = O. So the squared error percentage can be
simplified as :
100 P N 2
E =LL(Opi  t p;) •
N· P p=li=1
Note that this error function is never used in the field of optimization, but is specific to
ANN training. In our algorithm, we use this squared error percentage as the error
function, so that one may compare our training results with the results in [43].
4 Classification measure
The actual target function for classification problems is usually not the continuous
error measure used during training but the classification performance. However, the
classification performance is not the only measure we are interested in. We thus report
the actual error values in addition to the classification performance. Classification
performance is reported in terms of percent of incorrectly classified examples, the percent
classification error. This is better than reporting the percentage of correctly classified
examples, the classification accuracy, because the latter makes important differences
insufficiently clear: an accuracy of 98% is actually twice as good as one of 96%, which is
easier to see if the percent errors are reported (2% compared to 4%).
39
. ,
". , r
,"
.,
I,,'
::~
.. 'I~ , ..•.
• "i>, h: ",
!t,;'
""I'.
,~ '<,,
There are several possibilities to determine the classification a network has
computed from the outputs of the network.
In our implementation, we use the following method to determine the
classification. We calculate
where OJ and tj are the actual output and target output at the ith output node for one
example. If there is at least one dj , such that dj ~ 0.5, then we reject this example,
otherwise we accept it. We may use 0.4 or 0.3 (instead of 0.5) to determine the rejection
region. However, in our implementation, the differences are not significant.
5 Networks used
Neural network structure is one of the most important things to be specified when
we use a neural network. No one knows which particular structure is the best for any
particular problem. Basically, we just try several different structmes. Following [43], we
mainly choose a neural network with zero, one, or two hidden layers.
To describe the topology, we try to refer to common topology models. For
instance, for the common case of fuUyconnected layered feedforward networks, the
numbers of nodes in each layer from input to output can be given as a sequence. For
example, a 14503 network refers to a network with 14 input, 50 hidden, and 3 output
nodes. We call this a network with one hidden layer.
6 Stopping criteria
We design a training algorithm based on a limited memory quasiNewton method
to solve both classification problems and approximation problems. Since we intend to
40
"
, '.
'"
...
, I'
.1·'
, .J~
.'. ~ i
I .' ..,
'"
.' ':.
solve problems with a large number of examples, the validation set is not used, and the
following stopping criteria we used:
I. The weight update is within tolerance:
Ilwk+!  wkllz < tolerance
The tolerance is chosen depending on the problem. We will specifically state the
tolerance for our test problems in Section 6.4.
2. The number of function evaluations exceeds a predefined limit.
In our implementation, we set this predefined limit at 2000. Some authors use the
number of iterations instead of the number of function evaluations. In our
implementation, the difference between these two numbers is not large. Stopping on this
criterion implies failure to converge, although the results might still be of some use.
In addition to the above two stopping criteria, there is another stopping criterion in
the subroutine LBFGS, as we mentioned in Chapter 4.
If a validation set is used in the implementation, besides the above criteria, the
GLa stopping criterion can be used. Although we do not use a validation set in our
implementation, we still state the GLa stopping criterion [43] in the following. Interested
readers may use it in their implementation.
Let Eva(t) be the squared error percentage over the validation set, measured during
epoch t. The value Eopt(t) is defined to be the lowest validation set error obtained in the
epochs up to t:
41
Eop,(t) = min Eva(t')
.'SI
Now we define the generalization loss at epoch t to be the relative increase of the
validation error over the minimumsafar (in percent):
GL(t) =100,( EVQ (t)  1)
Eopt (t)
A high generalization loss is one candidate reason to stop training. This leads us to a
class of stopping criteria: Stop as soon as tbe generalization loss exceeds a certain
threshold a. We define the class GLa as
GLa : stop after first epoch t with GL(t) > ex.
Typical value used for a. is 5 [43].
6.2 Language Implementation
We use the FORTRAN 77 language to implement our learning algorithm. The
main purpose of a learning process is to train the neural network to have generalization
ability. That is, the network should have small error on data as well which it has not
learned.
We choose onehiddenlayer and twobiddenlayer feedfolWard neural network
architectures. Neural networks without hidden layer are also used. The propagated
computations and notations are exactly the same as in Wang [51]. Here we do not repeat
them.
42
)
",
I',:
.',\
,".
" ..~
~ "J
. '"1
" '.
, '.
, .:
 '
Before executing our program, one must prepare an input file exactly named
"INPUT.DAT". The format of "INPUT.DAY' is as follows:
The first lineTYPE, SEED
TYPE is an integer, which represents the type of training problem.
TYPE = 1: Function approximation problem.
TYPE = 2: Pattern classification problem.
SEED is an integer, which is used to generate a series of random numbers,
which represent the initial weights of the network.
The second lineNTRAIN, NTEST, NLAYER
NTRAIN is an integer, which represents the number of training examples.
NTEST is an integer, which represents the number of test examples.
NLAYER is an integer, which represents the number of layers (including
the input layer) of this network.
The third lineintegers
These integers represent the number of nodes (excluding the bias) in each
layer, starting from the input layer, and ending at the output layer.
In the remainder of the file, ,each ]ioe contains data for one example, data inputs
followed by the desired outputs.
The major subroutines of the program and their functions are the following:
LBSET ( )  define the values of several parameters in common areas.
43
·1
INPUT ( )  open and read the input data file.
INWEIT ( )  initialize all connection weights.
COMOUI ()  compute the outputs of the network for one example.
GRADIE ( )  compute the value and gradient of the error function.
MLBFGS ( )  implement the limited memory BFGS algorithm. This is a slight
modification of Nocedal's LBFGS program [36].
6.3 Test Problems.
As we mentioned before, the main purpose of this thesis is to train a neural
network with a large number of weights. We choose three data sets with the largest
number of examples in Probenl [43]. All these three data sets have a relatively large
number of inputs, so that the number of weights may be large, though it depends on the
actual design of the network. If the number of inputs and the number of outputs are fixed,
then the larger the number of hidden nodes, the larger the number of weights.
Two of the problems are function approximation problems, while the third is
classification problem.
Table 6.1 Test Problems
Problem , Type of #of #of # of Training # of Test
Problems Inputs Outputs Examples Examples
building2 Approximation 14 3 3156 1052
flare 1 Approximation 24 3 800 266
thyroid1 Classification 21 3 5400 1800
For further comparison, seven different network topologies were used for each
problem: zerohiddenIayer network, onehiddenIayer networks with 4, 16, or 32 hidden
nodes and twohiddenIayer networks with 4+4, 8+8, or 16+8 hidden nodes on the fITst
44
(}
."t,
;,
.. '
...
"',
"
.I•.,:
II',
and second hidden layer, respectively. For the two approximation problems, we also use
some other network topologies. All of these networks have all possible feedforward
connections, including a bias connection.
6.4 Test Results
In optimization problems, one uses various methods to find an approximate
minimum value of the objective function. Theoretically, the exact minimum of the
objective function cannot be known in advance. However, some certain known results
can serve as a reference.
In Probenl [43], a few results of neural network learning runs on the data sets are
given. The runs were made with linear networks, having only direct connections from
inputs to the outputs, and with various fully connected multilayers with one or two layers
of sigmoidal hidden nodes. Training was performed using the RPROP algorithm [43].
We list the average results of [43] in Table 6.2.
W,e notice that for different network topologies, the propagation from input layer
to output layer are different. Hence, the minimum values may not be exactly the same.
However, comparing with the results in Table 62, we can get a rough idea how well our
program works.
Table 6.2 Results in Proben 1
Problem Total Training Test set Test set
epochs set classification
building2 1183 0.23 0.26 
flare] 71 0.39 0.74 
thyroidl 491 0.60 1.31 2.32
45
nf
I
"
"
,'I"
.. , .
Training set: minimum squared! error percentage on the training set.
Test set: minimum squared error percentage on the test set.
Test set classification: percent of incorrectly classified examples on the test set.
Recall that we first need! to set the stopping criteria. The first criterion is that the
weight update is within tolerance:
Since the number of weights are relatively large. it is not necessary to set the tolerance to
be too small. In our implementation, we set the tolerance = 103
, with one exception. For
the problem building2 with 16+16 hidden nodes, when tolerance = 10.3
, it stops too early
and does not obtain the desired results. Hence, we set the tolerance =105 instead of 103
for this problem.
The second stopping criterion is that the number of function evaluations cannot
exceed a predefined limit, which we set as 2000.
..
"'
.'
r
The stopping criterion of LBFGS is:
Ilgtil < E X max(1, Ilwt 11),
We set E =104 with a few exceptions. With a similar reason as above, we set E =105
for the fonowing network topologies: flare! without hidden layer, and thyroidl with the
hidden layers 8+8 and 16+8.
In our actual implementation, for the building2 with 16+16 hidden nodes, the
program stops when the number of function evaluations achieves the limit 2000. For all
other situations, it stops when either 118wlh is too small or IIglh is too small.
The results of our tests are listed in the following tables.
46
",
Table 6.3 Results of building2 (m = 5)
Hidden #of Tau.] nf IlAwlh IIgll2 Training Test
nodes wei~hts epochs set set
None 45 67 75 0.0006483 0.002806 0.3441 0.3427
4 75 759 820 0.008264 0.002349 0.2635 0.2643
8 147 1084 1140 0.()OO8241 0.01060 0.2472 0.2544
16 291 1331 I 1421 0.0008560 0.008237 00.2267 0.2443
32 579 1341 1425 0.0009306 0.04055 0.2176 0.2464
4+4 95 1328 1434 0.02195 0.002396 0.2543 0.2598
8+8 219 1631 1747 0.0006908 0.003420 0.2172 0.2402
16+16 563 1801 2000 0.02671 0.01257 0.2073 0.2363
Table 6.4 Results of flarel (m = 5)
Hidden #of Tau.] nf 1'lAwl~ IIgllz Training Test
nodes weights epochs set set
None 75 138 150 0.004856 0.0002.159 0.2915 0.5962
4 115 99 121 0.03944 0.003599 0.2470 , 0.7380
8 227 162 183 0.03414 0.003334 0.2118 0.7999
16 451 166 188 0.03917 0.004947 0.2209 0.9708
32 899 153 171 0.6937 0.007733 0.2229 0.8296
4+4 135 121 144 0.07549 0.005195 0.2508 0.7387
8+4 251 143 187 0.4097 0.005541 0.2462 0.6216
8+8 II 299 121 153 0.01822 0.004801 0.2407 0.7591 I
i 16+8 563 306 356 0.01908 0.004518 0.2183 1.026
Table 6.5 Results of thyroid1 (m = 5)
Hidden # of Total. n1 lI~w,lll2 Ilglh Training Test Error
nodes ~ weights epochs set set rate
None 66 98 113 1.431 0.01917 2.885 3.179 6.500
4 103 236 280 0.05897 0.02529 0.9794 1.346 2.444
16 403 371 420 0.04310 0.02790 0.8786 1.246 2.500
32 803 647 735 0.02569 0.02054 1.445 1.751 4.111
4+4 123 35 41 21.94 0.02343 4.469 4.369 7.278
8+8 275 49 59 53.04 0.007671 4.477 4.391 7.278
16+8 515 611 730 0.003879 0.006455 2.696 2.652 4.944
nf total number of function evaluations.
Training set: minimum squared error percentage on training set.
Test set: minimum squared error percentage on test set.
Error rate: percent of incorrectly classified examples.
1. Building2
47
It seems that the iterations converge for all network topologies we chose. Except
for the network without hidden nodes, the training set error and the test set error are
similar in Table 6.2. The best training set error and the best test error are 0.2073 and
0.2363, respectively, which are better than the results in Table 6.2.
2. Flarel
It seems that the iterations converge for every network topology. The best training
set error is 0.2118, which is obtained by the network topology with eight hidden nodes,
while the best test set error is 0.5962 which is obtained by the network without hidden
nodes. We notice that the network topology with the best training set error may not ,have
the best test set error.
3. Thyroidl
It seems that the iterations converge for some network topologies but do not
converge for some other network topologies (none, 4+4, 8+8).. The best (smallest)
training set error and the best test error are 1.445 and 1.246, respectively, and the best
percent of incorrectly classified examples is 2.444%
In the following we summarize our results:
First of all, our program handles the large number of weights of a neural network
very welL The largest number of weights in our tests is 899. Actually, we believe that
our program can work on a neural network with several thousand weights without any
problem.
48
Second, for most of the network topologies which we chose for the three
problems, the optimization procedure ,converges. The other cases presumably would have
converged eventually.
Third, compared with the results in Table 6.2, most of our results are acceptable.
Some results are more accurate than the results in Table 6.2. In particular, we get very
good generalization results on test sets.
Roughly speaking, the greater the number of hidden nodes, the greater the number
of weights, and the greater the number of iterations and functions evaluations required. It
seems that the average of the number of iterations of different topologies for each
problem is similar in Table 6.2. However, we cannot claim that our program is very fast,
since it may take more time for each iteration than the other methods. For example, it
took about 30 hours for the problem Thyroid! with 32 hidden notes when we run it on
CSA: Sequent 24 Intel 80386 processors. However, our program can handle a very large
number of weights, which the other methods may not be able to (a conjugate gradient
method may handle a large problem, but it is unstable.) fu addition, as we mentioned
before, the main difference between a limited memory quasiNewton method and a quasiNewton
method is the storage part. If we modify the storage part of our program and
store the inverse Hessian matrixes explicitly, then our program can train a network with a
moderate number of weights and run very fast.
49
6.5 Test on XOR Problem
In this section we test our program on an "exclusive or'7 function of two variables.
This is perhaps the simplest learning problem that is not linearly separable. It therefore
cannot be solved by a network with no hidden layer. For the detailed description and
results see [55].
There are two main forms of learning architectures that have been used by others
to solve this problem [55]. The first one is a 221 network, which has two hidden units,
each connected to both inputs and to the output. The second one is a "211 shortcut
network", which has only a single hidden unit, but also has "shortcut" connections from
the inputs to the output unit. However, we do not consider the shortcut network here.
Some researchers have also investigated this problem with more than two hidden
units [56]. In general, the more hidden units there are, the easier the problem becomes.
We first call our program directly with 221 and 231 networks. But it does not
work very well. It seems to stop at a local minimum. We tben test our program with a 2
41 network. The results are satisfactory. We list the results in the following table.
Table 6.6 Results of XOR Problem with 241 Network
#Of Total nf lIt1wllz IIglh Error Error
Weights Epochs Rate
17 27 55 3.187 4.966D4 3.292D4 0.0
WEIGHT =(1.62IDOl, 1.400DOl, 4.693D02, 2.231D02, 4.414D02,
6.472D02, 9.630D03, 2.71OD02, 8.472D02, 8.488D02,
8.082D02, 5.366D02, 4.286D02, 5.266D02, 9.787D02,
6. 157D02, 8.823D02)
error: minimum squared error percentage
50
,error rate: percent of incorrectly classified examples
In our original program, we chose the initial weights between O.5/fanin and
O.5/fanin, where fanin is the number of nodes in previous layer [51]. Chandler [8]
suggested that we enlarge the range of the initial weights for this problem, that is the
initial weights are chosen between [50, 50]. In addition, we randomly choose the initial
weights and calculate the value of the error function one thousand times. We save the
weights with the smallest of squared error percentage, and use these weights as the initial
weights to test the XOR problem with a 221 network. The results are satisfactory. We
mist the results in the fonowing table.
Table 6.7 Results of XOR Problem with 221 Network
:#I Of Total nf Mwlh IIglh Error Error
Weights Epochs Rate
9 1 2 1.OOOD(X) 1.576D9. 1.119D9 0.0
WEIGHT =(2.5980+01, 2.987D+01, 1.286D+Ol, 8.498D+OO, 3.505D+01,
3.404D+Ol, 1.236D+01, 2.570D+01, 4.382D+Ol)
51
7. Conclusions
In this thesis, we designed a neural network program based on limited memory
quasiNewton methods to train fullyconnected feedforward neural networks. Our
program can train a neural network with a large number of weights and it has been tested
on several real world problems in Probenl [43]. Comparing with the results in Probenl,
our results are satisfactory. In particular, we obtain very good generalization results on
the test sets. Since we do not stofe the inverse Hessian matrix explicitly, it may take
more time for each iteration than for other methods. However if we modify the storage
part of our program, it will run very fast in training a neural network with a moderate
number of weights.
In addition, we tested the subroutine LBFGS on various functions and obtained
very good test results.
Suggestion for further study:
Test our program on real data sets with a very large number of inputs and not very
many examples. H necessary, use the validation set as a pseudo test set.
52
References
[1] M. AIBaali, Descent property and global convergence of the FletcherReeves
method with inexact line searches, IMA J. Numer. Anal., 5 (1985) 121124.
[2] E. M. L. Beale, On an iterative method for finding a local minimum of a function
of more than one variable, Tech. Report No. 25, Statistical Techniques Research
Group, Princeton Dillv. (1958).
[3] H. S. M. Beigi and C. J. Li, Learning algorithms for neural network based on
quasiNewton methods with selfscaling, J. of Dynamics Systems, Measurement
and Control, 115(1993) 3843.
[4] M. J. Box, A comparison of several current optimization methods, and the use of
transformations in constrained problems, Compo J. 9 (1966) 6777.
[5] R. P. Brent, Algorithms for Minimization without Derivatives, PrenticeHall, Inc.,
Englewood Cliffs, N. J. 1973.
[6] R. H. Byrd, J. Nocedal, and Y. Yuan, Global convergence of a class of quasiNewton
methods on convex problems, SIAM J. Numer. Anal., 24 (1987) 11711190.
[7] J. P. Chandler, Anonymous ftp://a.cs.okstate.edulpub/jpc/praxis.f, praxis.txt,
stepit.f.
[8] J. P. Chandler, Private communication.
[9] A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal
Processing, John Wiley & Sons, Inc., New York, 1993.
[10] A. R. Colville, A comparative study of nonlinear programming codes, IBM New
York Scientific Center Report, 3202949 (1968).
[11] J. E. Dennis and R. E. Schnabel, Numerical Methods for Unconstrained
Optimization and Nonlinear Equations, PrenticeHall, 1983.
[12] R. Fletcher,. Practical Methods of Optimization, Vol. 1, John Wiley & Sons, Inc.,
Chichester, 1980.
[13] R. Fletcher and M. J. D. Powell, A rapidly convergent descent method for
minimization, Compo J. 6 (1963) 163168.
[14] R. Fletcher and C. M. Reeves, Function minimization by conjugate gradients,
Computer J. 7 (1964) 149154.
53
[15] J. A. Freeman and D. M. Skapura, Neural Networks Algorithms. Applications, and
Programming Techniques, AddisonWesley Publishing Co., New York, 1992.
[16J J. c. Gilbert and J. Nocedal, Global convergence properties of conjugate gradient
methodsfor optimization, SIAM J. Optimization, 2 (1992) 2142.
[17] P. E. Gill and W. Murray, QuasiNewton methods for unconstrained optimization,
J. Inst. Math. Applies. 9 (1972) 91108.
[18] P. E. Gill, W. Murray and M. H. Wright, Practical Optimization, Academic Press,
London, 1981.
[19] R. T. Gregory and D. L. Karney, A Collection of Matrices for Testing
Computational Algorithms, Interscience, New York, 1969.
[20] A. Griewank and Ph. L. Toint, Partitioned variable metric updates for large
structured optimization problems, Numerische Mathematik 39 (1982) 119137.
[21] M. T. Hagan, H. B. Demuth, and M. Beale, Neural Network Design, PWS
Publishing Co., Boston, 1996.
[22J M. T. Hagan and M. B. Menhaj, Training feedforward networks with the
Marquardt algorithm, IEEE Transactions On Neural Networks 5 (1994) 989994.
[23] S. Haykin,. Neural Networks, A Comprehensive Foundation, Macmillan College
Publishing Co., 1994.
[24] N. B. Karayiannis and A. N. Venetsanopoulos, Artificial Neural Networks,
Learning Algorithms, Performance Evaluation, and Applications. KJuwer
Academic Publishers, Boston, 1993.
[25] J. S. Kowalik and M. R. Osborne, Methods for Unconstrained Optimization
Problems, Elsevier, New York, 1968.
[26] A. Leon, A comparison of eight known optimizing procedures, Recent advances in
optimization techniques, A. Lavi and T. P. Vogl eds.,Wiley & Sons, Inc., New
York, (1966).
[27] K. Levenberg, A Method For the Solution of Certain NonLinear Problems in
Least Squares, Quart. App!. Math. 2 (1944) 164168.
[28] D. C. Liu and J. Nocedal, On the limited memory BFGS method for large scale
optimization, Math. Programming 45 (1989) 503528.
54
[29] D. C. Liu and J. Nocedal, Test results of two limited memory methods for large
scale optimization, Technical Report NAM 04, Department of Electrical
Engineering and Computer Science, Northwestern University, 1988.
[30] D. W. Marquardt, An Algorithm for LeastSquares Estimation of Nonlinear
Parameters, J. Soc. lndust. Appl. Math. 11 (1963) 525553.
[31] W. Miller and D. Spooner, ACM Trans. Math. Software 4 (1978) 369390.
[32J J. J. More and D. J. Thuente, Line search algorithms with guaranteed sufficient
decrease, ACM Transaction on Mathematical Software 20 (1994) 287307.
[33] S. G. Nash, Preconditioning of truncatedNewton methods, SlAM Journal on
Scientific and Statistical Computing' (1985) 599616.
[34] S. G. Nash and A. Sofer, Linear and Nonlinear Programming, McGrawHill Inc.,
New York, 1996.
[35] J. Nocedal, Updating quasiNewton matrices with limited storage, Mathematics of
Computation 35 (1980) 773782.
[36] J. Nocedal, Anonymous ftp:/leecs.nwu.edulpubllbfgsllbfgs_um
[37] L. Nyhoff and S. Leestma, FORTRAN 77 and Numerical Methods for Engineers
and Scientists, PrenticeHall, New Jersey, 1995.
[38J D. P. O'Leary, A discrete quasiNewton algorithm for minimizing a function oj
many variables, Mathematical Programming 23 (1982) 2033.
[39] M. R. Osborne. Some aspects of nonlinear least squares calculations, Numerical
Methods for Nonlinear Optimization, F. A. Lootsma ed., Academic Pr:ess, New
York, (1971) 171189.
[40] M. J. D. Powell, An iterative method for finding stationary values of a function of
several variables, Compo J. 5 (1962) 147151.
[41J M. J. D. Powell, An efficient method for finding the minimum of a function of
several variables without calculating derivatives, Compo J. 7 (1964) 155162.
[42J M. J. D. Powell, Some global convergence properties of a variable metric method
w.ithout line searches, Nonlinear Programming, R. W. Cottle and C. E. Lemke,
eds., SIAMAMS Proceedings, American Mathematical Society, 9 (1976) 5372.
55
•
[43] L. Prechelt, Probenl  A Set of Neural Network Benchmark Problems and
Benchmarking Rules, Technical Report 21/94, University Karlsruhe, 1994,
Anonymous ftp:/lftp.ira.uka.de/pub/neuronlprobenl.tar.gz.
[44] B. D. Ripley, Pattern Recognition and Neural Networks. Cambridge University
Press, 1996.
[45] H. H. Rosenbrock, An automatic methodforfinding the greatest or least value ofa
function, Compo J. 3 (1960)173184.
[46] D. E. Rumelhart, G. E. Hinton and R. J. Williams, Learning representations by
backpropagation errors, Nature 323 (1986) 533536.
[47] L. E. Scales, Introduction to NonLinear Optimization, SpringerVerlag, New
York, 1985.
[48] T. Steihaug, The conjugate gradient method and trust regions in large scale
optimization, SIAM Journal on Numerical Analysis 20 (1983) 626637.
[49] J. G. Taylor, The Promise of Neural Networks, SpringerVerlag, London, 1993.
[50] Ph. L. Toint, Towards an efficient sparsity exploiting Newton method for
minimization, in: Sparse Matrices and Their Uses, I. S. Duff, ed., Academic Press,
New York, 1981,5758.
[51] L. Wang, The Damped Newton Method  An ANN Learning Algorithm, M. S.
Thesis, Computer Science Department, Oklahoma State University, 1995.
[52] M. A. Wolfe, Numerical Methods for Unconstrained Optimization, Van Nostrand
Reinhold Company, Ltd., Mony MiHars Lane, Wokingham, Berkshire, England,
1978.
[53] Systeml360 Scientific Subroutine Package (360ACM03X) Version III
Programmers Manual, ffiM Corporation, 1968.
[54] Internet: comp.ai.neuralnets.
[55] Anonymous FfP: ftp.cs.cmu.edulafs/cs/projectlconnectlbenchlxor.
56
APPENDIX A: PROGRAM LIST FOR DRIVEN.F
INPUT FILE FORMAT:
BEFORE THE PROGRAM IS EXECtrrED, THE INPUT FILE "INPUT.DAT~
MUST BE READY.
3RD LINE  INTEGERS, WHICH REPRESENT THE NUMBER OF NODES (EXCLUDING
THE BIAS) IN EACH LAYER, START FROM THE INPUT LAYER.
WEIGHT AND G ARE DOUBLE PRECIS.ION ARRAYS, WHICH CONTAIN THE VALUE OF
WEIGHTS AND THE GRADIENT OF THE ERROR FUNCTION, RESPECTIVELY.
TYPE, SEED
TYPE IS AN INTEGER, WHICH REPRESENTS THE TYPE OF
THE TRAINING PROBLEM.
TYPE = 1: FUNCTION APPROXIMATION PROBLEM.
TYPE = 2: PkTTERN CLASSIFICATION PROBLEM.
SEED IS AN INTEGER, WHICH IS USED TO GENERATE
RANDOM NUMBERS.
NTRAIN, NTEST, NLAYER ARE ALL INTEGERS, WHICH
REPRESENT THE NUMBER OF TRAINING EXAMPLES, THE
NUMBER OF TEST EXAMPLES, AND THE NUMBER OF LAYERS
(INCLUDING THE INPUT LAYER) OF THE NETWORK,
RESPECTIVELY.
IS AN INTEGER, WHICH REPRESENTS THE NUMBER OF CORRECTION
USED IN THE BFGS UPDATE.
IS A TWO DIMENSION INTEGER ARRAY USED TO STORE THE
NETWORK STRUCTURE.
THE FIRST INDEX OVER LAYERS (INCLUDING THE INPUT LAYER).
THE 2ND INDEX ARE DEFINED AS FOLLOWS:
(*,1) CONTAINS THE NUMBER OF NODES IN EACH LAYER,
EXCLUDING THE BIAS NODE.
(*,2) CONTAINS THE STARTING INDEX OF NEURON FOR
EACH LAYER.
IS AN INTEGER, WHICH REPRESENTS THE NUMBER OF WEIGHTS
IS A DOUBLE PRECISION VARIABLE CONTAINS THE VALUE OF THE
ERROR FUNCTION IN THE TRAINING PROCESS.
F
ERTEST IS A DOUBLE PRECISION VARIABLE CONTAINS THE VALUE OF THE
ERROR FUNCTION IN THE GENERATION PROCESS WITH TEST DATA
SET.
LAYIFO
1ST LINE 
THE ACTUAL LIMITED MEMORY BFGS METHOD IS IMPLEMENTED BY
MEANS OF THE SUBROUTINE MLBFGS, WHICH IS A SLIGHT
MODIFICATION OF THE ROUTINE LBFGS WRITTEN BY JORGE NOCEDAL.
2ND LINE 
THE REMAINDER OF THE FILE EACH CONTAINS DATA FOR ONE
EXAMPLE, DATA INPUT FOLLOWED BY THE DESIRED OUTPUT.
VARIABLES:
PROGRAM DRIVEN C _
C WRITTEN BY JOHN P. CHANDLER AND YlJUN HUANG,
C COMPtrrER SCIENCE DEPARTMENT, OKLAHOMA STATE UNIVERSITY, 1997. C _
C THIS IS A DRIVER TO IMPLEMENT A NEURAL NETWORK LEARNING
C ALGORITHM BASED ON THE LIMITED MEMORY QUASINEWTON METHOD.
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
CC
N
C
C
C
CC
M
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
57
 ..
NTRAIN, NTEST, NLAYER ARE THE NUMBER OF TRAINING EXAMPLES, THE
NUMBER OF TEST EXAMPLES AND THE, NUMBER OF LAYERS RESPECTIVELY.
CONTAINS THE STARTING INDEX OF WEIGHT FOR
EACH LAYER.
IS A INTEGER, WHICH REPRESENTS THE NUMBER OF
EVALUATIONS OF F AND G.
IS A TWO DIMENSIONAL DOUBLE PRECISION ARRAY.
THE 1ST INDEX CONTAINS THE OUTPUT FOR ONE TEST EXAMPLE.
THE 2ND INDEX OVER ALL TEST EXAMPLES IN THIS NETWORK.
1. E., EACH COLUMN OF THE ARRAY CONTAINS OUTPUT DATA FOR
ONE TEST EXAM.PLE.
IS A TWO DIMENSIONAL DOUBLE PRECISION ARRAY.
THE 1ST INDEX CONTAINS THE INPUTS FOR ONE TEST EXAMPLE.
THE 2ND INDEX OVER ALL TEST EXAMPLES IN THIS NETWORK.
I. E., EACH COLUMN OF THE ARRAY CONTAINS INPUTS DA.TA FOR
ONE TEST EXAMPLE.
IS A LOGICAL VARIABLE. SINCE THE SUBROUTINE MLBFGS IS
CALLED BY TWO DRIVERS, DRIVEN USED TO IMPLEMENT THE
ARTIFICIAL NEURAL NETWORKS, THEREFORE SET THE ANN = TRUE
IN THIS DRIVER.
IS A TWO DIMENSIONAL DOUBLE PRECISION ARRAY.
THE 1ST INDEX CONTAINS THE INPUTS FOR ONE TRAINING EXAMPLE.
THE 2ND INDEX OVER ALL TRAINING EXAMPLES IN THIS NETWORK.
I. E., EACH COLUMN OF THE ARRAY CONTAINS INPUTS DATA OF
ONE TRAINING EXAMPLE.
OUDATA IS A TWO DIMENSIONAL DOUBLE PRECISION ARRAY.
THE 1ST INDEX CONT.AINS THE OUTPUTS FOR ONE TRAINING EXAMPLE.
THE 2ND INDEX OVER ALL TRAINING EXAMPLES IN THIS NETWORK.
I. E., EACH COLUMN OF THE ARRAY CONTAINS OUTPUT DATA FOR
ONE TRAINING EXAMPLE.
INl, INTRAI, INTEST, NOUT, N1, N2, L1, L2 ARE ALL POSITIVE
INTEGERS, USED TO PASS THE DIMENSION INDICES WHEN CALLING
SUBROUTINES.
TESOUT
I CALL
TESTIN
INDATA
NEURON IS TWO DIMENSION DOUBLE PRECISION ARRAY.
THE 1ST INDEX OVER ALL NEURONS, OR NODES, IN THE NETWORK.
THE 2ND INDEX ARE DEFINED AS FOLLOWS:
(*,1) CONTAINS THE OUTPUT VALUE OF EACH NEURON.
(*,2) CONTAINS THE FIRST DERIVATIVE OF THE
ACTIVATION FUNCTION OF EACH NEURON.
(*,3) STORES THE PARTIAL DERIVATIVE OF THE ERROR
FUNCTION E(W) W.R.T. A NEURON OUTPUT.
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C ANN
C
C
C
C
C
C
C
C
C
C
CC
OTHER VARIABLES AND PARAMETERS ARE DESCRIBED IN THE SUBROUTINES
C LBSET AND MLBFGS.
C 
IMPLICIT REAL*8 (AH,OZ)
DOUBLE PRECISION WEIGHT (2000) ,G{2000),DIAG(2000) ,W(35000)
DOUBLE PRECISION F,EPS,XTOL,GTOL,STPMIN,STPMAX,TOLERA,ER
INTEGER IPRINT(2),IFLAG,ICALL,N,M,MP,LP,J,NFMAX,I,K,NOUT
LOGICAL DIAGCO,ANN,GRD
C
INTEGER TYPE,NTRAIN,NTEST,NLAYER,LAYIFO{5,3) ,II,
* SEED, IN1,INTRAI,INTEST,NOUT1,N1,N2,Ll,L2.NF05
DOUBLE PRECISION INDATA(25,5400) ,OUDATA(5,5400),
* TESTIN(25,1800),TESOUT(5,1800),
* NEURON(200,3),ERTEST,ERRATE
58
C
C
C
COMMON /LB3/MP,LP,GTOL,STPMIN,STPMAX
COMMON / IP/ NTRAIN, NLAYER
XTOL=1.0D16
EPS=1.0D4
TOLERA = 1. OD3
Nl=200
N2=3
Ll=5
L2=3
IN1=25
INTRAI=5400
INTEST=1800
NOUT=5
M=5
IPRINT{l)= 50
IPRINT(2)= 0
CC
WE DO NOT WISH TO PROVIDE THE DIAGONAL MATRICES HKO, AND
C THEREFORE SET DIAGCO TO FALSE.
C
DIAGCO= . FALSE.
ANN = . TRUE.
ICALL=O
I FLAG=0
NFMAX = 2000
CALL LBSET
CC
READ IN INPUT FILE AND INITIAL,IZE THE WEIGHTS.
C
CALL INPUT(INDATA,OUDATA,TESTIN,TESOUT,LAYIFO,NTEST,TYPE,SEED,
* Ll,L2,IN1,INTRAI,INTEST,NOUT)
CALL INWEIT(WEIGHT,N,LAYIFO,SEED,Ll,L2)
CC
SET BIAS INPUT.
C
DO 10 K=l, NLAYER
II = LAYIFO(K, 2)
NEURON(II, 1) = 1.0
10 CONTINUE
CC
 MAIN LOOP
C
20 CONTINUE
F= O.DO
DO 30 I=l,N
G(I)=O.ODO
30 CONTINUE
CC
COMPUTE VALUE AND GRADIENT OF ERROR FUNCTION .
C
CAL,L GRADIE (F, WEIGHT,N,G, INDATA,OUDATA, LAYIFO, NEURON,
* Nl,N2,Ll,L2,IN1,INTRAI,INTEST,NOUT)
C
CALL MLBFGS(N,M,WEIGHT,F,G,DIAGCO,DIAG,IPRINT,EPS,
* XTOL,W,IFLAG,ANN,TOLERA)
C
IF(IFLAG.EQ.l) THEN
CC
IF IFLAG=l, EVALUATE THE FUNCTION F AND GRADIENT G.
C
ICALL=ICALL + 1
C
59
...
C WE ALLOW AT MOST NFMAX EVALUATIONS OF F AND G
C
IF (ICALL.GT.NFMAX) GO TO 40
GO TO 20
C
C IFLAG = 0 INDICATES THE ROUTINE MLBFGS HAS TERMINATED SUCCESSFUL,
C THEREFORE, EVALUATE THE TEST DATA; OTHERWISE, AN ERROR OCCURS,
C THEN STOP THE EXECUTION OF THE PROGRAM.
C
CONTINUE
.0:;.
I~t ,'.
"
ERTEST = " IPDIO.3,2X, 'EPS=',lPDIO.3)
NF05 = ',15, 2X, 'ERRATE=' , IPDIO. 3)
*
ELSE IF (IFLAG .EQ. 0) THEN
ERTEST = ER(TYPE,TESTIN,TESOUT,NTEST,NEURON,
LAYIFO,WEIGHT,N,NF05,Nl,N2,Ll,L2,IN1,INTEST,NOUT}
WRITE (MP, 50) ERTEST,EPS
IF(TYPE .EQ. 2) THEN
ERRATE = 1. 0*NF05/NTEST
WRITE(MP,60) NF05,ERRATE
END IF
END IF
50 FORMAT (/'
60 FORMAT(/'
40
C
C
STOP
END
SUBROUTINE INPUT (INDATA, OUDATA, TESTIN, TESOUT, LAYIFO, NTEST,
* TYPE,SEED,Ll,L2,IN1,INTRAI,INTEST,NOUT)
C 
C OPEN AND READ IN THE "INPUT. DAT". STORE THE NETWORK INFORMATION
C INTO THE ARRAY LAYIFO, AND ALL TRAINING DATA AND TESTING DATA
C INTO INDATA, OUTADA, TESTIN, AND TESOUT, RESPECTIVELY.
C THE FORMAT OF "INPUT.DAT" AND ALL VARIABLES ARE DESCRIBED
C AT THE BEGINNING OF THE MAIN PROGRAM.
C 
IMPLICIT REAL*8 (AH,OZ)
INTEGER Ll,L2,I,IN,J,IN1,INTRAI,INTEST,NOUT,NOUT1,NINPUT
INTEGER IOERR,TYPE,SEED,NTRAIN,NTEST,LAYIFO(Ll,L2),NLAYER
DOUBLE PRECISION INDATA(IN1,INTRAI),OUDATA(NOUT,INTRAI),
* TESTIN(IN1, INTEST) ,TESOUT(NOUT, INTEST)
C
COMMON /IP/NTRAIN, NLAYER
C
IN = 5
C
IOERR=O
OPEN (UNIT=IN, FILE='INPUT.DAT', STATUS='OLD', IOSTAT=IOERR)
IF(IOERR .NE. 0) THEN
PRINT 10, IOERR
10 FORMAT('CANNOT OPEN INPUT DATA FILE, IOERR=' ,110)
GOTO 120
END IF
C
READ(IN,20) TYPE, SEED
20 FORMAT (215)
PRINT 3D,TYPE,SEED
30 FORMAT(/' TYPE =',I5,5X, •SEED =' ,Ill)
C
C
READ (IN, 40) NTRA1N, NTEST, NLAYER
40 FORMAT (315)
PRINT 50,NTRAIN,NTEST,NLAYER
50 FORMAT (/' NTRAIN =',19, ax, 'NTEST = I ,19, 8X, 'NLAYER =',12)
READ (IN, 60) {LAYIFO(I,l), 1=1, NLAYER)
60 FORMAT (515)
PRINT 70, (LAYIFO(I,l), 1=1, NLAYER)
60
..
C
C
70 FORMAT (J' LAYIFO (1,1) = ',815)
LAYIFO(1,2} = 1
LAYIFO(1,3) = 0
LAYIFO(2,3) = 1
C
C
DO 80 1=2, NLAYER
LAYIFO(I,2) = LAYIFO(I1,1) + LAYIFO(I1,2) + 1
80 CONTINUE
DO 90 1=3, NLAYER
LAYIFO(I,3)=LAYIFO(I1,3)+LAYIFO(I1,1)*(LAYIFO(I2,1)+1)
90 CONTINUE
NOUT1 = LAYIFO (NLAYER, 1)
NINPUT = LAYIFO(l,l)
DO 100 J=l, NT'RAIN
READ (IN, *l (INDATA(I,Jl ,I=l,NINPUT), (OUDATA(I,J), I=1,NOUT1)
100 CONTINUE
.~. ~..~~,,I
C
c
C
DO 110 J=l, NTEST
READ(IN,*) (TESTIN(I,J),I=l,NINPUT), (TESOUT(I,J),I=1,NOUT1)
110 CONTINUE
CLOSE (IN)
THE CALLING STATEMENT IS
CALL INWEIT(WT,N,LAYIFO,SEED,L1,L2)
IS A TWODIMENSION ARRAY OF INTEGERS, WHICH CONTAINS
SOME NETWORK INFORMATION, SUCH THAT THE NUMBER OF NODES
IN EACH LAYER.
IS A DOUBL,E PRECISION ARRAY, WHICH IS USED TO CONTAIN
THE INITIAL WEIGHTS.
WT
LAYIFO
WHERE
120 RETURN
END
SUBROUTINE INWEITIWT,N,LAYIFO,SEED,L1,L2) C C THIS SUBROUTINE IS USED TO INITIALIZE ALL THE CONNECTION WEIGHTS.
C EACH WEIGHT IS INITIALIZED TO A RANDOM NUMBER BETWEEN O.S/FANIN
C AND D.5/FANIN, WHERE FANIN IS THE NUMBER OF NODES(INCLUDING THE
C BIAS NODE) IN THE PREVIOUS LAYER.
C
C
C
C
C
C
C
C
C
C
C
C
CC
SEED IS AN INTEGER VARIABLE THAT USED TO GENERATE RANDOM
C NUMBERS.
C
IMPLICIT REAL*8 (AH,OZ)
INTEGER L1,L2,LLL,I,K, NTRAIN, NLAYER
INTEGER SEED, TEMP, FANIN , LAYIFO{L1,L2), N
DOUBLE PRECISION WT (1), RANDOM
C
COMMON / IP/ NTRAIN, NLAYER
C
TEMP 0
CC
ITERATE OVER ALL LAYERS (EXCLUDING THE INPUT LAYER) .
C
DO 20 K = 2, NLAYER
CC
FANIN = THE NUMBER OF NODES IN PREVIOUS LAYER + 1 (BIAS NODE)
C
61
FANIN = LAYIFO(Kl,l)+l
LLL=LAYIFO(K,l)*FANIN
DO 10 I = 1,LLL
WT(TEMP+I) = RANDOM{SEED)/FANIN
10 CONTINUE
TEMP = TEMP+Il
20 CONTINUE
CC
SET THE TOTAL NUMBER OF WEIGHTS.
C
N = TEMP
RETURN
END
DOUBLE PRECISION FUNCTION RANDOM{ SEED ) C _
C THIS FUNCTION IS USED TO GENERATE A RANDOM NUMBER BETWEEN 0.5
C TO 0.5.
C REFERENCE: "A PORTABLE RANDOM NUMBER GENERATOR FOR USE IN SIGNAL
C PROCESSING', SANDIA NATIONAL LABORATORIES TECHNICAL
C REPORT, BY S.D.STEARNS.
C INPUT: SEED
C RETURN: A DOUBLE PRECISION RANDOM NUMBER.
C
IMPLICIT REAL*8 (AH,OZ)
INTEGER SEED
c
SEED = 2045*SEED + 1
SEED = SEED  (SEED/I048576)*1048576
RANDOM = (SEED+1)/1048577.0  0.5
C
RETURN
END
DOUBLE PRECISION FUNCTION INPROD (VEC1, VEC2, DIM)
C
C THIS FUNCTION RETURNS THE INNER PRODUCT OF TWO VECTORS.
C IT IS USED IN SUBROUTINE COMOUl TO CALCULATE THE
C OUTPUT OF THE NETWORK.
C
INTEGER DIM, I
DOUBLE PRECISION VEC1(DIM), VEC2(DIM)
C
INPROD = 0.0
CC
ITERATE OVER ALL ELEMENTS OF THE VECTORS.
C
DO 10 1=1, DIM
INPROD = INPROD + VEC1(I)*VEC2(I)
10 CONTINUE
C
RETURN
END
DOUBLE PRECISION FUNCTION SIGMOD(X) C
C THIS FUNCTION RETURNS THE VALUE OF SINGMOID ACTIVATION FUNCTION.
C IT IS USED IN SUBROUTINE COMOU1 TO CALCULATE THE
C OUTPUT OF THE NETWORK.
C
IMPLICIT REAL*8 (AH,OZ)
DOUBLE PRECISION X
C
SIGMOD = 1.0/(1.0 + DEXP(X»
C
RE'TURN
END
SUBROUTINE COMOU1(A,NEURON,LAYIFO,WT,N,N1,N2,L1,L2,INl,GRD)
62
7'
,LAYIFO IS A TWODIMENSIONAL DOUBLE PRECISION ARRAY WHICH CONTAINS
SOME NETWORK INFORMATION.
THE CALLING STATEMENT IS
CALL SUBROUTINE COMOU1(A,NEURON,,LAYIFO,WT,N,N1,N2,L1,L2,
IN1,GRD)
NEURON IS A TWODIMENSIONAL DOUBLE PRECISION ARRAY THAT CONTAINS
OUTPUT OF THE NETWORK AND THE DERIVATIVE OF THE ACTIVATION
FUNCTION.
IS A DOUBLE PRECISION ARRAY THAT CONTAINS CURRENT
WEIGHTS.
IS A DOUBLE PRECISION ARRAY THAT CONTAINS INPUT DATA.
UP AND LOW ARE DOUBLE PRECISION VARIABLES. SINCE THE SIGMOID
FUNCTION F (X) WILL OVERFLOW WIlEN THE ABSOLUTE OF X IS
TOO LARGE. WE SET THE FUNCTION VALUE TO BE 1.0
OR 0.0 WHEN X IS LARGER THAN UP OR SMALLER THAN LOW.
A
WT
WHERE
*
C _
C THIS SUBROUTINE IS USED TO COMPUTE THE OUTPUT OF THE NETWORK FOR
C EACH NEURON, AND STORE THEM IN ARRAY NEURON ( * ,I) .
C IT IS CALLED BY THE FUNCTION ER TO CALCULATE THE VALUE OF
C THE ERROR FUNCTION ON THE TESTING DATA SET.
C IN ADDITION, IT COMPUTES THE DERIVATIVE OF THE ACTIVATION FUNCTION
C AND STORES THEM IN NEURON(* , 2) FOR FURTHER REFERENCE.
C IT IS CALLED BY SUBROUTINE GRADIE TO CALCULATE THE VALUE AND
C GRADIENT OF THE ERROR FUNCTION ON TRAINING DATA SET.
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C 
IMPLICIT REAL*B (AH,OZ)
INTEGER L1, 1.2, LLL, I, INDX, K,NDEX, NLAYER,NT.RAIN
INTEGER TEMP, WIDX, N, LAYIFO(L1,L2),Nl,N2,IN1,JJ
DOUBLE PRECISION SUM,WT(N),NEURON(N1,N2) ,INPROD,SIGMOD,
A( IN!) , UP , LOW
LOGICAL GRD
COMMON IIP/ NTRAIN, NLAYER
CC
SET UPPER AND LOWER BOUND FOR SIGMOID FUNCTION.
C
UP = 7.0D+2
LOW = 7. OD+2
CC
COPY ALL INPUT DATA INTO NUERON(*,l), ,LAYIFO{l,1)=NUMBER OF INPUTS
C
LLL=,LAYIFO(1,1)+1
DO 10 1=2, LLL
NEURON(I,1) = A(Il)
10 CONTINUE
CC
FORWARD PROPAGATED COMPUTATION OVER ALL ,LAYERS.
C
00 30 K=2, NLAYER
INDX = LAYIFO(K,2)
TEMP = LAYIFO(K,3)
NDEX = LAYIFO(Kl,2}
JJ = LAYIFO(Kl,l) + 1
CC
ITERATE OVER ALL NEURONS IN LAYER K.
C NOTE: LAYIFO(K,l) = NUMBER OF NEURONS IN KTH LAYER.
C
LLL=LAYIFO(K,l)
63
DO 20 I=l,LLL
C
C LOCATE CORRESPONDING INDEX FOR NEURONS AND WEIGHTS
C
INDX = INDX+1
WIDX = TEMP + JJ * (Il)
C
C COMPUTE WEIGHTED SUM WITH FUNCTION INPROD
C
SUM = INPROD(NEURON{NDEX,1) , WT(WIDX), JJ)
CC
THE VALUE OF SIGMOID FUNCTION APPROACHES 1.0 OR 0 WHEN
C THE ABSOLUTE VALUE OF SUM IS SUFFICIENT LARGE.
C
IF(SUM .LE. UP .AND. SUM .GE. LOW) THEN
NEURON (INDX,1) = SIGMOD(SUM)
ELSE IF (SUM .GT. UP) THEN
NEURON (INDX,1) = 1.0
ELSE
NEURON (INDX, 1) = 0.0
END IF
CC
COMPUTE DERIVATIVE OF THE SIGMOID ACTIVATION FUNCTION F(X) WHEN IT
C IS CALLED BY SUBROUTINE GRADIE (GRD=TRUE).
C NOTE: F' (X)=F(X)*(1F{X»
C
*
c
C
IF (GRD)
NEURON (INDX, 2)
20 CONTINUE
30 CONTINUE
NEURON (INDX, 1) * (1.0NEURON(INDX,11)
RETURN
END
DOUBLE PRECISION FUNCTION ER(TYPE,INARY,OUTARY,NUMBER,NRON,
* LAYIFO,WT,N,NF05,N1,N2,Ll,L2,IN1,INTEST,NOUT)
C
C THIS FUNCTION IS USED TO EVALUATE THE TRAINING RESULT WITH
C THE TEST DATA SET WHEN TRAINING PROCESS HAS COMPLETED.
C IT CALCULATES THE NUMBER OF INCORRECTLY CLASSIFICATION EXAMPLES
C AND RETURN THE VALUE OF THE ERROR FUNCTION.
CC
TYPE IS AN INTEGER VARIABLE THAT SPECIFIES THE TRAINING
C PROBLEM TYPE. FUNCTION APPROXIMATION TYPE = 1. WHILE
C PATTERN CLASSIFICATION PROBLEM IF TYPE =2.
C
C I NARY IS A TWODIMENSIONAL ARRAY. EACH COLUMN CONTAINS INPUT DAT
C FOR AN EXAMPLE.
CC
OUTARY IS A TWODIMENSIONAL ARRAY. EACH COLUMN CONTAINS DESIRED
C OUTPUT FOR AN EXAMPLE,.
C
C NUMBER IS AN INTEGER VARIABLE, INDICATE THE NUMBER OF TOTAL
C EXAMPLES IN THE TEST DATA SET.
CC
NRON IS A DOUBLE PRECISION ARRAY THAT CONTAINS ACTUAL
C OUTPUT OF THE NETWORK.
CC
LAYIFO IS A TWODIMENSIONAL INTEGER ARRAY WHICH CONTAINS
C SOME NETWORK INFORMATION.
CC
WT IS A DOUBLE PRECISION ARRAY THAT CONTAINS CURRENT
C WEIGHTS.
CC
NF05 IS THE NUMBER OF INCORRECTLY CLASSIFIED EXAMPLES.
64
IF THERE IS ANY ONE OF ABSOLUTE VALUE OF DIFFERENCE
BETWEEN THE ACTUAL OUTPUT AND THE DESIRED OUTPUT
OF A EXAMPLE IS GREATER THAN O. 5, THEN THIS EXAMPLE
REFERS TO INCORRECTLY CLASSIFIED, AND NFOS
INCREASE BY 1.
NOTE: IT IS USED ONLY FOR PATTERN CLASSIFICATION
PROBLEMS (TYPE = 2) AND WITH THE TEST
DATA SET WHEN TRAINING PROCESS FINISHED .
c
C
C
C
C
C
C
C
C
C
IMPLICIT REAL*8 (AH,OZ)
INTEGER L1,L2,N1,N2,IN1,INTEST,NOUT
INTEGER TYPE,NUMOUT,INDX,LAYIFO(L1,L2),N,NUMBER,NF05
DOUBLE PRECISION WT(N), NRON(Nl,N2), TEMP,
* INARY(INl,INTEST),OUTARY(NOUT,INTEST)
LOGICAL ACCEPT,GRD
C
COMMON IIPI NTRAIN, NLAYER
C
GRD = . FALSE.
ER = 0
NFOS = 0
NUMOUT = LAYIFO(NLAYER, 1)
CC
ITERATE OVER ALL EXAMPLES.
C
DO 20 M=l, NUMBER
ACCEPT = . TRUE.
C
C CALL COMOUI TO COMPUTE THE OUTPUT FOR MTH EXAMPLE
C
CALL COMOUl(INARY(l,M) ,NRON,LAYIFO,WT,N,N1,N2,Ll,L2,INl,GRD)
C
C ITERATE OVER ALL OUTPUT OF MTH EXAMPLE.
C
DO 10 1=1, NUMOUT
INDX = LAYIFO(NLAYER, 2) + I
CC
COMPUTE THE DIFFERENCE BETWEEN THE ACTUAL OUTPUT AND THE DESIRED
C OUTPUT
C
TEMP = NRON(INDX,l)  OUTARY(I,M)
ER = ER + TEMP**2
CC
COMPUTE NF05, IF NECESSARY.
C
IF(TYPE.EQ.2 .AND. ACCEPT .AND. DABS (TEMP) .GE.O.5DO) THEN
NF05 = NF05+1
ACCEPT = .FALSE.
END IF
10 CONTINUE
20 CONTINUE
CC
COMPUTE THE SQUARED ERROR PERCENTAGE.
C
ER = ER * 100/( NUMBER * NUMOUT )
C
RETURN
END
SUBROUTINE GRADIE (F,X,N,G,INDATA,OUDATA,LAYIFO,NEURON,
* Nl,N2,Ll,L2,INl,INTRAI,INTEST,NOUT)
C
C THIS SUBROUTINE IS USED TO COMPUTE THE VALUE AND GRADIENT OF THE
C ERROR FUNCTION E(W) .
C
65
.,
THE CALLING STATEMENT IS
CALL GRADIE (F,X,N,G,INDATA,OUDATA,LAYIFO, NEURON,
Nl,N2,L1,L2,INl,INTRAI,INTEST,NOUT)
. " ...
. ,'I
IS A DOUBLE PRECISION ARRAY THAT CONTAINS CORRECT WEIGHTS.
IS A TWODIMENSIONAL INTEGER ARRAY THAT CONTAINS SOME
NETWORK INFORMATION.
IS AN INTEGER, WHICH IS NUMBER OF WEIGHTS.
IS A TWODIMENSIONAL DOUBLE PRECISION ARRAY, THE FIRST COL
OF WHICH CONTAINS OUTPUT OF EACH NEURON, THE SECOND COLUMN
CONTAINS THE DERIVATIVE OF THE ACTIVATION FUNCTION, AND TH
THIRD COLUMN CONTAINS PARTIAL OF E(W) W.R.T. U{K,J).
IS A DOUBLE PRECISION ARRAY THAT CONTAINS THE
GRADIENTS AT THE POINT X.
IS A TWODIMENSIONAL ARRAY. EACH COLUMN CONTAINS OUTPUT DA
OF AN EXAMPLE.
IS A DOUBLE PRECISION VARIABLE THAT CONTAINS THE VALUE
OF THE ERROR FUNCTION.
IS A TWODIMENSIONAL ARRAY. EACH COLUMN CONTAINS INPUT
DATA OF AN EXAMPLE.
OODATA
NEURON
G
WHERE
F
x
INDATA
LAYIFO
N
C
C
C
C
C
C
C
C
C
C
C
C
CC
C
C
C
C
C
C
C
C
C
C
C
C
C
C
CC/'""
IMPLICIT REAL*8 (AH,OZ)
INTEGER L1,L2,LLL,LLLL,LLLLL,I,J,K,KK,M,NDX1,NOUT1
INTEGER LAYIFO'(L1,L2), WDX, NDX,N, N1,N2,IN1,
* INTRAI, INT.EST, NOUT , NTRAIN, NLAYER
DOUBLE PRECISION NEURON(N1,N2), OUDATA(NOUT,INTRAI),COEF,
* INDATA(IN1,INTRAI) ,X(N) ,G(N), F
LOGICAL GRD
C
COMMON lIP! NTRAIN, NLAYER
C
NOUT1 = LAYIFO(NLAYER, 1)
CC
ITERATE OVER ALL TRAINING EXAMPLES
C
GRD = .TRUE.
DO 80 M=l, NTRAIN
CC
COMPUTE THE OUTPUT OF MTH EXAMPLE.
C
CALL COMOU1(INDATA(1,M),NEURON,LAYIFO,X,N,Nl,N2,L1,L2,IN1,GRD)
C
DO 10 1=1, NOUT1
NDX = LAYIFO(NLAYER, 2) + I
NEURON (NDX,3) = NEURON(NDX,l)  OUDATA(I,M)
F = F + NEURON (NDX, 3 ) ** 2
10 CONTINUE
CC
BACKWARD PROPAGATION COMPUTATION OVER ALL LAYERS, STARTS FROM
C THE LAST HIDDEN LAYER.
C
LLL=NLAYERl
DO 40 KK=2, LLL
K=LLL+2KK
CC
OVER ALL NEURONS IN LAYER K.
66
6
C
LLLL=LAYIFO(K,l)
DO 30 J=1, LLLL
NDX =LAYIFO(K,2) + J
NEURON (NDX, 3) = 0
C
C OVER ALL NEURONS IN THE NEXT LAYER.
C
LLLLL=LAYIFO(K+1,1)
DO 20 I = 1, LLLLL
NDX1 = LAYIFO(K+1,2)+I
WDX = LAYIFO(K+1,3) + (I1)*(LAYIFO(K,1)+1) + J
CC
COMPUTE PARTIAL OF E(W) W.R.T. U(K,J)
C
,
20
30
40
*
NEURON (NDX,3) = NEURON(NDX,3) +
NEURON(NDX1,2)*NEURON(NDX1,3)*X(WDX)
CONTINUE
CONTINUE
CONTINUE
CC
BACKWARD PROPAGATION COMPUTATION OVER ALL LAYERS AGAIN, STARTS
C FROM OUTPUT LAYER.
C
DO 70 KK=2,NLAYER
K=NLAYER+2KK
CC
OVER ALL NEURONS IN KTH LAYER.
C
LLLL=LAYIFO(K,1)
DO 60 J=1, LLLL
NDX = LAYIFO(K,2) + J
CC
OVER ALL NEURONS IN THE PREVIOUS LAYER.
C
LLL=LAYIFO(K1,1}+1
DO 50 1=1, LLL
WDX = LAYIFO(K,3) + (J1)*(LAYIFO(K1,1)+1) + 11
NDX1 = LAYIFO(K1,2) + 11
CC
COMPUTE THE PARTIAL OF E(W) W.R.T. W(I,J,K)
C
G{WDX}= G(WDX) +
* NEURON(NDX,3)*NEURON(NDX,2)*NEURON(NDX1,1}
50 CONTINUE
60 CONTINUE
70 CONTINUE
80 CONTINUE
CC
COMPUTE THE SQUARED ERROR PERCENTAGE AND GRADIENT.
C
COEF = 200.0 / (NTRAIN * NOUT1)
F = 0.5 * F * COEF
DO 90 1=1, N
G{I) = G(I) * COEF
90 CONTINUE
C
RETURN
END
67
APPENDIX B: PROGRAM LIST FOR DRIVEF.F
PROGRAM ORIVEF C _
C THIS IS THE ORIVE TO SOLVE THE FUNCTIONS FROM BRENT'S SUITE OF TEST
C PROBLEMS AND OSBORNE'S FUNCTIONS BY USING THE LIMITEO MEMORY BFGS
C METHOO.
CC
REFERENCES:
C "ALGORITHMS FOR MINIMIZATION WITHOUT OERIVATIVES",
C RICHARD P. BRENT, PRENTICEHALL 1973, PAGE 138.
C
C "SOME ASPECTS OF NONLINEAR LEAST SQUARES CALCULATIONS"
C M. R. OSBORNE, NUMERICAL METHOOS FOR NONLINEAR OPTIMIZATION,
C F. A. LOOTSMA EO., ACADEMIC PRESS, NEW YORK, 1971,171189.
CC
THE SUBROUTINE TESTIN AND FUNCTION FTEST WERE WRITTEN BY
CDR. J. P. CHANDLER, COMPUTER SCIENCE OEPARTMENT,
C OKLAHOMA STATE UNIVERSITY.
C
C THE LIMITEO MEMORY BFGS METHOD IS IMPLEMENTEO BY MEANS OF THE
C SUBROUTINE MLBFGS, WHICH IS A SLIGHT MOOIFICATION OF THE ROUTINE
C LBFGS BY JORGE NOCEDAL.
CC
VARIABLES:
Y1,Y2,AND T ARE ARRAYS USEO TO TEST OSBORNE FUNCTIONS.
IS THE NUMBER OF VARIABLES.
IS THE VALUE OF THE FUNCTION.
ARE ARRAYS OF LENGTH N, WHICH CONTAIN THE VALUE OF THE
VARIABLES AND THE GRADIENT AT THE POINT X, RESPECTIVELY.
IS THE NUMBER OF CORRECTIONS USEO IN THE BFGS UPDATE.
TOLERANCE OF THE STOPPING CRITERIA OF SUBROUTINE MLBFGS.
THE MLBFGS TERMINATES WHEN IIGII < EPS * MAX(l, IIXII).
IS THE NUMBER OF EVALUATIONS OF F AND G.
IS A LOGICAL VARIABLE. SINCE THE SUBROUTINE MLBFGS
IS CALLED BY TWO ORIVERS, DRIVE IS NOT USED TO IMPLEMENT
THE ARTIFICIAL NEURAL NETWORKS, THEREFORE SET
ANN = FALSE; OTHERWISE SET ANN = TRUE.
F
EPS
X AND G
N
I CALL
ANN
c
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
CC
OTHER VARIABLES AND PARAMETERS ARE OESCRIBED IN THE
C SUBROUTINES LBSET AND MLBFGS.
C
IMPLICIT REAL*8 (AH,OZ)
DOUBLE PRECISION X(2000) ,G(2000),OIAG(2000) ,W(35000),
* YYl(33) ,YY2(65),Y1,Y2,T,TOLERA
DOUBLE PRECISION F, EPS,XTOL,GTOL,STPMIN, STPMAX,PI,FTEST
INTEGER IPRINT(2),IFLAG,ICALL,N,M,MP,LP,J,JFUNC,MAXJF,JFF,NFMAX
C
OATA YY1/0.84400,O.90800,O.932DO,0.93600,O.925DO,0.908DO,0.88100,
* 0.85000,O.81800,0.78400,0.75100,0.71800,0.685DO,O.65800,
* 0.62800,0.60300,O.58000,O.55800,O.53800,O.522DO,0.50600,
* 0.49000,0.47800,O.46700,O.45700,0.44800,O.438DO,0.43100,
* 0.42400,0.42000,0.41400,0.41100,0.40600/
OATA YY2/1.36600,l.191DO,l.11200,1.01300,O.99100,O.885DO,0.831DO,
* O.84700,0.786DO,0.72500,O.74600,0.67900,O.60800,0.65500,
68

C
C
***
*
****
0.616DO,C.606DO,O.602DQ,O.626DO,O.651DO,O.724DO,O.649D0,
0.649DO,O.694DO,O.644DQ,O.624DO,O.661DC,0.612DO,O.558D0,
0.533DO,0.495DO,O.500DO,0.423DO,O.395DO,O.375DO,0.372D0,
0.391DO,0.396DO,O.405DC,O.428DO,O.429DO,O.523DO,O.562D0,
O.607DO,0.653DO,O.672DO,O.708DO,0.633DO,O.668DO,O.645D0,
0.632DO,0.591DO,0.559DO,O.597DO,O.625DO,O.739DO,O.710D0,
0.729DQ,0.720DO,0.636DO,O.S81DO,O.428DO,0.292DO,O.162D0,
0.098DO,0.054DO/
LOGICAL DIAGCO, ANN
DATA XTOL, IEPS 11. OD16, 1. OD7 /
COMMON /LB3/MP,LP,GTOL,STPMIN,STPMAX
COMMON /PRTESTI PI,JFUNC,MAXJF
COMMON /OSB/Y1(33),Y2(65),T(65}
M=5
IPRINT(l)= 50
IPRINT(2)= 0
•
..,
"
C
C WE DO NOT WISH TO PROVIDE THE DIAGONAL MATRICES HKO, AND
C THEREFORE SET DIAGCO TO FALSIE.
C
DIAGCO= .FALSE.
CC
WE ARE NOT IMPLEMENTING THE ARTIFICIAL NEURAL NETWORKS LEARNING
C ALGORITHM, THEREFORE SET ANN TO FALSE.
C
ANN = .FALSE.
NFMAX = 2000
TOLERA = 1. OD3
C
C DEFINE THE DEFAULT VALUES OF SEVERAL PARAMETERS IN COMMON SECTIONS.
C
CALL LBSET
DO 5 1=1, 33
Yl(I)=¥Y1(I)
5 CONTINUE
DO 6 1=1, 65
Y2(I)=¥Y2(I)
6 CONTINUE
C
DO 50 JFF=l,13
ICALL=O
IFLAG=O
JFUNC=JFF
IF(JFUNC .LT. 12) THEN
WRITE(LP,10)JFUNC
10 FORMAT(/' SOLVE PROBLEM NUMBER',I3,
* ' FROM THE TEST SUITE USED BY BRENT.')
END IF
C
IF(JFF.EQ.6) N=9
IF(JFF.EQ.9) N=10
IF(JFF.EQ.10) N=20
CC
INITIALIZATION
C
CALL TESTIN(X,N)
C
20 CONTINUE
F= O.DO
DO 30 J=l,N
G(J)=O.DO
30 CONTINUE
C
69
C CALCULATE THE FUNCTION F AND GRADIENT G.
C
IF(JFUNC .LT. 12) THEN
F = FTEST(X,N)
CALL FGRAD(N,X,G)
ELSE
CALL OSBORNE(F,X,N,G)
END IF
C
C
*
CALL MLBFGS(N,M,X,F,G,DIAGCO,DIAG,IPRINT,EPS,XTOL,W,IFLAG,
ANN,TOLERA)
IF(IFLAG.LE.O) GO TO 35
ICALL=ICALL + 1
CC
WE ALLOW AT .MOST NFMAX EVALUATIONS OF F AND G
C
IF (ICALL.GT.NFMAX) GO TO 35
GO TO 20
C
35 WRITE (LP, 40)
WRITE (LP, 45) (X(I), I=l,N)
40 FORMAT (' VACTOR X= ')
45 FORMAT(6(2X,lPD10.3»
50 CONTINUE
C
END
C
SUBROUTINE TESTIN(X,N)
C
C TESTIN 1.2 SEPTEMBER 1995
CC
J. P. CHANDLER, COMPUTER SCIENCE DEPARTMENT,
C OKLAHOMA STATE UNIVERSITY
CC
INITILIZE FOR PROBLEM NUMBER JFUNC FROM BRENT'S TEST SUITE.
CC
"ALGORITHMS FOR MINIMIZATION WITHOUT DERIVATIVES",
C RICHARD P. BRENT, PRENTICEHALL 1973, PAGE 138
CC
CALL SUBROUTINE LBSET BEFORE CALLING TESTIN.
CC
NOTE THAT FOR JFUNC = 6, 9, OR 10,
C THE VALUE OF N MUST BE SET BEFORE CALLING TESTIN.
C
C
IMPLICIT REAL*8 (AH,OZ)
INTEGER N,LP,MP,JFUNC,MAXJF,J
DOUBLE PRECISION X(N) ,PI,DATAN,GTOL,STPMIN,STPMAX,Y1,Y2,T
C
COMMON /PRTEST/ PI,JFUNC,MAXJF
COMMON /LB3/MP,LP,GTOL,STPMIN,STPMAX
COMMON /OSB/Y1(33) ,Y2(65),T(65)
C
MAXJF=13
C
PI=4.0DO*DATAN(1.0DO)
C
C
IF(JFUNC.LT.1 .OR. JFUNC.GT.MAXJF) STOP
GO TO (10,30,50,70,90,110,170,190,210,240,270,300,330) ,JFUNC
CCJFUNC=l
C ROSENBROCK'S TEST FUNCTION
C
70
....
C THE MINIMUM IS F (1. 0 , 1. 0) =a.a .
C
10 N=2
X(1)'=1.2DO
X (2) =1. 000
WRITE(LP,20)
20 FORMAT ( , ROSENBROCK TEST FUNCTION'}
RETURN
CC
JFUNC=2
C POWELL'S SINGULAR TEST FUNCTION
C
C THE MINIMUM IS F(O.O,O.O,O.O,O.O)=O.O
C
30 N=4
X(1)=3.0DO
X(2) =1. 000
X(3)=0.ODO
X(4)=1.0DO
WRITE (LP, 40)
40 FORMAT (' SINGULAR TEST FUNCTION OF POWELL')
RETURN
CC
JFUNC=3
C HELICAL VALLEY TEST FUNCTION OF FLETCHER AND POWELL
C SINCE THE GRADIENT DOES NOT EXIST AT THE POINT (O,O,O), THEREFORE
C WE CHANGE THE INITIAL VALUE FROM (1,0,0) TO (0.01,0.01,0)
C THE MINIMUM IS F(1.0,0.0,0.0)=O.O .
C
50 N=3
X(l)=O.OlDO
X(2}=O.OlDO
X(3}=O.ODO
WRITE(LP,60)
60 FORMAT (' HELICAL VALLEY TEST FUNCTION OF FLETCHER AND POWELL')
RETURN
CC
JFUNC=4
C LEON'S CUBIC TEST FUNCTION
C
C THE MINIMUM IS F(1.0,1.0)=O.0
C
70 N=2
X(l} =1.2DO
X(2)=1.0DO
WRITE (LP, 80)
80 FORMAT (' CUBIC TEST FUNCTION OF LEON')
RETURN
CC
JFUNC=5
C BEALE'S TEST FUNCTION
CC
THE MINIMUM IS F(3.0,0.5)=O.0
C
90 N=2
X(l}=O.lDO
X(2}=0.lDO
WRITE(LP,lOO)
100 FORMAT ( , BEALE TEST FUNCTION'}
RETURN
CC
JFUNC=6
C WATSON'S TEST FUNCTION (SEE KOWALIK AND OSBORNE)
CC
FOR N=6, THE MINIMUM IS NEAR
71
l
"
"

C
C
C
C
C
C
C
F(0.015725,1.012435,0.232992,1.260430,1.513729.0.992996)=
2.28767005355D3 .
FOR N=9, THE MINIMUM IS NEAR
F(O.000015,O.999790,O.014764,O.146342,1.00081,2.617731,
4.104403,3.143612,1.052627)=1.399760138D6
110 DO 120 J=1,N
X(J)=O.ODO
120 CONTINUE
WRITE(LP,130)N
130 FORMAT(' WATSON TEST FUNCTION WITH N =',I3)
RETURN
•
CC
JFUNC=7
C POWELL'S 1964 TEST FUNCTION
CC
THE MINIMUM IS F(1.O,l.0,1.0)=0.0
C
170 N=3
X(l)=O.ODO
X(2)=1.0DO
X(3)=2.0DO
WRITE(LP,180)
180 FORMAT ( , POWELL (1964) TEST FUNCTION')
RETURN
CC
JFUNC=B
C WOOD'S TEST FUNCTION
C
C THE MINIMUM IS F (1. 0,1. 0,1. 0,1. 0) =0.0
C
190 N=4
X(1)=3.0DO
X(2) =1. 000
X(3)=3.0DO
X(4)=1.0DO
WRITE (LP, 200)
200 FORMAT(' TEST FUNCTION OF WOOD')
RETURN
c
C JFUNC=9
C BRENT'S HILBERT MATRIX TEST FUNCTION
CC
THE MINIMUM IS F(O.D,O.O,O.O, ... ,0.0)=0.0
CC
FOR N.GT.10, PRAXIS MAY RUN FOR A VERY, VERY LONG TIME
C AND TERMINATE WITH SOME COMPONENTS OF X(*) FAR FROM ZERO,
C BECAUSE OF THE EXTREME ILLCONDITIONING OF THIS PROBLEM.
CC
SHOULDN'T ILLCIN BE SET TO 1 FOR THESE FUNCTIONS.
C AT LEAST FOR LARGE VALUES OF N?
C
210 DO 220 J=1,N
X(J) =1. 000
220 CONTINUE
WRITE(LP,230)N
230 FORMAT(' HILBERT MATRIX TEST FUNCTION WITH N =',I3)
RETURN
CC
JFUNC=10
C TRIDIAGONAL MATRIX TEST FUNCTION
CC
THE MINIMUM IS F(N,N1,N2, ... ,2,1) N.
C
72
'.

240 DO 250 J=l,N
X(J) =0. aDO
250 CONTINUE
WRITE(LP,260jN
260 FORMAT (' TRIDIAGONAL MATRIX TEST FUNCTION WITH N =', 13 )
RETURN
C
C JFUNC=l1
C BOX'S TEST FUNCTION
C
C THE MINIMUM IS F(1.0,10.0,1.0)=0.O
C
270 N=3
X(l)=O.ODO
X(2)=10.0DO
X(3)=20.0DO
WRITE (LP, 280)
280 FORMAT(' TEST FUNCTION OF BOX')
RETURN
CC
JFUNC=12
C OSBORNE 1 FUNCTION
CC
THE MINIMUM IS F(0.3754,1.9358,1.4647,O.01287,
C 0.02212}=0.546D4
C
300 N=5
X(1)=0.5DO
X(2) =1. 5DO
X(3) =1. 000
X(4)=1.0D2
X(5)=2.0D2
00 310 J=l, 33
T(J)=10.0DO*(J1)
310 CONTINUE
WRITE(LP,320)
320 FORMAT(/' TEST FUNCTION OF OSBORNE 1')
RETURN
CC
JFUNC=13
C OSBORNE 2 FUNCTION
CC
THE MINIMUM IS F(1.3100,0.4315,0.6336,0.5993,0.7539,0.9056,
C 1.3651,4.8248,2.3988,4.5689,5.6754)=0.0402
C
330 N=l1
X(1)=1.3DO
X(2)=6.5D1
X(3)=6.5D1
X(4)=7.0D1
X(5)=6.0D1
X(6)=3.0DO
X(7)=5.0DO
X(8)=7.0DO
X(9)=2.0DO
X(10)=4.5DO
X{1l)=5.5DO
00 340 J=l, 65
T(J)=0.lDO*(J1)
340 CONTINUE
WRITE(LP,350)
350 FORMAT(!' TEST FUNCTION OF OSBORNE 2')
RETURN
END
C
73
•
.~
"
COMPUTE THE VALUE OF FUNCTION NUMBER JFUNC
FROM BRENT'S SUITE OF TEST PROBLEMS.
J. P. CHANDLER, Computer Science Department,
Oklahoma State University
DOUBLE PRECISION FUNCTION FTEST(X,N)
C
C FTEST 1.2 SEPTEMBER 1995
C
C
C
C
C
C
CC
ffALGORITHMS FOR MINIMIZATION WITHOUT DERIVATIVES",
C RICHARD P. BRENT, PRENTICEHALL 1973, PAGES 137154, 164166
C
IMPLICIT REAL*8 (AH,OZ)
INTEGER N,JFUNC,MAXJF, I,IMAX,J,JEVEN,JJ,JJMAX
DOUBLE PRECISION X(20} ,Y,PI,DATAN,DEXP,
* DSIN,DSQRT,F,P,R,S,T,TERM,U,YY
C
•l
C
COMMON /PRTEST/ PI,JFUNC,MAXJF
IF(JFUNC.LT.1 .OR. JFUNC.GT.MAXJF) STOP
GO TO (10,20,30,40,50,60,140,150,160,190,210),JFUNC
CC
JFUNC=1
C ROSENBROCK'S TEST FUNCTION
C
10 FTEST=lOO.ODO*(X(2)X(1)**2)**2+(1.0DOX(1»**2
RETURN
CC
JFUNC=2
C POWELL" S SINGULAR TEST FUNCTION
C
20 FTEST=(X(1)+10.0DO*XI2»)**2+5.0DO*(X(3)X(4))**2+
* (XI2)2.0DO*X(3)**4+10.0DO*(X(1)X(4»**4
RETURN
CC
JFUNC=3
C HELICAL VALLEY TEST FUNCTION OF FLETCHER AND POWELL
C
30 R=DSQRT{Xll)**2+XI2)**2)
C
IFIXll}.EQ.O.ODO) THEN
T=0.25DO
ELSE
T=DATAN(X(2)/X(1»)/(2.0DO*PI)
ENDIF
C
IF(X(l) .LT.O.ODO) T=T+O.5DO
FTEST=lOO.ODO*(IX(3)lO.ODO*T)**2+(R1.0DO)**2)+X(3)**2
RETURN
CC
JFUNC=4
C LEON'S CUBIC TEST FUNCTION
C
40 FTEST=100.0DO*IX(2)X(1)**3)**2+11.0DOXll) )**2
RETURN
CC
JFUNC=5
C BEALE'S TEST FUNCTION
C
50 FTEST=11.5DOX(1)*(1.0DOX(2»)**2+
* (2.25DOXI1)*(1.0DOX(2)**2»**2+
* (2.625DOX{1)*11.0DOX{2)**3»**2
RETURN
C
74
C JFUNC=6
C WATSON'S TEST FUNCTION (SEE KOWALIK AND OSBORNE)
C
60 S=X(1)**2+(X(2)X(1)**21.0DO)**2
DO 90 1=2,30
IT=(I1)j29.0DO
T=X(N)
JJMAX=N1
DO 70 JJ=1,JJMAX
J=JJMAX+1JJ
T=X(J)+IT*T
70 CONTINUE
U=(N1)*X(N)
C
DO 80 JJ=2,JJMAX
J=JJMAX+2JJ
U=(J1)*X(J)+YY*U
80 CONTINUE
S=S+(UT*T1.0DO) **2
90 CONTINUE
FTEST=S
RETURN
CC
JFUNC=7
C POWELL'S 1964 TEST FUNCTION
C
140 IF(X(2).EQ.0.ODO) THEN
TERM=O.ODO
ELSE
TERM=DEXP(«(X(1}+X(3»jX(2)2.0DO)**2)
ENDIF
FTEST=3.0DO1.0DO/(1.0DO+(X(1)X(2»)**2)*
DSIN(0.5DO*PI*X(2)*X(3»)TERM
RETURN
c
C JFUNC=8
C WOOD'S TEST FUNCTION
C
150 FTEST=100.0DO*(X(2)X(1)**2)**2+(1.0DOX(1) )**2+
* 90.0DO*(X(4)X(3)**2)**2+(1.0DOX(3»**2+
* 10.1DO*«X(2)1.ODO)**2+(X(4)1.0DO)**2)+
* 19.8DO*(X(2)1.ODO)*(X(4)1.ODO)
RETURN
CC
JFUNC=9
C BRENT'S HILBERT MATRIX TEST FUNCTION
C
160 S=O.ODO
DO 180 I=1,N
T=O.ODO
DO 170 J=1,N
T=T+X(J)j(I+J1.0DO)
170 CONTINUE
S=S+T*X(I)
180 CONTINUE
FTEST=S
RETURN
CC
JFUNC=10
C TRIDIAGONAL MATRIX TEST FUNCTION
C
190 S=X(1)*(X(1)X(2»
IMAX=N1
DO 200 I=2,IMAX
S=S+X(I)*(X(I)X(Il»+(X(I)X(I+1) »
75
•
200 CONTINUE
FTEST=S+X(N)*(2.0DO*X(N)X(N1))2.0DO*X(1)
RETURN
C
C JFUNC=l1
C BOX'S TEST FUNCTION
C
C
210
220
S=O.ODO
DO 220 1=1,10
P=I/lO.ODO
IF(P*X(2) .LT.40.0DO) THEN
TERM=O.ODO
ELSE
TERM=DEXP (p*X (2) )
ENDIF
S=S+(DEXP(P*X(1})TERMX(
3)*(DEXP(P)DEXP(10.0DO*P)))**2
CONTINUE
FTEST=S
RETURN
END
*
.;. .
SUBROUTINE FGRAD(N,X,G)
C
C COMPUTE THE GRADIENT OF FUNCTION NUMBER JFUNC
C FROM BRENT'S SUITE OF TEST PROBLEMS.
CC
"ALGORITHMS FOR MINIMIZATION WITHOUT DERIVATIVES",
C RICHARD P. BRENT, PRENTICEHALL 1973, PAGES 137154, 164166
C
IMPLICIT REAL*8 (AH,OZ)
INTEGER N,JFUNC,MAXJF, I,IMAX,J,JEVEN,JJ,JJMAX
DOUBLE PRECISION X(N) ,G(N),V,PI,DATAN,DEXP,
* DCOS, DSQRT, F, P, R, S, T (4) , C ( 3) ,TERM, U, yy
C
COMMON /PRTEST/ PI,JFUNC,MAXJF
C
IF(JFUNC.LT.1 .OR. JFUNC.GT.MAXJF) STOP
GO TO (10,20,30,40,50,60,140,150,160,190,210) ,JFUNC
CC
JFUNC=l
C ROSENBROCK'S TEST FUNCTION
10 G(2)=2.0D2*(X(2)X(1}**2}
G(1}=2.0DO*(X(1)*G(2)+1.ODOX(1))
RETURN
CC
JFUNC=2
C POWELL'S SINGULAR TEST FUNCTION
C
20 T(1)=2*(X(1)+10*X(2»)
T(2)=40*(X(1)X(4)1**3
T{3)=4*(X(2)2*X(31)**3
T(4)=10*(X(3)X(41)
G(1)=T(1}+T(2}
G(2)=10*T(1)+T(3)
G(3)=T(4)2*T(3)
G(4)=T(2)T(4)
RETURN
CC
JFUNC=3
C HELICAL VALLEY TEST FUNCTION OF FLETCHER AND POWELL
C
30 U=X(1)**2+X(2)**2
R=DSQRT(U)
76
R=2. OD2* ,(R1) /R
S=DATAN(X(2)/X(1»
S=2.0D2*(X(3)5.0DO*S/PI)
G{3)=S+2.0DO*X(3}
S=5.0DO*S/(U*PI}
G(2)=X(1)*S+R*X(2)
G(1)=X(2)*S+R*X(1)
RETURN
C
C JFUNC=4
C LEON'S CUBIC TEST FUNCTION
C
40 G(2)=2.0D2*(X(2)X(1)**3)
G(ll=3*G(2)*X(ll**22*(1X(1))
RETURN
c
C JFUNC=5
C BEALE'S TEST FUNCTION
C
50 C(1)=1.5DO
C(2)=2.25DO
C(3)=2.625DO
DO 52 1=1,3
T(I)=1X(2l**I
52 CONTINUE
DO 55 1=1, 3
G(l)=G(ll(C(I)X(l)*T(I»*T(I)
G(2)=G(2)+(C(I)X(1)*T(I))*I*X(l}*X(2)**(I1)
55 CONTINUE
G ( 1) =2 *G (1 )
G (2) =2 *G (2)
RETURN
CC
JFUNC=6
C WATSON'S TEST FUNCTION (SEE KOWALIK AND OSBORNE)
C
•
C
C
C
60 DO 100 1=2, 30
¥Y=(I1)/29.0DO
V=X(N)
,JJMAX=N1
DO 70 JJ=l,JJMAX
J=JJMAX+1JJ
V=X(J)+YY*V
70 CONTINUE
U=(N1)*X(N)
DO 80 JJ=2,JJMAX
J=JJMAX+2JJ
U=(J1)*X(J)+YY*U
80 CONTINUE
U=2.0DO*(UV*Vl.ODO)
R=1.0DD
DO 90 JJ=2, N
G(JJl=G(JJ)+U*((JJ1)*R2.0DO*V*R*YY}
R=R*YY
90 CONTINUE
G(1)=G(1)U*2.0DO*V
100 CONTINUE
C
R=X(2)X(1)*X(1)1.ODO
G(1)=G(1)+2.0DO*X(1)*(1.0DO2.0DO*R)
G(2)=G(2)+2.0DO*R
RETURN
77
CC
JFUNC=7
C POWELL'S 1964 TEST FUNCTION
C
140 $=(X(1)+XI3»/XI2)2.0DO
R= IX(1) Xl 2 ) ) / I 11 . ODD + (X (1 ) X (2) ) **2 ) * *2 )
U=S*DEXPIS*S)/X(2)
P=0.5DO*PI
V=P*DCOS(P*X(2)*X(3)}
G(1)=2.0DO*(R+U)
G(2)=2.0DO*(R+U*(X(1)+X{3) )/X(2»X(3)*V
G(3)=2.0DO*UX(2)*V
RETURN
CC
JFUNC=8
C WOOD'S TEST FUNCTION
C
150 T(1)=X(2)X(1)*X(1)
T(2)=X(4)X(3)*X(3)
G(1)=4.0D2*T(1)*X(1)2.0DO*(1.ODOX(1»
G(2)=2.0D2*T(1)+2.0DO*10.lDO*IX{2)1.0DO)+19.8DO*(X(4)l.ODO)
G(3)=3.6D2*T(2)*X(3)2.0DO*11.0DOX(3)
G(4)=1.8D2*T(2)+2.0DO*10.lDO*IX(4)1.0DO)+19.8DO*(X(2)1.0DO)
RETURN
" "I
JFUNC=9
BRENT'S HILBERT MATRIX TEST FUNCTION
C
C
C
C
160
170
180
DO 180 1=1, N
DO 170 J=l, N
G(I)=G(I)+2.0DO*X(J)/(I+J1)
CONTINUE
CONTINUE
RETURN
G(1)=2*(X(1)X(2}1)
00 200 1=2, N1
G(I)=2.0DO*(2.0DO*X(I)X(I1)X(I+1)
CONTINUE
G(N)=2*(2*X(N)X(N1)
RETURN
200
CC
JFUNC=10
C TRIDIAGONAL MATRIX TEST FUNCTION
C
190
DO 240 1=1, 10
P=I/10.0DO
T(l)=DEXP(X(l)*P)
T(2)=DEXP(X(2)*P}
T (3) =DEXP (1. OD1*P) DEXP (P)
S=2*(T(1)+T(2)+X(3)*T(3»)
G(l)=G(l)+S*T(l)*P
G(2)=G(2)+S*T(2)*P
G(3)=G(3)+S*T{3)
CONTINUE
RETURN
END
240
CC
JFUNC=l1
C BOX'S TEST FUNCTION
C
210
C
SUBROUTINE OSBORNE(F,X,N,G}
CC
COMPUTE THE VALUES AND THE GRADIENTS FOR OSBORNE FUNCTIONS 1 AND 2.
C
78

C
C
IMPLICIT REAL*8 (AH,OZ)
INTEGER N,JFUNC,MAXJF,J,I
DOUBLE PRECISION X(N) ,G(N) ,DEXP,FTX,F,R,S,TEMP(41 ,PI,Y1,Y2,T
COMMON /PRTEST/ PI,JFUNC,MAXJF
COMMON /OSB/Y1(33),Y2(65),T(65}
IF(JFUNC.LT.12 .OR. JFUNC.GT.13) STOP
F=O.ODO
DO 10 1=1, N
G(I)=O.ODO
10 CONTINUE
•
CC
JFUNC=12
C OSBORNE FUNCTION 1
C
IF(JFUNC .EQ. 12) THEN
DO 35 J=1,33
R=DEXP«1)*X(4)*T(J»
S=DEXP ( (1) *X(5) *T (J»
FTX=X(1)+X(2)*R+X(3)*SY1(J)
F = F+FTX**2
G(1)=G(1)+FTX
G(2)=G(2)+FTX*R
G(3)=G(3) + FTX*S
G(4)=G(4)  FTX*X(2)*T(J)*R
G(5)=G(5}  F7X*X(3)*T(J)*S
35 CONTINUE
CC
JFUNC=13
C OSBORNE FUNCTION 2
C
ELSE
DO 45 J=l, 65
TEMP (1) =DEXP(X(5) *T (J})
TEMP(2)=DEXP(X(6)*(T(J)X(9}}**2)
TEMP(3)=DEXP(X(7)*(T(J)X(10)}**2)
TEMP(4)=DEXP(X(8)*(T(J)X(11»**2)
FTX=O.DO
DO 40 1=1, 4
FTX=FTX+X(I)*TEMP(1)
40 CONTINUE
FTX=FTXY2(J)
F = F+FTX**2
DO 42 1=1, 4
G(1)=G(I)+FTX*TEMP(I)
42 CONTINUE
G(5)=G(5)FTX*T(J)*X{1)*TEMP(l)
G(6)=G(6)FTX*X(2)*TEMP(2)*«T(J)X(9»**2)
G(7)=G(7)FTX*X(3)*TEMP(3)*{(T(J)X(10»)**2)
G(8)=G{8)FTX*X(4)*TEMP(4)*«T(J)X(11»)**2)
G(9)=G(9)+FTX*X(2)*TEMP(2}*X(6)*2*(T(J)X(9»)
G(10)=G(10)+FTX*X{3)*TEMP(3)*X{7)*2*(T(J)X(10»
G(11)=G(11)+FTX*X(4)*TEMP(4)*X(8)*2*(T(J)X(11»
45 CONTINUE
END IF
DO 49 J=1, N
G(J)=2*G(J)
49 CONTINUE
C
END
79
"
 •
APPENDIX C: PROGRAM LIST FOR MLBFGS.F
c
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
THIS SUBROUTINE IS USED TO IMPLEMENT THE LIMITED MEMORY
BFGS METHOD, WHICH IS VERY SLIGHT MODIFICATION OF THE ROUTINE
LBFGS WRITTEN BY JORGE NOCEDAL.
THE MODIFICATIONS ARE:
1. REPLACE THE BLOCK DATA LB2 BY THE SUBROUTINE LBSET.
2. ADD A LOGICAL VARIABLE ANN IN THE CALLING STATEMENT, SO THAT
IT CAN BE CALLED BY DIFFERENT DRIVERS, AND USES DIFFERENT
STOPPING CRITERIA.
3. IF THE SUBROUTINE IS USED TO IMPLEMENT THE NEURAL NETWORK
LEARNING ALGORITHM, THEN ANN = TRUE, WE ADD A NEW STOPING
CRITEIRIA,I.E., THE SUBROUTINE TERMINATES
WHEN IIWK+1  WKII<TOLERA.
****************
LBFGS SUBROUTINE
****************
SUBROUTINE MLBFGS(N,M,X,F,G,DIAGCO,DIAG,IPRINT,EPS,XTOL,
* W,IFLAG,ANN,TOLERA)
THIS SUBROUTINE SOLVES THE UNCONSTRAINED MINIMIZATION PROBLEM
THE CALLING STATEMENT IS
CALL LBFGS(N,M,X,F,G,DIAGCO,DIAG,IPRINT,EPS,XTOL,W,IFLAG,
ANN, TOLERA)
LIMITED MEMORY BFGS METHOD FOR LARGE SCALE OPTIMIZATION
JORGE NOCEDAL
*** JULY 1990 ***
MIN F (Xl, X= (X1,X2, ... ,XN),
THE STEPLENGTH IS DETERMINED AT EACH ITERATION BY MEANS OF THE
LINE SEARCH ROUTINE MCVSRCH, WHICH IS A SLIGHT MODIFICATION OF
THE ROUTINE CSRCH WRITTEN BY MORE' AND THUENTE.
THE USER IS REQUIRED TO CALCULATE THE FUNCTION VALUE F AND ITS
GRADIENT G. IN ORDER TO ALLOW THE USER COMPLETE CONTROL OVER
THESE COMPUTATIONS, REVERSE COMMUNICATION IS USED. THE ROUTINE
MUST BE CALLED REPEATEDLY UNDER THE CONTROL OF THE PARAMETER
IFLAG.
USING THE LIMITED MEMORY BFGS METHOD. THE ROUTINE IS ESPECIALLY
EFFECTIVE ON PROBLEMS INVOLVING A LARGE NUMBER OF VARIABLES. IN
A TYPICAL ITERATION OF THIS METHOD AN APPROXIMATION HK TO THE
INVERSE OF THE HESSIAN IS OBTAINED BY APPLYING M BFGS UPDATES TO
A DIAGONAL MATRIX HKO, USING INFORMATION FROM THE PREVIOUS M STEP
THE USER SPECIFIES THE NUMBER M, WHICH DETERMINES THE AMOUNT OF
STORAGE REQUIRED BY THE ROUTINE. THE USER MAY ALSO PROVIDE THE
DIAGONAL MATRICES HKO IF NOT SATISFIED WITH THE DEFAULT CHOICE.
THE ALGORITHM IS DESCRIBED IN "ON THE LIMITED MEMORY BFGS METHOD
FOR LARGE SCALE OPTIMIZATION", BY D. LIU AND J. NOCEDAL,
MATHEMATICAL PROGRAMMING B 45 (1989) 503528.
INTEGER N,M,IPRINT{2) , I FLAG
DOUBLE PRECISION X (N) , G(N) , DIAG (N) ,W(1) , F , EPS, XTOL
LOGICAL DIAGCO,ANN
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
80
c
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
WHERE
N IS AN INTEGER VARIABLE THAT MUST BE SET BY THE USER TO THE
NUMBER OF VARIABLES. IT IS NOT ALTERED BY THE ROUTINE.
RESTRICTION~ N>O.
M IS AN INTEGER VARIABLE THAT MUST BE SET BY THE USER TO
THE NUMBER OF CORRECTIONS USED IN THE BFGS UPDATE. IT
IS NOT ALTERED BY THE ROUTINE. VALUES OF M LESS THAN 3 ARE
NOT RECOMMENDED; LARGE VALUES OF M WILL RESULT IN EXCESSIV
COMPUTING TIME. 3<= M <=7 IS RECOMMENDED. RESTRICTION: M>O
X IS A DOUBLE PRECISION ARRAY OF LENGTH N. ON INITIAL ENTRY
IT MUST BE SET BY THE USER TO THE VALUES OF THE INITIAL
ESTIMATE OF THE SOLUTION VECTOR. ON EXIT WITH IFLAG=O, IT
CONTAINS THE VALUES OF THE VARIABLES AT THE BEST POINT
FOUND (USUALLY A SOLUTION) .
F IS A DOUBLE PRECISION VARIABLE. BEFORE INITIAL ENTRY AND 0
A REENTRY WITH IFLAG=l, IT MUST BE SET BY THE USER TO
CONTAIN THE VALUE OF THE FUNCTION F AT THE POINT X.
G IS A DOUBLE PRECISION ARRAY OF LENGTH N. BEFORE INITIAL
ENTRY AND ON A REENTRY WITH IFLAG=l, IT MUST BE SET BY
THE USER TO CONTAIN THE COMPONENTS OF THE GRADIENT G AT
THE POINT X.
DIAGCO IS A LOGICAL VARIABLE THAT MUST BE SET TO . TRUE. IF THE
USER WISHES TO PROVIDE THE DIAGONAL MATRIX HKO AT EACH
ITERATION. OTHERWISE IT SHOULD BE SET TO .FALSE., IN WHICH
CASE LBFGS WILL USE A DEFAULT VALUE DESCRIBED BELOW. IF
DIAGCO IS SET TO . TRUE. THE ROUTINE WILL RETURN AT EACH
ITERATION OF THE ALGORITHM WITH I FLAG=2 , AND THE DIAGONAL
MATRIX HKO MUST BE PROVIDED IN THE ARRAY DIAG.
DIAG IS A DOUBLE PRECISION ARRAY OF LENGTH N. IF DIAGCO=.TRUE.,
THEN ON INITIAL ENTRY OR ON REENTRY WITH IFLAG=2, DIAG
IT MUST BE SET BY THE USER TO CONTAIN THE VALUES OF THE
DIAGONAL MATRIX HKO. RESTRICTION: ALL ELEMENTS OF DIAG
MUST BE POSITIVE.
IPRINT IS AN INTEGER ARRAY OF LENGTH TWO WHICH MUST BE SET BY THE
USER.
I PRINT (1) SPECIFIES THE FREQUENCY OF THE OUTPUT:
IPRINT(l) < 0 NO OUTPUT IS GENERATED,
IPRINT(l) = 0 : OUTPUT ONLY AT FIRST AND LAST ITERATION
IPRINT(l) > 0 : OUTPUT EVERY IPRINT(l) ITERATIONS.
IPRINT(2) SPECIFIES THE TYPE OF OUTPUT GENERATED:
IPRINT (2) 0 ITERATION COUNT, NUMBER OF FUNCTION
EVALUATIONS, FUNCTION VALUE, NORM OF TH
GRADIENT, AND STEPLENGTH,
IPRINT(2) = 1 SAME AS IPRINT(2)=O, PLUS VECTOR OF
VARIABLES AND GRADIENT VECTOR AT THE
INITIAL POINT,
IPRINT(2) = 2 SAME AS IPRINT(2)=1, PLUS VECTOR OF
VAR.IABLES ,
IPRINT(2) = 3 SAME AS IPRINT(2)=2, PLUS GRADIENT VECT
EPS IS A POSITIVE DOUBLE PRECISION VARIABLE THAT MUST BE SET B
THE USER, AND DETERMINES THE ACCURACY WITH WHICH THE SOLUT
IS TO BE FOUND. THE SUBROUTINE TERMINATES WHEN
81
•
.,
•
INFO = 5 THE STEP IS TOO LARGE.
INFO = 0 I.MPROPER INPUT PARAMETERS.
INFO = 4 THE STEP IS TOO SMALL.
.,
NO INPUT; DIAGNOSTIC MESSAGES ON UNIT MP AND
ERROR MESSAGES ON UNIT LP.
THE ONLY VARIABLES THAT ARE MACHINEDEPENDENT ARE XTOL,
STPMIN AND STPMAX.
INFO = 2 RELATIVE WIDTH OF THE INTERVAL OF
UNCERTAINTY IS AT MOST XTOL.
IFLAG=3 IMPROPER INPUT PARAMETERS FOR LBFGS (N OR M ARE
NOT POSITIVE) .
IFLAG=2 THE ITH DIAGONAL ELEMENT OF THE DIAGONAL INVER
HESSIAN APPROXIMATION, GIVEN IN DIAG, IS NOT
POSITIVE.
INFO = 6 ROUNDING ERRORS PREVENT FURTHER PROGRE
THERE MAY NOT BE A STEP WHICH SATISFIE
THE SUFFICIENT DECREASE AND CURVATURE
CONDITIONS. TOLERANCES MAY BE TOO SMAL
IFLAG=l THE LINE SEARCH ROUTINE MCSRCH FAILED. THE
PARAMETER INFO PROVIDES MORE DETAILED INFORMATI
(SEE ALSO THE DOCUMENTATION OF MCSRCH) :
INFO = 3 MORE THAN 20 FUNCTION EVALUATIONS WERE
REQUIRED AT THE PRESENT ITERATION.
THE FOLLOWING NEGATIVE VALUES OF IFLAG, DETECTING AN ERROR
ARE POSSIBLE:
"Gil < EPS MAX(l.llxIH,
WHERE I I . II DENOTES THE EUCLIDEAN NORM.
XTOL IS A POSITIVE DOUBLE PRECISION VARIABLE THAT MUST BE SET
THE USER TO AN ESTIMATE OF THE MACHINE PRECISION (E.G.
10**(16) ON A SUN STATION 3/60). THE LINE SEARCH ROUTINE
TERMINATE IF THE RELATIVE WIDTH OF THE INTERVAL OF UNCERTA
IS LESS THAN XTOL.
IFLAG IS AN INTEGER VARIABLE THAT MUST BE SET TO 0 ON INITIAL EN
TO THE SUBROUTINE. A RETURN WITH IFLAG<O INDICATES AN ERRO
AND IFLAG=O INDICATES THAT THE ROUTINE HAS TERMINATED WITH
DETECTING ERRORS. ON A RETURN WITH IFLAG=l, THE USER MUST
EVALUATE THE FUNCTION F AND GRADIENT G. ON A RETURN WITH
IFLAG=2, THE USER MUST PROVIDE THE DIAGONAL MATRIX HKO.
W IS A DOUBLE PRECISION ARRAY OF LENGTH N{2M+l)+2M USED AS
WORKSPACE FOR LBFGS. THIS ARRAY MUST NOT BE ALTERED BY THE
USER.
OTHER ROUTINES CALLED DIRECTLY: DAXPY, DDOT, LBI, MCSRCH
INPUT/OUTPUT
MACHINE DEPENDENCIES
GENERAL INFORMATION
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
CC
C
C
C
C
C
C
C
C
82
C
C
DOUBLE PRECISION GTOL, ONE, ZERO" GNORM, DDOT, STPI , FTOL, STPMIN,
STPMAX,STP,YS,YY,SQ,YR,BETA,XNORM,DFNORM,TOLERA
INTEGER MP,LP,ITER,NFUN,POINT,ISPT, IYPT,MAXFEV,INFO,
BOUND,NPT,CP,I,NFEV,INMC,IYCN,ISCN
LOGICAL FINISH
C
•
II
C
C
C
C
C
10
COMMON /LB3/MP,LP,GTOL,STPMIN,STPMAX
ONE = 1.0D+0
ZERO = O.OD+O
INITIALIZE
IF(IFLAG.EQ.O) GO TO 10
GO TO (172,100), IFLAG
ITER= 0
IF(N.LE.O.OR.M.LE.O) GO TO 196
IF(GTOL.LE.l.D04) THEN
IF(LP.GT.O) WRITE(LP,245)
GTOL=9.DOl
ENDIF
CC
PARAMETERS FOR LINE SEARCH ROUTINE
C
FTOL= 1.0D4
MAXFEV= 20
NFUN= 1
POINT= 0
FINISH= . FALSE.
IF (DIAGCO) THEN
DO 30 I=l,N
IF (DIAG{I) .LE.ZERO) GO TO 195
30 CONTINUE
ELSE
DO 40 I=l,N
DIAG(I) = 1. ODD
40 .CONTINUE
ENDIF
CC
THE WORK VECTOR W IS DIVIDED AS FOLLOWS:
C 
C THE FIRST N LOCATIONS ARE USED TO STORE THE GRADIENT AND
C OTHER TEMPORARY INFORMATION.
C LOCATIONS (N+l) (N+M) STORE THE SCALARS RHO.
C LOCATIONS (N+M+l) (N+2M) STORE THE NUMBERS ALPHA USED
C IN THE FORMULA THAT COMPUTES H*G.
C LOCATIONS (N+2M+l) ... (N+2M+NM) STORE THE LAST M SEARCH
C STEPS.
C LOCATIONS (N+2M+NM+l) ... (N+2M+2NM) STORE THE LAST M
C GRADIENT DIFFERENCES.
CC
THE SEARCH STEPS AND GRADIENT DIFFERENCES ARE STORED IN A
C CIRCULAR ORDER CONTROLLED BY THE PARAMETER POINT.
C
ISPT= N+2*M
IYPT= ISPT+N*M
DO 50 I=l,N
W(ISPT+I)= G(I)*DIAG(I)
50 CONTINUE
GNORM= DSQRT(DDOT(N,G,l,G,l»
STP1= ONE/GNORM
C
83
•
IF (IPRINT (1) . GE. 0) CALL LB1 (IPRINT, ITER, NFUN,
* GNORM,N,M,X,F,G,STP,FINISH,DFNORM)
MAIN ITERATION LOOP
ITER= ITER+1
INFO=O
BOUND=ITER1
IF(ITER.EQ.1) GO TO 165
IF (ITER .GT. M)BOUND=M
YS= DDOT(N,W(IYPT+NPT+l),1,W(ISPT+NPT+1},lj
IF ( . NOT. DIAGCO) THEN
IT= DDOT(N,W(IYPT+NPT+1},1,W(IYPT+NPT+l),1}
DO 90 I=l,N
DIAG(I}= YS/YY
90 CONTINUE
ELSE
IFLAG=2
RETURN
ENDIF
100 CONTINUE
IF (DIAGCO) THEN
DO 110 I=l,N
IF (DIAG(I) .LE.ZERO) GO TO 195
110 CONTINUE
ENDIF
c
C
C
C
C
C
80
CC
COMPUTE H*G USING THE FORMULA GIVEN IN: NOCEDAL, J. 1980,
C "UPDATING QUASINEWTON MATRICES WITH LIMITED STORAGE",
C MATHEMATICS OF COMPUTATION, VOL.24, NO.15I, PP. 773782.
C 
C
CP= POINT
IF (POINT.EQ.O) CP=M
W(N+CP)= ONE/YS
DO 112 I=l,N
W(I}= G(I)
112 CONTINUE
CP= POINT
DO 125 I= I,BOUND
CP=CPI
IF (CP.EQ. l)CP=Ml
SQ= DDOT(N,W(ISPT+CP*N+I),l,W,l)
INMC=N+M+CP+1
IYCN=IYPT+CP*N
W(INMC)= W(N+CP+l}*SQ
CALL DAXPY(N,W(INMC),W(IYCN+1) ,1,W,1)
125 CONTINUE
C
DO 130 I=l,N
W(I)=DIAG(I)*W(I)
130 CONTINUE
C
DO 145 I=l,BOUND
YR= DDOT(N,W(IYPT+CP*N+l),l,W,l)
BETA= W(N+CP+1)*YR
INMC=N+M+CP+1
BETA= W(INMC)BETA
ISCN=ISPT+CP*N
CALL DAXPY (N,BETA, W(ISCN+l)" 1, W, 1)
CP=CP+1
IF (CP.EQ.M)CP=O
145 CONTINUE
C
C STORE THE NEW S~CH DIRECTION
C 
C
•
160
C
C
C
C
165
170
172
C
DO 160 I=1,N
W(ISPT+POINT*N+I)= WeI)
CONTINUE
OBTAIN THE ONEDIMENSIONAL MINI.MIZER OF THE FUNCTION
BY USING THE LINE SEARCH ROUTINE MCSRCH
NFEV=O
STP=ONE
IF (ITER.EQ.1) STP=STPl
DO 170 I=1,N
W(I)=G(I)
CONTINUE
CONTINUE
CALL MCSRCH(N,X,F,G,W(ISPT+POINT*N+1) ,STP,FTOL,
* XTOL,MAXFEV,INFO,NFEV,DIAG)
IF (INFO .EQ. 1) THEN
IFLAG=1
RETURN
ENDIF
IF (INFO .NE. 1) GO TO 190
NFUN= NFUN + NFEV
CC
COMPUTE THE NEW STEP AND GRADIENT CHANGE
C 
C
NPT=POINT*N
DO 175 I=1,N
W(ISPT+NPT+I)= STP*W(ISPT+NPT+I)
W(IYPT+NPT+I)= G(I)W(I)
175 CONTINUE
C
POINT=POINT+1
IF (POINT.EQ.M)POINT=O
C
C TERMINATION TEST
C 
C
DFNORM=DSQRT(DDOT(N,W(ISPT+NPT+1),1,W(ISPT+NPT+1),1»
CC
ADD A STOPING CRITERION FOR NEURAL NETWORKS LEARNING ALGORITHM.
C THE SUBROUTINE TERMINATES WHEN I IXK+1XKI 1< TOLERA.
C
IF (ANN) THEN
IF(DFNORM .LE. TOLERA) THEN
FINISH = .TRUE.
GO TO 180
END IF
END IF
C
C
GNORM= DSQRT(DDOT(N,G,l,G,l»
XNORM= DSQRT(DDOT(N,X,l,X,l»
XNORM= DMAX1 (1. aDO, XNORM)
IF (GNORM/XNORM .LE. EPS) FINISH=.TRUE.
180 IF(IPRINT(1).GE.0) CALL LB1(IPRINT,ITER,NFUN,GNORM,N,M,X,
* F, G, STP, FINISH, DFNORM)
IF (FINISH) THEN
IFLAG=O
85
C
C
C
C
C
190
195
196
C
C
C
C
200
235
240
245
RETURN
ENDIF
GO TO BO
END OF MA.IN ITERATION LOOP. ERROR EXITS.
IFLAG=l
IF(IPRINT(l).GE.O) CALL LBl{IPRINT,ITER,NFUN,
* GNORM,N,M,X,F,G,STP,FINISH,DFNORM)
IF(LP.GT.O) WRITE (LP, 200) INFO
RETURN
IFLAG=2
IF(LP.GT.O) WRITE(LP,235) I
RETURN
IFLAG= 3
IF{LP.GT.O) WRITE{LP,240)
FORMATS
FORMAT{/' IFLAG= 1 '/' LINE SEARCH FAILED. SEE',
, DOCUMENTATION OF ROUTINE MCSRCH'/' ERROR RETURN',
, OF LINE SEARCH: INFO= ',12/
• POSSIBLE CAUSES: FUNCTION OR GRADIENT ARE INCORRECT'/
" OR INCORRECT TOLERANCES' )
FORMAT (I' IFLAG=  2 ' I' THE', 15, 'TH DIAGONAL ELEMENT OF THE' I
, INVERSE HESSIAN APPROXIMATION IS NOT POSITIVE')
FORMAT(j' IFLAG= 3' /. IMPROPER INPUT PARAMETERS (N OR M' ,
, ARE NOT POSITIVE) , )
FORMAT (I , GTOL IS LESS THAN OR EQUAL TO 1.D04'
/ . IT HAS BEEN RESET TO 9.D01')
RETURN
END
SUBROUTINE LB1(IPRINT,ITER,NFUN,
* GNORM,N,M,X,F,G,STP,FINISH,DFNORM)
•
CC
*** CHANGE PRINT STP TO DFNORM
C 
C THIS ROUTINE PRINTS MONITORING INFORMATION. THE FREQUENCY AND
C AMOUNT OF OUTPUT ARE CONTROLLED BY IPRINT.
C 
C
INTEGER IPRINT(2),ITER,NFUN,LP,MP,N,M, I
DOUBLE PRECISION X(N),G(N),F,GNORM,DFNORM,STP,GTOL,STPMIN,STPMAX
LOGICAL FINISH
COMMON IL,B3 /MP , LP , GTOL, STPMIN , STPMAX
C
IF (ITER.EQ.O)THEN
WRITE(MP,10)
WRITE(MP,20) N,M
WRITE(MP,30)F,GNORM
IF (IPRINT(2) .GE.l)THEN
WRITE (MP, 40)
WRITE(MP,50) (X(I) ,I=l,Nl
WRITE (MP, 60 l
WRITE(MP,50) (G(I) ,I=l,N)
ENDIF
WRITE (MP, 10)
WRITE (MP, 70)
ELSE
IF «IPRINT{l) .EQ.O) .AND. (ITER.NE.l.AND .. NOT.FINISH) )RETURN
IF (IPRINT(l) .NE.O)