

small (250x250 max)
medium (500x500 max)
Large
Extra Large
large ( > 500x500)
Full Resolution


r ... PENALTY METHODS TO REDUCE OVERFITTING IN ARTIFICIAL NEURAL NETWORKS By ZHONG XIANG LUO Bachelor of Engineering Gezhouba Hydroelectric Engineering University Yichang, China 1982 Master of Engineering Wuhan Hydraulic and Electric Engineering University Wuhan, China 1989 Submitted to the Faculty ofthe Graduate College ofthe Oklahoma State University In partial fulfillment of The requirements for The Degree of MASTER OF SCIENCE December, 1998 PENALTY METHODS TO REDUCE OVERFITTING IN ARTIFICIAL NEURAL NETWORKS Thesis Approval: 'LJ"'='~e~f> man . {oweJ1__ ofthe Graduate College ii ACKNOWLEDGMENTS I wish to express my sincere appreciation to my major advisor, Dr. John P. Chandler, for his invaluable guidance and encouragement throughout this research. Thanks are also due my other graduate committee members, Dr. B. E. Mayfield, and Dr. G. E. Hedrick, for their time and valuable assistance and cooperation. I also want to thank the faculty and staff in Computer Science Department at Oklahoma State University for their support. More over, I would like to express my special thanks and love to my mother, Ying Tong, my uncles, Shu Chen Luo, You Chen Luo, myoId sisters Vue Luo, Yuewo Luo, for their support and W1derstanding throughout my life. Finally, I would like to give my special thanks to my wife, Dr. Ming Yu, myoId daughter, Brooke T. Luo, and my new born baby Rynel Luo, for their love, patience, and support during this research. My wife's encouragement and faith always provided me strength and comfort. Myoid daughter's Jove always makes me happy and helped me get through aU difficulties. My new born baby brings me excitement and joys. 111 Cbapter TABLE OF CONTENTS Page I. INTRODUCTION 1 II. METHODS FOR REDUCING OVERFITTING 5 Overfitting and Generalization in Artificial Neural Networks 5 Methods ofReducing Overfitting 7 Penalty Mechanism and Algorithm 7 Penalty Terms as a Method ofReducing Overfitting 13 III. NEURAL NETWORK ARCHTECTURE AND LEARNING GORITHMS 18 Architectures ofFeedforward Artificial Neural Networks 18 Activation Function 21 Weights in the Neural Network 21 Optimization Algorithm 23 Forward Computations 26 Backpropagation Computation 27 Algorithms for the Penalty Method and Improved Penalty Method 32 VI. METHODS TESTED AND IMPLEMENTATION 37 Neural Network Architecture Design 37 Discussion ofTest Results 40 v . CONCLUSION 45 REFERENCES 47 APPENDIX A 50 Testing Tables 50 APPENDIX B 65 Computer Programs 65 IV LIST OF TABLES T~~ P~e 4 1 Overview ofmethod tested 39 4 2 Performance Comparison ofDifferent M.ethod for Network with 7 hidden nodes ... ... . . . .. . .. . .. . .. .. .. ... ... . .. ... . .. . . . ... ... .. . ..... .. .. 44 4 3 Performance Comparison ofDifferent Method for Network with 8 hidden nodes ... .... .. ... .. . ... ..... .. .. 44 4 4 Performance Comparison of Different Method for Network with 10 hidden nodes 44 A 1 Performance ofTraining and Generalization(RMS) Method A with 7 hidden nodes and A= 0.01 50 A 2 Performance ofTraining and Generalization(RMS) Method B with 7 hidden nodes and A= 0.01 50 A 3 Performance ofTraining and Generalization(RMS) Method B with 7 hidden nodes and A= 0.0001 51 A 4 Performance of Training and Generalization(RMS) Method C with 7 hidden nodes and A= 0.0001 51 A 5 Performance of Training and Generalization(RMS) Method C with 7 hidden nodes and A= 0.01 51 A 6 Performance ofTraining and Generalization(RMS) v Method D with 7 hidden nodes and A= 0.0001 52 A 7 Perfonnance ofTraining and Generalization(RMS) Method D with 7 hidden nodes and A=0.01 52 A 8 Perfonnance ofTraining and Generalization(RMS) Method E with 7 hidden nodes and A= 0.01 52 A 9 Performance ofTraining and Generalization(RMS) Method E with 7 hidden nodes and A= 0.0001 53 A 10 Perfonnance ofTraining and Generalization(RMS) Method A with 8 hidden nodes 53 A 11 Performance ofTraining and Generalization(RMS) Method B with 8 hidden nodes and A= 0.01 53 A 12 Performance ofTraining and Generalization(RMS) Method B with 8 hidden nodes and A=0.0001 54 A 13 Perfonnance ofTraining and Generalization(RMS) Method C with 8 hidden nodes and A= 0.0001 54 A 14 Perfonnance ofTraining and Generalization(RMS) Method C with 8 hidden nodes and A=0.01 54 A 15 Performance ofTraining and Generalization(RMS) Method D with 8 hidden nodes and A= 0.01 55 A 16 Performance ofTraining and Generalization(RMS) Method D with 8 hidden nodes and A= 0.0001 '" 55 A 17 Performance ofTraining and Generalization(RMS) VI Method E with 8 hidden nodes and A. = 0.001 55 A 17 Performance of Training and Generalization(RMS) Method E with 8 hidden nodes and A. = 0.01 56 A 18 Perfonnance ofTraining and Generalization(RMS) Method A with 10 hidd.en nodes , 56 A 19 Performance ofTraining and Generalization(RMS) Method B with 10 hidden nodes and A. = 0.01 .. 57 A 20 Performance ofTraining and Generalization(RMS) Method B with 10 hidden nodes and A. = 0.000 I 57 A 21 Performance ofTraining and Generalization(RMS) Method C with 10 hidden nodes and A. = 0.00001 57 A 22 Performance ofTraining and Generalization(RMS) Method C with 10 hidden nodes and A. =0.01 .. S8 A 23 Perfonnance ofTraining and Generalization(RMS) Method D with 10 hidden nodes and A. = 0.000 I 58 A 24 Performance ofTraining and Generalization(RMS) Method D with 10 hidden nodes and A. = 0.01 58 A 25 Performance ofTraining and Generalization(RMS) Method E with 10 hidden nodes and A. = 0.01 59 A 26 Performan.ce ofTraining and Generalization(RMS) Method E with 10 hidden nodes and A= 0.00001 59 A27 Performance of Training and Generalization (RMS) V1l Method F with 7 hidden nodes and different A(0.008 0.006, 0.004 0.002,0.001,0.0006,0.0001,0.00006,0.00004, 0.00001) 60 A 28 Performance ofTraining and Generalization (RMS) Method G with 7 hidden nodes and A= 0.001. 60 A 29 Performance of Training and Generalization (RMS) Method G with 7 hidden nodes and A= 0.0001 60 A 30 Perfonnance ofTraining and Generalization (RMS) Method G with 7 hidden nodes and A= 0.0002 61 A 31 Performance ofTraining and Generalization (RMS) Method G with 8 hidden nodes and A= 0.0002 61 A 32 Performance ofTraining and Generalization (RMS) Method G with 8 hidden nodes and A= 0.0001........ .. .. 62 A 33 Perfonnance ofTraining and Generalization (RMS) Method G with 10 hidden nodes and A= 0.0001 62 A 34 Performance of Training and Generalization (RMS) Method G with 10 hidden nodes and A= 0.00001 62 A 35 The Training Data Set.. .. .. .. .. .. 63 A 36 The Validation Data Set 64 Vlll LIST OF FIGURES Table Page 2.1 The Relationship Between Training Error and Testing Error................... 6 3.1 A Three Layer Feed forward Network 19 3.2 The Sigmoid Function .. . ... . . . 22 3.3 The Hyperbolic Function .. .. . .... . . .. . . .. .. . .. .. . .. .. .. .. .. . ... .. . .. .. . . . . . ..... 22 IX Chapter I INTRODUCTION Artificial neural networks are computational models of the human brain In contrast with conventional singleprocessor computers, the brain has a multiprocessor architecture that is highly interconnected. This architecture can be described as parallel distributed processing. Parallel distributed processing has many advantages over singleprocessor models for many difficult computer science problems. It allows problems that were once very difficult to solve on a computer to be attacked with relative ease. Neural networks can be trained to develop operational capabilities to respond to an information environment. Supervised learning and unsupervised learning are the two main learning regimes used in neural network training. A supervised learning algorithm adjusts the strengths or weights of the interneuron connections according to the difference between the desired and actual network outputs corresponding to a given input. Thus, supervised learning requires a teacher or supervisor to provide desired or target output signals. Examples of supervised learning algorithms include the delta rule [1], the generalized delta rule or backpropagation algorithm [2] and the LVQ algorithm [3]. Unsupervised learning algorithms do not require the desired outputs to be known. During training, only input patterns are presented to the neural network that automatically adapt the weights of its connections to cluster the input patterns into groups with similar features. Examples of unsupervised learning algorithms include the Kohonen [3] and CarpenterGrossberg Adaptive Resonance Theory (ART) [4] competitive learning algorithms. Neural Networks have been used in many fields including economics, transportation, defense, electronics, manufacturing, medicine, robotics, speech and telecommunications [1]. The Multilayer Perceptron (MLP) will be discussed in this research. MLPs are perhaps the bestknown type of feedforward networks. One of the interesting properties of a feedforward neural network is its capability of learning, i.e., a feedforward neural network can adjust its behavior using information from the environment. When a feedforward neural network is used to solve a problem, it is trained by a set of inputoutput sample data. Based on this data set, the network, when properly trained, wilt not only try to reproduce the sample set correctly, but also to generalize from the training examples to the entire problem domain. A learning algorithm is applied a set of training data, then it is applied to make predictions on new data points. The goal is to maximize its predictive accuracy on the new data points. If it is trained too hard to find the very best fit to the training data, there is a risk that the data noise will be fitted by memorizing various peculiarities of the training data rather than fmding a general predictive rule. For continuous domains, or large discrete ones, it is impossible to provide samples of every possible input. For a 2 large network, if the system simply memorizes the training patterns, it may do quite well dwing the training process but it may give spurious and misleading outputs if the input is slightly different from the sample inputs. This phenomenon is called overfitting. Overfitting is thought to happen when the network has more degrees of freedom than the number of the training samples. Obviously, a network can obtain a good generalization only when the number ofparameters is less than the number of data points in the training set. Unfortunately, it is difficult to find the smallest neural network size that can learn the training data best. Many techniques for reducing overfitting have been developed. The penaltyterm method is one of the most popular methods. The basic approach used in a penaltytenn method is adding penalty terms to the usual error fimction in order to constrain the search and cause weights to decay differentially. By modifying the cost function, the backpropagation will drive unnecessary weights close to zero and, in effect, remove them during training. Even if the weights are not actually removed, the network acts like a smaller system. This thesis focuses on the possibilities of reducing the overfitting by using the penaltyterm method in artificial neural networks. Many penalty terms have been developed to reduce overfitting. Some of them are complicated; some of them include a userdependent constant factor. Each penalty term has different advantages and disadvantages. The question remains of whether there is a penalty term or a combination of penalty terms that can produce superior results and, if there is, what the penalty term could be. This research will compare and summarize different penalty terms through their perfonnance. An improved penalty term method will be proposed in this research. 3 < It is expected that the improved penalty method will improve the generalization performance of neural networks significantly. The paper is organized as follows: In Chapter I, a general introduction to neural networks and the problems of interest is given. In Chapter II, a review of different algorithms for reducing overfiting, especially the penalty term methods, will be conducted. Chapter III will explain the architecture of the neural network that will be discussed in this research, and the application of optimization theory in the algorithm. A new penalty term method will be developed to reduce overfitting in this chapter. In Chapter IV, an overview of methods that will be tested is given. The regular learning algorithm without a penalty term, the penalty method with different penalty terms, and the improved penalty method will be tested. All the methods and different penalty terms tested will be compared with each other through their generalization performance in the research. Finally, the test results will be placed in Appendix A and the source program that is used in the implementation of the penalty term method and the improved penalty term method will be placed in Appendix B. 4 Chapter II METHODS FOR REDUCING OVERFITTING Overfitting and Generalization in Artificial Neural Networks Mathematically, the objective of learning in the neural network is to infer a function from a given sample data set. Learning algorithms are designed essentially to search for a function that best fits the given data in a space of functions. After learning, the neural network is applied on the new data set. If it is trained too hard to find the best fit to the training data, there is a risk that we will fit the noise in to the data by memorizing various peculiarities of the training data rather than finding a general predictive rule [5]. When a network is trained, the weights are modified in order to decrease errors on the training data set. If the network is tested on a new set of data, the errors on the test data set tend to decrease in step with the training error as the network tries to generalize from the training data set to the underlined function. However if the training data is incomplete, it may contain spurious and misleading regularities due to sampling [6]. Figure 21 illustrates this situation schematically. It is generally agreed that overfitting is closely related to the architecture of the networ~ i.e., the size of the network. If training starts with too small a network for the problem, good results cannot be obtained. If the network is too large, it may be 5 vulnerable to overfitting (20). B. Bawn and David Haussler [19] analyzed theoretically the lower and upper bounds on the size of the sample vs. the network size needed to achieve a valid generalization. Subutai Ahmad and Gerald Tesauro [21] analyzed how many training patterns and training cycles are needed for a problem of a given size and difficulty, how to represent the input, and how to choose training examples. In general. overfitting is related to the degree of freedom of neural networks. The degree of freedom of neural networks includes not only the number of weights but also the potential nonlinearity of the network, the architecture and the amount oftirne and the number of data used during training [22]. Error Testing Error Training Error o l TraIDing Time Figure 2_1 The Relationship Between Training Error and Testing Error 6 Methods of Reducing Overfitting There are many methods to reduce overfitting and improve generalization [6] such as pruning methods, stopped training methods and penalty term methods. The pruning method is to train a network that is larger than necessary and then remove parts that are not needed. The large initial size allows the network to learn reasonably quickly with less sensitivity to initial conditions, while the reduced complexity of the trimmed system favors improved generalization. The stopped training method is to estimate the generalization ability during training and stop when it begins to decrease. The simplest method is to divide the data into a training set and a validation set. The training set is used to modify the weights, the validation set is used to estimate the generalization ability, and training is stopped when the error on the validation set begins to rise. The penalty term method is another way to reduce overfitting. The basic approach involves adding penalty terms to the usual error function in order to constrain the search and cause weights to differentially decay. Actually, stopped training and penalty term methods are two widely used categories. The detailed penalty machines and penalty terms are presented in the following section. Penalty Mechanism and Algorithm Penalty Function Methods: Usually, penalty function methods are used in determining a solution ofa constrained nonlinear programming problem [10]. Currently, there is not a universally accepted method of dealing with such a problem. A penalty function method is to replace a constrained problem with one that is unconstrained. The 7 latter problem is then solved using an iterative technique. A general penalty function method, a barrier penalty function method and a quadratic penalty function method are introduced in the foBowing sections. In penalty function methods, the constrained problem is converted into an unconstrained problem by adding a penalty function, p(x), to the objective functionj{x). The resulting unconstrained objective function has the fonn j{x) +fJ p(x), where fJ> O. The function p(x) imposes a penalty of fJ p(x) whenever x does not satisfy the constraints ofthe original problem. Actually, a sequence {f(x) +fJp(x) } of functions are minimized (or maximized). The solution, {Xk}, ofthe sequence will usually approach the solution of the original problem. Normally, each Xlo; is not a feasible solution of the original problem. The process tenninates whenever the required accuracy has been obtained, or whenever some solution, x" , is generated that is a feasible solution of the original problem. In a penalty function method, an expression involving the constraints is added to the objective function. The expression is selected so that the value of the updated objective function is excessively high (or low) at a point x where the problem is infeasible. In general, one penalty function for the problem (21) is function (22) Minimize t{x) subject to gj(x) = bi for i = 1, ,1 gj(x) <= bi for i = 1+1, ,m 8 (21) 1 m k. p(x) =LIb;  g;(x)Ik.+ L(max{O,g,(x)  b,}) ;=1 ;=1+1 (22) where k is a natural number. Notice that p(x) ~ O. In fact p(x) = 0 if and only if x is feasible. Problem (21) could be converted into the form Minimize :t{x) subject to hi(x) = 0 for i = 1,....,m (23) by adding the square of an unrestricted variable to the left side of each inequality constraint, and then moving each bi to the left side of each constraint. A typical penalty function for (22) is (24) where k is a (usually even) natural number. Again notice that p(x) >=0. The remainder ofthis section deals with problem (23). Barrier function methods: A Barrier function method is an improved penalty function method. Again a sequence of functions (f(x) +(l/f3k )b(x)} is minimized (or maximized) and the sequence of solutions {xd nonnally tends to a solution of the original problem. The difference in barrier function is that the solutions, Xk, are all 9 feasible solutions ofthe original problem. The function hex) is called a barrier function because it imposes a penalty near the boundary of the set of feasible solutions of the original problem. For the problem: Minimize fl:x) subject to gi(X)~O for i =1,...,m (25) Notice that problem (25) does not contain any equality constraints. Barrier function methods are similar to penalty function methods in that a barrier function is added to the objective functio~ and the resulting function is minimized. The difference is that the solutions are interior points of F (rather than points exterior to F). The purpose of the barrier function is to prevent the solutions from leaving the interior ofF. Some common barrier functions for Problem (25) are b(x)=_~_l (26) i=1 gj(x) and b(x) = 2:lnlgj(x)1 (27) I~J Notice that b(x) is, in either case, continuous throughout the interior of F. Moreover, b(x)> CX) as x approaches the boundary of F via the interior of F. Rather than solve (2 5), we intend to solve the following problem: 1 Minimize I(x) +/ib(x) subject to each gj(x) < 0 10 (28) where rl>O. The Quadratic penalty function method: Both penalty function and barrier function methods can possess the undesirable property of slow convergence. In [25], the penalty function method is modified using Lagrange multipliers to obtain a more efficient method. The technique is called the method of multipliers and has emerged as an important tool for solving constrained nonlinear programming problems. The quadratic penalty function method is one ofthese methods. It is briefly introduced as foUowing. For the problem Minimize f(x) subject to hlx) = 0 1= 1,....,m (29) where f, h1, ••••hm are continuously differentiable, assume that the set, F, of feasible solutions of (29) is nonempty. The continuity of the hi ensures that F is closed. As mentioned in [10], the Weierstrass theorem guarantees the existence of a solution, x·, of problem (29). In [10], a method for determining x· was suggested, Namely, compute vectors x· and A.. that satisfy O_OoxL(X.",AO)_'lv7'f(xo)T + Aorooxh(x0) and 11 (210) where L(x, A) = f\x)+ATh(x) and hex) = [h,(x)...hm(X)]T. Unfortunately, the system of equations (210) is difficult to solve. Consider a solution x· of (29). Let A" be the corresponding vector of Lagrange multipliers for which equations (210) hold. Notice that whenever xEF, then L(x", A") = f\x"):::;; f(x) = f\x) + A"Th(x) = L(x, A") Thus, min {L(x, A") : x E F} = L(x·, A") and min {f(x) : x E F }= min{L(x, A·) : x E F } (211) This suggests that rather than solve (29), we could solve the problem on the right side of (211), possibly using a penalty function method. That is Minimize ftx) + A"Th(x) + p i:(h;(X»2 2 ;=1 (212) where ~ > O. Of course the problem is that A· is not known at the onset of the problem. The next result suggests an alternate strategy consisting of solving a sequence of problems ofthe fonn where Ak E Rmxl. 1.2 (213) The above discussions concern the penalty function methods and the penalty mechanism. They have some similarity with the penalty term method used in neural network training and can be used to evaluate the penalty terms and penalty mechanism used in neural network training. To evaluate the different penalty terms developed in neural network training, a summary of different penalty terms is presented in the following. Penalty Term Method of Reducing Overfitting A. Weigend et al Penalty Term Weigend et al. [11][13] suggested the following cost function: 2/ 2 "~ (I .. 2" Wi W, Ok) +A~ 2/ 2 kET ieCI+w; Wo (214) where C is the set of all connections and T is the set of training patterns. The second term is the penalty term that represents the complexity ofthe network as a function of the weight magnitudes relative to the constant Woo if Iwil >> wo, then the cost of a weight will approaches A. If I wil « Wo, the cost is close to zero. The value of A depends on the problem. If it is too smal~ it won't have any significant effect; if it is too large, all the weights will be driven to zero. B. Chauvin Penalty Term In [14], Chauvin minimize the cost function 13 (215) where e is a positive monotonic function. The swns are over the set of output units 0, the set of patterns P, and the set of hidden units H. The first term is the normal backpropagation error tenn, the second term measures the average "energy" expended by the hidden units. The parameters Jler and Jlen balance the two terms. The "energy" expended by a unithow much its activity varies over the training patternsis an indication of its importance. If the unit changes a lot, it probably encodes significant information; if it does not change much, it probably does not carry much information. A magnitudeofweights term may also be added to the cost function, giving (216) Since the derivative of the third term with respect to Wij is 2 fJ.wwij, this effectively introduces a weightdecay term into the backpropagation equations. Weights that are not essential to the solution decay to zero and can be removed. C. Ji Penalty Term Ji et al. [18] modifY the error function to minimize the number of hidden nodes and the magnitudes of the weights. A singlehiddenIayer network with one input node and one linear output node is investigated in their research. Beginning with a network having more hidden units than necessary, the output is computed as 14 N g(x,w,B) = LVif(UjXOJ i~1 (217) where Sr is the threshold, f is the sigmoid function II (l +eX ) , and u and v are the input and output weights of ith hidden unit respectively. The significance of a hidden unit is computed based on its input and output weights where cr(w) = ~/(l +~). (218) The error is defined as the sum of 80 , the nonnal sum of squared errors, and 81, tenn measuring node significance. M N iI =1] L[g(x1r;w,B )_ylt]2 +ALLSiSj 1r~1 i~1 j~1 (219) where 1t indexes the training patterns and x1t and y1t are the input and desired output for pattern 1t, and !l and A are learning rate parameters. The El(W) term makes the algorithm favor solutions with fewer significant hidden units. It is suggested the second term be added only after the network has learned the training set sufficiently well because conflict between the two error tenns may cause local minim. 15 D. Bishop Penalty Term Chris M. Bishop [16] proposed another penalty tena For error function (220), the penalty term is given by (221). (220) ( 221) Where Yn and XI denote the components of y and X, respectively, and the parameter A. controls the degree of smoothness of the network mapping. Bishop indicated: ''Unfortunately, the optimum value for A. is problem dependent. It may be found by seeking the minimum error with respect to a crossvalidation data set, or by a variety of techniques based on the statistical properties ofthe training data." [16]. E. Simple Penalty Terms Ishikawa [15] proposed another simple cost function c = L(/, OK)2 +ALI wi! I KeT IJ (222) If Wij > 0, the weight is decremented by A., otherwise, if Wij < 0, then it is incremented by A. Russel Reed [6] described the following simple cost function: c= L(I'0.)2 +A.LWij 2 keT i,j 16 (223) Russel Reed evaluated the simple penalty term: "One ofthe characters ofthe AWij penalty tenn is that it tends to favor vector with many small components over ones with a single large component, even when this is an effective choice." A constant Ais used in most of the penalty terms. There is no criteria to select a A. Weigend et al. [11] indicated that ''the value of Arequires some tuning and depends on the problem. If it is to small, it won't have any significant effect; if it is too large, all the weights will be driven to zero." Bishop [19] indicated that the optimum value for A will be problem dependent, and may be found by seeking the minimum error with respect to a crossvalidation data set, or by a variety of techniques based on the statistical properties of the training data. Ji et al. [18] suggested that the A can be made a function of Eo such as A =Aoe;&0 They suggested the second penalty term be added only after the network has learned the training set sufficiently well, because of the conflict between the two error terms may cause local minimal. Ping Jiang [24] said, ''the optimum point of A is network architecture dependent. We need to choose A to close to optimum point to improve the generalization performance." 17 Chapter III ARTIFICIAL NEURAL NETWORK ARCHTECTURE AND LEARNING ALGORITHMS Architectures of Feedforward Artificial Neural Networks Some artificial neural networks were introduced in Chapter 1. This thesis focuses on the most widely used multilayer feedforward networks. The architecture of a multilayer feedforward network is shown as Figure 31. This type of network arranges neurons in layers. All neurons in a layer are connected to all neurons in the adjacent layers through unidirectional links. These links are represented by synaptic weights. The input layer ofthe network is treated as connection nodes. All the layers except the output layer of the network are hidden layers. So the number of hidden layers is the number of layers in a network minus one. The notations used are shown in Figure 31. All neurons in a layer are consecutively indexed starting from 1, in a topdown fashion. The layers are indexed in. a lefttoright order and are identified by squarebracketed superscripts. All inputs to a neuron in layer k are denoted as a/k  I I, where I = 0, 1, 2,,,,Sk1 (SkI is the number of neurons in the (Kl )th layer). In the case of kl = 0, ajlOl are the inputs of the network. For each layer, we assumed an extra bias node that has a constant output value of1, i.e. 18 nP) n[::!] (21 n[31 a';1 5, 51 ~ 51 ~I P f(lJ fPI [PI R Wi a[OJ , a[ll , W W a[2J s,.R a s, .5, a .,.~ a nPI a~ll a~1 2 P, fill [PI f(3) a[OI all) a a Figure 31 A Three Layer Feedforward Network 19 3o(k] = 1. Notice that for each k > 2, aj[kI] is also the output of neuron I in (k1)th layer. The outputs in the kth layer of the network can be written in vector fonn as aj(k]. A weight is represented as Wj)kJ, where k is the layer index and 'j,i" means that the weight is the connection from the ith neuron in layer kl to the jth neuron in layer k. In vector form, weights can be represented by w1k] = (wikl)T. The nj(kJ represents the weighted sum of a neuron j in layer k. The weighted sum of the inputs of a neuron j in layer k can be expressed as The output of the neuronj in layer k can be expressed as (31) j = 1,2,... nk (32) Where fj1kJ is the activation function of the neuron. We will discuss the activation function in the following section. In vector form, the formulas can be written as (33) (34) where :fkl = (:fkJ)T is a vector ofthe activation function. 20 Activation Function The original activation function is a binary function [18]. This limits the application of perceptron neural networks to classification problems only. In order to solve a general type of mapping application problem, we need to use nonlinear continuous activation functions. There are many nonlinear activation functions that can be used in multilayer networks as long as the functions are differentiable. The most commonly used functions are the sigmoid function and the hyperbolic function which are expressed as 1 Sigmoid function f(x) = 1+ex ex ex Hyperbolic function f(x) = x x e +e (35) (36) The graphs of signoid and hyperbolic functions are shown in Figure 32 and 33. Since we can always scale down the input and output values to the interval (0,1) or (1, I), there is no significant difference between the two functions. The sigmoid function is used in this paper. Weights in the Neural Network The weights in a neural network are initially chosen to be small random numbers. An activation function is active only in a small domain interval as shown in Figure32. If the initial weights are too large, the activation functions may saturate at the beginning of 21 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 <J? "9 "f ~ 0 N Figure 32 The Sigmoid Function 0.8 0.6 0.4 0.2 0 <J? ~ "f ~ C N ...,. 10 a:l 0 ·O.~ 04 0.6 ) 0.8 1 Figure 33 The Hyperbolic Function 22 the training and the network is prone to get stuck in a local minimum near the starting point [19]. In this paper, the initial weights ofall neural networks are chosen as random numbers uniformly distributed between 0.5 fan  in of that node and 0.5 [20], where the fanin of that node IS the number of inputs fan in of that node including bias input to that node. Optimization Algorithm From an optimization point of view, training a network is equivalent to minimizing a global error function, which is a multivariate function that depends on the weights in the network. In this paper we use the Conjugate Gradient Optimization Method. The method is introduced simply as shown below. The Conjugate Gradient Method searches the minimum in the conjugate direction to guarantee the quadratic termination. Suppose that we want to minimize the following function: (37) From the Taylor series we know that the flIst order necessary condition for x· is equal to zero, l.e. VF(x)lx=x' =0 23 (38) Any point that satisfies the above equation is called a stationary point. Even though the above equation is satisfied, there is no guarantee that the local minimum is reached. The second order necessary condition for a strong minimum is that the Hessian matrix to be semidifinite. Sufficient conditions for a strong minimum to exist require the Hessian matrix to be positive defmitely. The conjugate gradient method is to search the mmmlUm in the conjugate direction to guarantee the quadratic termination. The conjugate direction is defined as follows: A set of vectors {Pd is mutually conjugate with respect to a positive defmite Hessian Matrix A if and only if (39) Many vectors that satisfies (39). One set consists of the eigenvalues of A. It can be shown [21] that if we make a sequence of exact linear searches along any set of conjugate directions {Ph P2,...,P.}, then the exact minimum of any quadratic function with n parameters, will be reached in. at most, m searches. Recall that for a quadratic function, the gradient is VF(x) = Ax + d Ifwe calculate the change in the gradient at iteration k+l, we have 24 (310) ~g(k) = g(k+I) _ g(k) =( Ax(k+I) + d ) _(Ax(k) + d ) =MX(k) Based on the Steepest Descent Method [21), we have where a.(k) is chosen to minimize F(x) in the direction p(k). (311 ) (312) We can now restate the conjugate conditions by substituting (310) and (311) to (39). (313) Usually we use steepest descent method to begin the search, i.e. (314) Then at each iteration we need to construct a vector p(k) which. is orthogonal to {~g(l), ~g(2), ...,~g(k)}. It can be simplified [21] by for following form (315) The ~(k) can be chosen by several different methods, which will produce equivalent results for quadratic functions. One ofthe most common choice [21] is (316) 25 The algorithm is as follows: Algorithm 31. The algorithm for the conjugate gradient method is as follows: 1. Set k = 1, guess x(l); 2. Select the first search direction according to the steepest descent method, i.e. p(l) = _gO) 3. Calculate g(k) g(k) = VF(x)I~' 4. Calculate ~(k) according to (316); 5. Calculate p(k) according to (315); 6. Calculate ~X(k) according to (312), i.e. ~X(k) = (X(k+l) _ X(k~ = a(k)p(k) 7. Calculate X(k+l) as the following X(k+l) = x(k) + L\x(k) 8. Ifx(k+') satisfies the convergence criteria, stop. Otherwise, 9. Go to step 3. Forward Computations As we know from Chapter I, the neural network learning process includes two phases: forward computation and backward computation. During forward computation, a set of input data is given to the neurons in the first layer (input layer). These neurons are activated and pass the results to neurons to next layer. The process continues until the output layer is reached and the outputs ofthe network have been calculated. The process can be summarized as follows: 26 Algorithm 32. Forward algorithm 1. Given input vector x, set nO = x· 2. The weight matrix and activation function tiki, k:=l,2, ...,K are known, where k is the number of layers in the network; 4. al"k] is the output ofthe network; Backpropagation Computation We have discussed the forward computation in feedforward artificial neural networks in the last section. We will now formulate the backprogagation computation for feedforward artificial networks. We know that a feedforward artificial neural network changes its behavior (weights) dynamically during the training session. The error made by the network during training is measured by a predefined function called the error function (performance) [22], cost function, or energy function [23]. The error function is used to calculate errors and the distribution of errors among all neurons of a network. Then the connection weights are changed to reduce the error of the network. This dynamic adaptation of weights ends when the error is within a tolerance limit at an optimum point with respect to some optimization criterion. Considering a neural network ofK layers, the general performance function can be shown as: (317) The first tenn is the performance function (error function). The second term is the penalty term. It could be the Weigend penalty term [11], Charvin penalty tenn [14], Ji 27 penalty term [18] or some other form. Q is the number of input/output samples. Pi is the ith input data, tj is the desired ith output, and Wo are constants that are adjusted during training. Because the differentiation is additive it is convenient to consider one input/output sample L In practice, this is used for online training [22]. Summation over the entire input/output samples constitutes offline training [22]. So we have (318) To calculate the gradient element gij, we take the derivative of Ej with respect to Wikl and using the chain rule, we have [k] [k] DEi DEi ofiji gji =~=.::l [kJ·~+Pji v Wji (/ fiji (/Wji where Pji is an element ofthe penalty term and is defined as or ( J 2 1 L N a2y a2y For Bishop's penalty term P = L L :f ' f can be expressed as 2'=ln=1 Ox, aXI 28 (319) (320) (320)' 82 Yn _ (Y(x, + h)  y(x,») (y(x, )  y(x, h») a 2  h2 X, _ y(x/ + h) +Y(Xr  h)  2y(x,)  h2 The Pij can be shown as : ap PiJ=a wij From (31), we have a [il nl _ (iI) ~a, U WII Here we define Sj as: lk) = dE; = .IxJ. [HI P Sj  ow[X] Sj a; + ji JI (320)" (320)''' (321) (322) where Sj is the sensitivity ofE; to change in the jth element ofthe net input at layer [k]. Then (319) becomes [k]=~_ .lkI). [kI) .. gj .'J (x]  Sj a, +PJ1 VWji 29 (323) Using the Jacobian matrix [24], we can derive the recurrence relationship for the sensitivities. a [hi] a [k+l] a [k+l] nl nl nl a [k] a [k] on[k] n. n2 Sk a [k+I] a [k+l] a [k+1) n2 n2 n2 a [k] a [Il] a [k] on[k+l] nl n2 n Sk on[k]  (324) on[k+IJ on[k+l] () [k+J] Sk +1 Sk +1 nk t3 [k] a [k] a [k] nl n2 n Sk the element ij in (324) can be shown as: where aIkl( [kl) .clkl( [hI J) _ nj I nj  () [kl nj So the Jacobian matrix can be written as 30 (325) (326) o lk+l] n _ Wlk+l] d nlkl ) onlle]  "\ (327) where f1k](n\k l) 0 0 0 f[kl(n~k]) 0 0 F[k1(n1k]) = 0 (328) 0 0 tik{n~l) Now the sensitivity recUIsively in matrix form is seen as: [ [Ie] OE [k+I]) i on OE j [k] lk+I) T OE j S = 0 nlk ] = 0 nlk ] 0 nlk+11 = F(n ).(w ). 0 n rkH ] = F(n[k J).(w [k+I I), S!k+ll (329) The sensitivities are propagated backward through the network from the last layer to the fIrst layer. The starting point can be obtained from the output layer. Since .:J .:J lk] ~_~_ [kJ( lkJ) on,l4]  on,[kJ  f n, Sj can be expressed as 31 (330) (331 ) (331) has the following matrix fonn (332) (333) So we can recursively calculate the sensitivities from the last layer to the ftrst layer. Knowing the sensitivities, we can calculate the gradient according to (322). Algorithms of Penalty Method and Improved Penalty Method The penalty method and the improved penalty method used in this research are discussed in the following. The basic approach used in the penalty method involves adding penalty terms to the usual objective function in order to constrain the search and cause weights to differentially decay. By using the penalty method, the neural network generalization error can be reduced [24]. Algoritlun 33 (penalty method): Given a set of S = {(Pi, ti) I Pi is input, tj is desired output of Pi} of d training samples, and given a network of K layers with an input dimension u and an output dimension of v. I. Initialize all weights w(k] = (w/1), 1= 1,2,...,K as random numbers uniformly 05 0.5 distributed between. and c. . f h . Set woo fan  m of that set Ian  In 0 t atset 2. For each sample (Xi, tl) E S, repeat the following steps: Initialize g(k) = o. 32 2.1 Compute the actual outputs of network according to (33) and (34) using the weight wlk ) 2.2 Calculate the gradient g(Xi) according to (33) 2.3 Surn up g(Xl.), 'I.e., g(k)  g(k) + g(X.I) 3. Ifk=l then set p(l) = r(l) = _g(l) 4. Compute a(k) using a line search technique [23]. 5. Compute W(k+l) = W(k) + a(k)p(k) using step 2 to compute g(k+I). 6. Compute ~(k) according to (316). 8. If all the weights are such that the following convergence criterion is satisfied, then go to 9, otherwise set k=k+1 and go to step 2. d < 9. Set w = W(k+l) and stop. Actually the overfitting problem is not exactly a constrained optimization problem because the constrained condition is wUmown. There is not a universally accepted method for a constrained nonlinear optimization problem. Based on the penalty method, an improved penalty method is developed in this research. The main idea is training the network without adding any penalty term. Once the performance function value (RMS) begins to increase, a penalty term is added to the usual error function, and the network training process becomes continuous as same as the penalty method. If the generalization value (RMS) begins to increase again, then stop the 33 training. A performance function value (RMS) is used as a stopping criteria both in the penalty method and the improved penalty method. Algorithm 34 (Improved penalty method 1): Given a set of S = {(Pi" q I Pi is input, ti is desired output of Pi } of d training samples, and given a network of K layers with an input dimension u and an output dimension ofv. I. Initialize all weights w[k] = (w}ll ), 1= 1,2,...,K as random numbers uniformly os OJ distributed between ~ . f h and ~ . f b . Set Wo and A. Ian  mot at set Ian  In 0 t at set 2. For each sample (Xi, ti) E S, repeat the following steps. Initialize g(k) = O. 2.1 Compute the actual outputs ofnetwork according to (33) and (34) using the weight W(k) 2.2 Calculate the gradient g(Xi) according to (33) 2.3 Surn up g(Xi), I..e., g (k)  g (k) + g(X)i 3. Ifk=l then set p(I) = r(l) = _g(J) 4. Compute U(k) using a line search technique [23]. 5. Compute W(k+l) = W(k) + a(k)p(k) using step 2 to compute g(k+I). 6. Compute p(k) according to (316). 8. Before setting the value ofA., ifall the weights are such that the following convergence criterion is satisfied, then set A., otherwise set k=k+1 and go to step 2. After 34 setting the value ofA, if all the weights are such that the following convergence criterion is satisfied, then go to step 9, otherwise set k = k+l and go to step 2. d < d L(E(W(kl)) ,=1 d < to1 9. Set w = W(k+l) and stop. Based on algorithm 34, one more improved penalty method is given as algorithm 35. The main difference between algoritlun 34 and algorithm 35 is that a series ofAis given for a penalty tenn in algorithm 35. The objective function is dynamically changed based on the perfonnance ofeach different penahy parameter A. Algorithm 35 (Improved penalty method 2): Given a set of S = {(Pi, ti) I Pi is input, t i is desired output of Pi } of d training samples, and given a network of K layers with an input dimension u and an output dimension of v. 1. Initialize all weights w1k1 = (w/1), 1= 1,2, ..., K as random numbers os 0.5 uniformly distributed between ~ . f th and ~ . f tha . Set woo Ian  m 0 at set Ian  m 0 tset 2. Select parameter Ai 3. For each sample (Xi, t i) E S, repeat the following steps: Initialize g(k) = o. 3.1 Compute the actual outputs of network according to (33) and (34), using the weight w(k). 3.2 Calculate the gradient g(Xi) according to (33) 3.3 Sum up g(Xi), i.e., g(k) = g(k) + g(Xj) 35 4. 5. 6. 7. 8. 9. Ifk=1 then set p(l) = r(l) = _g(l) Compute a.(k) using a line search technique [23]. Compute p(k) based on equation (316). Compute L E(WA/k.+l~. For each Ai, repeat steps 2 to 9 and obtain Let W(k+l) = WA./ k + l >, WA./k+I) corresponds to the minimum value ofthe error. Ifall the weights are such that the following convergence criterion is satisfied, then go to step 11, otherwise set k = 1<.+1 and go to step 2. 11. Set W = W(k+l) and stop. 36 CHAPTER IV METHODS AND IMPLEMENTATION Neural Network Architecture Design To compare the effectiveness of different penalty methods, the performance of three training methods are studied in this research. These methods are the regular learning algorithm without a penalty tenn, the penalty method with different penalty terms, and the improved penalty method proposed in this research. The performance of each method is calculated using a computer program written in the ANSI Standard FORTRAN 77 language. A small network is tested fIrst. Then the hidden nodes will be added to the network. When the network becomes larger, the generalization error becomes larger and larger. Usually, the generalization error can be reduced by inducing a penalty term [1]. The improved penalty method proposed in this research has proved to be able to reduce the generalization error signifIcantly. Three penalty terms are tested in this research. There are many different types of penalty terms used in neural networks to reduce overfitting. Some of them are very complicated. Some ofthem have a disadvantage in that large weights decay at the same rate as small weights. Some of them include a few of userdependent parameters. The 37 three penalty terms which will be tested in this research are A. W~PW~ 2 A,W;.2 and ( HI ._.2)' JI, (WJi ) + Wo ( J 2 1 P L N a2 y LLL~ 2p p=ll=ln=1 aX~ The performance function value (RMS) is used as the stopping criterion. When to stop the training process is very important for a given problem. Therefore, an optimal stopping point is needed to obtain better generalization performance so that the network has a good generalization performance. This is especially important when a network is overfitting. In this thesis, the sample data are divided into two sets. One is the training set and the other is the validation set. When the network is trained, the generalization performance will be tested at certain numbers ofiterations using the validation set. The weights are initialized with random values which are uniformly distributed between 0.5 and 0.5 [2]. A curve fitting criterion is used to test all the learning algorithms. Tow data sets are used. One is the training data and the other one is the validation data. Both of them contain 49 pairs. The training and validation data sets are listed in table A35 and A36 respectively. For all the methods tested, the same sample data were used. In total seven methods are tested. Method A is the regular method and method B and C are penalty term methods based on Algorithm 33 with different penalty terms. Method D and E are improved penalty term methods based on Algorithm 34 with different penalty terms. Method F is an improved penalty term method based on Algorithm 35. Method G is an simplified Bishop's penalty term method. The overview ofthe methods tested in this paper is shown in Table 41. 38 Table 41 Overview of methods tested  Methods* Penalty Term  A B c D E F G R P P NP NP NP2 P No . HI 2 Wj. Wo [ 1 P L N a2Ynp J2 2P ~U~:;I ax~ • R  Regular method without any penalty term. P  Penalty term method NP  New penalty term method NP2  New penalty term method 2, the objective function will be changed dynamically. 39 Discussion ofTest Results First, a network with two imput nodes, seven hidden nodes and one output node (2/7/1) is tested. The network has 29 weights and methods A, B, C, D, and E are tested. The training and generalization performance of different method is listed in table AI through A9. The training and generalization performance of method A is listed in table AI. It takes about 11 epochs of training to get the training RMS value of 0.070721 land generalization RMS value of 0.0724163. The training and generalization performance of method B is listed in table A2 and A3. It takes 5 epochs of training to get the training RMS value of 0.0706896 and generalization RMS value of 0.0741628. It is found that method B makes the generalization performance slightly decrease (2.35%). The performance of method C is listed in table A4 and A5. It takes 7 epochs to get the training RMS and generalization RMS value of 0.0725740 and 0.0758244. It makes the generalization error increased by 4.49% and the training error increased by 25.07%. The performance of method D is listed in table A6 and A7. It takes about 12 epochs to get the training RMS value of 0.0691364 and generalization RMS value of 0.0698246. It improved the generalization performance by 3.71% and reduced the training error by 2.3%. Comparing with the penalty method (method B), the improved penalty method (method D) improved the generalization and training performance by 6.2% and 2.25% respectively. The performance of method E is listed in table A8 and A9. It has the same training and generalization performance as the method A because the penalty the term used in this method makes the training error increase. Next, the network with two input nodes, eight hidden nodes, and an output node (2/8/1) is tested. Similarly, training and generalization performance ofeach method of A, 40 B, C, D and E are listed in table AlO through A17 respectively. The comparison of the performance of different method is listed in Table 43. Method A takes 8 training epochs to get the training RMS value of 0.0730081 and generalization RMS value of 0.0767580. The perfonnance of method B is listed in table AII and table A12. It takes 6 training epochs to get the training RMS value of 0.071706 and generalization RMS value of 0.0751127. It improved the training and generalization performance by 1.81% and 2.19% respectively. The performance of method C is listed in table A13 and A14. It takes 10 epochs to get the training RMS value of 0.0718936 and generalization RMS 0.0744065. It improved the training and generalization 1.5% and 3.06% respectively. However, the value ofA. should be selected very carefully. Otherwise, it will increase the training and generalization error. The perfonnance of method D is listed in table AIS and A16. It takes about 15 training epochs to get the training RMS value of 0.0681585 and generalization RMS value of 0.0669311. It improved the training and generalization performance by 7.11% and 14.68% respectively. Comparing with the penalty term method, it improved the training and generalization perfonnance by 5.2% and 12.22%. The performance of method E is listed in table A17. It takes ] 5 epochs to get the training RMS value of 0.0718534 and the generalization RMS value of 0.0737869. It improved the training and generalization performance by 1.6% and 4.03% respectively. Thirdly, the network with two input nodes, ten hidden nodes, and an output node (2/10/1) is tested. Methods A, B, C, D and E are tested. The performance of different method is listed in table A18 through A26. The performance of method A is listed in table A18. It takes about 12 epochs to get the training RMS value of 0.0823795 and generalization RMS value of 0.0865625. In this case, the network is overfitting. Method 41 B takes about 13 training epochs to get the training RMS value of 0.0782682 and generalization RMS value of 0.0865625. It improved the training and generalization performance by 5.25% and 6.46% respectively. The performance ofmethod C is listed in table A21 and A22. The performance of method D is listed in table A23 and A24. It takes about 17 epochs to get the training and generalization RMS value of0.0753999 and 0.0763246 respectively. It improved training perfonnance by 9.26% and the generalization performance by 13.41 %. Comparing with the penalty term method, the improved penalty term method improved the training and generalization perfonnance by 3.8% and 6.53% respectively. To test the effectiveness ofthe method F, a series ofA(0.008,0.006,0.004,0.002, 0.001,0.0006,0.0001, 0.00006, 0.00004, 0.00001) are tested in a network with two input nodes, seven hidden nodes, and an output node. The performance of training and generalization of different A is listed in table A27. The best performance is obtained when Aequals 0.0001. It is helpful to use the improved penalty term method 2 to get the best Afrom a set of A values. Once the A is selected, the rest of the training process ofthe improved penalty term method 2 (method F) is as same as the penalty term method. So the weakness ofthe penalty term method still exists in the improved penalty term method 2. Finally, the performance of Bishop's penalty term method G is tested. The performances of different networks are listed in table A28 through table A34. For the net work with two input nodes, seven hidden nodes, and an output node (2/711), Bishop's penalty term method takes 10 training epochs to get the training RMS value of 0.0707547 and generalization RMS value of 0.0724433. The generalization performance is very 42 close to the generalization performance of method A (Table 42). For the network (2/8/1), it takes 13 training epochs to get the training RMS value of 0.0788628 and generalization RMS value of 0.0748189. It improved the generalization performance by 2.5%. For the network with two input nodes, ten hidden nodes, and an output nodes (2110/1), it takes 8 training epochs to get the training RMS value of 0.0823851 and generalization RMS value of 0.0855824. It improved the generalization performance by 1.13%. The performance comparison of different method for different networks is listed in table 42 through tab 44. 43 Table 42 Perfonnance Comparison of Different Method For the Network with 7 hidden nodes Method Epoch Training RMS Generalization RMS A 10 0.0707211 0.0724163 B1 5 0.0706896 0.0741628 B2 6 0.0895382 0.0892582 Cl 7 0.0725740 0.0758244 C2 2 0.0884512 0.0866356 Dl 12 0.0691364 0.0698246 D2 10 0.0707211 0.0724163 El 10 0.0707211 0.0724163 E2 10 0.0707211 0.0724163 G 10 0.0707547 0.0724433 Table 43 Perfonnance Comparison ofDifferent Method For the Network with 8 hidden nodes Method Epoch Training RMS Generalization RMS A 8 0.0730081 ! 0.0767580 B1 6 0.0717076 0.0751127 B2 10 0.0728760 0.0764509 Cl 10 0.0718936 0.0744065 C2 2 0.0865809 0.0864760 Dl 16 0.0681907 0.0677225 D2 15 0.0681585 0.0669311 El 15 0.0718534 0.0737869 G 13 0.0728628 0.0748189 Table 44 Performance Comparison of Different Method for Network with 10 hidden nodes Method Epoch Training RMS Generalization RMS A 12 0.0823795 0.0865625 B1 5 0.0878567 0.0872217 B2 13 0.0782682 0.0813108 Cl 7 0.0823783 0.0855836 C2 2 0.0853115 0.0865137 Dl 17 0.0753999 0.0763246 D2 12 0.0823741 0.0855825 El 10 0.0823793 0.0855825 E2 10 0.0823793 0.0855822 G 8 0.0823851 0.0855824 44 CHAPTER V CONCLUSION Overfitting is a very important issue in artificial neural networks. Penalty term methods are useful way to reduce overfitting. Seven different training algorithms are studied in this research. The following conclusions can be drawn from this study: 1. Overfitting does exist in artificial neural networks. As the neural network becomes larger, the generalization performance becomes worse. It is better to use the smallest network that fits the data. 2. For a network which is not overfitting, the penalty term method has no significant improvement for training and generalization performance of the network. If the penalty term or the constant A is not chosen properly, the penalty term method will decrease the performance significantly. On the other side, the improved penalty method can slightly increase the generalization performance of the network if the penalty term and A are chosen properly. If the penalty term and A are not chosen properly, the improved penalty term method can also be used to train the network and has no risk to decrease the performance. Usually it is difficult to know if the network is overfitting or not. Therefore, it is better to use the improved penalty term method than to use the penalty term method in any situation. 45 3. When the network is overfitting, the penalty method can be used to improve the generalization performance of the networks. Compared with the penalty term method, the improved penalty method improves the training and generalization performance more significantly and has no risk to decrease the performance. 4. Penalty term and the constant A are problem and network architecture dependent. The improved penalty method 2 can be used to chose a Aproperly and improve the perfonnance significantly as well. Future work could be done in several areas as listed below: 1. To investigate the performance of each method, a training data set and a validation set are used in this research. Since the training procedure used in the research can itself lead to some overfitting to the validation set, the performance of each training method may be confirmed by measuring its performance on a third independent set ofdata called a test set. 2. A constant Ais used in most of the penalty methods. There is no criteria to select a A. It is valuable to conduct a method to choose A to close to the optimum point to improve the generalization performance. 3. Another method that can be investigated is an interactive method in which the designer checks the trained network and decides which nodes to remove. Several heuristics are used to identify units that don't constant output over all training patterns. When a number of nodes have highly correlated responses over all patterns, they can be combined into one node. 46 2 Bibliography [1] HechtNielsen, Robert, Neurocomputing, AddisonWesley Publishing Company, 1990. [2] Rumelhart, D. and McClelland, J., Parallel distributed processing: exploitations in the microstructure ofcognition, Volumes 1 and 2, Cambridge: MIT Press, 1986. [3] Kohonen, T., SelfOrganising and Associative Memory (3rd ed.), Berlin: SpringerVerlag, 1989. [4] Albus, 1. S., A new approach to manipulator control: cerebellar model articulation control (CMAC). Trans. ASME, J ofDynamics Syst., Meas. and Contr., 97, 220227, 1975. [5] Dietterich, Tom, Overfitting and Undercomputing in Machine Learning, ACM Computing Survey, Vol. 27, No.3, pp.326327, Sept. 1995. [6] Reed, Russell, Pruning AlgorithmsA Survey, IEEE Transactions on Neural Networks, Vol. 4,No. 5, 1993. [7] Hoerl, Arthur E. and Kennard, Robert W., Applications to Nonorthorgonal Problems, Eechnometrics, Vol. 12, No. I, pp. 6982, Feb. 1970. [8] Subatai., Ahmad and Tesauro, Gerald, Scaling and Generalization in Neural Networks:A Case Study, Advances in Neural Information Processing 1, D.S. Touretzky, Ed. pp. 160168, 1989. [9] Chauvin, Y., Generalization Performance of Overtrained Backpropagation Networks, in Lecture Notes in Computer Science, Edited by L. B. Almieda and C.l Wellekens, SpringerVerlag, 1990. 47 [10] Melvyn, W. 1., Mathematical Programming, An Introduction to Optimization, Marcel Dekker, Inc. 1986. [11] Weigend, AS., Rumelhart, D. E., and Huberman, B. A., Backpropagation, weightelimination and time series prediction, in Proc. 1990 Connectionist Models Summer School, D. Touretzky, 1. Elman, T. Sejnowski, and G. Hinton. Eds., 1990, pp. 105116. [12] Weigend, AS., Rumelhart, D. E., and Huberman, B. A., Generalization by weightelimination applied to currency exchange ra te prediction, in Pmc. Int. Joint Conf. Neural Networks, Vol. I (Seattle), 1991, pp. 837841. [13] Weigend, AS., Rumelhart, D. E., and Huberman, B. A, Generalization by weightelimination with application to forecasting, in Advances in Neural Information Processing (3), R. Lippmann, 1. Moody, and D. Touretzky, Eds., 1991, pp. 875882. [14] Chauvin, Y., A backpropagation algorithm with optimal use of hidden units, I Advances in Neural Information Processing (1), D.S. Touretzky, Ed. (Denver), 1989, pp. 519526. [15] Ishikawa, M., A structural learning algorithm with forgetting of link weights, Tech. Rep. TR907, Electrotechnical Lab., TsukubaCity, Japan, 1990. [16] Bishop, C. M., CurvatureDriven Smoothing: A Learning Algorithm for Feedforward Networks, IEEE Transactions on Neural Networh, Vol. 4, No.5, September 1993. [17] Valiant, L. G., A theory of the learnable, Comrnun. ACM, Vol. 27, No. 11, pp. 11341142, 1984. 48 [18] McCulloch, W.S. and Pitts, W., A Logical Calculus of the Ideas Immanent in Nervous Activity, Bulletin on Math. Bio., 5, 1943. [19] Hagan, Martin T., Neural Network Design, Lecture Notes, Oklahoma State University, 1995. [20] Cichocki, A. and Unbehauen, R., Neural Networks for Optimization and Signal Processing, Wiley, 1993. [21] Scales, L. E., Introduction to Nonlinear Optimization, New York, SpringerVerlag, 1985. [22] Cichocki, A. and Unbehauen, R., Neural Networks for Optimization and Signal Processing, Wiley, 1993. [23] Chauvin, Y., Dynamic Behavior of Constrained BackPropagation Networks, in Advances in Neural Information Processing 2, D.S. Toretzky, Ed. Pp.642649, 1989. [24] Jiang, P., A Penalty Method to Reduce Overfitting in Artificial Neural Networks. Masters degree thesis, Oklahoma State University, 1996. [25] Hestenes, M. R., Multiplier and Gradient Methods, Journal ofOptimization Theory and Applications, No.4, pp. 303320, 1969. 49 APPENDIX A TESTING TABLES Table AI Performance of Training and Generalization (RMS) Method A with 7 hidden nodes Epoch Training RMS Generalization RMS Convergence Error 0 0.214704EOO 0.209518EOO 1 0.812552E01 0.841985EB1 0.133448E00 2 0.80 1644EOl 0.834818E0 I 0.109082E02 3 O.790322E0 I 0.825625EOl 0.113216E02 4 0.779749E01 0.8 16252EO 1 0.105730E02 5 O.750445E01 0.787846EOl 0.293044E02 6 O.736659E0 1 0.774 192EO I 0.137859E02 7 0.7222IOEOl 0.755748£0 I 0.144488E02 8 O.717990EOl 0.732518EOl 0.422016E03 9 0.71 8357EOl O.732768EO I 0.366718E04 10 0.707211 EO 1* 0.724163E01* 0.1 11452E02 11 O. 707575E0 I 0.724437E0 I 0.363737E04  Table A2 Performance ofTraining and Generalization (RMS) Method B with 7 hidden nodes and Ie = 0.01 Epoch Training RMS Generalization RMS Convergence Error 0 0.214709E00 0.209518EOO I 0.896909EOl 0.891520EOl 0.125540E00 2 0.895401£01 0.892551EOl 0.150718E03 3 0.895380EO 1 0.892581£01 0.2 I8302E05 4 0.895385EO I 0.892578EOl 0.506639E06 5 0.895382EOI 0.892583 EOI 0.275671 E06 , 6 0.895382E0 1* 0.892582E0 '* 0.447035E07 7 0.895383 EO I 0.892581EOl 0.596046E07 50 Table A3 Performance of Training and Generalization (RMS) Method B with 7 hidden nodes and A. = 0.000 I Epoch Training RMS Generalization RMS Convergence Error 0 0.2 I4709EOO 0.2095 I8EoO 1 0.813377Eo I 0.842336£0 I 0.13337IEOO 2 0.80 I080Eo I 0.83400IE01 0.122967E02 3 0.782553Eol 0.818001 Eol 0.185277E02 4 O.774735Eo I 0.81o877Eo1 0.781715E03 5 0.706896Eol* 0.741628EoI· 0.678393E02 6 0.711018Eol O.745780E0 I 0.412233E03 Table A4 Performance of Training and Generalization (RMS) Method C with 7 hidden nodes and A. = 0.0001 Epoch Training RMS Generalization RMS Convergence Error 0 0.2 I4709E00 0.209518EoO I 1 0.812634E01. 0.842007E01 0.133440EoO 2 0.801 132Eol 0.834342Eol 0.115024E..Q2 3 0.788925Eol 0.824305Eol 0.122075E02 4 O.778096Eo 1 0.814628E01 0.108288E02 5 0.74428 lEO 1 O.781 553Eo I 0.338145E02 6 0.732613Eo 1 O.769954E0 I 0.1 16679E02 7 O.725740E0 I· 0.758244£01 * 0.687353E03 8 0.729824£0 I O.743964Eo I 0.408381E03 Table A5 Performance ofTraining and Generalization (RMS) Method C with 7 hidden nodes and A. = 0.01 Epoch Training RMS Generalization RMS Convergence Error 0 0.214709E00 0.2095 18EoO ] 0.886283Eo I 0.867396Eol 0.126221 EoO 2 0.884512EOl· 0.866238Eol· 0.177145E03 3 0.884734Eol 0.866356Eol 0.221804E04 51 Table A6 Perfonnanoe of Training and Generalization (RMS) Methoo D with 7 hidden nooes and A. = 0.0001 Epoch Training RMS Generalization RMS Convergence Error 0 0.214709E~0 o.209518EoO 1 0.812552E~1 0.841985E~1 0.133448 2 0.801644E01 0.834818E~1 0.109082E02 3 0.790322Eo 1 0.825625E~ I 0.1 13216E02 4 O.779749Eo I 0.816252E~1 0.105730E02 5 O.750445EO 1 O. 787846EO 1 0.293044E02 6 0.736659EOI 0.774192E01 0.137859E~2 7 0.722210Eol O.755748EO1 O. I44488E02 8 0.71 7990EOl 0.732518EOl 0.422016E03 9 0.718357E01 O.732768Eo 1 0.366718E04 10 0.7072IIEOl 0.724163Eol 0.11 1452E02 11 0.723729EOl O.728283EO 1 0.909194E03 12 0.691364EO 1* 0.698246EO I· 0.323655E02 13 O. 704434E0 1 0.708009E01 0.130697E02 Tab]e A7 Perionnanoe of Training and Generalization (RMS) Methoo D with 7 hidden nooes and A. = 0.0 I Epoch Training RMS Generalization RMS Conver~ence Error 0 0.214709EOO 0.209518EOO 1 0.8 12552EO 1 0.841985E0 1 0.133448 2 0.801644EO] 0.834818E01 0.109082E02 3 O.790322E01 0.825625EO 1 0.1132]6£02 4 0.779749£01 0.816252£01 0.1 05730E~2 5 0.750445£01 0.787846E01 0.293044E02 6 O.736659E0 1 0.774192E01 0.137859E02 7 0.722210EO 1 0.755748E01 0.144488E02 I 8 0.717990E01 0.732518£01 0.422016E03 9 0.718357£01 O.732768E01 0.366718E04 10 0.707211 EOl· 0.724163EOl· 0.1 11452E02 11 O.707750E0 I 0.724481 EO 1 0.422075E04 Table A8 Performance of Training and Generalization (RMS) Methoo E with 7 hidden nooes and A. = 0.01 Epoch Trainin~ RMS Generalization RMS ConverRence Error 0 0.214709EOO 0.209518EOO 1 0.812552EOl 0.841 985EO I 0.133448 2 0.80 1644EO I 0.834818EOl 0.109082E02 3 O.790322E0 I 0.825625E01 0.113216E02 4 0.779749EO I, 0.816252£01 0.105730E02 5 0.750445£01 0.787846EOl 0.293044E02 6 0.736659£01 O.774192EO I 0.137859£02 7 0.722210EOl 0.755748EOI 0.144488E02 8 0.717990EOI O.7325 18EO 1 0,422016E03 9 0.718357EOl O.732768E0 1 0.366718E04 10 0.707211EOI* 0.724163EOl· 0.11 1452E02 11 0.708656EOl 0.724377E01 0.28260 1E04 52 Table A9 Performance ofTraining and Generalization (RMS) Method £ with 7 bidden nodes and A. = 0.0001 Epoch Training RMS Generalization RMS Convergence Error 0 0.2 I4709EOO 0.209518£00 I 0.8 12552EO I 0.841985£01 0.133448 2 0.801644£01 0.834818£01 0.109082£02 3 0.790322£01 0.825625£01 0.113216E02 4 O.779749EO 1 0.816252£01 0.105730£02 5 0.750445£01 0.787846£01 0.293044£02 6 0.736659£01 0.774192£01 0.137859£02 7 0.722210£01 0.755748£01 0.144488£02 8 0.7 17990E01 0.732518£01 0.422016£03 9 0.718357£01 0.732768£01 0.366718E04 10 0.707211£01· 0.724163£01· 0.1 I 1452E02 Il 0.707701£01 0.724475£01 0.415072£04 Table AI0 Perfonnance of Training and Generalization (RMS) Method A with 8 hidden nodes Epoch Training RMS Generalization RMS Convergence Error 0 0.213003£00 0.207839E00 1 0.818643EOl 0.848730E0 1 0.131138 2 0.815437EOl 0.847499E0 1 0.320621 £03 3 0.813113£01 0.846303£01 0.232413£03 4 0.808220EOl 0.842961EOl 0.489302E03 5 0.796452£01 0.833113£01 0.117680£02 6 0.778920EOl 0.81 6870EOl 0.175317E02 7 0.737960£01 O.775554EO I 0.409602£02 8 0.730081 £01· 0.767580EOl· 0.787854E03 9 0.733293£01 0.770622£01 0.321187E03 Table AII Performance ofTraining and Generalization (RMS) Method B with 8 hidden nodes and A. = 0.01 Epoch Training RMS Generalization RMS Convergence Error 0 0.213003EOO 0.207839£00 1 0.819372EOl 0.848970£01 0.131071 2 0.816100£0 I 0.847689EO 1 0.327244E03 3 0.812510£01 0.845630E0 1 0.358932E03 4 0.800752EOI 0.836221E01 0.1 17583E02 5 0.791521£01 0.828127£01 0.923134E03 6 O.717076EOI· 0.751127£01 0.744448£02 7 0.722484£01 0.756384£01 0.540763E03 S3 Table A12 Performance of Training and Generalization (RMS) Method B with 8 hidden nodes and A= 0.000 I Epoch Trainin.l!: RMS Generalizatioo RMS Convenzence Error 0 0.213003EOO 0.207839EOO I 0.818717EOl 0.848753EOl 0.131132 2 0.81 5524EO I 0.847525EO I 0.3 I9220E03 3 0.813135E01 o.846287EO 1 0.238933E03 4 0.807856EO I 0.842638EO 1 0.527889E03 5 0.796023EO 1 0.832680EO I O. 118332E02 6 0.776854EO 1 0.8 I4787E0 1 0.191 688E02 7 0.739080EOl 0.776723E01 0.377746E02 8 O.728923E0 I 0.766369E01 0.101567E02 9 O.730239EO I 0.767609EOI O.13 1637E03 10 0.728760EOl* 0.764509EOl· 0.14790IE03 11 O.735858E0 I 0.771 347EOl 0.709720E03 Table A13 Performance of Training and Generalization (RMS) Method C with 8 hidden nodes and A= 0.0001 Epoch Training RMS Generali.zatiOll RMS Convergence Error 0 0.213oo3E00 0.207839E00 1 0.81869IEOI 0.848744EO I 0.131 134E00 2 0.815431EOI 0.847472EOI 0.325955E03 3 0.812866EOI 0.846118E01 0.256523E03 4 0.806990EO1 0.841 960EO I 0.587605E03 5 0.794732EOI 0.831542E{)1 0.122583E02 6 0.772993EOI 0.811025EOI 0.2 I7392E02 7 O.738339E01 O.776072E{) I 0.346541 E02 8 0.72845IEOI 0.765909EOl 0.988781 E03 9 0.728672EOl 0.7661 29EO1 0.221059E04 10 0.718936EOl· O.744065E{) I· 0.973582E03 II 0.722725EOI 0.747 144EO1 0.378869E03 Table A14 Performance of Training and Generalization (RMS) Method C with 8 hidden nodes and A= 0.01 Epoch Training RMS Generalization RMS Converp,ence Error 0 0.213003EOO 0.207839EOO I 0.866797EOI 0.865098EO I 0.126489E00 2 0.865809E0 I· 0.864760EO I 0.987947E04· 3 0.865913EOl 0.864781E{)1 0.103712E04 54 Table A15 Performance ofTraining and Generalization (RMS) Method 0 with 8 hidden nodes and A= 0.01 Eooch Training RMS Generalization RMS Cooveflzence Error 0 0.213003EOO 0.207839EOO 1 0.8 I8643 EO 1 0.848730EO 1 0.131138 2 0.815437EOI 0.847499EOI 0.32062]E03 3 0.813113EOl 0.846303EOI 0.232413E03 4 0.808220EO I 0.842961EOI 0.489302E03 5 0.796452EO] 0.833113EOI 0.1 I7680E02 6 0.778920EOI 0.816870EO] 0.1753] 7E02 7 0.737960EOI O.775554EO1 0.409602E02 8 0.73008IEOl 0.767580EO 1 0.787854E03 9 0.733293EOI 0.770622EO 1 0.321 187E03 10 0.731537EOl 0.765954EOl 0.175618E03 II 0.759106E01 0.774551 EO I 0.186249E02 12 0.757639EOl 0.773498EOI 0.146680E03 13 0.762574EOI O.775192EO1 0.493556E03 14 O.763435EO1 0.768638EOI 0.860468E04 15 0.747234EOl 0.740842EO] 0.162011E02 16 0.681907EOl* 0.677225EOl* 0.653267E02 17 0.695431 EO 1 0.688462EOl 0.135239E02 Table A16 Performance of Train ing and Generalization (RMS) Method 0 with 8 hidden nodes and A= 0.0001 Eooch Training RMS Generalization RMS Convergence Error 0 0.2 13003EOO 0.207839EOO 1 0.8 18643EOl 0.848730EO I 0.131138 2 0.815437EOl 0.847499EO I 0.320621 E03 3 0.813113EOI 0.846303EO 1 0.232413 E03 4 0.808220EOI 0.842961 EO I 0.489302E03 5 0.796452EOI 0.833113E~01 0.117680E02 6 0.778920E0] 0.816870EOI 0.175317E02 7 O.737960E01 0.775554EOl 0.409602E02 8 0.73008IE01 O.767580EO 1 0.787854E03 9 0.733293E01 O.770622EO1 0.321 I87E03 10 0.759254EOI 0.774626EOI 0.186847E02 II O.757737EO 1 O.773533EO1 0.151746E03 12 0.73 1537EOI 0.765954E01 0.1.75618E03 13 0.713689EOl O.703227EO I 0.510639£02 14 0.695947EOl 0.682037EO 1 0.177421 E02 15 0.681585EOI * 0.669311 EO 1* 0.143620E02 16 0.68 I996EO 1 0.669665EO I 0.410751 E04 55 I, '. Table A17 Performance ofTraining and Generalization (RMS) Method E with 8 bidder! nodes and A. = 0.01 Epoch Training RMS Generalization RMS Convergence Error 0 0.213003E00 0.207839EOO I 0.818643E01 0.848730EOl 0.131\38 2 0.81 5437EOI 0.847499EOl 0.32062\ E03 3 0.8131 13EOI 0.846303EO 1 0.232413E03 4 0.808220£0 \ 0.842961£0\ 0.489302E03 5 0.796452EOl 0.833113£0\ 0.1 I7680E02 6 O.778920EOl 0.816870£0 \ 0.175317E02 7 0.737960EOl 0.775554£01 0.409602£02 8 0.730081EOI 0.767580EOl 0.787854E03 9 0.733293EOl O.770622EO1 0.321 I87E03 10 0.731537£01 0.765954EOl 0.175618£03 II 0.756865E01 0.775078£01 0.187512E02 12 0.747758£01 0.767588£0 I 0.910699E03 13 0.743792£01 O.763368EO I 0.975490£03 14 0.735836EOI O.754677EO 1 0.156695E02 15 0.7 I8534E0 I· 0.737869£01· 0.173014E02 16 O.724389EO I 0.726581 EOI 0.585444£03 Table A18 Performance ofTraining and Generalization (RMS) Method A with 10 hidden nodes Epoch Training RMS Generalization RMS Convergence Error 0 0.210536E00 0.205405£00 1 0.8251IOEOl 0.855849EOl 0.128025 2 0.824249E01 O. 855839EO1 0.861 I38E04 3 0.824023EOl 0.855835E0 I 0.226125E04 4 0.823923EOl 0.855831 EO I 0.999123E05 5 0.823858EOI 0.855828EO I 0.64820 I E05 :1 6 0.823841 EOl 0.855827E0 1 0.171.363E05 7 0.823805E0 I 0.855825E0 I 0.366569E05 8 0.823793E01 0.855825E0 I 0.111759E05 9 0.823794E01 0.856824£01 0.447035E07 10 0.823790E01 0.856824EO I 0.402331 E06 11 0.823793E01 O.866823EOl 0.312924E06 12 0.823795EOl· 0.865625EO1· 0.178814E06 13 0.823806£01 0.855825EO I 0.11 J759E05 56 Table A19 Performance ofTraining and Generalization (RMS) Method B with 10 hjddeo nodes and A. = 0.0 I Epoch TraininR RMS Generalization RMS Convergence Error 0 0.210543 0.205405 1 0.878851£0 I 0.872232E0 1 0.123397 2 0.878564E0 1 0.872218E01 0.293776£04 3 0.878569E0 I 0.872211£01 0.514090E06 4 0.878569E0 I 0.872217EOl 0.00000 5 0.878561£01* 0.872211£01* 0.186265E06 6 0.878574EOI 0.872211£01 0.707805E06 Table A20 Performance of Training and Generalization (RMS) Method B with 10 hidden nodes and A. = 0.0001 Epoch Training RMS Generalization RMS Convergence Error 0 0.210543 0.205405 1 0.825718E01 0.855999Eol 0.127972 2 0.824931 Eo I 0.855999E01 0.787oo5E04 3 0.824664E0 I 0.855993Eo 1 0.266880E04 4 0.824511 EO 1 0.855984E0 I 0.153333E04 5 0.824397Eo1 0.855975Eo I 0.1 I 3398E04 6 O. 824122E01 0.855968Eo 1 0.753254E05 7 0.823246Eol 0.855960Eo 1 0.759959E05 8 0.814179Eol 0.857852E0 I 0.672042E05 9 0.814121Eol 0.835944Eol 0.582635E05 1.0 0.804066Eol 0.825736Eol 0.544631£05 II 0.782809EOl 0.813131 Eol 0.562519E05 12 0.782748Eol 0.813120Eo I 0.648946E05 13 0.782682Eol* 0.813108EoI* 0.693649E05 14 O.782996Eo I 0.813791 EO I 0.90 1520E05 Table A21 Performance ofTraining and Generalization (RMS) Method C with 10 hidden nodes and It. =0.00001 Epoch Training RMS Generalization RMS Convergence Error 0 0.210543E00 0.205405E00 I 0.825111£01 0.855866E0 I 0.128024E00 2 0.824285EOl 0.855855E0 1 0.832081 E04 3 0.824051 EO I 0.855849E0 I 0.234768E04 4 0.823928EOl 0.855844E0 I 0.1221 15E04 5 0.823875E0 I 0.855841 EO I 0.533462E05 6 0.823824E0 1 0.855839E01 0.509620E05 7 0.823805EOl 0.855837EOl 0.195205E05 8 0.823791£0 I 0.855831£01 0.789762E06 9 0.823783E01* 0.855836E0 I· 0.137091 E05 10 0.823790E0 1 0.855836EOI 0.707805E06 57 Table A22 Performance of Training and Generalization (RMS) Method C with 10 hidden nodes and A= 0.01 Epoch Training RMS Generalization RMS Conver~ence Error 0 0.210543EOO 0.205405EOO 1 0.853625EO I 0.865173EOl 0.125381 EOO 2 0.853090EOl 0.865138E01 0.534728E04 3 0.853115EO I* 0.865137EOl * 0.249594E05 4 0.853128EO I 0.865 136EO I 0.130385E05 Table A23 Performance of Training and Generalization (RMS) Method 0 with 10 hidden nodes and A. = 0.000 I Epoch Training RMS Generalization RMS Convergence Error 0 0.210536 0.205405 1 0.82511 OEO1 0.855849EO I 0.128025 2 0.824249EOl 0.855839EO I 0.861138E04 3 0.824023E0 I 0.855835E0 I 0.226125E04 4 0.823923E0 I 0.855831E01 0.999123E05 5 0.823858EOl 0.855828E01 0.648201 E05 6 0.823841 EO I 0.855827E01 0.171363E05 7 0.823805E0 I 0.855825EOl 0.366569E05 8 0.823793E0 I 0.855923EO I 0.1 I I 759E05 9 0.823792E0 I 0.856874EO I 0.447035E07 10 0.823790E01 0.855824EOI 0.402331 E06 II 0.825205EO I 0.850717E01 , 0.786036E05 12 0.814152EOl 0.846813EO I 0.528991 E05 13 0.802311£01 0.835607EO I 0.454485E05 J4 0.800074EOl 0.821903EOl 0.326335E05 15 0.784048EOl 0.8 J5800E0 I 0.263OO5E05 16 0.772402EOl O.794055EO I 0.261515E05 17 0.753999EOI * 0.763246EOI· 0.227243 E05 18 0.766967EOI 0.795789EO I 0.3 I5905E05 Table A24 Performance of Training and Generalization (RMS) Method 0 with 10 hidden nodes and A = 0.0 I Epoch Training RMS Generalization RMS Convergence Error 0 0.210543EOO 0.205405EOO 1 0.82511 OEO I 0.855849EOI 0.128025 2 0.824249E0 1 0.855839EOl 0.861 I38E04 3 0.824023 EO1 0.855835EOl 0.226125E04 4 0.823923E01 0.855831 EO 1 0.999123E05 5 0.823858EOl 0.855828EO I 0.64820 IE05 6 0.823841 EO I O. 855827EO I 0.17 J363E05 7 0.823805EOI 0.855825EO I 0.366569E05 8 0.823793EOl 0.855825E0 I 0.11 1759E05 9 0.823794EOl 0.855824EOl 0.447035E07 to 0.823790E0 1 0.855824EOl 0.402331 E06 II 0.82384IEOl 0.855824 EO 1 0.186265E06 12 0.82374IEOl* 0.855825EOl* 0.566244E06 13 0.823931 EO I 0.855824ED I 0.151992E05 58 Table A25 Performance ofTraining and Generalization (RMS) Method E with 10 hidden nodes and A. = 0.01 Epoch Training RMS Generalization RMS Convergence Error 0 0.210543E00 0.205405E00 1 0.825I1OEOl 0.855849E01 0.128025 2 0.824249E0 1 0.855839EOl 0.861 I 38E04 3 0.824023EOI 0.855835EOl 0.226125E04 4 0.823923E01 0.85583lE0 I 0.999123E05 5 0.823858E01 0.855828EOl 0.64820IE05 6 0.82384IEOI 0.855827E0 1 0.171363E05 7 0.823805E01 0.855825EOl 0.366569E05 8 0.823793EOl 0.855825EOl 0.111759E05 9 0.823794EOl 0.855824EOl 0.447035E07 Table A26 Performance ofTraining and Generalization (RMS) Method E with 10 hidden nodes and A. = 0.00001 Epoch Training RMS Generalization RMS Convergence Error 0 0.210543E00 0.205405E00 1 0.825 I IOEOl 0.855849E0 1 0.128025 2 0.824249E0 J 0.855839E0 1 0.861138E04 3 0.824023EOI 0.855835E0 1 0.226125E04 4 0.823923E0 J 0.855831 EO J 0.999123E05 5 0.823858£01 0.855828E0 J 0.648201 E05 6 0.823841EOl 0.855827EOl 0.171363E05 7 0.823805E01 0.855825EO] 0.366569E05 8 0.823793EOl 0.855825E01 0.1 11759E05 9 0.823794EOl 0.855824E01 0.447035E07 10 0.823790EOl 0.855824E0] 0.402331E06 11 0.823795E0 1 0.855823E01 0.154972E05 12 0.823793EOl 0.855823E0 1 0.178814E06 13 0.823779E0 1 0.855822E0] 0.144541 E05 59 Table A27 Perfonnance ofTraining and Generalization (RMS) Method F with 7 hidden nodes and different A(0.008,0.006 0.004,0.002,0.001,0.0006,0.0001,0.00006, 0.00004, 0.00001) Epoch " Training RMS Generalization RMS Convergence Error i 1 0.80000E02 0.860905EOl I 0.848817E"{) I 0.128730EOO 2 0.60000E02 0.846864EO I i 0.847823EO 1 0.465959E03 3 0.40000E02 0.837784EO 1 I 0.847662E"{) I 0.478327E..{)4 4 0.20000E02 0.828646EOl 0.846507E"{) 1 0.51 I557E04 5 0.10000E02 0.802596E01 0.806985E"{)1 0.211523E02 6 0.60000E03 0.791398EOI O.792753EOI 0.787571 E03 7 0.10000E03* 0.771920E0 I* 0.786071EOl 0.133017E"{)2 8 0.60000E04 O. 774715E..{) 1 0.784241 EOl 0.323653E03 9 0.40000E04 0.774570£01 0.784329E01 0.331238£03 10 O.IOOOOE04 0.774264EO I 0.784434E01 0.333793E03 1J 0.10000E03 0.771808EOl O.783094EO I 0.358023E03 12 0.10000E03 0.771222EOl 0.782939EOl 0.586659E04 13 0.10000E03 0.813377EOl 0.842336E0 I 0.133371 EOO Table A28 Performance of Training and Generalization (RMS) Method G with 7 hidden modes and A= 0.001 Epoch Training RMS Generalization RMS Convergence Error 0 0.214756 0.209518 1 0.81 7576EO I 0.841984EOl 0.132999 2 0.807776EO 1 0.835390EOl 0.979960E03 3 0.798116EOI 0.827304EO I 0.966057E03 4 O.788325EO 1 0.8 I8250E0 I 0.979044E03 5 0.766781EOl O.796450E..{) I 0.215444E02 6 0.752509EOl 0.781470EOl 0.142720E02 7 0.714120EOI* 0.736984EOl* 0.38389IE02 8 0.730710EOI 0.752282EOl 0.165908E02 Table A29 Perfonnance of Training and Generalization (RMS) Method G with 7 hidden modes and A = 0.000 I Epoch Training RMS Generaljzation RMS Convergence Error 0 0.214709 0.209518 I 0.813054£01 0.841985EOl 0.133404 2 0.802277E01 0.834886E0 I 0.107767E02 3 0.791121EOl 0.825798E0 I 0.1 11558E02 4 0.780597£01 0.816429E"{)1 0.105239E02 5 O. 752288EO1 0.788873EOI 0.283096E02 6 0.738405EOI 0.775041 EO 1 O. ]38825E02 7 0.721871 EOl 0.754372EO I O. ]65343E02 8 0.716075E01 0.730340EO] 0.579566E03 9 0.716354EOI 0.730538EOl 0.278950£04 10 0.709259£01 * 0.725106EOl* 0.709549£03 II 0.709502E01 O. 725289EO 1 0.243112£04 60 Table A30 Performance ofTrain.i.ng and Generaljzation (RMS) Method G with 7 hidden modes and A= 0.0002 Epoch Training RMS Generalizatioo RMS Convergence Error 0 0.214705 0.209518 1 0.812652E..Q 1 0.841985E..QI 0.133439 2 0..801 743Eol 0.834814Eo I 0.1 09088E02 3 0.790483EOI O.825664Eo 1 0.11260 1E02 4 0.779875EOI O.816249Eo 1 0.106080E02 5 O.750842EO I O.788082Eo1 0.29033 I E..()2 6 O.736947Eo I 0.774301Eol 0.1 38956E02 7 0.72196IEOl 0.755340Eol 0.149855E02 8 O.717005Eol 0.731551Eol 0.495657E03 9 O. 71 7307E0 I O.731 760Eo I 0.302196E04 10 0.707547Eol* O.724244Eo1* 0.976011E03 11 0.707797Eol 0.724433Eol 0.250116E04 Table A31 Perfonnance of Training and Generalization (RMS) Method G with 8 hidden modes and A= 0.0002 Epoch Training RMS Generalization RMS Convergence Error 0 0.213004 0.207839 I 0.8 I8767EO I 0.848726E01 0.131127 2 0.815572Eol 0.847500Eol 0.319563E03 3 0.813266Eo 1 0.846321Eol 0.230514E03 4 0.808509£01 0.843091£01 0.475705E03 5 0.796952Eol 0.833439EOl 0.1 15574E02 6 0.779820EOl 0.817583Eol 0.171316E02 7 0.738800£01 0.776176Eol 0.410205E02 8 0.730631E01 O.767895Eo 1 0.81 6934E03 9 O.733877E0 I O.770932Eo I 0.324629E03 10 0.73 I922Eo 1 O.766110E01 0.195459E03 II 0.733967E01 O.768062Eo 1 0.204444E03 12 O.731807E0 I 0.750615E01 0.21 5985E03 13 0.732087E01 O.750793EoI 0.279844E04 14 0.730509EOl* 0.749718EoI* O. J57736E03 15 O.730594E0 I O.749775EO I 0.849366E05 61 Table A32 Performance ofTraining and Generalization (RMS) Method G with 8 hidden modes and A= 0.0001 Epoch Training RMS Generalization RMS Convergence Error 0 0.213003 0.207839 I 0.818695Eol 0.848729£0 I 0.131134 2 0.815518£01 0.847507£01 0.3 I7685E03 3 0.813209Eo I 0.846324Eo I 0.230849E03 4 0.8084IIEOI 0.843060£01 0.479840E03 5 0.796791 Eo 1 0.833353 EO 1 0.116200E02 6 0.779562£01 0.8 I 7408E..{) 1 0.172289E02 7 0.738410E01 0.775888£0 I 0.411511£02 8 O.730438E0 I 0.767817EOI 0.7971 75E03 9 0.733867E01 0.771038Eol 0.342883E03 JO O.73 1672E0 I O.765858EO I 0.219509E03 11 0.733556Eol O.767654EO I 0.188418E03 12 0.728855Eol O.748343E0 1 0.470124E03 13 0.728628Eol* 0.748189EOl* 0.227 168E..{)4 14 O. 729582Eo1 O.748835EO1 0.954 196E04 Table A33 Performance of Training and Generalization (RMS) Method G with 10 hidden modes and A= 0.0001 Epoch Training RMS Generalization RMS Convergence Error 0 0.210543 0.205405 1 0.825592£01 0.855851 EO I 0.127984 2 0.824773EOl 0.855839E0 I 0.819638£04 3 0.824542EOl 0.855834E0 I 0.230819E04 4 0.824432E0 I 0.855830EOl 0.109598E04 5 0.824401 Eo 1 0.855829EOI 0.310689E05 6 0.824371 Eol 0.855827EOl O.302494E..{)5 7 0.824355E0 I 0.855827Eo I 0.164658E"{)5 8 0.824354EOI 0.855827E0 1 0.745058E07 9 0.824343EOl· 0.855826Eo 1 0.1 10269E05 10 0.824355EOI 0.855827Eol 0.122935E05 Table A34 Performance of Training and Generalization (RMS) Method G with 10 hidden modes and A= 0.0000 1 Epoch Training RMS Generalization RMS Convergence Error 0 0.210537 0.205405 1 0.825 160Eo1 0.855848E0 1 0.128021 2 0.824300Eol 0.855838E0 I 0.860468E"{)4 3 0.824081 EO 1 0.855834E0 I 0.2 18600E04 4 0.823960E0 I 0.855830EOI 0.120923E04 5 0.823906E0 I 0.855821£01 0.540167E05 6 0.823877E01 0.855826EO 1 0.288337E"{)5 7 0.823852EOl 0.855824£01 0.255555E05 8 0.823851 E..{) 1* 0.855824£0] 0.968575E07 9 0.823867E01 O. 855825E0 1 0.166 L48E05 62 Table A35 The Training Data Set O.OOOOOOOOE+OO O.OOOOOOOOE+OO 0.00000000£+00 0.00000000£+00 0.83333336E0] 0.69444450E02 0.00000000£+00 0.16666661£+00 0.27777780E0] 0.00000000£+00 0.25000000E+00 0.62500000E0] 0.00000000£+00 0.33333334£+00 0.11] 11112£+00 0.00000000£+00 0.4] 666666E+00 0.17361110£+00 0.00000000£+00 0.50000000£+00 0.25000000E+00 0.83333336£01 0.00000000£+00 0.69444450E02 0.83333336E0 1 0.83333336£01 0.13888890£01 0.83333336E0 1 0.16666661£+00 0.34722224E0 1 0.83333336E0 1 0.25000000£+00 0.69444448E0 1 0.83333336E0 1 0.33333334£+00 0.] 1805556E+00 0.83333336E0 1 0.41666666E+00 O. ]8055555£+00 0.83333336EO 1 0.50000000E+00 0.25694445E+OO 0.16666667E+00 O.OOOOOOOOE+00 0.27777780E0 1 0.16666667E+00 0.83333336E0 1 0.34722224E0 1 0.16666661£+00 0.16666661£+00 0.55555560E0 1 0.16666661£+00 0.250oo000E+00 0.90277776E0 1 0.16666661£+00 0.33333334E+00 0.13888890E+00 0.16666661£+00 0.4 1666666E+00 0.20138888E+00 0.16666667£+00 0.50000000E+00 o.27777779E+00 0.25000000E+00 O.OOOoooOOE+OO 0.62500000E0 1 0.25000000E+00 0.83333336E0 I 0.69444448E0 1 0.25000000E+00 0.16666667£+00 O.90277776E0 1 0.25000000£+00 0.25000000£+00 0.125OOO00E+00 0.25000000E+00 0.33333334E+00 0.17361112£+00 0.25000000£+00 0.41666666E+00 0.23611110E+00 0.25000000£+00 0.500oo000E+00 0.3 1250000E+00 0.33333334E+00 0.00000000£+00 0.11111112£+00 0.33333334E+00 0.83333336£01 0.] 1805557E+00 0.33333334E+00 0.16666667£+00 0.13888890E+00 0.33333334£+00 o.250oo000E+00 0.17361112E+00 0.33333334E+00 0.33333334E+00 0.22222224E+00 0.33333334E+00 0.41666666E+00 0.28472221 E+OO 0.33333334E+00 0.500oo000E+00 0.361111] OE+OO 0.41666666E+00 O.OOOOOOOOE+oo 0.17361 JlOE+oo 0.41666666E+00 0.83333336E0 1 0.18055555E+OO 0.41666666£+00 0.16666661£+00 0.20138888E+OO 0.41666666£+00 0.25000000£+00 0.23611 110E+OO 0.41666666£+00 0.33333334£+00 0.28472221E+00 0.4 1666666E+00 0.41666666£+00 0.34722221£+00 0.41666666E+00 o.50000000E+00 0.42361110£+00 0.50000000£+00 O.OOOOOOOOE+OO 0.25000000£+00 0.50000000£+00 0.83333336E0 1 0.25694445£+00 0.500oo000E+00 0.16666661£+00 0.27777779£+00 0.50000000£+00 0.25000000£+00 0.31250000E+00 0.50000000£+00 0.33333334E+OO 0.36] 11110£+00 0.50000000E+00 I 0.41666666£+00 0.423611 ]0£+00 0.500oo000E+00 0.50000000£+00 0.500oo000E+00 63 Table A36 The Validation Data Set 0.99999998E02 O.99999998E02 O.19999999E03 0.99999998E02 O.93333334EO I 0.88] II] ]6E02 0.99999998E02 0.17666666E+00 0.3] 3 11I1 OEO 1 0.99999998E02 O.25999999E+00 0.67699999E0 I 0.99999998E02 0.34333333E+00 0.1 I797778E+00 0.99999998E02 0.42666668E+00 0.18214445E+OO 0.99999998E02 O.50999999E+00 0.260 19999E+OO 0.93333334E01 0.99999998E02 0.881 11106E02 0.93333334E01 0.93333334E0] 0.] 7422222E0] 0.93333334E0] 0.17666666E+00 0.39922219E0 I 0.93333334E0 I 0.25999999E+00 0.76311104E01 0.93333334E0 I 0.34333333E+00 O.12658890E+00 O.93333334E0 1 0.42666668E+OO 0.19075556E+00 0.93333334E0 I 0.50999999E+00 O.26881111E+OO 0.17666666E+00 0.99999998E02 O.313111IOEOI 0.17666666E+00 0.93333334E01 O.39922222E0 I 0.17666666E+00 0.17666666E+OO 0.62422220E0 I 0.17666666E+00 0.25999999E+00 0.98811105EOI 0.1 7666666E+00 O.34333333E+OO 0.14908889E+00 0.17666666E+00 0.42666668E+00 O.21325557E+00 0.17666666E+00 0.50999999E+00 O.29131109E+OO 0.25999999E+00 0.99999998E02 0.67699999EO I o.25999999E+00 0.93333334E0 I 0.76311111 EO I 0.25999999E+00 o.17666666E+00 0.98811105EOI 0.25999999E+00 o.25999999E+00 0.135 I9999E+00 0.25999999E+00 0.34333333E+00 0.1 8547778E+00 0.25999999E+00 0.42666668E+00 0.24964444E+00 0.25999999E+OO 0.50999999E+00 0.32769999E+00 0.34333333E+OO 0.99999998E02 0.1 I797778E+00 0.34333333E+00 0.93333334E0] 0.12658890E+00 0.34333333E+00 0.17666666E+00 0.14908889E+00 0.34333333E+00 0.25999999E+00 0.1 8547778E+00 0.34333333E+00 0.34333333E+00 0.23575556E+00 0.34333333E+00 0.42666668E+00 0.29992223 E+00 0.34333333E+00 0.50999999E+00 0.37797776E+OO 0.42666668E+00 0.99999998E02 0.1 82 I4445E+00 0.42666668E+00 0.93333334EO 1 0.19075556E+00 0.42666668E+00 0.17666666E+00 0.2 I325555E+00 O.42666668E+00 O.25999999E+OO 0.24964444E+00 0.42666668E+00 0.34333333E+00 0.29992223E+00 0.42666668E+00 0.42666668E+00 0.36408889E+00 0.42666668E+00 0.50999999E+OO 0.442 I4442E+OO 0.50999999E+00 0.99999998E02 0.260 I9996E+00 0.50999999E+00 0.93333334EOI 0.2688] 108E+00 0.S0999999E+00 0.17666666E+00 0.29131109E+00 o.s0999999E+00 o.25999999E+00 0.32769996E+00 0.50999999E+00 0.34333333E+00 0.37797776E+00 0.50999999E+00 0.42666668E+OO , 0.442 I4442E+OO 0.50999999E+00 0.S0999999E+00 0.S2019995E+OO 64 APPENDIXB COMPUTER PROGRAMS PROGRAM DRIVER C······*·········*·····*···········*·**············*··•..** •••••• • • • • • • • • TillS DRIVER IS TO GENERATE THE RANDOM WEIGHTS W(MlAYR MNODE, O:MNODE) THE WEIGHT OF EACH LAYER. • P(MNODE)  THE INPUT DATA OF THE SAMPLE. • O(MNODE)  THE OUTPUT CALCULATED FROM THE INPUT DATA SAMPLE. * N(MLAYR, MNODE)  THE WEIGHTED SUM OF THE INPUTS OF A NEURON MNODE IN LAYER MLAYR REF (3.1.l) • A(O:MLAYR O:MNODE)  THE OurPUT OF THE NEURON MNODE IN LAYER MLAYR. REF (3.1.2) • NOnCE THAT A(O,·) REPRESENTS THE INPUT LAYER. A(·,0) REPRESENTS THE BIAS. • NNODE(O:MLAYR)  THE NUMBER OF NODE IN EACH LAYER. • LAYER  THE ACTUAL TOTAL LAYER OF THE NET. (EXCLUDING * THE INPUT LAYER) • MLAYR  THE MAXMUM LAYER A NET CAN HAVE. • MNODE  THE MAXMUM NODE ONE LAYER Of A NET CAN HAVE * LL  SAMPLE INDEX • C C C C C C C C C C C C C C C C C C C C C·**··························*······*······*······**·.......•*.. PARAMETER(MLAYR = 4, MNODE = IOO,MSAMP = 200) DOUBLE PRECISION DRANDOM, W(MLAYR, MNODE, O:MNODE), + SEED,TOL,WO,LAMDA,LAMDA2,P(MSAMP,MNODE),O(MSAMP,MNODE), + N(MLAYRMNODE), A(O:MLAYR,O:MNODE), + SENSI(MLAYR,MNODE),T(MSAMP,MNODE),ERROR2,ERROR1, + G(MLAYR,MNODE,O:MNODE),TG(MLAYR,MNODE,O:MNODE), + FRET,TOLl,ERROR,ERRORV INTEGER K,1,J,LL,NNODE(O:MLAYR),METHOD,LAYER,NSAMP INTEGER NUM,ITERMAXNUM,NWEIG,PSTAT,JUDGE CC THE FOLLOWING DATA IS USED IN CONJUGATE GRADIENT METHOD C DOUBLE PRECISION PP(MLAYR,MNODE,O:MNODE),BETA, + TGO(MLAYR,MNODE,O:MNODE),PPO(MLAYR,MNODE,O:MNODE) PSTAT=lO 65 C C SET UP NETWORK C CALL NETSETUP(LAYER.,MLAYR.,NNODE,SEED,WO,LAMDA,LAMDA2,METHOD) WRITE(*, 123)LAMDA, LAMDA2 123 FORMAT(lx,'LAM1=',FI6.12, 'LAM2=',FI6.12) CALL NETPRINT(LAYER,MLAYR,NNODE,SEED,WO,LAMDA,LAMDA2,METHOD) C C INITIAL WEIGHT WITH RANDOM NUMBER. C CALL INIWEIGHT(W, LAYER, MLAYR,NNODE. MNODE,SEED, + NWEIG) C C PRINT THE NUMBER OF WEIGHT C WRJTE(*,1001)NWEIG 1001 FORMAT(lX,'THE NUMBER OF WEIGHT IS: ',15) CC READ IN TRAINING DATA P(l) AND T(l) C READ IN THE INPUT AND DESIRED OUTPUT OF ONE TRAINING SAMPLE C CALL GETINPUTDATA(P,T,MNODE,NNODE(O),NNODE(LAYER), + MSAMP,NSAMP) CC CALCULATE THE PERFORMANCE FUNCTION C ERROR = SQRT(TEST(MSAMP,MNODE,W,MLAYR,LAYER, + NNODE,LAMDA,WOYNSAMP) PRINT*, 'BEFORE TRAINING GENERALlZAnON ERROR: .,ERROR CC LOOP OVER ITERATION C SET TOLERANCE AND MAXIMUM ITERATION NUMBER C TOL = 4.Of>.. I0 TOLl = 3.5f>..2 MAXITER=IO lTER=O JUIXiE=O C 1000 ITER = ITER + 1 C C ENTER ITERATION C CALL JNITG(LAYER,MLAYR,NNODE,MNODE,TG) CC SUM TOTAL GRADIENT C DO 320 LL=l,NSAMP C C FEEDFORWARD COMPUTAnON C CALL FORWARD(P,O,N,MLAYR,LAYER,MNODE,A, + NNODE,W,LL,MSAMP) CC CALCULATE THE SENSITIVITY MATRIX C 66 CALL SENSITIVITY(SENSl,W,LAYER,MLAYR,NNODE, + MNODE,T,O,N,LL,MSAMP) C C CALCULATE THE GRADLENT OF THE PERFORMANCE FUNCTION C CALL GRAD(SENSI,A,W,LAYER, MLAYR, + NNODE,MNODE,G,WO,LAMDA) C C SUM UP THE TOTAL GRADLENT C CALL SUMGRAD(G,LAYER,MLAYR,NNODE,MNODE,TG) 320 CONTINUE C C FIND THE PERFORMANCE FUNCTION VALUE, BEFORE LINE SEARCH C ERRORI= SQRT(FINDE(P,T,MSAMP,NSAMP,MNODE, + W,MLAYR,LAYER,NNODE,O,LAMDA,WO) INSAMP) TF (MOD(ITER,PSTAT) .EQ. l) THEN C WRITE(*,l600)ITER,ERRORI 1600 FORMAT(IX,'BEFORE LINE SEARCH, ITER #',I5,2X, + 'ERRORl VALUE = ',G25.20) ENDIF C C FfRST START AND RESTART USING STEEPEST DESCENT C IF (ITER .EQ. I .OR. MOD(ITER.,NWEIG) .EQ. 0) THEN CALL GETPP(pP,PPO,TG,MLAYR.,LAYER,MNODE,NNODE, + O.DO) ENDIF C C ASSIGN THE TG TO TGO C CALL ASSIGN(TG,TGO,MLAYR,LAYER.,MNODE,NNODE) CALL ASSIGN(PP,PPO,MLAYR,LAYER,MNODE,NNODE) C C COMPUTE ALGORITHM 3.6.1 (4) AND (5). CALL L1NMJN(FRET,P,T,MSAMP,NSAMP, + MNODE,W,PPO,MLAYR,LAYER,NNODE,O,LAMDA,WO) C C USING STEP 2 TO COMPUTE THE G(K+I) C CALL lNITG(LAYER.,MLAYR,NNODE,MNODE,TG) 00 321 LL=l,NSAMP C C FEEDFORWARD COMPUTATION C CALL FORWARD(P,O,N,MLAYR,LAYER,MNODE,A, + NNODE,W,LL,MSAMP) C C CALCULATE THE SENSITIVITY MATRIX C CALL SENSITIVlTY(SENSI,W,LAYER.,MLAYR,NNODE, + MNODE,T,O,N,LL,MSAMP) C C CALCULATE THE GRADIENT OF THE PERFORMANCE FUNCTION C 67 CALL GRAD(SENSI,A,W,LAYER, MLAYR, + NNODE,MNODE,G,WO,LAMDA) C C SUM UP THE TOTAL GRADlENT C CALL SUMGRA.D(G,LAYER,MLAYR,NNODE,MNODE,TG) 321 CONTINUE C C FIND THE PERFORMANCE FUNCTION VALUE, AFTER LINE SEARCH C ERROR2= SQRT(FINDE(P,T,MSAMP,NSAMP,MNODE, + W,MLAYR,LAYER,NNODE,O,LAMDA,WO) fNSAMP) C IF(MOD(ITER,PSTAT) .EQ. 0) THEN C WRITE(*,1100)ITER,ERR0R2 1100 FORMAT(lX,'AFTER UNE SEARCH, ITER#',J5,2X, + 'ERROR2 VALUE = ',G25.20) C ENDIF C IF(MOD(lTER,PSTAT) .EQ. 1) THEN C IF (ABS(ERROR2  ERROR1) .LT. TOL) THEN ERROR = ABS(ERROR2  ERRORI) C WRITE(*, 101)ERROR 101 FORMAT (IX,'ERROR = ',G25.20) C STOP C ENDIF C VALIDATE THE NETWORK USING VALIDATION SET. C IF(MOD(ITER,PSTAT) .EQ. I) THEN ERRORV = SQRT(TEST(MSAMP,MNODE,W,MLAYR,LAYER, + NNODE,LAMDA,WO)/NSAMP) WRITE(*,1211 )ITER,ERROR2,ERRORV,ERROR 1211 FORMAT(12,',',lx,GI5.6,',',lx,GI5.6,',',lx,GI5.6) 1200 FORMAT(IX,'AFTER TRAINING GENERAUZATION ERROR: ',G25.20) C ENDIF C BETA=FINDBETA(TG,TGO,MLAYR,LAYER,MNODE, + NNODE) CALL GETPP(PP,PPO,TG,MLAYR,LAYER,MNODE,NNODE, + BETA) C C PRINT TG, AFTER STEP 7 C C ASSING TG TO TGO, TOO STORES P(K+1) C CALL ASSIGN(TG,TGO,MLAYR,LAYER,MNODE,NNODE) C IF(JUDGE.EQ.O)THEN IF «ABS(ERROR2  ERRORI) .GT. TOL .OR. ERRORI .GT. TOLl + .OR. ERROR2 .GT. TOLl) .AND. ITER .LT. MAXITER)THEN ERRORI = ERROR2 GOTO 1000 ELSE JUOOE=I LAMDA=LAMDA2 lTER=O 68 C GOTO 1000 ENDIF ENDIF IF(JUDGE.EQ.l)THEN IF «ABS(ERROR2  ERROR1) .GT. TOL .OR. ERRORI .GT. TOLl + .OR. ERROR2 .GT. TOll) .AND.ITER .LT. MAXlTER)THEN ERRORI = ERROR2 GOTO 1000 ENDIF ENDIF WRITE(·,1300) WRITE(·,1301)ITER, MAXITER 1301 FORMAT(lx,'ITER=',15,4x,'MAXITER=',15) 1300 FORMAT(lX,'SOLUTION CONVERGE TO THE TOLERANCE') WRITE(·,1400)ITER,ERROR1,ERROR2,ABS(ERROR2ERRORl) 1400 FORMAT(IX,'ITER= ',15,2X,'ERRORI= ',G25.10,2X,'ERR0R2=', + G25.IO,2X,'ERROR=', G25.10) C C TEST THE NETWORK USING TEST SET C ERROR = SQRT(TEST(MSAMP,MNODE,W,MLAYR,LAYER, + NNODE,LAMDA,WO}'NSAMP) WRlTE(·,1500)ERROR 1500 FORMAT(1 X,'AFTER TRAfNfNG ERROR=',G25.1 0) C STOP END C*·*·**··*·····***····*··**·*·***·····*·*·········*··· *.***.*. SUBROUTINE ASSIGN(ORlG,NEW,MLAYR,NLAYR,MNODE, + NNODE) C****·*·*******·*····*········*·······················.•.••.....•... C TIDS SUBROUTINE IS TO COPY A ORIG MATRIX TO NEW MATRIX. • C IT IS USED TO COPY TG. • C*·······*·*···········*·····························*....•......... INTEGER MLAYR,NLAYR,MNODE,NNODE(O:MLAYR) DOUBLE PRECISION ORlG(MLAYR,MNODE,O:MNODE), + NEW(MLAYR,MNODE,O:MNODE) fNTEGER I,J,K,KK,LL DO 10 K=I,NLAYR KK=NNODE(K) LL=NNODE(Kl) 0020 J=I,KK 00301=0,LL NEW(K,J,I)=ORlG(K,J,I) 30 CONTINUE 20 CONTINUE 10 CONTINUE C RETURN END C·*·····*···*···****·······*·*·**·****·····***········.....****.* FUNCTION BRENT(AX,BX,CX,F,TOL,XMIN) 69 C C·*·····*·*·························*·········*·······•••••••••••• C GIVEN A FUNCTION P, AND GIVEN A BRACKETING • C TRIPLET OF ABSCIESSAS AX, BX, CX(SUCH THAT BX IS .. C BETWEEN AX, AND cx, AND F(BX) IS LESS THAN BOTH .. C F(AX) AND F(CX», TIDS ROUTINE ISOLATES THE MINIMUM • C TO A FRACTIONAL PRECISION OF ABOUT TOL USING BRENTS • C METHOD. THlS ABXCISSA OF THE MINIMUM IS RETURNED AS • C XMIN, AND MUNlMUM FUNCTION VALUE IS RETURNED AS BRENT, • C THE RETURNED FUNCTION VALUE. .. C .. C PARAMETERS: MAXIMUM ALLOWED NUMBER OF ITERATIONS; GOLDEN· C RATIO; AND A SMALL NUMBER THAT PROTECTS AGAINST TRYING .. C TO ACIDEVE FRACTION ACCURACY FOR A MINIMUM THAT HAPPENS .. C TO BE EXACTLY ZERO. .. C·*·*·*·*..··············*****··***·***·*****·*·**·**··.*•••*.***. INTEGER ITMAX OOUBLE PRECISION BRENT, AX,BX,CX,TOL,XMIN,F,CGOLD,ZEPS EXTERNALF PARAMETER(lTMAX=100, CGOLD=.38196600,ZEPS=1.ODl 0) INTEGER ITER OOUBLE PRECISION A,B,D,E,ETEMP,PU,FV,FW,FX,P,Q,R,TOLl ,TOL2, + U,V,W,x,XM A=MIN(AX,CX) B=MAX(AX,CX) V=BX W=V X=V E=O.oo FX=F(X) FV=FX FW=FX DO 11 [TER = 1, ITMAX XM = .500·(A+B) TOLl = TOL·ABS(X) +ZEPS TOL2 = 2.00·TOLl IF(ABS(XXM) .LE. (TOL2  .500·(BA») GOTO 3 IF(ABS(E) .GT. TOLl )THEN R=(XW)*(FXFV) Q=(XV)*(FXFW) P=(XV)·Q(XW)*R Q=2.00*(QR) IF(Q.GT.O) p=p Q=ABS(Q) ETEMP=E E=D IF(ABS(P).GE.ABS(.5DO*Q·ETEMP).0R.P.LE.Q*{AX).OR. + P.GE.Q*(BX»GOTO I D=P/Q U=X+D IF(UA.LT.TOL2 .OR. BV .L1. TOL2)D=DSfGN(TOLl,XMX) GOT02 ENDIF IF(X.GE.XM)THEN E=AX ELSE 70 E=BX ENDrF D=CGOLD*E 2 fF(ABS(D).GE.TOL1)THEN U=X+D ELSE U=X+DSIGN(TOLI ,0) ENDIF FU= F(U) IF(FU.LE.FX)THEN IF(U.GE.X)THEN A=X ELSE B=X ENDIF V=W FV=FW W=X FW=FX X=U FX=FU ELSE IF(U.LT.X)THEN A=U ELSE B=U ENDIF IF(FU.LE.FW .OR. W.EQ.X)THEN V=W FV=FW W=U FW=FU ELSEIF(FU .LE. FV .OR. V.EQ.X .OR. V.EQ. W)THEN V=U FV=FU ENDIF ENDfF 11 CONTINUE C 3 XMIN=X BRENT=FX RETURN END C**************************************************************** * THIS ROUTINE IS TO CONVERT THE 3DfMENSIONAL ARRAYS fNTO IDIMENSIONAL ARRAY. rT IS USED TO APPLY LINE SEARCH ROUTINE * SUBROUTINE CONYERT(TG,MLAYR,MNODE,NLAYR, + NNODE,A,MAXNUM,NUM) C***************************************************************** C * C C c************************·**************************************** rNTEGER MLAYR,MNODE,NLAYR,NNODE(O:MLAYR), + MAXNUM,NUM,I,J,K,KK,LL DOUBLE PRECISION A(MAXNUM),TG(MLAYR,MNODE,O:MNODE) c 71 NUM=O DO 10 K=I,NLAYR KK=NNODE(K) LL=NNODE(KI) DO 20 J=I,KK DO 30 l=l,LL NUM=NUM+I A(NUM)=TG(K.J,I) 30 CONTINUE 20 CONTINUE 10 CONTINUE C RETURN END C****···***··***··**···*·········*····******···****·***.***.**•••• C·**··*··***··****·**··*·******·*·····**····**·****···*•••••**••••• C GIVEN A FUNCTION F AND ITS DERIVATIVE FUNCTION DF, AND * C GIVEN A BRACKETING TRIPLET OF ABSCISSAS AX, BX, CX[SUCH • C THAT BX IS BETWEEN AX AND CX AND F(BX) IS LESS THAN BOTH • C F(AX) AND F(CX)], THIS ROUTINE ISOLATES THE MINIMUM TO A • C FRACTIONAL PRECISION OF ABOUT TOL USING A MODIFICATION OF • C BRENTS METHOD THAT USES DERIVATIVES. THE ABSCISSA OF THE • C MINIMUM IS RETURNED AS XMIN, AND THE MINIMUM FUNCTION • C VALUE IS RETURNED AS DBRENT, THE RETURNED FUNCTION VALUE. • C*·***·*·**··***··*****····**·*········**···*····*····••*••*••••••• FUNCTION DBRENT(AX,BX,CX,F,DF,TOL,XMlN) INTEGER lTMAX DOUBLE PRECISION DBRENT,AX,BX,CX,TOL,XMIN,DF,F,ZEPS EXTERNAL DF,F PARAMETER(JTEM=IOO,ZEPS=J .ODIO) INTEGER ITER DOUBLE PRECISION A,B,D,Dl ,D2,DU,DV,DW,DX,E,FU,FV,FW,FX,OLDE, + TOll, TOL2, U,UJ,U2,V,W,X,XM LOGICAL OK1,0K2 A=MJN(AX,CX) B=MAX(AX,CX) V=BX W=V X=V E=O. FX=F(X) FV=FX FW=FX DX=DF(X) DV=DX DW=DX DO I I ITER=I,ITMAX XM={).S·(A+B) TOll =TOL*ABS(X)+ZEPS TOL2=2.·TOLJ IF(ABS(XXM) .LE. (TOL2  .S·(BA»)GOTO 3 IF(ABS(E) .GT. TOll) THEN D1=2.*(BA) D2=DI IF(DW.NE.DX)Dl=(WX)*DX/(DXDW) 72 IF(DV.NE.DX)D2=(VX)*DX/(DXDV) UI=X+Dl U2=X+D2 OKl=«AUJ)*(UIB).GT.O) .AND. (DX*Dl .LE. 0.) 0K2=«AU2)*(U2B).GT.O) .AND. (DX*D2 .LE. 0.) OLDE=E E=D IF(.NOT. (OK I.OR.OK2»THEN GOl'O I ELSEIF (OKl .AND. OK2)THEN IF(ABS(Dl ).LT.ABS(D2»THEN D=Dl ELSE D=D2 ENDIF ELSEIF (OKI) THEN D=DI ELSE D=D2 ENDIF IF(ABS(D) .GT. ABS(O.S*OLDE»GOTO I U=X+D IF(UA .LT. TOL2 .OR. BU .LT. TOL2)D=SIGN(TOLI,XMX) GOT02 ENDfF IF(DX.GE.O.)THEN E=AX ELSE E=BX ENDIF D=.5*E 2 IF(ABS(D) .GE. TOLI)THEN U=X+D FU=F(U) ELSE U=X+SIGN(TOLI,D) FU=F(U) IF(FU.GT.FX)GOTO 3 ENDIF DU=DF(lJ) IF(FU. LE.FX)THEN IF(U.GE.X) THEN A=X ELSE B=X ENDIF V=W FV=FW DV=DW W=X FW=FX DW=DX X=U FX=FU DX=DU ELSE 73 IF(U.LT.X)THEN A=U ELSE B=U ENDIF IF(FU.LE.FW .OR. W.EQ.X)THEN V=W FV=FW DV=DW W=U FW=FU DW=DU ELSEIF(FU .LE. FV .OR. Y.EQ.X .OR. V.EQ.W)THEN V=U FV=FU DV=DU ENDIF ENDIF 11 CONTINUE 3 XMIN=X DBRENT=FX RETURN END C**************************************·**····**······**.**.*****. • THIS ROUTINE IS TO FIND THE BETA ACCORDING TO (2.4.8)  (2.4.10). * TG(MLAYR, MNODE, O:MNODE) STORES TOTAL GRADIENT * FUNCTION FINDBETA(TG,TGO,MLAYR,NLAYR,MNODE, + NNODE) C********·*********·*****·········**··*·*****·*··*·****.**.* C * C C C C**·***·***·*··***·**·*·****·*·****·**·**····******·******.* C*·****************·*·**********·*********··*··*·*··*·.*.********* END RETURN FINDBETA=SUMISUMI * TillS ROUTINE IS TO FIND THE BETA ACCORDING TO (2.4.8)  (2.4.10). • TG(MLAYR, MNODE, O:MNODE) STORES TOTAL GRADIENT * C C C C C 10 20 CONTINUE CONTINUE FUNCTION FINDBETA(TG,TGO,MLAYR,NLAYR,MNODE, + NNODE) c***********************************··*···*·*·********.**••* * c***·******···***··**·***·******·*********·****·****·*••**•• INTEGER MLAYR,NLAYR,MNODE,NNODE(O:MLAYR) DOUBLE PRECISION TG(MLAYR,MNODE,O:MNODE), + TGO(MLAYR,MNODE,O:MNODE),FINDBETA,SUM,SUM I INTEGER I,J,K,KK,LL C SUM={).DO SUMl=O.DO 00 10 K=1,NLAYR 74 KK=NNODE(K) LL=NNODE(Kl ) DO 20 J=l,KK DO 30 I=O,LL SUM = SUM + TG(K,J,I)*TG(K,J,I) SUMI = SUMI + TGO(K,J,I)*TGO(K,J 1) 30 CONTINUE 20 CONTINUE ]0 CONTINUE C FTNDBETA=SUMJSUM I RElURN END C*·········**·*···*··***·······**····*···*·········*··•••••••••••• •• • THIS FUNCTION IS TO FIND THE PERFORMANCE FUNCTION E(W) REF. (3.6.1). • FINDE  THE PERFORMANCE VALUE. REF (3.6. I) * T(MSAMP,MNODE) THE DESIRED OUTPUT OF THE NET W(MLAYR,MNODE,O:MNODE)WEIGHT MATRlX OF THE NET O(MSAMP,MNODE) THE CALCULATED OUTPUT OF THE NET LAMDA THE CONSTANTIN THE PENALTY TERM. • WO  THE CONSTANTS TN THE PENALTY. * C C C C C C C C FUNCTION FINDE(P,T,MSAMP,NSAMP,MNODE, + W,MLAYR,NLAYR,NNODE,O,LAMDA,WO) C*···*····**·***··**··*··*·**·****··******·····**·****•••••****** * C····*·*··**·***·*·*·*···***·*·***·*****·*··········*·•••••••*••• C INTEGER MSAMP,NSAMP,MNODE,MLAYR,NLAYR, + NNODE(O:MLAYR) DOUBLE PRECISION O(MSAMP,MNODE),T(MSAMP,MNODE), + W(MLAYR,MNODE,O:MNODE),LAMDA,WO,SUM,SUM1,FINDE, + P(MSAMP,MNODE) DOUBLE PRECISION N(MLAY.R,MNODE),A(O: MLAYR,O:MNODE) INTEGER I,J,K,L,KK,LL C C CALCULATE THE PENALTY TERM. C SUM =0.00 SUMI=O.DO 00 100 K=I,NLAYR KK=NNODE(K) LL=NNODE(Kl ) DO 200 J=l,KK DO 300 J=O,LL SUM=SUM+LAMDA·(W(K,J,I)**2/(WO··2+W(K,J,I)··2» CONTINUE CONTINUE CONTINUE 300 200 100 C C CALCULATE THE FIRST TERM C DO 10 L=I,NSAMP CALL FORWARD(P,O,N,MLAYR,NLAYR,MNODE,A,NNODE, + W,L,MSAMP) 75 00 20 K=l,NNODE(NLAYR) SUM1=SUMI +(T(L,K)O(L,K»··2 20 CONTINUE 10 CONTINUE C C FlNDE = O.5DO· (SUM +SUMl) RETURN END C····················································· . • • • • • • • SUBROUTINE FORWAR.D(P,O,N,MLAYR,NLAYR,MNODE,A, + NNODE,W,SN,MSAMP) C·········*·····························*············· . c • cccccc c cc C C C C C C····················································· *•• INTEGER MLAYR,MNODE,NNODE(O:MLAYR),I,J,K, + NLAYR,L,SN,MSAMP,KK,LL OOUBLE PRECISION N(MLAYR,MNODE),A(O:MLAYR,O:MNODE), + W(MLAYR,MNODE,O:MNODE),SUM,P(MSAMP,MNODE), + O(MSAMP,MNODE) C C STORE INPUT DATA INTO A(O,MNODE) C DO 100 1=1, NNODE(O) A(O,I)=P(SN,I) 100 CONTINUE C C STORE THE BIAS C A(O,O) = I.DO C C CALCULATE THE SUM OF THE INPUTS OF A NEURON J IN LAYER K C C LOOP OVER LAYER C LOOP OVER LAYER 00 10 K=I,NLAYR C LOOP OVER CURRENT NODE (TARGET) KK=NNODE(K) LL=NNODE(Kl) 0020J=I,KK C LOOP OVER PRVIOUS NODE (SOURCE) 76 SUM =O.ODO 00301=O,LL SUM = SUM + W(K,J,n*A(KI,I) 30 CONTINUE C C CALCULATE THE SUM OF I NEURON J IN LAYER K C N(K,J) = SUM C C CALCULATE THE OUTPUT OF NEURON J IN LAYER K C A(K,J) = SIGF(N(K,J) 20 CONTINUE C C THE BIAS C A(K,O) =1 10 CONTINUE C C STORED THE OUTPUT IN A(NNODE(NLAYR» C KK=NNODE(NLAYR) DO 200 1= I, KK O(SN,I)=A(NLAYR,I) 200 CONTINUE RETURN END C*·*·**·······················*·······················•••••••••• GIVEN A STARTING POINT P THAT IS A VECTOR OF LENGTH N, FLETCHREEVESPOLAKRIBIERE MJNIMIZATION IS • PERFORMED ON A FUNCTION FUNC, USING ITS GRADIENT AS • CALCULATED BY A ROUTINE DFUNC. THE CONVERGENCE TOLERANCE • ON THE FUNCTION VALUE IS INPUT AS FTOL. RETURNED • QUANTITIES ARE P(THE LOCATION OF THE MJNUMUM), ITER(THE • NUMBER OF ITERATIONS THAT WERE PERFORMED),AND FRET(THE • MINIMUM VALUE OF THE FUNCTION). THE ROUTINE LlNMlN IS • CALLED TO PERFORM LINE MINIMIZATIONS. • PARAMETERS: NMAX IS THE MAXIMUM ANTICIPATED VALUE OF N; • ITMAX IS THE MAXIMUM ALLOWED NUMBER OF ITERATIONS; EPS • [S A SMALL NUMBER TO RECTIFY SPECIAL CASE OF CONVERGING • TO EXACTLY ZERP FUNCTION VALUE. • C C C C C C C C C C C C C SUBROUTINE FRPRMN(P,N,FfOL,ITER,FRET) C····················································· . • C·····················································.........•..* INTEGER ITER,N,NMAX,ITMAX DOUBLE PRECISION FRET,FTOL,P(N),EPS,FUNC EXTERNALFUNC PARAMETER(NMAX=50,ITMAX=200,EPS=I.ODIO) C C USES DFUNC,FUNC,LINMIN C INTEGER ITS,J DOUBLE PRECISION DGG,GAM,GG,G(NMAX),H(NMAX),XI(NMAX) FP= FUNC(P) CALL DFUNC(P,XI) 77 00 II J=I,N G(J)=XI(J) H(J)=G(J) XI(J)=H(J) II CONTINUE 00 14 ITS=I,ITMAX CALL LlNMIN(P,XI,N,FRET) IF(2.*ABS(FRETFP) .LE. ITOL·(ABS(FRET)+ABS(FP)+EPS»)RETURN FP=FUNC(P) CALL DFUNC(P,XI) GG=O.DO OOG=O.DO 00 12 J=I,N GG=GG+G(J)**2 DGG=OOG+(XI(J)+G(J)*XI(J) 12 CONTINUE IF(GG .EQ. O)RETURN GAM=DGG/GG DO 13 J=I,N G(J)=XI(J) H(J)=G(J)+GAM·H(J) XJ(J)=H(J) 13 CONTINUE 14 CONTINUE C RETIJRN END C*********************************************·********.**.********** SUBROUTINE GETINPUTDATA(P,T,MNODE,DIMlN,DIMOUT,MSAMP, + NSAMP) C***********···**·*···***********·*·····***·····**···**•••*••••**••••• C THTS SUBROUTINE IS TO READ THE INPUT DATA FROM • C TRAINING SAMPLE AND THE TARGET OUTPUT DATA. • C·*****·****···***·········**·*·····****·········*····*•••••*.*•••*••• INTEGER MNODE, DIMIN,DIMOUT,I,NSAMP,MSAMP,J DOUBLE PRECISION P(MSAMP,MNODE), T(MSAMP,MNODE) C IN=20 OPEN(UNIT= IN, FILE = 'TRAIN.DAT,STATUS = 'OLD',IOSTAT=fOERR) IF(IOERR .NE. 0) THEN WRITE(*,10) IOERR 10 FORMAT(lX,'CANNOTOPEN NETWORK TRAINING DATA FILE(TRAIN.DAT)', + 15) STOP ENDfF C READ IN NUMBER OF TRAINING SAMPLE READ(IN,*)NSAMP C DO 100 J= I, NSAMP C READ IN THE INPUT DATA READ(IN,*XP(J,I),J=I,DIMlN) C C READ IN THE DESIRED OUTPUT DATA(TARGET DATA) READ(lN,*XT(J,I),l=I,DIMOUT) 100 CONTINUE 78 C CLOSE (UNIT=IN) C RETURN END C**·**····**··**········*··****··*******•••••••••••••••••*•••••••••••••••• SUBROUTINE GETPP(PP,PPO,TG,MLAYR,NLAYR,MNODE, + NNODE, BETA) C*·······*··***·*···*·***··**·*·*········*·*•••••••••••••*•••• *.* C THIS SUBROUTINE IS TO CALCULATE METRIX PPo REF. * C ALGORITHM 3.6.1 (3) AND (7). IT ADDS TIlE PREVIOUS * C GRADIENT TO THE CURRENT GRADIENT ACCORDING TO * C DIFFERENT BETA. REF.(2.4.8)(2.4.IO). STORED THE • C WHOLE GRADIENT IN PP. * C***·*··**·**·**···*****·***********·*·······*·*···.·**.*••••**.* INTEGER MLAYR,NLAYR,MNODE,NNODE(O:MLAYR) DOUBLE PRECISION PP(MLAYR,MNODE,O:MNODE),BETA, + TG(MLAYR,MNODE,O:MNODE), + PPO(MLAYR,MNODE,O:MNODE) INTEGER I,J,K,KK,LL C C CALCULATE THE GRADIENT AND STORE IT IN PP C DO 10 K=I,NLAYR KK=NNODE(K) LL=NNODE(Kl) DO 20 J=l,KK D030I=O,LL PP(K.,J,I) = TG(K,J,I) + BETA • PPO(K,J,I) 30 CONTINUE 20 CONTINUE 10 CONTINUE C RETURN END C*·***··**·*·***··***·****··*···****··**····**·······*••••**•••••• • * * * SUBROUTINE GRAD(SENSI,A,W,NLAYR, MLAYR, + NNODE,MNODE,G,WO,LAMDA) C*··**·*·*····**·*·**··*·**·*·**·····**···*****··*···******••**•••**.** C THIS SUBROUTINE [S TO CALCULATE THE * C GRADIENT OF THE PERFORMANCE WoR.T WEIGHT. C REF. (3.6.6). • C SENSI(MLAYR,MNODE)THE SENSITVJTY MATRIX. REF(3.6.12) C A(O:MLAYR,O:MNODE)  THE OUTPlfT OF A NEURON. REF(3.1.2) * C W(MLAYR,MNODE,O:MNODE) THE WE[GHT MATRJX C G(MLAYR,MNODE,O:MNODE) THE GRADIENT OF THE NET C OF ONE SAMPLE DATEo * C WO  THE CONSTANTS IN PENALTY TERM WOo .* C LAMDA  THE CONSTANT IN THE PENALTY. * C··*··**·*······*····****··**·····*****···**·*·*····*****.***********.* INTEGER NLAYR, MLAYR,MNODE,I,J,K,KK,LL, + NNODE(O:MLAYR) DOUBLE PRECISION SENSI(MLAYR,MNODE),A(O:MLAYR, + O:MNODE),W(MLAYR,MNODE,O:MNODE), 79 + G(MLAYR,MNODE,O:MNODE),WO,LAMDA C C CALCULATE THE GRADIENT OF PERFORMACE FUNCTION W.R.T C WEIGHTS ACCORDfNG TO (3.6.6) C C LOOP OVER LAYER C DO 10 K=l,NLAYR KK=NNODE(K) LL=NNODE(Kl) DO 20 J=I,KK C CBlASTERM C G(K,J,O)=SENSI(K,J)+ + LAMDA *(W(K,J,O) * WO*WO)/«WO*WO + W(K,J,0)**2)**2) DO 30 1=I,LL G(K,J,I)=SENSI(K,J) • A(Kl,1) + + LAMDA • (W(K,J,I) * WO*WO)/«WO*WO + W(K,J,I)**2)**2) 30 CONTINUE 20 CONTINUE 10 CONTINUE C RETURN END C********·*··**********************·**·*·***········******.***.**.** SUBROUTINE INITG(NLAYR,MLAYR,NNODE, + MNODE,TG) C********·*·*********·****************·****************** C THlS FUNCTION IS TO INITIALIZE THE TOTAL • C GRAD!ENT TO O. * C**********************************·*·***·******····***** INTEGER NLAYR, MLAYR,MNODE,I,J,K,KK,LL, + NNODE(O:MLAYR) DOUBLE PRECISION TG(MLAYR,MNODE,O:MNODE) C C INITIALIZE THE TOTAL GRADIENT TO 0 C AND NUMOFSAMPLE TO 0 C DO 10 K=I,NLAYR KK=NNODE(K) LL=NNODE(Kl ) D020J=1,KK 00301=0,I..L TG(K,J,I)= 0 30 CONTINUE 20 CONTINUE 10 CONTINUE C RETURN END C**********·*****************************·***·***·****** SUBROUTINE INIWEIGHT(WEIGHT, NLAYR, MLAYR,NNODE, + MNODE, SEED,NWEIG) 80 • * • • • • • • • WEIGIIT(LAYER, N, O:N) LAYER IN THE LAYER INDEX N,M CORRESPONDING TO W(J,I), I.E., • WEIGHT(LAYER, N, M) IS THE WEIGHT OF THE CONNECTION FROM NODE MOF THE (LAYERI )TH LAYER TO NODE N OF THE LAYERTH LAYER. • WEIGHT(LAYER, N, 0) IS THE BIAS. • INITIALIZE THE WEIGHT OF INPUT LAYER NUMNODE(D  THE NUMBER OF NODE AT LAYER I. NUMNODE(O)  THE NUMBER OF lNPUT (NODE). • NUMNODE(NLAYR)  NUMBER OF NODE IN OUTPUT LAYER NWEJG  THE NUMBER OF WEIGHT • NLAYR  THE NUMBER OF LAYER (INCLUDlNG OUTPUT AND HIDDEN LAYERS). • • C C C C C C C C C C C C C C C C C C c····················································· . • C····**·*······················*······················ . INTEGER MLAYR, MNODE,NWElG INTEGER I,J,K,FANIN,NNODE(O:MLAYR),NLAYR,KK,LL DOUBLE PRECISION WEIGHf(MLAYR, MNODE,O:MNODE), TEMP, + DRANDOM,SEED,TEMP1 CC GENERATE THE RANDOM NUMBER BETWEEN 0.5 TO 0.5 C TEMPI = SEED TEMP = DRANDOM(TEMPI)  .5DO NWEIG= 0 C C LOOP OVER LAYER C DO 10K=I, NLAYR C C CALCULATE THE FANIN OF THE LAYER. C KK=NNODE(K) LL=NNODE(Kl) FANIN = NNODE(KI) + I C C LOOP OVER ALL NEURONS IN CURRENT LAYER C DO 20 J=I, KK C C LOOP OVER ALL NEURONS IN PREVIOUS LAYER C DO 30 1= 0, LL WEIGHT(K,J,I) = TEMP/FANIN NWEIG = NWEJG + I 30 CONTINUE 20 CONTINUE 10 CONTINUE RETURN END 81 C···*···**··**·········*·*···****·**··················.•.. • • • • • • SUBROUTINE LINMfN(FRET,P,T,MSAMP,NSAMP, + MNODE,W,TG,MLAYR,NLAYR,NNODE,O,LAMDA,WO) C···*···*·····*··············*························ . C GIVEN AN NDIMENSIONAL POINT P( I:N) AND AN • C ND1JvfENSIONAL DIRECTION XI(I:N), MOVES AND • C RESETS P TO WHERE THE FUNCTION FUNC(P) TAKES ON C A MINIMUM ALONG THE DIRECTION XI FROM P AND C REPLACES Xl BY THE ACTUAL VECTOR DISPLACEMENT C THAT P WAS MOVED. ALSO RETURNS AS FRET THE VALUE C OF FUNC AT THE RETURND LOCATION P. THIS IS • C ACTUALLY ALL ACCOMPLISHED BY CALLING THE ROUTINES C MNBRAK AND BRENT. • C • C REF "NUMERICAL RECIPJES" C··**····*·····**·*·*·····*···*·········*·············..*...*.... INTEGER MSAMP,NSAMP,MNODE,MLAYR,NLAYR, + NNODE(O:MLAYR) DOUBLE PRECISION O(MSAMP,MNODE),T(MSAMP,MNODE}, + W(MLAYR,MNODE,O:MNODE),TG(MLAYR,MNODE,O:MNODE), + P(MSAMP,MNODE),LAMDA,WO,TOL,FRET INTEGER MSCOM,NSCOM,MNCOM,MLCOM, + NLCOM PARAMETER(MLCOM=4,MNCOM=IOO,MSCOM=2(0) INTEGER NNCOM(O:MLCOM) ooUBLE PRECISION OCOM(MSCOM,MNCOM),TCOM(MSCOM, + MNCOM),WCOM(MLCOM,MNCOM,O:MNCOM), + LAMCOM,WOCOM,TGCOM(MLCOM,MNCOM,O:MNCOM), + PCOM(MSCOM,MNCOM) ooUBLE PRECISION AX,BX,FA,FB,FX,XMIN,XX,BRENT COMMON IFIINSCOM, NLCOM, NNCOM COMMON IF2/PCOM,OCOM,TCOM,WCOM,TGCOM,LAMCOM,WOCOM INTEGER I,J,K,KK,LL EXTERNAL Fl DIM C C INITIALIZE THE PARAMETERS C NSCOM=NSAMP NLCOM=NLAYR LAMCOM=LAMDA WOCOM=WO TOL=1.0D4 00 .10 I=O,NLCOM NNCOM(I)=NNODE(I) 10 CONTINUE 00 50 I=I,NSCOM KK=NNCOM(NLCOM) 0060J=I,KK OCOM(I,J)=O(I,J) TCOM(I,J)=T(I.,J) PCOM(I,J)=P(I,J) 60 CONTINUE 50 CONTINUE C DO 20 K=I,NLCOM 82 KK=NNODE(K) LL=NNODE(KI) 00 30 J=I,KK DO 40 T=O,LL WCOM(K,J,I)=W(K,J,I) TGCOM(K,J,I)=TG(K,J,I) 40 CONTINUE 30 CONTINUE 20 CONTINUE C C USES BRENT,FI DIM,MNBRAK C AX = l.ODO XX =1.000 CALL MNBRAK(AX,XX,BX,FA,FX,FB,Fl DIM) FRET = BRENT(AX,XX,BX,Fl DTM,TOL,XMIN) C C CALCULATE THE TOTAL GRADIENT C DO 70K=1,NLCOM KK=NNODE(K) LL=NNODE(Kl) DO 80 J=l,KK DO 90 I=O,LL TG(K,J,I)=XMIN*TG(K,J,I) W(K,J,I)=W(K,J,l)+TG(K,J,I) 90 CONTINUE 80 CONTINUE 70 CONTINUE C RETURN END C********************·**···****·********··**····*·····•••••••• FUNCTION FIDlM(X) INTEGER MSCOM,NSCOM,MNCOM,MLCOM, + NLCOM PARAMETER(MLCOM=4,MNCOM=IOO,MSCOM=200) lNTEGER NNCOM(O:MLCOM) DOUBLE PRECISION OCOM(MSCOM,MNCOM),TCOM(MSCOM, + MNCOM),WCOM(MLCOM,MNCOM.O:MNCOM), + LAMCOM,WOCOM,TGCOM(MLCOM,MNCOM,O:MNCOM), + PCOM(MSCOM,MNCOM) DOUBLE PRECISION XT(MLCOM,MNCOM,O:MNCOM) C C THE COMMON BLOCK C COMMON IF IINSCOM, NLCOM, NNCOM COMMON IF2/PCOM,OCOM,TCOM,WCOM,TGCOM,LAMCOM,WOCOM DOUBLE PRECISION FlDIM,X EXTERNAL FTNDE C C USES FTNDE C USED BY LTNMlN AS THE FUNCTION PASSED MNBRAK AND BRENT C INTEGER r,J,K,KK,LL 83 DO 100 K=I,NLCOM KK=NNCOM(K) LL=NNCOM(KI} DO 200 J= I,KK DO 300 1=<l,LL XT(K,J,I)=WCOM(K,J,I}+X*TGCOM(K,J, I) 300 CONTINUE 200 CONTINUE 100 CONTINUE FI DIM = FINDE(PCOM,TCOM,MSCOM,}.ISCOM,MNCOM,XT, + MLCOM,NLCOM,NNCOM,OCOM,LAMCOM,WOCOM) RETURN END C******************************************************************* * * * * * * * * * * * * * THJS ROUTINE IS TO INITIALLY BRACKETING AMININUM. REF " NUMERICAL RECIPIES IN FORTRAN, THE ART OF SCJENTIFIC COMPUTING" BY W1LLJAM H. PRESS, ETe. * GIVEN A FUNCTION FUNC AND GIVEN DISTINCT INITIAL POrNTS AX AND BX, THIS ROUTINE SEARCHES IN THE DOWNHlLL DIRECTION (DEFfNED BY THE FUNCTION AS EVALUATED AT THE INITIAL POrNTS) AND RETURNS NEW POINTS AX, BX, CX THAT BRACKET A MINIMUM OF THE FUNCTION. ALSO RETURNED ARE THE FUNCTION VALVES AT THE THREE POrNTS, FA, FB AND Fe. * PARAMETERS: GOLD IS THE DEFAULT RATIO BY WHJCH SUCCESSIVE INTERVALS ARE MAGNIFIED; GUMIT IS THE MAXIMUM MAGNIFICATION FOR A PARABOLlCFJT STEP. * SUBROUTINE MNBRAK(AX,BX,CX,FA,FB,FC,FUNC} C**************************************************************** C * C C C C C C C C C C C C C C C C C**************************************************************** DOUBLE PRECISION AX,BX,CX,FA,FB,FC,FUNC,GOLD,GUMIT,TINY EXTERNAL FUNC PARAMETER (GOLD=I.618034DO,GLlMIT=IOO.DO,TINY=I.D20) DOUBLE PRECISION DUM,FU,Q,R,U,ULIM FA=FUNC(AX) FB=FUNC(BX) IF(FB .GT. FA) THEN DUM=AX AX=BX BX=DUM DUM=FB FB=FA FA=DUM ENDIF C C FIRST GUESS FOR C C CX = BX +GOLD*(BXAX) FC = FUNC(CX) C C INITIALIZE THE ITERATION COUNT 84 C lTER=O rF(FB.GE.FC)THEN R=(BXAX)*(FBFC) Q=(BXCX)*(FBFA) U=BX«BXCX)*Q(BXAX)*R)/(2.*SIGN(MAX(ABS(QR), + TfNY),QR» ULIM=BX + GLIMIT *(CXBX) IF«BXU)*(UCX) .GT. 0) THEN FU = FUNC(U) IF(FU .LT. FC)THEN AX=BX FA=FB BX=U FB=FU RETIJRN ELSE IF(FU .GT. FB) THEN CX=U FC=FU RETURN ENDIF U = CX +GOLD*(CX  BX) FU =FUNC(U) ELSE IF«CXU)*(UULIM) .GT.O)THEN FU =FUNC(U) IF(FU .LT. FC) THEN BX=CX CX=U U = CX + GOLD*(CX  BX) FB=FC FC=FU FU= FUNC(U) ENDIF ELSE !F«U  ULIM)*(ULIM  CX) .GE. O)THEN U=ULlM FU= FUNC(U) ELSE U = CX + GOLD * (CX  BX) FU=FUNC(U) ENDIF AX=BX BX=CX CX=U FA=FB FB= FC FC=FU ITER = ITER + 1 GOTO I ENDrF RETURN END I ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• SUBROUTINE NETPRlNT(LAYER,MLAYR,NNODE,SEED,WO, + LAMDA,LAMDA2,METHOD) c***··***·**·*·*************************··*···*··**·*·••*•••*••*. C THIS SUBROUTINE IS TO PRINT THE NETWORK ARCHJTCTURE AND • C INITIAL PARAMETERS. • 85 C C*···············**····*······························••••••••••• INTEGER LAYER,MLAYR,NNODE(O:MLAYR),METHOD,NSAMP DOUBLE PRECISION SEED, TOL, WO, LAMDA.LAMDA2 C WRITE(·, IO)LAYER 10 FORMAT(1 X,'THE NUMBER OF LAYER IN THE NETWORK IS: ',14) WRJTE(·,20)NNODE(0) 20 FORMAT(lX,'THE INPUT DIMENSION IS ',14) DO 30 1=1, LAYER WRITE(*,40)I,NNODE(I) 40 FORMAT(1X,'THE NUMBER OF NODE IN LAYER ',14, 'IS', ]4) 30 CONTINUE WRfTE(·,60)NNODE(LAYER) 60 FORMAT(IX,'THE OUTPUT DIMENSION IS ',14) IF (METHOD .EQ. 0) THEN WRITE(·,100) 100 FORMAT(IX,'THE PENALTY METHOD IS USED') ELSE IF(METHOD .EQ.I) THEN WRJTE(·,200) 200 FORMAT(lX,'THE STOP TRAINING METHOD IS USED') ELSE WRITE(·,300) 300 FORMAT(1X,'METHOD DATA ERROR') STOP ENDIF C PRINT THE PARAMETERS WRITE(·,50)SEED,WO,LAMDA,LAMDA2 50 FORMAT(I x,'THE SEED IS ·,FlO.4flX, + fIX,'THE WO IS ',FI6.12fIX,'THE LAMDA]S', F16.12, + /lx,'The LAMDA2]S " FI6.12) C RETURN END C·····················································.....•...... SUBROUTINE NETSETUP(LAYER,MLAYR,NNODE,SEED, + WO,LAMDA,LAMDA2,METHOD) C··*·****·***··**···***·**·········*·*·*········*·····* * . C • C THIS SUBROUTINE IS TO READ THE INPUT FILE AND SET UP • C THE NETWORK ARCHITECTURE AND INITIALIZE PARAMETERS • C * C*·*···········**·**··*···*·**···***··················.....*..*.* INTEGER LAYER, MLAYR, MAXNODE, NNODE(O:MLAYR),METHOD, + NSAMP DOUBLE PRECISION SEED,TOL,WO,LAMDA,LAMDA2 IN=50 OPEN(IN, FILE = 'NET.DAT, STATUS = 'OLD', rOSTAT= 10ERR) IF(IOERR .NE. 0) THEN WRITE(·, 10)IOERR 10 FORMAT('CANNOT OPEN NETWORK DATA FILE (NET.DAT), 10ERR= ',110) STOP ENDIF C C READ IN THE NUMBER OF LAYER 86 C READ(IN,·)LAYER C C READ IN THE NUMBER OF NODE IN EACH LAYER.,THE NUMBER OF NODE IN C INPUT LAYER IS IN NNODE(O). C READ(IN,·)(NNODE(]),]=O,LAYER) C C READ IN METHOD, (0 FOR PENALTY METHOD, 1 FOR STOP TRAINlNG METHOD) READ(IN,*) METHOD C C READ IN SEED NUMBER., TOLERANCE, WO AND LAMDA LAMDA2 C READ(IN,*) SEED, WO, LAMDA READ(IN,*) LAMDA2 WRlTE(*, 1111) LAMDA, LAMDA2 1111 FORMATCLAMDA=',F16.12, 'LAMDA2=',FI6.12) CLOSE (UNIT = IN) C RETURN END C···**········**·***···**···*·····*·········*···*·····••••••••••••••••••••••• SUBROUTINE PRJNT3D(A,MLAYR.,NLAYR,MNODE, + NNODE) C···**····******··*****·**************···········*····••••*••••*****. C TlITS SUBROUTINE IS TO COPY A ORIGINAL MATRIX TO NEW MATRIX. • C IT IS USED TO COPY TG. * C·********·******·***·*···****····********·*****··************.*.**** r~HEGERMLAYR,NLAYR.,MNODE,NNODE(O:MLAYR) DOUBLE PRECISION A(MLAYR,MNODE,O:MNODE) INTEGER I,J,K,KK,LL DO 10 K=I ,NLAYR KK=NNODE(K) LL=NNODE(Kl) 0020 J=I,KK 00 30 I=O,LL WRITE(·, 100)K,J,] 100 FORMAT(lX;LAYER # ',15, 'J#', ]5, ']# ',]5) WRITE(*,200)A(K,J,[) 200 FORMAT(lX,'A VALUE: ',EI5.7) 30 CONTINUE 20 CONTINUE 10 CONTINUE RETURN END C***·***************····*·********·**··*·*********·*·*••*••******** * TIDS SUBROUTINE IS TO PRINT THE INPUT DATA Of TRAINING SAMPLE AND THE TARGET OUTPUT DATA. SUBROUTINE PRINTINPUTDATA(P,T,MNODE,OIMJN,DlMOUT,MSAMP, + NSAMP) C**··********************************··*********************.****.**** C * C C·***·*···****··*···**·****···****·***·**···************.**.**••***••• INTEGER MNOOE, DlMIN,DIMOUT,I,NSAMP,MSAMP,J OOUBLE PRECISrON P(MSAMP,MNOOE), T(MSAMP,MNOOE) 87 C C PRINT IN THE INPUT DATA WRITE(*, lOO)NSAMP 100 FORMAT(l)(.'NUMBER OF SAMPLE IS: ',IS) C DO 200 J= 1,NSAMP WRITE{*,300)J 300 FORMAT(I X,'SAMPLE # ',]5) DO 20 1= I,DTMIN WRITE(*,400)P(I,J) 400 FORMAT(IX,'THE INPUT DATA ARE: ',£15.7) 20 CONTINUE C DO 30 1= I.DIMOUT WRITE(*,500)T(I,J) 500 FORMAT( IX,'THE DESIRED OUTPUT DATA ARE: ',E15.7) 30 CONTINUE 200 CONTINUE C RETURN END C***********************************··*·****·*****··************.****.**** FUNCTION DRANDOM(DL) C*****************************************************.**••** C THlS FUNCTION IS TO CREATE A RANOOM NUMBER BETWEEN * C OTOI * C*****·*·********·*····*********·*****·*·****·*·**·********.* DOUBLE PRECISION DL, DRANDOM C 10 DL=DMOD(16807.0DO*DL,2147483647.0DO) DRANDOM=DU2147483648.0OO IF(DRANDOM.LE.O.OOO .OR. DRANDOM.GE.l.ODO) GO TO 10 END C·**··*···*·*··******·***··*··***·**·**·**·***·*····*·**** • • * * * TillS SUBROUTfNE JS TO CALCULATE THE SENSITIVITY DEFINED IN (3.6.6). PLEASE REFER TO (3.6.6){3.6.16) • SENSI(MLAYR,MNODE)THE SENSITIVITY MATRIX. REF(3.6.12) W(MLAYR,MNODE,O:MNODE)WEIGHf MATRIX • T(MSAMP,MNODE}THE DESIRED OUTPUT OF THE NET OUT(MSAMP,MNODE)THE CALCULATED OUTPUT OF THE NET N(MLAYR,MNODE)THE SUMMATION OF THE WEIGHT. REF(3.l.1) SN  THE SAMPLE INDEX. • SUBROUTINE SENSITIVITY(SENSI,W,NLAYR.MLAYR,NNODE, + MNODE,T,OUT,N,SN,MSAMP) C****·*···*·**·*********·****************·*****************.**••***•••*•• C • C C C C C C C C C*******··**···**·**····*···**··***·*·*···***···***··**.*.****••********* INTEGER NLAYR,MLAYR,NNODE{O:MLAYR),I,J,K,KK,LL, + MNODE,MSAMP,SN DOUBLE PRECISION W(MLAYR,MNODE,O:MNODE), + SENSI(MLAYR,MNODE),T(MSAMP,MNODE), + OUT(MSAMP,MNODE),N(MLAYR,MNODE),SUM C C CALCULATE THE SENSITIVITY OF FINAL LAYER (3.6.16) 88 C LOOP OVER LAYER CALCULATE THE SENSITIVITY OF EACH LAYER STARTING FROM THE FINAL LAYER. (3.6.12) KK=NNODE(NLAYR) DO 10I=I,KK SENSI(NLAYR,I) = (T(SN,I)  OUT(SN,I) + • SIGFD(N(NLAYR,I» 10 CONTINUE C C C C C C 00 20 K=NLAYRI,l,I KK=NNODE(K) LL=NNODE(K+I) 0040 1= I,KK. SUM =0.00 00 30 J=l,LL SUM = SUM + SIGFD(N(K,I»·W(K+ 1,J,I)*SENSI(K+ I,J) 30 CONTINUE SENSI(K,I) = SUM 40 CONTINUE 20 CONTINUE C RETURN END C··*·*******·*********···*·*·***·*·**·*···**·*········...*.......•*. SIGMOID TRANSFER FUNCTION INPlIT: DOUBLE PREClSION: X * OUTPUT: OOUBLE PRECISION: SIGF * C C C FUNCTION SIGF(X) C*****·*·*·······*···*·****************** * C**·*···**···········*···**···*·*··*·*··· OOUBLE PRECISION X, SIGF SIGF = 1.00 /(1..00 + EXP(X» RETURN END C······*·····*···*****··**····******·**** FUNCTION SIGFD(X) C···*··**·······***********··*·**····*··· C DERIVATIVE OF SIGMOID FUNCTION * C INPUT: DOUBLE PRECISION: X • C OlITPUT: OOUBLE PRECISION SIGFD • C·**··*********·**··**···*·******·*····*· OOUBLE PRECISION X, SIGFD SIGFD= EXP(X) / «1.oo+EXP(X»**2) RETURN END C**····*****···*···*···*·······*····*····*··· SUBROUTINE SUMGRAD(G,NLAYR,MLAYR,NNODE, + MNODE,TG) C*·······*·*·············***··***··*··*··*···*··**********.*.....*.***.*. C TillS FUNCTION IS TO SUM UP THE GRADIENTS * 89 • • • • C OF EACH EPOCH. • C TG(MLAYR.,MNODE,O:MNODE)  STORES C THE TOTAL GRADIENTS OF NUMOFSAMPLE SAMPLES. C REF. ALGORITHM 3.6.\ (2.2) • C G(MLAYR.,MNODE,O:MNODE)THE GRADIENT OF THE NET OF C ONE SAMPLE. * C TG(MLAYR,MNODE,O:MNODE) THE TOTAL(SUMMATION) GRADIENT C OF ALL SAMPLES. REF. ALG(2.3) • C·········································*·*·········.........•......... INTEGER NLAYR., MLAYR,MNODE,I,J,K,KK,LL, + NNODE(O:MLAYR) DOUBLE PRECISION G(MLAYR.,MNODE,O:MNODE), + TG(MLAYR.,MNODE,O:MNODE) C C CALCULATE THE GRADIENT OF PERFORMACE FUNCTION W.R.T C WEIGHTS ACCORDING TO (3.6.6) C LOOP OVER LAYER C DO IOK=\,NLAYR KK=NNODE(K) LL=NNODE(KI) D020J=1,KK DO 30 I=O,LL TG(K,J,I)=TG(K,J,l) + G(K,J,I) 30 CONTINUE 20 CONTINUE 10 CONTINUE RETURN END C····*·**························***··*··*·**·***····· * *.*. • • • • • • • • SUBROUTINE SUMWEIGHT(P,O,N,MLAYR.,N LAY R,MNODE,A, + NNODE,W,SN,MSAMP) C···*·········*·····················*·····**··········....•.........•.... C THIS SUBROUTJNE IS TO CALCULATE THE SUM OF • C THE INPUTS OF A NEURON J IN LAYER K C PLEASE REFER TO (3.1.1) • C N(MLAYR.,MNODE)STORES THE SUM OF INPUTS OF C NEURON JIN LAYER K • C A(O:MLAYR,MNODE)STORES THE OUTPUT OF C NEURON J IN LAYER K • C A(O:MLAYR,O:MNODE)  STORES THE INPUT DATA. C P(MSAMP,MNODE)  IS THE INPur DATA FROM ONE SAMPLE C T(MSAMP,MNODE)  IS THE DESIRED OUTPUT DATA FROM ONE C SAMPLE • C O(MSAMP,MNODE)  IS THE OUTPUT CALCULATED FROM THE C NE~ • C W(MLAYR.,MNODE,O:MNODE)  THE WEIGHT OF THE NET. C SN  THE SAMPLE INDEX. • C··················*·············*···················· . INTEGER MLAYR,MNODE,NNOD.E(O:MLAYR),I,J,K, + NLAYR,L,SN,MSAMP DOUBLE PRECISION N(MLAYR,MNODE),A(O:MLAYR,O:MNODE), + W(MLAYR,MNODE,O:MNODE),SUM,P(MSAMP,MNODE), + O(MSAMP,MNODE) 90 C STORE INPUT DATA INTO A(O,MNODE) DO 100 1=1, NNODE(O) A(O,I)=P(SN,I) C PRINT *,'SN= ',SN,'P(SN,I)= ',P(SN,I) 100 CONTINUE C STORE THE BIAS A(O,O) = 1.00 C C CALCULATE THE SUM OF THE INPUTS OF A NEURON J IN LAYER K C C LOOP OVER LAYER CLooPOVERLAYER DO 10 K=I,NLAYR C LOOP OVER CURRENT NODE (TARGET) DO 20 J=I,NNODE(K) C LOOP OVER PRVIOUS NODE (SOURCE) SUM =0.000 00 30 I=O,NNODE(Kl) SUM = SUM + W(K,J,I)*A(Kl,l) 30 CONTINUE C CALCULATE THE SUM OF I NEURON J IN LAYER K N(K,J)= SUM C WRlTE (*, 500) K,J,N(K,J) C500 FORMAT(IX,'LAYER #',13,' NODE # ',13,' N = " Fl6.10) C CALCULATE THE OUTPUT OF NEURON J IN LAYER K A(K,J) = SIGF(N(K,J) C WRITE (*,400) K,J,A(K,J) C400 FORMAT(lX,'LAYER # ',I3,'NODE #',I3,' A=', FI6.10) 20 CONTINUE C THE BIAS A(K,O) = 1.D<l 10 CONTINUE C STORED THE OUTPUT IN A(NNODE(NLAYR» DO 2001=1, NNODE(NLAYR) O(SN.I)=A(NLAYR,l) 200 CONTINUE RETURN END C**********·*·*···***********····*··**···**···*···· C THIS FUNCTION IS TO FiND THE PERFORMANCE C FUNCTION E(W) REF. (3.6. I) USING VALIDATION C DATA SET OR TEST DATA SET. C TESTIS THE PERFORMANCE FUNCTION VALUE. REF(3.6.1) C T(MSAMP,MNODE}THE DESIRED OUTPUT OF THE NET C W(MLAYR,MNODE, 0:MNODE)WEIGHT MATRIX THAT C IS UPDATED BY TRAINING SAMPLES. C O(MSAMP,MNODE)THE CALCULATED OUTPUT OF THE C NET C LAMDATHE CONSTANT IN THE PENALTY TERM. C WOTHE CONSTANTS IN THE PENALTY. C·····***··***·***··**···***************·*·**·*···**··••**. 91 FUNCTION TEST(MSAMP,MNODE, + W,MLAYR,NLAYR,NNODE,LAMDA,WO) INTEGER MSAMP,NSAMP,MNODE,MLAYR,NLAYR, + NNODE(O:MLAYR) DOUBLE PRECISION O(MSAMP,MNODE),T(MSAMP,MNODE), + W(MLAYR,MNODE,O:MNODE),LAMDA,WO,SUM,SUMI ,FINDE, + P(MSAMP,MNODE) DOUBLE PRECISION N(MLAYR,MNODE),A(O:MLAYR,O:MNODE) INTEGER I,J,K,L,IN C C READ IN VALIDATIONlfEST OATA SET IN=20 OPEN(UNIT=IN,FlLE='VALlD.DAT,STATUS='OLD',IOSTAT=IOERR) IF(IOERR .NE
Click tabs to swap between content that is broken into logical sections.
Rating  
Title  Penalty Methods to Reduce Overfitting in Artificial Neural Networks  
Date  19981201  
Author  Luo, Zhong Xiang  
Document Type  
Full Text Type  Open Access  
Note  Thesis  
Rights  © Oklahoma Agricultural and Mechanical Board of Regents  
Transcript 
r
...
PENALTY METHODS TO REDUCE OVERFITTING
IN ARTIFICIAL NEURAL NETWORKS
By
ZHONG XIANG LUO
Bachelor of Engineering
Gezhouba Hydroelectric Engineering
University
Yichang, China
1982
Master of Engineering
Wuhan Hydraulic and Electric
Engineering University
Wuhan, China
1989
Submitted to the Faculty ofthe
Graduate College ofthe
Oklahoma State University
In partial fulfillment of
The requirements for
The Degree of
MASTER OF SCIENCE
December, 1998
PENALTY METHODS TO REDUCE OVERFITTING
IN ARTIFICIAL NEURAL NETWORKS
Thesis Approval:
'LJ"'='~e~f> man . {oweJ1__ ofthe Graduate College
ii
ACKNOWLEDGMENTS
I wish to express my sincere appreciation to my major advisor, Dr. John P.
Chandler, for his invaluable guidance and encouragement throughout this research.
Thanks are also due my other graduate committee members, Dr. B. E. Mayfield, and Dr.
G. E. Hedrick, for their time and valuable assistance and cooperation. I also want to
thank the faculty and staff in Computer Science Department at Oklahoma State University
for their support.
More over, I would like to express my special thanks and love to my mother, Ying
Tong, my uncles, Shu Chen Luo, You Chen Luo, myoId sisters Vue Luo, Yuewo Luo,
for their support and W1derstanding throughout my life.
Finally, I would like to give my special thanks to my wife, Dr. Ming Yu, myoId
daughter, Brooke T. Luo, and my new born baby Rynel Luo, for their love, patience, and
support during this research. My wife's encouragement and faith always provided me
strength and comfort. Myoid daughter's Jove always makes me happy and helped me get
through aU difficulties. My new born baby brings me excitement and joys.
111
Cbapter
TABLE OF CONTENTS
Page
I. INTRODUCTION 1
II. METHODS FOR REDUCING OVERFITTING 5
Overfitting and Generalization in Artificial Neural Networks 5
Methods ofReducing Overfitting 7
Penalty Mechanism and Algorithm 7
Penalty Terms as a Method ofReducing Overfitting 13
III. NEURAL NETWORK ARCHTECTURE
AND LEARNING GORITHMS 18
Architectures ofFeedforward Artificial Neural Networks 18
Activation Function 21
Weights in the Neural Network 21
Optimization Algorithm 23
Forward Computations 26
Backpropagation Computation 27
Algorithms for the Penalty Method and Improved Penalty Method 32
VI. METHODS TESTED AND IMPLEMENTATION 37
Neural Network Architecture Design 37
Discussion ofTest Results 40
v . CONCLUSION 45
REFERENCES 47
APPENDIX A 50
Testing Tables 50
APPENDIX B 65
Computer Programs 65
IV
LIST OF TABLES
T~~ P~e
4 1 Overview ofmethod tested 39
4 2 Performance Comparison ofDifferent M.ethod for Network
with 7 hidden nodes ... ... . . . .. . .. . .. . .. .. .. ... ... . .. ... . .. . . . ... ... .. . ..... .. .. 44
4 3 Performance Comparison ofDifferent Method for Network
with 8 hidden nodes ... .... .. ... .. . ... ..... .. .. 44
4 4 Performance Comparison of Different Method for Network
with 10 hidden nodes 44
A 1 Performance ofTraining and Generalization(RMS)
Method A with 7 hidden nodes and A= 0.01 50
A 2 Performance ofTraining and Generalization(RMS)
Method B with 7 hidden nodes and A= 0.01 50
A 3 Performance ofTraining and Generalization(RMS)
Method B with 7 hidden nodes and A= 0.0001 51
A 4 Performance of Training and Generalization(RMS)
Method C with 7 hidden nodes and A= 0.0001 51
A 5 Performance of Training and Generalization(RMS)
Method C with 7 hidden nodes and A= 0.01 51
A 6 Performance ofTraining and Generalization(RMS)
v
Method D with 7 hidden nodes and A= 0.0001 52
A 7 Perfonnance ofTraining and Generalization(RMS)
Method D with 7 hidden nodes and A=0.01 52
A 8 Perfonnance ofTraining and Generalization(RMS)
Method E with 7 hidden nodes and A= 0.01 52
A 9 Performance ofTraining and Generalization(RMS)
Method E with 7 hidden nodes and A= 0.0001 53
A 10 Perfonnance ofTraining and Generalization(RMS)
Method A with 8 hidden nodes 53
A 11 Performance ofTraining and Generalization(RMS)
Method B with 8 hidden nodes and A= 0.01 53
A 12 Performance ofTraining and Generalization(RMS)
Method B with 8 hidden nodes and A=0.0001 54
A 13 Perfonnance ofTraining and Generalization(RMS)
Method C with 8 hidden nodes and A= 0.0001 54
A 14 Perfonnance ofTraining and Generalization(RMS)
Method C with 8 hidden nodes and A=0.01 54
A 15 Performance ofTraining and Generalization(RMS)
Method D with 8 hidden nodes and A= 0.01 55
A 16 Performance ofTraining and Generalization(RMS)
Method D with 8 hidden nodes and A= 0.0001 '" 55
A 17 Performance ofTraining and Generalization(RMS)
VI
Method E with 8 hidden nodes and A. = 0.001 55
A 17 Performance of Training and Generalization(RMS)
Method E with 8 hidden nodes and A. = 0.01 56
A 18 Perfonnance ofTraining and Generalization(RMS)
Method A with 10 hidd.en nodes , 56
A 19 Performance ofTraining and Generalization(RMS)
Method B with 10 hidden nodes and A. = 0.01 .. 57
A 20 Performance ofTraining and Generalization(RMS)
Method B with 10 hidden nodes and A. = 0.000 I 57
A 21 Performance ofTraining and Generalization(RMS)
Method C with 10 hidden nodes and A. = 0.00001 57
A 22 Performance ofTraining and Generalization(RMS)
Method C with 10 hidden nodes and A. =0.01 .. S8
A 23 Perfonnance ofTraining and Generalization(RMS)
Method D with 10 hidden nodes and A. = 0.000 I 58
A 24 Performance ofTraining and Generalization(RMS)
Method D with 10 hidden nodes and A. = 0.01 58
A 25 Performance ofTraining and Generalization(RMS)
Method E with 10 hidden nodes and A. = 0.01 59
A 26 Performan.ce ofTraining and Generalization(RMS)
Method E with 10 hidden nodes and A= 0.00001 59
A27 Performance of Training and Generalization (RMS)
V1l
Method F with 7 hidden nodes and different A(0.008 0.006, 0.004
0.002,0.001,0.0006,0.0001,0.00006,0.00004, 0.00001) 60
A 28 Performance ofTraining and Generalization (RMS)
Method G with 7 hidden nodes and A= 0.001. 60
A 29 Performance of Training and Generalization (RMS)
Method G with 7 hidden nodes and A= 0.0001 60
A 30 Perfonnance ofTraining and Generalization (RMS)
Method G with 7 hidden nodes and A= 0.0002 61
A 31 Performance ofTraining and Generalization (RMS)
Method G with 8 hidden nodes and A= 0.0002 61
A 32 Performance ofTraining and Generalization (RMS)
Method G with 8 hidden nodes and A= 0.0001........ .. .. 62
A 33 Perfonnance ofTraining and Generalization (RMS)
Method G with 10 hidden nodes and A= 0.0001 62
A 34 Performance of Training and Generalization (RMS)
Method G with 10 hidden nodes and A= 0.00001 62
A 35 The Training Data Set.. .. .. .. .. .. 63
A 36 The Validation Data Set 64
Vlll
LIST OF FIGURES
Table Page
2.1 The Relationship Between Training Error and Testing Error................... 6
3.1 A Three Layer Feed forward Network 19
3.2 The Sigmoid Function .. . ... . . . 22
3.3 The Hyperbolic Function .. .. . .... . . .. . . .. .. . .. .. . .. .. .. .. .. . ... .. . .. .. . . . . . ..... 22
IX
Chapter I
INTRODUCTION
Artificial neural networks are computational models of the human brain In
contrast with conventional singleprocessor computers, the brain has a multiprocessor
architecture that is highly interconnected. This architecture can be described as parallel
distributed processing. Parallel distributed processing has many advantages over singleprocessor
models for many difficult computer science problems. It allows problems that
were once very difficult to solve on a computer to be attacked with relative ease.
Neural networks can be trained to develop operational capabilities to respond to
an information environment. Supervised learning and unsupervised learning are the two
main learning regimes used in neural network training.
A supervised learning algorithm adjusts the strengths or weights of the interneuron
connections according to the difference between the desired and actual network
outputs corresponding to a given input. Thus, supervised learning requires a teacher or
supervisor to provide desired or target output signals. Examples of supervised learning
algorithms include the delta rule [1], the generalized delta rule or backpropagation
algorithm [2] and the LVQ algorithm [3].
Unsupervised learning algorithms do not require the desired outputs to be known.
During training, only input patterns are presented to the neural network that automatically
adapt the weights of its connections to cluster the input patterns into groups with similar
features. Examples of unsupervised learning algorithms include the Kohonen [3] and
CarpenterGrossberg Adaptive Resonance Theory (ART) [4] competitive learning
algorithms.
Neural Networks have been used in many fields including economics,
transportation, defense, electronics, manufacturing, medicine, robotics, speech and
telecommunications [1].
The Multilayer Perceptron (MLP) will be discussed in this research. MLPs are
perhaps the bestknown type of feedforward networks. One of the interesting properties
of a feedforward neural network is its capability of learning, i.e., a feedforward neural
network can adjust its behavior using information from the environment. When a
feedforward neural network is used to solve a problem, it is trained by a set of inputoutput
sample data. Based on this data set, the network, when properly trained, wilt not
only try to reproduce the sample set correctly, but also to generalize from the training
examples to the entire problem domain.
A learning algorithm is applied a set of training data, then it is applied to make
predictions on new data points. The goal is to maximize its predictive accuracy on the
new data points. If it is trained too hard to find the very best fit to the training data, there
is a risk that the data noise will be fitted by memorizing various peculiarities of the
training data rather than fmding a general predictive rule. For continuous domains, or
large discrete ones, it is impossible to provide samples of every possible input. For a
2
large network, if the system simply memorizes the training patterns, it may do quite well
dwing the training process but it may give spurious and misleading outputs if the input is
slightly different from the sample inputs. This phenomenon is called overfitting.
Overfitting is thought to happen when the network has more degrees of freedom than the
number of the training samples. Obviously, a network can obtain a good generalization
only when the number ofparameters is less than the number of data points in the training
set. Unfortunately, it is difficult to find the smallest neural network size that can learn the
training data best.
Many techniques for reducing overfitting have been developed. The penaltyterm
method is one of the most popular methods. The basic approach used in a penaltytenn
method is adding penalty terms to the usual error fimction in order to constrain the search
and cause weights to decay differentially. By modifying the cost function, the
backpropagation will drive unnecessary weights close to zero and, in effect, remove them
during training. Even if the weights are not actually removed, the network acts like a
smaller system.
This thesis focuses on the possibilities of reducing the overfitting by using the
penaltyterm method in artificial neural networks. Many penalty terms have been
developed to reduce overfitting. Some of them are complicated; some of them include a
userdependent constant factor. Each penalty term has different advantages and
disadvantages. The question remains of whether there is a penalty term or a combination
of penalty terms that can produce superior results and, if there is, what the penalty term
could be. This research will compare and summarize different penalty terms through
their perfonnance. An improved penalty term method will be proposed in this research.
3
<
It is expected that the improved penalty method will improve the generalization
performance of neural networks significantly. The paper is organized as follows:
In Chapter I, a general introduction to neural networks and the problems of
interest is given.
In Chapter II, a review of different algorithms for reducing overfiting, especially
the penalty term methods, will be conducted.
Chapter III will explain the architecture of the neural network that will be
discussed in this research, and the application of optimization theory in the algorithm. A
new penalty term method will be developed to reduce overfitting in this chapter.
In Chapter IV, an overview of methods that will be tested is given. The regular
learning algorithm without a penalty term, the penalty method with different penalty
terms, and the improved penalty method will be tested. All the methods and different
penalty terms tested will be compared with each other through their generalization
performance in the research.
Finally, the test results will be placed in Appendix A and the source program that
is used in the implementation of the penalty term method and the improved penalty term
method will be placed in Appendix B.
4
Chapter II
METHODS FOR REDUCING OVERFITTING
Overfitting and Generalization in Artificial Neural Networks
Mathematically, the objective of learning in the neural network is to infer a
function from a given sample data set. Learning algorithms are designed essentially to
search for a function that best fits the given data in a space of functions. After learning,
the neural network is applied on the new data set. If it is trained too hard to find the best
fit to the training data, there is a risk that we will fit the noise in to the data by
memorizing various peculiarities of the training data rather than finding a general
predictive rule [5]. When a network is trained, the weights are modified in order to
decrease errors on the training data set. If the network is tested on a new set of data, the
errors on the test data set tend to decrease in step with the training error as the network
tries to generalize from the training data set to the underlined function. However if the
training data is incomplete, it may contain spurious and misleading regularities due to
sampling [6]. Figure 21 illustrates this situation schematically.
It is generally agreed that overfitting is closely related to the architecture of the
networ~ i.e., the size of the network. If training starts with too small a network for the
problem, good results cannot be obtained. If the network is too large, it may be
5
vulnerable to overfitting (20). B. Bawn and David Haussler [19] analyzed theoretically
the lower and upper bounds on the size of the sample vs. the network size needed to
achieve a valid generalization. Subutai Ahmad and Gerald Tesauro [21] analyzed how
many training patterns and training cycles are needed for a problem of a given size and
difficulty, how to represent the input, and how to choose training examples.
In general. overfitting is related to the degree of freedom of neural networks. The
degree of freedom of neural networks includes not only the number of weights but also
the potential nonlinearity of the network, the architecture and the amount oftirne and the
number of data used during training [22].
Error
Testing Error
Training Error
o l TraIDing Time
Figure 2_1 The Relationship Between Training Error and Testing Error
6
Methods of Reducing Overfitting
There are many methods to reduce overfitting and improve generalization [6]
such as pruning methods, stopped training methods and penalty term methods. The
pruning method is to train a network that is larger than necessary and then remove parts
that are not needed. The large initial size allows the network to learn reasonably quickly
with less sensitivity to initial conditions, while the reduced complexity of the trimmed
system favors improved generalization. The stopped training method is to estimate the
generalization ability during training and stop when it begins to decrease. The simplest
method is to divide the data into a training set and a validation set. The training set is
used to modify the weights, the validation set is used to estimate the generalization
ability, and training is stopped when the error on the validation set begins to rise. The
penalty term method is another way to reduce overfitting. The basic approach involves
adding penalty terms to the usual error function in order to constrain the search and cause
weights to differentially decay.
Actually, stopped training and penalty term methods are two widely used
categories. The detailed penalty machines and penalty terms are presented in the
following section.
Penalty Mechanism and Algorithm
Penalty Function Methods: Usually, penalty function methods are used in
determining a solution ofa constrained nonlinear programming problem [10]. Currently,
there is not a universally accepted method of dealing with such a problem. A penalty
function method is to replace a constrained problem with one that is unconstrained. The
7
latter problem is then solved using an iterative technique. A general penalty function
method, a barrier penalty function method and a quadratic penalty function method are
introduced in the foBowing sections.
In penalty function methods, the constrained problem is converted into an
unconstrained problem by adding a penalty function, p(x), to the objective functionj{x).
The resulting unconstrained objective function has the fonn j{x) +fJ p(x), where fJ> O.
The function p(x) imposes a penalty of fJ p(x) whenever x does not satisfy the
constraints ofthe original problem. Actually, a sequence {f(x) +fJp(x) } of functions are
minimized (or maximized). The solution, {Xk}, ofthe sequence will usually approach the
solution of the original problem. Normally, each Xlo; is not a feasible solution of the
original problem. The process tenninates whenever the required accuracy has been
obtained, or whenever some solution, x" , is generated that is a feasible solution of the
original problem. In a penalty function method, an expression involving the constraints
is added to the objective function. The expression is selected so that the value of the
updated objective function is excessively high (or low) at a point x where the problem is
infeasible.
In general, one penalty function for the problem (21) is function (22)
Minimize t{x)
subject to
gj(x) = bi for i = 1, ,1
gj(x) <= bi for i = 1+1, ,m
8
(21)
1 m k.
p(x) =LIb;  g;(x)Ik.+ L(max{O,g,(x)  b,})
;=1 ;=1+1
(22)
where k is a natural number. Notice that p(x) ~ O. In fact p(x) = 0 if and only if x is
feasible.
Problem (21) could be converted into the form
Minimize :t{x)
subject to
hi(x) = 0 for i = 1,....,m
(23)
by adding the square of an unrestricted variable to the left side of each inequality
constraint, and then moving each bi to the left side of each constraint. A typical penalty
function for (22) is
(24)
where k is a (usually even) natural number. Again notice that p(x) >=0. The remainder
ofthis section deals with problem (23).
Barrier function methods: A Barrier function method is an improved penalty
function method. Again a sequence of functions (f(x) +(l/f3k )b(x)} is minimized (or
maximized) and the sequence of solutions {xd nonnally tends to a solution of the
original problem. The difference in barrier function is that the solutions, Xk, are all
9
feasible solutions ofthe original problem. The function hex) is called a barrier function
because it imposes a penalty near the boundary of the set of feasible solutions of the
original problem.
For the problem:
Minimize fl:x) subject to gi(X)~O for i =1,...,m (25)
Notice that problem (25) does not contain any equality constraints. Barrier function
methods are similar to penalty function methods in that a barrier function is added to the
objective functio~ and the resulting function is minimized. The difference is that the
solutions are interior points of F (rather than points exterior to F). The purpose of the
barrier function is to prevent the solutions from leaving the interior ofF.
Some common barrier functions for Problem (25) are
b(x)=_~_l (26)
i=1 gj(x)
and
b(x) = 2:lnlgj(x)1 (27)
I~J
Notice that b(x) is, in either case, continuous throughout the interior of F. Moreover,
b(x)> CX) as x approaches the boundary of F via the interior of F. Rather than solve (2
5), we intend to solve the following problem:
1
Minimize I(x) +/ib(x) subject to each gj(x) < 0
10
(28)
where rl>O.
The Quadratic penalty function method: Both penalty function and barrier
function methods can possess the undesirable property of slow convergence. In [25], the
penalty function method is modified using Lagrange multipliers to obtain a more efficient
method. The technique is called the method of multipliers and has emerged as an
important tool for solving constrained nonlinear programming problems. The quadratic
penalty function method is one ofthese methods. It is briefly introduced as foUowing.
For the problem
Minimize f(x) subject to hlx) = 0 1= 1,....,m (29)
where f, h1, ••••hm are continuously differentiable, assume that the set, F, of feasible
solutions of (29) is nonempty. The continuity of the hi ensures that F is closed. As
mentioned in [10], the Weierstrass theorem guarantees the existence of a solution, x·, of
problem (29).
In [10], a method for determining x· was suggested, Namely, compute vectors x·
and A.. that satisfy
O_OoxL(X.",AO)_'lv7'f(xo)T + Aorooxh(x0)
and
11
(210)
where L(x, A) = f\x)+ATh(x) and hex) = [h,(x)...hm(X)]T. Unfortunately, the system of
equations (210) is difficult to solve.
Consider a solution x· of (29). Let A" be the corresponding vector of Lagrange
multipliers for which equations (210) hold. Notice that whenever xEF, then
L(x", A") = f\x"):::;; f(x) = f\x) + A"Th(x) = L(x, A")
Thus, min {L(x, A") : x E F} = L(x·, A") and
min {f(x) : x E F }= min{L(x, A·) : x E F }
(211)
This suggests that rather than solve (29), we could solve the problem on the right side of
(211), possibly using a penalty function method. That is
Minimize ftx) + A"Th(x) + p i:(h;(X»2
2 ;=1
(212)
where ~ > O. Of course the problem is that A· is not known at the onset of the problem.
The next result suggests an alternate strategy consisting of solving a sequence of
problems ofthe fonn
where Ak E Rmxl.
1.2
(213)
The above discussions concern the penalty function methods and the penalty
mechanism. They have some similarity with the penalty term method used in neural
network training and can be used to evaluate the penalty terms and penalty mechanism
used in neural network training.
To evaluate the different penalty terms developed in neural network training, a
summary of different penalty terms is presented in the following.
Penalty Term Method of Reducing Overfitting
A. Weigend et al Penalty Term
Weigend et al. [11][13] suggested the following cost function:
2/ 2
"~ (I .. 2" Wi W, Ok) +A~ 2/ 2
kET ieCI+w; Wo
(214)
where C is the set of all connections and T is the set of training patterns. The second
term is the penalty term that represents the complexity ofthe network as a function of the
weight magnitudes relative to the constant Woo if Iwil >> wo, then the cost of a weight will
approaches A. If I wil « Wo, the cost is close to zero. The value of A depends on the
problem. If it is too smal~ it won't have any significant effect; if it is too large, all the
weights will be driven to zero.
B. Chauvin Penalty Term
In [14], Chauvin minimize the cost function
13
(215)
where e is a positive monotonic function. The swns are over the set of output units 0, the
set of patterns P, and the set of hidden units H. The first term is the normal backpropagation
error tenn, the second term measures the average "energy" expended by the
hidden units. The parameters Jler and Jlen balance the two terms. The "energy" expended
by a unithow much its activity varies over the training patternsis an indication of its
importance. If the unit changes a lot, it probably encodes significant information; if it
does not change much, it probably does not carry much information.
A magnitudeofweights term may also be added to the cost function, giving
(216)
Since the derivative of the third term with respect to Wij is 2 fJ.wwij, this effectively
introduces a weightdecay term into the backpropagation equations. Weights that are
not essential to the solution decay to zero and can be removed.
C. Ji Penalty Term
Ji et al. [18] modifY the error function to minimize the number of hidden nodes
and the magnitudes of the weights. A singlehiddenIayer network with one input node
and one linear output node is investigated in their research. Beginning with a network
having more hidden units than necessary, the output is computed as
14
N
g(x,w,B) = LVif(UjXOJ
i~1
(217)
where Sr is the threshold, f is the sigmoid function II (l +eX
) , and u and v are the input
and output weights of ith hidden unit respectively.
The significance of a hidden unit is computed based on its input and output
weights
where cr(w) = ~/(l +~).
(218)
The error is defined as the sum of 80 , the nonnal sum of squared errors, and 81,
tenn measuring node significance.
M N iI
=1] L[g(x1r;w,B )_ylt]2 +ALLSiSj
1r~1 i~1 j~1
(219)
where 1t indexes the training patterns and x1t and y1t are the input and desired output for
pattern 1t, and !l and A are learning rate parameters. The El(W) term makes the algorithm
favor solutions with fewer significant hidden units.
It is suggested the second term be added only after the network has learned the
training set sufficiently well because conflict between the two error tenns may cause
local minim.
15
D. Bishop Penalty Term
Chris M. Bishop [16] proposed another penalty tena For error function (220), the
penalty term is given by (221).
(220)
( 221)
Where Yn and XI denote the components of y and X, respectively, and the parameter A.
controls the degree of smoothness of the network mapping. Bishop indicated:
''Unfortunately, the optimum value for A. is problem dependent. It may be found by
seeking the minimum error with respect to a crossvalidation data set, or by a variety of
techniques based on the statistical properties ofthe training data." [16].
E. Simple Penalty Terms
Ishikawa [15] proposed another simple cost function
c = L(/, OK)2 +ALI wi! I
KeT IJ
(222)
If Wij > 0, the weight is decremented by A., otherwise, if Wij < 0, then it is incremented by
A.
Russel Reed [6] described the following simple cost function:
c= L(I'0.)2 +A.LWij 2
keT i,j
16
(223)
Russel Reed evaluated the simple penalty term: "One ofthe characters ofthe AWij
penalty tenn is that it tends to favor vector with many small components over ones with a
single large component, even when this is an effective choice."
A constant Ais used in most of the penalty terms. There is no criteria to select a
A. Weigend et al. [11] indicated that ''the value of Arequires some tuning and depends
on the problem. If it is to small, it won't have any significant effect; if it is too large, all
the weights will be driven to zero." Bishop [19] indicated that the optimum value for A
will be problem dependent, and may be found by seeking the minimum error with respect
to a crossvalidation data set, or by a variety of techniques based on the statistical
properties of the training data. Ji et al. [18] suggested that the A can be made a function
of Eo such as A =Aoe;&0 They suggested the second penalty term be added only after
the network has learned the training set sufficiently well, because of the conflict between
the two error terms may cause local minimal. Ping Jiang [24] said, ''the optimum point of
A is network architecture dependent. We need to choose A to close to optimum point to
improve the generalization performance."
17
Chapter III
ARTIFICIAL NEURAL NETWORK ARCHTECTURE AND LEARNING
ALGORITHMS
Architectures of Feedforward Artificial Neural Networks
Some artificial neural networks were introduced in Chapter 1. This thesis focuses
on the most widely used multilayer feedforward networks. The architecture of a
multilayer feedforward network is shown as Figure 31. This type of network arranges
neurons in layers. All neurons in a layer are connected to all neurons in the adjacent
layers through unidirectional links. These links are represented by synaptic weights. The
input layer ofthe network is treated as connection nodes. All the layers except the output
layer of the network are hidden layers. So the number of hidden layers is the number of
layers in a network minus one.
The notations used are shown in Figure 31. All neurons in a layer are
consecutively indexed starting from 1, in a topdown fashion. The layers are indexed in. a
lefttoright order and are identified by squarebracketed superscripts. All inputs to a
neuron in layer k are denoted as a/k

I I, where I = 0, 1, 2,,,,Sk1 (SkI is the number of
neurons in the (Kl )th layer). In the case of kl = 0, ajlOl are the inputs of the network.
For each layer, we assumed an extra bias node that has a constant output value of1, i.e.
18
nP) n[::!] (21 n[31 a';1
5, 51 ~ 51 ~I
P f(lJ fPI [PI R
Wi a[OJ , a[ll , W W a[2J
s,.R a s, .5, a .,.~ a
nPI a~ll a~1
2
P, fill [PI f(3)
a[OI all)
a a
Figure 31 A Three Layer Feedforward Network
19
3o(k] = 1. Notice that for each k > 2, aj[kI] is also the output of neuron I in (k1)th layer.
The outputs in the kth layer of the network can be written in vector fonn as aj(k]. A
weight is represented as Wj)kJ, where k is the layer index and 'j,i" means that the weight
is the connection from the ith neuron in layer kl to the jth neuron in layer k. In vector
form, weights can be represented by w1k] = (wikl)T. The nj(kJ represents the weighted sum
of a neuron j in layer k. The weighted sum of the inputs of a neuron j in layer k can be
expressed as
The output of the neuronj in layer k can be expressed as
(31)
j = 1,2,... nk (32)
Where fj1kJ is the activation function of the neuron. We will discuss the activation
function in the following section. In vector form, the formulas can be written as
(33)
(34)
where :fkl = (:fkJ)T is a vector ofthe activation function.
20
Activation Function
The original activation function is a binary function [18]. This limits the
application of perceptron neural networks to classification problems only. In order to
solve a general type of mapping application problem, we need to use nonlinear
continuous activation functions. There are many nonlinear activation functions that can
be used in multilayer networks as long as the functions are differentiable. The most
commonly used functions are the sigmoid function and the hyperbolic function which are
expressed as
1
Sigmoid function f(x) = 1+ex
ex ex
Hyperbolic function f(x) = x x
e +e
(35)
(36)
The graphs of signoid and hyperbolic functions are shown in Figure 32 and 33. Since
we can always scale down the input and output values to the interval (0,1) or (1, I), there
is no significant difference between the two functions. The sigmoid function is used in
this paper.
Weights in the Neural Network
The weights in a neural network are initially chosen to be small random numbers.
An activation function is active only in a small domain interval as shown in Figure32. If
the initial weights are too large, the activation functions may saturate at the beginning of
21
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 Tags
Add tags for Penalty Methods to Reduce Overfitting in Artificial Neural Networks
Comments
Post a Comment for Penalty Methods to Reduce Overfitting in Artificial Neural Networks
Your rating was saved.
you wish to report:
...
Select the collections to add or remove from your search
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
