

small (250x250 max)
medium (500x500 max)
Large
Extra Large
large ( > 500x500)
Full Resolution


FITTING FUNCTIONS AND THEIR DERIVATIVES WITH NEURAL NETWORKS By ARJPOLSON PUKRITTAYAKAMEE Bachelor of Engineering Chulalongkorn University Bangkok, Thailand 1997 Master of Sciences Oklahoma State University Stillwater, Oklahoma 2001 Submitted to the Faculty of the Graduate College of the Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY July, 2009 ii FITTING FUNCTIONS AND THEIR DERIVATIVES WITH NEURAL NETWORKS Dissertation Approved: Dr. Martin T. Hagan Dissertation Adviser Dr. Lionel M. Raff Dr. Carl D. Latino Dr. George Scheets Dr. A. Gordon Emslie Dean of the Graduate College iii ACKNOWLEDGEMENT I would like to express my deep gratitude to my advisor, Professor Dr. Martin Hagan, for his dedication, support, guidance, patience and friendship throughout my graduate studies. His invaluable feedback and suggestions infinitely contributed to my research. I also would like to thank my other committee members, Professor Dr. Lionel Raff, Professor Dr. George Scheets and Professor Dr. Carl Latino, for their useful comments and suggestions. I also thank Professor Dr. Ranga Komanduri, Professor Dr. Satish Bukkapatnam, Professor Dr. Paras Agrawal and Professor Dr. Mark Rockley for providing helpful comments and all supports possible to the research. My special appreciation goes to Professor Dr. Lionel Raff for his dedication to teaching me the fundamental of quantum mechanics and molecular dynamics. The research is funded by a grant (DMI0457663) from the National Science Foundation, Division of Mechanical and Manufacturing Innovation (CMMI). I thank Dr. Jocelyn Harrison, Program Director for Materials Processing and Manufacturing, for the interest and support of this work. Most importantly, I especially thank and dedicate this dissertation to my parents, my grandmother and my sister for their loving support to all my endeavors and their strong encouragement at difficult times. iv I also would like to take this opportunity to express my gratitude to all of my teachers at Deves Thampirak School, St. Gabriel’s College, Patumwan Demonstration School and Triam Udom Suksa School, and Professors at Chulalongkorn University and Oklahoma State University for their immeasurable dedication, support and strong encouragement to my learning. I would like to thank Dr. Prapaporn Kiattikulwattana for special care, friendship, enthusiastic support and substantial encouragement throughout my doctoral study. My thanks and appreciation are extended to the following people: Rathachai Kaewlai (MD.), Chokeanan Wechaprukpitak, Dr. Manisra Baramichai, Dr. Nipapan Klunngien, Chainart Vongsittipornrung, Dr. Wisit Kumphai, Kultarika Kwosurat, Rungnapa and Dr. Somsak Kittipiyakul, Pornphan Dulyakarn, Dr. Chanvit Chantrasrisalai, Chayanuch Jakmanon, Arthit Phadungsilp, Xiaochen Hu, Dr. Jakkrit Kunthong, Arttaporn Komolhiran and Katherine McCollom for the enduring friendship, care, support and encouragement; Reza Jafari, Milind Malshe and Rutu Narulkar for being great colleagues and friends as well as providing strong supports to the research. My gratitude also goes to Professor Dr. Dusit Kruangam, Dr. Porponth Sichanugrist, Dr. Pavan Siamchai and Dr. Sakon Kittivatcharapong for giving me opportunities to acquire precious experiences and supports to pursue the doctoral degree. Finally, I thank friends at Chulalongkorn University, colleagues at Thaicom Satellite and Thai students at Oklahoma State University for the supportive friendship. v TABLE OF CONTENT CHAPTER 1 INTRODUCTION ...................................................................................................1 Overview......................................................................................................1 Objectives ....................................................................................................2 Outline .........................................................................................................4 CHAPTER 2 APPROXIMATION USING NEURAL NETWORKS...........................................7 Introduction..................................................................................................7 Notation .......................................................................................................7 Multilayer Feedforward Neural Networks.................................................10 Training Algorithms ..................................................................................20 Conditions for Function and Derivative Approximation...........................35 Summary....................................................................................................38 CHAPTER 3 VALIDATIONRELATED METHODS...............................................................40 Introduction................................................................................................40 Modified validation performance measure ................................................40 Early stopping using the derivative information of the training set ..........44 Summary....................................................................................................50 CHAPTER 4 GRADIENTBASED COMBINED FUNCTION AND DERIVATIVE APPROXIMATION ..............................................................................................................51 Introduction................................................................................................51 GradientBased Combined Function and Derivative Approximation .......51 Speed Test..................................................................................................79 Summary....................................................................................................80 CHAPTER 5 COMBINED FUNCTION AND DERIVATIVE APPROXIMATION WITH LEVENBERG MARQUARDT ....................................................................................82 Introduction................................................................................................82 CFDA with LevenbergMarquardt ............................................................82 Batch calculation........................................................................................85 Memorysave Calculation........................................................................101 Speed Test................................................................................................111 Summary..................................................................................................112 vi CHAPTER 6 A NETWORK PRUNING ALGORITHM FOR CFDA .....................................113 Introduction..............................................................................................113 TwoLayer Network Response ...............................................................114 CFDA Overfitting ....................................................................................121 Pruning Algorithm ...................................................................................131 Summary..................................................................................................143 CHAPTER 7 TRAINING RESULTS ON SIMPLE PROBLEMS............................................145 Introduction..............................................................................................145 Evaluation Procedure...............................................................................146 Parameters in CFDA................................................................................150 Simulation Results ...................................................................................152 Summary..................................................................................................179 CHAPTER 8 A REALWORLD APPLICATION....................................................................181 Introduction..............................................................................................181 Molecular Dynamics with Neural Networks ...........................................181 Simulation results ....................................................................................198 Summary..................................................................................................218 CHAPTER 9 CONCLUSIONS .................................................................................................220 Objective..................................................................................................220 Summary..................................................................................................220 Future work..............................................................................................224 APPENDIX A. PSGEN algorithm ..........................................................................................226 B. Calculation for the second derivatives of neural networks ............................232 C. Class of model.....................................................................................234 D. Scaled and true derivatives ............................................................................242 REFERENCES...............................................................................................................245 H2Br vii LIST OF TABLES Table 1 Approximation accuracy on problem 1.............................................................42 Table 2 Approximation accuracy on problem 2.............................................................43 Table 3 Approximation accuracy on problem 3.............................................................43 Table 4 Approximation accuracy on problem 1.............................................................47 Table 5 Approximation accuracy on problem 2.............................................................48 Table 6 Approximation accuracy on problem 3.............................................................48 Table 7 Description of test functions ...........................................................................146 Table 8 Network structure for each problem................................................................147 Table 9 Approximation accuracy on problem 1 (First set) ..........................................154 Table 10 Approximation accuracy on problem 2 (First set) ..........................................154 Table 11 Approximation accuracy on problem 3 (First set) ..........................................155 Table 12 Approximation accuracy on problem 4 (First set) ..........................................155 Table 13 Number of neurons after pruning (First set)....................................................155 Table 14 Approximation accuracy on problem 1 (Second set) ......................................157 Table 15 Approximation accuracy on problem 2 (Second set) ......................................157 Table 16 Approximation accuracy on problem 3 (Second set) ......................................158 Table 17 Approximation accuracy on problem 4 (Second set) ......................................158 Table 18 Number of neurons after pruning (Second set) ...............................................158 Table 19 Approximation RMSE of threebody silicon, 3000 training points ................201 Table 20 Approximation RMSE of threebody silicon, 1500 training points ................201 Table 21 Number of neurons after pruning, for threebody silicon ...............................201 Table 22 Approximation RMSE of fivebody silicon, 3000 training points..................204 Table 23 Approximation RMSE of fivebody silicon, 1500 training points..................204 Table 24 Number of neurons after pruning, for fivebody silicon .................................204 Table 25 Approximation RMSE of , 3000 training points ...................................207 Table 26 Approximation RMSE of , 1500 training points ...................................208 Table 27 Approximation RMSE of , 750 training points .....................................208 Table 28 Approximation RMSE of , 375 training points .....................................208 Table 29 Number of neurons after pruning, for ..................................................209 Table 30 Approximation RMSE of , after some extrapolation elimination .........211 Table 31 Elements of and their index ...................................................................227 Table 32 The binary sequences, combinadics and the associated elements of ....228 Table 33 Combinadics and their index ......................................................................229 Table 34 Notations for the combinadics.........................................................................229 Table 35 Parameters of the model..................................................................................234 H2Br H2Br H2Br H2Br H2Br H2Br S iS P(S) ic viii LIST OF FIGURES Figure 1) A single neuron ..................................................................................................11 Figure 2) The feedforward neural network.......................................................12 Figure 3) Training and validation set.................................................................................24 Figure 4) Sum square function error on training and validation set ..................................24 Figure 5) A training record showing versus ..........................................................49 Figure 6) Relative execution time (compared to ) for computing ............80 Figure 7) Relative execution time for computing the gradient and Jacobian of , compared to ..........................................................................................111 Figure 8) Sketches of and .......................................................................117 Figure 9) Effect of the firstlayer weight .........................................................................117 Figure 10) Effect of the secondlayer weight ..................................................................118 Figure 11) Neuron’s function and derivative responses in two dimensions....................119 Figure 12) A cross section of the neuron’s derivative responses.....................................120 Figure 13) Responses of two neurons..............................................................................122 Figure 14) TypeA Overfitting ........................................................................................122 Figure 15) Distance between a point to a line .................................................................123 Figure 16) TypeB Overfitting.........................................................................................127 Figure 17) Definition of Buffer Zone ..............................................................................128 Figure 18) Training points and the buffer zone in a twodimensional case.....................131 Figure 19) Overview for the pruning algorithm ..............................................................133 Figure 20) Graph of the four test functions .....................................................................147 Figure 21) Impact of on the approximation accuracy..................................................152 Figure 22) Error comparison for problem 1.....................................................................159 Figure 23) Error comparison for problem 2.....................................................................160 Figure 24) Error comparison for problem 3.....................................................................161 Figure 25) Error comparison for problem 4.....................................................................162 Figure 26) Overfitting in the function and derivative responses .....................................164 Figure 27) Function and first derivative errors ................................................................164 Figure 28) Responses of the three neurons......................................................................165 Figure 29) The combined responses of the three neurons ...............................................165 Figure 30) Function and derivative errors, right after pruning (with no retraining)........166 Figure 31) Function and derivative errors of the final network.......................................167 Figure 32) Function and derivative errors .......................................................................168 Figure 33) Responses of the two neurons........................................................................169 Figure 34) The combined responses of the two neurons .................................................170 Figure 35) Function and derivative errors, right after pruning (with no retraining)........171 Figure 36) Function and derivative errors of the final network.......................................172 Figure 37) Function and derivative errors .......................................................................173 Figure 38) Responses of the neuron causing the overfitting ...........................................173 Figure 39) Function and derivative errors, right after pruning (with no retraining)........174 Figure 40) Function and derivative errors of the final network.......................................174 Figure 41) Function and derivative errors .......................................................................175 3 – layer EV ˜ EV ∂Jf ⁄ ∂x ∂Jd ⁄ ∂x F2(x) F1(x) f 1 ni ( 1) f· 1 ni ( 1) λ ix Figure 42) Responses of the three neurons......................................................................176 Figure 43) The combined responses of the 15 neurons ...................................................177 Figure 44) Function and derivative errors, right after pruning (with no retraining)........178 Figure 45) Function and derivative errors of the final network.......................................179 Figure 46) Configuration of threebody silicon ...............................................................199 Figure 47) Configuration of fivebody silicon.................................................................203 Figure 48) Configuration of ..................................................................................206 Figure 49) Extrapolation in a Monte Carlo trial, with ....................................211 Figure 50) Error comparison for with ..................................................212 Figure 51) The three neuron centers ................................................................................213 Figure 52) Function and derivative responses of the two neurons along the cross section ............................................................................................................214 Figure 53) The combined responses ................................................................................215 Figure 54) Errors before and right after pruning the two neurons...................................215 Figure 55) Function and derivative response of the neuron ............................................216 Figure 56) Errors before and right after pruning the neuron ...........................................216 Figure 57) Errors before and right after pruning the 21 neurons.....................................217 Figure 58) Errors before and after performing the pruning algorithm.............................217 H2Br Q = 375 H2Br Q = 375 1 CHAPTER 1 INTRODUCTION Overview Multilayer feedforward neural networks (MFNN) have been used in many nonlinear regression problems. They have been shown (both mathematically and practically) to be a powerful tool in approximating functions, based on a set of examples. In addition, it has been theoretically proven that their derivatives are capable of approximating the underlying derivatives of the functions. In this research, we focus on the development and the derivation of training algorithms for MFNNs to approximate both functions and their firstorder derivatives. Multilayer feedforward neural networks are capable of simultaneously approximating both a function and its derivatives, as has been proven in [HoSt90], [Horn91], [Ito93], [Li96] and [Pink99]. However, there has not been a large amount of research into the development of algorithms for training multilayer feedforward neural networks to simultaneously approximate both a function and its first derivatives. This will be the focus of our research. There have been a few methods proposed in the past to train MFNNs for approximating both a function and its derivatives. All of these methods assume noisefree environment. One approach, called algebraic training, was proposed in [FeSt05]. This method 2 algebraically obtains the network parameters in one function approximation step. The derivative approximation is carried out in the second stage, where the algorithm iteratively adjusts the network parameters to improve the derivative approximation of the neural network. In [BaEn99], the derivative approximation was performed by adding an extra output unit for each partial derivative to the regular structure of the feedforward neural network that was used to approximate the function. The standard backpropagation procedure was then used to train the proposed network structure. In [HaZa99], an additional network was created to approximate the derivatives. The derivative approximation obtained from the extra network was then combined with the network output used for function approximation using the Taylor series expansion. In our research, a different technique will be used. We will modify the network structure or create an additional network for derivative approximation. We will use the same network for derivative approximation that we use for function approximation. In addition, unlike the algebraic training, the derivative approximation in our method will occur simultaneously with the function approximation. However, like the other methods above, we also assume that the data for this research are noisefree. Objectives As previously mentioned, our research goal is to develop new algorithms for training MFNNs to simultaneously approximate both a function and its first derivatives (in a noisefree environment). However, we will also investigate both function and derivative approximation accuracy for three existing algorithms. These existing algorithms will be called the standard methods in this research, and they are: not 3 1. BroydenFletcherGoldfarbShanno with Early Stopping ( ), 2. LevenbergMarquardt with Early Stopping ( ), and 3. GaussNewton Approximation to Bayesian Regularization ( ). This evaluation is similar to the work of [GaWh92], though we will use different training methods. The approximation accuracy and the execution time of the new algorithms will be compared with the standard methods. In this research, five new algorithms are developed. They are 1. LevenbergMarquardt with Early Stopping using a modified validation measure ( ), 2. LevenbergMarquardt with Early Stopping using the derivative information of the training set ( ), 3. Combined Function and Derivative Approximation using BroydenFletcher GoldfarbShanno optimization ( ), and 4. Combined Function and Derivative Approximation with LevenbergMarquardt optimization ( ). 5. Combined Function and Derivative Approximation methods with a networkpruning algorithm. The and methods only change the validation performance measure, while standard training algorithms are used to train the neural networks. For and , the proposed change is in the training performance index. Therefore, the standard calculations cannot be applied. The methods can be BFGS – ES LM – ES GNBR LM – ES1 LM – ES2 CFDA – BFGS CFDA – LM LM – ES1 LM – ES2 CFDA – BFGS CFDA – LM CFDA 4 incorporated with the proposed networkpruning algorithm, thus resulting in the fifth algorithm. The results of the standard methods and the new algorithms will be compared on several analytic and realworld problems. The realworld application we focus on in this research is molecular dynamics, where the motion of atoms is simulated. The potential energy of the atoms is a function of atomic configuration, and the forces acting on the atoms are the negative first derivatives of the potential energy. Therefore, molecular dynamics is a good example where both the function and its firstorder derivative approximation are of interest. Outline In Chapter 2, the mathematical notation and the general concepts of MFNNs will be introduced. The use of the neural networks in function approximation will be briefly reviewed. Three standard training algorithms for function approximation will be discussed, i.e. , and . The conditions under which the neural networks can simultaneously and uniformly approximate both a function and its derivatives will be provided. In Chapter 3, two new validationrelated methods will be introduced, i.e. and . The comparison of the approximation accuracy between these methods and on analytic problems will be compared and discussed. In Chapter 4, the training performance index is proposed. Two approaches (i.e. batch mode and memorysave method) for calculating the gradient of the per BFGS – ES LM – ES GNBR LM – ES1 LM – ES2 LM – ES CFDA CFDA 5 formance index (which works with any gradientbased optimization) are proposed and derived. The execution time for computing the gradient and the standard gradient, under several scenarios, is compared. In Chapter 5, the method under the LevenbergMarquardt framework, named , is discussed. Two approaches (i.e. batch mode and memorysave method) for minimizing the performance index are proposed and derived. The comparison of the execution time for the method and the standard Levenberg Marquardt algorithm is illustrated. In Chapter 6, two new types of overfitting in a twolayer network with one output, trained with any method are discussed. A networkpruning algorithm to mitigate the overfitting is proposed. A pseudo code for the algorithm is given. In Chapter 7, we introduce a procedure to test the approximation accuracy of neural networks in four analytic problems. We discuss a way to assign a value to the unknown parameter in the performance index. We choose , for the gradientbased . The pruning algorithm is applied to and . We denote with the pruning method and with the pruning method: and , respectively. We test the approximation accuracy of neural networks trained by the standard methods and the methods (i.e. , , and ). The results are shown and discussed. Examples showing the overfitting and how the pruning algorithm removes it are demonstrated. CFDA CFDA CFDA – LM CFDA CFDA – LM CFDA CFDA CFDA – BFGS CFDA CFDA – BFGS CFDA – LM CFDA – BFGS CFDA – LM CFDA – BFGSp CFDA – LMp CFDA CFDA – BFGS CFDA – LM CFDA – BFGSp CFDA – LMp CFDA 6 In Chapter 8, the background material for molecular dynamics is reviewed. The use of neural networks in molecular dynamics is discussed. The comparison in approximation accuracy of neural networks trained by and the methods is illustrated and discussed for several problems. An example showing the overfitting in a molecular dynamics problem and how the pruning algorithm removes it is demonstrated. In Chapter 9, the summary of the research and the future work is provided. GNBR CFDA CFDA 7 CHAPTER 2 APPROXIMATION USING NEURAL NETWORKS Introduction The objective of this research is to use multilayer feedforward neural networks (MFNNs) for approximating functions and their firstorder derivatives. This chapter serves three purposes. First, the notation used throughout the research will be introduced. Second, we will discuss the concept and the background material of MFNNs, including a review of three existing training algorithms for multilayer feedforward neural networks. These algorithms are BroydenFletcherGoldfarbShanno with Early Stopping , Levenberg Marquardt with Early Stopping , and GaussNewton approximation to the Bayesian regularization . Finally, the conditions under which multilayer feedforward neural networks can simultaneously approximate both a function and its firstorder derivatives will be briefly discussed and reviewed. Notation This section will introduce mathematical notation that will be used throughout the research. The following definitions provide the meaning of three mathematical operators, see [HoJo94] and [MaNe99]. (BFGS – ES) (LM – ES) (GNBR) 8 Definition (3) Let be an matrix and a matrix. The matrix defined by (1) is called the Kronecker product of and , and written . Definition (4) If and are matrices of the same order, say , then the Hadamard product of and is the matrix . (2) Definition (5) Let be an matrix and its column; then is the vector . (3) We will use the following notation, from [MaNe99], for representing the derivatives of a function: A m × n B p × q mp × nq a1, 1B … a1, nB … … am, 1B … am, nB A B A ⊗ B A = [ai, j] B = [bi, j] m × n A B m × n A • B = [ai, jbi, j] A m × n aj jth vecA mn × 1 vecA a1 a2 … an = 9 Definition (6) Let be a differentiable scalar function of a scalar variable . Then the derivative of is denoted . Definition (7) Let be a differentiable scalar function of a vector . Then the derivative of is the array . (4) Definition (8) Let be a differentiable real vector function of a vector , then the derivative of is the matrix . (5) And the derivative of the function , where , with respect to the scalar variable , where is denoted by . (6) f x f f·(x) ∂f(x) ∂x =  f p × 1 x f 1 × p ∂f(x) ∂xT  ∂f ∂xT  ∂f(x) ∂x1  … ∂f(x) ∂xn = =  f m × 1 p × 1 x f m × p ∂ f(x) ∂xT  ∂f ∂xT  ∂f1(x) ∂xT  … ∂fm(x) ∂xT  ∂f ∂x1  … ∂f ∂xp = = =  fi i = 1, 2, …, m xj j = 1, 2,…, p ∂fi(x) ∂xj  ∂fi ∂xj =  10 Definition (9) Let be a differentiable matrix function of a matrix of real variables . Then, the derivative of with respect to is the matrix . (7) We will also use the notation to denote the matrix, all of whose elements are ones. The identity matrix will be denoted by the notation . The next section will discuss some notation and background material for MFNNs. A brief discussion of how the MFNN can be used for function approximation will also be presented. Multilayer Feedforward Neural Networks We will divide this section into three parts. The first part will briefly discuss the general concept of MFNNs. The notation for MFNNs will be introduced in the second part. Finally, we will describe how MFNNs can be used for function approximation. Background material The term neural networks has been used to refer to a structure consisting of connected nodes (or neurons) in any formation (i.e. parallel, series or both). The inspiration of inventing neural networks was initially based on how a brain works. However, neural networks are now considered a mathematical and statistical model for information and sig F m × n p × q X F X mn × pq ∂vecF ∂(vecX)T  x1 ∂ T ∂ f1 … xq ∂ T ∂ f1 … … x1 ∂ T ∂ fn … xq ∂ T ∂ fn = 1m × n m × n m × m Im 11 nal processing. Neural networks have been theoretically and practically proven to be a powerful tool in many applications, such as function approximation, classification, pattern recognition, novelty detection, filtering, etc. Neural networks have been categorized into several types usually based on how neurons are connected. One of the most widelyused types for function approximation is the MFNN. In this research, we will focus on using MFNNs for function approximation. For ease of reference, the term neural network (or just network) will be used throughout this document to refer to MFNNs. A neural network consists of neurons connected in parallel and series. The structure of a neuron is shown in the following figure. Figure 1) A single neuron From Figure 1), a neuron consists of several components. They are the neuron input , the weight , the bias , the summer, the net input , the transfer function , and the neuron output . The net input is computed as . The net input is fed to the transfer function to produce the neuron output , i.e. . The weight and the bias are called the “network parameters”. When neurons are connected in parallel and cascade, a more complicated structure is obtained. In the parallel structure (layer), each neuron receives the same inputs as the other neurons do but independently produces its own output. When neurons are in the cascade Σ f 1 b p a w n p w b n f a n = wp + b f a a = f(n) w b 12 structure, each neuron output of a preceding layer is distributed to be a neuron input for every neuron in the layer following it. The general structure is referred to as an neural network. An example of a neural network is illustrated in the following figure. Figure 2) The feedforward neural network Since the complexity drastically increases as we have more neurons and layers, we need notation to refer to a specific neuron. In the next part, we will introduce the notation for the general structure of a multilayer feedforward neural networks that will be used throughout this research. Notations for neural networks An feedforward neural network structure can be denoted . This structure notation corresponds to the statement that the neural network consists of inputs, neurons in the first layer, neurons in the second layer and so on, until neurons in layer , i.e. the output layer. The layers 1 through are called hidden layers. Layer is called the output layer. We can consider the network inputs as the neuron outputs of layer . M – layer 3 – layer TanSigmoid Layer TanSigmoid Layer Linear Layer a tansig W p b 1 1 1 = ( + ) a tansig W a b 2 21 2 = ( + ) a purelin W a b 3 32 3 = ( + ) S 1x1 S 2x1 S 3x1 S 1x1 S 2x1 S 3x1 S 1x1 S 2x1 S 3x1 R x1 S 1 xR S2 x S 1 S 3 x S 2 S1 S2 S3 n1 n2 n3 p a1 a2 a3 W1 W2 W3 b1 b2 b1 1 1 3 R Inputs 3 – layer M – layer R – S1 – S2 – … – SM M – layer R S1 S2 SM M M – 1 M m = 0 13 Suppose we have an neural network and input/target pairs in the training set. We write to denote element of the input vector, and write to denote element in the input vector. We write to denote the weight of neuron at layer , that connects from the output of neuron at layer . The bias for neuron at layer is denoted by . The weight is the weight, which connects from the element of the input vector to neuron in the first layer. The net input of neuron at layer is denoted by and the notation denotes the output of neuron at layer . At the output layer, is also denoted by ( ). When the input vector is applied to the network (i.e. for ), the values of , and are denoted by , and , respectively. We assume that the transfer function is the same for each neuron in the same layer; thus using to denote the transfer function at layer . The notation we just introduced can also be used in matrix form. A couple of examples are provided: the notation means this element is at row of the input vector and it is also the element at row and column of the matrix . The notation means this element is at row of the vector , and it is the element at row and column of the matrix . The notation is at row of the vector . The notation is at row of the vector , and it is at the row and R – S1 – S2 – … – SM Q pr r pr, q r qth wi j , m i m j m – 1 i m bi m wi j , 1 jth i i m ni m ai m i m ak M ak ak M = ak qth pr = pr, q r = 1, 2, …, R ni m ai m ak ni q , m ai q , m ak, q f m m pr, q r R × 1 pq r q R × Q P ak q , m k Sm × 1 aq m k q Sm × Q Am bi m i Sm × 1 bm wi j , m i Sm × 1 wj m i 14 column of the matrix . The following examples illustrate how the notations , , and can be expressed in matrix form: and , (8) and , (9) , (10) and . (11) Note that we use the notation to denote the row vector of the matrix . Similarly, the notation is to denote the row vector of the matrix . The following equations illustrate how these are related to the notations previously introduced: and , (12) and . (13) j Sm × Sm – 1 Wm pr q , ak q , m bi m wi j , m pq T p1, q p2, q … pR, q = P p1 p2 … pQ = aq ( m)T a1, q m a2 q , m … a Sm, q = m Am a1 m a2 m … aQ m = (bm)T b1 m b2 m … b Sm = m wj ( m)T w1, j w2, j … w Sm, j = Wm w1 m w2 m … w Sm – 1 = m pT r rth P wm (i )T ith Wm pT r pr, 1 pr, 2 … pr, Q = P pT 1 pT 2 … pT R = wm (i )T wi, 1 wi, 2 … w i, Sm – 1 = Wm wm (1 )T wm (2 )T … wm ⎝Sm ⎠ ⎛ ⎞T = 15 Given the input signal , the output of neuron at layer , i.e. , can be computed as and , (14) for . At the first layer , the neuron output can be calculated as and , (15) for . Eq. (14) and Eq. (15) can be written in matrix form as and , (16) and . (17) In the batch mode when all of the inputs (i.e. ) are presented to the network at the same time, Eq. (14) can be expressed as: , where (18) pq i m ai q , m ai q , m f m ni q , m ( ) = ni q , m wi j , m aj q , m – 1 bi + m j = 1 Sm – 1 = Σ i = 1, 2, …, Sm (m = 1) ith ai, q 1 f 1 ni q , = ( 1 ) ni, q 1 wi r , 1 pr q , br + 1 r = 1 R = Σ i = 1, 2, …, S1 aq m fm nq ( m) f m n1, q ( m ) f m n2, q ( m ) … f m n Sm, q m ⎝ ⎠ ⎛ ⎞ = = nq m Wmaq = m – 1 + bm aq 1 f1 nq 1 ( ) f 1 n1, q ( 1 ) f 1 n2, q ( 1 ) … f 1 n S1, q 1 ⎝ ⎠ ⎛ ⎞ = = nq 1 = W1pq + b1 q = 1, 2,…, Q Am Fm(Nm) fm n1 ( m) fm n2 ( m) … fm nQ m = = ( ) 16 . (19) Eq. (15) can be expressed in the batch mode as: and . (20) Since the objective of the research focuses on the firstorder derivative approximation, we will need to provide the notation for the firstorder derivatives of neural networks. By Definition (6), we write the derivative of with respect to , evaluated at , as . (21) By Definition (7), the derivative of with respect to the input , evaluated at , as . (22) The derivative of with respect to , evaluated at , is denoted by . (23) By Definition (8), the derivative of with respect to the input , evaluated at , is written . (24) Nm = WmAm – 1 + 11 × Q ⊗ bm A1 = F1(N1) N1 = W1P + 11 × Q ⊗ b1 ak pr pr = pr, q ∂pr, q ∂ak, q ∂pr ∂ak pr = pr, q ≡ ak p p = pq pq ∂ T ∂ak, q ∂pT ∂ak p = pq ≡ a pr pr = pr, q ∂pr, q ∂aq ∂pr ∂a pr = pr, q ≡ a p p = pq pq ∂ T ∂aq ∂pT ∂a p = pq ≡ 17 A couple of more examples: the derivative of with respect to the net input , evaluated at , is expressed as . (25) The derivative of with respect to the net input , evaluated at , is denoted by . (26) For batch mode, first suppose the input vectors for all are distinct. By Definition (9), we write the derivative of with respect to the input , evaluated at for all , as . (27) Note that the blocks off the diagonal are zero since is not related to . The derivative of with respect to the input , evaluated at for all , is denoted by aj m nj m nj m nj q , m = nj q , m ∂ ∂aj q , m nj ∂ m ∂aj m nj m nj q , m = ≡ am nm nm nq = m nq ( m)T ∂ ∂aq m (nm)T ∂ ∂am nm nq m = ≡ pq q = 1, 2, …, Q a pr pr = pr, q q = 1, 2, …, Q ∂vecA rp ∂ T  ∂pr, 1 ∂a1 0 … 0 0 pr 2 ∂ , ∂a2 … 0 … … … 0 0 … ∂pr, Q ∂aQ = pr, q1 pr q, 2 a p p = pq q = 1, 2, …, Q 18 , (28) where, again, the blocks off the diagonal are zeros because each input vector is distinct. The derivative of with respect to the input , evaluated at for all , is denoted by . (29) One more example: the derivative of with respect to the net input , evaluated at for all , is denoted by ∂vecA ∂(vecP)T  p1 ∂ T ∂a1 0 … 0 0 p2 ∂ T ∂a2 … 0 … … … 0 0 … pQ T ∂ ∂aQ = pq am p p = pq q = 1, 2, …, Q ∂vecAm ∂(vecP)T  p1 ∂ T ∂a1 m 0 … 0 0 p2 ∂ T ∂a2 m … 0 … … … 0 0 … pQ T ∂ ∂aQ m = am nm nm nq = m q = 1, 2, …, Q 19 . (30) In the next section, we will briefly review how neural networks have been used in function approximation problems. Neural networks and function approximation A MFNN can be used as a function approximator. This means that the network outputs will be estimates of the response of an unknown function . Given the vector function , we write to denote the element. We will use the notation to denote the element of the function response to the input vector. Note that we also call the target value. As in previous matrix notation, and are the column vector and batch mode representations, respectively. Now, suppose we want a neural network to approximate a function mapping from a subset in to a subset in . Also suppose that a set of examples were drawn from function , where an example represents a pair of function inputs and corresponding function responses; i.e. . Given a sufficient number of neurons and a set of examples ∂vecAm (vecNm)T ∂  n1 ( m)T ∂ ∂a1 m 0 … 0 0 n2 ( m)T ∂ ∂a2 m … 0 … … … 0 0 … nQ m ( )T ∂ ∂aQ m = g g gk kth gk, q kth qth gk, q gq G g ℜR ℜSM g (pq, gq) 20 (the training set), an neural network can be used to approximate a function over a subset in . In fact, [HoSt89] theoretically showed that, with a sufficient number of neurons in the hidden layer, networks are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy; meaning that the networks are a class of universal approximators. When a neural network is trained, its weights and biases are adjusted so as to minimize some performance index (or objective function), which usually involves the mean square error between the network outputs and the target values. An optimization method is used to find the network parameters that minimize the performance index. The combination of the performance index and the optimization method makes up the training algorithm. There have been many training algorithms developed for neural networks. However, we will only review three algorithms: BroydenFletcherGoldfarbShanno with Early Stopping , LevenbergMarquardt with Early Stopping , and GaussNewton approximation to the Bayesian Regularization . Training Algorithms In this section, we will review three existing neural network training algorithm. We will first discuss the , followed by the , and finally the training algorithm. R – S1 – S2 – … – SM g ℜR 2 – layer (BFGS – ES) (LM – ES) (GNBR) BFGS – ES LM – ES GNBR 21 Suppose we want to approximate a function , which maps from a subset in to a subset in using an neural network. Assume that the number of data (or examples) in the training set is . We use the vector to denote the column vector containing all of the network parameters. The total number of the network parameters (i.e. the number of elements in the vector ) is . (31) BFGSES training algorithm The performance index for this algorithm is the sum square of the difference between the network outputs and the target values, i.e. the sum square function error. Its performance index is written as: . (32) Minimizing the performance index (with respect to the network parameters) is equivalent to forcing the neural network to approximate the function over a subset in . For this training algorithm, the optimization method, see [GiMu81], will be used to minimize the performance index. The steps for the training algorithm are described in the following section. Note that we use the notation and to denote the performance index and the gradient of the performance index with respect to , evaluated at , respectively. The notation denotes the 2norm of the vector . g ℜR ℜSM R – S1 – S2 – … – SM Q n × 1 x x n = S1(R + 1) + S2(S1 + 1) + …+ SM(SM – 1 + 1) Jf {ak, q – gk, q}2 k = 1 SM Σ q = 1 Q = Σ Jf ℜR BFGS BFGS Jf (xk) ∇Jf (xk) x xk x x 22 Steps for BFGS training algorithm 1. Set . Initialize the network parameter vector by the method in [NgWi90]. Present the training set to the network. Compute the gradient of the performance index with respect to the network parameters . Set the initial search direction . Initialize the approximated Hessian matrix . 2. Minimize the performance index along the search direction, i.e. determine such that minimizes the performance index . 3. Set , and evaluate . Also compute the gradient . Terminate the process if or are less than their predefined thresholds. 4. Calculate the change in the gradient , and compute the new approximated Hessian matrix: . (33) 5. Compute the new search direction , which is the solution of . (34) 6. Set and iterate step 2) to 6). Note that we will use the algorithm in [NgWi90] to initialize the network parameters for all training algorithms in this research. The backpropagation process is used to compute the gradient of the performance index with respect to the network parameters . k = 0 x0 Jf ∇Jf (x0) p0 = –∇Jf (x0) B0 = In αk xk + αkpk Jf xk + 1 = xk + αkpk Jf (xk + 1) ∇Jf (xk + 1) Jf (xk + 1) ∇Jf (xk + 1) yk = ∇Jf (xk + 1)– ∇Jf (xk) Bk + 1 Bk [∇Jf (xk)][∇Jf (xk)]T [∇Jf (xk)]Tpk  ykyk T αkyk Tpk = + +  pk + 1 Bk + 1pk + 1 = –∇Jf (xk + 1) k = k + 1 ∇Jf (x) 23 The details of the backpropagation process for training a neural network can be found in several books, such as [HaDe96]. Note also that there are several methods to perform the line search in step 2. We will use the backtracking algorithm [DeSc83]. From Eq. (32), we can see that the performance index is only the sum square of the errors, and thus it is an unregularized performance index. When using an unregularized performance index, it is possible to overfit the training data and then fail to generalize. There are several available techniques that could be used to prevent overfitting. For example, [HeKr91] proposed an approach to perform weight elimination based on the magnitude of the parameters. [SiDo91] proposed a method which adds noise to the function inputs, and [Bish95] showed that this technique is equivalent to Tikhonov regularization [TiAr77]. Another well known technique called Early Stopping, abbreviated , can be also used to prevent overfitting, and it will be the technique we will use in the research. Early stopping is a widelyused technique to prevent overfitting by monitoring the approximation error on a set of data that is not in the training set. This set of data is called the “validation set”. The approximation error on both the training and validation set initially decreases, until at some point the error on the validation set starts to increase while the error on the training set still keeps going lower. When the error over the validation set increases for a certain number of iterations, early stopping terminates the training process and it returns the network parameters at the point just before the increase of the validation error occurs. An example below illustrates how early stopping technique works. Suppose we want to approximate a function for . The following picture shows the graph of the function. Suppose a set of data points drawn from Jf ES g(p) = sin(πp) –1 ≤ p ≤ 1 24 the graph of the function were provided. We divided the set of data points into two groups: training set and validation set. Data points in the training and validation set are also shown in the figure. Figure 3) Training and validation set Now, suppose a network was used to approximate the function through the training algorithm. Figure 4) shows the sum square function error, i.e. , on the training and validation set at each iteration. Figure 4) Sum square function error on training and validation set −1 0 1 −1 0 1 Training data Validation data g(p) p g(p) 1 – 5 – 1 BFGS Jf 0 500 1000 10−8 10−3 102 Iteration Training set Validation set Jf 25 From Figure 4), we can see that the validation error initially decreased as the training errors decreased. However, at some point around iteration 400, the validation error started increasing, while the training error continued to decrease. In order to prevent overfitting, it is desirable to use the network trained up to iteration 400. Therefore, if early stopping was used, it would terminate the training process at some iteration after 400 and use the network trained until iteration 400. The validation error may fluctuate, so we do not stop the training at the first instance of an increase in the validation error. Instead, we monitor the error for a specified number of iterations to be sure that the error does not go back down. For the experiments described in this report, we monitored the error for 500 iterations after a minimum was reached before stopping the training. We use to refer to the training algorithm with early stopping. It is one of the three training algorithms we will use in the research. In the next section, we will review the second training method, which is the algorithm. LMES training algorithm The performance index for this algorithm is the same as in the training algorithm, i.e. Eq. (32). However, to minimize the performance index, the LevenbergMarquardt optimization method [Leve44] [Marq63], abbreviated , will be used. The optimization algorithm is a combination of the GaussNewton algorithm and the method of gradient descent. It was designed to approximate Newton’s method. To understand the concept of the optimization, let us begin with Newton’s method. The update equation in Newton’s method is BFGS – ES BFGS LM – ES BFGS – ES LM LM LM 26 , (35) where is the Hessian matrix evaluated at . This means we need to compute the gradient and the Hessian matrix of the performance index . To compute the gradient and the Hessian matrix, first suppose the performance index is in the form of , (36) then the element of the gradient would be , (37) where is the number of elements in the vector . The gradient can be written in matrix form: , (38) where is the Jacobian matrix. Next, we need to find the Hessian matrix. The element of the Hessian matrix would be . (39) The Hessian matrix can then be expressed in matrix form: , (40) xk + 1 xk [∇2Jf (xk)]–1 = – ∇Jf (xk) ∇2Jf (xk) xk Jf F(x) = zT(x) × z(x) jth ∇F (x) [∇F (x)]j ∂F (x) ∂xj  2 zi(x) ∂zi(x) ∂xj  i = 1 N = = Σ N z(x) ∇F (x) = 2JT(x)z(x) J(x) = ∂z(x) ⁄ ∂xT k, j [∇2F (x)]k, j ∂2F(x) ∂xk∂xj  2 ∂zi(x) ∂xk  ∂zi(x) ∂xj  zi(x) ∂2zi(x) ∂xk∂xj +  ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ i = 1 N = = Σ ∇2F (x) = 2JT(x)J(x) + 2S(x) 27 where . (41) If we assume is small, the approximated Hessian matrix becomes . (42) By substituting Eq. (38) and Eq. (42) into Eq. (35), we obtain the GaussNewton method: , (43) where and are the vector and the Jacobian matrix , evaluated at , respectively. One problem with the GaussNewton method is that the approximate Hessian matrix may not be invertible. This problem can be overcome by using the matrix in place of the matrix , where . Increasing the parameter will make the matrix become positive definite, and thus invertible. This leads to the LevenbergMarquardt algorithm: . (44) The optimization algorithm requires us to calculate two terms, which are the vector and the Jacobian matrix . By equating the performance index in Eq. (36) and Eq. (32), we have the vector , where , (45) S(x) zi(x)∇2 zi(x) i = 1 N = Σ S(x) ∇2F (x)≅ 2JT(x)J(x) xk + 1 xk [JT(xk)J(xk)] –1= – JT(xk)z(xk) z(xk) J(xk) z(x) J(x) xk JT(x)J(x) H ˜ = JT(x)J(x) + μIn JT(x)J(x) μ > 0 μ H ˜ xk 1 + xk JT xk ( )J xk ( ) μk [ + In] –1= – JT(xk)z(xk) LM z(x) J(x) z(x) z(x) = vecE E = A – G 28 with . Obtaining is thus simply the feedforward propagation. Computing the Jacobian matrix in neural networks is, however, more complicated and it was originally discussed in detail in [HaMe94]. Calculating the Jacobian matrix involves the calculation of the Marquardt sensitivity [HaDe96], using the backpropagation process. The following is a summary of the training algorithm. Steps for LM training algorithm 1. Set . Initialize the network parameter vector . Present the training set to the network. Initialize to a small value, e.g. 0.001. Set to a large value, e.g. . Set , e.g. 10. 2. Compute and the Jacobian matrix using Eq. (45) and Eq. (38), respectively. Also compute . Terminate the process if or is less than its predefined threshold, or if . 3. While , do the following: a). Compute . (46) b). Compute and the Jacobian matrix . Compute . c). If , set and go back to Step 3. Otherwise, go to the next step. N = SMQ z(x) J(x) LM k = 0 x0 μ0 μmax 1010 ϑ > 1 z(xk) J(xk) Jf (xk) Jf (xk) J(xk)z(xk) μk ≥ μmax μk < μmax xk + 1 temp xk JT(xk)J xk ( ) μk [ + In] –1= – JT(xk)z(xk) z xk + 1 ( temp) J xk + 1 ( temp) Jf xk + 1 ( temp) Jf xk + 1 ( temp) J≥ f (xk) μk = ϑμk 29 d). If , set and set . Then, set and go back to Step 2. 4. Terminate the process (in this case it is due to ). As the performance index is only the sum square of the function errors, early stopping will again be used to prevent overfitting. We will denote this algorithm, which minimizes the performance index using the optimization with early stopping, as the method. In the next section, we will review the GaussNewton approximation to Bayesian Regularization training algorithm. GNBR training algorithm Unlike the performance index for the and the training algorithms, the performance index of the training method is a regularized performance index, as shown below: , (47) where is a network weight or bias, is the total number of weights and biases, and are scalar values weighting the importance of the two terms. The regularized term in the performance index is , which is the sum of squares of the network weights and biases. Jf xk + 1 ( temp) J< f (xk) μk + 1 = μk ⁄ ϑ xk + 1 xk + 1 = temp k = k + 1 μk ≥ μmax Jf Jf LM LM – ES (GNBR) BFGS – ES LM – ES GNBR F β (gk, q – ak, q)2 k = 1 SM Σ q = 1 Q Σ α xi 2 i = 1 n = + Σ xi n β α xi 2 i = 1 n Σ 30 Regularization is another technique used to prevent overfitting. Although its purpose is the same as early stopping, it works in a different way. The regularized performance index forces the magnitudes of the network parameters to be small, which causes the network output to be smooth. That is, the regularization prevents steep fluctuations in the network response. A problem for using the regularized performance index is that the weighting factors and are unknown and problemdependent. If is too small relative to , then we would still observe overfitting as there is too little impact from the regularized term. In contrast, if is too large relative to , then the network output would be too smooth and would not approximate the function. To overcome this problem, David J. C. MacKay proposed a method in [MacK92] using Bayes’ rule to automatically choose the weighting factors to balance between the function approximation capability and the smoothness of the network output. [FoHa97] later combined MacKay’s method with the LevenbergMarquardt framework to obtain the GaussNewton Approximation to Bayesian Regularization, denoted as . The concept of this algorithm can be explained in two parts: first, minimizing the performance index in Eq. (47), and second, choosing the values of and . From Eq. (36) and Eq. (47), we can rewrite the performance index as: . (48) Now, from Eq. (38), the gradient of the performance index can be written . (49) Since is in fact in Eq. (32), the Jacobian matrix , which can be ob β α α β α β GNBR β α F = βzT(x)z(x) + αxTx = βED + αEW F ∇ β ED ∇ α EW ∇ + 2βJD T x ( )z x ( ) 2αJE T = = + (x)x ED Jf JD(x) = J(x) 31 tained by the same backpropagation process as in the LevenbergMarquardt training algorithm. Since the Jacobian matrix , this becomes . Therefore, Eq. (49) turns out to be . (50) Using Eq. (42) and Eq. (49), the Hessian can be approximated and written in matrix form: . (51) By substituting Eq. (50) and Eq. (51) into Eq. (35), the update equation becomes . (52) Now, we have an update equation for the performance index in Eq. (47). Next, we will explain MacKay’s method to choose the values of and . Under the Bayesian framework of MacKay [MacK92], the network parameters are considered random variables. The posterior density of the network parameters can be written according to the Bayes’ rule: , (53) where represents the data set presented to the network, and is the network model. The term is the likelihood function, the term is the prior density, and the term is named evidence. The parameter is related to the variance of the likelihood and is related to the variance of the prior density. If the noise in the model output and the prior density are assumed to follow Gaussian distributions, then the likeli JE(x) = ∂x ⁄ ∂xT JE(x) = In ∇F = 2βJT(x)z(x) + 2αx ∇2F (x) = β∇2ED + α∇2EW ≅ 2βJT(x)J(x) + 2αIn xk 1 + xk βkJT xk ( )J xk ( ) αk [ + ( + μk)In] –1 βkJT xk ( )z xk ( ) αk = – [ + xk] β α P x P(x D, β, α, M) P(D x, β, M)P(x α, M) P(D β, α, M) =  D M P(D x, β, M) P(x α, M) P(D β, α, M) β α 32 hood function and the prior density become , and (54) , (55) respectively. The term and , where . Considering the evidence as a normalization factor, and substituting Eq. (54) and Eq. (55) into Eq. (53), we obtain the posterior density . (56) From Eq. (56), we can see that maximizing the posterior density is equivalent to minimizing the regularized objective function . Now, given the data, we are interested in what value and should be. This means we need to consider . By using Bayes’ rule, it becomes . (57) Suppose the prior density is uniform, which corresponds to the statement that we do not know what value and should be. Then, maximizing the posterior is equivalent to maximizing the likelihood function . However, note that the likelihood function in Eq. (57) is the evidence, which is the normalization factor, in Eq. (53). Therefore, the evidence in Eq. (53) can be solved: . (58) P(D x, β, M) 1 ZD(β) =  exp(–βED) P(x α, M) 1 ZW(α) =  exp(–αEW) ZD(β) (π ⁄ β)N ⁄ 2 = ZW(α) (π ⁄ α)n ⁄ 2 = N = SMQ P(x D, β, α, M) 1 ZF(β, α) =  exp(–F(x)) F β α P(β, α D, M) P(β, α D, M) P(D β, α, M)P(β, α M) P(D M) =  P(β, α M) β α P(β, α D, M) P(D β, α, M) P(D β, α, M) P(D x, β, M)P(x α, M) P(x D, β, α, M) =  33 By substituting Eq. (54) to Eq. (56) into Eq. (58), we obtain . (59) From Eq. (59), the only term we do not know is . However, we can estimate it from a Taylor series expansion, by assuming the objective function has a quadratic shape in a small region around the minimum point . At the minimum point, the gradient of the function is zero. Thus, the objective function approximated around is written as , (60) where is the Hessian matrix of , and is the Hessian matrix evaluated at . Therefore, the posterior density in Eq. (56) can be written as , (61) which is rewritten as . (62) The multivariate Gaussian density with mean and the covariance matrix is expressed as: . (63) By equating Eq. (62) and Eq. (63), we can solve for P(D β, α, M) ZF(β, α) ZD(β)ZW(α) =  ZF(β, α) xMP xMP F x ( ) F xMP ( ) 12 (x – xMP)T≅ + HMP(x – xMP) H = β∇2ED + α∇2EW F(x) HMP xMP P(x D, β, α, M) 1 ZF(β, α)  F xMP ( ) 12 (x – xMP)T+ HMP(x – xMP) ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ – ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ≅ exp P(x D, β, α, M) 1 ZF(β, α)  exp(–F(xMP)) ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ 12 (x – xMP)T⎝– HMP(x – xMP)⎠ ≅ exp⎛ ⎞ xMP (HMP) –1 P(x) 1 (2π)n ⁄ 2 (HMP) –1 1 ⁄ 2  12 (x – xMP)T⎝– HMP(x – xMP)⎠ = exp⎛ ⎞ 34 . (64) Substitute Eq. (64) into Eq. (59) along with the values of and from Eq. (54) and Eq. (55), take the derivative with respect to each of the log in Eq. (59) and set them to zero. This produces and , (65) where is called the effective number of parameters. The parameter is a measure of how many parameters in the network are effectively used in reducing the error function, and it can range from zero to . Since the estimation for and requires the calculation of the Hessian matrix of the performance index at the minimum point , [FoHa97] proposed using the approximated Hessian readily available under the LevenbergMarquardt framework, i.e. Eq. (51), leading to the training algorithm. The algorithm is summarized below. Steps for GNBR training algorithm 1. Initialization: Same as the LevenbergMarquardt training algorithm. Initialize , . 2. Take one step of the training algorithm to minimize the objective function in Eq. (47), by using Eq. (52). 3. Compute the effective number of parameters , where the Hessian matrix can be approximated by Eq. (51), and is the Hessian evaluated at . ZF(β, α) (2π)n (HMP) –1 [ ] 1 ⁄ 2 ≅ exp(–F(xMP)) ZD(β) ZW(α) βMP N – γ 2ED(xMP) =  αMP γ 2EW(xMP) =  γ N 2αMPtr(HMP) –1 = – γ n β α F(x) xMP GNBR α0 = 0 β0 = 1 LM γk = n – 2αktr(Hk) H Hk xk 35 4. Compute the new estimates for and : and . (66) Then, set . 5. Iterate step 2) to 5) until , or is less than its predefined threshold. As the algorithm penalizes the magnitude of the network parameters as a technique to reduce the overfit problem, we will not have a validation data set in this algorithm. Recall that the goal of this research is to approximate both a function and its firstorder derivatives using neural networks. [HoSt89] showed that multilayer feedforward neural networks can approximate any Borel measurable function. In the next section, a theoretical discussion of function and derivative approximations with neural networks will be reviewed. Conditions for Function and Derivative Approximation Recall that our goal is to approximate both a function and its firstorder derivatives. This section will briefly discuss the theoretical conditions under which neural networks can simultaneously approximate both a function and its derivatives. There are several discussions on the conditions, such as [HoSt90], [Horn91], [Ito93] or [Pink99]; however, we will follow [Li96]. β α βk + 1 N – γk 2ED(xk) =  αk + 1 γk 2EW(xk) =  k = k + 1 μk ≥ μmax F (xk) GNBR 36 We first introduce notation. We let denote the lattice of nonnegative multiintegers in . For , we set and . (67) We also write if for all . Given an open set of (probably ), we write to denote the set consisting of functions with all order continuous partial derivatives in , for and . We write to imply, for a compact set of , there is an open set such that and . We write to denote a neural network with the network output , the transfer function in the hidden layer and the linear transfer function in the output layer, i.e. is linear. Given a compact subset of , and a function for , [Li96] showed that if and is not a polynomial, then of a network can uniformly and simultaneously approximate , for and . For example, [Li96] considered a function on , given by: if ; otherwise . We can verify that and are discontinuous at the origin ; however, , Z + R ℜR m (m1, m2,…, mR) Z + = ∈ R m m= 1 + m2 +…+ mR Dm p1 m1 p2 m2… pR ∂ ∂ mR m ∂ ∂ = m1 ≤ m2 mr 1 mr ≤ 2 r = 1, 2, …, R Ω ℜR Ω = ℜR Cm(Ω) kth Ω k Z + ∈ R k ≤ m f ∈ Cˆ m(K) K ℜR Ω K ⊂ Ω f ∈ Cm(Ω) M 2 – layer a f1 f 2 K ℜR g ∈ Cˆ m(K) m Z + ∈ R f 1∈ C m (ℜ) f 1 Dka M Dkg k Z + ∈ R k ≤ m ℜ2 g(p1, p2) (p1p2)11 ⁄ 3 = sin[1 ⁄ (p1p2)] p1, p2 ≠ 0 g(p1, p2) = 0 g 2 ∂ p1 2 ∂ ⁄ g 2 ∂ p2 2 ⁄ ∂ (0, 0) g 37 , , and are continuous on . Therefore, . In this situation, a network can simultaneously and uniformly approximate , , , and on any compact set containing , provided that the transfer function in the hidden layer is nonpolynomial and . The typical transfer functions in the hidden layers and the output layer of neural networks used for function approximation are sigmoid and linear functions, respectively. Sigmoid functions are nonpolynomial and they are of class (infinitely differentiable or smooth functions), i.e. all derivative orders are continuous. Therefore, given that the function and its firstorder partial derivatives exist and are continuous, the standard twolayer neural network has the ability to simultaneously approximate the function and its firstorder derivatives. Next, we will introduce notation for firstorder derivative values. Similar to the notation defined in Eq. (21) to Eq. (24), the derivative of the scalar function with respect to and with respect to , evaluated at and , is written, respectively, , . (68) The derivative of the vector function with respect to and , evaluated at and , is denoted, respectively, by: ∂g ⁄ ∂p1 ∂g ⁄ ∂p2 ∂2g ⁄ ∂p1∂p2 ℜ2 g C(1, 1) ∈ (ℜ2) M g ∂g ⁄ ∂p1 ∂g ⁄ ∂p2 ∂2g ⁄ ∂p1∂p2 K (0, 0) f 1 f 1∈ C2(ℜ) C∞ g g gk pr p pr = pr, q p = pq ∂pr, q ∂gk, q ∂pr ∂gk pr = pr, q ≡ pq ∂ T ∂gk, q ∂pT ∂gk p = pq ≡ g pr p pr = pr, q p = pq 38 , . (69) For batch mode, similar to Eq. (27) and Eq. (28), we will use the following notation: , and . (70) Summary In this chapter, we first stated the objective of the research: to develop procedures for approximating functions and their derivatives. Then, we introduced the operators and notation that will be used throughout the research. The notation and background material for the multilayer feedforward neural network were provided. A neural network learns to approximate a function through an optimization process. A combination of the objective function and the optimization process defines a distinct training algorithm. We discussed three existing training algorithms: , and . Each is capable of forcing a neural network to approximate a function in a different way. The and methods use early stopping as a technique to prevent overfitting, while the algorithm uses the Bayesian regularizer. ∂pr, q ∂gq ∂pr ∂g pr = pr, q ≡ pq ∂ T ∂gq ∂pT ∂g p = pq ≡ ∂vecG ∂(rp)T  ∂pr, 1 ∂g1 0 … 0 0 pr 2 ∂ , ∂g2 … 0 … … … 0 0 … ∂pr, Q ∂gQ = ∂vecG ∂(vecP)T  p1 ∂ T ∂g1 0 … 0 0 p2 ∂ T ∂g2 … 0 … … … 0 0 … pQ T ∂ ∂gQ = BFGS – ES LM – ES GNBR BFGS – ES LM – ES GNBR 39 Finally, the conditions under which a neural network can uniformly and simultaneously approximate a function and its derivatives were discussed. One of the conditions requires that the function and its firstorder derivatives be continuous. A second condition requires that the transfer function in the hidden layer of the neural network be sufficiently differentiable but not be a polynomial. For example, the typical sigmoid transfer function would be satisfactory. 40 CHAPTER 3 VALIDATIONRELATED METHODS Introduction Recall that our objective is to use a neural network to approximate a function and its firstorder derivatives. In this chapter, we will discuss two simple methods for accomplishing this task. These methods are similar to the standard training algorithms discussed in Chapter 2, but some modifications are applied to the early stopping technique. As the LevenbergMarquardt optimization is chosen to work with these two modified early stopping techniques, the two proposed methods are: 1. Modified validation performance measure ( ), and 2. Early stopping using the derivative information on the training set ( ). As we describe each method, the idea behind it will be discussed. Then, the simulation results for each method, based on the benchmark tests that are described in Chapter 7, will be shown. The results in this chapter can be compared with the simulation results obtained using the standard early stopping technique, which are shown in Chapter 7. Modified validation performance measure For the standard early stopping method (discussed in Chapter 2), the validation performance is measured by the sum squared function error. We proposed using a combination LM – ES1 LM – ES2 41 of the sum squared function error and the sum squared derivative error. In other words, the validation performance will be determined by , (71) where is the number of examples in the validation set, and is a scalar factor. If the validation error increases for a certain number of iterations, then the training will be terminated. The unknown parameter in the equation is , which we will vary to investigate its consequences to the approximation of neural networks. Another purpose of this parameter is to account for the fact that the scale of the derivative values can be much different from the function values. The procedure for training a neural network using this method is exactly the same as the standard training algorithm with the standard early stopping technique (e.g. or ), with two exceptions. First, we need to compute the derivative of the network output with respect to the network inputs; i.e. , for all , and this calculation is shown in Chapter 4. Second, the performance measure in the standard early stopping technique is now replaced by the new measure in Eq. (71). In this research, we choose to work with the modified validation performance measure. We denote this method . Note that when , is . EVd (ak, qV – gk, qV)2 k = 1 SM Σ qV = 1 QV Σ ρd ∂pr, qV ∂ak, qV ∂pr, qV ∂gk, qV – ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r = 1 R Σ k = 1 SM Σ qV = 1 QV = + Σ QV ρd EVd ρd BFGS – ES LM – ES ∂aq pq ⁄ ∂ T q 1 2 … Q= , , , V LM – ES LM – ES1 ρd = 0 LM – ES1 LM – ES 42 The approximation accuracy obtained from for the simple analytic functions (which are introduced in Chapter 7) are shown in the next section. Note again that the procedure to perform the simulation is described in Chapter 6. Simulation results The approximation accuracy obtained from for the simple analytic functions are shown in the following tables. The results in Table 1, Table 2 and Table 3 can be compared with the results obtained from other algorithms in Table 9, Table 10 and Table 11 in Chapter 7, respectively. For ease of reference, we put the results obtained from with the standard early stopping (i.e. ). Note that the definitions of and are defined in the simulation procedure in Chapter 7. in Problem 1 Training Test Training Test 0 2.48E06 8.90E04 6.42E03 1.74E02 1E06 2.48E06 8.90E04 6.42E03 1.77E02 1E04 2.50E06 8.90E04 6.42E03 1.74E02 1E02 2.52E06 8.90E04 5.71E03 1.38E02 1E+00 3.00E06 9.96E04 6.51E03 1.73E02 1E+02 3.00E06 9.96E04 6.51E03 1.73E02 Table 1 Approximation accuracy on problem 1 LM – ES1 LM – ES1 LM – ES ρd = 0 RMSEF md RMSED md ρd LM – ES1 RMSEF md RMSED md 43 From these results, we can see that the approximation accuracy obtained from is similar to the results from . We conclude that adding the derivative error term into the standard validation performance measure does not improve the approximation accuracy. In the next section, the simulation results obtained from the training algorithm with early stopping using the derivative information of the training set, i.e. , will be discussed. in Problem 2 Training Test Training Test 0 1.14E05 1.16E02 1.09E+00 1.94E+00 1E06 1.14E05 1.14E02 1.09E+00 1.94E+00 1E04 4.18E07 7.44E03 5.52E01 1.45E+00 1E02 9.04E07 8.67E03 9.69E01 1.54E+00 1E+00 9.04E07 8.67E03 9.69E01 1.54E+00 1E+02 9.04E07 8.67E03 9.69E07 1.54E+00 Table 2 Approximation accuracy on problem 2 in Problem 3 Training Test Training Test 0 6.72E06 2.09E04 5.67E03 1.12E02 1E06 6.72E06 2.09E04 5.67E03 1.12E02 1E04 9.93E06 2.80E04 1.41E02 2.69E02 1E02 1.14E05 2.96E04 1.38E02 2.69E02 1E+00 1.14E05 2.97E04 1.38E02 2.69E02 1E+02 1.14E05 2.97E04 1.38E02 2.69E02 Table 3 Approximation accuracy on problem 3 ρd LM – ES1 RMSEF md RMSED md ρd LM – ES1 RMSEF md RMSED md LM – ES1 LM – ES LM LM – ES2 44 Early stopping using the derivative information of the training set In regular early stopping, the training process will be terminated when the validation performance increases. Recall that the standard validation performance measure is . (72) The function error term can be approximated by the firstorder Taylor series expansion at the point , and this becomes: , where , (73) where is the nearest training point to . By inserting Eq. (73) into Eq. (72), we obtain (74) We may ignore the second term in Eq. (74) if we assume equal distribution around zero of the training error terms, i.e. and for all , and . Thus the sum of these error terms is approximately zero. The third term in Eq. (74) can be further expanded: (75) EV (ak, qV – gk, qV)2 k = 1 SM Σ qV = 1 QV = Σ ek, qV ≡ ak, qV – gk, qV pqV ek, qV ek, tV p∂ r, tV ∂ek, tV (pr, tV – pr, qV) r = 1 R ≈ + Σ ∂pr, tV ∂ek, tV ∂pr, tV ∂ak, tV ∂pr, tV ∂gk, tV = – ptV pqV EV ek, tV 2 2ek tV , pr tV ∂ , ∂ek, tV (pr, tV – pr, qV) r Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂pr, tV ∂ek, tV (pr, tV – pr, qV) r Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 + + ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ k Σ qV ≈Σ . ek, tV ∂ek, tV ⁄ ∂pr, tV k = 1, 2,…, SM tV = 1, 2, …, QV r = 1, 2, …, R ∂pr, tV ∂ek, tV (pr, tV – pr, qV) r Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 ∂pr, tV ∂ek, tV (pr, tV – pr, qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r Σ = + ∂pr, tV ∂ek, tV (pr, tV – pr, qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂pr', tV ∂ek, tV (pr', tV – pr', qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r' r' ≠ r Σ r Σ . 45 From Eq. (75), the validation performance measure could then be approximated by: (76) It should be noted that the differences of the two inputs; i.e. for all , are constant. In addition, by the assumption of equal distribution around zero of the training error terms, the last term in Eq. (76) is then approximately zero. Thus, Eq. (76) reduces to . (77) Again, since the differences of the two inputs are constant, the validation performance measure is proportional to: , (78) where the summation index can be replaced by . From Eq. (78), we can see that the first term is getting smaller, while the network is being trained. Therefore, the increase in the performance measure may be estimated by the increase of the second term, which is the sum squared derivative errors in the training EV EV ek, tV 2 k Σ qV Σ ∂pr, tV ∂ek, tV (pr, tV – pr, qV) r Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 k Σ qV ≈ +Σ ek, tV 2 k Σ qV Σ ∂pr, tV ∂ek, tV (pr, tV – pr, qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r Σ + k Σ qV = +Σ ∂pr, tV ∂ek, tV (pr, tV – pr, qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂pr', tV ∂ek, tV (pr', tV – pr', qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r' r' ≠ r Σ r Σ k Σ qV Σ . pr, tV – pr, qV r = 1, 2, …, R EV ek, tV 2 k Σ qV Σ ∂pr, tV ∂ek, tV (pr, tV – pr, qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r Σ k Σ qV ≈ +Σ EV EV ek, tV 2 k Σ tV Σ ∂pr, tV ∂ek, tV ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r Σ k Σ tV ∝ +Σ qV = 1, 2, …, QV tV = 1, 2, …, QV EV 46 set. This seems to make sense as large derivative errors in training data will result in large function errors in the validation data. Therefore, if we define a new validation performance measure whose value depends only on the derivative error of the training set, its increase should imply overfitting and we would then want to terminate the training process. The new validation measure is written as , (79) where is the number of examples available  the number of data in the training and validation sets. Since the new validation measure uses the derivative information of all data, we can use all the function information for training. This means that we use all the data in the training process, while overfitting is prevented by using the new validation measure . An advantage of having more training examples is we have better generalization, see [GaWh92] and [AtPa97]. The following summarizes the method. Steps for early stopping using the derivative information of the training set 1. Change the validation performance measure for early stopping so that it follows Eq. (79). 2. Use all of the available data for the training process, which can be performed by any standard training algorithm, e.g. or , etc. In other words, there is no data division into training and validation set. 3. The training process is terminated when the new validation measure, i.e. Eq. (79), consecutively increases for a certain number of iterations. EV ˜ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r = 1 R Σ k = 1 SM Σ q = 1 Q = Σ Q EV ˜ BFGS LM 47 We choose the LevenbergMarquardt training algorithm ( ) to work with the modified validation performance measure. We denote the method . The method requires the calculation of the derivative of the network with respect to the network inputs ( for all ), and the calculation is shown in Chapter 4. In the next section, we will provide the simulation results obtained from on the simple analytic problems (introduced in Chapter 7). Recall again that the procedure to perform the simulation is provided in Chapter 7. Simulation results In this section, the simulation results for on the simple analytic problems are presented in Table 4, Table 5 and Table 6. For ease of reference, we also put the results obtained from in the tables. The results in Table 4, Table 5 and Table 6 can be compared with Table 1, Table 2 and Table 3 for , and with Table 9, Table 10 and Table 11 in Chapter 7 for other training algorithms. Training Algorithm Problem 1 Training Test Training Test 2.48E06 8.90E04 6.42E03 1.74E02 2.24E05 3.67E04 4.78E03 8.43E03 Table 4 Approximation accuracy on problem 1 LM LM – ES2 ∂aq pq ⁄ ∂ T q = 1, 2, …, Q LM – ES2 LM – ES2 LM – ES LM – ES1 RMSEF md RMSED md LM – ES LM – ES2 48 From the results, we can see that seems to provide better results than the standard algorithm for both function and the derivatives for problem 1 and problem 2. However, it is not the case for problem 3, where the derivative error obtained from was larger than . We will analyze the algorithm in the following section. This is to understand why does not consistently yield better results over , even though the training set for is relatively larger than for . Algorithm analysis Recall that, for , the overfitting is prevented by terminating the training process once the derivative error of all training data increases. An advantage of this method is the number of examples in the training process increases from including examples in the Training Algorithm Problem 2 Training Test Training Test 1.14E05 1.16E02 1.09E+00 1.94E+00 1.76E05 6.44E03 2.40E01 8.23E01 Table 5 Approximation accuracy on problem 2 Training Algorithm Problem 3 Training Test Training Test 6.72E06 2.09E04 5.67E03 1.12E02 1.16E05 1.82E04 7.66E03 1.41E02 Table 6 Approximation accuracy on problem 3 RMSEF md RMSED md LM – ES LM – ES2 RMSEF md RMSED md LM – ES LM – ES2 LM – ES2 LM – ES LM – ES2 LM – ES LM – ES2 LM – ES LM – ES2 LM – ES LM – ES2 49 validation set into the training set. The increase in the training examples would lead to a better generalization (see [GaWh92] and [AtPa97]). We analyzed the method by training a network with the algorithm. However, we also monitored the values of (see Eq. (72)) and (see Eq. (79)) along the optimization process. The following figure shows the training records. Figure 5) A training record showing versus From Figure 5), we can see that the derivative error of the training set increased (at around iteration ) sooner than the validation performance measure did. This means that the increase in does not directly imply an increase in . This contradicts the assumption we made in the previous section, where we assumed the increase in could be estimated by the increase in . The reason behind this lies in Eq. (78). In Eq. (78), we can see that the regular validation measure is proportional to two terms; the first is the square training function errors, and is the second. This means that LM – ES2 LM – ES EV EV ˜ 0 100 200 10−8 10−2 104 Iteration EV ˜ EV Jf EV ˜ EV EV ˜ 125th EV EV ˜ EV EV EV ˜ EV EV ˜ 50 even though the value of increases, if the reduction in the training function error counts more, the regular validation measure will still be lower. From this fact, we conclude that the regular validation measure cannot exactly be estimated by the training derivative error, since one more factor (i.e. the training function error) also counts. Summary Recall that our objective is to approximate a function and its firstorder derivatives using neural networks. Two new methods were proposed in this chapter: and . These methods are similar to the standard training algorithms. However, changes were made in the early stopping technique. In the method, the validation performance measure was changed from the squared function error to a combination of the squared function and derivative errors. The simulation results showed that the effectiveness of this new validation measure is similar to that of the standard early stopping. In the second method ( ), the validation performance measure was changed to the squared derivative error of the training set. In this method, data division into training and validation sets is no longer needed, unlike in the standard early stopping. The simulation results showed that the new validation measure sometimes terminates the training process too soon (i.e. sooner than using the standard early stopping), causing worse approximation. EV ˜ EV EV LM – ES1 LM – ES2 LM – ES1 LM – ES2 51 CHAPTER 4 GRADIENTBASED COMBINED FUNCTION AND DERIVATIVE APPROXIMATION Introduction Recall that the objective of this research is to use neural networks for approximating functions and their firstorder derivatives. In this chapter, we will propose a new training algorithm to perform this function. This algorithm is designed to work with any gradientbased optimization method (e.g. steepest descent, conjugate gradient, , etc.). We call this algorithm the Combined Function and Derivative Approximation algorithm. We will start this chapter by introducing the performance index used in the method and will derive two approaches for gradient calculation. At the end of the chapter, we will provide some examples to illustrate how fast the new training algorithm is in comparison with the standard algorithm. GradientBased Combined Function and Derivative Approximation In Chapter 2, the conditions under which the neural network can approximate both a function and its firstorder derivatives were discussed. We will introduce the performance index for the method, assuming these conditions are satisfied. Then, we will focus on the derivation of the gradient of the performance index, which is required for any gradi BFGS (CFDA) CFDA CFDA 52 entbased optimization method. We will develop two approaches for gradient calculation. The first approach assumes the gradient computation is in batch mode, i.e. data are used at the same time, and this is performed by arranging information in matrices. This may be necessary in some programming languages in order to speed up the calculation. The second method, however, offers a tradeoff between the computation time and the required memory. Performance Index Assume we want to approximate the function , which maps a subset in to a subset in , and its firstorder derivatives, by an neural network (with the conditions discussed in Chatper 2). Also suppose for all . The proposed performance index is written as: (80) where the term is added to form the new performance index. is a scalar value controlling how important the term is, relative to the term . If , the new performance index reduces to the standard performance index we discussed in Chapter 2. The function in the performance index is introduced to cope with the situation when the terms are not available. The function is defined below: g ℜR ℜSM M – layer gk ∈ C1 k = 1, 2, …, SM J = Jf + ρJd 1 QSM  {ak, q – gk, q}2 k = 1 SM Σ q = 1 Q Σ ρ QdSd MRd  ϕk, r(q) ∂ak, q ∂pr, q  ∂gk, q ∂pr, q ⎝ – ⎠ ⎛ ⎞2 , r = 1 R Σ k = 1 SM Σ q = 1 Q = + Σ ρJd ρ Jd Jf ρ = 0 ϕk, r(q) ∂gk, q ⁄ ∂pr, q 53 (81) Minimizing the performance index will force the neural network to simultaneously approximate both the function and its firstorder derivatives. We will present two approaches for calculating the gradient of the performance index. The first puts all the calculations in matrices for batch operation, and the second offers a tradeoff between the computation speed and the required memory. The derivations for these two approaches will be shown in the next section. Gradient Calculation: Batch Operation The batch mode operation will be discussed in this section. To minimize the performance index using gradientbased optimization methods, we need to compute the derivative of the performance index with respect to each network parameter. This means that we need to compute and , where and . Note that we will show only how to compute and , since the terms and can be computed using the standard backpropagation algorithm [HaDe96]. From Eq. (80), by taking the derivative of with respect to (note that the constant term is temporarily dropped from the term for simplicity), we have: ϕk, r(q) 1 ; ∂gk, q ∂pr, q  is available. 0 ; otherwise. ⎩ ⎪ ⎨ ⎪ ⎧ = g J ∂ wi j , m ⁄ ∂ ∂J bi ⁄ ∂ m i = 1, 2, …, S m j = 1, 2, …, S m – 1 Jd ∂ wi j , m ⁄ ∂ ∂Jd bi ⁄ ∂ m Jf ∂ wi j , m ⁄ ∂ ∂Jf bi ⁄ ∂ m Jd wi j , m 1 QdSd ⁄ ( MRd) Jd 54 (82) where . (83) Since the term can be computed with standard backpropagation, we will focus on the computation of the term . Note that . (84) From Eq. (84), we can use the chain rule of calculus to compute the term . This is also a part of the standard backpropapgation [HaDe96], which can be computed as follows: (85) The term can be calculated using Eq. (14). Thus, Eq. (85) becomes . (86) wi j , m ∂ ∂Jd wi j , m ∂ ∂ ϕk r , (q) ∂pr, q ∂ak, q ∂pr, q ∂gk, q – ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r = 1 R Σ k = 1 SM Σ q = 1 Q Σ ⎝ ⎠ ⎜ ⎟ ⎜ ⎟ ⎛ ⎞ = 2 pr q ∂ , ∂ek, q wi j , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ × r = 1 R Σ k = 1 SM Σ q = 1 Q = Σ . ∂pr, q ∂ek, q ϕk, r(q) ∂pr, q ∂ak, q ∂pr, q ∂gk, q – ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ≡ ∂ek, q ⁄ ∂pr, q wi j , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ wi j , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂pr, q ∂ wi j , m ∂ ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = ek q , ∂ wi j , m ⁄ ∂ wi j , m ∂ ∂ek, q ni q , m ∂ ∂ek, q wi j , m ∂ ∂ni q , m = × . ni q , m wi j , m ∂ ∂ek, q ni q , m ∂ ∂ek, q aj q , m = × – 1 55 From Eq. (86), Eq. (84) becomes (87) Using Eq. (87), Eq. (82) becomes (88) By rearranging the summations, we obtain (89) To further compute Eq. (89), define and (90) . (91) wi j , m ∂ ∂ ∂pr, q ∂ ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂pr, q ∂ wi j , m ∂ ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = ∂pr, q ∂ ni q , m ∂ ∂ek, q aj q , m × – 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ aj q , m × – 1 ni q , m ∂ ∂ek, q ∂pr, q ∂aj q , m – 1 = + × . wi j , m ∂ ∂Jd 2 pr q ∂ , ∂ek, q ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ aj q , m × × – 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ k Σ q Σ ⎩ ⎨ ⎧ = ∂pr, q ∂ek, q ni q , m ∂ ∂ek, q ∂pr, q ∂aj q , m – 1 × × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ k Σ q Σ + ⎭ ⎬ ⎫ . wi j , m ∂ ∂Jd 2 aj q , m – 1 ∂pr, q ∂ek, q ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ × ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ r Σ k Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ q Σ ⎩ ⎨ ⎧ = ∂pr, q ∂aj q , m – 1 ∂pr, q ∂ek, q ni q , m ∂ ∂ek, q × ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ k Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ q Σ + ⎭ ⎬ ⎫ . vi q , m ∂pr, q ∂ek, q ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ × ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ , r Σ k Σ ≡ uqi r , m ∂pr, q ∂ek, q ni q , m ∂ ∂ek, q × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ ≡ 56 By using Eq. (90) and Eq. (91), Eq. (89) becomes (with the term returned): . (92) In matrix form, Eq. (92) can also be written as (93) where the vector consists of the terms , for all . The matrix consists of the terms , for all and . To write Eq. (92) in batch mode, we first define the matrix : . (94) By Eq. (94), Eq. (92) can be rewritten as (95) where the matrix is . (96) 1 QdSd ⁄ ( MRd) wi j , m ∂ ∂Jd 2 QdSd MRd  aj q , m – 1vi q , m { } q Σ ∂pr, q ∂aj q , m – 1 uqi r , m ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ r Σ q Σ + ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = ∂Wm ∂Jd 2 QdSd MRd  vq m aq ( m – 1)T Uq m pq ∂ T ∂aq m – 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T + ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ q Σ = , Sm × 1 vq m vi q , m i = 1, 2, …, Sm Sm × R Uq m uqi r , m i = 1, 2, …, Sm r = 1, 2, …, R SmQ × RQ Um Um U1 m 0 … 0 0 U2 m … 0 … … … 0 0 … UQ m ≡ ∂Wm ∂Jd 2 QdSd MRd  Vm(Am – 1)T + ⎩ ⎨ ⎧ = 11 × Q I Sm ( ⊗ ) Um vecA∂ m – 1 ∂(vecP)T  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1Q × 1 I Sm – 1 × × ( ⊗ ) ⎭ ⎬ ⎫ , Sm × Q Vm Vm v1 m v2 m … vQ m = 57 Note that , and (97) , (98) are the and matrices, respectively. Note further that Eq. (83) can be written in batch mode as: . (99) The matrix is defined as: , (100) where is the matrix consisting of the elements , for all and . 11 × Q I Sm ( ⊗ ) × Um U1 m U2 m … UQ m = ∂vecAm – 1 ∂(vecP)T  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1Q × 1 I Sm – 1 × ( ⊗ ) p1 ∂ T ∂a1 m – 1 p2 ∂ T ∂a2 m – 1 … pQ T ∂ ∂aQ m – 1 T = Sm × RQ RQ × Sm – 1 ∂vecE ∂(vecP)T  Φ ∂vecA ∂(vecP)T  ∂vecG ∂(vecP)T –  ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ • p1 ∂ T ∂e1 0 … 0 0 p2 ∂ T ∂e2 … 0 … … … 0 0 … pQ T ∂ ∂eQ = = SMQ × RQ Φ Φ Φ1 0 … 0 0 Φ2 … 0 … … … 0 0 … ΦQ ≡ Φq SM × R ϕk, r(q) k = 1, 2,…, SM r = 1, 2, …, R 58 In addition to , we also need to compute the derivative of the performance index with respect to the biases, i.e. . In Eq. (82), the term is changed to , and using the fact that , (101) thus . (102) As in Eq. (85), the term can be computed as (103) and by Eq. (14), this reduces to (104) Thus, Eq. (102) becomes . (105) From Eq. (90), Eq. (105) reduces to (106) Jd ∂ wi j , m ⁄ ∂ ∂Jd bi ⁄ ∂ m Jd ∂ wi j , m ⁄ ∂ ∂Jd bi ⁄ ∂ m bi ∂ m ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂pr, q ∂ bi ∂ m ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = bi ∂ m ∂Jd 2 QdSd MRd  ∂pr, q ∂ek, q ∂pr, q ∂ bi ∂ m ∂ek, q ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r = 1 R Σ k = 1 SM Σ q = 1 Q = Σ ∂ek, q bi ⁄ ∂ m bi ∂ m ∂ek, q ni q , m ∂ ∂ek, q bi ∂ m ∂ni q , m = × , bi ∂ m ∂ek, q ni q , m ∂ ∂ek, q = . bi ∂ m ∂Jd 2 QdSd MRd  ∂pr, q ∂ek, q ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ k Σ q Σ = bi ∂ m ∂Jd 2 QdSd MRd  vi q , m q Σ = . 59 Therefore, the vector can be written as . (107) From Eq. (95) and Eq. (107), we need to calculate the elements of the matrices , and . We will first show how to compute , followed by , and finally . We will break these calculations into three sections. I. Calculation of An element of this matrix is . From Eq. (14), by using the chain rule of calculus, we obtain (108) Using Eq. (14), Eq. (108) becomes , (109) where (110) Eq. (109) can be written in matrix form as , (111) Sm × 1 ∂Jd ⁄ ∂bm ∂bm ∂Jd 2 QdSd MRd  vq m q Σ 2 QdSd MRd = = {Vm × 1Q × 1} Vm Um ∂vecAm ⁄ ∂(vecP)T ∂vecAm ⁄ ∂(vecP)T Um Vm ∂vecAm ⁄ ∂(vecP)T aj q , m ∂ ⁄ ∂pr, q ∂pr, q ∂aj q , m nj q , m ∂ ∂aj q , m ∂pr, q ∂nj q , m = × . ∂pr, q ∂aj q , m f·m nj q , m ( ) wj l , m ∂pr, q ∂al q , m – 1 × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ l = 1 S m – 1 = Σ f·m nj q , m ( ) nj q , m ∂ ∂aj q , m = . pq ∂ T ∂aq m F · m nq m ( )Wm pq ∂ T ∂aq m – 1 = 60 where is the matrix defined below: . (112) In batch mode, we have . (113) Thus, from Eq. (111), the matrix can be computed as (114) From Eq. (114), we can see that computing requires the calculation of . Thus, we need to initialize (i.e. ). From the fact that is the input , thus and we can compute as follows: (115) F · m nq ( m) Sm × Sm F · m nq ( m) nq ( m)T ∂ ∂aq m ≡ f·mn1, q ( m ) 0 … 0 0 f·mn2, q ( m ) … 0 … … … 0 0 … f·m n Sm, q m ⎝ ⎠ ⎛ ⎞ = ∂vecAm (vecNm)T ∂  F · m n1 ( m) 0 … 0 0 F· m n2 ( m) … 0 … … … 0 0 … F · m nQ m ( ) = ∂vecAm ⁄ ∂(vecP)T ∂vecAm ∂(vecP)T  ∂vecAm (vecNm)T ∂  (IQ ⊗ Wm) vecA∂ m – 1 ∂(vecP)T = × ×  . ∂vecAm ⁄ ∂(vecP)T ∂vecAm – 1 ⁄ ∂(vecP)T ∂vecA0 ⁄ ∂(vecP)T m = 0 aj, q 0 pj q , S0 = R aj, q ∂ 0 p⁄ ∂ r, q ∂pr, q ∂aj, q 0 1 ; if j = r . 0 ; if j r . ≠ ⎩ ⎨ ⎧ = 61 Thus, the matrix is (116) Therefore, in batch mode, the matrix becomes (117) This completes the calculation for the matrix . Next, we will show the calculation for . II. Calculation of We begin by expanding the term in the term in Eq. (91). Using the chain rule of calculus: , (118) From Eq. (14), we obtain (119) If we substitute this expression into Eq. (91) and rearrange the summation, we find R R × aq 0 ∂ pq ⁄ ∂ T pq ∂ T ∂aq 0 pq ∂ T ∂pq IR . = = RQ × RQ ∂vecA0 ⁄ ∂(vecP)T ∂vecA0 ∂(vecP)T  IR 0 … 0 0 IR … 0 … … … 0 0 … IR = = IRQ . ∂vecAm ⁄ ∂(vecP)T Um Um ek q , ∂ ni q , m ⁄ ∂ ni q , m ∂ ∂ek, q nl q , m ∂ + 1 ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ai q , m ∂ ∂nl q , m + 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ni q , m ∂ ∂ai q , m ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ l = 1 S m + 1 = Σ ni q , m ∂ ∂ek, q nl q , m ∂ + 1 ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ wl i , m + 1f·m ni q , m ( ) . l Σ = 62 (120) Using Eq. (91), Eq. (120) becomes (121) Eq. (121) can be written in matrix form as . (122) From Eq. (94) and Eq. (113), the batch matrix can be expressed as . (123) We need to initialize this equation with . From Eq. (91) with , we have . (124) Since and , . (125) Therefore, Eq. (124) reduces to . (126) uqi, r m ∂pr, q ∂ek, q nl q , m ∂ + 1 ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ wl i , m + 1f·m ni q , m ( ) l Σ × ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ k Σ = f·m ni q , m ( ) wl i , m + 1 ∂pr, q ∂ek, q nl q , m ∂ + 1 ∂ek, q × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ l Σ = . uqi r , m f·m ni q , m ( ) wl i , m + 1uql, r m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ . l Σ = Uq m F · m nq m ( )(Wm + 1)TUq = m + 1 Um Um ∂vecAm (vecNm)T ∂  IQ (Wm + 1)T ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ = × × Um + 1 UM m = M uqi r , M ∂pr, q ∂ek, q ni q , M ∂ ∂ek, q × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ = ek q , ak q , gk q , – = ak q , ak q , M = ni q , M ∂ ∂ek, q ni q , M ∂ ∂ak, q f· M ni q , M ( ) ; if i = k 0 ; if i k ≠ ⎩ ⎨ ⎧ = = uqi r , M ∂pr, q ∂ei, q f· M ni q , M = ( ) 63 From Eq. (126), can be expressed as (127) Therefore, by Eq. (94) and Eq. (99), the batch matrix is written as (128) Note that if is the linear function, then and the batch matrix . This completes the calculation for . Next, we will compute . III. Calculation of From Eq. (90), by using the chain rule of calculus for the term which is presented in Eq. (119), we obtain (129) Since the last term in Eq. (129), , does not depend on , it can be brought outside the summation . Then, Eq. (129) becomes Uq M Uq M F · M nq ( M) pq ∂ T ∂eq . = UM UM ∂vecA (vecNM)T ∂  ∂vecE ∂(vecP)T = ×  . f M F · M nq ( M) I SM , = ∂vecA (vecNM)T ⁄ ∂ I SMQ = Um Vm Vm ek q , ∂ ni q , m ⁄ ∂ vi q , m ∂pr, q ∂ek, q ∂pr, q ∂ nl q , m ∂ + 1 ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ wl i , m + 1f·m ni q , m ( ) l = 1 Sm + 1 Σ ⎩ ⎭ ⎪ ⎪ ⎨ ⎬ ⎪ ⎪ ⎧ ⎫ × . k Σ r Σ = f· m ni q , m ( ) l l Σ 64 (130) By rearranging the terms in Eq. (130), this results in (131) From Eq. (90) and Eq. (91), Eq. (131) reduces to . (132) To write Eq. (132) for batch mode, first consider the term . The term depends only on the term , i.e. does not depend on when for . Thus, can be considered the element of the vector function: vi q , m ∂pr, q ∂ek, q ∂pr, q ∂ f·m ni q , m ( ) nl q , m ∂ + 1 ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ wl i , m + 1 l Σ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ r Σ = ∂pr, q ∂ek, q f·m ni q , m ∂ ( ) ∂pr, q  nl q , m ∂ + 1 ∂ek, q wl i , m × + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ l Σ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ r Σ = + ∂pr, q ∂ek, q f·m ni q , m ( ) ∂pr, q ∂ nl q , m ∂ + 1 ∂ek, q wl i , m × + 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ l Σ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ . k Σ r Σ vi q , m f· m ni q , m ∂ ( ) ∂pr, q  wl i , m + 1 ∂pr, q ∂ek, q nl q , m ∂ + 1 ∂ek, q × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ l Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ = + f· m ni q , m ( ) wl i , m + 1 ∂pr, q ∂ek, q ∂pr, q ∂ nl q , m ∂ + 1 ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ × ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ k Σ r Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ l Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ . vi q , m f· m ni q , m ∂ ( ) ∂pr, q  wl i , m + 1uql, r m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ l Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ f· m ni q , m ( ) wl i , m + 1vl q , m { + 1} l Σ = + f· m ni q , m ∂ ( ) ⁄ ∂pr, q f· m ni q , m ( ) f m ni q , m ( ) f· m ni q , m ( ) f m nj q , m ( ) i ≠ j i, j = 1, 2,…, Sm f· m ni q , m ( ) ith Sm × 1 65 , (133) which are the diagonal elements of the matrix . In other words, . (134) From Eq. (133), further define the batch matrix: . (135) It can be obtained from the matrix by (136) The batch derivative of the matrix with respect to the input is: dfm nq ( m) f· m n1, q ( m ) f· m n2, q ( m ) … f· m n Sm, q m ⎝ ⎠ ⎛ ⎞ ≡ F · m nq ( m) dfm nq m ( ) F · m nq ( m) 1 Sm × 1 = × Sm × Q DFm(Nm) dfm n1 ( m) dfm n2 ( m) … dfm nQ m ≡ ( ) ∂vecAm (vecNm)T ⁄ ∂ 11 × Q I Sm ( ⊗ ) vecA∂ m (vecNm)T ∂ ×  IQ 1 Sm × 1 × ( ⊗ ) F · m n1 m ( ) F · m n2 m ( ) … F · m nQ m ( ) IQ 1 Sm × 1 = × ( ⊗ ) = DFm(Nm) . DFm(Nm) 66 , (137) We can now represent Eq. (132) in vector form: . (138) Using Eq. (135) and Eq. (137), this can be put in batch matrix form: (139) is needed to initialize Eq. (139). From Eq. (90), we have . (140) From Eq. (125), this reduces to . (141) Therefore, can be written as ∂ vecDFm(Nm) ∂(vecP)T  dfm n1 ∂ ( m) p1 ∂ T  0 … 0 0 dfm n2 ∂ ( m) p2 ∂ T  … 0 … … … 0 0 … dfm nQ m ∂ ( ) pQ T ∂  = vq m dfm nq ∂ ( m) pq ∂ T  (Wm + 1)TUq m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ • ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ 1R 1 × F · m nq ( m)(Wm + 1)Tvq = + m + 1 Vm 11 × Q I Sm ( ⊗ ) vecDF∂ m(Nm) ∂(vecP)T  IQ (Wm + 1)T ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ Um + 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ • ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ = × × (IQ ⊗ 1R × 1) DFm(Nm) (Wm + 1)T+ • { Vm + 1} . VM vi q , M ∂pr, q ∂ek, q ∂pr, q ∂ ni q , M ∂ ∂ek, q ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ k Σ = vi q , M ∂pr, q ∂ei, q ∂pr, q ∂f· i, q M × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ = vq M 67 (142) Using Eq. (99) and Eq. (137), the batch matrix can be then expressed as . (143) It now remains to compute . Consider one element of this matrix, i.e. . Using the chain rule of calculus: (144) It may seem unusual that we are using as the itermediate variable here. In fact, it is often true that can be written as a function of . For example, if is the hyperbolic tangent sigmoid function, i.e. (145) then the derivative of with respect to , i.e. , is . (146) With Eq. (144), the matrix can be expressed as vq M dfM nq ∂ ( M) pq ∂ T  pq ∂ T ∂eq • ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = 1R × 1 . VM VM 11 × Q I SM ( ⊗ ) vecDF∂ M(NM) ∂(vecP)T  ∂vecE ∂(vecP)T •  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = × × (IQ ⊗ 1R × 1) ∂ vecDFM(NM) ⁄ ∂(vecP)T f· m ni q , m ∂ ( ) ⁄ ∂pr, q f· m ni q , m ∂ ( ) ∂pr, q  f· m ni q , m ∂ ( ) ai q , m ∂  ∂ pr, q ∂ ai q , m = × . ai q , m f· m ni q , m ( ) ai q , m f m f m(n) en – e–n en + e–n =  , f m n f· m (n) f· m (n) ∂n ∂f m 1 en e n – – en + e–n  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 – 1 (f m(n))2 – 1 (am)2 = = = = – Sm × R dfm nq ∂ ( m) pq ⁄ ∂ T 68 (147) where, the matrix is . (148) From Eq. (147), the batch matrix can be decomposed to (149) At the output layer (i.e. ), Eq. (144), Eq. (147) and Eq. (149) also apply, with and , and . This completes the derivation of the batch form of gradient of the performance index with respect to the network parameters and . In some programming languages (e.g. MATLAB) batch algorithms are more efficient than computing element by element. However, performing the calculation in batch mode requires sufficient memory to dfm nq ∂ ( m) pq ∂ T  dfm nq ∂ ( m) aq ( m)T ∂  pq T ∂ ∂aq m = × , Sm × Sm dfm nq ∂ ( m) aq ( m)T ⁄ ∂ dfm nq ∂ ( m) aq ( m)T ∂  f· m n1, q ∂ ( m ) a1, q ∂ m  0 … 0 0 f· m n2, q ∂ ( m ) a2, q ∂ m  … 0 … … … 0 0 … f· m n Sm, q m ⎝ ⎠ ∂ ⎛ ⎞ a Sm, q ∂ m  = ∂vecDFm(Nm) ⁄ ∂(vecP)T ∂ vecDFm(Nm) ∂(vecP)T  ∂vecDFm(Nm) (vecAm)T ∂  ∂vecAm ∂(vecP)T = ×  , m = M m M = ai q , M = ai, q aq M = aq AM = A Jd wi j , m bi m 69 hold all of the matrix elements at once. Therefore, the larger the matrices, the more memory is needed. The next section will derive an algorithm that is designed to save memory. Before going to the derivation of the memorysave method, let’s summarize the algorithm in this section. 70 STEPS TO COMPUTE AND (BATCH MODE) 1. Given , compute , and , by using Eq. (114) and Eq. (117). 2. Given , compute the derivative of the errors , using Eq. (99). 3. Compute and , using Eq. (128) and Eq. (143), respectively. 4. Backpropagate and , by Eq. (123) and Eq. (139), respectively. 5. Compute and by Eq. (95) and Eq. (107), respectively. Then, compute and . 6. Update the weights and biases using any gradientbased optimization technique. ∂J ⁄ ∂Wm ∂J ⁄ ∂bm P Am ∂vecAm ⁄ ∂(vecP)T ∂vecG ⁄ ∂(vecP)T ∂vecE ⁄ ∂(vecP)T UM VM Um Vm ∂Jd ⁄ ∂Wm ∂Jd ⁄ ∂bm ∂J ⁄ ∂Wm ∂J ⁄ ∂bm 71 Gradient Calculation: MemorySave Method In this section, another approach to compute the gradient of with respect to the network parameters will be discussed. As previously mentioned, this approach will be useful when the machine’s memory is insufficient for the batch mode operation. Mainly, the method will break down some matrices into smaller matrices. We will first briefly discuss the concept of how to break down matrices. This will be followed by the derivation of the gradient. From Eq. (80), we can see that, when comparing with the typical performance index , the new term has the additional summation . This extra summation, when manipulated for batch mode, leads to larger matrices than those matrices in the regular backpropagation. The idea is then to form smaller matrices whose sizes do not depend on the extra summation . First, from Eq. (90) and Eq. (91), redefine , and (150) . (151) Thus, and are the elements at row and column of and , respectively. From Eq. (90) and Eq. (150), this implies that Jd Jf Jd r = 1 R Σ r Σ v˜ ri, q m ∂pr, q ∂ek, q ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ ≡ u˜ ri, q m ∂pr, q ∂ek, q ni q , m ∂ ∂ek, q × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ ≡ v˜ ri, q m u˜ ri, q m i q V ˜ r m U ˜ r m 72 (152) For the term , rather than expressing it as an element of , we express it as the element at row of . It is also the element at row (note: ) and column of the matrix (see Eq. (27) for the notation). Note that . (153) By Eq. (153), the following matrices immediately follow: , (154) , and (155) , (156) where . (157) The matrix is defined: vi q , m v˜ ri, q m . r = 1 R = Σ aj q , m ∂ – 1 p⁄ ∂ r, q aq ∂ m – 1 pq ⁄ ∂ T j aq ∂ m – 1 p⁄ ∂ r, q l l = (q – 1)Sm – 1 + j q ∂vecAm – 1 pT ⁄ ∂r 11 × Q I Sm – 1 ( ⊗ ) vecA∂ m – 1 pT ∂r ×  ∂pr, 1 ∂a1 m – 1 ∂pr, 2 ∂a2 m – 1 … ∂pr, Q ∂aQ m – 1 = 11 × Q I SM ( ⊗ ) ∂vecA pT ∂r ×  ∂pr, 1 ∂a1 ∂pr, 2 ∂a2 … ∂pr, Q = ∂aQ 11 × Q I SM ( ⊗ ) ∂vecG pT ∂r ×  ∂pr, 1 ∂g1 ∂pr, 2 ∂g2 … ∂pr, Q = ∂gQ 11 × Q I SM ( ⊗ ) ∂vecE pT ∂r ×  ∂pr, 1 ∂e1 ∂pr, 1 ∂e2 … ∂pr, 1 = ∂eQ ∂vecE pT ∂r  Φ ˜ r ∂vecA pT ∂r  ∂vecG pT ∂r ⎝ – ⎠ = • ⎛ ⎞ Φ ˜ r 73 , (158) where is the column vector of the matrix , see Eq. (100). By Eq. (150) and Eq. (151), we can rewrite Eq. (92) as , (159) which can be further expressed as: (160) It should be noted that the transpose of the matrix in Eq. (153) produces the matrices in Eq. (160), i.e. . (161) The sizes of the matrices in Eq. (160) are less than or equal to the matrix sizes in Eq. (95), which was for batch mode. For calculating the term , we can directly use Eq. (107) since can be easily obtained by using Eq. (152): . (162) Φ ˜ r φr(1) 0 … 0 0 φr(2) … 0 … … … 0 0 … φr(Q) ≡ φr(q) rth Φq wi j , m ∂ ∂Jd 2 QdSd MRd  aj q , m 1 – v˜ ri, q ( m ) q Σ ∂pr, q ∂aj q , m – 1 u˜ ri, q m ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ q Σ + ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ r Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = ∂Wm ∂Jd 2 QdSd MRd  V ˜ r m (Am – 1)T U ˜ r m ∂vecAm – 1 pT ∂r  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1Q × 1 I Sm – 1 + × ( ⊗ ) ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ . r = 1 R = Σ 11 × Q I Sm – 1 ( ⊗ ) vecA∂ m – 1 pT ∂r ×  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T ∂vecAm – 1 pT ∂r  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1Q × 1 I Sm – 1 = × ( ⊗ ) ∂Jd ⁄ ∂bm Vm Vm V ˜ r m r = 1 R = Σ 74 Alternatively, from Eq. (107) and Eq. (162), can also be directly expressed in the form of as follows: . (163) From Eq. (160) and Eq. (163), we need to compute , and . We will start with , followed by and . I. Calculation of From Eq. (109), by using Eq. (112), we can express as . (164) Therefore, by Eq. (113) and Eq. (153), becomes . (165) Since this is a forward propagation process, we need to initialize at . Recall that . From Eq. (115), we then have the vector ∂Jd ⁄ ∂bm V ˜ r m ∂bm ∂Jd 2 QdSd MRd  V ˜ r m ( × 1Q × 1) r Σ = V ˜ r m U ˜ r m ∂vecAm rp ⁄ ∂ T ∂vecAm rp T ∂ ⁄ U ˜ r m V ˜ r m ∂vecAm rp ⁄ ∂ T aq ∂ m p⁄ ∂ r, q ∂pr, q ∂aq m F · m nq ( m)Wm ∂pr, q ∂aq m – 1 = ∂vecAm rp ⁄ ∂ T ∂vecAm pT ∂r  ∂vecAm (vecNm)T ∂  (IQ ⊗Wm) vecA∂ m – 1 pT ∂r = × ×  m = 0 S0 = R 75 , (166) where one appears at row . Thus, the matrix , for any . (167) Next, we will derive . II. Calculation of From Eq. (151), using Eq. (119), we have (168) and by Eq. (151), it reduces to (169) Eq. (169) can be expressed in the form of , using Eq. (136), as . (170) Since this is a backpropagation process, we need to initialize . From Eq. (125), Eq. (151) reduces to ∂pr, q ∂aq 0 ∂pr, q ∂pq 0 … 0 1 0 … 0 = = r ∂vecA0 pT ∂r  IQ pr q ∂ , ∂pq = ⊗ q U ˜ r m U ˜ r m u˜ ri, q m f·m ni q , m ( ) wl i , m + 1 ∂pr, q ∂ek, q nl q , m ∂ + 1 ∂ek, q × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ , l Σ = u˜ ri, q m f·m ni q , m ( ) wl i , m 1 + u˜ rl, q m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ . l Σ = U ˜ r m U ˜ r m DFm(Nm) (Wm + 1)TU ˜ r m+ 1 = • { } U ˜ r M 76 (171) From Eq. (171), can be written, using Eq. (156), as (172) Next, we will illustrate how to calculate . III. Calculation of From Eq. (150), by similarly following Eq. (129) to Eq. (132), we obtain (173) Eq. (173) can be expressed in the form of as , (174) where is the column vector of . Then, by Eq. (136), can be expressed as (175) Next, we need to initialize at . From Eq. (141), the term can be computed as follows: u˜ ri, q M ∂pr, q ∂ei, q f· M ni q , M = ( ) . U ˜ r M U ˜ r M DFM(NM) 11 × Q I SM ( ⊗ ) ∂vecE pT ∂r ×  ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ = • . V ˜ r m V ˜ r m v˜ ri, q m f· m ni q , m ∂ ( ) ∂pr, q  wl i , m 1 + u˜ rl, q m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ l Σ f· m ni q , m ( ) wl i , m 1 + v˜ rl, q m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ . l Σ = + v˜ rq m v˜ rq m dfm nq ∂ ( m) ∂pr, q  (Wm + 1)Tu˜ rq m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ • F ·m nq ( m)(Wm + 1)Tv˜ rq = + m + 1 u˜ rq m 1 + qth U ˜ r m + 1 V ˜ r m V ˜ r m 11 × Q I Sm ( ⊗ ) vecDF∂ m(Nm) pT ∂r ×  ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ (Wm + 1)TU ˜ r m + 1 = • { } DFm(Nm) (Wm + 1)TV ˜ r m + 1 + • { } . m M = v˜ ri, q M 77 (176) Thus, can be expressed as . (177) It now remains to calculate . Consider one element of this matrix, which is . From Eq. (144) and Eq. (147), we obtain (178) Then, by Eq. (137) and Eq. (149), can be decomposed to (179) We have derived an algorithm to compute the gradient of the new performance index with respect to the network parameters and such that the calculation process requires less machine memory than the batch mode operation. This algorithm is summarized below. v˜ ri, q M f· M ni q , M ∂ ( ) ∂pr, q  ∂pr, q ∂ei, q = × . V ˜ r M V ˜ r M 11 × Q I SM ( ⊗ ) vecDF∂ M(NM) pT ∂r  ∂vecE pT ∂r •  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = × ∂vecDFM(NM) pT ⁄ ∂r f· m ni q , m ∂ ( ) ⁄ ∂pr, q dfm nq ∂ ( m) ∂pr, q  dfm nq ∂ ( m) aq ( m)T ∂  ∂pr, q ∂aq m = × . ∂vecDFM(NM) pT ⁄ ∂r ∂ vecDFm(Nm) pT ∂r  ∂vecDFm(Nm) (vecAm)T ∂  ∂vecAm pT ∂r = ×  . Jd Wm bm 78 STEPS TO COMPUTE AND (MEMORY SAVED) 1. Given inputs in , obtain . Initialize . 2. Obtain by Eq. (165) and Eq. (167). 3. With , compute using Eq. (157). 4. Calculate and , using Eq. (172) and Eq. (177), respectively. 5. Backpropagate to obtain and , using Eq. (170) and Eq. (175), respectively. 6. Compute and by Eq. (160) and Eq. (163), respectively. 7. Set , and repeat step until . 8. Compute and . Update the weights and biases using any gradientbased optimization technique. ∂J ⁄ ∂Wm ∂J ⁄ ∂bm Q P Am r = 1 ∂vecAm rp ⁄ ∂ T ∂vecG rp ⁄ ∂ T ∂vecE p r ⁄ ∂ T U ˜ r M V ˜ r M U ˜ r m V ˜ r m ∂Jd ⁄ ∂Wm ∂Jd ⁄ ∂bm r = r + 1 2 – 7 r > R ∂J ⁄ ∂Wm ∂J ⁄ ∂bm 79 In this section, we have shown two approaches for computing the terms and . The first approach is performed in batch mode, thus requiring sufficient machine memory to simultaneously hold all numeric elements. This approach takes advantages of faster computation for some programming languages, e.g. MATLAB. The second approach, however, breaks down the matrices in the batch mode operation so that their sizes are smaller. This approach is useful when the machine’s memory is insufficient to perform in batch mode, thereby compromising between the computation speed and the required memory. Speed Test In this section, we create a network structure, i.e. an network ( will be varied), to test the speed of the algorithm. We will compare the computation time between the two approaches we previously derived, i.e. batch mode and memorysave approach. The following figure shows the relative oneiteration execution times (averaged over 10 runs) for the batch and memorysave methods for computing the gradient of , when compared to the time to compute the gradient of . Note that we assume the number of training points is fixed at . These tests were run on a computer with processor speed of 2.0GHz, and the memory size of 512MB. ∂Jd ⁄ ∂Wm ∂Jd ⁄ ∂bm R – 25 – 1 R CFDA Jd Jf Q = 75, 000 80 Figure 6) Relative execution time (compared to ) for computing We expected that time for computing the gradient of would be more than that of . However, from Figure 6), we can see that, initially, the gradient calculation for took less time than that for . This is because of the overhead in MATLAB codes. Once the input dimension increased, the time for computing the gradient of was now more than that for . The interesting part, however, occurred when the input dimension was more than seven. For these cases, the memory requirement caused the existing PC’s RAM to overflow, which required data to be sent to disk. Thus, the time for computing the gradient in batch mode was more than that for the memorysave approach. These results show that the memorysave approach is useful when the data storage requirements exhaust existing RAM. Summary In this chapter, a new training algorithm for approximating a function and its firstorder derivatives, called gradientbased , was proposed. We proposed the new per 2 4 6 8 10 12 10−1 100 101 102 103 Relative Execution Time for Batch Memory−Save ∂Jd ⁄ ∂x R ∂Jf ⁄ ∂x ∂Jd ⁄ ∂x Jd Jf Jd Jf Jd Jf CFDA 81 formance index, which includes not only the squared function errors but also the squared derivative errors. Two approaches for gradient calculation to minimize the performance index were presented. The first method arranges every numeric element into matrices for batch mode operation, in order to expedite the gradient calculation time in some programming languages. This approach, however, requires sufficient memory to simultaneously hold all elements. The other approach, called the memorysave method, compromises between the computation time and the required memory. The computation time under different conditions were measured. The results showed that the computation time for the squared derivative error term defined in the method was longer than the standard backpropagation. This is expected as the gradient computation in the method is more complicated than that in the standard backpropagation algorithm. The results also illustrated that the memorysave approach is useful in cases where computer RAM is insufficient to perform the gradient calculation in batch mode. CFDA CFDA 82 CHAPTER 5 COMBINED FUNCTION AND DERIVATIVE APPROXIMATION WITH LEVENBERGMARQUARDT Introduction In this chapter, we present a LevenbergMarquardt algorithm for minimizing the performance index. Recall from Chapter 2 that optimization with the Levenberg Marquardt algorithm requires the calculation of the Jacobian matrix. This will be the core work in this chapter. We will begin this chapter with a discussion of the LevenbergMarquardt framework for . We will then present two approaches for computing the Jacobian matrix. The first approach performs the calculation in batch mode. The second approach (i.e. the memory save method) compromises between the execution time and the required memory. The measured execution time under different conditions will be illustrated at the end of the chapter. CFDA with LevenbergMarquardt Consider a performance index in the form: , (180) where CFDA CFDA F(x) F x ( ) ρ1 F1 x ( ) ρ2 = + F2(x) 83 and . (181) The element of the gradient is . (182) The gradient can be written in matrix form: , (183) and, from Eq. (38) in Chapter 2, this becomes , (184) where the Jacobian matrices are and . (185) Next, we want to find the Hessian matrix. The element of the Hessian matrix would be . (186) From Eq. (39) in Chapter 2, this turns out to be (187) F1(x) zi 2(x) i = 1 N1 Σ zT x ( ) z x ( ) × = = F2 x ( ) z˜ i 2 (x) i = 1 N2 = Σ = z˜T(x) × z˜ (x) jth [∇F(x)]j ∂F(x) ∂xj  ρ1 ∂F1(x) ∂xj  ρ2 ∂F2(x) ∂xj = = +  ∇F(x) = ρ1∇F1(x) + ρ2∇F2(x) F x ( ) ∇ 2 ρ1JT x ( )z x ( ) ρ2 J ˜ T = { + (x)z˜ (x)} J(x) ∂z(x) ∂xT  = J ˜ (x) z∂˜ (x) ∂xT =  k, j [∇2F(x)]k, j ∂2F(x) ∂xk∂xj  ρ1 ∂2F1(x) ∂xk∂xj  ρ2 ∂2F2(x) ∂xk∂xj = = +  [∇2F(x)]k, j 2ρ1 ∂zi(x) ∂xk  ∂zi(x) ∂xj  zi(x) ∂2zi(x) ∂xk∂xj +  ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ i = 1 N1 = Σ + 2ρ2 z˜ ∂ i(x) ∂xk  z˜ ∂ i(x) ∂xj  zi(x) z˜ i(x) 2 ∂ ∂xk∂xj +  ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ . i = 1 N1 Σ 84 Using Eq. (40), the Hessian can then be expressed in matrix form (188) where and (189) If we assume is small. In this scenario, if we assume that both and are small, relative to the terms and , then the Hessian matrix can be approximated as (190) Therefore, by Eq. (184) and Eq. (190), the LevenbergMarquardt update is (191) Note that is adjusted using the typical LevenbergMarquardt algorithm, which was presented in Chapter 2. Now, recall from Eq. (80) that the performance index in the method consists of two terms (i.e. and ). This is in the same form as the performance index in Eq. (180), if we assign ∇2F(x) 2ρ1 JT(x)J(x) + S(x) ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ 2ρ2 J ˜ T x ( )J ˜ x ( ) S ˜ = + { + (x)} , S(x) zi(x)∇2 zi(x) i = 1 N1 Σ = S ˜ x ( ) z˜ i x ( ) z˜ ∇ i(x) . 2 i = 1 N2 = Σ S x ( ) S x ( ) S ˜ (x) JT x ( )J x ( ) J ˜ T x ( )J ˜ (x) F x ( ) ∇2 2ρ1JT x ( )J x ( ) 2ρ2J ˜ T x ( )J ˜ ≅ + (x) 2 ρ1JT x ( )J x ( ) ρ2 J ˜ T x ( )J ˜ x ( ) + { } . = Δxk + 1 = xk + 1 – xk ρ1JT xk ( )J xk ( ) ρ2 J ˜ T xk ( )J ˜ xk ( ) μk [ + + In] –1 ρ1JT xk ( )z xk ( ) ρ2 J ˜ T = – × [ + (xk)z˜ (xk)]. μk CFDA Jf Jd 85 , (192) (193) where and . Since the algorithm for computing and is already known (see [HaMe94] and [HaDe96]), we will focus on the calculation of and . The calculations for these terms are given in the following two sections. The next section will show the batch calculation and the following ssection will show the memorysave calculation. Batch calculation In this section, we will compute the vector and the Jacobian matrix . Recall from Eq. (193) that we have . (194) We can rewrite Eq. (194) as (195) Define the following matrix F1(x) {ak, q – gk, q}2 k = 1 SM Σ q = 1 Q ≡ Σ F2(x) ϕk, r(q) ∂ak, q ∂pr, q  ∂gk, q ∂pr, q ⎝ – ⎠ ⎛ ⎞2 , r = 1 R Σ k = 1 SM Σ q = 1 Q ≡ Σ ρ1 ≡ 1 ⁄ (QSM) ρ2 ρ QdSd ≡ ⁄ ( MRd) J(x) z(x) z˜ x ( ) J ˜ (x) z˜ x ( ) J ˜ (x) F2(x) ϕk, r(q) ∂ak, q ∂pr, q  ∂gk, q ∂pr, q ⎝ – ⎠ ⎛ ⎞2 r = 1 R Σ k = 1 SM Σ q = 1 Q = Σ F2(x) ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r = 1 R Σ k = 1 SM Σ q = 1 Q Σ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T vec pq ∂ T ∂eq . × q = 1 Q = = Σ 86 , (196) where the matrix is defined in Eq. (99). Then, Eq. (195) can be expressed as , (197) where . (198) Now, by equating Eq. (197) to Eq. (181), we can see that . (199) For the calculation for the Jacobian matrix , we need to have the vector containing all of the network parameters. There are several ways to define . However, the definition we use is , (200) where the value of is shown in Eq. (31). The vector is defined as DEP 11 × Q I SM ( ⊗ ) ∂vecE ∂(vecP)T ×  p1 ∂ T ∂e1 p2 ∂ T ∂e2 … pQ T ∂ ∂eQ ≡ = ∂vecE ⁄ ∂(vecP)T Jd = (vecDEP)T × vecDEP vecDEP vec p1 ∂ T ∂e1 vec p2 ∂ T ∂e2 … vec pQ T ∂ ∂eQ = z˜ (x) = vecDEP J ˜ (x) x x xT (x1)T (x2)T … (xM)≡ T n xm 87 . (201) Therefore, from Eq. (185), the Jacobian matrix can be written as . (202) Using Eq. (200) and Eq. (201), can be written , (203) where . (204) Therefore, by Eq. (202) through Eq. (204), can be written xm vecWm bm ≡ J ˜ (x) J ˜ (x) ∂vecDEP ∂xT  ∂ xT ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂ xT ∂ vec p2 ∂ T ∂e2 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … ∂ xT ∂ vec pQ T ∂ ∂eQ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = = ∂vecDEP ⁄ ∂xT J ˜ (x) ∂vecDEP ∂xT  ∂vecDEP ( x1)T ∂  ∂vecDEP ( x2)T ∂  … ∂vecDEP ( xM)T ∂ = =  ∂vecDEP ( xm)T ∂  ∂vecDEP (vecWm)T ∂  ∂vecDEP (bm)T ∂ =  ∂vecDEP ⁄ ∂xT 88 , (205) where . (206) To obtain , we need to compute . However, it is fairly complicated to derive the entire at once. Therefore, will be considered first. Then, the batch matrix will be expressed. Each element in the matrix is , where is at element of . We will first consider the case when is an element in the weight matrix and then the case when is an element of the bias vector . Consider when . From Eq. (87) in Chapter 4 and using the fact that J ˜ (x) ∂vecDEP ∂xT  ( x1)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ( x2)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … ( xM)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ( x1)T ∂ ∂ vec p2 ∂ T ∂e2 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ( x2)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … ( xM)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … … … ( x1)T ∂ ∂ vec pQ T ∂ ∂eQ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ( x2)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … ( xM)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = = ( xm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ (vecWm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ (bm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = J ˜ (x) vecDEP ( xm)T ∂ ⁄ ∂ vecDEP ( xm)T ∂ ⁄ ∂ ( xm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ vecDEP ( xm)T ∂ ⁄ ∂ ( xm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ xl ∂ m ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ xl m l xm x Wm xl m bm xl m wi j , m = 89 , (207) we have (208) Next, consider when is an element of , i.e. . From Eq. (101), Eq. (104) and Eq. (207), we have . (209) Eq. (208) and Eq. (209) can be written in matrix form: (210) and . (211) In batch mode for Eq. (210), the batch matrix and can be expressed as ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ni q , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = wi j , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ni q , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ aj q , m × – 1 ni q , m ∂ ∂ek, q ∂pr, q ∂aj q , m – 1 = + × . xl m bm xl m bi = m bi ∂ m ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ni q , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = (vecWm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ 1 1 × Sm – 1 nq ( m)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ⊗ ⎩ ⎭ ⎪ ⎪ ⎨ ⎬ ⎪ ⎪ ⎧ ⎫ aq ( m – 1)T 1 SMR × Sm ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ = • 1 R × Sm – 1 nq ( m)T ∂ ∂eq ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ pq ∂ T ∂aq m – 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1 SM × Sm ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ + • , (bm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ nq ( m)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = vecDEP (vecWm)T ∂ ⁄ ∂ 90 (212) The batch matrix for Eq. (211) can be written as . (213) To clarify the expressions in Eq. (212) and Eq. (213), first note, from Eq. (196), that , (214) and ∂vecDEP (vecWm)T ∂  1 1 × Sm – 1 ∂vecDEP (vecNm)T ∂  1Q × 1 I Sm × ( ⊗ ) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ ⎝ ⎜ ⎛ = (Am – 1)T 1 SMR × Sm ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ • ⎠ ⎟ ⎞ + 1 R × Sm – 1 ∂vecE (vecNm)T ∂  1Q × 1 I Sm ⎝ × ( ⊗ )⎠ ⊗ ⎛ ⎞ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ ⎝ ⎜ ⎛ 11 × Q I Sm – 1 ( ⊗ ) vecA∂ m – 1 ∂(vecP)T ×  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1 SM × Sm ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ ⎠ ⎟ ⎞ • . ∂vecDEP (bm)T ∂  ∂vecDEP (vecNm)T ∂  1Q × 1 I Sm = × ( ⊗ ) ∂vecDEP (vecNm)T ∂  n1 ( m)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ 0 … 0 0 n2 ( m)T ∂ ∂ vec p2 ∂ T ∂e2 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … 0 … … … 0 0 … nQ m ( )T ∂ ∂ vec pQ T ∂ ∂eQ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = 91 . (215) In addition, we have , and (216) . (217) We can also write ∂vecDEP (vecNm)T ∂  1Q × 1 I Sm × ( ⊗ ) n1 ( m)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ n2 ( m)T ∂ ∂ vec p2 ∂ T ∂e2 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … nQ m ( )T ∂ ∂ vec pQ T ∂ ∂eQ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = ∂vecE (vecNm)T ∂  n1 ( m)T ∂ ∂e1 0 … 0 0 n2 ( m)T ∂ ∂e2 … 0 … … … 0 0 … nQ m ( )T ∂ ∂eQ = ∂vecE (vecNm)T ∂  1Q × 1 I Sm × ( ⊗ ) n1 ( m)T ∂ ∂e1 n2 ( m)T ∂ ∂e2 … nQ m ( )T ∂ ∂eQ = 92 , (218) and the result is shown in Eq. (98). From Eq. (212) and Eq. (213), we will need to calculate , and . (We previously provided the equations for the other terms.) We will first show the derivation for computing the matrix , followed by . Note that the computation for the term is in Chapter 2, Eq. (18), and for the calculation for the term is in Chapter 4, Eq. (114). I. Calculation of Consider an element of , which is . This element is known as the Marquardt sensitivity, [HaDe96]. This term was computed in Eq. (118) and Eq. (119) in Chapter 4. Hence, can be expressed as , (219) where is defined in Eq. (112). Thus, from Eq. (216) and Eq. (219), the batch matrix can be expressed as (220) 11 × Q I Sm – 1 ( ⊗ ) vecA∂ m – 1 ∂(vecP)T ×  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T ∂vecAm – 1 ∂(vecP)T  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1Q × 1 I Sm – 1 = × ( ⊗ ) ∂vecE (vecNm)T ⁄ ∂ ∂vecDEP (vecNm)T ⁄ ∂ ∂vecE (vecNm)T ⁄ ∂ ∂vecDEP (vecNm)T ⁄ ∂ Am ∂vecAm ⁄ ∂(vecP)T ∂vecE (vecNm)T ⁄ ∂ ∂vecE (vecNm)T ∂ ⁄ ek q , ∂ ni q , m ⁄ ∂ ∂eq nq ( m)T ⁄ ∂ nq ( m)T ∂ ∂eq nq ( m + 1)T ∂ ∂eq Wm + 1F · m nq m = ( ) F · m nq m ( ) ∂vecE (vecNm)T ⁄ ∂ ∂vecE (vecNm)T ∂  ∂vecE (vecNm + 1)T ∂  {IQ ⊗ Wm + 1} vecA∂ m (vecNm)T ∂ = × ×  , 93 where the term is computed as shown in Eq. (113). Since Eq. (220) is a backpropagation process, we need to initialize it at the output layer, i.e. . From Eq. (125), we can write (221) We can also write Eq. (221) in the batch mode: (222) This completes the computation of . Next, the calculation of will be derived. II. Calculation of Recall that an element of is . From Eq. (207) along with Eq. (119), we have . (223) By taking the derivative inside the parenthesis and using Eq. (207), this turns out to be .(224) Therefore, using Eq. (133) and Eq. (134), Eq. (224) can be written in matrix form: ∂vecAm (vecNm)T ⁄ ∂ m = M nq ( M)T ∂ ∂eq ∂aq nq ( M)T ∂  F · M nq = = ( M) . ∂vecE (vecNM)T ∂  ∂vecA (vecNM)T ∂ =  . ∂vecE (vecNm)T ⁄ ∂ ∂vecDEP (vecNm)T ⁄ ∂ ∂vecDEP (vecNm)T ⁄ ∂ ∂vecDEP (vecNm)T ⁄ ∂ ni q , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ni q , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂p
Click tabs to swap between content that is broken into logical sections.
Rating  
Title  Fitting Functions and Their Derivatives with Neural Networks 
Date  20090701 
Author  Pukrittayakamee, Arjpolson 
Department  Electrical Engineering 
Document Type  
Full Text Type  Open Access 
Abstract  The objective of this work was to study methods of simultaneously approximating functions and their firstorder derivatives using multilayer feedforward neural networks. There are a few methods proposed today to simultaneously approximate both functions and their first derivatives, but they require modifications to the network structure. The new method works with any multilayer feedfoward neural network, by forming a new performance index that combines both the function error and the first derivative error. We tested and analyzed the results of the proposed method on both analytic and realworld problems. We selected two optimization procedures for the new performance index. The first procedure was for any gradientbased optimization, while the other was implemented under the LevenbergMarquardt framework. For each procedure, extra backpropagation calculations were derived to force the first derivative response of the neural network to match the desired derivative target. Moreover, we discovered two new types of overfitting from neural networks trained with the proposed performance index. We analyzed and illustrated how the overfitting develops. A network pruning algorithm was proposed to eliminate these types of overfitting. The simulation results tested on four analytic problems and three systems in Molecular Dynamics consistently showed that the approximation accuracy of neural networks trained by the new performance index significantly outperformed the use of standard training methods. In addition, the network generalization was even further improved with the incorporation of the pruning algorithm. We found that the most promising method yielding the most accurate approximation and the best generalization was to optimize the new performance index under the LevenbergMarquardt framework along with the use of the pruning algorithm. 
Note  Dissertation 
Rights  © Oklahoma Agricultural and Mechanical Board of Regents 
Transcript  FITTING FUNCTIONS AND THEIR DERIVATIVES WITH NEURAL NETWORKS By ARJPOLSON PUKRITTAYAKAMEE Bachelor of Engineering Chulalongkorn University Bangkok, Thailand 1997 Master of Sciences Oklahoma State University Stillwater, Oklahoma 2001 Submitted to the Faculty of the Graduate College of the Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY July, 2009 ii FITTING FUNCTIONS AND THEIR DERIVATIVES WITH NEURAL NETWORKS Dissertation Approved: Dr. Martin T. Hagan Dissertation Adviser Dr. Lionel M. Raff Dr. Carl D. Latino Dr. George Scheets Dr. A. Gordon Emslie Dean of the Graduate College iii ACKNOWLEDGEMENT I would like to express my deep gratitude to my advisor, Professor Dr. Martin Hagan, for his dedication, support, guidance, patience and friendship throughout my graduate studies. His invaluable feedback and suggestions infinitely contributed to my research. I also would like to thank my other committee members, Professor Dr. Lionel Raff, Professor Dr. George Scheets and Professor Dr. Carl Latino, for their useful comments and suggestions. I also thank Professor Dr. Ranga Komanduri, Professor Dr. Satish Bukkapatnam, Professor Dr. Paras Agrawal and Professor Dr. Mark Rockley for providing helpful comments and all supports possible to the research. My special appreciation goes to Professor Dr. Lionel Raff for his dedication to teaching me the fundamental of quantum mechanics and molecular dynamics. The research is funded by a grant (DMI0457663) from the National Science Foundation, Division of Mechanical and Manufacturing Innovation (CMMI). I thank Dr. Jocelyn Harrison, Program Director for Materials Processing and Manufacturing, for the interest and support of this work. Most importantly, I especially thank and dedicate this dissertation to my parents, my grandmother and my sister for their loving support to all my endeavors and their strong encouragement at difficult times. iv I also would like to take this opportunity to express my gratitude to all of my teachers at Deves Thampirak School, St. Gabriel’s College, Patumwan Demonstration School and Triam Udom Suksa School, and Professors at Chulalongkorn University and Oklahoma State University for their immeasurable dedication, support and strong encouragement to my learning. I would like to thank Dr. Prapaporn Kiattikulwattana for special care, friendship, enthusiastic support and substantial encouragement throughout my doctoral study. My thanks and appreciation are extended to the following people: Rathachai Kaewlai (MD.), Chokeanan Wechaprukpitak, Dr. Manisra Baramichai, Dr. Nipapan Klunngien, Chainart Vongsittipornrung, Dr. Wisit Kumphai, Kultarika Kwosurat, Rungnapa and Dr. Somsak Kittipiyakul, Pornphan Dulyakarn, Dr. Chanvit Chantrasrisalai, Chayanuch Jakmanon, Arthit Phadungsilp, Xiaochen Hu, Dr. Jakkrit Kunthong, Arttaporn Komolhiran and Katherine McCollom for the enduring friendship, care, support and encouragement; Reza Jafari, Milind Malshe and Rutu Narulkar for being great colleagues and friends as well as providing strong supports to the research. My gratitude also goes to Professor Dr. Dusit Kruangam, Dr. Porponth Sichanugrist, Dr. Pavan Siamchai and Dr. Sakon Kittivatcharapong for giving me opportunities to acquire precious experiences and supports to pursue the doctoral degree. Finally, I thank friends at Chulalongkorn University, colleagues at Thaicom Satellite and Thai students at Oklahoma State University for the supportive friendship. v TABLE OF CONTENT CHAPTER 1 INTRODUCTION ...................................................................................................1 Overview......................................................................................................1 Objectives ....................................................................................................2 Outline .........................................................................................................4 CHAPTER 2 APPROXIMATION USING NEURAL NETWORKS...........................................7 Introduction..................................................................................................7 Notation .......................................................................................................7 Multilayer Feedforward Neural Networks.................................................10 Training Algorithms ..................................................................................20 Conditions for Function and Derivative Approximation...........................35 Summary....................................................................................................38 CHAPTER 3 VALIDATIONRELATED METHODS...............................................................40 Introduction................................................................................................40 Modified validation performance measure ................................................40 Early stopping using the derivative information of the training set ..........44 Summary....................................................................................................50 CHAPTER 4 GRADIENTBASED COMBINED FUNCTION AND DERIVATIVE APPROXIMATION ..............................................................................................................51 Introduction................................................................................................51 GradientBased Combined Function and Derivative Approximation .......51 Speed Test..................................................................................................79 Summary....................................................................................................80 CHAPTER 5 COMBINED FUNCTION AND DERIVATIVE APPROXIMATION WITH LEVENBERG MARQUARDT ....................................................................................82 Introduction................................................................................................82 CFDA with LevenbergMarquardt ............................................................82 Batch calculation........................................................................................85 Memorysave Calculation........................................................................101 Speed Test................................................................................................111 Summary..................................................................................................112 vi CHAPTER 6 A NETWORK PRUNING ALGORITHM FOR CFDA .....................................113 Introduction..............................................................................................113 TwoLayer Network Response ...............................................................114 CFDA Overfitting ....................................................................................121 Pruning Algorithm ...................................................................................131 Summary..................................................................................................143 CHAPTER 7 TRAINING RESULTS ON SIMPLE PROBLEMS............................................145 Introduction..............................................................................................145 Evaluation Procedure...............................................................................146 Parameters in CFDA................................................................................150 Simulation Results ...................................................................................152 Summary..................................................................................................179 CHAPTER 8 A REALWORLD APPLICATION....................................................................181 Introduction..............................................................................................181 Molecular Dynamics with Neural Networks ...........................................181 Simulation results ....................................................................................198 Summary..................................................................................................218 CHAPTER 9 CONCLUSIONS .................................................................................................220 Objective..................................................................................................220 Summary..................................................................................................220 Future work..............................................................................................224 APPENDIX A. PSGEN algorithm ..........................................................................................226 B. Calculation for the second derivatives of neural networks ............................232 C. Class of model.....................................................................................234 D. Scaled and true derivatives ............................................................................242 REFERENCES...............................................................................................................245 H2Br vii LIST OF TABLES Table 1 Approximation accuracy on problem 1.............................................................42 Table 2 Approximation accuracy on problem 2.............................................................43 Table 3 Approximation accuracy on problem 3.............................................................43 Table 4 Approximation accuracy on problem 1.............................................................47 Table 5 Approximation accuracy on problem 2.............................................................48 Table 6 Approximation accuracy on problem 3.............................................................48 Table 7 Description of test functions ...........................................................................146 Table 8 Network structure for each problem................................................................147 Table 9 Approximation accuracy on problem 1 (First set) ..........................................154 Table 10 Approximation accuracy on problem 2 (First set) ..........................................154 Table 11 Approximation accuracy on problem 3 (First set) ..........................................155 Table 12 Approximation accuracy on problem 4 (First set) ..........................................155 Table 13 Number of neurons after pruning (First set)....................................................155 Table 14 Approximation accuracy on problem 1 (Second set) ......................................157 Table 15 Approximation accuracy on problem 2 (Second set) ......................................157 Table 16 Approximation accuracy on problem 3 (Second set) ......................................158 Table 17 Approximation accuracy on problem 4 (Second set) ......................................158 Table 18 Number of neurons after pruning (Second set) ...............................................158 Table 19 Approximation RMSE of threebody silicon, 3000 training points ................201 Table 20 Approximation RMSE of threebody silicon, 1500 training points ................201 Table 21 Number of neurons after pruning, for threebody silicon ...............................201 Table 22 Approximation RMSE of fivebody silicon, 3000 training points..................204 Table 23 Approximation RMSE of fivebody silicon, 1500 training points..................204 Table 24 Number of neurons after pruning, for fivebody silicon .................................204 Table 25 Approximation RMSE of , 3000 training points ...................................207 Table 26 Approximation RMSE of , 1500 training points ...................................208 Table 27 Approximation RMSE of , 750 training points .....................................208 Table 28 Approximation RMSE of , 375 training points .....................................208 Table 29 Number of neurons after pruning, for ..................................................209 Table 30 Approximation RMSE of , after some extrapolation elimination .........211 Table 31 Elements of and their index ...................................................................227 Table 32 The binary sequences, combinadics and the associated elements of ....228 Table 33 Combinadics and their index ......................................................................229 Table 34 Notations for the combinadics.........................................................................229 Table 35 Parameters of the model..................................................................................234 H2Br H2Br H2Br H2Br H2Br H2Br S iS P(S) ic viii LIST OF FIGURES Figure 1) A single neuron ..................................................................................................11 Figure 2) The feedforward neural network.......................................................12 Figure 3) Training and validation set.................................................................................24 Figure 4) Sum square function error on training and validation set ..................................24 Figure 5) A training record showing versus ..........................................................49 Figure 6) Relative execution time (compared to ) for computing ............80 Figure 7) Relative execution time for computing the gradient and Jacobian of , compared to ..........................................................................................111 Figure 8) Sketches of and .......................................................................117 Figure 9) Effect of the firstlayer weight .........................................................................117 Figure 10) Effect of the secondlayer weight ..................................................................118 Figure 11) Neuron’s function and derivative responses in two dimensions....................119 Figure 12) A cross section of the neuron’s derivative responses.....................................120 Figure 13) Responses of two neurons..............................................................................122 Figure 14) TypeA Overfitting ........................................................................................122 Figure 15) Distance between a point to a line .................................................................123 Figure 16) TypeB Overfitting.........................................................................................127 Figure 17) Definition of Buffer Zone ..............................................................................128 Figure 18) Training points and the buffer zone in a twodimensional case.....................131 Figure 19) Overview for the pruning algorithm ..............................................................133 Figure 20) Graph of the four test functions .....................................................................147 Figure 21) Impact of on the approximation accuracy..................................................152 Figure 22) Error comparison for problem 1.....................................................................159 Figure 23) Error comparison for problem 2.....................................................................160 Figure 24) Error comparison for problem 3.....................................................................161 Figure 25) Error comparison for problem 4.....................................................................162 Figure 26) Overfitting in the function and derivative responses .....................................164 Figure 27) Function and first derivative errors ................................................................164 Figure 28) Responses of the three neurons......................................................................165 Figure 29) The combined responses of the three neurons ...............................................165 Figure 30) Function and derivative errors, right after pruning (with no retraining)........166 Figure 31) Function and derivative errors of the final network.......................................167 Figure 32) Function and derivative errors .......................................................................168 Figure 33) Responses of the two neurons........................................................................169 Figure 34) The combined responses of the two neurons .................................................170 Figure 35) Function and derivative errors, right after pruning (with no retraining)........171 Figure 36) Function and derivative errors of the final network.......................................172 Figure 37) Function and derivative errors .......................................................................173 Figure 38) Responses of the neuron causing the overfitting ...........................................173 Figure 39) Function and derivative errors, right after pruning (with no retraining)........174 Figure 40) Function and derivative errors of the final network.......................................174 Figure 41) Function and derivative errors .......................................................................175 3 – layer EV ˜ EV ∂Jf ⁄ ∂x ∂Jd ⁄ ∂x F2(x) F1(x) f 1 ni ( 1) f· 1 ni ( 1) λ ix Figure 42) Responses of the three neurons......................................................................176 Figure 43) The combined responses of the 15 neurons ...................................................177 Figure 44) Function and derivative errors, right after pruning (with no retraining)........178 Figure 45) Function and derivative errors of the final network.......................................179 Figure 46) Configuration of threebody silicon ...............................................................199 Figure 47) Configuration of fivebody silicon.................................................................203 Figure 48) Configuration of ..................................................................................206 Figure 49) Extrapolation in a Monte Carlo trial, with ....................................211 Figure 50) Error comparison for with ..................................................212 Figure 51) The three neuron centers ................................................................................213 Figure 52) Function and derivative responses of the two neurons along the cross section ............................................................................................................214 Figure 53) The combined responses ................................................................................215 Figure 54) Errors before and right after pruning the two neurons...................................215 Figure 55) Function and derivative response of the neuron ............................................216 Figure 56) Errors before and right after pruning the neuron ...........................................216 Figure 57) Errors before and right after pruning the 21 neurons.....................................217 Figure 58) Errors before and after performing the pruning algorithm.............................217 H2Br Q = 375 H2Br Q = 375 1 CHAPTER 1 INTRODUCTION Overview Multilayer feedforward neural networks (MFNN) have been used in many nonlinear regression problems. They have been shown (both mathematically and practically) to be a powerful tool in approximating functions, based on a set of examples. In addition, it has been theoretically proven that their derivatives are capable of approximating the underlying derivatives of the functions. In this research, we focus on the development and the derivation of training algorithms for MFNNs to approximate both functions and their firstorder derivatives. Multilayer feedforward neural networks are capable of simultaneously approximating both a function and its derivatives, as has been proven in [HoSt90], [Horn91], [Ito93], [Li96] and [Pink99]. However, there has not been a large amount of research into the development of algorithms for training multilayer feedforward neural networks to simultaneously approximate both a function and its first derivatives. This will be the focus of our research. There have been a few methods proposed in the past to train MFNNs for approximating both a function and its derivatives. All of these methods assume noisefree environment. One approach, called algebraic training, was proposed in [FeSt05]. This method 2 algebraically obtains the network parameters in one function approximation step. The derivative approximation is carried out in the second stage, where the algorithm iteratively adjusts the network parameters to improve the derivative approximation of the neural network. In [BaEn99], the derivative approximation was performed by adding an extra output unit for each partial derivative to the regular structure of the feedforward neural network that was used to approximate the function. The standard backpropagation procedure was then used to train the proposed network structure. In [HaZa99], an additional network was created to approximate the derivatives. The derivative approximation obtained from the extra network was then combined with the network output used for function approximation using the Taylor series expansion. In our research, a different technique will be used. We will modify the network structure or create an additional network for derivative approximation. We will use the same network for derivative approximation that we use for function approximation. In addition, unlike the algebraic training, the derivative approximation in our method will occur simultaneously with the function approximation. However, like the other methods above, we also assume that the data for this research are noisefree. Objectives As previously mentioned, our research goal is to develop new algorithms for training MFNNs to simultaneously approximate both a function and its first derivatives (in a noisefree environment). However, we will also investigate both function and derivative approximation accuracy for three existing algorithms. These existing algorithms will be called the standard methods in this research, and they are: not 3 1. BroydenFletcherGoldfarbShanno with Early Stopping ( ), 2. LevenbergMarquardt with Early Stopping ( ), and 3. GaussNewton Approximation to Bayesian Regularization ( ). This evaluation is similar to the work of [GaWh92], though we will use different training methods. The approximation accuracy and the execution time of the new algorithms will be compared with the standard methods. In this research, five new algorithms are developed. They are 1. LevenbergMarquardt with Early Stopping using a modified validation measure ( ), 2. LevenbergMarquardt with Early Stopping using the derivative information of the training set ( ), 3. Combined Function and Derivative Approximation using BroydenFletcher GoldfarbShanno optimization ( ), and 4. Combined Function and Derivative Approximation with LevenbergMarquardt optimization ( ). 5. Combined Function and Derivative Approximation methods with a networkpruning algorithm. The and methods only change the validation performance measure, while standard training algorithms are used to train the neural networks. For and , the proposed change is in the training performance index. Therefore, the standard calculations cannot be applied. The methods can be BFGS – ES LM – ES GNBR LM – ES1 LM – ES2 CFDA – BFGS CFDA – LM LM – ES1 LM – ES2 CFDA – BFGS CFDA – LM CFDA 4 incorporated with the proposed networkpruning algorithm, thus resulting in the fifth algorithm. The results of the standard methods and the new algorithms will be compared on several analytic and realworld problems. The realworld application we focus on in this research is molecular dynamics, where the motion of atoms is simulated. The potential energy of the atoms is a function of atomic configuration, and the forces acting on the atoms are the negative first derivatives of the potential energy. Therefore, molecular dynamics is a good example where both the function and its firstorder derivative approximation are of interest. Outline In Chapter 2, the mathematical notation and the general concepts of MFNNs will be introduced. The use of the neural networks in function approximation will be briefly reviewed. Three standard training algorithms for function approximation will be discussed, i.e. , and . The conditions under which the neural networks can simultaneously and uniformly approximate both a function and its derivatives will be provided. In Chapter 3, two new validationrelated methods will be introduced, i.e. and . The comparison of the approximation accuracy between these methods and on analytic problems will be compared and discussed. In Chapter 4, the training performance index is proposed. Two approaches (i.e. batch mode and memorysave method) for calculating the gradient of the per BFGS – ES LM – ES GNBR LM – ES1 LM – ES2 LM – ES CFDA CFDA 5 formance index (which works with any gradientbased optimization) are proposed and derived. The execution time for computing the gradient and the standard gradient, under several scenarios, is compared. In Chapter 5, the method under the LevenbergMarquardt framework, named , is discussed. Two approaches (i.e. batch mode and memorysave method) for minimizing the performance index are proposed and derived. The comparison of the execution time for the method and the standard Levenberg Marquardt algorithm is illustrated. In Chapter 6, two new types of overfitting in a twolayer network with one output, trained with any method are discussed. A networkpruning algorithm to mitigate the overfitting is proposed. A pseudo code for the algorithm is given. In Chapter 7, we introduce a procedure to test the approximation accuracy of neural networks in four analytic problems. We discuss a way to assign a value to the unknown parameter in the performance index. We choose , for the gradientbased . The pruning algorithm is applied to and . We denote with the pruning method and with the pruning method: and , respectively. We test the approximation accuracy of neural networks trained by the standard methods and the methods (i.e. , , and ). The results are shown and discussed. Examples showing the overfitting and how the pruning algorithm removes it are demonstrated. CFDA CFDA CFDA – LM CFDA CFDA – LM CFDA CFDA CFDA – BFGS CFDA CFDA – BFGS CFDA – LM CFDA – BFGS CFDA – LM CFDA – BFGSp CFDA – LMp CFDA CFDA – BFGS CFDA – LM CFDA – BFGSp CFDA – LMp CFDA 6 In Chapter 8, the background material for molecular dynamics is reviewed. The use of neural networks in molecular dynamics is discussed. The comparison in approximation accuracy of neural networks trained by and the methods is illustrated and discussed for several problems. An example showing the overfitting in a molecular dynamics problem and how the pruning algorithm removes it is demonstrated. In Chapter 9, the summary of the research and the future work is provided. GNBR CFDA CFDA 7 CHAPTER 2 APPROXIMATION USING NEURAL NETWORKS Introduction The objective of this research is to use multilayer feedforward neural networks (MFNNs) for approximating functions and their firstorder derivatives. This chapter serves three purposes. First, the notation used throughout the research will be introduced. Second, we will discuss the concept and the background material of MFNNs, including a review of three existing training algorithms for multilayer feedforward neural networks. These algorithms are BroydenFletcherGoldfarbShanno with Early Stopping , Levenberg Marquardt with Early Stopping , and GaussNewton approximation to the Bayesian regularization . Finally, the conditions under which multilayer feedforward neural networks can simultaneously approximate both a function and its firstorder derivatives will be briefly discussed and reviewed. Notation This section will introduce mathematical notation that will be used throughout the research. The following definitions provide the meaning of three mathematical operators, see [HoJo94] and [MaNe99]. (BFGS – ES) (LM – ES) (GNBR) 8 Definition (3) Let be an matrix and a matrix. The matrix defined by (1) is called the Kronecker product of and , and written . Definition (4) If and are matrices of the same order, say , then the Hadamard product of and is the matrix . (2) Definition (5) Let be an matrix and its column; then is the vector . (3) We will use the following notation, from [MaNe99], for representing the derivatives of a function: A m × n B p × q mp × nq a1, 1B … a1, nB … … am, 1B … am, nB A B A ⊗ B A = [ai, j] B = [bi, j] m × n A B m × n A • B = [ai, jbi, j] A m × n aj jth vecA mn × 1 vecA a1 a2 … an = 9 Definition (6) Let be a differentiable scalar function of a scalar variable . Then the derivative of is denoted . Definition (7) Let be a differentiable scalar function of a vector . Then the derivative of is the array . (4) Definition (8) Let be a differentiable real vector function of a vector , then the derivative of is the matrix . (5) And the derivative of the function , where , with respect to the scalar variable , where is denoted by . (6) f x f f·(x) ∂f(x) ∂x =  f p × 1 x f 1 × p ∂f(x) ∂xT  ∂f ∂xT  ∂f(x) ∂x1  … ∂f(x) ∂xn = =  f m × 1 p × 1 x f m × p ∂ f(x) ∂xT  ∂f ∂xT  ∂f1(x) ∂xT  … ∂fm(x) ∂xT  ∂f ∂x1  … ∂f ∂xp = = =  fi i = 1, 2, …, m xj j = 1, 2,…, p ∂fi(x) ∂xj  ∂fi ∂xj =  10 Definition (9) Let be a differentiable matrix function of a matrix of real variables . Then, the derivative of with respect to is the matrix . (7) We will also use the notation to denote the matrix, all of whose elements are ones. The identity matrix will be denoted by the notation . The next section will discuss some notation and background material for MFNNs. A brief discussion of how the MFNN can be used for function approximation will also be presented. Multilayer Feedforward Neural Networks We will divide this section into three parts. The first part will briefly discuss the general concept of MFNNs. The notation for MFNNs will be introduced in the second part. Finally, we will describe how MFNNs can be used for function approximation. Background material The term neural networks has been used to refer to a structure consisting of connected nodes (or neurons) in any formation (i.e. parallel, series or both). The inspiration of inventing neural networks was initially based on how a brain works. However, neural networks are now considered a mathematical and statistical model for information and sig F m × n p × q X F X mn × pq ∂vecF ∂(vecX)T  x1 ∂ T ∂ f1 … xq ∂ T ∂ f1 … … x1 ∂ T ∂ fn … xq ∂ T ∂ fn = 1m × n m × n m × m Im 11 nal processing. Neural networks have been theoretically and practically proven to be a powerful tool in many applications, such as function approximation, classification, pattern recognition, novelty detection, filtering, etc. Neural networks have been categorized into several types usually based on how neurons are connected. One of the most widelyused types for function approximation is the MFNN. In this research, we will focus on using MFNNs for function approximation. For ease of reference, the term neural network (or just network) will be used throughout this document to refer to MFNNs. A neural network consists of neurons connected in parallel and series. The structure of a neuron is shown in the following figure. Figure 1) A single neuron From Figure 1), a neuron consists of several components. They are the neuron input , the weight , the bias , the summer, the net input , the transfer function , and the neuron output . The net input is computed as . The net input is fed to the transfer function to produce the neuron output , i.e. . The weight and the bias are called the “network parameters”. When neurons are connected in parallel and cascade, a more complicated structure is obtained. In the parallel structure (layer), each neuron receives the same inputs as the other neurons do but independently produces its own output. When neurons are in the cascade Σ f 1 b p a w n p w b n f a n = wp + b f a a = f(n) w b 12 structure, each neuron output of a preceding layer is distributed to be a neuron input for every neuron in the layer following it. The general structure is referred to as an neural network. An example of a neural network is illustrated in the following figure. Figure 2) The feedforward neural network Since the complexity drastically increases as we have more neurons and layers, we need notation to refer to a specific neuron. In the next part, we will introduce the notation for the general structure of a multilayer feedforward neural networks that will be used throughout this research. Notations for neural networks An feedforward neural network structure can be denoted . This structure notation corresponds to the statement that the neural network consists of inputs, neurons in the first layer, neurons in the second layer and so on, until neurons in layer , i.e. the output layer. The layers 1 through are called hidden layers. Layer is called the output layer. We can consider the network inputs as the neuron outputs of layer . M – layer 3 – layer TanSigmoid Layer TanSigmoid Layer Linear Layer a tansig W p b 1 1 1 = ( + ) a tansig W a b 2 21 2 = ( + ) a purelin W a b 3 32 3 = ( + ) S 1x1 S 2x1 S 3x1 S 1x1 S 2x1 S 3x1 S 1x1 S 2x1 S 3x1 R x1 S 1 xR S2 x S 1 S 3 x S 2 S1 S2 S3 n1 n2 n3 p a1 a2 a3 W1 W2 W3 b1 b2 b1 1 1 3 R Inputs 3 – layer M – layer R – S1 – S2 – … – SM M – layer R S1 S2 SM M M – 1 M m = 0 13 Suppose we have an neural network and input/target pairs in the training set. We write to denote element of the input vector, and write to denote element in the input vector. We write to denote the weight of neuron at layer , that connects from the output of neuron at layer . The bias for neuron at layer is denoted by . The weight is the weight, which connects from the element of the input vector to neuron in the first layer. The net input of neuron at layer is denoted by and the notation denotes the output of neuron at layer . At the output layer, is also denoted by ( ). When the input vector is applied to the network (i.e. for ), the values of , and are denoted by , and , respectively. We assume that the transfer function is the same for each neuron in the same layer; thus using to denote the transfer function at layer . The notation we just introduced can also be used in matrix form. A couple of examples are provided: the notation means this element is at row of the input vector and it is also the element at row and column of the matrix . The notation means this element is at row of the vector , and it is the element at row and column of the matrix . The notation is at row of the vector . The notation is at row of the vector , and it is at the row and R – S1 – S2 – … – SM Q pr r pr, q r qth wi j , m i m j m – 1 i m bi m wi j , 1 jth i i m ni m ai m i m ak M ak ak M = ak qth pr = pr, q r = 1, 2, …, R ni m ai m ak ni q , m ai q , m ak, q f m m pr, q r R × 1 pq r q R × Q P ak q , m k Sm × 1 aq m k q Sm × Q Am bi m i Sm × 1 bm wi j , m i Sm × 1 wj m i 14 column of the matrix . The following examples illustrate how the notations , , and can be expressed in matrix form: and , (8) and , (9) , (10) and . (11) Note that we use the notation to denote the row vector of the matrix . Similarly, the notation is to denote the row vector of the matrix . The following equations illustrate how these are related to the notations previously introduced: and , (12) and . (13) j Sm × Sm – 1 Wm pr q , ak q , m bi m wi j , m pq T p1, q p2, q … pR, q = P p1 p2 … pQ = aq ( m)T a1, q m a2 q , m … a Sm, q = m Am a1 m a2 m … aQ m = (bm)T b1 m b2 m … b Sm = m wj ( m)T w1, j w2, j … w Sm, j = Wm w1 m w2 m … w Sm – 1 = m pT r rth P wm (i )T ith Wm pT r pr, 1 pr, 2 … pr, Q = P pT 1 pT 2 … pT R = wm (i )T wi, 1 wi, 2 … w i, Sm – 1 = Wm wm (1 )T wm (2 )T … wm ⎝Sm ⎠ ⎛ ⎞T = 15 Given the input signal , the output of neuron at layer , i.e. , can be computed as and , (14) for . At the first layer , the neuron output can be calculated as and , (15) for . Eq. (14) and Eq. (15) can be written in matrix form as and , (16) and . (17) In the batch mode when all of the inputs (i.e. ) are presented to the network at the same time, Eq. (14) can be expressed as: , where (18) pq i m ai q , m ai q , m f m ni q , m ( ) = ni q , m wi j , m aj q , m – 1 bi + m j = 1 Sm – 1 = Σ i = 1, 2, …, Sm (m = 1) ith ai, q 1 f 1 ni q , = ( 1 ) ni, q 1 wi r , 1 pr q , br + 1 r = 1 R = Σ i = 1, 2, …, S1 aq m fm nq ( m) f m n1, q ( m ) f m n2, q ( m ) … f m n Sm, q m ⎝ ⎠ ⎛ ⎞ = = nq m Wmaq = m – 1 + bm aq 1 f1 nq 1 ( ) f 1 n1, q ( 1 ) f 1 n2, q ( 1 ) … f 1 n S1, q 1 ⎝ ⎠ ⎛ ⎞ = = nq 1 = W1pq + b1 q = 1, 2,…, Q Am Fm(Nm) fm n1 ( m) fm n2 ( m) … fm nQ m = = ( ) 16 . (19) Eq. (15) can be expressed in the batch mode as: and . (20) Since the objective of the research focuses on the firstorder derivative approximation, we will need to provide the notation for the firstorder derivatives of neural networks. By Definition (6), we write the derivative of with respect to , evaluated at , as . (21) By Definition (7), the derivative of with respect to the input , evaluated at , as . (22) The derivative of with respect to , evaluated at , is denoted by . (23) By Definition (8), the derivative of with respect to the input , evaluated at , is written . (24) Nm = WmAm – 1 + 11 × Q ⊗ bm A1 = F1(N1) N1 = W1P + 11 × Q ⊗ b1 ak pr pr = pr, q ∂pr, q ∂ak, q ∂pr ∂ak pr = pr, q ≡ ak p p = pq pq ∂ T ∂ak, q ∂pT ∂ak p = pq ≡ a pr pr = pr, q ∂pr, q ∂aq ∂pr ∂a pr = pr, q ≡ a p p = pq pq ∂ T ∂aq ∂pT ∂a p = pq ≡ 17 A couple of more examples: the derivative of with respect to the net input , evaluated at , is expressed as . (25) The derivative of with respect to the net input , evaluated at , is denoted by . (26) For batch mode, first suppose the input vectors for all are distinct. By Definition (9), we write the derivative of with respect to the input , evaluated at for all , as . (27) Note that the blocks off the diagonal are zero since is not related to . The derivative of with respect to the input , evaluated at for all , is denoted by aj m nj m nj m nj q , m = nj q , m ∂ ∂aj q , m nj ∂ m ∂aj m nj m nj q , m = ≡ am nm nm nq = m nq ( m)T ∂ ∂aq m (nm)T ∂ ∂am nm nq m = ≡ pq q = 1, 2, …, Q a pr pr = pr, q q = 1, 2, …, Q ∂vecA rp ∂ T  ∂pr, 1 ∂a1 0 … 0 0 pr 2 ∂ , ∂a2 … 0 … … … 0 0 … ∂pr, Q ∂aQ = pr, q1 pr q, 2 a p p = pq q = 1, 2, …, Q 18 , (28) where, again, the blocks off the diagonal are zeros because each input vector is distinct. The derivative of with respect to the input , evaluated at for all , is denoted by . (29) One more example: the derivative of with respect to the net input , evaluated at for all , is denoted by ∂vecA ∂(vecP)T  p1 ∂ T ∂a1 0 … 0 0 p2 ∂ T ∂a2 … 0 … … … 0 0 … pQ T ∂ ∂aQ = pq am p p = pq q = 1, 2, …, Q ∂vecAm ∂(vecP)T  p1 ∂ T ∂a1 m 0 … 0 0 p2 ∂ T ∂a2 m … 0 … … … 0 0 … pQ T ∂ ∂aQ m = am nm nm nq = m q = 1, 2, …, Q 19 . (30) In the next section, we will briefly review how neural networks have been used in function approximation problems. Neural networks and function approximation A MFNN can be used as a function approximator. This means that the network outputs will be estimates of the response of an unknown function . Given the vector function , we write to denote the element. We will use the notation to denote the element of the function response to the input vector. Note that we also call the target value. As in previous matrix notation, and are the column vector and batch mode representations, respectively. Now, suppose we want a neural network to approximate a function mapping from a subset in to a subset in . Also suppose that a set of examples were drawn from function , where an example represents a pair of function inputs and corresponding function responses; i.e. . Given a sufficient number of neurons and a set of examples ∂vecAm (vecNm)T ∂  n1 ( m)T ∂ ∂a1 m 0 … 0 0 n2 ( m)T ∂ ∂a2 m … 0 … … … 0 0 … nQ m ( )T ∂ ∂aQ m = g g gk kth gk, q kth qth gk, q gq G g ℜR ℜSM g (pq, gq) 20 (the training set), an neural network can be used to approximate a function over a subset in . In fact, [HoSt89] theoretically showed that, with a sufficient number of neurons in the hidden layer, networks are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy; meaning that the networks are a class of universal approximators. When a neural network is trained, its weights and biases are adjusted so as to minimize some performance index (or objective function), which usually involves the mean square error between the network outputs and the target values. An optimization method is used to find the network parameters that minimize the performance index. The combination of the performance index and the optimization method makes up the training algorithm. There have been many training algorithms developed for neural networks. However, we will only review three algorithms: BroydenFletcherGoldfarbShanno with Early Stopping , LevenbergMarquardt with Early Stopping , and GaussNewton approximation to the Bayesian Regularization . Training Algorithms In this section, we will review three existing neural network training algorithm. We will first discuss the , followed by the , and finally the training algorithm. R – S1 – S2 – … – SM g ℜR 2 – layer (BFGS – ES) (LM – ES) (GNBR) BFGS – ES LM – ES GNBR 21 Suppose we want to approximate a function , which maps from a subset in to a subset in using an neural network. Assume that the number of data (or examples) in the training set is . We use the vector to denote the column vector containing all of the network parameters. The total number of the network parameters (i.e. the number of elements in the vector ) is . (31) BFGSES training algorithm The performance index for this algorithm is the sum square of the difference between the network outputs and the target values, i.e. the sum square function error. Its performance index is written as: . (32) Minimizing the performance index (with respect to the network parameters) is equivalent to forcing the neural network to approximate the function over a subset in . For this training algorithm, the optimization method, see [GiMu81], will be used to minimize the performance index. The steps for the training algorithm are described in the following section. Note that we use the notation and to denote the performance index and the gradient of the performance index with respect to , evaluated at , respectively. The notation denotes the 2norm of the vector . g ℜR ℜSM R – S1 – S2 – … – SM Q n × 1 x x n = S1(R + 1) + S2(S1 + 1) + …+ SM(SM – 1 + 1) Jf {ak, q – gk, q}2 k = 1 SM Σ q = 1 Q = Σ Jf ℜR BFGS BFGS Jf (xk) ∇Jf (xk) x xk x x 22 Steps for BFGS training algorithm 1. Set . Initialize the network parameter vector by the method in [NgWi90]. Present the training set to the network. Compute the gradient of the performance index with respect to the network parameters . Set the initial search direction . Initialize the approximated Hessian matrix . 2. Minimize the performance index along the search direction, i.e. determine such that minimizes the performance index . 3. Set , and evaluate . Also compute the gradient . Terminate the process if or are less than their predefined thresholds. 4. Calculate the change in the gradient , and compute the new approximated Hessian matrix: . (33) 5. Compute the new search direction , which is the solution of . (34) 6. Set and iterate step 2) to 6). Note that we will use the algorithm in [NgWi90] to initialize the network parameters for all training algorithms in this research. The backpropagation process is used to compute the gradient of the performance index with respect to the network parameters . k = 0 x0 Jf ∇Jf (x0) p0 = –∇Jf (x0) B0 = In αk xk + αkpk Jf xk + 1 = xk + αkpk Jf (xk + 1) ∇Jf (xk + 1) Jf (xk + 1) ∇Jf (xk + 1) yk = ∇Jf (xk + 1)– ∇Jf (xk) Bk + 1 Bk [∇Jf (xk)][∇Jf (xk)]T [∇Jf (xk)]Tpk  ykyk T αkyk Tpk = + +  pk + 1 Bk + 1pk + 1 = –∇Jf (xk + 1) k = k + 1 ∇Jf (x) 23 The details of the backpropagation process for training a neural network can be found in several books, such as [HaDe96]. Note also that there are several methods to perform the line search in step 2. We will use the backtracking algorithm [DeSc83]. From Eq. (32), we can see that the performance index is only the sum square of the errors, and thus it is an unregularized performance index. When using an unregularized performance index, it is possible to overfit the training data and then fail to generalize. There are several available techniques that could be used to prevent overfitting. For example, [HeKr91] proposed an approach to perform weight elimination based on the magnitude of the parameters. [SiDo91] proposed a method which adds noise to the function inputs, and [Bish95] showed that this technique is equivalent to Tikhonov regularization [TiAr77]. Another well known technique called Early Stopping, abbreviated , can be also used to prevent overfitting, and it will be the technique we will use in the research. Early stopping is a widelyused technique to prevent overfitting by monitoring the approximation error on a set of data that is not in the training set. This set of data is called the “validation set”. The approximation error on both the training and validation set initially decreases, until at some point the error on the validation set starts to increase while the error on the training set still keeps going lower. When the error over the validation set increases for a certain number of iterations, early stopping terminates the training process and it returns the network parameters at the point just before the increase of the validation error occurs. An example below illustrates how early stopping technique works. Suppose we want to approximate a function for . The following picture shows the graph of the function. Suppose a set of data points drawn from Jf ES g(p) = sin(πp) –1 ≤ p ≤ 1 24 the graph of the function were provided. We divided the set of data points into two groups: training set and validation set. Data points in the training and validation set are also shown in the figure. Figure 3) Training and validation set Now, suppose a network was used to approximate the function through the training algorithm. Figure 4) shows the sum square function error, i.e. , on the training and validation set at each iteration. Figure 4) Sum square function error on training and validation set −1 0 1 −1 0 1 Training data Validation data g(p) p g(p) 1 – 5 – 1 BFGS Jf 0 500 1000 10−8 10−3 102 Iteration Training set Validation set Jf 25 From Figure 4), we can see that the validation error initially decreased as the training errors decreased. However, at some point around iteration 400, the validation error started increasing, while the training error continued to decrease. In order to prevent overfitting, it is desirable to use the network trained up to iteration 400. Therefore, if early stopping was used, it would terminate the training process at some iteration after 400 and use the network trained until iteration 400. The validation error may fluctuate, so we do not stop the training at the first instance of an increase in the validation error. Instead, we monitor the error for a specified number of iterations to be sure that the error does not go back down. For the experiments described in this report, we monitored the error for 500 iterations after a minimum was reached before stopping the training. We use to refer to the training algorithm with early stopping. It is one of the three training algorithms we will use in the research. In the next section, we will review the second training method, which is the algorithm. LMES training algorithm The performance index for this algorithm is the same as in the training algorithm, i.e. Eq. (32). However, to minimize the performance index, the LevenbergMarquardt optimization method [Leve44] [Marq63], abbreviated , will be used. The optimization algorithm is a combination of the GaussNewton algorithm and the method of gradient descent. It was designed to approximate Newton’s method. To understand the concept of the optimization, let us begin with Newton’s method. The update equation in Newton’s method is BFGS – ES BFGS LM – ES BFGS – ES LM LM LM 26 , (35) where is the Hessian matrix evaluated at . This means we need to compute the gradient and the Hessian matrix of the performance index . To compute the gradient and the Hessian matrix, first suppose the performance index is in the form of , (36) then the element of the gradient would be , (37) where is the number of elements in the vector . The gradient can be written in matrix form: , (38) where is the Jacobian matrix. Next, we need to find the Hessian matrix. The element of the Hessian matrix would be . (39) The Hessian matrix can then be expressed in matrix form: , (40) xk + 1 xk [∇2Jf (xk)]–1 = – ∇Jf (xk) ∇2Jf (xk) xk Jf F(x) = zT(x) × z(x) jth ∇F (x) [∇F (x)]j ∂F (x) ∂xj  2 zi(x) ∂zi(x) ∂xj  i = 1 N = = Σ N z(x) ∇F (x) = 2JT(x)z(x) J(x) = ∂z(x) ⁄ ∂xT k, j [∇2F (x)]k, j ∂2F(x) ∂xk∂xj  2 ∂zi(x) ∂xk  ∂zi(x) ∂xj  zi(x) ∂2zi(x) ∂xk∂xj +  ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ i = 1 N = = Σ ∇2F (x) = 2JT(x)J(x) + 2S(x) 27 where . (41) If we assume is small, the approximated Hessian matrix becomes . (42) By substituting Eq. (38) and Eq. (42) into Eq. (35), we obtain the GaussNewton method: , (43) where and are the vector and the Jacobian matrix , evaluated at , respectively. One problem with the GaussNewton method is that the approximate Hessian matrix may not be invertible. This problem can be overcome by using the matrix in place of the matrix , where . Increasing the parameter will make the matrix become positive definite, and thus invertible. This leads to the LevenbergMarquardt algorithm: . (44) The optimization algorithm requires us to calculate two terms, which are the vector and the Jacobian matrix . By equating the performance index in Eq. (36) and Eq. (32), we have the vector , where , (45) S(x) zi(x)∇2 zi(x) i = 1 N = Σ S(x) ∇2F (x)≅ 2JT(x)J(x) xk + 1 xk [JT(xk)J(xk)] –1= – JT(xk)z(xk) z(xk) J(xk) z(x) J(x) xk JT(x)J(x) H ˜ = JT(x)J(x) + μIn JT(x)J(x) μ > 0 μ H ˜ xk 1 + xk JT xk ( )J xk ( ) μk [ + In] –1= – JT(xk)z(xk) LM z(x) J(x) z(x) z(x) = vecE E = A – G 28 with . Obtaining is thus simply the feedforward propagation. Computing the Jacobian matrix in neural networks is, however, more complicated and it was originally discussed in detail in [HaMe94]. Calculating the Jacobian matrix involves the calculation of the Marquardt sensitivity [HaDe96], using the backpropagation process. The following is a summary of the training algorithm. Steps for LM training algorithm 1. Set . Initialize the network parameter vector . Present the training set to the network. Initialize to a small value, e.g. 0.001. Set to a large value, e.g. . Set , e.g. 10. 2. Compute and the Jacobian matrix using Eq. (45) and Eq. (38), respectively. Also compute . Terminate the process if or is less than its predefined threshold, or if . 3. While , do the following: a). Compute . (46) b). Compute and the Jacobian matrix . Compute . c). If , set and go back to Step 3. Otherwise, go to the next step. N = SMQ z(x) J(x) LM k = 0 x0 μ0 μmax 1010 ϑ > 1 z(xk) J(xk) Jf (xk) Jf (xk) J(xk)z(xk) μk ≥ μmax μk < μmax xk + 1 temp xk JT(xk)J xk ( ) μk [ + In] –1= – JT(xk)z(xk) z xk + 1 ( temp) J xk + 1 ( temp) Jf xk + 1 ( temp) Jf xk + 1 ( temp) J≥ f (xk) μk = ϑμk 29 d). If , set and set . Then, set and go back to Step 2. 4. Terminate the process (in this case it is due to ). As the performance index is only the sum square of the function errors, early stopping will again be used to prevent overfitting. We will denote this algorithm, which minimizes the performance index using the optimization with early stopping, as the method. In the next section, we will review the GaussNewton approximation to Bayesian Regularization training algorithm. GNBR training algorithm Unlike the performance index for the and the training algorithms, the performance index of the training method is a regularized performance index, as shown below: , (47) where is a network weight or bias, is the total number of weights and biases, and are scalar values weighting the importance of the two terms. The regularized term in the performance index is , which is the sum of squares of the network weights and biases. Jf xk + 1 ( temp) J< f (xk) μk + 1 = μk ⁄ ϑ xk + 1 xk + 1 = temp k = k + 1 μk ≥ μmax Jf Jf LM LM – ES (GNBR) BFGS – ES LM – ES GNBR F β (gk, q – ak, q)2 k = 1 SM Σ q = 1 Q Σ α xi 2 i = 1 n = + Σ xi n β α xi 2 i = 1 n Σ 30 Regularization is another technique used to prevent overfitting. Although its purpose is the same as early stopping, it works in a different way. The regularized performance index forces the magnitudes of the network parameters to be small, which causes the network output to be smooth. That is, the regularization prevents steep fluctuations in the network response. A problem for using the regularized performance index is that the weighting factors and are unknown and problemdependent. If is too small relative to , then we would still observe overfitting as there is too little impact from the regularized term. In contrast, if is too large relative to , then the network output would be too smooth and would not approximate the function. To overcome this problem, David J. C. MacKay proposed a method in [MacK92] using Bayes’ rule to automatically choose the weighting factors to balance between the function approximation capability and the smoothness of the network output. [FoHa97] later combined MacKay’s method with the LevenbergMarquardt framework to obtain the GaussNewton Approximation to Bayesian Regularization, denoted as . The concept of this algorithm can be explained in two parts: first, minimizing the performance index in Eq. (47), and second, choosing the values of and . From Eq. (36) and Eq. (47), we can rewrite the performance index as: . (48) Now, from Eq. (38), the gradient of the performance index can be written . (49) Since is in fact in Eq. (32), the Jacobian matrix , which can be ob β α α β α β GNBR β α F = βzT(x)z(x) + αxTx = βED + αEW F ∇ β ED ∇ α EW ∇ + 2βJD T x ( )z x ( ) 2αJE T = = + (x)x ED Jf JD(x) = J(x) 31 tained by the same backpropagation process as in the LevenbergMarquardt training algorithm. Since the Jacobian matrix , this becomes . Therefore, Eq. (49) turns out to be . (50) Using Eq. (42) and Eq. (49), the Hessian can be approximated and written in matrix form: . (51) By substituting Eq. (50) and Eq. (51) into Eq. (35), the update equation becomes . (52) Now, we have an update equation for the performance index in Eq. (47). Next, we will explain MacKay’s method to choose the values of and . Under the Bayesian framework of MacKay [MacK92], the network parameters are considered random variables. The posterior density of the network parameters can be written according to the Bayes’ rule: , (53) where represents the data set presented to the network, and is the network model. The term is the likelihood function, the term is the prior density, and the term is named evidence. The parameter is related to the variance of the likelihood and is related to the variance of the prior density. If the noise in the model output and the prior density are assumed to follow Gaussian distributions, then the likeli JE(x) = ∂x ⁄ ∂xT JE(x) = In ∇F = 2βJT(x)z(x) + 2αx ∇2F (x) = β∇2ED + α∇2EW ≅ 2βJT(x)J(x) + 2αIn xk 1 + xk βkJT xk ( )J xk ( ) αk [ + ( + μk)In] –1 βkJT xk ( )z xk ( ) αk = – [ + xk] β α P x P(x D, β, α, M) P(D x, β, M)P(x α, M) P(D β, α, M) =  D M P(D x, β, M) P(x α, M) P(D β, α, M) β α 32 hood function and the prior density become , and (54) , (55) respectively. The term and , where . Considering the evidence as a normalization factor, and substituting Eq. (54) and Eq. (55) into Eq. (53), we obtain the posterior density . (56) From Eq. (56), we can see that maximizing the posterior density is equivalent to minimizing the regularized objective function . Now, given the data, we are interested in what value and should be. This means we need to consider . By using Bayes’ rule, it becomes . (57) Suppose the prior density is uniform, which corresponds to the statement that we do not know what value and should be. Then, maximizing the posterior is equivalent to maximizing the likelihood function . However, note that the likelihood function in Eq. (57) is the evidence, which is the normalization factor, in Eq. (53). Therefore, the evidence in Eq. (53) can be solved: . (58) P(D x, β, M) 1 ZD(β) =  exp(–βED) P(x α, M) 1 ZW(α) =  exp(–αEW) ZD(β) (π ⁄ β)N ⁄ 2 = ZW(α) (π ⁄ α)n ⁄ 2 = N = SMQ P(x D, β, α, M) 1 ZF(β, α) =  exp(–F(x)) F β α P(β, α D, M) P(β, α D, M) P(D β, α, M)P(β, α M) P(D M) =  P(β, α M) β α P(β, α D, M) P(D β, α, M) P(D β, α, M) P(D x, β, M)P(x α, M) P(x D, β, α, M) =  33 By substituting Eq. (54) to Eq. (56) into Eq. (58), we obtain . (59) From Eq. (59), the only term we do not know is . However, we can estimate it from a Taylor series expansion, by assuming the objective function has a quadratic shape in a small region around the minimum point . At the minimum point, the gradient of the function is zero. Thus, the objective function approximated around is written as , (60) where is the Hessian matrix of , and is the Hessian matrix evaluated at . Therefore, the posterior density in Eq. (56) can be written as , (61) which is rewritten as . (62) The multivariate Gaussian density with mean and the covariance matrix is expressed as: . (63) By equating Eq. (62) and Eq. (63), we can solve for P(D β, α, M) ZF(β, α) ZD(β)ZW(α) =  ZF(β, α) xMP xMP F x ( ) F xMP ( ) 12 (x – xMP)T≅ + HMP(x – xMP) H = β∇2ED + α∇2EW F(x) HMP xMP P(x D, β, α, M) 1 ZF(β, α)  F xMP ( ) 12 (x – xMP)T+ HMP(x – xMP) ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ – ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ≅ exp P(x D, β, α, M) 1 ZF(β, α)  exp(–F(xMP)) ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ 12 (x – xMP)T⎝– HMP(x – xMP)⎠ ≅ exp⎛ ⎞ xMP (HMP) –1 P(x) 1 (2π)n ⁄ 2 (HMP) –1 1 ⁄ 2  12 (x – xMP)T⎝– HMP(x – xMP)⎠ = exp⎛ ⎞ 34 . (64) Substitute Eq. (64) into Eq. (59) along with the values of and from Eq. (54) and Eq. (55), take the derivative with respect to each of the log in Eq. (59) and set them to zero. This produces and , (65) where is called the effective number of parameters. The parameter is a measure of how many parameters in the network are effectively used in reducing the error function, and it can range from zero to . Since the estimation for and requires the calculation of the Hessian matrix of the performance index at the minimum point , [FoHa97] proposed using the approximated Hessian readily available under the LevenbergMarquardt framework, i.e. Eq. (51), leading to the training algorithm. The algorithm is summarized below. Steps for GNBR training algorithm 1. Initialization: Same as the LevenbergMarquardt training algorithm. Initialize , . 2. Take one step of the training algorithm to minimize the objective function in Eq. (47), by using Eq. (52). 3. Compute the effective number of parameters , where the Hessian matrix can be approximated by Eq. (51), and is the Hessian evaluated at . ZF(β, α) (2π)n (HMP) –1 [ ] 1 ⁄ 2 ≅ exp(–F(xMP)) ZD(β) ZW(α) βMP N – γ 2ED(xMP) =  αMP γ 2EW(xMP) =  γ N 2αMPtr(HMP) –1 = – γ n β α F(x) xMP GNBR α0 = 0 β0 = 1 LM γk = n – 2αktr(Hk) H Hk xk 35 4. Compute the new estimates for and : and . (66) Then, set . 5. Iterate step 2) to 5) until , or is less than its predefined threshold. As the algorithm penalizes the magnitude of the network parameters as a technique to reduce the overfit problem, we will not have a validation data set in this algorithm. Recall that the goal of this research is to approximate both a function and its firstorder derivatives using neural networks. [HoSt89] showed that multilayer feedforward neural networks can approximate any Borel measurable function. In the next section, a theoretical discussion of function and derivative approximations with neural networks will be reviewed. Conditions for Function and Derivative Approximation Recall that our goal is to approximate both a function and its firstorder derivatives. This section will briefly discuss the theoretical conditions under which neural networks can simultaneously approximate both a function and its derivatives. There are several discussions on the conditions, such as [HoSt90], [Horn91], [Ito93] or [Pink99]; however, we will follow [Li96]. β α βk + 1 N – γk 2ED(xk) =  αk + 1 γk 2EW(xk) =  k = k + 1 μk ≥ μmax F (xk) GNBR 36 We first introduce notation. We let denote the lattice of nonnegative multiintegers in . For , we set and . (67) We also write if for all . Given an open set of (probably ), we write to denote the set consisting of functions with all order continuous partial derivatives in , for and . We write to imply, for a compact set of , there is an open set such that and . We write to denote a neural network with the network output , the transfer function in the hidden layer and the linear transfer function in the output layer, i.e. is linear. Given a compact subset of , and a function for , [Li96] showed that if and is not a polynomial, then of a network can uniformly and simultaneously approximate , for and . For example, [Li96] considered a function on , given by: if ; otherwise . We can verify that and are discontinuous at the origin ; however, , Z + R ℜR m (m1, m2,…, mR) Z + = ∈ R m m= 1 + m2 +…+ mR Dm p1 m1 p2 m2… pR ∂ ∂ mR m ∂ ∂ = m1 ≤ m2 mr 1 mr ≤ 2 r = 1, 2, …, R Ω ℜR Ω = ℜR Cm(Ω) kth Ω k Z + ∈ R k ≤ m f ∈ Cˆ m(K) K ℜR Ω K ⊂ Ω f ∈ Cm(Ω) M 2 – layer a f1 f 2 K ℜR g ∈ Cˆ m(K) m Z + ∈ R f 1∈ C m (ℜ) f 1 Dka M Dkg k Z + ∈ R k ≤ m ℜ2 g(p1, p2) (p1p2)11 ⁄ 3 = sin[1 ⁄ (p1p2)] p1, p2 ≠ 0 g(p1, p2) = 0 g 2 ∂ p1 2 ∂ ⁄ g 2 ∂ p2 2 ⁄ ∂ (0, 0) g 37 , , and are continuous on . Therefore, . In this situation, a network can simultaneously and uniformly approximate , , , and on any compact set containing , provided that the transfer function in the hidden layer is nonpolynomial and . The typical transfer functions in the hidden layers and the output layer of neural networks used for function approximation are sigmoid and linear functions, respectively. Sigmoid functions are nonpolynomial and they are of class (infinitely differentiable or smooth functions), i.e. all derivative orders are continuous. Therefore, given that the function and its firstorder partial derivatives exist and are continuous, the standard twolayer neural network has the ability to simultaneously approximate the function and its firstorder derivatives. Next, we will introduce notation for firstorder derivative values. Similar to the notation defined in Eq. (21) to Eq. (24), the derivative of the scalar function with respect to and with respect to , evaluated at and , is written, respectively, , . (68) The derivative of the vector function with respect to and , evaluated at and , is denoted, respectively, by: ∂g ⁄ ∂p1 ∂g ⁄ ∂p2 ∂2g ⁄ ∂p1∂p2 ℜ2 g C(1, 1) ∈ (ℜ2) M g ∂g ⁄ ∂p1 ∂g ⁄ ∂p2 ∂2g ⁄ ∂p1∂p2 K (0, 0) f 1 f 1∈ C2(ℜ) C∞ g g gk pr p pr = pr, q p = pq ∂pr, q ∂gk, q ∂pr ∂gk pr = pr, q ≡ pq ∂ T ∂gk, q ∂pT ∂gk p = pq ≡ g pr p pr = pr, q p = pq 38 , . (69) For batch mode, similar to Eq. (27) and Eq. (28), we will use the following notation: , and . (70) Summary In this chapter, we first stated the objective of the research: to develop procedures for approximating functions and their derivatives. Then, we introduced the operators and notation that will be used throughout the research. The notation and background material for the multilayer feedforward neural network were provided. A neural network learns to approximate a function through an optimization process. A combination of the objective function and the optimization process defines a distinct training algorithm. We discussed three existing training algorithms: , and . Each is capable of forcing a neural network to approximate a function in a different way. The and methods use early stopping as a technique to prevent overfitting, while the algorithm uses the Bayesian regularizer. ∂pr, q ∂gq ∂pr ∂g pr = pr, q ≡ pq ∂ T ∂gq ∂pT ∂g p = pq ≡ ∂vecG ∂(rp)T  ∂pr, 1 ∂g1 0 … 0 0 pr 2 ∂ , ∂g2 … 0 … … … 0 0 … ∂pr, Q ∂gQ = ∂vecG ∂(vecP)T  p1 ∂ T ∂g1 0 … 0 0 p2 ∂ T ∂g2 … 0 … … … 0 0 … pQ T ∂ ∂gQ = BFGS – ES LM – ES GNBR BFGS – ES LM – ES GNBR 39 Finally, the conditions under which a neural network can uniformly and simultaneously approximate a function and its derivatives were discussed. One of the conditions requires that the function and its firstorder derivatives be continuous. A second condition requires that the transfer function in the hidden layer of the neural network be sufficiently differentiable but not be a polynomial. For example, the typical sigmoid transfer function would be satisfactory. 40 CHAPTER 3 VALIDATIONRELATED METHODS Introduction Recall that our objective is to use a neural network to approximate a function and its firstorder derivatives. In this chapter, we will discuss two simple methods for accomplishing this task. These methods are similar to the standard training algorithms discussed in Chapter 2, but some modifications are applied to the early stopping technique. As the LevenbergMarquardt optimization is chosen to work with these two modified early stopping techniques, the two proposed methods are: 1. Modified validation performance measure ( ), and 2. Early stopping using the derivative information on the training set ( ). As we describe each method, the idea behind it will be discussed. Then, the simulation results for each method, based on the benchmark tests that are described in Chapter 7, will be shown. The results in this chapter can be compared with the simulation results obtained using the standard early stopping technique, which are shown in Chapter 7. Modified validation performance measure For the standard early stopping method (discussed in Chapter 2), the validation performance is measured by the sum squared function error. We proposed using a combination LM – ES1 LM – ES2 41 of the sum squared function error and the sum squared derivative error. In other words, the validation performance will be determined by , (71) where is the number of examples in the validation set, and is a scalar factor. If the validation error increases for a certain number of iterations, then the training will be terminated. The unknown parameter in the equation is , which we will vary to investigate its consequences to the approximation of neural networks. Another purpose of this parameter is to account for the fact that the scale of the derivative values can be much different from the function values. The procedure for training a neural network using this method is exactly the same as the standard training algorithm with the standard early stopping technique (e.g. or ), with two exceptions. First, we need to compute the derivative of the network output with respect to the network inputs; i.e. , for all , and this calculation is shown in Chapter 4. Second, the performance measure in the standard early stopping technique is now replaced by the new measure in Eq. (71). In this research, we choose to work with the modified validation performance measure. We denote this method . Note that when , is . EVd (ak, qV – gk, qV)2 k = 1 SM Σ qV = 1 QV Σ ρd ∂pr, qV ∂ak, qV ∂pr, qV ∂gk, qV – ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r = 1 R Σ k = 1 SM Σ qV = 1 QV = + Σ QV ρd EVd ρd BFGS – ES LM – ES ∂aq pq ⁄ ∂ T q 1 2 … Q= , , , V LM – ES LM – ES1 ρd = 0 LM – ES1 LM – ES 42 The approximation accuracy obtained from for the simple analytic functions (which are introduced in Chapter 7) are shown in the next section. Note again that the procedure to perform the simulation is described in Chapter 6. Simulation results The approximation accuracy obtained from for the simple analytic functions are shown in the following tables. The results in Table 1, Table 2 and Table 3 can be compared with the results obtained from other algorithms in Table 9, Table 10 and Table 11 in Chapter 7, respectively. For ease of reference, we put the results obtained from with the standard early stopping (i.e. ). Note that the definitions of and are defined in the simulation procedure in Chapter 7. in Problem 1 Training Test Training Test 0 2.48E06 8.90E04 6.42E03 1.74E02 1E06 2.48E06 8.90E04 6.42E03 1.77E02 1E04 2.50E06 8.90E04 6.42E03 1.74E02 1E02 2.52E06 8.90E04 5.71E03 1.38E02 1E+00 3.00E06 9.96E04 6.51E03 1.73E02 1E+02 3.00E06 9.96E04 6.51E03 1.73E02 Table 1 Approximation accuracy on problem 1 LM – ES1 LM – ES1 LM – ES ρd = 0 RMSEF md RMSED md ρd LM – ES1 RMSEF md RMSED md 43 From these results, we can see that the approximation accuracy obtained from is similar to the results from . We conclude that adding the derivative error term into the standard validation performance measure does not improve the approximation accuracy. In the next section, the simulation results obtained from the training algorithm with early stopping using the derivative information of the training set, i.e. , will be discussed. in Problem 2 Training Test Training Test 0 1.14E05 1.16E02 1.09E+00 1.94E+00 1E06 1.14E05 1.14E02 1.09E+00 1.94E+00 1E04 4.18E07 7.44E03 5.52E01 1.45E+00 1E02 9.04E07 8.67E03 9.69E01 1.54E+00 1E+00 9.04E07 8.67E03 9.69E01 1.54E+00 1E+02 9.04E07 8.67E03 9.69E07 1.54E+00 Table 2 Approximation accuracy on problem 2 in Problem 3 Training Test Training Test 0 6.72E06 2.09E04 5.67E03 1.12E02 1E06 6.72E06 2.09E04 5.67E03 1.12E02 1E04 9.93E06 2.80E04 1.41E02 2.69E02 1E02 1.14E05 2.96E04 1.38E02 2.69E02 1E+00 1.14E05 2.97E04 1.38E02 2.69E02 1E+02 1.14E05 2.97E04 1.38E02 2.69E02 Table 3 Approximation accuracy on problem 3 ρd LM – ES1 RMSEF md RMSED md ρd LM – ES1 RMSEF md RMSED md LM – ES1 LM – ES LM LM – ES2 44 Early stopping using the derivative information of the training set In regular early stopping, the training process will be terminated when the validation performance increases. Recall that the standard validation performance measure is . (72) The function error term can be approximated by the firstorder Taylor series expansion at the point , and this becomes: , where , (73) where is the nearest training point to . By inserting Eq. (73) into Eq. (72), we obtain (74) We may ignore the second term in Eq. (74) if we assume equal distribution around zero of the training error terms, i.e. and for all , and . Thus the sum of these error terms is approximately zero. The third term in Eq. (74) can be further expanded: (75) EV (ak, qV – gk, qV)2 k = 1 SM Σ qV = 1 QV = Σ ek, qV ≡ ak, qV – gk, qV pqV ek, qV ek, tV p∂ r, tV ∂ek, tV (pr, tV – pr, qV) r = 1 R ≈ + Σ ∂pr, tV ∂ek, tV ∂pr, tV ∂ak, tV ∂pr, tV ∂gk, tV = – ptV pqV EV ek, tV 2 2ek tV , pr tV ∂ , ∂ek, tV (pr, tV – pr, qV) r Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂pr, tV ∂ek, tV (pr, tV – pr, qV) r Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 + + ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ k Σ qV ≈Σ . ek, tV ∂ek, tV ⁄ ∂pr, tV k = 1, 2,…, SM tV = 1, 2, …, QV r = 1, 2, …, R ∂pr, tV ∂ek, tV (pr, tV – pr, qV) r Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 ∂pr, tV ∂ek, tV (pr, tV – pr, qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r Σ = + ∂pr, tV ∂ek, tV (pr, tV – pr, qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂pr', tV ∂ek, tV (pr', tV – pr', qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r' r' ≠ r Σ r Σ . 45 From Eq. (75), the validation performance measure could then be approximated by: (76) It should be noted that the differences of the two inputs; i.e. for all , are constant. In addition, by the assumption of equal distribution around zero of the training error terms, the last term in Eq. (76) is then approximately zero. Thus, Eq. (76) reduces to . (77) Again, since the differences of the two inputs are constant, the validation performance measure is proportional to: , (78) where the summation index can be replaced by . From Eq. (78), we can see that the first term is getting smaller, while the network is being trained. Therefore, the increase in the performance measure may be estimated by the increase of the second term, which is the sum squared derivative errors in the training EV EV ek, tV 2 k Σ qV Σ ∂pr, tV ∂ek, tV (pr, tV – pr, qV) r Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 k Σ qV ≈ +Σ ek, tV 2 k Σ qV Σ ∂pr, tV ∂ek, tV (pr, tV – pr, qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r Σ + k Σ qV = +Σ ∂pr, tV ∂ek, tV (pr, tV – pr, qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂pr', tV ∂ek, tV (pr', tV – pr', qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r' r' ≠ r Σ r Σ k Σ qV Σ . pr, tV – pr, qV r = 1, 2, …, R EV ek, tV 2 k Σ qV Σ ∂pr, tV ∂ek, tV (pr, tV – pr, qV) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r Σ k Σ qV ≈ +Σ EV EV ek, tV 2 k Σ tV Σ ∂pr, tV ∂ek, tV ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r Σ k Σ tV ∝ +Σ qV = 1, 2, …, QV tV = 1, 2, …, QV EV 46 set. This seems to make sense as large derivative errors in training data will result in large function errors in the validation data. Therefore, if we define a new validation performance measure whose value depends only on the derivative error of the training set, its increase should imply overfitting and we would then want to terminate the training process. The new validation measure is written as , (79) where is the number of examples available  the number of data in the training and validation sets. Since the new validation measure uses the derivative information of all data, we can use all the function information for training. This means that we use all the data in the training process, while overfitting is prevented by using the new validation measure . An advantage of having more training examples is we have better generalization, see [GaWh92] and [AtPa97]. The following summarizes the method. Steps for early stopping using the derivative information of the training set 1. Change the validation performance measure for early stopping so that it follows Eq. (79). 2. Use all of the available data for the training process, which can be performed by any standard training algorithm, e.g. or , etc. In other words, there is no data division into training and validation set. 3. The training process is terminated when the new validation measure, i.e. Eq. (79), consecutively increases for a certain number of iterations. EV ˜ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r = 1 R Σ k = 1 SM Σ q = 1 Q = Σ Q EV ˜ BFGS LM 47 We choose the LevenbergMarquardt training algorithm ( ) to work with the modified validation performance measure. We denote the method . The method requires the calculation of the derivative of the network with respect to the network inputs ( for all ), and the calculation is shown in Chapter 4. In the next section, we will provide the simulation results obtained from on the simple analytic problems (introduced in Chapter 7). Recall again that the procedure to perform the simulation is provided in Chapter 7. Simulation results In this section, the simulation results for on the simple analytic problems are presented in Table 4, Table 5 and Table 6. For ease of reference, we also put the results obtained from in the tables. The results in Table 4, Table 5 and Table 6 can be compared with Table 1, Table 2 and Table 3 for , and with Table 9, Table 10 and Table 11 in Chapter 7 for other training algorithms. Training Algorithm Problem 1 Training Test Training Test 2.48E06 8.90E04 6.42E03 1.74E02 2.24E05 3.67E04 4.78E03 8.43E03 Table 4 Approximation accuracy on problem 1 LM LM – ES2 ∂aq pq ⁄ ∂ T q = 1, 2, …, Q LM – ES2 LM – ES2 LM – ES LM – ES1 RMSEF md RMSED md LM – ES LM – ES2 48 From the results, we can see that seems to provide better results than the standard algorithm for both function and the derivatives for problem 1 and problem 2. However, it is not the case for problem 3, where the derivative error obtained from was larger than . We will analyze the algorithm in the following section. This is to understand why does not consistently yield better results over , even though the training set for is relatively larger than for . Algorithm analysis Recall that, for , the overfitting is prevented by terminating the training process once the derivative error of all training data increases. An advantage of this method is the number of examples in the training process increases from including examples in the Training Algorithm Problem 2 Training Test Training Test 1.14E05 1.16E02 1.09E+00 1.94E+00 1.76E05 6.44E03 2.40E01 8.23E01 Table 5 Approximation accuracy on problem 2 Training Algorithm Problem 3 Training Test Training Test 6.72E06 2.09E04 5.67E03 1.12E02 1.16E05 1.82E04 7.66E03 1.41E02 Table 6 Approximation accuracy on problem 3 RMSEF md RMSED md LM – ES LM – ES2 RMSEF md RMSED md LM – ES LM – ES2 LM – ES2 LM – ES LM – ES2 LM – ES LM – ES2 LM – ES LM – ES2 LM – ES LM – ES2 49 validation set into the training set. The increase in the training examples would lead to a better generalization (see [GaWh92] and [AtPa97]). We analyzed the method by training a network with the algorithm. However, we also monitored the values of (see Eq. (72)) and (see Eq. (79)) along the optimization process. The following figure shows the training records. Figure 5) A training record showing versus From Figure 5), we can see that the derivative error of the training set increased (at around iteration ) sooner than the validation performance measure did. This means that the increase in does not directly imply an increase in . This contradicts the assumption we made in the previous section, where we assumed the increase in could be estimated by the increase in . The reason behind this lies in Eq. (78). In Eq. (78), we can see that the regular validation measure is proportional to two terms; the first is the square training function errors, and is the second. This means that LM – ES2 LM – ES EV EV ˜ 0 100 200 10−8 10−2 104 Iteration EV ˜ EV Jf EV ˜ EV EV ˜ 125th EV EV ˜ EV EV EV ˜ EV EV ˜ 50 even though the value of increases, if the reduction in the training function error counts more, the regular validation measure will still be lower. From this fact, we conclude that the regular validation measure cannot exactly be estimated by the training derivative error, since one more factor (i.e. the training function error) also counts. Summary Recall that our objective is to approximate a function and its firstorder derivatives using neural networks. Two new methods were proposed in this chapter: and . These methods are similar to the standard training algorithms. However, changes were made in the early stopping technique. In the method, the validation performance measure was changed from the squared function error to a combination of the squared function and derivative errors. The simulation results showed that the effectiveness of this new validation measure is similar to that of the standard early stopping. In the second method ( ), the validation performance measure was changed to the squared derivative error of the training set. In this method, data division into training and validation sets is no longer needed, unlike in the standard early stopping. The simulation results showed that the new validation measure sometimes terminates the training process too soon (i.e. sooner than using the standard early stopping), causing worse approximation. EV ˜ EV EV LM – ES1 LM – ES2 LM – ES1 LM – ES2 51 CHAPTER 4 GRADIENTBASED COMBINED FUNCTION AND DERIVATIVE APPROXIMATION Introduction Recall that the objective of this research is to use neural networks for approximating functions and their firstorder derivatives. In this chapter, we will propose a new training algorithm to perform this function. This algorithm is designed to work with any gradientbased optimization method (e.g. steepest descent, conjugate gradient, , etc.). We call this algorithm the Combined Function and Derivative Approximation algorithm. We will start this chapter by introducing the performance index used in the method and will derive two approaches for gradient calculation. At the end of the chapter, we will provide some examples to illustrate how fast the new training algorithm is in comparison with the standard algorithm. GradientBased Combined Function and Derivative Approximation In Chapter 2, the conditions under which the neural network can approximate both a function and its firstorder derivatives were discussed. We will introduce the performance index for the method, assuming these conditions are satisfied. Then, we will focus on the derivation of the gradient of the performance index, which is required for any gradi BFGS (CFDA) CFDA CFDA 52 entbased optimization method. We will develop two approaches for gradient calculation. The first approach assumes the gradient computation is in batch mode, i.e. data are used at the same time, and this is performed by arranging information in matrices. This may be necessary in some programming languages in order to speed up the calculation. The second method, however, offers a tradeoff between the computation time and the required memory. Performance Index Assume we want to approximate the function , which maps a subset in to a subset in , and its firstorder derivatives, by an neural network (with the conditions discussed in Chatper 2). Also suppose for all . The proposed performance index is written as: (80) where the term is added to form the new performance index. is a scalar value controlling how important the term is, relative to the term . If , the new performance index reduces to the standard performance index we discussed in Chapter 2. The function in the performance index is introduced to cope with the situation when the terms are not available. The function is defined below: g ℜR ℜSM M – layer gk ∈ C1 k = 1, 2, …, SM J = Jf + ρJd 1 QSM  {ak, q – gk, q}2 k = 1 SM Σ q = 1 Q Σ ρ QdSd MRd  ϕk, r(q) ∂ak, q ∂pr, q  ∂gk, q ∂pr, q ⎝ – ⎠ ⎛ ⎞2 , r = 1 R Σ k = 1 SM Σ q = 1 Q = + Σ ρJd ρ Jd Jf ρ = 0 ϕk, r(q) ∂gk, q ⁄ ∂pr, q 53 (81) Minimizing the performance index will force the neural network to simultaneously approximate both the function and its firstorder derivatives. We will present two approaches for calculating the gradient of the performance index. The first puts all the calculations in matrices for batch operation, and the second offers a tradeoff between the computation speed and the required memory. The derivations for these two approaches will be shown in the next section. Gradient Calculation: Batch Operation The batch mode operation will be discussed in this section. To minimize the performance index using gradientbased optimization methods, we need to compute the derivative of the performance index with respect to each network parameter. This means that we need to compute and , where and . Note that we will show only how to compute and , since the terms and can be computed using the standard backpropagation algorithm [HaDe96]. From Eq. (80), by taking the derivative of with respect to (note that the constant term is temporarily dropped from the term for simplicity), we have: ϕk, r(q) 1 ; ∂gk, q ∂pr, q  is available. 0 ; otherwise. ⎩ ⎪ ⎨ ⎪ ⎧ = g J ∂ wi j , m ⁄ ∂ ∂J bi ⁄ ∂ m i = 1, 2, …, S m j = 1, 2, …, S m – 1 Jd ∂ wi j , m ⁄ ∂ ∂Jd bi ⁄ ∂ m Jf ∂ wi j , m ⁄ ∂ ∂Jf bi ⁄ ∂ m Jd wi j , m 1 QdSd ⁄ ( MRd) Jd 54 (82) where . (83) Since the term can be computed with standard backpropagation, we will focus on the computation of the term . Note that . (84) From Eq. (84), we can use the chain rule of calculus to compute the term . This is also a part of the standard backpropapgation [HaDe96], which can be computed as follows: (85) The term can be calculated using Eq. (14). Thus, Eq. (85) becomes . (86) wi j , m ∂ ∂Jd wi j , m ∂ ∂ ϕk r , (q) ∂pr, q ∂ak, q ∂pr, q ∂gk, q – ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r = 1 R Σ k = 1 SM Σ q = 1 Q Σ ⎝ ⎠ ⎜ ⎟ ⎜ ⎟ ⎛ ⎞ = 2 pr q ∂ , ∂ek, q wi j , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ × r = 1 R Σ k = 1 SM Σ q = 1 Q = Σ . ∂pr, q ∂ek, q ϕk, r(q) ∂pr, q ∂ak, q ∂pr, q ∂gk, q – ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ≡ ∂ek, q ⁄ ∂pr, q wi j , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ wi j , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂pr, q ∂ wi j , m ∂ ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = ek q , ∂ wi j , m ⁄ ∂ wi j , m ∂ ∂ek, q ni q , m ∂ ∂ek, q wi j , m ∂ ∂ni q , m = × . ni q , m wi j , m ∂ ∂ek, q ni q , m ∂ ∂ek, q aj q , m = × – 1 55 From Eq. (86), Eq. (84) becomes (87) Using Eq. (87), Eq. (82) becomes (88) By rearranging the summations, we obtain (89) To further compute Eq. (89), define and (90) . (91) wi j , m ∂ ∂ ∂pr, q ∂ ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂pr, q ∂ wi j , m ∂ ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = ∂pr, q ∂ ni q , m ∂ ∂ek, q aj q , m × – 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ aj q , m × – 1 ni q , m ∂ ∂ek, q ∂pr, q ∂aj q , m – 1 = + × . wi j , m ∂ ∂Jd 2 pr q ∂ , ∂ek, q ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ aj q , m × × – 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ k Σ q Σ ⎩ ⎨ ⎧ = ∂pr, q ∂ek, q ni q , m ∂ ∂ek, q ∂pr, q ∂aj q , m – 1 × × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ k Σ q Σ + ⎭ ⎬ ⎫ . wi j , m ∂ ∂Jd 2 aj q , m – 1 ∂pr, q ∂ek, q ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ × ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ r Σ k Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ q Σ ⎩ ⎨ ⎧ = ∂pr, q ∂aj q , m – 1 ∂pr, q ∂ek, q ni q , m ∂ ∂ek, q × ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ k Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ q Σ + ⎭ ⎬ ⎫ . vi q , m ∂pr, q ∂ek, q ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ × ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ , r Σ k Σ ≡ uqi r , m ∂pr, q ∂ek, q ni q , m ∂ ∂ek, q × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ ≡ 56 By using Eq. (90) and Eq. (91), Eq. (89) becomes (with the term returned): . (92) In matrix form, Eq. (92) can also be written as (93) where the vector consists of the terms , for all . The matrix consists of the terms , for all and . To write Eq. (92) in batch mode, we first define the matrix : . (94) By Eq. (94), Eq. (92) can be rewritten as (95) where the matrix is . (96) 1 QdSd ⁄ ( MRd) wi j , m ∂ ∂Jd 2 QdSd MRd  aj q , m – 1vi q , m { } q Σ ∂pr, q ∂aj q , m – 1 uqi r , m ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ r Σ q Σ + ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = ∂Wm ∂Jd 2 QdSd MRd  vq m aq ( m – 1)T Uq m pq ∂ T ∂aq m – 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T + ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ q Σ = , Sm × 1 vq m vi q , m i = 1, 2, …, Sm Sm × R Uq m uqi r , m i = 1, 2, …, Sm r = 1, 2, …, R SmQ × RQ Um Um U1 m 0 … 0 0 U2 m … 0 … … … 0 0 … UQ m ≡ ∂Wm ∂Jd 2 QdSd MRd  Vm(Am – 1)T + ⎩ ⎨ ⎧ = 11 × Q I Sm ( ⊗ ) Um vecA∂ m – 1 ∂(vecP)T  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1Q × 1 I Sm – 1 × × ( ⊗ ) ⎭ ⎬ ⎫ , Sm × Q Vm Vm v1 m v2 m … vQ m = 57 Note that , and (97) , (98) are the and matrices, respectively. Note further that Eq. (83) can be written in batch mode as: . (99) The matrix is defined as: , (100) where is the matrix consisting of the elements , for all and . 11 × Q I Sm ( ⊗ ) × Um U1 m U2 m … UQ m = ∂vecAm – 1 ∂(vecP)T  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1Q × 1 I Sm – 1 × ( ⊗ ) p1 ∂ T ∂a1 m – 1 p2 ∂ T ∂a2 m – 1 … pQ T ∂ ∂aQ m – 1 T = Sm × RQ RQ × Sm – 1 ∂vecE ∂(vecP)T  Φ ∂vecA ∂(vecP)T  ∂vecG ∂(vecP)T –  ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ • p1 ∂ T ∂e1 0 … 0 0 p2 ∂ T ∂e2 … 0 … … … 0 0 … pQ T ∂ ∂eQ = = SMQ × RQ Φ Φ Φ1 0 … 0 0 Φ2 … 0 … … … 0 0 … ΦQ ≡ Φq SM × R ϕk, r(q) k = 1, 2,…, SM r = 1, 2, …, R 58 In addition to , we also need to compute the derivative of the performance index with respect to the biases, i.e. . In Eq. (82), the term is changed to , and using the fact that , (101) thus . (102) As in Eq. (85), the term can be computed as (103) and by Eq. (14), this reduces to (104) Thus, Eq. (102) becomes . (105) From Eq. (90), Eq. (105) reduces to (106) Jd ∂ wi j , m ⁄ ∂ ∂Jd bi ⁄ ∂ m Jd ∂ wi j , m ⁄ ∂ ∂Jd bi ⁄ ∂ m bi ∂ m ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂pr, q ∂ bi ∂ m ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = bi ∂ m ∂Jd 2 QdSd MRd  ∂pr, q ∂ek, q ∂pr, q ∂ bi ∂ m ∂ek, q ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r = 1 R Σ k = 1 SM Σ q = 1 Q = Σ ∂ek, q bi ⁄ ∂ m bi ∂ m ∂ek, q ni q , m ∂ ∂ek, q bi ∂ m ∂ni q , m = × , bi ∂ m ∂ek, q ni q , m ∂ ∂ek, q = . bi ∂ m ∂Jd 2 QdSd MRd  ∂pr, q ∂ek, q ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ k Σ q Σ = bi ∂ m ∂Jd 2 QdSd MRd  vi q , m q Σ = . 59 Therefore, the vector can be written as . (107) From Eq. (95) and Eq. (107), we need to calculate the elements of the matrices , and . We will first show how to compute , followed by , and finally . We will break these calculations into three sections. I. Calculation of An element of this matrix is . From Eq. (14), by using the chain rule of calculus, we obtain (108) Using Eq. (14), Eq. (108) becomes , (109) where (110) Eq. (109) can be written in matrix form as , (111) Sm × 1 ∂Jd ⁄ ∂bm ∂bm ∂Jd 2 QdSd MRd  vq m q Σ 2 QdSd MRd = = {Vm × 1Q × 1} Vm Um ∂vecAm ⁄ ∂(vecP)T ∂vecAm ⁄ ∂(vecP)T Um Vm ∂vecAm ⁄ ∂(vecP)T aj q , m ∂ ⁄ ∂pr, q ∂pr, q ∂aj q , m nj q , m ∂ ∂aj q , m ∂pr, q ∂nj q , m = × . ∂pr, q ∂aj q , m f·m nj q , m ( ) wj l , m ∂pr, q ∂al q , m – 1 × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ l = 1 S m – 1 = Σ f·m nj q , m ( ) nj q , m ∂ ∂aj q , m = . pq ∂ T ∂aq m F · m nq m ( )Wm pq ∂ T ∂aq m – 1 = 60 where is the matrix defined below: . (112) In batch mode, we have . (113) Thus, from Eq. (111), the matrix can be computed as (114) From Eq. (114), we can see that computing requires the calculation of . Thus, we need to initialize (i.e. ). From the fact that is the input , thus and we can compute as follows: (115) F · m nq ( m) Sm × Sm F · m nq ( m) nq ( m)T ∂ ∂aq m ≡ f·mn1, q ( m ) 0 … 0 0 f·mn2, q ( m ) … 0 … … … 0 0 … f·m n Sm, q m ⎝ ⎠ ⎛ ⎞ = ∂vecAm (vecNm)T ∂  F · m n1 ( m) 0 … 0 0 F· m n2 ( m) … 0 … … … 0 0 … F · m nQ m ( ) = ∂vecAm ⁄ ∂(vecP)T ∂vecAm ∂(vecP)T  ∂vecAm (vecNm)T ∂  (IQ ⊗ Wm) vecA∂ m – 1 ∂(vecP)T = × ×  . ∂vecAm ⁄ ∂(vecP)T ∂vecAm – 1 ⁄ ∂(vecP)T ∂vecA0 ⁄ ∂(vecP)T m = 0 aj, q 0 pj q , S0 = R aj, q ∂ 0 p⁄ ∂ r, q ∂pr, q ∂aj, q 0 1 ; if j = r . 0 ; if j r . ≠ ⎩ ⎨ ⎧ = 61 Thus, the matrix is (116) Therefore, in batch mode, the matrix becomes (117) This completes the calculation for the matrix . Next, we will show the calculation for . II. Calculation of We begin by expanding the term in the term in Eq. (91). Using the chain rule of calculus: , (118) From Eq. (14), we obtain (119) If we substitute this expression into Eq. (91) and rearrange the summation, we find R R × aq 0 ∂ pq ⁄ ∂ T pq ∂ T ∂aq 0 pq ∂ T ∂pq IR . = = RQ × RQ ∂vecA0 ⁄ ∂(vecP)T ∂vecA0 ∂(vecP)T  IR 0 … 0 0 IR … 0 … … … 0 0 … IR = = IRQ . ∂vecAm ⁄ ∂(vecP)T Um Um ek q , ∂ ni q , m ⁄ ∂ ni q , m ∂ ∂ek, q nl q , m ∂ + 1 ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ai q , m ∂ ∂nl q , m + 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ni q , m ∂ ∂ai q , m ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ l = 1 S m + 1 = Σ ni q , m ∂ ∂ek, q nl q , m ∂ + 1 ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ wl i , m + 1f·m ni q , m ( ) . l Σ = 62 (120) Using Eq. (91), Eq. (120) becomes (121) Eq. (121) can be written in matrix form as . (122) From Eq. (94) and Eq. (113), the batch matrix can be expressed as . (123) We need to initialize this equation with . From Eq. (91) with , we have . (124) Since and , . (125) Therefore, Eq. (124) reduces to . (126) uqi, r m ∂pr, q ∂ek, q nl q , m ∂ + 1 ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ wl i , m + 1f·m ni q , m ( ) l Σ × ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ k Σ = f·m ni q , m ( ) wl i , m + 1 ∂pr, q ∂ek, q nl q , m ∂ + 1 ∂ek, q × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ l Σ = . uqi r , m f·m ni q , m ( ) wl i , m + 1uql, r m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ . l Σ = Uq m F · m nq m ( )(Wm + 1)TUq = m + 1 Um Um ∂vecAm (vecNm)T ∂  IQ (Wm + 1)T ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ = × × Um + 1 UM m = M uqi r , M ∂pr, q ∂ek, q ni q , M ∂ ∂ek, q × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ = ek q , ak q , gk q , – = ak q , ak q , M = ni q , M ∂ ∂ek, q ni q , M ∂ ∂ak, q f· M ni q , M ( ) ; if i = k 0 ; if i k ≠ ⎩ ⎨ ⎧ = = uqi r , M ∂pr, q ∂ei, q f· M ni q , M = ( ) 63 From Eq. (126), can be expressed as (127) Therefore, by Eq. (94) and Eq. (99), the batch matrix is written as (128) Note that if is the linear function, then and the batch matrix . This completes the calculation for . Next, we will compute . III. Calculation of From Eq. (90), by using the chain rule of calculus for the term which is presented in Eq. (119), we obtain (129) Since the last term in Eq. (129), , does not depend on , it can be brought outside the summation . Then, Eq. (129) becomes Uq M Uq M F · M nq ( M) pq ∂ T ∂eq . = UM UM ∂vecA (vecNM)T ∂  ∂vecE ∂(vecP)T = ×  . f M F · M nq ( M) I SM , = ∂vecA (vecNM)T ⁄ ∂ I SMQ = Um Vm Vm ek q , ∂ ni q , m ⁄ ∂ vi q , m ∂pr, q ∂ek, q ∂pr, q ∂ nl q , m ∂ + 1 ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ wl i , m + 1f·m ni q , m ( ) l = 1 Sm + 1 Σ ⎩ ⎭ ⎪ ⎪ ⎨ ⎬ ⎪ ⎪ ⎧ ⎫ × . k Σ r Σ = f· m ni q , m ( ) l l Σ 64 (130) By rearranging the terms in Eq. (130), this results in (131) From Eq. (90) and Eq. (91), Eq. (131) reduces to . (132) To write Eq. (132) for batch mode, first consider the term . The term depends only on the term , i.e. does not depend on when for . Thus, can be considered the element of the vector function: vi q , m ∂pr, q ∂ek, q ∂pr, q ∂ f·m ni q , m ( ) nl q , m ∂ + 1 ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ wl i , m + 1 l Σ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ r Σ = ∂pr, q ∂ek, q f·m ni q , m ∂ ( ) ∂pr, q  nl q , m ∂ + 1 ∂ek, q wl i , m × + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ l Σ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ r Σ = + ∂pr, q ∂ek, q f·m ni q , m ( ) ∂pr, q ∂ nl q , m ∂ + 1 ∂ek, q wl i , m × + 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ l Σ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ . k Σ r Σ vi q , m f· m ni q , m ∂ ( ) ∂pr, q  wl i , m + 1 ∂pr, q ∂ek, q nl q , m ∂ + 1 ∂ek, q × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ l Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ = + f· m ni q , m ( ) wl i , m + 1 ∂pr, q ∂ek, q ∂pr, q ∂ nl q , m ∂ + 1 ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ × ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ k Σ r Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ l Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ . vi q , m f· m ni q , m ∂ ( ) ∂pr, q  wl i , m + 1uql, r m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ l Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ f· m ni q , m ( ) wl i , m + 1vl q , m { + 1} l Σ = + f· m ni q , m ∂ ( ) ⁄ ∂pr, q f· m ni q , m ( ) f m ni q , m ( ) f· m ni q , m ( ) f m nj q , m ( ) i ≠ j i, j = 1, 2,…, Sm f· m ni q , m ( ) ith Sm × 1 65 , (133) which are the diagonal elements of the matrix . In other words, . (134) From Eq. (133), further define the batch matrix: . (135) It can be obtained from the matrix by (136) The batch derivative of the matrix with respect to the input is: dfm nq ( m) f· m n1, q ( m ) f· m n2, q ( m ) … f· m n Sm, q m ⎝ ⎠ ⎛ ⎞ ≡ F · m nq ( m) dfm nq m ( ) F · m nq ( m) 1 Sm × 1 = × Sm × Q DFm(Nm) dfm n1 ( m) dfm n2 ( m) … dfm nQ m ≡ ( ) ∂vecAm (vecNm)T ⁄ ∂ 11 × Q I Sm ( ⊗ ) vecA∂ m (vecNm)T ∂ ×  IQ 1 Sm × 1 × ( ⊗ ) F · m n1 m ( ) F · m n2 m ( ) … F · m nQ m ( ) IQ 1 Sm × 1 = × ( ⊗ ) = DFm(Nm) . DFm(Nm) 66 , (137) We can now represent Eq. (132) in vector form: . (138) Using Eq. (135) and Eq. (137), this can be put in batch matrix form: (139) is needed to initialize Eq. (139). From Eq. (90), we have . (140) From Eq. (125), this reduces to . (141) Therefore, can be written as ∂ vecDFm(Nm) ∂(vecP)T  dfm n1 ∂ ( m) p1 ∂ T  0 … 0 0 dfm n2 ∂ ( m) p2 ∂ T  … 0 … … … 0 0 … dfm nQ m ∂ ( ) pQ T ∂  = vq m dfm nq ∂ ( m) pq ∂ T  (Wm + 1)TUq m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ • ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ 1R 1 × F · m nq ( m)(Wm + 1)Tvq = + m + 1 Vm 11 × Q I Sm ( ⊗ ) vecDF∂ m(Nm) ∂(vecP)T  IQ (Wm + 1)T ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ Um + 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ • ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ = × × (IQ ⊗ 1R × 1) DFm(Nm) (Wm + 1)T+ • { Vm + 1} . VM vi q , M ∂pr, q ∂ek, q ∂pr, q ∂ ni q , M ∂ ∂ek, q ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ k Σ = vi q , M ∂pr, q ∂ei, q ∂pr, q ∂f· i, q M × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ r Σ = vq M 67 (142) Using Eq. (99) and Eq. (137), the batch matrix can be then expressed as . (143) It now remains to compute . Consider one element of this matrix, i.e. . Using the chain rule of calculus: (144) It may seem unusual that we are using as the itermediate variable here. In fact, it is often true that can be written as a function of . For example, if is the hyperbolic tangent sigmoid function, i.e. (145) then the derivative of with respect to , i.e. , is . (146) With Eq. (144), the matrix can be expressed as vq M dfM nq ∂ ( M) pq ∂ T  pq ∂ T ∂eq • ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = 1R × 1 . VM VM 11 × Q I SM ( ⊗ ) vecDF∂ M(NM) ∂(vecP)T  ∂vecE ∂(vecP)T •  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = × × (IQ ⊗ 1R × 1) ∂ vecDFM(NM) ⁄ ∂(vecP)T f· m ni q , m ∂ ( ) ⁄ ∂pr, q f· m ni q , m ∂ ( ) ∂pr, q  f· m ni q , m ∂ ( ) ai q , m ∂  ∂ pr, q ∂ ai q , m = × . ai q , m f· m ni q , m ( ) ai q , m f m f m(n) en – e–n en + e–n =  , f m n f· m (n) f· m (n) ∂n ∂f m 1 en e n – – en + e–n  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 – 1 (f m(n))2 – 1 (am)2 = = = = – Sm × R dfm nq ∂ ( m) pq ⁄ ∂ T 68 (147) where, the matrix is . (148) From Eq. (147), the batch matrix can be decomposed to (149) At the output layer (i.e. ), Eq. (144), Eq. (147) and Eq. (149) also apply, with and , and . This completes the derivation of the batch form of gradient of the performance index with respect to the network parameters and . In some programming languages (e.g. MATLAB) batch algorithms are more efficient than computing element by element. However, performing the calculation in batch mode requires sufficient memory to dfm nq ∂ ( m) pq ∂ T  dfm nq ∂ ( m) aq ( m)T ∂  pq T ∂ ∂aq m = × , Sm × Sm dfm nq ∂ ( m) aq ( m)T ⁄ ∂ dfm nq ∂ ( m) aq ( m)T ∂  f· m n1, q ∂ ( m ) a1, q ∂ m  0 … 0 0 f· m n2, q ∂ ( m ) a2, q ∂ m  … 0 … … … 0 0 … f· m n Sm, q m ⎝ ⎠ ∂ ⎛ ⎞ a Sm, q ∂ m  = ∂vecDFm(Nm) ⁄ ∂(vecP)T ∂ vecDFm(Nm) ∂(vecP)T  ∂vecDFm(Nm) (vecAm)T ∂  ∂vecAm ∂(vecP)T = ×  , m = M m M = ai q , M = ai, q aq M = aq AM = A Jd wi j , m bi m 69 hold all of the matrix elements at once. Therefore, the larger the matrices, the more memory is needed. The next section will derive an algorithm that is designed to save memory. Before going to the derivation of the memorysave method, let’s summarize the algorithm in this section. 70 STEPS TO COMPUTE AND (BATCH MODE) 1. Given , compute , and , by using Eq. (114) and Eq. (117). 2. Given , compute the derivative of the errors , using Eq. (99). 3. Compute and , using Eq. (128) and Eq. (143), respectively. 4. Backpropagate and , by Eq. (123) and Eq. (139), respectively. 5. Compute and by Eq. (95) and Eq. (107), respectively. Then, compute and . 6. Update the weights and biases using any gradientbased optimization technique. ∂J ⁄ ∂Wm ∂J ⁄ ∂bm P Am ∂vecAm ⁄ ∂(vecP)T ∂vecG ⁄ ∂(vecP)T ∂vecE ⁄ ∂(vecP)T UM VM Um Vm ∂Jd ⁄ ∂Wm ∂Jd ⁄ ∂bm ∂J ⁄ ∂Wm ∂J ⁄ ∂bm 71 Gradient Calculation: MemorySave Method In this section, another approach to compute the gradient of with respect to the network parameters will be discussed. As previously mentioned, this approach will be useful when the machine’s memory is insufficient for the batch mode operation. Mainly, the method will break down some matrices into smaller matrices. We will first briefly discuss the concept of how to break down matrices. This will be followed by the derivation of the gradient. From Eq. (80), we can see that, when comparing with the typical performance index , the new term has the additional summation . This extra summation, when manipulated for batch mode, leads to larger matrices than those matrices in the regular backpropagation. The idea is then to form smaller matrices whose sizes do not depend on the extra summation . First, from Eq. (90) and Eq. (91), redefine , and (150) . (151) Thus, and are the elements at row and column of and , respectively. From Eq. (90) and Eq. (150), this implies that Jd Jf Jd r = 1 R Σ r Σ v˜ ri, q m ∂pr, q ∂ek, q ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ ≡ u˜ ri, q m ∂pr, q ∂ek, q ni q , m ∂ ∂ek, q × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ ≡ v˜ ri, q m u˜ ri, q m i q V ˜ r m U ˜ r m 72 (152) For the term , rather than expressing it as an element of , we express it as the element at row of . It is also the element at row (note: ) and column of the matrix (see Eq. (27) for the notation). Note that . (153) By Eq. (153), the following matrices immediately follow: , (154) , and (155) , (156) where . (157) The matrix is defined: vi q , m v˜ ri, q m . r = 1 R = Σ aj q , m ∂ – 1 p⁄ ∂ r, q aq ∂ m – 1 pq ⁄ ∂ T j aq ∂ m – 1 p⁄ ∂ r, q l l = (q – 1)Sm – 1 + j q ∂vecAm – 1 pT ⁄ ∂r 11 × Q I Sm – 1 ( ⊗ ) vecA∂ m – 1 pT ∂r ×  ∂pr, 1 ∂a1 m – 1 ∂pr, 2 ∂a2 m – 1 … ∂pr, Q ∂aQ m – 1 = 11 × Q I SM ( ⊗ ) ∂vecA pT ∂r ×  ∂pr, 1 ∂a1 ∂pr, 2 ∂a2 … ∂pr, Q = ∂aQ 11 × Q I SM ( ⊗ ) ∂vecG pT ∂r ×  ∂pr, 1 ∂g1 ∂pr, 2 ∂g2 … ∂pr, Q = ∂gQ 11 × Q I SM ( ⊗ ) ∂vecE pT ∂r ×  ∂pr, 1 ∂e1 ∂pr, 1 ∂e2 … ∂pr, 1 = ∂eQ ∂vecE pT ∂r  Φ ˜ r ∂vecA pT ∂r  ∂vecG pT ∂r ⎝ – ⎠ = • ⎛ ⎞ Φ ˜ r 73 , (158) where is the column vector of the matrix , see Eq. (100). By Eq. (150) and Eq. (151), we can rewrite Eq. (92) as , (159) which can be further expressed as: (160) It should be noted that the transpose of the matrix in Eq. (153) produces the matrices in Eq. (160), i.e. . (161) The sizes of the matrices in Eq. (160) are less than or equal to the matrix sizes in Eq. (95), which was for batch mode. For calculating the term , we can directly use Eq. (107) since can be easily obtained by using Eq. (152): . (162) Φ ˜ r φr(1) 0 … 0 0 φr(2) … 0 … … … 0 0 … φr(Q) ≡ φr(q) rth Φq wi j , m ∂ ∂Jd 2 QdSd MRd  aj q , m 1 – v˜ ri, q ( m ) q Σ ∂pr, q ∂aj q , m – 1 u˜ ri, q m ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ q Σ + ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ r Σ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = ∂Wm ∂Jd 2 QdSd MRd  V ˜ r m (Am – 1)T U ˜ r m ∂vecAm – 1 pT ∂r  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1Q × 1 I Sm – 1 + × ( ⊗ ) ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ . r = 1 R = Σ 11 × Q I Sm – 1 ( ⊗ ) vecA∂ m – 1 pT ∂r ×  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T ∂vecAm – 1 pT ∂r  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1Q × 1 I Sm – 1 = × ( ⊗ ) ∂Jd ⁄ ∂bm Vm Vm V ˜ r m r = 1 R = Σ 74 Alternatively, from Eq. (107) and Eq. (162), can also be directly expressed in the form of as follows: . (163) From Eq. (160) and Eq. (163), we need to compute , and . We will start with , followed by and . I. Calculation of From Eq. (109), by using Eq. (112), we can express as . (164) Therefore, by Eq. (113) and Eq. (153), becomes . (165) Since this is a forward propagation process, we need to initialize at . Recall that . From Eq. (115), we then have the vector ∂Jd ⁄ ∂bm V ˜ r m ∂bm ∂Jd 2 QdSd MRd  V ˜ r m ( × 1Q × 1) r Σ = V ˜ r m U ˜ r m ∂vecAm rp ⁄ ∂ T ∂vecAm rp T ∂ ⁄ U ˜ r m V ˜ r m ∂vecAm rp ⁄ ∂ T aq ∂ m p⁄ ∂ r, q ∂pr, q ∂aq m F · m nq ( m)Wm ∂pr, q ∂aq m – 1 = ∂vecAm rp ⁄ ∂ T ∂vecAm pT ∂r  ∂vecAm (vecNm)T ∂  (IQ ⊗Wm) vecA∂ m – 1 pT ∂r = × ×  m = 0 S0 = R 75 , (166) where one appears at row . Thus, the matrix , for any . (167) Next, we will derive . II. Calculation of From Eq. (151), using Eq. (119), we have (168) and by Eq. (151), it reduces to (169) Eq. (169) can be expressed in the form of , using Eq. (136), as . (170) Since this is a backpropagation process, we need to initialize . From Eq. (125), Eq. (151) reduces to ∂pr, q ∂aq 0 ∂pr, q ∂pq 0 … 0 1 0 … 0 = = r ∂vecA0 pT ∂r  IQ pr q ∂ , ∂pq = ⊗ q U ˜ r m U ˜ r m u˜ ri, q m f·m ni q , m ( ) wl i , m + 1 ∂pr, q ∂ek, q nl q , m ∂ + 1 ∂ek, q × ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ k Σ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ , l Σ = u˜ ri, q m f·m ni q , m ( ) wl i , m 1 + u˜ rl, q m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ . l Σ = U ˜ r m U ˜ r m DFm(Nm) (Wm + 1)TU ˜ r m+ 1 = • { } U ˜ r M 76 (171) From Eq. (171), can be written, using Eq. (156), as (172) Next, we will illustrate how to calculate . III. Calculation of From Eq. (150), by similarly following Eq. (129) to Eq. (132), we obtain (173) Eq. (173) can be expressed in the form of as , (174) where is the column vector of . Then, by Eq. (136), can be expressed as (175) Next, we need to initialize at . From Eq. (141), the term can be computed as follows: u˜ ri, q M ∂pr, q ∂ei, q f· M ni q , M = ( ) . U ˜ r M U ˜ r M DFM(NM) 11 × Q I SM ( ⊗ ) ∂vecE pT ∂r ×  ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ = • . V ˜ r m V ˜ r m v˜ ri, q m f· m ni q , m ∂ ( ) ∂pr, q  wl i , m 1 + u˜ rl, q m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ l Σ f· m ni q , m ( ) wl i , m 1 + v˜ rl, q m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ . l Σ = + v˜ rq m v˜ rq m dfm nq ∂ ( m) ∂pr, q  (Wm + 1)Tu˜ rq m + 1 ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ • F ·m nq ( m)(Wm + 1)Tv˜ rq = + m + 1 u˜ rq m 1 + qth U ˜ r m + 1 V ˜ r m V ˜ r m 11 × Q I Sm ( ⊗ ) vecDF∂ m(Nm) pT ∂r ×  ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ (Wm + 1)TU ˜ r m + 1 = • { } DFm(Nm) (Wm + 1)TV ˜ r m + 1 + • { } . m M = v˜ ri, q M 77 (176) Thus, can be expressed as . (177) It now remains to calculate . Consider one element of this matrix, which is . From Eq. (144) and Eq. (147), we obtain (178) Then, by Eq. (137) and Eq. (149), can be decomposed to (179) We have derived an algorithm to compute the gradient of the new performance index with respect to the network parameters and such that the calculation process requires less machine memory than the batch mode operation. This algorithm is summarized below. v˜ ri, q M f· M ni q , M ∂ ( ) ∂pr, q  ∂pr, q ∂ei, q = × . V ˜ r M V ˜ r M 11 × Q I SM ( ⊗ ) vecDF∂ M(NM) pT ∂r  ∂vecE pT ∂r •  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = × ∂vecDFM(NM) pT ⁄ ∂r f· m ni q , m ∂ ( ) ⁄ ∂pr, q dfm nq ∂ ( m) ∂pr, q  dfm nq ∂ ( m) aq ( m)T ∂  ∂pr, q ∂aq m = × . ∂vecDFM(NM) pT ⁄ ∂r ∂ vecDFm(Nm) pT ∂r  ∂vecDFm(Nm) (vecAm)T ∂  ∂vecAm pT ∂r = ×  . Jd Wm bm 78 STEPS TO COMPUTE AND (MEMORY SAVED) 1. Given inputs in , obtain . Initialize . 2. Obtain by Eq. (165) and Eq. (167). 3. With , compute using Eq. (157). 4. Calculate and , using Eq. (172) and Eq. (177), respectively. 5. Backpropagate to obtain and , using Eq. (170) and Eq. (175), respectively. 6. Compute and by Eq. (160) and Eq. (163), respectively. 7. Set , and repeat step until . 8. Compute and . Update the weights and biases using any gradientbased optimization technique. ∂J ⁄ ∂Wm ∂J ⁄ ∂bm Q P Am r = 1 ∂vecAm rp ⁄ ∂ T ∂vecG rp ⁄ ∂ T ∂vecE p r ⁄ ∂ T U ˜ r M V ˜ r M U ˜ r m V ˜ r m ∂Jd ⁄ ∂Wm ∂Jd ⁄ ∂bm r = r + 1 2 – 7 r > R ∂J ⁄ ∂Wm ∂J ⁄ ∂bm 79 In this section, we have shown two approaches for computing the terms and . The first approach is performed in batch mode, thus requiring sufficient machine memory to simultaneously hold all numeric elements. This approach takes advantages of faster computation for some programming languages, e.g. MATLAB. The second approach, however, breaks down the matrices in the batch mode operation so that their sizes are smaller. This approach is useful when the machine’s memory is insufficient to perform in batch mode, thereby compromising between the computation speed and the required memory. Speed Test In this section, we create a network structure, i.e. an network ( will be varied), to test the speed of the algorithm. We will compare the computation time between the two approaches we previously derived, i.e. batch mode and memorysave approach. The following figure shows the relative oneiteration execution times (averaged over 10 runs) for the batch and memorysave methods for computing the gradient of , when compared to the time to compute the gradient of . Note that we assume the number of training points is fixed at . These tests were run on a computer with processor speed of 2.0GHz, and the memory size of 512MB. ∂Jd ⁄ ∂Wm ∂Jd ⁄ ∂bm R – 25 – 1 R CFDA Jd Jf Q = 75, 000 80 Figure 6) Relative execution time (compared to ) for computing We expected that time for computing the gradient of would be more than that of . However, from Figure 6), we can see that, initially, the gradient calculation for took less time than that for . This is because of the overhead in MATLAB codes. Once the input dimension increased, the time for computing the gradient of was now more than that for . The interesting part, however, occurred when the input dimension was more than seven. For these cases, the memory requirement caused the existing PC’s RAM to overflow, which required data to be sent to disk. Thus, the time for computing the gradient in batch mode was more than that for the memorysave approach. These results show that the memorysave approach is useful when the data storage requirements exhaust existing RAM. Summary In this chapter, a new training algorithm for approximating a function and its firstorder derivatives, called gradientbased , was proposed. We proposed the new per 2 4 6 8 10 12 10−1 100 101 102 103 Relative Execution Time for Batch Memory−Save ∂Jd ⁄ ∂x R ∂Jf ⁄ ∂x ∂Jd ⁄ ∂x Jd Jf Jd Jf Jd Jf CFDA 81 formance index, which includes not only the squared function errors but also the squared derivative errors. Two approaches for gradient calculation to minimize the performance index were presented. The first method arranges every numeric element into matrices for batch mode operation, in order to expedite the gradient calculation time in some programming languages. This approach, however, requires sufficient memory to simultaneously hold all elements. The other approach, called the memorysave method, compromises between the computation time and the required memory. The computation time under different conditions were measured. The results showed that the computation time for the squared derivative error term defined in the method was longer than the standard backpropagation. This is expected as the gradient computation in the method is more complicated than that in the standard backpropagation algorithm. The results also illustrated that the memorysave approach is useful in cases where computer RAM is insufficient to perform the gradient calculation in batch mode. CFDA CFDA 82 CHAPTER 5 COMBINED FUNCTION AND DERIVATIVE APPROXIMATION WITH LEVENBERGMARQUARDT Introduction In this chapter, we present a LevenbergMarquardt algorithm for minimizing the performance index. Recall from Chapter 2 that optimization with the Levenberg Marquardt algorithm requires the calculation of the Jacobian matrix. This will be the core work in this chapter. We will begin this chapter with a discussion of the LevenbergMarquardt framework for . We will then present two approaches for computing the Jacobian matrix. The first approach performs the calculation in batch mode. The second approach (i.e. the memory save method) compromises between the execution time and the required memory. The measured execution time under different conditions will be illustrated at the end of the chapter. CFDA with LevenbergMarquardt Consider a performance index in the form: , (180) where CFDA CFDA F(x) F x ( ) ρ1 F1 x ( ) ρ2 = + F2(x) 83 and . (181) The element of the gradient is . (182) The gradient can be written in matrix form: , (183) and, from Eq. (38) in Chapter 2, this becomes , (184) where the Jacobian matrices are and . (185) Next, we want to find the Hessian matrix. The element of the Hessian matrix would be . (186) From Eq. (39) in Chapter 2, this turns out to be (187) F1(x) zi 2(x) i = 1 N1 Σ zT x ( ) z x ( ) × = = F2 x ( ) z˜ i 2 (x) i = 1 N2 = Σ = z˜T(x) × z˜ (x) jth [∇F(x)]j ∂F(x) ∂xj  ρ1 ∂F1(x) ∂xj  ρ2 ∂F2(x) ∂xj = = +  ∇F(x) = ρ1∇F1(x) + ρ2∇F2(x) F x ( ) ∇ 2 ρ1JT x ( )z x ( ) ρ2 J ˜ T = { + (x)z˜ (x)} J(x) ∂z(x) ∂xT  = J ˜ (x) z∂˜ (x) ∂xT =  k, j [∇2F(x)]k, j ∂2F(x) ∂xk∂xj  ρ1 ∂2F1(x) ∂xk∂xj  ρ2 ∂2F2(x) ∂xk∂xj = = +  [∇2F(x)]k, j 2ρ1 ∂zi(x) ∂xk  ∂zi(x) ∂xj  zi(x) ∂2zi(x) ∂xk∂xj +  ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ i = 1 N1 = Σ + 2ρ2 z˜ ∂ i(x) ∂xk  z˜ ∂ i(x) ∂xj  zi(x) z˜ i(x) 2 ∂ ∂xk∂xj +  ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ . i = 1 N1 Σ 84 Using Eq. (40), the Hessian can then be expressed in matrix form (188) where and (189) If we assume is small. In this scenario, if we assume that both and are small, relative to the terms and , then the Hessian matrix can be approximated as (190) Therefore, by Eq. (184) and Eq. (190), the LevenbergMarquardt update is (191) Note that is adjusted using the typical LevenbergMarquardt algorithm, which was presented in Chapter 2. Now, recall from Eq. (80) that the performance index in the method consists of two terms (i.e. and ). This is in the same form as the performance index in Eq. (180), if we assign ∇2F(x) 2ρ1 JT(x)J(x) + S(x) ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ 2ρ2 J ˜ T x ( )J ˜ x ( ) S ˜ = + { + (x)} , S(x) zi(x)∇2 zi(x) i = 1 N1 Σ = S ˜ x ( ) z˜ i x ( ) z˜ ∇ i(x) . 2 i = 1 N2 = Σ S x ( ) S x ( ) S ˜ (x) JT x ( )J x ( ) J ˜ T x ( )J ˜ (x) F x ( ) ∇2 2ρ1JT x ( )J x ( ) 2ρ2J ˜ T x ( )J ˜ ≅ + (x) 2 ρ1JT x ( )J x ( ) ρ2 J ˜ T x ( )J ˜ x ( ) + { } . = Δxk + 1 = xk + 1 – xk ρ1JT xk ( )J xk ( ) ρ2 J ˜ T xk ( )J ˜ xk ( ) μk [ + + In] –1 ρ1JT xk ( )z xk ( ) ρ2 J ˜ T = – × [ + (xk)z˜ (xk)]. μk CFDA Jf Jd 85 , (192) (193) where and . Since the algorithm for computing and is already known (see [HaMe94] and [HaDe96]), we will focus on the calculation of and . The calculations for these terms are given in the following two sections. The next section will show the batch calculation and the following ssection will show the memorysave calculation. Batch calculation In this section, we will compute the vector and the Jacobian matrix . Recall from Eq. (193) that we have . (194) We can rewrite Eq. (194) as (195) Define the following matrix F1(x) {ak, q – gk, q}2 k = 1 SM Σ q = 1 Q ≡ Σ F2(x) ϕk, r(q) ∂ak, q ∂pr, q  ∂gk, q ∂pr, q ⎝ – ⎠ ⎛ ⎞2 , r = 1 R Σ k = 1 SM Σ q = 1 Q ≡ Σ ρ1 ≡ 1 ⁄ (QSM) ρ2 ρ QdSd ≡ ⁄ ( MRd) J(x) z(x) z˜ x ( ) J ˜ (x) z˜ x ( ) J ˜ (x) F2(x) ϕk, r(q) ∂ak, q ∂pr, q  ∂gk, q ∂pr, q ⎝ – ⎠ ⎛ ⎞2 r = 1 R Σ k = 1 SM Σ q = 1 Q = Σ F2(x) ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞2 r = 1 R Σ k = 1 SM Σ q = 1 Q Σ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T vec pq ∂ T ∂eq . × q = 1 Q = = Σ 86 , (196) where the matrix is defined in Eq. (99). Then, Eq. (195) can be expressed as , (197) where . (198) Now, by equating Eq. (197) to Eq. (181), we can see that . (199) For the calculation for the Jacobian matrix , we need to have the vector containing all of the network parameters. There are several ways to define . However, the definition we use is , (200) where the value of is shown in Eq. (31). The vector is defined as DEP 11 × Q I SM ( ⊗ ) ∂vecE ∂(vecP)T ×  p1 ∂ T ∂e1 p2 ∂ T ∂e2 … pQ T ∂ ∂eQ ≡ = ∂vecE ⁄ ∂(vecP)T Jd = (vecDEP)T × vecDEP vecDEP vec p1 ∂ T ∂e1 vec p2 ∂ T ∂e2 … vec pQ T ∂ ∂eQ = z˜ (x) = vecDEP J ˜ (x) x x xT (x1)T (x2)T … (xM)≡ T n xm 87 . (201) Therefore, from Eq. (185), the Jacobian matrix can be written as . (202) Using Eq. (200) and Eq. (201), can be written , (203) where . (204) Therefore, by Eq. (202) through Eq. (204), can be written xm vecWm bm ≡ J ˜ (x) J ˜ (x) ∂vecDEP ∂xT  ∂ xT ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂ xT ∂ vec p2 ∂ T ∂e2 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … ∂ xT ∂ vec pQ T ∂ ∂eQ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = = ∂vecDEP ⁄ ∂xT J ˜ (x) ∂vecDEP ∂xT  ∂vecDEP ( x1)T ∂  ∂vecDEP ( x2)T ∂  … ∂vecDEP ( xM)T ∂ = =  ∂vecDEP ( xm)T ∂  ∂vecDEP (vecWm)T ∂  ∂vecDEP (bm)T ∂ =  ∂vecDEP ⁄ ∂xT 88 , (205) where . (206) To obtain , we need to compute . However, it is fairly complicated to derive the entire at once. Therefore, will be considered first. Then, the batch matrix will be expressed. Each element in the matrix is , where is at element of . We will first consider the case when is an element in the weight matrix and then the case when is an element of the bias vector . Consider when . From Eq. (87) in Chapter 4 and using the fact that J ˜ (x) ∂vecDEP ∂xT  ( x1)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ( x2)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … ( xM)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ( x1)T ∂ ∂ vec p2 ∂ T ∂e2 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ( x2)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … ( xM)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … … … ( x1)T ∂ ∂ vec pQ T ∂ ∂eQ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ( x2)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … ( xM)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = = ( xm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ (vecWm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ (bm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = J ˜ (x) vecDEP ( xm)T ∂ ⁄ ∂ vecDEP ( xm)T ∂ ⁄ ∂ ( xm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ vecDEP ( xm)T ∂ ⁄ ∂ ( xm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ xl ∂ m ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ xl m l xm x Wm xl m bm xl m wi j , m = 89 , (207) we have (208) Next, consider when is an element of , i.e. . From Eq. (101), Eq. (104) and Eq. (207), we have . (209) Eq. (208) and Eq. (209) can be written in matrix form: (210) and . (211) In batch mode for Eq. (210), the batch matrix and can be expressed as ∂pr, q ∂ ni q , m ∂ ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ni q , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = wi j , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ni q , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ aj q , m × – 1 ni q , m ∂ ∂ek, q ∂pr, q ∂aj q , m – 1 = + × . xl m bm xl m bi = m bi ∂ m ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ni q , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = (vecWm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ 1 1 × Sm – 1 nq ( m)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ⊗ ⎩ ⎭ ⎪ ⎪ ⎨ ⎬ ⎪ ⎪ ⎧ ⎫ aq ( m – 1)T 1 SMR × Sm ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ = • 1 R × Sm – 1 nq ( m)T ∂ ∂eq ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ pq ∂ T ∂aq m – 1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1 SM × Sm ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ + • , (bm)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ nq ( m)T ∂ ∂ vec pq ∂ T ∂eq ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = vecDEP (vecWm)T ∂ ⁄ ∂ 90 (212) The batch matrix for Eq. (211) can be written as . (213) To clarify the expressions in Eq. (212) and Eq. (213), first note, from Eq. (196), that , (214) and ∂vecDEP (vecWm)T ∂  1 1 × Sm – 1 ∂vecDEP (vecNm)T ∂  1Q × 1 I Sm × ( ⊗ ) ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ ⎝ ⎜ ⎛ = (Am – 1)T 1 SMR × Sm ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ • ⎠ ⎟ ⎞ + 1 R × Sm – 1 ∂vecE (vecNm)T ∂  1Q × 1 I Sm ⎝ × ( ⊗ )⎠ ⊗ ⎛ ⎞ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ ⎝ ⎜ ⎛ 11 × Q I Sm – 1 ( ⊗ ) vecA∂ m – 1 ∂(vecP)T ×  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1 SM × Sm ⊗ ⎩ ⎭ ⎨ ⎬ ⎧ ⎫ ⎠ ⎟ ⎞ • . ∂vecDEP (bm)T ∂  ∂vecDEP (vecNm)T ∂  1Q × 1 I Sm = × ( ⊗ ) ∂vecDEP (vecNm)T ∂  n1 ( m)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ 0 … 0 0 n2 ( m)T ∂ ∂ vec p2 ∂ T ∂e2 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … 0 … … … 0 0 … nQ m ( )T ∂ ∂ vec pQ T ∂ ∂eQ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = 91 . (215) In addition, we have , and (216) . (217) We can also write ∂vecDEP (vecNm)T ∂  1Q × 1 I Sm × ( ⊗ ) n1 ( m)T ∂ ∂ vec p1 ∂ T ∂e1 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ n2 ( m)T ∂ ∂ vec p2 ∂ T ∂e2 ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ … nQ m ( )T ∂ ∂ vec pQ T ∂ ∂eQ ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ = ∂vecE (vecNm)T ∂  n1 ( m)T ∂ ∂e1 0 … 0 0 n2 ( m)T ∂ ∂e2 … 0 … … … 0 0 … nQ m ( )T ∂ ∂eQ = ∂vecE (vecNm)T ∂  1Q × 1 I Sm × ( ⊗ ) n1 ( m)T ∂ ∂e1 n2 ( m)T ∂ ∂e2 … nQ m ( )T ∂ ∂eQ = 92 , (218) and the result is shown in Eq. (98). From Eq. (212) and Eq. (213), we will need to calculate , and . (We previously provided the equations for the other terms.) We will first show the derivation for computing the matrix , followed by . Note that the computation for the term is in Chapter 2, Eq. (18), and for the calculation for the term is in Chapter 4, Eq. (114). I. Calculation of Consider an element of , which is . This element is known as the Marquardt sensitivity, [HaDe96]. This term was computed in Eq. (118) and Eq. (119) in Chapter 4. Hence, can be expressed as , (219) where is defined in Eq. (112). Thus, from Eq. (216) and Eq. (219), the batch matrix can be expressed as (220) 11 × Q I Sm – 1 ( ⊗ ) vecA∂ m – 1 ∂(vecP)T ×  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T ∂vecAm – 1 ∂(vecP)T  ⎝ ⎠ ⎜ ⎟ ⎛ ⎞T 1Q × 1 I Sm – 1 = × ( ⊗ ) ∂vecE (vecNm)T ⁄ ∂ ∂vecDEP (vecNm)T ⁄ ∂ ∂vecE (vecNm)T ⁄ ∂ ∂vecDEP (vecNm)T ⁄ ∂ Am ∂vecAm ⁄ ∂(vecP)T ∂vecE (vecNm)T ⁄ ∂ ∂vecE (vecNm)T ∂ ⁄ ek q , ∂ ni q , m ⁄ ∂ ∂eq nq ( m)T ⁄ ∂ nq ( m)T ∂ ∂eq nq ( m + 1)T ∂ ∂eq Wm + 1F · m nq m = ( ) F · m nq m ( ) ∂vecE (vecNm)T ⁄ ∂ ∂vecE (vecNm)T ∂  ∂vecE (vecNm + 1)T ∂  {IQ ⊗ Wm + 1} vecA∂ m (vecNm)T ∂ = × ×  , 93 where the term is computed as shown in Eq. (113). Since Eq. (220) is a backpropagation process, we need to initialize it at the output layer, i.e. . From Eq. (125), we can write (221) We can also write Eq. (221) in the batch mode: (222) This completes the computation of . Next, the calculation of will be derived. II. Calculation of Recall that an element of is . From Eq. (207) along with Eq. (119), we have . (223) By taking the derivative inside the parenthesis and using Eq. (207), this turns out to be .(224) Therefore, using Eq. (133) and Eq. (134), Eq. (224) can be written in matrix form: ∂vecAm (vecNm)T ⁄ ∂ m = M nq ( M)T ∂ ∂eq ∂aq nq ( M)T ∂  F · M nq = = ( M) . ∂vecE (vecNM)T ∂  ∂vecA (vecNM)T ∂ =  . ∂vecE (vecNm)T ⁄ ∂ ∂vecDEP (vecNm)T ⁄ ∂ ∂vecDEP (vecNm)T ⁄ ∂ ∂vecDEP (vecNm)T ⁄ ∂ ni q , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ni q , m ∂ ∂ ∂pr, q ∂ek, q ⎝ ⎠ ⎜ ⎟ ⎛ ⎞ ∂p 



A 

B 

C 

D 

E 

F 

I 

J 

K 

L 

O 

P 

R 

S 

T 

U 

V 

W 


