NOVELTY DETECTION FOR FUNCTION
APPROXIMATION
By
ARJPOLSON PUKRITTAYAKAMEE
Bachelor of Engineering
ChulaJongkorn University
Bangkok, Thailand
1997
Suhmitted to the Faculty of the
Graduate College of the
Oklahoma State University
in partial fulfillment of
the requirements for
the Degree of
MASTER OF SCIENCE
December, 200 I
NOVELTY DETECTION FOR FUNCTION
APPROXIMATION
Thesis Approved:
~=D~a~d~~
il
ACKNOWLEDGMENTS
[ would like to thank my advisor, Dr. Martin T. Hagan, for his moral support, guidance,
patience, and friendship throughout my graduate studies. I also would like to thank
my other committee member, Dr. R.G. Ramakumar and Dr. Keith A. Teague, for their useful
suggestions.
I wish to sincerely thank again Dr. Martin T. Hagan for his financial support and his
generosity to let me experience this project.
Finally, I would like to express my special appreciations to my parents and my sister
for their loving support and strong encouragement at the difficult times.
iii
Chapter
TABLE OF CONTENTS
Page
I.. INTRODUCTION 1
II. LITERATURE REVIEW 4
Introduction 4
SingleInput Neuron 4
MultipleInput Neuron 9
Network Architectures 10
A Layer of Neurons 10
Multiple Layers of Neurons II
Function Approximation 13
Novelty Detection in Neural Network 16
Introduction 16
What is Novelty Detection? 17
Algorithms for Novelty Detection 18
Summary 19
III.. NEURAL TREE ALGORITHM 20
Introduction 20
Neural Tree Algorithm 20
How the algorithm works 23
Application to Novelty Detection 28
Simulation of a simple example 29
Summary 42
Chapter Page
IV. THE GAUSSIAN KERNEL ESTIMATOR 44
Introduction 44
Estimated density for novelty detection 45
Background 46
Histogram 47
Naive estimator 48
iv
Kernel estimator 50
The Gaussian kernel estimator 52
Onedimensional data '. 52
The generalized model 54
Application to novelty detection S6
Simulation of the simple example #1 S6
A way to improve the performance: Joint Density 63
Simulation of the simple example #2 64
Summary 68
V. AUTOASSOCIATIVE MULTILAYERPERCEPTRON 69
Introduction 69
Autoassociative Multilayer Perceptron 69
Application to novelty detection 71
Simulations of the simple example #1 74
A technique to improve efficiency 77
Simulation of the simple example #2 78
Summary 81
VI. MINIMUM DISTANCE ALGORITHM 83
Introduction 83
Minimum Distance Computation 83
Application to novelty detection 84
Simulation of the simple example # I 85
ATechnique to Improve Performance: Minimum Weighted Distance 88
Simulation of the simple example #2 92
Calculating the estimated derivative 96
How to speed up distance computation 99
The Kohonen rule ' 100
How to use the Kohonen rule to speed up distance computation 102
Summary 103
Chapter Page
VII. MINIMUM DISTANCE AND OUTLIER DETECTION 104
Introduction 104
Principal component analysis 105
Definition and derivation 105
Outlier Detection using principal components 110
The problem of outlier detection for novelty detection I 12
Minimum distance of the composite data seL I 13
v
Application to novelty detection 115
Simulation of the simple example 119
Summary 123
VIII. SIMULATION RESULTS 125
Introduction 125
Function approximation 1.. 126
Neural tree 129
The Gaussian kernel estimator (GKE) 133
The estimated density of input.. 133
The estimated inputoutput density 137
Minimum distance algorithm 141
The minimum distance of input.. 141
The minimum weighted distance 145
Minimum distance with outlier detection 149
Result summary 155
Function approximation II 157
Neural Tree 158
The Gaussian kernel estimator (GKE) 161
The estimated density of input.. )62
The estimated inputoutput density 164
Minimum distance algorithm )68
Minimum distance 168
Minimum weighted distance 171
Minimum distance and outlier detection 174
Result Summary 180
Summary 182
IX. CONCLUSIONS 184
Summary of the results 184
Recommendations for future work 186
REFERENCES , 188
vi
Table
LIST OF TABLES
Page
Novelty detection decision vs. approximation error 37
2 Percentage of misclassifications: The first test group 155
3 Percentage ofmisclassifications: The second test group 155
4 Percentage ofmisclassifLcations: Training data 156
5 Percentage of misclassifications: The first test group 180
6 Percentage ofmisclassifications: The second test group 180
7 Percentage of misclassifications: Training data 181
Vll
LIST OF FIGURES
Figure Page
A Single Neuron 5
J Pure Linear Transfer Function 6
3 Effect of Weight 6
4 Effect of Bias 7
5 Hyperbolic Tangent Sigmoid Function 8
6 Effect of Weight 8
7 Effect of Bias 9
8 MultipleInput Neuron 9
9 A SingleLayer Network II
10 ThreeLayer Network.... ... .. .. .. .. . .. . .. 12
II An Example of Network for Function Approximation 14
12 Network Outputs and Targets 15
13 Network Pouts and Targets on Testing Data 17
14 4cell Tree 21
15 Partitioned Cells in Hyperspace 22
16 Initialized Divided Hyperspace, 25
17 Tree Structure 26
18 Function F( p) 30
viii
Figure Page
19 Training data 31
20 Error between target and network output after training 31
21 Error between target and network output in the normalized hyperspace 32
22 Testing and Training Data 32
23 Error of the testing data 33
24 Initialized Cells 34
25 An 8Cell Tree 34
26 Partitioned Cells After Training 35
27 Abnormalities outside the cells 36
28 Error and Abnonnalities of the 8cell tree 38
29 Abnormalities from the 200cell tree 39
30 Error and abnormalities of the 200cell tree 39
31 Estimated density using neural tre~ algorithm 41
32 Donut Shape 42
33 Effect of the smoothing parameter on the naive estimator (a) b = 0.03 (b) b = OJ
(c) b = 3 50
34 The Gaussian kernel estimator with (a) b = 0.3 (b) b = I (c) b = 3 53
35 Estimated Density with (a) b = 0.001 (b) b = 0.01 (c) b = 0.1 (d) b = I 57
36 Estimated Density and The Errors (a) b = 0.001 (b) b = 0.01 (c) b = 0.1 and (d)
b = 1 58
37 Estimated density and approximation error: b = 0.0836 59
IX
Figure Page
38 Novelty Detection: Density ofinput.. 60
39 Estimated density of a data set 62
40 Estimated density and approximation error: b = 0.1164 65
41 Novelty detection: Density of input and output.. 66
42 Smallerror and lowdensity points 67
43 Diagram of an autoassociative novelty detector 73
44 Autoassociative error of training data 74
45 Autoassociative Error and Approixmation Errors 75
46 Error and abnormalities 76
47 Autoassociative Error versus Approximation Error 79
48 Novelty Detection 80
49 Minimum Distance versus Error.. 86
50 Error and abnormalities 87
51 Effect of the weighting factor to R 93
52 Approximation Error and minimum weighted distance 94
53 Novelty detection 95
54 Graphical Representation of the Kohonen Rule....................................................... I01
55 Data and cluster centers 101
56 A Data Set with Outliers........................................................................................... 110
57 A Transformed Data Set III
58 The value of forthe last PC of the training data 112
Figure Page
59 The value for the last PC of the testing data 113
60 Approximation Error and minimum distance 115
61 and the minimum distance....................................................................................... I17
62 The value from the composite training data seL..... 120
63 The value from the composite testing data set......................................................... 121
64 Novelty Detection ,.............................................. 123
65 Example plots of voltage and resistivity: Training data 128
66 Voltage and resistivity: Testing data 129
67 Error and abnormalities: The first test group (Neural Tree) ".. 1311
68 Error and abnormalities: The second test group (Neural Tree) 132
69 Estimated density and approximation error: The first test group............................. 134
70 Error and abnormalities: The first test group (GKE: input) " US
71 Error and abnormalities: The second test group (GKE: input) " ".. 136
72 Estimated density and approximation error: The first test group " 138
73 Error and abnormalities: The first test group (GKE: input and output) " 139
74 Error and abnormalities: The second test group (GKE: input and output) 140
75 Minimum distance and approximation error: The first test group , " ,.. 142
76 Error and abnormalities: The first test group (Minimum distance) " 143
77 Error and abnormalities: The second test group (Minimum distance) 144
78 The weighting factor and the correlation coefficient: The first test group "" " 145
79 Minimum weighted distance and approximation error: The first test group 146
xi
Figure Page
80 Error and abnormalities: The first test group (Minimum weighted distance) 147
81 Error and abnormalities: The second test group (Minimum weighted distance) 148
82 Variance of principal components: Training data.................. 150
83 Minimum distance and the sumsquare value: The first test group 151
84 Minimum distance and approximation crror: The first test group............................ 152
85 Error and abnormalities: The first test group (Minimum distance and PCA) 153
86 Error and abnormalities: The second test group (Minimum distance and peA) 154
87 Data points with large and small approximation error: Training data 158
88 Novel Data: The first test group (Neural Tree) ISIJ
89 Error and abnormalities: The first test group (Neural Tree) 159
90 Novel data: Th~ second test group (Neural Tree) 160
91 Error and abnormalities: The second tcst group (Neural Tree) 161
92 Estimated density and approximation error: The first test group 162
93 Error and abnormalities: The first test group (OKE: input} 163
94 Error and abnormalities: The second test group (OKE: input) 164
95 Estimated density and approximation error: The first test group 165
96 Error and abnormalities: The first test group (OKE: input and output) 166
97 Error and abnormalities: The second group (OKE: input and output) 167
98 Minimum distance and approximation error: The first test group 168
99 Error and abnormalities: The first test group (Minimum distance) 169
100 Error and abnormalities: The second test group (Minimum distance) 170
)l,ll
Figure Page
101 The weighting factor and the correlation coefficient: The first test group" ",.. 171
102 Minimum weighted distance and approximation error: The first test group 172
103 Error and abnormalities: The first test group (Minimum weighted distance) ,. 173
104 Error and abnonnalities: The second test group (Minimum weighted distance) 173
105 Minimum distance and the sumsquare value: The fLrst test group 176
106 Approximation error and minimum distance: The first test group , 177
107 Error and abnonnalities: The first test group (Minimum distance and PCA) 178
108 Error and abnonnalities: The second test group (Minimum distance and PCA) 179
xiii
CHAPTER 1
INTRODUCTION
One of the key tasks for neural networks is function approximation or plant identification.
The ability of neural networks in these applications has been well documented.
However, a major factor that limits the usage of neural networks is the difficulty to identify
the reliability of the neural network outputs. The procedure to determine whether or not a
neural network generates credible results is known as network validation or novelty detection.
Our objective is to compare the performance of existing novelty detection methods as
well as to find improvements to these techniques.
We will start our work by reviewing the general structure of neural networks for
function approximation. A simple example will be used to show the ability of the neural
network in performing this task. We will then demonstrate the main problem of function
approximation using this example.
After the limitation of neural networks is shown. we will introduce existing noveltydetection
algorithms. The simple example will again be utilized to exhibit the capability of
each novelty detector.
We will propose a procedure to improve the performance of some algorithms, followed
by a demonstration of the ability of the modified novelty detector via the example.
We will show through the simple example and our real world applications that the modified
methods result in improvements in novelty detection.
Let us now outline the flow of this work. Chapter 2 will serve to review neural network
background material, starting from basic concepts to the general structure. One of the
main objectives of this chapter is to introduce common notation used in later chapters. We
will finish this chapter by introducing the use of neural networks as function approximators.
In Chapter 3, we will introduce an existing algorithm known as the neural tree. We
will explain how we can use this algorithm as a novelty detector. A simple example will be
used to illustrate the capability of this method. We end this chapter by introducing a measure
of algorithm performance.
The most widelyused method for novelty detection will be described in Chapter 4.
The method is known as the Gaussian kernel estimator, in which the probability density
function will be involved. A technique for improving the performance will be described. It
will be followed by simulation results.
Chapter 5 will describe another algorithm, the autoassociative multilayer perceptron.
The example will demonstrate the capability of this method. A technique for improving
performance will be described, followed by an application to the simple example.
Then, in Chapter 6, the simplest method, known as the minimum distance algorithm,
will be introduced. An example will be used to show the efficiency of thi method.
An improvement to this algorithm, which is called the minimum weighted di tance, will be
proposed, followed by an example. The mathematical framework that leads us to the idea
of the minimum weighted distance algorithm will be discussed as well.
2
In Chapter 7 we explain the idea of principal components and outlier detection, and
explain a new technique that combines the knowledge of outlier detection with the minimum
distance algorithm. This is followed by an example.
Chapter 8 will be devoted to applying the techniques explained in Chapter 3. 4,6.
and 7 to real world data. We will provide short explanations of these methods within this
chapter. A performance comparison of these algorithms on real world data will be summarized
at the end of this chapter.
A summary of the main results and contributions of this thesis, followed by recommendations
for future work will be contained in Chapter 9.
3
CHAPTER 2
LITERATURE REVIEW
Introduction
This chapter describes the fundamental concepts of the neural network and introduces
the associated notation. It will start with an analysis of a singleinput neuron, which
is the smallest and the most basic component in the neural network, and this leads to more
complicated architectures. The simplest architecture has a single layer of neurons. The
more complex architectures have several layers. The multiple layer network has been widely
used to perform pattern recognition. However, this chapter will focus on how the multiple
layer network can be used for function approximation.
After we introduce the basic neural network concepts, the concept of novelty detection
in neural networks will be explained. We will define what novelty detector is and will
describe when it will be applied and how it relates to neural network function approximators.
SingleInput Neuron
A neuron is the smallest processing unit in the neural network. It has a scalar input
p and a scalar output a. The input is multiplied by the scalar weight wand then added to
the scalar bias b. Now, the output of the summer, which is 11 = wp + b, will be fed to the
4
input of the transfer function f. The output of the neuron is described by the following
equation.
a = f( n) =f( wp + b) (1)
The weight and bias of a neuron can be any real value. They are adjustable parameters and
are adapted by some learning rule.
The transfer function of the neuron can be a very simple linear function, or it can be
a more complex nonlinear function such as the hard limit function (hardlim) or the logarithmic
sigmoid function Oogsig). The transfer function will be specifically chosen, depending
on a particular problem that the neuron is going to solve.
p
w
b
a
Figure I A Single Neuron
As seen in Equation (I), the weight w controls the slope of the neuron output and
the bias b causes a translation in the neuron output. Next, we will give simple examples to
demonstrate how the weight and bias affect the output of the neuron. The insights provided
by these examples will be helpful when applying neural networks to the practical tasks that
will be described later.
Let the transfer function f be the pure linear function purelin. The relation between
the neuron input and output is
5
a =f(wp + b) = wp + b (2)
The following figure demonstrates this relationship for w = 1 and b = O.
.. 0
1
1 o
p
2 3 4
Figure 2 Pure Linear Transfer Function
The following figures show how changes to the weight and bias affect the response. Figure
3 illustrates the effect of changing the weight from I to 3. The effect of varying the bias is
shown in Figure 4.
1
3 2 1 o
p
2 3 4
Figure 3 Effect of Weight
6
Note that when setting b = 2. the graph will shift to the left by _£ .This is shown in Figw
ure 4, where the graph shifts to the left by 2 for b = 2 and w = I.
Figure 4 Effect of Bias
In the next example, we let the transfer function f be the hyperbolic tangent sigmoid.
The hyperbolic tangent sigmoid function is a monotonically increasing function
shown in Figure 5. This function has been extensively used for function approximation
problems, as will be explained later in this chapter. The formula for this function is
f(n) =
n n e  e
n n e + e
7
(3)
4
2.~~~~..,
1.5
0.5
., 0
0.5
1f
1.5
2 ''''_~_ _____'_ __'__ ___'__ ___'__ ___l
4 3 2 1 0 2 3
P
Figure 5 Hyperbolic Tangent Sigmoid Function
The function output is bounded to the range of [1,1], regardless of how large the input is.
The inputoutput relationship is almost linear in a small region near zero. The effect of
weight and bias are demonstrated in the following figures.
1.5
0.5
'" 0
0.5
11.....
welghl = 1
4
1.5
2'~'''~~'''
4 3 2 1 0 2 3
P
Figure 6 Weight Effect
1.5
05
., 0
0.5
1 ~=;
bias = 0
1.5 bIw
4
_2L~~~~~~~~~"
4 3 2 1 0 2 3
p
Figure 7 Bias Effect
As in the case of the pure linear function, increasing the weight value increases the slope
of the graph. The bias shifts the center of the graph to the point _£ .
w
MultipleInput Neuron
Commonly, a neuron can have several inputs. R inputs connecting to a neuron are
apressed as pT = ~I P2 ... PRJ. A neuron with R inputs is shown in Figure 8.
h
Figure 8 MultipleInput Neuron
a
The net input for this neuron is n = w J ,PI + wI #2 + ... + w] RPR + h. Notice that the
" ,
9
number of weights equals the number of input elements R . The first weight subscript indicates
the destination neuron whereas the second subscript indicates the input element. For
example, wI 5 means that this weight represents the connection to the first neuron from the
fifth input Ps . We can write the net input 11 in matrix form:
11 = Wp+b
Then the neuron output a is written as:
a = f( n) = f(Wp + b)
(4)
(5)
Network Architectures
A single multipleinput neuron may not be able to perform all tasks. Several neurons,
operating in parallel, are called a "layer". A layer of neurons will be discussed next,
followed by a discussion of multiple layers of neurons.
A Layer ofNeurons
The architecture of a single layer of S neurons is shown in Figure 9. The architecture
possesses R inputs, and S outputs. Note that the number of inputs R is not necessari Iy
equal to the number of outputs S.
10
Figure 9 A SingleLayer Network
The input vector p is connected to each neuron through the weights, which will be
defined as W.
(6)
WS,I wS,2 ... wS,R
A layer of multipleinput neurons is sufficient to solve some problems, such as linear
adaptive filtering, or simple pattern recognition tasks, etc. However, this structure is not
adequate for approximating arbitrary functions. The more general multiple layer architecture
will be introduced in the next section.
MuLtiple Layers ofNeurons
This network consists of several layers, connected in series. Each layer has the same
structure as noted earlier. Because this architecture has several layers, the superscript will
II
be used to identify the layer number. Consequently, a weight can be written as w\j, where
k denotes the layer to which the weight belongs. A three layer network is shown in Figure
10.
a3I
I
Wl,t
Pl . I
a 3
2
'/ \'
PR j '.\
a 3SJ
Figure] 0 ThreeLayer Network
The first layer of neurons has R input elements, 5 I neurons and 51 outputs. The
second layer has 5 I input elements, 52 neurons, and 52 outputs. Likewise, the third layer
contains 52 input elements, 53 neurons and 53 outputs. 5', 52 and 53 are the number of
neurons in the first, second and third layer, respectively. In the general case the number of
layers is arbitrary.
The weight in the first layer is expressed by the notation of Wi. This layer weight,
Wi, is the matrix with dimension 5 IxR . Similarly, W2 and W3 are the weights in the second
and third layer, respectively. W
2
has dimension 52x5' , while the size of W
3
is
12

S3xS2
. Generally, the weight matrix at layer k, Wk
, has dimension SkxSk  I . A network
comprised of R input elements, Sl , S2, and S3 neurons will be referred to as an
123 R  S  S  S network.
The output of each layer is the input to the next layer, and is denoted ak
. For instance,
the output of the second layer is denoted a 2
. The last layer of the network is called
an "output layer". The other layers, which are internally connected between the input vector
and the output layer. are commonly called "hidden layers".
This structure of the network is powerful enough to estimate arbitrary functions, as
will be shown below.
Function Approximation
Multilayer networks have been broadly used as function approximators. For exampIe,
in control systems neural networks are used to mimic plants in order to get proper feedback
signals. They have been widely used to compensate for channel fading in
telecommunication systems. Adaptive filtering is another application employing neural
networks. The function approximation abilities of neural networks are discussed below.
The multilayer network has several layers of neurons with Sk neurons in the k'll
layer. Normally, the number of layers of a network, N, is two or, at most, three. The number
of neurons in the hidden layer is heuristically specified, and depends on how complex the
function is. The number of neurons in the output layer depends on the number of outputs in
the desired function. Though it seems that the more neurons in the hidden layers, the better
a network can perform, it is possible that an overly complex network can overfit on a finite
13
training set. Thus, the appropriate number of neurons is dependent on the individual prob
Let's assume that the hyperbolic tangent sigmoid is used in the first layer, and the second
(7)
/I /I e + e
/I /I e  e
tansig(n) =
14
J
WI,I 2
Wl,l
PI a2
I
W 2,I
Suppose that the network is a twolayer 1  2  1 network, as shown in Figure 11.
a2 =f(p)
=!(W2/(W 1p + hi) + b2
)
where/en) = purelin(n) = n and{(n) =
Figure 11 An Example of Network for Function Approximation
The input/output relation is shown in the following equation.
layer transfer function is linear.
lem.

We have trained this network to approximate the function rCp) = p4 over the range
P E [1, I J . After training, the following weights and biases were obtained:
I [303J I [3.2~ 2 [ ;'] 2 [ ~ W = . ,b = .W = 1.22 1.22J and b = 2.42J
3.03 3.2
(8)
Figure 12 shows the network response as well as the target values p4
. The network outputs
are very close to the targets. This is just an example that a multiplelayer network can be
used to approximate arbitrary functions.
1.21~r===;===,~
1
 network output I
....... targel .
0.2
o
0.2''''..J
1 0.5 0 0.5
P
Figure 1.2 Network Outputs and Targets
There are many algorithms, such as the LevenbergMarquardt algorithm. the Bayesian
regularization algorithm, and the gradient descent algorithm, to train the weights and
biases. Such algorithms are in general called "backpropagation algorithms". and can be
found in many books and papers, such as [HaDeBe96], [HaMe94], [Fah189], etc. They generally
minimize errors between targets and network outputs. The mathematical derivations
15

of these algorithms will not be induded here since they are not the focus of this project.
Note that the error minimization process in a neural network is also called "training".
Now that we have discussed the basic concepts, we will next introduce the concept
of "novelty detection".
Novelty Detection in Neural Networks
Introduction
Recall from the last section that a network, given a sufficient number of neurons,
can be trained to approximate arbitrary functions. However, the performance of the trained
network will be dependent upon the data set that was provided during the training period.
In other words, a training algorithm minimizes the errors between the targets and network
outputs, for the training data set. When the network is subjected to data that were not in the
training (usually called "testing" data), what will the outputs of the neural network look
like? Will the network still approximate the function accurately? The following section will
descrihe this problem.
Consider the previous example in which Lhe network was trying Lo approx.imate the
function F(p) = / .The inputs p fed into the network during training process had the
range of [1.1]. The outputs of the network eventually looked simi lar to the targets for this
range of input. Now, suppose that new inputs that are between [2,2] are applied to the network.
The results of this test are shown in Figure 13.

Iii

16 1 network output I 14 target
12
10
8
'" 6
'*
2
0
2
2 1.5 1 0.5 0 0,5 1.5 2
P
Figure 13 Network Outputs and Targets on Testing Data
In the region of input data between [ 1,1], which is the same as that of training data,
the network performs well, as expected. However, the network performs poorly outside this
region, producing very large errors between targets and outputs.
Unfortunately, in real world applications, it is difficult to tell when an input vector
falls outside the range of the inputs in the training set. Should we count on the network oUlputs?
The next section will describe what we can do to alleviate such concerns.
What is Novelty Detection?
As shown in Figure 13, the network did not accurately approximate the function
outside the training range. Therefore, we need to be able to identify when an input vector
falls outside the range of the training data. This is called "Novelty Detection", In other
words, novelty detection methods should have the capability to detect input data that are
·'abnormal". We expect that this data will generate large errors. If we can detect whether or
not inputs are unusual, confidence in the network outputs would be stronger. Novel data
17
(which may cause considerable errors) would be rejected, whereas standard data (giving
desirable outputs) would be accepted.
Considering the example shown above, some might think that novel data could be
easily distinguished by picking up data points out of the training data range (bounded between
[I, I] in this case). Then the rest, which occupied the interval [2,1] and [1,2], would
be novel points. In practical applications, the dimension of the input to the network will be
much larger than that of the input (one) shown in the example, making it much harder to
detect these "unseen" data points. When the dimension of the input is large, the distinction
between interpolation and extrapolation will be much more ambiguous. Therefore, differentiating
the boundaries between normal and abnormal data is much more difficult, making
it harder to decide whether a specific input should be accepted or rejected. In the next section,
some algorithms for novelty detection will be introduced.
Algorithms for Novelty Detection
As discussed thus far, whether or not the output of the network for a particular test
input should be relied on is dependent upon the difference between the test input and the
inputs in the training data. For example, in the above example, if an input was in between
the interval [I, I], a corresponding output would be accepted, since the network was trained
to perform well in this interval. In contrast, if an input was out of the range [1,1], a corresponding
output would be disregarded. In other words, inputs will be identified when they
are not close to any training inputs. This concept leads to some existing algorithms, such as
the neural tree [Mart98] and the minimumdistance computation. They are similar, in that
they wi Jl flag any data as "abnormal" when the input vector is far from any training data.
18

The performance of the minimumdistance algorithm could be improved by applying some
weighting factors, which will be explained in Chapter 6. The algorithm that uses the Gaussian
kernel estimator model [Bish94], as elucidated in Chapter 4, works by means of computing
the probability of the ex.istence of an input vector in the training data near the test
input vector.
Unlike the algorithms described above, we can also use autoassociative multilayer
perceptron for novelty detection [FrGoPr96]. This method, which will be explained in
Chapter 5, is used to recognize input vectors in the training data. An additional neural network
is trained to memorize what the training inputs look like. Finally, the combination of
principal component analysis and newlydefined minimumdistance computation will be
discussed in Chapter 7.
Summary
In this chapter, the multilayer neural network was introduced. We described the
ability of the multilayer network to operate as a very general function approximator. These
multilayer function approximators are very good at interpolating between data points on
which they were trained. However, they are not good at extrapolating outside the training
set. The remainder of this thesis will present algorithms that can be used to detect when a
network is performing an extrapolation.
19
CHAPTER 3
NEURAL TREE ALGORITHM
Introduction
The neural tree algorithm, originally proposed by [Mart98], is the combination of
an unsupervised learning competitive network and a binary tree. The method takes advantage
of fast learning, because it only deals with scalar information, unlike competitive networks
that require matrix computation. Therefore, the algorithm generally uses less
training time than competitive learning.
In this chapter, we will start by defining the notation used in this algorithm and by
explaining how they relate to the data distribution. Then the process of learning the data
distribution, i.e. training a tree, will be explained. After the tree is trained, the procedure for
using a neural tree for novelty detection will be explained, and simple computer simulations
will be shown. Finally, we will explain how to measure the effectiveness of a noveltydetection
algorithm.
Neural Tree Algorithm
The neural tree hierarchically partitions a q dimensional space into cells, separated
by hyperplanes orthogonal to coordinate axes. As with any common searching tree, the
neural tree contains nodes. The node that is at the top of the tree is called the root node,
20

while the others are caned child nodes, or leaf nodes. Each node stores a "weight" Wij . The
/h hyperplane decision boundary is orthogonal to the j coordinate axis, and the position
of the hyperplane is at w along axis j . Below the last level of children nodes are the "cells".
A cell represents a certain region in the hyperspace that is hierarchically partitioned by the
weights. Note that an Ncell tree has N  I nodes that need to be trained. Figure 14 is an
example of a 4cell tree structure (having 3 nodes). The "circles" represent nodes, and
"rectangles" represent partitioned cells. From Figure 14, the 2dimensional hyperspace can
be divided into four cells as illustrated in Figure 15.
C2 C3
Figure 14 4cell Tree
21
.....
...
C'
0.'
... ....
0.2 C'
<t 0
0.2
C2 C3
0.'
0.•
0.• .." .."
1
_1 0.' 0.' 0.' 0.2 0 0.2 0.' 0.• 0.• ,
Figure 15 Partitioned Cells in Hyperspace
The cell Cl occupies the region less than weight WI I along coordinate axis I, C2 is the
region between wI! and w31 along axis 1 and less than w22 along axis 2. Similarly, C3 is
restricted to the area greater than w31 along axis 1 and less than w 22 along axis 2. C4 occupies
the area greater than W II along axis 1 and greater than w22 along axis 2. Therefore,
the weights wij determine the boundaries of cells. In the 2dimensional hyperspace, any
dataPk = ~kl PkJT can be located in a certain cell byfirstcomparingits/" element
where j E 1,2, Pkj' with the root node wI)' If Pkj > wI)' the data Pk will be sent to the
right child node. Otherwise, it will be sent to the left child node. The scalar comparison wi II
keep going until there is no node left to be compared. For example, from Figure 14 and Figure
15, assume that Pkl <wI] , the data Pk will be sent to the left child node. However,
since there is no left child node under the root node, the scalar comparison is stopped and
Pk is belong to cell C 1 .
22

The following section will describe the algorithm for adjusting the weights. Details
and proofs can be found in [Mart98].
How the aLgorithm works
In order to train a tree, we need to initialize it first. The initialization process can be
done in several ways. We can randomly select two values, j and w, for initializing a node.
However, the initialized tree may be a very poor fit to the data distribution, which can make
training difficult. Alternatively, all N  I nodes of an N cell tree can be initialized by using
N random samples from the training data. This method will be sensitive to the selection of
the N sampled data points. To reduce the sensitivity to the sampled data, a method called
norminaL initialization can be applied, which considers the distribution of the entire training
data set before constructing a tree. This is explained in [MaRoGi85], and [RiGr9I]. In
this thesis, we will initialize the tree by sampling the training data set.
We first need to find an axis to which a decision hyperplane will be orthogonal, and
then we need to compute the location of the decision hyperplane on the axis. To locate the
axis, we compare the element of an incoming data vector and a previous vector that is located
in the same partitioned cell. The element of the vector that shows the biggest difference
is selected as the axis that will be orthogonal to the hyperplane. Then, the location of
the hyperplane on the axis is found by computing the mean of the corresponding element
of the two data points. The initialization process will continue until the desired number of
cells is reached. Note that when the process is done, each partitioned cell will contain only
one data point. An example of the initialization process is given below.
23


Assume that a fourcell tree is to be trained. Four sample vectors must be drawn
from the training data, and suppose that they are

(9)
First, the samples d I and d 2 will be compared in order to place the root node. The differ
T
ence between the samples is Id l  d:d = [0.1 0.5J . Therefore, we will place the hyperpIane
on the P2 aXI.S at the oIca'tlon (0.8)+2 (0.3) = 0.55 ,mal1U: ng the root node we'tght
Wl2 = 0.55. Therefore, d 2 is in the cell above the boundary P2 = 0.55, and d l is in
the cell below the boundary.
Next, we will apply d 3 to the initialized node by comparing the second element of
d 3 , 0.7, with w 12 . It turns out that 0.7 > 0.55, and thus d 3 is in the cell above the boundary
P2 = 0.55, which is the same region as d 2 . Therefore, we will compute the differ
T
ence between d2 and d 3 , Id2  d 31 = [1.2 1.0J . Since the first element is greater than
the second, we will set up a hyperplane orthogonal to the PI axis at the location
(0.4) + (O·~n = 0.2. thereby making w21 = 0.2.
2
Now, d 1 is below the boundary P2 = 0.55, d 2 is above P2 = 0.55 and less
than PI = 0.2. while d 3 is above P2 = 0.55 and greater than PI = 0.2. We will continue
the initialization process by applying d 4 to the initialized tree by finding the cell that
24

d4 falls into. Since 0.75 > 0.55 and 0.70> 0.2, d4 is in the same region as d3 . We will
T
calculate the difference again, Id3  d 41 = [0.10 0.05J . We therefore putthe new hyperp1ane
on the PI aXI.s at the oIcat'lOn (0.8) +2 (0.7) =0.75, resu1t.mg.In w31 = 0.75 .We
now have the desired number of cells, and we therefore stop the initialization process. Figure
16 shows how the hyperspace is divided into 4 partitioned cells.
"" .."
0.1 ., 0
0.,
0.•
0.'
0.'
c,
<>~ 0
<l.' .,
0
<l.' .."
"'.1
"'.1 0·' C'
\
, .().I 0. .." .." ., 0.' 0' 01 01
Figure 16 Initialized Divided Hyperspace
The corresponding tree structure that will represent the divided hyperspace is shown Figure
17.
25

C3 C4
Figure 17 Tree Structure
After we have the initialized tree, the next step is to train those weights contained
in the tree. Suppose that a vector Pk = ~kl Pk2 .. , Pkj ... PkJ T flows to node i, which
stores weight W ij' at layer K. If Pkj ~ W ij • the vector Pk wi1\ be sent to the left chi Id node
directly under node i. Likewise, if Pkj > wij , the vector will be forwarded to the right child
node directly under node i. Simultaneously, weight wij will be updated according to the
following equation.
w ..( t+ 1)  w..(t)+11(!)(JRCi)(Pkj )  l LCi )(Pkj ))
I) I) n(i) n(i)
r 1
where 11(/) is the learning rate at time t with
( (0)
: Pkj E L(i)
; Pkj ~ L(i)
(I J)
and
: Pkj E R(i)
: Pkj eo R(i)
( 12)
26

where L(i) and R( i) are the unions of all partitioned cells belonging to the left or right subtree
of node i , respectively. The tenns n,(i) and nr(i) are the number of partitioned cells
associated with the left and right subtrees under node i . Note that the learning rate 11 (t)
can be adjusted with time. Its value will be reduced during training so that the algorithm
will converge.
The algorithm is a toptobottom karning method; training the root node first and
then down to the lowest level of children nodes. We will apply the next input vector to the
root node and the method will be repeated. Notice that as long as data are applied to the
tree, the boundaries of the cells will gradually move in accordance with the weight update
in Equation (J 0). Therefore, the location of the hyperplanes of the partitioned cells at this
moment are still in transition. The tree will learn the data distribution until the final input
comes in. After training is complete, all the nodes in the tree will contain fixed weight values.
We can refer to a specific cell in the final tree by using the following notation:

(13 )
The meaning is that the cell C occupies the region from a, to bl in the first coordinate
axis, from G2 to b2 in the second ax is and so on. For example. recall Figure 15, cell C I
represents the region [(00, W II)' (00,00) J. The quantization range of the first coordinate
axis is defined as la I  btl, and is likewise to any other dimension. Notice that the quantization
range is now unmeasurable. The following section will describe how to make it measurable
and how the neural tree algorithm relates to novelty detection. The convergence of
the algorithm was proven in [Mart98].
27

Application to Novelty Detection
After training is complete, the neural tree contains fixed boundaries, which partition
the hyperspace into different cells. We would like to use the trained tree to detect future
inputs that are unlike the inputs used for training (i.e., we are looking for novel inputs).
There are two approaches to identify these novel data. The first approach is to compute
probability density of each cell. If the probability density of a cell is low, it indicates that
data within that cell is less likely to occur. Therefore, any data falling in a cell having low
probability density will be more likely to be rejected as novel data. For the second approach,
data will be identified as novel when it is outside of a cell. We will explain these
two approaches in the following paragraphs.
For the first method (identifying novel data by estimating the probability density),
Martinez [Mart98] suggested that the probability of a vector falling into a cell is equal to
h, where N is the number of cells. Therefore, in order to obtain the probabi lily density, we
need to divide the probability by the cell area in the two dimensional space (or volume in
high dimensional spaces). That means that if the volume of a cell is large, the probability
density of a vector falling into the cell will be low, thereby maki ng the data in the cell prone
to being rejecting as novel. As we mentioned at the end of the last section, some cells occupied
infinite area (in the two dimensional data). We need to limit the occupied region of
a cell. Martinez suggested using the maximum and minimum value of the training data.
That means that the unmeasurable value will be replaced by lor I in the normalized hyperspace.
For example, in Figure 15, after training the tree, cell C I will occupy the region
28

[ (1, wII)' (1, 1)] . Then, the area or volume of cell C1 that was infinite is now computable.
Therefore, the probability density of cell C I will then be (~)/(IIw 111 x II  II) .
As we explained earlier, the density of a cell will be low if the volume of the cell is high.
And, any data falling in the cell will be more likely to be rejected as novd data since the
chance of such data occurring is small.
For the second approach, novel data will be identified when they are out of the cell
boundaries. In this case, the maximum or minimum values of training data ina cell may be
used to limit the cell size in every dimension. Note that we may use the other values such
as the maximum or minimum plus some margins to limit cell size. A further study of how
to choose an appropriate margin may be required. However, we will use zero margin in this
thesis. By applying this technique, every cell size will be fixed and finite. After we perform
this procedure, the tree can be used for novelty detection. After an input vector follows the
tree structure and is located at a cell, the algorithm determines whether the input vector is
more than a certain distance beyond the cell boundary. Abnormalities are identified whenever
input vectors are outside of their cells regardless of which boundaries they break.
We will show the simulation results of these two cases for novelty detection in thl'
following example.
SimuLation ofa simpLe exampLe
We will begin this section by demonstrating the capability of a neural network for
function approximation. Then, novelty detection using the neural tree algorithm with the
second approach we discussed in the last section will be applied. We will also show the
29


simulation results using the first approach (density estimation) and will introduce the problem
of utilizing this approach for novelty detection.
The following is a two dimensional example used to demonstrate novelty detection
employing the neural tree algorithm. This example will be used to demonstrate the other
algorithms as well.
In the example, a twolayer feed forward neural network, with 40 neurons in the
hidden layer and one neuron at the output layer, was trained to approximate the following
function.
t = F(p) ;Vp
= F(~I P2]1
sine lOJp7 + p;)
=
lOJp~ + P;
Figure 18 is a graph of the function F(p).
( 14)
0.•
0.•
0.4
§
~ 0.2
<1.2
<I,.'
" _1 .1
P,
Figure 18 Function F( p)
30

Figure 19 shows where the training data is located.
I 0.' 0.1 0.4 0.2 o
Po
o.t 0." ... o•
Figure 19 Training data
Figure 20 shows the error on the 638 training points from Figure 19. We can see that the
network provides an accurate approximation to the function for all training points.
• 101
'r~~~~~~,
l.&
,..
Figure 20 Error between target and network output after training
31

Figure 21 show the error between the output of the function and the output of the
function approximator. We can see that the errors out ide the training data are larger than
the errors within the training data region.
I ~l .,
Figure 21 Error between target and network output in the normalized hyperspace
Assume that we are going to test the trained network using 437 new data points. The
testing and training points are shown in Figure 22.

1
_1 0.8 0.6 0.4 0.2 o
,I
0.2 0.4 0.6 0.8
Figure 22 Testing and Training Data
:12

Let's apply the 437 test inputs, pi, to the trained network. Figure 23 shows the errors
on the testing data.
,
I •
•
3
•
I
! ~
0 1\ I
0.
0.5
o.
o.
0.
..
..
50 100 ,SO zoo 2SO 300 3.SO UIO
Figure 23 Error of the testing data
The large errors that occur in Figure 23 are for inputs outside the range of the training
data. We would like to use a neural tree to detect inputs outside the range of the training
data. This would enable us to determine the reliability of the multilayer network output. In
this example we use the tree described in Figure 24 and Figure 25.
33
0.•
0.'
0.2
0.'
0.'
0.•
0.•
"'"
d,
o
."
d,
o
"'"
."
d,
o
d,
o
'\ ..,;.,,~
o
d,
."
~.L 0::':,'07.•==0.':'..:0:::.,7.,='O.2~:O'~. 0::'.:•70.•='
Figure 24 Initialized Cells
CI C2 C3 C4 C5 C6 C7 C8
Figure 25 An 8Cell Tree
After the tree structure was created, the training data were used to train the tree.
Training took about 0.55 seconds on a 300 MHz PC, with learning rate 11(/) = 0.,3wehre
l
t is the number of data points that have been applied to the tree so far. The learning rate is
decreased during training to insure convergence. Now, after training the final weights are
shown in Equation (15).
34
W II = 0.3379 11
w22 = 0.101022
w31 = 0.7153 31
W 42 = 0.549242
w51 = 0.139851
w61 = 0.100061
wn = 0.018272
Figure 26 shows the final partitioned cells.
(IS)
0.' . : .c.
0.' "'; .. : .
0.4 ..
'C3 .
tJ.2 '~n: : : : . : .
a." 0 •• , •.
o.a ::
~~ .. . .ca : :
0." :.
0.& .
0.' . ....
qs: C6
c.
C7'
_1'~'~:':'''''l..e,,_~~,
, ~.a 0.8 0.4 0.' 0 0.2 0" 0 • 0.'
",
Figure 26 Partitioned Cells After Training
Every cell in the trained tree covers an infinite area, For example, cell C I covers
the Cl E [(00, 0.7153), (00,0.1010)1. Therefore, we need to limit the cell size. In this
example, the maximum or minimum values of the trainingdata points falling in a cell will
be used to limit the cell sizes. This approach gi ves cell C I a finite area, which is denoted
by CI E [(I, 0.7153), (1, 0.1010)] ; the minimum value of data within cell C1 in the
;15

first dimension is 1 . This is also the smallest value for the second coordinate of training
data in cell C I .
Novelty detection can be implemented after every cell has bounded area. The algorithm
defines abnormal data as data outside of any cell. Figure 27 illustrates novelty detection
by plotting training data, testing data, and identified abnormalities within the testing
data.
0.8 ..•
0.•
0.'
0.2
<1.2
<I.'
0.15
0.•
I · r'·"""....,1 .. Tee1ftgdli'0_
••••••••• ........... + ....
............... + ••
.+ ••••••••••••• ++ •.
, .• + •••••••••••• .+++ •.
.. . +.++ •••••••••••• + .... + ... +++.++++
••••• ++++++++ . .
· .++++ •• ++ •••••• +++++++++ .. .
+++++++++ ••••• +++++++++
· .+++++++++ ••••• +++++++++ ...
.. +++++++++++++++++++++ + •
.• +++.+ + ++ •• + .
+++++++++++++++++ ... +++++
.. . +++++++++ ••••• +++++++++.
· .+++++++++ ••••• +++++++++.
.. ·Tt •• . •• . +++ •••• +·
.++++ ••••••••• ++++++++
'" ·++++.a ••••••• ++++++++··
.. . ++ ••••••••••• +.+.+ •...
.+ +.'"+ . ••••••••••••• .+ ....
••••••••• +.++
!lI •••••• ++
~''<l::::'<l7.:.:"O:':.<l:.2=~:':0.2,::.•0'::::•:•7.:,'
Figure 27 Abnormalities outside the cells
The next step is to test whether data flagged as novel is correlated with large errors
in the multilayer network output. (Recall that the purpose of novelty detection in this thesis
is to identify inputs for which the trained multilayer network is unrdiable.) Before doing
so, we will introduce our indicators to measure the performance of novelty detectors.
When perfonning novelty detection to reject or to accept data, we can make two
types of erro rs . In the first case, we reject data, even though they create small errors. In
36

the second case, we accept data even though they generate large errors. The following table
illustrates these ideas.
Table I Novelty detection decision vs. approximation error
Decision\Error Small Error Large Error
Accept Correctlyclassified data Misclassified data
Reject Misclassified data Correctlyclassified data
From the above table, one might ask how we decide what is a small error or what is
a large error. Throughout this thesis, we consider the error as unacceptable when its value
is greater than 0.15 in the normalized hyperspace, in which targets are bounded between
[1,1] . The threshold used to accept or reject data will vary from algorithm to algorithm.
For each algorithm we will indicate the threshold we use.
An indicator we will use to measure the performance of the algorithm is the percentage
of misclassified data points. Clearly, the larger the percentage of misclassified points,
the worse the algorithm. However, keep in mind that the percentage of misclassifications
depends on the definition of large error. For some applications, the error for abnormalities
is required to be very small to guarantee the reliability of the network output. For example,
if the large error (abnormality) is defined as the error greater than 0.01, the percentages of
misclassifications we will show throughout this thesis will be changed as well (since we
defined the large error as greater than 0.15). Note that from now on we will use the term
type I error to represent smallerror data that is rejected by novelty detection. On the other
hand, type II error will represent largeerror data that is accepted by novelty detection.
37

From the above example, the percentage of misclassified points for the neural tree
algorithm was 25.17%. Figure 28 illustrates the error of the testing data and the data marked
as abnonnal. (The abnormal data points are flagged with an x at the bottom of the figure.)
7
•
5
•
3
2
,
I ~ u v V V IV ~ ~ l L L 0
1 . 
G.2
G.
o.
o.
o.
o.
o.
o.
o.
$0 100 150 200 250 300 360 400 &SO
Figure 28 Error and Abnonnalities of the 8cell tree
We can see in Figure 28 that some largeerror points were identified. This is because
some testing data outside the training data region were undetected (shown in Figure
27). This is due to the fact that we do not have enough cells (we had only 8). Therefore,
increasing the number of cells will be a way to detect such data. By increasing the number
of cells to 200, the time used to initialize and train the 200cell tree was 4,51 seconds. The
time utilized to identify abnormalities of the 437 testing points was 1.36 seconds. Figure 29
demonstrates data that the 2OGcell tree decided to mark as abnormal.
38

0.1 I
·T__
I • O~
0.1 · . .. ••••••••• ~............ . . ••••••••••••••• 0." ••• .•••• .."" ••••••••••••• ", ...• " •.• ..... ". " " " " " . . . . .. " " " " " " " " " .. , .. " . ".""" ~ " " " , , " " .. 0.2 ••.••.•••••••• "' ••••••••••••••••.•••••. . " " , " , , " " " . " , . .. . .
• • •• • •• """""",," Ill" ,," .. " " " '" o' •••••••• .... """"." .. """ """" .. """ .
<I.'
0,'
... "." "' ... "".""" .. """.""""""" ...
. """" .• """",,"'''''''''·0.''''''''..... .
• • •• • •• "" 11 .. " " " " • •••• " " " " " " " " " " ••••••••
. ..""""""""" •••• It" " " " • , " " .
•••••• II' •••••• X~ ••••• " . . .. . ......•••....••...... . . """"" " " .. " " 0.4 · . .. ....... . .• ." .• "•e••..•••••••••••• • • .. " ... .
• • •••••••••••••••••••••••••• •••••••••
<1.1
_, '~,',::'_':c_'_'_:'7_''_'
1 0.' ..c.1 0.4 0.2 0.2 0." 0.' 0.'
Figure 29 Abnonnalities from the 2OGcell tree
We can see that almost all of testing data outside training data region were identified as abnonnalities.
Figure 30 shows the error from the function approximator on the testing data and
indicates data marked as abnormal.
,
•
3
,
, llu L 0 v V ,,!lJ l \.J L L
,   .
, <I.
<I.
o.
0.'
G.
o.
0.6
o.
50 , 00 '50 200 2SO 300 360 '00 • &0
Figure 30 Error and abnormalities of the 2OGcell tree
As seen in Figure 30, most of testing data were identified as abnormal. Although most of
3')

the testing data that fell outside our training data were detected in Figure 29, the number of
misclassification was 46.91 %. All of these misclassified points were classified as novel, although
the errors were small (type I error). Although we can see that the percentage of misclassifications
for the 2oocell tree was larger than for the 8cell tree, we should realize thaI
this percentage is from only one data set. The percentage we show here will not apply to
every data set.
From this chapter through Chapter 7, we will use only one simulated data set to
demonstrate how the novelty detection algorithms work. However, in Chapter 8, we will
apply real world data to various novelty detection algorithms. That chapter will provide
more thorough tests of the algorithms.
From the results we have so far, as we increase the number of cells in the neural tree
algorithm, the more abnormalities will be identified. However, any data point that is close
to the training data but is outside the cell boundaries will be discarded as abnormal. This
will increase the percentage of type I misclassification error.
Thus far, we flag novel data when they break cell houndaries. We will now use the
neural tree algorithm to estimate the density for novelty detection.
After training the 200cell tree, infinite cells will be limited by using the maximum
and minimum value of input data, which is ) and 1 in this case. The area of each cell will
be calculated, and the probability density over cell Ci will be computed as (2~O)/ Ai'
where Ai is the area of cell Ci. Figure 31 illustrates data having low and high density. The
40

re rerre ent uata ha\ ing de it) I~ than 1l.2. \ hile tht: hlll~ po t I pi 'Ilt l<1ta
h;ni g d n it)' greater than or l.:ljllal to II..
igllre 31 ..,till1ated u n'iit, II ing n~ Iral lree al;;orithm
e 'an ee tho I ..,or e trail il1g uala lin the !llid Ie (r he ligllrC) Ila\t~ !o\.\ uen.."t)
\allle..,. l1i..,j..,h~cau ethel.' Ih I "lJ,:h a..,C'11 I, CJ. '1, CIO() and CUN intheflglln:)
uccup)' \elY large <.lrea.., cumpared \'.,tl1 (Ithel celk J her'lulc. d \'.c LJ l' til,'" ;q))IOdlh 11lI
(Hoelt) <.ktectlo , ..,ollle training dala j large cell.., lIlay 'llIi..,da.., died ;,.., I (l\ 'I. '1111'" pl1<'
nomenon '.,ill i crca..,e till' r'[ccnt<.ll!e (I'type I elror.
\Ilhnllt!ll it eeln.., thattk l1l'ur;t1tree algollthm leI d.., to h;J\e \ 'ry larl.!c tVI 'I ,'I lOr,
\.\e I"oll d that illan al"ol a\c a'iiglillicanl l'lcentage ol'tYD' If eJTO' (lille CIlClIlll..,tanc
e..,. 'on td r lhe data ..,et m Fig In: .12 (a"''''lIll e lhey ;lIe laming <.; la 1111111'> or a I"unction
a roximator).
41

0.'
0.•
0.4
0.2
,,' 0
<l.2
<l.4
<>..
<l.•
00000000 oaooooooo
00 000000000 00
oggogggggg °ogggg gggo 000 ooog gggggo
oaooooo 0000000
0000000 ooocooo
0000000 0000000
000000 000000
000000 000000
000000 000000
000000 ooaaao
000000 000000
000000 oaooao
000000 000000
000000 000000
000000 000000
000000 000000 ogggggo ogggggo
000000 000000
0000000 0000000
0000000 0000000
00000000 00000000
000000000 000000000
0000000000000000000000000
00000000000000000000000
000000000000000000000
0000000000000000000
000000000000000
000000000
_, '~.,.~_':_~_'_~:~_L.'_'
1 0.8 01 0.... ..0.7 0 0..2 0." 0.' o.
~.
Figure 32 Donut Shape
If we flag novel data when they fall outside cell boundaries, we can see that there is
no way to detect data inside the inner circle. This is because the data in this region are always
in the cell boundaries (a cell is shown in Figure 32), therefore type II errors may be
increased. If we use the density estimation approach, some data inside the inner circle may
be detected and some may be not. The final results will depend on how the cells are arranged.
The neural tree algorithm has a high percentage of misclassifications, but its major
advantage is its speed.
Summary
In this chapter, we introduced the fundamental ideas of the neural tree algorithm.
We defined the notation and terminology commonly used with this algorithm (e.g. tree,
node, cell, etc.) We explained how a tree divides the hyperspace, and how a tree is trained
to learn a data distribution. We then explained that, for novelty detection, we had to first
limit the size of cells. Abnormalities were indicated as data falling into lowdensity cells
42

(density estimation approach) or as data located outside of the cells (cell boundary approach).
The ability of this algorithm was then demonstrated using a simple example. We
showed through the simulation results that increasing the number of cells increases the ability
of the neural tree to identify data outside the region of training data. We showed why
the density estimation approach tends to increase type I misclassification.
After we demonstrated the noveltydetection process, we introduced the performance
measure  the percentage of misclassified data points. We suggested that a further
study on how to appropriately limit cells for novelty detection should be necessary.
4~

CHAPTER 4
THE GAUSSIAN KERNEL ESTIMATOR
Introduction
The Gaussian kernel estimator for novelty detection was fir t adopted by [Bish94],
and was thoroughly described in [Bish95]. This method is based upon lhe estimation of the
probability density function, as described in many stati tics books. Mo t realworld applications
and research involving novelty detection employ this method, as in [Bish94],
[NaCoRiToTa97], [TaNaToCo99J, or [HiAuOO].
Within this chapter. we will first give the reason why we are interested in u ing the
density function estimate for novelty detection. Then we will explain three wellknown
methods for density estimation, which include histogram, naive estimator, and kernel estimator.
A specific kernel function. the univariate Gaussian estimator. will be introduced.
followed by the generalized model for the multivariate case. Next, the procedure for adopting
this algorithm to novelty detection will be described. The algorithm will be demonstrated
through simulation example. After that. we will propose an idea to improve the
performance of the algorithm by incorporating the network output with the network input.
The improved algorithm will be illustrated with computer simulations. Finally, we will analyze
a problem with the proposed algorithm.
44

Estimated density for novelty detection
[Bish94] developed the noveltydetection method employing the Gaussian kernel
estimator by the error equation for training a function approximator:
N
sse = I JJ{a(Pi' W)  t j } 2f (p;, t;)dpdt
;= I
(16)
where a(Pi' W) is the output of a function approximator corresponding to the ;th trainingdata
input, Pi' through the network having weights W, Ii is the target and [(Pi' ti ) is the
joint probability density function of the input and target. By applying Bayes' rule, we will
replace the joint density with a product of fUdpi) and f(p;). When rearranging Equation
(16), we obtain
N N
i = I
where E is the expectation operation.
i = I
From Equation (17), we can see that the error equation is weighted by f(Pi) , which
represents the density function of the input data P . Aftertraini ng the function approximator
(minimizing the sse over a finite data set), we expect that the approximation is accurate in
regions that the density f(p;) is high. On the other hand, there is a small contribution to the
error minimization (Equation (17» from the regions where f(Pi) is low. Consequently, the
approximation should not be precise in these regions, thus resulting in large error from the
function approximator. Therefore, the density function f(Pi) can be an indicator to predict
45

when the approximation is not accurate. In other words, we may use the density function
as novelty detector for function approximation. Unfortunately, the density function f(Pi)
is unknown, and therefore we have to estimate it. The next section is dedicated to reviewing
the density estimation, from the histogram to the Gaussian kernel estimator.
Background
The probability density function gives a description of the distribution of a random
variable p . Probabilities associated with p can be found by
( 18)
Now, suppose that there is a set of observed data sampled from an unknown probability
density function. The method of predicting the unknown density function from the observed
data points is called "density estimation". Three density estimation methods will be
discussed here, starting with the wellknown histogram, followed by the naive estimator,
and finally leading to the kernel estimator.
Though most applications in the real world deal with multiple variables, we will begin
with the univariate case because of its simplicity. Before starting the discussion, let's
define some notation that will be used throughout this chapter. Assume that a random variable
p consists of N real observations PI' P2' ... , PN sampled from a data set whose underlying
density is unknown and to be estimated. Then, the estimated density function of
these observations will be denoted as f .
46

Histogram
The oldest and most extensively used method for estimating an unknown density
function is the histogram. It is mainly comprised of a series of boxes with heights indicating
how many data are contained in a certain region. To create a histogram, we begin by defining
a set of bins starting from the point Ao. Each bin is defined as the interval
lAo + mb, Ao+ (m + J)b), where m is an integer representing the bin number. Every bin
has width b. Then the histogram is defined as
f (A) = ~b(numbers of Pi in same bin as A) ( 19)
The estimated density is constant over each bin. The bin width b is sometimes called the
smoothing parameter. Note that in a more general form, the bin width can be adjustable
from bin to bin as well.
The accuracy of the histogram depends upon both the starting value and the bin
width. These values have to be chosen by experience and the choice may cause undesirable
effects, such as misinterpretation of the density estimate. Furthermore, though it is an excellent
tool to represent the approximate density for a single random variable, it does not
work well in high dimensional spaces. Even in two or three dimensions, it is extremely difficult
to create understandable figures. Moreover, the discontinuities between adjacent boxes
are not desirable. The next method we will discuss known as the naive estimator is
designed to overcome some of the problems of the histogram.
47

Naive estimator
The naive estimator is an alternative to the histogram. It eliminates some undesirable
effects of histogram, especially the choice of origin value, as will be seen. From Equation
(18), the density function can be rewritten as
f(A.) = lim 2IbP(A.  b <P < 'A. + b)
b~O
(20)
The right hand side can be estimated by computing the fraction of observed data falling
within the interval ('A.  b, A+ b), and can be written as
numbers of Pi falling in interval('A.  b, 'A. + b)
P('A.b<p<A+b)== N (21)
Therefore, for a small enough value of b, Equation (20) can be reformulated and used as
an estimate
f ('A.) = numbers of Pi falling in interval(A b, A+ b)
2Nb
(22)
The above equation is a simple form of the naive estimator. The general form is
N
~ I (Apo)
f(A) = NbL w;!
i = 1
where w(A) is called the weight function and can be defined as
(23)
W(A) = {~ ; IAI < 1
; otherwise
(24)
In Equation (23), we are constructing a box of width 2b over the range (A  b, A. + b) and
48

height 2~b on each observed data point. After adding them together, the density estimate
is eventually obtained. Therefore, the naive estimator uses observations to be the reference
for each box, meaning that the center of each box is an individual observation. In the case
of the histogram, the origin value has to be initialized before making a box. That means that
the naive estimator is a modified version of histogram in which each sampled data point
creates its own histogram centered on itself.
The naive estimator eliminates the problem of setting up the origin value. Nevertheless,
the trouble of selecting the smoothing parameter and the discontinuity of the curve
still remain. The discontinuity of the curve is the most undesirable feature, because the derivative
of the estimated density is infinite at every point Pi ± b, and is zero elsewhere.
Before going on to another section, let's consider how the estimate is affected by
changes in the smoothing parameter. When the parameter b is reduced, Equation (22) implies
that the interval 2b is decreased whereas the ordinate of each box 2~b is increased,
making the estimated density more jagged. The graphical presentation will look more noisy
when the parameter is smaller. On the other hand, when the parameter is increased, the discontinuous
characteristics will be reduced but some underlying information will be lost because
of too much overlap of the blocks. Figure 33 demonstrates the effect of varying the
smoothing parameter on the naive estimator.
49

._· •
1't
1"1
..; .i • i ; i . ~
1"'r
I .
.J • • t ; ....
Figure 33 Effect of the smoothing parameter on the naive estimator (a) b = 0.03 (b)
b = 0.3 (c) b = 3
As with the above explanation, the estimated density will be smeared if the smoothing
parameter is too small, and it will be obscured if the smoothing parameter lS too large.
Therefore, the choice of the smoothing parameter is one of the most critical steps in using
the nai ve estimator.
As noted above, the undesirable characteristic of the naive estimator is its discontinuity.
To overcome this problem. we will modify the weight function using a kernel function,
as explained in the next section.
Kernel estimator
In order to obtain a continuous curve, we can modify the weight function. We will
replace the weight function w(A.) with the kernel function K(A.) , which satisfies the condition
so
I K(A.)dA =
By substituting this kernel function into Equation (23), it becomes
N
i (A.) = ~b L K(A. ~Pi)
i = I
(25)
(26)
The kernel function is nonnally chosen to be symmetric and nonnegative everywhere.
Some of examples of kernel functions are the biweight, triangular, and nonnal density function
(the Gaussian function). By selecting a continuous kernel function, the density function
estimate will also be continuous.
As with the naive estimator. the kernel estimator can be thought of as adding together
all of the kernel curves centered at the observations. In other words, rather than placing
a box of width 2b having a midpoint at the observations with height 2~b ' a kernel curve
with a specific interval and height constrained by its own properties is located on each observation.
The density estimate at a point A. is obtained by summing all of the individual
kernel functions. Like the naive estimator, the smoothing parameter controls how smooth
the curve is. If the parameter is too large, the kernel functions overlap too much and are
smeared together. If the parameter is too small, the approximate density is just as misleading,
and consists of a spike at each observation.
Therefore, though the estimate is continuous, the problem of how to select the value
of the smoothing parameter still remains. However, the kernel estimator has been one of
the most popular methods employed for density estimation thus far. In the next section, a
51
certain kernel function called the Gaussian function will be deployed in order to perform to
novelty detection.
The Gaussian kernel estimator
In this section, one of the most wellknown functions in the statistics, mathematics
and engineering fields  the Gaussian function  will be employed as the kernel function.
We will begin with the univariate case, followed by the generalized model in which any
number of dimensions can be used.
Onedimensional data
We generally choose a kernel function that is differentiable, symmetric and nonnegative.
The Gaussian function with zero mean and variance of one
1 (I 2) g(x) = exp x J2ir. 2
(27)
satisfies all of the above properties. Now, by substituting Equation (27) into the kernel
function in Equation (26), it becomes
N 1 1 (1(AP,)2) f (A) = Nb I J2ir. exp 2 T
i = 1
The above equation is the Gaussian kernel estimator for univariate density functions.
(28)
Notice that the smoothing parameter controlling the width of the Gaussian curve is
the standard deviation in the normal density function, while an observation Pi can be
thought of as the mean of the kernel curve. When the standard deviation of a normal densi ty
52

function increases, the curve expands and the points far away from its mean have comparable
values to points close to the mean. In contrast, when the standard deviation decreases,
the shape becomes more like a spike, and the density at a small distance remote from the
mean is close to zero. Figure 34 shows what the estimated densities look like when the
smoothing parameter is varied. The estimated density computed from Equation (28) is exhibited
by the bold solid line, while individual kernels for the five observations are indicated
by the thin lines. When the smoothing parameter is too small, several peaks pop up and
cou Id be misleadi ng. When the value is too large, it can smooth over much of the detai I in
the true density.
o .......... • '0 '. 0
_10 I
(., (1)'
0."5 0.'.
O.'h D"tl
O.:J6t
0. 121
,) ,
10.25'
f0.1
J I i r~' 0.2:
O.ClOt
w o.../ II 0.' 0... • 'I
0,"'1 I 0.02
0 ."  • _°'0 '  . 0 '0 " . 0..... • '0 '.
'<I
0.00
007
Figure 34 The Gaussian kernel estimator with (a) b = 0.3 (b) b = I (c) b = 3
In the next section, the multivariate case will be discussed.
S3

The generalized model
Most applications involve high dimensional spaces. The higher the dimension, the
more difficult it will be to make accurate estimates of the density function and to represent
it graphically. However, let's modify our existing notation in order to represent high dimensional
data.
Suppose the random vectors PI' P2' ... , PN are sampled from a population with unknown
distribution. Also, assume that the size of each vector is q x I . Then, using the same
concept as the univariate case, the kernel estimator in Equation (26) can be rewritten as
N
i(A) = _1 ~ K(APi)
NbqL b
i = J
where the kernel function has to abide by the condition
JK(A)tfA. = 1
(29)
(30)
Similarly, the Gaussian kernel function will be modified to fit into high dimensional spaces
using the following formula.
; "Ix (31 )
By substituting Equation (31) into the kernel function in Equation (29), it becomes
54
ieA)
N
= 1 I ( I(A.  Pi)T(A  Pi))
NbqL, (21t)q/2 exp 2 b b
I = I
(32)
Now, the above equation is a generalization of the Gaussian kernel estimator. The smoothing
parameter b is heuristically chosen, depending on the problem that is going to be
solved. It has to be neither too large nor too small to obtain a desirable result. In the more
general form, the smoothing parameter could be a matrix, like the covariance matrix, and
the density estimate can be written as
(33)
where L is the smoothingparameter matrix, which can be thought of as the covariance matrix,
and has the size q x q. The notation ILl represents th~ determinant of the matrix. Notice
that (33) is the general form of (32) such that L = b2I q , and Iq is the identity matrix
with size q x q .
Note that the use of L = b2I q implies that the width of the Gaussian kernel placed
on each observation is equal in all directions, and that the data in each dimension is uncorrelated
with each other. Although a smoothing parameter matrix should be used to efficient
55

ly estimate density functions for data that is not evenly distributed, a constant value will be
used in this thesis for novelty detection.
In the next section, we will explain how this density function estimator can be used
for novelty detection.
Application to novelty detection
We will use the estimated density function for novelty detection. The main idea is
the following. The estimated density function describes the distribution of the training data
set. If a new input vector is similar to vectors in the training set, we would expect that the
estimated density function will be relatively large at that point. If a new input vector is unlike
any vector in the training set, then the estimated density function should be small at
that point. Therefore, those inputs that have a small value for the estimated density will be
considered novel inputs.
In the next section we will revisit the example problem described in Chapter 3. We
will use it to test the Gaussian kernel estimate of the density function, Equation (33), with
2 covariance matrix Lp = b I q :
~ f q/2 N (A .)T(A_ )) (A) = (27t) ",. exp P, Pi
Nbq L.J 2h2
;=1 .
(34)
Note that Pi is a training vector, corresponding to an observation Pi in Equation (33).
Simulation of the simple example #1
The following is the example illustrating the capability of the Gaussian kernel estimation
novelty detector. The regions for the testing and training data were shown in il. The
56
input vector are two dimensional, and the number of training data is 638, thu giving
q = 2 and N = 638. Figure 35 show the estimated density for various values of the
smoothing parameter.
" "
Figure 35 Estimated Density with (a) b = 0.001 (b) b = 0.01 (c) b = 0.1 (d) b = I
Recall from the previous chapter that we want to use a novelty detector in combination
with a multilayer network that has been trained for function approximation. If the
novelty detector flags data as being different than data in the training set. then we expect
that the multilayer network may perform poorly Olil that data. The novelty detector is a
warning system for the multilayer network. [n this context, we will test the Gaussian kernel
estimator on the function approximation problem described in the previous chapter.
Figure 36 plots the estimated den ity values versus the error of the multilayer network
on the 437test points of the function approximation problem. (See Figure 22 for the
"'7

location of the test points.) Figure 36 does indicate that the lower the estimated density is,
the more likely the error of the multilayer network will be large.
J..~.
.. .."r  _..  .  
t. lIIi· ..IUll
"t.·, I" ) ....... "
I d .
... , J.~" b"
· . ", .
":' ,
t_" t. U 01 t; u .. " '1 II ., .1 .•.
Figure 36 Estimated Density and The Errors (a) b = 0.001 (b) b = 0.0\ (c) b = 0.\ and
(d) b = I
From Figure 36, we can see that there is a straight Iine and a scalar value R shown
on each graph. The straight line represents a linear regression between the two variables
 the network error and the estimated density in this case. The regression line will be used
to determine the density function value used to flag abnormal data. We will describe how
to determine this threshold value from the regression line later.
The R value represents the correlation coefficient between the network error and
the estimated density (\ $ R $ 1) . If R value is positive (R > 0), it means that the error
tends to be high when the estimated density is high. On the other hand, if R < 0, it indicates
5li

that the error tends to be high when the estimated density is low. In addition, the greater the
magnitude of the correlation coefficient (IRI ~ I), the more correlation between the variabIes.
Therefore, when the R value is close to zero, it means that there is almost no correlation
between the two variables.
We can see in Figure 36 that the R values were negative, thus implying that when
the estimated density was low, the error tended to be high.
In order to decide which smoothing parameter values should be used for novelty detection,
Bishop [Bish94] suggested the method of choosing the b value by averaging the
distance to the ten nearest neighbors over the entire training data. By using this method, we
found that the b value for our example is 0.0836.
05r~~~,
O. R • ..0.583
o.
C.I
C~ O~:'CO.1':O::":~ :0.3=0:'7 .• ~0.':5::0.•:'0.7
En..
Figure 37 Estimated density and approximation error: b = 0.0836
To use the density estimate for novelty detection, we must choose a threshold below
which the data will be flagged as novel. Based on Figure 36 (c), we found that the regression
line that represents the network error and the estimated density with b = 0.0836 is
59
....
density = 0.967 x error + 0.294 (35)
We will use Equation (35) to find the estimated density value (based on this data set) that
produces an approximation error of 0.15. By substituting 0.15 for the error tenn in Equation
(35), the corresponding estimated density is 0.149. That indicates that, based on this testing
data set, the estimated density for data points generating an error of 0.15 for the function
approximator is on average equal to 0.149. We will use this value of the estimated density
to be the threshold to reject novel data. In other words, any data generating an estimated
density less than 0.149 will be discarded as novel.
Figure 38 shows the approximation error of the neural network. The points thaI are
flagged with an x have estimated density values less than 0.149.
0.1
0.5
0.4
0.3
l
0.2
"
vv
0,1  ••
<>·'0''.c.~,00'':IO:''OO,.,..',.:JOO::":,""".,.~OO...'
Figure 38 Novelty Detection: Density of input
As seen in Figure 36 (c) and Figure 38, the algorithm has a certain capability to pinpoint
which data should be discarded. The percentage of misclassified points when employ
60

ing this algorithm was 28.38%. All of the misclassifications were from smallerror points
that were flagged as novel (type I error).
Note from Figure 37 that if we use density lower than 0.149 to reject novel data, the
percentage of misclassifications will be reduced. This is because the threshold we chose
(based on the regression line) discarded many data points with small errors. We found that
in this example the threshold that minimizes the percentage of misclassifications is around
0.005. For this threshold, the percentage of misclassifications is 7.32%. Around 5.49% out
of the 7.32% are type I misclassifications. Though there is a large difference between the
percentages ofmisclassifications for the two different thresholds in this example, when we
apply the algorithm to our real world data in Chapter 8, the difference is no larger than 2%.
Although the purpose of using the testing data is to set the threshold, it should be
noted that it is very difficult to have a general value for the threshold that will minimize the
percentage of misclassifications for all data sets. What we can generally say about setting
the threshold is that the higher the threshold, the more likely we will experience type II
error. On the other hand, the lower the threshold, the more likely we will face type I misclassification.
Therefore, the appropriate threshold value will depend on our application.
Even though the percentage of misclassifications in this case is much less than that
for the neural tree, we found that this algorithm sometimes can reject more training data
than any other method. This is due to the fact that the estimated density of some training
data can be lower than that of some testing data. Such training data are located in regions
far away from the majority, thereby reducing the effect of the kernel curves from adjacent
61

training points. Figure 39 shows the estimated density of a onedimensional data et with
the training data marked as x.
_,,~_~~_,,_~_~~_~.J
_1 0.' ...0.' 0." C.2 0.2 0" 0.. 0.'
Figure 39 Estimated density of a data set
From the figure, we can see that some training data points, for example at
p = 0.06, have very low density. Its value is even lower than some testing data points.
such as at p = 0.9 or p = 0.5. That means that these lowdensity training point are
more prone to being discarded as abnormalities than some testing points, whose errors for
the function approximator may be larger. We can see in Chapter 8 that this algorithm rejects
more training data as novel than any other novelty detector. A way to reduce this problem
is to reduce the value of the smoothing parameter.
As we explained earlier, the smaller the smoothing parameter, the smaller the region
the kernel curve will cover. Thus, if we choose to have a small smoothing parameter,
the estimated density of the training data will he higher than testing points; however, the
density of the interpolations will be very low (see Figure 35 (a)). That indicates that we will
62
n
....
discard these smallerror points (interpolations) as abnormalities, thus increasing the percentage
of misclassifications. Therefore, no matter how large the smoothing parameter, this
method tends to reject smallerror points (either training data or interpolation points) as
novel data.
A way to improve the performance: Joint Density
Thus far, we have computed the estimated density of the input to the function approximator.
Recall from Equation (16) that the probability density function that we actually
use to minimize the sumsquare error is the joint density between the input and the target
of the network. Because the target is assumed to be unknown, it may be difficult to find the
joint density. However, in this section will propose a procedure for computing the joint density
between network input and output. Then we will use the estimated joint density to develop
an improved novelty detection procedure.
One advantage of using the joint density is that if a network could not minimize the
errors very well on some of the training data, the joint density between the network inputs
and outputs for those training data could be rejected based on comparing with the joint density
between the network inputs and targets. Also, there might be a regions close to trainingdata
inputs where the network did not minimize the error. This would cause the network
outputs from that region to be unreliable (computing the density of the network output
would be a good way to indicate such phenomena).
We can represent the joint density of two jointly Gaussian random vectors x and y
T
by creating an augmented vector Z = [x yJ .The new random vector z wi II also follow
the Gaussian distribution. Thus, for any training data, we will create the new composite
63
C..J
rt

vector r i = [Pi tJ T. For testing data, we need to propagate the input P~ through the function
approximator to get the network output a;. Then we will augment the network input
and output to create the composite testing data ~j = [p; a~ T. Then Equation (33) can be
rewritten as
(36)
where Lr is the smoothing parameter matrix. Note that the structure of Lr is
(37)
.C..i
where L, is the smoothing parameter for the target, and Lp1 is the cosmoothing parameter
between the network input and target.
In the next section, the estimation of the joint density will be illustrated by using
the simple example we have used in previous sections.
Simulation of the simple example #2
In this example we will compute the joint density of training data, and compare with
that of testing data. We will use Equation (36) for estimating the joint density.
We assume that all elements of the augmented training vectors rare uncorrelated.
Therefore, Lr will be a diagonal matrix. By using the same criterion as we did in example
64
P""'"
#1 (set the b value equal to the average distance to the tennearest neighbors in the training
data), the b value in this case then is 0.1164. In other words, Lr in this example is equal to
0.11642 0 0
L r = 0 0.11642 0
o 0 D.11642
(38)
After we used Equation (36) for computing the estimated joint density, we obtained
the estimate shown in Figure 40.
IT
o
o 0
~..0,<>00
o
C.060::":0.'::07.2 :':0.':0.•:,0:7.':'"0,'~:O.7
E...
o.
0.2
R. 0.5'.
Figure 40 Estimated density and approximation error: b = 0.1164
We can see from Figure 4D that largeerror data clearly have low density. This relationship
is clearer here than in example #L The regression line shown in the figure is
density = D.318xerror+0.159 (39)
By substituting 0.15 for the error term, the corresponding density is equal to D.1113. We
will use this value as the threshold. Figure 41 demonstrates the network error, and abnormal
data are flagged with an x,
65

0.1
0.5
0.4
0.3
!
0.2
O·'~..JLIJ~LJwU 'I wru Ll G \J w,,1',,I_
<l.' ...
Figure 41 Novelty detection: Density of input and output
The percentage of misclassifications in this case was 23.34%, which is a little bit
less than the result in example #1. All of the mi.sclassifications in this case were from type
I errors.
We found that the threshold that produces the fewest misclassifications is 0.04. For
this threshold, the percentage of misclassifications is 8.46%. Around 8% of the 8.46% are
type II errors. This is less than the percentage of misclassifications we obtained for the
threshold based on the regression line. However, in chapter g, where we apply this technique
to real world data, we found that the difference will not be this large.
From the results we obtained using the threshold of 0.04, the total misclassificatioll
is a little larger in example #1 (when we chose the threshold with fewest misclassifications).
This is due to the fact that there are some smallerror data with low density marked
in Figure 40. Such data points are shown in the twodimensional plot in Figure 41 .
66
CJ
n:

0.•.•..
0.8 •..•.•.
. I . ...
..... .
•• •••• • .. ". It.· It............. . .. 0.. . . . . . . • .• . « •••
0.2 :: •• : . : : " : : : : : : ~;: : : : •• : : : : : : : : : ,. . .
.... ::::;:::~!~~'!::~:::~::,::;: .
•• • " 0, • • ••• o . . . . . . . . .. • • . .. • • .. • .. .. .. .. .. • .. .. .. .. • • • . .. . ...
0.2 :: : : : : : : : : ; : : : : ~!!!!! ~ ~ : : :: ::: : :: :. .. . .
...........•.... ....•..~ . " 01 .. .. .. • .. • .. .. .. .. .•
~.4 : : : : : : : : : : : : : : : : ~ : : ..
.0.• :: .....
~....
~ L,4,..,•.4:'.:•47.,,4:.':::0:':.2:o.•~;;•.;•70..•,:
Figure 42 Smallerror and lowdensity points
From Figure 42, we can see that these data occupied a region where the density of
training inputs is low, compared with the other regions containing training data. However,
the real problem is that the density of the targets for this data is very low, resulting in low
joint density around this region. Therefore, any data around this region will naturally have
low density, compared to the density in the other regions. This phenomenon makes the data
in this region (though their errors are small) more prone to being discarded as novel data
than data in other regions. This is the reason why some smallerror points can have low density,
and this is the main problem with this algorithm.
In this section, we proposed a technique to improve the performance of the Gaussian
kernel estimator for novelty detection. We expected that any data generating unreliable
output but falling in the regions whose density of input are somewhat high (from the overlapping
of kernels) should have low density. Unfortunately, we found that data points occurring
in the regions where the density of inputs and targets is low could have low joint
density as well. These data points were therefore detected as novel data.
fi7
Summary
We began this chapter by deriving the error equation for training a function approxi
mator. We concluded that the error of the function approximator depends on the joi nt density
between target and input data. However, since the target is unknown, we factorized the
joint density using Bayes' rule, and used only the density of the input data. Because the density
function of the input data is unknown, we have to estimate it.
We introduced three methods for density estimation, and concluded that the Gaussian
kernel estimator is the most desirable method. However, there is a smoothing parameter
required by the estimator. We demonstrated that if the parameter is not set correctly the
algorithm will perform poorly.
When using the Gaussian kernel estimator, we found that the lower the estimated
density, the more likely we were to find large errors in the function approximator. We then
proposed a way to improve the performance of the novelty detector by estimating the joint
density between network input and output for testing data, and comparing it with the density
between network input and target for training data. The simulation results showed that
the joint density estimate had an improved capability of identifying abnormalities, and reduced
the percentage of mjsclassified points.
...
Ii
<:
CHAPTER 5
AUTOASSOCIATIVE MULTILAYER PERCEPTRON
Introduction
One of the most common uses of neural networks is to solve pattern recognition
problems. In Chapter 2, the general structure of multilayer neural networks was introduced.
Frosini and Gori proposed a new approach )n [FrG096] for using multilayer networks to
recognize their data and to perform novelty detection. Their application was the detection
of bogus banknotes.
In this chapter, we will use multilayer networks as novelty detectors, to recognize
data unlike training data. We will start this chapter by briefly recapitulating the neural network
structure. The definition of a new novelty detection will be given, followed by a description
of how we can use such a neural network to identify abnormalities. A simple
simulated example will be used for demonstration. After that, we will propose a technique
to improve the performance of this method. Finally, the simulation outcomes with this new
technique will be shown.
Autoassociative Multilayer Perceptron
Note that multilayer perceptron is another name for multilayer neural network,
which was previously described in Chapter 2. The word autoassociative is used because
the targets of these neural networks are the same as the inputs to the networks.
69
...
The major difference between the autoassociative multilayer perceptron and the
multilayer network function approximator is the target output. For the function approximator
network, the target output is the output of the function we wish to approximate. For the
autoassociative network the target output is the same as the network input.
Note that the objective of training a network is to minimize the sumsquare errors
sse, or meansquare error mse, between network outputs and targets. Since the output of
the autoassociative multilayer perceptron is not in scalar, mse can be written in the form
of vectors as:
mse
N
= N.!. '£".J eTi ei
i = I
N
= ~ I (t j  a)T(ti  a i)
i = I
N
= ~ I (Pi  a,.)T(Pi  a j )
i = I
(40)
...
where N indicates the number of training data vectors, Pi is the illl input vector, and a i is
the corresponding network output.
From Equation (40), we can see that the quantity written as (Pi  ai)T(Pi  a i) IS
the squared error of the /11 observation. The error itself can be calculated through the 2
norm operation.
70

norm(e;) = Ileill
T 1/2
= «e;) ei) (41)
T 1/2
= «p;a) (Pia))
We will call the quantity, norm(e;), in Equation (41) "autoassociative error".
In this section, we defined the structure of the neural network used for novelty detection.
The major difference between the network for novelty detection and that for function
approximation is the number of neurons in the output layer. We will describe in the
next section how we can bring this new neural network to be our novelty detector for function
approximation. The simulation results will be also provided.
Application to novelty detection
Recall that we want to use a novelty detector in combination with a multilayer network
that has been trained for function approximation. If the novelty detector flags data as
heing different from data in the training set, we then expect that the multilayer network may
perform poorly on that data. That means that the novelty detector is a warning system for
the multilayer network. In this chapter, we want to test the autoassociative multilayer perccptron
on the function approximation problem.
We will create a neural network with the structure described in the previous section
(autoassociative multilayer perceptron). We will use this network for novelty detection.
Since two kinds of multilayer neural networks are in use, we will call the neural network
utilized for novelty detection as the novelty detection (NO) network. The main idea for using
the NO network to be a novelty detector for the function approximator is provided ill
the following.
71
",
..... I
Although the ND network seems to perform a linear function (the output is the same
as its input), Hwang and Cho showed in [HwCh99] that the nonlinear transfer function. e.g.
the hyperbolic tangent sigmoid, in the hidden layers is necessary for novelty detection.
They concluded that the nonlinear transfer functions create outputconstrained hyperplanes
on which all output vectors 8; are projected. Therefore, when we minimize the square error
(train the ND network), the hyperplanes will be moved toward the vicinity of the training
vectors. Consequently, the squared error of the data points within the constrained hyperplanes
will be small, while that of data points far away from such hyperplanes will not be
minimized and this will result in large squared error (and autoassociative error). This is the
reason why the ND network can "recognize" training data.
We would like to train the NO network with the data we used to train the function
approximator so that the autoassociative error for these data is small. Therefore, if a new
testing vector is similar to a training vector, the autoassociative error of this vector will be
small. We then expect that the error of the function approximator would be small as well
(since the testing input is similar to a training input). For a new vector much different from
training data, the autoassociative error of this point will be large since it is not in the constrained
hyperplanes, and the error from the function approximator of this data should also
be large. Therefore, those inputs that have a large autoassociative error will be considered
novel points. Figure 43 demonstrates the diagram for the ND network for the function approximator.
72
o
~...:>' E l
. 0
>< ~ o .... I ll) 8:z
<C~ C Io
::l
. ilJ uZ C~

C
.0_...:>I'
...... 0
~ ~
...... ......
ll) ilJ
ClZ
.>..... ,~
 I ilJ ::l > ilJ oZ
Z
...
c.
i
I
If
"
.c.;
Figure 43 Diagram of an autoassociative novelty detector
In the next section, we will revisit the example problem described in the previous
chapters. We will use it to show the ability of this method.
73
Simulations ofthe simple example #1
The ND network was designed with a threelayer architecture to recognize the trai ning
data shown in Figure 19. We used 20 neurons in the first layer and 10 neurons in the
second layer. The number of neurons in the output layer must be two, since the input vectors
p have two elements. This made the architecture of the NO network 2  20  10  2 .
We then used the LevenbergMarquardt algorithm [HaOe96] to train the ND network (note
that we also tried the Bayesian regularization algorithm and the results were not much different).
In this example, the meansquare error was driven down to the order of toII. Figme
44 il1ustrates the errors of the ND network for the training data.
Figure 44 Autoassociative error of training data
The NO network generates output vectors that are very similar to the training inputs.
The maximum autoassociative error for the training data was 6.13 x 105
. After the training
process of the ND network is complete, the next step is to apply testing data pI to the
NO network and to compute the autoassociative errors  the error between the testing data
74
I
I.I, ....,,
'"""
pt and the corresponding NOnetwork outputs. Now, we want to see whether the error from
the function approximator correlates with the autoassociative error. Figure 45 shows the
correlation between the autoassociative errors from the ND network and the errors from the
function approximator on the test data points shown in Figure 22.
.,
::1:
• ,0'"
I.S
0 0
0
R.O.768 0 0
0 0
0
0 0
0
0
0 0
0 0
! 00
0
0
0
~ 0 0 0
0 ; 0 0 0 0
0 0 0 I 0 0 0
0 0
0 0 0 0 0
o.S
0
0 0
0
0
0
0
0
0.2 0.3 0.' O.S 0.1 0.1
E'""
Figure 45 Autoassociative Error and Approximation Errors
As seen in Figure 45, the correlation (indicated by the R value) between the autoasing
that there exists some correlation between these two variables.
sociative errors and the errors from the function approximator was somewhat high. imply
Once again, to set up the threshold to discard abnormalities, we consider the regression
line shown in Figure 45. The equation for the line in Figure 45 is
autoassociative error = 0.00189 x error + 6.66 x 105
(42)
From Equation (42), the autoassociati ve error that corresponds to a network error of 0.15
is 3.50 I x 104. We will use this value as the threshold to reject novel data. Figure 46 demonstrates
the network errors, and novel data are flagged with an x.
7'5
0.7
0.0
O.S
o.•
O.J
~
O.2
O. 1
0
<>.1
<>.2
0
l~ '' u V \J w IV LLu \.. 11
so ,DO 150 :200 260 )00 360 400 4.10
Figure 46 Error and abnonnalities
As with previous methods, there are two types of misclassifications. In type I misclassification,
the autoassociative error is greater than 3.50 I x 104 (data flagged as noveI),
but the approximation error is less than 0.15. In type II misclassifications, the
autoassociative error is less than 3.501 x 104
, but the approximation error is greater than
0.15. For this test, the total misclassifieation rate was 6.63%. Most of these misclassifications
(4.81 %) was of type I.
If we choose a threshold of 4.0 x 104, the percentage of misclassifications is
5.72%. Around 3.41% of this 5.72% is from type I error. We can see that type I misclassifications
are reduced when we increase the threshold. It should be noted that it is very diffieult
to have a threshold minimizing the percentage of misclassifications for every data set.
However, what we ean say in general about setting the threshold is that the higher the autoassociative
error we choose for the threshold, the more likely we will reduce type I misclassifications
(type II error increases). On the other hand, the Lower the autoassociative
76
""""
error we select for the threshold, the more likely we will reduce type II misclassification
(type I error increases). It will depend on the application what type of misclassifications we
can tolerate. For this thesis, we will use the threshold based upon the regression line.
From the simulation results, we can conclude that the autoassociative errors from
the ND network are correlated with the network errors. As we can see in Figure 45, some
data points generating large approximation errors have small autoassociative errors, while
some creating acceptable approximation errors make large autoassociative errors. These
data points will increase the percentage ofrnisclassifications. Therefore, in the next section,
we will propose a technique to reduce the percentage of misclassifications.
A technique to improve efficiency
In the previous chapter, we described our technique to improve the efficiency of the
Gaussian kernel estimator. We will use a similar procedure in this chapter. For our training
data, we will incorporate the target with the input to create new training vectors
r i = [Pi tJ T, and we will incorporate the output of the function approximator to make
new testing data r; = [pJ aJ T. That means that we will train an NO network with the new
augmented training data r. Furthermore, in order to get new testing data, we need to propagate
our testing input P; through the function approximator to get the network output a;,
and then stack these two variables. Finally, these new testing data will be applied to be the
inputs of the NO network. The advantage of using this method is described in the following.
77
('
I;:x:.
""'""
If we have testing data p; close to the training data p, but the function approximator
turns out an eccentric output, the NO network will generate large autoassociative error.
For example, assume that the function approximator creates a large error for one of its training
points, Pi <lti  ail is big). We also assume that the original ND network can recognize
our training data Pi (output of the ND network is very similar to the input Pi)' The autoassociative
error from the original ND network is thus small, since Pi is in the constrained
hyperplane (since it is a training vector for the ND network). However, when we use the
new ND network trained with the augmented training vectors r, the autoassociative error
from the NO network for [Pi aJ T should be large, since [Pi aJ T is not close to our training
data and thus is not in the constrained hyperplane (rather, r i = [Pi tJ T is in the constrained
hyperplane). Therefore, we would be able to reject this point as abnormal for the
function approximator.
In the next section, we will use the new training data set r to train another ND network,
and will apply the new testing data r l to the ND network. We will also show the correlation
between the autoassociative error and the network error and will discuss whether
or not it is improved over the original method.
Simulation ofthe simple example #2
In this example, the new data set is now threedimensional. The number of neurons
in the hidden layers remain the same as in the previous example. Therefore, the ND network
has the structure 3  20  10  3. We will again use the LevenbergMarquardt algo
78
""""
rithm for training the NO network to recognize the data set r. The meansquare error was
driven down to about the same level as in the previous case, and the maximum autoassociative
error from the training data was 6.14 x 105
.
After the NO network was trained, we applied the testing data set r ' to the NO network.
Note that we propagate testing inputs p' (shown in Figure 22) through the function
approximator to obtain the network outputs at, and then stack them to obtain the inputs to
the NO network. Figure 47 demonstrates the correlation between the errors of the function
,.• • 10'" r:~~~~~,
approximator and the autoassociative error from this NO network .
0 ... 0
0
00
~ 2 0
oJ
(J)
l'~ o 0
0 0
0 0
0
0
0 0
q£L9
0.' 00 0
0 0
0 OfC) ~
o 0
R.o.UJ o 0
o.•'~~~~~~' o 0.1 0.2 0.3 0.4 0.6 0.8 07
Em"
Figure 47 Autoassociative Error versus Approximation Error
The correlation coefficient value in this case was higher than the previous case
(0.883 vs. 0.756). Also, the data points generating large errors are more distinguishable
from data points creating small errors (compare Figure 45 and Figure 47).
79
Again, we consider the regression tine in order to set up the threshold for rejecting
novel data. From Figure 47, the regression line is
autoassociative error = 0.0035 x error  4.06 x I O~ (43)
From Equation (43), the autoassociative error corresponding to the network error of 0.15 is
4.844 X 104
. We will use this value as the threshold for our novelty detector. In Figure 48
we plot the approximation error on the testing set and indicate with an x all points that have
autoassociative error greater than 4.844 x 104.
.....
O.7
G.'
O.,
0.•
0.3
!
G.•
G.,
.....
..
t
~.c.: .... I:'"
p....' .......
Figure 48 Novelty Detection
In this case, the total percentage ofmisclassified points was reduced to 5.72% (compared
with 6.63% with the previous method). All of the misclassifications in this case were
from type II.
Note that some other threshold may reduce the percentage of misclassifications. For
example, if we choose the threshold to be 2 x 104
, the percentage of misclassifications is
80
3.66%. Around 2.97% of the 3.66% were type II error. There is no general threshold to minirnize
the percentage of rnisc1assifications for every data set. However, when we reduce the
threshold from 4.844 x 104 to 2 x 104 , type II error decreases and type I error increases.
Therefore, the higher the autoassociative error threshold, the higher the type II error (and
less type I errors) will decrease. On the other hand, the lower the autoassociative error
threshold, the more likely type I error will be increased (and type II errors decreased).
As demonstrated in this experiment, the technique of augmenting the input vectors
with the target output does reduce the percentage of misclassifications. The percentage error
for this method is lower than any other method we tested, and the method does not require
us to set parameters (as in the smoothing parameter of the Gaussian kernel estimator).
However, the training time for this method can be very long.
Summary
In this chapter, the multilayer neural network architecture was briefly reviewed. We
slightly changed the multilayer structure to fit the novelty detection problem. The modified
structure has the number of neurons of the output layer equal to the input dimension and
the target output of the novelty detection (NO) network was the same as the network input.
The error of the ND network was called the autoassociative error.
We used the autoassociative error to decide whether or not we would discard data
as too novel for accurate function approximation. The simulation results showed that when
the autoassociative error from the NO network was large, the more likely we would find
large errors for the function approximator.
!:!I
.. c: .o,.
~
..
rIC
... , .....
0 •• I::
' ..
,....
Finally, we introduced a new technique to improve the performance of this algorithm
by making use of network outputs and targets (as in the previous chapter). The simulation
outcomes demonstrated a better performance in terms of fewer misc1assifications.
82
'...
!~~
'f· 1"\
.·1." " ~ ..<
CHAPTER 6
MINIMUM DISTANCE ALGORITHM
Introduction
In this chapter, a new noveltydetection method will be introduced. The fundamental
idea behind this method is based on the 2norm between training vectors and testing vec
'..c..
,·:..:.! ..
·!.i,~'
·"
This algorithm is based on the idea that any two vectors that are close together
will be described.
tors. We will first describe how the new novelty detection method can be employed,
followed by simulation results. After that, we will propose a technique to improve the efficiency
of this algorithm based on a derivation of the error equation. The simulation results
Minimum Distance Computation
of the modified version will be illustrated. Fi nally, a procedure to speed up the computation
would contain similar properties (information), therefore the function approximator should
turn out comparable outcomes. The measurement of distance between any two vectors wi II
be computed by the 2norm of the difference between the two vectors. This 2norm is written
as:
(44)
where a j and bj are the two vectors that we will measure the distance between.
83
As discussed in previous chapters, the objective of this thesis is to develop procedures
to detect when a new input to a multilayer network function approximator is unlike
inputs contained in the training set for the approximator. When we have a new input to the
multi layer network, we will want to know how close that input is to inputs contained in the
training set. For that purpose we will measure the distance from the new input to every input
in the training set. The minimum of these distances wi II be used to measure the similarity
of the new input to training inputs,
We will explain how the concept of minimum distance can be applied to novelty
detection in the following section. We will also provide computer simulations of this technique.
Application to novelty detection
Recall that we need a novelty detector since we would like to flag data on which the
multilayer function approximation network will perform poorly (generate large error). In
this chapter we will use the minimum distance algorithm as a novelty detector. The main
idea of this method is the following. Because errors on the training data are made small during
the training process, any new data that is similar to some training data (has a small minimum
distance) should also produce a small approximation error. On the other hand, new
data with a large minimum distance to the training set can be expected to produce a large
error for the function approximator. Therefore, we will flag any data for which the minimum
distance is high.
Equation (45) expresses the distances from a testing vector p; to all of the N training
data inputs Pi
X4
.,...... ~. f::
::..1.. c;
""
:;
,'... ,..
...
!""""
d =
lip) p;11
IIp2  p;11 (45)
The minimum distance is the minimum element of the above vector
dm = mined) (46)
As we explained earlier, we expect that the bigger the value of dm ' the more likely
there will be a large error for the function approximator. Note that if the testing vector is
one of the training vectors, the minimum distance dm will be equal to zero, and we will
accept such data since the error of any training vectors should be small.
In the next section, we will revisit the example we have used in the previous chapters.
We will use it to test the minimum distance method, Equation (45) and Equation (46),
for novelty detection.
Simulation ofthe simple example #1
The following simulations will illustrate the results of novelty detection employing
the minimum distance algorithm. We use Equation (45) to find the distances of each testing
vector to the 638 training vectors (shown in Figure 19), and then Equation (46) is uti lized
to find the minimum distance. We repeat this method for all of the 437 testing vectors
(shown in Figure 22). Figure 49 shows the minimum distance values of all 437 testing vectors
versus the errors obtained from the function approximator.
85
'.I
,~ '.'~.
:.:
~,
r:.
:::1')
' ..
'..
R.0.7&2
0.1
0.5
0
0 0
0 0
0 0 0
0 00
00
~, ~ ~, ~ U M OJ
e.....
Figure 49 Minimum Distance versus Error
From the figure, we can see that as the minimum distance value increases, the more
likely we will find large errors. Also, the fairly high correlation coefficient (R value) implies
that there exists a correlation between the minimum distance and the error from the
..
function approximator.
To create a threshold to reject novel data, the regression line shown in Figure 49 will
be used. We found that the regression line in this case was
dm = 0.702 x error+ 0.0518 (47)
~. .:.~,,.:n
After substituting 0.15 in theerrorterm, the corresponding minimumdistanceequalsO.1571.
(We will be using 0.15 as an arbitrary point to represent large approximation error.) That
indicates that, on the average for this data set, data points generating an approximation error
of 0.15 are a distance of 0.1571 from their closest training vectors. We will use this value
to be the threshold for the novelty detector. In other words, any data generating a minimum
distance larger than 0.1571 will be considered as novel data. Figure 50 demonstrates the
graph showing the network errors for this testing data, and novel data are flagged with an x.
86
0.'
0.1
O.S
0.4
0.3
!
0.2
0.1
0.1 • _ _ _ _ _ _ _ _ _ •
Figure 50 Error and abnormalities
There are two types ofmisclassifications. In type I misclassifications, the minimum
distance is greater than 0.157 I (data flagged as novel), but the approximation error is less
than 0.15. In the type II misclassification, the distance is less than 0.1571, but the approximation
error is greater than 0.15. For this test case, the total misclassification rate was
12.82%, and most of this misclassification (12.58%) was of type I.
We found that the threshold that produces the fewest misclassifications in this example
is 0.25. The error rate is 6.87% and around 3.89% (out of 6.87%) are type I errors.
Although the purpose of using the testing data is to set up the threshold to reject novel data,
it should be noted that it is very difficult to have a general value for the threshold that will
minimize the percentage of misclassifications for all data sets. What we can generally say
about ~etting the threshold is that the lower the threshold, the more likely we will experience
type I errors. On the other hand. the higher the threshold, the more likely we will
have type II errors. Therefore, the threshold of 0.25 tends to give us more type II errors than
the threshold of 0.] 571. That means that if another test set is quite similar to this test set,
87
...
~
.~. ;"1
~ "
"'""
the threshold of 0.25 would be better than 0.1571. In contrast, if another test set is different
from this test set (i.e. it has data with large errors and small minimum distances), the threshold
of 0.) 571 should outperfonn. However, in our case, we will use the threshold based on
the regression line.
From the simulation results in this example, we can see that there is a relationship
between the minimum distance and the error from the function approximator. However, as
shown in Figure 49, there exist some points creating large approximation errors but having
minimum distances that are similar to data points that have small approximation errors. In
the next section, we will explain a technique to improve the performance of this algorithm.
A Technique to Improve Performance: Minimum Weighted Distance
In the previous chapter, we proposed that instead of just using network inputs, the
network outputs a and targets t could be also used for novelty detection. In this chapter,
we will modify the minimum distance algorithm in order to include network outputs. First,
we will start this section by deriving the error equation for the function approximator, using
the Taylor's series expansion.
For an input vector p; to the function approximator, the error between the corresponding
network output a; and the target t; can be written as:
.. ..~
=~.....
~b.
~ ;;r:
.:...
~ ::~
~ I:' .:..
I I I
e· = (.  a·
J J J
t I = F(p.)a.
J J
(48)
where F is the function we want to approximate.
88

Now we will use the Taylor's series to expand the function F about the training input
vector that is closest to the vector p;. Therefore, Equation (48) can be rewritten as follows:
of(p{ t /
F(po) + op (Pj  Po) + ...  uj
P =po
t dF(p{ t =(F(po)a)+dP (PjPo)+"·
P = Po
(49)
....
Recall that by using Equation (45) and Equation (46) to find the minimum distance,
respect to the input and is evaluated at the training vector.
we are able to indicate which training vector is closest to the testing vector. Let us assume
."~ :,.
".
= argmin(d) . Thus, the last expression in Equation (49) can be written as:
P
Pm
that Pm is the training vector closest to p;. In other words, Pm can be found by finding
T
where Po is a training vector, and ~;(P) is the first derivative of the function with
P=PIl
T
e' = (t  al) + dF(p)
J In} dp
t
(Pj  Pm) + ...
P = Pm
(50)
Now, if we assume that the higherorder terms in Equation (50) can be ignored, then we can
rewrite it as:
(51 )
89
....
We can see from Equation (S 1) that the error from the function approximator e;
does not depend only on the distance between testing vector p; and the closest training vec
T
tor p , but also on the gradient evaluated at the training vector aaF(p) and on the
m p
p=p",
distance between the target tm and the network output ofthe testing vector a;. Therefore,
if the underlying function F at the training vector Pm is fairly steep (the derivative at the
point is very large), the error of the function approximator could be large even though the
minimum distance between the testing vector and the training vector is small. Similarly, if
the distance between target tm and a; is large even though the distance between these input
vectors is small, the error can be large.
Unfortunately, the underlying function F in general is unknown, therefore it is impossible
to calculate its gradient (which is a vector of size q xl) at any training data point.
We first considered calculating the derivative of the function approximator F; however,
we found that this did not improve the novelty detection. We will explain in the next section
how to calculate the estimated derivative. We will also explain why we cannot use it for
novelty detection, and how it can be used for other work.
Though Equation (51) is not directly useful for error estimation, it gives us an idea
that the error depends not only on the distance between the two input vectors. Therefore,
we will modify our previous distances d (Equation (45» to make them depend on another
90
computable tenn  the distance between target and network output. Equation (52) shows
our new measurements:
lip; Pili + alit) a;11
d(l = lip;  P211 + allt2  a;1I
(52)
where a is greater than or equal to zero (a ~ 0). We will call the elements of d(l weighted
distances, because they include the distance between the targets and network outputs. As
before, we will obtain the minimum value of the weighted distances by computing
(53)
We will call d~ the minimum weighted distance. The effect of a will be discussed further.
As shown in Equation (52), if we set a = 0, the minimum weighted distance will
be equal to the minimum distance. Increasing the weighting factor will strengthen the effect
of the difference between targets and network outputs and will neutralize the distance of
the input data. For example, assume that Pk is a training data point that generates large
error, thereby making the value Iltk  akll high. When we substitute p; with Pk in Equation
(45), the term lip;  Pkll will be zero, thus causing zero minimum distance (dill =0). However,
the weighted minimum distance will not equal zero, because alltk  akll will not be
zero. Choosing an appropriate value of a will he important for successful novelty detection.
We will discuss this further in the next section.
91
We will demonstrate the ability of this method for novelty detection using our previous
example. The effect of u on the perfonnance of the novelty detector will be demonstrated
through computer simulations.
Simulation of the simple example #2
We will illustrate the use of the minimum weighted distance computation for novelty
detection through a simple example. We will show the effect of varying u on the correlation
coefficient R between the error of the function approximator and the minimum
weighted distance.
First, we need to propagate a testing vector p; through the function approximator
to get the network output a;. Equation (52) and Equation (53) are then used to find the minimum
weighted distance at a specific value of u for the 437 testing vectors. After that, we
compute the correlation coefficient (R) between the two variables  the approximation
error and the minimum weighted distance  for the 437 testing data. Figure 51 demonstrates
the relationship between the weighting factor and the correlation coefficient.
92
....
..,
Jt:: :..:.1.
• I
Q... r::====..~....._,
0...
0.'
o.
0.7'
0.710:::,~07:":2~07:,.!
W~iMiof (Iphel
Figure 5 I Effect of the weighting factor to R
We can see from Figure 51 that when the value of a is increased, the correlation
coefficient R is raised as well. This is because the effect of distance hetween target and network
input is added to the distance between the input vectors. And as we expected, when
the value of a is too large (more than 5.1 in this case), the R value begins to drop. We expect
that the curve shown in Figure 51 may change if we use a different testing set. However,
for our simulations, we will choose the weighting factor that maximizes the
correlation coefficient in this test set (a =5.1) .
After choosing the value of a = 5.1 , we plotted the error from the function approximator
and the minimum weighted distance in Figure 52.
91
...
.~.
It: : :':1. .:: :
;~l
·1
I........·
:5i
'''''
I
0
0 0 0
R.O_tsI
2.5
0
~ 0
0
0
0 0
J
0
2 0
0
0
1 0 o 0
0
i ... et?
~ 8 0 0
~
~ o 0 0
:I , 80
%9 0 0 0
0° 0 g
~ 00
eP 0
0,' • .2 • .3 .,. ... ... .,7
E, ...
Figure 52 Approximation Error and minimum weighted distance
As shown in Figure 52, the correlation coefficient is very high, and most data points '~.
For the error of 0.15, the minimum weighted distance is equal to 0.7198. That means that.
on the average for this data set, the error of 0.15 corresponds to the minimum weighted disfollow
the regression line. Therefore, the regression line in this case may be able to repre
...
;.'"..
.~..I..
."...I.
", .: "'..
~:
'II
·li
.AI ".
'" ''''I
d~~1 = 4.37 x error + 0.0643 (54)
From the result shown in Figure 52, the regression line was
sent these two variables more precisely than the regression line shown in Figure 49.
tance of 0.7198. Therefore, in order to reject any data generating errors greater than 0.15,
we will set up the threshold so that any data generating minimum weighted distance (with
a. = 5.1 ) larger than 0.7198 will be considered as novel. Figure 53 plots the network errors
for the testing data, and novel data are identified with an x .
94

0.1
0.0
O.5
O.•
o.,
!
0.2
O.,
0
0. I
0.2
0
Figure 53 Novelty detection
There were 5.03% of misclassified points using this method. We found that 3.66%
out of 5.03% misclassifications were type I errors. The rest of the misclassified data points
(1.37%) were type II.
If we choose the threshold to be 0.9 rather than 0.7198, we will get only 4.34% misclassified
data. Most of the misclassifications (around 4.11 % out of 4.34%) are type II error.
As we can see, though the total percentage of misclassifications is reduced by a certain
amount (4.34% versus 5.03%), the percentage of type II is fairly increased (4.11 % versus
1.37%). It will depend on the application what types of misclassifications one can tolerate.
However, in our case, we will stick with the threshold obtained from the regression line.
As we can see from the simulation results, the percentage of misclassifications was
reduced after we included the distance between targets and network outputs Lo the distance
measure. The execution time in this case was slightly longer than computing minimum distance.
95

Note that although this algorithm worked somewhat well for this data set, we found
that the weighting factor we used in the example was not always best for other data sets.
Therefore, if we computed the percentage of misclassifications on another data set, its value
would not be this low. We will introduce another novelty detection method in the next
chapter  minimum distance with outlier detection. We found that a parameter in that
method worked well for a variety of data sets.
Recall that we introduced the minimum weighted distance because the error from
the function approximator depends on three tenns, and the derivative tenn in Equation (51)
is unknown. Furthennore, we briefly mentioned that the estimated derivative is not helpful
for novelty detection. In the next section, we will explain in detail how to compute the estimated
derivative of the function approximator, the reason why the obtained value is not
useful for novelty detection, and what is the advantage of finding the estimated derivative.
Calculating the estimated derivative
Recall from Equation (48) that we began with the error equation and expanded it
using the Taylor's series. We then searched for the training vector closest to the testing
input, Pm = argmin( d) . so that the higherorder tenns of the series could be ignored. rep
suIting in Equation (51). Now, since having the multilayer network to approximate the original
function F, we came up with an idea to estimate the derivative of a function F from
the function approx.imator F. Then the estimated derivative will be substituted into Equation
(51) to obtain the estimated error. The method to derive the estimated derivative follows.
96

Apply the chain rule to the k layer network F (function approximator):
a A d k k
()pF(p) = r:Jpa (n )
= (~n\akI))T~a\l)
ap ank
= (adpn\ll)y/(nk
)
= (~lICnkI))T_a_nk(lI)/(/)
ap aak  1
= c~apl I (nk
 I)rWkT/(nk)
= (~nkl(l2))T_a_akl(nkl)wkT/(nk)
ap dnk  l
= (ddpnkICak2)YJ:lI(nkl)wkT!(i)
= (~l2(nk2))T_d_nk J(ak  2)yk1 Cnk  I )WkTIcnk)
dp aak  2 .
= (dd
p
l2(nk  2)rW k  I TykI (nk  I )WkTI(nk )
where y'(n') is the derivative of the transfer function rat layer I:
(55)
a,j (nI,) 0 0
dnl
0 aJ ( 0
F'(n') =
, Cn 2) ...
dn2
(56)
0 0 a j { ... , (ns)
ans
For example, if we use the hyperbolic tangent sigmoid function at layer I, Equation (56)
1)7
becomes
1 1 a j (l a l) 0 0
1/(01
) 0 1 1
= a2(l a2) 0
(57)
0 0 I I .. , as'(las')
If we continue to apply the chain rule to the fIrst tenn in Equation (55), we will
eventually obtain the derivative