AN ANALYSIS ON ADAPTNE APPROXIMATIONı BASED NEAREST NEIGHBOR SEARCH IN LARGEı IMAGE DATABASESı
Byı FAN YANGı Bachelor ofEngineeringı Tianjin Medical Universityı Tianjin, P. R. Chinaı 1990ı
Submitted to the Faculty of theı Graduate College of theı Oklahoma State Universityı In partial fulfillment ofı the requirements forı the Degree ofı MASTER OF SCIENCEı August 2003ı AN ANALYSIS ON ADAPTIVE APPROXIMATIONı BASED NEAREST NEIGHBOR SEARCH IN LARGEı IMAGE DATABASESı
Thesis Approved:
11
PREFACE
Many techniques provide effective and efficient retrieval of lowdimensional data stored in small image databases. However. as dimensionality increases, "curse of dimensionality" occurs in highdimensional image retrieval processes. A contentbased image is represented by a set of multidimensional feature vectors. Contentbased search is based on the similarity search and relevance feedback improves the retrieval of images. Vector approximation .file (VAFile) method reduces I/O cost due to the smaller size of the VAFile compared to the original data file. K Nearest Neighbor search refines the desired image during iterations of feedback. Standard VAFile based K Nearest Neighbor search perfonns the approximation filter step and the data level filter step to reduce the I/O cost by only visiting the objects resulting from the approximation step. Adaptive VAFile K Nearest Neighbor search finds correlations between two consecutive nearest neighbor search iterations in relevance feedback. and sets more constraints to reduce the number ofcandidates in approximation step.
However. adaptive VAFile KNN search does not show that the I/O performance improves in tenns ofblocks visited. The number of blocks visited determines the I/O cost. In order to analyze the adaptive nearest neighbor search and to evaluate its I/O performance improvement for highdimensional feature vectors within a large image database, we set the experiments using C/C++ to compare the number ofblocks visited in the standard KNN search with the adaptive KNN search. Our experimental results show
111
that although the number of candidates in the approximation level has been reduced by the adaptive KNN search, the number ofblocks visited in the data level filter has not been reduced. That is, the final va perfonnance is not improved in the adaptive KNN search when compared with the standard KNN search.
In addition, the analysis of the precomputed constraints to the feature resolution selected to construct the VA file is not clear in the adaptive KNN search. We believe that it should be independent ofthe feature resolution. To prove our analysis, we perfonned experiments to figure out the correlation between the feature resolution and the established upper bound on the next iteration's candidates, the precomputed distance using KNN with the next iteration's weight matrix. Finally, the experiment results match our analysis that the established upper bound on the next iteration's candidates in the approximation level is independent of the feature resolution we selected to construct the vector approximation file.
IV
ACKNOWLEDGMENTS
I wish to express my sincere appreciation to my major advisor, Dr. Douglas R. Heisterkamp for his intelligent supervision, constructive guidance, inspiration and friendship. My sincere appreciation extends to my other committee members Dr. John P. Chandler and Dr. Nohpill Park, whose guidance, assistance, encouragement, and friendship are also invaluable.
I would also lik.e to give my special appreciati.on to my family for their love, strong encouragement and support.
Finally, I would like to thank the Oklahoma State University Computer Science Department for supporting me during my graduate study.
v
TABLE OF CONTENTS
Chapter Page
I.
mTRODUCTION 1ı
II.
BACKGROUND 5ı
2.1
Textbased indexing ,
,.. , ,.,."
5ı
2.2
Feature 'Vectors 5ı
2.3
Relevance Feedback Techniqu "" "" " 6ı
2.4
Similarity Matrix. 7ı
Ill.
RELATE WORK 8ı
3,1 Approximation File for Feature Vector 8ı
3.2
Structure ofVAFile 9ı
3.3
Algorithm ofUpper Bound and Lower Bound 10ı
3.4
VAFile Based KNN Search 12ı
3.5
Adaptive VA~File KNN Search Algoritlun 13ı
IV.
PROBLEM STATEMENT 15ı
V.
EXPERIMENTS 17ı
5.1
Description " 17ı
5.2
Construction ofVAFile 17ı
5.3
Matrix Procedure ofKNN 18ı
5.4
Procedure ofEvaluation , 22ı
5.5
Generation ofSimilarity 23ı
5.6
Computation ofKth Distance 23ı
5.7
Computation of O/+I(Q) 27ı
5.8
Computation ofrt+lU(Q) " 29ı
5.9
Comparisons 30ı
5.9.1
avelNjstan (Q, Wt)1 and avelNjstan (Q, Wt)I 30ı
5.9.2
avelBian (Q, Wt)1 and avelBrtan (Q, Wt)I 32ı
5.9.3
Calculating Results ofrt+1 U (Q) 34ı
VI.
CONCLUSION 37ı
Vl
BIBLIOGRAPHY 38ı
APPENDICES 41ı
APPENDIX A
: Code for Adaptive VAFile KNN Search .41ı
APPENDIX B
: Code for Standard VAFile KNN Search 48ı
APPENDIX C
: Code for Evaluation Procedure 62ı
VII
LIST OF TABLES
Table Page
1. Notation 8ı
2. Pseudo code of Standard VAFile K.NN 13ı
3. The Calculating Values of CN! and CBl on vary VAFile 32ı
4. Overall Results of Number ofCandidates Resulting from Phase I 30ı
5. Overall Results ofNumber of Blocks by Standard and Adaptive Approach 33ı
6. Overall Results ofNumber of Blocks by Optimal Approach 33ı
VIIIı
LIST OF FIGURES
~~re p~
1.
VAFile Construction " 10ı
2.
rIll and rat different resolutions 17ı
3.
Flowchart ofadaptive VAFile KNN search procedure 22ı
41.
Comparison ofcandidates in reweighting procedure 26ı
42.
Comparison ofblocks in reweighting procedure 26ı
51.
Comparison of candidates on adaptive search and standard search 31ı
52.
Comparison ofblocks on optimal, adaptive and standard search 32ı
6.
Impact ofrul on VAFile construction 35ı
IX
CHAPTER I
Introduction
Image retrieval has gained much attention d uring recent years. The i mage as a kind of data source is growing fast and the computer hardware industry has the technical support for the computation of large image database. However, intelligent techniques that can support fast and efficient browsing and retrieval still need to be improved.
Multimedia information retrieval explores both textbased indexing [1, 25J and contentbased multidimensional infonnation processing techniques. For traditional database approaches, the keywordbased search gives a binary decision [2]; and contentbased search implies a similarity search within the database [29].
A contentbased image is represented by a set of multidimensional feature vectors. Content Based Image Retrieval (CBIR) [38] is a technology using the relevance feedback (RF) to gain the most desired images [23~ 24]. The main idea of RF is to use feedback from the results of searches to refine the search and get more relevant results. Researchers have developed systems to store and retrieve images following different mechanisms. Contentbased approaches are used in many application areas such as molecular biology [28], multimedia databases (for image retrieval) [4, 8, 14, 16,21, 26,27,30,36] and geometric databases (for shape matching) [5, 19]. The CBIR system is one of most popular systems using lowlevel visual features like c alar, shape, texture, objects placement and so on to represent the image content. Such systems allow users perfonn a query choosing a similar image to the required images based on lowlevel
1
features. Upon seeing the results, the user may further adjust the property, i.e. similarity
matrix, and submit a new query. The user may also label the retrieved images as good or not good and feed them back. The contentbased similarity models have the appropriate properties that the objects are mapped to vectors of a highdimensional space. The similarity of objects is defined in terms of their distance in the feature space. Euclidean distance function is a model used to measure the similarity of two objects [4]. This distance function has been successfully implemented for a variety of similarity models in different domains. Two of many CBIR. systems are the color histogram model for color images in ffiM's Query By Image Content (QBIC) system [20] and the CANDID system [14]. For an efficient query evaluation, various index structures have been proposed for managing the feature vectors in a multidimensional index [3, 6, 9,11,15,17,21]. However, despite the advances in both feature selection techniques and retrieval techniques, the current systems still have a major difficulty.
For an efficient query process of multidimensional fi ature vectors index structures become worse with increasing number of dimensions of the feature vectors within the large databases [31, 32, 37]. Such difficulties are usually called the "curse of dimensionality" [13]. The curse of dimensionality has shown in many situations, such as the increasing dimension and the increasing number of objects in the database. Hence, indexes often fail to outperform sequential scanning methods on the set of multidimensional feature vectors.
The solution called Vector Approximation file (VAFile) [31, 32, 33] is an improvement ofthe sequential scan by quantifying and compassing the file of the feature
,
vectors. The general idea of the VAfile is to store the features not with the full precision
2ı
of 32bit floating point values but with a reduced number of bits. The cells are
constructed to lie over the data space and are split by the markers. Such markers are determined by the data distribution in the data space. Each cell represents the location of feature vector.
Standard K Nearest Neighbor (KNN) search applied in VAFile is the multistep query processing. Candidates produced by two filter steps are exactly evaluated by a subsequent refinement step [32,33,34]. The first approximation step (phase I) performs a kind of approximation of the image objects, the low bound and the upper bound on the distance of each object in the database to the query object are computed and the candidates are selected according to the distance between the query and the objects' bounds. Such approximation can be determined in linear time. This approximation can be evaluated and the candidates are selected very efficiently for vector features due to the smaller size of VAFile compared to original data file. The second step comes after reducing the time of retrieval from the secondary storage in approximation level. The second step, data level filter step (phase II) is for computing the actual distance between the candidates and query object and filtering out some candidates that are not as good as desired.
Adaptive KNN search states that if the first filter followed by a technique that could reduce the number of candidates in the approximation level, the disk/page access could be reduced [35]. That implies that the number of candidates resulting from phase I determined disk access or page access.
The rest 0 f this paper i s 0 rganized as following: In section 2, it will introduce feature vector of image object, CBIR techniques for querying the contentbased images
3
and similarity matrix used in relevance feedback. In section 3, it will discuss the
approximation based indexing structure for high dimensional feature spaces and construction of the VAFile for feature vectors. In section 4, the paper will analyze the performances of standard KNN search and adaptive KNN search applied in RF and further discuss the correlation between the feature resolution and the precomputed upper bound on next iteration's candidates. In section 5, the paper shows the experiment results that compare the number of candidates resulting from phase I and data blocks visited in phase II by standard KNN and adaptive KNN respectively. The comparison of I/O performance in terms of the number of blocks visited in adaptive approach and standard approach is concluded; In addition, we will use the experimental results to show that the precomputed upper bound on next iteration's candidates is independent of feature resolution by performing a comprehensive experiment using C/C++;
4ı
CHAPTERUı Backgroundı
2.1 Textbased indexing
The development of the computer technology shows the importance of multimedia information that must be browsed, retrieved and delivered. Traditional retrieval approaches rely on text annotation for multimedia infonnation, e.g., keywords are attached to images videos, sounds and speech fragments. Retrieval is performed on the keywords rather than on original multimedia information. This approach has advantage of the inheritance of efficient technology developed for text retrieval but there are two serious drawbacks: one is the requirement to attach text descriptors to image~ the other is the possibility of inconsistency in keyword assignment. The latter occurs because different users could give the same richcontented image a different interpretation.
2.2 Featu re vectors
Techniques for contentbased retrieval of multimedia data aim to overcome these difficulties by using numerical features to measure the information content. Since an M dimensional feature vector represents an image, we can view it as a point in an M dimensional space. CBIR is favorable since the sets of ,highdimensional feature vectors can be generated by the feature transfonnation [16, 18]. Thus, the information used
5ı
during the retrieval process IS always consistent and independent on different human
interpretation.
Given an image, retrieval can be accomplished by extracting the query image's features and computing similarity distances in the features space between the feature vectors of a query image and the stored image. Then images that match or are the most similar from the database can be retrieved. The result is returned to the user by increasing or decreasing order of similarity. Compared with the traditional image retrieval approaches such as keyword annotation, CBIR is more efficient and practical. However, there are two deficiencies in CBIR systems: the gap between highlevel concepts and lowlevel features and subjectivity of human perception of visual content. To overcome these deficiencies, the concept ofRF associated with CBIR was proposed.
2.3 Relevance Feedback Technique
Relevance feedback is an interactive process in which the user judges the quality of the retrieval. This information is then used to refine the original query, and resubmitted for a more desired selection. The methods that perfonn RF on multilevel image model have been fonnulated based on the most popular vector model used in infonnation retrieval. Currently, RF is a widely accepted method of improving interactive retrieval effectiveness in image retrieval systems. The main idea of RF is an initial search is made by the system with a user's query, and a small number ofresults are returned to the users. The users indicate which of the returned results are relevant. The system then automatically refines the original query based upon tho~e user relevance judgments. The
6ı
new feedback query is then compared with the collection of images, returning an
improved set of images to the user. This process can continue to iterate until the user's information need is satisfied. The fundamental goal of these techniques is to estimate the ideal query parameters (both the query vectors and the associated weights) accurately. Most of the
RF can be classified into the following two approaches: query point movement and reweighting techniques [10]. The query point movement method essentially tries to improve the estimation of the "ideal query point" by moving it towards good example points and away from bad example points. This technique is implemented in the MARS
system [22, 24]. The reweighting is a method to adjust the associated weight on each dimension.
2.4 Similarity Matrix
The weight associated with a feature defines the importance of this feature. Intuitively, if all the relevant images have similar values for a feature, the feature can be considered as a good indicator for user's expectation. For example, if the variance of the good examples is high along a principle axis j, then we can increase the values on this axis. Conversely, if it is not very relevant to the input query, we assign a low weight ijj on it.
Based on the user's feedback, the similarity matrix is modified to recompute the next set of retrievals. RF based on this model is to find the appropriate weights to model
7ı
the user's information request. The Euclidean distance function uses similarity matrix to calculate the distance [2].
A(Fj)ı A(N1adap (Q. ~) )ı
IB2(Q.~)1
IB2(D. WI) I IB2adaP (Q. ~)I
0p
jB2l (Q. Wt)1
IB2sIan (Q. W;)I
Fj
Lj(Q. W;) Lj(Q, W;+1) M
N
Nl(Q.~)
IN1(Q.~)1
IN]sIan (Q. W;) I INI adap (Q. W;) I
~
Wl,k
Q
RI
T
U,.(Q, WI) U,.(Q, WL+I) avelB2adap (Q. ~)I avelB20pl (Q. W;)I ave IB2slan (Q. W;) I
aveINladap(Q, W;)I avelNlsran (Q. W;)I avetCrU(Q»
b
rl(Q) rt+l (Q) r1+IU(Q)
t
a
aBuf
BI+1(Q)
;(Q, W;)
Approximation representation ofF j Set of approximations' representations ofN/dap (Q. W;) Set ofdata blocks visited resulting from t in phase II Number of data blocks visited resulting from t in phase II Number of data blocks visited resulting from t in adaptive phase II Minimal set of blocks visited in phase II Number of data blocks visited resulting from t in standard phase II Image object associated with a set of feature vectors, F,. = [.!iI, !i2, ... ,jiM], i = I, ....N Lower bound under current iteration's W; Lower bound under next iteration's W;+1 Number of dimensions in each object Number of the image objects Set of candidates resulting during t in phase I Number of candidates during t in phase I Number ofcandidates during t in standard approach phase I Number ofcandidates during t in adaptive approach phase I Weight matrix during iteration t Weight associated with dimension k during iteration t [QJ, Q2, .... QM], the query object The set ofk nearest neighbors for the query Q at t iteration Number ofiterations performed in one KNN procedure Upper bound under current iteration's WI Upper bound under next iteration's WI+ 1 Averages number ofdata blocks visited in adaptive approach
Averages number ofthe minimal set of blocks visited Averages number ofdata blocks visited in standard approach Averages number ofcandidates resulting in adaptive approach Averages number of candidates resulting in standard approach The average rt(Q) on titeration KNN procedure
Feature resolution The distance between Q and Kth object under ~ The distance between Q and Kth object under Wt+ 1 The maximum ofthe distances of the KNN under ~+I; Iteration index, t = I,2, .. .,T The kth largest upper bound found so far during phase I The buffer keeping track of a The kth largest upper bounder ofNI (Q, W;) under W;+I The kth largest upper bound among all A(Fj) in phase I under W;
Table I: Notation
8 CHAPTERUIı Relate Workı
3.1 Approximation File for Feature Vector
Most images are high dimensional vectors. The most important overheads in query processing are the refinements, which require expensive random disk accesses. The idea ofVAFile is to compass the data file to a corresponding VAFile with a bitstring to represent the object [8]. The VAFile was developed as an index structure and comprises two files, a bit approximation version of the points and their exact representation. Both files are unsorted; the positions of the objects in the two files match each other so that we can trace the corresponding object.
A certain dimensionality can process a nearest neighbor query efficiently to accelerate the sequential scan by the use of data compression. The smaller size VAFile contains the compressed data that are determined by the cells in the data spac . The cell in each dimension corresponds to 0 to 2b 1, where b is the number of bits allocated in one cell of a dimension. Thus feature resolution is determined by b. The smaller b, results in a lower resolution, which is coarser VAFile and results in smaller VAFile. Reversely, the larger b results in higher resolution, which is finer and results in larger VAFile. The VAFile points can be loaded into main memory by a sequential scan. Basically, the time to scan the VAFile is sped up compared to the sequential scan of a data file because reading larger files sequentially from disk yields a linear time complexity with respect to the file length.
9
The resolution trade off is very important to efficiency. Because the number of
points to be refined increases when the resolution decreases, i.e. there are too many points allocating in one partition, causing overhead in query processing. On the other hand, the increasing feature resolution will result in an inefficient query due to increasing ofthe file length. The detail VAFile construction is shown as following.
3.2 Structure of VAFile
Construction of VAFile is determined by the selection of feature resolution and changing the resolution requires a reconstruction 0 fthe V AFile. Feature resolution is measurement ofthe cell size in the VAFile.
Figure 1: VAFile Construction
10
The construction of a VAFile is depicted on Figure 1. The approximation of a pointp, A(P) is based on a cell thatp lies in the data space. Denote bi to be the number of bits allocated in each cell on dimension}. There are 2bi cells along dimension}. The cells have numbers from 0 to 2bj 1. And the points spbtting one dimension are markers mU, 0], ..., mU. 2bi] as shown in Figure 1. Given a point p on dimensionj, Pi falls in a cell between two markers m[j, A (p,j)! andm[j. A(p,j) +Ij,
m[j, A(p, j)j <= Pi < m[j, A(p, j) +1j (1) A(p, j) denotes the cell number in dimensionj into which P falls. hj denotes each cell on MI
axis}. A bitstringb = L bj , is the concatenation of the cell number in each dimension.
j=O
Finally, the approximation ofp, A(P) is the bitstring ofthe cell that contains p.
For example, in Fingurel, the approximation of point p, A(P) is based on a cell that p allocated. The number of bits allocated in a cell in dimension X2 b2 = 2, there are 22 cells along dimension X2. The cells have numbers from 0 to 3, represented by binary 00 through 11. The points splitting one dimension are markers m[O, 0],..., 11'1[0. 4]. p on dimension 0, Po falls in a cell between two markers, 11'1[0,1] and 11'1[0, 2]. A(p. 0) = 01 is the cell number in dimension O. A(p, 1) = 01 is the cell number in dimension 1. A(P) is concatenation of the bitstrings A(p. 0) and A(p, 1). Thus A(P) is the bitstring of the cell containingp, 0101.
3.3 Algorithm of Bounds on the Distance Between Query and Data Point
I 1ı
Given the approximation of a data point, we can compute the lower and upper
bound of the distance between the data point p and the query point q using
L(p, q) <= distanee(p. q) <= U(p. q) (2) The lower bound L(p, q) is the distance between the query point and the closest point of the cell where p lies. That is the distance between q and the closest marker. For a given point Pj, the differences on each dimension are obtained from 3 cases (Equation 5, 6, 7), pj denotes the value ofp associated with dimensionj,
•ı
In dimension j, ifpj lies in the higher cell than lJ;' does, the lower bound ~ is the difference between the lower marker ofpj and qj in dimensionj.
•ı
Ifpj lies in the same cell as qj does, Ij is zero.
•ı
If Pj lies in the lower cell than CIJ does, Ij is the difference between qj and the
higher marker ofpj' Each ~ is associated with their weight on each dimension}. L(p, q. WI) is calculated using Euclidean function.
(3)
(4)
rqjmU, A(p.J) +1] qj> mU, A(p,j)+1] (5) ~= ~ 0 m(j.A(pJ)] >= CIJ >= m(j, A(p,j)+l] (6)l m(j, A(p, j)] qj qj < m[j, A(p, J)] (7)
rCJ.i m[j.A(pJ)] qj> m(j, A (p,j)+ I] Uj = ~ MAX(qj m[j,A(pJ)],m[j.A(PJ)+1] CJJ} m(j.A(pJ)] >= fJJ >= m[j, A(p,j)+1](8) l m[j. A(P,J)+1] qj qj < m(j, A(p,j)]
An example of lower bound ofP is shown on Figure I, given the approximation ofp, A(P), the lower bound L(p, q) is the distance between q and the upper right vertex point of the cell containing p, 0101. The distance(p, 'q) is an Euclidean function to
12ˇ
calculate the actual distance between p and q. Analogously, we obtain the upper bound
U(p, q).
3.4 VAFile Based KNN Search
Nearest Neighbor Search defines similarity queries with a defined result set size.
The classical nearest neighbor query returns exactly one point object as the result, the
object with the lowest distance to the query point among all points in the database.
Within RF, we do not only want one closest point as the answer upon query, but
rather a number of closest points, so a K earest eighbor query will be required. The
KNN query selects k points from the database such that no point among the remaining
points in the database is closer to the query point than any ofthe selected points.
VAR a: ARRAY of ApproKimation; p: ARRAY of Point;ı VAR i, nn: lNT; l, u: REAL; heap: HEAP;ı FUNCTION VASTANDARD (q: Point; VAR dist: REALI: INTEGER;ı
(/* PHASE ONE STANDARD */)
1: dist:= MAXREAL;
2: FORi:=1TONDO
3:ˇ GetBounds (a[i], q, l, u);
4:ˇ IF l <=distı THENı
5: IF u < dist THEN dist:= u;lluse u to update diet
6: Pueh(heap, l, i); Ilremember the candidate point
7:ˇ END;
8: END;
(/* PHASE TWO */)
9: dist:= MAXREAL; nn := 1;
10: Pop (heap, l, i);
11: WHILE l< dist DO
12: IF dist(p[i), q) < dist THEN
13: dist:= dist(p[i], q); nn := i;
14: END;
15: Pop (heap, l, i);
16: END;
17: RETURN nn;
18: END VASTANDARD;
Table 2: Pseudo COde of Standard VAFile KNN
13
'Table 2 shows the standard approach of algorithm for KNN search with the VAFile. The search algorithm has two phases:
•ı
The phase I (approx.imation level step) scans the approximations of the vectors to detennine lower and upper bounds on the distance between the current vector and q (see Statement (3)). Using statement (5), if the lower bound of a vector is larger than the kth upper bound seen so far, the vector cannot be the KNN and cannot be inserted into the candidates set; otherwise, the vector is a candidate ofKNN.
•ı
In phase II (data level step), the remaining set of vectors is sorted by the lower bounds in increasing order. Use statement (11), the phase II ends if a lower bound is encountered that is larger than the kth actual distance seen so far. At this stopping point, the algorithm visits exactly those points whose lower bound is smaller than the kth distance ofKNN.
3.5 Adaptive KNN Search Algorithm
The most important overhead in query processing is the refinement that requires an expensive random disk access. Adaptive KNN search algorithm finds the correlations between KNN in successive iterations to eliminate false candidates resulting from phase
1. Denotes the expected minimal set of approximations that contain all the K.NN in phase I to be Ntpt (Q, W;+J). During RF iterations, the underlying similarity matrix W, is changing to w,+J. By taking advantage of W;+J, adaptive approach proves that two necessary conditions for an approximation to be qualified one in N/PI (Q, W+).
t J
14ˇ
Necessary Condition 1
For an approximation to be chosen a sac andidate 0 f N /Pf (Q, ~+/), i ts lower bound must satisfy the inequality: L1 (Q. Wf+J) < B1+J(Q), where Bf+J(Q) is the kth largest upper bounder ofcandidates in phase I under the weight matrix J'Vr+I; Necessary Condition 2
OPI
For approximation to be a qualified one in N/ (Q,W1+,), its lower bound must satisfy the inequality: L1(Q, W;+J) < rf+ /U(Q), where rl+ JU(Q) is the maximum value of the distances between q and KNN during iteration t under the weight matrix w'+1 [35]. Based on two necessary conditions, adaptive approach sets two constraints on candidates resulting from phase I in the following iteration. One is the maximum actual
distance of KNN under ~+], rf+1U(Q) which can be precomputed to filter false candidates. Another is the precomputed kth largest upper bounder in previous iteration
)N! (Q, w')1 under w'+I, ~+)(Q) [35]. These constraints reduce the number of candidates resulting from phase 1. The improvements were compared by measuring the numb r of candidates in phase 1. Adaptive KNN states the effectiveness of the 2phase process is mainly detennined by the phase 1 because the number of candidates from phase 1 determines the cost of disk access or pages access in the entire query process [35]. That is, as the candidate set from phase I decreases, the I/O performance increases. Adaptive method finally claimed to demonstrate a significant overall reduction in the number of disk or page access.
15ı
CHAPTER IV
Problem Statement
The number of blocks visited is the main measurement of the I/O performance. Adaptive approach does not show the I/O performance improves in terms of the number of blocks visited. The Adaptive approacb reduces INJ(Q, ~)I but improvement of the I/O performance in terms of the number of blocks visited IB2(Q. W,) I is unknown. It is necessary to evaluate and compare the I/O perfonnance in terms of blocks from both standard and adaptive approaches. Hence, we want to demonstrate the correlation between the improvement of I/O performance in terms of blocks visited and the improvement of candidates in the approximation level. We will demonstrate bow the approximation level and the data level work during the entire image query procedure, and how useful it is to yield I/O performance gain.
In addition, as shown in Figure 2, the adaptive KNN search states that as the feature resolution decreases, i.e. bigger cell in the VAFile, the number of bits b allocated in each feature partition decreases, the established upper bound on the n ext iteration's candidates r/+IU(Q) increases in his experiment. We believe that the established upper bound on the distance of next iteration's KNN, r/+1U(Q) should be independent of the feature resolution selected to construct the VAFiles, since rl+1U(Q) is determined by the query object, the set of KNN objects of current iteration, and the next iteration's weight matrix. We construct VAFiles by varying feature resolutions and show rl+1U(Q) as a
function of feature resolutions.
16
Distance
30 1,
r=Yl
~
20ı 15ı
........ı
..............................................ı
10
...................ı
5.
o5 10 20 30s
Figure 2: r( U and y computed from approximations at different resolution, taken from Fig. 5, Page 7 [35J
17ı
CHAPTER V
Experiments
5.1 Descriptions
The main purpose of our experiments is to evaluate the I/O performance of the adaptive approach in terms of I/O blocks. The evaluation is conducted by comparing it
with the I/O performance of the standard approach in terms ofI/O blocks. We also want to investigate the impact ofthe precomputed constraints on the construction ofVAFiles.
The perfonnance of the two KNN approaches is measured in block I/O. The CPU cost is considered as tiny when compared to the I/O costs ofsecondarymemory accesses.
5.2 Construction of VAFile
Our implementation in C/C++ was tested on a dual Pentium ill, 800MHZ workstation under the LINUX operating system. Both, VAFile and data file were stored on the same disk drive. First, we use the raw data file to make the data block file. The raw data is from MIT Media Lab at: www.whitechapel.media.mit.eduipubNisTex. The data file contains 640 image objects, each represented by a 16dimensional feature vector and a class label. The data file used a block size of 1024 KB and 43 blocks were generated.
There are 15 records per block.
18 According to feature resolution, b, and the data distribution the marker file was
generated and the values partitioning the data space on each dimension were stored in a marker file [12]. The VAFile was generated from the data file, the marker file, and b, the feature resolution. The marker file is partitioning of the data space. The feature resolution is the number of bits allocated in each cell on one dimension. A VA block file was also generated. Each data point in the data file has a corresponding VA representation stored
in the VA block file.
5.3 Procedure of KNN
We demonstrate how the approximation level and the data level work with the entire adaptive KNN query procedure in Figure 3.
•ı
In the first iteration phase I, t = 1, given Q, W;, the standard approach is performed to search the KNN (See Table 2). The selected objects' approximations and their associated block index and record index are recorded in a data set N, (Q.
•ı
In the first iteration phase II, N[(Q, W,) is processed in increasing order of the lower bounds and is maintained by a heap [7]. At first, block index is used to find the appropriate block in block data file. Then, each object in the block is sequentially scanned. Finally, the phase II ends if a lower bound of the object is encountered that is larger than the kth actual distance seen so far. At this stopping point, the algorithm visits exactly those points whose lower bound is smaller than the k th distance 0 fK NN. In the mean time, t he associated block
19ı
index is recorded. Finally, KNN is recorded in Rr and the number of blocks visited
is recorded in IB1 (Q, Wr)l· The pair of IN, (Q. W,)I and IB1(Q. W,)I is outputted toı evaluation procedure. Note that the pair of r,+l U(Q) and Br+l (Q) is not applicable inı
first iteration.ı In the following iterations' phase I, T >= t >= 2, given Q, Wr and N, (Q, W,I),ı
•
adaptive approach initially scans the approximations of the vectors to detennine lower and upper bounds on the distance between the current vector and Q (See Equations (3) and (4». r,Il(Q) are computed by using the RI_, and W; (see Equation 16). ~(Q) is computed by using Q, W, and the approximation representation ofN! (Q, W,_I), A(Nj (Q. W,_I»~. The feature vectors are selected as candidates: if the lower bound of a vector is less than a, r,U(Q) and BI(Q), the approximation of vector can be the KNN and is record in N,(Q, W,). Otherwise, the vector is eliminated. Finally, the pair of IN! (Q, W,) Iand IB2(Q, W,)I and pair of rr+1U(Q) and
Br+l(Q) are returned to evaluation procedure.
•ˇ In the following iterations' phase II, N,(Q, WI) is processed in increasing order of the lower bounds. At first, using block index to find the appropriate block. Then, each object in the block is sequentially scanned. Finally, the phase II ends if a lower bound of the object is encountered that is larger than the kthe actual distance seen so far. At this stopping point, the algorithm visits exactly those points whose lower bound is smaller than the kth distance of KNN. The associated block index is recorded. Finally, KNN is recorded in RI and the number of blocks visited is recorded in 182 (Q, W,)I.
20ı
The different between the standard approach and the approximation approach are
the standard approach does not calculate rt+IU(Q) and B,+I(Q) in phase I. Only the pair of IN] (Q. ~)I and IB2 (Q. ~)I of the standard approach is returned to evaluation procedure to compare with adaptive approach. r;+IIl(Q) and Bt+I(Q) are not applicable
in standard approach. C/C++ codes for standard a nd adaptive approaches a re a ttacbed ina ppendix A and appendix B respectively.
21ı
[ start ]
Inpli onmt weigtX rmtrixWc
Il1lut previous iteration's KNN, ~I
Establish amstraint, Precompute E\(Q) using theNl (Q. WeI) under cwrent. Wt
Initialize the buffi:r keeping scrted Kupper bounds fbund so far as aBuf, retum kth upper bound, a.
1mtialize heap NI (Q. We)
false
Record the set of candidates n:su1ting from phase I as NI (0. ~)
Output INI (Q. Wt>1
Output IB2 (Q. Wt>1
Input query point Q
Il1lut previous iteration's Nl CO. WeI)
Establish the upper bound on cami dates by precomputing the RtI I.nder current Wt • rt'" (Q>
Output rc \l (Q>
[SLi(Q. We) <=ı rnin(mir(rc'" (Q).ı E\(Q)ı
),~
Insert upper bound into the buffer keeping k upper bounds, !dum the kth upper bc:und, ~ Push Fi imo heap, NI (Q. We>;
Figure 3: Flowchart of Adaptive VAFile KNN Search.
22ı
5.4 Procedure ofEvaluation
In order to obtain an overall evaluation on both approaches, each data point Fi in data file is selected once to be the query point Q in our experiment. Denote the averages of candidates resulting from phase I in the standard approach and the adaptive approach avelN/tan (0. W;)I and avelNjadap (Q. WI) I respectively. Denote the averages of the blocks visited in phase II in the standard approach and the adaptive approach avelBfton (Q, ~)I
and aveIB2adap(Q, ~)I·
Initially, standard approach was perfonned on each Fi, the pair of INlla1l (Q, ~)I and IB2stan (Q. W;) I are returned by standard KNN search procedure. The returned values are accumulated in evaluation procedure. Finally, a veiN/Ian (Q, WI) Iand aveIBl((/n (Q. Wt)1 are computed for comparison with adaptive approach.
Secondly, adaptive approach was perfonned for Fi. For each Fi, the pair of IN/odop (Q, ~)I and IB2adap (Q, ~)I are returned by adaptive KNN search procedure. The returned values are accumulated in evaluation procedure. Finally, avelN/odap (Q, Wt)1 and avelB20dap (0. WI) I are computed for comparison with standard approach.
To compare the va performance, the ratio avelN/stoll (Q, W;)II avelN,odap (Q,~)I, eN1 is computed to indicate the candidates improvement in phase 1. The ratio avelB/lall
(Q. ~)II avelB2adap (Q, W;)I, CB2 is computed to indicate the I/O improvement.
In addition, at the end of KNN search procedure, we examine the minimal set of blocks visited to accomplish this KNN search. Denotes the expected minimal set of
0P
blocks that contain the KNN in phase II B2opt (Q, WI)' B2 l (Q, W;) is the set of the blocks KNN search procedure visited. At the end of each KNN search procedure, the block
23ı
index of KNN are known and returned to evaluation procedure. The average number of
B20PI (Q. W,) for each query data avelB20PI CQ. Ff,)1 is calculated in evaluation procedure.
The second experiment explores the correlation between the feature resolution to construct a VAFile, b and the Y,+IU(Q). VAFiles are constructed by various feature resolutions (12]. For each VAFile, each data point in data file is selected as a query point once to perfonn the KNN search. For each query, we recorded the set of KNN during tth iteration as Rt and computed the established rt+l l1(Q) before the next iteration's phase I. Then, Yt+1 U(Q) is calculated in each iteration. Finally, Yt+/(Q) is returned to evaluation procedure. Denote the average of rt+/(Q) for all queries to be aveCYt+/(Q». For an overall view, ave(rt+JU(Q» are calculated in evaluation procedure for further comparing with other VAFiles.
C/C++ code is attached in appendix C.
5.5 Generation of Similarity Matrix
The similarity matrix is controlled generated to simulate the reweighting procedure within relevance feedback. For simplification, diagonal matrix was used in my experiment and can be further extend to regular matrix. Let Wt,k be the value associated with dimension k in iteration t along the diagonal. Ff"k was implemented by an one dimension data set in our experiment. Let!:! be reweighting rate. The next iteration's first weight is set to !:! times to previous iteration's first weight,
W"I = W,I.I * ~
The sum of weights in each dimension is equal to 1. The rest of weights are nonnalized,
24ı
W = (1W ) /(M I), k ={2 ... M}
',k. t,l
Thus, at each iteration, t <= T the first weight, k = 1, W:.I =t;T is initialized according to
given t and T. During a Titeration query procedure, the sum of weights cannot go out of the bound of sum. For example, in a 5iteration query procedure, T = 5, 11 = 10 at first iteration, t = 1, the first weight, k = 1, W:,I =t;T is initialized, ~.I =0.0001. At the
second iteration, t = 2, the W2,1 is set to 10 times of WI,l. Thus, W2,1 = ~,I '" 10 = 0.001,
which can be resulted from ~.1 =trT , wheret =2, T =5. We tested the number of candidates in phase I and blocks visited in phase n of both approaches on various reweighting procedures. One 16dimensional weight matrix is generated to simulate the reweighting similarity matrix at each of the iterations. Five experiments were performed on five reweighting rates, 11= 1.5, 5, 10, 50 and 100 respectively. The VAFile was constructed based on 4bits per cell. In a 5iteration KNN reweighting procedure, the average numbers of candidates on 5 diffi rent reweighting procedures are shown in Figure 41. As we can see, as the feweighting rate increases, the number of candidates of both approaches does not increase effectively except the Low reweighting rate applied.
25ı
comparison of Candidates in Reweighting Procedure (b =4, T =5 )
145 ...,
o
ı
... ;~: if·~•.••.•••••••.••..•.•••.•••~ •••••:••:.••••••:••••:••.::.ı
CD lit
.Q CD
e'"+standard
::l.g85 .... _ .
Z:s adapti\El
CD c
_._ .. _ .
0)",
65
~o
G) 45 ~ .
>
c:( 25 l,,rrt
o 20 40 60 80 100
Delta
Figure 41: The number of candidates in a 5iteration reweighting procedure by standard and adaptive approaches on 4bits VAFile.
The average numbers of blocks are depicted in Figure 42. As the reweighting rate increases, the numbers 0 f b locks visited by both a pproaches are identical and not significantly impacted on the reweighting rate except the low reweighting rate applied.
Comparison of Blocks in Reweighting Procedure (b =4, T =5)
15 ,..,
14
t··········..·
.
13
.,.
12
.
11
. ""'.'
.
""standard
10 9
. '""" ""'.."
._
.
.
_adaptive
8
.._.
"".'
7
.
_....
6
.....
.....
o
20 40 60 80100
Delta
Figure 42: The Number of blocks visited in a 5it~ration reweighting procedure by standard and adaptive approaches on 4bits VAFile.
26
ıIn the further experiments we selected 6. = 10 to simulate the reweighting
procedure in RF.
5.6 Computation of Kth Distance
We computed the weighted Euclidean distance between two objects as well as the objects and bounds of cell. Given a column vector matrix W, the distance of two vectors p and q is defined as weighted Euclidean distance function
Distance (P, q, W) = ~(p_q)W(p_q)T (9) Where (q p) T is the transposition of(q p) [2]. Using Equation 9, the distance between two points is calculated by t he expensive square root function. T he comparison i s the main concern in our experiment. Therefore, in order to achieve lower computational complexity, we used the square ofdistance to compare the two values of distances,
M
DistanceSquare(Q, Fi ,W;) =I(Q. F;,.)2 W"k (10)
1
In addition, the square of the upper bound and low bound were used to measure the distance between the query Q and the approximation cell, LSquare(p, Q) <= DistanceSquare(p, Q) <= USquare(p, Q) (11) For simplification, the above notions are substituted by the following, L(p, Q) <= DistanceSquare(p, Q) <= U(p, Q) (12) Denote the kth largest upper bound found so far during phase I to be a, and the buffer recording and maintaining a to be aBuf Ifp is one of KNN of Q, then lower bound L(p, Q) <= a (see Equation(3)). To find the KNN, either standard approach or adaptive
27ı
approach considers data points with a smaller lower bound than a in the buffer aBuf as a
candidate.
In addition, adaptive approach computes two constraints, rt+,uCQ) and 8;+JCQ), and considers data points with a lower bound than two precomputed constraints and a as a
candidate.
5.7 Computation of {} t+l(Q)
Adaptive KNN search finds the correlation between two consecutive iterations of RF and reduces the number of candidates resulting from phase I. This twophase KNN query filters out the false candidates in approximation level (phase I) and expects that fewer numbers of expensive exact evaluations have to be performed in the next data level step (phase I I).. Two precomputed constraints are used to update a smaller number of candidates during consecutive iterations in phase I without searching the entire data file in phase II repeatedly.
One constraint is to precompute the Kth largest upper bounder of current given candidates under ~+hBt+/(Q) to generate an upper bound for next iteration's candidates. In adaptive approach phase I, the approximations of tth iteration's candidates, A(N,(Q, W;) ) are stored in a data set. In the next iteration, t+ I, A (N,(Q, WI) ) is sequentially scanned for computing ~+L(Q) under Wt+I. Using Equation (8), for each approximation in A(N1(Q, ~) ), say AI(N1(Q, Wr) ), where 1= 0, ... , !ACN1(Q, Wt»1 1, the upper bound between Q and A,(N1(Q, w,») is calculated by using U(AiCN1(Q, W;), Q,. W;+l)' The kth largest U(AiCN1(Q, WI)' Q, ~+l) is returned as 8,+I(Q).
28ı
Applying OI+I(Q) in adaptive approach filters out no many candidates in phase I
but incurs the space and computation overhead in entire query procedure. A(N,(Q, ~) ) maintained in memory causes somehow space overhead. In each iteration calculating the uppers between Q and each of A(N,(Q, W;)) causes some extra computation overhead. We will discuss the usage of ~+I(Q) along with another constraint, rl+I"(Q) in the
following subsection.
5.8 Computation of rt+l"(Q)
Another constraint is to establish an upper bound on the next iteration's candidates by using the current iteration's KNN and next iteration's weigh matrix ~+I.
In adaptive approach phase I, the KNN recorded in previous phase II, Rr was used to compute the rl+lU(Q) under current Wr+l(Equation (13)). The distance between Q and each point in Rr under ~+I are computed and the largest value is returned as rr+1U(Q).
rt+1U(Q) = max {distance(Q, Rt, ~+I), i = l, ...,Kl} (13)
In practice, the average of B,+1(Q) is much larger than the average of rt+I"(Q) due to larger bound of the cell. In mathematic view of point, B,+1(Q) can be applied to constraint a little number of candidates. However, using only rl+IU(Q) is more practical than using both constraints. Our experiment perfonned 640 queries in a 2iteration ten nearest neighbors search results in 65.01 candidates with calculating Bt+1(Q) over 65.53 candidates without calculating Bt+I(Q). Obviously, the rt+l l'(Q) sufficiently filters out most false candidates in phase I and avoids if possible computation and space overhead caused by Ot+l(Q). On the other hand, neither rr+t(Q) nor Bt+I(Q) results in a smaller set
29ı
of blocks visited in phase II and reduces the I/O cost. In further experiments, we only use
rl+1 U(Q) as the constraint to eliminate the false candidates.
5.9 Comparisons
We compared the performance estimations of the standard and the adaptive KNN on five VAFiles with various resolution, b = 2, 3,4, 5, 6 respectively. Thus five standard VAFiles and five adaptive VAFiles were constructed. For overall estimation results, each file perfonned 2iteration, 3iteraion, 4iteration and 5iteration searches respectively. We selected K = 10 and /).= 10 to test the experiments. Here we did not evaluate two approaches in Iiteration search procedure since both the adaptive and the standard approach perform the standard approach.
5.9.1 Comparison of avelN/stan (Q, W()I and avelN/tan (Q, Wt)1
;::
stan 2
Adap 2
stan 3
adap 3
stan 4
adao 4
stan 5
adap 5
stan 6
adap 6
2
202.25
168.19
142.38
97.72
114.06
67.35
99.97
56.07
88.70
52.98
3
266.61
117.29
170.22
66.08
125.44
44.70
102.95
36.29
89.13
33.39
4
289.13
99.81
180.88
54.97
129.88
36.68
104.37
29.27
89.72
26.48
5
300.32
91.11
186.29
49.42
132.14
32.66
105.12
25.76
90.04
23.02
Table4: OverallResultsofNumberofCandidatesResultingfrom Phase I
We first consider the number of the candidates resulting from phase 1. For instance, from Table 4, in a 2iteration ten nearest neighbor search on 4bits VAFiles, the adaptive approach reduced the average number of candidates from aveIN/Ian (Q. W;)!
30ı
= 114.06 to avelNJodap (Q, ~)I = 67.35 (CNI = 1.69). In a 3iteration KNN search T = 3,
tbis set was reduced from 125.44 to 44.69 (CN1 = 2.80). T = 4, 129.88 to 36.38 (C I = 3.57). T = 5, 132.14 to 32.66 (CN! = 4.05). Note that C I is increased due to the increasing number of iterations in the KNN query procedure.
The results of both approaches in 2iteration, 3iteration, 4iteration and 5iteration search procedures on 4bits VAFiles are depicted in Figure 51.
Comparison of Candidates in Phase Ion Standard and Adaptive Approach (b = 4)
140 ..,
'0
120
..
.
... 4D rn .Q 4DE:t.gz:g 4D C
100 . 80 60~
.
. .
+Standard __Adaptiw
Qcal!O
40
. ..
.
:=='~..:..:..::..:..:..:.._=.:.J.
CD >4(
20

,
,.
.
.
O+,rI
2
3
4
5
Number of Iteration, T
Figure 51: Comparison of Candidates on Adaptive Search and Standard Search in 2iteraion, 3iteraion, 4iteration and 5iteration Search Procedure
Let us see another V AFile with b = 2. In a 2iteration ten nearest neighbor search on 2bit VAFile, b = 2, the adaptive approach reduced the average number of candidates set from aveIN/,an (Q. WJI = 202.25 to aveINlodap(Q, ~)I = 168.19 (CN! = 1.20). T= 3, CN} = 2.27. T= 4, CNJ = 2.90. T= 5, CN! = 3.30.
Table 3 shows that five VAFiles with various b = 2 3 4 5 6 in varving iteration
,,,, J'"
KNN search procedures, T = 2, 3, 4, 5 have the candidate improvement factor, CN1 > 1.
31ı
Adaptive approach results in the fewer candidates than the standard approach does and
provided a little improvement on the number ofcandidates resulting in phase 1.
CN1
CN1
em
CN1
CN1
Cm
CB2
C...,
C...,
CB2
~
2
3
4
5
6
2
3
4
5
6
2
1.20
1.46
1.69
1.78
1.67
1
1
1
1
1
3
2.27
2.58
2.81
2.84
2.67
1
1
1
1
1
4
2.90
3.29
3.54
3.57
3.39
1
1
1
1
1
5
3.30
3.77
4.05
4.08
3.91
1
1
1
1
1
Table 3: The Calculating Values of eN! and CO2 on 2bit, 3bit, 4bit and 5bit VAFile
5.9.2 Comparison of avetBzstQIf(Q, WJI and aveIBl'dGp(Q, Wt)1
Now we consider the number of blocks visited. If CB2 is also greater than 1, the adaptive approach visits fewer objects than standard approach does in phase II. The larger CO2 is, the better I/O performance improvement performed by the adaptive method. However, the increased eNJ does not result in a corresponding increased CO2 .
Comparison of Block Visited in Phase II on Optimal, Standard and Adaptive Approach(b =4)
16 ,., o
14
ı
,.",
Ql 12 .
J:lE ~~ en10 10 Standard Zg 8 Adaptiw
QlCD
C) 6
.Optimal
f
4
Ql
>
2
<
Otr.l 2
34 5 Number of Iterations. T
Figure 52: Comparison of blocks visited in ph~e II by optimal, adaptive and standardSearch in2iteraf 3. . 4. . d5 " h d
lOn, IteratIon, IteratIon an IteralOn searc proce ure.
32ı
Figure 52 shows that curve of avelBf'on (Q, ~) I and curve of avelBl odop (Q. W;)I
are identical on a 4bit VAFile in 2iteration 3iteration, 4iteration KNN searches. Both
do not reach avelB20PI (Q. ~)I·
Table 3 shows that the overall results of five VAFiles with various b =2 3,4,5, 6 in varying iteration KNN search procedures, T= 2,3,4,5 have the blocks improvement factor, ClJl = 1. Although the candidate improvement factor is greater than I, CNl >= 1, the blocks visited improvement factor, CO2 = 1. The number of blocks visited in adaptive approach phase II are the same as in the standard approach no matter how CN1 is improved.
Table 5 shows the overall experimental results that avelBflan (Q. W"r)1 and
avelB2adap (Q. W;)I are the same over a wide range of resolution.
stan
Adap
stan
adap
stan
adap
stan
adap
stan
adap
~
2
2
3
3
4
4
5
5
6
6
2
24.07
24.07
18.38
18.38
13.98
13.98
10.86
10.86
9.34
9.34
3
17.44
17.44
12.34
12.34
9.22
9.22
7.27
7.27
6.31
6.31
4
15.33
15.33
10.37
10.37
7.65
7.65
6.08
6.08
5.29
5.29
5
14.28
14.28
9.39
9.39
6.87
6.87
5.46
5.46
4.78
4.78
Table 5: Overall Results of Number ofBlock Visited by Standard and Adaptive
Approach
b
'i........234 5 6ı
2 6.69 6.69 6.68 6.59 6.58
3 4.79 4.79 4.78 4.73 4.73
4 4.15 4.16 4.15 4.12 4.12
5 3.84 3.84 3.84 3.81 3.81
Table 6: Overall Result ofNumber of Blocks Visit~d by Optimal Approach
33 The adaptive approach yields fewer candidates but the same number of blocks as the standard approach does. Both two approaches do not reach the expected minimal number of blocks (See Table 6). The adaptive approach reduces INj(Q, Wt)1 and does not
improve the I/O perfonnance in terms of the number of blocks visited. The experimental results match our prediction over a wide range of resolutions. As we have pointed out in section 4, the improvement of va perfonnance by using adaptive approach is not as significant as the improvement achieved in the approximation level.
5.9.3 Calculating Results of Tttl"(Q)
In the second experiment, rt+l"(Q) is measured as a function of feature resolution. As the fust step of our experiment, four VAFiles were constructed for b = 3,4, 5, 6 respectively. Each adaptive search performed ten nearest neighbors search on four VAFiles. For a consistent RF, we selected Ii= lOin for all reweighting procedures.

Impact of rut on VA Construction
0.00500 ,,
"CJ"'C
.s c
:::s :::s 0.00450 ...""""""" ..... '.= ~~~'". +T =2
..... ....
Q.O
Em T =3
0.00400
o ..
(,) CD T=4
Q) Q.
I Q. 0.00350 *1=====:::::;):;:'~=======4)';:(===== _~ ~T =5
D:~ 0.00300 4_~ L__ 345
6
Number of Bits, b
.. Fi~ure 6: Impact of r U, on 3bit, 4bit, 5bit and 6bit VAFile construction tested In 2lterahon, 3iteration, 4iteration and 5iteration adaptive KNN search.
34
To obtain an overall result, each VAFile perfonned 2iteration 3iteraion, 4iteration and 5iteraion query procedures. When T = 1 the experiment perfonned the standard approach and r,u(Q) was not applicable in the adaptive approach.
In evaluation procedure, each of 640 data point was selected once as a query point
Q. For each query, rtCQ) was returned by KNN procedure (Figure 3) and was recorded and accumulated in evaluation procedure. The ave,(ruCQ)) was calculated upon the end of evaluation procedure.
As shown on Figure 6, avetC,JICQ» are the exactly same values on four VAFiles over a wide range of feature resolutions. No matter how b changes, avelr"(Q» is a constant in a fixed reweighting query procedure. That is, the established upper bound on the distance of next iteration's KNN, rt+IU(Q) is independent of the feature resolution selected to construct the VAFile. The reason is that rt+/(Q) is determined by the query object, the set of KNN objects of current iteration and weight matrix updated on the n xt
iteration. For a given query point, the set of KNN objects are the same de pite the construction of V AFile and are only determined by the reweighting procedure. And the actual distances between the query and KNN are the same in cases where W; is changing at the same sequences. For a given W;, K.NN are always located closest to Q no matter how the cell size changes in data space. Compared with the plot of rl+/(Q) in Figure 2
[35], our experimental result confinns that the plot of rl+/(Q) in adaptive approach of KNN is in error.
35ı
CHAPTER V
Conclusion
The number of blocks visited is the main measurement of the .vo cost. Although the number of candidates has been reduced in the approximation level of the adaptive K.NN search, the number of blocks visited in the data file is not significantly reduced and the fmal I/O perfonnance is not effectively improved. Even with increasing the number of iterations in relevance feedback, no difference is found between the two approaches in term of blocks visited in data level. The.vO performance stays the same, and the effect of the improvement is not achieved by the adaptive KNN search. Even for low, middle and high resolutions, our proposed prediction matches the experimental results. In addition, the experimental results proved our prediction that the establish d upper bound on th candidates resulting from the approximation level is independent of the feature resolution selected to construct the vector approximation file.
36ı
Bibliography
[1] S. AIHawamdeh, B. C: Ooi, R. P?c~, T. H. Tng, and Y. H. Ang "Nearest neighbour searching in a picture archival system, ill Proc·1CMInt. Coni on Multimedia and Information Systems, November, 1991, Jurong, SlOgapore, Pages 1734.
[2] E. Angel, Computer Graphics. Boston, MA: AddisonWesley Publishing Company, Inc., 1990.
[3] N. Beckmann, H.P. Kriegel. R. Schneder, and B. Seeger, "The R*tree: An efficient and robust access method for points and rectangles", in Proc. the 1990 ACMSIGMOD International Conference on Management ofData, May, 1990, Atlantic City, New Jersy, Pages 322331.
[4] R. E. Bellman, Adaptive Control Processes: A Guided Tour. New Jersey, U.S.A., Princeton University Press, Princeton, 1961.
[5] S. Berchtold and H. Po Kriegel, "S3: Similarity search in CAD database systems", in Proc. ofACM SIGMOD Int. Coni on Management ofData, August, 1997, Tucson, Arizona, Pages 564567.
[6] P. Ciaccia, M. Patella, and P. Zezula, "Mtree: An efficient access method for similarity search in metric spaces", in Proc. the International Conference on Very Large Databases (VLDB), 1997, Athens, Greece.
[7] T. H. Connen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms. McGrawHill Book Company, 1992.
[8] C. Faloutsos, R. Barber, M. Flickner and J. Hafner, "Efficient and Effective Querying
by Image Content". in Journal ofIntelligent Information Systems, 1994, Vol. 3, Pages 231262.
[9] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, "Fast subsequence matching in timeseries databases", in Proc. ofACMSIGMOD Int. Con! on Management ofData, August, 1994, Minneapolis, Minnesota, Pages 419429.
[10] H.. Fer~atosmanoglu, E. Tuncel, D. Agrawal, and A. E. Abbadi, "Vector appro;:lmatlOn based indexing for nonuniform high dimensional data sets", in Proc. of the 9' ACM Int. Conf on Information and Knowledge Management (CIKM), November, 2000, Washington, DC, USA, Pages 202209. .
37ı
[11] A. Guttman, "Rtrees: a dynamic index structure for spatial searching", in Proc. of ACM SIGMOD Int. Con! on Management ofData, August, 1984, Boston, MA, Pages 4757.
[12] D. R. Heisterkamp, J. Peng, and H. K. Dai, "Adaptive Quasiconformal Kernel Metric for Image Retrieval", in Proc: ofIEE~ Con. on Computer Vision and Pattern Recognition, December, 2001, Kamu, Hawall, Vol. 2, Pages 388393.
[13] P. Indyk and R. Motwani, "Approximate nearest neighbors: Towards removing the Curse ofDimensions", in Proc. 0/3d" annual ACM symposium on theory o/computing (STOC), 1998, Dallas, Texas, Pages 604613.
[14] P. M. Kelly, T. M. Cannon, and D. R. Hush, "Query by image example: the CANDID approach", in Proc. SPIE Storage and Retrievalfor Image and Video Databases Ill, Vol. 2420, 1995, Pages 238248.
[15] N. Katayama and S. Satoh, "The SRtree: An index structure for highdimensional nearest neighbor queries", in Proc. ACM SIGMOD International Conference on Management 0/Data, 1997, Tucson, Arizona, Pages 369380.
[16] K. KUkich, "Techniques for Automatically Correcting Words in Text", in ACM Computing Surveys, 1992, Vol. 24, No.4, Pages 377440
[17] K.I. Lin, H. V. Jagadish and C. Fa1outsos, "The TVtree: An index structure for highdimensional data", The VLDB Journal: The International Journal on Very Large Data Bases, Vol. 3(4), October, 1994, Pagse 517549.
[18] R. Mehrotra and J. Gary, "FeatureBased Retrieval of imilar hapes", in Proc. 9th Int. Conf. on Data Engineering, April, 1993, Vienna, Austria, Pages 108115
[19] R. Mehrotra and J. Gary, "FeatureIndexBased Similar Shape Retrieval", in Proc. of the 3rd Working Con! on Visual Database Systems, March, 1995, Lausanne, Switzerland, Pages 4665
[20]
W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic, P. Yanker,
C.
Faloutsos, and G. Taulin. ''The qbic project: Quering images by contest using color, texture and shape", in Proc. SPIE Conference on Storage and Retrieval ofImage and Video Databases, 1993, San Jose, California, Pages 173187
[21] J. Nievergelt, H. Hinterberger, and K. C. Sevick, ''The grid file: An adaptable
symmetric multikey file Structure", in Proc. ACM Transactions on Database Structures, March, 1984, Pages 3871.
[~2]. M: Orteg~, Y: Rui, K. Chakrabarti, S. Mehrotan, and T, S. Huang, "Supporting slmIlanty quenes In MARS", in Proc. the 5th ACMInterational Multimedia Conference, Seattle, Washington, November, 1997, Pages 403413.
38
[23] Y. Rui, T. S. Huang, M. Ort~ga, and S: Me~~tra, "Relevance ~eedback: ~ p~wer tool in interactive contentbased lmage retneval ,m IEEE TransactIOns on ClrcUlts and Systems for Video Technology, September 1998, No.5 Vol. 8, Pages 644655.
[24] Y. Rui, T. S. Huang and S. Wehrotra "Contentbased image retrieval with relevance feedback in MARS", in Proc. ofIEEE Int. Con! on Image Processing, October 1997, Washington, DC, Pages 815818.
[25] G. Salton and M. McGill, Introduction to Modern Information Retrieval, New York, NY: McGrawHill Book Company, 1983, Pages xi xv.
[26] S. Sclaroff, L. Taycher, and M. La Cascia, "Imagerover: A contentbased image browser for the world wide web", in Proc. IEEE Workshop on Contentbased Access of Image and Video Libraries, June, 1997.
[27] T. Seidl and H. Po Kriegel, "Efficient UserAdaptable Similarity Search in Large Multimedia Databases", in Proc. 23rd Int. Conf. on Very Large Data Bases, 1997, August, 1997, Athens, Greece, Pages 506515.
[28] B. K. Shoichet, D. L. Bodian and I. D. Kuntz, "Molecular Docking Using Shape Descriptors", Journal o/Computational Chemistry, Vol. 13, No.3, 1992, Pages 380397.
[29] M. Stricker and M. Orengo, "Similarity ofcolor images", in Proc. SPIE Conference on Storage and Retrieval ofImage and Video Databases, 1995, San Jose, California.
[30] S. Volmer, "Tracing images in large databases by comparison of wavelet fingerprints", in Proc. 2nd International Conference on Visual Information System, December, 1997, San Diego, Califomia, Pages 163172.
[31] R. Weber, "Similarity Search in HighDimensional Data Spaces", in Workshop Grundlagen von Datenbanken, June, 1998, Konstanz, Gennany, Pages 138142.
[32] R. Weber and S. Blott, "An ApproximationBased Data tructure for Similarity Search", Technical Report, 24, ESPRIT project HERMES (no. 9141), October, 1997.
[33] R. Weber, HJ. Sebek and S. Blot, "A quantitative analysis and performance study for similarity search methods in highdimensional spaces", in Proc. ofthe Int. Con! on Very Data Bases Large, August, 1998, New York City, New York, Pages 194205.
[34] P. Wu, "Search and Indexing ofLarge ImageNideo Databases", Ph. D. thesis, Dept. Electrical and Computer Engineering, University of California, Santa Barbara, 2001.
[35] P. W.u and B. S. Manjunath, "Adaptive nearest neighbor search for relevance
feedback m large image database", in Proc. ACMInt. Multimedia Con!, October, 2001, Ottawa, Canada, Pages 8997.
39
[36] H. Shawney and J. Hafner, "Efficie~t Color Histogram Indexing", in Proc. Int. Con! on Image Processing, 1994, Vol. 2 Austln, Texas Pages 6670.
[37] P. N. Yianilos, "Locally lifting the curse ofdimensionality for nearest neighbor
search", in Proc. ACMSIAM Symposium on Discrete Algorithm, 2000, San Francisco California, Pages 361370.
[38] X. S. Zhou and T. S. Huang, "CBIR.: from lowlevel features to highlevel
semantics", in Proc. SPIE Image and Video Communication and Processing, January, 2000, San Jose, California.
40ı
APPENDIX A: Code for Adaptive KNN Search
I*~+** +***~***** ~** * * ***
* This program is to implement an adaptive VA 0 je t.
**** *+** ** ****+***~* *~* *~* *~* +* *~+ ** ~* * /
#ifndef VAFILE_HPP__ #include<VAFile.hpp> #endif
#include<futils.hpp> #include<values.h> #include<queue> #include<cstdio> #include<cstring> #include<dhtypes.h> #include<list> #include<set> #include<stack>
namespace DH FileStructures
{
static const double epsilon lelO;
void vAFile: :Create(const char *vafilename)
{
vfilename = vafilename;
vfile.open(vfilename.c_str());
if (!vfile)
throw FileStructureError (TEXT ("vfile open failed") , LOCATION) ;
Size vablocksize;
Size dimension;
Size bits;
vabHeaderSize = vabReadHeader(vfile,vablocksize,dimension,bits,
dfilename,mfilename,title);
if (vabHeaderSize < 0)
throw FileStructureError (TEXT (" failed to r ad vfile header") , LOCATION) ; dataDimension = dimension; mfile.open(mfilename.c str(); if (1 mfile) throw
FileStructureError(TEXT("failed to open mfile") , LOCATION) ; Size mbits,mdim; string mdname, mtitle' if (mrkReadHeader(mfile,mbits,mdim,mdname,mtitle) < 0 ) throw FileStructureError(TEXT("failed to read mfile header") , LOCATION) ;
if (mdname != dfilename) throw FileStructureError (TEXT ("Data names differ in
files") ,LOCATION) ; if (mbits 1= bits) throw FileStructureError(TEXT("Different bit sizes in
files") ,LOCATION);
41ı
if (dataDimension != mdim) throw FileStructureError (TEXT ("Different dimensions in
files") , LOCATION) ;
if (readMarkers(mfile, mdim, markers) <0) throw FileStructureError (TEXT ("Read makers failed"), LOCATION} ;
mfile.close(} ;ı vblock.create«int}vablocksize, (int)dataDimension,bits,ı
&markers) ;
dfile.open(dfilename.c_str(»;
if (! dfile) throw FileStructureError (TEXT ("Opening datafile
failed") ,LOCATION) ;
blocktype
=
0;
if
(*(dfilename.rbegin(»
==
'k')
blocktype
=
0;
else
if
(*(dfilename.rbegin()}
'I')
blocktype
=
1
;
else
if
(*(dfilename.rbeg
in(})
'2')
blocktype
=
2;
else
{
cout « "Unknown block type" « endli exit (0) ;
Size dsize; switch (blocktype)
{
case 0: blkHeaderSize blkReadHeader(dfile,dsize,dimension}; break; case 1: blkHeaderSize bnlReadHeader (dfile, dsize, dimension, vmin, vmax) ; break; case 2: blkHeaderSize = bn2ReadHeader(dfile,dsize,dimension,vmean,vstddev) ; break; default: cerr« "Unknown normalization type" « endl; exit(13) ;
if (dimension != dataDimension) throw FileStructureError (TEXT ("Data dimension differ in files") ,LOCATION) ;
dblock.Create(dsize,dataDimension) ;
dfile.seekg(O,ios: :end};ı long filesize = dfile.tellg(};ı long numblocks = (filesizeblkHeaderSize) I dblock.BlockSize();ı
maY~ataBlockIndex = numblocks;ı vfile.seekg(O,ios: :end);ı filesize = v£ile.tellg(};ı
42ı
maxVABlocklndex = (filesizevabHeaderSize) / vblock.BlockSize()i long maxNumSamples = numblocks * dblock.maxNumRecords; maxDataRecords = maxNumSamples; mapVAlndexToBlocklndex.resize(maxNumSamples) ;
int c =0;
for (int b=O; b< numblocks; ++b)
{
£or(Index r=Oi r< dblock.maxNumRecords; ++r)
{ mapVAlndexToBlocklndex[cl = makepair(b,r); c++;
validstate=true;
111£ number=l, then read next lock into memoryı long VAFile: :ReadVABlock (long blocknunmer=l)ı
{
REQUIRE (IsValid() ); i£(blocknumber >= 0) vfile.seekg(vabHeaderSize+blocknumber*vblock.BlockSize() , ios: :beg) ; return vblock.Read(vfile);
IIRead data block into memory. If number=l, read next blockı long VAFile: :ReadBlock(long blocknumber=l)ı {ı
REQUIRE(IsValid() ); if (blocknumber >= 0) dfile.seekg(blkHeaderSize+blocknumber*dblock.BlockSize(), ios: :beg); return dblock.Read(dfile);
Iinormalize data v ctorı void VAFile: :NormalizeDatavector(vector<double> &sample)ı
{
switch (blocktype)
{
case 0: II no nor alizatioı break;ı
case 1: /1 unit c beı REQUIRE(sample.size() == vmin.size();ı £or(Index i=O; i < sample.size(); ++i)ı
{
sample [i) (sample[i] vmin[i)) / (vmax[i] vmin[i]);
}
break;
case 2: II zero meanı REQUIRE(sample.size() == vmean.size~);
£or(Index i=O; i < sample.size(); ++i)ı
{
sample [i) = (sample[il vmean[i]) /vstddev[i];
43ı
}
break; default: II should no rearch here throw FileStructureError (TEXT ("Should not reach this location! unknown bleok type."), LOCATION) ; break;
}
I/make ap roximation rep esentation of a vectorı void VAFile: :ApproximateVector (vector<double> &v, vector<byte> &va)ı
{ va.resize(v.size(}) ; for(unsigned int dim = 0; dim < v.size{)i ++dim)
va [dim) = vbloek.getApproximation(v[dim) ,dim);
I/precalculate the distance between a vector to a va object void VAFile: :PreCalcMarkerDistances(vector<double> &v,vector<byte> &va, vector<vector<double> > &d, DistanceObjects &distobj)
// make s re d is large enought to hold precalc distancesı if (ldistobj.CanpreCalc(»ı
return;ı d.resize{v.size{» iı for(unsigned int j=O; j<v.size(); ++j)ı
d [j] . resize (markers [j) . size () ) ;
II precalc distances to markersı for{unsigned int dim=O; dim < v.size{); ++dim)ı {ı
for {unsigned int m=O; m < markers [dim] .size(); ++m)
{
d [dim] [m) = distobj. PreCalc (v [dim] ,markers [dim] [m], dim);
}
Ilcalculate th lower bound and upper boun to a query v_c or void VAFile: :GetBounds {ApproxVectorRecordView &a, vector<double> &q,vector<byte> &qbin,double &lower, dou Ie &upper, vector<vector<double> > &d, DistanceObjects &distobj)
REQUIRE{q.size() == d.size(»);ı REQUIRE{int{q.size(» == a.numFields);ı REQUIRE(q.size{) == qbin.size(»;ı
static ~ector<double> tmp;ı tmp. res~ze (q.size () ) ;ı
if (distobj.canPreCalc()
{
/ / Use recale distances in dı for (unsigned int dim =0; dim < q.size{); ++dim)ı
44
[dim] = (a (dim) cqbin [dim] ) ?d [dim] [a (dim) +1] :ı tmp ( (a (dim) > qbin [dim] )?d [dim] [a (dim) ) : 0) ;ı
lower = distobj.usepreCalcs(tmp);
for (unsigned int dim =0; dim c q.size(); ++dim)
{
tmp [dim] =ˇ (a (dim) c.qbin [dim] ) ?d [dim] [a (dim) ] : «a(dim»qbin[dim])?d[dim) [a (dim) +1) : 0)
}
upper = distobj .UsePreCalcs(tmp);ı }ı elseı
{ı for (unsigned int dim =0; dim c q. size (); ++dim)ı {ı
tmp [dim] = (a (dim) cqbin [dim]) ?markers [dim] [a (dim) +1]
( (a (dim) >qbin [dim]) ?markers [dim] [a (dim)] 0) ; } lower distobj (q,tmp); for (unsigned int dim =0; dim c q.size(); ++dim) {
tmp [dim] =ı (a (dim) cqbin [dim] ) ?markers [dim] [a (dim) ] : ( (a (dim) >qbin [dim]) ?markers [dim] [a (dim) +1] : 0) ;
upper distobj (q, tmp) ;
}
II use previous KN and current weight to calculae the first II con traint: the upper bound on candidates double VAFile: :Getrut(vectorcdouble> &q,
vectorcpaircpaircdouble,vectorcdouble> >,paircint,int> > > &results, WeightedDistanceSQ &tmpwd)
double max = 1 iˇ double tmpdis = 1;ı for (unsigned int i=O; icresults.size(); i++)ı
{
tmpdis = tmpwd(q,results[i] .first.second);ı if (tmpdis > max)ı max = tmpdis ;ı }ˇ
return max;
Ilcopy previous candidats' a proximations, calculate upper bounds underı Ilcurrent weight, return kth largest valueı double VAFile: :Gettheta(vectorcdouble> &q,vectorcbyte> &qbin,ı stackc~pproxV~ctorRecordView>&myApproxStack,vectorcvectorcdouble> >ı
{ddl
We~ghtedD~stancesQ &tmpwd, unsigned int K).
double max = DBL MAx.ı vectorcdouble> t~PKu~per(K+l, max);ı
45ı
! (myApproxstack. empty () ) while (
{
lowerr;double upperr,
GetBounds(myApproxStack.top() ,q,qbin,lowerr,upperr,work,tmpwd);
if (upperr < topKupper [K1])
{ı topKupper[K] = upperr;ı Bort{ topKupper.begin(), topKupper.end() );ı
}
myApproxStack.pop()ı }llafter opping, no more elementsı
return topKupper[Kl];
Ilcompute NI, 82 and rut for one query in one iteration
pair<pair<int,int>,pair<double,double> > VAFile: :GetKNN ( s tack<ApproxvectorRecordView> &myApprox,' unsigned int t, unsigned int K, vector<double> &q, vector<pair<pair<double,vector<double> >,pair<int,int> > >
&results, WeightedDistanceSQ &distobj, set<int,less<int> > &ignoreSet, double initUpperBound = 1)
REQUIRE (IsValid() );
if (q.size() != dataDimension)
{
char buf [1024] i sprintf(buf,"q.size = %d 1= %d = dataDimension",q.size() ,dataDimension);
throw FileStructureError(buf,LOCATION) ;ı }ı REQUIRE(q.size() dataDimension) jı
II Initializationı static vector<bool> visitedblockiı static vector<byte> qbin;ı list<pair<unsigned int,unsigned int> > BRignoreList;ı ApproximateVector(q,qbin);ı PreCalcMarkerDistances(q,qbin,work,distobj)iı pair<double,pair<int,int> > tmppairjı tmppair.first = DBL MAX;ı tmppair.second =mak~pair(l,l);
vector<pair<double,pair<int,int> > > topKUpperBound(K+1,tmppair);ı
II init heapı priority_queue<pair<double,pair<int,int> > > myheap;ı
vfile.seekg(vabHeaderSJ.'ze , ' eg
J.05:;b) j
unsigned int blockindex = OJ
46
int currentPosition = 0;ı int candidate_Count 0;ı int records_count = 0;ı
double rut"" 0;ı double theta = 0;ı
cout « "At iteration: "«t«endl;
//ada tive phaselı if(t!= l){ . .ı
rut"" Getrut(q, results, d~stobJ); theta = Gettheta(q, qbin, myApprox, work, distobj, K)j
Ilread va block in va block file one by one while ( ReadVABlock(l) >= 0) { for (unsigned int r=O; r <vblock.numValidRecords; ++r)
{ı double lower, upper;ı if «ignoreSet.size() > 0) &&ı
(ignoreSet.find(currentPosition) != ignoreSet.end())
{
II don't place in candidate list if in ignor listı BRignoreList.push_back(pair<unsigned int, unsignedı
int> (blockindex, r) ); currentPosition++; continue;
GetBounds(vblock(r) ,q,qbin,lower,upper,work,distobj);
Ilif vector has a lower bound than rut, theta and /Ikth upper bound found so far, selected as a candidate if (lower <= min ( min ( (topKUpperBound [Kl) . first +epsilon),
rut), theta) )
{ topKUpperBound [K) . first = upper; topKUpperBound[K) .second.first = blockindexj topKUpperBound[K) . second. second = r; sort (topKUpperBound. begin (), topKUpperBound. end () ) ; tmppair.first = l*lower; tmppair.second.first = blockindex; tmppair.second.second =
mapVAIndexToBlockIndex[records countJ .first; /Irecord lower bou~d, block index and record index of Nl myheap.push(tmppair) ;
Ilrecord approximation of 11 for nex iteration's th ta myApprox.push(vblock(r);
candidate count++·
,
records_count++;
currentposition++;ı}ı blockindex++;ı
} cout « "In phase 1'.' d
number of cand~dates: "« myheap.size() «en 1;
47
}llend of adaptive phase 1
Iistandard phase 1 performs at the first iteration else {
ReadVABlock ( 1.) >= 0)
while
{
for (unsigned int r=O; r <vblock.numValidRecords; ++r)
{ double lower, upper; if «ignoreset.size() > 0) && (ignoreSet. find (currentPosition) ignoreSet.end()) {
II don't place in candidate list if in ignore listı BRignoreList.push_back(pair<unsigned int, unsignedı
int> (blockindex,r) ) ; currentPosition++; continue;
GetBounds (vblock (r) , q, qbin, lower, upper, work, distobj) ;
Ilif a vector has lower bound than kth upper bound found so
far, //seleted a a candidate if (lower <= topKUpperBound[Kl] .first + epsilon) {
topKUpperBound[K] .first = upper;ı topKUpperBound [K] . second .. first = blockindex;ı topKUpperBound [K] . second. second = r iı sort (topKUpperBound. begin () J topKUpperBound. end () ) ;ı tmppair.first = l*loweriı tmppair.second.first = blockindex;ı tmppair.second.second =ı
mapVAlndexToBlocklndex[records_count] .first; myheap.push(tmppair);
/Iremember vaı myApprox.push{vblock(r»;ı
candidate_count++;
records_count++;
currentPosition++;ı }ı blockindex++;ı
}
cout « "In phase 1.: number of candidates: "« myheap.size() «endl;
}I/end of standard phase 1
I/Pha~e 2 ~s common for adaptive and standard palr<~alr<double/vector<double>>,pair<int,int> > tmppair2; tmppa~r2.first = makeFair(DBL MAX, vector<double>(dataDimension»; tmppalr2.second =makeFair(l,=l); , vector<pair<pair<double vector<double> >,
pair<int,int> > ~ topKvectors(2*K+ dblock.maxNumRecords
, tmppair2) ;
48ı
int numdatablocksread = 0;
visitedblock.resize(m~ata~l~cklndex);
fill (visitedblock.beg1n() ,vlsltedblock.end(), false);
myheap record the candidates one by one
Ilpop
(! (myheap. empty ( )) &&
while (l*(myheap.top() .first) < topKvectors[Kl] .first.first +
epsilon) { if (!visitedblock[myheap.top() . second. second] )
{ visitedblock[myheap.top() . second. second] = true; numdatablocksread++; if (ReadBlock(myheap.top() . second. second) < 0) throw FileStructureError(TEXT("Block read failed") I LOCATION) j
Iionc a block is read, measure distance to all records in block unsigned int p2blockindex = (unsigned int) ( myheap.top() .second.first);
for (unsigned int r=O; r< dblock.numValidRecords; ++r)
{
bool skiprec = false; for (list<pair<unsigned int,unsigned int> >: :iterator p BRignoreList.begin() ; p != BRignoreList.end(); ++p)
if «p>first == p2blockindex) && (p>second r») {
skiprec = true;ı break;ı
}
if (skiprec)ı continue;ı
topKvectors[2*K+r] . second. first myheap.top() . second. second; topKvectors[2*K+r) . second. second = r;
for (unsigned int dim = 0; dim < dataDimension; ++dim)
{
topKvectors[2*K+r] .first.second[dim] = dblock(r) (dim);
}
topKvectors[2*K+r] .first.first = distobj (q,topKvectors[2*K+r] .first.second); }
1* topKvectors has been initialized with values so even if only a f w values are placed into it we can still sort *1. partial_sort (topKvectors . begin () topKvectors. begin () +2 *K, topKvectors. en
d() ); t
49ı
myheap.pop() ; } cout « "In phase 2: read" cc numdatablocksread <c " blocks" <cendl; results.resize(K) ;
Ilco y pair of distance, vector andı Ilpair of block index and data index into topKve torı for (unsigned int i=O; icK; ++i)ı
(.
results [i) . first. flrst = topKvectors [i] . first. first; results [i) .first.second = topKvectors[i] .first.second; results[i] .second = topKvectors[i] .second;
}
return makepair(makepair( candidate_count, numdatablocksread ) I makepair( rut, theta »
50ı
APPENDIX B: Code for Standard KNN Search
*T** ..". "'" +* r* *..,..
/*~+**~~** ** *r* *r *
This rogram is to implement an adaptive VA object.
* ~* *** ******** +* ******* *** ~* * •• ***•*** /
#ifndef VAFILE_HPP__ #include<VAFile.hpp> #endif
#include<futils.hpp>
#include<values.h> #include<queue> #include<cstdio> #include<cstring> #include<dhtypes.h> #include<list> #include<map> #include<set> #include<stack>
namespace DH FileStructures
{
static const double epsilon le10;
void VAFile: :Create(const char *vafilename)
{
vfilename = vafilename;
vfile.open{vfilename.c_str{» ;
if (!vfile)
throw FileStructureError (TEXT ("vfile open failed") , LOCATION) ;
Size vablocksize;
Size dimension;
Size bits;
vabHeaderSize = vabReadHeader(vfile,vablocksize,dimension,bits,
dfilename,mfilename,title);
if (vabHeaderSize < 0)
throw FileStructureError(TEXT("failed to read vfile header") ,LOCATION) ; dataDimension = dimension; mfile.open(mfilename.c str{»; if (!mfile) throw
FileStructureError(TEXT{"failed to open mfile") ,LOCATION); Size mbits,mdim; string mdname, mtitle' if (mrkReadHeader{mfiie,mbits,mdim,mdname,mtitle) < 0 ) throw FileStructureError{TEXT{"failed to read mfile header") , LOCATION) ;
if (mdname != dfilename)
throw FileStructureError (TEXT ("Data names differ in files") ,LOCATION);
51
if (mbits ! = bits) throw FileStructureError (TEXT ("Different bit sizes in
files"),LOCATION); . if (dataDimension ! = md1.m) throw FileStructureError (TEXT ("Different dimensions in
files") ,LOCATION) ;
if (readMarkers(mfile, mdim, markers) <0) throw FileStructureError (TEXT ("Read makers failed"), LOCATION) ;
miile.close () ;ı vblock.Create((int)vablocksize, (int)dataDimension,bits,ı
&markers) ;
dfile. open (dfilename . c_str () ) ;
if (! dfile) throw FileStructureError (TEXT ("Opening datafile failed") , LOCATION) ;
blocktype = 0;ı if (*(dfilename.rbegin() == 'k')ı blocktype = 0;ı else if (*(dfilename.rbegin(» '1')ı blocktype = 1;ı else if (* (dfilename. rbegin () ) '2 ')ı
blocktype = 2;ı elseı {ı
cout « "Unknown block type" « endl;ı exit (0) ;ı
size dsize;ı switch (blocktype)ı {ı
case 0: blkHeaderSize = blkReadHeader(dfile,dsize,dimension); break;
case 1: blkHeaderSize bnlReadHeader(dfile,dsize,dimension,vmin,vmax)i break;
case 2: blkHeaderSize = bn2ReadHeader(dfile,dsize,dimension,vmean,vstddev); break;
default : cerr « "Unknown normalization type" « endl; exit (13);
if (dimension != dataDimension)
throw FileStructureError (TEXT ("Data dimension differ in files") ,LOCATION) ;
dblock.Create(dsize,dataDimension);
dfile.~eekg(O,ios::end);
long f1.1esize = dfile.tellg();
52
long numblocks = (filesizeblkHeaderSize) / dblock.BlockSize();ı maxDataBlockIndex = numblocks;ı vfile.seekg(O,ios: :end);ı filesize = vfile.tellg();ı maxVABlockIndex = (filesizevabHeaderSize) / vblock.BlockSize();ı long maxNumSamples = nUmblocks • dblock.maxNumRecords;ı maxDataRecords = maxNumSamples;ı mapVAlndexToBlocklndex. res ize (maxNumSamples) ;ı
int c =0;
for (int b=O; b< numblocks; ++b)
{
for(Index r=O; r< dblock.maxNumRecords; ++r)
{ mapVAlndexToBlocklndex[c] = makeFair(brr); c++;
validstate=true;
Illf number=lr then re d next block into memoryı long VAFile: : ReadVABlock (long blocknumber=l)ı {ı
REQUIRE(IsValid(»; if (blocknumber >= 0) vfile.seekg(vabHeaderSize+blocknumber*vblock.BlockSize() ,io
s: :beg) ;ı return vblock.Read(vfile);ı
IIRead data block into memory. If number=l, read next block long VAFile: : ReadBlock(long blocknumber=l)
{
REQUIRE(IsValid(»); if(blocknumber >= 0) dfile.seekg(blkHeaderSize+blocknumber*dblock.BlockSize() , ios: :beg) ; return dblock. Read (dfile) ;
Iinormalize data vectorı void VAFile: :NormalizeDataVector (vector<double> &sample)ı
{ switch (blocktype) {
case0 : II no normalizationı break;ı
case 1: 1/ unit eu e REQUIRE (sample.size() == vmin.sizet)); for(Index i=o; i < sample. size () ; ++i) { sample [i] = (sample [i]vmin [i]) / (vmax [i] vmin [i] ) ; }
53ı
breakj
case2 : II zero mean REQUIRE(Sample.size() == vmean.size(» j for (Index i=Oj i e sample.size() j ++i) { sample [i] = (sample [i] vmean [i]) /vstddev [i] j
}
breakj default: II should not rearch here throw FileStructureBrror(TEXT("Should not reach this location!
unknown blcok type.") , LOCATION) j
breakj
Ilmake approximation representation of a vectorı void VAFile: :ApproximateVector (vectoredouble> &v, vectorebyte> &va)ı
(
va.resize(v.size()) jı for(unsigned int dim = OJ dim e v.size()j ++dim)ı va [dim] = vblock.getApproximation (v [dim] ,dim);ı
!Iprecalculate the distance between a vector to a va object void VAFile: :PreCalcMarkerDistances(vectoredouble> &v,vectorebyte> &va, vector<vectoredouble> > &d, DistanceObjects &distobj)
{
II rna e sure d is large enou h to hold precalc dis ancesı if (!distobj.CanpreCalc()ı returnjı
d. resize (v. size () ) j
for(unsigned int j=Oj jev.size()j ++j)ı d [j] . resize (markers [j,] . size () ) jı
II precalc distances 0 mark rsı for (unsigned int dim=Oj dim e v.size() j ++dim)ı
{
for (unsigned int m=Oj m c markers [dimJ .size() j ++m) { d [dim] em] = distobj. PreCalc (v [dim] ,markers [dim] [m], dim); }
Ilcalculate the lower bound and upper bound to a query vector
void VAFile: :GetBounds( ApproxVectorRecordView &a, vectorcdouble> &q,vectorebyte> &qbin,double &lower, double &upper, vectorevectorcdouble> > &d, DistanceObjects &distobj)
REQUIRE(q.size() == d.size()jı RBQUIRE(int(q.size(» == a.numFields);ı REQUIRE(q.size() == qbin.size());ı
static vectorcdouble> tmpj
54
tmp.resize(q.size(» ;
if (distobj.CanpreCalc(» .
{ II use precalc dist~nCes ~n.d. .ı for(unsigned int d~m =0; d~m < q.s~ze(); ++d~m)
{ tmp [dim] = (a (dim) <qbin [dim] )?d [dim] [a (dim) +1]ı
«a(dim) > qbin[dim])?d[dim] [a(dim)] :0);
}
lower distobj.UsepreCalcs(tmp);
for (unsigned int dim =0; dim < q.size(); ++dim) { tmp [dim] = (a (dim) <qbin [dim] ) ?d [dim] [a (dim) ] : ( (a (dim) >qbin [dim] ) ?d [dim] [a (dim) +1] 0)
}
upper distobj . usePreCalcs (tmp) i } else { for (unsigned int dim =0; dim < q.size(); ++dim) { tmp[dim] = (a (dim) <qbin [dim] ) ?markers [dim] [a (dim) +1] : ( (a (dim) >qbin [dim]) ?markers [dim] [a (dim)] : 0) i
}
lower distobj (q,tmp);ı for (unsigned int dim =0; dim < q.size(); ++dim)ı { tmp [dim] = (a (dim) <qbin [dim] ) ?markers [dim] [a (dim) ] :ı
( (a (dim) >qbin [dim]) ?markers [dim] [a (dim) +1] : 0) ; } upper distobj (q, tmp) ;
pair<pair<int,int>,pair<double,double> > VAFile::G tKNN( stack<ApproxVectorRecordView> &myApprox, unsigned int t, unsigned int K, vector<double> &q, vector<pair<pair<double,vector<double> >,pair<int,int> > >
&results, WeightedDistanceSQ &distobj, set<int,less<int> > &ignoreSet, double initUpperBound = 1)
REQUIRE(IsValid(» ;ı if (q.size() 1= dataDimension)ı {ı
char buf[1024];ı sprintf(buf,"q.size = %d != %d =ı dataDimension" , q. size () , dataDimension) ;ı throw FileStructureError(buf,LOCATION);ı
}
REQUIRE(q.size() == dataDimension);
II Initializationı static vector<bool> visitedblock;ı s~atic ~ector<byte> qbin;ı l~st<pa~r<unsigned int unsigned int> > BRignoreList;ı ApproximateVector(q,qbin);ı
55
preCalcMarkerDistances(q,qbin,work,distobj) ;ı pair<double,pair<int,int> > tmppair;ı tmppair.first = DBL_MAX~
tmppair.second =makepa1r(l,1); vector<pair<double,pair<int,int> > > topKUpperBound(K+l,tmppair);
priority_queue<pair<double,pair<int,int> > > myheap;
vfile.seekg(vabHeaderSize, ios: :beg);
unsigned int blockindex = 0;ı int currentPosition= 0;ı int candidate count = 0;ı int records_count = 0;ı double rut = OJı double theta = 0;ı
cout « "At iteration: "« t «endl;
/Istandard phase 1 while ( ReadVABlock{l) >= 0)
(
for (unsigned int r=O; r <vblock.numValidRecords; ++r)
{
double lower, upper; if«ignoreSet.size() > 0) && (ignoreSet.find(currentPosition) != ignoreSet.end()))
II do 't place in candi ate list if in ignore list BRignoreList.push_back(
pair<unsigned int, unsigned int>(blockindex,r)); currentPosition++; continue;
GetBounds(vblock(r) ,q,qbin,lower,upper,work,distobj);
I*if a ve tor has lower bound than kth upper bound found so ar,
seleted as a candidate*1 if (lower <= topKUpperBound[Kl] .first + epsilon) (
topKUpperBound[K) .first = upper;ı topKUpperBound[K) . second. first = blockindex;ı topKUpperBound[K) . second. second = r;ı sort (topKUpperBound.begin() , topKUpperBound.end());ı tmppair.first = l*lower;ı tmppair.second.first = blockindex;ı tmppair.second.second =ı
mapVAlndexToBlocklndex[records count) .first; myheap.push(tmppair)j //
maintain the candidates in a heap myApprox.push(vblock(r)) ;
candidate_count++;
records_count++;
56
currentPosition++;
}
blockindex++;ı }ı
cout « "In phase 1: number of candidates: "« myheap.size() «endl;
fiend of standard phase 1
IIPhase 2 is common for adaptive a d standard pair<pair<double,vector<~ouble>>,pair<int,int> > tmppair2; tmppair2.first = makepa1r(DBL_MAX, vector<double>(dataDimension)); tmppair2. second =makepair (1, 1) ; vector<pair<pair<double,vector<double> >,
pair<int,int> > > topKvectors(2*K+ dblock.maxNumRecords ,tmppair2) ;
int numdatablocksread = 0;
visitedblock.resize(maxDataBlocklndex) ;ı fill(visitedblock.begin(),visitedblock.end(), false);ı
Ilpop the candidates by the increasing order of lowerbound while (! (myheap.empty() && (1* (myheap.top()) .first) < topKvectors[K1].first.first + epsilon)
{
if (!visitedblock[myheap.top() .second.second])
{
visitedblock[myheap.top() .second.second] = true;ı numdatablocksread++;ı if (ReadBlock(myheap.top() .second.second) < 0)ı
throw FileStructureError(TEXT("Block read failed") , LOCATION) ;
Iionce a block is read, measure distance to all records in block unsigned int p2blockindex = (unsigned int) ( myheap.top() .second.first);
for (unsigned int r=O; r< dblock.numValidRecords; ++r)
{ bool skiprec = false; for (list<pair<unsigned int,unsigned int> >: :iterator p
BRignoreList.begin() ; p != BRignoreList.end(); ++p) if «p>first == p2blockindex) && (p>second r))
{ı skiprec = true;ı break;ı
}
if (skiprec)ı continue;ı
topKvectors[2*K+r] . second. firstı myheap.top() .second.second;ı
57
topKvectors[2~K+r).second.second = r; for (unsigned lnt dim = 0; dim < dataDimension; ++dim) { topKvectors[2*K+r) .first.second[dim] = dblock(r) (dim);
~oPKvectors[2*K+r).first.first distobj (q,topKvectors[2*K+r) .first.second);
partial_sort(topKvectors.begin(),
topKvectors.begin()+2*K,topKvectors.end(» ;ı }ı myheap.pop() ;ı
cout « "In phase 2: read "« numdatablocksread « " blocks" «endl;
results.resize(K)i
//copy pair of distance, vector andı //pair of blo k index and d ta index into topKvectorı for (unsigned int i=Oi i<K; ++i)ı
{
results[i] .first.first = topKvectors[i] .first.first; results[i] .first.second = topKvectors[i] .first.second; results [i] . second = topKvectors [i] . second;
return makepair(makepair( candidate_count, numdatablocksread ), makepair( rut, theta»
58ı
APPENDIX c: Code for Evaluation Procedure
... ...
/***.***+ * +...,***oIc** * .... ~ **.., ;, *.."" *
... This program i.s used to evaluate the adaptive and s andard ... approaches and to c::o~pute the average number of can idates, average
* number of blockS v~s~ ed, average nu er a optimal blocks and
Y
average of the established upper bound .
.. ~******.y*********.Jr*';''''***** *Ir ** * **** **. *I
#include<VAFile .hPP::>
#include<iostream::>
#include<fstream::>
#include<strs trea:m::>
#include<string>
#include<algorithm::>
#include<exception::>
#include<values . h::>
#include<set>
#include<dhtypes . h::>
#include<stack>
II Corelin x usi.ng Assertion.hpp
#include<Common. hpp>
II resource file pa sing
#include<ResourceConfig. h>
using namespace DH_FileStructures;
using namespace DH_Utilities;
using namespace corelinux;
Ilread query data on by one in data fileı bool readLine (ifstream &fin, vector<double> &datalı
{
std: :string buf;
if ( getline (fin bufll
{,
std: : istrstream line (buf. c_str (ll ;
long j =0;
double tmpi
whi~e ~line » tmp)ı { ~f()«long)data.size())
data[j] tmp;ı elseı . data. push back (tmp) ;ı )++; }
elseı return false.ı return true; ,ı
59ı
const string optionInFile("in va file"};ı const string optionInFileQueries (" in query file");ı const string optionInFileTags("in tag file"};ı const string optionoutFileStats("out stats file");ı const string optionK ( "k") ;ı const string optionverify ("verify") ;ı const string optionCalcStats("calc stats"}iı
const string optionT ("T") ;ı const string optionDelta ("Delta") ;ı
int main(int argc, char *argv[]} { if (argc < 2) { cerr «"Invalid command lines:\n"« "Usage: searchVAFile config'File\n" i exit (11 ;
string inVANamei //input va file name string inQueriesName; //input query file name string inTagNamei //input tag file name string outStatsNamei I/output statistics file name int K= 1i /Inumber of feature vector to be retrieved int calcStats = 0; //statistics file off bool CALC_STATS= false; set<int,less<int> > TrainingSet; unsigned int NumIterations = OJ II number of iterations int delta = 1; lireweighting rate
try {ı ResourceConfiguration RC(argv[l]liı RC.GetValue(optionlnFile, inVANamel;ı RC.GetValue(optionlnFileQueries, inQueriesName};ı RC. GetValue (optionlnFileTags, inTagName);ı if (RC.HasOption(optionKllı
RC.GetValue(optionK, Kl;ı REQUIRE(K>O );ı if (RC.HasOption(optionTllı
RC.GetValue(optionT, Numlterationsl;ı REQUIRE(Numlterations>O}jı if (RC.HasOption(optionDelta})ı
RC.GetValue(optionDelta, delta);ı REQUIRE (delta>O) ;ı
if (RC.HasOption(optionCalcStats})
{
RC.GetValue(optionVerify, calcStats); RC.GetValue(optionOutFileStats, outStatsName};
}
CALC_STATS
(calcStats !=Ol? true: false~
}
catch (Assertion &e)
60
"ASsertion Failed!"« endl;
cerr «ı II "« e.getFile() « endl;ı
cerr «
" "« e.getLine() « endl;
cerr «ı " « e.getWhy() « endl;ı
cerr «
" « e.getUnwind() « endl;
cerr «
exit(999) ;ı }ı catch (... ) {ı
cout « "Uncaught Error" « endl;ı }ı
cout « "Search VAFile Run: " « endl;ı cout « vafile " « inVAName «endl;ı
"
cout « queries " « inQueriesName «endl;ı cout « tags « inTagName «endl;ı
"
"
cout «K <c' K «endl;
"
cout « T « Numlterations « endl;
"
cout « Delta "« delta « endl;
"ı cout « endl;ı
if (CALC_STATS)ı cout «" stat file= « outStatsName <cendl;ı
II
cerr « "Creating VAFile Obj ect" « endl; VAFile vafObject(inVAName.c_str(»; Ilconstruct a VAFile object
ifstream testdata(inQueriesName.c_str(»; Ilinput query data file
vector<double> q; Ilquery data
cerr « "Reading test data" « endl;ı readLine(testdata,q)ı
int qindex = 0;ı paircint,int> cres(O,O); Ilfor Nl and B2ı pair<double,double> cresrt(O,O); Ilfor rut and thetaı pair<pair<int,int>,paircdouble,double> > cress (cres, cresrt) jı int 8umOptBlocks = OJ Iisum of optimal blocksı int sumNumberCandidates = 0; Iisum of candidatesı int sumBlocksRead = 0; Iisum of blocks visitedı int numberSamples = 0; II umber of query dataı double sumRut = OJ Iisum of rutı double sumTheta = OJ Iisum of thetaı
while (testdata) { try{ cerr « "searching for KNN: on query "« numberSamples « endl;
//record KN distance and vector in the first pair, //block index and record index in the second pair vector<pair<pair<double,vectorcdouble> >,pair<int,int> > >
results;
i/for recording the approximation of 1
61
stack<ApproxvectorRecordView> myApprox;ı cerr « "creating Distance Object" « endl;ı for (unsigned int iter = 1; iter <= NumIterations; iter++)ı
{
Iisimulate the gen ration of new weightı vector<double > currW;ı currw.resize(q.size() ;ı
Ilavoid going ou of bo nds of sum of the wightsı currW[l] = l/pow(delta, NumIterationsiter);ı
Iinormalize the rest weightsı for(unsigned int di = 2; di <=currW.size()l; di++)ı
{
currW[di]=( lcurrW[l]) I (currW.size()2); }
Ilconstruct a square wei hted distance object WeightedDistanceSQ wd(currW);
cerr « "At iteration"« iter « endl;
Ilcall KNN produre to compute Nl, B2 and rut cress = vafObject.GetKNN(myApprox, iter, K, q, results,
wd,TrainingSet,l) ; Iistore the uniq e optimal block index set<int, less<int> > optblockset; for (unsigned int j=O; j< results.size(); j++ ) {
optblockset.insert(results[j] . second. first) ; } cout«"opt blocks = "« optblockset.size() «endl;
cerr « " finished KNN search" « endl;
sumOptBlocks += optblockset.size();ı sumNumberCandidates += cress. first. first;ı sumBlocksRead += cress.first.second;ı sumRut += cress.second.first;ı sumTheta += cress. second. second;ı cout « "SEARCH RESULTS: " « cress.first.firstı
«" "« cress.firat.second «" II << cress. second. first «" "« cress.second.second « endl«endl;
numberSamples++;
}
catch (Assertion &e)
(
cerr « "Assertion Failed! "« endl;ı cerr « " « e. getFile () « endl;ı cerr «" "« e.getLine() « endl;ı cerr «" "« e.getWhy() « endl;ı cerr « " « e.getUnwind() « endl;ı
exit (999) ;
62ı
}
(FileStructureError &e)
catch {
cerr « "Assertion Failed!"« endl;ı cerr «" "« e. getFile () « endl;ı cerr «" "« e.getLine() « endl;ı cerr «" "« e.getWhy() « endl;ı cerr «" "« e. getUnwind () « endl;ı
exit (999) ;
}
catch( ... ) { cout « "uncaught error"; }
readLine(testdata,q) ;
qindex++;
}
cout « endl;ı cout « "STATS OF RUN; " « endl;ı cout « " Average number of Candidates ="ı
« double (sumNumberCandidates) / Numlterations /numberSamples « endl;
coutˇ « " Average number of blocks read =" « double (sumBlocksRead) / Numlterations /numberSamples « endl;
coutˇ « " Average number of optimal blocks = " « double (sumOptBlocks) / Numlterations /numberSamples « endl;
cout « " Average rut =" « sumRut / (Numlterations 1) / numberSamplee « endl; cout « " Average theta =" « sumTheta / (Numlterations 1) / numberSamples « endl;
exiteD)
63ı
VITAı Fan Yangı Candidate for the Degree ofı Master ofScienceı
Thesis: AN ANALYSIS ON ADAPTIVE APPROXIMATION BASED NEAREST NEIGHBOR SEARCH IN LARGE IMAGE DATABASES
Major Field: Computer Science
Biographical:
Personal Data: Born in Tianjin, P. R. China, October 24, 1968, son of Sanlill Yang and Y ilin Sun.
Education: Graduated from Tianjin No.20 High School, Tianjin, P. R. China in July 1986. Received Bachelor ofEngineering degree in Biomedical Engineering from Tianjin Medical University, Tianjin, P. R. China in July 1990. Completed the requirements for the Master of cience degree with a major in Computer Science at Oklahoma State University in August, 2003.
Experience: Teaching Assistant for the Compute Science Department at Oklahoma State University from August 2000 to May 2003.
Professional Membership: Member ofthe local chapter ofACM at OSu.