DYNAMIC SPI SCHEMES
By
ZILI FA 1
Bachelor of Science
Hangzholl University
Hangzhou, Zhejiang', P.R. of China
1991
Master of Science
Oklahoma State University
Stillwater, Oklahoma
1994
Submitted to the Faculty of the
Graduate College of the
Oklahoma State Univer ·it.y
in part.ial fulfillment of
t.he requirements for
the Degree of
MASTER OF SCIENCE
December, 1996
DYNAMI . PI
Thp.'i:s Approved:
II
HE lIE
PREFACE
This paper presents two dynamic SPI schemes: Indt='x Dynamic SPI and
Directory Dynamic SPI r. SPIK is a new indexing technique in database design,
which is of practical importance in various fields.
The dynamic SPI s modify the static SPI s in two way': 1) The size of iudt='x
file is flexible through operations on data. 2) Data overflo'w and data spar 'ent='ss are
elirninated at the same time. Index Dynamic SPI introduces the dynamic index t.reE'
structure', t.he shapes of which are dynamic through Hodes split and combint='. Directory
Dyanmic SPIN evolves from extendible hashing, combinillg with static SPI s
propert.ies. It introduces directory structure, instead of index t.ree in index file. Al '0.
the data file is composed of bucket units, so the flexibility of data file is accomplisllf'd
in Dirc~ct.ory Dynamic SPI . The performance of thf' t.wo dynamic SPINs a.r('
analyzf'd. t.hf'fl compared wit.h stat.ic SPINs.
1 would likf' to exprel:)s my sincere grat.itude to my major adviser, Dr. G. E.
Hf'drick, who introduced mt' to t.his interC'sting projpct, and const.ant.ly gave In£' illtf'lligent
guidance. Many thanks also I.(J t.he other committee memhf'rs, Dr. K. M.
Gf'orge and Dr John P. Chandler, for th(~il' helpful advisement nnd suggest.ions.
I wish to express my :::;incere gratitude to those who provided suggestions and
a.~ 'il:)tance for this study: Dr. Allen Divali, Mr. Yunpeng Zhang, and Mrs. Ying
Fan. My deepest appreciation is extended to my parent.s whose encouragement and
understanding were invaluable throughout the study.
III
TABLE OF CONTENTS
Chapter
I. I TRODUCTIO
Page
1
II.
III.
DEFINITIO SAND TERMI OLOGY",.,.,.,., , , , , .
STATIC SPI SCHEMES ., ,.' ,.,." , .. 5
The CSPI function , , ,.', .. " .. '.... 5
The SSPIN function '.'"""" 7
The R_SPIN function .,.,."'.,,.,,'.','., 10
The limitations of R_SPIN .. " ..... ,.'., .. , ", .. , 11
IV. DYNAMIC SPIN SCHEMES 1.5
Index Dynamic SPIN , , " ".. 15
Preliminaries 15
General Description . . . . . . . . . . . . . . . . . . . . . . . .. 19
Space Utilization , . , " 23
Directory Dynamic SPI 2'
Preliminaries " ,.... 2G
Gneral Description , , , , . . .. 26
Splitting Control " :J 1
Bucket Struct.mE' . . . . . . . . . . . . . . . . . . . . . . . . . .. :j;j
Comparisons Between Dynamic PINs and R_SPI , 34
V. IMPLEMENTATION ,., 36
Data Structmes ", .. ','.,.,.' .. '.... 36
Index Dynamic SPIN '"" '"",.,. 37
Directory Dynamic SPI ,.,.".. . .... , . . . .. :3
VI. PERFORMANCE ANALYSIS FOR SPI 40
Testing Program . , , . ' . , , . , . , . ' 40
PerformancE' Analysis for R_SPI , . ,. . ... , 41
Performance Analysis for D_SPIN ",................. 42
Performance Analysis for T SPI .. . . . . . . . . . . . . . . . . .. 45
Ch~~r P~
VII. SUM ARY. CO CL SIO A D SUGGESTED F T RE WORK 4
A SELECTED BIBLIOGRAPHY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 50
APPE DIX A  GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 53
v
Figure
1.
LIST OF FIGURES
Transformation between different dimensional arrays .
Page
7
2. Transformation values for the array arr[2][4] 9
3. Tree structures for R..sPI 11
4. Different shapes of trees in an index file 16
o A leveled structure for the sparse situation 17
6. S0quential number::; of index for the nodes at the ::;ame level 17
7. A nonsparse (dense) tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 18
8. Different initial tree::; 19
9. Trees with splitted nodes .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 20
12. The whole picture for data indexing and retriev
10
11.
Insertion sitnation for index dynamic SPIN
The final re::iult. of the index tor e in D_SPI
21
23
24
13. Kf'Y tran formation::; in extensible ha::;hing . . . . . . . . . . . . . . . . . . .. 27
14. Key t.ransformations in T _SPI 28
IS Direct.ory of order d=3 with four buckets. . . . . . . . . . . . . . . . . . . .. 30
16. T _SPI transformations ,....... 31
17. Distrihution of keys after splitting buckf't D . . . . . . . . . . . . . . . . . .. 32
18. Data st.rncture for DSPI : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37
19. Experiment.al results using RSPI function 41
20. The re::;ult of experiment using R_SPIN function. . . . . . . . . . . . . . .. 42
Vi
Figure Page
21. The result of experiment llsing DSPI function 43
22. An experiment using DSPI function 44
23. A search experiment using D_SPI function 45
24. The result of experiment using T.sPIN function 46
25. The result of experiment using distribution of T _SPIN 47
Vll
CHAPTER I
I TRODUCTIO
Data indexing is important in database design and management. In larg database'
the indexing techniques are critical for the fundamental operations of search, in 'ert
update, and delete. Research in this field has generated several structures and algorithms
induding those for R_trees[l] and for B_trees[2].
Coburn propo::;ed a new data indexing technique, called the Single Point Index
Network(SPI )[4]. The significant difference between SPI and other data indexing
techniques is that SPIN supports layered data relationships using a multidimensional
approach. The layered data structure is defined as abstract layers of data using the
data structure opf'rations responsible for performing t.he fundamental operations of
search, insert, delete, and update. In an pnvironment. of n"lat.ional dataha.'j(,s, slIdl a
la.verE'd data strncture means a data structure s1Ipporting intf'rnal mappings het,wef'lI
fields in different databases. The process of snch mappings is also called "horizontal
int.egration [3]," which means the incorporation at onf' t.ime of information from more
than one functional area. Horizontal integration requires that. the computer work in
mllitidimensional dat.a spaces and use multidimensional data struct.mes.
Acwrding to Cobllrn[4]' SPIN can perform fundamf'ntal operations on multidimensional
data spaces while other many data structures apparently cannot do t.he
same work with the same spef'd and fh~xihility[3]. The SPI t.edmiqllP pvolves from
performing a modification to the multidimensional array data strllct,me, bpcallsP thc·
111llltidimensional array is the most. wmmon multidimensional data struct.ure. SPI
consists of a series of transformation functions which convert multidimensional data
1
2
spaces into equivalent linear data spaces. This allows the omputer to operat on a
multidimensional data space as though it were a linear space. However, th SPI
schemes proposed by Coburn have properties of stati structure. Th y r quiI' that
data storage space or index storage space be allocated statically. Thi means that
when a file exceeds the allocated space, an overflow happens; con 'equently the entire
file must be restructured at great expense. In addition ov rflow handling scheme'
often increase the retrieval time a::; files approach their space limits. To over orne
t.hes0 problems, dynamic SPIN schemes have been proposed. The file structmes of
dynamic SPI s and their associated algorithms adapt t.hemselves to changes in the
size of the file, so expensive periodic database reorganization is avoided. In this thesis,
we introduce two dynamic SPIN schemes, index dynamic SPIN and direct.ory
dynamic SPIN, with special emphases on the various design issues.
Tlw goals of this thesis are to provide two basic designs of the dynamic SPI
structures and algorithms, to outline some ofthe techniques that are being developeu.,
to implement the proposed dynamic SPINs, and to show how the various df'sign
parameters relate to their performancf's.
In t.he thesis, Chapt.er II defines some technical terms which will he llsed in
the t.hesis. Chapter III explaint; static SPIN schrmes, including propertirs of st.atiC"
SPI IS and the limit.ations of stat.ic SPINs. Chaptf'r IV descrihes two dynamic SPI r
sclwmes: index dynamic SPIN and directory dynamic SPI , somr design issues ar
also discussed in this chapter. Chapter V implements two dynamic SPIN schemes.
Chapter VI analyses the performance of dynamic SPINs. Chapter VII is the summary,
conclllsions and directions for future work. Appendix A is the glossary.
CHAPTER II
DEFI ITIO SAD TERMI OLOGY
To make the presentation of dynamic SPI s preci 'e, thi' hapter define' some
terIllti. Implicit in these definitions is the assumption that the volume of data is large
and operations on data are in main storage,
1. A dp.Tl.sp. b'ep. means a tree whosE' nodE"s all contain data. In other words, a tree i1l
full, Conversely, a sparsp. tree means a tree whose nodes do not all contain data.
A majority of its nodes are empty, In database design a sparse tree implies
that some storage space is wasted, and a dense tree implie::; that an overflow
may happen on the index tree.
2. static SPIN mf'ans the SPI packagf' dewlopped by 'ohmn[3]. This SPl
package inclndes three basic SPIN schemes: C_SPIN, S_SPIN and RSP1N.
ThE" first two SPINs tran::;form multidim .nsional data ::;trtlctm('::; into a )j('
dimen::;ional data structures. R_SPIN, based on 1'.hf' S_SPIN's tran::;formation.
furtlwr transforms sparse data stmctme::; into df'nse da.t.a structme::;. All these
transformat.ions have static properties; e.g., tht' sizp of index file is fixed.
3. dynamir SPIN means the SPI T package developped by the SPI [(:>::;earch
gronpl. Tbpre are several dynamic SPIN algorithms proposed so far. III this
thpsi::;, two dynamic SPI schemes are presented: index dynamic SPIN (D_SPIN
for short), and directory dynamic SPL (TSPIN for short.). The sizp of index
file i::; flexible in dynamic SPI ::;chemes.
IT1J!s [('search group COllsists of Dr. G. E. Hedrick, Dr. R. A. DiVali, M. Z. Fall, am! Y. ZIJanl.\"
3
4
4. load factor means a number A. whose value is between 0 and 1. and who e
value indicates how full (the load) the storage tree (or table) i . An empty tr e
(table) h&; a load factor of 0 (A=O); a full tree (table) has a load factor of 1
(A=1).
5. In SPI , overflow means an attempt to insert data into a tree with a load factor
of 1. A memor, error occurs if overflow is not handled properly.
6. mdix/tree] means a tree that has a fixed number of [possibly empty] branches
from each node. A radix n tree has n branches at each node.
7. A bur:.ket (OT page) corresponds to one or several physical sectors of a secondary
storage device s11ch &; a disk. The capacity of a bucket is brecords in t.his thesis.
8. Spar:.r: utilizatl.On is the ratio between nand m*b, where n is the number of
rerords in the file, m is the number of pages used, and b i::; the capacity in
records of the page.
CHAPTER. III
STATIC SPI SCHEMES
Coburn[4] developed three basic SPI. functions: C_SPIN. S_SPIN and R_SPIN.
All three functions support multidimensional data spaces and have th 'tatic property.
The basic mechanism behind these functions is that a key is transformed into an index
value. The index value is used to find an address of records in the data file. SPIN use'
a series of algorithms derived from thf' Fundamental Principle of Counting(FPC). The
FPC states:
Given a series of m operations 1, 2,3, .. , m. If the first operation can be
performed in mr ways, the second in m2 ways, and so on until the mth
opf'ration, which can be performed in mn ways, t.hen the Immher of ways
the m operations can be performed i '
(I )
Till" C_SPI Function
Coburn uses a multidimensional array to represent a mllitidimen 'ional data space.
For instance, in the C language, the declaration "int arr[3] [3] [3]" represents an array
of 27 integer' in three dimensions with three indices in each dimension. When
using Cj:;PIN, this array represents the three leveled spaces, corresponding to three
dimensions of the array.
The problem addressed by C_SPIN is: "How can one access the multidimensional
array?" Traditionally, the programmer uses cursors which are translated to pointer
5
6
variables to navigate through the dimension of a multidimensional array. This mal<
partial key search, storage, and retrieval operations impossibl . Al 0, pro edure for
allocating multidimensional memory dynamically are generally unavailable.
One solution is to convert the ml1lt.idimen~ionalarray into a corresponding single
dimensional array. With a one dimensional array, the shortcoming' of accessing a
multidimensional array with a pointer variable can be avoided. Another reason for
using the single dimensional array is that the digital computer is a sequential devi e,
but multidimensional arrays are not stmctured sequentially. One dimpnsional arrays
are sequential. Coburn states that. nonsequential data structure can pose special
storage and rf'trieval difficulties for t.he computers[3].
Given t.he array "arr[3][3][3]" it follows from formula 1 t.hat there are 27 valid
multidimensional subscript combinations(keys) for this array. In general, given an
dimensional array with 11", indices in dimensioni and i = 0,1,2, ... , N  1, by formula
1, C_SPIN transforms the no *71,1 *71,2 * ... *nNl subscript combinations into the set.
of 'equential whole flnmhers k such that k = 0,1,2, ... , no * 17.1 * 71,2 * ... * (nNl  1).
Thi' transformation has t.ht" following properties[4]:
1. The transformation is well ordered. That is, given two distinct suh 'cript. combinations(
keys), say k] and k2, which are passed to the Cj;PINfunction with the
proper argument.s, and two return values 0 1 and O2 from C_SP1N corresponding
to k1 and k2 respectively. it follows that if k] > k2 , then 0 1 > O2 ,
2. The transformation is onetoone and onto. For each distinct 'ub 'cript combinat.
ion passed to C_SPIN, there is exactly one return valuE' and t.hat retllfn
value is unique.
:3. The transformation is sequential. A given multidimensional array contains keys
that are base 10 integer values, it follows that, given k] and k2 such that k1 < k2 ,
if there does not exist an kk such that kl < k~, < k2 , then t.he corresponding
[0][0][0] <> 0
[0][0][1] <> 1
[0][1][0] <> 2
[0][1][1] <> 3
[1][0][0] <> 4
[1][0][1] <> 5
[1][1][0] <> 6
[1][1][1] <> 7
7
Figure 1: Transformation hetween multidimensional array and single dimensional
array
return values 0 1 and O2 have the following relationship: O2 = 0 1 + 1.
The Figure 1 clearly shows the above properties, given a multidimensional array
"int arr[2] [2] [2]" .
While C_SPIN provides a good way to access multidimensional array' in main
storage there are still some prohlems with C_SPIN. The first prohlem is that C_SPIN
is restricted to dimension size' no larger t.han five. The r .ason b t.he formula that
C_SPIN employs to conversion grows exponentially when t.lw nllmher of dimensions
increases. The second problem is that access is deni .d t.o the interior dimf'flsions of
a multiclirnensional array structure, therefore, partial suhscript combinat.ions cannot
bf' indE'xecl. ThE'se limitations lead to the S_SPIN function.
The S_SPI Function
/:l,C;PIN function iteratively applies C_SPIN function one dimension at a timE'. The
formula S_SPIN uses in transforming a multidimensional array into a singly dimensional
array is:
recnumber[i + 1] = retnumberIi] * lOex[i] + k[i + 1]  recnumber[i] * kr[i] (2)
where,
recnumber[i+l]=the index for th (i+1)st leve1. 2
rec_nllmber[i]=the index for t.he previou~ level or iteration.
max[i]=the number of indices in the ith dimension.
ex[i]s are exponents computed as followings:
if(max[i+1]>O && max[i+1]<=1O) ex[i]=l
else if(max[i+1]>10 && max[i+1]<=lOO) ex[i]=2
else if(max[i+1»10o && max[i+1]<=lOOO) ex[i]=3
kli]=the value of the. multidimensional subscript within the ith dimension.
krli]s are values computed as followings:
if(max[i+1]>0 && max[i+1]<=lO) kr[i]=lOmax[i+1]
else if(max[i+l] >10 && max[i+l]<=lOO) kr[i]=100max[i+l]
else if(max[i+1]>lOO && max[i+1]<=lOOO) kr[i]=1000max[i+1]
iC (O,1,2,3,,,.,N2):
As an example, the following illustration using thf' array "arr[2][4]" applie' the
formula 2
=2:
max[0]=2, max[1]=4;
ex[o]=l, since max[l] <=10:
kr[O]=10max[1]=104=6.
Eight valid subscript combinations can be passed to S'_SPIN. They are (0,0),
(0, 1), (0, 2), (0, 3), (1, 0), (1, 1), (1, 2), (1, :3). By formula 2, Tec_number[l] =
recnumber[O] * 101 + k[l]  reenumber[O] * kr[O]; where, recJlllmher[O] is the index
20nc le:'is thau the llumber of the dilILCW;ioIl withiu a 1II1l1l'idiU1CIIHiollal array is callcd the level
of the array.
Subs ript
Combination
(0, 0)
(0, 1)
(0, 2)
(0, 3)
(1, 0)
(I, 1)
(1, 2)
(1, 3)
Righthand side computation
°* 101 + 0  0 * 6
0* 101 + 1  0 * 6
o* 101 + 2  0 * 6
0* 101 + 3  0 * 6
1 * 101 + 0  1 * 6
1 * 101 +1  1 * 6
1 * 101 + 2  1 * 6
1 * 101 + 3  1 * 6
Index
o
1
2
3
4
5
6
7
9
Figure 2: Transformation values for the array arr[2][4]
for the first dimension, it is zero or one in this example. k[l] is the index value in the
first level, it is 0, or 1, or 2 in the example. Fignre 2 shows the transformation.
The formula 2, a recursivE' function, generates a seqUf'ncf' of return values a1
each If vel. For example, given an array "arr[I][2][3][4]," when applying formula 2 1'.0
calculate the index value for this array, it also generates iudex vahlE' for "arr[1][2], '
and "arr[1][2][3]," as well as "arr[1][2][3][4]." Consequently, S_S'PIN permit.s complltation
of part.ial subscript combinations (p rtial k0.Ys). Then are somf' limitations
of S_S'PIN One, is that the level order is fixed. The level order cannot hc' changed
without restarting the entire computation. The other prohlem is, since the level order
is fixf'd, accessing level n requires compnting values for all levels throllgh and
including level nl first. It is not possible to access the level we waut direct.ly. Partial
subscript combinations always proceeds from the first level to the desired level; It
is unidirectional. For example, given the array "arr[I][2][3][4][5]''' we cannot access
a partial subscript combination such as "[4][5]," "[2][3][4].' In a relational databa.':i€
environment, the relationship reflected by such combination
The RSPI Function
common.
10
Another prohlem C_SPIN and S_SPIN cannot olve is representation of spar e data.
Database designer~ try to avoid the problem of representation of 'par e data. For
example, suppose an airline owns 100 planes and flies to 100 citie . It is reasonable
to assume that the route that each plane flie~ does not pass through all 100 cities.
A multidimensional array to represent this would he "arr[100][100]." Using such an
array would require using 10000 (100*100) bytes of storage. If ea·h plane flie' to at
most 3 cities on a route, we would like to declare the array "arr[lOO] [3],' becansl" it
would require using only 300 bytes of storage, a savings of 9700 bytes.
R_S'PIN can kf'ep track of which nodes at one level are mapped to which nodes of
the next level. This helps to solve the problem of sparse data representation. Given
a multidimensional array, such as "p[20][40][50]," first, we assume we know that:
a) whether there is a sparse situation; and b) the degree of sparsen ss. Suppo 'e,
after eliminating sparseness, we can reduce th . number of mappings in the "p" array
to "p[20][3][2] " which is a dense array. We open 'W index file constructing a tree
structure for "p[20][3][2]" (Figure 3).
We construct a onetomany mapping. From "p[20][3][2]" each element in lev.J
zero is mapped to three elements in level one; each element in 1 vel one is mapp d t.o
two elements in level two(Figure 3). Therl" are 20 trees in Figure 3, corresponding to
the first dimension in "p[20] [3] [21" which has 20 elements in dimension one. At the
beginning, all nodes are initialized to zero except those at level zero, into which are put.
into index values taken from the first dimension of "p[20][3][2]" (Figure 3). If R_SPIN
operates on t.he subscript combination "p[3][20[8]," RSPIN first uses formula 2 to
calculate the index values corresponding to "p[3][24]" and "p[3][24][8]," then R_SPIN
stores the index values in the first empty node at the corresponding level. We call
11
.......................... , 19
Figure 3: Tree structures for R_SPI
t.hese valuE'S V values.''3
ext R_SPIN uses the location value of the nodes to calculate the index values
111 t.he dense array by employing formula 2. For example, if V value of the array
"p[3] [24] [8] , are stored in t.he first node at each level, then the subscript combination
of location values is "k[3J[O][O].
R_SPIN rE't.rieve· index values for a subscript ombination hy mat.chillg t.he value
compnted by formnla 2 with tho' st.orC'd in t.he index filC'. For <'xampk, if the arra
"p[3][14][ ]" is pa::lsed to R_SPIN, it will E'mploy formnla 2 t.o ca1c'lllat,('> levrl onC'
index value using "3" and "14" a.':l snbscripts. Then it will comparf' t.hC' V val1l0
stored at. each one of t.he thref' nodes of leVf~1 one of t.he tn'f' which is ind xed :i. H il
matches one of those values, it will go t.o next level to determine whether the index
valnes calculated using "3", "14" and "8" matches the V value' stored at t.hat. level.
R_SPIN significantly reduces the number of mappings into sparse array. This creatE'S
a very efficient way of storing and retrieving key'.
The limitations of RSPI
The major deficiency with R_SPINis the inability to predict accurately the sparsenes'
3Thf' first empty node at each level means the first available node at ccrtain subtrec of that level,
rather thau all nodes of tllat level.

12
arbitrary multidimensional arrays. Failure to make such prediction accurately lead'
to an overflow. The following example shows the overflow problem.
An airline example
For example, suppose we have a lE'velstructurE'd airline network with the airline
l'Olltes emanating from a:
a. very big city:
b, big city;
c. middlesized city; or a
d. small city.
We use a multidimensional array to express this structure as: P[a][b][c][d]. Suppose,
we have an airline from a to b, b to c, and c to d. At level two going from b
to c, we may have such questions as: From a given big city; e.g., Dallas how many
airline routes are there to middlesized cities? This leads to the "distribution of keys'
problem.
Distribute of keys
A statf'ment of thf' "distriblitioIJ of k ys" problem follows. On aV<'rage, how
man) airlines are therf' from one of thp big cities to the middlesized cities? Wp
must know the probability of Dallas (a hig city) having five or more airlines to the
next. level (middlesized cities). We also mnst. know what. t.he probability of one big
city having five or more airlines to the next. level (middlesized citie.) We examine
several sample data sets then deterrninp the distribntion of keys for these data sets.
For example, we can choose Honston, Chicago, ... , etc., to dE'termine whet.her f'a·h
of these cities has five or more airlines to the next level. We compute an average
probability for each of these cities having five or more airlines each to the next level
of cities. If the samples are typical and representative, then we apply the result to
t.he entire data set even thongh the contents of t.he data set change dynamically.
13
From this example we surmIse:
1. The distribution of keys can be predicted only in a detail application nVlronment.
2. The number of actual mappings assigned to a certain node at a certain level
of some leveled data structure depends on the prediction for the distribution of
keys. For example, if Dallas ha' a .95 probability of having five airlines fly to
the next level, and we assign only fom airlines to Dallas then there is a very
high risk of an overflow situation .
3. ThE' probability differences for overflow exist not. only among different levels,
but also among different nodes at the same level. For example, suppose Honston
is at the same level as Dallas. Houston's probability of having five airlines fly to
the next level might be .85. One reason is that the actual number of airlines for
Houston and Dallas might be different. Dallas might have six airlines; Houston
rnight have eight airlines.
4. In a dd,ailE'd appli 'ation rovironm0nt, t.h0 overflow prohlPm m'ty hp pr0dict,C'd,
but the problem still exists. Even wit.h a .99 prohabilit.y of having lJ() ow~rflow
for a certain node, the risk of overflow is still there. For example, slIppose Dalla.'3
has .95 probability for five airlinf's, .99 prohability for six airlines. If we assign
six airlines t.o Dallas, there is still a .en prohahility of overflow. One solnl.ioH is t.o
assign more positions for airlines to Dallas. In snch case, the risk of overflow may
approach zero, but space is wasted in a matrix representation of t.he data. In
01. her words Dallas uses only a small number of the airlines as 'igned. Complete
elimination of inefficient (sparse) data storage is R_SPINs target., hut that also
causes the overflow problem. It is a dilemma: complet.e elimination of sparseness
causes overflow; complete elimination of overflow requires sparse data to be
o

14
tored in oversized arrays.
5. R_SPI employs formula 2 twice. Fir t, it u es formula 2 to calculat an index
value. Sinc formula 2 has an e'timated time complexity of
for some machine defined constant c, and nnmber of dimensions n its repeat d
use during a single operation makes the method relatively inefficient.
6. The sparse situation that occurs in this example is: there are many middl sized
cities, but not. all of them are connected to all of the big itie '.
o

CHAPTER IV
DY AMIC SPI SCHEMES
In this chapter: we introduce two dynamic SPI schemes: index dynamic SPI
(D_SPINfor short) and directory dynamic SPI (T_SPIN for ·hort). Both algorithms
evolved from t.he limitations of the static SPI s. We explore some de 'ign i 'sue " such
as space utilization, splitting control, and bucket stmcture. We abo compare the
dynamic SPI s with t.he static SPI s.
Index Dynamic SPIN
From the limitations of the R_SPIN, we design a new index SPI , which dynamically
changes its nnmber of nodes when overflow happens. It eliminates the dilemma
pre 'ented in Chapter III.
Preliminarie .
Mappings between levels
The example in last chapter shows that different uodes at. the same level may
have different numbers of mappings, since the nodes may have different probabilities
of overflow. The trees in R_SPINs index file are not the same. A level may look likp
those shown in Figure 4. In Figure 4, tree one has three children at level one and each
child at level one has two children at level two. For tree two, there are two 'hildren
at. level one  one has two children at level two, the other has three children at level
two.
15


16
Fignre 4: Different shapes of tr es in an index file
Sparse data mappings
It is common in sparse storage situations wit.h R_SPIN to have different number'
of mappings at the same leveL Suppose a teacher teaches three lasses each class has
a maximum 30 students. A multidimensional array to express the teacher's cl8..':i'e'
with t.he students in each class requires a mult.idimensional array, V[3] [30]. However, if
only 10 student' enrolled in t.he first. class, 20 student.s enrolled in the second cla:..;s and
27 st ndents enrolled in the third cI8..':i " then there is a sparse data storagr si t.nation.
Each nnmber at t.hf' first lrvel ( 0, 1, 2) h8..':i a difff'rent mapping to tlw second If'vel
(10, 20, 27).
Even when there is the same number of mappings for each node at the same
level, the nodes commonly have different contf'nts. Thr 30 st.udents iII class one will
be different. from those in cl8..':is t.wo, as well 8..':i from those in clas,' three, There xist.
situations that we have the sam contents for multiple mappings, For example, some
st.udents may enroll in more than one class. Taking the abovp informatioIl t.ogether,
d. levelstructnred graph shows the sparse torage situation clearly.
Figure 5 shows four node" at level one and nine nodes at level two, Each node
at level one rna" have 1, 2 or 3 mappings to level two. Some mappings have the same
o

17
Level 2
o
Figure 5: A leveled structure for the sparse ~it.uation
o 1 2
Level 2
1 1
Figure 6: Sequential natural 111lmhers indexing th(' 1I0<!0S at. thr same lew']
content (Two or more pointers at level on. point to the ~ame node at. I vel two). Al '0,
~ome nodes at level two are unused.
Tree structured arrays
Figurt~ 6 shows a tree structure corresponding to the multidimensional array
A[3][2][2]. The subscript key values appear beside the nodes.
Formula 2 (Chapter III) calculate' the index values at level one. It r ,'\llt· in:
[0] [0] =0 [1] [1] =3
[OJ [1] =1 [2J [oJ =4
1
Figure 7: A nospar e (dense) tree
[1] [0] =2 [2] [1] =5
These index values corre 'pond to nodes as shown in Figure 6.
Level one index values are sequential number' progressing from t.h leftmo t
node to the rightmost node. The index values for level t.wo al'o may be sequential
numbers. Index values fill the nodes (Figure 6).
Formula 2 yields:
level two
[0] [0] [0] =0 [1] [1] [0]=6
[0] [oJ [1] =1 [1J [1J [1] =7
[OJ [1J [OJ =2 [2J [oJ [0] =8
[oJ [1] [1] =3 [2J [OJ [1] =9
[1] [0] [OJ =4 [2J [1] [oJ =10
[1J [0] [1] =5 [2J [1] [1J =11
Formula 2 and Figure 6 are consistent. Both reflect the transformation properties,
which are stated in Chapter III.
Let the tree i, (Figure 7) be dense. At level j, the fir·t node" index valuE' is k
t he next node will bt' k+1, next to next: k+ 1+ 1, ... , until th last node (rightmost
Unary tree
Binary tree
Figure 8: Different initial trees
19
node) at level j. The values appear as:
k, k + 1, k + 2, k + 3, k + 4, ...
We also know that the index values for tree iI, which is at the left side of tree
i, at level j, The last node's index value at level j for tree iI is k1. Continuing to
the left, we have all index values at level j:
.... k  4, k  3 k  2, k  1.
General description
Given a multidimensional array (or comhination of keys), P[20] [30] [40], we df't.ermine
the maximum number of actual mappings at each level t.o pliminat.es sparseness.
This does not pliminate overflow. In ot.her words, we may choose p[20][1][1], whi h
cert.ainly eliminate sparseness, but also has an overflow prohlem. Overflow is not a
prohlem during initialization. Figure 8 shows several structural choices for eliminating
sparse data storage.
If it is unknown whether there is a sparse storage problem, or if it is known there
is a sparse storage problem and the prohability of overflow is unpredictable, then, the
unary t.ree may he the best choice t.o guarant.ef' dense storage utilizat.ion Binary trees
20
Figure 9: Trees with splitted nodes
also can be used in certain applications 4 If we can predict the probabilities, we may
choose a conservative number of mappings between levels to minimizing the ri 'k of
overflow, while w(' also eliminate sparse data storage problems.
The reason for choosing i1.mong different shapes of trees is to initialize those
trees efficiently. Any tree will grow after initialization. A tree growing from a binary
structure allows faster acces::; than a tree growing from unary stru tUf(~. There. is no
significant increase in access speed when using a tree with radix gI'f~ater than two.
The binary tree is the starting tree in this example; i.e., p[20][2][21. Each node
splits into two parts shown in Figure 9.
Using the preliminaries described above, we fill in the index value::; the left par1.
of the node, shown in Figme 9. At the same level, indexes are sequential numbers.
The index value for the root node is trivial, since t.he number in dimension one i::;
always the same after eliminating spars storage; i.e., from P[20)[30)[40] to p[20][2][2] ,
the number 20 does not change.
Given a level two datum with key k1 and with index [3)[14][81 (kJl3][14][8]), we
employ formula 2 to calculate the V values for k1 . Suppose Vi is the V value for
4A billary tree call he used in any application. It it> particularly ut>cful ill applications such as
the airlille problem.
rt
21
4
Figure 10: Insertion situation for index dynamic SPI
level one, VI = kd3][14]. We put VI into the right part of t.he first available node at
level one, shown in Figure 9. The index value for \Ii is just the number in the left
part of node, In this case, it is six.5 Let Vll stands for the V value for level two,
VlJ = kd3][14][8]. We pnt. V] I in the first availahle child node of \Ii. We also can
obtain the index value for Vll easily. In this case, it is twelve.
Suppose we want. to insert k2 [3][17][21]. V values( V2 , V21 ) will he Pllt in the
corresponding nodes, shown in Figllfe 9. ext., suppose we insert k3 [3][19][7]. mce
t h('rf' is no empty node at lev('l one availahle, therf' is an ovrrfiow sitllatiolJ.
To handle t.he ovrrftow situation, we create an empty nodp at level one, which is
also connected to the root node 3 (Figure 10),
The newly created node is split into two parts as before. The index value for
t he new node is the maximnm index value at level one (which is 39) plus one; i.P. ..,
39+1=40. We put !lumber 40 in the left part of the new node. The right part of
new node is still used to store V values. All other nodes are left unchanged. ow
the maximum index value for level one is 40 (Figure 10). If tree 4 has oVf~rflow at
lpvrl onf', then we create a new node, split it, put the index value 41 (40+1) into the
.)wC' du HOt. 1Ie'C'd to II:i(' FOflllllia 2 to l'aklllale illd('x vallie'S. The' iudex vallle:i arc already ill
pla('C'.
22
left part of the node, and put the V value into the right part of the node (Figure
10). Index values for level one are no longer sequential numbers. However this new
method observes a prime principle of SPI : a distinct V value is associated with
a distinct index value. The index values are used to r triev. data records from a
database.
\Ve discuss several common operations for this new algorithm.
1. Insertion
Insertion is discussed above. An inserted node is filled with two number::>. The
left part of the node is its index value, the right part of node is its V value.
Also, the left part of each node should never be empty. An index value should
always occupy the lefthand "data area."
2. Searching
This uses the R_SPIN method for searching. It. calculates V values for the
key, or combination of keys. then uses the ca1culatf'd V valu ::> to ::>carch the
corresponding tree, nodl' by node, level by If'vel. If t.lw calculated V vahLP
matches the V value stored in the node, then search i::> ::>llccessfllJ. The index
value of the node is retllmed. The search is unsllccflssflll when a V value of 0 is
deleted.
3. Deletion
Logically, if the deleted node s index value is X, then we mllst find all nodes
whose index value is larger than X, then reduce their value' by one. A prohlem
arisE'S with those nodes whose index values are larger than X. They can he
located randomly on both sides of the deleted node. We must search every
node at that level, and check whether the index value is larger than X. One way
to avoid this tedious work is lazy deletion. \Vhen we want to delete a node, we
rt
23
...............~ .
K: index value;
V: V value;
Figure 11: The final result of the index tree in index dynamic SPI
mark it as deleted, but leave it physically in place. When the marked nodes
reach a threshold, we start the physical deletion mechanism (garbage collection)
physically deleting all marked nodes during a single search.
After building these trees, a po::;sible final result is shown in FigurE' 11. The
shape of each tree is different,
This new method has t'f'veral advantages:
1. It completely eliminate sparse data storage prohlems
2. It eliminates overflow.
3. It. employs formula 2 only once,
Like most traditional tree alg'Orithms all trees' shapes are changing (growing or
shrinking) dynamically, according to the operations performed.
Space Utilization
Assume that data is stored on magnetic disks. Data is fetched from magnetic
disk drives in pages of a fixed size. Further assume we have built a very large index
file of more than 100,000 trees on that. disk already. Each fetch of a page takes ahout
«
24
ex va liP
record
II
"""
II
II
II
" i ./ I"I
li+l1/ " /' II
,~ """"
Return re
V value search Return ind
Figure 12: The whole picture for data indexing and retrieving
10 milliseconds on average on the fastest disk drive:::;. Thi:::; is the time it takes for
the disk arm to move to the correct cylinder, for the disk to revolve until the head
is over the correct place in the track, and for the data to be transferred to the main
storage. For a search operation, given a combination of keys index dynamic SPIN
::icheme calculates a V value, then the program accesses the index file to try to find
the :::;ame V value. The index file consi:::;ts of :::;everal pages (hlock:::;). The pagp t.hat.
contain:::; t.he tree number is brought into main storage.
Snppose each page contains 30 trees. Since the root numbers of trees are kept. as
sequential numbers, retrieval of the tree with the index valuf' matching the V value
is direct. The index value points to a location in a data file which is also on the disk
and brokrn int.o pages by the operating system6 . What is associated with each index
value may be a data record, or another file pointer, depending on t.he application.
Page fetch or :::;wapping is the same as above. The total situation is shown in FigllH"
12.
No matter how large a page is, the root numbers of the tree are seqllential
natural numbers, as are the index values. This kind of structure is useful sine a
6 All index values are arrallged illto scqncIltial IlIlIl1bcrs ill all pages.
•
25
digital computer is a sequential device.
Directory Dynamic SPI
First we presented preliminaries of directory dynamic SPI (T_SPIN for short).
Then we gave a general description and several design issues about T_SPIN.
Preliminaries
A General Design Method for SPIN
The transformation mechanism behind R_SPIN is studied first. Given a parse
multidimensional arra', such as P[20] [30) [40], and its dense form p[20] [3] [2] (different
from D_SPIN, R_SPIN gives us both a sparse form and a dense form before we can
go through the R_SPIN function), we have an index file containing 20 tree tructure
of the form shown in Figure 3.3. All nodes initially contain the value zero. When
the subscript combination, such as P[3][24][8], is passed to R_SPIN, R_SPIN use'
the formula 2 to calculate t.ht. corresponding index value, snch as V. This is t.ht' fi rst
transformation, which transforms p[3][24][ ](a mult.idimensional dat.aspace) t.o integer
number V (a singly dimensional data space). Such transformation is not enough, since
t.he result V is in the sparsE' dat.a space. \Ve need to transform V (big numher, Hparsp
form) int.o n (small number, clenHe form). In ordpr to do t.hat., we st.ores t.he V in thl'
first. empty node at the correr:;poncling level. In thir:; case, it's level 2. Then R_SPIN
uses the location value of that node into which it. already put V , to calculate the
dense form index value n using the formllia 2. R_SPIN returns the valup n.
\Ve can see from R_S'PIN that the whole process actually is divid.d into two
steps:
First step, the transformation from a multidimensional data space to a single
dimensional data space.
26
Second step, the transformation from a sparse data pace, which i . already single
dimensional, to a dense data space.
DjIPIN also shows this twostep transformation. First tep D_SPIN i the
same as R_SPIN; Second step. because D_SPIN already stored the index value(n
value in this example) in the left. part of node, it retrieve' thE" n value from the
node, instead of calculating it. So, if we work in a spars. multidimensional data
space, the aboVE' twostep transformation is necessary if we want to index and process
data efficit'ntly. Besides R_SPIN and D_SPIN, various techniques can be Ilsed in
the process of transformation. All t.hosE' techniqnes either have a stat.ic prop .rty
as does R_SPIN, or a dynamic property, as does D_SPIN. In the view of de 'ign,
the twostep transformation gives us a general design method to design a new SPI
t.echnique. In other words, WE' only need to design a transformation t chnique for
each transformation step. The TSPIN proposed in this section is designed according
to the twostep transformation design method.
General Description
T_SPIN is evolved from extensible hashing[8]. Extensibl hashing is a ml"thod
of organizing a file so that it has the following three propert.ies[13]:
1. The file will automat.ically expand as necessary to accommodatf' new records.
Thp ('xpansion will not requirt' a reorganizat.ion of the file.
2. The file will automatically contract so that the prohability of the load factor
dropping below 50% is negligibly small. As with the expansion, thf' contraction
will not require a reorganization of the file.
3. The file structure will allow retrieval of a record by primary key with one access
to the file.
27
Primary 1 2 Directory 3
File
key H(key) index ~ pointer
1: Hashing function( SSPIN function in directory dynamic SPI
2: Extract fir~t d digit~
3: Table look up
Figure 13: Key transformations in extensible hashing
Obviously, extensible hashing file is a dynamic file structure. The sequence of
transformations by extensible hashing is ~hown in Figure 13. The first transformation
is a hashing function that maps the keys randomly onto some fixed addre 's pace
represented by the range of the hashing function. The first few digits of this result
are then extracted for use as an index into a directory. The directory contains point.ers
that point. to the file.
Considering the twostep transformation rn .thod in SPI design, we can seC' that
the first transformation in Figure 13 actnally is doing t.he work for t.he ~econd ~tf'P
in the two tep transformation: tran~forming a single dimensional sparse data spa f
into a dense data space.
In order to combine the dynamic properties of extensible hashing with SPIN
techniques, we must add the first step of twostep transformation to Figure 13, whi 11
is, a t.ransformation from a multidimensional data space to a singly dimensional data
space. We employ formula 2 to do this \'\lork. How.ver, we do not need any tr e
structurf's in this step, because the results of first step are arranged linearly so t.hat
they can be easily hashed in the second step transformation.
ow, the whole picture of transforrnations is shown in Figure 14.
28
Irnultidimensional
data
space
11
~ingle dimensiona 2 kIense data 3 ~irectory
data r space r
space H(key) index
nula 2 2: Hashing function 14
file
ract d digits 4: Table look up
pointer
1: Fan
3: Ext
Figure 14: Key transformations in T _SPIN
There are several design issues related to the transformations in Figure 14.
1. In Figure 4.11, only first three steps transformation belongs to T_SPIN, because
t.he fourth step (table look up) is heyond a indexi ng technique, and it will be
handled by file system.
2. The file pointers do not point to individual records, like R_SPIN or D_SPIN, hut
rather to blocks of records called buckets. A bucket is a large block of I'f'cords,
all of which are read with one physi al read operat.ion. The mf'thod of placing
and locating records within the bucket is not important. sincf' no additional
phy 'ical I/O operations are reqllired[23]. Buckets may he added to and delf'ted
from the file at any time.
3, Thf' ha:'hing fun tion used with T_SPIN can he any hashing function as long
as it sat.isfies the following fom properties[13]:
(a) uniform and random distribution of keys over the range of the function;
(b) small variations in the key will cause large variations in the value of the
function:
(c) synonyms occur no more than random probabilities would allow;

29
(d) the range of the hashing function he close to a power of 2.
4. The selection of the range of the hashing function is omewhat arbitrary and i
not tied to the number of records in the file. The range of hashing function is the
range of address space of dense data. For a dynamic ::icheme we cannot know
in advance precisely how large the dense spacE' need to 1>e[4]. Th 'el.chon
of the range of the hashing function actually depends on the given practical
application. In R_SPIN, such arbitrary selection of the range of dense data space
leads to the overflow problem; in D_SPIN, there is no this problem, becausE' the
index trees (which actually reflects the range of dense data space) dynamically
expands or contracts. In this case, T_SPIN, the arbitrary selection may lead to
synonyms, which means more than one record is mapped to the same location
by the hashing function. Considering the basic unit in a data file is the bucket
in T_SPIN, if synonyms happen, we can put t.wo records in the same bucket.
This means we map them to the same location.
.) According to the properties of the hashing function, it i.. convenient. to choose
a rangE' of hashing function equal to the first prime n11lnher small0.r than a
"round" binary nmnber that is a pOWE'1 of 2[7]. For examplf', t.he largest prime
number less t.han 216 is 65,521.
6. The third transformation in Figure 15 extracts a relatively small integer from
H(key) by using the first few digits of the hashing function. There are several
arbitrary choices in the selection of digit·. It is advantageous to USf' as small
a base as possible' hence, binary will he used as the base and hits will he the
digits. Another choice is which digits to use. Conventionally, the highorder
digits are selected[7].
7. Thl? digits extracted from H(key) are then llsed as an index into a onl? dimen
30
d=3
A H(key)=OO...
H(key)=010...
H(key)=011 ...
H(k .y)=1...
Buckets with records
Directory
Figure 15: Directory of order d=3 with four huckets
sional array of file pointers. This array is called the directory and contains 2rl
entries, one for each combination of d digits from H{key).
8. The number of digits extracted from the hashing function value, the number
of entries in the directory, and t.he number of bucket.s in the file all will change
automat.ically as the file expands and contracts. Consequently, it. is necessary
to store some parameters to indicate the current state of the file. Specifieally,
the number of digits, d, used to illdex into the directory are st.ored wit.h th('
directory as shown in Figure 16.
The transformation shown ill Figure 16 uses the first three binary digits from
the hashing function to partition the address space of the hashing function into eight.
eqnal segments. These eight segments correspond to eight entries in the directory.
For example, suppose H(key)=0110100101100101 in binary. The fir t three digits,
011, have a value of 3. By using 3 as an index into t.he directory, we find a pointer
t.hat points to bucket C.(The first element in thf' directory has an index of z ro.)
The complete sequence of ret.rieving a record using T_SPIN consists of six steps.
First, the formula 2 is applied to produce a single dimensional data{key). Second, the
key is hashed to produce H{key). Third, the first d digits{bits) are extracted from

31
multidimensional 1
I:!irectory ~ data  file
space index pointer
1: The TSPI function
2: Table look up
Figure 16: TSPIN transformations
H(key) to form an index into the directory. Forth, the index is used to locate the
appropriate bucket pointer in the directory. Fifth, t.he pointer is used to read the
bucket. into primary memory. Sixth, the desired record is located within the bucket.
The T_SPIN Function
From Figure 15, we see that the first t.hff~e step::; of key transformation are act.
nally functional operations. The outpnt of formula 2 is the input. of the hashing
function, and the OlltpUt. of thE" hashing function i::; the input of thf' extraction function.
This chain of fanchon call::; sugge::;t::; that. we may comhine thf' three functions
into one function to eliminate the overhead of function calls. The T_SPIN function is
the result of such combination. The input. of the T_SPIN function is the multidimensional
array and the number of digi ts we want t.o draw (d value) l and output of the
T_SPIN function is the digits extracted from the hashing result (the hashing process
is embedded in the T_SPIN function).
After combining the function calls together, the key transformation is simplified
as shown in figure 17.
Splitting Control
c
32
d=3
Figure 17: Dititribut.ion of keys among buckets after splitting bucket D.
The reason we use huckets instead of records as basic units in data file is to
allow the data file to expand or cont.ract gracefully as the number of records varies.
In other words, we want to keep the perfect dynamic property for TJ;PIN. In this
section, we discuss thf' split.t.ing control in data file.
A rule is imposed on the buckets that. sets a minimum and maximum load factor
for each hucket.. Typically, these are 50o/c and 100% retipf'ct.ively[13]. A change' ill
th( file structure is triggered whenever these limits are violated hy t.hf' addit.ion or
delet.ion of a record.
Consider the small file of Figme 16. SUPPOSf' that. a record is to he added which
maps into hucket D If hucket. D is aln~ady full, there is no room for t.he new record.
This triggers an expansion of th . file. A new bllcket, E, is added to the file. Half of
t.he pointers that point t.o bucket D are changed to point to bllcket E. The records
in bucket. 0 that are reached through the pointers that were changed must be moved
to bucket E. This will be approximately half of the records t.hat were in huckf't D.
This leaves both buckets approximately half full and there is ample room for the new
rf'cord.
The result of this split is shown in Figurf' 17. Before the plit, all records for
_.
c

33
which H(key)=1... were in bucket D. ow thos where H(ke )=10... are in buck t D
and those where H(key)=l1... are in bucket E. Th parameter dl shown with ach
bucket in Figure 17 indicates the number of digits of H(key) whose value i' commoll
to all records in the bucket. This must always be equal to or Ie s than the number of
digit.s, d, used to index int.o t.he direct.ory.
The process for contracting the file is the reverse of enlarging it. Buckets must
be combined if t.hree conditions are true[9]. First, the average load factor for the two
buckets cannot exceed 50%. Ot.herwise, there would not be room in the combined
bucket, for all t.he records. Second, the buckets to be combined must have the same
value of dl. Third, the keys of the records in both buckets mllst share a common value
of the first (dll) digits of H(key). These last two conditions are nec ssary so that
the records of the combined bucket will share a common value of the first. d' digit' of
H(key).
Bucket Structure
Besides the splitting control, another important de 'ign is::me is thf' huckC't sLrtlcture.
However, we don't need to pay much attentioll to bucket st.rucl,me for two
reasons. First whatever method is llsed to organize the hncket. internally will !Jol
affect the number of physical I/O operat.ion::; and so will not. have a significant. impact
on most file operations. Second, there are many feasible solutions, with no clear
preference between them[7].
Structures as simple as a sequentialchronological organization with a bit map
are feasible. To find the desired record, each record in the bucket is examined in
sequence and its key is compared to the given key until a match is found or the end
of the bucket is reached. The ordered relative file could also he llsed for the internal
bucket structure and would permit a binary search to be used to find the desired
record. The problem of insertions and deletions is solved hy moving hlocks of records
>.
:'·f :... 'Xl.:,i
::'::1
~.. ,
2~;1'
:;)
i:l/
_"1
~:lj
34
as necessary to make room for a new record or close up a gap when an old r cord
is removed. Thus, the reorganization is continuous and can be done ntirel within
main storage.
Comparisons Between Dynamic SPI sAnd RSPI
ThE' comparisons between dynamic SPINs and R_SPIN in this section focu' on
t.he design issues. The performance comparisons are covered in Chapter VI.
There are several differences between dynamic SPIN' and R_SPIN:
1. TSPIN, D_SPIN and R_SPIN can all be analysed by twostep transformation
method. But they have different operation::; on second step.
I SPI I First step I Second step
3. Different overflow problem handling techniques:
(a) R_SPIN: gives a message when overflow happens;
2. The contents of index file and basic units of data file are different among the
R_SPIN Formula 2 Formula 2
D_SPIN Formula 2 embedded in the nodes
T SPIN Formula 2 hashing function
R_SPIN tree structures records
DSPIN tree structures records
T_SPIN directory table buckets
three SPINs.
I SPI I Contents of index file I Basi' unit, of data file I
(b) D_SPIN: solves the overflow problem by dynamically expanding or CO][
tracting the tree structures index file;
(c) T_SPIN: solves the overflow problem by allowing mo[(~ than one index key
to be mapped to the same directory entry. (As long as the first few d digits
of index keys are t.he same, they all belong to thE' same directory entry.)
4. Different index retrieving method:
35
(a) R_SPIN: traces the corresponding nodes in the same tree stru ture, and
retrieve the information the nodes contain.
(b) D_SPIN: the same as R_SPIN.
(c) T_SPIN: one directory table corresponding to each level of multidimensional
array, and each entry in the directory table contains the file pointer
to a certain bucket.
5. Different dat.a file structures:
(a) R_SPIN: records are basic units of data file; many insertions and deletions
may cause data file reorganization.
(c) T_SPIN: the data file gracefully expands or contracts according the insertion
or deletion operatiolls.
>.~
;.,~
;",. :Iii :t.• %:.
~... 2~;1·
:;l
.i:.I,,'
~::I
CHAPTER V
IMPLEME TATIO I
An implementation of the two dynamic schemes has been done under NIX
and written in C. Bot.h implementations are based on the twostep transformation
method (Chapter IV). The implementation utilizes the assumption that all operation'
on data occur in main storage. In this chapter, we first present data tructures for
both schemes, then we give an implemf'ntation steps for each dynamic scheme.
Data Structures
Index Dynamic SPIN
Thf' data structure for a single nodI" in the index is shown in Figure 18. The
node contains three integer numbers and two structur , point.er '. The int.eger lefLpart.
cont.ains the K value, which is the initial sequential int.eger nnmber. '1'he int.eger
righLpart contains the V valw\ which is the computed index numher. The intf'ger
mIm_oLchild contains the number of childnm the node has. The two stmct.m . pointers
are children pointer and parent. pointer. A node may have more t.han one childrf'n.
Except the top node, each node has one parent node.
Directory Dynamic SPIN
The data structure for directory dynamic SPIN is the same as R_SPIN. Both
use multidimensional array as their basic dat.a structure.
Index Dynamic SPIN
The D_SPIN function implements the basic design idea in Chapter IV. It assumes
t.hat all operations on dat.a occurs in main st.orage. The st.eps of this program are:
36
§.!
:LI
,_~...,.. _..,.
.:~)
..s..'I.,,'
~~~!

37
struct {
int leftpart·
int rightpart;
int num_oLchild;
stru t tree. *child [ ].
strllct tree. *parent;
}
Figure 18: Data structure for D_SPIN
Step IP 1 (Initialization) Compute the parameters fOT the formula 2.
Step IP 2 (Compute index value) In sparse situat'ion, compute the index
value at each level. Put the result in an integer array "rect."
Step IP 3 (Check root node) Check whether the Toot node is NULL; 'if it it,
it needs to be initialized, then go to IP 4. Otherwise, go to Step IP 8.
Step IP 4 (Root node initialization) Initialize the root node. Also, inib.al1.zp
t.hp. rhild nodes for root. Thp. lefLpart of earh. node is assigned sPJj'uent,jal intpgpr'
number from zero to thp maX1.m:um number of the nodes. The TighLpart of parh nodp
is assigned zpro. Assign the pointers of r:h.ild nodes fa a temporarily point ,r army,
"tempi." Tempi points to level zero. Set r.ounter of th.is level fo the rwmber of nodes
at th.is levP[ plus one.
Spt level J=O.
Step IP 5 (Go down one level) Create a tempomrily pointer array "temp2",
whirh 'is a pointer to node at level one, The lefLpart of each node at level onp 1..'1
th,p sequp.ntial integer numbf'.r· jTom zero to the maximum numhf'.r of the nodes. Thp
"'ighLpart of f'.ach node is initialized to Zf'.ro. Set counter of f;his levf'.l to the number
of nodes at tins level plus one.
:;...
~,I
3
Step IP 6 (Connect level j and level j+l) Allocate memory space for child
nodes of tempi. Assign nodes of temp2 to the address of child nodes of templ' As ign
pointers of nodes of tempi fo the pointers of parent nodes of temp2.
Step IP 7 (Go down next level) Reinitialize nodes of tempi; A ign pointer
of nodes oftemp2 to pointers of tempI. Then, go to IP 5, ifj<=L (L is level numb r).
If j>L, go to step IP 8.
Step IP 8 (Retrieve nodes or add new nodes to the trees) Locate the tree
to be retrieved or added by using the subscript at level zero. Set the root of selected
tree to subroot. Then, if the execution mode 'is "r" (retrieve) do Step IP 9; if the
executwn mode is "a" (add), do Step IP IO.
Step IP 9 (Retrieve index nodes from the tree) Starting from the first
node at level I, if the righLpart value of the nodes of the subroot's children matches
the index subscript at levell, then, go to next level. We repeat this step until we rear:h
thp. last level of the tree. At any levd of the trep., zf fhp m.dex s·ubscn:pt. does not mat.ch
any nodes at that level, it means that there zs no surh index val1.Lp. t.o be retrieved in
thp. index trp.e, and return 1.
Step IP 10 (Add new index nodes to the tree) Starf1:ng from thp first node
at lp.vel 1. if we can find the matched node, which means the inde.'E subscript valup
at levpl one already exists, then we go to neJ;t. level. If we cannot fi.nd tlw matched
nodp, wp create a new pat.h whir;h wntains one new no&: at each lp;'uel. Set left_p0:,. t.
of thp new nodp. to the r:ount.eT value at this levp.l. Update r:ounter value by adding
one. Assign righLpart of new nodp. th.e index subscript valup.. S'et numbp.T of r;hildrP.'n.
of th.e new node to oru'.. Then, we go to next level, and Tepeat th1:s search.
Directory Dynamic SPI
Thp T_SPIN function implements the basic design idea in Chapter IV. It assume'
t.hat all operations on data occurs in main storage. The steps of this program are:
~.
;;:,.
,~..,. _ .1'
:., ,I
~f J ..Ii ~.
"
1 39
Step DP 1 (Initialization) Initiahze the parameters f01' formula 2. Set level
i=O. Create an integer array "kree" to contain the index valu s returned.
Step DP 2 (Compute the index value) From level 0 to lev l L, compute the
'i.ndp.x value at each lp.vel by using formula 2. p.ut the results into 'krec." This index
value is equal to V value in R_SPIN, which is the result of conversion from the parse
multidimensional subsrripts to a one dimensional index value.
Step DP 3 (Define various parameters) Define the size of hash table, which
will bp. used in hash furu:t'i.on. Also define the value of d, which 'is an integer numbp.r
specifY'ing how many digits w'ill be extracted from the hashing T'esult.
Step DP 4 (Hashing the index value) Given a hashing function .. hash the
index value at each level, Store the results into "hsesult, " which is an integer array.
In this prograrn, the hashing funct'ion 'used is:
hash(.T) = :rm.odH_SIZE
Step DP 5 (Extract frist d digits from the "h_result") The hashing result.
in Step DP 4 is in decimal formed. It l,S transformed into lnnar'y format. Thp.n, d
is e:L'fracted by d d1,g1,ts from its h'l.gh order end. Put the rp.sult 'into integer array
~.
•••
.,.. _.,
).,  ,'
..' ., I
JII
~I : • I
~. A' ),
CHAPTER VI
PERFORMA CE A ALYSIS FOR SPI
Testing Program
The testing program experimentally examines the average time complexity of the
DjJPIN and T_SPIN functions. It measures the average running time for insertion
and search operations. In order to compare the performances of dynamic SPI s with
R_SPIN, the average running time of R_SPIN is also tested. The steps of the te t
program are listed below. In t.he following steps, the test function means D_SPIN,
T_SPIN or R_SPIN function.
Step TP 1 (Generate permutations) Recursively generate disbnct multidimpnsional
suhscnpt combinations (or permutations), wh·ich. is used dur·i.ng tp:sting.
They are stored in main storage.
Step TP 2 (Set total test times) Sft a number "N" fot th.e tot;al n:umber
of t.psting r.ycles. In ear.h testing r;yde, a gpn,Prated permutalum. is passed to t.hp test
f1J.nrtion.
Step TP 3 (Initialize test function) There are many stahr; paramP.tfTs invalved
m the test funr:t.ioTl, and their r.omputation takes some time. So the test fun{'
hon must be in'itialized before tests begin.
Step TP 4 (Measures the overhead) Sinr;e this program lests the avemgp
time behavior of test function, the test functwn is repeated within a test1:ng loop.
In this step, a loop is implemented with all the neCf~ssary instrudions except the
functional call of test funr:tion.
40
~.
;.".
.,
'0•• 1
~II
"1' . "_I,i "I
).
41
000091
0.0009
Auerage
0.00089
Time 0.00088
(sec.) 0.00087
0.00086
000085
0 0 0 0 0 0 0
0 0 0 0 0 Cl 0
V (D co Cl N v (0
~ ... ...... ...
y = 1E05Ln(x) + 0.0009
R1 = 0.978B
o 0 0 0 0 000
00000°0°0 .c..o... °N CC''II VN C(D'I <Nxl N C") C")
Total No. of Permutations
Figure 19: The result of experiment using RSPIN function
Step TP 5 (Execution loop) This loop tests all the necessary in tructions
and execution call to test function. Record the testing time.
Step TP 6 (Caculate average running time) Subtmct overhead time from
the total exaction time, then divide the result by number uN".
Performance Analysis for R_SPIN
The original R_SPIN function is modified so t.hat it operat.es on t.he main st.oragE'.
We asSl1II1E' that the dense dimension size (t.reE' branch size) is pight so that gi Vf>n
a tot.al number of permutations (3200 in this examplp), th . prohahility oj' overflow
situat.ion is relatively small. All graphs in this chapt.er are created hy Microsoft. Excel.
Figure 6.1 shows the experimental result.s of R_S'PIN with a permutation definition
of fiVE' dimensions and dimensional size of eight in each dimension.
In Figure 19, "Total o. of Permutations" refers to the different total permuta
ion numbers used in this experiment, and "Average Time" means the average
insertion time.
Figure 19 shows that. the average insertion time of R_SPIN function is logarithmic.
The trendline fnnction is logarithmic function, The Rsquare of the trendline is
0.9788 (close to 1), which means the simulation of trendline is good.
...
• ••
r·)
c'l.E, ::,
.;:.. _.. ) ..  .1 1
~. , .
..j .
42
+ r_spin
Log. (r _spin)
y =3E05Ln(x) + 0.0008
R2 =0.9512
0.0009
000089
0.00088
0.00087
0.00086
0.00085
0.00084
0.00083
0.00082
0.00081
00008 +1ttttt+1ttttt+1t+ttt+tt+t++1
Average
search
time
(sec.)
o a a a 0
00000
... ("l ltl to 0')
N N N N N
Permutations
Figure 20: The result of experiment. u:;ing RJ3PI function
This average insertion time does not include the initialization time for index
trees. It only counts the insertion time on initialized index trees. The theoretical
proof shows that average time complexity for insertion or search on this kind of tree
structures (incomplete tree) has logarithmic behavior[12J.
The search operation in R_SPIN
.."
.··4r,~
c:~
=':1
The search experiments in this example use some of t.he permutations t.hat.
...
.·..f .. •i,
were inserted in earlier. At'; a resnlt, all search operat.ions in thf' pxpf'rinWllLs an'
Sllccpssful. Figure 20 shows the re:::mlts of experiment with dimensional size eight.
.,
"
The f'xperiment. indicates that. the average search time of t.he Rj;P1N fanction ha.':i
logarit hmic t.ime complexi ty.
Performance Analysis for D_SPI
.,
:~ .,
'J
\Ve assume that the operations occur in main storage, the binary t.ree i . the initial
index data structure for D_SPIN, the level nnmber for index tree is fom (dimension
number is five). Figure 21 shows the experimental results of D_SPIN.
Figure 21 shows the average insertion time of D_SPIN is a logarithmic time
complexity. The function of trendline is logarithmic time complexity. The value of
Rsquare is closp to one (0.9553), which means a good simulation of the trendline.
43
o 0 o N ... v
y = 0.0001ln(x) + 0.0009
Rl =0.9553
00000
ONvWCl:)
(Y) (Y) (Y) (Y) (Y)
+d_spin
Log. (d_spin)
0.0014
0.00135
0.0013
Allerage 0.00125
Time 00012
(sec.) 0.00115
0.0011
0.00105
0.001 +it++tt++t+t+i+++t++t+t+t+t+i
Total No. of Permutations
Figure 21: The result of experiment using D_SPI function
In view of algorithm design, there are several factors contributed to this logarithmic
average insertion time.
1. Initialization time is not included in the test re ult. The initialization time
means time for creating initial tree structures. It usually cost much more than
that for operations (insertion or search) on initialized trees. In D_SPI . besides
.,,,
.'.
the initialization time, there is abo ::lome time needed t.o reate new nodf's when •1 '
o\'erfiow happens. The total number of thosE' new nodes depends on how many
nodes are created and initialized at the beginning and how often the overHow
happens. Generally, the more new nodes are initialized at the heginning, t.he
.,
les::l new nodes are needed t.o be created during execution.
2. In practical applications, smce the probability of overflow can be predicted
roughly. the number of initialized nodes should at least count 50% of total permutat.
ions so that less time will be spent on creating new nodes and general
performance of D_S'PIN is acceptable. In this example, it. counts 37%. If there
is no overflow t.hen it is reasonable for DSPJN having a logarithmic execution
behavior, because the D3PIN execute' on the same index tree strncture as
R_S'PIN, and R_SPIN has been proved to have a logarithmic average time com
44
ooN
("')
o 0 o 0 co 0
N (Y)
y =O.o003x + 0.001
R2 =0.9985
oo
v
N
000 0 0 0
000 0 0 0
N "'" <D co 0 N
N N
+ d_spin
Linear (d_spin)
ooo
oo
co
oo
<D
0.009375
0008375
0007375
0.006375
0.005375
0.004375
0.003375
0.002375
0.001 375 ~F++ttt++iI!tt+t++iI!+tt1+t+t
oo
"'"
Average
Time
(sec.)
Total No. of Permutations
Figure 22: An experiment using D.BPIN function
pk~xity. If the time for creating new nodes does not count very much in total
execution time, it can be expected that averagf' insertion time of D_SPIN has
logarithmic behavior.
3. In the implementation of D_SPIN, when an overflow at the same level happens, .."
.. J
not only a new node at that level is created, but also a path of child nodes is .:,
". creatf'd. So, it f'ff('ctively saves thf' time for creating new node' wheu overflow .1
.1
happens on thf' same path repeatedly.
4. In some cases, if th~ time needed for creating Hew nodes count.s a milch higger
percentagf' in total execution time, then it can be (!xpect.p.u that the average
insertion time of D_SPIN is linear. FigurE' 22 confirms this expectation.
In Figure 22, the number of initialized nodes only counts 3,8% of Lotal permu ..'
tations. We see that trendline is a linear functioIl.
The search operation in D_SPIN
The search experiments in this example 11 'e some of the permutatiow; that were
inserted in earlier. As a result,) ail search operations in the experiment.' are successful.
Figure 23 shows the results of experiment with dimensional size two. The experiment
45
y =4E05Ln(x) + 0.001
R2 =0.9262
+d_spin
000114 Log. (d_spin)
000112
Auerage 0.0011
search 0.00108
time 0.00106
(sec.)
0.00104
0.00102
0.001 1fI1+++t++t++ttt+t1++T++++tti
o a a a a 0 a a 0 000 0 0 w 00 a N ~ W 00 0 N ~ W 00 0 N
~ N N N N N M M M ~ M v ~
Permutations
Figure 23: A search experiment using DSPI function
indicates that. the average search time of the D_SPIN function has log'arithmic tim
complexity.
Compare D_SPIN with R_SPIN
If we compare the trendline function of Dj]PIN with that of R_SPIN for insertion "'1
operation, we find the fixed parts of both function::; are the same; but for variable
parts (value of slope), thE' D_SPIN's is ten t.imes larger than the R_SPINs. In other
words, RSPI is ten tim s faster than D_SPIN. I' .1 ,r
y = O.OOOILn(.T) + 0.0009 .... .d_s]Yin (3)
y = IE  05Ln(x) + 0.0009.....r _spin (4) I
,.I I
Considering the static property of R_SPIN and dynamic property of D_SPIN: i i , ' ,.
it is reasonable that execution of R_SPIN i:) several orders of magnitude faster t.han
execution of D_SPIN.
Performance Analysis for T_SPI :
As stated in Chapter IV, t.he TS'PIN function is actually a transformation function.
It transforms a sparse multidimensional array into a dense multidimensional
array, which is further t.ransformed into d binary digits. The data structure the
0.00213 +t_spln
Linear Ct_spin)
46
0.00212
Average
Time
(sec.)
000211
0.0021 j I I I I I I I I I I I I I I I I I I I I
CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ
CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ "" (D ro CJ ~ <t' (D ro CJ ('oj "" ~ ~ ~ ~ ~ N ('oj N
y =5E08x + 0.0021
R2 =0.0348
I I I I I I I I
CJ CJ CJ CJ
CJ CJ CJ CJ
CD 1X:l CJ N N N ('") (Y')
Total No. of Permutaions
Figure 24: The result of experiment using T _SPIN function
T_SPIN function use~ it) mllitidimentlional array, which is similar to having only one
path in D_SPIN's tree structures. So, given a certain level of multidimensional array.
a constant average running time can he expected. This expectation is confirmed by
experiments. Figure 24 shows the experiment results.
In Figure 24, the slope of trendline function is very small (5EO ) and can be
considered closE' to zero. So, the average execution t.ime of T_S'PIN can be considered
to be constant. In this case, it is 0,0071 seconds. We also see from Figllff' 24 that largpr
the nnmher of inplIt data is, more thf' average nmning time i::; clo~p to 0.0021. Onp
reason for this constant time complexit.y is that thf' T_SPIN dof'~ not kepp directory
table which is necessary for data record I'ptrieving and searc.h. Thf' olltpll1 of T_SPIN
is the index value for the directory table. The management of directory table is left
for file system of database.
The other way to measure the performance of T_SPJN is to analyze the di tribution
of its output. The output of T_SPIN is the d binary digits, which is f(~quired
to evenly di~.:;tributed in the directory table, so that the load factor of pach bucket is
closed to each other. One important factor to influence the values of d hinary digits
is the hashing function used in T_SPIN. In this test program, T_SPIN uses a typical
hashing function:
'.
•
,1
j'Ir
J
I I
d'
47
35
rr
I:
30
c:: 25
o •
~ ~ 20
01=
~ '0 15 a
10
5
o
o  0  8 0 ~ : § 8 6
dbinilfydigits (d=")
5 80 o
Figure 25: The result of experiment using distribution of ontput in T _SPI
hash.(:r) = .7:modH_51ZE
Figure 25 shows the experimental results of T_SPIN, using d=4 at level 4.
From Figure 25, we see that the distribution of output is good. 0 directory
entry has t.oo many or too few index numbers.
I
il
I
'I J ,
~I
·11
I
,I
,f
I
I
, '
CHAPTER VII
SUMMARY, CO CLUSIO , A D SUGGESTED FUT RE WORK
Index dynamic SPI (D_SPIN) and directory dynamic SPI (LSPIN) are two
dynamic indexing techniques that do not require complet file reorganization as a
result of overflow. They can be very useful for applications where overflow oc ur'
often, and the frequency of overflow cannot be predicted accurately. They also eliminate
the sparse situation in index files so storage space is saved when thE' index dat.a
file is largE'. The design of both dynamic SPI
mrt hod, which is presented in chapter IV.
Conclusion
. follows the twostep transformation
II
D_SPIN employ an index tree structme in the index filE'. It keeps all necessary
informat.ion in nodE'S of t.he treE'. ThE' advant.age of 1his nwthod is that it. kepps
thE' layereel data 'tmctmE'. The disadvantagE' of this method is that. it must cr('~tl',E'
new node' and a new path to overcome overflow; therehy, reducing t.he pfficieflcy of
operations when compared to thE' static SPINs. T_SPIN employs a hash function in
its tran 'formation processes. The output of T_SP1N is d hinary digits, which will b0
fmther used hy filE' syst m t.o retrieve specific hucket. The advant.age of T_SPJN is
t hat it int.roduces bucket concept, and it. keeps data file dynamic since huckets in t.h0
dat.a file split on overflow. The disadvantage of T_SPIN is that it sacrificE'S execution
spE'ed since it needs to hashing and extract the outputs instead of directly rf'trieving
Olltputs as does R_SPIN.
The empirical results presented ahove demonstrate that a tradeoff exists hetween
dynamic SPIl s and static SPI s. If timE' is a rnajor factor, static SPI s show
48
01'
49
better performance since tatic method axe almost ten time faster than dynami
SPINs in the test example. However if eliminating oYerflow and saving torage space
is a premium, then dynamic SPINs show better performan e because ther . i neither
overflow nor sparse data storages in dynamic SPI
Suggested Future Work
The work in this thesis is limited to the indexing techniqu which i only one
link in whole database design chain. It is necessary to measure the performance
of dynamic SPI s in practical database design and application. For example, after
getting results from T_SPIN one must design and implement huckets file pointers,
directory tables, ... , etc.. so t.hat the test constitutes a complete picture of a database
design. This part of work is left to future study.
'.I
J!
11
II
d
!
I
A SELECTED BIBLIOGRAPHY
1. Guttman, A., Rtrees: a dynamic index structure for spat?:al searching, Proceedings
of ACM SIGMOD (Special InterestGroup on the anagem nt of Data),
June, 1984 pp.4757
2. Bayer, Rand 1. Schkolnick, Conc'urrency of operations on Btre s, A ta Informatica
9, 1977, pp.121
3. Coburn, Ty K. C SPIN toolkit, Oklahoma City, OK: ca.1991.
4. Coburn, Ty K. An introduction to SPIN hashing: an approach to managing
mult·i.dimensional data spaces. Tinker AFB, OK: Unpublished technical report
ca. 1991.
5. Coburn, Ty K. An introductwn to the S_SPIN hash function: making mon';
out of the mult1.dimensional army. IEEE (Institute of Electrical and Electric
Engineers) NAECON(National Avionics Engineering Conference) 1994.
6. Fan, Z.M., Y. Zhang, RA. DiVall and G.E. Hedrick Overflow analysis in SPIN.
Computer Science Dept., OSU: Unpulished technical report, OSUCSTR9601,
1996.
7. Fagin, R, J. Nievergelt, . Pippf'ngf'r, and H.R. Strong Extensi.hlp hashing  A
fast access method for dynamir: files. ACM Transactions On Datahase Systrms,
Vol 4. pp. 315344. Sept. 1979.
8. Larson, P. A. Dynamic hash:i.ng. I3IT 18, pp. 184201, 197~.
9. Flajolf't, P. On the performanep pvaluatwn of extendihlp ha hing and t;rip. sp.an·!J.ing.
ACTA Informatica, Vol 20, pp. 34S369, 1983.
10. Korth, H. F. and A. Silberschatz Datahasp. System Concepts. ew York: McGrawHill,
Inc., c1991.
11 Zhang, Y., R.A. DivalI, M.Z. Fan and G.E. Hf'drick An Expp;rimental Analysis
of a New MultzDimentional Stor'age And Retrieval Mdhod. Computer Science
Dept., OSU: Unplllished technical report, OSUCSTR9504, 1995.
12. DivalI, RA. Y. Zhang, M.Z. Fan and G.E. Hedrick. A Theoretical Analys'l.s
of a New Multidimentional Storage and Retrieval Meth.od. Computer Science
Dept.., OSU: Unplllished technical report (in preparation), 1995.
50
t
q
'.
II
",'..I.
,r
51
13. Harbron, T. R. File systems: structures and algorithms. Englewood Cliff: J:
Prentice Hall Inc. c1987.
14. Aho, A.V., J.E. Hopcroft and J.D. ' llman Data Structures and Algorithms.
Reading, MA: AddisonWesley 1983.
15. Aoe, J., Y. Yamamoto, and R. Shimada, A Pmctical method for reducing spar
matrices with invanant entries, International Journal of Computer Math matics,
Vol. 12, pp. 97111, Nov. 1982.
16. Boyer, R.S. and J.S. Moore, A fast string searching algorithm, Communication
of the ACM, Vol. 20, 0.10, pp. 762772, Oct. 1977.
17. Buehrer, D.J. and YW. Fan, SLtrees: An index1,ng structure for objectoriented
databases, The Journal of Systems and Software, Vo1.32, 0.3, pp. 237249,
Mar. 1996.
18. Chang, Y. and Lee C., Climbing hashing for extensible hashing, Information
Science, Vo1.86, No.3, pp. 7799, Sept. 1995.
19. Cormack, G.V., R. .S. Horspool and M. Kaiserswerth, Pmetical perfect hashing,
The Computer Journal, Vol. 28, pp. 5458, Jan. 19 5.
20. Jacobs, D.W., The spare requirements of indexing 'under persper.tive pro.jedions.
IEEE Transactions on Information Theory, Vol.18, pp. 330333, Mar. 1996.
21. Jaeschke, G., Rer:iprocal hashing: A mdhod for gwn.erating minimal perfect hashing
fundions, Communication of thE' ACM, Vo1.24 pp. 829 33, Dec. 19 1.
22. Jonge, W.D., A.S. TeOf~nbaum, and R.D. Reit, Two arr:p.ss mp.th.ods usm.g com.part.
binary tn);es, IEEE Trans. Software Engineering, Vol. SE13, pp. 799 10,
Jul. 1987.
23. Knnih, D.E., Thp.. Art of ComputeT Progmmming, Vol. III: Bor'ling and 8,,0:/"('h.ing.
Reading, MA: Adcli~OIl Wesley, 1977.
24. Knut.h, D.E., J.R. Morris, and V.R Pratt, Fast. pattf'r"'n. matrhing in str'ings,
SIAM Journal of Computer, Vol. 6, No.2, pp. 323349, Jun. 1977.
25. Kumar, V. and J. Mullins, An integmted data structure with multipleaccpss
paths for databasesystems and its performancp. Data and Knowledge Engineering,
Vo1.16, 0.1, pp. 5172 Jul. 1995.
26. Maly, R., Compressed Trees, Communication of the ACM, Vol. 19, o. 7, pp.
409415, Jul. 1976.
27. Orenstein, J.A., Multidimenswnal trips used for' associativp seard/;/,nq, Information
Processing Letters, Vol. 14, o. 4, pp. 150157, ,Jun. 1982.
•q
I
"t
"
:1
52
28. Sheil, B.A., Median split trees: A fa.st lookup technique for fr quently occurring
keys, Communication of the ACM, Vol. 21. pp. 947959, Nov. 1978.
29. Standish, T.A., Data Structure Techniques. Reading, MA: Addi onWesle .
1980.
30. Tarjan, R.E. and Yao A.C., Storing a sparse table, Commun. ACM, Vol. 22,
pp. 606611, ov. 1979.
31. Wirth N. Algonthms and Da.ta Structures. Englewood Cliffs J: PrenticeHall,
1986.
APPE DIX A
Glossary
Any computer program whose o~iective is the solution of a practical pTOblem.
application environment
An environment imputed to the SPIN package by the application program.
dense[tree])
A trep. whose nodes all conta:m data.
dynamic SPIN
Any method that is contained 'I.n the SPIN package that allows the structure of the
data to change dynamically.
key distribution
The way the keys of data objects an'. distr'ibuted throughout flu'. data spar. . FrequeTl.t:ly
expressp.d as a probabihty distribution fundion.
level [in SPIN}
The part of a storage tree that corresponds to a gwen index in a m:ulf'idimp.nsi,onal
array.
load factor
A number, A, whose valup. 'is between 0 and 1, and whose value indir:ates how full (the
load) the storagp. tree(or table} 18. An empty trp.e (table) has a load far' tOT of 0 (A=O),
a full trep. (table) has a load factor' of 1.
multidimensional array
Any array with rnm'e than onp. dimension.
overflow
53
54
In SPIN, a.n attempt to insert data 'into a tree with a load factor of 1.
radix[tree]
A tree that has a .fixed number of (possibly empty) branch s from each node. A radix
n tree has n bmnches at each node.
sparse [data]
Multidimensional data in which the number of zero or empty storage lor.a.tions greatly
pxr;peds the number of nonzero, nonempty storage location.
static SPIN
Any method that is used within the SPIN package and whose st.oragp strudUTP TPmains
fixed after initial allocatio'll..
SPIN
81ngle P01.Tl.t Index Network. A data structure for data organizatwn and retrieval.
SPIN mapping
Any mapping of keys, index values, or data values within one of the SPIN methods.
data file
A ./iff': is a colledi01l of rpcoTds.
bucket (or page)
A bucket. (or page) corresponds to one or sf':veral physical sp.rt.ors of a sp.r:ondury storag
devu:e such as a disk. Let the capaC1.ty of u bucket be b Tf':r:ords.
space utilization
Spacf': util7.zat'l.On 'IS the mtw bet,wp.en nand m*b, where n is the numbpr of f·p.cords in
the fllp, m 1,S the number of pages used, and b is the capacity of tlu~. pagp.
I,
VITA
ZILI FAN
Candidate for the Degree of
Master of Science
Thesis: DY AMIC SPI SCHEMES
Major Field: Computer Science
Biographical:
Education: Graduated from Hangzhou Second High School, Hangzhou, China
in July, 1987; received Bachelor of Science degree in Management Science
from Hangzhou University, Hangzhou, China in July, 1991; received
Master of Science degree in Economics from Oklahoma State University,
Stillwater, Oklahoma in December, 1994.
Completed the requirements for the Master of Science degree with a major
in Compllter Science at Oklahoma State Universit.y in Dccemher, 1996.
Experience: Employed hy Oklahoma Stat.e University, Departmf'nt of Complll.('I'
Science ru:; a graduat.e research assitant., 1995 t.o 1996
I I