A GENETIC ALGORITHM TO EXTRACT NTUPLES
FROM A GIVEN SET OF WORDS
By
RAVI B. MANDADI
Bachelor of Engineering
Osmania University
Hyderabad, India
1989
Submitted to the Faculty of the
Graduate College of the
Oklahoma State University
in partial fulfillment of
the requirements for
the Degree of
MASTER OF SCIENCE
May, 1995
A GENETIC ALGORITHM TO EXTRACT NTUPLES
FROM A GIVEN SET OF WORDS
Thesis Approved:
I{ e
Dean of the Graduate College
ii
ACKNOWLEDGMENTS
I wish to thank my major adviser, Dr. Blayne Mayfield for his guidance,
support, and inspiration. I feel that I would have not made it through these last three
years without his friendship and encouragement. I would also like to thank my other
committee members Dr. J. P. Chandler, for his constructive suggestions and supportive
guidance, and Dr. Lu for her time.
More over, I wish to thank Dr. George Sabbagh for his support, friendship, and
for allowing me to use his computer and necessary software to write my thesis. My
paper would not be complete without giving my gratitude to Mrs. Griffin for helping me
with my grammar.
iii
Chapter
1. INTRODUCTION
TABLE OF CONTENTS
page
1
2. LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9
2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9
2.2 Terms and Notations 10
2.3 The Role of Mating/Crossover 12
2.4 The Role of Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13
2.5 Natural Selection (Biased reproduction) 13
2.6 Fitness Concept 14
2.7 Need for extraction of ntuples 17
2.7.1 Application to perfect hashing schemes. . . . . . . . . . . .. 18
2.7.2 Matrix Compression . . . . . . . . . . . . . . . .. 20
2.8 Possibility of a genetic solution . . . . . . . . . . . . . . . . . . . . . .. 21
3. DESIGN AND IMPLEMENTATION ISSUES . . . . . . . . . . . . . . . . . .. 24
3.1 Description of the Genetic Approach . . . . . . . . . . . . . . . . . . .. 24
3.1.1 Outline of the Genetic Algorithm 25
3.1.2 Coding of Strings . . . . . . . . . . . . . . . . . . . . . . . . .. 26
3.1.3 Creation of the Initial Population 28
3.1.3.1 Random Initialization . . . . . . . . . . . . . . . . .. 28
3.1.3.2 Biased Initialization . . . . . . . . . . . . . . . . . .. 29
3.1.3.2.1 Probability that any two Ituples will be
identical . . . . . . . . . . . . . . . . . . . . .. 30
3.1.3.2.2 Probability that any two 2tuples will be
identical . . . . . . . . . . . . . . . . . . . . .. 31
3.1.3.2.3 The Generalization . . . . . . . . . . . . .. 31
3.1.4 The Genetic Operators: Reproduction, Crossover, and
Mutation 32
3.1.5 The Evaluation Scheme . . . . .. 33
3.1.6 Enforcement of the Constraints 37
3.1.7 Conditions leading to the Termination of the
Genetic Algorithm 38
3.1.8 Implementation Issues 40
3.1.8.1 Approach 1: The C program 40
3.1.8.1.1 Data Structures. . . . . . . . . . . . . . .. 41
iv
3.1.8.1.2 The Driver/Main program 42
3.1.8.2 Approach 2: Using the LibGA application tool .. 43
3.1.8.2.1 The Main Program 43
3.1.8.2.2 The Configuration file. . . . . . . . . . .. 45
3.1.8.2.3 The User Data file. . . . . . . . . . . . .. 45
3.2 Description of the Combinatorial/Exhaustive Search Approach 46
3.3 General description of the Trial and Error Approach 47
3.4 About the Implementation Platform. . . . . . . . . . . . . . . . . . . .. 48
4. RESULTS AND OBSERVATIONS 49
4.1 The Performance of the Genetic Algorithm . . . . . . . . . . . . . . .. 50
4.2 The Execution Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 50
4.3 The Solution Quality 56
4.4 Biased Initialization Vs. Random Initialization . . . . . . . . . . . . .. 59
4.5 Comparison of the two implementations of the genetic algorithm . .. 62
4.6 Additional graphs presented in Appendix A . . . . . . . . . . . . . . .. 64
5. SUMMARY AND CONCLUSIONS. . . . . . . . . . . . . . . . . . . . . . . .. 66
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 66
5.2 Future Work 67
BIBLIOGRAPHY
APPENDIXES
Appendix AGraphs
69
72
Appendix BSource code of the genetic algorithm
implemented in C 77
Appendix CSource code for the genetic algorithm implemented using
libGA application 96
v
Figure Page
1. The genetic representation of a tuple . . . . . . . . . . . . . . . . . . . . . . . . .. 5
2. The genetic cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6
3. The crossover operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12
4. The mutation operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13
5. Outline of the genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26
6. Genetic code 27
7. A sample solution space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29
8. Performance graph for words of length 6 . . . . . . . . . . . . .. 51
9. Performance graph for words of length 8 52
10. Performance graph for words of length 10 . . . . . . . . . . . . . . . . . . . .. 53
11. Performance graph #1 for words of increasing length 54
12. Performance graph #2 for words of increasing length 55
13. Solution quality for words of length 6 56
14. Solution quality for words of length 10 . . . . . . . . . . . . . . . . . . . . . .. 57
15. Solution quality for words of increasing length . . . . . . . . . . . . . . . . .. 58
16. Comparison of performance of two initialization techniques 60
17. Solution quality corresponding to the graph of Figure 16 . . . . . . . . . . .. 61
18. Convergence when not using elitism 63
9. Convergence when using elitism . . . . . . . . . . . . . . . . . . . . . . . . . .. 64
20. Performance graph #3 for words of increasing length . . . . . . . . . . . . .. 72
21. Solutions corresponding to the graph in figure 20 73
22. Solutions corresponding to graph in figure 12 . . . . . . . . . . . . . . . . . .. 74
23. Comparison of performance of two initialization techniques 75
24. Solution quality corresponding to the graph of figure 23 75
25. Comparison of performance of two initialization techniques 76
26. Solution quality corresponding to the graph of figure 25 76
vii
CHAPTER 1
1. INTRODUCTION
Living organisms are perfect problem solvers due to their versatility. These
organisms come by their abilities through the apparently undirected mechanisms of
evolution and natural selection. Genetic algorithms are search algorithms that are based
on the mechanics of natural selection and natural genetics. By harnessing the powers of
natural selection and evolution, genetic algorithms are able to breed solutions to problems
whose structures are not fully understood. Although genetic algorithms in their present
form were invented by John Holland [HOL73], the idea of simulating biological
evolution for optimization dates back to the 196Os. Genetic algorithms are optimized by
maximizing a fitness function or minimizing a penalty function.
Genetic algorithms make it possible to explore a far greater range of potential
solutions to a problem than do conventional programs through the use of two primary
processes: natural selection and reproduction. Genetic algorithms model the way in
which biological genetic processes seem to operate. This is accomplished by using three
processes known as genetic operators. The definition of the genetic operators in the
context of genetic algorithms was obtained by studying the natural evolution of biological
beings or organisms. Evolution is a powerful process that created the form of life as we
know today. A few billion years ago life existed in the form of extremely primitive
organisms that contained very little genetic information. These biological organisms
1
2
exhibited only primitive functions that were enough to sustain them. From generation
to generation, these organisms became more complex and better able to adapt themselves
to the changing environment. The algorithm that achieves this function uses three natural
biological processes: reproduction, mutation, and mating. Following is a description of
the roles of these three biological processes in natural evolution and how these processes
are adapted to solve problems in the field of genetic algorithms.
The biological environment in genetic algorithms is simulated by use of a fitness
function. The environment in biological evolution provides the standards which organisms
must meet in order to survive. As stated by Goldberg [GOL89], "In natural populations,
fitness is determined by a creature's ability to survive predators, pestilence, and other
obstacles to adulthood and subsequent reproduction." Fitness plays a similar role in
genetic algorithms. Just as organisms in a biological environment evolve to improve
their ability to survive, chromosomes in genetic algorithms evolve to attain higher fitness
values and become more adept at solving the problem. The fitness of a chromosome is
a measure of its ability to solve the problem. Chromosomes in the context of genetic
algorithms consist of a group of string values, each of which represents a characteristic
called a gene. The chromosome represents a possible solution to the problem. The
fitness function is the criterion for choosing the best chromosomes to use in creating a
new generation. This selection strategy is based on the Darwinian principle of the
survival of the fittest. Just as surviving organisms reproduce in nature, the selected
chromosomes will be mated to reproduce. The mating or crossover process is
accomplished by combining different parts of different chromosomes to produce offspring
3
chromosomes with characteristics derived from their parents. Occasionally one of the
string values of a randomly selected chromosome is altered. This process simulates a
process which is the biological equivalent of a mutation. This evolution process, in
short, consists of repeatedly applying the three genetic operators, mating, mutation, and
selection, until a good solution is obtained, i.e, in biological terms, when the
chromosomes evolve sufficiently to contain enough of the desired features or
characteristics. This condition for termination signifies the convergence to a solution,
since at this point the population will be dominated by superior chromosomes.
Genetic algorithms are used in optimization problems, in which afitnessfunction
facilitates the selection of the best chromosomes in a population. The solution to an
optimization problem is to maximize an objective/fitness function J(x) on a set X, where
X is the entire solution space. Fitness came from an analogy with Darwinism that
underlies this whole approach of simulating an evolution to guarantee the survival of the
fittest. The design and implementation of the objective function is problem dependent and
is described in Chapter 3. The process of solving this optimization problem using
evolution is explained in the following five steps. (1) First the population size is fixed
at some value n. Then n elements Xi are chosen at random from X to form the
individuals of the first generation. (2) For each of the individuals Xi' the value J (xJ is
computed. (3) Then for each of the individuals Xi' the survival probability Pi is computed
as the ratio of the individual's fitness to the sum total of the fitnesses of all the
individuals. A random number generator generates each individual Xi with the probability
Pi and also selects individuals to populate the next generation. This process is explained
4
in Chapter 3. The formula for Pi actually means that survival probability is proportional
to fitness, F(xi). (4) At this point, the n individuals that will populate the next generation
are determined. Some of these individuals remain the same as in the previous generation
and others are computed. A new individual is obtained by running the random individual
generator twice to get two individuals Xi and Xl and then combining these two individuals
according to some combination rule. The combination rule involves the use of genetic
operators whose operation and implementation are described in Chapter 2. This process
is repeated until an entirely new generation of size n is obtained. (5) Steps 2 through 4
are repeated with the individuals of the new generation.
It is desirable that each generation contain individuals that solve the problem
better than their predecessors. In other words, each population set will contain
individuals whosefitness is greater than that of their predecessors. Thus each generation
produces successively better solutions to the problem. According to the schema theorem
explained by Michalewicz [MIC92], progressive improvement of the solutions is
guaranteed when binary string representations of the solution are used. The key to a
successful genetic algorithm is to determine appropriate fitness criteria.
Genetic algorithms have been applied successfully to a wide range of realworld
problems such as, scheduling, machine learning, partitioning of graphs, the traveling
salesman problem, and the transportation problem to name a few. The genetic codes
used in solving each of these problems were different, since the representation issues are
problem specific. There are no rules for designing a genetic code.
The extraction of ntuples plays an important role in designing a perfect hashing
5
scheme. Various hashing schemes that use ntuples to build perfect hash functions are
described in Chapter 2. The problem of extracting these ntuples from a given word list,
using a genetic algorithm is addressed in this paper and is stated as follows. From a
given list of words, it is required to extract a list of ntuples. Each ntuple is an ordered
list of n characters picked from each word of the list. Characters must be picked from
distinct positions in a word and the same positions used for all the words in the list.
Furthermore, the ntuples must be unique and the number of characters picked to form
the tuples must be minimal. Since the ntuples are potential solutions to the problem, a
genetic representation of an ntuple must be defined. The genetic representation of the
ntuple is defined by a binary string's', of length equal to the word length of the shortest
word, such that if sri] = 1 then characters from position i are selected to form the tuple.
The size of the tuple is the number of 1's in the string's'. For example, consider the
words, CONST, LABEL, VALKE, PACKED. The genetic code, 010100, would
represent the list of 2tuples, (O,S), (A,E), (A,K), (A,K), obtained by selecting
characters at positions 2 and 4 from the word list. The genetic code (chromosome) for
this selection is pictured in Figure 1.
00100
The 1's indicate the selection of characters
In pOsitions 2 and 4
Figure 1: The genetic representation of a tuple
6
The fitness function is designed to rate these tuples according to their size and
overlap. The overlap in this context is the total number of duplicate entries among the
list of tuples. For example the overlap of the list of tuples mentioned above is 2 since
there are two duplicate entries, namely (A,K) and (A,K).
The initial generation will contain a population of a random mix of these
chromosomes. The three genetic operators, selection, mating and mutation, are applied
randomly to this population to obtain the new population. These genetic operators are
described in Chapter 2. This process is repeated to create each new generation until a
terminating condition is reached, such as a maximum number of generations, or until the
individuals of the new generation reach a desired level of fitness. The fitness in this
context refers to the ability of a member of the generation to solve the problem. While
it may seem that this search for a solution is random, in fact, the improvement of fitness
values in each generation indicates that the genetic algorithm provides an effective
directed search technique. The basic genetic cycle that creates a new generation by
applying the three genetic ,operators on an existing population is shown in Figure 2.
old
population
new
popUlation
Figure 2: The genetic cycle
7
The goal of the fitness function is to select not only the tuples with least amount
of overlap but also of the smallest size. As in this case, many practical problems require
that more than one constraint be satisfied. Constraints are usually classified as equality
or inequality relations. The goal of the fitness function is to minimize the overlap of the
tuples; subject to the constraint that among tuples of the same overlap, smallest tuples
are selected. The overlap as well as the size must be minimal for a good solution, but
in case of a tie, the overlap is used for rating the solution. The implementation of
constraints (explained in Chapter 3) on solutions play an important role in designing a
chromosome representation of solutions to the problem.
Decoders are used by fitness functions to identify chromosomes that violate one
or more constraints, during the selection process. The decoder identifies and isolates the
individuals that violate constraints and repairs or destroys them. Constraints that cannot
be violated can be implemented by imposing great penalties in the fitness function on
individuals that violate them, by imposing moderate penalties, or by creating decoders
of the representation that avoid creating individuals that violate the constraint. Each of
these methods has its advantages and disadvantages. If a high penalty is incorporated
into the evaluation routine and the domain is one in which production of an individual
violating the constraint is likely, then there exists a risk of creating a genetic algorithm
that spends most of its time evaluating illegal individuals. Furthermore, there is a risk
of premature convergence to a solution that is not optimal, since the likely paths to better
individuals require the production of illegal individuals as intermediate structures, and
the penalties for violating the constraint make it unlikely that such intermediate structures
8
will reproduce. On the other hand, imposing moderate penalties may result in producing
individuals that violate the constraint but are rated better than those that do not because
the rest of the evaluation function can be satisfied better by accepting the moderate
constraint penalty than by avoiding it. A decoder built into the evaluation procedure can
be designed to intelligently avoid building an illegal individual but running it is frequently
computationintensive. In the implementation of the constraints discussed in Chapter 3,
a repair procedure is designed and incorporated in the evaluation scheme. This
procedure admits illegal individuals but then immediately identifies them and sets a
severe penalty on such individuals to avoid violation of the constraint.
Chapter 2 presents a review of related work and the history of genetic algorithms.
It also defines the genetic operators and terms that are commonly used. The design and
implementation of the combinatorial approach to solving the problem is explained in
Chapter 3. In Chapter 4, the performance of this approach is compared with the
performance of the genetic approach and observations are made. The execution time to
arrive at the solution is computed for both methods and graphed for various randomly
generated data sets. The quality of the output in terms of tuple size and overlap size is
evaluated and compared for the solutions obtained by both these methods.
CHAPrER2
2. LITERATURE REVIEW
2.1 History
Biological analogies have been part of the science and the lore of computation
since the 1940s. The concept of evolution in molecular biology has been extended to
include computing. Landauer [LAD?1] made it clear that there are parallels between
evolutionary and computer processes.
According to Lander [LAN91], the field of biology is rapidly becoming much
more computational and analytical. Molecular biologists determined that the gene is
made up of DNA, deoxyribonucleic acid. Lander states that the DNA double helix, to
computer scientists, is a clever, robust, information storage and transmission system.
Damian [DAM91] describes the structure of a DNA double helix as a polymer consisting
of four elements, A, T, C, and G (adenine, thymine, cytosine, and guanine). According
to him, the four letter alphabet of DNA can encode messages of arbitrary complexity.
Fast matching algorithms have been applied in sequence analysis [PAV92] of DNA
sequences consisting of the four elements mentioned above.
Analogies between computing and biology are not a mere coincidence. Both
biological genes and computers record, copy, and disseminate information. Douglas
Hofstadter of Indiana University showed this clearly [HOF85] by demonstrating that the
action of DNA and RNA during the reproduction of the living cell can be interpreted as
9
10
an example of a selfreproducing Turing machine. Lawrence Hunter [HUN91] says that
the early AI systems functioned in the field of molecular biology and were later adapted
and modified to apply in different fields of computing such as, computational linguistics,
that can be applied to text and DNA sequences, neural networks, qualitative modelling,
hierarchical pattern recognition, casebased reasoning, visual processing, expert systems,
knowledgebased systems, minimal length encoding, etc.
The beginnings of genetic algorithms can be traced back to the early 1950s.
Goldberg [GOL85] says that during this time several biologists used computers for
simulations of biological systems. The work done in the late 1960s and early 1970s at
the University of Michigan under the direction of John Holland led to genetic algorithms
as they are known today. Holland was inspired by a Darwinian notion of evolution in
which only the fittest survive. He proposed that a learning machine's search for a good
learning strategy be organized as the breeding of many strategies in a population of
candidates, rather than as the construction and refinement of a single strategy. Holland
and his students called their searches reproductive plans which later became popularly
known as genetic algorithms, after Holland's seminal book published in 1975 [HOL75].
In 1989 David Goldberg of the University of Alabama published "Genetic Algorithms"
[GOL89] a book, that demonstrated a solid scientific basis for the field and cited more
than 73 successful applications.
2.2 Terms and notations
The term chromosome is analogous to a biological organism. Just as the process
11
of natural evolution works on a population consisting of biological organisms, so the
computer model of evolution works on a population of chromosomes, which in the
context of computers, denotes a string of characteristics. A gene is the basic building
component of a chromosome, which in the context of computing is a single element of
the string of characteristics representing the chromosome.
Just as surviving organisms reproduce in nature, so do selected chromosomes in
a genetic pool. The selection of the chromosomes that are mated to produce offspring
is based on their high fitness values. This process of selecting the chromosomes that
yield better fitness when input to a fitness function is called natural selection or biased
reproduction. Fitness function, also known as evaluation function, plays the role of the
environment, rating potential solutions in terms of their fitness. Consider for example
the problem of optimizing a simple function of one variable, f(x) = x22x+ 1, in the
domain [1,10]. If the evaluation function is chosen to be equivalent to the function f,
then the 3 chromosomes, vI = (01012), v2 = (10102), v3 = (01102) would be rated as
follows,
Fitness (vI) = f(vl) = f(010I 2) = f(5) = 16
Fitness (v2) = f(v2) = f(10102) = f(10) = 81
Fitness (v3) = f(v3) = f(01102) = f(6) = 25
The process of mating involves combining different parts of different
chromosomes to produce new chromosomes, also known as crossover. This process is
explained in the next section.
Mutation is the alteration of the value of an arbitrary gene of a chromosome.
12
The concept of evolution is simulated on a computer by random applications of these
three operators: selection, crossover, and mutation.
2.3 The Role Of Mating/Crossover
The process of mating ensures the mixing and recombination among the genes of
their offspring. Thus this process is responsible for transferring the combined properties
Parents Children
IIIIIIIIII
Cross Site
Crossover/Mating: Crossover never
changes the value of genes. It simply
rearranges existing gene values in
different ways
Figure 3: The crossover operation
III lli:: 111111
III IIIIII
of the mating genes of the current generation to their offspring in the next generation.
The conventional scheme of mating is the swapping of characteristics between two
chromosomes across a random point within the length of both chromosomes. Many
variations of this scheme are permitted as long as a monotonic increase in the quality of
genes in succeeding generations is ensured on an average over all the chromosomes in
a population. The process is demonstrated in Figure 3.
13
2.4 The Role Of Mutation
Mutation in the genetic algorithm is an arbitrary change in a given characteristic
in the chromosome. This process is demonstrated in Figure 4. The mutation rate is set
at a very low level as its main function is to restart a process of evolution that has
stalled. Mutation also helps to maintain pattern diversity and introduce new
characteristics into the population. Pattern diversity in the population corresponds to the
IIIIIIIIII
IIII i~i IIIII
Parent
Child
Mutation: Mutation changes the gene
value of an arbitrarily selected gene of a
chromosome.
Figure 4: The mutation operation
breadth of the searched domain. Loss of this diversity results in an undesirable,
premature convergence of the algorithm to a nonoptimal solution.
2.5 Natural Selection (Biased reproduction)
The process of natural selection determines which members of a population
survive to reproduce and is done by using a biased, randomselection methodology.
14
Parents are randomly selected from the current population in such a way that the fittest
genes in the population have the greatest chance of being selected. Holland [HOL75]
suggested a scheme for this type of selection which involved a linear search through a
roulette wheel with slots weighted in proportion to string fitness values. The roulette
wheel implements the natural selection process in a way that incorporates the Darwinian
notion of the survival of the fittest. Using natural selection of the fittest points in a
solution space forces the algorithm to move in the most promising direction in its overall
search.
2.6 Fitness Concept
The genetic algorithm exploits the higherpayoff, or target, regions of the solution
space due to the fact that successive generations of reproduction and crossover produce
increasing numbers of strings in those regions. A region is a schema which is defined
as follows. Let 0 be the set of all strings of some fixed length over some alphabet E,
then I0 I = IE In. let 'It be the power set of (}. Then the schema H is defined by Vose
[VOS9l] as a function that maps into the set {true, false} according to the rule,
H(x) = true iff x E: M where H is the schema describing the subset (of 0) M E: 'It.
For example let (} be the set of all strings of length 2 over the alphabet E = {O, I}. Then
the set 0 = rOO, 01, 10, II} and the power set of 0, '1' = {ef>, rOO}, {Ol}, {10}, {II},
{oo,Ol}, {oo,lO}, {oo,ll}, {Ol,lO}, {Ol,ll}, {IO,II}, {oo,Ol,IO} ,
{oo,OI,IO,II}}. Any element of '1' represents a schema. For instance {Ol, II} E: v is
15
described by the schema *1 which maps binary strings of length two into the set {true,
false} as follows,
*1 (x) = true iff x E: {01, II}.
The * in the above equation represents a don't care symbol which can take on any single
character from the alphabet. This notation was adopted by Vose [VOS91]. As an
example the schema 1*0* would contain the strings, 1000, 1001, 1100, 1101.
According to Holland [HOL92], a genetic algorithm casts a net over a landscape
of potential solutions, where the rate at which the genetic algorithm samples different
regions corresponds to the regions' average elevation or fitness. The algorithm favors
the fittest genes as parents, and so aboveaverage genes will have more offspring in the
next generation. The remarkable ability of genetic algorithms to focus their attention on
the most promising parts of a solution space is a direct outcome of their ability to
combine genes containing partial solutions.
The choice of a fitness function depends on the nature of the optimization
problem. Consider for example the problem of solving the two simultaneous equations
ax + by = c and dx + ey = f. The solution domain is the set of equations made up
of either the addition, subtraction, or multiplication of the six constants a, b, c, d, e and
f. Since the chromosome should represent any equation in the solution domain, it is
denoted by a binary tree of nodes containing the coefficients. If we choose values for the
coefficients, such as a=l, b=2, c=4, d=O.5, e=2 and f=3, then the solution is x =
2. So let the fitness of a chromosome be defined as the sum of the squares of the
differences between the correct answer 2 and the selected chromosome. The best
16
chromosome will have a fitness of 0 and the worst will be a large number. The best
chromosome will represent the solution for all values of a through f with a high
probability.
Game theory suggests that each player should minimize the maximum damage the
other player can inflict. For example, in the simple game known as prisoner's dilemma,
Michalewicz [MIC92] describes two prisoners that are held in seperate cells and are
unable to communicate with each other. Each prisoner is asked to defect and to betray
the other prisoner. If both defect, they are both tortured. If one defects, he is rewarded
and the other prisoner is punished. Thus, if anyone prisoner selfishly chooses defection,
he is guaranteed a higher payoff than if he chooses cooperation. But if both defect at the
same time then both are worse off than if they had cooperated. The prisoner's dilemma
is to decide whether to defect or cooperate with the other prisoner. Applying the genetic
algorithm to such problems requires translating possible strategies into strings. One
simple way is to base the next response on the outcome of the last three plays. The value
of each gene would be either 1 or 0 depending on whether the preferred response to its
corresponding history was cooperation or defection. The fitness of each string could be
heuristically constructed as the average of the payoffs its strategy received after repeated
play.
For problems in which various factors affect the optimization of the solution, the
fitness function might be constructed as the weighted sum of all the factors, the weights
being dependent on the nature of the problem.
Thus there are no general rules for the design of the fitness function. The fitness
17
function for an optimization problem is usually designed at a heuristic and intuitive level,
taking into consideration the factors that play a role in determining an optimal solution
to the problem.
2.7 Need for Extraction of ntuples
In this paper, an ntuple is an ordered list obtained by extracting characters at n
different fixed positions in a word. This implies that the word has to be at least n
characters long. Let w = Cl~C3 ••• Cp be a word containing characters C1 through cp,
where p > n. An ntuple extracted from this word is defined as, In = (Ch cj , Ct, ...)
where 1 < i < j < k < ... s p and Iln I = n. For an arbitrary list of words, an ntuple
is extracted from each word with the same values of i, j, k,.. to form a set of ntuples.
The set of ntuples extracted from an arbitrary list of words is said to be unique
when no two tuples in the set are identical. Chang [CHI91] states that finding a good
heuristic algorithm to extract a unique ntuple for an arbitrary list of words with the least
amount of required time still remains an open problem. In his paper Chang [CHI91]
says, "Up to now researchers have proposed many perfect hashing schemes using
extracted ntuples. They all used trial and error to find the needed ntuples." In this
paper Chang introduces a letteroriented perfect hashing scheme with a space complexity
of :
o (4w fn/21 )
N
18
,where UJ is the cardinality of the set of characters appearing in all extracted ntuples, N
denotes the number of words hashed and n is the tuple length. The time spent in finding
a perfect hashing function is based on the time complexity of the Matrix Compression
algorithm used, which is :
where p the compression rate, 1 < p < 1.
From this equation it is apparent that the space needed by this hashing scheme
depends on the size of the extracted tuples n. The genetic algorithm used to extract these
tuples attempts to optimize this parameter n. Numerous hashing schemes can be cited
where extraction of ntuples from an arbitrary list of words played a key role. The only
method suggested for the extraction of these ntuples involved repeated attempts at
guessing a possible ntuple on a trial and error basis. An overview of various hashing
schemes that use extracted ntuples is given below.
2.7.1 Annlication to Perfect Hashina Schemes
Up to now researchers have proposed many perfect hashing schemes using
extracted ntuples. Their performance relies on the low cardinality of the extracted ntuples.
The cardinality of a tuple is the number of characters that it contains. The
disadvantage of perfect hashing functions to date has been that the process for
19
constructing them is slow. Cichelli's algorithm [CIC80] searches a table that contains a
calculated value for each letter in the word (key) alphabet. Cichelli's algorithm has the
advantage of simplicity but it also has an exponential expected time complexity, O(er
),
where r is the number of keys in the set to be stored. Values associated with the letters
from the key are used in determining the storage address for the corresponding record.
The minicycle algorithm described by Sager [SAG85] grew out of an attempt to optimize
Cichelli's algorithm for generating perfect hash functions. The minicycle algorithm,
which is currently the most computationally efficient, has a worst case time complexity
of O(f) observed by Marshall Brain [BRA89].
Marshall Brain [BRA89] presented a near perfect hashing scheme on large word
sets based on a modification of Cichelli's algorithm. The improved procedure was the
result of examining the original algorithm for the causes of its sluggish performance and
then modifying them. Similar to the original Cichelli's algorithm, a table that contains
a calculated value for each letter in the key's alphabet was maintained. Then the hashing
function,
h(key)  tablevaluel!f}irstklter(key) +tablevalue_l!f_fastklter(key)
string length(key)
was used. An improvement was made in the ability to backtrack intelligently in case of
a collision, which is the case when the first and last characters of two equallength words
are the same. Though this improved algorithm had the advantage of efficiency and
20
simplicity, the great disadvantage was the larger storage usage than other methods.
In a new approach, described by Chang [CHA91], to design a perfect hashing
scheme, the assigned address of each keyword was determined as the following form,
address = VI (the keyword's ith character) + V2 (the keyword's jth character)
where VI and V2 were two integervalued functions defined on the set of twentysix
English letters. The letter value assignments for VI and V2 were determined by a letteroriented
mergingandexchanging algorithm. The loading factor depended heavily on the
extracted pairs. When a pair of extracted letters was not sufficient to distinguish all the
keys, this scheme failed.
Chang and Lee [CHA86] describe a minimal perfect hashing function suitable for
letteroriented keys, based on the Chinese Remainder Theorem. This hash function was
a result of modifying the algorithm proposed by Chang [CHA84] to handle letteroriented
keys. This scheme depended on the extraction of unique 2tuples from the letteroriented
keys. The 2tuples were extracted from the set of keys by picking characters
from two arbitrary character positions in the key.
2.7.2 Matr~Compression
The hashing schemes mentioned above require the use of a matrix compression
technique. The ntuples are extracted from the keys and stored in a matrix. The
cardinality of a set is the number of elements in the set. It is a measure of the size of
the set. The smaller the cardinality of the extracted ntuples, the greater the sparseness
21
of the matrix and hence the more effective are the compression techniques when applied
to the matrix. The trial and error schemes used for the extraction of ntuples in these
schemes left the choice of cardinality of the tuples to the discretion of the person making
these repeated trials. The proposed genetic scheme will take into consideration the
cardinality of the ntuples as a factor in determining a fitness function for the genes.
An efficient implementation of a trie structure using a new internal array structure
suggested by Aoe, Mormoto and Sato [AOE92], called the double array, combines the
fast access of a matrix form with the compactness of a list form. This paper suggests
the possibility of applying the algorithms presented for updating the double array to the
reduction of static sparse matrices.
Tarjan and Yao present a scheme for storing a sparse table ofn entries [TAR79],
each an integer between 0 and N1, with an access time of O(lognN) and a storage of
O(n).
2.8 Possibility of a Genetic Solution
The problem of extracting ntuples that uniquely identify the key space is a
combinatorial problem. Genetic algorithms are stochastic search procedures based on the
randomized operators, mating, crossover and mutation. Ever since Holland gave the
schema analysis [HOL73], genetic algorithms have been successfully applied to solving
a wide variety of hard combinatorial problems in areas such as scheduling and
transportation problems described by Goldberg [GOL89], partitioning graphs described
22
by Kernighan [KER70], and the traveling salesman problem described by Lin [LIN65].
The search behavior of the genetic algorithm has been modeled in a Markovian
framework by Arunkumar [ARU93] and strong convergence of the search to a solution
was proved. Genetic algorithms employ a randomized heuristic search strategy for which
a priori complete knowledge of the features of the domain are not required, which makes
it favorable for attempting to solve the combinatorial problem of the extraction of ntuples.
The search strategy adopted in genetic algorithms in the context of combinatorial
optimization is described by Arunkumar [ARU93]. The algorithm begins with an initial
generation of a uniformly random population of genetic codes that should represent the
solution patterns. The solution patterns are the syntactic encoding of a solution. This
is an issue of representing the problem in a genetic schema, which in many cases is not
trivial, as observed by Michalewicz [MIC92], depending on the nature of the problem.
The nature of the problem at hand lends itself easily to a genetic solution because of the
ease with which the solution patterns can be constructed, as explained in Chapter 3.
Battle and Vose [BAT93] conclude that schemata more general than Holland's can also
be made to direct a genetic search, and that a duality exists between problem
representations and the schemata that are relevant for their optimization. This duality
provides a theoretical framework in which to interpret problem representations.
Holland defined a schema as describing a subset of strings (in a population) with
similarities at certain string positions [HOL73]. Vose [VOS91] generalized this notion
of the schema in genetic algorithms by defining it as a predicate. Battle and Vose
23
[BAT93] generalized Holland's schemata in a manner which preserves the algebraic
structure, which behaves according to the Schemata Theorem given by Vose [VOS91].
According to Michalewicz [MIC92], the theoretical foundations of genetic
algorithms rely on a binary string representation of solutions and on the notion of
schema. According to the schema theorem as stated in Chapter 3 of [MIC92], above
average solution strings grow exponentially in subsequent generations of the genetic
algorithm, resulting in convergence to a solution. Since the nature of the problem being
addressed in this paper is such that it favors binary string representations of its solutions,
a genetic algorithm designed to solve this problem will guarantee a convergence to a
solution, as it is backed up by the Schema theorem.
CHAPTER 3
3. DESIGN AND IMPLEMENTATION ISSUES
This section describes three independent approaches to solving the problem of
extracting a set of unique tuples from a given word list. The combinatorial and trial and
error approaches are used as standards to compare and evaluate the performance of the
genetic approach, which is the focus of this paper. The genetic algorithm used to solve
the problem is implemented by two different approaches: writing a C program and using
a genetic application tool called LibGA. This is done in order to investigate the
usefulness of the application tool. The results obtained by these two different approaches
are compared.
Since the terms tuple, overlap, and cardinality will be referred to on more than
one occasion, their definitions are given. Tuple is defined as an ordered list of
characters. Cardinality refers to the length of a tuple and overlap refers to the presence
of duplicate entries in the list.
3.1 Description of the Genetic Approach
The problem of extracting a unique set of tuples from a given word list involves
a search without any prior knowledge about the search domain. A solution is
characterized as good according to the following two criteria; low cardinality and low
24
25
overlap of the tuples. Since the extracted tuples will be used in a perfect hashing
scheme, the solution must put more stress on realizing the latter of the two criteria. The
genetic approach to the problem as described in this paper was designed to do just this.
Genetic algorithms are well known to be applied to optimization problems. The design
of the fitness evaluation scheme, which will be described later in this section, relates
directly to the measure of overlap of the tuples. The goal of the genetic algorithm was
to effectively search for better candidates for the solution using only the payoff or fitness
of each candidate. This search required no auxiliary information in order to work and
hence suited this problem very nicely.
3.1.1 outline of the Genetic AI&orithm
The genetic algorithm described in this paper is a classical genetic algorithm,
which means that it operates on binary strings. Hence a mapping between the potential
solutions and the binary representation is required. This process, called the coding of
strings, is explained in detail in section 3.1.2. The genetic algorithm begins with a
randomly generated initial population ofbinary strings. This scheme of forming the initial
population is described in section 3.1.3. Then the genetic operators, explained in section
3.1.4, are applied. This application basically involves the copying and swapping of
partial strings of the original population to form a new population. The new population
thus created is evaluated using an evaluation scheme, explained in detail in section 3.1.5.
This process is repeated as shown in Figure 5, until a termination is reached. The
26
conditions leading to a termination of the cycle are explained in section 3.1.7.
Old Population~~
Apply Genetic
Operators Old Population
=
New Population
New Population
Evaluate~~
Figure 5: Outline of the genetic algorithm
During the genetic cycle some of the chromosomes are just copied into the new
population and others are modified by the genetic operators and then introduced into the
new population. When implementing a fixed size population it is required to replace
existing chromosomes by the new modified ones. Different implementations can use
different techniques of performing this replacement process. In the genetic program
described by Goldberg[GOL89], two chromosomes selected for mating are invariantly
replaced by their resulting offspring. In the application tool called libGA, which is
discussed in section 3.1.8.2, the replacement strategy is different. Of the four
chromosomes involved in the mating operation, i.e. the two parents and their two
offspring, only the best two chromosomes are selected for the new population.
3.1.2 CodinK of StrinKs
27
The coding of strings that represent potential solutions facilitates the mapping of
the parameters of the optimization problem to a string in the population. The parameters
of optimization are the tuple length and the amount of overlap. Thus a string should map
to a list of tuples of a known length and overlap. Consider the binary string 01101. In
this string the l's occur at the character positions 2, 3, and 5. The tuples that
correspond to this string are obtained by picking out the characters at these positions
from each word in the word list. For example, Figure 6 shows the mapping of the binary
string 01101 to a set of tuples for a given word list. In general, for a binary string
XtX2X3•• Xn where Xi is either a 0 or 1 for all i in (l,n) and n is the length of the words in
the data set, the tuples that this string represents are obtained by picking characters from
each word at positions i for which Xi = 1. The length of the tuple is the number of l' s
in the string. The measure of overlap is represented by the hashlength, which is
computed from the tuples that the string represents.
e r
n t
c h
r n
s e
[lLTl 0
off
9 r a
tea
1 e a
tho
rnmaps to these tuples I
l
f f r
rat
e a h
e a n
hoe
Figure 6 : Genetic code
28
3.1.3 Creation of the Initial Population
The population size is stored in the variable popsize, which is input to the
program during execution. The initial population contains chromosomes selected from
the solution space. The complete solution space consists of 2n 1 chromosomes, where
n is the word length of the words in the input list. For example, if we consider words
of length 3 then the complete solution space consists of the chromosomes,
001,010,011,100,101,110,111. In general, the solution space is much larger than the
population size. Hence the initial population can contain only a few chromsomes from
the solution space. This selection process is implemented using two methods: a random
initialization, and a biased initialization.
3.1.3.1 Random Initialization
The initial population contains chromosomes selected randomly from the entire
solution space. The logic for creation of the initial population follows,
for i = 1 to popsize
x = a random number in the range [1 ..2D1]
chrom = a binary string equivalent of x
insert chrom into the population
end loop
If the population size is smaller than or equal to the solution space, then the initial
29
population is constructed to contain the whole solution space.
3.1.3.2 Biased Initialization
In the biased initialization method, the initial population will consist of k1 number
of Ituples , k2 number of 2tuples, k3 number of 3tuples, .. ,leu number of ntuples. The
values k1, k2, k3, •• ,leu are computed from the following equation:
p rftrft~
k. ::: _i_ • ps, where Pi
• n rftl
'lJPi
i=1
,where ps is the population size, Pi is the probability that two tuples of size i will not be
identical, r is the alphabet size, n is the word length of the input word list. The
expression for Pi is derived below.
4 groups
Figure 7: A sample solution space
30
Consider the simple case of an input word set consisting of words of length 4,
formed from the alphabets: 0,1. There are altogether 24 possible words that can be
formed from this alphabet which constitutes the solution space, as shown in Figure 7.
There are 21 possible ways of constructing tuples of size 1. However there are totally 24
tuples. Therefore, the 24 Ituples consists of 2 groups of 24/21 (= 8) identical Ituples,
as seen in Figure 7. Similarly it can be said that the 24 = 16 possible words will consist
of
4 groups of 24
/ 22 (=4) identical 2tuples
8 groups of 24
/ 23 (=2) identical 3tuples
1 group of 24
/ 24 (= 1) identical 4tuples
3.1.3.2.1 Probability that any two Ituples will be identical The total
number of ways of selecting 2 Ituples from a list of 16 =
16
C = 120
2
The number of ways of selecting 20's =
8C
= 28
2
,since there are 80's altogether in the list of Ituples.
Similarly the number of ways of selecting 2 l's from a group of 8 l's is also = 28
31
Therefore the probability of two Ituples being identical =
number of ways of selecting 20's + number of ways of selecting 21's
Total number of ways of selecting two Ituples
8
2C'
2 = 282 = S6
16 120 120
C2
3.1.3.2.2 Probability that any two 2tuples will be identical Since there are
4 groups of 4 identical 2tuples, by a similar treatise it can be seen that this
probability evaluates to :
4
4C'
2
16
C2
64
120
24
120
3.1.3.2.3 A Generalization: Probability that any two ktuples will be identical
For a list of words of word length n and alphabet size r, the total number of distinct
tuples that can be formed is fl. This is the size of the entire solution domain. Following
along the line of intuition presented in the previous two sections, it can be seen that this
entire solution domain of size fl will contain rk groups of flk: identical tuples. The
probability of two ktuples being identical, 1~k~n, is given by :
32
(number of groups of identi~a1ktu.ples)(probability of selecting two tuples from a group)
Total number of ways of selecting two tuples
,.
C2
The probability that two ktuples will be different:
r,,kl
1
r"l
r"r,,k
r"1
3.1.4 The Genetic Operators; Reproduction, Crossover, and Mutation
The functioning of each of the three genetic operators depends on random choice.
This random choice is implemented in three functions: Random, Rnd, andflip. The first
of these returns a pseudorandom number between 0 and 1. Rnd returns an integer value
between specified lower and upper limits, and flip returns a boolean true value with
specified probability.
In the genetic algorithm, reproduction is implemented in the function select as a
linear search through a roulette wheel with slots weighted in proportion to string fitness
33
values. Select returns the population index value corresponding to the selected
individual.
A routine crossover takes two parent strings called parentI and parent2 and
generates two offspring strings called chiidl and child2. The probabilities of crossover
and mutation, pcross and pmutation, are passed to crossover, along with the string length
lchrom. At the beginning of the routine crossover, function flip is used to determine
whether the crossover operation will be performed on the current pair of parent
chromosomes. The function flip uses pcross as an argument to determine whether a
cross is called for. The crossing site is selected randomly using the function Rnd, which
returns a pseudorandom integer between I and lchrom. If no cross is to be performed,
then the mutation operator visits each bit and calls the flip function with argument
pmutation to determine whether that bit value will be altered.
3.1.5 The Evaluation Scheme
Literature confirms that there is no deterministic method other than an exhaustive
search to solve this problem. Although genetic algorithms use probabilistic transition
rules to guide their search, this use does not suggest that the search is some simple
random search. The Genetic algorithm uses the fitness function as a tool to guide the
random search toward regions of the search space with likely improvements. The genetic
approach was designed with the prospect of achieving better performance than a totally
random approach ( trial and error) at one extreme and a totally deterministic approach
34
(exhaustive search) at the other.
The population is a collection of candidate solutions that are represented by binary
strings. Each binary string represents a set of tuples. The mapping of a binary string
to the set of tuples that it represents is accomplished easily by picking characters from
the word list at positions corresponding to the 1's in the binary string. For example,
consider the list of words shown in Figure 6. The highlighted characters of the word
correspond to the 1's in the binary string. These characters form the tuples that the
binary string represents, namely {ffr, rat, eah, aan, hoe}. The evaluation scheme assigns
a fitness value to each binary string by computing the measure ofoverlap present in the
set of tuples that are mapped to it. This measure of overlap, also called hashlength, is
computed as follows. Consider an individual of a population represented by the binary
string 01011. Let the input word list be stored in an array of strings WN, where N is the
total number of words in the list. Then the set of tuples that are mapped to the binary
string 01011, obtained by extracting characters from positions 2,4, and 5, are denoted
by the list:
T1 = (W12' W14' W1S)
T2 = (W22, W24 , W2S)
By way of frequency count, this set of N tuples T = {T1,T2, •• , TN} can be partitioned
into a set of groups G = {G1, G2, •• , Gw} with the following properties:
Each group Gi is a set of one or more tuples.
35
all the tuples in each group are identical.
If i =1= j, then Gi n Gj = ~, the null set.
If Ord (X) represents the number of elements in the set X, then
w
lJOrd(Gj) = Ord(1) = N
1
When all the tuples are unique, as seen from the above equation, W = N
which means that each group will contain only one tuple.
The hashlength of the set of tuples T, corresponding to the binary string is evaluated by
the summation :
W Ord{GJ}
hashlength(1)=lJ lJ k
j=1 1=1
When there is no overlap among the tuples, all the tuples are unique and W =
N and Ord(Gj) = 1, resulting in hashlength = N, the number of words in the input word
list. The fitness function for this case must be evaluated to its maximum value, which
is 1. Thus the fitness function is computed as :
N fitness(7)  
hashlength(7)
When there is an overlap, the hashlength > N and the fitness evaluates to a fraction.
The smaller this fraction is, the higher the overlap.
The crossover/mating operation is designed to achieve a guided search toward
regions of the search space with likely improvement as illustrated in the following
36
scenario. Consider the following two individuals/chromosomes that may have been
selected for mating: 101100 and 010011. After the mating operation, the overlap of the
offspring mayor may not be significantly less than that of the parents. Upon crossover
the best parts or contributors to fitness of each chromosome may be combined into one,
resulting in a higher fitness value of one of the offspring. This scenario may not take
place, but when it does, the search moves significantly towards a better solution.
The evaluation of the fitness is based solely on the computation of the hashlength
according to the above equation for fitness(). The hashlength depends directly on the
number of duplicate entries among the list of tuples. The higher the number of
duplicates, the greater the value of the hashlength. It is clear that choosing more letters
from the word list to form the tuples will not make the hashlength worse. Stated in
another way, if T1 is the set of tuples mapped by the chromosome 011000 and TI is
the set of tuples mapped by the chromosome 011011 then the following is always true:
hashlength (T1) ~ hashlength (TI)
This happens because the latter of the two chromosomes considers two additional
character positions to form the tuple TI. If hashlength(TI) is greater than
hashlength(T1) then the tuple set TI should be given a greater fitness value than TI.
This is indeed the case in the evaluation scheme. But in the case where the two
hashlengths are identical, the tuple set TI should be given a greater fitness value because
the tuple size for T1 is less than that for TI. The evaluation scheme does not recognize
this situation and ends up assigning a higher fitness to TI. This situation is rectified by
a repair algorithm that keeps the cardinality, or tuple size, under check. This algorithm
37
is explained in section 3.1.3.
The genetic operators are applied to selected candidates of the population and the
resulting offspring are introduced into the new population. The probability of selection
of a candidate is designed to be proportional to its fitness. This selection is repeated
until the new population reaches a desired size, which is at least the size of the old
population. As this process is repeated again and again, duplicate copies of a particular
candidate will dominate the population. Theoretically this leads to a complete domination
by the individual. At this point the genetic algorithm is terminated and the solution is
represented by this dominant individual.
3.1.6 Enforcement of the Constraints
The evaluation scheme as it stands serves the purpose of minimizing only the
overlap present in the extracted tuples. The genetic search must incorporate the
constraint on the length of the tuple; otherwise the search suffers a potential risk of
favoring tuples of high cardinality and ending up with the redundant solution of a tuple
set that is identical to the word list itself. The constraint must be enforced to ensure that
the search will be directed towards solutions whose combination of overlap and tuplelength
is minimal. This is stated more formally as follows: Let P denote a population
of size n. Then Pi denotes the ith individual of the population P. F(PJ, O(PJ, and L(PJ
denote the fitness, a measure of overlap, and the length of the tuples that are represented
by the individual Pi.. For all i in {I,n} F(PJ < every element in the set S : {F(Pj ) I
38
O(Pj) = O(PJ & L(Pj ) > L(PJ, 1<j <n and j < > i}. In the implementation, the
elements of the set S are assigned a very low fitness value so that they eventually die out.
These candidates are undesirable since they violate the tuplelength constraint.
The implementation of this constraint is explained. A table called RestrainLength
is used to update the smallest hashlength for each tuple size. For example, consider all
tuple sets of size one extracted from a list of words of length four. There are four such
tuple sets obtained by string coding the four chromosomes, 1000, 0100, 0010, 0001.
The process of string coding was explained earlier. As each of these four chromosomes
are evaluated in the course of the genetic search, the first entry of the table
RestrainLength will be updated to the smallest hashlength. This update algorithm is
described below:
chrom :represents the chromosome being evaluated
card (chrom) :represents the cardinality of the chromosome chrom
hashlength (chrom) :represents the hashlength of the tuples mapped by the
chromosome chrom
if hashlength(chrom) < RestrainLength(card(chrom» then
RestrainLength(card(chrom» = hashlength(chrom)
If any of the entries above the updated entry has a hashlength that is less than or equal
to the hashlength of the updated entry, then the chromosome corresponding to the
updated entry is undesirable and assigned a low fitness value.
3.1.7 Conditions leadine to Termination of the Genetic A1eorithm
39
The genetic algorithm forms a new population from the existing one on each
iteration of the genetic cycle. The population created on each cycle is known as a
generation. Theoretically the population representing the nth generation as n approaches
infinity will contain chromosomes that are perfect solutions to the problem. In practice
, however, the cycle is terminated when one of the following two conditions occur.
1. A solution is found.
This occurs when all the individuals of the population attain the same
fitness value, and is referred to as convergence to a solution. This does
not necessarily mean that all the individuals of the population are
identical. The genetic algorithm may find more than one solution, if it
exists. Elitism is used to improve the performance of the genetic algorithm
by speeding up the rate of convergence. Use of elitism ensures that each
successive generation is as good or better than the generations that precede
it. This is implemented by maintaining the best few chromosomes in each
generation. Occasionally the application of genetic operators on good
chromosomes yield poor successors, resulting in a drop in the fitness of
the succeeding generation. Use of elitism ensures a monotonic increase
in the fitness of a generation because the best few chromosomes are
reintroduced into the population. Implementation of elitism in the C
program implementation is carried out by replacing the parents by their
offspring only if the offspring are better.
2. A solution is not found.
40
A counter called gencount is used to keep track of the number of
generations created. When gencount exceeds MAXGEN, a preset constant,
the cycle is terminated. This indicates that a solution was not found
within the maximum number of iterations allowed.
3.1.8 Implementation Issues
This section describes the two approaches used to implement the genetic
algorithm. In the first approach, a C program is written to implement all the aspects of
the genetic algorithm. The code is written entirely from a scratch and includes the
design and implementation of the data structures, the code for the genetic operators, the
code for the evaluation scheme, the code for various randomized operators, and other
support modules. In the second approach use of a genetic application package helps to
drastically reduce the amount of coding. Only the code for the evaluation scheme and
the main driver routine has to be constructed.
3.1.8.1 Approach 1: The C pro&ram
The implementations of the data structure for the population and the genetic
operators are identical to the implementations presented by Goldberg [GOL89]. The
source code for the genetic algorithm is written in the 'C' programming language. The
choice of a binary string as a data structure for the chromosome well suited the
41
transformation of the problem into genetic coding as explained in the section on solution
coding. In the genetic algorithm the probability of mutation pmutation is set at a much
lower value then the probability of crossover pcross. Mutation by itself is a random
walk through the string space. When used sparingly with reproduction and crossover,
it prevents the premature convergence to a nonoptimal solution. This is due to the fact
that the potentially better solutions that cannot be generated by recombination of existing
ones will be introduced by use of the mutation operator, thus preventing convergence to
an otherwise optimal solution.
The actual population size is set as an input parameter to the genetic algorithm,
so that its value can be altered easily. The strength of the genetic approach comes from
the fact that the population size is usually much smaller than the actual solution domain.
The genetic search focusses on the more promising parts of the solution space instead of
on the entire solution space. As an input parameter, the population size can be set at
different values for different data sets to study its effects on the convergence of the
genetic algorithm.
3.1.8.1.1 Data Structures The primary data structure for the genetic algorithm
is a string population. The population is constructed as an array of individuals where
each individual contains the genotype (the artificial chromosome or bit string), and the
fitness (evaluation function) value along with other auxiliary information such as the
cross point and the parents involved in the mating. A number of constants are defined:
the maximum population size, MAXPOP, and the maximum string length, MXLNGm.
42
These set upper bounds on the population size and the string length. The type population
is an array of type individual, which is indexed between 0 and MAXPOPl. Type
individual is a record composed of a type Chromosome called chrom, and a real variable
called fitness. These represent the artificial chromosome, and the string fitness value.
Type chromosome is itself an array of type gene, which is indexed between 0 and
MXLNGTHl. Gene is another name for the character type to represent a single gene
value of '1' or '0'.
In the genetic algorithm, the genetic operators are applied to the entire population
in each generation. In the implementation, the offspring of the genetic operations replace
the parents in the new population. The population size remains fixed. The computer
implementation uses a number of important global variables: pcross, pmutation, and
sumfitness. The variables pcross and pmutation are the probabilities of crossover and
mutation respectively. The sumfitness variable is the sum of the population fitness
values.
The array RestrainLength of size MXLNGTH is defined to be of type long. This
array is used to update the maximum fitness at each tuple length that accumulates during
the genetic cycle. The information in this array is used in the evaluation function to keep
the tuple length under check. This is explained in detail in a later section.
3.1.8.1.2 The Driver/Main program The driver/main section of the program
serves as a breeding ground for each generation to evolve into the next through a
repeated cycle or loop. The generation counter is incremented, the function generation
43
is called to generate a new generation, and the population is advanced: oldpop =
newpop. In addition, the main program is responsible for reading in the input data for
the problem, randomly initializating the first population, calculating of relevant statistics,
and reporting various parameters for plotting graphs. As part of the statistics, the best
k strings accumulated over every three generations are reintroduced into every fourth
generation. This procedure, called elitism, ensures the monotonic increase in the average
fitness level of the population over the generations. The value of k is selected as the
bigger of 1 and LPopulation_size/25J.
3.1.8.2 Approach 2: Usin& the LibGA annlication tool
When using the LibGA application tool [COR93], only the code for the evaluation
scheme and the main driver routine need to be written. The code for the evaluation
scheme is written in a function called objJun() which is declared in the main program
main(). This main program consists of essentially two parts, namely, the initialization
and the execution of the genetic algorithm. The objJun takes a chromosome as an
argument and returns a real fitness value. The chromosome is defined as a string of bits
using a configuration file. The process of evaluating the fitness of a chromosome
remains the same as before. The only difference lies in the coding of the main program,
the configuration file, and the user data file. These are explained below.
3.1.8.2.1 The Main proeram The main program consists of two parts, the
44
initialization and the execution. The initialization of the genetic algorithm is
accomplished by the function GA_conjig ("tupleapp.cfg", objJun). This function sets
the configuration of the genetic algorithm to the contents of the file tupleapp. cfg and sets
the function objJun as the evaluation scheme for the genetic algorithm. The function
objJun takes a data structure of type Chrom_Ptr as the argument. This data structure is
shown in Appendix A. The chromosome is stored in an array of type Gene_Ptr. Other
useful information is stored along with the chromosome, such as its length and fitness.
A population consists of a group of chromosomes. So the data structure for the
chromosome contains an index that is used to reference this chromosome in the group.
The configuration of the genetic algorithm is stored in a data structure of type
GA_Info_Ptr, which is returned by the function GA_conjig. The configuration file
contains the settings for the data type of the gene, the chromosome length, the population
size, the selection method, the genetic algorithm type, the crossover method, the
mutation method, etc. The configuration is set to a default setting unless changed. In
the configuration file tupleapp. cfg, the following settings are used:
user data the name of the input data file that contains the word list
data_type data type of a gene is a bit representing 1 or 0
initpool random (initial population is generated at random)
length length of the chromosome is set from the input data file
pool_size the population size is set from the input data file
The remaining conjiguration is at the default settings.
Every application will have its own configuration file where the configuration settings
45
for that application are stored. The configuration file used for this application is called
tupleapp.c[g. As mentioned earlier, one of the settings called user_data is used to store
the name of the input data file. This input data file will contain the word list from which
the tuples will be extracted. A function called readwords () reads the input word list from
the data file user_data. The execution of the genetic cycle is started by simply calling the
function GA_run (ga_info). The argument ga_info is a data structure of type
GA_Info_Ptr that is initialized to the contents of the configuration file.
3.1.8.2.2 The Confipration file The configuration file consists of settings that
are used to initialize the data structure GA_Info_Ptr. The components of this data
structure are given in Appendix A. Some of the components are explained briefly. The
entry user_data is used to store the name of the input data file. Data_type is an integer
flag representing the data type of the gene. The different data types allowed for a gene
are, bit, int, and real. The entry chrom_len refers to the length of the chromosome. A
boolean minimize is used to represent the type of optimization that is needed. If minimize
is true then the genetic algorithm minimizes the objJun; otherwise the objJun is
maximized. The crossover rate and the mutation rate are contained in the entries x rate
and mu_rate respectively.
3.1.8.2.3 The User Data file The user data file is the file that contains the input
word list. The first line of this file contains the number of words, the length of each
word, and the alphabets used to form the words. This is followed by the word list. The
46
function readwords reads the contents of this file and stores the words in an array of type
string. In the evaluation of the fitness of a chromosome, the tuples that it represents are
extracted from these words. The evaluation scheme, as explained in section 3.1.5 takes
these tuples as input to determine the fitness of the chromosome.
3.2 Description of the Combinatorial/Exhaustive Search Approach
The combinatorial approach systematically eliminates all possible nonsolutions
starting from tuples of cardinality 1 and ending at the whole word length. At any point
in this process as soon as a solution is found the program terminates immediately. A
solution is one for which the overlap is O. A nonsolution is one for which an overlap
exists. The extent of this overlap is not evaluated as it serves no purpose to this search
method in yielding a solution. All possible combinations of tuples of size 1 are tested
first. Then all combinations of tuples of size 2 are tested and so on until tuples of size
w are tested, where w is the length of the whole word. Thus in the worst case the
number of trials made computes to :
where w is the length of the word. This approach guarantees finding an optimal solution
if one exists.
When the number of words and the alphabet size used to form the words are
known in advance, the combinatorial approach can be modified to improve its
performance. Consider a list of words constructed from an alphabet of size r. The total
47
number of unique tuples of size 2 that can be constructed is r. Hence if the total number
of words in the list is greater than r, any set of tuples of size 2 extracted from the list
will have duplicates. So any tuple of size 2 cannot be a solution. In general any tuple
of size k, for which rt < total number of words in the input list, is not a solution and
can be skipped. This technique is incorporated into the combinatorial algorithm to
improve the efficiency of the exhaustive search.
3.3 General Description of the Trial and Error Approach
The trial and error approach to finding a solution is a completely random process
in which the solution is sought by repeated attempts or guesses at the tuples hoping to
hit the mark. This approach does not guarantee a solution even if one exists. Also the
solution, if found, is not necessarily the optimal one.
The basic technique of the trial and error approach consists of picking characters
at random positions from each word in the word list. Each such attempt, called a trial,
can be represented as a sequence of character positions. For example, one such trial
sequence [3,5,4] represents the set of tuples of size 3 obtained by extracting characters
from positions 3, 4 and 5 from the word list. The tuples obtained on each trial are
evaluated to determine the overlap. If the overlap is 1 then the solution is found and no
more trials are made; otherwise, another trial to guess the solution is made. No
assumption or knowledge of the word list is used in taking a guess at the solution. A
specified upper limit on the number of trials on each tuple size is used. When this limit
48
is exceeded then trial sequences of the next larger size are generated and so on. This
process is repeated until either a solution is found or the limit on the total number of
trials allowed is exceeded.
3.4 About the Implementation Platform
The machine used is a Sequent Symmetry S81 with 24 80386 processors running
at 20 MHz each and with a RAM of 108 Mbytes. Each processor also has both a 80387
and a Weitek floating point coprocessor.
CHAPrER4
4. RESULTS AND OBSERVATIONS
The problem is to extract a unique set of tuples from a given word list. A tuple
is a sequence of characters picked from a word in the list. All the tuples are extracted
by picking characters in the same sequence from each word in the list. The tuples must
not only be distinct but also obtained by picking as few characters as possible. A
genetic algorithm is implemented to solve this problem. The results obtained by this
algorithm are displayed in the form of graphs. From these results the performance of
the genetic algorithm is observed. The input data consists of a list of words. Each word
in the list is constructed by randomly selecting characters from a given alphabet set. For
the purpose of evaluation of the genetic algorithm, the length of the words and the
number of words in a data set are varied.
The genetic algorithm implemented using the C programming language is called
gengrph. Another implementation called the tupleapp uses the libGA application tool.
The results obtained by these two implementations are compared in section 4.5. The
performance of the genetic algorithm gengrph is compared to the combinatorial algorithm
combon, that uses an exhaustive search technique to find an optimal solution to the
problem. This is discussed in sections 4.2 and 4.3. Also the effect of using a biased
initialization of the initial population, based on the probabilities established in section
3.1.3.2 of chapter 3, on the performance is observed in section 4.4.
49
50
4.1 The Performance of the Genetic Algorithm
The performance of the genetic algorithm is evaluated in terms of the execution
time required to find a solution and the quality of the solution found. The execution time
is measured in microseconds and represents the cpu time required to find a solution. The
combon uses an exhaustive search and guarantees to find an optimal solution. The
solution obtained by the genetic algorithm is optimal if the solution's fitness is 1.00 and
the size of the solution tuple is same as that of the solution obtained by combon. The
two types of graphs presented in the following sections display results of comparing the
execution time and solution quality obtained by both the genetic and combinatorial
algorithms.
4.2 The Execution Time
The execution time required to find a solution is compared for both the methods:
combon and gengrph. The solution obtained by the combinatorial algorithm combon, is
guaranteed to be optimal but the exhaustive search that it employs is time consuming.
The performance of gengrph is studied to determine under what circumstances it will
outperform the combinatorial algorithm combon. This section discusses the execution
times, but the quality of the solutions obtained from these runs are discussed in the next
section.
The execution time plotted on the yaxis is representative of an average over
51
several runs. On each of these runs a different set of words of the same length and
number is used. The execution times for all these runs are averaged. The graphs that
depict this performance use two types of xaxes: one in which the word length is fixed
and the number of words varied, the other in which the number of words is fixed and
the word length varied.
For the graph shown in Figure 8, the size of the input data set increases along the
xaxis. For all the data sets on the xaxis the word length remains fixed at 6. Each
point on the yaxis is computed as an average of the execution times obtained on 40
separate runs. For example consider the fourth point on the gengrph curve.
Ward lenGth • e. Popul.tlon Slze."O
6.+1 ..,
6.+ 7
O.+D
• gt ng rp h
• comboll
o 10D 20D 300 ACO 600
Num bar 01 Ward. In the Input d.ta set
Figure 8: Performance Graph for words of length 6
The xcoordinate of this point corresponds to the input data containing 100 words. The
52
ycoordinate of this point was obtained by generating 40 different data sets each
containing a 100 words and taking the average of the execution times for each one of
these data sets. The graphs shown in Figures 9 and 10 are constructed in exactly the
same fashion, except that the word length used in Figure 9 is 8 and that used in Figure
10 is 10. The graph in Figure 8 shows that the combon method outperforms the gengrph.
We also see from Figures 8, 9, and 10 that as the number of words in the
WDrd length • a, Pop ul.t1an Size • 4Q
7e.7 r.
• gengrp h
• com bon
t e.7
Oe.O
r"
~'
/"""
I
ti
.,;'
/
/
. ,J ; _l~ \ .~
,/ ).........
,.....
,I "._'. I __ ~ ...... ....... .'.....• ..., / ; . .... .JII".~ ..... ...J
o 100 200 300 .00 600
Num b.r afWor41 III the Illput 411•• et
Figure 9: Performance graph for words of length 8
input list increases, the gengrph curve rises more rapidly than the combon curve.
From Figure 8 to Figure 9, the gap between the performance of the two methods has
reduced. From Figure 9 to Figure 10, the gap has reduced even more. From this
observation, it seems that although the execution time of both the methods increase
linearly with input data size, the increase in word length favors the performance of the
53
genetic method gengrph. It was noted in section 3.1.3 of Chapter 3 that the size of the
entire solution domain increases exponentially with word length. Since the combon uses
an exhaustive search mechanism, it comes as no surprise to see why the performance of
the genetic algorithm gengrph is rapidly catching up to the performance of the combon.
The genetic search is a biased form of the trial and error search because more trials are
made in regions of the search space that look promising than in those that are not.
Ward lenGth. 10: Populltlon Slze ...O
8."1
1.+1 
6 ... 1
.E
~ a.+7 
o~u~
2.+7 
1.+7 
0.+0 
~
I \
• gin g rp h I \
• combon I \
,..,,,I \~
I
I
/
1I
I
I
I
I"'..j ~,
/ ,I \
or',~,...J....._.•./.." .... • II
~~ /'
",',,~ ~ .,. .r"· .. /' ..... ~. ..
o 100 200 300 400 500
N um b • r 0 f Word sin th. In put d.t. s. t
Figure 10: Performance graph for words of length 10
Thus the increase in the word length does not seem to effect the performance of gengrph
as much as it effects cornbon. This phenomenon is seen in the graph shown in Figure
11. The graph in this figure uses a varying word length along the xaxis, and a fixed
54
size for number of words in the input data set. As before, each point on the yaxis is
computed as an average over 40 separate runs. From this graph, a marked difference
in the performance of the two methods is seen. The gengrph appears to outperform the
combon after a certain word length is crossed. The combon shows an exponential rate
of growth of the execution time, where as the genetic algorithm remains almost
unaffected. The graph in Figure 12 is constructed in a similar fashion using input words
of increasing length, but the number of words used in all the data sets is now fixed at
600. The gengrph curve in this graph exhibits the same form as before, running almost
N umb. r 0 f word I = .. 00, Pop u I. tI 0 n SIZI = 40
3.+8
21+ 8 
., 21+ 8 
'V
C
0u
•, e 1.+ 8 
~
!,
•E
i= 1.+ 8 
I::
0 ;:;
:::3
U.. )( w 6.+7 
01+0 
o
•
•
ge n g rph
combon
10 16 20 26
Word Length
Figure 11: Performance graph #1 for words of increasing length
55
level in a zigzag fashion across the figure. The combon curve rises exponentially across
the figure, as expected. The genetic algorithm outperforms the combinatorial algorithm
when words of larger length are considered.
Num berofwordl =600, Population Size =.40
3.+8
• 3.+8  • g. n g rph J
I
• combon I
i
~ 2.+8  ,.
• ,I
'sV; /
0 ,. u•
2e+8  / ,e
,
u I
~ •I • 1e+ 8  / ., E
F 1\ ~ I~~ I c0
.,
0A. / ~"'>'/\ /"J
a 1.+ 8  •)C w
6.+7
,I . .....( ~ V
/ ~.....j'
0.+0  =
o 6 10 15
Word Length
20 25
Figure 12: Performance graph #2 for words of increasing length
It is not enough to note a significant improvement of the performance relating to
execution time. The question is whether the genetic algorithm finds acceptable solutions
to the problem with this efficiency. The quality of the solutions obtained by the genetic
algorithm is discussed in the next section.
56
4.3 The Solution Quality
The quality of the solution is measured in terms of solution tuple size and the
solution tuple's fitness. The quality of the solutions obtained by the genetic algorithm
is determined by comparing it with the optimal solutions obtained by the combinatorial
algorithm.
The graph in Figure 13 shows the sizes of the solution tuples obtained by both the
methods: gengrph, and combon. These solutions were obtained in the execution times
6..
.
~ 3
~.:
a:
s.;.
t=
c
.C.o.
~ 2 
, 
o
_ g.ngrph
J. com~."
o 100 200 300 400 eoo
Num b.r of Wo rd. in th. Input data ••t
Figure 13: Solution quality for words of length 6
represented in Figure 8. The graph has the same xaxis as in the graph of Figure 8, but
the yaxis represents the average of the solution tuple sizes obtained on 40 separate runs
57
for each input data set on the xaxis. It is seen in the graph that the optimal solution
W 0 r d Ie nath ~ 10, It 0 Pu I.t I0 " &1ze ~ 40
6..
~ gengrph
..   combon
1 
o
o 100 200 300 400 100
Num ber of Words In the Input data set
Figure 14: Solution quality for words of length 10
tuple size seems to increase with the size of the input data set. This is expected, since
more the number of words, the higher the chance of overlap among the extracted tuples.
Hence tuples must use more characters to avoid overlap, resulting in an increase of the
solution tuple size. The graph in Figure 14 is constructed in the same fashion except that
all the data sets on xaxis now use a larger word length of 10. A match between the size
of the solution tuples obtained by gengrph and combon indicates that the genetic
algorithm found an optimal solution. By comparing the graphs of Figures 13 and 14, it
58
is seen that the latter of the two shows more mismatches between the solution tuple sizes.
This seems to be the affect of choosing input data sets containing words of higher length.
This affect is seen more clearly in the graph of Figure 15. In this graph the xaxis
represents input data sets of increasing word length. The yaxis represents the average
of the solution tuple sizes obtained over 40 separate runs of each data set. Towards the
higher word lengths, the mismatches between the solution tuple sizes seem to occur more
frequently. Thus the quality of the solutions obtained by the genetic algorithm worsens
as the word length increases. Considering the efficiency of the genetic algorithm on
data sets of higher word lengths, as demonstrated in the previous section, this loss of
solution quality is acceptable. For the most part, the solutions obtained by the genetic
N um b. r 0 f VII 0 rd. • .. 0 O. Pop u I. t Ion S I Z 6 • 4 0
7r.
6 ....  gengrph
... combon
6 
:I
en
• 4  a.
Fc0
:p
:I , 
~
(I)
2 
1 
o . , . " . ," . " " "
2 4 e 8 10 12 14 1 e 18 20 22
Word Length
Figure 15: Solution quality for words of increasing length
59
algorithm, when not optimal, differ in length from the optimal by utmost 3 characters.
In the graph of Figure 14, the xaxis has 19 points. As mentioned before, the
average solution tuple size over 40 independent data sets was computed for each of these
points on the xaxis. This gives a total of 19x40 or 760 runs for this graph alone.
Similarly the graph of Figure 13 represents the outcome of another 760 runs. Similar
runs were performed for two additional graphs of this type, for which the word lengths
were fixed at 8 and 12. Thus the grand total number of runs performed was 4x760 or
3040. A shell program, designed to perform this simulation, keeps track of the number
of times the genetic algorithm yielded a solution of fitness value less than 1.0. These are
the times that the genetic algorithm failed to find a solution. For this simulation set this
happened 4 times, yielding an almost perfect success rate of 3036/3040 or 99.86 %.
4.4 Biased Initialization Vs Random Initialization
Two methods of creating the initial population are implemented, namely random
initialization and biased initialization. This was discussed in section 3.1.3 of chapter 3.
The effect of using biased initialization on the performance in contrast to the random
initialization is observed.
In the graph shown in Figure 16, the xaxis represents the data sets with varying
word lengths and the yaxis represents the execution time in microseconds. The
execution time was computed as an average over 40 separate runs using the
corresponding data set on the xaxis. The word length of the data sets increase along the
60
xaxis, but the number of words in all the data sets if set to 400. Figure 17, shows the
quality of the solutions obtained by these runs.
From Figure 16, we see that the performance of both the implementations, as far
as the execution time is concerned, does not differ much. Just as the gengrph curve
shown in Figures 11 and 12 , the curves in this graph exhibit the same shape and form.
2e.8 ,
• gen grph using rando m Initialization
• glngrph using blas.d Initialization
28.8 
..g
~ 1 e+ 8 
u
I),eu~•
1 e. 8 E
i=
c
:8
::3
U
I) an ,.·7 
0.·0 
•,
II
II
II
,II
I I
II
I I
I I ! I
,IIl\
I \ I \
..: ... ,, ty/"\\\~ .,••11':" z \ I
/ ,''''• V \ ..:Ws .,."
• MI .. \t~., ~
a 10 11 20
Word Length
25 30
Figure 16: Comparison of performance of two initialization techniques
The increase in word length does not seem to effect the execution time, which stays
almost level.
61
The real differences in the performance between these two implementations of the
genetic algorithm is seen in the next graph in Figure 17, which shows the solution tuple
sizes obtained. The xaxis is same as that of the graph in Figure 16. The yaxis
represents the size of the solution tuples. These sizes were computed as an average over
40 separate runs using the corresponding data sets on the xaxis. The solutions obtained
by the implementation that uses biased initialization seem to be better than the ones
obtained by the other implementation. This is due to the fact that the solution tuple sizes
8...,....."
~ aengrph using rlndom Initialization
IIIIiIIiIIiI!Ii gengrph uling bl.led Inltlallzltion
6 
• 
•N
iii • Q.
:J
... 3 
c0
:;:I
:::J
'0
CI)
2 

I 10 11
Word length
20 26
Figure 17: Solution quality corresponding to the graph of Figure 16
62
are smaller for the genetic algorithm that uses biased initialization even when the fitness
of the solutions for both the implementations are equal.
Consider for example the data set corresponding to the word length of 10. The
graph in Figure 17 shows that the solution obtained by the implementation that uses the
biased initialization is much better, even though, as seen in Figure 16, it takes more
execution time than the implementation that uses random initialization. Now consider
the data set corresponding to the word length of 9. From Figure 16, it is seen that the
implementation using random initialization takes more execution time than the other
implementation. However, the solution tuple size obtained by both are the same. So it
appears that even though the genetic algorithm that uses biased initialization occasionally
takes longer to find a solution, it finds a better solution. This cannot be said for the
genetic algorithm that uses random initialization.
4.5 Comparison of the two implementations of the genetic algorithm
The libGA implementation of the genetic algorithm is called tupleapp and the C
programming language implementation is called gengrph. The convergence to the
solution by both these implementations is expressed in a graph form. The graph
represents the standard deviation and average of the fitness of all the chromosomes in
each generation. The point where the standard deviation curve reaches the zero level
represents the point of convergence. At this point the average curve represents the
fitness of the solution to which the genetic algorithm converged.
63
The graph in Figure 18 shows the convergence to a solution by both the
implementations. in this graph, both implementations do not use elitism. It is seen that
both the implementations seem to converge at the same rate. The graph in Figure 19
shows the same phenomenon, but both the implementations now use elitism. Although
both the implementations still converge at the same rate, a marked improvement is noted
from the graphs shown in Figure 18.
Convergence (with Elitism turned off)
1 gptt~ .. ·tA·o,••• otftft·t,•••c.ttt·"ft.···c
• average fitness (gengrphj
• standard deviation (gengrph)
avera g e fitn e ss (tu p Ie a p p)
" standard deviation (tupleapp)
o
~
~~\r m A~ $$5$ C••• , ~. 4R
,. ~ ; ••••• • 0•••••' ..
o 20 40
Gen.ratlon
60 80
Figure 18: Convergence when not using elitism
1
o
64
C on vergence (with Elitism turned on)
'I
'I • average fitness (gengrph)
/
.i", • standard deviation (gengrph)
... average fitness (tupleapp)
'Y standard deviation (tupleapp)
o 1 2 3 4 6 6 7 8 9
Generation
Figure 19: Convergence when elitism is used
4.6. Additional graphs presented in Appendix A.
In section 4.2, it was mentioned that the performance graphs use two types of xaxes:
one in which the word length is fixed and the number of words varied, the other
in which the number of words is fixed and the word length varied. The graphs in
65
Figures 8, 9, and 10 represent the first kind. A computation of the success rate of the
genetic algorithm for these 3 graphs was given in section 4.3. Appendix A shows graphs
that represent the solution quality for the performance graphs shown in Figures 11 and
12. The computation of the success rate of the genetic algorithm for these set of graphs
is also given in the appendix A. In section 4.4, a comparison of the performance of the
two initialization techniques was presented. The graph in Figure 16 compared the
execution times and the graph in Figure 17 compared the solution qualities. Appendix
A shows more graphs of these type.
CHAPTERS
S CONCLUSIONS AND FUTURE WORK
S.1 Conclusions
A genetic algorithm was implemented to solve the problem of extracting tuples
from word sets. Two implementations were realized, one using the C programming
language, and the other using the libGA application tool. The performance of these two
implementations were similar. The implementation of genetic algorithms is easier using
the libGA application tool, since only the fitness function needs to be coded. To
implement more complex genetic program, the libGA application tool does not provide
the flexibility needed to alter the data structures and provide decoders and repair
algorithms. The libGA application tool can handle implementations of only classical
genetic algorithms. The C programming implementation of the genetic algorithm used
two separate techniques of initializing the population. The first used a random
initialization that involves a random selection of points from the entire solution domain.
The second used a biased initialization that involves selection of points based on
estimated probabilities of overlap for each separate tuple size. When using the random
initialization technique the genetic algorithm converged to a solution in less time,
however the biased version seemed to converge to a better solution. The execution time
of the genetic algorithm to converge to a solution increased with the population size used.
66
But using larger population sizes yielded better solutions. The performance of the
genetic search was compared to an exhaustive search.
The genetic search seemed to perform better under certain conditions and the
exhaustive search under other conditions. For smaller word lengths the exhaustive search
outperformed the genetic search. As words of larger length were considered the
performance of the genetic search improved until finally it outperformed the exhaustive
search. The solution domain increases exponentially with the word length. The genetic
search seemed to perform better for larger solution domain. This performance was not
gained without paying a price. The price paid was the length of the solution tuple. The
solution tuple lengths were larger than the optimal solution tuple lengths for larger word
lengths, but this was under tolerable limits. The gain in the performance of the genetic
algorithm outweighed the loss in the optimality of the solution tuples. For word lengths
of 20 and more the length of the solution tuple was larger than the optimal length by
utmost 3 or 4 characters. Although the exhaustive search guarantees an optimal solution,
this search is combinatorially explosive. The genetic algorithm on the other hand shows
an approximate linear increase in the search time and finds a solution with 85% chance
of being optimal. Even when word length was fixed and the input word list was
increased, the genetic search eventually outperformed the exhaustive search.
5.2 Future Work
The population size effects the performance. Larger populations yield better
67
68
solutions but also take longer time to converge. Choosing an appropriate population size
for this problem is an issue that can be addressed in future. Also the affect of using
different variations in the implementations of the genetic operators on the performance
can be explored. The evaluation scheme can be modified to incorporate the probabilities
that were computed in section 3.1.3.2 of Chapter 3. A suggestion would be to use these
probabilities to form a biased fitness evaluation scheme.
[ARU93]
[AOE92]
[BAT93]
[BOR91]
[BRA89]
[CEL93]
[CHA91]
[CHI91]
[CHA86]
[CHA84]
[CIC80]
[DAM91]
BIBLIOGRAPHY
S. Arunkumar and T. Chockalingam, "Genetic Search Algorithms and
their andomized Operators" , Computers and Mathematics with
Applications, Vol 25, (5), (March 1993), pp 91100
JunIchi Aoe, Katsushi Morimoto and Takashi Sato, "An Efficient
Implementation of Trie Structures", SoftwarePractice and Experience,
Vol 22, (9), (April 1992), pp 695721.
David L. Battle and Michael D. Vose, "Isomorphisms of genetic
algorithms", Artificial Intelligence, Vol 60, (March 1993), PP 155165
Jurgen Borstler, Ulrich Moncke, and Reinhard Wilhelm, "Table
Compression for Tree Automata", ACM Transactions on Programming
Languages and Systems, Vol 13, (3), (July 1991), pp 295314.
Marshall D. Brain and Alan. L. Tharp, "Nearperfect hashing of large
word sets", SoftwarePractice and Experience, Vol 19, (10), (October
1989), pp 967978.
Joe Celko, "Genetic Algorithms and Database Indexing", Dr. Dobb's
Journal, Vol 18,(4), (April 1993), PP 3034
C. C. Chang, C. Y. Chen and J. K. Jan, "On the Design of a MachineIndependent
Perfect Hashing Scheme", The Computer Journal, Vol 34
(5), (1991), PP 469474.
ChinChen Chang and TzongChen Wu, "A Letteroriented Perfect
Hashing Scheme Based upon Sparse Table Compression", SoftwarePractice
and Experience, Vol 21, (1), (January 1991), PP 3549.
C. C. Chang and R. C. T. Lee, "A letteroriented Minimal Perfect
Hashing Scheme", The Computer Journal, Vol 29, (3), (1986), PP 277281
C. C. Chang, "The study of an ordered minimal perfect hashing scheme" ,
Communications of the ACM, Vol. 27, (4), (1984), PP 384387.
Richard J. Cichelli, "Minimal Perfect Hash Functions Made Simple",
Communications of the ACM, Vol. 23, (1), (January 1980), pp 1719.
Eric S. Lander, Robert Landridge and Damian M. Saccocio, "A report on
69
[GOL89]
[GOL85]
[HOL92]
[HOF85]
[HOL75]
[HOL73]
[HUN91]
[JAS80]
[KER70]
[LAN91]
[LAD71]
[LIN70]
70
computing in molecular biology; mapping and interpreting biological
information", Communications of the ACM, Vol 34, (November 1991) pp
3239.
David E. Goldberg, "Genetic Algorithms in Search, Optimization, and
Machine Learning", Reading, Mass: AddisonWesley.
David E. Goldberg, "Genetic Algorithm and Rule Learning in Dynamic
Control Systems", Proceedings of the First International Conference on
Genetic Algorithms, Lawrence Erlbaum Associates, Hillsdale, NJ, (1985),
pp. 815.
John H. Holland, "Genetic Algorithms", Scientific American, Vol
267,(1), (July 1992), pp 6672.
Douglas R. Hofstadter, "The genetic code: arbitrary?", in Metamagical
Themas, New York: Basic Books, PP 671699.
John Holland, "Adaptation in Natural and Artificial Systems", University
of Michigan Press, 1975.
John H. Holland, "Genetic algorithms and the optimal allocation of trials" ,
SIAM J. Computing, Vol 2, (2), (June 1973), pp 5056
Lawrence Hunter, "Artificial Intelligence and Molecular Biology", AI
Magazine, Vol 11, (January 1991), pp 2736.
G. Jasechke and G. Osterburg, "On Chichelli's Minimal Perfect Hash
Functions Method", Communications of the ACM, Vol. 23, (12),
(December 1980), pp 728729.
B. W. Kernighan and S. Lin, "An Efficient heuristic procedure for
partitioning graphs", Bell Systems Technical Journal, Vol 49, (2), (1970),
pp 291308
Eric S. Lander, Robert Landridge and Damian M. Saccocio, "Computing
in molecular biology: mapping and interpreting biological information" ,
Computer, Vol 24, (November 1991), pp 613
R. Landauer and J. W. Woo, "Minimal Energy Dissipation and Maximal
Error for the Computational Process", J. Anpl. Ehys., Vol 42, (1971),
pp2301+.
Shen Lin, "Computer Solutions of the Traveling Salesman Problem" , Bell
[LIU91]
[MIC92]
[PAV92]
[SPI92]
[SAG85]
[TAR79]
[VOS91]
[WAL89]
71
Systems Technical Journal, Vol, (December 1965), PP 22452265.
Joseph W. H. Liu, "A Generalized Envelope Method for Sparse
Factorization by Rows", ACM Transactions on Mathematical Software,
Vol 17, (1), (March 1991), pp 112129.
Zbigniew Michalewicz, "Genetic Algorithms + Data Structures =
Evolution Programs", Sprin&erVerla& Berlin Heidelberg, (January 1992)
Pevzner A. Pavel, "Multiple alignment, communication cost, and graph
matching", SIAM Journal on Al1l1lied Mathematics, Vol. 52, (December
1992), pp 17631779.
Richard Spillman, "Genetic Algorithms", Dr. Dobb's Journal, Vol 18,(4),
(July 1992), pp 2630
Thomas J. Sager, "A Polynomial Time Generator for Minimal Perfect
Rash Functions", Communications of the ACM, Vol. 28, (5), (May
1985), pp 523532.
Robert Endre Tarjan and Andrew ChiCRih Yao, "Storing a Sparse
Table", Communications of the ACM, Vol. 22, (11), (November 1979),
pp 606611.
Michael D. Vose, "Generalizing the notion of schema in genetic
algorithms", Artificial Intelli&ence, Vol 50, (1991), pp 385396
Walbridge, Charles T., "Genetic Algorithms : What Computers Can
Learn from Darwin", Technology Review, Vol 92 (1), (Jan 1989), pp 4653
APPENDIX A
The graph in Figure 20 shows the comparison of execution times required to find a
solution by the genetic search and exhaustive search. Word length varies along the xaxis.
The number of words in each data set is fixed at 800. This graph is an
addition to the graphs of figures 11 and 12, in which the size of the data sets were
fixed at 400 and 600 words respectively. The graph on the next page shows the
quality of the solutions obtained for the same data sets as depicted in Figure 20.
The graphs in figures 23 through 26 compare the performance between the two
implementations of the genetic algorithm which use random and biased initialization
respectively. Graphs in figures 23 and 25 compare the execution times and the
graphs in figures 24 and 26 compare the solution qualities.
N umb. r 0 f word I =800, Pop uI. t Ion S Iz. =.0
~.+8 .. /
31+8 ,. • g. ng rp h I
• combon I
. 31+8 r; ." I c0
21+ 8 "/
u
41)
'"I I \ I e I til u~
21+8 I
I •E
F 1 1+ 8 c0
+i
:::J
U•)C 11+ 8 w I
I
I
6e+7 j JIll
/~
01+0
o 6 to t6 20
Word L. n g th
26 30
Figure 20: Performance graph #3 for words of increasing length
72
Numberofwordl =800, Population Size =40
7
e   g.ngrph
..... combon
6
~
N
(i)
t> .. 
Q.
~
tt:
0 ;;
~ 3 
15
(I)
2 
73
o
4 6 8 1 0 1 2 14 1 e 18 20 22 24
Wo rd Len g th
Figure 21: Solutions corresponding to the graph in figure 20
74
Numberofwords =600, Population Size =40
6
6 
.. 
•N
Ui
•Q.
~ 3  c0~
::::I
'0
(I)
2 
1 
o
6
~ gengrph
... combon
10 16
Wo rd Len g th
20 26
Figure 22: Solutions corresponding to graph in figure 12
The graphs shown in figures 11, 12, and 20 belong to the category in which the length
of the words in the input data sets increase along the xaxis. The only difference
between these graphs is the size of the input data set. For the graph in figure 11 all the
data sets have 400 words. Similarly for the graph in figure 12 the size of input data sets
are set at 600 words, and for figure 20 the size is set at 800 words. Each of these
graphs are constructed from 24*40 independent runs, because each point on the y axis
is computed as an average over 40 runs and there alltogether 24 points. Thus the total
number of runs performed in all the three graphs is 3*24*40 = 2880. The number of
times the genetic algorithm failed to find a solution of fitness 1.00 is 57, which gives a
success rate of (288057)/2880 = 98.02%.
2.+8
• g.n arp h U 11"1 ra" do m 1"ltl.llzatlo n
• Sit n Ct r p h u.1 n Ct b la. t dI In Itl a lI.a t Ion
75
; 1 1e+8 ¥,~
CI
~•
1 .... 8 E
F
I:
.2 ;
u•
.; 6.+7 
o I 10 11 20
WO rd Len gttl
21 30
Figure 23: Comparison of performance of two initialization techniques.
All input data sets contain 400 words
...,....
 gl"_ rph ualnl randem 1"ltI.llz.tlon
 ._n.rph u.IIIJ Ifla••41 In IUalilatlon
6
...•i
~ 3
c::: o
:II
:3 i
2
Figure 24: Solution quality corresponding to the graph of Figure 23
2e+8 ~
76
1 .... 8 
i 1e+8
o
i,~
1.+8 
~•
E
F 8.+7
e
.2
';
u:: '.+7 UI
• gen,rp h U ling ran do m Initlalll.tlo n
• GI "_, rp h u sin I b la •• d In ItlBlilatlon
f\ i
Ii 1\ ,A l (;
I \i; I\'\/ ~ I \
,. \ ; \; ! ~ \! 1\ \. \!/ J \, r h\ t. I IJIII \ ,I1t\ I\.i I \ I •" \I l..\. ;
I '.\\ :\ ; \ ~! '\ I ·; " I
I \ ,,!J \,f \ !l: ! I\'\~i ,~I"
1\~. ~ ~ "/ \. •/
.' , 'i/ \/. ~."'~I
o. + 0 + .....,......_..4
o I 10 11 20
WO rd Len gth
21 30
Figure 25: Comparison of performance of two initialization techniques.
All input data sets contain 600 words
T
I
5
..=. "  D. .a
i
:s 3
o
CI)
2
o 5 10
WO rd len gth
16 20
Figure 26: Solution quality corresponding to the graph of figure 25
APPENDIXB
The source code of the genetic algorithm gengrph implemented in C
Uinclude <stdlib.h>
#include <stdio.h>
Uinclude <time.h >
Uinclude <math.h >
Uinclude "const.h"
Uinclude "def.h"
Uinclude "rand.h"
Uinclude "print.h"
Uinclude "eval.h"
Uinclude "genop.h"
/*Constants used */
/*Data Structures defined*/
/*Random number generation related functions*/
/*Output functions*/
/*Functions related to fitness evaluation */
/*Functions related to genetic operations */
/****************************************************
Driver Module of the Genetic Algorithm.
Creates a new generation from an existing one. The
population continues to evolve through the generations
until a solution is found, or the population is dominated
by a single chromosome/individual.
*****************************************************/
main (argc,argv)
int argc;
char *argvD;
{
Population pop, npop; /* old and new populations */
int maxndx,
minndx,
gencnt;
/* index to the chromosome of maximum fitness*/
/* index to the chromosome of minimum fitness*/
/* Generation counter */
float
int
avg, av, stdev; /* average and standardeviation */
i, j,
lchrom, /* length of chromosome */
nofwords, /* total number of words in the input file*/
popsize, /* population size */
stop, stpndx; /* stop is used to store a boolean 1 or 0
for termination condion. */
77
78
char letters[ORDINAL], /* The alphabet used by the input word set */
words[MAXNUMBEROFTUPLES][MXLNGTH]; /* the list of
input words */
float scale [MAXPOP]; /*cumulative sumfitness used in selection
process*/
/* The command line syntax : consists of the file name that contains the
input data, the population size of a generation, and a boolean value for elitism
of 1 or O. If elitism is 1 then elitism is applied otherwise no elitism is
applied*/
if (argc !=4 ) {
printf ("\nFormat : %s <input file> <pop size> <elitism> \n", argv[O]);
exit (1);
}
clock (); /* Clock is reset to O. The next time clock is invoked upon termination
of the genetic algorithm to determine the amount of cpu time used to
execute the program. */
MXPOP = atoi (argv[2]);
ELIT = atoi (argv[3]);
/* This module generates the initial pool of chromosomes that the first
generation will consist of. The initial pool is created using either a random
initialization or a biased initialization. */
Initialize (pop, &popsize, scale, argv[l], &lchrom,
letters, words, &nofwords);
gencnt = 1; /* Start the generation counter */
do {
/* This module creates a new population by applying the genetic operators:
reproduction, crossover, and mutation to selected chromosomes
of the old population */
Generation (pop, npop, &popsize, lchrom, scale,
words, nofwords);
79
/* The old population is replaced by the new population */
for (i=O;i <popsize;i+ +) pop[i] = npop[i];
/* Increment the generation counter to indicate the creation of the next
generation*/
gencnt++;
/* Compute relevant statistics for the new generation*/
statistics (popsize, &maxndx, &avg, &minndx, pop);
/* This module tests for the termination of the Genetic Cycle. If
the module returns with a stop value of 1 then the cycle/loop
terminates */
Terminate (pop, popsize, &stop, maxndx);
} while (!stop);
printf ("\n %s", pop[maxndx].chrom); /* The genetic program converged to this
solution */
}
flle: const.h (constants used)
31
500
26
/*Maximum population size*/
/* Number of unique alphabets contained
by the list of words */
/* Maximum String Length of all the
words */
Udefine MAXNUMBEROFTUPLES 1001 /*Maximum number of tuples*/
Udefine MAXPOP
Udefine ORDINAL
Udefine MXLNGTH
file: def.h (Data structures used)
typedef int BOOL;
typedef char Allele;
typedef Allele Chromosome[MXLNGTH];
typedef struct {
Chromosome chrom;
float x;
float fitness;
int parentI, parent2, xsite;
int count;
} Individual;
typedef Individual Population[MAXPOP];
int nmutation, ncross, MXPOP = 0, ELIT=O;
float pcross, pmutation, sumfitness; 1* probability of crossover, mutation. *1
long RestrainLength [MXLNGTH+1]; 1* used in the fitness function to
update the smallest hashlength for each
tuple size *1
tlIe: rand.h
long seed = 1.0;
/*****************************************************
Function : Random 0
Fetch a single random number between 0.0 and 1.0
*****************************************************/
float RandomO
{
float random;
long temp_seed; 1* temp_seed is declared as long ie, 32 bit accuracy */
int aa= 16807; 1* { 7A5 } *1
long m=2147483647; /* {2 A31 1 } *1
int q= 127773; 1* {m 1 aa}*/
int r =2836; /* {m mod a} */
temp_seed=aa*(seed % q)  r* «int)(seed/q»;
80
if (temp_seed > = 0) 1* if the temp_seed is greater than
zero seed is made equal to temp_seed
otherwise m is added to temp_seed to
make it positive. *1
seed = temp_seed;
else
seed =(temp_seed + m);
random = (float)seed/m;
return(random);
1* seed is divided by m to get the
random number . To preserve the
accuracy of random the float value
is taken *1
1* returning the Random number *1
pop;
*popsize;
scaleD;
*filename;
*lchrom;
letters[];
words[][MXLNGTH];
*nw;
}
/*****************************************************
Function: Rnd (low .. high)
Returns an integer selected uniformly and pseudorandomly
between upper and lower limits.
*****************************************************/
int Rnd (low, high)
int low, high;
{
int i;
if (low> = high)
i = low;
else
{
i = (RandomO * (highlow+1) + low);
if (i > high)
i = high;
}
return i;
}
rIle: genop.h
/*********************************************************
The Initialization Module
This function is responsible for the initialization of all
the parameters, and data structures needed for use by
other modules. Among various other initializations, the
gene pool for the first generation is initialized here.
The input data, the word list, is read from the input file.
**********************************************************/
Initialize (pop, popsize, scale, filename, lchrom, letters,
words, nw)
Population
int
float
char
int
char
char
int
{
81
int i, end, nofletters, nofindivid, n, tuplesize, pndx;
char msg [MXLNGTH], ch;
float objfunc 0, sum, power 0, find_series_sum 0;
FILE *fp;
/* Initialize the probilities of application of the genetic operators*/
82
pmutation = 0.001;
nmutation = 0;
pcross = 0.25;
ncross = 0;
/* Probability of mutation/mutation rate */
/* Counter to keep track of the number of
mutations performed */
/* Probability of crossover/crossover rate */
/* Counter to keep track of the number of
crossover operations performed */
/* Opening the input file that contains the list of words */
if «fp = fopen(filename, "rN» == NULL) {
printf ("\nError in opening the input file containing the words.\nlf
);
exit (1);
}
/* The data structure: RestrainLength, is used in the eval.h header file
to keep check on the size of the tuples. */
for (i = 1; i < = MXLNGTH; i++)
RestrainLength[i] = 80000L;
/* Read in the header section of the input file, which contains the
the parameters: nw (number of words ), lchrom (word length),
and the letters (the alphabet comprising the words) */
fscanf (fp,"%d %d If ,ow ,lchrom);
i = 0;
while «ch=getc(fp» != '\n') letters[i+ +] = ch;
nofletters = i;
letters[i] = '\0';
/* Read in the words from the input file*/
i = 0;
while (fscanf (fp, "%8", words[i]) != EOF) i+ +;
fclose (fp);
/* Computing the size of the complete solution domain. If the
lchrom is 5 then the population will consist of chromosomes
ranging from: 00001 to 11111, which totals to 64 1 = 63 */
for (end=O,i=O;i<*lchrom;i+ +)
83
end = (end< <1) IOxO1;
/* if size of solution space is greater than the maximum population
size then population size is set at MXPOP, otherwise population
size is set at the size of solution domain.
Also if size of solution domain > MXPOP then the initial population
will consist of MXPOP randomly selected entries from entire solution
domain, otherwise the initial population will be the entire solution
domain. */
if (end > MXPOP){
if (INITYPE = = O){ /*Implementing Random Initialization */
*popsize = MXPOP;
for (i = O;i < *popsize; i+ +) {
convertobin (Rnd(l,end), msg, *lchrom);
strcpy (pop[i].chrom, msg);
}
}
else {
/* Implementing Biased initialization based on probability of overlap
among tuples calculated as explained in section 3.1.3.2.
From section 3.1.3.2 it was stated that the probability that two ktuples
will not be identical is (t't'")/(t'I). The number of ktuples that will
be introduced into the population will be biased by the weight given by
r lI_r ll1
weight = =
II E(rllr lJ
;
1=1
II E (ll/r~
k=1
So the number of ktuples = weigth x total number of words*/
*popsize = MXPOP;
pndx = 0;
nofindivid = 0;
/* This function computes the denominator of the expression for weight
given above*/
sum = find_series_sum (nofletters, *lchrom);
for (tuplesize= 1; tuplesize< = *lchrom; tuplesize+ +) {
/* the number of tuples of size tuplesize that will be introduced into the
population is computed */
n = *popsize * «II/power(nofletters, tuplesize»/sum);
/* these tuples are now introduced into the population */
for (i = O;i < n; i++) {
createAtupleofsize (tuplesize, msg, *lchrom);
strcpy (pop[pndx+ + ].chrom, msg);
}
}
if (pndx < *popsize1)
for (i =pndx;i<*popsize;i+ +) {
createAtupleofsize (*lchrom,msg,*lchrom);
strcpy (pop[i].chrom, msg);
}
}
}
else {
*popsize = end;
for (i=O;i < end;i+ +) {
convertobin (i+ 1, msg, *lchrom);
strcpy (pop[i].chrom, msg);
}
}
/* Initialize the sum total of the fitness of all the chromosomes */
for (sumfitness = 0, i = 0; i < *popsize; i+ +) {
pop[i].fitness = objfunc (pop[i].chrom, *lchrom,
words, *nw);
sumfitness + = pop[i].fitness;
}
/* Initialize the cumulative sum of the fitness of all the chromosomes */
scale[O] = pop[O].fitness;
for (i= 1;i<*popsize;i+ + )
scale[i] = scale[i1]+pop[i].fitness;
}
/*****************************************************
Generation 0
Create a new generation through select, crossover,
and mutation.
Generation size remains same when an even numbered popsize
is considered. If the popsize is odd then the resulting
new generation will contain one extra chromosome.
84
******************************************************/
Generation (popold, popnew, popsize, lchrom, scale, words, nw)
Population popold, popnew;
int *popsize, lchrom;
float scaleD;
char wordsD[MXLNGTH];
int nw;
{
int j, mateI, mate2, jcross, i;
Allele Mutation 0;
float objfuncO;
j = 0;
do {
matel = Select (*popsize, sumfitness, popold, scale);
mate2 = Select (*popsize, sumfitness, popold, scale);
Crossover (popold[mateI] .chrom, popold[mate2].chrom,
popnew[j].chrom, popnew[j+1].chrom, lchrom);
popnew[j].fitness = objfunc (popnew[j].chrom, lchrom, words, nw);
popnew[j+1].fitness =
objfunc (popnew[j+I].chrom, lchrom, words, nw);
/* Replace worst child with parentI if parentI is better */
if (popnew[j].fitness < popnew[j+I).fitness) {
if (popold[mateI].fitness > popnew[j].fitness)
popnew[j] = popold[mateI];
}
else {
if (popold[mateI].fitness > popnew[j+I].fitness)
popnew[j +1] = popold[mateI];
}
/* Replace worst child with parent2 if parent2 is better */
if (popnew[j).fitness < popnew[j+I].fitness) {
if (popold[mate2].fitness > popnewU].fitness)
popnew[j] = popold[mate2);
}
else {
if (popold[mate2] .fitness > popnew[j+1) .fitness)
popnew[j+1] = popold[mate2);
}
85
j += 2;
} while (j < *popsize);
*popsize = j;
scale[O] = popnew[O].fitness;
for (i= 1;i<*popsize;i++)
scale[i] = scale[iI]+popnew[i].fitness;
}
/*****************************************************
Function : Mutation 0
******************************************************/
Allele Mutation (alleleval)
Allele alleleval;
{
int mutate;
mutate = Flip (pmutation);
if (mutate) {
if (alleleval = '0') return ' I';
else return '0';
}
else
return (alleleval);
}
/*****************************************************
Crossover 0
Cross 2 parent strings, place in 2 child strings
*******************************************************/
Crossover (parentI, parent2, childI , child2, lchrom)
Chromosome parentI, parent2, childI , child2;
int lchrom;
{
int jcross;
int pent, ccnt=O;
char tmpstore[MXLNGTH];
Allele Mutation 0;
if (Flip (pcross» {
jcross = Rnd(l, lchromI);
}
86
else {
jcross = lchrom;
}
for (pcnt=O;pcnt <jcross;pcnt+ +) {
childl[ccnt] = Mutation (parentl[pcnt]);
child2[ccnt] = Mutation (parent2[pcnt]);
ccnt++;
}
for (pcnt=jcross;pcnt<lchrom;pcnt++) {
child2[ccnt] = Mutation (parentl[pcnt]);
childl[ccnt] = Mutation (parent2[pcnt]);
ccnt+ +;
}
childl[ccnt] = child2[ccnt] = '\0';
}
/*****************************************************
Function : Select 0
Selects an individual from a Population biased by the fitness
of this individual. (Anologous to Selection on a Roulette Wheel
*******************************************************/
int Select (popsize, sumfitness, pop, scale)
int popsize;
float sumfitness;
Population pop;
float scaleD;
{
int i;
float Rand;
Rand = RandomO * sumfitness;
for (i=O;i <popsize;i+ +)
if (Rand < scale[i])
return i;
return 1; /* error */
}
/*****************************************************
Function : Flip (probability)
Returns result of a simulated biased coin toss
87
*****************************************************/
BOOL Flip (probability)
float probability;
{
if (probability = = 1.0)
return 1;
else
return (RandomO < = probability);
}
/****************************************************
Funtion takes an integer as the input argument,
and converts it to a binary string equivalent.
****************************************************/
convertobin (n,msg,l)
char msg[];
int n,l;
{
int i;
for (i=ll;i> =O;i) {
if (n&l)
msg[i] = '1';
else
msg[i]  '0';
n=n> > 1;
}
}
/***************************************************
Module responsible for determining whether the
termination of the Genetic cyle should be performed
****************************************************/
Terminate (pop, popsize, terminat, maxndx)
Population pop;
int popsize, *terminat, maxndx;
{
int i, count;
*terminat = 0;
count = 0;
/* if population is dominated by a single chromosome type
88
89
then terminate. i.e When all the chromosomes in the population
have identical fitness values, the genetic cycle is terminated*/
for (i = 0; i < popsize; i++)
if (pop[i].fitness = = pop[maxndx].fitness)
count+ +;
if (count> = popsize)
*terminat = 1;
}
/* Computes the maximum, minimum, average, and standard deviation of fitness
values of all the chromosomes in the population */
statistics (popsize, maxndx, avg, minndx, pop)
int popsize, *maxndx, *avg, *minndx;
Population pop;
{
int i;
sumfitness = pop[O].fitness;
*minndx = 0;
*maxndx = 0;
for (i = l;i < popsize; i++) {
sumfitness + = pop[i].fitness;
if (pop[i].fitness > pop[*maxndx].fitness) *maxndx = i;
if (pop[i].fitness < pop[*minndx].fitness) *minndx = i;
}
*avg = sumfitness/popsize;
}
/* Finds the length of the tuple represented by the chromosome. This carried out
by counting the number of 1's in the chromosome. This value is used to plot the
solution quality in terms of tuple length. */
findtuple_length (tuple, lchrom)
Chromosome tuple;
int lchrom;
{
int i, len = 0;
for (i = O;i < lchrom; i++)
if (tuple[i] == '1 ') len++;
return len;
}
1* Function is used to compute the probabilities mentioned in section 3.1.3.2.
These probabilities are used to implement a biased initialization of the initial
population. *1
float find_series_sum (r, n)
int r, n;
{
int i;
float sum = 0, power 0;
for (i=l;i< =n;i++) {
sum = sum + (1  1/power(r,i»;
}
return (sum);
}
1* Computes the value of a raised to the power of n. This function is also used in
the computation of probabilities mentioned in section 3.1.3.2*1
float power (a, n)
int a, n;
{
int i;
float prod = 1.0;
for (i = 0; i < n; i++) prod = prod * a;
return prod;
}
1******************************************************
This function creates a random tuple of size k
******************************************************1
createAtupleofsize (k, tuple, I)
int k;
char *tuple;
int I;
{
int pos[MXLNGTH], j, i, ndx, ps, len;
len = I;
for (i = 0; i < 1; i + +) {
pos[i] = 0;
tuple[i] = '0';
}
90
91
for (i=O; i < k; i++) {
ps = Rnd (0, II);
while (pos[ps] == 1) {
ps++;
if (ps == len) {
ps = 0;
}
}
tuple[ps] = '1';
pos[ps] = 1;
1·,
}
}
rde: eval.h
/*********************************************************
The FITNESS Evaluation Module
The objfunc 0 takes a chromosome as an argument and assigns it a fitness
value in the range 0.. 1. The higher the fitness value the closer the
chromosome is to the solution. For fitness value of 1, the chromosome
represents the solution.
The chromosome represents a set of tuples. The fitness value gives a measure
of the number of duplicate entries existing in the list of tuples. This
is computed as an average hashlength = nw/hashlength.
for example : if the chromosome 0101 represents the tuples : aa ab cd aa ba cd
ba ba. Then a frequency count of this list will tell us that there are
2 aa's, 1 ab, 2 cd's, and 3 ba's in the list of tuples. Then the hashlength
is computed as
(1 +2) + (1) + (1 +2) + (1 +2+3).
********************************************************************/
float objfunc (candidate, length, words, nw)
Chromosome candidate;
int length;
char wordsO[MXLNGTH];
int nw;
{
92
int i, fscount=O, fs[MXLNGTH], j, overlap;
long hashlength, find_average_hashlengthO;
float fitness;
char tuples[MAXNUMBEROFTUPLES][MXLNGTH];
/* Construct the set of tuples that are represented by the chromosome
This is done by picking characters from the word list, at positions indicative
of the presence of l' s in the chromosome */
for (i = 0; i < length; i++)
if (candidate[i] == '1') {
fs [fscount+ + l = i;
if (fscount > = MXLNGTH) {
printf ("\nThe size of the tuple has exceeded the maximum limit!. It);
printf ("\nAborting process. It);
exit (1);
}
}
for (i = 0; i < nw; i++)
{
for (j = 0; j < fscount; j++)
tuples[i+ 1lU] = words[i][fsU]];
tuples[i+1lUl = '\0';
}
heap_sort (tuples, nw);
hashlength = find_average_hashlength (tuples, nw);
/* RestrainLength is used in the method of selecting
the better tuple among two tuples with the same fitness value but
different tuple size.
RestrainLength keeps track of the maximum fitness value for each
tuple length size. If there exists a smaller tuple for which the
fitness is = to the fitness of this chromosome, then this chromosome
is undesirable and hence given a very low fitness. */
if (hashlength < RestrainLength[fscountl)
RestrainLength[fscount] = hashlength;
for (i = 1; i < fscount; i++) {
if (RestrainLength[fscount] == RestrainLength[i]) {
RestrainLength[fscount] = 80000L;
fitness = 1.0/80000L;
return fitness;
}
}
fitness = (float)nw/hashlength;
return fitness;
}
/*******************************************************
Computes the hashlength of the tuples. T is list of
tuples, and n is the number of tuples in this list.
The function first identifies groups from the list, such
that each group consists of one or more identical tuples
and no two groups have any tuples in common.
Then the hashlength is computed as sum of all sumseries for
each group. sumseries for a group is the natural sum of its
size i.e. if group size is 4, then sumseries for this
group computes to 1+2+3+4.
**********************************************************/
long find_average_hashlength (T,n)
char T[][MXLNGTH];
int n;
{
int i, overlap = 1;
long sum, sumseriesO;
char comparewith[MXLNGTH];
sum = 0;
strcpy (comparewith, T[l]);
for (i = 2; i < = n; i+ +) {
if (strcmp (comparewith, T[i]) == 0)
overlap++;
else {
strcpy (comparewith, T[i]);
sum + = sumseries(overlap);
93
overlap = 1;
}
}
sum + = sumseries(overlap);
return sum;
}
/******************************************************
Computes the natural sum of its argument. If argument
is 5 then this function computes the natural sum
1+2+3+4+5.
******************************************************/
long sumseries (uplimit)
int uplimit;
{
int i;
long sum;
sum = 0;
for (i = 1; i < = uplimit; i++)
sum += i;
return sum;
}
/*******************************************************
Prints the list of tuples T. n is the number of tuples
in the list
*******************************************************/
printlist (T, n)
char TO[MXLNGTH];
int n;
{
int i;
printf (n\nThe tuples: \nn);
for (i = 1; i < = n; i++) {
printf (n %s ", T[i]);
if (i%5 == 0) printf("\n");
}
}
/*******************************************************
Splits the tuples into the groups by use of sorting.
*******************************************************/
heap_sort (T, n)
char TO[MXLNGTH];
94
int n;
{
int i;
char tmp[MXLNGTH];
make_heap (T, n);
for (i = n; i > = 2; i) {
strcpy (tmp,T[I]); strcpy (T[I],T[i]); strcpy (T[i],tmp);
sift_down (T, 1, iI);
}
}
make_heap (T, n)
char T[][MXLNGTH];
int n;
{
int i;
for (i = n/2; i > = 1; i)
sift_down (T, i, n);
}
sift_down (T, i, n)
char TO[MXLNGTH];
int i, n;
{
int k, j;
char tmp[MXLNGTH];
k = i;
do {
j = k;
if «2*j < = n) && (strcmp(T[2*j], T[k]» 0» k = 2*j;
if «2*j < n) && (strcmp (T[2*j+l],T[k]) > 0» k = 2*j + 1;
strcpy (tmp,T[k]);
strcpy (T[k] ,TU]);
strcpy (TU],tmp);
} while (j != k)