MULTIPLE SEQUENCE ALIGNMENT
I
WITH EVOLUTIONARY TREES
By
KELIU
Bachelor of Science
Central University of Economics
Beijmg, China
1994
Submitted to the Faculty ofthe
Graduate College ofthe
Oklahoma State University
in partial fulfillment of
the requirements for
the Degree of
MASTER OF SCIENCE
December 2001
MULTIPLE SEQUENCE ALIGNMENT
WITH EVOLUTIONARY TREES
Thesis Approved:
~~~ A.~_____
De~ad~
11
PREFACE
Computation of multiple sequence alignments is one of the major open problems
in computational molecular biology. The purpose of this study was to provide a new
method for calculation of multiple amino acid sequence alignments. The new method is a
progressive method. Multiple alignments are built by the successive application of
pairwise alignment algorithms. The new method assumes that the sequences are related
with each other and that there exists some unknown evolutionary tree that corresponds to
the multiple sequence alignment.
In this new method, the distance between two groups is defined as the distance
between the two corresponding consensus sequences. Group to group alignment is also
calculated using the alignment of the consensus sequences. One advantage of this new
method is that it is simple and efficient, and in many cases generates reasonable results.
Another advantage of this method is that the scoring can be done with reference to this
evolutionary tree, even though the tree structure itselfmay remain unknown.
1Jl
ACKNOWLEDGEMENTS
I would like to convey my sincere appreciation to my major advisor Dr. Mansur
H. Samadzadeh for his intelligent supervision, constructive guidance, inspiration, and
friendship. My sincere appreciation extends to my other committee members, Dr. G. E.
Hedrick and Dr. John P. Chandler, for their guidance and assistance. My sincere
appreciation also extends to Dr. Ulrich Melcher for all ofhis advice and guidance.
I would also like to express my special appreciation and gratitude to my husband,
Kunhong Xiao, for his encouragement, support, and love. I would also like to extend
many thanks to my parents, sisters, and all my friends for their support and
encouragement.
IV
Chapter
TABLE OF CONTENTS
Page
I. INTRODUCTION " ...... . ..... . ... .. . .. . 1
1.1 DNA and Protein Sequences......... .. .. .. 1
1.2 Sequence Alignment and Homology. 2
1.3 Amino Acid Similarity Scoring Systems....... .. . . 3
1.3.1 DayhoffMutation Data Matrix...................................... 4
1.3.2 BLOSUM Matrix................. 5
II. PAIRWISE ALIGNMENT WITH DYNAMIC PROGRAMMING '" 7
2.1 Introduction to Pairwise Sequence Alignment.. ... ... ... ...... ... ....... ... .. ... . 7
2.1.1 Definition of Pairwise Alignment.... .. ... .... .. ... . ... .. . . ... ... ... 7
2.1.2 Value of an Alignment..... 8
2.1.3 Number of Alignments..... .. . 8
2.2 Computing Optimal Alignment with Dynamic Programming.... 9
2.2.1 Optimal Structure and Recursive Solution ofPairwise Alignment... 10
2.2.2 Bottom-up Computation ofPairwise Alignment..... 11
2.2.3 Traceback................. 13
2.3 Sequence Alignment Using Affine Gap Costs........... 15
III. MULTIPLE SEQUENCE ALIGNMENT........ 18
3.1 Scoring a Multiple Alignemnt. '" 18
3.2 Exact Methods with the Sum-of-Pairs (SP) Score...................... 20
3.2.1 N-Dimensional Dynamic Programming Method '" 20
3.2.2 MSA Method....... 21
3.3. Progressive Alignment Method.................................. 24
3.3.1 Feng-Doolittle Algorithm.. 25
3.3.2 Clustal W.... 27
N. lliENEWMETHOD.................................................................... 31
4.1 Construct a Similarity Matrix.. . .. ... .... 31
4.2 Construct an Evolutionary Tree........... .. 32
4.2.1 Calculating the Consensus Sequences.............. .. 34
4.2.2 Reconstructing the Similarity Matrix..... .. ... ... . .. 35
v
Chapter Page
4.2.3 Constructing Multiple Sequence Alignment. 36
V. IMPLEMENTATION AND RESULTS...... 38
5.1 Implementation........................................ 38
5.2 Results................ 39
5.3 Discussion. 42
5.3 Future Work ,. 43
REFERENCES.................................................................................. 45
j
APPENDICES.. 51
APPENDIX A-GLOSSARy....................................................... 52
APPENDIX B - CODE LISTINGS................................................. 54
VI
r
Table
LIST OF TABLES
Page
I. Three-letter Abbreviations and Single-letter Symbols ofAmino Acids [Voet
and Voet 95].......... 2
II. Log Odds Matrix for 250 PAMs [Dayhoff et al. 78]................................. 5
m. Blosum 62 Substitution Matrix [Henikoff and Henikoff92]....................... 6
IV. A Sample Dynamic Programming Table for Two Sequences.... . 13
V. A Sample Traceback Table for Dynamic Programming..... 14
VI. Examples ofRange of Affine Gap Penalties...................................... .... 15
VII. Similarity Martix Calculated for Example in Figure 5.. ..... ....... .... ..... ..... .. 32
VIII Similarity Martix After the First Iteration for Example in Figure 5. . ...... .. ..... 36
Vll
t"
LIST OF FIGURES
1. Basic Dynamic Programming Method for Sequence Alignment [Waterman
95]........................................... 12
2. A Sample Phylogenetic Tree [Fitch-Margoliash 67].. 26
3. Neighbor-Joining Method [Saitou and Nei 87]..... 28
4. Star'-like Tree and Neighbor-Joining in the NJ Method.... 29
5. A Sample of a Set ofSequences. . .. .. .. .. ... ... .. . ..... .. .. .. . .. . .. . ... .. ... ...... ..... 32
6. The New Algorithm: Constructing an Evolutionary Tree '" 33
7. Process of Constructing an Evolutionary Tree for the Example in Figure 5...... 37
8. Alignment of Eight Protein Sequences by Clustal W.......... 40
9. Alignment of Eight Protein Sequences by the New Method.. 41
Vlll
CHAPTER I
INTRODUCTION
Multiple sequence alignments, which are important tools in studying proteins, are
processes that consist of taking a group of sequences and identifying structurally or
functionally homologous amino acids. The infonnation they provide is very useful in
designing experiments to test and modify the function of specific proteins, in predicting
the function and structure of proteins, and in identifying new members of protein families
[Karlin and Altschul 90] [Barton and Rassell 93]. This chapter introduces some basic
biological concepts that are related to this study.
1.1 DNA and Protein Sequences
All infonnation of life is encoded in the different types of biological sequences:
DNA sequences and protein sequences [Voet and Voet 95].
DNA (deoxyribonucleic acid) is a class of nucleic acids, which is the hereditary
molecule in all-cellular life fonns. DNA directs its own replication during cell division
and directs the transcription of complementary molecules of RNA (ribonucleic acid),
which then direct the synthesis of protein sequences [Stryer 95]. DNA is actually a linear
string of four different kinds of nucleotides: Adenine (A), cytosine (C), guanine (G), and
thymine (T) [Lehninger et al. 93].
Proteins are the fundamental building blocks of life. They are either the molecular
machines responsible for virtually all of the chemical transformations that cells are
capable of, or much of the structure of a cell. A human contains more than 100,000
different proteins [Marks et al. 96]. Protein is also a linear string of 20 kinds of amino
acids. These 20 kinds of amino acids are the basic structure units of proteins, and they are
often designated by either a three-letter abbreviation or a one-letter symbol.
Table I below illustrates the abbreviation ofthe 20 kinds ofarnino acids [Yoet and
Voet 95]. In this thesis, the discussion is focused on protein sequences.
Table I. Three-Letter Abbreviations and Single-Letter Symbols of Amino Acids
[Voet and Voet 95]
Alanine
Arginine
Asparagine
Aspartic acid
Cysteine
Glutamine
Glutamic acid
Glycine
Histidine
Isoleucine
Ala
Arg
Asn
Asp
Cys
GIn
Glu
Gly
His
lIe
A
R
N
D
C
Q
E
G
H
I
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
L
K
M
F
P
S
T
W
y
V
1.2 Sequence Alignment and Homology
A sequence alignment is a one-to-one matching of two or more sequences so that
each character in one sequence is associated with a single character of the other sequence
or with a null character '-' (gap), showing where these sequences are similar and where
they differ.
Aligning two or more sequences for companson IS an essential step in
determining if they are homologous. Homologous sequences are sequences descended by
2
evolution from the same sequence of a common ancestor. The sequences of the
homologous proteins are identical at the time they originated. But proteins are continually
undergoing the stochastic process ofmutation [King and Wilson 75}.
The simplest and most frequent mutation is the substitution of one or more amino
acids by other amino acids. Insertion and deletion of one or more amino acids are also
common. Consequently, closely related proteins generally differ only by replacements of
one amino acid by another at a few positions in the sequence. Less frequently are
differences in the total number of the amino acids, which are due to the deletion or
insertion of residues within the sequences [Doolittle 81]. The more closely related they
are evolutionarily, the more similar their proteins sequences are found to be. As less
closely related species are compared, the differences among their proteins sequences
increase [Alberts et a1. 94].
1.3 Amino Acid Similarity Scoring Systems
Sequence alignment algorithms are limited by the underlying model of sequence
similarity, which is typically based on a set of scoring rules [George et a1. 90].
If we put the similarity scores that we give to 210 possible amino-acid pairs (190
pairs of different amino acids plus 20 pairs of identical amino acids) into a matrix, this is
frequently called a scoring matrix [Barton 96]. In a similarity scoring matrix, higher
values are assigned to more similar pairs and lower values to dissimilar pairs.
The simplest similarity matrix, which is a unitary matrix, assigns values of+1 for
identities and 0 for non-identities. Over years, many different matrices have been
proposed ([McLachlan 71] [Dayhoff et a1. 78] [Feng et a1. 85] [Risler et a1. 88] [George
3
et aI. 90] [Gonnet et a1. 92] [Henikoff and Keniko.ff 92] [Altschul 93]). Among such
scoring systems, the widely used scoring matrices have been the Dayhoff mutation data
matrix [Dayhoffet a1. 78] and Blosum matrix [Henikoffand Henikoff92].
1.3.1 Dayhoff Mutation Data Matrix
Possibly the most widely used scheme for scoring ammo acid paIrs is that
developed by Dayhoff and co-workers [Dayhoff et a1. 78] [Schwartz and Dayhoff 78] and
its recent variations [Gonnet et a1. 92]. From a study of observed residue replacements in
71 sets of closely related proteins, Dayhoff and co-workers constructed the PAM model
of molecular evolution. PAM stands for Percent Accepted Mutations and were inferred
from the types of changes observed in these closely related proteins [Dayhoff et a1. 78].
They combined the information about the individual kinds of mutations and about
the relative mutability of the amino acids into one distance-dependent "mutation
probability matrix", M. Each element Mij ofM gives the probability of the amino acid in
column} mutating to the amino acid in row i after a given evolutionary interval.
When used for the comparison of protein sequences, the mutation probability
matrix is usually normalized by dividing each element Mij by the relative frequency of
exposure to mutation of the amino acid i. This operation results in the symmetrical
"relatedness odds matrix" with each element giving the probability of amino acid
replacement per occurrence of i per occurrence of}. The resulting matrix is the "log-odds
matrix", which is frequently referred to as "Dayhoffs matrix" [Altschul 91].
Table II shows PAM-250 matrix of Dayhoff, which is known to be most useful to
reproduce accurate protein sequence alignments [George et a1. 90].
4
Table II. Log Odds Matrix for 250 PAMs [Dayhoff et al. 78]
C 12 S 0 "'"2
T -2 1 3
P -3 1 0 6
A -2 1 1 1 2
G -3 1 0 -1 1 5
N -4 1 0 -1 0 0 0 -5 0 0 -1 0 1 '2 4h
E -5 0 0 -1 0 0 134
a -5 -1 -1 0 0 -1 1 224
H -3 -1 -1 0 -1 -2 2. 1 1 3 6 ~'" R -4 0 -1 0 -2 -3 0 -1 -1 1 2
K -5 0 0 -1 -1 -2 1 0 0 1 0
M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6h I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 426
V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9~ Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 ' o 0 17
C S T P A G N 0 E Q H R K M I L V F Y WI
Elements are shown multiplied by 10. The neutral score is zero. A score of -10 means that the
pair would be expected to occur only one-tenth as frequently in related sequences as random
chance would predict, and a score of +2 means that the pair would be expected to occur 1.6
times as frequently. The order of the amino acids has been arranged to illustrate the patterns in
the mutation data.
1.3.2 BLOSUM Matrix
An alternative approach to estimating target frequencies and the corresponding
log-odds matrices has been advanced by Henikoff and Henikoff [Henikoff and Henikoff
92]. They examine multiple alignments of distantly related protein regions directly, rather
than extrapolate from closely related sequences, hence the tenn BLOSUM is from
BLOcks SUbstitution Matrix. A number of tests suggest that the "BLOSUM" matrices
5
produced by this method generally are superior to the PAM matrices for detecting
biological relationships [Pearson 95] [Henikoff and Henikoff93].
Different levels of the BLOSUM matrix can be created by differentially
weighting the degree of similarity between sequences. For example, a BLOSUM 62
matrix is calculated from protein blocks such that if two sequences are more than 62%
identical, the contribution of these sequences is weighted to sum to one. In this way the
contributions of multiple entries of closely related sequences is reduced [Pearson 95].
The BLOSUM 62 matrix is given in Table m.
Table m. Blosum 62 Substitution Matrix [Henikoff and Henikoff 92]
A C 0 E F G H I K L M N P Q R S T V W Y
A 4
C 0 9
0 -2 -3 6
E -1 -4 2 5
F -2 -2 -3 -3 6
G 0 -3 -1 -2 -3 6
H -2 -3 -1 0 -1 -2 8
I -1 -1 -3 -3 0 -4 -3 4
K -1 -3 -1 1 -3 -2 -1 -3 5
L -1 -1 -4 -3 0 -4 -3 2 -2 4
M -1 -1 -3 -2 0 -3 -2 1 -1 2 5 I
N -2 -3 1 0 -3 0 1 -3 0 -3 -2 6
P -1 -3 -1' -1 -4 -2 -2 -3 -1 -3 -2 -2 7
Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5
0 ·3 -2 -1 0 1
i
R -1 -3 -2 -3 -2 0 2 -2 5
S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4
T 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5
V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4
W -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11
Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7
BLOSUM 62 matrix is calculated from comparisons of sequences with no less than 62%
divergence. Identities are assigned the most positive scores. Frequently observed substitutions
also receive positive scores and seldom observed substitutions are given negative scores.
6
CHAPTER II
PAIRWISE SEQUENCE ALIGNMENT WITH DYNAMIC PROGRAMMING
2.1 Introduction to Pairwise Sequence Alignment
2.1.1 Definition ofPairwise Alignment
A pairwise alignment of two sequences S\ and S2 is produced when spaces (or
gaps), '-', are inserted either into or at the ends of SI and S2 to fonn the new sequences,
S; and S~, both of which must be of the same length. The two resulting sequences are
then placed one above the other [Barton 96].
respectively. Then the alignment is
a;a; ...a~ (S;)
b;b; .. h~ (S;)
Ignoring gap characters, S; and S; are exactly sequences 8\ and S2, respectively. Amino
acid pairs ofthe alignments, (ai, bi), can end in one of three ways.
1) (a, -) corresponds to the insertion of a into the first sequence, or the deletion
of a from the second sequence.
2) (a, b) corresponds to an identity or match if a = b, or a substitution or
mismatch ifa c:f. b.
3) (-, b) corresponds to an insertion/deletion ofb.
7
No alignment terms (-, -) are allowed, as there is no point in matching two
deletions. Therefore, max(n, m) ~ k ~ n + m [Gusfield 97].
2.1.2 Value of an Alignment
There are two ways to quantify similarity of the alignment of two sequences, the
similarity measures and the distance. In most cases, distance and similarity measures are
interchangeable in the sense that a small distance means high similarity, and vice versa.
Similarity measures are used in this chapter.
For a given alignment A of81 and S2, let S; and S; denote the aligned sequences,
and let k denote the (equal) length of the two sequences S; and S; in A. Let sea, b)
denotes the value (or score) obtained by aligning character a against character b using a
scoring matrix (scoring matrices were discussed briefly in Chapter I). The value of the
alignment A is defined as
K
V =Ls(ai',bi').
;=1
Given a scoring matrix, the pairwise alignment problem is to find an optimal
alignment maximizing the total alignment value V. The optimal alignment will emphasize
matches (or similarities) while penalizing mismatches or inserted spaces.
2.1.3 Number of Alignments
From the definition for the sequence alignment discussed in Section 2.1.1, we
know that there are many ways to align two sequences.
8
Theorem 2.1 [Waterman 95]
Define f (m, n) = number of alignments of one sequence of m characters with
another of n characters, for m, n > 0, then
f(m,n)=f(m-l,n)+ f(m-1,n-l)+ f(m,n-I)
with boundary conditions
f(m,O) =f(n,O) =1
Proof. On the end ofthe alignment, if am is deleted, then there existf(m-t, n) alignments
of the earlier part of the sequence. If am and bn are aligned,f(m-I, n-l) alignments result.
Ifbn is deleted, thenf(m, n-l) alignments result.
Watennan gave a solution forf(n, n) as follows [Watennan 95].
f(n,n) - (1 + .fi)2n+1 n-1f2 asn -+ ex)
where - is for limiting behavior
According to the above equation, two sequences of length 1000, for example, have
f(1 000,1 000) ~ (1 +.fi)20011000-112 = 10764
.4...
alignments. Obviously, one cannot examine all alignments to compute the optimal one.
2.2 Computing Optimal Alignment with Dynamic Programming
•
Dynamic programming is typically applied to optimization problems [Connen et
al. 92] [Bellman 57]. Dynamic programming solves problems by combining the solution
to sub-problems, and it solves every sub-problem only once and then saves its answer in a
table, thereby avoiding the work of re-computing the answer every time the sub-problem
is encountered.
9
The first use of the dynamic programming approach for sequence alignment was
reported by Needleman and Wunsch in 1970 [Needleman and Wunsch 70] and later in
slightly modified fonns by others [Sellers 74] [Sankoffand KruskaI83].
2.2.1 Optimal Structure and Recursive Solution ofPairwise Alignment
Dynamic programming methods assume the principle of optimality, stating that
each part of a globally optimal solution is itself an optimal solution to its corresponding
partial problem. The optimal structure of the sequence pairwise alignment problem is
defmed in Theorem 2.2 below.
Theorem 2.2 [Gusfield 97]
Given 81 = ala2 ...am and 82 = btb2...bn, then alignment of 8t and 82 can end in
one of three ways: (am, -), (am, bn), and (-, bn) (Section 2.1.1). Defined V[i, j] ( 1 ~ i ~ m, 1
~ j ~ n) as the value of the optimal alignment of prefixes ala2 ...aj and b1b2...bj . The
optimal substructure ofan optimal pairwise alignment can be stated as follows.
1) If the optimal alignment ends in (am, bn), then V[m, n] = V[m - 1, n - 1] + s(am, bn).
2) If the optimal alignment ends in (am, -), then V(m, n] = V[m - 1, n] + s(am, -).
3) If the optimal alignment ends in (-, bn), then ~m, n] = V[m, n -1] + s(-, bn).
Proof.
1) If the optimal alignment ends in (am, bn), then the initial part of the alignment
aja2 ...am-t and btb2...bn-t must itself be optimal. Suppose for the purpose of contradiction
there is an optimal alignment of ala2 ...am-l and btb2...bn-t with similarity score V[rn - 1,
n - 1]. Then aligning am and bn to the initial part of the optimal alignment produces a
10
similarity score of 81 and 82, V'[m - 1, n - 1] + s(am• bn), which is greater than V[m, n].
This is a contradiction to our assumption that V[m, n] is an optimal similarity score.
2) If the optimal alignment ends in (am. -), then the initial part of the alignment of
3) The case (-, bn) is identical in reasoning with the case (am, -). •
Based on Theorem 2.2, the recursive solution to the optimal alignment problem
can be easily induced.
Theorem 2.3 [Waterman 95]
alignment of prefixes a1a2 .. .aj and b1b2••• bj . Also set
j ;
V[O, 0]=0, V[O,j] = Ls(-,bk ), and V[i,O] = Ls(ak ,-).
k~ k~
Then, for 1 ~ i ~ m and 1 ~j ~ n, the general recurrence is
V[i, j] =max (V[i -1,j -IJ+ s(a;, b), V[i -1, j] +s(a;. -), V[i, j -IJ + s(-, bj )}.
2.2.2 Bottom-up Computation ofPairwise Alignment
Based on the recursive equation given by Theorem 2.3, we could easily code the
recursive relations and base conditions for V[i,jJ as a recursive procedure. This top-down
recursive approach requires exponential time complexity due to the massive number of
redundant recursive calls to the procedure [Cormen et a1. 92].
Typically, the bottom-up computation is organized with a dynamic programming
table of size (m +1) x (n +1) . The table holds the values of V[i,jJ for all choices of i and
j, where °~ i ~ m and 0 ~j ~ n. Sequence 81 corresponds to the vertical axis of the table
II
while sequence S2 corresponds to the horizontal axis. The values in row zero and column
zero are filled in directly from the base conditions. After that, the remaining m x h'Subtable
is filled in one row at a time, in order of increasing i. Within each row, the cells are
filled in order of increasing}. Figure I below shows the bottom-up computation
algorithm to Theorem 2.3.
1 procedure S(a1a2 ...am , bJb2 ...bn , m, n)
2 nO,O] ~ 0
3 fOf} ~ 1 to n do
4 V[O,}] ~ V[O,} -1] + s(-, bj )
5 for i ~ I to m do
6 V[i, 0] ~ V[i - 1, 0] + s(a,,-)
7 fOf} ~ 1 to n do
8 V[i,}] ~max{Vti-l,} -1]+s(ap bj ),V[i-l,}]+s(ai , -), V[i,} -l]+s(-, bj )}
9 retufn V[m, n]
Figure 1. Basic Dynamic Programming Method for Sequence Alignment [Waterman 95]
Table N below shows the computation of the optimal similarity alignment
between ABCNJRQCLCRPN and AJCJNRCKCRBP. The scoring matrix used is a
unitary scoring matrix. Gaps are given the same penalty as a mismatch.
In this table, when computing the values for a specific cell (i,j), only cells (i - I,)
-1), (i,) - I), and (i - I,}) are examined. Hence, to fill in one cell takes a constant
number of arithmetic operations and comparisons. There are O(mn) cells in the table, so
the dynamic programming algorithm for computing the alignment score between two
sequences oflength m and n can be computed in O(mn) time [Waterman 95].
12
Table IV. A Sample Dynamic Programming Table for Two Sequences
o 1 2 3 4 5 6 7 8 9 10 11 12 13
o
1
2
3
4
5
6
7
8
9
10
11
12
A B C N J R Q C L C R P M
0 0 0 0 0 0 0 0 0 a 0 0 0 0
A 0 1 1 1 1 1 1 1 1 1 1 1 1 1
J 0 1 1 1 1 2 2 2 2 2 2 2 2 2
C 0 1 1 2 2 2 2 2 3 3 3 3 3 3
J 0 1 1 2 2 3 3 3 3 3 3 3 3 3
N 0 1 1 2 3 3 3 3 3 3 3 3 3 3
R 0 1 1 2 3 3 4 tEftJ4 4 4 4 4
C 0 1 1 2 3 3 4 ,
K
C
R
B
P
The operation of successive summations began at the first row of the array and proceeded rowby-
row towards the last row. The operation has been partially completed in row 7. The highlighted
cell in this row is the site of the cell operation, which consists of a search along the enclosed cells
indicated by borders for the largest value, which is 4. The sum of this value and the value of s(C,
C), which is 1, is added to the cell from which the search began.
2.2.3 Traceback
The computation described above calculates an optimal similarity score without
specifying the alignment or alignments that produce the score. The optimally matched
alignments can be back-traced, guided by a direction matrix whose i, jth element is a
pointer indicating the paths through which the maximum value of 1I1i,)1 is chosen [Smith
et al. 81]. Pointers in each cell are set to:
{
upper left cell( ",",") if V [i,j] = V [i -1 ,j - 1] + S (a;, hj )
upper cell ("1''') if V [i,i] = V £i -1,1] + s (aj, -)
left cell ("~") if V [i,i} = V [i,i - I] + s (-, bj )
13
The optimal alignment is recovered from the path by interpreting:
{
"I'\." as a match if aj =bj and as a substitution if Qj :I: bj
"1'" as a deletion ofaj from 81 (a space, -, inserted into 82)
"+-" as an insertion of bj into 82 (a space, -, inserted into 81)
For the example mentioned in Table IV, we can trace back paths from cell (12,
13) to cell (0, 0) (see Table V) and construct optimal alignments at the same time. One of
the optimal alignments, whose pathway is illustrated in Table V, is as follows.
ABCNJ-RQCLCR-PM
I I I I I I I I
AJC-JNR-CKCRBP-where
'I' shows the identities in the optimal alignment.
Table V. A Sample Traceback Table for Dynamic Programming
A B C N J R Q C L C R P M
0 +-0 +-0 f-() f-() +-0 f-() f-() +-0 f-() +-0 +-0 +-0 f-()
A 11) ~1 .....1 ....1 .....1 ....1 .....1 ....1 ....1 ....1 .....1 .....1 .....1 .....1
J 11) 11 "'+-t1 "'.....1'1 "'.....1'1 ~ .....2 ....2 t-2 "'2 t-2 ....2 "'2 "'2
C 11) 1'1 "'+-t1 ~I "'2 "'12 "''''12 .......12 "3 "'3 "'.....3 ....3 "'3 "'3
J to 11 "'....t1 12 "'....12 "3 .....3 ....3 ....13 ........13 "''''13 "'<-t3 ........13 .......13
N 11) t1 ......t1 12 "3 "'1'3 "''''13 .......13 ......13 "'<-t3 ........13 "'....13 ...<-t3 ......13
R 11) 1'1 "'....t1 12 13 "''''1'3 "41 +-4 +-4 +-4 +-4 "'+-4 +-4 +-4
C 11) 11 ...+-t1 ~12 13 "'....13 14 .......14 '\5 4-5 "'4-5 4-5 4-5 4-5
K 11) 1'1 ......t1 12 13 ........13 14 ........14 15 "'<--15 ........15 "'<--15 "'.....1'5 ........1'5
C to 1'1 ...+-t1 ~ 13 "'.....1'3 14 "'.....14 "'1'5 "'<--15 ~ ~ ~ <-6
R 10 1'1 ........t1 12 13 "'.....13 "'14 "'+-14 15 ........1'5 115 1\7 ....7 "'7
B 11) 11 ~ <--t2 13 "'....13 14 "'.....14 1'5 ........15 115 1'7 "'....17 "'....17
P 10 11 12 "'....12 13 ........1'3 14 "'.....14 1'5 .......1'5 115 17 ~I <-8
This is a completed table with traceback pointers for the example mentioned in Table IV. One of
the pathways that could form the maximum match is illustrated.
14
2.3 Sequence Alignment Using Affine Gap Costs
So far, we have treated the gap symbol •-' as yet another character, denoting an
individual insertion or deletion. Each gap is given a constant weight independent of the
number of spaces in the gap. However, this view is not always adequate.
Frequently in sequence evolution, deletion or insertion of several adjacent letters
are not the sum of single deletions or insertions but the result of one event [Stryer 95].
Let g(k) be the indel weight for an indel of k bases, and g(1) be the gap penalty of one
space. It is reasonable that g(k) ~ kg(1) holds [Altschul and Erickson 86].
Mfine gap penalty means that we charge a certain set-up cost for introducing a
new gap, whereas extending an existing gap is less expensive [Fitch and Smith 83].
Affine gap penalty is of the form
g(k) = a +fJ(k -1).
where a and J3 are two constants denote the gap open penalty and the gap extension
penalty, respectively. Some examples of range of a and ~ in use for proteins are given in
Table VI.
Table VI. Examples of Range of Affine Gap Penalties
Protein Scoring Matrix Gap Open Penalty Gap Extension Penalty
PAM 250 12-16 4-16
Blosum 62 9-12 3-12
Waterman et al. [Waterman et al. 76] have generalized the basic dynamic
programming algorithm so that any set of gap penalty functions satisfying g(k)::;; kg(l)
can be used.
15
Theorem 2.4 [Waterman et at. 76]
Given 81 = alaz...am and 8z = b1bz...bn, define V[i.j] as the value of the optimal
alignment of prefixes a\al ...aj and b\bl ...bj • Also set
V(O, 0] = 0, V[O,j] = - g(j), and VU, 0] = -g(i).
Then
{
V[i-l,j-I]+S(ai>bj ), }
V[i,j] = max maxlSkSj r[.i,j -.k] - g(k)1 .
max'S!Si tTI-I,J] - g(l)}
Proof.
The proof is the same as in Theorem 2.2, except here the alignments can end in
deletion of up to i letters of the sequence aj ...ai. So, instead of V(i -l,j] + s(a,,-) , which
is the same as V(i-I,j] - g (1), we have max1SlS;{V(i -I,j]- g(/)}. •
The main disadvantage of Theorem 2.4 is that it takes L;)i + j)= O(m 2n+ n2m),
or O(n3
) if m = n, number of steps to find the optimal alignment. Gotoh presented an
algoritlun that allows linear gap costs but runs in essentially mn steps [Gotoh 82].
Theorem 2.5 [Gotoh 82]
Let g(k) = a +{3 (k -1) for constants a and {3. Set E [0, 0] = F [0, 0] = V [0, 0] = 0,
E[O,j] =F [O,Jl = V [0, J1 = - g(j), and E [t, 0] = F [i, 0] = V [i, 0] = - g(l). Then if
E[i,j] =max{V[i,j -1] -a,E[i,j -1] - p},
and
F[i,j] = max{V[t -1,j] - a,F[i -I,j] - p},
then
V[i,j] = max{V[i -I,j -1) +s(a;,bj ),E[i,j],F(i,j]}.
16
Proof.
The identities that we need to establish areE[i,j]=max{V[i,j-k]-g(k)}and
IS,tSj
F[i,j] = max{V[i -I,j] - g(l)}. The proof for E [i,j] follows.
IsIs;
AssumeE[i,/] = max{V[i,/]- g(k)} for 0 ~ / < j , then
lSkS/
max{V[i,j -k]- g(k)}
lS.tSj
=max{max-f[i,j - k] - g(k)1V[i,j -I] - g(I)}
2S.tSj
= max{ ma~ f"U,(j-I)-(k-I)]-g(k-I)}-p,V[i,j-I]-a}
IS.t-IS}-1
=max{v[i,j -1] -a,E[i,j -1]- p}= E[i,j]
•
Examination of the recurrences in Theorem 2.5 shows that for any pair (i,J), each
of the terms V [i,j], E [i,j], and F [i,j] is evaluated by a constant number ofreferences to
the previously computed values, arithmetic operations, and comparisons. Hence O(mn)
time suffices to fill in all the (n + I) x (m + I) cells in the dynamic programming table
[Altschul and Erickson 86].
All alignment algorithms discussed in this chapter are global alignments. Global
denotes the feature that all letters ofthe two involved sequences must be accounted for in
the alignment [Pearson and Lipman 88]. An alternative method is local alignment, also
called approximate pattern matching, which places into correspondence a segment from
each of the two sequences being compared [Barton 90] [Altschul et al. 90]. Related work
in studying global and local pairwise alignment can be found in [Guo 00].
17
CHAPTER ill
MULTIPLE SEQUENCE ALIGNMENT METHODS
3.1 Scoring a Multiple Alignment
A global multiple alignment of k > 2 sequences S = {Sl. S2, ..., Sk} is a natural
generalization of pairwise alignment [Barton 96]. Spaces are inserted into or at either end
of each of the k sequences so that the resulting sequences have the same length I. Then
the sequences are arrayed in k rows of I columns each, so that each character and space of
each sequence is in a unique column. The following is an example of multiple alignment
[Gusfield 97].
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--
VTVAWKADS-AALGCLVKDYFPEP--
VTVSWNSG--VSLTCLVKGFYPSD--
IAVEWESNG--
In the alignment above, eight fragments from immunoglobulin (antibody) protein
sequences are displayed together. Their alignment highlights conserved residues (like 'C'
at column 4 and 'w' at column 21), conserved regions (in particular, "Q.PG" at the end
ofthe first four sequences), and more sophisticated patterns.
18
Although the notion of multiple alignment is easily extended from pairwise
alignment, the score or goodness of a multiple alignment is not as easily generalized
[Watennan 95].
The scoring system for a multiple alignment should take into account an
important feature of multiple alignments: the fact that the sequences are not independent,
but instead are related by a phylogenetic tree [Feng and Doolittle 87]. An idealized way
to score a multiple alignment would therefore be to specify a complete model of
molecular sequence evolution. Since there is not enough data to parameterize the
complex evolutionary model, simplifying assumptions must be made to score a multiple
alignment [Gusfield 97].
To date, there is no objective function that has been as well accepted for multiple
alignments as similarity has been for pairwise alignment [Gusfield 97]. Different methods
implement different scoring functions, and most of the algorithms may not implement
any scoring functions [Watennan 95]. The goodness of those methods is judged by the
biological meaning of the alignments that they produce, and so the biological insight of
the evaluator is of critical importance.
Two of the most widely used multiple sequence alignment methods, the exact
method and the progressive alignment method, are introduced in Section 3.2 and 3.3
below.
19
3.2 Exact Methods with the Sum-oI-Pairs (SP) Score
The sum-of-pairs (SP) scoring method assumes that the individual columns of an
alignment are statistically independent [Gusfield 97].The overall SP score for a multiple
alignment can be defmed as the sum of the scores for each column ofthe alignment.
Definition 3.1 [Waterman 95]
Given a multiple alignment M of length I, let SCM) denote the overall SP score,
SCM;) denote the SP score for column i, and sea, b) denote the similarity score of the
amino acid a and b. Let Mt denote the amino acid at column i and row k of M. Assume
gap cost g(k) = kg(l) for a gap of length k, then
SCM) = L:S(M/)
Is,sl
whereS(M;) =Ls(M: ,M:)
k<h
•
The SP alignment problem is to compute a global multiple alignment with the
maximum SP similarity score or with the minimum distance (or cost).
The SP score was first introduced by Carrillo and Lipman [Carrillo and Lipman
88], and was subsequently used by Bacon and Anderson [Bacon and Anderson 86],
Murata [Murata et al. 85], Gotoh [Gotoh 86], and Gupta [Gupta et al. 95].
3.2.1 N-Dimensional Dynamic Programming Method
The SP problem can be solved exactly (optimally) via dynamic programming
[Murata et al. 85]. However, O(nd
) time and space is required to align d sequences of
length at most n using the dynamic programming method. Hence this method is practical
for only a small number of sequences.
20
What follow are dynamic programming recurrences for the case of three strings,
[Murata et a1. 85]. Extension to larger number of strings is straightforward.
Theorem 3.1 [Murata et al. 85]
nt, nz, and n3, respectively. Let sea, b) denote the similarity score for amino acids a and b,
and g(1) denote the gap penalty of one space. Let V(ni, nz, n3) be the optimal SP score for
V(i -I,} -1,k -1) +sea; ,bj ) +s(aj,c/c) +s(bj,c/c);
V(i -I,} -1,k) +s(apbj ) - 2g(I);
V(i -l,},k -1) + s(apc,t) - 2g(1);
V(i,},k) =max V(i,} -1,k -1) +s(bj,c/c) - 2g(1);
V(i -l,},k) - 2g(I);
V(i,} -1,k) - 2g(l);
V(i,},k -1) - 2g(l);
•
From the above theorem, Vent, nz, n3) can be evaluated in O(ntnZn3) time [Murata et
a1.85].
3.2.2 MSA method
Carrillo and Lipman [Carrillo and Lipman 88] suggested a way to reduce some of
the work needed in computing the optimal SP multiple alignment in practice. An
extension of their idea was implemented in the program called MSA [Lipman et a1. 89].
Additional refinements to MSA were reported by Gupta [Gupta et a1. 95].
The MSA program implements a complex variation of Dijkstra's single-source
shortest-paths algorithm [Dijkstra 59] [Gupta et a1. 95]. MSA formulated the multiple
21
alignment problem as a single-source (s), single-destination (t) shortest-paths problem in
the weighted, directed graph G = (V, E) corresponding to a multiple alignment table. The
number of vertices is the product of the sequence lengths. A path P starting at the source
vertex s = <0, ...,0> that ends at vertex t = <nl, ..., nk> represents an alignment of
Sj[l. ..ni] for l~ i ~ k. There is a one-to-one correspondence between edges in the path
and the columns in the alignment.
For the problem of aligning three sequences of n characters each, the directed
graph has roughly n3 nodes and 7n3 edges, and the shortest-path algorithm will explore
each one of the edges [Connen et aL92]. As Carrillo and Lipman stated, to solve the SP
alignment problem, not all alignments has to be considered [Carrillo and Lipman 88].
The MSA program adopts the Carrillo and Lipman method [Carrillo and
Lipman88] to calculate a lower bound L and an upper bound U to restrict the number of
paths to be considered.
L, the lower bound on the cost of the multiple alignment of S, can be defmed as
the sum ofdistances of optimal pairwise alignments for each pair in S.
Let S = SlS2 .. .Sn denote n sequences. Let d(Sj, Sj) denote the distance for the
optimal alignment of sequences Sj and ~, then,
L ="Id(Sj,Sj) where I ~ i,j ~ n
i<j
This value ofL is a lower bound because it assumes that each pair of sequences is aligned
optimally, independent ofthe other sequences.
To get the value ofthe upper bound, U, one option is for the user to define one. If
U is not provided by the user, the program computes a heuristic multiple sequence
22
alignment from pairwise alignments, and U is then induced from the cost ofthat heuristic
alignment.
MSA searches only among alignments whose cost IS ~ U. Theorem 3.2 IS
implemented by MSA to restrict the paths to be searched.
Theorem 3.2 [Carrillo and Lipman88]
Let S = SIS2.. .Sn denote n sequences, and d(Sj, Sj) denote the minimum distance
between sequences Sj and 8.i. Let M be a multiple sequence alignment of minimum cost,
and c(M;j) denote the cost of the i,jth sequences ofthe alignment M. Then
(1)
Proof:
c(M)::5;U
=> c(M)-L ~ U-L
=> L(c(Mj,)-d(Sj,Sj»::5;U-L
j<j
=> c(Mj,j) - d(SjlS) ~ U - L
=> c(Mj,j) ~ d(Sj,Sj)+U-L
•
Inequality (1) above theorem restricts consideration to those paths whose
projections onto the planes defined by all pairs Sj and Sj have cost at most d(Sj, ~) + U-L.
The value of U is critical in the program because if U is too small, the program
will fail to find any feasible solution, and, if U is too big, more time and space will be
required for program execution.
It is reported that MSA can align six strings of lengths around 200 characters in a
practical amount of time [Lipman et al. 89]. MSA performs better if one begins with an
23
extremely good value for U. However, it is not likely that MSA will make it practical to
optimally align tens or hundreds of sequences [Lipman et al. 89].
3.3 Progressive Alignment Method
The most widely used approach to multiple sequence alignment is progressive
alignment. A progressive alignment method utilizes pairwise alignment algorithm
iteratively to achieve the multiple alignment of a set of protein sequences [Feng and
Doolittle 96].
The progressive method is based on the concept of evolutionary trees [Stryer 95] ..
The dominative view of the evolution of life is that the high level history of life is ideally
organized and displayed as a rooted, directed tree. In the progressive method, usually a
guide tree is generated based on cluster analysis from paiIWise alignment between all
pairs of sequences. Guide tree is a rooted binary tree whose leaves represent sequences
and whose interior nodes represent alignments. The root node represents a complete
multiple alignment. [Feng and Doolittle 96].
Progressive alignment is heuristic: it docs not directly optimize any global scoring
function of alignment correctness. The most important heuristic of progressive alignment
algoritlun is to align the most similar pairs of sequences first. These are the most reliable
alignments.
The advantage of progressive alignment is that it is fast and efficient, and in many
cases the resulting alignments are reasonable. Many authors have proposed methods
based on progressive alignment [Feng and Doolittle 87] [Taylor 88] [Higgins et al. 96]
[Taylor 96]. The published methods vary in the way they choose the specified order to do
24
the alignment, and in the procedure used to align and score the sequences against existing
alignments. Two of the most popular progressive methods, Feng-Doolittle algorithm and
Clustal W, are introduced in the following sections.
3.3.1 Feng-Doolittle Algorithm
The Feng-Doolittle algorithm [Feng and Doolittle 87] was one of the first
progressive alignment algorithms.
The Feng-Doolittle algorithm consists ofthe following three steps.
Step 1. Calculate a matrix ofN(N-l)/2 distances between all pairs ofN sequences
by standard pairwise alignment, converting similarity scores to distances.
Distance of a pair of aligned sequences is converted from similarity scores. In
Feng-Doolittle algorithm, distances are approximate values calculated by Poisson
equation [Feng et al. 85]:
D= - InS
where
S = S,ea/(i,j)-Srand(i,j) xlOO
Siden (i, j) - Srand (i, j)
(1)
(2)
S,eal(i, j) is the observed similarity score for sequences being aligned. Siden(i, J) is the
average of the score of aligning either sequence to itself. SranlJ, J) is the expected score
for aligning two random sequences of the same length and residue composition. Random
sequences are determined by a computer-intensive random shuffling or an approximate
calculation provided by Feng and Doolittle [Feng and Doolittle 96]. Thus, S is a
normalized percentage similarity; it is expected to roughly decay exponentially towards
zero with increasing evolutionary distance.
25
Step 2. Construct a guide tree from the distance matrix using the clustering
algorithm [Fitch and Margoliash 67].
The Feng-Doolittle algorithm constructs a guide tree based directly on the Fitch-
Margoliash algorithm [Fitch and Margoliash 67], which is one of the fast clustering
algorithms that build evolutionary trees from distance matrices. First, the smallest
mutation distance score is identified and the two sequences involved are connected as
sibling nodes. The parent node is the union ofthe two identified sequences. A new matrix
is constructed that contains the average distances between members of the first pair
(parent node) and the remaining members of the set. This procedure is continued until all
scores have been incorporated. Figure 2 shows a guide tree built from a mutation distance
table using the Fitch-Margoliash algorithm.
.,
""
..
"
"
"""
j' I.•
I
Mutation Distance Table Guide Tree
18
A
B
B
24
C
28
32
10 ~
A B c
Figure 2. A Sample Phylogenetic Tree [Fitch and Margoliash 67]
Step 3. Starting from the first node added to the tree created in Step 2, align all
child nodes in the order that they were added to the tree until all sequences have been
aligned.
Sequence to sequence alignment is computed by the usual dynamic programming
algorithm. A sequence is added to an existing alignment by aligning it to each sequence
in the alignment in turn and the highest pairwise alignment score determines how the
26
sequence will be added. For aligning an alignment to an alignment, all sequence pairs
between the two groups are tried and the best pairwise sequence alignment determines
the final alignment.
After an alignment is completed, gap symbols are replaced with neutral elements
(Xs). The key to this alignment is that new gaps can be incorporated into either sequence,
but the earlier gaps are preserved. This is called the rule of "once a gap, always a gap"
by Feng and Doolittle [Feng and Doolittle 87].
3.3.2 Clustal W
The first Crustal program was written by Higgins and Sharp in 1988 [Higgins and
Sharp 88] [Higgins and Sharp 89]. Clustal V [Higgins et al. 91], which is a newer release,
incorporated profile alignments and the facility to generate trees from the alignments by
using the fast Neighbor-Joining method. Clustal W [Thompson et al. 94], the third
generation, made some internal modifications to Clustal V, such as sequence weighting,
the incorporation of position-specific gap penalties, and flexible weight matrix choices.
All these various additional heuristics contribute to its accuracy.
The algorithm is as follows:
Step 1. Construct a distance matrix of all pairs by fast, approximate alignment
algorithm [Wilbur and Lipman 83], followed by converting the similarity scores to
evolutionary distances using the model of Kimura [Kimura 80].
The distances calculated are related to the degree of divergence (defined as 100%
- percent identity) between the sequences. Clustal W uses Kimura's method [Kimura 80]
to calculate protein distances.
27
Distance = - In (1.0 -K- (K* KJ5.0»
where K is the degree of divergence (percent divergence / 100).
Step 2. Construct a guide tree using a neighbor-joining (N!) clustering algorithm
by Saitou and Nei [Saitou and Nei 87].
The neighbor-joining (NJ) method works on the distance matrix between all pairs
of sequences to be analyzed. The NJ method works by constructing a tree T by steps,
keeping a list L of active nodes in this tree [Saitou and Nei 87].
The NJ method calculates distance between two clusters as follows.
D.. =d . . -Cr. +r.) I,) l,) I J
where
rj = 1 L di,k where ILl denotes the size of the set L,
IL 1-2 kEL
The NJ algorithm is given below in Figure 3.
Initialization:
Define T to be the set of leaf nodes, one for each given sequence, and let L = T.
Iteration:
Pick a pair i,) in L for which Dij is minimal.
Defme a new node k and set diem =(dim + d jm + dij)/ 2, for all m in L.
Add k to T with edges of lengths dik =(dij + ri - rj ) and djk =dI} - djk , joining k
to i and), respectively.
Remove i and) from L, and add k.
Termination:
When L consists of two leaves i and), add the remaining edge between i and)
with length dij.
Figure 3. Neighbor-Joining Method [Saitou and Nei 87]
28
The principle of the NJ method is to fInd pairs of neighbors that minimize the
total branch length at each stage of clustering of neighbors. A pair of neighbors is a pair
of nodes connected through a single interior node in an unrooted, bifurcating tree.
The NJ method constructs a tree starting with a star-like tree, as given in Figure
4(a). In practice, some pairs of nodes are more closely related to each other than other
pairs are. In the tree given in Figure 4(b), there is only one interior branch, XY, which
connects the paired nodes I and 2 and the others that are connected by a single node Y.
Any pair of nodes can take the position of 1 and 2 in the tree, and there are N(N-I )/2
ways of choosing them. Among these possible pairs of nodes, the one that gives the
smallest sum ofbranch lengths is chosen. This pair is then regarded as a single node. This
procedure is continued until all N-3 interior branches are found.
8
8 7
~/ 7
1 I 6
~X_Y~
2 X 6
/ ~ / ~ 5
2
3 5 4
4 3
(a) (b)
Figure 4. Star-like Tree and Neighbor-Joining in the NJ Method.
(a) A star-like tree with no hierarchical structure.
(b) A tree in which nodes 1 and 2 are clustered.
29
Step 3. Progressively align at nodes in order of decreasing similarity, usmg
sequence-sequence, sequence-profile, and profile-profile alignment [Gribskov et al. 87].
After the tree is constructed, alignments are built progressively following the
branch order in the tree. At each alignment stage, the algorithm of Myers and Miller
[Myers and Miller 88] is used.
Sequence-alignment and alignment-alignment alignment are calculated using the
profile alignment method, which is a simple extension of the profile method [Gribskov et
al. 87]. In profile alignment, the alignment is treated as a single sequence, except that the
score is calculated at aligned positions as the average weight matrix score of all the
residues in one alignment versus all those in the other. Any gaps that are introduced are
placed in all ofthe sequences of an alignment at the same position.
30
CHAPTER IV
THE NEW METHOD
The algorithm which implements the new method follows the general strategy of
iteratively merging two multiple alignments of two subsets of sequences into a single
multiple alignment of the union of those subsets. This algorithm builds an evolutionary
tree from sequence data while simultaneously constructing an evolutionary infonnative
multiple alignment.
The algorithm has two steps:
1) Construct a similarity matrix of all pairs.
2) Construct an evolutionary tree while simultaneously construct the multiple alignment.
4.1 Construct a Similarity Matrix
To construct the similarity matrix, we need to calculate a matrix of n(n-l )/2
similarity scores between all pairs of n sequences. Similarity scores of pairwise
alignments are computed using Gotoh's algorithm [Gotoh 82], which was introduced in
Chapter II.
Since the alignment score between two sequences of lengths not more than m can
be computed in O(m2
) time using a dynamic programming algorithm [Connen et al. 92],
the time complexity for constructing the similarity matrix of d sequences of lengths at
most m is O(ffm2
) and the space required is O(m2
) [Cormen et al. 92].
31
The example below is a set of sequences S. This example will be used through. all
the following sections to illustrate the new method.
51: RPCVCPVLRQAAQQVLQRQIIQGPQQLRRLFAA
52: RPCACPVLRQVVQQALQRQIIQGPQQLRRLFAA
53: KPCLCPKQAAVKQAAHQQLYQGQLQGPKQVRRAFRLL
54: KPCVCPRQLVLRQAAHLAQQLYQGQRQVRRAFVA
55: KPCVCPRQLVLRQAAHQQLYQGQRQVRRLFAA
Figure 5. A Sample ofa Set of Sequences
The similarity matrix for the set S is as given in Table VII below. PAM-250 matrix
[Dayhoff et al. 78] (described in Chapter nis used as the scoring system. Gap open
penalty is 12 and gap extension penalty is 4.
Table VII. Similarity Matrix Calculated for the Example in Figure 5
81 8 2 8 3 84 8 5 I
8 1 - 160 66 83 85 I
82 160 - 60 75 91
83 66 60 - 86 94
84 83 75 86 - 147
85 85 91 94 147 -
4.2 Construct an Evolutionary Tree
Same as the progressive alignment method, this new method is based on the
concept of evolutionary trees [Stryer 95]. The evolutionary tree is a rooted binary tree.
Let S be a set of n sequences. The evolutionary tree for S is a tree T with the following
properties:
1) T contains n leaves, one for each sequence in S.
32
2) Each internal node of T presents a sequence alignment and has at I
children.
3) The root contains the final multiple alignment.
The new algorithm builds the tree T based on cluster analysis of the ~
matrix. It begins with a set of n leaves and perfonns a sequence of n-l
operations to create the final evolutionary tree, and simultaneously to cons
multiple sequence alignment.
The new algorithm is given below in Figure 6.
Initialization:
Define T to be the set of leaf nodes, one for each given sequence.
Iteration:
1) Pick a pair i,} from T for which sCi,}) is maximal according to the ~
matrix.
2) Construct a new node A, which has i and j as its left child and ri~
respectively, and contains the alignment of i and j. Calculate the c
string Ac for the alignment. Reconstruct the similarity matrix, setting i
s(Ac, m), for all minT.
3) Add A to T.
Termination:
After n-l loops, T will be a single full binary Tree. The root of Twill coni
final multiple alignement.
Figure 6. The New Method: Constructing an Evolutionary Tree
33
For the first step~ it takes O(n2
) time to pick the maximum score from the
similarity matrix for each iteration, so for n-l iterations, it takes time O(n3
).
Calculating the consensus string, re-constructing the similarity matrix, and
computing multiple sequence alignment, which are the three critical points of the new
algorithm given in Figure 6, are described in details in the following three subsections.
4.2.1 Calculating the Consensus Sequences
The consensus sequence is a consensus representation of the critical common
features of a set of sequences [Barton and Anderson 86].
There is not known method to find the consensus sequence. In terms of multiple
alignments, the problem of fmding consensus sequence is based on definition of the
consensus character.
Given a multiple alignment M of a set of sequences S~ the consensus character of
column i of M is the character that minimizes the summed distance to it from all the
characters in column i.
Since the alphabet is finite~ a consensus character for each column ofM exists and
can be found by enumeration. As one simple special case~ if the pairwise scoring scheme
scores a match with a zero and a mismatch or a space opposite a character with a one,
then the consensus character in column i is the plurality character (i.e., the character
occurring most often in column i). Note that the character can be a space.
A consensus character represents the critical characteristic of a column in a
multiple alignment, but if the column is not highly conserved, the resulting consensus
34
character cannot faithfully reflect the characteristic of the column. A better solution
would be to change the consensus character to the neutral character "X" [Melcher 00].
The consensus sequence 8M derived from alignment M is the concatenation of the
consensus characters for each column of M. Note that 8M need not be from M and
generally will not be [Barton and Anderson 86].
It is common in the computational biology to compute a multiple alignment M
and then represent those sequences by the consensus sequence derived from M [Gusfield
97]. It is then natural to use the goodness of the consensus sequence as a way to evaluate
the goodness ofthe multiple alignment M.
In the new algorithm, using consensus sequences as representatives of multiple
alignments, the similarity score between two clusters can be computed by calculating the
similarity score between two corresponding consensus sequences.
It takes O(mn2
) time to compute a consensus sequence of a group of n sequences
with length m.
4.2.2 Reconstructing the Similarity Matrix
Let T be the set of clusters of sequences defined in Figure 6. In each iteration, we
first pick one pair of clusters, i and}, from T. Then a new node A, which contains the
alignment of i and}, is created and added to T.
Accordingly, row i, row}, column i, and column} are removed from the similarity
matrix (Section 4.1). A new row and a new column for A are added to the similarity
matrix. Then we have to compute the similarity scores between A and all other nodes in
the matrix.
35
Table VIII. Similarity Matrix After the First Iteration for the Example in Figure 5
A1 53 54 55
A1 - 64 79 83
53 64 - 86 94
54 79 86 - 147
55 83 94 147 -
For our example in Figure 5, after the first iteration, Table VII will be updated to
Table VID. 8\ and 82will be removed from the table, and AI, parent of 81 and 82, will be
added to the table.
4.2.3 Constructing Multiple Sequence Alignment
Every internal node of tree T (Figure 6) contains the result of the pairwise
alignment of its two children. The pairwise alignments are calculated as follows.
Sequence-to-sequence alignment is done using the standard pairwise alignment
algorithm [Gotoh 82]. If any of the children contains an alignment, the consensus
sequence for the alignment is calculated using the method illustrated in Section 4.2.1.
Gap symbols in the resulting consensus sequences are replaced with neutral elements
(Xs). So, sequence-alignment and alignment-alignment alignments can also be calculated
using the standard pairwise alignment algorithm.
Given two child nodes {A, B}, which are both alignments of sequences, it takes
five steps to construct the alignment ofA and B.
1. Calculate the consensus sequences for the alignment ofA and B. We refer to those
two consensus sequences as a and b. Replace spaces in sequences a and b with neutral
elements, e.g., Xs.
36
2. Calculate an optimal pairwise alignment of sequences a and b. The sequences in
this alignment with the gaps are called a' and b'.
3. Insert all gaps from a' that are not already in a into all sequences in A.
4. Insert all gaps from b' that are not already in b into all sequences in B.
5. Merge A and B.
In the previous sections, we have discussed the details of the new algorithm given
in Figure 6. For n sequences, the algorithm will take n-l iterations to construct the
evolutionary tree. For our example, the algorithm proceeds as shown in Figure 7.
(a) ~~[IJ~[IJ (b) ~GJ GJGJ
s, s,
(0) ~GJ~ (d)
s, S, S. S,
(e)
Figure 7. Process of Constructing an Evolutionary Tree for the Example in Figure 5
37
CHAPTER V
IMPLEMENTATION AND RESULTS
5.1 Implementation
The algorithm has been implemented in the C++ language. The source code listed
In Appendix B has been compiled and tested under Microsoft Virtual C++ 6.0 on
Windows 98, and under the Solaris 7 system.
To use this program under Unix, put all source files under one directory, and enter
"make msa" after the command prompt. A new executable file named ')usa" will be
created. To run the program, simply enter "msa" after the prompt and follow the
instructions.
To run this program through a web browser such as Netscape or Microsoft IE, use
the following URL:
http://gradweb.cs.okstate.edu/~lke/msa/intro.html
You will be asked to provide your choices of gap open penalty, gap extension
penalty, type of the scoring matrix, and the input sequences.
1. Input Format
Sequences can be in upper or lower cases. The only symbols recognized are 20
amino acid characters: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y. All
other symbols will be ignored.
38
The program recognizes and uses the FASTA fonnat ([pearson and Lipman 88]).
The sequences are delimited by an angle bracket ">" in column 1. The text immediately
after the ">" is used as a title. Every thing on the following line until the next ">" or the
end of file is one sequence.
2. Gap Open Penalty
The default is 12. Reduce this to encourage gaps of all sizes, and increase it to
discourage them.
3. Gap Extension Penalty
The default is 4. Reduce this to encourage longer gaps, and increase it to shorten
them.
4. Scoring Matrix
The default is Dayhoff PAM 250 matrix [Dayhoff et al.78]. We also offer a
Blosum 62 matrix [Henikoff and HeniKoff92] and an identity matrix.
5.2 Results
To show a real example we took eight sequences of bacterial ponn [Jeanteur et al.
91]. The lengths of the sequences range from 329 to 347. Figures 8 and 9 illustrate the
eight protein sequences aligned by the Clustal W and this method, respectively.
It can be observed that both methods have correctly identified most of the
matched units and discovered the overall relationships among the sequences. The
alignment by this method is slightly more compact than by the Clustal W method. The
fonner is 392 columns long and the latter is 394 columns long.
39
7 AEIYNKDSNKLDLYGKVNAKHYFSSN-----DADDGDTTYARLGFKGETQINDQLTGFGQWEYEFK
8 AEIYNKDSNKLDLYGKVNAKHyFSSN-----DADDGDTTYARLGFKGETQINDQLTGFGQWEYEFK
5 AEVYNKDGNKLDLYGKVDGLHYFSDN-----KDVDGDQTYMRLGFKGETQVTDQLTGYGQWEYQIQ
6 AEIYNKDGNKLDLFGKVDGLHYFSDD-----KGSDGDQTYMRIGFKGETQVNDQLTGYGQWEYQIQ
3 AEVYNKNANKLDVYGKlKAMHYFSDY-----DSKDGDQTYVRFGIKGETQINEDLTGYGRWESEFS
4 AEVYNKNGNKLDVYGKVK-MHYSDD------DTKDGDQTYVRFGFKGETQINDQLTGYGRWEAEFA
2 AEIYNKDGNKLDVYGKVKAMHYMSDN-----ASKDGDQSYIRFGFKGETQINDQLTGYGRWEAEFA
1 AEIYNKDGNKVDLYGKAVGLHYFSKGNGENSYGGNGDMTYARLGFKGETQINSDLTGYGQWEYNFQ
**:***: .**:*: :** ** :** :* *'*'*****: .. :***:*:** : :
7 GNRAESQGSSK-DKYRLAFAGLKEGDYGSIDYGRNYGVAY-DIGAWTDVLPEFGGDTWTQTDVFMT
8 GNRAESQGSSK-DKYRLAFAGLKFGDYGSIDYGRNYGVAY-DlGAWTDVLPEFGGDTWTQTDVFMT
5 GNSAENENNSW-T--RVAFAGLKEQDVGSFDYGRNLGVVY-DVTSWTDVLPEFGGDTYG-SDNFMQ
6 GNQTEGSNDSW-T--RVAFAGLKEADAGSFDYGRNYGVTY-DVRSWTDVLPEFGGSTYG-ADNFMQ
3 GNKTESDSSQ---KTRLAFAGVKLKNYGSFDYGRNLGALY-DVEAWTDMFPEFGGDSSAQTDNFMT
4 GNKAESDSSQ---KTRLAFAGLKLKDFGSLDYGRNLGALY-DVEAWTDMFPEFGGDSSAQTDNFMT
2 GNKAESDTAQQ--KTRLAFAGLKYKDLGSFDYGRNLGALY-DVEAWTDMFPEFGGDSSAQTDNFMT
1 GNNSEGADAQTGNKTRLAFAGLKYADVGSFDYGRNYGVVYYDALGYTDMLPEFGG-TDAYSDDFFV
** : *. *:****:* : **:***** * * * .:**::*****: : * *:
7 GRTTGFATYRNNDFFGLVDGLNFAAQYQGKNDRSDFD-------N----YTEGNGDGFGFSATYEY
8 QRATGVATYRNNDFFGLVDGLNFAAQYQGKNDRSDFD-------N----YTEGNGDGFGFSATYEY
5 QRGNGYATYRNTDFFGLVDGLDFALQYQGKNGNPSGEGFTSGVTNNGRDALRQNGDGVGGSITYDY
6 QRGNGYATYRNTDFFGLVDGLDFALQYQGKNGSVSGE-----NTN-GRSLLNQNGDGYGGSLTYAI
3 KRASGLATYRNTDFFGLVDGLDLTLQYQGKNEGREVK------------KQ--NGDGVGTSLSYDF
4 KRASGLATYRNTDFFGAIDGLDMTLQyQGKNENRDAK------------KQ--NGDGFGTSLTYDF
2 KRASGLATYRNTDFFGVIDGLNLTLQYQGKNENRDVK------------KQ--NGDGFGTSLTYDF
1 GRVGGVATYRNSNFFGLVDGLNFAVQYLGKNERDTAR------------RS--NGDGVGGSISYEY
* * *****. :*** :***::: ** *** **** * * :*
7 EG--FGIGATYAKSDRTDTQVNAGKVLPEVFASGKNAEVWAAGLKYDANNIYLATTYSETQNMTVF
8 EG--FGIGATYAKSDRTDTQVNAGKVLPEVFASGKNAEVWAAGLKYDANNIYLATTYSETQNMTVF
5 EG--FGIGGAISSSKRTDAQN-----TAAYIGNGDRAETYTGGLKYDANNIYLAAQYTQTYNATRV
6 GEGYFSVGGAITTSKRTADQNNT--ANARLYGNGDRATVYTGGLKYDANNIYLAAQYSQT-NATRF
3 GGSDFAVSAAYTSSDRTNDQN------LLARGQGSKAEAWATGLKYDANNIYLATMYSETRKMTPI
4 GGSDFAVSGAYTNSDRTNAQN------LLARGQGQKAEAWATGLKYDANDIYLAAMYSETRNMTPI
2 GGSDFAISGAYTNSDRTNEQN------LQSRGTGKRAEAWATGLKYDANNIYLATFYSETRKMTPI
1 EG--FGIVGAYGAADRTLNQE------AQPLGNGKKAEQWATGLKYDANNIYLAANYGETRNATPI
*.: .: : . ** * * * :: *******: ****: * : * : *
7 ADHF---------VANKAQNFEAVAQYQFDFGLRPSVAYLQSKGK-DLSVWG-----DQDLVKYVD
8 ADH----------VANKAQNFEAVAQYQFDFGLRPSVAyLQSKGK-DLSVWG-----DQDLVKYVD
5 GS--------LFGWANKAQNFEAVAQYQFSFGLRPSLAYLQSKGK-NLGRGY----DDEDILKYVD
6 GTSNGSNPSTSYGFANKAQNFEVVAQYQFSFGLRPSVAYLQSKGK-DISNGYGASYGDQDIVKYVD
3 SG----------GFANKAQNFEAVAQyQFDFGLRPSLGYVLSKGK-DIEGVG-----SEDLVNYID
4 SG----------GFANKAQNFEVVAQYQFDFGLRPSLGYVQSKGK-DLEGIG-----DEDLVNYID
2 TG----------GFANKTQNFEAVAQYQFDFGLRPSLGYVLSKGK-DIEGIG-----DEDLVNYID
1 TNKFTN----TSGFANKTQDVLLVAQYQFDFGLRPSIAYTKSKSKADVEGIG-----DVDLVNYFE
***:*:. ******.******:.* **.*:: *: :: *.:
7 VGAT-¥YFNKNMSTFVKYKINLLDKNDFTKALGVSTDDIVAVGLVYQF----------------
8 VGAT-YYFNKNMSTFVKYKINLLDKNDFTKALGVSTDDIVAVGLVYQF----------------
5 VGATyyyFNKNMSTYVDyKINLLDDNQFTRDAGINTDNIVALGLVYQF----------------
6 VGAT-¥YFNKNMSTYVDyKINLLDKNDFTRDAGINTDDIVALGLVYQF----------------
3 VGLT-¥YFNKNMDAFVDYKINQLKSD---NKLGINDDDIVALGMTYQF----------------
4 VGAT-¥YFNKNMSAFVDYKINQIKDD---NKLGVNDDDIVALGMTYQFNYTQINAASVGLRHKF
2 VGAT-¥YFNKNMSAFVDYKINQLDSD---NKLNINNDDIVAVGMTyQF----------------
1 VGAT-YYFNKNMSTYVDYIINQIDSD---NKLGVGSDDTVAVGIVYQF----------------
** * *******.: :*.* ** : .. : *: **:*: .***
Figure 8. Alignment of Eight Protein Sequences by Clustal W. (,., indicates a match tuple).
40
1 AEIYNKDGNKVDLYGKAVGLHYFSKGNGENSYGGNGDMTYARLGFKGETQINSDLTGYGQWEYNFQ
2 AEIYNKDGNKLDVYGKVKAMHYMSDNASKD-----GDQSYIRFGFKGETQINDQLTGYGRWEAEFA
3 AEVYNKNANKLDVYGKlKAMHYFSDYDSKD-----GDQTYVRFGIKGETQINEDLTGYGRWESEFS
4 AEVYNKNGNKLDVYGKVK-MHY-SDDDTKD-----GDQTYVRFGFKGETQINDQLTGYGRWEAEFA
5 AEVYNKDGNKLDLYGKVDGLHYFSDNKDVD-----GDQTYMRLGFKGETQVTDQLTGYGQWEYQIQ
6 AEIYNKDGNKLDLFGKVDGLHYFSDDKGSD-----GDQTYMRIGFKGETQVNDQLTGYGQWEYQIQ
7 AEIYNKDSNKLDLYGKVNAKHYFSSNDADD-----GDTTYARLGFKGETQINDQLTGFGQWEYEFK
8 AEIYNKDSNKLDLYGKVNAKHYFSSNDADD-----GDTTYARLGFKGETQINDQLTGFGQWEYEFK
** *** ** * ** ** $ ** * * * ***** *** * **
1 GNNSEGADAQTGNKTRLAFAGLKYADVGSFDYGRNYGVVYYDALGYTDMLPEFGGTD-AYSDDFFV
2 GNKAES-D-TAQQKTRLAFAGLKYKDLGSFDYGRNLGALY-DVEAWTDMFPEFGGDSSAQTDNFMT
3 GNKTES-D-SSQ-KTRLAFAGVKLKNYGSFDYGRNLGALY-DVEAWTDMFPEFGGDSSAQTDNFMT
4 GNKAES-D-SSQ-KTRLAFAGLKLKDFGSLDYGRNLGALY-DVEAWTDMFPEFGGDSSAQTDNFMT
5 GNSAE---NENNSWTRVAFAGLKFQDVGSFDYGRNLGVVY-DVTSWTDVLPEFGGDTYG-SDNFMQ
6 GNQTE---GSNDSWTRVAFAGLKFADAGSFDYGRNYGVTY-DVRSWTDVLPEFGGSTYG-ADNFMQ
7 GNRAES-QGSSKDKYRLAFAGLKFGDYGSIDYGRNYGVAY-DIGAWTDVLPEFGGDTWTQTDVFMT
8 GNRAES-QGSSKDKYRLAFAGLKFGDYGSIDYGRNYGVAY-DIGAWTDVLPEFGGDTWTQTDVFMT
** * * **** * ** ***** * * * ** ***** * *
1 GRVGGVATYRNSNFFGLVDGLNFAVQYLGKNE-RDTAR---------RS----NGDGVGGSISYEY
2 KRASGLATYRNTDFFGVIDGLNLTLQYQGKNENRDVKK--------------QNGDGFGTSLTYDF
3 KRASGLATYRNTDFFGLVDGLDLTLQYQGKNEGREVKK--------------QNGDGVGTSLSYDF
4 KRASGLATYRNTDFFGAIDGLDMTLQYQGKNENRDAKK--------------QNGDGFGTSLTYDF
5 QRGNGYATYRNTDFFGLVDGLDFALQYQGKNGNPSGEGFTSGVTNNGRDALRQNGDGVGGSITYDY
6 QRGNGYATYRNTDFFGLVDGLDFALQYQGKNGSVSGEN------TNGRSLLNQNGDGYGGSLTYAI
7 GRTTGFATYRNNDFFGLVDGLNFAAQYQGKNDRSDFDNYT-----------EGNGDGFGFSATYEY
8 QRATGVATYRNNDFFGLVDGLNFAAQYQGKNDRSDFDNYT-----------EGNGDGFGFSATYEY
* * ***** *** *** ** *** **** * * *
1 EG--FGIVGAYGAADRTLNQE------AQPLGNGKKAEQWATGLKYDANNIYLAANYGETRN---A
2 GGSDFAISGAYTNSDRTNEQ----NLQSR--GTGKRAEAWATGLKYDANNIYLATFYSETRKM---
3 GGSDFAVSAAYTSSDRTNDQ----NLLAR--GQGSKAEAWATGLKYDANNIYLATMYSETRKM---
4 GGSDFAVSGAYTNSDRTNAQ----NLLAR--GQGQKAEAWATGLKYDANDIYLAAMYSETRNM---
5 -EG-FGIGGAISSSKRTDAQNTAA-----YIGNGDRAETYTGGLKYDANNIYLAAQYTQT-- --Y-
6 GEGYFSVGGAITTSKRTADQNNTAN--ARLYGNGDRATVYTGGLKYDANNIYLAAQYSQTNATRFG
7 EG--FGIGATYAKSDRTDTQVNAGKVLPEVFASGKNAEVWAAGLKYDANNIYLATTYSETQNM---
8 EG--FGIGATYAKSDRTDTQVNAGKVLPEVFASGKNAEVWAAGLKYDANNIYLATTYSETQNM---
* ** * * * ******* **** * ...
1 TPITNKFTNTSGFANKTQDVLLVAQYQFDFGLRPSIAYTKSKSKA---DVEGI-GDVDLVNYFEVG
2 ---TPITGGF---ANKTQNFEAVAQYQFDFGLRPSLGYVLSKGKDI----EGI-GDEDLVNYIDVG
3 ---TPISGGF---ANKAQNFEAVAQYQFDFGLRPSLGYVLSKGKDI----EGV-GSEDLVNYIDVG
4 ---TPISGGF---ANKAQNFEVVAQYQFDFGLRPSLGYVQSKGKDL----EGI-GDEDLVNYIDVG
5 --NATRVGSLFGWANKAQNFEAVAQYQFSFGLRPSLAYLQSKGKNLGRGYD----DEDILKYVDVG
6 TSNGSNPSTSYGFANKAQNFEVVAQYQFSFGLRPSVAYLQSKGKDISNGYGASYGDQDIVKYVDVG
7 ---TVFADHFV--ANKAQNFEAVAQYQFDFGLRPSVAYLQSKGKDL----SVW-GDQDLVKYVDVG
8 ---TVFADH-V--ANKAQNFEAVAQYQFDFGLRPSVAYLQSKGKDL----SVW-GDQDLVKYVDVG
*** * ****** ****** * ** * * * **
1 ATYY-FNKNMSTYVDyIINQIDSDN---KLGVGSDDTVAVGIVYQF----------------
2 ATYY-FNKNMSAFVDyKINQLDSDN---KLNINNDDIVAVGMTYQF----------------
3 LTYY-FNKNMDAFVDYKINQLKSDN---KLGINDDDIVALGMTyQF----------------
4 ATYY-FNKNMSAFVDYKINQIKDDN---KLGVNDDDIVALGMTYQFNYTQINAASVGLRHKF
5 ATYYYFNKNMSTYVDYKINLLDDNQFTRDAGINTDNIVALGLVYQF----------------
6 ATYY-FNKNMSTYVDYKINLLDKNDFTRDAGINTDDIVALGLVYQF----------------
7 ATYY-FNKNMSTFVKYKINLLDKNDFTKALGVSTDDIVAVGLVYQF----------------
8 ATYY-FNKNMSTFVKYKINLLDKNDFTKALGVSTDDIVAVGLVYQF----------------
*** ***** * * ** * ** * ***
Figure 9. Alignment ofEight Protein Sequences by the New Method.
41
The score obtained with the new algorithm is very close to the score obtained
from ClustaJ W. The result of the new algorithm is slightly better than that of Clustal W
at some positions. As a specific example, the new algorithm found the conserved residue
'S' at column 24.
With the above example we do not intend to prove that the new method performs
generally better than Clustal W, but we do want to show that the new method is usable
and produces good results.
5.3 Discussion
In this study a new method was presented for multiple sequence alignment.
Compared to other iterative algorithms, this method has two major properties.
First, this algorithm calculates the distance between two groups of sequences in
an efficient and biologically meaningful way. The Feng-Doolittle's algorithm, which was
introduced in Section 3.3.1, uses two original sequences, which have the minimum
distance between two groups. The problem of the Feng-DoolittIe's algorithm is that the
original sequence cannot always present the major characteristics of a group, so the result
may not be accurate for general cases. In the ClustalW algorithm, the group-group
alignment is based on the average distance, and the new multiple alignment is created by
aligning profiles. The disadvantage of Clustal W is that it needs a lot of memory space to
store the profiles in order to compute the profile alignments.
In the new method, a consensus sequence is produced as the representative of a
group of sequences. The distance between two groups is defined as the distance between
the two corresponding consensus sequences. Since by definition the consensus sequence
42
minimizes the summed distance to it from all the sequences in the group, the consensus
sequence is the center of the group. Therefore, the new algorithm solves the problem of
Feng-Doolittle's algorithm because the consensus sequence, instead of an original
sequence, more faithfully reflects the variations of a group of sequences. Moreover, since
a group of sequences can be represented by a single sequence, the computation of the
distance and alignment between two groups can be simplified to sequence-sequence
distance and sequence-sequence alignment. Compared to Clustal W, it is more time and
space efficient.
Second, this algorithm combines the steps of constructing the evolutionary tree
and the computation of the multiple alignments. The progressive method calculates a
guide tree first, then the multiple alignments are built based on the tree structure. In the
new algorithm, multiple alignment can be done with reference to the evolutionary tree,
even though the tree structure itselfmay remain unknown.
In conclusion, this algorithm is fast and efficient, and can produce reasonable and
meaningful results. It also has the advantages of easy implementation and relatively
simple data structures.
5.4 Future Work
In the new algorithm, each internal node of the evolutionary tree is represented by
a consensus sequence, which inevitably leads to a loss of some information. Other
objective functions may be added to keep more infonnation about the alignment.
Additional infonnation such as predicted secondary structures or propensities of gap
formation may be supplemented to the consensus sequence as well.
43
The implementation of the new algorithm uses Gotoh's algorithm to compute
pairwise sequence alignments, which requires O(mn) space to align two sequences of
lengths m and n. The new algorithm also stores a similarity table of size n*n to calculate
the closely related pairs among n sequences. Further refinement may be made to reduce
the space requirements.
44
REFERENCES
[Alberts et al. 94] Bruce Alberts, Dennis Bray, Julian Lewis, Martin Raff, Keith Roberts,
and James D. Watson, Molecular Biology ofthe Cells, Galdand Publishers, New York,
NY,1994.
[Altschul and Erickson 86] Stephen F. Altschul and Bruce W. Erickson, "Optimal
Sequence Alignment Using Affine Gap Costs", Bulletin ofMathematical Biology, Vol.
48, No. 5/6, pp. 603-616, August 1986.
[Altschul 89] Stephen F. Altschul, "Gap Costs for Multiple Sequence Alignment",
Journal ofTheoretical Biology, Vol. 138, No.3, pp. 297-309, June 1989.
[Altschul 91] Stephen. F. Altschul, "Amino Acid Substitution Matrices from an
Information Theoretic Perspective", Journal ofMolecular Biology, Vol. 219, No.3, pp.
555-565, June 1991.
[Altschul 93] Stephen F. Altschul, "A Protein Alignment Scoring System Sensitive at All
Evolutionary Distances", Journal ofMolecular Evolution, Vol. 36, No.3, pp. 290-300,
March 1993.
[Bacon and Anderson 86] David 1. Bacon and Wayne F. Anderson, "Multiple Sequence
Alignment", Journal of Molecular Biology, Vol. 191, No.2, pp. 153-161, September
1986.
[Barton 90] Geoffrey J. Barton, "Protein Multiple Sequence Alignment and Flexible
Pattern Matching", In Methods in Enzymology. Volume 183: Molecular Evolution:
Computer Analysis ofProtein and Nucleic Acid Sequences, Russell F. Doolittle (Ed.),
Academic Press, pp. 403-428, San Diego, CA, 1990.
[Barton 96] Geoffrey J. Barton, "Protein Sequence Alignment and Database Scanning",
In Protein Structure Prediction: A Practical Approach, Michael J. E. Steinberg (Ed.),
Oxford University Press, New York, NY, 1996.
[Barton and Russell 93] Geoffrey J. Barton and Robert B. Russell, "Protein Structure
Prediction", Nature, Vol. 361, No. 6412, pp. 505-506, February 1993.
[Bellman 57] Richard E. Bellman, Dynamic Programming, Princeton University Press,
Princeton, NJ, 1957.
45
[Carrillo and Lipman 88] Humberto Carrillo and David Lipman, "The Multiple Sequence
Alignment Problem in Biology", SIAM Journal on Applied Mathematics, Vol. 48, No.
5, pp. 1073-1082, October 1988.
[Cormen et al. 92} Thomas H. Connen, Charles E. Leiserson, and Ronald L. Rivest,
Introduction to Algorithms, MIT Press and McGraw-Hill, Cambridge, MA, 1992.
[Dayhoff et al. 78] Margaret O. Dayhoff, R. M. Schwartz, and B. C. Orcutt, "A Model of
Evolutionary Change in Proteinstl
, In Atlas ofProtein Sequence and Structure, Volume
5, Supplement 3, M. O. Dayhoff (ed.), National Biomedical Research Foundation, pp.
345 - 352, Washington D. C., 1978.
[Dijkstra 59] E. W. Dijkstra, "A Note on Two Problems in Connexion with Gaps",
Numerische Mathematik, Vol. 1, No.1, pp. 269-271, November 1959.
[Doolittle 81] Russell F. Doolittle, "Similar Amino Acid Sequence: Chance or Common
Ancestry?", Science, Vol. 214, No. 4517, pp. 385-396, October 1981.
[Durbin et al. 98] Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison,
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids,
Cambridge University Press, Cambridge, UK, 1998.
[Eddy 95] Sean R. Eddy, "Multiple Alignment Using Hidden Markov Models",
Proceedings of the Third International Conference on Intelligent Systems for
Molecular Biology, pp. 114-120, Cambridge, UK, July 1995.
[Feng et al. 85] D. F. Feng, M. S. Johnson, and R. F. Doolittle, "Aligning Amino Acid
Sequences: Comparison of Commonly Used Methods", Journal of Molecular
Evolution, Vol. 21, No.2, pp. 112-125, February 1985.
[Feng and Doolittle 87] Da-fei Feng and Russell F. Doolittle, "Progressive Sequence
Alignment as a Prerequisite to Correct Phylogenetic Trees", Journal of Molecular
Evolution, Vol. 25, No.4, pp. 351-360, August 1987.
[Feng and Doolittle 96] Da-fei Fend and Russell F. Doolittle, "Progressive Alignment of
Amino Acid Sequences and Construction of Phylogenetic Trees from Them", In
Methods in Enzymology, Volume 266: Computer Methods for Macromolecular
Sequences Analysis, Russell F. Doolittle (Ed.), Academic Press, pp. 368-382, San
Diego, CA, 1996.
[Fitch and Margoliash 67] Walter M. Fitch and Emanuel Margoliash, "Construction of
Phylogenie Trees", Science, Vol. 155, No. 3760, pp. 279-284, January 1967.
[Fitch and Smith 83] Walter M. Fitch and Temple F. Smith, "Optimal Sequence
Alignments", Proceedings of the National Academy of Sciences, USA, Vol. 80, No.5,
pp. 1382-1386, March 1983.
46
[George et aI. 90] David G. George, Winona C. Barker, and Lois T. Hunt, "Mutation
Data Matrix and Its Uses", In Methods in Enzymology, Volume 183: Molecular
Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, Russell F.
Doolittle (Ed.), Academic Press, pp. 333-351, San Diego, CA, 1990.
[Gilbert 91] Walter Gilbert, "Towards a Paradigm Shift in Biology", Nature, Vol. 349,
No. 6305, p. 99, January 1991.
[Gonnet et al. 92] Gaston H. Gonnet, Mark A. Cohen. and Steven A. Benner, "Exhaustive
Matching of the Entire Protein Sequence Database", Science, Vol. 256, No. 5062, pp.
1443-1445, June 1992.
[Gotoh 82] Osamu Gotoh, "An Improved Algorithm for Matching Biological
Sequences", Journal of Molecular Biology, Vol. 162, No.3, pp. 705-708, December
1982.
[Gotoh 86] Osamu Gotoh, "Alignment of Three Biological Sequences with an Efficient
Traceback Procedure", Journal of Theoretical Biology, Vol. 121, No.3, pp. 327-337,
August 1986.
[Gribskov et al.87] M. Gribskov, A. McLacWan, and E. Eisenberg, "Profile Analysis
Detection of Distantly Related Proteins", Proceedings of the National Academy of
Sciences, USA, Vol. 88, No. 11, pp. 4355-4358, November 1987.
[Guo 00] Yanwen Guo, "Dynamic Programming and Its Application to Pairwise Biology
Sequence Alignment", Master of Science Thesis, Computer Science Department.
Oklahoma State University, May 2000.
[Gupta et at. 95] Sandeep K. Gupta, John D. Kececioglu, and Alejandro A. Schaffer,
"Improving the Practical Space and Time Efficiency of the Shortest-Paths Approach to
Sum-of-Pairs Multiple Sequence Alignment", Journal of Computational Biology, Vol.
2, No.3, pp. 459-472, March 1995.
[Gusfield 97] Dan Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge
University Press, New York, NY, 1997.
[Henikoff and Henikoff 92] Steven Henikoff and Jorja G. Henikoff, "Amino Acid
Substitution Matrices from Protein Blocks", Proceedings ofthe National Academy of
Sciences, USA, Vol. 89, No. 22, pp. 10915 - 10919, November 1992.
[Henikoff and Henikoff 93] Steven Henikoff and JoIja G. Henikoff, "Performance
Evaluation of Amino Acid Substitution Matrices", Proteins, Vol. 17, No.1, pp. 49-61,
January 1993.
47
."..
[Higgins and Sharp 88] Desmond G. Higgins and Paul M. Sharp, "CLUTAL: a Package
for Perfonning Multiple Sequence Alignment on a Microcomputer", Gene, Vol. 73,
No.1, pp. 237-244, December 1988.
[Higgins and Sharp 89] Desmond G. Higgins and Paul M. Sharp, "Fast and Sensitive
Multiple Sequence Alignment on a Microcomputer", Computer Applications in the
Biosciences: CAB/OS, Vol. 5, No.1, pp. 151-153, December 1989.
[Higgins et al. 91] Desmond G. Higgins, A. J. Bleasby, and R. Fuchs, "CLUSTAL V:
Improved Software for Multiple Sequence Alignment", Computer Applications in the
Biosciences: CAB/OS, Vol. 8, No.1, pp. 189-191, December 1991.
[Higgins et al. 96] Desmond G. Higgins, Julie D. Thompson, and Toby J. Gibson, "Using
CLUSTAL for Multiple Sequence Alignments", In Methods in Enzymology, Volume
266: Computer Methods for Macromolecular Sequences Analysis, Russell F. Doolittle
(Ed.), Academic Press, pp. 383-401, San Diego, CA, 1996.
[Jeanteur et al. 91] D. Jeanteur, J. H. Lakey, and F. Pattus, "The Bacterial Porin
Superfamily: Sequence Alignment and Structure Prediction", Molecular Microbiology,
Vol. 5, No.4, pp. 2153-2164, October 1991.
[Karlin and Altschul 90] Samuel Karlin and Stephen F. Altschul, "Methods for Assessing
Statistical Significance of Molecular Sequence Features by Using General Scoring
Schemes", Proceedings ofthe National Academy ofSciences. USA, Vol. 87, No.6, pp.
2264-2268, March 1990.
[Kimura 80] Motoo Kimura, "A Simple Method for Estimating Evolutionary Rates of
Base Substitutions Through Comparative Studies of Nucleotide Sequences", Journal
ofMolecular Biology, Vol. 16, No.2, pp. 111-120, December 1980.
[King and Wilson 75] M. C. King and A. C. Wilson, "Evolution at Two Levels in
Humans and Chimpanzees", Science, Vol. 188, No. 4184, pp. 107-116, April 1975.
[Krogh et al. 94] Anders Krogh, Michael Brown, I. Saira Mian, Kimmen Sjolander, and
David Haussler, "Hidden Markov Models in Computational Biology: Applications to
Protein Modeling", Journal of Molecular Biology, Vol. 235, No.5, pp. 1501-1531,
February 1994.
[Lehninger et al. 93) Albert L. Lehninger, David L. Nelson, and Mechael M. Cox,
Principles ofBiochemistry, Worth Publishers, New York, NY, 1993.
[Lipman et al. 89] David J. Lipman, Stephen F. Altschul, and John D. Kececioglu, "A
Tool for Multiple Sequence Alignment", Proceedings of the National Academy of
Sciences USA, Vol. 86, No.6, pp. 4412-4415, June 1989.
48
[Marks et a1. 96] Dawn B. Marks, Allan D. Marks, and Colleen M. Smith, Basic Medical
Biochemistry, Williams & Wilkins, Baltimore, MD, 1996.
[Mclachlan 71] A. D. Mclachlan, "Tests for Comparing Related Amino-Acid Sequences
Cytochrome c and Cytochrome CS51", Journal of Molecular Biology, Vol. 61, No.2,
pp. 409-424, October 1971.
[Melcher 00] Ulrich Melcher, "The '30K' Superfamily of Viral Movement Proteins",
Journal ofGeneral Virology, Vol. 81, No.1, pp. 257-266, January 2000.
[Murata et al. 85] M. Murata, J. S. Richardson, and Joel L. Sussman, " Simultaneous
Comparison of Three Protein Sequences", Proceedings of the National Academy of
Sciences, USA, Vol. 82, No.4, pp. 3073-3077, May 1985.
[Myers and Miller 88] Eugene W. Myers and Webb Miller, "Optimal Alignments in
Linear-Space", Computer Applications in the Biosciences: CABIOS, Vol. 4, No.1,
pp.ll- 17, August 1988.
[Needleman and Wunsch 70] Saul B. Needleman and Christian D. Wunsch, "A General
Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two
Proteins", Journal ofMolecular Biology, Vol. 48, No.3, pp. 443-453, March 1970.
[Pearson and Lipman 88] William R. Pearson and David J. Lipman, "hnproved Tools for
Biological Sequence Comparison", Proceedings ofthe National Academy ofSciences,
USA, Vol. 85, No.8, pp. 2444-2448, April 1988.
[Pearson 95) William R. Pearson, "Comparison of Methods for Searching Protein
Sequence Databases", Protein Science, Vol. 4, No.6, pp. 1145-1160, June 1995.
[Risler et al. 88] J. L. Risler, H. Delacroix, and A. Henaut, "Amino Acid Substitutions in
Structurally Related Proteins: A Pattern Recognition Approach: Determination of a
New and Efficient Scoring Matrix", Journal ofMolecular Biology, Vol. 204, No.4, pp.
1019-1029, December 1988.
[Saitou and Nei 87] Naruya Saitou and Masatoshi Nei, ''The Neighbor-joining Method: A
New Method for Reconstructing Phylogenetic Trees", Molecular Biology and
Evolution, Vol. 4, No.4, pp. 406-425, July 1987.
[Sankoff and Kruskal 83] D. Sankoff and J. B. Kruskal, Time Warps, String Edits and
Macromolecules: The Theory and Practice of Sequence Comparison, Addison Wesley,
MA,1983.
[Schwartz and Dayhoff 78] R. M. Schwartz and M. O. Dayhoff, "Matrices for Detecting
Distant Relationships", In Atlas of Protein Sequence and Structure, Volume 5,
Supplement 3, M. O. Dayhoff (ed.), National Biomedical Research Foundation, pp.
353 - 358, Washington D. C., 1978.
49
[Sellers 74] Peter H. Sellers, "On the Theory of Computation of Evolutionary Distances",
SIAM Journal on Applied Mathematics, Vol. 26, No.4, pp. 787-793, June 1974.
[Smith et at. 81] T. F. Smith, M. S. Waterman, and W. M. Fitch, "Comparative
Biosequence Metrics", Journal of Molecular Evolution, Vol. 18, No.1, pp. 38-46,
December 1981.
[Smith and Watennan 81] T. F. Smith and M. S. Waterman, "Identification of Common
Molecular Subsequences", Journal ofMolecular Biology, Vol. 147, No.1, pp. 195-197,
March 1981.
[States and Boguski 92] David J. States and Mark S. Boguski, "Similarity and
Homology", In Sequence Analysis Primer, Michael Gribskov and John Devereux
(Eds.), Oxford University Press, New York, NY, 1992.
[Stryer 95] Lubert Stryer, Biochemistry, W. H. Freeman, New York, NY, 1995.
[Taylor 88] William R. Taylor, "A Flexible Method to Align Large Numbers of
Biological Sequences", Journal ofMolecular Evolution, Vol. 28, No.1, pp. 161-169,
December 1988.
[Taylor 96] William R. Taylor, "Hierarchical Method to Align Large Numbers of
Biological Sequences", In Methods in Enzymology, Volume 266: Computer Methods
for Macromolecular Sequences Analysis, Russell F. Doolittle (Ed.), Academic Press,
pp. 456-474, San Diego, CA, 1996.
[Thompson et al. 94] Julie D. Thompson, Desmond G. Higgins, and Toby J. Gibson,
"CLUSTUL W: Improving the Sensitivity of Progressive Multiple Sequence
Alignment through Sequence Weighting, Position-Specific Gap Penalties and Weight
Matrix Choice", Nucleic Acids Research, Vol. 22, No. 22, pp. 4673-4680, November
1994.
[Voet and Voet 95] Danald Voet and Judith G. Voet, Biochemistry, John Wiley & Sons,
New York, NY, 1995.
[Waterman et al. 76] M. S. Waterman, T. F. Smith, and W. A. Beyer, "Some Biological
Sequence Metrics", Advances in Mathematics, Vol. 20, No.3, pp. 367-387, June 1976.
[Waterman 95] Michael S. Waterman, Introduction to Computational Biology: Maps
Sequences and Genomes, Chapmen and Hall, London, UK, 1995.
[Wilbur and Lipman 83] W. J. Wilbur and David J. Lipman, "Rapid Similarity Searches
of Nucleic Acid and Protein Data Banks", Proceedings of The National Academy of
Sciences USA, Vol. 80, pp. 726-730, February 1983.
50
APPENDICES
51
'.
APPENDIX A
GLOSSARY
Alignment
Amino Acid
Consensus Sequence
FASTA
Gap
A one-to-one matching oftwo sequences so that each character
in one sequence is associated with a single character of the
other sequence or with a null character (gap).
Any of a class of 20 molecules that are combined to form
proteins in living things, also called "residue".
A single sequence that represents the content of a group of
aligned sequences.
A program that compares a protein sequence to another protein
sequence or to a protein database, or a DNA sequence to
another DNA sequence or a DNA library.
A space inserted into a sequence to enhance its alignment with
another sequence. Gap are often represented by a '.' or '-'
character.
Gap Extension Penalty The length-dependent term, J, of a gap penalty of the form w =
g + lx. Where w is the gap penalty, g is the length-independent
term, and x the length of the gap.
Gap Opening Penalty The length-independent term, g, of a gap penalty of the form w
= g + lx. (see gap extension penalty).
Global Alignment An optimal alignment that includes all characters from each
sequence. Global alignment may miss short regions of high
local similarity. Global alignments are most useful for closely
related sequences of known homology.
Homology Similarity attributable to descent from a common ancestor.
Homology is an all or none relationship. In molecular biology,
homology is often inferred from a high degree of sequence
similarity.
52
Multiple Alignment Simultaneous alignment ofmore than two sequences.
PAM Percent accepted mutations.
Phylogenetic Tree A hierarchical branching diagram based on morphological
similarities where each branch represents a postulated clade or
monophyletic group, a group comprised of all the sampled
descendants of a single ancestral lineage.
Profile A profile is a position specific scoring table that represents the
infonnation from a family of related sequences.
Progressive Alignment A multiple alignment algorithm in which the sequences are
first clustered and then added one by one, in order of
decreasing similarity, to the growing multiple alignment.
Protein A large molecule composed of one or more chains of amino
acids in specific order. Examples are honnones, enzymes, and
antibodies.
Scoring Matrix In a dynamic programming alignment, the score matrix
indicates the quality of the alignment ending at each possible
pair of residues.
Sequence The order of amino acids in a protein molecule.
Unitary Matrix Also known as an identity matrix. A scoring system in which
only identical characters received a positive similarity score.
53
APPENDIXB
CODE LISTINGS
The package for the new algorithm comes as seven C++ source files, five header
files, and one makefile. The source code for the package is listed as indicated below.
File Name Page
1. Makefile 55
2. main.cpp 56
3. Matrices.cpp 58
4. PairAlign.h 60
5. PairAlign.cpp 62
6. GuideTrec.h 65
7. GuideTree.cpp 66
8. ScoreMat.h 68
9. ScoreMat.cpp 69
10. MultipleAlign.h 71
11. MultipleAlign.cpp 72
12. Common.h 77
13. Common.cpp 7S
54
## File 1: makefile
install: msa
clean:
rm *.0
OBJECTS Common.o PairAlign.o Matrices.o GuideTree.o
ScoreMat.o MultipleAlign.o main.o
\
HEADERS = Common.h PairAlign.h GuideTree.h \
MultipleA1ign.h ScoreMat.h
CC
CFLAGS
LFLAGS
msa :
.c.o
g++
-c -g
-0 -1m
$ (OBJECTS)
S(CC) -0 $@ $ (OBJECTS) $ (LFLAGS)
$(CC) $ (CFLAGS) $?
55
II File 2: main.cpp
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
using namespace std;
#include <stdlib.h>
#include "PairAlign.h"
#include "MultipleAlign.h"
II Simple interface of the program.
II get parameters. open input file, output file and log file.
int main(char ** args)
{
cout « "Enter input file name: "
string inputFile;
cin » inputFile;
ifstream inFile;
inFile.open(inputFile.c str(), ios: :in);
if (! inFile) { -
cerr « inputFile « " could not be opened." « endl;
exit (1) ;
string matName [3] = {"PAM 250 Mutation Matrix",
"BLOSUM 62 Substitution Matrix", "Unitary Matrix" };
int mat = 0;
int gapOpen = 12;
int gapExt = 4;
int change;
do {
cout « endl « "1. Change Scoring Matrix:
« matName[mat]« endl
« "2. Change Gap Open Penalty: "«gapOpen «endl
« "3. Change Gap Extension Penalty: "« gapExt «endl
« "4. OK, Comput the Multiple Alignment Now" « endl
« "5. Exit" «endl« "Enter your choice: ";
cin » change;
switch (change) {
case 1: int m;
cout « endl «"scoring Matrix options: "«endl
«"1. PAM 250 Mutation Matrix" «endl
«"2. Blosum 62 Substitution Matrix" «endl
«"3. Unitary Matrix"«endl
< <" Enter your choice: ";
cin » m;
if (m != 1 && m 1= 2 && m != 3)
cout « "Your choice " « m
«"is not a valid number" « endl;
else
mat
break;
case 2:
m - 1;
56 -
new PAM250(); break;
new Blosum62(); break;
new Unitary(); break;
cout « "Enter gap opening penalty: ";
cin » gapOpen;
break;
case 3:
cout «"Enter gap extension penalty: ";
cin » gapExt;
break;
case 4: break;
case 5: exit(l);
default: cout « "Not a valid number. Try again.\n";
}
} while ( change != 4 );
Matrices *sub;
switch (mat) {
case 0: sub
case 1: sub
case 2: sub
}
string outputFile = inputFile + ".out";
of stream outFile;
outFile.open(outputFile.c str(), ios: :out);
if (! outFile) { -
cerr « outputFile « " could not be opened." « endl;
exit (1) ;
string Log = inputFile + ". log";
of stream 10gFile;
10gFile.open(Log.c str(), ios: :out);
if (!logFile) { -
cerr « Log « II could not be opened." « endl;
exit (1) ;
multiAlign MA(inFile, outFile, 10gFile, sub, gapOpen, gapExt);
inFile . close () ;
outFile.close() ;
10gFile. close () ;
cout « "The final result is printed to file "« outputFile
« endl « "and the log file is "« Log « endl;
return 0;
57
IIFile 3: Matrices.cpp
#include IPairAlign.h"
II This function maps the inputMatrix, which is ordered by
II residuesOrder, to a 26*26 score matrix, which is ordered
I I by alphabet.
void Matrices::buildScore(int inputMatrixJ] [baseSize])
{
for (int i = 0; i < baseSize; i++)
char a = residuesOrder[i];
for (int j = 0; j<=i; j++) {
char b = residuesOrder[j];
score[a-'A'] [b-'A'] = score[b-'A'] [a-'A']
= inputMatrix [i] [j] ;
II PAM250 Scoring Matrix
PAM250: : PAM250 ()
int
1*
{{
{
{{
{
{{
{{{
{{{
{
{
{{
{
{{
};
}
M[baseSize] [baseSize] {
A R N D C Q E G H ILK M F PST W Y V X *1
2 } ,
-2, 6 },
0, 0, 2 },
0,-1, 2,4 },
-2,-4,-4,-5,12 },
0,1, 1,2,-5,4 },
0,-1, I, 3,-5, 2, 4 },
1,-3, 0, 1,-3,-1, 0, 5 },
-1, 2, 2, 1,-3, 3, 1,-2, 6 },
-1,-2,-2,-2,-2,-2,-2,-3,-2, 5 },
_2,_3,_3,-4,-6,-2,-3,-4,-2,2, 6 },
-I, 3, 1, 0,-5, I, 0,-2, 0,-2,-3, 5 },
-I, 0,-2,-3,-5,-1,-2,-3,-2, 2, 4, 0, 6 },
-4,-4,-4,-6,-4,-5,-5,-5,-2, 1, 2,-5, 0, 9 },
1 0 -1 -1 -3 0,-1,-1, 0,-2,-3,-1,-2,-5, 6 },
l' 0' I' 0' 0'_1,0,1,-1,-1,-3,0,-2,-3, I, 2 },
I
, -I' 0' 0' _2' -I, 0, 0,-1, 0,-2, 0,-1,-3, 0, I, 3 },
, , , , , 5 17 }
-6, 2 , -4 , -7 , -8 , -5 , -7 , -7,-3,-5,-2,-3,-4, 0,-6,-2,- , 0 10 , }
- 3 ,-4, -2 , -4 , 0 , -4 , -4 , _5,0,-1,-1,-4,-2,7,-5,-3,-3" 6 2 4,}
O
-2 -2 -2 -2 -2 -2,-1,-2,4, 2,-2, 2,-1,-1,-1, 0,- ,- , . '}
, , , , , , 0 4 -2 -1 -1
O
-1 0 -1 -3 -1 _1,_1,_1,_1,_1,_1,_1,_2,_1, 0, ,-, , ,
, f , I , I
buildScore (M) ;
II Blosum62 Mutation Matrix
Blosum62: :Blosum62()
{
58
} ;
}
int M[baseSize] (baseSize] {
1* A R N D C Q E G H ILK M F PST W Y v x *1
{ 4 },
{-I, 5 },
{-2, 0, 6 },
{-2,-2, 1, 6 },
{ 0,-3,-3,-3, 9 },
{-1, 1, 0, 0,-3, 5 },
{-1, 0, 0, 2,-4, 2, 5 },
{ 0,-2, 0,-1,-3,-2,-2, 6 },
{-2, 0, 1,-1,-3, 0, 0,-2, 8 },
{-1,-3,-3,-3,-1,-3,-3,-4,-3, 4 },
{-1,-2,-3,-4,-1,-2,-3,-4,-3, 2, 4 },
{-1, 2, 0,-1,-3, 1, 1,-2,-1,-3,-2, 5 },
{-1,-1,-2,-3,-1, 0,-2,-3,-2, I, 2,-1, 5 },
{-2,-3,-3,-3,-2,-3,-3,-3,-1, 0, 0,-3, 0, 6 },
{-1,-2,-2,-I,-3,-I,-1,-2,-2,-3,-3,-1,-2,-4, 7 },
{ 1,-1, 1,0,-1, 0, 0, 0,-1,-2,-2, 0,-1,-2,-1,4 },
{ 0,-1, 0,-1,-1,-1,-1,-2,-2,-1,-1,-1,-1,-2,-1, 1, 5 },
{-3,-3,-4,-4,-2,-2,-3,-2,-2,-3,-2,-3,-1, 1,-4,-3,-2,11 },
{-2,-2,-2,-3,-2,-1,-2,-3, 2,-1,-1,-2,-1, 3,-3,-2,-2, 2, 7 },
{ 0,-3,-3,-3,-1,-2,-2,-3,-3, 3, 1,-2, 1,-1,-2,-2, 0,-3,-1, 4 },
{ 0,-1,-1,-1,-2,-1,-1,-1,-1,-1,-1,-1,-1,-1,-2, 0, 0,-2,-1,-1,-1}
buildScore (M) j
II Identity Scoring Matrix
Uni tary: : Uni tary ( )
{
int M[baseSizel [baseSize]
1* A R N D C Q E G H ILK M F PST W Y V x*1
A */ { 1 },
R *1 { 0,1 },
N *1 { 0,0,1 },
D *1 { 0,0,0,1 },
C *1 { 0,0,0,0,1 },
Q *1 { 0,0,0,0,0,1 },
E *1 { 0,0,0,0,0,0,1 },
G *1 { 0,0,0,0,0,0,0,1 },
H *1 { 0,0,0,0,0,0,0,0,1 },
I *1 { 0,0,0,0,0,0,0,0,0,1 },
L *1 { 0,0,0,0,0,0,0,0,0,0,1 },
K *1 { 0,0,0,0,0,0,0,0,0,0,0,1 },
M *1 { 0,0,0,0,0,0,0,0,0,0,0,0,1 },
F *1 { 0,0,0,0,0,0,0,0,0,0,0,0,0,1 },
P *1 { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 },
S *1 { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 },
T *1 { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 },
W *1 { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 },
Y *1 {0,0,0,0,0,0,0,o,o,0,0,0,o,o,0,0,0,0,1 },
V *1 {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,O,O,1 },
X *1 { 0,0,0,0,0,0,0,o,o,0,0,0,0,0,0,0,0,O,O,O,1}
1* A R N D C Q E G H ILK M F PST W Y V X *1
/*
1*
1*
1*
1*
1*
1*
1*
1*
1*
/*
/*
1*
1*
1*
/*
1*
/*
/*
1*
1*
} ;
}
buildScore (M) ;
59
II File 4: PairAlign.h
#ifndef
#define
PAlRALIGN H
PAlRALIGN H
#include "Cammon.h"
II The class of Matrices
class Matrices {
public:
void buildScore (int [] [baseSize) ) ;
II a, b should both be upper case letters.
int getScore(int a, int b) canst { return score[a-'A'] [b-'A'];}
protected:
int score [26) [26) ;
};
class PAM250 : public Matrices (
public:
PAM250() ;
} ;
class Blosum62 : public Matrices {
public:
Blosum62() ;
} ;
class Unitary : public Matrices {
public:
Unitary ( ) ;
} ;
II Traceback Pointer
class tbPtr {
public:
tbPtr(int cl, int c2 , int c3): k(cl) I i(c2), j (c3) { }
int k;
int i;
int ji II absolute coordinates
} ;
II Pairwise Alignment with affine gap costs
class pairAlign (
public:
pairAlign(Matrices *
_pairAlign () { }
inti inti const char * const char * bool);
60
II Return an array containing alignment of maximal score
char **getAlign( ) { return A; )
II Return the score of the best alignment
int getScore() { return score; }
int max (int xl, int x2) { return (xl> x2 ? xl : x2); }
int max(int xl, int x2, int x3) { return max(x1, max(x2, X3))i }
int max(int xl, int x2, int x3, int x4)
{ return max(max(x1, x2), max (x3, x4); }
private:
void fastAlign(Matrices *,int,int, canst char * canst char *);
void freeTB();
} ;
#endif
unsigned int m, ni
int **Vi
tbPtr ***TB[3);
char *A[2] i
int score;
II lengths of input sequences s1 and s2
II dynamic programming table
II the 3 m*n traceback pointer table
II the final alignment
II contains V[m] [n]
61
IIFile 5: PairAlign.cpp
#include <string>
using namespace std;
#include <stdlib.h>
#include <assert.h>
#include IPairAlign.h"
II takes a given scoring matrix, the gap open penalty d, gap extension
II penalty e, and the input sequences sl and s2, calculate the dynamic
II programming table V and the traceback table TB.
pairAlign: :pairAlign
(Matrices *sub, int d, int e, const char *sl, canst char *s2, bool
traceback)
{
if (!traceback) {
fastAlign(sub, d, e, sl, s2);
return;
m = strlen{sl);
n strlen(s2);
int **E = allocTable(m, n);
int **F = allocTable (m, n);
V = allocTable(m, n);
unsigned int i, j;
for ( i = 0; i < 3; i ++) {
tbPtr **Tmp = (tbPtr **)calloc( (m+1)*(n+1), sizeof(tbPtr
*) ) ;
TB[i] = (tbPtr ***)calloc(m+1, sizeof(tbPtr **»;
for ( j = 0; j <= m; j++)
TB[i] [j] = Tmp + j*(n+1);
II use Gotoh' algorithm, record to the traceback table as well
E[O] [OJ = F[O] [0] = V[O] [0] = 0;
for (i=l; i<=m; i++) (
E[i] [0] = F[i] [OJ = V[iJ [0] = -d - e * (i-1);
TB[O] [i] [OJ new tbPtr(O, i-1, 0);
TB [1] [i] [0 J = new tbPt r (1, i - 1, 0);
( j =1; j <=n; j ++) {
E[O] [jJ = FEO] [jJ
TB [OJ [0] [j J new
TB [2J [OJ [j J = new
}
for
V[OJ [j] = -d - e * (j-1);
tbPt r (0 , 0, j - 1) ;
tbPtr(2, 0, j-l);
int val;
for (i=l; i<=ffi; i++)
for (j=l; j<=n; j++) (
int 6 = sub->getScore(sl[i-1J, 62[j-1J) i
val E[i] [jJ = max(V[iJ [j-1J -d, E[i] [j-l] -e);
62
if (val ! = E [i] [j -1] -e)
TB[2] [i] [j] new tbPtr(O, i, j-1)i
else
val = F[i] [j] = max(V[i-l] [j]-d, F[i-1] [j]-e);
if (val != F[i-1] [j]-e)
TB[I] [i] [j] new tbPtr(O, i-1, j);
t,
else
TB [ 2] [i] [j]
TB [ 1] [i] [j]
new tbPtr(2, i, j-1) i
new tbPtr(l, i-I, j) i
val = Veil [j] = max(V[i-l] [j-l]+s, E[i] [j], F{i] [j])i
if (val == E[i] [j])
TB[O] [i] [j] = new tbPtr(2, i, j);
else if ( val == F[i] [j])
TB(O] [i) [j] new tbPtr(l, i, j) i
else
score V em] [n] i
TB [0] [i] [j] new tbPtr(O, i-1, j-l) i
II array A will contain the alignment
A[O) allocString(m+n)i
A[l] = allocString(m+n)i
tbPtr *tb;
i = mi j = ni
unsigned int k = 0, x = 0;
whi 1 e ( i ! = 0 I I j ! = 0 ) {
tb = TB[k) [i] [j];
if (i != tb->i ) I j 1= tb->j) {
A[O] [x] (i tb->i)?
A[l] [x) = (j == tb->j) ?
X++i
k tb->ki
i = tb->ii
j tb->j;
reverseString(A[O]) ;
reverseString (A [1] ) ;
freeTable(E, m, n);
freeTable(F, m, n)i
freeTable(V, m, n);
freeTB();
II free the memory allocated for matrix TB
void pairAlign::freeTB()
I ,
, - I
81 [i-I] ;
s2[j-l];
for(int i = 0;
for (int j
i < 3; i++)
0; j <= mj j ++)
63
for(int k = 0; k <= n; k++)
delete TB [i] [j] [k) ;
for(int 1 = 0; 1 < 3; 1++)
free (TB [1] [0] ) ;
II align without traceback, compute the similarity score only
void pairAlign::fastAlign
(Matrices *sub, int d, int e, const char *s1, const char *s2)
{
m = strlen(s1);
n = strlen(s2);
int **E = allocTable(m, n);
int **F = allocTable(m, n);
V = allocTable(m, n);
II use Gotoh' algorithm, record to the traceback table as well
unsigned int i, j;
E[O] [0) = F[O] [0) = V[O] [0) = 0;
for (i=1; i<=m; i++) {
E[i] [0] = F[i] [0] = V[i] [0]
}
for (j=l; j<=n; j++) {
E[O] [j] = F[O] [j] = V[O] [j]
-d - e * (i -1) ;
-d - e * (j-l);
for (i=l; i<=m; i++)
for (j=l; j<=n; j++) {
int s = sub->getScore(sl[i-l], s2[j-l]);
E [i] [j) max (V [ i] [j - 1] - d , E [i] [j - 1) - e) ;
F [ i] [j] max (V [ i - 1] [j] - d , F [ i - 1] [j] - e) ;
V[i] [j] max(V[i-l] [j-l]+s, E[i] [j], F[i] [j));
}
score = V[m) [n] ;
freeTable(E, m, n);
freeTable(F, m, n);
freeTable(V, m, n);
64
II File 6: GuideTree.h
#ifndef
#define
GUIDETREE H
GUIDETREE H
#include "Common.h"
struct treeNode {
int Number;
vector<seqNode *> *alignmenti
string *consensusi
I I Node Number
II the alignment it contains
treeNode(int n, vector<seqNode *> *a, string &con)
: Number(n), alignment (a)
{ consensus = new string (con) i }
treeNode(int n, seqNode * seq) : Number(n)
{ alignment = new vector<seqNode *>i
alignment->push_back(seq);
consensus = new string(*(seq->seq)) i
}
void prntNode(ostream & fout = cout);
};
class guideTree {
public:
void initTree(vector<seqNode *> *) i
void addNode(treeNode *);
treeNode *deleteNode(int i);
treeNode *finalAlign();
private:
vector<treeNode *> *treeListi
};
#endif
65
treeList->push_back(node) ;
II File 7: GuideTree.cpp
#include <iostream>
#include <vector>
#include <string>
using namespace std;
#include <assert.h>
#include "GuideTree.h"
#include "Common.h"
II initialize tree T to be the set of all sequences
void guideTree::initTree(vector<seqNode *> *inputSeq)
int num = inputSeq->size();
treeList new vector<treeNode *>;
for (int i = 0; i < num; i++) {
treeNode *node = new treeNode(i, (*inputSeq) [i]);
treeList->push_back(node);
II add a new node into the tree
void guideTree: :addNode(treeNode * node)
{
}
II delete node #i and return the node, return null if Ii not found
treeNode *guideTree: :deleteNode(int i)
(
treeNode * iNode;
vector<treeNode *>::iterator iter= treeList->begin();
for (; iter != treeList->end(); iter++)
if ( (*iter) - >Number == i) (
iNode = *iter;
treeList->erase(iter) ;
return iNode;
}
return NULL;
II return the final resulting alignment
treeNode *guideTree: :finaIAlign()
{
assert (treeList->size() == 1);
return treeList->front();
66
// print a node: number, sequence/alignment, consensus string
void treeNode: :prntNode(ostream &fout)
fout « "Node Number: " « Number « endl
« "alignment; " « endl j
int n = alignment->size()j
for (int i = 0; i < ni i++)
fout « * ( (*alignment) [i] - >seq) « endl;
fout « "consensus string: "«endl«*consensus«endl«endl;
67
II File 8: ScoreMat.h
#ifndef
#define
SCOREMAT H
SCOREMAT H
class scoreMat {
public:
scoreMat( Matrices * s, int d, int e)
sub(s), alpha(d), beta(e), maxScore(INT MIN) { }
bool initMat(vector<seqNode *> *);
void updateMat(int i, int j, int z, string seq);
void printTbl(ostream &fout = cout);
int getMaxSeql() { return maxSeql; }
int getMaxSeq2() { return maxSeq2; }
private:
int **Matrix;
string **lookupTbl;
int N;
int maxScore;
int maxSeql;
int maxSeq2;
Matrices *sub;
int alpha;
int beta;
};
#endif
II similarity score matrix: n*n table
II look up table mapping seq# and seqs
II size of Matrix/vector/lookupTbl
II maximum similarity score in Matrix
II sequence#l which has the maxScore
II sequence#2 which has the maxScore
II mat
II d
II e
68
II File 9: ScoreMat.cpp
#include <iostream>
#include <iomanip>
#include <string>
#include <vector>
using namespace std;
#include <stdlib.h>
#include <limits.h>
#include "PairAlign.h"
#include "ScoreMat.h"
#include "Common.h"
II Initial Matrix to be an N*N table of similarity scores
II between all pairs of N sequences
II lookupTbl is a reference table for Matrix
bool scoreMat: :initMat{vector<seqNode *> *Seqs)
{
N = Seqs->size();
if (!N)
return false;
Matrix = allocTable(N-l, N-l);
lookupTbl = (string **)malloc(sizeof(string *)*N);
int i;
for (i = 0; i < N; i++)
lookupTbl [i] = new string(* «*Seqs) [i] ->seq));
for (i = 0; i < N; i++) {
for ( int j = i+l; j < N; j++) {
pairAlign *pair = new pairAlign
(sub, alpha, beta, (*lookupTbl[i]) .c_str(),
(*lookupTbl [j]) .c_str(), false);
int score = pair->getScore();
Matrix[i] [j] = Matrix[j] [i] = score;
if ( score >= maxScore ) {
maxScore = score;
maxSeql i;
maxSeq2 = j;
}
Matrix [i] [i]
return true;
II print the similarity matrix
void scoreMat: :printTbl(ostream &fout)
fout « endl;
for ( int i = 0; i < N; i++) {
for (int j 0; j < N; j++)
69
if (Matrix[i] [j] == INT_MIN) fout « setw(4) «
else fout « setw(4) « Matrix[i] [j];
}
fout « endl;
"_",,
II delete childl and child2 from Matrix and add parent to the matrix.
II seq is the consensus string of the parent
void scoreMat: :updateMat
(int childl, int child2, int parent, string seq)
{
lookupTbl[childl]
lookupTbl[parent]
int i;
lookupTbl[child2] = NULL;
new string(seq) ;
for (i = 0; i < N; i++)
Matrix [childl] [i) = Matrix [child2] [i)
Matrix[i] [childl] = Matrix[i] [child2]
for (i = 0; i < N; i++) {
if i! = parent && lookupTbl [i] ) {
pairAlign *pair = new pairAlign(sub, alpha, beta,
(*lookupTbl[i]) .c_str(), seq.c_str() I flase);
int score = pair->getScore();
Matrix [i) [parent] = Matrix [parent] [i) = score;
maxScore = INT_MIN;
for (i = 0; i < N; i + + )
for (int j = i + l; j < N; j ++) {
int score = Matrix [i) [j] ;
if ( score >= maxScore ) {
maxScore = score;
maxSeql i;
maxSeq2 = j;
70
1/ File 10: MultipleAlign.h
#ifndef
#define
MSA H
MSA H
#include "GuideTree.h"
class multiAlign
{
public:
multiAlign(istream & in, ostream & out, ostream & Log,
Matrices * m, int d, int e)
: mat(m), alpha (d) , beta(e) { buildMSA(in, out, Log);}
void buildMSA(istream & in, ostream & out, ostream & Log);
private:
void getlnput(istream & in, ostream & Log);
void stripSeq(string & s);
void prntResult(vector<seqNode *>*, ostream &, ostream &);
vector<seqNode *> *getAlign(treeNode *, treeNode *, ostream &);
void insertGaps(string &, const char*, vector<seqNode *>*);
string &consensus(vector<seqNode *> *align);
Matrices *mat;
int alpha;
int beta;
vector<seqNode *> *inputSeqs;
} ;
#endif
71
II File 11: MultipleAlign.cpp
#include <iostream>
#include <iomanip>
#include <string>
#include <vector>
using namespace std;
#include <limits.h>
#include <stdlib.h>
#include <assert.h>
#include "PairAlign.h"
#include "ScoreMat.h"
#include "MultipleAlign.h"
#include "GuideTree.h"
#include "Common.h"
II the whole process of calculating the multiple sequence alignment
void multiAlign: :buildMSA(istream & in, ostream & out, ostream & Log)
get Input (in, Log};
II step 1. build scoring matrix (n*n table)
scoreMat *nnTbl ~ new scoreMat(mat, alpha, beta};
assert (nnTbl) ;
nnTbl->initMat(inputSeqs} ;
Iistep 2. construct the tree and do alignment.
II Initialize the tree
guideTree *tree ~ new guideTree(};
tree->initTree(inputSeqs) ;
II n-l iteration
int n ~ inputSeqs->size();
for ( int i ~ 0; i < n - 1; i++)
112.1 get the pair has maximum score
nnTbl->printTbl(Log) ;
int i = nnTbl->getMaxSeql(};
int j ~ nnTbl->getMaxSeq2();
Log « endl « "max score: ("
«i« ", "« j «")"« endl« endl;
112.2 remove node i and j from th~ tree
treeNode *iNode ~ tree->deleteNode(i);
treeNode *jNode = tree->deleteNode(j);
iNode->prntNode(Log) ;
jNode->prntNode(Log) ;
112.3 construct a new node
vector<seqNode *> *align = getAlign(iNode, jNode, Log);
string conStr ~ consensus (align) ;
treeNode *A = new treeNode(i, align, conStr);
Log « endl « "new node: It;
A->prntNode(Log) ;
72
112.4 update scoring table
nnTbl->updateMat(i, j, i, conStr);
112.5 add the new node to the tree
tree->addNode(A) ;
}
II step 3. print the result
prntResult(tree->finalAlign()->alignment, out, Log);
}
Ilread input with PASTA format.
void multiAlign: :getInput(istream & in, ostream & Log)
const int MaxLength = 255;
char Line[MaxLength+l] = " ";
inputSeqs = new vector<seqNode *>;
while (in.getline(Line, MaxLength) && Line[O] != '>');
if (Line [0] ! = '>') II empty or invalid format
return;
bool WasSeqLine = false;
int seqNumber = OJ
string seq = "";
Log « "Input sequences: "« endl;
while ( in.getline(Line, MaxLength) ) {
if (Line[O] == '>') {
if (WasSeqLine) {
stripSeq (seq) ;
if (!seq.empty(») {
seqNode *Node
new seqNode(seqNumber, seq);
inputSeqs->push_back(Node) ;
Log « seqNumber «": II « seq « endl;
seqNumber++;
}
}
WasSeqLine
seq = "";
false;
else
seq += Linej
WasSeqLine truej
II not a new sequence
}
}llwhile getLine
if (Line[O]!='>')
seq += Line;
stripSeq (seq) ;
if (! seq. empty ( )) {
seqNode *Node = new seqNode(seqNumber, seq);
inputSeqs->push_back(Node);
73
Log « seqNumber «": " « seq « endl;
seqNumber++;
II print the final alignment to the output file and the log file
void multiAlign::prntResult(vector<seqNode *> *result, ostream & out,
ostream & cout)
{
vector<seqNode *> *orderResult new vector<seqNode *>;
int i, n=result->size();
for ( i = 0; i < n; i ++ ) {
int number = (*result) [i]->Num;
vector<seqNode *>: :iterator iter = orderResult->begin();
fort; iter 1= orderResult->end(); iter++)
if( number < (*iter)->Num) break;
orderResult->insert(iter, (*result) [i);
int length = strlen((*orderResult) [O]->seq->c str(»;
int start = 0, len;
const int oneline = 60;
while (length> 0) {
len = length > oneline ? oneline length;
out « endl « " 11 ;
cout « endl « " " ;
for ( i = 1; i <= (len+l)/10; i++) {
cout « setw(lO) « start + i*lO;
out « setw(lO) « start + i*lO;
out « endl;
cout « endl;
for ( i = 0; i < n; i++) {
string current;
current.assign(* ((*orderResult) [i) ->seq), start, len);
out « setw(4) « i «" "« current « endl;
cout « setw(4) « i <e" "« current « endl;
start += lenj
length -= oneline;
II change string s to all capital, discard invalid letters
void multiAlign: : stripSeq (string & s)
{
bool valid [26] ;
int i;
for (i=O; i<26; i++)
valid[i] = false;
for (i=O; i<baseSize; i++)
valid[residuesOrder[i]-'A']
74
true;
int m
int j
for (
s.size();
0;
i = 0; i < m; i++)
char c = s[i);
if (islower (c) )
c = toupper(c);
if (isupper(c) && valid[c-'A'])
s [j++) = c;
}
fort; j < m; j++)
s[j] = '\0';
II compute the alignment of two nodes A and B using standard
1/ pairwise alignment
vector<seqNode *> *multiAlign: :getAlign(treeNode *A, treeNode *B,
ostream & Log)
{
string *AC
string *Bc
new string(*(A->consensus));
new string(*(B->consensus));
pairAlign *pair = new pairAlign
(mat, alpha, beta, Ac->c_str(), Bc->c_str());
char **result = pair->getAlign();
Log « "pairalign result: II «endl«
result [0] « endl « result[l) « endl « endl;
insertGaps(*Ac, result [0] , A->alignment);
insertGaps(*Bc, result[l) , B->alignment);
A->alignment->insert(A->alignment->begin(),
B->alignment->begin(), B->alignment->end());
return A->alignment;
II insert new gaps in after which are not in string before into the
1/ same position of all sequences in vector Beqs
void multiAlign: : insertGaps
(string &before, const char *after, vector<seqNode *> *seqs)
int n = seqs->size();
for (int i = 0; i < strlen (after); i++) {
assert(i <= strlen(before.c str()}};
if (before [i] ! = after [i]) T
assert(after[i] == '-' );
before.insert(i, "_II);
for ( int j = 0; j < n; j++)
(*seqs) [j]->seq->insert(i, 11 .. 11);
75
~I
II calculate the consensus string of an alignment align
string & multiAlign::consensus(vector<seqNode *> *align)
{
int nUmSeqs = align->size();
int seqLength = strlen( (*align) (O)->seq->c_str();
char *conSeq = new char [seqLength+l) ;
int charSum(26);
int column;
for column 0; column < seqLength; column++) {
int k;
for ( k = 0; k < 26; k++)
charSum(k) = INT MINi
for (k = 0; k < numSeqs; k++) {
char ch = (*align) (k)->seq->at(column);
if (ch == '-') ch = 'X';
charSum(ch-'A') = OJ
for (int rowl = 0; rowl < numSeqs; rowl++)
for (int row2 = rowl+l; row2 < numSeqs; row2++)
char a = (*align) (rowl)->seq->at(column);
if ( a == ,_, ) a = I X' ;
char b = (*align) (row2)->seq->at (column);
if (b== '-') b= 'X';
int score = mat->getScore(a, b);
charSum(a-'A') += score;
charSum(b-'A') += score;
}
int maxSum INT_MIN;
char conChar = '*';
for (int c = 0; c < 26; c++) {
if (charSum[c) > maxSum)
conChar = c + 'A';
maxSum = charSum(c) i
}
if (maxSum < 0)
conChar = ' X' ;
conSeq[column] = conChar;
}
conSeq(column] =
string *conStr =
return *conStr;
'\0 ';
new string (conSeq) ;
I I its parent
76
II File 12: Common.h
Ihfndef
Itdefine
COMMON H
COMMON H
Itinclude <string>
using namespace std;
const int baseSize = 21; II 20 residues + 'X'
const char residuesOrder[26] = "ARNDCQEGHILKMFPSTWYVX"i
struct seqNode {
int Num;
string *seqi
seqNode(int n, string s)
{ seq = new string(s) i }
} i
char *allocString(int size) i
char *reverseString(char *s);
int **allocTable(int, int) i
int freeTable(int**, int, intI i
#endif
Num(n)
77
// File 13: Common.cpp
#include <string>
using namespace std;
#include <stdlib.h>
#include <assert.h>
#include "Common.h"
// allocate and clear memory for a string of given size
char *allocString(int size}
{
char *sptr = (char *}malloc(sizeof(char)*(size+l});
for (int i = 0; i <= size; i++)
sptr[i] = '\0';
return sptr;
}
// return the reverse string of s
char *reverseString(char *s}
int size strlen(s}i
char Ci
for (int i = 0, j = size - 1; i <= size/2 - 1; i++, j--) {
C = s [i] ;
sri] s[j];
s[j] = Ci
return s;
// allocate a (m+l}*(n+l) tablei row 0 and column 0 will not be used
int **allocTable(int m, int n)
int *Tmp = (int *)malloc(sizeof(int)*(m+l)*(n+l));
assert (Tmp) ;
int **Table = (int **}malloc(sizeof(int*)*(m+l}}i
assert (Table) i
for ( int i = Oi i <= mi i++)
Table [i] Tmp + i * (n+l) ;
return Table;
}
// free the memory allocated for Table
int freeTable(int **Table, int m, int n)
{
free(Table[O]) i
return EXIT_SUCCESSi
78
tV
VITA
Ke Liu
Candidate for the Degree of
Master of Science
Thesis: MULTIPLE SEQUENCE ALIGNMENT WITH EVOLUTIONARY TREES
Major Field: Computer Science
Biographical:
Personal Data: Born in Hunan, China, January 28,1973, daughter of Changsheng
Liu and Xiaokun Yin.
Education: Received Bachelor of Science in Finance from Central University of
Economics, Beijing, China in July 1994. Completed the requirements for
the Master of Science degree in Computer Science at the Computer
Science department at Oklahoma State University in December 2001.