

small (250x250 max)
medium (500x500 max)
Large
Extra Large
large ( > 500x500)
Full Resolution


COMPARATIVE STUDY ON BLOCKING METHODS IN RECORD LINKAGE By JIYOUNG SHIN Master of Science in Computer Science Oklahoma State University Stillwater, Oklahoma 2009 Submitted to the Faculty of the Graduate College of the Oklahoma State University in partial fulfillment of the requirements for the Degree of MASTER OF SCIENCE December, 2009 ii COMPARATIVE STUDY ON BLOCKING METHODS IN RECORD LINKAGE Thesis Approved: Dr. K. M. George Thesis Adviser Dr. John P. Chandler Dr. Nohpill Park Dr. A. Gordon Emslie Dean of the Graduate College iv ACKNOWLEDGMENTS First of all, I would like to express the deepest appreciation to my advisor Dr. K. M. George for advising me through this journey. Without his guidance and persistent help, I could not successfully finish this thesis. I also want to thank committee members Dr. John P. Chandler and Dr. Nohpill Park. I gladly express my gratitude to Dr. Nohpill Park for providing all the valuable support to carry out the work. I would like to make a special thanks to Dr. Chandler for his valuable advice during the work. I extend my sincere thanks to Dr. NohJin Park, Sunny Choi, Linju Jose, QiuRui Zhu and other colleagues who supported me right next. I also want to express my appreciation to my two roommates, Sujin You and Hyojin Kim. Finally, I most thank my family for supporting me in every way. I thank God. v TABLE OF CONTENTS Chapter Page I. INTRODUCTION ......................................................................................................1 II. REVIEW OF LITERATURE....................................................................................4 Section 2.1 Record Linkage .....................................................................................4 Section 2.2 EM Algorithm .....................................................................................11 Section 2.3 Blocking Methods ...............................................................................14 Sub Section 2.3.1 Standard Blocking Method ................................................15 Sub Section 2.3.2 Sorted Neighborhood Blocking Method ...........................16 Sub Section 2.3.3 Bigram Indexing Blocking Method ..................................18 Sub Section 2.3.4 Canopy Clustering Blocking Method ................................20 III. METHODOLOGIES ............................................................................................22 Section 3.1 Data Type ...........................................................................................23 Section 3.2 Evaluation Method .............................................................................25 Section 3.3 Algorithm ...........................................................................................26 Section 3.4 Additional Information ......................................................................32 IV. SIMULATIONS ...................................................................................................36 Section 4.1 Environment ........................................................................................36 Section 4.2 Simulation ...........................................................................................39 Section 4.3 Results .................................................................................................41 V. CONCLUSIONS AND FUTURE WORKS .........................................................43 REFERENCES ............................................................................................................45 vi LIST OF TABLES Table Page Table 1 Standard Blocking Method Measurement by Distance ..............................37 Table 2 Sorted Neighborhood Blocking Method Measurement by Window Size ...37 Table 3 Bigram Indexing Blocking Method Measurement by Threshold Values ....38 Table 4 Canopy Clustering Blocking Method Measurement by Threshold Values 38 vii LIST OF FIGURES Figure Page Figure 1 Sorted Neighborhood Blocking Method ....................................................17 Figure 2 Bigram Indexing Blocking Method ............................................................19 Figure 3 Canopy Clustering Blocking Method .........................................................21 Figure 4 Evaluation of Effectiveness ........................................................................23 Figure 5 Implementation Flow Chart ........................................................................35 Figure 6 Reduction Ratio ..........................................................................................40 Figure 7 Pair Completeness ......................................................................................40 Figure 8 F Score ........................................................................................................41 1 CHAPTER I INTRODUCTION We live in a world where our information is stored in various places. At hospitals, restaurants, fitness clubs or government offices, people request our information to be stored in their databases. However, even within the same organization it is very hard to match the information, whether the records in their database represent the same entity or not. Otherwise when information is shared, they have to link the databases by indicating which record from a database relates to a record in the other database. For the need of linking databases, in 1946 record linkage was mentioned by Dunn for the first time saying that our “Books of Life” need to be linked into one volume for registrar purposes or medical purposes [11]. After that, Newcombe introduced a probabilistic record linkage formula and later Fellegi and Sunter formalized the theory [1]. Recordlinkage methodology and software have been developed by the U.S. Bureau of Census mainly to match the existing census data with the postenumeration survey [4]. As the records to be compared are from different files, they might have different data structures or fields, so we might need to work on the data itself before comparing files. And also there are possibilities that some data fields are missing or misspelled, so we need to take 2 care of the data standardization part. In our study, we do not cover the standardization part. We assume that data has been standardized, but missing or misspelled records wouldn’t be taken care of in the standardization part. The data we use for the simulation will contain missing or misspelled records. There are several approaches to record linkage. The most straightforward is a rulesbased approach, in which reasonable rules are developed and then refined as common exceptions are found [10]. A very popular approach has been probabilistic record linkage [10]. We also use probabilistic approach in our study. A probabilistic approach is used to compare the data with statistics rather than a straightforward oneonone comparison. This will reduce the number of comparisons for large databases. In this approach, each file consists of a number of fields or components and a number of records or observations. The components or attributes common to both files are useful for matching. The components do not contain equal amounts of information and error rates vary. So weights are defined considering the error probabilities for each field using a loglikelihood ratio. For files whose average size is too large for all possible record pair comparisons, they can be partitioned into mutually exclusive and exhaustive blocks designed to increase the proportion of matched pairs observed while decreasing the number of record pair comparisons to a block level. The blocking procedure is implemented by sorting the two files on one or more variables [1]. There are several blocking methods, which are ways to put records which have a high possibility to be same entity into the same block and do the comparison within the block. In our paper, we focus on four blocking methods and compare their accuracy and effectiveness. The four blocking methods are the basic sorted method (standard blocking), the basic sortedneighborhood method, the bigram based blocking method, and the canopy clustering method. 3 The remaining document is organized as follows. Chapter 2 summarizes the main principles of the record linkage, EM algorithm, and the four blocking methods. Chapter 3 gives an overview of the system implemented in this project including description of data sets and evaluation methods. Chapter 4 explains findings based on evaluation method results after our experiments. Finally, chapter 5 conclusions our findings and talks about future works. 4 CHAPTER II REVIEW OF LITERATURE The objective of record linkage or matching process is to find and link the records corresponding to the same individual. In this section, we will look over basic ideas of record linkage using formulas used in Fellegi and Sunter’s paper [1]. 2.1 Record Linkage Define two disjoint sets M and U for Matched and Unmatched formed from the cross product of two files A with B which are filled with records, A x B. A record pair is a member of M if that pair represents a true match, otherwise it is a member of U. Thus, the record linkage process tries to classify each record pair of A x B into M and U. A× B = {(a,b);a ∈ A,b∈ B} (2.1) M = {(a,b);a = b, a ∈ A,b∈ B} (2.2) U = {(a,b);a ≠ b, a ∈ A,b∈ B} (2.3) Suppose records corresponding to members of A and B are indicated as α (a) and β (b) . We express each pair as a comparison vector as γ [α (a),β (b)] (for convenience we 5 denote it as γ ). Because each pair has several fields to be compared, a comparison vector for each pair also can be represented as γ i [α (a),β (b)],i = 1,...,K which splits the comparison vector into specific comparisons of each ith field. γ [α (a),β (b)] = {γ 1[α (a),β (b)],Κ ,γ K [α (a),β (b)]} (2.4) The comparison vector can be represented as ones and zeroes. If the contents of the ith field such as both of the record’s first names are exactly the same or almost the same, then we give 1 otherwise we give 0. This is applied to all of the individual fields which are common for both records. For example, a comparison vector can be represented as “11011110” which means the pair has eight fields with six fields matching and two fields not matching. All the possible realizations of the comparison vector γ are represented as Γ , which is defined as the comparison space. By observing a comparison vector γ for each pair and mapping the comparison vector into a random decision function, we make a decision whether the pair is a matched pair, an unmatched pair, or a possible link. A linkage rule L is a rule mapping the comparison space Γ onto a set of random decision functions D = {d(γ )}. Each decision function includes probabilities for a γ as taking each decision as matched A1, unmatched A3 or possible link A2. The possible link A2 is when it is hard to conclude whether the record pairs are definitely matched or unmatched, and they need further observation. (γ ) = { ( γ ), ( γ ), ( γ )};γ ∈Γ 1 2 3 d P A P A P A (2.5) where (  ) 1 3 1 = Σ= i i P A γ which means the sum of the three decision probabilities is one. 6 We calculate weights for each field and use them as a measure of how much each field contributes to the total probabilities for each pair. The contribution of each field can be prioritized by using weights to measure the probability of making an accurate match. Fellegi and Sunter extended the measurement of weights into mathematical formulas for the record linkage process using a loglikelihood ratio. [1] When we make a decision whether a pair is matched or unmatched, we simply need to compare the probability that the pair is matched and unmatched. If (  ) 1 P A γ or P(M  γ ) is bigger than (  ) 3 P A γ or P(U  γ ) , then the pair is considered matched, and unmatched if it is the other way around. For this, Fellegi and Sunter use a likelihood ratio of the two conditional probabilities of the pair when they are known to be linked or not linked. The likelihood ratio for a certain γ vector comparison can be defined as ( ) ( ) γ γ u m R = (2.6) If R is bigger than 1 then we consider the comparison vector to belong to the match class and if R is less than 1 then we consider the comparison vector to belong to the unmatched class. To the calculation easier or clearer, we put a log in front of R and we name it as the weight. = ( ) ( ) log γ γ u m w + + + = ( ) ( ) log ( ) ( ) log ( ) ( ) log 2 2 1 1 k k u m u m u m w γ γ γ γ γ γ Λ 7 = ( ) ( ) log i i i u m w γ γ When we suppose all the fields are independent from each other, the decision rule is on the sum of the weights for each field Σ= k i i w 1 where k is the number of fields in the record comparison vector γ . The weight i w can be denoted as ( ) ( ) log i i u m γ γ . m(γ ) and u(γ ) are conditional probabilities of a γ given the pair is a true match or a true nonmatch. m(γ ) = P{γ [α (a),β (b)]  (a,b)∈M} (2.7) u(γ ) = P{γ [α (a),β (b)]  (a,b)∈U} (2.8) In the same way as for equation (2.4), m(γ ) and u(γ ) can be split into fields and m(γ ) and u(γ ) can be calculated by multiplying all of the field’s probabilities. ( ) ( ) ( ) ( ) 1 2 k m γ = m γ ⋅m γ Λ m γ (2.9) ( ) ( ) ( ) ( ) 1 2 k u γ = u γ ⋅u γ Λ u γ (2.10) For each of the ith fields of the jth pair, matched and unmatched probabilities can be represented as m P( j 1 j M) i i = γ = γ ∈ (2.11) u P( j 1 j U) i i = γ = γ ∈ (2.12) 8 Those formulas can be represented as mi = Pr{component i agrees r ∈ M} and ui = Pr{component i agrees r ∈ U} for the set of all record pairs r. In our study, we use the EM algorithm for the linkage rule to estimate variables i m , i u . After the values are calculated, those values can be used to calculate (2.7) and (2.8), which are m(γ ) and u(γ ) as below Π= = − − n i i i j j i j P M m i m 1 (γ  ) γ (1 )1 γ (2.13) Π= = − − n i i i j j i j P U u i u 1 (γ  ) γ (1 )1 γ (2.14) Those are estimated true positive and true negative probabilities of the comparison vector of each pair. In a pair, if the comparison vector for that field is 1, then multiply i m or i u , but if it is 0, then multiply one minus the value as the probability of each field. − − + + − − + − − = − − = = = − − − − − − = − = − Π Π j n j n j n j n j j j j j j i j j j i j i j i j i n n n n n i i i n i i i j j u u m m u u m m u u m m u u m m P U P M u m w γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ 1 1 1 2 2 1 2 2 1 1 1 1 1 1 1 1 1 1 (1 ) (1 ) log (1 ) (1 ) log (1 ) (1 ) log (1 ) (1 ) log (  ) (  ) log ( ) ( ) log 2 2 2 2 1 1 1 Λ The weight of the ith field in the jth pair is − − = − − j i j i j i j i i i i i i u u m m w γ γ γ γ 1 1 (1 ) (1 ) log The weights for the components are: 9 if component i agrees, in other words, when j = 1 i γ , then = i i i u m w log if component i disagrees, in other words, when j = 0 i γ , then − − = i i i u m w 1 1 log As we described in the decision rule section, there are three choices that could be made whether a pair is matched, unmatched, or a possible link. Therefore, we need two threshold values to divide the record pairs into three categories. For the decision procedure of each record pair, the weights of individual fields are added up, which means all of the each pair’s field weights wi will be summed up. If the composite weight value is over the upper threshold value, then it is considered as matched, and if the value is less than the lower threshold value, then the pair is considered as unmatched. If the value is between the two threshold values, the pair is considered as undecided. According to Jaro, the threshold values can be calculated with the sum of Pr(• M), one minus Pr(•U) and error values(μ and λ ) [4]. The upper threshold value, which is the maximum weight for a nonmatch, is where the sum of Pr(•M) does not exceed λ , which is the probability that a matched pair should be classified as unmatched. The lower threshold value is the minimum weight for a match and where is one minus the sum of Pr(•U) does not exceed μ , which is the probability that an unmatched pair should be classified as matched. Pr(•M) and Pr(•U) can be calculated by using formulas 2.13 and 2.14. For the linkage rule, as we have mentioned in the threshold section, we need to consider two types of errors; μ is the probability that an unmatched pair will be classified as matched, and λ is the probability that a matched pair will be classified as unmatched. We 10 use the errors for the decision making in the calculating the thresholds. μ can be calculated as the sum for all possible comparison vectors (Γ ) of multiplication between the unmatched probability of γ and the conditional probability of being matched for that comparison vector γ . The error λ also can be calculated in a similar way. As Sauleau, Paumier and Buemi stated, FellegiSunter theory suggests an optimal linkage rule for given error levels μ and λ to reduce A2, which means the goal of an optimal linkage rule is to make a clear decision as to whether a record pair is linked or not by reducing the cost caused by manual decision making for possible matched pairs [12]. Σ ∈Γ = = ⋅ γ μ (  ) (γ ) ( γ ) 1 1 P A U u P A Σ∈Γ = = ⋅ γ λ (  ) (γ ) ( γ ) 3 3 P A M m P A According to Fellegi and Sunter, the best linkage decision rule L at the μ , λ error levels with the Γ comparison space denoted as L(μ , λ , Γ ) is ≤ < < ≤ = λ λ μ μ γ γ γ γ γ γ γ if m u T if T m u T if T m u d (0,0,1) ( ) / ( ) (0,1,0) ( ) / ( ) (1,0,0) ( ) / ( ) ( ) where ( ) ( ) n n u m T γ γ μ = and ( ) ( ) n n u m T ′ = ′ γ γ λ Σ= = n i i u 1 μ , ΣΓ = ′ = N i n i λ m , n < n′ . This formula implies the concepts of record linkage explained in the previous sections. We can simply say record linkage is a process of making decision whether two records are matched or not, in other words, choosing one of the options from the decision function d (γ ) . 11 To make the decision, we need to calculate composite weights for every comparison vector, and the comparison vectors are ordered in ascending order according to the composite weights. To calculate the composite weights, we need to use the EM algorithm to estimate parameters needed to calculate composite weights. The threshold values are decided by using the error values at the point n and n′ where the error values are reasonable. After deciding threshold values, the pairs with a comparison vector having composite weights above the upper threshold value are matched pairs, and the pairs with composite weights less than the lower threshold value are unmatched pairs. The pairs with composite weights between the thresholds are ones on which the algorithm is unable to make clear decisions. 2.2 EM Algorithm The EM (expectation maximization) algorithm, proposed by Dempster [3] is applied to estimate the weight parameters in Fellegi and Sunter’s formula. As Dempster mentioned, the EM algorithm is a maximum likelihood estimation algorithm from incomplete data [3], therefore this algorithm is proper for the record linkage because some record contents could be missing data or misspelled. The EM algorithm does a repetition procedure between computing the expected value and maximizing that value, until the procedure converges [6]. We describe the estimation of mi and ui using the EM algorithm in this section the way Jaro proposed [4]. γ is defined as a comparison vector that is a vector function on records of a and b. The function γ j [a,b], j = 1,..., N is a vector function for the jth pair. We express γ j [a,b] also as 12 γ j . Also each jth pair γ j has several fields to be compared; j , j 1,...,N,i 1,...,n. i γ = = ; n is the number of fields and N is the number of pairs. γ [a,b] = {γ 1[a,b],Λ ,γ N [a,b]} , a∈ A,b∈ B In our case, as Jaro says, we use the function that if the ith field agrees on the jth pair then j = 1 i γ , and if the ith field does not agree on the jth pair then j = 0 i γ for i=1,…,n and j=1,…, N. If we formulate this statement, it can be defined as below, m Pr{ 1 r M} j j i i = γ = ∈ u Pr{ 1 r U} j j i i = γ = ∈ Therefore, γ j , j =1,...,N is a vector of ones and zeroes. The EM algorithm is a repeating process of estimating parameters and maximizing them until we get more accurate values for a set of parameters which is ı = (m, u, p). The parameter m represents mi and u represents ui. The parameter p is the proportion of matched pairs over the total number of matched and unmatched pairs:     M U M p ∪ = The complete data vector is equal to (γ, g), where ( ( ), ( j )) u j m g j = g γ g γ . Hence, g j = (1,0) iff γ j ∈M and g j = (0,1) iff γ j ∈U . As we mentioned in the previous, mj and uj can be calculated as below and these formulas will be used for the parameter estimation calculation. These are defined under the assumption 13 that each field does not affect the other. First one is probability of true matched for a pair and second formula is probability of true nonmatched for a pair. Π= = − − n i i i j j i j P M m i m 1 (γ  ) γ (1 )1 γ Π= = − − n i i i j j i j P U u i u 1 (γ  ) γ (1 )1 γ By using γ , we estimate m u gˆ , gˆ in the E (estimation) step, then we calculate i i mˆ ,uˆ , and pˆ in the M (maximization) step. These steps are described below. First, we set the starting values for the estimates of the unknown parameter set Φ = (mˆ ,uˆ, pˆ ) . The algorithm is not sensitive to the starting values except that the mˆ values should be greater than the corresponding uˆ values. The i i mˆ ,uˆ , and pˆ could be a value on the interval (0, 1). Next, we get the γ s for all paired records. For example, γ s of pairs with five fields might look like γ =[{1, 1, 0, 1, 1}, {0, 1, 1, 0. 0},…]. We plot γ values into m u gˆ , gˆ . E step Estimate the g j value with ( ˆ ( j ) m g γ , ˆ ( j ) u g γ ) Π Π Π = − = − = − − + − − − = n i i i n i i i n i i i m j i j i j i j i j i j i p m m p u u p m m g 1 1 1 1 1 1 ˆ ˆ (1 ˆ ) (1 ˆ ) ˆ (1 ˆ ) ˆ ˆ (1 ˆ ) ˆ γ γ γ γ γ γ Π Π Π = − = − = − − + − − − = n i i i n i i i n i i i u j i j i j i j i j i j i p u u p m m p u u g 1 1 1 1 1 1 ˆ ˆ (1 ˆ ) (1 ˆ ) ˆ (1 ˆ ) ˆ ˆ (1 ˆ ) ˆ γ γ γ γ γ γ 14 M step We put the m u gˆ , gˆ estimated values into i i mˆ ,uˆ and pˆ . Σ Σ = = = N j j m N j j i j m i g g m 1 1 [ ˆ ( )] [ ˆ ( ) ] ˆ γ γ γ Σ Σ = = = N j j u N j j i j u i g g u 1 1 [ ˆ ( )] [ ˆ ( ) ] ˆ γ γ γ N g p N j j m Σ= = 1 [ ˆ ( )] ˆ γ We repeat the E and M steps until we get precise values for the parameters, which means until the estimated parameters have no significant difference from the previous estimated ones. 2.3 Blocking Methods Researchers look at large scale data comparisons and their implementation with record linkage on real world data such as census data or medical data [4, 12]. However, real world data can be huge, for example if ten thousand records are in each of the two files, then there are ten thousand times ten thousand comparisons needed. Therefore, blocking techniques are used to reduce the cost of the quadratic comparisons [8]. Blocking methods 15 are used by Kelley, comparing two blocking methods based on likeliness of components for big data sets [2]. By using probabilities, the relative components are put into the same blocks and we do the comparison among them and the unrelated components are considered they are unmatched pairs and never have a chance to be compared. In 1989 Jaro tested the record linkage system with blocking methods on the real data sets of the census of Tampa, Florida and an independent post enumeration survey (PES) [4] Further, various blocking methods have been investigated and compared as to their effectiveness [5, 6, 8]. In our study, we only focus on four blocking methods to be compared, which are standard blocking method (sorting), sorted neighborhood blocking, bigram indexing, and canopy clustering. One of the common procedures for all of the methods is creating candidate keys. Finding the common fields in comparing records is the first thing to do for matching. Among them, the various fields contribute to matching in different levels. For example, a field such as sex only has two value states, and consequently could not impart enough information to identify a match uniquely. Conversely, a field such as surname imparts more information, but it may frequently be reported or keyed incorrectly [4]. Therefore, we create a candidate key that will be chosen with significant fields and combined part of the fields or used as it is as the key for each component. In this section, the four methods are explained in detail. 2.3.1 Standard Blocking Method In this section, the standard blocking method has been studied based on Jaro’s paper [4]. The basic sorted method is sorting two files with the same key variables (one, or more than one, variables), group records with the same values for those variables into blocks, and 16 do the comparison between records within the blocks. We also could compare records in the neighbor blocks for better accuracy when there is no matched record found. In this method, the most important step is getting the proper keys, which means we need to find the variables with low error. If we set the key with high error probability for a key, then we could get lots of false matches. A weak point of this method is we cannot control the number of records within a block, so, for example, if records are clustered into one block, this method is not useful to reduce the time for record comparisons. Therefore, we need to choose proper keys for sorting with low error probabilities and distributing records uniformly into blocks. 2.3.2 Sorted Neighborhood Blocking Method The sorted neighborhood method is based on Hernandez and Stolfo’s paper [7]. We could think of the sorted neighborhood method as one step further beyond the basic sorted method, reducing some of its defects and adding the concept that in the sorted record pairs, neighbor records have a higher matching possibility than records at a great distance. The procedures of the method are first joining two files and sorting them with candidate keys. The candidate keys can be made out of variables by concatenating parts of some chosen variables. For example, one way of creating a key could be concatenations of the first three consonants of a last name, the first three letters of a first name, the first three numbers of a Social Security number, and the first four numbers of a zip code; shnjiy6407407. After the keys are created, the records are sorted by the candidate keys. Then we set a fixed size of window which we consider as a block and we do the comparison within the block. The fixed size window slides and does the comparison within the block until there is no more record left by sliding the front record out and adding in the next record of the last record of the block. If we put it differ to the window and the first record be compared with the previous method is an improved version of the basic sorted method; it t probability choice by concatenating part normalizing the size of blocks by fixing Figure 1 17 differently, in the fixed window size w, a new record enters will slide out of the window, the newly entered w1 records to find a matching record each time. This d takes care of the low error parts of variables as a key, and it also takes the window size. Sorted Neighborhood Blocking Method record will akes care of 18 2.3.3 Bigram Indexing Blocking Method The bigram indexing blocking method has been studied based on the record linkage software in Febrl’s manual [10]. The bigram indexing method is less sensitive to typo errors or some missing information compared to the previous two blocking methods. The features of this method are, first, this method allows each record to be in multiple blocks if needed, and it uses an inverted index. The usual indexing system made it easy to find needed information, so for document indexing, the contents are a list of parts of the document, but for the inverted indexing system, a part or a content of information maps a set of documents which contain the part. The process of bigram indexing blocking method starts with choosing blocking keys. Some significant variables are chosen or parts of some variables are chosen, and they are combined into a string that is a preformed block key. The string will be split into bigram lists; the bigram is a substring with two letters. For example, if we create the string as a combination of the first two letters of the last name “sh”, the first two letters of the first name “ji” and the first two numbers of the zip code “74” for record index 7, then the preformed block key of this example record will be “shji74.” The created string will be divided into a bigram list as {“sh”, “hj”, “ji”, “i7”, “74”}. We consider the numbers also as strings in our study. The bigram list will be sorted alphabetically as {“74”, “hj”, “i7”, “ji”, “sh”}. We multiply the total number of possible bigrams, which is 5 in this case, with a threshold value that can be given manually as a decimal number between 0 to 1, and let us set the threshold as 0.8 in this case. The result will be 4, and then the blocking keys will be all possible combination of four bigrams with considering the sequence of the alphabetic ordered listed bigrams. The blocking keys for this record will be {“74hji7ji”, “74hji7sh”, “74i7jish”, “74hjjish”, “hji7jish”}. This example reco these five created blocking keys. This process will be repeated for the preformed key string and the threshold value are critical for this blocking method. If the preformed string length is too long and the threshold value is too small, then possibility for missing linked records in and too many expensive comparisons later on. If the length is too short and value is too big, then fewer blocking keys will be created and it will increase the possibility of missing linked records into Figure 2 19 ”}. record will be indexed into blocks having keys with all records. The length of ength the same blocks, but there could be too many blocks the the same block. Bigram Indexing Blocking Method rd there is less threshold 20 2.3.4 Canopy Clustering Blocking Method The canopy clustering blocking method has been studied based on McCallum, Nigam and Ungar’s paper [9]. Canopy clustering is a method considering each record as a point. It blocks records by approximately calculating distances between records and it uses an inverted index table for cheap distance calculation. The canopy clustering method also allows each record to get into more than one canopy or block. The inverted index table will be created, to be used in the calculation of a distance metric. We choose preformed keys for bigram indexes that would have low error probability such as diagnoses, last name, or first name. Record id will be entered into proper blocks as contents of the table. After that the process is selecting one point (a record) randomly out of all of the records. Using the inverted index table, the distance between the selected point and the rest of the points could be calculated fast. After creating a inverted index table, first, we randomly pick a center point from a pool of whole records and use inverted index table to calculate the distance metric, which can be calculated as the number of common indexes each record have with the center record. Before that, we could set two threshold values for records, one for a high probability for a record to be in the same cluster with the center (tight threshold) and the other one is a lower probability (loose threshold). The threshold values can be set by users or can be set by the result of previous experience or tests. For a randomly picked center point, we calculate the distance metric for all other points, and if the distance metric of a point is bigger than the lose threshold value, the point will be put into the same canopy with the center point. If the distance merit of a point is bigger than the tight threshold value, then the point will be removed from the pool of center points later, but the points have bigger distance than loose threshold but lesser value than tight threshold value still has a chance to be in other canopies and in the pool. We repeat calculating the distance merits and putting points into canopies until there are no more points F, G, H, I} are spread with the distances from each other as in the below pict centers randomly in order Awhich are blocks as below. Figure 3 21 . s available in the pool. For example, nine points {A, B, C, D, E, > F> H> I> E, then there will be five canopies will be created Canopy Clustering Blocking Method picture and we pick > 22 CHAPTER III METHODOLOGIES In this section, we describe the datasets, evaluation methods, and algorithms that are used in this study. The theory of record linkage involves the use of algorithms to match a large number of records quickly. For the speed problem, there are blocking methods that solve the problem by reducing the quadratic comparisons. Likely matches are grouped into blocks which are portions of whole records. Especially, for large data sets, various blocking methods have been investigated and compared for their effectiveness [5, 6, 8]. Matching records requires estimation of parameters needed for any record linkage activity, which was the basic algorithmic approach by the matcher of the early software. The record linkage software has been developed using probabilistic approaches such as Link King, LinkageWiz, Febrl, and TAILOR. Among probabilistic approaches, Jaro stated that “The EM algorithm is highly stable and the least sensitive to the starting values of any of the methods studied” [4]. In this project, we chose the EM algorithm to estimate parameters in the record linkage procedure. Jaro introduced a record linkage procedure done among the records within their own blocks using the EM algorithm [4]. McCallum, Nigam and Ungar proposed the canopy clustering blocking method with the EM algorithm [9]. Many comparative studies over blocking methods have been done. However, there has not been an attempt to compare the effectiveness of the four blocking the EM algorithm. In this project, we do a comparative study to evaluate the effectiveness the combined schemes of blocking methods and EM algorithm for large data sets Figure 4). We implement the record linkage algorithm test those using real data sets. We also evaluate implementation with thre and used measure formulas from record linkage toolbox TAILOR [9]. 3.1 Data In this project, four datasets are used for comparative analysis. The data datasets is U.S. mailing address downloadable from [16]. 23 roduced methods described in the previous section in the Java programming language and three already developed Figure 4 Evaluation of effectiveness type of the which was created by database generator DBGen and specifically with of (shown in e 24 The US mailing address generated by database generator DBGen which is an open source database generator used to create large mailing address datasets [33]. Contents for each record of mailing address are twelve fields with given name, surname, street name, street number, street address, suburb, postcode, state, birth date, age, phone number, and social security number. For the U.S. mailing address data, eight data sets and two files for each data set are used as input for each run. Combined size of each dataset is various as 1000, 2500, 5000 and 10000. The two files of each set have duplicates with some missing and misspelled fields or misallocated fields in random order. Only one duplicate of each entity is possible in other data file. Because four datasets have similar features but only difference is size of datasets, we could compare blocking methods’ effectiveness depending on dataset size. The 1000 size dataset which is smallest dataset among the four datasets used in our implementation is used as test data to estimate the best parameter or threshold values needed in the blocking method. Our goal is comparing two individual records from different files to conclude whether they indicate same entity or not. For each mailing address dataset, two files will be used for the comparison. There are other assumptions for data sets. We assume there is no absolute stand alone identification for each record because data could be missing or misspelled. To distinguish each record, we give a number as index to indentify each record. We also put ‘a’ or ‘b’ in front of the indexes to avoid comparison among records within the same file. For simplicity, in our study we assume each record is independent from each other’s fields. Also, for simplicity we consider numbers as string. The address files are csv file format. Each line stands for one record and fields are divided by comma. We assume every delimiter is used correctly. Each input file will be split 25 into a string ArrayList in Java representing each record as String Array and every element of a record for each array field. For missing fields, we leave components empty. 3.2 Evaluation Methods To accomplish our goal for this study, we need to measure four record linking methods. We chose three formulae from TAILOR toolbox [7]; reduction ratio, pair completeness, and F score. First, reduction ratio is to calculate how much expensive comparisons are reduced; the expensive comparisons as comparing every possible field for each pair. In other words, it is the reduced ratio of number of compared pairs after blocking methods out of one, therefore the number closer to one is better. It could show how much the blocking methods are conducive for reducing expensive comparison among records. The second measure is pair completeness which could measure how correctly decisions are made choosing whether record pairs are matched or not. It will show how both blocking methods and EM algorithm worked together to get the right decisions. The pair completeness is how many duplicates are correctly classified out of all duplicates. For the tradeoff between two measures reduction ratio and pair completeness, we chose F score which help us to measure harmonic relationship between two other measures. These measures have been used in various researches for blocking methods comparative study [5, 6, 8]. The formulae used for computation of the measures are given below: All pairs Potential duplicates Reduction Ratio(RR) = 1− Total true duplicates Correctely classified duplicates (PC) Pair Completeness = 26 In our study, we calculate the potential duplicates in the reduction ratio formula by adding up only the expensive comparisons (record pair comparisons for every field) after blocking process. For pair completeness, we created a file with true matching pairs (true duplicates) for each dataset. We consider matched pair if more than 80 percent of total variables are agreed and a field is agreed if two strings have distance value over 0.8 in the Jaro Winkler distance calculation. The Jaro Winkler Distance method will be discussed more in the next additional information section. The number of correctly classified duplicates is obtained by comparing our results after whole process with the true matched pair files. And the number of the total true duplicates in the pair completeness is the sum of pairs in each corresponding true matched pair files. 3.3 Algorithms For the simulation, we programmed Compare_Blocking_Methods by using java language. As a big picture of our study, what it does is reading record files in and calling four blocking methods with a list of record arrays. For each of four blocking methods, we created a class. The program is implemented for the four datasets we mentioned in the previous section and the results will be compared by the measuring methods we mentioned before. Depending on each block type choice, blocking methods will be called. For the blocking methods, whole records as an ArrayList type in java and the chosen key field or fields will be RR PC 2*RR * PC F score + = 27 used in each blocking method. For all four blocking method, we create candidate key which is concatenation of parts of some significant fields. In our study, the candidate key’s length is six; two front letters of three fields. For the U.S. mailing address data type, first name, last name, and postcode are chosen as significant fields. The candidate key is added at the end of each record as a new field. For Standard_Blocking and Sorted_Neighborhood_Blocking, we sort whole records according to the candidate keys with using merge sort algorithm. In the Standard_Blocking, the sorted records are partitioned into blocks with a significant variable. For the Sorted_Neighborhood_Blocking, we partition the sorted records by putting window size number of records into each block sequentially to do so slides the first record in a block out and slides the next record outside of the block in for the next block. For both Bigram_Indexing_Blocking and Canopy_Clustering_Blocking, we use the created candidate keys as preformed keys to create real blocking keys. By using the preformed keys for each record, we create inverted index table as we explained in the bigram indexing blocking method. For each preformed key, we find all possible bigrams which is substring of two letters in order. And we order the bigrams alphabetically and calculate the total number of the bigrams for that record. We decide how many bigrams will be used for each key which is three in our study by multiplying the total number of the bigrams (four; length of a preformed key minus one) and given threshold value. Inverted index table keys will be all possible combinations of the calculated number of bigrams sequentially. We repeat this process for whole records and put each record id into own created index contents. Bigram_Indexing_Blocking, the records in the same column of the table will be put into the same block. For Canopy_Clustering_Blocking, the inverted index table will be used to calculate distance among records. We put all the record ids into vector as pool. After 28 randomly choosing a center which is a record, we calculate the distance by adding the number of common index keys for each record to the center record. We set the two threshold values loose threshold and tight threshold. For both thresholds, we use percentage of number of common indexes over maximum number of common indexes between a center record and a compared record. The records having the bigger percentage than the loose threshold are put into the same block. Among The records, ones having the percentage bigger than the tight threshold, then the records will be removed from record id pool for next picks for other center points. We used Vector type in java for the center pool. This process will be repeated until there is no more candidate record id is left in the vector. For all the blocking methods, after dividing records into blocks, we do linkage procedure within each block which is finding matching pairs. In the linkage procedure, we chose probabilistic way which uses comparison vectors and making decision based on probability of each matched, unmatched, and undecidable for each comparison vector. To do so we chose EM algorithm which is one way to estimate parameters needed to calculate probabilistic composite weights for each comparison vectors used for making decisions. Every record within block will be paired to each other. The pairs will be transformed into a comparison vector which is an expression of ones and zeroes; when a field of a pair is similar, then the value will be one and it is not similar, then the value will be zero. We save those comparison vectors into double type arrays. The comparison vector will be used to calculate parameters mi, ui, and p using EM algorithm which are related to weight for each field; we pass the comparison vector arrays into EM_Algorithm class. We set the initial values for mi as 0.8 and ui as 0.2 for every field of every pair and p as 0.5; As Jaro stated in [4], EM algorithm is the least sensitive for the initial values, but only need is mi bigger than 29 ui. Then, we plot those initial values into formulas to calculate gm and gu and those calculated gm and gu are used to calculate mi, ui, and p. And, we do compare the previous mi, ui, and p and the calculated ones. If the difference is bigger than threshold value which is 0.0000000005 in our study, then we do the same procedure again and again until the difference become less than the threshold value. After the calculation of estimating parameters, those values will be used to calculate composite weights for whole comparison vectors. Those calculated mi and ui are probability of a field would be agreed which is one when a pair is matched and unmatched. To calculate composite weight for a comparison vector, if a comparison vector is “110” three fields then we add log value of m0 and minus log value of u0, add log value of m1 and minus log value of u1, and add log value of 1m2 and minus log value of 1u2. The whole comparison vector are sorted according to the calculated composite weights descending. There are three categories to be classified as matched, unmatched, and undecided. In our study, we only concern about matched and unmatched decisions because we don’t want to do any manual comparisons and we are comparing efficiency of blocking methods under same conditions. We use probability of true matching for a record ( P(γ j M) ) and probability of true nonmatching for a record ( P(γ j U) ) for decision making in our study. Our decision is made, from the comparison vectors with highest composite weight to the smallest, if the true matching probability is bigger than true nonmatching probability for a comparison vector, then we consider the pairs with that comparison vector as matched pairs and if a record doesn’t satisfy the condition, then we stop looking at any other comparison vectors below and consider rest of the pairs as unmatched. Then, we print the matched pairs into a file. And we do evaluation of the results. The important algorithms are listed below: 30 Compare_Blocking_Methods() Input: one or two record files and blocking type Output: a set of matched record pairs Read files If blocking type is standard blocking then Call Standard_Blocking() Else if blocking type is sorted neighborhood blocking then Call Sorted_Neighborhood_Blocking() Else if blocking type is bigram blocking then Call Bigram_Indexing_Blocking() Else if blocking type is canopy clustering then Call Canopy_Clustering_Blocking() End If Evaluate the results from Blocking methods Algorithm 1 Compare_Blocking Methods Algorithm Standard_Blocking() Input: records and key field Output: matched pairs Create candidate keys Sort records by candidate keys Calculate similarity between keys Divide records into blocks depending on similarities of key contents Pair up every records within block Transform record pairs into comparison vector Calculate composite weight using EM algorithm within each block Make decision whether record pairs are matched or not Print matched pairs Algorithm 2 Standard_Blocking Algorithm Sorted_Neighborhood_Blocking() Input: records Output: matched pairs 31 Create candidate keys Sort records by candidate keys Set window size as block size Divide records into window size blocks Calculate composite weight using EM algorithm within each block Make decision whether record pairs are matched or not Print matched pairs Algorithm 3 Sorted_Neighborhood_Blocking Algorithm Bigram_Indexing_Blocking() Input: records Output: matched pairs Create preformed key Create blocking keys Create inverted index table Put records in same column into block from inverted index table Calculate composite weight using EM algorithm within each block Make decision whether record pairs are matched or not Print matched pairs Algorithm 4 Bigram_Indexing_Blocking Algorithm Canopy_Clustering_Blocking() Input: records Output: matched pairs Create preformed key Create blocking keys Create inverted index table Calculate distance among records using inverted index table Set threshold values Partition records into blocks according to distance among them Calculate composite weight using EM algorithm within each block Make decision whether record pairs are matched or not Print matched pairs Algorithm 5 Canopy_Clustering_Blocking Algorithm 32 EM_Algorithm() Input: record pairs represented as ones and zeroes (comparison vector) ; ones for matched fields and zero for unmatched fields Output: m_i, u_i, p Initialize m_i, u_i, p values BEGIN_EM; IF old and new estimated values are similar Goto END_EM; EXPECTATION; Compute g_m; Compute g_u; MAXIMIZATION; Compute m_i; Compute u_i; Compute p; Compare old and new estimated m_i, u_i, and p values Goto BEGIN_EM; END_EM; Algorithm 6 EM_Algorithm 3.1 Additional Information In this paragraph, we discuss about string comparison. In the blocking methods and record linkage procedure, we need to compare strings. When we partition records into blocks, string comparisons in the standard blocking method is needed because we block records with same candidate keys. We do comparison of numbers same way as other string letters. We also need string comparisons when we create comparison vectors to be used in the EM algorithm and composite weight calculation. On the creating comparison vectors, we give one when two strings are similar to give some allowance for misspelled fields. For the string comparison, we use Jaro Winkler Distance method which is a string similarity measuring method by using matching characters and transpositions between two strings [13]. In our 33 study, for exact match, two strings have to have the Jaro Winkler Distance value one, and for similar match for two strings, we give threshold value 0.8; if the Jaro Winkler Distance value is over 0.8 then we consider two strings are similar. We use the already developed Jaro Winkler Distance method downloadable from [15] written in Java. In this paragraph, we discuss topics related to weight calculations. As we stated in the previous chapter, we assume all the variables are independent. We calculate weight with the values from EM algorithm. As the result of EM algorithm calculation, we get mi value and ui value. However, we only use mi value from the result and ui as one minus mi value because the calculated ui values from EM algorithm is result extracting unmatched pairs not included in each block when we see whole record pairs. For a convenience of the composite weight process, we put log in front of weight calculation formula, so we need to add log of mi value and minus log of mi value since weight calculation formula for each field is mi/ui for ith comparison vector field is one and (1mi)/(1ui) for ith comparison vector field is zero. In this case, if we could have u γ 0, m γ 0, 1 u γ 0, or 1 m γ 0 then ω γ ∞ or ∞. In those cases, we gave a certain big number 30 in our study to be able to calculate the weight for each record. The first formula below is way to calculate a weight for a field and the second one is to calculate composite weight after calculation of weights for each field. ω γ log m γ log u γ ω γ ω ω ω ω In this paragraph, we explain what we did to improve the Standard_Blocking. As we stated in the previous chapter, records or entities could not be identified by only one field. 34 Therefore, in standard blocking method, we cannot depend on only one key field for blocking records. In other papers, they suggest to run programs multiple passes [7, 8]. False nonmatches could be reduced by multiple passes with different blocking variables for each pass. In our study, to avoid multiple passes, we create a candidate key field by combining first two letters of three selected fields rather than choosing a variable as a candidate key to reduce the false nonmatches and we also partition records into blocks not only records with exactly matching candidate keys but rather similar keys which is judged by distance values of Jaro Winkler distance method in the Standard_Blocking method. In addition to sorting whole records according to the created candidate key, in our study, we partition the sorted records according a significant field which is not the same key used for sorting. Grouping ones having similar values for the field judged by Jaro Winkler distance. The standard blocking method, instead of multipassing for different variables, uses two different kinds of variables for sorting and blocking. In this paragraph, we discuss about how we implemented distance comparisons in the Canopy_Clustering_Blocking. In the canopy clustering blocking method, we need to calculate distance among record points which is calculating how many common created indexes of inverted index table. In our study, we created three hash tables to simplify the distance calculation process; a record id as a key and the created inverted index table indexes for each records as contents, a record id as a key and each record array as a content, and the inverted index table (the created inverted index as a key and ids of records corresponding to that index). When a record is randomly selected as a center to calculate the distance, we get the created inverted indexes for the center record from a hash table. From the center record, we calculate distances only for the records having any common inverted index with the center record. We track the inverted indexes in the center record column from the inverted index table and increase the distanc common inverted indexes more than same block as the center record Figure 5 Implementation Flow Chart 35 distance of ids as contents of the inverted indexes the loose threshold value, then we put the record into record. e indexes. If a record has 36 CHAPTER IV SIMULATIONS 4.1 Environment In record linkage, we need to match quickly a large number of records between two databases. Records are grouped into blocks to be matched efficiently. Especially, for large datasets, a specific algorithm including blocking method with parameter is necessary to optimize the results. In order to evaluate the effectiveness and efficiency of blocking method, we implement four types of blocking method on data linkage; which includes standard blocking, sorted neighborhood blocking, bigram indexing blocking, and canopy clustering blocking methods with four different sizes of datasets. In doing this simulation, parameter should be estimated for each blocking methods to get the best results such as: i) for standard blocking method, we need to fix the Jaro Winkler distance to decide records with how similar candidate keys would be put into same block, ii) for the sorted neighborhood blocking method, we need to set the window size, iii) for the bigram indexing blocking method, we have to fix the threshold value to decide how many bigrams are combined to create inverted index keys, iv) for the canopy clustering blocking method, we need to fix the tight and loose threshold values in the blocking records by distances of record points from a center point, as parameter estimation. 37 In this study, we used the dataset with size of 1000 to estimate parameters for each blocking methods. Table 1 – Table 4 represent the measurement results on each blocking method combined with EM algorithm with various parameters. Parameter Measurement Distance 1 0.9 0.8 0.7 0.6 0.5 Reduction Ratio 0.998288 0.998108 0.998084 0.997112 0.99006 0.978084 Pair Completeness 0.856 0.946 0.948 0.954 0.972 0.992 F score 0.921684795 0.971355674 0.9723975 0.97508 0.980947 0.984993 Table 1 Standard Blocking Method Measurement by Distance Table 1 represents the measurements of standard blocking method by varying parameter from 0.5 to 1. As shown in Table 1, distance 0.5 and 0.6 are better than other parameter values. We choose 0.6 for distance as a parameter, because 0.5 could have too large block size for bigger datasets than 0.6. Parameter Measurement Window Size 30 20 15 10 Reduction Ratio 0.126864 0.608284 0.779288 0.901492 Pair Completeness 0.996 0.994 0.988 0.986 F score 0.225061172 0.754715514 0.8713198 0.941854 Table 2 Sorted Neighborhood Blocking Method Measurement by Window Size Table 2 represents the measurement of sorted neighborhood blocking method by varying the window size of partitioning records as a parameter from 10 to 30 by scale 5. In this study, the window size is set by 15 as a parameter for Sorted Neighborhood Blocking Method, because even though 10 got the better F score than 15, 10 may have too many false unmatched pairs for bigger datasets, with missing matched pairs in the same block. 38 Table 3 Bigram Indexing Blocking Method Measurement by Threshold Values In Table 3, threshold value in creating bigram inverted indexes is varying from 0.4 to 0.8. We choose a threshold value 0.6 as parameter, because its F score is best. Table 4 Canopy Clustering Blocking Method Measurement by Threshold Values Table 4 represents the measurements results of implementing the canopy clustering blocking method by changing tight threshold value and loose threshold value. Tight threshold value runs between 0.6 and 0.9 and loose threshold value runs between 0.4 and 0.6 which is less than tight threshold value. In this study, we selected 0.4 for the loose threshold and 0.7 for the tight threshold. It has good result for evaluation methods and less likelihood for huge datasets have too big size of blocks than other values. Parameter Measurement Threshold 0.4 0.6 0.8 Reduction Ratio 0.9652 0.979124 0.990244 Pair Completeness 1 0.996 0.974 F score 0.982291879 0.987489903 0.9820548 Parameter Measure’t Threshold 0.6_0.4 0.6_0.5 0.7_0.4 0.7_0.6 0.8_0.4 0.8_0.6 0.9_0.4 0.9_0.6 Reduction Ratio 0.99768 0.998076 0.99768 0.998076 0.997672 0.998068 0.997672 0.998068 Pair Complete 0.97 0.958 0.97 0.958 0.97 0.958 0.97 0.958 F score 0.983645 0.977627 0.983645 0.977627 0.983641 0.977624 0.983641 0.977624 39 4.2 Simulations As we mentioned in the previous section, we will use three measurements, reduction ratio, pair completeness and F score to compare efficiency of four blocking methods combined with EM algorithm method. The four blocking methods to be compared are. First, standard blocking method which sorts records by a created candidate key and blocks the sorted records with a significant variable, in our study we used first name. Secondly, sorted neighborhood blocking method is blocking sorted records into window size of records. Fourth, bigram indexing blocking method is blocking records in the same column of inverted index table. Lastly, canopy clustering blocking method is blocking records by measuring distances from each other. Graphs in figure 5, 6 and 7 show the simulation result. In the graphs, horizontal axis represents the size of dataset used in the simulation, which are 1000, 2500, 5000 and 10000. Vertical axis represents value for each measure; reduction ratio, pair completeness, and F score. The values for all three measurements could be from zero to one and having a value closer to one is better. For the reduction ratio, any value closer to one means less comparison implemented over maximum total number of comparisons. For the pair completeness, any values closer to one means more true matched pairs are caught by blocking and EM algorithm methods. For the F score, values closer to one means harmonic relationship between reduction ratio and pair completeness is better. 40 Figure 6 Reduction Ratio Figure 7 Pair Completeness 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 1.02 1000 2500 5000 10000 Reduction Ratio Data Size Standard Neighborhood Bigram Indexing Canopy Clustering 0.88 0.9 0.92 0.94 0.96 0.98 1 1.02 1000 2500 5000 10000 Pair Completeness Data Size Standard Neighborhood Bigram Indexing Canopy Clustering 41 Figure 8 F score 4.3 Results In the reduction ratio graph in Figure 5, sorted neighborhood blocking method gets the lowest values, but the values are increased as the data size increases. Bigram indexing method has pretty good results which are most over 0.98 with having little bit of data size influence. For standard blocking method and canopy clustering method, we get the highest values regardless of data size. On the pair completeness graph in Figure 6, bigram indexing blocking method gets the best absolute result regardless of database size, which means it get the most matched pairs correctly predicted than other methods. Except the standard blocking method, most methods get the good results having values over 0.95. 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 1000 2500 5000 10000 F score Data Size Standard Neighborhood Bigram Indexing Canopy Clustering 42 In the F score graph in Figure 7, bigram indexing blocking method shows the absolute strongest result and canopy clustering method also gives strong result regardless of database size. The standard blocking method also gives pretty good outcome regardless of data size having values over 0.96. For the sorted neighborhood blocking method, as the size of dataset gets bigger, it gives better results. 43 CHAPTER V CONCLUSIONS AND FUTURE WORKS We have presented a matching entity technology Record Linkage with blocking methods plus EM algorithm to speed up matching process for a large number of records. For speedup purpose, records should be grouped into block, effectively and efficiently. The EM Algorithm is known to be more stable than any other methods to estimate parameters needed for probabilistic record linking process. In this study, three measurements, reduction ratio, pair completeness and F score, are used to compare the efficiency of four blocking methods combined with EM Algorithm such as 1) Standard Blocking Method, 2) Sorted Neighborhood Blocking Method, 3) Bigram Indexing Blocking Method, and 4) Canopy Clustering Blocking Method. Simulation results showed that the bigram indexing blocking method is the most reliable compared to other blocking methods regardless of database size. The canopy clustering blocking method is also a reliable blocking method. The effectiveness of sorted neighborhood blocking method depends on the size of data. But for the standard blocking method, after several modifications to improve it, it still gets worse results than other blocking methods. The reason could be that standard 44 blocking allows a record to be in only one block, not like other blocking methods allowing a record in multiple blocks. The key point of blocking method plus EM algorithm is putting a record in as many as blocks as possible which have high probability to have a matching record to reduce false nonmatches increasing pair completeness. The blocking methods have to be able to find only blocks which have high probability to reduce expensive comparisons among records to increase reduction ratio and have better parameter estimation in EM algorithm. We need to have enough number of records within each block to have better estimations of parameters in the EM algorithm to increase F score. In the future, we will propose a blocking method based on various numbers of variables involvement on blocking methods plus EM algorithm because EM algorithm involves parameter estimation with both pairwise and fieldwise. In addition, we can work on Record Linkage for multi matching records, not only one match for each entity. 45 REFERENCES [1] I. P. Fellegi and A. B. Sunter, “A Theory for Record Linkage”, Journal of the American Statistical Association, 1969, Vol. 64, pp. 11831210. [2] R. P. Kelley, “Blocking Considerations for Record Linkage Under Conditions of Uncertainty”, Proceedings of the social Statistics Section, American Statistical Association, 1984, pp. 602605. [3] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society. Series B (Methodological), 1977, Vol. 39, No. 1, pp.138. [4] M. A. Jaro, “Advances in RecordLinkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida”, Journal of the American Statistical Association, Jun. 1989, Vol. 84, No. 406, pp. 414420. [5] R. Baxter, P. Christen, and T. Churches. “A comparison of fast blocking methods for record linkage”, In ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington DC, 2003, pp. 25–27. [6] P. Christen. “Towards parameterfree blocking for scalable record linkage”, Technical Report TRCS0703, The Australian National University, Canberra, 2007. [7] M. A. Hernandez, S. J. Stolfo, “RealWorld Data is Dirty: Data Cleansing and the Merge/Purge Problem”, Data Mining and Knowledge Discovery, 1998, Vol. 2, No. 1, pp. 9 37. [8] P. Lehti and P. Fankhauser, “A Precise Blocking Method for Record Linkage”, In Proceedings of the 7th International Conference on Data Warehousing and Knowledge Discovery (DaWaK’05), Aug. 2005, pp. 210–220. 46 [9] A. McCallum, K. Nigam and L. H. Ungar, “Efficient clustering of highdimensional data sets with application to reference matching”, KDD, 2000, pp. 169178. [10] P & T Cristen: Febrl  Freely extensible biomedical record linkage (Manual, release 0.3) p.9 [11] H. L. Dunn. "Record Linkage" (PDF). American Journal of Public Health, 1946, Vol. 36, No. 12, pp. 1412–1416. [12] E. A. Sauleau, J.P. Paumier and A. Buemi. “Medical record linkage in health information systems by approximate string matching and clustering”. In BMC Medical Informatics and Decision Making, 2005, pp. 532. [13] W. E. Winkler. “The State of Record Linkage and Current Research Problems,” Statistica Society of Canada, Proceedings of the Section on Survey Methods, 1999, pp. 7379 [14] http://www.cs.utexas.edu/users/ml/riddle/data.html [15] http://aliasi.com/lingpipe/demos/tutorial/stringCompare/readme.html [16] http://sourceforge.net/projects/febrl/ VITA JIYOUNG SHIN Candidate for the Degree of Master of Science Thesis: COMPARATIVE STUDY ON BLOCKING METHODS IN RECORD LINKAGE Major Field: Computer Science Biographical: Personal Data:  Born in the city of Ulsan, Republic of Korea Education:  Completed the requirements for the Master of Science in computer science at Oklahoma State University, Stillwater, Oklahoma in December, 2009.  Completed the requirements for the Bachelor of Science in computer science at West Texas A&M University, Canyon, Texas in May, 2004. Experience:  Published Paper “Exploratory Data Analysis with Bivariate Dependence Functions” MSV 2006: 217223 Professional Memberships: ADVISER’S APPROVAL: Dr. K. M. George Name: JIYOUNG SHIN Date of Degree: December, 2009 Institution: Oklahoma State University Location: Stillwater, Oklahoma Title of Study: COMPARATIVE STUDY ON BLOCKING METHODS IN RECORD LINKAGE Pages in Study: 46 Candidate for the Degree of Master of Science Major Field: Computer Science Scope and Method of Study: Record linkage methods match related records from different datasets. For speedup purpose, records should be grouped into blocks efficiently by a blocking method whose parameter should be estimated. The EM Algorithm is more stable than any other methods needed for probabilistic record linking process. In this study, three measurements, reduction ratio, pair completeness and F score, are used to compare the efficiency of four blocking methods combined with EM Algorithm. The blocking methods are 1) Standard Blocking Method, 2) Sorted Neighborhood Blocking Method, 3) Bigram Indexing Blocking Method, and 4) Canopy Clustering Blocking Method. Measures are obtained via simulation. Findings and Conclusions: In the simulation, we find that the bigram indexing blocking method is the most reliable compared to other blocking methods regardless of database size. The canopy clustering blocking method is also a reliable blocking method. The effectiveness of sorted neighborhood blocking method depends on the size of data. In spite of several modifications to improve it, the standard blocking method still gets worse results than other blocking methods.
Click tabs to swap between content that is broken into logical sections.
Rating  
Title  Comparative Study On Blocking Methods In Record Linkage 
Date  20091201 
Author  Shin, Jiyoung 
Keywords  Bigram Indexing Blocking, Canopy Clustering Blocking, EM algorithm, Record Linkage, Sorted Neighborhood Blocking, Standard Blocking Method 
Department  Computer Science 
Document Type  
Full Text Type  Open Access 
Abstract  Record linkage methods match related records from different datasets. For speedup purpose, records should be grouped into blocks efficiently by a blocking method whose parameter should be estimated. The EM Algorithm is more stable than any other methods needed for probabilistic record linking process. In this study, three measurements, reduction ratio, pair completeness and F score, are used to compare the efficiency of four blocking methods combined with EM Algorithm. The blocking methods are 1) Standard Blocking Method, 2) Sorted Neighborhood Blocking Method, 3) Bigram Indexing Blocking Method, and 4) Canopy Clustering Blocking Method. Measures are obtained via simulation. 
Note  Thesis 
Rights  © Oklahoma Agricultural and Mechanical Board of Regents 
Transcript  COMPARATIVE STUDY ON BLOCKING METHODS IN RECORD LINKAGE By JIYOUNG SHIN Master of Science in Computer Science Oklahoma State University Stillwater, Oklahoma 2009 Submitted to the Faculty of the Graduate College of the Oklahoma State University in partial fulfillment of the requirements for the Degree of MASTER OF SCIENCE December, 2009 ii COMPARATIVE STUDY ON BLOCKING METHODS IN RECORD LINKAGE Thesis Approved: Dr. K. M. George Thesis Adviser Dr. John P. Chandler Dr. Nohpill Park Dr. A. Gordon Emslie Dean of the Graduate College iv ACKNOWLEDGMENTS First of all, I would like to express the deepest appreciation to my advisor Dr. K. M. George for advising me through this journey. Without his guidance and persistent help, I could not successfully finish this thesis. I also want to thank committee members Dr. John P. Chandler and Dr. Nohpill Park. I gladly express my gratitude to Dr. Nohpill Park for providing all the valuable support to carry out the work. I would like to make a special thanks to Dr. Chandler for his valuable advice during the work. I extend my sincere thanks to Dr. NohJin Park, Sunny Choi, Linju Jose, QiuRui Zhu and other colleagues who supported me right next. I also want to express my appreciation to my two roommates, Sujin You and Hyojin Kim. Finally, I most thank my family for supporting me in every way. I thank God. v TABLE OF CONTENTS Chapter Page I. INTRODUCTION ......................................................................................................1 II. REVIEW OF LITERATURE....................................................................................4 Section 2.1 Record Linkage .....................................................................................4 Section 2.2 EM Algorithm .....................................................................................11 Section 2.3 Blocking Methods ...............................................................................14 Sub Section 2.3.1 Standard Blocking Method ................................................15 Sub Section 2.3.2 Sorted Neighborhood Blocking Method ...........................16 Sub Section 2.3.3 Bigram Indexing Blocking Method ..................................18 Sub Section 2.3.4 Canopy Clustering Blocking Method ................................20 III. METHODOLOGIES ............................................................................................22 Section 3.1 Data Type ...........................................................................................23 Section 3.2 Evaluation Method .............................................................................25 Section 3.3 Algorithm ...........................................................................................26 Section 3.4 Additional Information ......................................................................32 IV. SIMULATIONS ...................................................................................................36 Section 4.1 Environment ........................................................................................36 Section 4.2 Simulation ...........................................................................................39 Section 4.3 Results .................................................................................................41 V. CONCLUSIONS AND FUTURE WORKS .........................................................43 REFERENCES ............................................................................................................45 vi LIST OF TABLES Table Page Table 1 Standard Blocking Method Measurement by Distance ..............................37 Table 2 Sorted Neighborhood Blocking Method Measurement by Window Size ...37 Table 3 Bigram Indexing Blocking Method Measurement by Threshold Values ....38 Table 4 Canopy Clustering Blocking Method Measurement by Threshold Values 38 vii LIST OF FIGURES Figure Page Figure 1 Sorted Neighborhood Blocking Method ....................................................17 Figure 2 Bigram Indexing Blocking Method ............................................................19 Figure 3 Canopy Clustering Blocking Method .........................................................21 Figure 4 Evaluation of Effectiveness ........................................................................23 Figure 5 Implementation Flow Chart ........................................................................35 Figure 6 Reduction Ratio ..........................................................................................40 Figure 7 Pair Completeness ......................................................................................40 Figure 8 F Score ........................................................................................................41 1 CHAPTER I INTRODUCTION We live in a world where our information is stored in various places. At hospitals, restaurants, fitness clubs or government offices, people request our information to be stored in their databases. However, even within the same organization it is very hard to match the information, whether the records in their database represent the same entity or not. Otherwise when information is shared, they have to link the databases by indicating which record from a database relates to a record in the other database. For the need of linking databases, in 1946 record linkage was mentioned by Dunn for the first time saying that our “Books of Life” need to be linked into one volume for registrar purposes or medical purposes [11]. After that, Newcombe introduced a probabilistic record linkage formula and later Fellegi and Sunter formalized the theory [1]. Recordlinkage methodology and software have been developed by the U.S. Bureau of Census mainly to match the existing census data with the postenumeration survey [4]. As the records to be compared are from different files, they might have different data structures or fields, so we might need to work on the data itself before comparing files. And also there are possibilities that some data fields are missing or misspelled, so we need to take 2 care of the data standardization part. In our study, we do not cover the standardization part. We assume that data has been standardized, but missing or misspelled records wouldn’t be taken care of in the standardization part. The data we use for the simulation will contain missing or misspelled records. There are several approaches to record linkage. The most straightforward is a rulesbased approach, in which reasonable rules are developed and then refined as common exceptions are found [10]. A very popular approach has been probabilistic record linkage [10]. We also use probabilistic approach in our study. A probabilistic approach is used to compare the data with statistics rather than a straightforward oneonone comparison. This will reduce the number of comparisons for large databases. In this approach, each file consists of a number of fields or components and a number of records or observations. The components or attributes common to both files are useful for matching. The components do not contain equal amounts of information and error rates vary. So weights are defined considering the error probabilities for each field using a loglikelihood ratio. For files whose average size is too large for all possible record pair comparisons, they can be partitioned into mutually exclusive and exhaustive blocks designed to increase the proportion of matched pairs observed while decreasing the number of record pair comparisons to a block level. The blocking procedure is implemented by sorting the two files on one or more variables [1]. There are several blocking methods, which are ways to put records which have a high possibility to be same entity into the same block and do the comparison within the block. In our paper, we focus on four blocking methods and compare their accuracy and effectiveness. The four blocking methods are the basic sorted method (standard blocking), the basic sortedneighborhood method, the bigram based blocking method, and the canopy clustering method. 3 The remaining document is organized as follows. Chapter 2 summarizes the main principles of the record linkage, EM algorithm, and the four blocking methods. Chapter 3 gives an overview of the system implemented in this project including description of data sets and evaluation methods. Chapter 4 explains findings based on evaluation method results after our experiments. Finally, chapter 5 conclusions our findings and talks about future works. 4 CHAPTER II REVIEW OF LITERATURE The objective of record linkage or matching process is to find and link the records corresponding to the same individual. In this section, we will look over basic ideas of record linkage using formulas used in Fellegi and Sunter’s paper [1]. 2.1 Record Linkage Define two disjoint sets M and U for Matched and Unmatched formed from the cross product of two files A with B which are filled with records, A x B. A record pair is a member of M if that pair represents a true match, otherwise it is a member of U. Thus, the record linkage process tries to classify each record pair of A x B into M and U. A× B = {(a,b);a ∈ A,b∈ B} (2.1) M = {(a,b);a = b, a ∈ A,b∈ B} (2.2) U = {(a,b);a ≠ b, a ∈ A,b∈ B} (2.3) Suppose records corresponding to members of A and B are indicated as α (a) and β (b) . We express each pair as a comparison vector as γ [α (a),β (b)] (for convenience we 5 denote it as γ ). Because each pair has several fields to be compared, a comparison vector for each pair also can be represented as γ i [α (a),β (b)],i = 1,...,K which splits the comparison vector into specific comparisons of each ith field. γ [α (a),β (b)] = {γ 1[α (a),β (b)],Κ ,γ K [α (a),β (b)]} (2.4) The comparison vector can be represented as ones and zeroes. If the contents of the ith field such as both of the record’s first names are exactly the same or almost the same, then we give 1 otherwise we give 0. This is applied to all of the individual fields which are common for both records. For example, a comparison vector can be represented as “11011110” which means the pair has eight fields with six fields matching and two fields not matching. All the possible realizations of the comparison vector γ are represented as Γ , which is defined as the comparison space. By observing a comparison vector γ for each pair and mapping the comparison vector into a random decision function, we make a decision whether the pair is a matched pair, an unmatched pair, or a possible link. A linkage rule L is a rule mapping the comparison space Γ onto a set of random decision functions D = {d(γ )}. Each decision function includes probabilities for a γ as taking each decision as matched A1, unmatched A3 or possible link A2. The possible link A2 is when it is hard to conclude whether the record pairs are definitely matched or unmatched, and they need further observation. (γ ) = { ( γ ), ( γ ), ( γ )};γ ∈Γ 1 2 3 d P A P A P A (2.5) where (  ) 1 3 1 = Σ= i i P A γ which means the sum of the three decision probabilities is one. 6 We calculate weights for each field and use them as a measure of how much each field contributes to the total probabilities for each pair. The contribution of each field can be prioritized by using weights to measure the probability of making an accurate match. Fellegi and Sunter extended the measurement of weights into mathematical formulas for the record linkage process using a loglikelihood ratio. [1] When we make a decision whether a pair is matched or unmatched, we simply need to compare the probability that the pair is matched and unmatched. If (  ) 1 P A γ or P(M  γ ) is bigger than (  ) 3 P A γ or P(U  γ ) , then the pair is considered matched, and unmatched if it is the other way around. For this, Fellegi and Sunter use a likelihood ratio of the two conditional probabilities of the pair when they are known to be linked or not linked. The likelihood ratio for a certain γ vector comparison can be defined as ( ) ( ) γ γ u m R = (2.6) If R is bigger than 1 then we consider the comparison vector to belong to the match class and if R is less than 1 then we consider the comparison vector to belong to the unmatched class. To the calculation easier or clearer, we put a log in front of R and we name it as the weight. = ( ) ( ) log γ γ u m w + + + = ( ) ( ) log ( ) ( ) log ( ) ( ) log 2 2 1 1 k k u m u m u m w γ γ γ γ γ γ Λ 7 = ( ) ( ) log i i i u m w γ γ When we suppose all the fields are independent from each other, the decision rule is on the sum of the weights for each field Σ= k i i w 1 where k is the number of fields in the record comparison vector γ . The weight i w can be denoted as ( ) ( ) log i i u m γ γ . m(γ ) and u(γ ) are conditional probabilities of a γ given the pair is a true match or a true nonmatch. m(γ ) = P{γ [α (a),β (b)]  (a,b)∈M} (2.7) u(γ ) = P{γ [α (a),β (b)]  (a,b)∈U} (2.8) In the same way as for equation (2.4), m(γ ) and u(γ ) can be split into fields and m(γ ) and u(γ ) can be calculated by multiplying all of the field’s probabilities. ( ) ( ) ( ) ( ) 1 2 k m γ = m γ ⋅m γ Λ m γ (2.9) ( ) ( ) ( ) ( ) 1 2 k u γ = u γ ⋅u γ Λ u γ (2.10) For each of the ith fields of the jth pair, matched and unmatched probabilities can be represented as m P( j 1 j M) i i = γ = γ ∈ (2.11) u P( j 1 j U) i i = γ = γ ∈ (2.12) 8 Those formulas can be represented as mi = Pr{component i agrees r ∈ M} and ui = Pr{component i agrees r ∈ U} for the set of all record pairs r. In our study, we use the EM algorithm for the linkage rule to estimate variables i m , i u . After the values are calculated, those values can be used to calculate (2.7) and (2.8), which are m(γ ) and u(γ ) as below Π= = − − n i i i j j i j P M m i m 1 (γ  ) γ (1 )1 γ (2.13) Π= = − − n i i i j j i j P U u i u 1 (γ  ) γ (1 )1 γ (2.14) Those are estimated true positive and true negative probabilities of the comparison vector of each pair. In a pair, if the comparison vector for that field is 1, then multiply i m or i u , but if it is 0, then multiply one minus the value as the probability of each field. − − + + − − + − − = − − = = = − − − − − − = − = − Π Π j n j n j n j n j j j j j j i j j j i j i j i j i n n n n n i i i n i i i j j u u m m u u m m u u m m u u m m P U P M u m w γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ 1 1 1 2 2 1 2 2 1 1 1 1 1 1 1 1 1 1 (1 ) (1 ) log (1 ) (1 ) log (1 ) (1 ) log (1 ) (1 ) log (  ) (  ) log ( ) ( ) log 2 2 2 2 1 1 1 Λ The weight of the ith field in the jth pair is − − = − − j i j i j i j i i i i i i u u m m w γ γ γ γ 1 1 (1 ) (1 ) log The weights for the components are: 9 if component i agrees, in other words, when j = 1 i γ , then = i i i u m w log if component i disagrees, in other words, when j = 0 i γ , then − − = i i i u m w 1 1 log As we described in the decision rule section, there are three choices that could be made whether a pair is matched, unmatched, or a possible link. Therefore, we need two threshold values to divide the record pairs into three categories. For the decision procedure of each record pair, the weights of individual fields are added up, which means all of the each pair’s field weights wi will be summed up. If the composite weight value is over the upper threshold value, then it is considered as matched, and if the value is less than the lower threshold value, then the pair is considered as unmatched. If the value is between the two threshold values, the pair is considered as undecided. According to Jaro, the threshold values can be calculated with the sum of Pr(• M), one minus Pr(•U) and error values(μ and λ ) [4]. The upper threshold value, which is the maximum weight for a nonmatch, is where the sum of Pr(•M) does not exceed λ , which is the probability that a matched pair should be classified as unmatched. The lower threshold value is the minimum weight for a match and where is one minus the sum of Pr(•U) does not exceed μ , which is the probability that an unmatched pair should be classified as matched. Pr(•M) and Pr(•U) can be calculated by using formulas 2.13 and 2.14. For the linkage rule, as we have mentioned in the threshold section, we need to consider two types of errors; μ is the probability that an unmatched pair will be classified as matched, and λ is the probability that a matched pair will be classified as unmatched. We 10 use the errors for the decision making in the calculating the thresholds. μ can be calculated as the sum for all possible comparison vectors (Γ ) of multiplication between the unmatched probability of γ and the conditional probability of being matched for that comparison vector γ . The error λ also can be calculated in a similar way. As Sauleau, Paumier and Buemi stated, FellegiSunter theory suggests an optimal linkage rule for given error levels μ and λ to reduce A2, which means the goal of an optimal linkage rule is to make a clear decision as to whether a record pair is linked or not by reducing the cost caused by manual decision making for possible matched pairs [12]. Σ ∈Γ = = ⋅ γ μ (  ) (γ ) ( γ ) 1 1 P A U u P A Σ∈Γ = = ⋅ γ λ (  ) (γ ) ( γ ) 3 3 P A M m P A According to Fellegi and Sunter, the best linkage decision rule L at the μ , λ error levels with the Γ comparison space denoted as L(μ , λ , Γ ) is ≤ < < ≤ = λ λ μ μ γ γ γ γ γ γ γ if m u T if T m u T if T m u d (0,0,1) ( ) / ( ) (0,1,0) ( ) / ( ) (1,0,0) ( ) / ( ) ( ) where ( ) ( ) n n u m T γ γ μ = and ( ) ( ) n n u m T ′ = ′ γ γ λ Σ= = n i i u 1 μ , ΣΓ = ′ = N i n i λ m , n < n′ . This formula implies the concepts of record linkage explained in the previous sections. We can simply say record linkage is a process of making decision whether two records are matched or not, in other words, choosing one of the options from the decision function d (γ ) . 11 To make the decision, we need to calculate composite weights for every comparison vector, and the comparison vectors are ordered in ascending order according to the composite weights. To calculate the composite weights, we need to use the EM algorithm to estimate parameters needed to calculate composite weights. The threshold values are decided by using the error values at the point n and n′ where the error values are reasonable. After deciding threshold values, the pairs with a comparison vector having composite weights above the upper threshold value are matched pairs, and the pairs with composite weights less than the lower threshold value are unmatched pairs. The pairs with composite weights between the thresholds are ones on which the algorithm is unable to make clear decisions. 2.2 EM Algorithm The EM (expectation maximization) algorithm, proposed by Dempster [3] is applied to estimate the weight parameters in Fellegi and Sunter’s formula. As Dempster mentioned, the EM algorithm is a maximum likelihood estimation algorithm from incomplete data [3], therefore this algorithm is proper for the record linkage because some record contents could be missing data or misspelled. The EM algorithm does a repetition procedure between computing the expected value and maximizing that value, until the procedure converges [6]. We describe the estimation of mi and ui using the EM algorithm in this section the way Jaro proposed [4]. γ is defined as a comparison vector that is a vector function on records of a and b. The function γ j [a,b], j = 1,..., N is a vector function for the jth pair. We express γ j [a,b] also as 12 γ j . Also each jth pair γ j has several fields to be compared; j , j 1,...,N,i 1,...,n. i γ = = ; n is the number of fields and N is the number of pairs. γ [a,b] = {γ 1[a,b],Λ ,γ N [a,b]} , a∈ A,b∈ B In our case, as Jaro says, we use the function that if the ith field agrees on the jth pair then j = 1 i γ , and if the ith field does not agree on the jth pair then j = 0 i γ for i=1,…,n and j=1,…, N. If we formulate this statement, it can be defined as below, m Pr{ 1 r M} j j i i = γ = ∈ u Pr{ 1 r U} j j i i = γ = ∈ Therefore, γ j , j =1,...,N is a vector of ones and zeroes. The EM algorithm is a repeating process of estimating parameters and maximizing them until we get more accurate values for a set of parameters which is ı = (m, u, p). The parameter m represents mi and u represents ui. The parameter p is the proportion of matched pairs over the total number of matched and unmatched pairs:     M U M p ∪ = The complete data vector is equal to (γ, g), where ( ( ), ( j )) u j m g j = g γ g γ . Hence, g j = (1,0) iff γ j ∈M and g j = (0,1) iff γ j ∈U . As we mentioned in the previous, mj and uj can be calculated as below and these formulas will be used for the parameter estimation calculation. These are defined under the assumption 13 that each field does not affect the other. First one is probability of true matched for a pair and second formula is probability of true nonmatched for a pair. Π= = − − n i i i j j i j P M m i m 1 (γ  ) γ (1 )1 γ Π= = − − n i i i j j i j P U u i u 1 (γ  ) γ (1 )1 γ By using γ , we estimate m u gˆ , gˆ in the E (estimation) step, then we calculate i i mˆ ,uˆ , and pˆ in the M (maximization) step. These steps are described below. First, we set the starting values for the estimates of the unknown parameter set Φ = (mˆ ,uˆ, pˆ ) . The algorithm is not sensitive to the starting values except that the mˆ values should be greater than the corresponding uˆ values. The i i mˆ ,uˆ , and pˆ could be a value on the interval (0, 1). Next, we get the γ s for all paired records. For example, γ s of pairs with five fields might look like γ =[{1, 1, 0, 1, 1}, {0, 1, 1, 0. 0},…]. We plot γ values into m u gˆ , gˆ . E step Estimate the g j value with ( ˆ ( j ) m g γ , ˆ ( j ) u g γ ) Π Π Π = − = − = − − + − − − = n i i i n i i i n i i i m j i j i j i j i j i j i p m m p u u p m m g 1 1 1 1 1 1 ˆ ˆ (1 ˆ ) (1 ˆ ) ˆ (1 ˆ ) ˆ ˆ (1 ˆ ) ˆ γ γ γ γ γ γ Π Π Π = − = − = − − + − − − = n i i i n i i i n i i i u j i j i j i j i j i j i p u u p m m p u u g 1 1 1 1 1 1 ˆ ˆ (1 ˆ ) (1 ˆ ) ˆ (1 ˆ ) ˆ ˆ (1 ˆ ) ˆ γ γ γ γ γ γ 14 M step We put the m u gˆ , gˆ estimated values into i i mˆ ,uˆ and pˆ . Σ Σ = = = N j j m N j j i j m i g g m 1 1 [ ˆ ( )] [ ˆ ( ) ] ˆ γ γ γ Σ Σ = = = N j j u N j j i j u i g g u 1 1 [ ˆ ( )] [ ˆ ( ) ] ˆ γ γ γ N g p N j j m Σ= = 1 [ ˆ ( )] ˆ γ We repeat the E and M steps until we get precise values for the parameters, which means until the estimated parameters have no significant difference from the previous estimated ones. 2.3 Blocking Methods Researchers look at large scale data comparisons and their implementation with record linkage on real world data such as census data or medical data [4, 12]. However, real world data can be huge, for example if ten thousand records are in each of the two files, then there are ten thousand times ten thousand comparisons needed. Therefore, blocking techniques are used to reduce the cost of the quadratic comparisons [8]. Blocking methods 15 are used by Kelley, comparing two blocking methods based on likeliness of components for big data sets [2]. By using probabilities, the relative components are put into the same blocks and we do the comparison among them and the unrelated components are considered they are unmatched pairs and never have a chance to be compared. In 1989 Jaro tested the record linkage system with blocking methods on the real data sets of the census of Tampa, Florida and an independent post enumeration survey (PES) [4] Further, various blocking methods have been investigated and compared as to their effectiveness [5, 6, 8]. In our study, we only focus on four blocking methods to be compared, which are standard blocking method (sorting), sorted neighborhood blocking, bigram indexing, and canopy clustering. One of the common procedures for all of the methods is creating candidate keys. Finding the common fields in comparing records is the first thing to do for matching. Among them, the various fields contribute to matching in different levels. For example, a field such as sex only has two value states, and consequently could not impart enough information to identify a match uniquely. Conversely, a field such as surname imparts more information, but it may frequently be reported or keyed incorrectly [4]. Therefore, we create a candidate key that will be chosen with significant fields and combined part of the fields or used as it is as the key for each component. In this section, the four methods are explained in detail. 2.3.1 Standard Blocking Method In this section, the standard blocking method has been studied based on Jaro’s paper [4]. The basic sorted method is sorting two files with the same key variables (one, or more than one, variables), group records with the same values for those variables into blocks, and 16 do the comparison between records within the blocks. We also could compare records in the neighbor blocks for better accuracy when there is no matched record found. In this method, the most important step is getting the proper keys, which means we need to find the variables with low error. If we set the key with high error probability for a key, then we could get lots of false matches. A weak point of this method is we cannot control the number of records within a block, so, for example, if records are clustered into one block, this method is not useful to reduce the time for record comparisons. Therefore, we need to choose proper keys for sorting with low error probabilities and distributing records uniformly into blocks. 2.3.2 Sorted Neighborhood Blocking Method The sorted neighborhood method is based on Hernandez and Stolfo’s paper [7]. We could think of the sorted neighborhood method as one step further beyond the basic sorted method, reducing some of its defects and adding the concept that in the sorted record pairs, neighbor records have a higher matching possibility than records at a great distance. The procedures of the method are first joining two files and sorting them with candidate keys. The candidate keys can be made out of variables by concatenating parts of some chosen variables. For example, one way of creating a key could be concatenations of the first three consonants of a last name, the first three letters of a first name, the first three numbers of a Social Security number, and the first four numbers of a zip code; shnjiy6407407. After the keys are created, the records are sorted by the candidate keys. Then we set a fixed size of window which we consider as a block and we do the comparison within the block. The fixed size window slides and does the comparison within the block until there is no more record left by sliding the front record out and adding in the next record of the last record of the block. If we put it differ to the window and the first record be compared with the previous method is an improved version of the basic sorted method; it t probability choice by concatenating part normalizing the size of blocks by fixing Figure 1 17 differently, in the fixed window size w, a new record enters will slide out of the window, the newly entered w1 records to find a matching record each time. This d takes care of the low error parts of variables as a key, and it also takes the window size. Sorted Neighborhood Blocking Method record will akes care of 18 2.3.3 Bigram Indexing Blocking Method The bigram indexing blocking method has been studied based on the record linkage software in Febrl’s manual [10]. The bigram indexing method is less sensitive to typo errors or some missing information compared to the previous two blocking methods. The features of this method are, first, this method allows each record to be in multiple blocks if needed, and it uses an inverted index. The usual indexing system made it easy to find needed information, so for document indexing, the contents are a list of parts of the document, but for the inverted indexing system, a part or a content of information maps a set of documents which contain the part. The process of bigram indexing blocking method starts with choosing blocking keys. Some significant variables are chosen or parts of some variables are chosen, and they are combined into a string that is a preformed block key. The string will be split into bigram lists; the bigram is a substring with two letters. For example, if we create the string as a combination of the first two letters of the last name “sh”, the first two letters of the first name “ji” and the first two numbers of the zip code “74” for record index 7, then the preformed block key of this example record will be “shji74.” The created string will be divided into a bigram list as {“sh”, “hj”, “ji”, “i7”, “74”}. We consider the numbers also as strings in our study. The bigram list will be sorted alphabetically as {“74”, “hj”, “i7”, “ji”, “sh”}. We multiply the total number of possible bigrams, which is 5 in this case, with a threshold value that can be given manually as a decimal number between 0 to 1, and let us set the threshold as 0.8 in this case. The result will be 4, and then the blocking keys will be all possible combination of four bigrams with considering the sequence of the alphabetic ordered listed bigrams. The blocking keys for this record will be {“74hji7ji”, “74hji7sh”, “74i7jish”, “74hjjish”, “hji7jish”}. This example reco these five created blocking keys. This process will be repeated for the preformed key string and the threshold value are critical for this blocking method. If the preformed string length is too long and the threshold value is too small, then possibility for missing linked records in and too many expensive comparisons later on. If the length is too short and value is too big, then fewer blocking keys will be created and it will increase the possibility of missing linked records into Figure 2 19 ”}. record will be indexed into blocks having keys with all records. The length of ength the same blocks, but there could be too many blocks the the same block. Bigram Indexing Blocking Method rd there is less threshold 20 2.3.4 Canopy Clustering Blocking Method The canopy clustering blocking method has been studied based on McCallum, Nigam and Ungar’s paper [9]. Canopy clustering is a method considering each record as a point. It blocks records by approximately calculating distances between records and it uses an inverted index table for cheap distance calculation. The canopy clustering method also allows each record to get into more than one canopy or block. The inverted index table will be created, to be used in the calculation of a distance metric. We choose preformed keys for bigram indexes that would have low error probability such as diagnoses, last name, or first name. Record id will be entered into proper blocks as contents of the table. After that the process is selecting one point (a record) randomly out of all of the records. Using the inverted index table, the distance between the selected point and the rest of the points could be calculated fast. After creating a inverted index table, first, we randomly pick a center point from a pool of whole records and use inverted index table to calculate the distance metric, which can be calculated as the number of common indexes each record have with the center record. Before that, we could set two threshold values for records, one for a high probability for a record to be in the same cluster with the center (tight threshold) and the other one is a lower probability (loose threshold). The threshold values can be set by users or can be set by the result of previous experience or tests. For a randomly picked center point, we calculate the distance metric for all other points, and if the distance metric of a point is bigger than the lose threshold value, the point will be put into the same canopy with the center point. If the distance merit of a point is bigger than the tight threshold value, then the point will be removed from the pool of center points later, but the points have bigger distance than loose threshold but lesser value than tight threshold value still has a chance to be in other canopies and in the pool. We repeat calculating the distance merits and putting points into canopies until there are no more points F, G, H, I} are spread with the distances from each other as in the below pict centers randomly in order Awhich are blocks as below. Figure 3 21 . s available in the pool. For example, nine points {A, B, C, D, E, > F> H> I> E, then there will be five canopies will be created Canopy Clustering Blocking Method picture and we pick > 22 CHAPTER III METHODOLOGIES In this section, we describe the datasets, evaluation methods, and algorithms that are used in this study. The theory of record linkage involves the use of algorithms to match a large number of records quickly. For the speed problem, there are blocking methods that solve the problem by reducing the quadratic comparisons. Likely matches are grouped into blocks which are portions of whole records. Especially, for large data sets, various blocking methods have been investigated and compared for their effectiveness [5, 6, 8]. Matching records requires estimation of parameters needed for any record linkage activity, which was the basic algorithmic approach by the matcher of the early software. The record linkage software has been developed using probabilistic approaches such as Link King, LinkageWiz, Febrl, and TAILOR. Among probabilistic approaches, Jaro stated that “The EM algorithm is highly stable and the least sensitive to the starting values of any of the methods studied” [4]. In this project, we chose the EM algorithm to estimate parameters in the record linkage procedure. Jaro introduced a record linkage procedure done among the records within their own blocks using the EM algorithm [4]. McCallum, Nigam and Ungar proposed the canopy clustering blocking method with the EM algorithm [9]. Many comparative studies over blocking methods have been done. However, there has not been an attempt to compare the effectiveness of the four blocking the EM algorithm. In this project, we do a comparative study to evaluate the effectiveness the combined schemes of blocking methods and EM algorithm for large data sets Figure 4). We implement the record linkage algorithm test those using real data sets. We also evaluate implementation with thre and used measure formulas from record linkage toolbox TAILOR [9]. 3.1 Data In this project, four datasets are used for comparative analysis. The data datasets is U.S. mailing address downloadable from [16]. 23 roduced methods described in the previous section in the Java programming language and three already developed Figure 4 Evaluation of effectiveness type of the which was created by database generator DBGen and specifically with of (shown in e 24 The US mailing address generated by database generator DBGen which is an open source database generator used to create large mailing address datasets [33]. Contents for each record of mailing address are twelve fields with given name, surname, street name, street number, street address, suburb, postcode, state, birth date, age, phone number, and social security number. For the U.S. mailing address data, eight data sets and two files for each data set are used as input for each run. Combined size of each dataset is various as 1000, 2500, 5000 and 10000. The two files of each set have duplicates with some missing and misspelled fields or misallocated fields in random order. Only one duplicate of each entity is possible in other data file. Because four datasets have similar features but only difference is size of datasets, we could compare blocking methods’ effectiveness depending on dataset size. The 1000 size dataset which is smallest dataset among the four datasets used in our implementation is used as test data to estimate the best parameter or threshold values needed in the blocking method. Our goal is comparing two individual records from different files to conclude whether they indicate same entity or not. For each mailing address dataset, two files will be used for the comparison. There are other assumptions for data sets. We assume there is no absolute stand alone identification for each record because data could be missing or misspelled. To distinguish each record, we give a number as index to indentify each record. We also put ‘a’ or ‘b’ in front of the indexes to avoid comparison among records within the same file. For simplicity, in our study we assume each record is independent from each other’s fields. Also, for simplicity we consider numbers as string. The address files are csv file format. Each line stands for one record and fields are divided by comma. We assume every delimiter is used correctly. Each input file will be split 25 into a string ArrayList in Java representing each record as String Array and every element of a record for each array field. For missing fields, we leave components empty. 3.2 Evaluation Methods To accomplish our goal for this study, we need to measure four record linking methods. We chose three formulae from TAILOR toolbox [7]; reduction ratio, pair completeness, and F score. First, reduction ratio is to calculate how much expensive comparisons are reduced; the expensive comparisons as comparing every possible field for each pair. In other words, it is the reduced ratio of number of compared pairs after blocking methods out of one, therefore the number closer to one is better. It could show how much the blocking methods are conducive for reducing expensive comparison among records. The second measure is pair completeness which could measure how correctly decisions are made choosing whether record pairs are matched or not. It will show how both blocking methods and EM algorithm worked together to get the right decisions. The pair completeness is how many duplicates are correctly classified out of all duplicates. For the tradeoff between two measures reduction ratio and pair completeness, we chose F score which help us to measure harmonic relationship between two other measures. These measures have been used in various researches for blocking methods comparative study [5, 6, 8]. The formulae used for computation of the measures are given below: All pairs Potential duplicates Reduction Ratio(RR) = 1− Total true duplicates Correctely classified duplicates (PC) Pair Completeness = 26 In our study, we calculate the potential duplicates in the reduction ratio formula by adding up only the expensive comparisons (record pair comparisons for every field) after blocking process. For pair completeness, we created a file with true matching pairs (true duplicates) for each dataset. We consider matched pair if more than 80 percent of total variables are agreed and a field is agreed if two strings have distance value over 0.8 in the Jaro Winkler distance calculation. The Jaro Winkler Distance method will be discussed more in the next additional information section. The number of correctly classified duplicates is obtained by comparing our results after whole process with the true matched pair files. And the number of the total true duplicates in the pair completeness is the sum of pairs in each corresponding true matched pair files. 3.3 Algorithms For the simulation, we programmed Compare_Blocking_Methods by using java language. As a big picture of our study, what it does is reading record files in and calling four blocking methods with a list of record arrays. For each of four blocking methods, we created a class. The program is implemented for the four datasets we mentioned in the previous section and the results will be compared by the measuring methods we mentioned before. Depending on each block type choice, blocking methods will be called. For the blocking methods, whole records as an ArrayList type in java and the chosen key field or fields will be RR PC 2*RR * PC F score + = 27 used in each blocking method. For all four blocking method, we create candidate key which is concatenation of parts of some significant fields. In our study, the candidate key’s length is six; two front letters of three fields. For the U.S. mailing address data type, first name, last name, and postcode are chosen as significant fields. The candidate key is added at the end of each record as a new field. For Standard_Blocking and Sorted_Neighborhood_Blocking, we sort whole records according to the candidate keys with using merge sort algorithm. In the Standard_Blocking, the sorted records are partitioned into blocks with a significant variable. For the Sorted_Neighborhood_Blocking, we partition the sorted records by putting window size number of records into each block sequentially to do so slides the first record in a block out and slides the next record outside of the block in for the next block. For both Bigram_Indexing_Blocking and Canopy_Clustering_Blocking, we use the created candidate keys as preformed keys to create real blocking keys. By using the preformed keys for each record, we create inverted index table as we explained in the bigram indexing blocking method. For each preformed key, we find all possible bigrams which is substring of two letters in order. And we order the bigrams alphabetically and calculate the total number of the bigrams for that record. We decide how many bigrams will be used for each key which is three in our study by multiplying the total number of the bigrams (four; length of a preformed key minus one) and given threshold value. Inverted index table keys will be all possible combinations of the calculated number of bigrams sequentially. We repeat this process for whole records and put each record id into own created index contents. Bigram_Indexing_Blocking, the records in the same column of the table will be put into the same block. For Canopy_Clustering_Blocking, the inverted index table will be used to calculate distance among records. We put all the record ids into vector as pool. After 28 randomly choosing a center which is a record, we calculate the distance by adding the number of common index keys for each record to the center record. We set the two threshold values loose threshold and tight threshold. For both thresholds, we use percentage of number of common indexes over maximum number of common indexes between a center record and a compared record. The records having the bigger percentage than the loose threshold are put into the same block. Among The records, ones having the percentage bigger than the tight threshold, then the records will be removed from record id pool for next picks for other center points. We used Vector type in java for the center pool. This process will be repeated until there is no more candidate record id is left in the vector. For all the blocking methods, after dividing records into blocks, we do linkage procedure within each block which is finding matching pairs. In the linkage procedure, we chose probabilistic way which uses comparison vectors and making decision based on probability of each matched, unmatched, and undecidable for each comparison vector. To do so we chose EM algorithm which is one way to estimate parameters needed to calculate probabilistic composite weights for each comparison vectors used for making decisions. Every record within block will be paired to each other. The pairs will be transformed into a comparison vector which is an expression of ones and zeroes; when a field of a pair is similar, then the value will be one and it is not similar, then the value will be zero. We save those comparison vectors into double type arrays. The comparison vector will be used to calculate parameters mi, ui, and p using EM algorithm which are related to weight for each field; we pass the comparison vector arrays into EM_Algorithm class. We set the initial values for mi as 0.8 and ui as 0.2 for every field of every pair and p as 0.5; As Jaro stated in [4], EM algorithm is the least sensitive for the initial values, but only need is mi bigger than 29 ui. Then, we plot those initial values into formulas to calculate gm and gu and those calculated gm and gu are used to calculate mi, ui, and p. And, we do compare the previous mi, ui, and p and the calculated ones. If the difference is bigger than threshold value which is 0.0000000005 in our study, then we do the same procedure again and again until the difference become less than the threshold value. After the calculation of estimating parameters, those values will be used to calculate composite weights for whole comparison vectors. Those calculated mi and ui are probability of a field would be agreed which is one when a pair is matched and unmatched. To calculate composite weight for a comparison vector, if a comparison vector is “110” three fields then we add log value of m0 and minus log value of u0, add log value of m1 and minus log value of u1, and add log value of 1m2 and minus log value of 1u2. The whole comparison vector are sorted according to the calculated composite weights descending. There are three categories to be classified as matched, unmatched, and undecided. In our study, we only concern about matched and unmatched decisions because we don’t want to do any manual comparisons and we are comparing efficiency of blocking methods under same conditions. We use probability of true matching for a record ( P(γ j M) ) and probability of true nonmatching for a record ( P(γ j U) ) for decision making in our study. Our decision is made, from the comparison vectors with highest composite weight to the smallest, if the true matching probability is bigger than true nonmatching probability for a comparison vector, then we consider the pairs with that comparison vector as matched pairs and if a record doesn’t satisfy the condition, then we stop looking at any other comparison vectors below and consider rest of the pairs as unmatched. Then, we print the matched pairs into a file. And we do evaluation of the results. The important algorithms are listed below: 30 Compare_Blocking_Methods() Input: one or two record files and blocking type Output: a set of matched record pairs Read files If blocking type is standard blocking then Call Standard_Blocking() Else if blocking type is sorted neighborhood blocking then Call Sorted_Neighborhood_Blocking() Else if blocking type is bigram blocking then Call Bigram_Indexing_Blocking() Else if blocking type is canopy clustering then Call Canopy_Clustering_Blocking() End If Evaluate the results from Blocking methods Algorithm 1 Compare_Blocking Methods Algorithm Standard_Blocking() Input: records and key field Output: matched pairs Create candidate keys Sort records by candidate keys Calculate similarity between keys Divide records into blocks depending on similarities of key contents Pair up every records within block Transform record pairs into comparison vector Calculate composite weight using EM algorithm within each block Make decision whether record pairs are matched or not Print matched pairs Algorithm 2 Standard_Blocking Algorithm Sorted_Neighborhood_Blocking() Input: records Output: matched pairs 31 Create candidate keys Sort records by candidate keys Set window size as block size Divide records into window size blocks Calculate composite weight using EM algorithm within each block Make decision whether record pairs are matched or not Print matched pairs Algorithm 3 Sorted_Neighborhood_Blocking Algorithm Bigram_Indexing_Blocking() Input: records Output: matched pairs Create preformed key Create blocking keys Create inverted index table Put records in same column into block from inverted index table Calculate composite weight using EM algorithm within each block Make decision whether record pairs are matched or not Print matched pairs Algorithm 4 Bigram_Indexing_Blocking Algorithm Canopy_Clustering_Blocking() Input: records Output: matched pairs Create preformed key Create blocking keys Create inverted index table Calculate distance among records using inverted index table Set threshold values Partition records into blocks according to distance among them Calculate composite weight using EM algorithm within each block Make decision whether record pairs are matched or not Print matched pairs Algorithm 5 Canopy_Clustering_Blocking Algorithm 32 EM_Algorithm() Input: record pairs represented as ones and zeroes (comparison vector) ; ones for matched fields and zero for unmatched fields Output: m_i, u_i, p Initialize m_i, u_i, p values BEGIN_EM; IF old and new estimated values are similar Goto END_EM; EXPECTATION; Compute g_m; Compute g_u; MAXIMIZATION; Compute m_i; Compute u_i; Compute p; Compare old and new estimated m_i, u_i, and p values Goto BEGIN_EM; END_EM; Algorithm 6 EM_Algorithm 3.1 Additional Information In this paragraph, we discuss about string comparison. In the blocking methods and record linkage procedure, we need to compare strings. When we partition records into blocks, string comparisons in the standard blocking method is needed because we block records with same candidate keys. We do comparison of numbers same way as other string letters. We also need string comparisons when we create comparison vectors to be used in the EM algorithm and composite weight calculation. On the creating comparison vectors, we give one when two strings are similar to give some allowance for misspelled fields. For the string comparison, we use Jaro Winkler Distance method which is a string similarity measuring method by using matching characters and transpositions between two strings [13]. In our 33 study, for exact match, two strings have to have the Jaro Winkler Distance value one, and for similar match for two strings, we give threshold value 0.8; if the Jaro Winkler Distance value is over 0.8 then we consider two strings are similar. We use the already developed Jaro Winkler Distance method downloadable from [15] written in Java. In this paragraph, we discuss topics related to weight calculations. As we stated in the previous chapter, we assume all the variables are independent. We calculate weight with the values from EM algorithm. As the result of EM algorithm calculation, we get mi value and ui value. However, we only use mi value from the result and ui as one minus mi value because the calculated ui values from EM algorithm is result extracting unmatched pairs not included in each block when we see whole record pairs. For a convenience of the composite weight process, we put log in front of weight calculation formula, so we need to add log of mi value and minus log of mi value since weight calculation formula for each field is mi/ui for ith comparison vector field is one and (1mi)/(1ui) for ith comparison vector field is zero. In this case, if we could have u γ 0, m γ 0, 1 u γ 0, or 1 m γ 0 then ω γ ∞ or ∞. In those cases, we gave a certain big number 30 in our study to be able to calculate the weight for each record. The first formula below is way to calculate a weight for a field and the second one is to calculate composite weight after calculation of weights for each field. ω γ log m γ log u γ ω γ ω ω ω ω In this paragraph, we explain what we did to improve the Standard_Blocking. As we stated in the previous chapter, records or entities could not be identified by only one field. 34 Therefore, in standard blocking method, we cannot depend on only one key field for blocking records. In other papers, they suggest to run programs multiple passes [7, 8]. False nonmatches could be reduced by multiple passes with different blocking variables for each pass. In our study, to avoid multiple passes, we create a candidate key field by combining first two letters of three selected fields rather than choosing a variable as a candidate key to reduce the false nonmatches and we also partition records into blocks not only records with exactly matching candidate keys but rather similar keys which is judged by distance values of Jaro Winkler distance method in the Standard_Blocking method. In addition to sorting whole records according to the created candidate key, in our study, we partition the sorted records according a significant field which is not the same key used for sorting. Grouping ones having similar values for the field judged by Jaro Winkler distance. The standard blocking method, instead of multipassing for different variables, uses two different kinds of variables for sorting and blocking. In this paragraph, we discuss about how we implemented distance comparisons in the Canopy_Clustering_Blocking. In the canopy clustering blocking method, we need to calculate distance among record points which is calculating how many common created indexes of inverted index table. In our study, we created three hash tables to simplify the distance calculation process; a record id as a key and the created inverted index table indexes for each records as contents, a record id as a key and each record array as a content, and the inverted index table (the created inverted index as a key and ids of records corresponding to that index). When a record is randomly selected as a center to calculate the distance, we get the created inverted indexes for the center record from a hash table. From the center record, we calculate distances only for the records having any common inverted index with the center record. We track the inverted indexes in the center record column from the inverted index table and increase the distanc common inverted indexes more than same block as the center record Figure 5 Implementation Flow Chart 35 distance of ids as contents of the inverted indexes the loose threshold value, then we put the record into record. e indexes. If a record has 36 CHAPTER IV SIMULATIONS 4.1 Environment In record linkage, we need to match quickly a large number of records between two databases. Records are grouped into blocks to be matched efficiently. Especially, for large datasets, a specific algorithm including blocking method with parameter is necessary to optimize the results. In order to evaluate the effectiveness and efficiency of blocking method, we implement four types of blocking method on data linkage; which includes standard blocking, sorted neighborhood blocking, bigram indexing blocking, and canopy clustering blocking methods with four different sizes of datasets. In doing this simulation, parameter should be estimated for each blocking methods to get the best results such as: i) for standard blocking method, we need to fix the Jaro Winkler distance to decide records with how similar candidate keys would be put into same block, ii) for the sorted neighborhood blocking method, we need to set the window size, iii) for the bigram indexing blocking method, we have to fix the threshold value to decide how many bigrams are combined to create inverted index keys, iv) for the canopy clustering blocking method, we need to fix the tight and loose threshold values in the blocking records by distances of record points from a center point, as parameter estimation. 37 In this study, we used the dataset with size of 1000 to estimate parameters for each blocking methods. Table 1 – Table 4 represent the measurement results on each blocking method combined with EM algorithm with various parameters. Parameter Measurement Distance 1 0.9 0.8 0.7 0.6 0.5 Reduction Ratio 0.998288 0.998108 0.998084 0.997112 0.99006 0.978084 Pair Completeness 0.856 0.946 0.948 0.954 0.972 0.992 F score 0.921684795 0.971355674 0.9723975 0.97508 0.980947 0.984993 Table 1 Standard Blocking Method Measurement by Distance Table 1 represents the measurements of standard blocking method by varying parameter from 0.5 to 1. As shown in Table 1, distance 0.5 and 0.6 are better than other parameter values. We choose 0.6 for distance as a parameter, because 0.5 could have too large block size for bigger datasets than 0.6. Parameter Measurement Window Size 30 20 15 10 Reduction Ratio 0.126864 0.608284 0.779288 0.901492 Pair Completeness 0.996 0.994 0.988 0.986 F score 0.225061172 0.754715514 0.8713198 0.941854 Table 2 Sorted Neighborhood Blocking Method Measurement by Window Size Table 2 represents the measurement of sorted neighborhood blocking method by varying the window size of partitioning records as a parameter from 10 to 30 by scale 5. In this study, the window size is set by 15 as a parameter for Sorted Neighborhood Blocking Method, because even though 10 got the better F score than 15, 10 may have too many false unmatched pairs for bigger datasets, with missing matched pairs in the same block. 38 Table 3 Bigram Indexing Blocking Method Measurement by Threshold Values In Table 3, threshold value in creating bigram inverted indexes is varying from 0.4 to 0.8. We choose a threshold value 0.6 as parameter, because its F score is best. Table 4 Canopy Clustering Blocking Method Measurement by Threshold Values Table 4 represents the measurements results of implementing the canopy clustering blocking method by changing tight threshold value and loose threshold value. Tight threshold value runs between 0.6 and 0.9 and loose threshold value runs between 0.4 and 0.6 which is less than tight threshold value. In this study, we selected 0.4 for the loose threshold and 0.7 for the tight threshold. It has good result for evaluation methods and less likelihood for huge datasets have too big size of blocks than other values. Parameter Measurement Threshold 0.4 0.6 0.8 Reduction Ratio 0.9652 0.979124 0.990244 Pair Completeness 1 0.996 0.974 F score 0.982291879 0.987489903 0.9820548 Parameter Measure’t Threshold 0.6_0.4 0.6_0.5 0.7_0.4 0.7_0.6 0.8_0.4 0.8_0.6 0.9_0.4 0.9_0.6 Reduction Ratio 0.99768 0.998076 0.99768 0.998076 0.997672 0.998068 0.997672 0.998068 Pair Complete 0.97 0.958 0.97 0.958 0.97 0.958 0.97 0.958 F score 0.983645 0.977627 0.983645 0.977627 0.983641 0.977624 0.983641 0.977624 39 4.2 Simulations As we mentioned in the previous section, we will use three measurements, reduction ratio, pair completeness and F score to compare efficiency of four blocking methods combined with EM algorithm method. The four blocking methods to be compared are. First, standard blocking method which sorts records by a created candidate key and blocks the sorted records with a significant variable, in our study we used first name. Secondly, sorted neighborhood blocking method is blocking sorted records into window size of records. Fourth, bigram indexing blocking method is blocking records in the same column of inverted index table. Lastly, canopy clustering blocking method is blocking records by measuring distances from each other. Graphs in figure 5, 6 and 7 show the simulation result. In the graphs, horizontal axis represents the size of dataset used in the simulation, which are 1000, 2500, 5000 and 10000. Vertical axis represents value for each measure; reduction ratio, pair completeness, and F score. The values for all three measurements could be from zero to one and having a value closer to one is better. For the reduction ratio, any value closer to one means less comparison implemented over maximum total number of comparisons. For the pair completeness, any values closer to one means more true matched pairs are caught by blocking and EM algorithm methods. For the F score, values closer to one means harmonic relationship between reduction ratio and pair completeness is better. 40 Figure 6 Reduction Ratio Figure 7 Pair Completeness 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 1.02 1000 2500 5000 10000 Reduction Ratio Data Size Standard Neighborhood Bigram Indexing Canopy Clustering 0.88 0.9 0.92 0.94 0.96 0.98 1 1.02 1000 2500 5000 10000 Pair Completeness Data Size Standard Neighborhood Bigram Indexing Canopy Clustering 41 Figure 8 F score 4.3 Results In the reduction ratio graph in Figure 5, sorted neighborhood blocking method gets the lowest values, but the values are increased as the data size increases. Bigram indexing method has pretty good results which are most over 0.98 with having little bit of data size influence. For standard blocking method and canopy clustering method, we get the highest values regardless of data size. On the pair completeness graph in Figure 6, bigram indexing blocking method gets the best absolute result regardless of database size, which means it get the most matched pairs correctly predicted than other methods. Except the standard blocking method, most methods get the good results having values over 0.95. 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 1000 2500 5000 10000 F score Data Size Standard Neighborhood Bigram Indexing Canopy Clustering 42 In the F score graph in Figure 7, bigram indexing blocking method shows the absolute strongest result and canopy clustering method also gives strong result regardless of database size. The standard blocking method also gives pretty good outcome regardless of data size having values over 0.96. For the sorted neighborhood blocking method, as the size of dataset gets bigger, it gives better results. 43 CHAPTER V CONCLUSIONS AND FUTURE WORKS We have presented a matching entity technology Record Linkage with blocking methods plus EM algorithm to speed up matching process for a large number of records. For speedup purpose, records should be grouped into block, effectively and efficiently. The EM Algorithm is known to be more stable than any other methods to estimate parameters needed for probabilistic record linking process. In this study, three measurements, reduction ratio, pair completeness and F score, are used to compare the efficiency of four blocking methods combined with EM Algorithm such as 1) Standard Blocking Method, 2) Sorted Neighborhood Blocking Method, 3) Bigram Indexing Blocking Method, and 4) Canopy Clustering Blocking Method. Simulation results showed that the bigram indexing blocking method is the most reliable compared to other blocking methods regardless of database size. The canopy clustering blocking method is also a reliable blocking method. The effectiveness of sorted neighborhood blocking method depends on the size of data. But for the standard blocking method, after several modifications to improve it, it still gets worse results than other blocking methods. The reason could be that standard 44 blocking allows a record to be in only one block, not like other blocking methods allowing a record in multiple blocks. The key point of blocking method plus EM algorithm is putting a record in as many as blocks as possible which have high probability to have a matching record to reduce false nonmatches increasing pair completeness. The blocking methods have to be able to find only blocks which have high probability to reduce expensive comparisons among records to increase reduction ratio and have better parameter estimation in EM algorithm. We need to have enough number of records within each block to have better estimations of parameters in the EM algorithm to increase F score. In the future, we will propose a blocking method based on various numbers of variables involvement on blocking methods plus EM algorithm because EM algorithm involves parameter estimation with both pairwise and fieldwise. In addition, we can work on Record Linkage for multi matching records, not only one match for each entity. 45 REFERENCES [1] I. P. Fellegi and A. B. Sunter, “A Theory for Record Linkage”, Journal of the American Statistical Association, 1969, Vol. 64, pp. 11831210. [2] R. P. Kelley, “Blocking Considerations for Record Linkage Under Conditions of Uncertainty”, Proceedings of the social Statistics Section, American Statistical Association, 1984, pp. 602605. [3] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society. Series B (Methodological), 1977, Vol. 39, No. 1, pp.138. [4] M. A. Jaro, “Advances in RecordLinkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida”, Journal of the American Statistical Association, Jun. 1989, Vol. 84, No. 406, pp. 414420. [5] R. Baxter, P. Christen, and T. Churches. “A comparison of fast blocking methods for record linkage”, In ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington DC, 2003, pp. 25–27. [6] P. Christen. “Towards parameterfree blocking for scalable record linkage”, Technical Report TRCS0703, The Australian National University, Canberra, 2007. [7] M. A. Hernandez, S. J. Stolfo, “RealWorld Data is Dirty: Data Cleansing and the Merge/Purge Problem”, Data Mining and Knowledge Discovery, 1998, Vol. 2, No. 1, pp. 9 37. [8] P. Lehti and P. Fankhauser, “A Precise Blocking Method for Record Linkage”, In Proceedings of the 7th International Conference on Data Warehousing and Knowledge Discovery (DaWaK’05), Aug. 2005, pp. 210–220. 46 [9] A. McCallum, K. Nigam and L. H. Ungar, “Efficient clustering of highdimensional data sets with application to reference matching”, KDD, 2000, pp. 169178. [10] P & T Cristen: Febrl  Freely extensible biomedical record linkage (Manual, release 0.3) p.9 [11] H. L. Dunn. "Record Linkage" (PDF). American Journal of Public Health, 1946, Vol. 36, No. 12, pp. 1412–1416. [12] E. A. Sauleau, J.P. Paumier and A. Buemi. “Medical record linkage in health information systems by approximate string matching and clustering”. In BMC Medical Informatics and Decision Making, 2005, pp. 532. [13] W. E. Winkler. “The State of Record Linkage and Current Research Problems,” Statistica Society of Canada, Proceedings of the Section on Survey Methods, 1999, pp. 7379 [14] http://www.cs.utexas.edu/users/ml/riddle/data.html [15] http://aliasi.com/lingpipe/demos/tutorial/stringCompare/readme.html [16] http://sourceforge.net/projects/febrl/ VITA JIYOUNG SHIN Candidate for the Degree of Master of Science Thesis: COMPARATIVE STUDY ON BLOCKING METHODS IN RECORD LINKAGE Major Field: Computer Science Biographical: Personal Data:  Born in the city of Ulsan, Republic of Korea Education:  Completed the requirements for the Master of Science in computer science at Oklahoma State University, Stillwater, Oklahoma in December, 2009.  Completed the requirements for the Bachelor of Science in computer science at West Texas A&M University, Canyon, Texas in May, 2004. Experience:  Published Paper “Exploratory Data Analysis with Bivariate Dependence Functions” MSV 2006: 217223 Professional Memberships: ADVISER’S APPROVAL: Dr. K. M. George Name: JIYOUNG SHIN Date of Degree: December, 2009 Institution: Oklahoma State University Location: Stillwater, Oklahoma Title of Study: COMPARATIVE STUDY ON BLOCKING METHODS IN RECORD LINKAGE Pages in Study: 46 Candidate for the Degree of Master of Science Major Field: Computer Science Scope and Method of Study: Record linkage methods match related records from different datasets. For speedup purpose, records should be grouped into blocks efficiently by a blocking method whose parameter should be estimated. The EM Algorithm is more stable than any other methods needed for probabilistic record linking process. In this study, three measurements, reduction ratio, pair completeness and F score, are used to compare the efficiency of four blocking methods combined with EM Algorithm. The blocking methods are 1) Standard Blocking Method, 2) Sorted Neighborhood Blocking Method, 3) Bigram Indexing Blocking Method, and 4) Canopy Clustering Blocking Method. Measures are obtained via simulation. Findings and Conclusions: In the simulation, we find that the bigram indexing blocking method is the most reliable compared to other blocking methods regardless of database size. The canopy clustering blocking method is also a reliable blocking method. The effectiveness of sorted neighborhood blocking method depends on the size of data. In spite of several modifications to improve it, the standard blocking method still gets worse results than other blocking methods. 



A 

B 

C 

D 

E 

F 

I 

J 

K 

L 

O 

P 

R 

S 

T 

U 

V 

W 


