

small (250x250 max)
medium (500x500 max)
Large
Extra Large
large ( > 500x500)
Full Resolution


ADAPTING MASKING TECHNIQUES FOR ESTIMATION PROBLEMS INVOLVING NONMONOTONIC RELATIONSHIPS IN PRIVACYPRESERVING DATA MINING By MOHAMMAD SAAD ALAHMADI Bachelor of Science in Computer Science King Fahd University of Petroleum and Minerals Dhahran, Saudi Arabia 1996 Master of Science in Telecommunications Management Oklahoma State University Stillwater, OK 2001 Submitted to the Faculty of the Graduate College of the Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY July, 2006 ii ADAPTING MASKING TECHNIQUES FOR ESTIMATION PROBLEMS INVOLVING NONMONOTONIC RELATIONSHIPS IN PRIVACYPRESERVING DATA MINING Dissertation Approved: Dr. Rathindra Sarathy Dr. Ramesh Sharda Dr. Dursun Delen Dr. Goutam Chakraborty Dr. A. Gordon Emslie Dean of the Graduate College iii ACKNOWLEDGMENTS In the name of Allah, the Most Merciful, the Most Compassionate Thanks, glorification and praises to Almighty God (Allah) for His uncountable favors and bounties upon me and my family. Among them are His assistance in finishing my doctoral program successfully and the blessing of a good, supportive, motivational Ph.D. committee chaired by Dr. Rathindra Sarathy to whom I owe all respect and admiration. Dr. Sarathy was always there for me when I needed him the most. In our fruitful meetings and discussions, I learned from him what I would not have learned from any other source or setting. My high appreciation and thanks are extended to all other members on my Ph.D. committee: Drs. Ramesh Sharda, Goutam Chakraborty and Dursun Delen. The courses they taught and ideas they raised were the best toolkit for me to finish my Ph.D. journey in an enjoyable and successful manner. I would also like to thank and show the deepest appreciation for the dearest and closest souls to my soul  my parents, Saad and Hamida, who taught me that there is no limit for success, and that bigger ideas and goals work the best in this life. They never stopped supporting and encouraging me and always kept me in their prayers. If I live the rest of my life as a servant at their feet, then I repay them nothing for what they have done for me. My thanks are extended to my beloved family: Maha, Asim, Shymaa, Ammar, Yasser, and Saad. Without their unconditional and unlimited patience, love, understanding and support, none of what I have done would have been possible. This is iv not my work alone, but it is the fruit of our collaborative work as a family. Congratulations to all of us on our graduation. I am also grateful and thankful to my brothers, sisters, and all of my relatives and friends in the USA and in Saudi Arabia who supported me and did not forget me in their good prayers. Many of them shared the joy of graduation with me either facetoface, by phone or by email. Knowing how long the list would be prevents me from listing individual names. Additionally, I would like to thank Dr. Ali Amiri, my professor and friend, for his support and encouragement during my doctoral studies, as well as my research partners for the fruitful, cooperative effort that resulted in some publications: Drs. Ramesh Sharda, Rick Wilson, and Peter Rosen. In addition, I would like to thank all faculty, staff, my peers and colleagues, especially my friends Ashish Gupta and Han Li, for the professional academic environment they created in the Department of Management Science and Information Systems (MSIS) in the William S. Spears School of Business at Oklahoma State University. I would like also to acknowledge and thank Dr. Andrew Robinson for developing and providing the equivalence package for running (modelvalidation) equivalence tests. Last but not least, I would like to thank the lovely secretary of our department, Tane Kester, for the tremendous amount of time and effort she put in editing this manuscript. v TABLE OF CONTENTS Chapter Page Acknowledgments...................................................................................... iii Table of Contents........................................................................................ v List of Tables ............................................................................................. ix List of Figures..........................................................................................xiii Acronyms and Abbreviations .................................................................. xix Mathematical Symbols and Notations .................................................... xxii I. INTRODUCTION ......................................................................................................... 1 I.1. PPE Problem – Definition and Research Scope ................................ 5 I.2. PPE Problem – Importance and Requirements.................................. 7 I.3. Motivation Example ........................................................................ 11 I.4. Summary and Outline...................................................................... 14 II. LITERATURE REVIEW............................................................................................ 16 II.1. Related Work in PrivacyPreserving Data Mining (PPDM) ........... 16 II.1.1. Review of PrivacyPreserving Estimation (PPE) Literature 18 II.1.2. DataCentric Approach (DCA) for PrivacyPreserving Data Mining.................................................................................. 21 II.2. Related Work in Masking Methods................................................. 23 II.2.1. Advantages of Using Masking Methods for PPE............. 27 II.2.2. Optimal Masking Methods ............................................... 28 II.2.1.2 Conditional Independence Theory ........................... 28 II.2.2.2 Practical Limitation of the Optimal Procedure......... 31 II.2.3. Recent Masking Methods ................................................. 32 II.3. Impact of Current Masking Methods on NonMonotonic Relationships.................................................................................... 35 vi Chapter Page II.3.1. Challenges in Developing Practical PPE Masking Methods for NonMonotonic Relationships ....................................... 37 II.4. Research Questions.......................................................................... 38 III. RELATIONSHIPBASED MASKING: THEORETICAL BASIS........................... 40 III.1. Conditional Expectation: The Formal Definition of Relationships in PPE .................................................................................................. 41 III.2. Artificial Neural Networks (ANN) Approaches for Estimating Conditional Expectations................................................................. 44 III.3. The Roles of the Residuals (r) ......................................................... 47 III.3.1. Residuals Role in Defining Relationships among Attributes 47 III.3.2. Role of Residuals in Guiding Security Requirements ...... 54 III.3.3. Summary........................................................................... 56 IV. IMPLEMENTATION OF RELATIONSHIPBASED MASKING.......................... 57 IV.1. NMEGADP Perturbation and NMEGADP Shuffling Masking Methods ........................................................................................... 57 IV.2. Assumptions of the RelationshipBased Masking Methods............ 65 V. ASSESSMENT MEASURES FOR THE RBM APPROACH.................................... 68 V.1. Data Security Measures in PPE....................................................... 68 V.2. Data Utility Measures in PPE.......................................................... 69 V.3. Equivalence Tests for Validating Models as Data Utility Measures75 VI. ASSESSING THE RBM APPROACH WHEN THE RELATIONSHIPS AMONG CONFIDENTIAL ATTRIBUTES ARE LINEAR........................................................... 84 VI.1. Illustration of the Effectiveness of NMEGADP Approach using the Motivation Example ........................................................................ 84 VI.1.1. Data Utility........................................................................ 85 VI.1.2. Data Security..................................................................... 87 VI.2. How a Snooper May Compromise RBM Masked Data .................. 92 VI.3. How the Characteristics of Original Datasets Determine the Characteristics of Masked Attributes............................................... 99 vii Chapter Page VI.4. Assessing the Impact of the Characteristics of Original Datasets on the Effectiveness of the RBM Approach....................................... 102 VI.4.1. Original Datasets and their Characteristics..................... 102 VI.4.2. Data Utility...................................................................... 117 VI.4.3. Data Security................................................................... 123 VII. ASSESSING THE RBM DATA UTILITY WHEN THE RELATIONSHIPS AMONG CONFIDENTIAL ATTRIBUTES ARE NONLINEAR................................ 133 VII.1. Datasets....................................................................................... 135 VII.2. Data Utility.................................................................................. 138 VII.2.1. Relationship 1: E(X1S)s vs. E(Y1S)s........................ 140 VII.2.2. Relationship 2: E(X2S)s vs. E(Y2S)s ........................ 144 VII.2.3. Relationship 3: E(X1X2)x2 vs. E(Y1Y2)x2 .................. 147 VII.2.4. Relationship 4: E(X2X1)x1 vs. E(Y2Y1)x1 ................. 151 VII.2.5. Relationship 5: E(X1SX2)sx2 vs. E(Y1SY2)sx2 .......... 155 VII.2.6. Relationship 6: E(X2SX1)sx1 vs. E(Y2SY1)sx1 .......... 159 VII.3. Summary..................................................................................... 162 VIII. CONCLUSIONS.............................................................................................. 164 VIII.1. Main Findings and Conclusions.................................................. 164 VIII.1.1. Data Utility................................................................ 164 VIII.1.2. Data security ............................................................. 168 VIII.2. Possible Opportunities and Limitations...................................... 170 VIII.3. Contributions of this Study......................................................... 172 REFERENCES ............................................................................................................... 175 APPENDICES ................................................................................................................ 189 Appendix A – SDL/Relationship Match Framework ............................. 189 Appendix B – The Relationship between the Covariance Matrices of Confidential Attributes X, Conditional Expectations E(XS) and Residuals r...................................................................................... 192 Appendix C – RelationshipBased NMEGADP Masking Algorithms . 199 viii Chapter Page Appendix D– Extended Results Related to the Motivation Example..... 207 Appendix E – Graphical Pilot Study – Comparisons for PPE Masking Methods ......................................................................................... 221 Appendix F – Dataset: NM.01 (Check Mark ) ....................................... 226 Appendix G – Dataset: NM.02 ............................................................... 230 Appendix H – Dataset: NM.03 ............................................................... 234 Appendix I – Dataset: NM.04................................................................. 238 Appendix J – Dataset: NM.05................................................................. 242 Appendix K – Dataset: MNL.01............................................................. 246 Appendix L – Dataset: MNL.02 ............................................................. 250 Appendix M – Dataset: MNL.03 ............................................................ 254 Appendix N– Dataset: NM.01.S1 – Check Mark Dataset with One S with NonMonotonic Relationships among Residuals .......................... 258 Appendix O – Dataset: ME.L.S1 – Motivation Example with One S.... 262 Appendix P – Important Results Relating the Characteristics of Masked Attributes to the Characteristics of Original Data ......................... 266 ix LIST OF TABLES Table Page Table 1. First 5 and last 5 records of the store dataset (motivation example) .................. 12 Table 2. Possible different combinations of relationships to be maintained when a confidential attribute is a dependent variable in the store motivation example (2 X and 2 S). k is the model number............................................................................ 72 Table 3. Predictionbased (MSE) data utility and data security measures for the NMEGADP perturbed store dataset ............................................................................ 86 Table 4. First 5 and last 5 records of the twovariable dataset, to check the piecewise CC security of NMEGADP masking procedures ...................................................... 91 Table 5. Var(X1) = Var(u1) + Var(r1)............................................................................ 106 Table 6. Var(X2) = Var(u2) + Var(r2)............................................................................ 106 Table 7. Cov(X1,X2) = Cov(u1,u2) + Cov(r1,r2).............................................................. 107 Table 8. Covariance between X1 and Y1 and its relationship to the variance of u1 ......... 107 Table 9. Covariance between X2 and Y2 and its relationship to the variance of u2 ......... 107 Table 10. Calculating the regression coefficients of X1 = b0 + b1Y1 based on the characteristics of original data ............................................................................ 108 Table 11. Calculating the regression coefficients of X2 = b0 + b1Y2 based on the characteristics of original data ............................................................................ 108 Table 12. Equivalence Tests using Linear Regression to Calculate the Compared Fitted Values ................................................................................................................. 121 Table 13. Equivalence Tests using LSSVM to Calculate the Compared Fitted Values 122 Table 14. ParameterBased data utility measures for E(X1X2) vs. E(Y1Y2) ................... 123 Table 15. Possible snooper scenario to compromise X1 ................................................. 126 Table 16. Possible snooper scenario to compromise X2 ................................................. 127 x Table Page Table 17. Canonical correlation security measures (CC) using the best predictor E(XS) for the three MERelated datasets....................................................................... 128 Table 18. Corr(X1,Y1): its upper bound and its relationship to Corr(X1, E(X1S)) and the security index (SI)............................................................................................... 128 Table 19. Corr(X2,Y2): its upper bound and its relationship to Corr(X2, E(X2S)) and the security index (SI)............................................................................................... 128 Table 20. Datasets characteristics................................................................................... 137 Table 21. Percentage of minimum equivalence regions of significant equivalence tests for mean and slope of 1 for relationships (fitted values) E(X1S)s vs. E(Y1S)s...... 140 Table 22. Slope of linear regression of relationships (fitted values) E(X1S)s vs. E(Y1S)s ............................................................................................................. 143 Table 23. R2 of linear regression and correlation of relationships (fitted values) E(X1S)s vs. E(Y1S)s ....................................................................................... 143 Table 24. Percentage of minimum equivalence regions of significant equivalence tests for mean and slope of 1 for relationships (fitted values) E(X2S)s vs. E(Y2S)s .... 146 Table 25. Slope of linear regression of relationships (fitted values) E(X2S)s vs. E(Y2S)s ............................................................................................................. 147 Table 26. R2 of linear regression and correlation of relationships (fitted values) E(X2S)s vs. E(Y2S)s ....................................................................................... 147 Table 27. Percentage of minimum equivalence regions of significant equivalence tests for mean and slope of 1 for relationships (fitted values) E(X1X2)x2 vs. E(Y1Y2)x2 ............................................................................................................................. 148 Table 28. Slope of linear regression of relationships (fitted values) E(X1X2)x2 vs. E(Y1Y2)x2 ........................................................................................................ 150 Table 29. R2 of linear regression and correlation of relationships (fitted values) E(X1X2)x2 vs. E(Y1Y2)x2 .................................................................................... 151 Table 30. Percentage of minimum equivalence regions of significant equivalence tests for mean and slope of 1 for relationships (fitted values) E(X2X1)x1 vs. E(Y2Y1)x1 ............................................................................................................................. 152 Table 31. Slope of linear regression of relationships (fitted values) E(X2X1)x1 vs. E(Y2Y1)x1 ........................................................................................................ 154 xi Table Page Table 32. R2 of linear regression and correlation of relationships (fitted values) E(X2X1)x1 vs. E(Y2Y1)x1 ............................................................................. 155 Table 33. Percentage of minimum equivalence regions of significant equivalence tests for mean and slope of 1 for relationships (fitted values) E(X1SX2)sx2 vs. E(Y1SY2)sx2 .................................................................................................... 156 Table 34. Slope of linear regression of relationships (fitted values) E(X1SX2)sx2 vs. E(Y1SY2)sx2 .................................................................................................... 158 Table 35. R2 of linear regression and correlation of relationships (fitted values) E(X1SX2)sx2 vs. E(Y1SY2)sx2 ..................................................................... 158 Table 36. Percentage of minimum equivalence regions of significant equivalence tests for mean and slope of 1 for relationships (fitted values) E(X2SX1)sx1 vs. E(Y2SY1)sx1 .................................................................................................... 159 Table 37. Slope of linear regression of relationships (fitted values) E(X2SX1)sx1 vs. E(Y2SY1)sx1 .................................................................................................... 161 Table 38. R2 of linear regression and correlation of relationships (fitted values) E(X2SX1)sx1 vs. E(Y2SY1)sx1 ..................................................................... 162 Table 39. MDA classification example using the motivation example dataset and its masked copies ..................................................................................................... 172 Table 40. Hypothetical example demonstrating “Shuffle A by B” for generating a shuffled variable C............................................................................................................ 203 Table 41. A list of the four NMEGADP masking methods and their characteristics.... 204 Table 42. Sample of the Motivation Example Dataset ................................................... 207 Table 43. Motivation Example  Original dataset Pearson correlations ......................... 213 Table 44. Motivation Example  NMEGADP perturbed dataset Pearson correlations. 213 Table 45. Motivation Example  NMEGADP shuffled dataset Pearson correlations ... 213 Table 46. Motivation Example  Original dataset rankorder (Spearman) correlations . 215 Table 47. Motivation Example  NMEGADP perturbed dataset rankorder (Spearman) correlations.......................................................................................................... 215 Table 48. Motivation Example  NMEGADP shuffled dataset rankorder (Spearman) correlations.......................................................................................................... 215 xii Table Page Table 49. Original dataset  Fitting a wrong regression model (Linear Regression Case: X1S1) ................................................................................................................... 217 Table 50. NMEGADP perturbed dataset  Fitting a wrong regression model (Linear Regression Case: X1S1)....................................................................................... 217 Table 51. NMEGADP shuffled dataset  Fitting a wrong regression model (Linear Regression Case: X1S1)....................................................................................... 217 Table 52. Motivation Example  Data Utility assessment by fitting nonlinear Parametric Regression Models: (XS), (YS), and (YshfS).................................................... 218 Table 53. Motivation Example  Data Utility assessment by fitting nonlinear Parametric Regression Models: (X1X2), (Y1Y2), and (Y1_shfY2_shf) ................................... 219 Table 54. Motivation Example  Data Utility assessment by fitting nonlinear Parametric Regression Models: (X1SX2), (Y1SY2), and (Y1_shfSY2_shf)............................. 219 Table 55. Classification: Original Dataset: IV: S2, DV: S1 X1 X2 ................................... 220 Table 56. Classification: NMEGADP Perturbed Dataset: IV: S2, DV: S1 Y1 Y2............ 220 Table 57. Classification: NMEGADP Shuffled Dataset: IV: S2, DV: S1 Y1shf Y2shf ....... 220 xiii LIST OF FIGURES Figure Page Figure 1. Research scope of the study ................................................................................ 7 Figure 2. Motivation Example: Nonmonotonic relationships between Age (S1) and Expenditure$ (X1) ................................................................................................. 13 Figure 3. Privacy Preserving Data Mining PPDM Literature........................................... 17 Figure 4. Impact of current masking methods (EGADP and data shuffling) on nonmonotonic relationships........................................................................................ 36 Figure 5. Diagrams of the twovariable dataset (aimed to check the piecewise CC security of NMEGADP masking procedures) and its masked datasets. ........................... 92 Figure 6. Motivation Example (ME.L) Dataset .............................................................. 109 Figure 7. ME.L dataset: u1 vs. u2 and r1 vs. r2 scatter plots (LSSVM).......................... 109 Figure 8. ME.L dataset: u1 vs. u2 and r1 vs. r2 scatter plots (piecewise linear) .............. 110 Figure 9. ME.L.MV1 Dataset ......................................................................................... 110 Figure 10. ME.L.MV1 dataset: u1 vs. u2 and r1 vs. r2 scatter plots (LSSVM) .............. 111 Figure 11. ME.L.MV1 dataset: u1 vs. u2 and r1 vs. r2 scatter plots (piecewise linear)... 111 Figure 12. ME.L.MV2 Dataset ....................................................................................... 112 Figure 13. ME.L.MV2 dataset: u1 vs. u2 and r1 vs. r2 scatter plots (LSSVM) .............. 112 Figure 14. ME.L.MV2 dataset: u1 vs. u2 and r1 vs. r2 scatter plots (piecewise linear)... 113 Figure 15. ME.L: X1 & Y1 vs. S1 scatter plot and the linearity of relationships between X1 & Y1..................................................................................................................... 113 Figure 16. ME.L.MV1: X1 & Y1 vs. S1 scatter plot and the linearity of relationships between X1 & Y1................................................................................................. 114 Figure 17. ME.L.MV2: X1 & Y1 vs. S1 scatter plot and the linearity of relationships xiv between X1 & Y1................................................................................................. 114 Figure 18. ME.L: X2 & Y2 vs. S1 scatter plot and the linearity of relationships between X2 & Y2..................................................................................................................... 115 Figure 19. ME.LMV1: X2 & Y2 vs. S1 scatter plot and the linearity of relationships between X2 & Y2................................................................................................. 115 Figure 20. ME.LMV2: X2 & Y2 vs. S1 scatter plot and the linearity of relationships between X2 & Y2................................................................................................. 116 Figure 21. Comparing the linearity of relationships between X1 & Y1 for the three datasets on a unified scale ................................................................................................ 116 Figure 22 Comparing the linearity of relationships between X2 & Y2 for the three datasets on a unified scale ................................................................................................ 117 Figure 23. ME.L.MV1 – Masked using masking method 1 ........................................... 129 Figure 24. ME.L.MV1 – Masked using masking method 2 ........................................... 129 Figure 25. ME.L.MV1 – Masked using masking method 3 ........................................... 130 Figure 26. ME.L.MV1 – Masked using masking method 4 ........................................... 130 Figure 27. ME.L.MV2 – Masked using masking method 1 ........................................... 131 Figure 28. ME.L.MV2 – Masked using masking method 2 ........................................... 131 Figure 29. ME.L.MV2 – Masked using masking method 3 ........................................... 132 Figure 30. ME.L.MV2 – Masked using masking method 4 ........................................... 132 Figure 31. Motivation Example (ME.L) Dataset – E(X1S)s vs. E(Y1S)s ..................... 143 Figure 32. Motivation Example (ME.L) Dataset – E(X2S)s vs. E(Y2S)s ..................... 146 Figure 33. Motivation Example (ME.L) Dataset – E(X1X2)x2 vs. E(Y1Y2)x2............... 149 Figure 34. Motivation Example (ME.L) Dataset – E(X2X1)x1 vs. E(Y2Y1)x1............... 153 Figure 35. Motivation Example (ME.L) Dataset – E(X1SX2)sx2 vs. E(Y1SY2)sx2 ....... 157 Figure 36. Motivation Example (ME_L) Dataset – E(X2SX1)sx1 vs. E(Y2SY1)sx1 ...... 160 Figure 37. Impact of multivalued data on learning the conditional expectation........... 171 Figure 38. SDL/Relationship Match Framework............................................................ 191 xv Figure 39. General schema of how RelationshipBased Masking (RBM) approach works ............................................................................................................................. 200 Figure 40. Classification of the NMEGADP masking methods.................................... 201 Figure 41. Motivation Example (ME.L) – Original dataset............................................ 208 Figure 42. Motivation Example – Predicted values E(X1S)s vs. E(X2S)s.................... 209 Figure 43. Motivation Example – Residuals r1 vs. r2 ..................................................... 209 Figure 44. Motivation Example – The NMEGADP Perturbation (masking method 1) masked dataset .................................................................................................... 210 Figure 45. Motivation Example – The NMEGADP Shuffling (masking method 2) masked dataset .................................................................................................... 210 Figure 46. Motivation Example (ME.L) – Masked using masking method 3................ 211 Figure 47. Motivation Example (ME.L) – Masked using masking method 4................ 211 Figure 48. Graphical Pilot Study: Linear relationships (bivariate normal dataset) ........ 221 Figure 49. Graphical Pilot Study: Monotonic nonlinear relationships I......................... 222 Figure 50. Graphical Pilot Study: Monotonic nonlinear relationships II........................ 223 Figure 51. Graphical Pilot Study: NonMonotonic relationships (Ushape data) .......... 224 Figure 52. Graphical Pilot Study: NonMonotonic relationships (3cluster data).......... 225 Figure 53. NM.01 Dataset............................................................................................... 226 Figure 54. NM.01 Dataset – Predicted values E(X1S)s vs. E(X2S)s ............................ 227 Figure 55. NM.01 Dataset – Residuals r1 vs. r2.............................................................. 227 Figure 56. NM.01 Dataset – Masked using masking method 1...................................... 228 Figure 57. NM.01 Dataset – Masked using masking method 2...................................... 228 Figure 58. NM.01 Dataset – Masked using masking method 3...................................... 229 Figure 59. NM.01 Dataset – Masked using masking method 4...................................... 229 Figure 60. NM.02 Dataset............................................................................................... 230 Figure 61. NM.02 Dataset – Predicted values E(X1S)s vs. E(X2S)s ............................ 231 xvi Figure 62. NM.02 Dataset – Residuals r1 vs. r2.............................................................. 231 Figure 63. NM.02 Dataset – Masked using masking method 1...................................... 232 Figure 64. NM.02 Dataset – Masked using masking method 2...................................... 232 Figure 65. NM.02 Dataset – Masked using masking method 3...................................... 233 Figure 66. NM.02 Dataset – Masked using masking method 4...................................... 233 Figure 67. NM.03 Dataset............................................................................................... 234 Figure 68. NM.03 Dataset – Predicted values E(X1S)s vs. E(X2S)s ............................ 235 Figure 69. NM.03 Dataset – Residuals r1 vs. r2.............................................................. 235 Figure 70. NM.03 Dataset – Masked using masking method 1...................................... 236 Figure 71. NM.03 Dataset – Masked using masking method 2...................................... 236 Figure 72. NM.03 Dataset – Masked using masking method 3...................................... 237 Figure 73. NM.03 Dataset – Masked using masking method 4...................................... 237 Figure 74. NM.04 Dataset............................................................................................... 238 Figure 75. NM.04 Dataset – Predicted values E(X1S)s vs. E(X2S)s ............................ 239 Figure 76. NM.04 Dataset – Residuals r1 vs. r2.............................................................. 239 Figure 77. NM.04 Dataset – Masked using masking method 1...................................... 240 Figure 78. NM.04 Dataset – Masked using masking method 2...................................... 240 Figure 79. NM.04 Dataset – Masked using masking method 3...................................... 241 Figure 80. NM.04 Dataset – Masked using masking method 4...................................... 241 Figure 81. NM.05 Dataset............................................................................................... 242 Figure 82. NM.05 Dataset – Predicted values E(X1S)s vs. E(X2S)s ............................ 243 Figure 83. NM.05 Dataset – Residuals r1 vs. r2.............................................................. 243 Figure 84. NM.05 Dataset – Masked using masking method 1...................................... 244 Figure 85. NM.05 Dataset – Masked using masking method 2...................................... 244 Figure 86. NM.05 Dataset – Masked using masking method 3...................................... 245 xvii Figure 87. NM.05 Dataset – Masked using masking method 4...................................... 245 Figure 88. MNL.01 Dataset ............................................................................................ 246 Figure 89. MNL.01 Dataset – Predicted values E(X1S)s vs. E(X2S)s.......................... 247 Figure 90. MNL.01 Dataset – Residuals r1 vs. r2 ........................................................... 247 Figure 91. MNL.01 Dataset – Masked using masking method 1 ................................... 248 Figure 92. MNL.01 Dataset – Masked using masking method 2 ................................... 248 Figure 93. MNL.01 Dataset – Masked using masking method 3 ................................... 249 Figure 94. MNL.01 Dataset – Masked using masking method 4 ................................... 249 Figure 95. MNL.02 Dataset ............................................................................................ 250 Figure 96. MNL.02 Dataset – Predicted values E(X1S)s vs. E(X2S)s.......................... 251 Figure 97. MNL.02 Dataset – Residuals r1 vs. r2 ........................................................... 251 Figure 98. MNL.02 Dataset – Masked using masking method 1 ................................... 252 Figure 99. MNL.02 Dataset – Masked using masking method 2 ................................... 252 Figure 100. MNL.02 Dataset – Masked using masking method 3 ................................. 253 Figure 101. MNL.02 Dataset – Masked using masking method 4 ................................. 253 Figure 102. MNL.03 Dataset .......................................................................................... 254 Figure 103. MNL.03 Dataset – Predicted values E(X1S)s vs. E(X2S)s........................ 255 Figure 104. MNL.03 Dataset – Residuals r1 vs. r2 ......................................................... 255 Figure 105. MNL.03 Dataset – Masked using masking method 1 ................................. 256 Figure 106. MNL.03 Dataset – Masked using masking method 2 ................................. 256 Figure 107. MNL.03 Dataset – Masked using masking method 3 ................................. 257 Figure 108. MNL.03 Dataset – Masked using masking method 4 ................................. 257 Figure 109. NM.01.S1 Dataset ....................................................................................... 258 Figure 110. NM.01.S1 Dataset – Predicted values E(X1S)s vs. E(X2S)s..................... 259 Figure 111. NM.01.S1 Dataset – Residuals r1 vs. r2....................................................... 259 xviii Figure 112. NM.01.S1 Dataset – Masked using masking method 1............................... 260 Figure 113. NM.01.S1 Dataset – Masked using masking method 2............................... 260 Figure 114. NM.01.S1 Dataset – Masked using masking method 3............................... 261 Figure 115. NM.01.S1 Dataset – Masked using masking method 4............................... 261 Figure 116. ME.L.S1 Dataset ......................................................................................... 262 Figure 117. ME.L.S1 Dataset – Predicted values E(X1S)s vs. E(X2S)s....................... 263 Figure 118. ME.L.S1 Dataset – Residuals r1 vs. r2......................................................... 263 Figure 119. ME.L.S1 Dataset – Masked using masking method 1................................. 264 Figure 120. ME.L.S1 Dataset – Masked using masking method 2................................. 264 Figure 121. ME.L.S1 Dataset – Masked using masking method 3................................. 265 Figure 122. ME.L.S1 Dataset – Masked using masking method 4................................. 265 xix ACRONYMS AND ABBREVIATIONS ANN Artificial Neural Networks CADP CorrelatedNoise Additive Data Perturbation (known also as Kim’s method) CC Canonical Correlation CDF Cumulative Distribution Function CGADP General Additive Data Perturbation: the Copula Approach CI Confidence Interval CNMEGADP NonMonotonic Exact General Additive Perturbation: the Copula Approach DCA DataCentric Approach DM Data Mining DV Dependent variable EC European Community EGADP Enhanced (or Exact) General Additive Data Perturbation ET Equivalence Tests xx FDA Food and Drug Administration GADP General Additive Data Perturbation IPSO Information Preserving Statistical Obfuscation IV Independent variable LSSVM Least Square Support Vector Machines MDA Multiple Discriminant Analysis MLP Multilayer Perceptron Neural Networks MR Multiple Regression MSE Mean Square(d) Error or (Mean) Sum of Square(d) Errors NMEGADP NonMonotonic EGADP PDF Probability Distribution Function PPDM PrivacyPreserving Data Mining PPE PrivacyPreserving Estimation (Regression) PTTE The Paired tTest for Equivalence RBM RelationshipBased Masking RBT RotationBased Transformation xxi S2MLP Secure 2party Multivariate Linear Regression S2MSA Secure 2party Multivariate Statistical Analysis SADP Simple Additive Data Perturbation SDC Statistical Disclosure Control SDL Statistical Disclosure Limitation SVM Support Vector Machines TOST The Two OneSided tTest xxii MATHEMATICAL SYMBOLS AND NOTATIONS S The set of nonconfidential numeric and categorical attributes X The set of confidential numeric attributes Y The set of masked numeric attributes V The set of random variate, which is used to generate independent unscaled noise set b for masking confidential attributes X (after scaling: e) and generating Y Si An nonconfidential numeric and categorical attribute i Xi A confidential numeric or categorical attribute i Yi A masked numeric and categorical attribute i Vi A columnvector random variate, which is used to generate independent unscaled noise bi for masking attribute Xi (after scaling: ei) and generating Yi p The number of nonconfidential attributes S q The number of confidential attributes X (or masked attributes Y) d The total number of attributes in a dataset (equals to p + q) e The set of columnvector independent noise resulted from scaling b to have the same covariance matrix of r (i.e. åe =år ) xxiii ei A columnvector independent noise corresponding to ri resulted from scaling b to have the same covariance matrix of r (i.e. åe =år ) b The set of columnvector residuals resulted from regressing random variate V on both nonconfidential and confidential attributes (S and X). bi A columnvector residuals resulted from regressing a random variate Vi on both nonconfidential and confidential attributes (S and X). r The set of columnvector residuals resulted from regressing confidential attributes X on nonconfidential attributes S. ri A columnvector residuals resulted from regressing one confidential attribute Xi on nonconfidential attribute S. E(.) Expected value E(..) Conditional expectation e Independent noise or residuals resulted usually from a regression of a random variable on a set of independent variables eint The initial prediction error a snooper gets by learning the conditional expectations E(XS) enew The new prediction error a snooper gets by trying to improve his prediction and condition on both E(XS) and Y år The covariance matrix of the residuals resulted from regressing X on S åb The covariance matrix of the unscaled independent noise terms resulted from regressing V on S and X åe The covariance matrix of the scaled independent noise terms and specified to be equal to år xxiv f(x) Strictly a real (deterministic) mathematical function that represents the exact component of the relationship between an independent variable x a dependent variable y and with no random element involved x An independent random variable (IV) y A dependent random variable (DV) Corr(.,.) The Pearson correlation measures between two random variables. It is also used as a possible datautility measure for linear relationships. I The input to Artificial Neural Networks ANN O The output to Artificial Neural Networks ANN A a dependent random variable ai The actual values of a dependent random variable A aˆi The corresponding predicted (or fitted) values using a specific nonlinear regression model a The mean of the dependent variable A R2 The coefficient of determination 0 H The null hypothesis a H The alternative hypothesis o m The population mean of observations p m The population mean of models predications xxv d The equivalence margin of significance tests 0 d The equivalence margin of the regression intercept using the regressionbased validation test for model validation 1 d The equivalence margin the regression slope using the regressionbased validation test for model validation a The test size (alphalevel) for significance tests or equivalence tests 1 CHAPTER I. INTRODUCTION The maturity of current information technology, especially telecommunications, storage and database technology, facilitates the collection, transmission and storage of huge amounts of raw data, unimagined until a few years ago. For the raw data to be utilized, they must be processed and transformed into information and knowledge that have added value, such as helping to accomplish the task at hand more effectively and efficiently. Data mining techniques and algorithms attempt to aid decision making by analyzing stored data to find useful patterns and to build decisionsupport models. These extracted patterns and models help to reduce the uncertainty in decisionmaking environments. Statisticians and researchers conduct surveys and collect datasets that usually contain tens of records. These datasets are considered to be very large when they contain a few hundred records (Hand, 1998). Traditional statistical techniques are the main (and the most suitable) tools for analyzing these datasets. The main objective of the analysis is to make inference and estimate population parameters from collected samples. Frequently, statistical agencies must release samples of datasets to external researchers. However, these datasets may have sensitive information about previously surveyed human subjects. This raises many questions about the privacy and confidentiality of individuals in released datasets. Privacy and confidentiality have recently become critical issues and a central concern for many people (Grupe et al., 2 2002). Sometimes these concerns result in people refusing to respond and share personal information, or worse, providing wrong responses. Many laws emphasize the importance of privacy and define the limits of legal uses of collected data. In the healthcare domain, for example, the U.S. Department of Health and Human Services (DHHS) added new standards and regulations to the Health Insurance Portability and Accountability Act of 1996 (HIPAA). The new standards aim to protect “the privacy of certain individually identifiable health data” (CDC, 2003). Grupe et al. (2002, EXHIBIT 1 pp. 65) listed a dozen privacyrelated acts and legislations issued between 1970 and 2000 in the United States. On the other hand, these acts and concerns limit, either legally and/or ethically, the releasing of datasets for reasons (sometimes legitimate) such as conducting research in the academic domain or obtaining competitive advantage in the business domain. In some cases, statistical offices face a dilemma of what can be called “war of acts.” While they must protect the privacy of individuals in their datasets, they are also legally required to disseminate these datasets. The conflicting objectives of the Privacy Act of 1974 and the Freedom of Information Act is just one example of this dilemma (Fienberg, 1994). This has led to an evolution in the field of statistical disclosure limitation (SDL). SDL methods attempt to find a balance between data utility (valid analytical results) and data security (privacy and confidentiality of individuals). In general, these methods try to either (a) limit the access to the values of sensitive attributes (mainly at the individual level), or (b) mask the values of confidential attributes in datasets while maintaining the general statistical characteristics of the datasets (such as mean, standard deviation, and covariance matrix). Data perturbation methods for microdata are one class 3 of masking methods. Since most statistical analysis methods are based on linear models, data perturbation methods generally aim to maintain linear relationships (Muralidhar et al., 1999). When the size of datasets are large, traditional statistical analysis techniques may not be the appropriate tools to use for two reasons (Hand, 1998; 2000; Hand et al., 2000). First, traditional statistical tools become unsuitable for making sense of the data and for making inferences about the population for large datasets; for instance, almost any small difference in a large dataset becomes significant. Second, large datasets may suggest that data was not collected for inference (parameter estimation) about the population, and that another type of analysis might be more appropriate. In most cases, a significant amount of collected data is generated as a consequence of some unplanned activities (e.g., transactional databases) vs. planned activities (e.g., experiment or survey designs). Therefore, as the size of datasets grows exponentially, the use of other (sizematched) analytical tools such as data mining becomes more appropriate. Examples of large datasets are abundant. MarketTouch, a company located in Georgia, supports direct marketers with data and analytical tools (DMReview.com, 2004). It has a sixterabyte database called Real America Database (RADBÒ), which provides information about more than 93 million households and 200 million individuals. It is updated monthly with more than 20 million records. Statistical agencies also experience this phenomenon of rapidly growing datasets. The Census Bureau (2001) reported that the collected Census 2000 data consist of “information about the 115.9 million housing units and 281.4 million people across the United States.” These large sizes suggest the need for analytical tools that are suitable for 4 large datasets, and again, data mining tools naturally come into play. Actually, the Census Bureau (accessed 2004) has started providing programs that have data mining capabilities such as DataFerrett (Federated Electronic Research, Review, Extraction and Tabulation Tool), which can be used to analyze and extract data from TheDataWeb  a repository of datasets that can be accessed freely online or bought offline. These datasets cover more than 95 subjects. Data mining techniques may lead to more significant threats to privacy and confidentiality compared to traditional statistical analytical techniques. DomingoFerrer and Torra (2003) made a connection between SDL methods and some AI (artificial intelligent) tools (note that statistics and AI fields are among the main fields contributing to the data mining field). They suggested there are two possible risks of AI (or equivalently DM) tools regarding the privacy and confidentiality of released masked datasets: disclosure and reidentification threats. From a disclosure threat perspective, DM tools can be used to aggregate or combine masked copies of a specific original dataset (DomingoFerrer and Torra, 2003). The goal is to reverse the masking effect and build the original dataset, which raises a confidentiality issue. This may pose a great threat when simple unsophisticated SDL techniques, such as simple additive data perturbation (SADP) method (Traub et al., 1984), are used and many masked copies are released. DM tools are also used to enforce data integrity and consistency in distributed databases by reidentifying different records belonging to the same individual. Contrary to DM, SDL methods aim to avoid identity disclosure. Thus, from a reidentification threat perspective, DM tools can be used to reidentify individuals in masked datasets (raising a privacy issue). DomingoFerrer and 5 Torra (2003) suggested that SDL methods should consider the existence of DM tools to assess their impact on disclosure (value disclosure) and reidentification (identity disclosure) threats. These concerns about privacy and confidentiality when DM tools are used have led to the birth of privacypreserving data mining (PPDM). The main goal of PPDM is to find useful patterns and build accurate models from datasets without accessing the individuals’ precise original values in records of datasets (Agrawal and Srikant, 2000). In many cases, PPDM algorithms employ one or more methods to protect the data. One protection method used is Simple Additive Data Perturbation Method (SADP) (Traub et al., 1984), which has undesirable characteristics in terms of data utility and data security (Muralidhar et al., 1999). Most of the newer and more sophisticated data perturbation and masking methods, such as CGADP (Sarathy et al., 2002), IPSO (Burridge, 2003), EGADP (Muralidhar and Sarathy, 2005b) and data shuffling (Muralidhar and Sarathy, 2003a; 2005a), have not been investigated in the PPDM domain. The only exception is the GADP method (Muralidhar et al., 1999), which appears in a few privacypreserving classification studies (Islam and Brankovic, 2004; Wilson and Rosen, 2002; 2003) . This study will discuss the possibilities of using some of these masking methods for monotonic relationships in privacypreserving estimation (PPE), as well as how these methods can be adapted and extended for the more general case involving nonmonotonic relationships. I.1. PPE Problem – Definition and Research Scope The goal of this study is to investigate the possibility of using and adapting some data masking (mainly data perturbation and data shuffling) methods to develop new 6 privacypreserving estimation PPE algorithms for numeric variables involving nonmonotonic relationships. Estimation (or regression) tries to estimate and quantify the relationships among variables in datasets. The main goal is to investigate the impact of newer data perturbation methods on different types of relationships in datasets. The required criterion is simple yet powerful: any used perturbation (or masking) method should not hurt or alter (to the extent possible) the structure and type of relationships among variables existing in original datasets. The result of testing such impact could be either preservation or destruction of the original relationships in masked datasets. Unfortunately, current data perturbation methods do not preserve all possible relationships, as will be seen later. They preserve monotonic nonlinear relationships, at best, and simpler relationships (linear ones). This study proposes four new PPE masking algorithms, based on the concepts of data perturbation and data shuffling, to preserve the more difficult nonmonotonic relationships. The focus and scope of this study in the space of the privacypreserving data mining PPDM techniques is represented in Figure 1. 7 Linear Nonlinear Monotonic NonMonotonic Association Rules Classification Clustering Estimation Privacy Preserving Data Mining PPDM Directed Data Mining (Prediction) Undirected Data Mining Data Mining DM Relationships Focus of SDL/Relationship Match Framework Focus of Adapted Masking Techniques Diagram Keys Figure 1. Research scope of the study I.2. PPE Problem – Importance and Requirements There are many important reallife problems that require building estimation models for continuous variables. Many of these models are related to human subjects and involve confidential attributes. In the following paragraphs, we shed some light on some aspects illustrating the importance and the need to protect the privacy and confidentiality of humanrelated data used in regression models. Before we start, we want to resolve some terminology issues. While the term “estimation” appears more frequently in engineering literature, other fields use the term “regression” (Ridgeway, 2003). Therefore, we may use the terms estimation and regression (technique(s)/ model(s)/ problem(s)) interchangeably. The need to build estimation models (prediction models for continuous variables) occurs frequently in different domains. Berry and Linoff (2000; 2004) have mentioned 8 many typical examples including estimating the number of children in a family, the total household income of a family, real estate value, and a customer’s lifetime value. Estimation is also indirectly related to classification. Instead of classifying customers as “respondents” or “nonrespondents,” for example, estimation techniques can be used to assign a probability of expected level of responsiveness (Berry and Linoff, 2004). This is very useful in marketing campaigns when budgets are insufficient to target all possible respondents. Then, expected respondents with the highest probability of responsiveness can be targeted first. In large datasets, which typify DM datasets, the frequency of each class in categorical and binary variables is usually high. On the other hand, a realnumber salary figure, for example, in a released dataset can be unique to the degree that it can single out the real identity of a deidentified whole record. Categorical and binary attributes automatically have more security against disclosure risks than continuous variables. Because of this, DomingoFerrer et al. (2001) indicated that while nonperturbative and masking sampling methods (Skinner et al., 1994), which are based on the concept of sample uniqueness and population uniqueness, may suit categorical microdata, they could not be used for numeric microdata. Hence, continuous attributes should attract more attention for protection against disclosure risks than categorical attributes. Regression models that involve allocating huge amounts of money based on information about human subjects are sometimes required. Ridgeway (2003) briefly mentioned a realworld case, which exemplifies the importance of such estimation models. Medicare (2004) is a federal health insurance program that covers elderly people (65 years or older), younger people with disabilities, or those suffering from EndStage 9 Renal Disease (ESRD). There are many Centers for Medicare and Medicare Services (CMS) nationwide. CMS is required by the 1997 Balanced Budget Act to develop a “prospective payment system” to allocate a $4.3 billion budget to healthcare facilities that help Medicareinsured inpatients to rehabilitate. Part of the system is a (regression) “cost model” built using different features (attributes) of patients to predict the cost of rehabilitation. These attributes come from two main sources. CMS provides attributes such as age, reason for hospitalization, and cost of care. Secondary sources provide functional ability scores measuring motor and cognitive abilities of patients. The cost model is used to predict the cost of rehabilitation per patient. Accordingly, healthcare facilities are reimbursed. Regression models are frequently used in business problems. Linear regression is the best tool when all relationships among variables in a dataset are linear. This is guaranteed when the dataset is normally distributed. However, most business datasets are nonnormal (Zhang, 2004), which opens the door to all possible forms of relationships including nonlinear (monotonic or nonmonotonic) relationships. Clearly, estimation and regression models for both linear and nonlinear relationships are important for many applications. In many cases, accessing sensitive data about human subjects might be necessary to build such models. Consequently, concerns about privacy and confidentiality automatically arise. However, there is a definite shortage in the amount of research related to privacypreserving estimation technique (PPE), especially for nonlinear and nonmonotonic relationships, as we will see in the PPErelated literature (Subection II.1.1) in the next chapter. 10 In summary, estimation (regression) is one pillar of the four main pillars of data mining (DM) techniques (see Figure 1 above) and is frequently used for different applications. In many cases, individuals’ privacy and confidentiality become a greater issue for numeric values than categorical values (DomingoFerrer et al., 2001; 2003). Actually, converting numeric values into categorical ones is one way to protect privacy and confidentiality. Alternatively, masking methods such as data perturbation can be used. Accordingly, the first of three main requirements that masking methods need to satisfy in the context of PPE is: · Requirement I: Masked datasets must allow accurate estimation models to be built, while preserving individuals’ privacy and confidentiality. Different types of relationships can exist in a dataset. For instance, multivariate normal datasets guarantee that all existing relationships among variables are linear. For this special case, some existing masking methods are readily available and can perfectly preserve linear relationships. However, most (business) datasets contain nonlinear relationships (Zhang, 2004), which can be monotonic or nonmonotonic (Fisher, 1970). “A truth about data mining not widely discussed is that the relationships in data the miner seeks are either very easy to characterize, or very, very hard,” (Pyle, 2003). This leads us to the second requirement: · Requirement II: Masking methods must preserve all possible types of relationships among variables (nonmonotonic or monotonic; linear or nonlinear). AlAhmadi et al. (2004) suggested an approach, called the “DataCentric” Approach in PPDM, for developing new privacypreserving data mining PPDM algorithms. This approach suggests, unlike many current PPDM algorithms (Agrawal and Srikant, 2000; 11 Thuraisingham, 2005), that any new PPDM algorithm should focus only on altering and changing original datasets without changing standard data mining algorithms. This should be done in a way that does not hurt the validity of required analyses (data utility) while maximizing data security. This approach is important for reasons that will be apparent in the Subsection II.1.2 titled “DataCentric Approach (DCA) for Privacy Preserving Data Mining.” The third requirement is: · Requirement III: PPE methods must be DataCentric. I.3. Motivation Example Boyens et al. (2002) have indicated that more businesses have started to outsource data mining tasks to specialized data mining and knowledge discovery consultant companies. In other cases, some organizations such as statistical offices are required by law to release datasets to outsiders such as researchers and data miners. In addition, some businesses need to release datasets to specific parties because of an alliance, although they (i.e. the businesses) may have the needed expertise to run and build different data mining models by themselves. In all of these cases, the data utility and data security concerns are important. 12 To motivate our discussion, we provide a hypothetical example. A store wants to release a 1,000record dataset to an allied market analysis firm while protecting the privacy and confidentiality of its customers. The dataset consists of four main variables (along with other identifier fields such as Name and Address). Two of these variables are nonconfidential and two are confidential. The two nonconfidential attributes are the age of customers (numeric between 2060 years) and the gender (binary – 0: male or 1: female), and they are denoted by S1 and S2, respectively. The two confidential attributes are the annual expenditure in $ (numeric) and the debt in $ (numeric), and they are denoted by X1 and X2, respectively. The first five and the last five records of the dataset are shown in Table 1. The main analysis required by the market analysis firm is regression and estimation modeling. Table 1. First 5 and last 5 records of the store dataset (motivation example) NonConfidential Attributes Confidential NO Attributes Age (S1) Gender (S2) Expenditure $ (X1) Debt $ (X2) 1 53.11 0 258.57 514.74 2 57.37 1 211.35 569.33 3 22.06 0 224.45 609.51 4 25.46 1 272.98 547.63 5 51.25 1 292.63 516.33 : : : : : 996 37.16 0 391.54 413.41 997 50.72 1 301.59 500.72 998 28.71 1 296.49 531.81 999 27.71 0 308.91 537 1000 29.62 1 298.04 469.97 Min 20.009 0 172.610 383.490 Max 59.969 1 444.190 632.800 Mean 40.264 0.468 296.775 504.692 STD 11.884 0.499 59.299 59.054 13 The store is required ethically and legally to protect the privacy and confidentiality of its customers (data security). This is also very important for maintaining customer trust and loyalty, and retaining profitable customers. At the same time and from a different perspective, the store is required to enable the allied market analyst firm to obtain accurate regression models from the released masked dataset (data utility). As an initial precaution, the store deidentifies the dataset by removing identifier fields such as Name and Address from the dataset. Figure 41 (Appendix D) shows the relationships among these four variables. The figure clearly indicates the existence of nonlinear nonmonotonic relationships (e.g., the relationship between S1: age and X1: expenditure; see Figure 2). In Section II.3 in the next chapter and Appendix E, we show that current masking techniques are unable to maintain relationships of the type shown in Figure 2 (i.e. nonmonotonic relationships). 16 28 40 52 64 Age 100 200 300 400 500 Expenditure $ Figure 2. Motivation Example: Nonmonotonic relationships between Age (S1) and Expenditure$ (X1) 14 I.4. Summary and Outline In this chapter, we introduced the privacypreserving estimation PPE involving nonmonotonic relationships and we specified its scope. We also talked about its importance from a practical point of view. In addition, we outlined the general requirements of the privacypreserving data mining PPDM and the specific requirements of the privacy preserving estimation PPE. In the next chapter (Chapter II), the relevant literature related to PPDM is briefly reviewed. The focus is on the limited literature of PPE. Then, the related work and concepts in masking methods are presented. This includes the advantage of using masking methods in PPDM. Optimal masking methods are defined, and the difficulty of their practical implementation is explained. Some recent masking methods are discussed, and their limitation in dealing with nonmonotonic relationships is demonstrated. This chapter concludes by listing possible research questions in PPE. Chapter III lays the theoretical basis for our proposed masking methods and their needed tools. The chapter starts by defining a “relationship” in the context of estimation and regression problems. Then, the theoretical role of Artificial Neural Networks (ANN) in learning such relationships is presented. Finally, the roles of residuals left after removing conditional expectations E(XS) in defining and guiding data utility and data security requirements are discussed. Chapter IV proposes four new masking methods for preserving nonmonotonic relationships by adapting and extending some existing masking methods. The chapter talks about their implementation and their assumptions. Chapter V treats the subject of assessing the success of the proposed masking methods in terms of data utility and data security. Some possible new and existing data utility and 15 data security measures for measuring the effectiveness of PPE masking methods are listed. Chapter VI assesses the effectiveness of RelationshipBased Masking (RBM) when the relationships among confidential attributes are linear. It starts by demonstrating the use of some of the RBM data utility and data security measures on the motivation example. Additionally, it briefly examines how a snooper may try to compromise a masked dataset involving nonmonotonic relationships. Then it explains how the characteristics of original datasets determine the characteristics of masked attributes. Finally, the determination of characteristics is empirically demonstrated. Chapter VII discusses the effectiveness of the RBM approach in terms of data utility when the relationships among confidential attributes are nonlinear. Eleven simulated datasets are used. The data utility assessment is done using the measures proposed in Section V.3. Although the focus is on data utility for reasons explained there, the chapter briefly discusses the subject of data security. Chapter VIII concludes this work by summarizing the main findings and results of this research. Additionally, the limitations of proposed methods are discussed. Further, some possible future research trends in PPE are given. In addition, the possibility of using the proposed masking methods for other PPDM techniques such as privacypreserving classification is suggested. 16 CHAPTER II. LITERATURE REVIEW In this chapter, we start by reviewing the literature related to PrivacyPreserving Data Mining (PPDM). The focus is mainly on PrivacyPreserving Estimation (PPE). Then, we review the main related concepts in statistical disclosure control (SDC). We also review some data perturbation and data shuffling masking methods. Next, we investigate the possibility of using these masking methods for PPE and assess their impact on nonmonotonic relationships. We conclude this chapter by articulating some of the possible research questions in PPE. II.1. Related Work in PrivacyPreserving Data Mining (PPDM) The data mining “algorithm” (or “technique” in the terminology from Berry and Linoff (2000; 2004)) is one of five different dimensions that can be used to classify privacypreserving data mining methods (Verykios et al., 2004b). Similar to the classification of data mining (DM) techniques proposed by Berry and Linoff (2000; 2004), privacy preserving data mining (PPDM) techniques can be classified as: (a) directed PPDM techniques: privacy preserving estimation and privacy preserving classification (both can be called predication techniques), and (b) undirected PPDM techniques: privacy preserving association rules and privacy preserving clustering. While the field of privacypreserving data mining is new, relatively good progress has been made on privacypreserving classification and association rules. Examples of 17 privacypreserving classification are Agrawal and Srikant (2000), Du and Zhan (2002; 2003), Du et al. (2004), Islam and Brankovic (2004), Johnsten and Raghavan (2000; 2001), Kantarcioglu and Clifton (2004b), Kantarcioglu and Vaidya (2003), Lindell and Pinkas (2002), Vaidya and Clifton (2004), Vaidya et al.(2004), and Yang et al. (2005). Examples of privacypreserving association rules are Ashrafi et al. (2003; 2004), Evfimievski et al. (2002), Evfimievski et al. (2004), Kantarcioglu and Clifton (2004a), Oliveira and Zaïane (2003a), Oliveira et al. (2004), Rizvi and Haritsa (2002), Saygin et al.(2002), Vaidya and Clifton (2002), Verykios et al. (2004a), and Zhang et al. (2004). Some progress has been made in privacypreserving clustering. Examples include Klusch et al. (2003), Lin et al. (2004), Merugu and Ghosh (2003a; 2003b) Oliveira and Zaïane (2003b; 2004a; 2004b), and Vaidya and Clifton (2003). However, there has been little research in privacypreserving estimation (regression). Figure 3 shows an abstract view of privacyprivacy data mining (PPDM) related literature broken down by PPDM Association Rules (Ashrafi, et al., 2003) (Ashrafi, et al., 2004) (Evfimievski, et al., 2002) (Evfimievski, et al., 2004) (Kantarcioglu and Clifton, 2004a) (Oliveira and Zaïane, 2003a) (Oliveira, et al., 2004) (Rizvi and Haritsa, 2002) (Saygin, et al., 2002) (Vaidya and Clifton, 2002) (Verykios, et al., 2004a) (Zhang, et al., 2004) Classification (Agrawal and Srikant, 2000) (Du and Zhan, 2002), (Du and Zhan, 2003), (Du, et al., 2004) (Islam and Brankovic, 2004) (Johnsten and Raghavan, 2001) (Johnsten and V.Raghavan, 2000) (Kantarcioglu and Clifton, 2004b) (Kantarcioglu and Vaidya, 2003) (Lindell and Pinkas, 2002) (Vaidya and Clifton, 2004) (Vaidya, et al., 2004) (Yang, et al., 2005) Clustering (Klusch, et al., 2003) (Lin, et al., 2004) (Merugu and Ghosh, 2003a) (Merugu and Ghosh, 2003b) (Oliveira and Zaïane, 2003b) (Oliveira and Zaïane, 2004a) (Oliveira and Zaïane, 2004b) (Vaidya and Clifton, 2003) Estimation (Du, et al., 2004) (Karr, et al., 2004) (Reiter, 2003) (Sanil, et al., 2004) Figure 3. Privacy Preserving Data Mining PPDM Literature 18 technique. II.1.1. Review of PrivacyPreserving Estimation (PPE) Literature Sanil et al. (2004) proposed an algorithm for computing the exact coefficients of multiple linear regression on the union of a verticallydistributed (or partitioned) dataset without sharing original values. This algorithm is applicable when there is a single shared, nonconfidential dependent variable, and the unshared confidential, independent variables are owned by more than two parties (agents) involved in the estimate process. It is based on Powell’s algorithm (Brent, 2002; Powell, 1964) for finding the minimum (as a numerical solution for a series of onedimensional minimization problems) of a multivariable quadratic function without calculating its derivative. In addition, the algorithm of Sanil et al. (2004) utilizes the secure summation algorithm (Benaloh, 1987; Clifton et al., 2002), which is considered to be a part of secure multiparty computation in the cryptography literature (Schneier, 1996, pp. 551552). The secure summation algorithm is used to share a statistical summary (total), populated partially by every party without revealing how much each party contributes to that statistic. This total is needed for estimating the regression coefficients iteratively. By the end of the proposed regression algorithm, each party can calculate accurately, from the global regression model, the coefficients of the variables they own and share them with other parties. Karr et al. (2004) dealt with the case of building multiple linear regression on the union of a horizontallydistributed dataset. They suggested two approaches. The first approach, called the secure data integration procedure, is used to integrate horizontallydistributed datasets from more than two parties (agents) into one dataset while protecting the identity of the data source. The integrated dataset could then be shared among 19 cooperative parties. Each party could locally run linear regression analysis (or any other statistical analysis) and any of its diagnostics on the integrated dataset. This approach clearly does not rise to the minimum requirements of protecting the privacy and confidentiality of surveyed human subjects because it does not mask confidential attributes and aims only to protect the identity of the data sources (i.e. the identity of the involved parties, not the identity or confidentiality of surveyed human subjects). A second approach is based on the additive nature of the linear regression analysis. Instead of sharing and integrating the original unmasked records of the distributed datasets, statistics required to calculate the least squares estimators of linear regression coefficients are shared and integrated in a secure manner using the secure summation algorithm (Benaloh, 1987; Clifton et al., 2002; Schneier, 1996). In this approach, diagnostics could not be calculated directly. Karr et al. (2004) proposed two approaches to resolve this issue based on whether the computations of diagnostics are additive with respect to the parties. When the nature of the computations of the diagnostics are additive, such as R2 and correlations, locallycalculated numerators and denominators of the diagnostic measures are shared using the secure summation algorithm to calculate the global diagnostic measures. When this is not possible, as in the case of sharing the residuals, simulated residuals derived from synthetic independent variables and mimicking the relationships among original residuals and independent variables are integrated and shared using the secure data integration procedure. The procedure of creating synthetic residuals is very similar to the procedure proposed by Reiter (2003), which is briefly discussed below. 20 Remote regression servers (cf. Duncan and Mukherjee, 2000; KellerMcNulty and Unger, 1998; Schouten and Cigrang, 2003) are accesslimitation methods for protecting microdata while enabling users to build linear regression models. Instead of running the regression analysis locally, users submit a request of the regression analyses they require and the server returns the results in terms of regression coefficients and standard errors. Although this approach has the advantage of building linear regression models using original values, users do not usually have any means of checking the fit of their models. Reiter (2003) proposed a method to enable users to check the fit of their models while limiting the disclosure risks. The proposed method is based on releasing artificial, simulated (marginallywise) dependent and independent variables, residuals and fitted values that mimic the original relationships of the built models. Then these synthetic variables could be used, similar to the use of original variables, to assess the fit of the regression models. He suggested that this approach could be used to provide diagnostics for other possible remotely built, generalized linear models. The computations of many multivariate methods, including multivariate linear regression, depend on matrix computations such as matrix multiplication and matrix inverse. Based on this wellknown fact, Du et al. (2004) proposed some protocols called secure twoparty matrix computations protocols, which enable two agents to collaboratively run matrix computations without knowing about or accessing the other party’s original, sensitive values and without the involvement of a third party. They suggest that secure versions of some multivariate statistical analysis could be formulated using these secure matrix computations building blocks. This approach is generally called the Secure 2party Multivariate Statistical Analysis (S2MSA). These protocols are used 21 to reformulate some specific multivariate statistical analysis problems including the Secure 2party Multivariate Linear Regression (S2MLP) problem, when the dependent variable is known to both agents. As we have seen, not much has been done in privacypreserving estimation and regression when we need to release a dataset for general analysis. The focus of current PPE methods are linear relationships and linear regression methods. Our work deals with a more difficult problem that involves nonmonotonic relationships. Most prior work deals with partitioned (vertically or horizontally) datasets while our work is about centralized datasets, whose masked copies will be released in whole for regression tasks. The above methods require the involvement of all parties every time they must run a regression model, while our proposed methods release the whole masked datasets and do not require timely cooperation between the data owner and the data analyst. Many of the above methods assume and restrict applicability to the existence of a single shared, dependent variable, while our methods allow building regression models using any numeric variable as a dependent variable. The mentioned methods are proposed mainly for regression tasks while there is initial evidence that our approach can be used for other analyses (see Section VIII.2 titled “Possible Opportunities and Limitations”). II.1.2. DataCentric Approach (DCA) for PrivacyPreserving Data Mining Many PPDM approaches modify some existing DM algorithm(s) while masking the data (Thuraisingham, 2005); see, for example, Agrawal and Srikant (2000). Enforcing the use of a modified algorithm with a masked dataset to ensure accurate results is not a good idea for many reasons. First, data miners usually employ more than one algorithm 22 to mine a dataset. Examining all data mining algorithms, as well as modifying them, is not feasible. Second, once a dataset is released, there is no guarantee as to which algorithm might be applied. Using a nonprescribed, standard algorithm may lead to incorrect conclusions and actions. Instead, as suggested by AlAhmadi et al. (2004), datasets should be protected or masked without reference to a specific DM algorithm. More recently, Oliveira and Zaïane (2004c) supported the concept of DataCentric Approach (DCA) in their standardization suggestions to PPDM researchers and developers. Oliveira and Zaïane (2004a) applied the DCA concept practically in developing a new PPDM clustering algorithm called RotationBased Transformation (RBT), which tries to modify only the data such that applying standard distancebased clustering algorithms on both the normalized original datasets and the transformed datasets produce the same results. Hence, data perturbation and masking methods can be a good starting point for implementing the DataCentric Approach (DCA) in PPDM. Next, we review the latest developments in data masking methods and see how we can use some of them in estimation problems. In Appendix A, we also provide a simple framework called the SDL/Relationship Match Framework for choosing a specific data perturbation method (or, in general, any masking method) to mask a specific dataset for building estimation models. The match is based on the type of existing, most difficult relationships in original datasets that the chosen masking method can preserve and reproduce in masked datasets. 23 II.2. Related Work in Masking Methods Statistical Disclosure Limitation (SDL), known also as Statistical Disclosure Control (SDC), techniques are a set of techniques that aim to control the amount of disclosed information about sensitive attributes at the level of individual records in disseminated datasets while providing valid overall statically analyzable datasets. The main goals of SDL techniques are twofold: (a) to minimize disclosure risks by protecting the privacy (identity disclosure) and confidentiality (value disclosure) of individuals surveyed or included in released datasets, and simultaneously (b) to maximize data utility (preserving original (statistical) characteristics of datasets) at the aggregate level (Willenborg and Waal, 2001). The phrase “disclosure limitation” in SDL (or “disclosure control” in SDC) indicates two things. First, there is a specific amount of information, usually at an aggregate level, in the original dataset one wants to reveal. Second, one wants to make sure that no further information (mainly sensitive or confidential) can be learned. In fact, when a dataset is released, some sort of disclosure automatically occurs, and total elimination of disclosure is impossible (Fienberg, 1994). Therefore, there are always possible disclosure risks associated with any released dataset. Hence, methods to assess different possible disclosure risks for any masking method are needed. Similarly, methods are also needed to assess data utility. Thus, when a masking method is proposed, existing or new measures that quantify data utility and disclosure risk are used to prove the effectiveness of the masking method. Two natural types of disclosure one wants to prevent are: identities of individuals (identity disclosure) and exact values of their confidential attributes (exact value 24 disclosure). The problem of disclosure control is complicated by the concept of the second type of value disclosure: the partial value disclosure. In this case, the exact values of confidential attributes are protected, but good statistical estimates can be gained by accessing masked attributes (Adam and Wortmann, 1989; Dalenius, 1977; Muralidhar et al., 1999; Muralidhar and Sarathy, 1999; Sarathy and Muralidhar, 2002). The only way to completely eliminate disclosure risks is to avoid releasing datasets (given that other physical security measures are implemented). In this situation, unfortunately, no data utility is achieved (Fienberg, 1994). Datasets need to be disseminated on many occasions for many good, possible reasons. In the case of statistical databases, permissible queries should get responses. Two important reasons, among others, are that dissemination is either (a) legally required, as in the case of the Bureau of Census, or (b) researchmotivated. For example, a hospital releases information about patients with a specific disease to a medical research institute developing a cure for the disease. In the data mining field, datasets can be released to an external DM and knowledge discovery consultant company in an effort to gain some competitive advantage. There is an increasing trend in many businesses to outsource data mining tasks (Boyens et al., 2002). There are basically two general approaches for disclosure control: limiting access to sensitive attributes or masking original confidential attributes (Adam and Wortmann, 1989; Willenborg and Waal, 2001). Queryrestrictions used in statistical databases (Adam and Wortmann, 1989) are one example of access limitation approaches. In this approach, a query returns only an aggregate statistic such as the SUM. The number of records used to build the aggregate statistic should be more than a specific threshold for the result to be 25 sent to the user. In addition, successive queries should not have a large overlap. However, in many cases, this is not sufficient to prevent disclosure. When the results of permissible queries are restricted to aggregate information, inferential disclosure can occur when a snooper can learn unrevealed information by combining the results of more than one query. A snooper is a legitimate user who misuses access privilege to obtain unauthorized content (Adam and Wortmann, 1989; Muralidhar et al., 1999; Muralidhar et al., 2001). For example, a snooper can combine the results of the following two queries, which can be issued by different users, to learn the salary of a company’s president: query 1  “total salaries the company pays,” and query 2  “total salaries the company pays except for the president” (Clifton, 2003). Although the result of each query by itself does not represent a privacy or confidentiality threat, combining the two leads to exact disclosure. One proposed solution (which may not be practical) is to track all replied queries of all users to avoid answering a query that can increase the amount of information released and lead to disclosure. Malvestuto and Moscarini (2003) discussed the inference problem of confidential values using repeated queries in multidimensional databases, as well as a graphical auditing solution (the answer map) for the problem. Willenborg and Waal (2001) devoted a book to the discussion of the principles related to masking techniques. They differentiated between two types of datasets that need to be protected: tabular data (aggregated data) and microdata (individual records). Perturbative and nonperturbative masking methods are reviewed for each dataset type. The former replaces original values with fabricated ones, while the latter utilizes original 26 values. Willenborg and Waal also reviewed data utility and data security measures for each dataset type. Duncan and Pearson (1991) suggested that masked microdata should be released instead of aggregated data (such as aggregated statistic responses to restricted queries, or tabular data) to maximize data utility and to allow for different types of analyses. Muralidhar and Sarathy (2005b) adopted this viewpoint and pointed out that accessing microdata is a requirement for data mining analysis tasks. Clifton (2003) also supported this opinion. One important category of perturbative SDL methods for microdata is data perturbation methods. In data perturbation, a noise term is used to change original confidential attributes and generate masked ones. There are mainly two types of perturbation methods: multiplicative and additive data perturbation methods. Multiplicative data perturbation methods generate masked values by multiplying original unmasked confidential values by an error term with a mean equal to 1 and with a small variance (Kim and Winkler, 2003; Muralidhar et al., 1995). On the other hand, additive data perturbation methods add an error term with mean 0 and a specific variance or covariance matrix to the confidential attributes. Examples of the latest additive data perturbation methods are GADP (Muralidhar et al., 1999), CGADP (Sarathy et al., 2002), IPSO (Burridge, 2003), and EGADP (Muralidhar and Sarathy, 2005b). One of the advantages of data perturbation methods is that they automatically safeguard against exact value disclosure (Adam and Wortmann, 1989; Muralidhar and Sarathy, 1999). However, their ability to protect against partial value disclosure varies from one method to another. Dalenius (1977) defined disclosure risk by the increment in 27 the snooper knowledge after accessing the masked attributes. To account for the worstcase scenario, the assumption should be that the snooper has maximum knowledge about confidential attributes represented in the form of their distributions (Muralidhar and Sarathy, 2003c). The remainder of this section is divided into three subsections. The first subsection discusses the advantages of using masking methods in developing PPE and PPDM methods. The second subsection talks about the Conditional Independence Theory for developing optimal (in terms of data utility and data security) masking methods and the practical limitation of this theory. The third discusses briefly the latest related masking methods and their optimality status. Then, in the main section to follow, we test the performance of some of these methods on the store dataset in the motivation example to see whether they can be used for PPE involving nonmonotonic relationships. We then list some difficulties in developing a new PPE method for nonmonotonic relationships. Finally, we conclude this chapter by compiling a list of possible PPE research questions. II.2.1. Advantages of Using Masking Methods for PPE Data can be classified as aggregate data (tabular summaries) and microdata (full data). While access limitation methods for microdata can achieve some security goals, full access to microdata is important for DM. Actually, one of the biggest barriers facing data mining projects today is the “inability to release data” due to privacy concerns (Clifton, 2003). Masking methods are a sound choice since they allow the release of microdata without any access restriction. They have rich literature and are built on a solid theoretical basis (cf. Adam and Wortmann, 1989; Willenborg and Waal, 2001). Equally important, they are rigorous techniques for maintaining privacy and confidentiality while 28 maximizing data utility (mainly for statistical analysis) (Muralidhar et al., 1999; Muralidhar and Sarathy, 2003a; 2005a; Sarathy et al., 2002). Actually, masking techniques automatically provide protection against exact value disclosure (Adam and Wortmann, 1989; Muralidhar and Sarathy, 1999). From another perspective, they are good starting points for practically implementing the DataCentric Approach (DCA) in PPDM (AlAhmadi et al., 2004), as discussed earlier. II.2.2. Optimal Masking Methods In this section, we will talk about the optimality requirements of masking methods based on the Conditional Independence Theory. Then we will briefly discuss its practical limitations. II.2.1.2 Conditional Independence Theory The ultimate goals for any masking method are: (a) to maximize data security of the masked dataset (minimize disclosure risks: both value and identity risks), and (b) to maximize the data utility (minimize information loss or maximize data accuracy) of datasets after the protection. Many prior perturbation methods were often developed based on the concept that there is always a tradeoff between these two goals. Examples include simple additive data perturbation SADP method (Traub et al., 1984) and correlatednoise additive data perturbation CADP known as Kim’s method (Fuller, 1993; Kim, 1986; Tendick, 1991). Muralidhar and Sarathy (2003c) developed a theoretical framework for perturbation methods, called the Conditional Independence Theory, which eliminates the need for a tradeoff between data utility and disclosure risks (beyond a specific level defined by the characteristics of the data). The framework suggests that the perturbed 29 values Y should be generated (a) only from the conditional distribution f(XS) and (b) independently from the original confidential attributes X given the nonconfidential attribute S. The importance of these conditions, once they are met, lies in the fact that both maximum data utility and data security (minimum disclosure risk) requirements will be automatically and simultaneously satisfied. Muralidhar and Sarathy (2003c) detailed these requirements of the conditional independence theory and their consequences in terms of marginal, joint and conditional distributions. The use of the conditional distribution f(XS) in the data masking literature is not new, and it has been examined in different contexts: multiple imputation (Little, 1993; Rubin, 1993), categorical data (Fienberg et al., 1998), and disclosure risk measures (Willenborg and Waal, 2001). Muralidhar and Sarathy (2003c) proved that the two required conditions of the conditional independence theory, once met, satisfy both data utility and disclosure risk requirements as follows. The first condition is that perturbed values Y should be generated from the conditional distribution f(XS): Y : fXS (X S) . (2.1) Second, Y should be independent of X given S: fX,Y S(X,Y S) = fX S(X S)fY S(Y S). (2.2) In terms of data utility requirements, two characteristics from the original dataset should be preserved in the perturbed dataset. First, the joint distribution between the nonconfidential attributes S and the perturbed attributes Y should equal the joint distribution between the nonconfidential attributes S and the confidential attributes X: f (Y,S) = f (X,S). The theory of conditional independence assures this since Y is generated from f(XS), f(YS)= f(XS), and therefore: 30 f (Y,S) = f (Y S)f (S) = f (X S)f (S) = f (X,S). (2.3) Second, the marginal distribution of the perturbed attributes Y should be the same as the marginal distribution of the confidential attributes X: f (Y) = f (X) . Again, the theory of conditional independence satisfies this requirement. This can be seen using (2.3): f ( ) = ò f ( , )d = ò f ( , )d = f ( ) S S Y Y Ss X S s X. (2.4) Preserving the joint distribution of the perturbed dataset to be the same as that of the original dataset (i.e. f (Y,S) = f (X,S)) maintains all relationships among variables. This suggests that applying any analysis that depends on relationships among variables in the perturbed dataset should provide similar (if not exact) results when applying the same analysis method on the original dataset. In addition, maintaining identical marginal distributions of the perturbed and the original datasets produces the same results for univariate analysis of the original and masked datasets for each corresponding variable. In terms of data security requirements, there are two assumptions regarding intruders: (a) they have full access to the nonconfidential attributes S, and (b) they have the maximum knowledge about confidential attributes: the conditional distribution of confidential attributes X conditioned on nonconfidential attributes S: f(XS). Both assumptions collectively account for the worstcase scenario: that the snooper does a good job in trying to breach the confidentiality and privacy of individuals in masked datasets even before their release. In this context, disclosure risk is defined by the increase in the snooper ability (or reduction of his uncertainty) to predict confidential attributes once the masked dataset is released (Dalenius, 1977; Duncan and Lambert, 1986). Thus, accessing the released masked dataset (i.e. S and Y) might cause an increase 31 in snooper prediction power since there is now more information (i.e. Y) to use. However, the theory of conditional independence requires that Y is independent of X given S, and this makes the following relationship true: f (X S,Y) = f (X S). This can be proven by using (2.2): ( , , ) ( , , ) : ( , ) ( , ) ( ) ( ) f f LHS f f f f = S X Y = S X Y = X S Y SY YS S ( , ) ( )( ) ( ) : ( ) ( ) f f f f RHS f f X Y S = X S Y S = X S Y S Y S . Therefore, snoopers do not gain any incremental knowledge (beyond what they already know about confidential attributes X from nonconfidential attributes S) by accessing masked attributes Y. As we have seen, the Conditional Independence Theory, once met, guarantees maximum data utility and data security. For an interesting discussion about whether the Conditional Independence Theory is sufficient to reach the optimal level of data utility and data security, refer to comments made by Polettini and Stander (2003) and the rejoinder made by Muralidhar and Sarathy (2003b). II.2.2.2 Practical Limitation of the Optimal Procedure Utilizing the concept of conditional independence to develop masking methods for every possible dataset (including nonnormal ones) proves to be a difficult problem (Muralidhar and Sarathy, 2003c). The main reason is the difficulty in learning the true multivariate conditional density function f(XS). In practice, f(XS) is usually unknowable and difficult to estimate. 32 This limits the practical applicability of the Conditional Independence Theory in developing optimal masking methods. Nevertheless, some special cases (such as the case of multivariate normal data) have been addressed in the SDL literature in which optimal masking methods have been built based on this theory. In the next subsection, we will discuss some of the latest (PPErelated) masking methods. In addition, we will explain which masking methods (and in which settings) satisfy the Conditional Independence Theory. II.2.3. Recent Masking Methods Muralidhar et al. (1999) proposed a new perturbation method called the General Additive Data Perturbation (GADP) method that avoids the problems in early perturbation methods. GADP is mainly designed for maintaining linear relationships. The ideal situation is when datasets are normally distributed, which assures that all relationships among variables are linear (Kotz et al., 2000). Burridge (2003) and Muralidhar and Sarathy (2005b) reported that GADP experiences sampling error that affects its performance when it is applied to small datasets. Burridge (2003) suggested a new perturbation method called Information Preserving Statistical Obfuscation (IPSO) method, which does not suffer from the same problem. The idea behind IPSO lies in the concept of capturing the sufficient statistics (Anderson, 2003; Johnson and Wichern, 1998; Lehmann and Casella, 1998) from original and normally distributed datasets, and utilizing them to produce perturbed values in masked datasets. Muralidhar and Sarathy (2005b) recognized that although IPSO does a good job when dealing with small datasets, it has a problem in data security. They proposed a variant from GADP method that uses sufficient statistics to avoid sampling error in small datasets. The new approach is called 33 Enhanced (or Exact) General Additive Data Perturbation (EGADP) method. EGADP maintains exact linear relationships, even in small data. All the above three masking methods are proposed for linear relationships. Sarathy et al. (2002) proposed a new method called the General Additive Data Perturbation method: the Copula approach (CGADP), which can maintain monotonic (linear or nonlinear) pairwise relationships. A copula is a function that joins a group of functions of marginals into one multivariate (copula) distribution (Jouini and Clemen, 1996; Nelsen, 1999; Schweizer, 1991; Sklar, 1959). CGADP utilizes a multivariate normal copula (Clemen and Reilly, 1999; Joe, 1997) to transform nonnormal distributions into normal ones, which capture the monotonic dependence structure (rankorder correlation) of original datasets. Then GADP or EGADP can be applied on these transformed normal datasets to produce (normally distributed) masked attributes. Finally, masked attributes are transformed to their original marginals. Muralidhar and Sarathy (2003a; 2005a) proposed a new masking method called data shuffling, which combines of the advantages of perturbation methods and dataswapping methods. Like data perturbation methods, the new approach maximizes data security and data utility in the case of linear or monotonic nonlinear relationships. Like dataswapping methods, it maintains the marginal distributions of the masked confidential attributes exactly as the marginal distributions of the original confidential attributes. In addition, data shuffling has the advantage of efficient implementation by utilizing “only rank order data,” (Muralidhar and Sarathy, 2005a, pp.1). There are two approaches to implementing data shuffling: parametric and nonparametric. The parametric approach is similar to the CGADP perturbation method 34 (Sarathy et al., 2002) with an additional step: rank ordering the original values of the confidential attributes X according to the rankorder of the perturbed values Y to get the final shuffled values. The nonparametric data shuffling approach has the advantage of bypassing the problem of identifying unknown marginal distributions. It utilizes the empirical CDF distribution as an estimation mechanism of unknown marginal distributions. From an optimality perspective, GADP and EGADP applied to multivariate normal data satisfy the Conditional Independence Theory. Therefore, they are optimal in terms of maximizing data utility and data security. Although IPSO provides ideal data utility by preserving exactly all (linear) relationships in normal data, it has a problem in the data security aspect since masked attributes Y are not generated independently from f(XS) (Muralidhar and Sarathy, 2005b). When applied to nonnormal data, GADP and EGADP provide ideal security but not ideal data utility. They only reproduce linear relationships in masked data. This is generally adequate for a majority of statistical analyses of masked data, but not for data mining. Many studies (Fuller, 1993; Sarathy et al., 2002; Sullivan and Fuller, 1989) concluded that applying additive perturbation methods (including the above sophisticated methods) on nonnormal datasets reduces the accuracy and utility of masked datasets because nonlinear relationships are not preserved. Hence, GADP and EGADP may not be suitable for general data mining. Because of the known difficulty of estimating the true conditional density function f(XS) in nonnormal datasets, developing optimal masking methods based on the Conditional Independence Theory is not feasible in this case (Muralidhar and Sarathy, 35 2003c). Nevertheless, preserving some characteristics of nonnormal datasets is possible. For example, CGADP and data shuffling can maintain monotonic (linear or nonlinear) pairwise relationships. Both CGADP and data shuffling provide optimal data security but not optimal data utility because they do “not utilize the true conditional density f(XS),” (Muralidhar and Sarathy, 2003c). Instead, they approximate f(XS) with a multivariate normal copula distribution. Nevertheless, the level of provided data utility may be sufficient for some data mining applications. II.3. Impact of Current Masking Methods on NonMonotonic Relationships Lechner and Pohlmeier (2004) investigated how some disclosure limitation masking methods affect estimating nonlinear data models using econometric estimation techniques. Two disclosure limitation methods were studied: blanking and noise addition. The researchers used three econometric estimation techniques: the SIMEX method, the calibration method, and a semiparametric sample selectivity estimator. Lechner and Pohlmeier (2004) concluded that the disclosure limitations techniques, at varying degrees, make it difficult to get nonlinear models and other estimates from masked datasets similar to those obtainable from original datasets. Lechner and Pohlmeier (2004) further pointed out that while the effects of masking methods on the estimators of linear (regression) models have been wellunderstood, “nonlinear regression techniques coping with implications of data masking are still at their infancy.” We picked two advanced data masking methods that have been proven to provide maximum data utility and data security: EGADP (Muralidhar and Sarathy, 2005b) for linear relationships, and (CEGADPbased) data shuffling (Muralidhar and Sarathy, 36 2003a; 2005a) for monotonic nonlinear relationships. Neither has been used in the PPDM arena. These methods were applied to the store dataset (see Table 1I.3). While they provide good security measures for the store dataset, these methods do not produce useful masked datasets for the required regression tasks because of the existence of nonmonotonic relationships. For example, the relationship between age and expenditure variables (see Figure 2 in Section I.3) is not preserved in the masked data. The destructive effect of these methods on this type of relationship is demonstrated in Figure 4. Sarathy et al. (2002) pointed out the limitations of these methods in the case of nonmonotonic relationships. Figure 4. Impact of current masking methods (EGADP and data shuffling) on nonmonotonic relationships Clearly, there are no masking methods capable of preserving nonmonotonic relationships in masked datasets as they exist in original datasets. However, the literature on SDL and data perturbation methods provides sophisticated masking methods for maintaining monotonic relationships (linear and nonlinear) and achieving high levels (sometimes optimal) of data utility and data security. Our goal is to adapt and extend some of these methods for the case of nonmonotonic relationships. In addition, we want 16 28 40 52 64 Age 100 200 300 400 500 Expenditure $ 16 28 40 52 64 Age 100 200 300 400 500 Expenditure $ 37 to develop or adopt some measures for data utility and data security. These measures will be used to investigate the effectiveness of our proposed methods. Next, we will discuss briefly some challenges in developing a PPE masking method for nonmonotonic relationships. II.3.1. Challenges in Developing Practical PPE Masking Methods for NonMonotonic Relationships SDL masking methods were developed based on the concept that maintaining some aggregate correlationbased measures of original datasets in masked datasets would preserve the corresponding relationships captured by the aggregate measures. For example, GADP (Muralidhar et al., 1999), IPSO (Burridge, 2003) and EGADP (Muralidhar and Sarathy, 2005b) try to maintain the Pearson correlation matrix (along mean and covariance matrix). This ensures that all original linear relationships are reproduced in the masked datasets. On the other hand, CGADP (Sarathy et al., 2002) and data shuffling (Muralidhar and Sarathy, 2003a; 2005a) try to maintain Spearman rankorder correlation matrix. This ensures that all original monotonic nonlinear (and linear, in the case of multivariate normal datasets) relationships are reproduced in the masked datasets. One challenge in developing a new PPEmasking approach for nonmonotonic relationships is that there is no aggregate measure capable of capturing nonmonotonic relationships that we can use to reproduce this type of relationship in masked datasets. Thus, a different strategy is needed for developing the new approach. Since PPE, and regression methods in general, are about quantifying the relationships among variables (Rud, 2001), “relationships” will be the basis for developing our new PPE approach. In 38 the first section in Chapter III, we will provide a formal regressionrelated definition of relationships in PPE. Another challenge that arises is the need to test the effectiveness of the newly developed PPE method in terms of data utility and data security. Data security measures are based on general concepts and, therefore, existing security measures can be used. However, data utility measures are specific to the task at hand (DomingoFerrer et al., 2001; 2003). For PPE, it is necessary to maintain relationships among the variables in masked datasets in the manner they exist in original datasets. For monotonic relationships, data utility (in masking techniques) can be easily checked by comparing the relative aggregate measures of the original and masked datasets. Similarity between these aggregate measures declares the success of the masking methods in terms of data utility. Again, there is no aggregate measure for nonmonotonic relationships; therefore, we need to take a different path in developing suitable data utility measures. Once more, we will use the concept of relationships as the basis for our data utility measures. The new proposed data utility measures, along with adopted security measures, are discussed in Chapter V. II.4. Research Questions This study raises and addresses the following questions: · Question 1: Since there is no aggregate measure for nonmonotonic relationships, how should relationships be defined (and later measured) in PPE and PPDM? · Question 2: Is it possible to develop/adapt new masking methods in PPE to maintain nonmonotonic relationships while maximizing data utility and data security? 39 · Question 3: What is the theoretical basis for these masking methods? · Question 4: What is data utility in PPE? What is data security in PPE? How can we quantify data utility and data security in PPE (success measures)? · Question 5: Can the new methods preserve other types of relationships (linear or monotonic nonlinear)? · Question 6: What are the assumptions of these new methods? What will happen if these assumptions are violated? · Question 7: How can we establish the validity of the new methods (experiment design)? · Question 8: How well do these methods compare with existing methods? 40 CHAPTER III. RELATIONSHIPBASED MASKING: THEORETICAL BASIS The prvious chapter discussed the need for maintaining and reproducing the different types of possible relationships in masked datasets, as they exist in original datasets. In addition, it was seen that existing masking methods do not preserve all types of relationships in masked datasets. This chapter introduces four interrelated RelationshipBased Masking (RBM) methods for reproducing different types of relationships in masked datasets. The RBM approach is a threestage approach: identifying relationships, analyzing residuals, and applying masking. These three stages are interrelated. The theoretical basis for the RBM methods lies mainly in the first two stages (relationships and residuals). Hence, the focus of this chapter is on the first two stages (and the tools related to them) while the subject of the following chapter is mainly the third stage (along with the first two stages). First, a formal definition for a relationship in the estimation and regression context is needed. In addition, we need an effective, proven way to learn and capture different possible types of relationships that may exist in a dataset. It is also critical to understand the roles that residuals play in defining and guiding data utility and data security requirements to develop effective masking methods. Section III.1 draws attention to the formal statistical definitions of a relationship and discusses the one that is usually used in estimation and regression modeling. In 41 addition, this section specifies the class1 of relationships that the RBM approach tries to learn from original datasets, and whether it is datautility or datasecurity driven. Section III.2 investigates the capability of Artificial Neural Networks (ANN) in capturing relationships including nonmonotonic ones. Section III.3 concludes this chapter by investigating the roles of residuals r left after fitting confidential attributes X (as dependent variables) to nonconfidential attributes S (as independent variables) in defining and guiding data utility and data security requirements. III.1. Conditional Expectation: The Formal Definition of Relationships in PPE Macnaughton (2002) compiles and discusses seven definitions of a relationship between two random variables. One of these definitions that is particularly suited to this study is that there is a relationship between variables x and y, if the values of y can be expressed as: y = f (x) + e (3.1) “where e is usually viewed as being independent of x” and represents the random component of the relationship. f(x) is strictly a real (deterministic) mathematical function that represents the exact component of the relationship with no random element involved. A strict mathematical function means that it maps from one or more values of x to only one value of y (but not to more than one value of y). For example, the function y = f(x) = x2 is a mathematical function because it maps multiple values of x such as (2,2) to 1 We may use the term “class” when we refer to a relationship based on the confidentiality level of its involved dependent and independent variables. Examples include the class of relationships between X and S and the class of relationships among confidential attributes X or E(XiXj). In addition, we may use the term “type” or “shape” when we talk about the shape of a relationship: linear or monotonic nonlinear, or nonmonotonic. 42 exactly one value of y (4). Thus, a mathematical function is either onetoone (11) mapping or manytoone (M1) mapping, but not onetomany (1M) mapping. The former two cases are called singlevalued mapping while the latter case is called a multivalued mapping (Bishop, 1995). The existence of multivalued mapping in data, which violates the definition of a mathematical function, impacts the effectiveness of our proposed masking methods in terms of data utility, as we will discuss later. Muralidhar and Sarathy (2005a) suggest that perturbed variables Y satisfy the minimum security requirements when they are generated using a function g(S,e) of only the nonconfidential attributes S, and a noise term e that is independent of the original confidential attributes X given the nonconfidential attributes S. Notice the similarity of formula (3.1) with the function g(S,e) when one chooses function g to be the addition of independent/orthogonal noise e to a random function f of nonconfidential attributes S: Y = g(S,e) = f (S) + e (3.2) where e is independent of X (or e ^ X) given S as a security requirement, ^ denotes “orthogonal to”, and f(S) is any random function of S. Equation (3.2) only suggests the requirements of the data security of perturbation methods and it does not consider or guarantee data utility. Data utility can be ensured if the function f(S) is carefully chosen. Macnaughton (2002) suggested that the mathematical form of f is usually determined by data analysis, although theoretical considerations can also be taken into account. He further suggested that such an approach usually leads to choosing the form of f as the best estimate of the conditional expectation E (Y S) in Equation (3.2) (note that we are using the terminology of data perturbation: S, X and Y). Actually, E (Y S) cannot be calculated 43 initially since the masked values Y are not available. Nevertheless, the conditional expectation E (Y S) is specified to be E (X S) to maximize data utility. Thus, in our context, formula (3.2) can be expressed as: Y = E (X S) + e (3.3) where e is independent of X given S (or e ^ S,X) and e resembles the characteristics of r obtained from: X = E (X S) + r (3.4) where r is independent of S (or r^ S) by definition. Equation (3.3) is the essence of EGADP (Muralidhar and Sarathy, 2005b). EGADP can preserve linear characteristics of original datasets perfectly, even for very small datasets, while providing maximum data security (Section II.3). However, it cannot, as we saw earlier, preserve nonlinear relationships. Estimating the conditional expectations E (X S) accurately, and accordingly the “relationships” between X and S by the regression definition, is critical for maintaining data utility. We conclude this section by discussing the difference between independence and orthogonality (uncorrelated). Independence implies orthogonality; the reverse is not always true (Hunter, 1972). When we learn relationships E(XS) from multivariate normal data, residuals r are always guaranteed to be independent of E(XS) or any function of S (see Rhodes (1971), Property 8, pp. 692). When distributions of data are nonnormal, the only condition guaranteed at all times is that residuals r are orthogonal (uncorrelated) to E(XS) or any function of S (see Rhodes (1971), Proposition 2b, pp. 690 and Bickel and Doksum (2001), Proposition 1.4.1 (a) and (b), pp. 34). Nevertheless, “(w)hen two variables are uncorrelated, they may be (and often are) 44 independent,”(Schield, 1995). This means that residuals r are often independent of E(XS) or any function of S regardless of the data distribution. In our discussion, we do not make a distinction between these two concepts and we use independence and orthogonality interchangeably. III.2. Artificial Neural Networks (ANN) Approaches for Estimating Conditional Expectations There are many approaches that can be used to learn relationships (conditional expectations) in datasets. For example, one can try to fit a polynomial of an appropriate degree to the data. However, this approach suffers from two limitations. First, it is a parametric approach that requires a prejudgment of the relationship before trying to fit it to the data. This proves difficult especially if there is no theoretical guidance (Fisher, 1970). Second, it is affected by the curse of dimensionality (Bishop, 1995). Similarly, nonlinear regression requires the specification of the functional form before the estimation of parameters can begin. On the other hand, other approaches such as artificial neural networks (ANN) do not assume any prespecified form of the relationship. In addition, although ANNs also face the problem of the curse of dimensionality, they are less affected compared to the polynomial approach (Bishop, 1995). There are many characteristics of neural networks architectures that make them appealing in estimating nonlinear and nonmonotonic conditional expectations. First, ANN approaches have been proven to be global function estimators of nonlinear and nonmonotonic continuous functions (Bishop, 1995; Hagan et al., 1996). Second and more interestingly, they can be used to approximate the conditional expectation of the 45 network output O given the network input I (i.e. E(OI)) when the minimized error function is the (mean) sum of squared errors function (MSE) and the network is used to map input I to output O (Bishop, 1995; Saerens, 1996; 2000). Saerens (1996) suggests that this is a fundamental mathematical statistics result based on estimation theory and can be found in many books such as Deutsch (1965), Meditch (1969) and Shao (1999). Bishop (1995) proves the above fact mathematically for the case of multilayer perceptron neural networks (MLP). Saerens (1996) does the same and emphasizes the importance of the assumption that the minimum error is indeed reached after the training. He also points out that many researchers arrive at the same conclusion. A few examples for the continuous case are White (1989) and Wan (1990). Examples for the binary case are Bourlard and Wellekens (1989), Ruck et al. (1990), Gish (1990), and Shoemaker (1991). For a general review, refer to Richard and Lippmann (1991). There are two more interesting facts about using the MSE function for estimating the conditional expectations. First, the ability of the neural networks to learn the conditional expectations is not affected by the characteristics of the noise (e.g. normal vs. nonnormal) in the data (Saerens, 1996). Second, the ability to learn the conditional expectation when using MSE is not specific to the multilayer perceptron neural networks MLP and it is independent of the neural network architecture (Bishop, 1995). Actually, this fact is even applicable in architectures that are not classified as neural networks as long as they try to minimize MSE and they succeed in approaching the real minimum (Bishop, 1995; Saerens, 1996). (Multiple) Linear regression, which utilizes the concept of least squared errors, does a good job in estimating linear conditional expectations in normally distributed 46 datasets (where all relationships are linear). However, when applied to datasets containing nonlinear relationships, it does a poor job. This is because it assumes that relationships are linear and can be discovered by fitting lines that minimize the squared error. In this sense, the global minimum of the squared errors, which might be obtained by fitting some curves to the data, is not reached. As a result, (multiple) linear regression cannot be used as a mechanism to estimate conditional expectations when the relationships are not linear. This shows the importance of the assumption that the minimum error is indeed achieved after the estimation process stops. ANN algorithms do not assume any functional form that may hamper learning the true conditional expectation. The above discussion is important to our context. We can use any neural networks architecture based on MSE to estimate the required conditional expectation of the confidential attributes X given the nonconfidential attributes S (i.e. E(XS)) by mapping the input S to the output X regardless of the form of the conditional expectation (monotonic or (singlevalued mapped) nonmonotonic) or the characteristics of the error term. The main assumption is that the training procedure actually reaches, or at least approaches, the global error minimum. MLP neural networks suffer from the trap of local minimums that may limit their ability to reach the global minimum of the minimized error function. Consequently, the real conditional expectation may not be learned accurately. However, there are other neural networks architectures that can be used instead and do not suffer from the same problem. 47 Support Vector Machine (SVM) (Vapnik, 2000) and Least Squares Support Vector Machine (LSSVM) (Suykens et al., 2002) are different neural networks architectures (Schèolkopf and Smola, 2003). SVM and LSSVM are kernelbased methods (Schèolkopf et al., 1999; ShaweTaylor and Cristianini, 2004). They utilize the kernel trick to implicitly map an input space into a higher dimensional space called the feature space. Then, wellknown linear (and nonlinear) learning algorithms can be applied on the feature space to learn relationships and patterns. The main two tasks, among others, that can be run in the feature space are regression and classification. SVM and LS SVM do not suffer from local minimums that hinder optimization algorithms from reaching the global minimum of the cost function in the error space. This is because the cost function in the dual space (the main optimization space for (LS) SVM cost functions) has a quadratic form with a unique global minimum (Suykens et al., 2002). In this study, we use LSSVM (Suykens et al., 2002) and its Matlab toolbox implementation, called LSSVMlab1.5 (2003), to learn conditional expectations. However, any learning mechanism that is theoretically capable of capturing and learning nonmonotonic conditional expectations (relationships) can be used in our proposed RelationshipBased NMEGADP masking approach (RBM). III.3. The Roles of the Residuals (r) III.3.1. Residuals Role in Defining Relationships among Attributes The residuals r obtained from estimating E(XS) (see Equation (3.4) in Section III.1) play important roles in defining two main groups (classes) of relationships: (a) relationships between nonconfidential attributes S and confidential attributes X (i.e. E(XS)), and (b) relationships among confidential attributes X (i.e. E(XiXj) or E(XjXi)). 48 For simplicity and without losing generality, we limit our discussion to two confidential attributes X (Xi and Xj , where i, j = 1K q and i ¹ j ) and the nonconfidential attributes S. The residuals r (ri and rj) are obtained from the following two equations: ( ) Xi = E Xi S + ri (3.5) and ( ) j j j X = E X S + r . (3.6) In our proposed masking methods, masked variables Y (Yi and Yj) are generated as follows: ( ) ( ) i i i i i Y EY e EX e = + = + S S (3.7) ( ) ( ) j j j j j Y EY e EX e = + = + S S (3.8) where e (ei and ej) are noise terms that try to preserve certain characteristics of the residuals r (ri and rj) to maximize data utility. In addition, the noise terms e are generated to satisfy security requirements and to avoid providing extra information about confidential attributes X beyond what is known about them from the nonconfidential attributes S. In the proposed masking methods, we specify E(YiS) = E(XiS) and E(YjS) = E(XjS) (see Equations (3.7) and (3.8)) so that relationships between confidential attributes X ( Xi and j X ) and nonconfidential attributes S are automatically reproduced in masked datasets between masked attribute Y ( i Y and Yj ) and nonconfidential attributes S. The only required condition is that e should be orthogonal to S or any function of S, similar to residuals r. Except for the orthogonality, the added noise terms e 49 are not required to mimic every aspect or characteristic of the residuals r to preserve this class of relationships. However, relationships among confidential attributes X (E(XiXj) and E(XjXi))2 are not directly reproduced in masked datasets by only fixing conditional expectations E(XS) while masking. In addition to the role of conditional expectations E(XS), E(XiXj) depends heavily on the characteristics of the added noise (ei and ej). Ideally, we want the characteristics of the added noise e (ei and ej) to be exactly the same as the characteristics of original residuals r (ri and rj) in terms of two related requirements: (a) orthogonality (mainly for maintaining relationships between X and S in masked datasets), and (b) joint distribution (mainly for maintaining relationships among confidential attributes X in masked datasets). Although the former is achievable, the latter is not always feasible and achievable. Therefore, we want to get as close as possible to the ideal case to maximize the data utility. But before we discuss these data utility requirements further, we explore whether other possible types of residuals that result from different forms of estimation or (conditional) expectations can be used in masking methods to maintain relationships (especially among confidential attributes X) without violating other requirements. First, let us consider the set of residuals resulted from subtracting the expectations of the confidential attributes E(X) from the confidential attributes X. In this case, all relationships among X are completely captured by this set of residuals. However, this set 2 For simplicity of discussion, we shall only talk about E(XiXj) initially, with the assumption that it is a singlevalued mapping. 50 of residuals captures none of the relationships between X and S, and there is no easy way to relate them to conditional expectations E(XS). Another idea is to use the residuals from fitting E(XiSXj) . One problem with these residuals is that they do not satisfy an important security requirement: we cannot use any function of confidential attributes X directly in generating masked attributes Y (Muralidhar and Sarathy, 2003c; 2006b). Clearly, residuals obtained from E(Xi Xj) are not suitable either since they inherit the problems of the last two types of residuals: they do not carry any information about nonconfidential attributes S, and they violate the security requirement of avoiding conditioning on confidential attributes X. Therefore, we can only use functions of nonconfidential attributes S as specified in Equations (3.5) and (3.6). In this case, E(XiXj) is captured by the two (orthogonal) components in Equations (3.5) and (3.6): E(XS) and r. Since E(XS) is the fixed part during masking (refer to Equations (3.7) and (3.8)), the added noise set e (i.e. the dynamic or changed part) should mimic the characteristics of the residual set r to maintain the relationships among masked attributes Y as they exist among confidential attributes X. There are two dimensions for similarity: orthogonality and joint distributions. In Equations (3.5) and (3.6), residuals r (ri and rj) are orthogonal to S and any function of S when the conditional expectations E(XS) are correctly estimated (Bickel and Doksum, 2001, Proposition 1.4.1 (a) and (b), pp. 34; Rhodes, 1971, Proposition 2b, pp. 690). Notice that both E(XiS) and E(XjS) are functions of S. This is the main concept in mean square estimation. It is called “the orthogonality principle” (Gray and Davisson, 2004; Papoulis and Pillai, 2002). Interestingly, the orthogonality principle generally holds 51 true when the conditional expectations (e.g. E(XiS) and E(XjS)) are nonlinear functions of the data (here it is S) and it is called “Nonlinear Orthogonality Rule” (Papoulis and Pillai, 2002). From a data utility perspective, for e to resemble the characteristics of r, e should be orthogonal to S or any function of S. Otherwise, relationships between X and S will not be reproduced in masked datasets between Y and
Click tabs to swap between content that is broken into logical sections.
Rating  
Title  Adapting Masking Techniques for Estimation Problems Involving Nonmonotonic Relationships in Privacypreserving Data Mining 
Date  20060701 
Author  AlAhmadi, Mohammad Saad 
Department  Management Science & Information Systems (PhD) 
Document Type  
Full Text Type  Open Access 
Abstract  The goal of this research is to adapt data masking techniques for estimation of confidential, continuous numeric data involving nonmonotonic relationships in privacypreserving data mining (PPDM). Estimation techniques are an important class of data mining (DM) techniques frequently used in realworld problems. In many cases, estimation involves confidential numeric data, raising issues of privacy and confidentiality. Masking techniques can be used to protect the confidentiality of numeric data. However, existing masking methods do not preserve nonmonotonic relationships. They replicate correlationbased aggregate measures in masked data to reproduce corresponding monotonic relationships. One challenge is that there are no aggregate measures for capturing nonmonotonic relationships. Hence, this study proposes a new approach called RelationshipBased Masking (RBM) that utilizes relationships (conditional expectations), rather than aggregate measures such as covariance matrices, as a basis for masking. Four new adapted RBM methods are proposed. Other challenges that arise in the case of nonmonotonic relationships include the measurement of data utility and data security. This study adopts existing security measures based on Mean Squared Errors (MSE) and proposes new data utility and data security measures. The theories behind these new methods are also described, and their effectiveness is established empirically in terms of data utility and disclosure risks using simulated numerical datasets. Findings and conclusions./ The optimality of RBM in terms of data security (based on the characteristics of original datasets) is demonstrated theoretically and empirically. Additionally, RBM can maintain certain classes of relationships. This study shows theoretically and empirically that RBM can maintain original relationships in masked data when the relationships among confidential attributes are linear regardless of the types of relationships between nonconfidential and confidential attributes. It also demonstrates empirically the effectiveness of RBM in maintaining other classes of relationships. However, RBM is not able to maintain all possible classes of relationships. Finally, interesting results relating the characteristics of masked data to the characteristics of original data, prior to masking, are shown theoretically and empirically. 
Note  Dissertation 
Rights  © Oklahoma Agricultural and Mechanical Board of Regents 
Transcript  ADAPTING MASKING TECHNIQUES FOR ESTIMATION PROBLEMS INVOLVING NONMONOTONIC RELATIONSHIPS IN PRIVACYPRESERVING DATA MINING By MOHAMMAD SAAD ALAHMADI Bachelor of Science in Computer Science King Fahd University of Petroleum and Minerals Dhahran, Saudi Arabia 1996 Master of Science in Telecommunications Management Oklahoma State University Stillwater, OK 2001 Submitted to the Faculty of the Graduate College of the Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY July, 2006 ii ADAPTING MASKING TECHNIQUES FOR ESTIMATION PROBLEMS INVOLVING NONMONOTONIC RELATIONSHIPS IN PRIVACYPRESERVING DATA MINING Dissertation Approved: Dr. Rathindra Sarathy Dr. Ramesh Sharda Dr. Dursun Delen Dr. Goutam Chakraborty Dr. A. Gordon Emslie Dean of the Graduate College iii ACKNOWLEDGMENTS In the name of Allah, the Most Merciful, the Most Compassionate Thanks, glorification and praises to Almighty God (Allah) for His uncountable favors and bounties upon me and my family. Among them are His assistance in finishing my doctoral program successfully and the blessing of a good, supportive, motivational Ph.D. committee chaired by Dr. Rathindra Sarathy to whom I owe all respect and admiration. Dr. Sarathy was always there for me when I needed him the most. In our fruitful meetings and discussions, I learned from him what I would not have learned from any other source or setting. My high appreciation and thanks are extended to all other members on my Ph.D. committee: Drs. Ramesh Sharda, Goutam Chakraborty and Dursun Delen. The courses they taught and ideas they raised were the best toolkit for me to finish my Ph.D. journey in an enjoyable and successful manner. I would also like to thank and show the deepest appreciation for the dearest and closest souls to my soul  my parents, Saad and Hamida, who taught me that there is no limit for success, and that bigger ideas and goals work the best in this life. They never stopped supporting and encouraging me and always kept me in their prayers. If I live the rest of my life as a servant at their feet, then I repay them nothing for what they have done for me. My thanks are extended to my beloved family: Maha, Asim, Shymaa, Ammar, Yasser, and Saad. Without their unconditional and unlimited patience, love, understanding and support, none of what I have done would have been possible. This is iv not my work alone, but it is the fruit of our collaborative work as a family. Congratulations to all of us on our graduation. I am also grateful and thankful to my brothers, sisters, and all of my relatives and friends in the USA and in Saudi Arabia who supported me and did not forget me in their good prayers. Many of them shared the joy of graduation with me either facetoface, by phone or by email. Knowing how long the list would be prevents me from listing individual names. Additionally, I would like to thank Dr. Ali Amiri, my professor and friend, for his support and encouragement during my doctoral studies, as well as my research partners for the fruitful, cooperative effort that resulted in some publications: Drs. Ramesh Sharda, Rick Wilson, and Peter Rosen. In addition, I would like to thank all faculty, staff, my peers and colleagues, especially my friends Ashish Gupta and Han Li, for the professional academic environment they created in the Department of Management Science and Information Systems (MSIS) in the William S. Spears School of Business at Oklahoma State University. I would like also to acknowledge and thank Dr. Andrew Robinson for developing and providing the equivalence package for running (modelvalidation) equivalence tests. Last but not least, I would like to thank the lovely secretary of our department, Tane Kester, for the tremendous amount of time and effort she put in editing this manuscript. v TABLE OF CONTENTS Chapter Page Acknowledgments...................................................................................... iii Table of Contents........................................................................................ v List of Tables ............................................................................................. ix List of Figures..........................................................................................xiii Acronyms and Abbreviations .................................................................. xix Mathematical Symbols and Notations .................................................... xxii I. INTRODUCTION ......................................................................................................... 1 I.1. PPE Problem – Definition and Research Scope ................................ 5 I.2. PPE Problem – Importance and Requirements.................................. 7 I.3. Motivation Example ........................................................................ 11 I.4. Summary and Outline...................................................................... 14 II. LITERATURE REVIEW............................................................................................ 16 II.1. Related Work in PrivacyPreserving Data Mining (PPDM) ........... 16 II.1.1. Review of PrivacyPreserving Estimation (PPE) Literature 18 II.1.2. DataCentric Approach (DCA) for PrivacyPreserving Data Mining.................................................................................. 21 II.2. Related Work in Masking Methods................................................. 23 II.2.1. Advantages of Using Masking Methods for PPE............. 27 II.2.2. Optimal Masking Methods ............................................... 28 II.2.1.2 Conditional Independence Theory ........................... 28 II.2.2.2 Practical Limitation of the Optimal Procedure......... 31 II.2.3. Recent Masking Methods ................................................. 32 II.3. Impact of Current Masking Methods on NonMonotonic Relationships.................................................................................... 35 vi Chapter Page II.3.1. Challenges in Developing Practical PPE Masking Methods for NonMonotonic Relationships ....................................... 37 II.4. Research Questions.......................................................................... 38 III. RELATIONSHIPBASED MASKING: THEORETICAL BASIS........................... 40 III.1. Conditional Expectation: The Formal Definition of Relationships in PPE .................................................................................................. 41 III.2. Artificial Neural Networks (ANN) Approaches for Estimating Conditional Expectations................................................................. 44 III.3. The Roles of the Residuals (r) ......................................................... 47 III.3.1. Residuals Role in Defining Relationships among Attributes 47 III.3.2. Role of Residuals in Guiding Security Requirements ...... 54 III.3.3. Summary........................................................................... 56 IV. IMPLEMENTATION OF RELATIONSHIPBASED MASKING.......................... 57 IV.1. NMEGADP Perturbation and NMEGADP Shuffling Masking Methods ........................................................................................... 57 IV.2. Assumptions of the RelationshipBased Masking Methods............ 65 V. ASSESSMENT MEASURES FOR THE RBM APPROACH.................................... 68 V.1. Data Security Measures in PPE....................................................... 68 V.2. Data Utility Measures in PPE.......................................................... 69 V.3. Equivalence Tests for Validating Models as Data Utility Measures75 VI. ASSESSING THE RBM APPROACH WHEN THE RELATIONSHIPS AMONG CONFIDENTIAL ATTRIBUTES ARE LINEAR........................................................... 84 VI.1. Illustration of the Effectiveness of NMEGADP Approach using the Motivation Example ........................................................................ 84 VI.1.1. Data Utility........................................................................ 85 VI.1.2. Data Security..................................................................... 87 VI.2. How a Snooper May Compromise RBM Masked Data .................. 92 VI.3. How the Characteristics of Original Datasets Determine the Characteristics of Masked Attributes............................................... 99 vii Chapter Page VI.4. Assessing the Impact of the Characteristics of Original Datasets on the Effectiveness of the RBM Approach....................................... 102 VI.4.1. Original Datasets and their Characteristics..................... 102 VI.4.2. Data Utility...................................................................... 117 VI.4.3. Data Security................................................................... 123 VII. ASSESSING THE RBM DATA UTILITY WHEN THE RELATIONSHIPS AMONG CONFIDENTIAL ATTRIBUTES ARE NONLINEAR................................ 133 VII.1. Datasets....................................................................................... 135 VII.2. Data Utility.................................................................................. 138 VII.2.1. Relationship 1: E(X1S)s vs. E(Y1S)s........................ 140 VII.2.2. Relationship 2: E(X2S)s vs. E(Y2S)s ........................ 144 VII.2.3. Relationship 3: E(X1X2)x2 vs. E(Y1Y2)x2 .................. 147 VII.2.4. Relationship 4: E(X2X1)x1 vs. E(Y2Y1)x1 ................. 151 VII.2.5. Relationship 5: E(X1SX2)sx2 vs. E(Y1SY2)sx2 .......... 155 VII.2.6. Relationship 6: E(X2SX1)sx1 vs. E(Y2SY1)sx1 .......... 159 VII.3. Summary..................................................................................... 162 VIII. CONCLUSIONS.............................................................................................. 164 VIII.1. Main Findings and Conclusions.................................................. 164 VIII.1.1. Data Utility................................................................ 164 VIII.1.2. Data security ............................................................. 168 VIII.2. Possible Opportunities and Limitations...................................... 170 VIII.3. Contributions of this Study......................................................... 172 REFERENCES ............................................................................................................... 175 APPENDICES ................................................................................................................ 189 Appendix A – SDL/Relationship Match Framework ............................. 189 Appendix B – The Relationship between the Covariance Matrices of Confidential Attributes X, Conditional Expectations E(XS) and Residuals r...................................................................................... 192 Appendix C – RelationshipBased NMEGADP Masking Algorithms . 199 viii Chapter Page Appendix D– Extended Results Related to the Motivation Example..... 207 Appendix E – Graphical Pilot Study – Comparisons for PPE Masking Methods ......................................................................................... 221 Appendix F – Dataset: NM.01 (Check Mark ) ....................................... 226 Appendix G – Dataset: NM.02 ............................................................... 230 Appendix H – Dataset: NM.03 ............................................................... 234 Appendix I – Dataset: NM.04................................................................. 238 Appendix J – Dataset: NM.05................................................................. 242 Appendix K – Dataset: MNL.01............................................................. 246 Appendix L – Dataset: MNL.02 ............................................................. 250 Appendix M – Dataset: MNL.03 ............................................................ 254 Appendix N– Dataset: NM.01.S1 – Check Mark Dataset with One S with NonMonotonic Relationships among Residuals .......................... 258 Appendix O – Dataset: ME.L.S1 – Motivation Example with One S.... 262 Appendix P – Important Results Relating the Characteristics of Masked Attributes to the Characteristics of Original Data ......................... 266 ix LIST OF TABLES Table Page Table 1. First 5 and last 5 records of the store dataset (motivation example) .................. 12 Table 2. Possible different combinations of relationships to be maintained when a confidential attribute is a dependent variable in the store motivation example (2 X and 2 S). k is the model number............................................................................ 72 Table 3. Predictionbased (MSE) data utility and data security measures for the NMEGADP perturbed store dataset ............................................................................ 86 Table 4. First 5 and last 5 records of the twovariable dataset, to check the piecewise CC security of NMEGADP masking procedures ...................................................... 91 Table 5. Var(X1) = Var(u1) + Var(r1)............................................................................ 106 Table 6. Var(X2) = Var(u2) + Var(r2)............................................................................ 106 Table 7. Cov(X1,X2) = Cov(u1,u2) + Cov(r1,r2).............................................................. 107 Table 8. Covariance between X1 and Y1 and its relationship to the variance of u1 ......... 107 Table 9. Covariance between X2 and Y2 and its relationship to the variance of u2 ......... 107 Table 10. Calculating the regression coefficients of X1 = b0 + b1Y1 based on the characteristics of original data ............................................................................ 108 Table 11. Calculating the regression coefficients of X2 = b0 + b1Y2 based on the characteristics of original data ............................................................................ 108 Table 12. Equivalence Tests using Linear Regression to Calculate the Compared Fitted Values ................................................................................................................. 121 Table 13. Equivalence Tests using LSSVM to Calculate the Compared Fitted Values 122 Table 14. ParameterBased data utility measures for E(X1X2) vs. E(Y1Y2) ................... 123 Table 15. Possible snooper scenario to compromise X1 ................................................. 126 Table 16. Possible snooper scenario to compromise X2 ................................................. 127 x Table Page Table 17. Canonical correlation security measures (CC) using the best predictor E(XS) for the three MERelated datasets....................................................................... 128 Table 18. Corr(X1,Y1): its upper bound and its relationship to Corr(X1, E(X1S)) and the security index (SI)............................................................................................... 128 Table 19. Corr(X2,Y2): its upper bound and its relationship to Corr(X2, E(X2S)) and the security index (SI)............................................................................................... 128 Table 20. Datasets characteristics................................................................................... 137 Table 21. Percentage of minimum equivalence regions of significant equivalence tests for mean and slope of 1 for relationships (fitted values) E(X1S)s vs. E(Y1S)s...... 140 Table 22. Slope of linear regression of relationships (fitted values) E(X1S)s vs. E(Y1S)s ............................................................................................................. 143 Table 23. R2 of linear regression and correlation of relationships (fitted values) E(X1S)s vs. E(Y1S)s ....................................................................................... 143 Table 24. Percentage of minimum equivalence regions of significant equivalence tests for mean and slope of 1 for relationships (fitted values) E(X2S)s vs. E(Y2S)s .... 146 Table 25. Slope of linear regression of relationships (fitted values) E(X2S)s vs. E(Y2S)s ............................................................................................................. 147 Table 26. R2 of linear regression and correlation of relationships (fitted values) E(X2S)s vs. E(Y2S)s ....................................................................................... 147 Table 27. Percentage of minimum equivalence regions of significant equivalence tests for mean and slope of 1 for relationships (fitted values) E(X1X2)x2 vs. E(Y1Y2)x2 ............................................................................................................................. 148 Table 28. Slope of linear regression of relationships (fitted values) E(X1X2)x2 vs. E(Y1Y2)x2 ........................................................................................................ 150 Table 29. R2 of linear regression and correlation of relationships (fitted values) E(X1X2)x2 vs. E(Y1Y2)x2 .................................................................................... 151 Table 30. Percentage of minimum equivalence regions of significant equivalence tests for mean and slope of 1 for relationships (fitted values) E(X2X1)x1 vs. E(Y2Y1)x1 ............................................................................................................................. 152 Table 31. Slope of linear regression of relationships (fitted values) E(X2X1)x1 vs. E(Y2Y1)x1 ........................................................................................................ 154 xi Table Page Table 32. R2 of linear regression and correlation of relationships (fitted values) E(X2X1)x1 vs. E(Y2Y1)x1 ............................................................................. 155 Table 33. Percentage of minimum equivalence regions of significant equivalence tests for mean and slope of 1 for relationships (fitted values) E(X1SX2)sx2 vs. E(Y1SY2)sx2 .................................................................................................... 156 Table 34. Slope of linear regression of relationships (fitted values) E(X1SX2)sx2 vs. E(Y1SY2)sx2 .................................................................................................... 158 Table 35. R2 of linear regression and correlation of relationships (fitted values) E(X1SX2)sx2 vs. E(Y1SY2)sx2 ..................................................................... 158 Table 36. Percentage of minimum equivalence regions of significant equivalence tests for mean and slope of 1 for relationships (fitted values) E(X2SX1)sx1 vs. E(Y2SY1)sx1 .................................................................................................... 159 Table 37. Slope of linear regression of relationships (fitted values) E(X2SX1)sx1 vs. E(Y2SY1)sx1 .................................................................................................... 161 Table 38. R2 of linear regression and correlation of relationships (fitted values) E(X2SX1)sx1 vs. E(Y2SY1)sx1 ..................................................................... 162 Table 39. MDA classification example using the motivation example dataset and its masked copies ..................................................................................................... 172 Table 40. Hypothetical example demonstrating “Shuffle A by B” for generating a shuffled variable C............................................................................................................ 203 Table 41. A list of the four NMEGADP masking methods and their characteristics.... 204 Table 42. Sample of the Motivation Example Dataset ................................................... 207 Table 43. Motivation Example  Original dataset Pearson correlations ......................... 213 Table 44. Motivation Example  NMEGADP perturbed dataset Pearson correlations. 213 Table 45. Motivation Example  NMEGADP shuffled dataset Pearson correlations ... 213 Table 46. Motivation Example  Original dataset rankorder (Spearman) correlations . 215 Table 47. Motivation Example  NMEGADP perturbed dataset rankorder (Spearman) correlations.......................................................................................................... 215 Table 48. Motivation Example  NMEGADP shuffled dataset rankorder (Spearman) correlations.......................................................................................................... 215 xii Table Page Table 49. Original dataset  Fitting a wrong regression model (Linear Regression Case: X1S1) ................................................................................................................... 217 Table 50. NMEGADP perturbed dataset  Fitting a wrong regression model (Linear Regression Case: X1S1)....................................................................................... 217 Table 51. NMEGADP shuffled dataset  Fitting a wrong regression model (Linear Regression Case: X1S1)....................................................................................... 217 Table 52. Motivation Example  Data Utility assessment by fitting nonlinear Parametric Regression Models: (XS), (YS), and (YshfS).................................................... 218 Table 53. Motivation Example  Data Utility assessment by fitting nonlinear Parametric Regression Models: (X1X2), (Y1Y2), and (Y1_shfY2_shf) ................................... 219 Table 54. Motivation Example  Data Utility assessment by fitting nonlinear Parametric Regression Models: (X1SX2), (Y1SY2), and (Y1_shfSY2_shf)............................. 219 Table 55. Classification: Original Dataset: IV: S2, DV: S1 X1 X2 ................................... 220 Table 56. Classification: NMEGADP Perturbed Dataset: IV: S2, DV: S1 Y1 Y2............ 220 Table 57. Classification: NMEGADP Shuffled Dataset: IV: S2, DV: S1 Y1shf Y2shf ....... 220 xiii LIST OF FIGURES Figure Page Figure 1. Research scope of the study ................................................................................ 7 Figure 2. Motivation Example: Nonmonotonic relationships between Age (S1) and Expenditure$ (X1) ................................................................................................. 13 Figure 3. Privacy Preserving Data Mining PPDM Literature........................................... 17 Figure 4. Impact of current masking methods (EGADP and data shuffling) on nonmonotonic relationships........................................................................................ 36 Figure 5. Diagrams of the twovariable dataset (aimed to check the piecewise CC security of NMEGADP masking procedures) and its masked datasets. ........................... 92 Figure 6. Motivation Example (ME.L) Dataset .............................................................. 109 Figure 7. ME.L dataset: u1 vs. u2 and r1 vs. r2 scatter plots (LSSVM).......................... 109 Figure 8. ME.L dataset: u1 vs. u2 and r1 vs. r2 scatter plots (piecewise linear) .............. 110 Figure 9. ME.L.MV1 Dataset ......................................................................................... 110 Figure 10. ME.L.MV1 dataset: u1 vs. u2 and r1 vs. r2 scatter plots (LSSVM) .............. 111 Figure 11. ME.L.MV1 dataset: u1 vs. u2 and r1 vs. r2 scatter plots (piecewise linear)... 111 Figure 12. ME.L.MV2 Dataset ....................................................................................... 112 Figure 13. ME.L.MV2 dataset: u1 vs. u2 and r1 vs. r2 scatter plots (LSSVM) .............. 112 Figure 14. ME.L.MV2 dataset: u1 vs. u2 and r1 vs. r2 scatter plots (piecewise linear)... 113 Figure 15. ME.L: X1 & Y1 vs. S1 scatter plot and the linearity of relationships between X1 & Y1..................................................................................................................... 113 Figure 16. ME.L.MV1: X1 & Y1 vs. S1 scatter plot and the linearity of relationships between X1 & Y1................................................................................................. 114 Figure 17. ME.L.MV2: X1 & Y1 vs. S1 scatter plot and the linearity of relationships xiv between X1 & Y1................................................................................................. 114 Figure 18. ME.L: X2 & Y2 vs. S1 scatter plot and the linearity of relationships between X2 & Y2..................................................................................................................... 115 Figure 19. ME.LMV1: X2 & Y2 vs. S1 scatter plot and the linearity of relationships between X2 & Y2................................................................................................. 115 Figure 20. ME.LMV2: X2 & Y2 vs. S1 scatter plot and the linearity of relationships between X2 & Y2................................................................................................. 116 Figure 21. Comparing the linearity of relationships between X1 & Y1 for the three datasets on a unified scale ................................................................................................ 116 Figure 22 Comparing the linearity of relationships between X2 & Y2 for the three datasets on a unified scale ................................................................................................ 117 Figure 23. ME.L.MV1 – Masked using masking method 1 ........................................... 129 Figure 24. ME.L.MV1 – Masked using masking method 2 ........................................... 129 Figure 25. ME.L.MV1 – Masked using masking method 3 ........................................... 130 Figure 26. ME.L.MV1 – Masked using masking method 4 ........................................... 130 Figure 27. ME.L.MV2 – Masked using masking method 1 ........................................... 131 Figure 28. ME.L.MV2 – Masked using masking method 2 ........................................... 131 Figure 29. ME.L.MV2 – Masked using masking method 3 ........................................... 132 Figure 30. ME.L.MV2 – Masked using masking method 4 ........................................... 132 Figure 31. Motivation Example (ME.L) Dataset – E(X1S)s vs. E(Y1S)s ..................... 143 Figure 32. Motivation Example (ME.L) Dataset – E(X2S)s vs. E(Y2S)s ..................... 146 Figure 33. Motivation Example (ME.L) Dataset – E(X1X2)x2 vs. E(Y1Y2)x2............... 149 Figure 34. Motivation Example (ME.L) Dataset – E(X2X1)x1 vs. E(Y2Y1)x1............... 153 Figure 35. Motivation Example (ME.L) Dataset – E(X1SX2)sx2 vs. E(Y1SY2)sx2 ....... 157 Figure 36. Motivation Example (ME_L) Dataset – E(X2SX1)sx1 vs. E(Y2SY1)sx1 ...... 160 Figure 37. Impact of multivalued data on learning the conditional expectation........... 171 Figure 38. SDL/Relationship Match Framework............................................................ 191 xv Figure 39. General schema of how RelationshipBased Masking (RBM) approach works ............................................................................................................................. 200 Figure 40. Classification of the NMEGADP masking methods.................................... 201 Figure 41. Motivation Example (ME.L) – Original dataset............................................ 208 Figure 42. Motivation Example – Predicted values E(X1S)s vs. E(X2S)s.................... 209 Figure 43. Motivation Example – Residuals r1 vs. r2 ..................................................... 209 Figure 44. Motivation Example – The NMEGADP Perturbation (masking method 1) masked dataset .................................................................................................... 210 Figure 45. Motivation Example – The NMEGADP Shuffling (masking method 2) masked dataset .................................................................................................... 210 Figure 46. Motivation Example (ME.L) – Masked using masking method 3................ 211 Figure 47. Motivation Example (ME.L) – Masked using masking method 4................ 211 Figure 48. Graphical Pilot Study: Linear relationships (bivariate normal dataset) ........ 221 Figure 49. Graphical Pilot Study: Monotonic nonlinear relationships I......................... 222 Figure 50. Graphical Pilot Study: Monotonic nonlinear relationships II........................ 223 Figure 51. Graphical Pilot Study: NonMonotonic relationships (Ushape data) .......... 224 Figure 52. Graphical Pilot Study: NonMonotonic relationships (3cluster data).......... 225 Figure 53. NM.01 Dataset............................................................................................... 226 Figure 54. NM.01 Dataset – Predicted values E(X1S)s vs. E(X2S)s ............................ 227 Figure 55. NM.01 Dataset – Residuals r1 vs. r2.............................................................. 227 Figure 56. NM.01 Dataset – Masked using masking method 1...................................... 228 Figure 57. NM.01 Dataset – Masked using masking method 2...................................... 228 Figure 58. NM.01 Dataset – Masked using masking method 3...................................... 229 Figure 59. NM.01 Dataset – Masked using masking method 4...................................... 229 Figure 60. NM.02 Dataset............................................................................................... 230 Figure 61. NM.02 Dataset – Predicted values E(X1S)s vs. E(X2S)s ............................ 231 xvi Figure 62. NM.02 Dataset – Residuals r1 vs. r2.............................................................. 231 Figure 63. NM.02 Dataset – Masked using masking method 1...................................... 232 Figure 64. NM.02 Dataset – Masked using masking method 2...................................... 232 Figure 65. NM.02 Dataset – Masked using masking method 3...................................... 233 Figure 66. NM.02 Dataset – Masked using masking method 4...................................... 233 Figure 67. NM.03 Dataset............................................................................................... 234 Figure 68. NM.03 Dataset – Predicted values E(X1S)s vs. E(X2S)s ............................ 235 Figure 69. NM.03 Dataset – Residuals r1 vs. r2.............................................................. 235 Figure 70. NM.03 Dataset – Masked using masking method 1...................................... 236 Figure 71. NM.03 Dataset – Masked using masking method 2...................................... 236 Figure 72. NM.03 Dataset – Masked using masking method 3...................................... 237 Figure 73. NM.03 Dataset – Masked using masking method 4...................................... 237 Figure 74. NM.04 Dataset............................................................................................... 238 Figure 75. NM.04 Dataset – Predicted values E(X1S)s vs. E(X2S)s ............................ 239 Figure 76. NM.04 Dataset – Residuals r1 vs. r2.............................................................. 239 Figure 77. NM.04 Dataset – Masked using masking method 1...................................... 240 Figure 78. NM.04 Dataset – Masked using masking method 2...................................... 240 Figure 79. NM.04 Dataset – Masked using masking method 3...................................... 241 Figure 80. NM.04 Dataset – Masked using masking method 4...................................... 241 Figure 81. NM.05 Dataset............................................................................................... 242 Figure 82. NM.05 Dataset – Predicted values E(X1S)s vs. E(X2S)s ............................ 243 Figure 83. NM.05 Dataset – Residuals r1 vs. r2.............................................................. 243 Figure 84. NM.05 Dataset – Masked using masking method 1...................................... 244 Figure 85. NM.05 Dataset – Masked using masking method 2...................................... 244 Figure 86. NM.05 Dataset – Masked using masking method 3...................................... 245 xvii Figure 87. NM.05 Dataset – Masked using masking method 4...................................... 245 Figure 88. MNL.01 Dataset ............................................................................................ 246 Figure 89. MNL.01 Dataset – Predicted values E(X1S)s vs. E(X2S)s.......................... 247 Figure 90. MNL.01 Dataset – Residuals r1 vs. r2 ........................................................... 247 Figure 91. MNL.01 Dataset – Masked using masking method 1 ................................... 248 Figure 92. MNL.01 Dataset – Masked using masking method 2 ................................... 248 Figure 93. MNL.01 Dataset – Masked using masking method 3 ................................... 249 Figure 94. MNL.01 Dataset – Masked using masking method 4 ................................... 249 Figure 95. MNL.02 Dataset ............................................................................................ 250 Figure 96. MNL.02 Dataset – Predicted values E(X1S)s vs. E(X2S)s.......................... 251 Figure 97. MNL.02 Dataset – Residuals r1 vs. r2 ........................................................... 251 Figure 98. MNL.02 Dataset – Masked using masking method 1 ................................... 252 Figure 99. MNL.02 Dataset – Masked using masking method 2 ................................... 252 Figure 100. MNL.02 Dataset – Masked using masking method 3 ................................. 253 Figure 101. MNL.02 Dataset – Masked using masking method 4 ................................. 253 Figure 102. MNL.03 Dataset .......................................................................................... 254 Figure 103. MNL.03 Dataset – Predicted values E(X1S)s vs. E(X2S)s........................ 255 Figure 104. MNL.03 Dataset – Residuals r1 vs. r2 ......................................................... 255 Figure 105. MNL.03 Dataset – Masked using masking method 1 ................................. 256 Figure 106. MNL.03 Dataset – Masked using masking method 2 ................................. 256 Figure 107. MNL.03 Dataset – Masked using masking method 3 ................................. 257 Figure 108. MNL.03 Dataset – Masked using masking method 4 ................................. 257 Figure 109. NM.01.S1 Dataset ....................................................................................... 258 Figure 110. NM.01.S1 Dataset – Predicted values E(X1S)s vs. E(X2S)s..................... 259 Figure 111. NM.01.S1 Dataset – Residuals r1 vs. r2....................................................... 259 xviii Figure 112. NM.01.S1 Dataset – Masked using masking method 1............................... 260 Figure 113. NM.01.S1 Dataset – Masked using masking method 2............................... 260 Figure 114. NM.01.S1 Dataset – Masked using masking method 3............................... 261 Figure 115. NM.01.S1 Dataset – Masked using masking method 4............................... 261 Figure 116. ME.L.S1 Dataset ......................................................................................... 262 Figure 117. ME.L.S1 Dataset – Predicted values E(X1S)s vs. E(X2S)s....................... 263 Figure 118. ME.L.S1 Dataset – Residuals r1 vs. r2......................................................... 263 Figure 119. ME.L.S1 Dataset – Masked using masking method 1................................. 264 Figure 120. ME.L.S1 Dataset – Masked using masking method 2................................. 264 Figure 121. ME.L.S1 Dataset – Masked using masking method 3................................. 265 Figure 122. ME.L.S1 Dataset – Masked using masking method 4................................. 265 xix ACRONYMS AND ABBREVIATIONS ANN Artificial Neural Networks CADP CorrelatedNoise Additive Data Perturbation (known also as Kim’s method) CC Canonical Correlation CDF Cumulative Distribution Function CGADP General Additive Data Perturbation: the Copula Approach CI Confidence Interval CNMEGADP NonMonotonic Exact General Additive Perturbation: the Copula Approach DCA DataCentric Approach DM Data Mining DV Dependent variable EC European Community EGADP Enhanced (or Exact) General Additive Data Perturbation ET Equivalence Tests xx FDA Food and Drug Administration GADP General Additive Data Perturbation IPSO Information Preserving Statistical Obfuscation IV Independent variable LSSVM Least Square Support Vector Machines MDA Multiple Discriminant Analysis MLP Multilayer Perceptron Neural Networks MR Multiple Regression MSE Mean Square(d) Error or (Mean) Sum of Square(d) Errors NMEGADP NonMonotonic EGADP PDF Probability Distribution Function PPDM PrivacyPreserving Data Mining PPE PrivacyPreserving Estimation (Regression) PTTE The Paired tTest for Equivalence RBM RelationshipBased Masking RBT RotationBased Transformation xxi S2MLP Secure 2party Multivariate Linear Regression S2MSA Secure 2party Multivariate Statistical Analysis SADP Simple Additive Data Perturbation SDC Statistical Disclosure Control SDL Statistical Disclosure Limitation SVM Support Vector Machines TOST The Two OneSided tTest xxii MATHEMATICAL SYMBOLS AND NOTATIONS S The set of nonconfidential numeric and categorical attributes X The set of confidential numeric attributes Y The set of masked numeric attributes V The set of random variate, which is used to generate independent unscaled noise set b for masking confidential attributes X (after scaling: e) and generating Y Si An nonconfidential numeric and categorical attribute i Xi A confidential numeric or categorical attribute i Yi A masked numeric and categorical attribute i Vi A columnvector random variate, which is used to generate independent unscaled noise bi for masking attribute Xi (after scaling: ei) and generating Yi p The number of nonconfidential attributes S q The number of confidential attributes X (or masked attributes Y) d The total number of attributes in a dataset (equals to p + q) e The set of columnvector independent noise resulted from scaling b to have the same covariance matrix of r (i.e. åe =år ) xxiii ei A columnvector independent noise corresponding to ri resulted from scaling b to have the same covariance matrix of r (i.e. åe =år ) b The set of columnvector residuals resulted from regressing random variate V on both nonconfidential and confidential attributes (S and X). bi A columnvector residuals resulted from regressing a random variate Vi on both nonconfidential and confidential attributes (S and X). r The set of columnvector residuals resulted from regressing confidential attributes X on nonconfidential attributes S. ri A columnvector residuals resulted from regressing one confidential attribute Xi on nonconfidential attribute S. E(.) Expected value E(..) Conditional expectation e Independent noise or residuals resulted usually from a regression of a random variable on a set of independent variables eint The initial prediction error a snooper gets by learning the conditional expectations E(XS) enew The new prediction error a snooper gets by trying to improve his prediction and condition on both E(XS) and Y år The covariance matrix of the residuals resulted from regressing X on S åb The covariance matrix of the unscaled independent noise terms resulted from regressing V on S and X åe The covariance matrix of the scaled independent noise terms and specified to be equal to år xxiv f(x) Strictly a real (deterministic) mathematical function that represents the exact component of the relationship between an independent variable x a dependent variable y and with no random element involved x An independent random variable (IV) y A dependent random variable (DV) Corr(.,.) The Pearson correlation measures between two random variables. It is also used as a possible datautility measure for linear relationships. I The input to Artificial Neural Networks ANN O The output to Artificial Neural Networks ANN A a dependent random variable ai The actual values of a dependent random variable A aˆi The corresponding predicted (or fitted) values using a specific nonlinear regression model a The mean of the dependent variable A R2 The coefficient of determination 0 H The null hypothesis a H The alternative hypothesis o m The population mean of observations p m The population mean of models predications xxv d The equivalence margin of significance tests 0 d The equivalence margin of the regression intercept using the regressionbased validation test for model validation 1 d The equivalence margin the regression slope using the regressionbased validation test for model validation a The test size (alphalevel) for significance tests or equivalence tests 1 CHAPTER I. INTRODUCTION The maturity of current information technology, especially telecommunications, storage and database technology, facilitates the collection, transmission and storage of huge amounts of raw data, unimagined until a few years ago. For the raw data to be utilized, they must be processed and transformed into information and knowledge that have added value, such as helping to accomplish the task at hand more effectively and efficiently. Data mining techniques and algorithms attempt to aid decision making by analyzing stored data to find useful patterns and to build decisionsupport models. These extracted patterns and models help to reduce the uncertainty in decisionmaking environments. Statisticians and researchers conduct surveys and collect datasets that usually contain tens of records. These datasets are considered to be very large when they contain a few hundred records (Hand, 1998). Traditional statistical techniques are the main (and the most suitable) tools for analyzing these datasets. The main objective of the analysis is to make inference and estimate population parameters from collected samples. Frequently, statistical agencies must release samples of datasets to external researchers. However, these datasets may have sensitive information about previously surveyed human subjects. This raises many questions about the privacy and confidentiality of individuals in released datasets. Privacy and confidentiality have recently become critical issues and a central concern for many people (Grupe et al., 2 2002). Sometimes these concerns result in people refusing to respond and share personal information, or worse, providing wrong responses. Many laws emphasize the importance of privacy and define the limits of legal uses of collected data. In the healthcare domain, for example, the U.S. Department of Health and Human Services (DHHS) added new standards and regulations to the Health Insurance Portability and Accountability Act of 1996 (HIPAA). The new standards aim to protect “the privacy of certain individually identifiable health data” (CDC, 2003). Grupe et al. (2002, EXHIBIT 1 pp. 65) listed a dozen privacyrelated acts and legislations issued between 1970 and 2000 in the United States. On the other hand, these acts and concerns limit, either legally and/or ethically, the releasing of datasets for reasons (sometimes legitimate) such as conducting research in the academic domain or obtaining competitive advantage in the business domain. In some cases, statistical offices face a dilemma of what can be called “war of acts.” While they must protect the privacy of individuals in their datasets, they are also legally required to disseminate these datasets. The conflicting objectives of the Privacy Act of 1974 and the Freedom of Information Act is just one example of this dilemma (Fienberg, 1994). This has led to an evolution in the field of statistical disclosure limitation (SDL). SDL methods attempt to find a balance between data utility (valid analytical results) and data security (privacy and confidentiality of individuals). In general, these methods try to either (a) limit the access to the values of sensitive attributes (mainly at the individual level), or (b) mask the values of confidential attributes in datasets while maintaining the general statistical characteristics of the datasets (such as mean, standard deviation, and covariance matrix). Data perturbation methods for microdata are one class 3 of masking methods. Since most statistical analysis methods are based on linear models, data perturbation methods generally aim to maintain linear relationships (Muralidhar et al., 1999). When the size of datasets are large, traditional statistical analysis techniques may not be the appropriate tools to use for two reasons (Hand, 1998; 2000; Hand et al., 2000). First, traditional statistical tools become unsuitable for making sense of the data and for making inferences about the population for large datasets; for instance, almost any small difference in a large dataset becomes significant. Second, large datasets may suggest that data was not collected for inference (parameter estimation) about the population, and that another type of analysis might be more appropriate. In most cases, a significant amount of collected data is generated as a consequence of some unplanned activities (e.g., transactional databases) vs. planned activities (e.g., experiment or survey designs). Therefore, as the size of datasets grows exponentially, the use of other (sizematched) analytical tools such as data mining becomes more appropriate. Examples of large datasets are abundant. MarketTouch, a company located in Georgia, supports direct marketers with data and analytical tools (DMReview.com, 2004). It has a sixterabyte database called Real America Database (RADBÒ), which provides information about more than 93 million households and 200 million individuals. It is updated monthly with more than 20 million records. Statistical agencies also experience this phenomenon of rapidly growing datasets. The Census Bureau (2001) reported that the collected Census 2000 data consist of “information about the 115.9 million housing units and 281.4 million people across the United States.” These large sizes suggest the need for analytical tools that are suitable for 4 large datasets, and again, data mining tools naturally come into play. Actually, the Census Bureau (accessed 2004) has started providing programs that have data mining capabilities such as DataFerrett (Federated Electronic Research, Review, Extraction and Tabulation Tool), which can be used to analyze and extract data from TheDataWeb  a repository of datasets that can be accessed freely online or bought offline. These datasets cover more than 95 subjects. Data mining techniques may lead to more significant threats to privacy and confidentiality compared to traditional statistical analytical techniques. DomingoFerrer and Torra (2003) made a connection between SDL methods and some AI (artificial intelligent) tools (note that statistics and AI fields are among the main fields contributing to the data mining field). They suggested there are two possible risks of AI (or equivalently DM) tools regarding the privacy and confidentiality of released masked datasets: disclosure and reidentification threats. From a disclosure threat perspective, DM tools can be used to aggregate or combine masked copies of a specific original dataset (DomingoFerrer and Torra, 2003). The goal is to reverse the masking effect and build the original dataset, which raises a confidentiality issue. This may pose a great threat when simple unsophisticated SDL techniques, such as simple additive data perturbation (SADP) method (Traub et al., 1984), are used and many masked copies are released. DM tools are also used to enforce data integrity and consistency in distributed databases by reidentifying different records belonging to the same individual. Contrary to DM, SDL methods aim to avoid identity disclosure. Thus, from a reidentification threat perspective, DM tools can be used to reidentify individuals in masked datasets (raising a privacy issue). DomingoFerrer and 5 Torra (2003) suggested that SDL methods should consider the existence of DM tools to assess their impact on disclosure (value disclosure) and reidentification (identity disclosure) threats. These concerns about privacy and confidentiality when DM tools are used have led to the birth of privacypreserving data mining (PPDM). The main goal of PPDM is to find useful patterns and build accurate models from datasets without accessing the individuals’ precise original values in records of datasets (Agrawal and Srikant, 2000). In many cases, PPDM algorithms employ one or more methods to protect the data. One protection method used is Simple Additive Data Perturbation Method (SADP) (Traub et al., 1984), which has undesirable characteristics in terms of data utility and data security (Muralidhar et al., 1999). Most of the newer and more sophisticated data perturbation and masking methods, such as CGADP (Sarathy et al., 2002), IPSO (Burridge, 2003), EGADP (Muralidhar and Sarathy, 2005b) and data shuffling (Muralidhar and Sarathy, 2003a; 2005a), have not been investigated in the PPDM domain. The only exception is the GADP method (Muralidhar et al., 1999), which appears in a few privacypreserving classification studies (Islam and Brankovic, 2004; Wilson and Rosen, 2002; 2003) . This study will discuss the possibilities of using some of these masking methods for monotonic relationships in privacypreserving estimation (PPE), as well as how these methods can be adapted and extended for the more general case involving nonmonotonic relationships. I.1. PPE Problem – Definition and Research Scope The goal of this study is to investigate the possibility of using and adapting some data masking (mainly data perturbation and data shuffling) methods to develop new 6 privacypreserving estimation PPE algorithms for numeric variables involving nonmonotonic relationships. Estimation (or regression) tries to estimate and quantify the relationships among variables in datasets. The main goal is to investigate the impact of newer data perturbation methods on different types of relationships in datasets. The required criterion is simple yet powerful: any used perturbation (or masking) method should not hurt or alter (to the extent possible) the structure and type of relationships among variables existing in original datasets. The result of testing such impact could be either preservation or destruction of the original relationships in masked datasets. Unfortunately, current data perturbation methods do not preserve all possible relationships, as will be seen later. They preserve monotonic nonlinear relationships, at best, and simpler relationships (linear ones). This study proposes four new PPE masking algorithms, based on the concepts of data perturbation and data shuffling, to preserve the more difficult nonmonotonic relationships. The focus and scope of this study in the space of the privacypreserving data mining PPDM techniques is represented in Figure 1. 7 Linear Nonlinear Monotonic NonMonotonic Association Rules Classification Clustering Estimation Privacy Preserving Data Mining PPDM Directed Data Mining (Prediction) Undirected Data Mining Data Mining DM Relationships Focus of SDL/Relationship Match Framework Focus of Adapted Masking Techniques Diagram Keys Figure 1. Research scope of the study I.2. PPE Problem – Importance and Requirements There are many important reallife problems that require building estimation models for continuous variables. Many of these models are related to human subjects and involve confidential attributes. In the following paragraphs, we shed some light on some aspects illustrating the importance and the need to protect the privacy and confidentiality of humanrelated data used in regression models. Before we start, we want to resolve some terminology issues. While the term “estimation” appears more frequently in engineering literature, other fields use the term “regression” (Ridgeway, 2003). Therefore, we may use the terms estimation and regression (technique(s)/ model(s)/ problem(s)) interchangeably. The need to build estimation models (prediction models for continuous variables) occurs frequently in different domains. Berry and Linoff (2000; 2004) have mentioned 8 many typical examples including estimating the number of children in a family, the total household income of a family, real estate value, and a customer’s lifetime value. Estimation is also indirectly related to classification. Instead of classifying customers as “respondents” or “nonrespondents,” for example, estimation techniques can be used to assign a probability of expected level of responsiveness (Berry and Linoff, 2004). This is very useful in marketing campaigns when budgets are insufficient to target all possible respondents. Then, expected respondents with the highest probability of responsiveness can be targeted first. In large datasets, which typify DM datasets, the frequency of each class in categorical and binary variables is usually high. On the other hand, a realnumber salary figure, for example, in a released dataset can be unique to the degree that it can single out the real identity of a deidentified whole record. Categorical and binary attributes automatically have more security against disclosure risks than continuous variables. Because of this, DomingoFerrer et al. (2001) indicated that while nonperturbative and masking sampling methods (Skinner et al., 1994), which are based on the concept of sample uniqueness and population uniqueness, may suit categorical microdata, they could not be used for numeric microdata. Hence, continuous attributes should attract more attention for protection against disclosure risks than categorical attributes. Regression models that involve allocating huge amounts of money based on information about human subjects are sometimes required. Ridgeway (2003) briefly mentioned a realworld case, which exemplifies the importance of such estimation models. Medicare (2004) is a federal health insurance program that covers elderly people (65 years or older), younger people with disabilities, or those suffering from EndStage 9 Renal Disease (ESRD). There are many Centers for Medicare and Medicare Services (CMS) nationwide. CMS is required by the 1997 Balanced Budget Act to develop a “prospective payment system” to allocate a $4.3 billion budget to healthcare facilities that help Medicareinsured inpatients to rehabilitate. Part of the system is a (regression) “cost model” built using different features (attributes) of patients to predict the cost of rehabilitation. These attributes come from two main sources. CMS provides attributes such as age, reason for hospitalization, and cost of care. Secondary sources provide functional ability scores measuring motor and cognitive abilities of patients. The cost model is used to predict the cost of rehabilitation per patient. Accordingly, healthcare facilities are reimbursed. Regression models are frequently used in business problems. Linear regression is the best tool when all relationships among variables in a dataset are linear. This is guaranteed when the dataset is normally distributed. However, most business datasets are nonnormal (Zhang, 2004), which opens the door to all possible forms of relationships including nonlinear (monotonic or nonmonotonic) relationships. Clearly, estimation and regression models for both linear and nonlinear relationships are important for many applications. In many cases, accessing sensitive data about human subjects might be necessary to build such models. Consequently, concerns about privacy and confidentiality automatically arise. However, there is a definite shortage in the amount of research related to privacypreserving estimation technique (PPE), especially for nonlinear and nonmonotonic relationships, as we will see in the PPErelated literature (Subection II.1.1) in the next chapter. 10 In summary, estimation (regression) is one pillar of the four main pillars of data mining (DM) techniques (see Figure 1 above) and is frequently used for different applications. In many cases, individuals’ privacy and confidentiality become a greater issue for numeric values than categorical values (DomingoFerrer et al., 2001; 2003). Actually, converting numeric values into categorical ones is one way to protect privacy and confidentiality. Alternatively, masking methods such as data perturbation can be used. Accordingly, the first of three main requirements that masking methods need to satisfy in the context of PPE is: · Requirement I: Masked datasets must allow accurate estimation models to be built, while preserving individuals’ privacy and confidentiality. Different types of relationships can exist in a dataset. For instance, multivariate normal datasets guarantee that all existing relationships among variables are linear. For this special case, some existing masking methods are readily available and can perfectly preserve linear relationships. However, most (business) datasets contain nonlinear relationships (Zhang, 2004), which can be monotonic or nonmonotonic (Fisher, 1970). “A truth about data mining not widely discussed is that the relationships in data the miner seeks are either very easy to characterize, or very, very hard,” (Pyle, 2003). This leads us to the second requirement: · Requirement II: Masking methods must preserve all possible types of relationships among variables (nonmonotonic or monotonic; linear or nonlinear). AlAhmadi et al. (2004) suggested an approach, called the “DataCentric” Approach in PPDM, for developing new privacypreserving data mining PPDM algorithms. This approach suggests, unlike many current PPDM algorithms (Agrawal and Srikant, 2000; 11 Thuraisingham, 2005), that any new PPDM algorithm should focus only on altering and changing original datasets without changing standard data mining algorithms. This should be done in a way that does not hurt the validity of required analyses (data utility) while maximizing data security. This approach is important for reasons that will be apparent in the Subsection II.1.2 titled “DataCentric Approach (DCA) for Privacy Preserving Data Mining.” The third requirement is: · Requirement III: PPE methods must be DataCentric. I.3. Motivation Example Boyens et al. (2002) have indicated that more businesses have started to outsource data mining tasks to specialized data mining and knowledge discovery consultant companies. In other cases, some organizations such as statistical offices are required by law to release datasets to outsiders such as researchers and data miners. In addition, some businesses need to release datasets to specific parties because of an alliance, although they (i.e. the businesses) may have the needed expertise to run and build different data mining models by themselves. In all of these cases, the data utility and data security concerns are important. 12 To motivate our discussion, we provide a hypothetical example. A store wants to release a 1,000record dataset to an allied market analysis firm while protecting the privacy and confidentiality of its customers. The dataset consists of four main variables (along with other identifier fields such as Name and Address). Two of these variables are nonconfidential and two are confidential. The two nonconfidential attributes are the age of customers (numeric between 2060 years) and the gender (binary – 0: male or 1: female), and they are denoted by S1 and S2, respectively. The two confidential attributes are the annual expenditure in $ (numeric) and the debt in $ (numeric), and they are denoted by X1 and X2, respectively. The first five and the last five records of the dataset are shown in Table 1. The main analysis required by the market analysis firm is regression and estimation modeling. Table 1. First 5 and last 5 records of the store dataset (motivation example) NonConfidential Attributes Confidential NO Attributes Age (S1) Gender (S2) Expenditure $ (X1) Debt $ (X2) 1 53.11 0 258.57 514.74 2 57.37 1 211.35 569.33 3 22.06 0 224.45 609.51 4 25.46 1 272.98 547.63 5 51.25 1 292.63 516.33 : : : : : 996 37.16 0 391.54 413.41 997 50.72 1 301.59 500.72 998 28.71 1 296.49 531.81 999 27.71 0 308.91 537 1000 29.62 1 298.04 469.97 Min 20.009 0 172.610 383.490 Max 59.969 1 444.190 632.800 Mean 40.264 0.468 296.775 504.692 STD 11.884 0.499 59.299 59.054 13 The store is required ethically and legally to protect the privacy and confidentiality of its customers (data security). This is also very important for maintaining customer trust and loyalty, and retaining profitable customers. At the same time and from a different perspective, the store is required to enable the allied market analyst firm to obtain accurate regression models from the released masked dataset (data utility). As an initial precaution, the store deidentifies the dataset by removing identifier fields such as Name and Address from the dataset. Figure 41 (Appendix D) shows the relationships among these four variables. The figure clearly indicates the existence of nonlinear nonmonotonic relationships (e.g., the relationship between S1: age and X1: expenditure; see Figure 2). In Section II.3 in the next chapter and Appendix E, we show that current masking techniques are unable to maintain relationships of the type shown in Figure 2 (i.e. nonmonotonic relationships). 16 28 40 52 64 Age 100 200 300 400 500 Expenditure $ Figure 2. Motivation Example: Nonmonotonic relationships between Age (S1) and Expenditure$ (X1) 14 I.4. Summary and Outline In this chapter, we introduced the privacypreserving estimation PPE involving nonmonotonic relationships and we specified its scope. We also talked about its importance from a practical point of view. In addition, we outlined the general requirements of the privacypreserving data mining PPDM and the specific requirements of the privacy preserving estimation PPE. In the next chapter (Chapter II), the relevant literature related to PPDM is briefly reviewed. The focus is on the limited literature of PPE. Then, the related work and concepts in masking methods are presented. This includes the advantage of using masking methods in PPDM. Optimal masking methods are defined, and the difficulty of their practical implementation is explained. Some recent masking methods are discussed, and their limitation in dealing with nonmonotonic relationships is demonstrated. This chapter concludes by listing possible research questions in PPE. Chapter III lays the theoretical basis for our proposed masking methods and their needed tools. The chapter starts by defining a “relationship” in the context of estimation and regression problems. Then, the theoretical role of Artificial Neural Networks (ANN) in learning such relationships is presented. Finally, the roles of residuals left after removing conditional expectations E(XS) in defining and guiding data utility and data security requirements are discussed. Chapter IV proposes four new masking methods for preserving nonmonotonic relationships by adapting and extending some existing masking methods. The chapter talks about their implementation and their assumptions. Chapter V treats the subject of assessing the success of the proposed masking methods in terms of data utility and data security. Some possible new and existing data utility and 15 data security measures for measuring the effectiveness of PPE masking methods are listed. Chapter VI assesses the effectiveness of RelationshipBased Masking (RBM) when the relationships among confidential attributes are linear. It starts by demonstrating the use of some of the RBM data utility and data security measures on the motivation example. Additionally, it briefly examines how a snooper may try to compromise a masked dataset involving nonmonotonic relationships. Then it explains how the characteristics of original datasets determine the characteristics of masked attributes. Finally, the determination of characteristics is empirically demonstrated. Chapter VII discusses the effectiveness of the RBM approach in terms of data utility when the relationships among confidential attributes are nonlinear. Eleven simulated datasets are used. The data utility assessment is done using the measures proposed in Section V.3. Although the focus is on data utility for reasons explained there, the chapter briefly discusses the subject of data security. Chapter VIII concludes this work by summarizing the main findings and results of this research. Additionally, the limitations of proposed methods are discussed. Further, some possible future research trends in PPE are given. In addition, the possibility of using the proposed masking methods for other PPDM techniques such as privacypreserving classification is suggested. 16 CHAPTER II. LITERATURE REVIEW In this chapter, we start by reviewing the literature related to PrivacyPreserving Data Mining (PPDM). The focus is mainly on PrivacyPreserving Estimation (PPE). Then, we review the main related concepts in statistical disclosure control (SDC). We also review some data perturbation and data shuffling masking methods. Next, we investigate the possibility of using these masking methods for PPE and assess their impact on nonmonotonic relationships. We conclude this chapter by articulating some of the possible research questions in PPE. II.1. Related Work in PrivacyPreserving Data Mining (PPDM) The data mining “algorithm” (or “technique” in the terminology from Berry and Linoff (2000; 2004)) is one of five different dimensions that can be used to classify privacypreserving data mining methods (Verykios et al., 2004b). Similar to the classification of data mining (DM) techniques proposed by Berry and Linoff (2000; 2004), privacy preserving data mining (PPDM) techniques can be classified as: (a) directed PPDM techniques: privacy preserving estimation and privacy preserving classification (both can be called predication techniques), and (b) undirected PPDM techniques: privacy preserving association rules and privacy preserving clustering. While the field of privacypreserving data mining is new, relatively good progress has been made on privacypreserving classification and association rules. Examples of 17 privacypreserving classification are Agrawal and Srikant (2000), Du and Zhan (2002; 2003), Du et al. (2004), Islam and Brankovic (2004), Johnsten and Raghavan (2000; 2001), Kantarcioglu and Clifton (2004b), Kantarcioglu and Vaidya (2003), Lindell and Pinkas (2002), Vaidya and Clifton (2004), Vaidya et al.(2004), and Yang et al. (2005). Examples of privacypreserving association rules are Ashrafi et al. (2003; 2004), Evfimievski et al. (2002), Evfimievski et al. (2004), Kantarcioglu and Clifton (2004a), Oliveira and Zaïane (2003a), Oliveira et al. (2004), Rizvi and Haritsa (2002), Saygin et al.(2002), Vaidya and Clifton (2002), Verykios et al. (2004a), and Zhang et al. (2004). Some progress has been made in privacypreserving clustering. Examples include Klusch et al. (2003), Lin et al. (2004), Merugu and Ghosh (2003a; 2003b) Oliveira and Zaïane (2003b; 2004a; 2004b), and Vaidya and Clifton (2003). However, there has been little research in privacypreserving estimation (regression). Figure 3 shows an abstract view of privacyprivacy data mining (PPDM) related literature broken down by PPDM Association Rules (Ashrafi, et al., 2003) (Ashrafi, et al., 2004) (Evfimievski, et al., 2002) (Evfimievski, et al., 2004) (Kantarcioglu and Clifton, 2004a) (Oliveira and Zaïane, 2003a) (Oliveira, et al., 2004) (Rizvi and Haritsa, 2002) (Saygin, et al., 2002) (Vaidya and Clifton, 2002) (Verykios, et al., 2004a) (Zhang, et al., 2004) Classification (Agrawal and Srikant, 2000) (Du and Zhan, 2002), (Du and Zhan, 2003), (Du, et al., 2004) (Islam and Brankovic, 2004) (Johnsten and Raghavan, 2001) (Johnsten and V.Raghavan, 2000) (Kantarcioglu and Clifton, 2004b) (Kantarcioglu and Vaidya, 2003) (Lindell and Pinkas, 2002) (Vaidya and Clifton, 2004) (Vaidya, et al., 2004) (Yang, et al., 2005) Clustering (Klusch, et al., 2003) (Lin, et al., 2004) (Merugu and Ghosh, 2003a) (Merugu and Ghosh, 2003b) (Oliveira and Zaïane, 2003b) (Oliveira and Zaïane, 2004a) (Oliveira and Zaïane, 2004b) (Vaidya and Clifton, 2003) Estimation (Du, et al., 2004) (Karr, et al., 2004) (Reiter, 2003) (Sanil, et al., 2004) Figure 3. Privacy Preserving Data Mining PPDM Literature 18 technique. II.1.1. Review of PrivacyPreserving Estimation (PPE) Literature Sanil et al. (2004) proposed an algorithm for computing the exact coefficients of multiple linear regression on the union of a verticallydistributed (or partitioned) dataset without sharing original values. This algorithm is applicable when there is a single shared, nonconfidential dependent variable, and the unshared confidential, independent variables are owned by more than two parties (agents) involved in the estimate process. It is based on Powell’s algorithm (Brent, 2002; Powell, 1964) for finding the minimum (as a numerical solution for a series of onedimensional minimization problems) of a multivariable quadratic function without calculating its derivative. In addition, the algorithm of Sanil et al. (2004) utilizes the secure summation algorithm (Benaloh, 1987; Clifton et al., 2002), which is considered to be a part of secure multiparty computation in the cryptography literature (Schneier, 1996, pp. 551552). The secure summation algorithm is used to share a statistical summary (total), populated partially by every party without revealing how much each party contributes to that statistic. This total is needed for estimating the regression coefficients iteratively. By the end of the proposed regression algorithm, each party can calculate accurately, from the global regression model, the coefficients of the variables they own and share them with other parties. Karr et al. (2004) dealt with the case of building multiple linear regression on the union of a horizontallydistributed dataset. They suggested two approaches. The first approach, called the secure data integration procedure, is used to integrate horizontallydistributed datasets from more than two parties (agents) into one dataset while protecting the identity of the data source. The integrated dataset could then be shared among 19 cooperative parties. Each party could locally run linear regression analysis (or any other statistical analysis) and any of its diagnostics on the integrated dataset. This approach clearly does not rise to the minimum requirements of protecting the privacy and confidentiality of surveyed human subjects because it does not mask confidential attributes and aims only to protect the identity of the data sources (i.e. the identity of the involved parties, not the identity or confidentiality of surveyed human subjects). A second approach is based on the additive nature of the linear regression analysis. Instead of sharing and integrating the original unmasked records of the distributed datasets, statistics required to calculate the least squares estimators of linear regression coefficients are shared and integrated in a secure manner using the secure summation algorithm (Benaloh, 1987; Clifton et al., 2002; Schneier, 1996). In this approach, diagnostics could not be calculated directly. Karr et al. (2004) proposed two approaches to resolve this issue based on whether the computations of diagnostics are additive with respect to the parties. When the nature of the computations of the diagnostics are additive, such as R2 and correlations, locallycalculated numerators and denominators of the diagnostic measures are shared using the secure summation algorithm to calculate the global diagnostic measures. When this is not possible, as in the case of sharing the residuals, simulated residuals derived from synthetic independent variables and mimicking the relationships among original residuals and independent variables are integrated and shared using the secure data integration procedure. The procedure of creating synthetic residuals is very similar to the procedure proposed by Reiter (2003), which is briefly discussed below. 20 Remote regression servers (cf. Duncan and Mukherjee, 2000; KellerMcNulty and Unger, 1998; Schouten and Cigrang, 2003) are accesslimitation methods for protecting microdata while enabling users to build linear regression models. Instead of running the regression analysis locally, users submit a request of the regression analyses they require and the server returns the results in terms of regression coefficients and standard errors. Although this approach has the advantage of building linear regression models using original values, users do not usually have any means of checking the fit of their models. Reiter (2003) proposed a method to enable users to check the fit of their models while limiting the disclosure risks. The proposed method is based on releasing artificial, simulated (marginallywise) dependent and independent variables, residuals and fitted values that mimic the original relationships of the built models. Then these synthetic variables could be used, similar to the use of original variables, to assess the fit of the regression models. He suggested that this approach could be used to provide diagnostics for other possible remotely built, generalized linear models. The computations of many multivariate methods, including multivariate linear regression, depend on matrix computations such as matrix multiplication and matrix inverse. Based on this wellknown fact, Du et al. (2004) proposed some protocols called secure twoparty matrix computations protocols, which enable two agents to collaboratively run matrix computations without knowing about or accessing the other party’s original, sensitive values and without the involvement of a third party. They suggest that secure versions of some multivariate statistical analysis could be formulated using these secure matrix computations building blocks. This approach is generally called the Secure 2party Multivariate Statistical Analysis (S2MSA). These protocols are used 21 to reformulate some specific multivariate statistical analysis problems including the Secure 2party Multivariate Linear Regression (S2MLP) problem, when the dependent variable is known to both agents. As we have seen, not much has been done in privacypreserving estimation and regression when we need to release a dataset for general analysis. The focus of current PPE methods are linear relationships and linear regression methods. Our work deals with a more difficult problem that involves nonmonotonic relationships. Most prior work deals with partitioned (vertically or horizontally) datasets while our work is about centralized datasets, whose masked copies will be released in whole for regression tasks. The above methods require the involvement of all parties every time they must run a regression model, while our proposed methods release the whole masked datasets and do not require timely cooperation between the data owner and the data analyst. Many of the above methods assume and restrict applicability to the existence of a single shared, dependent variable, while our methods allow building regression models using any numeric variable as a dependent variable. The mentioned methods are proposed mainly for regression tasks while there is initial evidence that our approach can be used for other analyses (see Section VIII.2 titled “Possible Opportunities and Limitations”). II.1.2. DataCentric Approach (DCA) for PrivacyPreserving Data Mining Many PPDM approaches modify some existing DM algorithm(s) while masking the data (Thuraisingham, 2005); see, for example, Agrawal and Srikant (2000). Enforcing the use of a modified algorithm with a masked dataset to ensure accurate results is not a good idea for many reasons. First, data miners usually employ more than one algorithm 22 to mine a dataset. Examining all data mining algorithms, as well as modifying them, is not feasible. Second, once a dataset is released, there is no guarantee as to which algorithm might be applied. Using a nonprescribed, standard algorithm may lead to incorrect conclusions and actions. Instead, as suggested by AlAhmadi et al. (2004), datasets should be protected or masked without reference to a specific DM algorithm. More recently, Oliveira and Zaïane (2004c) supported the concept of DataCentric Approach (DCA) in their standardization suggestions to PPDM researchers and developers. Oliveira and Zaïane (2004a) applied the DCA concept practically in developing a new PPDM clustering algorithm called RotationBased Transformation (RBT), which tries to modify only the data such that applying standard distancebased clustering algorithms on both the normalized original datasets and the transformed datasets produce the same results. Hence, data perturbation and masking methods can be a good starting point for implementing the DataCentric Approach (DCA) in PPDM. Next, we review the latest developments in data masking methods and see how we can use some of them in estimation problems. In Appendix A, we also provide a simple framework called the SDL/Relationship Match Framework for choosing a specific data perturbation method (or, in general, any masking method) to mask a specific dataset for building estimation models. The match is based on the type of existing, most difficult relationships in original datasets that the chosen masking method can preserve and reproduce in masked datasets. 23 II.2. Related Work in Masking Methods Statistical Disclosure Limitation (SDL), known also as Statistical Disclosure Control (SDC), techniques are a set of techniques that aim to control the amount of disclosed information about sensitive attributes at the level of individual records in disseminated datasets while providing valid overall statically analyzable datasets. The main goals of SDL techniques are twofold: (a) to minimize disclosure risks by protecting the privacy (identity disclosure) and confidentiality (value disclosure) of individuals surveyed or included in released datasets, and simultaneously (b) to maximize data utility (preserving original (statistical) characteristics of datasets) at the aggregate level (Willenborg and Waal, 2001). The phrase “disclosure limitation” in SDL (or “disclosure control” in SDC) indicates two things. First, there is a specific amount of information, usually at an aggregate level, in the original dataset one wants to reveal. Second, one wants to make sure that no further information (mainly sensitive or confidential) can be learned. In fact, when a dataset is released, some sort of disclosure automatically occurs, and total elimination of disclosure is impossible (Fienberg, 1994). Therefore, there are always possible disclosure risks associated with any released dataset. Hence, methods to assess different possible disclosure risks for any masking method are needed. Similarly, methods are also needed to assess data utility. Thus, when a masking method is proposed, existing or new measures that quantify data utility and disclosure risk are used to prove the effectiveness of the masking method. Two natural types of disclosure one wants to prevent are: identities of individuals (identity disclosure) and exact values of their confidential attributes (exact value 24 disclosure). The problem of disclosure control is complicated by the concept of the second type of value disclosure: the partial value disclosure. In this case, the exact values of confidential attributes are protected, but good statistical estimates can be gained by accessing masked attributes (Adam and Wortmann, 1989; Dalenius, 1977; Muralidhar et al., 1999; Muralidhar and Sarathy, 1999; Sarathy and Muralidhar, 2002). The only way to completely eliminate disclosure risks is to avoid releasing datasets (given that other physical security measures are implemented). In this situation, unfortunately, no data utility is achieved (Fienberg, 1994). Datasets need to be disseminated on many occasions for many good, possible reasons. In the case of statistical databases, permissible queries should get responses. Two important reasons, among others, are that dissemination is either (a) legally required, as in the case of the Bureau of Census, or (b) researchmotivated. For example, a hospital releases information about patients with a specific disease to a medical research institute developing a cure for the disease. In the data mining field, datasets can be released to an external DM and knowledge discovery consultant company in an effort to gain some competitive advantage. There is an increasing trend in many businesses to outsource data mining tasks (Boyens et al., 2002). There are basically two general approaches for disclosure control: limiting access to sensitive attributes or masking original confidential attributes (Adam and Wortmann, 1989; Willenborg and Waal, 2001). Queryrestrictions used in statistical databases (Adam and Wortmann, 1989) are one example of access limitation approaches. In this approach, a query returns only an aggregate statistic such as the SUM. The number of records used to build the aggregate statistic should be more than a specific threshold for the result to be 25 sent to the user. In addition, successive queries should not have a large overlap. However, in many cases, this is not sufficient to prevent disclosure. When the results of permissible queries are restricted to aggregate information, inferential disclosure can occur when a snooper can learn unrevealed information by combining the results of more than one query. A snooper is a legitimate user who misuses access privilege to obtain unauthorized content (Adam and Wortmann, 1989; Muralidhar et al., 1999; Muralidhar et al., 2001). For example, a snooper can combine the results of the following two queries, which can be issued by different users, to learn the salary of a company’s president: query 1  “total salaries the company pays,” and query 2  “total salaries the company pays except for the president” (Clifton, 2003). Although the result of each query by itself does not represent a privacy or confidentiality threat, combining the two leads to exact disclosure. One proposed solution (which may not be practical) is to track all replied queries of all users to avoid answering a query that can increase the amount of information released and lead to disclosure. Malvestuto and Moscarini (2003) discussed the inference problem of confidential values using repeated queries in multidimensional databases, as well as a graphical auditing solution (the answer map) for the problem. Willenborg and Waal (2001) devoted a book to the discussion of the principles related to masking techniques. They differentiated between two types of datasets that need to be protected: tabular data (aggregated data) and microdata (individual records). Perturbative and nonperturbative masking methods are reviewed for each dataset type. The former replaces original values with fabricated ones, while the latter utilizes original 26 values. Willenborg and Waal also reviewed data utility and data security measures for each dataset type. Duncan and Pearson (1991) suggested that masked microdata should be released instead of aggregated data (such as aggregated statistic responses to restricted queries, or tabular data) to maximize data utility and to allow for different types of analyses. Muralidhar and Sarathy (2005b) adopted this viewpoint and pointed out that accessing microdata is a requirement for data mining analysis tasks. Clifton (2003) also supported this opinion. One important category of perturbative SDL methods for microdata is data perturbation methods. In data perturbation, a noise term is used to change original confidential attributes and generate masked ones. There are mainly two types of perturbation methods: multiplicative and additive data perturbation methods. Multiplicative data perturbation methods generate masked values by multiplying original unmasked confidential values by an error term with a mean equal to 1 and with a small variance (Kim and Winkler, 2003; Muralidhar et al., 1995). On the other hand, additive data perturbation methods add an error term with mean 0 and a specific variance or covariance matrix to the confidential attributes. Examples of the latest additive data perturbation methods are GADP (Muralidhar et al., 1999), CGADP (Sarathy et al., 2002), IPSO (Burridge, 2003), and EGADP (Muralidhar and Sarathy, 2005b). One of the advantages of data perturbation methods is that they automatically safeguard against exact value disclosure (Adam and Wortmann, 1989; Muralidhar and Sarathy, 1999). However, their ability to protect against partial value disclosure varies from one method to another. Dalenius (1977) defined disclosure risk by the increment in 27 the snooper knowledge after accessing the masked attributes. To account for the worstcase scenario, the assumption should be that the snooper has maximum knowledge about confidential attributes represented in the form of their distributions (Muralidhar and Sarathy, 2003c). The remainder of this section is divided into three subsections. The first subsection discusses the advantages of using masking methods in developing PPE and PPDM methods. The second subsection talks about the Conditional Independence Theory for developing optimal (in terms of data utility and data security) masking methods and the practical limitation of this theory. The third discusses briefly the latest related masking methods and their optimality status. Then, in the main section to follow, we test the performance of some of these methods on the store dataset in the motivation example to see whether they can be used for PPE involving nonmonotonic relationships. We then list some difficulties in developing a new PPE method for nonmonotonic relationships. Finally, we conclude this chapter by compiling a list of possible PPE research questions. II.2.1. Advantages of Using Masking Methods for PPE Data can be classified as aggregate data (tabular summaries) and microdata (full data). While access limitation methods for microdata can achieve some security goals, full access to microdata is important for DM. Actually, one of the biggest barriers facing data mining projects today is the “inability to release data” due to privacy concerns (Clifton, 2003). Masking methods are a sound choice since they allow the release of microdata without any access restriction. They have rich literature and are built on a solid theoretical basis (cf. Adam and Wortmann, 1989; Willenborg and Waal, 2001). Equally important, they are rigorous techniques for maintaining privacy and confidentiality while 28 maximizing data utility (mainly for statistical analysis) (Muralidhar et al., 1999; Muralidhar and Sarathy, 2003a; 2005a; Sarathy et al., 2002). Actually, masking techniques automatically provide protection against exact value disclosure (Adam and Wortmann, 1989; Muralidhar and Sarathy, 1999). From another perspective, they are good starting points for practically implementing the DataCentric Approach (DCA) in PPDM (AlAhmadi et al., 2004), as discussed earlier. II.2.2. Optimal Masking Methods In this section, we will talk about the optimality requirements of masking methods based on the Conditional Independence Theory. Then we will briefly discuss its practical limitations. II.2.1.2 Conditional Independence Theory The ultimate goals for any masking method are: (a) to maximize data security of the masked dataset (minimize disclosure risks: both value and identity risks), and (b) to maximize the data utility (minimize information loss or maximize data accuracy) of datasets after the protection. Many prior perturbation methods were often developed based on the concept that there is always a tradeoff between these two goals. Examples include simple additive data perturbation SADP method (Traub et al., 1984) and correlatednoise additive data perturbation CADP known as Kim’s method (Fuller, 1993; Kim, 1986; Tendick, 1991). Muralidhar and Sarathy (2003c) developed a theoretical framework for perturbation methods, called the Conditional Independence Theory, which eliminates the need for a tradeoff between data utility and disclosure risks (beyond a specific level defined by the characteristics of the data). The framework suggests that the perturbed 29 values Y should be generated (a) only from the conditional distribution f(XS) and (b) independently from the original confidential attributes X given the nonconfidential attribute S. The importance of these conditions, once they are met, lies in the fact that both maximum data utility and data security (minimum disclosure risk) requirements will be automatically and simultaneously satisfied. Muralidhar and Sarathy (2003c) detailed these requirements of the conditional independence theory and their consequences in terms of marginal, joint and conditional distributions. The use of the conditional distribution f(XS) in the data masking literature is not new, and it has been examined in different contexts: multiple imputation (Little, 1993; Rubin, 1993), categorical data (Fienberg et al., 1998), and disclosure risk measures (Willenborg and Waal, 2001). Muralidhar and Sarathy (2003c) proved that the two required conditions of the conditional independence theory, once met, satisfy both data utility and disclosure risk requirements as follows. The first condition is that perturbed values Y should be generated from the conditional distribution f(XS): Y : fXS (X S) . (2.1) Second, Y should be independent of X given S: fX,Y S(X,Y S) = fX S(X S)fY S(Y S). (2.2) In terms of data utility requirements, two characteristics from the original dataset should be preserved in the perturbed dataset. First, the joint distribution between the nonconfidential attributes S and the perturbed attributes Y should equal the joint distribution between the nonconfidential attributes S and the confidential attributes X: f (Y,S) = f (X,S). The theory of conditional independence assures this since Y is generated from f(XS), f(YS)= f(XS), and therefore: 30 f (Y,S) = f (Y S)f (S) = f (X S)f (S) = f (X,S). (2.3) Second, the marginal distribution of the perturbed attributes Y should be the same as the marginal distribution of the confidential attributes X: f (Y) = f (X) . Again, the theory of conditional independence satisfies this requirement. This can be seen using (2.3): f ( ) = ò f ( , )d = ò f ( , )d = f ( ) S S Y Y Ss X S s X. (2.4) Preserving the joint distribution of the perturbed dataset to be the same as that of the original dataset (i.e. f (Y,S) = f (X,S)) maintains all relationships among variables. This suggests that applying any analysis that depends on relationships among variables in the perturbed dataset should provide similar (if not exact) results when applying the same analysis method on the original dataset. In addition, maintaining identical marginal distributions of the perturbed and the original datasets produces the same results for univariate analysis of the original and masked datasets for each corresponding variable. In terms of data security requirements, there are two assumptions regarding intruders: (a) they have full access to the nonconfidential attributes S, and (b) they have the maximum knowledge about confidential attributes: the conditional distribution of confidential attributes X conditioned on nonconfidential attributes S: f(XS). Both assumptions collectively account for the worstcase scenario: that the snooper does a good job in trying to breach the confidentiality and privacy of individuals in masked datasets even before their release. In this context, disclosure risk is defined by the increase in the snooper ability (or reduction of his uncertainty) to predict confidential attributes once the masked dataset is released (Dalenius, 1977; Duncan and Lambert, 1986). Thus, accessing the released masked dataset (i.e. S and Y) might cause an increase 31 in snooper prediction power since there is now more information (i.e. Y) to use. However, the theory of conditional independence requires that Y is independent of X given S, and this makes the following relationship true: f (X S,Y) = f (X S). This can be proven by using (2.2): ( , , ) ( , , ) : ( , ) ( , ) ( ) ( ) f f LHS f f f f = S X Y = S X Y = X S Y SY YS S ( , ) ( )( ) ( ) : ( ) ( ) f f f f RHS f f X Y S = X S Y S = X S Y S Y S . Therefore, snoopers do not gain any incremental knowledge (beyond what they already know about confidential attributes X from nonconfidential attributes S) by accessing masked attributes Y. As we have seen, the Conditional Independence Theory, once met, guarantees maximum data utility and data security. For an interesting discussion about whether the Conditional Independence Theory is sufficient to reach the optimal level of data utility and data security, refer to comments made by Polettini and Stander (2003) and the rejoinder made by Muralidhar and Sarathy (2003b). II.2.2.2 Practical Limitation of the Optimal Procedure Utilizing the concept of conditional independence to develop masking methods for every possible dataset (including nonnormal ones) proves to be a difficult problem (Muralidhar and Sarathy, 2003c). The main reason is the difficulty in learning the true multivariate conditional density function f(XS). In practice, f(XS) is usually unknowable and difficult to estimate. 32 This limits the practical applicability of the Conditional Independence Theory in developing optimal masking methods. Nevertheless, some special cases (such as the case of multivariate normal data) have been addressed in the SDL literature in which optimal masking methods have been built based on this theory. In the next subsection, we will discuss some of the latest (PPErelated) masking methods. In addition, we will explain which masking methods (and in which settings) satisfy the Conditional Independence Theory. II.2.3. Recent Masking Methods Muralidhar et al. (1999) proposed a new perturbation method called the General Additive Data Perturbation (GADP) method that avoids the problems in early perturbation methods. GADP is mainly designed for maintaining linear relationships. The ideal situation is when datasets are normally distributed, which assures that all relationships among variables are linear (Kotz et al., 2000). Burridge (2003) and Muralidhar and Sarathy (2005b) reported that GADP experiences sampling error that affects its performance when it is applied to small datasets. Burridge (2003) suggested a new perturbation method called Information Preserving Statistical Obfuscation (IPSO) method, which does not suffer from the same problem. The idea behind IPSO lies in the concept of capturing the sufficient statistics (Anderson, 2003; Johnson and Wichern, 1998; Lehmann and Casella, 1998) from original and normally distributed datasets, and utilizing them to produce perturbed values in masked datasets. Muralidhar and Sarathy (2005b) recognized that although IPSO does a good job when dealing with small datasets, it has a problem in data security. They proposed a variant from GADP method that uses sufficient statistics to avoid sampling error in small datasets. The new approach is called 33 Enhanced (or Exact) General Additive Data Perturbation (EGADP) method. EGADP maintains exact linear relationships, even in small data. All the above three masking methods are proposed for linear relationships. Sarathy et al. (2002) proposed a new method called the General Additive Data Perturbation method: the Copula approach (CGADP), which can maintain monotonic (linear or nonlinear) pairwise relationships. A copula is a function that joins a group of functions of marginals into one multivariate (copula) distribution (Jouini and Clemen, 1996; Nelsen, 1999; Schweizer, 1991; Sklar, 1959). CGADP utilizes a multivariate normal copula (Clemen and Reilly, 1999; Joe, 1997) to transform nonnormal distributions into normal ones, which capture the monotonic dependence structure (rankorder correlation) of original datasets. Then GADP or EGADP can be applied on these transformed normal datasets to produce (normally distributed) masked attributes. Finally, masked attributes are transformed to their original marginals. Muralidhar and Sarathy (2003a; 2005a) proposed a new masking method called data shuffling, which combines of the advantages of perturbation methods and dataswapping methods. Like data perturbation methods, the new approach maximizes data security and data utility in the case of linear or monotonic nonlinear relationships. Like dataswapping methods, it maintains the marginal distributions of the masked confidential attributes exactly as the marginal distributions of the original confidential attributes. In addition, data shuffling has the advantage of efficient implementation by utilizing “only rank order data,” (Muralidhar and Sarathy, 2005a, pp.1). There are two approaches to implementing data shuffling: parametric and nonparametric. The parametric approach is similar to the CGADP perturbation method 34 (Sarathy et al., 2002) with an additional step: rank ordering the original values of the confidential attributes X according to the rankorder of the perturbed values Y to get the final shuffled values. The nonparametric data shuffling approach has the advantage of bypassing the problem of identifying unknown marginal distributions. It utilizes the empirical CDF distribution as an estimation mechanism of unknown marginal distributions. From an optimality perspective, GADP and EGADP applied to multivariate normal data satisfy the Conditional Independence Theory. Therefore, they are optimal in terms of maximizing data utility and data security. Although IPSO provides ideal data utility by preserving exactly all (linear) relationships in normal data, it has a problem in the data security aspect since masked attributes Y are not generated independently from f(XS) (Muralidhar and Sarathy, 2005b). When applied to nonnormal data, GADP and EGADP provide ideal security but not ideal data utility. They only reproduce linear relationships in masked data. This is generally adequate for a majority of statistical analyses of masked data, but not for data mining. Many studies (Fuller, 1993; Sarathy et al., 2002; Sullivan and Fuller, 1989) concluded that applying additive perturbation methods (including the above sophisticated methods) on nonnormal datasets reduces the accuracy and utility of masked datasets because nonlinear relationships are not preserved. Hence, GADP and EGADP may not be suitable for general data mining. Because of the known difficulty of estimating the true conditional density function f(XS) in nonnormal datasets, developing optimal masking methods based on the Conditional Independence Theory is not feasible in this case (Muralidhar and Sarathy, 35 2003c). Nevertheless, preserving some characteristics of nonnormal datasets is possible. For example, CGADP and data shuffling can maintain monotonic (linear or nonlinear) pairwise relationships. Both CGADP and data shuffling provide optimal data security but not optimal data utility because they do “not utilize the true conditional density f(XS),” (Muralidhar and Sarathy, 2003c). Instead, they approximate f(XS) with a multivariate normal copula distribution. Nevertheless, the level of provided data utility may be sufficient for some data mining applications. II.3. Impact of Current Masking Methods on NonMonotonic Relationships Lechner and Pohlmeier (2004) investigated how some disclosure limitation masking methods affect estimating nonlinear data models using econometric estimation techniques. Two disclosure limitation methods were studied: blanking and noise addition. The researchers used three econometric estimation techniques: the SIMEX method, the calibration method, and a semiparametric sample selectivity estimator. Lechner and Pohlmeier (2004) concluded that the disclosure limitations techniques, at varying degrees, make it difficult to get nonlinear models and other estimates from masked datasets similar to those obtainable from original datasets. Lechner and Pohlmeier (2004) further pointed out that while the effects of masking methods on the estimators of linear (regression) models have been wellunderstood, “nonlinear regression techniques coping with implications of data masking are still at their infancy.” We picked two advanced data masking methods that have been proven to provide maximum data utility and data security: EGADP (Muralidhar and Sarathy, 2005b) for linear relationships, and (CEGADPbased) data shuffling (Muralidhar and Sarathy, 36 2003a; 2005a) for monotonic nonlinear relationships. Neither has been used in the PPDM arena. These methods were applied to the store dataset (see Table 1I.3). While they provide good security measures for the store dataset, these methods do not produce useful masked datasets for the required regression tasks because of the existence of nonmonotonic relationships. For example, the relationship between age and expenditure variables (see Figure 2 in Section I.3) is not preserved in the masked data. The destructive effect of these methods on this type of relationship is demonstrated in Figure 4. Sarathy et al. (2002) pointed out the limitations of these methods in the case of nonmonotonic relationships. Figure 4. Impact of current masking methods (EGADP and data shuffling) on nonmonotonic relationships Clearly, there are no masking methods capable of preserving nonmonotonic relationships in masked datasets as they exist in original datasets. However, the literature on SDL and data perturbation methods provides sophisticated masking methods for maintaining monotonic relationships (linear and nonlinear) and achieving high levels (sometimes optimal) of data utility and data security. Our goal is to adapt and extend some of these methods for the case of nonmonotonic relationships. In addition, we want 16 28 40 52 64 Age 100 200 300 400 500 Expenditure $ 16 28 40 52 64 Age 100 200 300 400 500 Expenditure $ 37 to develop or adopt some measures for data utility and data security. These measures will be used to investigate the effectiveness of our proposed methods. Next, we will discuss briefly some challenges in developing a PPE masking method for nonmonotonic relationships. II.3.1. Challenges in Developing Practical PPE Masking Methods for NonMonotonic Relationships SDL masking methods were developed based on the concept that maintaining some aggregate correlationbased measures of original datasets in masked datasets would preserve the corresponding relationships captured by the aggregate measures. For example, GADP (Muralidhar et al., 1999), IPSO (Burridge, 2003) and EGADP (Muralidhar and Sarathy, 2005b) try to maintain the Pearson correlation matrix (along mean and covariance matrix). This ensures that all original linear relationships are reproduced in the masked datasets. On the other hand, CGADP (Sarathy et al., 2002) and data shuffling (Muralidhar and Sarathy, 2003a; 2005a) try to maintain Spearman rankorder correlation matrix. This ensures that all original monotonic nonlinear (and linear, in the case of multivariate normal datasets) relationships are reproduced in the masked datasets. One challenge in developing a new PPEmasking approach for nonmonotonic relationships is that there is no aggregate measure capable of capturing nonmonotonic relationships that we can use to reproduce this type of relationship in masked datasets. Thus, a different strategy is needed for developing the new approach. Since PPE, and regression methods in general, are about quantifying the relationships among variables (Rud, 2001), “relationships” will be the basis for developing our new PPE approach. In 38 the first section in Chapter III, we will provide a formal regressionrelated definition of relationships in PPE. Another challenge that arises is the need to test the effectiveness of the newly developed PPE method in terms of data utility and data security. Data security measures are based on general concepts and, therefore, existing security measures can be used. However, data utility measures are specific to the task at hand (DomingoFerrer et al., 2001; 2003). For PPE, it is necessary to maintain relationships among the variables in masked datasets in the manner they exist in original datasets. For monotonic relationships, data utility (in masking techniques) can be easily checked by comparing the relative aggregate measures of the original and masked datasets. Similarity between these aggregate measures declares the success of the masking methods in terms of data utility. Again, there is no aggregate measure for nonmonotonic relationships; therefore, we need to take a different path in developing suitable data utility measures. Once more, we will use the concept of relationships as the basis for our data utility measures. The new proposed data utility measures, along with adopted security measures, are discussed in Chapter V. II.4. Research Questions This study raises and addresses the following questions: · Question 1: Since there is no aggregate measure for nonmonotonic relationships, how should relationships be defined (and later measured) in PPE and PPDM? · Question 2: Is it possible to develop/adapt new masking methods in PPE to maintain nonmonotonic relationships while maximizing data utility and data security? 39 · Question 3: What is the theoretical basis for these masking methods? · Question 4: What is data utility in PPE? What is data security in PPE? How can we quantify data utility and data security in PPE (success measures)? · Question 5: Can the new methods preserve other types of relationships (linear or monotonic nonlinear)? · Question 6: What are the assumptions of these new methods? What will happen if these assumptions are violated? · Question 7: How can we establish the validity of the new methods (experiment design)? · Question 8: How well do these methods compare with existing methods? 40 CHAPTER III. RELATIONSHIPBASED MASKING: THEORETICAL BASIS The prvious chapter discussed the need for maintaining and reproducing the different types of possible relationships in masked datasets, as they exist in original datasets. In addition, it was seen that existing masking methods do not preserve all types of relationships in masked datasets. This chapter introduces four interrelated RelationshipBased Masking (RBM) methods for reproducing different types of relationships in masked datasets. The RBM approach is a threestage approach: identifying relationships, analyzing residuals, and applying masking. These three stages are interrelated. The theoretical basis for the RBM methods lies mainly in the first two stages (relationships and residuals). Hence, the focus of this chapter is on the first two stages (and the tools related to them) while the subject of the following chapter is mainly the third stage (along with the first two stages). First, a formal definition for a relationship in the estimation and regression context is needed. In addition, we need an effective, proven way to learn and capture different possible types of relationships that may exist in a dataset. It is also critical to understand the roles that residuals play in defining and guiding data utility and data security requirements to develop effective masking methods. Section III.1 draws attention to the formal statistical definitions of a relationship and discusses the one that is usually used in estimation and regression modeling. In 41 addition, this section specifies the class1 of relationships that the RBM approach tries to learn from original datasets, and whether it is datautility or datasecurity driven. Section III.2 investigates the capability of Artificial Neural Networks (ANN) in capturing relationships including nonmonotonic ones. Section III.3 concludes this chapter by investigating the roles of residuals r left after fitting confidential attributes X (as dependent variables) to nonconfidential attributes S (as independent variables) in defining and guiding data utility and data security requirements. III.1. Conditional Expectation: The Formal Definition of Relationships in PPE Macnaughton (2002) compiles and discusses seven definitions of a relationship between two random variables. One of these definitions that is particularly suited to this study is that there is a relationship between variables x and y, if the values of y can be expressed as: y = f (x) + e (3.1) “where e is usually viewed as being independent of x” and represents the random component of the relationship. f(x) is strictly a real (deterministic) mathematical function that represents the exact component of the relationship with no random element involved. A strict mathematical function means that it maps from one or more values of x to only one value of y (but not to more than one value of y). For example, the function y = f(x) = x2 is a mathematical function because it maps multiple values of x such as (2,2) to 1 We may use the term “class” when we refer to a relationship based on the confidentiality level of its involved dependent and independent variables. Examples include the class of relationships between X and S and the class of relationships among confidential attributes X or E(XiXj). In addition, we may use the term “type” or “shape” when we talk about the shape of a relationship: linear or monotonic nonlinear, or nonmonotonic. 42 exactly one value of y (4). Thus, a mathematical function is either onetoone (11) mapping or manytoone (M1) mapping, but not onetomany (1M) mapping. The former two cases are called singlevalued mapping while the latter case is called a multivalued mapping (Bishop, 1995). The existence of multivalued mapping in data, which violates the definition of a mathematical function, impacts the effectiveness of our proposed masking methods in terms of data utility, as we will discuss later. Muralidhar and Sarathy (2005a) suggest that perturbed variables Y satisfy the minimum security requirements when they are generated using a function g(S,e) of only the nonconfidential attributes S, and a noise term e that is independent of the original confidential attributes X given the nonconfidential attributes S. Notice the similarity of formula (3.1) with the function g(S,e) when one chooses function g to be the addition of independent/orthogonal noise e to a random function f of nonconfidential attributes S: Y = g(S,e) = f (S) + e (3.2) where e is independent of X (or e ^ X) given S as a security requirement, ^ denotes “orthogonal to”, and f(S) is any random function of S. Equation (3.2) only suggests the requirements of the data security of perturbation methods and it does not consider or guarantee data utility. Data utility can be ensured if the function f(S) is carefully chosen. Macnaughton (2002) suggested that the mathematical form of f is usually determined by data analysis, although theoretical considerations can also be taken into account. He further suggested that such an approach usually leads to choosing the form of f as the best estimate of the conditional expectation E (Y S) in Equation (3.2) (note that we are using the terminology of data perturbation: S, X and Y). Actually, E (Y S) cannot be calculated 43 initially since the masked values Y are not available. Nevertheless, the conditional expectation E (Y S) is specified to be E (X S) to maximize data utility. Thus, in our context, formula (3.2) can be expressed as: Y = E (X S) + e (3.3) where e is independent of X given S (or e ^ S,X) and e resembles the characteristics of r obtained from: X = E (X S) + r (3.4) where r is independent of S (or r^ S) by definition. Equation (3.3) is the essence of EGADP (Muralidhar and Sarathy, 2005b). EGADP can preserve linear characteristics of original datasets perfectly, even for very small datasets, while providing maximum data security (Section II.3). However, it cannot, as we saw earlier, preserve nonlinear relationships. Estimating the conditional expectations E (X S) accurately, and accordingly the “relationships” between X and S by the regression definition, is critical for maintaining data utility. We conclude this section by discussing the difference between independence and orthogonality (uncorrelated). Independence implies orthogonality; the reverse is not always true (Hunter, 1972). When we learn relationships E(XS) from multivariate normal data, residuals r are always guaranteed to be independent of E(XS) or any function of S (see Rhodes (1971), Property 8, pp. 692). When distributions of data are nonnormal, the only condition guaranteed at all times is that residuals r are orthogonal (uncorrelated) to E(XS) or any function of S (see Rhodes (1971), Proposition 2b, pp. 690 and Bickel and Doksum (2001), Proposition 1.4.1 (a) and (b), pp. 34). Nevertheless, “(w)hen two variables are uncorrelated, they may be (and often are) 44 independent,”(Schield, 1995). This means that residuals r are often independent of E(XS) or any function of S regardless of the data distribution. In our discussion, we do not make a distinction between these two concepts and we use independence and orthogonality interchangeably. III.2. Artificial Neural Networks (ANN) Approaches for Estimating Conditional Expectations There are many approaches that can be used to learn relationships (conditional expectations) in datasets. For example, one can try to fit a polynomial of an appropriate degree to the data. However, this approach suffers from two limitations. First, it is a parametric approach that requires a prejudgment of the relationship before trying to fit it to the data. This proves difficult especially if there is no theoretical guidance (Fisher, 1970). Second, it is affected by the curse of dimensionality (Bishop, 1995). Similarly, nonlinear regression requires the specification of the functional form before the estimation of parameters can begin. On the other hand, other approaches such as artificial neural networks (ANN) do not assume any prespecified form of the relationship. In addition, although ANNs also face the problem of the curse of dimensionality, they are less affected compared to the polynomial approach (Bishop, 1995). There are many characteristics of neural networks architectures that make them appealing in estimating nonlinear and nonmonotonic conditional expectations. First, ANN approaches have been proven to be global function estimators of nonlinear and nonmonotonic continuous functions (Bishop, 1995; Hagan et al., 1996). Second and more interestingly, they can be used to approximate the conditional expectation of the 45 network output O given the network input I (i.e. E(OI)) when the minimized error function is the (mean) sum of squared errors function (MSE) and the network is used to map input I to output O (Bishop, 1995; Saerens, 1996; 2000). Saerens (1996) suggests that this is a fundamental mathematical statistics result based on estimation theory and can be found in many books such as Deutsch (1965), Meditch (1969) and Shao (1999). Bishop (1995) proves the above fact mathematically for the case of multilayer perceptron neural networks (MLP). Saerens (1996) does the same and emphasizes the importance of the assumption that the minimum error is indeed reached after the training. He also points out that many researchers arrive at the same conclusion. A few examples for the continuous case are White (1989) and Wan (1990). Examples for the binary case are Bourlard and Wellekens (1989), Ruck et al. (1990), Gish (1990), and Shoemaker (1991). For a general review, refer to Richard and Lippmann (1991). There are two more interesting facts about using the MSE function for estimating the conditional expectations. First, the ability of the neural networks to learn the conditional expectations is not affected by the characteristics of the noise (e.g. normal vs. nonnormal) in the data (Saerens, 1996). Second, the ability to learn the conditional expectation when using MSE is not specific to the multilayer perceptron neural networks MLP and it is independent of the neural network architecture (Bishop, 1995). Actually, this fact is even applicable in architectures that are not classified as neural networks as long as they try to minimize MSE and they succeed in approaching the real minimum (Bishop, 1995; Saerens, 1996). (Multiple) Linear regression, which utilizes the concept of least squared errors, does a good job in estimating linear conditional expectations in normally distributed 46 datasets (where all relationships are linear). However, when applied to datasets containing nonlinear relationships, it does a poor job. This is because it assumes that relationships are linear and can be discovered by fitting lines that minimize the squared error. In this sense, the global minimum of the squared errors, which might be obtained by fitting some curves to the data, is not reached. As a result, (multiple) linear regression cannot be used as a mechanism to estimate conditional expectations when the relationships are not linear. This shows the importance of the assumption that the minimum error is indeed achieved after the estimation process stops. ANN algorithms do not assume any functional form that may hamper learning the true conditional expectation. The above discussion is important to our context. We can use any neural networks architecture based on MSE to estimate the required conditional expectation of the confidential attributes X given the nonconfidential attributes S (i.e. E(XS)) by mapping the input S to the output X regardless of the form of the conditional expectation (monotonic or (singlevalued mapped) nonmonotonic) or the characteristics of the error term. The main assumption is that the training procedure actually reaches, or at least approaches, the global error minimum. MLP neural networks suffer from the trap of local minimums that may limit their ability to reach the global minimum of the minimized error function. Consequently, the real conditional expectation may not be learned accurately. However, there are other neural networks architectures that can be used instead and do not suffer from the same problem. 47 Support Vector Machine (SVM) (Vapnik, 2000) and Least Squares Support Vector Machine (LSSVM) (Suykens et al., 2002) are different neural networks architectures (Schèolkopf and Smola, 2003). SVM and LSSVM are kernelbased methods (Schèolkopf et al., 1999; ShaweTaylor and Cristianini, 2004). They utilize the kernel trick to implicitly map an input space into a higher dimensional space called the feature space. Then, wellknown linear (and nonlinear) learning algorithms can be applied on the feature space to learn relationships and patterns. The main two tasks, among others, that can be run in the feature space are regression and classification. SVM and LS SVM do not suffer from local minimums that hinder optimization algorithms from reaching the global minimum of the cost function in the error space. This is because the cost function in the dual space (the main optimization space for (LS) SVM cost functions) has a quadratic form with a unique global minimum (Suykens et al., 2002). In this study, we use LSSVM (Suykens et al., 2002) and its Matlab toolbox implementation, called LSSVMlab1.5 (2003), to learn conditional expectations. However, any learning mechanism that is theoretically capable of capturing and learning nonmonotonic conditional expectations (relationships) can be used in our proposed RelationshipBased NMEGADP masking approach (RBM). III.3. The Roles of the Residuals (r) III.3.1. Residuals Role in Defining Relationships among Attributes The residuals r obtained from estimating E(XS) (see Equation (3.4) in Section III.1) play important roles in defining two main groups (classes) of relationships: (a) relationships between nonconfidential attributes S and confidential attributes X (i.e. E(XS)), and (b) relationships among confidential attributes X (i.e. E(XiXj) or E(XjXi)). 48 For simplicity and without losing generality, we limit our discussion to two confidential attributes X (Xi and Xj , where i, j = 1K q and i ¹ j ) and the nonconfidential attributes S. The residuals r (ri and rj) are obtained from the following two equations: ( ) Xi = E Xi S + ri (3.5) and ( ) j j j X = E X S + r . (3.6) In our proposed masking methods, masked variables Y (Yi and Yj) are generated as follows: ( ) ( ) i i i i i Y EY e EX e = + = + S S (3.7) ( ) ( ) j j j j j Y EY e EX e = + = + S S (3.8) where e (ei and ej) are noise terms that try to preserve certain characteristics of the residuals r (ri and rj) to maximize data utility. In addition, the noise terms e are generated to satisfy security requirements and to avoid providing extra information about confidential attributes X beyond what is known about them from the nonconfidential attributes S. In the proposed masking methods, we specify E(YiS) = E(XiS) and E(YjS) = E(XjS) (see Equations (3.7) and (3.8)) so that relationships between confidential attributes X ( Xi and j X ) and nonconfidential attributes S are automatically reproduced in masked datasets between masked attribute Y ( i Y and Yj ) and nonconfidential attributes S. The only required condition is that e should be orthogonal to S or any function of S, similar to residuals r. Except for the orthogonality, the added noise terms e 49 are not required to mimic every aspect or characteristic of the residuals r to preserve this class of relationships. However, relationships among confidential attributes X (E(XiXj) and E(XjXi))2 are not directly reproduced in masked datasets by only fixing conditional expectations E(XS) while masking. In addition to the role of conditional expectations E(XS), E(XiXj) depends heavily on the characteristics of the added noise (ei and ej). Ideally, we want the characteristics of the added noise e (ei and ej) to be exactly the same as the characteristics of original residuals r (ri and rj) in terms of two related requirements: (a) orthogonality (mainly for maintaining relationships between X and S in masked datasets), and (b) joint distribution (mainly for maintaining relationships among confidential attributes X in masked datasets). Although the former is achievable, the latter is not always feasible and achievable. Therefore, we want to get as close as possible to the ideal case to maximize the data utility. But before we discuss these data utility requirements further, we explore whether other possible types of residuals that result from different forms of estimation or (conditional) expectations can be used in masking methods to maintain relationships (especially among confidential attributes X) without violating other requirements. First, let us consider the set of residuals resulted from subtracting the expectations of the confidential attributes E(X) from the confidential attributes X. In this case, all relationships among X are completely captured by this set of residuals. However, this set 2 For simplicity of discussion, we shall only talk about E(XiXj) initially, with the assumption that it is a singlevalued mapping. 50 of residuals captures none of the relationships between X and S, and there is no easy way to relate them to conditional expectations E(XS). Another idea is to use the residuals from fitting E(XiSXj) . One problem with these residuals is that they do not satisfy an important security requirement: we cannot use any function of confidential attributes X directly in generating masked attributes Y (Muralidhar and Sarathy, 2003c; 2006b). Clearly, residuals obtained from E(Xi Xj) are not suitable either since they inherit the problems of the last two types of residuals: they do not carry any information about nonconfidential attributes S, and they violate the security requirement of avoiding conditioning on confidential attributes X. Therefore, we can only use functions of nonconfidential attributes S as specified in Equations (3.5) and (3.6). In this case, E(XiXj) is captured by the two (orthogonal) components in Equations (3.5) and (3.6): E(XS) and r. Since E(XS) is the fixed part during masking (refer to Equations (3.7) and (3.8)), the added noise set e (i.e. the dynamic or changed part) should mimic the characteristics of the residual set r to maintain the relationships among masked attributes Y as they exist among confidential attributes X. There are two dimensions for similarity: orthogonality and joint distributions. In Equations (3.5) and (3.6), residuals r (ri and rj) are orthogonal to S and any function of S when the conditional expectations E(XS) are correctly estimated (Bickel and Doksum, 2001, Proposition 1.4.1 (a) and (b), pp. 34; Rhodes, 1971, Proposition 2b, pp. 690). Notice that both E(XiS) and E(XjS) are functions of S. This is the main concept in mean square estimation. It is called “the orthogonality principle” (Gray and Davisson, 2004; Papoulis and Pillai, 2002). Interestingly, the orthogonality principle generally holds 51 true when the conditional expectations (e.g. E(XiS) and E(XjS)) are nonlinear functions of the data (here it is S) and it is called “Nonlinear Orthogonality Rule” (Papoulis and Pillai, 2002). From a data utility perspective, for e to resemble the characteristics of r, e should be orthogonal to S or any function of S. Otherwise, relationships between X and S will not be reproduced in masked datasets between Y and 



A 

B 

C 

D 

E 

F 

I 

J 

K 

L 

O 

P 

R 

S 

T 

U 

V 

W 


