

small (250x250 max)
medium (500x500 max)
Large
Extra Large
large ( > 500x500)
Full Resolution


SIMULTANEOUS INFERENCE IN GENERALIZED LINEAR MODEL SETTINGS By AMY WAGLER Bachelor of Science in Mathematics University of Texas of the Permian Basin Odessa, Texas, USA 1995 Master of Science in Statistics Oklahoma State University Stillwater, Oklahoma, USA 2003 Submitted to the Faculty of the Graduate College of Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY July, 2007 COPYRIGHT c By AMY WAGLER July, 2007 SIMULTANEOUS INFERENCE IN GENERALIZED LINEAR MODEL SETTINGS Dissertation Approved: Dr. Melinda McCann Dissertation Advisor Dr. Stephanie Monks Dr. Mark Payton Dr. Wade Brorsen Dr. A. Gordon Emslie Dean of the Graduate College iii ACKNOWLEDGMENTS There are many I would like to thank for assistance, reassurance and inspiration during the time I worked on this dissertation. First, I would like to thank the entire Department of Statistics at Oklahoma State University including all of the faculty members, staff, and students. I appreciate your kindness and concern for me. Of the faculty members, I would specifically like to thank my committee members, Dr. Stephanie Monks, Dr. Mark Payton, and Dr. Wade Brorsen, for the time and effort they put into this research project. Most importantly, I would like to thank my advisor, Dr. Melinda McCann, for her support and encouragment, perseverance in research, and friendship. I was fortunate to have an advisor such as you. I would also like to thank Trecia Kippola, Tracy Morris, and Janae Nicholson who listened to my many complaints and were just good friends. The most emphatic thanks I reserve for my family, who are always loving and supportive. My Mom, Dad, and sisters for believing I could do this even when I didn’t believe. Most of all, I would like to thank Ron, my husband, and Olive, my daughter. When I needed help, encouragement, or a babysitter, my husband was always there. When I was frustrated and feeling down, a smile from Olive was all I needed to put things in perspective. Thank you and I love you both. iv 5 TABLE OF CONTENTS Chapter Page 1 INTRODUCTION 1 2 The GLM and Estimated Quantities 6 2.1 The Expected Mean Response . . . . . . . . . . . . . . . . . . . . . . 8 2.2 The Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Other Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Interval Estimation from GLMs . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . 14 2.4.2 Reference Coding for Logistic Regression . . . . . . . . . . . . 17 2.4.3 Loglinear or Poisson Model . . . . . . . . . . . . . . . . . . . 18 3 Inferences on Quantities Estimated from a GLM 22 3.1 Inference on the Mean Response . . . . . . . . . . . . . . . . . . . . . 22 3.1.1 Previous Methods . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.2 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Bounds for Simultaneously Estimating Functions of Model Parameters 39 3.2.1 Previous Methods . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.2 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Simulations 49 4.1 Expected Response Function Simulations . . . . . . . . . . . . . . . . 49 4.1.1 Expected Response Function Simulation Results . . . . . . . . 51 4.2 Functions of the Parameters Simulations . . . . . . . . . . . . . . . . 61 vi 4.2.1 Functions of the Parameters Simulation Results . . . . . . . . 63 5 CONCLUSIONS 65 5.1 Application of Proposed Methods . . . . . . . . . . . . . . . . . . . . 65 5.1.1 Diabetes among Pima Women . . . . . . . . . . . . . . . . . . 65 5.1.2 Depression in Adolescents . . . . . . . . . . . . . . . . . . . . 66 5.2 Overall Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.1 Recommendations for Estimating the Mean Response . . . . . 68 5.2.2 Recommendations for Estimating the Parameters . . . . . . . 69 5.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 BIBLIOGRAPHY 72 A FIGURES 75 B Moments of Binomial and Poisson Random Variables 92 B.1 The Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . 92 B.2 The Poisson Random Variable . . . . . . . . . . . . . . . . . . . . . . 93 C RestrictedScheff´e Methodology 94 D SCR Methodology 97 E Fortran Code: PiegorschCasella PMLE for the Mean Response 100 F Fortran Code: SCR PMLE for the Mean Response 123 G Fortran Code: PiegorschCasella PMLE for the Parameters 150 H Fortran Code: SCR PMLE for the Parameters 172 vii LIST OF TABLES Table Page 1.1 Maternal Stress Relative Risks and 95% Confidence Intervals . . . . . 3 2.1 Sample Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Examples of Estimated Odds Ratios . . . . . . . . . . . . . . . . . . . 18 5.1 Intervals for Proportion of Pima Women with Diabetes . . . . . . . . 66 5.2 Intervals for Odds Ratio for Depression in Adolescents . . . . . . . . 68 viii LIST OF FIGURES Figure Page 4.1 MLE and PMLE SCR Intervals with wide range . . . . . . . . . . . . 51 4.2 pMLE Intervals LogitWide Range B=1 . . . . . . . . . . . . . . . . 52 4.3 MLE and PMLE SCR Intervals with wide range . . . . . . . . . . . . 54 4.4 MLE and PMLE SCR Intervals with wide range . . . . . . . . . . . . 54 4.5 MLE and PMLE SCR Intervals with wide range . . . . . . . . . . . . 55 4.6 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 55 4.7 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 56 4.8 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 56 4.9 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 57 4.10 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 57 4.11 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 58 4.12 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 58 4.13 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 59 4.14 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 59 4.15 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 60 4.16 MLE and PMLE SCR Intervals for the Parameters . . . . . . . . . . 63 4.17 MLE and PMLE SCR Intervals for the Parameters . . . . . . . . . . 64 A.1 MLE Intervals LogitWide Range B=1 . . . . . . . . . . . . . . . . . 75 A.2 MLE Intervals LogitWide Range B=2 . . . . . . . . . . . . . . . . . 76 A.3 MLE Intervals LogitWide Range B=3 . . . . . . . . . . . . . . . . . 76 A.4 MLE Intervals LogitWide Range B=4 . . . . . . . . . . . . . . . . . 77 ix A.5 MLE Intervals LogitNarrow Range B=1 . . . . . . . . . . . . . . . . 77 A.6 MLE Intervals LogitNarrow Range B=2 . . . . . . . . . . . . . . . . 78 A.7 MLE Intervals LogitNarrow Range B=3 . . . . . . . . . . . . . . . . 78 A.8 MLE Intervals LogitNarrow Range B=4 . . . . . . . . . . . . . . . . 79 A.9 MLE Intervals PoissonWide Range B=1 . . . . . . . . . . . . . . . . 79 A.10 MLE Intervals PoissonWide Range B=2 . . . . . . . . . . . . . . . . 80 A.11 MLE Intervals PoissonWide Range B=3 . . . . . . . . . . . . . . . . 80 A.12 MLE Intervals PoissonNarrow Range B=1 . . . . . . . . . . . . . . . 81 A.13 MLE Intervals PoissonNarrow Range B=2 . . . . . . . . . . . . . . . 81 A.14 MLE Intervals PoissonNarrow Range B=3 . . . . . . . . . . . . . . . 82 A.15 pMLE Intervals LogitWide Range B=2 . . . . . . . . . . . . . . . . 82 A.16 pMLE Intervals LogitWide Range B=3 . . . . . . . . . . . . . . . . 83 A.17 pMLE Intervals LogitWide Range B=4 . . . . . . . . . . . . . . . . 83 A.18 pMLE Intervals LogitNarrow Range B=1 . . . . . . . . . . . . . . . 84 A.19 pMLE Intervals LogitNarrow Range B=2 . . . . . . . . . . . . . . . 85 A.20 pMLE Intervals LogitNarrow Range B=3 . . . . . . . . . . . . . . . 85 A.21 pMLE Intervals LogitNarrow Range B=4 . . . . . . . . . . . . . . . 86 A.22 pMLE Intervals PoissonWide Range B=1 . . . . . . . . . . . . . . . 86 A.23 pMLE Intervals PoissonWide Range B=2 . . . . . . . . . . . . . . . 87 A.24 pMLE Intervals PoissonWide Range B=3 . . . . . . . . . . . . . . . 87 A.25 pMLE Intervals PoissonNarrow Range B=1 . . . . . . . . . . . . . . 88 A.26 pMLE Intervals PoissonNarrow Range B=2 . . . . . . . . . . . . . . 88 A.27 pMLE Intervals PoissonNarrow Range B=3 . . . . . . . . . . . . . . 89 A.28 MLE Intervals for the Parameters  Logit . . . . . . . . . . . . . . . . 89 A.29 MLE Intervals for the Parameters  Poisson . . . . . . . . . . . . . . 90 A.30 pMLE Intervals for the Parameters  Logit . . . . . . . . . . . . . . . 90 A.31 pMLE Intervals for the Parameters  Poisson . . . . . . . . . . . . . 91 x CHAPTER 1 INTRODUCTION Modern epidemiological and medical research routinely employs generalized linear modeling. These models can be helpful in understanding what behaviors or traits can influence the incidence of a particular disease or characteristic. For example, logistic regression models provide a means of relating the incidence of some trait or disease to a set of possible predictor variables, while loglinear models help us understand associations between a trait and predictor variables. After building a generalized linear model(GLM), one typically wishes to estimate particular quantities of interest such as response probabilities, odds ratios, or relative risks. Customarily, these are reported via confidence intervals or confidence bounds using some prespecified level of significance for each inference. For example, using a loglinear model, one could report 100(1 − α)% confidence intervals for each relative risk resulting from the model. Using oneatatime intervals is appropriate when the investigators are not making overall conclusions about the quantities of interest. For example, if the aforementioned loglinear model was estimated and 100(1 − α)% confidence intervals for the relative risks were reported, conclusions about each individual relative risk could be made, but any statements comparing these relative risks would inflate the assumed α error rate. Research on simultaneous estimation procedures for quantities from generalized linear models has received little attention beyond very routine treatments, such as making Bonferroni adjustments to the usual confidence interval or constructing Scheff´e intervals. However, recent advances made in simultaneous inference for linear models may be applied in the generalized linear 1 model setting. Additionally, further improvements may be made by utilizing some unique properties of the estimated parameters from generalized linear models. I plan to present the justification for employing these simultaneous inference methods in the generalized linear model setting and, via simulation, compare their performance. Thus, the objective of this study is to develop simultaneous interval based procedures that will estimate functions of linear combinations of the parameters of a generalized linear model. Specifically, this includes simultaneously estimating the expected response function, odds ratios, and relative risks from generalized linear models. Overall, attention is focused on quantities that are estimated from a GLM, not the estimation of the GLM itself. Obviously, the performance of any of these procedures will be influenced by how well the model is estimated, but this dissertation will assume the model is well estimated. Additionally, all of the procedures developed in this paper involve constructing interval estimates. Often, procedures that account for multiplicity employ hypothesis tests to make overall conclusions about a set of data. It is more appropriate in the applications I will discuss to use simultaneous intervals instead of stepwise procedures, since I wish to not only detect statistical differences between quantities, but also to assess the practical significance of these differences. Thus, all methods discussed are intervalbased procedures. Before presenting the details of generalized linear models, a practical example of the implementation of a GLM may provide a frame of reference. A 2003 study from the American Journal of Epidemiology explored the relationship between maternal stress and preterm birth [1]. Several previously identified sources of maternal stress, such as high incidence of life events, increased anxiety, living in a dangerous neighborhood, and increased perception of stress, were explored for any association with preterm births. Specifically, the study focused on predicting the prevalence of preterm birth among pregnant women aged 16 or older from two prenatal clinics in central North Carolina. Upon admission to the study, women were asked to complete 2 Table 1.1: Maternal Stress Relative Risks and 95% Confidence Intervals Life Events Stress RR 95% CI No Stress 1.00 MedLow Stress 1.5 (1.0, 2.2) MedHigh Stress 1.4 (0.9, 2.1) High Stress 1.8 (1.2, 2.7) questionnaires in addition to completing a psychological instrument. Also, several blood, urine, and genital tract tests were conducted in order to assess the physical health of the candidates. In all, 2,029 women were eligible, recruited, and completed the preliminary tests in order to participate in the study. Of these participants, 231 delivered preterm, less than 37 weeks gestation. A loglinear model was employed to assess the relationship between the sources and levels of maternal stress and preterm birth. As a result, the authors considered the resulting model relative risks for each individual stress factor or level of a stress factor and its association with preterm birth. Additionally, 95% confidence intervals were computed for each relative risk. The relative risk will be discussed in detail later, but note that for this study each relative risk is the risk of preterm birth for an individual with one particular maternal stress factor relative to the risk of preterm birth for an individual with none of the other identified sources of maternal stress present. Thus a large relative risk for a particular source of maternal stress indicates a strong association between that stress factor and preterm birth. In general, other quantities derived from the generalized linear model may also be of interest. Table 1.1 contains results from the preterm birth study discussed, though these results have been simplified from the actual implemented model for ease of presentation. In particular, the relative risks for different levels of stress due to general life events is presented. As previously discussed, note 3 that the 95% confidence intervals for each relative risk presented in Table 1.1 estimate the risk of preterm birth for each individual subject to a particular level of a maternal stress factor (life event stress) with reference to the control (the case with no identified source of maternal stress). For this scenario, oneatatime inferences are reasonable if the researcher wishes to answer questions such as, “how does the presence of one source of maternal stress affect the risk of preterm birth?” Note that this question is only concerned with the presence of a particular stress factor and how it affects the incidence of preterm births. If one wishes to make any overall conclusions comparing how the multiple sources of maternal stress affect the risk of preterm birth, then another estimation procedure that accounts for multiplicity needs to be implemented. For instance, in the preterm birth study, the researchers reported the above relative risks and confidence intervals, and then remarked that “(w)omen in the highest negative life events impact quartile had the highest risk (RR=1.8, 95% CI: 1.2, 2.7); however, the middle categories did not show increasing risk with increasing measures of stress.” This kind of conclusion is inappropriate given that the researchers only computed oneatatime 95% confidence intervals for the relative risks. Therefore, this is a case that would benefit from simultaneous inference on the relative risks. As another example where simultaneous inference would be appropriate, consider the case where the researcher wishes to identify the set of the sources and levels of maternal stress that are significantly associated with preterm birth. In this case the oneatatime intervals are again inappropriate. In order to make this kind of conclusion, the researcher needs to determine which groups of relative risks are significantly different from 1. If the researcher additionally wants to determine the practical significance of the differences between the varying sources and levels of maternal stress and the control, then confidence intervals with a multiplicity adjustment are required. Stepwise procedures are not adequate as they only determine where statistically significant differences exist, but do not provide a way to estimate the scale of these 4 differences. Additionally, it is often desirable to make conclusions such as “if the subject has one level of a predictor variable, then he is twice as likely to have the disease than if he has any other level of that predictor variable”. Many other examples of similar conclusions could be given, but generally, these conclusions are comparing one parameter to another and the desired outcome is to somehow relate these parameters. Thus, if one wishes to make any comparisons of these parameters, it is desirable to control the overall type I error rate by accounting for the multiplicity of inferences. In the following chapters, I will present the motivation for simultaneous inference of certain parameters and outline both the current methodologies and my proposed methodologies. Additionally, the simulation results of the proposed methods are presented and analyzed. Specifically, in chapter two, I present the generalized linear model and the typical quantities that are estimated from the model, and discuss why simultaneous inference of these quantities is essential in some situations. Additionally, I review some methods for computing oneatatime confidence intervals on various quantities resulting from generalized linear models. In chapter three, I outline the current methodologies used for simultaneously estimating various functions resulting from generalized linear models, and I propose four new methods to estimate these parameters from GLMs. In chapter four, I summarize how I evaluated these new methods using simulation and present the simulation results. Finally, I propose some future research questions regarding simultaneous estimation of a GLM and present some applications of the new methods in the concluding chapter. 5 CHAPTER 2 The GLM and Estimated Quantities There are several generalized linear models (GLM) that permit estimates, such as the odds ratio or relative risk, where multiplicity adjustments often seem warranted. Some of these models include the logistic regression model, loglinear model, Poisson regression model, and the probit or complementary loglog model. In general, a GLM can be expressed as Yi = g−1(x′ iβ + ǫi), i = 1, . . . , n (2.1) or alternatively, φi = g(E(Yixi)) = x′ iβ, i = 1, . . . , n (2.2) where g links the expected response, E(Yixi), to φi, with xi the vector of covariates corresponding to Yi, β the k×1 vector of regression parameters, and ǫi independently and identically distributed random variables. In the later sections, Y = (Y1, . . . , Yn)′ is the vector of responses and X = (x1, . . . , xn)′ is the full rank matrix of predictor variables. Each GLM corresponds to a particular link function g, typically called the canonical link when it transforms the mean to the natural parameter. In general, the maximum likelihood estimate (MLE) of the regression parameters is denoted ˆ β = ( ˆ β1, . . . , ˆ βk). The MLE is asymptotically multivariate normal with mean β and covariance matrix V = (X′WX)−1 (2.3) 6 where W is a diagonal matrix with diagonal elements wi = (∂μi/∂φi)2/var(Yi) for μi = E(Yixi) and φi in (2.2). Thus, ˆβ ·∼ Nk(β, V ). (2.4) We can estimate the covariance matrix, V , by Vˆ = coˆv(βˆ) = (X′ ˆW X)−1 (2.5) with ˆW =W =ˆ . Further results will require the estimated covariance of x′ iβ for a given xi vector with k known elements. This is given by ˆσ2G LM(xi) = qx′ i ˆ V xi, i = 1, . . . , n. (2.6) At times I will need to refer to a linear model in this proposal. A linear model is generally given by Yi = x′ iθ + ǫi, i = 1, . . . , n (2.7) or alternatively, E(Y ) = Xθ (2.8) where θ is the vector of regression parameters, Y and X are as previously defined, and ǫi ∼ iid N(0, σ2L M). The MLEs for θ = (θ1, . . . , θk) are denoted ˆθ and ˆθ ∼ Nk(θ, σ2L MF) (2.9) where σ2L M is the variance of the model residuals for a linear model and F = (X′X)−1. Whenever a linear model is referenced in this paper, the notation presented above will be utilized. As discussed previously, the objective of this research is not to merely estimate a GLM, but to simultaneously estimate quantities derived from a GLM. There are many natural quantities that can be of interest when modeling data with a GLM. These include measures such as the expected mean response, the odds ratio, the relative 7 risk, and possibly others. While the focus of this paper is on the estimation of these quantities from GLMs, it should be mentioned that they may be estimated directly from the data. I will first present these measures in general, and then discuss them specifically in the context of generalized linear models. 2.1 The Expected Mean Response The expected mean response is a generic term for either the response probability or the mean response. Depending on the sampling distribution of the data, either one or the other is of interest. For example, if we assume binomial sampling, the expected mean response function is the probability of a success for a given level of the predictor variables, or the response probability. When a Poisson sampling scheme is assumed, the expected mean response is the average for a particular cell in the contingency table, or the mean response. A response probability is the proper quantity of interest if one wishes to understand the probability of developing a disease or another characteristic for a given set of predictor variables that are believed to be associated with the disease. For example, a response probability could be used to inform a particular patient of their probability of developing a particular disease given their history and profile. With respect to the preterm birth example, if a doctor has a patient known to be experiencing a major life event, such as a death in the family, then she could ascertain that patient’s risk of preterm birth and take appropriate measures. Conversely, the mean response might be used in a situation where a clinician has recorded a host of risk factors for a particular disease and wishes to predict how many of the subjects will develop the disease. This communicates how many patients on average will or will not develop a certain characteristic. In the context of the preterm birth example, the Poisson mean response could be used to estimate how many subjects out of the total sample size will experience a preterm birth. The 8 Poisson mean response is simply estimated directly from the frequencies given in a contingency table when it is not estimated from a model. In general, a oneatatime 100(1−α)% confidence interval for the response probability is given by πˆi ± z /2(vˆar(πˆi))1/2 (2.10) where var( ˆ πi) = i(1− i) mi and vˆar(πˆi) = ˆ i(1−ˆ i) mi . (Note that confidence intervals on the mean response for Poisson sampling distribution models are not typically computed.) This oneatatime confidence interval for πi is often employed to estimate an expected mean response for binomial or multinomial sampling scenarios. If the researcher simply wants to know how a particular level of the predictor variables affects the incidence of disease, this is all that needs to be calculated. First, consider the case where the predictor variable is categorical, as in the preterm birth example. Suppose a clinician wishes to estimate the risk of preterm birth for a particular patient in her clinic. Then the oneatatime interval would be adequate. Alternatively, consider a scenario where a researcher wants to make some kind of overall conclusion about the relationship between all the sources and levels of maternal stress and preterm birth. For example, suppose the researcher wishes to compare the risk of preterm birth for all the maternal stress factors and their levels. In order to simultaneously estimate these differences, the researcher would need to employ some kind of procedure that accounts for the multiple inferences being made. Additionally, it would be of practical interest to identify a group of maternal stress factors and levels that are “most associated” with preterm birth and another group of maternal stress factors and levels that are “least associated” with preterm births. This too would necessitate a procedure that adjusts for multiplicity while also providing interval estimates of the response probabilities for each stress factor. Finally, consider the case where the researcher would want to compare the probability of preterm birth for each maternal stress factor or level with the probability of preterm birth for a control or reference 9 level. In this example, the reasonable reference level would be subjects that have no identified sources of maternal stress. Again the appropriate procedure would adjust for multiplicity. Now consider the case where the predictor variable is continuous. Often, with a continuous predictor variable, a specific range of the domain is of particular interest. For example, if we added a continuous measure of each patient’s prepregnancy body mass index (BMI) in the preterm birth study, we might have particular interest in BMI’s less than 19.8 (underweight), 19.8 to 26.0 (normal weight), 26.0 to 29.0 (overweight), and over 29.0 (obese). It may be of interest to compare the expected number of preterm birth cases for subjects within these BMI groups within a particular maternal stress factor group. In order to make conclusions such as the obese patients have the largest number of preterm birth cases, interval estimates need to be used that account for the multiple inferences being made. The methods I propose will adjust for this kind of multiplicity. 2.2 The Odds Ratio The odds ratio is a widely used measure in epidemiological and medical applications. The odds ratio is generally defined as the ratio of the odds of a characteristic (or disease) occurring in one group to the odds of it occurring in another group. With reference to the preterm birth study, odds ratios could have been computed that would estimate the relative odds of preterm birth for a particular level of a maternal stress factor to the odds for those mothers with no identifiable stress factors. Thus, an odds ratio of 2.11 for mothers who live in a neighborhood perceived to be dangerous, would be interpreted as: the odds of delivering a preterm infant when living in a neighborhood that is perceived to be dangerous is 2.11 times greater than the odds of having a preterm infant when a subject is not exposed to any identifiable sources of maternal stress. 10 Table 2.1: Sample Contingency Table X = x1 X = x2 Y = 1 a b m1 Y = 0 c d m2 n1 n2 n The sample odds ratio can easily be computed from the raw data and is given by ˆη = ad bc for counts as given in Table 2.1 irrespective of which sampling model (binomial, multinomial, or Poisson) is assumed for the cell counts. For large samples, again under all sampling models, the log odds ratio, log(ˆη) is asymptotically normal with mean log(η) and estimated standard error ˆσlogˆ = ( 1 a + 1 b + 1 c + 1 d )1/2. Thus, a 100(1 − α)% oneatatime large sample confidence interval for the log odds ratio is given by log(ˆη) ± z /2ˆσlog(ˆ ). (2.11) Exponentiating the lower and upper bounds of this interval yields confidence bounds for the odds ratio. It is common to see oneatatime confidence intervals for odds ratios reported along with their point estimates. Suppose a researcher wants to report the estimated odds ratio for a particular patient profile with confidence limits. For example, in the preterm birth example, she may want to report the odds ratio of preterm birth for those exposed to a particular maternal stress factor compared to those with no identifiable maternal stress factors. In this case, the oneatatime intervals are appropriate. Alternatively, consider the case where the researcher wants to identify which, if any, of the maternal stress factors or levels of a stress factor are statistically associated with a preterm birth or to identify a set of stress factors or level of a stress 11 factor whose association with preterm birth is larger than that for no stress factors present. The oneatatime intervals will not suffice for these kinds of questions as there are multiple inferences being made. In order to control the type I error rate, a simultaneous estimation procedure should be utilized. Additionally, the researcher may wish to compare the odds of preterm birth for any maternal stress factor to the odds of preterm birth for all other sources of maternal stress. In order to do this, a multiple comparison procedure must also be utilized as the researcher actually wants to compare the probabilities of preterm birth across all the possible sources of maternal stress. It is tempting to make overall conclusions about the odds ratios when reporting estimated odds ratios via oneatatime confidence intervals. However, the error rate associated with these overall conclusions based on multiple oneatatime intervals is not controlled, or even known. In this case, a method that simultaneously estimates the parameters is warranted. 2.3 Other Quantities Another quantity frequently reported is the relative risk. The relative risk communicates the risk of developing a disease at one level of the predictor variable relative to another level of the predictor variable. An example of a study employing relative risks is the preterm birth study. In Table 2, the estimated relative risk would be given by ˆγ = a/n1 b/n2 where the counts are as in Table 2.1. Suppose a researcher reports that the estimated relative risk of preterm birth is 1.75, given the subject lives in a neighborhood perceived to be dangerous. Thus the proportion of those that experience a preterm birth among those that live in the dangerous neighborhood is estimated to be 1.75 times the proportion of those who experience a preterm birth among those with no identified sources of stress. Again, the log scale is often utilized and large sample derivations show that the log of the sample relative risk, log(ˆγ) is asymptotically normal with mean log(γ) and estimated standard error ˆσlog( ) = ( 1−ˆ 1 ˆ 1n1 + 1−ˆ 2 ˆ 2n2 )1/2 12 where ˆπi for i = 1, 2 is the estimated probability of disease among those in group i and ni is the sample size for group i = 1, 2. Thus, the 100(1−α)% confidence interval for the log relative risk is log(ˆγ) ± z /2ˆσlog(ˆ ). (2.12) These resulting bounds may be exponentiated in order to obtain confidence limits on the relative risk. It is sufficient to report an estimated relative risk via a oneatatime confidence interval when a researcher only needs to understand how one level of the predictor variable affects the incidence of disease. The preterm birth study reported relative risks and the associated confidence intervals, thus only individual inferences about each source of maternal stress or level of a maternal stress factor relative to the case with no source of stress can be made. However, suppose a researcher wishes to pick out which risk factor or level of a risk factor contributes most to a disease or condition, or obtain ranking information for the sources and levels of maternal stress with respect to risk of disease or condition. As estimation is still also of interest, multiplicity adjustments need to be made to the confidence intervals. In addition to the relative risk, other quantities should be considered as well. For example, many epidemiological researchers find the attributable proportion a useful measure. Suppose we have a disease and several risk factors for that disease. Then the attributable proportion would be the probability that a diseased individual in the given risk factor has the disease because of that risk factor [2]. This is of interest when there are multiple risk factors for a disease. Thus, this measure is of particular interest in casecontrol studies where the incidence of disease is related to several risk factors as it allows the researcher to understand how much the disease could be reduced by eliminating a particular risk factor. Oneatatime confidence intervals can also be utilized to estimate the attributable proportions. Modelbased confidence interval formulas can be computed on the usual 13 attributable proportion. Often transformations of the relative risk are utilized to compute bounds on the attributable proportion since they can be more efficient asymptotically. However, the MLEbased interval performs adequately [3] and is more easily adjusted for simultaneous inference in the sequel. Thus, a oneatatime confidence interval for the attributable proportion, denoted κ, is given by, κˆ ± z /2 × vˆar(κˆ)1/2 (2.13) where z /2 is the z critical value that gives 100(1 − α)% confidence. (Details for computing vˆar(κˆ) are given in [4].) Again, this interval is all that is required in many applications. However, if the researcher wishes to compare the attributable proportions for a group of risk factors, an adjustment for multiplicity would be necessary. This might be necessary if, for example, one wished to understand which risk factor should be focused on most for prevention of the disease. Here we would want to identify the largest attributable proportion and focus on disease prevention via reducing the effect of that risk factor. 2.4 Interval Estimation from GLMs Though we have introduced notation for both linear models and GLMs, the rest of this section focuses on the particular GLMs utilized to illustrate the results in this paper. While the methods derived apply to any GLM, particular attention will be devoted to the logistic and Poisson models due to their applicability and popularity. 2.4.1 Logistic Regression Model The logistic regression model is widely used in epidemiological and health science applications. The predictor variable in a logistic regression model can be either a single variable or a vector of variables. Thus, let xi = (xi1, xi2, . . . , xik) be a vector of predictor variables for i = 1, . . . , n where n is the total number of observations, and k 14 is the number of predictor variables. Thus, xi is the ith vector of predictor variables. Recall that for qualitative covariates, the xi’s would be defined as appropriate indicator variables. For example, in the preterm birth study xi could be the maternal stress vector of predictor variables with binary elements indicating the presence of a particular source or level of a source of maternal stress. Thus, if the ith case is a patient exposed only to dangerous neighborhoods as a source of maternal stress, we would code xi = (1, 1, 0, 0, 0, . . . , 0) where the first element has a 1 for the intercept term, the second place has a 1 to indicate the presence of stress in the form of a dangerous neighborhood, and the other elements of the vector are 0 indicating the patient was not exposed to the other sources of maternal stress. A logistic regression model assumes that the probability of a success for the ith observation is π(xi) where π(xi) = P[Yi = 1] = ex′ i 1 + ex′ i = e 1xi1+...+ kxik 1 + e 1xi1+...+ kxik , i = 1, . . . , n. (2.14) The matrix X, as previously defined, contains information relating to the predicted value, Y , for the model. Alternatively, we can express this model as φ(xi) = logit[π(xi)] = ln[ π(xi) 1 − π(xi) ] = x′ iβ, i = 1, . . . , n. (2.15) Now let the MLE of π(xi) be denoted ˆπ(xi). We will make the usual assumption that the Yi random variables are independent and binomially distributed with parameters mi (assumed known) and π(xi) given by (2.14), i = 1, . . . , n. Thus, W = diag[miπ(xi)(1 − π(xi))], i = 1 . . . , n and the asymptotic distribution of the MLE of β is given by (2.4). For ˆW = diag[miˆπ(xi)(1 − ˆπ(xi))], i = 1, . . . , n, the estimated covariance matrix of ˆ β is given by (2.5). When it is assumed that a logistic regression model is appropriate, the typical quantities of interest are the coefficients of the regression model, or the log odds ratios, βi, i = 1, . . . , k, and the response probabilities, π(xi). These quantities relate to what was generally referred to as the expected response function. In particular, the 15 expected response function for a logistic regression model is the response probability since this model assumes binomial sampling. When considering response probabilities for single experimental units, oneata time confidence intervals seem appropriate. (Confidence intervals provide additional information about the precision of the estimated response probability, so are often preferable to point estimates.) An appropriate confidence interval on the logit(π(xi)) = x′ iβ is computed by x′ i ˆ β ± z1− /2ˆσGLM(xi) (2.16) where z1− /2 is a zpercentile and ˆσGLM(xi) is given by (2.6) with ˆW as previously defined. Let the upper and lower limits of (2.16) be denoted by ULOGIT and LLOGIT , respectively. Then we can apply the antilogit and obtain bounds on the response probability. Thus, a 100(1 − α)% confidence interval for the response probability is given by exp(LLOGIT ) 1 + exp(LLOGIT ) , exp(ULOGIT ) 1 + exp(ULOGIT ) . (2.17) Another quantity of interest from a logistic regression model is the odds ratio. Oneatatime large sample confidence intervals can easily be constructed on the log odds ratios, as they are linear functions of the k logistic regression coefficients, β. Thus we may utilize the asymptotic multivariate normal distribution of the maximum likelihood estimates of these k logistic regression coefficients, given by (2.4), to obtain large sample confidence intervals for the appropriate odds ratios. For illustration, suppose a particular odds ratio is given by exp(ciβ) for ci = (ci1, . . . , cik), a vector of appropriate constants. Then a oneatatime large sample confidence interval for this particular odds ratio is given by exp{ci ˆβ − z /2ˆσGLM(ci)}, exp{ci ˆβ + z /2ˆσGLM(ci)} (2.18) where ˆσGLM(ci) is given by (2.6). Typically, in epidemiological applications, the logistic regression model employed for computing the modelbased odds ratios utilizes 16 reference coding. When reference coding, as explained below, is utilized, special care must be taken in interpreting the modelbased odds ratios. 2.4.2 Reference Coding for Logistic Regression If a logistic regression model is employed for a categorical predictor variable, the design coding typically used necessitates that one of the levels of x be a reference level. Then odds ratios that result from the model coefficients are observed and compared to that reference level. Most often the reference level is a true control, but at times the reference level is arbitrary. When the reference level is informative, we may wish to: (1) estimate all the odds ratios relative to the reference level simultaneously, thereby allowing the researcher to assess the practical significance of any observed difference from the reference level while also providing the ability to evaluate which nonreference levels are significantly greater than or less than the reference level and (2) make comparisons for a prespecified set of contrasts of the odds ratios. If the reference level is arbitrary, it seems reasonable to simultaneously compute all odds ratios or all odds ratio differences and then emulate one of the two scenarios described above. Again, if we wish to assess the practical significance of any estimates we need to estimate these quantities simultaneously rather than utilize a stepwise procedure. Note that for both above cases, when there is only one categorical predictor variable x, then all inference procedures performed on the odds ratios resulting from the logistic regression model are equivalent to any similar analysis performed on the crude data in contingency table format. Differences will occur in models with multiple covariates. The standard method for computing the odds ratios resulting from a logistic regression model using reference coding for the design matrix is to exponentiate the appropriate linear combinations of the estimated regression coefficients. For example, if we have a logit model such as (2.15), where there are k levels for our single predictor variable, then we can utilize the explanatory variables x1, . . . , xk−1 with the covariate 17 Table 2.2: Examples of Estimated Odds Ratios eˆ 1 The odds comparing the first nonreference level to the reference level eˆ 2 The odds comparing the second nonreference level to the reference level ... ... eˆ 2−ˆ 1 The odds comparing the second nonreference level to the first nonreference level ... ... eˆ k−ˆ k−1 The odds comparing the kth nonreference level to the (k − 1)th nonreference level vector at the reference level of our predictor variable equal to 0, that is, x1 = . . . = xk−1 = 0. Thus, x1, . . . , xk−1 would be defined as indicator variables for the k − 1 nonreference levels of our predictor variable. When this model is assumed, then we can interpret e ˆ1 as follows: the odds that Y = 1 for the first nonreference level is e ˆ1 times greater than that for the reference level. Table 2.2 illustrates other estimated odds ratios and their corresponding interpretations. The estimated odds ratios defined in Table 2.2 could then be utilized to construct confidence intervals that would aid in interpreting the model. 2.4.3 Loglinear or Poisson Model Another model often employed in epidemiological studies is the loglinear model. The loglinear model relates the counts of a Poisson or multinomial distribution to a set of covariates. It may assume the total sample size is random or fixed, depending on whether the model assumes Poisson or multinomial sampling, respectively. For an 18 I ×J contingency table let N = I ×J. Note that the number of cells in a contingency table, N, is distinct from the sample size or number of observations, denoted n, although they can be equal. Whenever the number of observations, n, is fixed, we have multinomial sampling for Yi, i = 1, . . . ,N − 1. However, when the sample size n is not fixed, we usually assume Poisson sampling for Yi, i = 1, . . . ,N. For ease in notation let n∗ = N − 1 for multinomial sampling and N for Poisson. Then the loglinear model is log(μ(xi)) = x′ iβ, i = 1, . . . , n∗ (2.19) where E(Y ) = μ = (μ(x1), . . . , μ(xn∗))′ is the vector of expected counts of the respective cells of the contingency table, xi is a 1×k vector of covariates as described in (2.2), and β is a kdimensional vector of model parameters. A loglinear model may also be expressed as log(μ(xi)) = k Xj=1 βjxij , i = 1, . . . , n∗, (2.20) where each xij is the covariate value corresponding to βi for the ith level of Y , i = 1, . . . , n∗, and j = 1, . . . , k. Recall the assumption that Yi is a Poisson or multinomial random variable. Thus, the expectation of any Yi is a positive value, μ(xi) for i = 1, . . . , n∗. The derivation of the large sample distribution of the model parameters depends on the sampling assumptions. When n is not fixed, we assume Poisson sampling. Then the MLE of ˆβ is asymptotically normal with mean β and covariance matrix V = (X′diag(μ)X)−1. Notice that W = diag[μ]. Thus, the estimated covariance matrix of βˆ is given by Vˆ = coˆv(βˆ) = [X′diag(ˆμ)X]−1. For Poisson sampling, we have, ˆβ ·∼ N(β, (X′diag(μ)X)−1). (2.21) Alternatively, when n, the overall sample size, is fixed we assume multinomial sampling. Typically, under multinomial sampling, we have interest in cell probabilities, 19 ˆπ = ˆμ/n. Here the ˆπ are multivariate normal with mean π and covariance matrix V = cov( ˆβ) = {X′[diag(μ)−(μμ′/n)]X}−1 = {nX′[diag(π)−ππ′]X}−1. Notice that W = diag(μ)−(μμ′/n). Additionally, the estimated covariance matrix for the regression parameters is given by Vˆ = coˆv(βˆ) = {X′[diag(μˆ) − (μˆμˆ′/n)]X}−1 = {X′[diag(ˆπ)− ˆπˆπ′]X/n}−1 when we have one multinomial sample. Thus for multinomial sampling, ˆβ ·∼ N(β, (X′[diag(μ) − (μμ′/n)]X)−1). (2.22) Notice that the asymptotic normality of the parameters holds for both Poisson and multinomial sampling. When Poisson sampling is assumed, the expected response function is the mean cell count, μ. Alternatively, when multinomial sampling is assumed, the expected response function is π. All inferences on the model parameters or any functions of the model parameters can be made via the asymptotic distributions previously stated. I will focus on the case where Poisson sampling is assumed as it is the customary assumption. Additionally, when Poisson sampling is assumed, the loglinear model is often referred to as a Poisson model. Intervals for response probabilities from multinomial loglinear models could be formed in a manner similar to that described for logistic regression, but this is rarely done with loglinear models. Instead, focus is usually on the estimated relative risks. When utilizing a loglinear model the relative risk yields point estimates that are often more applicable to clinical situations than the odds ratio; thus we consider relative risk here. The use of the relative risk is very common in epidemiological applications, thus discussion of the relative risk will focus on these types of scenarios. Estimating the relative risk from a loglinear model is a particularly easy implementation since it may be shown that the estimated relative risk is simply exp( ˆ β1) where ˆ β1 is the slope coefficient for x1, the predictor variable indicating presence of the intervention. Other models, such as the logistic model, could be used similarly to estimate the relative risk, although other models do not always yield simple formulas. 20 Consider for instance, the case where we are estimating the relative risk from a loglinear model. The estimated relative risks would be of the form exp( ˆ βj) where ˆ βj is the estimated slope coefficient for the covariate xj , j = 1, . . . , k. Let ci be a vector with the jth element equal to 1 and all other elements equal to 0. A confidence band is formed by exp{ˆ βj − z /2 × ˆσGLM(cj)}, exp{ˆ βj + z /2 × ˆσGLM(cj)} where ˆσGLM(cj) is given by (2.6) and W for Poisson sampling is given previously in by equation 2.21. Other models using alternative canonical links could also be considered. For example, other GLMs are formed by utilizing the probit link, where g = −1(π(x)), and the complementary loglog link, where g = log(−log(1 − π(x))). These both assume a binomial sampling scenario and the usual focus is on the resulting probability of success, π(x). 21 CHAPTER 3 Inferences on Quantities Estimated from a GLM When utilizing GLMs, several quantities may be of interest. For example, the expected response, odds ratio, or relative risk may be estimated via the GLM. This section focuses on the case where the expected response function is of primary concern. All the methods discussed utilize the fact that GLMs may be expressed as g(E(Yixi)) = x′ iβ (3.1) where Yi is the response for the ith observation, xi = (xi1, . . . , xik) is the vector of appropriate covariate values for the ith observation, β = (β1, . . . , βk) is the vector of parameters, and g is the canonical link. (Specific assumptions and details on this model are given in equations (2.1) and (2.2).) 3.1 Inference on the Mean Response This section will focus on the response probability or estimated mean response, π(xi) = E(Yixi), i = 1, . . . , n in a GLM assuming binomial or multinomial sampling. Alternatively, if we assume Poisson sampling, we would have interest in the expected cell counts, μ(xi) = E(Yixi), i = 1, . . . , n. This general methodology can be extended to the Poisson sampling scheme provided our inferences are on μ(xi), rather than π(xi). Suppose we have a covariate X with kdimensional domain I in a GLM of the form g(E(Y xi)) = x′ iβ, i = 1, . . . , n. Then let X ⊂ I be a compact subset of the domain which is of special interest. The intention of this section is to bound the expected response function, E(Y xi), for all xi ∈ X using confidence bounds on a 22 GLM. The subset X can be a set of the domain that is of special interest or it may be selected to answer a particular question. Discussion is restricted to the case where there is one covariate, but the methodologies may be extended to cases with many covariates. Even in the single covariate case, X may be a matrix if, for example, the model employs reference coding. 3.1.1 Previous Methods Two primary approaches for simultaneously estimating the mean response function are discussed. The first is a conventional approach utilizing bounds similar to the wellknown Scheff´e bounds. The second is a modern approach utilizing solutions referred to as tubeformulas for constructing simultaneous intervals. Scheff´e bounds are a wellknown methodology in simultaneous inferences, and are widely applied in linear models and generalized linear models. Some of the regularity conditions necessary for applying Scheff´e bounds include that the sample size is sufficiently large and that the domain for the predictor variable is fixed [5]. Under these suitable regularity conditions, the maximum likelihood estimates (MLEs) of a linear model are multivariate normal with mean vector β and covariance matrix equal to the inverse of the Fisher information matrix σ2L MF where F = (X′X)−1. (See (2.8) for details on model assumptions.) In a standard regression model, Scheff´e bounds are often utilized to obtain simultaneous intervals. These bounds are simultaneous for all xi ∈ Rk and thus are conservative for any finite set of such comparisons. Alternatively, Casella and Strawderman [6] derived Scheff´etype bounds for a regression model with restrictions assumed on the domain. These intervals are exact for this restricted domain. The advantage of assuming these restrictions is that the usual Scheff´e bounds are conservative when the entire domain is not used. Piegorsch and Casella [7] utilized the Casella and Strawderman (CS) method to obtain simultaneous bounds on a logistic regression model. Specifically, they obtained Scheff´etype bounds 23 on the x′ iβ in a logistic regression utilizing a restricted predictor variable domain of rectangular form. The method originally developed by Casella and Strawderman, and later extended by Piegorsch and Casella, is less conservative than the usual Scheff´e bounds as it restricts the predictor variable space. It is desirable at this point to reparameterize the model so that it is in the socalled diagonalized form (Casella and Strawderman [6]). This will simplify the calculations used hereafter. By the Spectral Theorem for symmetric matrices [8], the matrix F may be decomposed, given that F is symmetric. Thus, a linear model may be diagonalized by noting that F = UDU′ where D = diag(λi), a diagonal k × k matrix of the ordered eigenvalues of F, and U = (u1, . . . ,uk) is the k × k matrix of corresponding orthonormal eigenvectors. Now define Zn×k = XUD−1/2 and ηk×1 = D1/2U′θ where UU′ = I since each row,ui, is orthonormal. Thus, Zη = [XUD−1/2][D1/2U′θ] = XUU′θ = XIθ = Xθ where the model may be written as Y = Zη + ǫ. Note that ˆη = D1/2U′ˆθ is distributed Nk(η, σ2L MI), given that (2.4) holds. The authors Casella and Strawderman [6] consider bounding linear models of the form Yi = xiθ+ǫi with the usual restrictions on ǫ (see (2.7)) and with a domain for xi of the form xi = {xi : Pr j=1 x2 ij ≥ q2Pk j=r+1 x2 ij} where q is a fixed constant. When r = 1 these regions are coneshaped regions, and if r > 1 there is no easy visualization of the space. Casella and Strawderman achieve exact results for bounding linear models for domains of this general form. Alternatively, both Casella and Strawderman [6] and Piegorsch and Casella [7] consider a more defined set of interval constraints on X which are of the form Rxi = {a11 < xi1 < a12, a21 < xi2 < a22, . . . , ak1 < xik < ak2} ⊂ xi for a specified q. These intervals would be of particular interest in many experimental settings and thus are assumed for the remainder of this section. 24 The goal of the restrictedScheff´e procedure developed by Casella and Strawderman is to bound the regression function for all xi ∈ xi . Thus, keeping in mind the objective of inference on E(Yixi) = x′ iθ, consider S( xi) = {θ : (x′ i ˆθ − x′ iθ)2 ≤ d2σ2L Mx′ iF−1xi ∀xi ∈ xi} (3.2) where d is an arbitrary constant. Casella and Strawderman derive a procedure to calculate the value of d that yields, P[S( xi)] = 1 − α. (3.3) Their derivation involves considering a domain for Z similar to xi , zi = {zi : r Xj=1 z2 ij ≥ q2 k Xj=r+1 z2 ij} where q is a fixed constant. Thus, Casella and Strawderman prove that for a specified d P[S( zi)] = P[{η : (z′i ˆη − z′i η)2 ≤ d2σ2L Mz′i zi ∀zi ∈ zi}] = 1 − α (3.4) where ˆη is the MLE under spectral decomposition. Notice that the only difference between the sets S( xi) and S( zi) is the space we are operating in. Recall the form assumed about the domain of interest, Rxi . This is a convex set in Rk. Thus, the image of this set, Rzi , will also be convex, since a linear map preserves convexity. Note that for γ = ˆ − LM , the quantity S( zi) = {γ : (γzi)2 ≤ d2z′i zi ∀zi ∈ zi}. (3.5) Assume this form of S( zi) henceforth. Since we have a domain of the form zi and wish to obtain a Scheff´etype probability band, then via Theorem 1 in [6] we have P(S( zi)) = P(χ2 k ≤ d2) + P(Er,s(b, d2)) (3.6) where Er,s(b, d2) = {(χ2r , χ2s ) : χ2r +χ2s ≥ d2, (aχr+bχs)2 ≤ d2, χ2r ≤ q2χ2s }, a2+b2 = 1, and χ2r and χ2s are independent chisquare random variables. Also note that q is a 25 fixed constant determined by zi and b, r, and s are determined by the value of d and the parameters of the problem. (For specific details see Casella and Strawderman [6].) A solution for the quantity d which yields appropriate simultaneous intervals may be found by setting the right hand side of (3.6) equal to 1 − α and solving for d. Casella and Strawderman applied this theorem to a linear model achieving exact results for all xi ∈ xi , thus yielding a conservative solution for all xi ∈ Rxi ⊂ xi . The details of the derivation of the appropriate zi (and hence xi) are given by Casella and Strawderman [6]. The resulting restrictedScheff´e intervals are of the form ˆ Y ± dˆσ(xi) where d is a critical value determined by an algorithm which is described in Appendix C. Piegorsch and Casella applied this procedure specifically to logistic regression. However, it has not been applied for use in a generic GLM, and it is unclear how these bounds will perform for other generalized linear models. Note that although this method is still conservative, it is less conservative than the conventional Scheff´e bounds, as it is not applicable for the entire predictor variable space. As another alternative to the Scheff´etype bounds, Sun, Loader, and McCormick (2000) [9] (SLM) proposed a solution for simultaneously estimating the mean response for the general class of GLMs with all xi in a compact set. This general method of bounding a regression function, called simultaneous confidence regions (SCR), can account for a variety of linear and nonlinear models. Specifically, the SCR bounds can be applied when there are heteroscedastic and nonadditive error terms, as is the case for many GLMs. The SCR bounds utilize error expansions to approximate the noncoverage probability for a GLM. They are far less conservative than Scheff´e solutions and perform exceptionally well for moderate sample sizes. The SCR bounds are based on applying the socalled tube formula due to Naiman [10] with various possible adjustments. The tube formula provides a lower bound for the coverage probability of a confidence band of a regression function over a specified 26 closed set. However, the tubeformula assumes the error distribution of the model is normal. Clearly, this is not a valid assumption if we have a GLM, although it does provide a starting point for constructing confidence bounds, as the large sample error distribution is approximately normal. Obviously, this assumption will be problematic for smaller sample sizes. The basic tubeformula methodology is described by Naiman in his 1986 paper (Naiman [10]). This paper outlines a solution for constructing simultaneous confidence bands on polynomial regression models of the form Yi = k Xj=1 θjfj(xi) + ei, i = 1, . . . , n, xi ∈ I. (3.7) Here it is assumed that I is a closed interval in R, that ei ∼ iidN(0, σ2L M) with σ2LM unknown, and that θj (j = 1, . . . , k) are unknown constants. The vector f(xi) = (f1(xi), . . . , fk(xi))′ maps from I to Rk. Naiman’s intent is to provide simultaneous confidence bounds on E(Yixi) = θ′f(xi) for all xi ∈ I where an estimate ˆθ = (ˆθ1, . . . , ˆθk)′ is available such that ˆ θ is distributed N(θ, σ2L MF) with σ2L M unknown, F known and s2 LM an independent estimator of σ2L M such that s2 LM 2 LM ∼ χ2 (ν = n − k). In order to understand how Naiman derives these bounds, consider alternatively another mapping, γ, from I to the unit sphere Sk−1 centered at the origin of Rk, such that γ is piecewise differentiable and (γ) = ZI γ′(x)∂x (3.8) is finite. Since γ maps I to the unit sphere, Sk−1, it will be considered in place of the primary mapping f(xi). Specifically, it is the projection of f(xi) on Sk−1. Note that γ(x) = kPf(xi)k−1Pf(xi) for xi ∈ I where P is a k×k matrix such that F = P′P. The quantity (γ) is called the path length, as it measures the length of the path, γ, a continuous mapping essentially connecting the points in the image of f. The 27 image of the path in Sk−1 is then denoted by (γ) = {γ(x) : x ∈ I}. The goal is to bound with a tube via bounding the (γ), which equivalently bounds the regression function, f(xi), on I as they share the same domain. Regarding the image of the path function, (γ), Naiman demonstrates that μ((γ)(g)) ≤ min(Fk−2,2[ 2(g−2 − 1) k − 2 ] × 2π + 1 2 Fk−1,1[ g−2 − 1 k − 1 ], 1) (3.9) where g ∈ [0, 1] such that g = {u ∈ Sk−1 : c(u) ≥ g} (a set of points in Sk−1 that surround ) with c(u) = sup{u′v : v ∈ } for any u ∈ Sk−1 and μ is the uniform measure. Naiman then applies these results to obtain confidence bands of the form ˆθ′f(xi) ± d(ˆσLM)(f(xi)′Ff(xi)). The intervals are formed utilizing the critical value d which is determined by setting 1 − Z 1/d 0 min(Fk−2,2[ 2(dt−2 − 1) k − 2 × π + 1 2 Fk−1,1[ dt−2 − 1 k − 1 ], 1)fT (t)∂t (3.10) equal to 1 − α and solving for d. Here fT (t) is the density of a random variable T where rT2 ∼ F ,r. Utilizing these tubeformula bounds, SLM form simultaneous bounds on the expected response function for a GLM. Recall that Naiman derived these bounds assuming normally distributed residuals. Clearly, generalized linear models only have normally distributed residuals asymptotically. Thus, the tubeformulas were originally applied directly via the asymptotic normality of the residuals to obtain asymptotic simultaneous confidence bands. Modifications were then made to the usual tubebased bounds to improve them for small to moderate sample sizes. The following description outlines how to apply the tubeformula bounds to GLMs. Let the maximum likelihood estimate (MLE) of a predicted response for a GLM at xi be denoted by ˆYi. Ultimately, the interval desired is of the form Id(xi) = (g−1(x′ i ˆ β − dˆσGLM(xi)), g−1(x′ i ˆ β + dˆσGLM(xi))) ∀xi ∈ X where [ˆσGLM(xi)]2 is the asymptotic variance of x′ i ˆ β and X is a particular compact 28 subset of the domain. The tube formulas will enable us to find a value d such that P[g(E(Yixi)) ∈ Id(xi), for all xi ∈ X] ≥ 1 − α. (3.11) Applying the tube formula directly to solve for d, yields what SLM term a naive SCR. We will utilize the notation dTUBE to indicate a critical value calculated in this manner. This solution performs adequately when the asymptotic distribution of the residuals is nearly normal. However, this method will not attain the desired confidence level when the sample size is relatively small, as typically the residuals are nonnormal discrete random variables for GLMs. In order to improve the small sample performance, the authors consider some modifications to the tubeformula. They begin by approximating the sampling distribution of the residual via construction of expansions on the estimated model. This approximation of the sampling distribution will be utilized to obtain a critical point for the confidence interval formula that is adjusted with respect to the bias introduced by the MLEs. Consider the random process Wn(xi) = g( ˆ Yi)−g(E(Yixi)) ˆ G(xi) where xi = (xi1, . . . , xik) is the ith vector of X = (xi, . . . , xn)′ and [ˆσG(xi)]2 is the asymptotic variance of g( ˆ Yi). This converges in distribution to a Gaussian random field. Let W(xi) be a random variable with the same distribution as the limiting distribution of Wn(xi). Then the bias behaves like Wn(xi)−W(xi). This equivalent expression of the bias may be bounded via inverse Edgeworth expansions. SLM propose three corrections that can aid in correcting the bias introduced from estimating the regression parameters in a GLM with MLEs. These three solutions are based on the inverse Edgeworth expansion of the random process given by, Wn(xi) = W(xi) − p2(xi,Wn(xi)) (3.12) where the term subtracted can be thought of as a correction for the bias of the process. It is based on the centered moments of the process Wn(xi) denoted κi for i = 1, 2, 3, 4 29 and is given by, p2(xi, Z) = −Z{ 1 2 [κ2(xi) − 1 + κ21 (xi)] + 1 24 [κ4(xi) + 4κ1(xi)κ3(xi)](Z2 − 3) + 1 72 κ23 (xi)(Z4 − 10Z2 + 15)} = O(n−1). (3.13) The κi’s may be computed as detailed by Hall(1992, [11]). Details are provided in Appendix D. The inverse Edgeworth expansions (3.13) are then also utilized to account for the bias typically observed in the MLEs of generalized linear models. A first version of a corrected SCR, denoted SCR1, is a solution where the bias term, p2(xi,Wn(xi)), is bounded. First consider the supremum of the bias term, R′ p = sup xi ∈ X{p2(xi,Wn(xi))} = Op(1/n). We want to find a positive constant R′ p such that P[R′ p ≤ r′ p] = 1 − α as n −→ ∞. Details are given in Sun, Loader, and McCormick [9] and Hall [11]. Additionally, specific calculation procedures are described in Appendix D. These r′ p values are then used to correct the bias in the choice of d via the tubeformula method. Namely, our new interval is given by g−1(x′ i ˆβ − dSCR1ˆσGLM(xi)), g−1(x′ i ˆβ + dSCR1ˆσGLM(xi)) (3.14) where the new critical point, dSCR1 is equal to dTUBE − r′p  where dTUBE is the aforementioned solution. Another version of the corrected SCR, SCR2, considers the modified process, W0 n (xi) = Wn(xi)− 1(xi) √ 2(xi) , such that W0 n(xi) = W(xi)−q2(xi,W0 n(xi)) with q2 similar to p2. The tube formula is then applied to W0 n(xi). Bounding this normalized process, W0 n (xi), further corrects the bias. Doing this results in confidence bounds on the E(Yixi) which are an improvement of the tube method applied directly to Wn(xi). It is of interest to note that this method corrects the bias via a first level approximation. The resulting confidence region is of the form, g−1(x′ i ˆβ − dSCR2ˆσGLM(xi)), g−1(x′ i ˆβ + dSCR2ˆσGLM(xi)) (3.15) 30 where dSCR2 is this biascorrected solution for the critical value. Note that this is just a critical value, like d, that is corrected for the bias. SLM term this a twosided corrected SCR. Recall that it utilizes the modified Gaussian process that corrects the bias inherit in the MLE estimates and finds a critical value that adjusts for that bias. A last solution, called the centered SCR (SCR3), begins by estimating the mean and variance of the Gaussian process Wn(xi). These are the centered moments and are given by ˆκ1(xi) and ˆκ2(xi), respectively. These essentially move and rescale the confidence region so it is no longer biased. The tubebased critical value dTUBE is again involved, so the final interval is g−1((x′ i ˆβ)∗ − dTUBEˆσ∗ i ), g−1((x′ i ˆ β)∗ + dTUBE ˆσ∗ i ) (3.16) where (x′ i ˆβ)∗ = x′ i ˆ β− ˆκ1(xi)ˆσGLM(xi) and ˆσ∗ i = ˆσGLM(xi)pˆκ2(xi)). These formulas are given in Appendix D. 3.1.2 Proposed Methods Expanding on the methodologies presented in the previous section, I have developed two new approaches for estimating a mean response function over a specified compact set via confidence regions. The first proposed method is based on the restrictedScheff´e bounds developed by Casella and Strawderman and further refined by Piegorsch and Casella. In Piegorsch and Casella [7], the authors derive and implement conservative simultaneous bounds on the response probabilities of logistic regression models for rectangular domains. I have generalized these bounds on the expected response function for any GLM. Outlined below is the method by which these bounds may be computed. For any GLM, let the anti link function be g−1 so that E(Yixi) = g−1(x′ iβ), i = 1, . . . , n (3.17) where E(Yixi) denotes the response probability (logistic regression) or the mean 31 response (loglinear models or Poisson regression) for the specified covariate levels given by xi (Complete model specifications are given in (2.2)). Recall that the MLE of β, ˆ β, is asymptotically normal with mean β and k×k covariance matrix V where V = (X′WX)−1 with W defined in (2.3). Applying the restrictedScheff´e procedure of Casella and Strawderman to GLMs yields appropriate conservative simultaneous confidence intervals for the mean response. Casella and Strawderman assumed that the MLEs of the regression parameters were normally distributed with a specified mean vector and covariance matrix. For our case we only have asymptotic normality of the MLEs and therefore, the probability in (3.6) is not exactly 1 − α but instead converges to 1 − α as n → ∞. We will also require a slightly different definition for S( xi) and S( zi). Recall S( xi ) = {θ : (x′ i ˆθ − x′ iθ)2 ≤ d2σ2L Mx′ iF−1xi ∀xi ∈ xi} for linear models. Here however S( xi ) = {β : (x′ i ˆβ − x′ iβ)2 ≤ d2x′ iV −1xi ∀xi ∈ xi}. Notice that the inequalities in both sets have an upper bound given by the variance of x′ i ˆ β or x′ i ˆ θ, respectively. S( zi) will have a similar definition for GLMs and the diagonalization described in section 4.1 applies with F = V . Thus, for any S( zi) of the form (3.5) generalized appropriately for a GLM, P(S( zi)) → P(χ2 k ≤ d2) + P(Er,s(b, d2)) (3.18) as n → ∞ where Er,s(b, d2) = {(χ2r , χ2s ) : χ2r + χ2s ≥ d2, (aχr + bχs)2 ≤ d2, χ2r ≤ q2χ2s }, (3.19) where a2 + b2 = 1 and χ2r and χ2s are independent chisquare random variables. Note that q is a fixed constant determined by the particular xi . (Recall we will choose the smallest xi ⊃ Rxi .) Additionally, the constants b, r, and s are determined by the value of d and the parameters of the problem. (Details of the computation of b, r, and s specifically for GLMs are given in Appendix C.) Notice that the coverage probability of the set S( zi) is the sum of the usual coverage probability of the Scheffe 32 set (P(χ2 p ≤ d2)) and a probability that adjusts for the restricted domain. Recall that these bounds are derived assuming a domain of the form xi . If a set of the form Rxi is of interest, an approximate answer may still be found as in Piegorsch and Casella. In order to find a bound for a domain of the form Rxi , the smallest set of the form xi is established such that xi contains Rxi . (This procedure is detailed in Appendix C.) We may consider sets on either the X domain, xi and Rxi , or their equivalent sets on the Z domain, zi and Rzi . Recall that the method of Casella and Strawderman provides simultaneous bounds on a linear model for a transformed domain of the form zi , and thus equivalently for xi . The adapted method of Piegorsch and Casella computes these bounds for domains of the form Rxi in logistic regression models. I propose extending these bounds for use in any generalized linear model with a canonical link. The simultaneous bounds may be transformed from x′ iβ, xi ∈ Rxi , to the expected response function via the antilink function. In order to apply the CasellaStrawderman results to GLMs, we must first show that the probability of the set S( xi) converges to 1 − α. Corollary 3.1 If ˆ β is asymptotically normal with mean vector β and covariance matrix V , then P(S( zi)) → P(χ2 k ≤ d2) + P(Er,s(b, d2)) as n → ∞. (3.20) with Er,s(b, d2) given by (3.19) where q is determined by the particular zi considered and appropriate constants b, r, and s. Proof: Recall zi = {zi : Pr j=1 z2 ij ≥ q2Pk j=r+1 z2 ij} for a specified constant q. Here zi is the diagonalized form of xi. Theorem 1 from Casella and Strawderman [6] gave exact equality of the same probabilities in (3.20) under exact normality for ˆ β. Consequently, under asymptotic normality we have (3.20). 33 Unfortunately (3.20) requires V known. Consider the following corollary to the Casella and Strawderman theorem, that holds for GLMs. Corollary 3.2 If ˆ β is asymptotically normal with mean vector β and covariance matrix V , estimated by ˆ V , then for S( zi) utilizing ˆ V instead of V (3.20) still holds. Proof: Let ˆ V = (X′ ˆW X)−1 then since ˆ β p→ β we have ˆW = W =ˆ p→ W. Thus ˆ V = (X′ ˆW X)−1 p→ V = (X′WX)−1. Consequently the convergence in (3.20) also holds when V is estimated by ˆ V . Now let dCS be the value of d such that the probability on the righthand side of (3.20) is 1 − α. Then P(g−1(x′ i ˆβ − dCSˆσ∗ i ) ≤ E(Yixi) ≤ g−1(x′ i ˆβ + dCSˆσ∗ i )) → 1 − α as n → ∞ (3.21) where ˆσ∗ i = (x′ i ˆ V −1xi)1/2. We have theoretically demonstrated that the restrictedScheff´e bounds of Casella and Strawderman may be applied to GLMs, but the computational details remain unclear. A detailed computational algorithm to compute the restrictedScheff´e bounds for any GLM is given in Appendix C. As an alternative to the restrictedScheff´e bounds, we can also apply an estimate of β, ˆ β∗ to the simultaneous confidence regions (SCRs) developed by SLM [9] and the restrictedScheff´e bounds for GLMs. Recall that SLM derived four SCR bounds. First, the tubebased bounds were applied to the maximum likelihood estimators by appealing to their asymptotic normal distribution (dTUBE). Second, the bias was bounded and then the tubeformula solution was applied (dSCR1). Third, the SCR bounds were derived for a modified process which accounts for the bias in the distribution of the MLE (dSCR2). And fourth, the random process was centered and rescaled to correct the bias before applying the tubebased bounds (centered SCR). The vari 34 ous solutions all attempt to correct the bias of the maximum likelihood estimates for GLMs. In particular, when the sample size is small the MLEs are highly nonnormal, and the adjustments to the tube formulas are especially helpful. I propose estimating simultaneous SCR bounds, and restrictedScheff´e bounds, that are not based on the MLE, but on an alternative to the MLE, the penalized maximum likelihood estimator (pMLE). The pMLE is a biascorrected estimate that is closely related to the usual MLE. The tubeformula bounds can then be applied utilizing the pMLE estimates rather than the MLEs. In particular, the naive SCR and centered SCR (SCR3) bounds can easily be applied. The difference between the proposed method and the methods of Sun et al. (2000) [9] is that utilizing the pMLE estimates doesn’t merely “correct” the bias, but prevents the bias from occurring (in the first order) a priori. No additional bias correcting procedures will be necessary since the pMLE estimate has little bias from the start. Additionally, this method eliminates inestimable model parameters (0 or ∞) due to zero cell counts in GLMs. This difficulty was not resolved by the methodologies presented by SLM, and can be particularly troublesome with small to moderate sample sizes. Also, recall that the bias corrections employed in some of the SCR bounds were only biasreducing asymptotically. This procedure is biasreducing for any sample size. The penalized maximum likelihood estimate (pMLE) was developed by David Firth [12] as an alternative to the MLE. Firth developed these estimators for use in models, such as GLMs, where the typical MLE is known to be a biased estimate. Specifically, all of the derivations depend on the model belonging to the general exponential class. These distributions have the general form f(t, θ) = exp{[(tθ − K(θ))]/a(φ) + c(t, φ)}, (3.22) where when φ, the dispersion parameter, is known, this simplifies to f(t, θ) = exp{(tθ − K(θ))} 35 [13]. Although this is an unusual form for an exponential class model, it lends itself quite well to a derivation later in the section, and this form can be shown to be equivalent to the standard form under the correct reparameterization. The general pMLE procedure involves penalizing the score function via the Jeffreys invariant prior for the particular parameter of interest. This penalty yields estimates of the regression parameters that are unbiased in the firstorder. Typically, when computing an MLE, the first derivative of the log likelihood, often called the score function, is computed and set equal to 0, yielding the MLE as the solution. For estimating a set of parameters θ = (θ1, . . . , θk), let the usual vector of score functions be denoted U(θ) = (U1(θ), . . . , Uk(θ))′. Note that Ur(θ) is the derivative of the loglikelihood (score function) with respect to the rth parameter. Firth proposes shifting the score function to correct the bias present in most MLE estimates for GLMs. The shift is determined by the estimated bias, b(θ) and information matrix, i(θ). Then the shifted or penalized score function, U∗(θ) = U(θ) − i(θ)b(θ) = (U∗ 1 (θ), . . . , U∗ k (θ))′ is set equal to 0 and a penalized MLE is the solution. Thus, the pMLE, θ∗, is the solution to U∗(θ) = 0. The details of estimating the bias are given in Firth [12], but as previously noted, the calculations are based on utilizing a Jeffreys prior for the parameters of interest, θ. Firth applies this estimator to the logistic regression model, but the methodology can be applied to any exponential class model, specifically any GLM. Firth states that for a GLM with a canonical link the modified score function is given by U∗ r (θ) = Ur(θ) + 1 2φ n Xi=1 κ3i κ2i hixir, (r = 1, . . . , k), (3.23) where Ur(θ) is the derivative of the log likelihood function for the rth parameter, n is the number of observations Yi, φ is the dispersion parameter in (3.22), and hi is the ith diagonal of the hat matrix (the leverage). The leverage, or the ith diagonal of the hat matrix, is a measure of the distance between each observation, xir, and the mean, x. (See McCullough and Nelder [13] for an explicit definition.) Additionally, κti is the 36 tth (t = 2, 3) cumulant, or tth central moment, of Yi, i = 1, . . . , n. The cumulants may be calculated using the cumulant generating function Ki(s) = log(Mi(s)) where Mi(s) is the moment generating function for the distribution of the dependent variable, Yi. Each cumulant is found by taking the derivative of Ki(s) with respect to s and then letting s = 0. For example, κ2i = K(2) i (0) where K(2) i (s) is the second derivative of the cumulant generating function. Note that the third cumulant, κ3i, estimates the bias. Justification for this formula is outlined by McCullough and Nelder (1989) [13] in section 15 of Generalized Linear Models. Specifically, assume a binomial logit model has a response variable Yi ∼ BIN(mi, πi), i = 1, . . . , n where mi, i = 1, . . . , n are assumed known and the Yi variables are independent. Notice that in this scenario k = n. Thus, the moment generating function is given by Mi(s) = (πies+(1−πi))mi . Moreover, the cumulant generating function would be Ki(s) = milog(πies +(1−πi)). Thus, the first, second, and third derivatives of the cumulant generating function are given by K′ i(s) = miπies πies + (1 − πi) , K(2) i (s) = miπies πies + (1 − πi) − miπ2 i e2s (πies + (1 − πi))2 , and K(3) i (s) = miπies πies + (1 − πi) − 3miπ2 i e2s (πies + (1 − πi))2 + miπ3 i e3s (πies + (1 − πi))3 , respectively. Consequently, the second and third cumulants are κ2i = miπi(1 − πi) and κ3i = miπi(1 − πi)(1 − 2πi). Additionally, the likelihood function is given by L(πi) = πyi i (1 − πi)mi−yi , i = 1, . . . , n. Thus, the usual score function, the first derivative of the log likelihood is given by Ur(π) = Pn i=1(yi − miπi)xir, r = 1, . . . , n. 37 Therefore, the penalized score function is U∗ r (π) = n Xi=1 (yi − miπi)xir + 1 2φ n Xi=1 (1 − 2πi)hixir, r = 1, . . . , n. (3.24) In the case of a single binomial trial, the dispersion parameter is φ = 1/1 = 1 (McCullough and Nelder (1989) [13]). Thus, (3.24) simplifies to, U∗ r (π) = n Xi=1 {(yi + hi 2 ) − (hi + mi)πi}xir for r = 1, . . . , n. Note that in models with no bias present in the MLE, such as a simple linear regression model, the score function will not be penalized as the third cumulant will be zero. Now we may apply the tubeformula confidence bounds to the penalized MLEs, rather than the MLEs. All calculations will follow the previous SCR interval descriptions since the information matrix of the pMLEs is the usual information matrix of the MLEs (See Firth (1993) [12]). Consequently no alteration of the methodology is necessary. We are simply replacing ˆ β with the pMLE of ˆβ, ˆβ∗. We do not need to consider the first two SCR methods as they were adjusting for the bias. Rather, we will simply consider the critical value dTUBE applied to the pMLE ˆβ∗. Additionally, the centered SCR could be applied with possible improvement as this interval is recentered and rescaled. The CS bounds can also be calculated utilizing ˆ β∗ instead of ˆ β. We refer to this as the pCS bounds in the sequel. No other modifications will be necessary. We will refer to our SCR bounds utilizing the pMLEs as bias prevented SCRs, or pSCRs. The first of these is given by (g−1(x′ i ˆ β∗ − dTUBE ˆσ(xi)), g−1(x′ i ˆ β∗ + dTUBE ˆσ(xi))), i = 1, . . . , n (3.25) where β∗ is a penalized maximum likelihood estimate, ˆσ(xi) is given by (2.6), and dTUBE is obtained as described in section 3.1.1. The CasellaStrawderman results 38 could also be applied to an interval of the form (3.25) with dTUBE replaced by dCS, where dCS is obtained as described in section 3.1.1. Finally, the centered SCR may be utilized to yield the second pSCR. This interval is of the form g−1((x′ i ˆ β)∗ − dpSCR2), g−1((x′ i ˆ β)∗ + dpSCR2) , i = 1, . . . , n (3.26) for (x′ i ˆ β)∗ = x′ i ˆ β∗−ˆκ∗1 (xi)qx′ i ˆ V xi and dpSCR2 = dTUBEqx′ i ˆ V xiˆκ∗2 (xi) where ˆκ∗1 (xi) and ˆκ∗2 (xi) are now based on the estimator ˆ β∗. The formulas for these moments of the Gaussian field are given in Appendix D. Since ˆ β∗ attempts to eliminate the bias, it is reasonable to expect that the confidence regions based on this estimator will attain the desired level of confidence for smaller sample sizes than the SCR bounds of SLM. The pSCR bounds for moderate to large samples should be very similar to SLM’s corrected and centered SCR bounds. 3.2 Bounds for Simultaneously Estimating Functions of Model Parameters Often an estimate of something other than the expected response, whether a probability or a mean, is desired. In previous sections, quantities such as the odds ratio, relative risk, and attributable proportion were discussed. Odds ratios or relative risks have immediate clinical application and are often easier for nonstatisticians to understand than expected responses. As discussed previously, all of these quantities may be estimated directly from the data. However, it is preferable at times to estimate these quantities via generalized linear models. Specifically, these quantities may all be expressed as functions of the parameters of GLMs. Consequently, it is possible to utilize all of the methodologies discussed in section 3.1 to simultaneously estimate quantities such as the odds ratio or relative risk. In this section I propose methods that utilize these simultaneous bounds for GLMs to estimate quantities such as odds ratios, thereby accounting for multiplicity of inference. Since the odds ratio, relative 39 risk, and attributable proportion were described previously, I will not review these quantities. Rather we begin by reviewing some relevant methods for simultaneous estimation. 3.2.1 Previous Methods When quantities such as the odds ratios or relative risks are of interest, we are often utilizing discrete random variables to predict a binary response variable. While it is possible to have continuous predictors and compute quantities such as the odds ratio, the use and application of the odds ratio in these situations is less obvious and the need for multiplicity is less apparent. Consequently, attention will first focus on the case of categorical predictors. Researchers have previously explored simultaneous estimation of various sets of the parameters. For example, in 1996, McCann and Edwards [14] proposed a procedure to simultaneously estimate p contrasts of k unknown parameters. This method utilizes Naiman’s Inequality, discussed previously, to obtain conservative simultaneous confidence regions for the p contrasts of interest. These new bounds outperform the existing competing conservative bounds for many scenarios. However, the McCann Edwards (ME) method applied only to linear models in general. I propose adapting the SCR and related bounds from section 3.1 to simultaneously bound p contrasts of k unknown parameters from generalized linear models in an analogous manner. First, I will review the ME method for linear models, and then detail the proposed method for GLMs. Assume we have a regression model of the form, Y = X′θ where Y is a n × 1 vector of responses, X is a k × n matrix of predictor variables, and θ is a k × 1 vector of regression parameters as described by (2.7). Assume that 40 the MLE for θ, ˆθ, is asymptotically multivariate normal with mean θ and covariance matrix σ2L MF with F assumed known and full rank. Also assume that an estimate for σ2L M exists and is given by ˆσ2L M where ˆσ2L M is independent of ˆθ and is such that ˆ 2 LM 2 LM ∼ χ2 . Thus, we have k unknown parameters to estimate via the usual MLE, ˆθ = (ˆθ1, ˆθ2, . . . , ˆθk). Now suppose p linear combinations of the regression parameters are of interest. Let C be a p × k matrix of constants such that Cθ is a vector of these linear combinations of θ. Now, given the distributional assumptions on ˆθ, Cˆθ is multivariate normal with mean Cθ and covariance matrix σ2L MCFC′. Thus, if a single contrast is given by c′ jθ where C = (c1, . . . , cp)′ and each c′ j = (cj1, . . . , cjk), then we can form exact simultaneous interval estimates for each contrast via the following formula: c′ j ˆθ ± dˆσj , j = 1, . . . , p (3.27) where ˆσj = ˆσLM(c′ jFcj)1/2 and d is a pdimensional multivariate t quantile with ν degrees of freedom and correlation matrix R, with R the correlation matrix corresponding to CFC′. The path length inequality proposed by McCann and Edwards provides a conservative solution for d that outperforms the existing conservative solutions in many cases. Note that their solution only provides conservative intervals, as obtaining the multivariatet quantile providing an exact interval is generally an intractable problem. The following theorem given by McCann and Edwards (1996) in [14] details this solution. Theorem 3.1 Let T have a pdimensional multivariatet distribution with degrees of freedom ν and underlying correlation matrix R of rank r. The probability P(Tj  ≤ d, j = 1, . . . , p) is bounded below by the expression 1 − Z 1/d 0 min(Fr−2,2[(s((dt)−2 − 1))/(r − 2)] × ( / ) + Fr−1,1[((dt)−2 − 1)/(r − 1)], 1)fT (t)dt, 41 with = p−1 Xj=1 cos−1(rj,j+1), where Fm,n is the distribution function of an F random variable with m and n degrees of freedom and fT is the density function of a random variable T such that rT2 ∼ F ,r. If d is such that the foregoing expression is at least 1 − α, then the intervals (3.27) will be conservative simultaneous (1 − α)100% confidence intervals. This inequality determines a value of d that depends on the correlation structure R and the path length . The path length also depends on the ordering of the indices 1, 2, . . . , p and yields the smallest value of d for the optimum ordering. ME recommend estimating the optimum ordering via the nearest neighbor algorithm (Townsend, 1987) as no exact solution exists and this method often provides the optimum ordering. It is noted that as the path length function approaches either infinity or zero the value of d becomes (rF ,r, )1/2 (Scheff´e’s critical value) or t /2, (the oneatatime critical value), respectively. Simulations show that when the degrees of freedom are low and the number of comparisons are high, the ME method outperforms other existing conservative solutions. Note that the ME method has simply applied Naiman’s inequality to an interval and parameterization carefully chosen to contain the quantities of interest. However, the ME method is only strictly valid for linear models. I propose adapting the SCR bounds and their counterparts for generalized linear models in an analogous manner in order to make simultaneous inference on linear combinations of the parameters from GLMs, such as odds ratios or relative risks. 3.2.2 Proposed Methods In order to estimate the linear combinations of the GLM parameters, I propose applying the SCR and PC methodologies in a fashion similar to the ME bounds. Recall that the only requirement for applying the SCRtype bounds is normality of the 42 regression parameter estimates, which is true asymptotically for GLMs. Application of these bounds to various quantities of interest are outlined in detail below. 3.2.2.1 A General Set of Comparisons of the Model Parameters Consider the general setup and estimators for GLMs presented in Chapter 2. In section 3.1 the SCR bands were applied to GLMs to estimate the expected response function simultaneously over a closed set χ. Now these bounds, along with the pSCR, CS, and pCS bounds, may be applied to simultaneously estimate a set of p linear combinations of the regression parameters from a GLM. These linear combinations will each be of the form c′ jβ where cj is a k × 1 vector for j = 1, . . . , p. Let C = (c1, . . . , cp)′ be a p×k matrix. Thus Cβ is a vector of the p linear combinations of interest. Utilizing the SCR methodologies, we have that the expected response function is simultaneously estimated by bounds with at least 100(1 − α)% coverage ∀xi ∈ X. As the results detailed in section 3.1 can be applied to simultaneously estimate the expected response function, they can be applied to X to obtain simultaneous intervals on g−1(c′ jβ) provided cj ∈ X for j = 1, . . . , p. This interval may then be transformed to obtain simultaneous bounds on the c′ jβ via the link g. Generally, the bounds will be of the form c′ j ˆβ ± d × ˆσGLM(cj), j = 1, . . . , p (3.28) where ˆσGLM(cj) is given by (2.6). The following theorems detail the use of the aforementioned critical values to obtain 100(1 − α)% coverage for a fixed set of p linear combinations of the parameters. Note that we will have eight possible solutions for confidence bands on a fixed set of p linear combinations of the parameters since the usual CS, the pCS, the four SCR, and the two pSCR intervals may all be applied. Recall the domains, denoted Rxi , presented in section 3.1 pertaining to the CS method for linear models. Let R∗x i be the smallest hyperrectangle of the CS form 43 that contains the cj , ∀ j = 1, . . . , p. A set of this form will be utilized in the following theorem. Theorem 3.2 Under the GLM setting described in (2.1), the asymptotic simultaneous coverage probability of the bands (3.28) has a lower bound of 1 − α for d = dCS when dCS is computed for Rxi = R∗x i . The same holds for the bands (3.28) when ˆ β = ˆ β∗, the pMLE of β. Proof: Note that this result holds for any xi ∈ xi and that xi ⊃ Rxi = R∗x i where the vectors cj = (cj1, . . . , cjp), j = 1, . . . , p, are embedded in the hyperrectangle R∗x i . Thus cj , j = 1, . . . , p is contained in xi and consequently, utilizing d = dCS in (3.28) guarantees at least 100(1 − α)% simultaneous coverage asymptotically for the p intervals of interest. Also note that the limiting distributions of ˆ β∗ and ˆ β are identical. Thus the asymptotic coverage of the bands (3.28) based on ˆ β∗ is the same as those based on ˆ β. Now let X∗ be the smallest compact subset of the domain where cj ∈ X∗, ∀ j = 1, . . . , p. Theorem 3.3 Under the GLM setting described in (2.1), the asymptotic simultaneous coverage probability of the bands (3.28) has a lower bound of 1−α for d = dTUBE, d = dSCR1 and d = dSCR2 where these critical values are computed for X = X∗. For ˆ β = ˆ β∗, the pMLE of β, the same holds for d = dTUBE. Proof: Note that this result holds for any xi ∈ X = X∗. The vectors cj = (cj1, . . . , cjk), j = 1, . . . , p, are embedded in the set X∗. Thus, cj ∈ X = X∗ ∀j and consequently, utilizing d = dTUBE, d = dSCR1 or dSCR2 in (3.28) guarantees at least 100(1 − α)% simultaneous coverage asymptotically for the intervals. Moreover, since the limiting distributions of ˆ β∗ and ˆ β are identical, then the asymptotic coverage of the bands (3.28) based on ˆ β∗ is the same as those based on ˆ β. 44 Theorem 3.4 Under the GLM setting described in (2.1), the asymptotic simultaneous coverage probability of the band c′ j ˆβ − ˆκ1(ci)ˆσGLM(ci) ± dTUBEˆσGLM(ci)pˆκ2(ci) (3.29) where ˆσGLM(c′ j) is given by (2.6), has a lower bound of 1 − α with ˆκ1(cj) and ˆκ2(cj) defined as stated in section 3.1 and X = X∗. The same holds for ˆ β = ˆ β∗, the pMLE of β with ˆκ1(cj) and ˆκ2(cj) defined appropriately for ˆ β∗ and X again equal to X∗. Proof: Note that this result holds for any xi ∈ X∗. The vectors cj = (cj1, . . . , cjk), j = 1, . . . , p, are embedded in the set X∗. Thus, cj ∈ X = X∗ ∀j and consequently the intervals in (3.29) utilizing ˆ β, the usual MLE for β, guarantee at least 100(1−α)% simultaneous coverage asymptotically for the intervals. Since the limiting distributions of ˆ β∗ and ˆ β are identical, then the asymptotic coverage of the bands (3.28) based on ˆ β∗ is the same as those based on ˆ β. 3.2.2.2 Illustrations of Simultaneous Procedures for Particular Scenarios The simultaneous procedures for estimating any specified combination of the model parameters from a GLM include the following: SCR (all four forms), PC (restrictedScheff´e), pPC (pMLE restrictedScheff´e), and both pSCR bounds. The relevant bounding procedures will be demonstrated for the odds ratio, relative risk, and attributable proportion, but note that other quantities of interest could also be estimated. Since we have assumed the model estimated is a GLM, it is possible to apply any of the proposed eight confidence bounds to the general expected response function. Following the procedure described in section 3.1, one could find the bounds for any GLM on a restricted domain and then apply the link function. For example, a GLM is generally given by g(E(Yixi)) = x′ iβ, i = 1, . . . , n. 45 We will outline the procedure for some common link functions, log and logit. Note that it is possible to simultaneously estimate any specified set of linear combination of the regression parameters for a specified GLM using any of the eight methods described in previously in this chapter. As an example, recall the preterm birth study discussed in Chapter 1. This study employed a loglinear model where the reference level was the case with no apparent source of maternal stress. Recall the model was given by (2.14), thus let β = (β0, β1, . . . , βk) be the (k + 1) × 1 parameter vector, for this example k = 3. In this study, the relative risks for each level of the source of maternal stress compared back to the cases with no identified maternal stress factor were of primary interest and, as discussed previously, the overall conclusions made in the study merits simultaneous estimation of these relative risks. We focused specifically on the variable indicating “Life Event” stress (see Table 1.1). This factor had four overall levels for the independent variable indicating the presence of any “Life Event” maternal stress. To utilize the procedure outlined in section 3.2.2.1, we need to define the matrix C3×4 = (c1, . . . , c3)′. Specifically, let cj = (cj1, cj2, . . . , cj4), where cj,i = 0 for i 6= j + 1 and cj,j+1 = 1, i = 1, . . . , 4; j = 1, . . . , 3. Since βj is the log of the relative risk for the jth level of the independent variable, then Cβ = (β1, β2, β3)′, is the vector of log relative risks for “Life Event” stress in the preterm birth study. The C matrix that yields Cβ equal to the log relative risks for other referencecoded Poisson models would be defined similarly. Notice that in our example X will be the three dimensional subspace where x1 = 0 and P4 i=2 xi ≤ 1. (Note that cj , j = 1, . . . , 3, is contained in this X.) Asymptotic simultaneous 100(1 − α)% confidence bands for the log relative risks of interest in the preterm birth study can now be formed by utilizing (3.28) or (3.29) with an appropriate critical value and ˆ β or ˆ β∗ as warranted. To obtain asymptotic simultaneous confidence bands on the relative risks, the bands on the log relative risks 46 would be exponentiated. These intervals will allow us to make overall conclusions with a specified asymptotic error rate. Alternatively, now suppose relative risks are of interest using a slightly different model. Specifically, modelbased relative risks can be estimated utilizing a Poisson regression on a proportion. Thus, assume a simple model of the form, log(π(xi)) = β0 + x′ iβ = x∗′ i β∗ for i = 1, . . . , n where β∗ (k+1)×1 = (β0, β1, . . . , βk)′ and x∗i = (1, xi1, . . . , xik)′ is a predictor vector of dimension k + 1. Then the relative risk is easily estimated by exponentiating any one of the regression parameters, e i (i = 1, . . . , k). Thus, in order to estimate the ith relative risk, let the matrix C be defined as described in 5.2.1 with each cij = 0 when i 6= j + 1 and cj,j+1 = 1, i = 1, . . . , k + 1; j = 1, . . . , p. Again we can obtain asymptotic simultaneous 100(1 − α)% confidence bands for the relative risks by utilizing (2.12) or (3.29) with an appropriate critical value and ˆ β or ˆ β∗ as warranted. Here X would be similar to that for the skin cancer study. To complete the calculations, the resulting bounds would be exponentiated, as they were for the odds ratio calculations. Once relative risks are estimated from a Poisson regression model, it may be helpful to additionally estimate the attributable proportions. Recall that the point estimate of a relative risk is given by γi = exp(c′ jβ) for j = 1, . . . , p. As the attributable proportion is a onetoone increasing function of the relative risk, the following holds, κi = 1 − 1 i = i−1 i , i = 1, . . . , n where κ is the attributable proportion and γ is the relative risk. Suppose the relative risk interval has lower and upper limits denoted LRR and URR, respectively. Then the limits on the attributable proportion are given by LRR − 1 LRR , URR − 1 URR . (3.30) If (LRR, URR) were obtained with simultaneous coverage for a specified set, then the 47 same simultaneous coverage properties will hold for the equivalent set of attributable proportions. Note that all the previous examples assumed the estimated quantities should refer back to the reference level. However, this general methodology can be extended to make other kinds of comparisons between the estimated quantities. For example, if we consider a logistic model with reference coding, we could consider an alternative set of odds ratios for joint estimation. Recall that every e i is the odds ratio for the ith level compared to the reference level (or control). Alternatively, suppose it was of interest to estimate the odds ratio comparing the first nonreference level of the covariate to every other nonreference level. Then these odds ratios could be estimated via e 1− j for j = 2, . . . , k. Thus, the contrast matrix, C, would have rows that appear something like, cj = (0, 1, 0, . . . , 0,−1, 0, . . . , 0). Then via (3.28) or (3.29) we could estimate the log odds ratios simultaneously for an appropriate d and ˆ β or ˆ β∗ as warranted. Simply exponentiating the results would give simultaneous bands for this particular set of odds ratios. 48 CHAPTER 4 Simulations In order to evaluate the performance of the proposed restricted Scheff´e bounds, pMLE Scheff´e bounds, and the pMLE SCR bounds, I conducted Monte Carlo simulations. I ran two main simulation studies; simulations for the simultaneous estimation of the expected response function and simulations for the simultaneous estimation of a linear function of the regression parameters which easily gives simultaneous intervals on the odds ratios, relative risks, or attributable proportions via transformation. Some recommendations are provided on the choice of intervals for various scenarios. Additionally, an example of this kind of transformation is given in section 5.2. 4.1 Expected Response Function Simulations These simulations assume only one predictor variable so that the vector of parameter estimates is of the general form β = (β0, β1)′. The estimated coverage, or alternatively, the estimated error was recorded for each scenario simulated. I simulated scenarios for various values of the parameter β, the sample size, n, and the predictor variable, x. Simulations focused on the most commonly utilized GLMs, logistic regression and Poisson models, although other models could be investigated. Simulations included the following scenarios. Regression parameters as follows: (1) β = (−1,−0.5), (2) β = (0, 1), (3) β = (2, 4), and (4) β = (−0.5,−1). The sample sizes were n=10, 25, and 50. The domain, X, for both logit and Poisson models is continuous. For the logit model, I generated points equally spaced over wide and narrow intervals. First appropriate values of π were chosen that would determine either a wide 49 or narrow range for the domain. For a wide domain πL=0.1 and πU=0.9 and for a narrow domain πL=0.25 and πU=0.75. The X interval endpoints were then determined by inverting the GLM so that x = (logit(π)−β0)/β1 where β0 and β1 are the parameters. Thus, for the logit model, xL = (logit(πL)−β0)/β1 and xU = (logit(πU)−β0)/β1 are the endpoints of the domain. Then let x1 = xL and xn = xU and let the remaining points be equally spaced between x1 and xn. Thus, X = (x1, x2, . . . , xn) is a vector of a countable set of points contained in X = {x : xL ≤ x ≤ xU}. In order to simulate the response, a set of Yi, i = 1, . . . , n response variables was generated in the following manner. First, we generated Uniform(0,1) random variables, Ui, i = 1, . . . , n. Then we let the response variable Yi be 1 if Ui < π(xi) and 0 otherwise when generating data for a logit model. Note that π(xi) = 1 1+e 0+ 1xi for i = 1, . . . , n. For the Poisson model, I generated a set of uniform random variables for the X values. The wide domain was distributed Uniform(0,1), while the narrow domain was distributed Uniform(0,0.5). To generate the response variable for a Poisson regression model, Yi is a Poisson random variable generated to have mean μ(xi) = e 0+ 1xi for each i = 1, . . . , n. For both models, I then used this set of Yi values and the X ∈ X data set to compute the estimated parameter values and estimated covariance matrix. These were also utilized to obtain the equations for the bounds over X for each method evaluated. Note that we are not generating binomial, multinomial or Poisson random variables and fitting the model to the generated observations. Rather, we are assuming the model holds exactly. For situations where the logistic and Poisson models do not work well, these methods could perform quite poorly. For each sample I will note whether the estimated bounds cover the true response function X, a finite set contained in X. I will then estimate the simultaneous coverage with the empirical coverage of my simulated samples. Additionally, in order to determine the number of simulated samples required to estimate the error to within ±0.005 we calculated a lower bound on the number of 50 simulations. If we are willing to tolerate 5% type I error (α = 0.05), a lower bound on the number of simulations may be determined via [(0.05)(0.95)/ρ]1/2 < 0.005 where ρ is the number of simulations [7]. This yields ρ > 1900. Thus we ran 5000 simulations to reduce error. 4.1.1 Expected Response Function Simulation Results The most significant distinction among all the competing intervals with regard to the empirical confidence level, was the estimator used, MLE or pMLE. See Figure 4.1 for a plot comparing two MLE intervals to two pMLE intervals. Generally, the Logit Model (W) with B=(1, 0.5) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.1: MLE and PMLE SCR Intervals with wide range pMLE based intervals achieved the desired level of confidence at all sample sizes. In contrast, the MLE based intervals as a group only reached the desired level of confidence for some cases of n=25 or 50. Clearly, the ability of the pMLE intervals to achieve the desired level of α at any sample size is an improvement over any of the usual MLE intervals. Though we see this improvement in the reliability of the pMLE 51 based intervals, the pMLE based intervals are extremely conservative, particularly at small sample sizes. Again, see Figure 4.1 for an example. This could be due to the computation of the biasing constant used to shift the likelihood equations for solving for the pMLE. At the larger sample sizes the pMLE bias adjustment is more precise while at small sample sizes this adjustment is more conservative. Recall that we applied the pMLE estimator to the restrictedScheff´e, naive SCR, and SCR3 intervals. Interestingly, the reshifted and rescaled SCR interval (SCR3) does not perform as well as the naive tubebased SCR interval or any of the other intervals. See Figure 4.2 for an example. Similar behavior was observed to Figure 4.2 for other parameter sets. Clearly, when using a biaspreventing estimator, trying Logit Model (W) with B=(1, 0.5) Sample Size Empirical Confidence Level 10 25 50 0.95 0.96 0.97 0.98 0.99 1.00 SCR1 SCR2 PC SCHEFFE Figure 4.2: pMLE Intervals LogitWide Range B=1 to correct the remaining bias is not helpful, and in fact is often detrimental. Though the pMLE SCR3 intervals are not usually a good idea as an alternative to any MLE based intervals, the other pMLE intervals (restrictedScheff´e and naive SCR) perform far better than the MLE intervals in all cases. See both Figure 4.1 and Figure 4.2 for 52 examples. As the sample size increases to moderately large sizes (n=25, 50 or 100), the MLE based intervals as a whole do reach the desired level of confidence and the pMLE based estimators’s empirical confidence level decreases to a level much closer to the intended level of confidence (see Figure 4.1), and in many cases the pMLE based intervals (either PC or naive SCR) are actually less conservative than the usual MLE based intervals. While the pMLE based intervals were intended to address poor coverage at small sample sizes, these intervals also appear to improve the conservative nature of the usual MLE based intervals at the moderate sample sizes. Thus, at these moderate to large sample sizes the pMLE based intervals still attain the desired level of confidence but, in general, do not overreach the desired confidence level as the usual MLE based intervals often do. Note that so far I have presented only one parameter case for the logit model. Recall that I investigated four sets of parameters for the logit model and three sets of parameters for the Poisson model. The plots comparing the two superior pMLE intervals to the corresponding MLE intervals are displayed in Figures 4.3 to 4.15 in this section. Generally, for the logit model simulations, we see very similar behavior to the plots analyzed previously. However, for some choices of the regression parameters, a sample size of 100 was required to achieve the desired confidence level with the naive tube MLE interval. See Figures 4.3 to 4.5. In regard to the domain used in the estimation of the interval, when a wide domain was assumed for estimation of the GLM, the MLE based intervals needed larger sample sizes to attain the desired level of confidence than when the domain width was narrow. This held for the logit model in particular (see Figure 4.3). However, this trend did not translate to the pMLE based intervals in general, where the domain of interest really only affected how conservative the intervals were. The choice of the four parameter sets did not change the overall trends observed 53 Logit Model (W) with B=(0, 1) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.3: MLE and PMLE SCR Intervals with wide range Logit Model (W) with B=(0.5, 0.25) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.4: MLE and PMLE SCR Intervals with wide range 54 Logit Model (W) with B=(2, 4) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.5: MLE and PMLE SCR Intervals with wide range Logit Model (N) with B=(1, 0.5) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.6: MLE and PMLE SCR Intervals with narrow range 55 Logit Model (N) with B=(0, 1) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.7: MLE and PMLE SCR Intervals with narrow range Logit Model (N) with B=(0.5, 0.25) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.8: MLE and PMLE SCR Intervals with narrow range 56 Logit Model (N) with B=(2, 4) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.9: MLE and PMLE SCR Intervals with narrow range Poisson Model (W) with B=(1, 0.5) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.10: MLE and PMLE SCR Intervals with narrow range 57 Poisson Model (W) with B=(0, 1) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.11: MLE and PMLE SCR Intervals with narrow range Poisson Model (W) with B=(0.5, 0.25) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.12: MLE and PMLE SCR Intervals with narrow range 58 Poisson Model (N) with B=(1, 0.5) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.13: MLE and PMLE SCR Intervals with narrow range Poisson Model (N) with B=(0, 1) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.14: MLE and PMLE SCR Intervals with narrow range 59 Poisson Model (N) with B=(0.5, 0.25) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.15: MLE and PMLE SCR Intervals with narrow range for the competing intervals significantly. In contrast, the choice of the link function did. The logistic models were in general less conservative than the Poisson models (compare Figures 4.1 and 4.10). The performance of the MLE and competing pMLE intervals was most distinct when investigating the logistic regression model. Conversely, the Poisson model empirical confidence levels did not vary greatly across the sample sizes when the pMLE based intervals were utilized. Even the MLE based intervals were conservative at times for this link function and attained the desired level of confidence even at the smallest sample sizes for many cases (see Figure 4.10 as an example). Thus, many of the comments made about the logistic regression models do not apply when considering Poisson regression models. In general, little distinction can be made among the competing intervals. At times, the MLE based intervals appear to be a little less conservative than the pMLE based intervals, but often that difference in negligable. I believe that the conservative nature of all the methods for estimating the mean response of a Poisson model needs to be addressed. 60 This conservative behavior may be due to the tendency of Poisson regression models to overestimate the variance of the regression parameters (see [15] and [16]). A sandwich estimator of the variance should be studied to potentially correct this problem [17]. A sandwich variance estimator, along with generalized estimating equations (GEE’s) for estimating the model parameters, can correct a misspecified variance function for Poisson models. The sandwich estimator is a robust variance estimator that consistently estimates the true variance when the parametric model fails to hold. Since the parameters estimated from GEE’s have asymptotic normality when utilizing the sandwiched covariance matrix, we may directly apply the usual MLE based interval methods. Though the MLE intervals are not of primary concern, it should be noted that among the MLE methods, as expected, the Scheff´e is the most conservative of all. Yet little distinction can be made among the other methods except that the SCR3 intervals tend to not reach the desired level of confidence as quickly as the other SCR intervals. See Appendix A for examples. However, these trends do not translate to the intervals utilizing the pMLE estimator since: 1) not all SCR intervals are employed using this estimator and thus cannot be compared, and 2) the biaspreventing estimate does drastically change the behavior of each interval overall. 4.2 Functions of the Parameters Simulations In order to estimate the error associated with the confidence regions for estimating the set of regression parameters, and hence, odds ratios, relative risks or attributable proportions, I will again assume only one predictor variable. In general, the single predictor variable is assumed to be categorical. However, using reference cell coding for one categorical predictor entails utilizing several binary predictor variables. This will be taken into account in the simulations. When evaluating the odds ratio we also need to consider what type of multiple comparisons could be of interest. I will 61 consider only contrasts corresponding to comparisons with a control. For example, comparisons with a control would require simultaneously estimating the odds ratio for each nonreference level with the reference level from a logistic regression model when reference cell coding is utilized. This entails simultaneously estimating all slope parameters. Recall that the control is the reference level, thus if we have k+1 regression parameters, there will be k comparisons or k odds ratios to be simultaneously estimated. Thus we are simultaneously estimating e i for every i = 1, . . . , k. Similarly, the relative risks for each level of a predictor variable with reference to the control could be investigated for a Poisson regression model. The attributable proportion also could be observed for both a Poisson regression model and a logistic regression model. These too are onetoone functions of the k slope parameters. We investigated both logistic and Poisson regression models for evaluating the limits on the estimated odds ratios, relative risks, or attributable proportions. In this scenario the generation of the X data set and Yi, i = 1, . . . , n was identical to that described in section 4.1. However, in this case we evaluated the estimated coverage of the simultaneous confidence bounds for the discrete set of interest only. This coverage was estimated in an analogous manner to that for the expected response. Namely, the data were generated, the model was estimated, and finally the intervals for the slope parameters were constructed. Each time the interval captured the true parameter value, a success was recorded and the number of captures out of all k comparisons was recorded. This was repeated 5000 times and the average empirical confidence level was recorded. For the purpose of the simulations, I considered k = 4, where k is the number of estimated parameters (slope parameters in this special case). The sample sizes considered are n=50, 100, 200, and 300 and α = 0.05. Also, as in 4.1, I recorded the empirical confidence level for each method considered. At times, the estimated covariance matrix was nearsingular. This means that the covariance matrix was estimated to be a quantity such that when the calculations for 62 computing the inverse of the matrix were begun, an error occurs in the LU factorization. An error is returned in this case by the Fortran compiler. Additionally, the model is illfitting if the response vector is either all 1’s or all 0’s. Thus, cases where the response vector is all 0 or 1 or the covariance matrix is singular or nearsingular were recorded and data were regenerated. When n=50, there were 252 cases that were thrown out and the data regenerated. When n=100, 200, or 300, no cases of nearsingular matrices or all 0’s or 1’s response vectors occurred. 4.2.1 Functions of the Parameters Simulation Results As with the intervals on the mean response, performance of the intervals on the parameters was most affected by the estimator utilized, MLE or pMLE. In general, when the pMLE estimator was used for any interval method, the desired level of confidence was reached at any sample size (see Figure 4.16 or Figure 4.17). Logit Model with B=(1, 0.5, 0.25, 0.5, 0.25) Sample Size Empirical Confidence Level 50 100 200 300 0.7 0.8 0.9 1.0 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.16: MLE and PMLE SCR Intervals for the Parameters In contrast, the MLE based intervals did not in general attain the desired confi 63 Poisson Model with B=(1, 0.5, 0.25, 0.5, 0.25) Sample Size Empirical Confidence Level 50 100 200 300 0.90 0.92 0.94 0.96 0.98 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.17: MLE and PMLE SCR Intervals for the Parameters dence level until either n = 200 or n = 300 for most cases studied. At the smallest sample size (n = 50), the pMLE based interval attained the desired level of confidence and was not overly conservative (see Figure 4.16 or Figure 4.17). As the sample size increased, the pMLE based interval’s empirical confidence level slowly approaches the desired confidence level, just as we saw with the mean response simulations. At the largest sample size simulated (n = 300), the pMLE based interval had an empirical confidence level between the various MLE intervals. In general, only the MLE Scheff´e and PC intervals are more conservative than the two pMLE intervals, while all the MLE SCR intervals are a little less conservative. Thus, whether a practitioner utilizes the MLE or pMLE based intervals makes little difference at these larger sample sizes. 64 CHAPTER 5 CONCLUSIONS 5.1 Application of Proposed Methods Using practical examples with emphasis placed on implementation of the procedures and the interpretation of the results, I will now illustrate how to utilize the proposed intervals. One example focuses on estimating the mean response of a GLM while the other utilizes a GLM with reference coded binary predictors to estimate a set of odds ratios. 5.1.1 Diabetes among Pima Women Diabetes is a common disease among females of the Pima culture in Arizona. A study was conducted to better understand the incidence of diabetes in the population of Pima women [18]. One explanatory variable that is believed to be associated with diabetes in young Pima women, age 24 and younger, is the plasma glucose concentration. The glucose concentration was measured with an oral glucose tolerance test on the 51 Pima women aged 24 or younger. A binary variable indicates presence of diabetes in the young women (0=no diabetes and 1=diabetes), thus a simple logistic regression model is reasonable. Table 5.1 contains the estimated probability of diabetes for women with high glucose readings (140 and above). Individual 95% confidence intervals were calculated for each proportion as well as naive tube or SCR1 intervals with the restriction that the glucose was greater than 140. Note that each proportion in Table 5.1 uses the notation πglucoselevel with the estimated proportion giving the probability of diabetes for an individual Pima women with that glucose 65 Table 5.1: Intervals for Proportion of Pima Women with Diabetes Point Estimate 95% MLE Individual 95% pMLE SCR1 Simultaneous for π Confidence Intervals for π Confidence Intervals for π ˆπ140=0.268 (0.109,0.523) (0.101,0.561) ˆπ142=0.297 (0.120,0.567) (0.108,0.598) ˆπ143=0.312 (0.125,0.589) (0.111,0.616) ˆπ148=0.391 (0.155,0.693) (0.128,0.703) ˆπ151=0.443 (0.175,0.748) (0.140,0.750) ˆπ154=0.495 (0.197,0.796) (0.151,0.792) ˆπ177=0.831 (0.417,0.971) (0.259,0.962) ˆπ188=0.914 (0.540,0.990) (0.322,0.984) ˆπ199=0.958 (0.658,0.996) (0.391,0.994) level. A discussion of the results follows the presented interval estimates in Table 5.1. First note that, as expected, the individual confidence intervals are narrower than the simultaneous intervals. Thus, using the individual confidence intervals, it will be easier to reject certain proportions as the true value. Note that the proportion estimates given in Table 5.1 are from the pMLE estimated model. The MLE model was logit(ˆπ) = −10.840+0.0703X while the pMLE model was logit(ˆπ) = −10.826+ 0.0701X. The change in the length and center of the intervals could lead to very different conclusions given certain research questions. 5.1.2 Depression in Adolescents A study was conducted on the classification of depression (high or low) among adolescents between the ages of 12 to 18 with either learning disabilities (LD) or with serious emotional disturbances (SED) [19]. Six risk factors for high levels of 66 depression were considered (as a combination of the age (1214,1516,1718) factor and group (LD and SED) factor). Thus, there were a total of 6 levels for a single categorical predictor variable. In the text, Epidemiological Research Methods [19], it is suggested that a logistic regression model is appropriate for these data. The estimated odds ratios are provided to ascertain any differences in the level of depression for the different groups. Simultaneous estimation of the odds ratio from the logistic regression model should be considered here since it is reasonable to make conclusions about the group with the highest or lowest odds of high levels of depression. As suggested in the text [19], the group with the lowest risk of high levels of depression (1718,SED) was the referent or control category. The reference coding takes this into account so that every log odds ratio refers back to that baseline category. The estimated log odds ratios for the 5 estimated slope coefficients are given in table 5.2. Note that βage,condition and θage,condition refer to the parameter or odds ratio respectively for an adolescent of a particular age group and condition. Additionally, Table 5.2 contains the estimated model odds ratios comparing the odds of high levels of depression for each risk category with reference to the 1718,SED category and the individual and restrictedScheff´e or PC simultaneous intervals. All point estimates utilize the pMLE estimated model. Note that the 95% individual confidence intervals demonstrated an odds ratio significantly different than 1 for the case where the adolescent was age 1214 and learning disabled. Once simultaneous adjustments are made, the significant association no longer exists. This would be an example of a case where the oneatatime intervals and simultaneous intervals would contradict and care should be taken about which methodology is appropriate. For instance, in order to say that the odds of high levels of depression is highest among the group aged 12 to 14 and with serious emotional disorders, simultaneous intervals would need to be computed. Clearly, that conclusion should not be made for this study since that is not supported via the simultaneous intervals. 67 Table 5.2: Intervals for Odds Ratio for Depression in Adolescents Point Estimate 95% MLE Individual 95% pMLE PC Simultaneous for Odds Ratio Confidence Interval for θ Confidence Interval for θ ˆθ12−14,LD=1.939 (0.812,4.627) (0.384,8.926) ˆθ12−14,SED=4.683 (1.642,13.383) (0.671,29.874) ˆθ15−16,LD=1.616 (0.651,4.102) (0.300,8.053) ˆθ15−16,SED=1.458 (0.520,4.088) (0.222,9.189) ˆθ17−18,LD=1.844 (0.696,4.884) (0.307,10.381) ˆ β17−18,SED=1.00 5.2 Overall Conclusions The proposed pMLE based intervals, including the pScheff´e, pMLE restricted Scheff´e, and pSCR1 intervals, did improve small sample estimation over the usual MLE based intervals. As demonstrated, the usualMLE based intervals often could not attain the desired level of confidence for what was considered a small sample size for the varying models. In contrast, the pMLE did attain the desired level of confidence. However, the penalty with using the pMLE based intervals is that these intervals are very conservative at these small sample sizes. As the sample size increases to moderate levels, the distinction between the MLE and pMLE based intervals lessens, but the pMLE intervals, as a whole, tend to be less conservative than the MLE based intervals. 5.2.1 Recommendations for Estimating the Mean Response When selecting an interval method for a set of comparisons, attention should be paid to what the set of predictor variables are. For instance, when there are 68 many comparisons and interest is focused on the entire estimation space, the Scheff´e intervals would ensure the desired level of confidence. However, if a restricted interval of the predictor variable is of interest or a finite number of discrete points embedded in a continuous domain is of interest, then the restrictedScheff´e or one of the SCR methods, respectively, is most advantageous. In summary, recommendations are to use pMLE intervals for all small to moderate sample sizes. Specifically, the pMLE SCR1 intervals were the least conservative of all the pMLE based intervals and are thus the best choice. However, for ease of computation, the restrictedScheff´e pMLE intervals are a good second choice with far more computational ease. Then, for moderate and larger sample sizes, the pMLE based intervals are again recommendeded, though the behavior of these intervals have not been studied for any sample size greater than 50. Again, the SCR1 pMLE interval is the overall best choice. Though even less distinction may be made between the pMLE intervals at the larger sample sizes. 5.2.2 Recommendations for Estimating the Parameters Overall, for small to moderate sample sizes, the SCR1 pMLE attains the desired level of confidence while not being as conservative as the PC pMLE interval. Thus, for most cases where the sample size is greater than 50, I recommend utilizing a naive tube critical point with the pMLE estimators. For sample sizes smaller than 50, the cautious choice would be the PC pMLE interval. This is a slightly conservative interval, yet it attains the desired level of confidence, unlike any other competing interval. I would not recommend using the MLE based intervals at smaller sample sizes. If the sample size is moderate to large (n=200 or 300) there is little observed difference between the pMLE and MLE intervals and thus, for convenience, the MLE intervals may justifiably be utilized. 69 5.3 Future Research There are many avenues to explore in the future that are related to this research topic. Some of these include performing indepth simulations of these methods for more complex GLMs. Namely, if a GLM has a predictor matrix, given by X, that is a mix of categorical and continuous predictors, then nothing is known about how the aforementioned simultaneous procedures would behave. There are a host of other issues to explore as well when we have a predictor matrix such as X. For instance, what multiple comparison techniques are applicable, what should we compare, and how do we adjust for the other variables in the model? Additionally, for the single predictor variable case, other configurations of the contrast matrix C should be considered. For example, these other forms of C could be utilized to assess how the methods perform for allpairwise types of comparisons. Also, GLMs with interaction and quadratic terms need to be explored. This ent
Click tabs to swap between content that is broken into logical sections.
Rating  
Title  Simultaneous Inference in Generalized Linear Model Settings 
Date  20070701 
Author  Wagler, Amy Elizabeth 
Keywords  Generalized Linear Models; Simultaneous Inference; Multiple Comparisons 
Department  Statistics 
Document Type  
Full Text Type  Open Access 
Abstract  Generalized Linear Models (GLM's) are utilized in a variety of statistical applications. Many times the estimated quantities from the models are of primary interest. These estimated quantities may include the mean response, odds ratio, relative risk, or attributable proportion. In these cases overall conclusions about these quantities may be desirable. Currently few sophisticated methods exist to simultaneously estimate these quantities from a GLM. I propose several methods of estimating these quantities simultaneously and compare them to the existing methods. Intervals for the expected response of the GLM and any set of linear combinations of the GLM are explored. Most existing methods emphasis the simultaneous estimation of the expected response; few consider estimation of the sets of regression parameters, and hence quantities such as the odds ratio or relative risk. Additionally, almost all intervals employ maximum likelihood estimators (MLEs) for the model parameters. MLEs are often biased estimators for GLMs, particularly at small sample sizes. Thus, another set of intervals is proposed that utilize an alternative estimator for the parameters, the penalized maximum likelihood estimator (pMLE). This estimator is very similar to the usual MLE, but it is shifted in order to account for the bias typically present in the MLE for GLMs. Various critical values of the simultaneous intervals are explored for both the MLE and pMLE based intervals. Emphasis is placed on scenarios where the sample size is small relative to the number of parameters being estimated. Simulation studies compare the various intervals and suggest general recommendations. The pMLE based intervals proposed exhibit superior performance, particularly at small and moderate sample sizes. While usual MLE based intervals typically do not attain the desired level of confidence at the small sample sizes, the pMLE based intervals do. Additionally, at moderate to large sample sizes the pMLE based intervals are, in many cases, less conservative than the usual MLE based intervals. 
Note  Dissertation 
Rights  © Oklahoma Agricultural and Mechanical Board of Regents 
Transcript  SIMULTANEOUS INFERENCE IN GENERALIZED LINEAR MODEL SETTINGS By AMY WAGLER Bachelor of Science in Mathematics University of Texas of the Permian Basin Odessa, Texas, USA 1995 Master of Science in Statistics Oklahoma State University Stillwater, Oklahoma, USA 2003 Submitted to the Faculty of the Graduate College of Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY July, 2007 COPYRIGHT c By AMY WAGLER July, 2007 SIMULTANEOUS INFERENCE IN GENERALIZED LINEAR MODEL SETTINGS Dissertation Approved: Dr. Melinda McCann Dissertation Advisor Dr. Stephanie Monks Dr. Mark Payton Dr. Wade Brorsen Dr. A. Gordon Emslie Dean of the Graduate College iii ACKNOWLEDGMENTS There are many I would like to thank for assistance, reassurance and inspiration during the time I worked on this dissertation. First, I would like to thank the entire Department of Statistics at Oklahoma State University including all of the faculty members, staff, and students. I appreciate your kindness and concern for me. Of the faculty members, I would specifically like to thank my committee members, Dr. Stephanie Monks, Dr. Mark Payton, and Dr. Wade Brorsen, for the time and effort they put into this research project. Most importantly, I would like to thank my advisor, Dr. Melinda McCann, for her support and encouragment, perseverance in research, and friendship. I was fortunate to have an advisor such as you. I would also like to thank Trecia Kippola, Tracy Morris, and Janae Nicholson who listened to my many complaints and were just good friends. The most emphatic thanks I reserve for my family, who are always loving and supportive. My Mom, Dad, and sisters for believing I could do this even when I didn’t believe. Most of all, I would like to thank Ron, my husband, and Olive, my daughter. When I needed help, encouragement, or a babysitter, my husband was always there. When I was frustrated and feeling down, a smile from Olive was all I needed to put things in perspective. Thank you and I love you both. iv 5 TABLE OF CONTENTS Chapter Page 1 INTRODUCTION 1 2 The GLM and Estimated Quantities 6 2.1 The Expected Mean Response . . . . . . . . . . . . . . . . . . . . . . 8 2.2 The Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Other Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Interval Estimation from GLMs . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . 14 2.4.2 Reference Coding for Logistic Regression . . . . . . . . . . . . 17 2.4.3 Loglinear or Poisson Model . . . . . . . . . . . . . . . . . . . 18 3 Inferences on Quantities Estimated from a GLM 22 3.1 Inference on the Mean Response . . . . . . . . . . . . . . . . . . . . . 22 3.1.1 Previous Methods . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.2 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Bounds for Simultaneously Estimating Functions of Model Parameters 39 3.2.1 Previous Methods . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.2 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Simulations 49 4.1 Expected Response Function Simulations . . . . . . . . . . . . . . . . 49 4.1.1 Expected Response Function Simulation Results . . . . . . . . 51 4.2 Functions of the Parameters Simulations . . . . . . . . . . . . . . . . 61 vi 4.2.1 Functions of the Parameters Simulation Results . . . . . . . . 63 5 CONCLUSIONS 65 5.1 Application of Proposed Methods . . . . . . . . . . . . . . . . . . . . 65 5.1.1 Diabetes among Pima Women . . . . . . . . . . . . . . . . . . 65 5.1.2 Depression in Adolescents . . . . . . . . . . . . . . . . . . . . 66 5.2 Overall Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.1 Recommendations for Estimating the Mean Response . . . . . 68 5.2.2 Recommendations for Estimating the Parameters . . . . . . . 69 5.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 BIBLIOGRAPHY 72 A FIGURES 75 B Moments of Binomial and Poisson Random Variables 92 B.1 The Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . 92 B.2 The Poisson Random Variable . . . . . . . . . . . . . . . . . . . . . . 93 C RestrictedScheff´e Methodology 94 D SCR Methodology 97 E Fortran Code: PiegorschCasella PMLE for the Mean Response 100 F Fortran Code: SCR PMLE for the Mean Response 123 G Fortran Code: PiegorschCasella PMLE for the Parameters 150 H Fortran Code: SCR PMLE for the Parameters 172 vii LIST OF TABLES Table Page 1.1 Maternal Stress Relative Risks and 95% Confidence Intervals . . . . . 3 2.1 Sample Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Examples of Estimated Odds Ratios . . . . . . . . . . . . . . . . . . . 18 5.1 Intervals for Proportion of Pima Women with Diabetes . . . . . . . . 66 5.2 Intervals for Odds Ratio for Depression in Adolescents . . . . . . . . 68 viii LIST OF FIGURES Figure Page 4.1 MLE and PMLE SCR Intervals with wide range . . . . . . . . . . . . 51 4.2 pMLE Intervals LogitWide Range B=1 . . . . . . . . . . . . . . . . 52 4.3 MLE and PMLE SCR Intervals with wide range . . . . . . . . . . . . 54 4.4 MLE and PMLE SCR Intervals with wide range . . . . . . . . . . . . 54 4.5 MLE and PMLE SCR Intervals with wide range . . . . . . . . . . . . 55 4.6 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 55 4.7 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 56 4.8 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 56 4.9 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 57 4.10 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 57 4.11 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 58 4.12 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 58 4.13 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 59 4.14 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 59 4.15 MLE and PMLE SCR Intervals with narrow range . . . . . . . . . . . 60 4.16 MLE and PMLE SCR Intervals for the Parameters . . . . . . . . . . 63 4.17 MLE and PMLE SCR Intervals for the Parameters . . . . . . . . . . 64 A.1 MLE Intervals LogitWide Range B=1 . . . . . . . . . . . . . . . . . 75 A.2 MLE Intervals LogitWide Range B=2 . . . . . . . . . . . . . . . . . 76 A.3 MLE Intervals LogitWide Range B=3 . . . . . . . . . . . . . . . . . 76 A.4 MLE Intervals LogitWide Range B=4 . . . . . . . . . . . . . . . . . 77 ix A.5 MLE Intervals LogitNarrow Range B=1 . . . . . . . . . . . . . . . . 77 A.6 MLE Intervals LogitNarrow Range B=2 . . . . . . . . . . . . . . . . 78 A.7 MLE Intervals LogitNarrow Range B=3 . . . . . . . . . . . . . . . . 78 A.8 MLE Intervals LogitNarrow Range B=4 . . . . . . . . . . . . . . . . 79 A.9 MLE Intervals PoissonWide Range B=1 . . . . . . . . . . . . . . . . 79 A.10 MLE Intervals PoissonWide Range B=2 . . . . . . . . . . . . . . . . 80 A.11 MLE Intervals PoissonWide Range B=3 . . . . . . . . . . . . . . . . 80 A.12 MLE Intervals PoissonNarrow Range B=1 . . . . . . . . . . . . . . . 81 A.13 MLE Intervals PoissonNarrow Range B=2 . . . . . . . . . . . . . . . 81 A.14 MLE Intervals PoissonNarrow Range B=3 . . . . . . . . . . . . . . . 82 A.15 pMLE Intervals LogitWide Range B=2 . . . . . . . . . . . . . . . . 82 A.16 pMLE Intervals LogitWide Range B=3 . . . . . . . . . . . . . . . . 83 A.17 pMLE Intervals LogitWide Range B=4 . . . . . . . . . . . . . . . . 83 A.18 pMLE Intervals LogitNarrow Range B=1 . . . . . . . . . . . . . . . 84 A.19 pMLE Intervals LogitNarrow Range B=2 . . . . . . . . . . . . . . . 85 A.20 pMLE Intervals LogitNarrow Range B=3 . . . . . . . . . . . . . . . 85 A.21 pMLE Intervals LogitNarrow Range B=4 . . . . . . . . . . . . . . . 86 A.22 pMLE Intervals PoissonWide Range B=1 . . . . . . . . . . . . . . . 86 A.23 pMLE Intervals PoissonWide Range B=2 . . . . . . . . . . . . . . . 87 A.24 pMLE Intervals PoissonWide Range B=3 . . . . . . . . . . . . . . . 87 A.25 pMLE Intervals PoissonNarrow Range B=1 . . . . . . . . . . . . . . 88 A.26 pMLE Intervals PoissonNarrow Range B=2 . . . . . . . . . . . . . . 88 A.27 pMLE Intervals PoissonNarrow Range B=3 . . . . . . . . . . . . . . 89 A.28 MLE Intervals for the Parameters  Logit . . . . . . . . . . . . . . . . 89 A.29 MLE Intervals for the Parameters  Poisson . . . . . . . . . . . . . . 90 A.30 pMLE Intervals for the Parameters  Logit . . . . . . . . . . . . . . . 90 A.31 pMLE Intervals for the Parameters  Poisson . . . . . . . . . . . . . 91 x CHAPTER 1 INTRODUCTION Modern epidemiological and medical research routinely employs generalized linear modeling. These models can be helpful in understanding what behaviors or traits can influence the incidence of a particular disease or characteristic. For example, logistic regression models provide a means of relating the incidence of some trait or disease to a set of possible predictor variables, while loglinear models help us understand associations between a trait and predictor variables. After building a generalized linear model(GLM), one typically wishes to estimate particular quantities of interest such as response probabilities, odds ratios, or relative risks. Customarily, these are reported via confidence intervals or confidence bounds using some prespecified level of significance for each inference. For example, using a loglinear model, one could report 100(1 − α)% confidence intervals for each relative risk resulting from the model. Using oneatatime intervals is appropriate when the investigators are not making overall conclusions about the quantities of interest. For example, if the aforementioned loglinear model was estimated and 100(1 − α)% confidence intervals for the relative risks were reported, conclusions about each individual relative risk could be made, but any statements comparing these relative risks would inflate the assumed α error rate. Research on simultaneous estimation procedures for quantities from generalized linear models has received little attention beyond very routine treatments, such as making Bonferroni adjustments to the usual confidence interval or constructing Scheff´e intervals. However, recent advances made in simultaneous inference for linear models may be applied in the generalized linear 1 model setting. Additionally, further improvements may be made by utilizing some unique properties of the estimated parameters from generalized linear models. I plan to present the justification for employing these simultaneous inference methods in the generalized linear model setting and, via simulation, compare their performance. Thus, the objective of this study is to develop simultaneous interval based procedures that will estimate functions of linear combinations of the parameters of a generalized linear model. Specifically, this includes simultaneously estimating the expected response function, odds ratios, and relative risks from generalized linear models. Overall, attention is focused on quantities that are estimated from a GLM, not the estimation of the GLM itself. Obviously, the performance of any of these procedures will be influenced by how well the model is estimated, but this dissertation will assume the model is well estimated. Additionally, all of the procedures developed in this paper involve constructing interval estimates. Often, procedures that account for multiplicity employ hypothesis tests to make overall conclusions about a set of data. It is more appropriate in the applications I will discuss to use simultaneous intervals instead of stepwise procedures, since I wish to not only detect statistical differences between quantities, but also to assess the practical significance of these differences. Thus, all methods discussed are intervalbased procedures. Before presenting the details of generalized linear models, a practical example of the implementation of a GLM may provide a frame of reference. A 2003 study from the American Journal of Epidemiology explored the relationship between maternal stress and preterm birth [1]. Several previously identified sources of maternal stress, such as high incidence of life events, increased anxiety, living in a dangerous neighborhood, and increased perception of stress, were explored for any association with preterm births. Specifically, the study focused on predicting the prevalence of preterm birth among pregnant women aged 16 or older from two prenatal clinics in central North Carolina. Upon admission to the study, women were asked to complete 2 Table 1.1: Maternal Stress Relative Risks and 95% Confidence Intervals Life Events Stress RR 95% CI No Stress 1.00 MedLow Stress 1.5 (1.0, 2.2) MedHigh Stress 1.4 (0.9, 2.1) High Stress 1.8 (1.2, 2.7) questionnaires in addition to completing a psychological instrument. Also, several blood, urine, and genital tract tests were conducted in order to assess the physical health of the candidates. In all, 2,029 women were eligible, recruited, and completed the preliminary tests in order to participate in the study. Of these participants, 231 delivered preterm, less than 37 weeks gestation. A loglinear model was employed to assess the relationship between the sources and levels of maternal stress and preterm birth. As a result, the authors considered the resulting model relative risks for each individual stress factor or level of a stress factor and its association with preterm birth. Additionally, 95% confidence intervals were computed for each relative risk. The relative risk will be discussed in detail later, but note that for this study each relative risk is the risk of preterm birth for an individual with one particular maternal stress factor relative to the risk of preterm birth for an individual with none of the other identified sources of maternal stress present. Thus a large relative risk for a particular source of maternal stress indicates a strong association between that stress factor and preterm birth. In general, other quantities derived from the generalized linear model may also be of interest. Table 1.1 contains results from the preterm birth study discussed, though these results have been simplified from the actual implemented model for ease of presentation. In particular, the relative risks for different levels of stress due to general life events is presented. As previously discussed, note 3 that the 95% confidence intervals for each relative risk presented in Table 1.1 estimate the risk of preterm birth for each individual subject to a particular level of a maternal stress factor (life event stress) with reference to the control (the case with no identified source of maternal stress). For this scenario, oneatatime inferences are reasonable if the researcher wishes to answer questions such as, “how does the presence of one source of maternal stress affect the risk of preterm birth?” Note that this question is only concerned with the presence of a particular stress factor and how it affects the incidence of preterm births. If one wishes to make any overall conclusions comparing how the multiple sources of maternal stress affect the risk of preterm birth, then another estimation procedure that accounts for multiplicity needs to be implemented. For instance, in the preterm birth study, the researchers reported the above relative risks and confidence intervals, and then remarked that “(w)omen in the highest negative life events impact quartile had the highest risk (RR=1.8, 95% CI: 1.2, 2.7); however, the middle categories did not show increasing risk with increasing measures of stress.” This kind of conclusion is inappropriate given that the researchers only computed oneatatime 95% confidence intervals for the relative risks. Therefore, this is a case that would benefit from simultaneous inference on the relative risks. As another example where simultaneous inference would be appropriate, consider the case where the researcher wishes to identify the set of the sources and levels of maternal stress that are significantly associated with preterm birth. In this case the oneatatime intervals are again inappropriate. In order to make this kind of conclusion, the researcher needs to determine which groups of relative risks are significantly different from 1. If the researcher additionally wants to determine the practical significance of the differences between the varying sources and levels of maternal stress and the control, then confidence intervals with a multiplicity adjustment are required. Stepwise procedures are not adequate as they only determine where statistically significant differences exist, but do not provide a way to estimate the scale of these 4 differences. Additionally, it is often desirable to make conclusions such as “if the subject has one level of a predictor variable, then he is twice as likely to have the disease than if he has any other level of that predictor variable”. Many other examples of similar conclusions could be given, but generally, these conclusions are comparing one parameter to another and the desired outcome is to somehow relate these parameters. Thus, if one wishes to make any comparisons of these parameters, it is desirable to control the overall type I error rate by accounting for the multiplicity of inferences. In the following chapters, I will present the motivation for simultaneous inference of certain parameters and outline both the current methodologies and my proposed methodologies. Additionally, the simulation results of the proposed methods are presented and analyzed. Specifically, in chapter two, I present the generalized linear model and the typical quantities that are estimated from the model, and discuss why simultaneous inference of these quantities is essential in some situations. Additionally, I review some methods for computing oneatatime confidence intervals on various quantities resulting from generalized linear models. In chapter three, I outline the current methodologies used for simultaneously estimating various functions resulting from generalized linear models, and I propose four new methods to estimate these parameters from GLMs. In chapter four, I summarize how I evaluated these new methods using simulation and present the simulation results. Finally, I propose some future research questions regarding simultaneous estimation of a GLM and present some applications of the new methods in the concluding chapter. 5 CHAPTER 2 The GLM and Estimated Quantities There are several generalized linear models (GLM) that permit estimates, such as the odds ratio or relative risk, where multiplicity adjustments often seem warranted. Some of these models include the logistic regression model, loglinear model, Poisson regression model, and the probit or complementary loglog model. In general, a GLM can be expressed as Yi = g−1(x′ iβ + ǫi), i = 1, . . . , n (2.1) or alternatively, φi = g(E(Yixi)) = x′ iβ, i = 1, . . . , n (2.2) where g links the expected response, E(Yixi), to φi, with xi the vector of covariates corresponding to Yi, β the k×1 vector of regression parameters, and ǫi independently and identically distributed random variables. In the later sections, Y = (Y1, . . . , Yn)′ is the vector of responses and X = (x1, . . . , xn)′ is the full rank matrix of predictor variables. Each GLM corresponds to a particular link function g, typically called the canonical link when it transforms the mean to the natural parameter. In general, the maximum likelihood estimate (MLE) of the regression parameters is denoted ˆ β = ( ˆ β1, . . . , ˆ βk). The MLE is asymptotically multivariate normal with mean β and covariance matrix V = (X′WX)−1 (2.3) 6 where W is a diagonal matrix with diagonal elements wi = (∂μi/∂φi)2/var(Yi) for μi = E(Yixi) and φi in (2.2). Thus, ˆβ ·∼ Nk(β, V ). (2.4) We can estimate the covariance matrix, V , by Vˆ = coˆv(βˆ) = (X′ ˆW X)−1 (2.5) with ˆW =W =ˆ . Further results will require the estimated covariance of x′ iβ for a given xi vector with k known elements. This is given by ˆσ2G LM(xi) = qx′ i ˆ V xi, i = 1, . . . , n. (2.6) At times I will need to refer to a linear model in this proposal. A linear model is generally given by Yi = x′ iθ + ǫi, i = 1, . . . , n (2.7) or alternatively, E(Y ) = Xθ (2.8) where θ is the vector of regression parameters, Y and X are as previously defined, and ǫi ∼ iid N(0, σ2L M). The MLEs for θ = (θ1, . . . , θk) are denoted ˆθ and ˆθ ∼ Nk(θ, σ2L MF) (2.9) where σ2L M is the variance of the model residuals for a linear model and F = (X′X)−1. Whenever a linear model is referenced in this paper, the notation presented above will be utilized. As discussed previously, the objective of this research is not to merely estimate a GLM, but to simultaneously estimate quantities derived from a GLM. There are many natural quantities that can be of interest when modeling data with a GLM. These include measures such as the expected mean response, the odds ratio, the relative 7 risk, and possibly others. While the focus of this paper is on the estimation of these quantities from GLMs, it should be mentioned that they may be estimated directly from the data. I will first present these measures in general, and then discuss them specifically in the context of generalized linear models. 2.1 The Expected Mean Response The expected mean response is a generic term for either the response probability or the mean response. Depending on the sampling distribution of the data, either one or the other is of interest. For example, if we assume binomial sampling, the expected mean response function is the probability of a success for a given level of the predictor variables, or the response probability. When a Poisson sampling scheme is assumed, the expected mean response is the average for a particular cell in the contingency table, or the mean response. A response probability is the proper quantity of interest if one wishes to understand the probability of developing a disease or another characteristic for a given set of predictor variables that are believed to be associated with the disease. For example, a response probability could be used to inform a particular patient of their probability of developing a particular disease given their history and profile. With respect to the preterm birth example, if a doctor has a patient known to be experiencing a major life event, such as a death in the family, then she could ascertain that patient’s risk of preterm birth and take appropriate measures. Conversely, the mean response might be used in a situation where a clinician has recorded a host of risk factors for a particular disease and wishes to predict how many of the subjects will develop the disease. This communicates how many patients on average will or will not develop a certain characteristic. In the context of the preterm birth example, the Poisson mean response could be used to estimate how many subjects out of the total sample size will experience a preterm birth. The 8 Poisson mean response is simply estimated directly from the frequencies given in a contingency table when it is not estimated from a model. In general, a oneatatime 100(1−α)% confidence interval for the response probability is given by πˆi ± z /2(vˆar(πˆi))1/2 (2.10) where var( ˆ πi) = i(1− i) mi and vˆar(πˆi) = ˆ i(1−ˆ i) mi . (Note that confidence intervals on the mean response for Poisson sampling distribution models are not typically computed.) This oneatatime confidence interval for πi is often employed to estimate an expected mean response for binomial or multinomial sampling scenarios. If the researcher simply wants to know how a particular level of the predictor variables affects the incidence of disease, this is all that needs to be calculated. First, consider the case where the predictor variable is categorical, as in the preterm birth example. Suppose a clinician wishes to estimate the risk of preterm birth for a particular patient in her clinic. Then the oneatatime interval would be adequate. Alternatively, consider a scenario where a researcher wants to make some kind of overall conclusion about the relationship between all the sources and levels of maternal stress and preterm birth. For example, suppose the researcher wishes to compare the risk of preterm birth for all the maternal stress factors and their levels. In order to simultaneously estimate these differences, the researcher would need to employ some kind of procedure that accounts for the multiple inferences being made. Additionally, it would be of practical interest to identify a group of maternal stress factors and levels that are “most associated” with preterm birth and another group of maternal stress factors and levels that are “least associated” with preterm births. This too would necessitate a procedure that adjusts for multiplicity while also providing interval estimates of the response probabilities for each stress factor. Finally, consider the case where the researcher would want to compare the probability of preterm birth for each maternal stress factor or level with the probability of preterm birth for a control or reference 9 level. In this example, the reasonable reference level would be subjects that have no identified sources of maternal stress. Again the appropriate procedure would adjust for multiplicity. Now consider the case where the predictor variable is continuous. Often, with a continuous predictor variable, a specific range of the domain is of particular interest. For example, if we added a continuous measure of each patient’s prepregnancy body mass index (BMI) in the preterm birth study, we might have particular interest in BMI’s less than 19.8 (underweight), 19.8 to 26.0 (normal weight), 26.0 to 29.0 (overweight), and over 29.0 (obese). It may be of interest to compare the expected number of preterm birth cases for subjects within these BMI groups within a particular maternal stress factor group. In order to make conclusions such as the obese patients have the largest number of preterm birth cases, interval estimates need to be used that account for the multiple inferences being made. The methods I propose will adjust for this kind of multiplicity. 2.2 The Odds Ratio The odds ratio is a widely used measure in epidemiological and medical applications. The odds ratio is generally defined as the ratio of the odds of a characteristic (or disease) occurring in one group to the odds of it occurring in another group. With reference to the preterm birth study, odds ratios could have been computed that would estimate the relative odds of preterm birth for a particular level of a maternal stress factor to the odds for those mothers with no identifiable stress factors. Thus, an odds ratio of 2.11 for mothers who live in a neighborhood perceived to be dangerous, would be interpreted as: the odds of delivering a preterm infant when living in a neighborhood that is perceived to be dangerous is 2.11 times greater than the odds of having a preterm infant when a subject is not exposed to any identifiable sources of maternal stress. 10 Table 2.1: Sample Contingency Table X = x1 X = x2 Y = 1 a b m1 Y = 0 c d m2 n1 n2 n The sample odds ratio can easily be computed from the raw data and is given by ˆη = ad bc for counts as given in Table 2.1 irrespective of which sampling model (binomial, multinomial, or Poisson) is assumed for the cell counts. For large samples, again under all sampling models, the log odds ratio, log(ˆη) is asymptotically normal with mean log(η) and estimated standard error ˆσlogˆ = ( 1 a + 1 b + 1 c + 1 d )1/2. Thus, a 100(1 − α)% oneatatime large sample confidence interval for the log odds ratio is given by log(ˆη) ± z /2ˆσlog(ˆ ). (2.11) Exponentiating the lower and upper bounds of this interval yields confidence bounds for the odds ratio. It is common to see oneatatime confidence intervals for odds ratios reported along with their point estimates. Suppose a researcher wants to report the estimated odds ratio for a particular patient profile with confidence limits. For example, in the preterm birth example, she may want to report the odds ratio of preterm birth for those exposed to a particular maternal stress factor compared to those with no identifiable maternal stress factors. In this case, the oneatatime intervals are appropriate. Alternatively, consider the case where the researcher wants to identify which, if any, of the maternal stress factors or levels of a stress factor are statistically associated with a preterm birth or to identify a set of stress factors or level of a stress 11 factor whose association with preterm birth is larger than that for no stress factors present. The oneatatime intervals will not suffice for these kinds of questions as there are multiple inferences being made. In order to control the type I error rate, a simultaneous estimation procedure should be utilized. Additionally, the researcher may wish to compare the odds of preterm birth for any maternal stress factor to the odds of preterm birth for all other sources of maternal stress. In order to do this, a multiple comparison procedure must also be utilized as the researcher actually wants to compare the probabilities of preterm birth across all the possible sources of maternal stress. It is tempting to make overall conclusions about the odds ratios when reporting estimated odds ratios via oneatatime confidence intervals. However, the error rate associated with these overall conclusions based on multiple oneatatime intervals is not controlled, or even known. In this case, a method that simultaneously estimates the parameters is warranted. 2.3 Other Quantities Another quantity frequently reported is the relative risk. The relative risk communicates the risk of developing a disease at one level of the predictor variable relative to another level of the predictor variable. An example of a study employing relative risks is the preterm birth study. In Table 2, the estimated relative risk would be given by ˆγ = a/n1 b/n2 where the counts are as in Table 2.1. Suppose a researcher reports that the estimated relative risk of preterm birth is 1.75, given the subject lives in a neighborhood perceived to be dangerous. Thus the proportion of those that experience a preterm birth among those that live in the dangerous neighborhood is estimated to be 1.75 times the proportion of those who experience a preterm birth among those with no identified sources of stress. Again, the log scale is often utilized and large sample derivations show that the log of the sample relative risk, log(ˆγ) is asymptotically normal with mean log(γ) and estimated standard error ˆσlog( ) = ( 1−ˆ 1 ˆ 1n1 + 1−ˆ 2 ˆ 2n2 )1/2 12 where ˆπi for i = 1, 2 is the estimated probability of disease among those in group i and ni is the sample size for group i = 1, 2. Thus, the 100(1−α)% confidence interval for the log relative risk is log(ˆγ) ± z /2ˆσlog(ˆ ). (2.12) These resulting bounds may be exponentiated in order to obtain confidence limits on the relative risk. It is sufficient to report an estimated relative risk via a oneatatime confidence interval when a researcher only needs to understand how one level of the predictor variable affects the incidence of disease. The preterm birth study reported relative risks and the associated confidence intervals, thus only individual inferences about each source of maternal stress or level of a maternal stress factor relative to the case with no source of stress can be made. However, suppose a researcher wishes to pick out which risk factor or level of a risk factor contributes most to a disease or condition, or obtain ranking information for the sources and levels of maternal stress with respect to risk of disease or condition. As estimation is still also of interest, multiplicity adjustments need to be made to the confidence intervals. In addition to the relative risk, other quantities should be considered as well. For example, many epidemiological researchers find the attributable proportion a useful measure. Suppose we have a disease and several risk factors for that disease. Then the attributable proportion would be the probability that a diseased individual in the given risk factor has the disease because of that risk factor [2]. This is of interest when there are multiple risk factors for a disease. Thus, this measure is of particular interest in casecontrol studies where the incidence of disease is related to several risk factors as it allows the researcher to understand how much the disease could be reduced by eliminating a particular risk factor. Oneatatime confidence intervals can also be utilized to estimate the attributable proportions. Modelbased confidence interval formulas can be computed on the usual 13 attributable proportion. Often transformations of the relative risk are utilized to compute bounds on the attributable proportion since they can be more efficient asymptotically. However, the MLEbased interval performs adequately [3] and is more easily adjusted for simultaneous inference in the sequel. Thus, a oneatatime confidence interval for the attributable proportion, denoted κ, is given by, κˆ ± z /2 × vˆar(κˆ)1/2 (2.13) where z /2 is the z critical value that gives 100(1 − α)% confidence. (Details for computing vˆar(κˆ) are given in [4].) Again, this interval is all that is required in many applications. However, if the researcher wishes to compare the attributable proportions for a group of risk factors, an adjustment for multiplicity would be necessary. This might be necessary if, for example, one wished to understand which risk factor should be focused on most for prevention of the disease. Here we would want to identify the largest attributable proportion and focus on disease prevention via reducing the effect of that risk factor. 2.4 Interval Estimation from GLMs Though we have introduced notation for both linear models and GLMs, the rest of this section focuses on the particular GLMs utilized to illustrate the results in this paper. While the methods derived apply to any GLM, particular attention will be devoted to the logistic and Poisson models due to their applicability and popularity. 2.4.1 Logistic Regression Model The logistic regression model is widely used in epidemiological and health science applications. The predictor variable in a logistic regression model can be either a single variable or a vector of variables. Thus, let xi = (xi1, xi2, . . . , xik) be a vector of predictor variables for i = 1, . . . , n where n is the total number of observations, and k 14 is the number of predictor variables. Thus, xi is the ith vector of predictor variables. Recall that for qualitative covariates, the xi’s would be defined as appropriate indicator variables. For example, in the preterm birth study xi could be the maternal stress vector of predictor variables with binary elements indicating the presence of a particular source or level of a source of maternal stress. Thus, if the ith case is a patient exposed only to dangerous neighborhoods as a source of maternal stress, we would code xi = (1, 1, 0, 0, 0, . . . , 0) where the first element has a 1 for the intercept term, the second place has a 1 to indicate the presence of stress in the form of a dangerous neighborhood, and the other elements of the vector are 0 indicating the patient was not exposed to the other sources of maternal stress. A logistic regression model assumes that the probability of a success for the ith observation is π(xi) where π(xi) = P[Yi = 1] = ex′ i 1 + ex′ i = e 1xi1+...+ kxik 1 + e 1xi1+...+ kxik , i = 1, . . . , n. (2.14) The matrix X, as previously defined, contains information relating to the predicted value, Y , for the model. Alternatively, we can express this model as φ(xi) = logit[π(xi)] = ln[ π(xi) 1 − π(xi) ] = x′ iβ, i = 1, . . . , n. (2.15) Now let the MLE of π(xi) be denoted ˆπ(xi). We will make the usual assumption that the Yi random variables are independent and binomially distributed with parameters mi (assumed known) and π(xi) given by (2.14), i = 1, . . . , n. Thus, W = diag[miπ(xi)(1 − π(xi))], i = 1 . . . , n and the asymptotic distribution of the MLE of β is given by (2.4). For ˆW = diag[miˆπ(xi)(1 − ˆπ(xi))], i = 1, . . . , n, the estimated covariance matrix of ˆ β is given by (2.5). When it is assumed that a logistic regression model is appropriate, the typical quantities of interest are the coefficients of the regression model, or the log odds ratios, βi, i = 1, . . . , k, and the response probabilities, π(xi). These quantities relate to what was generally referred to as the expected response function. In particular, the 15 expected response function for a logistic regression model is the response probability since this model assumes binomial sampling. When considering response probabilities for single experimental units, oneata time confidence intervals seem appropriate. (Confidence intervals provide additional information about the precision of the estimated response probability, so are often preferable to point estimates.) An appropriate confidence interval on the logit(π(xi)) = x′ iβ is computed by x′ i ˆ β ± z1− /2ˆσGLM(xi) (2.16) where z1− /2 is a zpercentile and ˆσGLM(xi) is given by (2.6) with ˆW as previously defined. Let the upper and lower limits of (2.16) be denoted by ULOGIT and LLOGIT , respectively. Then we can apply the antilogit and obtain bounds on the response probability. Thus, a 100(1 − α)% confidence interval for the response probability is given by exp(LLOGIT ) 1 + exp(LLOGIT ) , exp(ULOGIT ) 1 + exp(ULOGIT ) . (2.17) Another quantity of interest from a logistic regression model is the odds ratio. Oneatatime large sample confidence intervals can easily be constructed on the log odds ratios, as they are linear functions of the k logistic regression coefficients, β. Thus we may utilize the asymptotic multivariate normal distribution of the maximum likelihood estimates of these k logistic regression coefficients, given by (2.4), to obtain large sample confidence intervals for the appropriate odds ratios. For illustration, suppose a particular odds ratio is given by exp(ciβ) for ci = (ci1, . . . , cik), a vector of appropriate constants. Then a oneatatime large sample confidence interval for this particular odds ratio is given by exp{ci ˆβ − z /2ˆσGLM(ci)}, exp{ci ˆβ + z /2ˆσGLM(ci)} (2.18) where ˆσGLM(ci) is given by (2.6). Typically, in epidemiological applications, the logistic regression model employed for computing the modelbased odds ratios utilizes 16 reference coding. When reference coding, as explained below, is utilized, special care must be taken in interpreting the modelbased odds ratios. 2.4.2 Reference Coding for Logistic Regression If a logistic regression model is employed for a categorical predictor variable, the design coding typically used necessitates that one of the levels of x be a reference level. Then odds ratios that result from the model coefficients are observed and compared to that reference level. Most often the reference level is a true control, but at times the reference level is arbitrary. When the reference level is informative, we may wish to: (1) estimate all the odds ratios relative to the reference level simultaneously, thereby allowing the researcher to assess the practical significance of any observed difference from the reference level while also providing the ability to evaluate which nonreference levels are significantly greater than or less than the reference level and (2) make comparisons for a prespecified set of contrasts of the odds ratios. If the reference level is arbitrary, it seems reasonable to simultaneously compute all odds ratios or all odds ratio differences and then emulate one of the two scenarios described above. Again, if we wish to assess the practical significance of any estimates we need to estimate these quantities simultaneously rather than utilize a stepwise procedure. Note that for both above cases, when there is only one categorical predictor variable x, then all inference procedures performed on the odds ratios resulting from the logistic regression model are equivalent to any similar analysis performed on the crude data in contingency table format. Differences will occur in models with multiple covariates. The standard method for computing the odds ratios resulting from a logistic regression model using reference coding for the design matrix is to exponentiate the appropriate linear combinations of the estimated regression coefficients. For example, if we have a logit model such as (2.15), where there are k levels for our single predictor variable, then we can utilize the explanatory variables x1, . . . , xk−1 with the covariate 17 Table 2.2: Examples of Estimated Odds Ratios eˆ 1 The odds comparing the first nonreference level to the reference level eˆ 2 The odds comparing the second nonreference level to the reference level ... ... eˆ 2−ˆ 1 The odds comparing the second nonreference level to the first nonreference level ... ... eˆ k−ˆ k−1 The odds comparing the kth nonreference level to the (k − 1)th nonreference level vector at the reference level of our predictor variable equal to 0, that is, x1 = . . . = xk−1 = 0. Thus, x1, . . . , xk−1 would be defined as indicator variables for the k − 1 nonreference levels of our predictor variable. When this model is assumed, then we can interpret e ˆ1 as follows: the odds that Y = 1 for the first nonreference level is e ˆ1 times greater than that for the reference level. Table 2.2 illustrates other estimated odds ratios and their corresponding interpretations. The estimated odds ratios defined in Table 2.2 could then be utilized to construct confidence intervals that would aid in interpreting the model. 2.4.3 Loglinear or Poisson Model Another model often employed in epidemiological studies is the loglinear model. The loglinear model relates the counts of a Poisson or multinomial distribution to a set of covariates. It may assume the total sample size is random or fixed, depending on whether the model assumes Poisson or multinomial sampling, respectively. For an 18 I ×J contingency table let N = I ×J. Note that the number of cells in a contingency table, N, is distinct from the sample size or number of observations, denoted n, although they can be equal. Whenever the number of observations, n, is fixed, we have multinomial sampling for Yi, i = 1, . . . ,N − 1. However, when the sample size n is not fixed, we usually assume Poisson sampling for Yi, i = 1, . . . ,N. For ease in notation let n∗ = N − 1 for multinomial sampling and N for Poisson. Then the loglinear model is log(μ(xi)) = x′ iβ, i = 1, . . . , n∗ (2.19) where E(Y ) = μ = (μ(x1), . . . , μ(xn∗))′ is the vector of expected counts of the respective cells of the contingency table, xi is a 1×k vector of covariates as described in (2.2), and β is a kdimensional vector of model parameters. A loglinear model may also be expressed as log(μ(xi)) = k Xj=1 βjxij , i = 1, . . . , n∗, (2.20) where each xij is the covariate value corresponding to βi for the ith level of Y , i = 1, . . . , n∗, and j = 1, . . . , k. Recall the assumption that Yi is a Poisson or multinomial random variable. Thus, the expectation of any Yi is a positive value, μ(xi) for i = 1, . . . , n∗. The derivation of the large sample distribution of the model parameters depends on the sampling assumptions. When n is not fixed, we assume Poisson sampling. Then the MLE of ˆβ is asymptotically normal with mean β and covariance matrix V = (X′diag(μ)X)−1. Notice that W = diag[μ]. Thus, the estimated covariance matrix of βˆ is given by Vˆ = coˆv(βˆ) = [X′diag(ˆμ)X]−1. For Poisson sampling, we have, ˆβ ·∼ N(β, (X′diag(μ)X)−1). (2.21) Alternatively, when n, the overall sample size, is fixed we assume multinomial sampling. Typically, under multinomial sampling, we have interest in cell probabilities, 19 ˆπ = ˆμ/n. Here the ˆπ are multivariate normal with mean π and covariance matrix V = cov( ˆβ) = {X′[diag(μ)−(μμ′/n)]X}−1 = {nX′[diag(π)−ππ′]X}−1. Notice that W = diag(μ)−(μμ′/n). Additionally, the estimated covariance matrix for the regression parameters is given by Vˆ = coˆv(βˆ) = {X′[diag(μˆ) − (μˆμˆ′/n)]X}−1 = {X′[diag(ˆπ)− ˆπˆπ′]X/n}−1 when we have one multinomial sample. Thus for multinomial sampling, ˆβ ·∼ N(β, (X′[diag(μ) − (μμ′/n)]X)−1). (2.22) Notice that the asymptotic normality of the parameters holds for both Poisson and multinomial sampling. When Poisson sampling is assumed, the expected response function is the mean cell count, μ. Alternatively, when multinomial sampling is assumed, the expected response function is π. All inferences on the model parameters or any functions of the model parameters can be made via the asymptotic distributions previously stated. I will focus on the case where Poisson sampling is assumed as it is the customary assumption. Additionally, when Poisson sampling is assumed, the loglinear model is often referred to as a Poisson model. Intervals for response probabilities from multinomial loglinear models could be formed in a manner similar to that described for logistic regression, but this is rarely done with loglinear models. Instead, focus is usually on the estimated relative risks. When utilizing a loglinear model the relative risk yields point estimates that are often more applicable to clinical situations than the odds ratio; thus we consider relative risk here. The use of the relative risk is very common in epidemiological applications, thus discussion of the relative risk will focus on these types of scenarios. Estimating the relative risk from a loglinear model is a particularly easy implementation since it may be shown that the estimated relative risk is simply exp( ˆ β1) where ˆ β1 is the slope coefficient for x1, the predictor variable indicating presence of the intervention. Other models, such as the logistic model, could be used similarly to estimate the relative risk, although other models do not always yield simple formulas. 20 Consider for instance, the case where we are estimating the relative risk from a loglinear model. The estimated relative risks would be of the form exp( ˆ βj) where ˆ βj is the estimated slope coefficient for the covariate xj , j = 1, . . . , k. Let ci be a vector with the jth element equal to 1 and all other elements equal to 0. A confidence band is formed by exp{ˆ βj − z /2 × ˆσGLM(cj)}, exp{ˆ βj + z /2 × ˆσGLM(cj)} where ˆσGLM(cj) is given by (2.6) and W for Poisson sampling is given previously in by equation 2.21. Other models using alternative canonical links could also be considered. For example, other GLMs are formed by utilizing the probit link, where g = −1(π(x)), and the complementary loglog link, where g = log(−log(1 − π(x))). These both assume a binomial sampling scenario and the usual focus is on the resulting probability of success, π(x). 21 CHAPTER 3 Inferences on Quantities Estimated from a GLM When utilizing GLMs, several quantities may be of interest. For example, the expected response, odds ratio, or relative risk may be estimated via the GLM. This section focuses on the case where the expected response function is of primary concern. All the methods discussed utilize the fact that GLMs may be expressed as g(E(Yixi)) = x′ iβ (3.1) where Yi is the response for the ith observation, xi = (xi1, . . . , xik) is the vector of appropriate covariate values for the ith observation, β = (β1, . . . , βk) is the vector of parameters, and g is the canonical link. (Specific assumptions and details on this model are given in equations (2.1) and (2.2).) 3.1 Inference on the Mean Response This section will focus on the response probability or estimated mean response, π(xi) = E(Yixi), i = 1, . . . , n in a GLM assuming binomial or multinomial sampling. Alternatively, if we assume Poisson sampling, we would have interest in the expected cell counts, μ(xi) = E(Yixi), i = 1, . . . , n. This general methodology can be extended to the Poisson sampling scheme provided our inferences are on μ(xi), rather than π(xi). Suppose we have a covariate X with kdimensional domain I in a GLM of the form g(E(Y xi)) = x′ iβ, i = 1, . . . , n. Then let X ⊂ I be a compact subset of the domain which is of special interest. The intention of this section is to bound the expected response function, E(Y xi), for all xi ∈ X using confidence bounds on a 22 GLM. The subset X can be a set of the domain that is of special interest or it may be selected to answer a particular question. Discussion is restricted to the case where there is one covariate, but the methodologies may be extended to cases with many covariates. Even in the single covariate case, X may be a matrix if, for example, the model employs reference coding. 3.1.1 Previous Methods Two primary approaches for simultaneously estimating the mean response function are discussed. The first is a conventional approach utilizing bounds similar to the wellknown Scheff´e bounds. The second is a modern approach utilizing solutions referred to as tubeformulas for constructing simultaneous intervals. Scheff´e bounds are a wellknown methodology in simultaneous inferences, and are widely applied in linear models and generalized linear models. Some of the regularity conditions necessary for applying Scheff´e bounds include that the sample size is sufficiently large and that the domain for the predictor variable is fixed [5]. Under these suitable regularity conditions, the maximum likelihood estimates (MLEs) of a linear model are multivariate normal with mean vector β and covariance matrix equal to the inverse of the Fisher information matrix σ2L MF where F = (X′X)−1. (See (2.8) for details on model assumptions.) In a standard regression model, Scheff´e bounds are often utilized to obtain simultaneous intervals. These bounds are simultaneous for all xi ∈ Rk and thus are conservative for any finite set of such comparisons. Alternatively, Casella and Strawderman [6] derived Scheff´etype bounds for a regression model with restrictions assumed on the domain. These intervals are exact for this restricted domain. The advantage of assuming these restrictions is that the usual Scheff´e bounds are conservative when the entire domain is not used. Piegorsch and Casella [7] utilized the Casella and Strawderman (CS) method to obtain simultaneous bounds on a logistic regression model. Specifically, they obtained Scheff´etype bounds 23 on the x′ iβ in a logistic regression utilizing a restricted predictor variable domain of rectangular form. The method originally developed by Casella and Strawderman, and later extended by Piegorsch and Casella, is less conservative than the usual Scheff´e bounds as it restricts the predictor variable space. It is desirable at this point to reparameterize the model so that it is in the socalled diagonalized form (Casella and Strawderman [6]). This will simplify the calculations used hereafter. By the Spectral Theorem for symmetric matrices [8], the matrix F may be decomposed, given that F is symmetric. Thus, a linear model may be diagonalized by noting that F = UDU′ where D = diag(λi), a diagonal k × k matrix of the ordered eigenvalues of F, and U = (u1, . . . ,uk) is the k × k matrix of corresponding orthonormal eigenvectors. Now define Zn×k = XUD−1/2 and ηk×1 = D1/2U′θ where UU′ = I since each row,ui, is orthonormal. Thus, Zη = [XUD−1/2][D1/2U′θ] = XUU′θ = XIθ = Xθ where the model may be written as Y = Zη + ǫ. Note that ˆη = D1/2U′ˆθ is distributed Nk(η, σ2L MI), given that (2.4) holds. The authors Casella and Strawderman [6] consider bounding linear models of the form Yi = xiθ+ǫi with the usual restrictions on ǫ (see (2.7)) and with a domain for xi of the form xi = {xi : Pr j=1 x2 ij ≥ q2Pk j=r+1 x2 ij} where q is a fixed constant. When r = 1 these regions are coneshaped regions, and if r > 1 there is no easy visualization of the space. Casella and Strawderman achieve exact results for bounding linear models for domains of this general form. Alternatively, both Casella and Strawderman [6] and Piegorsch and Casella [7] consider a more defined set of interval constraints on X which are of the form Rxi = {a11 < xi1 < a12, a21 < xi2 < a22, . . . , ak1 < xik < ak2} ⊂ xi for a specified q. These intervals would be of particular interest in many experimental settings and thus are assumed for the remainder of this section. 24 The goal of the restrictedScheff´e procedure developed by Casella and Strawderman is to bound the regression function for all xi ∈ xi . Thus, keeping in mind the objective of inference on E(Yixi) = x′ iθ, consider S( xi) = {θ : (x′ i ˆθ − x′ iθ)2 ≤ d2σ2L Mx′ iF−1xi ∀xi ∈ xi} (3.2) where d is an arbitrary constant. Casella and Strawderman derive a procedure to calculate the value of d that yields, P[S( xi)] = 1 − α. (3.3) Their derivation involves considering a domain for Z similar to xi , zi = {zi : r Xj=1 z2 ij ≥ q2 k Xj=r+1 z2 ij} where q is a fixed constant. Thus, Casella and Strawderman prove that for a specified d P[S( zi)] = P[{η : (z′i ˆη − z′i η)2 ≤ d2σ2L Mz′i zi ∀zi ∈ zi}] = 1 − α (3.4) where ˆη is the MLE under spectral decomposition. Notice that the only difference between the sets S( xi) and S( zi) is the space we are operating in. Recall the form assumed about the domain of interest, Rxi . This is a convex set in Rk. Thus, the image of this set, Rzi , will also be convex, since a linear map preserves convexity. Note that for γ = ˆ − LM , the quantity S( zi) = {γ : (γzi)2 ≤ d2z′i zi ∀zi ∈ zi}. (3.5) Assume this form of S( zi) henceforth. Since we have a domain of the form zi and wish to obtain a Scheff´etype probability band, then via Theorem 1 in [6] we have P(S( zi)) = P(χ2 k ≤ d2) + P(Er,s(b, d2)) (3.6) where Er,s(b, d2) = {(χ2r , χ2s ) : χ2r +χ2s ≥ d2, (aχr+bχs)2 ≤ d2, χ2r ≤ q2χ2s }, a2+b2 = 1, and χ2r and χ2s are independent chisquare random variables. Also note that q is a 25 fixed constant determined by zi and b, r, and s are determined by the value of d and the parameters of the problem. (For specific details see Casella and Strawderman [6].) A solution for the quantity d which yields appropriate simultaneous intervals may be found by setting the right hand side of (3.6) equal to 1 − α and solving for d. Casella and Strawderman applied this theorem to a linear model achieving exact results for all xi ∈ xi , thus yielding a conservative solution for all xi ∈ Rxi ⊂ xi . The details of the derivation of the appropriate zi (and hence xi) are given by Casella and Strawderman [6]. The resulting restrictedScheff´e intervals are of the form ˆ Y ± dˆσ(xi) where d is a critical value determined by an algorithm which is described in Appendix C. Piegorsch and Casella applied this procedure specifically to logistic regression. However, it has not been applied for use in a generic GLM, and it is unclear how these bounds will perform for other generalized linear models. Note that although this method is still conservative, it is less conservative than the conventional Scheff´e bounds, as it is not applicable for the entire predictor variable space. As another alternative to the Scheff´etype bounds, Sun, Loader, and McCormick (2000) [9] (SLM) proposed a solution for simultaneously estimating the mean response for the general class of GLMs with all xi in a compact set. This general method of bounding a regression function, called simultaneous confidence regions (SCR), can account for a variety of linear and nonlinear models. Specifically, the SCR bounds can be applied when there are heteroscedastic and nonadditive error terms, as is the case for many GLMs. The SCR bounds utilize error expansions to approximate the noncoverage probability for a GLM. They are far less conservative than Scheff´e solutions and perform exceptionally well for moderate sample sizes. The SCR bounds are based on applying the socalled tube formula due to Naiman [10] with various possible adjustments. The tube formula provides a lower bound for the coverage probability of a confidence band of a regression function over a specified 26 closed set. However, the tubeformula assumes the error distribution of the model is normal. Clearly, this is not a valid assumption if we have a GLM, although it does provide a starting point for constructing confidence bounds, as the large sample error distribution is approximately normal. Obviously, this assumption will be problematic for smaller sample sizes. The basic tubeformula methodology is described by Naiman in his 1986 paper (Naiman [10]). This paper outlines a solution for constructing simultaneous confidence bands on polynomial regression models of the form Yi = k Xj=1 θjfj(xi) + ei, i = 1, . . . , n, xi ∈ I. (3.7) Here it is assumed that I is a closed interval in R, that ei ∼ iidN(0, σ2L M) with σ2LM unknown, and that θj (j = 1, . . . , k) are unknown constants. The vector f(xi) = (f1(xi), . . . , fk(xi))′ maps from I to Rk. Naiman’s intent is to provide simultaneous confidence bounds on E(Yixi) = θ′f(xi) for all xi ∈ I where an estimate ˆθ = (ˆθ1, . . . , ˆθk)′ is available such that ˆ θ is distributed N(θ, σ2L MF) with σ2L M unknown, F known and s2 LM an independent estimator of σ2L M such that s2 LM 2 LM ∼ χ2 (ν = n − k). In order to understand how Naiman derives these bounds, consider alternatively another mapping, γ, from I to the unit sphere Sk−1 centered at the origin of Rk, such that γ is piecewise differentiable and (γ) = ZI γ′(x)∂x (3.8) is finite. Since γ maps I to the unit sphere, Sk−1, it will be considered in place of the primary mapping f(xi). Specifically, it is the projection of f(xi) on Sk−1. Note that γ(x) = kPf(xi)k−1Pf(xi) for xi ∈ I where P is a k×k matrix such that F = P′P. The quantity (γ) is called the path length, as it measures the length of the path, γ, a continuous mapping essentially connecting the points in the image of f. The 27 image of the path in Sk−1 is then denoted by (γ) = {γ(x) : x ∈ I}. The goal is to bound with a tube via bounding the (γ), which equivalently bounds the regression function, f(xi), on I as they share the same domain. Regarding the image of the path function, (γ), Naiman demonstrates that μ((γ)(g)) ≤ min(Fk−2,2[ 2(g−2 − 1) k − 2 ] × 2π + 1 2 Fk−1,1[ g−2 − 1 k − 1 ], 1) (3.9) where g ∈ [0, 1] such that g = {u ∈ Sk−1 : c(u) ≥ g} (a set of points in Sk−1 that surround ) with c(u) = sup{u′v : v ∈ } for any u ∈ Sk−1 and μ is the uniform measure. Naiman then applies these results to obtain confidence bands of the form ˆθ′f(xi) ± d(ˆσLM)(f(xi)′Ff(xi)). The intervals are formed utilizing the critical value d which is determined by setting 1 − Z 1/d 0 min(Fk−2,2[ 2(dt−2 − 1) k − 2 × π + 1 2 Fk−1,1[ dt−2 − 1 k − 1 ], 1)fT (t)∂t (3.10) equal to 1 − α and solving for d. Here fT (t) is the density of a random variable T where rT2 ∼ F ,r. Utilizing these tubeformula bounds, SLM form simultaneous bounds on the expected response function for a GLM. Recall that Naiman derived these bounds assuming normally distributed residuals. Clearly, generalized linear models only have normally distributed residuals asymptotically. Thus, the tubeformulas were originally applied directly via the asymptotic normality of the residuals to obtain asymptotic simultaneous confidence bands. Modifications were then made to the usual tubebased bounds to improve them for small to moderate sample sizes. The following description outlines how to apply the tubeformula bounds to GLMs. Let the maximum likelihood estimate (MLE) of a predicted response for a GLM at xi be denoted by ˆYi. Ultimately, the interval desired is of the form Id(xi) = (g−1(x′ i ˆ β − dˆσGLM(xi)), g−1(x′ i ˆ β + dˆσGLM(xi))) ∀xi ∈ X where [ˆσGLM(xi)]2 is the asymptotic variance of x′ i ˆ β and X is a particular compact 28 subset of the domain. The tube formulas will enable us to find a value d such that P[g(E(Yixi)) ∈ Id(xi), for all xi ∈ X] ≥ 1 − α. (3.11) Applying the tube formula directly to solve for d, yields what SLM term a naive SCR. We will utilize the notation dTUBE to indicate a critical value calculated in this manner. This solution performs adequately when the asymptotic distribution of the residuals is nearly normal. However, this method will not attain the desired confidence level when the sample size is relatively small, as typically the residuals are nonnormal discrete random variables for GLMs. In order to improve the small sample performance, the authors consider some modifications to the tubeformula. They begin by approximating the sampling distribution of the residual via construction of expansions on the estimated model. This approximation of the sampling distribution will be utilized to obtain a critical point for the confidence interval formula that is adjusted with respect to the bias introduced by the MLEs. Consider the random process Wn(xi) = g( ˆ Yi)−g(E(Yixi)) ˆ G(xi) where xi = (xi1, . . . , xik) is the ith vector of X = (xi, . . . , xn)′ and [ˆσG(xi)]2 is the asymptotic variance of g( ˆ Yi). This converges in distribution to a Gaussian random field. Let W(xi) be a random variable with the same distribution as the limiting distribution of Wn(xi). Then the bias behaves like Wn(xi)−W(xi). This equivalent expression of the bias may be bounded via inverse Edgeworth expansions. SLM propose three corrections that can aid in correcting the bias introduced from estimating the regression parameters in a GLM with MLEs. These three solutions are based on the inverse Edgeworth expansion of the random process given by, Wn(xi) = W(xi) − p2(xi,Wn(xi)) (3.12) where the term subtracted can be thought of as a correction for the bias of the process. It is based on the centered moments of the process Wn(xi) denoted κi for i = 1, 2, 3, 4 29 and is given by, p2(xi, Z) = −Z{ 1 2 [κ2(xi) − 1 + κ21 (xi)] + 1 24 [κ4(xi) + 4κ1(xi)κ3(xi)](Z2 − 3) + 1 72 κ23 (xi)(Z4 − 10Z2 + 15)} = O(n−1). (3.13) The κi’s may be computed as detailed by Hall(1992, [11]). Details are provided in Appendix D. The inverse Edgeworth expansions (3.13) are then also utilized to account for the bias typically observed in the MLEs of generalized linear models. A first version of a corrected SCR, denoted SCR1, is a solution where the bias term, p2(xi,Wn(xi)), is bounded. First consider the supremum of the bias term, R′ p = sup xi ∈ X{p2(xi,Wn(xi))} = Op(1/n). We want to find a positive constant R′ p such that P[R′ p ≤ r′ p] = 1 − α as n −→ ∞. Details are given in Sun, Loader, and McCormick [9] and Hall [11]. Additionally, specific calculation procedures are described in Appendix D. These r′ p values are then used to correct the bias in the choice of d via the tubeformula method. Namely, our new interval is given by g−1(x′ i ˆβ − dSCR1ˆσGLM(xi)), g−1(x′ i ˆβ + dSCR1ˆσGLM(xi)) (3.14) where the new critical point, dSCR1 is equal to dTUBE − r′p  where dTUBE is the aforementioned solution. Another version of the corrected SCR, SCR2, considers the modified process, W0 n (xi) = Wn(xi)− 1(xi) √ 2(xi) , such that W0 n(xi) = W(xi)−q2(xi,W0 n(xi)) with q2 similar to p2. The tube formula is then applied to W0 n(xi). Bounding this normalized process, W0 n (xi), further corrects the bias. Doing this results in confidence bounds on the E(Yixi) which are an improvement of the tube method applied directly to Wn(xi). It is of interest to note that this method corrects the bias via a first level approximation. The resulting confidence region is of the form, g−1(x′ i ˆβ − dSCR2ˆσGLM(xi)), g−1(x′ i ˆβ + dSCR2ˆσGLM(xi)) (3.15) 30 where dSCR2 is this biascorrected solution for the critical value. Note that this is just a critical value, like d, that is corrected for the bias. SLM term this a twosided corrected SCR. Recall that it utilizes the modified Gaussian process that corrects the bias inherit in the MLE estimates and finds a critical value that adjusts for that bias. A last solution, called the centered SCR (SCR3), begins by estimating the mean and variance of the Gaussian process Wn(xi). These are the centered moments and are given by ˆκ1(xi) and ˆκ2(xi), respectively. These essentially move and rescale the confidence region so it is no longer biased. The tubebased critical value dTUBE is again involved, so the final interval is g−1((x′ i ˆβ)∗ − dTUBEˆσ∗ i ), g−1((x′ i ˆ β)∗ + dTUBE ˆσ∗ i ) (3.16) where (x′ i ˆβ)∗ = x′ i ˆ β− ˆκ1(xi)ˆσGLM(xi) and ˆσ∗ i = ˆσGLM(xi)pˆκ2(xi)). These formulas are given in Appendix D. 3.1.2 Proposed Methods Expanding on the methodologies presented in the previous section, I have developed two new approaches for estimating a mean response function over a specified compact set via confidence regions. The first proposed method is based on the restrictedScheff´e bounds developed by Casella and Strawderman and further refined by Piegorsch and Casella. In Piegorsch and Casella [7], the authors derive and implement conservative simultaneous bounds on the response probabilities of logistic regression models for rectangular domains. I have generalized these bounds on the expected response function for any GLM. Outlined below is the method by which these bounds may be computed. For any GLM, let the anti link function be g−1 so that E(Yixi) = g−1(x′ iβ), i = 1, . . . , n (3.17) where E(Yixi) denotes the response probability (logistic regression) or the mean 31 response (loglinear models or Poisson regression) for the specified covariate levels given by xi (Complete model specifications are given in (2.2)). Recall that the MLE of β, ˆ β, is asymptotically normal with mean β and k×k covariance matrix V where V = (X′WX)−1 with W defined in (2.3). Applying the restrictedScheff´e procedure of Casella and Strawderman to GLMs yields appropriate conservative simultaneous confidence intervals for the mean response. Casella and Strawderman assumed that the MLEs of the regression parameters were normally distributed with a specified mean vector and covariance matrix. For our case we only have asymptotic normality of the MLEs and therefore, the probability in (3.6) is not exactly 1 − α but instead converges to 1 − α as n → ∞. We will also require a slightly different definition for S( xi) and S( zi). Recall S( xi ) = {θ : (x′ i ˆθ − x′ iθ)2 ≤ d2σ2L Mx′ iF−1xi ∀xi ∈ xi} for linear models. Here however S( xi ) = {β : (x′ i ˆβ − x′ iβ)2 ≤ d2x′ iV −1xi ∀xi ∈ xi}. Notice that the inequalities in both sets have an upper bound given by the variance of x′ i ˆ β or x′ i ˆ θ, respectively. S( zi) will have a similar definition for GLMs and the diagonalization described in section 4.1 applies with F = V . Thus, for any S( zi) of the form (3.5) generalized appropriately for a GLM, P(S( zi)) → P(χ2 k ≤ d2) + P(Er,s(b, d2)) (3.18) as n → ∞ where Er,s(b, d2) = {(χ2r , χ2s ) : χ2r + χ2s ≥ d2, (aχr + bχs)2 ≤ d2, χ2r ≤ q2χ2s }, (3.19) where a2 + b2 = 1 and χ2r and χ2s are independent chisquare random variables. Note that q is a fixed constant determined by the particular xi . (Recall we will choose the smallest xi ⊃ Rxi .) Additionally, the constants b, r, and s are determined by the value of d and the parameters of the problem. (Details of the computation of b, r, and s specifically for GLMs are given in Appendix C.) Notice that the coverage probability of the set S( zi) is the sum of the usual coverage probability of the Scheffe 32 set (P(χ2 p ≤ d2)) and a probability that adjusts for the restricted domain. Recall that these bounds are derived assuming a domain of the form xi . If a set of the form Rxi is of interest, an approximate answer may still be found as in Piegorsch and Casella. In order to find a bound for a domain of the form Rxi , the smallest set of the form xi is established such that xi contains Rxi . (This procedure is detailed in Appendix C.) We may consider sets on either the X domain, xi and Rxi , or their equivalent sets on the Z domain, zi and Rzi . Recall that the method of Casella and Strawderman provides simultaneous bounds on a linear model for a transformed domain of the form zi , and thus equivalently for xi . The adapted method of Piegorsch and Casella computes these bounds for domains of the form Rxi in logistic regression models. I propose extending these bounds for use in any generalized linear model with a canonical link. The simultaneous bounds may be transformed from x′ iβ, xi ∈ Rxi , to the expected response function via the antilink function. In order to apply the CasellaStrawderman results to GLMs, we must first show that the probability of the set S( xi) converges to 1 − α. Corollary 3.1 If ˆ β is asymptotically normal with mean vector β and covariance matrix V , then P(S( zi)) → P(χ2 k ≤ d2) + P(Er,s(b, d2)) as n → ∞. (3.20) with Er,s(b, d2) given by (3.19) where q is determined by the particular zi considered and appropriate constants b, r, and s. Proof: Recall zi = {zi : Pr j=1 z2 ij ≥ q2Pk j=r+1 z2 ij} for a specified constant q. Here zi is the diagonalized form of xi. Theorem 1 from Casella and Strawderman [6] gave exact equality of the same probabilities in (3.20) under exact normality for ˆ β. Consequently, under asymptotic normality we have (3.20). 33 Unfortunately (3.20) requires V known. Consider the following corollary to the Casella and Strawderman theorem, that holds for GLMs. Corollary 3.2 If ˆ β is asymptotically normal with mean vector β and covariance matrix V , estimated by ˆ V , then for S( zi) utilizing ˆ V instead of V (3.20) still holds. Proof: Let ˆ V = (X′ ˆW X)−1 then since ˆ β p→ β we have ˆW = W =ˆ p→ W. Thus ˆ V = (X′ ˆW X)−1 p→ V = (X′WX)−1. Consequently the convergence in (3.20) also holds when V is estimated by ˆ V . Now let dCS be the value of d such that the probability on the righthand side of (3.20) is 1 − α. Then P(g−1(x′ i ˆβ − dCSˆσ∗ i ) ≤ E(Yixi) ≤ g−1(x′ i ˆβ + dCSˆσ∗ i )) → 1 − α as n → ∞ (3.21) where ˆσ∗ i = (x′ i ˆ V −1xi)1/2. We have theoretically demonstrated that the restrictedScheff´e bounds of Casella and Strawderman may be applied to GLMs, but the computational details remain unclear. A detailed computational algorithm to compute the restrictedScheff´e bounds for any GLM is given in Appendix C. As an alternative to the restrictedScheff´e bounds, we can also apply an estimate of β, ˆ β∗ to the simultaneous confidence regions (SCRs) developed by SLM [9] and the restrictedScheff´e bounds for GLMs. Recall that SLM derived four SCR bounds. First, the tubebased bounds were applied to the maximum likelihood estimators by appealing to their asymptotic normal distribution (dTUBE). Second, the bias was bounded and then the tubeformula solution was applied (dSCR1). Third, the SCR bounds were derived for a modified process which accounts for the bias in the distribution of the MLE (dSCR2). And fourth, the random process was centered and rescaled to correct the bias before applying the tubebased bounds (centered SCR). The vari 34 ous solutions all attempt to correct the bias of the maximum likelihood estimates for GLMs. In particular, when the sample size is small the MLEs are highly nonnormal, and the adjustments to the tube formulas are especially helpful. I propose estimating simultaneous SCR bounds, and restrictedScheff´e bounds, that are not based on the MLE, but on an alternative to the MLE, the penalized maximum likelihood estimator (pMLE). The pMLE is a biascorrected estimate that is closely related to the usual MLE. The tubeformula bounds can then be applied utilizing the pMLE estimates rather than the MLEs. In particular, the naive SCR and centered SCR (SCR3) bounds can easily be applied. The difference between the proposed method and the methods of Sun et al. (2000) [9] is that utilizing the pMLE estimates doesn’t merely “correct” the bias, but prevents the bias from occurring (in the first order) a priori. No additional bias correcting procedures will be necessary since the pMLE estimate has little bias from the start. Additionally, this method eliminates inestimable model parameters (0 or ∞) due to zero cell counts in GLMs. This difficulty was not resolved by the methodologies presented by SLM, and can be particularly troublesome with small to moderate sample sizes. Also, recall that the bias corrections employed in some of the SCR bounds were only biasreducing asymptotically. This procedure is biasreducing for any sample size. The penalized maximum likelihood estimate (pMLE) was developed by David Firth [12] as an alternative to the MLE. Firth developed these estimators for use in models, such as GLMs, where the typical MLE is known to be a biased estimate. Specifically, all of the derivations depend on the model belonging to the general exponential class. These distributions have the general form f(t, θ) = exp{[(tθ − K(θ))]/a(φ) + c(t, φ)}, (3.22) where when φ, the dispersion parameter, is known, this simplifies to f(t, θ) = exp{(tθ − K(θ))} 35 [13]. Although this is an unusual form for an exponential class model, it lends itself quite well to a derivation later in the section, and this form can be shown to be equivalent to the standard form under the correct reparameterization. The general pMLE procedure involves penalizing the score function via the Jeffreys invariant prior for the particular parameter of interest. This penalty yields estimates of the regression parameters that are unbiased in the firstorder. Typically, when computing an MLE, the first derivative of the log likelihood, often called the score function, is computed and set equal to 0, yielding the MLE as the solution. For estimating a set of parameters θ = (θ1, . . . , θk), let the usual vector of score functions be denoted U(θ) = (U1(θ), . . . , Uk(θ))′. Note that Ur(θ) is the derivative of the loglikelihood (score function) with respect to the rth parameter. Firth proposes shifting the score function to correct the bias present in most MLE estimates for GLMs. The shift is determined by the estimated bias, b(θ) and information matrix, i(θ). Then the shifted or penalized score function, U∗(θ) = U(θ) − i(θ)b(θ) = (U∗ 1 (θ), . . . , U∗ k (θ))′ is set equal to 0 and a penalized MLE is the solution. Thus, the pMLE, θ∗, is the solution to U∗(θ) = 0. The details of estimating the bias are given in Firth [12], but as previously noted, the calculations are based on utilizing a Jeffreys prior for the parameters of interest, θ. Firth applies this estimator to the logistic regression model, but the methodology can be applied to any exponential class model, specifically any GLM. Firth states that for a GLM with a canonical link the modified score function is given by U∗ r (θ) = Ur(θ) + 1 2φ n Xi=1 κ3i κ2i hixir, (r = 1, . . . , k), (3.23) where Ur(θ) is the derivative of the log likelihood function for the rth parameter, n is the number of observations Yi, φ is the dispersion parameter in (3.22), and hi is the ith diagonal of the hat matrix (the leverage). The leverage, or the ith diagonal of the hat matrix, is a measure of the distance between each observation, xir, and the mean, x. (See McCullough and Nelder [13] for an explicit definition.) Additionally, κti is the 36 tth (t = 2, 3) cumulant, or tth central moment, of Yi, i = 1, . . . , n. The cumulants may be calculated using the cumulant generating function Ki(s) = log(Mi(s)) where Mi(s) is the moment generating function for the distribution of the dependent variable, Yi. Each cumulant is found by taking the derivative of Ki(s) with respect to s and then letting s = 0. For example, κ2i = K(2) i (0) where K(2) i (s) is the second derivative of the cumulant generating function. Note that the third cumulant, κ3i, estimates the bias. Justification for this formula is outlined by McCullough and Nelder (1989) [13] in section 15 of Generalized Linear Models. Specifically, assume a binomial logit model has a response variable Yi ∼ BIN(mi, πi), i = 1, . . . , n where mi, i = 1, . . . , n are assumed known and the Yi variables are independent. Notice that in this scenario k = n. Thus, the moment generating function is given by Mi(s) = (πies+(1−πi))mi . Moreover, the cumulant generating function would be Ki(s) = milog(πies +(1−πi)). Thus, the first, second, and third derivatives of the cumulant generating function are given by K′ i(s) = miπies πies + (1 − πi) , K(2) i (s) = miπies πies + (1 − πi) − miπ2 i e2s (πies + (1 − πi))2 , and K(3) i (s) = miπies πies + (1 − πi) − 3miπ2 i e2s (πies + (1 − πi))2 + miπ3 i e3s (πies + (1 − πi))3 , respectively. Consequently, the second and third cumulants are κ2i = miπi(1 − πi) and κ3i = miπi(1 − πi)(1 − 2πi). Additionally, the likelihood function is given by L(πi) = πyi i (1 − πi)mi−yi , i = 1, . . . , n. Thus, the usual score function, the first derivative of the log likelihood is given by Ur(π) = Pn i=1(yi − miπi)xir, r = 1, . . . , n. 37 Therefore, the penalized score function is U∗ r (π) = n Xi=1 (yi − miπi)xir + 1 2φ n Xi=1 (1 − 2πi)hixir, r = 1, . . . , n. (3.24) In the case of a single binomial trial, the dispersion parameter is φ = 1/1 = 1 (McCullough and Nelder (1989) [13]). Thus, (3.24) simplifies to, U∗ r (π) = n Xi=1 {(yi + hi 2 ) − (hi + mi)πi}xir for r = 1, . . . , n. Note that in models with no bias present in the MLE, such as a simple linear regression model, the score function will not be penalized as the third cumulant will be zero. Now we may apply the tubeformula confidence bounds to the penalized MLEs, rather than the MLEs. All calculations will follow the previous SCR interval descriptions since the information matrix of the pMLEs is the usual information matrix of the MLEs (See Firth (1993) [12]). Consequently no alteration of the methodology is necessary. We are simply replacing ˆ β with the pMLE of ˆβ, ˆβ∗. We do not need to consider the first two SCR methods as they were adjusting for the bias. Rather, we will simply consider the critical value dTUBE applied to the pMLE ˆβ∗. Additionally, the centered SCR could be applied with possible improvement as this interval is recentered and rescaled. The CS bounds can also be calculated utilizing ˆ β∗ instead of ˆ β. We refer to this as the pCS bounds in the sequel. No other modifications will be necessary. We will refer to our SCR bounds utilizing the pMLEs as bias prevented SCRs, or pSCRs. The first of these is given by (g−1(x′ i ˆ β∗ − dTUBE ˆσ(xi)), g−1(x′ i ˆ β∗ + dTUBE ˆσ(xi))), i = 1, . . . , n (3.25) where β∗ is a penalized maximum likelihood estimate, ˆσ(xi) is given by (2.6), and dTUBE is obtained as described in section 3.1.1. The CasellaStrawderman results 38 could also be applied to an interval of the form (3.25) with dTUBE replaced by dCS, where dCS is obtained as described in section 3.1.1. Finally, the centered SCR may be utilized to yield the second pSCR. This interval is of the form g−1((x′ i ˆ β)∗ − dpSCR2), g−1((x′ i ˆ β)∗ + dpSCR2) , i = 1, . . . , n (3.26) for (x′ i ˆ β)∗ = x′ i ˆ β∗−ˆκ∗1 (xi)qx′ i ˆ V xi and dpSCR2 = dTUBEqx′ i ˆ V xiˆκ∗2 (xi) where ˆκ∗1 (xi) and ˆκ∗2 (xi) are now based on the estimator ˆ β∗. The formulas for these moments of the Gaussian field are given in Appendix D. Since ˆ β∗ attempts to eliminate the bias, it is reasonable to expect that the confidence regions based on this estimator will attain the desired level of confidence for smaller sample sizes than the SCR bounds of SLM. The pSCR bounds for moderate to large samples should be very similar to SLM’s corrected and centered SCR bounds. 3.2 Bounds for Simultaneously Estimating Functions of Model Parameters Often an estimate of something other than the expected response, whether a probability or a mean, is desired. In previous sections, quantities such as the odds ratio, relative risk, and attributable proportion were discussed. Odds ratios or relative risks have immediate clinical application and are often easier for nonstatisticians to understand than expected responses. As discussed previously, all of these quantities may be estimated directly from the data. However, it is preferable at times to estimate these quantities via generalized linear models. Specifically, these quantities may all be expressed as functions of the parameters of GLMs. Consequently, it is possible to utilize all of the methodologies discussed in section 3.1 to simultaneously estimate quantities such as the odds ratio or relative risk. In this section I propose methods that utilize these simultaneous bounds for GLMs to estimate quantities such as odds ratios, thereby accounting for multiplicity of inference. Since the odds ratio, relative 39 risk, and attributable proportion were described previously, I will not review these quantities. Rather we begin by reviewing some relevant methods for simultaneous estimation. 3.2.1 Previous Methods When quantities such as the odds ratios or relative risks are of interest, we are often utilizing discrete random variables to predict a binary response variable. While it is possible to have continuous predictors and compute quantities such as the odds ratio, the use and application of the odds ratio in these situations is less obvious and the need for multiplicity is less apparent. Consequently, attention will first focus on the case of categorical predictors. Researchers have previously explored simultaneous estimation of various sets of the parameters. For example, in 1996, McCann and Edwards [14] proposed a procedure to simultaneously estimate p contrasts of k unknown parameters. This method utilizes Naiman’s Inequality, discussed previously, to obtain conservative simultaneous confidence regions for the p contrasts of interest. These new bounds outperform the existing competing conservative bounds for many scenarios. However, the McCann Edwards (ME) method applied only to linear models in general. I propose adapting the SCR and related bounds from section 3.1 to simultaneously bound p contrasts of k unknown parameters from generalized linear models in an analogous manner. First, I will review the ME method for linear models, and then detail the proposed method for GLMs. Assume we have a regression model of the form, Y = X′θ where Y is a n × 1 vector of responses, X is a k × n matrix of predictor variables, and θ is a k × 1 vector of regression parameters as described by (2.7). Assume that 40 the MLE for θ, ˆθ, is asymptotically multivariate normal with mean θ and covariance matrix σ2L MF with F assumed known and full rank. Also assume that an estimate for σ2L M exists and is given by ˆσ2L M where ˆσ2L M is independent of ˆθ and is such that ˆ 2 LM 2 LM ∼ χ2 . Thus, we have k unknown parameters to estimate via the usual MLE, ˆθ = (ˆθ1, ˆθ2, . . . , ˆθk). Now suppose p linear combinations of the regression parameters are of interest. Let C be a p × k matrix of constants such that Cθ is a vector of these linear combinations of θ. Now, given the distributional assumptions on ˆθ, Cˆθ is multivariate normal with mean Cθ and covariance matrix σ2L MCFC′. Thus, if a single contrast is given by c′ jθ where C = (c1, . . . , cp)′ and each c′ j = (cj1, . . . , cjk), then we can form exact simultaneous interval estimates for each contrast via the following formula: c′ j ˆθ ± dˆσj , j = 1, . . . , p (3.27) where ˆσj = ˆσLM(c′ jFcj)1/2 and d is a pdimensional multivariate t quantile with ν degrees of freedom and correlation matrix R, with R the correlation matrix corresponding to CFC′. The path length inequality proposed by McCann and Edwards provides a conservative solution for d that outperforms the existing conservative solutions in many cases. Note that their solution only provides conservative intervals, as obtaining the multivariatet quantile providing an exact interval is generally an intractable problem. The following theorem given by McCann and Edwards (1996) in [14] details this solution. Theorem 3.1 Let T have a pdimensional multivariatet distribution with degrees of freedom ν and underlying correlation matrix R of rank r. The probability P(Tj  ≤ d, j = 1, . . . , p) is bounded below by the expression 1 − Z 1/d 0 min(Fr−2,2[(s((dt)−2 − 1))/(r − 2)] × ( / ) + Fr−1,1[((dt)−2 − 1)/(r − 1)], 1)fT (t)dt, 41 with = p−1 Xj=1 cos−1(rj,j+1), where Fm,n is the distribution function of an F random variable with m and n degrees of freedom and fT is the density function of a random variable T such that rT2 ∼ F ,r. If d is such that the foregoing expression is at least 1 − α, then the intervals (3.27) will be conservative simultaneous (1 − α)100% confidence intervals. This inequality determines a value of d that depends on the correlation structure R and the path length . The path length also depends on the ordering of the indices 1, 2, . . . , p and yields the smallest value of d for the optimum ordering. ME recommend estimating the optimum ordering via the nearest neighbor algorithm (Townsend, 1987) as no exact solution exists and this method often provides the optimum ordering. It is noted that as the path length function approaches either infinity or zero the value of d becomes (rF ,r, )1/2 (Scheff´e’s critical value) or t /2, (the oneatatime critical value), respectively. Simulations show that when the degrees of freedom are low and the number of comparisons are high, the ME method outperforms other existing conservative solutions. Note that the ME method has simply applied Naiman’s inequality to an interval and parameterization carefully chosen to contain the quantities of interest. However, the ME method is only strictly valid for linear models. I propose adapting the SCR bounds and their counterparts for generalized linear models in an analogous manner in order to make simultaneous inference on linear combinations of the parameters from GLMs, such as odds ratios or relative risks. 3.2.2 Proposed Methods In order to estimate the linear combinations of the GLM parameters, I propose applying the SCR and PC methodologies in a fashion similar to the ME bounds. Recall that the only requirement for applying the SCRtype bounds is normality of the 42 regression parameter estimates, which is true asymptotically for GLMs. Application of these bounds to various quantities of interest are outlined in detail below. 3.2.2.1 A General Set of Comparisons of the Model Parameters Consider the general setup and estimators for GLMs presented in Chapter 2. In section 3.1 the SCR bands were applied to GLMs to estimate the expected response function simultaneously over a closed set χ. Now these bounds, along with the pSCR, CS, and pCS bounds, may be applied to simultaneously estimate a set of p linear combinations of the regression parameters from a GLM. These linear combinations will each be of the form c′ jβ where cj is a k × 1 vector for j = 1, . . . , p. Let C = (c1, . . . , cp)′ be a p×k matrix. Thus Cβ is a vector of the p linear combinations of interest. Utilizing the SCR methodologies, we have that the expected response function is simultaneously estimated by bounds with at least 100(1 − α)% coverage ∀xi ∈ X. As the results detailed in section 3.1 can be applied to simultaneously estimate the expected response function, they can be applied to X to obtain simultaneous intervals on g−1(c′ jβ) provided cj ∈ X for j = 1, . . . , p. This interval may then be transformed to obtain simultaneous bounds on the c′ jβ via the link g. Generally, the bounds will be of the form c′ j ˆβ ± d × ˆσGLM(cj), j = 1, . . . , p (3.28) where ˆσGLM(cj) is given by (2.6). The following theorems detail the use of the aforementioned critical values to obtain 100(1 − α)% coverage for a fixed set of p linear combinations of the parameters. Note that we will have eight possible solutions for confidence bands on a fixed set of p linear combinations of the parameters since the usual CS, the pCS, the four SCR, and the two pSCR intervals may all be applied. Recall the domains, denoted Rxi , presented in section 3.1 pertaining to the CS method for linear models. Let R∗x i be the smallest hyperrectangle of the CS form 43 that contains the cj , ∀ j = 1, . . . , p. A set of this form will be utilized in the following theorem. Theorem 3.2 Under the GLM setting described in (2.1), the asymptotic simultaneous coverage probability of the bands (3.28) has a lower bound of 1 − α for d = dCS when dCS is computed for Rxi = R∗x i . The same holds for the bands (3.28) when ˆ β = ˆ β∗, the pMLE of β. Proof: Note that this result holds for any xi ∈ xi and that xi ⊃ Rxi = R∗x i where the vectors cj = (cj1, . . . , cjp), j = 1, . . . , p, are embedded in the hyperrectangle R∗x i . Thus cj , j = 1, . . . , p is contained in xi and consequently, utilizing d = dCS in (3.28) guarantees at least 100(1 − α)% simultaneous coverage asymptotically for the p intervals of interest. Also note that the limiting distributions of ˆ β∗ and ˆ β are identical. Thus the asymptotic coverage of the bands (3.28) based on ˆ β∗ is the same as those based on ˆ β. Now let X∗ be the smallest compact subset of the domain where cj ∈ X∗, ∀ j = 1, . . . , p. Theorem 3.3 Under the GLM setting described in (2.1), the asymptotic simultaneous coverage probability of the bands (3.28) has a lower bound of 1−α for d = dTUBE, d = dSCR1 and d = dSCR2 where these critical values are computed for X = X∗. For ˆ β = ˆ β∗, the pMLE of β, the same holds for d = dTUBE. Proof: Note that this result holds for any xi ∈ X = X∗. The vectors cj = (cj1, . . . , cjk), j = 1, . . . , p, are embedded in the set X∗. Thus, cj ∈ X = X∗ ∀j and consequently, utilizing d = dTUBE, d = dSCR1 or dSCR2 in (3.28) guarantees at least 100(1 − α)% simultaneous coverage asymptotically for the intervals. Moreover, since the limiting distributions of ˆ β∗ and ˆ β are identical, then the asymptotic coverage of the bands (3.28) based on ˆ β∗ is the same as those based on ˆ β. 44 Theorem 3.4 Under the GLM setting described in (2.1), the asymptotic simultaneous coverage probability of the band c′ j ˆβ − ˆκ1(ci)ˆσGLM(ci) ± dTUBEˆσGLM(ci)pˆκ2(ci) (3.29) where ˆσGLM(c′ j) is given by (2.6), has a lower bound of 1 − α with ˆκ1(cj) and ˆκ2(cj) defined as stated in section 3.1 and X = X∗. The same holds for ˆ β = ˆ β∗, the pMLE of β with ˆκ1(cj) and ˆκ2(cj) defined appropriately for ˆ β∗ and X again equal to X∗. Proof: Note that this result holds for any xi ∈ X∗. The vectors cj = (cj1, . . . , cjk), j = 1, . . . , p, are embedded in the set X∗. Thus, cj ∈ X = X∗ ∀j and consequently the intervals in (3.29) utilizing ˆ β, the usual MLE for β, guarantee at least 100(1−α)% simultaneous coverage asymptotically for the intervals. Since the limiting distributions of ˆ β∗ and ˆ β are identical, then the asymptotic coverage of the bands (3.28) based on ˆ β∗ is the same as those based on ˆ β. 3.2.2.2 Illustrations of Simultaneous Procedures for Particular Scenarios The simultaneous procedures for estimating any specified combination of the model parameters from a GLM include the following: SCR (all four forms), PC (restrictedScheff´e), pPC (pMLE restrictedScheff´e), and both pSCR bounds. The relevant bounding procedures will be demonstrated for the odds ratio, relative risk, and attributable proportion, but note that other quantities of interest could also be estimated. Since we have assumed the model estimated is a GLM, it is possible to apply any of the proposed eight confidence bounds to the general expected response function. Following the procedure described in section 3.1, one could find the bounds for any GLM on a restricted domain and then apply the link function. For example, a GLM is generally given by g(E(Yixi)) = x′ iβ, i = 1, . . . , n. 45 We will outline the procedure for some common link functions, log and logit. Note that it is possible to simultaneously estimate any specified set of linear combination of the regression parameters for a specified GLM using any of the eight methods described in previously in this chapter. As an example, recall the preterm birth study discussed in Chapter 1. This study employed a loglinear model where the reference level was the case with no apparent source of maternal stress. Recall the model was given by (2.14), thus let β = (β0, β1, . . . , βk) be the (k + 1) × 1 parameter vector, for this example k = 3. In this study, the relative risks for each level of the source of maternal stress compared back to the cases with no identified maternal stress factor were of primary interest and, as discussed previously, the overall conclusions made in the study merits simultaneous estimation of these relative risks. We focused specifically on the variable indicating “Life Event” stress (see Table 1.1). This factor had four overall levels for the independent variable indicating the presence of any “Life Event” maternal stress. To utilize the procedure outlined in section 3.2.2.1, we need to define the matrix C3×4 = (c1, . . . , c3)′. Specifically, let cj = (cj1, cj2, . . . , cj4), where cj,i = 0 for i 6= j + 1 and cj,j+1 = 1, i = 1, . . . , 4; j = 1, . . . , 3. Since βj is the log of the relative risk for the jth level of the independent variable, then Cβ = (β1, β2, β3)′, is the vector of log relative risks for “Life Event” stress in the preterm birth study. The C matrix that yields Cβ equal to the log relative risks for other referencecoded Poisson models would be defined similarly. Notice that in our example X will be the three dimensional subspace where x1 = 0 and P4 i=2 xi ≤ 1. (Note that cj , j = 1, . . . , 3, is contained in this X.) Asymptotic simultaneous 100(1 − α)% confidence bands for the log relative risks of interest in the preterm birth study can now be formed by utilizing (3.28) or (3.29) with an appropriate critical value and ˆ β or ˆ β∗ as warranted. To obtain asymptotic simultaneous confidence bands on the relative risks, the bands on the log relative risks 46 would be exponentiated. These intervals will allow us to make overall conclusions with a specified asymptotic error rate. Alternatively, now suppose relative risks are of interest using a slightly different model. Specifically, modelbased relative risks can be estimated utilizing a Poisson regression on a proportion. Thus, assume a simple model of the form, log(π(xi)) = β0 + x′ iβ = x∗′ i β∗ for i = 1, . . . , n where β∗ (k+1)×1 = (β0, β1, . . . , βk)′ and x∗i = (1, xi1, . . . , xik)′ is a predictor vector of dimension k + 1. Then the relative risk is easily estimated by exponentiating any one of the regression parameters, e i (i = 1, . . . , k). Thus, in order to estimate the ith relative risk, let the matrix C be defined as described in 5.2.1 with each cij = 0 when i 6= j + 1 and cj,j+1 = 1, i = 1, . . . , k + 1; j = 1, . . . , p. Again we can obtain asymptotic simultaneous 100(1 − α)% confidence bands for the relative risks by utilizing (2.12) or (3.29) with an appropriate critical value and ˆ β or ˆ β∗ as warranted. Here X would be similar to that for the skin cancer study. To complete the calculations, the resulting bounds would be exponentiated, as they were for the odds ratio calculations. Once relative risks are estimated from a Poisson regression model, it may be helpful to additionally estimate the attributable proportions. Recall that the point estimate of a relative risk is given by γi = exp(c′ jβ) for j = 1, . . . , p. As the attributable proportion is a onetoone increasing function of the relative risk, the following holds, κi = 1 − 1 i = i−1 i , i = 1, . . . , n where κ is the attributable proportion and γ is the relative risk. Suppose the relative risk interval has lower and upper limits denoted LRR and URR, respectively. Then the limits on the attributable proportion are given by LRR − 1 LRR , URR − 1 URR . (3.30) If (LRR, URR) were obtained with simultaneous coverage for a specified set, then the 47 same simultaneous coverage properties will hold for the equivalent set of attributable proportions. Note that all the previous examples assumed the estimated quantities should refer back to the reference level. However, this general methodology can be extended to make other kinds of comparisons between the estimated quantities. For example, if we consider a logistic model with reference coding, we could consider an alternative set of odds ratios for joint estimation. Recall that every e i is the odds ratio for the ith level compared to the reference level (or control). Alternatively, suppose it was of interest to estimate the odds ratio comparing the first nonreference level of the covariate to every other nonreference level. Then these odds ratios could be estimated via e 1− j for j = 2, . . . , k. Thus, the contrast matrix, C, would have rows that appear something like, cj = (0, 1, 0, . . . , 0,−1, 0, . . . , 0). Then via (3.28) or (3.29) we could estimate the log odds ratios simultaneously for an appropriate d and ˆ β or ˆ β∗ as warranted. Simply exponentiating the results would give simultaneous bands for this particular set of odds ratios. 48 CHAPTER 4 Simulations In order to evaluate the performance of the proposed restricted Scheff´e bounds, pMLE Scheff´e bounds, and the pMLE SCR bounds, I conducted Monte Carlo simulations. I ran two main simulation studies; simulations for the simultaneous estimation of the expected response function and simulations for the simultaneous estimation of a linear function of the regression parameters which easily gives simultaneous intervals on the odds ratios, relative risks, or attributable proportions via transformation. Some recommendations are provided on the choice of intervals for various scenarios. Additionally, an example of this kind of transformation is given in section 5.2. 4.1 Expected Response Function Simulations These simulations assume only one predictor variable so that the vector of parameter estimates is of the general form β = (β0, β1)′. The estimated coverage, or alternatively, the estimated error was recorded for each scenario simulated. I simulated scenarios for various values of the parameter β, the sample size, n, and the predictor variable, x. Simulations focused on the most commonly utilized GLMs, logistic regression and Poisson models, although other models could be investigated. Simulations included the following scenarios. Regression parameters as follows: (1) β = (−1,−0.5), (2) β = (0, 1), (3) β = (2, 4), and (4) β = (−0.5,−1). The sample sizes were n=10, 25, and 50. The domain, X, for both logit and Poisson models is continuous. For the logit model, I generated points equally spaced over wide and narrow intervals. First appropriate values of π were chosen that would determine either a wide 49 or narrow range for the domain. For a wide domain πL=0.1 and πU=0.9 and for a narrow domain πL=0.25 and πU=0.75. The X interval endpoints were then determined by inverting the GLM so that x = (logit(π)−β0)/β1 where β0 and β1 are the parameters. Thus, for the logit model, xL = (logit(πL)−β0)/β1 and xU = (logit(πU)−β0)/β1 are the endpoints of the domain. Then let x1 = xL and xn = xU and let the remaining points be equally spaced between x1 and xn. Thus, X = (x1, x2, . . . , xn) is a vector of a countable set of points contained in X = {x : xL ≤ x ≤ xU}. In order to simulate the response, a set of Yi, i = 1, . . . , n response variables was generated in the following manner. First, we generated Uniform(0,1) random variables, Ui, i = 1, . . . , n. Then we let the response variable Yi be 1 if Ui < π(xi) and 0 otherwise when generating data for a logit model. Note that π(xi) = 1 1+e 0+ 1xi for i = 1, . . . , n. For the Poisson model, I generated a set of uniform random variables for the X values. The wide domain was distributed Uniform(0,1), while the narrow domain was distributed Uniform(0,0.5). To generate the response variable for a Poisson regression model, Yi is a Poisson random variable generated to have mean μ(xi) = e 0+ 1xi for each i = 1, . . . , n. For both models, I then used this set of Yi values and the X ∈ X data set to compute the estimated parameter values and estimated covariance matrix. These were also utilized to obtain the equations for the bounds over X for each method evaluated. Note that we are not generating binomial, multinomial or Poisson random variables and fitting the model to the generated observations. Rather, we are assuming the model holds exactly. For situations where the logistic and Poisson models do not work well, these methods could perform quite poorly. For each sample I will note whether the estimated bounds cover the true response function X, a finite set contained in X. I will then estimate the simultaneous coverage with the empirical coverage of my simulated samples. Additionally, in order to determine the number of simulated samples required to estimate the error to within ±0.005 we calculated a lower bound on the number of 50 simulations. If we are willing to tolerate 5% type I error (α = 0.05), a lower bound on the number of simulations may be determined via [(0.05)(0.95)/ρ]1/2 < 0.005 where ρ is the number of simulations [7]. This yields ρ > 1900. Thus we ran 5000 simulations to reduce error. 4.1.1 Expected Response Function Simulation Results The most significant distinction among all the competing intervals with regard to the empirical confidence level, was the estimator used, MLE or pMLE. See Figure 4.1 for a plot comparing two MLE intervals to two pMLE intervals. Generally, the Logit Model (W) with B=(1, 0.5) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.1: MLE and PMLE SCR Intervals with wide range pMLE based intervals achieved the desired level of confidence at all sample sizes. In contrast, the MLE based intervals as a group only reached the desired level of confidence for some cases of n=25 or 50. Clearly, the ability of the pMLE intervals to achieve the desired level of α at any sample size is an improvement over any of the usual MLE intervals. Though we see this improvement in the reliability of the pMLE 51 based intervals, the pMLE based intervals are extremely conservative, particularly at small sample sizes. Again, see Figure 4.1 for an example. This could be due to the computation of the biasing constant used to shift the likelihood equations for solving for the pMLE. At the larger sample sizes the pMLE bias adjustment is more precise while at small sample sizes this adjustment is more conservative. Recall that we applied the pMLE estimator to the restrictedScheff´e, naive SCR, and SCR3 intervals. Interestingly, the reshifted and rescaled SCR interval (SCR3) does not perform as well as the naive tubebased SCR interval or any of the other intervals. See Figure 4.2 for an example. Similar behavior was observed to Figure 4.2 for other parameter sets. Clearly, when using a biaspreventing estimator, trying Logit Model (W) with B=(1, 0.5) Sample Size Empirical Confidence Level 10 25 50 0.95 0.96 0.97 0.98 0.99 1.00 SCR1 SCR2 PC SCHEFFE Figure 4.2: pMLE Intervals LogitWide Range B=1 to correct the remaining bias is not helpful, and in fact is often detrimental. Though the pMLE SCR3 intervals are not usually a good idea as an alternative to any MLE based intervals, the other pMLE intervals (restrictedScheff´e and naive SCR) perform far better than the MLE intervals in all cases. See both Figure 4.1 and Figure 4.2 for 52 examples. As the sample size increases to moderately large sizes (n=25, 50 or 100), the MLE based intervals as a whole do reach the desired level of confidence and the pMLE based estimators’s empirical confidence level decreases to a level much closer to the intended level of confidence (see Figure 4.1), and in many cases the pMLE based intervals (either PC or naive SCR) are actually less conservative than the usual MLE based intervals. While the pMLE based intervals were intended to address poor coverage at small sample sizes, these intervals also appear to improve the conservative nature of the usual MLE based intervals at the moderate sample sizes. Thus, at these moderate to large sample sizes the pMLE based intervals still attain the desired level of confidence but, in general, do not overreach the desired confidence level as the usual MLE based intervals often do. Note that so far I have presented only one parameter case for the logit model. Recall that I investigated four sets of parameters for the logit model and three sets of parameters for the Poisson model. The plots comparing the two superior pMLE intervals to the corresponding MLE intervals are displayed in Figures 4.3 to 4.15 in this section. Generally, for the logit model simulations, we see very similar behavior to the plots analyzed previously. However, for some choices of the regression parameters, a sample size of 100 was required to achieve the desired confidence level with the naive tube MLE interval. See Figures 4.3 to 4.5. In regard to the domain used in the estimation of the interval, when a wide domain was assumed for estimation of the GLM, the MLE based intervals needed larger sample sizes to attain the desired level of confidence than when the domain width was narrow. This held for the logit model in particular (see Figure 4.3). However, this trend did not translate to the pMLE based intervals in general, where the domain of interest really only affected how conservative the intervals were. The choice of the four parameter sets did not change the overall trends observed 53 Logit Model (W) with B=(0, 1) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.3: MLE and PMLE SCR Intervals with wide range Logit Model (W) with B=(0.5, 0.25) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.4: MLE and PMLE SCR Intervals with wide range 54 Logit Model (W) with B=(2, 4) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.5: MLE and PMLE SCR Intervals with wide range Logit Model (N) with B=(1, 0.5) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.6: MLE and PMLE SCR Intervals with narrow range 55 Logit Model (N) with B=(0, 1) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.7: MLE and PMLE SCR Intervals with narrow range Logit Model (N) with B=(0.5, 0.25) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.8: MLE and PMLE SCR Intervals with narrow range 56 Logit Model (N) with B=(2, 4) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.9: MLE and PMLE SCR Intervals with narrow range Poisson Model (W) with B=(1, 0.5) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.10: MLE and PMLE SCR Intervals with narrow range 57 Poisson Model (W) with B=(0, 1) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.11: MLE and PMLE SCR Intervals with narrow range Poisson Model (W) with B=(0.5, 0.25) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.12: MLE and PMLE SCR Intervals with narrow range 58 Poisson Model (N) with B=(1, 0.5) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.13: MLE and PMLE SCR Intervals with narrow range Poisson Model (N) with B=(0, 1) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.14: MLE and PMLE SCR Intervals with narrow range 59 Poisson Model (N) with B=(0.5, 0.25) Sample Size Empirical Confidence Level 10 25 50 0.75 0.80 0.85 0.90 0.95 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.15: MLE and PMLE SCR Intervals with narrow range for the competing intervals significantly. In contrast, the choice of the link function did. The logistic models were in general less conservative than the Poisson models (compare Figures 4.1 and 4.10). The performance of the MLE and competing pMLE intervals was most distinct when investigating the logistic regression model. Conversely, the Poisson model empirical confidence levels did not vary greatly across the sample sizes when the pMLE based intervals were utilized. Even the MLE based intervals were conservative at times for this link function and attained the desired level of confidence even at the smallest sample sizes for many cases (see Figure 4.10 as an example). Thus, many of the comments made about the logistic regression models do not apply when considering Poisson regression models. In general, little distinction can be made among the competing intervals. At times, the MLE based intervals appear to be a little less conservative than the pMLE based intervals, but often that difference in negligable. I believe that the conservative nature of all the methods for estimating the mean response of a Poisson model needs to be addressed. 60 This conservative behavior may be due to the tendency of Poisson regression models to overestimate the variance of the regression parameters (see [15] and [16]). A sandwich estimator of the variance should be studied to potentially correct this problem [17]. A sandwich variance estimator, along with generalized estimating equations (GEE’s) for estimating the model parameters, can correct a misspecified variance function for Poisson models. The sandwich estimator is a robust variance estimator that consistently estimates the true variance when the parametric model fails to hold. Since the parameters estimated from GEE’s have asymptotic normality when utilizing the sandwiched covariance matrix, we may directly apply the usual MLE based interval methods. Though the MLE intervals are not of primary concern, it should be noted that among the MLE methods, as expected, the Scheff´e is the most conservative of all. Yet little distinction can be made among the other methods except that the SCR3 intervals tend to not reach the desired level of confidence as quickly as the other SCR intervals. See Appendix A for examples. However, these trends do not translate to the intervals utilizing the pMLE estimator since: 1) not all SCR intervals are employed using this estimator and thus cannot be compared, and 2) the biaspreventing estimate does drastically change the behavior of each interval overall. 4.2 Functions of the Parameters Simulations In order to estimate the error associated with the confidence regions for estimating the set of regression parameters, and hence, odds ratios, relative risks or attributable proportions, I will again assume only one predictor variable. In general, the single predictor variable is assumed to be categorical. However, using reference cell coding for one categorical predictor entails utilizing several binary predictor variables. This will be taken into account in the simulations. When evaluating the odds ratio we also need to consider what type of multiple comparisons could be of interest. I will 61 consider only contrasts corresponding to comparisons with a control. For example, comparisons with a control would require simultaneously estimating the odds ratio for each nonreference level with the reference level from a logistic regression model when reference cell coding is utilized. This entails simultaneously estimating all slope parameters. Recall that the control is the reference level, thus if we have k+1 regression parameters, there will be k comparisons or k odds ratios to be simultaneously estimated. Thus we are simultaneously estimating e i for every i = 1, . . . , k. Similarly, the relative risks for each level of a predictor variable with reference to the control could be investigated for a Poisson regression model. The attributable proportion also could be observed for both a Poisson regression model and a logistic regression model. These too are onetoone functions of the k slope parameters. We investigated both logistic and Poisson regression models for evaluating the limits on the estimated odds ratios, relative risks, or attributable proportions. In this scenario the generation of the X data set and Yi, i = 1, . . . , n was identical to that described in section 4.1. However, in this case we evaluated the estimated coverage of the simultaneous confidence bounds for the discrete set of interest only. This coverage was estimated in an analogous manner to that for the expected response. Namely, the data were generated, the model was estimated, and finally the intervals for the slope parameters were constructed. Each time the interval captured the true parameter value, a success was recorded and the number of captures out of all k comparisons was recorded. This was repeated 5000 times and the average empirical confidence level was recorded. For the purpose of the simulations, I considered k = 4, where k is the number of estimated parameters (slope parameters in this special case). The sample sizes considered are n=50, 100, 200, and 300 and α = 0.05. Also, as in 4.1, I recorded the empirical confidence level for each method considered. At times, the estimated covariance matrix was nearsingular. This means that the covariance matrix was estimated to be a quantity such that when the calculations for 62 computing the inverse of the matrix were begun, an error occurs in the LU factorization. An error is returned in this case by the Fortran compiler. Additionally, the model is illfitting if the response vector is either all 1’s or all 0’s. Thus, cases where the response vector is all 0 or 1 or the covariance matrix is singular or nearsingular were recorded and data were regenerated. When n=50, there were 252 cases that were thrown out and the data regenerated. When n=100, 200, or 300, no cases of nearsingular matrices or all 0’s or 1’s response vectors occurred. 4.2.1 Functions of the Parameters Simulation Results As with the intervals on the mean response, performance of the intervals on the parameters was most affected by the estimator utilized, MLE or pMLE. In general, when the pMLE estimator was used for any interval method, the desired level of confidence was reached at any sample size (see Figure 4.16 or Figure 4.17). Logit Model with B=(1, 0.5, 0.25, 0.5, 0.25) Sample Size Empirical Confidence Level 50 100 200 300 0.7 0.8 0.9 1.0 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.16: MLE and PMLE SCR Intervals for the Parameters In contrast, the MLE based intervals did not in general attain the desired confi 63 Poisson Model with B=(1, 0.5, 0.25, 0.5, 0.25) Sample Size Empirical Confidence Level 50 100 200 300 0.90 0.92 0.94 0.96 0.98 1.00 Tube MLE Tube pMLE PC MLE PC pMLE Figure 4.17: MLE and PMLE SCR Intervals for the Parameters dence level until either n = 200 or n = 300 for most cases studied. At the smallest sample size (n = 50), the pMLE based interval attained the desired level of confidence and was not overly conservative (see Figure 4.16 or Figure 4.17). As the sample size increased, the pMLE based interval’s empirical confidence level slowly approaches the desired confidence level, just as we saw with the mean response simulations. At the largest sample size simulated (n = 300), the pMLE based interval had an empirical confidence level between the various MLE intervals. In general, only the MLE Scheff´e and PC intervals are more conservative than the two pMLE intervals, while all the MLE SCR intervals are a little less conservative. Thus, whether a practitioner utilizes the MLE or pMLE based intervals makes little difference at these larger sample sizes. 64 CHAPTER 5 CONCLUSIONS 5.1 Application of Proposed Methods Using practical examples with emphasis placed on implementation of the procedures and the interpretation of the results, I will now illustrate how to utilize the proposed intervals. One example focuses on estimating the mean response of a GLM while the other utilizes a GLM with reference coded binary predictors to estimate a set of odds ratios. 5.1.1 Diabetes among Pima Women Diabetes is a common disease among females of the Pima culture in Arizona. A study was conducted to better understand the incidence of diabetes in the population of Pima women [18]. One explanatory variable that is believed to be associated with diabetes in young Pima women, age 24 and younger, is the plasma glucose concentration. The glucose concentration was measured with an oral glucose tolerance test on the 51 Pima women aged 24 or younger. A binary variable indicates presence of diabetes in the young women (0=no diabetes and 1=diabetes), thus a simple logistic regression model is reasonable. Table 5.1 contains the estimated probability of diabetes for women with high glucose readings (140 and above). Individual 95% confidence intervals were calculated for each proportion as well as naive tube or SCR1 intervals with the restriction that the glucose was greater than 140. Note that each proportion in Table 5.1 uses the notation πglucoselevel with the estimated proportion giving the probability of diabetes for an individual Pima women with that glucose 65 Table 5.1: Intervals for Proportion of Pima Women with Diabetes Point Estimate 95% MLE Individual 95% pMLE SCR1 Simultaneous for π Confidence Intervals for π Confidence Intervals for π ˆπ140=0.268 (0.109,0.523) (0.101,0.561) ˆπ142=0.297 (0.120,0.567) (0.108,0.598) ˆπ143=0.312 (0.125,0.589) (0.111,0.616) ˆπ148=0.391 (0.155,0.693) (0.128,0.703) ˆπ151=0.443 (0.175,0.748) (0.140,0.750) ˆπ154=0.495 (0.197,0.796) (0.151,0.792) ˆπ177=0.831 (0.417,0.971) (0.259,0.962) ˆπ188=0.914 (0.540,0.990) (0.322,0.984) ˆπ199=0.958 (0.658,0.996) (0.391,0.994) level. A discussion of the results follows the presented interval estimates in Table 5.1. First note that, as expected, the individual confidence intervals are narrower than the simultaneous intervals. Thus, using the individual confidence intervals, it will be easier to reject certain proportions as the true value. Note that the proportion estimates given in Table 5.1 are from the pMLE estimated model. The MLE model was logit(ˆπ) = −10.840+0.0703X while the pMLE model was logit(ˆπ) = −10.826+ 0.0701X. The change in the length and center of the intervals could lead to very different conclusions given certain research questions. 5.1.2 Depression in Adolescents A study was conducted on the classification of depression (high or low) among adolescents between the ages of 12 to 18 with either learning disabilities (LD) or with serious emotional disturbances (SED) [19]. Six risk factors for high levels of 66 depression were considered (as a combination of the age (1214,1516,1718) factor and group (LD and SED) factor). Thus, there were a total of 6 levels for a single categorical predictor variable. In the text, Epidemiological Research Methods [19], it is suggested that a logistic regression model is appropriate for these data. The estimated odds ratios are provided to ascertain any differences in the level of depression for the different groups. Simultaneous estimation of the odds ratio from the logistic regression model should be considered here since it is reasonable to make conclusions about the group with the highest or lowest odds of high levels of depression. As suggested in the text [19], the group with the lowest risk of high levels of depression (1718,SED) was the referent or control category. The reference coding takes this into account so that every log odds ratio refers back to that baseline category. The estimated log odds ratios for the 5 estimated slope coefficients are given in table 5.2. Note that βage,condition and θage,condition refer to the parameter or odds ratio respectively for an adolescent of a particular age group and condition. Additionally, Table 5.2 contains the estimated model odds ratios comparing the odds of high levels of depression for each risk category with reference to the 1718,SED category and the individual and restrictedScheff´e or PC simultaneous intervals. All point estimates utilize the pMLE estimated model. Note that the 95% individual confidence intervals demonstrated an odds ratio significantly different than 1 for the case where the adolescent was age 1214 and learning disabled. Once simultaneous adjustments are made, the significant association no longer exists. This would be an example of a case where the oneatatime intervals and simultaneous intervals would contradict and care should be taken about which methodology is appropriate. For instance, in order to say that the odds of high levels of depression is highest among the group aged 12 to 14 and with serious emotional disorders, simultaneous intervals would need to be computed. Clearly, that conclusion should not be made for this study since that is not supported via the simultaneous intervals. 67 Table 5.2: Intervals for Odds Ratio for Depression in Adolescents Point Estimate 95% MLE Individual 95% pMLE PC Simultaneous for Odds Ratio Confidence Interval for θ Confidence Interval for θ ˆθ12−14,LD=1.939 (0.812,4.627) (0.384,8.926) ˆθ12−14,SED=4.683 (1.642,13.383) (0.671,29.874) ˆθ15−16,LD=1.616 (0.651,4.102) (0.300,8.053) ˆθ15−16,SED=1.458 (0.520,4.088) (0.222,9.189) ˆθ17−18,LD=1.844 (0.696,4.884) (0.307,10.381) ˆ β17−18,SED=1.00 5.2 Overall Conclusions The proposed pMLE based intervals, including the pScheff´e, pMLE restricted Scheff´e, and pSCR1 intervals, did improve small sample estimation over the usual MLE based intervals. As demonstrated, the usualMLE based intervals often could not attain the desired level of confidence for what was considered a small sample size for the varying models. In contrast, the pMLE did attain the desired level of confidence. However, the penalty with using the pMLE based intervals is that these intervals are very conservative at these small sample sizes. As the sample size increases to moderate levels, the distinction between the MLE and pMLE based intervals lessens, but the pMLE intervals, as a whole, tend to be less conservative than the MLE based intervals. 5.2.1 Recommendations for Estimating the Mean Response When selecting an interval method for a set of comparisons, attention should be paid to what the set of predictor variables are. For instance, when there are 68 many comparisons and interest is focused on the entire estimation space, the Scheff´e intervals would ensure the desired level of confidence. However, if a restricted interval of the predictor variable is of interest or a finite number of discrete points embedded in a continuous domain is of interest, then the restrictedScheff´e or one of the SCR methods, respectively, is most advantageous. In summary, recommendations are to use pMLE intervals for all small to moderate sample sizes. Specifically, the pMLE SCR1 intervals were the least conservative of all the pMLE based intervals and are thus the best choice. However, for ease of computation, the restrictedScheff´e pMLE intervals are a good second choice with far more computational ease. Then, for moderate and larger sample sizes, the pMLE based intervals are again recommendeded, though the behavior of these intervals have not been studied for any sample size greater than 50. Again, the SCR1 pMLE interval is the overall best choice. Though even less distinction may be made between the pMLE intervals at the larger sample sizes. 5.2.2 Recommendations for Estimating the Parameters Overall, for small to moderate sample sizes, the SCR1 pMLE attains the desired level of confidence while not being as conservative as the PC pMLE interval. Thus, for most cases where the sample size is greater than 50, I recommend utilizing a naive tube critical point with the pMLE estimators. For sample sizes smaller than 50, the cautious choice would be the PC pMLE interval. This is a slightly conservative interval, yet it attains the desired level of confidence, unlike any other competing interval. I would not recommend using the MLE based intervals at smaller sample sizes. If the sample size is moderate to large (n=200 or 300) there is little observed difference between the pMLE and MLE intervals and thus, for convenience, the MLE intervals may justifiably be utilized. 69 5.3 Future Research There are many avenues to explore in the future that are related to this research topic. Some of these include performing indepth simulations of these methods for more complex GLMs. Namely, if a GLM has a predictor matrix, given by X, that is a mix of categorical and continuous predictors, then nothing is known about how the aforementioned simultaneous procedures would behave. There are a host of other issues to explore as well when we have a predictor matrix such as X. For instance, what multiple comparison techniques are applicable, what should we compare, and how do we adjust for the other variables in the model? Additionally, for the single predictor variable case, other configurations of the contrast matrix C should be considered. For example, these other forms of C could be utilized to assess how the methods perform for allpairwise types of comparisons. Also, GLMs with interaction and quadratic terms need to be explored. This ent 



A 

B 

C 

D 

E 

F 

I 

J 

K 

L 

O 

P 

R 

S 

T 

U 

V 

W 


