Missing covariates

12 messages 8 people Latest: Jul 05, 2001

Missing covariates

From: Atul Bhattaram Venkatesh Date: July 02, 2001 technical
From: bvatul <bvatul@ufl.edu> Subject: Missing covariates Date: Mon, 02 Jul 2001 10:22:08 -0400 Hello All Could somebody please clarify this: I am analysing a data set in which I have covariates (wge, ht, wt, crcl) for 70% of the patients. Is it a good idea to substitute the median values of these covariates for the missing covariates ie., in patients in whom I dont have the covariates?In case we are substituting the median values and analysing the covariates is it still necessary to center the covariates? Are there any reported papers where the addition of missing covariates has led to misinterpretation of data? When should we substitute the missing covariates with the median values and when should we not? Thanks Atul

Re: Missing covariates

From: Jogarao Gobburu Date: July 02, 2001 technical
From: "Jogarao Gobburu 301-594-5354 FAX 301-480-3212" <GOBBURUJ@cder.fda.gov> Subject: Re: Missing covariates Date: Mon, 02 Jul 2001 11:06:24 -0400 (EDT) Dear Atul, Dr. Schafer presented a tutorial on 'multiple imputation (MI) for missing data problems' at the recent PAGE 2001 meeting in Basel. According to him (originally after Don Rubin), one can use the multivariate (covariate) distribution to simulate the missing values several times and produce the statistics of interest (for eg: the coefficient for the influence of wt on CL and its confidence limits). Though direct experience with this approach is limited and there are no publications in our field (pkpd) to my knowledge, the MI approach seems to be theoretically more appealing than the crude 'median/mean' approach. It might be worthwhile for you to consider using MI and share your findings with the rest of this group. Some related references from other fields: 1. Rubin, DB. Inference and missing data. Biometrika, 63, 581-592 (1976). 2. Schafer JL and Yucel R. Computational strategies for multivariate linear mixed effect models with missing values. J.Computational and Graphical statistics (under review). 3. Lavori, PW. Dawson R. and Shera D. A multiple imputation strategy for clinical trials with truncation of patient data. Stat. in med. 14, 1913-1925, 1995. 4. Schafer JL. Multiple impuation: a primer. Stat.Methods in med. research, 8, 3-15, 1999. Centering is done for convenience, it does not alter statistical inference. Regards, Joga Gobburu Pharmacometrics, CDER, FDA.

RE: Missing covariates

From: Kenneth G. Kowalski Date: July 02, 2001 technical
From: "KOWALSKI, KENNETH G. [PHR/1825]" <kenneth.g.kowalski@pharmacia.com> Subject: RE: Missing covariates Date: Mon, 2 Jul 2001 10:33:43 -0500 Dear Joga, I don't believe the MI theory has been worked out when we are interested in exploring the covariate relationships via model building. I had discussed this very situation with John Barnard (a colleague of Don Rubin's) at a recent ASA chapter workshop on MI. I suppose the theory is worked out if one is willing to fix the model such as a full model. Ken

Re: Missing covariates

From: Diane Mould Date: July 02, 2001 technical
From: "diane r mould" <drmould@attglobal.net> Subject: Re: Missing covariates Date: Mon, 2 Jul 2001 11:51:15 -0400 Dear Atul Substitution of the median values for missing covariates is not a good idea, particularly when you have a large percentage of missing data. If the missing data is replaced with the median data, the median of the covariate distribution is preserved, but the other aspects of it are not (ie variance quantiles etc). This can lead to problems (e.g. accepting a false covariate or rejecting a true one) when covariate models are tested. The use of joint functions is a lot better than replacing missing data with the median data. If the use of a joint function is not feasible (they can be difficult to get to run and often take a long time to converge), try doing multiple imputation. If you are not sure how to impliment these methods, please let me know and I can help out a bit, as I have been struggling a lot lately with these missing data problems Best Regards diane

RE: Missing covariates

From: Hui C. Kimko Date: July 02, 2001 technical
From: "Hui C. Kimko" <koh@georgetown.edu> Subject: RE: Missing covariates Date: Mon, 2 Jul 2001 12:09:15 -0400 Dear Atul, What you are curious about now is exactly what I am thinking. The answers to your questions are dependent on your dataset. Of course, if you are in the phase of building a model to describe your data, it would be difficult to know the impact of using many different ways of handling missing data - sensitivity analyses may give you some hints. So, here I am including a general advice from the FDA. Guidance for Industry, Population Pharmacokinetics, Page 12-13 http://www.fda/gov/cder/guidance/1852fnl.pdf B. Handling Missing Data After assembling data for population analysis, the issue of any missing covariate data should be addressed. Missing data will not automatically invalidate the results provided a good-faith effort is made to capture the missing data and adequate documentation is made regarding why data are unavailable. However, missing data represent a potential source of bias. Thus, every effort should be made to fulfill the protocol requirements concerning the collection and management of data, thereby reducing the amount of missing data. Many subjects may be rich in covariate data, and some may be missing only a small sample of covariates. Excluding all subjects with any covariate data missing in some situations will vastly decrease the sample size. Extreme caution should be taken, but in certain situations, it may be better to impute missing covariate values for some subjects rather than to delete those subjects from the data set. Some simple methods of imputation, including the use of median, mean, or mode for missing values, may be biased and inefficient when predictors are correlated (34). A better method uses maximum likelihood procedures for predicting each predictor from all other predictors. Another method for consideration is multiple imputation, in which several imputed data sets are analyzed to remove the optimistic bias from estimates of precision caused by imputing data and treating is as though it were actually observed (35). However, the performance of imputation techniques in this context is not well-studied, nor is there a wealth of experience on their use. Moreover, imputation of missing covariates adds another layer of assumptions to the model. Imputation procedures should be described, and a detailed explanation provided of how such imputations were done and the underlying assumptions made. The sensitivity of the results of the analysis to the method of imputation of missing data should be tested, especially if the number of missing values is substantial. Sometimes missing concentration data can become a problem in a longitudinal population PK study that is conducted for a long time. If there is a pattern to the missing data, appropriate statistical procedures should be used to address the problem. Such procedures should be described in the population PK analysis report. However, if the concentration data are missing randomly, the process that caused the missing data can be ignored and the observed data can be analyzed without regard to the missing data (36, 37). Good luck ! :-) Huicy *********************************** Hui C. Kimko, PhD Center for Drug Development Science Georgetown University Medical Center voice: 202 687 4332 fax: 202 687 0193

Re: Missing covariates

From: Lewis B. Sheiner Date: July 02, 2001 technical
From: LSheiner <lewis@c255.ucsf.edu> Subject: Re: Missing covariates Date: Mon, 02 Jul 2001 09:32:04 -0700 This has been discussed before ... (the discussion at http://www.cognigencorp.com/nonmem/nm/99sep112000.html is most relevant) It is generally NOT a good idea to simply impute (the word used in the statistical literature) the missing covariates as you have described. There are at least three problems. 1. You are making up data and treating them as though they were observed. Your standard errors, etc., will reflect this, and hence will be too "optimistic"; that is, you will think you have learned more from the data than you in fact have. 2. By substituting the mean or median, you weaken any covariate relationship that really exists. Imagine, for example, that you were analyzing dose (X) vs concentration at a 6 hr. post dose (Y), but were missing X in half the cases. The correlation between X and Y estimated from the imputed data if all unknown doses were fixed to the median would be about half that estimated from only the "complete" cases ... Which do you think is more likely to be right? 3. The missing data may be missing for reasons that correlate with the values themselves (e.g., data sets with BQLs deleted: the data are missing BECAUSE their values are low), so called "non-ignorable missingness." If so, then the median of the observed values is actually a biased estimate of the median of the missing ones. There are ways to deal with missing data in a principled way (see above referenced archive site), avoiding problems 1 and 2 above, but they can be a bit tricky. Problem 3 is very difficult to deal with, usually requiring additional data or assumptions. LBS. -- _/ _/ _/_/ _/_/_/ _/_/_/ Lewis B Sheiner, MD (lewis@c255.ucsf.edu) _/ _/ _/ _/_ _/_/ Professor: Lab. Med., Bioph. Sci., Med. _/ _/ _/ _/ _/ Box 0626, UCSF, SF, CA, 94143-0626 _/_/ _/_/ _/_/_/ _/ 415-476-1965 (v), 415-476-2796 (fax)

RE: Missing covariates

From: Leonid Gibiansky Date: July 02, 2001 technical
From: "Gibiansky, Leonid" <gibianskyl@globomax.com> Subject: RE: Missing covariates Date: Mon, 2 Jul 2001 13:10:51 -0400 What would group say about this method: For categorical covariates, you create separate category "missing" and treat it similar to any other level: TCL=THETA(1) ; FOR MISSING VALUES IF(SEX.EQ.0) TCL=THETA(2) ; FOR MALES IF(SEX.EQ.1) TCL=THETA(3) ; FOR FEMALES CL=TCL*EXP(ETA(1)) For continuous covariates, you still use "missing" as level for a new continuous/categorical covariate: TCL=THETA(1) ; FOR MISSING VALUES IF(WT.GT.0) TCL=THETA(2)*WT**THETA(3) ; for non-missing, where negative WT codes for missing weight CL=TCL*EXP(ETA(1)) You need to assume the same variability for subjects with missing and non-missing data, which is probably not quite correct. Alternatively, you may try to give different variability to a population with missing covariates. This can be messy if you have a lot of covariates but in a simple cases should work, in my opinion. Any comments ? Thanks, Leonid

Re: Missing covariates

From: Jogarao Gobburu Date: July 02, 2001 technical
From: "Jogarao Gobburu 301-594-5354 FAX 301-480-3212" <GOBBURUJ@cder.fda.gov> Subject: Re: Missing covariates Date: Mon, 02 Jul 2001 13:14:32 -0400 (EDT) Dear Ken, Thanks for sharing your experience on this topic. At the end of the MI experiment one would have a distribution of coefficient estimates or p-values. Let us consider 2 cases, case 1. most p-values are below 0.05 (alpha=5%), and case 2. p-values are distributed such that several fall above 0.05. For case 2, the results suggest that there may not be any definitive method for filling in missing information OR just that the relationship is not important. For case 1, it is probably reasonable to conclude that the covariate is important. Ofcourse, this information is supportive evidence to any mechanistic reason that one might have, in which case the parameter estimate becomes the point of interest. What, according to your discussions, would be the downsides of using MI when you are not certain about the influence of a particular covariate? Regards, Joga Pharmacometrics, CDER, FDA.

Re: Missing covariates

From: Lewis B. Sheiner Date: July 02, 2001 technical
From: LSheiner <lewis@c255.ucsf.edu> Subject: Re: Missing covariates Date: Mon, 02 Jul 2001 12:07:46 -0700 Of course that is fine, but it implies a different model than the original one. The marginal likelihood methods, i.e., the methods that integrate the likelihood over the distribution of the missing data -- all of those I discussed as "principled" -- assume that the SAME conditional (on the covariates) likelihood applies to all subjects. Thus, if you believe that CL differs between sexes (as in your example) then using the marginal likelihood, you essentially "assign" a fraction (equal to the probability of being male) of the likelihood from each individual with missing sex covaraite to the class of men, and the remaining fraction to the class of women. In your model, you create a third class which is neither male nor female, but "missing sex" (and introduce an additional parameter for their mean clearance), and do not demand that the clearance of this group be the probability weighted average of that for males and females. Other than increasing model complexity, this poses a problem for a causal or predictive model (as opposed to a descriptive model), namely How do you predict Cl for a new individual? If you know his/her sex, presumbaly you use the sex-specific value; if you don't, presumbaly you use the new 3rd class. Now, if the reasons for missingness are the same for the new patient as they were for those originally studied, this will be a reasonable empirical model (it is certainly not a mechanistic one, as it depends on a non-mechanistic covariate: missingness of data). If not, though, ... LBS. -- _/ _/ _/_/ _/_/_/ _/_/_/ Lewis B Sheiner, MD (lewis@c255.ucsf.edu) _/ _/ _/ _/_ _/_/ Professor: Lab. Med., Bioph. Sci., Med. _/ _/ _/ _/ _/ Box 0626, UCSF, SF, CA, 94143-0626 _/_/ _/_/ _/_/_/ _/ 415-476-1965 (v), 415-476-2796 (fax)

RE: Missing covariates

From: Kenneth G. Kowalski Date: July 02, 2001 technical
From: "KOWALSKI, KENNETH G. [PHR/1825]" <kenneth.g.kowalski@pharmacia.com> Subject: RE: Missing covariates Date: Mon, 2 Jul 2001 14:54:24 -0500 Dear Joga, The downside to model building with MI is that for each imputed data set (I guess) we would need to apply the model building procedure. In so doing, we know that we will get optimistic standard errors (whole point of doing MI) and thus probably also more false positives since standard errors and hypothesis tests are related. So, how do we apply the model building procedure to assess which covariates are included in the final model for each imputed data set? Afterall, it is during the averaging process of the multiple imputations that we adjust the standard errors. I don't think the theory has been worked out on how to combine estimates across different final models obtained from the model building procedure applied to each imputed data set when the final models can have different covariate parameters excluded or restricted to zero. If on the otherhand we fit the full model to each imputed data set then MI should work nicely. However, how do we go from this full model with MI adjusted estimates and standard errors to a more parsimonious final model that we might want to use for predictions and simulation? It is something for Rubin et. al. to think about. Ken

RE: Missing covariates

From: Leonid Gibiansky Date: July 02, 2001 technical
From: "Gibiansky, Leonid" <gibianskyl@globomax.com> Subject: RE: Missing covariates Date: Mon, 2 Jul 2001 16:00:05 -0400 Thanks for the comments My motivation for the model below was the following: if we have missing covariates (and we assume them missing randomly), the best way would be to exclude the subjects with missing covariates. However, if one subject has missing gender, the other one has missing WT, the third one has missing creatinine clearance, etc., we may end up removing all of them, which is not appropriate unless we have unlimited amount of data. Next best thing would be to remove the influence of the missing covariates onto the population model. That is precisely what I am trying to achieve. By creating class "missing" for gender, I assume that we have comparable number of males and females with missing gender (not necessarily equal proportion but the same proportion as in the population under study). Then CL for missing gender would be some average of CL for males and females. This is like a hybrid model: covariate model for those who have covariate values and base model for those who have missing values. Variability of the parameters for "missing" class should be higher (similar to higher variability in the base model comparing with the covariate model), but in simple cases we may either neglect this difference or take it into account by creating different variability for subjects with missing covariates. I think that the procedure uses most of the covariate information that is present in the data (except correlation between covariates), and do not impute any new information. In the prediction part, I would assume that we have covariate values for any particular subject. If not, I use the "base model approximation" for the subjects with missing data, similar how it was done during the model building. I think the proposed procedure may have its value somewhere between the more complicated ones (based on multiple imputation or model-based imputation) and the naive one (with the imputation of the median values). This offers a relatively straightforward way of modeling, and excludes situations when, e.g., the imputed median WT for the one or two very heavy individuals damage the actual (or creates fictitious) covariate relationship. Of course, I completely agree that if the "missingness" is systematic, and the reasons for missingness are not the same for the new patient as they were for those originally studied, everything will fall apart. But the same can be said about any modeling procedure: it is based on the current data, and may fail if the new population is significantly different from the one used for the model building. Thanks Leonid

Re: Missing covariates

From: Smith Brian P Date: July 05, 2001 technical
From: SMITH_BRIAN_P@Lilly.com Subject: Re: Missing covariates Date: Thu, 05 Jul 2001 12:38:52 -0500 I have used Leonid's method many times in my analyses. It is a pragmatic way of dealing with missing covariates. Let me describe his method in a different way, and I think you will see that it is quite useable and makes sense. Another way you could think of this if you had a linear model, is set all of the missing values to zero. Create a dummy variable which is 1 if the value is missing and 0 otherwise. Then consider the model Cl = a + b*wt + c*miss When the value is missing you get Cl = a + c When the value is not missing you get Cl = a + b*wt If you were to a + c = a + b*wt and solve for wt = c/b, then in essence you are letting your model estimate the average of the missing individual's weight, which is c/b. This is superior to just imputing the median or mean weight. First, your missing values may have systematically smaller or larger weights than the group that is not missing. Second this method uses up a degree of freedom in order to estimate, thus you are paying a penalty for having missing values. Now, Leonid uses a power model in his example. He also uses if then code, which I seldom use. But, in his example, when the value is missing Cl = theta(1) When the value is not missing you get Cl = theta(2)*wt**(theta(3)) Set them equal and solve for wt. Thus, your estimate for the average of an individual's with missing weights is (theta(1)/theta(2))**(1/theta(3)). What I would do, which equivalent to Leonid's model, is fit Cl = exp(theta(1) + theta(2)*miss + theta(3)*lnwt) Let, lnwt=0 when missing and create a dummy variable as described above. Then theta(2)/theta(3) becomes the estimate of the average lnwt. The exponential of this quantity would become the estimate of the average weight for an individual with missing weight. The same method is equally applicable when you have categorical data, and if you only had 2 classes (like gender), you could estimate the proportion of missing observations that were male and female. Sincerely, Brian Smith Go to Subject: 'Centering (was: Missing Covariates)'