RE: Missing covariates
From: "Gibiansky, Leonid" <gibianskyl@globomax.com>
Subject: RE: Missing covariates
Date: Mon, 2 Jul 2001 16:00:05 -0400
Thanks for the comments
My motivation for the model below was the following: if we have missing covariates (and we assume them missing randomly), the best way would be to exclude the subjects with missing covariates. However, if one subject has missing gender, the other one has missing WT, the third one has missing creatinine clearance, etc., we may end up removing all of them, which is not appropriate unless we have unlimited amount of data. Next best thing would be to remove the influence of the missing covariates onto the population model. That is precisely what I am trying to achieve.
By creating class "missing" for gender, I assume that we have comparable number of males and females with missing gender (not necessarily equal proportion but the same proportion as in the population under study). Then CL for missing gender would be some average of CL for males and females. This is like a hybrid model: covariate model for those who have covariate values and base model for those who have missing values. Variability of the parameters for "missing" class should be higher (similar to higher variability in the base model comparing with the covariate model), but in simple cases we may either neglect this difference or take it into account by creating different variability for subjects with missing covariates. I think that the procedure uses most of the covariate information that is present in the data (except correlation between covariates), and do not impute any new information.
In the prediction part, I would assume that we have covariate values for any particular subject. If not, I use the "base model approximation" for the subjects with missing data, similar how it was done during the model building.
I think the proposed procedure may have its value somewhere between the more complicated ones (based on multiple imputation or model-based imputation) and the naive one (with the imputation of the median values). This offers a relatively straightforward way of modeling, and excludes situations when, e.g., the imputed median WT for the one or two very heavy individuals damage the actual (or creates fictitious) covariate relationship.
Of course, I completely agree that if the "missingness" is systematic, and the reasons for missingness are not the same for the new patient as they were for those originally studied, everything will fall apart. But the same can be said about any modeling procedure: it is based on the current data, and may fail if the new population is significantly different from the one used for the model building.
Thanks
Leonid