Re: Missing covariates
From: LSheiner <lewis@c255.ucsf.edu>
Subject: Re: Missing covariates
Date: Mon, 02 Jul 2001 09:32:04 -0700
This has been discussed before ... (the discussion at
http://www.cognigencorp.com/nonmem/nm/99sep112000.html
is most relevant)
It is generally NOT a good idea to simply impute (the word used in the statistical literature) the missing covariates as you have described. There are at least three problems.
1. You are making up data and treating them as though they were observed. Your standard errors, etc., will reflect this, and hence will be too "optimistic"; that is, you will think you have learned more from the data than you in fact have.
2. By substituting the mean or median, you weaken any covariate relationship that really exists. Imagine, for example, that you were analyzing dose (X) vs concentration at a 6 hr. post dose (Y), but were missing X in half the cases. The correlation between X and Y estimated from the imputed data if all unknown doses were fixed to the median would be about half that estimated from only the "complete" cases ...
Which do you think is more likely to be right?
3. The missing data may be missing for reasons that correlate with the values themselves (e.g., data sets with BQLs deleted: the data are missing BECAUSE their values are low), so called "non-ignorable missingness." If so, then the median of the observed values is actually a biased estimate of the median of the missing ones.
There are ways to deal with missing data in a principled way (see above referenced archive site), avoiding problems 1 and 2 above, but they can be a bit tricky. Problem 3 is very difficult to deal with, usually requiring additional data or assumptions.
LBS.
--
_/ _/ _/_/ _/_/_/ _/_/_/ Lewis B Sheiner, MD (lewis@c255.ucsf.edu)
_/ _/ _/ _/_ _/_/ Professor: Lab. Med., Bioph. Sci., Med.
_/ _/ _/ _/ _/ Box 0626, UCSF, SF, CA, 94143-0626
_/_/ _/_/ _/_/_/ _/ 415-476-1965 (v), 415-476-2796 (fax)