RE: Coding for missing data values
From: "Bachman, William (MYD)"
Subject: RE: [NMusers] Coding for missing data values
Date: Tue, June 29, 2004 9:02 am
"Why did I choose this code?" Actually, the bottom line is that I gave two
methods that are commonly employed: assuming the mean for the missing value
(or other imputation algorithm) or letting the data decide if this is a
valid assumption. (remember that the original question asked was: how do
you code missing covariates?) Choose whichever you want. In actual
practice, I've most often assumed the mean for missing covariates and
frankly it usually has no significant effect on the model which method is
used.
However, I've decided to be the devils advocate in the finest Holfordesque
tradition. There is no reason to assume that the missing values are random
and representative of the population (unless you have additional prior
information and in the absence of which it is not rigorous to assume that
they are from a statistical viewpoint). e.g. they may have come from a
pediatric population (less blood drawn, fewer tests, high likelihood of
different parameters), from a site with less rigorous procedures (just
skipped the test, could also have less attention to sampling times resulting
in more variability), from a sicker subset, or any number of other
scenarios. Assuming the mean for them introduces systematic bias under
these scenarios. Allowing the parameter to be estimated could
prove/disprove the validity of the assumption.
The other reason for coding they way I did was the interpretation of the
thetas. In retrospect, this is how I would code it today:
IF(TECL.EQ.0) THEN
TVCL=THETA(1) + THETA(2) + (COVn-x.x)*THETA(n) + ...
ELSE
TVCL=THETA(1)+(TECL-5.4)*THETA(3) + (COVn-x.x)*THETA(n) + ...
ENDIF
CL=TVCL*EXP(ETA(1))
Then, theta(1) is "basically" the population typical value, theta(2) relates
the difference in CL between those with and without measured TECL, and
theta(3), theta(n) represent the influence of TECL, COVn ... At the
conclusion of the modeling excercise, test for significance of all thetas.
If any are not, remove them from the model. (if theta(2) is zero you have
proved that that population without measured TECL can be adequately
represented by mean TECL, get rid of it).
Let the data drive the model to the simplest model rather than assuming it
apriori. If a simpler model is warranted, the data will tell you that and
the prudent modeler will listen to the data. Also, give 10 analysts a set
of data and you will get 10 differently coded models.