RE: Coding for missing data values

From: William Bachman Date: June 29, 2004 technical Source: cognigencorp.com
From: "Bachman, William (MYD)" Subject: RE: [NMusers] Coding for missing data values Date: Tue, June 29, 2004 9:02 am "Why did I choose this code?" Actually, the bottom line is that I gave two methods that are commonly employed: assuming the mean for the missing value (or other imputation algorithm) or letting the data decide if this is a valid assumption. (remember that the original question asked was: how do you code missing covariates?) Choose whichever you want. In actual practice, I've most often assumed the mean for missing covariates and frankly it usually has no significant effect on the model which method is used. However, I've decided to be the devils advocate in the finest Holfordesque tradition. There is no reason to assume that the missing values are random and representative of the population (unless you have additional prior information and in the absence of which it is not rigorous to assume that they are from a statistical viewpoint). e.g. they may have come from a pediatric population (less blood drawn, fewer tests, high likelihood of different parameters), from a site with less rigorous procedures (just skipped the test, could also have less attention to sampling times resulting in more variability), from a sicker subset, or any number of other scenarios. Assuming the mean for them introduces systematic bias under these scenarios. Allowing the parameter to be estimated could prove/disprove the validity of the assumption. The other reason for coding they way I did was the interpretation of the thetas. In retrospect, this is how I would code it today: IF(TECL.EQ.0) THEN TVCL=THETA(1) + THETA(2) + (COVn-x.x)*THETA(n) + ... ELSE TVCL=THETA(1)+(TECL-5.4)*THETA(3) + (COVn-x.x)*THETA(n) + ... ENDIF CL=TVCL*EXP(ETA(1)) Then, theta(1) is "basically" the population typical value, theta(2) relates the difference in CL between those with and without measured TECL, and theta(3), theta(n) represent the influence of TECL, COVn ... At the conclusion of the modeling excercise, test for significance of all thetas. If any are not, remove them from the model. (if theta(2) is zero you have proved that that population without measured TECL can be adequately represented by mean TECL, get rid of it). Let the data drive the model to the simplest model rather than assuming it apriori. If a simpler model is warranted, the data will tell you that and the prudent modeler will listen to the data. Also, give 10 analysts a set of data and you will get 10 differently coded models.
Jun 28, 2004 Bharath Muralidharan Coding for missing data values
Jun 28, 2004 William Bachman RE: Coding for missing data values
Jun 28, 2004 Nick Holford RE: Coding for missing data values
Jun 28, 2004 William Bachman RE: Coding for missing data values
Jun 28, 2004 Nick Holford RE: Coding for missing data values
Jun 29, 2004 Anthe Zandvliet RE: Coding for missing data values
Jun 29, 2004 William Bachman RE: Coding for missing data values
Jun 29, 2004 Nick Holford RE: Coding for missing data values
Jun 29, 2004 Nick Holford RE: Coding for missing data values