RE: likelihood model for missing SEX data
From: Jakob Ribbing jakob.ribbing@farmbio.uu.se
Subject: RE: [NMusers] likelihood model for missing SEX data
Date: Tue, 11 Jul 2006 17:13:11 +0200
Dear all,
Nick's example was enough to convince me that estimating mixture and
covariate effect simultaneously does not cause bias as I thought (and I now
understand why). In most cases, I would use his approach rather than the one
I suggested.
The latter, to impute covariates outside of NONMEM may still be useful,
basically with the same arguments for and against as for using the GAM in
Xpose instead of stepwise covariate modelling within NONMEM, but with the
additional benefit that you do not have to worry about modelling the
relation between covariates (or the NONMEM-limit in #mixtures). I went ahead
and did a quick try, but implemented so that we do not have to obtain from
NONMEM the individual probability of a certain SEX, and instead use the
Empirical-Bayes Estimates (EBE) of CL to estimate the probability of SEX.
This approach would allow imputation of several missing covariates using
covariates and EBE:s, but in this example only use the EBE of CL:
1. Change the covariate model for SEX as if it was continuous between 0-1
and impute missing values with the mean. Fit the model in NONMEM and output
the EBE of CL (CL) in a table file
2. Re-impute the missing SEX so that the relation between CL and SEX is as
strong in the "missing subjects" as in the non-missing. Since this step
involves an element of randomness (sampling from the residuals that are
obtained from fitting CL to SEX on the subjects where the covariate is not
missing), this could be repeated several times (multiple imputation), each
time creating a dataset with imputed SEX.
3. Fit a NONMEM model to each dataset created in step 2. These have a more
accurate imputation of SEX (than the nave imputation in step 1). If the
effect of SEX is large and the information on CL is sparse, step 2-3 may
have to be re-iterated, using the updated CL from this step. Whether that is
necessary or not will be seen already in step 1: The effect of SEX will not
change much since in this case we are not using any independent variables
(covariates) in the imputation. The information on CL can be measured as the
shrinkage in the eta on CL.
4. If several datasets have been imputed in step 2 and the parameters differ
they are averaged (simple arithmetic mean). The variability between datasets
is only the variability that is due to the imputation and the SE:s of the
SEX-effect and IIV-CL is difficult to obtain (whereas any of Nick's models
could be bootstrapped).
In case anyone is interested, Ill go through the different steps to explain
what I did in practice for Nick's dataset-example (apologies for any typos
and miss-formattings but I am in a hurry):
STEP 1: For a real example, I would impute with the mean SEX in the included
patient population. Since this was unknown, I used the mean in the subjects
with non-missing SEX (same as Nick). Dropping a few columns from
Nick's/Anthe's datafile (that has missing sex), this model file (run0.mod)
could be used:
#########################################################
$PROB RUN# 115 (PHENOBARBITAL POPULATION PK MODEL)
$INPUT ID TIME AMT WT APGR DV EVID SEX
$DATA cp_missingsex.csv IGNORE=I
$SUBROUTINES ADVAN1
$PK
ISEX = SEX
IF (ISEX.LT.0) ISEX=0.5517 ;Mean in non-missing
DUMMY = THETA(1);NOT ESTIMATED IN NONMEM ANYMORE
TVCL = THETA(2)*(1+THETA(4)*ISEX)
CL = TVCL*EXP(ETA(1))
TVV = THETA(3)
V = TVV*EXP(ETA(2))
K = CL/V
S1 = V
$ERROR
IPRED = F
W = 1
Y = IPRED + THETA(5)*EPS(1)
IRES = DV - IPRED
IWRES = IRES/W
$THETA
(0, 0.424,1) FIX ; TH1: NOT ESTIMATED IN NONMEM
(0, 0.005) ; TH2: POPCL FOR MALES (male set as reference)
(0, 1.5) ; POPV
(-1,0.4) ; CLSEX (Female Fractional change in CL)
(0, 2.5) ; SD
$OMEGA
0.16 ; PPV_CL
0.16 ; PPV_V
$SIGMA
1 FIX ; EPS1
$EST MAXEVAL=9999 PRINT=20 METHOD=COND
$COV PRINT=E
;-------Table as the input-datafile but with columns for EBE:s added:-------
$TABLE ID TIME AMT WT APGR DV EVID SEX CL
ONEHEADER NOPRINT NOAPPEND FILE="imptab1"
#########################################################
There is a moderate (29%) shrinkage in the eta on CL which may be a problem
since the empirical-Bayes estimates of this parameter are used for the
imputation. Shrinkage (1-std(ETA1i)/sqrt(OMEGA1)) is easily calculated using
the PsN-tool[1] execute when running NONMEM:
$ execute -shrinkage run0.mod
Compared to estimating parameters on only non-missing subjects, the
imprecise estimates of IIV on CL and V change substantially (-9% and 23% on
std-scale, respectively) to the better (closer to the values obtained when
SEX is available in all 59 subjects). The typical value of CL for male and
female increase with 5% and 4%, respectively. This is probably due to the
fact that the missing population has (by random chance) a larger difference
between sex and a higher fraction of females (than in the non-missing).
STEP 2:
Run this Splus function (code at the end of this e-mail) to get ten datasets
with imputation of sex:
impute.sex(num.imp=10)
STEP 3:
Create run1.mod run10.mod as run0.mod but using the ten new datafiles
(ebtab1-10, respectively). Estimate the ten models in NONMEM using PsN:
$ execute *.mod -shrinkage -threads=5 -dir MultipleImputed
STEP 4:
Since I used PsN, the parameter estimates from the ten models are neatly
tabled in: "MultipleImputed/raw_results.csv" and can easily be averaged. I
used a simple arithmetic mean of all parameters.
Some parameters are hardly affected by the multiple imputation, where as
others are more variable. Looking at the SE due to imputation, the only
parameter which changes more than one SE from run0.mod (mean imputation) is
IIV on CL which is slightly decreased. This is because the imputed SEX is
better than the mean-imputed, which leaves less variability in CL as random,
but the difference was small in this case (IIV 38.2% -> 37.6%). The effect
of SEX on CL is highly variable between imputations (mean 14%, SE 4%) but
the mean is in the same ball park as when estimating the effect on only the
non-missing (11.6%).
At this stage one could consider to reiterate steps 2 and 3, due to the
problem of shrinkage. In this case, since the estimated effect of SEX is
small and shrinkage did not decrease in the step3-models (average 29%, as
for run0.mod) I stop here.
In conclusion, compared to Nick's solutions this could be a powerful
approach if many important covariates are missing and the relation between
covariates is difficult to model. I am sure that someone else could come up
with better code than the one I used for the multiple imputation in S-plus
(below).
Have a nice summer!
Jakob
"impute.sex" <- function (nm.tab="imptab1", num.imp=1, new.datafile="ebtab",
cat.imp=F, condMean=F, shrink.imputed=F, seed.imputation=1) {
#If many covariates/EBE:s are used for imputation,
#use: shrink.imputed=T
#If no EBE:s are used in the imputation,
#use: condMean=T and num.imp=1
library("Hmisc")
set.seed(i=seed.imputation)
coeff.bias <- NULL
org.data <- read.table(nm.tab, skip=1, header=T)
#Set missing/imputed values to NA before re-imputing
org.data$SEX[org.data$SEX != 0 &
org.data$SEX != 1] <- NA
data <- org.data
data <- data[!duplicated(data$ID),]#Assuming no IOV
if(!any(names(data)=="LNCL")) data$LNCL <- log(data$CL)
missing <- is.na(data$SEX)
CL <- data$CL
LNCL <- data$LNCL
SEX <- as.numeric(data$SEX)
NASEX <- SEX
#----Impute one dataset at a time (if multiple imputation)------------
for (imp.no in 1:num.imp) {
if (condMean) {#Will produce biased result if DV:s are used!!!!
print("WARNING: Using a mean that is conditional on the DV")
print("without a random component will bias the importance")
print("of imputed covariates!!!")
xtrans <- transcan(~LNCL+SEX, imputed=T,
shrink=shrink.imputed,
data=data)
data$ISEX <- impute(xtrans, SEX, data=data)
}
else {#Better to use the aregImpute(). -Approx Bayesian bootstrap
xtrans <- transcan(~LNCL+SEX, imputed=T,
shrink=shrink.imputed,
data=data, n.impute=1)
data$ISEX <- data$SEX
data$ISEX <- round(impute(xtrans, SEX, data=data,
imputation=1),
digits=4)
data$ISEX[missing & data$ISEX==1] <- 0.9999
data$ISEX[missing & data$ISEX==0] <- 0.0001
}
#For imputation to categorical instead of continuous SEX:
if (cat.imp) {
data$ISEX <- ifelse(unclass(data$ISEX)[1:n.subj]<0.5,0,1)
}
if (imp.no<=10) { #Produce diagnostics
print(wo.imputation <- lm(data$LNCL~NASEX,
na.action=na.exclude))
print(with.imputation <- lm(data$LNCL~data$ISEX))
print(plot(jitter(as.numeric(data$ISEX[data$ISEX==0 |
data$ISEX==1]), factor=0.25),
data$LNCL[data$ISEX==0 | data$ISEX==1],
xlab="Observed(o) or imputed(I) SEX",
ylab="Empirical-Bayes Estimate of LN(CL)"))
text(as.numeric(data$ISEX[data$ISEX!=0 & data$ISEX!=1]),
data$LNCL[data$ISEX!=0 & data$ISEX!=1],
labels="I", col=4, adj=0.5)
print(plot(as.numeric(data$ISEX),data$LNCL,
xlab="Observed (0/1) or imputed SEX",
ylab="Empirical Bayes Estimate of LN(CL)"))
}
coeff.bias <- c(coeff.bias,
100*(exp(with.imputation$coefficients[2]) -
exp(wo.imputation$coefficients[2]))/
(exp(wo.imputation$coefficients[2])-1))
#----Write imputed data to file----------------
imp.file <- paste(new.datafile, imp.no, sep="")
imp.dataset <- merge(org.data,
data[,is.element(el=names(data),
set=c("ID", "ISEX"))],
all=T, by=c("ID"))
imp.dataset$SEX <- imp.dataset$ISEX
write.table(imp.dataset[,names(imp.dataset)!="ISEX"],
file = imp.file, sep = ",",
dimnames.write="colnames")
print(paste("Imputed datafile created:", imp.file))
}#----End of multiple-imputation loop---------
print (coeff.bias)
if (num.imp >=10) {#Just a check, don't trust too much...
print ("The estimated bias due to random imputation (%):")
print (signif(mean(coeff.bias),digits=2))
}
invisible()
}
1. Lindbom L, Pihlgren P, Jonsson EN. PsN-Toolkit--a collection of
computer intensive statistical methods for non-linear mixed effect modeling
using NONMEM. Comput Methods Programs Biomed. Sep 2005;79(3):241-257.
_______________________________________________________