Validation of a Population Model
>From time to time there has been discussion about the need to "validate" a
population analysis. It is fair to say that Sam Vozeh was an "early"
exponent of this idea (see the latest NONMEM bibliography: Data Analyses
item 36, and Methodological item 21). At NONMEM intermediate level
workshops late in 93 the subject came up repeatedly for discussion. The
matter was also discussed at the COST conference in November.
Below, I present a few ideas for using NONMEM IV for model validation.
They are simple and straightforward, and I hope they will help you. I want
to thank Mats Karlsson for his review and help with this discussion.
In addition, though, please pause for just a few minutes and consider this
question. What else, if anything, do you think should be done in the way
of model validation, and how might NONMEM be improved in this regard? It
may be difficult to answer this question. You might intuit that a more
sophisticated approach could be undertaken than what I shall describe, pos-
sibly one involving more sophisticated statistics. However, perhaps not
being a statistician, or perhaps not having sufficient time to work out the
details, you do not know *exactly* what to suggest, and, of course, until
one sees an approach fully worked out, one really doesn't know one has a
good idea. Some of us at the NONMEM Project Group have already given this
matter some thought and believe that model validation may indeed be a
tricky business. Still, if you think you have a cogent idea regarding a
more sophisticated approach, I'd like to hear from you. If you reply, I'll
not be critical; indeed, I'll probably not have the time to offer a
critique. I'll acknowledge receipt of your message, and I'll be thankful
that you offered something for us to think about. I may also need to dis-
cuss something further with you, in which case, I'll contact you directly.
On the other hand, most of you are in a position to recognize that some of
what I'm about to describe might be implemented more easily with certain
changes to NONMEM. Comments of this sort are also welcome and would be
appreciated. Do ask yourself first, though, whether what I'm about to
describe is so difficult to implement currently that the user can indeed be
much better served by changing NONMEM. If you would like to share your
thoughts with others on the users net, please of course do so. However, I
do not regularly read mail on the net; so to share your thoughts with me
too (or with me alone if you prefer), please respond directly to me at
stuart@c255.ucsf.edu.
____________________________________________________________________________
INTRODUCTION
By model validation, I mean obtaining information about how well a popula-
tion model describes a set of data (called the "validation" or "test" set),
none of which was used to develop the model itself. The data set used to
develop the model is called the "index" or "learning" set, and the model
that is developed may be called "the fitted model", or simply "the model".
I shall take the term "the model" to mean a model form *together with* a
set of values for the model parameters.
A validation data set can be entered into a NONMEM problem, as can any
index population data set. The parameters values of the model (THETA,
OMEGA and SIGMA) can be entered as initial estimates, or an input MSF can
be used which contains these values from the earlier problem which fit the
model to the index data. (The same MSF can be used with different data
sets.) The Estimation Step will not be implemented, so these population
parameter values will not change. Predictions and the like will be com-
puted using these values.
The basic idea, of course, is to see how close model predictions are to the
validation data. "Closeness" may be judged in clinical, rather than sta-
tistical, terms. That is, the investigator might ask "do the differences I
see mean that there can arise some clinical problem if I try to use this
model?" It's important to make the distiction between clinical satisfaction
and statistical satisfaction. The suggestions made herein are offered in
the spirit of helping with the question of clinical satisfaction. They are
really just a couple of very simple ones.
There are two types of predictions one might make of observations from an
individual in a validation set. The first is a prediction made without the
benefit of using any of other observations from the indivdual that might be
available, i.e. without using "feedback observations". This is the type of
prediction made for a patient from which no observations have yet been
made, or if they have been made, they are being ignored. It is the "popu-
lation prediction", i.e. the one given by the population model using the
value eta=0. It is available in a NONMEM table under the PRED heading, and
can be included in a scatterplot by using the label PRED. The second type
of prediction is made using some observations from the individal ("feedback
observations), but different ones from those being predicted. I shall dis-
cuss this type of prediction also, and remind you how they may be obtained.
I'll call them "feedback predictions". Another term for feedback predic-
tions might be "intraindividual predictions". As the number of feedback
observations increases, feedback predictions depend increasingly on the
feedback observations, and less so on much of the population model, so that
the nature of what is being validated changes.
POPULATION PREDICTIONS
Consider first the population predictions. A simple graphical approach may
suffice to examine how close predictions are to the validation data. Of
course, if necessary, the numbers themselves can be examined in detail in a
table. DV vs PRED is one helpful plot. When generating this plot,
remember to include the option UNIT in the $SCATTERPLOT record in order to
obtain a line with slope=1 on the scatterplot. Some people may prefer to
plot (or plot in addition) RES versus a covariate such as ID or AGE, or
perhaps RES vs PRED. A plot like RES vs AGE allows a judgement to be made
regarding the (clinical) adequacy of the model for different subpopula-
tions. The numbers can be summarized with various statistics, but for
clinical purposes, probably as much, if not more, information is presented
by these simple pictures as is presented by the statistics. Often the best
statistics are pictures. Descriptive statistics that help assess all
interesting features of certain pictures may be difficult to formulate;
interesting information can easily "get lost" in the summarization. How-
ever, I do not mean to suggest that in general this is not a feasible or
useful activity. Formal statistics are certainly going to be required when
it is important to assign a statistical error level to some inference.
Some further discussion about a more formal statistical approach are given
below.
Another plot to help examine the closeness of predictions to observations
is one which must be developed outside of NONMEM, using a graphics package
with a table output by NONMEM. This is a plot of the data vs covariate,
superimposed on a plot of PRED vs covariate. (There are a number of good
graphics packages, and it is not a goal of the NONMEM Project Group at this
time to include their capabilities into NONMEM itself. This is not to say
that certain NONMEM improvements using printer-plots, which are the type of
plots output by NONMEM, could not or should not be undertaken.)
With these plots, there will be considerable differences (errors) between
predictions and observations, and it may be of some clinical concern to
know whether these differences are within the realm of error described by
the population model. After all, if the population model does not describe
the observed errors well enough with the validation data, then how reliable
is the model; might its application with yet a second validation data set
yield errors which collectively appear different from those seen with the
index set *and* from those seen with the first validation set? However, it
is first important to observe how large errors can be. After all, if, for
example, errors are smaller than described by the model, but small enough
across all data sets, then the model may be unreliable in the way just
described, but nonetheless quite adequate.
Population standard deviations measure the magnitudes of the errors, where
the errors are regarded as arising due to *both* random intraindividual and
random interindividual effects. They, just as the predictions, are numbers
computed solely from the model. That is, they are independent of the
observations, although they do depend on the values of the *independent*
variables found in the data set. A population s.d.'s might be regarded as
"a prediction of an unsigned error".
Weighted residuals (WRES) have been a part of NONMEM output since the
program's inception. The WRES are the RES expressed in (i.e. as fractions
of) population standard deviation units. Although, the way this is done is
a little complicated; see below. A plot of WRES vs a covariate (e.g. ID)
can be used to make some assessment about whether the errors follow the
description given for them under the population model. Ideally, the WRES
should be homogeneously scattered about the zero line on the WRES axis,
provided the description is a good one. For the scatter to be homogeneous,
the amount of scatter in the WRES in each little region along the covariate
axis, should be about the same. Ideally, all the WRES are almost indepen-
dent of each other, even within the data from a given individual. That is,
when viewing the data from a number of individuals, the correlation one
usually sees within the data from a given individual, due to interindivi-
dual differences, should not be seen within the WRES from a given indivi-
dual, or at least should not be seen as nearly as much, provided (once
again) the model well-describes (the variability and covariability of) the
differences. Were this not so, the scatter in the plot of WRES vs ID would
not appear to be homogeneous.
The WRES are also routinely computed with the index data. The nominal
homogeneous scatter of the WRES allow them to be used with this data during
the model building process to detect a covariate influence that has not
been modeled well enough. If one sees a trend in a plot of the WRES vs the
covariate, rather than a homogeneous scatter, this means that the model is
not describing the variability well enough; the model might be improved by
more appropriate use of the covariate. The same is true with the valida-
tion data, the only difference being that there may be no further opportun-
ity for model improvement using validation data; the validation data set
may have been be set aside *only for validation* purposes. On the other
hand, suppose, for example, that the different values of AGE in the index
data lie in a small range, so that an AGE influence was not detected with
the index data, but that the values in the validation data are more widely
dispersed, and, using this data, an influence seems to exist. Then one
might contemplate changing the model *during the validation process* to
include AGE. Perhaps one might take the point of view that the model with
AGE distributed as with the index data can still be "validated", whereas
the model with AGE distributed as with the validation data must remain
"provisional".
Suppose that for every individual in the validation set, there is just one
DV item. Of course, by choosing one measurement per individual, this can
always be arranged. Then with a given individual, there is just one RES
and one WRES. The WRES is a simple normalization of the RES, viz. the RES
divided by the population standard deviation of the observation. In this
case there is another plot might be considered, one allowing a view of the
data themselves along with an indication of a range within which the data
might occur, as predicted by the model. This plot, as with another men-
tioned above, must be developed outside of NONMEM, using a graphics package
with a table output by NONMEM. It is a plot of the data vs covariate,
superimposed on a plot of PRED +-2 population s.d. vs covariate. By PRED
+- 2 population s.d., I mean that both the numbers PRED +2 population s.d.
and PRED -2 population s.d. are plotted. A population standard deviation
may be obtained using the formula population s.d. = RES/WRES. However, it
should be understood that the number 2 has no magic meaning. Were (i) the
data normally distributed, (ii) the estimates being used for the population
parameters error-free, and (iii) the model description of the errors a good
one, then about 95% of the data should lie within the PRED +- 2 population
s.d. range. Since, though, assumption (i) need not hold (it is *not a
part* of the model), nor need (ii) hold, then one can only use this plot in
a qualitative way. In general, it may be more difficult to use this plot
than a plot of WRES vs covariate, because the errors are only implicitly
shown and are not adjusted to be homogeneously scattered (about the PRED
curve).
A little care may need to be taken with the computation of PRED +- 2 s.d.
Suppose the index data (and presumably the validation data) are distributed
in a rather skewed way (and suppose the data are concentrations, so that
they are all nonnegative). Then it would be reasonable to have developed a
population model for transformed data (for example, log transformed data).
The structural model for the transformed data might have been obtained by
taking a structural model for the untransformed data and applying the
identical transformation to it (the "transform both sides" method). With
this technique, predictions of untransformed data are obtainable, but the
NONMEM ELS-type objective function would have been used to fit the differ-
ences between transformed values of these predictions and transformed
(index) data, and thus applied to differences that are approximately sym-
metrically distributed. Then the plot in question (using the validation
data) could be one of transformed data or one of untransformed data. If,
though, untransformed data are plotted, then the two data-delimiting curves
on which these data are superimposed are the *inverse transformed* PRED +-
2 s.d. curves, where PRED +- 2 s.d. refers here to those curves that would
be used with the transformed data. To understand this, it may be instruc-
tive to consider the plot if a transformation had not been used to fit the
index data. In this case, the validation plot would suggest that much
smaller errors occur at low concentrations than are predicted by the model,
which, of course, would be the case. Indeed, the plot would suggest that
there can be negative concentrations.
When there are more than one DV items per individual, each WRES cannot be
associated with a particular DV data item. In this case, each WRES for a
given individual is a linear combination of the RES for that individual and
thus is a function of all the data from that individual. In this case,
there is no way (of which I know) to examine "predicted unsigned errors"
and the data in the same plot.
Vozeh (NONMEM bibliography: Data Analyses item 36) suggests computing the
average RES (ARES) for each individual, i.e. the arithmetic average of all
the RES from the individual. This is just the difference between the aver-
age observation and the average prediction. The population standard devia-
tion of the ARES, given by the model, can be computed, and a plot of the
average observation vs covariate, superimposed on a plot of the average
PRED +-2 population standard deviations, can be made. This plot does give
one the ability to view observations, although they are *average observa-
tions*, along with predicted unsigned errors. However, there is a problem
interpreting the average observation and the ARES, since these are averages
over a number of nonrandom heterogeneous conditions. For example, time,
and dosage too perhaps, vary over an individual's record. Moreover, the
ARES is but a single summary statistic of all the errors from the indivi-
dual; some information about these errors will not be reflected in this
statistic. It is one particular linear combination of the errors. In con-
trast, each WRES from the individual is also a linear combination of the
errors, and there are as many of these as there are observations from the
individual. From all the WRES from the individual (and the population
variance-covariance matrix of the data from this individual), all the
errors can, in principle, be recomputed. For these reasons, and for the
others mentioned above, it is not clear to me that one would not be just as
well off with a plot of the WRES.
Vozeh suggests performing a t-test, based on taking the average of the ARES
across the individuals, to test the hypothesis that the model describes the
errors in the validation data set. (The test uses the population s.d.'s of
the different ARES values, and so, strictly speaking, is not a t-test, but
is a z-test.) A test of this sort, testing the same hypothesis, can just
as well be based on the average WRES over the entire set of WRES. (The
population s.d. of each WRES is 1). Actually, neither test is an efficient
one to test the hypothesis. Both tests are primarily ones of the
hypothesis that the population average ARES or population average WRES is
0, and only secondarily do they test that the assumed population s.d.'s
truly describe the magnitudes of the errors. Even so, they incorrectly
assume that the estimates for the population parameter values are given
without error. Finally, the very idea that a proper test of the hypothesis
could be useful is questionable. How badly must the errors be described by
the model before a clinical problem of some sort arises?
FEEDBACK PREDICTIONS
Feedback predictions of data from a given individual are based on estimates
of individual-specific (PK/PD) parameter values for the individual, which
in turn are based on data from that individual (feedback observations)
other than those which are being predicted. The parameter estimates are
obtained by means of Bayesian estimation. If the amount of data from the
individual is very small, the estimates of many of the individual-specific
parameters are essentially the estimates for the typical individual that
are given by the population model. This is because as the amount of data
decreases, individual-specific parameter Bayesian estimates become increas-
ingly constrained by the "Bayes penalty term" in the Bayes objective func-
tion to be "close to" the typical individual parameter estimates. As the
amount of data per individual increases, the Bayesian term becomes less
influential, and the individual-specific Bayesian parameter estimates
become ELS estimates.
If the amount of data per individual is large, the Bayesian estimates
depend little on the population model, except for its basic structural form
(and except for the intraindividual error model; see below). Hopefully,
one has a pretty good idea about the basic structural form before embarking
on the analysis of the index data. If, though, there remains some ambi-
guity about this, or simply to check that things are as anticipated and
that no mistakes have been made, the basic form can be validated by compar-
ing the observations of the validation set with their feedback predictions.
One way to do this is to run NONMEM with multiple problems, as many as
there are individuals in the validation data set. With each problem, the
basic structural model is fit to the data from a different individual, and
thus feedback predictions for the individual's data are obtained. Bayesian
estimation can be used (see e.g. NONMEM Guide II, section C), but, as men-
tioned above, with much data per individual, the Bayes penalty term is
negligible. It is sufficient simply to use ELS (the usual and default
method for fitting single-subject data).
All the separate ELS fits can be done within one NONMEM run by stacking all
the problem specifications in a single control stream. In this case, the
$DATA record with each problem points to the entire validation data set.
However, with each problem, either the RECORDS option on the $DATA record
must give the number of data records for that individual, or a FINISH
record must terminate these particular data records (see Guide IV, chap.
III, sec. B.5). Each problem involves single-subject data, amd with such
data, the ID data items must differ between observation data records. NM-
TRAN automatically generates appropriate ID data items, different from
those included in the population data set (see Guide IV, chapter II, sec-
tion C.4).
However, an even simpler way to obtain feedback predictions, one which can
be used with any amount of data per individual, is to actually use Bayesian
estimation, implemented as follows. Construct a single problem, using the
entire validation data set. Include POSTHOC and MAXEVALS=0 on the $ESTIMA-
TION record. By setting MAXEVALS=0, the Estimation Step is not imple-
mented, i.e. the population parameter values are just those given as ini-
tial estimate values (or are those in the MSF output from the run fitting
the model to the index data). The use of POSTHOC, however, directs Baye-
sian estimation to be carried out with each individual's record; Bayesian
estimates of all the eta's, for all individuals, are obtained, conditional
on the population parameter values. Predictions based on these estimates
are displayable in a table or scatterplot by including a statement such as
IPRED=F (in $ERROR), and using the label IPRED in the $TABLE or $SCATTER-
PLOT record. The name IPRED (for intraindividual predictions) is one I
made up, you could use some other name in your abbreviated code. Intrain-
dividual residuals, obtainable by also including the statement IRES=DV-F
(in $ERROR), can also be plotted. (The estimates of an individual's eta's
are essentially the estimates of that individual's parameters. For exam-
ple, if the parameter is modeled as theta1 + theta2*WT + eta, then theta1
and theta2 are fixed in the computation of the estimate, as is WT (assuming
WT doesn't change within the individual), and so the difference between the
estimate of the parameter and the estimate of the eta is just the constant
theta1 + theta2*WT.)
The accuracy of the intraindividual error model can also be assessed. Just
as there are population standard deviations, there are intraindividual
standard deviations. These measure the magnitudes of the errors (the
intraindividual errors) between an individual's observations and their
feedback predictions due to random intraindividual effects only. They are
computed under the intraindividual error model for the individual's data,
which, in this discussion, is understood to be a model form together with
parameter values *given by the Bayesian estimates*. With a heteroscedastic
intraindividual error model which depends on feedback predictions, not only
does the s.d. vary for different observations within the individual, but
even when two individuals have the same values for *all* independent vari-
ables, the s.d. can differ for the observations between the two individu-
als. This is because the feedback observations are generally different
between the two individuals, and therefore, so are the feedback predic-
tions.
Intraindividual residuals can be transformed to intraindividual weighted
residuals, i.e. can be expressed in intraindividual standard deviation
units. Unlike the situation with population weighted residuals, each
intraindividual weighted residual can always be associated with a single
observation; it is the residual divided by the s.d.. To obtain intraindi-
vidual weighted residuals, include the statement IWRES=IRES/H, when Y is
given by F+H*ERR. Plots of the data along with a predicted range, can also
always be made. In this regard, note that the s.d. itself is given by
IRES/IWRES.
To some extent, residuals and weighted residuals can be pooled over indivi-
duals. That is, a plot can include (for example) the residuals from all
individuals. However, consider that fact that feedback predictions vary
between individuals, if for no other reason than that the Bayesian esti-
mates vary between individuals. One might wonder whether this variability,
or more precisely - this source of variability, might affect the accuracy
of the plot. Population predictions, in contrast, do not vary between
individuals due to variability in parameter estimates. It seems intuitive
to me that as long as the "precision" of the Baysian estimates is fairly
constant across individuals, interindividual variability in feedback pred-
ictions should not affect the accuracy of the plots. A problem might
arise, though, when the feedback observations are obtained under very dif-
ferent designs for different individuals. Moreover, the designs could
differ in systematic, rather than random ways. Then the precision of the
Bayesian estimates might vary considerably with the individual and in a
nonrandom way. We have no experience that counsels us about this situa-
tion. If necessary (or even in addition to pooled plots), residual and
weighted residual plots can be stratified by individual (using list 3 of
the $SCATTERPLOT record).