From: "Iyer, Ganesh R [PRI]" <GIyer1@prius.jnj.com>
Subject: Bootstrap resampling!
Date: Fri, 23 Mar 2001 08:06:09 -0500
Dear NonMem Users
This is something that came to my mind regarding Bootstrap procedures that we regularly perform to understand the stability and performance of the PK/PD model using Bootstrap approach. I had 30 subjects with both PK and PD data that I was trying to model and I was looking for stability and performance of my model stability. I was trying to perform model stability using Bootstrap (Thanks to Ruedi and Vladimir for providing me with the script file for S-plus!) to get a better feel for what the underlying distribution of my parameters really is. The bootstrap procedure that was followed was based on the sampling subjects at random but leaves each subject' concentration data unchanged. In one of the papers (JPB,1997) written by Ene Ette, He nicely quotes, " By resampling the original data with replacement, artificial samples are generated on which the inference of interest can be made as for the orginal sample. What we see is the subjects are replaced but the concentration from the original data set is taken by another subject, and the assumption is that our bootstrap data sets follows a similar distribution to the original data set. I am just wondering whether its a good idea to sample residuals at random rather than subjects, although the residuals depend on the concentration measurements?
I would appreciate for your help on this topic.
Many Thanks in Advance,
Ganesh
------------------------------
Ganesh R. Iyer
Clinical Pharmacokinetics and Pharmacology Dept
RW Johnson Pharmaceutical Research Institute
Raritan, New Jersey
Bootstrap resampling!
25 messages
13 people
Latest: Apr 07, 2001
From: "Paul Williams" <pwilliams@uop.edu>
Subject: Re: Bootstrap resampling!
Date: Fri, 23 Mar 2001 07:59:07 -0800
If I am understanding you correctly, you have two choices, 1] take ALL the subjects data [including the concentrations] and copy and paste into the bootstrap data set or 2] bootstrap the weighted residuals.
If you choose option 1 you would have to take all the subjects data including dose and concentration history, covariates, etc and copy [ie resample with replacement] into the bootstrap data set. It would not make sense to take the concentrations from a 100 kg individual and assign them to a 50 kg individual especially if clearance or volume were related to size.
How are you using the bootstrap? Are you using the less synthetic percentile bootstrap, are you bias correcting your original estimates and are you using the bootstrap to internally validate the model?
Cheers!
Paul Williams
From: Nick Holford <n.holford@auckland.ac.nz>
Subject: Re: Bootstrap resampling!
Date: Sat, 24 Mar 2001 13:48:45 +1200
Paul,
Paul Williams wrote:
> How are you using the bootstrap? Are you using the less synthetic percentile bootstrap, are you bias correcting your original estimates and are you using the bootstrap to internally validate the model?
Would you please explain why you are asking these questions? And please define what you mean by "less synthetic percentile", "bias correcting" and "internally validate"?
Thanks,
Nick
--
Nick Holford, Divn Pharmacology & Clinical Pharmacology
University of Auckland, 85 Park Rd, Private Bag 92019, Auckland, New Zealand
email:n.holford@auckland.ac.nz tel:+64(9)373-7599x6730 fax:373-7556
http://www.phm.auckland.ac.nz/Staff/NHolford/nholford.htm
From: harry.mager.hm@bayer-ag.de
Subject: Antwort: Bootstrap resampling!
Date: Sat, 24 Mar 2001 15:09:30 +0100
Dear Ganesh,
Resampling residuals in order to evaluate a model is a well established procedure, although resampling from the set of profiles (in this case actutally resampling subjects) may be preferable, since this would fit to more general constellations. The former procedure relies on strong assumptions with regard to the distribution of the residuals, with the latter the assumptions may be weaker.
However, I am not quite sure if classical bootstrapping of a model, regardless of resampling residuals or entire observation vectors does tell you that much about "model stability" in a pharmacometric sense. For sure it is not a "validation", and the result, for example the empirical variances of the model parameter estimates that are calculated on the basis of the bootstrap samples, are just (other) estimates of the parameter variances and, if you like, you may compare them to those obtained in the orignal (Nonmem) model.
Bottom line: In case you do not trust your (Nonmem) variances of the parameter estimates, bootstrap is an option. If you want to explore the stability of the model (robustness against small changes in the data etc.), alternative methods (e.g., cross validation) might be considered.
Best regards,
Harry Mager
__________________________________________
Bayer AG
PH-PD-GCP Biometry & Pharmacometry
D-42096 Wuppertal / Bldg. 470
Telefon: +49 (0) 202-36-8891
Telefax: +49 (0) 202-36-4788
eMail: Harry.Mager.HM@Bayer-AG.de
From: "Paul Williams" <pwilliams@uop.edu>
Subject: Re: Bootstrap resampling!
Date: Tue, 27 Mar 2001 09:18:14 -0800
Less synthetic percentile bootstrap:
There are two approaches to applying the bootstrap method. The first is the standard bootstrap which would assume some type of distribution (usually the normal distribution) throughout the entire modeling process (both development and model checking) and therefore relies on the formulae that are used to calculate means, standard errors, 95% CIs etc. The percentile bootstrap is less reliant on the formulae that are a function of an assumed distribution because in the end it ranks the element(s) of interest and takes the 2.5th percentile element and the 97.5th percentile element and constructs the 95% confidence interval for that element as the distance between these two. For example I have previously been interested in a ppk model for an antifungal agent. Cl = theta1 * clcr + theta2. I was interested in the 95% CI for theta1. I constructed 1000 bootstrap data sets and estimated the model for each of the 1000 bootstrap data sets. Rather than plugging the 1000 values into a formula that assumes a normal distribution to calculate the standard error, then the 95% CI, I ranked the 1000 values for theta1 and took the 25th as the lower boundary for the 95% CI and the 975th as the upper boundary. It should be noted that when using the percentile method one must construct at least 1000 data sets and re-estimate the model on all 1000. So I call this a "less synthetic" approach because (1) it is less reliant on formulae and underlying assumptions about distributions and (2) the intervals come directly from a ranking of the data not from a series of calculations. The percentile bootstrap can have the advantage of avoiding nonsense estimates which may sometimes come about when the standard normal distribution is assumed. For example, I have occasionally had results that indicated the lower boundary of a 95% CI for a coefficient of variation for inter-subject variability was intractable (i.e would be less than 0 which would not make sense). This won't happen with the percentile bootstrap.
Why do I say internal validation?
I would divide model validation into two types. 1] External validation which is the most stringent approach and 2] internal validation methods such as the bootstrap or cross-validation. Internal validation methods are attractive when it is difficult to obtain a new data set for an external validation (such as Peds or rare diseases) or when drug approval should be done expeditiously (such as Tx for AIDS). The FDA "Guidance for Industry: Population Pharmacokinetics" has called these methods internal validation (see page 16 of the Guidance) and has recognized their appropriateness. Although the process is intriguing, I don't want to go into the exact mechanism used to internally validate a model but would refer you to Efron and Gong "A leisurely look at the bootstrap, the jackknife and cross-validation" The American Statistician vol 37 pgs 36 * 48 1983 and Ene Ette's paper "Stability and performance of a population pharmacokinetic model: J Clin Pharmacology 1997:37:486-495. So, I call these internal validation because the validation process comes from the data that was originally used to estimate the model.
Bias Correction: please see Efron's text "An Introduction to the Bootstrap" chapter 10 [ISBN = 0-412-04231-2].
Why would I ask these questions? I was being eclectic and interested in Ganesh's approach to the bootstrap.
A comment for the good and welfare of all: It does not seem to me that bootstrapping residuals is the appropriate approach for population PK or PD modeling. I have looked at this and the within subject residuals are correlated for population models. The exception would be if cross-sectional sampling was done. So it seems to me that one is restructuring the entire data set when the residuals from subject A are assigned to subject B. Also, sampling of residuals assumes that we know the population model(s) with certainty. I am not sure one can make such an assumption. The safe approach is to randomly sample individuals (with entire data associated with each individual) with replacement to create bootstrap data sets.
Cheers to all!
Paul
From: Nick Holford <n.holford@auckland.ac.nz>
Subject: Re: Bootstrap resampling!
Date: Wed, 28 Mar 2001 10:01:59 +1200
Paul,
Paul Williams wrote:
>
> Less synthetic percentile bootstrap:
>
A few months ago we discussed the bootstrap terminology in nmusers and the terms parametric vs non-parametric bootstrap were mentioned. Parametric might also be called "all simulation" or "totally synthetic" because all the bootstrap data sets are simulated from a parametric model for the fixed and random effects. The non-parametric bootstrap involves resampling from the original data and typically the sampling unit is an individual subject (when applied to mixed effect models).
The next thing to consider is what to do with the 1000 estimates of a parameter such as clearance (CL) which you have obtained from analysing 1000 bootstrap data sets. At this point it does not matter if you used the parametric or non-parametric bootstrap method. You may be interested in a confidence interval for the estimate. (I prefer the term "credible interval" which is popular among BUGSy (Bayesian) types -- both are CI so you can interpret CI as you prefer).
There are two basic approaches here. The first we might call parametric and the second non-parametric. The parametric approach first computes the standard error from the empirical distribution of 1000 CL estimates then uses some formula such as CI=mean+/-1.96*SE to predict the asymptotic 95% CI. As you point out this can sometimes give you crazy results e.g. negative CL values if SE is large. The non-parametric approach is the one you describe below where you rank the estimates and pick the CL values that are at the 2.5% and 97.5% quantiles. This is much more robust and cannot predict negative CL values. It may also produce an asymmetrical CI which is fine by me but impossible using the simple parametric approach.
So I think your term "less synthetic percentile" is what one might call the non-parametric naive quantile approach. There are other, less naive, methods. Davison has written an excellent book (Bootstrap Methods and Their Application (Cambridge Series in Statistical and Probabilistic Mathematics , No 1) by A. C. Davison, D. V. Hinkley) (which Steve Duffull "stole" from me last November so I cannot refer to it right now) which describes these in detail. It includes "studentizing" the quantiles by incorporating the standard error of the estimate of CL for each replication. Of course this is rarely feasible using NONMEM because getting the covariance step to run for every replication is very unlikely!
Paul Williams wrote:
> There are two approaches to applying the bootstrap method. The first is the standard bootstrap which would assume some type of distribution (usually the normal distribution) throughout the entire modeling process (both development and model checking) and therefore relies on the formulae that are used to calculate means, standard errors, 95% CIs etc. The percentile bootstrap is less reliant on the formulae that are a function of an assumed distribution because in the end it ranks the element(s) of interest and takes the 2.5th percentile element and the 97.5th percentile element and constructs the 95% confidence interval for that element as the distance between these two. For example I have previously been interested in a ppk model for an antifungal agent. Cl = theta1 * clcr + theta2. I was interested in the 95% CI for theta1. I constructed 1000 bootstrap data sets and estimated the model for each of the 1000 bootstrap data sets. Rather than plugging the 1000 values into a formula that assumes a normal distribution to calculate the standard error, then the 95% CI, I ranked the 1000 values for theta1 and took the 25th as the lower boundary for the 95% CI and the 975th as the upper boundary. It should be noted that when using the percentile method one must construct at least 1000 data sets and re-estimate the model on all 1000. So I call this a "less synthetic" approach because (1) it is less reliant on formulae and underlying assumptions about distributions and (2) the intervals come directly from a ranking of the data not from a series of calculations. The percentile bootstrap can have the advantage of avoiding nonsense estimates which may sometimes come about when the standard normal distribution is assumed. For example, I have occasionally had results that indicated the lower boundary of a 95% CI for a coefficient of variation for inter-subject variability was intractable (i.e would be less than 0 which would not make sense). This won't happen with the percentile bootstrap.
>
[stuff deleted -- NH]
Paul Williams wrote:
> A comment for the good and welfare of all: It does not seem to me that bootstrapping residuals is the appropriate approach for population PK or PD modeling. I have looked at this and the within subject residuals are correlated for population models. The exception would be if cross-sectional sampling was done. So it seems to me that one is restructuring the entire data set when the residuals from subject A are assigned to subject B. Also, sampling of residuals assumes that we know the population model(s) with certainty. I am not sure one can make such an assumption. The safe approach is to randomly sample individuals (with entire data associated with each individual) with replacement to create bootstrap data sets.
I agree. The sampling unit needs to be the subject not the residual in order to preserve the within subject correlations as well as the need to allow for heteroscedasticity. The residual sampling approach assumes that the residual is the same typical size for all observations.
Nick
--
Nick Holford, Divn Pharmacology & Clinical Pharmacology
University of Auckland, 85 Park Rd, Private Bag 92019, Auckland, New Zealand
email:n.holford@auckland.ac.nz tel:+64(9)373-7599x6730 fax:373-7556
http://www.phm.auckland.ac.nz/Staff/NHolford/nholford.htm
From: "Stephen Duffull" <sduffull@pharmacy.uq.edu.au>
Subject: RE: Bootstrap resampling!
Date: Wed, 28 Mar 2001 08:48:13 +1000
Hi
Nick wrote:
> methods. Davison has written an excellent book
> (Bootstrap Methods and Their Application
> (Cambridge Series in Statistical and
> Probabilistic Mathematics , No 1) by A. C.
> Davison, D. V. Hinkley) (which Steve Duffull
> "stole" from me last November so I cannot refer
> to it right now) which describes these in detail.
Nick is right - I do have his book.
After a quick read I am beginning to believe that if "credible" estimates of SE or CI are required then it might be easier to just use a MCMC method from the start (eg via WinBUGS) than to model then add BS (bootstrap) :-)
Regards
Steve
=================
Stephen Duffull
School of Pharmacy
University of Queensland
Brisbane, QLD 4072
Australia
Ph +61 7 3365 8808
Fax +61 7 3365 1688
http://www.uq.edu.au/pharmacy/duffull.htm
From: "Banken, Ludger {PDBS~Basel}" <LUDGER.BANKEN@Roche.COM>
Subject: RE: Bootstrap resampling!
Date: Wed, 28 Mar 2001 11:40:30 +0200
I do not fully agree with Paul and Nick that the sample unit for bootstrapping should be the subject. In population PK/PD data there are two sources of random variation, inter- and intra-subject variation. A good approach to handle the inter-subject variability is to resample among the subjects. But I believe that additional resampling is necessary to take the intra-subject variability into account, resulting in a two-stage bootstrapping. Without it the intra-subject variability is treated as nonexistence. I must admit that I don't know about a reasonable method without assumptions. But it might even be better to use a non-optimal method than none at all. But the proposal needs to be investigated, by simulations or conceptual work before it should be used in practice.
Does anybody know that such a two-stage bootstrap method is used somewhere or that it has been discussed in the literature?
Best regards,
Ludger
Ludger Banken
PDBS, 74/3.OG W
Hoffmann-La Roche, Ltd.
CH-4070 Basel, Switzerland
Phone: ++41 61 68 87363; Fax: ++41 61 68 814525;
E-mail: Ludger.Banken@roche.com
From: Nick Holford <n.holford@auckland.ac.nz>
Subject: Re: Bootstrap resampling!
Date: Thu, 29 Mar 2001 13:51:42 +1200
Ludger,
Thanks for your insightful comments. You are quite correct that the subject=sampling unit does not randomize the within-subject variability. But we need to distinguish exactly what this involves.
There is true within-subject variabilty which we might model as between occasion variability if the subject is studied on more than one occasion. This is an important level of variability that would need to be considered but I do not know how to arrange the sampling to do it.
There is also residual unknown variability (also unfortunately called intra-individual variability). This is of course nearly always modelled quite independently of the individual so it seems wrong to call it intra-individual when it is estimated from residual error across all individuals. Once again it is hard to know how to sample this and still preserve the heteroscedasticity that is so typical of PK data.
What assumptions do you want to make for a reasonable method of sampling?
Nick
--
Nick Holford, Divn Pharmacology & Clinical Pharmacology
University of Auckland, 85 Park Rd, Private Bag 92019, Auckland, New Zealand
email:n.holford@auckland.ac.nz tel:+64(9)373-7599x6730 fax:373-7556
http://www.phm.auckland.ac.nz/Staff/NHolford/nholford.htm
From: "Gibiansky, Leonid" <gibianskyl@globomax.com>
Subject: RE: Bootstrap resampling!
Date: Thu, 29 Mar 2001 07:52:33 -0500
There was one more use of the bootstrap procedure (that I've learned from Lewis Sheiner talk) that was not mentioned in this discussion (or mentioned very briefly). It was proposed for investigation of the covariate model. The idea was to disassociate patients and their covariates. For example, when you study importance of gender, you take patients with all their history (concentrations, dosing, all the other covariates except gender) and randomly bootstrap gender covariate for each patient from the "observed" set of gender variable for the population under study. Then you fit the model with gender covariate and without it and see how often gender appears to be significant. It is clear that for the bootstrap sample constructed in this manner the gender is random and should not be significant. Then the number of false-positive results can be used to characterize the significance of the covariate and see how often it appears to be significant just by chance. I have not tried the procedure but found it a very elegant although time-consuming way to investigate the covariate model especially for the small data sets.
Is there anybody in the group who tried this approach ? If yes, could you describe your experience ?
Thanks,
Leonid Gibiansky
From: "Diane Mould" <DRMOULD@ATTGLOBAL.NET>
Subject: Re: Bootstrap resampling!
Date: Thu, 29 Mar 2001 09:12:35 -0500
Dear Leonid
The procedure that you describe sounds a bit like the randomization test procedure. Is this what you are referring to? That would be an interesting way to evaluate covariate models.
Best Regards
Diane
From: "Gibiansky, Leonid" <gibianskyl@globomax.com>
Subject: RE: Bootstrap resampling!
Date: Thu, 29 Mar 2001 09:23:53 -0500
Dear Diane,
I am not sure how it is called, I've heard about it only once and liked it a lot. Now when you asked I recall that Lewis described it in a little different terms and the description below come out after discussion of his talk with Tom Ludden (I am trying to give proper credit, if I missed someone - it is not intentional). Probably, Lewis was describing the randomization test procedure, could you, please, let me know what is this test about and how it is done ?
Thanks,
Leonid
From: "KOWALSKI, KENNETH G. [PHR/1825]" <kenneth.g.kowalski@pharmacia.com>
Subject: RE: Bootstrap resampling!
Date: Thu, 29 Mar 2001 08:42:39 -0600
Yes, what Leonid is describing is a form of randomization test. It is a way to get a post-hoc assessment of the alpha level (false positive rate) of the test. If one completely disassociates the covariate from the patient then for the bootstrap data sets testing say the effect of gender, we should get a false positive rate of about 5% if we are using a standard chi-square critical value of 3.84. However, in the presence of model misspecification the actual false positive rate can be much higher. I have a simulation example where the false positive rate can be as high as 25% when using a chi-square test presumably controlled at the 5% significance level when there is known model misspecification (see Kowalski and Hutmacher, Stats. in Med, 2001;20:75-91). By disassociating the covariates from the patients in doing the bootstrapping one can essentially characterize the empirical distribution of the test statistc under the null hypothesis (e.g., no gender difference) and assess whether the distribution of bootstrap test statistics really follow the chi-square distribution.
Ken
From: "Jogarao Gobburu 301-594-5354 FAX 301-480-3212" <GOBBURUJ@cder.fda.gov>
Subject: Re: Bootstrap resampling!
Date: Thu, 29 Mar 2001 10:28:31 -0500 (EST)
Dear ALL,
Yes, we have some experience with this procedure. Recently, John Lawrence (FDA) and I performed some experiments to find out the exact significance level of including a covariate, say weight (WT) (nested model). So, one would simulate under the null hypothesis. This means, it does not matter whose weight is what. The WTs of all IDs are swapped among the IDs and then the 'permuted' data are fitted to the alternate model (ie. with WT as covariate). The distribution of log-likelihood ratios so generated is used to find the 'conditional' significance level, instead of the 'asymptotic' p-value (chisq. dist., df=1).
This procedure has 3 different names: permutation test, randomization test and rerandomization test (I have the reference somewhere, but cannot find it right now!). We will be presenting this at PAGE 2001 and eventually to a wider audience by other means.
Regards,
Joga Gobburu
Pharmacometrics,
CDER, FDA.
From: Nick Holford <n.holford@auckland.ac.nz>
Subject: Re: Bootstrap resampling! -> Randomization test
Date: Fri, 30 Mar 2001 07:04:06 +1200
Leonid,
You are describing the randomization test procedure. I don't know if it is strictly correct to call it a bootstrap method (Steve Duffull has promised to return Davison's book and I will try to check this when it comes back). Two links which explain the RT and put it in perspective are:
http://garnet.acns.fsu.edu/~pkelly/resampling.html
http://davidmlane.com/hyperstat/B143907.html
The purpose of the RT procedure is to determine the 'true' probability that you would reject the null hypothesis (e.g. H0: gender does not explain the variability in CL) when the null hypothesis was in fact true. The relevance to NONMEM users is very strong. If you use the FOCE method there are several people (e.g. Mats Karlsson's group in Uppsala) who have found that the chi square distribution is pretty good for predicting a P value (none of this has been published as far as I knows but if anyone know of a publication please tell us). But if you use FO then you need a larger difference in the objective function (deltaOBJ) to correctly reject the null hypothesis. Anybody using NONMEM is doing this kind of hypothesis test all the time. If you use FO and are relying on a deltaOBJ of 3.84 for a one parameter difference in the models then you are almost certainly drawing conclusions with a P value > 0.05. A test case I did with a one compartment model indicated that the P value associated with deltaOBJ of 3.84 was 0.08 and that I should use a deltaOBJ of 4.8 to reject the null with alpha=0.05.
I have recently (this week) implemented the randomization test (RT) as part of Wings for NONMEM ( http://www.geocities.com/wfn2k/). I am still doing some testing but hope to release it by the beginning of next week. It will let you run the RT with any NMTRAN control stream and data set with a command such as:
nmrt wt theopd 1000
where nmrt is the RT command, wt is the covariate to be randomized, theopd is the runname (the NMTRAN control stream) and 1000 is the number of replications. In order to evaluate the P value you will need to run at least 1000 replications so this procedure is only practical for NONMEM runs that complete within a few minutes or less.
Nick
Nick Holford, Divn Pharmacology & Clinical Pharmacology
University of Auckland, 85 Park Rd, Private Bag 92019, Auckland, New Zealand
email:n.holford@auckland.ac.nz tel:+64(9)373-7599x6730 fax:373-7556
http://www.phm.auckland.ac.nz/Staff/NHolford/nholford.htm
From: "Gibiansky, Leonid" <gibianskyl@globomax.com>
Subject: RE: Bootstrap resampling! -> Randomization test
Date: Thu, 29 Mar 2001 14:24:54 -0500
Thank you all who replied on my mail. Now I know the name of the procedure and the references (those web sites are really useful). Even if this is not strictly a bootstrap, it is in line with the bootstrap ideas. To continue discussion: have anyone used this procedure to motivate exclusion of the covariate from the model in the real PK or PK/PD modeling project? Was it useful, or the decision was pretty much obvious even from the diagnostic plots of two (full and restricted) models on the original data ?
Thanks again,
Leonid
From: "KOWALSKI, KENNETH G. [PHR/1825]" <kenneth.g.kowalski@pharmacia.com>
Subject: RE: Bootstrap resampling! -> Randomization test
Date: Thu, 29 Mar 2001 13:53:43 -0600
Nick,
To add to what you are saying, simulations studies have shown that you get an inflated alpha (false positive error rate) when you simulate data and then fit a model that is different from the model you used to simulate data, ie., model misspecification. Stu Beal has pointed out that even the FOCE method runs into problems when the residual variance depends on the response such as
Y=F*(1+EPS(1)) OR
Y=F*EXP(EPS(1))+EPS(2)
If we simulate data with residual variance structures as above then we will get inflated alphas even with FOCE...in this setting we need to use FOCE w/INTERACTION because there is an interaction between the etas (inbedded in F) and epsilons.
However if we simulate and model the residual variance structure as
Y=LOG(F) + EPS(1)
then FOCE should work fine (i.e., there is no interaction between etas inbedded in F and EPS(1)) and not lead to inflated alphas assuming that other aspects of the model are specified correctly.
Ken
From: "Jogarao Gobburu 301-594-5354 FAX 301-480-3212" <GOBBURUJ@cder.fda.gov>
Subject: Re: Bootstrap resampling! -> Randomization test
Date: Thu, 29 Mar 2001 15:22:06 -0500 (EST)
Dear Ken,
Just one more twist: When the data are sparse (Nobs<Npar) then FO method is 'reasonably' similar to FOCE (F+ERR(1))/FOCE-INTER (F*EXP(ERR(1)). But when the data are dense then there is a more clear superiority of FOCE with or without over FO. Nevertheless, the most reliable p-values can derived using permutations. Our analyses show that even when FOCE(+/- INTER) method is used, the distribution of log-likelihood ratios is not strictly a chi-square distribution, but close enough!
So, one really need not perform parametric bootstrapping or permuations to derive the significance levels if the appropriate estimation method is used, for most routine problems. Only when precise p-values are required (eg: effectiveness claims) then we can use these computationally intensive methods.
Regards,
Joga Gobburu
Pharmacometrics,
CDER, FDA.
From: "KOWALSKI, KENNETH G. [PHR/1825]" <kenneth.g.kowalski@pharmacia.com>
Subject: RE: Bootstrap resampling! -> Randomization test
Date: Thu, 29 Mar 2001 14:50:22 -0600
Dear Joga,
I agree that we need to be concerned about which estimation method we use (eg., FO, FOCE or FOCE-INT) but we also have to consider other forms of model misspecification when assessing the validity of the chi-square distribution for the likelihood ratio tests (LRT). For example, I have done simulations showing that when we design a patient study with sparse sampling say over a narrow range of sampling times within a dosing interval, which may only support fitting a one-compartment model when we know say from dense sampling in healthy volunteers that the kinetics obey two compartments, we can get inflated alphas regardless of the estimation method used. Because we still have model misspecification of the structural model. We have shown (Kowalski and Hutmacher, Stats in Med, 2001;20:75-91) that although such an approximation (one-compartment model approximation in a narrow range of data simulated with two-compartment kinetics) can provide fairly accurate estimates of CL/F and accurate estimates of the covariate effect on CL/F, but the LRT for the covariate effect can have an inflated alpha level. Thus, any form of model misspecifation (not just related to the estimation method) can result in the failure of the chi-square distribution to adequately describe the LRT under the null.
Ken
From: "Stephen Duffull" <sduffull@pharmacy.uq.edu.au>
Subject: RE: Bootstrap resampling! -> Randomization test
Date: Fri, 30 Mar 2001 14:16:13 +1000
Hi all
To pose some simple questions. If model misspecification affects the validity of the likelihood ratio test (which is not an altogether unexpected finding) and when analysing actual data (rather than simulated data) we always have model misspecification, then the LRT will almost never be valid in real life. Therefore it would seem appropriate to ignore deltaOBJ values in a statistical sense altogether (ie not use them for hypothesis testing). Perhaps we should ask ourselves: when do we need statistical verification of our model? For non-nested models, when the LRT is inappropriate anyway, have we always required statistical verification? Certainly most publications that discuss non-nested models do not supply statistical validation for their choice.
Do we need statistical verification when the covariate effect is biologically plausible and biologically significant - or do we need it when biological plausibility or significance cannot be assessed (if so then was the covariate that important anyway)?
I would certainly be interested in any thoughts that the group may have on this, particularly in light of the discussion about bootstrap and RT.
Regards
Steve
=================
Stephen Duffull
School of Pharmacy
University of Queensland
Brisbane, QLD 4072
Australia
Ph +61 7 3365 8808
Fax +61 7 3365 1688
http://www.uq.edu.au/pharmacy/duffull.htm
From: Nick Holford <n.holford@auckland.ac.nz>
Subject: Re: Bootstrap resampling! -> Randomization test
Date: Fri, 30 Mar 2001 16:56:19 +1200
Steve,
You bring up a new phrase "statistical verification". I dont really know what this means. I dont need new phrases. What I need is to be able to make a decision about whether one model is "better" than another. I need this almost every day. In practice I use deltaOBJ. If its bigger than 10 then I am convinced especially if FOCE. It its betwen 4 and 6 I am willing to accept it may be better than the other model especially if the parameters look reasonable.
I cannot rely solely on mechanistic or biological plausibility alone. It is plausible that women have a different clearance from men but I need the data to convince me and would rely on deltaOBJ. On the other hand I routinely use an allometric size model for clearance and volume and if the OBJ gets worse when I add this in I still keep the allometric model but then think what other covariate is confounded with weight that is worsening the fit (ignoring the thorny issue of a local minimum).
For some special cases I may want to be more concerned about how to achieve a certain alpha level and then the randomization test offers the means to figure it out. Given the run times it is certainly not something one can apply to every model building hypothesis.
Nick
--
Nick Holford, Divn Pharmacology & Clinical Pharmacology
University of Auckland, 85 Park Rd, Private Bag 92019, Auckland, New Zealand
email:n.holford@auckland.ac.nz tel:+64(9)373-7599x6730 fax:373-7556
http://www.phm.auckland.ac.nz/Staff/NHolford/nholford.htm
From: "Stephen Duffull" <sduffull@pharmacy.uq.edu.au>
Subject: RE: Bootstrap resampling! -> Randomization test
Date: Fri, 30 Mar 2001 16:07:28 +1000
Nick
Statistical verification is not intended to be loaded phrase. I am just asking whether we need to perform an LRT in order to achieve some nominal P-value in order to be happy with our choice of models?
Steve
Date: Fri, 30 Mar 2001 08:44:48 +0200
From: Mats Karlsson <Mats.Karlsson@biof.uu.se>
Subject: Re: Bootstrap resampling!
Dear Joga and others,
We (Ulrika Wählby et al.) has presented results from the randomisation test for covariates and LRT behaviour (FO, FOCE, FOCE INTER, LAPLACIAN) on previous PAGE meetings and an article on this is in press in J Pharmacokinet Pharmacodyn. In addition to what has been said here on nmusers, it appears that the LRT is sensitive to misspecification also to the distribution of random effects, not symmatric distribution means inflated significance levels. This is even more pronounced for OMEGA parameters. Ulrika will be presenting more on this at this years PAGE meeting.
It is easy to sympatize with Joga's (we only need exact p-values sometimes) and Steve's (we don't need them at all), but for efficiency of model building the LRT is very useful. It is somewhat unfortunate that a use of a cut-off of 10, like Nick and many uses, sometimes gives actual alpha values of 0.001 and for other data sets/models a value of 0.2. Therefore if one is to trust the LRT, especially with the FO method, I think some understanding of its properties is in order. Hopefully our manuscript can help readers understand some influencing factors of the acutal significance levels of the LRT.
Best regards,
Mats
--
Mats Karlsson, PhD
Professor of Pharmacometrics
Div. of Pharmacokinetics and Drug Therapy
Dept. of Pharmaceutical Biosciences
Faculty of Pharmacy
Uppsala University
Box 591
SE-751 24 Uppsala
Sweden
phone +46 18 471 4105
fax +46 18 471 4003
mats.karlsson@farmbio.uu.se
From: Lewis B Sheiner <lewis@c255.ucsf.edu>
Subject: Re: Bootstrap resampling! -> Randomization test
Date: Fri, 30 Mar 2001 10:47:55 -0800
I can't give an answer, but I can provide a point of view.
The testing paradigm only makes sense when a proponent is advocating a conclusion to a skeptical observer (who can, of course, be the proponent's alter ego). When this is the set-up, then the skeptical observer gets to specify the rules. Currently they are to offer data that are extremely unlikely if the claim does not hold. To be logically tight, the claim-not-holding assertion should be sharp (e.g., the means of two populations are exactly equal), which of course makes the claim rather diffuse (the means are unequal ... but what they are is not addressed). When the paradigm is used to "test" a scientific hypothesis in the service of making sure it is tentatively acceptable, the diffuseness doesn't matter; the experiment being evaluated was set up, presumably, to provide a sharp qualitative challenge. For public acceptance of remedies, there may be some problem with the diffuseness (for example that we wind up unsure of exactly how much tretment benefit to expect).
So a simple answer to your question is: we need valid (i.e., correct performance under the null) hypothesis testing procedures whenever we are in a testing situation (as above).
As scientists (as opposed to advocators), I believe there is another use of testing, and this is the one that RA Fisher advocated: we are so eager to find patterns in our data that we need some reality check on this tendency. So an hypothesis test of the null hypothesis that "nothing interesting is going on" may be useful, if it is not rejected, to encourage us to abandon fruitless quests to find signal where there is likely only noise.
LBS
From: SMITH_BRIAN_P@Lilly.com
Subject: Re: Bootstrap resampling! -> Randomization test
Date: Sat, 07 Apr 2001 10:40:45 -0500
I just love the original question and Lewis's reply. I am going to add my 2 cents as well below.
Sincerely,
Brian Smith
I just want to point out that there is no reason in hypothesis testing situations that the null hypothesis has to be that there is no treatment differences. There is no reason, for instance, for a test of the means that the null hypothesis cannot be mu1 - mu2 = 5. What is interesting is that the region of a 95% confidence interval are all values for which the null hypothesis cannot be rejected. That is if my 95% confidence interval were (2, 7). Then we would fail to reject the null hypothesis that mu1 - mu2 = 2.1 (at the 0.05 level) and we would fail to reject the null hypotheses that mu1 - mu2 = 6.9 (at the 0.05 level). (In addition we would fail to reject all null hypotheses between 2 and 7.) However, we would reject the null hypothesis that the difference was equal to 1.9. The confidence interval can define how much treatment benefit to expect. In essence the confidence interval is generated with the same mechanism as hypotheses tests.
> So a simple answer to your question is: we need valid (i.e., correct
> performance under the null) hypothesis testing procedures whenever we
> are in a testing situation (as above).
And since confidence intervals are developed by a similar mechanism as hypotheses tests, equivalently we need the correct coverage probabilities for confidence intervals.
> As scientists (as opposed to advocators), I believe there is another
> use of testing, and this is the one that RA Fisher advocated: we are so eager to
> find
> patterns in our data that we need some reality check on this tendency.
> So an hypothesis test of the null hypothesis that "nothing interesting is
> going on" may be useful, if it is not rejected, to encourage us to abandon
> fruitless quests to find signal where there is likely only noise.
I agree 100%. If we understand what a clinically significant effect is, we can in addition use confidence intervals to further help us discern if anything interesting is or could be going on.
> Stephen Duffull wrote:
>>
>> Hi all
>>
>> To pose some simple questions. If model misspecification
>> affects the validity of the likelihood ratio test (which is
>> not an altogether unexpected finding) and when analysing
>> actual data (rather than simulated data) we always have
>> model misspecification, then the LRT will almost never be
>> valid in real life. Therefore it would seem appropriate to
>> ignore deltaOBJ values in a statistical sense altogether (ie
>> not use them for hypothesis testing). Perhaps we should ask
>> ourselves: when do we need statistical verification of our
>> model? For non-nested models, when the LRT is inappropriate
>> anyway, have we always required statistical verification?
>> Certainly most publications that discuss non-nested models
>> do not supply statistical validation for their choice.
>>
Well as I have tried to make clear hypothesis tests are not the only type of "statistical verification" that can be done. Whether or not you have model misspecification or not the "hypothesis test" you are talking about is still valid. A p-value tells you the following, given that model A is the true model, what is the chance that I would get the values of model B or something more extreme. The only way we can easily extract a p-value from this situation is with nested models. In which case you a p-value just becomes a function of the difference of -2*ln(L)'s. That is if we believe that the likelihood expresses the degree of fit of a model (NONMEM users implicitly assume this since it is a maximum likelihood based program). A p-value is just a reformulation of the likelihoods in a form that is more easily interpretable. (Note: the -2*ln(L) test is asymptotic. Which means that the p-value only approaches its correct value as the sample size approaches infinity. Just because, however, it is not exact does not mean that it is not useful.)
It is harder to justify one non-nested model over another. The major principle is that if 2 models have the same degree of complexity the one with the largest likelihood (smallest -2*ln(L)) is preferred. This principle obviously cannot be applied in a vacuum. Suppose that the goal is future dose adjustments. You have model a) that has an easily obtainable variable which has a slightly smaller likelihood than model b) that has a measure that measures the same quantity in a more complex fashion. It is unlikely that clinicians will use measure b) in clinical practice, but a) would not be a problem. Model a) is preferable. There is also the case in which one model just makes more since than another but has a slightly smaller likelihood. Go with the one that makes since. However, I think we conceive of "tests" for non-nested models.
In the case where you have non-nested models with differing degrees of complexity, you have a problem. This is why AIC and BIC exist. The problem is that in the statistics literature there seem to be 50 of these sort of criteria. Which is correct? No-one knows. Let us suppose that you pick your model with BIC. Model A gives you a BIC of 272 and model B gives 280. A is smaller so we prefer A. But should we? Is 272 really different to 280? BIC and the difference in BIC between 2 models are statistics!! Guess what, we could do bootstrapping or randomization testing on the difference in the BICs. Stephen states correctly that we do not apply statistical criteria for the justification of non-nested models. This is mostly because it would be hard to do. With modern computing power this is no longer such an impediment. Maybe we should start producing p-values to justify non-nested models.
I would claim that the likelihood is the quantity that justifies your model. Besides practical concerns, the likelihood ought to be the sole judge of a models fitness or lack there of.
>> Do we need statistical verification when the covariate
>> effect is biologically plausible and biologically
>> significant - or do we need it when biological plausibility
>> or significance cannot be assessed (if so then was the
>> covariate that important anyway)?
>>
This is why I would suggest that you look at confidence intervals as well. There are 2 reasons a covariate is not significant 1) there really is nothing going on, 2) your data are not powerful enough to detect a meaningful difference. A very wide confidence indicates that 2) is possible. A narrow confidence interval indicates 1).
>> I would certainly be interested in any thoughts that the
>> group may have on this, particularly in light of the
>> discussion about bootstrap and RT.