RE: Simulation vs. actual data

From: Kenneth Kowalski Date: July 14, 2005 technical Source: cognigencorp.com
From: "Kowalski, Ken" Ken.Kowalski@pfizer.com Subject: RE: [NMusers] Simulation vs. actual data Date: Thu, July 14, 2005 1:22 pm Nick, To make distinctions between the 3 different types of statistical intervals we only need consider two major classes of variability: sampling variability and parameter uncertainty. Here sampling variability can have multiple sources of variation due to sampling of subjects (BSV), different occasions within a subject (WSV), and sampling of observations (RUV). Moreover, parameter uncertainty can be multivariate and can be thought of as the trial-to-trial variation in that we would get a different set of population parameter estimates each time we repeat the trial. I still like my simple univariate normal mean model example which has only one source of sampling variability and one parameter to illustrate the differences in these intervals. In the univariate normal example the sampling variability is estimated as the sample standard deviation (SD), the population mean parameter is estimated by the sample mean (Ybar), and the parameter uncertainty is the standard error of the mean (SE=SD/sqrt(N)). With these definitions I'll take another stab at defining the general forms of the 3 types of statistical intervals for the univariate normal mean model case and draw analogies to the more complex population modeling setting. Degenerative Tolerance Interval: Ybar +/- k*SD. This is the form of a degenerate tolerance interval where we assume the population mean (mu) and standard deviation (sigma) are known without uncertainty given by the estimates Ybar and SD, respectively. It is degenerative in that we put all the probability mass on our point estimates (Ybar and SD) where a degenerative distribution is one in which the distribution has only one value with probability=1. The inference for this interval is on the individual Y's since the SD is the standard deviation of the individual Y's. If we assume the Y's are normal with Ybar and SD known (without uncertainty), we know from our first course in probability and statistics that Ybar+/-1*SD, Ybar+/-2*SD, and Ybar+/-3*SD contain approximately 68, 95, and 99.7% of the individual Y values, respectively. Degenerative Tolerance Interval: Population PK Example. To draw an analogy to population modeling and construction of degenerative tolerance intervals consider the following example. Suppose we conduct a single dose PK study with N=12 healthy subjects at each of 5 doses with dense sampling of n time points (observations) per subject. If we fit a population PK model to estimate thetas, Omega, and Sigma, then use the model (and estimates without uncertainty) to simulate individual Y values for say 1000 hypothetical subjects at each dose for each time point we can then calculate Ybar and SD at each dose/time point across the 1000 hypothetical subjects and construct degenerative tolerance intervals of the form Ybar +/- k*SD. However, typically what we do is compute Ybar as our point estimate and construct an asymmetric tolerance interval using the percentile method (i.e., instead of using +/- k*SD to determine the endpoints of the interval the appropriate lower and upper quantiles from the empirical distribution of the simulated Y values are used). Prediction Interval for a Single Future Observation: Ybar +/- k*SD*sqrt(1+1/N). To calculate the variability in a single future observation (Ynew) we add the sampling variance (SD^2) for the new observation and parameter uncertainty variance (SE^2) to obtain Var(Ynew) = SD^2 + SE^2 = (SD^2)*(1+1/N) and hence SD(Ynew) = SD*sqrt(1+1/N). Note that when we ignore parameter uncertainty to obtain a degenerative prediction interval or when N is very large such than SE is near 0 the resulting interval collapses to the same as a degenerative tolerance interval since SD(Ynew) = SD in this setting. This may in part explain some of the confusion we have with the terminology. The distinctions between these intervals collapse when we don't take into account parameter uncertainty. Prediction Interval for a Single Future Observation: Population PK Example. To draw an analogy to population modeling using the population PK example I just described, consider taking into account parameter uncertainty by assuming the posterior distribution of the parameter estimates follow a multivariate normal distribution with mean vector of parameter estimates and covariance matrix of the estimates from the COV step output. Suppose now we obtain 1000 sets of population parameters (thetas, Omegas, and Sigmas) from a parametric simulation from this multivariate normal distribution. Furthermore, if we simulate a single hypothetical subject's profile for each set of 1000 population parameters we now have a set of 1000 individual Y's at each dose/time point. If we calculate the SD(Ynew) across the 1000 individual values we have taken into account both sampling variation (Omegas and Sigma) and parameter uncertainty (multivariate normal posterior distribution). The resulting SD(Ynew) will be larger than the SD calculated for the degenerate tolerance interval example. However, suppose for the 1000 set of population parameters we set them all to the same estimates obtained from the PK model fit (i.e., ignore parameter uncertainty). If we now simulate a single hypothetical subjects's profile for each of the 1000 sets of exactly the same population parameters the degenerative prediction interval results in SD(Ynew) = SD since it only takes into account the sampling variation in Omega and Sigma the same as for the degenerative tolerance interval. Prediction Interval for a Mean of M Future Observations: Ybar +/- k*SD*sqrt(1/M + 1/N). To calculate the variability in the mean of M future observations (Ybar-new = Sum(Ynew-i)/M, i=1,...,M) we note that the sampling variance for Ybar-new is (SD^2)/M. To obtain the variance of Ybar-new taking into account this sampling variance as well as the uncertainty in Ybar we have that Var(Ybar-new) = (SD^2)/M + (SE^2) = (SD^2)*(1/M + 1/N) and hence, SD(Ybar-new) = SD*sqrt(1/M + 1/N). The inference for this statistical interval is on the mean of a fixed number (M) of future values rather than on individual observations. However, when M=1, this interval reduces to the prediction interval on individual observations. When we perform a PPC where the statistic, T, is a mean of the individual observations we are usually conditioning on some fixed known value of M. Prediction Interval for a Mean of M Future Observations: Population PK Example. Going back to the population PK example, suppose we want to perform a PPC where we use our single dose population PK model to predict the mean PK profile where we have M=12 subjects per each of 5 dose groups (i.e., an internal PPC where we condition on the design of the study that we used to develop the model). To perform the PPC taking into account parameter uncertainty we obtain 1000 sets of population parameters (thetas, Omegas, and Sigmas) sampling from the multivariate normal posterior distribution. Our statistic (T) of interest is the mean of the Y's for each dose/time point across the M=12 subjects. Thus, for each of the 1000 sets of population parameter values we simultate M=12 individual Y's for each dose/time point. We then calculate the mean across the M=12 individual Y values where T = Ybar-new, and we then have 1000 samples of the posterior distribution of Ybar-new. If we calculate the sample standard deviation of the means, i.e., SD(Ybar-new), we can construct a prediction interval for Ybar-new and overlay the observed statistic, Ybar (N=12). In practice we typically use the percentile method to construct the prediction interval and specifically for PPC we might show the histogram for the posterior predictive distribution of the statistic and overlay the observed statistic as a reference line on the histogram. If we don't take into account parameter uncertainty then the resulting prediction interval and PPC can be thought of as degenerative and will necessarily result in a narrower distribution of the statistic. Confidence Interval on the Population Mean: Ybar +/- k*SD*sqrt(1/N) = Ybar +/- k*SE. The confidence interval on the population mean can be thought of as a prediction interval where M is infinitely large such that 1/M goes to zero. In this setting the only uncertainty that is reflected is the parameter uncertainty in Ybar (i.e., SE). Confidence Interval on the Population Mean: Population PK Example. To illustrate the construction of a confidence interval using the population PK example we can repeat the process as above for the prediction interval for M observations where we take M to be arbitrarily large, say M=2000 hypothetical subjects. Again we simulate 1000 sets of parameters values from the multivariate normal posterior distribution and for each set we simulate M=2000 individual Y's for each dose/time point. We then compute the sample mean (Ybar-pop) of M=2000 subjects for each of the 1000 sets of population parameters. We can then construct a confidence interval for the population mean by computing the sample standard deviation of the set of 1000 Ybar-pop means (averaged over M=2000 hypothetical subjects) or use the percentile method. Note that in the simulations we take into account the sampling variation (Omegas and Sigmas) of the individual Y's but when we average over a sufficiently large M we effectively average out this sampling variation such that the only uncertainty reflected in SD(Ybar-pop) is due to parameter uncertainty. Also note that the population mean (Ybar-pop) and the typical individual prediction (Yhat | ETAs=0) are not the same thus, if we don't take into account the sampling variation and instead calculate the typical individual prediction for each of the 1000 sets of population parameter values, the resulting confidence interval of the mean of the typical individual predictions is not the same as the confidence interval of the population mean prediction. This is a particularly important distinction when working with noncontinuous models. With regards to whether we should take into account parameter uncertainty, I agree that there are many instances where parameter uncertainty is inconsequential relative to sampling variation in constructing tolerance and prediction intervals from our population models. Particularly, when all of the parameters and variance components are well-estimated with good precision (small standard errors). The problem comes in when we fit models to "small-ish" datasets or datasets that cover a limited range (e.g., limited range of doses or exposures) such that one or more of the parameters are estimated with poor precision (large standard errors). Given the complexity of our population models and the multivariate nature of the posterior distribution it is difficult to always know when parameter uncertainty is large relative to sampling variation and should be accounted for when constructing tolerance and prediction intervals. I know of no other way to truly assess the importance of parameter uncertainty other than to do it both ways (with and without parameter uncertainty). Of course, if you gone through the trouble of doing it both ways the purist in me says to report out the interval that includes parameter uncertainty even if it doesn't make much difference. If you are willing to assume that the posterior distribution of the parameters is multivariate normal with mean vector corresponding to the parameter estimates and covariance matrix of the estimates from the COV step output, then parametric simulations taking into account parameter uncertainty are not computationally burdensome to perform. However, taking into account parameter uncertainty from a parametric sampling from the multivariate normal to construct tolerance and prediction intervals may require additional custom coding to do so since we don't have automated utilities that do this...at least not that I'm aware of. On the other hand, if you are unwilling to make the parametric assumption that the posterior distribution is multivariate normal and thus are faced with performing nonparametric bootstrapping to construct an empirical posterior distribution as Juanjo recommends, then this will be considerably more computationally burdensome because of the need to re-estimate the parameters for each bootstrap dataset. Nevertheless, if one is planning to go ahead and perform nonparametric bootstrapping anyway to construct bootstrap confidence intervals for the parameter estimates, then you've already done the hard work of generating the empirical posterior distribution of the parameters so why not use it to account for parameter uncertainty when constructing tolerance and prediction intervals? One of the reasons that I like to take into account parameter uncertainty when performing a PPC is that when I do a degenerative posterior predictive check, the empirical distribution may underestimate the variation in the statistic (T) such that the observed statistic may fall outside the distribution. If parameter uncertainty is substantial, taking it into account will result in a wider empirical distribution of the statistic such that the observed statistic may now lie reasonably within the distribution and thus I might be more willing to accept the model as having adequate predictive performance for this statistic. My apologies for such a long message and kudos to everyone who read it this far. Ken
Jun 14, 2005 Toufigh Gordi Simulation vs. actual data
Jun 14, 2005 Nick Holford Re: Simulation vs. actual data
Jun 14, 2005 Liping Zhang Re: Simulation vs. actual data
Jun 15, 2005 Kenneth Kowalski RE: Simulation vs. actual data
Jun 25, 2005 Nick Holford Re: Simulation vs. actual data
Jul 05, 2005 Kenneth Kowalski RE: Simulation vs. actual data
Jul 12, 2005 Nick Holford Re: Simulation vs. actual data
Jul 12, 2005 Juan Jose Perez Ruixo RE: Simulation vs. actual data
Jul 12, 2005 Nick Holford Re: Simulation vs. actual data
Jul 13, 2005 Juan Jose Perez Ruixo RE: Simulation vs. actual data
Jul 14, 2005 Kenneth Kowalski RE: Simulation vs. actual data
Jul 14, 2005 Juan Jose Perez Ruixo RE: Simulation vs. actual data
Jul 14, 2005 Nick Holford Re: Simulation vs. actual data
Jul 15, 2005 Kenneth Kowalski RE: Simulation vs. actual data
Jul 16, 2005 Kenneth Kowalski RE: Simulation vs. actual data