RE: order of covariate inclusion -> avoiding stepwise approaches
From: mark.e.sale@gsk.com
Subject: RE: [NMusers] order of covariate inclusion -> avoiding stepwise approaches
Date: 9/26/2003 9:33 AM
My own perspective, which many of you have heard already.
My concern is usually more with getting the model right than with any statistical test. As
I understand it the primary problems with step wise regression are:
1. Inflated type 1 error, and the related inflated Rsquare values, downward bias standard
errors for parameters etc. You basically are data dredging, and if you look at enough random
effects you'll find one. This concerns me only somewhat, since post-hoc models should always be
regarded as hypothesis generation, in a statistical sense.
From Frank Harrell - Regression Modeling Strategies: "Step wise variable selection ... if this procedure
had just been proposed as a statistical method, it would most likely be rejected because it violates
every principle of statistical estimation and hypothesis testing"
Also, I typically want the "best" model (whatever that means), so I worry more about type 2 errors than type 1 errors.
2. (More important in my view) - confounding between variables, nicely demonstrated by this paper:
Interaction between structural, statistical and covariate models in population pharmacokinetic analysis
(Wade JR, Beal SL, Sambol NC J Pharmacokinetics and Biopharmaceutics, 1994 Vol 22 (2) 165-177)
Basically, this means that the answer you get depends on how you get there. IMHO, the only way to
get the "best" answer is with a formal search of the models that are considered plausible or of
interest (either based on previous data or biology). A formal search of the plausible models does not
address the issues of inflated type 1 error (in fact it may make it worse). But again, I'm usually not
concerned (much) about that, I want the best model. Penalties can introduced to addresses issues of
parsimony and Bayesian priors.
This method is:
Objective
Can be predefined (although that only partly helps with inflated type 1)
Includes all effects (e.g., compartments, omega terms, residual errors, lag times etc), not just covariates.
Robust - it apparently will invariably find the "best" model among those considered.
Fast - we are currently running this using distributed computing on 1000 computers,
Can be put in a Bayesian framework with prior knowledge (although this isn't currently implemented). This may
represent a compromise between Marc Gs comment that we could just build a full covariate model (assumes that we
know the model - completely informative prior), and a more traditional (hypothesis testing - uninformative prior) view.
Mark Sale M.D.
Global Director, Research Modeling and Simulation
GlaxoSmithKline
5 Moore Drive
RTP NC, 27709
919-483-1808