RE: order of covariate inclusion -> avoiding stepwise a pproaches -> abandoning exploratory analysis?
From: Ken.Kowalski@pfizer.com
Subject: RE: [NMusers] order of covariate inclusion -> avoiding stepwise a pproaches -> abandoning exploratory analysis?
Date: 9/29/2003 2:37 PM
Leonid,
I think you are missing my point. I have nothing against investigating a
large number of covariates (e.g., >=30) if they truly provide scientifically
relevant and independent information. However, typically what happens when
we have a long list of covariates is that many covariates may be collinear
and hence are redundant. Such redundancy based on these nuisance covariates
that are correlated with the important mechanistic covariates can cause
havoc with any model building procedure, masking our ability to discern the
true covariate effects. Forward selection procedures are particularly
vulnerable because they can be blind to the collinearity issue as they can
often find a good fitting model without running into
stability/over-parameterization issues that a full model would be confronted
with. However, as I've said previously, just because forward selection can
find a good fitting model doesn't mean it found the right one. For example,
by chance a nuisance covariate that is highly correlated with the true
covariate may be selected first by a forward selection procedure and because
of the order of testing the true covariate may never get further evaluated
in the forward selection procedure and hence will be excluded in favor of
the nuisance covariate. I can't tell you how many times that a modeler will
say they have a difficult time interpreting covariate effects in the final
model selected by a stepwise procedure because they felt certain excluded
covariates were more scientifically plausible. I'm merely suggesting that
we be a bit more discriminatory with developing our list of plausible
covariates to investigate so that we have a set that are the most
scientifically plausible and independent.
The main reason we use a combination of forward selection/backward
elimination with a higher alpha level for inclusion (e.g., alpha = 0.05) is
precisely to help mitigate the problem with forward selection alone. By
increasing the alpha level for inclusion we allow for "bigger" models to be
tested before pruning to a parsimonious model using backward elimination
with a smaller alpha level for exclusion (e.g., alpha =0.01 or 0.001). Of
course, if we increase the alpha level for inclusion towards 1.0 the
combination forward selection/backward elimination procedure will collapse
to a purely backward elimination procedure. Moreover, it may only take
setting the alpha level for inclusion to 0.20 to begin to develop bigger
models using forward selection that will encounter the ill-conditioning
problems due to collinearity that we observe with the full model (if we
don't become a bit more discriminating in our choice of covariates).
I agree that we could err on the other side as well and eliminate important
explanatory covariates if we are too discriminatory. Nevertheless, we need
to use our best scientific judgement as well as good statistical principles
to really uncover the important covariates. Blindly ignoring the
limitations of forward selection and just turning the crank to allow the
algorithm to identify covariate effects is risky. That is not to say that
I'm against forward selection, just that we need to know when it is
appropriate to use it and when its not. I still maintain it is better to
identify where the redundancies are and eliminate the least plausible
covariates when redundancies exist. Certainly if a less plausible covariate
is fairly independent of the other covariates then we don't need to
eliminate it. I have no problem using forward selection to identify a
parsimonious model once we've streamlined the list to those covariates that
provide the most independent information where redundancies are removed
based on the covariates that are the least plausible. The problem is
knowing when we are in a situation where there is a lot of redundancy.
Building a full model and looking at the diagnostics from the full model fit
(e.g., COV step) will help us to know when we are in a situation where we
should deal with the collinearity.
Ken
_______________________________________________________