Re: General question on modeling
Mark & list,
I'm a newbie to the list. I hope I'm not duplicating anything
mentioned yesterday (the archive seems to become available with
delay), but this is a topic I'm also very much interested in, so I'd
like to share my current view on it (I'd be happy to hear both
dissenting or agreeing opinions).
Quoted reply history
> On Monday 19 March 2007 19:32, Mark Sale - Next Level Solutions wrote:
> Dear Colleagues,
> I've lately been reviewing the literature on model building/selection
> algorithms. The structural
> first, then variances/forward addition/backward elimination is
> generally mentioned in a number of places
> [...] Can anyone point
> me to any rigorous discussion of this model building strategy?
There can be no rigorous general (i.e. problem-independent) statement
about the superiority of any variable or model selection strategy over
another:
* Wolpert, D.H. and W.G. Macready, 1997. No free lunch theorems for
search. IEEE Transactions on Evolutionary Computation (cf.
http://citeseer.ist.psu.edu/wolpert95no.html and
http://en.wikipedia.org/wiki/No-free-lunch_theorem).
Thus, the only justification for advocating the use of a particular
strategy _without making use of problem-specific knowledge_ is the
empirical observation that it often works well in practice. Other
approaches besides forward addition/backward elimination also often
work well. An up-to-date overview (opening a whole journal special
issue on variable selection):
* An Introduction to Variable and Feature Selection
Isabelle Guyon, André Elisseeff;
Journal of Machine Learning Research 3(Mar):1157--1182, 2003.
http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf
More or less subtle forms of overfitting always play a role in model
selection, and with limited data, it is generally not possible to
simultaneously select an optimal model _and_ obtain optimally accurate
performance estimates, neither by relying on p-values, AIC/BIC/...,
(double-)bootstrap-, or (double-) cross-validation-based procedures.
However, the "double" versions for resampling the entire modeling
process help a lot in obtaining more reliable estimates when doing a
lot of "data dredging".
Harrell's (fantastic) book was mentioned by some previous posters. In
my personal opinion and experience, it is a bit too negative about
stepwise variable selection or the simplified version of univariable
screening (e.g. on pp. 56-60). In fact, Guyon/Elisseeff and many
others have mentioned that greedy search strategies (such as
forward/backward selection) are "particularly computationally
advantageous and robust against overfitting", as compared to many more
sophisticated approaches.
Finally, for me, three important eye-openers on modeling, model
uncertainty, and model selection in general (the first two also
referenced in Harrell's book) were:
* Model Specification: The Views of Fisher and Neyman, and Later Developments
E. L. Lehmann
Statistical Science 5:2 (1990), pp. 160-168.
* Model uncertainty, data mining and statistical inference
C. Chatfield
Journal of the Royal Statistical Society A 158 (1995), pp. 419-466
* Statistical modeling: the two cultures (+ lots of discussion
articles in the same issue)
Leo Breiman
Statistical Science 16 (2001), pp. 199-231
I hope this didn't sound too disappointing. Put positively, the fact
that very few generic things can be said about the model selection
process can be considered a "full employment theorem" for modelers...
:)
Cheers,
Tobias.
--
Tobias Sing
Computational Biology and Applied Algorithmics
Max Planck Institute for Informatics
Saarbrucken, Germany
Phone: +49 681 9325 315
Fax: +49 681 9325 399
http://www.tobiassing.net