Re: General question on modeling

From: Tobias Sing Date: March 21, 2007 technical Source: mail-archive.com
Mark & list, I'm a newbie to the list. I hope I'm not duplicating anything mentioned yesterday (the archive seems to become available with delay), but this is a topic I'm also very much interested in, so I'd like to share my current view on it (I'd be happy to hear both dissenting or agreeing opinions).
Quoted reply history
> On Monday 19 March 2007 19:32, Mark Sale - Next Level Solutions wrote: > Dear Colleagues, > I've lately been reviewing the literature on model building/selection > algorithms. The structural > first, then variances/forward addition/backward elimination is > generally mentioned in a number of places > [...] Can anyone point > me to any rigorous discussion of this model building strategy? There can be no rigorous general (i.e. problem-independent) statement about the superiority of any variable or model selection strategy over another: * Wolpert, D.H. and W.G. Macready, 1997. No free lunch theorems for search. IEEE Transactions on Evolutionary Computation (cf. http://citeseer.ist.psu.edu/wolpert95no.html and http://en.wikipedia.org/wiki/No-free-lunch_theorem). Thus, the only justification for advocating the use of a particular strategy _without making use of problem-specific knowledge_ is the empirical observation that it often works well in practice. Other approaches besides forward addition/backward elimination also often work well. An up-to-date overview (opening a whole journal special issue on variable selection): * An Introduction to Variable and Feature Selection Isabelle Guyon, André Elisseeff; Journal of Machine Learning Research 3(Mar):1157--1182, 2003. http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf More or less subtle forms of overfitting always play a role in model selection, and with limited data, it is generally not possible to simultaneously select an optimal model _and_ obtain optimally accurate performance estimates, neither by relying on p-values, AIC/BIC/..., (double-)bootstrap-, or (double-) cross-validation-based procedures. However, the "double" versions for resampling the entire modeling process help a lot in obtaining more reliable estimates when doing a lot of "data dredging". Harrell's (fantastic) book was mentioned by some previous posters. In my personal opinion and experience, it is a bit too negative about stepwise variable selection or the simplified version of univariable screening (e.g. on pp. 56-60). In fact, Guyon/Elisseeff and many others have mentioned that greedy search strategies (such as forward/backward selection) are "particularly computationally advantageous and robust against overfitting", as compared to many more sophisticated approaches. Finally, for me, three important eye-openers on modeling, model uncertainty, and model selection in general (the first two also referenced in Harrell's book) were: * Model Specification: The Views of Fisher and Neyman, and Later Developments E. L. Lehmann Statistical Science 5:2 (1990), pp. 160-168. * Model uncertainty, data mining and statistical inference C. Chatfield Journal of the Royal Statistical Society A 158 (1995), pp. 419-466 * Statistical modeling: the two cultures (+ lots of discussion articles in the same issue) Leo Breiman Statistical Science 16 (2001), pp. 199-231 I hope this didn't sound too disappointing. Put positively, the fact that very few generic things can be said about the model selection process can be considered a "full employment theorem" for modelers... :) Cheers, Tobias. -- Tobias Sing Computational Biology and Applied Algorithmics Max Planck Institute for Informatics Saarbrucken, Germany Phone: +49 681 9325 315 Fax: +49 681 9325 399 http://www.tobiassing.net
Mar 19, 2007 Mark Sale General question on modeling
Mar 19, 2007 Anthony J. Rossini Re: General question on modeling
Mar 19, 2007 Nick Holford Re: General question on modeling
Mar 19, 2007 Paul Hutson Re: General question on modeling
Mar 19, 2007 Stephen Duffull RE: General question on modeling
Mar 20, 2007 Nick Holford Re: General question on modeling
Mar 20, 2007 Stephen Duffull RE: General question on modeling
Mar 20, 2007 Mark Sale RE: General question on modeling
Mar 20, 2007 Paul Hutson Re: General question on modeling
Mar 20, 2007 Michael Fossler General question on modeling
Mar 20, 2007 Peter Bonate General question on modeling
Mar 20, 2007 Michael . Looby RE: General question on modeling
Mar 20, 2007 Michael Fossler General question on modeling
Mar 20, 2007 James G Wright RE: General question on modeling
Mar 20, 2007 Tim Bergsma Re: General question on modeling
Mar 20, 2007 Alison Boeckmann Re: General question on modeling
Mar 20, 2007 Marc Gastonguay Re: General question on modeling
Mar 21, 2007 Tobias Sing Re: General question on modeling
Mar 21, 2007 Mark Sale RE: General question on modeling