RE: Simple parallel benchmark for Nonmem 7.2 with large Bayes problem

From: Xavier Woot de Trixhe Date: May 23, 2011 technical Source: mail-archive.com
Hi Nick and Ron, During beta testing done by Exprimo it became clear that speed might not be the best heuristic to judge 7.2. First a tradeoff between consistency and speed that had to in the choice compiler options as some options resulted in different results between the MPI and NON-MPI runs. An added benefit is that using these option NONMEM 7.1 and NONMEM 7.2 produced IDENTICAL results. The "default" install of NONMEM has, as far as I know, rarely (if ever) produced identical results across version numbers. Putting the lingering issue of cross version consistency is in my option well worth waiting a few extra seconds. Second NONMEM 7.2 includes a semi-dynamic resizing step. This steps forces NONMEM to (re)compile a bigger part of the code before the actual NONMEM execution can take place. Here flexibly prevailed over speed. I certainly will not miss having to tweak the SIZES file and keeping reg, big, huge, etc... installations of NONMEM. Finally the speed increase of an MPI run is so dramatic that comparing speed moot for all but the simple models with limited data. For those the -trskip and -prskip options may help. (Untested) To add to the overall discussion on parallel NONMEM: - FPI should be avoided at all costs. Sharing data using files is far from efficient (Maybe for huge problems distributed over different geographical sites?) - The communication overhead for MPI executions over different machines will probably be extremely dependant on the ssh-settings (protocol used to communicate between two machines). - Hyperthreading* is open for debate and speculations... Critics will recommend not using it as it may give a false impression of having more resources than you actually have, add to the overhead, and requires specially designed code etc... Others -like me- will argue that having the hardware do some crude load balancing is better than leave it all to the OS. - Overall I would not recommend MPI over different machines, especially in a multi-user production environment. I am somewhat surprised that you did not see any difference when using multiple machines instead of one. To Ron I have a comment: I would not expect the big differences between FPi and MPI on a workstation but rather on a cluster. The reason is that on a workstation the working directories are locate on the local disk for both MPI and FPI. So on the one hand FPI will have slightly higher disk IO while MPI has an added overhead from the MPI daemon. The big difference would be in a cluster environment: FPI needs to use working directories located on a network drive while MPI can use local drives. Disk IO is distributed over the nodes when using MPI while it is centralized using FPI. I speculate that the performance of FPI will be strongly dependent on the IO of your cluster file system the network latency will be a defining factor for MPI. K. Regards, Xavier *Hyperthreading creates two virtual cores per physical... i7 with HT seem to have 8 cores although it only as 4 physical.
Quoted reply history
From: [email protected] [mailto:[email protected]] On Behalf Of Nick Holford Sent: 21 May 2011 12:17 To: nmusers Subject: Re: [NMusers] Simple parallel benchmark for Nonmem 7.2 with large Bayes problem Ron, I haven't had a chance to try out the final NONMEM 7.2 release. Did you compare NONMEM 7.2 run times with NONMEM 7.1 and NONMEM 6 without parallelization? I found NONMEM 6 was the fastest and NONMEM 7.2 (beta) was slower than NONMEM 7.1 with single core runs (WinSvr2003, Intel 11.1, 8 Intel cores) Nick On 21/05/2011 12:03 p.m., Ron Keizer wrote: Dieter, the observation that the 8-core run is slower than the 4-core run is probably not due CPU hyperthreading, as you suggest. The CPU loads that you report also suggest otherwise. I agree with Mark that it is more likely due to the short time per iteration, i.e. the relatively high amount of overhead compared to the actual calculations. We noticed the same when using FPI. Use MPI or test a slower model and this effect will probably disappear. We also did some benchmarking, and noticed that NM7.2 can do pretty efficient parallelization. Our conclusions: - MPI is much more efficient than FPI, especially for faster problems - The efficiency with MPI seems to hold across estimation methods (FOCE / BAYES / SAEM) and models (8 tested), around 90% when using 5 cores. See results below. - Parallelization efficiency depends on e.g. time per iteration, transfer type, number of individuals in dataset. - parallelization (MPI) was still efficient at higher numbers of cores. We tested up to 7 cores on 1 machine. In some basic tests, performance over network-nodes seemed as good as when running on a single machine, although fair benchmarking is difficult on a production cluster. We tested using the gfortran compiler, on a dedicated 8-core machine running Linux. best regards, Ron -- Nick Holford, Professor Clinical Pharmacology Dept Pharmacology & Clinical Pharmacology University of Auckland,85 Park Rd,Private Bag 92019,Auckland,New Zealand tel:+64(9)923-6730 fax:+64(9)373-7090 mobile:+64(21)46 23 53 email: [email protected] http://www.fmhs.auckland.ac.nz/sms/pharmacology/holford
May 20, 2011 Dieter Menne Simple parallel benchmark for Nonmem 7.2 with large Bayes problem
May 20, 2011 Mark Sale RE: Simple parallel benchmark for Nonmem 7.2 with large Bayes problem
May 21, 2011 Ron Keizer Re: Simple parallel benchmark for Nonmem 7.2 with large Bayes problem
May 21, 2011 Nick Holford Re: Simple parallel benchmark for Nonmem 7.2 with large Bayes problem
May 23, 2011 Ron Keizer Re: Simple parallel benchmark for Nonmem 7.2 with large Bayes problem
May 23, 2011 Xavier Woot de Trixhe RE: Simple parallel benchmark for Nonmem 7.2 with large Bayes problem