Simple parallel benchmark for Nonmem 7.2 with large Bayes problem

6 messages 5 people Latest: May 23, 2011
Here some quick-and-dirty results of my first benchmark with parallel processing in NONMEM 7.2 Running Win7, 64 bit, intel i7, with 4 CPU (and 4 hyperthreading cores). One computer only. Using file message passing. Could not get mpi to work in this configuration. call nmfe72 mtl_KPreM2Pre_T2L2_.ctl -parafile=fpiwini8.pnm [nodes]= (1 or 4 or 8) 10 iterations of a very large Bayes problem (which should not profit from multiple cores, according to the manual) nodes time 1 45 s 4 25 s 8 40 s So about a factor of 2 between 1 and 4 cores. It is not surprising that 8 gives worse values because these are no real CPUs. More surprising is the fact that with 8 "CPU", I have 100 load on all of them (huh?), while with 4 CPUs, I have the expected 50%. Dieter
Dieter, We never expected the parallel NONMEM to perform well with problems of this size. The benefit, in our early benchmarks, really starts with problems that are at least 20 minutes. The math is pretty simple, basically, if a function evaluation takes more than a about a half second (not that a "typical" nonmem run may have 3000 function evaluations), it is worth sending out to multiple processes. That was our conclusion with the file-based method, the MPI might be more efficient (but, I'm told that behind the curtains, they both do pretty much the same thing, the OS buffers data blocks of this size very well, the data never actually goes to the physical disc). Our early benchmark were also with multiple computers, across a 100 Mb/s LAN. Likely there is also better performance with the very clever load balancing and dynamic sizing that Bob Bauer has put into the new release. But, don't expect any benefit with 1 minute runs, there is I/O overhead involved with sending out the data, even on the same CPU. Note that our benchmarks had a base run time of 6 hours. See our poster at http://2009.go-acop.org/acop2009/posters . Mark Mark Sale MD President, Next Level Solutions, LLC www.NextLevelSolns.com 919-846-9185 A carbon-neutral company See our real time solar energy production at: http://enlighten.enphaseenergy.com/public/systems/aSDz2458
Quoted reply history
-------- Original Message -------- Subject: [NMusers] Simple parallel benchmark for Nonmem 7.2 with large Bayes problem From: "Dieter Menne " < [email protected] > ; Date: Fri, May 20, 2011 3:36 pm To: "nmuser list" < [email protected] > Here some quick-and-dirty results of my first benchmark with parallel processing in NONMEM 7.2 Running Win7, 64 bit, intel i7, with 4 CPU (and 4 hyperthreading cores). One computer only. Using file message passing. Could not get mpi to work in this configuration. call nmfe72 mtl_KPreM2Pre_T2L2_.ctl -parafile= fpiwini8.pnm [nodes]= (1 or 4 or 8) 10 iterations of a very large Bayes problem (which should not profit from multiple cores, according to the manual) nodes time 1 45 s 4 25 s 8 40 s So about a factor of 2 between 1 and 4 cores. It is not surprising that 8 gives worse values because these are no real CPUs. More surprising is the fact that with 8 "CPU", I have 100 load on all of them (huh?), while with 4 CPUs, I have the expected 50%. Dieter
Dieter, the observation that the 8-core run is slower than the 4-core run is probably not due CPU hyperthreading, as you suggest. The CPU loads that you report also suggest otherwise. I agree with Mark that it is more likely due to the short time per iteration, i.e. the relatively high amount of overhead compared to the actual calculations. We noticed the same when using FPI. Use MPI or test a slower model and this effect will probably disappear. We also did some benchmarking, and noticed that NM7.2 can do pretty efficient parallelization. Our conclusions: - MPI is much more efficient than FPI, especially for faster problems - The efficiency with MPI seems to hold across estimation methods (FOCE / BAYES / SAEM) and models (8 tested), around 90% when using 5 cores. See results below. - Parallelization efficiency depends on e.g. time per iteration, transfer type, number of individuals in dataset. - parallelization (MPI) was still efficient at higher numbers of cores. We tested up to 7 cores on 1 machine. In some basic tests, performance over network-nodes seemed as good as when running on a single machine, although fair benchmarking is difficult on a production cluster. We tested using the gfortran compiler, on a dedicated 8-core machine running Linux. best regards, Ron -- ----------------------------------- Ron Keizer, PharmD PhD Post-doctoral fellow Pharmacometrics Research Group Uppsala University ----------------------------------- table1: multicore efficiency | tt | n cores | time_FOCE | % | time_BAYES | % | |-----+---------+-----------+-----+------------+-----| | - | 1 | 13462.69 | 100 | 5283.78 | 100 | | FPI | 2 | 7269.35 | 54 | 3096.51 | 58 | | FPI | 3 | 5081.05 | 38 | 2470.52 | 46 | | FPI | 4 | 4211.93 | 31 | 2709.43 | 51 | | FPI | 5 | 3667.43 | 27 | 2729.8 | 51 | | FPI | 6 | 3464.34 | 26 | 3254.91 | 61 | |-----+---------+-----------+-----+------------+-----| | - | 1 | 13462.69 | 100 | 5283.78 | 100 | | MPI | 2 | 7122.48 | 53 | 2731.38 | 51 | | MPI | 3 | 4826.77 | 36 | 1853.94 | 35 | | MPI | 4 | 3705.35 | 28 | 1464.69 | 27 | | MPI | 5 | 2976.36 | 22 | 1179.11 | 22 | | MPI | 6 | 2519.89 | 19 | 1011.94 | 19 | table 2: efficiency across different models (distributed to 5 cores, t in sec) | mo | model | est | n_ind | iter | t_orig | t_mpi5 | t% | eff% | |----+--------+-------+-------+------+----------+---------+-------+-------| | M1 | ADVAN6 | FOCEI | 9 | 16 | 5863.0 | 1881.88 | 32.1 | 62.31 | | M2 | ADVAN6 | FOCEI | 454 | 28 | 4485.3 | 930.38 | 20.74 | 96.42 | | M3 | ADVAN6 | FOCEI | 412 | 20 | 363.84 | 78.23 | 21.5 | 93.02 | | M4 | ADVAN6 | FOCE | 105 | 486 | 13616.83 | 2979.52 | 21.88 | 91.4 | | M5 | ADVAN6 | FOCEI | 42 | 45 | 14183.92 | 3167.56 | 22.33 | 89.56 | | M6 | ADVAN6 | FOCEI | 39 | 43 | 4698.34 | 992.52 | 21.12 | 94.67 | | M7 | ADVAN6 | FOCE | 100 | 29 | 33249 | 7493.82 | 22.54 | 88.74 |
Quoted reply history
On 5/20/11 9:36 PM, Dieter Menne wrote: > Here some quick-and-dirty results of my first benchmark with parallel > processing in NONMEM 7.2 > > Running Win7, 64 bit, intel i7, with 4 CPU (and 4 hyperthreading cores). One > computer only. > > Using file message passing. Could not get mpi to work in this configuration. > > call nmfe72 mtl_KPreM2Pre_T2L2_.ctl -parafile=fpiwini8.pnm [nodes]= (1 or 4 > or 8) > > 10 iterations of a very large Bayes problem (which should not profit from > multiple cores, according to the manual) > > nodes time > 1 45 s > 4 25 s > 8 40 s > > So about a factor of 2 between 1 and 4 cores. > > It is not surprising that 8 gives worse values because these are no real > CPUs. More surprising is the fact that with 8 "CPU", I have 100 load on all > of them (huh?), while with 4 CPUs, I have the expected 50%. > > Dieter
Ron, I haven't had a chance to try out the final NONMEM 7.2 release. Did you compare NONMEM 7.2 run times with NONMEM 7.1 and NONMEM 6 without parallelization? I found NONMEM 6 was the fastest and NONMEM 7.2 (beta) was slower than NONMEM 7.1 with single core runs (WinSvr2003, Intel 11.1, 8 Intel cores) Nick
Quoted reply history
On 21/05/2011 12:03 p.m., Ron Keizer wrote: > Dieter, > > the observation that the 8-core run is slower than the 4-core run is probably not due CPU hyperthreading, as you suggest. The CPU loads that you report also suggest otherwise. I agree with Mark that it is more likely due to the short time per iteration, i.e. the relatively high amount of overhead compared to the actual calculations. We noticed the same when using FPI. Use MPI or test a slower model and this effect will probably disappear. > > We also did some benchmarking, and noticed that NM7.2 can do pretty efficient parallelization. Our conclusions: > > - MPI is much more efficient than FPI, especially for faster problems > > - The efficiency with MPI seems to hold across estimation methods (FOCE / BAYES / SAEM) and models (8 tested), around 90% when using 5 cores. See results below. - Parallelization efficiency depends on e.g. time per iteration, transfer type, number of individuals in dataset. - parallelization (MPI) was still efficient at higher numbers of cores. We tested up to 7 cores on 1 machine. In some basic tests, performance over network-nodes seemed as good as when running on a single machine, although fair benchmarking is difficult on a production cluster. > > We tested using the gfortran compiler, on a dedicated 8-core machine running Linux. > > best regards, > Ron -- Nick Holford, Professor Clinical Pharmacology Dept Pharmacology& Clinical Pharmacology University of Auckland,85 Park Rd,Private Bag 92019,Auckland,New Zealand tel:+64(9)923-6730 fax:+64(9)373-7090 mobile:+64(21)46 23 53 email: [email protected] http://www.fmhs.auckland.ac.nz/sms/pharmacology/holford
Nick, my own experiences are not so clear-cut. I have seen problems where NM6 performed best, others where NM7.1 performed best, and others where NM7.2 (beta) performed best. Differences in speed may be as high as 50%. I haven't identified a cause for these differences yet, and I only tested using gfortran (and a NM7.2 beta), with ADVAN6/8 problems. Haven't tested final NM7.2 yet either. Ron -- ----------------------------------- Ron Keizer, PharmD PhD Post-doctoral researcher Pharmacometrics Research Group Uppsala University -----------------------------------
Quoted reply history
On 05/21/2011 12:16 PM, Nick Holford wrote: > Ron, > > I haven't had a chance to try out the final NONMEM 7.2 release. > > Did you compare NONMEM 7.2 run times with NONMEM 7.1 and NONMEM 6 without parallelization? > > I found NONMEM 6 was the fastest and NONMEM 7.2 (beta) was slower than NONMEM 7.1 with single core runs (WinSvr2003, Intel 11.1, 8 Intel cores) > > Nick > > On 21/05/2011 12:03 p.m., Ron Keizer wrote: > > > Dieter, > > > > the observation that the 8-core run is slower than the 4-core run is probably not due CPU hyperthreading, as you suggest. The CPU loads that you report also suggest otherwise. I agree with Mark that it is more likely due to the short time per iteration, i.e. the relatively high amount of overhead compared to the actual calculations. We noticed the same when using FPI. Use MPI or test a slower model and this effect will probably disappear. > > > > We also did some benchmarking, and noticed that NM7.2 can do pretty efficient parallelization. Our conclusions: > > > > - MPI is much more efficient than FPI, especially for faster problems > > > > - The efficiency with MPI seems to hold across estimation methods (FOCE / BAYES / SAEM) and models (8 tested), around 90% when using 5 cores. See results below. - Parallelization efficiency depends on e.g. time per iteration, transfer type, number of individuals in dataset. - parallelization (MPI) was still efficient at higher numbers of cores. We tested up to 7 cores on 1 machine. In some basic tests, performance over network-nodes seemed as good as when running on a single machine, although fair benchmarking is difficult on a production cluster. > > > > We tested using the gfortran compiler, on a dedicated 8-core machine running Linux. > > > > best regards, > > Ron > > -- > Nick Holford, Professor Clinical Pharmacology > Dept Pharmacology& Clinical Pharmacology > University of Auckland,85 Park Rd,Private Bag 92019,Auckland,New Zealand > tel:+64(9)923-6730 fax:+64(9)373-7090 mobile:+64(21)46 23 53 > email:[email protected] > http://www.fmhs.auckland.ac.nz/sms/pharmacology/holford -- ----------------------------------- Ron Keizer, PharmD PhD Post-doctoral researcher Pharmacometrics Research Group Uppsala University
Hi Nick and Ron, During beta testing done by Exprimo it became clear that speed might not be the best heuristic to judge 7.2. First a tradeoff between consistency and speed that had to in the choice compiler options as some options resulted in different results between the MPI and NON-MPI runs. An added benefit is that using these option NONMEM 7.1 and NONMEM 7.2 produced IDENTICAL results. The "default" install of NONMEM has, as far as I know, rarely (if ever) produced identical results across version numbers. Putting the lingering issue of cross version consistency is in my option well worth waiting a few extra seconds. Second NONMEM 7.2 includes a semi-dynamic resizing step. This steps forces NONMEM to (re)compile a bigger part of the code before the actual NONMEM execution can take place. Here flexibly prevailed over speed. I certainly will not miss having to tweak the SIZES file and keeping reg, big, huge, etc... installations of NONMEM. Finally the speed increase of an MPI run is so dramatic that comparing speed moot for all but the simple models with limited data. For those the -trskip and -prskip options may help. (Untested) To add to the overall discussion on parallel NONMEM: - FPI should be avoided at all costs. Sharing data using files is far from efficient (Maybe for huge problems distributed over different geographical sites?) - The communication overhead for MPI executions over different machines will probably be extremely dependant on the ssh-settings (protocol used to communicate between two machines). - Hyperthreading* is open for debate and speculations... Critics will recommend not using it as it may give a false impression of having more resources than you actually have, add to the overhead, and requires specially designed code etc... Others -like me- will argue that having the hardware do some crude load balancing is better than leave it all to the OS. - Overall I would not recommend MPI over different machines, especially in a multi-user production environment. I am somewhat surprised that you did not see any difference when using multiple machines instead of one. To Ron I have a comment: I would not expect the big differences between FPi and MPI on a workstation but rather on a cluster. The reason is that on a workstation the working directories are locate on the local disk for both MPI and FPI. So on the one hand FPI will have slightly higher disk IO while MPI has an added overhead from the MPI daemon. The big difference would be in a cluster environment: FPI needs to use working directories located on a network drive while MPI can use local drives. Disk IO is distributed over the nodes when using MPI while it is centralized using FPI. I speculate that the performance of FPI will be strongly dependent on the IO of your cluster file system the network latency will be a defining factor for MPI. K. Regards, Xavier *Hyperthreading creates two virtual cores per physical... i7 with HT seem to have 8 cores although it only as 4 physical.
Quoted reply history
From: [email protected] [mailto:[email protected]] On Behalf Of Nick Holford Sent: 21 May 2011 12:17 To: nmusers Subject: Re: [NMusers] Simple parallel benchmark for Nonmem 7.2 with large Bayes problem Ron, I haven't had a chance to try out the final NONMEM 7.2 release. Did you compare NONMEM 7.2 run times with NONMEM 7.1 and NONMEM 6 without parallelization? I found NONMEM 6 was the fastest and NONMEM 7.2 (beta) was slower than NONMEM 7.1 with single core runs (WinSvr2003, Intel 11.1, 8 Intel cores) Nick On 21/05/2011 12:03 p.m., Ron Keizer wrote: Dieter, the observation that the 8-core run is slower than the 4-core run is probably not due CPU hyperthreading, as you suggest. The CPU loads that you report also suggest otherwise. I agree with Mark that it is more likely due to the short time per iteration, i.e. the relatively high amount of overhead compared to the actual calculations. We noticed the same when using FPI. Use MPI or test a slower model and this effect will probably disappear. We also did some benchmarking, and noticed that NM7.2 can do pretty efficient parallelization. Our conclusions: - MPI is much more efficient than FPI, especially for faster problems - The efficiency with MPI seems to hold across estimation methods (FOCE / BAYES / SAEM) and models (8 tested), around 90% when using 5 cores. See results below. - Parallelization efficiency depends on e.g. time per iteration, transfer type, number of individuals in dataset. - parallelization (MPI) was still efficient at higher numbers of cores. We tested up to 7 cores on 1 machine. In some basic tests, performance over network-nodes seemed as good as when running on a single machine, although fair benchmarking is difficult on a production cluster. We tested using the gfortran compiler, on a dedicated 8-core machine running Linux. best regards, Ron -- Nick Holford, Professor Clinical Pharmacology Dept Pharmacology & Clinical Pharmacology University of Auckland,85 Park Rd,Private Bag 92019,Auckland,New Zealand tel:+64(9)923-6730 fax:+64(9)373-7090 mobile:+64(21)46 23 53 email: [email protected] http://www.fmhs.auckland.ac.nz/sms/pharmacology/holford