Here some quick-and-dirty results of my first benchmark with parallel
processing in NONMEM 7.2
Running Win7, 64 bit, intel i7, with 4 CPU (and 4 hyperthreading cores). One
computer only.
Using file message passing. Could not get mpi to work in this configuration.
call nmfe72 mtl_KPreM2Pre_T2L2_.ctl -parafile=fpiwini8.pnm [nodes]= (1 or 4
or 8)
10 iterations of a very large Bayes problem (which should not profit from
multiple cores, according to the manual)
nodes time
1 45 s
4 25 s
8 40 s
So about a factor of 2 between 1 and 4 cores.
It is not surprising that 8 gives worse values because these are no real
CPUs. More surprising is the fact that with 8 "CPU", I have 100 load on all
of them (huh?), while with 4 CPUs, I have the expected 50%.
Dieter
Simple parallel benchmark for Nonmem 7.2 with large Bayes problem
6 messages
5 people
Latest: May 23, 2011
Dieter, We never expected the parallel NONMEM to perform well with problems of this size. The benefit, in our early benchmarks, really starts with problems that are at least 20 minutes. The math is pretty simple, basically, if a function evaluation takes more than a about a half second (not that a "typical" nonmem run may have 3000 function evaluations), it is worth sending out to multiple processes. That was our conclusion with the file-based method, the MPI might be more efficient (but, I'm told that behind the curtains, they both do pretty much the same thing, the OS buffers data blocks of this size very well, the data never actually goes to the physical disc). Our early benchmark were also with multiple computers, across a 100 Mb/s LAN. Likely there is also better performance with the very clever load balancing and dynamic sizing that Bob Bauer has put into the new release. But, don't expect any benefit with 1 minute runs, there is I/O overhead involved with sending out the data, even on the same CPU. Note that our benchmarks had a base run time of 6 hours. See our poster at http://2009.go-acop.org/acop2009/posters . Mark Mark Sale MD President, Next Level Solutions, LLC www.NextLevelSolns.com 919-846-9185 A carbon-neutral company See our real time solar energy production at: http://enlighten.enphaseenergy.com/public/systems/aSDz2458
Quoted reply history
-------- Original Message --------
Subject: [NMusers] Simple parallel benchmark for Nonmem 7.2 with large
Bayes problem
From: "Dieter Menne " < [email protected] > ;
Date: Fri, May 20, 2011 3:36 pm
To: "nmuser list" < [email protected] >
Here some quick-and-dirty results of my first benchmark with parallel
processing in NONMEM 7.2
Running Win7, 64 bit, intel i7, with 4 CPU (and 4 hyperthreading cores). One
computer only.
Using file message passing. Could not get mpi to work in this configuration.
call nmfe72 mtl_KPreM2Pre_T2L2_.ctl -parafile= fpiwini8.pnm [nodes]= (1 or 4
or 8)
10 iterations of a very large Bayes problem (which should not profit from
multiple cores, according to the manual)
nodes time
1 45 s
4 25 s
8 40 s
So about a factor of 2 between 1 and 4 cores.
It is not surprising that 8 gives worse values because these are no real
CPUs. More surprising is the fact that with 8 "CPU", I have 100 load on all
of them (huh?), while with 4 CPUs, I have the expected 50%.
Dieter
Dieter,
the observation that the 8-core run is slower than the 4-core run is probably not due CPU hyperthreading, as you suggest. The CPU loads that you report also suggest otherwise. I agree with Mark that it is more likely due to the short time per iteration, i.e. the relatively high amount of overhead compared to the actual calculations. We noticed the same when using FPI. Use MPI or test a slower model and this effect will probably disappear.
We also did some benchmarking, and noticed that NM7.2 can do pretty efficient parallelization. Our conclusions:
- MPI is much more efficient than FPI, especially for faster problems
- The efficiency with MPI seems to hold across estimation methods (FOCE / BAYES / SAEM) and models (8 tested), around 90% when using 5 cores. See results below. - Parallelization efficiency depends on e.g. time per iteration, transfer type, number of individuals in dataset. - parallelization (MPI) was still efficient at higher numbers of cores. We tested up to 7 cores on 1 machine. In some basic tests, performance over network-nodes seemed as good as when running on a single machine, although fair benchmarking is difficult on a production cluster.
We tested using the gfortran compiler, on a dedicated 8-core machine running Linux.
best regards,
Ron
--
-----------------------------------
Ron Keizer, PharmD PhD
Post-doctoral fellow
Pharmacometrics Research Group
Uppsala University
-----------------------------------
table1: multicore efficiency
| tt | n cores | time_FOCE | % | time_BAYES | % |
|-----+---------+-----------+-----+------------+-----|
| - | 1 | 13462.69 | 100 | 5283.78 | 100 |
| FPI | 2 | 7269.35 | 54 | 3096.51 | 58 |
| FPI | 3 | 5081.05 | 38 | 2470.52 | 46 |
| FPI | 4 | 4211.93 | 31 | 2709.43 | 51 |
| FPI | 5 | 3667.43 | 27 | 2729.8 | 51 |
| FPI | 6 | 3464.34 | 26 | 3254.91 | 61 |
|-----+---------+-----------+-----+------------+-----|
| - | 1 | 13462.69 | 100 | 5283.78 | 100 |
| MPI | 2 | 7122.48 | 53 | 2731.38 | 51 |
| MPI | 3 | 4826.77 | 36 | 1853.94 | 35 |
| MPI | 4 | 3705.35 | 28 | 1464.69 | 27 |
| MPI | 5 | 2976.36 | 22 | 1179.11 | 22 |
| MPI | 6 | 2519.89 | 19 | 1011.94 | 19 |
table 2: efficiency across different models (distributed to 5 cores, t in sec)
| mo | model | est | n_ind | iter | t_orig | t_mpi5 | t% | eff% |
|----+--------+-------+-------+------+----------+---------+-------+-------|
| M1 | ADVAN6 | FOCEI | 9 | 16 | 5863.0 | 1881.88 | 32.1 | 62.31 |
| M2 | ADVAN6 | FOCEI | 454 | 28 | 4485.3 | 930.38 | 20.74 | 96.42 |
| M3 | ADVAN6 | FOCEI | 412 | 20 | 363.84 | 78.23 | 21.5 | 93.02 |
| M4 | ADVAN6 | FOCE | 105 | 486 | 13616.83 | 2979.52 | 21.88 | 91.4 |
| M5 | ADVAN6 | FOCEI | 42 | 45 | 14183.92 | 3167.56 | 22.33 | 89.56 |
| M6 | ADVAN6 | FOCEI | 39 | 43 | 4698.34 | 992.52 | 21.12 | 94.67 |
| M7 | ADVAN6 | FOCE | 100 | 29 | 33249 | 7493.82 | 22.54 | 88.74 |
Quoted reply history
On 5/20/11 9:36 PM, Dieter Menne wrote:
> Here some quick-and-dirty results of my first benchmark with parallel
> processing in NONMEM 7.2
>
> Running Win7, 64 bit, intel i7, with 4 CPU (and 4 hyperthreading cores). One
> computer only.
>
> Using file message passing. Could not get mpi to work in this configuration.
>
> call nmfe72 mtl_KPreM2Pre_T2L2_.ctl -parafile=fpiwini8.pnm [nodes]= (1 or 4
> or 8)
>
> 10 iterations of a very large Bayes problem (which should not profit from
> multiple cores, according to the manual)
>
> nodes time
> 1 45 s
> 4 25 s
> 8 40 s
>
> So about a factor of 2 between 1 and 4 cores.
>
> It is not surprising that 8 gives worse values because these are no real
> CPUs. More surprising is the fact that with 8 "CPU", I have 100 load on all
> of them (huh?), while with 4 CPUs, I have the expected 50%.
>
> Dieter
Ron,
I haven't had a chance to try out the final NONMEM 7.2 release.
Did you compare NONMEM 7.2 run times with NONMEM 7.1 and NONMEM 6 without parallelization?
I found NONMEM 6 was the fastest and NONMEM 7.2 (beta) was slower than NONMEM 7.1 with single core runs (WinSvr2003, Intel 11.1, 8 Intel cores)
Nick
Quoted reply history
On 21/05/2011 12:03 p.m., Ron Keizer wrote:
> Dieter,
>
> the observation that the 8-core run is slower than the 4-core run is probably not due CPU hyperthreading, as you suggest. The CPU loads that you report also suggest otherwise. I agree with Mark that it is more likely due to the short time per iteration, i.e. the relatively high amount of overhead compared to the actual calculations. We noticed the same when using FPI. Use MPI or test a slower model and this effect will probably disappear.
>
> We also did some benchmarking, and noticed that NM7.2 can do pretty efficient parallelization. Our conclusions:
>
> - MPI is much more efficient than FPI, especially for faster problems
>
> - The efficiency with MPI seems to hold across estimation methods (FOCE / BAYES / SAEM) and models (8 tested), around 90% when using 5 cores. See results below. - Parallelization efficiency depends on e.g. time per iteration, transfer type, number of individuals in dataset. - parallelization (MPI) was still efficient at higher numbers of cores. We tested up to 7 cores on 1 machine. In some basic tests, performance over network-nodes seemed as good as when running on a single machine, although fair benchmarking is difficult on a production cluster.
>
> We tested using the gfortran compiler, on a dedicated 8-core machine running Linux.
>
> best regards,
> Ron
--
Nick Holford, Professor Clinical Pharmacology
Dept Pharmacology& Clinical Pharmacology
University of Auckland,85 Park Rd,Private Bag 92019,Auckland,New Zealand
tel:+64(9)923-6730 fax:+64(9)373-7090 mobile:+64(21)46 23 53
email: [email protected]
http://www.fmhs.auckland.ac.nz/sms/pharmacology/holford
Nick,
my own experiences are not so clear-cut. I have seen problems where NM6 performed best, others where NM7.1 performed best, and others where NM7.2 (beta) performed best. Differences in speed may be as high as 50%. I haven't identified a cause for these differences yet, and I only tested using gfortran (and a NM7.2 beta), with ADVAN6/8 problems. Haven't tested final NM7.2 yet either.
Ron
--
-----------------------------------
Ron Keizer, PharmD PhD
Post-doctoral researcher
Pharmacometrics Research Group
Uppsala University
-----------------------------------
Quoted reply history
On 05/21/2011 12:16 PM, Nick Holford wrote:
> Ron,
>
> I haven't had a chance to try out the final NONMEM 7.2 release.
>
> Did you compare NONMEM 7.2 run times with NONMEM 7.1 and NONMEM 6 without parallelization?
>
> I found NONMEM 6 was the fastest and NONMEM 7.2 (beta) was slower than NONMEM 7.1 with single core runs (WinSvr2003, Intel 11.1, 8 Intel cores)
>
> Nick
>
> On 21/05/2011 12:03 p.m., Ron Keizer wrote:
>
> > Dieter,
> >
> > the observation that the 8-core run is slower than the 4-core run is probably not due CPU hyperthreading, as you suggest. The CPU loads that you report also suggest otherwise. I agree with Mark that it is more likely due to the short time per iteration, i.e. the relatively high amount of overhead compared to the actual calculations. We noticed the same when using FPI. Use MPI or test a slower model and this effect will probably disappear.
> >
> > We also did some benchmarking, and noticed that NM7.2 can do pretty efficient parallelization. Our conclusions:
> >
> > - MPI is much more efficient than FPI, especially for faster problems
> >
> > - The efficiency with MPI seems to hold across estimation methods (FOCE / BAYES / SAEM) and models (8 tested), around 90% when using 5 cores. See results below. - Parallelization efficiency depends on e.g. time per iteration, transfer type, number of individuals in dataset. - parallelization (MPI) was still efficient at higher numbers of cores. We tested up to 7 cores on 1 machine. In some basic tests, performance over network-nodes seemed as good as when running on a single machine, although fair benchmarking is difficult on a production cluster.
> >
> > We tested using the gfortran compiler, on a dedicated 8-core machine running Linux.
> >
> > best regards,
> > Ron
>
> --
> Nick Holford, Professor Clinical Pharmacology
> Dept Pharmacology& Clinical Pharmacology
> University of Auckland,85 Park Rd,Private Bag 92019,Auckland,New Zealand
> tel:+64(9)923-6730 fax:+64(9)373-7090 mobile:+64(21)46 23 53
> email:[email protected]
> http://www.fmhs.auckland.ac.nz/sms/pharmacology/holford
--
-----------------------------------
Ron Keizer, PharmD PhD
Post-doctoral researcher
Pharmacometrics Research Group
Uppsala University
Hi Nick and Ron,
During beta testing done by Exprimo it became clear that speed might not
be the best heuristic to judge 7.2.
First a tradeoff between consistency and speed that had to in the choice
compiler options as some options resulted in different results between
the MPI and NON-MPI runs.
An added benefit is that using these option NONMEM 7.1 and NONMEM 7.2
produced IDENTICAL results.
The "default" install of NONMEM has, as far as I know, rarely (if ever)
produced identical results across version numbers.
Putting the lingering issue of cross version consistency is in my option
well worth waiting a few extra seconds.
Second NONMEM 7.2 includes a semi-dynamic resizing step.
This steps forces NONMEM to (re)compile a bigger part of the code before
the actual NONMEM execution can take place.
Here flexibly prevailed over speed. I certainly will not miss having to
tweak the SIZES file and keeping reg, big, huge, etc... installations of
NONMEM.
Finally the speed increase of an MPI run is so dramatic that comparing
speed moot for all but the simple models with limited data.
For those the -trskip and -prskip options may help. (Untested)
To add to the overall discussion on parallel NONMEM:
- FPI should be avoided at all costs. Sharing data using files
is far from efficient (Maybe for huge problems distributed over
different geographical sites?)
- The communication overhead for MPI executions over different
machines will probably be extremely dependant on the ssh-settings
(protocol used to communicate between two machines).
- Hyperthreading* is open for debate and speculations...
Critics will recommend not using it as it may give a false impression of
having more resources than you actually have, add to the overhead, and
requires specially designed code etc... Others -like me- will argue that
having the hardware do some crude load balancing is better than leave it
all to the OS.
- Overall I would not recommend MPI over different machines,
especially in a multi-user production environment. I am somewhat
surprised that you did not see any difference when using multiple
machines instead of one.
To Ron I have a comment:
I would not expect the big differences between FPi and MPI on a
workstation but rather on a cluster.
The reason is that on a workstation the working directories are locate
on the local disk for both MPI and FPI. So on the one hand FPI will have
slightly higher disk IO while MPI has an added overhead from the MPI
daemon. The big difference would be in a cluster environment: FPI needs
to use working directories located on a network drive while MPI can use
local drives. Disk IO is distributed over the nodes when using MPI while
it is centralized using FPI. I speculate that the performance of FPI
will be strongly dependent on the IO of your cluster file system the
network latency will be a defining factor for MPI.
K. Regards,
Xavier
*Hyperthreading creates two virtual cores per physical... i7 with HT
seem to have 8 cores although it only as 4 physical.
Quoted reply history
From: [email protected] [mailto:[email protected]]
On Behalf Of Nick Holford
Sent: 21 May 2011 12:17
To: nmusers
Subject: Re: [NMusers] Simple parallel benchmark for Nonmem 7.2 with
large Bayes problem
Ron,
I haven't had a chance to try out the final NONMEM 7.2 release.
Did you compare NONMEM 7.2 run times with NONMEM 7.1 and NONMEM 6
without parallelization?
I found NONMEM 6 was the fastest and NONMEM 7.2 (beta) was slower than
NONMEM 7.1 with single core runs (WinSvr2003, Intel 11.1, 8 Intel
cores)
Nick
On 21/05/2011 12:03 p.m., Ron Keizer wrote:
Dieter,
the observation that the 8-core run is slower than the 4-core run is
probably not due CPU hyperthreading, as you suggest. The CPU loads that
you report also suggest otherwise. I agree with Mark that it is more
likely due to the short time per iteration, i.e. the relatively high
amount of overhead compared to the actual calculations. We noticed the
same when using FPI. Use MPI or test a slower model and this effect will
probably disappear.
We also did some benchmarking, and noticed that NM7.2 can do pretty
efficient parallelization. Our conclusions:
- MPI is much more efficient than FPI, especially for faster problems
- The efficiency with MPI seems to hold across estimation methods (FOCE
/ BAYES / SAEM) and models (8 tested), around 90% when using 5 cores.
See results below.
- Parallelization efficiency depends on e.g. time per iteration,
transfer type, number of individuals in dataset.
- parallelization (MPI) was still efficient at higher numbers of cores.
We tested up to 7 cores on 1 machine. In some basic tests, performance
over network-nodes seemed as good as when running on a single machine,
although fair benchmarking is difficult on a production cluster.
We tested using the gfortran compiler, on a dedicated 8-core machine
running Linux.
best regards,
Ron
--
Nick Holford, Professor Clinical Pharmacology
Dept Pharmacology & Clinical Pharmacology
University of Auckland,85 Park Rd,Private Bag 92019,Auckland,New Zealand
tel:+64(9)923-6730 fax:+64(9)373-7090 mobile:+64(21)46 23 53
email: [email protected]
http://www.fmhs.auckland.ac.nz/sms/pharmacology/holford