RE: Simple parallel benchmark for Nonmem 7.2 with large Bayes problem
Hi Nick and Ron,
During beta testing done by Exprimo it became clear that speed might not
be the best heuristic to judge 7.2.
First a tradeoff between consistency and speed that had to in the choice
compiler options as some options resulted in different results between
the MPI and NON-MPI runs.
An added benefit is that using these option NONMEM 7.1 and NONMEM 7.2
produced IDENTICAL results.
The "default" install of NONMEM has, as far as I know, rarely (if ever)
produced identical results across version numbers.
Putting the lingering issue of cross version consistency is in my option
well worth waiting a few extra seconds.
Second NONMEM 7.2 includes a semi-dynamic resizing step.
This steps forces NONMEM to (re)compile a bigger part of the code before
the actual NONMEM execution can take place.
Here flexibly prevailed over speed. I certainly will not miss having to
tweak the SIZES file and keeping reg, big, huge, etc... installations of
NONMEM.
Finally the speed increase of an MPI run is so dramatic that comparing
speed moot for all but the simple models with limited data.
For those the -trskip and -prskip options may help. (Untested)
To add to the overall discussion on parallel NONMEM:
- FPI should be avoided at all costs. Sharing data using files
is far from efficient (Maybe for huge problems distributed over
different geographical sites?)
- The communication overhead for MPI executions over different
machines will probably be extremely dependant on the ssh-settings
(protocol used to communicate between two machines).
- Hyperthreading* is open for debate and speculations...
Critics will recommend not using it as it may give a false impression of
having more resources than you actually have, add to the overhead, and
requires specially designed code etc... Others -like me- will argue that
having the hardware do some crude load balancing is better than leave it
all to the OS.
- Overall I would not recommend MPI over different machines,
especially in a multi-user production environment. I am somewhat
surprised that you did not see any difference when using multiple
machines instead of one.
To Ron I have a comment:
I would not expect the big differences between FPi and MPI on a
workstation but rather on a cluster.
The reason is that on a workstation the working directories are locate
on the local disk for both MPI and FPI. So on the one hand FPI will have
slightly higher disk IO while MPI has an added overhead from the MPI
daemon. The big difference would be in a cluster environment: FPI needs
to use working directories located on a network drive while MPI can use
local drives. Disk IO is distributed over the nodes when using MPI while
it is centralized using FPI. I speculate that the performance of FPI
will be strongly dependent on the IO of your cluster file system the
network latency will be a defining factor for MPI.
K. Regards,
Xavier
*Hyperthreading creates two virtual cores per physical... i7 with HT
seem to have 8 cores although it only as 4 physical.
Quoted reply history
From: [email protected] [mailto:[email protected]]
On Behalf Of Nick Holford
Sent: 21 May 2011 12:17
To: nmusers
Subject: Re: [NMusers] Simple parallel benchmark for Nonmem 7.2 with
large Bayes problem
Ron,
I haven't had a chance to try out the final NONMEM 7.2 release.
Did you compare NONMEM 7.2 run times with NONMEM 7.1 and NONMEM 6
without parallelization?
I found NONMEM 6 was the fastest and NONMEM 7.2 (beta) was slower than
NONMEM 7.1 with single core runs (WinSvr2003, Intel 11.1, 8 Intel
cores)
Nick
On 21/05/2011 12:03 p.m., Ron Keizer wrote:
Dieter,
the observation that the 8-core run is slower than the 4-core run is
probably not due CPU hyperthreading, as you suggest. The CPU loads that
you report also suggest otherwise. I agree with Mark that it is more
likely due to the short time per iteration, i.e. the relatively high
amount of overhead compared to the actual calculations. We noticed the
same when using FPI. Use MPI or test a slower model and this effect will
probably disappear.
We also did some benchmarking, and noticed that NM7.2 can do pretty
efficient parallelization. Our conclusions:
- MPI is much more efficient than FPI, especially for faster problems
- The efficiency with MPI seems to hold across estimation methods (FOCE
/ BAYES / SAEM) and models (8 tested), around 90% when using 5 cores.
See results below.
- Parallelization efficiency depends on e.g. time per iteration,
transfer type, number of individuals in dataset.
- parallelization (MPI) was still efficient at higher numbers of cores.
We tested up to 7 cores on 1 machine. In some basic tests, performance
over network-nodes seemed as good as when running on a single machine,
although fair benchmarking is difficult on a production cluster.
We tested using the gfortran compiler, on a dedicated 8-core machine
running Linux.
best regards,
Ron
--
Nick Holford, Professor Clinical Pharmacology
Dept Pharmacology & Clinical Pharmacology
University of Auckland,85 Park Rd,Private Bag 92019,Auckland,New Zealand
tel:+64(9)923-6730 fax:+64(9)373-7090 mobile:+64(21)46 23 53
email: [email protected]
http://www.fmhs.auckland.ac.nz/sms/pharmacology/holford