RE: NONMEM/PsN benchmark for SGE expansion
Dear Julia,
I am very happy to see that Merck is leveraging parallelized computing so
significantly.
We did some systematic testing on parallelized modeling and published this
recently.
AAPS Journal 2011, DOI: http://dx.doi.org/10.1208/s12248-011-9258-9
http://www.springerlink.com/content/c215433172002281/
The results on the efficiency of parallelization were generally in good
agreement
with the testing that Chee and Bob presented for parallelized S-ADAPT earlier.
This was using the Importance Sampling EM algorithm (pmethod=4 in S-ADAPT;
equivalent to method IMPMAP in NONMEM).
For this example, parallelizing on 8 threads yielded a 6.9 times faster
estimation and
parallelizing on 48 threads yielded a 23 times faster estimation. As the
datasets had
48 subjects, each thread received one subject in the latter case and about 50%
of
the estimation time was distributing data through the network.
The benefit of parallelizing increases significantly:
1) If the algorithm has a large (>99%) parallelizable fraction that can be
distributed
among worker nodes. (IMPMAP is very well suited for this, MCMC is not,
FOCE should have a smaller parallelizable fraction than IMPMAP).
Example: A program with 50% parallelizable fraction can only be accelerated to
2-fold the single threaded speed, no matter how many cores one has.
2) If the dataset has many subjects. This is most critical for industry.
3) If the model is complex and requires differential equations. (Parallelizing
a one
compartment model is unlikely to yield much benefit due to network traffic).
4) Bootstrap analyses are ideal to be distributed on the network. Bootstraps
are best
run in single threaded though, as one can parallelize with 100% efficiency
across the
1000 bootstrap replicates.
Some additional thoughts:
a) The larger your cluster, the more important it is to invest in the model
code and
dataset debugger before a model is compiled, since you do not want to manually
shut-down the 2000 simultaneously running exe-files that might not have closed
properly. This is one of the key reasons why we invested significant time in
developing a free pre-processor for S-ADAPT.
b) If you have 2000 nodes, it may be worth to consider launching jobs from
several
master nodes. You can run into trouble both with the available RAM and with the
network traffic, if everything needs to funnel through one master node, even if
you
use 4x Infiniband networking, for example.
c) Creating a cuing system to prioritize jobs from different users and projects
may help.
Your computational chemistry group must have a system like this.
d) Saving and analyzing intermediary results is most critical for large
parallelized jobs.
Hope this provides some useful ideas. Overall, I think for (complex) models
that require
differential equations, parallelizing will decide whether a project is feasible
or not
in the time available. This is why we almost always parallelize.
Best wishes
Juergen
Jürgen B. Bulitta, Ph.D., Senior Scientist,
Ordway Research Institute,
150 New Scotland Avenue, Albany, NY 12208, USA
Phone: +1 (518) 641-6418, Fax: +1 (518) 641-6304
Email: [email protected]
http://www.ordwayresearch.org/profile_bulitta.html
Quoted reply history
From: [email protected] [mailto:[email protected]] On
Behalf Of Ivashina, Julia
Sent: Friday, March 25, 2011 11:43 AM
To: [email protected]
Subject: [NMusers] NONMEM/PsN benchmark for SGE expansion
Dear all,
We would like to benchmark our new SGE cluster, and appreciate anyone who has
performed a similar task and can share the findings.
We use NONMEM 7.1.2 with PsN 3.2.12 in two cluster environments.
Our older environment consists of 9 quad core machines (about 40 work nodes,
counting the head node), and the newer one - over 2000 work nodes 512 CPU each.
These are the questions we'd like to answer:
· What is a reasonable time one should expect to shave off by moving
PK/PD analysis from the smaller cluster to the bigger one?
· What type of analysis is the most sensitive to an increase in number
of work nodes?
· What should be the expected gain from increasing the number in
-threads 50 times?
· What parts of NONMEM/PsN are the most optimized for parallel
execution?
· What are the scenarios where gain from parallelization is the biggest?
The initial bootstrap test we've done showed some progress. Although, the model
we chose did not run 50 times faster (2000/40=50).
Some of the reasons: pre-processing (creating of bootstrap samples), Fortran
compiler work, and combining of the results are not spread across work nodes.
Since the compute time for each of the job was small (5-10 seconds), the
overhead of job submittal was more significant.
We also use vpc, npc, cdd, llp, sse and scm analysis, so would like to get some
ideas on parallelization capability of these functions. Any benchmarking
results or ideas that you can share is very much appreciated.
Thank you,
Julia
Notice: This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from
your system.