Dear all,
We would like to benchmark our new SGE cluster, and appreciate anyone who has
performed a similar task and can share the findings.
We use NONMEM 7.1.2 with PsN 3.2.12 in two cluster environments.
Our older environment consists of 9 quad core machines (about 40 work nodes,
counting the head node), and the newer one - over 2000 work nodes 512 CPU each.
These are the questions we'd like to answer:
* What is a reasonable time one should expect to shave off by moving
PK/PD analysis from the smaller cluster to the bigger one?
* What type of analysis is the most sensitive to an increase in number of
work nodes?
* What should be the expected gain from increasing the number in -threads
50 times?
* What parts of NONMEM/PsN are the most optimized for parallel execution?
* What are the scenarios where gain from parallelization is the biggest?
The initial bootstrap test we've done showed some progress. Although, the model
we chose did not run 50 times faster (2000/40=50).
Some of the reasons: pre-processing (creating of bootstrap samples), Fortran
compiler work, and combining of the results are not spread across work nodes.
Since the compute time for each of the job was small (5-10 seconds), the
overhead of job submittal was more significant.
We also use vpc, npc, cdd, llp, sse and scm analysis, so would like to get some
ideas on parallelization capability of these functions. Any benchmarking
results or ideas that you can share is very much appreciated.
Thank you,
Julia
Notice: This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from
your system.
NONMEM/PsN benchmark for SGE expansion
4 messages
3 people
Latest: Apr 01, 2011
Hi Julia,
Quoted reply history
On 3/25/2011 4:42 PM, Ivashina, Julia wrote:
> * What type of analysis is the most sensitive to an increase in
> number of work nodes?
As far as I know, the version of nonmem that you have installed does not support parallel processing. Legend has it that this feature will show up in the next version of nonmem (a beta version is said to exist... ;-) ).
> * What should be the expected gain from increasing the number in
> -threads 50 times?
As you've experienced this depends on a number of factors; your distributed computing facility will bring you the largest benefits with very CPU intensive jobs. This could be models with very large datasets or PKPD models relying on numerical integration (ADVAN6, ADVAN8, ADVAN13).
Smaller runs will spend a comparatively higher proportion of the runtime compiling nonmem and pushing data across the network. The way you've set up the grid file system plays a role here too. I'm not too surprised that you don't see much benefit when runtimes are as short as 5-10 seconds.
> * What parts of NONMEM/PsN are the most optimized for parallel
> execution?
Bootstrap and VPC/NPC are the scripts that are most suited for running on a grid. Use the command line parameters -samples and -threads with the boostrap command, and -samples and -n_simulation_models (and possibly -threads) with VPC/NPC.
See: http://psn.sourceforge.net/pdfdocs/npc_vpc_userguide.pdf and http://psn.sourceforge.net/pdfdocs/bootstrap_userguide.pdf
> * What are the scenarios where gain from parallelization is the biggest?
In my experience, your gains will be biggest when executing many long-running jobs (eg a complete model history, a bootstrap, or a NPC/VPC with the -n_simulation_models parameters)
Kind regards,
--
Paul Matthias Diderichsen, PhD
Quantitative Solutions B.V.
+31 624 330 706
Dear Julia,
I am very happy to see that Merck is leveraging parallelized computing so
significantly.
We did some systematic testing on parallelized modeling and published this
recently.
AAPS Journal 2011, DOI: http://dx.doi.org/10.1208/s12248-011-9258-9
http://www.springerlink.com/content/c215433172002281/
The results on the efficiency of parallelization were generally in good
agreement
with the testing that Chee and Bob presented for parallelized S-ADAPT earlier.
This was using the Importance Sampling EM algorithm (pmethod=4 in S-ADAPT;
equivalent to method IMPMAP in NONMEM).
For this example, parallelizing on 8 threads yielded a 6.9 times faster
estimation and
parallelizing on 48 threads yielded a 23 times faster estimation. As the
datasets had
48 subjects, each thread received one subject in the latter case and about 50%
of
the estimation time was distributing data through the network.
The benefit of parallelizing increases significantly:
1) If the algorithm has a large (>99%) parallelizable fraction that can be
distributed
among worker nodes. (IMPMAP is very well suited for this, MCMC is not,
FOCE should have a smaller parallelizable fraction than IMPMAP).
Example: A program with 50% parallelizable fraction can only be accelerated to
2-fold the single threaded speed, no matter how many cores one has.
2) If the dataset has many subjects. This is most critical for industry.
3) If the model is complex and requires differential equations. (Parallelizing
a one
compartment model is unlikely to yield much benefit due to network traffic).
4) Bootstrap analyses are ideal to be distributed on the network. Bootstraps
are best
run in single threaded though, as one can parallelize with 100% efficiency
across the
1000 bootstrap replicates.
Some additional thoughts:
a) The larger your cluster, the more important it is to invest in the model
code and
dataset debugger before a model is compiled, since you do not want to manually
shut-down the 2000 simultaneously running exe-files that might not have closed
properly. This is one of the key reasons why we invested significant time in
developing a free pre-processor for S-ADAPT.
b) If you have 2000 nodes, it may be worth to consider launching jobs from
several
master nodes. You can run into trouble both with the available RAM and with the
network traffic, if everything needs to funnel through one master node, even if
you
use 4x Infiniband networking, for example.
c) Creating a cuing system to prioritize jobs from different users and projects
may help.
Your computational chemistry group must have a system like this.
d) Saving and analyzing intermediary results is most critical for large
parallelized jobs.
Hope this provides some useful ideas. Overall, I think for (complex) models
that require
differential equations, parallelizing will decide whether a project is feasible
or not
in the time available. This is why we almost always parallelize.
Best wishes
Juergen
Jürgen B. Bulitta, Ph.D., Senior Scientist,
Ordway Research Institute,
150 New Scotland Avenue, Albany, NY 12208, USA
Phone: +1 (518) 641-6418, Fax: +1 (518) 641-6304
Email: [email protected]
http://www.ordwayresearch.org/profile_bulitta.html
Quoted reply history
From: [email protected] [mailto:[email protected]] On
Behalf Of Ivashina, Julia
Sent: Friday, March 25, 2011 11:43 AM
To: [email protected]
Subject: [NMusers] NONMEM/PsN benchmark for SGE expansion
Dear all,
We would like to benchmark our new SGE cluster, and appreciate anyone who has
performed a similar task and can share the findings.
We use NONMEM 7.1.2 with PsN 3.2.12 in two cluster environments.
Our older environment consists of 9 quad core machines (about 40 work nodes,
counting the head node), and the newer one - over 2000 work nodes 512 CPU each.
These are the questions we'd like to answer:
· What is a reasonable time one should expect to shave off by moving
PK/PD analysis from the smaller cluster to the bigger one?
· What type of analysis is the most sensitive to an increase in number
of work nodes?
· What should be the expected gain from increasing the number in
-threads 50 times?
· What parts of NONMEM/PsN are the most optimized for parallel
execution?
· What are the scenarios where gain from parallelization is the biggest?
The initial bootstrap test we've done showed some progress. Although, the model
we chose did not run 50 times faster (2000/40=50).
Some of the reasons: pre-processing (creating of bootstrap samples), Fortran
compiler work, and combining of the results are not spread across work nodes.
Since the compute time for each of the job was small (5-10 seconds), the
overhead of job submittal was more significant.
We also use vpc, npc, cdd, llp, sse and scm analysis, so would like to get some
ideas on parallelization capability of these functions. Any benchmarking
results or ideas that you can share is very much appreciated.
Thank you,
Julia
Notice: This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from
your system.
Hi Jurgen (and nmusers),
a short comment to your additional thought b:
Quoted reply history
On 4/1/2011 8:33 AM, Jurgen Bulitta wrote:
> b) If you have 2000 nodes, it may be worth to consider launching jobs
> from several
> master nodes. You can run into trouble both with the available RAM and
> with the
> network traffic, if everything needs to funnel through one master node,
> even if you
> use 4x Infiniband networking, for example.
I think this is primarily a consideration that needs to be done in a cluster (based on MOSIX) rather than on a grid (based on OGE/SGE). On MOSIX, every process is started on the/a master node and then migrates away to the "best" (in terms of resources) computing node. On OGE, job specifications are stored in a queue, and started directly on the best computing node.
Of course, the master in an OGE grid should be reliable (consider one or several shadow masters), but to my understanding, RAM and network bandwidth are less of a bottleneck on OGE than on MOSIX. (For that, MOSIX has other advantages...)
Kind regards,
--
Paul Matthias Diderichsen, PhD
Quantitative Solutions B.V.
+31 624 330 706