NONMEM/PsN benchmark for SGE expansion

4 messages 3 people Latest: Apr 01, 2011

NONMEM/PsN benchmark for SGE expansion

From: Julia Ivashina Date: March 25, 2011 technical

Dear all, We would like to benchmark our new SGE cluster, and appreciate anyone who has performed a similar task and can share the findings. We use NONMEM 7.1.2 with PsN 3.2.12 in two cluster environments. Our older environment consists of 9 quad core machines (about 40 work nodes, counting the head node), and the newer one - over 2000 work nodes 512 CPU each. These are the questions we'd like to answer: * What is a reasonable time one should expect to shave off by moving PK/PD analysis from the smaller cluster to the bigger one? * What type of analysis is the most sensitive to an increase in number of work nodes? * What should be the expected gain from increasing the number in -threads 50 times? * What parts of NONMEM/PsN are the most optimized for parallel execution? * What are the scenarios where gain from parallelization is the biggest? The initial bootstrap test we've done showed some progress. Although, the model we chose did not run 50 times faster (2000/40=50). Some of the reasons: pre-processing (creating of bootstrap samples), Fortran compiler work, and combining of the results are not spread across work nodes. Since the compute time for each of the job was small (5-10 seconds), the overhead of job submittal was more significant. We also use vpc, npc, cdd, llp, sse and scm analysis, so would like to get some ideas on parallelization capability of these functions. Any benchmarking results or ideas that you can share is very much appreciated. Thank you, Julia Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system.

Re: NONMEM/PsN benchmark for SGE expansion

From: Paul Matthias Diderichsen Date: March 28, 2011 technical

Hi Julia,

Quoted reply history

On 3/25/2011 4:42 PM, Ivashina, Julia wrote: > * What type of analysis is the most sensitive to an increase in > number of work nodes? As far as I know, the version of nonmem that you have installed does not support parallel processing. Legend has it that this feature will show up in the next version of nonmem (a beta version is said to exist... ;-) ). > * What should be the expected gain from increasing the number in > -threads 50 times? As you've experienced this depends on a number of factors; your distributed computing facility will bring you the largest benefits with very CPU intensive jobs. This could be models with very large datasets or PKPD models relying on numerical integration (ADVAN6, ADVAN8, ADVAN13). Smaller runs will spend a comparatively higher proportion of the runtime compiling nonmem and pushing data across the network. The way you've set up the grid file system plays a role here too. I'm not too surprised that you don't see much benefit when runtimes are as short as 5-10 seconds. > * What parts of NONMEM/PsN are the most optimized for parallel > execution? Bootstrap and VPC/NPC are the scripts that are most suited for running on a grid. Use the command line parameters -samples and -threads with the boostrap command, and -samples and -n_simulation_models (and possibly -threads) with VPC/NPC. See: http://psn.sourceforge.net/pdfdocs/npc_vpc_userguide.pdf and http://psn.sourceforge.net/pdfdocs/bootstrap_userguide.pdf > * What are the scenarios where gain from parallelization is the biggest? In my experience, your gains will be biggest when executing many long-running jobs (eg a complete model history, a bootstrap, or a NPC/VPC with the -n_simulation_models parameters) Kind regards, -- Paul Matthias Diderichsen, PhD Quantitative Solutions B.V. +31 624 330 706

RE: NONMEM/PsN benchmark for SGE expansion

From: Jurgen Bulitta Date: April 01, 2011 technical

Dear Julia, I am very happy to see that Merck is leveraging parallelized computing so significantly. We did some systematic testing on parallelized modeling and published this recently. AAPS Journal 2011, DOI: http://dx.doi.org/10.1208/s12248-011-9258-9 http://www.springerlink.com/content/c215433172002281/ The results on the efficiency of parallelization were generally in good agreement with the testing that Chee and Bob presented for parallelized S-ADAPT earlier. This was using the Importance Sampling EM algorithm (pmethod=4 in S-ADAPT; equivalent to method IMPMAP in NONMEM). For this example, parallelizing on 8 threads yielded a 6.9 times faster estimation and parallelizing on 48 threads yielded a 23 times faster estimation. As the datasets had 48 subjects, each thread received one subject in the latter case and about 50% of the estimation time was distributing data through the network. The benefit of parallelizing increases significantly: 1) If the algorithm has a large (>99%) parallelizable fraction that can be distributed among worker nodes. (IMPMAP is very well suited for this, MCMC is not, FOCE should have a smaller parallelizable fraction than IMPMAP). Example: A program with 50% parallelizable fraction can only be accelerated to 2-fold the single threaded speed, no matter how many cores one has. 2) If the dataset has many subjects. This is most critical for industry. 3) If the model is complex and requires differential equations. (Parallelizing a one compartment model is unlikely to yield much benefit due to network traffic). 4) Bootstrap analyses are ideal to be distributed on the network. Bootstraps are best run in single threaded though, as one can parallelize with 100% efficiency across the 1000 bootstrap replicates. Some additional thoughts: a) The larger your cluster, the more important it is to invest in the model code and dataset debugger before a model is compiled, since you do not want to manually shut-down the 2000 simultaneously running exe-files that might not have closed properly. This is one of the key reasons why we invested significant time in developing a free pre-processor for S-ADAPT. b) If you have 2000 nodes, it may be worth to consider launching jobs from several master nodes. You can run into trouble both with the available RAM and with the network traffic, if everything needs to funnel through one master node, even if you use 4x Infiniband networking, for example. c) Creating a cuing system to prioritize jobs from different users and projects may help. Your computational chemistry group must have a system like this. d) Saving and analyzing intermediary results is most critical for large parallelized jobs. Hope this provides some useful ideas. Overall, I think for (complex) models that require differential equations, parallelizing will decide whether a project is feasible or not in the time available. This is why we almost always parallelize. Best wishes Juergen Jürgen B. Bulitta, Ph.D., Senior Scientist, Ordway Research Institute, 150 New Scotland Avenue, Albany, NY 12208, USA Phone: +1 (518) 641-6418, Fax: +1 (518) 641-6304 Email: [email protected] http://www.ordwayresearch.org/profile_bulitta.html

Quoted reply history

From: [email protected] [mailto:[email protected]] On Behalf Of Ivashina, Julia Sent: Friday, March 25, 2011 11:43 AM To: [email protected] Subject: [NMusers] NONMEM/PsN benchmark for SGE expansion Dear all, We would like to benchmark our new SGE cluster, and appreciate anyone who has performed a similar task and can share the findings. We use NONMEM 7.1.2 with PsN 3.2.12 in two cluster environments. Our older environment consists of 9 quad core machines (about 40 work nodes, counting the head node), and the newer one - over 2000 work nodes 512 CPU each. These are the questions we'd like to answer: · What is a reasonable time one should expect to shave off by moving PK/PD analysis from the smaller cluster to the bigger one? · What type of analysis is the most sensitive to an increase in number of work nodes? · What should be the expected gain from increasing the number in -threads 50 times? · What parts of NONMEM/PsN are the most optimized for parallel execution? · What are the scenarios where gain from parallelization is the biggest? The initial bootstrap test we've done showed some progress. Although, the model we chose did not run 50 times faster (2000/40=50). Some of the reasons: pre-processing (creating of bootstrap samples), Fortran compiler work, and combining of the results are not spread across work nodes. Since the compute time for each of the job was small (5-10 seconds), the overhead of job submittal was more significant. We also use vpc, npc, cdd, llp, sse and scm analysis, so would like to get some ideas on parallelization capability of these functions. Any benchmarking results or ideas that you can share is very much appreciated. Thank you, Julia Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system.

Re: RE: NONMEM/PsN benchmark for SGE expansion

From: Paul Matthias Diderichsen Date: April 01, 2011 technical

Hi Jurgen (and nmusers), a short comment to your additional thought b:

Quoted reply history

On 4/1/2011 8:33 AM, Jurgen Bulitta wrote: > b) If you have 2000 nodes, it may be worth to consider launching jobs > from several > master nodes. You can run into trouble both with the available RAM and > with the > network traffic, if everything needs to funnel through one master node, > even if you > use 4x Infiniband networking, for example. I think this is primarily a consideration that needs to be done in a cluster (based on MOSIX) rather than on a grid (based on OGE/SGE). On MOSIX, every process is started on the/a master node and then migrates away to the "best" (in terms of resources) computing node. On OGE, job specifications are stored in a queue, and started directly on the best computing node. Of course, the master in an OGE grid should be reliable (consider one or several shadow masters), but to my understanding, RAM and network bandwidth are less of a bottleneck on OGE than on MOSIX. (For that, MOSIX has other advantages...) Kind regards, -- Paul Matthias Diderichsen, PhD Quantitative Solutions B.V. +31 624 330 706

`j` / `k`	Next / previous message
`o`	Open message
`f`	Search
`s`	Copy link
`t`	Filters
`c`	Copy message body
`r`	Related threads
`?`	This help
`Esc`	Close / clear