Issues with NONMEM 7.2/MPI

3 messages 2 people Latest: Nov 06, 2013

Issues with NONMEM 7.2/MPI

From: Michael Mayer Date: November 04, 2013 technical

Hi all, just wondering if anyone of you has seen an issue like that. Occasionally when we run parallel nonmem 7.2 jobs, we receive messages in stdout similar to the ones highlighted in yellow below: iteration 9 OBJ= -905.810529070325 eff.= 1198. Smpl.= 3000. Fit.= 0.99553 Elapsed estimation time in seconds: 399.70 Last iteration, set up for variance assessment TIMEOUT FROM WORKER5 RESUBMITTING JOB TO LOCAL iteration 10 OBJ= -905.807559328067 eff.= 1216. Smpl.= 3000. Fit.= 0.99559 OPTIMIZATION NOT TESTED Elapsed covariance time in seconds: 1306.23 The nonmem run seems to be finished but the MPI processes keep spinning and the non-interactive job never gets terminated. The only correlation to this behavior we found so far was the appearance of the above error message. We are using "PARSE_TYPE=4 PARSE_NUM=20 TIMEOUTI=100 TIMEOUT=10" in the pnm file. Has anyone seen the above behavior ? My original guess was that this would occur if one of the nodes would crash and then NONMEM cannot communicate to the MPI threads running on that node. In the case above all 10 threads were running on the same node. The other guess was that one thread would take longer than the TIMEOUT value (10 mins) but any thread was shorter than 3 mins. Our system: RHEL5, mpich2-1.4.1p1, Intel Fortran 11. Many thanks for any insight. Michael.

Re: Issues with NONMEM 7.2/MPI

From: Ekaterina Gibiansky Date: November 04, 2013 technical

Hi Michael, We had this happening, and following conversation with Bob Bauer increased TIMEOUT. According to him: "TIMEOUT=10 means it waits up to 10 minutes for a worker to finish its part of the OBJ function call. You seem to have a very tough job, so set the TIMEOUT to something bigger. It does not hurt to set TIMEOUT to 1000000 or something really big. There is no inefficiency from it. The manager continues polling the worker until the worker says it is done, anyway. It does mean that if there was some difficulty with a worker computer, and your TIMEOUT is very long, then the manager waits a very long time before it decides to reroute the work." Regards, Katya Ekaterina Gibiansky, Ph.D. CEO&CSO, QuantPharm LLC Web: www.quantpharm.com Email: EGibiansky at quantpharm.com Tel: (301)-717-7032

Quoted reply history

On 11/4/2013 4:10 AM, Mayer, Michael wrote: > Hi all, > > just wondering if anyone of you has seen an issue like that. > > Occasionally when we run parallel nonmem 7.2 jobs, we receive messages in stdout similar to the ones highlighted in yellow below: > > iteration 9 OBJ= -905.810529070325 eff.= 1198. Smpl.= 3000. Fit.= 0.99553 > > Elapsed estimation time in seconds: 399.70 > > Last iteration, set up for variance assessment > > TIMEOUT FROM WORKER5 > > RESUBMITTING JOB TO LOCAL > > iteration 10 OBJ= -905.807559328067 eff.= 1216. Smpl.= 3000. Fit.= 0.99559 > > OPTIMIZATION NOT TESTED > > Elapsed covariance time in seconds: 1306.23 > > The nonmem run seems to be finished but the MPI processes keep spinning and the non-interactive job never gets terminated. The only correlation to this behavior we found so far was the appearance of the above error message. > > We are using "PARSE_TYPE=4 PARSE_NUM=20 TIMEOUTI=100 TIMEOUT=10" in the pnm file. > > Has anyone seen the above behavior ? My original guess was that this would occur if one of the nodes would crash and then NONMEM cannot communicate to the MPI threads running on that node. In the case above all 10 threads were running on the same node. The other guess was that one thread would take longer than the TIMEOUT value (10 mins) but any thread was shorter than 3 mins. > > Our system: RHEL5, mpich2-1.4.1p1, Intel Fortran 11. > > Many thanks for any insight. > > Michael.

RE: Issues with NONMEM 7.2/MPI

From: Michael Mayer Date: November 06, 2013 technical

Hi Katya, I just tested your suggested solution and it works. I think for now we will increase the timeout from 10 to 100 and keep monitoring how that goes with our jobs. BTW: The timeout only seems to occur when we use the same control stream with 10 cores. When using 20 cores, no timeout can be seen even when TIMEOUT=10 is set. In this case I assume the workload is better distributed although total compute time is only reduced by 35 % instead of 50 % one would expect for "ideal" parallelization efficiency. Many thanks for your help. Michael.

Quoted reply history

From: Ekaterina Gibiansky [mailto:[email protected]] Sent: Montag, 4. November 2013 16:47 To: Mayer, Michael; [email protected] Subject: Re: [NMusers] Issues with NONMEM 7.2/MPI Hi Michael, We had this happening, and following conversation with Bob Bauer increased TIMEOUT. According to him: "TIMEOUT=10 means it waits up to 10 minutes for a worker to finish its part of the OBJ function call. You seem to have a very tough job, so set the TIMEOUT to something bigger. It does not hurt to set TIMEOUT to 1000000 or something really big. There is no inefficiency from it. The manager continues polling the worker until the worker says it is done, anyway. It does mean that if there was some difficulty with a worker computer, and your TIMEOUT is very long, then the manager waits a very long time before it decides to reroute the work." Regards, Katya Ekaterina Gibiansky, Ph.D. CEO&CSO, QuantPharm LLC Web: http://www.quantpharm.com Email: EGibiansky at quantpharm.com Tel: (301)-717-7032 On 11/4/2013 4:10 AM, Mayer, Michael wrote: Hi all, just wondering if anyone of you has seen an issue like that. Occasionally when we run parallel nonmem 7.2 jobs, we receive messages in stdout similar to the ones highlighted in yellow below: iteration 9 OBJ= -905.810529070325 eff.= 1198. Smpl.= 3000. Fit.= 0.99553 Elapsed estimation time in seconds: 399.70 Last iteration, set up for variance assessment TIMEOUT FROM WORKER5 RESUBMITTING JOB TO LOCAL iteration 10 OBJ= -905.807559328067 eff.= 1216. Smpl.= 3000. Fit.= 0.99559 OPTIMIZATION NOT TESTED Elapsed covariance time in seconds: 1306.23 The nonmem run seems to be finished but the MPI processes keep spinning and the non-interactive job never gets terminated. The only correlation to this behavior we found so far was the appearance of the above error message. We are using "PARSE_TYPE=4 PARSE_NUM=20 TIMEOUTI=100 TIMEOUT=10" in the pnm file. Has anyone seen the above behavior ? My original guess was that this would occur if one of the nodes would crash and then NONMEM cannot communicate to the MPI threads running on that node. In the case above all 10 threads were running on the same node. The other guess was that one thread would take longer than the TIMEOUT value (10 mins) but any thread was shorter than 3 mins. Our system: RHEL5, mpich2-1.4.1p1, Intel Fortran 11. Many thanks for any insight. Michael.

`j` / `k`	Next / previous message
`o`	Open message
`f`	Search
`s`	Copy link
`t`	Filters
`c`	Copy message body
`r`	Related threads
`?`	This help
`Esc`	Close / clear