Issues with NONMEM 7.2/MPI
Hi all,
just wondering if anyone of you has seen an issue like that.
Occasionally when we run parallel nonmem 7.2 jobs, we receive messages in
stdout similar to the ones highlighted in yellow below:
iteration 9 OBJ= -905.810529070325 eff.= 1198. Smpl.= 3000.
Fit.= 0.99553
Elapsed estimation time in seconds: 399.70
Last iteration, set up for variance assessment
TIMEOUT FROM WORKER5
RESUBMITTING JOB TO LOCAL
iteration 10 OBJ= -905.807559328067 eff.= 1216. Smpl.= 3000.
Fit.= 0.99559
OPTIMIZATION NOT TESTED
Elapsed covariance time in seconds: 1306.23
The nonmem run seems to be finished but the MPI processes keep spinning and the
non-interactive job never gets terminated. The only correlation to this
behavior we found so far was the appearance of the above error message.
We are using "PARSE_TYPE=4 PARSE_NUM=20 TIMEOUTI=100 TIMEOUT=10" in the pnm
file.
Has anyone seen the above behavior ? My original guess was that this would
occur if one of the nodes would crash and then NONMEM cannot communicate to the
MPI threads running on that node. In the case above all 10 threads were running
on the same node. The other guess was that one thread would take longer than
the TIMEOUT value (10 mins) but any thread was shorter than 3 mins.
Our system: RHEL5, mpich2-1.4.1p1, Intel Fortran 11.
Many thanks for any insight.
Michael.