Parallel benchmarks
Nick, I'm a little surprised at your choice of benchmarks for parallel NONMEM. The original motivation for this was a run that was taking 2 to 3 weeks, and we thought it would be nice we could get that down to 2 or 3 days (best we had at the time was an 8 processor server). You are correct, if your benchmark for parallel nonmem is getting a 2 minute run down to 1 minute, you'll be disappointed, it will indeed be "paralyzed". I/O time across multiple computers in our development hardware (using FPI) is on the order of 0.1 seconds per machine, less with MPI, less still on a single computer. I maintain that parallel execution is helpful if the ratio of time for function evaluation/time for IO is greater that least 2, real benefit occurs when this ratio is > 10. In our development hardware this works out to about a 15 minute run time for a 3000 function evaluation model. Specifically: Assume CPU time for OBJ evaluation (on one "core") is Tcpu, IO time (time to send data out, read it on the other end, and possible "wait state" used in FPI method is IO, and n is number of processes. Then, Total time (TT) for function evaluation = TT = Tcpu/n + IO*n That is, the actual compuation time is split between n processes, but, the IO time goes up with the number of processes - you need to send the data to multiple machines/processes, assume this cannot be parallelized (actually it can, sort of, mostly a hardware issue). This all assumes perfect load balancing. To minimize this time, take the deriviative: dTT/dn = -Tcpu*n^-2 + IO. Solving for zero (to find minimum): Tcpu*n^-2 = IO solving for n n = sqrt(Tcpu/IO) This relationship has help up quite well in our benchmarks. Mark Mark Sale MD President, Next Level Solutions, LLC www.NextLevelSolns.com 919-846-9185 A carbon-neutral company See our real time solar energy production at: http://enlighten.enphaseenergy.com/public/systems/aSDz2458
Quoted reply history
-------- Original Message --------
Subject: Re: [NMusers] MPI installation on Win 7/ 64 bit
From: Nick Holford < [email protected] >
Date: Tue, May 24, 2011 3:38 am
To: nmusers < [email protected] >
Rik, The speed differences I noted between NM6, NM7.1 and NM7.2 beta were obtained with the same compiler (Intel 11.1) with the same compiler options for ifort in setup.bat /Gs /nologo /nbs /w /Ob1gyti /Qprec_div /4Yportlib /traceback except I do not use /traceback because it makes the exe bigger and slows down execution. When I get around to installing 7.2 I will try using /Gs /nologo /nbs /w /fp:strict as you suggest. So far I haven't seen anybody report their comparisons of NM6, NM7.1, NM7.2 without using the paralysing options. Here are my results with NM7.2 beta. The test problem had 376 subjects and 1942 observations with an ADVAN6 differential equation defined model with one DE using 15 THETAs, 3 OMEGAs and one SIGMA. Dual core processor, Windows XP, Solid state disk (SSD) Run (C_C Smax400) Test+Tcov Test Tcov toctwamby_ivf_NM6 1.02 1.02 na toctwamby_ivf_NM7 1.28 0.46 0.82 toctwamby_ivf_mpi 1.06 0.4 0.66 toctwamby_ivf_NM72 1.49 0.53 0.96 toctwamby_ivf_fpi 1.85 0.83 1.02 toctwamby_gf_NM6 1.2 1.2 na toctwamby_gf_NM7 2.54 0.93 1.61 toctwamby_gf_mpi 0.95 0.37 0.58 toctwamby_gf_NM72 1.31 0.48 0.83 toctwamby_gf_fpi 1.88 0.92 0.96 ivf=Intel 11.1 gf=gfortran Test is estimation time (and covariance time for NM6) in minutes Tcov is covariance time for NM7 in minutes In general NM7.2 executes more slowly than NM7.1 and NM6. FPI is slower than NM6 despite using SSD which should give fast file access times. On 24/05/2011 8:59 a.m., Rik Schoemaker wrote: Dear all,
In contrast to Dieter's findings, I can assure you it is very well possible
to set up MPI on a Windows 7 64 bit installation if you use Intel Visual
Fortran.
If you use the following compiler settings:
set op=/Gs /nologo /nbs /w /fp:strict
(win7)
-fp-model strict -Gs -nologo -nbs -w -static
(linux/MacOSX)
you will get identical results for NONMEM 7.1, NONMEM 7.2, with and without
MPI on Windows 7 64 bit, Linux and Mac OSX providing you have the same
Fortran compiler version (we use 11.1). I can assure you that this is
definitely not the case when you use the default settings.
This dependency on settings could perhaps explain the difference in speed
that Nick noticed between different NONMEM versions: the minimisation path
is not reproducible and so some runs which converged perfectly before now
fail to do so and the other way around.
As far as gain in speed is concerned, I have two examples. We have a dual
hex-core machine that supports hyperthreading to 24 cores. The first fairly
intensive problem takes 29:40 min without parallel processing. If we use MPI
with 12 nodes
we go down to 2:57 min, a 10-fold decrease! Using 20 cores we go up to 3:40
which means we lose some speed. For a much smaller problem I get the
following figures:
Nodes time (sec)
0 183 non-parallel
1 179
2 101
3 74
4 63
5 54
6 48
7 44
10 38
12 35
14 36
16 39
18 38
20 42
24 47
So for this run we have a maximum 5.2 fold increase in speed (here the
overhead is taking its toll) and the optimum is 12 cores, so there is no
gain in using hyper-threaded cores.
Cheers,
Rik
-----Original Message-----
From: [email protected] [ mailto: [email protected] ] On
Behalf Of Dieter Menne
Sent: 23 May 2011 9:43 PM
To: nmuser list
Subject: [NMusers] MPI installation on Win 7/ 64 bit
The solution was to install the 32-bit version of MPICH2; both the version
coming with Nonmem and the version from
http://www.mcs.anl.gov/research/projects/mpich2/
work ok.
I got many personal mails telling me that my 64-bit version of MPICH was not
running. As I had noted, it was running
mpiexec -hosts 2 localhost computername.exe
DIETERPC
DIETERPC
but it did not play well with nonmem compiled with gfortran.
Here the summary with MPICH for a short Bayes run (20 iterations) and an i7
with 4 real cores
CPUs Time (s)
1 101
2 55
4 33 (50% CPU time)
8 35 (100% CPU time)
So almost perfect scaling with 2 CPUs, and as I would expect no improvement
beyond 4. The 100% CPU time indicated is simply bogus.
Dieter
I am trying to get the MPI feature in 7.2 running.
-- My system: Window 7, 64 bit. German. Gfortran 4.6.0
-- All single-CPU tests work ok.
-- File passing works ok:
Nmfe72 foce_parallel.ctl foce_parallel.res -parafile= fpiwini8.pnm
[nodes]=4
Surprisingly slow (32 seconds vs. 3 seconds with one thread), but
never mind.
-- Test if smpd/mpiexec is working (computername.exe in directory)
smpd -start
MPICH2 Daemon (C) 2003 Argonne National Lab started.
mpiexec -hosts 1 localhost computername.exe
DIETERPC
### Everything works ok up to here
Nmfe72 foce_parallel.ctl foce_parallel.res -parafile= mpiwini8.pnm
[nodes]=4
doing nmtran
WARNINGS AND ERRORS (IF ANY) FOR PROBLEM 1
(WARNING 2) NM-TRAN INFERS THAT THE DATA ARE POPULATION.
CREATING MUMODEL ROUTINE...
1 Datei(en) kopiert.
Finished compiling fsubs
USING PARALLEL PROFILE mpiwini8.pnm
MPI TRANSFER TYPE SELECTED
Completed call to gfcompile.bat
Starting MPI version of nonmem execution ...
C:\tmp\test1\output konnte nicht gefunden warden (could not be found)
----
All subdirectories are created ok, but the process returns immediately
with the above message.
--
Nick Holford, Professor Clinical Pharmacology
Dept Pharmacology & Clinical Pharmacology
University of Auckland,85 Park Rd,Private Bag 92019,Auckland,New Zealand
tel:+64(9)923-6730 fax:+64(9)373-7090 mobile:+64(21)46 23 53
email: [email protected]
http://www.fmhs.auckland.ac.nz/sms/pharmacology/holford