Parallel benchmarks

From: Mark Sale Date: May 24, 2011 technical Source: mail-archive.com
Nick, I'm a little surprised at your choice of benchmarks for parallel NONMEM. The original motivation for this was a run that was taking 2 to 3 weeks, and we thought it would be nice we could get that down to 2 or 3 days (best we had at the time was an 8 processor server). You are correct, if your benchmark for parallel nonmem is getting a 2 minute run down to 1 minute, you'll be disappointed, it will indeed be "paralyzed". I/O time across multiple computers in our development hardware (using FPI) is on the order of 0.1 seconds per machine, less with MPI, less still on a single computer. I maintain that parallel execution is helpful if the ratio of time for function evaluation/time for IO is greater that least 2, real benefit occurs when this ratio is > 10. In our development hardware this works out to about a 15 minute run time for a 3000 function evaluation model. Specifically: Assume CPU time for OBJ evaluation (on one "core") is Tcpu, IO time (time to send data out, read it on the other end, and possible "wait state" used in FPI method is IO, and n is number of processes. Then, Total time (TT) for function evaluation = TT = Tcpu/n + IO*n That is, the actual compuation time is split between n processes, but, the IO time goes up with the number of processes - you need to send the data to multiple machines/processes, assume this cannot be parallelized (actually it can, sort of, mostly a hardware issue). This all assumes perfect load balancing. To minimize this time, take the deriviative: dTT/dn = -Tcpu*n^-2 + IO. Solving for zero (to find minimum): Tcpu*n^-2 = IO solving for n n = sqrt(Tcpu/IO) This relationship has help up quite well in our benchmarks. Mark Mark Sale MD President, Next Level Solutions, LLC www.NextLevelSolns.com 919-846-9185 A carbon-neutral company See our real time solar energy production at: http://enlighten.enphaseenergy.com/public/systems/aSDz2458
Quoted reply history
-------- Original Message -------- Subject: Re: [NMusers] MPI installation on Win 7/ 64 bit From: Nick Holford < [email protected] > Date: Tue, May 24, 2011 3:38 am To: nmusers < [email protected] > Rik, The speed differences I noted between NM6, NM7.1 and NM7.2 beta were obtained with the same compiler (Intel 11.1) with the same compiler options for ifort in setup.bat /Gs /nologo /nbs /w /Ob1gyti /Qprec_div /4Yportlib /traceback except I do not use /traceback because it makes the exe bigger and slows down execution. When I get around to installing 7.2 I will try using /Gs /nologo /nbs /w /fp:strict as you suggest. So far I haven't seen anybody report their comparisons of NM6, NM7.1, NM7.2 without using the paralysing options. Here are my results with NM7.2 beta. The test problem had 376 subjects and 1942 observations with an ADVAN6 differential equation defined model with one DE using 15 THETAs, 3 OMEGAs and one SIGMA. Dual core processor, Windows XP, Solid state disk (SSD) Run (C_C Smax400) Test+Tcov Test Tcov toctwamby_ivf_NM6 1.02 1.02 na toctwamby_ivf_NM7 1.28 0.46 0.82 toctwamby_ivf_mpi 1.06 0.4 0.66 toctwamby_ivf_NM72 1.49 0.53 0.96 toctwamby_ivf_fpi 1.85 0.83 1.02 toctwamby_gf_NM6 1.2 1.2 na toctwamby_gf_NM7 2.54 0.93 1.61 toctwamby_gf_mpi 0.95 0.37 0.58 toctwamby_gf_NM72 1.31 0.48 0.83 toctwamby_gf_fpi 1.88 0.92 0.96 ivf=Intel 11.1 gf=gfortran Test is estimation time (and covariance time for NM6) in minutes Tcov is covariance time for NM7 in minutes In general NM7.2 executes more slowly than NM7.1 and NM6. FPI is slower than NM6 despite using SSD which should give fast file access times. On 24/05/2011 8:59 a.m., Rik Schoemaker wrote: Dear all, In contrast to Dieter's findings, I can assure you it is very well possible to set up MPI on a Windows 7 64 bit installation if you use Intel Visual Fortran. If you use the following compiler settings: set op=/Gs /nologo /nbs /w /fp:strict (win7) -fp-model strict -Gs -nologo -nbs -w -static (linux/MacOSX) you will get identical results for NONMEM 7.1, NONMEM 7.2, with and without MPI on Windows 7 64 bit, Linux and Mac OSX providing you have the same Fortran compiler version (we use 11.1). I can assure you that this is definitely not the case when you use the default settings. This dependency on settings could perhaps explain the difference in speed that Nick noticed between different NONMEM versions: the minimisation path is not reproducible and so some runs which converged perfectly before now fail to do so and the other way around. As far as gain in speed is concerned, I have two examples. We have a dual hex-core machine that supports hyperthreading to 24 cores. The first fairly intensive problem takes 29:40 min without parallel processing. If we use MPI with 12 nodes we go down to 2:57 min, a 10-fold decrease! Using 20 cores we go up to 3:40 which means we lose some speed. For a much smaller problem I get the following figures: Nodes time (sec) 0 183 non-parallel 1 179 2 101 3 74 4 63 5 54 6 48 7 44 10 38 12 35 14 36 16 39 18 38 20 42 24 47 So for this run we have a maximum 5.2 fold increase in speed (here the overhead is taking its toll) and the optimum is 12 cores, so there is no gain in using hyper-threaded cores. Cheers, Rik -----Original Message----- From: [email protected] [ mailto: [email protected] ] On Behalf Of Dieter Menne Sent: 23 May 2011 9:43 PM To: nmuser list Subject: [NMusers] MPI installation on Win 7/ 64 bit The solution was to install the 32-bit version of MPICH2; both the version coming with Nonmem and the version from http://www.mcs.anl.gov/research/projects/mpich2/ work ok. I got many personal mails telling me that my 64-bit version of MPICH was not running. As I had noted, it was running mpiexec -hosts 2 localhost computername.exe DIETERPC DIETERPC but it did not play well with nonmem compiled with gfortran. Here the summary with MPICH for a short Bayes run (20 iterations) and an i7 with 4 real cores CPUs Time (s) 1 101 2 55 4 33 (50% CPU time) 8 35 (100% CPU time) So almost perfect scaling with 2 CPUs, and as I would expect no improvement beyond 4. The 100% CPU time indicated is simply bogus. Dieter I am trying to get the MPI feature in 7.2 running. -- My system: Window 7, 64 bit. German. Gfortran 4.6.0 -- All single-CPU tests work ok. -- File passing works ok: Nmfe72 foce_parallel.ctl foce_parallel.res -parafile= fpiwini8.pnm [nodes]=4 Surprisingly slow (32 seconds vs. 3 seconds with one thread), but never mind. -- Test if smpd/mpiexec is working (computername.exe in directory) smpd -start MPICH2 Daemon (C) 2003 Argonne National Lab started. mpiexec -hosts 1 localhost computername.exe DIETERPC ### Everything works ok up to here Nmfe72 foce_parallel.ctl foce_parallel.res -parafile= mpiwini8.pnm [nodes]=4 doing nmtran WARNINGS AND ERRORS (IF ANY) FOR PROBLEM 1 (WARNING 2) NM-TRAN INFERS THAT THE DATA ARE POPULATION. CREATING MUMODEL ROUTINE... 1 Datei(en) kopiert. Finished compiling fsubs USING PARALLEL PROFILE mpiwini8.pnm MPI TRANSFER TYPE SELECTED Completed call to gfcompile.bat Starting MPI version of nonmem execution ... C:\tmp\test1\output konnte nicht gefunden warden (could not be found) ---- All subdirectories are created ok, but the process returns immediately with the above message. -- Nick Holford, Professor Clinical Pharmacology Dept Pharmacology & Clinical Pharmacology University of Auckland,85 Park Rd,Private Bag 92019,Auckland,New Zealand tel:+64(9)923-6730 fax:+64(9)373-7090 mobile:+64(21)46 23 53 email: [email protected] http://www.fmhs.auckland.ac.nz/sms/pharmacology/holford