A few years ago there was a post about benchmarking results for NONMEM on a dual-core CPU ( http://huxley.phor.com/nonmem/nm/99nov212005.html ). Given the relatively recent release of Xeon quad-core processors I wanted to know if anybody has compared NONMEM runs on a machine with two dual-core processors to NONMEM runs on a quad-core CPU, or even NONMEM runs on a computer with two quad-core CPUs. Has anyone confirmed that having four or eight cores provides linear speedup of running four or eight NONMEM jobs? Alternatively, if anyone has confirmed that the speedup is not linear, what is the approximate speedup, and what was model number of the CPU(s)?
If a similar topic has been discussed recently (in January or February) on this mailing list, could someone please re-post the information? I just joined in March 2007, and the archives seem to contain no messages from 2007.
Thanks,
Steve
Linear speedup of NONMEM on quad-core CPUs?
7 messages
4 people
Latest: Mar 14, 2007
Steve,
I have just set up NONMEM 6 on a 4GB Core(2) Quad system running XP64. I
don't yet have benchmarks, but I have noted activity on all four CPU using
the "/Qparallel" option with the Intel Fortran Compiler. I look forward to
hearing of others' experiences.
Cheers... Brian
Quoted reply history
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Steve Chapel
Sent: Friday, March 09, 2007 12:08 PM
To: [email protected]
Subject: [NMusers] Linear speedup of NONMEM on quad-core CPUs?
A few years ago there was a post about benchmarking results for NONMEM
on a dual-core CPU ( http://huxley.phor.com/nonmem/nm/99nov212005.html).
Given the relatively recent release of Xeon quad-core processors I
wanted to know if anybody has compared NONMEM runs on a machine with two
dual-core processors to NONMEM runs on a quad-core CPU, or even NONMEM
runs on a computer with two quad-core CPUs. Has anyone confirmed that
having four or eight cores provides linear speedup of running four or
eight NONMEM jobs? Alternatively, if anyone has confirmed that the
speedup is not linear, what is the approximate speedup, and what was
model number of the CPU(s)?
If a similar topic has been discussed recently (in January or February)
on this mailing list, could someone please re-post the information? I
just joined in March 2007, and the archives seem to contain no messages
from 2007.
Thanks,
Steve
That's really not my question. My question was about speedup of multiple NONMEM runs, not one NONMEM run. Let me rephrase the question.
Let's say I have eight NONMEM jobs to run each week. Each NONMEM job takes eight hours to run. I go to a computer and start one NONMEM job, and when it is finished, I start another, and so on. After eight hours, all eight NONMEM jobs are run.
The next week, I get a great idea. Instead of using one computer, I can use eight computers. I start all eight NONMEM jobs at the same time, and after only one hour they are all done. I have achieved eightfold (linear) speedup in running eight jobs by using eight computers.
The next week, I make a further realization. The computers I was running the NONMEM jobs are dual-core, so I need to use only four computers. I start two NONMEM jobs on each of the four computers, and after one hour all the jobs are done. The benefit is that this week I needed only four computers to be available.
It might occur to me that all I really need is one computer with two quad-core processors. I could start all eight NONMEM jobs simultaneously on just one computer. The question is, has anyone actually tried this? Does it run all eight NONMEM jobs in the same time it would take to run one NONMEM jobs? In other words, has going from one core to eight cores enabled an eightfold (linear) speedup in running eight NONMEM jobs? If not, how much speedup might I expect from an eight-core computer?
-- Steve
Brian M. Sadler wrote:
> Steve,
>
> I have just set up NONMEM 6 on a 4GB Core(2) Quad system running XP64. I
> don't yet have benchmarks, but I have noted activity on all four CPU using
> the "/Qparallel" option with the Intel Fortran Compiler. I look forward to
> hearing of others' experiences.
>
> Cheers... Brian
>
Quoted reply history
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
> Behalf Of Steve Chapel
> Sent: Friday, March 09, 2007 12:08 PM
> To: [email protected]
> Subject: [NMusers] Linear speedup of NONMEM on quad-core CPUs?
>
> A few years ago there was a post about benchmarking results for NONMEM on a dual-core CPU ( http://huxley.phor.com/nonmem/nm/99nov212005.html ). Given the relatively recent release of Xeon quad-core processors I wanted to know if anybody has compared NONMEM runs on a machine with two dual-core processors to NONMEM runs on a quad-core CPU, or even NONMEM runs on a computer with two quad-core CPUs. Has anyone confirmed that having four or eight cores provides linear speedup of running four or eight NONMEM jobs? Alternatively, if anyone has confirmed that the speedup is not linear, what is the approximate speedup, and what was model number of the CPU(s)?
>
> If a similar topic has been discussed recently (in January or February) on this mailing list, could someone please re-post the information? I just joined in March 2007, and the archives seem to contain no messages from 2007.
>
> Thanks,
> Steve
Steve,
It really, really should be the case that speed up for multiple
simulataneous runs is linear. In looking at it for many years, NONMEM
execution really is consistently proportional to benchmarks like
specfp95. It seems that disc I/O is trivial, the entire data set can
typically be put into cache on modern machines. I have noted
differences between "cheaper" 2.8 Ghz dual core machines (Dell E510)
and "better" 2.8 Ghz machines I've gotten (from Gateway). But, if you
look at the specfp95 ( http://www.spec.org/cpu95/results/cfp95.html),
there are difference between machines using the same CPU - I can't
claim to understand why. Memory should not be an issue - NONMEM
typically uses less than 5 Mb of memory.
I have done what you ask (I think) in a two stage, but not the whole
thing:
Dual core does increase run speed (1/time) linearly (note that dual core
are typically a little slower clock speed) for 2 processes - this is
what I currently run.
4 processor (single core - a Proliant 4 processor server running Windows
Server 2000) machine does increase run speed (1/time) linearly, for four
processes.
The Intel quad core is just two dual core processor stuck together with
a single front side bus, they don't share cache or registers. This
probably is better for NONMEM than the AMD approach, sharing registers,
since separate NONMEM runs obviously don't need to share anything. (the
Intel approach is worse for games, since latency to cache memory is
worse)
But, a 4 processor dual core will cost you > $12,000, and will not use
less power than 4 dual core boxes - why go to the quad processor?
(Trust me, it won't make less noise either) You can buy 4 dual core
boxes, set up a LAN and map the c: drive on one "main" machine to all
the machines (so from the "main" machine, everthing looks like it is
happening on the local drive, when in fact execution is happening on
the other machines), use remote desktop to control all 4 computers from
one monitor/mouse/keyboard. A dual core Dell is about $700. Best price
for quad core right now is about $2000 (i.e, more $/Ghz than dual core)
The current Intel quad core is intended for servers, and is expensive.
The desktop version is due out late this year - should be cheaper and
prices will probably come down when AMD comes out with their quad core
CPU.
Brian,
You're observation (if I understand correctly that you are talking
about running only one NONMEM run) is a little surprising, NONMEM is
single threaded. So the current appoach to parallel computing
(multithreading) isn't going to happen. The parallel option on the
Intel compiler can, in theory, "unroll" loops in Fortran. But, in
reality, the code has to be specifically written to do this, and NONMEM
certainly is not. I tried this, in collaboration with Silicon Graphics
about 10 years ago (who claimed to have the best parallel compiler
around, right before they went out of business), and got zero
parallelization for a single run of NONMEM. But this was a long time
ago, maybe Intel figured out something new.
Mark
Mark Sale MD
Next Level Solutions, LLC
www.NextLevelSolns.com
Quoted reply history
> -------- Original Message --------
> Subject: Re: [NMusers] Linear speedup of NONMEM on quad-core CPUs?
> From: Steve Chapel <[EMAIL PROTECTED]>
> Date: Mon, March 12, 2007 11:30 am
> To: [email protected]
>
> That's really not my question. My question was about speedup of multiple
> NONMEM runs, not one NONMEM run. Let me rephrase the question.
>
> Let's say I have eight NONMEM jobs to run each week. Each NONMEM job
> takes eight hours to run. I go to a computer and start one NONMEM job,
> and when it is finished, I start another, and so on. After eight hours,
> all eight NONMEM jobs are run.
>
> The next week, I get a great idea. Instead of using one computer, I can
> use eight computers. I start all eight NONMEM jobs at the same time, and
> after only one hour they are all done. I have achieved eightfold
> (linear) speedup in running eight jobs by using eight computers.
>
> The next week, I make a further realization. The computers I was running
> the NONMEM jobs are dual-core, so I need to use only four computers. I
> start two NONMEM jobs on each of the four computers, and after one hour
> all the jobs are done. The benefit is that this week I needed only four
> computers to be available.
>
> It might occur to me that all I really need is one computer with two
> quad-core processors. I could start all eight NONMEM jobs simultaneously
> on just one computer. The question is, has anyone actually tried this?
> Does it run all eight NONMEM jobs in the same time it would take to run
> one NONMEM jobs? In other words, has going from one core to eight cores
> enabled an eightfold (linear) speedup in running eight NONMEM jobs? If
> not, how much speedup might I expect from an eight-core computer?
>
> -- Steve
>
>
> Brian M. Sadler wrote:
> > Steve,
> >
> > I have just set up NONMEM 6 on a 4GB Core(2) Quad system running XP64. I
> > don't yet have benchmarks, but I have noted activity on all four CPU using
> > the "/Qparallel" option with the Intel Fortran Compiler. I look forward to
> > hearing of others' experiences.
> >
> > Cheers... Brian
> >
> >
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
> > Behalf Of Steve Chapel
> > Sent: Friday, March 09, 2007 12:08 PM
> > To: [email protected]
> > Subject: [NMusers] Linear speedup of NONMEM on quad-core CPUs?
> >
> > A few years ago there was a post about benchmarking results for NONMEM
> > on a dual-core CPU ( http://huxley.phor.com/nonmem/nm/99nov212005.html).
> > Given the relatively recent release of Xeon quad-core processors I
> > wanted to know if anybody has compared NONMEM runs on a machine with two
> > dual-core processors to NONMEM runs on a quad-core CPU, or even NONMEM
> > runs on a computer with two quad-core CPUs. Has anyone confirmed that
> > having four or eight cores provides linear speedup of running four or
> > eight NONMEM jobs? Alternatively, if anyone has confirmed that the
> > speedup is not linear, what is the approximate speedup, and what was
> > model number of the CPU(s)?
> >
> > If a similar topic has been discussed recently (in January or February)
> > on this mailing list, could someone please re-post the information? I
> > just joined in March 2007, and the archives seem to contain no messages
> > from 2007.
> >
> > Thanks,
> > Steve
> >
> >
> >
> >
> >
I've actually run this experiment (not on a quad-core, but, on a four
cpu cluster). I have no reason to believe the quad-core would behave
differently than the cluster assuming appropriate software on the
quad-core.
N = n cpu's
O = observed run-time in minutes for multiple identical runs
D = observed decrease in runtime with n cpu's
T = theoretical decrease in runtime with n cpu's
N O D T
1 159 - 0
2 79 80 80
3 53 106 106
4 40 119 119
The equations for the reduction in processing time were also derived and
presented at AAPS in 2003, "Use of a Linux Cluster with PDx-Pop and
NONMEM V to Streamline Population Analysis", W. Bachman and W. Knebel.
Quoted reply history
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Steve Chapel
Sent: Monday, March 12, 2007 11:31 AM
To: [email protected]
Subject: Re: [NMusers] Linear speedup of NONMEM on quad-core CPUs?
That's really not my question. My question was about speedup of multiple
NONMEM runs, not one NONMEM run. Let me rephrase the question.
Let's say I have eight NONMEM jobs to run each week. Each NONMEM job
takes eight hours to run. I go to a computer and start one NONMEM job,
and when it is finished, I start another, and so on. After eight hours,
all eight NONMEM jobs are run.
The next week, I get a great idea. Instead of using one computer, I can
use eight computers. I start all eight NONMEM jobs at the same time, and
after only one hour they are all done. I have achieved eightfold
(linear) speedup in running eight jobs by using eight computers.
The next week, I make a further realization. The computers I was running
the NONMEM jobs are dual-core, so I need to use only four computers. I
start two NONMEM jobs on each of the four computers, and after one hour
all the jobs are done. The benefit is that this week I needed only four
computers to be available.
It might occur to me that all I really need is one computer with two
quad-core processors. I could start all eight NONMEM jobs simultaneously
on just one computer. The question is, has anyone actually tried this?
Does it run all eight NONMEM jobs in the same time it would take to run
one NONMEM jobs? In other words, has going from one core to eight cores
enabled an eightfold (linear) speedup in running eight NONMEM jobs? If
not, how much speedup might I expect from an eight-core computer?
-- Steve
Brian M. Sadler wrote:
> Steve,
>
> I have just set up NONMEM 6 on a 4GB Core(2) Quad system running XP64.
I
> don't yet have benchmarks, but I have noted activity on all four CPU
using
> the "/Qparallel" option with the Intel Fortran Compiler. I look
forward to
> hearing of others' experiences.
>
> Cheers... Brian
>
>
> -----Original Message-----
> From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On
> Behalf Of Steve Chapel
> Sent: Friday, March 09, 2007 12:08 PM
> To: [email protected]
> Subject: [NMusers] Linear speedup of NONMEM on quad-core CPUs?
>
> A few years ago there was a post about benchmarking results for NONMEM
> on a dual-core CPU
( http://huxley.phor.com/nonmem/nm/99nov212005.html).
> Given the relatively recent release of Xeon quad-core processors I
> wanted to know if anybody has compared NONMEM runs on a machine with
two
> dual-core processors to NONMEM runs on a quad-core CPU, or even NONMEM
> runs on a computer with two quad-core CPUs. Has anyone confirmed that
> having four or eight cores provides linear speedup of running four or
> eight NONMEM jobs? Alternatively, if anyone has confirmed that the
> speedup is not linear, what is the approximate speedup, and what was
> model number of the CPU(s)?
>
> If a similar topic has been discussed recently (in January or
February)
> on this mailing list, could someone please re-post the information? I
> just joined in March 2007, and the archives seem to contain no
messages
> from 2007.
>
> Thanks,
> Steve
>
>
>
>
>
_______________________________________________________________________________________________________________________________________
Mark,
The 4-core "activity" was noted in the Windows Task Manager and may,
according to the guy who build my machine, be an anomaly in the way the Task
Manager reports activity on each processor. I am just starting my evaluation
of this computer's performance and will share my experience with the group
once I have more objective results.
Cheers... Brian
Quoted reply history
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Mark Sale - Next Level Solutions
Sent: Monday, March 12, 2007 1:04 PM
To: Steve Chapel
Cc: [email protected]
Subject: RE: [NMusers] Linear speedup of NONMEM on quad-core CPUs?
Steve,
It really, really should be the case that speed up for multiple
simulataneous runs is linear. In looking at it for many years, NONMEM
execution really is consistently proportional to benchmarks like
specfp95. It seems that disc I/O is trivial, the entire data set can
typically be put into cache on modern machines. I have noted
differences between "cheaper" 2.8 Ghz dual core machines (Dell E510)
and "better" 2.8 Ghz machines I've gotten (from Gateway). But, if you
look at the specfp95 ( http://www.spec.org/cpu95/results/cfp95.html),
there are difference between machines using the same CPU - I can't
claim to understand why. Memory should not be an issue - NONMEM
typically uses less than 5 Mb of memory.
I have done what you ask (I think) in a two stage, but not the whole
thing:
Dual core does increase run speed (1/time) linearly (note that dual core
are typically a little slower clock speed) for 2 processes - this is
what I currently run.
4 processor (single core - a Proliant 4 processor server running Windows
Server 2000) machine does increase run speed (1/time) linearly, for four
processes.
The Intel quad core is just two dual core processor stuck together with
a single front side bus, they don't share cache or registers. This
probably is better for NONMEM than the AMD approach, sharing registers,
since separate NONMEM runs obviously don't need to share anything. (the
Intel approach is worse for games, since latency to cache memory is
worse)
But, a 4 processor dual core will cost you > $12,000, and will not use
less power than 4 dual core boxes - why go to the quad processor?
(Trust me, it won't make less noise either) You can buy 4 dual core
boxes, set up a LAN and map the c: drive on one "main" machine to all
the machines (so from the "main" machine, everthing looks like it is
happening on the local drive, when in fact execution is happening on
the other machines), use remote desktop to control all 4 computers from
one monitor/mouse/keyboard. A dual core Dell is about $700. Best price
for quad core right now is about $2000 (i.e, more $/Ghz than dual core)
The current Intel quad core is intended for servers, and is expensive.
The desktop version is due out late this year - should be cheaper and
prices will probably come down when AMD comes out with their quad core
CPU.
Brian,
You're observation (if I understand correctly that you are talking
about running only one NONMEM run) is a little surprising, NONMEM is
single threaded. So the current appoach to parallel computing
(multithreading) isn't going to happen. The parallel option on the
Intel compiler can, in theory, "unroll" loops in Fortran. But, in
reality, the code has to be specifically written to do this, and NONMEM
certainly is not. I tried this, in collaboration with Silicon Graphics
about 10 years ago (who claimed to have the best parallel compiler
around, right before they went out of business), and got zero
parallelization for a single run of NONMEM. But this was a long time
ago, maybe Intel figured out something new.
Mark
Mark Sale MD
Next Level Solutions, LLC
www.NextLevelSolns.com
> -------- Original Message --------
> Subject: Re: [NMusers] Linear speedup of NONMEM on quad-core CPUs?
> From: Steve Chapel <[EMAIL PROTECTED]>
> Date: Mon, March 12, 2007 11:30 am
> To: [email protected]
>
> That's really not my question. My question was about speedup of multiple
> NONMEM runs, not one NONMEM run. Let me rephrase the question.
>
> Let's say I have eight NONMEM jobs to run each week. Each NONMEM job
> takes eight hours to run. I go to a computer and start one NONMEM job,
> and when it is finished, I start another, and so on. After eight hours,
> all eight NONMEM jobs are run.
>
> The next week, I get a great idea. Instead of using one computer, I can
> use eight computers. I start all eight NONMEM jobs at the same time, and
> after only one hour they are all done. I have achieved eightfold
> (linear) speedup in running eight jobs by using eight computers.
>
> The next week, I make a further realization. The computers I was running
> the NONMEM jobs are dual-core, so I need to use only four computers. I
> start two NONMEM jobs on each of the four computers, and after one hour
> all the jobs are done. The benefit is that this week I needed only four
> computers to be available.
>
> It might occur to me that all I really need is one computer with two
> quad-core processors. I could start all eight NONMEM jobs simultaneously
> on just one computer. The question is, has anyone actually tried this?
> Does it run all eight NONMEM jobs in the same time it would take to run
> one NONMEM jobs? In other words, has going from one core to eight cores
> enabled an eightfold (linear) speedup in running eight NONMEM jobs? If
> not, how much speedup might I expect from an eight-core computer?
>
> -- Steve
>
>
> Brian M. Sadler wrote:
> > Steve,
> >
> > I have just set up NONMEM 6 on a 4GB Core(2) Quad system running XP64. I
> > don't yet have benchmarks, but I have noted activity on all four CPU
using
> > the "/Qparallel" option with the Intel Fortran Compiler. I look forward
to
> > hearing of others' experiences.
> >
> > Cheers... Brian
> >
> >
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On
> > Behalf Of Steve Chapel
> > Sent: Friday, March 09, 2007 12:08 PM
> > To: [email protected]
> > Subject: [NMusers] Linear speedup of NONMEM on quad-core CPUs?
> >
> > A few years ago there was a post about benchmarking results for NONMEM
> > on a dual-core CPU ( http://huxley.phor.com/nonmem/nm/99nov212005.html).
> > Given the relatively recent release of Xeon quad-core processors I
> > wanted to know if anybody has compared NONMEM runs on a machine with two
> > dual-core processors to NONMEM runs on a quad-core CPU, or even NONMEM
> > runs on a computer with two quad-core CPUs. Has anyone confirmed that
> > having four or eight cores provides linear speedup of running four or
> > eight NONMEM jobs? Alternatively, if anyone has confirmed that the
> > speedup is not linear, what is the approximate speedup, and what was
> > model number of the CPU(s)?
> >
> > If a similar topic has been discussed recently (in January or February)
> > on this mailing list, could someone please re-post the information? I
> > just joined in March 2007, and the archives seem to contain no messages
> > from 2007.
> >
> > Thanks,
> > Steve
> >
> >
> >
> >
> >
I didn't even think about disk I/O. I was more concerned about front side bus activity being a bottleneck. I found that NONMEM used only 2.4 MB of memory, but of course this would depend on computer architecture, compiler options, sizes of arrays, and so on. My impression is that the bus should not be a bottleneck, because the 4 MB shared cache of each dual core should be able to hold the most frequently accessed memory, as you point out. It's good that someone has determined that this really is the case.
As for why go the quad-core route, I'm going to need a very reliable system so I want servers with Xeons anyway. Noise is not a concern, because these servers are going to get their own room. Comparing the cost and power of two quad-core systems vs. a system with two quad-cores, it looks like the two quad-core system will cost less and should use less power. If performance is no worse, it makes economic sense to pack as many cores into each box as possible.
-- Steve
Mark Sale - Next Level Solutions wrote:
> Steve,
> It really, really should be the case that speed up for multiple
> simulataneous runs is linear. In looking at it for many years, NONMEM
> execution really is consistently proportional to benchmarks like
> specfp95. It seems that disc I/O is trivial, the entire data set can
> typically be put into cache on modern machines. I have noted
> differences between "cheaper" 2.8 Ghz dual core machines (Dell E510)
> and "better" 2.8 Ghz machines I've gotten (from Gateway). But, if you
> look at the specfp95 ( http://www.spec.org/cpu95/results/cfp95.html),
> there are difference between machines using the same CPU - I can't
> claim to understand why. Memory should not be an issue - NONMEM
> typically uses less than 5 Mb of memory.
>
> I have done what you ask (I think) in a two stage, but not the whole
> thing:
> Dual core does increase run speed (1/time) linearly (note that dual core
> are typically a little slower clock speed) for 2 processes - this is
> what I currently run.
> 4 processor (single core - a Proliant 4 processor server running Windows
> Server 2000) machine does increase run speed (1/time) linearly, for four
> processes.
>
> The Intel quad core is just two dual core processor stuck together with
> a single front side bus, they don't share cache or registers. This
> probably is better for NONMEM than the AMD approach, sharing registers,
> since separate NONMEM runs obviously don't need to share anything. (the
> Intel approach is worse for games, since latency to cache memory is
> worse)
>
> But, a 4 processor dual core will cost you > $12,000, and will not use
>
> less power than 4 dual core boxes - why go to the quad processor? (Trust me, it won't make less noise either) You can buy 4 dual core
>
> boxes, set up a LAN and map the c: drive on one "main" machine to all
> the machines (so from the "main" machine, everthing looks like it is
> happening on the local drive, when in fact execution is happening on
> the other machines), use remote desktop to control all 4 computers from
> one monitor/mouse/keyboard. A dual core Dell is about $700. Best price
>
> for quad core right now is about $2000 (i.e, more $/Ghz than dual core) The current Intel quad core is intended for servers, and is expensive. The desktop version is due out late this year - should be cheaper and
>
> prices will probably come down when AMD comes out with their quad core
> CPU.
>
> Brian,
> You're observation (if I understand correctly that you are talking
> about running only one NONMEM run) is a little surprising, NONMEM is
> single threaded. So the current appoach to parallel computing
> (multithreading) isn't going to happen. The parallel option on the
> Intel compiler can, in theory, "unroll" loops in Fortran. But, in
> reality, the code has to be specifically written to do this, and NONMEM
> certainly is not. I tried this, in collaboration with Silicon Graphics
> about 10 years ago (who claimed to have the best parallel compiler
> around, right before they went out of business), and got zero
> parallelization for a single run of NONMEM. But this was a long time
> ago, maybe Intel figured out something new.
>
> Mark
>
> Mark Sale MD
> Next Level Solutions, LLC
> www.NextLevelSolns.com