[Q] slow calculations with multiple-runs at 64bit ubuntu

Byoung_Jeon · ‎02-03-2011

Hi, I am running molecular dynamics code on Intel Xeon CPU W3680 @ 3.33GHz, which has 6 cores.
When a single calculation runs, it is done 82seconds but if I run multiple runs like 2 or three simultaneously,
it shows up to 150 seconds, which is almost twice slow. Each run consumes 1-2% of memory (6Gig) and I don't
think it is memory shortage or swapping. I tried "-heap-arrays" or "uliimit -s unlimited" to change stack or heap memory
for dyn. memory but either is not working. Still multiple runs show very bad performance.

Is there any way to get better performance on mutiple runs? Any comment will be very appreciated.

Byoungseon

PS.

#ifort -v
Version 12.0.0

cpu: Xeon CPU W3680 @ 3.33GHz, which has 6 cores. Same resuls with or without Hyperthread.
os: ubuntu 10.4, 64bit
compiling option: ifort -xHost -O3 -ipo -no-prec-div

Ron_Green · ‎02-03-2011

your processes are contending with each other - but the question is where. Is it memory bandwidth constraints, or file IO, are you colliding on an NFS filesystem for data files, or cache, hard to say.

You'll want to run some performance monitors and watch CPU, memory usage, disk IO, network IO to see what pegs when you run multiple copies. You could try a tool like ganglia to help.

Also, you might consider downloading an eval copy of VTune Amplifier XE from http://software.intel.com/en-us/articles/intel-software-evaluation-center/

There are too many variables, you'll have to narrow down the application bottlenecks first.

ron

timintel · ‎02-03-2011

Would you agree that your application probably makes heavy use of memory reference? Even though you have sufficient memory for multiple copies, they would contend for use of memory buss and L3 cache. Did you try to pin each job to a different core by taskset, preferably avoiding use of both cores 0 and 1 or 2 and 3?
It's often more efficient to thread a job rather than attempt to run multiple copies on a single CPU; of course you wouldn't expect linear speedup, but you would hope to use cache more effectively.