- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I am running molecular dynamics code on Intel Xeon CPU W3680 @ 3.33GHz, which has 6 cores.
When a single calculation runs, it is done 82seconds but if I run multiple runs like 2 or three simultaneously,
it shows up to 150 seconds, which is almost twice slow. Each run consumes 1-2% of memory (6Gig) and I don't
think it is memory shortage or swapping. I tried "-heap-arrays" or "uliimit -s unlimited" to change stack or heap memory
for dyn. memory but either is not working. Still multiple runs show very bad performance.
Is there any way to get better performance on mutiple runs? Any comment will be very appreciated.
Byoungseon
PS.
#ifort -v
Version 12.0.0
cpu: Xeon CPU W3680 @ 3.33GHz, which has 6 cores. Same resuls with or without Hyperthread.
os: ubuntu 10.4, 64bit
compiling option: ifort -xHost -O3 -ipo -no-prec-div
When a single calculation runs, it is done 82seconds but if I run multiple runs like 2 or three simultaneously,
it shows up to 150 seconds, which is almost twice slow. Each run consumes 1-2% of memory (6Gig) and I don't
think it is memory shortage or swapping. I tried "-heap-arrays" or "uliimit -s unlimited" to change stack or heap memory
for dyn. memory but either is not working. Still multiple runs show very bad performance.
Is there any way to get better performance on mutiple runs? Any comment will be very appreciated.
Byoungseon
PS.
#ifort -v
Version 12.0.0
cpu: Xeon CPU W3680 @ 3.33GHz, which has 6 cores. Same resuls with or without Hyperthread.
os: ubuntu 10.4, 64bit
compiling option: ifort -xHost -O3 -ipo -no-prec-div
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
your processes are contending with each other - but the question is where. Is it memory bandwidth constraints, or file IO, are you colliding on an NFS filesystem for data files, or cache, hard to say.
You'll want to run some performance monitors and watch CPU, memory usage, disk IO, network IO to see what pegs when you run multiple copies. You could try a tool like ganglia to help.
Also, you might consider downloading an eval copy of VTune Amplifier XE from http://software.intel.com/en-us/articles/intel-software-evaluation-center/
There are too many variables, you'll have to narrow down the application bottlenecks first.
ron
You'll want to run some performance monitors and watch CPU, memory usage, disk IO, network IO to see what pegs when you run multiple copies. You could try a tool like ganglia to help.
Also, you might consider downloading an eval copy of VTune Amplifier XE from http://software.intel.com/en-us/articles/intel-software-evaluation-center/
There are too many variables, you'll have to narrow down the application bottlenecks first.
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Would you agree that your application probably makes heavy use of memory reference? Even though you have sufficient memory for multiple copies, they would contend for use of memory buss and L3 cache. Did you try to pin each job to a different core by taskset, preferably avoiding use of both cores 0 and 1 or 2 and 3?
It's often more efficient to thread a job rather than attempt to run multiple copies on a single CPU; of course you wouldn't expect linear speedup, but you would hope to use cache more effectively.
It's often more efficient to thread a job rather than attempt to run multiple copies on a single CPU; of course you wouldn't expect linear speedup, but you would hope to use cache more effectively.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page