- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
Recently I have testedapplication programs (VASP and CP2K)that I use on opteron clusters, and compared the performance of intel mpi and generic mpich. I am very impressed that intel mpi library outperforms, e.g. for 8 cpus about 30% performance gain can be obtained. We have several batch of platforms, having 4-core nodes, 2-core nodes, and 16-core nodes. The 4-core nodes and 2-core nodes are linked homogeneously by gigabit network.
I encountered significant performance loss after certain number of cpus involving in the calculations. For example, VASP running on 4-core nodes:
ncpus time (seconds)
26219
4 4059
82911
122134
1611319
Any suggestion and comment, or pointing toa right direction,will be appreciated.
Best regards,
William
Recently I have testedapplication programs (VASP and CP2K)that I use on opteron clusters, and compared the performance of intel mpi and generic mpich. I am very impressed that intel mpi library outperforms, e.g. for 8 cpus about 30% performance gain can be obtained. We have several batch of platforms, having 4-core nodes, 2-core nodes, and 16-core nodes. The 4-core nodes and 2-core nodes are linked homogeneously by gigabit network.
I encountered significant performance loss after certain number of cpus involving in the calculations. For example, VASP running on 4-core nodes:
ncpus time (seconds)
26219
4 4059
82911
122134
1611319
Any suggestion and comment, or pointing toa right direction,will be appreciated.
Best regards,
William
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What does ncpus=14 give you?
What does ncpus=15 give you?
At a certain threashold, your system may start to shuffling tasks and data from system to system. Installing insturmentation code into your application might indicate what is happening as your transition to the higher runtimes. Insturmenting your code will not introduce an additional thread(s) to monitor the application.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - yimwlihpc.a-star.edu.sg
I encountered significant performance loss after certain number of cpus involving in the calculations. For example, VASP running on 4-core nodes:
ncpus time (seconds)
26219
4 4059
82911
122134
1611319
Any suggestion and comment, or pointing toa right direction,will be appreciated.
Just a guess, but your application may saturate your network. You may try to measure network utilization.
If it's so, then you may try to increase the granularity of processing and make it more distributed. For example, replace fully-connected topology between processes to more distributed tree-based or cluster based.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - yimwlihpc.a-star.edu.sg
Dear all,
Recently I have testedapplication programs (VASP and CP2K)that I use on opteron clusters, and compared the performance of intel mpi and generic mpich. I am very impressed that intel mpi library outperforms, e.g. for 8 cpus about 30% performance gain can be obtained. We have several batch of platforms, having 4-core nodes, 2-core nodes, and 16-core nodes. The 4-core nodes and 2-core nodes are linked homogeneously by gigabit network.
I encountered significant performance loss after certain number of cpus involving in the calculations. For example, VASP running on 4-core nodes:
ncpus time (seconds)
26219
4 4059
82911
122134
1611319
Any suggestion and comment, or pointing toa right direction,will be appreciated.
Best regards,
William
Recently I have testedapplication programs (VASP and CP2K)that I use on opteron clusters, and compared the performance of intel mpi and generic mpich. I am very impressed that intel mpi library outperforms, e.g. for 8 cpus about 30% performance gain can be obtained. We have several batch of platforms, having 4-core nodes, 2-core nodes, and 16-core nodes. The 4-core nodes and 2-core nodes are linked homogeneously by gigabit network.
I encountered significant performance loss after certain number of cpus involving in the calculations. For example, VASP running on 4-core nodes:
ncpus time (seconds)
26219
4 4059
82911
122134
1611319
Any suggestion and comment, or pointing toa right direction,will be appreciated.
Best regards,
William
Did you try setting OMP_NUM_THREADS=1? This will turn off the open_MP parallelization in your code. The -O3 optimization turns on parallelization by the compiler. This is a new feature with version 10 of the compiler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - john.low@uop.com
Did you try setting OMP_NUM_THREADS=1? This will turn off the open_MP parallelization in your code. The -O3 optimization turns on parallelization by the compiler. This is a new feature with version 10 of the compiler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
Quoting - john.low@uop.com
Did you try setting OMP_NUM_THREADS=1? This will turn off the open_MP parallelization in your code. The -O3 optimization turns on parallelization by the compiler. This is a new feature with version 10 of the compiler.
After futher research I believe that this problem is with the multithreaded MKL libraries and VASP. The multithreading will hurt the performance of VASP. I have noticed performance decrease by a factor of 2 to 10. If you set OMP_NUM_THREADs=1 or use the sequential (or not multithreaded) library (mkl_sequential) you should get better performance from VASP.
By the way, why doesn't INTEL update the page on compiling VASP on their web site. The current version is totally out of date!!!!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the heads up on the issue. I've notified the engineering team of the issue and look forward to them making the update and letting you all know.
This is the page that I'm having them look at:
http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-using-intel-mkl-in-vasp/
If there is another place to make the update please let me know.
Quoting - john.low@uop.com
After futher research I believe that this problem is with the multithreaded MKL libraries and VASP. The multithreading will hurt the performance of VASP. I have noticed performance decrease by a factor of 2 to 10. If you set OMP_NUM_THREADs=1 or use the sequential (or not multithreaded) library (mkl_sequential) you should get better performance from VASP.
By the way, why doesn't INTEL update the page on compiling VASP on their web site. The current version is totally out of date!!!!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Aaron Tersteeg (Intel)
Hi,
The VASP user note in MKL KB has been updated to the latest MKL 10.2 version.
We will be updating the performance results with MKL 10.2 on Nehalem soon.
--Vipin

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page