multinode performance

yimwlihpc_a-star_edu · ‎01-08-2009

Dear all,

Recently I have testedapplication programs (VASP and CP2K)that I use on opteron clusters, and compared the performance of intel mpi and generic mpich. I am very impressed that intel mpi library outperforms, e.g. for 8 cpus about 30% performance gain can be obtained. We have several batch of platforms, having 4-core nodes, 2-core nodes, and 16-core nodes. The 4-core nodes and 2-core nodes are linked homogeneously by gigabit network.

I encountered significant performance loss after certain number of cpus involving in the calculations. For example, VASP running on 4-core nodes:

ncpus time (seconds)
26219
4 4059
82911
122134
1611319

Any suggestion and comment, or pointing toa right direction,will be appreciated.

Best regards,
William

jimdempseyatthecove · ‎01-12-2009

What does ncpus=14 give you?
What does ncpus=15 give you?

At a certain threashold, your system may start to shuffling tasks and data from system to system. Installing insturmentation code into your application might indicate what is happening as your transition to the higher runtimes. Insturmenting your code will not introduce an additional thread(s) to monitor the application.

Jim Dempsey

Dmitry_Vyukov · ‎01-16-2009

Quoting - yimwlihpc.a-star.edu.sg

I encountered significant performance loss after certain number of cpus involving in the calculations. For example, VASP running on 4-core nodes:

ncpus time (seconds)
26219
4 4059
82911
122134
1611319

Any suggestion and comment, or pointing toa right direction,will be appreciated.

Just a guess, but your application may saturate your network. You may try to measure network utilization.

If it's so, then you may try to increase the granularity of processing and make it more distributed. For example, replace fully-connected topology between processes to more distributed tree-based or cluster based.

john_low · ‎05-20-2009

Quoting - yimwlihpc.a-star.edu.sg

Dear all,

Recently I have testedapplication programs (VASP and CP2K)that I use on opteron clusters, and compared the performance of intel mpi and generic mpich. I am very impressed that intel mpi library outperforms, e.g. for 8 cpus about 30% performance gain can be obtained. We have several batch of platforms, having 4-core nodes, 2-core nodes, and 16-core nodes. The 4-core nodes and 2-core nodes are linked homogeneously by gigabit network.

I encountered significant performance loss after certain number of cpus involving in the calculations. For example, VASP running on 4-core nodes:

ncpus time (seconds)
26219
4 4059
82911
122134
1611319

Any suggestion and comment, or pointing toa right direction,will be appreciated.

Best regards,
William

Did you try setting OMP_NUM_THREADS=1? This will turn off the open_MP parallelization in your code. The -O3 optimization turns on parallelization by the compiler. This is a new feature with version 10 of the compiler.

TimP · ‎05-20-2009

Quoting - john.low@uop.com

Did you try setting OMP_NUM_THREADS=1? This will turn off the open_MP parallelization in your code. The -O3 optimization turns on parallelization by the compiler. This is a new feature with version 10 of the compiler.

Nothing has been said about which compiler was in use. -O3 turns on auto-vectorization in current gnu compilers, as -O2 or -O3 do for Intel compilers. That doesn't involve any threading or OpenMP. As far as I know, the Intel and Sun compiilers are the only ones available to work with Intel MPI which have an auto-parallel option, and it's invoked separately from the other options.

john_low · ‎06-12-2009

Quoting - tim18

Quoting - john.low@uop.com

Did you try setting OMP_NUM_THREADS=1? This will turn off the open_MP parallelization in your code. The -O3 optimization turns on parallelization by the compiler. This is a new feature with version 10 of the compiler.

Nothing has been said about which compiler was in use. -O3 turns on auto-vectorization in current gnu compilers, as -O2 or -O3 do for Intel compilers. That doesn't involve any threading or OpenMP. As far as I know, the Intel and Sun compiilers are the only ones available to work with Intel MPI which have an auto-parallel option, and it's invoked separately from the other options.

After futher research I believe that this problem is with the multithreaded MKL libraries and VASP. The multithreading will hurt the performance of VASP. I have noticed performance decrease by a factor of 2 to 10. If you set OMP_NUM_THREADs=1 or use the sequential (or not multithreaded) library (mkl_sequential) you should get better performance from VASP.

By the way, why doesn't INTEL update the page on compiling VASP on their web site. The current version is totally out of date!!!!!

AaronTersteeg · ‎06-12-2009

Thank you for the heads up on the issue. I've notified the engineering team of the issue and look forward to them making the update and letting you all know.

This is the page that I'm having them look at:

http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-using-intel-mkl-in-vasp/

If there is another place to make the update please let me know.

Quoting - john.low@uop.com

After futher research I believe that this problem is with the multithreaded MKL libraries and VASP. The multithreading will hurt the performance of VASP. I have noticed performance decrease by a factor of 2 to 10. If you set OMP_NUM_THREADs=1 or use the sequential (or not multithreaded) library (mkl_sequential) you should get better performance from VASP.

By the way, why doesn't INTEL update the page on compiling VASP on their web site. The current version is totally out of date!!!!!

VipinKumar_E_Intel · ‎06-23-2009

Quoting - Aaron Tersteeg (Intel)

Hi,

The VASP user note in MKL KB has been updated to the latest MKL 10.2 version.
We will be updating the performance results with MKL 10.2 on Nehalem soon.

--Vipin