Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

multinode performance

yimwlihpc_a-star_edu
888 Views
Dear all,

Recently I have testedapplication programs (VASP and CP2K)that I use on opteron clusters, and compared the performance of intel mpi and generic mpich. I am very impressed that intel mpi library outperforms, e.g. for 8 cpus about 30% performance gain can be obtained. We have several batch of platforms, having 4-core nodes, 2-core nodes, and 16-core nodes. The 4-core nodes and 2-core nodes are linked homogeneously by gigabit network.

I encountered significant performance loss after certain number of cpus involving in the calculations. For example, VASP running on 4-core nodes:

ncpus time (seconds)
26219
4 4059
82911
122134
1611319

Any suggestion and comment, or pointing toa right direction,will be appreciated.

Best regards,
William
0 Kudos
7 Replies
jimdempseyatthecove
Honored Contributor III
888 Views

What does ncpus=14 give you?
What does ncpus=15 give you?

At a certain threashold, your system may start to shuffling tasks and data from system to system. Installing insturmentation code into your application might indicate what is happening as your transition to the higher runtimes. Insturmenting your code will not introduce an additional thread(s) to monitor the application.

Jim Dempsey
0 Kudos
Dmitry_Vyukov
Valued Contributor I
888 Views

I encountered significant performance loss after certain number of cpus involving in the calculations. For example, VASP running on 4-core nodes:

ncpus time (seconds)
26219
4 4059
82911
122134
1611319

Any suggestion and comment, or pointing toa right direction,will be appreciated.


Just a guess, but your application may saturate your network. You may try to measure network utilization.

If it's so, then you may try to increase the granularity of processing and make it more distributed. For example, replace fully-connected topology between processes to more distributed tree-based or cluster based.


0 Kudos
john_low
Beginner
888 Views
Dear all,

Recently I have testedapplication programs (VASP and CP2K)that I use on opteron clusters, and compared the performance of intel mpi and generic mpich. I am very impressed that intel mpi library outperforms, e.g. for 8 cpus about 30% performance gain can be obtained. We have several batch of platforms, having 4-core nodes, 2-core nodes, and 16-core nodes. The 4-core nodes and 2-core nodes are linked homogeneously by gigabit network.

I encountered significant performance loss after certain number of cpus involving in the calculations. For example, VASP running on 4-core nodes:

ncpus time (seconds)
26219
4 4059
82911
122134
1611319

Any suggestion and comment, or pointing toa right direction,will be appreciated.

Best regards,
William

Did you try setting OMP_NUM_THREADS=1? This will turn off the open_MP parallelization in your code. The -O3 optimization turns on parallelization by the compiler. This is a new feature with version 10 of the compiler.
0 Kudos
TimP
Honored Contributor III
888 Views

Did you try setting OMP_NUM_THREADS=1? This will turn off the open_MP parallelization in your code. The -O3 optimization turns on parallelization by the compiler. This is a new feature with version 10 of the compiler.
Nothing has been said about which compiler was in use. -O3 turns on auto-vectorization in current gnu compilers, as -O2 or -O3 do for Intel compilers. That doesn't involve any threading or OpenMP. As far as I know, the Intel and Sun compiilers are the only ones available to work with Intel MPI which have an auto-parallel option, and it's invoked separately from the other options.
0 Kudos
john_low
Beginner
888 Views
Quoting - tim18

Did you try setting OMP_NUM_THREADS=1? This will turn off the open_MP parallelization in your code. The -O3 optimization turns on parallelization by the compiler. This is a new feature with version 10 of the compiler.
Nothing has been said about which compiler was in use. -O3 turns on auto-vectorization in current gnu compilers, as -O2 or -O3 do for Intel compilers. That doesn't involve any threading or OpenMP. As far as I know, the Intel and Sun compiilers are the only ones available to work with Intel MPI which have an auto-parallel option, and it's invoked separately from the other options.

After futher research I believe that this problem is with the multithreaded MKL libraries and VASP. The multithreading will hurt the performance of VASP. I have noticed performance decrease by a factor of 2 to 10. If you set OMP_NUM_THREADs=1 or use the sequential (or not multithreaded) library (mkl_sequential) you should get better performance from VASP.

By the way, why doesn't INTEL update the page on compiling VASP on their web site. The current version is totally out of date!!!!!
0 Kudos
AaronTersteeg
Employee
888 Views
Thank you for the heads up on the issue. I've notified the engineering team of the issue and look forward to them making the update and letting you all know.
This is the page that I'm having them look at:
If there is another place to make the update please let me know.


After futher research I believe that this problem is with the multithreaded MKL libraries and VASP. The multithreading will hurt the performance of VASP. I have noticed performance decrease by a factor of 2 to 10. If you set OMP_NUM_THREADs=1 or use the sequential (or not multithreaded) library (mkl_sequential) you should get better performance from VASP.

By the way, why doesn't INTEL update the page on compiling VASP on their web site. The current version is totally out of date!!!!!

0 Kudos
VipinKumar_E_Intel
888 Views



Hi,

The VASP user note in MKL KB has been updated to the latest MKL 10.2 version.
We will be updating the performance results with MKL 10.2 on Nehalem soon.

--Vipin
0 Kudos
Reply