Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Hybrid mode faster than just mpi

L__D__Marks
New Contributor I
1,167 Views

I am using a code (Wien2k) which extensively exploits lapavck/scalapack via the mkl library, and can also work in hybrid mode with openmp+mpi. In my prior experience, and that of others, the hybrid mode with 2 openmp threads was slightly slower, perhaps 10%.

 

With a 64 core Gold 6338 it is very different, with 2 openmp & the rest mpi ~1.6 times faster! I cannot explain this, and I am wondering whether this somehow relates to the architecture or is a bug with using all 64 mpi.

 

For reference I am using 2021.1.1 versions of mkl/compiler/impi as later ones don't work for reasons I have not been able to determine (large program for matrix eigensolving hangs).

 

I can provide a way to reproduce this, but it would involve transferring a large code & some control files.

0 Kudos
26 Replies
VidyalathaB_Intel
Moderator
945 Views

Hi Laurence,


Thanks for reaching out to us.


>>I can provide a way to reproduce this..

It would be a great help if you can provide us with a sample reproducer code and steps to reproduce it so that we can check it from our end as well.


Regards,

Vidya.


L__D__Marks
New Contributor I
932 Views

The attached can hopefully be used to reproduce the issue. Use tar -xj (bzip2) to decompress, then look at the README file inside. Contact me if it is not clear or does not want to compile/run.

L__D__Marks
New Contributor I
888 Views

The Intel version is 2021.1.1 . Later versions have worse problems, hanging with no information.

 

$ uname -a

Linux qnode1058 3.10.0-1160.71.1.el7.x86_64 #1 SMP Wed Jun 15 08:55:08 UTC 2022 x86_64 x86_64 x86_64

 GNU/Linux

 

[lma712@qnode1058 ~]$ head -20 /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
stepping : 6
microcode : 0xd000363
cpu MHz : 2000.000
cache size : 49152 KB
physical id : 0
siblings : 32
core id : 0
cpu cores : 32
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes

VidyalathaB_Intel
Moderator
865 Views

Hi Laurence,


Thanks for providing us with the details.

Could you please let us know if there are any additional dependencies needed to be installed here in this case?

If yes, please provide us with the required information so that we can proceed further in this case.


Regards,

Vidya.


L__D__Marks
New Contributor I
853 Views

Nothing beyond ifort/icc/impi, tcsh & standard Linux.

L__D__Marks
New Contributor I
842 Views

For reference, I am seeing the 6338 as about 1.5 times slower (normalized to the number of cores) than a 6130 in pure mpi, with a speedup of about 1.75 using hybrid.

 

I spoke to a colleague in Cambridge, UK and they have seen something similar, in fact far worse -- it can be a factor of 10. You are probably going to hear multiple grumbles from major supercomputer users around the world on similar issues.

VidyalathaB_Intel
Moderator
805 Views

Hi Laurence,

 

I tried following the steps provided in README and this is the error I'm getting when trying to run this step 

x lapw1 -p -orb -up

VidyalathaB_Intel_0-1669976929335.png

Could you please let me know what am i missing here and help me to resolve this issue?

 

Regards,

Vidya.

 

L__D__Marks
New Contributor I
793 Views
The message "mpirun not found" means that you do not have your PATH setup right. You need to source the oneapi setvars.sh. It will be wise to do "ldd lapw1_mpi" to check you have the right libraries.
VidyalathaB_Intel
Moderator
739 Views

Hi Laurence,


Yes, I've already sourced oneAPI setvars.sh script still I'm getting the mpirun command not found error (from the screenshot in my previous post you can see mpirun --version is working fine as I've already set up oneAPI environment).


Do you have any idea about this error like is there any script that effects oneAPI environment setup?


Please do let me know so that i can proceed further.


Regards,

Vidya.


L__D__Marks
New Contributor I
726 Views
I do not think that anything should be overwritten in the PATH, but I am sitting in an airport waiting for a flight, so I cannot fully check. I suggest two things:
1) Before doing "x lapw1 -up -p -orb" do "which mpirun". If mpirun is not found then, you have an incomplete oneapi. You will need to add the impi package.
2) If you find mpirun, edit lapw1para and after:

#which def-file are we using?

if ($#argv < 1) then
echo usage: $0 deffile
exit
endif

Add (line 134) "which mpirun".

Let me know if the first works but the second fails. If that is the case I will try and replicate it myself.
VidyalathaB_Intel
Moderator
653 Views

Hi Laurence,


Could you please let us know how much time it would take approximately to finish this step x lapw1 -up -p -orb?

I tried running it for about 2.5 hrs but still, it kept running.


Regards,

Vidya.


L__D__Marks
New Contributor I
642 Views

If you have done cp M2 .machines, where M2 is (with nodes edited)

granularity:1
omp_lapw1:2
1:node01:32 node02:32 node03:32
extrafine

 

That should take 27 minutes. M1 shoul take about 45 minutes. If you used instead

 

granularity:1
omp_lapw1:1
1:node01:64

extrafine

 

In principle it should take about 90 minutes. It will either crash somewhere in impi, or run forever for reasons I do not undestand.

 

Please ensure that you did not oversubscribed the number of mpi processes, as then they compete/conflict and it may never stop.

VidyalathaB_Intel
Moderator
543 Views

Hi Laurence,


If possible, could you please try to isolate the issue that you are facing in the form of a sample reproducer code to reproduce the performance issue that you are observing with hybrid mode so that it would be easier to address the issue quickly?


Regards,

Vidya.



L__D__Marks
New Contributor I
529 Views
I provided you with a "real" reproducer, that represents hard-core scientific computing with multicore parallel programming. It does require some setup of a proper Linux system. That is life.

It is not appropriate to make a toy version, it will not be representative.

I assume that you still cannot get it to run. What is your . machines file? What CPU are you using? What is your network?
VidyalathaB_Intel
Moderator
489 Views

Hi Laurence,

 

Could you please let me know if I'm on right track in replicating the issue that you are observing?

Here is the screenshot of the output of x lapw1 -p -orb -up command (5th step in Readme)

VidyalathaB_Intel_0-1672138234923.png

 

Here is the screenshot of the output of x lapw1 -p -orb -up command (8th step in Readme)

VidyalathaB_Intel_1-1672138313880.png

 

Regards,

Vidya.

 

L__D__Marks
New Contributor I
479 Views
I think you have replicated the issue. After each of the two steps please do "tail *.output1up_1" (see below), which will give a more readable output in the last two lines.

lma712@quser21 PtF]$ tail *output1up_1 0.6516724 0.6517651 0.6518586 0.6521098 0.6522035
0.6522145 0.6523540 0.6524470 0.6525784 0.6528393
0 EIGENVALUES BELOW THE ENERGY -11.00000
********************************************************

NUMBER OF K-POINTS: 2
===> TOTAL CPU TIME: 1979.8 (INIT = 16.8 + K-POINTS = 1963.0)
> SUM OF WALL CLOCK TIMES: 1020.1 (INIT = 17.1 + K-POINTS = 1003.0)
Maximum WALL clock time: 2230.08410000001
Maximum CPU time: 4212.38736000000
L__D__Marks
New Contributor I
460 Views

For reference, with just mpi the last two lines I have are

Maximum WALL clock time: 3520.51289999999
Maximum CPU time: 3510.76150400000

And combining omp & mpi
Maximum WALL clock time: 2228.30399999999
Maximum CPU time: 4237.83427500000

 

WALL is the expired time (in seconds), which is what matters. I expect your numbers to be similar.

VidyalathaB_Intel
Moderator
439 Views

Hi Laurence,

 

Could you please check the below screenshot and let me know if this is the expected result (i guess there is some difference)?

VidyalathaB_Intel_0-1672247566991.png

 

Regards,

Vidya.

 

L__D__Marks
New Contributor I
414 Views

Something is very odd with your numbers as WALL should be only slight larger than CPU with pure MPI, and about 1/2 in hybrid. Thoughts:

1) Do you have a fast infiniband or or a slow Ethernet interconnect?

2) Is hyperthreading on (it can get in the way)?

3) Was anyone else running on those nodes, beyond standard OS?

 

Please do "grep -ie time Up_1 Up_2" and find a way to get me the output. It might give me a clue.

 

Currently confused!

VidyalathaB_Intel
Moderator
246 Views

Hi Laurence,

 

Apologies for the delay.

 

Please find the output of grep -ie time Up_1 Up_2 attached.

In my case the HT is disabled and I'm running the application on a cluster of nodes connected by Intel Omni-Path interconnect.

 

Regards,

Vidya.

 

Reply