Solved: Hybrid mode faster than just mpi

L__D__Marks · ‎11-17-2022

I am using a code (Wien2k) which extensively exploits lapavck/scalapack via the mkl library, and can also work in hybrid mode with openmp+mpi. In my prior experience, and that of others, the hybrid mode with 2 openmp threads was slightly slower, perhaps 10%.

With a 64 core Gold 6338 it is very different, with 2 openmp & the rest mpi ~1.6 times faster! I cannot explain this, and I am wondering whether this somehow relates to the architecture or is a bug with using all 64 mpi.

For reference I am using 2021.1.1 versions of mkl/compiler/impi as later ones don't work for reasons I have not been able to determine (large program for matrix eigensolving hangs).

I can provide a way to reproduce this, but it would involve transferring a large code & some control files.

L__D__Marks · ‎02-02-2023

Slightly embarrassing!

Before posting in November I double checked the timings, talked to a colleague in Cambridge who had seen something similar on other code and also checked with the local sys_admins for the cluster at Northwestern. Everyone indicated it was real, with other info on the internet that was similar.

When you asked for timings with 2 nodes (attached), the full mpi was not 1.5 times slower. I rechecked and with 3 nodes it is now also not abnormally slower. Investigating, not soon after my tests & posting I was told that a switch on the cluster relevant to the nodes I was using died, and also one of the nodes I was using died. I do not know specifics, and I suspect both have now been recycled. My guess is that there were major hardware problems at the time of the test which were being patched by some slow fault-tolerant algorithms, but costing a lot of time. A guess only, and I have no idea why this should effect pure mpi more than hybrid.

Attached in a tar is the output with 2 nodes of grep -ie time Up_1 and Up_2. I edited slightly PtF.klist (also attached) so the calculations would be faster. I am also attaching the M1 and M2 to use for these. It will still be interesting to know what your timings are.

View solution in original post

VidyalathaB_Intel · ‎11-18-2022

Hi Laurence,

Thanks for reaching out to us.

>>I can provide a way to reproduce this..

It would be a great help if you can provide us with a sample reproducer code and steps to reproduce it so that we can check it from our end as well.

Regards,

Vidya.

L__D__Marks · ‎11-18-2022

The attached can hopefully be used to reproduce the issue. Use tar -xj (bzip2) to decompress, then look at the README file inside. Contact me if it is not clear or does not want to compile/run.

L__D__Marks · ‎11-22-2022

The Intel version is 2021.1.1 . Later versions have worse problems, hanging with no information.

$ uname -a

Linux qnode1058 3.10.0-1160.71.1.el7.x86_64 #1 SMP Wed Jun 15 08:55:08 UTC 2022 x86_64 x86_64 x86_64

GNU/Linux

[lma712@qnode1058 ~]$ head -20 /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
stepping : 6
microcode : 0xd000363
cpu MHz : 2000.000
cache size : 49152 KB
physical id : 0
siblings : 32
core id : 0
cpu cores : 32
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes

VidyalathaB_Intel · ‎11-28-2022

Hi Laurence,

Thanks for providing us with the details.

Could you please let us know if there are any additional dependencies needed to be installed here in this case?

If yes, please provide us with the required information so that we can proceed further in this case.

Regards,

Vidya.

L__D__Marks · ‎11-28-2022

Nothing beyond ifort/icc/impi, tcsh & standard Linux.

L__D__Marks · ‎11-28-2022

For reference, I am seeing the 6338 as about 1.5 times slower (normalized to the number of cores) than a 6130 in pure mpi, with a speedup of about 1.75 using hybrid.

I spoke to a colleague in Cambridge, UK and they have seen something similar, in fact far worse -- it can be a factor of 10. You are probably going to hear multiple grumbles from major supercomputer users around the world on similar issues.

VidyalathaB_Intel · ‎12-02-2022

Hi Laurence,

I tried following the steps provided in README and this is the error I'm getting when trying to run this step

x lapw1 -p -orb -up

Could you please let me know what am i missing here and help me to resolve this issue?

Regards,

Vidya.

L__D__Marks · ‎12-02-2022

The message "mpirun not found" means that you do not have your PATH setup right. You need to source the oneapi setvars.sh. It will be wise to do "ldd lapw1_mpi" to check you have the right libraries.

VidyalathaB_Intel · ‎12-06-2022

Hi Laurence,

Yes, I've already sourced oneAPI setvars.sh script still I'm getting the mpirun command not found error (from the screenshot in my previous post you can see mpirun --version is working fine as I've already set up oneAPI environment).

Do you have any idea about this error like is there any script that effects oneAPI environment setup?

Please do let me know so that i can proceed further.

Regards,

Vidya.

L__D__Marks · ‎12-06-2022

I do not think that anything should be overwritten in the PATH, but I am sitting in an airport waiting for a flight, so I cannot fully check. I suggest two things:
1) Before doing "x lapw1 -up -p -orb" do "which mpirun". If mpirun is not found then, you have an incomplete oneapi. You will need to add the impi package.
2) If you find mpirun, edit lapw1para and after:

#which def-file are we using?

if ($#argv < 1) then
echo usage: $0 deffile
exit
endif

Add (line 134) "which mpirun".

Let me know if the first works but the second fails. If that is the case I will try and replicate it myself.

VidyalathaB_Intel · ‎12-14-2022

Hi Laurence,

Could you please let us know how much time it would take approximately to finish this step x lapw1 -up -p -orb?

I tried running it for about 2.5 hrs but still, it kept running.

Regards,

Vidya.

L__D__Marks · ‎12-14-2022

If you have done cp M2 .machines, where M2 is (with nodes edited)

granularity:1
omp_lapw1:2
1:node01:32 node02:32 node03:32
extrafine

That should take 27 minutes. M1 shoul take about 45 minutes. If you used instead

granularity:1
omp_lapw1:1
1:node01:64

extrafine

In principle it should take about 90 minutes. It will either crash somewhere in impi, or run forever for reasons I do not undestand.

Please ensure that you did not oversubscribed the number of mpi processes, as then they compete/conflict and it may never stop.

VidyalathaB_Intel · ‎12-22-2022

Hi Laurence,

If possible, could you please try to isolate the issue that you are facing in the form of a sample reproducer code to reproduce the performance issue that you are observing with hybrid mode so that it would be easier to address the issue quickly?

Regards,

Vidya.

L__D__Marks · ‎12-22-2022

I provided you with a "real" reproducer, that represents hard-core scientific computing with multicore parallel programming. It does require some setup of a proper Linux system. That is life.

It is not appropriate to make a toy version, it will not be representative.

I assume that you still cannot get it to run. What is your . machines file? What CPU are you using? What is your network?

VidyalathaB_Intel · ‎12-27-2022

Hi Laurence,

Could you please let me know if I'm on right track in replicating the issue that you are observing?

Here is the screenshot of the output of x lapw1 -p -orb -up command (5th step in Readme)

Here is the screenshot of the output of x lapw1 -p -orb -up command (8th step in Readme)

Regards,

Vidya.

L__D__Marks · ‎12-27-2022

I think you have replicated the issue. After each of the two steps please do "tail *.output1up_1" (see below), which will give a more readable output in the last two lines.

lma712@quser21 PtF]$ tail *output1up_1 0.6516724 0.6517651 0.6518586 0.6521098 0.6522035
0.6522145 0.6523540 0.6524470 0.6525784 0.6528393
0 EIGENVALUES BELOW THE ENERGY -11.00000
********************************************************

NUMBER OF K-POINTS: 2
===> TOTAL CPU TIME: 1979.8 (INIT = 16.8 + K-POINTS = 1963.0)
> SUM OF WALL CLOCK TIMES: 1020.1 (INIT = 17.1 + K-POINTS = 1003.0)
Maximum WALL clock time: 2230.08410000001
Maximum CPU time: 4212.38736000000

L__D__Marks · ‎12-27-2022

For reference, with just mpi the last two lines I have are

Maximum WALL clock time: 3520.51289999999
Maximum CPU time: 3510.76150400000

And combining omp & mpi
Maximum WALL clock time: 2228.30399999999
Maximum CPU time: 4237.83427500000

WALL is the expired time (in seconds), which is what matters. I expect your numbers to be similar.

VidyalathaB_Intel · ‎12-28-2022

Hi Laurence,

Could you please check the below screenshot and let me know if this is the expected result (i guess there is some difference)?

Regards,

Vidya.

L__D__Marks · ‎12-29-2022

Something is very odd with your numbers as WALL should be only slight larger than CPU with pure MPI, and about 1/2 in hybrid. Thoughts:

1) Do you have a fast infiniband or or a slow Ethernet interconnect?

2) Is hyperthreading on (it can get in the way)?

3) Was anyone else running on those nodes, beyond standard OS?

Please do "grep -ie time Up_1 Up_2" and find a way to get me the output. It might give me a clue.

Currently confused!

VidyalathaB_Intel · ‎01-05-2023

Hi Laurence,

Apologies for the delay.

Please find the output of grep -ie time Up_1 Up_2 attached.

In my case the HT is disabled and I'm running the application on a cluster of nodes connected by Intel Omni-Path interconnect.

Regards,

Vidya.