I am using a code (Wien2k) which extensively exploits lapavck/scalapack via the mkl library, and can also work in hybrid mode with openmp+mpi. In my prior experience, and that of others, the hybrid mode with 2 openmp threads was slightly slower, perhaps 10%.
With a 64 core Gold 6338 it is very different, with 2 openmp & the rest mpi ~1.6 times faster! I cannot explain this, and I am wondering whether this somehow relates to the architecture or is a bug with using all 64 mpi.
For reference I am using 2021.1.1 versions of mkl/compiler/impi as later ones don't work for reasons I have not been able to determine (large program for matrix eigensolving hangs).
I can provide a way to reproduce this, but it would involve transferring a large code & some control files.
Thanks for reaching out to us.
>>I can provide a way to reproduce this..
It would be a great help if you can provide us with a sample reproducer code and steps to reproduce it so that we can check it from our end as well.
The Intel version is 2021.1.1 . Later versions have worse problems, hanging with no information.
$ uname -a
Linux qnode1058 3.10.0-1160.71.1.el7.x86_64 #1 SMP Wed Jun 15 08:55:08 UTC 2022 x86_64 x86_64 x86_64
[lma712@qnode1058 ~]$ head -20 /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
stepping : 6
microcode : 0xd000363
cpu MHz : 2000.000
cache size : 49152 KB
physical id : 0
siblings : 32
core id : 0
cpu cores : 32
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes
Thanks for providing us with the details.
Could you please let us know if there are any additional dependencies needed to be installed here in this case?
If yes, please provide us with the required information so that we can proceed further in this case.
For reference, I am seeing the 6338 as about 1.5 times slower (normalized to the number of cores) than a 6130 in pure mpi, with a speedup of about 1.75 using hybrid.
I spoke to a colleague in Cambridge, UK and they have seen something similar, in fact far worse -- it can be a factor of 10. You are probably going to hear multiple grumbles from major supercomputer users around the world on similar issues.
I tried following the steps provided in README and this is the error I'm getting when trying to run this step
x lapw1 -p -orb -up
Could you please let me know what am i missing here and help me to resolve this issue?
Yes, I've already sourced oneAPI setvars.sh script still I'm getting the mpirun command not found error (from the screenshot in my previous post you can see mpirun --version is working fine as I've already set up oneAPI environment).
Do you have any idea about this error like is there any script that effects oneAPI environment setup?
Please do let me know so that i can proceed further.
1) Before doing "x lapw1 -up -p -orb" do "which mpirun". If mpirun is not found then, you have an incomplete oneapi. You will need to add the impi package.
2) If you find mpirun, edit lapw1para and after:
#which def-file are we using?
if ($#argv < 1) then
echo usage: $0 deffile
Add (line 134) "which mpirun".
Let me know if the first works but the second fails. If that is the case I will try and replicate it myself.
Could you please let us know how much time it would take approximately to finish this step x lapw1 -up -p -orb?
I tried running it for about 2.5 hrs but still, it kept running.
If you have done cp M2 .machines, where M2 is (with nodes edited)
1:node01:32 node02:32 node03:32
That should take 27 minutes. M1 shoul take about 45 minutes. If you used instead
In principle it should take about 90 minutes. It will either crash somewhere in impi, or run forever for reasons I do not undestand.
Please ensure that you did not oversubscribed the number of mpi processes, as then they compete/conflict and it may never stop.
If possible, could you please try to isolate the issue that you are facing in the form of a sample reproducer code to reproduce the performance issue that you are observing with hybrid mode so that it would be easier to address the issue quickly?
It is not appropriate to make a toy version, it will not be representative.
I assume that you still cannot get it to run. What is your . machines file? What CPU are you using? What is your network?
Could you please let me know if I'm on right track in replicating the issue that you are observing?
Here is the screenshot of the output of x lapw1 -p -orb -up command (5th step in Readme)
Here is the screenshot of the output of x lapw1 -p -orb -up command (8th step in Readme)
lma712@quser21 PtF]$ tail *output1up_1 0.6516724 0.6517651 0.6518586 0.6521098 0.6522035
0.6522145 0.6523540 0.6524470 0.6525784 0.6528393
0 EIGENVALUES BELOW THE ENERGY -11.00000
NUMBER OF K-POINTS: 2
===> TOTAL CPU TIME: 1979.8 (INIT = 16.8 + K-POINTS = 1963.0)
> SUM OF WALL CLOCK TIMES: 1020.1 (INIT = 17.1 + K-POINTS = 1003.0)
Maximum WALL clock time: 2230.08410000001
Maximum CPU time: 4212.38736000000
For reference, with just mpi the last two lines I have are
Maximum WALL clock time: 3520.51289999999
Maximum CPU time: 3510.76150400000
And combining omp & mpi
Maximum WALL clock time: 2228.30399999999
Maximum CPU time: 4237.83427500000
WALL is the expired time (in seconds), which is what matters. I expect your numbers to be similar.
Something is very odd with your numbers as WALL should be only slight larger than CPU with pure MPI, and about 1/2 in hybrid. Thoughts:
1) Do you have a fast infiniband or or a slow Ethernet interconnect?
2) Is hyperthreading on (it can get in the way)?
3) Was anyone else running on those nodes, beyond standard OS?
Please do "grep -ie time Up_1 Up_2" and find a way to get me the output. It might give me a clue.