super slow blacs_gridinit() with intel parallel studio

Kerry_K_ · ‎12-17-2020

I've narrowed down a code scaling issue on my university cluster to the intel MKL's scalapack library function blacs_gridinit and was wondering if others have seen this same problem. When compiling and running with intel-parallel-studio (test both 2020 and 2019 versions), the attached minimum working example code takes an inordinate amount of time to run (10 s to execute blacs_gridinit() on 960 cores), whereas it takes much less than 0.85 s to run on 960 cores when compiled with gcc.

Details:

Intel Parallel Studio:

$ srun --pty --nodes=40 --ntasks-per-node=24 --exclusive --time=4:00:00  --account emlab /bin/bash
$ which mpif90
/rigel/opt/parallel_studio_xe_2020/compilers_and_libraries_2020.4.304/linux/mpi/intel64/bin/mpif90
$ rm *.o; make  <...> 
$ mpirun -np 960  test_gridinit
 nprocs,nprow,npcol,blacs_gridinit (s):    960     30     32    10.504

GCC:

$ module load openmpi/gcc/64/2.0.1  scalapack/openmpi/gcc/64/2.0.2
$ rm *.o; make <...> 
$ mpirun -np 960  test_gridinit
 nprocs,nprow,npcol,blacs_gridinit (s):    960     30     32     0.856

Kerry_K_ · ‎12-17-2020

note that line 31 in the test code should has a typo and should end with "t1-t0". But even so gcc is still 10x faster.

Gennady_F_Intel · ‎12-17-2020

it looks very strange actually as the performance doesn't depend on which compiler has been used to build the executable. did you try to take intel mpi and re-run this case?

Kerry_K_ · ‎12-17-2020

The last number is the wall time in seconds, which shows gcc + openmpi ran more than 10x faster when specifically timing the blacs_gridinit() function compared to using ifort with intel mpi. It seems like there is some connection issue with the intel installation on the cluster as the bottleneck slow down occurs when running on close to or all of the cores on a given slurm allocation. But it is specific to using the intel compilers as I can use the same instance and recompile with gcc+openmpi and the code scales fine up to the full number of cores allocated.

Here's another run of tests, this time running on two 24 core nodes connected with EDR infiniband and using the intel parallel studio compilers.


 nprocs,nprow,npcol,blacs_gridinit (s):      2      2      1     0.002
 nprocs,nprow,npcol,blacs_gridinit (s):      4      2      2     0.003
 nprocs,nprow,npcol,blacs_gridinit (s):      8      2      4     0.002
 nprocs,nprow,npcol,blacs_gridinit (s):     12      4      3     0.003
 nprocs,nprow,npcol,blacs_gridinit (s):     24      4      6     0.006
 nprocs,nprow,npcol,blacs_gridinit (s):     36      6      6     2.100
 nprocs,nprow,npcol,blacs_gridinit (s):     48      6	   8     3.106

Kerry_K_ · ‎12-17-2020

This just happens for blacs_gridinit in mkl_scalapack_lp64. All other MPI calls in my code run as fast as expected. Here's the linking information from the Makefile:

 FC      = mpiifort  
 FFLAGS  = -O2 -mkl=sequential 
 LIB     =  -L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64  -lm -ldl

I've run the Intel benchmarks for MPI and don't notice this same type of slowdown, even when running on the exact same slurm allocation with 48 cores. For example:

$ mpirun -np 48 IMB-MPI1 Bcast
...
#----------------------------------------------------------------
# Benchmarking Bcast 
# #processes = 48 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.18         0.06
            1         1000         1.85         3.98         2.77
            2         1000         0.94         4.85         2.61
            4         1000         1.06         4.20         2.60
            8         1000         0.95         4.67         2.76
           16         1000         0.94         3.51         2.17
           32         1000         0.94         3.53         2.19
           64         1000         0.96         3.82         2.34
          128         1000         0.98         4.31         2.39
          256         1000         1.01         4.08         2.52
          512         1000         1.30         4.38         2.85
         1024         1000         1.15         4.69         2.98
         2048         1000         1.32         5.58         3.59
         4096         1000         1.66         6.86         4.52
         8192         1000         2.51        10.40         7.10
        16384         1000         4.56        14.68        10.55
        32768         1000         8.56        23.06        17.47
        65536          640        24.28        36.86        32.40
       131072          320        44.98        65.65        59.69
       262144          160       132.78       166.78       154.80
       524288           80       234.22       294.71       268.65
      1048576           40       470.87       534.97       506.52
      2097152           20      1028.70      1724.68      1376.94
      4194304           10      1994.20      4423.80      4087.09

So it seems like it the compilers are otherwise working as fast as expected and its just a problem specifically related to blacs_gridinit(). In some case that function takes up to a minute to initialize the grid on ~960 cores, which is quite ironic since I can then use the scalapack on the grid to solve a 30000 x 30000 dense linear system in much less time that the initialization.

Kerry_K_ · ‎12-17-2020

meant to say blacs_gridinit() in mkl_blacs_intelmpi_lp64.

Gennady_F_Intel · ‎12-17-2020

How did you build this example by using gcc? it would be interesting to check with -np48 mode.