Hi, my name is Edward Hutter and I am a graduate student at the University of Illinois. I have been benchmarking my code against MKL's ScaLAPACK routine for Householder QR factorization, PDGEQRF, and PDORGQR on Stampede2, and have noticed a possible bug after extensive testing. At 512 nodes on Stampede2, using 2D processor grids 4096x4 and 2048x8, there is a hang. This happens regardless of matrix size. We have tested a 262144 x 2048 matrix with a 2048x8 processor grid and a 1048576 x 512 matrix with a 4096x4 processor grid over block sizes 1,2,4,8,16,32,64. All of these configurations were tested individually and each hang. Note that other processor grid dimensions at 512 nodes such as 1024 x 16 and 512 x 32 work correctly. We also tried both processor grids at 1024 nodes with 16 ppn, and 512 node with 32 ppn. Both hang.
I have attached a simple test code that I have made that reproduces the hang. I provide the source file, the Makefile, and a script with which to launch a job. Simply enter "make" while in the Intel environment. The specific modules loaded are: intel/17.0.4, impi/17.0.3, git/2.9.0, autotools/1.1, python/2.7.13, xalt/2.1.2, and TACC. I have provided a script to launch two configurations. The parameters provided are in order: 1) number of rows in matrix, 2) number of columns in matrix, 3) block size, 4) number of iteration of PDGEQRF and PDORGQF, 5) always leave as zero, 6) number of rows in processor grid (number of columns is inferred from this and total number of processes), 7) always leave as zero. Other details such as compilation flags are located in the simple Makefile.
It has been suggested to me that the issue could possibly be in Intel MPI that I have loaded on Stampede2. Just to note, I have tested the LibSci implementation of ScaLAPACK on Blue Waters and it doesn't have this bug and it uses a different MPI. Reference ScaLAPACK (Netlib) also has this bug on Stampede2 when loaded with Intel's MPI module. I will look into whether loading a different MPI module (mvapich possibly) on Stampede2 causes MKL Scalapack to run correctly.
> the latest version of MKL is v.2019. Could you check this version on this cluster?
> Did you see the problem when the number of node less then 512, ex 128, 64, 32 ? Did you tried this?
The latest version of MKL on Stampede2 is under compilers_and_libraries_2018.2.199, and that isn't set as the default MKL module. I don't think the issue is in MKL, but more likely in Intel's MPI. I am currently testing to see if this bug can be reproduced with the mvapich2 MPI module and MKL.
No, the hang appeared only for certain 2d processor grid dimensions such as 4096x4 and 2048x8 when run with 512 nodes. Other processor grid dimensions at 512 nodes worked fine, and I had no problems with any node count less than 512.
I will get back to you once I get the results of my experiment described above.
Just verified that the hang still occurs with mvapich2 as the MPI module instead of IMPI. So I guess the problem must be in MKL? Its seems like a very strange bug, but I have spent a lot of time verifying which ScaLAPACK configurations work and which don't. Those failing configurations are mentioned in my original post.