Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
7125 Discussions

MKL ScaLAPACK P?GEMM subroutines return wrong results under specific conditions

luise
Beginner
379 Views

Hello everyone,

I have encountered a problem with the Scalapack routine PZGEMM for certain combinations of the dimensions of the process grid (nprow, npcol), the total size of the matrices involved (N1, N2), and the block sizes used in the 2D block-cyclic distribution (mb, nb). The problem occurs when the matrix multiplication operation is applied to submatrices that do not start at global row/column index 1 (i.e. IA /=1 and JA /= 1).

You can recreate the issue using the basic program test_pvgemm.F90, which is available on GitHub in the P_GEMM_check repository. This program defines a N2 x N2 distributed matrix B, and a (N1+N2) x (N2+N1) distributed matrix C. It then uses the subroutine p?gemm to perform the following operation:

B = B - C21 * C12 

where C12 and C21 are the submatrices:

C21 = C(N1+1:N1+N2, 1:N1)

C12 = C(1:N1, N1+1:N1+N2)

All coefficients of matrix C are set to 1, and all coeffs of matrix B are initially set to N1. The expected result is therefore that all coeffs of B end up equal to 0.

The call to the subroutine p?gemm is:

...
call pdgemm('N', 'N', N2, N2, N1, &
-1.0_dp, C, N1 + 1, 1, Cdesc, C, 1, N1 + 1, Cdesc, &
1.0_dp, B, 1, 1, Bdesc)
...

The test fails (some values in the returned matrix B are not zero, and are displayed on the standard output) when N1, N2, mb, nb, nprow and npcol are combined in certain ways, such as:

  • mb = 4 ; nb = 4 ; N1 = 20 ; N2 > 1025; nprow = 2; npcol = 2
  • mb = 4 ; nb = 4 ; N1 = 20 ; N2 > 1537; nprow = 3; npcol = 4
  • mb = 32; nb = 32; N1 = 160; N2 > 2049; nprow = 4; npcol = 4
  • ...
The error occurs for mkl versions 2022.2.0, 2024.0 and 2025.0 (and probably others) when linking with intelmpi, libmkl_blacs_intelmpi_lp64 and libmkl_scalapack_lp64. The error is silent as the code runs and exits normally. 
 
We found that if you use the original Scalapack source code, the problem is solved: when we link with a manually compiled version of Scalapack 2.2.0 (downloaded from Netlib), the subroutine p?gemm returns the expected values and the test program finishes without displaying any erroneous values in all the cases we checked.

 

Cordially,

Sergio Llorente.

 

P.D.: This issue is similar to the one described in this post on the Intel forums (which was apparently fixed in MKL 2019u1).

0 Kudos
4 Replies
Ruqiu_C_Intel
Moderator
254 Views

Hello Sergio Llorente,


Thank you for raising your concern.

We will investigate this issue and will update here if there is any progress.


Regards,

Ruqiu


0 Kudos
Ruqiu_C_Intel
Moderator
219 Views

Hello Sergio Llorente,


I tested oneMKL 2025.0.1, oneMKL 2024.2, and oneMKL 2024.0, looks your reproducer all passed in these three versions. The logs attached below for reference. Or is there anything I missed here?


# MKL_VERBOSE=1 ./test_pzgemm

Procs = 1; Grid = (1 x 1)

mb = 4; nb = 4; N1 = 20; N2 = 1025

MKL_VERBOSE oneMKL 2025 Patch 1 Product build 20241031 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX-2) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), Lnx 2.10GHz lp64 intel_thread

MKL_VERBOSE ZGEMM(N,N,1025,1025,20,0x7ffe25daae38,0x7f7cec25b380,1045,0x7f7cec2acc80,1045,0x7ffe25daae48,0x7f7ceb05a280,1025) 28.98ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

TEST PASSED!



# MKL_VERBOSE=1 ./test_pzgemm


Procs = 1; Grid = (1 x 1)

mb = 4; nb = 4; N1 = 20; N2 = 1025

MKL_VERBOSE oneMKL 2024.0 Update 2 Patch 1 Product build 20240722 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX-2) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), Lnx 2.10GHz lp64 intel_thread

MKL_VERBOSE ZGEMM(N,N,1025,1025,20,0x7fffc1870e48,0x7f4acef5e380,1045,0x7f4acefafc80,1045,0x7fffc1870e58,0x7f4acdd5d280,1025) 29.17ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

TEST PASSED!


# MKL_VERBOSE=1 ./test_pzgemm


Procs = 1; Grid = (1 x 1)

mb = 4; nb = 4; N1 = 20; N2 = 1025

MKL_VERBOSE oneMKL 2024.0 Product build 20231011 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX-2) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), Lnx 2.10GHz lp64 intel_thread

MKL_VERBOSE ZGEMM(N,N,1025,1025,20,0x7ffe3b23fce8,0x7fab3075e380,1045,0x7fab307afc80,1045,0x7ffe3b23fcf8,0x7fab2f55d280,1025) 46.00ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

TEST PASSED!


Regards,

Ruqiu


0 Kudos
luise
Beginner
201 Views

Hello Ruqiu,

 

The problem occurs when more than one process is used and for certain process grids. Please try:

$ MKL_VERBOSE=1 mpirun -n 4 ./test_pzgemm

This is my output (with MKL 2025.0):

Procs = 4; Grid = (2 x 2)
mb = 4; nb = 4; N1 = 20; N2 = 1025
MKL_VERBOSE oneMKL 2025 Product build 20241009 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), EVEX-encoded AES and Carry-Less Multiplication Quadword instructions, Lnx 2.30GHz lp64 intel_thread
MKL_VERBOSE ZGEMM(N,N,512,513,4,0x7fff63286a98,0x1018090,512,0x10300a0,4,0x7fff63286aa8,0x7f61a47c6280,513) 5.90ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,1,513,4,0x7fff63286a98,0x1020090,1,0x10300a0,4,0x7fff63286aa8,0x7f61a47c8280,513) 40.16us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE oneMKL 2025 Product build 20241009 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), EVEX-encoded AES and Carry-Less Multiplication Quadword instructions, Lnx 2.30GHz lp64 intel_thread
MKL_VERBOSE ZGEMM(N,N,512,512,4,0x7ffd4b84b0a8,0x83d070,512,0x845080,4,0x7ffd4b84b0b8,0x7fb36d7ce280,513) 5.89ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,1,512,4,0x7ffd4b84b0a8,0x82d070,1,0x845080,4,0x7ffd4b84b0b8,0x7fb36d7d0280,513) 41.27us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE oneMKL 2025 Product build 20241009 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), EVEX-encoded AES and Carry-Less Multiplication Quadword instructions, Lnx 2.30GHz lp64 intel_thread
MKL_VERBOSE ZGEMM(N,N,512,512,4,0x7fff51eafd28,0x122c070,512,0x123c0c0,4,0x7fff51eafd38,0x7f8dec3d6280,512) 5.95ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE oneMKL 2025 Product build 20241009 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), EVEX-encoded AES and Carry-Less Multiplication Quadword instructions, Lnx 2.30GHz lp64 intel_thread
MKL_VERBOSE ZGEMM(N,N,512,513,4,0x7ffff35918a8,0x1762070,512,0x17820c0,4,0x7ffff35918b8,0x7f1ad8fce280,512) 6.06ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,512,4,0x7ffd4b84b0a8,0x82d070,512,0x84d0c0,4,0x7fb3d390a410,0x7fb36d7ce280,513) 602.84us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,1,512,4,0x7ffd4b84b0a8,0x835070,1,0x84d0c0,4,0x7fb3d390a410,0x7fb36d7d0280,513) 3.40us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,513,4,0x7ffff35918a8,0x1772070,512,0x177a080,4,0x7f1b3f10a410,0x7f1ad8fce280,512) 730.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,512,4,0x7fff51eafd28,0x121c070,512,0x1234080,4,0x7f8e5250a410,0x7f8dec3d6280,512) 758.19us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,513,4,0x7fff63286a98,0x1028090,512,0x10380e0,4,0x7f620a90a410,0x7f61a47c6280,513) 711.08us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,1,513,4,0x7fff63286a98,0x1018090,1,0x10380e0,4,0x7f620a90a410,0x7f61a47c8280,513) 4.76us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,512,4,0x7ffd4b84b0a8,0x83d070,512,0x845080,4,0x7fb3d390a410,0x7fb36d7ce280,513) 198.66us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,1,512,4,0x7ffd4b84b0a8,0x82d070,1,0x845080,4,0x7fb3d390a410,0x7fb36d7d0280,513) 3.04us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,513,4,0x7ffff35918a8,0x1762070,512,0x17820c0,4,0x7f1b3f10a410,0x7f1ad8fce280,512) 191.97us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,512,4,0x7fff51eafd28,0x122c070,512,0x123c0c0,4,0x7f8e5250a410,0x7f8dec3d6280,512) 192.12us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,513,4,0x7fff63286a98,0x1018090,512,0x10300a0,4,0x7f620a90a410,0x7f61a47c6280,513) 259.47us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,1,513,4,0x7fff63286a98,0x1020090,1,0x10300a0,4,0x7f620a90a410,0x7f61a47c8280,513) 3.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,513,4,0x7fff63286a98,0x1028090,512,0x10380e0,4,0x7f620a90a410,0x7f61a47c6280,513) 287.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,1,513,4,0x7fff63286a98,0x1018090,1,0x10380e0,4,0x7f620a90a410,0x7f61a47c8280,513) 4.14us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,512,4,0x7ffd4b84b0a8,0x82d070,512,0x84d0c0,4,0x7fb3d390a410,0x7fb36d7ce280,513) 287.43us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,1,512,4,0x7ffd4b84b0a8,0x835070,1,0x84d0c0,4,0x7fb3d390a410,0x7fb36d7d0280,513) 4.24us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,513,4,0x7ffff35918a8,0x1772070,512,0x177a080,4,0x7f1b3f10a410,0x7f1ad8fce280,512) 269.43us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,512,4,0x7fff51eafd28,0x121c070,512,0x1234080,4,0x7f8e5250a410,0x7f8dec3d6280,512) 278.26us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,513,4,0x7ffff35918a8,0x1762070,512,0x17820c0,4,0x7f1b3f10a410,0x7f1ad8fce280,512) 228.02us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,512,4,0x7fff51eafd28,0x122c070,512,0x123c0c0,4,0x7f8e5250a410,0x7f8dec3d6280,512) 232.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,513,4,0x7fff63286a98,0x1018090,512,0x10300a0,4,0x7f620a90a410,0x7f61a47c6280,513) 241.55us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,1,513,4,0x7fff63286a98,0x1020090,1,0x10300a0,4,0x7f620a90a410,0x7f61a47c8280,513) 2.64us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,512,512,4,0x7ffd4b84b0a8,0x83d070,512,0x845080,4,0x7fb3d390a410,0x7fb36d7ce280,513) 260.02us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,1,512,4,0x7ffd4b84b0a8,0x82d070,1,0x845080,4,0x7fb3d390a410,0x7fb36d7d0280,513) 2.86us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
B(1025, 5) = (4.00000000000000,0.000000000000000E+000)
B(1025, 6) = (4.00000000000000,0.000000000000000E+000)
B(1025, 7) = (4.00000000000000,0.000000000000000E+000)
B(1025, = (4.00000000000000,0.000000000000000E+000)
B(1025, 13) = (4.00000000000000,0.000000000000000E+000)
...
TEST FAILED in proc (0, 1) with 512 errors

Regards,

Sergio.

0 Kudos
Ruqiu_C_Intel
Moderator
119 Views

Hi Sergio,


We reproduced the problem with 4 processes, and the test passed with other processes, for example 2, 3, 5, 6 ...

We will investigate the issue further and update here once there is any improvement.


Regards,

Ruqiu​


0 Kudos
Reply