Solved: Overwritting input matrix in gemm

ddavobsc · ‎04-04-2025

Hello, I am trying to use gemm to make some complex computations, and sometimes, if the result matrix is one of the input matrices, it gives an incorrect result.

I haven't seen any mention of this in the documentation:

// Linked to Intel MKL
#include <cblas.h>

// Lets say we have 3 square matrices of size `size`
// This works
cblas_tgemm<T>(layout, CblasNoTrans, CblasNoTrans, size, size, size, 1.0, A, size, B, size, 0.0, C, size);

// This also seems to work
cblas_tgemm<T>(layout, CblasNoTrans, CblasNoTrans, size, size, size, 1.0, A, size, B, size, 0.0, A, size);

// But this gives incorrect results
cblas_tgemm<T>(layout, CblasNoTrans, CblasNoTrans, size, size, size, 1.0, A, size, B, size, 0.0, B, size);

Ethan_F_Intel · ‎04-14-2025

To start off, I want to mention that this is not how the function is intended to be used so any incorrect results are to be expected. Using cblas_?gemm this way is not supported. The output matrix should not be the same as the input matrix. This is because they all can technically have different dimensions. A is an m by k matrix, B is a k by n matrix, and C is an m by n matrix. The only situation where using the input matrix to store the output might technically pass is when they all are the same size. Even still, in my investigations I could not find a situation where [A = A*B] worked but [B = A*B] didn't. Either both work, or neither do. Here's an example:

C <- 1.0 * ( A * B ) + 0.0 * C

A <- 1.0 * ( A * B ) + 0.0 * A

B <- 1.0 * ( A * B ) + 0.0 * B

Amxk
0.840188 0.394383 0.783099 0.79844
0.911647 0.197551 0.335223 0.76823
0.277775 0.55397 0.477397 0.628871
0.364784 0.513401 0.95223 0.916195

Bkxn
0.635712 0.717297 0.141603 0.606969
0.0163006 0.242887 0.137232 0.804177
0.156679 0.400944 0.12979 0.108809
0.998925 0.218257 0.512932 0.839112

Cmxn
1.46082 1.1867 0.684279 1.58231
1.40269 1.00398 0.59376 1.39331
0.888607 0.662464 0.499886 1.19373
1.30467 0.968114 0.715646 1.50668

Amxk
0.840188 0.394383 0.783099 0.79844
0.911647 0.197551 0.335223 0.76823
0.277775 0.55397 0.477397 0.628871
0.364784 0.513401 0.95223 0.916195

Bkxn
0.635712 0.717297 0.141603 0.606969
0.0163006 0.242887 0.137232 0.804177
0.156679 0.400944 0.12979 0.108809
0.998925 0.218257 0.512932 0.839112

A (result)
1.46082 1.1867 0.684279 1.58231
1.40269 1.00398 0.59376 1.39331
0.888607 0.662464 0.499886 1.19373
1.30467 0.968114 0.715646 1.50668

Amxk
0.840188 0.394383 0.783099 0.79844
0.911647 0.197551 0.335223 0.76823
0.277775 0.55397 0.477397 0.628871
0.364784 0.513401 0.95223 0.916195

Bkxn
0.635712 0.717297 0.141603 0.606969
0.0163006 0.242887 0.137232 0.804177
0.156679 0.400944 0.12979 0.108809
0.998925 0.218257 0.512932 0.839112

B (result)
1.46082 1.1867 0.684279 1.58231
1.40269 1.00398 0.59376 1.39331
0.888607 0.662464 0.499886 1.19373
1.30467 0.968114 0.715646 1.50668

It made no difference whether the result was stored in A or B. It worked either way with correct results. While this may have succeeded for this small example, Intel MKL does not claim to support this behavior. Once the matrices get larger in size (like 256x256 on my machine) the result becomes incorrect.

View solution in original post

Ethan_F_Intel · ‎04-14-2025