- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello forum,

I am trying to write a test program to port a cblas_gemm call to cblas_s8u8u32 call for a quantization evaluation experiment.

Below is the test program, and the output obtained. I was looking for some help as the output of gemm and s8u8u32 does not match. So any review comments of the test program will be of great help to debug.

Thanks,

Ram

***** test program ******

#include <stdio.h>

#include <time.h>

#include <stdlib.h>

#include <string.h>

#include <math.h>

#include <mkl.h>

#include <mkl_cblas.h>

#include <mkl_blas.h>

void PrintMatrix(float* pMatrix, const size_t nR, const size_t nC, const CBLAS_ORDER Order) {

unsigned int i, j;

if (Order == CblasRowMajor) {

for (i = 0; i < nR; i++) {

for (j = 0; j < nC; j++) {

fprintf(stderr,"%f \t ", pMatrix[i * nC + j]); // !!!

}

fprintf(stderr,"\n"); // !!!

}

} else {

for (i = 0; i < nR; i++) {

for (j = 0; j < nC; j++) {

fprintf(stderr,"%f \t ", pMatrix[i + j* nR ]); // !!!

}

fprintf(stderr,"\n"); // !!!

}

}

fprintf(stderr,"\n"); // !!!

}

int main(void)

{

const int m = 4;

const int n = 8;

const int k = 1;

float A[] = {0.1,0.2,0.3,0.4,0.5,0.1,0.2,0.3, 0.1,0.2,0.3,0.4,0.5,0.1,0.2,0.3, 0.1,0.2,0.3,0.4,0.5,0.1,0.2,0.3, 0.1,0.2,0.3,0.4,0.5,0.1,0.2,0.1};

float B[] = { 0.1, 0.2, 0.3, 0.4 };

float alpha = 1.0, beta = 0.0;

float *a1 = (float *) malloc(m * n *sizeof(float)); //4x8

MKL_INT8 *a1_mkl_int8 = (MKL_INT8 *) malloc(m * n *sizeof(MKL_INT8)); //4x8

float *b1 = (float *) malloc(m * 1 * sizeof(float)); // 4x1

MKL_INT8 *b1_mkl_int8 = (MKL_INT8 *) malloc(m * 1 *sizeof(MKL_INT8)); //4x1

float *c1 = (float *) malloc(n * 1 * sizeof(float)); // 8x1

MKL_INT32 *c1_mkl_int32 = (MKL_INT32 *) malloc(n * 1 *sizeof(MKL_INT32)); //8x1

memcpy(a1, A, m * n * sizeof(float));

memcpy(b1, B, m * 1 * sizeof(float));

memset(c1, 0, n * 1 * sizeof(float));

cblas_sgemm(CblasColMajor, CblasTrans, CblasNoTrans, n,k,m, alpha, a1, m, b1, n, beta, c1, n); // 8x4 multiplied to 4x1 gives 8x1

fprintf(stderr,"Result after sgemm\n");

PrintMatrix(c1, 8, k, CblasRowMajor);

for(int i=0;i<m;i++){

for(int j=0;j<n;j++){

MKL_INT8 x = (MKL_INT8) (*(a1+(i*m)+j) * 100);

if(x > 127) x = 127;

if(x < -127) x = -127;

*(a1_mkl_int8+(i*m)+j) = x;

}

}

for(int j=0;j<m;j++){

MKL_INT8 y = (MKL_INT8) (*(b1+j) * 100);

if(y > 127) y = 127;

if(y < -127) y = -127;

*(b1_mkl_int8+j) = y;

}

const MKL_INT8 ao = 0; const MKL_INT8 bo = 0; MKL_INT32 co = 0; CBLAS_OFFSET offsetc=CblasFixOffset;

cblas_gemm_s8u8s32(CblasColMajor, CblasTrans, CblasNoTrans, offsetc,

n, k, m, alpha, a1_mkl_int8, m, ao, b1_mkl_int8, n, bo, beta, c1_mkl_int32, n, &co);

for(int j=0;j<n;j++) {

float y = (float) (*(c1_mkl_int32+j));

y = y / 10000;

*(c1+j) = y;

}

fprintf(stderr,"Result after gemm_s8u8s32\n");

PrintMatrix(c1, 8, k, CblasRowMajor);

free(a1); free(b1); free(c1); free(a1_mkl_int8);

return 0;

}

***** compile instructions *******

source /opt/intel/bin/compilervars.sh intel64

g++ -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_tbb_thread -lmkl_core -ltbb -lstdc++ -lpthread -lm -ldl repro.cpp

****** output *********

Result after sgemm

0.300000

0.250000

0.300000

0.250000

0.300000

0.250000

0.300000

0.170000

Result after gemm_s8u8s32

0.300000

0.250000

0.300000

0.250000

0.300000

0.000000

0.000000

0.000000

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

There is a significant error in your C code: on lines 57 and 60, you should have written

(i*n)+j

instead of

(i*m)+j

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

You have written code to perform a BLAS GEMM operation in two different ways. The results do not agree, so either of those, or both, contain errors -- errors in argument sequence, types, etc., or errors in values.

From a reader's perspective, however, the problem definition is absent! One could look at the code and suspect whether what you really wish to perform is an GEMV operation.

Please state what you wish to do, and look at the MKL example source files in .../examples/cblas/source to see if any of them could be used as the basis of code to perform your calculation.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sure. Here is a more elaborate problem definition.

I am trying to figure out how to port gemm api to s8u8s32 api. So as a first step I have scaled up the input matrices by 100 and stored them in MKL_INT8 type matrices. I have retained most of the other arguments as-is from the gemm api and passed the same to s8u8s32 api. After calling the api, I am have converted the output vector c to floating-point after scaling down by 10000. And expect that the output of gemm and s8u8s2. But the output partially match. So I am looking for some help in getting the arguments to s8u8s32 right, which I suspect could be the problem.

Also, I couldn't find the gemv version of the s8u8s32 interface. Please let me know if there is one and I can try that instead

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

Thanks for reaching out to us.

We are forwarding this issue to the concerned team.

Warm Regards,

Abhishek

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Abhishek,

Thanks for the response. Does your response mean that this could likely be a bug in the API implementation?

Please let me know so that I can stop debugging the test program and wait for a resolution from the concerned team.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

A second look at your program revealed a few oddities.

1. MKL_INT8 is an 8-bit __unsigned__ integer type. Given that, I do not see what is the purpose and behavior of code such as

if (x < -127) x = -127;

when x is of type MKL_INT8.

2. When your code is compiled with ICL 2021.1.2 on Windows and the resulting EXE is run repeatedly, the last three numbers output are not zero as in your report, but appear to be garbage and vary from run to run.

3. When your code is compiled and run using ICX 2021.1.2 on Windows, we get the following error:

__Intel MKL ERROR: Parameter 10 was incorrect on entry to cblas_gemm_s8u8s32.__

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Yes. As noted by you in point 2, the problem I am concerned is that the last 3 numbers of the output is zero and does not match the gemm output.

I am using ubuntu linux with Intel MKL version 2020-0-088.

Let me know if you need additional information.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

RamGowda,

I use MKL quite frequently, but mostly in Fortran code. I have often felt that the handling of arrays of rank greater than 1 is more complicated and error prone in C than in Fortran. Thus, I wrote a short Fortran program to run your test problem, and it gives consistent and correct results. For what it may be worth, here is the source code.

```
program xgemms8u8s32
use blas95
implicit none
character*1 :: transa='T',transb='N', ofa = 'F', ofb = 'F', ofc = 'F'
integer :: m = 8, n = 1, k = 4
real :: alpha = 1.0, beta = 0.0
integer*1,allocatable :: ia(:,:),ib(:,:)
integer, allocatable :: ic(:,:)
integer*1 :: oa, ob, oc
integer :: lda,ldb,ldc
real, allocatable :: a(:,:), b(:,:), c(:,:)
!
allocate(ia(k,m),ib(k,n),ic(m,n))
ia = reshape( &
[10,20,30,40, 50,10,20,30, &
10,20,30,40, 50,10,20,30, &
10,20,30,40, 50,10,20,30, &
10,20,30,40, 50,10,20,10], [k,m])
ib = reshape([10,20,30,40],[k,n])
oa=0; ob=0; oc=0
lda = k; ldb = k; ldc = m
call gemm_s8u8s32(transa, transb, ofc, m,n,k, alpha, &
ia,lda,oa, ib,ldb,ob, beta, ic,ldc,oc)
print *,'Results from gemm_s8u8s32:'
print '(1x,8I8)',ic
allocate(a(k,m),b(k,n),c(m,n))
a = ia*0.01; b = ib*0.01
deallocate(ia,ib,ic)
call gemm(a,b,c,transa)
print *,'Results from gemm:'
print '(1x,8F8.4)',c
deallocate(a,b,c)
end program
```

You may compile and link the program using the command

`ifort -mkl xgemms8u8s32.f90 -lmkl_blas95_lp64`

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

There is an example file cblas_gemm_s8u8s32x.c in the OneAPI/mkl/latest/examples/cblas/source directory. If you compile that file with the common_func.c file in the same directory and link to the MKL libraries, and run the resulting program with the name of a data file containing the data below, you will see the correct results for your data. Building and running this code may help you to find out what to fix in your own source code.

```
: C B L A S _ I G E M M EXAMPLE PROGRAM DATA
8 1 4 :Values of m, n, k
1.0 0.0 :Values of alpha, beta
111 111 101 173 :Values of transA, transB, layout, offsetc(N,N,Row)
0 0 :Values of ao, bo
0 :Values of co
10 20 30 40
50 10 20 30
10 20 30 40
50 10 20 30
10 20 30 40
50 10 20 30
10 20 30 40
50 10 20 10 :Values of array A(m x k)
10
20
30
40 :Values of array B(k x n)
0
0
0
0
0
0
0
0 :Values of array C(m x n)
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

There is a significant error in your C code: on lines 57 and 60, you should have written

(i*n)+j

instead of

(i*m)+j

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for spotting the bug. The code works as expected after fixing the error.

Cheers,

Ram

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page