Solved: cblas_gemm to cblas_s8u8s32 porting

ramgowda · ‎01-11-2021

Hello forum,

I am trying to write a test program to port a cblas_gemm call to cblas_s8u8u32 call for a quantization evaluation experiment.

Below is the test program, and the output obtained. I was looking for some help as the output of gemm and s8u8u32 does not match. So any review comments of the test program will be of great help to debug.

Thanks,

Ram

***** test program ******

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <mkl.h>
#include <mkl_cblas.h>
#include <mkl_blas.h>

void PrintMatrix(float* pMatrix, const size_t nR, const size_t nC, const CBLAS_ORDER Order) {
unsigned int i, j;
if (Order == CblasRowMajor) {
for (i = 0; i < nR; i++) {
for (j = 0; j < nC; j++) {
fprintf(stderr,"%f \t ", pMatrix[i * nC + j]); // !!!
}
fprintf(stderr,"\n"); // !!!
}
} else {
for (i = 0; i < nR; i++) {
for (j = 0; j < nC; j++) {
fprintf(stderr,"%f \t ", pMatrix[i + j* nR ]); // !!!
}
fprintf(stderr,"\n"); // !!!
}
}
fprintf(stderr,"\n"); // !!!
}

int main(void)
{
const int m = 4;
const int n = 8;
const int k = 1;

float A[] = {0.1,0.2,0.3,0.4,0.5,0.1,0.2,0.3, 0.1,0.2,0.3,0.4,0.5,0.1,0.2,0.3, 0.1,0.2,0.3,0.4,0.5,0.1,0.2,0.3, 0.1,0.2,0.3,0.4,0.5,0.1,0.2,0.1};
float B[] = { 0.1, 0.2, 0.3, 0.4 };

float alpha = 1.0, beta = 0.0;

float *a1 = (float *) malloc(m * n *sizeof(float)); //4x8
MKL_INT8 *a1_mkl_int8 = (MKL_INT8 *) malloc(m * n *sizeof(MKL_INT8)); //4x8
float *b1 = (float *) malloc(m * 1 * sizeof(float)); // 4x1
MKL_INT8 *b1_mkl_int8 = (MKL_INT8 *) malloc(m * 1 *sizeof(MKL_INT8)); //4x1
float *c1 = (float *) malloc(n * 1 * sizeof(float)); // 8x1
MKL_INT32 *c1_mkl_int32 = (MKL_INT32 *) malloc(n * 1 *sizeof(MKL_INT32)); //8x1
memcpy(a1, A, m * n * sizeof(float));
memcpy(b1, B, m * 1 * sizeof(float));
memset(c1, 0, n * 1 * sizeof(float));
cblas_sgemm(CblasColMajor, CblasTrans, CblasNoTrans, n,k,m, alpha, a1, m, b1, n, beta, c1, n); // 8x4 multiplied to 4x1 gives 8x1
fprintf(stderr,"Result after sgemm\n");
PrintMatrix(c1, 8, k, CblasRowMajor);

for(int i=0;i<m;i++){
for(int j=0;j<n;j++){
MKL_INT8 x = (MKL_INT8) (*(a1+(i*m)+j) * 100);
if(x > 127) x = 127;
if(x < -127) x = -127;
*(a1_mkl_int8+(i*m)+j) = x;
}
}
for(int j=0;j<m;j++){
MKL_INT8 y = (MKL_INT8) (*(b1+j) * 100);
if(y > 127) y = 127;
if(y < -127) y = -127;
*(b1_mkl_int8+j) = y;
}

const MKL_INT8 ao = 0; const MKL_INT8 bo = 0; MKL_INT32 co = 0; CBLAS_OFFSET offsetc=CblasFixOffset;
cblas_gemm_s8u8s32(CblasColMajor, CblasTrans, CblasNoTrans, offsetc,
n, k, m, alpha, a1_mkl_int8, m, ao, b1_mkl_int8, n, bo, beta, c1_mkl_int32, n, &co);
for(int j=0;j<n;j++) {
float y = (float) (*(c1_mkl_int32+j));
y = y / 10000;
*(c1+j) = y;
}

fprintf(stderr,"Result after gemm_s8u8s32\n");
PrintMatrix(c1, 8, k, CblasRowMajor);

free(a1); free(b1); free(c1); free(a1_mkl_int8);

return 0;
}

***** compile instructions *******

source /opt/intel/bin/compilervars.sh intel64
g++ -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_tbb_thread -lmkl_core -ltbb -lstdc++ -lpthread -lm -ldl repro.cpp

****** output *********

Result after sgemm
0.300000
0.250000
0.300000
0.250000
0.300000
0.250000
0.300000
0.170000

Result after gemm_s8u8s32
0.300000
0.250000
0.300000
0.250000
0.300000
0.000000
0.000000
0.000000

mecej4 · ‎01-17-2021

There is a significant error in your C code: on lines 57 and 60, you should have written

(i*n)+j

instead of

(i*m)+j

View solution in original post

mecej4 · ‎01-12-2021

You have written code to perform a BLAS GEMM operation in two different ways. The results do not agree, so either of those, or both, contain errors -- errors in argument sequence, types, etc., or errors in values.

From a reader's perspective, however, the problem definition is absent! One could look at the code and suspect whether what you really wish to perform is an GEMV operation.

Please state what you wish to do, and look at the MKL example source files in .../examples/cblas/source to see if any of them could be used as the basis of code to perform your calculation.

ramgowda · ‎01-12-2021

Sure. Here is a more elaborate problem definition.

I am trying to figure out how to port gemm api to s8u8s32 api. So as a first step I have scaled up the input matrices by 100 and stored them in MKL_INT8 type matrices. I have retained most of the other arguments as-is from the gemm api and passed the same to s8u8s32 api. After calling the api, I am have converted the output vector c to floating-point after scaling down by 10000. And expect that the output of gemm and s8u8s2. But the output partially match. So I am looking for some help in getting the arguments to s8u8s32 right, which I suspect could be the problem.

Also, I couldn't find the gemv version of the s8u8s32 interface. Please let me know if there is one and I can try that instead

AbhishekD_Intel · ‎01-13-2021

Hi,

Thanks for reaching out to us.

We are forwarding this issue to the concerned team.

Warm Regards,

Abhishek

ramgowda · ‎01-14-2021

Hi Abhishek,

Thanks for the response. Does your response mean that this could likely be a bug in the API implementation?

Please let me know so that I can stop debugging the test program and wait for a resolution from the concerned team.

mecej4 · ‎01-15-2021

A second look at your program revealed a few oddities.

1. MKL_INT8 is an 8-bit unsigned integer type. Given that, I do not see what is the purpose and behavior of code such as

if (x < -127) x = -127;

when x is of type MKL_INT8.

2. When your code is compiled with ICL 2021.1.2 on Windows and the resulting EXE is run repeatedly, the last three numbers output are not zero as in your report, but appear to be garbage and vary from run to run.

3. When your code is compiled and run using ICX 2021.1.2 on Windows, we get the following error:

Intel MKL ERROR: Parameter 10 was incorrect on entry to cblas_gemm_s8u8s32.

ramgowda · ‎01-15-2021

Yes. As noted by you in point 2, the problem I am concerned is that the last 3 numbers of the output is zero and does not match the gemm output.

I am using ubuntu linux with Intel MKL version 2020-0-088.

Let me know if you need additional information.

mecej4 · ‎01-16-2021

RamGowda,

I use MKL quite frequently, but mostly in Fortran code. I have often felt that the handling of arrays of rank greater than 1 is more complicated and error prone in C than in Fortran. Thus, I wrote a short Fortran program to run your test problem, and it gives consistent and correct results. For what it may be worth, here is the source code.

program xgemms8u8s32
use blas95
implicit none
character*1 :: transa='T',transb='N', ofa = 'F', ofb = 'F', ofc = 'F'
integer :: m = 8, n = 1, k = 4
real :: alpha = 1.0, beta = 0.0
integer*1,allocatable :: ia(:,:),ib(:,:)
integer, allocatable :: ic(:,:)
integer*1 :: oa, ob, oc
integer :: lda,ldb,ldc
real, allocatable :: a(:,:), b(:,:), c(:,:)
!
allocate(ia(k,m),ib(k,n),ic(m,n))
ia = reshape( &
   [10,20,30,40, 50,10,20,30, &
    10,20,30,40, 50,10,20,30, &
    10,20,30,40, 50,10,20,30, &
    10,20,30,40, 50,10,20,10], [k,m])
ib = reshape([10,20,30,40],[k,n])
oa=0; ob=0; oc=0
lda = k; ldb = k; ldc = m
call gemm_s8u8s32(transa, transb, ofc, m,n,k, alpha, &
   ia,lda,oa, ib,ldb,ob, beta, ic,ldc,oc)
print *,'Results from gemm_s8u8s32:'
print '(1x,8I8)',ic

allocate(a(k,m),b(k,n),c(m,n))
a = ia*0.01; b = ib*0.01
deallocate(ia,ib,ic)
call gemm(a,b,c,transa)
print *,'Results from gemm:'
print '(1x,8F8.4)',c
deallocate(a,b,c)

end program

You may compile and link the program using the command

ifort -mkl xgemms8u8s32.f90 -lmkl_blas95_lp64

mecej4 · ‎01-16-2021

There is an example file cblas_gemm_s8u8s32x.c in the OneAPI/mkl/latest/examples/cblas/source directory. If you compile that file with the common_func.c file in the same directory and link to the MKL libraries, and run the resulting program with the name of a data file containing the data below, you will see the correct results for your data. Building and running this code may help you to find out what to fix in your own source code.

:  C B L A S _ I G E M M  EXAMPLE PROGRAM DATA
8 1 4                 :Values of m, n, k
1.0 0.0               :Values of alpha, beta
111 111 101 173       :Values of transA, transB, layout, offsetc(N,N,Row)
0 0                   :Values of ao, bo
0                     :Values of co
10   20   30   40
50   10   20   30
10   20   30   40
50   10   20   30
10   20   30   40
50   10   20   30
10   20   30   40
50   10   20   10    :Values of array A(m x k)
10
20
30
40                   :Values of array B(k x n)
0
0
0
0
0
0
0
0                    :Values of array C(m x n)

mecej4 · ‎01-17-2021

There is a significant error in your C code: on lines 57 and 60, you should have written

(i*n)+j

instead of

(i*m)+j

ramgowda · ‎01-18-2021

Thanks for spotting the bug. The code works as expected after fixing the error.

Cheers,

Ram