bug of sgemm, 2015 mkl gives wrong result, but 2013 gives correct

志强_赵_ · ‎03-30-2015

The compile option of 2015:

/work1/soft/intel2015/composer_xe_2015.0.090/bin/intel64/ifort -i8 -openmp main.f90 -o i64_2015 -L /work1/soft/intel2015/composer_xe_2015.0.090/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_intel_thread -Wl,-rpath /work1/soft/intel2015/mkl/lib/intel64

The option of 2013

/work1/soft/intel/composer_xe_2013.0.079/bin/intel64/ifort -i8 -openmp main.f90 -o i64_2013 -L /work1/soft/intel/composer_xe_2013.0.079/mkl/lib/intel64/ -lmkl_intel_ilp64 -lmkl_core -lmkl_intel_thread -Wl,-rpath /work1/soft/intel/composer_xe_2013.0.079/mkl/lib/intel64

The 2013 gives 2.0000, but 2015 gives 0.000

The code is shown below

program testi8
  implicit none
  integer,parameter :: REL = 4
  integer,  parameter :: num  = 2500000000, one = 1
  real(REL),dimension(:,:),allocatable :: a, b, c
  integer :: i, j

  print*, "allocate arrays..."
  allocate(a(1, num ), b(num, 1), c(1,1))
  a = 0.0_REL
  b = 0.0_REL
  c = 0.0_REL

  print*, "value arrays..."
  do i = 1, num
     if(i.eq.1) then
        a(1,1) = 1.0_REL
     else if(mod(i, 2).eq.0) then
        a(1, i) = 1.0_REL
     else
        a(1, i) = -1.0_REL
     end if
     b(i, 1) = 1.0_REL
  end do

  call sgemm('n', 'n', one, one, num, 1.0_REL, a, one, b, num, 0.0_REL, c, one)

  print*, "result"
  print*, c

end program testi8

Gennady_F_Intel · ‎03-31-2015

thanks, we will check

Gennady_F_Intel · ‎03-31-2015

I see the same result too.

We will try to investigate the cause of the issue and will back with the updates.

thanks, Gennady

志强_赵_ · ‎03-31-2015

In this case，If you set the a array to 1.0. Then call the sgemm. The desired result should be 2.5e9.

but both 2013 and 2015 give the wrong result. The 2013 say it's 1.6777e7. The 2015 say it's 1.6e9.

I think this is another situation. Please check it.

Gennady Fedorov (Intel) wrote:

I see the same result too.

We will try to investigate the cause of the issue and will back with the updates.

thanks, Gennady

Murat_G_Intel · ‎04-02-2015

These differences seem to be due to the round-off errors in floating point operations. Vectorization changes the order of operations, and instead of accessing the elements sequentially, vectorization may process the elements in strides. For example, in your use case (essentially a dot product) instead of summing the element-wise vector-multiplication sequentially, a vector register is used to accumulate the sums in different chunks. The elements of the vector register is accumulated to get the final result. This causes round off errors even for the cases where one would expect the summation will cancel each other such as (1, 1, -1, -1, 1, -1, 1, ...).

To verify this claim, I created a C code that demonstrates this issue when vectorization is turned on (-O2 and above):

#include <stdlib.h>
float sum(float* x, long long len) {
    long long i;
    float res = 0.;
#pragma vector aligned
    for (i=0; i<len; ++i)
        res += x;
    return res;
}

void main(int argc, char* argv[])
{
    long long i;
    long long N = 100;
    if (argc>1) N = 1000LL*atoi(argv[1]);
    float* a =  _mm_malloc(sizeof(float)*N, 128);
    printf("N=%ld \n", N);
    if (N<2) return;

    a[0] = 1;
    a[1] = 1;
    for (i=2; i<N; ++i) {
        if (i%2) a = 1;
        else a = -1;
    }
    printf("Result = %f\n", sum(a, N));
    _mm_free(a);
}

icc main.c -O2 -xAVX -vec-report=6

$ ./a.out 20000
N=20000000
Result = 2.000000

$ ./a.out 200000
N=200000000
Result = 0.000000

If you compile the same program with -O1, we always get 2.0:

./a.out 2000000
N=2000000000
Result = 2.000000