Why VM library is so slow in the new mode of CPUs

祁__志国 · ‎01-04-2019

Hi, I had reached a special problem and found that the calculation which using for ... loop is more efficiency than use the VM library of MKL. I test for several examples. It shows the same kind of results.

For example. Work with CPU : Intel(R) Xeon(R) Gold 6148 CPU, Intel Parallel Studio 2018u5. After running the 'test', the result shows:

./test
Time for normal distribution
   serial:   5s
   vector (HA):   12s
   vector (LA):   11s
   vector (EP):   12s

The compile and link flags are:

-O3 -xHost -DMKL_ILP64 -I${MKLROOT}/include

And for link:

-Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl

I also using Vtune to check the computing. It shows the time consuming of vdExp is almost equal the for...loop of serial part. The source code and Makefile are found in the tar package. The OS is Centos 7.3

I don't understand why the VM library runs so slowly. Is there anything wrong with the FLAGS or codes?

Gennady_F_Intel · ‎01-05-2019

What is the problem size? Please check with the latest MKL 2019 u1.

>> the calculation which using for ... loop is more efficiency than use the VM library of MKL.

actually, in the case of loops, svml implementation of these functions would be called and for short and medium problems, we expect the SVML performance would be better.

祁__志国 · ‎01-06-2019

Thanks for your reply.

The size is big, I think. There are two loops.

for(i=0;i<30000; i++)

for(j=0;j<100000; j++)

So it is 30000x100000 = 3e9.

If the for ... loop codes runs slower than which use svml, it is only in this case, to compile the codes by gcc with MKL.

Hinds__David · ‎04-01-2019

The VML version of your code is memory bound, which is why you see no difference in performance for HA/LA/EP modes. You would need to rearrange the VML version of the code to be cache-blocked to have better performance -- try tests with smaller values of nx to see how this matters. For your inner loops to fit in L2 cache, you should limit nx to something like 10000 or less.

The serial version of the code has much better memory utilization because it makes just one pass through the x and y_serial vectors for each pass of the outer loop. The VML version also has to write and read the three temporary vectors.