Showing results for 
Search instead for 
Did you mean: 

Why VM library is so slow in the new mode of CPUs

Hi, I had reached a special problem and found that the calculation which using for ... loop is more efficiency than use the VM library of MKL. I test for several examples. It shows the same kind of results.

For example. Work with CPU : Intel(R) Xeon(R) Gold 6148 CPU, Intel Parallel Studio 2018u5. After running the 'test', the result shows:

Time for normal distribution
    serial:    5s
    vector (HA):    12s
    vector (LA):    11s
    vector (EP):    12s

The compile and link flags are:

-O3  -xHost -DMKL_ILP64 -I${MKLROOT}/include

And for link:

-Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl

I also using Vtune to check the computing. It shows the time consuming of vdExp is almost equal the for...loop of serial part. The source code and Makefile are found in the tar package. The OS is Centos 7.3

I don't understand why the VM library runs so slowly. Is there anything wrong with the FLAGS or codes?




0 Kudos
3 Replies

What is the problem size?  Please check with the latest MKL 2019 u1. 

>> the calculation which using for ... loop is more efficiency than use the VM library of MKL.

 actually, in the case of loops, svml implementation of these functions would be called and for short and medium problems, we expect the SVML performance would be better.



Thanks for your reply.

The size is big, I think. There are two loops.

for(i=0;i<30000; i++)

    for(j=0;j<100000; j++)

So it is 30000x100000 = 3e9.

If the for ... loop codes runs slower than which use svml, it is only in this case, to compile the codes by gcc with MKL.



The VML version of your code is memory bound, which is why you see no difference in performance for HA/LA/EP modes.  You would need to rearrange the VML version of the code to be cache-blocked to have better performance -- try tests with smaller values of nx to see how this matters.  For your inner loops to fit in L2 cache, you should limit nx to something like 10000 or less. 

The serial version of the code has much better memory utilization because it makes just one pass through the x and y_serial vectors for each pass of the outer loop.  The VML version also has to write and read the three temporary vectors.