VML performance very slow

mklvml · ‎01-17-2011

I wrote this small subroutine that compares simple vector mathematical functions, performed either with a loop:

f(i) = a(i) + b(i)

or direct:

f = a + b

or using Intel MKL VML:

vdAdd(n,a,b,f)

The timing results for n=50000000 are:

VML 0.9 sec

direct 0.4

loop 0.4

And I dont understand, why VML takes twice as long as the other methods! (Loop is sometimes faster than direct)

I used threaded MKL with 2 or 1 thread on Intel Core 2 Duo, but the result stays the same.

Flags: /O3 /MT /Qopenmp /heap-arrays0

Subroutine can be found underhttp://paste.ideaslabs.com/show/L6dVLdAOIfand called via

program test

  use vmltests
  implicit none

  call vmlTest()

end program

Gennady_F_Intel · ‎01-17-2011

What version of MKL You are using?

Ilya_B_Intel · ‎01-18-2011

Vector Math Functions work best when data is in L2 cache. n=50000000 is way out of L2 cache.

Your example code is not just a(i)+b(i), it is f(i)=a(i)*b(i)+c(i)*d(i)+a(i), which you replace by several MKL VML calls.MKL VML walks through all this memory for each call, while compiler optimized code groups computation and walks through this memory only once.

In order to overcome this limitation you may apply common optimization technique named blocking:

Do i=1,50000

call vdMul(1000,a(i*1000),b(i*1000),e(i*1000))

call vdMul(1000,c(i*1000),d(i*1000),f(i*1000))

call vdAdd(1000,f(i*1000),e(i*1000),f(i*1000))

call vdAdd(1000,f(i*1000),a(i*1000),f(i*1000))

End do

Each block will be within L2 cache.

When/if you try more complex functions you will come to yet another effect: by default compiler will use less accurate functions than MKL. Use vmlSetMode function to set MKL VML accuracy to the same level:

mode=VML_LA

mode=VMLSETMODE(mode)

And yes, which MKL version are you using?

mklvml · ‎01-20-2011

MKL 10.2.

Thank you for these insights!

mklvml · ‎01-20-2011

The L2-problem does not explain, why the MKL functions do not scale at all an a dual-core precessor!

If MKL Threads are set to 2, the CPU simply doubles!

Sergey_M_Intel2 · ‎01-20-2011

Hi mklvml,

Let me try to provide insights on the testcase execution efficiency.

The first fact that I've noticed is that your interested in measuring performance of the
f(i)=a(i)*b(i)+c(i)*d(i)+a(i), where i goes from 1 to 50M. You're interested in double precision. I couldn't find reference to the compiler you use for performance evaluation; but I found that you use MKL 10.2. You also refer to Intel Core 2 Duo processor. Please correct me if I misinterpreted you.

First, I should say that modern Intel processors can execute multiply and add instructions within the same processor cycle. That is, c(i)*d(i)+a(i) can be executed in one cycle. As soon as the result (let it be tmp(i)) of this operation is ready the processor canissue a(i)*b(i)+tmp(i) within the same cycle.

Next thing, if you use the modern optimizing compiler for x86 such as Intel Fortran or C++ compiler the compiler is capable ofvectorizing the code by using vectorSSE2 instructions. This will result in the fact that the processor will exectue two consecutive loop iterations in parallel, e.g. i-th and (i+1)-th. Also modern compilers can unroll the loop and schedule instruction in such a way that latency of the computation of tmp(i) is hidden by other computations (from other loop iterations).

The point is if you use smart enough compiler then a(i)*b(i)+c(i)*d(i)+a(i) is not being executed literally as it is written.

Let us have a look at what happens if you call vdAdd from Intel MKL. MKL vector add executes add operation on vector elements. That simple fact means that by calling VML add you underutilize CPU multiply unit for a long time. On the next step you call vector multiply, which underutilizes CPU add unit. I would recommend you to look for other MKL primitives that better balance use of add and multiply CPU units, e.g. dot product functions in MKL or similar one.

A few notes about vector primitives threading efficiency. Let's have a look at VML Performance and Accuracy charts available with MKL documentation
http://software.intel.com/sites/products/documentation/hpc/mkl/vml3/functions/mul.html
http://software.intel.com/sites/products/documentation/hpc/mkl/vml3/functions/add.html
You can notice a few interesting facts out of that data
1) Threading adds non-negligible overheads to the function execution time, especially noticable on shorter vector lengths
2) Due to those overheads threading itself makes sense when the vector size is big enough which is conflictingwith the objective to use shorter vectorsto fit into L2 cache.

Please notice again, modern CPU can issue 2 adds and 2 mulsevery cycle. This is really tiny performance primitives. Threading is not being for free; this is typically quite expensive. Typically people tend todo threading on the highest possible level (application level).SoI'm basically not surprised that you're not seeing theperformance gains.

Please don't hesitate to contact me if you need more details,
Regards,
Sergey

Ilya_B_Intel · ‎01-21-2011

Additional inputs.

I took your testcase and reduced it to original question: what if we take only one addition. This is the case, where no mul+add pairing is possible and MKL should give similar results to compiler code.

Call StartTime(time(:,1))

call vdAdd(n,c,d,f)

Call StopTime(time(:,1))

Call StartTime(time(:,2))

f(i)=c(i)+d(i)

End do

Call StopTime(time(:,2))

Call StartTime(time(:,3))

f=c+d

Call StopTime(time(:,3))

Main finding:

The slowest time is not for VML call, but for one, which is measured the first. So, if you will measure direct call first, it will be the slowest.

The reasons:

1). Your timing routine has a first call initialization, so we are measuring initialization of timing routine and not computations. In order to remove that effect, we make a fake timing measurement on the first place:

Call StartTime(time(:,4))

Call StopTime(time(:,4))

2) Your output array is allocated but not yet touched, this brings issues. For fair comparison we will put something in it:

call random_number(f)

call random_number(c)

call random_number(d)

3) I am not aware of your memory limits, but 3 such double precision arrays with 50M elements is 1.2G.

In our example we remove a,b,e arrays allocation.

Now we run new example and timings are very close:

VML 0.3679440 0.5000000

Loop 0.3649440 0.5000000

direct 0.3659451 0.5000000

Another question: threading.

vdAdd and vdMul functions are threaded starting MKL 10.3

Malav_S_ · ‎10-24-2014

Hello ,

I am facing a kind of same problem as mklvml mentioned above .I want to implement Intel vml functions in to my subroutines which are written in fortran90. I wrote a code to test timing difference using Multiplication as the operation on the arrays that are generated by using random number generator. The normal array size of my subroutines are 10^6 . Results of my code are mentioned below .

I am asking this thing again as I cannot find the code attached by mklvml and its difficult to follow the comments without having a look at code. And also in my case I want to be sure about the timings improvement before applying it to my subroutines .

So please do share your comments on it.

The output that I get is :

t3 - t2(Do loop) = 8 sec

t4 - t3 (Vml Function )= 49 sec

As per the output it seems that do loops are faster than VML Function .

TimP · ‎10-24-2014

-msse4.1 appears to improve performance on my Westmere, though I don't know why.

The vml runs fastest at about 4 threads (out of the default 24), while the in-line code is running 1 thread with nontemporal store. Evidently, the case is limited by memory bandwidth and cache issues.

Bernard · ‎10-26-2014

Yes and moreover as HT enables two hardware threads to be executed at the same because of doubled architectural state. Execution units of the CPU are shared (FP and SIMD stack) between those two threads and if there is no instruction interdependecis only one of those threads can at the same time *issue fmul and fadd thread ID tagged uops.

*issue - Scheduler will issue thread ID tagged uops.

TimP · ‎10-28-2014

To summarize some of the above:

Current CPUs aren't like 30 years ago, where it made sense to combine library calls like VML, in the absence of threading and cacheing. You can achieve much better performance by allowing a compiler to see more of the picture and eliminate unnecessary memory traffic.

Sergey suggested you should use (and compile for) an AVX2 CPU. This could increase the margin of performance a compiler could achieve vs. a series of VML calls. If you are using a core 2 duo (I missed the hints about that), VML doesn't have much latitude to use too many threads or choose an ineffective instruction set. You may still want to compile for sse4.1 if you have one of the later core 2 duos which supports that, and you are willing to spend a few minutes looking at compiler options.