I wrote this small subroutine that compares simple vector mathematical functions, performed either with a loop:
f(i) = a(i) + b(i)
f = a + b
or using Intel MKL VML:
The timing results for n=50000000 are:
VML 0.9 sec
And I dont understand, why VML takes twice as long as the other methods! (Loop is sometimes faster than direct)
I used threaded MKL with 2 or 1 thread on Intel Core 2 Duo, but the result stays the same.
Flags: /O3 /MT /Qopenmp /heap-arrays0
Subroutine can be found underhttp://paste.ideaslabs.com/show/L6dVLdAOIfand called via
Let me try to provide insights on the testcase execution efficiency.
The first fact that I've noticed is that your interested in measuring performance of the
f(i)=a(i)*b(i)+c(i)*d(i)+a(i), where i goes from 1 to 50M. You're interested in double precision. I couldn't find reference to the compiler you use for performance evaluation; but I found that you use MKL 10.2. You also refer to Intel Core 2 Duo processor. Please correct me if I misinterpreted you.
First, I should say that modern Intel processors can execute multiply and add instructions within the same processor cycle. That is, c(i)*d(i)+a(i) can be executed in one cycle. As soon as the result (let it be tmp(i)) of this operation is ready the processor canissue a(i)*b(i)+tmp(i) within the same cycle.
Next thing, if you use the modern optimizing compiler for x86 such as Intel Fortran or C++ compiler the compiler is capable ofvectorizing the code by using vectorSSE2 instructions. This will result in the fact that the processor will exectue two consecutive loop iterations in parallel, e.g. i-th and (i+1)-th. Also modern compilers can unroll the loop and schedule instruction in such a way that latency of the computation of tmp(i) is hidden by other computations (from other loop iterations).
The point is if you use smart enough compiler then a(i)*b(i)+c(i)*d(i)+a(i) is not being executed literally as it is written.
Let us have a look at what happens if you call vdAdd from Intel MKL. MKL vector add executes add operation on vector elements. That simple fact means that by calling VML add you underutilize CPU multiply unit for a long time. On the next step you call vector multiply, which underutilizes CPU add unit. I would recommend you to look for other MKL primitives that better balance use of add and multiply CPU units, e.g. dot product functions in MKL or similar one.
A few notes about vector primitives threading efficiency. Let's have a look at VML Performance and Accuracy charts available with MKL documentation
You can notice a few interesting facts out of that data
1) Threading adds non-negligible overheads to the function execution time, especially noticable on shorter vector lengths
2) Due to those overheads threading itself makes sense when the vector size is big enough which is conflictingwith the objective to use shorter vectorsto fit into L2 cache.
Please notice again, modern CPU can issue 2 adds and 2 mulsevery cycle. This is really tiny performance primitives. Threading is not being for free; this is typically quite expensive. Typically people tend todo threading on the highest possible level (application level).SoI'm basically not surprised that you're not seeing theperformance gains.
Please don't hesitate to contact me if you need more details,
I am facing a kind of same problem as mklvml mentioned above .I want to implement Intel vml functions in to my subroutines which are written in fortran90. I wrote a code to test timing difference using Multiplication as the operation on the arrays that are generated by using random number generator. The normal array size of my subroutines are 10^6 . Results of my code are mentioned below .
I am asking this thing again as I cannot find the code attached by mklvml and its difficult to follow the comments without having a look at code. And also in my case I want to be sure about the timings improvement before applying it to my subroutines .
So please do share your comments on it.
The output that I get is :
t3 - t2(Do loop) = 8 sec
t4 - t3 (Vml Function )= 49 sec
As per the output it seems that do loops are faster than VML Function .
-msse4.1 appears to improve performance on my Westmere, though I don't know why.
The vml runs fastest at about 4 threads (out of the default 24), while the in-line code is running 1 thread with nontemporal store. Evidently, the case is limited by memory bandwidth and cache issues.
Yes and moreover as HT enables two hardware threads to be executed at the same because of doubled architectural state. Execution units of the CPU are shared (FP and SIMD stack) between those two threads and if there is no instruction interdependecis only one of those threads can at the same time *issue fmul and fadd thread ID tagged uops.
*issue - Scheduler will issue thread ID tagged uops.
To summarize some of the above:
Current CPUs aren't like 30 years ago, where it made sense to combine library calls like VML, in the absence of threading and cacheing. You can achieve much better performance by allowing a compiler to see more of the picture and eliminate unnecessary memory traffic.
Sergey suggested you should use (and compile for) an AVX2 CPU. This could increase the margin of performance a compiler could achieve vs. a series of VML calls. If you are using a core 2 duo (I missed the hints about that), VML doesn't have much latitude to use too many threads or choose an ineffective instruction set. You may still want to compile for sse4.1 if you have one of the later core 2 duos which supports that, and you are willing to spend a few minutes looking at compiler options.