I wrote this small subroutine that compares simple vector mathematical functions, performed either with a loop:
f(i) = a(i) + b(i)
f = a + b
or using Intel MKL VML:
The timing results for n=50000000 are:
VML 0.9 sec
And I dont understand, why VML takes twice as long as the other methods! (Loop is sometimes faster than direct)
I used threaded MKL with 2 or 1 thread on Intel Core 2 Duo, but the result stays the same.
Flags: /O3 /MT /Qopenmp /heap-arrays0
Subroutine can be found underhttp://paste.ideaslabs.com/show/L6dVLdAOIfand called via
I am facing a kind of same problem as mklvml mentioned above .I want to implement Intel vml functions in to my subroutines which are written in fortran90. I wrote a code to test timing difference using Multiplication as the operation on the arrays that are generated by using random number generator. The normal array size of my subroutines are 10^6 . Results of my code are mentioned below .
I am asking this thing again as I cannot find the code attached by mklvml and its difficult to follow the comments without having a look at code. And also in my case I want to be sure about the timings improvement before applying it to my subroutines .
So please do share your comments on it.
The output that I get is :
t3 - t2(Do loop) = 8 sec
t4 - t3 (Vml Function )= 49 sec
As per the output it seems that do loops are faster than VML Function .
-msse4.1 appears to improve performance on my Westmere, though I don't know why.
The vml runs fastest at about 4 threads (out of the default 24), while the in-line code is running 1 thread with nontemporal store. Evidently, the case is limited by memory bandwidth and cache issues.
Yes and moreover as HT enables two hardware threads to be executed at the same because of doubled architectural state. Execution units of the CPU are shared (FP and SIMD stack) between those two threads and if there is no instruction interdependecis only one of those threads can at the same time *issue fmul and fadd thread ID tagged uops.
*issue - Scheduler will issue thread ID tagged uops.
To summarize some of the above:
Current CPUs aren't like 30 years ago, where it made sense to combine library calls like VML, in the absence of threading and cacheing. You can achieve much better performance by allowing a compiler to see more of the picture and eliminate unnecessary memory traffic.
Sergey suggested you should use (and compile for) an AVX2 CPU. This could increase the margin of performance a compiler could achieve vs. a series of VML calls. If you are using a core 2 duo (I missed the hints about that), VML doesn't have much latitude to use too many threads or choose an ineffective instruction set. You may still want to compile for sse4.1 if you have one of the later core 2 duos which supports that, and you are willing to spend a few minutes looking at compiler options.