Built-in math function vectorization!

janez-makovsek · ‎06-27-2011

Hi!

I have been running benchmarks comparing Intel VML (from MKL library) math functions with the current Open CL implementation. The was no measurable difference between AMD driver and Intel driver. However, functions (sin, cos, exp, log...) are not (auto) vectorized and reach only about 10% of speed from Intel VML. I hope you will find some ways to improve on that. It really puts Open CL based solution at a huge disadvantage in compare to say an algorithm written in C++.

Thanks!
Atmapuri

Sion_B_Intel · ‎06-27-2011

Hi,
Thanks for the info.
Can you please show one of the OpenCL kernels which did not Vectorize, so we can check for the reason?

Thanks,
Sion

janez-makovsek · ‎06-27-2011

__kernel void vdSin(__global const float *Src, const int SrcIdx,
__global float *Dst, const int DstIdx)
{
int gid = get_global_id(0);
Dst[DstIdx + gid] = sin(Src[SrcIdx + gid]);
}

janez-makovsek · ‎06-28-2011

Hi!

After going through more testing I noticed that there is vdSin8 function called by the Intel Open CL driver. So, there is some vectorization going on, but the results are still rather poor. I went forward and did some threading analysis. I noticed that for such short kernels, the CPU memory bandwidth can quickly get to be saturated. Then TBB threading library must have threads waiting at full throttle (doing nothing) because of this and the performance penalty comes from the need to switch to the next thread to process the next part of the vector. (even though it would be faster to let the same thread continue).

To put things blunt:
Running a single thread will yield 2-3x faster code than if running it on multiple threads.
I am also not sure that there is no further performance improvement possible for vectors longer than 8.
The performance penalties are particulary large for a simple add kernel:

a = b + c;

It runs 6x slower than on a simple for-loop on a single thread in Intel C++ regardless of the vector length
(as long as it is less than L2 CPU cache size). The timinigs exclude the time needed to get the data to and then back from the Open CL driver.

An attemp to thread a = b + c; on 4 cores in C++ results in 4x slower code than a single core.

Conclusion:
If the kernel is by any chance bandwidth limited, it will run up to 6x slower than comparable C++ code.
(because of problems with threading). For Core i7 architecture, the "add" reaches limit on one core and the "sin" function actually can run on 2 (assuming that result and source data reside in CPU cache) before bandwidth is exhausted. The thing is only that "sin" or similar is relatively expensive computationaly and to avoid bandwidth saturation you would need to do a lot of basic +/-* math operations without reading and or writing to the main memory. Even for the C++ compiler it is many times hard to achieve (to keep all vars in registers).

Regards!
Atmapuri

Doron_S_Intel · ‎06-28-2011

Hello Atmapuri,

Thanks for sharing your experiments and observations with us. It's certainly interesting. For completeness, could you elaborate which global / local sizes you've used? It seems reasonable to expect that for a large enough global size, even a bandwidth limited kernel will begin showing benefit from threading. One trick you could try (that's mostly applicable to such experimentation, and isn't recommended as a general optimization step) is manually setting a relatively large local (work group) size. I suspect you may see considerably better performance in your test scenario.

Please keep us updated if you decide to pursue this further, as locating such problems will allow us to focus our optimization efforts for future releases.

Thanks,
Doron Singer

Doron_S_Intel · ‎06-28-2011

Huh, it seems something knocked the Intel hat off my head :-)

janez-makovsek · ‎06-29-2011

Dear Doron,

I tried changing the local work size but it had no effect (in the sense of improving the performance).

Thanks!
Atmapuri

Doron_S_Intel · ‎06-29-2011

Curious. Could you supply some ballpark numbers for what sizes you've played with?

janez-makovsek · ‎06-29-2011

Hi!

Measuring (Intel IPP/VML) (disabled threading):

vdSin(.. ) //for (i = 0; i < Len; i++) a = sin(c);
ippsAdd32f_I //for (i = 0; i < Len; i++) a += b;

Open CL Kernels:

(Frist kernel) Dst = sin(Src);
(Second kernel) Dst += Src;

Vector length = Exp2(18);
Iteration count = 3000; (total running time 2+ seconds)
CPU: Core i7 860

Timings for Open CL (normalized for one iteration):
Local work size (1): 4360us
Local work size (2): 3338us
Local work size (4): 1720us
Local work size (8): 1720us
...
Local work size (256): ~1720us

Timings for IPP (MKL) (no threading): 1230 us
The timings for IPP are very stable and the Open CL can vary +/- 30%.

Not so bad for Open CL?
Not quite. We still need to enable threading for Intel IPP. (not internal):

Timings for IPP (MKL) (2 threads): 595us
Timings for IPP (MKL) (4 threads): 387us

Open CL / optimal = 1720/387 = 4.5x

It is true that Intel Open CL driver is about 2x faster than AMD, but overall performance is still abysmal. It still makes more sense to run C++ code on 1 core then use 4 cores with Open CL.

Thanks!
Atmapuri

P.S.
You may not agree with reasoning about why this happens, but the numbers are as they are.

Doron_S_Intel · ‎06-29-2011

Would you be so kind as to run with local size set to something like 1 << 14? What you're looking for is to cause very few work groups (8-32, depending on your system) to be created by artificially inflating the local work size. In extreme scenarios, this might help.

janez-makovsek · ‎06-29-2011

Dear Doron, it is you who are being kind. The maximum accepted work group size is 1024. Setting value to larger than this results in "invalid work group size" reported by the driver. But also with local work size set to 1024, there is no measurable benefit (over Local work size of 4). If there is anyting I can do to help you find a way to improve performance I would be happy to.

Doron_S_Intel · ‎06-29-2011

Ah, right. I forgot about that limit. As a last resort, could you try unrolling work item code? Instead of each kernel invocation calculating a single sinus, have a loop within the kernel code iterating over several such calculations? What I'm trying to figure out is whether the poor performance you're experiencing is due to a large overhead from the underlying tasking system, or due to some other issue, so we can better address it in the future.

janez-makovsek · ‎06-29-2011

Timings:
*Open CL only c = sin(b) with Intel Implicit kernel vectorization (all four cores): 900us
*Intel IPP only sin function (single core): 900us
*Intel IPP only sin function (all four cores): 320us
*Open CL with internal for-loop with 64 iterations (all four cores): 1000 us
*Open CL with float8 without internal loop (all four cores): 850us
*Open CL with float16 without internal loop (all four cores): 830us
*Open CL with float16 with internal loop (all four cores): 700us

I was told in this newsgroup that internal for-loops are not vectorized. There is clearly some space left for longer implicit vectors (not only 8 elements), but that is not the main reason for the slow code. No matter how you write it, it is the generated code inside the kernel which is slow. (in compare to Intel IPP//VML).

Another thing is this. In another post some weeks ago in this newsgroup it was established that kernel enquing has an overhead of 250us (or so) in compare to 80us (AMD CPU, 15us for GPU) due to clSetKernelArgs slowness. If we correct the timings for this know slowdown (said that it will be fixed), we get:

*Open CL only c = sin(b) with Intel Implicit kernel vectorization (all four cores): 650us
*Intel IPP only sin function (single core): 900us
*Intel IPP only sin function (all four cores): 320us
*Open CL with internal for-loop with 64 iterations (all four cores): 750 us
*Open CL with float8 without internal loop (all four cores): 600us
*Open CL with float16 without internal loop (all four cores): 580us
*Open CL with float16 with internal loop with 4 iterations (all four cores): 450us

Conclusion: float16 * 4 has a ~50% advantage over float8 and Intel could consider increasing the length of the implicit vector (by 8x?). (I assume it is now 8 from function names when looking at asm).

Further increasing the length of the for-loop working on the float16 vector has no effect. The sine function itself even when properly vectorized and all still seems to be 50% slower than Intel IPP implementation.

So, there is no one big culprit. It is little by little that the performance is eaten away.

ARNON_P_Intel · ‎06-29-2011

This discussion is very interesting,

Nevertheless we all should take into account that Intel's provide developer choice of parallel models exactly for that. To enable developers to mix and match libs and APIs to best fit to the problem domain and the developer expertises.

We will continue to improve our OpenCL performance as well as any other tools we offer.

Regards,

Arnon

janez-makovsek · ‎06-29-2011

Dear Arnon,

Thank you for your reply. I just tested the Intel Open CL Gold (June 29th 2011). The kernel launch overhead (time needed to launch kernel from the que many hundred long) has been increased from 250us to 600us in this release and is eying a milisecond. Comparing that to the time of 15us to launch a kernel on an ATI GPU and 80us to launch (AMD CPU) kernel, it looked pretty bad.

Ok, another try, The queue length before its flushed. It turnes out that AMD requires several hundred kernels to be enqued to achieve shortest kernel launch overhead. Today I tried to find the best que length for Intel and it turnes out that clFinish should be called as soon as queue is longer than 6. In this case the kernel take only 45us to launch. When the que is finished for every kernel (depth 1), the launch time is 70us. Allowing it to grow beyond 10 will only make the kernel launch time overhead bigger (reaching 600us+ at depth 1000).

This is actually very nice :) Going from complete dump to (almost) pure gold

>mix and match libs and APIs to best fit to the problem domain and the developer expertises

The best is to know how to make it go fast.

Thanks!
Atmapuri

Doron_S_Intel · ‎06-30-2011

Hello again Atmapuri,

Your experimentation is quite invaluable. One other suggestion I'd like to make, if you're curious about the launch time for a large amount of clEnqueueXXX calls, try the benchmark with an out of order queue as well as an in-order queue. It's expected to perform better for a large number of independent commands.

Thanks,
Doron Singer

janez-makovsek · ‎06-30-2011

Dear Doron,

I see. I did not take notice that Intel driver supports out-of-order execution. I saw on AMD GPUs that this is (typically?) not available. The number of applications where this is possible is not very large though.

Thanks!
Atmapuri