Comparing OpenMP and OpenCL performance!

janez-makovsek · ‎05-21-2011

Hi!

For openCL kernel which runs on PC to acheive optimal performance on a quad core machine, it would be using a global_work_size of 4 with local_work_size of 1 and have a vectorized for-loop inside the kernel. Each of the four launched kernels would be processing one quorter of the array(s) within each thread.

I run same some comparisons for the c = a + b kernel. Intel C++ compiler with threading still runs this code about 5x faster than current Open CL. Specifically, the kernel internal for-loops like:

__kernel void test(.... )
{
....
for (i = 0; i < Len; i++)
{
c = a + b;
}
}

are not vectorized, which is the main reason for this. If this for-loops are left out and global_work_size is increased, there is function call overhead for the kernel. (which basically defeats SSE) and the expected 5x slow down is demonstrated as expected.

Are there any plans for Intels OpenCL driver to vectorize kernel internal for-loops?
(which would bring the code speed on-par with C++).

Thanks!
Atmapuri

Sion_B_Intel · ‎05-21-2011

Hi,
Thanks for your comments.
Currently, the OpenCL compiler Vectorizes only the "outer" loop, as in bringing together several work-items.
Therefore, to gain SSE utilization, youcan increase the amount of work-items per each work-group (Vectorization happens inside a workgroup).

As for future plans, we are considering several optimization possibilities. We will take your request into account.

Thank you,
Sion

janez-makovsek · ‎05-22-2011

Hi!

I tried using the outer loop vectorization first. (Without specifying local_work_size) and the speed was about 8x slower than OpenMP C++. You mention in your documentation that it is best to leave local_work_size undefined. Notice also that inner for-loop which as you say does not use vectorization runs only 5x slower than OpenMP C++.

The non-vectorized inner loop therefore seems to run faster than the outer vectorized one. (but both of which are much slower than what could be appreciated).

Thanks!
Atmapuri

Maxim_S_Intel · ‎05-22-2011

it would be using a global_work_size of 4

Hi, probably here you see issue with threading, not with code generation or vectorization. Having just 4 work-items in the whole NDRange (thus4 work-groups)reduces the efficiency of parallelization, since it is better to have number ofwork-goups being a way larger than number of logical cores. THis would allow efficient load-balancing, task-stealing etc. In contrast with just 4 work-items in the NDRange it is even not guarantee that all available threads will be used.

I would advice to try having ~1024 work-items in the NDRange, togeher with using NULL for local size, this will give you about 16-32 work-groups that should be ok for 4-cores CPU.

janez-makovsek · ‎05-23-2011

Thanks! I have tried various global and local work sizes before posting and noticed no major difference in run times. (maybe by 20%, while testing reasonable local_work_size sizes) Fastest run was not with NULL for local_work_size though. (Using Core i7 860).

>Having just 4 work-items in the whole NDRange

If each item does a considerable amount of work within kernels internal for-loop that I believe is not an issue. I am assuming that Intel still recommends not to launch more threads working with FPU (SSE) math than there are cores for highest efficiency and speed. (in general) Even with Hyper threading that would still result only in 8 threads for a 4 core machine.

> probably here you see issue with threading

That I agree with. I did some more tests and can confirm that code generation for the outer loop is actually fairly efficient. The bottleneck is the clSetKernelArg, which on competing platforms (AMD CPU or AMD GPU) takes between 1us and 2us and with Intels current drivers it takes between 20us to 300us (all measured for 8 calls).