Memory alignment requirements!

janez-makovsek · ‎07-04-2011

Hi!

The current Intel auto vectorization works only, if the GlobalWorkSize is divisable with the length of the vector to be vectorized. This is a similar restriction to the memory alignment. Although it is not difficult to allocate aligned vector it can easily cause an overhead of an additional copy operation thus nulifying the possible gains due to proper alignment and increasing memory usage. Please consider the following enhancement with regard to this issue:

1.) Process the elements of unaligned array up to the point of alignment and use faster code from there on
2.) Process elements after the final aligned address with slower code.

I tried to do that manually and have achieved pretty good timings in compare to the aligned address and size, but the code is long and tedious to write within the same kernel.

Thanks!
Atmapuri

Sion_B_Intel · ‎07-04-2011

Hi,
Thanks for your inputs. We consider many options to improve OpenCL compiler technology, and remember that for most OpenCL developers portable code is most important.

Best regards,
Sion

janez-makovsek · ‎07-05-2011

Dear Sion,
I appreciate your (Intels) attention to all customers. The suggested optimization does not affect portability. On the other hand, the current Open CL users may in fact be focusing on GPU and use CPU only as a fallback device. But am also sure that it is not in Intels best interest to keep it that way. Open CL, the way it was designed, opened up a few new ideas which are applicable also to the exclusive world of CPU only. For example, Microsoft tried for 10 years to speed up .NET application and has reached the level of Java Script with its VS2010.NET release (due to all the constraints). There are also many languages/platforms that cannot afford extensive investment in to compilers generating performance efficient code and using latest sets of Intels instructions. This is where Open CL for CPU comes in. It provides a free (!) tool to accelerate performance sensitive code.

The issue which is blocking the drive in this direction is the overhead of the Open CL API, which was built with the assumption of physically separate memory between CPU and GPU and the assumption of inherently threaded code. Both of these features are good because they allow scaling. What is missing is support for lower overhead (to call Open CL code) for CPU.

Regards!
Atmapuri

maxim_milakov · ‎07-05-2011

>overhead of the Open CL API, which was built with the assumption of physically separate memory between CPU and GPU

Actually OpenCL defines enqueueMapBuffer function. Intel advises to use it. And, by the way, the competetor with its recent processors with integrated GPUs also advise using this function.

janez-makovsek · ‎07-05-2011

In C++ calling a single element function like sin(float a) is much faster than via Open CL API. That is what I mean with overhead. Its a different perspective. Even enqueueMapBuffer is a far reach from ideal in this case. There is no technical reason why Open CL (with extensions) should not be able to provide the same low overhead for CPU devices.

Regards!
Atmapuri

maxim_milakov · ‎07-06-2011

Do you mean thatOpenCL showssignificantly lower perfromance in processing sin(float a) function in comparison with generic host code? Well, if it is truewhencalling this function1,000,000 times (for different elements) then it is strange and investigation is required. But if you are talking about just single call, then it is fine. OpenCL is for relatevily large parallel computation tasks.

janez-makovsek · ‎07-06-2011

That is my point. Open CL needs not to be for large parallel computation tasks only, when it comes to CPU devices.
Atmapuri