The current Intel auto vectorization works only, if the GlobalWorkSize is divisable with the length of the vector to be vectorized. This is a similar restriction to the memory alignment. Although it is not difficult to allocate aligned vector it can easily cause an overhead of an additional copy operation thus nulifying the possible gains due to proper alignment and increasing memory usage. Please consider the following enhancement with regard to this issue:
1.) Process the elements of unaligned array up to the point of alignment and use faster code from there on
2.) Process elements after the final aligned address with slower code.
I tried to do that manually and have achieved pretty good timings in compare to the aligned address and size, but the code is long and tedious to write within the same kernel.
Thanks for your inputs. We consider many options to improve OpenCL compiler technology, and remember that for most OpenCL developers portable code is most important.
I appreciate your (Intels) attention to all customers. The suggested optimization does not affect portability. On the other hand, the current Open CL users may in fact be focusing on GPU and use CPU only as a fallback device. But am also sure that it is not in Intels best interest to keep it that way. Open CL, the way it was designed, opened up a few new ideas which are applicable also to the exclusive world of CPU only. For example, Microsoft tried for 10 years to speed up .NET application and has reached the level of Java Script with its VS2010.NET release (due to all the constraints). There are also many languages/platforms that cannot afford extensive investment in to compilers generating performance efficient code and using latest sets of Intels instructions. This is where Open CL for CPU comes in. It provides a free (!) tool to accelerate performance sensitive code.
The issue which is blocking the drive in this direction is the overhead of the Open CL API, which was built with the assumption of physically separate memory between CPU and GPU and the assumption of inherently threaded code. Both of these features are good because they allow scaling. What is missing is support for lower overhead (to call Open CL code) for CPU.
>overhead of the Open CL API, which was built with the assumption of physically separate memory between CPU and GPU
Actually OpenCL defines enqueueMapBuffer function. Intel advises to use it. And, by the way, the competetor with its recent processors with integrated GPUs also advise using this function.
Do you mean thatOpenCL showssignificantly lower perfromance in processing sin(float a) function in comparison with generic host code? Well, if it is truewhencalling this function1,000,000 times (for different elements) then it is strange and investigation is required. But if you are talking about just single call, then it is fine. OpenCL is for relatevily large parallel computation tasks.