In some cases our auto-vectorizer estimates that using 128-bit wide registers is likely to generate faster code. In other cases, AVX simply does not implement certain operations (such asinteger operations).
Yes, for the BitonicSort sample the GROUP_SIZE is set to 1, which is very unoptimal. Other samples differ: * The EigenValue and NBody samples sets groupsize to min(256,value-from-clGetKernelWorkGroupInfo) * The RadixSort sample uses min(64,value-from-cl-GetKernelWorkGroupInfo).
The OpenCL driver reports 1024 as Max workgroups size, so real values used are 256 and 64. This is not changed between Intel OpenCL 1.1 and 1.5
Quote: In some cases our auto-vectorizer estimates that using 128-bit wide
registers is likely to generate faster code. In other cases, AVX simply
does not implement certain operations (such asinteger operations).
For simple float vector addition 256-bit vaddps should be way faster than 128-bit one.
We have investigated the BitonicSort and RadixSort reported issue and found out that we do have a performance regression in our 1.5 Gold vs. 1.1 when it comes to execution of kernels with small work group sizes. We are working on eliminating most of this performance regression and one of our future releases will include the fix for this issue.
To avoid this phenomenawe recommendusing larger work group sizes where the sweetspot would be at workgroup size > 64.
As a side comment, I would recommend using large work group sizes in general as this would probably be more optimal for our implementation. You can read more about it in the optimization guide which is attached to this release.
We are still investigating the EigenValue regression. However we weren't able to reproduce the NBody regression