OpenCL SDK 1.5: 2-3 times slower than 1.1

alex-tutubalin · ‎09-27-2011

I've tested new 1.5 SDK with some selected OpenCL samples from AMD APP SDK 2.5. These samples allows easy selection of Platform/Device, so very useful for quick-n-dirty tests.

The results are discouraging:

BitonicSort (with -x 16777216 option): 10.85 sec for Intel OpenCL SDK 1.1 and 30.61 sec for version 1.5
EigenValue (-x 20480): 10.75/23.5 sec (1.1/1.5 versions)
Nbody (-x 102400): 6.87/20.0
RadixSort (-x 102400000): 9.08/12.75

Not-so-bad result:
SimpleConvolution: (-x 8192 -y 8192 -m 16): 9.83/9.99

Good Result (only one from six samples tested):
Histogram (-x 20480 -y 32768 -i 3): 0.92/0.83

The Histogram sample is the only one 'not vectorized' (Kernel ... not vectorizes message from Offline Compiler)

The samples was run with platform set to Intel (-p 2 on my machine) and timing (-t on). The times are for kernel+data transfer.

Machine is i7-2600K (so, AVX) @4.5GHz, Windows7/x64. Three OpenCL SDKs are installed (Nvidia, AMD and Intel)

Nadav_Rotem__Intel_ · ‎09-27-2011

Alex,

Thanks for the feedback. Generally,you should expectbetter performance with new releases. We are working on reproducingyour setup locally and investigating the problem.

Thanks,
Nadav

alex-tutubalin · ‎09-27-2011

Also, it not looks like AVX is really used.
I've tested with vector sum kernel:
__kernel void vsum (__global const float *a, __global const float *b, __global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid]+b[gid];
}

The internal loop of vectorized version is 128-bit:
__Vectorized_.vsum: # @__Vectorized_.vsum
.... initialization ....
LBB3_1: # %SyncBB
# =>This Inner Loop Header: Depth=1
mov R10D, DWORD PTR [RDX]
add R10D, DWORD PTR [RSI]
movsxd R10, R10D
vmovups XMM0, XMMWORD PTR [R8 + 4*R10]
vmovups XMM1, XMMWORD PTR [R9 + 4*R10]
vaddps XMM0, XMM1, XMM0
vmovups XMMWORD PTR [RDI + 4*R10], XMM0
add RDX, 32
inc RAX
cmp RAX, RCX
jb LBB3_1

For some kernels I see 256-bit load, then extract128 from YMM to XMM, then 128-bit code again.

Doron_S_Intel · ‎09-28-2011

Hello Alex,

Do you have SP1 installed for Windows 7?

Thanks,
Doron Singer

alex-tutubalin · ‎09-28-2011

Yes, SP1 is installed (and all current updates too)

Nadav_Rotem__Intel_ · ‎09-28-2011

Alex,

Please notice that theAMD sampleusesgroup size = 1,whichmeans that it always runs non-vectorized code.

Nadav

Nadav_Rotem__Intel_ · ‎09-28-2011

In some cases our auto-vectorizer estimates that using 128-bit wide registers is likely to generate faster code. In other cases, AVX simply does not implement certain operations (such asinteger operations).

alex-tutubalin · ‎09-28-2011

Yes, for the BitonicSort sample the GROUP_SIZE is set to 1, which is very unoptimal.
Other samples differ:
* The EigenValue and NBody samples sets groupsize to min(256,value-from-clGetKernelWorkGroupInfo)
* The RadixSort sample uses min(64,value-from-cl-GetKernelWorkGroupInfo).

The OpenCL driver reports 1024 as Max workgroups size, so real values used are 256 and 64. This is not changed between Intel OpenCL 1.1 and 1.5

alex-tutubalin · ‎09-28-2011

Quote: In some cases our auto-vectorizer estimates that using 128-bit wide registers is likely to generate faster code. In other cases, AVX simply does not implement certain operations (such asinteger operations).

For simple float vector addition 256-bit vaddps should be way faster than 128-bit one.

alex-tutubalin · ‎09-28-2011

The good news:

for my own code (added timers for microbenchmarks) I see about 20% speed-up for Intel OpenCL 1.5 over v1.1.

Boaz_O_Intel · ‎10-05-2011

Hi Alex,

We have investigated the BitonicSort and RadixSort reported issue and found out that we do have a performance regression in our 1.5 Gold vs. 1.1 when it comes to execution of kernels with small work group sizes. We are working on eliminating most of this performance regression and one of our future releases will include the fix for this issue.

To avoid this phenomenawe recommendusing larger work group sizes where the sweetspot would be at workgroup size > 64.

As a side comment, I would recommend using large work group sizes in general as this would probably be more optimal for our implementation. You can read more about it in the optimization guide which is attached to this release.

We are still investigating the EigenValue regression.
However we weren't able to reproduce the NBody regression

Thanks for helping us improve the product,
Boaz

alex-tutubalin · ‎10-05-2011

Thanks!

Will wait for the next releases!