Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
10 Views

OpenCL SDK 1.5: 2-3 times slower than 1.1

I've tested new 1.5 SDK with some selected OpenCL samples from AMD APP SDK 2.5. These samples allows easy selection of Platform/Device, so very useful for quick-n-dirty tests.

The results are discouraging:

BitonicSort (with -x 16777216 option): 10.85 sec for Intel OpenCL SDK 1.1 and 30.61 sec for version 1.5
EigenValue (-x 20480): 10.75/23.5 sec (1.1/1.5 versions)
Nbody (-x 102400): 6.87/20.0
RadixSort (-x 102400000): 9.08/12.75

Not-so-bad result:
SimpleConvolution: (-x 8192 -y 8192 -m 16): 9.83/9.99

Good Result (only one from six samples tested):
Histogram (-x 20480 -y 32768 -i 3): 0.92/0.83

The Histogram sample is the only one 'not vectorized' (Kernel ... not vectorizes message from Offline Compiler)

The samples was run with platform set to Intel (-p 2 on my machine) and timing (-t on). The times are for kernel+data transfer.

Machine is i7-2600K (so, AVX) @4.5GHz, Windows7/x64. Three OpenCL SDKs are installed (Nvidia, AMD and Intel)

0 Kudos
11 Replies
Highlighted
10 Views

Alex,

Thanks for the feedback. Generally,you should expectbetter performance with new releases. We are working on reproducingyour setup locally and investigating the problem.

Thanks,
Nadav
0 Kudos
Highlighted
Beginner
10 Views

Also, it not looks like AVX is really used.
I've tested with vector sum kernel:
__kernel void vsum (__global const float *a, __global const float *b, __global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid]+b[gid];
}

The internal loop of vectorized version is 128-bit:
__Vectorized_.vsum: # @__Vectorized_.vsum
.... initialization ....
LBB3_1: # %SyncBB
# =>This Inner Loop Header: Depth=1
mov R10D, DWORD PTR [RDX]
add R10D, DWORD PTR [RSI]
movsxd R10, R10D
vmovups XMM0, XMMWORD PTR [R8 + 4*R10]
vmovups XMM1, XMMWORD PTR [R9 + 4*R10]
vaddps XMM0, XMM1, XMM0
vmovups XMMWORD PTR [RDI + 4*R10], XMM0
add RDX, 32
inc RAX
cmp RAX, RCX
jb LBB3_1

For some kernels I see 256-bit load, then extract128 from YMM to XMM, then 128-bit code again.
0 Kudos
Highlighted
Employee
10 Views

Hello Alex,

Do you have SP1 installed for Windows 7?

Thanks,
Doron Singer
0 Kudos
Highlighted
Beginner
10 Views

Yes, SP1 is installed (and all current updates too)
0 Kudos
Highlighted
10 Views

Alex,

Please notice that theAMD sampleusesgroup size = 1,whichmeans that it always runs non-vectorized code.

Nadav
0 Kudos
Highlighted
10 Views

In some cases our auto-vectorizer estimates that using 128-bit wide registers is likely to generate faster code. In other cases, AVX simply does not implement certain operations (such asinteger operations).
0 Kudos
Highlighted
Beginner
10 Views


Yes, for the BitonicSort sample the GROUP_SIZE is set to 1, which is very unoptimal.
Other samples differ:
* The EigenValue and NBody samples sets groupsize to min(256,value-from-clGetKernelWorkGroupInfo)
* The RadixSort sample uses min(64,value-from-cl-GetKernelWorkGroupInfo).

The OpenCL driver reports 1024 as Max workgroups size, so real values used are 256 and 64. This is not changed between Intel OpenCL 1.1 and 1.5

0 Kudos
Highlighted
Beginner
10 Views

Quote: In some cases our auto-vectorizer estimates that using 128-bit wide registers is likely to generate faster code. In other cases, AVX simply does not implement certain operations (such asinteger operations).


For simple float vector addition 256-bit vaddps should be way faster than 128-bit one.
0 Kudos
Highlighted
Beginner
10 Views

The good news:

for my own code (added timers for microbenchmarks) I see about 20% speed-up for Intel OpenCL 1.5 over v1.1.

0 Kudos
Highlighted
Employee
10 Views

Hi Alex,

We have investigated the BitonicSort and RadixSort reported issue and found out that we do have a performance regression in our 1.5 Gold vs. 1.1 when it comes to execution of kernels with small work group sizes. We are working on eliminating most of this performance regression and one of our future releases will include the fix for this issue.

To avoid this phenomenawe recommendusing larger work group sizes where the sweetspot would be at workgroup size > 64.

As a side comment, I would recommend using large work group sizes in general as this would probably be more optimal for our implementation. You can read more about it in the optimization guide which is attached to this release.

We are still investigating the EigenValue regression.
However we weren't able to reproduce the NBody regression

Thanks for helping us improve the product,
Boaz
0 Kudos
Highlighted
Beginner
10 Views

Thanks!

Will wait for the next releases!
0 Kudos