OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

OpenCL SDK 1.5: 2-3 times slower than 1.1

alex-tutubalin
Beginner
841 Views
I've tested new 1.5 SDK with some selected OpenCL samples from AMD APP SDK 2.5. These samples allows easy selection of Platform/Device, so very useful for quick-n-dirty tests.

The results are discouraging:

BitonicSort (with -x 16777216 option): 10.85 sec for Intel OpenCL SDK 1.1 and 30.61 sec for version 1.5
EigenValue (-x 20480): 10.75/23.5 sec (1.1/1.5 versions)
Nbody (-x 102400): 6.87/20.0
RadixSort (-x 102400000): 9.08/12.75

Not-so-bad result:
SimpleConvolution: (-x 8192 -y 8192 -m 16): 9.83/9.99

Good Result (only one from six samples tested):
Histogram (-x 20480 -y 32768 -i 3): 0.92/0.83

The Histogram sample is the only one 'not vectorized' (Kernel ... not vectorizes message from Offline Compiler)

The samples was run with platform set to Intel (-p 2 on my machine) and timing (-t on). The times are for kernel+data transfer.

Machine is i7-2600K (so, AVX) @4.5GHz, Windows7/x64. Three OpenCL SDKs are installed (Nvidia, AMD and Intel)

0 Kudos
11 Replies
Nadav_Rotem__Intel_
841 Views
Alex,

Thanks for the feedback. Generally,you should expectbetter performance with new releases. We are working on reproducingyour setup locally and investigating the problem.

Thanks,
Nadav
0 Kudos
alex-tutubalin
Beginner
841 Views
Also, it not looks like AVX is really used.
I've tested with vector sum kernel:
__kernel void vsum (__global const float *a, __global const float *b, __global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid]+b[gid];
}

The internal loop of vectorized version is 128-bit:
__Vectorized_.vsum: # @__Vectorized_.vsum
.... initialization ....
LBB3_1: # %SyncBB
# =>This Inner Loop Header: Depth=1
mov R10D, DWORD PTR [RDX]
add R10D, DWORD PTR [RSI]
movsxd R10, R10D
vmovups XMM0, XMMWORD PTR [R8 + 4*R10]
vmovups XMM1, XMMWORD PTR [R9 + 4*R10]
vaddps XMM0, XMM1, XMM0
vmovups XMMWORD PTR [RDI + 4*R10], XMM0
add RDX, 32
inc RAX
cmp RAX, RCX
jb LBB3_1

For some kernels I see 256-bit load, then extract128 from YMM to XMM, then 128-bit code again.
0 Kudos
Doron_S_Intel
Employee
841 Views
Hello Alex,

Do you have SP1 installed for Windows 7?

Thanks,
Doron Singer
0 Kudos
alex-tutubalin
Beginner
841 Views
Yes, SP1 is installed (and all current updates too)
0 Kudos
Nadav_Rotem__Intel_
841 Views
Alex,

Please notice that theAMD sampleusesgroup size = 1,whichmeans that it always runs non-vectorized code.

Nadav
0 Kudos
Nadav_Rotem__Intel_
841 Views
In some cases our auto-vectorizer estimates that using 128-bit wide registers is likely to generate faster code. In other cases, AVX simply does not implement certain operations (such asinteger operations).
0 Kudos
alex-tutubalin
Beginner
841 Views

Yes, for the BitonicSort sample the GROUP_SIZE is set to 1, which is very unoptimal.
Other samples differ:
* The EigenValue and NBody samples sets groupsize to min(256,value-from-clGetKernelWorkGroupInfo)
* The RadixSort sample uses min(64,value-from-cl-GetKernelWorkGroupInfo).

The OpenCL driver reports 1024 as Max workgroups size, so real values used are 256 and 64. This is not changed between Intel OpenCL 1.1 and 1.5

0 Kudos
alex-tutubalin
Beginner
841 Views
Quote: In some cases our auto-vectorizer estimates that using 128-bit wide registers is likely to generate faster code. In other cases, AVX simply does not implement certain operations (such asinteger operations).


For simple float vector addition 256-bit vaddps should be way faster than 128-bit one.
0 Kudos
alex-tutubalin
Beginner
841 Views
The good news:

for my own code (added timers for microbenchmarks) I see about 20% speed-up for Intel OpenCL 1.5 over v1.1.

0 Kudos
Boaz_O_Intel
Employee
841 Views
Hi Alex,

We have investigated the BitonicSort and RadixSort reported issue and found out that we do have a performance regression in our 1.5 Gold vs. 1.1 when it comes to execution of kernels with small work group sizes. We are working on eliminating most of this performance regression and one of our future releases will include the fix for this issue.

To avoid this phenomenawe recommendusing larger work group sizes where the sweetspot would be at workgroup size > 64.

As a side comment, I would recommend using large work group sizes in general as this would probably be more optimal for our implementation. You can read more about it in the optimization guide which is attached to this release.

We are still investigating the EigenValue regression.
However we weren't able to reproduce the NBody regression

Thanks for helping us improve the product,
Boaz
0 Kudos
alex-tutubalin
Beginner
841 Views
Thanks!

Will wait for the next releases!
0 Kudos
Reply