- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've tested new 1.5 SDK with some selected OpenCL samples from AMD APP SDK 2.5. These samples allows easy selection of Platform/Device, so very useful for quick-n-dirty tests.
The results are discouraging:
BitonicSort (with -x 16777216 option): 10.85 sec for Intel OpenCL SDK 1.1 and 30.61 sec for version 1.5
EigenValue (-x 20480): 10.75/23.5 sec (1.1/1.5 versions)
Nbody (-x 102400): 6.87/20.0
RadixSort (-x 102400000): 9.08/12.75
Not-so-bad result:
SimpleConvolution: (-x 8192 -y 8192 -m 16): 9.83/9.99
Good Result (only one from six samples tested):
Histogram (-x 20480 -y 32768 -i 3): 0.92/0.83
The Histogram sample is the only one 'not vectorized' (Kernel ... not vectorizes message from Offline Compiler)
The samples was run with platform set to Intel (-p 2 on my machine) and timing (-t on). The times are for kernel+data transfer.
Machine is i7-2600K (so, AVX) @4.5GHz, Windows7/x64. Three OpenCL SDKs are installed (Nvidia, AMD and Intel)
The results are discouraging:
BitonicSort (with -x 16777216 option): 10.85 sec for Intel OpenCL SDK 1.1 and 30.61 sec for version 1.5
EigenValue (-x 20480): 10.75/23.5 sec (1.1/1.5 versions)
Nbody (-x 102400): 6.87/20.0
RadixSort (-x 102400000): 9.08/12.75
Not-so-bad result:
SimpleConvolution: (-x 8192 -y 8192 -m 16): 9.83/9.99
Good Result (only one from six samples tested):
Histogram (-x 20480 -y 32768 -i 3): 0.92/0.83
The Histogram sample is the only one 'not vectorized' (Kernel ... not vectorizes message from Offline Compiler)
The samples was run with platform set to Intel (-p 2 on my machine) and timing (-t on). The times are for kernel+data transfer.
Machine is i7-2600K (so, AVX) @4.5GHz, Windows7/x64. Three OpenCL SDKs are installed (Nvidia, AMD and Intel)
Link Copied
11 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alex,
Thanks for the feedback. Generally,you should expectbetter performance with new releases. We are working on reproducingyour setup locally and investigating the problem.
Thanks,
Nadav
Thanks for the feedback. Generally,you should expectbetter performance with new releases. We are working on reproducingyour setup locally and investigating the problem.
Thanks,
Nadav
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, it not looks like AVX is really used.
I've tested with vector sum kernel:
__kernel void vsum (__global const float *a, __global const float *b, __global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid]+b[gid];
}
The internal loop of vectorized version is 128-bit:
__Vectorized_.vsum: # @__Vectorized_.vsum
.... initialization ....
LBB3_1: # %SyncBB
# =>This Inner Loop Header: Depth=1
mov R10D, DWORD PTR [RDX]
add R10D, DWORD PTR [RSI]
movsxd R10, R10D
vmovups XMM0, XMMWORD PTR [R8 + 4*R10]
vmovups XMM1, XMMWORD PTR [R9 + 4*R10]
vaddps XMM0, XMM1, XMM0
vmovups XMMWORD PTR [RDI + 4*R10], XMM0
add RDX, 32
inc RAX
cmp RAX, RCX
jb LBB3_1
For some kernels I see 256-bit load, then extract128 from YMM to XMM, then 128-bit code again.
I've tested with vector sum kernel:
__kernel void vsum (__global const float *a, __global const float *b, __global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid]+b[gid];
}
The internal loop of vectorized version is 128-bit:
__Vectorized_.vsum: # @__Vectorized_.vsum
.... initialization ....
LBB3_1: # %SyncBB
# =>This Inner Loop Header: Depth=1
mov R10D, DWORD PTR [RDX]
add R10D, DWORD PTR [RSI]
movsxd R10, R10D
vmovups XMM0, XMMWORD PTR [R8 + 4*R10]
vmovups XMM1, XMMWORD PTR [R9 + 4*R10]
vaddps XMM0, XMM1, XMM0
vmovups XMMWORD PTR [RDI + 4*R10], XMM0
add RDX, 32
inc RAX
cmp RAX, RCX
jb LBB3_1
For some kernels I see 256-bit load, then extract128 from YMM to XMM, then 128-bit code again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Alex,
Do you have SP1 installed for Windows 7?
Thanks,
Doron Singer
Do you have SP1 installed for Windows 7?
Thanks,
Doron Singer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, SP1 is installed (and all current updates too)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alex,
Please notice that theAMD sampleusesgroup size = 1,whichmeans that it always runs non-vectorized code.
Nadav
Please notice that theAMD sampleusesgroup size = 1,whichmeans that it always runs non-vectorized code.
Nadav
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In some cases our auto-vectorizer estimates that using 128-bit wide registers is likely to generate faster code. In other cases, AVX simply does not implement certain operations (such asinteger operations).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, for the BitonicSort sample the GROUP_SIZE is set to 1, which is very unoptimal.
Other samples differ:
* The EigenValue and NBody samples sets groupsize to min(256,value-from-clGetKernelWorkGroupInfo)
* The RadixSort sample uses min(64,value-from-cl-GetKernelWorkGroupInfo).
The OpenCL driver reports 1024 as Max workgroups size, so real values used are 256 and 64. This is not changed between Intel OpenCL 1.1 and 1.5
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quote: In some cases our auto-vectorizer estimates that using 128-bit wide
registers is likely to generate faster code. In other cases, AVX simply
does not implement certain operations (such asinteger operations).
For simple float vector addition 256-bit vaddps should be way faster than 128-bit one.
For simple float vector addition 256-bit vaddps should be way faster than 128-bit one.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The good news:
for my own code (added timers for microbenchmarks) I see about 20% speed-up for Intel OpenCL 1.5 over v1.1.
for my own code (added timers for microbenchmarks) I see about 20% speed-up for Intel OpenCL 1.5 over v1.1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Alex,
We have investigated the BitonicSort and RadixSort reported issue and found out that we do have a performance regression in our 1.5 Gold vs. 1.1 when it comes to execution of kernels with small work group sizes. We are working on eliminating most of this performance regression and one of our future releases will include the fix for this issue.
To avoid this phenomenawe recommendusing larger work group sizes where the sweetspot would be at workgroup size > 64.
As a side comment, I would recommend using large work group sizes in general as this would probably be more optimal for our implementation. You can read more about it in the optimization guide which is attached to this release.
We are still investigating the EigenValue regression.
However we weren't able to reproduce the NBody regression
Thanks for helping us improve the product,
Boaz
We have investigated the BitonicSort and RadixSort reported issue and found out that we do have a performance regression in our 1.5 Gold vs. 1.1 when it comes to execution of kernels with small work group sizes. We are working on eliminating most of this performance regression and one of our future releases will include the fix for this issue.
To avoid this phenomenawe recommendusing larger work group sizes where the sweetspot would be at workgroup size > 64.
As a side comment, I would recommend using large work group sizes in general as this would probably be more optimal for our implementation. You can read more about it in the optimization guide which is attached to this release.
We are still investigating the EigenValue regression.
However we weren't able to reproduce the NBody regression
Thanks for helping us improve the product,
Boaz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks!
Will wait for the next releases!
Will wait for the next releases!
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page