OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1722 Discussions

Intel OpenCL SDK is (slow) actually faster compared to AMD APP SDK 2.4

kugeleck
Beginner
2,937 Views
Hi,
I'm doing some simple OpenCL tests and I noticed that my OpenCL code executes much fasterslower with the AMD APP SDK 2.4. I'm running a 64bit Linux with an Intel Core i5 750 @ 2.67GHz and an NVIDIAGeForce GTX 550 Ti.
When I run the following OpenCL program (vector and global work size si 4096, local work size is 256):
[plain]#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel __attribute__((vec_type_hint(double)))
void add(__global __read_only double *a,
         __global __read_only double *b,
         __global __write_only double *c)
{
    size_t i = get_global_id(0);
    c = a + b;
}[/plain]
I get these execution times on average:
INTEL: 1.45s
AMD: 0.45s
NVIDIA: 1.1s (probably due to overhead)
native: 2.35s (single CPU)

What could be the reason?

The times I meassured were actually bogus. Correct numbers can be found in post #18. I was also using clWaitForEvents instead of clFinish. According to the Tips and Tricks for Kernel Development it is better to use clFinish with the Intel OpenCL SDK.

Thanks
0 Kudos
19 Replies
zhaopeng
Beginner
2,937 Views
I have also got similar result. The performance of my median filter sample using Intel SDK is half of that using AMD SDK.
However, the linux support of Intel OpenCL SDK is still in preview state. Maybe it will be optimized in future.
0 Kudos
gzahl
Beginner
2,937 Views
Hey,
I have no experience at all (yet) in OpenCL development, but take a look at

"Avoiding Spurious Operations in Kernel Code"


Maybe this could help your performance?

Cheers
Manuel
0 Kudos
Maxim_S_Intel
Employee
2,937 Views
Indeed, as pointed by gzahl, using unsigned int as a type forwork-item id introduces overhead of down-casting actual 64-bit value to 32-bits. Best practice is to use size_t that works both for 32- and 64-bit platforms.

However with the input size of just 4096 elements, various overheads might dominate in the overall time.
Especially if you include data movement costs in your measurements, as queueing/sync/etc associated with clEnqueue*Buffer can be high in compare to the actual time for transfering those4096 doubles. I guess that kernel execution time is by far less than data transfer costs for this kernel.

Could you pls specify whther data transfer time is included? Did situation change when you increase input size?
I would also kindly ask zhaopengintelto elaborate on the same questions about median sample.

0 Kudos
kugeleck
Beginner
2,937 Views
@zhaopengintel
Good to know that I'm not the only one with such results.

@gzahl
I read the "Tips and Tricks for Kernel Development" and changed my code accordingly. I also updated my post.

@Maxim Shevtsov
I increased the number of elements to about 1/2 million and it looks like vector additions are just too simple. The execution times are now:
INTEL: 2.7s
AMD: 0.85s
NVIDIA: 1.2s
native: 325s (single CPU)
As you can see, the trend seems to be the same. The AMD APP SDK 2.4 still provides better results than the Intel OpenCL SDK.What are your experiences with the AMD APP SDK 2.4? Do you have any idea when the final version of the Intel OpenCL SDK will arrive?

I'm going to implement FFT to get a better idea of how the different SDKs/devices compare to each other.
0 Kudos
Evgeny_F_Intel
Employee
2,937 Views
Thanks for the report.

Can you provide more information about OS you are using?
0 Kudos
dade916
Beginner
2,937 Views
I'm testing Intel OpenCL SDK on a i7 860 (plus 2x5870+1x5850) and the results I'm obtaining are a bit different. I'm observing only 16% difference in performances. I'm also using some AMD specific compiler option (i.e. "-fno-alias") when running on AMD OpenCL platform so the difference seems ok to me.

However I'm using http://www.luxrender.net/wiki/LuxMark as test (i.e. kernels code is few thousand of lines).


0 Kudos
kugeleck
Beginner
2,937 Views
@Evgeny Fiksman
I'm using openSUSE 11.3 with kernel version2.6.33.7-rt x86_64. Since the SDK depends on libnuma I installed libnuma 2.0.4-rc2 from the repository.

I also get similar results on a different system with Ubuntu 10.10 and kernel 2.6.35 x86_64. libnuma is also 2.0.4-rc2. The processor is a Core2 Duo P8600 @ 2.4GHz and the graphics card is a NVIDIA Quadro NVS 160M.
0 Kudos
kugeleck
Beginner
2,937 Views
@dade916
Which version of LuxMark are you using. The prebuilt one (LuxMark 1.0) does not show my CPU as an OpenCL device. What scene are you rendering? The default one (LuxBall HD) or something else?

Just out of curiosity: How many points to you get with the HD 5850?
0 Kudos
Evgeny_F_Intel
Employee
2,937 Views
@kugeleck
We do expect some small differences in execution due to linux kernel configuration differences. But I think it's not your case.
Do you create buffers with CL_USE_HOST_PTR flag, if this is the case please make sure that the pointer is algined to CL_DEVICE_MEM_BASE_ADDR_ALIGN.
If this is not the case, please post also your host code
0 Kudos
zhaopeng
Beginner
2,937 Views
Suprisely so many quick replies and thanks! My median sample is very simple and not optimized for specificed hardware. Yesterday I just gave a quick test with 50 times filter for a 1000*1122 RGB image. The sum time of computation (not include data transfering time) is4.58294m on my notebook with Duo P8600 and OpenSUSE 11.4. And using AMD SDK the result is1.86533m. Now I think I find the reason. Because there is no float3 type in OpenCL C language, I do calculation seperately for RGB using scalar float type.

So I write another kernel using float4 to calculate RGBA and do a test using the same size RGBA image. The performance is very good 0.781254m. And using AMD SDK the result is 1.19419m. I get an idea the Intel OpenCL SDK is optimized for vector data type and the performance guide also suggests to avoid extracting vector components. Am I right?
0 Kudos
kugeleck
Beginner
2,937 Views
@Evgeny Fiksman
I'm using QtOpenCL [1] as a C++ wrapper for OpenCL. With that my call ends up looking like this [2]:
[bash]clCreateBuffer(
    contextId,
    CL_MEM_READ_WRITE,
    size,
    0,
    &error);[/bash]
If I understand clCreateBuffer [3] and what you wrote correctly, then this should not be my problem. Since the OpenCL implementation allocates the buffer for me, I don't need to worry about alignment.


@zhaopengintel
Would it be possible to get your example code? Or coud you maybe run some other example that you have and post your results? This would give me a chance to find out if the C++ wrapper I'm using is the problem since we seem to have identical notebook processors.

Thanks
0 Kudos
Maxim_S_Intel
Employee
2,937 Views
I increased the number of elements to about 1/2 million and it looks like vector additions are just too simple. The execution times are now:
INTEL: 2.7s
AMD: 0.85s
NVIDIA: 1.2s
native: 325s (single CPU

Notice that Native perf did change with increasing input size, but OCL's almost didn't. This might indicate that within measurments routine you do not use some blocking to ensure kernel is completed. In other words you probably measure just the overhead of calling clEnqueueNDRange (which is async and just puts your job into the queue) not the real kernel execution.

The proper sequence would need to include either explicit waiting on some event, or calling clFinish (which also ensures that everything was preveously submitted is completed):

ulong start = get_current_time();//either rtdsc or system funcs
clEnqueueNDRange(...);
clFinish(...);
ulong end = get_current_time();
ulong exec_time= end-start;

Also you might want to use OCL profiling events that give you ability to get more precise breakdown of the execution costs.

0 Kudos
dade916
Beginner
2,937 Views
@kugeleck: I'm using the latest source available from http://src.luxrender.net/luxrays/ (I'm one of the author). I'm not aware of any change to the LuxMark code since v1.0 release so it shouldn't be an issue.

I have installed on my Linux (Ubuntu 10.10 64bit) Intel SDK side by side with AMD SDK and I have no problem to recognize both platforms.

However LuxMark GUI doesn't have the support for OpenCL multi-platforms so you have to edit a configuration file and to select by hand the platform to use/test.
You have only to open the "luxmark/scenes/luxball/render-hdr.cfg" file and edit the following line:

# Select the OpenCL platform to use (0=first platform available, 1=second, etc.)
opencl.platform.index = 0

For instance, just change "0" in "1" to switch from the first OpenCL platform to the second available.

P.S. my aggregate (i.e. i7 860+2xHD5870+1x5850) LuxMark score 11816. Sorry, I don't have the 5850-only score at hand at the moment.



0 Kudos
Evgeny_F_Intel
Employee
2,937 Views
@zhaopengintel
It's good to here that you have improved performance.
OpenCL 1.1defines float3 type, it's one of the differences from 1.0.

Implicit vectorization module shoudl hanle also the scalar types, you may use the Offline compiler tool to see if it worked. In case it didn't, we will appriciate if you can share your kernel in order to understand what is the issue.
0 Kudos
kugeleck
Beginner
2,937 Views
@dade916
I tried to build LuxMark from the latest source which required recent boost libraries. I built those but the had trouble with CMake. In the end I found a ppa with boost1.44.

When I now run LuxMark I get:
[plain][LuxRays] OpenCL Platform 0: Advanced Micro Devices, Inc.
[LuxRays] OpenCL Platform 1: Intel Corporation
RUNTIME ERROR: Unable to find an appropiate OpenCL platform[/plain]
I already changedopencl.cpu.use, opencl.gpu.use andopencl.platform.index but that did not help.

Since I also run Ubuntu 10.10 on my laptop it must be possible to get this running.
0 Kudos
jogshy
New Contributor I
2,937 Views
Probably AMD is converting your doubles to floats, that's the reason :p
Change thouse doubles by floats and test again :p
0 Kudos
dade916
Beginner
2,937 Views
@kugeleck: I have tried even the pre-compiled LuxMark V1.0 binaries available on the site and they seem to work fine with Intel SDK. Your error is printed by the following code:
[bash]if ((platforms.size() == 0) || (openclPlatformIndex >= (int)platforms.size()))
  throw std::runtime_error("Unable to find an appropiate OpenCL platform");

[/bash]
Intel and AMD OpenCL platforms are clearly recognised as printed on your log message so it looks like "openclPlatformIndex" is greater 1. opencl.platform.index seems set to a value greater than 1. The file parser is a bit picky may be there is some extra character on the line or opencl.platform.index is set multiple times (the last one read wins) ?
0 Kudos
kugeleck
Beginner
2,937 Views
@Maxim Shevtsov
I changed my code and am now also using profiling events. It now looks like this:
[plain]queue = clCreateCommandQueue(..., CL_QUEUE_PROFILING_ENABLE, ...);

gettimeofday(&before, ...);

clEnqueueNDRange(queue, ..., event);

clWaitForEvents(1, &event); // variant 1
clFinish(queue);            // variant 2

gettimeofday(&after, ...);

clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, ..., &start, ...);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, ..., &end, ...);

time = end - start;
time_with_overhead = (after.tv_sec * 1000000 + after.tv_usec) -
                     (before.tv_sec * 1000000 + before.tv_usec);

clGetKernelWorkGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ..., kernel_work_group_size, ...);[/plain]
I now get quite different times with a global work size of 8192:
[plain]LWS   | 128 | 256 | 512 | 1024 kernel_work_group_size
------------------------------ ----------------------
NVIDIA|  5  | <5  | >5  | >5            1024
INTEL |15/12|15/12|14/12|14/12           512
AMD   |28/30|22/25|20/23|20/23          1024

      time_with_overhead
------------------------------
NVIDIA|28/30|28/30|28/30|29/30   
INTEL |21/21|21/20|20/20|20/20
AMD   |53/62|41/51|36/45|33/43

LWS - local work size
measurements are in s, the values are for variant 1/variant 2[/plain]
I did those tests with real time scheduling (SCHED_RR + max_priority).

What I don't understand though is that I get different execution times depending when using clWaitForEvent as opposed to clFinish. The Intel SDK seems to like clFinish and the AMD SDK seems to work better with clWaitForEvent. Shouldn't the execution time of the kernel be the same in either case?

Since the local work size has a big effect on the execution time it seems like I'm going to have to profile every kernel I develop. Is there already a nice tool for that which runs the kernel many times and then graphically displays the time or at least gives some statistical information like deviation, mean, ...?
0 Kudos
Evgeny_F_Intel
Employee
2,937 Views
@kugeleck
Intel SDK prefers clFinish() because in this case we can apply some optimizations that minimize thread switching overhead. On clWaitForEvent() currently we can't perform such optimizations.

Intel SDK doesn'tprovide such a tool, but you find in the Optimization Guids tips how to develop efficient kernels.

BTW, it also includes the explanation of clFinish() usage.

0 Kudos
Reply