Intel OpenCL SDK is (slow) actually faster compared to AMD APP SDK 2.4

kugeleck · ‎05-11-2011

Hi,

I'm doing some simple OpenCL tests and I noticed that my OpenCL code executes much fasterslower with the AMD APP SDK 2.4. I'm running a 64bit Linux with an Intel Core i5 750 @ 2.67GHz and an NVIDIAGeForce GTX 550 Ti.

When I run the following OpenCL program (vector and global work size si 4096, local work size is 256):

[plain]#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel __attribute__((vec_type_hint(double)))
void add(__global __read_only double *a,
         __global __read_only double *b,
         __global __write_only double *c)
{
    size_t i = get_global_id(0);
    c = a + b;
}[/plain]

I get these execution times on average:

INTEL: 1.45s

AMD: 0.45s

NVIDIA: 1.1s (probably due to overhead)

native: 2.35s (single CPU)

What could be the reason?

The times I meassured were actually bogus. Correct numbers can be found in post #18. I was also using clWaitForEvents instead of clFinish. According to the Tips and Tricks for Kernel Development it is better to use clFinish with the Intel OpenCL SDK.

Thanks

zhaopeng · ‎05-11-2011

I have also got similar result. The performance of my median filter sample using Intel SDK is half of that using AMD SDK.

However, the linux support of Intel OpenCL SDK is still in preview state. Maybe it will be optimized in future.

gzahl · ‎05-11-2011

Hey,

I have no experience at all (yet) in OpenCL development, but take a look at

http://software.intel.com/en-us/articles/tips-and-tricks-for-kernel-development/

"Avoiding Spurious Operations in Kernel Code"

Maybe this could help your performance?

Cheers

Manuel

Maxim_S_Intel · ‎05-11-2011

Indeed, as pointed by gzahl, using unsigned int as a type forwork-item id introduces overhead of down-casting actual 64-bit value to 32-bits. Best practice is to use size_t that works both for 32- and 64-bit platforms.

However with the input size of just 4096 elements, various overheads might dominate in the overall time.
Especially if you include data movement costs in your measurements, as queueing/sync/etc associated with clEnqueue*Buffer can be high in compare to the actual time for transfering those4096 doubles. I guess that kernel execution time is by far less than data transfer costs for this kernel.

Could you pls specify whther data transfer time is included? Did situation change when you increase input size?
I would also kindly ask zhaopengintel to elaborate on the same questions about median sample.

kugeleck · ‎05-11-2011

@zhaopengintel

Good to know that I'm not the only one with such results.

@gzahl

I read the "Tips and Tricks for Kernel Development" and changed my code accordingly. I also updated my post.

@Maxim Shevtsov

I increased the number of elements to about 1/2 million and it looks like vector additions are just too simple. The execution times are now:

INTEL: 2.7s

AMD: 0.85s

NVIDIA: 1.2s

native: 325s (single CPU)

As you can see, the trend seems to be the same. The AMD APP SDK 2.4 still provides better results than the Intel OpenCL SDK.What are your experiences with the AMD APP SDK 2.4? Do you have any idea when the final version of the Intel OpenCL SDK will arrive?

I'm going to implement FFT to get a better idea of how the different SDKs/devices compare to each other.

Evgeny_F_Intel · ‎05-11-2011

Thanks for the report.

Can you provide more information about OS you are using?

dade916 · ‎05-11-2011

I'm testing Intel OpenCL SDK on a i7 860 (plus 2x5870+1x5850) and the results I'm obtaining are a bit different. I'm observing only 16% difference in performances. I'm also using some AMD specific compiler option (i.e. "-fno-alias") when running on AMD OpenCL platform so the difference seems ok to me.

However I'm using http://www.luxrender.net/wiki/LuxMark as test (i.e. kernels code is few thousand of lines).

kugeleck · ‎05-11-2011

@Evgeny Fiksman

I'm using openSUSE 11.3 with kernel version2.6.33.7-rt x86_64. Since the SDK depends on libnuma I installed libnuma 2.0.4-rc2 from the repository.

I also get similar results on a different system with Ubuntu 10.10 and kernel 2.6.35 x86_64. libnuma is also 2.0.4-rc2. The processor is a Core2 Duo P8600 @ 2.4GHz and the graphics card is a NVIDIA Quadro NVS 160M.

kugeleck · ‎05-11-2011

@dade916

Which version of LuxMark are you using. The prebuilt one (LuxMark 1.0) does not show my CPU as an OpenCL device. What scene are you rendering? The default one (LuxBall HD) or something else?

Just out of curiosity: How many points to you get with the HD 5850?

Evgeny_F_Intel · ‎05-11-2011

@kugeleck
We do expect some small differences in execution due to linux kernel configuration differences. But I think it's not your case.
Do you create buffers with CL_USE_HOST_PTR flag, if this is the case please make sure that the pointer is algined to CL_DEVICE_MEM_BASE_ADDR_ALIGN.
If this is not the case, please post also your host code

zhaopeng · ‎05-11-2011

Suprisely so many quick replies and thanks! My median sample is very simple and not optimized for specificed hardware. Yesterday I just gave a quick test with 50 times filter for a 1000*1122 RGB image. The sum time of computation (not include data transfering time) is4.58294m on my notebook with Duo P8600 and OpenSUSE 11.4. And using AMD SDK the result is1.86533m. Now I think I find the reason. Because there is no float3 type in OpenCL C language, I do calculation seperately for RGB using scalar float type.

So I write another kernel using float4 to calculate RGBA and do a test using the same size RGBA image. The performance is very good 0.781254m. And using AMD SDK the result is 1.19419m. I get an idea the Intel OpenCL SDK is optimized for vector data type and the performance guide also suggests to avoid extracting vector components. Am I right?

kugeleck · ‎05-12-2011

@Evgeny Fiksman

I'm using QtOpenCL [1] as a C++ wrapper for OpenCL. With that my call ends up looking like this [2]:

[bash]clCreateBuffer(
    contextId,
    CL_MEM_READ_WRITE,
    size,
    0,
    &error);[/bash]

If I understand clCreateBuffer [3] and what you wrote correctly, then this should not be my problem. Since the OpenCL implementation allocates the buffer for me, I don't need to worry about alignment.

[1] QtOpenCL Documentation

[2] QtOpenCL Code - clCreateBuffer

[3] OpenCL Documentation - clCreateBuffer

@zhaopengintel

Would it be possible to get your example code? Or coud you maybe run some other example that you have and post your results? This would give me a chance to find out if the C++ wrapper I'm using is the problem since we seem to have identical notebook processors.

Thanks

Maxim_S_Intel · ‎05-12-2011

I increased the number of elements to about 1/2 million and it looks like vector additions are just too simple. The execution times are now:
INTEL: 2.7s
AMD: 0.85s
NVIDIA: 1.2s
native: 325s (single CPU

Notice that Native perf did change with increasing input size, but OCL's almost didn't. This might indicate that within measurments routine you do not use some blocking to ensure kernel is completed. In other words you probably measure just the overhead of calling clEnqueueNDRange (which is async and just puts your job into the queue) not the real kernel execution.

The proper sequence would need to include either explicit waiting on some event, or calling clFinish (which also ensures that everything was preveously submitted is completed):

ulong start = get_current_time();//either rtdsc or system funcs
clEnqueueNDRange(...);
clFinish(...);
ulong end = get_current_time();
ulong exec_time= end-start;

Also you might want to use OCL profiling events that give you ability to get more precise breakdown of the execution costs.

dade916 · ‎05-12-2011

@kugeleck: I'm using the latest source available from http://src.luxrender.net/luxrays/ (I'm one of the author). I'm not aware of any change to the LuxMark code since v1.0 release so it shouldn't be an issue.

I have installed on my Linux (Ubuntu 10.10 64bit) Intel SDK side by side with AMD SDK and I have no problem to recognize both platforms.

However LuxMark GUI doesn't have the support for OpenCL multi-platforms so you have to edit a configuration file and to select by hand the platform to use/test.
You have only to open the "luxmark/scenes/luxball/render-hdr.cfg" file and edit the following line:

# Select the OpenCL platform to use (0=first platform available, 1=second, etc.)
opencl.platform.index = 0

For instance, just change "0" in "1" to switch from the first OpenCL platform to the second available.

P.S. my aggregate (i.e. i7 860+2xHD5870+1x5850) LuxMark score 11816. Sorry, I don't have the 5850-only score at hand at the moment.

Evgeny_F_Intel · ‎05-12-2011

@zhaopengintel
It's good to here that you have improved performance.
OpenCL 1.1defines float3 type, it's one of the differences from 1.0.

Implicit vectorization module shoudl hanle also the scalar types, you may use the Offline compiler tool to see if it worked. In case it didn't, we will appriciate if you can share your kernel in order to understand what is the issue.

kugeleck · ‎05-12-2011

@dade916

I tried to build LuxMark from the latest source which required recent boost libraries. I built those but the had trouble with CMake. In the end I found a ppa with boost1.44.

When I now run LuxMark I get:

[plain][LuxRays] OpenCL Platform 0: Advanced Micro Devices, Inc.
[LuxRays] OpenCL Platform 1: Intel Corporation
RUNTIME ERROR: Unable to find an appropiate OpenCL platform[/plain]

I already changedopencl.cpu.use, opencl.gpu.use andopencl.platform.index but that did not help.

Since I also run Ubuntu 10.10 on my laptop it must be possible to get this running.

jogshy · ‎05-12-2011

Probably AMD is converting your doubles to floats, that's the reason :p
Change thouse doubles by floats and test again :p

dade916 · ‎05-13-2011

@kugeleck: I have tried even the pre-compiled LuxMark V1.0 binaries available on the site and they seem to work fine with Intel SDK. Your error is printed by the following code:

[bash]if ((platforms.size() == 0) || (openclPlatformIndex >= (int)platforms.size()))
  throw std::runtime_error("Unable to find an appropiate OpenCL platform");

[/bash]

Intel and AMD OpenCL platforms are clearly recognised as printed on your log message so it looks like "openclPlatformIndex" is greater 1. opencl.platform.index seems set to a value greater than 1. The file parser is a bit picky may be there is some extra character on the line or opencl.platform.index is set multiple times (the last one read wins) ?

kugeleck · ‎05-13-2011

@Maxim Shevtsov

I changed my code and am now also using profiling events. It now looks like this:

[plain]queue = clCreateCommandQueue(..., CL_QUEUE_PROFILING_ENABLE, ...);

gettimeofday(&before, ...);

clEnqueueNDRange(queue, ..., event);

clWaitForEvents(1, &event); // variant 1
clFinish(queue);            // variant 2

gettimeofday(&after, ...);

clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, ..., &start, ...);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, ..., &end, ...);

time = end - start;
time_with_overhead = (after.tv_sec * 1000000 + after.tv_usec) -
                     (before.tv_sec * 1000000 + before.tv_usec);

clGetKernelWorkGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ..., kernel_work_group_size, ...);[/plain]

I now get quite different times with a global work size of 8192:

[plain]LWS   | 128 | 256 | 512 | 1024 kernel_work_group_size
------------------------------ ----------------------
NVIDIA|  5  | <5  | >5  | >5            1024
INTEL |15/12|15/12|14/12|14/12           512
AMD   |28/30|22/25|20/23|20/23          1024

      time_with_overhead
------------------------------
NVIDIA|28/30|28/30|28/30|29/30   
INTEL |21/21|21/20|20/20|20/20
AMD   |53/62|41/51|36/45|33/43

LWS - local work size
measurements are in s, the values are for variant 1/variant 2[/plain]

I did those tests with real time scheduling (SCHED_RR + max_priority).

What I don't understand though is that I get different execution times depending when using clWaitForEvent as opposed to clFinish. The Intel SDK seems to like clFinish and the AMD SDK seems to work better with clWaitForEvent. Shouldn't the execution time of the kernel be the same in either case?

Since the local work size has a big effect on the execution time it seems like I'm going to have to profile every kernel I develop. Is there already a nice tool for that which runs the kernel many times and then graphically displays the time or at least gives some statistical information like deviation, mean, ...?

Evgeny_F_Intel · ‎05-14-2011

@kugeleck
Intel SDK prefers clFinish() because in this case we can apply some optimizations that minimize thread switching overhead. On clWaitForEvent() currently we can't perform such optimizations.

Intel SDK doesn'tprovide such a tool, but you find in the Optimization Guids tips how to develop efficient kernels.

BTW, it also includes the explanation of clFinish() usage.