Bugs in Intel OpenCL 1.5

Kloeckner__Andreas · ‎11-19-2011

Hi there,

I'm running into a few bugs with the Intel OpenCL SDK v 1.5. I am the maintainer of PyOpenCL, and the 1.5 SDK will not survive PyOpenCL's test suite. (the 1.1 SDK did just fine)

A few issues I noticed:

if you ask for a kernel name that doesn't exist in a program, a C++ exception comes at you through the (supposedly C-only) OpenCL interface (PyOpenCL tests for proper behavior in this case), which (by string search) must originate in the Intel CL implementation.
Passing NULL for local_size in EnqueueNDRange causes failures (NaN output and such). Note that the CL standard allows this.

When running the PyOpenCL test suite with the Intel 1.5 and AMD 2.5 CPU implementations, the Intel implementation crashes at unpredictable times within Intel::OpenCL::CPUDevice::NDRange::Release (inside a thread it spawned, apparently). Here's a backtrace for that failure:

#0  0x00007ffff03851e3 in Intel::OpenCL::CPUDevice::NDRange::Release() () from /home/andreas/pack/intel-opencl-1.5/vendors/intel/libcpu_device.so
#1  0x00007ffff20f80b9 in Intel::OpenCL::TaskExecutor::in_order_executor_task::execute() () from /home/andreas/pack/intel-opencl-1.5/vendors/intel/libtask_executor.so
#2  0x00007ffff1dbe4e4 in tbb::internal::custom_scheduler<:INTERNAL::INTELSCHEDULERTRAITS>::local_wait_for_all(tbb::task&, tbb::task*) () from /home/andreas/pack/intel-opencl-1.5/vendors/intel/libtbb.so.2
#3  0x00007ffff1dbc1c8 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /home/andreas/pack/intel-opencl-1.5/vendors/intel/libtbb.so.2
#4  0x00007ffff1dbb11b in tbb::internal::market::process(rml::job&) () from /home/andreas/pack/intel-opencl-1.5/vendors/intel/libtbb.so.2
#5  0x00007ffff1db944c in tbb::internal::rml::private_worker::run() () from /home/andreas/pack/intel-opencl-1.5/vendors/intel/libtbb.so.2
#6  0x00007ffff1db93c6 in tbb::internal::rml::private_worker::thread_routine(void*) () from /home/andreas/pack/intel-opencl-1.5/vendors/intel/libtbb.so.2
#7  0x00007ffff7bc7b40 in start_thread (arg=) at pthread_create.c:304
#8  0x00007ffff6a5a36d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#9  0x0000000000000000 in ?? ()

For definiteness, I'm running on Debian unstable with glibc 2.13, and my remarks concern PyOpenCL git version "a4e6bf847aa6891aea084e42cadf4c254e67e4a7", which you can get from here: https://github.com/inducer/pyopencl

Andreas

Doron_S_Intel · ‎11-19-2011

Thanks for your report.
The first issue is known and will be fixed in a future release.
The second one sounds like it might be a bug in the kernel you're trying to run (NULL local size is certainly supported and should work) -- is it possible to share some minimal reproduction with stripped-down kernel code and input/expected output?
The third one might be a memory overrun in the kernel code. Have you tried making sure that all writes are to indices in the expected range? You can use the printf extension to print out a case of out of bounds index or write.

Thanks,
Doron Singer

mbf1 · ‎12-01-2011

Just downloaded and started playing with Intel's OpenCL 1.5 with my OpenCL development tools still in production...Wanted to report some initial findings from some of my battery of tests that I'm doing (against all OpenCL vendor implementations on a variety of 32-bit and 64-bit OSes and hardware devices):

1) For my C#/C++ wrapper code generated by a private program I wrote, and I had to shim my Device.GetInfo (Device::GetInfo in C++) method like this (this was a regression from Intel's OpenCL 1.1 gold build, for which this call worked and didn't return an Invalid Value return code):

[csharp]status = m_dll.clGetDeviceInfo( m_device, (int)paramName, UIntPtr.Zero, null, ref valueSize );
// Above line existed already unmodified
if( status != (int) StatusCodeEnum.Success /*QUICK HACK FOR INTEL OpenCL 1.5:*/ && valueSize.Equals( UIntPtr.Zero ) )[/csharp]

2) I also had to shim my Mem.GetInfo like this (in the OpenCL 1.1 gold build also -- I just hadn't signed up for a forum account or reported it yet):

[csharp]status = m_dll.clGetMemObjectInfo( m_buffer, (int)paramName, UIntPtr.Zero, null, ref valueSize );
IntelHack( ref status, ref valueSize );
if( status != (int) StatusCodeEnum.Success )
// rest of method...

private void IntelHack( ref StatusCodeEnum status, ref UIntPtr valueSize ) {
if( status == StatusCodeEnum.InvalidValue && IsIntelPlatform() ) {
   status = StatusCodeEnum.Success;
   valueSize = new UIntPtr( (uint) UIntPtr.Size );
} }
[/csharp]

3) I wrote some code to test passing vector types (signed, unsigned, and floating point types of 8, 16, 32, and 64 bits as appropriate) with all vector sizes (2, 3, 4, 8, and 16). It generates kernel methods and generates appropriate values (using C#/.NET reflection to convert math types from strings to do this in a generic way). Intel's OpenCL implementation fails on all 64-bit sized vector types of length 3. It fails in 2 ways:

3a) Intel's clSetKernelArg( ... ) expects 3 * 8 bytes as the length instead of 4 * 8 bytes as specified by OpenCL and implemented correctly by all other vendors I have tested. The error returned isCL_INVALID_ARG_SIZE.

3b) The values are not properly consumed. For an example, here's one of my (50) generated kernels - 10 types * 5 vector sizes, 3 of which (long3, ulong3, double3) fail on Intel in my testing:

[cpp]__kernel void testLong3( __global int* result, __global long* out, long a0, long a1, long a2, long3 a )
{
   result[ 0 ] = (a0 == a.s0) && (a1 == a.s1) && (a2 == a.s2) ? 1 : 0;
   out[ 0 ] = a0;
   out[ 1 ] = a1;
   out[ 2 ] = a2;
   out[ 3 ] = a.s0;
   out[ 4 ] = a.s1;
   out[ 5 ] = a.s2;
   out[ 6 ] = a0 - a.s0;
   out[ 7 ] = a1 - a.s1;
   out[ 8 ] = a2 - a.s2;
}[/cpp]

When this kernel is called in a manner like this:

[cpp]int resultBuffer[ 1 ] = { 0 };
long outBuffer[ 9 ] = { 0, 0, 0, 0, 0, 0, 0, 0, 0 };
testLong3( resultBuffer, outBuffer, 0, 1, 2, new long3( 0, 1, 2 ) );[/cpp]

Then I get back something like this:

[cpp]resultBuffer = { 0 };
outBuffer = { 0, 1, 2, N, 0, 1, 0-N, 1, 1 };
[/cpp]

Where N is some unreproducible value.

I suspect that this is a memory alignment issue internal to Intel's implementation. I would think it's an issue with my implementation of the test except that the other 47 kernels pass on Intel, and all 50 of these tests pass on NVIDIA (3 devices, 2 OSes), and all 50 pass on AMD (CPU, 2 OSes). AMD (GPU) fails on the 3x 32-bit sized type3 values. I need to re-run this test on my Apple OpenCL (Core2 Duo/GeForce 320M Laptop) because I broke the underlying support code for that test before I ran it last on that box, and all 50 tests failed.

4) Intel's OpenCL 1.1 and 1.5 both fail to compile this boring example kernel I produced to test global atomic operations (all other OpenCL implementations tested correctly compile and execute this kernel):

[cpp]#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
__kernel void Sum1( __global const int* buffer, __global volatile int* sum )
{
   const int i = get_global_id( 0 );
   atom_add( sum, buffer[ i ] );
}[/cpp]

There are 2 related errors: Intel's Compiler doesn't like #pragma OPENCL EXTENSION ... and 2: atom_add is not enabled by this pragma. atomic_add (new to OpenCL 1.1 if my memory is correct), builds and runs correctly, but as far as I'm aware, these extensions serve for OpenCL 1.0 programs to run on OpenCL 1.1. I haven't yet gone through and added compiler tests for all the other OpenCL 1.0, 1.1, and soon 1.2 extensions to my tests or all of the values, but I expect that Intel's compiler needs to recognize all #pragma's with OPENCL EXTENSION as specified in the standard, especially those which its platform/devices claim to support.

Thanks!

-Mike

Evgeny_F_Intel · ‎12-01-2011

Hi Mike,

Thanks for the comprehensive information.
We will work out to analyze and work out these issues.

Evgeny

Kloeckner__Andreas · ‎12-01-2011

The second issue was on my side--I was understanding the CL spec in such a way that local_size==NULL would only run the specified global number of work items. But it seems that rounding up to the nearest multiple of the chosen group size is what the intel implementation does. The spec isn't quite clear here, perhaps.

The third issue persists, however, and occurs only on Intel when combined with AMD CPU--though within Intel CL code. (It does not occur on any other combination of OS/impl/device that I have tested--and those were quite a few: Apple/Linux, AMD GPU/AMD CPU/Nvidia Tesla/Nvidia Fermi, notably not on Intel alone). You should be able to reproduce it on Linux with the recently released PyOpenCL 2011.2.

It also seems that something in Intel's OpenCL implementation handles SIGINT (the compiler?), which it clearly shouldn't.

Andreas

Evgeny_F_Intel · ‎12-02-2011

Thank you for more detailed information.
1. Could you provide a code sample for the second issue?
2. Could you elaborate more on "combined with AMD CPU--though within Intel CL code"?

Evgeny

Kloeckner__Andreas · ‎12-02-2011

Sure.

1. Suppose you want to run on 22137 global work items, and you pass local_size==NULL, then it appears that the Intel implementation will round up the global size to the nearest multiple of the group size it chose, so that you have to pass your *intended* global size and check your global id against it.

2. The PyOpenCL test suite enumerates all CL platforms and devices it can find and then runs its tests on all of them. When AMD and Intel CPU CL implementations are present, the crash I describe occurs.

Andreas

Evgeny_F_Intel · ‎12-02-2011

Thanks again.

As for (1) I can't reproduce this issue in C/C++. Could you please provide C/C++ code that expose the issue.

As for (2):
a. Do you have Callbacks in your code?
b. Which version of OpenCL header you are using for compilation AMD or Intel? How the callback function are defined in phyton? If you can, please proved C/C++ code that exposes this issue.

Evgeny