Please allow the kernels to be called without clEnqueueNDRange, but simply directly through the function pointer. This would allow the following features:
1.) Zero overhead for the thread start/stop because kernel runs on the current app thread. This makes it possible to accelerate "short" kernels which may need to run on one thread. 2.) Allows custom threading to be implemented by the caller 3.) Allows Intel to Inject latest high performance instructions in to any language via Open CL interface while keeping the dll style calling approach. People write performance sensitive code once in Open CL and for the next generation of CPU Intel simply releases a driver update. 4.) Makes the kernels debuggable with the full range of Intel debugging tools. 5.) Provides method to properly debug any Open CL code.
We share most ofyour equirements above and plan an OpenCL EXT extension to execute NDRaange and task commands in a single-threaded fashion by the host thread that will apply the corresponding clEnqueue commands. This extension is planned for our next major release.
I'm not sure I fully understood your questions - please don't hesitate to ask again if I've missed the point.
To use the extension, create another command queue and pass the property CL_QUEUE_IMMEDIATE_EXECUTION_ENABLE_INTEL which is defined in the cl_ext.h header file that comes bundled with the SDK. Commands enqueued to that command queue will execute in the direct manner described.
To enqueue to this queue from multiple threads, I first of all recommend you combine the mode above with CL_QUEUE_OUT_OF_ORDER_EXEC_ENABLE when you create the queue, so threads don't block each other (which is required by OpenCL in-order queue semantics). Then, that cl_command_queue handle can be shared freely between threads.