Comparing minimum kernel running time overhead!

janez-makovsek · ‎05-22-2011

Hi!

I mesured the minimum time to run the kernel (which defines the meaningfull size of the smallest job) for AMD and Intel Open CL drivers. I get about 55us for AMD drivers (CPU) and 155us for Intel (CPU) drivers. I am sort of a stummped with these delays as the GPU has only a 15us overhead and there is PCI bus between.

I have also tested that I can get a 2us thread start/stop times on the CPU using C++. Two microseconds is about equal to the thread slice time. It would have been reasonable for openCL kernel launch to take 4-8us on CPU for example, but 155us is a lot. (Times were measured by timing average execution time of 3000 kernels without copying buffers, but including 8 calls to clSetKernelArg).

Is there some way to improve on this measurement?

Thanks!
Atmapuri

Doron_S_Intel · ‎05-22-2011

Hello Atmapuri,

Thanks for reporting this issue. While we are constantly striving to minimize overhead for every mode of execution, it should be noted that our current implementation of clEnqueueNDRangeKernel is implemented with a focus on scalability and performance of kernels taking more than a few clocks.

Does this test reflect an actual use case you're interested in, or is it more of a synthetic benchmark to try and get a feel for how long operations take in OpenCL?

Thanks,
Doron Singer

Evgeny_F_Intel · ‎05-22-2011

@Atmapuri
Please also try to measure only NDRange w/o the clSetArgs.
Thanks.

janez-makovsek · ‎05-23-2011

I have run more tests and can confirmt hat the bottleneck is the clSetKernelArgs. Timing a batch of 8 calls takes between 1 and 2us on competing platforms (AMD, CPU or GPU) and between 20 and 300us with current drivers from Intel (CPU). The measurements with Intels drivers vary a lot, with an average probably around 100us.

janez-makovsek · ‎05-23-2011

Dear Doron,

>kernels taking more than a few clocks.

I would consider kernels running for 150us minimum more than a few clocks. It is possible to compute 30 FFT's with 1024 points on CPU within this time. (I dont know why it takes so much math to launch a kernel).

>Does this test reflect an actual use case you're interested in, or is it more
>?of a synthetic benchmark to try and get a feel for how long operations take in OpenCL?

It is an actual use case. Kernel launch times define to a great extend the algorithms which are suitable for OpenCL acceleration. In my view OpenCL kernel launch times should be comparable with OpenMP thread launch times in C++.

Some people see the OpenCL running on CPU only as an emergency fallback from GPU. From the tests that I ran, it makes a lot of sense to use the OpenCL code as the cross-platform portable high performance code.

Well written OpenCL kernels for CPU come within 2-5x of of the speed of a high end GPU on Intels latest (SandyBridge) CPUs when using double precision math.

Thanks!
Atmapuri

Doron_S_Intel · ‎05-23-2011

Hello again Atmapuri,

Thanks for your reply, and I'm glad to hear you're seeing a benefit from executing OpenCL on the CPU. Please let us know if you encounter any more obstacles in the form of functional or performance bugs.

To the topic at hand, we've certainly noted the issue with clSetKernelArg and will work towards improving its performance in the future.

Thanks again,
Doron Singer