OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

Comparing minimum kernel running time overhead!

janez-makovsek
New Contributor I
664 Views
Hi!

I mesured the minimum time to run the kernel (which defines the meaningfull size of the smallest job) for AMD and Intel Open CL drivers. I get about 55us for AMD drivers (CPU) and 155us for Intel (CPU) drivers. I am sort of a stummped with these delays as the GPU has only a 15us overhead and there is PCI bus between.

I have also tested that I can get a 2us thread start/stop times on the CPU using C++. Two microseconds is about equal to the thread slice time. It would have been reasonable for openCL kernel launch to take 4-8us on CPU for example, but 155us is a lot. (Times were measured by timing average execution time of 3000 kernels without copying buffers, but including 8 calls to clSetKernelArg).

Is there some way to improve on this measurement?

Thanks!
Atmapuri
0 Kudos
5 Replies
Doron_S_Intel
Employee
664 Views
Hello Atmapuri,

Thanks for reporting this issue. While we are constantly striving to minimize overhead for every mode of execution, it should be noted that our current implementation of clEnqueueNDRangeKernel is implemented with a focus on scalability and performance of kernels taking more than a few clocks.

Does this test reflect an actual use case you're interested in, or is it more of a synthetic benchmark to try and get a feel for how long operations take in OpenCL?

Thanks,
Doron Singer
0 Kudos
Evgeny_F_Intel
Employee
664 Views
@Atmapuri
Please also try to measure only NDRange w/o the clSetArgs.
Thanks.
0 Kudos
janez-makovsek
New Contributor I
664 Views
I have run more tests and can confirmt hat the bottleneck is the clSetKernelArgs. Timing a batch of 8 calls takes between 1 and 2us on competing platforms (AMD, CPU or GPU) and between 20 and 300us with current drivers from Intel (CPU). The measurements with Intels drivers vary a lot, with an average probably around 100us.
0 Kudos
janez-makovsek
New Contributor I
664 Views
Dear Doron,

>kernels taking more than a few clocks.

I would consider kernels running for 150us minimum more than a few clocks. It is possible to compute 30 FFT's with 1024 points on CPU within this time. (I dont know why it takes so much math to launch a kernel).

>Does this test reflect an actual use case you're interested in, or is it more
>?of a synthetic benchmark to try and get a feel for how long operations take in OpenCL?

It is an actual use case. Kernel launch times define to a great extend the algorithms which are suitable for OpenCL acceleration. In my view OpenCL kernel launch times should be comparable with OpenMP thread launch times in C++.

Some people see the OpenCL running on CPU only as an emergency fallback from GPU. From the tests that I ran, it makes a lot of sense to use the OpenCL code as the cross-platform portable high performance code.

Well written OpenCL kernels for CPU come within 2-5x of of the speed of a high end GPU on Intels latest (SandyBridge) CPUs when using double precision math.

Thanks!
Atmapuri
0 Kudos
Doron_S_Intel
Employee
664 Views
Hello again Atmapuri,

Thanks for your reply, and I'm glad to hear you're seeing a benefit from executing OpenCL on the CPU. Please let us know if you encounter any more obstacles in the form of functional or performance bugs.

To the topic at hand, we've certainly noted the issue with clSetKernelArg and will work towards improving its performance in the future.

Thanks again,
Doron Singer
0 Kudos
Reply