- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
I mesured the minimum time to run the kernel (which defines the meaningfull size of the smallest job) for AMD and Intel Open CL drivers. I get about 55us for AMD drivers (CPU) and 155us for Intel (CPU) drivers. I am sort of a stummped with these delays as the GPU has only a 15us overhead and there is PCI bus between.
I have also tested that I can get a 2us thread start/stop times on the CPU using C++. Two microseconds is about equal to the thread slice time. It would have been reasonable for openCL kernel launch to take 4-8us on CPU for example, but 155us is a lot. (Times were measured by timing average execution time of 3000 kernels without copying buffers, but including 8 calls to clSetKernelArg).
Is there some way to improve on this measurement?
Thanks!
Atmapuri
I mesured the minimum time to run the kernel (which defines the meaningfull size of the smallest job) for AMD and Intel Open CL drivers. I get about 55us for AMD drivers (CPU) and 155us for Intel (CPU) drivers. I am sort of a stummped with these delays as the GPU has only a 15us overhead and there is PCI bus between.
I have also tested that I can get a 2us thread start/stop times on the CPU using C++. Two microseconds is about equal to the thread slice time. It would have been reasonable for openCL kernel launch to take 4-8us on CPU for example, but 155us is a lot. (Times were measured by timing average execution time of 3000 kernels without copying buffers, but including 8 calls to clSetKernelArg).
Is there some way to improve on this measurement?
Thanks!
Atmapuri
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Atmapuri,
Thanks for reporting this issue. While we are constantly striving to minimize overhead for every mode of execution, it should be noted that our current implementation of clEnqueueNDRangeKernel is implemented with a focus on scalability and performance of kernels taking more than a few clocks.
Does this test reflect an actual use case you're interested in, or is it more of a synthetic benchmark to try and get a feel for how long operations take in OpenCL?
Thanks,
Doron Singer
Thanks for reporting this issue. While we are constantly striving to minimize overhead for every mode of execution, it should be noted that our current implementation of clEnqueueNDRangeKernel is implemented with a focus on scalability and performance of kernels taking more than a few clocks.
Does this test reflect an actual use case you're interested in, or is it more of a synthetic benchmark to try and get a feel for how long operations take in OpenCL?
Thanks,
Doron Singer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Atmapuri
Please also try to measure only NDRange w/o the clSetArgs.
Thanks.
Please also try to measure only NDRange w/o the clSetArgs.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have run more tests and can confirmt hat the bottleneck is the clSetKernelArgs. Timing a batch of 8 calls takes between 1 and 2us on competing platforms (AMD, CPU or GPU) and between 20 and 300us with current drivers from Intel (CPU). The measurements with Intels drivers vary a lot, with an average probably around 100us.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Doron,
>kernels taking more than a few clocks.
I would consider kernels running for 150us minimum more than a few clocks. It is possible to compute 30 FFT's with 1024 points on CPU within this time. (I dont know why it takes so much math to launch a kernel).
>Does this test reflect an actual use case you're interested in, or is it more
>?of a synthetic benchmark to try and get a feel for how long operations take in OpenCL?
It is an actual use case. Kernel launch times define to a great extend the algorithms which are suitable for OpenCL acceleration. In my view OpenCL kernel launch times should be comparable with OpenMP thread launch times in C++.
Some people see the OpenCL running on CPU only as an emergency fallback from GPU. From the tests that I ran, it makes a lot of sense to use the OpenCL code as the cross-platform portable high performance code.
Well written OpenCL kernels for CPU come within 2-5x of of the speed of a high end GPU on Intels latest (SandyBridge) CPUs when using double precision math.
Thanks!
Atmapuri
>kernels taking more than a few clocks.
I would consider kernels running for 150us minimum more than a few clocks. It is possible to compute 30 FFT's with 1024 points on CPU within this time. (I dont know why it takes so much math to launch a kernel).
>Does this test reflect an actual use case you're interested in, or is it more
>?of a synthetic benchmark to try and get a feel for how long operations take in OpenCL?
It is an actual use case. Kernel launch times define to a great extend the algorithms which are suitable for OpenCL acceleration. In my view OpenCL kernel launch times should be comparable with OpenMP thread launch times in C++.
Some people see the OpenCL running on CPU only as an emergency fallback from GPU. From the tests that I ran, it makes a lot of sense to use the OpenCL code as the cross-platform portable high performance code.
Well written OpenCL kernels for CPU come within 2-5x of of the speed of a high end GPU on Intels latest (SandyBridge) CPUs when using double precision math.
Thanks!
Atmapuri
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello again Atmapuri,
Thanks for your reply, and I'm glad to hear you're seeing a benefit from executing OpenCL on the CPU. Please let us know if you encounter any more obstacles in the form of functional or performance bugs.
To the topic at hand, we've certainly noted the issue with clSetKernelArg and will work towards improving its performance in the future.
Thanks again,
Doron Singer
Thanks for your reply, and I'm glad to hear you're seeing a benefit from executing OpenCL on the CPU. Please let us know if you encounter any more obstacles in the form of functional or performance bugs.
To the topic at hand, we've certainly noted the issue with clSetKernelArg and will work towards improving its performance in the future.
Thanks again,
Doron Singer
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page