Showing results for

- Intel Community
- Software Development SDKs and Libraries
- OpenCL*
- OpenCL overhead on empty kernel

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Richard_S_7

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-13-2017
03:54 AM

53 Views

OpenCL overhead on empty kernel

Hello,

I am currently comparing my own implemention of GEMV in OpenCL to the MKL. I am benchmarking very small input sizes like 2x64 for example. On my system the MKL runs around 0,001ms for this input size and my kernel runs around 0,003ms.

When executing a completly empty kernel I get a runtime of around 0,0025ms. Where does this overhead come from and why doesn't the MKL seem to have it? I am benchmarking my OpenCL kernel via the OpenCL events and MKL with the dsecnd() function, that is supplied by the MKL.

Thanks in advance!

3 Replies

Highlighted
##

OpenCL kernel enqueue and launch will have some overhead. Ideally the operation would be larger so this overhead would be a relatively small part of the overall execution time. OpenCL's advantages are convenient threading for CPU and access to accelerator HW like GPU and FPGA. OpenCL may be able to help more when scheduling many of these small operations, or with larger input sizes.

Jeffrey_M_Intel1

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-13-2017
02:33 PM

53 Views

Highlighted
##

Take into account that MKL's API is highly optimized for bigger data sets ( matrices, vectors ).
MKL also has overheads and a simple classic matrix multiplication algorithm ( triple-for-loop / processing core is less than 10 code lines ) outperforms MKL's **sgemm** for matrices up to 2,048x2,048.
>>...On my system the MKL runs around 0,001ms for this input size and my kernel runs around 0,003ms...
How many times did you execute the test to get these numbers?

SKost

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-20-2017
11:59 AM

53 Views

Highlighted
##

Richard_S_7

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-22-2017
04:08 AM

53 Views

Thank you for your feedback.

Sergey Kostrov wrote:

MKL also has overheads and a simple classic matrix multiplication algorithm ( triple-for-loop / processing core is less than 10 code lines ) outperforms MKL's

sgemmfor matrices up to 2,048x2,048.

Could you supply the classic matrix multiplication algorithm as described by you? I would highly appreciate it!

Sergey Kostrov wrote:

How many times did you execute the test to get these numbers?

I made 3 warm up runs and calculated the average runtime of 5 following runs. The profiling was done by using the C++ chronos library and alternatively by using the MKL function dsecnd(). Both profiling methods produced the same results.

For more complete information about compiler optimizations, see our Optimization Notice.