OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1719 Discussions

OpenCL overhead on empty kernel

Richard_S_7
Beginner
904 Views

Hello,

I am currently comparing my own implemention of GEMV in OpenCL to the MKL. I am benchmarking very small input sizes like 2x64 for example. On my system the MKL runs around 0,001ms for this input size and my kernel runs around 0,003ms.

When executing a completly empty kernel I get a runtime of around 0,0025ms. Where does this overhead come from and why doesn't the MKL seem to have it? I am benchmarking my OpenCL kernel via the OpenCL events and MKL with the dsecnd() function, that is supplied by the MKL.

Thanks in advance!

0 Kudos
3 Replies
Jeffrey_M_Intel1
Employee
904 Views

OpenCL kernel enqueue and launch will have some overhead.  Ideally the operation would be larger so this overhead would be a relatively small part of the overall execution time.  OpenCL's advantages are convenient threading for CPU and access to accelerator HW like GPU and FPGA.  OpenCL may be able to help more when scheduling many of these small operations, or with larger input sizes.

0 Kudos
SergeyKostrov
Valued Contributor II
904 Views
Take into account that MKL's API is highly optimized for bigger data sets ( matrices, vectors ). MKL also has overheads and a simple classic matrix multiplication algorithm ( triple-for-loop / processing core is less than 10 code lines ) outperforms MKL's sgemm for matrices up to 2,048x2,048. >>...On my system the MKL runs around 0,001ms for this input size and my kernel runs around 0,003ms... How many times did you execute the test to get these numbers?
0 Kudos
Richard_S_7
Beginner
904 Views

Thank you for your feedback.

Sergey Kostrov wrote:

MKL also has overheads and a simple classic matrix multiplication algorithm ( triple-for-loop / processing core is less than 10 code lines ) outperforms MKL's sgemm for matrices up to 2,048x2,048.

Could you supply the classic matrix multiplication algorithm as described by you? I would highly appreciate it!

Sergey Kostrov wrote:

How many times did you execute the test to get these numbers?

I made 3 warm up runs and calculated the average runtime of 5 following runs. The profiling was done by using the C++ chronos library and alternatively by using the MKL function dsecnd(). Both profiling methods produced the same results.

0 Kudos
Reply