OpenCL overhead on empty kernel

Richard_S_7 · ‎06-13-2017

Hello,

I am currently comparing my own implemention of GEMV in OpenCL to the MKL. I am benchmarking very small input sizes like 2x64 for example. On my system the MKL runs around 0,001ms for this input size and my kernel runs around 0,003ms.

When executing a completly empty kernel I get a runtime of around 0,0025ms. Where does this overhead come from and why doesn't the MKL seem to have it? I am benchmarking my OpenCL kernel via the OpenCL events and MKL with the dsecnd() function, that is supplied by the MKL.

Thanks in advance!

Jeffrey_M_Intel1 · ‎06-13-2017

OpenCL kernel enqueue and launch will have some overhead. Ideally the operation would be larger so this overhead would be a relatively small part of the overall execution time. OpenCL's advantages are convenient threading for CPU and access to accelerator HW like GPU and FPGA. OpenCL may be able to help more when scheduling many of these small operations, or with larger input sizes.

SergeyKostrov · ‎06-20-2017

Take into account that MKL's API is highly optimized for bigger data sets ( matrices, vectors ). MKL also has overheads and a simple classic matrix multiplication algorithm ( triple-for-loop / processing core is less than 10 code lines ) outperforms MKL's sgemm for matrices up to 2,048x2,048. >>...On my system the MKL runs around 0,001ms for this input size and my kernel runs around 0,003ms... How many times did you execute the test to get these numbers?

Richard_S_7 · ‎06-22-2017

Thank you for your feedback.

Sergey Kostrov wrote:

MKL also has overheads and a simple classic matrix multiplication algorithm ( triple-for-loop / processing core is less than 10 code lines ) outperforms MKL's sgemm for matrices up to 2,048x2,048.

Could you supply the classic matrix multiplication algorithm as described by you? I would highly appreciate it!

Sergey Kostrov wrote:

How many times did you execute the test to get these numbers?

I made 3 warm up runs and calculated the average runtime of 5 following runs. The profiling was done by using the C++ chronos library and alternatively by using the MKL function dsecnd(). Both profiling methods produced the same results.