Hi Allanmac,

allanmac1 · ‎02-13-2018

I'm observing an ~25% performance drop in a kernel that is ~4.5 msecs of pure compute with one final write to an OpenGL render buffer.

The measurement is via a cl_event.

Whether the kernels are loaded as binaries or as source doesn't matter.

I'm wondering how performance can vary so much when the kernel hasn't changed and its only interaction with "external" resources is limited to a single store to a 1 megapixel renderbuffer?

Michael_C_Intel1 · ‎02-23-2018

Hi Allanmac,

Thanks for sharing your experience.

Can you post a representative source sample of the performance issue so folks can understand the nature of the drop you're describing... ? ... if you could supply an example we may get some feedback on your functional workflow... Hopefully we can see if your experience is in line with expectations or if there is something that can be improved.

Can you identify any interop examples you may be referencing?

Can you share your platform info?

CPU SKU
OS
OCL Driver Version
OCL Implementation feature set you are building and running against (i.e. 1.2, 2.0, etc)
OCL SDK version in use?

I'm assuming you used clEnqueueNDRangeKernel(..) and clGetEventProfilingInfo(..) on the host side to queue and get timer data on the kernel?

Thanks,

-MichaelC

Ben_A_Intel · ‎02-26-2018

Hi allanmac, Have you tried the Intercept Layer for OpenCL Applications? https://github.com/intel/opencl-intercept-layer The HostPerformanceTiming and DevicePerformanceTiming capabilities might help to narrow down where the extra time is being spent. Or, if you know exactly which kernel is taking extra time, it might give some hints as to why - maybe it was compiled for a different SIMD size, or maybe it spilled out of the register file, etc. Give it a try and let us know if it helps. -- Ben

allanmac1 · ‎02-26-2018

Thanks Ben,

My example program tries both source and precompiled binary kernels and I dump the spill mem and local mem sizes. No changes to either.

The kernels are 99% mundane loads, FMAs and local memory accesses up until the very end when they perform write_imageh(half4) stores.

The same performance regression appears on an HD 530 (SKL), 630 (KBL) and 505 (Apollo Lake).

I'll dig deeper into this soon and the Intercept Layer looks useful!

I'll see if I can get you a binary reproducer.

-Allan

Michael_C_Intel1 · ‎02-26-2018

Hi Allan,

We can't address binary reproducers on the forum... Could you provide a generalized representative source case to reproduce?

Thanks,

-MichaelC

allanmac1 · ‎02-26-2018

MICHAEL C. (Intel) wrote:
Could you provide a generalized representative source case to reproduce?

That will take much longer as it's a big kernel. Maybe at some point in the future.

-ASM

allanmac1 · ‎04-10-2018

The 23.20.16.5018 "NEO" Win10 OpenCL driver has solved my regression.

I'm seeing kernel runtimes similar to or slightly faster than circa .4771 drivers.

Significant performance regression *after* the 15.46.05.4771 driver

Significant performance regression after the 15.46.05.4771 driver