OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.

Significant performance regression *after* the 15.46.05.4771 driver

allanmac1
Beginner
553 Views

I'm observing an ~25% performance drop in a kernel that is ~4.5 msecs of pure compute with one final write to an OpenGL render buffer.

The measurement is via a cl_event.  

Whether the kernels are loaded as binaries or as source doesn't matter.

I'm wondering how performance can vary so much when the kernel hasn't changed and its only interaction with "external" resources is limited to a single store to a 1 megapixel renderbuffer?

 

0 Kudos
6 Replies
Michael_C_Intel1
Moderator
553 Views

Hi Allanmac,

Thanks for sharing your experience.

Can you post a representative source sample of the performance issue so folks can understand the nature of the drop you're describing... ? ... if you could supply an example we may get some feedback on your functional workflow... Hopefully we can see if your experience is in line with expectations or if there is something that can be improved.

Can you identify any interop examples you may be referencing?

Can you share your platform info?

  • CPU SKU
  • OS
  • OCL Driver Version
  • OCL Implementation feature set you are building and running against (i.e. 1.2, 2.0, etc)
  • OCL SDK version in use?

I'm assuming you used clEnqueueNDRangeKernel(..) and clGetEventProfilingInfo(..) on the host side to queue and get timer data on the kernel?

 

Thanks,

-MichaelC

 

 

0 Kudos
Ben_A_Intel
Employee
553 Views
Hi allanmac, Have you tried the Intercept Layer for OpenCL Applications? https://github.com/intel/opencl-intercept-layer The HostPerformanceTiming and DevicePerformanceTiming capabilities might help to narrow down where the extra time is being spent. Or, if you know exactly which kernel is taking extra time, it might give some hints as to why - maybe it was compiled for a different SIMD size, or maybe it spilled out of the register file, etc. Give it a try and let us know if it helps. -- Ben
0 Kudos
allanmac1
Beginner
553 Views

Thanks Ben,

My example program tries both source and precompiled binary kernels and I dump the spill mem and local mem sizes.  No changes to either.

The kernels are 99% mundane loads, FMAs and local memory accesses up until the very end when they perform write_imageh(half4) stores.

The same performance regression appears on an HD 530 (SKL), 630 (KBL) and 505 (Apollo Lake).

I'll dig deeper into this soon and the Intercept Layer looks useful!

I'll see if I can get you a binary reproducer.

-Allan

0 Kudos
Michael_C_Intel1
Moderator
553 Views

Hi Allan,

We can't address binary reproducers on the forum... Could you provide a generalized representative source case to reproduce?

Thanks,

-MichaelC

0 Kudos
allanmac1
Beginner
553 Views

MICHAEL C. (Intel) wrote:
Could you provide a generalized representative source case to reproduce?

That will take much longer as it's a big kernel.  Maybe at some point in the future.

-ASM

0 Kudos
allanmac1
Beginner
553 Views

The 23.20.16.5018 "NEO" Win10 OpenCL driver has solved my regression.  

I'm seeing kernel runtimes similar to or slightly faster than circa .4771 drivers.  

0 Kudos
Reply