- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm observing an ~25% performance drop in a kernel that is ~4.5 msecs of pure compute with one final write to an OpenGL render buffer.
The measurement is via a cl_event.
Whether the kernels are loaded as binaries or as source doesn't matter.
I'm wondering how performance can vary so much when the kernel hasn't changed and its only interaction with "external" resources is limited to a single store to a 1 megapixel renderbuffer?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Allanmac,
Thanks for sharing your experience.
Can you post a representative source sample of the performance issue so folks can understand the nature of the drop you're describing... ? ... if you could supply an example we may get some feedback on your functional workflow... Hopefully we can see if your experience is in line with expectations or if there is something that can be improved.
Can you identify any interop examples you may be referencing?
Can you share your platform info?
- CPU SKU
- OS
- OCL Driver Version
- OCL Implementation feature set you are building and running against (i.e. 1.2, 2.0, etc)
- OCL SDK version in use?
I'm assuming you used clEnqueueNDRangeKernel(..) and clGetEventProfilingInfo(..) on the host side to queue and get timer data on the kernel?
Thanks,
-MichaelC
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Ben,
My example program tries both source and precompiled binary kernels and I dump the spill mem and local mem sizes. No changes to either.
The kernels are 99% mundane loads, FMAs and local memory accesses up until the very end when they perform write_imageh(half4) stores.
The same performance regression appears on an HD 530 (SKL), 630 (KBL) and 505 (Apollo Lake).
I'll dig deeper into this soon and the Intercept Layer looks useful!
I'll see if I can get you a binary reproducer.
-Allan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Allan,
We can't address binary reproducers on the forum... Could you provide a generalized representative source case to reproduce?
Thanks,
-MichaelC
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MICHAEL C. (Intel) wrote:
Could you provide a generalized representative source case to reproduce?
That will take much longer as it's a big kernel. Maybe at some point in the future.
-ASM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The 23.20.16.5018 "NEO" Win10 OpenCL driver has solved my regression.
I'm seeing kernel runtimes similar to or slightly faster than circa .4771 drivers.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page