OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.

Apparent memory leak and performance problems

Elmar
Beginner
259 Views

Dear all,

my company is developing a scientific application with ~30000 registered users. We encountered some problems with OpenCL support in Windows 8.1, using the latest driver 10.18.14.4080, and testing with the built-in GPU of a Core i7 4770, the application is 32bit.

1) A memory leak-like issue: The application has a non-performance critical loop which looks like..

- Create kernels
for (i=0;i<100;i++)
{ - Create OpenCL buffers
  - Run kernels
  - Free OpenCL buffers }

Unfortunately this loop makes it only till the 3rd iteration, then the OpenCL buffer creation fails with CL_MEM_OBJECT_ALLOCATION_FAILURE. In the Windows task manager, I can see the memory usage go through the roof.

Of course I am only reporting this because the same code runs fine when using OpenCL devices from AMD or nVIDIA (where the task manager shows a constant number during all 100 iterations).

I also implemented a counter, which I increment for each allocation (clCreateProgramWithSource, clCreateKernel, clCreateBuffer, clCreateImage) and decrement for each deallocation (clReleaseProgram, clReleaseKernel, clReleaseMemObject), and this counter does not increase with each loop iteration, so I'm pretty sure the leak or fragmentation problem is not in my code.

If your driver team wants to investigate this, I can of course send you the application for testing (I can't really extract a minimum code example).


2) A performance issue: For the two performance critical parts of my application (not the loop above), a Radeon R9 290X executes the kernels 18x and 12x as fast as the HD4600 GPU in the Core i7 4770. But in 3DMark11, the Radeon runs only 3.5x as fast (http://www.futuremark.com/hardware/gpu/Intel+HD+Graphics+4600/review). So somewhere, I have a massive performance loss. Now I have three questions:

a) My code uses a lot of barriers(CLK_LOCAL_MEM_FENCE) after copying data from global memory or when sharing data between work items. On AMD and nVIDIA GPUs, these barriers are optimized away by the compiler, because I declare kernels with __attribute__((reqd_work_group_size(CL_WGSIZE,1,1))), where CL_WGSIZE matches the warp/wavefront size. But the Intel compiler seems to include barrier code (the kernels break if I remove some barriers). Is there any way / work group size to help the compiler? I tried work group sizes 32 and 64, but no luck.

b) On AMD and nVIDIA GPUs, I can obtain the assembler code created for the kernels and check that everything has been translated as expected. Is there any way to check also with Intel GPUs what the compiler did to identify bottlenecks?

c) Which is the preferred, not excessively expensive way to identify bottlenecks? Our software development in done in Linux, Windows executables ar cross-compiled.

Many thanks for your help,
Elmar

 

0 Kudos
1 Solution
Robert_I_Intel
Employee
259 Views

Dear Elmar,

1. Typically, we do not recommend to create and destroy buffers in a loop. It is much better to create buffers at the beginning of the application, run your kernels repeatedly and then release the buffers at the very end of your application. Nevertheless, your use is legitimate, so I will be very interested in getting reproducer. You could send it to me in a private message, if you would like.

2. Please send me your kernel code for analysis. You could try work group sizes 8 or 16, since your kernel is most likely compiled in SIMD8 (large kernel) or SIMD16 (regular kernel) fashion.

Unfortunately, we currently don't provide assembly view, though we do provide textual SPIR version, which you could check. You can download Intel(R) INDE https://software.intel.com/en-us/intel-inde and then follow the advice in this articles https://software.intel.com/en-us/articles/getting-started-with-opencl-development-on-windows-with-in... and https://software.intel.com/en-us/articles/using-spir-for-fun-and-profit-with-intel-opencl-code-build...

For bottleneck identification, you can start with Intel(R) OpenCL(TM) Code Builder tools (see the link above) and for a deeper analysis, download a version of Intel(R) Vtune Amplifier https://software.intel.com/en-us/intel-vtune-amplifier-xe  - you can obtain a 30-day trial to see if it works for you. See this article on the use of Vtune for OpenCL analysis: https://software.intel.com/en-us/articles/intel-vtune-amplifier-xe-getting-started-with-opencl-perfo...

 

View solution in original post

4 Replies
Robert_I_Intel
Employee
260 Views

Dear Elmar,

1. Typically, we do not recommend to create and destroy buffers in a loop. It is much better to create buffers at the beginning of the application, run your kernels repeatedly and then release the buffers at the very end of your application. Nevertheless, your use is legitimate, so I will be very interested in getting reproducer. You could send it to me in a private message, if you would like.

2. Please send me your kernel code for analysis. You could try work group sizes 8 or 16, since your kernel is most likely compiled in SIMD8 (large kernel) or SIMD16 (regular kernel) fashion.

Unfortunately, we currently don't provide assembly view, though we do provide textual SPIR version, which you could check. You can download Intel(R) INDE https://software.intel.com/en-us/intel-inde and then follow the advice in this articles https://software.intel.com/en-us/articles/getting-started-with-opencl-development-on-windows-with-in... and https://software.intel.com/en-us/articles/using-spir-for-fun-and-profit-with-intel-opencl-code-build...

For bottleneck identification, you can start with Intel(R) OpenCL(TM) Code Builder tools (see the link above) and for a deeper analysis, download a version of Intel(R) Vtune Amplifier https://software.intel.com/en-us/intel-vtune-amplifier-xe  - you can obtain a 30-day trial to see if it works for you. See this article on the use of Vtune for OpenCL analysis: https://software.intel.com/en-us/articles/intel-vtune-amplifier-xe-getting-started-with-opencl-perfo...

 

View solution in original post

Elmar
Beginner
259 Views

Dear Robert,

many, many thanks for your fast and very helpful reply. I sent you the files to reproduce the leak problem by private message, hoping that I did nothing wrong that only worked by accident with AMD/nVIDIA drivers.

I also tried a work group size of 16, but then the kernel took 45% longer to execute. I can unfortunately not try work group size 8 (my kernel needs at least 16 work items to function).

Thanks also for the hint to look at the SPIR version.

Best regards,
Elmar

Robert_I_Intel
Employee
259 Views

Elmar, 

Our driver developer did the analysis and here is what he found (things did work by accident :)):

This is an application issue caused by lack of synchronization between host part that performs clEnqueue commands and GPU that is executing these tasks.

  • Each subsequent iteration in this application is executing non-blocking calls ONLY.
  • There is no command like clFinish, clWaitForEvents or blocking clEnqueueReadBuffer in subsequent iterations.

It looks like application developer makes several wrong assumptions. She/he should take in consideration that:

  • OCL spec says: “After the memobj reference count becomes zero and commands queued for execution on a command-queue(s) that use memobj have finished, the memory object is deleted.” – there is no check for execution progress in this application
  • OCL objects created on integrated gfx card (e.g. with clCreateBuffer) consumes application address space (32-bit in such case). Discrete cards are using separate memory for this purpose (clCreateBuffer doesn’t consume application address space – these allocations are not visible in Task Manager)
  •  Other vendors probably blocks some OpenCL calls to avoid to large task queue. 100 iterations executed with completely non-blocking calls will require at least 20GB of memory (two of clCreateBuffer calls set allocation size to ~100MB)

This scenario is not valid according to OpenCL specification and requires introduction of blocking mechanism in driver for non-blocking calls.

Everything works fine with clFinish injected at the end of each iteration.

Elmar
Beginner
259 Views

Dear Robert,

damn, I indeed forgot a clFlush. Apologies to your driver team for wasting their time with this triviality. After two years of struggling with compiler bugs and crashes of other vendor's OpenCL implementations, I got desperate and reported this too early...

Best regards,

Elmar

 

Reply