OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

neural-style/torchcl with intel-opencl-r3.0

Chernov__Alexey
Beginner
321 Views

Hi,

I'm ran onto issue on "github.com/jcjohnson/neural-style" && "torchcl" (github.com/hughperkins/distro branch distro-cl) with intel-opencl-r3.0.

It runs 7 times faster than Beignet 1.1.1, but processing stopped after 90-100 iterations with error code CL_OUT_OF_HOST_MEMORY (-6), whereas Beignet work stable.

On image size 500x500, computer have 32Gb of RAM, OS use ~1.5Gb, and torch use ~10Gb (~5Gb resident), but your driver returns "out of host memory". Can you explain that? In same situation Beignet use ~5Gb (~0.8Gb resident).

It look like, that error dropped at same number of iterations regardless of image size (250x250 or 500x500 does not make a difference). I don't see, that memory use significantly grows across iterations.

Does intel-opencl-r3.0 have its own logging system to figure out what triggers "out of memory" error? And what else I can do in this situation?

P.S.: That "torchcl" is written for GPUs and don't follows your recommendations for not duplicate all buffers in memory. (https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics)
And maybe it have memory/object leaks, but it somehow work with Beignet without "out of memory" errors.
So I suggest, that "intel-opencl-r3.0" may have its own issues beside of that.

HW: i3-6300
OS: Ubuntu-16.04 with kernel 4.4 (tried out 4.8 with your patch for i915, nothing changed).

0 Kudos
1 Solution
Ben_A_Intel
Employee
321 Views

I added my findings to the torch-cl issue you created (thanks!):

https://github.com/hughperkins/distro-cl/issues/14

The fastest solution will be to fix the event leak in torch-cl, so hopefully we can make progress on this issue.  In the meantime though, we're also looking at ways to improve our event handling so it's more resilient to memory leaks in the future.

View solution in original post

6 Replies
Chernov__Alexey
Beginner
321 Views

Didn't said it at first time, I'm only use GPU driver from intel-opencl-r3.0, CPU processing is slow and consume lot of power. CPU mode work after 100 iterations without errors.

> P.S.: That "torchcl" is written for GPUs
Meant standalone GPUs with its own RAM.

Jeffrey_M_Intel1
Employee
321 Views

Thanks for this report.   I think I've set up enough to replicate.  'th neural_style.lua" completes in CPU mode but crashes before 100 iterations with the clnn backend.   I will let you know what we find.

Chernov__Alexey
Beginner
321 Views

Thanks for fast reply. There some details that I found:

$ gdb torchcl/install/bin/luajit -ex "catch throw"

(gdb) run neural_style.lua -backend clnn -gpu 0 -print_iter 10 <...image options...>

Iteration 10 / 1000
Iteration 20 / 1000
..
Iteration 80 / 1000
Iteration 90 / 1000

(gdb) bt 3
#0  in __cxa_throw ()
#1  in EasyCL::checkError at torchcl/opencl/cltorch/src/EasyCL/EasyCL.cpp:538
#2  in CLWrapper::copyToHost at torchcl/opencl/cltorch/src/EasyCL/CLWrapper.cpp:74

torchcl/opencl/cltorch/src/EasyCL/CLWrapper.cpp:

void CLWrapper::copyToHost() {
    if(!onDevice) {
        throw std::runtime_error("copyToHost(): not on device");
    }
//    cl->finish();
    cl_event event = NULL;
    error = clEnqueueReadBuffer(*(cl->queue), devicearray, CL_TRUE, 0, getElementSize() * N, getHostArray(), 0, NULL, &event);    
    cl->checkError(error);
    cl_int err = clWaitForEvents(1, &event);
    clReleaseEvent(event);
    if (err != CL_SUCCESS) {
        throw std::runtime_error("wait for event on copytohost failed with " + easycl::toString(err) );
    }
    deviceDirty = false;
}

So error come from clEnqueueReadBuffer.
When I'm changed code to this:

void CLWrapper::copyToHost() {
    if(!onDevice) { throw std::runtime_error("copyToHost(): not on device"); }
//    cl->finish();
    void *ptr = clEnqueueMapBuffer(*(cl->queue), devicearray, CL_TRUE, CL_MAP_READ, 0, getElementSize() * N, 0, NULL, NULL, &error);
    cl->checkError(error);
    ::memcpy(getHostArray(), ptr, getElementSize() * N);
    clEnqueueUnmapMemObject(*(cl->queue), devicearray, ptr, 0, NULL, NULL);
    deviceDirty = false;
}

... neural-style stop with same error after 100 iterations, but in different place:

    torchcl/install/share/lua/5.1/optim/lbfgs.lua:152:
    clblasSdot() failed with -6 at torchcl/opencl/cltorch/src/lib/THClBlas.cpp:186

But I haven't figured out from which CL call -6 error comes this time.
clblasSdot() is from torchcl/opencl/cltorch/src/clMathLibraries/clBLAS/src/library/blas/xdot.c, can return error from many CL functions.

 

Ben_A_Intel
Employee
321 Views

I'm seeing events being created that are never released.  After running for enough iterations this eventually results in the OUT_OF_MEMORY error.  Many of the events that are never released come from enqueuing "Sdot_kernel".  I'm tracking down where this kernel is being enqueued so I can figure out where the event should be released, but I wanted to post my findings so far.
 

Ben_A_Intel
Employee
322 Views

I added my findings to the torch-cl issue you created (thanks!):

https://github.com/hughperkins/distro-cl/issues/14

The fastest solution will be to fix the event leak in torch-cl, so hopefully we can make progress on this issue.  In the meantime though, we're also looking at ways to improve our event handling so it's more resilient to memory leaks in the future.

Chernov__Alexey
Beginner
321 Views

Big thanks to you, I have seen this source file, but don't noticed this event. I have pulled request with fix to hughperkins/clBLAS.

Look like that OpenCL need special code for "out of events" error.

Reply