Solved: neural-style/torchcl with intel-opencl-r3.0

Chernov__Alexey · ‎11-01-2016

Hi,

I'm ran onto issue on "github.com/jcjohnson/neural-style" && "torchcl" (github.com/hughperkins/distro branch distro-cl) with intel-opencl-r3.0.

It runs 7 times faster than Beignet 1.1.1, but processing stopped after 90-100 iterations with error code CL_OUT_OF_HOST_MEMORY (-6), whereas Beignet work stable.

On image size 500x500, computer have 32Gb of RAM, OS use ~1.5Gb, and torch use ~10Gb (~5Gb resident), but your driver returns "out of host memory". Can you explain that? In same situation Beignet use ~5Gb (~0.8Gb resident).

It look like, that error dropped at same number of iterations regardless of image size (250x250 or 500x500 does not make a difference). I don't see, that memory use significantly grows across iterations.

Does intel-opencl-r3.0 have its own logging system to figure out what triggers "out of memory" error? And what else I can do in this situation?

P.S.: That "torchcl" is written for GPUs and don't follows your recommendations for not duplicate all buffers in memory. (https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics)
And maybe it have memory/object leaks, but it somehow work with Beignet without "out of memory" errors.
So I suggest, that "intel-opencl-r3.0" may have its own issues beside of that.

HW: i3-6300
OS: Ubuntu-16.04 with kernel 4.4 (tried out 4.8 with your patch for i915, nothing changed).

Ben_A_Intel · ‎11-15-2016

I added my findings to the torch-cl issue you created (thanks!):

https://github.com/hughperkins/distro-cl/issues/14

The fastest solution will be to fix the event leak in torch-cl, so hopefully we can make progress on this issue. In the meantime though, we're also looking at ways to improve our event handling so it's more resilient to memory leaks in the future.

View solution in original post

Chernov__Alexey · ‎11-02-2016

Didn't said it at first time, I'm only use GPU driver from intel-opencl-r3.0, CPU processing is slow and consume lot of power. CPU mode work after 100 iterations without errors.

> P.S.: That "torchcl" is written for GPUs
Meant standalone GPUs with its own RAM.

Jeffrey_M_Intel1 · ‎11-04-2016

Thanks for this report. I think I've set up enough to replicate. 'th neural_style.lua" completes in CPU mode but crashes before 100 iterations with the clnn backend. I will let you know what we find.

Chernov__Alexey · ‎11-04-2016

Thanks for fast reply. There some details that I found:

$ gdb torchcl/install/bin/luajit -ex "catch throw"

(gdb) run neural_style.lua -backend clnn -gpu 0 -print_iter 10 <...image options...>

Iteration 10 / 1000
Iteration 20 / 1000
..
Iteration 80 / 1000
Iteration 90 / 1000

(gdb) bt 3
#0 in __cxa_throw ()
#1 in EasyCL::checkError at torchcl/opencl/cltorch/src/EasyCL/EasyCL.cpp:538
#2 in CLWrapper::copyToHost at torchcl/opencl/cltorch/src/EasyCL/CLWrapper.cpp:74

torchcl/opencl/cltorch/src/EasyCL/CLWrapper.cpp:

void CLWrapper::copyToHost() {
    if(!onDevice) {
        throw std::runtime_error("copyToHost(): not on device");
    }
//    cl->finish();
    cl_event event = NULL;
    error = clEnqueueReadBuffer(*(cl->queue), devicearray, CL_TRUE, 0, getElementSize() * N, getHostArray(), 0, NULL, &event);    
    cl->checkError(error);
    cl_int err = clWaitForEvents(1, &event);
    clReleaseEvent(event);
    if (err != CL_SUCCESS) {
        throw std::runtime_error("wait for event on copytohost failed with " + easycl::toString(err) );
    }
    deviceDirty = false;
}

So error come from clEnqueueReadBuffer.
When I'm changed code to this:

void CLWrapper::copyToHost() {
    if(!onDevice) { throw std::runtime_error("copyToHost(): not on device"); }
//    cl->finish();
    void *ptr = clEnqueueMapBuffer(*(cl->queue), devicearray, CL_TRUE, CL_MAP_READ, 0, getElementSize() * N, 0, NULL, NULL, &error);
    cl->checkError(error);
    ::memcpy(getHostArray(), ptr, getElementSize() * N);
    clEnqueueUnmapMemObject(*(cl->queue), devicearray, ptr, 0, NULL, NULL);
    deviceDirty = false;
}

... neural-style stop with same error after 100 iterations, but in different place:

torchcl/install/share/lua/5.1/optim/lbfgs.lua:152:
clblasSdot() failed with -6 at torchcl/opencl/cltorch/src/lib/THClBlas.cpp:186

But I haven't figured out from which CL call -6 error comes this time.
clblasSdot() is from torchcl/opencl/cltorch/src/clMathLibraries/clBLAS/src/library/blas/xdot.c, can return error from many CL functions.

Ben_A_Intel · ‎11-10-2016

I'm seeing events being created that are never released. After running for enough iterations this eventually results in the OUT_OF_MEMORY error. Many of the events that are never released come from enqueuing "Sdot_kernel". I'm tracking down where this kernel is being enqueued so I can figure out where the event should be released, but I wanted to post my findings so far.

Ben_A_Intel · ‎11-15-2016

I added my findings to the torch-cl issue you created (thanks!):

https://github.com/hughperkins/distro-cl/issues/14

The fastest solution will be to fix the event leak in torch-cl, so hopefully we can make progress on this issue. In the meantime though, we're also looking at ways to improve our event handling so it's more resilient to memory leaks in the future.

Chernov__Alexey · ‎11-15-2016

Big thanks to you, I have seen this source file, but don't noticed this event. I have pulled request with fix to hughperkins/clBLAS.

Look like that OpenCL need special code for "out of events" error.