I'll have to check again when

e4lam · ‎12-06-2019

Hi, I'm getting a hang/deadlock when when initializing an Intel OpenCL context from a TBB thread on Linux. On Windows, it's fine. Please try the attached source code.

// ocl_test.cpp
//
// Compile with: g++ -std=c++14 -I/path/to/opencl/include -I/path/to/tbb/include -lOpenCL -ltbb -lpthread -L. ocl_test.cpp -o ocl_test
//
// cl.hpp used from: https://www.khronos.org/registry/OpenCL/api/2.1/cl.hpp
//
// Run with:
// $ cd /path/to/parent/dir/libintelocl.so
// $ env LD_LIBRARY_PATH=. LD_PRELOAD=libintelocl.so ./ocl_test
//
// This hangs on Linux under various Intel processors but not Windows
//
#include <iostream>
#include <future>
#include <atomic>
#include <vector>

#define __CL_ENABLE_EXCEPTIONS
#define CL_TARGET_OPENCL_VERSION 120
#include "cl.hpp"

#ifdef _WIN32
#ifdef _DEBUG
    #pragma comment(lib, "OpenCL_d")
    #pragma comment(lib, "tbb_debug")
#else
    #pragma comment(lib, "OpenCL")
    #pragma comment(lib, "tbb")
#endif
#endif

//#define TEST_STDTHREAD
#define TEST_TBB

#ifdef TEST_TBB
#include <tbb/tbb.h>

template <typename F>
class LambdaTask : public tbb::task
{
public:
    LambdaTask(const F& f) : myF(f) {
    }
private:
    tbb::task* execute() override {
        if (myF()) {
            return new(tbb::task::allocate_continuation()) LambdaTask(myF);
        }
        return nullptr;
    }
    F myF;
};

template <typename F>
static void EnqueueLambda(const F& f) {
    tbb::task::enqueue(
            *new(tbb::task::allocate_root()) LambdaTask<F>(f) );
}

template <typename F>
static void SpawnAndWaitLambda(const F& f) {
    tbb::task::spawn_root_and_wait(
            *new(tbb::task::allocate_root()) LambdaTask<F>(f) );
}
#endif

int main() {

    std::vector<cl::Platform> platforms;
    cl::Platform::get(&platforms);

    if (platforms.empty()) {
        std::cout << "No platforms found.\n";
        return 1;
    }

    for (const cl::Platform &platform : platforms) {
        std::cout << "Available platform: "
                  << platform.getInfo<CL_PLATFORM_NAME>() << "\n";
    }

    cl::Platform platform = platforms[0];
    std::cout << "Using platform: " << platform.getInfo<CL_PLATFORM_NAME>()
              << "\n";

    std::vector<cl::Device> devices;
    platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);
    if (devices.empty()) {
        std::cout << "No devices found.\n";
        return 1;
    }

    cl::Device device = devices[0];
    std::cout << "Using device: " << device.getInfo<CL_DEVICE_NAME>() << "\n";

    auto create_context = [&]() {
        std::cout << "Attempting to create context" << std::endl;
        cl::Context context({device});
        std::cout << "Created context" << std::endl;
    };

#if defined(TEST_STDTHREAD)

    std::cout << "TEST_STDTHREAD" << std::endl;
    auto task = std::async(std::launch::async, [&]() { create_context(); });
    task.wait();
    std::cout << "Done" << std::endl;

#elif defined(TEST_TBB)

    std::cout << "TEST_TBB" << std::endl;

    std::atomic_bool done(false);

    EnqueueLambda([&]() {
        create_context();
        done = true;
        return false;
    });

#if 1
    // Try waiting inside scheduler loop
    SpawnAndWaitLambda([&]() {
        if (done)
        {
            std::cout << "Done" << std::endl;
            return false;
        }
        return true;
    });
#else
    // This also fails if we just spin loop
    while (!done)
        ;
    std::cout << "Done" << std::endl;
#endif

#else
    std::cout << "TEST_SERIAL" << std::endl;
    create_context();
    std::cout << "Done" << std::endl;
#endif

    return 0;
}

Michael_C_Intel1 · ‎12-06-2019

Hi e4lam,

Thanks for the reproducer... Can you confirm your platform OS distro and CPU SKU? Which OpenCL (libintelocl.so) runtime are you using? From where did you acquire it for Windows and Linux respectively?

-MichaelC

Michael_C_Intel1 · ‎12-06-2019

Any particular reason you LD_PRELOAD instead of rely on the ICD Loader library?

Thanks,

-MichaelC

e4lam · ‎12-06-2019

I'll have to check again when I get back to work on Monday but it's quite reproducible for across several CPUs on Linux (Mint, Ubuntu, RHEL). From the stack traces, it looks like a TBB problem interaction problem and has nothing to do with CPU because we haven't got that far yet. In one instance it looked like a problem with an internal barrier that was messing up because we saw two OCL tbb tasks in the same thread (possibly due to task stealing).

The reason for the LD_PRELOAD is to ensure that the Intel OpenCL driver is being used instead of the GPU driver. The LD_LIBRARY_PATH is there to ensure that the same tbb library is being used to rule out bad interaction with a different TBB library. This is a minimal reproducer from us trying to rule out as many possibilities as possible.

We tested with the latest Intel OpenCL driver and Intel TBB 2019 headers.

Michael_C_Intel1 · ‎12-06-2019

e4lam,

Thanks for the detail on loading.

When you get the opportunity can you share the:

TBB version?
- 2019 initial release or 2019 Update X or a specific version string would be useful for Windows and Linux...
From where did you acquire TBB? The performance primitives site? threadbuildingblocks.org/github? System package repository manager (which)?
- Custom build or prebuilt TBB binaries?
The version string with clinfo from the platforms? On CPU RT version and distributions:

There are two deployments of CPU Runtime for Windows. One is a standalone available from the Intel Registration Center, the other comes with the graphics driver for Intel Graphics. Vendors release drivers at different cadences. Intel has first party drivers... they are typically deployed on systems like NUCs or in rare cases where the system vendors support it. Which did you get it from?

On Linux, there are two deployments as well. One is experimental for SYCL... The other is standalone available from Intel Registration Center.

Confirming the runtime stack will help expedite finding a resolution.

Thanks,

-MichaelC

e4lam · ‎12-06-2019

For both Linux and Windows, it is the standalone Intel OpenCL driver. In all of our tests, the systems were using nVidia or AMD discrete video cards which is why we had to ensure that the CPU driver is running.

As for TBB, we use the opensource releases built ourselves from GitHub. However, I'll note again that in the setup, we've ensured that we're running against the same tbb shared libraries as libintelopencl.so albeit compiled using the headers from GitHub.

Are you saying that you've tried the sample and cannot reproduce?

e4lam · ‎12-09-2019

Here's the clinfo output from one of our machines with the libintelocl.so preloaded

Michael_C_Intel1 · ‎12-09-2019

Hi e4lam,

Release notes reference page

To my recollection, the TBB libraries included with the CPU RT 18.1 may have custom symbols (at least for windows). They were built for CPU RT only and are not promoted for direct user application linkage. So, LD_LIBRARY_PATH=. for the user application worries me for TBB access.

One thing that they did mention in the release notes, is that if the application needs to use TBB directly it should use a higher version of TBB... you mentioned you are using 2019 headers... Can you deploy and link against the corresponding libs in linux? Per the guidance this means 2018 initial release or newer headers and libs are needed... as opposed to just headers.

See the 'Known Issues' (section 6 page 8) of the release notes.

Do you get a similar error when you link against TBB 2018 initial release or newer?

Sidebar:

Even though the release notes suggest 'functionality and performance' may vary, the sighting is still appreciated. Users will see cases where products that should seemingly run without issues. It would be useful to pass those cases along to TBB and CPU RT teams. Both TBB and CPU RT teams have been responsive to TBB w/ CPU RT issues in the past.

-MichaelC

e4lam · ‎12-09-2019

This minimal reproducer was our attempt to trim things down from our production scenario. For our production scenarios, we have tried older configurations of Intel OpenCL and TBB where the application links against a separate TBB. We'll try again but we haven't found any combination that works so far on Linux.

What is the best channel to contact the teams that you mention? I was hoping that this was already a communication channel for the CPU RT team here.

Thanks!

e4lam · ‎12-09-2019

PS. A careful reading of the known limitations that it's *supposed* to work whenever you use a newer OR *same* version of TBB, which is what the minimal example above does.

Linux* OS case: Make sure there is no other Threading Building Blocks (TBB) library in your OpenCL™ host application library search path on Linux* OS. Intel® CPU Runtime for OpenCL™ Applications was tested only with Intel® Threading Building Blocks (Intel® TBB) libraries included in the package. In case OpenCL™ host application intentionally uses features of a standalone Threading Building Blocks (TBB) make sure that it is of a higher version than the library version in the package and is found earlier in the shared library search procedure. If standalone Threading Building Blocks (TBB) libraries are loaded functionality and performance may vary.

Michael_C_Intel1 · ‎12-10-2019

Hi e4lam,

Your feedback is appreciated. I suggest this forum as the most appropriate at this time.

The reproducer applies to the second part of the release notes section cited. Specifically, in that the host applications intentionally uses features of a stand alone Threading Building Blocks.

If you can cite observed behavior building and linking your application against TBB 2018 initial release or later, I'd like to take that to the CPU RT dev team as necessary.

Some times the dev teams themselves monitor these forums.

Thanks,

-MichaelC

Hang on Linux when initializing OpenCL context from TBB thread