Intel® oneAPI Base Toolkit
Support for the core tools and libraries within the base toolkit that are used to build and deploy high-performance data-centric applications.
418 Discussions

2Q) Segfault selecting device and best usage of devices

RN1
New Contributor I
1,317 Views

Hi,

I have two questions regarding OneAPI and a multi-device strategy.

First, the Segfault.

What could be the reason to receive segfault randomly when using two selectors in parallel?

 

>>> bt
#0  0x00007ffff4c9c38b in __intel_avx_rep_memcpy () from /opt/intel/inteloneapi/ipp/latest/../../compiler/latest/linux/compiler/lib/intel64/libintlc.so.5
#1  0x00007ffff77035f4 in cl::sycl::detail::pi::initialize() () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#2  0x00007ffff7731b4d in cl::sycl::detail::platform_impl::get_platforms() () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#3  0x00007ffff77b9709 in cl::sycl::platform::get_platforms() () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#4  0x00007ffff77a349a in cl::sycl::device::get_devices(cl::sycl::info::device_type) () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#5  0x00007ffff77ab8aa in cl::sycl::device_selector::select_device() const () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#6  0x000000000041488c in cl::sycl::queue::queue (this=this@entry=0x7ffff444ec28, DeviceSelector=..., PropList=...) at /opt/intel/inteloneapi/compiler/2021.1-beta07/linux/include/sycl/CL/sycl/queue.hpp:59
#7  0x0000000000408621 in gaussian_cpu (opts=..., gaussian=0x7fffffffc0a8) at /home/user/oneapi/gaussian.cpp:671
#8  0x00000000004174cf in std::__invoke_impl<void, void (*)(bool, Options, Gaussian*), bool, Options, Gaussian*> (__f=@0x7fffec00c3e8: 0x0, __args=@0x7fffec00c350: 0x0, __args=@0x7fffec00c350: 0x0, __args=@0x7fffec00c350: 0x0) at /usr/lib64/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../include/c++/10.2.0/bits/invoke.h:60
#9  std::__invoke<void (*)(bool, Options, Gaussian*), bool, Options, Gaussian*> (__fn=@0x7fffec00c3e8: 0x0, __args=@0x7fffec00c350: 0x0, __args=@0x7fffec00c350: 0x0, __args=@0x7fffec00c350: 0x0) at /usr/lib64/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../include/c++/10.2.0/bits/invoke.h:95
#10 std::thread::_Invoker<std::tuple<void (*)(bool, Options, Gaussian*), bool, Options, Gaussian*> >::_M_invoke<0ul, 1ul, 2ul, 3ul> (this=0x7fffec00c350) at /usr/lib64/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../include/c++/10.2.0/thread:264
#11 std::thread::_Invoker<std::tuple<void (*)(bool, Options, Gaussian*), bool, Options, Gaussian*> >::operator() (this=0x7fffec00c350) at /usr/lib64/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../include/c++/10.2.0/thread:271
#12 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(bool, Options, Gaussian*), bool, Options, Gaussian*> > >::_M_run (this=0x7fffec00c348) at /usr/lib64/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../include/c++/10.2.0/thread:215
#13 0x00007ffff7a42c24 in std::execute_native_thread_routine (__p=0x50b9f0) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#14 0x00007ffff7f5f3e9 in start_thread () from /usr/lib/libpthread.so.0
#15 0x00007ffff74f6293 in clone () from /usr/lib/libc.so.6

 

 

The output is sometimes correct (the whole processing computes in parallel, correctly), but other times it just gives this segfault (above you can see the backtrace from gdb in such specific cases):

 

gpu selector
cpu selector
Segmentation fault (core dumped)

 

 

I only paste here the relevant lines according to OneAPI. Here in this example (gaussian_both) I am using a pointer to `device_selector` to instantiate cpu and gpu (two independent functions, each one will instantiate one selector), and then freeing the pointer.

 

void gaussian_both(bool cpu, Opts opts, Gaussian* gaussian){
    // using opts and gaussian

  { // oneapi scope

    device_selector *sel;

    if (cpu) {
      sel = new cpu_selector();
    } else {
      sel = new gpu_selector();
    }

    {
      std::lock_guard<std::mutex> lk(*opts.m);
      // spliting the work load
      // the first work load is for the first device to enter, etc
      // calculate the offset and size for this work chunk
    }
          cl::sycl::property_list prop_list =
          cl::sycl::property_list{cl::sycl::property::queue::enable_profiling()};
      queue q(*sel, prop_list); // <--------- SEGFAULT happen here (gaussian.cpp:671)
      if (debug) {
        if (cpu) {
          printf("CPU is: %s\n", q.get_device().get_info<sycl::info::device::name>().c_str());
        } else {
          printf("GPU is: %s\n", q.get_device().get_info<sycl::info::device::name>().c_str());
        }
      }


    auto R = sycl::range<1>(size);
    auto Rinput = sycl::range<1>(gaussian->_total_size);
    auto Rfilter = sycl::range<1>(gaussian->_filter_total_size);

    sycl::buffer<cl_float, 1> buf_filterWeight(gaussian->_b.data(), Rfilter);
    sycl::buffer<cl_uchar4, 1> buf_blurred((gaussian->_c.data() + offset), R);

    auto submit_event = q.submit([&](handler &h) {
        auto input = buf_input.get_access<sycl::access::mode::read>(h);
        auto filterWeight = buf_filterWeight.get_access<sycl::access::mode::read>(h);
        auto blurred = buf_blurred.get_access<sycl::access::mode::write>(h);

        h.parallel_for(R, [=](item<1> it) {
          auto tid = it.get_linear_id();

          int r = tid / cols; // current row
          int c = tid % cols; // current column
          int idx2tid = r * cols + c; // current pixel index

          int middle = filterWidth / 2;
          float blurX = 0.f; // will contained blurred value
          float blurY = 0.f; // will contained blurred value
          float blurZ = 0.f; // will contained blurred value
          int width = cols - 1;
          int height = rows - 1;

          for (int i = -middle; i <= middle; ++i) // rows
          {
            for (int j = -middle; j <= middle; ++j) // columns
            {
              int h = r + i;
              int w = c + j;

              if (h > height || h < 0 || w > width || w < 0) {
                continue;
              }
              int idx = w + cols * h; // current pixel index
              float pixelX = (input[idx].s[0]);
              float pixelY = (input[idx].s[1]);
              float pixelZ = (input[idx].s[2]);

              idx = (i + middle) * filterWidth + j + middle;
              float weight = filterWeight[idx];

              blurX += pixelX * weight;
              blurY += pixelY * weight;
              blurZ += pixelZ * weight;
            }
          }
          blurred[tid].s[0] = (unsigned char) cl::sycl::round(blurX);
          blurred[tid].s[1] = (unsigned char) cl::sycl::round(blurY);
          blurred[tid].s[2] = (unsigned char) cl::sycl::round(blurZ);
        });
      });

    delete sel;  // free the device selector (new)
  } // oneapi scope

}

void main(){
  // ...
  if (use_cpu_and_gpu){
    std::thread t1(do_gaussian_both, true, opts, &gaussian);
    do_gaussian_both(false, opts, &gaussian);
    t1.join();
  }
}

 

 

 

// Execution failing in the gpu case:

>>> bt
#0  0x00007ffff4c9c3a6 in __intel_avx_rep_memcpy () from /opt/intel/inteloneapi/ipp/latest/../../compiler/latest/linux/compiler/lib/intel64/libintlc.so.5
#1  0x00007ffff7703654 in cl::sycl::detail::pi::initialize() () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#2  0x00007ffff7731b4d in cl::sycl::detail::platform_impl::get_platforms() () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#3  0x00007ffff77b9709 in cl::sycl::platform::get_platforms() () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#4  0x00007ffff77a349a in cl::sycl::device::get_devices(cl::sycl::info::device_type) () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#5  0x00007ffff77ab8aa in cl::sycl::device_selector::select_device() const () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#6  0x000000000041468c in cl::sycl::queue::queue (this=this@entry=0x7fffffffbeb0, DeviceSelector=..., PropList=...) at /opt/intel/inteloneapi/compiler/2021.1-beta07/linux/include/sycl/CL/sycl/queue.hpp:59
#7  0x0000000000409720 in do_gaussian_both (cpu=cpu@entry=false, opts=..., gaussian=gaussian@entry=0x7fffffffc0a8) at /home/user/oneapi/gaussian.cpp:988
#8  0x000000000040eace in process_gaussian (cpu=false, opts=..., gaussian=0x7fffffffc0a8) at /home/user/oneapi/gaussian.cpp:1748
#9  main (argc=<optimized out>, argv=<optimized out>) at /home/user/oneapi/gaussian.cpp:1940

// Execution failing in the cpu case:

>>> bt
#0  0x00007ffff4c9c38b in __intel_avx_rep_memcpy () from /opt/intel/inteloneapi/ipp/latest/../../compiler/latest/linux/compiler/lib/intel64/libintlc.so.5
#1  0x00007ffff77035f4 in cl::sycl::detail::pi::initialize() () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#2  0x00007ffff7731b4d in cl::sycl::detail::platform_impl::get_platforms() () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#3  0x00007ffff77b9709 in cl::sycl::platform::get_platforms() () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#4  0x00007ffff77a349a in cl::sycl::device::get_devices(cl::sycl::info::device_type) () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#5  0x00007ffff77ab8aa in cl::sycl::device_selector::select_device() const () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#6  0x000000000041468c in cl::sycl::queue::queue (this=this@entry=0x7ffff444eca0, DeviceSelector=..., PropList=...) at /opt/intel/inteloneapi/compiler/2021.1-beta07/linux/include/sycl/CL/sycl/queue.hpp:59
#7  0x0000000000409720 in do_gaussian_both (cpu=true, opts=..., gaussian=0x7fffffffc0a8) at /home/user/oneapi/gaussian.cpp:988
#8  0x00000000004172cf in std::__invoke_impl<void, void (*)(bool, Options, Gaussian*), bool, Options, Gaussian*> (__f=@0x7fffec00c408: 0x0, __args=@0x7fffec00c370: 0x0, __args=@0x7fffec00c370: 0x0, __args=@0x7fffec00c370: 0x0) at /usr/lib64/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../include/c++/10.2.0/bits/invoke.h:60
#9  std::__invoke<void (*)(bool, Options, Gaussian*), bool, Options, Gaussian*> (__fn=@0x7fffec00c408: 0x0, __args=@0x7fffec00c370: 0x0, __args=@0x7fffec00c370: 0x0, __args=@0x7fffec00c370: 0x0) at /usr/lib64/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../include/c++/10.2.0/bits/invoke.h:95
#10 std::thread::_Invoker<std::tuple<void (*)(bool, Options, Gaussian*), bool, Options, Gaussian*> >::_M_invoke<0ul, 1ul, 2ul, 3ul> (this=0x7fffec00c370) at /usr/lib64/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../include/c++/10.2.0/thread:264
#11 std::thread::_Invoker<std::tuple<void (*)(bool, Options, Gaussian*), bool, Options, Gaussian*> >::operator() (this=0x7fffec00c370) at /usr/lib64/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../include/c++/10.2.0/thread:271
#12 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(bool, Options, Gaussian*), bool, Options, Gaussian*> > >::_M_run (this=0x7fffec00c368) at /usr/lib64/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../include/c++/10.2.0/thread:215
#13 0x00007ffff7a42c24 in std::execute_native_thread_routine (__p=0x50b9f0) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#14 0x00007ffff7f5f3e9 in start_thread () from /usr/lib/libpthread.so.0
#15 0x00007ffff74f6293 in clone () from /usr/lib/libc.so.6

 

The location gaussian.cpp:988 contains the queue q(*sel).

I also tried without using a prop_list, just `queue q(*sel);`. It also happens the segfault.

 

I also tried with two independent functions having each one a specific selector. It also gives the same segfault. Such as:

 

void do_gaussian_cpu(bool cpu, Opts opts, Gaussian* gaussian){
  // using opts and gaussian

  { // oneapi scope
    cpu_selector sel;

    {
      std::lock_guard<std::mutex> lk(*opts.m);
      // spliting the work load
      // the first work load is for the first device to enter, etc
      // calculate the offset and size for this work chunk
    }

    queue q(sel);

    auto R = sycl::range<1>(size);
    auto Rinput = sycl::range<1>(gaussian->_total_size);
    auto Rfilter = sycl::range<1>(gaussian->_filter_total_size);

    sycl::buffer<cl_float, 1> buf_filterWeight(gaussian->_b.data(), Rfilter);
    sycl::buffer<cl_uchar4, 1> buf_blurred((gaussian->_c.data() + offset), R);

    auto submit_event = q.submit([&](handler &h) {
                                   // ... the same as before ...
                                 });

  }
}
void do_gaussian_gpu(bool cpu, Opts opts, Gaussian* gaussian){
  // using opts and gaussian

  { // oneapi scope
    gpu_selector sel;

    {
      std::lock_guard<std::mutex> lk(*opts.m);
      // spliting the work load
      // the first work load is for the first device to enter, etc
      // calculate the offset and size for this work chunk
    }

    queue q(sel);

    auto R = sycl::range<1>(size);
    auto Rinput = sycl::range<1>(gaussian->_total_size);
    auto Rfilter = sycl::range<1>(gaussian->_filter_total_size);

    sycl::buffer<cl_float, 1> buf_filterWeight(gaussian->_b.data(), Rfilter);
    sycl::buffer<cl_uchar4, 1> buf_blurred((gaussian->_c.data() + offset), R);

    auto submit_event = q.submit([&](handler &h) {
                                   // ... the same as before ...
                                 });

  }
}

void main(){
  // using two independent funcs, one for cpu, the other for gpu
  if (use_cpu_and_gpu){
    std::thread t1(do_gaussian_cpu, true, opts, &gaussian);
    do_gaussian_gpu(false, opts, &gaussian);
    t1.join();
  }
}

 

 

In this case, I get sometimes the segfault in the `queue q{sel};` of the gpu function, and other in the cpu function. So, the same behavior but in two different functions.

 

It only happens when I use both devices (`use_cpu_and_gpu`), not when I use only the cpu or only the gpu.

 

My second question is: is appropriate the way I use OneAPI to compute in parallel in two devices parts of the problem dynamically? I need the best OneAPI strategy to achieve good performance.

My strategy consist of using std::threads, one per device (each device its own queue, buffers and kernel), running in parallel and synchronizing by using std::mutex (to split the work).

I did some tests and apparently, with my strategy is almost as good as just using a single function with everything inside (split defined in compilation time), but I don't know how to achieve dynamic behavior using the expected OneAPI strategy. That's why I am asking.

It is important to know what is the best way to do it with OneAPI, since I am seeing quite overheads when running larger problems. And I want to be able to compute little chunks of work per device dynamically, so my strategy with the threads seems correct, but I need confirmation from Intel/OneAPI.

 

#include <CL/sycl.hpp>
#include <iostream>
#include <iomanip>
#include <ctime>
#include <sys/time.h>
#include <vector>
#include <iostream>
#include <cstdlib>

using namespace std;

typedef float op_type;

void verify(vector<op_type>& v1, int N){
  for (int i = 0; i<N; ++i) {
    for (int ii = 1; ii < 10000; ii++) {
      op_type tanval = (std::sin(v1[i]) * ii) / (std::cos(v1[i]) * ii);
      op_type secval = 1.0 / std::cos(v1[i]);
      v1[i] = (secval * secval) - (tanval * tanval);
    }
  }
}

void may_verify(vector<op_type>& v1, int N){
  char *check_str = getenv("CHECK");
  if (check_str != nullptr && std::string(check_str) == "y") {
    std::vector<op_type> validate_vector(N);
    for (int i = 0; i < N; i++) {
      validate_vector[i] = static_cast<op_type>(i);
    }
    // std::vector<op_type> validate_vector(N, N);
    verify(validate_vector, N);
    int show = 5;
    bool wrong = false;
    for (int i = 0; i<show; ++i){
      if (abs(v1[i] - validate_vector[i]) > 0.001){
        std::cout << " " << i << " v1 " << v1[i] << " != validate " << validate_vector[i] << "\n";
        wrong = true;
        break;
      }
    }
    if (wrong){
      // validate_vector == v1 ? std::cout << "Vector addition: Success\n" : std::cout << "Vector addition: Failure\n";
      std::cout << "Vector addition: Failure\n";
    } else {
      std::cout << "Vector addition: Success\n";
    }
  }
}

//#include <cstdlib>
//#include "sycl_exceptions.hpp"
using namespace std;

int main(int argc, char *argv[]) {
  if (argc < 2 || argc > 3) {
    std::cout << "usage: <size> <gpu proportion>\n";
    return 1;
  }
  float size = argc > 1 ? atoi(argv[1]) : (1024 * 100);
  float gpuProp = argc > 2 ? atof(argv[2]) : 0.5;

  const int N = size; // Originally was 1024 * 1000
  int Ngpu = N * gpuProp;
  int Ncpu = N - Ngpu;
  auto Rcpu = sycl::range<1>(Ncpu);
  auto Rgpu = sycl::range<1>(Ngpu);
  std::vector<op_type> v1(N);
//  std::vector<float> v2(N);
  for (int i = 0; i < N; i++) {
    v1[i] = static_cast<op_type>(i);
//    v2[i] = i;
  }
//struct timeval start, end, start1, end1;
  struct timeval start, compute, end;
  gettimeofday(&start, NULL);
  ios_base::sync_with_stdio(false);
//sycl::queue cpuQ(sycl::cpu_selector{},exception_handler);
  sycl::queue cpuQ(sycl::cpu_selector{});
//sycl::queue gpuQ(sycl::gpu_selector{},exception_handler);
  sycl::queue gpuQ(sycl::gpu_selector{});
  {
    //Splitting input data into 2 halves for CPU and iGPU
    sycl::buffer<op_type, 1> bufcpuv1(v1.data(), Rcpu);
    sycl::buffer<op_type, 1> bufgpuv1((v1.data() + (Ncpu)), Rgpu);
    std::cout << "Running on: " << gpuQ.get_device().get_info<sycl::info::device::name>() << " (size: " << Ngpu
              << ") and "
              << cpuQ.get_device().get_info<sycl::info::device::name>() << " (size: " << Ncpu << ")\n";
    gettimeofday(&compute, NULL);
    ios_base::sync_with_stdio(false);
    gpuQ.submit([&](sycl::handler &h) {
      auto agpuv1 = bufgpuv1.get_access<sycl::access::mode::read_write>(h);
      //auto agpuv2 = bufgpuv2.get_access<sycl::access::mode::read_write>(h);
      h.parallel_for(Rgpu, [=](sycl::id<1> i) {
//        agpuv1[i]+=agpuv2[i];
        for (int ii = 1; ii < 10000; ii++) {
          op_type tanval = (sycl::sin(agpuv1[i]) * ii) / (sycl::cos(agpuv1[i]) * ii);
          op_type secval = 1.0 / sycl::cos(agpuv1[i]);
          agpuv1[i] = (secval * secval) - (tanval * tanval);
        }
      });
    });
    cpuQ.submit([&](sycl::handler &h) {
      auto acpuv1 = bufcpuv1.get_access<sycl::access::mode::read_write>(h);
      //auto acpuv2 = bufcpuv2.get_access<sycl::access::mode::read_write>(h);
      h.parallel_for(Rcpu, [=](sycl::id<1> i) {
        //acpuv1[i]+=acpuv2[i];
        for (int ii = 1; ii < 10000; ii++) {
          op_type tanval = (sycl::sin(acpuv1[i]) * ii) / (sycl::cos(acpuv1[i]) * ii);
          op_type secval = 1.0 / sycl::cos(acpuv1[i]);
          acpuv1[i] = (secval * secval) - (tanval * tanval);
        }
      });
    });
  }
  gettimeofday(&end, NULL);
  double time_taken;
  time_taken = (end.tv_sec - start.tv_sec) * 1e6;
  time_taken = (time_taken + (end.tv_usec - start.tv_usec)) * 1e-6;
  cout << "Time taken by queue is : " << fixed << time_taken << setprecision(6) << " sec " << "\n";
  time_taken = (end.tv_sec - compute.tv_sec) * 1e6;
  time_taken = (time_taken + (end.tv_usec - compute.tv_usec)) * 1e-6;
  cout << "Time taken by kernel is : " << fixed << time_taken << setprecision(6) << " sec " << "\n";
  std::cout << "Sample values on GPU and CPU\n";

  may_verify(v1, N);
  return 0;
}

 

I paste here the conc_chk 2pointers prop (offline strategy: 1 OneAPI scope for 2 devices). So, in this way I cannot use dynamic strategies, but at least I can show how even in static workload split it has not enough performance (too much overhead using two devices).

 

conc_chk with 1024 problem size.

- only in CPU: 1.15s

- only in GPU: 0.96s

- parallel CPU and GPU managed by 1 OneAPI scope (60% GPU, 40% CPU): 1.89

- parallel CPU and GPU managed by 1 OneAPI scope (90% GPU, 10% CPU): 1.80

- parallel CPU and GPU managed by 1 OneAPI scope (10% GPU, 90% CPU): 2.10s

- parallel CPU and GPU managed by 2 OneAPI scopes 2 threads (60% GPU, 40% CPU): 2.04s

- parallel CPU and GPU managed by 2 OneAPI scopes 2 threads (90% GPU, 10% CPU): 1.85s

- parallel CPU and GPU managed by 2 OneAPI scopes 2 threads (10% GPU, 90% CPU): 2.12s

As we can see, the minimums are quite near (1.80s vs 1.85s).

 

with 102400 as problem size:

- only in CPU: 27.27s

- only in GPU: 2.09s

- parallel CPU and GPU managed by 1 OneAPI scope (80% GPU, 20% CPU): 7.23

- parallel CPU and GPU managed by 1 OneAPI scope (95% GPU, 5% CPU): 3.41

- parallel CPU and GPU managed by 2 OneAPI scopes 2 threads (80% GPU, 20% CPU): 7.64s

- parallel CPU and GPU managed by 2 OneAPI scopes 2 threads (95% GPU, 5% CPU): 3.90s

You can see more overhead with 2 scopes, but it is the only way I know how to do dynamically give chunks at runtime.

 

It starts to be near the inflection point when we reach a problem size of at least 614400, giving:

- only in GPU: 8.59s

- GPU (98%) and CPU (2%) in 1 scope: 8.66s

- GPU (98%) and CPU (2%) in 2 scopes: 8.62s

But even in 1024000 problem size, it is struggling to be interesting the usage of both devices:
- only in GPU: 13.80s

- GPU (98%) and CPU (2%) in 1 scope: 13.77s <--- Finally won!

- GPU (98%) and CPU (2%) in 2 scopes:13.85s <--- Not yet


Another experimentations with different problems:


Let me explain myself with the gaussian example, in my i5, with the simplest strategy, that is split in two chunks:

- CPU computes everything: 9.1s    (cpu bandwidth 492 units/ms)

- GPU computes everything: 1.8s    (gpu bandwidth 2861 units/ms)

- If give 5% of workload to the CPU, 95% to the GPU. Problem computed in: 3.8s.

  Seeing internally the times: CPU needs 3.8s (bandwidth 71 u/ms), GPU 3.3s (bandwidth 2749 u/ms).

So, when doing this parallel execution it seems OneAPI suffer high overheads, reducing the bandwidth a lot (from 492 to 71 u/ms), apart from the general management.

I tried with many different work loads (3%, 8%, 10%,...) and it is even worst.

I only can see the advantage of parallel execution if I increase the problem size considerably (10-15s.), but not always. It seems something is missing yet.

 

0 Kudos
6 Replies
RahulV_intel
Moderator
1,304 Views

Hi,

 

Instead of using device_selector *sel, I'd recommend you to use sycl::queue q; and based on the flag set (bool cpu), you can modify it inside the if/else condition as q = sycl::queue(sycl::gpu_selector{ }) or sycl::queue(sycl::cpu_selector{ }).

 

Code snippet:

 

 sycl::queue q;    // Instead of device_selector *sel;

    if (cpu) {
      q = sycl::queue(sycl::cpu_selector{});
    } else {
      q = sycl::queue(sycl::gpu_selector{});
    }

    {
      std::lock_guard<std::mutex> lk(*opts.m);
      // spliting the work load
      // the first work load is for the first device to enter, etc
      // calculate the offset and size for this work chunk
    }
          cl::sycl::property_list prop_list =
          cl::sycl::property_list{cl::sycl::property::queue::enable_profiling()};

      if (debug) {
        if (cpu) {
          printf("CPU is: %s\n", q.get_device().get_info<sycl::info::device::name>().c_str());
        } else {
          printf("GPU is: %s\n", q.get_device().get_info<sycl::info::device::name>().c_str());
        }
      }

 

 

Also refer to the attached code(vec-add sample), wherein I've used a single function to perform vector addition which runs either on a CPU/GPU or both. I've ran it close to 20-30 times and I've not noticed any SEG_FAULT error(for the code sample that I've attached).

 

Coming to your second question, multi-device concurrent execution will only make sense if your input data size is extremely large enough and the kernel is compute intensive, so that it alleviates the cost of overheads incurred. 

 

Thanks,

Rahul

0 Kudos
RN1
New Contributor I
1,301 Views

Hi, Rahul.

Perfect, this solves apparently the issue 1. I have run around 20 executions and no one failed. Interesting.

Now, the most important one, the issue 2.

 

0 Kudos
RN1
New Contributor I
1,279 Views



"Coming to your second question, multi-device concurrent execution will only make sense if your input data size is extremely large enough and the kernel is compute intensive, so that it alleviates the cost of overheads incurred."

Thank you. So, I assume that is the expected behavior currently with OneAPI.
I can see that OneAPI suffers with buffer movements. How can I reduce the buffer overheads (or other overheads) that OneAPI could incur?

Should I use another strategies for multi-device than, eg.:

sycl::buffer<op_type, 1> buf1(v1.data(), Rcpu);
auto agpuv1 = bufgpuv1.get_access<sycl::access::mode::read_write>(h);

In problems like gaussian blur or matrix multiplication I can see that the overheads are increasing with the problem size, compared with the raw OpenCL behavior, which tolerates multi-device parallel execution without much overheads.

0 Kudos
RahulV_intel
Moderator
1,247 Views

Hi,


Alternate approach to Buffers would be to use USM in your code. The data transfers to/from the device to host (or vice-versa) can be accomplished either via Buffers/accessors or USM. You may refer to the oneAPI programming guide for more details on USM.


Link: https://software.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html


Regarding concurrent execution on USM, I will get back to you on the other thread that you have posted. ( https://community.intel.com/t5/Intel-oneAPI-Base-Toolkit/Partition-data-USM/m-p/1216294#M623)



Thanks,

Rahul



0 Kudos
RahulV_intel
Moderator
1,198 Views

Hi,

 

I have posted the answer on concurrent USM execution in the other thread ( https://community.intel.com/t5/Intel-oneAPI-Base-Toolkit/Partition-data-USM/m-p/1216294#M623).

 

Let me know if I can close this thread from my end.

 

Thanks,

Rahul

 

0 Kudos
RahulV_intel
Moderator
1,178 Views

Hi,


We are closing this thread assuming that the solution provided has helped. Intel will no longer monitor this thread. Any further replies on this thread will be considered community only.



Thanks,

Rahul


0 Kudos
Reply