Re: Partition data USM

RN1 · ‎10-09-2020

Hi,

Can USM share a buffer between two devices (cpu and gpu)?

I would like to avoid data transfers but compute from two independent devices regions of the same buffer.

As I understand, here https://software.intel.com/content/www/us/en/develop/articles/onemkl-random-number-generators-api-usage-example-for-monte-carlo-simulation.html it seems they are using two independent buffers (x for gpu with usm, y for cpu).

I am trying to know how can I use USM (or some efficient zero-copy data transfer) with the same buffer for two queues/devices.

What is the approach?

Edit: I tried different things, in a basic example, and I get wrong computed results when mixing (partitioning) between cpu and gpu:

{
  // ...
  const int N = size; // Originally was 1024 * 1000
  int Ngpu = N * gpuProp;
  int Ncpu = N - Ngpu;
  auto Rcpu = sycl::range<1>(Ncpu);
  auto Rgpu = sycl::range<1>(Ngpu);

  sycl::queue cpuQ(sycl::host_selector{});
  sycl::queue gpuQ(sycl::gpu_selector{});


  auto v1_ptr = (op_type*)sycl::malloc_shared(N * sizeof(op_type), gpuQ);
  for (int i = 0; i < N; i++) {
    v1_ptr[i] = static_cast<op_type>(i);
  }



  ios_base::sync_with_stdio(false);

  {
    std::cout << "Running on: " << gpuQ.get_device().get_info<sycl::info::device::name>() << " (size: " << Ngpu
              << ") and "
              << cpuQ.get_device().get_info<sycl::info::device::name>() << " (size: " << Ncpu << ")\n";
    gettimeofday(&compute, NULL);
    ios_base::sync_with_stdio(false);
    auto ev1 = gpuQ.submit([&](sycl::handler &h) {
      h.parallel_for(Rgpu, [=](sycl::id<1> i) {
        auto idx = i.get(0) + Ncpu;
        {
          static const CONSTANT char FMT[] = "GPU [%d] r\n";
          sycl::intel::experimental::printf(FMT, idx);
        }
        
        for (int ii = 1; ii < 10000; ii++) {
          op_type tanval = (sycl::sin(v1_ptr[idx]) * ii) / (sycl::cos(v1_ptr[idx]) * ii);
          op_type secval = 1.0 / sycl::cos(v1_ptr[idx]);
          v1_ptr[idx] = (secval * secval) - (tanval * tanval);
        }
      });
    });
    auto ev2 = cpuQ.submit([&](sycl::handler &h) {
      h.parallel_for(Rcpu, [=](sycl::id<1> i) {
        auto idx = i.get(0);
        {
          static const CONSTANT char FMT[] = "CPU [%d] r\n";
          sycl::intel::experimental::printf(FMT, idx);
        }

        for (int ii = 1; ii < 10000; ii++) {
          op_type tanval = (sycl::sin(v1_ptr[idx]) * ii) / (sycl::cos(v1_ptr[idx]) * ii);
          op_type secval = 1.0 / sycl::cos(v1_ptr[idx]);
          v1_ptr[idx] = (secval * secval) - (tanval * tanval);
        }
      });
    });
    ev1.wait();
    ev2.wait();
  }
  // ...
}

I did basic examples with 10 values and giving 50% to each, and I get wrong values in some positions (not all). Here I am trying to use directly the same pointer from both queues (trying to achieve what I am asking in this post).

RahulV_intel · ‎10-12-2020

Hi,

These wrong values which you get at some positions, does it change with every run?

Kindly try the following scenarios and let me know:

Instead of calling wait() at the end, can you individually wait() call right after every event (although it stops concurrency), just want to check if it gives correct results.
Try to split v1_ptr pointer separately for CPU, iGPU and check if the results are right in this scenario. (I understand that this is not what you are looking for but just try it out once).

It would be really helpful if you could attach the complete source code so that I can try it out at my end.

Thanks,

Rahul

RN1 · ‎10-12-2020

Hello, Rahul!

First question) Same program, different executions can get different results:

$ CHECK=y ./build/conc_chk_2pointers_prop_usm 8
CPU [0] r
CPU [1] r
CPU [2] r
CPU [3] r
GPU [4] r
GPU [5] r
GPU [6] r
GPU [7] r
Running on: Intel(R) Gen9 (size: 4) and SYCL host device (size: 4)
Time taken by queue is : 0.449407 sec
Time taken by kernel is : 0.449383 sec
Sample values on GPU and CPU
 [2] v1: 2.000000 != validate vector: 1.000000
 [3] v1: 3.000000 != validate vector: 1.000000
Vector addition: Failure

$ CHECK=y ./build/conc_chk_2pointers_prop_usm 8
CPU [0] r
CPU [1] r
CPU [2] r
CPU [3] r
GPU [4] r
GPU [5] r
GPU [6] r
GPU [7] r
Running on: Intel(R) Gen9 (size: 4) and SYCL host device (size: 4)
Time taken by queue is : 0.444678 sec
Time taken by kernel is : 0.444659 sec
Sample values on GPU and CPU
 [0] v1: 0.000000 != validate vector: 1.000000
 [2] v1: 2.000000 != validate vector: 1.000000
 [3] v1: 3.000000 != validate vector: 1.000000
Vector addition: Failure

Item 1) It seems with two waits, each after each queue submit, it gets correct results. Like so (as you say, not in parallel):

  {
    std::cout << "Running on: " << gpuQ.get_device().get_info<sycl::info::device::name>() << " (size: " << Ngpu
              << ") and "
              << cpuQ.get_device().get_info<sycl::info::device::name>() << " (size: " << Ncpu << ")\n";
    gettimeofday(&compute, NULL);
    ios_base::sync_with_stdio(false);

    auto ev1 = gpuQ.submit([&](sycl::handler &h) {
      h.parallel_for(Rgpu, [=](sycl::id<1> i) {
        auto idx = i.get(0) + Ncpu;
        {
          static const CONSTANT char FMT[] = "GPU [%d] r\n";
          sycl::intel::experimental::printf(FMT, idx);
        }
        for (int ii = 1; ii < 10000; ii++) {
          op_type tanval = (sycl::sin(v1_ptr[idx]) * ii) / (sycl::cos(v1_ptr[idx]) * ii);
          op_type secval = 1.0 / sycl::cos(v1_ptr[idx]);
          v1_ptr[idx] = (secval * secval) - (tanval * tanval);
        }
      });
    });
    ev1.wait();
    may_verify(v1_ptr, N); // wrong

    auto ev2 = cpuQ.submit([&](sycl::handler &h) {
      h.parallel_for(Rcpu, [=](sycl::id<1> i) {
        auto idx = i.get(0);
        {
          static const CONSTANT char FMT[] = "CPU [%d] r\n";
          sycl::intel::experimental::printf(FMT, idx);
        }

        for (int ii = 1; ii < 10000; ii++) {
          op_type tanval = (sycl::sin(v1_ptr[idx]) * ii) / (sycl::cos(v1_ptr[idx]) * ii);
          op_type secval = 1.0 / sycl::cos(v1_ptr[idx]);
          v1_ptr[idx] = (secval * secval) - (tanval * tanval);
        }
      });
    });
    ev2.wait();
    may_verify(v1_ptr, N); // ok
  }

Item 2) Doing this I get wrong results:

  auto v1_ptr = (op_type*)sycl::malloc_shared(N * sizeof(op_type), gpuQ);
  auto vcpu_ptr = (op_type*)v1_ptr;
  auto vgpu_ptr = (op_type*)(v1_ptr + Ncpu);
  printf("cpu_ptr: %p\n", vcpu_ptr);
  printf("gpu_ptr: %p\n", vgpu_ptr);
  printf("ncpu: %d\n", Ncpu);
  printf("ngpu: %d\n", Ngpu);
  for (int i = 0; i < N; i++) {
    v1_ptr[i] = static_cast<op_type>(i);
  }
  struct timeval start, compute, end;
  gettimeofday(&start, NULL);
  ios_base::sync_with_stdio(false);
  {
    std::cout << "Running on: " << gpuQ.get_device().get_info<sycl::info::device::name>() << " (size: " << Ngpu
              << ") and "
              << cpuQ.get_device().get_info<sycl::info::device::name>() << " (size: " << Ncpu << ")\n";
    gettimeofday(&compute, NULL);
    ios_base::sync_with_stdio(false);

    auto ev1 = gpuQ.submit([&](sycl::handler &h) {
      h.parallel_for(Rgpu, [=](sycl::id<1> i) {
        auto idx = i.get(0); // + Ncpu;
        {
          static const CONSTANT char FMT[] = "GPU [%d] r\n";
          sycl::intel::experimental::printf(FMT, idx);
        }
        for (int ii = 1; ii < 10000; ii++) {
          op_type tanval = (sycl::sin(vgpu_ptr[idx]) * ii) / (sycl::cos(vgpu_ptr[idx]) * ii);
          op_type secval = 1.0 / sycl::cos(vgpu_ptr[idx]);
          vgpu_ptr[idx] = (secval * secval) - (tanval * tanval);
        }
      });
    });
    ev1.wait();
    // may_verify(v1_ptr, N); // enable to be correct

    auto ev2 = cpuQ.submit([&](sycl::handler &h) {
      h.parallel_for(Rcpu, [=](sycl::id<1> i) {
        auto idx = i.get(0);
        {
          static const CONSTANT char FMT[] = "CPU [%d] r\n";
          sycl::intel::experimental::printf(FMT, idx);
        }

        for (int ii = 1; ii < 10000; ii++) {
          op_type tanval = (sycl::sin(vcpu_ptr[idx]) * ii) / (sycl::cos(vcpu_ptr[idx]) * ii);
          op_type secval = 1.0 / sycl::cos(vcpu_ptr[idx]);
          vcpu_ptr[idx] = (secval * secval) - (tanval * tanval);
        }
      });
    });
    ev2.wait();
    // may_verify(v1_ptr, N); // ok
  }

What is weird, is that if I try to verify after the first ev1.wait(), I get good results.

So, it seems it fetch the results to the host side, and then send them to the cpu queue.

But if I omit the may_verify, they are wrong (gpu values not correct).

On the other side, if I try to use the

gpuQ.submit()
cpuQ.submit()
ev1.wait()
ev2.wait()
// no matter what, the results are wrong.

I paste here the whole code for this example, you can replace as much as you want.

My purpose: to know the Intel's most efficient way to do a data-parallel computation with OneAPI using the CPU and GPU at the same time. This is really important since I want to compare with other technologies and publish results. You can modify this example:

#include <CL/sycl.hpp>
#include <iostream>
#include <iomanip>
#include <ctime>
#include <sys/time.h>

#include <vector>
#include <cstdlib>

using namespace std;

#ifdef __SYCL_DEVICE_ONLY__
#define CONSTANT __attribute__((opencl_constant))
#else
#define CONSTANT
#endif

typedef float op_type;

void verify(op_type* v1, int N){
  for (int i = 0; i<N; ++i) {
    for (int ii = 1; ii < 10000; ii++) {
      op_type tanval = (std::sin(v1[i]) * ii) / (std::cos(v1[i]) * ii);
      op_type secval = 1.0 / std::cos(v1[i]);
      v1[i] = (secval * secval) - (tanval * tanval);
    }
  }
}
void verify(std::vector<op_type>& v1, int N){
  verify(v1.data(), N);
}

void may_verify(op_type* v1, int N){
  char *check_str = getenv("CHECK");
  if (check_str != nullptr && std::string(check_str) == "y") {
    std::vector<op_type> validate_vector(N);
    for (int i = 0; i < N; i++) {
      validate_vector[i] = static_cast<op_type>(i);
    }
    // std::vector<op_type> validate_vector(N, N);
    verify(validate_vector, N);
    int show = 5;
    int total_errors = 0;
    bool wrong = false;
    for (int i = 0; i<N; ++i){
      if (abs(v1[i] - validate_vector[i]) > 0.001){
        wrong = true;
        if (show > 0){
          std::cout << " [" << i << "] v1: " << v1[i] << " != validate vector: " << validate_vector[i] << "\n";
        }
        total_errors++;
        show--;
      }
    }
    if (wrong){
      // validate_vector == v1 ? std::cout << "Vector addition: Success\n" : std::cout << "Vector addition: Failure\n";
      std::cout << "Total errors: " << total_errors << "\n";
      std::cout << "Vector addition: Failure\n";
    } else {
      std::cout << "Vector addition: Success\n";
    }
  }
}

void may_verify(std::vector<op_type>& v1, int N){
  may_verify(v1.data(), N);
}

//#include <cstdlib>
//#include "sycl_exceptions.hpp"
using namespace std;

int main(int argc, char *argv[]) {
  if (argc < 2 || argc > 3) {
    std::cout << "usage: <size> <gpu proportion>\n";
    return 1;
  }
  int size = argc > 1 ? atoi(argv[1]) : (1024 * 100);
  float gpuProp = argc > 2 ? atof(argv[2]) : 0.5;

  const int N = size; // Originally was 1024 * 1000
  int Ngpu = N * gpuProp;
  int Ncpu = N - Ngpu;
  auto Rcpu = sycl::range<1>(Ncpu);
  auto Rgpu = sycl::range<1>(Ngpu);

  sycl::queue cpuQ(sycl::host_selector{});
  sycl::queue gpuQ(sycl::gpu_selector{});

  auto v1_ptr = (op_type*)sycl::malloc_shared(N * sizeof(op_type), gpuQ);
  auto vcpu_ptr = (op_type*)v1_ptr;
  auto vgpu_ptr = (op_type*)(v1_ptr + Ncpu);
  printf("cpu_ptr: %p\n", vcpu_ptr);
  printf("gpu_ptr: %p\n", vgpu_ptr);
  printf("ncpu: %d\n", Ncpu);
  printf("ngpu: %d\n", Ngpu);
  for (int i = 0; i < N; i++) {
    v1_ptr[i] = static_cast<op_type>(i);
  }
  struct timeval start, compute, end;
  gettimeofday(&start, NULL);
  ios_base::sync_with_stdio(false);
  {
    std::cout << "Running on: " << gpuQ.get_device().get_info<sycl::info::device::name>() << " (size: " << Ngpu
              << ") and "
              << cpuQ.get_device().get_info<sycl::info::device::name>() << " (size: " << Ncpu << ")\n";
    gettimeofday(&compute, NULL);
    ios_base::sync_with_stdio(false);

    auto ev1 = gpuQ.submit([&](sycl::handler &h) {
      h.parallel_for(Rgpu, [=](sycl::id<1> i) {
        auto idx = i.get(0); // + Ncpu;
        {
          static const CONSTANT char FMT[] = "GPU [%d] r\n";
          sycl::intel::experimental::printf(FMT, idx);
        }
        for (int ii = 1; ii < 10000; ii++) {
          op_type tanval = (sycl::sin(vgpu_ptr[idx]) * ii) / (sycl::cos(vgpu_ptr[idx]) * ii);
          op_type secval = 1.0 / sycl::cos(vgpu_ptr[idx]);
          vgpu_ptr[idx] = (secval * secval) - (tanval * tanval);
        }
      });
    });

    auto ev2 = cpuQ.submit([&](sycl::handler &h) {
      h.parallel_for(Rcpu, [=](sycl::id<1> i) {
        auto idx = i.get(0);
        {
          static const CONSTANT char FMT[] = "CPU [%d] r\n";
          sycl::intel::experimental::printf(FMT, idx);
        }

        for (int ii = 1; ii < 10000; ii++) {
          op_type tanval = (sycl::sin(vcpu_ptr[idx]) * ii) / (sycl::cos(vcpu_ptr[idx]) * ii);
          op_type secval = 1.0 / sycl::cos(vcpu_ptr[idx]);
          vcpu_ptr[idx] = (secval * secval) - (tanval * tanval);
        }
      });
    });
    ev1.wait();
    may_verify(v1_ptr, N); // wrong
    ev2.wait();
    may_verify(v1_ptr, N); // wrong
    // may_verify(v1_ptr, N); // ok
  }
  gettimeofday(&end, NULL);
  double time_taken;
  time_taken = (end.tv_sec - start.tv_sec) * 1e6;
  time_taken = (time_taken + (end.tv_usec - start.tv_usec)) * 1e-6;
  cout << "Time taken by queue is : " << fixed << time_taken << setprecision(6) << " sec " << "\n";
  time_taken = (end.tv_sec - compute.tv_sec) * 1e6;
  time_taken = (time_taken + (end.tv_usec - compute.tv_usec)) * 1e-6;
  cout << "Time taken by kernel is : " << fixed << time_taken << setprecision(6) << " sec " << "\n";
  std::cout << "Sample values on GPU and CPU\n";

  may_verify(v1_ptr, N);
  return 0;
}

This is just an example, of course when I will compare i will remove printfs and other stuff. But I need to know how is the appropriate way to achieve the best performance in co execution with OneAPI.

RahulV_intel · ‎10-13-2020

Hi,

Thanks for trying it out. I need to investigate further on the code sample and will get back to you at the earliest.

Thanks,

Rahul

RahulV_intel · ‎10-20-2020

Hi,

Please find the inserted sample code below for multi device USM. You may alter the gpu_split(var) percentage. I have used two wait calls at the end of both the events and it does seem to give correct results.

#include <CL/sycl.hpp>
#include <iostream>
#include <vector>
//#include <cstdlib>
//#include "sycl_exceptions.hpp"
constexpr int N = 1024;

bool validate_res(float testarr[N]) {
    for(int i=0;i<N;i++) {
        if(testarr[i]<0.95 || testarr[i]>1.05) {
            std::cout<<"failed at index: "<<i<<"Value: "<<testarr[i]<<"\n";
            return false;
        }
    }
    return true;
}

int main() {
    float gpu_split = 80; //GPU split percent
    int gpu_range = N * gpu_split/100;
    int cpu_range = N - (gpu_range);
    std::cout<<"Input size: "<<N;
    std::cout<<"\nGPU split percent: "<<gpu_split<<"\nCPU split percent: "<<100 - gpu_split<<"\n";
    //auto R = sycl::range<1>(N/2);
    //sycl::queue cpuQ(sycl::cpu_selector{},exception_handler);
    sycl::queue cpuQ(sycl::cpu_selector{});
    //sycl::queue gpuQ(sycl::gpu_selector{},exception_handler);
    sycl::queue gpuQ(sycl::gpu_selector{});

    std::cout<<"Running on: "<<gpuQ.get_device().get_info<sycl::info::device::name>()<<" and "
    <<cpuQ.get_device().get_info<sycl::info::device::name>()<<"\n";

    int *gpuval = (int *) sycl::malloc_host(gpu_range * sizeof(int), gpuQ);

    for(int i=0; i<gpu_range; i++)
        gpuval[i] = i;

    int *cpuval = new int[cpu_range];

    for(int i=0; i<cpu_range; i++)
        cpuval[i] = i;

    float *res = (float *) sycl::malloc_shared(N * sizeof(float), gpuQ);

    auto ev1 = gpuQ.submit([&](sycl::handler &h) {
        h.parallel_for(sycl::range<1>(gpu_range), [=](sycl::id<1> i) {
            res[i] = sycl::sin((float)gpuval[i])*sycl::sin((float)gpuval[i]) + sycl::cos((float)gpuval[i])*sycl::cos((float)gpuval[i]);
        });
    });

    auto ev2 = cpuQ.submit([&](sycl::handler &h) {
        h.parallel_for(sycl::range<1>(cpu_range), [=](sycl::id<1> i) {
            res[i + gpu_range] = sycl::sin((float)cpuval[i])*sycl::sin((float)cpuval[i]) + sycl::cos((float)cpuval[i])*sycl::cos((float)cpuval[i]);
        });
    });

    ev1.wait();
    ev2.wait();

   // for(int i=0; i<20; i++)
     //   std::cout<<res[i]<<" "<<res[i + gpu_range]<<"\n";

    validate_res(res) ? std::cout<<"Success\n" : std::cout<<"Failure\n";

    delete [] cpuval;
    sycl::free(gpuval, gpuQ);
    sycl::free(res, gpuQ);
//    std::cout<<"\n";

return 0;
}

I have used malloc_host since it provides zero copy to the device (for read).

Let me know if you face any issues.

Thanks,

Rahul

RN1 · ‎10-25-2020

Thanks!. Any idea why it throws CL_INVALID_VALUE at line 51? (just when using the cpu queue: `auto ev2 = cpuQ.submit([&](sycl::handler &h) {`).

Input size: 1024
GPU split percent: 80
CPU split percent: 20
[New Thread 0x7ffff3087640 (LWP 3771)]
[New Thread 0x7fffeed61640 (LWP 3772)]
[New Thread 0x7fffdbc5b640 (LWP 3773)]
[New Thread 0x7fffd37f6640 (LWP 3774)]
[New Thread 0x7fffcbfff640 (LWP 3775)]
[New Thread 0x7fffd33f5640 (LWP 3776)]
Running on: Intel(R) Gen9 and Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  Native API failed. Native API returns: -30 (CL_INVALID_VALUE) -30 (CL_INVALID_VALUE)

Thread 1 "conc_chk_usm_in" received signal SIGABRT, Aborted.
0x00007ffff7433615 in raise () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7433615 in raise () from /usr/lib/libc.so.6
#1  0x00007ffff741c862 in abort () from /usr/lib/libc.so.6
#2  0x00007ffff7a0986a in __gnu_cxx::__verbose_terminate_handler ()
    at /build/gcc/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007ffff7a15d9a in __cxxabiv1::__terminate (handler=<optimized out>)
    at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x00007ffff7a15e07 in std::terminate ()
    at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
#5  0x00007ffff7a160ae in __cxxabiv1::__cxa_throw (obj=<optimized out>,
    tinfo=0x7ffff780d910 <typeinfo for cl::sycl::runtime_error>,
    dest=0x7ffff7623d20 <cl::sycl::exception::~exception()>)
    at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
#6  0x00007ffff76246af in void cl::sycl::detail::plugin::checkPiResult<cl::sycl::runtime_error>(_pi_result) const () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#7  0x00007ffff7785d4d in cl::sycl::detail::ExecCGCommand::enqueueImp() ()
   from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#8  0x00007ffff7774fda in cl::sycl::detail::Command::enqueue(cl::sycl::detail::EnqueueResultT&, cl::sycl::detail::BlockingT) () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#9  0x00007ffff778cb91 in cl::sycl::detail::Scheduler::addCG(std::unique_ptr<cl::sycl::detail::CG, std::default_delete<cl::sycl::detail::CG> >, std::shared_ptr<cl::sycl::detail::queue_impl>) ()
   from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#10 0x00007ffff77b2d4f in cl::sycl::handler::finalize() ()
   from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#11 0x00007ffff77bfbfc in cl::sycl::detail::queue_impl::submit_impl(std::function<void (cl::sycl::handler&)> const&, std::shared_ptr<cl::sycl::detail::queue_impl>, cl::sycl::detail::code_location const&) ()
   from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#12 0x00007ffff77bf826 in cl::sycl::detail::queue_impl::submit(std::function<void (cl::sycl::handler&)> const&, std::shared_ptr<cl::sycl::detail::queue_impl>, cl::sycl::detail::code_location const&) ()
   from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#13 0x00007ffff77beac3 in cl::sycl::queue::submit_impl(std::function<void (cl::sycl::handler&)>, cl::sycl::detail::code_location const&) () from /opt/intel/inteloneapi/compiler/latest/linux/lib/libsycl.so.1
#14 0x0000000000403bf8 in cl::sycl::queue::submit<main::$_1> (this=0x7fffffffc400, CGF=..., CodeLoc=...)
    at /opt/intel/inteloneapi/compiler/2021.1-beta07/linux/include/sycl/CL/sycl/queue.hpp:185
#15 main () at /home/user/usm_intel.cpp:51

RN1 · ‎10-26-2020

Solved with beta10!

RahulV_intel · ‎10-26-2020

Hi,

That's great! Since your issue is resolved, could you let me know if I can close this thread from my end?

Thanks,

Rahul

RahulV_intel · ‎11-02-2020

Hi,

Since your issue is resolved, I'll go ahead and close this thread. Intel will no longer monitor this thread. However, it will remain open for community discussion.

Thanks,

Rahul