Intel® oneAPI Data Parallel C++
Support for Intel® oneAPI DPC++ Compiler, Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and GDB*
341 Discussions

Request for Intel OpenCL Offline Compiler (OCLOC)

Viet-Duc
Novice
521 Views

 

Hi,

 

I am trying to perform ahead of time compilation following the instruction here:

https://software.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-gui...

https://software.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-onea...

Using the following compiler options:

dpcpp -fsycl-targets=spir64_gen-unknown-unknown-sycldevice -Xs "-device Gen9,dg1,Gen12HP" vector-add.cpp. 

The error message is:

dpcpp: error: unable to execute command: Executable "ocloc" doesn't exist!

As I understand, the NDA queue is not configured to accept interactive job (qsub -I). 

If I compile on login node and run on Xe_HP, the performance is not as good as expected. For instance, the result of memory bandwidth measurement is aproximately 40% of theoretical maximum of 800GB/s.

Would you make ocloc available system-wide ? If not, is there a way to install it locally ?

Regards.

0 Kudos
1 Solution
Jie_L_Intel
Employee
400 Views

The devcloud does not install the ocloc right now. You could try it following the installation guide if you have your own develop machine that could install oneAPI.

https://software.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-onea...


View solution in original post

3 Replies
Gopika_Intel
Moderator
486 Views

Hi,

Thank you for reaching out. We are checking this internally. We will get back to you as soon as we get an update.

Regards

Gopika


Viet-Duc
Novice
453 Views

Thanks for moving the post to the suitable forum.

 

I am not sure if I can show performance of NDA hardware here. Nevertheless, I need to make my case.

The measurement is done using BabelStream, which is a extension of Stream benchmark for heterogenous platforms.

The triad kernel implemented in SYCL is as follow:

template <class T>
void SYCLStream<T>::triad()
{
  const T scalar = startScalar;
  queue->submit([&](handler &cgh)
  {
    auto ka = d_a->template get_access<access::mode::write>(cgh);
    auto kb = d_b->template get_access<access::mode::read>(cgh);
    auto kc = d_c->template get_access<access::mode::read>(cgh);
    cgh.parallel_for<triad_kernel>(range<1>{array_size}, [=](id<1> idx)
    {
      ka[idx] = kb[idx] + scalar * kc[idx];
    });
  });
  queue->wait();
}

This implementation does not use USM and work group size is determined by OpenCL runtime, which may affect performance.

 

Result for DG1 is below:

Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using SYCL device Intel(R) Iris(R) Xe MAX Graphics [0x4905]
Driver: 21.11.19310
Function    MBytes/sec  Min (sec)   Max         Average
Triad       59800.403   0.01347     0.01361     0.01354 

This translates to 86% of theoretical bandwidth.

 

Result for Xe_HP is below:

...
Using SYCL device Intel(R) Graphics [0x0205]
Driver: 21.12.019357+embargo458
Function    MBytes/sec  Min (sec)   Max         Averag
Triad       282888.513  0.00285     0.00368     0.00345

This translates to only 36% of the theoretical bandwidth. Tuning workgroup size may give 10% margin gain.

Thus, I am trying to use AOT compilation since the NDA queue does not accept iteractive jobs.

 

Your insights on this issue is much appreciated.

Thanks.

Jie_L_Intel
Employee
401 Views

The devcloud does not install the ocloc right now. You could try it following the installation guide if you have your own develop machine that could install oneAPI.

https://software.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-onea...


Reply