- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am trying to perform ahead of time compilation following the instruction here:
Using the following compiler options:
dpcpp -fsycl-targets=spir64_gen-unknown-unknown-sycldevice -Xs "-device Gen9,dg1,Gen12HP" vector-add.cpp.
The error message is:
dpcpp: error: unable to execute command: Executable "ocloc" doesn't exist!
As I understand, the NDA queue is not configured to accept interactive job (qsub -I).
If I compile on login node and run on Xe_HP, the performance is not as good as expected. For instance, the result of memory bandwidth measurement is aproximately 40% of theoretical maximum of 800GB/s.
Would you make ocloc available system-wide ? If not, is there a way to install it locally ?
Regards.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The devcloud does not install the ocloc right now. You could try it following the installation guide if you have your own develop machine that could install oneAPI.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for reaching out. We are checking this internally. We will get back to you as soon as we get an update.
Regards
Gopika
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for moving the post to the suitable forum.
I am not sure if I can show performance of NDA hardware here. Nevertheless, I need to make my case.
The measurement is done using BabelStream, which is a extension of Stream benchmark for heterogenous platforms.
The triad kernel implemented in SYCL is as follow:
template <class T>
void SYCLStream<T>::triad()
{
const T scalar = startScalar;
queue->submit([&](handler &cgh)
{
auto ka = d_a->template get_access<access::mode::write>(cgh);
auto kb = d_b->template get_access<access::mode::read>(cgh);
auto kc = d_c->template get_access<access::mode::read>(cgh);
cgh.parallel_for<triad_kernel>(range<1>{array_size}, [=](id<1> idx)
{
ka[idx] = kb[idx] + scalar * kc[idx];
});
});
queue->wait();
}
This implementation does not use USM and work group size is determined by OpenCL runtime, which may affect performance.
Result for DG1 is below:
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using SYCL device Intel(R) Iris(R) Xe MAX Graphics [0x4905]
Driver: 21.11.19310
Function MBytes/sec Min (sec) Max Average
Triad 59800.403 0.01347 0.01361 0.01354
This translates to 86% of theoretical bandwidth.
Result for Xe_HP is below:
...
Using SYCL device Intel(R) Graphics [0x0205]
Driver: 21.12.019357+embargo458
Function MBytes/sec Min (sec) Max Averag
Triad 282888.513 0.00285 0.00368 0.00345
This translates to only 36% of the theoretical bandwidth. Tuning workgroup size may give 10% margin gain.
Thus, I am trying to use AOT compilation since the NDA queue does not accept iteractive jobs.
Your insights on this issue is much appreciated.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The devcloud does not install the ocloc right now. You could try it following the installation guide if you have your own develop machine that could install oneAPI.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page