- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Windows 10
IDE
Visual Studio 19
Compiler
Intel(R) oneAPI DPC++ compiler
Toolkit
oneAPI Base Toolkit version 2021.1
Problem
- Random call instructions are generated in critical loops when using parallel_for.
- No loop optimization is applied when using single_task.
Example program,
#include <CL/sycl.hpp>
#include <iostream>
namespace {
__declspec(noinline) void add(
float const* a, size_t a_stride,
float const* b, size_t b_stride,
float* c, size_t c_stride,
size_t w, size_t h
) {
for (size_t y = 0; y < h; ++y) {
for (size_t x = 0; x < w; ++x) {
c[y * c_stride + x] = a[y * a_stride + x] + b[y * b_stride + x];
}
}
}
__declspec(noinline) void add_single_task(
sycl::queue& q,
float const* a, size_t a_stride,
float const* b, size_t b_stride,
float* c, size_t c_stride,
size_t w, size_t h
) {
auto e = q.submit([&](sycl::handler& cgh) {
cgh.single_task([=]() {
for (size_t y = 0; y < h; ++y) {
for (size_t x = 0; x < w; ++x) {
c[y * c_stride + x] = a[y * a_stride + x] + b[y * b_stride + x];
}
}
});
});
e.wait_and_throw();
}
__declspec(noinline) void add_parallel_for(
sycl::queue& q,
float const* a, size_t a_stride,
float const* b, size_t b_stride,
float* c, size_t c_stride,
size_t w, size_t h
) {
auto e = q.submit([&](sycl::handler& cgh) {
cgh.parallel_for(sycl::range<2>(h, w), [=](sycl::id<2> index) {
auto y = index[0];
auto x = index[1];
c[y * c_stride + x] = a[y * a_stride + x] + b[y * b_stride + x];
});
});
e.wait_and_throw();
}
sycl::queue get_queue() {
// Note:
// Initialization of a queue is very slow.
// Hence, we'll be caching it.
static auto q = sycl::queue(sycl::cpu_selector());
return q;
}
} // namespace
int main(int argc, char* argv[]) {
try {
auto q = get_queue();
}
catch (std::exception const& e) {
std::cout << "Queue initialization failed with " << e.what() << '\n';
return -1;
}
try {
constexpr auto w = size_t(1024);
constexpr auto h = size_t(1024);
auto a = std::vector<float>(w * h, 1);
auto b = std::vector<float>(w * h, 2);
auto c = std::vector<float>(w * h);
auto q = get_queue();
for (size_t i = 0; i < 30; ++i) {
add(a.data(), w, b.data(), w, c.data(), w, w, h);
add_single_task(q, a.data(), w, b.data(), w, c.data(), w, w, h);
add_parallel_for(q, a.data(), w, b.data(), w, c.data(), w, w, h);
}
}
catch (std::exception const& e) {
std::cout << "Add failed with " << e.what() << '\n';
return -1;
}
return 0;
}
Intel Advisor reports,
- add_parallel_for,
- call tree,
- assembly, - add_single_task,
- source insights,
Are those call instructions necessary and is there a way to request the AOT/JIT compiler to not generate them? And for single_task, shouldn't optimization be applied as well (was simply interested to see how it fare against the host compiler optimization, add)?
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Thanks for sharing!
/Zi tells the compiler to generate full debugging information and call instructions are generated when it enabled. Without this option, there are no callq instructions. Please, note that /Zi is not recommended to use in the release code. So, this is not a bug.
Also, please, note that O0-O3 options are also supported by dpcpp \ icx compilers as well.
Link copiado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Hi,
Thanks for reaching out to us!
We are escalating this thread to the Subject Matter Expert(SME) who will guide you further.
Have a Good day
Thanks & Regards
Goutham
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
I reproduced the second issue (No loop optimization is applied when using single_task). Since this is a missed performance opportunity, I suppose it might be escalated as a feature request. I am presently investigating it and will get back to you shortly with an update.
However, I wasn't able to reproduce callq inside add_parallel_for. Could you please share the compiler options you used? Which compiler driver, dpcpp or icx\icpx, did you use?
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Using default settings on release mode in Visual Studio.
/O2 which should translate to -O3 by clang-cl.
Was reproducible in,
-
Intel(R) Core(TM) i5-7300HQ
-
Intel(R) Xeon(R) Gold 6230N
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
If it's still not reproducible, try removing the add inside main, leaving only add_single_task and add_parallel_for.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
No, it is not reproducible, even with O3. Could you please share a compiler command line and other steps to reproduce?
What do you mean by 'which should translate to -O3 by clang-cl'?
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Steps:
- Create a new project, select Empty Project.
- Change Platform Toolset from Visual Studio 2019 (v142) to Intel(R) oneAPI DPC++ Compiler.
- Change x86 to x64 and Debug to Release.
- Paste the script and build.
* DPC++ Console Application template project should work too (without having to go through steps 1 and 2).
For reference, these are the arguments passed to the compiler.
/O2 /Zi /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /WX- /MD /EHsc /W1 /nologo /Fo"x64\Release\"
As to "which should translate to -O3 by clang-cl", as seen above, I'm using /O2 which is a msvc option, not a gcc/clang option. However, clang-cl should translate it to -O3.
I have also attached a zipped file which contains the project and executable along with Intel Advisor's report.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Thanks for sharing!
/Zi tells the compiler to generate full debugging information and call instructions are generated when it enabled. Without this option, there are no callq instructions. Please, note that /Zi is not recommended to use in the release code. So, this is not a bug.
Also, please, note that O0-O3 options are also supported by dpcpp \ icx compilers as well.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Yes, that works.
And the generated assembly is much much better, see attached image.
Finally, is there any way to use intel loop directives FPGA loop directives for a CPU/GPU kernel, like [[intel::invdep]]?
Using the same example, c[y * c_stride + x] = a[y * a_stride + x] + b[y * b_stride + x], I believe the optimizer detects possible memory aliasing here (y * stride), and disables other loop transformations, like loop unrolling, permutation, tiling, etc.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Hi,
Thanks for the confirmation!
As this issue has been resolved, we will no longer respond to this thread.
If you require any additional assistance from Intel, please start a new thread.
Any further interaction in this thread will be considered community only.
Have a Good day.
Thanks & Regards
Goutham

- Subscrever fonte RSS
- Marcar tópico como novo
- Marcar tópico como lido
- Flutuar este Tópico para o utilizador atual
- Marcador
- Subscrever
- Página amigável para impressora