Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*

CPU backend performance issues.

ChoonHo
Beginner
2,663 Views
OS

Windows 10

IDE

Visual Studio 19

Compiler

Intel(R) oneAPI DPC++ compiler

Toolkit

oneAPI Base Toolkit version 2021.1

Problem

  1. Random call instructions are generated in critical loops when using parallel_for.
  2. No loop optimization is applied when using single_task.

Example program,

#include <CL/sycl.hpp>

#include <iostream>

namespace {

  __declspec(noinline) void add(
    float const* a, size_t a_stride,
    float const* b, size_t b_stride,
    float* c, size_t c_stride,
    size_t w, size_t h
  ) {
    for (size_t y = 0; y < h; ++y) {
      for (size_t x = 0; x < w; ++x) {
        c[y * c_stride + x] = a[y * a_stride + x] + b[y * b_stride + x];
      }
    }
  }

  __declspec(noinline) void add_single_task(
    sycl::queue& q,
    float const* a, size_t a_stride,
    float const* b, size_t b_stride,
    float* c, size_t c_stride,
    size_t w, size_t h
  ) {
    auto e = q.submit([&](sycl::handler& cgh) {

      cgh.single_task([=]() {
        for (size_t y = 0; y < h; ++y) {
          for (size_t x = 0; x < w; ++x) {
            c[y * c_stride + x] = a[y * a_stride + x] + b[y * b_stride + x];
          }
        }
        });
      });

    e.wait_and_throw();
  }

  __declspec(noinline) void add_parallel_for(
    sycl::queue& q,
    float const* a, size_t a_stride,
    float const* b, size_t b_stride,
    float* c, size_t c_stride,
    size_t w, size_t h
  ) {
    auto e = q.submit([&](sycl::handler& cgh) {

      cgh.parallel_for(sycl::range<2>(h, w), [=](sycl::id<2> index) {
        auto y = index[0];
        auto x = index[1];

        c[y * c_stride + x] = a[y * a_stride + x] + b[y * b_stride + x];
        });
      });

    e.wait_and_throw();
  }

  sycl::queue get_queue() {
    // Note:
    // Initialization of a queue is very slow.
    // Hence, we'll be caching it.
    static auto q = sycl::queue(sycl::cpu_selector());
    return q;
  }

}  // namespace

int main(int argc, char* argv[]) {
  try {
    auto q = get_queue();
  }
  catch (std::exception const& e) {
    std::cout << "Queue initialization failed with " << e.what() << '\n';
    return -1;
  }

  try {
    constexpr auto w = size_t(1024);
    constexpr auto h = size_t(1024);

    auto a = std::vector<float>(w * h, 1);
    auto b = std::vector<float>(w * h, 2);
    auto c = std::vector<float>(w * h);

    auto q = get_queue();

    for (size_t i = 0; i < 30; ++i) {
      add(a.data(), w, b.data(), w, c.data(), w, w, h);
      add_single_task(q, a.data(), w, b.data(), w, c.data(), w, w, h);
      add_parallel_for(q, a.data(), w, b.data(), w, c.data(), w, w, h);
    }
  }
  catch (std::exception const& e) {
    std::cout << "Add failed with " << e.what() << '\n';
    return -1;
  }

  return 0;
}

 

Intel Advisor reports,

  1. add_parallel_for,

    - call tree,
    add_parallel_for_call_stack.png

    - assembly,
    add_parallel_for_assembly.png

  2. add_single_task,

    - source insights,
    add_single_task_insights.PNG

Are those call instructions necessary and is there a way to request the AOT/JIT compiler to not generate them? And for single_task, shouldn't optimization be applied as well (was simply interested to see how it fare against the host compiler optimization, add)?

0 Kudos
1 Solution
Alina_S_Intel
Employee
2,543 Views

Thanks for sharing!


/Zi tells the compiler to generate full debugging information and call instructions are generated when it enabled. Without this option, there are no callq instructions. Please, note that /Zi is not recommended to use in the release code. So, this is not a bug.


Also, please, note that O0-O3 options are also supported by dpcpp \ icx compilers as well.


View solution in original post

0 Kudos
9 Replies
GouthamK_Intel
Moderator
2,632 Views

Hi,

Thanks for reaching out to us!

We are escalating this thread to the Subject Matter Expert(SME) who will guide you further.

Have a Good day


Thanks & Regards

Goutham


0 Kudos
Alina_S_Intel
Employee
2,604 Views

I reproduced the second issue (No loop optimization is applied when using single_task). Since this is a missed performance opportunity, I suppose it might be escalated as a feature request. I am presently investigating it and will get back to you shortly with an update.


However, I wasn't able to reproduce callq inside add_parallel_for. Could you please share the compiler options you used? Which compiler driver, dpcpp or icx\icpx, did you use?


0 Kudos
ChoonHo
Beginner
2,585 Views

Using default settings on release mode in Visual Studio.

/O2 which should translate to -O3 by clang-cl.

Was reproducible in,

  • Intel(R) Core(TM) i5-7300HQ

  • Intel(R) Xeon(R) Gold 6230N

 

0 Kudos
ChoonHo
Beginner
2,583 Views

If it's still not reproducible, try removing the add inside main, leaving only add_single_task and add_parallel_for.

 

0 Kudos
Alina_S_Intel
Employee
2,558 Views

No, it is not reproducible, even with O3. Could you please share a compiler command line and other steps to reproduce?

What do you mean by 'which should translate to -O3 by clang-cl'?


0 Kudos
ChoonHo
Beginner
2,552 Views

Steps:

  1. Create a new project, select Empty Project.
  2. Change Platform Toolset from Visual Studio 2019 (v142) to Intel(R) oneAPI DPC++ Compiler.
  3. Change x86 to x64 and Debug to Release.
  4. Paste the script and build.

* DPC++ Console Application template project should work too (without having to go through steps 1 and 2).

For reference, these are the arguments passed to the compiler.

/O2 /Zi /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /WX- /MD /EHsc /W1 /nologo /Fo"x64\Release\" 

 

As to "which should translate to -O3 by clang-cl", as seen above, I'm using /O2 which is a msvc option, not a gcc/clang option. However, clang-cl should translate it to -O3.

 

I have also attached a zipped file which contains the project and executable along with Intel Advisor's report.

 

0 Kudos
Alina_S_Intel
Employee
2,544 Views

Thanks for sharing!


/Zi tells the compiler to generate full debugging information and call instructions are generated when it enabled. Without this option, there are no callq instructions. Please, note that /Zi is not recommended to use in the release code. So, this is not a bug.


Also, please, note that O0-O3 options are also supported by dpcpp \ icx compilers as well.


0 Kudos
ChoonHo
Beginner
2,537 Views

Yes, that works.

And the generated assembly is much much better, see attached image.

Finally, is there any way to use intel loop directives FPGA loop directives for a CPU/GPU kernel, like [[intel::invdep]]?

Using the same example, c[y * c_stride + x] = a[y * a_stride + x] + b[y * b_stride + x], I believe the optimizer detects possible memory aliasing here (y * stride), and disables other loop transformations, like loop unrolling, permutation, tiling, etc.

 

0 Kudos
GouthamK_Intel
Moderator
2,114 Views

Hi, 

Thanks for the confirmation!

As this issue has been resolved, we will no longer respond to this thread. 

If you require any additional assistance from Intel, please start a new thread. 

Any further interaction in this thread will be considered community only. 

Have a Good day.


Thanks & Regards

Goutham


0 Kudos
Reply