Solved: CPU backend performance issues.

ChoonHo · ‎12-16-2020

Windows 10

IDE

Visual Studio 19

Compiler

Intel(R) oneAPI DPC++ compiler

Toolkit

oneAPI Base Toolkit version 2021.1

Problem

Random call instructions are generated in critical loops when using parallel_for.
No loop optimization is applied when using single_task.

Example program,

#include <CL/sycl.hpp>

#include <iostream>

namespace {

  __declspec(noinline) void add(
    float const* a, size_t a_stride,
    float const* b, size_t b_stride,
    float* c, size_t c_stride,
    size_t w, size_t h
  ) {
    for (size_t y = 0; y < h; ++y) {
      for (size_t x = 0; x < w; ++x) {
        c[y * c_stride + x] = a[y * a_stride + x] + b[y * b_stride + x];
      }
    }
  }

  __declspec(noinline) void add_single_task(
    sycl::queue& q,
    float const* a, size_t a_stride,
    float const* b, size_t b_stride,
    float* c, size_t c_stride,
    size_t w, size_t h
  ) {
    auto e = q.submit([&](sycl::handler& cgh) {

      cgh.single_task([=]() {
        for (size_t y = 0; y < h; ++y) {
          for (size_t x = 0; x < w; ++x) {
            c[y * c_stride + x] = a[y * a_stride + x] + b[y * b_stride + x];
          }
        }
        });
      });

    e.wait_and_throw();
  }

  __declspec(noinline) void add_parallel_for(
    sycl::queue& q,
    float const* a, size_t a_stride,
    float const* b, size_t b_stride,
    float* c, size_t c_stride,
    size_t w, size_t h
  ) {
    auto e = q.submit([&](sycl::handler& cgh) {

      cgh.parallel_for(sycl::range<2>(h, w), [=](sycl::id<2> index) {
        auto y = index[0];
        auto x = index[1];

        c[y * c_stride + x] = a[y * a_stride + x] + b[y * b_stride + x];
        });
      });

    e.wait_and_throw();
  }

  sycl::queue get_queue() {
    // Note:
    // Initialization of a queue is very slow.
    // Hence, we'll be caching it.
    static auto q = sycl::queue(sycl::cpu_selector());
    return q;
  }

}  // namespace

int main(int argc, char* argv[]) {
  try {
    auto q = get_queue();
  }
  catch (std::exception const& e) {
    std::cout << "Queue initialization failed with " << e.what() << '\n';
    return -1;
  }

  try {
    constexpr auto w = size_t(1024);
    constexpr auto h = size_t(1024);

    auto a = std::vector<float>(w * h, 1);
    auto b = std::vector<float>(w * h, 2);
    auto c = std::vector<float>(w * h);

    auto q = get_queue();

    for (size_t i = 0; i < 30; ++i) {
      add(a.data(), w, b.data(), w, c.data(), w, w, h);
      add_single_task(q, a.data(), w, b.data(), w, c.data(), w, w, h);
      add_parallel_for(q, a.data(), w, b.data(), w, c.data(), w, w, h);
    }
  }
  catch (std::exception const& e) {
    std::cout << "Add failed with " << e.what() << '\n';
    return -1;
  }

  return 0;
}

Intel Advisor reports,

add_parallel_for,

- call tree,

- assembly,
add_single_task,

- source insights,

Are those call instructions necessary and is there a way to request the AOT/JIT compiler to not generate them? And for single_task, shouldn't optimization be applied as well (was simply interested to see how it fare against the host compiler optimization, add)?

Alina_S_Intel · ‎12-30-2020

Thanks for sharing!

/Zi tells the compiler to generate full debugging information and call instructions are generated when it enabled. Without this option, there are no callq instructions. Please, note that /Zi is not recommended to use in the release code. So, this is not a bug.

Also, please, note that O0-O3 options are also supported by dpcpp \ icx compilers as well.

View solution in original post

GouthamK_Intel · ‎12-22-2020

Hi,

Thanks for reaching out to us!

We are escalating this thread to the Subject Matter Expert(SME) who will guide you further.

Have a Good day

Thanks & Regards

Goutham

Alina_S_Intel · ‎12-24-2020

I reproduced the second issue (No loop optimization is applied when using single_task). Since this is a missed performance opportunity, I suppose it might be escalated as a feature request. I am presently investigating it and will get back to you shortly with an update.

However, I wasn't able to reproduce callq inside add_parallel_for. Could you please share the compiler options you used? Which compiler driver, dpcpp or icx\icpx, did you use?

ChoonHo · ‎12-27-2020

Using default settings on release mode in Visual Studio.

/O2 which should translate to -O3 by clang-cl.

Was reproducible in,

Intel(R) Core(TM) i5-7300HQ
Intel(R) Xeon(R) Gold 6230N

ChoonHo · ‎12-27-2020

If it's still not reproducible, try removing the add inside main, leaving only add_single_task and add_parallel_for.

Alina_S_Intel · ‎12-29-2020

No, it is not reproducible, even with O3. Could you please share a compiler command line and other steps to reproduce?

What do you mean by 'which should translate to -O3 by clang-cl'?

ChoonHo · ‎12-29-2020

Steps:

Create a new project, select Empty Project.
Change Platform Toolset from Visual Studio 2019 (v142) to Intel(R) oneAPI DPC++ Compiler.
Change x86 to x64 and Debug to Release.
Paste the script and build.

* DPC++ Console Application template project should work too (without having to go through steps 1 and 2).

For reference, these are the arguments passed to the compiler.

/O2 /Zi /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /WX- /MD /EHsc /W1 /nologo /Fo"x64\Release\"

As to "which should translate to -O3 by clang-cl", as seen above, I'm using /O2 which is a msvc option, not a gcc/clang option. However, clang-cl should translate it to -O3.

I have also attached a zipped file which contains the project and executable along with Intel Advisor's report.

Alina_S_Intel · ‎12-30-2020

Thanks for sharing!

/Zi tells the compiler to generate full debugging information and call instructions are generated when it enabled. Without this option, there are no callq instructions. Please, note that /Zi is not recommended to use in the release code. So, this is not a bug.

Also, please, note that O0-O3 options are also supported by dpcpp \ icx compilers as well.

ChoonHo · ‎12-30-2020

Yes, that works.

And the generated assembly is much much better, see attached image.

Finally, is there any way to use intel loop directives FPGA loop directives for a CPU/GPU kernel, like [[intel::invdep]]?

Using the same example, c[y * c_stride + x] = a[y * a_stride + x] + b[y * b_stride + x], I believe the optimizer detects possible memory aliasing here (y * stride), and disables other loop transformations, like loop unrolling, permutation, tiling, etc.

GouthamK_Intel · ‎05-20-2021

Hi,

Thanks for the confirmation!

As this issue has been resolved, we will no longer respond to this thread.

If you require any additional assistance from Intel, please start a new thread.

Any further interaction in this thread will be considered community only.

Have a Good day.

Thanks & Regards

Goutham