oneMKL blas - performance regression on Intel CPUs

IgorBaratta · ‎01-06-2022

I'm running a simple axpy using the oneMKL blas interface and it's really slow compared to a non optimized SYCL kernel.

#include "oneapi/mkl.hpp"
#include <chrono>

using namespace cl;
using namespace std::chrono;

// Run benchmarks
int main(int argc, char** argv) {
  using T = double;

  sycl::queue queue{sycl::cpu_selector{}};
  constexpr std::size_t size = 1e9;
  T* x = sycl::malloc_device<T>(size, queue);
  T* y = sycl::malloc_device<T>(size, queue);
  queue.fill(x, T{1.0}, size).wait();
  queue.fill(y, T{2.0}, size).wait();

  T alpha = 3.;
  int num_iter = 5;
  for (int i = 0; i < num_iter; i++) {
    auto start = high_resolution_clock::now();
    oneapi::mkl::blas::axpy(queue, size, alpha, x, 1, y, 1).wait();
    auto end = high_resolution_clock::now();
    double t = duration_cast<duration<double>>(end - start).count();
    std::cout << i << " oneMKL: " << t << " seconds" << std::endl;
  }

  for (int i = 0; i < num_iter; i++) {
    auto start = high_resolution_clock::now();
    auto e = queue.submit([&](sycl::handler& h) {
      h.parallel_for(sycl::range<1>{size}, [=](sycl::item<1> it) {
        const std::size_t i = it.get_id();
        x[i] = alpha * y[i] + x[i];
      });
    });
    e.wait();
    auto end = high_resolution_clock::now();
    double t = duration_cast<duration<double>>(end - start).count();
    std::cout << i << " SYCL: " << t << " seconds" << std::endl;
  }

  sycl::free(x, queue);
  sycl::free(y, queue);

  return 0;
}

Command used to compile (from Intel® oneAPI Math Kernel Library Link Line Advisor).

dpcpp -Ofast -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_tbb_thread -lmkl_core -lsycl -lOpenCL -lpthread -lm -ldl  -DMKL_ILP64  -I"${MKLROOT}/include" test.cpp

Version:

Intel(R) oneAPI DPC++/C++ Compiler 2022.0.0 (2022.0.0.20211123)

Output:

Ice Lake - Model name: Intel(R) Xeon(R) Platinum 8368Q CPU @ 2.60GHz

0 oneMKL: 1.35613 seconds
1 oneMKL: 1.5168 seconds
2 oneMKL: 1.4051 seconds
3 oneMKL: 1.38451 seconds
4 oneMKL: 1.40654 seconds

0 SYCL: 0.12582 seconds
1 SYCL: 0.125947 seconds
2 SYCL: 0.126261 seconds
3 SYCL: 0.128162 seconds
4 SYCL: 0.123251 seconds

This happens both with the installation using spack and the offload installer.

Similar result running the code on devcloud:

Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz

0 oneMKL: 2.77735 seconds
1 oneMKL: 2.2834 seconds
2 oneMKL: 2.05315 seconds
3 oneMKL: 2.44329 seconds
4 oneMKL: 1.96935 seconds

0 SYCL: 0.513233 seconds
1 SYCL: 0.494699 seconds
2 SYCL: 0.512073 seconds
3 SYCL: 0.50423 seconds
4 SYCL: 0.494641 second

Am I missing something?

IgorBaratta · ‎01-06-2022

In my project, I'm using the following cmake commands for linking, so I reckon te issue is not only with linking.

#CXX=dpcpp
find_package(MKL CONFIG REQUIRED)

target_compile_options(${PROJECT_NAME} PUBLIC $<TARGET_PROPERTY:MKL::MKL_DPCPP,INTERFACE_COMPILE_OPTIONS>)
target_include_directories(${PROJECT_NAME} PUBLIC $<TARGET_PROPERTY:MKL::MKL_DPCPP,INTERFACE_INCLUDE_DIRECTORIES>)
target_link_libraries(${PROJECT_NAME} PUBLIC $<LINK_ONLY:MKL::MKL_DPCPP>)

But I still get the same performance regression (compared to the C interface).
It's worth mentioning that this issue is not unique to axpy, but I observed the same behaviour for other "level 1" blas functions when using the SYCL interface.

VidyalathaB_Intel · ‎01-07-2022

Hi,

Thanks for reaching out to us.

We tried reproducing the issue from our end on 2 different processors and observed that on one CPU, the timings are almost similar.

Here are the Results

Device: Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz

0 oneMKL: 0.801352 seconds
1 oneMKL: 0.77536 seconds
2 oneMKL: 0.784574 seconds
3 oneMKL: 0.773554 seconds
4 oneMKL: 0.772544 seconds

0 SYCL: 0.753969 seconds
1 SYCL: 0.754054 seconds
2 SYCL: 0.753803 seconds
3 SYCL: 0.753249 seconds
4 SYCL: 0.80803 seconds

But when tried on this CPU, the issue is reproducible

Device: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz

0 oneMKL: 1.82411 seconds
1 oneMKL: 2.82107 seconds
2 oneMKL: 4.32416 seconds
3 oneMKL: 4.77222 seconds
4 oneMKL: 2.69084 seconds

0 SYCL: 0.986935 seconds
1 SYCL: 0.994626 seconds
2 SYCL: 0.961732 seconds
3 SYCL: 0.997382 seconds
4 SYCL: 0.966748 seconds

Could you please let us know the OS details on which you are working?

Regards,

Vidya.

IgorBaratta · ‎01-07-2022

Hi,
Thanks for your reply.

I've tested the code on devcloud (which I assume uses Ubuntu 18.04 or 20.04), I also tested it on our local cluster with Centos 8.

A third system I'm using runs on Red Hat 8.

Best,

Igor

VidyalathaB_Intel · ‎01-11-2022

Hi Igor,

Thanks for providing us with the details.

We are working on your issue, we will get back to you soon.

Regards,

Vidya.

Gennady_F_Intel · ‎01-11-2022

Igor,

it might be an optimization problem wrt all L1 functions and we will check this case.

-Gennady

Khang_N_Intel · ‎04-07-2022

Hi Igor,

We have had this issue resolved. The fix will be in the upcoming version, 2022.1, of oneMKL.

This release will be announced soon.

Best regards,

Khang

Khang_N_Intel · ‎06-01-2022

Hi Igor,

The issue has been fixed oneMKL 2022.1.

The Intel(r) oneAPI Base Toolkit 2022.2 (containing oneMKL 2022.1) has been released.

Could you verify that the issue is fixed on your end?

Thanks,

Khang

Khang_N_Intel · ‎06-06-2022

Hi Igor,

Since the fixed has been implemented in oneMKL 2022.1 and that version of oneMKL has been released for quite some time, I am going to close this thread.

This thread will no longer be monitored.

Best regards,

Khang