Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6975 Discussions

oneMKL blas - performance regression on Intel CPUs

IgorBaratta
Beginner
1,339 Views

I'm running a simple axpy using the oneMKL blas interface and it's really slow compared to a non optimized SYCL kernel. 

 

 

#include "oneapi/mkl.hpp"
#include <chrono>

using namespace cl;
using namespace std::chrono;

// Run benchmarks
int main(int argc, char** argv) {
  using T = double;

  sycl::queue queue{sycl::cpu_selector{}};
  constexpr std::size_t size = 1e9;
  T* x = sycl::malloc_device<T>(size, queue);
  T* y = sycl::malloc_device<T>(size, queue);
  queue.fill(x, T{1.0}, size).wait();
  queue.fill(y, T{2.0}, size).wait();

  T alpha = 3.;
  int num_iter = 5;
  for (int i = 0; i < num_iter; i++) {
    auto start = high_resolution_clock::now();
    oneapi::mkl::blas::axpy(queue, size, alpha, x, 1, y, 1).wait();
    auto end = high_resolution_clock::now();
    double t = duration_cast<duration<double>>(end - start).count();
    std::cout << i << " oneMKL: " << t << " seconds" << std::endl;
  }

  for (int i = 0; i < num_iter; i++) {
    auto start = high_resolution_clock::now();
    auto e = queue.submit([&](sycl::handler& h) {
      h.parallel_for(sycl::range<1>{size}, [=](sycl::item<1> it) {
        const std::size_t i = it.get_id();
        x[i] = alpha * y[i] + x[i];
      });
    });
    e.wait();
    auto end = high_resolution_clock::now();
    double t = duration_cast<duration<double>>(end - start).count();
    std::cout << i << " SYCL: " << t << " seconds" << std::endl;
  }

  sycl::free(x, queue);
  sycl::free(y, queue);

  return 0;
}

 

 

 

Command used to compile (from Intel® oneAPI Math Kernel Library Link Line Advisor).

 

dpcpp -Ofast -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_tbb_thread -lmkl_core -lsycl -lOpenCL -lpthread -lm -ldl  -DMKL_ILP64  -I"${MKLROOT}/include" test.cpp

 

 

Version:

Intel(R) oneAPI DPC++/C++ Compiler 2022.0.0 (2022.0.0.20211123)

 

Output:

Ice Lake - Model name: Intel(R) Xeon(R) Platinum 8368Q CPU @ 2.60GHz

 

 

0 oneMKL: 1.35613 seconds
1 oneMKL: 1.5168 seconds
2 oneMKL: 1.4051 seconds
3 oneMKL: 1.38451 seconds
4 oneMKL: 1.40654 seconds

0 SYCL: 0.12582 seconds
1 SYCL: 0.125947 seconds
2 SYCL: 0.126261 seconds
3 SYCL: 0.128162 seconds
4 SYCL: 0.123251 seconds

 

 

 

This happens both with the installation using spack and the offload installer.

 

Similar result running the code on devcloud:

Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz

 

 

0 oneMKL: 2.77735 seconds
1 oneMKL: 2.2834 seconds
2 oneMKL: 2.05315 seconds
3 oneMKL: 2.44329 seconds
4 oneMKL: 1.96935 seconds

0 SYCL: 0.513233 seconds
1 SYCL: 0.494699 seconds
2 SYCL: 0.512073 seconds
3 SYCL: 0.50423 seconds
4 SYCL: 0.494641 second

 

 

 

Am I missing something? 

0 Kudos
8 Replies
IgorBaratta
Beginner
1,317 Views

In my project, I'm using the following cmake commands for linking, so I reckon te issue is not only with linking.

#CXX=dpcpp
find_package(MKL CONFIG REQUIRED)

target_compile_options(${PROJECT_NAME} PUBLIC $<TARGET_PROPERTY:MKL::MKL_DPCPP,INTERFACE_COMPILE_OPTIONS>)
target_include_directories(${PROJECT_NAME} PUBLIC $<TARGET_PROPERTY:MKL::MKL_DPCPP,INTERFACE_INCLUDE_DIRECTORIES>)
target_link_libraries(${PROJECT_NAME} PUBLIC $<LINK_ONLY:MKL::MKL_DPCPP>)

But I still get the same performance regression (compared to the C interface).
It's worth mentioning that this issue is not unique to axpy, but I observed the same behaviour for other "level 1" blas functions when using the SYCL interface.

0 Kudos
VidyalathaB_Intel
Moderator
1,296 Views

Hi,

 

Thanks for reaching out to us.

We tried reproducing the issue from our end on 2 different processors and observed that on one CPU, the timings are almost similar.

Here are the Results

Device: Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz 

 

 

0 oneMKL: 0.801352 seconds
1 oneMKL: 0.77536 seconds
2 oneMKL: 0.784574 seconds
3 oneMKL: 0.773554 seconds
4 oneMKL: 0.772544 seconds

0 SYCL: 0.753969 seconds
1 SYCL: 0.754054 seconds
2 SYCL: 0.753803 seconds
3 SYCL: 0.753249 seconds
4 SYCL: 0.80803 seconds

 

 

But when tried on this CPU, the issue is reproducible 

Device: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz

 

 

0 oneMKL: 1.82411 seconds
1 oneMKL: 2.82107 seconds
2 oneMKL: 4.32416 seconds
3 oneMKL: 4.77222 seconds
4 oneMKL: 2.69084 seconds

0 SYCL: 0.986935 seconds
1 SYCL: 0.994626 seconds
2 SYCL: 0.961732 seconds
3 SYCL: 0.997382 seconds
4 SYCL: 0.966748 seconds

 

 

Could you please let us know the OS details on which you are working?

 

 

Regards,

Vidya.

0 Kudos
IgorBaratta
Beginner
1,282 Views

Hi,
Thanks for your reply.

I've tested the code on devcloud (which I assume uses Ubuntu 18.04 or 20.04), I also tested it on our local cluster with Centos 8.

A third system I'm using runs on Red Hat 8.

 

Best,

Igor

0 Kudos
VidyalathaB_Intel
Moderator
1,217 Views

Hi Igor,


Thanks for providing us with the details.

We are working on your issue, we will get back to you soon.


Regards,

Vidya.


0 Kudos
Gennady_F_Intel
Moderator
1,209 Views

Igor,

it might be an optimization problem wrt all L1 functions and we will check this case.

-Gennady


0 Kudos
Khang_N_Intel
Employee
1,047 Views

Hi Igor,


We have had this issue resolved. The fix will be in the upcoming version, 2022.1, of oneMKL.

This release will be announced soon.


Best regards,

Khang


0 Kudos
Khang_N_Intel
Employee
990 Views

Hi Igor,


The issue has been fixed oneMKL 2022.1.

The Intel(r) oneAPI Base Toolkit 2022.2 (containing oneMKL 2022.1) has been released.


Could you verify that the issue is fixed on your end?


Thanks,

Khang


0 Kudos
Khang_N_Intel
Employee
980 Views

Hi Igor,


Since the fixed has been implemented in oneMKL 2022.1 and that version of oneMKL has been released for quite some time, I am going to close this thread.


This thread will no longer be monitored.


Best regards,

Khang


0 Kudos
Reply