Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.
6701 Discussions

Does oneMKL axpy optimally tuned for Intel GPUs ?

LaurentPlagne
Novice
737 Views

I wonder if I can improve the performance of the following snippet that I would like to use to assess the bandwidth of Intel GPUs :

#include <fstream>
#include <iostream>
#include <CL/sycl.hpp>
#include <chrono>
#include <cmath>
#include <cstring>
#include <stdio.h>
#include <iostream>
#include <random>
#include <fstream>
#include <chrono>
#include "mkl_sycl.hpp"
#include "dpc_common.hpp"


using namespace cl::sycl;
using namespace std;

constexpr size_t NITER=100; //amortize device/host communication
using Scalar=float;

template <class T>
void bench_axpy(size_t N){

  std::vector<T> a(N,1);
  std::vector<T> b(N,2);
  gpu_selector device_selector;
  queue q(device_selector, dpc_common::exception_handler);
  
  auto start=std::chrono::high_resolution_clock::now();
  {  // Begin buffer scope
    buffer buf_a(&a[0], range(N));// Create buffers using DPC++ class buffer
    buffer buf_b(&b[0], range(N));

    const T alpha=0.5;
    try{
        for (size_t iter=0; iter<NITER; iter++) {
            mkl::blas::axpy(q, N, alpha, buf_a, 1, buf_b, 1);
        }
    }
    catch(cl::sycl::exception const& e) {
        std::cout << "\t\tCaught synchronous SYCL exception during AXPY:\n"
          << e.what() << std::endl;
    }
  }
  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> elapsed_seconds = end-start;
  double time = elapsed_seconds.count();
  double GBs=double(3*N)*sizeof(T)*NITER/(time*1.e9);//2R+1W
  std::cout <<"GBs="<<GBs<<std::endl; 
}


int main(int argc, char* argv[]) {

  bench_axpy<float>(2<<27);

  return 0;
}

 

 

 

I compile with :

dpcpp -O3 -fsycl -std=c++17 -DMKL_ILP64 -g -DNDEBUG -lOpenCL -lsycl -lmkl_sycl -lmkl_core -lmkl_sequential -lmkl_intel_lp64 ../src/portable_main.cpp

 

and obtain :

GBs=23.09 on my machine with a UHD630 (and no vram).

Is it possible to improve this ?

 

 

0 Kudos
1 Solution
Gennady_F_Intel
Moderator
666 Views

As we have the Beta version of oneMKL at this moment, therefore it is too earlier to speak about the “optimality of this kernel for performing axpy on every Intel GPUs…”. I think we could get back to this perf query after release timeframe. 


View solution in original post

6 Replies
GouthamK_Intel
Moderator
723 Views

Hi,

Thanks for reaching out to us!

Since your issue is related to oneMKL, we are moving this query to the Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library forum for a faster response.


Regards

Goutham


Gennady_F_Intel
Moderator
675 Views

You could try to check the achievable bandwidth on this particular system by running a stream benchmark ( e.g. BabelStream).


LaurentPlagne
Novice
670 Views
Hi, thank you very much for your answer ! I will post a stream benchmark as soon as I get my laptop back.

I suspect that in this case this kernel actually exhausts the RAM bandwidth.

My question was more about the optimality of this kernel for performing axpy on every Intel GPUs (including GPUs with VRAM).

Thank you again.
Gennady_F_Intel
Moderator
667 Views

As we have the Beta version of oneMKL at this moment, therefore it is too earlier to speak about the “optimality of this kernel for performing axpy on every Intel GPUs…”. I think we could get back to this perf query after release timeframe. 


LaurentPlagne
Novice
661 Views
Reply