Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
7117 Discussions

Does oneMKL axpy optimally tuned for Intel GPUs ?

LaurentPlagne
Novice
1,591 Views

I wonder if I can improve the performance of the following snippet that I would like to use to assess the bandwidth of Intel GPUs :

#include <fstream>
#include <iostream>
#include <CL/sycl.hpp>
#include <chrono>
#include <cmath>
#include <cstring>
#include <stdio.h>
#include <iostream>
#include <random>
#include <fstream>
#include <chrono>
#include "mkl_sycl.hpp"
#include "dpc_common.hpp"


using namespace cl::sycl;
using namespace std;

constexpr size_t NITER=100; //amortize device/host communication
using Scalar=float;

template <class T>
void bench_axpy(size_t N){

  std::vector<T> a(N,1);
  std::vector<T> b(N,2);
  gpu_selector device_selector;
  queue q(device_selector, dpc_common::exception_handler);
  
  auto start=std::chrono::high_resolution_clock::now();
  {  // Begin buffer scope
    buffer buf_a(&a[0], range(N));// Create buffers using DPC++ class buffer
    buffer buf_b(&b[0], range(N));

    const T alpha=0.5;
    try{
        for (size_t iter=0; iter<NITER; iter++) {
            mkl::blas::axpy(q, N, alpha, buf_a, 1, buf_b, 1);
        }
    }
    catch(cl::sycl::exception const& e) {
        std::cout << "\t\tCaught synchronous SYCL exception during AXPY:\n"
          << e.what() << std::endl;
    }
  }
  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> elapsed_seconds = end-start;
  double time = elapsed_seconds.count();
  double GBs=double(3*N)*sizeof(T)*NITER/(time*1.e9);//2R+1W
  std::cout <<"GBs="<<GBs<<std::endl; 
}


int main(int argc, char* argv[]) {

  bench_axpy<float>(2<<27);

  return 0;
}

 

 

 

I compile with :

dpcpp -O3 -fsycl -std=c++17 -DMKL_ILP64 -g -DNDEBUG -lOpenCL -lsycl -lmkl_sycl -lmkl_core -lmkl_sequential -lmkl_intel_lp64 ../src/portable_main.cpp

 

and obtain :

GBs=23.09 on my machine with a UHD630 (and no vram).

Is it possible to improve this ?

 

 

0 Kudos
1 Solution
Gennady_F_Intel
Moderator
1,520 Views

As we have the Beta version of oneMKL at this moment, therefore it is too earlier to speak about the “optimality of this kernel for performing axpy on every Intel GPUs…”. I think we could get back to this perf query after release timeframe. 


View solution in original post

0 Kudos
6 Replies
GouthamK_Intel
Moderator
1,577 Views

Hi,

Thanks for reaching out to us!

Since your issue is related to oneMKL, we are moving this query to the Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library forum for a faster response.


Regards

Goutham


0 Kudos
Gennady_F_Intel
Moderator
1,529 Views

You could try to check the achievable bandwidth on this particular system by running a stream benchmark ( e.g. BabelStream).


0 Kudos
LaurentPlagne
Novice
1,524 Views
Hi, thank you very much for your answer ! I will post a stream benchmark as soon as I get my laptop back.

I suspect that in this case this kernel actually exhausts the RAM bandwidth.

My question was more about the optimality of this kernel for performing axpy on every Intel GPUs (including GPUs with VRAM).

Thank you again.
0 Kudos
Gennady_F_Intel
Moderator
1,521 Views

As we have the Beta version of oneMKL at this moment, therefore it is too earlier to speak about the “optimality of this kernel for performing axpy on every Intel GPUs…”. I think we could get back to this perf query after release timeframe. 


0 Kudos
LaurentPlagne
Novice
1,515 Views
0 Kudos
Reply