Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6977 Discussions

Does oneMKL axpy optimally tuned for Intel GPUs ?

LaurentPlagne
Novice
1,012 Views

I wonder if I can improve the performance of the following snippet that I would like to use to assess the bandwidth of Intel GPUs :

#include <fstream>
#include <iostream>
#include <CL/sycl.hpp>
#include <chrono>
#include <cmath>
#include <cstring>
#include <stdio.h>
#include <iostream>
#include <random>
#include <fstream>
#include <chrono>
#include "mkl_sycl.hpp"
#include "dpc_common.hpp"


using namespace cl::sycl;
using namespace std;

constexpr size_t NITER=100; //amortize device/host communication
using Scalar=float;

template <class T>
void bench_axpy(size_t N){

  std::vector<T> a(N,1);
  std::vector<T> b(N,2);
  gpu_selector device_selector;
  queue q(device_selector, dpc_common::exception_handler);
  
  auto start=std::chrono::high_resolution_clock::now();
  {  // Begin buffer scope
    buffer buf_a(&a[0], range(N));// Create buffers using DPC++ class buffer
    buffer buf_b(&b[0], range(N));

    const T alpha=0.5;
    try{
        for (size_t iter=0; iter<NITER; iter++) {
            mkl::blas::axpy(q, N, alpha, buf_a, 1, buf_b, 1);
        }
    }
    catch(cl::sycl::exception const& e) {
        std::cout << "\t\tCaught synchronous SYCL exception during AXPY:\n"
          << e.what() << std::endl;
    }
  }
  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> elapsed_seconds = end-start;
  double time = elapsed_seconds.count();
  double GBs=double(3*N)*sizeof(T)*NITER/(time*1.e9);//2R+1W
  std::cout <<"GBs="<<GBs<<std::endl; 
}


int main(int argc, char* argv[]) {

  bench_axpy<float>(2<<27);

  return 0;
}

 

 

 

I compile with :

dpcpp -O3 -fsycl -std=c++17 -DMKL_ILP64 -g -DNDEBUG -lOpenCL -lsycl -lmkl_sycl -lmkl_core -lmkl_sequential -lmkl_intel_lp64 ../src/portable_main.cpp

 

and obtain :

GBs=23.09 on my machine with a UHD630 (and no vram).

Is it possible to improve this ?

 

 

0 Kudos
1 Solution
Gennady_F_Intel
Moderator
941 Views

As we have the Beta version of oneMKL at this moment, therefore it is too earlier to speak about the “optimality of this kernel for performing axpy on every Intel GPUs…”. I think we could get back to this perf query after release timeframe. 


View solution in original post

0 Kudos
6 Replies
GouthamK_Intel
Moderator
998 Views

Hi,

Thanks for reaching out to us!

Since your issue is related to oneMKL, we are moving this query to the Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library forum for a faster response.


Regards

Goutham


0 Kudos
Gennady_F_Intel
Moderator
950 Views

You could try to check the achievable bandwidth on this particular system by running a stream benchmark ( e.g. BabelStream).


0 Kudos
LaurentPlagne
Novice
945 Views
Hi, thank you very much for your answer ! I will post a stream benchmark as soon as I get my laptop back.

I suspect that in this case this kernel actually exhausts the RAM bandwidth.

My question was more about the optimality of this kernel for performing axpy on every Intel GPUs (including GPUs with VRAM).

Thank you again.
0 Kudos
Gennady_F_Intel
Moderator
942 Views

As we have the Beta version of oneMKL at this moment, therefore it is too earlier to speak about the “optimality of this kernel for performing axpy on every Intel GPUs…”. I think we could get back to this perf query after release timeframe. 


0 Kudos
LaurentPlagne
Novice
936 Views
0 Kudos
Reply