Solved: Does oneMKL axpy optimally tuned for Intel GPUs ?

LaurentPlagne · ‎08-21-2020

I wonder if I can improve the performance of the following snippet that I would like to use to assess the bandwidth of Intel GPUs :

#include <fstream>
#include <iostream>
#include <CL/sycl.hpp>
#include <chrono>
#include <cmath>
#include <cstring>
#include <stdio.h>
#include <iostream>
#include <random>
#include <fstream>
#include <chrono>
#include "mkl_sycl.hpp"
#include "dpc_common.hpp"


using namespace cl::sycl;
using namespace std;

constexpr size_t NITER=100; //amortize device/host communication
using Scalar=float;

template <class T>
void bench_axpy(size_t N){

  std::vector<T> a(N,1);
  std::vector<T> b(N,2);
  gpu_selector device_selector;
  queue q(device_selector, dpc_common::exception_handler);
  
  auto start=std::chrono::high_resolution_clock::now();
  {  // Begin buffer scope
    buffer buf_a(&a[0], range(N));// Create buffers using DPC++ class buffer
    buffer buf_b(&b[0], range(N));

    const T alpha=0.5;
    try{
        for (size_t iter=0; iter<NITER; iter++) {
            mkl::blas::axpy(q, N, alpha, buf_a, 1, buf_b, 1);
        }
    }
    catch(cl::sycl::exception const& e) {
        std::cout << "\t\tCaught synchronous SYCL exception during AXPY:\n"
          << e.what() << std::endl;
    }
  }
  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> elapsed_seconds = end-start;
  double time = elapsed_seconds.count();
  double GBs=double(3*N)*sizeof(T)*NITER/(time*1.e9);//2R+1W
  std::cout <<"GBs="<<GBs<<std::endl; 
}


int main(int argc, char* argv[]) {

  bench_axpy<float>(2<<27);

  return 0;
}

I compile with :

dpcpp -O3 -fsycl -std=c++17 -DMKL_ILP64 -g -DNDEBUG -lOpenCL -lsycl -lmkl_sycl -lmkl_core -lmkl_sequential -lmkl_intel_lp64 ../src/portable_main.cpp

and obtain :

GBs=23.09 on my machine with a UHD630 (and no vram).

Is it possible to improve this ?

Gennady_F_Intel · ‎08-28-2020

As we have the Beta version of oneMKL at this moment, therefore it is too earlier to speak about the “optimality of this kernel for performing axpy on every Intel GPUs…”. I think we could get back to this perf query after release timeframe.

View solution in original post

GouthamK_Intel · ‎08-24-2020

Hi,

Thanks for reaching out to us!

Since your issue is related to oneMKL, we are moving this query to the Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library forum for a faster response.

Regards

Goutham

LaurentPlagne · ‎08-27-2020

No hints ?

Gennady_F_Intel · ‎08-28-2020

You could try to check the achievable bandwidth on this particular system by running a stream benchmark ( e.g. BabelStream).

LaurentPlagne · ‎08-28-2020

Hi, thank you very much for your answer ! I will post a stream benchmark as soon as I get my laptop back.

I suspect that in this case this kernel actually exhausts the RAM bandwidth.

My question was more about the optimality of this kernel for performing axpy on every Intel GPUs (including GPUs with VRAM).

Thank you again.

Gennady_F_Intel · ‎08-28-2020

As we have the Beta version of oneMKL at this moment, therefore it is too earlier to speak about the “optimality of this kernel for performing axpy on every Intel GPUs…”. I think we could get back to this perf query after release timeframe.

LaurentPlagne · ‎08-28-2020

Fair enough. Thank you again.