topic Re:Does oneMKL axpy optimally tuned for Intel GPUs ? in Intel® oneAPI Math Kernel Library

Does oneMKL axpy optimally tuned for Intel GPUs ?

LaurentPlagne — Fri, 21 Aug 2020 08:53:47 GMT

I wonder if I can improve the performance of the following snippet that I would like to use to assess the bandwidth of Intel GPUs :

#include <fstream> #include <iostream> #include <CL/sycl.hpp> #include <chrono> #include <cmath> #include <cstring> #include <stdio.h> #include <iostream> #include <random> #include <fstream> #include <chrono> #include "mkl_sycl.hpp" #include "dpc_common.hpp" using namespace cl::sycl; using namespace std; constexpr size_t NITER=100; //amortize device/host communication using Scalar=float; template <class T> void bench_axpy(size_t N){ std::vector<T> a(N,1); std::vector<T> b(N,2); gpu_selector device_selector; queue q(device_selector, dpc_common::exception_handler); auto start=std::chrono::high_resolution_clock::now(); { // Begin buffer scope buffer buf_a(&a[0], range(N));// Create buffers using DPC++ class buffer buffer buf_b(&b[0], range(N)); const T alpha=0.5; try{ for (size_t iter=0; iter<NITER; iter++) { mkl::blas::axpy(q, N, alpha, buf_a, 1, buf_b, 1); } } catch(cl::sycl::exception const& e) { std::cout << "\t\tCaught synchronous SYCL exception during AXPY:\n" << e.what() << std::endl; } } auto end = std::chrono::high_resolution_clock::now(); std::chrono::duration<double> elapsed_seconds = end-start; double time = elapsed_seconds.count(); double GBs=double(3*N)*sizeof(T)*NITER/(time*1.e9);//2R+1W std::cout <<"GBs="<<GBs<<std::endl; } int main(int argc, char* argv[]) { bench_axpy<float>(2<<27); return 0; }

I compile with :

dpcpp -O3 -fsycl -std=c++17 -DMKL_ILP64 -g -DNDEBUG -lOpenCL -lsycl -lmkl_sycl -lmkl_core -lmkl_sequential -lmkl_intel_lp64 ../src/portable_main.cpp

and obtain :

GBs=23.09 on my machine with a UHD630 (and no vram).

Is it possible to improve this ?

Re:Does oneMKL axpy optimally tuned for Intel GPUs ?

GouthamK_Intel — Mon, 24 Aug 2020 08:38:36 GMT

Hi,

Thanks for reaching out to us!

Since your issue is related to oneMKL, we are moving this query to the Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library forum for a faster response.

Regards

Goutham

Re: Re:Does oneMKL axpy optimally tuned for Intel GPUs ?

LaurentPlagne — Thu, 27 Aug 2020 19:49:02 GMT

No hints ?

Re:Does oneMKL axpy optimally tuned for Intel GPUs ?

Gennady_F_Intel — Fri, 28 Aug 2020 08:12:11 GMT

You could try to check the achievable bandwidth on this particular system by running a stream benchmark ( e.g. BabelStream).

Re: Re:Does oneMKL axpy optimally tuned for Intel GPUs ?

LaurentPlagne — Fri, 28 Aug 2020 08:30:18 GMT

Hi, thank you very much for your answer ! I will post a stream benchmark as soon as I get my laptop back.

I suspect that in this case this kernel actually exhausts the RAM bandwidth.

My question was more about the optimality of this kernel for performing axpy on every Intel GPUs (including GPUs with VRAM).

Thank you again.

Re:Does oneMKL axpy optimally tuned for Intel GPUs ?

Gennady_F_Intel — Fri, 28 Aug 2020 08:44:00 GMT

As we have the Beta version of oneMKL at this moment, therefore it is too earlier to speak about the “optimality of this kernel for performing axpy on every Intel GPUs…”. I think we could get back to this perf query after release timeframe.

Re: Re:Does oneMKL axpy optimally tuned for Intel GPUs ?

LaurentPlagne — Fri, 28 Aug 2020 11:06:15 GMT

Fair enough. Thank you again.