- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wonder if I can improve the performance of the following snippet that I would like to use to assess the bandwidth of Intel GPUs :
#include <fstream>
#include <iostream>
#include <CL/sycl.hpp>
#include <chrono>
#include <cmath>
#include <cstring>
#include <stdio.h>
#include <iostream>
#include <random>
#include <fstream>
#include <chrono>
#include "mkl_sycl.hpp"
#include "dpc_common.hpp"
using namespace cl::sycl;
using namespace std;
constexpr size_t NITER=100; //amortize device/host communication
using Scalar=float;
template <class T>
void bench_axpy(size_t N){
std::vector<T> a(N,1);
std::vector<T> b(N,2);
gpu_selector device_selector;
queue q(device_selector, dpc_common::exception_handler);
auto start=std::chrono::high_resolution_clock::now();
{ // Begin buffer scope
buffer buf_a(&a[0], range(N));// Create buffers using DPC++ class buffer
buffer buf_b(&b[0], range(N));
const T alpha=0.5;
try{
for (size_t iter=0; iter<NITER; iter++) {
mkl::blas::axpy(q, N, alpha, buf_a, 1, buf_b, 1);
}
}
catch(cl::sycl::exception const& e) {
std::cout << "\t\tCaught synchronous SYCL exception during AXPY:\n"
<< e.what() << std::endl;
}
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
double time = elapsed_seconds.count();
double GBs=double(3*N)*sizeof(T)*NITER/(time*1.e9);//2R+1W
std::cout <<"GBs="<<GBs<<std::endl;
}
int main(int argc, char* argv[]) {
bench_axpy<float>(2<<27);
return 0;
}
I compile with :
dpcpp -O3 -fsycl -std=c++17 -DMKL_ILP64 -g -DNDEBUG -lOpenCL -lsycl -lmkl_sycl -lmkl_core -lmkl_sequential -lmkl_intel_lp64 ../src/portable_main.cpp
and obtain :
GBs=23.09 on my machine with a UHD630 (and no vram).
Is it possible to improve this ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As we have the Beta version of oneMKL at this moment, therefore it is too earlier to speak about the “optimality of this kernel for performing axpy on every Intel GPUs…”. I think we could get back to this perf query after release timeframe.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us!
Since your issue is related to oneMKL, we are moving this query to the Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library forum for a faster response.
Regards
Goutham
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You could try to check the achievable bandwidth on this particular system by running a stream benchmark ( e.g. BabelStream).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suspect that in this case this kernel actually exhausts the RAM bandwidth.
My question was more about the optimality of this kernel for performing axpy on every Intel GPUs (including GPUs with VRAM).
Thank you again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As we have the Beta version of oneMKL at this moment, therefore it is too earlier to speak about the “optimality of this kernel for performing axpy on every Intel GPUs…”. I think we could get back to this perf query after release timeframe.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page