- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I wonder about the vtune diagnostic for memory bound problem on GPU.
I measure the observed bandwidth of a vector kernel (MemoryBoundKernel.hpp) for large (2>>27) vector of floats a and b :
h.parallel_for(global_range, [=](id<1> i) {
acc_b[i]+=acc_a[i];
});
}); // end submit
I obtain 19 GB/s on my laptop (9300H with UHD630) which I suspect to be close to the maximal bandwidth on this machine.
What I found surprising is that the vtune GPU analysis emphasis (in red) on EU occupancy and does not (not in red) emphasis on the RAM bandwidth saturation.
Do I miss something obvious ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The low medium and high thresholds are just the default values. You can change these by moving the sliders at the bottom of the graph. The defaults are evenly distributed.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us!
As your query is related to Vtune, we are redirecting your post to the Vtune forum so that Vtune experts can guide you better.
Thanks & Regards
Goutham
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Self answer : I have replaced my kernel (20 GB/s)
q.submit([&](auto &h) {// Submit command group for execution
auto acc_a = buf_a.template get_access<access::mode::read>(h);// Create accessors
auto acc_b = buf_b.template get_access<access::mode::write>(h);
auto global_range = range<1>(vsize);// Define local and global range
h.parallel_for(global_range, [=](id<1> i) {
acc_b[i]+=alpha*acc_a[i];
});
}); // end submit
by oneAPI MKL axpy (26 GB/s)
mkl::blas::axpy(q, vsize, alpha, buf_a, 1, buf_b, 1);
and now vtune correctly emphasizes (in red) the DRAM bandwidth bound (83.5 %).
Although I don't know how my kernel could be enhanced, Vtune correctly identifies that there was room for improvement.
BTW, what does exactly mean the figure 83.5% ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The metric description is accurately described in the doc (except on how the default threshold values (low-medium-high) are computed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The low medium and high thresholds are just the default values. You can change these by moving the sliders at the bottom of the graph. The defaults are evenly distributed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you !
BTW do you know how the max system GPU bandwidth is evaluated ?
I use the GPU version of oneMKL saxpy (on 2<<27 sized arrays) and obtain 26 GB/s on my laptop (repeated 100 times to eliminate the device/host communication) while vtune put the default max gpu bandwidth to 35 GB/s.
Is is OK to assume that oneMKL saxpy should saturate the available bandwidth ?
- Tags:
- Th
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Plagne,
We are checking on this with our SME, will get back to you soon.
Thanks,
Adweidh
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page