Solved: Vtune question for memory bound problem on GPU

LaurentPlagne · ‎08-17-2020

Hi,

I wonder about the vtune diagnostic for memory bound problem on GPU.

I measure the observed bandwidth of a vector kernel (MemoryBoundKernel.hpp) for large (2>>27) vector of floats a and b :

                   h.parallel_for(global_range, [=](id<1> i) {
                        acc_b[i]+=acc_a[i];
                        });
                    });  // end submit

I obtain 19 GB/s on my laptop (9300H with UHD630) which I suspect to be close to the maximal bandwidth on this machine.

What I found surprising is that the vtune GPU analysis emphasis (in red) on EU occupancy and does not (not in red) emphasis on the RAM bandwidth saturation.

Do I miss something obvious ?

Kevin_O_Intel1 · ‎08-19-2020

The low medium and high thresholds are just the default values. You can change these by moving the sliders at the bottom of the graph. The defaults are evenly distributed.

View solution in original post

GouthamK_Intel · ‎08-17-2020

Hi,

Thanks for reaching out to us!

As your query is related to Vtune, we are redirecting your post to the Vtune forum so that Vtune experts can guide you better.

Thanks & Regards

Goutham

Plagne__Laurent · ‎08-18-2020

Self answer : I have replaced my kernel (20 GB/s)

 q.submit([&](auto &h) {// Submit command group for execution
    auto acc_a = buf_a.template get_access<access::mode::read>(h);// Create accessors
    auto acc_b = buf_b.template get_access<access::mode::write>(h);

    auto global_range = range<1>(vsize);// Define local and global range

    h.parallel_for(global_range, [=](id<1> i) {
        acc_b[i]+=alpha*acc_a[i];
        });
    });  // end submit

by oneAPI MKL axpy (26 GB/s)

 mkl::blas::axpy(q, vsize, alpha, buf_a, 1, buf_b, 1);

and now vtune correctly emphasizes (in red) the DRAM bandwidth bound (83.5 %).

Although I don't know how my kernel could be enhanced, Vtune correctly identifies that there was room for improvement.

BTW, what does exactly mean the figure 83.5% ?

Plagne__Laurent · ‎08-18-2020

The metric description is accurately described in the doc (except on how the default threshold values (low-medium-high) are computed.

https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/memory-bound/dram-bound/dram-bandwidth-bound.html

Kevin_O_Intel1 · ‎08-19-2020

The low medium and high thresholds are just the default values. You can change these by moving the sliders at the bottom of the graph. The defaults are evenly distributed.

LaurentPlagne · ‎08-19-2020

Thank you !

BTW do you know how the max system GPU bandwidth is evaluated ?

I use the GPU version of oneMKL saxpy (on 2<<27 sized arrays) and obtain 26 GB/s on my laptop (repeated 100 times to eliminate the device/host communication) while vtune put the default max gpu bandwidth to 35 GB/s.

Is is OK to assume that oneMKL saxpy should saturate the available bandwidth ?

Adweidh_Intel · ‎08-18-2020

Hi Plagne,

We are checking on this with our SME, will get back to you soon.

Thanks,

Adweidh