Analyzers
Support for Analyzers (Intel VTune™ Profiler, Intel Advisor, Intel Inspector)
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
4682 Discussions

Vtune question for memory bound problem on GPU

LaurentPlagne
Novice
722 Views

Hi,

I wonder about the vtune diagnostic  for memory bound problem on GPU.

I measure the observed bandwidth of a vector kernel (MemoryBoundKernel.hpp) for large (2>>27) vector of floats a and b :

                   h.parallel_for(global_range, [=](id<1> i) {
                        acc_b[i]+=acc_a[i];
                        });
                    });  // end submit

 

I obtain 19 GB/s on my laptop (9300H with UHD630) which I suspect to be close to the maximal bandwidth on this machine.

What I found surprising is that the vtune GPU analysis emphasis (in red) on EU occupancy and does not (not in red) emphasis on the RAM bandwidth saturation.

Do I miss something obvious ?

 

 

0 Kudos
1 Solution
Kevin_O_Intel1
Employee
638 Views

The low medium and high thresholds are just the default values. You can change these by moving the sliders at the bottom of the graph. The defaults are evenly distributed.

View solution in original post

6 Replies
GouthamK_Intel
Moderator
704 Views

Hi,

Thanks for reaching out to us!

As your query is related to Vtune, we are redirecting your post to the Vtune forum so that Vtune experts can guide you better.

 

Thanks & Regards

Goutham

Plagne__Laurent
Beginner
671 Views

Self answer : I have replaced my kernel (20 GB/s)

 q.submit([&](auto &h) {// Submit command group for execution
    auto acc_a = buf_a.template get_access<access::mode::read>(h);// Create accessors
    auto acc_b = buf_b.template get_access<access::mode::write>(h);

    auto global_range = range<1>(vsize);// Define local and global range

    h.parallel_for(global_range, [=](id<1> i) {
        acc_b[i]+=alpha*acc_a[i];
        });
    });  // end submit

 

by oneAPI MKL axpy (26 GB/s)

 mkl::blas::axpy(q, vsize, alpha, buf_a, 1, buf_b, 1);

and now vtune correctly emphasizes (in red) the DRAM bandwidth bound (83.5 %).

Although I don't know how my kernel could be enhanced, Vtune correctly identifies that there was room for improvement.

BTW, what does exactly mean the figure 83.5% ?

 

saxpy_vtune.png

 

Plagne__Laurent
Beginner
667 Views

The metric description is accurately described in the doc (except on how the default threshold values (low-medium-high) are computed.

https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metr...

Kevin_O_Intel1
Employee
639 Views

The low medium and high thresholds are just the default values. You can change these by moving the sliders at the bottom of the graph. The defaults are evenly distributed.

LaurentPlagne
Novice
615 Views

Thank you !

BTW do you know how the max system GPU bandwidth is evaluated ?

I use the GPU version of oneMKL saxpy (on 2<<27 sized arrays) and obtain 26 GB/s on my laptop (repeated 100 times to eliminate the device/host communication) while vtune put the default max gpu bandwidth to 35 GB/s.

Is is OK to assume that oneMKL saxpy should saturate the available bandwidth ?

Adweidh_Intel
Moderator
665 Views

Hi Plagne,


We are checking on this with our SME, will get back to you soon.


Thanks,

Adweidh


Reply