Community
cancel
Showing results for 
Search instead for 
Did you mean: 
LaurentPlagne
Novice
492 Views

Vtune question for memory bound problem on GPU

Jump to solution

Hi,

I wonder about the vtune diagnostic  for memory bound problem on GPU.

I measure the observed bandwidth of a vector kernel (MemoryBoundKernel.hpp) for large (2>>27) vector of floats a and b :

                   h.parallel_for(global_range, [=](id<1> i) {
                        acc_b[i]+=acc_a[i];
                        });
                    });  // end submit

 

I obtain 19 GB/s on my laptop (9300H with UHD630) which I suspect to be close to the maximal bandwidth on this machine.

What I found surprising is that the vtune GPU analysis emphasis (in red) on EU occupancy and does not (not in red) emphasis on the RAM bandwidth saturation.

Do I miss something obvious ?

 

 

0 Kudos
1 Solution
Kevin_O_Intel1
Employee
409 Views

The low medium and high thresholds are just the default values. You can change these by moving the sliders at the bottom of the graph. The defaults are evenly distributed.

View solution in original post

6 Replies
GouthamK_Intel
Moderator
475 Views

Hi,

Thanks for reaching out to us!

As your query is related to Vtune, we are redirecting your post to the Vtune forum so that Vtune experts can guide you better.

 

Thanks & Regards

Goutham

Plagne__Laurent
Beginner
442 Views

Self answer : I have replaced my kernel (20 GB/s)

 q.submit([&](auto &h) {// Submit command group for execution
    auto acc_a = buf_a.template get_access<access::mode::read>(h);// Create accessors
    auto acc_b = buf_b.template get_access<access::mode::write>(h);

    auto global_range = range<1>(vsize);// Define local and global range

    h.parallel_for(global_range, [=](id<1> i) {
        acc_b[i]+=alpha*acc_a[i];
        });
    });  // end submit

 

by oneAPI MKL axpy (26 GB/s)

 mkl::blas::axpy(q, vsize, alpha, buf_a, 1, buf_b, 1);

and now vtune correctly emphasizes (in red) the DRAM bandwidth bound (83.5 %).

Although I don't know how my kernel could be enhanced, Vtune correctly identifies that there was room for improvement.

BTW, what does exactly mean the figure 83.5% ?

 

saxpy_vtune.png

 

Plagne__Laurent
Beginner
437 Views

The metric description is accurately described in the doc (except on how the default threshold values (low-medium-high) are computed.

https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metr...

Kevin_O_Intel1
Employee
410 Views

The low medium and high thresholds are just the default values. You can change these by moving the sliders at the bottom of the graph. The defaults are evenly distributed.

View solution in original post

LaurentPlagne
Novice
386 Views

Thank you !

BTW do you know how the max system GPU bandwidth is evaluated ?

I use the GPU version of oneMKL saxpy (on 2<<27 sized arrays) and obtain 26 GB/s on my laptop (repeated 100 times to eliminate the device/host communication) while vtune put the default max gpu bandwidth to 35 GB/s.

Is is OK to assume that oneMKL saxpy should saturate the available bandwidth ?

Tags (1)
Adweidh_Intel
Moderator
435 Views

Hi Plagne,


We are checking on this with our SME, will get back to you soon.


Thanks,

Adweidh


Reply