Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
5113 Discussions

Vtune question for memory bound problem on GPU

LaurentPlagne
Novice
1,779 Views

Hi,

I wonder about the vtune diagnostic  for memory bound problem on GPU.

I measure the observed bandwidth of a vector kernel (MemoryBoundKernel.hpp) for large (2>>27) vector of floats a and b :

                   h.parallel_for(global_range, [=](id<1> i) {
                        acc_b[i]+=acc_a[i];
                        });
                    });  // end submit

 

I obtain 19 GB/s on my laptop (9300H with UHD630) which I suspect to be close to the maximal bandwidth on this machine.

What I found surprising is that the vtune GPU analysis emphasis (in red) on EU occupancy and does not (not in red) emphasis on the RAM bandwidth saturation.

Do I miss something obvious ?

 

 

0 Kudos
1 Solution
Kevin_O_Intel1
Employee
1,695 Views

The low medium and high thresholds are just the default values. You can change these by moving the sliders at the bottom of the graph. The defaults are evenly distributed.

View solution in original post

6 Replies
GouthamK_Intel
Moderator
1,761 Views

Hi,

Thanks for reaching out to us!

As your query is related to Vtune, we are redirecting your post to the Vtune forum so that Vtune experts can guide you better.

 

Thanks & Regards

Goutham

0 Kudos
Plagne__Laurent
Beginner
1,728 Views

Self answer : I have replaced my kernel (20 GB/s)

 q.submit([&](auto &h) {// Submit command group for execution
    auto acc_a = buf_a.template get_access<access::mode::read>(h);// Create accessors
    auto acc_b = buf_b.template get_access<access::mode::write>(h);

    auto global_range = range<1>(vsize);// Define local and global range

    h.parallel_for(global_range, [=](id<1> i) {
        acc_b[i]+=alpha*acc_a[i];
        });
    });  // end submit

 

by oneAPI MKL axpy (26 GB/s)

 mkl::blas::axpy(q, vsize, alpha, buf_a, 1, buf_b, 1);

and now vtune correctly emphasizes (in red) the DRAM bandwidth bound (83.5 %).

Although I don't know how my kernel could be enhanced, Vtune correctly identifies that there was room for improvement.

BTW, what does exactly mean the figure 83.5% ?

 

saxpy_vtune.png

 

0 Kudos
Plagne__Laurent
Beginner
1,724 Views
0 Kudos
Kevin_O_Intel1
Employee
1,696 Views

The low medium and high thresholds are just the default values. You can change these by moving the sliders at the bottom of the graph. The defaults are evenly distributed.

LaurentPlagne
Novice
1,672 Views

Thank you !

BTW do you know how the max system GPU bandwidth is evaluated ?

I use the GPU version of oneMKL saxpy (on 2<<27 sized arrays) and obtain 26 GB/s on my laptop (repeated 100 times to eliminate the device/host communication) while vtune put the default max gpu bandwidth to 35 GB/s.

Is is OK to assume that oneMKL saxpy should saturate the available bandwidth ?

0 Kudos
Adweidh_Intel
Moderator
1,722 Views

Hi Plagne,


We are checking on this with our SME, will get back to you soon.


Thanks,

Adweidh


0 Kudos
Reply