About GPU profile log from command line

k_higashi · ‎08-06-2020

Hi.

I have two question about what I see in the profile log.
I'm using DPC++ compiler with Intel OneAPI.

Develop Environment:
OS: Windows 10 Home (64bit)
CPU: Intel Corei7-1065G7 1.3GHz
GPU: Intel Iris Plus Graphics
I have installed Intel oneAPI basetoolkit beta Update 8.

I ran the following command.
test_matrix.exe is my executable file created by DPC++ compiler.

advixe-cl --collect=roofline --profile-gpu --project-dir=C:\Test\Release --search-dir src:r=C:\Test\src -- C:\Test\Release\test_matrix.exe

The following warning is displayed in survey analysis.

advixe: Collection started. To stop the collection, either press CTRL-C or enter from another console window: advixe-cl -r C:\Test\Release\e000\hs008 -command stop.
advixe: Warning: [Instrumentation Engine]: GTPin: GTPin didn't find any kernels... Exiting without doing anything.
advixe: Collection stopped.

・What is the cause of this?
・In my executable file, matrix operation is executed by GPU (DPC++) parallel processing.

Is it not profiled correctly?

The value of GFLOPS displayed in the log is 0,

about survey analysis and tripcounts analysis.

Output log example:

Elapsed Time: 5.23s
Total CPU time: 3.83101
Time in 1 vectorized loop: 0.298428
GFLOPS: 0

・Is it not profiled correctly?
Is there a way to make sure it is correct?

Best regard.

AntonT · ‎08-06-2020

Hi,

Regarding your first question: this is fine for the first 'survey' step of the collection.

For the second one: what is the size of the multiplied matrices? Please note, that the kernel has to run at least 10ms (longer is better).

BR, Anton

k_higashi · ‎08-06-2020

Thank you for your answer.

>first question

OK.

>second question

The size of the multiplied matrices is 1024.
I also tried the matrix size with 4096, but the value of GFLOPS was 0.

output log (Excerpt)

Elapsed Time: 14.74s
Total CPU time: 12.5268
Time in 2 vectorized loops: 12.02
GFLOPS: 0

I think the process takes more than 10ms.

・another question.(About the reason for restrictions)

>that the kernel has to run at least 10ms (longer is better).

I want to know why I need more than 10ms.

If the matrix size is small (eg N=256x256) and the processing time is short,
is it impossible to analyze the Adivisor roofline?

best regards.

AntonT · ‎08-07-2020

Hi,

Can you share your source code?

Advisor needs 10ms in order to have at least a couple of the time sampling hits inside the kernel. In other words to have more reliable results.

BR, Anton

k_higashi · ‎08-16-2020

Hi.

I attach the source code zip file.　(TestCodeDCP_IntelAdvisor.zip)

Development: Microsoft VisualStudio Professional 2019 Version 16.5.5

I want to measure the performance of the following GPU parallel processing part.
(src\multiply.cpp Line26-53)

Best regards.

Mariya_P_Intel · ‎06-08-2021

Hi @k_higashi, could you please try to use Advisor 2021.2.0 and let us know the result?

https://software.intel.com/content/www/us/en/develop/articles/oneapi-standalone-components.html#advisor

Thanks, Mariya

Gopika_Intel · ‎06-20-2021

Hi,

We have not heard from you in a while. If you need any additional information, please submit a new question as this thread will no longer be monitored.

Regards

Gopika