Community
cancel
Showing results for 
Search instead for 
Did you mean: 
k_higashi
Beginner
375 Views

About GPU profile log from command line

Hi.

I have two question about what I see in the profile log.
I'm using DPC++ compiler with Intel OneAPI.

Develop Environment:
OS: Windows 10 Home (64bit)
CPU: Intel Corei7-1065G7 1.3GHz
GPU: Intel Iris Plus Graphics
I have installed Intel oneAPI basetoolkit beta Update 8.

I ran the following command.
test_matrix.exe is my executable file created by DPC++ compiler.

advixe-cl --collect=roofline --profile-gpu --project-dir=C:\Test\Release --search-dir src:r=C:\Test\src -- C:\Test\Release\test_matrix.exe

 

<Question 1>

The following warning is displayed in survey analysis.

advixe: Collection started. To stop the collection, either press CTRL-C or enter from another console window: advixe-cl -r C:\Test\Release\e000\hs008 -command stop.
advixe: Warning: [Instrumentation Engine]: GTPin: GTPin didn't find any kernels... Exiting without doing anything.
advixe: Collection stopped.

・What is the cause of this? 
・In my executable file, matrix operation is executed by GPU (DPC++) parallel processing.

    Is it not profiled correctly?

<Question 2>

The value of GFLOPS displayed in the log is 0, 

 about survey analysis and tripcounts analysis.

Output log example:

Elapsed Time: 5.23s
Total CPU time: 3.83101
Time in 1 vectorized loop: 0.298428
GFLOPS: 0

・Is it not profiled correctly?
Is there a way to make sure it is correct?

Best regard.

0 Kudos
4 Replies
AntonT
Employee
363 Views

Hi,

Regarding your first question: this is fine for the first 'survey' step of the collection.

For the second one: what is the size of the multiplied matrices? Please note, that the kernel has to run at least 10ms (longer is better).

BR, Anton

k_higashi
Beginner
345 Views

Thank you for your answer.
 
>first question
OK.
 
>second question
The size of the multiplied matrices is 1024.
I also tried the matrix size with 4096, but the value of GFLOPS was 0.
 
output log (Excerpt)
Elapsed Time: 14.74s
Total CPU time: 12.5268
Time in 2 vectorized loops: 12.02
GFLOPS: 0
I think the process takes more than 10ms.
 
・another question.(About the reason for restrictions)
>that the kernel has to run at least 10ms (longer is better).

I want to know why I need more than 10ms.
 
If the matrix size is small (eg N=256x256) and the processing time is short,
is it impossible to analyze the Adivisor roofline?
 
best regards.
AntonT
Employee
336 Views

Hi,

Can you share your source code?

Advisor needs 10ms in order to have at least a couple of the time sampling hits inside the kernel. In other words to have more reliable results.

BR, Anton

 

k_higashi
Beginner
320 Views

Hi.

I attach the source code zip file. (TestCodeDCP_IntelAdvisor.zip)

Development: Microsoft VisualStudio Professional 2019 Version 16.5.5

I want to measure the performance of the following GPU parallel processing part.
(src\multiply.cpp Line26-53)

Best regards.

Reply