Solved: Re: VTune gpu-hotspots for Iris Xe queue

Viet-Duc · ‎10-23-2021

Dear Support Team,

I am testing gpu-hotspots on DevCloud following the article here:

https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/methodologies/software-optimization-for-intel-gpus.html

I was able to reproduce the result with Gen9 queue. For Iris Xe Max queue, which contains 2 and 4 GPUs respectively, the 'Trace GPU programming APIs' does not support multiple GPU adapters'. Please see the attached file.

So my questions are as follows

1. How can I select only one Xe GPU for hotpot analysis ?

2. In the 'Analyze Multiple GPUs' section here:

https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/accelerators-group/gpu-offload-analysis.html

Quote: "You can also find this information on your Windows (see Task Manager) or Linux (run

lspci) system"

Which lspci value should I use ? And How can I pass this info to VTune command line ?

JaideepK_Intel · ‎10-28-2021

Hi,

Sorry for the delay , You can select only one Xe GPU for hotspot analysis.

Please find the below two methods to run the analysis.

Method- 1: (using webserver: https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/configuration-recipes/using-vtune-server-with-vs-code-intel-devcloud.html )

1) Log into the DevCloud login node:

ssh devcloud

2) Enter into compute node:

qsub -I -l nodes=1:quad_gpu:ppn=2

note: Do not close the terminal after this step, as the action will release your compute node.

3) Open a new terminal.

ssh -L 127.0.0.1:55001:127.0.0.1:55001 devcloud

4)Establish an SSH connection from the login node to the compute node with one more SSH tunnel:

ssh -L 127.0.0.1:55001:127.0.0.1:55001 s000-n000

note: Replace s000-n000 with your compute node name.

vtune-backend --web-port=55001 --enable-server-profiling

Now you can use vtune on webserver and there you can see target GPU dropdown(you can see in attached image), By

clicking on HOW(right side top icon) we can select analysis type.

Method- 2: (using command line)

1)Type the below command to view available GPUs:

vtune -help collect gpu-hotspots

There you can see target-gpu, <domain:bus:device.function>(example:0:27:0.0) with this id you can select

desired GPU

2)With that id now you can run your analysis:

vtune -collect gpu-offload -knob target-gpu=0:27:0.0 /<Path to the executable>

About -knob parameter please refer below documentation:

https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/command-line-interface/command-line-interface-reference/knob.html

Thanks,

Jaideep

View solution in original post

Viet-Duc · ‎10-24-2021

I attached here the text output of vtune:

Elapsed Time: 5.211s
    GPU Time: 0.009s
0:27:0.0 : DG1 [Iris Xe MAX Graphics]
    GPU Tile 0
        EU Array Stalled/Idle: 100.0%
         | The percentage of time when the EUs were stalled or idle is high,
         | which has a negative impact on compute-bound applications.
         |
            Occupancy: 0.0%
             | Several factors including shared local memory, use of memory
             | barriers, and inefficient work scheduling can cause a low value
             | of the occupancy metric.
             |

                Hottest GPU Computing Tasks with Low Occupancy
                Computing Task         Total Time  Global Size  Local Size  SIMD Width  Occupancy(%)  SIMD Utilization(%)
                ---------------------  ----------  -----------  ----------  ----------  ------------  -------------------
                _ZTSZ4mainEUlT_E54_37      0.297s  2048 x 2048     512 x 1          32          0.0%               100.0%
            Sampler Busy: 0.0%

                Hottest GPU Computing Tasks with High Sampler Usage
                Computing Task  Total Time
                --------------  ----------
            DRAM Bandwidth Bound: 0.0%

                Hottest GPU Computing Tasks Bound by DRAM Bandwidth
                Computing Task  Total Time
                --------------  ----------
0:32:0.0 : DG1 [Iris Xe MAX Graphics]
    GPU Tile 0
        EU Array Stalled/Idle: 100.0%
         | The percentage of time when the EUs were stalled or idle is high,
         | which has a negative impact on compute-bound applications.
         |
            Occupancy: 0.0%
             | Several factors including shared local memory, use of memory
             | barriers, and inefficient work scheduling can cause a low value
             | of the occupancy metric.
             |

                Hottest GPU Computing Tasks with Low Occupancy
                Computing Task  Total Time  Global Size  Local Size  SIMD Width  Occupancy(%)  SIMD Utilization(%)
                --------------  ----------  -----------  ----------  ----------  ------------  -------------------
            Sampler Busy: 0.0%

                Hottest GPU Computing Tasks with High Sampler Usage
                Computing Task  Total Time
                --------------  ----------
            DRAM Bandwidth Bound: 0.0%

                Hottest GPU Computing Tasks Bound by DRAM Bandwidth
                Computing Task  Total Time
                --------------  ----------

I guess '0:27:0.0' and '0:32:0.0' are the PCIe designation for Xe GPUs. It seems that the first one is selected, yet 'EU array stalled is 100%', i.e GPU is not used at all. Open the result with vtune-gui shows the error message in my opening post.

So unless there is way to select only one Xe GPU, it may not possible to view result with vtune-gui

JaideepK_Intel · ‎10-25-2021

Hi,

Thank you for posting in Intel Communities.

We are investigating your issue at our end, Could you please share the VTune version for better understanding.

Thanks,

Jaideep

Viet-Duc · ‎10-27-2021

Dear Jaideep

Sorry for late follow up. Thanks for taking a look at this problem.

On DevCloud, it is the 2021.7.1 version. Since DevCloud does not support X11 forwarding, I installed 2021.8.0 version locally.

I couldn't find 7.1 version, since the download page always offer the latest one.

Nevertheless, I could still load gen9 results as mentioned in the original post. So I think version is not a concern here.

From the help message of 'collect':

$ vtune -help collect gpu-hotspots
...
collect-programming-api

  Analyze DPC++, OpenCL, and Intel Media SDK programs running on Intel
  Processor Graphics. This option may affect the performance of your
  application on the CPU side.

  Default value: true
  Possible values: true false

For multiple GPUs, this option is disabled as shown in the error message. Thus:

I can run SYCL codes without issue and obtain correct results.
I can still profile SYCL codes, but the results are defaulted to '0.0%' as shown in 2nd post.
I cannot load the multi-GPU results into VTune.

Since I don't have Iris card on my side, I could not check whether VTune will work with a single-tile card.

I hope this helps your investigation.

Regards.

JaideepK_Intel · ‎10-28-2021

Hi,

Sorry for the delay , You can select only one Xe GPU for hotspot analysis.

Please find the below two methods to run the analysis.

Method- 1: (using webserver: https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/configuration-recipes/using-vtune-server-with-vs-code-intel-devcloud.html )

1) Log into the DevCloud login node:

ssh devcloud

2) Enter into compute node:

qsub -I -l nodes=1:quad_gpu:ppn=2

note: Do not close the terminal after this step, as the action will release your compute node.

3) Open a new terminal.

ssh -L 127.0.0.1:55001:127.0.0.1:55001 devcloud

4)Establish an SSH connection from the login node to the compute node with one more SSH tunnel:

ssh -L 127.0.0.1:55001:127.0.0.1:55001 s000-n000

note: Replace s000-n000 with your compute node name.

vtune-backend --web-port=55001 --enable-server-profiling

Now you can use vtune on webserver and there you can see target GPU dropdown(you can see in attached image), By

clicking on HOW(right side top icon) we can select analysis type.

Method- 2: (using command line)

1)Type the below command to view available GPUs:

vtune -help collect gpu-hotspots

There you can see target-gpu, <domain:bus:device.function>(example:0:27:0.0) with this id you can select

desired GPU

2)With that id now you can run your analysis:

vtune -collect gpu-offload -knob target-gpu=0:27:0.0 /<Path to the executable>

About -knob parameter please refer below documentation:

https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/command-line-interface/command-line-interface-reference/knob.html

Thanks,

Jaideep

Viet-Duc · ‎11-01-2021

Hi again Jaideep,

Thanks for the detailed answers.

I thought the id listed in the help message were just examples, and not the real IDs of the Xe GPUs.

I've tried both methods and now obtained the results.

For purpose of refrence, I include the result below

0:27:0.0 : DG1 [Iris Xe MAX Graphics]
    GPU Tile 0
        EU Array Stalled/Idle: 77.9%
         | The percentage of time when the EUs were stalled or idle is high,
         | which has a negative impact on compute-bound applications.
         |
            GPU L3 Bandwidth Bound: 43.0%

                Hottest GPU Computing Tasks Bound by GPU L3 Bandwidth
                Computing Task  Total Time
                --------------  ----------
            Occupancy: 98.8%

                Hottest GPU Computing Tasks with Low Occupancy
                Computing Task  Total Time  Global Size  Local Size  SIMD Width  Occupancy(%)  SIMD Utilization(%)
                --------------  ----------  -----------  ----------  ----------  ------------  -------------------

EU Stalled/L3 Bound/Occupancy show that Xe GPU has actual workload.
I can transfer the result to my local machine and open it with vtune-gui 21.8.0. Everything can be viewed except the memory hierarchy, probably due to difference between devcloud's and local's version.

I will mark this question solved. Once again, thanks very much for your time.

Regards.

JaideepK_Intel · ‎11-01-2021

Hi Viet-Duc,

Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks,

Jaideep