Support for Analyzers (Intel VTune™ Profiler, Intel Advisor, Intel Inspector)
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
4679 Discussions

loop-mode=loops-only with amplxe-cl most time is [Outside any loop]


I am trying to profile my code using the command line version of VTune (version 2013 update 17) on Linux.  I need to get the hot loops and not just the hot functions.  The report when using -loop-mode=loop-only or -loop-mode=loop-and-function shows most of the runtime as [Outside any loop] in the module [unknown].  There are also some loops from other libraries.  For example:

Function                          Function Stack      Module                CPU Time:Self
--------------------------------  ------------------  --------------------  -------------
[Outside any loop]                                    [Unknown]             0.029

parse_line                                    0.004

_nss_files_parse_pwent                                0.004

____strtoull_l_internal                               0.003

func@0x3eba4ae460                              0.001

__strchr_sse2                                         0.001

_nss_files_getpwuid_r                         0.001

[Loop@0x4a9010 in func@0x4a9010]                      bash                  0.001
                                  [Outside any loop]  [Unknown]             0.001

I have also tested the same code with the 2015 GUI version of VTune on my laptop (Windows), and I can see the loops with line numbers, so I am not sure if my problem is related to the older version or the fact that I am using command line.   How can I see the individual loop line numbers with 2013 amplxe-cl?




0 Kudos
6 Replies

Hi Diana:

It would seem that you are missing debug info.  Are any of your modules (i.e., modules that you developed/built) listed in the results?  I see some libraries, but they appear to be system or standard libraries.  Is your binary built with debug info and not stripped (execute 'file <executable>' to determine if it is stripped of debug info or not).

Also, what does the report look like without the "loop-mode" option?  What module and function has the most time?

Finally, if you use the tachyon sampling included in the product (see <installation-directory>/samples/en/C++), you should be able to build and test and see what it *should* look like.  For example, here is the output using -loop-mode=loop-only:

Function                                                 Module                     CPU Time:Self  CPU Time:Idle:Self  CPU Time:Poor:Self  CPU Time:Ok:Self  CPU Time:Ideal:Self  CPU Time:Over:Self  Overhead Time:Self  Spin Time:Self
-------------------------------------------------------  -------------------------  -------------  ------------------  ------------------  ----------------  -------------------  ------------------  ------------------  --------------
[Loop at line 580 in grid_intersect]                     tachyon_find_hotspots.bak          4.980                   0               2.410             0.130                2.440                   0                   0               0
[Loop at line 144 in render_one_pixel]                   tachyon_find_hotspots.bak          0.951                   0               0.180             0.030                0.741                   0                   0               0
[Outside any loop]                                       [Unknown]                          0.900               0.460               0.440                 0                    0                   0                   0           0.890
[Loop at line 559 in grid_intersect]                     tachyon_find_hotspots.bak          0.320                   0               0.170             0.020                0.130                   0                   0               0
[Loop@0x3560e1ec69 in __libc_start_main]                              0.270               0.020               0.220             0.030                    0                   0                   0           0.260
[Loop at line 561 in grid_intersect]                     tachyon_find_hotspots.bak          0.229                   0               0.110                 0                0.119                   0                   0               0
[Loop at line 634 in grid_intersect]                     tachyon_find_hotspots.bak          0.200                   0               0.130             0.010                0.060                   0                   0               0
[Loop at line 113 in intersect_objects]                  tachyon_find_hotspots.bak          0.140                   0               0.030                 0                0.110                   0                   0               0
[Loop at line 111 in shader]                             tachyon_find_hotspots.bak          0.070                   0               0.020                 0                0.050                   0                   0               0
[Loop at line 178 in tachyon_video::on_process]          tachyon_find_hotspots.bak          0.050               0.050                   0                 0                    0                   0                   0               0
[Loop at line 202 in thread_trace$omp$parallel_for@197]  tachyon_find_hotspots.bak          0.030                   0               0.010             0.010                0.010                   0    



Hi MrAnderson,

Thanks for your help.

I have recompiled the example code with -g:

icc -g testapp.c -o testapp.x

When I profile this on Xeon, I do see the correct output for the loop profile:

amplxe-cl -collect advanced-hotspots -r test1 ./testapp.x
amplxe-cl -R callstacks -r test1  -loop-mode=loop-only

Function                    Function Stack              Module     CPU Time:Self
--------------------------  --------------------------  ---------  -------------
[Loop at line 37 in func2]                              testapp.x  2.566
                            [Loop at line 36 in func2]  testapp.x  2.566
                            [Loop at line 34 in func2]  testapp.x  0
                            [Outside any loop]          [Unknown]  0

[Loop at line 36 in func2]                              testapp.x  0.012
                            [Loop at line 34 in func2]  testapp.x  0.012
                            [Outside any loop]          [Unknown]  0

[Outside any loop]                                      [Unknown]  0.005

[Loop at line 14 in func1]                              testapp.x  0.004
                            [Loop at line 12 in func1]  testapp.x  0.004
                            [Outside any loop]          [Unknown]  0

However, when I try to run the same code on Xeon Phi (native mode), I am still seeing [Outside any loop] as the main/only contributor.

icc -g -mmic testapp.c -o testapp.mic

amplxe-cl -collect knc-hotspots -r test1 -- ssh mic0 /tmp/testapp.mic
amplxe-cl -R callstacks -r test1  -loop-mode=loop-only

Function            Function Stack  Module     CPU Time:Self
------------------  --------------  ---------  -------------
[Outside any loop]                  [Unknown]  91.647

The function-only report is:

Function                   Module                   CPU Time:Self
-------------------------  -----------------------  -------------
[testapp.mic]              testapp.mic                     46.175
[vmlinux]                  vmlinux                         44.653
[sep3_15]                  sep3_15                          0.385
[]                  0.296
[]               0.067
[]          0.040
[micscif]                  micscif                          0.007
[coi_daemon]               coi_daemon                       0.005
[]                      0.005
[]            0.005
[dma_module]               dma_module                       0.002
[]                0.002
[]                 0.002
[sep_mic_server3.15]       sep_mic_server3.15               0.002


I am not sure why the Xeon Phi version treats the entire program as one function.  Adding -fno-inline did not seem to make a difference either.



Hi Diana:

I apologize for the delay.  I was checking with the team to make sure I was accurate in my understanding of the issues.

Currently, there are several issues blocking loop analysis on Xeon Phi systems.  It is a combination of incorrect debug information and the algorithm used to analyze the info.  We expect a fix for both shortly.  I will try to remember to update this thread when the fixes are available.  Until then, you will be limited to function-hotspot analysis on Xeon Phi.


FYI, VTune Amplifier XE 2015 Update 1, which was released about two weeks ago, addresses the algorithm part of this behavior.  It should improve the behavior.

I don't have any info on changes to the debug info, but will look into it.


Thanks for getting back to me.  I will look into whether we can use the 2015 version on the Xeon Phi system.


Diana G. wrote:

[testapp.mic] testapp.mic 46.175


Function name in brackets like [testapp.mic] can indicate that testapp.mic module was not found during finalization. There should be a message about testapp.mic in finalization output if that is the case. I would suggest to try to specify the search path for this module and refinalize the result: amplxe-cl -finalize -r test1 -search-dir=<path to directory where testapp.mic is located>.