loop-mode=loops-only with amplxe-cl most time is [Outside any loop]

Diana_G_ · ‎10-13-2014

I am trying to profile my code using the command line version of VTune (version 2013 update 17) on Linux. I need to get the hot loops and not just the hot functions. The report when using -loop-mode=loop-only or -loop-mode=loop-and-function shows most of the runtime as [Outside any loop] in the module [unknown]. There are also some loops from other libraries. For example:

Function                          Function Stack      Module                CPU Time:Self
--------------------------------  ------------------  --------------------  -------------
[Outside any loop]                                    [Unknown]             0.029

parse_line                                            libnss_files-2.12.so  0.004

_nss_files_parse_pwent                                libc-2.12.so          0.004

____strtoull_l_internal                               libc-2.12.so          0.003

func@0x3eba4ae460                                     libcrypto.so.1.0.1e   0.001

__strchr_sse2                                         libc-2.12.so          0.001

_nss_files_getpwuid_r                                 libnss_files-2.12.so  0.001

[Loop@0x4a9010 in func@0x4a9010]                      bash                  0.001
                                  [Outside any loop]  [Unknown]             0.001

I have also tested the same code with the 2015 GUI version of VTune on my laptop (Windows), and I can see the loops with line numbers, so I am not sure if my problem is related to the older version or the fact that I am using command line. How can I see the individual loop line numbers with 2013 amplxe-cl?

David_A_Intel1 · ‎10-13-2014

Hi Diana:

It would seem that you are missing debug info. Are any of your modules (i.e., modules that you developed/built) listed in the results? I see some libraries, but they appear to be system or standard libraries. Is your binary built with debug info and not stripped (execute 'file <executable>' to determine if it is stripped of debug info or not).

Also, what does the report look like without the "loop-mode" option? What module and function has the most time?

Finally, if you use the tachyon sampling included in the product (see <installation-directory>/samples/en/C++), you should be able to build and test and see what it *should* look like. For example, here is the output using -loop-mode=loop-only:

Function                                                 Module                     CPU Time:Self  CPU Time:Idle:Self  CPU Time:Poor:Self  CPU Time:Ok:Self  CPU Time:Ideal:Self  CPU Time:Over:Self  Overhead Time:Self  Spin Time:Self
-------------------------------------------------------  -------------------------  -------------  ------------------  ------------------  ----------------  -------------------  ------------------  ------------------  --------------
[Loop at line 580 in grid_intersect]                     tachyon_find_hotspots.bak          4.980                   0               2.410             0.130                2.440                   0                   0               0
[Loop at line 144 in render_one_pixel]                   tachyon_find_hotspots.bak          0.951                   0               0.180             0.030                0.741                   0                   0               0
[Outside any loop]                                       [Unknown]                          0.900               0.460               0.440                 0                    0                   0                   0           0.890
[Loop at line 559 in grid_intersect]                     tachyon_find_hotspots.bak          0.320                   0               0.170             0.020                0.130                   0                   0               0
[Loop@0x3560e1ec69 in __libc_start_main]                 libc-2.12.so                       0.270               0.020               0.220             0.030                    0                   0                   0           0.260
[Loop at line 561 in grid_intersect]                     tachyon_find_hotspots.bak          0.229                   0               0.110                 0                0.119                   0                   0               0
[Loop at line 634 in grid_intersect]                     tachyon_find_hotspots.bak          0.200                   0               0.130             0.010                0.060                   0                   0               0
[Loop at line 113 in intersect_objects]                  tachyon_find_hotspots.bak          0.140                   0               0.030                 0                0.110                   0                   0               0
[Loop at line 111 in shader]                             tachyon_find_hotspots.bak          0.070                   0               0.020                 0                0.050                   0                   0               0
[Loop at line 178 in tachyon_video::on_process]          tachyon_find_hotspots.bak          0.050               0.050                   0                 0                    0                   0                   0               0
[Loop at line 202 in thread_trace$omp$parallel_for@197]  tachyon_find_hotspots.bak          0.030                   0               0.010             0.010                0.010                   0

Diana_G_ · ‎10-14-2014

Hi MrAnderson,

Thanks for your help.

I have recompiled the example code with -g:

icc -g testapp.c -o testapp.x

When I profile this on Xeon, I do see the correct output for the loop profile:

amplxe-cl -collect advanced-hotspots -r test1 ./testapp.x
amplxe-cl -R callstacks -r test1  -loop-mode=loop-only

Function                    Function Stack              Module     CPU Time:Self
--------------------------  --------------------------  ---------  -------------
[Loop at line 37 in func2]                              testapp.x  2.566
                            [Loop at line 36 in func2]  testapp.x  2.566
                            [Loop at line 34 in func2]  testapp.x  0
                            [Outside any loop]          [Unknown]  0

[Loop at line 36 in func2]                              testapp.x  0.012
                            [Loop at line 34 in func2]  testapp.x  0.012
                            [Outside any loop]          [Unknown]  0

[Outside any loop]                                      [Unknown]  0.005

[Loop at line 14 in func1]                              testapp.x  0.004
                            [Loop at line 12 in func1]  testapp.x  0.004
                            [Outside any loop]          [Unknown]  0

However, when I try to run the same code on Xeon Phi (native mode), I am still seeing [Outside any loop] as the main/only contributor.

icc -g -mmic testapp.c -o testapp.mic

amplxe-cl -collect knc-hotspots -r test1 -- ssh mic0 /tmp/testapp.mic
amplxe-cl -R callstacks -r test1  -loop-mode=loop-only

Function            Function Stack  Module     CPU Time:Self
------------------  --------------  ---------  -------------
[Outside any loop]                  [Unknown]  91.647

The function-only report is:

Function                   Module                   CPU Time:Self
-------------------------  -----------------------  -------------
[testapp.mic]              testapp.mic                     46.175
[vmlinux]                  vmlinux                         44.653
[sep3_15]                  sep3_15                          0.385
[libc-2.14.90.so]          libc-2.14.90.so                  0.296
[libcrypto.so.1.0.0]       libcrypto.so.1.0.0               0.067
[libnss_files-2.14.90.so]  libnss_files-2.14.90.so          0.040
[micscif]                  micscif                          0.007
[coi_daemon]               coi_daemon                       0.005
[ld-2.14.90.so]            ld-2.14.90.so                    0.005
[libpthread-2.14.90.so]    libpthread-2.14.90.so            0.005
[dma_module]               dma_module                       0.002
[libmessage_mic.so]        libmessage_mic.so                0.002
[libpam.so.0.83.1]         libpam.so.0.83.1                 0.002
[sep_mic_server3.15]       sep_mic_server3.15               0.002

I am not sure why the Xeon Phi version treats the entire program as one function. Adding -fno-inline did not seem to make a difference either.

David_A_Intel1 · ‎10-15-2014

Hi Diana:

I apologize for the delay. I was checking with the team to make sure I was accurate in my understanding of the issues.

Currently, there are several issues blocking loop analysis on Xeon Phi systems. It is a combination of incorrect debug information and the algorithm used to analyze the info. We expect a fix for both shortly. I will try to remember to update this thread when the fixes are available. Until then, you will be limited to function-hotspot analysis on Xeon Phi.

David_A_Intel1 · ‎11-24-2014

FYI, VTune Amplifier XE 2015 Update 1, which was released about two weeks ago, addresses the algorithm part of this behavior. It should improve the behavior.

I don't have any info on changes to the debug info, but will look into it.

Diana_G_ · ‎11-25-2014

Thanks for getting back to me. I will look into whether we can use the 2015 version on the Xeon Phi system.

Denis_M_Intel2 · ‎11-25-2014

Diana G. wrote:

[testapp.mic] testapp.mic 46.175

Function name in brackets like [testapp.mic] can indicate that testapp.mic module was not found during finalization. There should be a message about testapp.mic in finalization output if that is the case. I would suggest to try to specify the search path for this module and refinalize the result: amplxe-cl -finalize -r test1 -search-dir=<path to directory where testapp.mic is located>.