Solved: Re: Intrusive 2nd step (Intel Advisor cannot display assembly)

Lin_L_Intel · ‎11-20-2020

I have a few questions for Intel Advisor. I am using 2021.1.Beta08.

1. I am testing a simple c++ application on linux server machine, build it with Intel Xmain compiler with -g option enabled. Here I have my application test.exe.

I collect advisor roofline data with command line:

advixe-cl --collect="roofline" --stacks --project-dir="advisor-o2" -- ./test.exe

Then I download the result directory and open it with advisor GUI on my local machine. I set the project properties to specify <Binary/Symbol Search directory> and <Source Search directory> to where I put the test.exe and the source file test.cpp. However advisor GUI can only display the source file when I select some loop/function, it cannot display relative assembly and shows "Intel Advisor cannot show assembly code for the selected function/loop". Why?

2. Besides, I have another question for how advisor collect data, seems not the same way as VTune. It takes much longer time than VTune for application over 30 minutes. When we are collecting data for SPEC benchmarks which would create112 processes on a 56 core machine with HT enabled and those 112 processes run the same exe file at the same time. Since the collecting overhead cannot be ignored, we must collect all 112 processes' data at the same time. It makes the collecting too long to finish. I just wonder what does advisor do to collect information, why it takes such long time.

Zakhar_M_Intel1 · ‎11-23-2020

Yes, there might be the difference. Let me try to summarize:

First of all, Roofline depends on both 1st and 2nd pass. 1st pass provides "seconds", 2nd pass provides operation counts. So in theory you can configure 1st and 2nd pass differently with respect to # processes profiled, but it should be done very carefully, so I'd not go this way for now.

Secondly, considering 3 general cases:

1. Run user app on 1 core and advisor (as a result) also on 1 core. This is not recommended if you want characterize real multi-threaded application behavior, because with 111 logical cores staying idle you'll get different, non representative, benchmarks ("roofs") and overall application behavior.

2. Run user app on ALL cores and advisor just on 1 core. This should be the BEST method, because your application behavior (and advisor benchmarks ) will be representative, but extra advisor-driven pressure on system will be minimal.

3. Run user app on ALL 112 "cores" and advisor on ALL cores too. This will normally keep results somewhat representative, but extra advisor-driven pressure on system will be huge, especially because of Hyper-threading involved. This advisor's resource utilization pressure will certainly slow down 2nd pass a lot and may even make timing from the 1st pass less representative compared to the option 2.

For option 2 I think your additional concern is that the processes (1 with advisor and other 111 without it) will not behave uniformly. In most cases it is not an issue, because for the first pass (where uniformity matters most of all) - the overhead is small and thus all 112 processes should behave identically as without advisor. It is only 2nd pass where process #1 will behave very differently (slower), but you likely do not care a lot because Roofline's timing data (from step 1!) will not be affected and time is where representativeness and uniformity matter most of all. Note also the 2nd pass' job is to count #operations and bytes and they usually remain the same even if processes interact somewhat differently because of different timing.

In either case you can double-check it by comparing OP and Bytes count between option 2 and option 3 from advisor.

Also, just in case if HT is not critical for you, i'd at least use #processes = #physical cores. But maybe your use case requires HT, and that's another reason to consider option 2.

View solution in original post

Zakhar_M_Intel1 · ‎11-22-2020

For your question #2:

Advisor CPU roofline analysis consists of 2 passes:

non-intrusive survey - with 1x overhead (it provides "seconds" in particular) and
intrusive 2nd pass (flops/tripcounts)- with 4x - 50x slowdown. By default it would be 4x, but you are using "stacks" mode which makes it much slower. 2nd pass is based on PIN. We use this slower method, because it makes it possible to attribute data (even DDR traffic in some cases) to loops/functions and because it is the only method to provide you with precise FLOP count on CPU (considering mask registers and data flow).

This 2-pass schema makes it possible to capture time so that it is NOT affected by slower 2nd step. So possibly there is no worries wrt "non representative behavior" effects caused by slower tripcounts analysis, because what matters is representative first pass only.

Here is a list of recommendations to make your analysis faster:

If your SPEC is MPI based, then you can pick and choose only few ranks to be profiled by advisor, while other ranks can be run without advisor. with 2 pass schema and mostly no slowdown at 1st pass, you could likely afford it. You can "pick and choose" ranks for advisor profiling using Intel MPI -gtool option.
Remove --stacks option to make analysis ~5 times faster. This would not decrease roofline "coverage" , although certainly it could make it harder to interpret Roofline for outermost loopnests and functions
There are multiple methods to further reduce the overhead
as described at : https://software.intel.com/content/www/ru/ru/develop/articles/managing-overhead-of-intel-advisor-analyses.html . Some of them will not fully apply to "roofline with stacks" (especially "resume after/duration" should not be used), but basic itt_pause/resume and modules selection still should be partially helpful.

Lin_L_Intel · ‎11-23-2020

My consideration for the intrusive 2nd pass is, is the result of "running the application only on one core(one process) and collecting advisor data for it" going to be different from "running the same application on 112 cores(full cores used, 112 processes) but only collect advisor on one core(one process) "? Is the 2nd pass result(shown in roofline figure) going to be affected?

Zakhar_M_Intel1 · ‎11-23-2020

Yes, there might be the difference. Let me try to summarize:

First of all, Roofline depends on both 1st and 2nd pass. 1st pass provides "seconds", 2nd pass provides operation counts. So in theory you can configure 1st and 2nd pass differently with respect to # processes profiled, but it should be done very carefully, so I'd not go this way for now.

Secondly, considering 3 general cases:

1. Run user app on 1 core and advisor (as a result) also on 1 core. This is not recommended if you want characterize real multi-threaded application behavior, because with 111 logical cores staying idle you'll get different, non representative, benchmarks ("roofs") and overall application behavior.

2. Run user app on ALL cores and advisor just on 1 core. This should be the BEST method, because your application behavior (and advisor benchmarks ) will be representative, but extra advisor-driven pressure on system will be minimal.

3. Run user app on ALL 112 "cores" and advisor on ALL cores too. This will normally keep results somewhat representative, but extra advisor-driven pressure on system will be huge, especially because of Hyper-threading involved. This advisor's resource utilization pressure will certainly slow down 2nd pass a lot and may even make timing from the 1st pass less representative compared to the option 2.

For option 2 I think your additional concern is that the processes (1 with advisor and other 111 without it) will not behave uniformly. In most cases it is not an issue, because for the first pass (where uniformity matters most of all) - the overhead is small and thus all 112 processes should behave identically as without advisor. It is only 2nd pass where process #1 will behave very differently (slower), but you likely do not care a lot because Roofline's timing data (from step 1!) will not be affected and time is where representativeness and uniformity matter most of all. Note also the 2nd pass' job is to count #operations and bytes and they usually remain the same even if processes interact somewhat differently because of different timing.

In either case you can double-check it by comparing OP and Bytes count between option 2 and option 3 from advisor.

Also, just in case if HT is not critical for you, i'd at least use #processes = #physical cores. But maybe your use case requires HT, and that's another reason to consider option 2.

Lin_L_Intel · ‎11-23-2020

Hi Zakhar, that's very detailed explanation. For the concern of option 2, I understand your point is the uniformity will not affect the 2nd pass a lot because timing is in the 1st pass. But I wonder if the timing is different in 2nd pass, won't it affect the memory pressure or calculation pressure which would lead to the difference of the roofline reports(performance of each loop/function)? I mean the position of each dot in roofline figure.

Zakhar_M_Intel1 · ‎11-24-2020

Hello,

The generic memory and compute (FLOP count) characteristics are not affected, they are invariant wrt timing. As for more advanced stuff such as DDR vs LLC data distribution (which you didn't use in your run), Advisor obtains it via cache simulator technology, which is also independent from timing.

One (indirect analogy) - that's actually how SDE and some simulators work- they slow down things enormously, but the resultant data might be very precise.

If you are still not sure - you can do one-time run with option 2 and compare it against your existing data with option 3. Of course you should compare FLOP OP count, not FLOP/S count.

Lin_L_Intel · ‎11-25-2020

That's clear, thanks a lot!

Ruslan_M_Intel · ‎11-23-2020

You could try to create a snapshot (Packed Snapshot is even better) of the result with binaries included. https://software.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top/results/creating-a-read-only-result-snapshot.html

Lin_L_Intel · ‎11-23-2020

I don't understand the seasonfor creating a snapshot. I mean I can't even open the binary in current workspace when I opened the result, why do I create a snapshot for it? If I create a snapshot, the binary could be recognized??

Lin_L_Intel · ‎11-23-2020

I don't understand the season for creating a snapshot. I mean I can't even open the binary in current workspace when I opened the result, why do I create a snapshot for it? If I create a snapshot, the binary could be recognized??

Ruslan_M_Intel · ‎11-23-2020

I'm sorry, I didn't say that I referred to original result (not downloaded copy). You download the result but instead you could copy a packed snapshot instead. I hope in this case you wouldn't get the problem

Lin_L_Intel · ‎11-23-2020

I still cannot open the binaries in advisor.

Ruslan_M_Intel · ‎11-23-2020

Are you able to use Advisor Beta10 instead? Possibly, there is a bug in Beta08.

AthiraM_Intel · ‎11-26-2020

Hi,

Could you please let us know whether all your queries clarified? Can we close this case?

Thanks.

AthiraM_Intel · ‎12-07-2020

Hi,

We would discontinue monitoring this issue , please raise a new thread if you have further issues.

Thanks.

Intel Advisor cannot display assembly.