I have a few questions for Intel Advisor. I am using 2021.1.Beta08.
1. I am testing a simple c++ application on linux server machine, build it with Intel Xmain compiler with -g option enabled. Here I have my application test.exe.
I collect advisor roofline data with command line:
advixe-cl --collect="roofline" --stacks --project-dir="advisor-o2" -- ./test.exe
Then I download the result directory and open it with advisor GUI on my local machine. I set the project properties to specify <Binary/Symbol Search directory> and <Source Search directory> to where I put the test.exe and the source file test.cpp. However advisor GUI can only display the source file when I select some loop/function, it cannot display relative assembly and shows "Intel Advisor cannot show assembly code for the selected function/loop". Why?
2. Besides, I have another question for how advisor collect data, seems not the same way as VTune. It takes much longer time than VTune for application over 30 minutes. When we are collecting data for SPEC benchmarks which would create112 processes on a 56 core machine with HT enabled and those 112 processes run the same exe file at the same time. Since the collecting overhead cannot be ignored, we must collect all 112 processes' data at the same time. It makes the collecting too long to finish. I just wonder what does advisor do to collect information, why it takes such long time.
For your question #2:
Advisor CPU roofline analysis consists of 2 passes:
This 2-pass schema makes it possible to capture time so that it is NOT affected by slower 2nd step. So possibly there is no worries wrt "non representative behavior" effects caused by slower tripcounts analysis, because what matters is representative first pass only.
Here is a list of recommendations to make your analysis faster:
You could try to create a snapshot (Packed Snapshot is even better) of the result with binaries included. https://software.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top/results/cr...
I don't understand the seasonfor creating a snapshot. I mean I can't even open the binary in current workspace when I opened the result, why do I create a snapshot for it? If I create a snapshot, the binary could be recognized??
I'm sorry, I didn't say that I referred to original result (not downloaded copy). You download the result but instead you could copy a packed snapshot instead. I hope in this case you wouldn't get the problem
My consideration for the intrusive 2nd pass is, is the result of "running the application only on one core(one process) and collecting advisor data for it" going to be different from "running the same application on 112 cores(full cores used, 112 processes) but only collect advisor on one core(one process) "? Is the 2nd pass result(shown in roofline figure) going to be affected?
Yes, there might be the difference. Let me try to summarize:
First of all, Roofline depends on both 1st and 2nd pass. 1st pass provides "seconds", 2nd pass provides operation counts. So in theory you can configure 1st and 2nd pass differently with respect to # processes profiled, but it should be done very carefully, so I'd not go this way for now.
Secondly, considering 3 general cases:
1. Run user app on 1 core and advisor (as a result) also on 1 core. This is not recommended if you want characterize real multi-threaded application behavior, because with 111 logical cores staying idle you'll get different, non representative, benchmarks ("roofs") and overall application behavior.
2. Run user app on ALL cores and advisor just on 1 core. This should be the BEST method, because your application behavior (and advisor benchmarks ) will be representative, but extra advisor-driven pressure on system will be minimal.
3. Run user app on ALL 112 "cores" and advisor on ALL cores too. This will normally keep results somewhat representative, but extra advisor-driven pressure on system will be huge, especially because of Hyper-threading involved. This advisor's resource utilization pressure will certainly slow down 2nd pass a lot and may even make timing from the 1st pass less representative compared to the option 2.
For option 2 I think your additional concern is that the processes (1 with advisor and other 111 without it) will not behave uniformly. In most cases it is not an issue, because for the first pass (where uniformity matters most of all) - the overhead is small and thus all 112 processes should behave identically as without advisor. It is only 2nd pass where process #1 will behave very differently (slower), but you likely do not care a lot because Roofline's timing data (from step 1!) will not be affected and time is where representativeness and uniformity matter most of all. Note also the 2nd pass' job is to count #operations and bytes and they usually remain the same even if processes interact somewhat differently because of different timing.
In either case you can double-check it by comparing OP and Bytes count between option 2 and option 3 from advisor.
Also, just in case if HT is not critical for you, i'd at least use #processes = #physical cores. But maybe your use case requires HT, and that's another reason to consider option 2.
Hi Zakhar, that's very detailed explanation. For the concern of option 2, I understand your point is the uniformity will not affect the 2nd pass a lot because timing is in the 1st pass. But I wonder if the timing is different in 2nd pass, won't it affect the memory pressure or calculation pressure which would lead to the difference of the roofline reports(performance of each loop/function)? I mean the position of each dot in roofline figure.