I'm looking at some assembly language output by icc -S (v16) and see lines like this:
vmovups %zmm0, 960(%rsp,%rdx) #300.3 c7 stall 2
What is the "c7 stall 2" part telling me?
Is there a document somewhere that explains the more esoteric aspect of an assembly listing?
There's no documentation available for the asm listing per-se and I've passed your feedback to the team. Also, the "c" is the cycle number and the "stall N" indicates that there is an N-cycle stall prior to execution of the instruction as predicted by the compiler’s instruction scheduler. Hope that helps.
Just curious -- exactly what options did you use to get the compiler to generate these cycle and stall counts? I have never seen them before and I have not found any options that enable them....
In this case the compiler line was:
icc -std=c99 -O3 trans.c -xMIC-AVX512 -S
It would be nice if I could use this information to tweak performance but I'm not really sure what to make of it. The "cycle counts" are always odd (as in X mod 2 = 1 :^)); sometimes there are 4 or five instructions in a row with the same count; sometimes there are gaps of as much as 6 between lines/counts (i.e. c79 -> c85). Stalls seem to be mostly on "moves" (of one form or another) and an assortment of perm/shuf and gathers.
In any case, it's curious.
The cycle counts and stall information are emitted only for MIC. I'll touch base with the product team to gather more info on this info that's emitted for MIC and will update you accordingly, appreciate much.
The article at: https://software.intel.com/en-us/mic-developer/programming a good read for getting a clearer picture of the predicted cycle counts and stall information related to the basic MIC architecture per-se. A white paper that's also very relevant is https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-cop... which explains on what each core is capable of on the number of HW threads etc. The compiler assumes two actively running threads so each thread runs in every cycle. ANs do, the cycle counts are always odd and therefore the compiler assumes that the code will run in c1, c3, c5 etc., will be used by the thread.
To understand the details of the cycle counts/stalls and why there could be gaps in cycle counts and reason for stalls require a detailed understanding of the microarchitecture. I am awaiting from the group to see if there's any optimization guide available similar to the manual to http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimizat... and will let you know as soon as I find the same.
Please file any issues you find where the cycle count and stall information is not perfect so it can be fixed although there could be good explanation for that which may look suspicious per-se as the system is only capable of executing 2 instructions per cycle and could be caused by instruction sequences that aren't subjected to scheduling etc.
Hope the above helps.