During analysis of one of my test code using VTune.
I'm using icc 11.0 without with disabling optimizatins using O0 flag.
I found that VTune is showing me 45% cache miss are caused by incl instruction.
I'm not able to find incl instruction in Intel instruction manual.
Does anybody know what does incl instruction does , and why VTune is showing so much of cache misses for tihs instruction?
I'm having difficulty reading your .png; it seems that your incl instruction (which simply increments a counter) is a branch target, so the events reported there would have been initiated prior to branching to that instruction.
Actually, it happens with VTune that takes "MEM_LOAD_RETIRED.L1D_LINE_MISS.events" samples at the instruction _next_ to the one actually taking longer to execute, it is the way sampling works - it captures CS:EIP from the interrupt stack at the service routine and captured instruction pointer (EIP)points at that time to the next instruction. So it is not increment (incl) but indirect addressing reference of mov [movl -20(%rbp), %eax] is theissue.
This "incl", here l signifies of long type, as the basic instruction is "inc" suffixed by data type either "long (l), word(w), quad(q), etc." or simply "inc" which means single byte. The primary use of this "inc" is to implement the counter (s), by adding 1 to the destination operand (here its base pointer %rbp register).
In "Intel-64 and IA-32 Arch. Software Developer's Manual", you will only find information about the basic instructions, which means "inc" but not its type, incl.
Could you quote the SAV chosen for MEM_LOAD_RETIRED.L1D_LINE_MISS.events?
Use the Precise Events to focus on instructions which makes high LI & L2 misses, also check which instructions is causing Branch mis-predictions.
If I happen to see your asm code, it seems you have compiled the applications without any optimization flags(-On), any reasons for doing so?
Could you try compiling your application with O3 or O2 and let the code use SSE stack rather x87 stack.