Actually, it happens with VTune that takes "MEM_LOAD_RETIRED.L1D_LINE_MISS.events" samples at the instruction _next_ to the one actually taking longer to execute, it is the way sampling works - it captures CS:EIP from the interrupt stack at the service routine and captured instruction pointer (EIP)points at that time to the next instruction. So it is not increment (incl) but indirect addressing reference of mov [movl -20(%rbp), %eax] is theissue.
This "incl", here l signifies of long type, as the basic instruction is "inc" suffixed by data type either "long (l), word(w), quad(q), etc." or simply "inc" which means single byte. The primary use of this "inc" is to implement the counter (s), by adding 1 to the destination operand (here its base pointer %rbp register).
In "Intel-64 and IA-32 Arch. Software Developer's Manual", you will only find information about the basic instructions, which means "inc" but not its type, incl.
Could you quote the SAV chosen for MEM_LOAD_RETIRED.L1D_LINE_MISS.events?
Use the Precise Events to focus on instructions which makes high LI & L2 misses, also check which instructions is causing Branch mis-predictions.
If I happen to see your asm code, it seems you have compiled the applications without any optimization flags(-On), any reasons for doing so?
Could you try compiling your application with O3 or O2 and let the code use SSE stack rather x87 stack.