Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here. Store-forwarding Restriction on Data Availability


In the Intel Architecture Optimization Manual ( Section says: 

The value to be stored must be available before the load operation can be completed. If this restriction is violated, the execution of the load will be delayed until the data is available. This delay causes some execution resources to be used unnecessarily, and that can lead to sizable but non-deterministic delays. However, the overall impact of this problem is much smaller than that from violating size and alignment requirements.      In modern microarchitectures, hardware predicts when loads are dependent on and get their data forwarded from preceding stores. These predictions can significantly improve performance. However, if a load is scheduled too soon after the store it depends on or if the generation of the data to be stored is delayed, there can be a significant penalty. 

I think I might be running up against this in a couple cases, and would like to better understand the details of what's happening.   What pattern would the performance counters show if this is happening? 

There are two similar test cases online that I haven't seen well explained:

In this case, a volatile variable is being updated in a tight loop:  mov [mem] -> reg; inc reg; mov reg -> [mem].   There are all sorts of performance oddities that happen.  Eli discovered that adding an extraneous call into the loop made it faster.  Others have found that adding nops has a similar effect.  I've even found that two functions with identical assembly instructions can have significantly different execution speed depending on...  Well, I have no idea:  it might be register asymmetries, loop alignment, or phases of the moon.  I've discovered that one version has10x more IDQ_UOPS_NOT_DELIVERED.CORE events than the other.   Incredibly, this one with more such more such events is consistently _faster_ than the one without.

While computing a histogram of bytes, Yann discovered that performance improved greatly if he wrote to multiple tables and summed them at the end.  He surmised that this was due to a write commit delay, but this didn't seem to be the case.  Others (including me) concluded it was probably related to store-to-load-forwarding, but the performance counter numbers didn't really bear this out.   Speculative loads were then explored, but don't seem to quite cover it either.  Symptomatically, it appears that stores are being issued many times before being retired, but that there is no increase in the number of machine clears.    Is there a mechanism that would notice mispredicted speculative loads, and re-execute specific affected instructions without causing a pipeline flush?  

Any and all insight would be appreciated.





0 Kudos
6 Replies
Black Belt

 >>>Eli discovered that adding an extraneous call into the loop made it faster.  Others have found that adding nops has a similar effec>>>

Adding more workload to the loop and achieving faster code execution seems counter intuitive, but the author of that blog named  Eli made interesting statement about possible reason for perceived code execution speed up. From his point of view call and ret instruction are executed when there is pipeline stall while CPU is waiting for the data. I suppose that at the hardware level call instruction can be translated into uops of the following machine code instructions in pseudocode: mov cs:eip [jump_target], push cs:esp, <ret_address> , jmp cs:eip . I think that jmp instruction can be executed probably by Port7 as an unconditional jump. Still I do not know which resources are utilized by uops which are loading internally eip register and pushing return address.


I posted a simplified version of my question here:

And put up sample code here:

I'm fairly sure that the timing difference has something do with the internal details of the Loop Stream Detector or Decoded Instruction Cache combined with the exact timings of when the store and load are issued.   Alignment of the loop plays a definite role, extending to granularities much greater than a cache line.  But I still haven't been able to put all the pieces together to figure out what's really happening, or which performance counter measurements can best distinguish between the fast and the slow cases.


Black Belt

>>>I'm fairly sure that the timing difference has something do with the internal details of the Loop Stream Detector or Decoded Instruction Cache combined with the exact timings of when the store and load are issued. >>>

I am not sure if really LSD is used because of call/ret instructions which change code control flow.On the other hand probably  call/ret are just like unconditional jumps to known beforehand  target.

Btw, if you have VTune installed you can profile two version of the loop one with call instruction and one without  and hope that profiler will sched some light on that mystery.

Black Belt
   400880:       48 8b 15 c1 07 20 00    mov    0x2007c1(%rip),%rdx  # 601048 <counter>
   400887:       48 83 c2 01             add    $0x1,%rdx
   40088b:       83 e8 01                sub    $0x1,%eax
   40088e:       48 89 15 b3 07 20 00    mov    %rdx,0x2007b3(%rip)  # 601048 <counter>
   400895:       75 e9                   jne    400880 <loop8+0x20>

>>>And put up sample code here:>>>

Please perform if possible VTune analysis of every function. Start with Basic Hotspot Analysis. It is very hard to precisely explain what really is happening during various functions execution.

I suppose that posted above loop was processed by LSD  beside speeding up execution by reducing macro-instructions fetch/decoding cycles I cannot explain such a big variation during the execution of various functions.


Hi Ilya --

Thanks for your continued attention.  I've managed to come up with an even more succinct example of what I think is the same issue:

; Minimal example, see also
; To build (Linux):
;   nasm -felf64 func.asm
;   ld func.o
; Then run:
;   perf stat -r10 ./a.out
; On Haswell and Sandy Bridge, observed runtime varies 
; ~15% depending on whether sub or sbb is used in the loop
section .text
global _start
  push qword 0h       ; put counter variable on stack
  jmp loop            ; jump to function
align 64              ; function alignment.
  mov rcx, 1000000000
align 64              ; loop alignment.
  mov rax, [rsp]
  add rax, 1h
  mov [rsp], rax
; sbb rcx, 1h         ; which is faster: sbb or sub?
  sub rcx, 1h         ; switch, time it, and find out
  jne l               ; (rot13 spoiler: foo vf snfgre ol 15%)
fin:                  ; If that was too easy, explain why.
  mov eax, 60
  xor edi, edi        ; End of program. Exit with code 0

It's a mindbender, but I think I'm starting to understand it.  I've been looking at it with 'likwid' and 'perf'.  I'll try to explore it with VTune as well.  


Black Belt

>>>Thanks for your continued attention. >>>

You are welcome.

Regarding my answer #5 I think that counter variable has high temporal locality , but I do not know if that can be responsible for execution time variation. Try to check with VTune if that code has also Front-End stalls. On the other hand I am not sure if there are a significant Front-End stalls because of aferomentioned temporal locality of the counter variable. I suppose that bulk of the processing is done on cached value.Variation in time measurement can be due to code sharing execution units and cache in HyperThreading scenario. Moreover there are various more priviledged threads and ISRs which can preempt your code at various moments during the program execution.