Contention among atomic operation

ASing3 · ‎07-28-2018

Hi,

Atomic instruction are be used to modify shared memory variable.

1. What will happen if atomic instruction in two cores accessing same variable are triggered at the same time? Is there any reference that explain the how atomic instruction contention is handled in detail?

2. If more thread on different cores execute the atomic inst modifying same variable, How these request are ordered?

thanks

Ajit

McCalpinJohn · ‎07-30-2018

If two (or more) logical processors attempt to perform an atomic update to the same memory location "at the same time", the hardware will ensure that the operations take place sequentially. If there is no additional code enforcing an ordering on the operations, then they will occur in some order that the hardware decides. I

n most cases, attempting to apply the concept "at the same time" results in confusion, because that is not how the hardware works. For example, in processors with inclusive L3 caches, each L3 cache slice is limited to processing one command per cycle. Even if the cores could be synchronized to ensure that the requests got onto the ring in the same cycle, they would likely not arrive at the target L3 in the same cycle. Even if it were possible that the commands were received by the L3 in the same cycle (e.g., one arriving on the clockwise ring and one on the counterclockwise ring), the hardware would pick an order in which to process them. (I can't tell from the available documentation whether the "even" and "odd" properties of the rings prevent arrival of two commands at the same destination in the same cycle, but even if the commands can arrive in the same cycle, choosing one to go first is not rocket science.)

There are different types of "atomic" operations used by different architectures. In some architectures, operations that are not chosen to go first will be stalled (then retried by the hardware until they succeed), while in other architectures they will "fail" (for software-based retry). In an Intel processor, for example, a locked ADD instruction will be retried by the hardware if the target memory location is busy, while a locked "compare and exchange" operation must be checked to see if it succeeded (so the software must notice the failure and retry the operation).

ASing3 · ‎07-30-2018

Dr. McCalpin

Thanks for your reply. I need to model interaction of atomic operation for my work. What I have observed in my experimentation(on Xeon E7 8890 ) is that cores which are nearer to the core which is currently holding the lock are more likely to get the lock instead of core which are further apart. Given that cores are connected over a ring bus, I assume that after release on core holding the lock, control flows to cores in order in which they are connected to the ring and hence preference for cores nearer to core currently holding the lock. Which ring of the two would be chosen is probably decided at random.

Is my assumption correct?

thanks

Ajit

jimdempseyatthecove · ‎07-31-2018

Hardware thread sharing L1 (i.e. HyperThreads of same core), will tend to observe unlocked condition first, then threads sharing L2 (e.g. Core Duo where 2 cores share an L2) will tend to observe unlocked condition second, then threads sharing L3 will tend to observe unlocked condition next, then threads sharing LLC will tend to observe unlocked condition next, then threads on shorter QPI/UPI/NUMA Node paths will tend to observe unlocked condition next (in shorter to longer path order). Not all CPUs have all the prior listed features.

Note use of "tend".

If you require fairness, then you will have to finesse this with coding rather than using a simplified lock

Jim Dempsey

McCalpinJohn · ‎07-31-2018

The result probably depends on the relative location of the cores as well as the location of the L3 (or CHA) responsible for the physical address being accessed. You may be able to change the behavior by changing the memory address being used. Intel does not document the mapping of physical addresses to L3 or CHA slices, but there is good reason to believe that the 64 cache lines in every (aligned) 4KiB page have at least one cache line mapped to each L3 (or CHA).

If you need some kind of "fairness" that the hardware does not naturally provide, then you will almost certainly have to implement your own software infrastructure that monitors the "fairness" and adjusts the behavior in response. For example, if a thread is getting the lock "too often", the software might need to add a delay between that threads attempts to do atomic operations on a contended location. If this is managed at a relatively coarse granularity, then the overhead should be small -- but the logic will probably be ugly....