Memory store retirenment

Wang_Jaff · ‎07-15-2012

In the context of Nehalem/Sandy Bridge CPU architecture.

One thread (bounded to core 1) writes data to the same cachline in sequence:

while(1)
{
..........
Write A
Write B
Write C
}

Another thread (bounded to core 2) read the same data in the same sequence:

while(1)
{
...........
Read A
Read B
Read C
...........
}

It is assumed that the cacheline in which A,B,C reside marked as Shared(S) before the first core
starts writing A,B,C.

Very intresting question is when C is going to be retired(written to L1 cache) so it becomes visible by the second core?
Logicall it should be retired right after a write to C appeared in the store buffer, but I could imagine CPU might not retire data from the store buffer right after they appeared in it, but after say the buffer has at least 2 elements to retire so it could combine B and C in one shot.

In my dev environment I have an issue with update to C being read with a 80 ns delay so knowing how store retirenment works might help to find a way to improve the latency.

Hussam_Mousa__Intel_ · ‎07-15-2012

This code is not multithread safe and no guarantees or predictions can be made on the order of retirement without locks or sync points in place.

Store buffers are per hardware thread structures, and provide ordering only at that level. In the case of your writer thread code, there are no requirements for when A,B, and C need to be retiredsince there is no read instructions in that same thread. In fact, it is equally possible that the loop will continue entirely without writing back to cache at all since HW will notice that the writes to A,B, and C are never read before being re-written. It is likely thatthe cache write back (which will become visible to the read thread) only happens due to a completely unrelated event, like a thread pre-emption for the first thread.

SergeyKostrov · ‎07-17-2012

Quoting garkus

In the context of Nehalem/Sandy Bridge CPU architecture.

One thread (bounded to core 1) writes data to the same cachline in sequence:

while(1)
{
..........
Write A
Write B
Write C
}

Another thread (bounded to core 2) read the same data in the same sequence:

while(1)
{
...........
Read A
Read B
Read C
...........
}

It is assumed that the cacheline in which A,B,C reside marked as Shared(S) before the first core
starts writing A,B,C.

Very intresting question is when C is going to be retired(written to L1 cache) so it becomes visible by the second core?..

I'd like to understand why you don't use any synchronization objects? What about integrity of datafor eachread operation?

Wang_Jaff · ‎07-20-2012

Apologies, provided sudo-code was not exactly correct and A,B,C outght to be integer/double/char. Basically it is like this:

One thread (bounded to core 1) writes data to the same cachline in sequence:

Initially C=0
............

while(1)
{
Read C
full_compiler_fence
if (C == 0)
{
Write A
Write B
full_compiler_fence
Write C = 1
(store fence here has no impact on latency at all)
}
}

Another thread (bounded to core 2) read the same data in the same sequence:

while(1)
{
Read C
full_compiler_fence
if (C == 1)
{
Read A
Read B
full_compiler_fence
Write C = 0
}
}

My understanding is writes/reads of properly aligned integer/double/charare are atomic, guaranteed to be retired in a best possible manner(I want to speed it up as much as possible though) and they are not reorded.

Though Hussam noted that write might not be retired ASAP which does strike me a bit. In practice, on iCore 5 Nehalem I achieved 350 CPU ticks latency from write to C in thread 1 and read of C in thread 2. Any sort of synchronization only slows it down. Should the code be still synchronized to make it rock-solid?

Also Is it still somehow possible to speed up write retirenment(bar store memory fence after a write to C in thread 1, which play no diffrence at all) in my example?

Regards,
Nikolay

SergeyKostrov · ‎07-20-2012

Quoting garkus

...Also Is it still somehow possible to speed up write retirenment(bar store memory fence after a write to C in thread 1, which play no diffrence at all) in my example?

I saw performance improvements ( a couple of percents )when a priority of a process/thread was changed from Normal toReal-Time.
However, I'm not sure that it could work in your case because you've already "pushed everything hard". I would simply try it.

Best regards,
Sergey

SergeyKostrov · ‎08-04-2012

Quoting garkus

Also Is it still somehow possible to speed up ... my example?

Hi Nikolay,

Since you have just two threads I think it would be very interesting tocompareyour test-case,that uses a'full_compiler_fence' approach,
with a test-case that usesDekker's algorithm for syncronization of two threads (which doesn't useany external synchronization objects ):

http://en.wikipedia.org/wiki/Dekker's_algorithm

Best regards,
Sergey

Hussam_Mousa__Intel_ · ‎08-13-2012

Hi Nikolay,

Synchronization strategies aside (even though in your case, these will acount for the majority of the performance and correctness characteristics), it is important to understand at what architectural level are these memory objects shared and how the hardware ensures memory coherence.

When you say core 0 and core 1, there are 3 different situations that this can still apply to:
1- Cores 0 and 1 are siblings on the same physical core each running on a hardware SMT thread.
2- Cores 0 and 1 are separate physical cores that are on the same physical processors (package)
3- Cores 0 and 1 are cores on different physical processors (i.e. in a 2 or moreprocessor system)

In the case of #1, the shared objects can reside either in the L1 cache or the L2 cache. They can also be forwarded directly within the backend pipeline from the store to the load registers (this is likely for reads/writes on the same thread, I am not 100% sure about whether they happen for SMT thread siblings).

For #2 they can be shared at the LLC level, and SNOOP mechanisms can also forward them from the L2 caches

For #3 they aresynchronized at the LLC leveland it uses a remote SNOOPing for direct LLC to LLC forwarding (butthat will add latency vs sharing at the same local LLC level).

Regardless of which thread layout is the case, the writer thread will constantly be marking the cache lines as
'Exclusive' (locked for write) or 'Modified' (just written to), and so the reader thread will almost always be missing it's L1 and possibly L2 caches as it waits to have the most recent value forwarded from the writer threads cached copy.

An interesting experiment is when you enable SMT, and then pin your cores to the threads on the same core. I suspect they will share at the L1 level and you should be able to get much lower latency.

-Hussam