IPIs and weak memory ordering

Michael4 · ‎05-26-2010

Hi all!

I was wondering if it was possible for an IPI to "overtake" a memory write.

For example:
1. CPU A writes some global variable (and the write happens to stay in the store buffer for a long time)
2. CPU A sends an IPI to CPU B
3. CPU B's IPI ISR reads the global variable

Is it theoretically possible in this scenario that the store buffer of CPU A has not been drained to the cache/memory when CPU B takes the interrupt and thus reads an old value of the variable?
I.e. is an explicit synchronisation instruction needed?

I couldn't find any information on that in chapter 8.2 (Memory Ordering) of the Software Developer's Manual Vol. 3. And while chapter 11.10 (Store Buffer) says that the store buffer is drained whenever an "exception or interrupt is generated", I suspect this only refers to the CPU receiving the interrupt, not the one sending it.

Cheers
Michael

Dmitry_Vyukov · ‎05-27-2010

I think you may consult Linux kernel sources. As far as I remember, there are no special instructions to ensure memory visibility before sending an IPI for arch/x86. In either case, the instruction that waits for a store buffer to drain is MFENCE.

Dmitry_Vyukov · ‎05-27-2010

However, that was indeed possible for some architectures in the past, so your concern in not unfounded. Here is an excerpt from the "Is Parallel Programming Hard, And, If So, What Can You Do About It?" book:

C.9 Advice to Hardware Designers
There are any number of things that hardware designers
can do to make the lives of software people
difficult. Here is a list of a few such things that we
have encountered in the past, presented here in the
hope that it might help prevent future such problems:
...
3. Inter-processor interrupts (IPIs) that ignore
cache coherence.
This can be problematic if the IPI reaches its
destination before all of the cache lines in the
corresponding message buffer have been committed
to memory.

jimdempseyatthecove · ‎05-27-2010

You could use the MFENCE as Dmitriy suggest or if you setup for single producer single consumer messaging you can use a present/taken structure. Sketch follows

message_t* messageAtoB = NULL;

// code on A
void SendMessageToB(message_t* message)
{
// check for prior message not taken
// should seldom occure
while(messageAtoB)
_mm_pause(); // not taken (rework this code for failures)
messageAtoB = message;
IPI(signalB);
}

...

// code on B
message_t* ReadMessageFromA()
{
while(!messageAtoB)
_mm_pause(); // not present(rework this code for failures)
message_t* p =messageAtoB;
messageAtoB = NULL; //A will eventually observe we took the message
return p;
}

Expand the sketch to use a ring buffer and to issue the IPI on first fill.
Also flesh out the error detection for interrupt lost and/or spurrious interrupt assumed.

Note, the above is a sketch and not necessarily the code you would implement.

message_t* messageAtoB = NULL;
message_t* newMessageForB = NULL;
// code on A
void SendMessageToB(message_t* message)
{
// check for prior message not taken
// should seldom occure
while(messageAtoB)
{
if(newMessageForB == NULL)
IPI(signalB);
_mm_pause(); // not taken (rework this code for failures)
}
messageAtoB = message;
if(newMessageForB == NULL)
IPI(signalB);
}

...

// code on B
message_t* newMessageForB = NULL;
IPIscan:
push rax;
...
if(messageAtoB)
{
newMessageForB=messageAtoB;
messageAtoB = NULL; //A will eventually observe we took the message
}
...
pop rax
iret

Something along the above ought to work.

Jim Dempsey
www.quickthreadprogramming.com

Michael4 · ‎06-07-2010

Thanks for your replies.
Yes, I have also seen that Linux assumes that such a behaviour is not possible.
Nevertheless, I was wondering if this assumption is justified.
Means: Which part of the Software Developer's Manual guarantees that I'm allowed to assume that?
I suspect this information is missing in the manual and therefore suggest it should be updated.

Just to clarify my interest in this topic:
I'm not just writing some code which I want to work correctly.
I'm developing a formal multiprocessor execution model for x86 CPUs in which I have to formally state whether such a behaviour is possible or not. And I have to justify such a formalisation with a reference to the Software Developer's Manual.

Changbin · ‎04-02-2024

It's possible, and a barrier is required for such architectures. See the smp_call implementation of the Linux kernel.

void __smp_call_single_queue(int cpu, struct llist_node *node)
{
        ...
        /*
         * The list addition should be visible to the target CPU when it pops
         * the head of the list to pull the entry off it in the IPI handler
         * because of normal cache coherency rules implied by the underlying
         * llist ops.
         *
         * If IPIs can go out of order to the cache coherency protocol
         * in an architecture, sufficient synchronisation should be added
         * to arch code to make it appear to obey cache coherency WRT
         * locking and barrier primitives. Generic code isn't really
         * equipped to do the right thing...
         */
        if (llist_add(node, &per_cpu(call_single_queue, cpu)))
                send_call_function_single_ipi(cpu);
}

And, there's a fence (mfence+lfence) before sending IPI over X2APIC.

static void x2apic_send_IPI(int cpu, int vector)
{
	u32 dest = per_cpu(x86_cpu_to_apicid, cpu);

	/* x2apic MSRs are special and need a special fence: */
	weak_wrmsr_fence();
	__x2apic_send_IPI_dest(dest, vector, APIC_DEST_PHYSICAL);
}

Changbin · ‎04-08-2024

See my full answer here: https://stackoverflow.com/questions/76352933/will-memory-write-be-visible-after-sending-an-ipi-on-x86/78264953#78264953

BSD4dot2 · ‎08-13-2025

Are barriers required in the case of a legacy xAPIC?

As the SDM points out "(Note: The MMIO-based xAPIC interface is mapped by system software as an un-cached region. Consequently, read/writes to the xAPIC-MMIO interface have serializing semantics in the xAPIC mode.)"

This seems to imply that all stores issued by the IPI-sender should be visible in the IPI-receiver even without using any explicit atomics/barriers. Is that correct?