Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

IPIs and weak memory ordering

Michael4
Beginner
1,602 Views
Hi all!

I was wondering if it was possible for an IPI to "overtake" a memory write.

For example:
1. CPU A writes some global variable (and the write happens to stay in the store buffer for a long time)
2. CPU A sends an IPI to CPU B
3. CPU B's IPI ISR reads the global variable

Is it theoretically possible in this scenario that the store buffer of CPU A has not been drained to the cache/memory when CPU B takes the interrupt and thus reads an old value of the variable?
I.e. is an explicit synchronisation instruction needed?

I couldn't find any information on that in chapter 8.2 (Memory Ordering) of the Software Developer's Manual Vol. 3. And while chapter 11.10 (Store Buffer) says that the store buffer is drained whenever an "exception or interrupt is generated", I suspect this only refers to the CPU receiving the interrupt, not the one sending it.

Cheers
Michael
0 Kudos
6 Replies
Dmitry_Vyukov
Valued Contributor I
1,602 Views
I think you may consult Linux kernel sources. As far as I remember, there are no special instructions to ensure memory visibility before sending an IPI for arch/x86. In either case, the instruction that waits for a store buffer to drain is MFENCE.


0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,602 Views
However, that was indeed possible for some architectures in the past, so your concern in not unfounded. Here is an excerpt from the "Is Parallel Programming Hard, And, If So, What Can You Do About It?" book:

C.9 Advice to Hardware Designers
There are any number of things that hardware designers
can do to make the lives of software people
difficult. Here is a list of a few such things that we
have encountered in the past, presented here in the
hope that it might help prevent future such problems:
...
3. Inter-processor interrupts (IPIs) that ignore
cache coherence.
This can be problematic if the IPI reaches its
destination before all of the cache lines in the
corresponding message buffer have been committed
to memory.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,602 Views
You could use the MFENCE as Dmitriy suggest or if you setup for single producer single consumer messaging you can use a present/taken structure. Sketch follows

message_t* messageAtoB = NULL;

// code on A
void SendMessageToB(message_t* message)
{
// check for prior message not taken
// should seldom occure
while(messageAtoB)
_mm_pause(); // not taken (rework this code for failures)
messageAtoB = message;
IPI(signalB);
}

...

// code on B
message_t* ReadMessageFromA()
{
while(!messageAtoB)
_mm_pause(); // not present(rework this code for failures)
message_t* p =messageAtoB;
messageAtoB = NULL; //A will eventually observe we took the message
return p;
}

Expand the sketch to use a ring buffer and to issue the IPI on first fill.
Also flesh out the error detection for interrupt lost and/or spurrious interrupt assumed.

Note, the above is a sketch and not necessarily the code you would implement.

message_t* messageAtoB = NULL;
message_t* newMessageForB = NULL;
// code on A
void SendMessageToB(message_t* message)
{
// check for prior message not taken
// should seldom occure
while(messageAtoB)
{
if(newMessageForB == NULL)
IPI(signalB);
_mm_pause(); // not taken (rework this code for failures)
}
messageAtoB = message;
if(newMessageForB == NULL)
IPI(signalB);
}

...

// code on B
message_t* newMessageForB = NULL;
IPIscan:
push rax;
...
if(messageAtoB)
{
newMessageForB=messageAtoB;
messageAtoB = NULL; //A will eventually observe we took the message
}
...
pop rax
iret

Something along the above ought to work.

Jim Dempsey
www.quickthreadprogramming.com
0 Kudos
Michael4
Beginner
1,602 Views
Thanks for your replies.
Yes, I have also seen that Linux assumes that such a behaviour is not possible.
Nevertheless, I was wondering if this assumption is justified.
Means: Which part of the Software Developer's Manual guarantees that I'm allowed to assume that?
I suspect this information is missing in the manual and therefore suggest it should be updated.

Just to clarify my interest in this topic:
I'm not just writing some code which I want to work correctly.
I'm developing a formal multiprocessor execution model for x86 CPUs in which I have to formally state whether such a behaviour is possible or not. And I have to justify such a formalisation with a reference to the Software Developer's Manual.
0 Kudos
Changbin
Beginner
749 Views
It's possible, and a barrier is required for such architectures. See the smp_call implementation of the Linux kernel.

 

void __smp_call_single_queue(int cpu, struct llist_node *node)
{
        ...
        /*
         * The list addition should be visible to the target CPU when it pops
         * the head of the list to pull the entry off it in the IPI handler
         * because of normal cache coherency rules implied by the underlying
         * llist ops.
         *
         * If IPIs can go out of order to the cache coherency protocol
         * in an architecture, sufficient synchronisation should be added
         * to arch code to make it appear to obey cache coherency WRT
         * locking and barrier primitives. Generic code isn't really
         * equipped to do the right thing...
         */
        if (llist_add(node, &per_cpu(call_single_queue, cpu)))
                send_call_function_single_ipi(cpu);
}

 

And, there's a fence (mfence+lfence) before sending IPI over X2APIC.

static void x2apic_send_IPI(int cpu, int vector)
{
	u32 dest = per_cpu(x86_cpu_to_apicid, cpu);

	/* x2apic MSRs are special and need a special fence: */
	weak_wrmsr_fence();
	__x2apic_send_IPI_dest(dest, vector, APIC_DEST_PHYSICAL);
}

 

0 Kudos
Reply