Solved: Jim, thanks for your response

Yeager__David · ‎08-23-2018

Hi,

In the Intel Architecture Software Developer's manual, volume 3, section 8.2.2, it says:

"In a multiple-processor system, the following ordering principles apply:

-Writes by a single processor are observed in the same order by all processors.

-Writes from an individual processor are NOT ordered with respect to the writes from other processors."

On a multi-processor system with context switching, there is no guarantee that any given portion of a program will execute on any given processor.

Say x and y are variables in shared memory. If the program in a single thread does this:

x = 0; //happens on processor 0
y = 0; //happens on processor 0
x = 1; //happens on processor 0
//context switch 
y = 1; //happens on processor 1

Is it possible for other processors to see x = 0 and y = 1 at the same time at any point here? Or do context switches on Linux include memory barriers to prevent this from happening?

McCalpinJohn · ‎08-24-2018

I can't imagine that it is possible to migrate a process from one core to another without a full pipeline flush (which includes a memory barrier), but I have not been able to find a specific reference to this in either volume 3 of the SWDM or in the Linux kernel source.

View solution in original post

jimdempseyatthecove · ‎08-24-2018

Potentially, yes. The precise order of line 3 and line 5 is unknown from your sketch code. For example,

a) on processor 1, what was executed prior to y=0 on line 5?
b) are x and y in the same cache line?
c) what was the value of y prior to line 2?
d) had the code in the 3rd processor retrieved y prior to line 2 on processor 0 and/or line 5 on processor 1?
e) was the code written such that x and y on all processors are volatile or atomic?
f)...

BTW "context switch" is likely not what you intended to say as it states that the hardware thread switches from one thread/process context to another thread/process context.

Jim Dempsey

McCalpinJohn · ‎08-24-2018

I can't imagine that it is possible to migrate a process from one core to another without a full pipeline flush (which includes a memory barrier), but I have not been able to find a specific reference to this in either volume 3 of the SWDM or in the Linux kernel source.

Yeager__David · ‎08-24-2018

Thanks for your replies. Even if there is a pipeline flush and memory barrier when migrating a process to a different core, to what extent does that memory barrier affect all other cores? ie maybe it's just a pipeline flush that enforces memory ordering only on the first and second cores the thread ran on but not the other cores, which can view the writes as out of order?

Lets assume that x and y are both 8 byte aligned and can sit on different cache lines, are not declared as volatile, and exist in shared memory. So instead lets make them part of a struct in shared memory;

struct my_struct_type *a = shared_mem_ptr;

//Thread running on processor 0
a->x = 0; 
a->y = 0; 
a->x = 1; 
//Same thread migrates to processor 1
a->y = 1;

Now lets say a different process is watching this happen on processor 2. Can that process on processor 2 see a->y = 1 get written before a->x = 1? ie can the write on line 8 happen before the write on line 6?

I'm doing all sorts of things with shared memory and want to know if I'm breaking anything by assuming ordered writes by the same thread will be viewed in order by other processes on multi-processor x86-64 systems.

jimdempseyatthecove · ‎08-25-2018

In order for the thread to migrate from one logical processor to another (same core, same socket/different core, different socket), this requires an interrupt on the former logical processor, a context switch of the thread state by the O/S (kernel), and then a resumption of the state by the new logical processor. This is likely 100's of instructions if not 1000's. As long as the line 6 a->x=1 did not cause a page fault, the write of x will complete before/during the interrupt. Note, it is unclear as to if the write of x will occur before the write of the program counter on/in the interrupt stack. I assume it will (excepting for the page fault situation or other memory fault associated with writing to x). Due to line 4 not faulting (assumption on my part), and assuming there is not additional context switch before line 6 completes (execute phase, write to RAM/LLC may be pending), the write order will be preserved, and observable by all other threads... provided that the other threads are coded to properly observe the changes in x and y. IOW some additional thread is not also manipulating x and/or y, as well as if x and y share or do not share the same cache line.

For example, consider an observer threads issuing

if(a->x && a->y) ...

Or other similar evaluations of x and y by other software thread, might not observe the state of x and y as known by the first thread at the moment it executes line 8. i.e. there is a time interval between the fetch of x and fetch of y before the && operation. During this evaluation by the observer thread, the producer thread may be at some arbitrary point in the above code.

It might be better if you show the exact code you are having problems with.

Jim Dempsey

Yeager__David · ‎08-25-2018

Jim, thanks for your thoughtful response. In my case I'm referring to regular user mode programs where page faults can happen, and so from what I understand you're essentially saying that two random writes from one thread can be viewed by other threads out of order if other provisions aren't made.

It's scary because the memory consistency model is very important for concurrent programming, and the Intel manual clearly states that writes from the same core will be viewed in program order by other cores. That statement carries a lot of weight for developers and can be misunderstood. Since threads/processes can migrate cores, what I am concluding for now is that this guarantee is basically meaningless to most developers building regular user mode apps since there is no guarantee any two writes will happen on the same core, and there is no guarantee that page faults won't happen, unless other provisions are made.

Maybe operating systems effectively include memory barriers that enforce this write order globally across all cores when threads migrate cores, and we just aren't aware of it. I'll have to investigate this further. This sounds like a trivial detail but it's actually a very important one that affects concurrent programming on most servers today, and the community should be very clear as to what rules apply wrt threads and write order guarantees.

For now I think I'll have to be conservative and just use memory barriers.

jimdempseyatthecove · ‎08-26-2018

David,

I in no way implied that memory was written inconsistently from the perspective of the writing thread, or from the perspective of the reading thread, provided that the reading thread could perform an atomic read of both x and y of object a... at the appropriate time. Most programming errors relate to looking only at what a producer does (writer) an what the consumer does without respect to interrelations of when they perform their activity.

Take for example a simple case of a barrier for use in a multi-threaded program. Can you spot the error in the BadBarrier?

struct BadBarrier
{
 std::atomic<intptr_t> c1;
 BadBarrier() { c1 = 0; }
 void here(intptr_t iThread, intptr_t nThreads)
 {
  if(iThread)
  {
   // Not master thread
   // indicate this thread reached barrier
   c1.fetch_add(1);
   while (c1.load())
   {
    PAUSE();
   }
   // here after master detected all other threads at barrier
  }
  else
  {
   // (iThread==0)
   // wait for all other threads to reached barrier
   while((c1.load()+1)!=nThreads)
   {
    PAUSE();
   }
   // here when all other threads of team at barrier
   // release all other threads
   c1.store(0);
  }
 }
};

struct GoodBarrier
{
 std::atomic<intptr_t> c1;
 std::atomic<intptr_t> c2;
 GoodBarrier() { c1 = 0; c2 = 0; }
 void here(intptr_t iThread, intptr_t nThreads)
 {
  if(iThread)
  {
   // Not master thread
   // indicate this thread reached barrier
   c1.fetch_add(1);
   while (c1.load())
   {
    PAUSE();
   }
   // here after master detected all other threads at barrier
   // indicate this thread no longer observing c1
   c2.fetch_add(1);
   // now wait for team master thread
   while(c2.load())
   {
    PAUSE();
   }
  }
  else
  {
   // (iThread==0)
   // wait for all other threads to reached barrier
   while((c1.load()+1)!=nThreads)
   {
    PAUSE();
   }
   // here when all other threads of team at barrier
   // release all other threads
   c1.store(0);
   // wait for all other threads to acknowledge release
   // and subsequently, no longer using GoodBarrier object
   while((c2.load()+1)!=nThreads)
   {
    PAUSE();
   }
   // all threads no longer using GoodBarrier object
   c2.store(0);
  }
 }
};

jimdempseyatthecove · ‎08-26-2018

Example using barrier (bad or good):

BadBarrier barrier;
#pragma omp parallel
{
  intptr_t iThread = omp_get_thread_num();
  intptr_t nThreads = get_num_threads();
  while(true)
  {
    // code here
    barrier.here(iThread, nThreads);
  } // while(true)
} // end omp parallel

The point of this discussion is not to argue that one should use their own barrier over the supplied OpenMP barrier, put rather to illustrate a common pitfall of focusing only on the "producer" (iThread = 0) and the observer (iThread != 0).

Have you determined the problem with the code in BadBarrier?

Jim Dempsey

Yeager__David · ‎08-26-2018

Jim, I'm really sorry but I'm going to have to ignore your example because it distracts us from my very basic question which we should not be over thinking. I really should have rephrased my initial question to the following:

Intel's memory consistency model guarantees writes by the same processor are always viewed in original program order by other processors. This is true in all scenarios, under all conditions. It also states that the original order is not guaranteed for writes performed by different processors on the same system, when observed by other processors on that system.

My question is a very simple one: Can the same guarantee be made for threads or processes? Are writes in one thread always viewed in original program order by other threads in all scenarios? Are writes in one process always viewed in original program order by other processes in all scenarios? Lets assume we're talking Linux.

This is a first principles, foundational question for concurrent programming, and there should be a clear cut answer here. It either is always guaranteed or it is not always guaranteed.

We know that threads and processes can migrate cores, and so the answer isn't obvious based on the Intel manual since two stores from the same thread can happen on different cores. Maybe the answer is more OS dependant and I should pose this question in a Linux forum, even though the answer is extremely relevant to people in this forum.

I don't want to focus on any one example since we should all know the answer to this in a general sense. But as a practical example, think of lockless data structures where you fill out the data structure first and then make the pointer to it visible to other threads at the end. If it's correct to have the new object visible or invisible to other threads, but incorrect to have it visible but only partly filled out, and you are guaranteed write order is preserved, then you don't need to add memory barriers or locks to this operation since updating the 8-byte aligned pointer to it at the end will always be an atomic operation.

Answering yes, write order can be guaranteed under certain conditions (Jim) implies to me that NO, it is not guaranteed in all scenarios.

So far here is my interpretation of your answers based on my initial question, and please correct me if I got this wrong:

Jim: No, it is not guaranteed in all scenarios.

John: Yes, I think it is guaranteed in all scenarios.

Since it seems that both of you are extremely knowledgeable in this area, I am unable to come to a conclusion here and will have to simply be conservative and assume the answer is that there is no guarantee, while at the same time doing more investigation to determine a more definitive answer.

jimdempseyatthecove · ‎08-27-2018

>>My question is a very simple one: Can the same guarantee be made for threads or processes? Are writes in one thread always viewed in original program order by other threads in all scenarios?

Multiple writes from one thread to memory will be observable either a) in same order, or b) being performed at the same time (typically when the two writes are within the same cache line). The problem is a minor distinction (grammatically) between observable and observed. The BadBarrier example illustrates this quite nicely (once you know the problem). My answer is: the write order (or combindedness) is correct. It is the programmers responsibility to code in a manner such that the observance is made correctly.

Since it appears that you are likely not going to take the time to figure out the issue with BadBarrier, I will explain it.

When only 1 or 2 threads are in the team the barrier works as intended. Assume the case of 3 or more threads.

As threads enter the barrier the non-master threads are performing

t0: increment entry counter
loop:
t1: if counter==0 break
t2: waste non-zero small amount of time gently
goto loop

After all threads (including master) complete the increment, and master observes the increment to thread count, the master thread zeros the counter, thus intending to release the other threads. This works most of the time. Where it doesn't work is when a released thread exits the pseudo code loop above, and then completes the next iteration of the while(true) loop in the sample code in post #8, and then re-enters the same barrier prior to one of the team members observing the counter going to 0 from the prior entry to the barrier, then you have a deadlock at best, or threads executing out of phase at worst. Note, in the error condition, writes were written in proper order and observable, but were not observed as intended.

Jim Dempsey

jimdempseyatthecove · ‎08-27-2018

To answer your potential next question:

The "waste non-zero small amount of time gently" compute time is so much less than the time to execute the code in the while loop that this problem case will never occur. Eh?

This isn't the entire picture of what is happening with your code. Consider what happens if immediately after t1 in the above loop, that the hardware thread gets interrupted. When this occurs, the wait time is extended quite considerably for the duration of the interrupt and possibly for the preempting threads to complete their time quanta.

Jim Dempsey

Yeager__David · ‎08-27-2018

Jim, thanks for your answer:

"Multiple writes from one thread to memory will be observable either a) in same order, or b) being performed at the same time"

Now I believe we have a consensus with John's answer for my original question. It is not possible for other processors to see those writes out of order.

Many thanks to both of you for your answers.

McCalpinJohn · ‎08-27-2018

I finally dug through enough of the Linux kernel source to find at least one place where a memory barrier is happening in a context switch....

The function names and locations probably vary by kernel version, but for CentOS 7.4 (kernel 3.10.0-693) the "context_switch()" function is in "kernel/sched/core.c" in the kernel source tree. There are a couple of paths that the code might take, but the function "switch_mm()" looks important. This function is defined in arch/x86/include/asm/mmu_context.h, where the comments include a discussion of the required ordering functionality. The specific issue discussed here is related to page table ordering, which is not constrained by the ordinary LOCK or FENCE operations. The comments note that load_cr3() is serializing, which (as discussed in Section 8.3 of Volume 3 of the Intel Architectures SW Developers Manual, document 325384-067)

[...] force the processor to complete all modifications to flags, registers, and memory by previous instructions and to drain all buffered writes to memory before the next instruction is fetched and executed.

This explains why I was not able to find any FENCE instructions in the context switching code -- the load to CR3 is required to set up the page tables for the migrated process on its new core, and the serialization properties are a superset of the memory fence requirement discussed here.

I don't see the corresponding code that creates the memory fence on the core that the task was running on before the task switch, but I am confident that it is hiding in there somewhere... It is possible that simply entering the kernel (as part of the de-scheduling process) is enough to force a memory fence, but it is a lot of work to dig through all the relevant code.....

Yeager__David · ‎08-27-2018

John, thank you for looking into this. So from what you have explained here, a memory barrier must happen on both the core where the task originates, and then on the core where the task migrates to, so that a third observer core will see the writes in original program order. Your explanation of MOV CR3 being required on the final core where the task migrates to, which is a serializing operation and therefore a full memory barrier, makes sense to me.

I've also looked around and have not yet been able to determine that the required memory barrier or equivalent will always happen on the first core in Linux in this scenario, but will update this thread if I do find that proof.

McCalpinJohn · ‎08-28-2018

The memory barrier on the core where the process *was* running is actually the only one that matters. All stores made while running on the original core have to be made globally visible before starting the process on another core, or the process might not see its *own* stores in program order! Obviously, this can't be allowed to happen, so something must provide the memory fence functionality.

After a bit more looking, I found the relevant words in Section 11.10 of Volume 3 of the Intel Architectures SWDM:

The processor ensures that [...] the contents of the store buffer are always drained to memory in the following situations:

. When an exception or interrupt is generated.

So the store buffers are flushed and all prior stores are made globally visible when the OS interrupts the process. Since it must interrupt the process on its original logical processor before starting it on another processor, visibility is guaranteed, and so ordering is also guaranteed.

jimdempseyatthecove · ‎08-28-2018

John,

An observation here...

Assuming (hypothetically) that the store buffers are not flushed (or have not yet completed flushing), the cache coherency system is designed such that all other cores/HW threads will have the associated cache line(s) of that store invalidated, and (presumably) inhibited from loading until the associated store has completed. IOW an explicit (or implicit CR3) flush is superfluous. The only potential issue that I can think of is (in this hypothetical case) if at the start of the interrupt, a DMA transfer is initiated (or redirected) such that the DMA (bypassing the cache system) gets the memory contents of the location prior to the completion of the flush. All other cores, as well as same core, would observe the data in order (or concurrently).

Note, this discussion relates to Intel (and other CPUs supporting valid cache coherency systems). ARM, or other, multi-core systems may have different behavior.

Jim Dempsey

Yeager__David · ‎08-28-2018

Just a thought about John's comment: "All stores made while running on the original core have to be made globally visible before starting the process on another core, or the process might not see its *own* stores in program order!"

Remember that we're talking about another process on a third core that is observing the writes, and that third core can be on a different socket. For the process to see it's own stores it just needs to be visible by the new core the process migrated to.

But as John said, an interrupt must happen on the original core for the task to move to another core, and it sounds like an interrupt makes all previous stores on that core globally visible to all other SMP processors once it completes. I'll assume that interrupt even happens when the thread/process makes a syscall, and then the kernel migrates that task to another core before returning from that syscall. I guesses this is the proof! Thanks guys!

Ordering of writes to shared memory with multiple processors