Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

Are memory stores atomic on modern multicore systems?

azru0512
Beginner
1,833 Views

Consider the example code below,

int msg = 0;

P1 P2

msg = 1; r = msg;

Is it OK to write such code? Is there anypossibility that P2 read some "intermediate" value (i.e. neither 0 nor 1)?

I have researched on this topic for a while. However,I am still not sure if we are allowed to write the above code.

Some said we should use mutex.

Some saidthat a native-wordstore is atomic.

Some said a alignednative-wordstore is atomic.

Which one is right? Can programmers just write the above code, then assume that a native-word / aligned native-word store is atomic?

Any comment will be appreciated.

0 Kudos
27 Replies
jimdempseyatthecove
Honored Contributor III
1,636 Views
When the size of int, which may be 2, 4, 8bytes, resides within and alignment of 2,4,8 bytes respectively. Then the store is atomic (on currentIntel Archetecutre). On earlier Intel Archetecutre 80386, 80486, maybe first Pentium, this was restricted to 2, 4 bytes.

When alignment is not naturaly aligned (aligned to size), then when the entire int resides within one cache line, then for some processors, the store would be atomic. However do not rely on this as the guaranty of atomnicity is for naturaly aligned and within one cache line for 1, 2, 4, 8 bytes (and 16 bytesunder some circumstances).

Jim Dempsey
0 Kudos
azru0512
Beginner
1,636 Views

Thanks for your reply.

Let me ask in another way,

int msg = 0
P1P2
msg = 1; r = msg;

Can we write the above code and assume that "msg = 1" can be done atomically, no matter what compilers we use and no matterwhere platforms we run?

Or perhapswe shouldrely on the guaranty provided by programming langauages or special API?

0 Kudos
azru0512
Beginner
1,636 Views
Seems atomic read / write depends on a lot of things. See the link below.

http://www.lambdacs.com/cpt/FAQ.html#Q377
0 Kudos
RafSchietekat
Valued Contributor III
1,636 Views
"Is it OK to write such code?"
Short answer: no, use an atomic type instead, that's what they're there for.

Jim is right, of course (well, I assume), but that stuff is an atomic implementer's business, not an atomic user's.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,636 Views
I suggest you also use volatile

volatile int msg = 0;


The reason being, without volatile, the compiler is free to rearange or eliminate the store sequences by P1 as long as it does not affect P1. IOW without regard to the effects on P2. Likewise for the code execution in P2 without regard to P1.

Also, if msg can be updated simulteaneously (msg used for 2-way message passing) then use InterlockedExchange (or equivilent asm statement).

*** BAD

int msg = 0;
P1P2
msg = 1; r = msg;
while(msg)msg = 0;
_mm_pause(); ...

Without volatile, the P1 while loop will not see P2 returning msg back to 0

*** good

volatile int msg = 0;
P1P2
msg = 1; r = msg;
while(msg)msg = 0;
_mm_pause(); ...

Now while(msg) gets fresh copy of msg on each iteration

Jim Dempsey
0 Kudos
ARCH_R_Intel
Employee
1,636 Views
Often atomicity is not enough. Often order of operations is critical too. E.g., if "msg==1" means that "there is other data that is ready to read", the write "msg=1" has to guarantee that earlier writes become visible before "msg==1" becomes visible. See http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/ for a discussion of why volatile is useless for such guarantees.
0 Kudos
azru0512
Beginner
1,636 Views
Yes, atomicity and ordering are both critical.

However, I found an interesting paper "FastForward for Concurrent Threaded Pipelines" in which the authors proposed a lock-free queue and claimed the correctness of the queue only depends on atomicity.

The authors proposed the enqueue and dequeue functions as follows,

// enqueue function
put_nonblock(...) {

if (NULL == queue[head]) {

queue[head] = ptr;

head = NEXT(head);

}

}

// dequeue function

get_nonblock(...) {

if (NULL != queue[tail]) {

ptr = queue[tail];

queue[tail] = NULL;

tail = NEXT(tail);

}

}


The basic idea of FastForward is using the queue entry to idicate the empty or full condition rather than using head andtail. Therefore, head and tail are not shared variables anymore.

I quoted here what the authors said about the correctness of FastForward.

...notice that this optimization also allows a CLF implementation that only depends on the coherence protocol (and not the consistency model) to ensure mutual exclusion.

To see this, recall that stores are atomic on all coherent memory systems, thus the getter and putter will never update the same node simultaneously since the getter waits for a non-NULL value and the setter a NULL value.

I am surprised the FastForward can only rely the atomicity as the authors claimed. But I don't know what the authors claimed is right or not. Any comment will be appreciated.
0 Kudos
RafSchietekat
Valued Contributor III
1,636 Views
They should still use atomic: if it's not needed, it probably also will not cost anything, and otherwise, well...

Meanwhile I hope nobody is thinking "volatile" anymore, because it's either a nonportable equivalent for atomic, or it isn't useful at all.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,636 Views
The reason being, without volatile, the compiler is free to rearange or eliminate the store sequences by P1 as long as it does not affect P1.

Where do you get the info?

According to ISO C/C++ volatile has nothing to do with ordering. Even if a var is volatile, compilers are free to rearrange stores around them. For sure.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,636 Views
> I am surprised the FastForward can only rely the atomicity as the authors claimed

And for queue construction they do not rely on both atomicity and ordering. This fact is also irrelevant here.
You should look at how they synchronize passing of user data rather than synchronization of queue cells.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,636 Views
The question is senseless w/o context. There are a way too many answers to it. What is the contest (language, compiler, hardware) you are interested in?


0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,636 Views
> I am surprised the FastForward can only rely the atomicity as the authors claimed

And for queue construction they do not rely on both atomicity and ordering. This fact is also irrelevant here.
You should look at how they synchronize passing of user data rather than synchronization of queue cells.

Hey, wait, the following code is just incorrect:

if (NULL != queue[tail]) {
ptr = queue[tail];
queue[tail] = NULL;
tail = NEXT(tail);
}

... until it's C++ where the code can have any meaning... or Microsoft C and vars are declared volatile... anyway it's incorrect w/o fences.

0 Kudos
azru0512
Beginner
1,636 Views
Assumewe implement FastForwardwith C.Since the authors of FastForward did not mention anything about "volatile", let us assume that all thevariables are simple data type (i.e., no volatile or other quantifier).

FastForward use the queue entry to idicate the empty or full condition. In our case, NULL meansthe queue entry (slot) is empty. I reposted the enqueue and dequeue function here.

// enqueue function

put_nonblock(...) {

if (NULL == queue[head]) {

queue[head] = ptr;

head = NEXT(head);

}

}

// dequeue function

get_nonblock(...) {

if (NULL != queue[tail]) {

ptr = queue[tail];

queue[tail] = NULL;

tail = NEXT(tail);

}

}


Note that head and tail are NOT shared variables anymore.

The enqueue function first see thequeue entry is empty or not by using NULL as an indicator.If the queue entry is empty, then it will insert data into the queue entry and increase head.

The dequeue functiondoes a similar thing. After retrieve data from thequeue entry, ituse NULL to mark the queue entryas a empty slot.

The correctness of FastForward rely on atomic store operations as claimed by the authors.Since yousaid it's incorrect w/o fences, could you show me how ording can break the code?

Thanks.
0 Kudos
robert-reed
Valued Contributor II
1,636 Views
Maybe I've been out of the loop too long, or missing some precondition here, but I don't see how this works using ordinary C/C++. Saying memory stores are atomic on modern multicore systems isn't saying much; unless you've got a queue entry bridging two cache lines, most memory stores are going to be a cache-line at at time. Am I missing something in this scenario?

[cpp]// enqueue function                Thread 1                          Thread 2
put_nonblock(...) {
  if (NULL == queue[head]) {     Finds queue[head] NULL        Instant later does the same
    queue[head] = ptr;           queue[head] = ptr1            Overwrites ptr1 with ptr2 
    head = NEXT(head);           head = head2                  head = head3
  }
}[/cpp]

And likewise on the queue function? Without an atomic binding of the read and the write (in this case the read and write of queue[head]) how do you avoid races?


0 Kudos
azru0512
Beginner
1,636 Views
Hmm..., FastForward is designed for single producer single consumer scenario. I forgot to mention this.
0 Kudos
RafSchietekat
Valued Contributor III
1,636 Views
If the ptr values are going to actually point to something, you'll obviously need a release-store operation in the enqueue, and a load-acquire in the dequeue, and atomics can provide that. For simple data items, you still have to know for certain that the loads and stores are indivisible, so either you implicitly trust the compiler to do what you expect (odds are good, but not 100.00%, and a static analysis tool worth its salt will surely complain about races because by definition that's what you'll have), or you still use atomics (can be relaxed, i.e., no fences, and they won't cost you anything) or inline assembly (that's like a roll-your-own atomic). If you want to play with fire anyway, how certain are you that the compiler won't optimise away access to the circular buffer? Odds are very good, but again not 100.00% (maybe the compiler is super smart but not super super smart?), so you'll have to implicitly trust the compiler to do what you expect a second time: out-of-order stores or out-of-order loadsshouldn'twouldn't upset the algorithm, but they should still go to memory in a finite amount of time, so maybe it would be a good idea (though not optimal) to make the buffer at least volatile, if you allow any kind of optimisation by the compiler. But still, the only reason not to use atomics is if you're somehow allergic to them, because they're the right tool for the job and they won't cost you anything (you'll have to set the fence, though: relaxed for simple data, elementary fence for pointers, because C++0x seems to want to have sequential consistency as the default, which seems like selling a race bike with training wheels that you'll probably want to take off before an actual race). Dmitriy, what's your take on that?

(Added) Interestingly, with pointers to data the producer only needs to release on the store (the load is just a NULL), and the consumer only needs to acquire on the load (the store is just a NULL), so apparently default memory semantics on atomic operations are not always the final choice. Also, the producer shouldn't assume that a single compare-and-swap will be better, because that'll cost an arm and a leg on x86 and the algorithm doesn't require inter-operation atomicity here; I don't think anybody has invented single negative-compare-and-swaps for the consumer, though?

(Corrected typo after seeing #18: traning->training)

(Edit for clarity: shouldn't->wouldn't)
0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,636 Views
It should look as follows (C1x atomic primitives are used):

[cpp]	size_t head;
size_t tail;
atomic_address queue [QUEUE_LENGTH];

put_nonblock(...)
{
if (atomic_load_explicit(&queue[head], memory_order_relaxed) == 0)
{
atomic_store_explicit(&queue[head], ptr, memory_order_release);
head = NEXT(head);
}
}

get_nonblock(...)
{
if (atomic_load_explicit(&queue[tail], memory_order_relaxed) != 0)
{
ptr = atomic_load_explicit(&queue[tail], memory_order_consume);
atomic_store_explicit(&queue[tail], 0, memory_order_relaxed);
tail = NEXT(tail);
}
}
[/cpp]

> The enqueue function first see thequeue entry is empty or not by using NULL as an indicator.If the queue entry is empty, then it will insert data into the queue entry and increase head.

It does not work that way.

> Since yousaid it's incorrect w/o fences, could you show me how ording can break the code?

For sure. Consider:

// this is part of user's code
msg->data = 47;
// and this is pup_nonblock()
if (queue[head] == 0)
{
queue[head] = msg;
head = NEXT(head);
}

Either C/C++ compiler OR hardware is free to reorder the code as:

if (queue[head] == 0)
{
queue[head] = msg;
head = NEXT(head);
}
msg->data = 47;

As you may see, single-threaded semantics are preserved by the transformation, and that's what both C/C++ compiler and hardware preserve by default.
However, multi-threaded semantics are definitely broken by the transformation - consumer will receive partially constructed message.

The same kind of transformation can occur on consumer side too (hence memory_order_consume ordering constraint). However details are much more wicked. You may think of it as:

if (queue[tail] != 0)
{
msg = queue[tail];
queue[tail] = 0;
tail = NEXT(tail);
data = msg->data;
printf("%d", data);
}

transformed to:
data = *(0x12345678);
if (NULL != queue[tail])
{
ptr = queue[tail];
if (ptr != 0x12345678)
data = *ptr;
queue[tail] = NULL;
tail = NEXT(tail);
printf("%d", data);
}

That's also perfectly correct single-threaded optimization that breaks multi-threaded code.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,636 Views
> C++0x seems to want to have sequential consistency as the default, which seems like selling a race bike with traning wheels that you'll probably want to take off before an actual race). Dmitriy, what's your take on that?

Yeah, race bike with training wheels. LOL
What for in this world one wants atomic operations with seq_cst everywhere? I guess mutexes can be faster than that, because they do not have to be globally ordered.
And personally I do not want any defaults for atomics, on the contrary, I want to be as explicit and as voluble as possible.


0 Kudos
RafSchietekat
Valued Contributor III
1,636 Views

"ptr = atomic_load_explicit(&queue[tail], memory_order_consume);"
I think that should be memory_order_acquire, shouldn't it? But I still don't "get" memory_order_consume, so maybe it's just me...

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,365 Views

"ptr = atomic_load_explicit(&queue[tail], memory_order_consume);"
I think that should be memory_order_acquire, shouldn't it? But I still don't "get" memory_order_consume, so maybe it's just me...

It may be either way depending on user-level semantics provided by the queue.

consume is faster and enough in 99% of real-world usages.

0 Kudos
Reply