- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Consider the example code below,
int msg = 0;
P1 P2
msg = 1; r = msg;
Is it OK to write such code? Is there anypossibility that P2 read some "intermediate" value (i.e. neither 0 nor 1)?
I have researched on this topic for a while. However,I am still not sure if we are allowed to write the above code.
Some said we should use mutex.
Some saidthat a native-wordstore is atomic.
Some said a alignednative-wordstore is atomic.
Which one is right? Can programmers just write the above code, then assume that a native-word / aligned native-word store is atomic?
Any comment will be appreciated.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When alignment is not naturaly aligned (aligned to size), then when the entire int resides within one cache line, then for some processors, the store would be atomic. However do not rely on this as the guaranty of atomnicity is for naturaly aligned and within one cache line for 1, 2, 4, 8 bytes (and 16 bytesunder some circumstances).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply.
Let me ask in another way,
int msg = 0
P1P2
msg = 1; r = msg;
Can we write the above code and assume that "msg = 1" can be done atomically, no matter what compilers we use and no matterwhere platforms we run?
Or perhapswe shouldrely on the guaranty provided by programming langauages or special API?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://www.lambdacs.com/cpt/FAQ.html#Q377
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Short answer: no, use an atomic type instead, that's what they're there for.
Jim is right, of course (well, I assume), but that stuff is an atomic implementer's business, not an atomic user's.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
volatile int msg = 0;
The reason being, without volatile, the compiler is free to rearange or eliminate the store sequences by P1 as long as it does not affect P1. IOW without regard to the effects on P2. Likewise for the code execution in P2 without regard to P1.
Also, if msg can be updated simulteaneously (msg used for 2-way message passing) then use InterlockedExchange (or equivilent asm statement).
*** BAD
int msg = 0;
P1P2
msg = 1; r = msg;
while(msg)msg = 0;
_mm_pause(); ...
Without volatile, the P1 while loop will not see P2 returning msg back to 0
*** good
volatile int msg = 0;
P1P2
msg = 1; r = msg;
while(msg)msg = 0;
_mm_pause(); ...
Now while(msg) gets fresh copy of msg on each iteration
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
However, I found an interesting paper "FastForward for Concurrent Threaded Pipelines" in which the authors proposed a lock-free queue and claimed the correctness of the queue only depends on atomicity.
The authors proposed the enqueue and dequeue functions as follows,
// enqueue function
put_nonblock(...) {if (NULL == queue[head]) {
queue[head] = ptr;
head = NEXT(head);
}
}
// dequeue functionget_nonblock(...) {
if (NULL != queue[tail]) {
ptr = queue[tail];
queue[tail] = NULL;
tail = NEXT(tail);
}
}
The basic idea of FastForward is using the queue entry to idicate the empty or full condition rather than using head andtail. Therefore, head and tail are not shared variables anymore.
I quoted here what the authors said about the correctness of FastForward.
I am surprised the FastForward can only rely the atomicity as the authors claimed. But I don't know what the authors claimed is right or not. Any comment will be appreciated....notice that this optimization also allows a CLF implementation that only depends on the coherence protocol (and not the consistency model) to ensure mutual exclusion.
To see this, recall that stores are atomic on all coherent memory systems, thus the getter and putter will never update the same node simultaneously since the getter waits for a non-NULL value and the setter a NULL value.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Meanwhile I hope nobody is thinking "volatile" anymore, because it's either a nonportable equivalent for atomic, or it isn't useful at all.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Where do you get the info?
According to ISO C/C++ volatile has nothing to do with ordering. Even if a var is volatile, compilers are free to rearrange stores around them. For sure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And for queue construction they do not rely on both atomicity and ordering. This fact is also irrelevant here.
You should look at how they synchronize passing of user data rather than synchronization of queue cells.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And for queue construction they do not rely on both atomicity and ordering. This fact is also irrelevant here.
You should look at how they synchronize passing of user data rather than synchronization of queue cells.
Hey, wait, the following code is just incorrect:
if (NULL != queue[tail]) {
ptr = queue[tail];
queue[tail] = NULL;
tail = NEXT(tail);
}
... until it's C++ where the code can have any meaning... or Microsoft C and vars are declared volatile... anyway it's incorrect w/o fences.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FastForward use the queue entry to idicate the empty or full condition. In our case, NULL meansthe queue entry (slot) is empty. I reposted the enqueue and dequeue function here.
// enqueue function
put_nonblock(...) {
if (NULL == queue[head]) {
queue[head] = ptr;
head = NEXT(head);
}
}
// dequeue function
get_nonblock(...) {
if (NULL != queue[tail]) {
ptr = queue[tail];
queue[tail] = NULL;
tail = NEXT(tail);
}
}
Note that head and tail are NOT shared variables anymore.
The enqueue function first see thequeue entry is empty or not by using NULL as an indicator.If the queue entry is empty, then it will insert data into the queue entry and increase head.
The dequeue functiondoes a similar thing. After retrieve data from thequeue entry, ituse NULL to mark the queue entryas a empty slot.
The correctness of FastForward rely on atomic store operations as claimed by the authors.Since yousaid it's incorrect w/o fences, could you show me how ording can break the code?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]// enqueue function Thread 1 Thread 2 put_nonblock(...) { if (NULL == queue[head]) { Finds queue[head] NULL Instant later does the same queue[head] = ptr; queue[head] = ptr1 Overwrites ptr1 with ptr2 head = NEXT(head); head = head2 head = head3 } }[/cpp]
And likewise on the queue function? Without an atomic binding of the read and the write (in this case the read and write of queue[head]) how do you avoid races?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(Added) Interestingly, with pointers to data the producer only needs to release on the store (the load is just a NULL), and the consumer only needs to acquire on the load (the store is just a NULL), so apparently default memory semantics on atomic operations are not always the final choice. Also, the producer shouldn't assume that a single compare-and-swap will be better, because that'll cost an arm and a leg on x86 and the algorithm doesn't require inter-operation atomicity here; I don't think anybody has invented single negative-compare-and-swaps for the consumer, though?
(Corrected typo after seeing #18: traning->training)
(Edit for clarity: shouldn't->wouldn't)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp] size_t head;
size_t tail;
atomic_address queue [QUEUE_LENGTH];
put_nonblock(...)
{
if (atomic_load_explicit(&queue[head], memory_order_relaxed) == 0)
{
atomic_store_explicit(&queue[head], ptr, memory_order_release);
head = NEXT(head);
}
}
get_nonblock(...)
{
if (atomic_load_explicit(&queue[tail], memory_order_relaxed) != 0)
{
ptr = atomic_load_explicit(&queue[tail], memory_order_consume);
atomic_store_explicit(&queue[tail], 0, memory_order_relaxed);
tail = NEXT(tail);
}
}
[/cpp]
> The enqueue function first see thequeue entry is empty or not by using NULL as an indicator.If the queue entry is empty, then it will insert data into the queue entry and increase head.
It does not work that way.
> Since yousaid it's incorrect w/o fences, could you show me how ording can break the code?
For sure. Consider:
// this is part of user's code
msg->data = 47;
// and this is pup_nonblock()
if (queue[head] == 0)
{
queue[head] = msg;
head = NEXT(head);
}
Either C/C++ compiler OR hardware is free to reorder the code as:
if (queue[head] == 0)
{
queue[head] = msg;
head = NEXT(head);
}
msg->data = 47;
As you may see, single-threaded semantics are preserved by the transformation, and that's what both C/C++ compiler and hardware preserve by default.
However, multi-threaded semantics are definitely broken by the transformation - consumer will receive partially constructed message.
The same kind of transformation can occur on consumer side too (hence memory_order_consume ordering constraint). However details are much more wicked. You may think of it as:
if (queue[tail] != 0)
{
msg = queue[tail];
queue[tail] = 0;
tail = NEXT(tail);
data = msg->data;
printf("%d", data);
}
transformed to:
data = *(0x12345678);
if (NULL != queue[tail])
{
ptr = queue[tail];
if (ptr != 0x12345678)
data = *ptr;
queue[tail] = NULL;
tail = NEXT(tail);
printf("%d", data);
}
That's also perfectly correct single-threaded optimization that breaks multi-threaded code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yeah, race bike with training wheels. LOL
What for in this world one wants atomic operations with seq_cst everywhere? I guess mutexes can be faster than that, because they do not have to be globally ordered.
And personally I do not want any defaults for atomics, on the contrary, I want to be as explicit and as voluble as possible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"ptr = atomic_load_explicit(&queue[tail], memory_order_consume);"
I think that should be memory_order_acquire, shouldn't it? But I still don't "get" memory_order_consume, so maybe it's just me...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"ptr = atomic_load_explicit(&queue[tail], memory_order_consume);"
I think that should be memory_order_acquire, shouldn't it? But I still don't "get" memory_order_consume, so maybe it's just me...
It may be either way depending on user-level semantics provided by the queue.
consume is faster and enough in 99% of real-world usages.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page