Could double-checked locking accidentally happen to work?

RafSchietekat · ‎07-07-2009

With the implementation used in "C++ and the Perils of Double-Checked Locking" (Example 3), could it be that with g++ optimisation turned off on x64 running 32-bit or 64-bit Linux it turns out that the first read is in fact always a load-acquire and the final assignment a release-store? I would use atomic operations myself, but what do you say to somebody who thinks that it's a bigger hassle to adopt, e.g., Threading Building Blocks for its atomics support,than to trust that the problem will not manifest itself? Maybe it does turn out that everything is aligned (even 64-bit pointers on 64-bit Linux), that the compiler always uses single load/store instructions, that loads are always ordered even if this is not officially documented, and that I'm making a fuss over nothing at all?

anthony_williams · ‎07-07-2009

Quoting - Raf Schietekat

With the implementation used in "C++ and the Perils of Double-Checked Locking" (Example 3), could it be that with g++ optimisation turned off on x64 running 32-bit or 64-bit Linux it turns out that the first read is in fact always a load-acquire and the final assignment a release-store? I would use atomic operations myself, but what do you say to somebody who thinks that it's a bigger hassle to adopt, e.g., Threading Building Blocks for its atomics support,than to trust that the problem will not manifest itself? Maybe it does turn out that everything is aligned (even 64-bit pointers on 64-bit Linux), that the compiler always uses single load/store instructions, that loads are always ordered even if this is not officially documented, and that I'm making a fuss over nothing at all?

The nature of undefined behaviour is that anything may happen --- including the possibility that everything may behave as you naively expect. It is therefore entirely possible that this code may turn out to be safe on some particular compiler/compiler setting/platform combination.

However, I would strongly recommend against relying on this, as the smallest change could affect the behaviour so that it is no longer what you may expect.

In particular, the compiler may choose to generate the write to pInstance before it invokes the constructor of the new Singleton object --- it has to make that store anyway, and it provides somewhere to write the value whilst it calls the constructor.

If the compiler does do so, then it doesn't matter whether the instruction is a release store or not --- another thread might see the non-NULL value before the constructor has completed, thus hitting a race condition.

If you are going to rely on compiler and platform specific behaviour, then you are better off using the compiler-specific atomic primitives such as the gcc __sync_xxx builtins. Such primitives will have a documented consequence on the compiler behaviour, and might therefore provide the requisite guarantees.

RafSchietekat · ‎07-07-2009

"However, I would strongly recommend against relying on this, as the smallest change could affect the behaviour so that it is no longer what you may expect."
You're preaching to the choir. :-)

"In particular, the compiler may choose to generate the write to pInstance before it invokes the constructor of the new Singleton object"
Even with optimisation turned off? Remember, I'm trying to crack through "if it's not broken, don't fix it", so I'll have to be convincing.

"If you are going to rely on compiler and platform specific behaviour, then you are better off using the compiler-specific atomic primitives such as the gcc __sync_xxx builtins. Such primitives will have a documented consequence on the compiler behaviour, and might therefore provide the requisite guarantees."
Seems like overkill: they're documented as "full barrier", which is too expensive (I only need release-store and load-acquire,which shouldn't take more thana single MOV of properly aligned data).

jimdempseyatthecove · ‎07-07-2009

When you (your program) assume a non-NULL pointer to an object is a pointer to a fully constructed object then it is your obligation (in multi-threaded program) to new/construct to a temporary pointer not visible to other threads. Once constructed then copy pointer to location visible to other threads (copy using CAS in the event that the empty pointer may be going under multiple constructions).

Jim Dempsey

RafSchietekat · ‎07-07-2009

Quoting - jimdempseyatthecove

When you (your program) assume a non-NULL pointer to an object is a pointer to a fully constructed object then it is your obligation (in multi-threaded program) to new/construct to a temporary pointer not visible to other threads. Once constructed then copy pointer to location visible to other threads (copy using CAS in the event that the empty pointer may be going under multiple constructions).

But what will g++ do with optimisation turned off? Does that temporary pointer really make a difference? Optimisation does not treat variables like real people, so a temporary pointer would be nothing more than a dangerous illusion, but this is specifically not the situation here.

I don't know what you mean with "multiple constructions", but the transfer definitely needs atomic release-store and load-acquire between the thread that makes the object and the thread that uses it; the question is whether that is already (accidentally) in place here.

jimdempseyatthecove · ‎07-07-2009

thread A

if(objectPointer) objectPointer->DoWork();

thread B

if(!objectPointer) objectPointer = new Object; // has non-trivial ctor

Thread A could potentially issue objectPointer->DoWork(); prior to completion of construction

----------------------------------

thread B

if(!objectPointer) objectPointer = getNewObject();
...
__declspec(noinline)
Object* getNewObject()
{
return new Object;
}

Above is safe provided only one instance of thread B code runs
If/when multiple threads perform above, you could have multiple new's overwriting one pointer.
-----------------------------
threadB

if(!objectPointer)getNewObject(&objectPointer);
...
__declspec(noinline)
Object* getNewObject()
{
Object* temp = new Object;
if(CAS(objectPointer, 0, temp))
return;
delete temp;
}

Above is safe even if multiple attempts to allocate begin.
Note, the errant allocation could be preserved by placing into an additional location for use later.
(assuming that pointer is NULL).
-------------------

jimdempseyatthecove · ‎07-07-2009

I should add the first case may avoid the potential to call a partially built object if the ctor executes on a register/internal temp locationholding the partially built object but _prior_ to use as rvalue in "objectPointer=".
But reliance on this could be implementation dependent.

Jim

RafSchietekat · ‎07-07-2009

#5.1: In general, I most definitely agree, but the question is whether it's a problem here (g++, optimisation turned off).

#5.2: Would a non-optimised compilation otherwise inline functions? If not, this is irrelevant.

#5.3: This seems broken in general because there is no acquire on the first test of objectPointer, so the question remains whether it would work in my situation. Furthermore, I'm not sure that the CAS will be cheaper than a lock (the cost is already amortised down to zero, so it would only save a bit of space (hmm...)), and maybe the transient Object candidates could be problematic (it's supposed to be a Singleton).

#6: That's exactly my question: this code has seemed to work fine for years on a different platform, so do I have a point questioning it for use specifically on (x64, Linux 32 and maybe later 64, g++ -O0) if nothing bad seems to be happening here either and ifthis codecan always be fixed later?

Help me out here: I'm looking for a solid argument to convince the non-believers that something should be done, or otherwise something that can help me sleep easy after giving up the fight. :-)

jimdempseyatthecove · ‎07-08-2009

>>#5.3: This seems broken in general because there is no acquire on the first test of objectPointer

Not so. The objectPointer being shared and changeable should be attributed with volatile. Regardless of volatile, acquire,or not you will have a window between a test sees NULL and where new object pointer is inserted into objectPointer. All acquire (on non-volatile) will do is force the compiler to not use registered copy of pointer should one be present (same as volatile for read). Typically a function will havea first reference of a pointer and the first reference will tend to have the NULL test. Therefore theacquire (and volatile) may be immaterial. Shouldyou have subsequent test, where prior contents may be registered, then acquire (or volatile) will avoid an unnecessary call into the allocation function. However, the allocation function will properly avoid tromping on prior allocation due to CAS(but will exhibit additional overhead of unnecessary new). As mentioned though the extraneous copy can be buffered in the function (as static pointer). When the objects are long lived the occasional (if any) superfluous new overhead will not be noticed (noneed to buffer extreanousobject). When the objects are short lived (and potential for numerous extra allocations) then insert the code to maintain and useextraneous object.

>>CAS will be cheaper than a lock
When lock is required, lock generally uses CAS (although there are ways to reduce the number of CASs in the event where your thread was the prior owner).

Jim

RafSchietekat · ‎07-08-2009

"Not so."
An acquire is absolutely necessary because otherwise the "client" may still be looking at old data where the referent is supposed to be. If x86(-64) hardware continues to behave like it does now even though this is not officially documented (last time I looked anyway), the acquire is implicit and all is well... but only if the O.S.+linker+compiler turn this into a single aligned MOV for whatever size a pointer is, so that the read is also atomic in the original sense of the word (indivisible). It has to be (at least) an atomic load/acquire, there's no getting around that.

"When lock is required, lock generally uses CAS (although there are ways to reduce the number of CASs in the event where your thread was the prior owner)."
You could also use XCHG instead of LOCK CMPXCHG for just locking, but, whether or not that's cheaper at the time of locking (I don't know), the average time attributable to locking over the total number of uses of the pointer approaches zero anyway.

Dmitry_Vyukov · ‎07-09-2009

Quoting - Raf Schietekat

But what will g++ do with optimisation turned off?

It's incorrect question. The correct question will be "What g++ vX.XX do with optimization turned off?" So in what version are you interested? ;)
I can not say for gcc, however Intel C++ does some optimizations in debug build, at least some code looks not in the way I would expect to be "straightforward translation" of C++ code.

RafSchietekat · ‎07-09-2009

So you're giving me just enough information to stress me out, and not enough to force a change... :-/

I guess if we'll just keep our fingers crossed, and occasionally light a candle, that ought to do the trick.

jimdempseyatthecove · ‎07-09-2009

Quoting - Raf Schietekat

"Not so."
An acquire is absolutely necessary because otherwise the "client" may still be looking at old data where the referent is supposed to be. If x86(-64) hardware continues to behave like it does now even though this is not officially documented (last time I looked anyway), the acquire is implicit and all is well... but only if the O.S.+linker+compiler turn this into a single aligned MOV for whatever size a pointer is, so that the read is also atomic in the original sense of the word (indivisible). It has to be (at least) an atomic load/acquire, there's no getting around that.

"When lock is required, lock generally uses CAS (although there are ways to reduce the number of CASs in the event where your thread was the prior owner)."
You could also use XCHG instead of LOCK CMPXCHG for just locking, but, whether or not that's cheaper at the time of locking (I don't know), the average time attributable to locking over the total number of uses of the pointer approaches zero anyway.

XCHG inforces a LOCK whether you ask for it or not.

typically a lock (changing a 0 to 1) would use XCHG whatever with 1 and test the return value. If return value was 0 then your XCHG set the lock, if non-zero then lock was already held (or you just got beat out).

You can insert a test memory lock wordfor 0 and bne (branch not equal) to lock fail just prior to issuing the XCHG. Although this is more work on your code part, it will avoid invalidating that cache line in all systems holding that cache line. And in turn save them time and you time when the other threads return the favor.

Jim Dempsey

jimdempseyatthecove · ‎07-09-2009

>>An acquire is absolutely necessary because otherwise the "client" may still be looking at old data where the referent is supposed to be.

And volatile assures the memory is read (as does acquire by way of use of volatile).

Now should C++0x "improve" on a long standing standard for volatile and depreciate it, then volatile no longer means volatile.

Jim