Interlocked operations

dpotages · ‎01-29-2007

Hi,
I'm working with Vs2003, targeting multiple platforms and processors. I'd like to replace some of the windows' CriticalSections used in our code something faster. As using volatile (especially with 2003) is not very safe, i went for interlocked operations.
Since the code i was suppose to patch was using something that looks like

while( !m_bBoolean ) ;
m_bBoolean = true;
// ...
volatile bool m_bBoolean;

I introduced a new class to only need to replace the type of the value itself:

template< typename T > class VolatileType
{
enum { LOCKED = 1, UNLOCKED = 0 };
public:
__forceinline VolatileType() : _lock(UNLOCKED) { set(0); }
__forceinline VolatileType( const T& value ) : _lock(UNLOCKED) { set( value ); }
__forceinline volatile VolatileType& operator=( const T& value ) { set( value ); return *this; }
__forceinline operator T() const { volatile T res = get(); return res; }

T get() const
{
volatile T res;
AcquireLock();
{
res = _data;
}
ReleaseLock();
return res;
}

void set( T value )
{
AcquireLock();
{
_data = value;
}
ReleaseLock();
}

private:
VolatileType( const VolatileType& );
VolatileType& operator=( const VolatileType& );

void AcquireLock() const
{
LONG volatile *pLock = const_cast(&_lock);
while( InterlockedCompareExchange( pLock, LOCKED, UNLOCKED ) == LOCKED );
//_ReadWriteBarrier();
}

void ReleaseLock() const
{
LONG volatile *pLock = const_cast(&_lock);
InterlockedExchange( pLock, UNLOCKED );
}

__declspec(align(32)) volatile LONG _lock;
_volatile T _data;
};

typedef VolatileType VolatileBool;

Sadly, that doesn't work at all, and i probably missed or misunderstood something very easy. Am i completely mis-using interlocked operations (most likely), or is windows/the compiler not doing his job (i wish

)
Any tip is welcome!

Thanks in advance,

/David

jimdempseyatthecove · ‎01-29-2007

When performing an operation like get() or set() in your sample code you do not require interlocked operations. You do require interlocked or locked operations when performing a non-atomic operation. The non-atomic operation may be as simple as

++A;

or

A = A + B;

or

if((pNode = Head)
Head = Head->Next;
return pNode;

Using a lock flag, you would lock before the operation and unlock after the operation.
There are interlocked instructions that perform some of the common operations, such as ADD. For some reason MS VS 2005 did not include InterlockedAdd (Intel ICC may have), but you have an alternate in the InterlockedCompareAndExchange type of instructions. i.e. add to temp then attempt to swap old value with result.

Lock(HeadLock);
if((pNode = Head) != NULL)
Head = Head->Next;
Unlock(HeadLock);
return pNode;

In the last case above, where you delink a linked list you would like to lock the Head for exclusive use on the 1st two lines. The above coding introduces an inconvience to the other threads sharing Head should the one owning the lock get context switched out.

Jim

dpotages · ‎01-30-2007

"When performing an operation like get() or set() in your sample code you do not require interlocked operations"

Well since i'm targetting multiprocessors systems, i think i need to, especially according to an msdn article that says:
"With Visual Studio 2005, volatile variable access creates a hardware memory barrier when needed. (...) With Visual Studio 2003, volatile variable access prevents compiler re-ordering only."

So to have a memory barrier, i need to use interlocked operations (as the needed intrinsics are not available in vs2003) or use assembly, or did i misunderstand something?

/david

jimdempseyatthecove · ‎01-30-2007

David,

Memory barriers are only required when the physical sequence of reads and/or writes through the memory interface are important to the entity on the other side of the memory interface. The entity being another processor or I/O device.

Many processor designs (including Intel) permit out of order reads as well as out of order writes. Only when physical order or reads and/or writes are important would you require the assurance of ordering. This implies all other cases support out of order physicalread/write while seemingly (logically) executing in order.

Note that I haven't mentioned cache. Most MP systems, including Intel,havecache coherency. This means, if one processor performs a physical write to the memory bus and if the memory data were cached in other processors then that cached data is either a) invalidated or b) updated (depending on cache design). This means that the next read from memory, by any processor, will obtain the new value.

If your application is simply monitoring a flag, or simply setting a flag then it need only to read or write the memory without regard to memory barrier. The governing term here is "read or write the memory". This means optimizations by the compiler should not interfere with the reading or writing of the memory.

Volatile, pre-VS2005, assured that optimizations would noteliminate seemingly redundant read or writes from your code. e.g. moving the read (and rereading) of a variable from inside of a loop to outside of a loop. This method did not assure sequencing of read and/or writes by either the compiler or the processor.

Example. Assume you were to write a MP and multi-threaded application that were to communicate via something line a simulatedparallel port device. i.e. data register, and control register

timeOutCount = 0;

DataReg = data;
while(!(ControlReg&Done))
{
if(++timeOutCount > Limit)Error(PortTimeOut);
}
DataReg = 0;

If ControlReg were not volatile and not reset in the check timeout block then the compiler could conceivably perform the test once. So ControlReg must be volatile.

If DataReg were not volatile, and not referenced in the check for timeout block, then the optimizations of the compiler could omit the "DataReg = data;" and simply set DataReg to 0.Therefor you need volatile on DataReg as well.

The pre-VS2005 use of volatile in the above codewould not assure the order a) write DataReg, b) Read ControlReg. The pre-VS2005 compiler may have assured the logical sequence of instructions (code bytes) but would not have assured the processor performed the sequence in order to the memory interface. Without this assurance, it would have been possible (permit able)for the processor to have fetched the ControlReg first then write the DataReg second. e.g. Data is 8-bits, DataReg is unsigned 16-bits, high byte being Control Register and zeroed concurrent with write of Data to DataReg. In this case, out of order read (vs write) could potentially see Done set for the prior write as opposed to for the current write.

The newer VS2005 code assures physical sequencing consistent with logical sequencing through theuse of memory barriers. The barriers are only required in ci rcumstances where sequencing is important. Formerly the programmer had to explicitly insert the memory barriers, now, memory barriers are implicit.

In addition to memory barriers, at the appropriate times, the programmer must be made aware of situations were a given operation is atomicon a single processor but not necessarily so on a multi-processor system. Example, the instruction to add a register to a memory location

add [ecx], eax

or

inc [ecx]

On a single processor those instructions are atomic in the sense that the instruction is not interruptible. To the memory interface though, it is two operations: Read,then Write (with the processor performing the add or inc between the Read and Write). On a multi-processor system, two or more processors could perform

Time: n, n+1, n+2
CPU0:Read, modify, Write
CPU1: Read, modify, Write

Where one processor undoes the operation of the other processor. e.g. INC by both results in +1 and not +2.

To avoid this, the processor (Intel and others) have a prefix or other meansthat can be added to the instruction that permits atomic Read/Modify/Write. For Intel this is the LOCK prefix. This prefix is used by the _Interlocked... series of intrinsic functions.

Memory barriers should be used where needed. And memory barriersare not cure all forbad programming.

Also, memory barriers are not the same as interlocked operations.

1) A memory barrier does not assure a processor atomic operation which uses multiple memory read/writes, is atomic with regard to other processors sharing the same memory.

2) The LOCK prefix (used in _Interlocked... functions) assures atomic Read/Modify/Write with respect to all processors sharing the same memory, but does not assure a memory barrier within the processor issuing the LOCK. e.g. the processor may have performed a read-ahead and which is now cached.

3) Memory barriers (Read Barrier, Write Barrier, RW Barrier) are implemented by separate instruction sequences. These barriers need not be included in the _Interlocked... functions. The compiler will (should) know if a barrier is required and if barrier inside intrinsic function, if not then insert appropriate barrier (according to rules defined by compiler writer). If the auto generated barrier is not adequate then the programmer must explicitly use the barriers as required.

Jim Dempsey

dpotages · ‎01-31-2007

Once again, thanks alot Jim, it's now much more clear to me.
Basically i was posting this as i had some issues with this code which gave alot of ThreadChecker errors on my multicore if i'm not using the CRITICAL_SECTION implementation. I tried to explicitely add memory barriers, didn't help. I'll reread several times your post and see if i can find out my stupid mistake :)

/david

jimdempseyatthecove · ‎01-31-2007

David,

After a cursory overview of your code I have some comments.

When using critical section the block incorporaties a spinlock (you set the value to 4000). Therefor, when attempting to obtain a lock, andwhen lock is own by other thread, the attempt takes more than 4000 iterations, the thread making the attempt to lock is suspended.

When using your own interlock flag there is no "after so many attempts suspend the thread". Therefor when attempting to obtain a lock, andwhen lock is own by other thread, the attempt computes until lock is free.

Since your for loops in main and Job acquire a lock for the duration of the loop the two loops will be serialized. This may be what you intended for tour test application.

However, also note that the body of each loop contains the constructor for csh. The constructor performs Enter() on each iteration. From MSDN

After a thread has ownership of a critical section,
it can make additional calls to EnterCriticalSection
or TryEnterCriticalSection without blocking its execution.

However, note that the code when using _Interlocked... will block on lock flag regardless if the lock is owned by the caller. The following should run with the ThreadChecker.

{
 CriticalSectionHolder csh( s_CS );
  for( size_t i = 0; i < 5000000; ++i )
  {
   m_iValue = 1;
  }
}

Jim Dempsey

dpotages · ‎01-31-2007

JimDempseyAtTheCove:
When using critical section the block incorporaties a spinlock (you set the value to 4000). Therefor, when attempting to obtain a lock, andwhen lock is own by other thread, the attempt takes more than 4000 iterations, the thread making the attempt to lock is suspended.

When using your own interlock flag there is no "after so many attempts suspend the thread". Therefor when attempting to obtain a lock, andwhen lock is own by other thread, the attempt computes until lock is free.

Since your for loops in main and Job acquire a lock for the duration of the loop the two loops will be serialized. This may be what you intended for tour test application.

It is indeed intended :)

JimDempseyAtTheCove:
However, also note that the body of each loop contains the constructor for csh. The constructor performs Enter() on each iteration. From MSDN
After a thread has ownership of a critical section,
it can make additional calls to EnterCriticalSection
or TryEnterCriticalSection without blocking its execution. 
However, note that the code when using _Interlocked... will block on lock flag regardless if the lock is owned by the caller.

Indeed (i also mentionned it in a comment in my code), but each loop also contains the destructor for csh.

JimDempseyAtTheCove:
The following should run with the ThreadChecker.
{
 CriticalSectionHolder csh( s_CS );
  for( size_t i = 0; i < 5000000; ++i )
  {
   m_iValue = 1;
  }
}
Jim Dempsey

Well it also gives one datarace error (instead of 5000000) when using the interlocked version..

jimdempseyatthecove · ‎01-31-2007

David,

Look at the following code. I haven't compiled and tested it so there may be a typeo.

#define _WIN32_WINNT 0x0500
#define WIN32_LEAN_AND_MEAN
#include

#include

// Comment out both #defines for interference
// Un-comment one of #defines for seperate tests
// #define USE_InterlockedAdd
// #define USE_InterlockedCompareExchange

longLoopLimit = 5000000;
volatile longm_ivalue;

// ---- Thread job ----
DWORD WINAPI Job( LPVOID )
{
 for( size_t i = 0; i < LoopLimit; ++i )
 {
#if defined(USE_InterlockedAdd)
 _InterlockedAdd(&m_iValue, 1);
#elif defined(USE_InterlockedCompareExchange)
 while (true)
 {
 long temp = m_iValue;
 if(_InterlockedCompareExchange(&m_iValue, temp+1, temp) == temp) break;
 }
#else
 m_iValue = m_iValue + 1;
#endif
 }
 return 0;
}

// ---- Main thread ----
int main()
{

 if( CreateThread( 0, 0, &Job, 0, 0, 0 ) == 0 )
 return -1;

 for( size_t i = 0; i < LoopLimit; ++i )
 {
#if defined(USE_InterlockedAdd)
 _InterlockedAdd(&m_iValue, 1);
#elif defined(USE_InterlockedCompareExchange)
 while (true)
 {
 long temp = m_iValue;
 if(_InterlockedCompareExchange(&m_iValue, temp+1, temp) == temp) break;
 }
#else
 m_iValue = m_iValue + 1;
#endif
 }
 if(m_iValue == LoopLimit*2)
 printf("Success
");
 else
 printf("Fail
");
 return 0;
}

Note,

On my release of VS2005 C++ the _InterlockedAdd intrinsic seems to be missing although it is documented. Intel's C++ may have it.

If your application requires simple atomic operations and where the operation can use _Interlocked... then do so by all means (more efficient code).

Some operations cannot be performed in an atomic manner using the supplied _Interlocked... functions. For these you will need a lock flag as you intended to do with your class declarations.

Jim Dempsey

jimdempseyatthecove · ‎01-31-2007

I think you misunderstood the problem described with my second comment.

As originally written each of your for loops perform a LOCK on each iteration. The unlock is only performed on the exit of the scope of the loop. Therefore

Using CriticalSection the 2nd, 3rd, ... 5000000th lock by the lock owner succeeds and the loop finishes (and exits the CriticalSection)

Using lock flag, the 2nd attempt to lock by the owner of the lock blocks and the loop never completes (as it is waiting for itself to release the lock).

Jim Dempsey

dpotages · ‎02-01-2007

JimDempseyAtTheCove:
As originally written each of your for loops perform a LOCK on each iteration. The unlock is only performed on the exit of the scope of the loop. Therefore

Actually this is wrong. The lock and unlock are performed at every iteration of the loop. The liftime csh is one iteration.

You can try this code:

#include

using std::cout;
using std::endl;

struct Lock
{
Lock() { cout << "Constructor" << endl; }
~Lock() { cout << "Destructor" << endl; }
};

int main()
{
cout << "Before loop" << endl;
for( size_t i = 0; i < 5; ++i )
{
cout << "Before constructor" << endl;
Lock a;
cout << "Before destructor" << endl;
}
cout << "After loop" << endl;
}

/david

jimdempseyatthecove · ‎02-01-2007

Thanks David for pointing this out. (Teaching an old dog anew trick.)

This behavior (distructor inside loop) must have changed since the earlier days of C++. Probably around the time when they decided if the declaration of an initiator in a for expression belongs inside the scope of the loop as opposed to outside the scope of the braces.

Therefore, in your application,this means that the interlocked flag verses the critical section is not causing a block. Or should I say should not cause a block. Does the loop holding the 1st lock (using Interlocked) terminate? If not, then you may have to trace the assembly code to see what is happening.

If the loop completes, thenthe Thread Checker must be indicating a problem where your interpretation of problem is different. i.e. it is indicating a thread inyour code is waiting an excissively long time in a tight loop.

Jim Dempsey

dpotages · ‎03-12-2007

Hi,

I was away from the forums for some time, deadlines/milestones/etc :)

Anyway Thread Checker was reporting some race conditions simply because i did not read the documentation completely (doh), as i was not using the __itt_notify_sync_xxx functions..
After using them, everything's fine :)

/david