Solved: Out-of-order execution and semaphores question

jacquesbas · ‎02-19-2009

I have a problem with an application. My application does multithreaded tree search. The entire tree is shared by all threads and I use a semaphoreat each node to synchronize the access to the data stored in that node. The semaphore is a single byte that may contain (1 = free, 0 = busy)

The semaphore's codeis:

mov al, 1 // 1 = free
mov dl, 0 // 0 = busy
lea esi, node.semaphore_byte
lock cmpxchg [esi], dl
jnz xxx // This branch is taken when the semaphore is busy
// Semaphore was free, busy now.
mov ebx, [some Address in the node]

To free the semaphore:

lea esi, node.semaphore_byte
lockdec byte ptr[esi]

In the manual "Intel 64 and IA-32 Architectures, Software Developer's Manual, Volume 3A: System Programming Guide, Part 1" subsection "7.1.2.2 Software Controlled Bus Locking" some precisions are made:

" NOTE Do not implement semaphores using the WC memory type. Do not perform non-temporal stores to a cache line containing a location used to implement a semaphore."

I guess "normal" RAM allocated by a GlobalAlloc() Windows API call is not write-combining (WC), so the first part should not be a problem.

I don't understand the second part: " Do not perform non-temporal stores to a cache line containing a location used to implement a semaphore " Does that mean, the semaphores must be stored in 64 byte cache lines containing nothing else? (Since my node size is 64 bytes that would double the RAM usage.) Is my semaphore code correct? (It works "almost always") I may have out of order RAM access problems, or perhaps I have to hunt the bug somewhere else ...

Please, help.

jimdempseyatthecove · ‎02-19-2009

Because in this case your node size does not span cache lines, the sfence can be removed (provided non-non-temporal stores).

However, at a later date, should you extend the node size to span cache lines, or should your lock code (not shown) double lock (lock a secondary node), then the sfence may be required on some platforms. Some platforms==multi-socket. The cache coherency system will maintain coherency within a cache line but it will not necessarily maintain the write sequence order of multiple cache lines across sockets. (Someone at Intel may contradict this but you should also consult the engineers at AMD.)

The presense or absense of sfence will depend on the code enclosed in the protected region. Examine your code and leave copious cautionary comments above and below the protected code section for future reference.

If the code permits sfence to be removed, then by all means remove the sfence.

Also, as an additional optimization, seeif you can perform the release of the semaphore combined with the last write of adjacent data (both/all in same aligned word, dword, qword, dqword).

If possible run stress tests on various platforms with sanity checks in your test applicaiton but not in the code under test. The routine being tested should be "as will be used under release version" and is not to contain the sanity checks as this will affect the running characteristics of the code.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎02-19-2009

Assuming your 64-byte nodes, including semaphor,are aligned at 64-byte addresses:

From the description of your code, the only time (outside of program initializtion) you perform a write into a node (into a cache line)is after setting the lock. However, if you permit a different thread to read through a locked node, that code must be sensitive to the possibility that some data in the node is in flux and the code written appropriately.

By the way, your code states:

mov al, 1 // 1 = free
mov dl, 0 // 0 = busy
lea esi, node.semaphore_byte
lock cmpxchg [esi], dl
jnz xxx // This branch is taken when the semaphore is busy
// Semaphore was free, busy now.

** meaning semaphore_byte is now 0
** then to free the semaphore:

lea esi, node.semaphore_byte
lockdec byte ptr[esi]

** meaning the semaphore_byte is now -1 !! which is not 1 and not = free
** unless the interviening code set semaphore_byte to 2

Consider unlock using

lea esi, node.semaphore_byte
SFENCE
mov byte ptr [esi],1

The advantage is the code owning the lock can exit quicker
The disadvantage is the code owning the lock, soon to be formerly owning the lock, has a higher chance of reacquiring the lock in the near future (before other HW threads see the release).

Why the choice of 1=free/0=busy
Typically locks are expressed with 0=free, x=state x (1=phase 1, 2=phase 2/phase 1 complete, 3=phase 3/phase 2 complete) or simply !=0 meaning busy.

Jim Dempsey

Dmitry_Vyukov · ‎02-19-2009

Quoting - jacquesbas

I guess "normal" RAM allocated by a GlobalAlloc() Windows API call is not write-combining (WC), so the first part should not be a problem.

Yes, you are right. Normal memory type is WB (write-back). You hardly miss the moment, that you are allocating memory of other memory types. On Windows you can allocate memory of WC type with VirtualProtectEx(PAGE_WRITECOMBINE) or of NC (non cachable) type with VirtualProtectEx(PAGE_NOCACHE).

Dmitry_Vyukov · ‎02-19-2009

Quoting - jacquesbas

I don't understand the second part: " Do not perform non-temporal stores to a cache line containing a location used to implement a semaphore " Does that mean, the semaphores must be stored in 64 byte cache lines containing nothing else? (Since my node size is 64 bytes that would double the RAM usage.) Is my semaphore code correct? (It works "almost always") I may have out of order RAM access problems, or perhaps I have to hunt the bug somewhere else ...

No, this means that you must not use store instructions that use non-temporal stores (write-combining buffers). These instructions are postfixed with "NT" - movntq, etc.

Dmitry_Vyukov · ‎02-19-2009

Quoting - jimdempseyatthecove

lea esi, node.semaphore_byte
SFENCE
mov byte ptr [esi],1

SFENCE in unlock is required on x86 IFF you are using non-temporal stores (MOVNTQ, etc) inside of critical section.
Otherwise I would strongly recommend to remove SFENCE.

Dmitry_Vyukov · ‎02-19-2009

Quoting - jacquesbas

The semaphore's codeis:

mov al, 1 // 1 = free
mov dl, 0 // 0 = busy
lea esi, node.semaphore_byte
lock cmpxchg [esi], dl
jnz xxx // This branch is taken when the semaphore is busy
// Semaphore was free, busy now.
mov ebx, [some Address in the node]

To free the semaphore:

lea esi, node.semaphore_byte
lockdec byte ptr[esi]

As for the code, it looks correct.

As Jim mentioned, I would also recommend to replace 'LOCK DEC' in unlock with plain MOV. MOV on x86 provides both atomicity and required memory ordering:

lea esi, node.semaphore_byte
mov byte ptr [esi], 0

This will significantly reduce the cost of unlock. Regarding fairness, well, usually fairness is not required on such low level, also I guess, since you are doing tree traversal, you must not re-lock the same node in the near future (~100 cycles) after unlock.

jacquesbas · ‎02-19-2009

1. jimdempseyatthecove wrote:
> lea esi, node.semaphore_byte
> lock dec byte ptr [esi]

> ** meaning the semaphore_byte is now -1 !! which is not 1 and not = free
> ** unless the interviening code set semaphore_byte to 2

Oops, yes it should have been "lock inc byte ptr [esi]"

2. Dmitriy Vyukov wrote:
> .. Normal memory type is WB (write-back).
> .. No, this means that you must not use store instructions that use non-temporal
> stores (write-combining buffers). These instructions are postfixed with "NT" -
> movntq, etc.

Ok. Thank you. So that is notthe problem.

3. jimdempseyatthecove wrote:

> Assuming your 64-byte nodes, including semaphor, are aligned at 64-byte addresses:

The entire nodes are 64 byte aligned, the semaphor is just 1 byte inside the node.

> However, if you permit a different thread to read through a locked node, that code
> must be sensitive to the possibility that some data in the node is in flux and the
> code written appropriately.

The reason why a lock is needed is because the tree grows and only a thread that has
won the lock can add new children to the node. The threads that just read (which is
the most frequent case) and occasionally write data to fields of the node not related
with the tree arrangement do NOT use the lock. Of course, when a value is modified,
another thread may read either the new value or the previous value. That is not a
problem. Fields are aligned and that operations should be atomic.

Conclusion:

1. The semaphore mechanism is correct, but

lea esi, node.semaphore_byte
SFENCE // Is this required? I don't use instructions with "NT".
mov byte ptr [esi],1

is more efficient.

2. There is no problem with sharing "locked access" and "non-locked access" other than
the data read may be outdated. There is nothing special in the fact that the lock is
in one particular byte of the same cache line (=node) where other threads may read or
write other fields of the node, except the pointer to the next child which is the
critical field protected by the lock.

Are my conclusions correct?

jimdempseyatthecove · ‎02-19-2009

Because in this case your node size does not span cache lines, the sfence can be removed (provided non-non-temporal stores).

However, at a later date, should you extend the node size to span cache lines, or should your lock code (not shown) double lock (lock a secondary node), then the sfence may be required on some platforms. Some platforms==multi-socket. The cache coherency system will maintain coherency within a cache line but it will not necessarily maintain the write sequence order of multiple cache lines across sockets. (Someone at Intel may contradict this but you should also consult the engineers at AMD.)

The presense or absense of sfence will depend on the code enclosed in the protected region. Examine your code and leave copious cautionary comments above and below the protected code section for future reference.

If the code permits sfence to be removed, then by all means remove the sfence.

Also, as an additional optimization, seeif you can perform the release of the semaphore combined with the last write of adjacent data (both/all in same aligned word, dword, qword, dqword).

If possible run stress tests on various platforms with sanity checks in your test applicaiton but not in the code under test. The routine being tested should be "as will be used under release version" and is not to contain the sanity checks as this will affect the running characteristics of the code.

Jim Dempsey

jimdempseyatthecove · ‎02-19-2009

>>The reason why a lock is needed is because the tree grows and only a thread that has won the lock can add new children to the node. The threads that just read (which is the most frequent case) and occasionally write data to fields of the node not related with the tree arrangement do NOT use the lock.

A problem you may have is if the locked process affects the geometry of the tree other than simply adding a branch. e.g. If the lock code is deleting or moving nodes your traversal code (running in other thread) might break. Care must be taken to assure the lock code and the unlocked traversing code can run concurrently without problem.

Note, if you change your semaphore_byte state values you can use

0=free
1=in use but not bunging up tree
2=in use but *** caution - bunging up tree

Then the traversal code can take appropriate actions.
This is not to say your current code has this sensitivity problem.

Jim Dempsey

Dmitry_Vyukov · ‎02-19-2009

Quoting - jacquesbas

The reason why a lock is needed is because the tree grows and only a thread that has
won the lock can add new children to the node. The threads that just read (which is
the most frequent case) and occasionally write data to fields of the node not related
with the tree arrangement do NOT use the lock. Of course, when a value is modified,
another thread may read either the new value or the previous value. That is not a
problem. Fields are aligned and that operations should be atomic.

You must be very careful here. Atomicity is only one of the aspects of memory models. There is also ordering of memory accesses and visibility of memory accesses. While you are stepping to the lock-free ground (not all accesses to shared data are protected with mutexes), you must take into account all the aspects.
Particularly you must take into account that C/C++ compiler or hardware can reorder memory accesses, so that reader threads will see partially-constructed nodes or inconsistent nodes.

jimdempseyatthecove · ‎02-19-2009

Also, I forgot to mention

A common technique is to place the 1 bit (or 2 bit) semaphore in the least significant bits of a pointer contained within the object being protected. In this way you do not have a byte packing problem and the flag occupies otherwise free space. In your node the lsb 6 bits (64 byte aligned nodes) are available, but in a general case you have 2 bits available on 32-bit platform and 3 bits available on 64-bit platform (when using intptr_t aligned nodes).

The additional benefit is you can interlocked conditionally obtain/store a branch link pointer while simulteaneously locking/unlocking a node.

Jim Dempsey

jacquesbas · ‎02-20-2009

jimdempseyatthecove wrote:

> A problem you may have is if the locked process affects the geometry of the tree other than

> simply adding a branch. e.g. If the lock code is deleting or moving nodes your traversal code

> (running in other thread) might break. Care must be taken to assure the lock code and the

> unlocked traversing code can run concurrently without problem.

During the search, the tree may only grow, no nodes are moved or deleted. It is a Monte-Carlo Tree Search, when a node is visited a given number of times it is expanded. Only when all children have been initialized, the parent's pointer to the first child is updated. Immediately after that, the lock is released. Using the pointer as a lock could be a good idea as you mention, but I prefer not to modify other parts of the code that may use the pointer and should read it and-ed with a mask.

Dmitriy Vyukov wrote:

>Particularly you must take into account that C/C++ compiler or hardware can reorder

> memory accesses, so that reader threads will see partially-constructed nodes or inconsistent

> nodes.

Yes. The algorithm is robust enough to accept small errors. The data in the node are results of simulations. In the worst case, a simulation result is partially counted or not counted at all. The probability of a thread collision is small. The simulation is slow (100 - 500 microseconds) and updating the results is just a few clock cycles. The error is small compared to the simulation noise.

Thank you both for your comprehensive answers.

Dmitry_Vyukov · ‎02-20-2009

Quoting - jimdempseyatthecove

but in a general case you have 2 bits available on 32-bit platform and 3 bits available on 64-bit platform

Actually it's possible to steal up to 25 (!) bits from pointer on Win64, i.e. to compress pointer to 39 (!) bits.
Win64 uses 44 bit address spaces. 4 bits are a lsb anyway-zero bits (does Win64 always align allocations on 16 bytes?). And 1 msb bit is senseless, if you are not going to pass pointers between user and kernel spaces.
It's the hacks that allowed Microsoft to port SList API to Win64. For example, if we have 128-bit CAS (DWCAS on 64-bit platform), we are able to pack 3 pointers and 11-bit additional structure. Or 2 pointers and 50-bit additional structure.
You can see details here:
http://www.alex-ionescu.com/?p=50

jimdempseyatthecove · ‎02-20-2009

Quoting - Dmitriy Vyukov

Quoting - jimdempseyatthecove

but in a general case you have 2 bits available on 32-bit platform and 3 bits available on 64-bit platform

Actually it's possible to steal up to 25 (!) bits from pointer on Win64, i.e. to compress pointer to 39 (!) bits.
Win64 uses 44 bit address spaces. 4 bits are a lsb anyway-zero bits (does Win64 always align allocations on 16 bytes?). And 1 msb bit is senseless, if you are not going to pass pointers between user and kernel spaces.
It's the hacks that allowed Microsoft to port SList API to Win64. For example, if we have 128-bit CAS (DWCAS on 64-bit platform), we are able to pack 3 pointers and 11-bit additional structure. Or 2 pointers and 50-bit additional structure.
You can see details here:
http://www.alex-ionescu.com/?p=50

Dmitriy,

Applications often have a lifetime much longer than the hardware platform. During my tenure as a programmer addressing bits have grown by 26 bits or more(I used to program ona system with12 bits ofaddressing capability).Demand for more addressable memory will drive the number of bits upwards. Although I cannot envision (dream yes, envision no) a system with 64 bits of addressable memory in one box, this is not to say that your application you write today will not run on a future system using the address space for more than accessing the lump of RAM in your system box. In a future O/S the upper address bits potentially be an IP address or similar concept to express "this memory address over there->".

If (when) you take over the "unused" bits in a pointer, your technique must be flexible enough to detect which bits are indeed "unused" and take action accordingly. e.g. use CPUID or equivilent system call to physically query what is used (for assumption of what is unused).

In the case of flags these are more or less fixed in the application so they will tend to be required to reside in "always known (future and past)to be unused bits". The remaining numberbits are to be assumed to be elastic and have the potential to shrink in number (although not necessarily on any given run instance of the application).

Further, depending on O/S, negative addresses may be used. An example of this is the stack on Linux starts at ?? -4, -8?? in the virtual memory address space. I haven't looked at BOS, could be -4096, whatever it is it is working down from the -address direction.

Goto http://appft1.uspto.gov/netahtml/PTO/srchnum.html
or
http://www.uspto.gov if the above link doesn't work

Patents | Publication Number Search (right box middle)

And enter in 20080228784

That is a publication of a U.S. Patent application that I filedrelating tothis subject.

In the application, the technique was for simulating a DCAS operation using CAS (extended to double of whatever you have for your CAS size).

A typical use for DCAS is to (protected attempt to) swap a pointer and sequence counter as a means (measure) to avoid the ABA problem whereby a single CAS is ineffective for protection. In the mention application, the codewould determine what bits were available (always 0, always 1) and then borrowed those bits for use as an ABA-like sequence counter (there is more to the application than that).

If you are consdering using "unused bits" you might want to consult that document (caution, it is written as a patent application so it does not look like a programming sample).

Jim Dempsey

jimdempseyatthecove · ‎02-20-2009

I forgot to note

The unused bits in the addresss bits are not necessarily the bits above the physical address as virtual memory systems typically provide for a virtual address space larger than the physical address space. When you query your system, you must determine the structure of the page table system and not the amount of physical RAM.

Jim Dempsey

Dmitry_Vyukov · ‎02-22-2009

Quoting - jimdempseyatthecove

Applications often have a lifetime much longer than the hardware platform. During my tenure as a programmer addressing bits have grown by 26 bits or more(I used to program ona system with12 bits ofaddressing capability).Demand for more addressable memory will drive the number of bits upwards. Although I cannot envision (dream yes, envision no) a system with 64 bits of addressable memory in one box, this is not to say that your application you write today will not run on a future system using the address space for more than accessing the lump of RAM in your system box. In a future O/S the upper address bits potentially be an IP address or similar concept to express "this memory address over there->".

Not only addressing bits changing over time, but instruction sets too. In the future we may find ourselves in the "share nothing" world with hardware support for efficient message passing, scheduling and garbage collection. Or hardware support for full-fledged transactional memory. Or just extended atomic RMW instructions like z/Arch's PLO.
So I am considering it as just a solution for concrete today's problem...

Dmitry_Vyukov · ‎02-22-2009

Quoting - jimdempseyatthecove

Goto http://appft1.uspto.gov/netahtml/PTO/srchnum.html
or
http://www.uspto.gov if the above link doesn't work

Patents | Publication Number Search (right box middle)

And enter in 20080228784

That is a publication of a U.S. Patent application that I filedrelating tothis subject.

In the application, the technique was for simulating a DCAS operation using CAS (extended to double of whatever you have for your CAS size).

I remember you was mentioning this patent, but unfortunately I have not studied it yet (yes, patent language does not make reading simpler).

Although... well... I usually use low-level synchronization techniques for speed and scalability, and not for lock-free (forward progress) properties (btw, is your emulated DWCAS lock-free or wait-free?). So 3 CASes looks like overkill, there is private rule - if lock-free algo issues more than 1 CAS (atomic RMW), then one better use just spin-mutex.

Btw, high-level description of your technique recalls STM by Tim Harris and Keir Fraser - it's lock-free and built directly from CAS:
http://www.cl.cam.ac.uk/research/srg/netos/lock-free/

jimdempseyatthecove · ‎02-22-2009

>>(btw, is your emulated DWCAS lock-free or wait-free?).

It is lock-free and wait-free and is useful on systems that support only CAS but where DCAS is required.
(or supports up to DCAS but where QCASS is required, ...)

>>So 3 CASes looks like overkill, there is private rule - if lock-free algo issues more than 1 CAS (atomic RMW), then one better use just spin-mutex

The problem with "just spin-mutex" is sooner or later the O/S will intervien on the task holding the lock. The lock holder could get suspended for a relatively long time. This causes all the other threads waiting for the lock to spin for this relatively long time. Depending on how you write the spin code will depend on how long you wait. Example, anoverly aggressive spin-mutexwith _mm_pause on a system with over subscription of threads is bad news as the thread holding the lock may be contexed switched out and will wait for the thread on the spin-mutex to finish its timeslice (use less aggressive SwitchToThread to relieve this). And there may be multiple threads trying to get the lock. Oh, and before you assume that your application will not subscribe more threads than cores - please consider that your application is not running alone on the system. Other applications and system services vitually make it assured that your system is oversuscribed to one extent or another.

The wait-free portion of the code permits the 2nd (or any subsequent) thread to complete the 1st threads task without waiting (the 1st thread will take notice of this and abort the redundant work).

Jim Dempsey

Dmitry_Vyukov · ‎02-22-2009

Quoting - jimdempseyatthecove

>>So 3 CASes looks like overkill, there is private rule - if lock-free algo issues more than 1 CAS (atomic RMW), then one better use just spin-mutex

The problem with "just spin-mutex" is sooner or later the O/S will intervien on the task holding the lock. The lock holder could get suspended for a relatively long time. This causes all the other threads waiting for the lock to spin for this relatively long time. Depending on how you write the spin code will depend on how long you wait. Example, anoverly aggressive spin-mutexwith _mm_pause on a system with over subscription of threads is bad news as the thread holding the lock may be contexed switched out and will wait for the thread on the spin-mutex to finish its timeslice (use less aggressive SwitchToThread to relieve this). And there may be multiple threads trying to get the lock. Oh, and before you assume that your application will not subscribe more threads than cores - please consider that your application is not running alone on the system. Other applications and system services vitually make it assured that your system is oversuscribed to one extent or another.

Yes-yes, I meant spin-lock with passive spin (i.e pthread_yield/SwitchToThread). So if all my threads will start executing pthread_yield/SwitchToThread, control must back to preempted thread in the reasonable amount of time.
Some UNIX systems contain schedctl() syscall that can be used to notify OS scheduler that thread is in critical section and it's better to no preempt it now. Notification is very lightweight - just increment of some particular var (which will be examined by OS scheduler on time slice end). This trick is used by some STM implementations, and does actually significantly reduces probablity of the scenario you are mentioned:
http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=0630&db=man&fname=/usr/share/catman/p_man/cat2/ftn/schedctl.z
(see SETHINTS command)
And note while there is possibility of thread preemption inside critical section tremendous amount of operations will be executed much faster. Thread switch on modern OSes takes around 1000 cycles (not counting cache draining, etc), and CAS on non cached location takes 200-500 cycles. So it will be not very easy to justify eliminition of possibility of thread blocking with 2 additional CASes on fast-path...

Dmitry_Vyukov · ‎02-22-2009

Quoting - jimdempseyatthecove

The wait-free portion of the code permits the 2nd (or any subsequent) thread to complete the 1st threads task without waiting (the 1st thread will take notice of this and abort the redundant work).

Now it even more recalls Harris/Fraser STM - it also uses helping technique. Although their algorithm is only lock-free (not wait-free), because if you are using CAS, well, it's difficult to end up w/o cycles.
Hmm... and their algorithm of "A Practical Multi-Word Compare-and-Swap Operation" was in public domain since 2002 or so...

jimdempseyatthecove · ‎02-22-2009

>>Thread switch on modern OSes takes around 1000 cycles (not counting cache draining, etc), and CAS on non cached location takes 200-500 cycles. So it will be not very easy to justify eliminition of possibility of thread blocking with 2 additional CASes on fast-path...

A thread wants resource protected by semaphore, assume CAS used, needs to execute for 10 cycles before release using lazy write of semaphore. So the execution took 210-510 cycles general case. Occasionally, the O/S will intercede in the thread holding the lock before the release. When this happens, +1,000 cycles to context switch the lock holding thread out and schedule the interloper thread, +1ms or longer (4,000,000 cycles) to run the interloper thread, +1,000 cycles to get the lock holding thread back running (assuming only one interloper thread).

The costs to the threads not holding the lock will be just as high. On an 8 core system 7 of the cores could end up wasting 1ms each of thecomputational resources of the system (the other core is running the 1ms of interloper thread).

When the frequency of this interaction is low, it might be tolerable to code using locks as the average throughput will tend to be higher. However, when worst case scenario is not tolerable then you must code using wait-free technique (or code such that the locks are never required or seldom required).

Jim