Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Out-of-order execution and semaphores question

jacquesbas
Beginner
3,397 Views

I have a problem with an application. My application does multithreaded tree search. The entire tree is shared by all threads and I use a semaphoreat each node to synchronize the access to the data stored in that node. The semaphore is a single byte that may contain (1 = free, 0 = busy)

The semaphore's codeis:

mov al, 1 // 1 = free
mov dl, 0 // 0 = busy
lea esi, node.semaphore_byte
lock cmpxchg [esi], dl
jnz xxx // This branch is taken when the semaphore is busy
// Semaphore was free, busy now.
mov ebx, [some Address in the node]

To free the semaphore:

lea esi, node.semaphore_byte
lockdec byte ptr[esi]

In the manual "Intel 64 and IA-32 Architectures, Software Developer's Manual, Volume 3A: System Programming Guide, Part 1" subsection "7.1.2.2 Software Controlled Bus Locking" some precisions are made:

" NOTE Do not implement semaphores using the WC memory type. Do not perform non-temporal stores to a cache line containing a location used to implement a semaphore."

I guess "normal" RAM allocated by a GlobalAlloc() Windows API call is not write-combining (WC), so the first part should not be a problem.

I don't understand the second part: " Do not perform non-temporal stores to a cache line containing a location used to implement a semaphore " Does that mean, the semaphores must be stored in 64 byte cache lines containing nothing else? (Since my node size is 64 bytes that would double the RAM usage.) Is my semaphore code correct? (It works "almost always") I may have out of order RAM access problems, or perhaps I have to hunt the bug somewhere else ...

Please, help.

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
3,393 Views

Because in this case your node size does not span cache lines, the sfence can be removed (provided non-non-temporal stores).

However, at a later date, should you extend the node size to span cache lines, or should your lock code (not shown) double lock (lock a secondary node), then the sfence may be required on some platforms. Some platforms==multi-socket. The cache coherency system will maintain coherency within a cache line but it will not necessarily maintain the write sequence order of multiple cache lines across sockets. (Someone at Intel may contradict this but you should also consult the engineers at AMD.)

The presense or absense of sfence will depend on the code enclosed in the protected region. Examine your code and leave copious cautionary comments above and below the protected code section for future reference.

If the code permits sfence to be removed, then by all means remove the sfence.

Also, as an additional optimization, seeif you can perform the release of the semaphore combined with the last write of adjacent data (both/all in same aligned word, dword, qword, dqword).

If possible run stress tests on various platforms with sanity checks in your test applicaiton but not in the code under test. The routine being tested should be "as will be used under release version" and is not to contain the sanity checks as this will affect the running characteristics of the code.

Jim Dempsey

View solution in original post

0 Kudos
28 Replies
Dmitry_Vyukov
Valued Contributor I
974 Views
A thread wants resource protected by semaphore, assume CAS used, needs to execute for 10 cycles before release using lazy write of semaphore. So the execution took 210-510 cycles general case. Occasionally, the O/S will intercede in the thread holding the lock before the release. When this happens, +1,000 cycles to context switch the lock holding thread out and schedule the interloper thread, +1ms or longer (4,000,000 cycles) to run the interloper thread, +1,000 cycles to get the lock holding thread back running (assuming only one interloper thread).

The costs to the threads not holding the lock will be just as high. On an 8 core system 7 of the cores could end up wasting 1ms each of thecomputational resources of the system (the other core is running the 1ms of interloper thread).

I don't get how 4'000'000 wasted cycles part appeared.
While thread is preempted inside critical section other threads can (1) make some useful work (no wasted cycles) or (2) execute yield() while waiting for the mutex (~1000 wasted cycles). Yield() will schedule another thread which in turn can do (1) or (2). So if there is at least 1 processor not loaded with useful work, then preempted thread will be scheduled again. So maximum what we can waste is N*1000 cycles, where N is bounded by the (number of threads - number of processors).
The worst case scenario looks like:
Thread 1 is executing -> thread 1 is preempted inside critical section -> thread 2 is executing -> thread 2 yields (1000 cycles) -> thread 3 is executing -> thread 3 yields (1000 cycles) -> ... -> thread N is executing -> thread N yields (1000 cycles) -> thread 1 is executing again


0 Kudos
jimdempseyatthecove
Honored Contributor III
974 Views

The 4,000,000 cycles appeared because the programmer, wanting the lowest programming latency, was:

a) not usingusing _mm_pause, monitor, or SwitchToThread, Sleep(0), (other) technique - burnning power.
b) was using _mm_monitor on semaphore to suspend thread to save powerwithout context switch until write to semaphore by other thread (or O/S context switch at end of time slice or at pre-emption)
c) was using _mm_pause to reduce power consumption without context switch until write to semaphore by other thread (or O/S context switch at end of time slice or at pre-emption)

I was pointing out that under some circumstances, by programming for the lowest latency, the programmer may inadvertantly be setting themself up for experiencing the longest latency. The spin wait loop should be coded to include a yield (SwitchToThread or Sleep(0), ...) after a short period of failed low latency examining of semaphore. Most runtime libraries take this type of action (try n times, yield). When the programmer writes their own routine they might overlook the possibility of a context switch occuring in their critical section.

It would be interesting to run a test to actually measure assumption of 1000 cycles. To measure the SwitchToThread overhead I would suggest the following test:

Select a good single core (no HT) processor and run a
while(!done)
{
Lock critical section
add 1 to count (not using InterlockedAdd)
Unlock critical section
}

Done flag set in ^C routine

Then run test on differeing thread counts.

Once you have the base line context switch overhead value then run the test on 2-core, 4-core, 6-core, 8-core, 2-socket, 4-socket, ... under differing loads of non-test program threads running (PKZIPing a large folder, MPEG-ing a raw audio file, defragging the disk).

Jim
0 Kudos
jimdempseyatthecove
Honored Contributor III
974 Views

*** moderator
I wanted to edit my last post. After editing and clicking on Save
DB ERROR: Cannot add or update a child row: a foreign key constraint fails ...

I wanted to add that the end values in the counts of the counter loop should be charted and to also add a non-threaded loop that omits the Lock/Unlock

Jim
0 Kudos
Dmitry_Vyukov
Valued Contributor I
974 Views

The 4,000,000 cycles appeared because the programmer, wanting the lowest programming latency, was:

a) not usingusing _mm_pause, monitor, or SwitchToThread, Sleep(0), (other) technique - burnning power.
b) was using _mm_monitor on semaphore to suspend thread to save powerwithout context switch until write to semaphore by other thread (or O/S context switch at end of time slice or at pre-emption)
c) was using _mm_pause to reduce power consumption without context switch until write to semaphore by other thread (or O/S context switch at end of time slice or at pre-emption)

I was pointing out that under some circumstances, by programming for the lowest latency, the programmer may inadvertantly be setting themself up for experiencing the longest latency. The spin wait loop should be coded to include a yield (SwitchToThread or Sleep(0), ...) after a short period of failed low latency examining of semaphore. Most runtime libraries take this type of action (try n times, yield). When the programmer writes their own routine they might overlook the possibility of a context switch occuring in their critical section.

Well, Ok, I just have rephrase my initial statement - if lock-free algo issues more than 1 CAS (atomic RMW), then one better use just spin-mutex with correctly implemented passive spin.
Anyway there are much more things which require extreme attention and must not be overlooked in the lock-free algo :)

0 Kudos
Dmitry_Vyukov
Valued Contributor I
974 Views
It would be interesting to run a test to actually measure assumption of 1000 cycles. To measure the SwitchToThread overhead I would suggest the following test:

Select a good single core (no HT) processor and run a
while(!done)
{
Lock critical section
add 1 to count (not using InterlockedAdd)
Unlock critical section
}

Done flag set in ^C routine

Then run test on differeing thread counts.

Once you have the base line context switch overhead value then run the test on 2-core, 4-core, 6-core, 8-core, 2-socket, 4-socket, ... under differing loads of non-test program threads running (PKZIPing a large folder, MPEG-ing a raw audio file, defragging the disk).


I had measured SwitchToThread() overhead on WinXP in the best case (best case in the sense that it yields lowest overhead), just of curiosity - to see the order of the number. I don't remember all the details of the test (except that it was synthetic test), but the number I got is ~700-750 cycles.

0 Kudos
jimdempseyatthecove
Honored Contributor III
974 Views
Quoting - Dmitriy Vyukov
It would be interesting to run a test to actually measure assumption of 1000 cycles. To measure the SwitchToThread overhead I would suggest the following test:

Select a good single core (no HT) processor and run a
while(!done)
{
Lock critical section
add 1 to count (not using InterlockedAdd)
Unlock critical section
}

Done flag set in ^C routine

Then run test on differeing thread counts.

Once you have the base line context switch overhead value then run the test on 2-core, 4-core, 6-core, 8-core, 2-socket, 4-socket, ... under differing loads of non-test program threads running (PKZIPing a large folder, MPEG-ing a raw audio file, defragging the disk).


I had measured SwitchToThread() overhead on WinXP in the best case (best case in the sense that it yields lowest overhead), just of curiosity - to see the order of the number. I don't remember all the details of the test (except that it was synthetic test), but the number I got is ~700-750 cycles.


Was this with 1 thread or 2 threads?

With 1 thread and no waiting thread the in and out may have a shortcut (i.e. bypasses state save and state restore). With 2 threads on same HW thread, this would include the state save and state restore.

Although the typical cost of the SwitchToThread would be the fast path (no waiting thread) the worst case information would be nice to have for consideration in choice of how to write your code.

Jim Dempsey
0 Kudos
Dmitry_Vyukov
Valued Contributor I
974 Views
Was this with 1 thread or 2 threads?

With 1 thread and no waiting thread the in and out may have a shortcut (i.e. bypasses state save and state restore). With 2 threads on same HW thread, this would include the state save and state restore.

Although the typical cost of the SwitchToThread would be the fast path (no waiting thread) the worst case information would be nice to have for consideration in choice of how to write your code.


AFAIR, it was 2 threads on 1 core constantly switching one to another:

void thread_1_and_2_func()
{
barrier();
for (int i = 0; i != 10000000; ++i)
SwitchToThread();
barrier();
}

So state save/restore was taken into account.

I believe worst case cost is determined by cache misses on thread's data. Hmm... I think it's possible to model this:

int* data1, data2; // MUST map to the same locations in cache!
void thread_1_and_2_func()
{
barrier();
for (int i = 0; i != 10000000; ++i)
{
for (int j = 0; j != DATA_SIZE; ++j)
load(data1); // thread 2 must load data2
SwitchToThread();
}
barrier();
}


0 Kudos
jimdempseyatthecove
Honored Contributor III
974 Views

>>I believe worst case cost is determined by cache misses on thread's data.

Correct, same is true for non-locked peek at semaphore (e.g. in mailbox). On single socket Core i7 (4-core 8 HW thread) this is not so bad because of the unified L3 cache. The miss maygo from L2 to L3. On non-unified cache system (multi-socket)the penalty is higher as you reach out to RAM or some cache archetectures may reach through to other cache, in either case it is longer.

The design of your system has to take into consideration if you are spending most of your time looking for tasks to perform or actually performing the tasks. When actually performing work (task) you may flush out the cache entries for the mailbox (semaphore) and you will take the miss on task completion when you look at the mailboxes for next task to work on.
0 Kudos
Reply