Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

More bugs in concurrent_queue

Dmitry_Vyukov
Valued Contributor I
1,073 Views
TBB 2.1 (20080825)

I've finally applied Relacy Race Detector to concurrent_queue's signaling mechanism (__TBB_NO_BUSY_WAIT_IN_CONCURRENT_QUEUE + _WIN32 version), and it reveals 2 interesting moments.

I. n_waiting_consumers and n_waiting_producers variables are never decremented. So basically if there was just one blocking, then every subsequent signaling operation executes CRITICAL_SECTION lock/unlock and SetEvent() kernel call. And signalling is on the fast-path of both push and pop operations!

II. Consider following code, which is in the heart of blocking/signaling mechanism:
for( ;; ) {

 LeaveCriticalSection( &r.mtx_items_avail );

 WaitForSingleObject( r.var_wait_for_items, INFINITE );

 EnterCriticalSection( &r.mtx_items_avail );

 if( r.n_consumers_to_wakeup > 0 && r.consumer_wait_generation != my_generation )

 break;

}

If expression in 'if' statement evaluates to 'false', then this cycle basically degenerates to busy loop. Thread constantly acquires critical section, checks event's state (which is signaled), releases critical section and so on eating whole time slice. I can recommend to at least insert Sleep(0) call if 'if' statement fails.
0 Kudos
13 Replies
Wooyoung_K_Intel
Employee
1,073 Views

Thank you very much for spotting the bug. The fix will be in the next developer release.

As for 2, you are correct that for some threads it is busy-wait loop. But adding Sleep(0) has its own problem because if we add it,a thread is in the critical section when it is about to call Sleep(0). You don't want the thread to yield the processor while in the critical section. Besides, the code is in the slow path. And, in order for a thread to reach the 'if' statement, it has to wait until it enters the critical section. So, I think the chance that a thread repeats the execution of the loop body without going into sleep due to the 'if' statement evaluating to false seems very small...

best regards,

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,073 Views
MADwkim3:

As for 2, you are correct that for some threads it is busy-wait loop. But adding Sleep(0) has its own problem because if we add it,a thread is in the critical section when it is about to call Sleep(0). You don't want the thread to yield the processor while in the critical section. Besides, the code is in the slow path. And, in order for a thread to reach the 'if' statement, it has to wait until it enters the critical section. So, I think the chance that a thread repeats the execution of the loop body without going into sleep due to the 'if' statement evaluating to false seems very small...


Well, it's on slow path to some degree, but it's hit every time thread tries to dequeue item and queue is empty. And in some scenarios queue size fluctuates around zero, so this path is hit constantly.

What about "to reach the 'if' statement, it has to wait until it enters the critical section". This thread just now released this critical section, so is has very god chances to re-enter critical section. If there are no concurrent contending threads, then it can spin until end of time slice.

What about sleeping inside critical section. Loop must look like this:

LeaveCriticalSection( &r.mtx_items_avail );
for( ;; ) {
 WaitForSingleObject( r.var_wait_for_items, INFINITE );
 EnterCriticalSection( &r.mtx_items_avail );
 if( r.n_consumers_to_wakeup > 0 && r.consumer_wait_generation != my_generation )
 break;
 LeaveCriticalSection( &r.mtx_items_avail );
 Sleep(0);
}



0 Kudos
Wooyoung_K_Intel
Employee
1,073 Views

Certainly that is one way to do it.

And you are correct that some applications may use concurrent queue such that the queue size remains near 0. But then, most of threads (consumers)would wait at

WaitForSingleObject(r.var_wait_for_items, INFINTIE).

When a producer inserts an item, all the consumers will wake up (which Iwasnot able toavoid with the current implementation), and the first one that enters the critical section, decrements r.n_consumers_to_wakeup, and resets the event. The subsequent threads that enter the critical section see 'false' in the if statement and go back to sleep at the above WaitFor statement because the event is already reset.So,could you point out to me what gain we get by addingSleep(0) ?

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,073 Views
MADwkim3:

And you are correct that some applications may use concurrent queue such that the queue size remains near 0. But then, most of threads (consumers)would wait at

WaitForSingleObject(r.var_wait_for_items, INFINTIE).

When a producer inserts an item, all the consumers will wake up (which Iwasnot able toavoid with the current implementation), and the first one that enters the critical section, decrements r.n_consumers_to_wakeup, and resets the event.

Not the *first* thread, the *last* thread from *previous* generation. And until that moment event stays signaled, and all threads from *current* generation will be spinning till the end of time slice.

Assume that last thread from previous generation is preempted by thread from current generation. In this situation event won't reset for some substantial amount of time.

Assume that thread from previous generation has lower priority than thread from current generation. In this situation event won't reset for some substantial amount of time.


MADwkim3:

The subsequent threads that enter the critical section see 'false' in the if statement and go back to sleep at the above WaitFor statement because the event is already reset.So,could you point out to me what gain we get by addingSleep(0) ?

Event can be not resetted instantly. Anyway it's busy spin loop, it's must be reported to OS.


0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,073 Views

Btw, it may be interesting. This is exactly kind of bug that was found in Singularity OS Kernel:

ftp://ftp.research.microsoft.com/pub/tr/TR-2007-149.pdf

(page 13, fig 6)

"Violation of the good samaritan property. Under certain cases, this loop results in the thread spinning idly till time-slice expiration... The developers responsible for this code immediately recognized this scenario as a serious bug. In practice, this bug resulted in sluggish I/O behavior during the boot process, a behavior that was previously known to occur but very hard to debug."

This bug was found with CHESS software verification system. Bug in TBB I've found with Relacy Race Detector, which is also able to detect violations of the good samaritan property.


0 Kudos
Wooyoung_K_Intel
Employee
1,073 Views

I guessWaitForSingleObject() has a call to Sleep(0)orsomething? and if it does not, it should?

If your application oversubscribtes and creates more threads than the number ofavailablecores, then you end up paying way more penalty than the overhead you described here.

And, you seem to know a lot mroeabout Windows than I do. Why can "events" not be reset instantly???

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,073 Views
MADwkim3:

I guessWaitForSingleObject() has a call to Sleep(0)orsomething? and if it does not, it should?

Why? I don't think so. WaitForSingleObject() in non-blocking case must be as fast as possible, I think.


MADwkim3:

Why can "events" not be reset instantly???


Because only the *last* thread from previous generation will reset the event.

0 Kudos
Wooyoung_K_Intel
Employee
1,073 Views

why not? It is a good place to do context switching as you suggested because the thread voluntarily notifies OS of its willingness to yield.

As for the second point, I think we should make an assumption that EnterCriticalSection() is reasonably fair so that threads who come out of their sleep may enter the critical section with equal probability. Otherwise, Sleep(0) does not help at all because it will immediately return if no other thread is in the scheduling queue. Another assumption is that the queue size is near 0, which means there are only occasional pushes. (Otherwise, only occasionally does the slow path in question execute.)

1. if producers come at a well-spaced regular pace,the numbers of consumers of the same generation waiting at WaitForSingleObject() are mostly 1. So, I think ResetEvent() would be called fairly quickly.

2. If occasionally producers come in burst, yes some consumers wake up only to find that they have to go back to sleep but the signal has not been cleared yet. They will come back and wait to enter the critical section. But, most consumers who woke up will break out of the inner for loop (there are many items to pick up). And WaitForSingleEvent() is a system call, so other consumers will have a fairly good chance toenter the CS before the unlucky one comes back andtries to re-enter it.

Did I miss something?

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,073 Views
MADwkim3:

why not? It is a good place to do context switching as you suggested because the thread voluntarily notifies OS of its willingness to yield.

WaitForSingleObject() can be considered as yield only if it blocks, otherwise it's just normal call which must be executed as fast as possible. Many programs use WaitForSingleObject() on semaphores or events in such a way that it doesn't block in 99% of cases.


MADwkim3:

As for the second point, I think we should make an assumption that EnterCriticalSection() is reasonably fair...


It's not fair, nor intended to be fair. It's on the too low level, where just no such thing as fairness. We can be sure that any sane low-level mutex allows just released thread to re-acquire mutex infinite number of times.

Fair mutexes are substantially more expensive, and real fairness needed quite rarely. For example, order of request processing in web server must be fair, but it's not the level of OS mutexes, it's application logic level.



0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,073 Views

MADwkim3:

...so that threads who come out of their sleep may enter the critical section with equal probability. Otherwise, Sleep(0) does not help at all because it will immediately return if no other thread is in the scheduling queue.


Ok. Assume uni processor system. 2 threads released from event. One from previous generation, and one from current generation. OS decides to schedule thread from current generation first. Please, describe what here will prevent thread to eat whole time slice senselessly hanging between mutex and event?

MADwkim3:

1. if producers come at a well-spaced regular pace,the numbers of consumers of the same generation waiting at WaitForSingleObject() are mostly 1. So, I think ResetEvent() would be called fairly quickly.

See example above. Only one thread in generation, but other thread still eats whole time slice.


MADwkim3:

2. If occasionally producers come in burst, yes some consumers wake up only to find that they have to go back to sleep but the signal has not been cleared yet. They will come back and wait to enter the critical section. But, most consumers who woke up will break out of the inner for loop (there are many items to pick up). And WaitForSingleEvent() is a system call, so other consumers will have a fairly good chance toenter the CS before the unlucky one comes back andtries to re-enter it.



I don't see why threads will be necessarily waiting for critical section. Thread that just released critical section always has very good chances to reacquire critical section again.
Also, if there is oversubscription of threads to processors (even a little, for example 4 processors and 6 threads), then 'right' thread can be not scheduled very long time, especially taking into account that 'wrong' threads don't call yield, especially taking into account that 'wrong' thread can have higher priority.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,073 Views
I've provided some reasons why it's good to insert Sleep() call into busy spinning loop. If you don't agree, can you provide some reasons why it's good to NOT insert Sleep() call into busy spinning loop?
It's clear for me that inserting Sleep() call at least is not worse than not inserting.

0 Kudos
Wooyoung_K_Intel
Employee
1,073 Views

I am sure you know I did not mean 'fair' in expensive fair mutexes. From EnterCriticalSection() API document at MSDN,

"There is no guarantee about the order in which waiting threads will acquire ownership of the critical section."

I don't object to the idea of putting 'pause/Sleep(0)/SwitchToThread' in a busy wait loop in general. In fact, I think we ought to add one wherever possible. In this particular case, however, I feel it is not necessary. I think Sleep(0) returns immediately if no other threads are inthe scheduling queue. If EnterCriticalSection favors unfairly the thread that just left the critical section, and WaitForSingleObject() returns immediately if the event is signaled with no cost, then does Sleep(0) help at all? And, it looks like WaitForSingleObject is a kernel call and it takes quite a few cycles even if theevent is signaled.So, if a fewthreads are competing at the door to the critical section, I would assume that the one who left the critical section wouldnotbe thewinner with a high probability.

Yes, if the platform is oversubscribed and there are more threads than the number of available cores, adding Sleep(0) definitely helps. But, then do you think you get the performance improvement you expect in such circumstances? I would dare to say using a fat lock around std::queue would be faster especially when the queue size is near 0 always.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,073 Views
MADwkim3:

I don't object to the idea of putting 'pause/Sleep(0)/SwitchToThread' in a busy wait loop in general. In fact, I think we ought to add one wherever possible. In this particular case, however, I feel it is not necessary. I think Sleep(0) returns immediately if no other threads are inthe scheduling queue. If EnterCriticalSection favors unfairly the thread that just left the critical section, and WaitForSingleObject() returns immediately if the event is signaled with no cost, then does Sleep(0) help at all?

Consider my example with only 1 processor/core. Or my example with threads with different priorities.

Can you comment on my examples? How situation will be resolved w/o yield?


MADwkim3:

And, it looks like WaitForSingleObject is a kernel call and it takes quite a few cycles even if theevent is signaled.So, if a fewthreads are competing at the door to the critical section, I would assume that the one who left the critical section wouldnotbe thewinner with a high probability.

Yes, if the platform is oversubscribed and there are more threads than the number of available cores, adding Sleep(0) definitely helps. But, then do you think you get the performance improvement you expect in such circumstances?

Yes.

Consider OS decides to schedule 'right' thread after 'wrong' thread on the same core. Until 'wrong' thread will call yield, or it's time slice will end, event will not be reset.

Can you comment on this example? How situation will be resolved w/o yield?


MADwkim3:

I would dare to say using a fat lock around std::queue would be faster especially when the queue size is near 0 always.




Maybe, but TBB provides only one queue implementation ;)

0 Kudos
Reply