Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Cancellation performance

RafSchietekat
Valued Contributor III
325 Views

Just wondering academically: is it more efficient to use an atomic (relaxed, of course), task cancellation, or both together, to stop a simple parallel_for loop?

With an atomic, cancellation in the loop Bodies can be nearly immediate (unless the code still wants to load the atomic only once every x iterations), but parallel_for will start to execute a Body for each chunk.

With task cancellation, the parallel_for is stopped before spawning more tasks, but cancellation propagation takes longer than coherent-cache chatter, and has to run its course once started.

Assume that the parallel_for is simple (not a lot of code to touch for an atomic, no recursive parallelism), and has coarse granularity and/or uses auto_partitioner (not that many chunks). Think quick_sort_pretest_body in include/tbb/parallel_sort.h, or similar with a lambda (even less coding for that atomic).

My intuition says that there's little benefit from task cancellation in such a situation, if any... or worse.

0 Kudos
4 Replies
jimdempseyatthecove
Honored Contributor III
325 Views

Raf,

Are you most interested in the efficiency of: a) thread/task performing the cancellation or b) efficiency of task getting side-swiped by cancellation?

I do not know if this feature has been added to TBB but one might consider an iterator with cancellation property. The iterator would have an isCanceled() member function that can be tested in the next() and ++(), as well as tested in the user code prior to issuing new level of parallelization as well as periodically polled in compute sections of code. Essentially you place into the iterator the atomic "cancel" variable/flag/exception.

However you do this, you will add overhead. My preference is the overhead is at the preference of the programmer.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
325 Views

Paragraph by paragraph, and an afterthought:

I'm not sure how that helps. If propagation takes time, it will also affect how soon the tasks to be cancelled are reached. And the cancelled tasks cannot be abandoned, they will have to be joined. So latency and aggregate delay seem to be interlinked?

Nice idea, and it could also be self-contained in a developer-provided Range, with is_divisible() also requiring that the cancellation hasn't happened yet, which shouldn't be too confusing.

I don't see any remaining overhead? Well, except for adding the special range type...

Meanwhile, if the choice is just between task cancellation and a cancellation atomic in the described situation, it still seems more efficient to let the hardware communicate the cancellation, and to assume that threads with work from the loop can speed through the O(log n) tasks normally required to execute it. Chances are that few if any threads have stolen work if the system was fairly busy (which is when efficiency counts most), so why bother sequentially pushing the cancellation to all arenas? Hmm, still not easily decided without actually testing, which seems like a big effort for a mostly academical question...

0 Kudos
jimdempseyatthecove
Honored Contributor III
325 Views

>> it still seems more efficient to let the hardware communicate the cancellation

How? TBB is not pthreads where you could terminate a thread. The cancelable iterator with isCanceled() would provide for graceful termination.

Jim


 

0 Kudos
RafSchietekat
Valued Contributor III
325 Views

I acknowledged that ("Nice idea", with a way for a developer to quickly terminate existing algorithms by just disabling further Range splitting), but "Meanwhile" was just continuing the original more limited question (without a fancy new Range). By hardware I meant the coherent cache for the atomic (relaxed access, probably best in a cache line by itself, but that still has to start all chunks), as opposed to explicit cancellation propagation by way of the task group context tree. Sorry for not making that clearer.

I only just realised that there's an optimisation that avoids propagation if the context has no children. So the solution in parallel_sort.h would in fact be optimal. Sorry again!

I thought it was interesting, but probably not worth a forum thread. Maybe I should just go explore something entirely new instead. :-)

0 Kudos
Reply