I've switched to tbb for compatibility for Linux recently. I've used ppl from Microsoft. The ppl library only provides STL-like ways to use a lambda function in parallel_for(). The tbb library, however, has more ways to write a parallel_for() task. For example, I've tried 3 ways to run a task:
1, parallel_for() with blocked_range() and a body function;
2, parallel_for() with first, last iterator and a lambda function;
3, parallel_for() with first, last iterator and a body function.
Well, is there any reference that would tell their differences?
Furthermore, I use cancel_group_execution() to terminate this task at a specific status, and what I've found astonishes me. The code is uploaded as an attachment, and the output of this piece of code is something like:
MSVC14, x64, Release under Windows 10:
parallel_for() with blocked_range() & body function: cancel_group_execution() called. is_cancelled() true. steps = 1000007, after cancel_group_execution() steps = 1000007, after parallel_for() parallel_for() with first, last indicator & lambda function: cancel_group_execution() called. is_cancelled() true. steps = 1031723, after cancel_group_execution() steps = 1562496, after parallel_for() parallel_for() with first, last indicator & body function: cancel_group_execution() called. is_cancelled() true. steps = 1000000, after cancel_group_execution() steps = 1562496, after parallel_for()
The output of the code in debug mode is quite like the above, except that the program will take a bit longer time to finish.
The task still runs for a period of time after cancel_group_execution() is called when using parallel_for() with first/last iterator. It seems that using parallel_for() with blocked_range() is much better than the others.
So I've three questions:
1, what's the difference among using blocked_range() and first/last iterator, body function and lambda function? Though the latter seems no difference, the former does have something.
2, in the code I've uploaded, is there any mistakes I've made?
3, if the code is right, why the behavior of the running code differs among different usage of parallel_for() and cancel_group_execution()?
There is no implementation difference between these parallel_for APIs. In fact, "first, last, lambda" and "first, last, body" calls use the same API function, as lambda is just a way to specify a body functor. And the implementation just redirects these both to the range-based parallel_for.
In your code I do not see any mistakes related to the use of TBB. As far as I can tell, the behavior of three tests is almost equivalent (see below); I would only suggest to make the actual body function the same (called with different parameters) to make the equivalence obvious. Calling the combine() method of of tbb::combinable at each iteration would be bad for performance in real code, but for the sake of testing it is fine. One small nit is that you compute the "sums" before calling cancel_group_execution(), but printed it as "steps after cancel_group_execution()" which is not quite correct. Another one is that you declared N for the loop size in the beginning, but ended up not using it.
I see two possible reasons for the observed different behavior. First, TBB initializes threads lazily, i.e. only when some tasks are available. So the very first call - in your case, the range-based one - initiates thread creation, and at the time worker threads are ready to start the job the calling thread might already executed it to the cancellation point. So it's possible there was no actual parallelism at the first call, and you see it as if the cancellation was immediate. But for the second and the third call, worker threads already existed and could take some tasks and execute some iterations.
Another, likely more important difference is that when you cancel the task you immediately return from the body/lambda, but the effect of this is different. For the range based form, this return skips processing the remainder of the range, while for the iterator based API it only stops the current iteration. The running TBB task, however, usually executes multiple iterations, and it will not stop until all of them are done - as checking for the cancellation state would be bad for performance of the codes that do not use cancellation. Thus your second and third call will keep calling the body/lambda. If you would use the same processing function all three times and only return from it (as I suggested above), you would have the same behavior for all three cases (barring the thread creation difference I mentioned).
Hope this helps. And by the way, TBB provides a compatibility header with MS PPL which adds class/function names into namespace Concurrency to simplify migration; see include/tbb/compat/ppl.h and the documentation at https://www.threadingbuildingblocks.org/docs/help/reference/appendices/ppl_compatibility.html