Solved: Creating new partitioners

Bartlomiej1 · ‎07-04-2015

Hello, It's been a while since I've been there... I have the following question: does it make sense to create my own partitioner for the tbb::parallel_for? It does not seem as something encouraged... My problem is as follows: I'm migrating my TBB-based code onto the MIC architecture. There is a parallel_for loop and it would make sense to force some of the iterations to be executed on the same cores. Simplifying the actual code, slightly, let us assume, we have an array: int L; where N is typically much higher than the number of threads and a parallelized loop: tbb:parallel_for(1, N, 1, /* lambda function there*/); performing some computations. Iterations with the same value of L should be executed by HTs of the same core - to optimize the cache affinity. Other words, it is what in UPC the fourth argument of upc_forall function does... How to obtain it? There is something, like tbb::affinity_partitioner, but it seems unlikely, it would guess how to schedule the threads (how would it know about the L[] array???). I have found this post: https://software.intel.com/en-us/blogs/2013/10/31/applying-intel-threading-building-blocks-observers-for-thread-affinity-on-intel but it does not seem helpful, either. Thanks in advance for any help. Best regards, Bartlomiej

RafSchietekat · ‎07-04-2015

I can guess that the shared data is for writing, otherwise it wouldn't matter if it would reside on multiple caches. For simplicity, I'll also assume that the work for each element is sequential (so none of it can be stolen).

Then, having elements with the same value executing on the same thread also means they're executing on the same core. One problem there is that there might not be enough separate values to provide enough parallel slack compared to being able to split those values among different threads on the same core, and I don't see an easy solution for that. Another problem is that, with data reordering, identical values might be in different chunks, and therefore possibly in different threads, and therefore possibly executed on different cores, so this might actually involve a custom Range after all. But the first problem might not be present, and the second problem can be overcome.

I also don't think you can do anything with arenas at this point: you can't set affinities, but they would also require you to explicitly divide the work among arenas and then you might as well use basic threads instead of TBB.

View solution in original post

RafSchietekat · ‎07-04-2015

That looks like something you might want to do if you know that you have a specific number of hardware threads dedicated to the job at hand, with high-latency bulk communication between them, which sort of goes against the idea of fine-grained adaptive division of work that TBB wants to provide. Do tell if I'm mistaken!

Of course you can still do that, and you don't even have to create a new kind of Range: just have the input blocked_range be the domain of the elements of the array (or a transformation that is more evenly distributed), use a large-enough grainsize, and then in each chunk of the domain you handle all elements that are in that chunk. You might also want nontemporal loads of the elements to avoid bumping those common values out of the cache and so defeating the whole purpose... although that only increases the risk of a data movement bottleneck because each chunk needs all array elements. Please let us know what you found if you still want to go there.

However, I would recommend instead to make a collection with all unique values (perhaps with parallel_sort() and parallel_scan()), create a lookup table (with parallel_for()), and then look up the results (also with parallel_for()).

(Added) Well, that's only if the result of the heavy lifting for those element values can be shared, of course.

RafSchietekat · ‎07-04-2015

Am I sort of on track here? I don't think I've hit anything on the head yet, but let's see where this leads. The only thing I'm still fairly confident about is that a custom Range is not the simplest solution here.

About affinity_partitioner: that's only useful if you go twice through a loop, and you want to know that the same element is handled by the same thread, and hence core (unless the operating system played a trick on you). I don't see any use for it here.

About the post you found: that would be useful to avoid the above-mentioned trick, and it still might be useful here, but it is indeed not sufficient by itself, as it only addresses the problem that, for TBB, affinity is with a thread, on the assumption that migration of threads is sufficiently rare for that to mean fairly good affinity with a core.

Another idea, but it's quite costly in preparation, might be to rearrange the data as an array of pairs of index and value, sort by value, and then do parallel_for() on that. Still, it might save some tuning compared to my first suggestion above (more chunks means more wasted iterations, fewer chunks means less parallel slack), and you also wouldn't have to worry about evicting useful data from the cache.

Bartlomiej1 · ‎07-04-2015

Dear Raf,

Thanks! Now, I understood how the affinity_partitioner works (at least).

And, no, it would not be helpful.

Let me re-phrase my problem in more details:

(i) We have N iterations to be executed.

(ii) If N is smaller than 60 (the number of cores), they can be assigned to virtually any thread.

(iii) But, if N is higher (which is likely), it might be important to assign some iterations to the same core. Please not: CORE, not thread! They can be executed by different hyper-threads, yet working on the same core.

Reordering the elements of L[] is not a problem, but it does not seem to help much. Yes, we can use tbb::task_arena on a sub-set of iterations: an arena for each possible value of L[] and force all threads in an arena to the same core. Yet, I do not like this solution and probably, you'll agree with me.

Best regards,

Bartlomiej

PS. On the second thought - it woul even not be simple to force all threads in an arena to the same core as we do not have an analog of ``observer'' for an arena...

Well, writing

RafSchietekat · ‎07-04-2015

I can guess that the shared data is for writing, otherwise it wouldn't matter if it would reside on multiple caches. For simplicity, I'll also assume that the work for each element is sequential (so none of it can be stolen).

Then, having elements with the same value executing on the same thread also means they're executing on the same core. One problem there is that there might not be enough separate values to provide enough parallel slack compared to being able to split those values among different threads on the same core, and I don't see an easy solution for that. Another problem is that, with data reordering, identical values might be in different chunks, and therefore possibly in different threads, and therefore possibly executed on different cores, so this might actually involve a custom Range after all. But the first problem might not be present, and the second problem can be overcome.

I also don't think you can do anything with arenas at this point: you can't set affinities, but they would also require you to explicitly divide the work among arenas and then you might as well use basic threads instead of TBB.

Vladimir_P_1234567890 · ‎07-08-2015

BTW you can try to use affinity_partitioner + thread affinity (https://software.intel.com/en-us/blogs/2013/10/31/applying-intel-threading-building-blocks-observers-for-thread-affinity-on-intel). So you can pin workers to hardware threads or cores.

But i'm not sure that this will help.

--Vladimir

RafSchietekat · ‎07-08-2015

Vladimir, can you explain how the partitioner would be used here? My understanding was that it would just work for repeated parallel range executions (for, reduce, ...), but I don't see what it would be able to do here.

Maybe task::set_affinity() might be useful. Of course that would mean working with tasks for individual elements.

But we're mostly going blind here, without data on number of elements, distribution of values, amount of work, expected savings from affinity. The suggestions are at face value. What counts in the end is actual measured difference in performance.

Bartlomiej1 · ‎08-01-2015

Thanks for all your helpful responses. Actually, I realized there are other bottlenecks in my project - probably cache is not the most important. Yet some improvements in future TBB versions for MIC might be welcome... ;-) Best regards