about the number of grains

kelkch · ‎12-13-2007

I found parallel_for always (if possible) try to adjust the grainsize to a smaller value in order to split whole range into a number of 2^N chunks. Are there any necessities that does in such a way? As I know, say, 4-cores processor with Hyper-Threading, theoretically total 8 threads are available. But it does not mean that a program can get all of these resources, depend upon the operating system situation at the time. This brings no problem, just would like to know. For the simplicity of parallel_for, it has a primitive approach to generate threads of any numbers justprogramsinvoke. Many thanks.

robert-reed · ‎12-14-2007

TBB relies on the OS to tell it how many threads to use, if not specified explicitly in the creation of the task_scheduler_init object. It creates a pool for those threads and doesn't vary theirnumber. The scheduler also limits the amount of parallelism to avoid overwhelming the machine (see Arch's comments in another TBB Forum thread for more details). The region-splitting approach mimics the method used in Cilk to try to divide the focus of the individual threads to separate regions of memory so that they don't step on each other and cause the thrashing of active cache lines across the multiple processors. The range splitting is intended to keep the working set size small for each thread and to divide the work so that all available threads can keep busy (maximizing load balance).

ARCH_R_Intel · ‎12-18-2007

The number of chunks is not necessarily a power of two. E.g., parallel_for( blocked_range(0,13,1), body ) will create 13 pieces of work. Ideally, you set the grainsize (or let auto_partitioner do it) so that there are many more chunks than processors, so that the scheduler can balance load. The recursive algorithm used by parallel_for does not create all those chunks at once, so having many more chunks does not incur a space penalty.

The only thing towatch out for is create chunks that are so small that parallel scheduling overheads dominate useful work. Because of the way the TBB scheduler works, the parallel overheads tend to be independent of the number of hardware threads that are available, so you can tune the chunk size based on 1-thread runs, or let auto_partitioner do it for you.