static auto_partitioner ap;
So for 2-3 problem gone. But for 5,6,7 it exist. And I lose speed up of factor 7-8 at nThreads=8,9
And anyway, why I shouldn't use grain size here?
How big my tasks should be to use another way? One task have about 1000 instructions. So one portion, which have about 500 tasks have 500 000 instructions, is it small?
Into attachment you can see all data. This is for first case described in #1. From it I can suppose what I can just forget about HT. So I don't check it in case without grain-size.
Now, for case without grain-size I see that with 14-16 threads I really have speed up of factor about 7. So you right:).
Will be interesting to hear explanation of such differences between this two cases... But this is not a point now.
Results in #3 were just fluctuation.
I collect statistic and get more or less normal results. (See in attachment tbbtime_1auto_fit1000_re1.png )
Also I try to write by myself procedure, which don't use standard tbb::range and can divide not only on 2^N (see parallel_for_simple.h in attachment). Results are little worse (than with auto_partitioner) - tbbtime_new_fit10_re100.png. But then I fix for each threads cpu to run I get very good and explainable dependence - tbbtime_new_fixthr_fit10_re100.png
So, Dmitrij, Raf, thank you for help and interesting discussion.
But there is one problem, when I try to write new procedure I have next problem:
then I do
TpfTask *a = new( allocate_child() ) TpfTask(currentTask,currentTask+portSize1-1,fBody);
cout << a << endl;
if (a) delete a;
I get next:
*** glibc detected *** tbbVc: free(): invalid pointer: 0x00007f7373573940 ***
The pointers allocated with allocate_child() shouldn't be deleted by hands or what is the problem can be here?