i use tbb::pipeline to run stats procesing in parallel
code looks like:
pipeline.add_filter(st1); // pick task day to process
pipeline.add_filter(st2); // actual processing, parallel
pipeline.add_filter(st3); // reduce, seq processing
st1 - stage very fast
st2 - really slow, it can take about 1 sec to run on my hardware
st3 - 0.2...0.3 seconds to run
i add debug points to every filter like:
to collect timestamps when task executing
in test run was 3 days to process
max 4 tokens to run in pipeline
pentium4 D HT (2 threads)
linux 2.6.18 kernel, debian etch
cals - my 2nd stage
and reduce - 3rd
value 0.2 means stage performed by 2 threads, 0.1 - only one thread.
so i wounder why tbb does not run reduce for 2nd day immediatly after reduce for 1st day is done
it waits until all filters done, and only then picks task to execute
it seems strange to me
is exists any possibility to run stages as soon as previous stage done to maximize CPU usage ?
I would guess that at the second stage day 3 was processed before day 2. But the third stage, which is ordered, can't start "reducing" day 3 before day 2. I think you could easily check this guess extending the information collected at debug points with some data-specific info (e.g. the number of the day being processed).
As far as I understand, hyperthreads aren't quite "even"; one of the threads only has chances to execute when some processor units aren't used by the other one. So I do not wonder if in your case the main thread started to process day 1, the TBB worker thread took day 2 but made slow progress (due to HT), then the main thread completed day 1, took day 3, and having kind of priority on the processor resources completed day 3 before the worker thread finished day 2.
If that's the case, I wonder if adding a pause/yield point right before taking a new token from the pipeline would help the second thread take priority on processor resources and complete its job earlier.
I believe that adding yield or pauseoperations to the main thread won't help. OS does not discern logical CPUs in HT systems (at least it was so some time ago). Therefore when the main thread relinquishes its time quantum, the system will see that another thread is already working, and so it will resume the main thread. During all this time the processing will be happening in the same (main) pipeline of the CPU and so the second thread will remain in the secondary (low priority) CPU pipeline.
I think the problem could be solved by increasing the maximal number of tokens in flight. E.g. if the hyper-thread works at 15% of the main one speed, than 7 or 8 tokens will assure the acceptable balance. You couldplay with number of tokens in the range 6-15 and find the value resulting in the maximal throughput.