"But i dont know how to implement separate thread for output filter within pipeline? may i know how to do that?"
tbb_thread and concurrent_queue? (Added) I mean a tbb_thread outside of the pipeline; the last filter would just add the work to the queue for delivery to the tbb_thread.
"If i use pipeline, is it possible to use parallel_for also?"
Nesting parallel_for inside pipeline is a good way of creating parallel slack (opportunities to execute things in parallel), assuming the units of work are sufficient to avoid excessive parallel overhead, which seems to be satisfied from your description.
Note that your pipeline will behave as if stages 1 and 2 are just one stage, if that helps to understand what is going on. In computation-bound work, assumed by TBB, that is probably a good thing, but it also means that any latency in getting more data is going to result in a stalled worker thread, i.e., no background subtraction work being done during that time. I wouldn't be surprised if the solution ended up being several threads, with TBB only used for a parallel_for to do the background subtraction, which it shoulddo very well. (Added) Orreasonably well: I wanted to balance off the gloom with an accolade, but I'm sure that design could still be improved.
"I think if i use parallel_for inside pipeline to read those two image then it will be too overhead for system. is it?"
You would probably read the data once and then let parallel_for process it one range at a time, so you wouldn't be reading the data multiple times (although there would be some inter-cache traffic, but I'm not sure how that could be avoided). The overhead you should be concerned with would be in task creation (don't use a simple_partitioner with a grainsize of one pixel: always process one or several full horizontal lines at once by using a single-dimensional vertically oriented blocked_range), and in the barrier to wind down parallel_for processing (although auto_partitioner should behave well enough for work that is fairly uniformly distributed like this seems to be).
You seem set on waiting inside the third stage, ignoring my advice, aren't you?
If you care about the order of the frames, you should only add them to a queue in an ordered serial stage.
Each stage takes a cookie and returns a (potentially different) cookie, typed void* because it is typically a pointer to data. You have to be prepared for more than one data item in flight at any one time (otherwise the pipeline would degenerate into a simple loop), i.e., multiple frames may exist at the same time, but if you pass the correct values ineach stage, the ordered-serial output filter will automagically process the frames in the same order as the serial input filter.
The key to providing sufficient processing to keep ahead of the output filter is efficient buffer management and dividing the work efficiently between available threads to minimize the background subtraction process, which it soundsvaries in themount of work based on the input data (presumablyrapidly changing frames require more processing to determine what is the background). It also sounds like just making the middle filter parallel did not provide enough processing to accomplish the goal. I go back to an idea I suggested originally, of employing a parallel_for within the middle filter,using frame banding like Raf suggested or some similar process to cleanly divide the work among available threads. This would be an important addition to provide load balancing between the lightly processed frames and the more heavily processed ones--if each frame is represented as a single, indivisible task, there's no opportunity tobalance the frames among the threads.
Really? I didn't pick up on that. Guess I don't know what background subtraction really does (I thought each frame would be processed relative to a still reference frame), or why the work would vary significantly from frame to frame.
#14 "i think in this situation i can use pipeline instead of parallel_for."
A pipeline is a great way to divide a big workload, but perhaps not if latency is a concern, or if simultaneously processing frames compete for the cache (400x400 pixels at 3 bytes per pixel is half a megabyte). In such a situation you'd want to limit the number of frames "in flight", and use parallel_for on horizontal strips (or "bands"?) to provide parallel slack. Actually, there may also be an affinity concern in that each strip may need to be processed entirely on a single core (TBB doesn't do NUMA yet, I think, so staying with the same core seems like the best tactic in such situations); perhaps Robert knows whether affinity_partitioner does anything for parallel_for inside a pipeline, but without such assurance I might even opt for parallel_for instead of a pipeline (not just inside one), and then, unless I'm mistaken, affinity_partitioner would still help to coordinate strips of the varying frames with the corresponding strips of the reference frame (or whatever related constraint applies).
Does that make sense?