"But i dont know how to implement separate thread for output filter within pipeline? may i know how to do that?"
tbb_thread and concurrent_queue? (Added) I mean a tbb_thread outside of the pipeline; the last filter would just add the work to the queue for delivery to the tbb_thread.
"If i use pipeline, is it possible to use parallel_for also?"
Nesting parallel_for inside pipeline is a good way of creating parallel slack (opportunities to execute things in parallel), assuming the units of work are sufficient to avoid excessive parallel overhead, which seems to be satisfied from your description.
Note that your pipeline will behave as if stages 1 and 2 are just one stage, if that helps to understand what is going on. In computation-bound work, assumed by TBB, that is probably a good thing, but it also means that any latency in getting more data is going to result in a stalled worker thread, i.e., no background subtraction work being done during that time. I wouldn't be surprised if the solution ended up being several threads, with TBB only used for a parallel_for to do the background subtraction, which it shoulddo very well. (Added) Orreasonably well: I wanted to balance off the gloom with an accolade, but I'm sure that design could still be improved.
"I think if i use parallel_for inside pipeline to read those two image then it will be too overhead for system. is it?"
You would probably read the data once and then let parallel_for process it one range at a time, so you wouldn't be reading the data multiple times (although there would be some inter-cache traffic, but I'm not sure how that could be avoided). The overhead you should be concerned with would be in task creation (don't use a simple_partitioner with a grainsize of one pixel: always process one or several full horizontal lines at once by using a single-dimensional vertically oriented blocked_range), and in the barrier to wind down parallel_for processing (although auto_partitioner should behave well enough for work that is fairly uniformly distributed like this seems to be).
You seem set on waiting inside the third stage, ignoring my advice, aren't you?
If you care about the order of the frames, you should only add them to a queue in an ordered serial stage.
Each stage takes a cookie and returns a (potentially different) cookie, typed void* because it is typically a pointer to data. You have to be prepared for more than one data item in flight at any one time (otherwise the pipeline would degenerate into a simple loop), i.e., multiple frames may exist at the same time, but if you pass the correct values ineach stage, the ordered-serial output filter will automagically process the frames in the same order as the serial input filter.