There is also possibility to change graph structure dynamically - e.g. you can create extra reading node when more files appear.
For using graph take a look at some articles:
In a file processing situation you have thread stalls for Open, Close, Read, Write. The TBB architecture is an equal priority tasking system. Meaning during a thread stall you may be undersubscribed for parallel processing of your process filter. Try adding +2 threads (run with 18 or 34 if you have HT). Also, run with more tokens (I/O buffers) than number of processing threads. The number of additional buffers will depend on the worst case latency of the I/O.
On QuickThread (my product) I have two classes of threads: Compute class(like TBB) plus I/O class. You use the I/O class for I/O or other stalling tasks (e.g.waiting for events).
Using QuickThread for this pipelineyou would typically specify two I/O class threads and number of hardware threads for the compute class. When programmed this way, and with 2x number of compute class threads for buffers, the simple TBB sample program to upcase words in a file run on a Dell R710 converts at ~3.2GB/sec. This speed is saturating the I/O and memory bandwidths and in this case adding an additional pipeline (for say 2nd conversion stream) would be counterproductive. This sample is running with one verylarge input file and writing one large output file.
When processing multiple smaller files I/O latency will tend tobe longer due to many Open/Close operations. In this situation you "might" find some benefit in running multiple pipelines but this is something you would have to test on a case by case basis. Note, running multiple I/O streams tends to increase average I/O latency due to the potential for increased seek latency as well as internal disk buffer data eviction. The following may work on TBB and it deffinately works for QuickThread.
For a many files situation consider a two dimensional pipeline
file-by-file input filter->
parallel second dimension process filter ->
file-by-file output filter
combined with second dimension process filter (pipeline)
intra-file input filter ->
parallel process filter ->
intra-file output filter
In this case you might want +4 threads - experiment