You will need new classes for each unique loop you wish to convert to parallel execution. Each class can define one function call operator (operator()), which contains the corresponding for-loop, then use each unique constructor within the function containing multiple non-nested loops, using each as appropriate in the corresponding parallel_for calls.
Alternatively, you could try a compiler that supports the c++00x standard for lambda constructs, which will allow you to specify the for-loop ineach parallel_for call as a lambda construct, enabling the loops to stay in-line in the original function. Intel C++ Compiler V11 supports lambdas.
As was previously mentioned, parallel_while has been deprecated in favor of parallel_do, but never was it intended as a general replacement for while. The parallel_while with its stream interface and the parallel_do using iterators, each seeks to enable some parallel computation with an inherent serialization in the loop, to advance to the next item. Concurrency occurs only insofar as the loop can seriallyspawn tasks faster than they can be completed in parallel. If the loop itself can add additional work items, that will improve the scaling.
Well, having seen such miraculous "improvements" in the past while parallelizing code,I'd first caution you to make sure all the work is getting done--that is, verify that the parallel code gets the correct result. As I've discovered to my chagrin in the past, it's amazing how much work you can get done if you don't do it all ;-).
If that all checks out, about all I can think of that might explain it would be the advantage in cache locality that you gain by partitioning the work via the Intel TBB constructs. We have seen cases of super-linear scaling due to the improvements in cache use, though I don't have enough information to know whether this applies in your case. Are you using a parallel_for to process the buffered file?