Unfortunately I am not an expert in the modern scientific algorithms, so I may be wrong in some of my assumptions. First I suppose that algorithms of the METIS family are serial, and thus the stage 1 is serial too. Then I assume that the stage 3 actually consists of two sub-stages: 3.1. Process local matrices (actually sub-matrices of the matrix A). This can be done in parallel. 3.2. Perform reduction of the results obtained at the step 3.1. This probably can also be done in parallel.
Depending on the specific algorithm used at the step 3.2, two variants are possible. The first one: you use parallel_reduce algorithm where body::operator() executes steps 2 and 3.1 (paired together for each graph partition), and the body::join() method executes step 3.2.
If the the step 3.2 does not fit the reduction model, then you use parallel_for to execute steps 2 and 3.1, and then what is appropriate for the step 3.2.
But what is really important, is that when working with TBB you should forget about process or thread IDs. The primary difference between MPI and TBB approaches results from the fact that the distributed systems have terrible communication overhead, and thus they have to use coarse-grained work splitting (a chunk per CPU node) to minimize the number of message passing roundtrips. In shared memory parallelizm (which TBB is targeted to), the cost of communication between CPUs/cores is relatively small. This allows to achieve very good load balance by using fine-grained work splitting.
Thus the number of subgraphs you need to partition the source matrix to should be such that useful work at step 3.1 took approximately 10-20 microseconds (on modern CPUs), but was not (much) less than auxiliary operations overhead (step 2 is such an auxiliary operation in this case, as well as partitioning cost). The number of subgraphs should not depend on the number of CPUs/cores in your system, and normally it should be much bigger (an order of thousands would be fine for TBB).
Advanced TBB partitioners working with TBB algorithms (auto_partitioner and affinity_partitioner) dinamically aggregate several subgraphs in one transaction where possible (to decrease the parallel book-keeping overhead), while falling back to the smallest grain size where it is necessary for the sake of good load balancing.