How to use TBB for solving sparse linear systems?

Zhongze_L_ · ‎07-10-2008

Hi,

I used to use MPI based packages to iteratively solve sparse linear systems. The basic procedure is:

1. partition the graph/hypergraph based on the matrix A (metis/hmetis).

2. construct local matrix and vector based on the partitioning results obtained.

3. solve the given linear system in parallel.

There is a process ID which determines the data to deal with. There is no such ID available in TBB. I figured it
out that I could call

parallel_for(blocked_range(0,nthread, 1), ApplySolve)

to get thread ID. But how can I put all three steps above in a BIG class ApplySolve? Should I use pipeline instead? If so, how?

Thanks,
Zhongze

Andrey_Marochko · ‎07-21-2008

Hi, Zhongze

Unfortunately I am not an expert in the modern scientific algorithms, so I may be wrong in some of my assumptions. First I suppose that algorithms of the METIS family are serial, and thus the stage 1 is serial too. Then I assume that the stage 3 actually consists of two sub-stages:
3.1. Process local matrices (actually sub-matrices of the matrix A). This can be done in parallel.
3.2. Perform reduction of the results obtained at the step 3.1. This probably can also be done in parallel.

Depending on the specific algorithm used at the step 3.2, two variants are possible. The first one: you use parallel_reduce algorithm where body::operator() executes steps 2 and 3.1 (paired together for each graph partition), and the body::join() method executes step 3.2.

If the the step 3.2 does not fit the reduction model, then you use parallel_for to execute steps 2 and 3.1, and then what is appropriate for the step 3.2.

But what is really important, is that when working with TBB you should forget about process or thread IDs. The primary difference between MPI and TBB approaches results from the fact that the distributed systems have terrible communication overhead, and thus they have to use coarse-grained work splitting (a chunk per CPU node) to minimize the number of message passing roundtrips. In shared memory parallelizm (which TBB is targeted to), the cost of communication between CPUs/cores is relatively small. This allows to achieve very good load balance by using fine-grained work splitting.

Thus the number of subgraphs you need to partition the source matrix to should be such that useful work at step 3.1 took approximately 10-20 microseconds (on modern CPUs), but was not (much) less than auxiliary operations overhead (step 2 is such an auxiliary operation in this case, as well as partitioning cost). The number of subgraphs should not depend on the number of CPUs/cores in your system, and normally it should be much bigger (an order of thousands would be fine for TBB).

Advanced TBB partitioners working with TBB algorithms (auto_partitioner and affinity_partitioner) dinamically aggregate several subgraphs in one transaction where possible (to decrease the parallel book-keeping overhead), while falling back to the smallest grain size where it is necessary for the sake of good load balancing.