Re: Dynamic scheduling and memory locality

mambru37 · ‎06-17-2008

Hi,

I'm writing a quite complicated program that has to work over several matrices (all of them with the same number of elements). To enhance memory locality, I do a first touch of the matrices in a parallel do with default static schedule policy.

The problem arises because not all of the elements have the same computational cost, so to process the matrices with the same scheduling policy results in a clear load imbalance. The ideal solution would be to impose a dynamic or guided scheduling, but, from what I understand, that would assign chunks in an unpredictable fashion to threads, thus ruining memory locality. Of course, if each thread takes chunks from "his" stack until it empties and then starts stealing from other processors stacks, that would be much more efficient.

Is there any way to achieve that behaviour, or any alternative solution?

On a related question, is there any way to force chunks to be a multiple of the system's pagesize (besides doing it by hand), I believe that would increase memory throughput.

Alexey-Kukanov · ‎06-27-2008

It migth be a bigger alternative than you want, but you might considerIntel Threading Building Blocks(TBB) for C++ programs (and for C programs if you are fine with some use of C++). I think TBB might provide good data locality for you, because:

a) it supports two-dimentional data distribution, i.e. you could process your matrices by blocks, not onlyby columns/rows

b) its affinity_partitioner feature allows to replay chunk distribution between threads close to what it was in a previous run.

c) it uses work stealing scheduler, which has the desired property of "each thread taking chunks from its stack until it empties and then starts stealing from other processors stacks".

If you are interested, I will be glad to answer your questions here or at the TBB forum.

mambru37 · ‎07-02-2008

I guess the MEMORYTOUCH directive (ia64 only) might be the solution.

jimdempseyatthecove · ‎07-02-2008

Here is a suggestion:

Zone the matricies work into more zones than you have cores, and where the initial "number of cores" number of zones contains the majority of the work. Then for the remainder of zoned work on first come first serve.

$OMP PARALLEL
{work on zone=OpenMPteamMemberNumber}
$OMP DO SCHEDULE(YourChoice)
{do remainder work as parallel do}
$OMP END DO
$OMP END PARALLEL

You _may_ also find it advantageous to set thread affinity to a given processor.

Jim Dempsey