ScaLAPACK arrays are distributed in a block-cyclic fashion over the process "grid". ScaLAPACK then uses the PBLAS and BLACS to perform BLAS-like operations, but in a distributed SPMD fashion, which become a mix of communication between processes, and BLAS operations within the processes, more-or-less.
So the size of the block is going to affect the performance of the communication and the BLAS calls, but the degree to which it does depends on the implementation. The MKL implementation is a black-box to the end user (me). And I don't have an ATLAS-like search tool to point me in the right direction toward what block size I should be using, especially when the parameters are things like { Gig-ethernet vs 10G infiniband vs ....} and {westmere vs sandy/ivy-bridge vs haswell } etc.
So... are there any guidelines for choice of block size when using MKL ScaLAPACK, LAPACK, and BLAS ?
Link Copied
Sorry for some sloppiness in my writing. If I knew how to edit my post, I would make these changes:
For more complete information about compiler optimizations, see our Optimization Notice.