Is there any way to prevent this from happening? It's wasteful to do this kind of splitting with my custom DGEMM routine.
I am calling pdgemm from a C++ program. One example of me calling it is with the following syntax:
pdgemm(&transa, &transb, &m, &n, &k, α,
a.val, &ione, &ione, a.desc_,
b.val, &ione, &ione, b.desc_,
β, val, &ione, &ione, desc_);
Where val is the starting address of the matrix C. desc_ contains the MDESC data type description for the matrix C. I believe I have filled in the descriptor correctly, including the block sizes being correctly at a.desc_, a.desc_, etc. As I said before, changing the values of the block sizes does not seem to modify the way PDGEMM is splitting up the work. For example, in my code I multiply a 156196 x 256 and a 256 x 156196 matrix together. PDGEMM splits this into 8 calls, splitting up the 256 dimension into 8 groups of 32.
First of all, matrix should be distributed among the BLACS processes before call of PDGEMM. Also, PDGEMM defines M, N, K for internal call of gemm basing on blocksizes, but M, N, K are not equal to block sizes. You can try to figure out how it happens by reading PDGEMM sources (available via netlib).