AFAIK and in order to invoke MKL's cblas_dgemm I need three different matrices namely A,B,C:
C = alpha*A*B + beta*C
To compute A=A*B I seem to have to allocate a separate copy of A into C which is a great inconvenience for the algorithm I'm trying to implement (blocked accumulated Givens). This API constraints forces me to continuosly copy patches out of the main matrix and apply the transformation and copy the patch back to the matrix. So I am afraid that the gain due to the BLAS3 and blocking: parallellism. locality, vectorization of the MMM might not pay off due to the many copying around I'm forced to by using MKL's cblas_dgemm.
Is there a way to circumvent this? when I am done with it, I plan to post the algorithm for possible feedback on how to improve it.
The algorithm takes as input a matrix in upper hessenberg form with 1-off diagonal set of elements that are non-zero and need to be annihilated. Instead of invoking dlartg and dlasr in a loop (very bad locality and does not scale well w.r.t. to the problem sizes) I try to work on top of blocks of size NB patches over the affected diagonal, apply plane rotation on top of the NB block diagonal and at the same time accumulate the rotations into a Q orthogonal matrix which is then applied at once to the row band next to the main working patch seating on the diagonal. Hope is somewhat clear.
Unfortunately, there is no in-place cblas_dgemm in MKL. What are the typical dimensions for the A and B matrices? If you have sufficiently large matrices, the overhead of copying to a temporary buffer will be insignificant compared to the dgemm execution time.
Indeed, spot-on answer. For doing this Givens blocked (NB block size) I start by copying a patch of the matrix of size NBx(n - begin) where n is the total number of columns and begin the first column where a non-zero element ocurrs and needs to be annihilated. I will need to make this patching smarter because the costs of the copying is simply too high.
My initial goal was to reduce the stride needed to operate on the main matrix (which is in column major and the number of rows is much larger than the cache line size) and therefore increase locality plus moving from BLAS1 to BLAS3 because I accumulate the Givens transformation into a square block of size NB (accumulated NB-1 Given rotations matrix) and apply it at once to the whole row block at the right of the NB diagonal block.
I could actually copy patches of the same size NB and apply them one by one. Ideally I would have several square patches and apply them over the columns of the matrix but there is a trade off here of how many patches (and the respective copying overhead) I can efficiently maintain simultaneously.
That said, I can see that you may be able to make a case for Intel's providing an entirely new routine to do
A <= A B
where A and B are square.
If you can list a number of matrix calculations/algorithms (with references to books and links as available) where the
A <= A B, A and B square
operation is an integral part, that would be quite persuasive.
In my specific case I need to do the fastest update to an upper hessenberg matrix and make it upper triangular again. The matrix becomes in upper hessenberg form because it was initially upper triangular and then a single column is deleted. For my iterative algorithm this happens in every iteration and this is _the_ flops hotspot (profiled using VTune).
- Blocked Algorithms for the Reduction to Hessenberg-Triangular Form Revisited: www.netlib.org/lapack/lawnspdf/lawn198.pdf
- The multishift QR algorithm: www.mcs.sdsmt.edu/kbraman/Research/QR-I.pdf
- Using Level 3 BLAS in Rotation-Based Algorithms: http://www.google.com/url?sa=t&rct=j&q=using%20level%203blas%20in%20rotation%20algorithms&source=web...
Using your syntax,TRMM does A := A*B, whereB is a triangular matrix. However, in your case, B matrix is not triangular, right? So, you need a TRMM-like subroutine where B can be a full matrix?
In my use cases so far I only have dense matrices and the operation is in general: GC*M where M is a block matrix and GC is a block matrix of NBxNB corresponding to the accumulated Givens rotations i.e. GC = GN*...*G2*G1. So my A are the GC and my B is M. What I have been investigating is how to align G_i (and be able to apply it) so that GC becomes upper triangular and then I can switch to use TRMM rather than GEMM but this I haven't been able to figure out how to apply it correctly yet ... I am researching whether is possible.
My current implementation only uses GEMM and I need to do an extra copy of B each time so I can invoke GEMM i.e. I actually need:
1) B = 1.0*A*B
and so I have no choice but to do:
1) copy B to C
2) B = 1.0*A*C + 0.0*B
Many thanks for your support and following up with this.
Not sure if it is still interested. I get news from our developers. After investigation, we think the way you did ( blocking and partial overwrite) would be still beneficial. We don't expect performance improvement by implementing the faction natively. So please go ahead to use your self implementation,