cblas_dgemm inplace A=alpha*A*B version?

Azua_Garcia__Giovann · ‎06-13-2012

Hello,

AFAIK and in order to invoke MKL's cblas_dgemm I need three different matrices namely A,B,C:
C = alpha*A*B + beta*C

To compute A=A*B I seem to have to allocate a separate copy of A into C which is a great inconvenience for the algorithm I'm trying to implement (blocked accumulated Givens). This API constraints forces me to continuosly copy patches out of the main matrix and apply the transformation and copy the patch back to the matrix. So I am afraid that the gain due to the BLAS3 and blocking: parallellism. locality, vectorization of the MMM might not pay off due to the many copying around I'm forced to by using MKL's cblas_dgemm.

Is there a way to circumvent this? when I am done with it, I plan to post the algorithm for possible feedback on how to improve it.

The algorithm takes as input a matrix in upper hessenberg form with 1-off diagonal set of elements that are non-zero and need to be annihilated. Instead of invoking dlartg and dlasr in a loop (very bad locality and does not scale well w.r.t. to the problem sizes) I try to work on top of blocks of size NB patches over the affected diagonal, apply plane rotation on top of the NB block diagonal and at the same time accumulate the rotations into a Q orthogonal matrix which is then applied at once to the row band next to the main working patch seating on the diagonal. Hope is somewhat clear.

Many TIA,
Best regards,
Giovanni

Murat_G_Intel · ‎06-15-2012

Hi Giovanni,

Unfortunately, there is no in-place cblas_dgemm in MKL. What are the typical dimensions for the A and B matrices? If you have sufficiently large matrices, the overhead of copying to a temporary buffer will be insignificant compared to the dgemm execution time.

Thank you,

Efe

Azua_Garcia__Giovann · ‎06-15-2012

Hi Efe,

Indeed, spot-on answer. For doing this Givens blocked (NB block size) I start by copying a patch of the matrix of size NBx(n - begin) where n is the total number of columns and begin the first column where a non-zero element ocurrs and needs to be annihilated. I will need to make this patching smarter because the costs of the copying is simply too high.

My initial goal was to reduce the stride needed to operate on the main matrix (which is in column major and the number of rows is much larger than the cache line size) and therefore increase locality plus moving from BLAS1 to BLAS3 because I accumulate the Givens transformation into a square block of size NB (accumulated NB-1 Given rotations matrix) and apply it at once to the whole row block at the right of the NB diagonal block.

I could actually copy patches of the same size NB and apply them one by one. Ideally I would have several square patches and apply them over the columns of the matrix but there is a trade off here of how many patches (and the respective copying overhead) I can efficiently maintain simultaneously.

Best regards,
Giovanni

mecej4 · ‎06-15-2012

I may offer a plausible argument why the ?GEMM routines will never suit your special case. The key reason is that, in general, matrices A and B are not of the same shape as C, but their product is. Your special case involves A and B being square, so A, B and C all have the same shape.

That said, I can see that you may be able to make a case for Intel's providing an entirely new routine to do

A <= A B

where A and B are square.

Azua_Garcia__Giovann · ‎06-15-2012

Exactly. That would be great! how do I start making the case? :)

mecej4 · ‎06-15-2012

We just have to catch Gennady Fedorov's eye (not too hard to do!).

If you can list a number of matrix calculations/algorithms (with references to books and links as available) where the

A <= A B, A and B square

operation is an integral part, that would be quite persuasive.

Azua_Garcia__Giovann · ‎06-15-2012

Ok. These below are the papers I have been studying. The topic is in general how to update or compute a QR factorization using accumulated and/or smart ways to regroup Givens rotations. The concepts are one way or another related to replacing long stride bad locality BLAS-1 updates (drotg & dlartg & dlasr) with BLAS-3 MMM i.e. moving to gemm or trmm. Trying to implement these concepts leads to the OP. So yes ideally I would propose the same extension for gemm and trmm.

In my specific case I need to do the fastest update to an upper hessenberg matrix and make it upper triangular again. The matrix becomes in upper hessenberg form because it was initially upper triangular and then a single column is deleted. For my iterative algorithm this happens in every iteration and this is _the_ flops hotspot (profiled using VTune).

Blocked Algorithms for the Reduction to Hessenberg-Triangular Form Revisited: www.netlib.org/lapack/lawnspdf/lawn198.pdf
The multishift QR algorithm: www.mcs.sdsmt.edu/kbraman/Research/QR-I.pdf
Using Level 3 BLAS in Rotation-Based Algorithms: http://www.google.com/url?sa=t&rct=j&q=using%20level%203blas%20in%20rotation%20algorithms&source=web&cd=2&ved=0CE0QFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.33.2219%26rep%3Drep1%26type%3Dpdf&ei=M0bbT_fDJund4QT-wOmcCg&usg=AFQjCNGExaizbiaTXePRBmhCcUbmaxZhtQ&cad=rja

Murat_G_Intel · ‎06-16-2012

Hi Giovanni,

Using your syntax,TRMM does A := A*B, whereB is a triangular matrix. However, in your case, B matrix is not triangular, right? So, you need a TRMM-like subroutine where B can be a full matrix?

Thank you,
Efe

Azua_Garcia__Giovann · ‎06-19-2012

Hi Efe,

In my use cases so far I only have dense matrices and the operation is in general: GC*M where M is a block matrix and GC is a block matrix of NBxNB corresponding to the accumulated Givens rotations i.e. GC = GN*...*G2*G1. So my A are the GC and my B is M. What I have been investigating is how to align G_i (and be able to apply it) so that GC becomes upper triangular and then I can switch to use TRMM rather than GEMM but this I haven't been able to figure out how to apply it correctly yet ... I am researching whether is possible.

My current implementation only uses GEMM and I need to do an extra copy of B each time so I can invoke GEMM i.e. I actually need:
1) B = 1.0*A*B

and so I have no choice but to do:
1) copy B to C
2) B = 1.0*A*C + 0.0*B

Many thanks for your support and following up with this.
Best regards,
Giovanni

Ying_H_Intel · ‎08-12-2015

Hi Giovanni,

Not sure if it is still interested. I get news from our developers. After investigation, we think the way you did ( blocking and partial overwrite) would be still beneficial. We don't expect performance improvement by implementing the faction natively. So please go ahead to use your self implementation,

Best Regards,

Ying