topic Ok, in the end I went for an in Intel® oneAPI Math Kernel Library

How to optimize A'PA computation for memory use

Stefano_B_ — Fri, 29 Apr 2016 10:12:29 GMT

Hi,

I am setting up several weighted normal matrices as A'PA, where A is the first design matrix of size nObs x nPar and P is the weight matrix of size nObs x nObs . A'PA is then a symmetric matrix of size nPar x nPar. I perform this operation using the DSYMM + DGEMM routines.

! Memory allocation
       IF (ASSOCIATED(AA))    DEALLOCATE(AA)
       ALLOCATE(AA(NOBS,NPAR),stat=ii)

       IF (ASSOCIATED(PA))    DEALLOCATE(PA)
       ALLOCATE(PA(NOBS,NPAR),stat=ii)

! I actually just need the U or L triangular part of it (symmetric)
       IF (ASSOCIATED(ATPA))    DEALLOCATE(ATPA)
       ALLOCATE(ATPA(NPAR,NPAR),stat=ii)

! Set up of PA
       CALL dsymm('L', 'U', nObs, nPar, 1.d0, P_f, nObs, AA, nObs, 0.d0, PA, nObs)

! Setup of A'PA
       CALL dgemm('T','N',nPar,nPar,nObs,1.d0,AA,nObs,PA,nObs,1.d0,ATPA,nPar)

! Deallocation of AA, PA
    IF (ASSOCIATED(AA))    DEALLOCATE(AA)
    IF (ASSOCIATED(PA))    DEALLOCATE(PA)

Now, in my case nPar is as large as 90000 (or more), so that I need to allocate a very large amount of memory for the output (A'PA). In principle, I just need the upper or lower triangular matrix (since it's symmetric) but I cannot find a way to avoid the simultaneous allocation of the 90000x90000 matrix (needed as output by DGEMM) and of the triangular matrix (which is approximately half the size) where I would copy the part I am interested in before deallocating the full one.

Do you have any suggestion or see any option to compute this product using optimized parallel routines without allocating the full A'PA matrix? I checked all routines or packed formats but I cannot find a viable way out of allocating the full matrix at some point.

Thanks!

ps : I have an alternative using the intrinsic MATMUL function and small batches of 5 observations that are then added together in a triangular matrix (allocated and accessed as a vector). Unfortunately, as you can imagine, this is hardly very efficient.

1. Is the weights matrix P

mecej4 — Fri, 29 Apr 2016 16:21:00 GMT

1. Is the weights matrix P a diagonal matrix?

2. What do you intend to do with A^TPA, once you have assembled it?

Hi!

Stefano_B_ — Fri, 29 Apr 2016 16:32:00 GMT

Hi! 1. Yes it is. 2. I will store it in a file (together with the respective r.h.s.) to then combine it with others, exchange it, ... and eventually solve it (but only at the very end). Thanks!

It would be more economical

mecej4 — Fri, 29 Apr 2016 17:33:00 GMT

It would be more economical (memory and running time) and numerically stable to

compute the QR decomposition of A
compute R_s = R.sqrt(P)

Having done this, note that A^TPA = (QR_s)^T(QR_s), and that Q is orthogonal.

Please note that computational routines in MKL/Lapack may return Q in a different representation than as a dense matrix (e.g., as a product of Householder reflections), but Lapack provides helper routines to work with such Q matrices.

Thanks! I will try it but I

Stefano_B_ — Fri, 29 Apr 2016 22:42:18 GMT

Thanks! I will try it but I still have the impression that I will need to allocate the full A'PA matrix to store the result of the final multiplication (unless it exist some specific routine for these kind of factorised matrices...)

Let us say that I will go for

Stefano_B_ — Tue, 10 May 2016 14:27:07 GMT

Let us say that I will go for the factorisation and then use a standard MATMUL to compute only the part of A'PA that I need. How do I proceed?

I think I should :

1. use GEQP3 to perform the factorisation

2. how do I get the R matrix out of the output to compute R.sqrt(P) ? Manually?

3. how do you suggest to recompute QRs ?

Thanks!

Ok, in the end I went for an

Stefano_B_ — Fri, 26 Aug 2016 12:22:00 GMT

Ok, in the end I went for an "easy" even if a bit less efficient solution (I report it here in case it shall be of help for someone). I used DGEMV in a loop to get the elements of my normal matrix : it il indeed less efficient but still I gain a huge time factor by parallelization and I use less than 1/3 of the memory. I'm happy enough with it. ;)