How to optimize A'PA computation for memory use

Stefano_B_ · ‎04-29-2016

Hi,

I am setting up several weighted normal matrices as A'PA, where A is the first design matrix of size nObs x nPar and P is the weight matrix of size nObs x nObs . A'PA is then a symmetric matrix of size nPar x nPar. I perform this operation using the DSYMM + DGEMM routines.

! Memory allocation
       IF (ASSOCIATED(AA))    DEALLOCATE(AA)
       ALLOCATE(AA(NOBS,NPAR),stat=ii)

       IF (ASSOCIATED(PA))    DEALLOCATE(PA)
       ALLOCATE(PA(NOBS,NPAR),stat=ii)

! I actually just need the U or L triangular part of it (symmetric)
       IF (ASSOCIATED(ATPA))    DEALLOCATE(ATPA)
       ALLOCATE(ATPA(NPAR,NPAR),stat=ii)

! Set up of PA
       CALL dsymm('L', 'U', nObs, nPar, 1.d0, P_f, nObs, AA, nObs, 0.d0, PA, nObs)

! Setup of A'PA
       CALL dgemm('T','N',nPar,nPar,nObs,1.d0,AA,nObs,PA,nObs,1.d0,ATPA,nPar)

! Deallocation of AA, PA
    IF (ASSOCIATED(AA))    DEALLOCATE(AA)
    IF (ASSOCIATED(PA))    DEALLOCATE(PA)

Now, in my case nPar is as large as 90000 (or more), so that I need to allocate a very large amount of memory for the output (A'PA). In principle, I just need the upper or lower triangular matrix (since it's symmetric) but I cannot find a way to avoid the simultaneous allocation of the 90000x90000 matrix (needed as output by DGEMM) and of the triangular matrix (which is approximately half the size) where I would copy the part I am interested in before deallocating the full one.

Do you have any suggestion or see any option to compute this product using optimized parallel routines without allocating the full A'PA matrix? I checked all routines or packed formats but I cannot find a viable way out of allocating the full matrix at some point.

Thanks!

ps : I have an alternative using the intrinsic MATMUL function and small batches of 5 observations that are then added together in a triangular matrix (allocated and accessed as a vector). Unfortunately, as you can imagine, this is hardly very efficient.

mecej4 · ‎04-29-2016

1. Is the weights matrix P a diagonal matrix?

2. What do you intend to do with A^TPA, once you have assembled it?

Stefano_B_ · ‎04-29-2016

Hi! 1. Yes it is. 2. I will store it in a file (together with the respective r.h.s.) to then combine it with others, exchange it, ... and eventually solve it (but only at the very end). Thanks!

mecej4 · ‎04-29-2016

It would be more economical (memory and running time) and numerically stable to

compute the QR decomposition of A
compute R_s = R.sqrt(P)

Having done this, note that A^TPA = (QR_s)^T(QR_s), and that Q is orthogonal.

Please note that computational routines in MKL/Lapack may return Q in a different representation than as a dense matrix (e.g., as a product of Householder reflections), but Lapack provides helper routines to work with such Q matrices.

Stefano_B_ · ‎04-29-2016

Thanks! I will try it but I still have the impression that I will need to allocate the full A'PA matrix to store the result of the final multiplication (unless it exist some specific routine for these kind of factorised matrices...)

Stefano_B_ · ‎05-10-2016

Let us say that I will go for the factorisation and then use a standard MATMUL to compute only the part of A'PA that I need. How do I proceed?

I think I should :

1. use GEQP3 to perform the factorisation

2. how do I get the R matrix out of the output to compute R.sqrt(P) ? Manually?

3. how do you suggest to recompute QRs ?

Thanks!

Stefano_B_ · ‎08-26-2016

Ok, in the end I went for an "easy" even if a bit less efficient solution (I report it here in case it shall be of help for someone). I used DGEMV in a loop to get the elements of my normal matrix : it il indeed less efficient but still I gain a huge time factor by parallelization and I use less than 1/3 of the memory. I'm happy enough with it. ;)