Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Stefano_B_
Beginner
101 Views

How to optimize A'PA computation for memory use

Hi,

I am setting up several weighted normal matrices as A'PA, where A is the first design matrix of size nObs x nPar and P is the weight matrix of size nObs x nObs . A'PA is then a symmetric matrix of size nPar x nPar. I perform this operation using the DSYMM + DGEMM routines.

! Memory allocation
       IF (ASSOCIATED(AA))    DEALLOCATE(AA)
       ALLOCATE(AA(NOBS,NPAR),stat=ii)

       IF (ASSOCIATED(PA))    DEALLOCATE(PA)
       ALLOCATE(PA(NOBS,NPAR),stat=ii)

! I actually just need the U or L triangular part of it (symmetric)
       IF (ASSOCIATED(ATPA))    DEALLOCATE(ATPA)
       ALLOCATE(ATPA(NPAR,NPAR),stat=ii)

! Set up of PA
       CALL dsymm('L', 'U', nObs, nPar, 1.d0, P_f, nObs, AA, nObs, 0.d0, PA, nObs)

! Setup of A'PA
       CALL dgemm('T','N',nPar,nPar,nObs,1.d0,AA,nObs,PA,nObs,1.d0,ATPA,nPar)

! Deallocation of AA, PA
    IF (ASSOCIATED(AA))    DEALLOCATE(AA)
    IF (ASSOCIATED(PA))    DEALLOCATE(PA)

Now, in my case nPar is as large as 90000 (or more), so that I need to allocate a very large amount of memory for the output (A'PA). In principle, I just need the upper or lower triangular matrix (since it's symmetric) but I cannot find a way to avoid the simultaneous allocation of the 90000x90000 matrix (needed as output by DGEMM) and of the triangular matrix (which is approximately half the size) where I would copy the part I am interested in before deallocating the full one.

Do you have any suggestion or see any option to compute this product using optimized parallel routines without allocating the full A'PA matrix? I checked all routines or packed formats but I cannot find a viable way out of allocating the full matrix at some point.

Thanks!

ps : I have an alternative using the intrinsic MATMUL function and small batches of 5 observations that are then added together in a triangular matrix (allocated and accessed as a vector). Unfortunately, as you can imagine, this is hardly very efficient.

 

0 Kudos
6 Replies
mecej4
Black Belt
101 Views

1. Is the weights matrix P a diagonal matrix?

2. What do you intend to do with ATPA, once you have assembled it?

Stefano_B_
Beginner
101 Views

Hi! 1. Yes it is. 2. I will store it in a file (together with the respective r.h.s.) to then combine it with others, exchange it, ... and eventually solve it (but only at the very end). Thanks!
mecej4
Black Belt
101 Views

It would be more economical (memory and running time) and numerically stable to

  1. compute the QR decomposition of A
  2. compute Rs = R.sqrt(P)

Having done this, note that ATPA = (QRs)T(QRs), and that Q is orthogonal.

Please note that computational routines in MKL/Lapack may return Q in a different representation than as a dense matrix (e.g., as a product of Householder reflections), but Lapack provides helper routines to work with such Q matrices.

Stefano_B_
Beginner
101 Views

Thanks! I will try it but I still have the impression that I will need to allocate the full A'PA matrix to store the result of the final multiplication (unless it exist some specific routine for these kind of factorised matrices...)
Stefano_B_
Beginner
101 Views

Let us say that I will go for the factorisation and then use a standard MATMUL to compute only the part of A'PA that I need. How do I proceed?

I think I should :

1. use GEQP3 to perform the factorisation

2. how do I get the R matrix out of the output to compute R.sqrt(P) ? Manually?

3. how do you suggest to recompute QRs ?

Thanks!

Stefano_B_
Beginner
101 Views

Ok, in the end I went for an "easy" even if a bit less efficient solution (I report it here in case it shall be of help for someone). I used DGEMV in a loop to get the elements of my normal matrix : it il indeed less efficient but still I gain a huge time factor by parallelization and I use less than 1/3 of the memory. I'm happy enough with it. ;)

 

Reply