I am setting up several weighted normal matrices as A'PA, where A is the first design matrix of size nObs x nPar and P is the weight matrix of size nObs x nObs . A'PA is then a symmetric matrix of size nPar x nPar. I perform this operation using the DSYMM + DGEMM routines.
! Memory allocation IF (ASSOCIATED(AA)) DEALLOCATE(AA) ALLOCATE(AA(NOBS,NPAR),stat=ii) IF (ASSOCIATED(PA)) DEALLOCATE(PA) ALLOCATE(PA(NOBS,NPAR),stat=ii) ! I actually just need the U or L triangular part of it (symmetric) IF (ASSOCIATED(ATPA)) DEALLOCATE(ATPA) ALLOCATE(ATPA(NPAR,NPAR),stat=ii) ! Set up of PA CALL dsymm('L', 'U', nObs, nPar, 1.d0, P_f, nObs, AA, nObs, 0.d0, PA, nObs) ! Setup of A'PA CALL dgemm('T','N',nPar,nPar,nObs,1.d0,AA,nObs,PA,nObs,1.d0,ATPA,nPar) ! Deallocation of AA, PA IF (ASSOCIATED(AA)) DEALLOCATE(AA) IF (ASSOCIATED(PA)) DEALLOCATE(PA)
Now, in my case nPar is as large as 90000 (or more), so that I need to allocate a very large amount of memory for the output (A'PA). In principle, I just need the upper or lower triangular matrix (since it's symmetric) but I cannot find a way to avoid the simultaneous allocation of the 90000x90000 matrix (needed as output by DGEMM) and of the triangular matrix (which is approximately half the size) where I would copy the part I am interested in before deallocating the full one.
Do you have any suggestion or see any option to compute this product using optimized parallel routines without allocating the full A'PA matrix? I checked all routines or packed formats but I cannot find a viable way out of allocating the full matrix at some point.
ps : I have an alternative using the intrinsic MATMUL function and small batches of 5 observations that are then added together in a triangular matrix (allocated and accessed as a vector). Unfortunately, as you can imagine, this is hardly very efficient.
It would be more economical (memory and running time) and numerically stable to
- compute the QR decomposition of A
- compute Rs = R.sqrt(P)
Having done this, note that ATPA = (QRs)T(QRs), and that Q is orthogonal.
Please note that computational routines in MKL/Lapack may return Q in a different representation than as a dense matrix (e.g., as a product of Householder reflections), but Lapack provides helper routines to work with such Q matrices.
Let us say that I will go for the factorisation and then use a standard MATMUL to compute only the part of A'PA that I need. How do I proceed?
I think I should :
1. use GEQP3 to perform the factorisation
2. how do I get the R matrix out of the output to compute R.sqrt(P) ? Manually?
3. how do you suggest to recompute QRs ?
Ok, in the end I went for an "easy" even if a bit less efficient solution (I report it here in case it shall be of help for someone). I used DGEMV in a loop to get the elements of my normal matrix : it il indeed less efficient but still I gain a huge time factor by parallelization and I use less than 1/3 of the memory. I'm happy enough with it. ;)