Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
- How to optimize A'PA computation for memory use

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Stefano_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2016
03:12 AM

9 Views

How to optimize A'PA computation for memory use

Hi,

I am setting up several weighted normal matrices as A'PA, where A is the first design matrix of size nObs x nPar and P is the weight matrix of size nObs x nObs . A'PA is then a symmetric matrix of size nPar x nPar. I perform this operation using the DSYMM + DGEMM routines.

! Memory allocation IF (ASSOCIATED(AA)) DEALLOCATE(AA) ALLOCATE(AA(NOBS,NPAR),stat=ii) IF (ASSOCIATED(PA)) DEALLOCATE(PA) ALLOCATE(PA(NOBS,NPAR),stat=ii) ! I actually just need the U or L triangular part of it (symmetric) IF (ASSOCIATED(ATPA)) DEALLOCATE(ATPA) ALLOCATE(ATPA(NPAR,NPAR),stat=ii) ! Set up of PA CALL dsymm('L', 'U', nObs, nPar, 1.d0, P_f, nObs, AA, nObs, 0.d0, PA, nObs) ! Setup of A'PA CALL dgemm('T','N',nPar,nPar,nObs,1.d0,AA,nObs,PA,nObs,1.d0,ATPA,nPar) ! Deallocation of AA, PA IF (ASSOCIATED(AA)) DEALLOCATE(AA) IF (ASSOCIATED(PA)) DEALLOCATE(PA)

Now, in my case nPar is as large as 90000 (or more), so that I need to allocate a very large amount of memory for the output (A'PA). In principle, I just need the upper or lower triangular matrix (since it's symmetric) but I cannot find a way to avoid the simultaneous allocation of the 90000x90000 matrix (needed as output by DGEMM) and of the triangular matrix (which is approximately half the size) where I would copy the part I am interested in before deallocating the full one.

Do you have any suggestion or see any option to compute this product using optimized parallel routines without allocating the full A'PA matrix? I checked all routines or packed formats but I cannot find a viable way out of allocating the full matrix at some point.

Thanks!

ps : I have an alternative using the intrinsic MATMUL function and small batches of 5 observations that are then added together in a triangular matrix (allocated and accessed as a vector). Unfortunately, as you can imagine, this is hardly very efficient.

6 Replies

Highlighted
##

mecej4

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2016
09:21 AM

9 Views

1. Is the weights matrix P a diagonal matrix?

2. What do you intend to do with A^{T}PA, once you have assembled it?

Highlighted
##

Hi!
1. Yes it is.
2. I will store it in a file (together with the respective r.h.s.) to then combine it with others, exchange it, ... and eventually solve it (but only at the very end).
Thanks!

Stefano_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2016
09:32 AM

9 Views

Highlighted
##

mecej4

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2016
10:33 AM

9 Views

It would be more economical (memory and running time) and numerically stable to

- compute the QR decomposition of A
- compute R
_{s}= R.sqrt(P)

Having done this, note that A^{T}PA = (QR_{s})^{T}(QR_{s}), and that Q is orthogonal.

Please note that computational routines in MKL/Lapack may return Q in a different representation than as a dense matrix (e.g., as a product of Householder reflections), but Lapack provides helper routines to work with such Q matrices.

Highlighted
##

Thanks! I will try it but I still have the impression that I will need to allocate the full A'PA matrix to store the result of the final multiplication (unless it exist some specific routine for these kind of factorised matrices...)

Stefano_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2016
03:42 PM

9 Views

Highlighted
##

Stefano_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-10-2016
07:27 AM

9 Views

Let us say that I will go for the factorisation and then use a standard MATMUL to compute only the part of A'PA that I need. How do I proceed?

I think I should :

1. use **GEQP3 **to perform the factorisation

2. how do I get the R matrix out of the output to compute R.sqrt(P) ? Manually?

3. how do you suggest to recompute QRs ?

Thanks!

Highlighted
##

Ok, in the end I went for an "easy" even if a bit less efficient solution (I report it here in case it shall be of help for someone). I used DGEMV in a loop to get the elements of my normal matrix : it il indeed less efficient but still I gain a huge time factor by parallelization and I use less than 1/3 of the memory. I'm happy enough with it. ;)

Stefano_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-26-2016
05:22 AM

9 Views

For more complete information about compiler optimizations, see our Optimization Notice.