Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- How to optimize A'PA computation for memory use

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Stefano_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2016
03:12 AM

101 Views

How to optimize A'PA computation for memory use

Hi,

I am setting up several weighted normal matrices as A'PA, where A is the first design matrix of size nObs x nPar and P is the weight matrix of size nObs x nObs . A'PA is then a symmetric matrix of size nPar x nPar. I perform this operation using the DSYMM + DGEMM routines.

! Memory allocation IF (ASSOCIATED(AA)) DEALLOCATE(AA) ALLOCATE(AA(NOBS,NPAR),stat=ii) IF (ASSOCIATED(PA)) DEALLOCATE(PA) ALLOCATE(PA(NOBS,NPAR),stat=ii) ! I actually just need the U or L triangular part of it (symmetric) IF (ASSOCIATED(ATPA)) DEALLOCATE(ATPA) ALLOCATE(ATPA(NPAR,NPAR),stat=ii) ! Set up of PA CALL dsymm('L', 'U', nObs, nPar, 1.d0, P_f, nObs, AA, nObs, 0.d0, PA, nObs) ! Setup of A'PA CALL dgemm('T','N',nPar,nPar,nObs,1.d0,AA,nObs,PA,nObs,1.d0,ATPA,nPar) ! Deallocation of AA, PA IF (ASSOCIATED(AA)) DEALLOCATE(AA) IF (ASSOCIATED(PA)) DEALLOCATE(PA)

Now, in my case nPar is as large as 90000 (or more), so that I need to allocate a very large amount of memory for the output (A'PA). In principle, I just need the upper or lower triangular matrix (since it's symmetric) but I cannot find a way to avoid the simultaneous allocation of the 90000x90000 matrix (needed as output by DGEMM) and of the triangular matrix (which is approximately half the size) where I would copy the part I am interested in before deallocating the full one.

Do you have any suggestion or see any option to compute this product using optimized parallel routines without allocating the full A'PA matrix? I checked all routines or packed formats but I cannot find a viable way out of allocating the full matrix at some point.

Thanks!

ps : I have an alternative using the intrinsic MATMUL function and small batches of 5 observations that are then added together in a triangular matrix (allocated and accessed as a vector). Unfortunately, as you can imagine, this is hardly very efficient.

Link Copied

6 Replies

mecej4

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2016
09:21 AM

101 Views

1. Is the weights matrix P a diagonal matrix?

2. What do you intend to do with A^{T}PA, once you have assembled it?

Stefano_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2016
09:32 AM

101 Views

mecej4

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2016
10:33 AM

101 Views

It would be more economical (memory and running time) and numerically stable to

- compute the QR decomposition of A
- compute R
_{s}= R.sqrt(P)

Having done this, note that A^{T}PA = (QR_{s})^{T}(QR_{s}), and that Q is orthogonal.

Please note that computational routines in MKL/Lapack may return Q in a different representation than as a dense matrix (e.g., as a product of Householder reflections), but Lapack provides helper routines to work with such Q matrices.

Stefano_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2016
03:42 PM

101 Views

Stefano_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-10-2016
07:27 AM

101 Views

Let us say that I will go for the factorisation and then use a standard MATMUL to compute only the part of A'PA that I need. How do I proceed?

I think I should :

1. use **GEQP3 **to perform the factorisation

2. how do I get the R matrix out of the output to compute R.sqrt(P) ? Manually?

3. how do you suggest to recompute QRs ?

Thanks!

Stefano_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-26-2016
05:22 AM

101 Views

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.