<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Ok, in the end I went for an  in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110426#M24323</link>
    <description>&lt;P&gt;Ok, in the end I went for an "easy" even if a bit less efficient solution (I report it here in case it shall be of help for someone). I used DGEMV in a loop to get the elements of my normal matrix : it il indeed less efficient but still I gain a huge time factor by parallelization and I use less than 1/3 of the memory. I'm happy enough with it. ;)&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 26 Aug 2016 12:22:00 GMT</pubDate>
    <dc:creator>Stefano_B_</dc:creator>
    <dc:date>2016-08-26T12:22:00Z</dc:date>
    <item>
      <title>How to optimize A'PA computation for memory use</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110420#M24317</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;I am setting up several weighted normal matrices as A'PA, where A is the first design matrix of size nObs x nPar and P is the weight matrix of size nObs x nObs . A'PA is then a symmetric matrix of size nPar x nPar. I perform this operation using the DSYMM + DGEMM routines.&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;! Memory allocation
       IF (ASSOCIATED(AA))    DEALLOCATE(AA)
       ALLOCATE(AA(NOBS,NPAR),stat=ii)

       IF (ASSOCIATED(PA))    DEALLOCATE(PA)
       ALLOCATE(PA(NOBS,NPAR),stat=ii)

! I actually just need the U or L triangular part of it (symmetric)
       IF (ASSOCIATED(ATPA))    DEALLOCATE(ATPA)
       ALLOCATE(ATPA(NPAR,NPAR),stat=ii)

! Set up of PA
       CALL dsymm('L', 'U', nObs, nPar, 1.d0, P_f, nObs, AA, nObs, 0.d0, PA, nObs)

! Setup of A'PA
       CALL dgemm('T','N',nPar,nPar,nObs,1.d0,AA,nObs,PA,nObs,1.d0,ATPA,nPar)

! Deallocation of AA, PA
    IF (ASSOCIATED(AA))    DEALLOCATE(AA)
    IF (ASSOCIATED(PA))    DEALLOCATE(PA)
&lt;/PRE&gt;

&lt;P&gt;Now, in my case nPar is as large as 90000 (or more), so that I need to allocate a very large amount of memory for the output (A'PA). In principle, I just need the upper or lower triangular matrix (since it's symmetric) but I cannot find a way to avoid the simultaneous allocation of the 90000x90000 matrix (needed as output by DGEMM) and of the triangular matrix (which is approximately half the size) where I would copy the part I am interested in before deallocating the full one.&lt;/P&gt;

&lt;P&gt;Do you have any suggestion or see any option to compute this product using optimized parallel routines without allocating the full A'PA matrix? I checked all routines or packed formats but I cannot find a viable way out of allocating the full matrix at some point.&lt;/P&gt;

&lt;P&gt;Thanks!&lt;/P&gt;

&lt;P&gt;ps : I have an alternative using the intrinsic MATMUL function and small batches of 5 observations that are then added together in a triangular matrix (allocated and accessed as a vector). Unfortunately, as you can imagine, this is hardly very efficient.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 29 Apr 2016 10:12:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110420#M24317</guid>
      <dc:creator>Stefano_B_</dc:creator>
      <dc:date>2016-04-29T10:12:29Z</dc:date>
    </item>
    <item>
      <title>1. Is the weights matrix P</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110421#M24318</link>
      <description>&lt;P&gt;1. Is the weights matrix P a diagonal matrix?&lt;/P&gt;

&lt;P&gt;2. What do you intend to do with A&lt;SUP&gt;T&lt;/SUP&gt;PA, once you have assembled it?&lt;/P&gt;</description>
      <pubDate>Fri, 29 Apr 2016 16:21:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110421#M24318</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2016-04-29T16:21:00Z</dc:date>
    </item>
    <item>
      <title>Hi!</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110422#M24319</link>
      <description>Hi! 
1. Yes it is.
2. I will store it in a file (together with the respective r.h.s.) to then combine it with others, exchange it, ... and eventually solve it (but only at the very end).

Thanks!</description>
      <pubDate>Fri, 29 Apr 2016 16:32:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110422#M24319</guid>
      <dc:creator>Stefano_B_</dc:creator>
      <dc:date>2016-04-29T16:32:00Z</dc:date>
    </item>
    <item>
      <title>It would be more economical</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110423#M24320</link>
      <description>&lt;P&gt;It would be more economical (memory and running time) and numerically stable to&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;compute the QR decomposition of A&lt;/LI&gt;
	&lt;LI&gt;compute R&lt;SUB&gt;s&lt;/SUB&gt; = R.sqrt(P)&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;Having done this, note that A&lt;SUP&gt;T&lt;/SUP&gt;PA = (QR&lt;SUB&gt;s&lt;/SUB&gt;)&lt;SUP&gt;T&lt;/SUP&gt;(QR&lt;SUB&gt;s&lt;/SUB&gt;), and that Q is orthogonal.&lt;/P&gt;

&lt;P&gt;Please note that computational routines in MKL/Lapack may return Q in a different representation than as a dense matrix (e.g., as a product of Householder reflections), but Lapack provides helper routines to work with such Q matrices.&lt;/P&gt;</description>
      <pubDate>Fri, 29 Apr 2016 17:33:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110423#M24320</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2016-04-29T17:33:00Z</dc:date>
    </item>
    <item>
      <title>Thanks! I will try it but I</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110424#M24321</link>
      <description>Thanks! I will try it but I still have the impression that I will need to allocate the full A'PA matrix to store the result of the final multiplication (unless it exist some specific routine for these kind of factorised matrices...)</description>
      <pubDate>Fri, 29 Apr 2016 22:42:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110424#M24321</guid>
      <dc:creator>Stefano_B_</dc:creator>
      <dc:date>2016-04-29T22:42:18Z</dc:date>
    </item>
    <item>
      <title>Let us say that I will go for</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110425#M24322</link>
      <description>&lt;P&gt;Let us say that I will go for the factorisation and then use a standard MATMUL to compute only the part of A'PA that I need. How do I proceed?&lt;/P&gt;

&lt;P&gt;I think I should :&lt;/P&gt;

&lt;P&gt;1. use &lt;B&gt;GEQP3 &lt;/B&gt;to perform the factorisation&lt;/P&gt;

&lt;P&gt;2. how do I get the R matrix out of the output to compute R.sqrt(P) ? Manually?&lt;/P&gt;

&lt;P&gt;3. how do you suggest to recompute QRs ?&lt;/P&gt;

&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 14:27:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110425#M24322</guid>
      <dc:creator>Stefano_B_</dc:creator>
      <dc:date>2016-05-10T14:27:07Z</dc:date>
    </item>
    <item>
      <title>Ok, in the end I went for an</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110426#M24323</link>
      <description>&lt;P&gt;Ok, in the end I went for an "easy" even if a bit less efficient solution (I report it here in case it shall be of help for someone). I used DGEMV in a loop to get the elements of my normal matrix : it il indeed less efficient but still I gain a huge time factor by parallelization and I use less than 1/3 of the memory. I'm happy enough with it. ;)&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 26 Aug 2016 12:22:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-optimize-A-PA-computation-for-memory-use/m-p/1110426#M24323</guid>
      <dc:creator>Stefano_B_</dc:creator>
      <dc:date>2016-08-26T12:22:00Z</dc:date>
    </item>
  </channel>
</rss>

