<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic parallel computing &amp; array multiplication problem, any library? in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796571#M461</link>
    <description>&lt;P&gt;Hi there, I'm a newbie in parallel computing, and met a project that requires using C/C++ to rewrite a fortran program and finally runs on a multi-core Intel CPU + GPU server node or even a cluster.&lt;/P&gt;&lt;P&gt;The program involves much matrix/multidimensional array multiplication. In fortran it easy to do row/column assignment, matrix addition, subtraction, multiplication etc., but not so simple for C - it needs to be enhanced to have functions like fortran or matlab.&lt;/P&gt;&lt;P&gt;So I read a bit about CUDA, OpenCL about GPU and OpenMP, MPI about parallel machines. There are only low level APIs, and don't have the functions above. If I make some libraries for C myself, I'm afraid they won't be reliable or efficient enough. So I wonder if there're any opensource libraries. I saw nvidia CUDA topics recommended a library called arrayfire, but is commercial. And I tried gsl, and also heard about mtl, octave, I'd like to ask which is the best option? or Intel mkl?&lt;BR /&gt;&lt;BR /&gt;Thanks!&lt;/P&gt;</description>
    <pubDate>Tue, 06 Mar 2012 16:33:15 GMT</pubDate>
    <dc:creator>zlzlzlz7</dc:creator>
    <dc:date>2012-03-06T16:33:15Z</dc:date>
    <item>
      <title>parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796571#M461</link>
      <description>&lt;P&gt;Hi there, I'm a newbie in parallel computing, and met a project that requires using C/C++ to rewrite a fortran program and finally runs on a multi-core Intel CPU + GPU server node or even a cluster.&lt;/P&gt;&lt;P&gt;The program involves much matrix/multidimensional array multiplication. In fortran it easy to do row/column assignment, matrix addition, subtraction, multiplication etc., but not so simple for C - it needs to be enhanced to have functions like fortran or matlab.&lt;/P&gt;&lt;P&gt;So I read a bit about CUDA, OpenCL about GPU and OpenMP, MPI about parallel machines. There are only low level APIs, and don't have the functions above. If I make some libraries for C myself, I'm afraid they won't be reliable or efficient enough. So I wonder if there're any opensource libraries. I saw nvidia CUDA topics recommended a library called arrayfire, but is commercial. And I tried gsl, and also heard about mtl, octave, I'd like to ask which is the best option? or Intel mkl?&lt;BR /&gt;&lt;BR /&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 06 Mar 2012 16:33:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796571#M461</guid>
      <dc:creator>zlzlzlz7</dc:creator>
      <dc:date>2012-03-06T16:33:15Z</dc:date>
    </item>
    <item>
      <title>parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796572#M462</link>
      <description>In Intel Fortran, the option /Qopt-matmul should make an automatic substitution of the MKL BLAS ?gemm function where appropriate to implement Fortran MATMUL. In C or C++, you can call ?gemm directly as F77 function or indirectly by cblas interfaces. There's not much point (at least not from a performance point of view) in packaged functions for matrix addition/subtraction. If the problem is suited for threaded parallel, either OpenMP or Cilk+ can do the job easily. Unless your problem is so big that it's advantageous to spread across a cluster, MPI doesn't fit well with basic matrix algegra.&lt;BR /&gt;For the operations you are talking about, Matlab is simply another user interface for MKL library (when run on the appropriate platforms).&lt;BR /&gt;For CUDA, a usual tactic for matrix multiply is to call into the cudablas library in similar fashion. To some extent, this avoids spending excessive effort on non-portable code.&lt;BR /&gt;Other libraries you mention evidently have devoted users, but not simply on account of basic matrix algebra.</description>
      <pubDate>Tue, 06 Mar 2012 19:59:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796572#M462</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-03-06T19:59:14Z</dc:date>
    </item>
    <item>
      <title>Parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796573#M463</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1331096659343="58" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=471386" href="https://community.intel.com/en-us/profile/471386/" class="basic"&gt;zlzlzlz7&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt;...The program involves much &lt;STRONG&gt;matrix/multidimensional array multiplication&lt;/STRONG&gt;. In fortran it easy to do row/column assignment, matrix addition, subtraction, multiplication etc., &lt;STRONG&gt;but not so simple for C&lt;/STRONG&gt;...&lt;BR /&gt;&lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;It depends on an algorithm(s) you're going to select.&lt;STRONG&gt;Classic&lt;/STRONG&gt; algorithms formatrix multiplication, addition,&lt;BR /&gt;etc are very simple when implementedin&lt;STRONG&gt;C&lt;/STRONG&gt; or &lt;STRONG&gt;C++&lt;/STRONG&gt;. Performance of these algorithms goes up as soon&lt;BR /&gt;asloops are unrolled, or&lt;STRONG&gt;SSE&lt;/STRONG&gt; is used, or calculations done in parallel, or a pirority of the process boosted to&lt;BR /&gt;a realtime. Once again, speaking about implementation complexity, they are simple.&lt;BR /&gt;&lt;BR /&gt;Everything changes when a size ofa matrix isgreater than&lt;STRONG&gt;1024x1024&lt;/STRONG&gt;. In that case &lt;STRONG&gt;Strassen&lt;/STRONG&gt;-like&lt;BR /&gt;algorithms &lt;STRONG&gt;must be used&lt;/STRONG&gt; and &lt;SPAN style="text-decoration: underline;"&gt;performance gains are significant&lt;/SPAN&gt; becausetime complexity of the&lt;BR /&gt;&lt;STRONG&gt;Strassen&lt;/STRONG&gt;-like algorithmsis better than a &lt;STRONG&gt;Calssic&lt;/STRONG&gt; one. They multiply matrices significantly faster but it&lt;BR /&gt;comes at a price, that is, implementation complexity of the &lt;STRONG&gt;Strassen&lt;/STRONG&gt;-like algorithmsin &lt;STRONG&gt;C&lt;/STRONG&gt; or &lt;STRONG&gt;C++&lt;/STRONG&gt;is higher.&lt;BR /&gt;&lt;BR /&gt;Hereis someinformation regarding &lt;STRONG&gt;Time Complexity&lt;/STRONG&gt; for Matrix Multiplication Algorithms:&lt;BR /&gt;&lt;BR /&gt; Virginia Vassilevska Williams...O( n^&lt;STRONG&gt;2.3727&lt;/STRONG&gt; )&lt;BR /&gt; Coppersmith-Winograd..............O( n^&lt;STRONG&gt;2.3760&lt;/STRONG&gt; )&lt;BR /&gt; Strassen........................................O( n^&lt;STRONG&gt;2.8070&lt;/STRONG&gt; )&lt;BR /&gt; Strassen-Winograd.....................O( n^&lt;STRONG&gt;2.8070&lt;/STRONG&gt; )&lt;BR /&gt; Classic...........................................O( n^&lt;STRONG&gt;3.0000&lt;/STRONG&gt; )&lt;BR /&gt;&lt;BR /&gt;Let me know if you would like to see performance numbers, &lt;STRONG&gt;Classic&lt;/STRONG&gt; vs. &lt;STRONG&gt;Strassen&lt;/STRONG&gt; ( Recursive Heap-Based Complete).&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Sergey&lt;/P&gt;</description>
      <pubDate>Wed, 07 Mar 2012 05:42:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796573#M463</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-07T05:42:14Z</dc:date>
    </item>
    <item>
      <title>parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796574#M464</link>
      <description>&lt;P&gt;to TimP (Intel),&lt;BR /&gt;&lt;BR /&gt;Thank you very much for the reply!&lt;/P&gt;&lt;P&gt;I'm sure Intel MKL is a powerful tool, and with blas as a base, the performance must be very good. But as you said, there're user interfaces. I noticed that many math libraries requires user to define a matrix struct or something. My project is on physics, and I'm not professional, so I really don't wish to write many for loops to manipulate a matrix.&lt;/P&gt;&lt;P&gt;In case I can't explain myself well, here is what I was going to do with some packaged functions:&lt;/P&gt;&lt;P&gt;eg. 1&lt;/P&gt;&lt;P&gt;Fortran&lt;/P&gt;&lt;P&gt;mat(:,1) = 0. !set the 1st column of matrix mat to all 0s (1-index)&lt;/P&gt;&lt;P&gt;C&lt;/P&gt;&lt;P&gt;vector_set_zero(vecttemp);//get a zeros vector&lt;/P&gt;&lt;P&gt;matrix_set_col(mat, 0, vecttemp);//set the 1st column of matrix mat to all 0s (0-index)&lt;/P&gt;&lt;P&gt;eg. 2&lt;/P&gt;&lt;P&gt;Fortran&lt;/P&gt;&lt;P&gt;vecttemp=matmul(mat,vect); !built-in function of fortran, matrix multiplication&lt;/P&gt;&lt;P&gt;C&lt;/P&gt;&lt;P&gt;blas_dgemv (CblasNoTrans, 1.0, mat, vect, 0.0, vecttemp);//using cblas&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;so is there something for C/C++ to make it as easy to use as fortran or matlab, yet still have high performance? or some higher level APIs for MKL?&lt;/P&gt;</description>
      <pubDate>Wed, 07 Mar 2012 12:46:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796574#M464</guid>
      <dc:creator>zlzlzlz7</dc:creator>
      <dc:date>2012-03-07T12:46:00Z</dc:date>
    </item>
    <item>
      <title>Parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796575#M465</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1331124551620="59" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=353541" href="https://community.intel.com/en-us/profile/353541/" class="basic"&gt;Sergey Kostrov&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt;&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1331124551620="60" jquery1331096659343="58" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=471386" href="https://community.intel.com/en-us/profile/471386/" class="basic"&gt;zlzlzlz7&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt;...The program involves much &lt;STRONG&gt;matrix/multidimensional array multiplication&lt;/STRONG&gt;. In fortran it easy to do row/column assignment, matrix addition, subtraction, multiplication etc., &lt;STRONG&gt;but not so simple for C&lt;/STRONG&gt;...&lt;BR /&gt;&lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;It depends on an algorithm(s) you're going to select.&lt;STRONG&gt;Classic&lt;/STRONG&gt; algorithms formatrix multiplication, addition,&lt;BR /&gt;...&lt;/P&gt;&lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;Also thanks a lot for the reply!&lt;/P&gt;&lt;P&gt;The optimization in the aspect of algorithms is also our concern. The serial program for 1 dimention physical model takes only several seconds, but for 3 diemention it takes many hours. As the 3D model isn't finished yet, I'm not sure if the matrix size will be very large. The direction on data size and algorithm is very helpful!&lt;/P&gt;</description>
      <pubDate>Wed, 07 Mar 2012 12:50:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796575#M465</guid>
      <dc:creator>zlzlzlz7</dc:creator>
      <dc:date>2012-03-07T12:50:29Z</dc:date>
    </item>
    <item>
      <title>Parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796576#M466</link>
      <description>&amp;gt;&amp;gt;but for 3 dimension it takes many hours.&lt;BR /&gt;&lt;BR /&gt;With RAM abundant, consider maintaining the portion of the 3D model that is use in matrix multiplication as three seperate copies: one with X as the minor index, one with Y as the minor index, one with Z as the minor index (yes three instances of those data).&lt;BR /&gt;&lt;BR /&gt;What you will be trading off is 3 writes/cell against the product of the size of each dimension. e.g. 3 writes against 1000x1000x1000 reads per (output)cell(for a cubewith side of1000).&lt;BR /&gt;&lt;BR /&gt;With the rotated arrays you can now perform DOT products of sequential data (memory access friendly).&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 07 Mar 2012 13:56:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796576#M466</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2012-03-07T13:56:18Z</dc:date>
    </item>
    <item>
      <title>Parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796577#M467</link>
      <description>You may say there is plenty of address space, if you use 64-bit OS, but you may find that Fortran MATMUL incurs significantly more cache misses than BLAS dgemv, due to allocating an additional array for the result of MATMUL. gfortran will avoid making a temporary when the MATMUL result is assigned directly to an array; other compilers may not.&lt;BR /&gt;I don't know what you consider easy to do in C or C++, but you can zero out a single column of a matrix stored as BLAS expects by a single memset() or fill(), or by an equivalent for(). It's up to you if you want to hide it under another name so as to avoid the standard library function names. I'm not one to advocate the idea that C or C++ can be made as easy to use as Fortran, when taking into account performance, but I don't favor implicit style rules which make one or the other more difficult.&lt;BR /&gt;</description>
      <pubDate>Wed, 07 Mar 2012 14:34:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796577#M467</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-03-07T14:34:30Z</dc:date>
    </item>
    <item>
      <title>Parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796578#M468</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1331130039500="60" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=471386" href="https://community.intel.com/en-us/profile/471386/" class="basic"&gt;zlzlzlz7&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt;&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;...&lt;BR /&gt;As the 3D model isn't finished yet, I'm not sure if &lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;the matrix size will be very large&lt;/SPAN&gt;&lt;/STRONG&gt;.&lt;BR /&gt;The direction on data size and algorithm is very helpful!&lt;BR /&gt;...&lt;/DIV&gt;&lt;/DIV&gt;&lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;What is the matrix size?&lt;BR /&gt;What precision do you need, that is, a &lt;STRONG&gt;single&lt;/STRONG&gt; or a&lt;STRONG&gt;double&lt;/STRONG&gt;?&lt;BR /&gt;&lt;BR /&gt;Also, take into account limitations of &lt;STRONG&gt;IEEE 754 Standard&lt;/STRONG&gt;. In case of matrix multiplication there are some&lt;BR /&gt;accuracy issues even with small matrices. I could give an example of a rounding problem whentwo &lt;STRONG&gt;8x8&lt;/STRONG&gt; matrices are multiplied.&lt;/P&gt;</description>
      <pubDate>Wed, 07 Mar 2012 14:48:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796578#M468</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-07T14:48:01Z</dc:date>
    </item>
    <item>
      <title>Parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796579#M469</link>
      <description>&lt;P&gt;It is not related to the subject of your post. This is simply an example of a rounding problem&lt;BR /&gt;in case of a single-precision 'float'data type.&lt;/P&gt;&lt;P&gt;// Matrix A - 8x8( 'float' type ):&lt;/P&gt;&lt;P&gt; 101.0 201.0 301.0 401.0 501.0 601.0 701.0 801.0&lt;BR /&gt; 901.0 1001.0 1101.0 1201.0 1301.0 1401.0 1501.0 1601.0&lt;BR /&gt; 1701.0 1801.0 1901.0 2001.0 2101.0 2201.0 2301.0 2401.0&lt;BR /&gt; 2501.0 2601.0 2701.0 2801.0 2901.0 3001.0 3101.0 3201.0&lt;BR /&gt; 3301.0 3401.0 3501.0 3601.0 3701.0 3801.0 3901.0 4001.0&lt;BR /&gt; 4101.0 4201.0 4301.0 4401.0 4501.0 4601.0 4701.0 4801.0&lt;BR /&gt; 4901.0 5001.0 5101.0 5201.0 5301.0 5401.0 5501.0 5601.0&lt;BR /&gt; 5701.0 5801.0 5901.0 6001.0 6101.0 6201.0 6301.0 6401.0&lt;/P&gt;&lt;P&gt;// Matrix B - 8x8( 'float' type ):&lt;/P&gt;&lt;P&gt; 101.0 201.0 301.0 401.0 501.0 601.0 701.0 801.0&lt;BR /&gt; 901.0 1001.0 1101.0 1201.0 1301.0 1401.0 1501.0 1601.0&lt;BR /&gt; 1701.0 1801.0 1901.0 2001.0 2101.0 2201.0 2301.0 2401.0&lt;BR /&gt; 2501.0 2601.0 2701.0 2801.0 2901.0 3001.0 3101.0 3201.0&lt;BR /&gt; 3301.0 3401.0 3501.0 3601.0 3701.0 3801.0 3901.0 4001.0&lt;BR /&gt; 4101.0 4201.0 4301.0 4401.0 4501.0 4601.0 4701.0 4801.0&lt;BR /&gt; 4901.0 5001.0 5101.0 5201.0 5301.0 5401.0 5501.0 5601.0&lt;BR /&gt; 5701.0 5801.0 5901.0 6001.0 6101.0 6201.0 6301.0 6401.0&lt;/P&gt;&lt;P&gt;// Matrix C = Matrix A * Matrix B( 8x8 - 'float' type ):&lt;/P&gt;&lt;P&gt; 13826808.0 14187608.0 14548408.0 14909208.0 15270008.0 15630808.0 15991608.0 16352408.0&lt;BR /&gt; 32393208.0 33394008.0 34394808.0 35395608.0 36396408.0 37397208.0 38398008.0 39398808.0&lt;BR /&gt; &lt;SPAN style="text-decoration: underline;"&gt;50959604.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;52600404.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;54241204.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;55882004.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;57522804.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;59163604.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;60804404.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;62445204.0&lt;/SPAN&gt;&lt;BR /&gt; 69526008.0 71806808.0 74087608.0 76368408.0 78649208.0 80930008.0 83210808.0 85491608.0&lt;BR /&gt; 88092408.0 91013208.0 93934008.0 96854808.0 99775608.0 102696408.0 105617208.0 108538008.0&lt;BR /&gt;106658808.0 110219608.0 113780408.0 117341208.0 120902008.0 124462808.0 128023608.0 131584408.0&lt;BR /&gt;125225208.0 129426008.0 133626808.0 &lt;SPAN style="text-decoration: underline;"&gt;137827616.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;142028400.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;146229216.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;150430000.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;154630816.0&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="text-decoration: underline;"&gt;143791600.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;148632416.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;153473200.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;158314016.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;163154800.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;167995616.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;172836416.0&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;177677200.0&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Iunderlined all incorrect values, like'&lt;SPAN style="text-decoration: underline;"&gt;50959604.0&lt;/SPAN&gt;' and due to a rounding it is &lt;SPAN style="text-decoration: underline;"&gt;not&lt;/SPAN&gt; '50959608.0'.&lt;BR /&gt;&lt;BR /&gt;In all cases last two digits of any value in the Matrix C must be '...&lt;STRONG&gt;08&lt;/STRONG&gt;'. It can't be '...&lt;STRONG&gt;00&lt;/STRONG&gt;', or '...&lt;STRONG&gt;04&lt;/STRONG&gt;',or '...&lt;STRONG&gt;16&lt;/STRONG&gt;'.&lt;/P&gt;&lt;P&gt;The test case is reproducible on many platforms when compiled with different C/C++ compilers and&lt;BR /&gt;a default Floating-Point Unit settings.&lt;/P&gt;</description>
      <pubDate>Thu, 08 Mar 2012 14:29:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796579#M469</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-08T14:29:10Z</dc:date>
    </item>
    <item>
      <title>Parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796580#M470</link>
      <description>Sergey,&lt;BR /&gt;&lt;BR /&gt;float (32-bit) has a 23-bit fraction with 24 bits of precision. Resulting in slightly more than 7 digits of precision. The multiplications you are preforming requires slightly more than 8 digits of precision (IOW the mantissa would require another 4-bits of precision (27-bit fraction with implied 1).&lt;BR /&gt;&lt;BR /&gt;The results of the 8x8 multiplication is acceptible assuming that you can accept ~7.1 digits of accuracy. If not, then you must use a floating point format that uses more bits in the mantissa (64-bit format uses 52-bit mantissa providing 53-bits of precision).&lt;BR /&gt;&lt;BR /&gt;Exponent overflow/underflow is a seperate issue.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Thu, 08 Mar 2012 16:49:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796580#M470</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2012-03-08T16:49:33Z</dc:date>
    </item>
    <item>
      <title>Parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796581#M471</link>
      <description>Jim,&lt;BR /&gt;&lt;BR /&gt;All these details are not new for me. That test case I providedcaused some "&lt;SPAN style="text-decoration: underline;"&gt;chaos&lt;/SPAN&gt;" on a project back in&lt;BR /&gt;September 2011. That is,a complete code review, re-testing, etc, ofa matrix multiplication subsystem based&lt;BR /&gt;on a &lt;STRONG&gt;Strassen&lt;/STRONG&gt; algorithm. As soon as the problem was understood aset set of changes / recommendations,&lt;BR /&gt;like use'&lt;STRONG&gt;double&lt;/STRONG&gt;' type instead of '&lt;STRONG&gt;float&lt;/STRONG&gt;' type to improve accuracy of computations, was made.&lt;BR /&gt;&lt;BR /&gt;Unfortunately, as soon as '&lt;STRONG&gt;double&lt;/STRONG&gt;' type is used twice more memory is needed for a matrix. The situation&lt;BR /&gt;escalates on&lt;STRONG&gt;32-bit&lt;/STRONG&gt; platforms when &lt;STRONG&gt;1024x1024&lt;/STRONG&gt;, orgreaterdimensions,matrices need to be multiplied.&lt;BR /&gt;&lt;BR /&gt;Thanks for the feedback.&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Sergey&lt;BR /&gt;</description>
      <pubDate>Fri, 09 Mar 2012 00:56:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796581#M471</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-09T00:56:43Z</dc:date>
    </item>
    <item>
      <title>Parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796582#M472</link>
      <description>Round-off error is almost a rule excepting for specialized input data.&lt;BR /&gt;&lt;BR /&gt;1024x1024 requires 1024 additions along each DOT product. Assuming even distribution of (integer) numbers, the average product will expand 1024x. IOW 10-bits (on average) will be required of the result over the average bits of the products going into the DOT product. Should 32 bits be used as unsigned int, then statistically the numbers in the cells prior to the multiplication, can be expressed using 11 bits. These are average numbers.&lt;BR /&gt;&lt;BR /&gt;Assuming double with 53 bits of precision (52+1), then (on average), 1024x1024 requires&lt;BR /&gt;&lt;BR /&gt;10-bits for accumulation&lt;BR /&gt;leaving 42-bits for products (on average)&lt;BR /&gt;and limiting terms for product of ~21 bits.&lt;BR /&gt;Or just over 6 digits of accuracy in the input datafor multiplications containing worst case (valid) data.&lt;BR /&gt;&lt;BR /&gt;Even doubles may not maintain the accuracy you desire.&lt;BR /&gt;i.e. do not expect 15 digits of precision using doubles for a 1024x1024 mat mul.&lt;BR /&gt;Best cases may approach that precision, worst case may be on the order of 6 digits.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Fri, 09 Mar 2012 13:56:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796582#M472</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2012-03-09T13:56:21Z</dc:date>
    </item>
    <item>
      <title>Parallel computing &amp; array multiplication problem, any library?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796583#M473</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1331357215265="58" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=471386" href="https://community.intel.com/en-us/profile/471386/" class="basic"&gt;zlzlzlz7&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;EM&gt;...I really &lt;SPAN style="text-decoration: underline;"&gt;don't wish to write many for loops&lt;/SPAN&gt; to manipulate a matrix...&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt; [&lt;STRONG&gt;SergeyK&lt;/STRONG&gt;]In case of a Classic matrix multiplication&lt;SPAN style="text-decoration: underline;"&gt;three&lt;/SPAN&gt; loops are needed.&lt;BR /&gt; In case of an Element-by-element addition orsubstraction &lt;SPAN style="text-decoration: underline;"&gt;one&lt;/SPAN&gt; loop is needed.&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;...so is there something for C/C++ to make it as easy to use as fortran or matlab, yet still have high performance? or some higher level APIs&lt;BR /&gt;for MKL?...&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt; [&lt;STRONG&gt;SergeyK&lt;/STRONG&gt;] Another option to consideris &lt;STRONG&gt;IPP&lt;/STRONG&gt; library.The libraryhas &lt;STRONG&gt;Matrix Processing&lt;/STRONG&gt; and&lt;STRONG&gt;Vector&lt;BR /&gt; Math&lt;/STRONG&gt; domains. All functions in &lt;STRONG&gt;IPP&lt;/STRONG&gt; are highly optimized and it is really hard to outperform them&lt;BR /&gt; withregular &lt;STRONG&gt;C/C++&lt;/STRONG&gt; functions.&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Sergey&lt;/P&gt;</description>
      <pubDate>Sat, 10 Mar 2012 05:42:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/parallel-computing-array-multiplication-problem-any-library/m-p/796583#M473</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-10T05:42:34Z</dc:date>
    </item>
  </channel>
</rss>

