topic RAM Performance Question in IntelĀ® oneAPI Math Kernel Library & IntelĀ® Math Kernel Library
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960174#M15846
<P>Hello All</P>
<P>I'm building a new workstation for solving a lot of banded matrices. I only have to solve for a single solution, and I use LU decomposition of the matrix using the functino <EM><STRONG>'cgbtrf</STRONG></EM>'. The matrices are around 300.000 x 1024 and upwards.</P>
<P>For this algorithm, is higher frequency higher CAS latency better than lower frequency lower CAS latency? I.e. is it the bandwidth or the fetch delay that will be my bottleneck?</P>
<P>I'm considereing 1066 MHz CAS 7 vs. 1333 MHz CAS 9 in 8 GB blocks to have room for expanding beyond my initial 64 GB.</P>
<P>Or will this not affect anything as it will all be bottlenecked by the CPU? (2x E5-2640).</P>
<P>Best regards</P>
<P>Henrik Andresen</P>Mon, 04 Mar 2013 10:37:56 GMThareson2013-03-04T10:37:56ZRAM Performance Question
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960174#M15846
<P>Hello All</P>
<P>I'm building a new workstation for solving a lot of banded matrices. I only have to solve for a single solution, and I use LU decomposition of the matrix using the functino <EM><STRONG>'cgbtrf</STRONG></EM>'. The matrices are around 300.000 x 1024 and upwards.</P>
<P>For this algorithm, is higher frequency higher CAS latency better than lower frequency lower CAS latency? I.e. is it the bandwidth or the fetch delay that will be my bottleneck?</P>
<P>I'm considereing 1066 MHz CAS 7 vs. 1333 MHz CAS 9 in 8 GB blocks to have room for expanding beyond my initial 64 GB.</P>
<P>Or will this not affect anything as it will all be bottlenecked by the CPU? (2x E5-2640).</P>
<P>Best regards</P>
<P>Henrik Andresen</P>Mon, 04 Mar 2013 10:37:56 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960174#M15846hareson2013-03-04T10:37:56ZI think you need to be more
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960175#M15847
I think you need to be more concerned about amount of available memory ( 64GB looks very good! ) and performance of your CPU.
Regarding CAS Latency numbers.
>>For this algorithm, is higher frequency higher CAS latency better than lower frequency lower CAS latency?
This is Not always true ( take a look at a similar thread / see below ).
>>...I.e. is it the bandwidth or the fetch delay that will be my bottleneck?
I think Yes.
>>...I'm considereing 1066 MHz CAS 7 vs. 1333 MHz CAS 9...
Please take a look at a similar thread:
Forum Topic: <STRONG>Laptop SODIMM memory with 9-9-9-24 latency vs. 10-10-10-27 latency ( Non-ECC )</STRONG>
Web-link: software.intel.com/en-us/forums/topic/364897
and follow a couple of <STRONG>wikipedia</STRONG> links since they provide more technical details on how different DIMMs with different CLs and Frequencies could be compared.Mon, 04 Mar 2013 14:08:31 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960175#M15847SKost2013-03-04T14:08:31Z>>...The matrices are around
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960176#M15848
>>...The matrices are around 300.000 x 1024 and upwards...
I've done a quick verification: a matrix with dimensions <STRONG>300000x1024</STRONG> ( let's say a double-precision floating-point type ) "equals" ( in terms how much memory will be needed ) to a matrix with dimensions <STRONG>17527x17527</STRONG> and it will need ~2.89GB of memory. So, it is significantly less than 64GB of memory available on your computer.
A matrix with dimensions <STRONG>17527x17527</STRONG> could be processed on a computer with ~8GB of memory but as faster as possible CPU is needed.Mon, 04 Mar 2013 14:22:32 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960176#M15848SKost2013-03-04T14:22:32ZHi Sergey
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960177#M15849
<P>Hi Sergey</P>
<P>Thank you for your replies.</P>
<P>The reason for the selection of memory is how I expect the LU decomposition works. It states in the wiki pages that compilers choose according to CAS latency, but for read/write operations it will still matter unless a lot of the data is cached before-hand. For 1k rows, the data usage will be 8KB per row and 8MB for all data relevant to the operations of a single matrix, assuming the out of reach data has to be read anyway.</P>
<P>In the multi-threaded case the data needs to be shared with other CPU's, which makes the data size exceed the cache data available on the specific CPU. So, depending on the algorithm, it will be one or the other. So I thought to ask in the case anyone made a test of this.</P>
<P>Regarding memory size I need to keep other data along with running multiple cases at once for full CPU utilization. This will allow me to benefit from the 64GB.</P>
<P>But as you write, I might be overshadowed by other effects.</P>
<P>Thank you</P>Mon, 04 Mar 2013 15:17:31 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960177#M15849hareson2013-03-04T15:17:31Z>>>>...A matrix with
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960178#M15850
>>>>...A matrix with dimensions <STRONG>17527x17527</STRONG> could be processed on a computer with ~8GB of memory but as faster as possible
>>>>CPU is needed...
I'd like to give a short explanation why I did that verification: I simply wanted to see how initial size of your matrix matches to a size of a square matrix and ~<STRONG>16Kx16K</STRONG> is what I use most of the time.
Now, let's go practical and here are real numbers for multiplication of <STRONG>16Kx16K</STRONG> matricies using different algorithms:
...
<STRONG>[ Algorithm 1 - Single-threaded - 'double' data type ]</STRONG>
Matrix sizes: 16384x16384
Time to calculate: 3379.5433 sec
...
<STRONG>[ Algorithm 2 - Single-threaded - 'double' data type ]</STRONG>
Matrix sizes: 16384x16384
Time to calculate: 78.4685 sec
...
As you can see the 2nd algorithm is 43x faster. Of course, multi-threaded implementations for both algorithms will work faster but they both CPU-bound (!).
>>...But as you write, I might be overshadowed by other effects...
I wouldn't worry about CAS latency numbers for some DIMMs because even with the fastest memory in case of matrix multiplication processing is more CPU-bound then RAM-bound. In order to get results faster you need to use Advanced matrix multiplication algorithms, like:
Strassen - O( n^2.8070 )
Strassen-Winograd - O( n^2.8070 )
Kronecker based ( Tensor Product ) - I don't have an exact asymptotic complexity, it is about ~O( n^2.5 ) ( really fast! )
Coppersmith-Winograd - O( n^2.3760 )
Virginia Vassilevska Williams - O( n^2.3727 )
However, since your matricies are Not square you can't use them directly. For example, Strassen algorithm requires that both matricies are square.Tue, 05 Mar 2013 01:17:00 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960178#M15850SKost2013-03-05T01:17:00ZTake a look ( as soon as you
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960179#M15851
Take a look ( as soon as you have time ) at two recently created threads related to matrix mulriplication:
Forum Topic: <STRONG>AVX performance question</STRONG>
Web-link: software.intel.com/en-us/forums/topic/373607
Forum Topic: <STRONG>Matrix Multiplication</STRONG>
Web-link: software.intel.com/en-us/forums/topic/365581Tue, 05 Mar 2013 01:19:35 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960179#M15851SKost2013-03-05T01:19:35Z>>>I wouldn't worry about CAS
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960180#M15852
<P>>>>I wouldn't worry about CAS latency numbers for some DIMMs because even with the fastest memory in case of matrix multiplication processing is more CPU-bound then RAM-bound. In order to get results faster you need to use Advanced matrix multiplication algorithms, like:>>></P>
<P>Yes that is true.Better option is to invest in powerful CPU than in CAS 7 or 9 latency memory.I suppose that those programs influenced by the memory bandwidth could be described as those which have a high ratio of load/store operations.</P>
<P>@hareson</P>
<P>I have found a few links about the impact of memory bandwidth on scientific application</P>
<P>Link <A href="http://stackoverflow.com/questions/2952277/when-is-a-program-limited-by-the-memory-bandwidth">://stackoverflow.com/questions/2952277/when-is-a-program-limited-by-the-memory-bandwidth</A></P>Tue, 05 Mar 2013 06:06:32 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960180#M15852Bernard2013-03-05T06:06:32Z@hareson
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960181#M15853
<P>@hareson</P>
<P>There is also STREAM benchmark which measures memory bandwidth performation.</P>
<P>Link <A href="http://www.cs.virginia.edu/stream/ref.html">.cs.virginia.edu/stream/ref.html</A></P>Tue, 05 Mar 2013 06:21:17 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960181#M15853Bernard2013-03-05T06:21:17Z@hareson
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960182#M15854
@hareson
There is also STREAM syntetic benchmark which measures memory bandwidth performance.
Link ://www.cs.virginia.edu/stream/ref.htmlTue, 05 Mar 2013 06:22:46 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960182#M15854Bernard2013-03-05T06:22:46ZThank you all for your
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960183#M15855
<P>Thank you all for your replies. I'll pool my money into CPU power instead of super optimized RAM then.</P>
<P>Again, thank you.</P>Tue, 05 Mar 2013 09:38:33 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960183#M15855hareson2013-03-05T09:38:33Z>>>Thank you all for your
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960184#M15856
<P>>>>Thank you all for your replies. I'll pool my money into CPU power instead of super optimized RAM then>>></P>
<P>You are welcome.</P>Tue, 05 Mar 2013 11:17:25 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960184#M15856Bernard2013-03-05T11:17:25Z>>...Better option is to
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960185#M15857
>>...Better option is to invest in powerful CPU than in CAS 7 or 9 latency memory...
Price for 16GB of memory is ~100USD and price for the upgrade from Intel Core i7-3840QM to Intel Core Extreme Edition was ~ 800USD. Expected performance improvement when using all DIMMs with the same 7 or 9 CAS latency is unknown ( I don't expect it is greater than 0.5% ) but performance improvement with Intel Core Extreme Edition could be greater than 25%.
it is very important that <STRONG>all DIMMs</STRONG> have the same CAS latency and if DIMMs with different CAS latencies are used in a system then the slowest ( a higher number ) CAS latency will be set.
<STRONG>Edited:</STRONG> Missed 's' before 'lowest'Tue, 05 Mar 2013 13:25:00 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960185#M15857SKost2013-03-05T13:25:00Z>>>Price for 16GB of memory
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960186#M15858
<P>>>>Price for 16GB of memory is ~100USD and price for the upgrade from Intel Core i7-3840QM to Intel Core Extreme Edition was ~ 800USD>>></P>
<P>So which option would you choose?</P>Wed, 06 Mar 2013 05:18:22 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/RAM-Performance-Question/m-p/960186#M15858Bernard2013-03-06T05:18:22Z