I have a program ( written on C++ and compiled by Intel Parallel Studio XE 2018 ), which needs a lot of memory & CPU cores ( MPI ). During execution maximum memory usage for test example of input data is about 110 Gb. When I run it on server with 512 Gb ( or above ), the computation time is124 min. When I take off 10 ( of 12 ) memory module, and total memory reduces to 128 Gb, the computation time reduces to 30 min.
Server Lenovo, 2 x Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz, Windows Server 2016 Standard. The memory modules are DDR4 64Gb ECC 2666MHz, passed all tests.
The same effect I see on HP, Dell, SuperMicro servers.
Can anybody explain this ?
This may be of use: https://lenovopress.com/lp0697.pdf
Of particular interest may be Page 9, Socket Interleave.
Due to your program being MPI, you best performance may be NUMA. If the memory is not set to NUMA, then memory allocations will be distributed across sockets, and, depending on the physical placement, your program could have the unfortunate luck of having its heavily used RAM located on the other socket. IOW bad luck of the draw.
Another potential source might be is depending on the Page Size and total RAM, the number of TLB's required during execution may be fewer in one case and more in the other. TLB's are part of the Virtual to Physical memory address translation. This is not a cache of the data, but think of it as a cache of the page tables. A miss on a TLB requires accessing the page table(s).