Hi John,

Florian_R_ · ‎06-20-2013

I understand that there are 8 Memorycontrollers (at least to my knowledge) and some amount of memory. Right now I assume that the memory is not partitioned into 8 (gigantic) chunks, but several small chunks, which then will eventually be handled by the various controllers. If this is wrong - please tell me. I assume this partitioning due to better load balancing (and it would make much sense).

Otherwise the (remaining) question is: What is the size of such a chunk? Is there a fixed size? If not: In what order is the size of a typical chunk?

I am asking this question since I am getting a weird behavior with an array. If I insert unncessary elements in the array the overall performance is getting better. The only explanation that I have right now is that the inserted padding helps splitting the array between various memory controllers. This is then beneficial when the array is being read by multiple cores (I have 1 thread per core), since more memory controllers can handle the requests.

McCalpinJohn · ‎06-24-2013

I don't think that Intel has documented the details of the DRAM interleave on Xeon Phi, but it is pretty clear that memory is interleaved on fairly fine granularity. The finest granularity possible would be one cache line (since that is the minimum burst length from the DRAMs), and single cache line granularity is the most common interleaving used on other Intel processors.

The "inline" ECC on Xeon Phi probably works best with an interleave that is bigger than one cache line, but smaller than one 4KiB page. Trying to figure this out is made more complex by the address remapping used to hide the "holes" in the memory where ECC data is stored.
It might be possible to reverse engineer the interleave using the recently published documentation on how to read the performance counters in the Xeon Phi memory controllers, but it would be a tedious bit of work.

Another factor that may contribute to the observed performance differences is the complex variation in latency due to the physical locations of the core making the memory request, the distributed tag directory handling coherence for that particular cache line, and the memory controller owning that cache line. The ratio of worst-case to best-case memory latency is rather large (about 3:1), and it is certainly conceivable that padding arrays with unused elements could result in changes to the average memory latency.

Florian_R_ · ‎06-26-2013

Hi John,

thanks for your answer. Your information matches with ours - the granularity is fairly fine. The two boundaries (1CL <= granularity <= 4KiB) you mentioned seem to be very well within the range I am interested in, i.e. where I could observe this phenomena.

Thanks again for your answer - it has been really helpful!

Memory per Memorycontroller