Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

Memory allocation from a specific MCDRAM block

Gheibi__Sanaz
Beginner
200 Views

Hi, 

We are using KNL with its MCDRAM configured in "flat" mode. Our question is: when allocating memory from MCDRAM, is there any way we could specify which of the 8 MCDRAM blocks to allocate from?

Thank you very much, 

Sanaz 

0 Kudos
1 Solution
McCalpinJohn
Black Belt
200 Views

From my testing, in "Flat All-to-All" mode, contiguous (cache line) addresses are assigned in a round-robin distribution across the set of 8 EDC controllers.  In "Flat Quadrant" mode, the distribution is more complex.  It looks like the addresses are hashed across the CHAs using the same pseudo-random hash, but then each physical cache line address is mapped to one of the EDCs that happens to be in the same quadrant as the CHA that is servicing the address.   Using the performance counters, it is not difficult to figure out which EDC an address is mapped to, but then you have a list of non-contiguous cache line addresses, not a block of contiguous addresses.

The closest you can get to assigning to specific MCDRAM blocks is to use SNC4 mode.  In this mode, the first quarter of the physical address space is interleaved between EDC 0 and EDC1, the second quarter of the physical address space is interleaved between EDC 2 and EDC 3, etc.   You can then use "numactl" or the "libnuma" APIs to allocate from the pair of EDCs in a quadrant.  (The "memkind" APIs may also be flexible enough for this case, but I have not used that approach.)

It is fairly inconvenient to deal with the 8 NUMA nodes in SNC4 mode: NUMA nodes 0-3 have cores and DDR memory, while NUMA nodes 4-7 have only MCDRAM memory.  To make it even less convenient, NUMA nodes are based on whole tiles (core pairs), so unless your core count is divisible by 8, the NUMA nodes will have different numbers of cores.  On our Xeon Phi 7250 (68-core) parts, NUMA nodes 0-1 have 18 cores each (9 tiles), while NUMA nodes 2-3 have 16 cores each (8 tiles).    This asymmetry in core count cancels out most of the (small) performance advantage that comes the decreased mesh traffic in SNC4 mode.   We no longer keep any nodes in SNC4 mode -- most are "Cache Quadrant" mode, with about 5% in "Flat Quadrant" mode.
 

View solution in original post

1 Reply
McCalpinJohn
Black Belt
201 Views

From my testing, in "Flat All-to-All" mode, contiguous (cache line) addresses are assigned in a round-robin distribution across the set of 8 EDC controllers.  In "Flat Quadrant" mode, the distribution is more complex.  It looks like the addresses are hashed across the CHAs using the same pseudo-random hash, but then each physical cache line address is mapped to one of the EDCs that happens to be in the same quadrant as the CHA that is servicing the address.   Using the performance counters, it is not difficult to figure out which EDC an address is mapped to, but then you have a list of non-contiguous cache line addresses, not a block of contiguous addresses.

The closest you can get to assigning to specific MCDRAM blocks is to use SNC4 mode.  In this mode, the first quarter of the physical address space is interleaved between EDC 0 and EDC1, the second quarter of the physical address space is interleaved between EDC 2 and EDC 3, etc.   You can then use "numactl" or the "libnuma" APIs to allocate from the pair of EDCs in a quadrant.  (The "memkind" APIs may also be flexible enough for this case, but I have not used that approach.)

It is fairly inconvenient to deal with the 8 NUMA nodes in SNC4 mode: NUMA nodes 0-3 have cores and DDR memory, while NUMA nodes 4-7 have only MCDRAM memory.  To make it even less convenient, NUMA nodes are based on whole tiles (core pairs), so unless your core count is divisible by 8, the NUMA nodes will have different numbers of cores.  On our Xeon Phi 7250 (68-core) parts, NUMA nodes 0-1 have 18 cores each (9 tiles), while NUMA nodes 2-3 have 16 cores each (8 tiles).    This asymmetry in core count cancels out most of the (small) performance advantage that comes the decreased mesh traffic in SNC4 mode.   We no longer keep any nodes in SNC4 mode -- most are "Cache Quadrant" mode, with about 5% in "Flat Quadrant" mode.
 

Reply