Memory bandwidth per core and aggregate for Nehalem EP vs Westmere EP
I was wondering if anyone has any performance results for latency and memory bandwidth per core and aggregata for the westmere EP vs Nehalem EP.
In theory, each socket memory subsystem should deliver ~31.992GiB/s raw. However in practice it is more like 10GiB/s for Nehalem and I have seen similar results for Westmere EP. Are there any more detailed results for Westmere EP published ?
Does anyone know the changes in the memory subsystem and the Global Queue which allows more outstanding memory transactions on the Westmere vs Nehalem and by how much this may impact the performance per / core and aggregate / socket and overall?
Is anyone familiar with the points at which queuing of the requests takes place? And how long the queue length is?
Any information or a pointer to this specific type of information would be appreciated.....
As no experts on this subject have responded, I'll point out that you need the BIOS NUMA mode setting and a balanced number of threads running on each CPU, with aligned nontemporal stores, and DDR1333, to approach rated bandwidth. As far as I know, the biggest improvement in WSM-EP compared to NHM-EP would be the support for 2 DDR1333 sticks per channel, so, unlike NHM, you could get full memory bandwidth with more than 6 sticks installed. You may need to balance the use of the cache channels which serve 2 cores against those which serve 1 core (by not running more threads than you would on NHM-EP, and figuring out optimum placement), to get full bandwidth on WSM-EP 6 core CPUs. On the WSM-EP I've tested, KMP_AFFINITY="proclist=[2-5,8-11],explicit" appears suitable for 8 threads, with HT disabled. If there is a BIOS NUMA mode setting, it is likely to be shipped disabled, accounting for a lot of people getting low bandwidth assessments. The idea of non-NUMA appears to be to limit memory performance to a level which can be achieved without controlling thread affinity. As I remember, the total bandwidth for 2 sockets is about 40% more than for one socket, which seems not to agree with your implication.
so it is possible with the NUMA BIOS setting set to ON and by properly alligning threads on cores to attain > 10GiB/s per IMC? Is there any documentation with detais about these BIOS (or other) setings and how they affect application / system performance?
Here is a report from Fujitsu on the memory bandwidth demonstrated with the various models of WSM-EP and various memory stick configurations. They also discuss some effects of memory interleaving, which is one of the primary considerations for performance of applications which depend on memory performance. If your machine is set in a "non-NUMA" BIOS configuration, that cuts the peak memory performance by about 40% in return for making it nearly independent of which cores your job is bound (or not bound) to. In this Dell paper apparently they refer to the non-NUMA setting as Node Interleaved. Each core would access alternately a cache line on local memory and a cache line on the memory local to the other CPU, giving performance intermediate to what you would get with 100% local memory or 100% remote memory. If you set up NUMA mode but don't bind threads to cores and build your application so that memory access is always local, the NUMA mode would likely give you uncontrolled performance variations. Here is a note about non-temporal stores (at a fairly high level). You could read about them with more technical detail and less information about relevance in the instruction reference. As noted, you generally need non-portable compiler options or pragmas to get them compiled in. They also are necessary to approach rated bandwidth. If you specify non-temporal moves directly at low level, by using intrinsics, you must take care of alignment. The icc #pragma vector non-temporal will make the adjustments by splitting off scalar remainder loops so that aligned non-temporal instructions can be used for the body of the loop, as will current glibc and Intel memcpy() library functions. So you could perform your memory bandwidth measurements with memcpy() if you take care how you do it. The OS support functions (e.g. glibc) are responsible for "first-touch" allocation, so that memory is used local to the CPU which first accesses it. Thus, in NUMA mode, the thread which first hits the memory (e.g. initialization) must plan ahead to run on the same CPU where all the work will be done. So, you would need to run threaded mode with consistent affinity from beginning to end to get the boost in aggregate bandwidth which 2 CPUs are capable of.
Tim thanks for these nice pointers and the good discussion on how to extract maximum throughput out of the EP platform.
Do you have any information about how the memory access hardware in the 6-core westmere-EP platform was modified in order to increase the throughput as compared to teh 4-core Nehalem EP? I saw in IDF presentations that deeper h/w buffering was introduced to that effect. I could not find the specific location though :
was it in the core to L1, or L1 <-> L2, L2<->L3 or Global Queue or the IMC itself?
This doesn't appear to respond directly to your question, but changes in memory access from NHM-EP to WSM-EP include: ring cache organization of L3, where each core "owns" a segment of L3 shared paths from L2 to L3 so as to support 6 cores with same number of parallel data paths as 4 cores support for 2 DDR-1333 DIMMs per channel (for the specific validated products), where only 1 or 2 OEMs had that for NHM-EP. WSM-EP brought in a process shrink, allowing 6 cores to run within the power consumption of previous 4 lower speed cores. I don't think this forum is well adapted to getting answers on this in more depth.