today I've interesting question about your experience (not only theoretical improvement) with code performance difference on SSE4/AVX with dual-channel memory board vs. AVX2 with quad-channel memory board.
I will try to formulae my question: we have some old, but already high optimized SSE4 algorithms and used cheap dual-channel memory boards. It is not sufficient and we "think about" use of quad-channel boards. The most test (with exception of SiSoft Sanda) and results on internet shows , the is practical no difference betwenn quad-channel and dual-channel (I remember - the speedup between single-channel and dual-channel was about 2x). But I don't know, if such tests are really representative because I don't know if those are optimized to use quad-channel (256 bits) memory access.
What I think about and that is exactly my question - has AVX2 processors double amount of "internal"/"external" access lines to L1/L2/L3/DDRAM as it was prior to AVX2 world? If so, does the dual-channel memory help to speedup the algorithms that portd to AVX2 or it is the same speedup with SSE4? because normally we go through the cache and cache line is either 64 bytes (2x quad-channel).
What you think about following: AVX-algorithm has aaprox 20% load instructions (with high probability the data is not outside of all caches or max in L3 cache), 2-4% aligned write with NT-Hint, others is calculation (mul, add, sub, min, max - no or very less shifts). How such an algorithm can speed-up if ported to AVX2? 10%, 20%, .. 50%, 100%. What is the load-latency (memory is properly aligned!) of YMM register in comparision to XMM, because, if they are the same, the speedup should be about 100%, but it don't trust this too much. What you think about?
P.S. Algorithm is 2x (or latter 4x) paralellized without dependencies and each "feed" works on completelly separate memory areas.
Many thanks in advance!
Dual-channel and quad-channel systems from Intel use a completely different "uncore" implementation, which leads to a variety of performance differences in different directions. The details depend on the processor generation, but in general a quad-channel system will have slightly higher memory latency and often slightly lower single-core memory bandwidth, in exchange for the capability of higher aggregate bandwidth when using multiple cores (typically 4 or more).
The bandwidth between cache levels was increased with the Haswell core that also introduced the AVX2 extensions. This can help application performance for cases with good L2 or L3 cache hit statistics.
If you have good skills with low-level programming, you might look at the memory controller counters on the dual-channel systems that are described at https://software.intel.com/en-us/articles/monitoring-integrated-memory-controller-requests-in-the-2n...; These counters should be able to show how much DRAM bandwidth you are actually using, which can make it much easier to estimate whether a quad-channel system would run faster.
Not all quad-DRAM-channel processors are the same. In the Xeon E5 v3 (Haswell) series, some of the low-power, low-core-count processors don't support full-speed (2133) DDR4 memory. Also, the Xeon E5 v3 processors with 8 cores or less only have one "Home Agent", which causes a reduction in sustained bandwidth even if they support full-speed DDR4 memory. The situation is probably similar with Xeon E5 v4 processors. I don't have a good reference handy for the quad-channel Core i7 processors (Haswell E and Broadwell E), but those look very similar to the corresponding Xeon E5 processors with a single "Home Agent", so they might not get the full 2x bandwidth speedup that one would hope for....
many thanks for very interesting information!
Put it all together, I understand (with exception to L2/L3 caches), the performance of qad-core Xeon E5 v3/v4 with quad-DDRAM-channel is practically the same if not lesser as of quad-core dual-channel I7 at the same core frequency? And that is actually my observation. So the question is why DELL sells the Xeon PC for 3x price of I7 if the overall performanse is still the same.
But what with my question about porting the algorithms from SSE to AVX2 - can the new algorithm use quad-channel access to speedup load/store operations with YMM registers, or would be latencies simply 2x of XMM registers load/store. Or would it give some overall speedup of 5% and be only worth of time?
I don't have any measurements on a quad-channel Core i7, but I did test two different Xeon E5 v3 (Haswell) models that should be comparable to the Core i7 quad-channel processors of that generation.
The first processor I tested was a Xeon E5-2603 v3 -- 6 core, 1.6 GHz, DDR4/1600, single Home Agent -- no HyperThreading, no Turbo. Memory bandwidth on this node was significantly (~25%) lower than on a Xeon E5-2670 (Sandy Bridge), despite having the same 4 channels of 1600 MHz DRAM.
The second processor I tested was a Xeon E5-2667 v3 -- 8 core, 3.2 GHz, DDR4/2133, single Home Agent, Turbo enabled. Memory bandwidth on this node was 5% to 15% lower than a similar processor having 2 Home Agents (Xeon E5-2660 v3). The differences were especially large when running all reads or when running streaming stores. Results were pretty close (~5%) for traffic with (1 read + 1 write), (2 reads + 1 write), or (3 reads + 1 write).
The quad-channel Core i7 parts probably don't have the severe performance impact that I saw on the low-power Xeon E5-2603, but they probably do show some bandwidth limitations due to the single Home Agent configuration. If this is in the 5%-15% range, then the sustained bandwidth speedup going from 2 channels to 4 channels should be in the range of 1.7x to 1.9x. Of course lots of applications don't care about memory bandwidth, or only use full bandwidth for a small fraction of the runtime, so they may see little or no speedup when going to 4 channels.
On the second question -- SSE codes should get a benefit from the increased bandwidth between cache levels on Haswell or Broadwell processors. Outside of the L1 Data Cache, everything moves in full-cache-line blocks, so it does not really matter what size load generated the corresponding cache miss. In most of my tests of data transfer rates, the AVX/AVX2 code is slightly faster than SSE -- after all there are more instructions to be retired and no overlapping of memory access and instruction execution is ever going to be perfect -- but the differences are typically only a few percent.
many x many thanks for this very interesting and important information! Specially about information regarding to number of HA's and theoretical speedup with Xeon E5-2660 v3. I will serios recommend to buy one testsystem and see how much we can win. The AVX2-porting seems to be a wrong way to achive real speedup unless we can siggnificantly increase "real" memory bandwidth (which is currently not a case with our systems). I'm very glad to stop it timely rather to invest too much time for nothing :)