Performance loss migrating from Xeon X5550 to Xeon E5-2650 v2

Alex_P_ · ‎02-29-2016

Hi,

I'm migrating Fortran in-house developed software from a cluster with Intel Xeon X5550 processors, to a cluster with Intel Xeon E5-2650 v2 processors and experiencing a loss of performance. I checked the specifics of the processors and it seems like E5-2650 processors should give me better performance, allowing also for AVX instruction extensions. I'm observing slower runs instead, up to 50% slower.

The clusters have the same compiler and I'm using the same compilation flags, apart from the -mavx option that I add for the new cluster:

-O3 -shared-intel -free -align all -mavx -xHost -opt-mem-bandwidth2 -finline-functions -inline all -no-inline-min-size -fp-model fast=2 -unroll -unroll-aggressive -warn nointerfaces -nogen-interfaces -fpp -lstdc++ -align array64byte -ipo

I tried compiling with -O2 instead of -O3, but the code is slower on both clusters, and tried to remove -mavx and -xHost flags but there is no observable difference.

Can anyone help me understanding if I'm doing something wrong here?

Thank you very much,

Alex

McCalpinJohn · ‎02-29-2016

Not a lot of information to go on here....

Let's start with:

Is this parallel code?
If so, how is the parallelism implemented?
Is the slowdown observed for serial runs?

TimP · ‎02-29-2016

As the newer CPUs have twice as many cores per socket, presumably also twice as many per node, it might help if you would compare runs using the same number of nodes and ranks per node. As you didn't say anything about which MPI you use or how you assigned ranks, one guess might be that you compared with the same number of ranks but half the number of nodes. Did you assign your ranks to separate cores, either by disabling HyperThreads for both clusters, or by similar appropriate pinning for both?

Alex_P_ · ‎03-01-2016

Hi John and Tim,

Thank you both for your reply. The code is a CFD parallel code. I'm using mpiifort compiler, and the parallelism basically consists in a call to mpi_sendrecv at each iteration with which processors exchange data needed for the next iteration. Each processor performs exactly the same tasks, but of course the load is not exactly the same since the problem cannot be split in perfectly equal parts.

I tested the code on a single processor, on two processors, and on 256 processors taking 16 full nodes (on both clusters each node has 16 processors). On single processor the performance is 50% worse, while on 256 the code runs around 30% slower.

McCalpinJohn · ‎03-01-2016

The Xeon X5550 is a 4-core processor, so that cluster must be running with HyperThreading enabled to have 16 "logical processors" per node. The Xeon E5-2650 v2 is an 8-core processor, so that cluster must be running with HyperThreading disabled to have 16 "logical processors" per node.

It is not easy to make a single-threaded code run half as fast on a Xeon E5-2650 v2 as on a Xeon X5550, but there are probably some ways to do it.... This is particularly odd to see in a CFD code, since they are usually bandwidth-limited and everything in the memory hierarchy of the Xeon E5-2650 v2 is as fast or faster than on the Xeon X5550. One difference is in DRAM channel interleaving. The Xeon X5550 has 3 DRAM channels and interleaves consecutive cache lines around the 3 channels using a full "modulo-3" pattern. The Xeon E5-2650 v2 has 4 DRAM channels and interleaves consecutive cache lines around the 4 channels using a "modulo-4" pattern. One consequence of this is that a stride of 256 Bytes (4 cache lines) will interleave across all three channels on the Xeon X5550, but will direct all of its accesses to only 1 channel on the Xeon E5-2650 v2. This would give a 2x performance advantage in favor of the older system (32 GB/s on 3 channels vs 14.933 GB/s on 1 channel). If the code is built using an "array of structures" approach, it is fairly easy to generate strided memory references of this type.

Nothing else comes to mind....

Alex_P_ · ‎03-02-2016

Thanks for the very detailed reply.

First of all, I checked both clusters (with 'cat /proc/cpuinfo') and came out that information I got from the admin of the first cluster was wrong. The processors are E5-2670 and not X5550, I'm sorry that I gave you wrong info. Then I checked with 'lscpu' to verify the hyperthreading and both systems have it activated. The main difference is that the first cluster CPU clock is set to 2.6 GHz, while on the new cluster to 2.0 GHz.

Can this difference account for the loss in performance? From the website cpubenchmark it seems that the two CPU are not too far in performance, with the E5-2650 v2 being a little bit better.

It looks like there's not much I can do at this point, am I right?

McCalpinJohn · ‎03-02-2016

I think we need to review the processor model numbers one more time....

The most recent post for the first cluster is consistent -- the Xeon E5-2670 (v1) is an 8-core processor with a nominal 2.6 GHz frequency. The maximum all-core-active Turbo frequency for this processor is 3.0 GHz.

If the second cluster is an 8-core v2 part running at 2.0 GHz, then the corresponding model number is Xeon E5-2640 v2, not Xeon E5-2650 v2. The maximum all-core-active Turbo frequency for the Xeon E5-2640 v2 is 2.3 GHz.

The Xeon E5-2670 and Xeon E5-2640 v2 both use four memory channels per socket, and both support DDR3/1600 as the fastest memory option. There are numerous ways that memory can be configured to give suboptimal performance, so you might want to run the STREAM benchmark or the Intel Memory Latency Checker on each of the systems (especially the new one, of course), to see if the bandwidths are reasonable. Both chips have 51.2 GB/s peak memory bandwidth per socket, and depending on the ratio of reads, writes, and non-temporal writes you should see something in the 70%-85% of peak range for most of the tests, and 90% or better for the "all reads" test.

The Xeon E5-26xx v1 and v2 implementation limits the uncore frequency to the maximum core frequency. Since the uncore includes the L3 cache and the memory controller, this makes memory bandwidth more sensitive to core frequency than on many other platforms. My measurements on Xeon E5-2680 (v1) running at all available frequencies suggests that a processor running at 2.0 GHz should have about 12% lower memory bandwidth than a processor running at 3.0 GHz, so even if Turbo was enabled on the older system and disabled on the newer system this should not be enough to explain the 2x performance difference.

There is another difference between Xeon E5 v1 (Sandy Bridge) and Xeon E5 v2 (Ivy Bridge) that could show up in a CFD code. Agner Fog notes (http://www.agner.org/optimize/instruction_tables.pdf and http://www.agner.org/optimize/microarchitecture.pdf) that Software Prefetch instructions are very very slow on Xeon E5 v2. On Sandy Bridge you can issue 2 Software Prefetch instructions per cycle, while Ivy Bridge is limited to one Software Prefetch instruction every 43 cycles. The Intel compilers will not normally generate software prefetch instructions for mainstream Xeon processors, but a CFD code that uses indirect accesses may have had software prefetch pragmas added to the code. A quick way to check for these on a Linux system would be:

objdump -d my_executable_file | grep -i prefetch

Alex_P_ · ‎03-10-2016

Dear John,

As you suggested I asked to the administrators of the second cluster about the CPUs and you are right, it's not the V2, but the first version E5-2650 @ 2.0Ghz. These processors have inferior performance with respect to E5-2670 @ 2.6Ghz and the clock is slower. I guess this settles my question and it is now clear why the code runs slower. I'm sorry for the confusion, but information about clusters can apparently be quite deceptive on websites and brochures.

Thank you very much for your help!

Alex

Talisin__Rus · ‎09-14-2018

John ..By software prefetch on v2 , is that firmware latency , or clumsy compiler code as run on Ivy ? the way said it is entirely firmware on v2

also, does this apply for prefetch containing loops of <256bytes ? thus loops > 256 bytes cannot be used to said Nehalem advantage, apart from E5 Sandy / Ivy having added uOP cache AFTER instr decoders

this seems to contradict the Ivy penalty .. p. 123 .. " Code that runs out of the µop cache are not subject to the limitations of the fetch and decode units. It can deliver a throughput of 4 (possibly fused) µops or the equivalent of 32 bytes of code per clock cycle. The µop cache is rarely used to the maximum capacity of 1536 µops "

on Sandy / Ivy instruction needing more than 4 uOPs .. is there a reference to those ( to avoid ) not clear if Ivy also affected .. ?

is the prefetch advantage due to uOP fusion ? that have Agner mentioned condition constraints p .108 (Microarch of Intel / AMD )

'Nehalem has the loop buffer after the decoders. The Nehalem loop buffer can hold 28 (possibly fused) µops. The size of the loop code is limited to 256 bytes of code, or up to 8 blocks of 32 bytes each. A loop containing more than 256 bytes of code cannot use the loop buffer'

seems contrary to Agner 3.7 Branch prediction in Intel Sandy Bridge and Ivy Bridge (i welcome less complex branch prediction)

The Sandy Bridge reverses the trend of ever more complicated branch prediction algorithms by not having a separate predictor for loops. The redesign of the branch prediction mechanism has probably been necessary in order to handle the new µop cache (see page 122 below). A further reason for the simplification may be a desire to reduce the pipeline length and thereby the misprediction penalty.

"Sandy Bridge and Ivy Bridge can fuse two instructions into one µop in more cases than previous processors can (see p108).decoders will fuse an arithmetic or logic instruction and a subsequent conditional jump instruction into a single compute-and-branch µop in certain cases. .. compute-and-branch µop is not split in two at the execution units but executed as a single µop by the branch unit at execution port 5."

The instruction fusion works even if instructions cross a 16-bytes boundary on the Ivy Bridge.. how is the prefetch on v2 (ivy) so bad ??

FINALLY , my concern is i jus dumped Dell 710s for R620 with 2690 v2 , my 1st reading cannot find the v2 disadvantage u speak of .. amazed no manual (apart from intels Vol2 register Man) shows the included perf counters (apart fr .http://oprofile.sourceforge.net/docs/intel-ivybridge-events.php ) . code to implicitly access / utilize these in practice .. Why is this left to profiling tools ?

Travis_D_ · ‎09-27-2018

For what it's worth I've measured a dense software prefetch loop on Ivy Bridge EP and they issue at up to 2 per cycle, just like Sandy Bridge. No sign of a restriction to one per 43 cycles as mentioned by Agner.

I have seen that one-per-43 claim repeated several times, but never with more details on how to reproduce it. It seems unlikely to me that it applies broadly to all software prefetch instructions, since that would be an obvious and large regression even for much general purpose code that uses software prefetch incidentally e.g., in libc memcpy implementations.