In reading several scientific computing benchmarks of the E5-2697 v3 vs the E5-2697 v2, I got the impression the v3's should perform better, although they were 0.1 GHz slower. I'm getting funny results on a heterogeneous cluster I'm running on. Centos 2.6.32-504.el6.x86_64.
Basically, the E5-2697 v2's are clearly outperforming the v3 counterparts (~15% faster. I'm running a finite element code on them, compiled against intel compiler products 15.0.2 (ifort, icc, icpc etc...)). The timing I get either in parallel within a node, or serial on each node shows results on the v3's that are much slower than what I expected. I ran a calculation on each of the 4 different types of nodes I have on the cluster, all named "tebowXXX":
|Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
|Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
|Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
|Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz, overclocked to 4.2 GHz
The results can be found on sheet 1 of the attached Excel file, however I also want to go through what I have verified. There is no thermal throttling going on (I do a check on the core_throttle_count to make sure). I also checked to make sure there was OS throttling through the kondemand process or anything similar, but the nodes all report running at the stock speeds (the overclocked node still reports 3.4 GHz, but I know I've boosted it up). I checked memory info (seen in the attachment) between tebow135 and tebow123, as these are the bulk of our cluster and the speed difference between them is enough to preclude utilizing them effectively together, not to mention that I got newer nodes, I want them running faster if possible. The code was compiled on the headnode which is identical to tebow117, (E5-2687W 0 @ 3.10 GHz). I wasn't sure the best way to compare the parallel runs which used more or less CPUs than the control, so the raw data is there to also look at also. Basically I tried to do percentage comparison of each of the 4 node types; in parallel using 28 & 24 CPUs on the v3, 24 CPUs on the v2, and 16 CPUs on the 2687W, in a way that was meaningful. Additionally I did serial runs on all of these also to take MPI (running OpenMPI 1.8.4) out of the equation, and there was seeing ~15% slower runs between the v2 and v3. The 2687W scored well in the serial also, but I believe it was most likely then running in turbo boost mode, thus the 3.1 is not as meaningful (3.8 I think). I don't know if the parallel comparison is meaningful or not the way I crunched it, a "Seconds/processor/GHz" scale. Use it with caution, the serial comparisons are probably the best.
So after all this can anyone help me troubleshoot why I'm getting such bad performance and if it is expected? As I mentioned the benchmarks I read didn't see this. Could it be compiler issues, Linux configuration issues, Infiniband issues (I ran serial as well as parallel dedicated on each node so I thought I would have minimized communication differences, although the file system is shared as NFSoRDMA), or something else I can't think of? Any thoughts or troubleshooting help is welcome.
A follow up if anyone encounters the same issue. After much searching, it ended up being a BIOS setting from the manufacturer regarding the "snoop mode". It was set on early snoop, and changing it to "cluster on die" sped things up. The finite element comparison went from 15% slower than the previous V2, to being 25% faster. From what I can tell, performance-wise "cluster on die" is the way to go, eventually seeing some benchmarks that compared these settings and found similar results.
A side note though, I run a different code that isn't matrix heavy; a monte carlo particle transport code and this change made no difference there. I'm still seeing the V3's being nearly 50% slower than the V2's, for the architectures I'm comparing which are Intel(R) Xeon(R) CPU E5-2697 v3 vs Intel(R) Xeon(R) CPU E5-2697 v2. Trying to track down why, as I'm still better off buying the V2's unless I can find additional settings to at least match the V2 performance there. Not sure if dropping to the 8-core architecture would have different results or not.