If you haven't looked into

Gong__Xufei · ‎04-02-2018

I have optimized my finite difference code with SSE. On my workstation, the speed is almost doubled. But when I run the same code on a cluster node, the code with SSE has almost the same performance with the non-optimized code. So what happened to the cluster node case?

I guess there are few possible reasons. (1) I may have missed something during the compilation so the SSE does not work as expected. So I have tried to compile the code on my workstation, then run it on the cluster node, but it does not help. (2) On the cluster node, the non-optimized code has been automatically optimized some way so that it is already fast enough. (3) SSE is not supported by that cluster node.

So, how could I figure out why?

PS: Compiler: icc. The CPU on my workstation: Intel(R) Xeon(R) CPU E7- 4830 @ 2.13GHz (SSE improves speed). The CPU on the cluster node: Intel(R) Xeon(R) CPU E5- 2670 @ 2.60GHz (SSE makes no difference).

TimP · ‎04-04-2018

If you haven't looked into any details we can offer only guesses. 1) if you use icc to target those specific CPUs, optimization of C code for avx wide registers may be as good as SSE code which uses only half registers on the newer cpu. 2) the newer (but now also obsolete) CPU could run up against the 128 bit wide bandwidth limitation between L1 and L2 cache. I remember tearing down and replacing motherboards and CPUs on those westmore E7 boxes. Their prices were hardly justified even with the upgraded hardware.

No speedup with SSE