Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

No speedup with SSE


I have optimized my finite difference code with SSE. On my workstation, the speed is almost doubled. But when I run the same code on a cluster node, the code with SSE has almost the same performance with the non-optimized code. So what happened to the cluster node case?

I guess there are few possible reasons. (1) I may have missed something during the compilation so the SSE does not work as expected. So I have tried to compile the code on my workstation, then run it on the cluster node, but it does not help. (2) On the cluster node, the non-optimized code has been automatically optimized some way so that it is already fast enough. (3) SSE is not supported by that cluster node.

So, how could I figure out why?

PS: Compiler: icc. The CPU on my workstation: Intel(R) Xeon(R) CPU E7- 4830  @ 2.13GHz (SSE improves speed). The CPU on the cluster node: Intel(R) Xeon(R) CPU E5- 2670  @ 2.60GHz (SSE makes no difference).

0 Kudos
1 Reply
Black Belt
If you haven't looked into any details we can offer only guesses. 1) if you use icc to target those specific CPUs, optimization of C code for avx wide registers may be as good as SSE code which uses only half registers on the newer cpu. 2) the newer (but now also obsolete) CPU could run up against the 128 bit wide bandwidth limitation between L1 and L2 cache. I remember tearing down and replacing motherboards and CPUs on those westmore E7 boxes. Their prices were hardly justified even with the upgraded hardware.