I have optimized my finite difference code with SSE. On my workstation, the speed is almost doubled. But when I run the same code on a cluster node, the code with SSE has almost the same performance with the non-optimized code. So what happened to the cluster node case?
I guess there are few possible reasons. (1) I may have missed something during the compilation so the SSE does not work as expected. So I have tried to compile the code on my workstation, then run it on the cluster node, but it does not help. (2) On the cluster node, the non-optimized code has been automatically optimized some way so that it is already fast enough. (3) SSE is not supported by that cluster node.
So, how could I figure out why?
PS: Compiler: icc. The CPU on my workstation: Intel(R) Xeon(R) CPU E7- 4830 @ 2.13GHz (SSE improves speed). The CPU on the cluster node: Intel(R) Xeon(R) CPU E5- 2670 @ 2.60GHz (SSE makes no difference).