whatever optimization selection (specify CPU, O2/O3 ) I turn on/off, the performance is the same.
Obviously, the ICC is better than VC8.
I get a question, that is : I thank the CFunction2 is more parallelized for compiler to use SSE instruction, but the proformance of CFunction2 and CFunction1 is the same for ICC. But there is still about 152% ( 0.959 sec vs 0.38 sec) gap between SSEfuncion.
How should I do to modify the C code to let the performance get close to the SSEfunction result?