I have benchmmark the simd instruction with 2 cpu: atom email@example.comGHz and i3 firstname.lastname@example.orgGHz
testing on linux 2.6.35 and linux 2.6.38 (fedora 14 and fedora 15) with gcc 4.5.1
auto-vectoriztion by gcc ( with -msse4.2 /-mssse3 and -mavx)
my result (unit: Mflops):
atom 330 i3-2310m
float 133.32 1141.61
float + sse 402.93 (302%) 1825.28 (160%)
float + avx 2291.78 ( 200%)
double 132.96 1151.64
double + sse 233.94(176%) 1208.82 (105%)
double + avx(*) 2407.44 ( 209%)
* Compiled by icc with -mavx, becouse the gcc got slow down (550 MFlops) by -mavx
as you see, the sse set on atom is much better than it on i3-2310m
I have check my compile flag much times...
the result on atom is very reasonable, but on i3 is much unreasonable....
I would know to know why atom is much better than i3-2310m on sse set....
Am I correct in believing that the gcc AVX code is AVX-128, while the icc AVX is AVX-256?
I believe there are more serious bugs in the C than the Fortran version of Livermore Fortran Kernels presented on that site.
As the other responder indicated, your slowest results seem to indicate more the lack of hardware acceleration of simple scalar loops on Atom than what you call a better sse set.
My original data is from the verify of the cloop code., that is the first output of the cloop benchmark, it is not the peak nor average of the top 25%.
I use the compile flag :
gcc -O3 -mssse3
for atom with sse
gcc -O3 -msse4.2
gcc -O3 -mavx
for i3 2310M
OS is 32 bit
You has mention the avx 128 bit, I do know what is means. By my know the avx means register whick length is 256 bit could support SIMD set. So you means the gcc -mavx is not avx with 256 bit ??
My original question is, turning on sse set on the atom could accelerate about 75%, but on sand bridge is just 3%... it is not reasonble.
Please read the document I refered to above. You're getting these results because Atom is terrible at x87, not because Sandy Bridge is bad at SSE.
The reason you're only getting a 3% speedup with double-precision SSE on Sandy Bridge is apparently due to the vectorization overhead. Typical code requires a significant number of shuffle, insert and extract instructions to organize the data. Future architectures may alleviate this by adding gather/scatter support (i.e. packed load/store which doesn't occupy port 0/1/5 slots).
You could easily check whether your gcc -mavx is generating AVX-128 code. If you generate a .s file by -S or run objdump on a .o file (objdump -S your.o) you would see operation on xmm registers (similar to sse) for AVX-128; ymm registers for AVX-256. According to my understanding, gcc support for AVX-256 comes with gcc-4.7. AVX-128 should accomplish a given task with fewer asm instructions (typically 10% less, not sufficient to motivate the introduction of the new instructions), but the difference in terms of micro-ops probably is a lot less. So it's entirely "reasonable" for the performance of SSE and AVX-128 to be within 3%.
In an AVX FMA implementation (no such yet in production), it's possible that selected serial sequences might see the sort of gain you imagine. The LFK calibration loop, if that's the one you mean, is easily vectorizable, and would not be speeded up so much by FMA even if it makes a large decrease in the number of asm instructions. The last CPU I worked with where there was an instruction issue rate bottleneck which could be overcome by FMA was a MIPS over a decade ago. When I submitted a gcc patch to make a large increase in IA64 FMA usage, similar to what was already implemented for PA-RISC, it took 3 years for gcc to release it, so you can see that the experts don't put so much stock in FMA. Anyway, the prospect of a future FMA doesn't justify expectations of current non-FMA CPUs.
Some marketers would like you to think that a decrease in the number of asm instructions for a given task could always be accomplished without an increase in clock cycles per asm instruction, or that AVX would magically decrease the number of cycles for a cache or TLB miss compared with SSE on the same CPU. Your terminology "not reasonable" could rightly be applied to such implications.
I guess you're confirming that the comparison you are making is between x87 scalar and SSE vector code. It's hardly surprising if you find that out-of-order execution speedup on the I3 is greater for the scalar than for vector code. People used to consider that a mark of an excellent CPU. The early Opteron CPUs were praised for their ability to run x87 code nearly as fast as SSE2. Now you're complaining about the extent to which I3 approaches that old "ideal."
GCC also has the -fsse2avx/-fnosse2avx switches, which tell the compiler whether to use the ternary or binary instruction codings for 128bit instructions. You might have been thinking of that one?
Regarding ILP: The ILP of vector and scalar SSE/AVX instructions is the same. So it just depends on the actual code, how much ILP you will get. If you do vertical vectorization, you're likely trading ILP for vector parallelism and therefore don't gain anything.