simd instrunction: atom is better ?

Gaiger_Chen · ‎04-22-2011

Dear all:

I have benchmmark the simd instruction with 2 cpu: atom 330@1.6GHz and i3 2310m@2.1GHz

by http://www.aip.org/cip/langer.htm

testing on linux 2.6.35 and linux 2.6.38 (fedora 14 and fedora 15) with gcc 4.5.1

auto-vectoriztion by gcc ( with -msse4.2 /-mssse3 and -mavx)

my result (unit: Mflops):

atom 330 i3-2310m

float 133.32 1141.61
float + sse 402.93 (302%) 1825.28 (160%)
float + avx 2291.78 ( 200%)

double 132.96 1151.64
double + sse 233.94(176%) 1208.82 (105%)
double + avx(*) 2407.44 ( 209%)

* Compiled by icc with -mavx, becouse the gcc got slow down (550 MFlops) by -mavx

as you see, the sse set on atom is much better than it on i3-2310m

I have check my compile flag much times...

the result on atom is very reasonable, but on i3 is much unreasonable....

I would know to know why atom is much better than i3-2310m on sse set....

thank you.

capens__nicolas · ‎04-26-2011

I think all this really illustrates is how bad Atom is at x87 instructions. See Agner Fog's microarchitecture document (section 10.5) for details:http://www.agner.org/optimize/

TimP · ‎04-26-2011

There's not much to say without more detail about what you are doing. Are you comparing gcc -m32 X87 code vs. sse/sse2? Are you reporting the peak result of the fastest kernel, or the average of the top 25%, or one of the other reported results? These benchmarks include a wide range of kernels, some vectorizable, some not, a few parallelizable as well.
Am I correct in believing that the gcc AVX code is AVX-128, while the icc AVX is AVX-256?
I believe there are more serious bugs in the C than the Fortran version of Livermore Fortran Kernels presented on that site.
As the other responder indicated, your slowest results seem to indicate more the lack of hardware acceleration of simple scalar loops on Atom than what you call a better sse set.

Gaiger_Chen · ‎04-27-2011

Dear TimP:

My original data is from the verify of the cloop code., that is the first output of the cloop benchmark, it is not the peak nor average of the top 25%.

I use the compile flag :

gcc -O3

and

gcc -O3 -mssse3

for atom with sse

gcc -O3

gcc -O3 -msse4.2

gcc -O3 -mavx

for i3 2310M

OS is 32 bit

You has mention the avx 128 bit, I do know what is means. By my know the avx means register whick length is 256 bit could support SIMD set. So you means the gcc -mavx is not avx with 256 bit ??

My original question is, turning on sse set on the atom could accelerate about 75%, but on sand bridge is just 3%... it is not reasonble.

capens__nicolas · ‎04-28-2011

Quoting Gaiger Chen

My original question is, turning on sse set on the atom could accelerate about 75%, but on sand bridge is just 3%... it is not reasonble.

Please read the document I refered to above. You're getting these results because Atom is terrible at x87, not because Sandy Bridge is bad at SSE.

The reason you're only getting a 3% speedup with double-precision SSE on Sandy Bridge is apparently due to the vectorization overhead. Typical code requires a significant number of shuffle, insert and extract instructions to organize the data. Future architectures may alleviate this by adding gather/scatter support (i.e. packed load/store which doesn't occupy port 0/1/5 slots).

TimP · ‎04-28-2011

As far as I can recall, -mssse3 hasn't been recommended for Atom, any more than for Core i3, although I don't expect it to lose as much with gcc as with icc. The Atom team has worked to convince us to avoid even the special icc Atom architecture switch.
You could easily check whether your gcc -mavx is generating AVX-128 code. If you generate a .s file by -S or run objdump on a .o file (objdump -S your.o) you would see operation on xmm registers (similar to sse) for AVX-128; ymm registers for AVX-256. According to my understanding, gcc support for AVX-256 comes with gcc-4.7. AVX-128 should accomplish a given task with fewer asm instructions (typically 10% less, not sufficient to motivate the introduction of the new instructions), but the difference in terms of micro-ops probably is a lot less. So it's entirely "reasonable" for the performance of SSE and AVX-128 to be within 3%.
In an AVX FMA implementation (no such yet in production), it's possible that selected serial sequences might see the sort of gain you imagine. The LFK calibration loop, if that's the one you mean, is easily vectorizable, and would not be speeded up so much by FMA even if it makes a large decrease in the number of asm instructions. The last CPU I worked with where there was an instruction issue rate bottleneck which could be overcome by FMA was a MIPS over a decade ago. When I submitted a gcc patch to make a large increase in IA64 FMA usage, similar to what was already implemented for PA-RISC, it took 3 years for gcc to release it, so you can see that the experts don't put so much stock in FMA. Anyway, the prospect of a future FMA doesn't justify expectations of current non-FMA CPUs.
Some marketers would like you to think that a decrease in the number of asm instructions for a given task could always be accomplished without an increase in clock cycles per asm instruction, or that AVX would magically decrease the number of cycles for a cache or TLB miss compared with SSE on the same CPU. Your terminology "not reasonable" could rightly be applied to such implications.
I guess you're confirming that the comparison you are making is between x87 scalar and SSE vector code. It's hardly surprising if you find that out-of-order execution speedup on the I3 is greater for the scalar than for vector code. People used to consider that a mark of an excellent CPU. The early Opteron CPUs were praised for their ability to run x87 code nearly as fast as SSE2. Now you're complaining about the extent to which I3 approaches that old "ideal."

Matthias_Kretz · ‎05-01-2011

GCC 4.5 (I didn't check earlier versions) already creates 256 bit instructions with -mavx. I've never seen/heard different. Also GCC automatically uses FMA instructions when enabled, even if you write _mm256_add_ps(_mm256_mul_ps(a, b), c); (The switch for that is -mfma4 and works on the new AMD CPUs.)
GCC also has the -fsse2avx/-fnosse2avx switches, which tell the compiler whether to use the ternary or binary instruction codings for 128bit instructions. You might have been thinking of that one?

Regarding ILP: The ILP of vector and scalar SSE/AVX instructions is the same. So it just depends on the actual code, how much ILP you will get. If you do vertical vectorization, you're likely trading ILP for vector parallelism and therefore don't gain anything.

Matthias_Kretz · ‎05-01-2011

Quoting c0d1f1ed

I think all this really illustrates is how bad Atom is at x87 instructions. See Agner Fog's microarchitecture document (section 10.5) for details:http://www.agner.org/optimize/

He said that he uses an Atom 330. That one supports 64-bit instructions. So unless he's using a 32-bit OS he was very likely not using x87 instructions. So his code probably compares scalar SSE vs. vector SSE instructions. But, really, this kind of benchmark, where you rely on auto-vectorization of the compiler, really needs double-checking of the generated asm to be able to interpret the results. You can't know whether you're benchmarking the compiler's or the CPU's abilities.