If you use ADDSD instead of ADDPD you will process (at least) two times less data per loop iteration. Since loops are usually unrolled to consume one cache line of data per iteration that means you will end up with longer code and you will have to use more registers.
I am sure ADDPD will work at least a bit faster than ADDSD on any CPU that supports it — if you believe otherwise, then by all means benchmark your own code to see if vectorization is beneficial for your particular case.
somewhat analogous to the way AMD CPUs did at the time
This may explain why I was only experiencing less than 2x speedup from the expected 4x when using SSE on my previous AMD 3500.
I always assumedit was a AMD's thing, possibly related to 3DNow legacy (would have made sense).
Now that you mention this, wasn't then (at that time)the real benefit of SSE just in the fact you were -not- using the FPU that was slower (especially on the P4)?
I mean, at that time I was wondering why there were so many scalar SSE1 operations, since they were seriously inferior (32bit vs 80bit) to the FPU versions. I just didn't see the point, other than them being there because the oldFPU started to suck on Intel's (but was still fast on AMD's).
And then SSE3 added.. a new FPU instruction. So I don't think I will ever understand this schizophrenic situation:
-does Intel want us to use SSE or the FPU?
-why scalar SSE operations?
-why so many instructions that do the same thing, but differently?
Was there a plan/story behind all this? And changes in those plans because of failures?
Max gave an excellent summary; I have not seen this explained in one place.
Let me try again to follow up, since my original reply was deleted during restore.
Although the peak floating point arithmetic issue rate (per cycle) doubled from P4 to Core 2, the peak rate for use of data in cache did not increease, while the clock rate (per second) decreased. This was offset partly by improvements in buffering. For example, the number of write combine buffers went from 6 in P4, to 8 in Prescott, to 10 in Core 2. No indication has come of a further increase in write combine buffers per core for AVX, in spite of the restoration of HyperThreading.
AVX did not introduce wider move instructions. 256-bit registers are packed and unpacked by 128-bit moves. 2 128-bit loads are possible per cycle, but only 1 128-bit store. Add and multiply instructions support 1 possible 256-bit aligned memory operand. The increased rate of loads (and improved memory system performance) would help SSE2 code as well, returning to the balance between floating point and data load rate of P4. So, in practice, AVX instructions would not double performance.
gcc for avx, so far, doesn't pack 2 128-bit operands per register, so the potential increase in peak floating point rate doesn't apply to gcc.
Hi, Tim, thanks for reply.
Let me correct you though:
> AVX did not introduce wider move instructions ...
Please check AVX Programming Reference at http://software.intel.com/sites/avx/ - AVX indeed introduces 256-bit load/store instructions.
VMOVUPS/VMOVUPD/VMOVDQU should be used for 256-load store in AVX (and not aligned counterparts VMOVAPS/ which still exists if you need alignment exceptions).
AVX improves the programming paradigm LOAD+OP type of operations (e.g. VADDPS ymm0, ymm1, [rsi + rax*8]) for both 128- and 256-bit instructions do not fire unaligned exceptions any more in AVX, it is new standard behavior. And taking into account that with Nehalem MOVUPS for actually aligned data starts to have same performance as MOVAPS (check Ronak Singhal IDF slides https://intel.wingateweb.com/SHchina/published/NGMS001/SP08_NGMS001_100r_eng.pdf, slide 25: no reason to use aligned instruction on Nehalem!), this is going to continue further. So, instructions producing aligned exception (
> Although the peak floating point arithmetic issue rate (per cycle) doubled from P4 to Core 2, the peak rate for use of data in cache did not increase ...
Yes, Core2 did not increase L1 read peak throughout (16-byte/clock) vs. P4, however, as you mentioned, Core2 u-arch actually allows to achieve this high throughout to much wider range of codes compared to P4 (Id even say P4 hardly needed that high L1 throughout for majority of FP codes). Plenty of (and most optimized) FP algorithms have amount of calculations larger than amount of data they read, so doubled FP operations peak is perfectly achievable with 128-bit/clock L1 read throughout on Core2 (well, LINPACK would probably be most widely recognized example).
And as was said on past IDFs Sandy Bridge will also double L1 load throughput to balance doubling FP operations throughout peak.
I also expect GCC to fully support 256-bit AVX FP vector width (for loads/stores and compute/shuffle operations) maybe soon or closer to Sandy Bridge appearance on the market.
Please let me know if I still missed to clarify something,
Intel compilers continue to produce 2 versions of vectorized code so as to use more aligned loads, and gcc continues to use scalar loads to avoid unaligned loads, even though these tactics aren't optimum on Nehalem or Barcelona.
The greatly improved performance of unaligned loads also makes viable the consideration of loop reversal to allow effective vectorization with source and destination overlap. Both Intel and gnu compilers still avoid reversed loop vectorization with parallel loads and stores.
Sorry for asking late for this thread.
I understand ADDSD & ADDPD for SSE2 as below -
__m128d _mm_add_sd(__m128d a, __m128d b)
Adds the lower DP FP (double-precision, floating-point) values of a and b ; the
upper DP FP value is passed through from a.
r0 := a0 + b0
r1 := a1
__m128d _mm_add_pd(__m128d a, __m128d b)
Adds the two DP FP values of a and b.
r0 := a0 + b0
r1 := a1 + b1
(a) You commented "If you use ADDSD instead of ADDPD you will process (at least) two times less data per loop iteration." Could you elaborate w.r.t above definition of ADDSD & ADDPD. I am simply asking to understand.
(b) You commented "I am sure ADDPD will work at least a bit faster than ADDSD on any CPU that supports it". Couid you elaborate more w.r.t above definition of ADDSD & ADDPD.
Thanks & BR.
Above information seems very valuable w.r.t EU & instructions flow.
Could you suggest something about the behaviour within Clovertown 5300 series processor for SSE2.
Do you have any idea where one can get information of EU & instructions flow for SSE2 related with Clovertown?
>>> 0.5 64-bit MUL + 1 64-bit ADD (4 SP or 1.5 DP FLOPS) per cycle
Hmmm.. How get 1.5 DP FLOPS per cycle on PIII?
0.5*2+1*2=3. How you get 4 SP??
Why in this link: http://www.intel.com/support/processors/sb/CS-020868.htm#2
PIII-1GHz performance indicated as 2GFLOPS?? It's unreal. Real peak performance - 1 DP GFLOPS! How many in SP? 4 or 3 GFLOPS?
> Could you give some information about Intel Atom peak FP operations throughput (for single and double precision)?
To see complete operations throughput picture please check Optimization Reference Manual http://www.intel.com/products/processor/manuals/ look at pages 12-19 - 12-26 for Atom
To summarize SIMD/SSE FP performance: current Atom can do 1 128-bit or 64-bit SP ADD (ADDSS/ADDPS) on port1 and 1 128-bit or 64-bit SP MUL (MULSS/MULPS) on port0 - what gives quite high 8 SP FLOPS/cycle in _peak_ but practically achievable performance will be lesser; scalar DP performance is 1 DP ADD + 1 DP MUL/cycle - 2 DP FLOPS/cycle, and packed DP performance is quite slow about ~1/5 throughput compared to SP or scalar DP.
>> 0.5 64-bit MUL + 1 64-bit ADD (4 SP or 1.5 DP FLOPS) per cycle
> Hmmm.. How get 1.5 DP FLOPS per cycle on PIII?
I put note in the beginning that PIII only supports SP SIMD (SSE), so DP performance was cited for PentiumM derivatives only.
> 0.5*2+1*2=3. How you get 4 SP??
Half throughput for MULs is only for DP, not for SP
> Why in this link: http://www.intel.com/support/processors/sb/CS-020868.htm#2 PIII-1GHz performance indicated as 2GFLOPS?? It's unreal. Real peak performance - 1 DP GFLOPS! How many in SP? 4 or 3 GFLOPS?
cannot comment re that old discontinued processor page ...