Regarding this, I have the following questions:
a. Is this true?
b. Assuming it is true, won't it mean that there is no speed advantage with instructions like addpd as compared to addsd (as the addpd instruction is split into two anyway) ?
For non-Intel brands of CPUs, it was technically possible for a pair of scalar double operations to run as fast as a parallel double operation, provided that bottlenecks in instruction decode, write combine buffering, etc, were avoided. In my experience, there was always at least a 5% advantage in vectorized code for large applications with 64-bit operands; a 20% advantage with 32-bit operands. Granted, you might loosely characterize a 5% gain as no gain, when a competing Intel CPU showed a 30% gain for vectorization in the same situation.
On the other side of the coin, the CPUs which were designed not to depend on vectorization did show an advantage when running vectorizable code which was not vectorized.
Given the wide availability of vectorizing compilers and CPUs which execute parallel instructions efficiently, I don't see that historical facts should influence your future development plans, unless you intend to use Microsoft C (the only remaining major non-vectorizing compiler) to develop for CPUs of the past.
a. Is it also true for all Intel non-mobile CPU's before core 2 duo? In other words, is core 2 duo the only intel processor which has a 128-bit wide SSE execution unit?
b. When you say that you obtained a gain of 5% with 64-bit operands, are you talking about non-intel brand CPU's or intel cpus' with 64-bit wide SSE execution unit? Same question for the case when you mention 30% gain.
Unfortunately , I was looking at ~ halving the time of some of my numerical functions, therefore it is very pertinent to me.
I was contrasting the 5% overall gain for vectorizing large double precision applications on a non-Intel CPU with a 30% gain on an Intel CPU with the same application. Such applications are not really feasible to run on Core Solo or Turion. I mentioned it only as an example of the CPUs which were designed to run nearly as well without vectorization as with it.
In any case, discussing how to minimize the gain of current CPUs over those no longer in production seems tangential to the purpose for which this forum branch was started.
If you use ADDSD instead of ADDPD you will process (at least) two times less data per loop iteration. Since loops are usually unrolled to consume one cache line of data per iteration that means you will end up with longer code and you will have to use more registers.
I am sure ADDPD will work at least a bit faster than ADDSD on any CPU that supports it — if you believe otherwise, then by all means benchmark your own code to see if vectorization is beneficial for your particular case.
somewhat analogous to the way AMD CPUs did at the time
This may explain why I was only experiencing less than 2x speedup from the expected 4x when using SSE on my previous AMD 3500.
I always assumedit was a AMD's thing, possibly related to 3DNow legacy (would have made sense).
Now that you mention this, wasn't then (at that time)the real benefit of SSE just in the fact you were -not- using the FPU that was slower (especially on the P4)?
I mean, at that time I was wondering why there were so many scalar SSE1 operations, since they were seriously inferior (32bit vs 80bit) to the FPU versions. I just didn't see the point, other than them being there because the oldFPU started to suck on Intel's (but was still fast on AMD's).
And then SSE3 added.. a new FPU instruction. So I don't think I will ever understand this schizophrenic situation:
-does Intel want us to use SSE or the FPU?
-why scalar SSE operations?
-why so many instructions that do the same thing, but differently?
Was there a plan/story behind all this? And changes in those plans because of failures?
I believe that past AMD models were designed so that, for example, 2 ADDSD could be executed in the same or less time than a single ADDPD (with a number of qualifications). I suspect that AMD intended not to rely as much as Intel on vectorizing compilers.
The Pentium-M was designed likewise so that parallel SSE operations were performed in 2 parts. Also, it was designed so that x87 instructions could be issued at a higher rate than SSE instructions. I doubt that "Intel wants" people to continue writing code for Pentium-M, which came near the end of the line of CPUs without 64-bit mode option.
In fact, the option to generate code skewed toward Pentium-M is deprecated to the extent that it isn't mentioned in the basic documentation of the latest Intel compilers, and compilers issued in the last year have warned against its use. Of course, there never was an equivalent option for 64-bit mode.
If you are so interested in history, you could read up on how Prof. Kahan persuaded vendors to standardize and produce 80-bit format floating point, the ensuing controversies, and how performance came to be preferred over extra precision. Not to mention 64-bit mode.
Probably late answer, but I decided to clarify it anyways as I used to get similar questions regularly. Lets leave aside x87 legacy 80-bit Floating Point (FP) instructions as they have other differences and for clarity speak only about SIMD FP operations. Ill refer to 64-bit FP (a.k.a. double precision or DP) operations included in SSE2, and to 32-bit (single precision or SP) available since SSE in Pentium III product.
Pentium III (only SP supported by SSE), Pentium M based CPUs (and also Core Duo, not to confuse with Core2): have 64-bit FP MUL and ADD execution units (EU) located on two different dispatch ports. 128-bit SSE operations are being split into two 64-bit parts in front end stages before they go to OOO engine for scheduling and dispatch. Also DP MUL executed with half throughput. Thus peak throughput of FP operations you can get is 0.5 64-bit MUL + 1 64-bit ADD (4 SP or 1.5 DP FLOPS) per cycle either with packed or scalar operations for DP, vectorization for SP is required.
P4: is a bit more complex, it has 128-bit FP MUL and FP ADD units both of which can accept either packed or scalar operation every other cycle, both ADD and MUL EUs located on same port, which can dispatch just one either ADD or MUL packed or scalar operation per cycle. So, peak FP operations throughput is 1 64-bit FP MUL + 1 64-bit FP ADD per cycle (4 SP or 2 DP FLOPS) but to achieve it _packed_ 128-bit instructions must be used. If code not vector zed then just one scalar (either ADD or MUL) FP operation can be dispatched per clock on P4, this may explain relative weakness of P4 on not vectorized FP codes, on the other hand for optimized vectorized code P4 was offering very completive FP performance.
Core2 (all current products and Nehalem) doubled peak FP performance by adding 128-bit FP ADD and FP MUL EUs (on different ports working with 1 cycle throughput), with peak FP throughput for vectorized code of 2 64-bit MUL and 2 64-bit ADD per cycle (8 SP or 4 DP FLOPS).
Upcoming products based on Sandy Bridge microarchitecture will once again double peak FP operations throughput by introducing 256-bit AVX instruction set, supported by microarchitecture capability to start 1 256-bit FP MUL and 1 256-bit FP ADD operations per cycle (16 SP or 8 DP FLOPS).
Hope this helps,
Max gave an excellent summary; I have not seen this explained in one place.
Let me try again to follow up, since my original reply was deleted during restore.
Although the peak floating point arithmetic issue rate (per cycle) doubled from P4 to Core 2, the peak rate for use of data in cache did not increease, while the clock rate (per second) decreased. This was offset partly by improvements in buffering. For example, the number of write combine buffers went from 6 in P4, to 8 in Prescott, to 10 in Core 2. No indication has come of a further increase in write combine buffers per core for AVX, in spite of the restoration of HyperThreading.
AVX did not introduce wider move instructions. 256-bit registers are packed and unpacked by 128-bit moves. 2 128-bit loads are possible per cycle, but only 1 128-bit store. Add and multiply instructions support 1 possible 256-bit aligned memory operand. The increased rate of loads (and improved memory system performance) would help SSE2 code as well, returning to the balance between floating point and data load rate of P4. So, in practice, AVX instructions would not double performance.
gcc for avx, so far, doesn't pack 2 128-bit operands per register, so the potential increase in peak floating point rate doesn't apply to gcc.
Hi, Tim, thanks for reply.
Let me correct you though:
> AVX did not introduce wider move instructions ...
Please check AVX Programming Reference at http://software.intel.com/sites/avx/ - AVX indeed introduces 256-bit load/store instructions.
VMOVUPS/VMOVUPD/VMOVDQU should be used for 256-load store in AVX (and not aligned counterparts VMOVAPS/ which still exists if you need alignment exceptions).
AVX improves the programming paradigm LOAD+OP type of operations (e.g. VADDPS ymm0, ymm1, [rsi + rax*8]) for both 128- and 256-bit instructions do not fire unaligned exceptions any more in AVX, it is new standard behavior. And taking into account that with Nehalem MOVUPS for actually aligned data starts to have same performance as MOVAPS (check Ronak Singhal IDF slides https://intel.wingateweb.com/SHchina/published/NGMS001/SP08_NGMS001_100r_eng.pdf, slide 25: no reason to use aligned instruction on Nehalem!), this is going to continue further. So, instructions producing aligned exception (
> Although the peak floating point arithmetic issue rate (per cycle) doubled from P4 to Core 2, the peak rate for use of data in cache did not increase ...
Yes, Core2 did not increase L1 read peak throughout (16-byte/clock) vs. P4, however, as you mentioned, Core2 u-arch actually allows to achieve this high throughout to much wider range of codes compared to P4 (Id even say P4 hardly needed that high L1 throughout for majority of FP codes). Plenty of (and most optimized) FP algorithms have amount of calculations larger than amount of data they read, so doubled FP operations peak is perfectly achievable with 128-bit/clock L1 read throughout on Core2 (well, LINPACK would probably be most widely recognized example).
And as was said on past IDFs Sandy Bridge will also double L1 load throughput to balance doubling FP operations throughout peak.
I also expect GCC to fully support 256-bit AVX FP vector width (for loads/stores and compute/shuffle operations) maybe soon or closer to Sandy Bridge appearance on the market.
Please let me know if I still missed to clarify something,
Intel compilers continue to produce 2 versions of vectorized code so as to use more aligned loads, and gcc continues to use scalar loads to avoid unaligned loads, even though these tactics aren't optimum on Nehalem or Barcelona.
The greatly improved performance of unaligned loads also makes viable the consideration of loop reversal to allow effective vectorization with source and destination overlap. Both Intel and gnu compilers still avoid reversed loop vectorization with parallel loads and stores.
Sorry for asking late for this thread.
I understand ADDSD & ADDPD for SSE2 as below -
__m128d _mm_add_sd(__m128d a, __m128d b)
Adds the lower DP FP (double-precision, floating-point) values of a and b ; the
upper DP FP value is passed through from a.
r0 := a0 + b0
r1 := a1
__m128d _mm_add_pd(__m128d a, __m128d b)
Adds the two DP FP values of a and b.
r0 := a0 + b0
r1 := a1 + b1
(a) You commented "If you use ADDSD instead of ADDPD you will process (at least) two times less data per loop iteration." Could you elaborate w.r.t above definition of ADDSD & ADDPD. I am simply asking to understand.
(b) You commented "I am sure ADDPD will work at least a bit faster than ADDSD on any CPU that supports it". Couid you elaborate more w.r.t above definition of ADDSD & ADDPD.
Thanks & BR.
Above information seems very valuable w.r.t EU & instructions flow.
Could you suggest something about the behaviour within Clovertown 5300 series processor for SSE2.
Do you have any idea where one can get information of EU & instructions flow for SSE2 related with Clovertown?
>>> 0.5 64-bit MUL + 1 64-bit ADD (4 SP or 1.5 DP FLOPS) per cycle
Hmmm.. How get 1.5 DP FLOPS per cycle on PIII?
0.5*2+1*2=3. How you get 4 SP??
Why in this link: http://www.intel.com/support/processors/sb/CS-020868.htm#2
PIII-1GHz performance indicated as 2GFLOPS?? It's unreal. Real peak performance - 1 DP GFLOPS! How many in SP? 4 or 3 GFLOPS?
> Could you give some information about Intel Atom peak FP operations throughput (for single and double precision)?
To see complete operations throughput picture please check Optimization Reference Manual http://www.intel.com/products/processor/manuals/ look at pages 12-19 - 12-26 for Atom
To summarize SIMD/SSE FP performance: current Atom can do 1 128-bit or 64-bit SP ADD (ADDSS/ADDPS) on port1 and 1 128-bit or 64-bit SP MUL (MULSS/MULPS) on port0 - what gives quite high 8 SP FLOPS/cycle in _peak_ but practically achievable performance will be lesser; scalar DP performance is 1 DP ADD + 1 DP MUL/cycle - 2 DP FLOPS/cycle, and packed DP performance is quite slow about ~1/5 throughput compared to SP or scalar DP.
>> 0.5 64-bit MUL + 1 64-bit ADD (4 SP or 1.5 DP FLOPS) per cycle
> Hmmm.. How get 1.5 DP FLOPS per cycle on PIII?
I put note in the beginning that PIII only supports SP SIMD (SSE), so DP performance was cited for PentiumM derivatives only.
> 0.5*2+1*2=3. How you get 4 SP??
Half throughput for MULs is only for DP, not for SP
> Why in this link: http://www.intel.com/support/processors/sb/CS-020868.htm#2 PIII-1GHz performance indicated as 2GFLOPS?? It's unreal. Real peak performance - 1 DP GFLOPS! How many in SP? 4 or 3 GFLOPS?
cannot comment re that old discontinued processor page ...