Background: There are quite a number of SSE instructions that have an SSE2 equivalent with lower latency and better throughput. For instance:
* "pand" (1/0.5) and "andps" (1/1)
* pcmpeqd (1/0.5) and "cmpps" (3/1)
* "pblendvb" (2/1) vs "blendvps" (2/2)
(of course there are more, but just some basic examples).
Some can always be interchanged and some only in certain cases, but anyway - what is the penalty for executing an "integer" instruction on float point data?
Edit: Corrected blendvX timings.
like chapter 18.104.22.168 from Intel 64 and IA-32 Architectures Optimization Reference Manual , available from http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimizati...
there are also details about bypass delays , like at the table 2-18
In addition, please note that "ps" at the end of instruction name, commonly means "packed single precision" (pd for packed and double)way of operations and not only single element/scalar approach, therefore needs to be in consideration as for execution time check.
Don't know why I hadn't picked that up, I've been though the document quite a bit of times.
Thanks for the help!
May I ask an stupid question?
what is the meaning in side the brackets?
what is the meaing of first, and what is the meaning of the secnd ?
latency ? and .. ?
Vector Integer <-> Floating Point stacks bypass adds 2-cycle latency on Nehalem in fact, but 1-cycle on Sandy Bridge
PBLENDVB is also 2-cycle latency, not 1
unless you are sure you are bound by port #5 (e.g. due to FP shuffles or FP logic operations), it is rarely worth it paying with additional latency for the ability to execute on more ports.
> what is the meaing of first, and what is the meaning of the secnd ?
latency (cycles) and max throughput (in cycles it takes to execute 1 - if it is <0 it means more than 1 can be executed/cycle)
So, As your explanation, an simple code:
(a,b,c is integer in the stack)
MOV EAX, a; (1)
MOV EBX, b; (2)
MOV ECX, c; (3)
ADD EBX, EAX; (4)
INC EBX; (5)
DEC ECX; (6)
by the out of order mechanism, the (1)(2)(3) is independent, so we use throughput for estimate
and the (4) is dependent on (1)(2), so we should consider (1)(2) latency
(5) is depentent on(4), so we should considder (4) latency
(6) is depentent on(3), so we should considder (3) latency
is that is right ?
I have a rather big section that is basically binary compare and masking , 'or' and 'and' of float (ps) values, so if the switch penalty wasn't too big the entire section might as well be done as integers since the values themselves doesn't matter.
In general yes, but since you are actually dealing with memory operations, no - you have to take cache latency into the equation, also there is a limited number of "running" loads, which might clog up your code.
In my experience, you should only worry about latency/throughput on pure "computation" code - or at least when you know your data is in L1 cache, at which time it is usually in the area of 4 cycles.
I use this table to get a general overview of the instruction timings when NOT dealing with memory:
It doesn't contain Sandy Bridge, but they don't differ that much to Nehalem, and you can find these timings in the doc linked above.
However, as always when optimizing don't stare too much at them - there are a LOT of things that are much more important than instruction timings - like just in general to be very aware of avoiding dependencies, because it will always help you. On the Atom, you will find parts of your code, where you have to "hide" 4 instructions before you can use your result (float point addition, integer multiply) for optimal usage.
@sh0dan: thought I'd elaborate a bit more - almost all "FP codes" are either bound by memory or cache hierarchy bandwidth or the capacity of the scheduler (RS) to keep as many long latency dependency chains (with relatively long latency MUL's and ADD's) in the flight as possible - so, by adding a stack bypass latency between instructions you are more likely to increase pressure on the scheduler and reduce the level of out-of-order parallelism than get a benefit from an additional throughput ... I'd really be interested to see a real-life example that benefits from replacing FP vector operations with integer equivalents ...
@Gaiger Chen: check out this tool http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/ it shows peak (idealistic) throughput and latency of a piece of code through the micro-architecture, including a placement of uops to the execution ports
The example is here:
Go to "rgb_tone_sse2". Basically it needs to determine for 3 xmm registers each containing values for 4 pixels (R,G,B) which of them contains the highest, lowest and medium value.
So two registers are made, one containing the smallest (sm) and largest (lg) values.
The part that involves integer is the part, where masks are created for each type, for instance "is the R value the largest value" (is_r_lg) and so on. This is done for all 9 cases.
Also based on this the "medium" values are found. "Stuff" is done to this, and the masks are used to re-combine the values, and place them in R,G,B respectively.
PS() & DW() are simple macros for _mm_castsi128_ps() and _mm_castps_si128() for better readbility.
A quick calculation - there are 10 compares and 27 and/or/xor, with a total of 10 registers switching one way or the other.