Vector Integer <-> Floating Point stacks bypass adds 2-cycle latency on Nehalem in fact, but 1-cycle on Sandy Bridge
PBLENDVB is also 2-cycle latency, not 1
unless you are sure you are bound by port #5 (e.g. due to FP shuffles or FP logic operations), it is rarely worth it paying with additional latency for the ability to execute on more ports.
> what is the meaing of first, and what is the meaning of the secnd ?
latency (cycles) and max throughput (in cycles it takes to execute 1 - if it is <0 it means more than 1 can be executed/cycle)
@sh0dan: thought I'd elaborate a bit more - almost all "FP codes" are either bound by memory or cache hierarchy bandwidth or the capacity of the scheduler (RS) to keep as many long latency dependency chains (with relatively long latency MUL's and ADD's) in the flight as possible - so, by adding a stack bypass latency between instructions you are more likely to increase pressure on the scheduler and reduce the level of out-of-order parallelism than get a benefit from an additional throughput ... I'd really be interested to see a real-life example that benefits from replacing FP vector operations with integer equivalents ...
@Gaiger Chen: check out this tool http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/ it shows peak (idealistic) throughput and latency of a piece of code through the micro-architecture, including a placement of uops to the execution ports
The example is here:
Go to "rgb_tone_sse2". Basically it needs to determine for 3 xmm registers each containing values for 4 pixels (R,G,B) which of them contains the highest, lowest and medium value.
So two registers are made, one containing the smallest (sm) and largest (lg) values.
The part that involves integer is the part, where masks are created for each type, for instance "is the R value the largest value" (is_r_lg) and so on. This is done for all 9 cases.
Also based on this the "medium" values are found. "Stuff" is done to this, and the masks are used to re-combine the values, and place them in R,G,B respectively.
PS() & DW() are simple macros for _mm_castsi128_ps() and _mm_castps_si128() for better readbility.
A quick calculation - there are 10 compares and 27 and/or/xor, with a total of 10 registers switching one way or the other.