SSE Float <=> SSE2 Integer switch penalty

sh0dan · ‎03-06-2012

I have been looking quite a bit for information on the penalty when switching between SSE2 integer instructions and SSE float point instructions. I assume it is there, but I cannot find anything to go by.

Background: There are quite a number of SSE instructions that have an SSE2 equivalent with lower latency and better throughput. For instance:

* "pand" (1/0.5) and "andps" (1/1)
* pcmpeqd (1/0.5) and "cmpps" (3/1)
* "pblendvb" (2/1) vs "blendvps" (2/2)

(of course there are more, but just some basic examples).

Some can always be interchanged and some only in certain cases, but anyway - what is the penalty for executing an "integer" instruction on float point data?

Edit: Corrected blendvX timings.

Maxym_D_Intel · ‎03-06-2012

you might need to be aware about bypass scenario between execution domains,

like chapter 3.5.2.3 from Intel 64 and IA-32 Architectures Optimization Reference Manual , available from http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

there are also details about bypass delays , like at the table 2-18

In addition, please note that "ps" at the end of instruction name, commonly means "packed single precision" (pd for packed and double)way of operations and not only single element/scalar approach, therefore needs to be in consideration as for execution time check.

sh0dan · ‎03-06-2012

Ah - 3.5.2.3 has what I need, so in the area of 1 cycle latency added per transition, seems fair - actually less than I expected.

Don't know why I hadn't picked that up, I've been though the document quite a bit of times.

Thanks for the help!

Gaiger_Chen · ‎03-13-2012

Hi

May I ask an stupid question?

what is the meaning in side the brackets?

like "(1/0.5)".......

what is the meaing of first, and what is the meaning of the secnd ?

latency ? and .. ?

thank you.

Max_L · ‎03-13-2012

Vector Integer <-> Floating Point stacks bypass adds 2-cycle latency on Nehalem in fact, but 1-cycle on Sandy Bridge

PBLENDVB is also 2-cycle latency, not 1

unless you are sure you are bound by port #5 (e.g. due to FP shuffles or FP logic operations), it is rarely worth it paying with additional latency for the ability to execute on more ports.

> what is the meaing of first, and what is the meaning of the secnd ?

latency (cycles) and max throughput (in cycles it takes to execute 1 - if it is <0 it means more than 1 can be executed/cycle)

-Max

Gaiger_Chen · ‎03-14-2012

So, As your explanation, an simple code:

(a,b,c is integer in the stack)

MOV EAX, a; (1)
MOV EBX, b; (2)
MOV ECX, c; (3)

ADD EBX, EAX; (4)
INC EBX; (5)
DEC ECX; (6)

by the out of order mechanism, the (1)(2)(3) is independent, so we use throughput for estimate
and the (4) is dependent on (1)(2), so we should consider (1)(2) latency
(5) is depentent on(4), so we should considder (4) latency
(6) is depentent on(3), so we should considder (3) latency

is that is right ?

thank you.

sh0dan · ‎03-14-2012

@Max: Oh - sorry - I don't know how I got these wrong. I corrected the top post.

I have a rather big section that is basically binary compare and masking , 'or' and 'and' of float (ps) values, so if the switch penalty wasn't too big the entire section might as well be done as integers since the values themselves doesn't matter.

@Gaiger:
In general yes, but since you are actually dealing with memory operations, no - you have to take cache latency into the equation, also there is a limited number of "running" loads, which might clog up your code.

In my experience, you should only worry about latency/throughput on pure "computation" code - or at least when you know your data is in L1 cache, at which time it is usually in the area of 4 cycles.

I use this table to get a general overview of the instruction timings when NOT dealing with memory:

http://akuvian.org/src/mubench_results.txt

It doesn't contain Sandy Bridge, but they don't differ that much to Nehalem, and you can find these timings in the doc linked above.

However, as always when optimizing don't stare too much at them - there are a LOT of things that are much more important than instruction timings - like just in general to be very aware of avoiding dependencies, because it will always help you. On the Atom, you will find parts of your code, where you have to "hide" 4 instructions before you can use your result (float point addition, integer multiply) for optimal usage.

Max_L · ‎03-14-2012

@sh0dan: thought I'd elaborate a bit more - almost all "FP codes" are either bound by memory or cache hierarchy bandwidth or the capacity of the scheduler (RS) to keep as many long latency dependency chains (with relatively long latency MUL's and ADD's) in the flight as possible - so, by adding a stack bypass latency between instructions you are more likely to increase pressure on the scheduler and reduce the level of out-of-order parallelism than get a benefit from an additional throughput ... I'd really be interested to see a real-life example that benefits from replacing FP vector operations with integer equivalents ...

@Gaiger Chen: check out this tool http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/ it shows peak (idealistic) throughput and latency of a piece of code through the micro-architecture, including a placement of uops to the execution ports

-Max

sh0dan · ‎03-15-2012

The example is here:

http://rawstudio.org/svn/rawstudio/trunk/plugins/dcp/dcp-sse2.c

Go to "rgb_tone_sse2".  Basically it needs to determine for 3 xmm registers each containing values for 4 pixels (R,G,B) which of them contains the highest, lowest and medium value.

So two registers are made, one containing the smallest (sm) and largest (lg) values.

The part that involves integer is the part, where masks are created for each type, for instance "is the R value the largest value" (is_r_lg) and so on. This is done for all 9 cases.

Also based on this the "medium" values are found. "Stuff" is done to this, and the masks are used to re-combine the values, and place them in R,G,B respectively.

PS() & DW()  are simple macros for _mm_castsi128_ps() and  _mm_castps_si128() for better readbility.

A quick calculation - there are 10 compares and 27 and/or/xor, with a total of 10 registers switching one way or the other.