- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
To perform a logical andoperation on SSE vectors I can choose between the float, double, and integer variants of the intruction. But which one is truely preferred?
andps is one byte shorter than pand, and it doesn't require SSE2 support like andpd, so my preference goes to that one. So is there any reason to ever want to use the other variants? It looks like the other instructions have simply been added to keep the ISA symmetric which in turn might keep the decoders simple. But if that's the only reason then why were andpd and pand ever documented? Or is there really value in having separate mnemonics andcorresponding encodings specific for double and integer processing?
Thanks for any information.
Nicolas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nicolas,
Probably the best policy to minimize these bypass penalties in the long run is to use operations typed 'correctly'. I.e. use single-precision typed loads, booleans, and storeswhen doing single-precision operations; packed integer forms for packed integer ops.
There are a couple of 'safe' ways to cheat. For example it is OK to use PS typed operations instead of PD when using double-precision floating point - the former generally save you a byte per instruction. i.e. you should never generate MOVAPD, MOVUPD, ANDPD, etc- Both PS and PD instructions bind to the same resources (the exception are the shuffles...).
There are really not very many cases where it's profitable to cheat and save latency.There ARE, however, ways to cheat that modify the port binding of the instructions.For code that is limited by execution throughput, and not latency (i.e. loops), it is occasionally profitable to prefer the operation that goes to the least used execution port (i.e. when latency can be hidden). I have to admit having done this for hand-coded loops, although it is not a policy I advocate in general - the port bindings also change from product to product and the current breed of compilers are not optimizing for throughput using port binding decisions to affect instruction selection.
Regards,
Mark
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In section 3.5.2.2 'Bypass between Execution Domain' of the Optimization Reference Manual I found that transistions between floating-point and integer SIMD cost a 1 cycle penalty. However, I tested this on my Core 2 and found no such penalty. So I'm still inclined to keep using andps everywhere...
Could an Intel engineer give me a definite answer on this? What would be the behavior of Core i7 and future CPUs?Thanks a lot.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nicolas,
On Core 2, FP SIMD booleans and register copies are bound to the same ports and execution stack as the Integer SIMD versions. There is a cycle of latency penalty in either direction. For example, the throughput of the folllowing (contrived) loop with a loop carried dependence is 6 cycles (instead of the 4 you might expect from execution latency).
loop_1:
addps xmm0, xmm0
andpsxmm0, xmm1 ; would be the same if you had used PAND xmm0, xmm0
sub ecx,1
jg loop_1
You can't avoid the 1-cycle bypass penalties on Core 2 in this code snippet. Note that a very similar situation applies to super shuffler operation on Penryn. At this point you might be wondering why we have the recommendation, considering as you note that to use `correctly` typed operations you will pay in code footprint. The short term answer is that there are a few floating point operations (such as QW granularity shuffles) that do not pay these latencies, as they are implemented on the FP stack. The long-term answer is that on Nehalem (and future architectures ), we have been much more diligent in separating execution units to different stacks, so that not only would you not pay the 1 cycle penalty above when using correct types, but you would pay additional cycles of latency for moving between stacks when using the wrong type. There are cases where this latencydoesn't matterbut as a general guideline, we ask for type correct use as on future CPU's it usually outweighs the cost of the extra instruction size.
Regards,
Mark Buxton
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mark,
Thanks a lot for the detailed information. I wrongly assumed that the penalty could be avoided by using the other variants of the instruction (on Core 2), but now it's clear that the optimization manual just explains why the latency is longer than initially expected.
Although it's probably no big issue in real-world applications, I wonder though how one could avoid or minimize domain bypasses as suggested. Specifically, what algorithm could a compiler use to optimize things?
Kind regards,
Nicolas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nicolas,
Probably the best policy to minimize these bypass penalties in the long run is to use operations typed 'correctly'. I.e. use single-precision typed loads, booleans, and storeswhen doing single-precision operations; packed integer forms for packed integer ops.
There are a couple of 'safe' ways to cheat. For example it is OK to use PS typed operations instead of PD when using double-precision floating point - the former generally save you a byte per instruction. i.e. you should never generate MOVAPD, MOVUPD, ANDPD, etc- Both PS and PD instructions bind to the same resources (the exception are the shuffles...).
There are really not very many cases where it's profitable to cheat and save latency.There ARE, however, ways to cheat that modify the port binding of the instructions.For code that is limited by execution throughput, and not latency (i.e. loops), it is occasionally profitable to prefer the operation that goes to the least used execution port (i.e. when latency can be hidden). I have to admit having done this for hand-coded loops, although it is not a policy I advocate in general - the port bindings also change from product to product and the current breed of compilers are not optimizing for throughput using port binding decisions to affect instruction selection.
Regards,
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just to chime in and say thanks for that explanation, coming from using VMX/AltiVec intrinsics, I really wondered about why there were a stinkpile of bitwise ands, thinking to myself surely a bitwise and is a bitwise and is a bitwise and. Now I'm a bit wiser... or at least a bit better informed.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page