- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is any version of ICC capable of software pipelining loops for x86/x64? Currently, I'm doing it manually, but this is a well known method for decades, so I think it should be in the compiler.
Is there some option to turn it on? I looked in the man page and see it mentioned under IA-64, but nothing under x86.
My loops consist of SIMD intrinsic functions without any branches other than the loop condition, so this should be amenable to software pipelining.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Very interesting optimization.Thanks for posting this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
icc supports software pipelining on Intel(R) IA64 architecture but not on Intel(R) 64 architecture. To implement "effective" software pipelining optimization and achieve desired performance on x86/x64, additonal hardware is required similar to those in the Intel IA64 architecture. The hardware support in Intel IA64 architecture such as 128 general registers, 128 floating point registers, 64 predicate registers, 8 branch registers, Loop count register, Epilog count register, etc., allow effective implementation of software pipelining on IA64 that Intel compilers support. icc supports high level loop optimizations on x86/x64 (except SW pipelinig) at -O3.
--mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mark!
Sorry for slightly off topic question.What is the number of physical internal GP registers (unseen to software) on x86/64 architecture?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"What is the number of physical internal GP registers (unseen to software) on x86/64 architecture?"
You'd begin by perusing the data sheets for the model of interest.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think your efforts would be better used by seeking to improve vectorization of your code than to do what is effectively prefetching.
There is an exception to this. After vectorization, look to using multi-threading (threads within same core). If your code can divide work up properly, the output of one stage is likely to reside in the Last Level Cache for use by the next stage. (This is a benefit when your total working set is larger than LLC).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mark - thanks for explaining the situation. But I don't understand why the compiler team wouldn't implement SW pipelining just because x86 doesn't have 128 registers. x64 has 16 SIMD registers, so there're plenty of loops that can benefit.
My code in question is unpacking 12bit pixels to 16bit values. I'm getting 1.5x speedup with software pipelining, so it's clearly a win. I've looked at the performance counters and it shows 3 instructions/cycle !
Jim, I agree that SW pipelining isn't relevant on throughput processors (GPUs) since they have a lot more paralleism to exploit unlike CPUs, which rely on instruction level parallelism. Even on CPUs I wouldn't do this unless the function is used a lot. And I don't think the SIMD code can be improved short of using AVX 2.0.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think your efforts would be better used by seeking to improve vectorization of your code than to do what is
effectively prefetching.
No, this isn't "prefetching" in the traditional sense. Conceptually it might seem like prefetching, but the improvement is better. Prefetch instructions reduce the latency from > 12 cycles (L2 latency) to ~3 (L1 latency) since you still need to use a load to get it into a register. SW pipelining allows even the last 3 cycles of latency to be hidden as well !
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As the preceding replies hinted, software pipelining has usually been applied to architectures with specific support for it, but without support for out-of-order execution or vectorization. I wouldn't care to pontificate on the reasons, but multi-threading support has been relatively weak in conjunction with software pipelining.
Also, as was touched upon above, it's typically useful to add some software loop unrolling, e.g. unroll by 4, so as to engage a limited degree of software pipelining on current architectures. If you consider that unrolling times the vector register widths of up to 16 (for 32-bit data), the total effective unrolling rivals what was needed for software pipelining. The action of loop stream detection and micro-op caching also helps further in keeping the pipeline full across iterations of the unrolled loop body.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TimP (Intel) wrote:
"What is the number of physical internal GP registers (unseen to software) on x86/64 architecture?"
You'd begin by perusing the data sheets for the model of interest.
I was not able to find any information related to the number of internal GP registers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>My code in question is unpacking 12bit pixels to 16bit values.
How many 12-bit values?
Do the bits need to be reversed?
This problem may be best handled using the GP registers as opposed to SSE/AVX
If on x64 platform and using 64-bit registers:
16 12-bit inputs = 24 bytes input (3 regs) = 32 bytes output (4 regs)
32 12-bit inputs = 48 bytes input (6 regs) = 64 bytes output (8 regs)
I would suggest trying to be cache friendly on the output side.
The above would require the code to perform the necessary shifts (and merges for inputs spanning 64-bit words).
An alternative approach that should be tested is to let the memory fetch perform some of the shifting:
Read in 4 12-bit inputs = 6 bytes input = 8 bytes output.
Though you will have misaligned reads as an expense, you will not need code to manage inputs that span registers. There may be a net benefit. You will have to try out the code to find out. Note, different CPU archetecture may affect the outcome.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I checked the second method (Read in 4 12-bit inputs = 6 bytes input = 8 bytes output.) in C++ and it looks pretty uggly. This would likely have to be done in assembler (C++ with inline assembly).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
I checked the second method (Read in 4 12-bit inputs = 6 bytes input = 8 bytes output.) in C++ and it looks pretty uggly. This would likely have to be done in assembler (C++ with inline assembly).
Jim Dempsey
you can use following code:
#include <mmintrin.h>
#include <xmmintrin.h>
#define PREFETCH_T0(addr) _mm_prefetch(((char *)(addr)),_MM_HINT_T0)
Do not use macro like: PREFETCH_T0(ptr+10), but have a __rescrict'ed pointer that is incremented in loop and (in this case) 10 units ahead, and pass it to this this macro, since due to my observation prefetch assembler instruction like 96+rax*8 eats many clock cycles, due to multiplication and addition.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My input is packed as H M, L H, M L (most significant bit on left of each byte) which is somewhat big-endian, so I basically have to move groups of 4 bits by *different* distances, so I don't think shifting alone is an efficient solution. I already have very optimized code that uses SSE byte shuffle to first move the bytes to their approximate output locations, then followed by a left or right shift by 4 if needed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page