Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
51 Views

software pipelining for x86

Is any version of ICC capable of software pipelining loops for x86/x64? Currently, I'm doing it manually, but this is a well known method for decades, so I think it should be in the compiler.

Is there some option to turn it on? I looked in the man page and see it mentioned under IA-64, but nothing under x86.

My loops consist of SIMD intrinsic functions without any branches other than the loop condition, so this should be amenable to software pipelining.

0 Kudos
20 Replies
Highlighted
Valued Contributor II
49 Views

>>Is any version of ICC capable of software pipelining loops for x86/x64? Currently, I'm doing it manually, but this is a well >>known method for decades, so I think it should be in the compiler. Do you mean loops-unrolling? If Yes, there are two command line options: /Qunroll set maximum number of times to unroll loops. Omit n to use default heuristics. Use n=0 to disable the loop unroller /Qunroll-aggressive[-] enables more aggressive unrolling heuristics Please explain if No and I appreciate if you provide an example in C/C++. Best regards, Sergey
0 Kudos
Highlighted
Beginner
49 Views

No, I'm not talking about inlining, I'm talking about a method to increase the dependence distance between instructions. You can look up software pipelining on Wikipedia. You basically divide your loop into stages and interleave the execution of multiple iterations. It can be done in conjunction with unrolling, but doesn't depend on it. Here's an example: before: for (i = 0; i < N; ++i) { a = load(&data) b = Compute(a) store(&out, b) } -------------------------after pipelining--------------------------------- // fill pipeline a = load(&data[1]) b = Compute(load(&data[0])) for (i = 2; i < N; ++i) { // precondition: a contains value from load stage from iteration i - 1, b contains value from compute stage from iteration i - 2 store(&out[i - 2], b); b = compute(a) a = load(&data) } // conditions: b contains value of compute stage from iteration N - 2 (assuming N is big enough) // a contains value of load stage from iteration N - 1 // drain pipeline store(&out[N - 2], b) store(&out[N - 1], Compute(a))
0 Kudos
Highlighted
Valued Contributor II
49 Views

>>No, I'm not talking about inlining, I'm talking about a method to increase the dependence distance between instructions. >>You can look up software pipelining on Wikipedia. You basically divide your loop into stages and interleave the >>execution of multiple iterations. It can be done in conjunction with unrolling, but doesn't depend on it. Thank you and it looks very interesting! I'll try do a test to verify if it improves performance of processing.
0 Kudos
Highlighted
Black Belt
49 Views

Very interesting optimization.Thanks for posting this.

0 Kudos
Highlighted
Employee
49 Views

icc supports software pipelining on Intel(R) IA64 architecture but not on Intel(R) 64 architecture.  To implement "effective" software pipelining optimization and achieve desired performance on x86/x64, additonal hardware is required similar to those in the Intel IA64 architecture.  The hardware support in Intel IA64 architecture such as 128 general registers, 128 floating point registers, 64 predicate registers, 8 branch registers, Loop count register, Epilog count register, etc., allow effective implementation of software pipelining on IA64 that Intel compilers support.  icc supports high level loop optimizations on x86/x64 (except SW pipelinig) at -O3.

--mark

0 Kudos
Highlighted
Black Belt
49 Views

Hi Mark!

Sorry for slightly off topic question.What is the number of physical internal GP registers (unseen to software) on x86/64 architecture?

0 Kudos
Highlighted
Black Belt
49 Views

"What is the number of physical internal GP registers (unseen to software) on x86/64 architecture?"

You'd begin by perusing the data sheets for the model of interest.

0 Kudos
Highlighted
49 Views

I think your efforts would be better used by seeking to improve vectorization of your code than to do what is effectively prefetching.

There is an exception to this. After vectorization, look to using multi-threading (threads within same core). If your code can divide work up properly, the output of one stage is likely to reside in the Last Level Cache for use by the next stage. (This is a benefit when your total working set is larger than LLC).

Jim Dempsey

0 Kudos
Highlighted
Valued Contributor II
49 Views

>>I think your efforts would be better used by seeking to improve vectorization of your code than to do what is >>effectively prefetching. This is what I've asked myself at some point when reading the 1st post of the thread and in codes it could look like this: ... MOVZX ecx, WORD PTR nVals SAR ecx, 2 PREFETCHT0 [ -96+edx+ecx*8 ] // Next 6 iterations ahead LEA edx, iA2 ...
0 Kudos
Highlighted
Beginner
49 Views

Mark  - thanks for explaining the situation. But I don't understand why the compiler team wouldn't implement SW pipelining just because x86 doesn't have 128 registers. x64 has 16 SIMD registers, so there're plenty of loops that can benefit.

My code in question is unpacking 12bit pixels to 16bit values. I'm getting 1.5x speedup with software pipelining, so it's clearly a win. I've looked at the performance counters and it shows 3 instructions/cycle !

Jim, I agree that SW pipelining isn't relevant on throughput processors (GPUs) since they have a lot more paralleism to exploit unlike CPUs, which rely on instruction level parallelism. Even on CPUs I wouldn't do this unless the function is used a lot. And I don't think the SIMD code can be improved short of using AVX 2.0.

0 Kudos
Highlighted
Beginner
49 Views

I think your efforts would be better used by seeking to improve vectorization of your code than to do what is
effectively prefetching.

No, this isn't "prefetching" in the traditional sense. Conceptually it might seem like prefetching, but the improvement is better. Prefetch instructions reduce the latency from  > 12 cycles (L2 latency) to ~3  (L1 latency)  since you still need to use a load to get it into a register. SW pipelining allows even the last 3 cycles of latency to be hidden as well !

0 Kudos
Highlighted
Black Belt
49 Views

As the preceding replies hinted, software pipelining has usually been applied to architectures with specific support for it, but without support for out-of-order execution or vectorization.  I wouldn't care to pontificate on the reasons, but multi-threading support has been relatively weak in conjunction with software pipelining.

Also, as was touched upon above, it's typically useful to add some software loop unrolling, e.g. unroll by 4, so as to engage a limited degree of software pipelining on current architectures.  If you consider that unrolling times the vector register widths of up to 16 (for 32-bit data), the total effective unrolling rivals what was needed for software pipelining.  The action of loop stream detection and micro-op caching also helps further in keeping the pipeline full across iterations of the unrolled loop body.

0 Kudos
Highlighted
Valued Contributor II
49 Views

>>...Prefetch instructions reduce the latency from > 12 cycles (L2 latency) to ~3 (L1 latency) since you still need to use >>a load to get it into a register. SW pipelining allows even the last 3 cycles of latency to be hidden as well ! The most interesting part is that C/C++ codes are absolutely portable. Did you find any references in Intel manuals regarding that subject?
0 Kudos
Highlighted
Black Belt
49 Views

TimP (Intel) wrote:

"What is the number of physical internal GP registers (unseen to software) on x86/64 architecture?"

You'd begin by perusing the data sheets for the model of interest.

I was not able to find any information related to the number of internal GP registers.

0 Kudos
Highlighted
49 Views

>>My code in question is unpacking 12bit pixels to 16bit values.

How many 12-bit values?

Do the bits need to be reversed?

This problem may be best handled using the GP registers as opposed to SSE/AVX

If on x64 platform and using 64-bit registers:

16 12-bit inputs = 24 bytes input (3 regs) = 32 bytes output (4 regs)
32 12-bit inputs = 48 bytes input (6 regs) = 64 bytes output (8 regs)

I would suggest trying to be cache friendly on the output side.

The above would require the code to perform the necessary shifts (and merges for inputs spanning 64-bit words).

An alternative approach that should be tested is to let the memory fetch perform some of the shifting:

Read in 4 12-bit inputs = 6 bytes input = 8 bytes output.

Though you will have misaligned reads as an expense, you will not need code to manage inputs that span registers. There may be a net benefit. You will have to try out the code to find out. Note, different CPU archetecture may affect the outcome.

Jim Dempsey

0 Kudos
Highlighted
Valued Contributor II
49 Views

>>...My code in question is unpacking 12bit pixels to 16bit values... This is a kind of conversion that is commonly used in image or digital signal processing and it is called a Dynamic Range change. In more generic terms it is called as a Normalization. In that example, a data set values are changed from a 0 - 4096 range ( 12-bits ) to 0 - 65536 range ( 16-bits ) and I will do some investigation if that method increases performance of processing. Thank you to everybody for the comments.
0 Kudos
Highlighted
49 Views

I checked the second method (Read in 4 12-bit inputs = 6 bytes input = 8 bytes output.) in C++ and it looks pretty uggly. This would likely have to be done in assembler (C++ with inline assembly).

Jim Dempsey

0 Kudos
Highlighted
New Contributor II
49 Views

jimdempseyatthecove wrote:

I checked the second method (Read in 4 12-bit inputs = 6 bytes input = 8 bytes output.) in C++ and it looks pretty uggly. This would likely have to be done in assembler (C++ with inline assembly).

Jim Dempsey

you can use following code:

#include <mmintrin.h>
#include <xmmintrin.h>
#define PREFETCH_T0(addr) _mm_prefetch(((char *)(addr)),_MM_HINT_T0)

Do not use macro like: PREFETCH_T0(ptr+10), but have a __rescrict'ed pointer that is incremented in loop and (in this case) 10 units ahead, and pass it to this this macro, since due to my observation prefetch assembler instruction like 96+rax*8 eats many clock cycles, due to multiplication and addition.

0 Kudos
Highlighted
Beginner
49 Views

My input is packed as     H M, L H, M L    (most significant bit on left of each byte)  which is somewhat big-endian, so I basically have to move   groups of 4 bits by *different* distances, so I don't think shifting alone is an efficient solution. I already have very optimized code that uses  SSE byte shuffle to first move the bytes to their approximate output locations, then followed by a left or right shift by 4 if needed.

0 Kudos