Solved: I think that compiler at

Manish_K_ · ‎03-30-2015

Hi,

Can two avx instrcutions can be executed in parallel?

For example,

Version1:

a1= _mm256_load_ps((Rin +offset));
a2= _mm256_load_ps((Gin +offset));
a3= _mm256_load_ps((Bin +offset));

           ac0 = _mm256_mul_ps(a1, in2outAvx_11);
ac1 = _mm256_mul_ps(a2, in2outAvx_12);
           ac2 = _mm256_mul_ps(a3, in2outAvx_13);

           z0 = _mm256_add_ps(ac0,ac1);
           z1 = _mm256_add_ps(z0, ac2);

If I changed this code to

Version 2:

a1= _mm256_load_ps((Rin +offset));
a2= _mm256_load_ps((Gin +offset));
a3= _mm256_load_ps((Bin +offset));

ac0 = _mm256_mul_ps(a1, in2outAvx_11);
ac1 = _mm256_mul_ps(a2, in2outAvx_12);

/*first two instructions below, are data independent and might run in parallel */

z0 = _mm256_add_ps(ac0,ac1);
ac2 = _mm256_mul_ps(a3, in2outAvx_13);

z1 = _mm256_add_ps(z0, ac2);

Will version2 code run faster as add and mul intrinsics can execute together?

Or version1 and version two take same time if the compiler rearranges the instructions by itself?

McCalpinJohn · ‎03-30-2015

For a general discussion of the microarchitecture and a detailed discussion of performance issues, you should start with the "Intel 64 and IA32 Architectures Optimization Reference Manual" (document 248966, revision 030, September 2014).

An overview of the Haswell processor is provided in Section 2.1, with Table 2-1 giving a nice overview of the execution ports and functional units. From this figure it is pretty easy to see where the instructions can go:

_mm256_add_ps: Port 1
_mm256_mul_ps: Port 0 or Port 1
_mm256_load_ps: Port 2 or Port 3
_mm256_store_ps: Port 7 (for the address calculation) and Port 4 (for the data)

If the add and multiply operations use only register operands, then these can all be issued in one cycle. If the arithmetic instructions have a memory input operand, then they will compete with the load for access to ports 2 and 3. Since there are two load ports, these instructions can all be issued in one cycle even if one of the arithmetic instructions includes a memory input operand.

Appendix C-3 provides latency and reciprocal throughput information for a large subset of the instructions on a large subset of Intel processors. From Table C-1 you can see that Haswell processors have one of four DisplayFamily_DisplayModel designations 06_3CH, 06_45H, 06_46H, and 06_3FH. Fortunately all of these appear to be grouped together in the tables below, so you don't need to know exactly which model you have.

From Table C-8 you can find:

_mm256_add_ps: VADDPS: 3 cycle latency, 1 instruction/cycle issue rate (1/throughput=1)
_mm256_mul_ps: VMULPS: 5 cycle latency, 2 instructions/cycle issue rate (1/throughput=0.5)

This appendix does not include data for operations with memory operands, in part because the latency depends on so many factors. A few basic latencies for the Haswell processor are included in Section 2.1.3 and 2.1.4. The discussion of Sandy Bridge latencies in Section 2.2.5 is much more complete. The details are not going to be the same as for Haswell, but the Sandy Bridge discussion does provide an indication of what sorts of factors influence latencies.

Another valuable resource for x86 microarchitecture and performance information is Agner Fog's web site. His "microarchitecture.pdf" document describes the evolution of the the microarchitecture of Intel processors. This allows the reader to start with the earlier/simpler processors and learn about the increasing complexity incrementally. (Because of this structure, I recommend that the document be read in order, rather than just jumping to Chapter 10 to learn about Haswell.) The "instruction_tables.pdf" document provides more comprehensive coverage of instruction latencies and throughputs than the Intel Optimization Reference Manual. One particularly useful feature is that the tables show how exactly which execution ports each instruction uses. Sometimes these are obvious, and sometimes they are very much not obvious! These documents (and more useful resources) are available at: http://www.agner.org/optimize/

View solution in original post

Bernard · ‎03-30-2015

z0 and z1 data types cannot be rearranged in first version because of interdependecy.In the second version z0 and ac2 can be executed in parallel because as you already stated they are independent from each other.If CPU can exploit ILP on Haswell the Port0 can execute FP MUL instruction and Port1 can execute FP ADD instruction. Of course there ar many additional factors involved like port saturation which can occurr when there is a second thread which happens to run on the same core and is contains FP code. Availability of the operands at the time of execution is also important. For example Rin array if it has linear access stride can be prefetched and stored inside the cache so the operands will be pulled from L1D and rerouted to FP execution stack.

Manish_K_ · ‎03-30-2015

Can you point me to any document or can you give me info about following questions?

How many _mm256_mul_ps, _mm256_load_ps, _mm256_store_ps and _mm256_add_ps instructions I Haswell family processor can run in parallel?

Can the above instructions be executed in one cycle?

What will be the maximum cycles taken if all above instructions are running in parallel?

McCalpinJohn · ‎03-30-2015

For a general discussion of the microarchitecture and a detailed discussion of performance issues, you should start with the "Intel 64 and IA32 Architectures Optimization Reference Manual" (document 248966, revision 030, September 2014).

An overview of the Haswell processor is provided in Section 2.1, with Table 2-1 giving a nice overview of the execution ports and functional units. From this figure it is pretty easy to see where the instructions can go:

_mm256_add_ps: Port 1
_mm256_mul_ps: Port 0 or Port 1
_mm256_load_ps: Port 2 or Port 3
_mm256_store_ps: Port 7 (for the address calculation) and Port 4 (for the data)

If the add and multiply operations use only register operands, then these can all be issued in one cycle. If the arithmetic instructions have a memory input operand, then they will compete with the load for access to ports 2 and 3. Since there are two load ports, these instructions can all be issued in one cycle even if one of the arithmetic instructions includes a memory input operand.

Appendix C-3 provides latency and reciprocal throughput information for a large subset of the instructions on a large subset of Intel processors. From Table C-1 you can see that Haswell processors have one of four DisplayFamily_DisplayModel designations 06_3CH, 06_45H, 06_46H, and 06_3FH. Fortunately all of these appear to be grouped together in the tables below, so you don't need to know exactly which model you have.

From Table C-8 you can find:

_mm256_add_ps: VADDPS: 3 cycle latency, 1 instruction/cycle issue rate (1/throughput=1)
_mm256_mul_ps: VMULPS: 5 cycle latency, 2 instructions/cycle issue rate (1/throughput=0.5)

This appendix does not include data for operations with memory operands, in part because the latency depends on so many factors. A few basic latencies for the Haswell processor are included in Section 2.1.3 and 2.1.4. The discussion of Sandy Bridge latencies in Section 2.2.5 is much more complete. The details are not going to be the same as for Haswell, but the Sandy Bridge discussion does provide an indication of what sorts of factors influence latencies.

Another valuable resource for x86 microarchitecture and performance information is Agner Fog's web site. His "microarchitecture.pdf" document describes the evolution of the the microarchitecture of Intel processors. This allows the reader to start with the earlier/simpler processors and learn about the increasing complexity incrementally. (Because of this structure, I recommend that the document be read in order, rather than just jumping to Chapter 10 to learn about Haswell.) The "instruction_tables.pdf" document provides more comprehensive coverage of instruction latencies and throughputs than the Intel Optimization Reference Manual. One particularly useful feature is that the tables show how exactly which execution ports each instruction uses. Sometimes these are obvious, and sometimes they are very much not obvious! These documents (and more useful resources) are available at: http://www.agner.org/optimize/

Vladimir_Sedach · ‎03-30-2015

Manish,

Your versions 1 and 2 are practically same because compiler chooses the fastest (in its opinion) order of generated instructions.
In other words, the generated code for both versions most likely would be the same.

It doesn't mean the order of intrinsics doesn't matter. It actually means it's impossible to predict what the compiler would do.

Manish_K_ · ‎03-30-2015

Thanks a lot John. !!

It is really helpful.

Bernard · ‎03-31-2015

I think that compiler at least will try to exploit ILP in presented source code.Second version does not have interdependencies so ILP can be exploited at least in case of following operstions:

z0 = _mm256_add_ps(ac0,ac1);
ac2 = _mm256_mul_ps(a3, in2outAvx_13);

Can AVX instruction be executed in parallel