Software Archive
Read-only legacy content
17061 Discussions

Order of Intrinsics

Patrick_S_
New Contributor I
418 Views

Hey all,

I am currently writing a program using Intel Intrinsics for KNC. Due to the in-order execution my question is: Does the ordering of the Intrinsic instructions make a difference? For readability it would be good for my code to gather the load and store instructions for one loop iteration as well as the align, shuffle and arithmetic instructions.

Let's make two very very simplified examples (without the omp pragmas):

[cpp]

__m512 a_, b_, c_ ,d_ ,e_, f_;

__m512 result_ab_, result_cd_, result_ef_;

for ( std::size_t j = 0; j < N; ++j ) {

 

     a_ = _mm512_load_ps( array + 16*6 * j + 16*0 );

     d_ = _mm512_load_ps( array + 16*6 * j + 16*1 );

     b_ = _mm512_load_ps( array + 16*6 * j + 16*2 );

     e_ = _mm512_load_ps( array + 16*6 * j + 16*3 );

     c_ = _mm512_load_ps( array + 16*6 * j + 16*4 );

     f_ = _mm512_load_ps( array + 16*6 * j + 16*5 );

 

     //////////////////////////////////////////

     do all shuffles involving a_, b_, c_ ,d_ ,e_, f_

     //////////////////////////////////////////

 

     //////////////////////////////////////////

     do all arithmetics involving a_, b_, c_ ,d_ ,e_, f_

     //////////////////////////////////////////

 

     _mm512_storenrngo_ps( result + 16*3 * j + 16*0, result_ab_ );

     _mm512_storenrngo_ps( result + 16*3 * j + 16*1, result_cd_ );

     _mm512_storenrngo_ps( result + 16*3 * j + 16*2, result_ef_ );

}

 [/cpp]

[cpp]

__m512 a_, b_, c_ ,d_ ,e_, f_;

__m512 result_ab_, result_cd_, result_ef_;

for ( std::size_t j = 0; j < N; ++j ) {

 

     a_ = _mm512_load_ps( array + 16*6 * j + 16*0 );

     b_ = _mm512_load_ps( array + 16*6 * j + 16*2 );

     //////////////////////////////////////////

     do all shuffles involving a_, b_

     //////////////////////////////////////////

     //////////////////////////////////////////

     do all arithmetics involving a_, b_

     //////////////////////////////////////////

     _mm512_storenrngo_ps( result + 16*3 * j + 16*0, result_ab_ );

 

 

     c_ = _mm512_load_ps( array + 16*6 * j + 16*4 );

     d_ = _mm512_load_ps( array + 16*6 * j + 16*1 );

     //////////////////////////////////////////

     do all shuffles involving c_, d_

     //////////////////////////////////////////

     //////////////////////////////////////////

     do all arithmetics involving c_, d_

     //////////////////////////////////////////

     _mm512_storenrngo_ps( result + 16*3 * j + 16*1, result_cd_ );

 

 

     e_ = _mm512_load_ps( array + 16*6 * j + 16*3 );

     f_ = _mm512_load_ps( array + 16*6 * j + 16*5 );

     //////////////////////////////////////////

     do all shuffles involving e_, f_

     //////////////////////////////////////////

     //////////////////////////////////////////

     do all arithmetics involving e_, f_

     //////////////////////////////////////////

     _mm512_storenrngo_ps( result + 16*3 * j + 16*2, result_ef_ );

 

}

 [/cpp]

Which order of my examples is better?  Note that in the 2nd example the load instruction has to perform a non-sequential memory access.

Again these examples are very simplified. My code has the same structure, but performs a lot of shuffle and arithmetic instructions and needs not more than 32 zmm registers at the same time for one loop iteration, but the code is close to 32 zmm registers. The whole program is a big loop, which loads data from a point in a multi dimensional lattice. The data differs at each point in the lattice. Each iteration in the big loop performs the same instructions.

Is there a general rule for ordering the Intrinsics? How clever is the compiler?

Thanks

Patrick

 

0 Kudos
7 Replies
Loc_N_Intel
Employee
418 Views

Hi Patrick,

Let's me ask the experts around here and get back to you. Thank you.

0 Kudos
TimP
Honored Contributor III
418 Views

You would want to issue all the loads before any stores.  The compiler would require __restrict qualifiers or #ivdep pragmas to do that automatically.  If there's a chance the compiler may not be able to do so, writing loads before stores could be advantageous.  You will be able to see what the compiler has done by looking at generated asm; in view of it being an in-order CPU, this does have relevance.

0 Kudos
Patrick_S_
New Contributor I
418 Views

loc-nguyen (Intel) wrote:

Hi Patrick,

Let's me ask the experts around here and get back to you. Thank you.

ok.

Tim Prince wrote:

writing loads before stores could be advantageous.

What would be the benefit?

0 Kudos
TimP
Honored Contributor III
418 Views

You want to issue loads as far in advance as possible of where the data are needed, to maximize the number of non-dependent instructions executed before a data stall occurs. 

0 Kudos
Loc_N_Intel
Employee
418 Views

Our developers said Tim’s advice was appropriate for this, adding that “The order of intrinsics (thus the order of instructions) does matter on KNC. The compiler performs some reordering of generated instructions to generate faster code, but it cannot do this through the whole program.”

0 Kudos
Patrick_S_
New Contributor I
418 Views

Tim Prince wrote:

You want to issue loads as far in advance as possible of where the data are needed, to maximize the number of non-dependent instructions executed before a data stall occurs. 

even if I need to load far more than 32 zmm registers?

0 Kudos
TimP
Honored Contributor III
418 Views

That's an interesting question, whether the compiler can deal with register pressure as well when using intrinsics as with C or C++ code.   I don't know why there should be any advantage in using intrinsics rather than plain C or CEAN assignments for moving data into mm512 objects, in the simple cases, so you have some variations to try if you run into that situation. In a future out-of-order coprocessor, the number of program-accessible registers shouldn't place a fixed limit on the number of mm512 objects which can be in use simultaneously by a single thread.  I suppose you will run into limits on number of data streams for hardware and software prefetch before you use up 32 simd registers.

I haven't seen any examples of widely used applications which appeared to need more than 24 simd registers when taking advantage of out-of-order and shadow registers.  One of the methods Intel compilers use on host to deal with register, fill buffer, and data stream pressure is automatic loop distribution.  I've heard of situations where splitting a loop manually in source code for MIC coprocessor has been advocated long before reaching the 32-register limit.

0 Kudos
Reply