Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Array of _m128d values as function argument

Anastopoulos__Nikos
402 Views

Hi, 

I have a large loop which executes the following piece of code many times: 

-------------------------------------------------------------------------

//input: _m128d U_c00, U_c01, U_c02, U_c10, U_c11, U_c12, U_c20, U_c21, U_c22, psi_c0, psi_c1, psi_c2;

//output: _m128d chi_c0, chi_c1, chi_c2;

tmp0 = _mm_add_pd(U_c00, psi_c0);
tmp1 = _mm_add_pd(U_c01, psi_c1);
tmp2 = _mm_add_pd(U_c02, psi_c2);
chi_c0 = _mm_add_pd(tmp0, tmp1);
chi_c0 = _mm_add_pd(chi_c0, tmp2);
tmp0 = _mm_add_pd(U_c10, psi_c0);
tmp1 = _mm_add_pd(U_c11, psi_c1);
tmp2 = _mm_add_pd(U_c12, psi_c2);
chi_c1 = _mm_add_pd(tmp0, tmp1);
chi_c1 = _mm_add_pd(chi_c1, tmp2);
tmp0 = _mm_add_pd(U_c20, psi_c0);
tmp1 = _mm_add_pd(U_c21, psi_c1);
tmp2 = _mm_add_pd(U_c22, psi_c2);
chi_c2 = _mm_add_pd(tmp0, tmp1);
chi_c2 = _mm_add_pd(chi_c2, tmp2);

--------------------------------------------------------------------------

All live-in values in the above loop (U_c00, U_c01, ..., U_c22, psi_c0, psi_c1, psi_c2) are of type _m128d, and have been initialized in previous steps from the corresponding elements of complex-valued arrays (i.e., U is a 3x3 complex array and psi is a 1x3 complex vector) using _mm_load functions. Similarly, chi_c0, chi_c1 and chi_c2 are output values that are being used in subsequent computations, or stored back to memory. 

In order to avoid reapeating the above piece of code again and again within the loop, I would like to implement it as a separate function. The only elegant solution I see is to use arrays of _m128d values to encode both input (U, psi) and output (chi) values. So, the signature of such a function would be as follows: 

_m128d* multiply(_m128d **U, _m128d *psi);

My question is:

If I choose to encode e.g. U as an array of type _m128d[3][3], will each element be automatically mapped to a specific vector register by the compiler? 

And most importantly, is the compiler clever enough to maintain these register mappings within then "multiply" function, in order to avoid copying the register values to memory locations (upon each "multiply" call) and back to registers (within "multiply")? 

0 Kudos
8 Replies
SergeyKostrov
Valued Contributor II
402 Views
>>_m128d* multiply( _m128d **U, _m128d *psi ); >> >>My question is: >> >>If I choose to encode e.g. U as an array of type _m128d[3][3], will each element be automatically mapped to a specific >>vector register by the compiler? If U as an array of type _m128d[3][3] is declared ( created & initialized ) in memory ( RAM ) I don't think it will be automatically mapped to any register. Is there a C++ compiler option to do it? I have not seen a such one. Anyway, you could easily verify it under a Debugger. Just step into the function and open a Register window to see all registers, like FPU, MMX, SSE, etc.
0 Kudos
TimP
Honored Contributor III
402 Views

I would expect the compilers to figure this out if you use in-lining options to optimize your function.  The compiler should figure out a level of unrolling which fits within the available registers and the mode for which you compile.

0 Kudos
Anastopoulos__Nikos
402 Views

Thanks for your answers.

I implemented the piece of code I mentioned in my first post within a separate function, as follows:

static inline void foo(__m128d chi[3], __m128d U[3][3], __m128d psi[3])

{

    __m128d tmp0, tmp1, tmp2;

      tmp0 = _mm_add_pd(U[0][0], psi[0]);
      tmp1 = _mm_add_pd(U[0][1], psi[1]);
      tmp2 = _mm_add_pd(U[0][2], psi[2]);
      chi[0] = _mm_add_pd(tmp0, tmp1);
      chi[0] = _mm_add_pd(chi[0], tmp2);
      tmp0 = _mm_add_pd(U[1][0], psi[0]);
      tmp1 = _mm_add_pd(U[1][1], psi[1]);
      tmp2 = _mm_add_pd(U[1][2], psi[2]);
      chi[1] = _mm_add_pd(tmp0, tmp1);
      chi[1] = _mm_add_pd(chi[1], tmp2);
      tmp0 = _mm_add_pd(U[2][0], psi[0]);
      tmp1 = _mm_add_pd(U[2][1], psi[1]);
      tmp2 = _mm_add_pd(U[2][2], psi[2]);
      chi[2] = _mm_add_pd(tmp0, tmp1);
      chi[2] = _mm_add_pd(chi[2], tmp2); 

}

This version was performing equally well (or sometimes better) with the original version, where I manually inline the operations performed by foo. Telling only from this (and without having inspected the assembly code), I assume that the compiler (gcc 4.6.3) successfully inlines the function at all sites it is called.

0 Kudos
SergeyKostrov
Valued Contributor II
402 Views
Thanks for the feedback. >>...If U as an array of type _m128d[3][3] is declared ( created & initialized ) in memory ( RAM ) I don't think it will be >>automatically mapped to any register... I wonder if an introduction of a new #pragma directive ( for example, MapRamToXmm ) to do that mapping makes sense? Please consider it as a feature request.
0 Kudos
Bernard
Valued Contributor I
402 Views

>>>Just step into the function and open a Register window to see all registers, like FPU, MMX, SSE, etc.>>>

_m128d data type will not be loaded in x87 registers.

0 Kudos
Bernard
Valued Contributor I
402 Views

>>>I wonder if an introduction of a new #pragma directive ( for example, MapRamToXmm ) to do that mapping makes sense?>>>

It make sense only for small _m128d data types to be loaded completely into available XMMn registers.

0 Kudos
SergeyKostrov
Valued Contributor II
402 Views
>>>>Just step into the function and open a Register window to see all registers, like FPU, MMX, SSE, etc.>>> >> >>_m128d data type will not be loaded in x87 registers. Do you have a Visual Studio? I was taking about how to see content of registers using the Register Window. Nobody assumed that FPU registers could be loaded.
0 Kudos
Bernard
Valued Contributor I
402 Views

>>>I was taking about how to see content of registers using the Register Window. Nobody assumed that FPU registers could be loaded>>>

Ok! I understand you.

0 Kudos
Reply