- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a large loop which executes the following piece of code many times:
-------------------------------------------------------------------------
//input: _m128d U_c00, U_c01, U_c02, U_c10, U_c11, U_c12, U_c20, U_c21, U_c22, psi_c0, psi_c1, psi_c2;
//output: _m128d chi_c0, chi_c1, chi_c2;
tmp0 = _mm_add_pd(U_c00, psi_c0);
tmp1 = _mm_add_pd(U_c01, psi_c1);
tmp2 = _mm_add_pd(U_c02, psi_c2);
chi_c0 = _mm_add_pd(tmp0, tmp1);
chi_c0 = _mm_add_pd(chi_c0, tmp2);
tmp0 = _mm_add_pd(U_c10, psi_c0);
tmp1 = _mm_add_pd(U_c11, psi_c1);
tmp2 = _mm_add_pd(U_c12, psi_c2);
chi_c1 = _mm_add_pd(tmp0, tmp1);
chi_c1 = _mm_add_pd(chi_c1, tmp2);
tmp0 = _mm_add_pd(U_c20, psi_c0);
tmp1 = _mm_add_pd(U_c21, psi_c1);
tmp2 = _mm_add_pd(U_c22, psi_c2);
chi_c2 = _mm_add_pd(tmp0, tmp1);
chi_c2 = _mm_add_pd(chi_c2, tmp2);
--------------------------------------------------------------------------
All live-in values in the above loop (U_c00, U_c01, ..., U_c22, psi_c0, psi_c1, psi_c2) are of type _m128d, and have been initialized in previous steps from the corresponding elements of complex-valued arrays (i.e., U is a 3x3 complex array and psi is a 1x3 complex vector) using _mm_load functions. Similarly, chi_c0, chi_c1 and chi_c2 are output values that are being used in subsequent computations, or stored back to memory.
In order to avoid reapeating the above piece of code again and again within the loop, I would like to implement it as a separate function. The only elegant solution I see is to use arrays of _m128d values to encode both input (U, psi) and output (chi) values. So, the signature of such a function would be as follows:
_m128d* multiply(_m128d **U, _m128d *psi);
My question is:
If I choose to encode e.g. U as an array of type _m128d[3][3], will each element be automatically mapped to a specific vector register by the compiler?
And most importantly, is the compiler clever enough to maintain these register mappings within then "multiply" function, in order to avoid copying the register values to memory locations (upon each "multiply" call) and back to registers (within "multiply")?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would expect the compilers to figure this out if you use in-lining options to optimize your function. The compiler should figure out a level of unrolling which fits within the available registers and the mode for which you compile.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your answers.
I implemented the piece of code I mentioned in my first post within a separate function, as follows:
static inline void foo(__m128d chi[3], __m128d U[3][3], __m128d psi[3])
{
__m128d tmp0, tmp1, tmp2;
tmp0 = _mm_add_pd(U[0][0], psi[0]);
tmp1 = _mm_add_pd(U[0][1], psi[1]);
tmp2 = _mm_add_pd(U[0][2], psi[2]);
chi[0] = _mm_add_pd(tmp0, tmp1);
chi[0] = _mm_add_pd(chi[0], tmp2);
tmp0 = _mm_add_pd(U[1][0], psi[0]);
tmp1 = _mm_add_pd(U[1][1], psi[1]);
tmp2 = _mm_add_pd(U[1][2], psi[2]);
chi[1] = _mm_add_pd(tmp0, tmp1);
chi[1] = _mm_add_pd(chi[1], tmp2);
tmp0 = _mm_add_pd(U[2][0], psi[0]);
tmp1 = _mm_add_pd(U[2][1], psi[1]);
tmp2 = _mm_add_pd(U[2][2], psi[2]);
chi[2] = _mm_add_pd(tmp0, tmp1);
chi[2] = _mm_add_pd(chi[2], tmp2);
}
This version was performing equally well (or sometimes better) with the original version, where I manually inline the operations performed by foo. Telling only from this (and without having inspected the assembly code), I assume that the compiler (gcc 4.6.3) successfully inlines the function at all sites it is called.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Just step into the function and open a Register window to see all registers, like FPU, MMX, SSE, etc.>>>
_m128d data type will not be loaded in x87 registers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I wonder if an introduction of a new #pragma directive ( for example, MapRamToXmm ) to do that mapping makes sense?>>>
It make sense only for small _m128d data types to be loaded completely into available XMMn registers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I was taking about how to see content of registers using the Register Window. Nobody assumed that FPU registers could be loaded>>>
Ok! I understand you.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page