Hello, the performance of my application heavily depends on summing two vectors (stored as aligned double arrays), namely I need a fast vecA += vecB. As with SSE one does not have instructions for +=, the only option is to have vecA = vecA + vecB. I have two versions of this function:
inline void addToDoubleVectorSSE(const double * what, const double * toWhat, volatile double * dest, const unsigned int len)
__m128d * _what = (__m128d*)what;
__m128d * _toWhat = (__m128d*)toWhat;
__m128d * _toWhatBase = (__m128d*)toWhat;
for ( register unsigned int i = 0; i < len; i+= 4, _what += 2, _toWhat += 2, _toWhatBase+=2 )
_toWhatBase = _toWhat;
_dest1 = _mm_add_pd( *_what, *_toWhat ); //line A
_dest2 = _mm_add_pd( *(_what+1), *(_toWhat+1)); //line B
*_toWhatBase = _dest1;
*(_toWhatBase+1) = _dest2;
for ( register unsigned int i = 0; i < len; i+= 4 )
_toWhatBase = _toWhat;
_dest1 = _mm_add_pd( *_what++, *_toWhat++ );
_dest2 = _mm_add_pd( *_what++, *_toWhat++ );
*_toWhatBase++ = _dest1;
*_toWhatBase++ = _dest2;
FAST_SSE should take advantage of the independence of lines A and line B, hence should provide performance gains.
Scenario 1: Assume having arrays double * a, *b, *c each 1000 elements long. Calling addToDoubleVectorSSE(a,b,c,1000) say 10K times indeed shows that FAST_SSE version has approx. 25-30 percent faster runtime.
Scenario 2: Assume having double ** a, ** b, **c where each a,b,c consists of 1000 arrays, each array (a, b, c) being 1000 elements long. Calling addToDoubleVectorSSE(a,b,c,1000) over i=0....999 say 10K times makes the performance gain of FAST_SSE disappear.
The question is whether the performance loss can somehow be mitigated. I understand that cache misses as probably going to be the problem. In the first scenario, all arrays a, b, c are small enough to remain in L2, which is not the case with scenario 2. Is there e.g. a way to tell the compiler something like "In two lines of code, Im gonna need arrays a, b, c so if you can, prefetch them to L2"? Or is there any other workaround?
Any hint is much appreciated, Daniel.
P.S. The sample bechmark code can be downloaded from http://pastebin.com/Z1pQ6Sdp
There are mm_prefetch compiler intrinsics to add software prefetch, or -opt-prefetch option for Intel compilers to suggest doing it automatically. These would not require explicit expansion of data structures, even though they increase cache usage footprint. The most likely possible advantage is that these could fetch data from a not-yet-accessed page, eliminating the 4KB boundaries associated with hardware prefetch. Except for that, you may already approach the full memory bandwidth capability of a single CPU of your system.