## SSE sum of vectors - how to improve cache performance

Hello, the performance of my application heavily depends on summing two vectors (stored as aligned double arrays), namely I need a fast vecA += vecB. As with SSE one does not have instructions for  +=, the only option is to have vecA = vecA + vecB. I have two versions of this function:

inline void addToDoubleVectorSSE(const double * what, const double * toWhat, volatile double * dest, const unsigned int len)
{
__m128d * _what = (__m128d*)what;
__m128d * _toWhat = (__m128d*)toWhat;
__m128d * _toWhatBase = (__m128d*)toWhat;

__m128d _dest1;
__m128d _dest2;

#ifdef FAST_SSE
for ( register unsigned int i = 0; i < len; i+= 4, _what += 2, _toWhat += 2, _toWhatBase+=2 )
{
_toWhatBase = _toWhat;
_dest1 = _mm_add_pd( *_what, *_toWhat );               //line A
_dest2 = _mm_add_pd( *(_what+1), *(_toWhat+1));    //line B

*_toWhatBase = _dest1;
*(_toWhatBase+1) = _dest2;
}
#else
for ( register unsigned int i = 0; i < len; i+= 4 )
{
_toWhatBase = _toWhat;
_dest1 = _mm_add_pd( *_what++, *_toWhat++ );
_dest2 = _mm_add_pd( *_what++, *_toWhat++ );

*_toWhatBase++ = _dest1;
*_toWhatBase++ = _dest2;
}
#endif
}

FAST_SSE should take advantage of the independence of lines A and line B, hence should provide performance gains.

Scenario 1: Assume having arrays double * a, *b, *c each 1000 elements long. Calling addToDoubleVectorSSE(a,b,c,1000) say 10K times indeed shows that FAST_SSE version has approx. 25-30 percent faster runtime.

Scenario 2: Assume having double ** a, ** b, **c where each a,b,c consists of 1000 arrays, each array (a, b, c) being 1000 elements long. Calling addToDoubleVectorSSE(a,b,c,1000) over i=0....999 say 10K times makes the performance gain of FAST_SSE disappear.

The question is whether the performance loss can somehow be mitigated. I understand that cache misses as probably going to be the problem. In the first scenario, all arrays a, b, c are small enough to remain in L2, which is not the case with scenario 2. Is there e.g. a way to tell the compiler something like "In two lines of code, Im gonna need arrays a, b, c so if you can, prefetch them to L2"? Or is there any other workaround?

Any hint is much appreciated, Daniel.