Slow memcpy with SSE4.2 enabled

friedrich_nord · ‎11-04-2011

Hi,

I'm using Intel Composer XE 12.1 Build 20110811 under Windows 7 32bit.

In my project I have a simple struct

struct matrix {
__m128d d[9];
};

and several methods which copy their matrix argument to stack.

void doSomething(const matrix& a) {
matrix b = a;

// ......
}

When copying an instance of this struct, __intel_fast_memcpy is invoked. When using the compiler option /QxSSE4.2 (together with /O3 or /O2, haven't tried unoptimized) however, code similar to this for copying the 9 elements is generated:

mov ebx,80h
movups xmm0,xmmword ptr [eax+80h] ; Copy 1 element
movaps xmmword ptr [esp+80h],xmm0
loop:
movups xmm0,xmmword ptr [eax+ebx-10h] ; Copy 4 elements, repeat twice
movaps xmmword ptr [esp+ebx-10h],xmm0
movups xmm1,xmmword ptr [eax+ebx-20h]
movaps xmmword ptr [esp+ebx-20h],xmm1
movups xmm2,xmmword ptr [eax+ebx-30h]
movaps xmmword ptr [esp+ebx-30h],xmm2
movups xmm3,xmmword ptr [eax+ebx-40h]
movaps xmmword ptr [esp+ebx-40h],xmm3
sub ebx,40h
jne loop

There's several things wrong with this.

1) Using a loop for 9 elements is weird, but the loop doesn't seem to incur any performance hit, so that's fine with me.
2) For fetching the data into a register, the compiler fails to infer that the data is aligned, and generates a movups instruction. In all other places, the compiler generates the correct movaps.
3) For the first two copies, xmm0 is used. It seems the second copy operation is stalled until xmm0 can be reused, which incurs a huge performance penalty. In some other places the compiler placed the loop first (that is, copying 2x4 first and then 1 element). I compared the two with VTune, and the latter is about 10 times faster. I also took some steps to make sure the stall was not caused by cache issues.

The third issue in particular is annoying, since the stall amounts to more than 10% of the total running time of the application I'm currently developing.

I also tried replacing the line "matrix b = a" by a manually unrolled elementwise copy:

b.d[0] = a.d[0];
//... snip ...
b.d[9] = a.d[9];

in which case (depending on the surrounding code) either a strange combination of cmov's and rep movs's is emitted, or nothing at all. And by nothing I don't mean that the copying operation was optimized away; the subsequent code uses the then-uninitialized variable b, and catastrophe ensues.

Greetings,
Christian Voss

TimP · ‎11-04-2011

On CPUs which support SSE4.2, there's no performance penalty for use of unaligned instructions on aligned data. As you can see, the compiler uses the unaligned loads, regardless of alignment.

SergeyKostrov · ‎11-06-2011

Hi Christian,

1.You're passing 'a' by reference;
2.A new object 'b'of type 'matrix' will be created on the stack;
3.All data of will be copied to 'b':
you're preserving all values of 'a' ( specificator 'const' is used)
as soon as functionexits all data in 'b' will be lost.

Is that correct?

Did youtry to pass 'a' bypointer: void doSomething(matrix *a ){ ... }?

Compiler will create a smaller code butthere is a problem, values of 'a'will be modified if you don't use 'b'...

Best regards,
Sergey

Alexander_W_Intel · ‎11-08-2011

Christian,

Because of the Out of Order architecture of the modern Intel* CPUs (except the Atom* CPUs), reusing the xmm0 register shouldn't be an issue.

I've done some performance tests on to prove this. To copy the 9 elements it takes about 10 ticks on my Nehalem CPU. This is an average of doing it 10 million times. And this is independentlyif I use the sequence: xmm0, xmm0, xmm1, xmm2, xmm3, xmm0, xmm1, xmm2, xmm3 or the sequence: xmm0, xmm1, xmm2, xmm3, xmm0, xmm1, xmm2, xmm3, xmm0.

So I assume the performance issue is caused by some other part of your code or you have a problem with cache alignment.

Measuring such small amount of time with VTune is difficult. You might try to measure it using the time stamp counter (__rdstc()). To be more exact you should considering calling cpuid right before reading the time stamp counter to serialize the out of order execution at this point.

If you have a small reproducer for your performance issue, I can take a closer look if its caused by the compiler.

Thanks,
Alex

TimP · ‎11-08-2011

Make it __rdtsc() (a macro defined in Microsoft and Intel C++ compilers only, which returns a 64-bit integer).
Serializing by cpuid probably takes more time than your own code, so you would have to figure out how to subtract that overhead.