AVX add slow due to vinsertf128

DLake1 · ‎12-17-2016

CPU is a 3820, ICL 14.0 and VS2013, variables are double's

auto newvel = velocitiesy + force;

That line is slow because instruction vinsertf128 in this case has a high CPI of over 3.2, this is the assembly:

vaddsd xmm7, xmm8, qword ptr [r9+rax*8]
vmovupd xmm12, xmmword ptr [r9+r8*8]
vinsertf128 ymm11, ymm12, xmmword ptr [r9+r8*8+0x10], 0x1
vaddpd ymm12, ymm11, ymm10
vaddsd xmm7, xmm8, qword ptr [r9+r15*8]

I found this after running several tests with VTune, this only happens with QaxAVX, QaxSSE4.2 and lower are unaffected, also why is it using mostly xmm's rather than ymm's?

Here's the entire method, I expanded it by using temporary variables to assist in troubleshooting:

bool Compute(double* __restrict pointsx, double* __restrict pointsy, double* __restrict velocitiesx, double* __restrict velocitiesy, int pcount, double elapsed){
	for (auto i = 0; i < pcount; ++i){
		auto x = pointsx + (pointsx < 0 ? -0.1 : 0.1);
		double xsq;
		if (x < 0.0L) xsq = -1.0L * x * x;
		else xsq = x * x;
		auto force=0.0001L / xsq;
		auto newvel=velocitiesx + force; //<---SLOW
		pointsx -= newvel * elapsed;
		velocitiesx = newvel;
		if (pointsx > 1.0L){
			pointsx = 1.0L;
			velocitiesx = 0.0L;
		}
		else if (pointsx < -1.0L){
			pointsx = -1.0L;
			velocitiesx = 0.0L;
		}
	}
	for (auto i = 0; i < pcount; ++i){
		auto y = pointsy + (pointsy < 0 ? -0.1 : 0.1);
		double ysq;
		if (y < 0.0L) ysq = -1.0L * y * y;
		else ysq = y * y;
		auto force = 0.0001L / ysq;
		auto newvel = velocitiesy + force; //<---SLOW
		pointsy -= newvel * elapsed;
		velocitiesy = newvel;
		if (pointsy > 1.0L){
			pointsy = 1.0L;
			velocitiesy = 0.0L;
		}
		else if (pointsy < -1.0L){
			pointsy = -1.0L;
			velocitiesy = 0.0L;
		}
	}
	return true;
}

TimP · ‎12-18-2016

The split load of AVX operand is chosen, on the compiler's assumption that the operand may not be aligned, to maintain performance on Sandy/Ivy Bridge. If the vinsertf128 incurs stalls, it's probably on account of waiting for its operands (presumably either cache misses or the divide which may be delayed if your if..else.. was not optimized).

You don't offer enough context to see whether you are suffering from partial vectorization. The compiler will choose xmm operands for scalar operands as well as the cases where parallel is cut down to AVX-128 for performance reasons.

DLake1 · ‎12-18-2016

Actually I think its just loading from ram thats slow because theres no prefetching going on, how can I make it prefetch the required values in advance?

TimP · ‎12-19-2016

That guess isn't credible unless you have disabled all default prefetchers and shown that they had no effect. Did you look at your compiler optimization reports or try Intel parallel advisor?

McCalpinJohn · ‎12-19-2016

Tim P. is correct -- the Intel compilers will split 256-bit loads into two 128-bit pieces (the second being the VINSERTF128) when compiling for the AVX target because Sandy Bridge (e.g., Core i7-3820) has large penalties for 256-bit loads that cross cache line boundaries.

There is typically no performance advantage for 256 bit loads with the Sandy Bridge core even if they are aligned, since the two load ports are only 128-bits wide. The core can sustain full L1 bandwidth with either two 128-bit (aligned) loads per cycle or two 256-bit (aligned) loads every two cycles.

The VINSERTF128 instruction is not an expensive one on Sandy Bridge unless you already have lots of instructions on Port 5. It is the only Port 5 instruction in the inner loop ASM instructions shown above, so it should not be a problem here.

It is possible for VTune to attribute the "hot spot" incorrectly due to instruction skew.

It is probably a good idea to look at the actual pointers being used here. Just looking at all of the possible 8-Byte alignments, I see:

If the initial pointer used by the VMOVUPD instruction is aligned on a 32-Byte boundary, then
- Neither the VMOVUPD nor the VINSERTF128 loads will cross cache-line boundaries, and
- All of the new cache lines will be brought in by VMOVUPD instructions
If the initial pointer used by the VMOVUPD instruction is aligned on a 16-Byte boundary that is not also a 32-Byte boundary, then
- Neither the VMOVUPD nor the VINSERTF128 loads will cross cache-line boundaries, and
- All of the new cache lines will be brought in by VINSERTF128 instructions
If the initial pointer used by the VMOVUPD instruction is aligned 8 Bytes or 40 Bytes above a 64-Byte boundary, then
- Half of the VINSERTF128 instructions will load data that crosses a cache line boundary (SLOW), and
- All of the new cache lines will be brought in by these (slow) VINSERTF128 instructions whose loads cross a cache line boundary.
If the initial pointer used by the VMOVUPD instruction is aligned 24 Bytes or 56 Bytes above a 64-Byte boundary, then
- Half of the VMOVUPD instructions will load data that crosses a cache line boundary (SLOW), and
- All of the new cache lines will be brought in by these (slow) VMOVUPD instructions whose loads cross a cache line boundary.

Ignoring all of the other complexities in your code, just this 5-instruction sequence can have different performance and different hot spots depending on the initial pointer alignment.

At first glance, I would be more worried about the vectorization of all of the IF statements. Sometimes the compiler understands what is going on well enough to generate excellent vector merge code, but I have also seen some fails in this area.

A CPI of 3.2 sounds high, but I can't tell from your comment whether this applies to the loop as a whole or to some subset. This would not surprise me if there are conditional branches in the generated inner loop code -- mispredicted branches are very expensive. If everything is vectorized, then I would want to look at how much the floating-point divide adds to the overall minimum loop execution time.

DLake1 · ‎12-20-2016

Disregard.