Haswell and crosslan

Christian_M_2 · ‎04-27-2015

Hello,

I build a code for Integralimage computation with SSE and its quite good. But I have serious problems making use of AVX/AVX2. I run my code on an i5-4460.

What is the basis: For integral image I need rowsum which is not optimal for vector units but can be done by shuffle and add. And I need to broadcast the last element to all elements as a second step. This can be done with a shuffle.

Now with AVX, there is no full shuffle for 32 bit, but I can do it with a normal shuffle and _mm256_permute2f128_ps.

AVX is only 60% speed of SSE so much slower. I have checked intrinsics guide. I think Haswell is a big problem here. All cross lane operations like _mm256_permute2f128_ps have latency of 3. And I need an additional shuffle. This make 4 cycles (as the isntructiosn are dependent), SSE only needs 1. The next thing is, Haswell seems a big step back. Sandy and Ivy had latency 1 for cross lane operations. Makes 2 for a whole AVX shuffle. So Haswell takes twice the time compared to the previous architecture. Is this really so? Can I do something to improve my code?

BTW, the instrinsics guide shows a lot of instructions rising from 1 to latency 3 for Haswell. I use the online version from: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

As you see, for the broadcastLast I have two AVX versions, but both are slow. Are there any AVX2 instructions I have overlooked that are fast? I only found _mm256_permutevar8x32_ps but latency is 3 two, and I need to initialized the indexing register. Or will this still be faster?

Or are there faster options for the horizontal sum and broadcasting the last element.

The next thing: Integralimages only have adds no mul. Might it be that Haswell has only one executions port for vector adds? Can I help here with FMA (setting multiplication factor two 1) and thus having c = c + a * 1.

SSE Code:

template <int N1, int N2, int N3, int N4>
inline __m128 fastshuffle(__m128 v) {
	return _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(v), _MM_SHUFFLE(N1, N2, N3, N4)));
}

inline __m128 scan_SSE(__m128 x) {
	x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
	x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
	return x;
}

inline __m128 broadcastLast(__m128 v) {
	return fastshuffle<3, 3, 3, 3>(v);
}

AVX Code:

template <int N1, int N2, int N3, int N4>
inline __m256 fastshuffle(__m256 v) {
	return _mm256_shuffle_ps(v, _MM_SHUFFLE(N1, N2, N3, N4));
}

inline __m256 scan_AVX(__m256 x) {
	__m256 t0, t1;
	//shift1_AVX + add
	t0 = fastshuffle<2, 1, 0, 3>(x);
	x = _mm256_add_ps(x, _mm256_blend_ps(t0, t1, 0x11));
	t1 = _mm256_permute2f128_ps(t0, t0, 41);
	//shift2_AVX + add
	t0 = fastshuffle<1, 0, 3, 2>(x);
	t1 = _mm256_permute2f128_ps(t0, t0, 41);
	x = _mm256_add_ps(x, _mm256_blend_ps(t0, t1, 0x33));
	//shift3_AVX + add
	x = _mm256_add_ps(x, _mm256_permute2f128_ps(x, x, 41));
	return x;
}

inline __m256 broadcastLast(__m256 v) {
	/*__m256 t0 = _mm256_permute_ps(v, _MM_SHUFFLE(3, 3, 3, 3));
	__m128 t1 = _mm256_extractf128_ps(t0, 1);
	__m256 t2 = _mm256_castsi256_ps(_mm256_broadcastsi128_si256(_mm_castps_si128(t1)));
	return t2;*/
	__m256 t0 = _mm256_permute2f128_ps(v, v, 0x11);
	__m256 t1 = fastshuffle<3, 3, 3, 3>(t0);
	return t1;
}

Christian_M_2 · ‎04-27-2015

Christian M. wrote:

AVX is only 60% speed of SSE so much slower.

It has been a long day and I had a mistake in my time meassure (basically I did not reset time meassure after SSE version).

AVX is 10% faster than SSE. That's something, but if I think of the Haswell latency I think it could be better. I will test it on a Sandy Bridge CPU.

Maybe there are still parts in my code that can be optimized.

McCalpinJohn · ‎04-27-2015

You are correct that Haswell has only one execution port for vector addition, but two ports can be used for either vector multiplication or vector FMA. As you also noted, you can manually "upgrade" a vector add to a vector FMA to get use of the extra functional unit. The penalty is an increase in the dependent operation latency from 3 cycles to 5 cycles. It should be easy enough to test in your code to see if the extra (potential) throughput provides a bigger gain than the extra (potential) latency.

Vladimir_Sedach · ‎04-28-2015

Christian,

The fastest horizontal add with AVX in my experiments is:

__m128 hadd(__m256 x)
{
__m256 r;
#if 1
  r = _mm256_dp_ps(x, _mm256_set1_ps(1.0f), 0xF1);
#else //or maybe:
   r = _mm256_hadd_ps(x, x);
   r = _mm256_hadd_ps(r, r);
#endif
   return _mm_add_ss(_mm256_castps256_ps128(r), _mm256_extractf128_ps(r, 1));
}
It returns the sum in the first element (It can be broadcasted with one AVX2 instruction).

_mm256_permutevar8x32_ps() is perhaps the best way to broadcast the last element
because the index register would be loaded outside of the loop.

Christian_M_2 · ‎04-28-2015

Hello John and Vladimir,

thanks for your answers! I will try it out and report later.

I read in the microarchiteture manual from http://www.agner.org/optimize/, which quite often is mentioned here in the forum.

Quote from page 140: "The 256-bit data path that is used for all operations on YMM registers is divided into two
lanes of 128 bits each. All instructions that can move data between these two lanes have a
latency of 3 clock cycles, while other move instructions have a latency of only one clock
cycle."

So there is no way to do crosslane data moves fast anymore, like previous generations. Secondly, only port 5 can handly all kind of permutes and has indeed latency 3 if crosslane comes into play. Whereas Sandy and Ivy had two shuffle units (only one 256bit wide, the other 128 bit wide) but all kind of shuffles including crosslane had latency 1. Well, on page 129 of same document you find that Sandy and Ivy had latency 2 for cross lane because of 1 cycle data bypass delay.

Sorry, but I can not understand why the did this step back.

Only shift operations are on a different execution port and have latency 1. So I will try to use shifts and inlane permutes as the could go parallel, I suppose.

My last idea, for a fast broadcast: What about _mm256_set1_ps? I can not find a latency information, so I assume it is not a signle instruction. _mm256_broadcast_ss would be fast, but need a memory address. Do you think a "dummy" write and a broadcast from same address would be fast?

I will test the _mm256_permutevar8x32_ps.

McCalpinJohn · ‎04-28-2015

If the original data is coming from memory it is often faster to reload it with the desired offsets than to wait for port 5 to be available for a shuffle operation. A detailed example is provided in https://software.intel.com/sites/default/files/m/d/4/1/d/8/UsingIntelAVXToImplementIDCT-r1_5.pdf

If the data is coming from registers then going through memory is less likely to help, but it depends on the relative sizes and alignments of the store and the subsequent re-load. For Sandy Bridge this is discussed in the Intel Optimization Manual, but I did not see any updates for Haswell.

Christian_M_2 · ‎04-28-2015

Hello John,

the strided operation might help me, I am checking this out just now, what gets easier and what more complicated. Yes, I agree, for Haswell you find not much optimization information but in my mind changes are far heavy enough, making things different.

@Vladimir, unfortunately, I need all parts of the sum. So x0 must be x0, x1 must be x0+x1, x2: x0+x1+x2. I now have the following solution (not tested yet):

inline __m256 scan_from_left_AVX(__m256 x) {
    __m256 t0 = _mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 4));
    __m256 t1 = _mm256_add_ps(x, t0);
    __m256 t2 = _mm256_shuffle_ps(_mm256_setzero_ps(), t1, 0x40);
    __m256 t3 = _mm256_add_ps(t1, t2);
    // now cross lane work
    __m256 t4 = _mm256_setzero_ps();
    t4 = _mm256_permute2f128_ps(t4, t3, 0x20);
    t4 = _mm256_shuffle_ps(t4, t4, 0xFF);
    __m256 t5 = _mm256_add_ps(t4, t3);
    return t5;
}

There is only one expensive shuffle operation, the permute2f128. So this should be fast for Sandy/Ivy and Haswell. There are less instructions to may initial version. For Haswell I need to check, whether using _mm256_permutevar8x32_ps is faster than my setzero, permute2f128 and shuffle. If the setzero can be optimized away, as there is still a register containing zeros, it makes 4 cycles. mm256_permutevar8x32 would take 3 cycles + initially setting the index register. But as I use a lot of register already (at least, I suppose, this is done by the compiler with intrinsics), I am not sure, whether the index register would be kept. Or could the CPU put it on stack and prefetch in time? Than I would get a further advantage.

// EDIT: Checked out the docs for _mm256_permute2f128_ps and it has a zeroing feature, so I can omit the setzero anyway.

And if I have counted latency correctly (and dependency) should be 2-3 cycles faster than yours. Yours seems great, but _mm256_dp_ps has a very high latency of 14 (and only 5 are hidden by the first hadd). And the second hadd depends on it and has 5 latency, too.

Christian_M_2 · ‎04-29-2015

One more question,

VINSERTI128 and some other instructions can be executed on p015 p23, if one operand is from memory. So this would reduce port 5 pressure too. But intrinsics show only register version. Will the compiler choose a memory version if a write directly a load conmand for one register parameter of the intrinsic? Hase someone experience here? BTW, I use VS 2013 64bit. Intel compiler would be an option (as I could use the free student version for my research).

McCalpinJohn · ‎04-29-2015

The compiler will often generate code that does the same thing as the intrinsics request, but in a very different way. You should definitely get in the habit of looking at the generated assembly code to ensure that it is close to what you expected.

Two examples of what I consider very large changes:

The compiler will often combine AVX load intrinsics with subsequent arithmetic instructions (creating an arithmetic instruction with a memory operand), even if it would be a lot better to do the load and save the value in a register for re-use.
The compiler will even change VFMADD intrinsics into separate VADD and VMUL instructions if it thinks it might be faster.

TimP · ‎04-29-2015

John D. McCalpin wrote:

The compiler will often generate code that does the same thing as the intrinsics request, but in a very different way. You should definitely get in the habit of looking at the generated assembly code to ensure that it is close to what you expected.

Two examples of what I consider very large changes:

The compiler will often combine AVX load intrinsics with subsequent arithmetic instructions (creating an arithmetic instruction with a memory operand), even if it would be a lot better to do the load and save the value in a register for re-use.

The compiler will even change VFMADD intrinsics into separate VADD and VMUL instructions if it thinks it might be faster.

I've been perplexed by the occasional expansion of fma into instructions which can run on early AVX CPUs. It might seem reasonable to do that when the compiler is asked not to generate AVX2 code, but then John produced a case where intrinsics produce apparently better AVX2 code under AVX than under AVX2 setting. This seems to augment the number of reasons why it's necessary to check intrinsics code with all available compilers as well as a range of compile switches.

gcc is more likely to unroll loops including intrinsics (when so suggested), so there is a case where another compiler doesn't take the intrinsics literally.

Christian_M_2 · ‎04-30-2015

Wow, especially exchanging FMA is something the compiler should not do.

I mean FMA or a sequence of mul and add can give different results. I only heard that compiler may never automatically choose FMA for seperate mul and add because just of this. So it's weird it does it the other way round.

I am going to check assembly code. I just realized I can easily do this, as I use the architecture analyser anyway, which gives the assembly anyway.

McCalpinJohn · ‎04-30-2015

My horror story of the compiler completely rewriting my AVX2 intrinsics code is discussed in the forum thread at https://software.intel.com/en-us/forums/topic/545040

I have not had a chance to get back to work on this code, but it is pretty clear that I will have to rewrite this in inline assembly code if I want the compiler pay more attention to the way I have specified the operations. I use inline assembly code all the time for short instruction sequences (e.g., RDPMC, RDTSCP, etc), but have no experience with inline assembly code for large blocks -- so I don't know whether the compiler will get this confused as well. Implementing the entire function in assembly language would be very unpleasant because there are at least 5 nested loops, and keeping track of all of this in assembly code is not fun.

When the compiler generates multiple versions of various loops, I find it convenient to use VTune to profile the code and drill down to the specific assembly-language blocks that are actually chosen for execution at run-time.

Vladimir_Sedach · ‎04-30-2015

Hi Christian,

"#if 1" AVX2 version of sum_all() is 10-15% (15% with VC / ICC, 10% - GCC) faster than the #else one.
Another thought is you can make same calculation in the opposite order so that the last element becomes the low order one
and it's faster to broadcast.
Would you please check this with your program?

#ifdef __GNUC__
volatile
#endif
__m256i idx = _mm256_setr_epi32(0, 0, 0, 0, 3, 3, 3, 3);

__inline __m256 sum_all(__m256 x)
{
__m256 r0, r1;
__m256 zero = _mm256_setzero_ps();

   r0 = _mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 4));
   r0 = _mm256_add_ps(x, r0);
   r1 = _mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(r0), 8));
   r0 = _mm256_add_ps(r0, r1);

#if 1 //fast
   r1 = _mm256_permutevar8x32_ps(r0, idx);
   r1 = _mm256_blend_ps(r1, zero, 0x0F);
   r0 = _mm256_add_ps(r0, r1);
#else
   r1 = _mm256_shuffle_ps(r0, r0, 0xFF);
   zero = _mm256_insertf128_ps(zero, _mm256_castps256_ps128(r1), 1);
   r0 = _mm256_add_ps(r0, zero);
#endif

return r0;
}

Christian_M_2 · ‎05-01-2015

Hello John,

I saw your post is a longer story, but I think I read it completely, seems interesting. I totally agress, inline assembly is only good for short pieces, but nested loop handling, should be handled by compiler. So it is annoying that intrinsics sometimes don't get you what you want.

I experienced this also a little bit now. My integral image code needs row min and max, but min/max go to some port. So I thought I add one min after calculation and max at beginning of iteration plus once after row finish, this way I wanted to remove port conflict. But although there is the whole integral image code inbetweend, compiler just placed min and max side by side and Architecture Code Analyzer tells my a ressource conflict and delay. But I might try reloading the stored value, maybe this helps me.

Hello Vladimir,

I have been doing a lot of tests. Here are the results including your version. I have compile with VS2013, Code Generation set either to /arch:AVX or /arch:AVX2. Then I have analyzed the function for prefixsum alone with Intel Architecture Code Analyzer. First the code, then generated assembly and blockthroughput and latency.

---------------------------------
           V1
---------------------------------
inline __m256 scan_from_left_AVX(__m256 x) {
   __m256 t0 = _mm256_permute_ps(x, _MM_SHUFFLE(2, 1, 0, 0));
   t0 = _mm256_blend_ps(t0, _mm256_setzero_ps(), 0x11);
   __m256 t1 = _mm256_add_ps(x, t0);
   __m256 t2 = _mm256_shuffle_ps(_mm256_setzero_ps(), t1, 0x40);
   __m256 t3 = _mm256_add_ps(t1, t2);
   // now cross lane work
   // TODO: check whether the flag is correct so zero out works
   __m256 t4 = _mm256_permute2f128_ps(t3, t3, 0x28);
   t4 = _mm256_shuffle_ps(t4, t4, 0xFF);
   __m256 t5 = _mm256_add_ps(t4, t3);
   return t5;
}

vxorps ymm0, ymm0, ymm0
vmovups ymm5, ymm0
vpermilps ymm1, ymm3, 0x90
vblendps ymm2, ymm1, ymm5, 0x11
vaddps ymm3, ymm2, ymm3
vshufps ymm1, ymm0, ymm3, 0x40
vaddps ymm2, ymm1, ymm3
vmovups ymm5, ymm2
vperm2f128 ymm1, ymm2, ymm2, 0x28
vshufps ymm2, ymm1, ymm1, 0xff
vaddps ymm1, ymm2, ymm5

Code Generation: /arch:AVX

Ivy Bridge
Block Throughput: 6.00 Cycles
Latency: 15 Cycles

Haswell
Block Throughput: 7.00 Cycles
Latency: 17 Cycles

---------------------------------
           V2
---------------------------------
inline __m256 scan_from_left_AVX2(__m256 x) {
   __m256 t0 = _mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 4));
   __m256 t1 = _mm256_add_ps(x, t0);
   __m256 t2 = _mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(t1), 8));
   __m256 t3 = _mm256_add_ps(t1, t2);
   // now cross lane work
   // TODO: check whether the flag is correct so zero out works
   __m256 t4 = _mm256_permute2f128_ps(t3, t3, 0x28);
   t4 = _mm256_shuffle_ps(t4, t4, 0xFF);
   __m256 t5 = _mm256_add_ps(t4, t3);
   return t5;
}

vpslldq ymm1, ymm0, 0x4
vaddps ymm1, ymm1, ymm0
vpslldq ymm0, ymm1, 0x8
vaddps ymm0, ymm0, ymm1
vmovups ymm4, ymm0
vperm2f128 ymm1, ymm0, ymm0, 0x28
vshufps ymm0, ymm1, ymm1, 0xff
vaddps ymm1, ymm0, ymm4

Code Generation: /arch:AVX2

Haswell
Block Throughput: 12.00 Cycles
Latency: 15 Cycles

---------------------------------
V3
---------------------------------
__m256i idx = _mm256_setr_epi32(0, 0, 0, 0, 3, 3, 3, 3);

__inline __m256 sum_all(__m256 x)
{
__m256 r0, r1;
__m256 zero = _mm256_setzero_ps();

   r0 = _mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 4));
   r0 = _mm256_add_ps(x, r0);
   r1 = _mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(r0), 8));
   r0 = _mm256_add_ps(r0, r1);

   r1 = _mm256_permutevar8x32_ps(r0, idx);
   r1 = _mm256_blend_ps(r1, zero, 0x0F);
   r0 = _mm256_add_ps(r0, r1);

return r0;
}

vxorps ymm5, ymm0, ymm0
vpslldq ymm1, ymm0, 0x4
vaddps ymm0, ymm1, ymm0
vpslldq ymm2, ymm0, 0x8
vaddps ymm0, ymm2, ymm0
vmovups ymm4, ymm0
vmovdqu ymm1, ymmword ptr [rip+0x2b3126]
vpermps ymm2, ymm1, ymm0
vblendps ymm0, ymm2, ymm5, 0xf
vaddps ymm1, ymm0, ymm4

Code Generation: /arch:AVX2

Haswell
Block Throughput: 12.00 Cycles
Latency: 15 Cycles

OK, that's theory. Second test, was integration of the function in my integral image and doing hundres of calulations of image parts meassuring runtime. Here, things are different. Tests were done in Haswell Refresh CPU as stated in initial post. Tests run fime times, and results always similiar. To my suprise, V3 is the fastest, anyway V3 and V2 are fast then V1. So I suppose latency is here more important then block throughput at least when integrated in whole integral image computation.

AVX Code (V1): 1.795 (sec)

AVX Code (V2): 1.760 (sec)

AVX Code (V3): 1.748 (sec)

Another thought is you can make same calculation in the opposite order so that the last element becomes the low order one
and it's faster to broadcast.

Thats seems very interesting. BTW, I try to replace all broadcasting based on shuffle, vinsert etc by _mm256_broadcast_ss ore _mm256_broadcast_ps. As I always broadcast values that have been written in the last iteration. Hopefully this is faster.

For the sum, any ideas about reducing dependency in the code above? I mean two or three independent adds and then a shuffle or blend and only one or two last steps depending on the independent previous ones, would perform better in terms of latency.

Vladimir_Sedach · ‎05-01-2015

Hi Christian,

_mm256_broadcast_ss() works with memory only.
AVX2 has _mm256_broadcastss_ps() that copies from register and is more suitable for you.

I tested sum_all() function with a loop that reads input and writes the result from/to a single location.
If you're handling a big memory chunk, results are quite different and more dependent on memory access and cache issues.

I'm never analyzing throughput/latency, just CPU cycles (RDTSC instruction) or time in seconds of a code section or function.

I experimented with a sum version that transposes 4 256-bit vars and then adds them vertically and transposes back to same 4 vars.
Unfortunately, this turned out to be way slower than handling them independently as you do (.

Vladimir_Sedach · ‎05-03-2015

Hello Christian,

Christian M. wrote:

For the sum, any ideas about reducing dependency in the code above? I mean two or three independent adds and then a shuffle or blend and only one or two last steps depending on the independent previous ones, would perform better in terms of latency.

You can "reduce dependency" by handling the elements of two or more rows (the data of the previous row is already in registers) and more than one 256-bit element in a single row on each loop step.
This way you would decrease the memory access that takes perhaps most of the time.

Christian_M_2 · ‎05-04-2015

Hello Vladimir,

that sounds like a great idea! I post back the results.

Anyway, is there anything else, I should take care of? I always thought intel CPUs are decoding several instructions per clock, or at least under some circumstances. But, in the Architecture Code Analyzer I never got could, where two different ports start an operation the same time.

BTW, could you post the code being based on transposing 4x4 matrix. Or is it a lot slower? I have an application where I need row wise min and maximum of integral image. Currently, I compare regwise several elements of a row and at the end of each row have to do some horizontal compares, meaning this is not ideal. But in the transposed way, I would have one reg containing the maximums of 4 rows. The same for the min. If it it not that much slower, in combination with the min/max it might be equally or faster speed.

Vladimir_Sedach · ‎05-05-2015

Hi Christian,

The code for a partial transpose:

__v256 v0, v1, v2, v3;
__v256 r0, r1, r2, r3;

r0 = _mm256_unpacklo_ps(v0, v1);
r1 = _mm256_unpacklo_ps(v2, v3);
r2 = _mm256_unpackhi_ps(v0, v1);
r3 = _mm256_unpackhi_ps(v2, v3);

v0 = _mm256_shuffle_ps(r0, r1, _MM_SHUFFLE(1, 0, 1, 0));
v1 = _mm256_shuffle_ps(r0, r1, _MM_SHUFFLE(3, 2, 3, 2));
v2 = _mm256_shuffle_ps(r2, r3, _MM_SHUFFLE(1, 0, 1, 0));
v3 = _mm256_shuffle_ps(r2, r3, _MM_SHUFFLE(3, 2, 3, 2));

Its result:
00000000 00000010 00000020 00000030 00000004 00000014 00000024 00000034
00000001 00000011 00000021 00000031 00000005 00000015 00000025 00000035
00000002 00000012 00000022 00000032 00000006 00000016 00000026 00000036
00000003 00000013 00000023 00000033 00000007 00000017 00000027 00000037

Where the 1st digit is the vector index (0..3), the 2nd (0..7) is the element index.
Same code transposes 128-bit vars if you use _mm_ instead of _mm256_.

I don't believe it would be faster -- you need to do that on each step instead of doing it only once at the end
of each row with horizontal min().