How to extract DWORD from upper half of 256-bit register? - Page 2

levicki · ‎07-02-2013

Congratulations to Intel CPU instruction set engineers for managing to make YET ANOTHER non-orthogonal instruction set extension -- why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?

Any ideas/tricks to work around this engineering "oversight"?

levicki · ‎07-10-2013

The action in if/else block directly depends on source data. I don't see how that could be moved out of the loop? You need to do a test for every bit in source to decide what to write to destination.

The only other thing I could think of would be to use a lookup table of 256 x 24 bytes (6,144 bytes in size), fetch a byte from source, and memcpy() the corresponding row from the table. To do that, you would have to precompute the table because the background and foreground color can be different each time so the speedup might be noticable only for large pictures. You could also try doing a smaller table (16 x 12 bytes) which would be faster to precompute and split the source byte into two 4-bit nibbles for lookup, but it would be less efficient to copy from such table (dwords instead of qwords). On the other hand, it would compete less for L1 cache bandwidth. Without testing it is impossible to say which one would be faster.

For the monochrome bitmap of 4,593 x 6,000 pixels I am getting ~18 ms for AVX2 code I wrote which is somewhere around 1460 MB/sec or 11,680 MPixels/sec.

Bernard · ‎07-10-2013

Sorry misunderstood the code.

bronxzv · ‎07-10-2013

Igor Levicki wrote:

When I said 50% I actually meant 50% shorter execution time which would translate into 2x speedup. Sorry for confusion.

Attached is the code with simple test driver. My results are:

test_C : 6345.035 ms
test_SSE4.1 : 3944.771 ms
test_AVX2 : 2190.420 ms

Difference is 1.80x here too, but that difference gets smaller (1.51x) if you change pragma for SSE4.1 function and let compiler generate 128-bit SSE

after a rapid check of your AVX2 code path, it appears that you are effectivly using only 75% of the YMM registers width (24-byte / 32-byte), so 25% of the computations (the ones for the unused higher 8-byte) are done in pure waste, this is a classical case of partial vectorization

for better speedups you'll have to process more pixels in parallel, I'm not sure if it's possible in your case, though

just another remark, the operation below is useless:
blend_mask = _mm256_and_si256(blend_mask, sign_mask);

BLENDVx instructions use only the MSB of the mask elements so clearing the lower bits isn't required

Bernard · ‎07-10-2013

>>>The only other thing I could think of would be to use a lookup table of 256 x 24 bytes (6,144 bytes in size), fetch a byte from source, and memcpy() the corresponding row from the table>>>

memcpy() could be a potential bottleneck in this case.IIRC this function at machine code level usess rep movsb(d).I was thinking about the some kind of streaming load/store function(based on inline assembly) which could speed data transaction.

levicki · ‎07-11-2013

bronxzv wrote:
for better speedups you'll have to process more pixels in parallel, I'm not sure if it's possible in your case, though

I am afraid it is not, at least not efficiently.

bronxzv wrote:
just another remark, the operation below is useless:
blend_mask = _mm256_and_si256(blend_mask, sign_mask);

BLENDVx instructions use only the MSB of the mask elements so clearing the lower bits isn't required

Yes, I am aware of that. I did that while I was visualizing data flow. It can be removed now.

levicki · ‎07-11-2013

iliyapolak wrote:
memcpy() could be a potential bottleneck in this case.IIRC this function at machine code level usess rep movsb(d).I was thinking about the some kind of streaming load/store function(based on inline assembly) which could speed data transaction.

memcpy() is long past simple rep movsb in every compiler -- it is replaced by optimal sequence of instructions and inlined for short copies.

Bernard · ‎07-11-2013

@Igor

do you that memcpy() could be implemented with the help of streaming store instruction and probably loop-unrolled?Did you disassemble memcpy() function?

bronxzv · ‎07-11-2013

Igor Levicki wrote:

Quote:

iliyapolak wrote:memcpy() could be a potential bottleneck in this case.IIRC this function at machine code level usess rep movsb(d).I was thinking about the some kind of streaming load/store function(based on inline assembly) which could speed data transaction.

memcpy() is long past simple rep movsb in every compiler -- it is replaced by optimal sequence of instructions and inlined for short copies.

actually REP MOVSB is again a sensible choice for memset/memcpy, at least for some dataset sizes (>= 128 bytes), since they are optimized for best throughput in Ivy Bridge and Haswell, with more compact code than unrolled sequences of 16-byte moves but similar speed, that's what they call ERMSB in the IA optimization reference manual, have a look at pages 3-65 to 3-68 of the June 2013 edition

Bernard · ‎07-11-2013

@bronxzv

so rep movsb(d) can be used for specific dataset sizes?Strange because unrolled streaming version seems to be faster,but at cost of more machine code to be executed.

bronxzv · ‎07-11-2013

iliyapolak wrote:

@bronxzv

so rep movsb(d) can be used for specific dataset sizes?Strange because unrolled streaming version seems to be faster,but at cost of more machine code to be executed.

note that streaming stores are much slower than regular (cached) stores when your workset fit in the LLC or even worse if you can work with L2 cache blocking, with streaming stores you force slow and power hungry memory transactions that should not occur with temporal data used for auxiliary intermediate results, as is typical with multi-passes algorithms and cache blocking, btw it will be interesting to see how the L4 cache in Iris Pro deal with streaming stores (streaming stores bypass also the L4 or not ?)

please refer to the optimization manual for advices about REP MOVSB (for which dataset sizes the usage is sensible, etc.) since I have no practical experience with this on IVB/HSW, exactly like Igor I was thinking it was something of the distant past until it was resurrected in IVB and I read about it in the guide only a few weeks ago

Bernard · ‎07-12-2013

@bronxzv

Thanks for reply.Initially I was confused by seemingly much larger memory bandwidth which could go through load/store ports combined with prefetching and loop unrolling,but I see that this is not the case.Btw my understanding is that streaming stores can be used for large memory movement of non-temporal data.

levicki · ‎07-12-2013

From my own experience, streaming stores can be used to avoid cache pollution. Therefore they are only usefull if the data will not be consumed immediately. Before consuming the data you need a memory fence operation to make sure all outstanding writes have completed. Also, you need to keep in mind that when using intrinsics compiler may reorder your reads and writes unless you use a barrier as well. Streaming stores will compete for write buffers (important when considering threading on logical cores), and they will not work as intended if you do not write out a full cache line of data at a time because they use write combining buffers.

Overall, they are usefull in very limited number of scenarios and must be used with great care. That is all from memory (I worked with them long time ago) so it might not be 100% accurate or it might not even be true for the latest CPUs.

Now regarding memcpy() -- I couldn't get 13.1 update 5 compiler to emit rep movsb/w/d for memcpy() and I was too lazy to write it in assembler (hey, it's Friday :)) but here are the results for using memcpy() for copying of 24 bytes from fixed source to fixed destination:

// IA32 (mov) = 1990.800 ms
// SSE4.1 (movsd) = 1532.112 ms
// SSE4.2 (movups/movsd) = 1312.965 ms
// AVX2 (vmovdqu/vmovsd) = 1312.950 ms

In all cases compiler has inlined memcpy() call and replaced it with 12xMOV, 6xMOVSD, 2xMOVUPS+2xMOVSD, or 2xVMOVUPS+2xVMOVSD. Code size was (not counting source byte fetch and one shift for address calculation) MOV 53 bytes, MOVSD 38 bytes, MOVUPS+MOVSD 23 bytes, and VMOVUPS+VMOVSD 25 bytes.

From the perspective of code size, (V)MOVUPS+(V)MOVSD seems most efficient and it is also fastest in the test. REP MOVSD might be even shorter, but I am not sure about the speed.

Finally, bear in mind that the quick test I did for this is not realistic because it uses color table precalculated one time and in advance (not measured) and it always copies same data where in real life it would have to make a table every time (because of different foreground/background colors) and it would copy from different table rows, not from the same one (depending on source byte). To make it realistic I would have to create a test driver with real-life data (i.e. a large monochrome BMP image).

SergeyKostrov · ‎07-12-2013

>>// IA32 (mov) = 1990.800 ms >>// SSE4.1 (movsd) = 1532.112 ms >>// SSE4.2 (movups/movsd) = 1312.965 ms >>// AVX2 (vmovdqu/vmovsd) = 1312.950 ms I tested the memcpy some time ago and an overhead of calling the function could take all advantages unless it is inlined especially when you copy just 24 bytes. Also, test results show that memcpy is faster than my FastMemCopy128 if a memory block is less than 128K ( 131072 bytes ). The 2nd function is based on: ... for( j = i; j < ( i + iPageSize ); j += 32 ) { _mm_stream_ps( ( RTfloat * )( ( RTchar * )pDst + j ), _mm_load_ps( ( RTfloat * )( ( RTchar * )pSrc + j ) ) ); _mm_stream_ps( ( RTfloat * )( ( RTchar * )pDst + j + 16 ), _mm_load_ps( ( RTfloat * )( ( RTchar * )pSrc + j + 16 ) ) ); } ...

levicki · ‎07-12-2013

Sergey,

I am pretty sure you could get more performance with streaming stores if you use MOVAPS to prefetch 2KB of data to L1 cache and then MOVNTPS to stream those 2KB out to memory. That requires three loops, outer loop going in 2KB blocks, and two inner loops, one to prefetch, the other to stream. You need to write out a full cache line worth of data at a time in both inner loops. You can also play with prefetch distance and different block sizes to see if more bandwidth can be squeezed out. It goes without saying that you should use VirtualAlloc() and VirtualLock() for the 4KB page where you will buffer data for streaming and that source and destination must be aligned.

McCalpinJohn · ‎07-12-2013

Streaming stores can be slower than ordinary stores even when the data is going to/from memory if you are only using one or two threads per chip. This is because the performance is limited by the number of cache miss buffers that a single core can utilize, and the limiting value is often a small fraction of the peak bandwidth available to the chip. With normal loads, the store misses can be prefetched into the L2 cache so that the Load Fill Buffers are occupied for a much shorter duration. With streaming stores, each transaction holds onto a Load Fill Buffer until the line is transferred all the way to the memory controller, so fewer transactions can be performed per unit time.

Once you are using several threads, the extra read traffic associated with the store misses becomes the limiting factor and streaming stores become more efficient.

On my Xeon E5-2670 systems, a single thread runs the STREAM benchmark quite a bit faster if I disable streaming stores. (I can't find the numbers right now, but I think it was ~14 GB/s without streaming stores and ~10 GB/s with streaming stores.) When using all cores the performance ratio tracks the ratio of total traffic, so the case with streaming stores gives ~38 GB/s (per chip), while the case without streaming stores is about 2/3 of that performance for the Copy and Scale kernels and 3/4 of that performance for the Add and Triad kernels.

bronxzv · ‎07-12-2013

Igor Levicki wrote:
I am afraid it is not, at least not efficiently.

sure, I've learned the hard way that it's far easier to spot this kind of problem than to fix it

anyway your SSE4.1 path has the same issue (75% useful computations), so it doesn't explain the deceptive SSE4.1 (VEX.128 AVX) to AVX2 scaling, it's maybe a problem with your test framework with a function call overhead at each loop iteration (it's not "fair" for the fastest code path), I'm quite sure you'll have better timings (and better SSE4.1 to AVX2 speedup) if you introduce a small inner loop calling your function with inlining, for example 100 iterations in the inner loop (workload entirely in the L1D cache with random input data, more real-world like) and 10 M iterations for the outer loop (in order to have 1G calls to the profiled function like the current version, thus comparable timings)

if you test it again this way I'll be interested to hear about your findings

levicki · ‎07-12-2013

Well, I am testing it with no inlining because I wanted to eliminate variations in generated code between various functions caused by global compiler optimizations -- this version is exactly testing the "kernel" of each version and since all have the same penalty for CALL/RET it can be safely discarded. Furthermore, this code should already be keeping things in L1D entirely since it always reads the same source and writes the same destination address. Finally, there is no difference in calculation speed due to randomness of input data since input is the same all the time (the only time I would want random data is to measure impact of cache misses on table lookups which I don't have in this code).

Anyway, feel free to experiment with the code and by all means let me know if you find an efficient way to pack data and use full register width. Just bear in mind that it wouldn't be the first time that the most obvious solution is also the fastest one when it comes to assembler optimization (at least it happens to me often) :)

bronxzv · ‎07-12-2013

Igor Levicki wrote:

optimizations -- this version is exactly testing the "kernel" of each version and since all have the same penalty for CALL/RET it can be safely

that's exactly what I call being "unfair" with the fastest path, the more the tested function is optimized, the more this fixed function call overhead becomes important and biases the comparison (level down the speedup), your AVX2 path looks very good and the way you expand the 8-bit mask to a 256-bit mask very clever and already the optimal solution, it will be a shame that its true potential don't show up in the measurements

bronxzv · ‎07-12-2013

Igor Levicki wrote:
Anyway, feel free to experiment with the code

I have just tested it (plugged your routines in a test framework of mine with a small inner loop as outlined above in this thread) to see the impact of the function call overhead, my findings below:

NOINLINE [1]:
baseline 3698.9 ms optimized 2854.97 ms speedup = 1.296 x

INLINE [2]:
baseline 3539.13 ms optimized 2241.87 ms speedup = 1.579 x

the studied function is called 1e9 times as in your example, "baseline" is for your function test_SSE41 unchanged besides the INLINE/NOINLINE prefix, "optimized" is for your function test_AVX2 unchanged

after removing the useless code in test_AVX2 i.e. "blend_mask = _mm256_and_si256(blend_mask, sign_mask);" and "_mm256_zeroupper();" I get a better speedup (INLINE case shown):

baseline 3528.59 ms optimized 2122.16 ms speedup = 1.663 x

configuration: Core i7 4770K @ 4 GHz, Intel C++ compiler v13.1.3.198 (64-bit)

[1] #define NOINLINE __declspec(noinline)

[2] #define INLINE _forceinline

levicki · ‎07-12-2013

John D. McCalpin wrote:
Streaming stores...

Hello and welcome to my humble thread :)

Regarding streaming stores, this is the fastest variant for me on Haswell:

[cpp]
void memcopy(void *dst, const void *src, size_t nbytes)
{
   __asm   {
       mov       esi, src
       mov       edi, dst
       mov       ecx, nbytes
       shr       ecx, 6
main_loop:
       test       ecx, ecx
       jz       main_loop_end
       prefetcht0   [esi + 64 * 30]
       movaps       xmm0, [esi]
       movaps       xmm1, [esi + 16]
       movaps       xmm2, [esi + 32]
       movaps       xmm3, [esi + 48]
       movntps       [edi], xmm0
       movntps       [edi + 16], xmm1
       movntps       [edi + 32], xmm2
       movntps       [edi + 48], xmm3
       add       esi, 64
       add       edi, 64
       sub       ecx, 1
       jmp       main_loop
main_loop_end:
       sfence
   }
}
[/cpp]

SergeyKostrov · ‎07-13-2013

>>...Regarding streaming stores, this is the fastest variant for me on Haswell... Thanks Igor and I'll check if it is the fastest version for Ivy Bridge. By the way, that code is well known and I think Intel Optimization Manual has a chapter with it.