- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Congratulations to Intel CPU instruction set engineers for managing to make YET ANOTHER non-orthogonal instruction set extension -- why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?
Any ideas/tricks to work around this engineering "oversight"?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The action in if/else block directly depends on source data. I don't see how that could be moved out of the loop? You need to do a test for every bit in source to decide what to write to destination.
The only other thing I could think of would be to use a lookup table of 256 x 24 bytes (6,144 bytes in size), fetch a byte from source, and memcpy() the corresponding row from the table. To do that, you would have to precompute the table because the background and foreground color can be different each time so the speedup might be noticable only for large pictures. You could also try doing a smaller table (16 x 12 bytes) which would be faster to precompute and split the source byte into two 4-bit nibbles for lookup, but it would be less efficient to copy from such table (dwords instead of qwords). On the other hand, it would compete less for L1 cache bandwidth. Without testing it is impossible to say which one would be faster.
For the monochrome bitmap of 4,593 x 6,000 pixels I am getting ~18 ms for AVX2 code I wrote which is somewhere around 1460 MB/sec or 11,680 MPixels/sec.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry misunderstood the code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor Levicki wrote:
When I said 50% I actually meant 50% shorter execution time which would translate into 2x speedup. Sorry for confusion.
Attached is the code with simple test driver. My results are:
test_C : 6345.035 ms
test_SSE4.1 : 3944.771 ms
test_AVX2 : 2190.420 msDifference is 1.80x here too, but that difference gets smaller (1.51x) if you change pragma for SSE4.1 function and let compiler generate 128-bit SSE
after a rapid check of your AVX2 code path, it appears that you are effectivly using only 75% of the YMM registers width (24-byte / 32-byte), so 25% of the computations (the ones for the unused higher 8-byte) are done in pure waste, this is a classical case of partial vectorization
for better speedups you'll have to process more pixels in parallel, I'm not sure if it's possible in your case, though
just another remark, the operation below is useless:
blend_mask = _mm256_and_si256(blend_mask, sign_mask);
BLENDVx instructions use only the MSB of the mask elements so clearing the lower bits isn't required
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>The only other thing I could think of would be to use a lookup table of 256 x 24 bytes (6,144 bytes in size), fetch a byte from source, and memcpy() the corresponding row from the table>>>
memcpy() could be a potential bottleneck in this case.IIRC this function at machine code level usess rep movsb(d).I was thinking about the some kind of streaming load/store function(based on inline assembly) which could speed data transaction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
bronxzv wrote:
for better speedups you'll have to process more pixels in parallel, I'm not sure if it's possible in your case, though
I am afraid it is not, at least not efficiently.
bronxzv wrote:
just another remark, the operation below is useless:
blend_mask = _mm256_and_si256(blend_mask, sign_mask);BLENDVx instructions use only the MSB of the mask elements so clearing the lower bits isn't required
Yes, I am aware of that. I did that while I was visualizing data flow. It can be removed now.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
memcpy() could be a potential bottleneck in this case.IIRC this function at machine code level usess rep movsb(d).I was thinking about the some kind of streaming load/store function(based on inline assembly) which could speed data transaction.
memcpy() is long past simple rep movsb in every compiler -- it is replaced by optimal sequence of instructions and inlined for short copies.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Igor
do you that memcpy() could be implemented with the help of streaming store instruction and probably loop-unrolled?Did you disassemble memcpy() function?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor Levicki wrote:
Quote:
iliyapolak wrote:memcpy() could be a potential bottleneck in this case.IIRC this function at machine code level usess rep movsb(d).I was thinking about the some kind of streaming load/store function(based on inline assembly) which could speed data transaction.memcpy() is long past simple rep movsb in every compiler -- it is replaced by optimal sequence of instructions and inlined for short copies.
actually REP MOVSB is again a sensible choice for memset/memcpy, at least for some dataset sizes (>= 128 bytes), since they are optimized for best throughput in Ivy Bridge and Haswell, with more compact code than unrolled sequences of 16-byte moves but similar speed, that's what they call ERMSB in the IA optimization reference manual, have a look at pages 3-65 to 3-68 of the June 2013 edition
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@bronxzv
so rep movsb(d) can be used for specific dataset sizes?Strange because unrolled streaming version seems to be faster,but at cost of more machine code to be executed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
@bronxzv
so rep movsb(d) can be used for specific dataset sizes?Strange because unrolled streaming version seems to be faster,but at cost of more machine code to be executed.
note that streaming stores are much slower than regular (cached) stores when your workset fit in the LLC or even worse if you can work with L2 cache blocking, with streaming stores you force slow and power hungry memory transactions that should not occur with temporal data used for auxiliary intermediate results, as is typical with multi-passes algorithms and cache blocking, btw it will be interesting to see how the L4 cache in Iris Pro deal with streaming stores (streaming stores bypass also the L4 or not ?)
please refer to the optimization manual for advices about REP MOVSB (for which dataset sizes the usage is sensible, etc.) since I have no practical experience with this on IVB/HSW, exactly like Igor I was thinking it was something of the distant past until it was resurrected in IVB and I read about it in the guide only a few weeks ago
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@bronxzv
Thanks for reply.Initially I was confused by seemingly much larger memory bandwidth which could go through load/store ports combined with prefetching and loop unrolling,but I see that this is not the case.Btw my understanding is that streaming stores can be used for large memory movement of non-temporal data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From my own experience, streaming stores can be used to avoid cache pollution. Therefore they are only usefull if the data will not be consumed immediately. Before consuming the data you need a memory fence operation to make sure all outstanding writes have completed. Also, you need to keep in mind that when using intrinsics compiler may reorder your reads and writes unless you use a barrier as well. Streaming stores will compete for write buffers (important when considering threading on logical cores), and they will not work as intended if you do not write out a full cache line of data at a time because they use write combining buffers.
Overall, they are usefull in very limited number of scenarios and must be used with great care. That is all from memory (I worked with them long time ago) so it might not be 100% accurate or it might not even be true for the latest CPUs.
Now regarding memcpy() -- I couldn't get 13.1 update 5 compiler to emit rep movsb/w/d for memcpy() and I was too lazy to write it in assembler (hey, it's Friday :)) but here are the results for using memcpy() for copying of 24 bytes from fixed source to fixed destination:
// IA32 (mov) = 1990.800 ms
// SSE4.1 (movsd) = 1532.112 ms
// SSE4.2 (movups/movsd) = 1312.965 ms
// AVX2 (vmovdqu/vmovsd) = 1312.950 ms
In all cases compiler has inlined memcpy() call and replaced it with 12xMOV, 6xMOVSD, 2xMOVUPS+2xMOVSD, or 2xVMOVUPS+2xVMOVSD. Code size was (not counting source byte fetch and one shift for address calculation) MOV 53 bytes, MOVSD 38 bytes, MOVUPS+MOVSD 23 bytes, and VMOVUPS+VMOVSD 25 bytes.
From the perspective of code size, (V)MOVUPS+(V)MOVSD seems most efficient and it is also fastest in the test. REP MOVSD might be even shorter, but I am not sure about the speed.
Finally, bear in mind that the quick test I did for this is not realistic because it uses color table precalculated one time and in advance (not measured) and it always copies same data where in real life it would have to make a table every time (because of different foreground/background colors) and it would copy from different table rows, not from the same one (depending on source byte). To make it realistic I would have to create a test driver with real-life data (i.e. a large monochrome BMP image).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
I am pretty sure you could get more performance with streaming stores if you use MOVAPS to prefetch 2KB of data to L1 cache and then MOVNTPS to stream those 2KB out to memory. That requires three loops, outer loop going in 2KB blocks, and two inner loops, one to prefetch, the other to stream. You need to write out a full cache line worth of data at a time in both inner loops. You can also play with prefetch distance and different block sizes to see if more bandwidth can be squeezed out. It goes without saying that you should use VirtualAlloc() and VirtualLock() for the 4KB page where you will buffer data for streaming and that source and destination must be aligned.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Streaming stores can be slower than ordinary stores even when the data is going to/from memory if you are only using one or two threads per chip. This is because the performance is limited by the number of cache miss buffers that a single core can utilize, and the limiting value is often a small fraction of the peak bandwidth available to the chip. With normal loads, the store misses can be prefetched into the L2 cache so that the Load Fill Buffers are occupied for a much shorter duration. With streaming stores, each transaction holds onto a Load Fill Buffer until the line is transferred all the way to the memory controller, so fewer transactions can be performed per unit time.
Once you are using several threads, the extra read traffic associated with the store misses becomes the limiting factor and streaming stores become more efficient.
On my Xeon E5-2670 systems, a single thread runs the STREAM benchmark quite a bit faster if I disable streaming stores. (I can't find the numbers right now, but I think it was ~14 GB/s without streaming stores and ~10 GB/s with streaming stores.) When using all cores the performance ratio tracks the ratio of total traffic, so the case with streaming stores gives ~38 GB/s (per chip), while the case without streaming stores is about 2/3 of that performance for the Copy and Scale kernels and 3/4 of that performance for the Add and Triad kernels.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor Levicki wrote:
I am afraid it is not, at least not efficiently.
sure, I've learned the hard way that it's far easier to spot this kind of problem than to fix it
anyway your SSE4.1 path has the same issue (75% useful computations), so it doesn't explain the deceptive SSE4.1 (VEX.128 AVX) to AVX2 scaling, it's maybe a problem with your test framework with a function call overhead at each loop iteration (it's not "fair" for the fastest code path), I'm quite sure you'll have better timings (and better SSE4.1 to AVX2 speedup) if you introduce a small inner loop calling your function with inlining, for example 100 iterations in the inner loop (workload entirely in the L1D cache with random input data, more real-world like) and 10 M iterations for the outer loop (in order to have 1G calls to the profiled function like the current version, thus comparable timings)
if you test it again this way I'll be interested to hear about your findings
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, I am testing it with no inlining because I wanted to eliminate variations in generated code between various functions caused by global compiler optimizations -- this version is exactly testing the "kernel" of each version and since all have the same penalty for CALL/RET it can be safely discarded. Furthermore, this code should already be keeping things in L1D entirely since it always reads the same source and writes the same destination address. Finally, there is no difference in calculation speed due to randomness of input data since input is the same all the time (the only time I would want random data is to measure impact of cache misses on table lookups which I don't have in this code).
Anyway, feel free to experiment with the code and by all means let me know if you find an efficient way to pack data and use full register width. Just bear in mind that it wouldn't be the first time that the most obvious solution is also the fastest one when it comes to assembler optimization (at least it happens to me often) :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor Levicki wrote:
optimizations -- this version is exactly testing the "kernel" of each version and since all have the same penalty for CALL/RET it can be safely
that's exactly what I call being "unfair" with the fastest path, the more the tested function is optimized, the more this fixed function call overhead becomes important and biases the comparison (level down the speedup), your AVX2 path looks very good and the way you expand the 8-bit mask to a 256-bit mask very clever and already the optimal solution, it will be a shame that its true potential don't show up in the measurements
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor Levicki wrote:
Anyway, feel free to experiment with the code
I have just tested it (plugged your routines in a test framework of mine with a small inner loop as outlined above in this thread) to see the impact of the function call overhead, my findings below:
NOINLINE [1]:
baseline 3698.9 ms optimized 2854.97 ms speedup = 1.296 x
INLINE [2]:
baseline 3539.13 ms optimized 2241.87 ms speedup = 1.579 x
the studied function is called 1e9 times as in your example, "baseline" is for your function test_SSE41 unchanged besides the INLINE/NOINLINE prefix, "optimized" is for your function test_AVX2 unchanged
after removing the useless code in test_AVX2 i.e. "blend_mask = _mm256_and_si256(blend_mask, sign_mask);" and "_mm256_zeroupper();" I get a better speedup (INLINE case shown):
baseline 3528.59 ms optimized 2122.16 ms speedup = 1.663 x
configuration: Core i7 4770K @ 4 GHz, Intel C++ compiler v13.1.3.198 (64-bit)
[1] #define NOINLINE __declspec(noinline)
[2] #define INLINE _forceinline
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John D. McCalpin wrote:
Streaming stores...
Hello and welcome to my humble thread :)
Regarding streaming stores, this is the fastest variant for me on Haswell:
[cpp]
void memcopy(void *dst, const void *src, size_t nbytes)
{
__asm {
mov esi, src
mov edi, dst
mov ecx, nbytes
shr ecx, 6
main_loop:
test ecx, ecx
jz main_loop_end
prefetcht0 [esi + 64 * 30]
movaps xmm0, [esi]
movaps xmm1, [esi + 16]
movaps xmm2, [esi + 32]
movaps xmm3, [esi + 48]
movntps [edi], xmm0
movntps [edi + 16], xmm1
movntps [edi + 32], xmm2
movntps [edi + 48], xmm3
add esi, 64
add edi, 64
sub ecx, 1
jmp main_loop
main_loop_end:
sfence
}
}
[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page