- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Congratulations to Intel CPU instruction set engineers for managing to make YET ANOTHER non-orthogonal instruction set extension -- why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?
Any ideas/tricks to work around this engineering "oversight"?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Thanks Igor and I'll check if it is the fastest version for Ivy Bridge. By the way, that code is well known and I think Intel Optimization Manual has a chapter with it.
You may need to tweak prefetch distance, this is tuned for Haswell. The code might be known, but I don't think many people know that AVX2 version with 2 YMM registers and vex prefixes is slower on Haswell than this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like tweaked single loop version of example from ISDM.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that code posted by Igor could execute two unrolled loop cycles load and one unrolled loop store simultaneously occupying 3 execution load/store ports.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
I've noticed that a magic number is 1920 ( = 64 * 30 ).I simply would like to understand that this is Not 1080p/i related ( Full HD / 1920x1080 resolution ) and this is related to something else?
It is a coincidence -- I determined the number by trial and error (I made a loop which increments prefetch distance by 1 cache line and re-tests).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Igor,
I see that FastMemCopy ( based on the example you've provided / I did some modifications to support 32-bit and 64-bit platforms ) doesn't outperform CRT memcpy for smaller memory blocks up to some threshold ( 2MB / 4MB and it depends on a system / see performance data ). For larger memory blocks FastMemCopy is faster but a performance difference drops down as soon as the size of a memory block increases.
I never claimed that it outperforms CRT memcpy() because CRT version is optimized for many more cases and especially for small copies.
The code I posted is just a demo of streaming stores and prefetch. Reason why it has a tiny bit better performance on blocks larger than 2MB/4MB is because streaming bypasses cache (it is using write-combining buffers) so the speed is not affected when your dataset is too big to fit in the last level cache.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is the AVX-based variation of memory copying routine.It is based on eight loop unrolling and non-temporal stores through the YMMn registers.Prefetching is software based.Later I will test iteration unrolling and hardware prefetching.
void FastAVX_MemCpy(const void * source, void * dest,const unsigned int length){
const unsigned int Len = length;
if(NULL == source || NULL == dest){
if(!source){
printf("Null pointer has been passed [%p] \n",&source);
exit(1);
}else
if(!dest){
printf("Null pointer has been passed [%p] \n",&dest);
exit(1);
}
}
else
if(Len % 32 != 0){
printf("length argument must be a multiplies of 32 %d \n",Len);
exit(1);
}else{
//(__m256 *)source;
//(__m256 *)dest;
_asm{
mov edi,dest
mov esi,source
mov edx,source
add edx,dword ptr Len
align 32
copy_loop:
prefetcht0 [esi+256 * 32]
vmovaps ymm0, ymmword ptr [esi]
vmovaps ymm1, ymmword ptr [esi+32]
vmovaps ymm2, ymmword ptr [esi+64]
vmovaps ymm3, ymmword ptr [esi+96]
vmovaps ymm4, ymmword ptr [esi+128]
vmovaps ymm5, ymmword ptr [esi+160]
vmovaps ymm6, ymmword ptr [esi+192]
vmovaps ymm7, ymmword ptr [esi+228]
vmovntps ymmword ptr [edi], ymm0
vmovntps ymmword ptr [edi+32], ymm1
vmovntps ymmword ptr [edi+64], ymm2
vmovntps ymmword ptr [edi+96], ymm3
vmovntps ymmword ptr [edi+128],ymm4
vmovntps ymmword ptr [edi+160],ymm5
vmovntps ymmword ptr [edi+190],ymm6
vmovntps ymmword ptr [edi+228],ymm7
add esi,256
add edi,256
cmp esi,edx
jne copy_loop
sfence
}
}
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Two errors in previous post code snippet.There should be a zeroing of ECX register in prefetching loop and offset during unrolling should be esi+ecx+224.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
copy_loop:
prefetcht0 [esi+256 * 32]
vmovaps ymm0, ymmword ptr [esi]
vmovaps ymm1, ymmword ptr [esi+32]
vmovaps ymm2, ymmword ptr [esi+64]
vmovaps ymm3, ymmword ptr [esi+96]
vmovaps ymm4, ymmword ptr [esi+128]
vmovaps ymm5, ymmword ptr [esi+160]
vmovaps ymm6, ymmword ptr [esi+192]
vmovaps ymm7, ymmword ptr [esi+228]
you read 256 bytes per iteration but prefetch only 64 bytes (one cache line)
this will be probably faster to add 3 more prefetches or to simply unroll 2x, both cases with perfectly matched prefetch and read sizes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
EDIT clumsy comment of mine removed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I need still need to tweak that code and test it.I think that prefetching distance is greater than 64 bytes.On every iteration of the loop esi register is incremented by 256*32 bytes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Keep in mind two things:
1. prefetch instruction is merely a hint. It may or may not do anything depending on various factors.
2. When it does something, prefetch instruction competes for memory bus bandwidth so use it sparingly or you are going to get opposite effect from the one you wanted.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
one prefetch instruction prefetch typically one cache line (64 B) so you are prefetching explicitely only 1/4 of your data in your example
iliyapolak wrote:
I think that prefetching distance is greater than 64 bytes.On every iteration of the loop esi register is incremented by 256*32 bytes.
your prefetch distance is another issue, it looks quite big at 8KB, it can be an issue if you have another thread fighting for the L1D cache, also you'll not prefetch the initial 8KB, but as you say it's a matter of tuning
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for correcting me.It seems that I did not read that sentence in Optimization Manual which states that single cache line is loaded by prefetch instruction.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page