- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Congratulations to Intel CPU instruction set engineers for managing to make YET ANOTHER non-orthogonal instruction set extension -- why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?
Any ideas/tricks to work around this engineering "oversight"?
링크가 복사됨
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Sergey Kostrov wrote:
Thanks Igor and I'll check if it is the fastest version for Ivy Bridge. By the way, that code is well known and I think Intel Optimization Manual has a chapter with it.
You may need to tweak prefetch distance, this is tuned for Haswell. The code might be known, but I don't think many people know that AVX2 version with 2 YMM registers and vex prefixes is slower on Haswell than this.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Sergey Kostrov wrote:
I've noticed that a magic number is 1920 ( = 64 * 30 ).I simply would like to understand that this is Not 1080p/i related ( Full HD / 1920x1080 resolution ) and this is related to something else?
It is a coincidence -- I determined the number by trial and error (I made a loop which increments prefetch distance by 1 cache line and re-tests).
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Sergey Kostrov wrote:
Igor,
I see that FastMemCopy ( based on the example you've provided / I did some modifications to support 32-bit and 64-bit platforms ) doesn't outperform CRT memcpy for smaller memory blocks up to some threshold ( 2MB / 4MB and it depends on a system / see performance data ). For larger memory blocks FastMemCopy is faster but a performance difference drops down as soon as the size of a memory block increases.
I never claimed that it outperforms CRT memcpy() because CRT version is optimized for many more cases and especially for small copies.
The code I posted is just a demo of streaming stores and prefetch. Reason why it has a tiny bit better performance on blocks larger than 2MB/4MB is because streaming bypasses cache (it is using write-combining buffers) so the speed is not affected when your dataset is too big to fit in the last level cache.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Here is the AVX-based variation of memory copying routine.It is based on eight loop unrolling and non-temporal stores through the YMMn registers.Prefetching is software based.Later I will test iteration unrolling and hardware prefetching.
void FastAVX_MemCpy(const void * source, void * dest,const unsigned int length){
const unsigned int Len = length;
if(NULL == source || NULL == dest){
if(!source){
printf("Null pointer has been passed [%p] \n",&source);
exit(1);
}else
if(!dest){
printf("Null pointer has been passed [%p] \n",&dest);
exit(1);
}
}
else
if(Len % 32 != 0){
printf("length argument must be a multiplies of 32 %d \n",Len);
exit(1);
}else{
//(__m256 *)source;
//(__m256 *)dest;
_asm{
mov edi,dest
mov esi,source
mov edx,source
add edx,dword ptr Len
align 32
copy_loop:
prefetcht0 [esi+256 * 32]
vmovaps ymm0, ymmword ptr [esi]
vmovaps ymm1, ymmword ptr [esi+32]
vmovaps ymm2, ymmword ptr [esi+64]
vmovaps ymm3, ymmword ptr [esi+96]
vmovaps ymm4, ymmword ptr [esi+128]
vmovaps ymm5, ymmword ptr [esi+160]
vmovaps ymm6, ymmword ptr [esi+192]
vmovaps ymm7, ymmword ptr [esi+228]
vmovntps ymmword ptr [edi], ymm0
vmovntps ymmword ptr [edi+32], ymm1
vmovntps ymmword ptr [edi+64], ymm2
vmovntps ymmword ptr [edi+96], ymm3
vmovntps ymmword ptr [edi+128],ymm4
vmovntps ymmword ptr [edi+160],ymm5
vmovntps ymmword ptr [edi+190],ymm6
vmovntps ymmword ptr [edi+228],ymm7
add esi,256
add edi,256
cmp esi,edx
jne copy_loop
sfence
}
}
}
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
iliyapolak wrote:
copy_loop:
prefetcht0 [esi+256 * 32]
vmovaps ymm0, ymmword ptr [esi]
vmovaps ymm1, ymmword ptr [esi+32]
vmovaps ymm2, ymmword ptr [esi+64]
vmovaps ymm3, ymmword ptr [esi+96]
vmovaps ymm4, ymmword ptr [esi+128]
vmovaps ymm5, ymmword ptr [esi+160]
vmovaps ymm6, ymmword ptr [esi+192]
vmovaps ymm7, ymmword ptr [esi+228]
you read 256 bytes per iteration but prefetch only 64 bytes (one cache line)
this will be probably faster to add 3 more prefetches or to simply unroll 2x, both cases with perfectly matched prefetch and read sizes
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Keep in mind two things:
1. prefetch instruction is merely a hint. It may or may not do anything depending on various factors.
2. When it does something, prefetch instruction competes for memory bus bandwidth so use it sparingly or you are going to get opposite effect from the one you wanted.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
one prefetch instruction prefetch typically one cache line (64 B) so you are prefetching explicitely only 1/4 of your data in your example
iliyapolak wrote:
I think that prefetching distance is greater than 64 bytes.On every iteration of the loop esi register is incremented by 256*32 bytes.
your prefetch distance is another issue, it looks quite big at 8KB, it can be an issue if you have another thread fighting for the L1D cache, also you'll not prefetch the initial 8KB, but as you say it's a matter of tuning
