Thanks for correcting me.It - Page 3

levicki · ‎07-02-2013

Congratulations to Intel CPU instruction set engineers for managing to make YET ANOTHER non-orthogonal instruction set extension -- why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?

Any ideas/tricks to work around this engineering "oversight"?

levicki · ‎07-13-2013

Sergey Kostrov wrote:
Thanks Igor and I'll check if it is the fastest version for Ivy Bridge. By the way, that code is well known and I think Intel Optimization Manual has a chapter with it.

You may need to tweak prefetch distance, this is tuned for Haswell. The code might be known, but I don't think many people know that AVX2 version with 2 YMM registers and vex prefixes is slower on Haswell than this.

Bernard · ‎07-14-2013

It looks like tweaked single loop version of example from ISDM.

Bernard · ‎07-14-2013

I think that code posted by Igor could execute two unrolled loop cycles load and one unrolled loop store simultaneously occupying 3 execution load/store ports.

SergeyKostrov · ‎07-14-2013

>>>>... >>>>prefetcht0 [ esi + 64 * 30 ] >>>>... >> >>...You may need to tweak prefetch distance, this is tuned for Haswell... I've noticed that a magic number is 1920 ( = 64 * 30 ). I simply would like to understand that this is Not 1080p/i related ( Full HD / 1920x1080 resolution ) and this is related to something else?

levicki · ‎07-15-2013

Sergey Kostrov wrote:
I've noticed that a magic number is 1920 ( = 64 * 30 ).

I simply would like to understand that this is Not 1080p/i related ( Full HD / 1920x1080 resolution ) and this is related to something else?

It is a coincidence -- I determined the number by trial and error (I made a loop which increments prefetch distance by 1 cache line and re-tests).

SergeyKostrov · ‎07-25-2013

>>...Regarding streaming stores, this is the fastest variant for me on Haswell... Igor, Did you verify / compare performance of your memcopy function against CRT function memcpy? I'll post my results for two systems later. Thanks.

SergeyKostrov · ‎07-26-2013

Note: USR - stands for USER *** Performance Results for a system with Ivy Bridge CPU - Prefetch Offset is 1920 bytes *** Number of bytes to copy is 262144 ( 256KB ) Test Case 1 - USR FastMemCopy Completed in: 31 ticks Test Case 2 - CRT memcpy Completed in: 16 ticks Number of bytes to copy is 524288 ( 512KB ) Test Case 1 - USR FastMemCopy Completed in: 62 ticks Test Case 2 - CRT memcpy Completed in: 47 ticks Number of bytes to copy is 1048576 ( 1024KB ) Test Case 1 - USR FastMemCopy Completed in: 109 ticks Test Case 2 - CRT memcpy Completed in: 94 ticks Number of bytes to copy is 2097152 ( 2048KB ) Test Case 1 - USR FastMemCopy Completed in: 219 ticks Test Case 2 - CRT memcpy Completed in: 172 ticks Number of bytes to copy is 4194304 ( 4096KB ) Test Case 1 - USR FastMemCopy Completed in: 437 ticks Test Case 2 - CRT memcpy Completed in: 468 ticks Number of bytes to copy is 8388608 ( 8192KB ) Test Case 1 - USR FastMemCopy Completed in: 905 ticks Test Case 2 - CRT memcpy Completed in: 983 ticks Number of bytes to copy is 16777216 ( 16384KB ) Test Case 1 - USR FastMemCopy Completed in: 2043 ticks Test Case 2 - CRT memcpy Completed in: 2184 ticks Number of bytes to copy is 33554432 ( 32768KB ) Test Case 1 - USR FastMemCopy Completed in: 4274 ticks Test Case 2 - CRT memcpy Completed in: 4524 ticks Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 8642 ticks Test Case 2 - CRT memcpy Completed in: 9080 ticks

SergeyKostrov · ‎07-26-2013

*** Performance Results for a system with Ivy Bridge CPU - Different Prefetch Offsets *** Prefetch Offset is 64 bytes Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 9033 ticks Test Case 2 - CRT memcpy Completed in: 9095 ticks Prefetch Offset is 128 bytes Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 8970 ticks Test Case 2 - CRT memcpy Completed in: 9079 ticks Prefetch Offset is 256 bytes Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 8829 ticks Test Case 2 - CRT memcpy Completed in: 9032 ticks Prefetch Offset is 512 bytes Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 8690 ticks Test Case 2 - CRT memcpy Completed in: 9157 ticks Prefetch Offset is 1024 bytes Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 8612 ticks Test Case 2 - CRT memcpy Completed in: 9016 ticks Prefetch Offset is 2048 bytes Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 8612 ticks Test Case 2 - CRT memcpy Completed in: 9110 ticks Prefetch Offset is 4096 bytes Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 8611 ticks Test Case 2 - CRT memcpy Completed in: 9049 ticks Prefetch Offset is 8192 bytes Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 8580 ticks Test Case 2 - CRT memcpy Completed in: 9095 ticks Prefetch Offset is 16384 bytes Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 8658 ticks Test Case 2 - CRT memcpy Completed in: 9079 ticks Prefetch Offset is 32768 bytes Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 10265 ticks Test Case 2 - CRT memcpy Completed in: 9017 ticks Prefetch Offset is 65536 bytes Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 10265 ticks Test Case 2 - CRT memcpy Completed in: 9064 ticks Prefetch Offset is 131072 bytes Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 10561 ticks Test Case 2 - CRT memcpy Completed in: 9110 ticks

SergeyKostrov · ‎07-26-2013

*** Performance Results for a system with Ivy Bridge CPU - Different Prefetch Offsets - 2048 vs 8192 *** Prefetch Offset is 2048 bytes Number of bytes to copy is 33554432 ( 32768KB ) Test Case 1 - USR FastMemCopy Completed in: 4259 ticks Completed in: 4275 ticks Completed in: 4274 ticks Completed in: 4290 ticks Completed in: 4275 ticks Completed in: 4274 ticks Completed in: 4274 ticks Completed in: 4291 ticks Test Case 2 - CRT memcpy Completed in: 4524 ticks Completed in: 4508 ticks Completed in: 4508 ticks Completed in: 4524 ticks Completed in: 4524 ticks Completed in: 4524 ticks Completed in: 4524 ticks Completed in: 4509 ticks Prefetch Offset is 8192 bytes Number of bytes to copy is 33554432 ( 32768KB ) Test Case 1 - USR FastMemCopy Completed in: 4274 ticks Completed in: 4275 ticks Completed in: 4274 ticks Completed in: 4243 ticks Completed in: 4197 ticks Completed in: 4274 ticks Completed in: 4275 ticks Completed in: 4258 ticks Test Case 2 - CRT memcpy Completed in: 4477 ticks Completed in: 4493 ticks Completed in: 4493 ticks Completed in: 4524 ticks Completed in: 4524 ticks Completed in: 4524 ticks Completed in: 4508 ticks Completed in: 4524 ticks

SergeyKostrov · ‎07-26-2013

Note: Simply for comparison in order to see a performance difference between Ivy Bridge and Pentium 4 systems. *** Performance Results for a system with Pentium 4 CPU - Prefetch Offset is 1920 bytes *** Number of bytes to copy is 262144 ( 256KB ) Test Case 1 - USR FastMemCopy Completed in: 187 ticks Test Case 2 - CRT memcpy Completed in: 172 ticks Number of bytes to copy is 524288 ( 512KB ) Test Case 1 - USR FastMemCopy Completed in: 734 ticks Test Case 2 - CRT memcpy Completed in: 828 ticks Number of bytes to copy is 1048576 ( 1024KB ) Test Case 1 - USR FastMemCopy Completed in: 1516 ticks Test Case 2 - CRT memcpy Completed in: 1641 ticks Number of bytes to copy is 2097152 ( 2048KB ) Test Case 1 - USR FastMemCopy Completed in: 2797 ticks Test Case 2 - CRT memcpy Completed in: 2937 ticks Number of bytes to copy is 4194304 ( 4096KB ) Test Case 1 - USR FastMemCopy Completed in: 5985 ticks Test Case 2 - CRT memcpy Completed in: 6562 ticks Number of bytes to copy is 8388608 ( 8192KB ) Test Case 1 - USR FastMemCopy Completed in: 11859 ticks Test Case 2 - CRT memcpy Completed in: 12391 ticks Number of bytes to copy is 16777216 ( 16384KB ) Test Case 1 - USR FastMemCopy Completed in: 20094 ticks Test Case 2 - CRT memcpy Completed in: 20343 ticks Number of bytes to copy is 33554432 ( 32768KB ) Test Case 1 - USR FastMemCopy Completed in: 44031 ticks Test Case 2 - CRT memcpy Completed in: 47250 ticks Number of bytes to copy is 67108864 ( 65536KB ) Test Case 1 - USR FastMemCopy Completed in: 78812 ticks Test Case 2 - CRT memcpy Completed in: 79281 ticks

SergeyKostrov · ‎07-26-2013

Igor, I see that FastMemCopy ( based on the example you've provided / I did some modifications to support 32-bit and 64-bit platforms ) doesn't outperform CRT memcpy for smaller memory blocks up to some threshold ( 2MB / 4MB and it depends on a system / see performance data ). For larger memory blocks FastMemCopy is faster but a performance difference drops down as soon as the size of a memory block increases.

levicki · ‎07-26-2013

Sergey Kostrov wrote:

Igor,

I see that FastMemCopy ( based on the example you've provided / I did some modifications to support 32-bit and 64-bit platforms ) doesn't outperform CRT memcpy for smaller memory blocks up to some threshold ( 2MB / 4MB and it depends on a system / see performance data ). For larger memory blocks FastMemCopy is faster but a performance difference drops down as soon as the size of a memory block increases.

I never claimed that it outperforms CRT memcpy() because CRT version is optimized for many more cases and especially for small copies.

The code I posted is just a demo of streaming stores and prefetch. Reason why it has a tiny bit better performance on blocks larger than 2MB/4MB is because streaming bypasses cache (it is using write-combining buffers) so the speed is not affected when your dataset is too big to fit in the last level cache.

SergeyKostrov · ‎07-26-2013

>>...Reason why it has a tiny bit better performance on blocks larger than 2MB/4MB is because streaming bypasses cache... This is exactly what I've observed on two systems with different CPUs.

Bernard · ‎12-16-2013

Here is the AVX-based variation of memory copying routine.It is based on eight loop unrolling and non-temporal stores through the YMMn registers.Prefetching is software based.Later I will test iteration unrolling and hardware prefetching.

void FastAVX_MemCpy(const void * source, void * dest,const unsigned int length){

   const unsigned int Len = length;
   if(NULL == source || NULL == dest){
       if(!source){
           printf("Null pointer has been passed [%p] \n",&source);
           exit(1);
       }else
           if(!dest){
               printf("Null pointer has been passed [%p] \n",&dest);
               exit(1);
           }
   }
   else
       if(Len % 32 != 0){
           printf("length argument must be a multiplies of 32 %d \n",Len);
           exit(1);
       }else{
           //(__m256 *)source;
           //(__m256 *)dest;

           _asm{

                   mov edi,dest
                   mov esi,source
                   mov edx,source
                   add edx,dword ptr Len
                   align 32
copy_loop:
                   prefetcht0 [esi+256 * 32]
                   vmovaps ymm0, ymmword ptr [esi]
                   vmovaps ymm1, ymmword ptr [esi+32]
                   vmovaps ymm2, ymmword ptr [esi+64]
                   vmovaps ymm3, ymmword ptr [esi+96]
                   vmovaps ymm4, ymmword ptr [esi+128]
                   vmovaps ymm5, ymmword ptr [esi+160]
                   vmovaps ymm6, ymmword ptr [esi+192]
                   vmovaps ymm7, ymmword ptr [esi+228]
                   vmovntps ymmword ptr [edi], ymm0
vmovntps ymmword ptr [edi+32], ymm1
                   vmovntps ymmword ptr [edi+64], ymm2
                   vmovntps ymmword ptr [edi+96], ymm3
                   vmovntps ymmword ptr [edi+128],ymm4
                   vmovntps ymmword ptr [edi+160],ymm5
                   vmovntps ymmword ptr [edi+190],ymm6
                   vmovntps ymmword ptr [edi+228],ymm7
                   add esi,256
                   add edi,256
                   cmp esi,edx
                   jne copy_loop
                   sfence
           }

}

Bernard · ‎12-16-2013

Two errors in previous post code snippet.There should be a zeroing of ECX register in prefetching loop and offset during unrolling should be esi+ecx+224.

bronxzv · ‎12-17-2013

iliyapolak wrote:

copy_loop:
                   prefetcht0 [esi+256 * 32]
                   vmovaps ymm0, ymmword ptr [esi]
                   vmovaps ymm1, ymmword ptr [esi+32]
                   vmovaps ymm2, ymmword ptr [esi+64]
                   vmovaps ymm3, ymmword ptr [esi+96]
                   vmovaps ymm4, ymmword ptr [esi+128]
                   vmovaps ymm5, ymmword ptr [esi+160]
                   vmovaps ymm6, ymmword ptr [esi+192]
                   vmovaps ymm7, ymmword ptr [esi+228]

you read 256 bytes per iteration but prefetch only 64 bytes (one cache line)

this will be probably faster to add 3 more prefetches or to simply unroll 2x, both cases with perfectly matched prefetch and read sizes

bronxzv · ‎12-17-2013

EDIT clumsy comment of mine removed

Bernard · ‎12-17-2013

I need still need to tweak that code and test it.I think that prefetching distance is greater than 64 bytes.On every iteration of the loop esi register is incremented by 256*32 bytes.

levicki · ‎12-17-2013

Keep in mind two things:

1. prefetch instruction is merely a hint. It may or may not do anything depending on various factors.

2. When it does something, prefetch instruction competes for memory bus bandwidth so use it sparingly or you are going to get opposite effect from the one you wanted.

bronxzv · ‎12-17-2013

one prefetch instruction prefetch typically one cache line (64 B) so you are prefetching explicitely only 1/4 of your data in your example

iliyapolak wrote:
I think that prefetching distance is greater than 64 bytes.On every iteration of the loop esi register is incremented by 256*32 bytes.

your prefetch distance is another issue, it looks quite big at 8KB, it can be an issue if you have another thread fighting for the L1D cache, also you'll not prefetch the initial 8KB, but as you say it's a matter of tuning

Bernard · ‎12-17-2013

Thanks for correcting me.It seems that I did not read that sentence in Optimization Manual which states that single cache line is loaded by prefetch instruction.

How to extract DWORD from upper half of 256-bit register?