do _mm256_load_ps slower than _mm_load_ps?

zhang_h_ · ‎09-09-2013

I'm tried to improve performance of simple code via SSE and AVX, but I found the AVX code need more time then the SSE code:

void testfun()

{

int dataLen = 4800;

int N = 10000000;

float *buf1 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));

float *buf2 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));

float *buf3 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));

for(int i=0; i<dataLen; i++)

{

buf1 = 1; buf2 = 1; buf3 = 0;

}

int timePassed; int t;

//=========================SSE CODE=====================================

t = clock();

__m128 *p1, *p2, *p3;

for(int j=0;j<N; j++)

{

p1 = (__m128 *)buf1;

p2 = (__m128 *)buf2;

p3 = (__m128 *)buf3;

for(int i=0; i<dataLen/4; i++)

{

*p3 = _mm_add_ps(_mm_mul_ps(*p1, *p2), *p3);

p1++; p2++; p3++;

}

timePassed = clock() - t;

std::cout<<"SSE time used: "<<timePassed<<"ms"<<std::endl;

for(int i=0; i<dataLen; i++) { buf3 = 0; }

t = clock();

//=========================AVX　CODE=====================================

__m256 *pp1, *pp2, *pp3;

for(int j=0;j<N; j++)

{

pp1 = (__m256*) buf1;

pp2 = (__m256*) buf2;

pp3 = (__m256*) buf3;

for(int i=0; i<dataLen/8; i++)

{

*pp3 = _mm256_add_ps(_mm256_mul_ps(*pp1, *pp2), *pp3);

pp1++; pp2++; pp3++;

}

timePassed = clock() - t; std::cout<<"AVX time used: "<<timePassed<<"ms"<<std::endl;

_aligned_free(buf1); _aligned_free(buf2);

}

I changed the "dataLen" and get different efficiency:

dataLen = 400 SSE time:758 ms AVX time:483 ms SSE > AVX

dataLen = 2400 SSE time:4212 ms AVX time:2636 ms SSE > AVX

dataLen = 2864 SSE time:6115 ms AVX time:6146 ms SSE ~= AVX

dataLen = 3200 SSE time:8049 ms AVX time:9297 ms SSE < AVX

dataLen = 4000 SSE time:10170 ms AVX time:11690 ms SSE < AVX

My L1 cache is 32KB, L2 cache 1MB.It seems that sometimes load 256 Bytes is slower load 128Bytes, why?It is the same result if I change the code to SIMD Instructions, just like"_mm256_load_ps ","_mm_load_ps ","mm_add_ps" .....

jimdempseyatthecove · ‎09-15-2013

You haven't indicated your processor. My guess is you are Sandybridge.

Next gen Haswell will correct for this in your example by a 2x wider data bus between the CPU and the L1/L2/L3

see http://www.realworldtech.com/haswell-cpu/5/ for some insight.

Also Haswell can perform FMA (Fused Multiply and Add) in one instruction (... = (B*C) + A)
And depending on Haswell CPU, you can have upto 4 memory banks.

Use Sandybridge for learning how to use _mm256 (and converting applications).
Use Haswell for production code.

Sandybridge gave you a year to adopt your code for Haswell (and later).

Jim Dempsey

Bernard · ‎09-16-2013

Hi Zheng

Do you have a transition penalty between SSE and AVX-256bit code?Maybe during the execution of your code SSE and AVX-256 got intermixed into YMM registers?

SergeyKostrov · ‎09-22-2013

>>float *buf1 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32)); >> >>float *buf2 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32)); >> >>float *buf3 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32)); Did you verify that these three pointers are really aligned on 32-byte boundary? Also, You've overcomplicated memory allocation and why do you need reinterpret_cast C++ operator?

zhang_h_ · ‎10-21-2013

jimdempseyatthecove wrote:

You haven't indicated your processor. My guess is you are Sandybridge.

Use Haswell for production code.

Sandybridge gave you a year to adopt your code for Haswell (and later).

Jim Dempsey

Hi jim, Thanks for your reply!

Yes my processor is Sandybridge(xeon e3 1225 v2).Others can recreate my results on sandybridge processor.

For extra large arrays and simple calculation,like

for(int i=0; i<1000000000; i++)

{

A += B*C;

}

This loop is memory band width limited,so it can not fully display the FMA's advantage.

I've tested my memory read and write speed , it is about 20GB/s(DDR3, Dual channel),I think this speed is fast,but still not enough.

Can I get faster?

Bernard · ‎10-21-2013

>>>This loop it is memory band width limited,so it can not fully display the FMA's advantage.>>>

Can you run VTune analysis on that code to see where the pipeline stalls are?Here I mean front-end pipeline stalls.

zhang_h_ · ‎10-21-2013

iliyapolak wrote:

Hi Zheng

Do you have a transition penalty between SSE and AVX-256bit code?Maybe during the execution of your code SSE and AVX-256 got intermixed into YMM registers?

I have seen the assembly code,and it use the XMM registers in the sse code , and use the YMM registers in the AVX code!

Also I can comment the SSE code or comment the AVX code , and the result is the same!

zhang_h_ · ‎10-21-2013

iliyapolak wrote:

>>>This loop it is memory band width limited,so it can not fully display the FMA's advantage.>>>

Can you run VTune analysis on that code to see where the pipeline stalls are?Here I mean front-end pipeline stalls.

Hi,I don't have this software...

I get the conclusion just by some tests. For example, reduce the size of the array and add the iterations of the loop

( keep the amount of calculation the same) then I will get different performance.

When the array can all stored in the cache, then the performance is the best!

zhang_h_ · ‎10-21-2013

Sergey Kostrov wrote:

>>float *buf1 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));
>>
>>float *buf2 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));
>>
>>float *buf3 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));

Did you verify that these three pointers are really aligned on 32-byte boundary? Also, You've overcomplicated memory allocation and why do you need reinterpret_cast C++ operator?

Hi, I have checked the allocated memory, they are really aligned on 32-byte.

If I change them to 16Byte aligned , then the allocated memory is 16Byte aligened but not 32 Byte aligened.

"reinterpret_cast<float*> " just change the address to a float pointer. I am not so familiar...

could you please give a example for creating a 32 Byte aligened emory? Thanks!

Bernard · ‎10-21-2013

>>>Hi,I don't have this software...

I get the conclusion just by some tests. For example, reduce the size of the array and add the iterations of the loop>>>

You can download trial version of Parallel Studio.It is hard to say without collecting cpu counters data what is exactly the limiting factor in your case

jimdempseyatthecove · ‎10-21-2013

On Sandy Bridge the internal data path is still 128 bits. See: http://www.realworldtech.com/sandy-bridge/6/
Similar issue with Ivy Bridge.

Haswell has expanded the internal data path to 256 bits. See: http://www.hardwaresecrets.com/printpage/Inside-the-Intel-Haswell-Microarchitecture/1777

Jim Dempsey

McCalpinJohn · ‎10-21-2013

Intel has not made definitive statements on this, but I can think of a few reasons why the 128-bit loads might be more efficient than the 256-bit loads on Sandy Bridge processors (and presumably Ivy Bridge processors as well).

(1) Two 128-bit loads can be issued to the two load ports in a single cycle, while the 256 bit loads are issued to one port which is then occupied for two cycles. It is certainly plausible that the former case allows better low-level scheduling.

(2) When the data is bigger than the L1 Dcache: There is evidence that the L1 Dcache can either provide 32 Bytes/cycle to the core *or* receive 32 bytes per cycle to reload a line from the L2, but not both at the same time. Again, having independent 128-bit loads that can execute in a single cycle might allow better scheduling with respect to the timing of the L1 Dcache refills from the L1 cache than having 256-bit loads that occupy the L1 for two cycles.

(3) Intel has not disclosed enough details about the L1 Data Cache banking to really understand what is going on there. It is possible that 256-bit loads hit bank conflicts more often than 128-bit loads, or that the impact of these delays is larger (because the 256-bit loads occupy a port for two cycles instead of one cycle).

(4) For data bigger than L1: The L1 hardware prefetcher is activated by "streams" of load addresses. Using 128-bit loads gets you a "stream" of loads faster than using 256-bit loads. Since the hardware prefetchers have to start all over again for every 4 KiB page (64 cache lines), being able to start prefetching from the L2 even a few cycles earlier might make a noticeable difference.

(5) It is important to check the assembly code carefully when using intrinsics! Although these "look like" inline assembly, they are not, and the compiler may perform high-level optimizations that you don't expect. I think that the differences seen here are real, but some of the details may depend on exactly what code the compiler decides to generate.

perfwise · ‎10-24-2013

The core LS performance isn't impacted by cache size or the hw pref in terms of answering this question. If you ran and measured LS latency for the given instruction, you'd find that 256-bit loads within a cacheline have a latency of 9 clks on SB/IB. That's 2 clks greater than the non-256-bit loads. Now if your 256-bit load spans a cacheline boundary.. you pay a penalty of 21 clks on SB/IB (it's 13 clks on HW). That's why this is not prudent on SB/IB. If you are using 128-bit vector.. and you're not 16B aligned.. then it's crossing a cacheline boundary 1/4 of the time... but in 256-bit it's happening 1/2 the time. If you're 16B aligned.. then you won't pay this cacheline spanning penalty in SSE/AVX128.. but you will in AVX256.. since you're not 32B aligned. To align.. just do it in C code by adding some buffer to your malloc and then anding by 31 and then taking that number / your granularity and use that index to feed your code.

perfwise

Bernard · ‎10-24-2013

I suppose that in case of 256-bit load two additional cycles are needed to physically transport additional 16bytes of data.

SergeyKostrov · ‎11-01-2013

>>... >>could you please give a example for creating a 32 Byte aligened emory?.. In case of Release configuration try to use _mm_malloc and _mm_free intrinsic functions. Here is an example: ... int *piMemoryBlock1 = NULL; ... piMemoryBlock1 = ( int * )_mm_malloc( 777 * sizeof( int ), 32 ); ... // Some Processing ... if( piMemoryBlock1 != NULL ) _mm_free( piMemoryBlock1 ); ... PS: I use these two intrinsic functions in ~95% of all memory allocation / de-allocation cases. PS2: In Debug configuration I use Microsoft's malloc_dbg and free_dbg functions since they allow to detect memory leaks and buffer overflows.