topic Hi Zheng in Intel® ISA Extensions

do _mm256_load_ps slower than _mm_load_ps?

zhang_h_ — Tue, 10 Sep 2013 03:40:12 GMT

I'm tried to improve performance of simple code via SSE and AVX, but I found the AVX code need more time then the SSE code:

void testfun()

{

int dataLen = 4800;

int N = 10000000;

float *buf1 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));

float *buf2 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));

float *buf3 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));

for(int i=0; i<dataLen; i++)

{

buf1 = 1; buf2 = 1; buf3 = 0;

}

int timePassed; int t;

//=========================SSE CODE=====================================

t = clock();

__m128 *p1, *p2, *p3;

for(int j=0;j<N; j++)

{

p1 = (__m128 *)buf1;

p2 = (__m128 *)buf2;

p3 = (__m128 *)buf3;

for(int i=0; i<dataLen/4; i++)

{

*p3 = _mm_add_ps(_mm_mul_ps(*p1, *p2), *p3);

p1++; p2++; p3++;

}

timePassed = clock() - t;

std::cout<<"SSE time used: "<<timePassed<<"ms"<<std::endl;

for(int i=0; i<dataLen; i++) { buf3 = 0; }

t = clock();

//=========================AVX　CODE=====================================

__m256 *pp1, *pp2, *pp3;

for(int j=0;j<N; j++)

{

pp1 = (__m256*) buf1;

pp2 = (__m256*) buf2;

pp3 = (__m256*) buf3;

for(int i=0; i<dataLen/8; i++)

{

*pp3 = _mm256_add_ps(_mm256_mul_ps(*pp1, *pp2), *pp3);

pp1++; pp2++; pp3++;

}

timePassed = clock() - t; std::cout<<"AVX time used: "<<timePassed<<"ms"<<std::endl;

_aligned_free(buf1); _aligned_free(buf2);

}

I changed the "dataLen" and get different efficiency:

dataLen = 400 SSE time:758 ms AVX time:483 ms SSE > AVX

dataLen = 2400 SSE time:4212 ms AVX time:2636 ms SSE > AVX

dataLen = 2864 SSE time:6115 ms AVX time:6146 ms SSE ~= AVX

dataLen = 3200 SSE time:8049 ms AVX time:9297 ms SSE < AVX

dataLen = 4000 SSE time:10170 ms AVX time:11690 ms SSE < AVX

My L1 cache is 32KB, L2 cache 1MB.It seems that sometimes load 256 Bytes is slower load 128Bytes, why?It is the same result if I change the code to SIMD Instructions, just like"_mm256_load_ps ","_mm_load_ps ","mm_add_ps" .....

You haven't indicated your

jimdempseyatthecove — Sun, 15 Sep 2013 17:16:08 GMT

You haven't indicated your processor. My guess is you are Sandybridge.

Next gen Haswell will correct for this in your example by a 2x wider data bus between the CPU and the L1/L2/L3

see http://www.realworldtech.com/haswell-cpu/5/ for some insight.

Also Haswell can perform FMA (Fused Multiply and Add) in one instruction (... = (B*C) + A)
And depending on Haswell CPU, you can have upto 4 memory banks.

Use Sandybridge for learning how to use _mm256 (and converting applications).
Use Haswell for production code.

Sandybridge gave you a year to adopt your code for Haswell (and later).

Jim Dempsey

Hi Zheng

Bernard — Tue, 17 Sep 2013 06:12:00 GMT

Hi Zheng

Do you have a transition penalty between SSE and AVX-256bit code?Maybe during the execution of your code SSE and AVX-256 got intermixed into YMM registers?

>>float *buf1 = reinterpret

SergeyKostrov — Mon, 23 Sep 2013 05:27:48 GMT

>>float *buf1 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32)); >> >>float *buf2 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32)); >> >>float *buf3 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32)); Did you verify that these three pointers are really aligned on 32-byte boundary? Also, You've overcomplicated memory allocation and why do you need reinterpret_cast C++ operator?

Quote:jimdempseyatthecove

zhang_h_ — Mon, 21 Oct 2013 08:39:00 GMT

jimdempseyatthecove wrote:

You haven't indicated your processor. My guess is you are Sandybridge.

Use Haswell for production code.

Sandybridge gave you a year to adopt your code for Haswell (and later).

Jim Dempsey

Hi jim, Thanks for your reply!

Yes my processor is Sandybridge(xeon e3 1225 v2).Others can recreate my results on sandybridge processor.

For extra large arrays and simple calculation,like

for(int i=0; i<1000000000; i++)

{

A += B*C;

}

This loop is memory band width limited,so it can not fully display the FMA's advantage.

I've tested my memory read and write speed , it is about 20GB/s(DDR3, Dual channel),I think this speed is fast,but still not enough.

Can I get faster?

>>>This loop it is memory

Bernard — Mon, 21 Oct 2013 08:50:43 GMT

>>>This loop it is memory band width limited,so it can not fully display the FMA's advantage.>>>

Can you run VTune analysis on that code to see where the pipeline stalls are?Here I mean front-end pipeline stalls.

Quote:iliyapolak wrote:

zhang_h_ — Mon, 21 Oct 2013 09:12:55 GMT

iliyapolak wrote:

Hi Zheng

Do you have a transition penalty between SSE and AVX-256bit code?Maybe during the execution of your code SSE and AVX-256 got intermixed into YMM registers?

I have seen the assembly code,and it use the XMM registers in the sse code , and use the YMM registers in the AVX code!

Also I can comment the SSE code or comment the AVX code , and the result is the same!

Quote:iliyapolak wrote:

zhang_h_ — Mon, 21 Oct 2013 10:12:42 GMT

iliyapolak wrote:

>>>This loop it is memory band width limited,so it can not fully display the FMA's advantage.>>>

Can you run VTune analysis on that code to see where the pipeline stalls are?Here I mean front-end pipeline stalls.

Hi,I don't have this software...

I get the conclusion just by some tests. For example, reduce the size of the array and add the iterations of the loop

( keep the amount of calculation the same) then I will get different performance.

When the array can all stored in the cache, then the performance is the best!

Quote:Sergey Kostrov wrote:

zhang_h_ — Mon, 21 Oct 2013 10:51:00 GMT

Sergey Kostrov wrote:

>>float *buf1 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));
>>
>>float *buf2 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));
>>
>>float *buf3 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));

Did you verify that these three pointers are really aligned on 32-byte boundary? Also, You've overcomplicated memory allocation and why do you need reinterpret_cast C++ operator?

Hi, I have checked the allocated memory, they are really aligned on 32-byte.

If I change them to 16Byte aligned , then the allocated memory is 16Byte aligened but not 32 Byte aligened.

"reinterpret_cast<float*> " just change the address to a float pointer. I am not so familiar...

could you please give a example for creating a 32 Byte aligened emory? Thanks!

>>>Hi,I don't have this

Bernard — Mon, 21 Oct 2013 11:32:22 GMT

>>>Hi,I don't have this software...

I get the conclusion just by some tests. For example, reduce the size of the array and add the iterations of the loop>>>

You can download trial version of Parallel Studio.It is hard to say without collecting cpu counters data what is exactly the limiting factor in your case

On Sandy Bridge the internal

jimdempseyatthecove — Mon, 21 Oct 2013 12:42:22 GMT

On Sandy Bridge the internal data path is still 128 bits. See: http://www.realworldtech.com/sandy-bridge/6/
Similar issue with Ivy Bridge.

Haswell has expanded the internal data path to 256 bits. See: http://www.hardwaresecrets.com/printpage/Inside-the-Intel-Haswell-Microarchitecture/1777

Jim Dempsey

Intel has not made definitive

McCalpinJohn — Mon, 21 Oct 2013 19:34:17 GMT

Intel has not made definitive statements on this, but I can think of a few reasons why the 128-bit loads might be more efficient than the 256-bit loads on Sandy Bridge processors (and presumably Ivy Bridge processors as well).

(1) Two 128-bit loads can be issued to the two load ports in a single cycle, while the 256 bit loads are issued to one port which is then occupied for two cycles. It is certainly plausible that the former case allows better low-level scheduling.

(2) When the data is bigger than the L1 Dcache: There is evidence that the L1 Dcache can either provide 32 Bytes/cycle to the core *or* receive 32 bytes per cycle to reload a line from the L2, but not both at the same time. Again, having independent 128-bit loads that can execute in a single cycle might allow better scheduling with respect to the timing of the L1 Dcache refills from the L1 cache than having 256-bit loads that occupy the L1 for two cycles.

(3) Intel has not disclosed enough details about the L1 Data Cache banking to really understand what is going on there. It is possible that 256-bit loads hit bank conflicts more often than 128-bit loads, or that the impact of these delays is larger (because the 256-bit loads occupy a port for two cycles instead of one cycle).

(4) For data bigger than L1: The L1 hardware prefetcher is activated by "streams" of load addresses. Using 128-bit loads gets you a "stream" of loads faster than using 256-bit loads. Since the hardware prefetchers have to start all over again for every 4 KiB page (64 cache lines), being able to start prefetching from the L2 even a few cycles earlier might make a noticeable difference.

(5) It is important to check the assembly code carefully when using intrinsics! Although these "look like" inline assembly, they are not, and the compiler may perform high-level optimizations that you don't expect. I think that the differences seen here are real, but some of the details may depend on exactly what code the compiler decides to generate.

The core LS performance isn't

perfwise — Thu, 24 Oct 2013 13:13:25 GMT

The core LS performance isn't impacted by cache size or the hw pref in terms of answering this question. If you ran and measured LS latency for the given instruction, you'd find that 256-bit loads within a cacheline have a latency of 9 clks on SB/IB. That's 2 clks greater than the non-256-bit loads. Now if your 256-bit load spans a cacheline boundary.. you pay a penalty of 21 clks on SB/IB (it's 13 clks on HW). That's why this is not prudent on SB/IB. If you are using 128-bit vector.. and you're not 16B aligned.. then it's crossing a cacheline boundary 1/4 of the time... but in 256-bit it's happening 1/2 the time. If you're 16B aligned.. then you won't pay this cacheline spanning penalty in SSE/AVX128.. but you will in AVX256.. since you're not 32B aligned. To align.. just do it in C code by adding some buffer to your malloc and then anding by 31 and then taking that number / your granularity and use that index to feed your code.

perfwise

I suppose that in case of 256

Bernard — Thu, 24 Oct 2013 14:03:29 GMT

I suppose that in case of 256-bit load two additional cycles are needed to physically transport additional 16bytes of data.

>>...

SergeyKostrov — Fri, 01 Nov 2013 21:09:00 GMT

>>... >>could you please give a example for creating a 32 Byte aligened emory?.. In case of Release configuration try to use _mm_malloc and _mm_free intrinsic functions. Here is an example: ... int *piMemoryBlock1 = NULL; ... piMemoryBlock1 = ( int * )_mm_malloc( 777 * sizeof( int ), 32 ); ... // Some Processing ... if( piMemoryBlock1 != NULL ) _mm_free( piMemoryBlock1 ); ... PS: I use these two intrinsic functions in ~95% of all memory allocation / de-allocation cases. PS2: In Debug configuration I use Microsoft's malloc_dbg and free_dbg functions since they allow to detect memory leaks and buffer overflows.