Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

How do I get 128 bits per clock memory access

DLake1
New Contributor I
722 Views
I have an i7 3820 at 4.3GHz and 16GB quad channel 2133MHz RAM CL9 and ICL v14.0, if 128 bits per clock is possible for memory access why can I not get more than 64 bits per clock? I use this for copying memory because its faster than memcpy and what the compiler replaces it with: bool memcopy(void* restrict dest, const void* restrict source, int size){ //memcpy(dest, source, size); auto b=static_cast(dest); auto a=static_cast(source); #pragma simd for (int i=size; 0 < i; --i){ *b++=*a++; } return true; } But still cant get more than 64 bits per clock in any case whatever options I set for the compiler?
0 Kudos
20 Replies
DLake1
New Contributor I
705 Views

My code sample got messed up

bool memcopy(void* restrict dest, const void* restrict source, int size){
    //memcpy(dest, source, size);
    auto b=static_cast<char*>(dest);
    auto a=static_cast<const char*>(source);
#pragma simd
    for (int i=size; 0 < i; --i){
        *b++=*a++;
    }
    return true;
}
0 Kudos
TimP
Honored Contributor III
705 Views

This Ivy Bridge CPU still adheres to the original Intel AVX limitation of one 128-bit load and one store, or two loads, per clock cycle.  The data would have to be at least 4-byte aligned, and preferable 16-byte aligned, in order to vectorize.  As you aren't using streaming stores, you move both operands to cache, apparently 64 bits per loop iteration, and store also 64 bits.

Your Ivy Bridge CPU is designed to handle 256-bit unaligned loads with less of a penalty than the Sandy Bridge, but still the compiler will not voluntarily use 256-bit unaligned moves, and even with alignment the hardware would still split across 2 cycles for the stores.  So you aren't likely to get any advantage from AVX here, and SSE nontemporal could be good enough.

The fast_memcpy which comes with Intel compilers checks several possibilities for relative alignment and engages simd moves with peeling for alignment if applicable, but will not switch to streaming stores unless the length is a large fraction of cache size, because it doesn't know whether your context would favor use of streaming stores for shorter lengths. If the operands aren't relatively aligned such that 4-byte alignment can be achieved, all that effort is wasted, and your simple code is likely to be better.  On the other hand, if you can assert alignment in your code, that also could show quicker startup than the library memcpy.

0 Kudos
DLake1
New Contributor I
705 Views

My CPU is a Sandy Bridge-E

0 Kudos
DLake1
New Contributor I
705 Views

This is the assembly code from my read test which goes about 18.4GB/s single threaded:

vpaddq      xmm15,xmm15,xmmword ptr [rax+rdx*8]  
vpaddq      xmm14,xmm14,xmmword ptr [rax+rdx*8+10h] 
vpaddq      xmm1,xmm1,xmmword ptr [rax+rdx*8+20h]  
vpaddq      xmm5,xmm5,xmmword ptr [rax+rdx*8+30h]  
vpaddq      xmm2,xmm2,xmmword ptr [rax+rdx*8+40h]  
vpaddq      xmm4,xmm4,xmmword ptr [rax+rdx*8+50h]  
vpaddq      xmm3,xmm3,xmmword ptr [rax+rdx*8+60h]  
vpaddq      xmm13,xmm13,xmmword ptr [rax+rdx*8+70h] 
vpaddq      xmm6,xmm6,xmmword ptr [rax+rdx*8+80h]  
vpaddq      xmm8,xmm8,xmmword ptr [rax+rdx*8+90h]  
vpaddq      xmm7,xmm7,xmmword ptr [rax+rdx*8+0A0h]  
vpaddq      xmm12,xmm12,xmmword ptr [rax+rdx*8+0B0h]
vpaddq      xmm9,xmm9,xmmword ptr [rax+rdx*8+0C0h]  
vpaddq      xmm11,xmm11,xmmword ptr [rax+rdx*8+0D0h]
vpaddq      xmm10,xmm10,xmmword ptr [rax+rdx*8+0E0h]
vpaddq      xmm0,xmm0,xmmword ptr [rax+rdx*8+0F0h]  

Is not vpaddq an AVX2 instruction? If so how is that working on a Sandy Bridge-E?

0 Kudos
TimP
Honored Contributor III
705 Views

You must have done something which you haven't explained in your compilation.  Still, if you didn't achieve streaming stores, you won't get full performance.

0 Kudos
Bernard
Valued Contributor I
705 Views

This is AVX2 instruction.

0 Kudos
Bernard
Valued Contributor I
705 Views

CommanderLake wrote:

This is the assembly code from my read test which goes about 18.4GB/s single threaded:

vpaddq      xmm15,xmm15,xmmword ptr [rax+rdx*8]  
vpaddq      xmm14,xmm14,xmmword ptr [rax+rdx*8+10h] 
vpaddq      xmm1,xmm1,xmmword ptr [rax+rdx*8+20h]  
vpaddq      xmm5,xmm5,xmmword ptr [rax+rdx*8+30h]  
vpaddq      xmm2,xmm2,xmmword ptr [rax+rdx*8+40h]  
vpaddq      xmm4,xmm4,xmmword ptr [rax+rdx*8+50h]  
vpaddq      xmm3,xmm3,xmmword ptr [rax+rdx*8+60h]  
vpaddq      xmm13,xmm13,xmmword ptr [rax+rdx*8+70h] 
vpaddq      xmm6,xmm6,xmmword ptr [rax+rdx*8+80h]  
vpaddq      xmm8,xmm8,xmmword ptr [rax+rdx*8+90h]  
vpaddq      xmm7,xmm7,xmmword ptr [rax+rdx*8+0A0h]  
vpaddq      xmm12,xmm12,xmmword ptr [rax+rdx*8+0B0h]
vpaddq      xmm9,xmm9,xmmword ptr [rax+rdx*8+0C0h]  
vpaddq      xmm11,xmm11,xmmword ptr [rax+rdx*8+0D0h]
vpaddq      xmm10,xmm10,xmmword ptr [rax+rdx*8+0E0h]
vpaddq      xmm0,xmm0,xmmword ptr [rax+rdx*8+0F0h]  

Is not vpaddq an AVX2 instruction? If so how is that working on a Sandy Bridge-E?

Looks like 15x unrolling.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
705 Views

I agree with Tim Prince. I do not think you are showing enough of your program or the compiler options used.

vpaddq      xmm15,xmm15,xmmword ptr [rax+rdx*8]  
vpaddq      xmm14,xmm14,xmmword ptr [rax+rdx*8+10h] 
vpaddq      xmm1,xmm1,xmmword ptr [rax+rdx*8+20h]  
vpaddq      xmm5,xmm5,xmmword ptr [rax+rdx*8+30h]  
vpaddq      xmm2,xmm2,xmmword ptr [rax+rdx*8+40h]  
vpaddq      xmm4,xmm4,xmmword ptr [rax+rdx*8+50h]  
vpaddq      xmm3,xmm3,xmmword ptr [rax+rdx*8+60h]  
vpaddq      xmm13,xmm13,xmmword ptr [rax+rdx*8+70h] 
vpaddq      xmm6,xmm6,xmmword ptr [rax+rdx*8+80h]  
vpaddq      xmm8,xmm8,xmmword ptr [rax+rdx*8+90h]  
vpaddq      xmm7,xmm7,xmmword ptr [rax+rdx*8+0A0h]  
vpaddq      xmm12,xmm12,xmmword ptr [rax+rdx*8+0B0h]
vpaddq      xmm9,xmm9,xmmword ptr [rax+rdx*8+0C0h]  
vpaddq      xmm11,xmm11,xmmword ptr [rax+rdx*8+0D0h]
vpaddq      xmm10,xmm10,xmmword ptr [rax+rdx*8+0E0h]
vpaddq      xmm0,xmm0,xmmword ptr [rax+rdx*8+0F0h]  

In the above, note that each successive vpaddq is advancing by +10h (16 bytes/128 bits) and are using the SSE registers xmm..

AVX registers are ymm.., should advance +20h (32 bytes/256 bits).

Plus the disassembly above is not representative of your copy.

Jim Dempsey

 

 

0 Kudos
DLake1
New Contributor I
705 Views
Sorry yes that is not the assembly from the copy its from the read test of my little benchmark program and unrolling is set to 16 to take advantage of the 16 128bit avx registers (8 256bit) the instruction is avx because it starts with a v but i searched and could only find mention of an avx2 instruction with that name why does my SB-E have an avx2 instruction?
0 Kudos
DLake1
New Contributor I
705 Views

I can turn streaming stores on and off with the pragmas vector nontemporal/temporal and the fastest I can get is 18.5GB/s read 16GB/s write and 8.5GB/s copy with these methods:

#include "RAM Speed++.h"
#include <memory.h>
#include <immintrin.h>
unsigned long long* data0;
unsigned long long* data1;
bool Create(){
	//data0=static_cast<unsigned long long*>(_mm_malloc(1073741824,64));
	//data1=static_cast<unsigned long long*>(_mm_malloc(1073741824,64));
	//auto dval=static_cast<unsigned long long*>(_mm_malloc(8, 64));
	data0=new unsigned long long[134217728];
	data1=new unsigned long long[134217728];
	unsigned long long dval=0;
#pragma ivdep
#pragma vector nontemporal
	for(unsigned long long i=0; i<134217728; ++i){
		data1=dval;
	}
	return true;
}
bool Destroy(){
	delete[] data0;
	delete[] data1;
	//_mm_free(data0);
	//_mm_free(data1);
	return true;
}
bool Write(){
	//auto dval=static_cast<unsigned long long*>(_mm_malloc(8, 64));
	//dval[0]=0;
#pragma ivdep
#pragma vector nontemporal
	for(unsigned long long i=0; i<134217728; ++i){
		data0=0;
	}
	return true;
}
bool Read(){
	//auto tmp = static_cast<unsigned long long*>(_mm_malloc(8, 64));
	//tmp[0]=0;
	unsigned long long tmp=0;
	for(unsigned long long i=0; i<134217728; ++i){
		tmp+=data0;
	}
	return tmp>0;
}
bool Copy(){
#pragma ivdep
#pragma vector temporal
#pragma simd
	for(unsigned long long i=0; i<134217728; ++i){
		data1=data0;
	}
	return true;
}

As you can see from the commented code I have tried aligning the test data and it makes absolutely ZERO difference, writing data is actually faster when its temporal.

I build it with /O3, /Oi, /Ot, /Quse-intel-optimized-headers, /QxAVX, /Qstd=c++11, /Qansi-alias and /Qunroll:16 all set in the VS2010 project options with ICL 14.0.

0 Kudos
TimP
Honored Contributor III
705 Views

/Qunroll:4 is more often close to optimum even for cases which benefit from unroll (and depends on the vector remainder feature improved in ICL 15.0).  Your source code seemed more C++ like than C99 (I tried both).  I didn't try editing restrict to __restrict, which seems better supported among MSVC++/ICL/g++.  I don't have a pure AVX CPU.

You've posed a confusing situation as to whether you are testing on Sandy Bridge, Ivy Bridge, or Haswell.  The CPU model you quoted shows up as Ivy Bridge on ark.intel.com.

The new operators ought to produce at least 16-byte alignment by default when compiling for Intel64, and the compiler would take that into account within the scope visible to it.  #pragma simd normally has the effect of disabling fast_memcpy substitution so that nontemporal or streaming-store options can take effect.

0 Kudos
DLake1
New Contributor I
705 Views

It is C++ and unrolling to 8 or 16 improves read performance slightly over 4.

I know my CPU is an i7 3820 and I know its a Sandy Bridge-E as confirmed by CPU-Z.

I made the copy method slightly faster with _mm_prefetch:

bool Copy(){
#pragma ivdep
#pragma vector temporal
#pragma simd
	for(unsigned long long i=0; i<134217728; ++i){
		_mm_prefetch(reinterpret_cast<char*>(data0+i+2), _MM_HINT_T2);
		data1=data0;
	}
	return true;
}

The improvement is something like 8.37 to 8.57GB/s, T2 is slightly faster than T0.

I still dont understand whats limiting the bandwidth thats all I want to know.

Note: this code sample is from my bandwidth testing program it is unrelated to the code in the original post.

0 Kudos
DLake1
New Contributor I
705 Views

I decided to just use memset, memcmp and memcpy instead as its all more reliable and consistent that way, any ways with memcmp I get about 20GB/s read bandwidth so I take it I wont get more than that with a single thread but what's the limiting factor here the RAM latency?

0 Kudos
Bernard
Valued Contributor I
705 Views

 >>>take it I wont get more than that with a single thread but what's the limiting factor here the RAM latency?>>>

What are your RAM specifications?

0 Kudos
DLake1
New Contributor I
705 Views
16GB quad channel DDR3 2133MHz 9 11 11 25 I've been using inline assembly for memory benchmarking and I have learnt that: 1. temporal stores are faster for copying memory 2. using 1 xmm is fastest for copying and writing 3. use prefetcht2 for copying 4. non-temporal stores are faster for just writing 5. use prefetcht0 for reading 6. use all available xmm's for reading 7. building for 64 bit is slightly faster because there are more sse registers available
0 Kudos
Bernard
Valued Contributor I
705 Views

Interesting results.

0 Kudos
Bernard
Valued Contributor I
705 Views

>>>16GB quad channel DDR3 2133MHz 9 11 11 25>>>

I suppose that you are fully utilizing only one channel memory channel.I think that Total memory bandwidth will directly be dependent  on the buffer data size and of course available bandwidth at infinitesimally small (one memory cycle) time period(1/2.33e+6). I do not know how the channels arbiter is working and I can only suppose that it can utilize additional channels when one channel is saturated.

0 Kudos
DLake1
New Contributor I
705 Views
If 1 64 bit memory channel were able to transfer 64 bits on every cycle of the 2133MHz DDR clock it would be transferring 15.9GB/s and hey that's right where my write bandwidth is at! Now if 1 core were to store 64 bits per clock cycle it would be transferring 34.4GB/s so obviously the RAM is the limit because the CPU is only transferring 64 bits per cycle. This must be because the load and store ports can only transfer 64 bits per cycle and theres only 1 store port per core so that's what's limiting write bandwidth but there are 2 load ports so I should get about 31.8GB/s read but I'm about 10GB/s short so where's the limit? Intel I'm looking at you.
0 Kudos
Bernard
Valued Contributor I
705 Views

>>>This must be because the load and store ports can only transfer 64 bits per cycle and theres only 1 store port per core so that's what's limiting write bandwidth but there are 2 load ports so I should get about 31.8GB/s read but I'm about 10GB/s short so where's the limit? Intel I'm looking at you>>>

Probably store ports are used by  different thread(s) memory write operations or part of the available store bandwidth is reserved by MC.

Try to ask your question on the performance and tunning forum where well known expert John McCalpin could be more helpful.

0 Kudos
DLake1
New Contributor I
590 Views
Thanks, about the copy performance, I think temporal stores are faster when copying because writing to cache first causes the write cycles interfere with the read cycles less. Just a thought.
0 Kudos
Reply