- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My code sample got messed up
bool memcopy(void* restrict dest, const void* restrict source, int size){ //memcpy(dest, source, size); auto b=static_cast<char*>(dest); auto a=static_cast<const char*>(source); #pragma simd for (int i=size; 0 < i; --i){ *b++=*a++; } return true; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This Ivy Bridge CPU still adheres to the original Intel AVX limitation of one 128-bit load and one store, or two loads, per clock cycle. The data would have to be at least 4-byte aligned, and preferable 16-byte aligned, in order to vectorize. As you aren't using streaming stores, you move both operands to cache, apparently 64 bits per loop iteration, and store also 64 bits.
Your Ivy Bridge CPU is designed to handle 256-bit unaligned loads with less of a penalty than the Sandy Bridge, but still the compiler will not voluntarily use 256-bit unaligned moves, and even with alignment the hardware would still split across 2 cycles for the stores. So you aren't likely to get any advantage from AVX here, and SSE nontemporal could be good enough.
The fast_memcpy which comes with Intel compilers checks several possibilities for relative alignment and engages simd moves with peeling for alignment if applicable, but will not switch to streaming stores unless the length is a large fraction of cache size, because it doesn't know whether your context would favor use of streaming stores for shorter lengths. If the operands aren't relatively aligned such that 4-byte alignment can be achieved, all that effort is wasted, and your simple code is likely to be better. On the other hand, if you can assert alignment in your code, that also could show quicker startup than the library memcpy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My CPU is a Sandy Bridge-E
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is the assembly code from my read test which goes about 18.4GB/s single threaded:
vpaddq xmm15,xmm15,xmmword ptr [rax+rdx*8] vpaddq xmm14,xmm14,xmmword ptr [rax+rdx*8+10h] vpaddq xmm1,xmm1,xmmword ptr [rax+rdx*8+20h] vpaddq xmm5,xmm5,xmmword ptr [rax+rdx*8+30h] vpaddq xmm2,xmm2,xmmword ptr [rax+rdx*8+40h] vpaddq xmm4,xmm4,xmmword ptr [rax+rdx*8+50h] vpaddq xmm3,xmm3,xmmword ptr [rax+rdx*8+60h] vpaddq xmm13,xmm13,xmmword ptr [rax+rdx*8+70h] vpaddq xmm6,xmm6,xmmword ptr [rax+rdx*8+80h] vpaddq xmm8,xmm8,xmmword ptr [rax+rdx*8+90h] vpaddq xmm7,xmm7,xmmword ptr [rax+rdx*8+0A0h] vpaddq xmm12,xmm12,xmmword ptr [rax+rdx*8+0B0h] vpaddq xmm9,xmm9,xmmword ptr [rax+rdx*8+0C0h] vpaddq xmm11,xmm11,xmmword ptr [rax+rdx*8+0D0h] vpaddq xmm10,xmm10,xmmword ptr [rax+rdx*8+0E0h] vpaddq xmm0,xmm0,xmmword ptr [rax+rdx*8+0F0h]
Is not vpaddq an AVX2 instruction? If so how is that working on a Sandy Bridge-E?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You must have done something which you haven't explained in your compilation. Still, if you didn't achieve streaming stores, you won't get full performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is AVX2 instruction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CommanderLake wrote:
This is the assembly code from my read test which goes about 18.4GB/s single threaded:
vpaddq xmm15,xmm15,xmmword ptr [rax+rdx*8] vpaddq xmm14,xmm14,xmmword ptr [rax+rdx*8+10h] vpaddq xmm1,xmm1,xmmword ptr [rax+rdx*8+20h] vpaddq xmm5,xmm5,xmmword ptr [rax+rdx*8+30h] vpaddq xmm2,xmm2,xmmword ptr [rax+rdx*8+40h] vpaddq xmm4,xmm4,xmmword ptr [rax+rdx*8+50h] vpaddq xmm3,xmm3,xmmword ptr [rax+rdx*8+60h] vpaddq xmm13,xmm13,xmmword ptr [rax+rdx*8+70h] vpaddq xmm6,xmm6,xmmword ptr [rax+rdx*8+80h] vpaddq xmm8,xmm8,xmmword ptr [rax+rdx*8+90h] vpaddq xmm7,xmm7,xmmword ptr [rax+rdx*8+0A0h] vpaddq xmm12,xmm12,xmmword ptr [rax+rdx*8+0B0h] vpaddq xmm9,xmm9,xmmword ptr [rax+rdx*8+0C0h] vpaddq xmm11,xmm11,xmmword ptr [rax+rdx*8+0D0h] vpaddq xmm10,xmm10,xmmword ptr [rax+rdx*8+0E0h] vpaddq xmm0,xmm0,xmmword ptr [rax+rdx*8+0F0h]Is not vpaddq an AVX2 instruction? If so how is that working on a Sandy Bridge-E?
Looks like 15x unrolling.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree with Tim Prince. I do not think you are showing enough of your program or the compiler options used.
vpaddq xmm15,xmm15,xmmword ptr [rax+rdx*8] vpaddq xmm14,xmm14,xmmword ptr [rax+rdx*8+10h] vpaddq xmm1,xmm1,xmmword ptr [rax+rdx*8+20h] vpaddq xmm5,xmm5,xmmword ptr [rax+rdx*8+30h] vpaddq xmm2,xmm2,xmmword ptr [rax+rdx*8+40h] vpaddq xmm4,xmm4,xmmword ptr [rax+rdx*8+50h] vpaddq xmm3,xmm3,xmmword ptr [rax+rdx*8+60h] vpaddq xmm13,xmm13,xmmword ptr [rax+rdx*8+70h] vpaddq xmm6,xmm6,xmmword ptr [rax+rdx*8+80h] vpaddq xmm8,xmm8,xmmword ptr [rax+rdx*8+90h] vpaddq xmm7,xmm7,xmmword ptr [rax+rdx*8+0A0h] vpaddq xmm12,xmm12,xmmword ptr [rax+rdx*8+0B0h] vpaddq xmm9,xmm9,xmmword ptr [rax+rdx*8+0C0h] vpaddq xmm11,xmm11,xmmword ptr [rax+rdx*8+0D0h] vpaddq xmm10,xmm10,xmmword ptr [rax+rdx*8+0E0h] vpaddq xmm0,xmm0,xmmword ptr [rax+rdx*8+0F0h]
In the above, note that each successive vpaddq is advancing by +10h (16 bytes/128 bits) and are using the SSE registers xmm..
AVX registers are ymm.., should advance +20h (32 bytes/256 bits).
Plus the disassembly above is not representative of your copy.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can turn streaming stores on and off with the pragmas vector nontemporal/temporal and the fastest I can get is 18.5GB/s read 16GB/s write and 8.5GB/s copy with these methods:
#include "RAM Speed++.h" #include <memory.h> #include <immintrin.h> unsigned long long* data0; unsigned long long* data1; bool Create(){ //data0=static_cast<unsigned long long*>(_mm_malloc(1073741824,64)); //data1=static_cast<unsigned long long*>(_mm_malloc(1073741824,64)); //auto dval=static_cast<unsigned long long*>(_mm_malloc(8, 64)); data0=new unsigned long long[134217728]; data1=new unsigned long long[134217728]; unsigned long long dval=0; #pragma ivdep #pragma vector nontemporal for(unsigned long long i=0; i<134217728; ++i){ data1=dval; } return true; } bool Destroy(){ delete[] data0; delete[] data1; //_mm_free(data0); //_mm_free(data1); return true; } bool Write(){ //auto dval=static_cast<unsigned long long*>(_mm_malloc(8, 64)); //dval[0]=0; #pragma ivdep #pragma vector nontemporal for(unsigned long long i=0; i<134217728; ++i){ data0=0; } return true; } bool Read(){ //auto tmp = static_cast<unsigned long long*>(_mm_malloc(8, 64)); //tmp[0]=0; unsigned long long tmp=0; for(unsigned long long i=0; i<134217728; ++i){ tmp+=data0; } return tmp>0; } bool Copy(){ #pragma ivdep #pragma vector temporal #pragma simd for(unsigned long long i=0; i<134217728; ++i){ data1=data0; } return true; }
As you can see from the commented code I have tried aligning the test data and it makes absolutely ZERO difference, writing data is actually faster when its temporal.
I build it with /O3, /Oi, /Ot, /Quse-intel-optimized-headers, /QxAVX, /Qstd=c++11, /Qansi-alias and /Qunroll:16 all set in the VS2010 project options with ICL 14.0.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
/Qunroll:4 is more often close to optimum even for cases which benefit from unroll (and depends on the vector remainder feature improved in ICL 15.0). Your source code seemed more C++ like than C99 (I tried both). I didn't try editing restrict to __restrict, which seems better supported among MSVC++/ICL/g++. I don't have a pure AVX CPU.
You've posed a confusing situation as to whether you are testing on Sandy Bridge, Ivy Bridge, or Haswell. The CPU model you quoted shows up as Ivy Bridge on ark.intel.com.
The new operators ought to produce at least 16-byte alignment by default when compiling for Intel64, and the compiler would take that into account within the scope visible to it. #pragma simd normally has the effect of disabling fast_memcpy substitution so that nontemporal or streaming-store options can take effect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is C++ and unrolling to 8 or 16 improves read performance slightly over 4.
I know my CPU is an i7 3820 and I know its a Sandy Bridge-E as confirmed by CPU-Z.
I made the copy method slightly faster with _mm_prefetch:
bool Copy(){ #pragma ivdep #pragma vector temporal #pragma simd for(unsigned long long i=0; i<134217728; ++i){ _mm_prefetch(reinterpret_cast<char*>(data0+i+2), _MM_HINT_T2); data1=data0; } return true; }
The improvement is something like 8.37 to 8.57GB/s, T2 is slightly faster than T0.
I still dont understand whats limiting the bandwidth thats all I want to know.
Note: this code sample is from my bandwidth testing program it is unrelated to the code in the original post.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I decided to just use memset, memcmp and memcpy instead as its all more reliable and consistent that way, any ways with memcmp I get about 20GB/s read bandwidth so I take it I wont get more than that with a single thread but what's the limiting factor here the RAM latency?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>take it I wont get more than that with a single thread but what's the limiting factor here the RAM latency?>>>
What are your RAM specifications?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interesting results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>16GB quad channel DDR3 2133MHz 9 11 11 25>>>
I suppose that you are fully utilizing only one channel memory channel.I think that Total memory bandwidth will directly be dependent on the buffer data size and of course available bandwidth at infinitesimally small (one memory cycle) time period(1/2.33e+6). I do not know how the channels arbiter is working and I can only suppose that it can utilize additional channels when one channel is saturated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>This must be because the load and store ports can only transfer 64 bits per cycle and theres only 1 store port per core so that's what's limiting write bandwidth but there are 2 load ports so I should get about 31.8GB/s read but I'm about 10GB/s short so where's the limit? Intel I'm looking at you>>>
Probably store ports are used by different thread(s) memory write operations or part of the available store bandwidth is reserved by MC.
Try to ask your question on the performance and tunning forum where well known expert John McCalpin could be more helpful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page