- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The answer to the second part: It has been clearly mentioned in the AVX Instrucion set reference that performance is guaranteed to improve when Vzero all/upper are used when jumpingfrom 256 to128 bit modes. removing them does not improve performance either.
The first part: If the performance does not improve with one 256 bit load in place of two 128 bit loads then it is still acceptable, butas I have mentioned earlierthe AVXtime taken is 10 times more than theSSE time.Something must be wrong!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
for example the Intel compiler expands _mm256_loadu_ps and _mm256_storeu_ps to a series of instructions (128-bit moves / inserts)
it may be part of the explanation if you work with unaligned data
for better advices I'll suggest to postactual code of your inner loop / hotspot
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The code is more or less a replica of the pseudo I had written in the first post. A loop with a series of loads and stores. A few basic operations in between. For the sake of simplicity I have removed those operations and am testing the code just for loads/stores. The question is not aboutthe performanceimprovement with AVX,rather it aboutwhy the performance degrades to half or quarter of the 128BitSIMD performance.here it is:
Loop_count
=========
{
//Reg = _mm_loadu_si128(POINTER);
//_mm_storeu_si128(POINTER, Reg);
Reg = _mm_load_si128(POINTER);
_mm_store_si128(POINTER, Reg);
POINTER = POINTER + 8;
}
Loop_count/2
=========
{
//Reg = _mm256_loadu_si256(POINTER);
//_mm256_storeu_si256(POINTER, Reg);
Reg = _mm256_load_si256(POINTER);
_mm256_store_si256(POINTER, Reg);
POINTER = POINTER + 16;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
it will be more useful to have full(compilable) code for your example, I'll like to see yourmemory allocation, the exact typeof "POINTER" (I suppose the casts are missing in your example code), the typical loop counts (i.e. cache footprint),...
anyway,do you work with 32B aligned addresses? I ask becauseit's possible that_mm256_load_si256 is generating VMOVDQU insteadof VMOVDQA (seen with the Intel compiler) so 32B alignment isn't certain simply by seeing your example
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The code is placed below. Is there anything else required. The AVX performanceis worse than the SSE-128performance?
main()
{
long int Buf1_Size =1024, Buf2_Size =1024;
short *Buf1_ptr = _aligned_malloc(Buf1_Size, 32);
short *Buf2_ptr = _aligned_malloc(Buf2_Size, 32);
int counter;
//SSE-128
for(counter =0; counter < Buf1_Size; counter +=8)
{
__m128i A;
A = _mm_load_si128((__m128i *) (Buf1_ptr + counter));
_mm_store_si128((__m128i *) (Buf2_ptr + counter), A);
}
//AVX-256
for(counter =0; counter < Buf1_Size; counter +=16)
{
__m256i B;
B = _mm256_load_si256((__m256i*) (Buf1_ptr + counter));
_mm256_store_si256((__m256i*) (Buf2_ptr + counter), B);
}
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, there were a lot of errors in the code sent previously. You can use this one.
void main()
{
long int Buf1_Size =1024*1204, Buf2_Size =1024*1024;
short *Buf1_ptr = (short *) _aligned_malloc(Buf1_Size, 16);
short *Buf2_ptr = (short *) _aligned_malloc(Buf2_Size, 16);
int counter;
//SSE-128
for(counter =0; counter < 1024; counter +=8)
{
__m128i A;
A = _mm_load_si128((__m128i *) (Buf1_ptr + counter));
_mm_store_si128((__m128i *) (Buf2_ptr + counter), A);
}
_aligned_free(Buf1_ptr);
_aligned_free(Buf2_ptr);
Buf1_ptr = (short *) _aligned_malloc(Buf1_Size, 32);
Buf2_ptr = (short *) _aligned_malloc(Buf2_Size, 32);
//AVX-256
for(counter =0; counter < 1024; counter +=16)
{
__m256i B;
B = _mm256_load_si256((__m256i*) (Buf1_ptr + counter));
_mm256_store_si256((__m256i*) (Buf2_ptr + counter), B);
}
_aligned_free(Buf1_ptr);
_aligned_free(Buf2_ptr);
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just a question: why do you allocate 1MB buffers and iterate over only2 KB?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
void Copy128(const short *Buf1_ptr, short *Buf2_ptr)
{
for (int counter=0; counter<1024; counter+=8)
{
const __m128i A = _mm_load_si128((const __m128i *)(Buf1_ptr + counter));
_mm_store_si128((__m128i *) (Buf2_ptr + counter), A);
}
}
void Copy256(const short *Buf1_ptr, short *Buf2_ptr)
{
for (int counter=0; counter<1024; counter+=16)
{
const __m256i B = _mm256_load_si256((const __m256i *)(Buf1_ptr + counter));
_mm256_store_si256((__m256i *)(Buf2_ptr + counter),B);
}
}
here is the ASMgeneratedby the Intel compiler (32-bit mode):
SSE :
.B1.2: ; Preds .B1.2 .B1.1
;;; {
movdqa xmm0, XMMWORD PTR [ecx+eax*2] ;303.56
movdqa XMMWORD PTR [edx+eax*2], xmm0 ;304.34
add eax, 8 ;301.37
cmp eax, 1024 ;301.31
jl .B1.2 ; Prob 99% ;301.31
AVX-128 :
.B1.2: ; Preds .B1.2 .B1.1
;;; for (int counter=0; counter<1024; counter+=8)
;;; {
mov esi, eax ;303.23
inc eax ;301.3
shl esi, 5 ;303.23
cmp eax, 64 ;301.3
vmovdqu xmm0, XMMWORD PTR [ecx+esi] ;299.6
vmovdqu XMMWORD PTR [edx+esi], xmm0 ;299.6
vmovdqu xmm1, XMMWORD PTR [16+ecx+esi] ;299.6
vmovdqu XMMWORD PTR [16+edx+esi], xmm1 ;299.6
jb .B1.2 ; Prob 98% ;301.3
AVX-256 :
.B2.2: ; Preds .B2.2 .B2.1
mov esi, eax ;313.22
inc eax ;311.2
shl esi, 6 ;313.22
cmp eax, 32 ;311.2
vmovdqu ymm0, YMMWORD PTR [ecx+esi] ;313.22
vmovdqu YMMWORD PTR [edx+esi], ymm0 ;314.4
vmovdqu ymm1, YMMWORD PTR [32+ecx+esi] ;313.22
vmovdqu YMMWORD PTR [32+edx+esi], ymm1 ;314.4
jb .B2.2 ; Prob 96% ;311.2
I got these timings on a Core i7 2600K at stock clock with speedstep and turbo disabled:
runtime for 10,000,000 runs :
SSE : 818 ms
AVX-128 : 654 ms
AVX-256: 420 ms
so AVX-256 is nearly twice faster than SSE in this particular example, I'm pretty surprised since I was thinking the difference will be less due to the L1D cache bandwidth limitation
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the timings I get are very repeatable(note that I run the outer loop 10 million times),so I supposeyou're using a different compiler, it will be interesting to see the ASM for your inner loops and to knowthe absolutetimings you get(same array size, 10,000,000 iterations at 3.4 GHz for an easy comparison)
NB: the throughput I measured for the Copy256 case is very near the theoretical peak L1D$write bandwidth of 16B/clock
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks.
D
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
oughhh if you were using an emulation librarythisis clearlythe explanation of your strange findings, hint:it's always a good ideato watchthe ASM dumps to understand performance issues
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page