- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I'm studying and trying to use AVX-256/512 instructions/intrinics, but I could not find a good demo/example for new starters. If there is a simple example project with c code and AVX-related asm code to run, it may help a lot. Could you send me one such example project?
Thank you
John
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wrote two blogs that fall into that category:
- Processing Arrays of Bits with Intel® Advanced Vector Extensions 2 (Intel® AVX2)
- Processing Arrays of Bits with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
Does this match with what you are looking for?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Thomas,
Thank you for the fast replies and good help.Your blogs help a lot.
One more quesiton is related to asm code, I actually also try to compile and run asm code, I added an asm files(file postfix is asm) into the Source Files of the VS Project(I used vs2013 integrated with intel parallel studio xc2015) and try to compile and run, but it looks that the project compiles all c/c++ source files except the asm files, did I do it wrong? Can you help to clarify, or coudl you send related example project to me if have?
Thank you
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use inline assembly instead or better use intrinsics.
http://stackoverflow.com/questions/4548763/compiling-assembly-in-visual-studio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi iliyapolak,
Thanks a lot for your help. I tried inline assembly, but when I run it in release mode, it has issue, Can you help to take a look if there is issue?
Below is general C code:
void complexVec_dotProduct_General_C(float *inputPtr1, float *inputPtr2, float *outputPtr, int numData)
{
int idxData;
float data1Re, data2Re;
#pragma ivdep
__assume_aligned(inputPtr1, 64);
__assume_aligned(inputPtr2, 64);
__assume_aligned(outputPtr, 64);
for (idxData = 0; idxData < numData; idxData++)
{
data1Re = inputPtr1[idxData];
data2Re = inputPtr2[idxData];
outputPtr[idxData] = data1Re * data2Re;
}
return;
}
Below has inline assembly code, but it looks I pass the arguments wrong, can you show me how to pass it to in line assembly code
void complexVec_dotProduct_Asm_AVX(float *inputPtr1, float *inputPtr2, float *outputPtr, int numData)
{
int idxData;
#pragma ivdep
__assume_aligned(inputPtr1, 64);
__assume_aligned(inputPtr2, 64);
__assume_aligned(outputPtr, 64);
for (idxData = 0; idxData < numData; idxData += 8)
{
//inData1 = _mm256_load_ps(inputPtr1 + idxData); //loads aligned array inputPtr1 into inData1
//inData2 = _mm256_load_ps(inputPtr2 + idxData); //loads aligned array inputPtr2 into inData2
//outData = _mm256_mul_ps(inData1, inData2); //performs multiplication
__asm vmovaps ymm0, YMMWORD PTR[rcx + rax * 4]
__asm vmulps ymm1, ymm0, YMMWORD PTR[rdx + rax * 4]
__asm vmovups YMMWORD PTR[r8 + rax * 4], ymm1
}
Thank you
John
return;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You need to load either 32-bit or 64-bit GP register with the address of inputPtr1. Do the same with the rest of float arrays.
xor esi,esi
mov esi, inputPtr1
vmovups ymm0, ymmword ptr [esi]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Moreover you can put all inline assembly code inside asm block.
_asm
{
//asm code
........
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi iliyapolak,
Thanks for the suggestion, I found the issue. Actually, inputPtr1, inputPtr2 and outputPtr are passed through rcx, rdx,r8 by default, don't need to load , the issue is where I used rax. But your suggestion points me to the possible issue.
I should zero rax by xor rax, rax, and add rax, 32 in loop, so correct one looks should be as below.
void complexVec_dotProduct_Asm_AVX(float *inputPtr1, float *inputPtr2, float *outputPtr, int numData)
{
int idxData;
#pragma ivdep
__assume_aligned(inputPtr1, 64);
__assume_aligned(inputPtr2, 64);
__assume_aligned(outputPtr, 64);
__asm
{
xor rax, rax
}
for (idxData = 0; idxData < numData; idxData += 8)
{
//inData1 = _mm256_load_ps(inputPtr1 + idxData); //loads aligned array inputPtr1 into inData1
//inData2 = _mm256_load_ps(inputPtr2 + idxData); //loads aligned array inputPtr2 into inData2
//outData = _mm256_mul_ps(inData1, inData2); //performs multiplication
__asm
{
vmovaps ymm0, YMMWORD PTR[rcx + rax]
vmulps ymm1, ymm0, YMMWORD PTR[rdx + rax]
vmovups YMMWORD PTR[r8 + rax], ymm1
add rax, 32
}
}
return;
}
Thank you
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I also have another question, for the core loop as below,
for (idxData = 0; idxData < numData; idxData++)
{
data1Re = inputPtr1[idxData];
data2Re = inputPtr2[idxData];
outputPtr[idxData] = data1Re * data2Re;
}
I see the compiler with release, avx-I, and opt turned on, will generate below asm code in asm file:
xor rax, rax
movsxd r9, r9d
test r9, r9
jl LOOP_END
LOOP_START:
vmovaps ymm0, YMMWORD PTR[rcx + rax * 4]
vmulps ymm1, ymm0, YMMWORD PTR[rdx + rax * 4]
vmovups YMMWORD PTR[r8 + rax * 4], ymm1
add rax, 8
cmp rax, r9
jl LOOP_START
LOOP_END:
vzeroupper
It looks to me that, data load and store, loop jump all needs overhead here, for each iteration, it takes 6 clocks here, asm code is executed in serial , no parallel and pipeline here, can anyone help to explain?
Thank you
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>> It looks to me that, data load and store, loop jump all needs overhead here, for each iteration>>>
So called "loop overhead" will be executed in parallel with AVX code. This execution is possible because different execution ports are issuing microops to the execution stacks. For example cmp/jmp are probably fused and possibly will be executed by Port 6(Haswell arch.) , before that add rax,8 can be executed by the same Port 6 so only one Port is saturated. FP multiplication will be executed either by Port 0 or Port 1. If FP mul instruction are independent compiler can unroll the loop so two FP multiplication can scheduled on Port 0 and Port 1.
http://www.anandtech.com/show/6355/intels-haswell-architecture/8
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi iliyapolak,
Thanks a lot for you illustration. The link your attached is a good one.
Thank you
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi WEI Z,
You are welcome.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page