request for a demo project of using AVX asm

Wei_Z_Intel · ‎03-01-2015

Hi

I'm studying and trying to use AVX-256/512 instructions/intrinics, but I could not find a good demo/example for new starters. If there is a simple example project with c code and AVX-related asm code to run, it may help a lot. Could you send me one such example project?

Thank you

John

Thomas_W_Intel · ‎03-02-2015

I wrote two blogs that fall into that category:

Processing Arrays of Bits with Intel® Advanced Vector Extensions 2 (Intel® AVX2)
Processing Arrays of Bits with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

Does this match with what you are looking for?

Wei_Z_Intel · ‎03-02-2015

Hi Thomas,

Thank you for the fast replies and good help.Your blogs help a lot.

One more quesiton is related to asm code, I actually also try to compile and run asm code, I added an asm files(file postfix is asm) into the Source Files of the VS Project(I used vs2013 integrated with intel parallel studio xc2015) and try to compile and run, but it looks that the project compiles all c/c++ source files except the asm files, did I do it wrong? Can you help to clarify, or coudl you send related example project to me if have?

Thank you

John

Bernard · ‎03-02-2015

You can use inline assembly instead or better use intrinsics.

http://stackoverflow.com/questions/4548763/compiling-assembly-in-visual-studio

Wei_Z_Intel · ‎03-04-2015

Hi iliyapolak,

Thanks a lot for your help. I tried inline assembly, but when I run it in release mode, it has issue, Can you help to take a look if there is issue?

Below is general C code:

void complexVec_dotProduct_General_C(float *inputPtr1, float *inputPtr2, float *outputPtr, int numData)
{
int idxData;
float data1Re, data2Re;

   #pragma ivdep
   __assume_aligned(inputPtr1, 64);
   __assume_aligned(inputPtr2, 64);
   __assume_aligned(outputPtr, 64);

   for (idxData = 0; idxData < numData; idxData++)
{
data1Re = inputPtr1[idxData];
data2Re = inputPtr2[idxData];

outputPtr[idxData] = data1Re * data2Re;
}

return;
}

Below has inline assembly code, but it looks I pass the arguments wrong, can you show me how to pass it to in line assembly code

void complexVec_dotProduct_Asm_AVX(float *inputPtr1, float *inputPtr2, float *outputPtr, int numData)
{
int idxData;

#pragma ivdep
   __assume_aligned(inputPtr1, 64);
   __assume_aligned(inputPtr2, 64);
   __assume_aligned(outputPtr, 64);

for (idxData = 0; idxData < numData; idxData += 8)
{

       //inData1 = _mm256_load_ps(inputPtr1 + idxData); //loads aligned array inputPtr1 into inData1
       //inData2 = _mm256_load_ps(inputPtr2 + idxData); //loads aligned array inputPtr2 into inData2
       //outData = _mm256_mul_ps(inData1, inData2); //performs multiplication
       __asm vmovaps ymm0, YMMWORD PTR[rcx + rax * 4]
       __asm vmulps ymm1, ymm0, YMMWORD PTR[rdx + rax * 4]
       __asm vmovups YMMWORD PTR[r8 + rax * 4], ymm1
   }

Thank you

John

return;
}

Bernard · ‎03-05-2015

You need to load either 32-bit or 64-bit GP register with the address of inputPtr1. Do the same with the rest of float arrays.

xor esi,esi

mov esi, inputPtr1

vmovups ymm0, ymmword ptr [esi]

Bernard · ‎03-05-2015

Moreover you can put all inline assembly code inside asm block.

_asm

{

//asm code

........

}

Wei_Z_Intel · ‎03-05-2015

Hi iliyapolak,

Thanks for the suggestion, I found the issue. Actually, inputPtr1, inputPtr2 and outputPtr are passed through rcx, rdx,r8 by default, don't need to load , the issue is where I used rax. But your suggestion points me to the possible issue.

I should zero rax by xor rax, rax, and add rax, 32 in loop, so correct one looks should be as below.

void complexVec_dotProduct_Asm_AVX(float *inputPtr1, float *inputPtr2, float *outputPtr, int numData)
{
int idxData;

#pragma ivdep
   __assume_aligned(inputPtr1, 64);
   __assume_aligned(inputPtr2, 64);
   __assume_aligned(outputPtr, 64);

   __asm
   {
       xor rax, rax
   }
   for (idxData = 0; idxData < numData; idxData += 8)
   {
       //inData1 = _mm256_load_ps(inputPtr1 + idxData); //loads aligned array inputPtr1 into inData1
       //inData2 = _mm256_load_ps(inputPtr2 + idxData); //loads aligned array inputPtr2 into inData2
       //outData = _mm256_mul_ps(inData1, inData2); //performs multiplication
       __asm
       {
           vmovaps ymm0, YMMWORD PTR[rcx + rax]
           vmulps ymm1, ymm0, YMMWORD PTR[rdx + rax]
           vmovups YMMWORD PTR[r8 + rax], ymm1
           add rax, 32
       }
   }

return;
}

Thank you

John

Wei_Z_Intel · ‎03-05-2015

Hi

I also have another question, for the core loop as below,

for (idxData = 0; idxData < numData; idxData++)
{
data1Re = inputPtr1[idxData];
data2Re = inputPtr2[idxData];

outputPtr[idxData] = data1Re * data2Re;
}

I see the compiler with release, avx-I, and opt turned on, will generate below asm code in asm file:

       xor rax, rax
       movsxd r9, r9d
       test r9, r9
       jl LOOP_END
   LOOP_START:
       vmovaps ymm0, YMMWORD PTR[rcx + rax * 4]
       vmulps ymm1, ymm0, YMMWORD PTR[rdx + rax * 4]
       vmovups YMMWORD PTR[r8 + rax * 4], ymm1
       add rax, 8
       cmp rax, r9
       jl LOOP_START
   LOOP_END:
   vzeroupper

It looks to me that, data load and store, loop jump all needs overhead here, for each iteration, it takes 6 clocks here, asm code is executed in serial , no parallel and pipeline here, can anyone help to explain?

Thank you

John

Bernard · ‎03-05-2015

>>> It looks to me that, data load and store, loop jump all needs overhead here, for each iteration>>>

So called "loop overhead" will be executed in parallel with AVX code. This execution is possible because different execution ports are issuing microops to the execution stacks. For example cmp/jmp are probably fused and possibly will be executed by Port 6(Haswell arch.) , before that add rax,8 can be executed by the same Port 6 so only one Port is saturated. FP multiplication will be executed either by Port 0 or Port 1. If FP mul instruction are independent compiler can unroll the loop so two FP multiplication can scheduled on Port 0 and Port 1.

http://www.anandtech.com/show/6355/intels-haswell-architecture/8

Wei_Z_Intel · ‎03-05-2015

Hi iliyapolak,

Thanks a lot for you illustration. The link your attached is a good one.

Thank you

John

Bernard · ‎03-05-2015

Hi WEI Z,

You are welcome.