Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

request for a demo project of using AVX asm

Wei_Z_Intel
Employee
790 Views

Hi

             I'm studying and trying to use AVX-256/512 instructions/intrinics, but I could not find a good demo/example for new starters.  If there is a simple example project with c code and AVX-related asm code to run, it may help a lot. Could you send me one such example project?

 

Thank you

John

0 Kudos
11 Replies
Thomas_W_Intel
Employee
790 Views
0 Kudos
Wei_Z_Intel
Employee
790 Views

Hi Thomas,

            Thank you for the fast replies and good help.Your blogs help a lot.

            One more quesiton is related to asm code,  I actually also try to compile and run asm code, I added an asm files(file postfix is asm) into the Source Files of the VS Project(I used vs2013 integrated with intel parallel studio xc2015) and try to compile and run, but it looks that the project compiles all c/c++ source files except the asm files, did I do it wrong?  Can you help to clarify, or coudl you send related example project to me if have?

 

Thank you

John

0 Kudos
Bernard
Valued Contributor I
790 Views

You can use inline assembly instead or better use intrinsics.

http://stackoverflow.com/questions/4548763/compiling-assembly-in-visual-studio

0 Kudos
Wei_Z_Intel
Employee
790 Views

Hi iliyapolak,

      Thanks a lot for your help. I tried inline assembly, but when I run it in release mode, it has issue, Can you help to take a look if there is issue?

Below is general C code:

void complexVec_dotProduct_General_C(float *inputPtr1, float *inputPtr2, float *outputPtr, int numData)
{
    int idxData;
    float data1Re, data2Re;

    #pragma ivdep
    __assume_aligned(inputPtr1, 64);
    __assume_aligned(inputPtr2, 64);
    __assume_aligned(outputPtr, 64);
     
    for (idxData = 0; idxData < numData; idxData++)
    {
        data1Re = inputPtr1[idxData];
        data2Re = inputPtr2[idxData];
        
        outputPtr[idxData] = data1Re * data2Re;
    }

    return;
}

Below has inline assembly code, but it looks I pass the arguments wrong, can you show me how to pass it to in line assembly code

void complexVec_dotProduct_Asm_AVX(float *inputPtr1, float *inputPtr2, float *outputPtr, int numData)
{
    int idxData;

#pragma ivdep
    __assume_aligned(inputPtr1, 64);
    __assume_aligned(inputPtr2, 64);
    __assume_aligned(outputPtr, 64);

    for (idxData = 0; idxData < numData; idxData += 8)
    {

        //inData1 = _mm256_load_ps(inputPtr1 + idxData);   //loads aligned array inputPtr1 into inData1  
        //inData2 = _mm256_load_ps(inputPtr2 + idxData);   //loads aligned array inputPtr2 into inData2  
        //outData = _mm256_mul_ps(inData1, inData2); //performs multiplication  
        __asm vmovaps ymm0, YMMWORD PTR[rcx + rax * 4]
        __asm vmulps ymm1, ymm0, YMMWORD PTR[rdx + rax * 4]
        __asm vmovups YMMWORD PTR[r8 + rax * 4], ymm1
    }

Thank you

John

    return;
}

0 Kudos
Bernard
Valued Contributor I
790 Views

You need to load either 32-bit or 64-bit GP register with the address of inputPtr1. Do the same with the rest of float arrays.

xor esi,esi

mov esi, inputPtr1

vmovups ymm0, ymmword ptr [esi]

0 Kudos
Bernard
Valued Contributor I
790 Views

Moreover you can put all inline assembly code inside asm block.

_asm

{

 //asm code

 ........

}

0 Kudos
Wei_Z_Intel
Employee
790 Views

Hi iliyapolak,

       Thanks for the suggestion,  I found the issue. Actually, inputPtr1, inputPtr2 and outputPtr are passed through rcx, rdx,r8 by default, don't need to load , the issue is where I used rax. But your suggestion points me to the possible issue.

       I should zero rax by  xor rax, rax, and  add  rax, 32 in loop, so correct one looks should be as below. 

 

void complexVec_dotProduct_Asm_AVX(float *inputPtr1, float *inputPtr2, float *outputPtr, int numData)
{
    int idxData;

#pragma ivdep
    __assume_aligned(inputPtr1, 64);
    __assume_aligned(inputPtr2, 64);
    __assume_aligned(outputPtr, 64);

    __asm
    {
        xor rax, rax
    }
    for (idxData = 0; idxData < numData; idxData += 8)
    {
        //inData1 = _mm256_load_ps(inputPtr1 + idxData);   //loads aligned array inputPtr1 into inData1  
        //inData2 = _mm256_load_ps(inputPtr2 + idxData);   //loads aligned array inputPtr2 into inData2  
        //outData = _mm256_mul_ps(inData1, inData2);       //performs multiplication  
        __asm
        {
            vmovaps ymm0, YMMWORD PTR[rcx + rax]
            vmulps ymm1, ymm0, YMMWORD PTR[rdx + rax]
            vmovups YMMWORD PTR[r8 + rax], ymm1
            add       rax, 32
        }
    }

    return;
}

 

Thank you

John

 

0 Kudos
Wei_Z_Intel
Employee
790 Views

Hi 

        I also have another question, for the core loop as below,

    for (idxData = 0; idxData < numData; idxData++)
    {
        data1Re = inputPtr1[idxData];
        data2Re = inputPtr2[idxData];
        
        outputPtr[idxData] = data1Re * data2Re;
    }

         I see the compiler with release, avx-I, and opt turned on,  will generate below asm code in asm file:

        xor       rax, rax
        movsxd    r9, r9d
        test      r9, r9
        jl        LOOP_END
    LOOP_START:
        vmovaps   ymm0, YMMWORD PTR[rcx + rax * 4]
        vmulps    ymm1, ymm0, YMMWORD PTR[rdx + rax * 4]
        vmovups   YMMWORD PTR[r8 + rax * 4], ymm1
        add       rax, 8
        cmp       rax, r9
        jl        LOOP_START
    LOOP_END:
        vzeroupper

         It looks to me that, data load and store, loop jump all needs overhead here, for each iteration, it takes 6 clocks here,  asm code is executed in serial , no parallel and pipeline here, can anyone help to explain?

Thank you

John

0 Kudos
Bernard
Valued Contributor I
790 Views

>>> It looks to me that, data load and store, loop jump all needs overhead here, for each iteration>>>

So called "loop overhead" will be executed in parallel with AVX code. This execution is possible because different execution ports are issuing  microops to the execution stacks. For example cmp/jmp are probably fused and possibly will be executed by Port 6(Haswell arch.) , before that add rax,8 can be executed by the same Port 6  so only one Port is saturated. FP multiplication will be executed either by Port 0 or Port 1. If FP mul instruction are independent compiler can unroll the loop so two FP multiplication can scheduled on Port 0 and Port 1.

http://www.anandtech.com/show/6355/intels-haswell-architecture/8

0 Kudos
Wei_Z_Intel
Employee
790 Views

Hi iliyapolak,

       Thanks a lot for you illustration. The link your attached is a good one.

 

Thank you

John

0 Kudos
Bernard
Valued Contributor I
790 Views

Hi WEI Z,

You are welcome.

0 Kudos
Reply