Thank you. I managed to get

berthou · ‎04-04-2014

Hello,

I would like to implement some AVX code manually using an asm block inside a function/program.

My question is the following : How do I pass an address from the standard C code to the asm block ? There is an example thereafter. I want to load the array 'a' and 'b' to registers, do an AVX add and then store back the result to the array 'c'. I know I could use avx intrinsics, it works that way but I would like it working using asm blocks.

#include <stdio.h>

int main()
{
float a[8] = {1., 2., 3., 4., 5., 6., 7., 8.};
float b[8] = {12., 23., 31., 4.1, 5.3, 6.3, 71., 8.1};
float c[8];

asm
{
// load 'a' and 'b' to registers 'ymm0' and 'ymm1'
vmovups ymm0, ???;
vmovups ymm1, ???;
vaddps  ymm3, ymm0, ymm1;
// store 'ymm3' to the array 'c'
vmovups ???, ymm3;
}

	printf("%f", c[0]);

return 0;
}

Is there an equivalent of the asm blocks in fortran/ifort ?

Thanks for helping out !

Vincent

Bernard · ‎04-04-2014

Try this code.You can also read this article.

http://masm32.com/board/index.php?topic=2960.0

align 32

xor eax,eax

xor ebx,ebx

xor edx,edx

mov eax,a

mov ebx,b

mov edx,c

vmovups ymm0,[eax]

vmovups ymm1,[ebx]

vaddps ymm1,ymm1,ymm0 //destructive operation.

vmovups [edx],ymm1

berthou · ‎04-11-2014

Thank you. I managed to get it working with your help.

But is there an equivalent to -fasm-blocks in ifort ?

I had to use 64 bits registers and so switch from mov to lea, because mov yield an operand size error with 64 bits registers.

int main()
{
        float a[16] __attribute__((aligned(32))) ; 
        float b[16] __attribute__((aligned(32))) ; 
        float c[16] __attribute__((aligned(32))) ; 
        int i;
        for (i=0;i<16;i++)
        {   
                a = (float) i;
                b = (float) i+1;
             
        }   

        asm 
        {   
                lea rax, a;
                lea rbx, b;
                lea rdi, c;
                mov rdx, 0x0; // address index
                mov rcx, 0x2; // execute loop1 twice
loop1:
                // load 'a' and 'b' to registers 'ymm0' and 'ymm1'
                vmovaps ymm0, [rax+(0x4*rdx)];
                vmovaps ymm1, [rbx+(0x4*rdx)];
                vsubps  ymm3, ymm0, ymm1;
                // store 'ymm3' to the array 'c'
                vmovaps [rdi+(0x4*rdx)], ymm3;
                add rdx, 0x8;
                loop loop1;

        }   

        for (i=0;i<16;i++)
        {   
                printf("%f\n", c);
        }   

        return 0;
}

Bernard · ‎04-12-2014

@berthou

I am glad that I was able to help you.

Regarding ifort I do not know I am not fortran programmer.

Btw, you can ask that question on Fortran forum because there is a lot of experts.

Bernard · ‎04-12-2014

I suppose that complex address calculation like this one [rax+(0x4*rdx)] could probably load AGU unit more that simple addreesing like this one [rax + 16],[rax + 32],[rax + 16*n].

where n = 1,2,3,4,n+1

zalia64 · ‎04-16-2014

I believe that [rax+123456] uses less micro-ops then [rax+4*rdx] .

I don't think that the execution will take longer - with parallel operations, re-order of micro-ops, speculative executions, etc, I don't think you will see any difference.

I personally prefer using the same register both for indexing and counting,: A single RDX, not both RDX and RCX. This way RCX is free for other uses.

       asm
15	        {  
16	                lea rax, a;
17	                lea rbx, b;
18	                lea rdi, c;
19	                mov rdx, 0x8;          // N minus 8  :address index
20	                                       // old: mov rcx, 0x2;  //not used
21	loop1:
22	                                       // load 'a' and 'b' to registers 'ymm0' and 'ymm1'
23	                vmovaps ymm0, [rax+(0x4*rdx)];
24	                vmovaps ymm1, [rbx+(0x4*rdx)];
25	                vsubps  ymm3, ymm0, ymm1;
26	                                       // store 'ymm3' to the array 'c'
27	                vmovaps [rdi+(0x4*rdx)], ymm3;
28	               sub rdx,8               // old:  add rdx, 0x8;
29	               jge loop1               // old:  loop loop1;
30	 
31	        }

Bernard · ‎04-16-2014

>>>I don't think that the execution will take longer - with parallel operations, re-order of micro-ops, speculative executions, etc, I don't think you will see any difference>>>

I would try to keep addressing simpler.I suppose that AGU is decoupled from the other integer execution unit and probably address calculation can not be send internally to other ports.Now when issuing complex address like this [rax+(0x4*rdx)] it can be probably decoded in two uops (not sure about this) or it could be handled differently at AGU level without generation arithmetic uops.

Bernard · ‎04-17-2014

Anyway you have still two different integer operations issued per cycle which will occupy adder and multiplier units.

Bernard · ‎04-17-2014

I am thinking about how to test overhead of complex vs. simple addressing

zalia64 · ‎04-20-2014

First, effects would depend on the Hardware: two ports for address calculation or a single port?. This means 'Haswell' CPU may perform differently from 'Ivy Bridge', and 'core i3' differently from 'core i7'.

Then the question arises: where is the bottleneck? if it ain't the address calculation, it will be hidden.

You could try timing the simplest loop - just read and the store, with two different buffers. You should repeatedly use the same memory blocks, small enough to fit both inside the catch. I would suggest 'vmovaps Ymm0, [rax+(0x4*rdx)]; vmovaps [rbx+(0x4*rdx)],Ymm0 ; '

Code explicitly at least K consecutive stages of the loop internals 'Read \ store '. Instead of 'loop' command use 'jnz' which is usually faster (Hardware dependent). To do 20000 repetitions over the same block of memory, simply use 'AND RDX, 0x3FF ' or similar.

To minimize the overhead of the loop ( e.g. 'jnz' or 'loop' ) unroll many stages of the loop. There is a catch here: Your actual loop will be N micro-ops long. If it is short, the decoded form (from mnemonics into micro-ops) fits inside the decoder-catch, and the decoder just sits idle - a quicker time. If your actual loop is longer, everything must be decoded each pass. I believe you will find, with unrolling 1,2,..M loops, a magic K number

T( 1 stage) > T( 2 stages)/2 > T( 3stages)/3 > ... > T(K stages)/K

T( K stages)/K << T(K+1 stages)/(K+1) > T(K+2 stages)/(K+2) ...

On the other hand, if the bottleneck is not the decoder, this 'magic K' will never show up....

Good Luck.

Bernard · ‎04-20-2014

Thanks for reply very interesting post.

I suppose that LSD probably will be in use because of small loop size <= 56 uops.In order to keep it below the aferomentioned size initially I would not use aggresive unrolling.

As far as my understanding goes I think that simple addressing and complex addressing both of them will be decoded into one uop , but encoded differently.Next that uop will be send to AGU unit where the integer address calculation will be performed.Probably AGU is able to execute one addition and one multiplication per cycle.

pass address to asm block