Software Archive
Read-only legacy content
17061 Discussions

pass address to asm block

berthou
Beginner
596 Views

Hello,

I would like to implement some AVX code manually using an asm block inside a function/program.

My question is the following : How do I pass an address from the standard C code to the asm block ? There is an example thereafter. I want to load the array 'a' and 'b' to registers, do an AVX add and then store back the result to the array 'c'. I know I could use avx intrinsics, it works that way but I would like it working using asm blocks.

#include <stdio.h>

int main()
{
float a[8] = {1., 2., 3., 4., 5., 6., 7., 8.};
float b[8] = {12., 23., 31., 4.1, 5.3, 6.3, 71., 8.1};
float c[8];

asm
{
// load 'a' and 'b' to registers 'ymm0' and 'ymm1'
vmovups ymm0, ???;
vmovups ymm1, ???;
vaddps  ymm3, ymm0, ymm1;
// store 'ymm3' to the array 'c'
vmovups ???, ymm3;
}

	printf("%f", c[0]);

return 0;
}

Is there an equivalent of the asm blocks in fortran/ifort ?

Thanks for helping out !

Vincent

0 Kudos
10 Replies
Bernard
Valued Contributor I
596 Views

Try this code.You can also read this article.

http://masm32.com/board/index.php?topic=2960.0

 

align 32

xor eax,eax

xor ebx,ebx

xor edx,edx

mov eax,a

mov ebx,b

mov edx,c

vmovups ymm0,[eax]

vmovups ymm1,[ebx]

 

vaddps ymm1,ymm1,ymm0 //destructive operation.

vmovups [edx],ymm1

 

0 Kudos
berthou
Beginner
596 Views

Thank you. I managed to get it working with your help.

But is there an equivalent to -fasm-blocks in ifort ?

I had to use 64 bits registers and so switch from mov to lea, because mov yield an operand size error with 64 bits registers.

int main()
{
        float a[16] __attribute__((aligned(32))) ; 
        float b[16] __attribute__((aligned(32))) ; 
        float c[16] __attribute__((aligned(32))) ; 
        int i;
        for (i=0;i<16;i++)
        {   
                a = (float) i;
                b = (float) i+1;
             
        }   

        asm 
        {   
                lea rax, a;
                lea rbx, b;
                lea rdi, c;
                mov rdx, 0x0; // address index
                mov rcx, 0x2; // execute loop1 twice
loop1:
                // load 'a' and 'b' to registers 'ymm0' and 'ymm1'
                vmovaps ymm0, [rax+(0x4*rdx)];
                vmovaps ymm1, [rbx+(0x4*rdx)];
                vsubps  ymm3, ymm0, ymm1;
                // store 'ymm3' to the array 'c'
                vmovaps [rdi+(0x4*rdx)], ymm3;
                add rdx, 0x8;
                loop loop1;

        }   

        for (i=0;i<16;i++)
        {   
                printf("%f\n", c);
        }   

        return 0;
}
0 Kudos
Bernard
Valued Contributor I
596 Views

@berthou

I am glad that I was able to help you.

Regarding ifort I do not know I am not fortran programmer.

Btw, you can ask that question on Fortran forum because there is a lot of experts.

0 Kudos
Bernard
Valued Contributor I
596 Views

 

I suppose that complex address calculation like this one [rax+(0x4*rdx)] could probably load AGU unit more that simple addreesing like this one [rax + 16],[rax + 32],[rax + 16*n].

where n = 1,2,3,4,n+1

0 Kudos
zalia64
New Contributor I
596 Views

I believe that [rax+123456] uses less micro-ops then [rax+4*rdx] .

I don't think that the execution will take longer - with parallel operations, re-order of micro-ops, speculative executions, etc, I don't think you will see any difference.

I personally prefer using the same register both for indexing and counting,: A single RDX, not both RDX and RCX. This way RCX is free for other uses.

       asm
15	        {  
16	                lea rax, a;
17	                lea rbx, b;
18	                lea rdi, c;
19	                mov rdx, 0x8;          // N minus 8  :address index
20	                                       // old: mov rcx, 0x2;  //not used
21	loop1:
22	                                       // load 'a' and 'b' to registers 'ymm0' and 'ymm1'
23	                vmovaps ymm0, [rax+(0x4*rdx)];
24	                vmovaps ymm1, [rbx+(0x4*rdx)];
25	                vsubps  ymm3, ymm0, ymm1;
26	                                       // store 'ymm3' to the array 'c'
27	                vmovaps [rdi+(0x4*rdx)], ymm3;
28	               sub rdx,8               // old:  add rdx, 0x8;
29	               jge loop1               // old:  loop loop1;
30	 
31	        }   

 

0 Kudos
Bernard
Valued Contributor I
596 Views

>>>I don't think that the execution will take longer - with parallel operations, re-order of micro-ops, speculative executions, etc, I don't think you will see any difference>>>

I would try to keep addressing simpler.I suppose that AGU is decoupled from the other integer execution unit and probably address calculation can not be send internally to other ports.Now when issuing complex address like this [rax+(0x4*rdx)]  it can be probably decoded in two uops (not sure about this) or it could be handled differently at AGU level without generation arithmetic uops.

0 Kudos
Bernard
Valued Contributor I
596 Views

Anyway you have still two different integer operations issued per cycle which will occupy adder and multiplier units.

0 Kudos
Bernard
Valued Contributor I
596 Views

I am thinking about how to test overhead of complex vs. simple addressing

0 Kudos
zalia64
New Contributor I
596 Views

First, effects would depend on the Hardware: two ports for address calculation or a single port?. This means 'Haswell' CPU may perform differently from 'Ivy Bridge', and 'core i3' differently from 'core i7'.

Then the question arises: where is the bottleneck? if it ain't the address calculation, it will be hidden.

You could try timing the simplest loop - just read and the store, with two different buffers. You should repeatedly use the same memory blocks,  small enough to fit both inside the catch. I would suggest  'vmovaps Ymm0, [rax+(0x4*rdx)];  vmovaps [rbx+(0x4*rdx)],Ymm0 ; '

Code explicitly at least K consecutive stages of the loop internals 'Read  \ store '. Instead of 'loop' command use 'jnz' which is usually faster (Hardware dependent).  To  do 20000 repetitions over the same block of memory,  simply use 'AND  RDX, 0x3FF '  or similar. 

To minimize the overhead of the loop ( e.g. 'jnz' or 'loop' ) unroll many stages of the loop. There is a catch here: Your  actual loop will be N micro-ops long. If it is short, the decoded form (from mnemonics into micro-ops) fits inside the decoder-catch, and the decoder just sits idle -  a quicker time. If your actual loop is longer, everything must be decoded each pass. I believe you will find, with unrolling 1,2,..M loops, a magic K number

   T( 1 stage) > T( 2 stages)/2 > T( 3stages)/3 > ...   > T(K stages)/K

  T( K stages)/K  << T(K+1 stages)/(K+1)  > T(K+2 stages)/(K+2) ...

On the other hand, if the bottleneck is not the decoder, this 'magic K' will never show up....

Good Luck.

 

0 Kudos
Bernard
Valued Contributor I
596 Views

Thanks for reply very interesting post.

I suppose that LSD probably will be in use because of small loop size  <= 56 uops.In order to keep it below the aferomentioned size initially I would not use aggresive unrolling.

As far as my understanding goes I think that simple addressing and complex addressing both of them will be decoded into one uop , but encoded differently.Next that uop will be send to AGU unit where the integer address calculation will be performed.Probably AGU is able to execute one addition and one multiplication per cycle.

 

0 Kudos
Reply