- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I would like to implement some AVX code manually using an asm block inside a function/program.
My question is the following : How do I pass an address from the standard C code to the asm block ? There is an example thereafter. I want to load the array 'a' and 'b' to registers, do an AVX add and then store back the result to the array 'c'. I know I could use avx intrinsics, it works that way but I would like it working using asm blocks.
#include <stdio.h> int main() { float a[8] = {1., 2., 3., 4., 5., 6., 7., 8.}; float b[8] = {12., 23., 31., 4.1, 5.3, 6.3, 71., 8.1}; float c[8]; asm { // load 'a' and 'b' to registers 'ymm0' and 'ymm1' vmovups ymm0, ???; vmovups ymm1, ???; vaddps ymm3, ymm0, ymm1; // store 'ymm3' to the array 'c' vmovups ???, ymm3; } printf("%f", c[0]); return 0; }
Is there an equivalent of the asm blocks in fortran/ifort ?
Thanks for helping out !
Vincent
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try this code.You can also read this article.
http://masm32.com/board/index.php?topic=2960.0
align 32
xor eax,eax
xor ebx,ebx
xor edx,edx
mov eax,a
mov ebx,b
mov edx,c
vmovups ymm0,[eax]
vmovups ymm1,[ebx]
vaddps ymm1,ymm1,ymm0 //destructive operation.
vmovups [edx],ymm1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you. I managed to get it working with your help.
But is there an equivalent to -fasm-blocks in ifort ?
I had to use 64 bits registers and so switch from mov to lea, because mov yield an operand size error with 64 bits registers.
int main() { float a[16] __attribute__((aligned(32))) ; float b[16] __attribute__((aligned(32))) ; float c[16] __attribute__((aligned(32))) ; int i; for (i=0;i<16;i++) { a = (float) i; b = (float) i+1; } asm { lea rax, a; lea rbx, b; lea rdi, c; mov rdx, 0x0; // address index mov rcx, 0x2; // execute loop1 twice loop1: // load 'a' and 'b' to registers 'ymm0' and 'ymm1' vmovaps ymm0, [rax+(0x4*rdx)]; vmovaps ymm1, [rbx+(0x4*rdx)]; vsubps ymm3, ymm0, ymm1; // store 'ymm3' to the array 'c' vmovaps [rdi+(0x4*rdx)], ymm3; add rdx, 0x8; loop loop1; } for (i=0;i<16;i++) { printf("%f\n", c); } return 0; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@berthou
I am glad that I was able to help you.
Regarding ifort I do not know I am not fortran programmer.
Btw, you can ask that question on Fortran forum because there is a lot of experts.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suppose that complex address calculation like this one [rax+(0x4*rdx)] could probably load AGU unit more that simple addreesing like this one [rax + 16],[rax + 32],[rax + 16*n].
where n = 1,2,3,4,n+1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe that [rax+123456] uses less micro-ops then [rax+4*rdx] .
I don't think that the execution will take longer - with parallel operations, re-order of micro-ops, speculative executions, etc, I don't think you will see any difference.
I personally prefer using the same register both for indexing and counting,: A single RDX, not both RDX and RCX. This way RCX is free for other uses.
asm 15 { 16 lea rax, a; 17 lea rbx, b; 18 lea rdi, c; 19 mov rdx, 0x8; // N minus 8 :address index 20 // old: mov rcx, 0x2; //not used 21 loop1: 22 // load 'a' and 'b' to registers 'ymm0' and 'ymm1' 23 vmovaps ymm0, [rax+(0x4*rdx)]; 24 vmovaps ymm1, [rbx+(0x4*rdx)]; 25 vsubps ymm3, ymm0, ymm1; 26 // store 'ymm3' to the array 'c' 27 vmovaps [rdi+(0x4*rdx)], ymm3; 28 sub rdx,8 // old: add rdx, 0x8; 29 jge loop1 // old: loop loop1; 30 31 }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I don't think that the execution will take longer - with parallel operations, re-order of micro-ops, speculative executions, etc, I don't think you will see any difference>>>
I would try to keep addressing simpler.I suppose that AGU is decoupled from the other integer execution unit and probably address calculation can not be send internally to other ports.Now when issuing complex address like this [rax+(0x4*rdx)] it can be probably decoded in two uops (not sure about this) or it could be handled differently at AGU level without generation arithmetic uops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Anyway you have still two different integer operations issued per cycle which will occupy adder and multiplier units.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am thinking about how to test overhead of complex vs. simple addressing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First, effects would depend on the Hardware: two ports for address calculation or a single port?. This means 'Haswell' CPU may perform differently from 'Ivy Bridge', and 'core i3' differently from 'core i7'.
Then the question arises: where is the bottleneck? if it ain't the address calculation, it will be hidden.
You could try timing the simplest loop - just read and the store, with two different buffers. You should repeatedly use the same memory blocks, small enough to fit both inside the catch. I would suggest 'vmovaps Ymm0, [rax+(0x4*rdx)]; vmovaps [rbx+(0x4*rdx)],Ymm0 ;
'
Code explicitly at least K consecutive stages of the loop internals 'Read \ store '. Instead of 'loop' command use 'jnz' which is usually faster (Hardware dependent). To do 20000 repetitions over the same block of memory, simply use 'AND RDX, 0x3FF ' or similar.
To minimize the overhead of the loop ( e.g. 'jnz' or 'loop' ) unroll many stages of the loop. There is a catch here: Your actual loop will be N micro-ops long. If it is short, the decoded form (from mnemonics into micro-ops) fits inside the decoder-catch, and the decoder just sits idle - a quicker time. If your actual loop is longer, everything must be decoded each pass. I believe you will find, with unrolling 1,2,..M loops, a magic K number
T( 1 stage) > T( 2 stages)/2 > T( 3stages)/3 > ... > T(K stages)/K
T( K stages)/K << T(K+1 stages)/(K+1) > T(K+2 stages)/(K+2) ...
On the other hand, if the bottleneck is not the decoder, this 'magic K' will never show up....
Good Luck.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for reply very interesting post.
I suppose that LSD probably will be in use because of small loop size <= 56 uops.In order to keep it below the aferomentioned size initially I would not use aggresive unrolling.
As far as my understanding goes I think that simple addressing and complex addressing both of them will be decoded into one uop , but encoded differently.Next that uop will be send to AGU unit where the integer address calculation will be performed.Probably AGU is able to execute one addition and one multiplication per cycle.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page