int
main(){
Vec4f v1, v2, v3;
v1.Set(1,1,1,1);
v2.Set(2,2,2,2);
Vec4fAdd(&v3, &v1, &v2);
__asm{
movaps xmm0, v1
movaps xmm1, v2
addps xmm0, xmm1
movaps v3, xmm0
}
return 0;}
void
Vec4fAdd(void* _o, const void* _a, const void* _b){
//_o.x = _a.x + _b.x; //_o.y = _a.y + _b.y; //_o.z = _a.z + _b.z; //_o.w = _a.w + _b.w; __asm{
movaps xmm0, _a
movaps xmm1, _b
addps xmm0, xmm1
movaps _o, xmm0
}
}
The asm inlined in main() works but the Vec4fAdd() call does not. It gets an exception reading 0xffffffff. If I also inline some assembly to put the parameters into general registers it works. What I do not understand is why it doesnt work even though the values thatI am seeing in the debugger are the same memory locations that are used in the inlined part in main(). It seems that no matter whatfunction I sendthe Vec4f in it crashes even though it looks 16 byte aligned.
Message Edited by Rheikon on 09-30-2005 02:45 PM
Link Copied
Dear Rheikon,
This is the same problem as discussed in the thread http://softwareforums.intel.com/ids/board/message?board.id=16&message.id=2949 of this forum. When a variable var is a direct SSE object, one can either use
lea eax, var
movaps xmm0, [eax]
or directly
movaps xmm0, var
to load the value. When a variable ptr is a pointer to an SSE object, one has to use
mov eax, ptr
movaps xmm0, [eax]
instead. Hope this helps.
Aart Bik
http://www.aartbik.com/
In main, the variable denotes a memory location that directly contains a 128-bit object. In the function, the variable (a pointer) denotes a memory location that contains the address of another memory location that contains a 128-bit object (please note that I used movaps xmm0, [eax], not just eax). Also note that the o.x notation in the comment really should be o->x, maybe that makes things more clear?
Message Edited by Rheikon on 10-02-2005 12:03 AM
To answer your first question, the addressing mode you allude to (two indirections) is not supported in a single instruction on the Intel Architecture (and, probably confusing matters further, putting the [] around a variable has not theeffect you assumed at all). The inline assembler simplifies coding by converting C-style variables into appropriate addressing modes, but you should still adhere to the 1:1 mapping with actual machine code. May I suggest you try intrinsics (which is much closer to the C language) or, even better, automatic vectorization?
The code below
struct f4 {
float x,y,z,w;
};
void doit(struct f4 *o, struct f4 *a, struct f4 *b) {
#pragma ivdep
#pragma vector aligned
o->x = a->x + b->x;
o->y = a->y + b->y;
o->z = a->z + b->z;
o->w = a->w + b->w;
}
is simply vectorized by the Intel compiler
joho.c(8) : (col. 3) remark: BLOCK WAS VECTORIZED.
into
movaps xmm0, XMMWORD PTR [eax]
addps xmm0, XMMWORD PTR [edx]
movaps XMMWORD PTR [ecx], xmm0
Hope this helps.
Aart Bik
http://www.aartbik.com/
Compiler version, switches, OS?
Here is a Windows example (for Linux, omit the "Q"):
Intel C++ Compiler for 32-bit applications, Version 9.0
....
joho.c(8) : (col. 3) remark: BLOCK WAS VECTORIZED.
Message Edited by Rheikon on 10-03-2005 10:46 AM
Hopefully you will find this on-line article useful:
http://www.intel.com/cd/ids/developer/asmo-na/eng/65774.htm
Or, if you really want to know all the details, the book:
http://www.intel.com/intelpress/sum_vmmx.htm
Message Edited by Rheikon on 10-03-2005 12:22 PM
Why don't youcontact me directly (aart.bik@intel.com) withthe vectorization issues you encounter? It sounds like with just alittle more effort (and patience) you could get the performance you want.
Message Edited by abik on 10-03-2005 02:15 PM
For more complete information about compiler optimizations, see our Optimization Notice.