- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
int
main(){
Vec4f v1, v2, v3;
v1.Set(1,1,1,1);
v2.Set(2,2,2,2);
Vec4fAdd(&v3, &v1, &v2);
__asm{
movaps xmm0, v1
movaps xmm1, v2
addps xmm0, xmm1
movaps v3, xmm0
}
return 0;}
void
Vec4fAdd(void* _o, const void* _a, const void* _b){
//_o.x = _a.x + _b.x; //_o.y = _a.y + _b.y; //_o.z = _a.z + _b.z; //_o.w = _a.w + _b.w; __asm{
movaps xmm0, _a
movaps xmm1, _b
addps xmm0, xmm1
movaps _o, xmm0
}
}
The asm inlined in main() works but the Vec4fAdd() call does not. It gets an exception reading 0xffffffff. If I also inline some assembly to put the parameters into general registers it works. What I do not understand is why it doesnt work even though the values thatI am seeing in the debugger are the same memory locations that are used in the inlined part in main(). It seems that no matter whatfunction I sendthe Vec4f in it crashes even though it looks 16 byte aligned.
Message Edited by Rheikon on 09-30-2005 02:45 PM
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Rheikon,
This is the same problem as discussed in the thread http://softwareforums.intel.com/ids/board/message?board.id=16&message.id=2949 of this forum. When a variable var is a direct SSE object, one can either use
lea eax, var
movaps xmm0, [eax]
or directly
movaps xmm0, var
to load the value. When a variable ptr is a pointer to an SSE object, one has to use
mov eax, ptr
movaps xmm0, [eax]
instead. Hope this helps.
Aart Bik
http://www.aartbik.com/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mov eax, ptr
movaps xmm0, eax
Why is that different than just passing directly like:
movaps xmm0, ptr
It seems to me that they would be the same values but apparently not. What happens when you load it into the register?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In main, the variable denotes a memory location that directly contains a 128-bit object. In the function, the variable (a pointer) denotes a memory location that contains the address of another memory location that contains a 128-bit object (please note that I used movaps xmm0, [eax], not just eax). Also note that the o.x notation in the comment really should be o->x, maybe that makes things more clear?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
movaps xmm0, [ptr]
I have always wondered about that since I have no good documentation on assembly. Also I never really thought about the fact that it was a pointer and not the data, it is probably since I am bit sick I have not been thinking very well hah.
The commented out code was from the first version of the function which took references, so thats whay it is written like that.
Message Edited by Rheikon on 10-02-2005 12:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
void Vec4fAdd(Vec4f &_o, const Vec4f &_a, const Vec4f &_b)
{
//_o.x = _a.x + _b.x;
//_o.y = _a.y + _b.y;
//_o.z = _a.z + _b.z;
//_o.w = _a.w + _b.w;
__asm
{
mov edx, _a
movaps xmm0, [edx]
mov eax, _b
movaps xmm1, [eax]
addps xmm0, xmm1
mov ecx, _o
movaps [ecx], xmm0
}
}
But I looked at the assembly for it and it looks somewhat inefficient. It loads the vars into registers and pushes them in the func but then I load them into registers again. It seems like a waste. Is there any way to make it optimize across the function call? From what I read it would seem thats what Whole Program Optimization does but it is not or it doesnt do what I think it does.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To answer your first question, the addressing mode you allude to (two indirections) is not supported in a single instruction on the Intel Architecture (and, probably confusing matters further, putting the [] around a variable has not theeffect you assumed at all). The inline assembler simplifies coding by converting C-style variables into appropriate addressing modes, but you should still adhere to the 1:1 mapping with actual machine code. May I suggest you try intrinsics (which is much closer to the C language) or, even better, automatic vectorization?
The code below
struct f4 {
float x,y,z,w;
};
void doit(struct f4 *o, struct f4 *a, struct f4 *b) {
#pragma ivdep
#pragma vector aligned
o->x = a->x + b->x;
o->y = a->y + b->y;
o->z = a->z + b->z;
o->w = a->w + b->w;
}
is simply vectorized by the Intel compiler
joho.c(8) : (col. 3) remark: BLOCK WAS VECTORIZED.
into
movaps xmm0, XMMWORD PTR [eax]
addps xmm0, XMMWORD PTR [edx]
movaps XMMWORD PTR [ecx], xmm0
Hope this helps.
Aart Bik
http://www.aartbik.com/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And I tried the automatic vectorization but it just turned it into multiple movss and addss and did not vectorize the block.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Compiler version, switches, OS?
Here is a Windows example (for Linux, omit the "Q"):
Intel C++ Compiler for 32-bit applications, Version 9.0
....
joho.c(8) : (col. 3) remark: BLOCK WAS VECTORIZED.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
.main.cpp(30) : (col. 11) remark: loop was not vectorized: unsupported loop structure.
.main.cpp(30) : (col. 11) remark: loop was not vectorized: unsupported loop structure.
.main.cpp(30) : (col. 11) remark: loop was not vectorized: unsupported loop structure.
.main.cpp(30) : (col. 11) remark: loop was not vectorized: unsupported loop structure.
.main.cpp(35) : (col. 2) remark: loop was not vectorized: contains unvectorizable statement at line 35.
.main.cpp(42) : (col. 2) remark: loop was not vectorized: existence of vector dependence.
I will try to fix my loop.
Message Edited by Rheikon on 10-03-2005 10:46 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hopefully you will find this on-line article useful:
http://www.intel.com/cd/ids/developer/asmo-na/eng/65774.htm
Or, if you really want to know all the details, the book:
http://www.intel.com/intelpress/sum_vmmx.htm
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Message Edited by Rheikon on 10-03-2005 12:22 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have this a program that adds a buch of vectors in an array and it says it can not vectorize the loop which I do not want it to do in the first place. I want the function I wrote to be vectorized which it will not do unless the loop the function call is in can be it seems. This is too much of a hassle to get the compiler to understand what you want. I mind aswell go back to writing it by hand at the sacrifice of a little performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why don't youcontact me directly (aart.bik@intel.com) withthe vectorization issues you encounter? It sounds like with just alittle more effort (and patience) you could get the performance you want.
Message Edited by abik on 10-03-2005 02:15 PM
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page