SSE2 help needed

rheikon · ‎09-30-2005

I was experimenting with some sse2 to get ready for making a vector and matrix lib. I have run into a weird issue that I do not understand.

Vec4f is a class with 4 floats and is defined with __declspec(align(16))

int

main()

{

Vec4f v1, v2, v3;

v1.Set(1,1,1,1);

v2.Set(2,2,2,2);

Vec4fAdd(&v3, &v1, &v2);

__asm

{

movaps xmm0, v1

movaps xmm1, v2

addps xmm0, xmm1

movaps v3, xmm0

}

return 0;

}

void

Vec4fAdd(void* _o, const void* _a, const void* _b)

{

//_o.x = _a.x + _b.x;

//_o.y = _a.y + _b.y;

//_o.z = _a.z + _b.z;

//_o.w = _a.w + _b.w;

__asm

{

movaps xmm0, _a

movaps xmm1, _b

addps xmm0, xmm1

movaps _o, xmm0

}

The asm inlined in main() works but the Vec4fAdd() call does not. It gets an exception reading 0xffffffff. If I also inline some assembly to put the parameters into general registers it works. What I do not understand is why it doesnt work even though the values thatI am seeing in the debugger are the same memory locations that are used in the inlined part in main(). It seems that no matter whatfunction I sendthe Vec4f in it crashes even though it looks 16 byte aligned.

Message Edited by Rheikon on 09-30-2005 02:45 PM

Intel_C_Intel · ‎09-30-2005

Dear Rheikon,

This is the same problem as discussed in the thread http://softwareforums.intel.com/ids/board/message?board.id=16&message.id=2949 of this forum. When a variable var is a direct SSE object, one can either use
lea eax, var
movaps xmm0, [eax]
or directly
movaps xmm0, var
to load the value. When a variable ptr is a pointer to an SSE object, one has to use
mov eax, ptr
movaps xmm0, [eax]
instead. Hope this helps.
Aart Bik
http://www.aartbik.com/

rheikon · ‎10-01-2005

See this is the part I am not clear on since I am somewhat new to assembly. When you do:

mov eax, ptr
movaps xmm0, eax

Why is that different than just passing directly like:

movaps xmm0, ptr

It seems to me that they would be the same values but apparently not. What happens when you load it into the register?

Intel_C_Intel · ‎10-02-2005

In main, the variable denotes a memory location that directly contains a 128-bit object. In the function, the variable (a pointer) denotes a memory location that contains the address of another memory location that contains a 128-bit object (please note that I used movaps xmm0, [eax], not just eax). Also note that the o.x notation in the comment really should be o->x, maybe that makes things more clear?

rheikon · ‎10-02-2005

Ok so when you put [] around a register or a variable it does the same thing as lea in a way? If so does that mean I could do:

movaps xmm0, [ptr]

I have always wondered about that since I have no good documentation on assembly. Also I never really thought about the fact that it was a pointer and not the data, it is probably since I am bit sick I have not been thinking very well hah.

The commented out code was from the first version of the function which took references, so thats whay it is written like that.

Message Edited by Rheikon on 10-02-2005 12:03 AM

rheikon · ‎10-03-2005

I decided to go with this:

void Vec4fAdd(Vec4f &_o, const Vec4f &_a, const Vec4f &_b)
{
//_o.x = _a.x + _b.x;
//_o.y = _a.y + _b.y;
//_o.z = _a.z + _b.z;
//_o.w = _a.w + _b.w;

__asm
{
mov edx, _a
movaps xmm0, [edx]
mov eax, _b
movaps xmm1, [eax]
addps xmm0, xmm1
mov ecx, _o
movaps [ecx], xmm0
}
}

But I looked at the assembly for it and it looks somewhat inefficient. It loads the vars into registers and pushes them in the func but then I load them into registers again. It seems like a waste. Is there any way to make it optimize across the function call? From what I read it would seem thats what Whole Program Optimization does but it is not or it doesnt do what I think it does.

Intel_C_Intel · ‎10-03-2005

To answer your first question, the addressing mode you allude to (two indirections) is not supported in a single instruction on the Intel Architecture (and, probably confusing matters further, putting the [] around a variable has not theeffect you assumed at all). The inline assembler simplifies coding by converting C-style variables into appropriate addressing modes, but you should still adhere to the 1:1 mapping with actual machine code. May I suggest you try intrinsics (which is much closer to the C language) or, even better, automatic vectorization?

The code below

struct f4 {
float x,y,z,w;
};

void doit(struct f4 *o, struct f4 *a, struct f4 *b) {
#pragma ivdep
#pragma vector aligned
o->x = a->x + b->x;
o->y = a->y + b->y;
o->z = a->z + b->z;
o->w = a->w + b->w;
}

is simply vectorized by the Intel compiler

joho.c(8) : (col. 3) remark: BLOCK WAS VECTORIZED.

into

movaps xmm0, XMMWORD PTR [eax]
addps xmm0, XMMWORD PTR [edx]
movaps XMMWORD PTR [ecx], xmm0

Hope this helps.
Aart Bik
http://www.aartbik.com/

rheikon · ‎10-03-2005

Well thanks for clearing up the [] thing.

And I tried the automatic vectorization but it just turned it into multiple movss and addss and did not vectorize the block.

Intel_C_Intel · ‎10-03-2005

Compiler version, switches, OS?
Here is a Windows example (for Linux, omit the "Q"):

icl -QxP -Qvec_report2 joho.c
Intel C++ Compiler for 32-bit applications, Version 9.0
....
joho.c(8) : (col. 3) remark: BLOCK WAS VECTORIZED.

rheikon · ‎10-03-2005

I have WinXp pro sp2 and intel compiler 9... I will look into the switches. -QxP was on and then I put -Qvec_report2 and it is telling me it can not vectorize.

.main.cpp(30) : (col. 11) remark: loop was not vectorized: unsupported loop structure.
.main.cpp(30) : (col. 11) remark: loop was not vectorized: unsupported loop structure.
.main.cpp(30) : (col. 11) remark: loop was not vectorized: unsupported loop structure.
.main.cpp(30) : (col. 11) remark: loop was not vectorized: unsupported loop structure.
.main.cpp(35) : (col. 2) remark: loop was not vectorized: contains unvectorizable statement at line 35.
.main.cpp(42) : (col. 2) remark: loop was not vectorized: existence of vector dependence.

I will try to fix my loop.

Message Edited by Rheikon on 10-03-2005 10:46 AM

Intel_C_Intel · ‎10-03-2005

Hopefully you will find this on-line article useful:
http://www.intel.com/cd/ids/developer/asmo-na/eng/65774.htm

Or, if you really want to know all the details, the book:
http://www.intel.com/intelpress/sum_vmmx.htm

rheikon · ‎10-03-2005

Ok I got it to vectorize. Now I wonder if there is a way to tell the compiler not to get rid of things in code, so that I can write simple speed tests without it getting rid of the entire thing.

Message Edited by Rheikon on 10-03-2005 12:22 PM

rheikon · ‎10-03-2005

Hmmm, I am considering dumping automatic vectorization...

I have this a program that adds a buch of vectors in an array and it says it can not vectorize the loop which I do not want it to do in the first place. I want the function I wrote to be vectorized which it will not do unless the loop the function call is in can be it seems. This is too much of a hassle to get the compiler to understand what you want. I mind aswell go back to writing it by hand at the sacrifice of a little performance.

Intel_C_Intel · ‎10-03-2005

Why don't youcontact me directly (aart.bik@intel.com) withthe vectorization issues you encounter? It sounds like with just alittle more effort (and patience) you could get the performance you want.

Message Edited by abik on 10-03-2005 02:15 PM