Re: Saving xmm registers w/ inlined assembly function?

jamesqf · ‎04-19-2004

I have a couple of functions written in inline assembly code. (They're am_exp_ps and am_exp_eps from the approximate math library.) When I call them in a test program, they give correct results. However, when I use them in real code, it seems as though they are destroyingwhatever was in the xmm registers before they were called.

That is, the code flow is like this:

F32vec4 x, y, z;

x = F32vec4 (known values); // which has z in an xmm reg

...do some computations to get x;

y = am_exp_ps (x);

...and find that z has been changed!

So, how do I preserve the values? I'd have thought the compiler would take care of it, since I made am_exp_ps an inline function, but apparently not.

Also, is there any documentation around on the inline assembly format? Or examples? What's in the compiler manual is pretty scant. I've got this far mostly on trial & error.

Thanks,

James

Maximillia_D_Intel · ‎04-20-2004

James,

The inline assembly syntaxis really a property of the native compiler and not Intel architecture. You should consult Microsoft docsfor inline assembly details on Windows and GCC (AT&T assembly syntax)docs for gcc assembly.

The Intel Binary Compatibility Specification defines what registers must be saved by a called function - I don't believe the xmm registers are covered here. With GCC assembly it is possible to specify constraints and it may be possible to communicate to the compiler that particular xmm registers are clobbered (I have never tried this so that's why I say may).

Another option is to use intrinsics, see xmmintrin.h and emmintrin.h in our include directory (/opt/intel_cc_80/include). Using intrinsics, you get the benefits of inline asm, but allow the compiler to be aware of the registers in use.

Max

jamesqf · ‎04-20-2004

Max,

Thanks for the reply. Unfortunately, I don't have access to current Microsoft assembler documentation (I work on Linux clusters these days, and am of course using the Intel compiler), and while I have written a good bit of assembly, thatstopped about the time the 486 came out. As for Gnu assembly... well, it's like anything else those people do when not constrained by the necessity of matching an existing interface: totally incomprehensible. In any case, my question really isn't so much about the syntax - I can figure that out with a bit of work - but about the best way to do it.

I wouldn't mind translating the code into intrinsics, IF I understood what it's doing. I agree with you on the benefits, but I don't see how to do it without some understanding of the code. I've had a message posted here for several weeks, requestingbackground or sourcefor thealgorithm, but have gotten no useful response.

In any case, my question really isn't so much about syntax -I can figure that out with a bit of work - but about the most efficient way to save the xmm regs. I don't mind tweaking the existing assembly to save the xmm regs, but how should Ido it? Can they be PUSHed & POPped, or moved to stack or tempspace somehow? Is there some documentation that coversthis sort of thing?

Thanks,

James

TimP · ‎04-20-2004

The Microsoft ABI for AMD64 (I assume Intel is working to change that designation) specifies that xmm registers are included in those which are automatically saved and restored, while the x87 fp registers are not.

Now that you say that you are working on linux, it could make a difference which kernel and glibc you are using. I think that current 2.4.xx kernels should accomplish this, and certainly all x86-64 kernels will. The kernels would have to be built with a gcc which supports the xmm registers correctly, which rules out 2.9x and some early 3.0 versions, I think. Evidently, if you go back far enough, you will find kernels and gcc versions which were totally unaware of xmm.

jamesqf · ‎04-20-2004

Wouldn't that automatic saving only apply to task switching? This is just for a simple function call -I don't really see why (or how) the kernel would be involved.

That's all the more so because the code in the function is only about 50 instructions, hence should execute in only several tens of nanoseconds per call, where task switching granularity would, IIRC, be on the order of milliseconds.

Maximillia_D_Intel · ‎04-22-2004

James,

To confirm: you are using inline assembly on with our Linux compiler. Correct?

With gcc-style inline assembly there is a way of specifying input and output registers and thus communicate to the compiler which registers need to be saved before entry into the inline asm block (which I believe is the core of what you want to know).

Can you post a snippet of your code, preferably code that compiles and provides some indication that it either worked or didn't work? If you do that, I can get back to you with syntax for the input-output section using xmm registers.

Thanks,

Max

Maximillia_D_Intel · ‎04-22-2004

James,

I played around with this for fun and have a program that shows the idea in regard to gcc style assembly.

Compile the program below and run it.
icc foo4.cpp; a.out
icc -DWORKS foo4.cpp; a.out

You will see different behavior because the xmm register gets overwritten. Take a look at the program and let me know if you have any questions. For more details on gnu style assembly see www.ibiblio.org/ldp/GCC-Inline-Assembly-HOWTO.html#s6

Max

#include
#include

inline int foobar(void)
{
int i;
int r1 = 0;
__m128 y = _mm_set_ps(1.0, 1.0, 1.0, 1.0);
__asm ("movups %1, %%xmm0 "
"movups %1, %%xmm1 "
"movups %1, %%xmm2 "
"movups %1, %%xmm3 "
"movups %1, %%xmm4 "
"movups %1, %%xmm5 "
"movups %1, %%xmm6 "
: "=r" (r1)
: "x" (y)
#ifdef WORKS
: "%xmm0", "%xmm1", "%xmm2", "%xmm3", "%xmm4", "%xmm5", "%xmm6"
#else
: "%xmm0"
#endif
);
return r1;
}

int main()
{
int i;
float z = 0.0;
float value = 0.0;
F32vec4 y;
y = F32vec4(1.0,2.0,3.0,4.0);
foobar();
value = add_horizontal(y);
printf ("%f ", value);
return z;
}

Message Edited by mjdomeik on 04-22-2004 11:25 AM

jamesqf · ‎04-23-2004

Max,

Yes, that's correct. Inline assembly - standard format, not Gnu - with the Linux compiler. Version 8.

The code is on my home machine, so I can't post it now. (And I can't access this site from home. Because of Intel's Internet Explorer only block on access, I have to come to the lab and borrow the secretary's Windoze machine.) I'll work up something over the weekend, and post it Monday.

However, it's not really MY assembly code: it's the exponential functions from the Intel Approximate Math Library. All I did to the source was to delete everything butthose functions, and add a #define of one of the MS-only function decorations (_stdcall?) to an empty string. (It gave error messages otherwise.)

Meanwhile, I'll have a look at your example code and see what I can figure out. I did try simply moving the used regs to a static memory area, but I'm still getting errors. Not sure if I'm doing it right, though.

Thanks,

James

jamesqf · ‎04-26-2004

Max,

Attached is, or should be if IE does itsjob,a bit of code (rather more than a snippet, unfortunately, but as short as I can get) that demonstrates the problem I'm having with assembly code using SSE instructions.

The actual assemblyfunction was simply extracted from the Approximate Math Library, though I've removed irrelevant constants and turned some macros into literals. Made it as simple as possible, IOW.

If I link the assembly function with a very simple test program that just calls am_exp_ps with an __m128 argument, it will return correct values. The test program would look like this:

int main ()

{

However, when I use it in a context that's something likea real program, it returns bad values.

jamesqf · ‎04-26-2004

Well, that was interesting :-) Right in the middle of typing my message, IE thinks I've written enough, and decides to post it. Yet another reason I hate Windoze.

Anyway, the test program looks like

int main ()

{

__m128 X, Y;

float *xp, *yp;

xp = (float *) &X;

yp = (float *) &Y;

*(xp) = *(xp+1) = *(xp+2) = *(xp+2) = 1.0;

Y = am_exp_ps (X);

printf ("%f, %f, %f, %f ", *(yp), *(yp+1), *(yp+2), *(yp+3));

}

When linked with AMaths.c, it returns approximately correct values.

Maximillia_D_Intel · ‎04-28-2004

James,

O.k. To recap what you are doing and your options:

You have taken the Approximate Math library and have attempted to port it from Windows to Linux.

I believe you have one of two options:

1. Keep the function call non-inline. Have you confirmed whether or not that works?

2. Define input and output registers in your gcc asm statement. I have provided an example of how to do this.

Please let me know if these two options make sense and which you decide to try.

I will also see if I can ping the author of the library and see if he has any comments.

Regards,

Max

jamesqf · ‎05-07-2004

Got sidetracked there for a while, but I think I have found the problem and a fix. The problem is the undocumented (as far as I can find, anyway) "_declspec (naked)" directive in the functions. It seems to cause the code to be compiled without a return, so the function just keeps on going until it hits some other return instruction, at which point the xmm0 register has beensomeother value

The asm code had a "ret 16" instruction, but that didn't work for some reason.

Anyway, I think I can get it working rignt from here.

Thanks,

James

Maximillia_D_Intel · ‎05-07-2004

James,

Great to hear you have a route to success!

Max