Build with Intel Compiler 11.0 a low-level project which doesn't and actually can't use FPU/MMX/SSE instructions like Linux kernel

andrey_mirkin · ‎02-06-2009

Hello,

I have found a white paper about Intel Compilers compatibility
with GNU compilers and Linux kernel compilation using it.
And would appreciate if Intel developers could comment one issue regarding similar
compilation research.

I try to build a low-level project which doesn't and actually can't use
FPU/MMX/SSE instructions (exactly like Linux kernel) using Intel Compiler
10/11 for Linux, but the resulting code contains MMX/SSE instruction
sequences which were created by ICL itself.
Specifying options "-mia32", "-mcpu=pentium" or "-m32" doesn't help.

Can someone please tell me how this problem were solved for Linux kernel compilation?
Which option was used or any tricks you did to avoid MMX/SSE code?

With best regards,
Andrey Mirkin

TimP · ‎02-06-2009

I wasn't aware that avoidance of SSE was a goal of projects which use icc to built linux kernel, or that anyone considered instruction set choices "tricks."
I can't guess how you would get MMX code generation.
ICL is the name of the Windows version of the compiler.
-mia32 should be recognized by icc 11.0 32-bit only, which would have defaulted to -msse2. icc 10.1 32-bit would default to the equivalent of -mia32.
10.1 32-bit had a partly supported, partly deprecated, undocumented, unreliable backward compatibility option to generate sse but not sse2; it's not clear if that's the code generation you are discussing. 11.0 takes that option as a synonym for -mia32.
You may be able to "trick" the 64-bit icc into generating x87 code by -O0, or by long double data types, or partially by -mp -vec-.
-m32 is a gcc option, not supported by icc, which causes a switch to the 32-bit compiler, but doesn't itself choose instruction set.

As you can see, no one is guessing well the details of your concern. Perhaps you could show an example.

jimdempseyatthecove · ‎02-06-2009

Tim,

I think the problem is some of the data movement and/or data initialization (with optimizations) is using the xmm registers. Not for floating point purposes. Because the standard application level xmm register usage rules declare a subset of xmm registers as free, the compiler assumes it is building for an application and not a kernel routine, and as a result stomps on a register.

The old Borland Turbo C had an interrupt keyword that you could place on a function (return of void) that would disable the rule of "free registers", but it had other limitations

The user is asking for and option or #pragmaor __declspec or something that declares these xxx registers are not free so preserve or don't use.

Jim Dempsey

TimP · ‎02-06-2009

The original MMX implementations used the same register space as was allocated to x87 stack, with special macros required for switching between x87 and MMX modes. As that was so long ago, on CPU architectures which no longer get much testing, if the kernel source is using MMX explicitly, I could imagine bugs being exposed. If that's the question, current icc is not the ideal compiler for supporting CPUs prior to P4. I do remember when P-II was the greatest and latest, and almost no one attempted to use Intel compilers.

levicki · ‎02-06-2009

tim, jim:

I think that the problem is that we cannot make ICC to output only x86 code without touching the SIMD registers. Part of the issue is also the inability to use memset() from the CRT without bringing in the CPU dispatcher and whole printf() with it.

That kind of limits what ICC can be used for and I believe that we should have a greater control over the code output.

TimP · ‎02-06-2009

Quoting - Igor Levicki

tim, jim:

I think that the problem is that we cannot make ICC to output only x86 code without touching the SIMD registers. Part of the issue is also the inability to use memset() from the CRT without bringing in the CPU dispatcher and whole printf() with it.

That kind of limits what ICC can be used for and I believe that we should have a greater control over the code output.

In my limited experience with it, -fno-builtin prevents the substitution of _intel_fast_mem... run-time functions. As we've seen on this forum, it is possible to abuse that option.
I think the number of customers who would prefer run-time library paths to be limited to architectures specified in the compile switch (avoiding CPU dispatch) may be underestimated. Such ideas have been raised but haven't gone far. It's hard to foresee all the implications in run-time library maintenance, but the QA implications of not knowing which run-time will be in use also are serious.

levicki · ‎02-06-2009

Quoting - tim18

In my limited experience with it, -fno-builtin prevents the substitution of _intel_fast_mem... run-time functions. As we've seen on this forum, it is possible to abuse that option.
I think the number of customers who would prefer run-time library paths to be limited to architectures specified in the compile switch (avoiding CPU dispatch) may be underestimated. Such ideas have been raised but haven't gone far. It's hard to foresee all the implications in run-time library maintenance, but the QA implications of not knowing which run-time will be in use also are serious.

Yes, but:

1. There is no such thing as -fno-builtin on Windows. Why we can't have that switch?

2. If I specify that I want the code for Penryn CPU (-QxS) why the compiler couldn't pull in just the dispatched functions for the Penryn code path instead of the dispatcher and the CPU checking code? Why I cannot override its decision if I know the target system is going to have the right CPU?

3. Why I can't force compiler to fallback to X86/FPU code generation if I want to avoid SIMD state and register usage issues in embedded projects?

TimP · ‎02-06-2009

Quoting - Igor Levicki

2. If I specify that I want the code for Penryn CPU (-QxS) why the compiler couldn't pull in just the dispatched functions for the Penryn code path instead of the dispatcher and the CPU checking code? Why I cannot override its decision if I know the target system is going to have the right CPU?

I think we're in agreement on this principle.

andrey_mirkin · ‎02-09-2009

Quoting - Igor Levicki

tim, jim:

I think that the problem is that we cannot make ICC to output only x86 code without touching the SIMD registers. Part of the issue is also the inability to use memset() from the CRT without bringing in the CPU dispatcher and whole printf() with it.

That kind of limits what ICC can be used for and I believe that we should have a greater control over the code output.

Igor,

You are right, my question was how to make ICC to output only x86 code without SSE/MMX instructions and without touching xmm registers.
As I understand from discussion there is no such options for ICC 10.1 and 11.0. Correct me if I'm wrong.

jimdempseyatthecove · ‎02-09-2009

Quoting - Igor Levicki

tim, jim:

I think that the problem is that we cannot make ICC to output only x86 code without touching the SIMD registers. Part of the issue is also the inability to use memset() from the CRT without bringing in the CPU dispatcher and whole printf() with it.

That kind of limits what ICC can be used for and I believe that we should have a greater control over the code output.

Igor,

That was the point I was trying to make. The programmer should have the capability to "exclude" processor features, processor detection code and other fluff routines such as printf.

Jim

Feilong_H_Intel · ‎02-10-2009

Quoting - Igor Levicki

Yes, but:

1. There is no such thing as -fno-builtin on Windows. Why we can't have that switch?

2. If I specify that I want the code for Penryn CPU (-QxS) why the compiler couldn't pull in just the dispatched functions for the Penryn code path instead of the dispatcher and the CPU checking code? Why I cannot override its decision if I know the target system is going to have the right CPU?

3. Why I can't force compiler to fallback to X86/FPU code generation if I want to avoid SIMD state and register usage issues in embedded projects?

The equivalent option on Windows for -fno-builtin is /Oi-. It inhibits the compiler from doing some transformations, liketransformingmemcpy() to _intel_fast_memcpy().

levicki · ‎02-11-2009

Quoting - Feilong H (Intel)

The equivalent option on Windows for -fno-builtin is /Oi-. It inhibits the compiler from doing some transformations, liketransformingmemcpy() to _intel_fast_memcpy().

I am aware of that. What I am still not sure of is in which situations exactly CPU dispatcher code and printf get included? Is there some list of CRT library functions that rely on a CPU check?