ICC 11.1 ignores "-m" when "-ax" is present?

styc · ‎06-29-2009

Hi,

When I compiled this loop

[cpp]    double s = 0., t = 0.;
    #pragma vector aligned
    for (size_t i = 0; i < n; ++i) {
        s += x * x;
        t += (x - y) * (x - y);
    }
[/cpp]

under "-msse3", ICC 11.1 had no problem generating two haddpd instructions; but when I switched to "-msse3 -xSSE4.2", the compiler simply used unpckhpd+addsd. Does this imply that ICC ignores "-m" when "-ax" is present and always uses SSE2 as the baseline?

Lingfeng_C_Intel · ‎06-30-2009

Maybe below information can helpful for you.

Processor support for the baseline code path is determined by the processor family or instruction set specified in the -m or -x (Linux and Mac OS X) or /arch or /Qx (Windows) option, which has default values for each architecture.

This allows you to impose a more strict processor or instruction set requirement for the baseline code path; however, such generic baseline code will not operate correctly on processors that are not compatible with the minimum processor or instruction set requirement. For the IA-32 architecture, you can specify a baseline code path that will work on all IA-32 compatible processors using the -mia32 (Linux) or /arch:IA32 (Windows) options. You should always specify the processor or instruction set requirements explicitly for the baseline code path, rather than depend on the defaults for the architecture.

Optimizations in the specialized code paths can include generating and using Intel Streaming SIMD Extensions 4 (SSE4), Supplemental Streaming SIMD Extensions 3 (SSSE3), Streaming SIMD Extensions 3 (SSE3), or Streaming SIMD Extensions 2 (SSE2) instructions for supported Intel processors; however, such specialized code paths are executed only after checking verifies that the code is supported by the run-time host processor.

Thanks,
Wise

TimP · ‎06-30-2009

Quoting - styc

under "-msse3", ICC 11.1 had no problem generating two haddpd instructions; but when I switched to "-msse3 -xSSE4.2", the compiler simply used unpckhpd+addsd. Does this imply that ICC ignores "-m" when "-ax" is present and always uses SSE2 as the baseline?

Did you mean to use "-axSSE4.2 -msse3" ? This option would appear to mean "use Intel specific code only when running on Core i7 or Xeon 5500, otherwise use code partially optimized for AMD" (since you are looking for haddpd, which isn't efficient on Intel). I haven't seen any reason for using SSE4.2 rather than SSE4.1, unless you use the SSE4.2 intrinsic explicitly.
Your quoted option string doesn't appear to correspond with the result you reported.

styc · ‎06-30-2009

Quoting - tim18

Quoting - styc

under "-msse3", ICC 11.1 had no problem generating two haddpd instructions; but when I switched to "-msse3 -xSSE4.2", the compiler simply used unpckhpd+addsd. Does this imply that ICC ignores "-m" when "-ax" is present and always uses SSE2 as the baseline?

Did you mean to use "-axSSE4.2 -msse3" ? This option would appear to mean "use Intel specific code only when running on Core i7 or Xeon 5500, otherwise use code partially optimized for AMD" (since you are looking for haddpd, which isn't efficient on Intel). I haven't seen any reason for using SSE4.2 rather than SSE4.1, unless you use the SSE4.2 intrinsic explicitly.
Your quoted option string doesn't appear to correspond with the result you reported.

Yes, I meant "-axSSE4.2 -msse3". Sorry for the typo. I know that SSE4 has nothing to do with the loop. I just find it unexpected that the compiler does not honor "-m" when "-ax" is specified.

TimP · ‎06-30-2009

Quoting - styc

Yes, I meant "-axSSE4.2 -msse3". Sorry for the typo. I know that SSE4 has nothing to do with the loop. I just find it unexpected that the compiler does not honor "-m" when "-ax" is specified.

I thought you meant that you didn't get expected results on account of your mistyping of the options. I agree that the compiler would be expected to use haddpd in the loop epilogue with correctly specified dual path options for SSE3 and SSE4.
A slight correction to what I said before about the SSE4.1 vs SSE4.2 code comparison; the 11.1 compiler does distinguish between them by not generating palignr instructions for SSE4.2, thereby improving performance of some vectorized loops.