Performance regression icl -> icx, with code snippets

hansvz · ‎03-26-2025

We are trying to move from the Intel icl compiler, which has always optimized our code very well, to the "new" icx. We're seeing a large overall degradation in performance (~ 30%). I have a few extremely short and extremely simple code snippets, that show large differences in performance.

Example 1: Using a loop counter in a calculation.

    __attribute__((aligned(64))) float a[10000];

    #pragma vector aligned
    for (int i = 0; i < 10000; i++)
    {
        a[i] = (float)(i * 10);
    }

I would expect this to be vectorized, and in icl it is. This is the inner loop, which will be executed 2500 times:

..B1.6:                         # Preds ..B1.6 ..B1.5
        cvtdq2ps  xmm2, xmm0                                    #14.28
        movups    XMMWORD PTR [a+rax*4], xmm2                   #14.9
        add       rax, 4                                        #12.5
        paddd     xmm0, xmm1                                    #14.28
        cmp       rax, 10000                                    #12.5
        jb        ..B1.6        # Prob 99%

icx does not vectorize it, so this loop is executed 10,000 times:

.LBB0_1:
        xorps   xmm0, xmm0
        cvtsi2ss        xmm0, ecx
        movss   dword ptr [rax], xmm0
        add     rcx, 10
        add     rax, 4
        cmp     rcx, 100000
        jne     .LBB0_1

If I force it to vectorize with #pragma vector always, it does, but the code looks pretty bad - it's executed 2500 times, just like the icl code, but it's more than twice as long:

.LBB0_1:
        movd    xmm4, ecx
        pshufd  xmm4, xmm4, 0
        paddd   xmm4, xmm0
        movdqa  xmm5, xmm4
        pblendw xmm5, xmm1, 170
        psrld   xmm4, 16
        pblendw xmm4, xmm2, 170
        subps   xmm4, xmm3
        addps   xmm4, xmm5
        movaps  xmmword ptr [4*rax + a+16], xmm4
        add     rax, 4
        add     ecx, 40
        cmp     rax, 9996
        jb      .LBB0_1

Example 2: Using a calculation with a starting value:

    #pragma vector aliged
    for (int i = 0; i < 10000; i++)
    {
        a[i] = f_val;
        f_val += f_step;
    }

icl output is vectorized:

..B1.6:                         # Preds ..B1.6 ..B1.5
        movups    XMMWORD PTR [a+rax*4], xmm0                   #14.9
        add       rax, 4                                        #12.5
        addps     xmm0, xmm1                                    #14.9
        cmp       rax, 10000                                    #12.5
        jb        ..B1.6        # Prob 99%

icx output is not:

.LBB0_1:
        movss   dword ptr [rax + a+40000], xmm0
        addss   xmm0, xmm1
        add     rax, 4
        jne     .LBB0_1

With #pragma vector always, it is, and this time the code looks good:

.LBB0_1:
        movaps  xmmword ptr [4*rax + a+16], xmm0
        addps   xmm0, xmm1
        add     rax, 4
        cmp     rax, 9996
        jb      .LBB0_1

The performance of the icx code with #pragma vector always is almost exactly a factor 4 better, as one would expect looking at the generated code.

If I look at the vectorization reports, icx seems to think that the vectorized code is slower than the non-vectorizedd code for both code snippets. With the pretty bad code for the 1st that would make sense, for this 2nd snippet, not so much.

Some notes:
I targeted a specific CPU type with the compiler flags (SSE4.1), if I use a newer one it does choose to vectorize. But overall (for our entire code base), icl with target SSE2 outperforms icx with target AVX2.

I added the #pragma loop unroll's to avoid loop unrolling, which makes the code very hard to compare since different compilers select different amounts of unrolling - it does not affect vectorization though.

Am I missing something obvious here? A lot of loops in our code are like the ones in these examples, typically a lot more complex. If it was only a few I would manually vectorize them, but that isn't really an option. If the compiler does not properly optimize such loops, we will need to stay with icl, but since that's no longer being maintained, I don't really like that idea.

Ethan_F_Intel · ‎04-25-2025

This issue has been reported to engineering and is being handled actively.

mochongli · ‎05-01-2025

@Ethan_F_Intel @hansvz Out of curiosity I tested the linpack benchmark

The test platform is Windows.

The icx performance of 2025.1.1 is the same as that of 2023.2.0.

I also tested clang.

hvz_ · ‎05-02-2025

Based on your screenshots, unless I'm missing something it looks like icl and clang perform the same, but icx is 35% slower. Visible in both the KFLOPS and Time fields. The 1024 takes 19.17 seconds in icx, vs 12.87 in Clang and 12.81 in icl.

I have not yet tested Clang on our end because I assumed that it would at least be worse than icx.

mochongli · ‎05-02-2025

in this test, gcc is the fastest, and msvc, clang and icl are all faster than icx.