hidden text to trigger early load of fonts ПродукцияПродукцияПродукцияПродукция Các sản phẩmCác sản phẩmCác sản phẩmCác sản phẩm المنتجاتالمنتجاتالمنتجاتالمنتجات מוצריםמוצריםמוצריםמוצרים
Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
754 Discussions

Performance regression icl -> icx, with code snippets

hvz_
Beginner
82 Views

We are trying to move from the Intel icl compiler, which has always optimized our code very well, to the "new" icx. We're seeing a large overall degradation in performance (~ 30%). I have a few extremely short and extremely simple code snippets that show large differences in performance.

 

Example 1: Using a loop counter in  a calculation.

    __attribute__((aligned(64))) float a[10000];

    #pragma vector aligned
    for (int i = 0; i < 10000; i++)
    {
        a[i] = (float)(i * 10);
    }

I would expect this to be vectorized, and in icl it is. This is the inner loop, which will be executed 2500 times:

..B1.6:                         # Preds ..B1.6 ..B1.5
        cvtdq2ps  xmm2, xmm0                                    #14.28
        movups    XMMWORD PTR [a+rax*4], xmm2                   #14.9
        add       rax, 4                                        #12.5
        paddd     xmm0, xmm1                                    #14.28
        cmp       rax, 10000                                    #12.5
        jb        ..B1.6        # Prob 99%     

icx does not vectorize it, so this loop is executed 10,000 times:

.LBB0_1:
        xorps   xmm0, xmm0
        cvtsi2ss        xmm0, ecx
        movss   dword ptr [rax], xmm0
        add     rcx, 10
        add     rax, 4
        cmp     rcx, 100000
        jne     .LBB0_1

If I force it to vectorize with #pragma vector always, it does, but the code looks pretty bad - it's executed 2500 times, just like the icl code, but it's more than twice as long:

.LBB0_1:
        movd    xmm4, ecx
        pshufd  xmm4, xmm4, 0
        paddd   xmm4, xmm0
        movdqa  xmm5, xmm4
        pblendw xmm5, xmm1, 170
        psrld   xmm4, 16
        pblendw xmm4, xmm2, 170
        subps   xmm4, xmm3
        addps   xmm4, xmm5
        movaps  xmmword ptr [4*rax + a+16], xmm4
        add     rax, 4
        add     ecx, 40
        cmp     rax, 9996
        jb      .LBB0_1

Godbolt link: https://godbolt.org/z/5Ed1qrsnT

 

Example 2: Using a calculation with a starting value:

    #pragma vector aliged
    for (int i = 0; i < 10000; i++)
    {
        a[i] = f_val;
        f_val += f_step;
    }

icl output is vectorized:

..B1.6:                         # Preds ..B1.6 ..B1.5
        movups    XMMWORD PTR [a+rax*4], xmm0                   #14.9
        add       rax, 4                                        #12.5
        addps     xmm0, xmm1                                    #14.9
        cmp       rax, 10000                                    #12.5
        jb        ..B1.6        # Prob 99%     

icx output is not:

.LBB0_1:
        movss   dword ptr [rax + a+40000], xmm0
        addss   xmm0, xmm1
        add     rax, 4
        jne     .LBB0_1

With #pragma vector always, it is, and this time the code looks good:

.LBB0_1:
        movaps  xmmword ptr [4*rax + a+16], xmm0
        addps   xmm0, xmm1
        add     rax, 4
        cmp     rax, 9996
        jb      .LBB0_1

Godbolt link: https://godbolt.org/z/GWEaGqj3o

 

If I look at the vectorization reports, it seems to think that this vectorized code is slower than the non-vectorizedd code for both code snippets. With the pretty bad code for the 1st that would make sense, for this 2nd snippet, not so much.

 

Some notes:
I targeted a specific CPU type with the compiler flags (SSE4.1), if I use a newer one it does choose to vectorize. icl with target SSE2 still outperforms icx with target AVX2 though.

I added the #pragma loop unroll's to avoid loop unrolling, which makes the code very hard to compare since different compilers select different amounts of unrolling - it does not affect vectorization though.

 

Am I missing something obvious here? A lot of loops in our code are like the ones in these examples, typically a lot more complex. If it was only a few I would manually vectorize them, but that isn't really an option. If the compiler does not properly optimize such loops, we will need to stay with icl, but since that's no longer being maintained, I don't really like that idea.

0 Kudos
1 Reply
Sravani_K_Intel
Moderator
50 Views

Thanks for reporting this gap in icc vs icx. This issue is being escalated fr further investigation by engineering team.

0 Kudos
Reply