- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are trying to move from the Intel icl compiler, which has always optimized our code very well, to the "new" icx. We're seeing a large overall degradation in performance (~ 30%). I have a few extremely short and extremely simple code snippets that show large differences in performance.
Example 1: Using a loop counter in a calculation.
__attribute__((aligned(64))) float a[10000];
#pragma vector aligned
for (int i = 0; i < 10000; i++)
{
a[i] = (float)(i * 10);
}
I would expect this to be vectorized, and in icl it is. This is the inner loop, which will be executed 2500 times:
..B1.6: # Preds ..B1.6 ..B1.5
cvtdq2ps xmm2, xmm0 #14.28
movups XMMWORD PTR [a+rax*4], xmm2 #14.9
add rax, 4 #12.5
paddd xmm0, xmm1 #14.28
cmp rax, 10000 #12.5
jb ..B1.6 # Prob 99%
icx does not vectorize it, so this loop is executed 10,000 times:
.LBB0_1:
xorps xmm0, xmm0
cvtsi2ss xmm0, ecx
movss dword ptr [rax], xmm0
add rcx, 10
add rax, 4
cmp rcx, 100000
jne .LBB0_1
If I force it to vectorize with #pragma vector always, it does, but the code looks pretty bad - it's executed 2500 times, just like the icl code, but it's more than twice as long:
.LBB0_1:
movd xmm4, ecx
pshufd xmm4, xmm4, 0
paddd xmm4, xmm0
movdqa xmm5, xmm4
pblendw xmm5, xmm1, 170
psrld xmm4, 16
pblendw xmm4, xmm2, 170
subps xmm4, xmm3
addps xmm4, xmm5
movaps xmmword ptr [4*rax + a+16], xmm4
add rax, 4
add ecx, 40
cmp rax, 9996
jb .LBB0_1
Godbolt link: https://godbolt.org/z/5Ed1qrsnT
Example 2: Using a calculation with a starting value:
#pragma vector aliged
for (int i = 0; i < 10000; i++)
{
a[i] = f_val;
f_val += f_step;
}
icl output is vectorized:
..B1.6: # Preds ..B1.6 ..B1.5
movups XMMWORD PTR [a+rax*4], xmm0 #14.9
add rax, 4 #12.5
addps xmm0, xmm1 #14.9
cmp rax, 10000 #12.5
jb ..B1.6 # Prob 99%
icx output is not:
.LBB0_1:
movss dword ptr [rax + a+40000], xmm0
addss xmm0, xmm1
add rax, 4
jne .LBB0_1
With #pragma vector always, it is, and this time the code looks good:
.LBB0_1:
movaps xmmword ptr [4*rax + a+16], xmm0
addps xmm0, xmm1
add rax, 4
cmp rax, 9996
jb .LBB0_1
Godbolt link: https://godbolt.org/z/GWEaGqj3o
If I look at the vectorization reports, it seems to think that this vectorized code is slower than the non-vectorizedd code for both code snippets. With the pretty bad code for the 1st that would make sense, for this 2nd snippet, not so much.
Some notes:
I targeted a specific CPU type with the compiler flags (SSE4.1), if I use a newer one it does choose to vectorize. icl with target SSE2 still outperforms icx with target AVX2 though.
I added the #pragma loop unroll's to avoid loop unrolling, which makes the code very hard to compare since different compilers select different amounts of unrolling - it does not affect vectorization though.
Am I missing something obvious here? A lot of loops in our code are like the ones in these examples, typically a lot more complex. If it was only a few I would manually vectorize them, but that isn't really an option. If the compiler does not properly optimize such loops, we will need to stay with icl, but since that's no longer being maintained, I don't really like that idea.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for reporting this gap in icc vs icx. This issue is being escalated fr further investigation by engineering team.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page