- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are trying to move from the Intel icl compiler, which has always optimized our code very well, to the "new" icx. We're seeing a large overall degradation in performance (~ 30%). I have a few extremely short and extremely simple code snippets, that show large differences in performance.
Example 1: Using a loop counter in a calculation.
__attribute__((aligned(64))) float a[10000]; #pragma vector aligned for (int i = 0; i < 10000; i++) { a[i] = (float)(i * 10); }
I would expect this to be vectorized, and in icl it is. This is the inner loop, which will be executed 2500 times:
..B1.6: # Preds ..B1.6 ..B1.5 cvtdq2ps xmm2, xmm0 #14.28 movups XMMWORD PTR [a+rax*4], xmm2 #14.9 add rax, 4 #12.5 paddd xmm0, xmm1 #14.28 cmp rax, 10000 #12.5 jb ..B1.6 # Prob 99%
icx does not vectorize it, so this loop is executed 10,000 times:
.LBB0_1: xorps xmm0, xmm0 cvtsi2ss xmm0, ecx movss dword ptr [rax], xmm0 add rcx, 10 add rax, 4 cmp rcx, 100000 jne .LBB0_1
If I force it to vectorize with #pragma vector always, it does, but the code looks pretty bad - it's executed 2500 times, just like the icl code, but it's more than twice as long:
.LBB0_1: movd xmm4, ecx pshufd xmm4, xmm4, 0 paddd xmm4, xmm0 movdqa xmm5, xmm4 pblendw xmm5, xmm1, 170 psrld xmm4, 16 pblendw xmm4, xmm2, 170 subps xmm4, xmm3 addps xmm4, xmm5 movaps xmmword ptr [4*rax + a+16], xmm4 add rax, 4 add ecx, 40 cmp rax, 9996 jb .LBB0_1
Example 2: Using a calculation with a starting value:
#pragma vector aliged for (int i = 0; i < 10000; i++) { a[i] = f_val; f_val += f_step; }
icl output is vectorized:
..B1.6: # Preds ..B1.6 ..B1.5 movups XMMWORD PTR [a+rax*4], xmm0 #14.9 add rax, 4 #12.5 addps xmm0, xmm1 #14.9 cmp rax, 10000 #12.5 jb ..B1.6 # Prob 99%
icx output is not:
.LBB0_1: movss dword ptr [rax + a+40000], xmm0 addss xmm0, xmm1 add rax, 4 jne .LBB0_1
With #pragma vector always, it is, and this time the code looks good:
.LBB0_1: movaps xmmword ptr [4*rax + a+16], xmm0 addps xmm0, xmm1 add rax, 4 cmp rax, 9996 jb .LBB0_1
The performance of the icx code with #pragma vector always is almost exactly a factor 4 better, as one would expect looking at the generated code.
If I look at the vectorization reports, icx seems to think that the vectorized code is slower than the non-vectorizedd code for both code snippets. With the pretty bad code for the 1st that would make sense, for this 2nd snippet, not so much.
Some notes:
I targeted a specific CPU type with the compiler flags (SSE4.1), if I use a newer one it does choose to vectorize. But overall (for our entire code base), icl with target SSE2 outperforms icx with target AVX2.
I added the #pragma loop unroll's to avoid loop unrolling, which makes the code very hard to compare since different compilers select different amounts of unrolling - it does not affect vectorization though.
Am I missing something obvious here? A lot of loops in our code are like the ones in these examples, typically a lot more complex. If it was only a few I would manually vectorize them, but that isn't really an option. If the compiler does not properly optimize such loops, we will need to stay with icl, but since that's no longer being maintained, I don't really like that idea.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This issue has been reported to engineering and is being handled actively.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Ethan_F_Intel @hansvz Out of curiosity I tested the linpack benchmark
The test platform is Windows.
The icx performance of 2025.1.1 is the same as that of 2023.2.0.
I also tested clang.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Based on your screenshots, unless I'm missing something it looks like icl and clang perform the same, but icx is 35% slower. Visible in both the KFLOPS and Time fields. The 1024 takes 19.17 seconds in icx, vs 12.87 in Clang and 12.81 in icl.
I have not yet tested Clang on our end because I assumed that it would at least be worse than icx.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
in this test, gcc is the fastest, and msvc, clang and icl are all faster than icx.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page