Intel icx does not scale the code well on Windows

newcfd · ‎04-10-2026

The same code is compiled on Linux and Windows. The running time with thread numbers is follows.

on Windows
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 73.31324 (min); OpenMP timer: 73.31224 (min); CPU time: 73.31223 (min) 50 threads
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 75.19106 (min); OpenMP timer: 75.18994 (min); CPU time: 75.18993 (min) 50 threads
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 79.41946 (min); OpenMP timer: 79.41827 (min); CPU time: 79.41827 (min) 56 threads
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 96.83948 (min); OpenMP timer: 96.83786 (min); CPU time: 96.83787 (min) 70 threads
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 127.79664 (min); OpenMP timer: 127.79473 (min); CPU time: 127.79473 (min) 100 threads

// on Linux
Successful completion Step 1867 11.8242 years 14059 iterations; real duration: 51.32146 (min); OpenMP timer: 51.32146 (min); CPU time: 2833.64239 (min) 56 threads
Successful completion Step 1867 11.8242 years 14059 iterations; real duration: 32.59633 (min); OpenMP timer: 32.59633 (min); CPU time: 2993.59505 (min) 96 threads

OpenMP settings

    _putenv_s("GOMP_CPU_AFFINITY", ""); 
    _putenv_s("OMP_DYNAMIC", "false");  
    _putenv_s("OMP_MAX_ACTIVE_LEVELS", "1");

    _putenv_s("OMP_WAIT_POLICY", "ACTIVE");
    _putenv_s("OMP_PROC_BIND", "false");

/MP /GS /Qiopenmp /GA /W3 /Gy /Zc:wchar_t /Qipo /Zc:forScope /std:c17 /Oi /MD /std:c++20 /Qxhost /Qftz

The Intel CPU is same for both Linux and Windows: Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz 2.59 GHz (2 processors) .

Why is the code built on Windows not scaled well and much slower than on Linux?

The icx on Windows is the latest.

The icx on Linux: Intel(R) oneAPI DPC++/C++ Compiler 2025.0.0 (2025.0.0.20241008)

Sravani_K_Intel · ‎04-13-2026

On Linux, CPU time ≈ 55× wall time (expected for 56 threads doing real work). On Windows, CPU time ≈ wall time, meaning the process is effectively running on ~1 thread's worth of work, regardless of how many threads are spawned.

GOMP_* variables are for GCC's libgomp. Intel's runtime (libiomp5) uses KMP_* variables. GOMP_CPU_AFFINITY is silently ignored, leaving thread placement to the OS scheduler. Try setting KMP_AFFINITY as described at https://www.intel.com/content/www/us/en/docs/dpcpp-cpp-compiler/developer-guide-reference/2025-2/thread-affinity-interface.html to see if that helps.

newcfd · ‎04-30-2026

It is a big project and I can not offer you the test code.

CPU: Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz 2.59 GHz

Can you please tell me which flags are proper for large linear sparse matrix system? You guys must have tested some cases. Does Intel have any benchmark cases?

newcfd · ‎05-25-2026

-ffp-contract=off -fp-model=precise -ffp-exception-behavior=strict -ftz -mdaz-ftz are used in icx build. Tried other flags as well. Any speficific suggestions to choose flags for numerical computing on this CPU?

Sravani_K_Intel · ‎05-27-2026

Thanks for sharing additional details about your CPU and the current flags being used. Based on this info, here are a few suggestions, some of which you might have tried:

1. Target the architecture explicitly via -xicelake-server(most optimizations enabled for the target) or -march=icelake-server or -xCORE-AVX512 -mtune=icelake-server (for more portability)

2. Vectorization flags

-O3 #enables auto-vectorization
-xCORE-AVX512 #unlock AVX-512 on this CPU
-qopt-zmm-usage=high #encourage 512-bit ZMM register use
-qopt-report=3 #see what got vectorized and why

3. Sparse matrix traversal is memory-bound, prefetch tuning often matters more than compute flags. You can try

-qopt-prefetch=4 # aggressive prefetch
-qopt-prefetch-distance=64 # tune to cache line / problem size

Adjust the values for your code.

4. Interprocedural Optimization via -ipo

Since sparse systems are almost always memory-bandwidth bound on this CPU, run Intel VTune or perf stat to check:

Cache miss rate - if L3 miss rate is high, prefetch flags matter most
Vector intensity - if AVX-512 utilization is low, check -qopt-report output

Are you using any sparse solver library like MKL?

newcfd · ‎04-21-2026

Thank for your reply. Good to know. Try the settings you suggested. No help!

Another case on Linux

ICX
Successful completion Step 3434 10.0000 years 21630 iterations; real duration: 267.52013 (min); OpenMP timer: 267.52013 (min); CPU time: 25280.19389 (min) threads 96

GCC
Successful completion Step 3312 10.0000 years 20857 iterations; real duration: 72.17944 (min); OpenMP timer: 72.17944 (min); CPU time: 6761.04930 (min) Threads 96

GCC code is three times faster. Which settings or flags can make a numerical code run as close fast as the build with gcc. We do not talk about faster.

newcfd · ‎04-21-2026

I can not believe intel compiler is so inferior to gcc for Intel CPU. I do not even need any special settings for gcc. Intel compiler has so many settings, but can not make the code faster.

Intel guys: what are the secrets in Intel compiler?

Sravani_K_Intel · ‎04-28-2026

Could you please share a sample of code that demonstrates this issue so we can help troubleshoot?