Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
873 Discussões

Intel icx does not scale the code well on Windows

newcfd
Principiante
562 Visualizações

The same code is compiled on Linux and Windows.  The running time with thread numbers is follows.

on Windows
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 73.31324 (min); OpenMP timer: 73.31224 (min); CPU time: 73.31223 (min) 50 threads
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 75.19106 (min); OpenMP timer: 75.18994 (min); CPU time: 75.18993 (min) 50 threads
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 79.41946 (min); OpenMP timer: 79.41827 (min); CPU time: 79.41827 (min) 56 threads
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 96.83948 (min); OpenMP timer: 96.83786 (min); CPU time: 96.83787 (min) 70 threads
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 127.79664 (min); OpenMP timer: 127.79473 (min); CPU time: 127.79473 (min) 100 threads


// on Linux
Successful completion Step 1867 11.8242 years 14059 iterations; real duration: 51.32146 (min); OpenMP timer: 51.32146 (min); CPU time: 2833.64239 (min) 56 threads
Successful completion Step 1867 11.8242 years 14059 iterations; real duration: 32.59633 (min); OpenMP timer: 32.59633 (min); CPU time: 2993.59505 (min) 96 threads

OpenMP settings

    _putenv_s("GOMP_CPU_AFFINITY", ""); 
    _putenv_s("OMP_DYNAMIC", "false");  
    _putenv_s("OMP_MAX_ACTIVE_LEVELS", "1");
_putenv_s("OMP_WAIT_POLICY", "ACTIVE"); _putenv_s("OMP_PROC_BIND", "false");

/MP /GS /Qiopenmp /GA /W3 /Gy /Zc:wchar_t  /Qipo /Zc:forScope /std:c17 /Oi /MD /std:c++20 /Qxhost /Qftz   

 

The Intel CPU is same for both Linux and Windows:  Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz 2.59 GHz (2 processors) .

 

Why is the code built on Windows not scaled well and much slower than on Linux?

The icx on Windows is the latest.

The icx on Linux:  Intel(R) oneAPI DPC++/C++ Compiler 2025.0.0 (2025.0.0.20241008)

0 Kudos
7 Respostas
Sravani_K_Intel
Moderador
539 Visualizações

On Linux, CPU time ≈ 55× wall time (expected for 56 threads doing real work). On Windows, CPU time ≈ wall time, meaning the process is effectively running on ~1 thread's worth of work, regardless of how many threads are spawned.

GOMP_* variables are for GCC's libgomp. Intel's runtime (libiomp5) uses KMP_* variables. GOMP_CPU_AFFINITY is silently ignored, leaving thread placement to the OS scheduler. Try setting KMP_AFFINITY as described at https://www.intel.com/content/www/us/en/docs/dpcpp-cpp-compiler/developer-guide-reference/2025-2/thread-affinity-interface.html to see if that helps.

 

newcfd
Principiante
310 Visualizações

It is a big project and I can not offer you the test code. 

CPU: Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz 2.59 GHz

Can you please tell me which flags are proper for large linear sparse matrix system? You guys must have tested some cases. Does Intel have any benchmark cases? 

newcfd
Principiante
143 Visualizações

-ffp-contract=off -fp-model=precise -ffp-exception-behavior=strict -ftz -mdaz-ftz are used in icx build. Tried other flags as well. Any speficific suggestions to choose flags for numerical computing on this CPU?

Sravani_K_Intel
Moderador
123 Visualizações

Thanks for sharing additional details about your CPU and the current flags being used. Based on this info, here are a few suggestions, some of which you might have tried:

1. Target the architecture explicitly via -xicelake-server(most optimizations enabled for the target) or -march=icelake-server or -xCORE-AVX512 -mtune=icelake-server (for more portability)

2. Vectorization flags

-O3 #enables auto-vectorization
-xCORE-AVX512 #unlock AVX-512 on this CPU
-qopt-zmm-usage=high #encourage 512-bit ZMM register use
-qopt-report=3 #see what got vectorized and why

3. Sparse matrix traversal is memory-bound, prefetch tuning often matters more than compute flags. You can try 

-qopt-prefetch=4 # aggressive prefetch
-qopt-prefetch-distance=64 # tune to cache line / problem size

Adjust the values for your code. 

4. Interprocedural Optimization via -ipo 

 

Since sparse systems are almost always memory-bandwidth bound on this CPU, run Intel VTune or perf stat to check:

  • Cache miss rate - if L3 miss rate is high, prefetch flags matter most
  • Vector intensity - if AVX-512 utilization is low, check -qopt-report output

Are you using any sparse solver library like MKL?

 

newcfd
Principiante
442 Visualizações

Thank for your reply. Good to know. Try the settings you suggested. No help!

 

Another case on Linux

ICX
Successful completion Step 3434 10.0000 years 21630 iterations; real duration: 267.52013 (min); OpenMP timer: 267.52013 (min); CPU time: 25280.19389 (min) threads 96

GCC
Successful completion Step 3312 10.0000 years 20857 iterations; real duration: 72.17944 (min); OpenMP timer: 72.17944 (min); CPU time: 6761.04930 (min) Threads 96

 

GCC code is three times faster. Which settings or flags can make a numerical code run as close fast as the build with gcc. We do not talk about faster.

newcfd
Principiante
441 Visualizações

I can not believe intel compiler is so inferior to gcc for Intel CPU. I do not even need any special settings for gcc. Intel compiler has so many settings, but can not make the code faster.

Intel guys: what are the secrets in Intel compiler?

Sravani_K_Intel
Moderador
342 Visualizações

Could you please share a sample of code that demonstrates this issue so we can help troubleshoot?

Responder