- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The same code is compiled on Linux and Windows. The running time with thread numbers is follows.
on Windows
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 73.31324 (min); OpenMP timer: 73.31224 (min); CPU time: 73.31223 (min) 50 threads
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 75.19106 (min); OpenMP timer: 75.18994 (min); CPU time: 75.18993 (min) 50 threads
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 79.41946 (min); OpenMP timer: 79.41827 (min); CPU time: 79.41827 (min) 56 threads
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 96.83948 (min); OpenMP timer: 96.83786 (min); CPU time: 96.83787 (min) 70 threads
Successful completion Step 1892 11.8242 years 14414 iterations; real duration: 127.79664 (min); OpenMP timer: 127.79473 (min); CPU time: 127.79473 (min) 100 threads
// on Linux
Successful completion Step 1867 11.8242 years 14059 iterations; real duration: 51.32146 (min); OpenMP timer: 51.32146 (min); CPU time: 2833.64239 (min) 56 threads
Successful completion Step 1867 11.8242 years 14059 iterations; real duration: 32.59633 (min); OpenMP timer: 32.59633 (min); CPU time: 2993.59505 (min) 96 threads
OpenMP settings
_putenv_s("GOMP_CPU_AFFINITY", ""); _putenv_s("OMP_DYNAMIC", "false"); _putenv_s("OMP_MAX_ACTIVE_LEVELS", "1");
_putenv_s("OMP_WAIT_POLICY", "ACTIVE"); _putenv_s("OMP_PROC_BIND", "false");
/MP /GS /Qiopenmp /GA /W3 /Gy /Zc:wchar_t /Qipo /Zc:forScope /std:c17 /Oi /MD /std:c++20 /Qxhost /Qftz
The Intel CPU is same for both Linux and Windows: Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz 2.59 GHz (2 processors) .
Why is the code built on Windows not scaled well and much slower than on Linux?
The icx on Windows is the latest.
The icx on Linux: Intel(R) oneAPI DPC++/C++ Compiler 2025.0.0 (2025.0.0.20241008)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On Linux, CPU time ≈ 55× wall time (expected for 56 threads doing real work). On Windows, CPU time ≈ wall time, meaning the process is effectively running on ~1 thread's worth of work, regardless of how many threads are spawned.
GOMP_* variables are for GCC's libgomp. Intel's runtime (libiomp5) uses KMP_* variables. GOMP_CPU_AFFINITY is silently ignored, leaving thread placement to the OS scheduler. Try setting KMP_AFFINITY as described at https://www.intel.com/content/www/us/en/docs/dpcpp-cpp-compiler/developer-guide-reference/2025-2/thread-affinity-interface.html to see if that helps.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is a big project and I can not offer you the test code.
CPU: Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz 2.59 GHz
Can you please tell me which flags are proper for large linear sparse matrix system? You guys must have tested some cases. Does Intel have any benchmark cases?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-ffp-contract=off -fp-model=precise -ffp-exception-behavior=strict -ftz -mdaz-ftz are used in icx build. Tried other flags as well. Any speficific suggestions to choose flags for numerical computing on this CPU?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for sharing additional details about your CPU and the current flags being used. Based on this info, here are a few suggestions, some of which you might have tried:
1. Target the architecture explicitly via -xicelake-server(most optimizations enabled for the target) or -march=icelake-server or -xCORE-AVX512 -mtune=icelake-server (for more portability)
2. Vectorization flags
-O3 #enables auto-vectorization
-xCORE-AVX512 #unlock AVX-512 on this CPU
-qopt-zmm-usage=high #encourage 512-bit ZMM register use
-qopt-report=3 #see what got vectorized and why
3. Sparse matrix traversal is memory-bound, prefetch tuning often matters more than compute flags. You can try
-qopt-prefetch=4 # aggressive prefetch
-qopt-prefetch-distance=64 # tune to cache line / problem size
Adjust the values for your code.
4. Interprocedural Optimization via -ipo
Since sparse systems are almost always memory-bandwidth bound on this CPU, run Intel VTune or perf stat to check:
- Cache miss rate - if L3 miss rate is high, prefetch flags matter most
- Vector intensity - if AVX-512 utilization is low, check
-qopt-reportoutput
Are you using any sparse solver library like MKL?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank for your reply. Good to know. Try the settings you suggested. No help!
Another case on Linux
ICX
Successful completion Step 3434 10.0000 years 21630 iterations; real duration: 267.52013 (min); OpenMP timer: 267.52013 (min); CPU time: 25280.19389 (min) threads 96
GCC
Successful completion Step 3312 10.0000 years 20857 iterations; real duration: 72.17944 (min); OpenMP timer: 72.17944 (min); CPU time: 6761.04930 (min) Threads 96
GCC code is three times faster. Which settings or flags can make a numerical code run as close fast as the build with gcc. We do not talk about faster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can not believe intel compiler is so inferior to gcc for Intel CPU. I do not even need any special settings for gcc. Intel compiler has so many settings, but can not make the code faster.
Intel guys: what are the secrets in Intel compiler?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you please share a sample of code that demonstrates this issue so we can help troubleshoot?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page