Solved: Unexpected timing result with optimisation -O3

Arjen_Markus · ‎07-15-2025

I am experimenting with a simple algorithm: various implementations and problem sizes to see what would give the best performance. To my surprise, the timings I got with -O3 are four times as large as with -O2, where I expected either the same or lower timings. As these worse results are consistent over four different implementations and 16 model sizes (so a total of 64 cases), I conclude that -O3 for my case is definitely slower than -O2.

I attach one version of the program, an input file and the timing results for both -O2 and -O3.

The compile options were: -O3 -heap-arrays (the latter are required for the larger sizes)

For the size 880x880 the timings are (last column in the output files):

debug (-Od): 120 seconds

-O2: 18 seconds

-O3: 77 seconds

As the routine timing prints the compiler version and compile options, I am certain that I did not make a mistake there.

My question: Is it likely that you get such a deterioration in performance? Especially as the program is very straightforward. And this same deterioration can be seen in all versions.

Ron_Green · ‎07-17-2025

I have heard from users with newer Intel processors that have a mix of core types - efficiency (E-cores) and performance (P-cores) cores. Without controlling process core placement one can get scheduled on E-cores which have lower performance.

View solution in original post

Ron_Green · ‎07-15-2025

Odd. I do not see much if any difference between O2 and O3. Both are rather hand-tied by the fact that you are not targeting your processor so it's defaulting to SSE2. Notice the loop has conditionals ( IFs ). SSE2 did not have masked vector operations. Those came in AVX, improved in AVX2 and AVX512. By adding xhost you can see significant improvements. If you are on Genuine Intel processor. If you use other than Genuine Intel you can try the -march options.
What processor are you using?

In any event, SSE2 default is terrible in performance. The compiler defaults to least common denominator for processor targeting - the oldest imaginable crusty old Xeon from 15 years ago. Make sure to use -x or -ax options on Intel, and -march for that other vendor.

I should note that just to be sane, I set
export OMP_NUM_THREADS=1

also, I edited the .f90 and removed all print/writes except the one in stop_timer. All I care about is the timing in this case.
To get much better performance by using AVX512 with it's masked vector instructions. That I did with -xhost and running on an Intel processor that supports AVX512

~/quad/triage/tuesday$ ifx -O2 -heap-arrays  poisson_island_naive.f90 timing.o ; ./a.out ; cat poisson_island_naive.out
Report of simulation
--------------------
Compiler version: 
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.2.0 Build 20250605
Compiler options: 
-O2 -c 
Wall clock (s):  28.6359    
CPU time (s):    28.6349    
~/quad/triage/tuesday$ 
~/quad/triage/tuesday$ ifx -O3 -heap-arrays  poisson_island_naive.f90 timing.o ; ./a.out ; cat poisson_island_naive.out
Report of simulation
--------------------
Compiler version: 
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.2.0 Build 20250605
Compiler options: 
-O2 -c 
Wall clock (s):  27.8884    
CPU time (s):    27.8874    
~/quad/triage/tuesday$ 
~/quad/triage/tuesday$ 
~/quad/triage/tuesday$ ifx -O2 -heap-arrays -xhost  poisson_island_naive.f90 timing.o ; ./a.out ; cat poisson_island_naive.out
Report of simulation
--------------------
Compiler version: 
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.2.0 Build 20250605
Compiler options: 
-O2 -c 
Wall clock (s):  11.3130    
CPU time (s):    11.3124    
~/quad/triage/tuesday$ 
:~/quad/triage/tuesday$ 
~/quad/triage/tuesday$ ifx -O3 -heap-arrays -xhost  poisson_island_naive.f90 timing.o ; ./a.out ; cat poisson_island_naive.out
Report of simulation
--------------------
Compiler version: 
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.2.0 Build 20250605
Compiler options: 
-O2 -c 
Wall clock (s):  11.1859    
CPU time (s):    11.1852

Arjen_Markus · ‎07-15-2025

I created a double-precision version of the programs and that showed a slow-down of some ten to twenty percent. So, not as dramatic as the one I showed :). I did not include the other versions of the program (they involve hiding the if-statements in an extra mutiplication, using pointers to offset the indices and array operations and using ASSOCIATE to do the same), but they showed the same deterioration. If it is of any use, then I can post those results as well.

I should perhaps mention that I used Intel Fortran oneAPI 2025.0 for this.

Thanks for the pointers on how to take advantage of the actual machinery.

Arjen_Markus · ‎07-16-2025

Well, one aspect of the double-precision results with -O3 is that I used the same grid sizes, which of course means twice as much memory. Since the results for single precision indicate they are in the region of the L1 cache, the double-precision results are outside of that region. That may make a difference. Anyway, more compile options and computational options to explore :).

Arjen_Markus · ‎07-17-2025

I have tried some machine-specific compile options, namely -Qm64 and -QxAVX, as I am using Windows and the machine I use does not support AVX512. I get results that are consistently slower by a factor 3 to 5 than when I simply use -O2. But I have seen some other bizarrely slow calculations as well, so I will have to make sure that what I see is indeed correct.

Hopefully I can make some sense out of the further experiments and report back the results.

Ron_Green · ‎07-17-2025

I have heard from users with newer Intel processors that have a mix of core types - efficiency (E-cores) and performance (P-cores) cores. Without controlling process core placement one can get scheduled on E-cores which have lower performance.

Arjen_Markus · ‎07-17-2025

Hm, how do I find out whether this can happen on my laptop? And if so, how does one control the placement of processes on the various cores? It does sound like a likely scenario, as I would not expect to find any large factor vis-à-vis performance.

Ron_Green · ‎07-17-2025

Start here. YOu may have to add option /Qopenmp, not sure, still researching it.

https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2024-2/managing-performance-with-heterogeneous-cores.html

also set environment var

KMP_AFFINITY=verbose
OMP_NUM_THREADS=1

this should give a binding map when you run the program

Arjen_Markus · ‎07-17-2025

Well, that seems to fit my laptop indeed: 12th Gen Intel(R) Core(TM) i7-12800HX, Rev. 38658.

I will try to take advantage of this information. Thanks.

Ron_Green · ‎07-17-2025

If you have Intel MPI installed, from the command line there is a command 'cpuinfo' that will print out a list of the processors, numbering, and other interesting information on your cpu(s).

Arjen_Markus · ‎07-21-2025

That command indeed gives a lot of information. I also have the Intel Processor Information Utility installed, which basically tells the same things, but in a graphical way. Now I need to experiment with the OpenMP settings to see whether I can control it all :).