openmp under ifx does not scale (at all) with number of threads, works fine with ifort

dmpogo · ‎05-22-2025

I have computational code, which for testing purposes just calls a subroutine with 3 nested loops

!$OMP PARALLEL DO DEFAULT(FIRSTPRIVATE) &
!$OMP             SHARED(prm,spec_x,spec_y,spec_z,speq_x,speq_y,speq_z)

     do j3=1,n3.


        do j2=1,n2
           do j1=1,nyq1_1    

            ........

            endtime OMP_NUM_THREADS=1 ./mag_field < input_text
do




         enddo
     enddo

!$OMP END PARALLEL DO

where innermost loop has some calls to MKL routines

I have 3 platforms - 2011 desktop with intel i7 Core 4cores/8 threads and ifort 2021.2, 2017 laptop with intel i7 Core and ifx 2025.0.4 2 cores/4 threads and Supermicro server with Intel Xeon 48 cores/96 threads 2025.1.1

The code is compiles with FFLAGS= -w -extend_source -O -qopenmp

Here are the strange results

ifort platform

$   time OMP_NUM_THREADS=1 ./mag_field < input_text


real    4m45.787s
user    4m45.206s
sys     0m0.567s



$  time OMP_NUM_THREADS=4 ./mag_field < input_text

real    1m15.504s
user    4m36.408s
sys     0m0.700s

As expected, the user time remain the same, while the wall time decreased by almost factor of 4

But same code and compiler options on ifx platforms give

laptop

$  time OMP_NUM_THREADS=1 ./mag_field < input_text



real    1m9.073s
user    1m8.716s
sys     0m0.284s



$ time OMP_NUM_THREADS=4 ./mag_field < input_text

real    0m35.847s
user    2m19.619s
sys     0m0.428s

user time went up by a factor of 2 and real time improvement is only twice, not 4 times. This platform, however, has only 2 real cores.

But even more dramatic is the result on the server with ifx compiler

$  time OMP_NUM_THREADS=1 ./mag_field < input_text

real    0m44.983s
user    0m44.457s
sys     0m0.517s



$  time OMP_NUM_THREADS=4 ./mag_field < input_text

real    0m37.040s
user    2m24.184s
sys     0m1.092s

There is no improvement in real time at all, but instead the user times increased almost 4 times.

I could see 4 CPU's working at 100%, but it looks like they spent as much time as a single CPU did.

I am probably missing something important how to use OpenMP with ifx

.........