Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29235 Discussions

openmp under ifx does not scale (at all) with number of threads, works fine with ifort

dmpogo
Beginner
268 Views

I have computational code, which for testing purposes just calls a subroutine with 3 nested loops

 

!$OMP PARALLEL DO DEFAULT(FIRSTPRIVATE) &
!$OMP             SHARED(prm,spec_x,spec_y,spec_z,speq_x,speq_y,speq_z)

     do j3=1,n3.


        do j2=1,n2
           do j1=1,nyq1_1    


            ........

            endtime OMP_NUM_THREADS=1 ./mag_field < input_text
do





         enddo
     enddo

!$OMP END PARALLEL DO


where innermost loop has some calls to MKL routines

I have 3 platforms -   2011 desktop with intel i7 Core   4cores/8 threads  and ifort 2021.2,   2017 laptop with intel i7 Core and  ifx 2025.0.4  2 cores/4 threads  and Supermicro server with Intel Xeon  48 cores/96 threads 2025.1.1

The code is compiles with FFLAGS= -w -extend_source -O -qopenmp

 

Here are the strange results

ifort platform

$   time OMP_NUM_THREADS=1 ./mag_field < input_text


real    4m45.787s
user    4m45.206s
sys     0m0.567s



$  time OMP_NUM_THREADS=4 ./mag_field < input_text

real    1m15.504s
user    4m36.408s
sys     0m0.700s

 

As expected,   the user time remain the same, while the wall time decreased by almost factor of 4

 

But same code and compiler options on  ifx platforms give

laptop

$  time OMP_NUM_THREADS=1 ./mag_field < input_text



real    1m9.073s
user    1m8.716s
sys     0m0.284s



$ time OMP_NUM_THREADS=4 ./mag_field < input_text

real    0m35.847s
user    2m19.619s
sys     0m0.428s


user time went up by a factor of 2 and real time improvement is only twice, not 4 times. This platform, however, has only 2 real cores.

 

But even more dramatic is the result on the server  with ifx compiler

$  time OMP_NUM_THREADS=1 ./mag_field < input_text

real    0m44.983s
user    0m44.457s
sys     0m0.517s



$  time OMP_NUM_THREADS=4 ./mag_field < input_text

real    0m37.040s
user    2m24.184s
sys     0m1.092s

There is no improvement in real time at all,  but instead the user times increased almost 4 times.

I could see 4 CPU's working at 100%,  but it looks like they spent as much time as a single CPU did.

 

I am probably missing something important how to use OpenMP with ifx

 

 

 

 

 

 

                .........

 

0 Kudos
0 Replies
Reply