Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
29285 Discussions

openmp under ifx does not scale (at all) with number of threads, works fine with ifort

dmpogo
Beginner
299 Views

I have computational code, which for testing purposes just calls a subroutine with 3 nested loops

 

!$OMP PARALLEL DO DEFAULT(FIRSTPRIVATE) &
!$OMP             SHARED(prm,spec_x,spec_y,spec_z,speq_x,speq_y,speq_z)

     do j3=1,n3.


        do j2=1,n2
           do j1=1,nyq1_1    


            ........

            endtime OMP_NUM_THREADS=1 ./mag_field < input_text
do





         enddo
     enddo

!$OMP END PARALLEL DO


where innermost loop has some calls to MKL routines

I have 3 platforms -   2011 desktop with intel i7 Core   4cores/8 threads  and ifort 2021.2,   2017 laptop with intel i7 Core and  ifx 2025.0.4  2 cores/4 threads  and Supermicro server with Intel Xeon  48 cores/96 threads 2025.1.1

The code is compiles with FFLAGS= -w -extend_source -O -qopenmp

 

Here are the strange results

ifort platform

$   time OMP_NUM_THREADS=1 ./mag_field < input_text


real    4m45.787s
user    4m45.206s
sys     0m0.567s



$  time OMP_NUM_THREADS=4 ./mag_field < input_text

real    1m15.504s
user    4m36.408s
sys     0m0.700s

 

As expected,   the user time remain the same, while the wall time decreased by almost factor of 4

 

But same code and compiler options on  ifx platforms give

laptop

$  time OMP_NUM_THREADS=1 ./mag_field < input_text



real    1m9.073s
user    1m8.716s
sys     0m0.284s



$ time OMP_NUM_THREADS=4 ./mag_field < input_text

real    0m35.847s
user    2m19.619s
sys     0m0.428s


user time went up by a factor of 2 and real time improvement is only twice, not 4 times. This platform, however, has only 2 real cores.

 

But even more dramatic is the result on the server  with ifx compiler

$  time OMP_NUM_THREADS=1 ./mag_field < input_text

real    0m44.983s
user    0m44.457s
sys     0m0.517s



$  time OMP_NUM_THREADS=4 ./mag_field < input_text

real    0m37.040s
user    2m24.184s
sys     0m1.092s

There is no improvement in real time at all,  but instead the user times increased almost 4 times.

I could see 4 CPU's working at 100%,  but it looks like they spent as much time as a single CPU did.

 

I am probably missing something important how to use OpenMP with ifx

 

 

 

 

 

 

                .........

 

0 Kudos
0 Replies
Reply