Solved: OpenMP do loop efficiency

AlphaF20 · ‎02-23-2021

Hi, I have the following code modified from an answer in stackoverflow

I tried to compare the efficiency of nested loops without and with openmp

Program reshape_for_blas

  Use, Intrinsic :: iso_fortran_env, Only :  wp => real64, li => int64
  Use :: omp_lib
  
  Implicit None

  Real( wp ), Dimension( :, :    ), Allocatable :: a
  Real( wp ), Dimension( :, :, : ), Allocatable :: b
  Real( wp ), Dimension( :, :, : ), Allocatable :: c1, c2, c3, c4, c5
  Real( wp ), Dimension( :, :    ), Allocatable :: d
  Real( wp ), Dimension( :, :    ), Allocatable :: e

  Integer :: na, nb, nc, nd, ne
  Integer :: la, lb, lc, ld  
  Integer( li ) :: start, finish, rate
  Integer :: numthreads
  
  
  numthreads = 6

  call omp_set_num_threads(numthreads)  
  
  
  Write( *, * ) 'na, nb, nc, nd ?'
  Read( *, * ) na, nb, nc, nd
  ne = nc * nd
  Allocate( a ( 1:na, 1:nb ) ) 
  Allocate( b ( 1:nb, 1:nc, 1:nd ) ) 
  Allocate( c3( 1:na, 1:nc, 1:nd ) )
  Allocate( c5( 1:na, 1:nc, 1:nd ) )  
  Allocate( d ( 1:nb, 1:ne ) ) 
  Allocate( e ( 1:na, 1:ne ) ) 

  ! Set up some data
  Call Random_number( a )
  Call Random_number( b )
  c3(:,:,:) = 0.0_wp
  c5(:,:,:) = 0.0_wp

  !naive
  Call System_clock( start, rate )

  
  do la = 1, na 
    do lb = 1, nb
      do lc = 1, nc
        do ld = 1, nd
          c3(la,lc,ld) = c3(la,lc,ld)  + a(la,lb) * b(lb, lc, ld)
        enddo  
      enddo
    enddo
  enddo  
 
  
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for naive loop', Real( finish - start, wp ) / rate  
 
   

  Call System_clock( start, rate )  
  !$omp parallel
  !$omp do private(la, lb, lc, ld) schedule(dynamic) reduction(+: c5)    
  do la = 1, na 
    do lb = 1, nb
      do lc = 1, nc
        do ld = 1, nd
          c5(la,lc,ld) = c5(la,lc,ld)  + a(la,lb) * b(lb, lc, ld)
        enddo  
      enddo
    enddo
  enddo  
  !$omp end do  
  !$omp end parallel
 
  
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for naive loop omp', Real( finish - start, wp ) / rate  
 

  
  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         if ( dabs(c3(la,lc,ld) - c5(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c5', la,lc,ld, c3(la,lc,ld) - c5(la,lc,ld)
         endif         
      enddo
    enddo
  enddo  
  
End Program reshape_for_blas

By using ifort (IFORT) 19.0.3.199 20190206, with the command, ifort reshape-5.f90 -qopenmp,
(the filename is reshape-5.f90), and with 60 60 60 60 input dimension, I got the output:

Time for naive loop 3.116000000000000E-003
Time for naive loop omp 1.263800000000000E-002

I can remove array d and e, which have been never used. But the results are similar. My question is, why there is no speedup in my openmp calculation?

P.S. I am not sure if this question fits oneAPI HPC Toolkit, guessed from this question .

jimdempseyatthecove · ‎02-26-2021

>>Time for naive loop 3.116000000000000E-003
>>Time for naive loop omp 1.263800000000000E-002

>>...My question is, why there is no speedup in my openmp calculation?

The total runtime is less than the OpenMP parallel region and OpenMP loop setup.

Also, your loop index order is non-optimal. Use:

  do ld = 1, nd 
    do lc = 1, nc
      do lb = 1, nb
        do la = 1, na
          c3(la,lc,ld) = c3(la,lc,ld)  + a(la,lb) * b(lb, lc, ld)
        enddo  
      enddo
    enddo
  enddo

In Fortran, a(la, lb) and a(la+1,lb) are in adjacent memory locations.

IOW the left most index has stride 1, the second index has stride size(first index), ...

This is backwards from C/C++ notation where arrays do not have descriptors, but rather expressed as a nested list of pointers.

Use of proper index order permits the compiler to vector-ize the code. While compiler optimization can reorder the code, you are not necessarily assured that it will.

Jim Dempsey

View solution in original post

AbhishekD_Intel · ‎02-26-2021

Hi,

Thanks for reaching out to us.

As you are using OpenMP offload with the Fortran compiler, we will move this thread to the Intel Fortran Compiler Forum.

From the OpenMP side, there are a lot of factors that will impact performance. The main factor is complexity if you will use a small-size computation, then the serial version of code will definitely show good performance over the parallel version.

Please expect in-depth details from the Fortran experts. We will move this thread to Fortran Forum, refer to the below link:

https://community.intel.com/t5/Intel-Fortran-Compiler/bd-p/fortran-compiler

Warm Regards,

Abhishek

AlphaF20 · ‎02-26-2021

Thank you so much for redirecting. I did not find this Fortran forum before.

andrew_4619 · ‎02-26-2021

I would note that before think about openmp that your arrays are formulated the wrong way for efficiency in Fortran. In your loops the last index in c3 and d is the one that is changing most rapidly but if ld was the first index you would be working with adjacent memory locations and would be seeing better basic performance. In general you want loops with the first index changing rapidly and the last index slowly to stop manic large jumps in memory location all the time . Make a simple test case.

jimdempseyatthecove · ‎02-26-2021

>>Time for naive loop 3.116000000000000E-003
>>Time for naive loop omp 1.263800000000000E-002

>>...My question is, why there is no speedup in my openmp calculation?

The total runtime is less than the OpenMP parallel region and OpenMP loop setup.

Also, your loop index order is non-optimal. Use:

  do ld = 1, nd 
    do lc = 1, nc
      do lb = 1, nb
        do la = 1, na
          c3(la,lc,ld) = c3(la,lc,ld)  + a(la,lb) * b(lb, lc, ld)
        enddo  
      enddo
    enddo
  enddo

In Fortran, a(la, lb) and a(la+1,lb) are in adjacent memory locations.

IOW the left most index has stride 1, the second index has stride size(first index), ...

This is backwards from C/C++ notation where arrays do not have descriptors, but rather expressed as a nested list of pointers.

Use of proper index order permits the compiler to vector-ize the code. While compiler optimization can reorder the code, you are not necessarily assured that it will.

Jim Dempsey

AlphaF20 · ‎02-26-2021

Thank you so much for your answer! I appreciate your help.

By adjusting the orders of loops and use 300 300 300 300 as the input dimension, I got

 na, nb, nc, nd ?
300 300 300 300
 Time for naive loop abcd    55.8493150000000
 Time for naive loop dcba   2.20551200000000
 Time for naive loop omp abcd   4.44474100000000
 Time for naive loop omp dcba  0.572829000000000

where abcd and dcba are the order of loops from outer to inner. Looks quite good

Ron_Green · ‎03-03-2021

Jim did a great job answering this. From your version info it looks like you are on Linux.

For further tuning you may want to examine OMP affinity.

export OMP_DISPLAY_AFFINITY=true

and run. this will show how your threads bind to processors. If you have hyperthreading (HT) enabled you will have 2x the number of processors as physical cores. If you have Intel MPI installed, run

cpuinfo

Here is an 8-core system with HT, partial output to show proc numbering:

===== Processor composition =====

Processor name : Intel(R) Core(TM) i9-9900K

Packages(sockets) : 1

Cores : 8

Processors(CPUs) : 16

Cores per package : 8

Threads per core : 2

===== Processor identification =====

Processor Thread Id. Core Id. Package Id.

0 0 0 0

1 0 1 0

2 0 2 0

3 0 3 0

4 0 4 0

5 0 5 0

6 0 6 0

7 0 7 0

8 1 0 0

9 1 1 0

10 1 2 0

11 1 3 0

12 1 4 0

13 1 5 0

14 1 6 0

15 1 7 0

===== Placement on packages =====

Package Id. Core Id. Processors

0 0,1,2,3,4,5,6,7 (0,8)(1,9)(2,10)(3,11)(4,12)(5,13)(6,14)(7,15)

===== Cache sharing =====

Cache Size Processors

L1 32 KB (0,8)(1,9)(2,10)(3,11)(4,12)(5,13)(6,14)(7,15)

L2 256 KB (0,8)(1,9)(2,10)(3,11)(4,12)(5,13)(6,14)(7,15)

L3 16 MB (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)

so 1 socket, 8 physical cores, 16 processors. Numbering (starting at 0, darn C programmers!) is 0-7 across the cores, then 8-15 are the hyper-contexts, or 2nd thread context, on cores 0-7. So if an OMP thread is bound to Processor 11 that is physical core 3, 2nd thread context.

You use OMP_PLACES and OMP_PROC_BIND to control pinning threads to cores. IT's complicated, read up on it. But try this if you want just 1 thread per physical core for this example:

export OMP_NUM_THREADS=8

export OMP_DISPLAY_ENV=true

export OMP_PLACES=cores

./a.out

this will put the 8 threads on processors 0-7 along with their hyper-contexts:

OMP: pid 3780544 tid 3780544 thread 0 bound to OS proc set {0,8}

OMP: pid 3780544 tid 3780547 thread 3 bound to OS proc set {3,11}

OMP: pid 3780544 tid 3780548 thread 4 bound to OS proc set {4,12}

OMP: pid 3780544 tid 3780546 thread 2 bound to OS proc set {2,10}

OMP: pid 3780544 tid 3780545 thread 1 bound to OS proc set {1,9}

OMP: pid 3780544 tid 3780549 thread 5 bound to OS proc set {5,13}

OMP: pid 3780544 tid 3780550 thread 6 bound to OS proc set {6,14}

OMP: pid 3780544 tid 3780551 thread 7 bound to OS proc set {7,15}

so for example thread 3 is on core 3 AND it's 2nd thread context 11. but since you only have 8 threads and not more, essentially each thread is bound to and "owns" that cpu. Which is great for DGEMM.

One other note: Look up -qopt-matmul compiler option. This is ON at O2 and above. and O2 is default so if you dont explicitly use -O option you will get O2. This will try to recognize hand-coded matmul and replace with a matmul library call. You can check for this with

-qopt-report=5 and look for the <file>.optrpt optimization report.

That can really mess with an analysis like this. You can turn it off with

-qno-opt-matmul

JUST TO MAKE SURE you are timing a loop and not a replacement library call. Of course if you WANT the replacement that is OK too. Just know what you are timing.

ron