- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I have the following code modified from an answer in stackoverflow
I tried to compare the efficiency of nested loops without and with openmp
Program reshape_for_blas
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
Use :: omp_lib
Implicit None
Real( wp ), Dimension( :, : ), Allocatable :: a
Real( wp ), Dimension( :, :, : ), Allocatable :: b
Real( wp ), Dimension( :, :, : ), Allocatable :: c1, c2, c3, c4, c5
Real( wp ), Dimension( :, : ), Allocatable :: d
Real( wp ), Dimension( :, : ), Allocatable :: e
Integer :: na, nb, nc, nd, ne
Integer :: la, lb, lc, ld
Integer( li ) :: start, finish, rate
Integer :: numthreads
numthreads = 6
call omp_set_num_threads(numthreads)
Write( *, * ) 'na, nb, nc, nd ?'
Read( *, * ) na, nb, nc, nd
ne = nc * nd
Allocate( a ( 1:na, 1:nb ) )
Allocate( b ( 1:nb, 1:nc, 1:nd ) )
Allocate( c3( 1:na, 1:nc, 1:nd ) )
Allocate( c5( 1:na, 1:nc, 1:nd ) )
Allocate( d ( 1:nb, 1:ne ) )
Allocate( e ( 1:na, 1:ne ) )
! Set up some data
Call Random_number( a )
Call Random_number( b )
c3(:,:,:) = 0.0_wp
c5(:,:,:) = 0.0_wp
!naive
Call System_clock( start, rate )
do la = 1, na
do lb = 1, nb
do lc = 1, nc
do ld = 1, nd
c3(la,lc,ld) = c3(la,lc,ld) + a(la,lb) * b(lb, lc, ld)
enddo
enddo
enddo
enddo
Call System_clock( finish, rate )
Write( *, * ) 'Time for naive loop', Real( finish - start, wp ) / rate
Call System_clock( start, rate )
!$omp parallel
!$omp do private(la, lb, lc, ld) schedule(dynamic) reduction(+: c5)
do la = 1, na
do lb = 1, nb
do lc = 1, nc
do ld = 1, nd
c5(la,lc,ld) = c5(la,lc,ld) + a(la,lb) * b(lb, lc, ld)
enddo
enddo
enddo
enddo
!$omp end do
!$omp end parallel
Call System_clock( finish, rate )
Write( *, * ) 'Time for naive loop omp', Real( finish - start, wp ) / rate
do la = 1, na
do lc = 1, nc
do ld = 1, nd
if ( dabs(c3(la,lc,ld) - c5(la,lc,ld)) > 1.e-6 ) then
write (*,*) '!!! c5', la,lc,ld, c3(la,lc,ld) - c5(la,lc,ld)
endif
enddo
enddo
enddo
End Program reshape_for_blas
By using ifort (IFORT) 19.0.3.199 20190206, with the command, ifort reshape-5.f90 -qopenmp,
(the filename is reshape-5.f90), and with 60 60 60 60 input dimension, I got the output:
Time for naive loop 3.116000000000000E-003
Time for naive loop omp 1.263800000000000E-002
I can remove array d and e, which have been never used. But the results are similar. My question is, why there is no speedup in my openmp calculation?
P.S. I am not sure if this question fits oneAPI HPC Toolkit, guessed from this question .
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Time for naive loop 3.116000000000000E-003
>>Time for naive loop omp 1.263800000000000E-002
>>...My question is, why there is no speedup in my openmp calculation?
The total runtime is less than the OpenMP parallel region and OpenMP loop setup.
Also, your loop index order is non-optimal. Use:
do ld = 1, nd
do lc = 1, nc
do lb = 1, nb
do la = 1, na
c3(la,lc,ld) = c3(la,lc,ld) + a(la,lb) * b(lb, lc, ld)
enddo
enddo
enddo
enddo
In Fortran, a(la, lb) and a(la+1,lb) are in adjacent memory locations.
IOW the left most index has stride 1, the second index has stride size(first index), ...
This is backwards from C/C++ notation where arrays do not have descriptors, but rather expressed as a nested list of pointers.
Use of proper index order permits the compiler to vector-ize the code. While compiler optimization can reorder the code, you are not necessarily assured that it will.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
As you are using OpenMP offload with the Fortran compiler, we will move this thread to the Intel Fortran Compiler Forum.
From the OpenMP side, there are a lot of factors that will impact performance. The main factor is complexity if you will use a small-size computation, then the serial version of code will definitely show good performance over the parallel version.
Please expect in-depth details from the Fortran experts. We will move this thread to Fortran Forum, refer to the below link:
https://community.intel.com/t5/Intel-Fortran-Compiler/bd-p/fortran-compiler
Warm Regards,
Abhishek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much for redirecting. I did not find this Fortran forum before.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would note that before think about openmp that your arrays are formulated the wrong way for efficiency in Fortran. In your loops the last index in c3 and d is the one that is changing most rapidly but if ld was the first index you would be working with adjacent memory locations and would be seeing better basic performance. In general you want loops with the first index changing rapidly and the last index slowly to stop manic large jumps in memory location all the time . Make a simple test case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Time for naive loop 3.116000000000000E-003
>>Time for naive loop omp 1.263800000000000E-002
>>...My question is, why there is no speedup in my openmp calculation?
The total runtime is less than the OpenMP parallel region and OpenMP loop setup.
Also, your loop index order is non-optimal. Use:
do ld = 1, nd
do lc = 1, nc
do lb = 1, nb
do la = 1, na
c3(la,lc,ld) = c3(la,lc,ld) + a(la,lb) * b(lb, lc, ld)
enddo
enddo
enddo
enddo
In Fortran, a(la, lb) and a(la+1,lb) are in adjacent memory locations.
IOW the left most index has stride 1, the second index has stride size(first index), ...
This is backwards from C/C++ notation where arrays do not have descriptors, but rather expressed as a nested list of pointers.
Use of proper index order permits the compiler to vector-ize the code. While compiler optimization can reorder the code, you are not necessarily assured that it will.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much for your answer! I appreciate your help.
By adjusting the orders of loops and use 300 300 300 300 as the input dimension, I got
na, nb, nc, nd ?
300 300 300 300
Time for naive loop abcd 55.8493150000000
Time for naive loop dcba 2.20551200000000
Time for naive loop omp abcd 4.44474100000000
Time for naive loop omp dcba 0.572829000000000
where abcd and dcba are the order of loops from outer to inner. Looks quite good
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim did a great job answering this. From your version info it looks like you are on Linux.
For further tuning you may want to examine OMP affinity.
export OMP_DISPLAY_AFFINITY=true
and run. this will show how your threads bind to processors. If you have hyperthreading (HT) enabled you will have 2x the number of processors as physical cores. If you have Intel MPI installed, run
cpuinfo
Here is an 8-core system with HT, partial output to show proc numbering:
===== Processor composition =====
Processor name : Intel(R) Core(TM) i9-9900K
Packages(sockets) : 1
Cores : 8
Processors(CPUs) : 16
Cores per package : 8
Threads per core : 2
===== Processor identification =====
Processor Thread Id. Core Id. Package Id.
0 0 0 0
1 0 1 0
2 0 2 0
3 0 3 0
4 0 4 0
5 0 5 0
6 0 6 0
7 0 7 0
8 1 0 0
9 1 1 0
10 1 2 0
11 1 3 0
12 1 4 0
13 1 5 0
14 1 6 0
15 1 7 0
===== Placement on packages =====
Package Id. Core Id. Processors
0 0,1,2,3,4,5,6,7 (0,8)(1,9)(2,10)(3,11)(4,12)(5,13)(6,14)(7,15)
===== Cache sharing =====
Cache Size Processors
L1 32 KB (0,8)(1,9)(2,10)(3,11)(4,12)(5,13)(6,14)(7,15)
L2 256 KB (0,8)(1,9)(2,10)(3,11)(4,12)(5,13)(6,14)(7,15)
L3 16 MB (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
so 1 socket, 8 physical cores, 16 processors. Numbering (starting at 0, darn C programmers!) is 0-7 across the cores, then 8-15 are the hyper-contexts, or 2nd thread context, on cores 0-7. So if an OMP thread is bound to Processor 11 that is physical core 3, 2nd thread context.
You use OMP_PLACES and OMP_PROC_BIND to control pinning threads to cores. IT's complicated, read up on it. But try this if you want just 1 thread per physical core for this example:
export OMP_NUM_THREADS=8
export OMP_DISPLAY_ENV=true
export OMP_PLACES=cores
./a.out
this will put the 8 threads on processors 0-7 along with their hyper-contexts:
OMP: pid 3780544 tid 3780544 thread 0 bound to OS proc set {0,8}
OMP: pid 3780544 tid 3780547 thread 3 bound to OS proc set {3,11}
OMP: pid 3780544 tid 3780548 thread 4 bound to OS proc set {4,12}
OMP: pid 3780544 tid 3780546 thread 2 bound to OS proc set {2,10}
OMP: pid 3780544 tid 3780545 thread 1 bound to OS proc set {1,9}
OMP: pid 3780544 tid 3780549 thread 5 bound to OS proc set {5,13}
OMP: pid 3780544 tid 3780550 thread 6 bound to OS proc set {6,14}
OMP: pid 3780544 tid 3780551 thread 7 bound to OS proc set {7,15}
so for example thread 3 is on core 3 AND it's 2nd thread context 11. but since you only have 8 threads and not more, essentially each thread is bound to and "owns" that cpu. Which is great for DGEMM.
One other note: Look up -qopt-matmul compiler option. This is ON at O2 and above. and O2 is default so if you dont explicitly use -O option you will get O2. This will try to recognize hand-coded matmul and replace with a matmul library call. You can check for this with
-qopt-report=5 and look for the <file>.optrpt optimization report.
That can really mess with an analysis like this. You can turn it off with
-qno-opt-matmul
JUST TO MAKE SURE you are timing a loop and not a replacement library call. Of course if you WANT the replacement that is OK too. Just know what you are timing.
ron
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page