Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Parallel processing much slower?

dajum
Novice
4,000 Views

I have a code that I set up as follows

       SUBROUTINE OPI

!$OMP PARALLEL SECTIONS NUM_THREADS(2)

       CALL OPER

!$OMP SECTION

       CALL SUB

!$OMP END PARALLEL SECTIONS

       RETURN

This is the basic structure.  I use a module with volatile variables to communicate between the two threads.  SUB has a DO WHILE loop that goes until OPER tells it to quit.  To test it I don't have SUB doing anything other than looping.  So none of the flags change except the one to tell it to quit. All of the real computations are done in OPER.  This takes about 100 seconds to run.  If I run this without the parallel sections, it takes 81 seconds.  Where do I look for all this overhead.  Once I actually have SUB doing some real work I expect it to happen in parallel to OPER, but the overhead is wiping out any improvements I can expect.

Thanks!

Dave

0 Kudos
37 Replies
jimdempseyatthecove
Honored Contributor III
984 Views

IanH,

I took the liberty to modify your program. I haven't tried all permutations of compile options.

Change summary:

Changed name of "sub" to "IOsub" to reflect purpose of subroutine.

Made Master thread call IOsub, (all) other thread(s) call OPER.

Outer level runs with all threads (you can set this to 2 threads if you want to enable nested.

Locks are set for region of work by each OPER worker thread that will perform work in OPER

Moved RANDOM_NUMBER out of inner loop (both in IOSUB and OPER)

Added nThreads = omp_get_num_threads(); iThread = omp_get_thread_num()

The DO ib loop in OPER now iterates

DO ib=iThread, batches, nThreads-1

When more than 2 threads in use then each non-master thread takes interlieved steps over batches.

Note, modifications assume all computation results maintained by application, IOsub outputs each batch as it is completed. The real application may not wish to hold all results (batches), in that event, double buffering or n-buffering could be considered.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
984 Views

I forgot to mention.

Should you want to use nested parallelism, then set the number of threads on the outer level to 2. For test purposes then modify OPER DO i=1,count loop to have !$OMP PARALLEL DO SHARED(ib,count,r,rnd)

Jim Dempsey

0 Kudos
dajum
Novice
984 Views

I took the program and made it work more like mine.  Still some differences, but I don't think they are critical in understanding why this works the way it does.  This version will run in 30 seconds for me when it is serial.  But in parallel it takes 48 seconds.  I adjusted the counts to make sure the serial timing interesting.  I basically fills the array in OPER and writes it out in SUB. Can anyone explain why the parallel implementation is so slow?

Jim I'll take a look at your code next. Thanks!

0 Kudos
jimdempseyatthecove
Honored Contributor III
984 Views

On my system (Core i7 4-core w/HT), with IOsubs at count/4 and no thread limitation on outer leve (iow 8 threads).

 IOsubs at count/4
OpenMP stubs (sequential code) 1.29s
OpenMP parallel code .44s

IOsub at count/100
OpenMP stubs (sequential code) 1.085s
OpenMP parallel code .285s

Jim Dempsey

0 Kudos
Steven_L_Intel1
Employee
984 Views

Have you tried running this under Intel VTune Amplifier XE to look for thread and lock contention, etc.?

0 Kudos
dajum
Novice
984 Views

I actually demo'd it a few months ago to look at the same code.  It wasn't useful then to figure out what was happening, and I spent a few days trying to get it to help me figure it out.  I asked for support a couple times and got suggestions.  But in the end I switched to GlowCode and figured out the bottleneck in a few minutes. I'm not a fan of the mechanics of how VTune works,  so I don't own a copy of it.

0 Kudos
dajum
Novice
984 Views

Jim,

I tried your code.  It just crashes if I set OMP_NUM_THREADS=1.  Does it work that way for you?

Compiled with /Qopenmp it runs in .099 - .192 seconds

Without /Qopenmp .048-.064

So I see it as much worse running in parallel.  

0 Kudos
jimdempseyatthecove
Honored Contributor III
984 Views

dajum,

Programming error. You do not want me to do all your work for you...
See fix attached.

Jim Dempsey

0 Kudos
dajum
Novice
984 Views

Jim,

I wasn't really worried about the crash.  I'm still at the same point I was at the very beginning.  I don't understand why in parallel it runs slower.  I tested this one too.  Basically I think the serial version is faster.  This one was a little closer.  So why doesn't the same code you run do the same thing for me?  You seem to think it is faster in parallel, It definitely is not for me.  Yet my machine appears to be much faster than yours.  Any idea why that might be so.  Did you compile with arguments other than /Qopenmp?  I used /Qopenmp /O3.  Every test case I run has the same basic characteristic, parallel is slower. Something must be causing that other than the code.

Anyone?

0 Kudos
IanH
Honored Contributor III
984 Views

bh2 has got the "issue" with data races that I was talking about before - you have writes to `index` and `done` potentially going on in parallel to reads of `index` and `data` (the if statement).  For your program to be formally well defined you need to synchronise those.  Similarly there's a formal requirement to ensure that the "view" of the shared variables is consistent between threads. 

I qualify with "formally" because the compiler's generated code and your hardware may "happen" to achieve both of those aspects practically (I don't know - I don't play at that level - whenever the debugger throws up disassembly I have to go and have a good lie down).  Even if those requirements are being met practically, that will come at some cost in terms of execution speed.

My attempt at fixing those aspects (and then perhaps breaking other things) attached.  Measured over four runs, on my two core, no HT system the omp case is faster by about 15% in a release build.  On a four core HT system similar.  Disabling HT on the four core, the omp run is 40% faster than serial (serial runtime is about the same HT to no HT). (I'm seeing occasional significant variability in both serial and omp cases on the four core machine which may be due to background system activity, disk buffering, etc - perhaps its also possible its due to a programming error.)

0 Kudos
jimdempseyatthecove
Honored Contributor III
984 Views

dajum,

I will be out of my office today, will look further and at bh3.f90.

In my tests I ran: In parallel and with OpenMP stubs. I did not run compiled without OpenMP.

My system has one socket. I do not know if that is affecting the issue, it shouldn't.

Jim Dempsey

0 Kudos
dajum
Novice
984 Views

Ian,

Isn't this a non-critical race situation?  Only one thread modifies the data.  There is no possibility that it can ever end up being at a non-deterministic value as I understand it.  So there can't be an impact to the final results from this.  It may result in the thread reading the data having to make another loop, but I don't see that as an issue in the big picture.

But obviously your code and Jim's code are formally correct, yet I see the same behavior of the parallel versions running slower than the serial case.  What exactly causes that to occur? This is the big picture in my mind.  Why a parallel version should ever take longer than the serial version? I don't understand this point. 

0 Kudos
dajum
Novice
984 Views

Ian,

I ran your bh3.f90 code.  For me it takes twice as long to run the parallel version as it does the serial version. On a co-workers machine it took 2.5 times longer to run the parallel version.  However, both these machines are laptops (windows 7).  I then ran it on a desktop machine.  Also a quadcore but only a Xenon E5405 (running VISTA-64).  That machine had the most consistent run-times of all, with almost no spread in the elapsed times (.1 seconds), but was consistently 6% faster for the parallel version. So there is something happening with parallel code on a laptop it would appear.  All are DELL computers. Know anything about that?

0 Kudos
jimdempseyatthecove
Honored Contributor III
984 Views

Put a break point in (but prior to loop) both OPER and SUB (IOsub). Run in parallel build, verify that you are a) running in two threads, b) both threads are in seperate cores/HW threads. b) should be observable by starting the performance monitor and watching the CPU utilization during run.

On the notebook and single E5405 all theads will share the LL Cache (L3), on the multi-socket maching you may be running in different sockets, in that case a memory intensive could run longer (however, this should be the case for the 1st iteration of your inner loop (DO i)

*** this assumes the RANDOM_NUMBER has been moved outside the DO i loop. If you have not moved this outside the loop then both (all) theads are passing through a critical section on each iteration of their respective DO i loop (as opposed to once per batch).

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
984 Views

Revised program.

Edited results

[plain]

KMP_AFFINITY = verbose
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7}

Number of threads            1
Elapsed time: 18.56 s

Number of threads            2
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 19.93 s

Number of threads            3
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 10.55 s

Number of threads            4
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 8.439 s

Number of threads            5
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 6.533 s

Number of threads            6
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 4.765 s

Number of threads            7
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 4.720 s

Number of threads            8
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 4.720 s
Done

KMP_AFFINITY=verbose,compact
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1}

Number of threads            1
lapsed time: 18.60 s

Number of threads            2
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1}
Elapsed time: 20.33 s

Number of threads            3
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {2,3}
Elapsed time: 10.14 s

Number of threads            4
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {2,3}
Elapsed time: 12.07 s
Number of threads            5

OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {4,5}
Elapsed time: 7.059 s

Number of threads            6
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {4,5}
Elapsed time: 5.014 s

Number of threads            7
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {6,7}
Elapsed time: 5.022 s

Number of threads            8
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {6,7}
Elapsed time: 4.995 s
Done

KMP_AFFINITY=verbose,scatter
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1}
Number of threads            1
Elapsed time: 18.66 s

Number of threads            2
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {2,3}
Elapsed time: 19.79 s

Number of threads            3
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {4,5}
Elapsed time: 9.924 s

Number of threads            4
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {6,7}
Elapsed time: 8.047 s

Number of threads            5
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1}
Elapsed time: 6.172 s
Number of threads            6

OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {2,3}
Elapsed time: 5.061 s

OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {4,5}
Number of threads            7
Elapsed time: 4.853 s

Number of threads            8
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {6,7}
Elapsed time: 4.704 s
Done

KMP_AFFINITY=verbose,compact,1,0
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1}

Number of threads            1
Elapsed time: 18.62 s

Number of threads            2
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {2,3}
Elapsed time: 19.93 s

OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {4,5}
Number of threads            3
Elapsed time: 9.921 s

OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {6,7}
Number of threads            4
Elapsed time: 8.193 s

Number of threads            5
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1}
Elapsed time: 6.046 s

Number of threads            6
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {2,3}
Elapsed time: 5.065 s
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {4,5}
Number of threads            7
Elapsed time: 4.944 s

Number of threads            8
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {6,7}
Elapsed time: 4.654 s
Done

[/plain]

Jim Dempsey

0 Kudos
dajum
Novice
984 Views

Jim,

I see similar results to you with the latest code.  I had previously verified that the code is indeed running in multiple processors.

Ian,

I implemented the ATOMIC statements which are fine for 12.1 and 13., but when using version 11.1 There are no clauses allowed for ATOMIC in 11.1 and it won't take just an assignment statement.  It appears ATOMIC is only meant for WRITE statements in 11.1.  Is that the case?  Is there some way that this is meant to be implemented when using 11.1?

0 Kudos
jimdempseyatthecove
Honored Contributor III
984 Views

dajum,

The 2 thread issue (slower than 1) may be due to the code used to emulate the load for the IOsub. I suggest you modify this to perform a formatted internal write (to character variable), then call SLEEPQQ to emulate write latency. This may be more representative of your overhead for IOsub.

Jim Dempsey

0 Kudos
Reply