Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Parallel processing much slower?

dajum
Novice
3,955 Views

I have a code that I set up as follows

       SUBROUTINE OPI

!$OMP PARALLEL SECTIONS NUM_THREADS(2)

       CALL OPER

!$OMP SECTION

       CALL SUB

!$OMP END PARALLEL SECTIONS

       RETURN

This is the basic structure.  I use a module with volatile variables to communicate between the two threads.  SUB has a DO WHILE loop that goes until OPER tells it to quit.  To test it I don't have SUB doing anything other than looping.  So none of the flags change except the one to tell it to quit. All of the real computations are done in OPER.  This takes about 100 seconds to run.  If I run this without the parallel sections, it takes 81 seconds.  Where do I look for all this overhead.  Once I actually have SUB doing some real work I expect it to happen in parallel to OPER, but the overhead is wiping out any improvements I can expect.

Thanks!

Dave

0 Kudos
37 Replies
jimdempseyatthecove
Honored Contributor III
2,977 Views

Can you show the code?

0 Kudos
dajum
Novice
2,977 Views

What part would be relavent? OPER is hundreds of lines calling thousands of routines.  Sub is a DO WHILE loop with two if tests that always evaluate to false right now until the last line of OPER sets the DO WHILE condition false and SUB ends.  

0 Kudos
dajum
Novice
2,977 Views

Jim,

The code is almost the same pattern as you suggested in this thread: http://software.intel.com/en-us/forums/topic/299766

But my goal is to bufffer data in the OPER that is written out in SUB.  But my testing now does write either to the buffer or do any output.  Hence the two if tests evaluate to false in my code.  The flags in use are in a module that all have the VOLATILE attribute.  But why this threading makes it run 25% longer is puzzling me.

Dave

0 Kudos
dajum
Novice
2,977 Views

Interesting.  I put a CALL SLEEP(1) inside the DO WHILE.  Execution time 84 seconds.  Why does it matter so much if the tread goes to sleep?

0 Kudos
IanH
Honored Contributor III
2,977 Views

From your description, the amount of work that SUB has to do depends on how long it takes OPER to finish ("SUB has a DO WHILE loop that goes until OPER tells it to quit").  In the serial case, doesn't that mean SUB does nothing - OPER has already finished?  If so, that means the two cases are far from equivalent. 

A spinning DO WHILE loop will tie up a core (in the absence of the compiler working out that the loop does nothing and eliminating it).  If you put a sleep inside the loop then the core becomes available for work by other threads in the system - in the case where you have less cores than active system threads (how many cores do you have?) that could affect the number of timeslices given to the thread running OPER.

Perhaps I've misunderstood, but if not consider giving SUB a fixed unit of work.

Making a shared variable volatile is not on its own enough to avoid data race conditions and/or guarantee a consistent view of the variable between threads.  Based on your description I expect you would need explicit synchronisation and flush operations.  How you have those arranged can also make a significant difference to execution time.

0 Kudos
dajum
Novice
2,977 Views

Yes in the serial case SUB does nothing.  But I didn't expect the two treads to compete for execution time.  Isn't that the point of have separate threads doing separate work on a multicore machine? Does it matter if the work is just spinning in a loop or actually doing useful processing and output? I have an i7 Q720 processor (4 cores 8 threads).  So I expected the two threads to work in different cores. Do I have to do something special to make that happen?  

In a real case I expect the SUB thread to actually get data to process such that the SUB thread will have some variable fraction of the work. In a serial case it is as much as 20-40% of the total time. I expected to be able to reduce the total exection time by putting most of that effort in a second thread. Sitting and spinning seemed to be the best solution.  Is there some mechanism to wake up the second thread  when the first thread has data ready that would be a better idea?

I have built in flags to handle the synchronisation, that seems to work fine, but process is slower than doing the work in a serial code. So this was a test to try to determine why. But it just seems to raise more questions for me.   

Any pointers to a reference(s) that explains the details of the overhead would be appreciated.  

0 Kudos
IanH
Honored Contributor III
2,977 Views

The sleep response could be explained by your threads sharing the same physical core.  I think default thread affinity depends on what the operating system had for breakfast.  Out of my domain, others will know better. 

In the meantime open a command prompt, then:

[plain]set KMP_AFFINITY=verbose,scatter[/plain]

then run your program in that command prompt and see what happens.  This should force threads to different physical cores and give you some diagnostics as well.

This may be completely unrelated to your problems,but if by "built in flags" you mean ordinary Fortran variables in conjunction with ordinary Fortran statements, then it is probable that your synchronisation is not formally well defined.

0 Kudos
dajum
Novice
2,977 Views

I changed the affinity, but it didn't really make any difference.  Run-time was 102 without any SLEEP. 

My flags are all variables in a module. But they are configured that only one thread writes any variable, and the other thread only reads it.  I think that makes if well defined, if that is your meaning.  Otherwise could you clarify what you mean by the "synchronisation is not formally well defined"?

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {2,3}

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,977 Views

>>But my goal is to bufffer data in the OPER that is written out in SUB.


Do you intend to run your code with OpenMP Nested enabled?
(i.e. does OPER contain !$OMP PARALLEL...?)

Is OPI called from within a parallel region?
Is OPI called only once or many times?
Is the results data produced by OPER large or small?
What will the ratio be of time spent by SUB verses time spent by OPER?

The concept in the mentioned link is for a technique to overlap writes (or reads) with work.
And this recommended only when the writes (or reads) are a significant portion of the work.

The construction of your code in this forum thread is non-overlapped (no advantage to parallelization).
Not knowing about your code it is difficult to recommend a technique.

Presumably OPER has a loop. If so, can the partial results be written (SUB) as they are accumulated (iow on/after each iteration)?
If OPER has a loop, then you have at least two differen strategies:
a) single buffer (one results buffer, one intermediary copy for writing)
b) double buffer (two results buffer, no intermediary copy for writing)

Method a) is often easier to implement but has the overhead of copying data in memory.

Additional information will be required before we can make recommendations.

Jim Dempsey

0 Kudos
dajum
Novice
2,977 Views

Yes OPER also has parallel regions. OPI is called only once. The ratio between SUB and OPER isn't known in advance as it can vary widely.  At times SUB will be greater, and at times OPER will be greater.  But the targeted cases that I'm really looking to reduce overall duty cycle will have SUB using about 25% of the processing time for a serial run. The amount of data can exceed 1 GB, and most of the work in SUB is writing data.

I'm not sure why you think it has no advantage to being done in parallel. It is intended to do the same overlap of writing data with work. I have used your method a) as my code isn't structured to use b).  

0 Kudos
IanH
Honored Contributor III
2,977 Views

The OpenMP spec says "...if at least one thread reads from a memory unit and at least one thread writes without synchronization to that same memory unit ... then a data race occurs.  If a data race occurs then the result of the program is unspecified."  Beyond that, there's the need to ensure that your threads view of any shared variables is consistent.

I still think that the simple explanation is that your spinning DO WHILE loop is not equivalent to your serial case.  Again, out of my domain in terms of what happens at a hardware level, but I could imagine that the continuous access of a shared flag variable in the loop introduces reasonable overhead.

Discussion of this sort of topic is difficult without specific code examples (e.g. depending on the number of threads you start inside OPER you can be back in the situation of having less physical cores than threads wanting to run).  Based on what I think you are trying to do I've attached an example that uses omp locks. I think this is formally correct, but note that I really only dabble in OpenMP - I can usually get myself into enough trouble when running things serially that I only contemplate running in parallel when I wish for almost certain disaster.  For this specific example (noting that there are ten batches, sub and oper take about the same time per batch and the output from the batches are somewhat distant from each other in memory) the parallel case leaves the serial case choking in its dust.  Vary things and you can elininate or even slightly reverse that relative performance.

OMP tasks might also be suitable for this - I've used them successfully in an application which also has producers and consumers of data, but with more complicated dependencies between actors.

0 Kudos
dajum
Novice
2,977 Views

Ian,

Thanks for the code. To execute the parallel version am I supposed to edit all the !$ to be !$OMP, or is there some other way to do this?

BTW in my test case there are no other parallel constructs.  So I expected what happens in SUB to not really matter.  Why it does is what I don't understand.  I sort of expect the two treads to operate independently.  The elapsed time only depending on how long OPER takes to run.  But it appears the SUB thread makes the OPER thread not execute continuously. If it did it should run in the same time as the serial case.  But it must be not executing for some reason, I'd like to understand that reason. 

Dave

0 Kudos
IanH
Honored Contributor III
2,977 Views

!$ is the OpenMP free form source conditional compilation sentinel - see 2.2.2 of the OpenMP 3.1 spec.  Those lines are comments if OpenMP is not enabled, they are normal statments if OpenMP is enabled.  You shouldn't need to edit the source - it should (hopefully) compile and behave similarly regardless of OpenMP.

0 Kudos
dajum
Novice
2,977 Views

Ian,

I don't think I'm getting the same results as you.  Running each case a number of times I saw for the serial cases elapsed times of .076-.093 seconds.

For the parallel cases .056 - .155 secconds.  The parallel case has a much wider variablility. And most of time was much slower than the serial case, and once was faster.  Which just makes no sense to me. In one trial, it even did all the oper cases before the sub cases, which ran in .09 seconds.  Is this what you saw?

0 Kudos
IanH
Honored Contributor III
2,977 Views

No, but my hardware is nowhere near as capable as yours.  Make the number of iterations (count) bigger, perhaps by a factor of 100, then secondly perhaps set the affinity to scattered and see what happens.

0 Kudos
TimP
Honored Contributor III
2,977 Views

If only one thread is doing work, as appears to be implied, OpenMP could be expected to slow it down.  Reducing the value of KMP_BLOCKTIME might make a difference, as you are discussing times of that order of magniitude.

It might help of you would state what you are trying to learn from this thread; but you've already declined to answer questions leading in that direction.

0 Kudos
dajum
Novice
2,977 Views

My code behaves like the run-times I get for many of the runs of Ian's code ( .155 seconds parallel versus .09 for serial).  Doing work in parallel makes it take longer than doing it serialially.  I don't understand that.  What I'm trying to learn is why that happens.  What happens when I start two threads, OPER that does a bunch of work, and SUB that just loops and should just stop when OPER stops, there are no waits or stops or any flags changing between the two.  Why does that take 25% longer than just calling oper?  If the threads are in different cores, why isn't the difference just the time it takes to get the two treads running?  What makes the second thread slow the first one down so much?  Yet making it sleep, will let the other thread run faster.  What is interupting the OPER thread? If the SUB thread didn't slow down the OPER thread I think it should just take a small fraction of time longer than OPER takes running alone.  

I've tried to answer every question posed. I'm sorry if I missed something, but I don't see what it is other than showing the code.  I though the snipet I posted was the relavent part.  

0 Kudos
TimP
Honored Contributor III
2,977 Views

In the example you posted, doesn't OPER get executed by both threads (watch out for races), followed by SUB being executed by one thread, followed by a wait for KMP_BLOCKTIME timeout?

0 Kudos
dajum
Novice
2,977 Views

I don't think so. I read the SECTION documentation as "Specifies one or more blocks of code that must be divided among threads in the team. Each section is executed once by a thread in the team."  So I think OPER is in one thread and SUB is in another, each getting executed once concurrently.  Since I don't do anything in either thread that makes the other wait at this point, I don't understand why OPER takes 25% longer just because another thread is running. And when I add a SLEEP(1) per internal loop in the SUB thread, OPER only takes 5% longer.  If they are in different processors, and not causing any waits between the threads, I don't understand what is happening that causes this difference. The same thing goes for Ian's code, can you explain what makes it take .155 seconds when it runs in parallel mode.  I understand the cases that it gets .056, but not .155.  I don't see any explanation for that behavior.  Its' like the code decides it just needs to wait on something else.  What is the something else? Is it just Windows letting other processes push it out? That seems strange since the variability of the serial case is much tighter. 

0 Kudos
IanH
Honored Contributor III
2,862 Views

Your understanding (that there is an implicit section directive after the parallel sections directive) is the same as mine.

When I tried my example code (with larger iteration counts) on a four physical core machine with hyperthreads (8 logical cores) Windows 7 machine I got similar variable/unexpected results to you.  On my ancient two physical core no hyperthreading Vista machine I see consistent speedup with the OMP case.

The call to random_number might be (i.e. I don't know) invoking additional synchronisation in my example that I didn't count on.  More tomorrow.

0 Kudos
Reply