!DEC$ PARALLEL

davidspurr · ‎01-07-2008

Language Ref states that !DEC$ PARALLEL "enables auto-parallelization for an immediately following DO loop".

Does this apply to an outer loop that has many other loops & subroutine calls etc within it? ie. each cycle of an outer loop processed in a separate thread, even if there is a substantial amount of code within the loop.

That would seem a very simple means of parallel execution & in my case should speed execution significantly (quad core, x64), since I have many 1000's of sites of independent activity. However, when I tried it I see no increase in speed, with CPU usage rarely exceeding 25% - 27%.

David

davidspurr · ‎01-08-2008

I did a little more testing, looking at CPU usage per core.

With no PARALLEL directives set CPU usage is ~90% (mean) in one core, and maybe 2% - 5% in the other 3 cores. Overall total usage ranges 25% - 27%.

With /Qopenmp & !DEC$ PARALLEL, CPU usage is 33%- 38% (each, mean) in cores 1 & 2 (slightly higher in 1), and ~10% & ~20% in the other two cores. Overall total usage ranges 25% - 27%.

So the PARALLEL command does appear to be affecting the distribution of activity between cores but with no net effect on the overall execution speed.

Quite likely I am missing something here (likely a few statements!), but it seems the PARALLEL command is having some effect, just not anything beneficial.

David

BTW - is it possible to add images (~40kb) to these posts (eg Task Manager window image)?

[EDIT]

I realise(?) "!DEC$ PARALLEL" is not OpenMP, but setting /Qopenmp did seem to result in more balancing between the cores - its just not beneficial.

One other thought - is "!DEC$ PARALLEL" restricted to just a single DO loop; ie. must not have inner loops? Are many inner loops in my case - plus calls to subroutines etc that also contain their own loops.

onkelhotte · ‎01-08-2008

Its hard to say, why your program doesnt use your CPUs by 100 percent.

Is the Process Priority set high enough? Which version of Fortran do you use?

davidspurr · ‎01-08-2008

Using version 10.1.013. Not certain about the Process Priority - will be default, whatever that is as I have not set anything.

David

onkelhotte · ‎01-08-2008

Try to increase the priority. But keep in mind, that the other process of your system may run slower and dont seem to react any morein the higher modes. Example is for normal behaviour:

use dfwininteger*4 hProcesshProcess=getCurrentProcess()l=setPriorityClass(hProcess,NORMAL_PRIORITY_CLASS)

Steven_L_Intel1 · ‎01-08-2008

!DEC$ PARALLEL is ignored unless you also use /parallel. It is not a magic wand for parallelism and is rather conservative. Large complex loops and loops containing routine calls may inhibit parallelism.

The compiler offers detailed optimization reports telling you what loops did and did not parallelize and why. Read about them in the documentation.

jimdempseyatthecove · ‎01-08-2008

David,

Consider using OpenMP as opposed to auto parallization. OpenMP use !$OMP ... and will give you better control over your parallelization endeavors.

Jim Dempsey

davidspurr · ‎01-08-2008

What I had set was /Qparallel which seems to be the only option offered in my case. Is there a difference?

FWIW full command line is

/nologo /O3 /Og /QaxS /Qunroll:3 /Qparallel /assume:buffered_io /Qopenmp /module:"x64Release/" /object:"x64Release/" /libs:static /threads /c

Did try with & without /Og but it seemed to make no difference.

davidspurr · ‎01-08-2008

Thanks Jim

I did try the OpenMP PARALLEL. Compiled OK but crashed during runtime (seemed to crash before it got to the // code parts. But maybe it did get there?)

Basically I am analysing a series of EQ scenarios. In theory I could run each one as a separate analysis, except that there are 10's of thousands of cases modelled. Each EQ is "entirely separate" & there is a considerable amount of analysis for each one so it seems an ideal candidate for // computing.

A couple of issues that may be preventing // analysis?

1. After each "scenario" a message is written to the screen with the time it was completed (used to indicate progress of the run); ie a "WRITE (*, fmt) loc, ..." statement is used, where "loc" is the "scenario location" number (1 - ~30,000 say). "loc" is the DO loop variable, so if // is working would expect non-sequential values of "loc" printed to screen. I assume that is not a problem?

2. Though each EQ is a totally separate event, the analysis of each event does access common data values; ie. generically:

eq_res1(loc) = fn ( loc, a, b, c, ...), where some data "a, b, c, ..." is not a function of loc.

That means if multiple loc's are analysed concurrently, multiple threads may try to access ("read") data "a" (or say an array value) at the same time.

I assumed the implementation of // computing in the compiler is designed to handle that type of situation, but maybe not????

David

TimP · ‎01-08-2008

Minor nit-pick: /QaxS generates a special code path for Penryn CPUs only whenever there is an opportunity to use SSE3 or later. Otherwise, SSE/SSE2 are used. This seems fairly unlikely to show an advantage.

davidspurr · ‎01-08-2008

Thanks

What alternative should I be using?

Are running on a QX9650 CPU machine.

TimP · ‎01-08-2008

If you have vectorizable complex math, -xP or -xT would have an advantage, otherwise you can take the ifort 10 default (-xW).

jimdempseyatthecove · ‎01-09-2008

David,

!$OMP PARALLEL DO
DO loc=1,NumberOf_loc
eq_res1(loc) = fn(loc,a,b,c,d)
if(mod(loc/100 .eq. 0) write(*,fmt) loc, ...
END DO
!$OMP END PARALLEL DO

In the above "loc" is private per thread however the loop progresses in a manner such that now two threads execute the same values within the do loop. The values a, b, c, d, ... are assumed to be shared (default for default is default=shared). If say "a" were to be computed to be unique for a given loc then "a" should be declared as private to the thread.

!$OMP PARALLEL DO PRIVATE(a)
DO loc=1,NumberOf_loc
a = SomeExpression
eq_res1(loc) = fn(loc,a,b,c,d)
write(*,fmt) loc, ...
END DO
!$OMP END PARALLEL DO

The DO loops in OpenMP can be scheduled to run in various ways. Look as SCHEDULE in the OpenMP section of the documentation.

For the above example.

Case 1: NumberOf_loc very large, fn(loc,...) very small compute time.

For this case you would want to use a scheduling methodthat distributes large chunks of the loop iteration to each thread (reduce thread maintenance overhead)

iCHUNK =NumberOf_loc / OMP_NUM_THREADS()
!$OMP PARALLEL DO PRIVATE(a) SCHEDULE(STATIC,iCHUNK)

Note, the above is the default for parallel do loops so the above coding would not be necessary. However consider

iCHUNK = NumberOf_loc / OMP_NUM_THREADS()) / 2
!$OMP PARALLEL DO PRIVATE(a) SCHEDULE(STATIC,iCHUNK)

As to why you would want to perform more thread distributions consider what happens while you are running your application if something else runs on your system. (Browsing, eMail, writing a report). In this situation the something else will be stealing processor time from your compute intensive application. This will skew the relative completion times. i.e. each thread of your application will not perform the same amount of work in the same time.

Case 2: NumberOf_loc moderate, fn(loc,...) very large compute time and computation time varies as function of loc

For this case you would want to use a scheduling that parceled out one at a time

iCHUNK =1
!$OMP PARALLEL DO PRIVATE(a) SCHEDULE(STATIC,iCHUNK)

There are other forms of scheduling, each with differing characteristics. Get your program running first using the defaults for scheduling. Shake out any problems where you may be sharing a temporary variable when it should be a private variable. Once that is working, then consider tweaking the performance by modifying the scheduling and chunk size.

Jim Dempsey

davidspurr · ‎01-09-2008

Jim

Many, many thanks for the detailed reply.

My situation is as per your case 2 - both the large compute time for "fn" and that it varies (considerably) with loc. ("fn" = several inner levels of loops in several subroutines, working with many variables & multi-MB of data. Actual CPU time c. in the range <0.01sec to several seconds per loc. NumberOf_loc moderate - typically ~30,000).

Have just got on deck (8am here in NZ). Will digest then have another shot at it.

Thanks
David

jimdempseyatthecove · ‎01-09-2008

David,

A suggestion for use later:

If there is a low computational overhead way to determine ahead of time the amount of time that will be required within a given fn(loc,... Then I would suggest that you consider performing the task like a sieve. Perform the large runtimes first, then the smaller runtimes last.This way you could avoid having the chance of having the longest iteration running last (i.e. all but one core idel during last lengthy iteration).

iRTmin = 1.0 ! Minimum Runtime Threshold
!$OMP PARALLEL DO SCHEDULE(STATIC, 1), PRIVATE(iRT)
DO loc=1,N_loc
iRT = EstimateRunTime(loc)
IF(iRT .GT. iRTmin) fn(loc,...)
END DO
!$OMP END PARALLEL DO
!$OMP PARALLEL DO SCHEDULE(STATIC, 1), PRIVATE(iRT)
DO loc=1,N_loc
iRT = EstimateRunTime
IF(iRT .LE. iRTmin) fn(loc,...)
END DO
!$OMP END PARALLEL DO

Generally one or more of thearguments to fn(loc,...) can be used to compute a weight as opposed to a time. Pick an appropriate weight as opposed to run time.

Jim Dempsey

davidspurr · ‎01-09-2008

Had a closer look and I now see that the situation is a bit more complex.

The situation for one instance I was looking to use parallel analysis for is more like:

!$OMP PARALLEL DO
DO loc=1,NumberOf_loc
~200 lines of code, incl several loops and several calls to subroutines
(which in turn call other subroutines)
END DO
!$OMP END PARALLEL DO

The ~200 lines of code (& the contents of the subroutines) contain large numbers of intermediate variables ("int_vars") that are functions of "loc" (& probably many that aren't)

ie. in effect analysis stream is:

Large amount of raw data & parameters (indep of loc)
--> compute int_vars = fn( loc, raw data & parameters)
--> compute results(loc) = fn(loc, int_vars)

The results are stored in a arrays (one element per loc) or are aggregated over all loc, so are not a problem. But trying to catch all the int_vars that are functions of loc & declaring them all PRIVATE could be a bit messy.

Is there anyway to declare a subroutine "PRIVATE" so that all variables calculated within (including those calculated within secondary subroutines called by the "PRIVATE" subroutine) are PRIVATE? (Clutching at straws!!)

eg. If I rolled the ~200 lines of code (or most of them) into a subroutine so that I now had

!$OMP PARALLEL DO PRIVATE(new_sub)
DO loc=1,NumberOf_loc
call new_sub(loc, .....)
+ a few lines of code that do not matter (not fn of loc / admin etc)
END DO
!$OMP END PARALLEL DO

But looking through the Language Ref that seems unlikely.

Possibly simpler to manually implement? - eg. create say three copies of the new sub (new_sub1, new_sub2, new_sub3), though this may also be difficult to implement. Will need think it over a bit more.

I can currently achieve a form of parallel operation in some cases by running say 3 analyses concurrently when needed (sometimes multiple runs of the program are required). Implementing OMP would allow // analysis for the more common case of single runs, which would be helpful but is not critical.

TimP · ‎01-09-2008

The local "automatic" (no SAVE, no external reference) variables and arrays in a subroutine called inside a threaded region are automatically private, when the subroutine is compiled with OpenMP or other options which imply thread safety (default automatic).
As to the automatic load balancing Jim referred to, that is usually done by schedule dynamic and possibly adjustment of chunk size (the default chunk size 1 may be OK for you).

davidspurr · ‎01-09-2008

Hmm, perhaps I overlooked the DEFAULT ( PRIVATE ) option. Seems like the following should be feasible:

USE OMP_LIB
..... multiple lines of code
!$OMP PARALLEL DO DEFAULT (PRIVATE) SCHEDULE(DYNAMIC)
DO loc=1,NumberOf_loc
~200 lines of code, incl several loops and several calls to subroutines (which in turn call other subroutines)
END DO
!$OMP END PARALLEL DO

I'm attempting to use this in two subroutines. In the first (a small part of the analysis) it seems to cause no problems(?) but I have not yet worked out if the code is actually running in // there or not. In the second subroutine (the bulk of the analysis) the program crashes on reaching the // coded part.

Have I misinterpreted DEFAULT (PRIVATE)?
eg, does it apply to variables in subroutines called from within the "lexical extent of a parallel region"?

Possibly should be declaring the large arrays of "raw data" as SHARED?
They are mostly accessed within the nested subroutines, so would a SHARED statement preceding the DO loop even be effective?

David

[EDIT] Had not seen Tim's response before posting the above. Many of the variables I use are declared in Modules, rather locally within each subroutine.

Hence will still need DEFAULT (PRIVATE) ???

TimP · ‎01-09-2008

I've seen people bitten often enough by defaults that they have chosen DEFAULT(NONE) so as to force all to be specified. Yes, if a module variable is visible and writable by multiple threads, it will need private, firstprivate, lastprivate etc. so that each thread gets a local copy and knows when it inherits or passes a global value. shared of course is fine for variables which aren't to be modified within the threads. It becomes practically impossible to verify without a tool like Intel Thread Checker.

davidspurr · ‎01-09-2008

Thanks Tim

In light of the complexity of included code & number of variables to be made private etc, I decided I had better learn to crawl before trying to run.

To check if I was on track, I decided to first implement the // coding on one of the inner loops with only 11 variables needing to be declared private. I expected this to have a relatively small impact on the analysis time as the loop accounted for perhaps only 20% - 30% of the work of the outer loop [on closer inspection, is likely << 20% of to total]. Plus likely high overheads as the inner loop itself is run ~200,000 times per analysis. Still the inner loop contained ~25 lines on code, including two subroutine calls so I expected to see some benefit? (Number of cycles executed in the inner do loop typically in the 10's of thousands [EDIT - mean cycles for the loop is actually 2,800, range 1 - 17,000]).

Looked good for a moment or two. CPU usage jumped to 90 - 100% (all four cores at close to 100% for the first time on this PC). That was surprising though, since there was a lot of work outside the // section.

After about 5 minutes it became apparent that despite the CPU working 3 - 4 times harder, the analysis was running MUCH slower than without the // code!

First attempt retained the SCHEDULE(DYNAMIC) spec. I retried without it but the analysis still ran SIX times slower than without the // coding (& CPU still at 90 - 100%).

Will need to do a lot more digging to confirm that the implementation is correct, but it does not look hopeful at this point :-(

D

davidspurr · ‎01-10-2008

Follow up to above post:

In the above attempt it seems some variables may not have been correctly 'typed' (private vs shared, etc), as several intermediate 'debug' values output were incorrect (analysis terminated long before completion).

###

Tried on another inner loop with fewer statements & only one "simple" function reference (function contains only local variables & the dummy arguments); ie. no complication with module variables in called subroutines.

This second loop is also executed ~200,000 times but has more than 20x as many cycles (mean >77,000 cycles) as the previous case and has its own inner loop of 15 cycles & a third level loop within the function (suspect this section of code accounts for > 30% of the total analysis time). SCHEDULE not specified, hence STATIC?

Based on CPU_TIME running times written to screen for each "loc", it appeared this case was ~3x slower than the non-parallel case. But it turns out CPU_TIME is not 'valid' for 4 cores running in parallel (likely not news to many). DATE_AND_TIME recorded at start & end of program shows that the second parallel case was actually 16.5% faster than the non-parallel case (divide CPU_TIME by 4?). However, not worth committing 4 cores @ close to 100% for a 17% gain.

Re-ran the above with NUM_THREADS set 3 to check whether memory access bottleneck was an issue. CPU usage was typically ~75% as expected, with total analysis time (ex DATE_AND_TIME) 17.3% less than the non-parallel case. Marginally better (less CPU demand), but still not a worthwhile gain :-(

###

Despite the relatively 'straightforward' structure of this second loop, it appears something must still be wrong as the final results differ from the non-parallel case by ~10% (much less than the error in the intermediate values for the first inner loop attempt). Error for 3 thread case was slightly less.

The likely source of the error would seem to be in the final summation at the end of the loop? Simplified, this case is of the form:

!$OMP PARALLEL DO PRIVATE(....)
DO i = 1,LargeNum
k = kv(i)
...
DO j = 1,15
....
x = xFUNCTION(....) ! contains only local variables
p = ... (depends on x, k & j)
Res(j,iloc(k)) = Res(j,iloc(k))+ p
END DO
END DO
!$OMP END PARALLEL DO

x, p, j & k are declared PRIVATE, but not Res(:,:).
iloc(:) & kv(:) are SHARED (by default), which should be OK (not changed in loop).

The result sumation is basically the same as R = R + p where all threads aggregate the same R. I assumed R should be SHARED in this situation but given the error in the result ...????

OR ... does "xFUNCTION" somehow need to be declared PRIVATE also?

I attempted to, but got a compile syntax error ("name has not been declared as an array or a function", It has been; ie. Integer xFUNCTION locally declared).