Re: OMP_NUM_THREADS, -openmp and -parallel switch conflicts?

ulsterjam · ‎02-15-2007

Hi all,

I would be grateful if someone could describe briefly the transformation performed by the -parallel switch onthe F95 code described below.

The code in particular isa conventional implementation of a sparse matrix-vector multiply (using CSR representation of the sparse matrix).....effectively the routine is just a nesteddo loop. The matrix and vectors are porportional to 2^20 double complex elements. The code is being tested on a SMP machine with 256 GB RAM and32 processors. I am currently using the Version 9.1Build 20060928 of the ifort compiler.

I obtained the following wall-clock data timed over 150 iterations of the M-V-M routine:

Compiling with -O3 only, gave a time of~ 110 secs.
Compiling with -O3 -parallel gave a time of~45 secs
Compiling with -O3 -parallel and increasing OMP_NUM_THREADS=2 gave a time of ~84 secs (further increases in OMP_NUM_THREADS caused further degradation in wall-clock time)

I then added an OPENMP PARALLEL DO around the outer do loop;

Compiling with -O3 -openmp and OMP_NUM_THREADS=1 gave ~110 secs
Compiling with -O3 -openmp and OMP_NUM_THREADS=4 gave ~43 secs (increases in OMP_NUM_THREADS gavedecent speedup)
Compiling with -O3 -openmp -parallel and OMP_NUM_THREADS=4 gave a time ~ 240 secs

While I am not advocating a compiler problem I would just like to know why

the -parallel switch conflicts with the OMP_NUM_THREADS > 1
the -openmp and -parallel switches together causes major slowdown

A brief explanation of the switches function (particularly -parallel) in this case would be gratefully received.

Apologies in advance if this topic has been tackled before,

Tim.

jimdempseyatthecove · ‎02-15-2007

Tim,

In general (as you've discovered) it is wise to use only one form of parallelization.

There are manyconsiderations involved in parallelizing code. The parallel version of the code has to produce numerically equivalent results as compared to non-parallel code. (i.e. within round-off precision considering potential different order of execution).

The code within the loops must be examined to see if there are any temporal dependencies e.g. A(n) must be calculated before A(n+1). When coding with both OpenMP parallelization and auto-parallelization you have a situation of competing authorities as to which is in control of the temporal issues.

1)If indifference is given to both (i.e. no consideration) then you may very well experience temporal issues in the execution.

2) If preference is given to OpenMP (i.e. auto parallization compilationis temporarily off for that loop) then when running the loop under circumstances when the loop is serialized by OpenMP then the loop executes serially due to no auto-parallization.

3) If Preference is given to auto-parallization (i.e. OpenMP compilation is temporarily off for that loop) then the loop, although executing in parallel may not be doing so with the broader scope information available to OpenMP.

I haven't looked at your particular case as to what code is generated but it wouldn't surprise me if there is a

1b)If indifference is given to both (i.e. no consideration) and temporal issues in the execution resolved by critical section.

I went to the Doc and said "It hurts when I walk this way." Doc says "Don't walk that way."

Jim Dempsey

ulsterjam · ‎02-15-2007

Thanks Jim for that...very informative indeed.

If you don't mind I've pasted the code below...just would be interested to know if this provides any further insight

subroutine zamux (n, x, y, a,ja,ia)

double complex x(*), y(*), a(*)

integer n, ja(*), ia(*)

double complex t

integer i, k

!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(t,k)

do 100 i = 1,n

t = 0.0

do 99 k=ia(i), ia(i+1)-1

t = t + a(k)*x(ja(k))

99 continue

y(i) = t

100 continue

!$OMP END PARALLEL DO

return

end

jimdempseyatthecove · ‎02-15-2007

Without having knowledge of n and the contents of the range tables ia and ja it is difficult for me to offer precisesuggestions. I can offer some observations

I must assume arrays ia and ja are dimensioned at least one larger than n

Circumstance 1:

The computation of loop 99 is approximately equivilent for each iteration of loop 100 and n is significantly larger than the number of processors available.

Your coding is acceptible except for a potential gain of unrolling loop 99 (requires some knowledge of order in ja(k))

Circumstance 2:

The computation of loop 99 is approximately equivilent for each iteration of loop 100 and n is on the order of number of processors available and MOD(m,numberOfProcessors) is not 0. And the inner loop computation requirement is relatively large.

This would cause under utilizedprocessors on the last (first and only) iterations of the outer loop. Consider parallelizing the inner loop instead of the outer loop.

Circumstance 3:

The computation of loop 99 is approximately equivilent for each iteration of loop 100 and n is on the order of number of processors available and MOD(m,numberOfProcessors) is not 0. And the inner loop computation requirement is relatively small.

This would cause under utilizedprocessors on the last (first and only) iterations of the outer loop. Considerunrolling the inner loop somewhat.

Circumstance 4:

The computation of the inner loop is small but the number of iterations is large.

Use the OpenMP SCHEDULE(STATIC) on the outer loop
or SCHEDULE(STATIC, max(1,n/NumberOfProcessors/2))
erc...

Circumstance 5:

The computation of loop 99differs widelyfor each iteration of loop 100 and n is relatively large

SCHEDULE(DYNAMIC, max(1,n/NumberOfProcessors/16))

Experimentation might yeld better insight than speculation on my part.

The inner loop can be unrolled somewhat with knowledge of system cache.

Assume cache line size is 64 bytes and assume the memory subsystem reads ahead one cache line. Then think of the cache as reading 128 bytes. (don't assume I know what I am talking about). Each double complex is 16 bytes or 4 per cache line, 8 per two cache lines.

Although your x(ja(k)) might be non-sequential order the a(k) will be sequential in order.

I would suggest using a local array of 4 or 8 double complex numbers and copy ja(k:k+3) or ja(k:k+7) to the local array (taking care of last iteration issues) then use the local array together with x(ja(k)). In that manner you can take advantage of the processor cache for at least half of the processing.

Jim Dempsey

ulsterjam · ‎02-15-2007

Thanks again for the detailed analysis.

With respect to the various cases you outlined n is very large and the inner loop repeats only a minimal amount of times. To give you an example of a typical case:

i = 1...2**20
k = 1..60 ( fixed for all iterations of i )
I have at most 16 processors/threads to work with.

So from your discussion, I think circumstances 1 and 4 suit best. I tried the dynamic schedule (even though the work is load-balanced per iteration ) and the performance was horrible as opposed to static or guided schedules.

I will take on board your comments and try to optimise the routine further..many thanks.

Do you know if Intel plans to introduce a threaded sparse matrix-vector multiply routine ( using double complex values ) as part of the MKL library? I just can't find any SMP implementation of this operation, which I would have expected to be pretty common in scientific applications....hence I am left to hand-craft my own routine with the help of Y.SAAD's sparsekit and comments from experts like yourself.

Thanks again.

Steven_L_Intel1 · ‎02-16-2007

This would be a good question to ask in the MKL forum.

jimdempseyatthecove · ‎02-16-2007

If the outer loop is very large and the inner loop is relatively light weight then I would suggest using

NumberOfAvailableProcessors = min(OMP_GET_NUM_PROCS(), OMP_GET_NUM_THREADS())
iChunkSize =max(1, (n / NumberOfAvailableProcessors) / NumberOfAvailableProcessors)

!$OMP PARALLEL DO SCHEDULE(STATIC,iChunkSize), ...

GUIDED may work well too.

What works best will depend on what else is running on your system at the time and in conjunction with the specified ChunkSize.

If the system is dedicated and the inner loop time is consistant for all runs then try

iChunkSize =max(1, (n / NumberOfAvailableProcessors) )

or

iChunkSize =max(1, (n / NumberOfAvailableProcessors/2))

Experiment

Don't forget about unrolling the loop as discussed. The purpose of this unrolling is not to reduce loop iteration overhead, but instead to better utilize the system cache. This could be a 2x or more improvement gain. YMMV

Jim Dempsey

jimdempseyatthecove · ‎02-16-2007

An untested sample of the inner loop unrolling (doubling up)

 double complex tCopy(2), aCopy(2), xCopy(2)
  !DEC$ ATTRIBUTES ALIGN: 16::tCopy
  !DEC$ ATTRIBUTES ALIGN: 16::aCopy
  !DEC$ ATTRIBUTES ALIGN: 16::xCopy
  ...
 iBegin = ia(i)
 iEnd = ia(i+1)-1
 if(IAND((iEnd-iBegin),1) .eq. 0) then
 tCopy(1) = 0.0
 tCopy(2) = 0.0
 else
 tCopy(1) = a(iEnd)*x(ja(iEnd))
 tCopy(2) = 0.0
 iEnd = iEnd - 1
 endif
 do 99 k=ia(i), ia(i+1)-1, 2 
 aCopy(1:2) = a(k:k+1)
 xCopy(1) = x(ja(k))
 xCopy(2) = x(ja(k+1))
 tCopy = tCopy + aCopy*xCopy
 99 continue
 y(i) = tCopy(1) + tCopy(2)

If you know that there is a strong possibility(ja(k+1) .eq. ja(k)+1) then the 99 loop might be

 do 99 k=ia(i), ia(i+1)-1, 2
 jaK = ja(k)
 jaK1 = ja(k+1) 
 if(ja(k+1).eq.(ja(k)+1)) then
 tCopy = tCopy + a(k:k+1)*x(jaK:jaK1)
 else
 aCopy(1:2) = a(k:k+1)
 xCopy(1) = x(jaK)
 xCopy(2) = x(jaK1)
 tCopy = tCopy + aCopy*xCopy
 endif
 99 continue
 y(i) = tCopy(1) + tCopy(2)

Jim Dempsey