Parallel programming: Paralleled do loops

aafshani · ‎01-29-2013

Hi everyone;

I had a problem with paralellizing of fortran program here to share. My program had a three nested big do loops which were taking too much time to run. So, I decided to paralellize outside do loop (n=1,nszf) and then use it in some supercomuters in my university. My current Dell xps 8500 Windows 7 ultimate PC has 8 threads. here is that part of my program:

!$omp parallel private(i,j,k,n,l,c) shared (nszf, nband) reduction (- : sk, r1)
!$omp do
do 300 n=1,nszf ! nszf=139020
i=n
do 290 l=2,nband ! nband=7000
i=i+1
if (sk(n,l)) 240,290,240
240 c = sk(n,l)/sk(n,1)
J=0
do 270 k=l,nband
j=j+1
if(sk(n,k)) 260,270,260
260 sk(i,j)=sk(i,j)-c*sk(n,k)
270 continue
280 sk(n,l)=c
r1(i)=r1(i)-c*r1(n)
290 continue
300 r1(n)=r1(n)/sk(n,1)
!$omp end do
!$omp end parallel

Without omp clauses program runs well, but using paralell gives some errors. I think program have problem in private, shared, and reduction clauses parts. sk is a stiffness matrix and r1 is force array. It should be noted that this program is the same famous Gaussian elimination method to solve a unknown in a matrix. Is it true to use reduction for matrix and arrays this way here? please let me know where the problem is?

Thanks

jimdempseyatthecove · ‎01-29-2013

Experiment with having the !$omp do enclose the 270 loop (and remove the reduction clause on the !$omp parallel)

Have only the master thread update the variables outside the !$omp do (and inside the !$omp parallel).

Once this is working, then see how to partition the iteration spaces such that you can move the parallelization to outer levels. Note, keeping the !$omp parallel at the outer level will reduce the number of enter/exit parallel region, but comes at the expense of delegating the work outside the !$omp do to one thread.

Jim Dempsey

John_Campbell · ‎01-29-2013

This is a fixed band direct solver, circa 1972. My experience with a direct solver has been that it is not easy to apply the openMP to this solution approach, although I assume it can be done. Nband is typically too small to justify multi-threading the inner loop, while sk(i,j) varying prohibits the outer loops being parallelised. With this solver, It might be possible to parallelise loop 290, as this applies row N to all lower rows. Would taking a copy of row N, say as vector Row_N, make it easier for the compiler to parallelise loop 290 ? I have had more (easier) success with vectorising the calculation, with AVX instructions now offering more in this area. With /O2 (default?) vectorisation is enabled. Two areas I would check: 1) How often does the test "if(sk(n,k)) 260,270,260 " skip to 270. It might be better to ignore this test and simplify the inner loop. 2) How uniformly banded is your matrix ? There are many examples of skyline solvers, which identify the band-width of each equation, rather than the maximum band. COLSOL by R L Taylor is very well published, including in Zienkiewicz: The Finite Element Method. For most problems, changing from a fixed band row storage solver to a variable band column storage solver should provide a significant speed improvement. . John . PS: I changed the code to F90 and reviewed your IF tests, now hopefully without errors!! low case l or upper case I are not a good variables with most fonts ! . !$omp parallel private(i,j,k,n,l,c) shared (nszf, nband) reduction (- : sk, r1) !$omp do do n = 1,nszf ! nszf=139020 if (sk(n,1) < tiny_value) then ! test for singularity , could be abs test if eigenvalue extraction sk(n,1) = 1 cycle end if ! iband = 1 Row_n(1) = sk(n,1) do b = 2,nband Row_n(b) = sk(n,b) if (Row_n(b) == 0) cycle iband = b sk(n,b) = Row_n(b)/Row_n(1) end do ! ! This loop could be parallelised, with Row_n and iband shared ? i=n do b = 2,iband ! nband=7000 i = i+1 if (Row_n(b)==0) cycle ! main zero test c = Row_n(b)/Row_n(1) j = 0 do k = b,iband j = j+1 ! omit if (Row_n(k)==0) cycle ! low probability test ? sk(i,j) = sk(i,j) - c*Row_n(k) end do ! sk(n,b) = c ! already done r1(i) = r1(i) - c*r1(n) end do r1(n) = r1(n) / Row_n(1) ! end do !$omp end do !$omp end parallel

John_Campbell · ‎01-29-2013

The existing 290/270 loop structure should be more suitable for $omp with the introduction of Row_n . Some other ideas that might help run time are: 1) seperate r1 to a seperate forward reduction ( an order of mannitude faster than sk reduction ) 2) swap loops 270 and 290 to address sk(i,j) sequentially might reduce memory access times . ! Seperate Loops for r1 (after sk(n.) has been modified) do n = 1,nszf ! nszf=139020 if (sk(n,1) < tiny_value) cycle ! test for singularity , could be abs test if eigenvalue extraction i=n do b = 2,nband i = i+1 if (sk(n,b)==0) cycle ! main zero test r1(i) = r1(i) - sk(n,b)*r1(n) end do r1(n) = r1(n) / sk(n,1) end do . ! re-ordered i and j might be better (coding needs to be checked) ! if loops b and k were swapped, sk(i,j) would then be addressed sequentially ! but ! involves two ops (* and /) which can be overcome by using sk(n,b) ! no easy zero test in loop j ! b = i-n+1 ! i = b+n-1 ! k = j+b-1 = j+i-n ! j = k-b+1 do j = 1,iband b = 1 k = j do i = n+1,n+1+iband-j b = b+1 ! b = i-n+1 k = k+1 ! k = i-n+j ! sk(i,j) = sk(i,j) - Row_n(b)*Row_n(k)/Row_n(1) sk(i,j) = sk(i,j) - sk(n,b)*Row_n(k) end do end do .

John_Campbell · ‎01-30-2013

After further review, there are two areas where you could improve performance could be: . 1) storage : you have three possible storage methods to use, being store by diagonal :: sk(nszf,nband), which you are using store by row :: sk(nband,nszf), which should be better store by column :: sk(nband,nszf), which suits COLSOL approach row storage is more suited to Gaussian elimination . 2) zero values : if there is not total non-zero filling of the reduced sk array, then monitoriong of non zero values might be a significant improvenment. It can be done in the calculation of the Row_f vector, reducing the inner loops to only the non-zero coefficients in row n. ! ! Find active values in this row iband = 1 Row_i(1) = 1 Row_f(1) = skt(1,n) do b = 2,nband if (skt(b,n) == 0) cycle iband = iband + 1 Row_i(iband) = b Row_f(iband) = skt(b,n) skt(b,n) = Row_f(iband)/Row_f(1) end do ! ! This key loop could be parallelised, with Row_f, Row_i and iband shared ? do b = 2,iband i = n+row_i(b)-1 c = Row_f(b)/Row_f(1) do k = b,iband j = row_i(k)-row_i(b)+1 skt(j,i) = skt(j,i) - c*Row_f(k) end do end do . (you would need to check this coding, especially for i and j) . I would expect that the use of Row_f will improve loops 290/270 being parallelised. Parallelising is not easily achieved in COLSOL codings. If your model is a rectangular mesh with uniform banding, then the identification of zeros might not offer significant improvement. However, both improvement areas should offer some improvement. . John

aafshani · ‎01-31-2013

Thank you Jim, and especially you John for comments. realy helpful.

John, my mesh which I use for this program is uniform rectangular of length 120m, width= 135m, and height =55m all sides of mesh general boundary except top parts of it.

John Campbell wrote:

1) How often does the test "if(sk(n,k)) 260,270,260 " skip to 270. It might be better to ignore this test and simplify the inner loop.

In sk(nszf, nband) matrix, I would say that around 80% of row is zero.

John Campbell wrote:

2) How uniformly banded is your matrix ? There are many examples of skyline solvers, which identify the band-width of each equation, rather than the maximum band.

I think actual bandwidth is pretty close to my nband=7000, maybe actual bandwith be around 6800. so, this one is okay, i think.

Last comment of you seems to be good alternative, but it looks that transposing of sk(nzsf, nband) matrix to sk(nband,nszf) makes it a bit confusing, as we are dealing with bandwith too, but I was wondering to know couldn't we use the sk(nband, nszf) with Row_f ?

I'm new to openMP too, so, is the variable declaration in openMP in my program true especialy reduction (- : sk, r1)?

Whole of the program should be as follow, right? but, what about r1 array? based on this program, I should transpose the sk to skt, and then after it retrasnpose again after do loops, as I use sk in following parts for stress and strain alanysis too, right?

!$omp parallel private (i,j,k,n,l,c) shared (nszf, nband,Row_f, Row_i,iband) reduction (- : sk, r1)

do n = 1,nszf ! nszf=139020
if (sk(n,1) < tiny_value) then ! test for singularity , could be abs test if eigenvalue extraction
sk(n,1) = 1
cycle
end if

.

iband = 1
Row_i(1) = 1
Row_f(1) = skt(1,n)
do b = 2,nband
if (skt(b,n) == 0) cycle
iband = iband + 1
Row_i(iband) = b
Row_f(iband) = skt(b,n)
skt(b,n) = Row_f(iband)/Row_f(1)
end do

!$omp do
do b = 2,iband
i = n+row_i(b)-1
c = Row_f(b)/Row_f(1)
do k = b,iband
j = row_i(k)-row_i(b)+1
skt(j,i) = skt(j,i) - c*Row_f(k)
end do
end do

!$omp end do

!$omp end parallel

end do

thanks :)

John_Campbell · ‎01-31-2013

Alex,

If there is only one r1 array, ie only 1 load case in your problem solution, I would put the r1 reduction back in the "290" loop, although it can be a neater solution to have the r1 forward reduction and backward substitution in a seperate sk_solve routine.

You stated that "In sk(nszf, nband) matrix, I would say that around 80% of row is zero." I think you should generate some statistics on this, as if you have a regular 3D mesh as described, then there could be as little as 1% zeros. The values of sk are benig tested after previous equations have trickled down their influence. If there is significant non-zero infill then the Row_f indexed version of Row_n might not be very effective. For each row, count the number of non zero values and report the average. You might be surprised.
I'd recommend first try the row_n approach, as it should help the $omp performance.

As the sk matrix is so large, I think there would be benefit in transposing. This would localise the memory usage. You must have the value of nband, before you start generating the sk matrix. Isn't sk an allocatable array ? There should be limited reference to this array in the matrix formation and reduction codes. The back-substitution code for r1 might only need the sk transpose change, as it is much faster than the sk reduction.

The combination of sk transposed and Row_n changes should produce some benefits. (Row_n woudl work better with sk transposed)

I hope this helps,

John

John_Campbell · ‎02-01-2013

Alex,

I have made some minimal changes to your code, to:
introduce row_f : the active row of values to remove sk(..) from the RHS in 290 loop
introduce iband : the bandwidth limit for equation n
summary statistics for bandwidth, zero values and elapsed time
changed l to b for easier reading

I assumed that sk was declared as:
real*8, allocatable, dimension(:,:) :: sk
allocate ( sk(nszf,nband) )

I have made some changes to the !$omp syntax, but I am not sure of the syntax.

Could you run it with /O2 and with or without /Qparallel. I would be interested to find out that:
the results remain the same
the value of summary statistics, and
statistics of elapse time (using system_clock) compared with the previous version.

These could identify how the changes I identified might benefit the run time.

The next step would be to transpose sk in your program and see the changes.

John

aafshani · ‎02-02-2013

thnaks John for responce,

I was trying to run the program with openMp, before running, I noticed a minor issue in your attachment file:

In loop 290, you have used iband, but row_f (b) index is varying from 1 to nband, I mean array row_f (b) (b=1,nband) in previous parts also include zero values, and its size is nband, right?

i = n
do b = 2,nband ! nband=7000
i = i+1
if (row_f(b)==0) cycle
c = row_f(b)/row_f(1)
j = 0
do k = b, nband
j = j+1
if (row_f(k)==0) cycle
sk(i,j) = sk(i,j) - c*row_f(k)

Next main issue, as you told is that sk is allocatable matrix, I think there is no way to not use allocatable array because of large size of this array,

allocate (sk(nszf,nband)). During running with openMP, and using allocatable matrix of sk, getting following error:

A deferred array shape array is not permitted in an openMP

Not using allocatable array, during running (building solution), facing: compilation aborted (code 3), exactly at the first line of entire program.

so, what else can I do?

thanks.

John_Campbell · ‎02-02-2013

Allex,

Loop 290 should now use iband. In past posts I have identified three bandwidth measures:
1) nband is the maximum bandwidth of any row
2) iband in gaussean.f90 is the length of the n row to the last non-zero value
3) iband in my post of Wed, 01/30/2013 - 15:22, where I included row_i, is the count of the number of non-zero values that were indexed in row_i. Depending on the statistics this could be a good change.
You should use the iband, as I provided in gaussean.f90.

I provided some statistics in this code. You might want to include the following, before the start of the reduction:

!     initial statistics
         real*8 sum_zero
         sum_zero = 0
         do n = 1,nszf                   ! nszf=139020
           do b = 1,nband
             if (sk(n,b)==0) sum_zero = sum_zero+1
           end do
         end do
         write (*,*) ' % zero before      =', sum_zero/dble(nszf*nband) * 100.

Your comment about sk not being able to be used for openMP is a surprise to be. I would have a main module where sk, nszf, nband and other important system variables are used. You should allocate sk, prior to assembly of the global stiffness matrix, once you know nszf and nband. I have given an example of this module definition in gaussean.f90.
Not being able to use an allocatable array in openMP should not be the case.

Is sk declared as real*8 or real*4 ?

Transposing the use of sk from sk(i,j) to sk(j,i) should help the run time.

Could you adapt gaussean.f90 and run to get the statistics. These would show if including row_i is worthwile, certainly which adaptation of loop 290 should work best. Once we understand this I think the next is to determine why loop 290 can not work with !$omp.

I am tempted to make loop 290 a subroutine, however I have found this approach not work with /Qparallel. I had hoped that !$omp would overcome this problem.

John

John_Campbell · ‎02-03-2013

Allex,

I have reviewed the example I gave you and included extra code it to generate a dummy smaller stiffness matrix so that I could run it. (neq=130000, nband=1700) The results of these runs, especially the degree on non-zero infill, may be different from your case, but it shows:
gaussean.f90 is the run where iband (the band envelope) is used for each equation, (306 seconds)
gaussean_tran.f90 is the run where SK is transposed (106 seconds)
gaussean_t2.f90 is using transposed and row_i to remove zeros from the active row ( 60 seconds)
The run time improvement is heavily dependant on the % zero when reduced of 55%. If your example gives a much lower figure, say < 10% (which I expect) then the savings will be less, but probably worthwhile.
I also separated out the reduction of R1, as this is much faster than reducing SK. You could also add the loop for back-substitution. I'd recommend this.
You should check gaussean_tran.f90 to see how easy it is to transpose SK. (use FC utility)

All this has been run with /O2 but not /Qparallel or /Qopenmp, which brings us to where you first began with your question about !$OMP directives in the code. This needs some more review.
I tried !$OMP CRITICAL for the initial setup of Row_f, but I don't think that did what I intended, which was that it ran only once and not for each thread.
The !$OMP DO did parallelise this code segment, which is what is required, but I could not guarantee it is working correctly.
I ran gaussean_t2.f90 with /Qopenmp, but it ran in only 6 seconds with 4 processes, which is not right. It should be at least 15 seconds. There is a two-stage process of 1) generation of row_f/row_i then 2) loop 290. This needs to be enforced in the !$OMP directives.
I think it is close to working, but needs changes to the !$OMP to handle the loops correctly, but only run the row_f generation once. Hopefilly someone else might be able to review the !$OMP commands to get this right.

I am attaching the 3 versions of the test program plus a run trace of each and the batch file I have been using to test with ifort Ver 12.1.
What I have been describing should work and hopefully I have not introduced other coding errors.

I hope this gives you some ideas. I'd be interested in how you proceed,

John

PS: I have now attached an updated version of option 3, which is a more complete test, by also providing load case back substitution then a check calculation of : force - [sk] . disp, to check the maximum calculation error. This should be useful to test the parallel operation.
The changes I have now made to the $OMP commands appear to be working, although there might be a better implementation.
!$OMP appears to reduce the run time by about 50% with 4 processes.

aafshani · ‎02-04-2013

thanks, John, about attached files and your comments;

In my program, sk was allocated in some other subroutines (sk declared real*4, all of my variables are real*4) and, then was used in other subroutine, something like:

! Main program

real, allocatable:: sk(:,:)

call formk (sk,nszf,...)

call solve (sk,r1,nszf)

! end end of main program

Subroutine formk (sk,nszf,...)

real, allocatable:: sk(:,:)
.
.
allocate (sk(nszf, nband))
! sk matrix is formed here, boundary condition is applied, and reduced to nszf x nband. then, it is sent to be solved in solve subroutine.
end subroutine formk

subroutine solve (sk,r1,nszf)
real, allocatable:: sk(:,:)
allocate (sk(nszf, nband))
nband=7000
call gaussian_reduction (sk, nszf, nband, r1) (your subroutine)
.
end subrouotine solve
.
subroutine gaussian_reduction (sk, nszf, nband, r1)
real:: sk(nszf,nband)
.. and then rest of program you provided....
This one gives unhandled exception at .... stack overflow error at reduction (- : sk, r1) line. Need to tell that I've allocated large number in stack reserve size (1000000000).

changing real:: sk(nszf,nband) at gaussian_reduction subroutine into real, allocatable:: sk(:,:), and then running gives:
A deferred array shape array is not permitted in an openMP FIRSTPRIVATE, .... [SK]

and, in first case, program executed the statistcs parts and gives % zero before = 95%.
You know that could run gaussean.f90, probably because your dummy stiffness matrix size is small.
so, in this case, what can i do?

thank you

Alex

John_Campbell · ‎02-04-2013

Alex,

You should only allocate SK once. I would place SK in a module and USE the module in both FORMK and SOLVE. (SOLVE is a very common name, I'd change it). You should try to understand why this is the case.
It is wrong to allocate SK more than once, or define nband=7000 after the allocate statement in SOLVE. This might be the cause of the error you reported earlier with $OMP and SK.

Look at the way I call gaussean_reduction (sk, nszf, nband), then declare the arguments. This will overcome the "A deferred array shape array is not permitted in an openMP FIRSTPRIVATE, .... [SK]" error

If you don't have any knowledge of the precision required for SK, I'd recommend real*8. You havn't described the system you are modelling, but I'd expect real*8 would be required.
My testing also shows a significant gain by transposing SK. You should try to identify the changes that would be required for this. It should not be extensive, unless you are using SK for other purposes, which would not be advisable for such a large array.

Try the latest version I attached, as I think the $OMP commands are working. It would be good to see results on your i7 which supports 8 processes. I think this version is fairly close.

John

John_Campbell · ‎02-04-2013

Alex,

I have tidied up the code to include some more statistics and confirm that it appears to be working for !$OMP.
The attached code is this latest version.
It could be a good example of a basic openMP implementation.

The compilation options I have selected are: /Tf %1.f90 /free /O2 /Qopenmp /QxHost

I have some questions about !$OMP which someone might be able to comment on:
Is it possible to move the following command outside the outer DO loop to improve performance ?
!$OMP PARALLEL PRIVATE (c,i,j,k,b), SHARED (SK, N, Row_f, Row_i, iband)
Are the !$OMP commands I have selected the most appropriate ?
What would be the minimal install on a PC, which does not have ifort installed, so the program could run on another PC ?

John

John_Campbell · ‎02-05-2013

Alex,

Could you please run the latest version I attached on your i7, with and without /Qopenmp compile option. I'd like to see the difference with 8 possible processes. You should run task manager to confirm you are getting 100% cpu, rather than 12.5% without /Qopenmp. The R1 solver has not been optimised or uses !$OMP. It runs much quicker than the SK reduction. It could be improved if it runs many times.

The !$OMP commands could be reviewed, however the outer "n" loops must run sequentially in all cases. That is why I placed the "do n" outside the !$OMP construct. I'm not sure if this means that the thread processes are set up nszf times, or only once. Once would be better!

The fortran ALLOCATE statement, provides a memory address space for the SK array, once it is allocated it is fixed for the full run.

Let me know how you go.

John

aafshani · ‎02-09-2013

John;
Thanks for your comments, and your files, and sorry to respond late, was a bit busy.
I ran your latest attached file with my PC, Dell XPS 8500, assuming sk(nband=3700,nszf=130000) with !$omp, it took about 197 sec, and without it, it was about 202 sec. I was checking the task manager as well, so during !$omp, CPU usage was 98-100%, and without it it was 12-17%. Same for sk(nband=3700,nszf=70000), with !$omp 98.5 sec, and without it 102 sec.
It seems that using !$omp doesn't affect greatly the calculation time !?? I was expecting large difference between the times.
however I want to try your latest attachment in my program, i'll post about it here soon.

Thanks
Alex

aafshani · ‎02-09-2013

I just came across with another question about parallel programing;
In following code, I'm trying to trasnpose a vector (sk vecor according to my previous post in this thread), a=sk and b=skt arrays are large arrays and have been allocated already in main program, as I was monitoing the task manager duing running of the program at this part, task manager>performance>physical memory, free value of physical memory was 0. And, becasue of using parallel programing, I was expecting to see around 100% cpu usage, while it was around 1-29%. I was wondering, as the whole point of using parallel programing was to use all CPU capacity, here low cpu uage, it is probably due to the lack of free physical memory, right, please correct me if i'm wrong? if so, how can I over come this issue?

subroutine trans(a,b,row,col)
integer::i,j,row,col
real*8 a(row,col),b(col,row)
!$omp parallel private(i,j) shared(a,b,row,col)
!$omp do
do i=1,row
do j=1,col
b(j,i)=a(i,j)
end do
end do
!$omp end do
!$omp end parallel
end subroutine trans

Thanks
Alex

TimP · ‎02-09-2013

Parallel optimization of transpose seems to be a frequently set homework assignment, part of the point being to set an incentive to browse prior posts on this.

You might start by checking that your single thread performance is as good as transpose intrinsic, including checking the effect of vector nontemporal and architecture settings.

John_Campbell · ‎02-10-2013

Alex,
I would not transpose SK, but generate it in the transposed form, with ALLOCATE ( sk(nband,nszf) )
There should not be too many occurrences in your code of SK(j,i) =
which you convert to SK(i,j) =
There might also be some DO loops which you could change their order.

Using real*4 is a low precision, unless the problem is very well behaved. You should make sure you back calculate for force, given the displacement field result;
Force = sum (K_element x Disp) ( using all element stiffness matrics is usually much quicker)
Error = Force - Applied_Force
checking the maximum error for each degree of freedom. It would be good to declare FORCE and ERROR as real*8

John

John_Campbell · ‎02-10-2013

Tin,

I am investigating some other matrix processing problems, related to the use of "row_i" in the above example for indexing the non-zero coefficients in the active row.
You made the comment about "checking the effect of vector nontemporal and architecture settings."
Are you able to provide more details on these settings and how they might be checked ?
How might these settings be altered ?

John

TimP · ‎02-11-2013

trahspose is a standard Fortran intrinsic, since 20 years ago. Normally, it could be expected to be implemented in an effective way for the target platform, although it may not effectively with auto-parallel or OpenMP. You could use it to check that your explicit do loops aren't costing you performance at 1 thread, and that you aren't losing performance by adding threads.

Intel Fortran supports

!dir$ vector nontemporal

so as to request stores to bypass cache. In some situations, the compiler may do it automatically, if the stored array is expected to be larger than say 50% of cache. In a case where the stored data aren't accessed again while remaining in cache on the same CPU, this directive can avoid reading the data prior to writing, thus potentially cutting memory traffic by 50%.

If you haven't noticed any advertising about /arch:AVX for corei7-2 CPUs, you still might want to try varying the settings. Depending on the situation, /Qxsse4.1 or the default /Qxsse2 may work best on corei7[-1] in spite of the advertising about /Qxsse4.2 or /QxHost (implied by /fast).