- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone;
I had a problem with paralellizing of fortran program here to share. My program had a three nested big do loops which were taking too much time to run. So, I decided to paralellize outside do loop (n=1,nszf) and then use it in some supercomuters in my university. My current Dell xps 8500 Windows 7 ultimate PC has 8 threads. here is that part of my program:
!$omp parallel private(i,j,k,n,l,c) shared (nszf, nband) reduction (- : sk, r1)
!$omp do
do 300 n=1,nszf ! nszf=139020
i=n
do 290 l=2,nband ! nband=7000
i=i+1
if (sk(n,l)) 240,290,240
240 c = sk(n,l)/sk(n,1)
J=0
do 270 k=l,nband
j=j+1
if(sk(n,k)) 260,270,260
260 sk(i,j)=sk(i,j)-c*sk(n,k)
270 continue
280 sk(n,l)=c
r1(i)=r1(i)-c*r1(n)
290 continue
300 r1(n)=r1(n)/sk(n,1)
!$omp end do
!$omp end parallel
Without omp clauses program runs well, but using paralell gives some errors. I think program have problem in private, shared, and reduction clauses parts. sk is a stiffness matrix and r1 is force array. It should be noted that this program is the same famous Gaussian elimination method to solve a unknown in a matrix. Is it true to use reduction for matrix and arrays this way here? please let me know where the problem is?
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Experiment with having the !$omp do enclose the 270 loop (and remove the reduction clause on the !$omp parallel)
Have only the master thread update the variables outside the !$omp do (and inside the !$omp parallel).
Once this is working, then see how to partition the iteration spaces such that you can move the parallelization to outer levels. Note, keeping the !$omp parallel at the outer level will reduce the number of enter/exit parallel region, but comes at the expense of delegating the work outside the !$omp do to one thread.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Jim, and especially you John for comments. realy helpful.
John, my mesh which I use for this program is uniform rectangular of length 120m, width= 135m, and height =55m all sides of mesh general boundary except top parts of it.
John Campbell wrote:
1) How often does the test "if(sk(n,k)) 260,270,260 " skip to 270. It might be better to ignore this test and simplify the inner loop.
In sk(nszf, nband) matrix, I would say that around 80% of row is zero.
John Campbell wrote:
2) How uniformly banded is your matrix ? There are many examples of skyline solvers, which identify the band-width of each equation, rather than the maximum band.
I think actual bandwidth is pretty close to my nband=7000, maybe actual bandwith be around 6800. so, this one is okay, i think.
Last comment of you seems to be good alternative, but it looks that transposing of sk(nzsf, nband) matrix to sk(nband,nszf) makes it a bit confusing, as we are dealing with bandwith too, but I was wondering to know couldn't we use the sk(nband, nszf) with Row_f ?
I'm new to openMP too, so, is the variable declaration in openMP in my program true especialy reduction (- : sk, r1)?
Whole of the program should be as follow, right? but, what about r1 array? based on this program, I should transpose the sk to skt, and then after it retrasnpose again after do loops, as I use sk in following parts for stress and strain alanysis too, right?
!$omp parallel private (i,j,k,n,l,c) shared (nszf, nband,Row_f, Row_i,iband) reduction (- : sk, r1)
do n = 1,nszf ! nszf=139020
if (sk(n,1) < tiny_value) then ! test for singularity , could be abs test if eigenvalue extraction
sk(n,1) = 1
cycle
end if
.
.
iband = 1
Row_i(1) = 1
Row_f(1) = skt(1,n)
do b = 2,nband
if (skt(b,n) == 0) cycle
iband = iband + 1
Row_i(iband) = b
Row_f(iband) = skt(b,n)
skt(b,n) = Row_f(iband)/Row_f(1)
end do
!$omp do
do b = 2,iband
i = n+row_i(b)-1
c = Row_f(b)/Row_f(1)
do k = b,iband
j = row_i(k)-row_i(b)+1
skt(j,i) = skt(j,i) - c*Row_f(k)
end do
end do
!$omp end do
!$omp end parallel
end do
thanks :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alex,
If there is only one r1 array, ie only 1 load case in your problem solution, I would put the r1 reduction back in the "290" loop, although it can be a neater solution to have the r1 forward reduction and backward substitution in a seperate sk_solve routine.
You stated that "In sk(nszf, nband) matrix, I would say that around 80% of row is zero." I think you should generate some statistics on this, as if you have a regular 3D mesh as described, then there could be as little as 1% zeros. The values of sk are benig tested after previous equations have trickled down their influence. If there is significant non-zero infill then the Row_f indexed version of Row_n might not be very effective. For each row, count the number of non zero values and report the average. You might be surprised.
I'd recommend first try the row_n approach, as it should help the $omp performance.
As the sk matrix is so large, I think there would be benefit in transposing. This would localise the memory usage. You must have the value of nband, before you start generating the sk matrix. Isn't sk an allocatable array ? There should be limited reference to this array in the matrix formation and reduction codes. The back-substitution code for r1 might only need the sk transpose change, as it is much faster than the sk reduction.
The combination of sk transposed and Row_n changes should produce some benefits. (Row_n woudl work better with sk transposed)
I hope this helps,
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alex,
I have made some minimal changes to your code, to:
introduce row_f : the active row of values to remove sk(..) from the RHS in 290 loop
introduce iband : the bandwidth limit for equation n
summary statistics for bandwidth, zero values and elapsed time
changed l to b for easier reading
I assumed that sk was declared as:
real*8, allocatable, dimension(:,:) :: sk
allocate ( sk(nszf,nband) )
I have made some changes to the !$omp syntax, but I am not sure of the syntax.
Could you run it with /O2 and with or without /Qparallel. I would be interested to find out that:
the results remain the same
the value of summary statistics, and
statistics of elapse time (using system_clock) compared with the previous version.
These could identify how the changes I identified might benefit the run time.
The next step would be to transpose sk in your program and see the changes.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thnaks John for responce,
I was trying to run the program with openMp, before running, I noticed a minor issue in your attachment file:
In loop 290, you have used iband, but row_f (b) index is varying from 1 to nband, I mean array row_f (b) (b=1,nband) in previous parts also include zero values, and its size is nband, right?
i = n
do b = 2,nband ! nband=7000
i = i+1
if (row_f(b)==0) cycle
c = row_f(b)/row_f(1)
j = 0
do k = b, nband
j = j+1
if (row_f(k)==0) cycle
sk(i,j) = sk(i,j) - c*row_f(k)
Next main issue, as you told is that sk is allocatable matrix, I think there is no way to not use allocatable array because of large size of this array,
allocate (sk(nszf,nband)). During running with openMP, and using allocatable matrix of sk, getting following error:
A deferred array shape array is not permitted in an openMP
Not using allocatable array, during running (building solution), facing: compilation aborted (code 3), exactly at the first line of entire program.
so, what else can I do?
thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Allex,
Loop 290 should now use iband. In past posts I have identified three bandwidth measures:
1) nband is the maximum bandwidth of any row
2) iband in gaussean.f90 is the length of the n row to the last non-zero value
3) iband in my post of Wed, 01/30/2013 - 15:22, where I included row_i, is the count of the number of non-zero values that were indexed in row_i. Depending on the statistics this could be a good change.
You should use the iband, as I provided in gaussean.f90.
I provided some statistics in this code. You might want to include the following, before the start of the reduction:
! initial statistics
real*8 sum_zero
sum_zero = 0
do n = 1,nszf ! nszf=139020
do b = 1,nband
if (sk(n,b)==0) sum_zero = sum_zero+1
end do
end do
write (*,*) ' % zero before =', sum_zero/dble(nszf*nband) * 100.
Your comment about sk not being able to be used for openMP is a surprise to be. I would have a main module where sk, nszf, nband and other important system variables are used. You should allocate sk, prior to assembly of the global stiffness matrix, once you know nszf and nband. I have given an example of this module definition in gaussean.f90.
Not being able to use an allocatable array in openMP should not be the case.
Is sk declared as real*8 or real*4 ?
Transposing the use of sk from sk(i,j) to sk(j,i) should help the run time.
Could you adapt gaussean.f90 and run to get the statistics. These would show if including row_i is worthwile, certainly which adaptation of loop 290 should work best. Once we understand this I think the next is to determine why loop 290 can not work with !$omp.
I am tempted to make loop 290 a subroutine, however I have found this approach not work with /Qparallel. I had hoped that !$omp would overcome this problem.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Allex,
I have reviewed the example I gave you and included extra code it to generate a dummy smaller stiffness matrix so that I could run it. (neq=130000, nband=1700) The results of these runs, especially the degree on non-zero infill, may be different from your case, but it shows:
gaussean.f90 is the run where iband (the band envelope) is used for each equation, (306 seconds)
gaussean_tran.f90 is the run where SK is transposed (106 seconds)
gaussean_t2.f90 is using transposed and row_i to remove zeros from the active row ( 60 seconds)
The run time improvement is heavily dependant on the % zero when reduced of 55%. If your example gives a much lower figure, say < 10% (which I expect) then the savings will be less, but probably worthwhile.
I also separated out the reduction of R1, as this is much faster than reducing SK. You could also add the loop for back-substitution. I'd recommend this.
You should check gaussean_tran.f90 to see how easy it is to transpose SK. (use FC utility)
All this has been run with /O2 but not /Qparallel or /Qopenmp, which brings us to where you first began with your question about !$OMP directives in the code. This needs some more review.
I tried !$OMP CRITICAL for the initial setup of Row_f, but I don't think that did what I intended, which was that it ran only once and not for each thread.
The !$OMP DO did parallelise this code segment, which is what is required, but I could not guarantee it is working correctly.
I ran gaussean_t2.f90 with /Qopenmp, but it ran in only 6 seconds with 4 processes, which is not right. It should be at least 15 seconds. There is a two-stage process of 1) generation of row_f/row_i then 2) loop 290. This needs to be enforced in the !$OMP directives.
I think it is close to working, but needs changes to the !$OMP to handle the loops correctly, but only run the row_f generation once. Hopefilly someone else might be able to review the !$OMP commands to get this right.
I am attaching the 3 versions of the test program plus a run trace of each and the batch file I have been using to test with ifort Ver 12.1.
What I have been describing should work and hopefully I have not introduced other coding errors.
I hope this gives you some ideas. I'd be interested in how you proceed,
John
PS: I have now attached an updated version of option 3, which is a more complete test, by also providing load case back substitution then a check calculation of : force - [sk] . disp, to check the maximum calculation error. This should be useful to test the parallel operation.
The changes I have now made to the $OMP commands appear to be working, although there might be a better implementation.
!$OMP appears to reduce the run time by about 50% with 4 processes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks, John, about attached files and your comments;
In my program, sk was allocated in some other subroutines (sk declared real*4, all of my variables are real*4) and, then was used in other subroutine, something like:
! Main program
real, allocatable:: sk(:,:)
call formk (sk,nszf,...)
call solve (sk,r1,nszf)
! end end of main program
Subroutine formk (sk,nszf,...)
real, allocatable:: sk(:,:)
.
.
allocate (sk(nszf, nband))
! sk matrix is formed here, boundary condition is applied, and reduced to nszf x nband. then, it is sent to be solved in solve subroutine.
end subroutine formk
subroutine solve (sk,r1,nszf)
real, allocatable:: sk(:,:)
allocate (sk(nszf, nband))
nband=7000
call gaussian_reduction (sk, nszf, nband, r1) (your subroutine)
.
end subrouotine solve
.
subroutine gaussian_reduction (sk, nszf, nband, r1)
real:: sk(nszf,nband)
.. and then rest of program you provided....
This one gives unhandled exception at .... stack overflow error at reduction (- : sk, r1) line. Need to tell that I've allocated large number in stack reserve size (1000000000).
changing real:: sk(nszf,nband) at gaussian_reduction subroutine into real, allocatable:: sk(:,:), and then running gives:
A deferred array shape array is not permitted in an openMP FIRSTPRIVATE, .... [SK]
and, in first case, program executed the statistcs parts and gives % zero before = 95%.
You know that could run gaussean.f90, probably because your dummy stiffness matrix size is small.
so, in this case, what can i do?
thank you
Alex
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alex,
You should only allocate SK once. I would place SK in a module and USE the module in both FORMK and SOLVE. (SOLVE is a very common name, I'd change it). You should try to understand why this is the case.
It is wrong to allocate SK more than once, or define nband=7000 after the allocate statement in SOLVE. This might be the cause of the error you reported earlier with $OMP and SK.
Look at the way I call gaussean_reduction (sk, nszf, nband), then declare the arguments. This will overcome the "A deferred array shape array is not permitted in an openMP FIRSTPRIVATE, .... [SK]" error
If you don't have any knowledge of the precision required for SK, I'd recommend real*8. You havn't described the system you are modelling, but I'd expect real*8 would be required.
My testing also shows a significant gain by transposing SK. You should try to identify the changes that would be required for this. It should not be extensive, unless you are using SK for other purposes, which would not be advisable for such a large array.
Try the latest version I attached, as I think the $OMP commands are working. It would be good to see results on your i7 which supports 8 processes. I think this version is fairly close.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alex,
I have tidied up the code to include some more statistics and confirm that it appears to be working for !$OMP.
The attached code is this latest version.
It could be a good example of a basic openMP implementation.
The compilation options I have selected are: /Tf %1.f90 /free /O2 /Qopenmp /QxHost
I have some questions about !$OMP which someone might be able to comment on:
Is it possible to move the following command outside the outer DO loop to improve performance ?
!$OMP PARALLEL PRIVATE (c,i,j,k,b), SHARED (SK, N, Row_f, Row_i, iband)
Are the !$OMP commands I have selected the most appropriate ?
What would be the minimal install on a PC, which does not have ifort installed, so the program could run on another PC ?
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alex,
Could you please run the latest version I attached on your i7, with and without /Qopenmp compile option. I'd like to see the difference with 8 possible processes. You should run task manager to confirm you are getting 100% cpu, rather than 12.5% without /Qopenmp. The R1 solver has not been optimised or uses !$OMP. It runs much quicker than the SK reduction. It could be improved if it runs many times.
The !$OMP commands could be reviewed, however the outer "n" loops must run sequentially in all cases. That is why I placed the "do n" outside the !$OMP construct. I'm not sure if this means that the thread processes are set up nszf times, or only once. Once would be better!
The fortran ALLOCATE statement, provides a memory address space for the SK array, once it is allocated it is fixed for the full run.
Let me know how you go.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John;
Thanks for your comments, and your files, and sorry to respond late, was a bit busy.
I ran your latest attached file with my PC, Dell XPS 8500, assuming sk(nband=3700,nszf=130000) with !$omp, it took about 197 sec, and without it, it was about 202 sec. I was checking the task manager as well, so during !$omp, CPU usage was 98-100%, and without it it was 12-17%. Same for sk(nband=3700,nszf=70000), with !$omp 98.5 sec, and without it 102 sec.
It seems that using !$omp doesn't affect greatly the calculation time !?? I was expecting large difference between the times.
however I want to try your latest attachment in my program, i'll post about it here soon.
Thanks
Alex
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just came across with another question about parallel programing;
In following code, I'm trying to trasnpose a vector (sk vecor according to my previous post in this thread), a=sk and b=skt arrays are large arrays and have been allocated already in main program, as I was monitoing the task manager duing running of the program at this part, task manager>performance>physical memory, free value of physical memory was 0. And, becasue of using parallel programing, I was expecting to see around 100% cpu usage, while it was around 1-29%. I was wondering, as the whole point of using parallel programing was to use all CPU capacity, here low cpu uage, it is probably due to the lack of free physical memory, right, please correct me if i'm wrong? if so, how can I over come this issue?
subroutine trans(a,b,row,col)
integer::i,j,row,col
real*8 a(row,col),b(col,row)
!$omp parallel private(i,j) shared(a,b,row,col)
!$omp do
do i=1,row
do j=1,col
b(j,i)=a(i,j)
end do
end do
!$omp end do
!$omp end parallel
end subroutine trans
Thanks
Alex
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Parallel optimization of transpose seems to be a frequently set homework assignment, part of the point being to set an incentive to browse prior posts on this.
You might start by checking that your single thread performance is as good as transpose intrinsic, including checking the effect of vector nontemporal and architecture settings.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alex,
I would not transpose SK, but generate it in the transposed form, with ALLOCATE ( sk(nband,nszf) )
There should not be too many occurrences in your code of SK(j,i) =
which you convert to SK(i,j) =
There might also be some DO loops which you could change their order.
Using real*4 is a low precision, unless the problem is very well behaved. You should make sure you back calculate for force, given the displacement field result;
Force = sum (K_element x Disp) ( using all element stiffness matrics is usually much quicker)
Error = Force - Applied_Force
checking the maximum error for each degree of freedom. It would be good to declare FORCE and ERROR as real*8
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tin,
I am investigating some other matrix processing problems, related to the use of "row_i" in the above example for indexing the non-zero coefficients in the active row.
You made the comment about "checking the effect of vector nontemporal and architecture settings."
Are you able to provide more details on these settings and how they might be checked ?
How might these settings be altered ?
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
trahspose is a standard Fortran intrinsic, since 20 years ago. Normally, it could be expected to be implemented in an effective way for the target platform, although it may not effectively with auto-parallel or OpenMP. You could use it to check that your explicit do loops aren't costing you performance at 1 thread, and that you aren't losing performance by adding threads.
Intel Fortran supports
!dir$ vector nontemporal
so as to request stores to bypass cache. In some situations, the compiler may do it automatically, if the stored array is expected to be larger than say 50% of cache. In a case where the stored data aren't accessed again while remaining in cache on the same CPU, this directive can avoid reading the data prior to writing, thus potentially cutting memory traffic by 50%.
If you haven't noticed any advertising about /arch:AVX for corei7-2 CPUs, you still might want to try varying the settings. Depending on the situation, /Qxsse4.1 or the default /Qxsse2 may work best on corei7[-1] in spite of the advertising about /Qxsse4.2 or /QxHost (implied by /fast).

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page