Parallel programming: Paralleled do loops - Page 2

aafshani · ‎01-29-2013

Hi everyone;

I had a problem with paralellizing of fortran program here to share. My program had a three nested big do loops which were taking too much time to run. So, I decided to paralellize outside do loop (n=1,nszf) and then use it in some supercomuters in my university. My current Dell xps 8500 Windows 7 ultimate PC has 8 threads. here is that part of my program:

!$omp parallel private(i,j,k,n,l,c) shared (nszf, nband) reduction (- : sk, r1)
!$omp do
do 300 n=1,nszf ! nszf=139020
i=n
do 290 l=2,nband ! nband=7000
i=i+1
if (sk(n,l)) 240,290,240
240 c = sk(n,l)/sk(n,1)
J=0
do 270 k=l,nband
j=j+1
if(sk(n,k)) 260,270,260
260 sk(i,j)=sk(i,j)-c*sk(n,k)
270 continue
280 sk(n,l)=c
r1(i)=r1(i)-c*r1(n)
290 continue
300 r1(n)=r1(n)/sk(n,1)
!$omp end do
!$omp end parallel

Without omp clauses program runs well, but using paralell gives some errors. I think program have problem in private, shared, and reduction clauses parts. sk is a stiffness matrix and r1 is force array. It should be noted that this program is the same famous Gaussian elimination method to solve a unknown in a matrix. Is it true to use reduction for matrix and arrays this way here? please let me know where the problem is?

Thanks

John_Campbell · ‎02-12-2013

Alex,

I have not been successful with the application of !$OMP commands to this calculation, but have had further success with improving the sequential procedure.

The latest version includes:

· Improved timers using QueryPerformanceCounter and rdtsc (single thread suitable)
· Modified the length of each row of SK to be a multiple of 256 bytes. Suits AVX
· Replaced the inner loop with an array section calculation, which should still suit OMP
· Duplicated the middle loop. I was trying to force 2 threads.
· Other tidy ups and improved comments
· Reduced nszf to 13,000 to speed up tests.

The key improvements for performance have been transposing SK (allowing row section calculations) and setting the row length as a multiple of 256 bytes. (There may still be potential for improving the run time by improving the % alignment of row_f in the two reduction DO loops. Perhaps two versions of row_f.)

The duplicating the middle loop was an idea to improve the multi-thread option, as I had not been able to get better than 10% improvement with OMP. I was hoping for 50% with this approach. I could not get this to work with OMP. However this appeared to improve run time for the single thread version, probably by enforcing memory availability for the second loop.

For ifort compiler options, I tried /QxAVX or /Qxsse4.1 or /Qxsse4.2, but /QxHost appears to work well.

This latest attached version appears to run well as a single thread.

There remains the problem that I have not succeeded in providing you an appropriate OMP. To fix this:

· It would be good to move “do n = 1,nszf” into !$OMP PARALLEL, but ensure that N is processed sequentially
· The determine IBAND stage should be a single thread, as iband, row_f and row_i are shared, while
· The SK reduction stage is ideally suited to multi-thread.

I hope this helps you understand the Gaussean solution process.

John

PS : unfortunately, my change to nband to be a multiple of 32, changed the resulting zero infill for the changed SK matrix, so the 256 byte idea has not worked as well as initially thought. Your results will be dependent on the infill in your real SK matrix.

John_Campbell · ‎02-17-2013

Alex,

I have progressed my understanding of the options available for improving run time for the equation solution.

I have tested:
a)    The benefit of OpenMP (none)
b)   The use of /QxAVX on i5 ( 73%run time vs Xeon )
c)    Gausean elimination vs Skyline solver ( Skyline solver is 66% vs Gaussean Elimination )
d)   Transposing SK to suit Gaussean elimination. ( I adopted this as a base case ! 82% vs original)
e)    5 different versions of the inner loop ( no significant change)

I have performed the tests on Intel ifort Ver 11.1 and ifort Ver 2011 compilers and Xeon and i5 processors.

OpenMP : I have attempted to implement OpenMP commands to improve performance, but have not been able to achieve any significant gains. Ifort Ver 11.1 failed, with run times 3 to 5 times the serial version. Ifort Ver 2011 produced improvements of at best < 10% to typically 0%. I have contacted others in the forum, but I was not able to identify a way of improving this performance. Tests were on a 4 process Xeon or 4 process i5. Jim’s conclusion was that the process is hitting a memory bandwidth issue with the hardware available. I don’t know how to address this problem.

AVX : Making use of AVX instructions on the i5 processor appears to reduce run time to about 73% of the Xeon performance. The use of these vector instructions is probably a sensible way to go for improving performance.

Skyline solver vs Gaussean Elimination. A Skyline solution improves performance significantly. This reduces the run time to about 66% of the Gaussean solver. The reason for this is that there are fewer changes to the coefficients of the Matrix. For the inner loop, Gaussean has SK(j,i) = SK(j,i) – const * SK(j+ii,n). Skyline has the inner loop as a dot_product, so each coefficient of SK is updated once vs nband times for gaussean.

Transposing SK : Gaussean solver works best if SK is stored as SK(row,column) while skyline is stored as a list of active columns. Both these are better than storing SK(equations,band)

Different Inner Loops. I tried different inner loops for both Gaussean elimination or Skyline solver. The optimisation in ifort resulted in no significant difference between any of the versions. My preference is for case 4
For Gaussean solver the alternative inner loops were:

case (1) ! Run for standard DO loop'
do ik = ib,iband ; k = row_i(ik) ; sk(k-ii,i) = sk(k-ii,i) - c*row_f(k) ; end do

case (2) ! Run for DO loop for SK(1.'
do k = 1,ienvl-ii ; sk(k,i) = sk(k,i) - c*row_f(k+ii) ; end do

case (3) ! Run for array sections'
sk(1:k,i) = sk(1:k,i) - c*row_f(b:ienvl)

case (4) ! Run for Vec_Sub using array syntax'
call Vec_Sub (sk(1,i), row_f(b), k, c)

case (5,6) ! Run for Vec_Sub using DO syntax
call Vec_Sub_do (sk(1,i), row_f(b), k, c)

The combination of transpose, AVX and Skyline solver reduces the run time from 330 seconds to 127 seconds.

SK(eqn,band) > SK(band, eqn) : 330 seconds to 270 seconds
SK(band.) > AVX instructions : 270 seconds to 192 seconds
SK_AVX to Profile_AVX : 192 seconds to 127 seconds
SK(band.) > Profile : 270 seconds to 173 seconds

This is for 130,000 equations with a max bandwidth of 1,700 and an average bandwidth of 1,686; which is a very evenly banded matrix.

I would look forward to being proved wrong on my estimation of OpenMP performance.

I hope this helps.

John

ps: I can not reproduce the SK(eqn,band) case for 330 seconds. The best I am now getting is 2,600 seconds !! using the matrix real(8) SK(130000,1700) is very slow. I think my earlier test results were for a SK(13000,1700) size. This makes the non-transposed case look even worse for large size problems.

TimP · ‎02-18-2013

To test OpenMP for skyline solver, you would likely set schedule(runtime) so as to use the environment variable OMP_SCHEDULE, and try options such as dynamic,2 or guided. You would also try the KMP environment variables to set 1 thread per core and localize adjacent threads to a single CPU if you have multiple CPUs. You're mighty sketchy on details.

If your problem is big enough, on a multiple CPU platform, it may be worth while to pass through first and estimate the operation counts, then divide the work among the threads, with each thread working on a contiguous group of columns.

John_Campbell · ‎02-18-2013

Tim,

I have tried not to be sketchy on the details of the coding I am testing. I am hoping that the examples I have attached could help other ifort users, like myself, who are attempting to use OpenMP with ifort, to be able to implement some basic practical examples.

I am not yet familiar with the scheduling approach you have identified. I was hoping to understand a simpler implementation of OpenMP, similar to the examples I have found in the Ifort documentation. I would hope that the options you have highlighted ( which I am yet to learn about ) would be a further level of sophistocation.

I have provided two examples of my !$OMP attempts on 4-Feb and 12-Feb, which do not achieve any improvement when compiling with /O2 /Qopenmp /QxHost. Is it possible to achieve improved parallel performance by using the !$OMP instructions ?
I am using Intel Composer XE 2011 ( version 12.1.5.344 6-Jun-12 ). I also tried Ver 11, but it's performance was much worse for the coding I attempted.

I have assumed that the Gaussean reduction is more suited to OpenMP, as the SK matrix is not referenced in the inner 2 DO loops; only re-defined, while for a skyline solver, the modified SK matrix is updated in the middle loop and referenced in the inner loop, limiting OpenMP to apply to the inner dot_product loop. My limited understanding of OpenMP tells me that Gaussean elimination should be both more suited to OpenMP and easier to implement.

I would welcome your recomendations on alternative instructions in either code example I have attached, so that we might see how to proceed.

John