- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am an OpenMP beginner, and I am experimenting with simple examples on a quad-core CPU. When I parallelized a matrix-vector product, I got a small speed-up for 2 threads, but a slow-down for 4 threads. I would be very grateful if someone cleared it out for me why there is a slow-down?
When the code below is executed with matrix dimension of 18000, the elapsed time is:
4.01 seconds for 1 thread, 3.19s for 2 threads and 7.75s for 4 threads.
Here is the code:
PROGRAM MVP INTEGER NRA, NCA, TID, NTHREADS, I, J, K, CHUNK, OMP_GET_NUM_THREADS,OMP_GET_THREAD_NUM integer start, finish, rate REAL*8 A(:,:), B(:,:), C(:,:) allocatable a,b,cc Input the number of columns/rows of the matrix
write(*,*)'matrix dimension?' read(*,*)NRA NCA=NRA allocate(A(NRA,NCA),stat=istat) allocate(B(NCA,1),stat=istat) allocate(C(NRA,1),stat=istat)C Set loop iteration chunk size
CHUNK = 10c Input the number of threads
write(*,*)'number of threads?' read(*,*)kp call omp_set_num_threads(kp)C Initialize matrices
DO 30 I=1, NRA DO 30 J=1, NCA A(I,J) = (I-1)+(J-1)30
CONTINUE DO 40 I=1, NCA B(I,1) = (I-1)40
CONTINUE DO 50 I=1, NRA C(I,1) = 050
CONTINUEcccccccccccccccccccccccccccccc ccccccccccccccccccccccccccccccccccccc
call system_clock (COUNT_RATE = rate) call system_clock (COUNT = start)!$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK)PRIVATE(TID,I,J,K)
C Do matrix-vector multiply sharing iterations on outer loop
TID = OMP_GET_THREAD_NUM() PRINT *, 'Thread', TID, 'starting matrix multiply...'!$OMP DO SCHEDULE(STATIC, CHUNK)
DO 60 I=1, NRA DO 60 K=1, NCA C(I,1) = C(I,1) + A(I,K) * B(K,1)60
CONTINUE!$OMP END DO
C End of parallel region
!$OMP END PARALLEL
call system_clock (COUNT = finish) seconds = float (finish - start) / float (rate) write(*,*)'time elapsedLink Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Generally, you should try to conform to the old slogan "Concurrent Outer, Vector Inner"
Most OpenMP compilers will follow your code literally and not attempt loop nest optimizations which differ from your specification of the loop index used for parallelization, even when you choose compiler flags which might do so when parallelization is turned off.
It might be interesting if you would tell us the effect of thread placement (KMP_AFFINITY or GOMP_AFFINITY, formy 2 favorite compilers)as well as the effect of switching the loops. It looks as if you are testing mainly how well threading can deal with DTLB misses, and you are nearly up to full bus bandwidth limited performance, with one thread. As you have not blocked for cache locality, you gain performance only when you spread your job across all cache.
I believe that MKL doesn't attempt to thread ?GEMV, since the "vector inner" part of the slogan actually should come first. You should at least compare your performance with what you get with standard BLASSGEMV vectorized, with and without threading. Some people like to brag about maximizing OpenMP scaling even when it is gained at the expense of total performance; for that purpose, you can sometimes leave out the vector inner part, as it is easiest to get good scaling by making the single thread performance as bad as possible.
You would need to find an effective way to block the operation so as to enable vectorization and preserve cache locality in each thread. Assuming you are using an Intel quad core, you might be able to take advantage of the cache sharing between cores 0,2 and 1,3, and you certainly need to block so that the 2 threads on the same cache don't fight each other.
So, in spite of the apparent simplicity of this operation, it already requires you to go beyond the "OpenMP beginner" stage.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you maybe know where to find examples of OpenMP parallelization of linear algebra operations like this one?
I know that MKL has matrix-matrix product parallelized but they didn't do *GEMV. I will compare my example with SGEMV and see if it also slows down when I run it with 4 threads. Otherwise, I know it doesn't offer any speed-up, I 'm sure on that.
Regards,
Drazen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'd guess you might want to block it something like
Ichunk=(NRA+3)/4
!$OMP PARALLEL DO
DOJ=0,3
DO K=1, NCA
DO I= J*Ichunk + 1, (J+1)*Ichunk
C(I,1) = C(I,1) + A(I,K) * B(K,1)
End do
End do
End do
and try running with KMP_AFFINITY settings. Note that I didn't take care of remainders, in case NRA is not divisible by 4. This would work somewhat better if the arrays are 16-byte aligned (and the compiler knows it), and Ichunk is divisible by 8.
You could replace the loops on I and K with SGEMV calls on the same array sections. I agree with you, SGEMV probably isn't built with such threading.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regards,
Drazen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And on this example you might want to specify the OMP schedule as static with chunk size of 1 (not to be confused with Ichunk).
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Drazen,
I copied your program and compiled it for testing on my system. This system has 2 Opteron 270 Dual Core processors for a total of 4 cores. The system has 2GB of RAM. Using 18000 for array size would require 7.78GB so I could not run the test with arrays of that size. Using 10000 requires 2.4GB of RAM but since array C is only referenced as C(I,1) a size of 10000 seemed to fit on my system without excessive page swapping.
Runtimes for my system
CoresTime%1core%2core
16.86100%60.64%
24.16164.9%100%
32.969 231.1%140.11%
42.656 258.3%156.63%
I also noticed that your loop indexing was not optimal. In Fortran, adjacent cells of the left most index are adjacent in memory
Your code
DO 30 I=1, NRA
DO 30 J=1, NCA
A(I,J) = (I-1)+(J-1)
30 CONTINUE
Jumps down NRA variables each poke in memory.
Changing the loop order to
DO 30 J=1, NCA
DO 30 I=1, NRA
A(I,J) = (I-1)+(J-1)
30 CONTINUE
References adjecent memory on each poke in memory and is much faster.
Additionaly, the loop can now be vectorized (2 real(8) written each iteration)
A similar thing can be done with your timming test loop. Original code:
DO 60 I=1, NRA
DO 60 K=1, NCA
C(I,1) = C(I,1) + A(I,K) * B(K,1)
60 CONTINUE
Suggested change
DO 60 K=1, NCA
DO 60 I=1, NRA
C(I,1) = C(I,1) + A(I,K) * B(K,1)
60 CONTINUE
With the changes in place my run times are too short to produce meaninful results using the timming routines of your choice.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yurii,
The index priority for Fortran is backwards from C++ for multiple indexed arrays. As for C(I,J)
Fortran -leftmost index indexes adjacent memory: C(I,J), C(I+1,J) adjacent
C++ - rightmost index indexes adjacent memory C(I,J), C(I,J+1) adjacent
The sample Fortran program from DRASKO was written as if indexed for C++ therefor the loops could not vectorize the instructions and the memory references tended to require more cache lines (i.e. cache would flush before accessing a variable co-resident with a prior variable). Loop nesting order is dependent on language array indexing precedence order.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
I had no in view of a concrete example.
I had in view of the general principles.
Yurii
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yruii,
RE: I had no in view of a concrete example.
I took the liberty to modify Drasko's original code. Please compile and make your own test runs. The modified code uses a higher precision timmer and runs the timed loops indexed in both manners (concrete example).
Note, each timed loop is run twice and the second value is used. This is done such that the cache is preconditioned in a manner that should not adversely effect (advantage) one run over the other.
Loops 31 and 61 are are the alternate forms of loops 30 and 60 with the suggestions made in my earlier post.
Reordering loop 61 (the major function to be optimized) results in over 5x performance improvement on my system. The reorder of the loop had more impact than throwing more cores at the problem (~3x performance increase from 1 core to 4 cores.) Using both more cores and re-order gives best results. Additional optimizations can be made that can yield a bit more performance.
Jim Dempsey
On my 4-Core system
matrix dimension?
10000
number of threads?
4
Run time loop 30 8.49819253990427
Run time loop 31 0.485419438919052
Difference 17.5069061074858
Run time loop 60 2.46630702423863
Run time loop 61 0.466605992522091
Difference 5.28563084007514
Program with modifications follows
PROGRAM MVPINTEGER :: NRA, NCA, TID, NTHREADS, I, J, K, CHUNKINTEGER :: OMP_GET_NUM_THREADS,OMP_GET_THREAD_NUMreal(8) :: OMP_GET_WTIME, OMP_GET_WTICKinteger :: start, finish, rateREAL(8), allocatable :: A(:,:), B(:,:), C(:,:)real(8) :: Start_WTIME, End_WTIMEreal(8) :: RT30, RT31, RT60, RT61integer :: cacheFlush! Input the number of columns/rows of the matrix
write(*,*)'matrix dimension?'read(*,*)NRANCA=NRA
allocate(A(NRA,NCA),stat=istat)allocate(B(NCA,1),stat=istat)allocate(C(NRA,1),stat=istat)! Set loop iteration chunk size
CHUNK = 10
! Input the number of threads
write(*,*)'number of threads?'read(*,*)kp call omp_set_num_threads(kp)! Initialize matrices
! Perform test twice
! First iteration used to normalize cache
do cacheFlush=0,1Start_WTIME = OMP_GET_WTIME()
DO 30 I=1, NRADO 30 J=1, NCAA(I,J) = (I-1)+(J-1)
30
CONTINUEEnd_WTIME = OMP_GET_WTIME()
RT30 = End_WTIME - Start_WTIME
end do! Perform test twice
! First iteration used to normalize cache
do cacheFlush=0,1Start_WTIME = OMP_GET_WTIME()
DO 31 J=1, NCADO 31 I=1, NRAA(I,J) = (I-1)+(J-1)
31
CONTINUEEnd_WTIME = OMP_GET_WTIME()
RT31 = End_WTIME - Start_WTIME
end dowrite(*,*) 'Run time loop 30', RT30write(*,*) 'Run time loop 31', RT31write(*,*) 'Difference', RT30/RT31DO 40 I=1, NCAB(I,1) = (I-1)
40
CONTINUEDO 50 I=1, NRAC(I,1) = 0
50
CONTINUE! Perform test twice
! First iteration used to normalize cache
do cacheFlush=0,1Start_WTIME = OMP_GET_WTIME()
! Do matrix-vector multiply sharing iterations on outer loop
!$OMP PARALLEL DO SCHEDULE(STATIC, CHUNK) SHARED(A,B,C) PRIVATE(I,K)
DO 60 I=1, NRADO 60 K=1, NCAC(I,1) = C(I,1) + A(I,K) * B(K,1)
60
CONTINUE!$OMP END PARALLEL DO
End_WTIME = OMP_GET_WTIME()
RT60 = End_WTIME - Start_WTIME
end do! Perform test twice
! First iteration used to normalize cache
do cacheFlush=0,1Start_WTIME = OMP_GET_WTIME()
!$OMP PARALLEL DO SCHEDULE(STATIC, CHUNK) SHARED(A,B,C) PRIVATE(I,K)
DO 61 K=1, NCADO 61 I=1, NRAC(I,1) = C(I,1) + A(I,K) * B(K,1)
61
CONTINUE!$OMP END PARALLEL DO
End_WTIME = OMP_GET_WTIME()
RT61 = End_WTIME - Start_WTIME
end dowrite(*,*) 'Run time loop 60', RT60write(*,*) 'Run time loop 61', RT61write(*,*) 'Difference', RT60/RT61END
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Probably I have badly explained the ideas.
I'll try to explain once again.
Tridiagonalization of matrixes will consist from Blas2 and Blas3.
For BLAS2 function dsymv responds.
Try it to execute on several cores.
The initial code is in package LAPACK.
Yurii
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page