OpenMP: slow-down for matrix-vector product

Dusan_Z_ · ‎04-27-2007

Hello,

I am an OpenMP beginner, and I am experimenting with simple examples on a quad-core CPU. When I parallelized a matrix-vector product, I got a small speed-up for 2 threads, but a slow-down for 4 threads. I would be very grateful if someone cleared it out for me why there is a slow-down?

When the code below is executed with matrix dimension of 18000, the elapsed time is:

4.01 seconds for 1 thread, 3.19s for 2 threads and 7.75s for 4 threads.

Here is the code:

PROGRAM MVP

INTEGER NRA, NCA, TID, NTHREADS, I, J, K, CHUNK,

OMP_GET_NUM_THREADS,OMP_GET_THREAD_NUM

integer start, finish, rate

REAL*8 A(:,:), B(:,:), C(:,:)

allocatable a,b,c

c Input the number of columns/rows of the matrix

write(*,*)'matrix dimension?'

read(*,*)NRA

NCA=NRA

allocate(A(NRA,NCA),stat=istat)

allocate(B(NCA,1),stat=istat)

allocate(C(NRA,1),stat=istat)

C Set loop iteration chunk size

CHUNK = 10

c Input the number of threads

write(*,*)'number of threads?'

read(*,*)kp

call omp_set_num_threads(kp)

C Initialize matrices

DO 30 I=1, NRA

DO 30 J=1, NCA

A(I,J) = (I-1)+(J-1)

30

CONTINUE

DO 40 I=1, NCA

B(I,1) = (I-1)

40

CONTINUE

DO 50 I=1, NRA

C(I,1) = 0

50

CONTINUE

cccccccccccccccccccccccccccccc ccccccccccccccccccccccccccccccccccccc

call system_clock (COUNT_RATE = rate)

call system_clock (COUNT = start)

!$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK)PRIVATE(TID,I,J,K)

C Do matrix-vector multiply sharing iterations on outer loop

TID = OMP_GET_THREAD_NUM()

PRINT *, 'Thread', TID, 'starting matrix multiply...'

!$OMP DO SCHEDULE(STATIC, CHUNK)

DO 60 I=1, NRA

DO 60 K=1, NCA

C(I,1) = C(I,1) + A(I,K) * B(K,1)

60

CONTINUE

!$OMP END DO

C End of parallel region

!$OMP END PARALLEL

call system_clock (COUNT = finish)

seconds = float (finish - start) / float (rate)

write(*,*)'time elapsed:',seconds

END

TimP · ‎04-27-2007

Generally, you should try to conform to the old slogan "Concurrent Outer, Vector Inner"

Most OpenMP compilers will follow your code literally and not attempt loop nest optimizations which differ from your specification of the loop index used for parallelization, even when you choose compiler flags which might do so when parallelization is turned off.

It might be interesting if you would tell us the effect of thread placement (KMP_AFFINITY or GOMP_AFFINITY, formy 2 favorite compilers)as well as the effect of switching the loops. It looks as if you are testing mainly how well threading can deal with DTLB misses, and you are nearly up to full bus bandwidth limited performance, with one thread. As you have not blocked for cache locality, you gain performance only when you spread your job across all cache.

I believe that MKL doesn't attempt to thread ?GEMV, since the "vector inner" part of the slogan actually should come first. You should at least compare your performance with what you get with standard BLASSGEMV vectorized, with and without threading. Some people like to brag about maximizing OpenMP scaling even when it is gained at the expense of total performance; for that purpose, you can sometimes leave out the vector inner part, as it is easiest to get good scaling by making the single thread performance as bad as possible.

You would need to find an effective way to block the operation so as to enable vectorization and preserve cache locality in each thread. Assuming you are using an Intel quad core, you might be able to take advantage of the cache sharing between cores 0,2 and 1,3, and you certainly need to block so that the 2 threads on the same cache don't fight each other.

So, in spite of the apparent simplicity of this operation, it already requires you to go beyond the "OpenMP beginner" stage.

Dusan_Z_ · ‎04-30-2007

Thank you for a detailed answer. I now see that the problem is much more complex than I thought.

Do you maybe know where to find examples of OpenMP parallelization of linear algebra operations like this one?

I know that MKL has matrix-matrix product parallelized but they didn't do *GEMV. I will compare my example with SGEMV and see if it also slows down when I run it with 4 threads. Otherwise, I know it doesn't offer any speed-up, I 'm sure on that.

Regards,
Drazen

TimP · ‎04-30-2007

I'd guess you might want to block it something like

Ichunk=(NRA+3)/4

!$OMP PARALLEL DO

DOJ=0,3

DO K=1, NCA

DO I= J*Ichunk + 1, (J+1)*Ichunk

C(I,1) = C(I,1) + A(I,K) * B(K,1)

End do

and try running with KMP_AFFINITY settings. Note that I didn't take care of remainders, in case NRA is not divisible by 4. This would work somewhat better if the arrays are 16-byte aligned (and the compiler knows it), and Ichunk is divisible by 8.

You could replace the loops on I and K with SGEMV calls on the same array sections. I agree with you, SGEMV probably isn't built with such threading.

Dusan_Z_ · ‎05-03-2007

Thank you for your help. The code you posted doesn't experience any slow-down when number of threads is increased from 2 to 4. It doesn't speed-up either, but it's much better than the original one.

Regards,
Drazen

jimdempseyatthecove · ‎05-04-2007

And on this example you might want to specify the OMP schedule as static with chunk size of 1 (not to be confused with Ichunk).

Jim

jimdempseyatthecove · ‎07-12-2007

Drazen,

I copied your program and compiled it for testing on my system. This system has 2 Opteron 270 Dual Core processors for a total of 4 cores. The system has 2GB of RAM. Using 18000 for array size would require 7.78GB so I could not run the test with arrays of that size. Using 10000 requires 2.4GB of RAM but since array C is only referenced as C(I,1) a size of 10000 seemed to fit on my system without excessive page swapping.

Runtimes for my system

CoresTime%1core%2core
16.86100%60.64%
24.16164.9%100%
32.969 231.1%140.11%
42.656 258.3%156.63%

I also noticed that your loop indexing was not optimal. In Fortran, adjacent cells of the left most index are adjacent in memory

Your code

DO 30 I=1, NRA
DO 30 J=1, NCA
A(I,J) = (I-1)+(J-1)
30 CONTINUE

Jumps down NRA variables each poke in memory.
Changing the loop order to

DO 30 J=1, NCA
DO 30 I=1, NRA
A(I,J) = (I-1)+(J-1)
30 CONTINUE

References adjecent memory on each poke in memory and is much faster.
Additionaly, the loop can now be vectorized (2 real(8) written each iteration)

A similar thing can be done with your timming test loop. Original code:

DO 60 I=1, NRA
DO 60 K=1, NCA
C(I,1) = C(I,1) + A(I,K) * B(K,1)
60 CONTINUE

Suggested change

DO 60 K=1, NCA
DO 60 I=1, NRA
C(I,1) = C(I,1) + A(I,K) * B(K,1)
60 CONTINUE

With the changes in place my run times are too short to produce meaninful results using the timming routines of your choice.

Jim Dempsey

jimdempseyatthecove · ‎07-12-2007

Yurii,

The index priority for Fortran is backwards from C++ for multiple indexed arrays. As for C(I,J)

Fortran -leftmost index indexes adjacent memory: C(I,J), C(I+1,J) adjacent

C++ - rightmost index indexes adjacent memory C(I,J), C(I,J+1) adjacent

The sample Fortran program from DRASKO was written as if indexed for C++ therefor the loops could not vectorize the instructions and the memory references tended to require more cache lines (i.e. cache would flush before accessing a variable co-resident with a prior variable). Loop nesting order is dependent on language array indexing precedence order.

Jim Dempsey

abc_qmost · ‎07-12-2007

Jim,

I had no in view of a concrete example.
I had in view of the general principles.

Yurii

jimdempseyatthecove · ‎07-13-2007

Yruii,

RE: I had no in view of a concrete example.

I took the liberty to modify Drasko's original code. Please compile and make your own test runs. The modified code uses a higher precision timmer and runs the timed loops indexed in both manners (concrete example).

Note, each timed loop is run twice and the second value is used. This is done such that the cache is preconditioned in a manner that should not adversely effect (advantage) one run over the other.

Loops 31 and 61 are are the alternate forms of loops 30 and 60 with the suggestions made in my earlier post.

Reordering loop 61 (the major function to be optimized) results in over 5x performance improvement on my system. The reorder of the loop had more impact than throwing more cores at the problem (~3x performance increase from 1 core to 4 cores.) Using both more cores and re-order gives best results. Additional optimizations can be made that can yield a bit more performance.

Jim Dempsey

On my 4-Core system

matrix dimension?
10000
number of threads?
4
Run time loop 30 8.49819253990427
Run time loop 31 0.485419438919052
Difference 17.5069061074858
Run time loop 60 2.46630702423863
Run time loop 61 0.466605992522091
Difference 5.28563084007514

Program with modifications follows

PROGRAM MVP
INTEGER :: NRA, NCA, TID, NTHREADS, I, J, K, CHUNK
INTEGER :: OMP_GET_NUM_THREADS,OMP_GET_THREAD_NUM
real(8) :: OMP_GET_WTIME, OMP_GET_WTICK
integer :: start, finish, rate
REAL(8), allocatable :: A(:,:), B(:,:), C(:,:)
real(8) :: Start_WTIME, End_WTIME
real(8) :: RT30, RT31, RT60, RT61
integer :: cacheFlush! Input the number of columns/rows of the matrixwrite(*,*)'matrix dimension?'
read(*,*)NRA
NCA=NRA
allocate(A(NRA,NCA),stat=istat)
allocate(B(NCA,1),stat=istat)
allocate(C(NRA,1),stat=istat)! Set loop iteration chunk size CHUNK = 10! Input the number of threadswrite(*,*)'number of threads?'
read(*,*)kp
call omp_set_num_threads(kp)! Initialize matrices
! Perform test twice
! First iteration used to normalize cachedo cacheFlush=0,1
Start_WTIME = OMP_GET_WTIME()
DO 30 I=1, NRA
DO 30 J=1, NCA
A(I,J) = (I-1)+(J-1)30 CONTINUEEnd_WTIME = OMP_GET_WTIME()
RT30 = End_WTIME - Start_WTIME
end do! Perform test twice
! First iteration used to normalize cachedo cacheFlush=0,1
Start_WTIME = OMP_GET_WTIME()
DO 31 J=1, NCA
DO 31 I=1, NRA
A(I,J) = (I-1)+(J-1)31 CONTINUEEnd_WTIME = OMP_GET_WTIME()
RT31 = End_WTIME - Start_WTIME
end dowrite(*,*) 'Run time loop 30', RT30
write(*,*) 'Run time loop 31', RT31
write(*,*) 'Difference', RT30/RT31
DO 40 I=1, NCA
B(I,1) = (I-1)40 CONTINUEDO 50 I=1, NRA
C(I,1) = 050 CONTINUE! Perform test twice
! First iteration used to normalize cachedo cacheFlush=0,1
Start_WTIME = OMP_GET_WTIME()
! Do matrix-vector multiply sharing iterations on outer loop
!$OMP PARALLEL DO SCHEDULE(STATIC, CHUNK) SHARED(A,B,C) PRIVATE(I,K)DO 60 I=1, NRA
DO 60 K=1, NCA
C(I,1) = C(I,1) + A(I,K) * B(K,1)60 CONTINUE!$OMP END PARALLEL DOEnd_WTIME = OMP_GET_WTIME()
RT60 = End_WTIME - Start_WTIME
end do! Perform test twice
! First iteration used to normalize cachedo cacheFlush=0,1
Start_WTIME = OMP_GET_WTIME()
!$OMP PARALLEL DO SCHEDULE(STATIC, CHUNK) SHARED(A,B,C) PRIVATE(I,K)DO 61 K=1, NCA
DO 61 I=1, NRA
C(I,1) = C(I,1) + A(I,K) * B(K,1)61 CONTINUE!$OMP END PARALLEL DOEnd_WTIME = OMP_GET_WTIME()
RT61 = End_WTIME - Start_WTIME
end dowrite(*,*) 'Run time loop 60', RT60
write(*,*) 'Run time loop 61', RT61
write(*,*) 'Difference', RT60/RT61END

abc_qmost · ‎07-13-2007

Jim,

Probably I have badly explained the ideas.
I'll try to explain once again.
Tridiagonalization of matrixes will consist from Blas2 and Blas3.
For BLAS2 function dsymv responds.
Try it to execute on several cores.
The initial code is in package LAPACK.

Yurii

abcd_qmost · ‎08-14-2007

My first message has been removed by someone. Therefore I repeat the idea. For BLAS2 the most important is a competent programming. Performance BLAS2 on several cores for model of theshared memory is a blatant ignorance.

See: http://www.thesa-store.com/products/

Yurii

abcd_qmost · ‎08-14-2007

In addition to previous my message I shall note, that in all algorithms where it is used BLAS2, Intel MKL conducts calculations on one core.

Yurii