Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

## OpenMP: slow-down for matrix-vector product Beginner
670 Views

Hello,

I am an OpenMP beginner, and I am experimenting with simple examples on a quad-core CPU. When I parallelized a matrix-vector product, I got a small speed-up for 2 threads, but a slow-down for 4 threads. I would be very grateful if someone cleared it out for me why there is a slow-down?

When the code below is executed with matrix dimension of 18000, the elapsed time is:

Here is the code:

PROGRAM MVP

INTEGER NRA, NCA, TID, NTHREADS, I, J, K, CHUNK,

integer start, finish, rate

REAL*8 A(:,:), B(:,:), C(:,:)

allocatable a,b,c

c Input the number of columns/rows of the matrix

write(*,*)'matrix dimension?'

NCA=NRA

allocate(A(NRA,NCA),stat=istat)

allocate(B(NCA,1),stat=istat)

allocate(C(NRA,1),stat=istat)

C Set loop iteration chunk size

CHUNK = 10

c Input the number of threads

C Initialize matrices

DO 30 I=1, NRA

DO 30 J=1, NCA

A(I,J) = (I-1)+(J-1)

30

CONTINUE

DO 40 I=1, NCA

B(I,1) = (I-1)

40

CONTINUE

DO 50 I=1, NRA

C(I,1) = 0

50

CONTINUE

cccccccccccccccccccccccccccccc ccccccccccccccccccccccccccccccccccccc

call system_clock (COUNT_RATE = rate)

call system_clock (COUNT = start)

C Do matrix-vector multiply sharing iterations on outer loop

PRINT *, 'Thread', TID, 'starting matrix multiply...'

!\$OMP DO SCHEDULE(STATIC, CHUNK)

DO 60 I=1, NRA

DO 60 K=1, NCA

C(I,1) = C(I,1) + A(I,K) * B(K,1)

60

CONTINUE

!\$OMP END DO

C End of parallel region

!\$OMP END PARALLEL

call system_clock (COUNT = finish)

seconds = float (finish - start) / float (rate)

write(*,*)'time elapsed:',seconds

END

12 Replies Black Belt
670 Views

Generally, you should try to conform to the old slogan "Concurrent Outer, Vector Inner"

Most OpenMP compilers will follow your code literally and not attempt loop nest optimizations which differ from your specification of the loop index used for parallelization, even when you choose compiler flags which might do so when parallelization is turned off.

It might be interesting if you would tell us the effect of thread placement (KMP_AFFINITY or GOMP_AFFINITY, formy 2 favorite compilers)as well as the effect of switching the loops. It looks as if you are testing mainly how well threading can deal with DTLB misses, and you are nearly up to full bus bandwidth limited performance, with one thread. As you have not blocked for cache locality, you gain performance only when you spread your job across all cache.

I believe that MKL doesn't attempt to thread ?GEMV, since the "vector inner" part of the slogan actually should come first. You should at least compare your performance with what you get with standard BLASSGEMV vectorized, with and without threading. Some people like to brag about maximizing OpenMP scaling even when it is gained at the expense of total performance; for that purpose, you can sometimes leave out the vector inner part, as it is easiest to get good scaling by making the single thread performance as bad as possible.

You would need to find an effective way to block the operation so as to enable vectorization and preserve cache locality in each thread. Assuming you are using an Intel quad core, you might be able to take advantage of the cache sharing between cores 0,2 and 1,3, and you certainly need to block so that the 2 threads on the same cache don't fight each other.

So, in spite of the apparent simplicity of this operation, it already requires you to go beyond the "OpenMP beginner" stage. Beginner
670 Views
Thank you for a detailed answer. I now see that the problem is much more complex than I thought.

Do you maybe know where to find examples of OpenMP parallelization of linear algebra operations like this one?

I know that MKL has matrix-matrix product parallelized but they didn't do *GEMV. I will compare my example with SGEMV and see if it also slows down when I run it with 4 threads. Otherwise, I know it doesn't offer any speed-up, I 'm sure on that.

Regards,
Drazen Black Belt
670 Views

I'd guess you might want to block it something like

Ichunk=(NRA+3)/4

!\$OMP PARALLEL DO

DOJ=0,3

DO K=1, NCA

DO I= J*Ichunk + 1, (J+1)*Ichunk

C(I,1) = C(I,1) + A(I,K) * B(K,1)

End do

End do

End do

and try running with KMP_AFFINITY settings. Note that I didn't take care of remainders, in case NRA is not divisible by 4. This would work somewhat better if the arrays are 16-byte aligned (and the compiler knows it), and Ichunk is divisible by 8.

You could replace the loops on I and K with SGEMV calls on the same array sections. I agree with you, SGEMV probably isn't built with such threading. Beginner
670 Views
Thank you for your help. The code you posted doesn't experience any slow-down when number of threads is increased from 2 to 4. It doesn't speed-up either, but it's much better than the original one.

Regards,
Drazen Black Belt
670 Views

And on this example you might want to specify the OMP schedule as static with chunk size of 1 (not to be confused with Ichunk).

Jim Black Belt
670 Views

Drazen,

I copied your program and compiled it for testing on my system. This system has 2 Opteron 270 Dual Core processors for a total of 4 cores. The system has 2GB of RAM. Using 18000 for array size would require 7.78GB so I could not run the test with arrays of that size. Using 10000 requires 2.4GB of RAM but since array C is only referenced as C(I,1) a size of 10000 seemed to fit on my system without excessive page swapping.

Runtimes for my system

CoresTime%1core%2core
16.86100%60.64%
24.16164.9%100%
32.969 231.1%140.11%
42.656 258.3%156.63%

I also noticed that your loop indexing was not optimal. In Fortran, adjacent cells of the left most index are adjacent in memory

DO 30 I=1, NRA
DO 30 J=1, NCA
A(I,J) = (I-1)+(J-1)
30 CONTINUE

Jumps down NRA variables each poke in memory.
Changing the loop order to

DO 30 J=1, NCA
DO 30 I=1, NRA
A(I,J) = (I-1)+(J-1)
30 CONTINUE

References adjecent memory on each poke in memory and is much faster.
Additionaly, the loop can now be vectorized (2 real(8) written each iteration)

A similar thing can be done with your timming test loop. Original code:

DO 60 I=1, NRA
DO 60 K=1, NCA
C(I,1) = C(I,1) + A(I,K) * B(K,1)
60 CONTINUE

Suggested change

DO 60 K=1, NCA
DO 60 I=1, NRA
C(I,1) = C(I,1) + A(I,K) * B(K,1)
60 CONTINUE

With the changes in place my run times are too short to produce meaninful results using the timming routines of your choice.

Jim Dempsey Black Belt
670 Views

Yurii,

The index priority for Fortran is backwards from C++ for multiple indexed arrays. As for C(I,J)

The sample Fortran program from DRASKO was written as if indexed for C++ therefor the loops could not vectorize the instructions and the memory references tended to require more cache lines (i.e. cache would flush before accessing a variable co-resident with a prior variable). Loop nesting order is dependent on language array indexing precedence order.

Jim Dempsey Beginner
670 Views

Jim,

I had no in view of a concrete example.
I had in view of the general principles.

Yurii Black Belt
670 Views

Yruii,

RE: I had no in view of a concrete example.

I took the liberty to modify Drasko's original code. Please compile and make your own test runs. The modified code uses a higher precision timmer and runs the timed loops indexed in both manners (concrete example).

Note, each timed loop is run twice and the second value is used. This is done such that the cache is preconditioned in a manner that should not adversely effect (advantage) one run over the other.

Loops 31 and 61 are are the alternate forms of loops 30 and 60 with the suggestions made in my earlier post.

Reordering loop 61 (the major function to be optimized) results in over 5x performance improvement on my system. The reorder of the loop had more impact than throwing more cores at the problem (~3x performance increase from 1 core to 4 cores.) Using both more cores and re-order gives best results. Additional optimizations can be made that can yield a bit more performance.

Jim Dempsey

On my 4-Core system

`matrix dimension?10000number of threads?4Run time loop 30 8.49819253990427Run time loop 31 0.485419438919052Difference 17.5069061074858Run time loop 60 2.46630702423863Run time loop 61 0.466605992522091Difference 5.28563084007514`
`Program with modifications follows`

```PROGRAM MVPINTEGER :: NRA, NCA, TID, NTHREADS, I, J, K, CHUNKINTEGER :: OMP_GET_NUM_THREADS,OMP_GET_THREAD_NUMreal(8) :: OMP_GET_WTIME, OMP_GET_WTICKinteger :: start, finish, rateREAL(8), allocatable :: A(:,:), B(:,:), C(:,:)real(8) :: Start_WTIME, End_WTIMEreal(8) :: RT30, RT31, RT60, RT61integer :: cacheFlush! Input the number of columns/rows of the matrixwrite(*,*)'matrix dimension?'read(*,*)NRANCA=NRAallocate(A(NRA,NCA),stat=istat)allocate(B(NCA,1),stat=istat)allocate(C(NRA,1),stat=istat)! Set loop iteration chunk size CHUNK = 10! Input the number of threadswrite(*,*)'number of threads?'read(*,*)kp
call omp_set_num_threads(kp)! Initialize matrices! Perform test twice! First iteration used to normalize cachedo cacheFlush=0,1Start_WTIME = OMP_GET_WTIME()DO 30 I=1, NRADO 30 J=1, NCAA(I,J) = (I-1)+(J-1)30 CONTINUEEnd_WTIME = OMP_GET_WTIME()RT30 = End_WTIME - Start_WTIMEend do! Perform test twice! First iteration used to normalize cachedo cacheFlush=0,1Start_WTIME = OMP_GET_WTIME()DO 31 J=1, NCADO 31 I=1, NRAA(I,J) = (I-1)+(J-1)31 CONTINUEEnd_WTIME = OMP_GET_WTIME()RT31 = End_WTIME - Start_WTIMEend dowrite(*,*) 'Run time loop 30', RT30write(*,*) 'Run time loop 31', RT31write(*,*) 'Difference', RT30/RT31DO 40 I=1, NCAB(I,1) = (I-1)40 CONTINUEDO 50 I=1, NRAC(I,1) = 050 CONTINUE! Perform test twice! First iteration used to normalize cachedo cacheFlush=0,1Start_WTIME = OMP_GET_WTIME()! Do matrix-vector multiply sharing iterations on outer loop!\$OMP PARALLEL DO SCHEDULE(STATIC, CHUNK) SHARED(A,B,C) PRIVATE(I,K)DO 60 I=1, NRADO 60 K=1, NCAC(I,1) = C(I,1) + A(I,K) * B(K,1)60 CONTINUE!\$OMP END PARALLEL DOEnd_WTIME = OMP_GET_WTIME()RT60 = End_WTIME - Start_WTIMEend do! Perform test twice! First iteration used to normalize cachedo cacheFlush=0,1Start_WTIME = OMP_GET_WTIME()!\$OMP PARALLEL DO SCHEDULE(STATIC, CHUNK) SHARED(A,B,C) PRIVATE(I,K)DO 61 K=1, NCADO 61 I=1, NRAC(I,1) = C(I,1) + A(I,K) * B(K,1)61 CONTINUE!\$OMP END PARALLEL DOEnd_WTIME = OMP_GET_WTIME()RT61 = End_WTIME - Start_WTIMEend dowrite(*,*) 'Run time loop 60', RT60write(*,*) 'Run time loop 61', RT61write(*,*) 'Difference', RT60/RT61END``` Beginner
670 Views

Jim,

Probably I have badly explained the ideas.
I'll try to explain once again.
Tridiagonalization of matrixes will consist from Blas2 and Blas3.
For BLAS2 function dsymv responds.
Try it to execute on several cores.
The initial code is in package LAPACK.

Yurii Beginner
670 Views
My first message has been removed by someone. Therefore I repeat the idea. For BLAS2 the most important is a competent programming. Performance BLAS2 on several cores for model of theshared memory is a blatant ignorance.
Yurii Beginner
670 Views
In addition to previous my message I shall note, that in all algorithms where it is used BLAS2, Intel MKL conducts calculations on one core.
Yurii 