已解决: Performance decrease of parallel code for large set of data

gregor_seitlinger · ‎06-30-2011

I've written a simple parallel matrix-vector multiplication using OpenMP.

[fortran]!Parallel matrix-vector multiplication, row packed matrix

    !$OMP PARALLEL DEFAULT(SHARED) NUM_THREADS(4)
    !$OMP DO SCHEDULE(DYNAMIC) PRIVATE(i,j,itn,Sum)
    do i=1,n
        Sum = 0.
        itn = (i-1)*n
        do j=1,n
            Sum = Sum + A(itn+j)*x(j)
        end do
        y(i) = Sum
    end do
    !$OMP END DO NOWAIT
    !$OMP END PARALLEL[/fortran]

With an Intel Q6600 I get a serial to parallel time ratio of nearly 4, with a matrix size of 1000x1000.
When I increase the matrix size to 2000x2000 this ratio goes down to ~1.4.
What might be causing this dramatic decrease of parallel performance compared to the serial performance?

Gregor Seitlinger

jimdempseyatthecove · ‎06-30-2011

The Q6600 has 2x 4MB L2 cache.

When your data set fits within the cache then the cores do not contend with RAM access.
You did not indicate if your arry A (and/or a, y) are REAL(4) or REAL(8).

REAL(4) A(1000,1000) is 4MB (don't know the sizes of arrays x and y (probably 1000))
REAL(4) A(2000,2000) is 16MB (don't know the sizes of arrays x and y (probably 2000))

REAL(8) A(1000,1000) is 8MB (most of A fits)

When you exceed the cache capacity, then all threads get in line for memory access to RAM.

Jim Dempsey

在原帖中查看解决方案

TimP · ‎06-30-2011

You're probably seeing a loss of cache locality. If you aren't setting affinity so as to keep adjacent threads together on a shared cache, doing so may raise the point at which cache misses become significant.

gregor_seitlinger · ‎06-30-2011

Thanks for your reply.
I've now tried using the environment variable KMP_AFFINITY with a range of options, from compact, to scatter and even to specific proclists, but none of them improved the computation time.
I'd be very grateful for any further suggestions what might be causing the slow down of my parallel code.

jimdempseyatthecove · ‎06-30-2011

The Q6600 has 2x 4MB L2 cache.

When your data set fits within the cache then the cores do not contend with RAM access.
You did not indicate if your arry A (and/or a, y) are REAL(4) or REAL(8).

REAL(4) A(1000,1000) is 4MB (don't know the sizes of arrays x and y (probably 1000))
REAL(4) A(2000,2000) is 16MB (don't know the sizes of arrays x and y (probably 2000))

REAL(8) A(1000,1000) is 8MB (most of A fits)

When you exceed the cache capacity, then all threads get in line for memory access to RAM.

Jim Dempsey

gregor_seitlinger · ‎06-30-2011

Thank you!
This explains why the code behaves the way it does.
So I guess there is not much I can do the improve performance for large data sets other then getting a CPU with more cache.

Tudor · ‎07-01-2011

Have you tried changing you scheduling to static? Maybe dynamic scheduling bounces data around too much. Anyway, in this algorithm, dynamic scheduling doesn't bring any benefit since there is no potential load-unbalance between what each thread executes.

jimdempseyatthecove · ‎07-01-2011

>>So I guess there is not much I can do the improve performance for large data sets other then getting a CPU with more cache.

See if you can tile the operation

Example: Perform the A(2000,2000) in 4 tiles:

outputs for:

A(1:1000,1:1000)
A(1001:2000,1:1000)
A(1:1000,1001:2000)
A(1001:2000,1001:2000)

(remember not to overwrite any elements in output A that are to be used as input for a different tile)

Then see what 8 tile partitions perform like;

On Q6600 the last level cache is L2 at 4MB
But on other processors L2 may be 1MB (or smaller) and all cores share larger L3. Depending on what you do smaller tiles can fit in L2... but there is no free lunch, you will incur additional thread synchronization. Experiment. What works best on one processor may not work best on a different design.

Jim Dempsey

gregor_seitlinger · ‎07-04-2011

Thanks for the suggestions.

@ #5
You are right static scheduling is for this type of operation quite a bit faster than dynamic scheduling, but with large matrices the performance of the parallel code still suffers, so having to access slower memory is mostly likely the case for the loss in parallel performance.

@ #6
I've tried several different tiling approaches and different tile sizes but the performance is pretty much the same for all the tiling variations or no tiling. I guess because the matrix I'm using is row packed (1 dimension), and because I only have to use each matrix element once (matrix-vector product) I don't see an improvement from manual tiling.
On a related note, does the Intel Fortran compiler do automatic tiling when optimizing code?

Thanks again for your suggestions, if you have any further ideas how I could speed up my code I'd be grateful and happy to try them all out.

Gregor Seitlinger

jimdempseyatthecove · ‎07-04-2011

As far as I know, the compiler won't tile for you.

However....

If what you are doing is common practice, one of the optional libraries (e.g. MKL) should be considered. Bear in mind that the API to the library functions will seem overly complex for some simple requirements (generalized input parameters). It's what's inside the function call that counts. As a convincer, try your best for large-ish matrix multiply, then run a test using the MKL.

gregor_seitlinger · ‎07-04-2011

My initial approach was to use the MKL but for the tests I've run on my system with a Q6600, my code, using a row packed one dimensional matrix, with static scheduling out performed the MKL by about a factor of two.
Here the output from my test program:

[bash]     Parallel Algebra Performance Test

     1400x1400 Matrix multiplied with a vector 100000 times


     Serial row packed matrix-vector multiplication took 163.55469 seconds to complete.


     Parallel row packed matrix-vector multiplication took  24.86914 seconds to complete.


     IMKL parallel matrix-vector multiplication took  59.45898 seconds to complete.
[/bash]

I'm guessing that the reason for the difference in performance has something to do with the fact that Fortran stores it arrays in column-major form, but the BLAS interface needs a matrix in row-major form. So for efficient memory access in Fortran the matrix has to be transposed before using it.
Or maybe the MKL is just not optimized for my processor ^^.

Gregor Seitlinger

Tudor · ‎07-05-2011

Ok, I'm just gonna throw a random idea here. Is it possible to memory align arrays in Fortran? If so, you could try to 64-byte align each row. You might be suffering cache misses.

gregor_seitlinger · ‎07-10-2011

I think the intel fortran compiler automatically aligns arrays whenever possible. But since I'm not entirely sure I tried manually aligning the array with the directive,

!DEC$ ATTRIBUTES ALIGN

but after having tried all possible alignment boundaries without any difference in performance I'm pretty sure it's not an alignment issue.

Thanks for the suggestion anyway.
I'm nearly certain it's a cache issue by now as I see a dramatic drop in performance at the point when the array doesn't fit into the L2 cache anymore.

Gregor Seitlinger