Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1697 Discussions

Performance decrease of parallel code for large set of data

gregor_seitlinger
1,588 Views
I've written a simple parallel matrix-vector multiplication using OpenMP.

[fortran]!Parallel matrix-vector multiplication, row packed matrix

    !$OMP PARALLEL DEFAULT(SHARED) NUM_THREADS(4)
    !$OMP DO SCHEDULE(DYNAMIC) PRIVATE(i,j,itn,Sum)
    do i=1,n
        Sum = 0.
        itn = (i-1)*n
        do j=1,n
            Sum = Sum + A(itn+j)*x(j)
        end do
        y(i) = Sum
    end do
    !$OMP END DO NOWAIT
    !$OMP END PARALLEL[/fortran]

With an Intel Q6600 I get a serial to parallel time ratio of nearly 4, with a matrix size of 1000x1000.
When I increase the matrix size to 2000x2000 this ratio goes down to ~1.4.
What might be causing this dramatic decrease of parallel performance compared to the serial performance?

Gregor Seitlinger
0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
1,588 Views
The Q6600 has 2x 4MB L2 cache.

When your data set fits within the cache then the cores do not contend with RAM access.
You did not indicate if your arry A (and/or a, y) are REAL(4) or REAL(8).

REAL(4) A(1000,1000) is 4MB (don't know the sizes of arrays x and y (probably 1000))
REAL(4) A(2000,2000) is 16MB (don't know the sizes of arrays x and y (probably 2000))

REAL(8) A(1000,1000) is 8MB (most of A fits)

When you exceed the cache capacity, then all threads get in line for memory access to RAM.

Jim Dempsey


View solution in original post

0 Kudos
11 Replies
TimP
Honored Contributor III
1,588 Views
You're probably seeing a loss of cache locality. If you aren't setting affinity so as to keep adjacent threads together on a shared cache, doing so may raise the point at which cache misses become significant.
0 Kudos
gregor_seitlinger
1,588 Views
Thanks for your reply.
I've now tried using the environment variable KMP_AFFINITY with a range of options, from compact, to scatter and even to specific proclists, but none of them improved the computation time.
I'd be very grateful for any further suggestions what might be causing the slow down of my parallel code.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,589 Views
The Q6600 has 2x 4MB L2 cache.

When your data set fits within the cache then the cores do not contend with RAM access.
You did not indicate if your arry A (and/or a, y) are REAL(4) or REAL(8).

REAL(4) A(1000,1000) is 4MB (don't know the sizes of arrays x and y (probably 1000))
REAL(4) A(2000,2000) is 16MB (don't know the sizes of arrays x and y (probably 2000))

REAL(8) A(1000,1000) is 8MB (most of A fits)

When you exceed the cache capacity, then all threads get in line for memory access to RAM.

Jim Dempsey


0 Kudos
gregor_seitlinger
1,588 Views
Thank you!
This explains why the code behaves the way it does.
So I guess there is not much I can do the improve performance for large data sets other then getting a CPU with more cache.
0 Kudos
Tudor
New Contributor I
1,588 Views
Have you tried changing you scheduling to static? Maybe dynamic scheduling bounces data around too much. Anyway, in this algorithm, dynamic scheduling doesn't bring any benefit since there is no potential load-unbalance between what each thread executes.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,588 Views
>>So I guess there is not much I can do the improve performance for large data sets other then getting a CPU with more cache.

See if you can tile the operation

Example: Perform the A(2000,2000) in 4 tiles:

outputs for:

A(1:1000,1:1000)
A(1001:2000,1:1000)
A(1:1000,1001:2000)
A(1001:2000,1001:2000)

(remember not to overwrite any elements in output A that are to be used as input for a different tile)

Then see what 8 tile partitions perform like;

On Q6600 the last level cache is L2 at 4MB
But on other processors L2 may be 1MB (or smaller) and all cores share larger L3. Depending on what you do smaller tiles can fit in L2... but there is no free lunch, you will incur additional thread synchronization. Experiment. What works best on one processor may not work best on a different design.

Jim Dempsey
0 Kudos
gregor_seitlinger
1,588 Views
Thanks for the suggestions.

@ #5
You are right static scheduling is for this type of operation quite a bit faster than dynamic scheduling, but with large matrices the performance of the parallel code still suffers, so having to access slower memory is mostly likely the case for the loss in parallel performance.

@ #6
I've tried several different tiling approaches and different tile sizes but the performance is pretty much the same for all the tiling variations or no tiling. I guess because the matrix I'm using is row packed (1 dimension), and because I only have to use each matrix element once (matrix-vector product) I don't see an improvement from manual tiling.
On a related note, does the Intel Fortran compiler do automatic tiling when optimizing code?

Thanks again for your suggestions, if you have any further ideas how I could speed up my code I'd be grateful and happy to try them all out.

Gregor Seitlinger
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,588 Views
As far as I know, the compiler won't tile for you.

However....

If what you are doing is common practice, one of the optional libraries (e.g. MKL) should be considered. Bear in mind that the API to the library functions will seem overly complex for some simple requirements (generalized input parameters). It's what's inside the function call that counts. As a convincer, try your best for large-ish matrix multiply, then run a test using the MKL.
0 Kudos
gregor_seitlinger
1,588 Views
My initial approach was to use the MKL but for the tests I've run on my system with a Q6600, my code, using a row packed one dimensional matrix, with static scheduling out performed the MKL by about a factor of two.
Here the output from my test program:

[bash]     Parallel Algebra Performance Test

     1400x1400 Matrix multiplied with a vector 100000 times


     Serial row packed matrix-vector multiplication took 163.55469 seconds to complete.


     Parallel row packed matrix-vector multiplication took  24.86914 seconds to complete.


     IMKL parallel matrix-vector multiplication took  59.45898 seconds to complete.
[/bash]

I'm guessing that the reason for the difference in performance has something to do with the fact that Fortran stores it arrays in column-major form, but the BLAS interface needs a matrix in row-major form. So for efficient memory access in Fortran the matrix has to be transposed before using it.
Or maybe the MKL is just not optimized for my processor ^^.

Gregor Seitlinger
0 Kudos
Tudor
New Contributor I
1,588 Views
Ok, I'm just gonna throw a random idea here. Is it possible to memory align arrays in Fortran? If so, you could try to 64-byte align each row. You might be suffering cache misses.
0 Kudos
gregor_seitlinger
1,588 Views

I think the intel fortran compiler automatically aligns arrays whenever possible. But since I'm not entirely sure I tried manually aligning the array with the directive,

!DEC$ ATTRIBUTES ALIGN

but after having tried all possible alignment boundaries without any difference in performance I'm pretty sure it's not an alignment issue.

Thanks for the suggestion anyway.
I'm nearly certain it's a cache issue by now as I see a dramatic drop in performance at the point when the array doesn't fit into the L2 cache anymore.

Gregor Seitlinger

0 Kudos
Reply