- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

[fortran]!Parallel matrix-vector multiplication, row packed matrix !$OMP PARALLEL DEFAULT(SHARED) NUM_THREADS(4) !$OMP DO SCHEDULE(DYNAMIC) PRIVATE(i,j,itn,Sum) do i=1,n Sum = 0. itn = (i-1)*n do j=1,n Sum = Sum + A(itn+j)*x(j) end do y(i) = Sum end do !$OMP END DO NOWAIT !$OMP END PARALLEL[/fortran]

With an Intel Q6600 I get a serial to parallel time ratio of nearly 4, with a matrix size of 1000x1000.

When I increase the matrix size to 2000x2000 this ratio goes down to ~1.4.

What might be causing this dramatic decrease of parallel performance compared to the serial performance?

Gregor Seitlinger

1 Solution

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

When your data set fits within the cache then the cores do not contend with RAM access.

You did not indicate if your arry A (and/or a, y) are REAL(4) or REAL(8).

REAL(4) A(1000,1000) is 4MB (don't know the sizes of arrays x and y (probably 1000))

REAL(4) A(2000,2000) is 16MB (don't know the sizes of arrays x and y (probably 2000))

REAL(8) A(1000,1000) is 8MB (most of A fits)

When you exceed the cache capacity, then all threads get in line for memory access to RAM.

Jim Dempsey

Link Copied

11 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I've now tried using the environment variable KMP_AFFINITY with a range of options, from compact, to scatter and even to specific proclists, but none of them improved the computation time.

I'd be very grateful for any further suggestions what might be causing the slow down of my parallel code.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

When your data set fits within the cache then the cores do not contend with RAM access.

You did not indicate if your arry A (and/or a, y) are REAL(4) or REAL(8).

REAL(4) A(1000,1000) is 4MB (don't know the sizes of arrays x and y (probably 1000))

REAL(4) A(2000,2000) is 16MB (don't know the sizes of arrays x and y (probably 2000))

REAL(8) A(1000,1000) is 8MB (most of A fits)

When you exceed the cache capacity, then all threads get in line for memory access to RAM.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

This explains why the code behaves the way it does.

So I guess there is not much I can do the improve performance for large data sets other then getting a CPU with more cache.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

See if you can tile the operation

Example: Perform the A(2000,2000) in 4 tiles:

outputs for:

A(1:1000,1:1000)

A(1001:2000,1:1000)

A(1:1000,1001:2000)

A(1001:2000,1001:2000)

(remember not to overwrite any elements in output A that are to be used as input for a different tile)

Then see what 8 tile partitions perform like;

On Q6600 the last level cache is L2 at 4MB

But on other processors L2 may be 1MB (or smaller) and all cores share larger L3. Depending on what you do smaller tiles can fit in L2... but there is no free lunch, you will incur additional thread synchronization. Experiment. What works best on one processor may not work best on a different design.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

@ #5

You are right static scheduling is for this type of operation quite a bit faster than dynamic scheduling, but with large matrices the performance of the parallel code still suffers, so having to access slower memory is mostly likely the case for the loss in parallel performance.

@ #6

I've tried several different tiling approaches and different tile sizes but the performance is pretty much the same for all the tiling variations or no tiling. I guess because the matrix I'm using is row packed (1 dimension), and because I only have to use each matrix element once (matrix-vector product) I don't see an improvement from manual tiling.

On a related note, does the Intel Fortran compiler do automatic tiling when optimizing code?

Thanks again for your suggestions, if you have any further ideas how I could speed up my code I'd be grateful and happy to try them all out.

Gregor Seitlinger

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

However....

If what you are doing is common practice, one of the optional libraries (e.g. MKL) should be considered. Bear in mind that the API to the library functions will seem overly complex for some simple requirements (generalized input parameters). It's what's inside the function call that counts. As a convincer, try your best for large-ish matrix multiply, then run a test using the MKL.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Here the output from my test program:

[bash] Parallel Algebra Performance Test 1400x1400 Matrix multiplied with a vector 100000 times Serial row packed matrix-vector multiplication took 163.55469 seconds to complete. Parallel row packed matrix-vector multiplication took 24.86914 seconds to complete. IMKL parallel matrix-vector multiplication took 59.45898 seconds to complete. [/bash]

I'm guessing that the reason for the difference in performance has something to do with the fact that Fortran stores it arrays in column-major form, but the BLAS interface needs a matrix in row-major form. So for efficient memory access in Fortran the matrix has to be transposed before using it.

Or maybe the MKL is just not optimized for my processor ^^.

Gregor Seitlinger

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

!DEC$ ATTRIBUTES ALIGN

but after having tried all
possible alignment boundaries without any difference in performance
I'm pretty sure it's not an alignment issue.

Thanks for the
suggestion anyway.

I'm nearly certain it's a cache issue by now as
I see a dramatic drop in performance at the point when the array
doesn't fit into the L2 cache anymore.

Gregor Seitlinger

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page