- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've written a simple parallel matrix-vector multiplication using OpenMP.
With an Intel Q6600 I get a serial to parallel time ratio of nearly 4, with a matrix size of 1000x1000.
When I increase the matrix size to 2000x2000 this ratio goes down to ~1.4.
What might be causing this dramatic decrease of parallel performance compared to the serial performance?
Gregor Seitlinger
[fortran]!Parallel matrix-vector multiplication, row packed matrix !$OMP PARALLEL DEFAULT(SHARED) NUM_THREADS(4) !$OMP DO SCHEDULE(DYNAMIC) PRIVATE(i,j,itn,Sum) do i=1,n Sum = 0. itn = (i-1)*n do j=1,n Sum = Sum + A(itn+j)*x(j) end do y(i) = Sum end do !$OMP END DO NOWAIT !$OMP END PARALLEL[/fortran]
With an Intel Q6600 I get a serial to parallel time ratio of nearly 4, with a matrix size of 1000x1000.
When I increase the matrix size to 2000x2000 this ratio goes down to ~1.4.
What might be causing this dramatic decrease of parallel performance compared to the serial performance?
Gregor Seitlinger
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Q6600 has 2x 4MB L2 cache.
When your data set fits within the cache then the cores do not contend with RAM access.
You did not indicate if your arry A (and/or a, y) are REAL(4) or REAL(8).
REAL(4) A(1000,1000) is 4MB (don't know the sizes of arrays x and y (probably 1000))
REAL(4) A(2000,2000) is 16MB (don't know the sizes of arrays x and y (probably 2000))
REAL(8) A(1000,1000) is 8MB (most of A fits)
When you exceed the cache capacity, then all threads get in line for memory access to RAM.
Jim Dempsey
When your data set fits within the cache then the cores do not contend with RAM access.
You did not indicate if your arry A (and/or a, y) are REAL(4) or REAL(8).
REAL(4) A(1000,1000) is 4MB (don't know the sizes of arrays x and y (probably 1000))
REAL(4) A(2000,2000) is 16MB (don't know the sizes of arrays x and y (probably 2000))
REAL(8) A(1000,1000) is 8MB (most of A fits)
When you exceed the cache capacity, then all threads get in line for memory access to RAM.
Jim Dempsey
Link Copied
11 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You're probably seeing a loss of cache locality. If you aren't setting affinity so as to keep adjacent threads together on a shared cache, doing so may raise the point at which cache misses become significant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply.
I've now tried using the environment variable KMP_AFFINITY with a range of options, from compact, to scatter and even to specific proclists, but none of them improved the computation time.
I'd be very grateful for any further suggestions what might be causing the slow down of my parallel code.
I've now tried using the environment variable KMP_AFFINITY with a range of options, from compact, to scatter and even to specific proclists, but none of them improved the computation time.
I'd be very grateful for any further suggestions what might be causing the slow down of my parallel code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Q6600 has 2x 4MB L2 cache.
When your data set fits within the cache then the cores do not contend with RAM access.
You did not indicate if your arry A (and/or a, y) are REAL(4) or REAL(8).
REAL(4) A(1000,1000) is 4MB (don't know the sizes of arrays x and y (probably 1000))
REAL(4) A(2000,2000) is 16MB (don't know the sizes of arrays x and y (probably 2000))
REAL(8) A(1000,1000) is 8MB (most of A fits)
When you exceed the cache capacity, then all threads get in line for memory access to RAM.
Jim Dempsey
When your data set fits within the cache then the cores do not contend with RAM access.
You did not indicate if your arry A (and/or a, y) are REAL(4) or REAL(8).
REAL(4) A(1000,1000) is 4MB (don't know the sizes of arrays x and y (probably 1000))
REAL(4) A(2000,2000) is 16MB (don't know the sizes of arrays x and y (probably 2000))
REAL(8) A(1000,1000) is 8MB (most of A fits)
When you exceed the cache capacity, then all threads get in line for memory access to RAM.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you!
This explains why the code behaves the way it does.
So I guess there is not much I can do the improve performance for large data sets other then getting a CPU with more cache.
This explains why the code behaves the way it does.
So I guess there is not much I can do the improve performance for large data sets other then getting a CPU with more cache.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you tried changing you scheduling to static? Maybe dynamic scheduling bounces data around too much. Anyway, in this algorithm, dynamic scheduling doesn't bring any benefit since there is no potential load-unbalance between what each thread executes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>So I guess there is not much I can do the improve performance for large data sets other then getting a CPU with more cache.
See if you can tile the operation
Example: Perform the A(2000,2000) in 4 tiles:
outputs for:
A(1:1000,1:1000)
A(1001:2000,1:1000)
A(1:1000,1001:2000)
A(1001:2000,1001:2000)
(remember not to overwrite any elements in output A that are to be used as input for a different tile)
Then see what 8 tile partitions perform like;
On Q6600 the last level cache is L2 at 4MB
But on other processors L2 may be 1MB (or smaller) and all cores share larger L3. Depending on what you do smaller tiles can fit in L2... but there is no free lunch, you will incur additional thread synchronization. Experiment. What works best on one processor may not work best on a different design.
Jim Dempsey
See if you can tile the operation
Example: Perform the A(2000,2000) in 4 tiles:
outputs for:
A(1:1000,1:1000)
A(1001:2000,1:1000)
A(1:1000,1001:2000)
A(1001:2000,1001:2000)
(remember not to overwrite any elements in output A that are to be used as input for a different tile)
Then see what 8 tile partitions perform like;
On Q6600 the last level cache is L2 at 4MB
But on other processors L2 may be 1MB (or smaller) and all cores share larger L3. Depending on what you do smaller tiles can fit in L2... but there is no free lunch, you will incur additional thread synchronization. Experiment. What works best on one processor may not work best on a different design.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the suggestions.
@ #5
You are right static scheduling is for this type of operation quite a bit faster than dynamic scheduling, but with large matrices the performance of the parallel code still suffers, so having to access slower memory is mostly likely the case for the loss in parallel performance.
@ #6
I've tried several different tiling approaches and different tile sizes but the performance is pretty much the same for all the tiling variations or no tiling. I guess because the matrix I'm using is row packed (1 dimension), and because I only have to use each matrix element once (matrix-vector product) I don't see an improvement from manual tiling.
On a related note, does the Intel Fortran compiler do automatic tiling when optimizing code?
Thanks again for your suggestions, if you have any further ideas how I could speed up my code I'd be grateful and happy to try them all out.
Gregor Seitlinger
@ #5
You are right static scheduling is for this type of operation quite a bit faster than dynamic scheduling, but with large matrices the performance of the parallel code still suffers, so having to access slower memory is mostly likely the case for the loss in parallel performance.
@ #6
I've tried several different tiling approaches and different tile sizes but the performance is pretty much the same for all the tiling variations or no tiling. I guess because the matrix I'm using is row packed (1 dimension), and because I only have to use each matrix element once (matrix-vector product) I don't see an improvement from manual tiling.
On a related note, does the Intel Fortran compiler do automatic tiling when optimizing code?
Thanks again for your suggestions, if you have any further ideas how I could speed up my code I'd be grateful and happy to try them all out.
Gregor Seitlinger
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As far as I know, the compiler won't tile for you.
However....
If what you are doing is common practice, one of the optional libraries (e.g. MKL) should be considered. Bear in mind that the API to the library functions will seem overly complex for some simple requirements (generalized input parameters). It's what's inside the function call that counts. As a convincer, try your best for large-ish matrix multiply, then run a test using the MKL.
However....
If what you are doing is common practice, one of the optional libraries (e.g. MKL) should be considered. Bear in mind that the API to the library functions will seem overly complex for some simple requirements (generalized input parameters). It's what's inside the function call that counts. As a convincer, try your best for large-ish matrix multiply, then run a test using the MKL.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My initial approach was to use the MKL but for the tests I've run on my system with a Q6600, my code, using a row packed one dimensional matrix, with static scheduling out performed the MKL by about a factor of two.
Here the output from my test program:
I'm guessing that the reason for the difference in performance has something to do with the fact that Fortran stores it arrays in column-major form, but the BLAS interface needs a matrix in row-major form. So for efficient memory access in Fortran the matrix has to be transposed before using it.
Or maybe the MKL is just not optimized for my processor ^^.
Gregor Seitlinger
Here the output from my test program:
[bash] Parallel Algebra Performance Test 1400x1400 Matrix multiplied with a vector 100000 times Serial row packed matrix-vector multiplication took 163.55469 seconds to complete. Parallel row packed matrix-vector multiplication took 24.86914 seconds to complete. IMKL parallel matrix-vector multiplication took 59.45898 seconds to complete. [/bash]
I'm guessing that the reason for the difference in performance has something to do with the fact that Fortran stores it arrays in column-major form, but the BLAS interface needs a matrix in row-major form. So for efficient memory access in Fortran the matrix has to be transposed before using it.
Or maybe the MKL is just not optimized for my processor ^^.
Gregor Seitlinger
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, I'm just gonna throw a random idea here. Is it possible to memory align arrays in Fortran? If so, you could try to 64-byte align each row. You might be suffering cache misses.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think the intel fortran compiler
automatically aligns arrays whenever possible. But since I'm not
entirely sure I tried manually aligning the array with the directive,
!DEC$ ATTRIBUTES ALIGN
but after having tried all
possible alignment boundaries without any difference in performance
I'm pretty sure it's not an alignment issue.
Thanks for the
suggestion anyway.
I'm nearly certain it's a cache issue by now as
I see a dramatic drop in performance at the point when the array
doesn't fit into the L2 cache anymore.
Gregor Seitlinger
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page