Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
6590 Discussions

## Performance decrease of BLAS function for large matrices

Beginner
147 Views

Hello everyone!

Has anyone got any ideas what might cause the drop in performance of the BLAS function GEMV() when compared to a simple serial computation of the same problem?

Let me explain my question more clearly.
I've written a program that compares the performance of GEMV() to a simple serial matrix-vector multiplication routine. Each routine (serial one and GEMV()) is called 100000 times and the total time needed for the computations is recorded in a text file. This is done to simulate a program that uses an iterative method of finding voltages and currents in an inductive network.
With a matrix size of 1000X1000 GEMV() performs approximately 3.3 times as fast (using 4 cores) as the serial version.

But with increasing matrix size this performance increase decreases considerably.
For a 1500x1500 matrix GEMV() performs ~ 1.7 times as fast as the serial computation
and for a 2000x2000 matrix GEMV() using 4 cores takes about the same amount of time as the serial computation.

What is causing this behavior? Has it got something to do with cache, memory access patterns or something completely different? Any ideas what might be causing this and any suggestions on how to keep the performance up for large matrices would be greatly appreciated.

Gregor Seitlinger

6 Replies
Black Belt
147 Views
You left out some significant details. For example, what are the cache sizes? Do you have two levels or three and, if three, is the L-3 cache shared by all the cores? Do you compute M v or M' v with the call to GEMV?

A 2000 X 2000 dense matrix would occupy 16 or 32 Mbytes, which probably is more than what you have in L-2 or L-3 cache.

Are you surprised by the timing results as a result of expecting linear speed-up according to the number of threads?

Amdahl's "law" has something to say about how much speed-up to expect, not just by using parallel programming, but by dedicating more resources in general.

There is an excellent review of the issues in A minicourse on multithreaded programming.
Beginner
147 Views
Level (2!!!!!) Blas routine
Black Belt
147 Views
At issue is cache level, not BLAS level.
Beginner
147 Views

For the big matrixes speed of RAM is important (forlevel 2 Blas)

.

Beginner
147 Views