Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7956 Discussions

super linear speedup on matrix multiply, why/how?

robackja
Beginner
643 Views
Hello.

Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.

OpenMP runs

OMP_NUM_THREADS=1 runs in 56 seconds
OMP_NUM_THREADS=2 runs in 12 seconds
OMP_NUM_THREADS=4 runs in 6 seconds
OMP_NUM_THREADS=8 runs in 3.8 seconds

from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?

Compiling with CFLAGS =-O2 -openmp -no-vec -static

Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.

My workstation specs.

Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)
Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux
Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores
8GB of DDR2, 2x WD Raptors in Software RAID-0

0 Kudos
1 Solution
levicki
Valued Contributor I
643 Views
Quoting - robackja
Hello.

Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.

OpenMP runs

OMP_NUM_THREADS=1 runs in 56 seconds
OMP_NUM_THREADS=2 runs in 12 seconds
OMP_NUM_THREADS=4 runs in 6 seconds
OMP_NUM_THREADS=8 runs in 3.8 seconds

from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?

Compiling with CFLAGS =-O2 -openmp -no-vec -static

Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.

My workstation specs.

Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)
Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux
Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores
8GB of DDR2, 2x WD Raptors in Software RAID-0


This might sound silly but I will try...

Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.

Of course I might be wrong regarding datasize, perhaps you are using float?

At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

View solution in original post

0 Kudos
4 Replies
levicki
Valued Contributor I
644 Views
Quoting - robackja
Hello.

Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.

OpenMP runs

OMP_NUM_THREADS=1 runs in 56 seconds
OMP_NUM_THREADS=2 runs in 12 seconds
OMP_NUM_THREADS=4 runs in 6 seconds
OMP_NUM_THREADS=8 runs in 3.8 seconds

from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?

Compiling with CFLAGS =-O2 -openmp -no-vec -static

Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.

My workstation specs.

Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)
Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux
Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores
8GB of DDR2, 2x WD Raptors in Software RAID-0


This might sound silly but I will try...

Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.

Of course I might be wrong regarding datasize, perhaps you are using float?

At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

0 Kudos
jimdempseyatthecove
Honored Contributor III
643 Views

>>At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

Igor, the computational load for a mat mul is not proportional to the matrix size. It is more like size**2.

The easyest way to compute the factor is to compute the difference in the number of operatons (I*J*K) in your inner loop of the MAT MUL.

Jim Dempsey
0 Kudos
robackja
Beginner
643 Views
Quoting - Igor Levicki

This might sound silly but I will try...

Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.

Of course I might be wrong regarding datasize, perhaps you are using float?

At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.



LOL, you are absolutely correct. 1024 x 1024 x 3 (3 matrices) x 8 (double) is 24MB. So splitting among 2 processors each with 12mb of cache... I was so focused on the super linear speedup, I forgot to do the simple thing and "do the math" as some people say.

thanks.
0 Kudos
levicki
Valued Contributor I
643 Views

>>At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

Igor, the computational load for a mat mul is not proportional to the matrix size. It is more like size**2.

Jim Dempsey

Jim, I wasn't talking about computational load, but about the cache load and it seems I was right. :)

@robackja:
You are welcome!

0 Kudos
Reply