Community
cancel
Showing results for 
Search instead for 
Did you mean: 
robackja
Beginner
138 Views

super linear speedup on matrix multiply, why/how?

Jump to solution
Hello.

Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.

OpenMP runs

OMP_NUM_THREADS=1 runs in 56 seconds
OMP_NUM_THREADS=2 runs in 12 seconds
OMP_NUM_THREADS=4 runs in 6 seconds
OMP_NUM_THREADS=8 runs in 3.8 seconds

from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?

Compiling with CFLAGS =-O2 -openmp -no-vec -static

Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.

My workstation specs.

Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)
Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux
Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores
8GB of DDR2, 2x WD Raptors in Software RAID-0

0 Kudos
1 Solution
ILevi1
Valued Contributor I
138 Views
Quoting - robackja
Hello.

Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.

OpenMP runs

OMP_NUM_THREADS=1 runs in 56 seconds
OMP_NUM_THREADS=2 runs in 12 seconds
OMP_NUM_THREADS=4 runs in 6 seconds
OMP_NUM_THREADS=8 runs in 3.8 seconds

from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?

Compiling with CFLAGS =-O2 -openmp -no-vec -static

Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.

My workstation specs.

Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)
Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux
Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores
8GB of DDR2, 2x WD Raptors in Software RAID-0


This might sound silly but I will try...

Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.

Of course I might be wrong regarding datasize, perhaps you are using float?

At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

View solution in original post

4 Replies
ILevi1
Valued Contributor I
139 Views
Quoting - robackja
Hello.

Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.

OpenMP runs

OMP_NUM_THREADS=1 runs in 56 seconds
OMP_NUM_THREADS=2 runs in 12 seconds
OMP_NUM_THREADS=4 runs in 6 seconds
OMP_NUM_THREADS=8 runs in 3.8 seconds

from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?

Compiling with CFLAGS =-O2 -openmp -no-vec -static

Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.

My workstation specs.

Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)
Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux
Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores
8GB of DDR2, 2x WD Raptors in Software RAID-0


This might sound silly but I will try...

Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.

Of course I might be wrong regarding datasize, perhaps you are using float?

At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

View solution in original post

jimdempseyatthecove
Black Belt
138 Views

>>At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

Igor, the computational load for a mat mul is not proportional to the matrix size. It is more like size**2.

The easyest way to compute the factor is to compute the difference in the number of operatons (I*J*K) in your inner loop of the MAT MUL.

Jim Dempsey
robackja
Beginner
138 Views
Quoting - Igor Levicki

This might sound silly but I will try...

Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.

Of course I might be wrong regarding datasize, perhaps you are using float?

At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.



LOL, you are absolutely correct. 1024 x 1024 x 3 (3 matrices) x 8 (double) is 24MB. So splitting among 2 processors each with 12mb of cache... I was so focused on the super linear speedup, I forgot to do the simple thing and "do the math" as some people say.

thanks.
ILevi1
Valued Contributor I
138 Views

>>At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

Igor, the computational load for a mat mul is not proportional to the matrix size. It is more like size**2.

Jim Dempsey

Jim, I wasn't talking about computational load, but about the cache load and it seems I was right. :)

@robackja:
You are welcome!

Reply