Solved: super linear speedup on matrix multiply, why/how?

robackja · ‎12-17-2009

Hello.

Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.

OpenMP runs

OMP_NUM_THREADS=1 runs in 56 seconds

OMP_NUM_THREADS=2 runs in 12 seconds

OMP_NUM_THREADS=4 runs in 6 seconds

OMP_NUM_THREADS=8 runs in 3.8 seconds

from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?

Compiling with CFLAGS =-O2 -openmp -no-vec -static

Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.

My workstation specs.

Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)

Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux

Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores

8GB of DDR2, 2x WD Raptors in Software RAID-0

levicki · ‎12-17-2009

Quoting - robackja

Hello.

Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.

OpenMP runs

OMP_NUM_THREADS=1 runs in 56 seconds

OMP_NUM_THREADS=2 runs in 12 seconds

OMP_NUM_THREADS=4 runs in 6 seconds

OMP_NUM_THREADS=8 runs in 3.8 seconds

from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?

Compiling with CFLAGS =-O2 -openmp -no-vec -static

Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.

My workstation specs.

Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)

Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux

Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores

8GB of DDR2, 2x WD Raptors in Software RAID-0

This might sound silly but I will try...

Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.

Of course I might be wrong regarding datasize, perhaps you are using float?

At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

View solution in original post

levicki · ‎12-17-2009

Quoting - robackja

Hello.

Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.

OpenMP runs

OMP_NUM_THREADS=1 runs in 56 seconds

OMP_NUM_THREADS=2 runs in 12 seconds

OMP_NUM_THREADS=4 runs in 6 seconds

OMP_NUM_THREADS=8 runs in 3.8 seconds

from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?

Compiling with CFLAGS =-O2 -openmp -no-vec -static

Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.

My workstation specs.

Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)

Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux

Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores

8GB of DDR2, 2x WD Raptors in Software RAID-0

This might sound silly but I will try...

Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.

Of course I might be wrong regarding datasize, perhaps you are using float?

At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

jimdempseyatthecove · ‎12-18-2009

>>At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

Igor, the computational load for a mat mul is not proportional to the matrix size. It is more like size**2.

The easyest way to compute the factor is to compute the difference in the number of operatons (I*J*K) in your inner loop of the MAT MUL.

Jim Dempsey

robackja · ‎12-18-2009

Quoting - Igor Levicki

This might sound silly but I will try...

Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.

Of course I might be wrong regarding datasize, perhaps you are using float?

At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

LOL, you are absolutely correct. 1024 x 1024 x 3 (3 matrices) x 8 (double) is 24MB. So splitting among 2 processors each with 12mb of cache... I was so focused on the super linear speedup, I forgot to do the simple thing and "do the math" as some people say.

thanks.

levicki · ‎12-20-2009

Quoting - jimdempseyatthecove

>>At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

Igor, the computational load for a mat mul is not proportional to the matrix size. It is more like size**2.

Jim Dempsey

Jim, I wasn't talking about computational load, but about the cache load and it seems I was right. :)

@robackja:
You are welcome!