Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development Tools (Compilers, Debuggers, Profilers & Analyzers)
- Intel® C++ Compiler
- super linear speedup on matrix multiply, why/how?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

robackja

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-17-2009
01:57 PM

138 Views

super linear speedup on matrix multiply, why/how?

Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.

OpenMP runs

OMP_NUM_THREADS=1 runs in 56 seconds

OMP_NUM_THREADS=2 runs in 12 seconds

OMP_NUM_THREADS=4 runs in 6 seconds

OMP_NUM_THREADS=8 runs in 3.8 seconds

from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?

Compiling with CFLAGS =-O2 -openmp -no-vec -static

Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.

My workstation specs.

Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)

Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux

Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores

8GB of DDR2, 2x WD Raptors in Software RAID-0

1 Solution

ILevi1

Valued Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-17-2009
05:44 PM

138 Views

Quoting - robackja

Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.

OpenMP runs

OMP_NUM_THREADS=1 runs in 56 seconds

OMP_NUM_THREADS=2 runs in 12 seconds

OMP_NUM_THREADS=4 runs in 6 seconds

OMP_NUM_THREADS=8 runs in 3.8 seconds

from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?

Compiling with CFLAGS =-O2 -openmp -no-vec -static

Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.

My workstation specs.

Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)

Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux

Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores

8GB of DDR2, 2x WD Raptors in Software RAID-0

This might sound silly but I will try...

Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.

Of course I might be wrong regarding datasize, perhaps you are using float?

At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

Link Copied

4 Replies

ILevi1

Valued Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-17-2009
05:44 PM

139 Views

Quoting - robackja

Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.

OpenMP runs

OMP_NUM_THREADS=1 runs in 56 seconds

OMP_NUM_THREADS=2 runs in 12 seconds

OMP_NUM_THREADS=4 runs in 6 seconds

OMP_NUM_THREADS=8 runs in 3.8 seconds

from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?

Compiling with CFLAGS =-O2 -openmp -no-vec -static

Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.

My workstation specs.

Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)

Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux

Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores

8GB of DDR2, 2x WD Raptors in Software RAID-0

This might sound silly but I will try...

Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.

Of course I might be wrong regarding datasize, perhaps you are using float?

At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-18-2009
09:28 AM

138 Views

Igor, the computational load for a mat mul is not proportional to the matrix size. It is more like size**2.

The easyest way to compute the factor is to compute the difference in the number of operatons (I*J*K) in your inner loop of the MAT MUL.

Jim Dempsey

robackja

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-18-2009
10:25 AM

138 Views

Quoting - Igor Levicki

This might sound silly but I will try...

Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.

Of course I might be wrong regarding datasize, perhaps you are using float?

At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.

LOL, you are absolutely correct. 1024 x 1024 x 3 (3 matrices) x 8 (double) is 24MB. So splitting among 2 processors each with 12mb of cache... I was so focused on the super linear speedup, I forgot to do the simple thing and "do the math" as some people say.

thanks.

ILevi1

Valued Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-20-2009
05:37 AM

138 Views

Quoting - jimdempseyatthecove

Igor, the computational load for a mat mul is not proportional to the matrix size. It is more like size**2.

Jim Dempsey

Jim, I wasn't talking about computational load, but about the

@robackja:

You are welcome!

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.