- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello.
Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.
OpenMP runs
OMP_NUM_THREADS=1 runs in 56 seconds
OMP_NUM_THREADS=2 runs in 12 seconds
OMP_NUM_THREADS=4 runs in 6 seconds
OMP_NUM_THREADS=8 runs in 3.8 seconds
from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?
Compiling with CFLAGS =-O2 -openmp -no-vec -static
Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.
My workstation specs.
Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)
Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux
Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores
8GB of DDR2, 2x WD Raptors in Software RAID-0
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - robackja
Hello.
Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.
OpenMP runs
OMP_NUM_THREADS=1 runs in 56 seconds
OMP_NUM_THREADS=2 runs in 12 seconds
OMP_NUM_THREADS=4 runs in 6 seconds
OMP_NUM_THREADS=8 runs in 3.8 seconds
from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?
Compiling with CFLAGS =-O2 -openmp -no-vec -static
Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.
My workstation specs.
Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)
Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux
Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores
8GB of DDR2, 2x WD Raptors in Software RAID-0
This might sound silly but I will try...
Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.
Of course I might be wrong regarding datasize, perhaps you are using float?
At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - robackja
Hello.
Using Intel Compiler 11.1 on Ubuntu 9.10 on an 8-core (dual socket quad-core Xeon). A simple matrix multiplication. Multiplying 2 nxn matrices where n = 1024. Non-openmp version runs in 52 seconds.
OpenMP runs
OMP_NUM_THREADS=1 runs in 56 seconds
OMP_NUM_THREADS=2 runs in 12 seconds
OMP_NUM_THREADS=4 runs in 6 seconds
OMP_NUM_THREADS=8 runs in 3.8 seconds
from 1 to 2 threads, super linear speedup. but from 2 to 4 and 4 to 8, something more close to linear. Could someone explain to me how this is possible? What is the compiler doing?
Compiling with CFLAGS =-O2 -openmp -no-vec -static
Sometimes super linear speedup can happen due to cache effects improving by splitting up the problem size, but 56 to 12 seconds seems like a giant leap.
My workstation specs.
Ubuntu 9.10 / 64-bit (went from 8.04 to 8.10 to 9.04 to 9.10)
Linux azrael 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux
Intel Xeon CPU E5405 @ 2.00GHz, LGA771, dual socket, 2x quad-cores
8GB of DDR2, 2x WD Raptors in Software RAID-0
This might sound silly but I will try...
Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.
Of course I might be wrong regarding datasize, perhaps you are using float?
At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.
Igor, the computational load for a mat mul is not proportional to the matrix size. It is more like size**2.
The easyest way to compute the factor is to compute the difference in the number of operatons (I*J*K) in your inner loop of the MAT MUL.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Igor Levicki
This might sound silly but I will try...
Assuming your two matrices are 1024 x 1024 x 8 (i.e. datatype double), you have 16 MB of data and your CPUs L2 cache is only 12 MB. Splitting this to two threads may schedule them to separate phyiscal CPUs (you have two quad cores), so that each thread has only 8 MB of data to process which fits in 12 MB L2 thus not having to go to the 10x slower main memory.
Of course I might be wrong regarding datasize, perhaps you are using float?
At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.
LOL, you are absolutely correct. 1024 x 1024 x 3 (3 matrices) x 8 (double) is 24MB. So splitting among 2 processors each with 12mb of cache... I was so focused on the super linear speedup, I forgot to do the simple thing and "do the math" as some people say.
thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
>>At any rate, it is easy to test this theory by halving the data size when running single-threaded version -- if my guess is right, it should finish in considerably less than 28 seconds.
Igor, the computational load for a mat mul is not proportional to the matrix size. It is more like size**2.
Jim Dempsey
Jim, I wasn't talking about computational load, but about the cache load and it seems I was right. :)
@robackja:
You are welcome!

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page