Community
cancel
Showing results for 
Search instead for 
Did you mean: 
xman_hawkeye
Beginner
64 Views

AVX performance question

I posted my question in Fortran forum and then I realized probably I should post it here.  Any inputs are welcome.  Thanks.

http://software.intel.com/en-us/forums/topic/373604

0 Kudos
7 Replies
SergeyKostrov
Valued Contributor II
64 Views

There are some test results for sqrt calculations in: Forum Topic: Performance of sqrt Web-link: software.intel.com/en-us/forums/topic/364460 and take a look at SqrtTestApp project. Regarding performance of matrix multiplication: It can not be improved significantly by applying some instruction set, like SSE, SSE2, AVX, AVX2, etc, or with a sophisticated cache management. A high performance matrix multiplication should rely on: - Advanced algorithm, like Strassen - O( n^2.8070 ) Strassen-Winograd - O( n^2.8070 ) Kronecker based ( Tensor Product ) - I don't have an exact asymptotic complexity, it is about ~O( n^2.5 ) ( really fast! ) Coppersmith-Winograd - O( n^2.3760 ) Virginia Vassilevska Williams - O( n^2.3727 ) - Fastest instruction set available on a given CPU - Multi-threading
xman_hawkeye
Beginner
64 Views

Thanks.

SergeyKostrov
Valued Contributor II
64 Views

>>...A high performance matrix multiplication should rely on: >> >>- Advanced algorithm, like >> >>Strassen - O( n^2.8070 ) This is a follow up and I'd like to provide results of some tests that demonstrate how performance of the algorithm is improving ( it is still CPU-bound! ) when one parameter ( Matrix Size Threshold ) is changed: [ Strassen HBC ( Heap Based Complete ) - All Optimizations disabled ] Matrix Size : 4096 x 4096 Matrix Size Threshold: 2048 x 2048 Matrix Partitions : 8 Strassen HBC - Pass 1.1 - Completed: 1057.00098 secs Strassen HBC - Pass 1.2 - Completed: 1056.40698 secs Strassen HBC - Pass 1.3 - Completed: 1056.48596 secs Matrix Size : 4096 x 4096 Matrix Size Threshold: 1024 x 1024 Matrix Partitions : 57 Strassen HBC - Pass 2.1 - Completed: 641.61700 secs Strassen HBC - Pass 2.2 - Completed: 641.00800 secs Strassen HBC - Pass 2.3 - Completed: 641.16400 secs Matrix Size : 4096 x 4096 Matrix Size Threshold: 512 x 512 Matrix Partitions : 400 Strassen HBC - Pass 3.1 - Completed: 267.86899 secs Strassen HBC - Pass 3.2 - Completed: 266.73001 secs Strassen HBC - Pass 3.3 - Completed: 266.74701 secs Matrix Size : 4096 x 4096 Matrix Size Threshold: 256 x 256 Matrix Partitions : 2801 Strassen HBC - Pass 4.1 - Completed: 165.18900 secs Strassen HBC - Pass 4.2 - Completed: 162.69299 secs Strassen HBC - Pass 4.3 - Completed: 162.69400 secs Matrix Size : 4096 x 4096 Matrix Size Threshold: 128 x 128 Matrix Partitions : 19608 Strassen HBC - Pass 5.1 - Completed: 147.88901 secs Strassen HBC - Pass 5.2 - Completed: 142.99001 secs Strassen HBC - Pass 5.3 - Completed: 142.99100 secs Matrix Size : 4096 x 4096 Matrix Size Threshold: 64 x 64 Matrix Partitions : 137257 Strassen HBC - Pass 6.1 - Completed: 129.74600 secs Strassen HBC - Pass 6.2 - Completed: 119.79310 secs Strassen HBC - Pass 6.3 - Completed: 119.79300 secs Note: Best time A ( BtA ) Matrix Size : 4096 x 4096 Matrix Size Threshold: 32 x 32 Matrix Partitions : 960800 Strassen HBC - Pass 7.1 - Completed: 797.85199 secs Note: Virtual Memory bound / Performance decreased Strassen HBC - Pass 7.2 - Completed: 127.11000 secs Strassen HBC - Pass 7.3 - Completed: 122.94400 secs
SergeyKostrov
Valued Contributor II
64 Views

A C++ compiler optimizations could significantly improve performance and here is an example: [ Strassen HBC ( Heap Based Complete ) - All Optimizations disabled - /Od ] Matrix Size : 4096 x 4096 Matrix Size Threshold: 64 x 64 Matrix Partitions : 137257 Strassen HBC - Pass 6.1 - Completed: 129.74600 secs Strassen HBC - Pass 6.2 - Completed: 119.79310 secs Strassen HBC - Pass 6.3 - Completed: 119.79300 secs Note: Best time A ( BtA ) [ Strassen HBC ( Heap Based Complete ) - Optimization Maximize Speed enabled - /O2 ] Matrix Size : 4096 x 4096 Matrix Size Threshold: 64 x 64 Matrix Partitions : 137257 Strassen HBC - Pass 1.1 - Completed: 45.75500 secs Strassen HBC - Pass 1.2 - Completed: 35.78600 secs Note: Best time B ( BtB ) and it is ~3.4x faster than BtA! Strassen HBC - Pass 1.3 - Completed: 35.78700 secs
Bernard
Black Belt
64 Views

@Sergey

Is it possible to post the source code   for Strassen multiplication algorithm?

Thanks in advance.

SergeyKostrov
Valued Contributor II
64 Views

No because Strassen Heap Based Complete ( HBC ) is a proprietary algorithm. If you search the Internet you could find sources in C, Java, PHP languages for Strassen classic ( Stack Based ) algorithm.
Bernard
Black Belt
64 Views

>>>If you search the Internet you could find sources in C>>>

I have found already a plenty of code examples.

Reply