[ Abstract ]
In 1969 the German mathematician Volker Strassen has invented a recursive algorithm for
multiplication of two square matrices with asymptotic complexity O( n^2.807 ). That was the first matrix multiplication algorithm
with asymptotic complexity less than O( n^3 ).
Note: 2.807 = log2( 7 ).
Two Classic well known Compute Schemas ( CS ) of the algorithm are as follows:
Strassen..........................A4B4C4M7T2 ( or ABC4M7T2 )
Strassen-Winograd......A4B4C4S4T4P7U7 ( or ABCST4PU7 )
These algorithms use temporary matrices to store intermediate results and, for example, classic Strassen algorithm
uses two T-matrices ( denoted as T2 ).
Classic versions of Strassen matrix multiplication algorithm could be classified as Recursive Stack Based ( RSB ) and
at every recursion memory for subdivided, or partitioned, temporary matrices is allocated from the stack.
Performance of two Heap Based versions of Strassen matrix multiplication algorithm is evaluated. That is,
Strassen Heap Based Incomplete ( SHBI ) and Strassen Heap Based Complete ( SHBC ) algorithms
with a highly optimized memory usage and Compute Schema A2B2M7 ( or AB2M7 ) without any temporary matrices to store
These two versions of Strassen matrix multiplication algorithm could be classified as Recursive Heap Based ( RHB ).
Unofficially, on the ScaLib for BDP project, A2B2M7 ( or AB2M7 ) Compute Schema is called as Strassen-Kostrov.
As a reference, performance CBLAS function SGEMM from of Intel MKL, one of the fastest implementations for
matrix multiplication, is evaluated for every test case.
Additional high level technical details about SHBI and SHBC algorithms will be provided.
To learn more about Volker Strassen follow a web-link:
[ Computer Systems used for performance evaluations ]
** Dell Precision Mobile M4700 **
Intel Core i7-3840QM ( 2.80 GHz )
Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846
NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory )
Windows 7 Professional 64-bit SP1
Size of L3 Cache = 8MB ( shared between all cores for data & instructions )
Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions )
Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions )
Display resolution: 1366 x 768
[ C++ compilers used for performance evaluations ]
MinGW C++ compiler v5.1.0 64-bit
MinGW C++ compiler v6.1.0 64-bit
Microsoft C++ compiler ( VS2008 PE ) 64-bit
Intel C++ compiler v13.1.0 ( u149 ) 64-bit
Command Line Options of C++ compilers used in these performance evaluations will be provided.