Performance Evaluation of Strassen Matrix Multiplication algorithms

SergeyKostrov · ‎10-10-2016

*** Performance Evaluation of Strassen Matrix Multiplication algorithms ***

SergeyKostrov · ‎10-10-2016

[ Abstract ] In 1969 the German mathematician Volker Strassen has invented a recursive algorithm for multiplication of two square matrices with asymptotic complexity O( n^2.807 ). That was the first matrix multiplication algorithm with asymptotic complexity less than O( n^3 ). Note: 2.807 = log2( 7 ). Two Classic well known Compute Schemas ( CS ) of the algorithm are as follows: Strassen..........................A4B4C4M7T2 ( or ABC4M7T2 ) Strassen-Winograd......A4B4C4S4T4P7U7 ( or ABCST4PU7 ) These algorithms use temporary matrices to store intermediate results and, for example, classic Strassen algorithm uses two T-matrices ( denoted as T2 ). Classic versions of Strassen matrix multiplication algorithm could be classified as Recursive Stack Based ( RSB ) and at every recursion memory for subdivided, or partitioned, temporary matrices is allocated from the stack. Performance of two Heap Based versions of Strassen matrix multiplication algorithm is evaluated. That is, Strassen Heap Based Incomplete ( SHBI ) and Strassen Heap Based Complete ( SHBC ) algorithms with a highly optimized memory usage and Compute Schema A2B2M7 ( or AB2M7 ) without any temporary matrices to store intermediate results. These two versions of Strassen matrix multiplication algorithm could be classified as Recursive Heap Based ( RHB ). Unofficially, on the ScaLib for BDP project, A2B2M7 ( or AB2M7 ) Compute Schema is called as Strassen-Kostrov. As a reference, performance CBLAS function SGEMM from of Intel MKL, one of the fastest implementations for matrix multiplication, is evaluated for every test case. Additional high level technical details about SHBI and SHBC algorithms will be provided. To learn more about Volker Strassen follow a web-link: . http://www.math.uni-konstanz.de/~strassen

SergeyKostrov · ‎10-10-2016

[ List of Abbreviations ] MM - Matrix Multiplication MMA - Matrix Multiplication Algorithm SMM - Strassen Matrix Multiplication SMMA - Strassen Matrix Multiplication Algorithm CS - Compute Schema RSB - Recursive Stack Based RHB - Recursive Heap Based HBI - Heap Based Incomplete HBC - Heap Based Complete SCSB - Strassen Classic Stack Based WCSB - Winograd Classic Stack Based SHBI - Strassen Heap Based Incomplete SHBC - Strassen Heap Based Complete DRP - Degree of Recursive Processing Compute Schemas: A4B4C4M7T2 ( or ABC4M7T2 ) SCS A4B4C4S4T4P7U7 ( or ABCST4PU7 ) SWCS A2B2M7 ( or AB2M7 ) SKCS

SergeyKostrov · ‎10-10-2016

[ Performance Evaluations of SHBI and SHBC versions of SMMA are completed for these matrix ( A, B and C ) sizes ] 4096 x 4096 8192 x 8192 16384 x 16384 32768 x 32768 40960 x 40960

SergeyKostrov · ‎10-10-2016

[ Computer Systems used for performance evaluations ] ** Dell Precision Mobile M4700 ** Intel Core i7-3840QM ( 2.80 GHz ) Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846 32GB RAM 320GB HDD NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) Windows 7 Professional 64-bit SP1 Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) Display resolution: 1366 x 768

SergeyKostrov · ‎10-10-2016

[ C++ compilers used for performance evaluations ] MinGW C++ compiler v5.1.0 64-bit MinGW C++ compiler v6.1.0 64-bit Microsoft C++ compiler ( VS2008 PE ) 64-bit Intel C++ compiler v13.1.0 ( u149 ) 64-bit Command Line Options of C++ compilers used in these performance evaluations will be provided.

SergeyKostrov · ‎10-10-2016

[ MinGW C++ compiler v5.1.0 64-bit ] MgwTestApp.cpp -DNDEBUG -O3 -mavx -mprfchw -ffast-math -fpeel-loops -ftree-vectorizer-verbose=0 -ftree-vectorize -fvect-cost-model -fomit-frame-pointer -fwhole-program -fopenmp -w -I "C:/WorkLib/ICC2013/Composer XE/Mkl/Include" -B "../../AppsSca" "C:/WorkLib/ICC2013/Composer XE/Mkl/Lib/Intel64/mkl_rt.lib" -Xlinker --stack=1073741824

SergeyKostrov · ‎10-10-2016

[ MinGW C++ compiler v6.1.0 64-bit ] MgwTestApp.cpp -DNDEBUG -O3 -mavx -mprfchw -ffast-math -fpeel-loops -ftree-vectorizer-verbose=0 -ftree-vectorize -fvect-cost-model -fomit-frame-pointer -fwhole-program -fopenmp -w -I "C:/WorkLib/ICC2013/Composer XE/Mkl/Include" -B "../../AppsSca" "C:/WorkLib/ICC2013/Composer XE/Mkl/Lib/Intel64/mkl_rt.lib" -Xlinker --stack=1073741824

SergeyKostrov · ‎10-10-2016

[ Microsoft C++ compiler ( VS2008 PE ) 64-bit ] [ Compiler ] O2 /Ob1 /Oi /Ot /Oy /GL /I "..\..\Include" /D "WIN32" /D "_CONSOLE" /D "NDEBUG" /D "_WIN32_MSC" /D "_UNICODE" /D "UNICODE" /GF /Gm /MT /GS- /fp:fast /GR- /openmp /Yu"Stdphf.h" /Fp"x64 Release\ScaLibTestApp64.pch" /Fo"x64/Release/" /Fd"x64/Release/" /W4 /nologo /c /Zi /TP /wd4005 /U "_WINCE_MSC" /U "WIN32_PLATFORM_PSPC" /U "WIN32_PLATFORM_WFSP" /U "WIN32_PLATFORM_WM50" /U "_WIN32_MGW" /U "_WIN32_BCC" /U "_COS16_TCC" /U "_WIN32_ICC" /U "_WIN32_WCC" /errorReport:prompt [ Linker ] /OUT:"x64\Release/ScaLibTestApp64.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"x64\Release\ScaLibTestApp64.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /STACK:1073741824 /LTCG /DYNAMICBASE:NO /MACHINE:X64 /ERRORREPORT:PROMPT kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib "..\..\bin\release\scalib64.lib"

SergeyKostrov · ‎10-10-2016

[ Intel C++ compiler v13.1.0 ( u149 ) 64-bit ] [ Compiler ] /c /O3 /Ob1 /Oi /Ot /Qipo /I "..\..\Include" /I "C:\WorkLib\ICC2013\Composer XE 2013\ipp\include" /D "WIN32" /D "_CONSOLE" /D "NDEBUG" /D "_WIN32_ICC" /D "INTEL_SUITE_VERSION=PE130_149" /D "_IPP_PARALLEL_DYNAMIC" /D "IPP_USE_CUSTOM" /D "_VC80_UPGRADE=0x0710" /D "_UNICODE" /D "UNICODE" /GF /MT /GS- /arch:AVX /fp:fast=2 /GR- /Yu"Stdphf.h" /Fp"x64\Release\IccTestApp64.pch" /Fo"x64/Release/" /Fd"x64/Release/" /W5 /nologo /Wp64 /Zi /TP /U "_WIN32_MSC" /U "_WINCE_MSC" /U "WIN32_PLATFORM_PSPC" /U "WIN32_PLATFORM_WFSP" /U "WIN32_PLATFORM_WM50" /U "_WIN32_MGW" /U "_WIN32_BCC" /U "_COS16_TCC" /U "_WIN32_WCC" /Qopenmp /Qfp-speculation:fast /Qopt-matmul /Qstd=c++0x /Qrestrict /Qansi-alias /Qdiag-disable:111,673,2012,2015,2960,10121 /Wport /Qeffc++ /QxAVX /Qansi-alias /Qvec-report=0 /Qfma /Qunroll /Qunroll-aggressive /Qopt-streaming-stores:always /Qipp /Qipp-link:dynamic /Qmkl [ Linker ] kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /OUT:"x64\Release/IccTestApp64.exe" /INCREMENTAL:NO /nologo /LIBPATH:"C:\WorkLib\ICC2013\Composer XE 2013\ipp\lib\intel64" /LIBPATH:"C:\WorkLib\ICC2013\Composer XE 2013\compiler\lib\intel64" /MANIFEST /MANIFESTFILE:"x64\Release\IccTestApp64.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /NODEFAULTLIB:"../../Bin/Release/ScaLib64.lib" /TLBID:1 /SUBSYSTEM:CONSOLE /STACK:1000000000 /LARGEADDRESSAWARE /DYNAMICBASE /NXCOMPAT /MACHINE:X64 /qdiag-disable:111,673,2012,2015,2960,10121 /qdiag-sc-dir:"My Inspector XE Results - IccTestApp"

SergeyKostrov · ‎10-11-2016

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // [ 4096x4096 / OpenMP Threads=4 / Release ] //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

SergeyKostrov · ‎10-11-2016

[ *** Microsoft C++ compiler ( VS2008 PE ) 64-bit *** ] [ SHBI 4096x4096 DRP=1st ] Strassen HBI Matrix Size : 4096 x 4096 Matrix Size Threshold : 2048 x 2048 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: N/A Calculating... Strassen HBI - Pass 01 - Completed: 0.90500 secs Strassen HBI - Pass 02 - Completed: 0.87300 secs Strassen HBI - Pass 03 - Completed: 0.87400 secs Strassen HBI - Pass 04 - Completed: 0.87300 secs Strassen HBI - Pass 05 - Completed: 0.87400 secs

SergeyKostrov · ‎10-11-2016

[ *** Microsoft C++ compiler ( VS2008 PE ) 64-bit *** ] [ SHBC 4096x4096 DRP=1st ] Strassen HBC Matrix Size : 4096 x 4096 Matrix Size Threshold : 2048 x 2048 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: Disabled Calculating... Strassen HBC - Pass 01 - Completed: 0.88900 secs Strassen HBC - Pass 02 - Completed: 0.87400 secs Strassen HBC - Pass 03 - Completed: 0.87300 secs Strassen HBC - Pass 04 - Completed: 0.87400 secs Strassen HBC - Pass 05 - Completed: 0.87400 secs

SergeyKostrov · ‎10-11-2016

[ *** Microsoft C++ compiler ( VS2008 PE ) 64-bit *** ] [ SHBC 4096x4096 DRP=2nd ] Strassen HBC Matrix Size : 4096 x 4096 Matrix Size Threshold : 1024 x 1024 Matrix Partitions : 57 Degree of Recursion : 2 Result Sets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 1.02900 secs Strassen HBC - Pass 02 - Completed: 0.96800 secs Strassen HBC - Pass 03 - Completed: 0.98200 secs Strassen HBC - Pass 04 - Completed: 0.96800 secs Strassen HBC - Pass 05 - Completed: 0.98200 secs

SergeyKostrov · ‎10-11-2016

[ *** Microsoft C++ compiler ( VS2008 PE ) 64-bit *** ] [ SHBC 4096x4096 DRP=3rd ] Strassen HBC Matrix Size : 4096 x 4096 Matrix Size Threshold : 512 x 512 Matrix Partitions : 400 Degree of Recursion : 3 Result Sets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 1.23300 secs Strassen HBC - Pass 02 - Completed: 1.13900 secs Strassen HBC - Pass 03 - Completed: 1.13800 secs Strassen HBC - Pass 04 - Completed: 1.15500 secs Strassen HBC - Pass 05 - Completed: 1.13900 secs

SergeyKostrov · ‎10-11-2016

[ *** Microsoft C++ compiler ( VS2008 PE ) 64-bit *** ] [ SHBC 4096x4096 DRP=4th ] Strassen HBC Matrix Size : 4096 x 4096 Matrix Size Threshold : 256 x 256 Matrix Partitions : 2801 Degree of Recursion : 4 Result Sets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 1.79400 secs Strassen HBC - Pass 02 - Completed: 1.60700 secs Strassen HBC - Pass 03 - Completed: 1.60700 secs Strassen HBC - Pass 04 - Completed: 1.60600 secs Strassen HBC - Pass 05 - Completed: 1.60700 secs

SergeyKostrov · ‎10-11-2016

[ *** Microsoft C++ compiler ( VS2008 PE ) 64-bit *** ] [ SHBC 4096x4096 DRP=5th ] Strassen HBC Matrix Size : 4096 x 4096 Matrix Size Threshold : 128 x 128 Matrix Partitions : 19608 Degree of Recursion : 5 Result Sets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 3.91600 secs Strassen HBC - Pass 02 - Completed: 3.41600 secs Strassen HBC - Pass 03 - Completed: 3.40100 secs Strassen HBC - Pass 04 - Completed: 3.41600 secs Strassen HBC - Pass 05 - Completed: 3.41700 secs

SergeyKostrov · ‎10-11-2016

[ *** Microsoft C++ compiler ( VS2008 PE ) 64-bit *** ] [ MKL 4096x4096 ] Cblas SGEMM Matrix Size : 4096 x 4096 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas SGEMM - Pass 01 - Completed: 0.85800 secs Cblas SGEMM - Pass 02 - Completed: 0.85800 secs Cblas SGEMM - Pass 03 - Completed: 0.85800 secs Cblas SGEMM - Pass 04 - Completed: 0.85800 secs Cblas SGEMM - Pass 05 - Completed: 0.85800 secs

SergeyKostrov · ‎10-11-2016

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // [ 4096x4096 / OpenMP Threads=4 / Release ] //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

SergeyKostrov · ‎10-11-2016

[ *** MinGW C++ compiler v5.1.0 64-bit *** ] [ SHBI 4096x4096 DRP=1st ] Strassen HBI Matrix Size : 4096 x 4096 Matrix Size Threshold : 2048 x 2048 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: N/A Calculating... Strassen HBI - Pass 01 - Completed: 0.90500 secs Strassen HBI - Pass 02 - Completed: 0.87400 secs Strassen HBI - Pass 03 - Completed: 0.87400 secs Strassen HBI - Pass 04 - Completed: 0.87300 secs Strassen HBI - Pass 05 - Completed: 0.87400 secs

SergeyKostrov · ‎10-11-2016

[ *** MinGW C++ compiler v5.1.0 64-bit *** ] [ SHBC 4096x4096 DRP=1st ] Strassen HBC Matrix Size : 4096 x 4096 Matrix Size Threshold : 2048 x 2048 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: Disabled Calculating... Strassen HBC - Pass 01 - Completed: 0.88900 secs Strassen HBC - Pass 02 - Completed: 0.87300 secs Strassen HBC - Pass 03 - Completed: 0.87400 secs Strassen HBC - Pass 04 - Completed: 0.87400 secs Strassen HBC - Pass 05 - Completed: 0.87300 secs