topic [ MinGW C++ compiler v6.1.0 in Software Archive

Performance Evaluation of Classic Matrix Multiplication algorithms

SergeyKostrov — Thu, 04 Aug 2016 16:25:52 GMT

*** Performance Evaluation of Classic Matrix Multiplication algorithms *** [ Abstract ] This is one of the most detailed analysis of performance of Classic Matrix Multiplication algorithm on different Software and Hardware platforms.

This is one of the most

SergeyKostrov — Thu, 04 Aug 2016 16:29:04 GMT

This is one of the most detailed analysis of performance of Classic Matrix Multiplication algorithm. The list of different versions of the algorithm is as follows: Classic 2D Classic 2D LBOT Classic 2D Fused Classic 2D Fused LBOT Classic 2D Transposed Classic 2D Transposed LBOT Classic 2D Fused Transposed Classic 2D Fused Transposed LBOT Classic 2D SSE2 Transposed v1 Classic 2D SSE2 Transposed v1 LBOT Classic 2D SSE2 Transposed v2 Classic 2D SSE2 Transposed v2 LBOT Classic 1D Classic 1D LBOT Two sub-versions of each version of the algorithm is evaluated with: - Loop Processing Schema IJK - Loop Processing Schema IKJ ( aka Loop Interchange technique ) Performance evaluations are done: (1) On four computer systems: Dell Precision Mobile M4700 Dell Dimension 4400 Dell Latitude CPi D300XT Acer Aspire One ( netbook ) (2) On four Operating Systems: Windows 95 Pan European 32-bit Windows 2000 Professional 32-bit SP4 Windows XP Professional 32-bit SP3 Windows 7 Professional 64-bit SP1 (3) With four IDEs: Visual Studio 98 Professional Edition Visual Studio 2005 Professional Edition Visual Studio 2008 Professional Edition Visual Studio 2008 Express Edition (4) With twenty two C++ compilers: Borland C++ compiler v5.5.1 32-bit MinGW C++ compiler v3.4.2 32-bit MinGW C++ compiler v4.8.1 32-bit MinGW C++ compiler v4.9.2 32-bit MinGW C++ compiler v4.9.2 64-bit MinGW C++ compiler v5.1.0 32-bit MinGW C++ compiler v5.1.0 64-bit MinGW C++ compiler v6.1.0 32-bit MinGW C++ compiler v6.1.0 64-bit Microsoft C++ compiler ( VS98 PE ) 32-bit Microsoft C++ compiler ( VS2005 PE ) 32-bit Microsoft C++ compiler ( VS2008 PE ) 32-bit Microsoft C++ compiler ( VS2008 PE ) 64-bit Microsoft C++ compiler ( VS2008 EE ) 32-bit Intel C++ compiler v7.1.0 ( u029 ) 32-bit Intel C++ compiler v8.1.0 ( u038 ) 32-bit Intel C++ compiler v12.1.7 ( u371 ) 32-bit Intel C++ compiler v13.1.0 ( u149 ) 32-bit Intel C++ compiler v13.1.0 ( u149 ) 64-bit Watcom C++ compiler v1.9.0 32-bit Watcom C++ compiler v2.0.0 32-bit Watcom C++ compiler v2.0.0 64-bit

[ Watcom C++ compiler v2.0.0

SergeyKostrov — Thu, 04 Aug 2016 16:31:07 GMT

[ Watcom C++ compiler v2.0.0 64-bit ] Even if the compiler and linker are ported to 64-bit platforms generated binary codes are still 32-bit!

[ List of Abbreviations ]

SergeyKostrov — Thu, 04 Aug 2016 16:32:28 GMT

[ List of Abbreviations ] MM - Matrix Multiplication C - Classic LPS - Loop Processing Schema 1D - One Dimensional Input Matrices 2D - Two Dimensional Input Matrices LB - Loop Blocking ( OT ) LBOT - Loop Blocking Optimization Technique F - Fused ( OT ) T - Transposed ( OT ) SSE2 - Streaming SIMD Extensions v2 OT - Optimization Technique PE - Professional Edition ( of Visual Studio ) EE - Express Edition ( of Visual Studio ) P2 - Intel Pentium PII P4 - Intel Pentim 4 IB - Intel Ivy Bridge AN - Intel Atom N270

[ Computer Systems used for

SergeyKostrov — Thu, 04 Aug 2016 16:34:23 GMT

[ Computer Systems used for performance evaluations ] ** Dell Precision Mobile M4700 ** Intel Core i7-3840QM ( 2.80 GHz ) Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846 32GB RAM 320GB HDD NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) Windows 7 Professional 64-bit SP1 Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) Display resolution: 1366 x 768 ** Dell Dimension 4400 ** Intel Pentium 4 ( 1.60 GHz / 1 core ) 1GB RAM Seagate 20GB HDD ( * ) Seagate 3TB HDD ( ** ) EVGA GeForce 6200 Video Card 512MB DDR2 AGP 8x Video Card Windows XP Professional 32-bit SP3 Size of L2 Cache = 256KB Size of L1 Cache = 8KB Display resolution: 1440 x 990 ( * ) Seagate Barracuda 20GB IDE Hard Disk Drive ST320011A 3.5" 7200 Rpm 2MB Cache IDE Ultra ATA100 / ATA-iV/6 Average Rotational Latency : 4.17 ms Average Seek Times Read : 9.0ms Average Seek Times Write : 10.0ms Maximum Internal Transfer Rate : 69.4MB/sec Average External Transfer Rate : 100MB/sec ( Read and Write ) Maximum External Transfer Rate : 150MB/sec ( Read ) Note: Barracuda ATA IV Family ( ** ) Seagate Barracuda 3TB IDE Hard Disk Drive ST3000DM001 3.5" 7200 Rpm 64MB Cache SATA III ( 6GB/sec ) Average Rotational Latency : 4.16 ms Average Seek Times Read : 8.5ms Average Seek Times Write : 9.5ms Maximum Internal Transfer Rate : 268MB/sec Average External Transfer Rate : 156MB/sec ( Read and Write ) Maximum External Transfer Rate : 210MB/sec ( Read ) ** Dell Latitude CPi D300XT ** Intel Pentium II ( 300 MHz / 1 core ) 128MB RAM ( 2x64MB / MT8LDT864HG-6X 144-pin EDO SODIMM 60ns ) 6GB HDD Windows 2000 Professional 32-bit SP4 Size of L2 Cache = 512KB Size of L1 Cache = 16KB Display resolution: 1024 x 768 ** Acer Aspire One ** Intel Atom N270 ( 1.60 GHz / 2 cores ) 1.5GB RAM CF to ZIF 1.8" HDD SSD IDE Adapter 2GB Compact Flash ( CF ) Card Windows 95 Pan European 32-bit Size of L2 Cache = 512KB Size of L1 Cache = 24KB Display resolution: 800 x 600 // Memory Settings in System.ini ... [386Enh] ; ; MaxPhysPage value ; Amount of physical RAM Windows 95 can access ; MaxPhysPage=32768 ; 823336 KB = 804 MB ( Currently Used ) ...

[ OSs used for performance

SergeyKostrov — Thu, 04 Aug 2016 16:35:27 GMT

[ OSs used for performance evaluations ] Windows 95 Pan European 32-bit Windows 2000 Professional 32-bit SP4 Windows XP Professional 32-bit SP3 Windows 7 Professional 64-bit SP1

[ IDEs used for performance

SergeyKostrov — Thu, 04 Aug 2016 16:36:38 GMT

[ IDEs used for performance evaluations ] Visual Studio 98 Professional Edition Visual Studio 2005 Professional Edition Visual Studio 2008 Professional Edition Visual Studio 2008 Express Edition

[ C++ compilers used for

SergeyKostrov — Thu, 04 Aug 2016 16:37:31 GMT

[ C++ compilers used for performance evaluations ] Borland C++ compiler v5.5.1 32-bit MinGW C++ compiler v3.4.2 32-bit MinGW C++ compiler v4.8.1 32-bit MinGW C++ compiler v4.9.2 32-bit MinGW C++ compiler v4.9.2 64-bit MinGW C++ compiler v5.1.0 32-bit MinGW C++ compiler v5.1.0 64-bit MinGW C++ compiler v6.1.0 32-bit MinGW C++ compiler v6.1.0 64-bit Microsoft C++ compiler ( VS98 PE ) 32-bit Microsoft C++ compiler ( VS2005 PE ) 32-bit Microsoft C++ compiler ( VS2008 PE ) 32-bit Microsoft C++ compiler ( VS2008 PE ) 64-bit Microsoft C++ compiler ( VS2008 EE ) 32-bit Intel C++ compiler v7.1.0 ( u029 ) 32-bit Intel C++ compiler v8.1.0 ( u038 ) 32-bit Intel C++ compiler v12.1.7 ( u371 ) 32-bit Intel C++ compiler v13.1.0 ( u149 ) 32-bit Intel C++ compiler v13.1.0 ( u149 ) 64-bit Watcom C++ compiler v1.9.0 32-bit Watcom C++ compiler v2.0.0 32-bit Watcom C++ compiler v2.0.0 64-bit

[ Base Performance

SergeyKostrov — Thu, 04 Aug 2016 16:38:52 GMT

[ Base Performance Evaluations with MKL SGEMM function - CPU AN 32-bit Windows 95 ] It is Not completed because an MKL library installation for the platform is No longer available

[ Base Performance

SergeyKostrov — Thu, 04 Aug 2016 16:39:29 GMT

[ Base Performance Evaluations with MKL SGEMM function - CPU P2 32-bit Windows 2000 ] It is Not completed because an MKL library installation for the platform is No longer available

[ Base Performance

SergeyKostrov — Thu, 04 Aug 2016 16:40:37 GMT

[ Base Performance Evaluations with MKL SGEMM function - CPU P4 32-bit Windows XP ] Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Tests: Start > Test1153 Start < Sub-Test 1.1 - Runtime Binding of MKL functions Dynamic Library mkl_rt.dll Loaded Initialization Done Sub-Test 3.2 - MKL Matrix Multiplication Matrix Multiplication C[ 1024x1024 ] = A[ 1024x1024 ] * B[ 1024x1024 ] Allocating Memory for Matrices ( 16-byte alignment ) Intializing Matrix Data - Started Intializing Matrix Data - Completed Cblas xGEMM Matrix Size : 1024 x 1024 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas SGEMM - Pass 01 - Completed: 0.53100 secs Cblas SGEMM - Pass 02 - Completed: 0.51600 secs Cblas SGEMM - Pass 03 - Completed: 0.51600 secs Cblas SGEMM - Pass 04 - Completed: 0.51600 secs Cblas SGEMM - Pass 05 - Completed: 0.51500 secs Cblas SGEMM - Passed Deallocating Memory Dynamic Library mkl_rt.dll Unloaded > Test1153 End < Tests: Completed Application - BccTestApp - WIN32_BCC ( 32-bit ) - Release Tests: Start > Test1153 Start < Sub-Test 1.1 - Runtime Binding of MKL functions Dynamic Library mkl_rt.dll Loaded Initialization Done Sub-Test 3.2 - MKL Matrix Multiplication Matrix Multiplication C[ 1024x1024 ] = A[ 1024x1024 ] * B[ 1024x1024 ] Allocating Memory for Matrices ( 16-byte alignment ) Intializing Matrix Data - Started Intializing Matrix Data - Completed Cblas xGEMM Matrix Size : 1024 x 1024 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas SGEMM - Pass 01 - Completed: 0.51500 secs Cblas SGEMM - Pass 02 - Completed: 0.51500 secs Cblas SGEMM - Pass 03 - Completed: 0.51600 secs Cblas SGEMM - Pass 04 - Completed: 0.51600 secs Cblas SGEMM - Pass 05 - Completed: 0.51600 secs Cblas SGEMM - Passed Deallocating Memory Dynamic Library mkl_rt.dll Unloaded > Test1153 End < Tests: Completed Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1153 Start < Sub-Test 1.1 - Runtime Binding of MKL functions Dynamic Library mkl_rt.dll Loaded Initialization Done Sub-Test 3.2 - MKL Matrix Multiplication Matrix Multiplication C[ 1024x1024 ] = A[ 1024x1024 ] * B[ 1024x1024 ] Allocating Memory for Matrices ( 16-byte alignment ) Intializing Matrix Data - Started Intializing Matrix Data - Completed Cblas xGEMM Matrix Size : 1024 x 1024 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas SGEMM - Pass 01 - Completed: 0.53200 secs Cblas SGEMM - Pass 02 - Completed: 0.51500 secs Cblas SGEMM - Pass 03 - Completed: 0.51600 secs Cblas SGEMM - Pass 04 - Completed: 0.51600 secs Cblas SGEMM - Pass 05 - Completed: 0.51500 secs Cblas SGEMM - Passed Deallocating Memory Dynamic Library mkl_rt.dll Unloaded > Test1153 End < Tests: Completed Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release Tests: Start > Test1153 Start < Sub-Test 1.1 - Runtime Binding of MKL functions Dynamic Library mkl_rt.dll Loaded Initialization Done Sub-Test 3.2 - MKL Matrix Multiplication Matrix Multiplication C[ 1024x1024 ] = A[ 1024x1024 ] * B[ 1024x1024 ] Allocating Memory for Matrices ( 16-byte alignment ) Intializing Matrix Data - Started Intializing Matrix Data - Completed Cblas xGEMM Matrix Size : 1024 x 1024 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas SGEMM - Pass 01 - Completed: 0.54700 secs Cblas SGEMM - Pass 02 - Completed: 0.51500 secs Cblas SGEMM - Pass 03 - Completed: 0.51600 secs Cblas SGEMM - Pass 04 - Completed: 0.51500 secs Cblas SGEMM - Pass 05 - Completed: 0.51500 secs Cblas SGEMM - Passed Deallocating Memory Dynamic Library mkl_rt.dll Unloaded > Test1153 End < Tests: Completed Application - WccTestApp - WIN32_WCC ( 32-bit ) - Release Tests: Start > Test1153 Start < Sub-Test 1.1 - Runtime Binding of MKL functions Dynamic Library mkl_rt.dll Loaded Initialization Done Sub-Test 3.2 - MKL Matrix Multiplication Matrix Multiplication C[ 1024x1024 ] = A[ 1024x1024 ] * B[ 1024x1024 ] Allocating Memory for Matrices ( 16-byte alignment ) Intializing Matrix Data - Started Intializing Matrix Data - Completed Cblas xGEMM Matrix Size : 1024 x 1024 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas SGEMM - Pass 01 - Completed: 0.54900 secs Cblas SGEMM - Pass 02 - Completed: 0.51600 secs Cblas SGEMM - Pass 03 - Completed: 0.51500 secs Cblas SGEMM - Pass 04 - Completed: 0.51500 secs Cblas SGEMM - Pass 05 - Completed: 0.51600 secs Cblas SGEMM - Passed Deallocating Memory Dynamic Library mkl_rt.dll Unloaded > Test1153 End < Tests: Completed

[ Base Performance

SergeyKostrov — Thu, 04 Aug 2016 16:41:35 GMT

[ Base Performance Evaluations with MKL SGEMM function - CPU IB 64-bit Windows 7 ] Application - ScaLibTestApp - WIN64_MSC ( 64-bit ) - Release Tests: Start > Test1153 Start < Sub-Test 1.1 - Runtime Binding of MKL functions Dynamic Library mkl_rt.dll Loaded Initialization Done Sub-Test 3.2 - MKL Matrix Multiplication Matrix Multiplication C[ 1024x1024 ] = A[ 1024x1024 ] * B[ 1024x1024 ] Allocating Memory for Matrices ( 16-byte alignment ) Intializing Matrix Data - Started Intializing Matrix Data - Completed Cblas xGEMM Matrix Size : 1024 x 1024 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas SGEMM - Pass 01 - Completed: 0.06100 secs Cblas SGEMM - Pass 02 - Completed: 0.06600 secs Cblas SGEMM - Pass 03 - Completed: 0.06600 secs Cblas SGEMM - Pass 04 - Completed: 0.06600 secs Cblas SGEMM - Pass 05 - Completed: 0.06500 secs Cblas SGEMM - Passed Deallocating Memory Dynamic Library mkl_rt.dll Unloaded > Test1153 End < Tests: Completed Application - BccTestApp - WIN32_BCC ( 32-bit ) - Release Tests: Start > Test1153 Start < Sub-Test 1.1 - Runtime Binding of MKL functions Dynamic Library mkl_rt.dll Loaded Initialization Done Sub-Test 3.2 - MKL Matrix Multiplication Matrix Multiplication C[ 1024x1024 ] = A[ 1024x1024 ] * B[ 1024x1024 ] Allocating Memory for Matrices ( 16-byte alignment ) Intializing Matrix Data - Started Intializing Matrix Data - Completed Cblas xGEMM Matrix Size : 1024 x 1024 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas SGEMM - Pass 01 - Completed: 0.06500 secs Cblas SGEMM - Pass 02 - Completed: 0.06500 secs Cblas SGEMM - Pass 03 - Completed: 0.06600 secs Cblas SGEMM - Pass 04 - Completed: 0.06600 secs Cblas SGEMM - Pass 05 - Completed: 0.06600 secs Cblas SGEMM - Passed Deallocating Memory Dynamic Library mkl_rt.dll Unloaded > Test1153 End < Tests: Completed Application - IccTestApp - WIN64_ICC ( 64-bit ) - Release Tests: Start > Test1153 Start < Sub-Test 1.1 - Runtime Binding of MKL functions Dynamic Library mkl_rt.dll Loaded Initialization Done Sub-Test 3.2 - MKL Matrix Multiplication Matrix Multiplication C[ 1024x1024 ] = A[ 1024x1024 ] * B[ 1024x1024 ] Allocating Memory for Matrices ( 16-byte alignment ) Intializing Matrix Data - Started Intializing Matrix Data - Completed Cblas xGEMM Matrix Size : 1024 x 1024 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas SGEMM - Pass 01 - Completed: 0.06200 secs Cblas SGEMM - Pass 02 - Completed: 0.06500 secs Cblas SGEMM - Pass 03 - Completed: 0.06600 secs Cblas SGEMM - Pass 04 - Completed: 0.06600 secs Cblas SGEMM - Pass 05 - Completed: 0.06500 secs Cblas SGEMM - Passed Deallocating Memory Dynamic Library mkl_rt.dll Unloaded > Test1153 End < Tests: Completed Application - MgwTestApp - WIN64_MGW ( 64-bit ) - Release Tests: Start > Test1153 Start < Sub-Test 1.1 - Runtime Binding of MKL functions Dynamic Library mkl_rt.dll Loaded Initialization Done Sub-Test 3.2 - MKL Matrix Multiplication Matrix Multiplication C[ 1024x1024 ] = A[ 1024x1024 ] * B[ 1024x1024 ] Allocating Memory for Matrices ( 16-byte alignment ) Intializing Matrix Data - Started Intializing Matrix Data - Completed Cblas xGEMM Matrix Size : 1024 x 1024 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas SGEMM - Pass 01 - Completed: 0.06700 secs Cblas SGEMM - Pass 02 - Completed: 0.06500 secs Cblas SGEMM - Pass 03 - Completed: 0.06600 secs Cblas SGEMM - Pass 04 - Completed: 0.06500 secs Cblas SGEMM - Pass 05 - Completed: 0.06500 secs Cblas SGEMM - Passed Deallocating Memory Dynamic Library mkl_rt.dll Unloaded > Test1153 End < Tests: Completed Application - WccTestApp - WIN32_WCC ( 32-bit ) - Release Tests: Start > Test1153 Start < Sub-Test 1.1 - Runtime Binding of MKL functions Dynamic Library mkl_rt.dll Loaded Initialization Done Sub-Test 3.2 - MKL Matrix Multiplication Matrix Multiplication C[ 1024x1024 ] = A[ 1024x1024 ] * B[ 1024x1024 ] Allocating Memory for Matrices ( 16-byte alignment ) Intializing Matrix Data - Started Intializing Matrix Data - Completed Cblas xGEMM Matrix Size : 1024 x 1024 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas SGEMM - Pass 01 - Completed: 0.06900 secs Cblas SGEMM - Pass 02 - Completed: 0.06600 secs Cblas SGEMM - Pass 03 - Completed: 0.06500 secs Cblas SGEMM - Pass 04 - Completed: 0.06500 secs Cblas SGEMM - Pass 05 - Completed: 0.06600 secs Cblas SGEMM - Passed Deallocating Memory Dynamic Library mkl_rt.dll Unloaded > Test1153 End < Tests: Completed

[ Microsoft C++ compiler (

SergeyKostrov — Thu, 04 Aug 2016 16:42:39 GMT

[ Microsoft C++ compiler ( VS98 PE ) - Release - 32-bit ( LPS: IJK ) - CPU AN 32-bit Windows 95 ] Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 1024 x 1024 Loop Processing Schema ( LPS ): IJK Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 140.56801 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 1024x1024 elements Completed: 136.45601 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 145.31301 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 1024x1024 elements Completed: 142.82801 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 5.08100 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 1024x1024 elements Completed: 5.31400 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 5.61700 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 1024x1024 elements Completed: 5.94600 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 136.55101 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 1024x1024 elements Completed: 136.57901 secs > Test1099 End < Tests: Completed

[ Microsoft C++ compiler (

SergeyKostrov — Thu, 04 Aug 2016 16:43:26 GMT

[ Microsoft C++ compiler ( VS98 PE ) - Release - 32-bit ( LPS: IKJ ) - CPU AN 32-bit Windows 95 ] Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 1024 x 1024 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 9.87500 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 1024x1024 elements Completed: 9.44900 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 9.73700 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 1024x1024 elements Completed: 9.75100 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 147.64801 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 1024x1024 elements Completed: 147.68901 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 146.48101 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 1024x1024 elements Completed: 154.74801 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 9.44800 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 1024x1024 elements Completed: 9.46300 secs > Test1099 End < Tests: Completed

[ Microsoft C++ compiler (

SergeyKostrov — Thu, 04 Aug 2016 16:54:39 GMT

[ Microsoft C++ compiler ( VS98 PE ) - Release - 32-bit ( LPS: IJK ) - CPU P2 32-bit Windows 2000 ] Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 1024 x 1024 Loop Processing Schema ( LPS ): IJK Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 253.86501 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 1024x1024 elements Completed: 253.85501 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 256.85901 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 1024x1024 elements Completed: 257.74001 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 48.61000 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 1024x1024 elements Completed: 59.95600 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 72.07300 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 1024x1024 elements Completed: 72.43400 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 258.42101 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 1024x1024 elements Completed: 258.35201 secs > Test1099 End < Tests: Completed

+ // [ Intel C++ compiler v7

SergeyKostrov — Thu, 04 Aug 2016 16:55:00 GMT

[ Intel C++ compiler v7.1.0 ( u029 ) - Release - 32-bit ( LPS: IJK ) - CPU P2 32-bit Windows 2000 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 1024 x 1024 Loop Processing Schema ( LPS ): IJK Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 254.23501 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 1024x1024 elements Completed: 281.93501 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 254.79601 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 1024x1024 elements Completed: 255.33701 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 47.97900 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 1024x1024 elements Completed: 60.25600 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 72.31400 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 1024x1024 elements Completed: 72.74500 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 272.31201 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 1024x1024 elements Completed: 273.65301 secs > Test1099 End < Tests: Completed

[ Microsoft C++ compiler (

SergeyKostrov — Thu, 04 Aug 2016 16:55:21 GMT

[ Microsoft C++ compiler ( VS98 PE ) - Release - 32-bit ( LPS: IKJ ) - CPU P2 32-bit Windows 2000 ] Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 1024 x 1024 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 59.51500 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 1024x1024 elements Completed: 59.54500 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 98.13100 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 1024x1024 elements Completed: 98.14100 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 254.30601 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 1024x1024 elements Completed: 254.62601 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 256.21801 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 1024x1024 elements Completed: 255.96901 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 59.69600 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 1024x1024 elements Completed: 59.68600 secs > Test1099 End < Tests: Completed

+ // [ Intel C++ compiler v7

SergeyKostrov — Thu, 04 Aug 2016 16:56:00 GMT

[ Intel C++ compiler v7.1.0 ( u029 ) - Release - 32-bit ( LPS: IKJ ) - CPU P2 32-bit Windows 2000 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 1024 x 1024 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 60.21600 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 1024x1024 elements Completed: 59.84600 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 72.53500 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 1024x1024 elements Completed: 72.52500 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 254.90701 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 1024x1024 elements Completed: 254.93701 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 256.24901 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 1024x1024 elements Completed: 256.48901 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 59.45600 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 1024x1024 elements Completed: 59.48500 secs > Test1099 End < Tests: Completed

+ // [ Intel C++ compiler v8

SergeyKostrov — Thu, 04 Aug 2016 16:57:00 GMT

[ Intel C++ compiler v8.1.0 ( u038 ) - Release - 32-bit ( LPS: IJK ) - CPU P2 32-bit Windows 2000 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 1024 x 1024 Loop Processing Schema ( LPS ): IJK Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 253.37400 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 1024x1024 elements Completed: 253.12400 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 254.65600 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 1024x1024 elements Completed: 255.29700 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 47.44800 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 1024x1024 elements Completed: 48.89000 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 72.33400 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 1024x1024 elements Completed: 72.35400 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 249.90900 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 1024x1024 elements Completed: 249.90900 secs > Test1099 End < Tests: Completed

[ Intel C++ compiler v8.1.0 (

SergeyKostrov — Thu, 04 Aug 2016 17:03:04 GMT

[ Intel C++ compiler v8.1.0 ( u038 ) - Release - 32-bit ( LPS: IKJ ) - CPU P2 32-bit Windows 2000 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 1024 x 1024 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 60.24600 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 1024x1024 elements Completed: 59.78600 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 79.53400 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 1024x1024 elements Completed: 79.54400 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 253.84500 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 1024x1024 elements Completed: 254.01600 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 255.91800 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 1024x1024 elements Completed: 255.87800 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 59.30500 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 1024x1024 elements Completed: 59.29500 secs > Test1099 End < Tests: Completed