Software Archive
Read-only legacy content
17061 Discussions

Performance of Portable Algorithms

SergeyKostrov
Valued Contributor II
512 Views
*** Performance of Portable Algorithms *** [ Abstract ] There is a great deal of attention to performance of Portable Algorithms. Since a Portrable Algorithm could work on many platforms, that is Operating Systems working on different Hardware, the performance of these algorithms is a very important since it could vary to some degree. Here are two examples: Example A: If a difference in performance for an Algorithm A on Platforms PA and PB running on the same hardware (!) is less than 5 percent then the Algorithm A could be qualified as a 'Performance Portable Algorithm'. Example B: If a difference in performance for an Algorithm B on Platforms PA and PB running on the same hardware (!) is greater than 5 percent then the Algorithm A could Not be qualified as a 'Performance Portable Algorithm'. When speaking about Performance Portable Algorithms it is very important to evaluate performance on the same hardware platforms since even a small difference in hardware could skew performance numbers.
0 Kudos
11 Replies
SergeyKostrov
Valued Contributor II
513 Views
Note: ...If a difference in performance for an Algorithm A on Platforms PA and PB running... It means, Platforms PA and PB are using the same Hardware P but different Operating Systems A and B.
0 Kudos
SergeyKostrov
Valued Contributor II
513 Views
[ List of Abbreviations ] MM - Matrix Multiplication MMOp - Matrix Multiplication Operation SAMM - Strassen Algorithm of Matrix Multiplication SHBI - Strassen Heap Based Incomplete SHBC - Strassen Heap Based Complete DRP - Degree of Recursive Processing BMMO - Basic Matrix Multiplication Operation ( in context of SAMM ) APT - Accelerated Processing Technique FPU - Floating Point Unit SP - Single Precision DP - Double Precision ABC4M7T2 - Strassen Schema ( SS ) of MM ( Classic / Stack Based ) ABCST4PU7 - Winograd Schema ( WS ) of MM ( Classic / Stack Based / Reduced number of MMOps ) AB2M7 - Kostrov Schema ( KS ) of MM ( Classic Modified / Heap Based / Memory Efficient )
0 Kudos
SergeyKostrov
Valued Contributor II
513 Views
[ Computer System used for performance evaluations ] ** Dell Precision Mobile M4700 ( DPMM4700 )** Intel Core i7-3840QM ( 2.80 GHz ) Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846 32GB RAM 320GB HDD NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) Windows 7 Professional 64-bit SP1 Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) Display resolution: 1366 x 768 and ** Dell Dimension 4400 ( DD4400 ) ** Intel Pentium 4 ( 1.60 GHz / 1 core ) 1GB RAM Seagate 20GB HDD ( * ) Seagate 3TB HDD ( ** ) EVGA GeForce 6200 Video Card 512MB DDR2 AGP 8x Video Card Windows XP Professional 32-bit SP3 Size of L2 Cache = 256KB Size of L1 Cache = 8KB Display resolution: 1440 x 990 ( * ) Seagate Barracuda 20GB IDE Hard Disk Drive ST320011A 3.5" 7200 Rpm 2MB Cache IDE Ultra ATA100 / ATA-iV/6 Average Rotational Latency : 4.17 ms Average Seek Times Read : 9.0ms Average Seek Times Write : 10.0ms Maximum Internal Transfer Rate : 69.4MB/sec Average External Transfer Rate : 100MB/sec ( Read and Write ) Maximum External Transfer Rate : 150MB/sec ( Read ) Note: Barracuda ATA IV Family ( ** ) Seagate Barracuda 3TB IDE Hard Disk Drive ST3000DM001 3.5" 7200 Rpm 64MB Cache SATA III ( 6GB/sec ) Average Rotational Latency : 4.16 ms Average Seek Times Read : 8.5ms Average Seek Times Write : 9.5ms Maximum Internal Transfer Rate : 268MB/sec Average External Transfer Rate : 156MB/sec ( Read and Write ) Maximum External Transfer Rate : 210MB/sec ( Read )
0 Kudos
SergeyKostrov
Valued Contributor II
513 Views
[ Test-Case 1 - SHBC - DRP is 1 ( Non-APT / DD4400 ) ] Strassen HBC Matrix Size : 1024 x 1024 Matrix Size Threshold : 512 x 512 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: Disabled Calculating... Strassen HBC - Pass 01 - Completed: 2.32800 secs Strassen HBC - Pass 02 - Completed: 2.31300 secs Strassen HBC - Pass 03 - Completed: 2.31200 secs Strassen HBC - Pass 04 - Completed: 2.31300 secs Strassen HBC - Pass 05 - Completed: 2.31200 secs
0 Kudos
SergeyKostrov
Valued Contributor II
513 Views
[ Test-Case 2 - SHBC - DRP is 2 ( Non-APT / DD4400 ) ] Strassen HBC Matrix Size : 1024 x 1024 Matrix Size Threshold : 256 x 256 Matrix Partitions : 57 Degree of Recursion : 2 Result Sets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 1.35900 secs Strassen HBC - Pass 02 - Completed: 1.26600 secs Strassen HBC - Pass 03 - Completed: 1.28100 secs Strassen HBC - Pass 04 - Completed: 1.26600 secs Strassen HBC - Pass 05 - Completed: 1.28100 secs
0 Kudos
SergeyKostrov
Valued Contributor II
513 Views
[ Test-Case 3 - SHBC - DRP is 3 ( Non-APT / DD4400 ) ] Strassen HBC Matrix Size : 1024 x 1024 Matrix Size Threshold : 128 x 128 Matrix Partitions : 400 Degree of Recursion : 3 Result Sets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 0.98500 secs Strassen HBC - Pass 02 - Completed: 0.98400 secs Strassen HBC - Pass 03 - Completed: 0.96900 secs Strassen HBC - Pass 04 - Completed: 0.98500 secs Strassen HBC - Pass 05 - Completed: 0.96900 secs
0 Kudos
SergeyKostrov
Valued Contributor II
513 Views
[ Test-Case 4 - SHBC - DRP is 4 ( Non-APT / DD4400 ) ] Strassen HBC Matrix Size : 1024 x 1024 Matrix Size Threshold : 64 x 64 Matrix Partitions : 2801 Degree of Recursion : 4 Result Sets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 1.40600 secs Strassen HBC - Pass 02 - Completed: 1.21900 secs Strassen HBC - Pass 03 - Completed: 1.20300 secs Strassen HBC - Pass 04 - Completed: 1.21900 secs Strassen HBC - Pass 05 - Completed: 1.21900 secs
0 Kudos
SergeyKostrov
Valued Contributor II
513 Views
[ Test-Case 5 - SHBC - DRP is 5 ( Non-APT ) / DD4400 ] Strassen HBC Matrix Size : 1024 x 1024 Matrix Size Threshold : 32 x 32 Matrix Partitions : 19608 Degree of Recursion : 5 Result Sets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 2.31300 secs Strassen HBC - Pass 02 - Completed: 1.73400 secs Strassen HBC - Pass 03 - Completed: 1.71900 secs Strassen HBC - Pass 04 - Completed: 1.71800 secs Strassen HBC - Pass 05 - Completed: 1.71900 secs
0 Kudos
SergeyKostrov
Valued Contributor II
513 Views
[ Test-Case 6 - CBLAS SGEMM ( MKL / SP / DD4400) ] Cblas xGEMM Matrix Size : 1024 x 1024 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas xGEMM - Pass 01 - Completed: 0.51600 secs Cblas xGEMM - Pass 02 - Completed: 0.51500 secs Cblas xGEMM - Pass 03 - Completed: 0.51600 secs Cblas xGEMM - Pass 04 - Completed: 0.51500 secs Cblas xGEMM - Pass 05 - Completed: 0.51600 secs Note: It is ~47% faster when compared to the best results for SAMM with 3rd DRP.
0 Kudos
SergeyKostrov
Valued Contributor II
513 Views
[ Relations between Degree of Recursive Processing ( DRP ) and number of BMMOps ] Polynomial representations of these Relations are as follows: 1st Degree: 7 BMMOps = 7^1 = 7 2nd Degree: 56 BMMOps = 7^2 + 7^1 = 49 + 7 3rd Degree: 399 BMMOps = 7^3 + 7^2 + 7^1 = 343 + 49 + 7 4th Degree: 2800 BMMOps = 7^4 + 7^3 + 7^2 + 7^1 = 2401 + 343 + 49 + 7 5th Degree: 19607 BMMOps = 7^5 + 7^4 + 7^3 + 7^2 + 7^1 = 16807 + 2401 + 343 + 49 + 7 That is, the Number of BMMOps for Nth DRP is the sum of all 7 to the power of N where N is from 1 to N. Take into account that BMMO is Not a basic multiplication operation on FPU, SP or DP, and it is a multiplication of two partitioned ( sub-divided ) matrices. Products of all these matrix multiplications will be combined ( see Wikipedia on how SAMM works ) to get a final product of two source matrices.
0 Kudos
SergeyKostrov
Valued Contributor II
513 Views
[ Hardware Issues ] ** Seagate Barracuda 20GB IDE Hard Disk Drive ** ST320011A 3.5" 7200 Rpm 2MB Cache IDE Ultra ATA100 / ATA-iV/6 Average Latency : x.xx ms Average Seek Times Read : 18.0ms Average Seek Times Write : 24.9ms Maximum Internal Transfer Rate : 56.9MB/sec Average External Transfer Rate : 100MB/sec ( Read and Write ) Maximum External Transfer Rate : 150MB/sec ( Read ) ** Seagate Barracuda 3TB IDE Hard Disk Drive ** ST3000DM001 3.5" 7200 Rpm 64MB Cache SATA III ( 6GB/sec ) Average Latency : 4.16 ms Average Seek Times Read : 8.5ms Average Seek Times Write : 9.5ms Maximum Internal Transfer Rate : 268MB/sec Average External Transfer Rate : 156MB/sec ( Read and Write ) Maximum External Transfer Rate : 210MB/sec ( Read )
0 Kudos
Reply