Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

Performance of Portable Algorithms

SergeyKostrov
Valued Contributor II
1,321 Views
*** Performance of Portable Algorithms *** [ Abstract ] There is a great deal of attention to performance of Portable Algorithms. Since a Portrable Algorithm could work on many platforms, that is Operating Systems working on different Hardware, the performance of these algorithms is a very important since it could vary to some degree. Here are two examples: Example A: If a difference in performance for an Algorithm A on Platforms PA and PB running on the same hardware (!) is less than 5 percent then the Algorithm A could be qualified as a 'Performance Portable Algorithm'. Example B: If a difference in performance for an Algorithm B on Platforms PA and PB running on the same hardware (!) is greater than 5 percent then the Algorithm A could Not be qualified as a 'Performance Portable Algorithm'. When speaking about Performance Portable Algorithms it is very important to evaluate performance on the same hardware platforms since even a small difference in hardware could skew performance numbers.
0 Kudos
11 Replies
SergeyKostrov
Valued Contributor II
1,322 Views
Note: ...If a difference in performance for an Algorithm A on Platforms PA and PB running... It means, Platforms PA and PB are using the same Hardware P but different Operating Systems A and B.
0 Kudos
SergeyKostrov
Valued Contributor II
1,322 Views
[ List of Abbreviations ] MM - Matrix Multiplication MMOp - Matrix Multiplication Operation SAMM - Strassen Algorithm of Matrix Multiplication SHBI - Strassen Heap Based Incomplete SHBC - Strassen Heap Based Complete DRP - Degree of Recursive Processing BMMO - Basic Matrix Multiplication Operation ( in context of SAMM ) APT - Accelerated Processing Technique FPU - Floating Point Unit SP - Single Precision DP - Double Precision ABC4M7T2 - Strassen Schema ( SS ) of MM ( Classic / Stack Based ) ABCST4PU7 - Winograd Schema ( WS ) of MM ( Classic / Stack Based / Reduced number of MMOps ) AB2M7 - Kostrov Schema ( KS ) of MM ( Classic Modified / Heap Based / Memory Efficient )
0 Kudos
SergeyKostrov
Valued Contributor II
1,322 Views
[ Computer System used for performance evaluations ] ** Dell Precision Mobile M4700 ( DPMM4700 )** Intel Core i7-3840QM ( 2.80 GHz ) Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846 32GB RAM 320GB HDD NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) Windows 7 Professional 64-bit SP1 Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) Display resolution: 1366 x 768 and ** Dell Dimension 4400 ( DD4400 ) ** Intel Pentium 4 ( 1.60 GHz / 1 core ) 1GB RAM Seagate 20GB HDD ( * ) Seagate 3TB HDD ( ** ) EVGA GeForce 6200 Video Card 512MB DDR2 AGP 8x Video Card Windows XP Professional 32-bit SP3 Size of L2 Cache = 256KB Size of L1 Cache = 8KB Display resolution: 1440 x 990 ( * ) Seagate Barracuda 20GB IDE Hard Disk Drive ST320011A 3.5" 7200 Rpm 2MB Cache IDE Ultra ATA100 / ATA-iV/6 Average Rotational Latency : 4.17 ms Average Seek Times Read : 9.0ms Average Seek Times Write : 10.0ms Maximum Internal Transfer Rate : 69.4MB/sec Average External Transfer Rate : 100MB/sec ( Read and Write ) Maximum External Transfer Rate : 150MB/sec ( Read ) Note: Barracuda ATA IV Family ( ** ) Seagate Barracuda 3TB IDE Hard Disk Drive ST3000DM001 3.5" 7200 Rpm 64MB Cache SATA III ( 6GB/sec ) Average Rotational Latency : 4.16 ms Average Seek Times Read : 8.5ms Average Seek Times Write : 9.5ms Maximum Internal Transfer Rate : 268MB/sec Average External Transfer Rate : 156MB/sec ( Read and Write ) Maximum External Transfer Rate : 210MB/sec ( Read )
0 Kudos
SergeyKostrov
Valued Contributor II
1,322 Views
[ Test-Case 1 - SHBC - DRP is 1 ( Non-APT / DD4400 ) ] Strassen HBC Matrix Size : 1024 x 1024 Matrix Size Threshold : 512 x 512 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: Disabled Calculating... Strassen HBC - Pass 01 - Completed: 2.32800 secs Strassen HBC - Pass 02 - Completed: 2.31300 secs Strassen HBC - Pass 03 - Completed: 2.31200 secs Strassen HBC - Pass 04 - Completed: 2.31300 secs Strassen HBC - Pass 05 - Completed: 2.31200 secs
0 Kudos
SergeyKostrov
Valued Contributor II
1,322 Views
[ Test-Case 2 - SHBC - DRP is 2 ( Non-APT / DD4400 ) ] Strassen HBC Matrix Size : 1024 x 1024 Matrix Size Threshold : 256 x 256 Matrix Partitions : 57 Degree of Recursion : 2 Result Sets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 1.35900 secs Strassen HBC - Pass 02 - Completed: 1.26600 secs Strassen HBC - Pass 03 - Completed: 1.28100 secs Strassen HBC - Pass 04 - Completed: 1.26600 secs Strassen HBC - Pass 05 - Completed: 1.28100 secs
0 Kudos
SergeyKostrov
Valued Contributor II
1,322 Views
[ Test-Case 3 - SHBC - DRP is 3 ( Non-APT / DD4400 ) ] Strassen HBC Matrix Size : 1024 x 1024 Matrix Size Threshold : 128 x 128 Matrix Partitions : 400 Degree of Recursion : 3 Result Sets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 0.98500 secs Strassen HBC - Pass 02 - Completed: 0.98400 secs Strassen HBC - Pass 03 - Completed: 0.96900 secs Strassen HBC - Pass 04 - Completed: 0.98500 secs Strassen HBC - Pass 05 - Completed: 0.96900 secs
0 Kudos
SergeyKostrov
Valued Contributor II
1,322 Views
[ Test-Case 4 - SHBC - DRP is 4 ( Non-APT / DD4400 ) ] Strassen HBC Matrix Size : 1024 x 1024 Matrix Size Threshold : 64 x 64 Matrix Partitions : 2801 Degree of Recursion : 4 Result Sets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 1.40600 secs Strassen HBC - Pass 02 - Completed: 1.21900 secs Strassen HBC - Pass 03 - Completed: 1.20300 secs Strassen HBC - Pass 04 - Completed: 1.21900 secs Strassen HBC - Pass 05 - Completed: 1.21900 secs
0 Kudos
SergeyKostrov
Valued Contributor II
1,322 Views
[ Test-Case 5 - SHBC - DRP is 5 ( Non-APT ) / DD4400 ] Strassen HBC Matrix Size : 1024 x 1024 Matrix Size Threshold : 32 x 32 Matrix Partitions : 19608 Degree of Recursion : 5 Result Sets Reflection: Enabled Calculating... Strassen HBC - Pass 01 - Completed: 2.31300 secs Strassen HBC - Pass 02 - Completed: 1.73400 secs Strassen HBC - Pass 03 - Completed: 1.71900 secs Strassen HBC - Pass 04 - Completed: 1.71800 secs Strassen HBC - Pass 05 - Completed: 1.71900 secs
0 Kudos
SergeyKostrov
Valued Contributor II
1,322 Views
[ Test-Case 6 - CBLAS SGEMM ( MKL / SP / DD4400) ] Cblas xGEMM Matrix Size : 1024 x 1024 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas xGEMM - Pass 01 - Completed: 0.51600 secs Cblas xGEMM - Pass 02 - Completed: 0.51500 secs Cblas xGEMM - Pass 03 - Completed: 0.51600 secs Cblas xGEMM - Pass 04 - Completed: 0.51500 secs Cblas xGEMM - Pass 05 - Completed: 0.51600 secs Note: It is ~47% faster when compared to the best results for SAMM with 3rd DRP.
0 Kudos
SergeyKostrov
Valued Contributor II
1,322 Views
[ Relations between Degree of Recursive Processing ( DRP ) and number of BMMOps ] Polynomial representations of these Relations are as follows: 1st Degree: 7 BMMOps = 7^1 = 7 2nd Degree: 56 BMMOps = 7^2 + 7^1 = 49 + 7 3rd Degree: 399 BMMOps = 7^3 + 7^2 + 7^1 = 343 + 49 + 7 4th Degree: 2800 BMMOps = 7^4 + 7^3 + 7^2 + 7^1 = 2401 + 343 + 49 + 7 5th Degree: 19607 BMMOps = 7^5 + 7^4 + 7^3 + 7^2 + 7^1 = 16807 + 2401 + 343 + 49 + 7 That is, the Number of BMMOps for Nth DRP is the sum of all 7 to the power of N where N is from 1 to N. Take into account that BMMO is Not a basic multiplication operation on FPU, SP or DP, and it is a multiplication of two partitioned ( sub-divided ) matrices. Products of all these matrix multiplications will be combined ( see Wikipedia on how SAMM works ) to get a final product of two source matrices.
0 Kudos
SergeyKostrov
Valued Contributor II
1,322 Views
[ Hardware Issues ] ** Seagate Barracuda 20GB IDE Hard Disk Drive ** ST320011A 3.5" 7200 Rpm 2MB Cache IDE Ultra ATA100 / ATA-iV/6 Average Latency : x.xx ms Average Seek Times Read : 18.0ms Average Seek Times Write : 24.9ms Maximum Internal Transfer Rate : 56.9MB/sec Average External Transfer Rate : 100MB/sec ( Read and Write ) Maximum External Transfer Rate : 150MB/sec ( Read ) ** Seagate Barracuda 3TB IDE Hard Disk Drive ** ST3000DM001 3.5" 7200 Rpm 64MB Cache SATA III ( 6GB/sec ) Average Latency : 4.16 ms Average Seek Times Read : 8.5ms Average Seek Times Write : 9.5ms Maximum Internal Transfer Rate : 268MB/sec Average External Transfer Rate : 156MB/sec ( Read and Write ) Maximum External Transfer Rate : 210MB/sec ( Read )
0 Kudos
Reply