- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
*** Performance of Portable Algorithms ***
[ Abstract ]
There is a great deal of attention to performance of Portable Algorithms. Since a Portrable Algorithm
could work on many platforms, that is Operating Systems working on different Hardware, the performance of
these algorithms is a very important since it could vary to some degree.
Here are two examples:
Example A:
If a difference in performance for an Algorithm A on Platforms PA and PB running on the same hardware (!) is
less than 5 percent then the Algorithm A could be qualified as a 'Performance Portable Algorithm'.
Example B:
If a difference in performance for an Algorithm B on Platforms PA and PB running on the same hardware (!) is
greater than 5 percent then the Algorithm A could Not be qualified as a 'Performance Portable Algorithm'.
When speaking about Performance Portable Algorithms it is very important to evaluate performance on
the same hardware platforms since even a small difference in hardware could skew performance numbers.
Link Copied
11 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note: ...If a difference in performance for an Algorithm A on Platforms PA and PB running...
It means, Platforms PA and PB are using the same Hardware P but different Operating Systems A and B.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ List of Abbreviations ]
MM - Matrix Multiplication
MMOp - Matrix Multiplication Operation
SAMM - Strassen Algorithm of Matrix Multiplication
SHBI - Strassen Heap Based Incomplete
SHBC - Strassen Heap Based Complete
DRP - Degree of Recursive Processing
BMMO - Basic Matrix Multiplication Operation ( in context of SAMM )
APT - Accelerated Processing Technique
FPU - Floating Point Unit
SP - Single Precision
DP - Double Precision
ABC4M7T2 - Strassen Schema ( SS ) of MM ( Classic / Stack Based )
ABCST4PU7 - Winograd Schema ( WS ) of MM ( Classic / Stack Based / Reduced number of MMOps )
AB2M7 - Kostrov Schema ( KS ) of MM ( Classic Modified / Heap Based / Memory Efficient )
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Computer System used for performance evaluations ]
** Dell Precision Mobile M4700 ( DPMM4700 )**
Intel Core i7-3840QM ( 2.80 GHz )
Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846
32GB RAM
320GB HDD
NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory )
Windows 7 Professional 64-bit SP1
Size of L3 Cache = 8MB ( shared between all cores for data & instructions )
Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions )
Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions )
Display resolution: 1366 x 768
and
** Dell Dimension 4400 ( DD4400 ) **
Intel Pentium 4 ( 1.60 GHz / 1 core )
1GB RAM
Seagate 20GB HDD ( * )
Seagate 3TB HDD ( ** )
EVGA GeForce 6200 Video Card 512MB DDR2 AGP 8x Video Card
Windows XP Professional 32-bit SP3
Size of L2 Cache = 256KB
Size of L1 Cache = 8KB
Display resolution: 1440 x 990
( * ) Seagate Barracuda 20GB IDE Hard Disk Drive
ST320011A
3.5" 7200 Rpm 2MB Cache IDE Ultra ATA100 / ATA-iV/6
Average Rotational Latency : 4.17 ms
Average Seek Times Read : 9.0ms
Average Seek Times Write : 10.0ms
Maximum Internal Transfer Rate : 69.4MB/sec
Average External Transfer Rate : 100MB/sec ( Read and Write )
Maximum External Transfer Rate : 150MB/sec ( Read )
Note: Barracuda ATA IV Family
( ** ) Seagate Barracuda 3TB IDE Hard Disk Drive
ST3000DM001
3.5" 7200 Rpm 64MB Cache SATA III ( 6GB/sec )
Average Rotational Latency : 4.16 ms
Average Seek Times Read : 8.5ms
Average Seek Times Write : 9.5ms
Maximum Internal Transfer Rate : 268MB/sec
Average External Transfer Rate : 156MB/sec ( Read and Write )
Maximum External Transfer Rate : 210MB/sec ( Read )
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test-Case 1 - SHBC - DRP is 1 ( Non-APT / DD4400 ) ]
Strassen HBC
Matrix Size : 1024 x 1024
Matrix Size Threshold : 512 x 512
Matrix Partitions : 8
Degree of Recursion : 1
Result Sets Reflection: Disabled
Calculating...
Strassen HBC - Pass 01 - Completed: 2.32800 secs
Strassen HBC - Pass 02 - Completed: 2.31300 secs
Strassen HBC - Pass 03 - Completed: 2.31200 secs
Strassen HBC - Pass 04 - Completed: 2.31300 secs
Strassen HBC - Pass 05 - Completed: 2.31200 secs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test-Case 2 - SHBC - DRP is 2 ( Non-APT / DD4400 ) ]
Strassen HBC
Matrix Size : 1024 x 1024
Matrix Size Threshold : 256 x 256
Matrix Partitions : 57
Degree of Recursion : 2
Result Sets Reflection: Enabled
Calculating...
Strassen HBC - Pass 01 - Completed: 1.35900 secs
Strassen HBC - Pass 02 - Completed: 1.26600 secs
Strassen HBC - Pass 03 - Completed: 1.28100 secs
Strassen HBC - Pass 04 - Completed: 1.26600 secs
Strassen HBC - Pass 05 - Completed: 1.28100 secs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test-Case 3 - SHBC - DRP is 3 ( Non-APT / DD4400 ) ]
Strassen HBC
Matrix Size : 1024 x 1024
Matrix Size Threshold : 128 x 128
Matrix Partitions : 400
Degree of Recursion : 3
Result Sets Reflection: Enabled
Calculating...
Strassen HBC - Pass 01 - Completed: 0.98500 secs
Strassen HBC - Pass 02 - Completed: 0.98400 secs
Strassen HBC - Pass 03 - Completed: 0.96900 secs
Strassen HBC - Pass 04 - Completed: 0.98500 secs
Strassen HBC - Pass 05 - Completed: 0.96900 secs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test-Case 4 - SHBC - DRP is 4 ( Non-APT / DD4400 ) ]
Strassen HBC
Matrix Size : 1024 x 1024
Matrix Size Threshold : 64 x 64
Matrix Partitions : 2801
Degree of Recursion : 4
Result Sets Reflection: Enabled
Calculating...
Strassen HBC - Pass 01 - Completed: 1.40600 secs
Strassen HBC - Pass 02 - Completed: 1.21900 secs
Strassen HBC - Pass 03 - Completed: 1.20300 secs
Strassen HBC - Pass 04 - Completed: 1.21900 secs
Strassen HBC - Pass 05 - Completed: 1.21900 secs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test-Case 5 - SHBC - DRP is 5 ( Non-APT ) / DD4400 ]
Strassen HBC
Matrix Size : 1024 x 1024
Matrix Size Threshold : 32 x 32
Matrix Partitions : 19608
Degree of Recursion : 5
Result Sets Reflection: Enabled
Calculating...
Strassen HBC - Pass 01 - Completed: 2.31300 secs
Strassen HBC - Pass 02 - Completed: 1.73400 secs
Strassen HBC - Pass 03 - Completed: 1.71900 secs
Strassen HBC - Pass 04 - Completed: 1.71800 secs
Strassen HBC - Pass 05 - Completed: 1.71900 secs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test-Case 6 - CBLAS SGEMM ( MKL / SP / DD4400) ]
Cblas xGEMM
Matrix Size : 1024 x 1024
Matrix Size Threshold : N/A
Matrix Partitions : N/A
Degree of Recursion : N/A
Result Sets Reflection: N/A
Calculating...
Cblas xGEMM - Pass 01 - Completed: 0.51600 secs
Cblas xGEMM - Pass 02 - Completed: 0.51500 secs
Cblas xGEMM - Pass 03 - Completed: 0.51600 secs
Cblas xGEMM - Pass 04 - Completed: 0.51500 secs
Cblas xGEMM - Pass 05 - Completed: 0.51600 secs
Note: It is ~47% faster when compared to the best results for SAMM with 3rd DRP.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Relations between Degree of Recursive Processing ( DRP ) and number of BMMOps ]
Polynomial representations of these Relations are as follows:
1st Degree: 7 BMMOps = 7^1 = 7
2nd Degree: 56 BMMOps = 7^2 + 7^1 = 49 + 7
3rd Degree: 399 BMMOps = 7^3 + 7^2 + 7^1 = 343 + 49 + 7
4th Degree: 2800 BMMOps = 7^4 + 7^3 + 7^2 + 7^1 = 2401 + 343 + 49 + 7
5th Degree: 19607 BMMOps = 7^5 + 7^4 + 7^3 + 7^2 + 7^1 = 16807 + 2401 + 343 + 49 + 7
That is, the Number of BMMOps for Nth DRP is the sum of all 7 to the power of N where
N is from 1 to N.
Take into account that BMMO is Not a basic multiplication operation on FPU, SP or DP,
and it is a multiplication of two partitioned ( sub-divided ) matrices. Products of
all these matrix multiplications will be combined ( see Wikipedia on how SAMM works )
to get a final product of two source matrices.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Hardware Issues ]
** Seagate Barracuda 20GB IDE Hard Disk Drive **
ST320011A
3.5" 7200 Rpm 2MB Cache IDE Ultra ATA100 / ATA-iV/6
Average Latency : x.xx ms
Average Seek Times Read : 18.0ms
Average Seek Times Write : 24.9ms
Maximum Internal Transfer Rate : 56.9MB/sec
Average External Transfer Rate : 100MB/sec ( Read and Write )
Maximum External Transfer Rate : 150MB/sec ( Read )
** Seagate Barracuda 3TB IDE Hard Disk Drive **
ST3000DM001
3.5" 7200 Rpm 64MB Cache SATA III ( 6GB/sec )
Average Latency : 4.16 ms
Average Seek Times Read : 8.5ms
Average Seek Times Write : 9.5ms
Maximum Internal Transfer Rate : 268MB/sec
Average External Transfer Rate : 156MB/sec ( Read and Write )
Maximum External Transfer Rate : 210MB/sec ( Read )
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page