- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						*** Performance of Portable Algorithms ***
[ Abstract ]
There is a great deal of attention to performance of Portable Algorithms. Since a Portrable Algorithm
could work on many platforms, that is Operating Systems working on different Hardware, the performance of
these algorithms is a very important since it could vary to some degree.
Here are two examples:
Example A:
If a difference in performance for an Algorithm A on Platforms PA and PB running on the same hardware (!) is
less than 5 percent then the Algorithm A could be qualified as a 'Performance Portable Algorithm'.
Example B:
If a difference in performance for an Algorithm B on Platforms PA and PB running on the same hardware (!) is
greater than 5 percent then the Algorithm A could Not be qualified as a 'Performance Portable Algorithm'.
When speaking about Performance Portable Algorithms it is very important to evaluate performance on
the same hardware platforms since even a small difference in hardware could skew performance numbers.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
Link Copied
		11 Replies
	
		
		
			
			
			
					
	
			- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Note: ...If a difference in performance for an Algorithm A on Platforms PA and PB running...
It means, Platforms PA and PB are using the same Hardware P but different Operating Systems A and B.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						[ List of Abbreviations ]
		MM   - Matrix Multiplication
		MMOp - Matrix Multiplication Operation
		SAMM - Strassen Algorithm of Matrix Multiplication
		SHBI - Strassen Heap Based Incomplete
		SHBC - Strassen Heap Based Complete
		DRP  - Degree of Recursive Processing
		BMMO - Basic Matrix Multiplication Operation ( in context of SAMM )
		APT  - Accelerated Processing Technique
		FPU  - Floating Point Unit
		SP   - Single Precision
		DP   - Double Precision
		ABC4M7T2  - Strassen Schema ( SS ) of MM ( Classic / Stack Based )
		ABCST4PU7 - Winograd Schema ( WS ) of MM ( Classic / Stack Based / Reduced number of MMOps )
		AB2M7     - Kostrov  Schema ( KS ) of MM ( Classic Modified / Heap Based / Memory Efficient )
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						[ Computer System used for performance evaluations ]
	** Dell Precision Mobile M4700 ( DPMM4700 )**
		Intel Core i7-3840QM ( 2.80 GHz )
		Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846
		32GB RAM
		320GB HDD
		NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory )
		Windows 7 Professional 64-bit SP1
		Size of L3 Cache =   8MB ( shared between all cores for data & instructions )
		Size of L2 Cache =   1MB ( 256KB per core / shared for data & instructions )
		Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions )
		Display resolution: 1366 x 768
	and
	** Dell Dimension 4400 ( DD4400 ) **
		Intel Pentium 4 ( 1.60 GHz / 1 core )
		1GB RAM
		Seagate 20GB HDD						( *  )
		Seagate  3TB HDD						( ** )
		EVGA GeForce 6200 Video Card 512MB DDR2 AGP 8x Video Card
		Windows XP Professional 32-bit SP3
		Size of L2 Cache = 256KB
		Size of L1 Cache =   8KB
		Display resolution: 1440 x 990
		( *  )	Seagate Barracuda 20GB IDE Hard Disk Drive
		ST320011A
		3.5" 7200 Rpm  2MB Cache IDE Ultra ATA100 / ATA-iV/6
		Average Rotational Latency	: 4.17 ms
		Average Seek Times Read		: 9.0ms
		Average Seek Times Write	: 10.0ms
		Maximum Internal Transfer Rate	: 69.4MB/sec
		Average External Transfer Rate	: 100MB/sec ( Read and Write )
		Maximum External Transfer Rate	: 150MB/sec ( Read           )
		Note: Barracuda ATA IV Family
		( ** )	Seagate Barracuda  3TB IDE Hard Disk Drive
		ST3000DM001
		3.5" 7200 Rpm 64MB Cache SATA III ( 6GB/sec )
		Average Rotational Latency	: 4.16 ms
		Average Seek Times Read		: 8.5ms
		Average Seek Times Write	: 9.5ms
		Maximum Internal Transfer Rate	: 268MB/sec
		Average External Transfer Rate	: 156MB/sec ( Read and Write )
		Maximum External Transfer Rate	: 210MB/sec ( Read           )
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						[ Test-Case 1 - SHBC - DRP is 1 ( Non-APT / DD4400 ) ]
		Strassen HBC
		Matrix Size           :  1024 x  1024
		Matrix Size Threshold :   512 x   512
		Matrix Partitions     :     8
		Degree of Recursion   :     1
		Result Sets Reflection: Disabled
		Calculating...
		Strassen HBC - Pass 01 - Completed:     2.32800 secs
		Strassen HBC - Pass 02 - Completed:     2.31300 secs
		Strassen HBC - Pass 03 - Completed:     2.31200 secs
		Strassen HBC - Pass 04 - Completed:     2.31300 secs
		Strassen HBC - Pass 05 - Completed:     2.31200 secs
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						[ Test-Case 2 - SHBC - DRP is 2 ( Non-APT / DD4400 ) ]
		Strassen HBC
		Matrix Size           :  1024 x  1024
		Matrix Size Threshold :   256 x   256
		Matrix Partitions     :    57
		Degree of Recursion   :     2
		Result Sets Reflection: Enabled
		Calculating...
		Strassen HBC - Pass 01 - Completed:     1.35900 secs
		Strassen HBC - Pass 02 - Completed:     1.26600 secs
		Strassen HBC - Pass 03 - Completed:     1.28100 secs
		Strassen HBC - Pass 04 - Completed:     1.26600 secs
		Strassen HBC - Pass 05 - Completed:     1.28100 secs
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						[ Test-Case 3 - SHBC - DRP is 3 ( Non-APT / DD4400 ) ]
		Strassen HBC
		Matrix Size           :  1024 x  1024
		Matrix Size Threshold :   128 x   128
		Matrix Partitions     :   400
		Degree of Recursion   :     3
		Result Sets Reflection: Enabled
		Calculating...
		Strassen HBC - Pass 01 - Completed:     0.98500 secs
		Strassen HBC - Pass 02 - Completed:     0.98400 secs
		Strassen HBC - Pass 03 - Completed:     0.96900 secs
		Strassen HBC - Pass 04 - Completed:     0.98500 secs
		Strassen HBC - Pass 05 - Completed:     0.96900 secs
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						[ Test-Case 4 - SHBC - DRP is 4 ( Non-APT / DD4400 ) ]
		Strassen HBC
		Matrix Size           :  1024 x  1024
		Matrix Size Threshold :    64 x    64
		Matrix Partitions     :  2801
		Degree of Recursion   :     4
		Result Sets Reflection: Enabled
		Calculating...
		Strassen HBC - Pass 01 - Completed:     1.40600 secs
		Strassen HBC - Pass 02 - Completed:     1.21900 secs
		Strassen HBC - Pass 03 - Completed:     1.20300 secs
		Strassen HBC - Pass 04 - Completed:     1.21900 secs
		Strassen HBC - Pass 05 - Completed:     1.21900 secs
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						[ Test-Case 5 - SHBC - DRP is 5 ( Non-APT ) / DD4400 ]
		Strassen HBC
		Matrix Size           :  1024 x  1024
		Matrix Size Threshold :    32 x    32
		Matrix Partitions     : 19608
		Degree of Recursion   :     5
		Result Sets Reflection: Enabled
		Calculating...
		Strassen HBC - Pass 01 - Completed:     2.31300 secs
		Strassen HBC - Pass 02 - Completed:     1.73400 secs
		Strassen HBC - Pass 03 - Completed:     1.71900 secs
		Strassen HBC - Pass 04 - Completed:     1.71800 secs
		Strassen HBC - Pass 05 - Completed:     1.71900 secs
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						[ Test-Case 6 - CBLAS SGEMM ( MKL / SP  / DD4400) ]
		Cblas xGEMM
		Matrix Size           :  1024 x  1024
		Matrix Size Threshold : N/A
		Matrix Partitions     : N/A
		Degree of Recursion   : N/A
		Result Sets Reflection: N/A
		Calculating...
		Cblas xGEMM  - Pass 01 - Completed:     0.51600 secs
		Cblas xGEMM  - Pass 02 - Completed:     0.51500 secs
		Cblas xGEMM  - Pass 03 - Completed:     0.51600 secs
		Cblas xGEMM  - Pass 04 - Completed:     0.51500 secs
		Cblas xGEMM  - Pass 05 - Completed:     0.51600 secs
		Note: It is ~47% faster when compared to the best results for SAMM with 3rd DRP.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						[ Relations between Degree of Recursive Processing ( DRP ) and number of BMMOps ]
		Polynomial representations of these Relations are as follows:
		1st Degree: 7 BMMOps = 7^1 = 7
		2nd Degree: 56 BMMOps = 7^2 + 7^1 = 49 + 7
		3rd Degree: 399 BMMOps = 7^3 + 7^2 + 7^1 = 343 + 49 + 7
		4th Degree: 2800 BMMOps = 7^4 + 7^3 + 7^2 + 7^1 = 2401 + 343 + 49 + 7
		5th Degree: 19607 BMMOps = 7^5 + 7^4 + 7^3 + 7^2 + 7^1 = 16807 + 2401 + 343 + 49 + 7
		That is, the Number of BMMOps for Nth DRP is the sum of all 7 to the power of N where
		N is from 1 to N.
		Take into account that BMMO is Not a basic multiplication operation on FPU, SP or DP,
		and it is a multiplication of two partitioned ( sub-divided ) matrices. Products of
		all these matrix multiplications will be combined ( see Wikipedia on how SAMM works )
		to get a final product of two source matrices.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						[ Hardware Issues ]
		** Seagate Barracuda 20GB IDE Hard Disk Drive **
		ST320011A
		3.5" 7200 Rpm  2MB Cache IDE Ultra ATA100 / ATA-iV/6
		Average Latency			: x.xx ms
		Average Seek Times Read		: 18.0ms
		Average Seek Times Write	: 24.9ms
		Maximum Internal Transfer Rate	: 56.9MB/sec
		Average External Transfer Rate	: 100MB/sec ( Read and Write )
		Maximum External Transfer Rate	: 150MB/sec ( Read           )
		** Seagate Barracuda  3TB IDE Hard Disk Drive **
		ST3000DM001
		3.5" 7200 Rpm 64MB Cache SATA III ( 6GB/sec )
		Average Latency			: 4.16 ms
		Average Seek Times Read		: 8.5ms
		Average Seek Times Write	: 9.5ms
		Maximum Internal Transfer Rate	: 268MB/sec
		Average External Transfer Rate	: 156MB/sec ( Read and Write )
		Maximum External Transfer Rate	: 210MB/sec ( Read           )
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
 
					
				
				
			
		
					
					Reply
					
						
	
		
				
				
				
					
						
					
				
					
				
				
				
				
			
			Topic Options
			
				
					
	
			
		
	- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page