topic Appropriate Matrix Sub in Intel® Integrated Performance Primitives

Appropriate Matrix Sub

hpc_prog — Thu, 20 May 2010 01:59:53 GMT

Hi there,

I'm a very recent convert to IPP and am still feeling my way around.

I have a list of points (XYZ - 2x10^7 for a small test case) and I need to subtract a vector [tx, ty, tz] from each 3D point.

I can do the operation individually by three (3) separate calls

[bash]status = ippmSub_vc_32f(X, stride0, tx, t_x, stride0, numberOfPoints);
status = ippmSub_vc_32f(Y, stride0, ty, t_y, stride0, numberOfPoints);
status = ippmSub_vc_32f(Z, stride0, tz, t_z, stride0, numberOfPoints);[/bash]

And this works fine and fast. But I was wondering if there is a way to do this as a single ipp call using the matrix formats.

Appropriate Matrix Sub

hpc_prog — Thu, 20 May 2010 02:36:24 GMT

Sorry, I figured it out. Why do I try harder after posting something?

[bash]status = ippmSub_vav_32f_P((const Ipp32f**) XYZ, 0, stride0, (const Ipp32f**) vec, 0, result, 0, stride0, 3, numberOfPoints);[/bash]

Appropriate Matrix Sub

Vladimir_Dudnik — Thu, 20 May 2010 20:37:59 GMT

That is ok. What is your feeling on IPP small matrix functionality? Does it meet your expectations in functionality and performance?

Regards,
Vladimir

Appropriate Matrix Sub

hpc_prog — Fri, 21 May 2010 02:32:52 GMT

Not bad. Other work is done for each iteration outside the function under scrutiny.

Running the code as separate calls for each coordinate on a single core gives (test is run 100x with camera zoom each iteration):
-------------------------------------------------------
numberOfPoints | Time (seconds)
1x10^7 | 3.35
2.5*10^7 | 7.411
5*10^7 | 14.2

now with ippSetNumThreads(4) - there are serial portions
1x10^7 | 3.4
2.5*10^7 | 7.7
5*10^7 | 14.11
--------------------------------------------------------
With the small matrix, a single core

numberOfPoints | Time (seconds)
1x10^7 | 9.98
2.5*10^7 | 24.4
5*10^7 | 49.94

now with ippSetNumThreads(4) - there are serial portions
1x10^7 | 3.58
2.5*10^7 | 7.60
5*10^7 | 14.52
--------------------------------------------------------
I had the code as:

[bash]// Storage class for points
	Ipp32f *XYZ[3] = {X, Y, Z};
	IppStatus status;
	int stride0 = sizeof(Ipp32f); // Stride between columns
	Ipp32f tx, ty, tz;
	tx = camera.from.x;
	ty = camera.from.y;
	tz = camera.from.z;
	Ipp32f camera_from[3] = {tx, ty, tz};
	Ipp32f *vec[3] = {camera_from, camera_from+1, camera_from+2};
	
	Ipp32f *result[3] = {t_x, t_y, t_z};
	
	// Matrix subtraction
	status = ippmSub_vav_32f_P((const Ipp32f**) XYZ, 0, stride0, (const Ipp32f**) vec, 0, result, 0, stride0, 3, numberOfPoints);[/bash]

Which is not ideal, particularly with XYZ and others redone every iteration. Difficult to know which path to take since there is more compute to be done. Will still pursue both methods including some microbenchmarking as some of the work to still be done may benefit greatly.

The existing code I have (structs, no ipp, etc) can do every calculation needed ( including if statements for culling for 5*10^7 points)
in 109.14 seconds for a single core
or 0.48 seconds for just the operations above on a single core.

Processor: E5530, 6GB DDR3
OS: 2.6.31.12-174.2.22.fc12.x86_64
Compiler: icpc 11.1.064
IPP: v6.1 update 3

50 million points drawn per machine @ 1920x4800 with five machines in total. New points stream in as camera moves.

Obviously any pointers welcome.

Appropriate Matrix Sub

Vladimir_Dudnik — Fri, 21 May 2010 07:10:09 GMT

so you have 109 sec without IPP. How it improves with IPP?

And note, that there is no enough computations in small matrix operations to involve internal threading, so you may not need to set number of threads to 4 (just simple makes no sense for IPP matrix functions). It should be much more benefit if you can call IPP functions in parallel.

Just a note, the latest version is IPP 6.1.5

Regards,
Vladimir

Appropriate Matrix Sub

kramulous — Sun, 23 May 2010 22:19:54 GMT

Just using Ipp32f, Ipp32u and Ipp16u instead of float, unsigned int and unsigned short and no other ipp method calls apart from the malloc I get 39.09 seconds. Which is pretty good outcome in itself.

Created a new profile for myself rather than using work's group one.