- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
I'm a very recent convert to IPP and am still feeling my way around.
I have a list of points (XYZ - 2x10^7 for a small test case) and I need to subtract a vector [tx, ty, tz] from each 3D point.
I can do the operation individually by three (3) separate calls
And this works fine and fast. But I was wondering if there is a way to do this as a single ipp call using the matrix formats.
I'm a very recent convert to IPP and am still feeling my way around.
I have a list of points (XYZ - 2x10^7 for a small test case) and I need to subtract a vector [tx, ty, tz] from each 3D point.
I can do the operation individually by three (3) separate calls
[bash]status = ippmSub_vc_32f(X, stride0, tx, t_x, stride0, numberOfPoints); status = ippmSub_vc_32f(Y, stride0, ty, t_y, stride0, numberOfPoints); status = ippmSub_vc_32f(Z, stride0, tz, t_z, stride0, numberOfPoints);[/bash]
And this works fine and fast. But I was wondering if there is a way to do this as a single ipp call using the matrix formats.
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, I figured it out. Why do I try harder after posting something?
[bash]status = ippmSub_vav_32f_P((const Ipp32f**) XYZ, 0, stride0, (const Ipp32f**) vec, 0, result, 0, stride0, 3, numberOfPoints);[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That is ok. What is your feeling on IPP small matrix functionality? Does it meet your expectations in functionality and performance?
Regards,
Vladimir
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not bad. Other work is done for each iteration outside the function under scrutiny.
Running the code as separate calls for each coordinate on a single core gives (test is run 100x with camera zoom each iteration):
-------------------------------------------------------
numberOfPoints | Time (seconds)
1x10^7 | 3.35
2.5*10^7 | 7.411
5*10^7 | 14.2
now with ippSetNumThreads(4) - there are serial portions
1x10^7 | 3.4
2.5*10^7 | 7.7
5*10^7 | 14.11
--------------------------------------------------------
With the small matrix, a single core
numberOfPoints | Time (seconds)
1x10^7 | 9.98
2.5*10^7 | 24.4
5*10^7 | 49.94
now with ippSetNumThreads(4) - there are serial portions
1x10^7 | 3.58
2.5*10^7 | 7.60
5*10^7 | 14.52
--------------------------------------------------------
I had the code as:
Which is not ideal, particularly with XYZ and others redone every iteration. Difficult to know which path to take since there is more compute to be done. Will still pursue both methods including some microbenchmarking as some of the work to still be done may benefit greatly.
The existing code I have (structs, no ipp, etc) can do every calculation needed ( including if statements for culling for 5*10^7 points)
in 109.14 seconds for a single core
or 0.48 seconds for just the operations above on a single core.
Processor: E5530, 6GB DDR3
OS: 2.6.31.12-174.2.22.fc12.x86_64
Compiler: icpc 11.1.064
IPP: v6.1 update 3
50 million points drawn per machine @ 1920x4800 with five machines in total. New points stream in as camera moves.
Obviously any pointers welcome.
Running the code as separate calls for each coordinate on a single core gives (test is run 100x with camera zoom each iteration):
-------------------------------------------------------
numberOfPoints | Time (seconds)
1x10^7 | 3.35
2.5*10^7 | 7.411
5*10^7 | 14.2
now with ippSetNumThreads(4) - there are serial portions
1x10^7 | 3.4
2.5*10^7 | 7.7
5*10^7 | 14.11
--------------------------------------------------------
With the small matrix, a single core
numberOfPoints | Time (seconds)
1x10^7 | 9.98
2.5*10^7 | 24.4
5*10^7 | 49.94
now with ippSetNumThreads(4) - there are serial portions
1x10^7 | 3.58
2.5*10^7 | 7.60
5*10^7 | 14.52
--------------------------------------------------------
I had the code as:
[bash]// Storage class for points Ipp32f *XYZ[3] = {X, Y, Z}; IppStatus status; int stride0 = sizeof(Ipp32f); // Stride between columns Ipp32f tx, ty, tz; tx = camera.from.x; ty = camera.from.y; tz = camera.from.z; Ipp32f camera_from[3] = {tx, ty, tz}; Ipp32f *vec[3] = {camera_from, camera_from+1, camera_from+2}; Ipp32f *result[3] = {t_x, t_y, t_z}; // Matrix subtraction status = ippmSub_vav_32f_P((const Ipp32f**) XYZ, 0, stride0, (const Ipp32f**) vec, 0, result, 0, stride0, 3, numberOfPoints);[/bash]
Which is not ideal, particularly with XYZ and others redone every iteration. Difficult to know which path to take since there is more compute to be done. Will still pursue both methods including some microbenchmarking as some of the work to still be done may benefit greatly.
The existing code I have (structs, no ipp, etc) can do every calculation needed ( including if statements for culling for 5*10^7 points)
in 109.14 seconds for a single core
or 0.48 seconds for just the operations above on a single core.
Processor: E5530, 6GB DDR3
OS: 2.6.31.12-174.2.22.fc12.x86_64
Compiler: icpc 11.1.064
IPP: v6.1 update 3
50 million points drawn per machine @ 1920x4800 with five machines in total. New points stream in as camera moves.
Obviously any pointers welcome.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so you have 109 sec without IPP. How it improves with IPP?
And note, that there is no enough computations in small matrix operations to involve internal threading, so you may not need to set number of threads to 4 (just simple makes no sense for IPP matrix functions). It should be much more benefit if you can call IPP functions in parallel.
Just a note, the latest version is IPP 6.1.5
Regards,
Vladimir
And note, that there is no enough computations in small matrix operations to involve internal threading, so you may not need to set number of threads to 4 (just simple makes no sense for IPP matrix functions). It should be much more benefit if you can call IPP functions in parallel.
Just a note, the latest version is IPP 6.1.5
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just using Ipp32f, Ipp32u and Ipp16u instead of float, unsigned int and unsigned short and no other ipp method calls apart from the malloc I get 39.09 seconds. Which is pretty good outcome in itself.
Created a new profile for myself rather than using work's group one.
Created a new profile for myself rather than using work's group one.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page