<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Appropriate Matrix Sub in Intel® Integrated Performance Primitives</title>
    <link>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816011#M4307</link>
    <description>Not bad. Other work is done for each iteration outside the function under scrutiny. &lt;BR /&gt;&lt;BR /&gt;Running the code as separate calls for each coordinate on a single core gives (test is run 100x with camera zoom each iteration):&lt;BR /&gt;-------------------------------------------------------&lt;BR /&gt;numberOfPoints | Time (seconds)&lt;BR /&gt;1x10^7 | 3.35&lt;BR /&gt;2.5*10^7 | 7.411&lt;BR /&gt;5*10^7 | 14.2&lt;BR /&gt;&lt;BR /&gt;now with ippSetNumThreads(4) - there are serial portions &lt;BR /&gt;1x10^7 | 3.4&lt;BR /&gt;
2.5*10^7 | 7.7&lt;BR /&gt;
5*10^7 | 14.11&lt;BR /&gt;--------------------------------------------------------&lt;BR /&gt;With the small matrix, a single core&lt;BR /&gt;&lt;BR /&gt;numberOfPoints | Time (seconds)&lt;BR /&gt;
1x10^7 | 9.98&lt;BR /&gt;
2.5*10^7 | 24.4&lt;BR /&gt;
5*10^7 | 49.94&lt;BR /&gt;
&lt;BR /&gt;
now with ippSetNumThreads(4) - there are serial portions &lt;BR /&gt;
1x10^7 | 3.58&lt;BR /&gt;

2.5*10^7 | 7.60&lt;BR /&gt;

5*10^7 | 14.52&lt;BR /&gt;--------------------------------------------------------&lt;BR /&gt;I had the code as:&lt;BR /&gt;&lt;BR /&gt;&lt;PRE&gt;[bash]// Storage class for points
	Ipp32f *XYZ[3] = {X, Y, Z};
	IppStatus status;
	int stride0 = sizeof(Ipp32f); // Stride between columns
	Ipp32f tx, ty, tz;
	tx = camera.from.x;
	ty = camera.from.y;
	tz = camera.from.z;
	Ipp32f camera_from[3] = {tx, ty, tz};
	Ipp32f *vec[3] = {camera_from, camera_from+1, camera_from+2};
	
	Ipp32f *result[3] = {t_x, t_y, t_z};
	
	// Matrix subtraction
	status = ippmSub_vav_32f_P((const Ipp32f**) XYZ, 0, stride0, (const Ipp32f**) vec, 0, result, 0, stride0, 3, numberOfPoints);[/bash]&lt;/PRE&gt; &lt;BR /&gt;Which is not ideal, particularly with XYZ and others redone every
iteration. Difficult to know which path to take since there is more
compute to be done. Will still pursue both methods including some microbenchmarking as some of the work to still be done may benefit greatly.&lt;BR /&gt;&lt;BR /&gt;The existing code I have (structs, no ipp, etc) can do every calculation needed ( including if statements for culling for 5*10^7 points)&lt;BR /&gt;in 109.14 seconds for a single core&lt;BR /&gt;or 0.48 seconds for just the operations above on a single core.&lt;BR /&gt;&lt;BR /&gt;Processor: E5530, 6GB DDR3&lt;BR /&gt;OS: 2.6.31.12-174.2.22.fc12.x86_64&lt;BR /&gt;Compiler: icpc 11.1.064&lt;BR /&gt;IPP: v6.1 update 3&lt;BR /&gt;&lt;BR /&gt;50 million points drawn per machine @ 1920x4800 with five machines in total. New points stream in as camera moves.&lt;BR /&gt;&lt;BR /&gt;Obviously any pointers welcome.&lt;BR /&gt;</description>
    <pubDate>Fri, 21 May 2010 02:32:52 GMT</pubDate>
    <dc:creator>hpc_prog</dc:creator>
    <dc:date>2010-05-21T02:32:52Z</dc:date>
    <item>
      <title>Appropriate Matrix Sub</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816008#M4304</link>
      <description>Hi there,&lt;BR /&gt;&lt;BR /&gt; I'm a very recent convert to IPP and am still feeling my way around.&lt;BR /&gt;&lt;BR /&gt;I have a list of points (XYZ - 2x10^7 for a small test case) and I need to subtract a vector [tx, ty, tz] from each 3D point.&lt;BR /&gt;&lt;BR /&gt;I can do the operation individually by three (3) separate calls&lt;BR /&gt;&lt;BR /&gt;&lt;PRE&gt;[bash]status = ippmSub_vc_32f(X, stride0, tx, t_x, stride0, numberOfPoints);
status = ippmSub_vc_32f(Y, stride0, ty, t_y, stride0, numberOfPoints);
status = ippmSub_vc_32f(Z, stride0, tz, t_z, stride0, numberOfPoints);[/bash]&lt;/PRE&gt; &lt;BR /&gt;&lt;BR /&gt;And this works fine and fast. But I was wondering if there is a way to do this as a single ipp call using the matrix formats.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 20 May 2010 01:59:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816008#M4304</guid>
      <dc:creator>hpc_prog</dc:creator>
      <dc:date>2010-05-20T01:59:53Z</dc:date>
    </item>
    <item>
      <title>Appropriate Matrix Sub</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816009#M4305</link>
      <description>Sorry, I figured it out. Why do I try harder after posting something?&lt;BR /&gt;&lt;BR /&gt;&lt;PRE&gt;[bash]status = ippmSub_vav_32f_P((const Ipp32f**) XYZ, 0, stride0, (const Ipp32f**) vec, 0, result, 0, stride0, 3, numberOfPoints);[/bash]&lt;/PRE&gt;</description>
      <pubDate>Thu, 20 May 2010 02:36:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816009#M4305</guid>
      <dc:creator>hpc_prog</dc:creator>
      <dc:date>2010-05-20T02:36:24Z</dc:date>
    </item>
    <item>
      <title>Appropriate Matrix Sub</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816010#M4306</link>
      <description>That is ok. What is your feeling on IPP small matrix functionality? Does it meet your expectations in functionality and performance?&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt; Vladimir</description>
      <pubDate>Thu, 20 May 2010 20:37:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816010#M4306</guid>
      <dc:creator>Vladimir_Dudnik</dc:creator>
      <dc:date>2010-05-20T20:37:59Z</dc:date>
    </item>
    <item>
      <title>Appropriate Matrix Sub</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816011#M4307</link>
      <description>Not bad. Other work is done for each iteration outside the function under scrutiny. &lt;BR /&gt;&lt;BR /&gt;Running the code as separate calls for each coordinate on a single core gives (test is run 100x with camera zoom each iteration):&lt;BR /&gt;-------------------------------------------------------&lt;BR /&gt;numberOfPoints | Time (seconds)&lt;BR /&gt;1x10^7 | 3.35&lt;BR /&gt;2.5*10^7 | 7.411&lt;BR /&gt;5*10^7 | 14.2&lt;BR /&gt;&lt;BR /&gt;now with ippSetNumThreads(4) - there are serial portions &lt;BR /&gt;1x10^7 | 3.4&lt;BR /&gt;
2.5*10^7 | 7.7&lt;BR /&gt;
5*10^7 | 14.11&lt;BR /&gt;--------------------------------------------------------&lt;BR /&gt;With the small matrix, a single core&lt;BR /&gt;&lt;BR /&gt;numberOfPoints | Time (seconds)&lt;BR /&gt;
1x10^7 | 9.98&lt;BR /&gt;
2.5*10^7 | 24.4&lt;BR /&gt;
5*10^7 | 49.94&lt;BR /&gt;
&lt;BR /&gt;
now with ippSetNumThreads(4) - there are serial portions &lt;BR /&gt;
1x10^7 | 3.58&lt;BR /&gt;

2.5*10^7 | 7.60&lt;BR /&gt;

5*10^7 | 14.52&lt;BR /&gt;--------------------------------------------------------&lt;BR /&gt;I had the code as:&lt;BR /&gt;&lt;BR /&gt;&lt;PRE&gt;[bash]// Storage class for points
	Ipp32f *XYZ[3] = {X, Y, Z};
	IppStatus status;
	int stride0 = sizeof(Ipp32f); // Stride between columns
	Ipp32f tx, ty, tz;
	tx = camera.from.x;
	ty = camera.from.y;
	tz = camera.from.z;
	Ipp32f camera_from[3] = {tx, ty, tz};
	Ipp32f *vec[3] = {camera_from, camera_from+1, camera_from+2};
	
	Ipp32f *result[3] = {t_x, t_y, t_z};
	
	// Matrix subtraction
	status = ippmSub_vav_32f_P((const Ipp32f**) XYZ, 0, stride0, (const Ipp32f**) vec, 0, result, 0, stride0, 3, numberOfPoints);[/bash]&lt;/PRE&gt; &lt;BR /&gt;Which is not ideal, particularly with XYZ and others redone every
iteration. Difficult to know which path to take since there is more
compute to be done. Will still pursue both methods including some microbenchmarking as some of the work to still be done may benefit greatly.&lt;BR /&gt;&lt;BR /&gt;The existing code I have (structs, no ipp, etc) can do every calculation needed ( including if statements for culling for 5*10^7 points)&lt;BR /&gt;in 109.14 seconds for a single core&lt;BR /&gt;or 0.48 seconds for just the operations above on a single core.&lt;BR /&gt;&lt;BR /&gt;Processor: E5530, 6GB DDR3&lt;BR /&gt;OS: 2.6.31.12-174.2.22.fc12.x86_64&lt;BR /&gt;Compiler: icpc 11.1.064&lt;BR /&gt;IPP: v6.1 update 3&lt;BR /&gt;&lt;BR /&gt;50 million points drawn per machine @ 1920x4800 with five machines in total. New points stream in as camera moves.&lt;BR /&gt;&lt;BR /&gt;Obviously any pointers welcome.&lt;BR /&gt;</description>
      <pubDate>Fri, 21 May 2010 02:32:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816011#M4307</guid>
      <dc:creator>hpc_prog</dc:creator>
      <dc:date>2010-05-21T02:32:52Z</dc:date>
    </item>
    <item>
      <title>Appropriate Matrix Sub</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816012#M4308</link>
      <description>so you have 109 sec without IPP. How it improves with IPP?&lt;BR /&gt;&lt;BR /&gt;And note, that there is no enough computations in small matrix operations to involve internal threading, so you may not need to set number of threads to 4 (just simple makes no sense for IPP matrix functions). It should be much more benefit if you can call IPP functions in parallel.&lt;BR /&gt;&lt;BR /&gt;Just a note, the latest version is IPP 6.1.5&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt; Vladimir</description>
      <pubDate>Fri, 21 May 2010 07:10:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816012#M4308</guid>
      <dc:creator>Vladimir_Dudnik</dc:creator>
      <dc:date>2010-05-21T07:10:09Z</dc:date>
    </item>
    <item>
      <title>Appropriate Matrix Sub</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816013#M4309</link>
      <description>Just using Ipp32f, Ipp32u and Ipp16u instead of float, unsigned int and unsigned short and no other ipp method calls apart from the malloc I get 39.09 seconds. Which is pretty good outcome in itself. &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Created a new profile for myself rather than using work's group one.</description>
      <pubDate>Sun, 23 May 2010 22:19:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Appropriate-Matrix-Sub/m-p/816013#M4309</guid>
      <dc:creator>kramulous</dc:creator>
      <dc:date>2010-05-23T22:19:54Z</dc:date>
    </item>
  </channel>
</rss>

