<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Format Conversion in Intel® Integrated Performance Primitives</title>
    <link>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819429#M4574</link>
    <description>Thanks Guys&lt;BR /&gt;Thats v useful - I will take a look at the ResizeSqr.&lt;BR /&gt;I did get to see what version I was calling - in a crash!.&lt;BR /&gt;Seems it is using y8 - which is correct for the CPU's I am running.&lt;BR /&gt;&lt;BR /&gt;So back to my original question - I see no difference in speed &lt;BR /&gt;regardless of whether I call ippInit or not.&lt;BR /&gt;&lt;BR /&gt;If I run the t2.cpp example:&lt;BR /&gt;&lt;BR /&gt;/* static non-threaded lib&lt;BR /&gt;g++ t2.cpp -I /opt/intel/composerxe-2011.4.191/ipp/include -o t2 \&lt;BR /&gt;/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libipps_t.a \&lt;BR /&gt;/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libippcore_t.a \&lt;BR /&gt;/opt/intel/composerxe-2011.4.191/compiler/lib/intel64/libiomp5.a -lpthread&lt;BR /&gt;&lt;BR /&gt;static threaded&lt;BR /&gt;g++ t2.cpp -I /opt/intel/composerxe-2011.4.191/ipp/include -o t2 \&lt;BR /&gt;/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libipps_l.a \&lt;BR /&gt;/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libippcore_l.a&lt;BR /&gt;&lt;BR /&gt;dynamic&lt;BR /&gt;g++ t2.cpp -I /opt/intel/composerxe-2011.4.191/ipp/include -o t2 -L /opt/intel/composerxe-2011.4.191/ipp/lib/intel64 -lippcore -lipps -lpthread&lt;BR /&gt;*/&lt;BR /&gt;&lt;BR /&gt;#include &lt;STDIO.H&gt;&lt;BR /&gt;#include &lt;IPP.H&gt;&lt;BR /&gt;&lt;BR /&gt;int main()&lt;BR /&gt;{&lt;BR /&gt; const int N = 20000, loops = 100;&lt;BR /&gt; Ipp32f src&lt;N&gt;, dst&lt;N&gt;;&lt;BR /&gt; unsigned int seed = 12345678, i;&lt;BR /&gt; Ipp64s t1,t2;&lt;BR /&gt;&lt;BR /&gt; /// no StaticInit call, means PX code, not optimized&lt;BR /&gt; ippsRandUniform_Direct_32f(src,N,0.0,1.0,&amp;amp;seed);&lt;BR /&gt; t1=ippGetCpuClocks();&lt;BR /&gt; for(i=0; i&lt;LOOPS&gt;&lt;/LOOPS&gt; ippsSqrt_32f(src,dst,N);&lt;BR /&gt; t2=ippGetCpuClocks();&lt;BR /&gt; printf("without StaticInit: %.1f clocks/element\n",(float)(t2-t1)/loops/N);&lt;BR /&gt; ippInit();&lt;BR /&gt; t1=ippGetCpuClocks();&lt;BR /&gt; for(i=0; i&lt;LOOPS&gt;&lt;/LOOPS&gt; ippsSqrt_32f(src,dst,N);&lt;BR /&gt; t2=ippGetCpuClocks();&lt;BR /&gt; printf("with StaticInit: %.1f clocks/element\n",(float)(t2-t1)/loops/N);&lt;BR /&gt; return 0;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;a) static non threaded i get&lt;BR /&gt;without StaticInit: 1.4 clocks/element&lt;BR /&gt;
with StaticInit: 2.8 clocks/element&lt;BR /&gt;&lt;BR /&gt;b) static threaded&lt;BR /&gt;without StaticInit: 1.4 clocks/element&lt;BR /&gt;with StaticInit: 2.3 clocks/element&lt;BR /&gt;&lt;BR /&gt;c) dynamic&lt;BR /&gt;without StaticInit: 2.5 clocks/element&lt;BR /&gt;with StaticInit: 1.0 clocks/element&lt;BR /&gt;&lt;BR /&gt;The only one that is faster with static init is the dynamically loaded one!&lt;BR /&gt;Am I missing soemthing here?&lt;BR /&gt;Steve&lt;/N&gt;&lt;/N&gt;&lt;/IPP.H&gt;&lt;/STDIO.H&gt;</description>
    <pubDate>Tue, 24 May 2011 13:20:29 GMT</pubDate>
    <dc:creator>BatterseaSteve</dc:creator>
    <dc:date>2011-05-24T13:20:29Z</dc:date>
    <item>
      <title>Format Conversion</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819422#M4567</link>
      <description>Hi&lt;BR /&gt;Evaluating IPP for mpeg code/decode. New to IPP.&lt;BR /&gt;Our platform is Intel Dual quad core X5650 under RedHat 5.5 (24 cpu's) using gcc 4.1.2&lt;BR /&gt;&lt;BR /&gt;We are using (currently) ffmpeg to decode movies but want to conform all decode into UYVY packed&lt;BR /&gt;We are using IPP to format convert from YUV420p, YUV422p etc etc into CbYCr packed&lt;BR /&gt;We are also resizing video and are using ippiResizeYUV422_8u_C2R to do this.&lt;BR /&gt;However this decodes into YUVY - so we use ippiYCbCr422ToCbYCr422_8u_C2R to transform into &lt;BR /&gt;UYVY. My questions are:&lt;BR /&gt;&lt;BR /&gt;1) Is there a Resize that we can use that resizes direct into CbYCr?&lt;BR /&gt;2) The Resize is slow even for Nearest Neighbour (4-5msec for HD) and unusable for Cubic. I notice that even tho we are linking in threaded libraries no threading takes place - is this expected for these functions. I also see no difference in calling ippInit (or ippStaticInit) or not - which makes me suspicios that we are (for some reason) not even calling optimised functions.&lt;BR /&gt;3) Can anyone suggest what strategy we should use when resizing intelaced video - is it normal to de-interlace before resizing or is there another way of dealing with this&lt;BR /&gt;&lt;BR /&gt;Any help/comments or abuse is welcome.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 20 May 2011 13:46:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819422#M4567</guid>
      <dc:creator>BatterseaSteve</dc:creator>
      <dc:date>2011-05-20T13:46:21Z</dc:date>
    </item>
    <item>
      <title>Format Conversion</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819423#M4568</link>
      <description>Apologies - seem to have submitted this thread twice - firefox hung - anyone know how to delete one of them&lt;BR /&gt;Cheers</description>
      <pubDate>Fri, 20 May 2011 16:55:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819423#M4568</guid>
      <dc:creator>BatterseaSteve</dc:creator>
      <dc:date>2011-05-20T16:55:37Z</dc:date>
    </item>
    <item>
      <title>Format Conversion</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819424#M4569</link>
      <description>Hi, I deleted the duplicate post for you.</description>
      <pubDate>Sat, 21 May 2011 02:30:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819424#M4569</guid>
      <dc:creator>Joseph_S_Intel</dc:creator>
      <dc:date>2011-05-21T02:30:41Z</dc:date>
    </item>
    <item>
      <title>Format Conversion</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819425#M4570</link>
      <description>Hi,&lt;BR /&gt;&lt;P&gt;&lt;EM&gt;1) Is there a Resize that we can use that resizes direct into CbYCr?&lt;BR /&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;No, we don't have a function that resizes direct into CbYCr&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;2) The Resize is slow even for Nearest Neighbour (4-5msec for HD) and unusable for Cubic. I notice that even tho we are linking in threaded libraries no threading takes place - is this expected for these functions. I also see no difference in calling ippInit (or ippStaticInit) or not - which makes me suspicios that we are (for some reason) not even calling optimised functions.&lt;BR /&gt;&lt;/EM&gt;Are you linking statically? ippInit is only necessary if you are linking statically; if you are linking dynamically you do not need to call any initialization function. Only about twenty percent of the functions in the Intel IPP shared and static threaded librarys are actually threaded. There is a file called ThreadedFunctionsList.txt in the Intel IPP documentation that lists the functions which are threaded, and the functions you mentioned above are not listed in that file. You can use a TBB or Cilk Plus or OpenMP wrapper to thread primitive functions in some cases so that might be an option to thread those functions. &lt;/P&gt;&lt;P&gt;&lt;EM&gt;&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 21 May 2011 03:12:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819425#M4570</guid>
      <dc:creator>Joseph_S_Intel</dc:creator>
      <dc:date>2011-05-21T03:12:23Z</dc:date>
    </item>
    <item>
      <title>Format Conversion</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819426#M4571</link>
      <description>Hi&lt;DIV&gt;Thanks for the reply. I am still finding my way around the IPP and did not see the threaded func list. I am linking statically. So I guess my question is why I see no difference in timing regardless on whether I run ippInit - which as I say, makes me suspicious.&lt;/DIV&gt;&lt;DIV&gt;My understanding was that without ippInit the call would reduce to optimised C. Is there any way I can find out which cpu variation I am actually calling?&lt;/DIV&gt;&lt;DIV&gt;Are there any perf figures I can compare against to see if mine are in the ball park&lt;/DIV&gt;&lt;DIV&gt;Cheers&lt;/DIV&gt;&lt;DIV&gt;Steve&lt;/DIV&gt;</description>
      <pubDate>Sun, 22 May 2011 13:25:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819426#M4571</guid>
      <dc:creator>BatterseaSteve</dc:creator>
      <dc:date>2011-05-22T13:25:49Z</dc:date>
    </item>
    <item>
      <title>Format Conversion</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819427#M4572</link>
      <description>&lt;P&gt;Steve, &lt;/P&gt;&lt;P&gt;you can use the flow function to check the optimized version you used: &lt;BR /&gt;ippiGetLibVersion()&lt;/P&gt;&lt;P&gt;The function will return the the version you used. You can learn which optimized version you used. &lt;BR /&gt;Also, you may consider to use the following two functions to resize the image. &lt;BR /&gt; ippiResizeSqrPixel() //Resize first. if there is three YUV plane, may need to call three times&lt;BR /&gt; ippiYCrCb420ToCbYCr422_8u_P3C2R //color conversion. &lt;/P&gt;&lt;P&gt;ippiResizeSqrPixel is a more optimzed function for performance. &lt;/P&gt;&lt;P&gt;Thanks,&lt;BR /&gt;Chao&lt;/P&gt;</description>
      <pubDate>Tue, 24 May 2011 04:00:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819427#M4572</guid>
      <dc:creator>Chao_Y_Intel</dc:creator>
      <dc:date>2011-05-24T04:00:30Z</dc:date>
    </item>
    <item>
      <title>Format Conversion</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819428#M4573</link>
      <description>To be more clear, ippiResizeSqrPixel can utilize multiple cores (when threading is enabled in IPP), ippiResize cannot.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 24 May 2011 11:25:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819428#M4573</guid>
      <dc:creator>Thomas_Jensen1</dc:creator>
      <dc:date>2011-05-24T11:25:18Z</dc:date>
    </item>
    <item>
      <title>Format Conversion</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819429#M4574</link>
      <description>Thanks Guys&lt;BR /&gt;Thats v useful - I will take a look at the ResizeSqr.&lt;BR /&gt;I did get to see what version I was calling - in a crash!.&lt;BR /&gt;Seems it is using y8 - which is correct for the CPU's I am running.&lt;BR /&gt;&lt;BR /&gt;So back to my original question - I see no difference in speed &lt;BR /&gt;regardless of whether I call ippInit or not.&lt;BR /&gt;&lt;BR /&gt;If I run the t2.cpp example:&lt;BR /&gt;&lt;BR /&gt;/* static non-threaded lib&lt;BR /&gt;g++ t2.cpp -I /opt/intel/composerxe-2011.4.191/ipp/include -o t2 \&lt;BR /&gt;/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libipps_t.a \&lt;BR /&gt;/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libippcore_t.a \&lt;BR /&gt;/opt/intel/composerxe-2011.4.191/compiler/lib/intel64/libiomp5.a -lpthread&lt;BR /&gt;&lt;BR /&gt;static threaded&lt;BR /&gt;g++ t2.cpp -I /opt/intel/composerxe-2011.4.191/ipp/include -o t2 \&lt;BR /&gt;/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libipps_l.a \&lt;BR /&gt;/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libippcore_l.a&lt;BR /&gt;&lt;BR /&gt;dynamic&lt;BR /&gt;g++ t2.cpp -I /opt/intel/composerxe-2011.4.191/ipp/include -o t2 -L /opt/intel/composerxe-2011.4.191/ipp/lib/intel64 -lippcore -lipps -lpthread&lt;BR /&gt;*/&lt;BR /&gt;&lt;BR /&gt;#include &lt;STDIO.H&gt;&lt;BR /&gt;#include &lt;IPP.H&gt;&lt;BR /&gt;&lt;BR /&gt;int main()&lt;BR /&gt;{&lt;BR /&gt; const int N = 20000, loops = 100;&lt;BR /&gt; Ipp32f src&lt;N&gt;, dst&lt;N&gt;;&lt;BR /&gt; unsigned int seed = 12345678, i;&lt;BR /&gt; Ipp64s t1,t2;&lt;BR /&gt;&lt;BR /&gt; /// no StaticInit call, means PX code, not optimized&lt;BR /&gt; ippsRandUniform_Direct_32f(src,N,0.0,1.0,&amp;amp;seed);&lt;BR /&gt; t1=ippGetCpuClocks();&lt;BR /&gt; for(i=0; i&lt;LOOPS&gt;&lt;/LOOPS&gt; ippsSqrt_32f(src,dst,N);&lt;BR /&gt; t2=ippGetCpuClocks();&lt;BR /&gt; printf("without StaticInit: %.1f clocks/element\n",(float)(t2-t1)/loops/N);&lt;BR /&gt; ippInit();&lt;BR /&gt; t1=ippGetCpuClocks();&lt;BR /&gt; for(i=0; i&lt;LOOPS&gt;&lt;/LOOPS&gt; ippsSqrt_32f(src,dst,N);&lt;BR /&gt; t2=ippGetCpuClocks();&lt;BR /&gt; printf("with StaticInit: %.1f clocks/element\n",(float)(t2-t1)/loops/N);&lt;BR /&gt; return 0;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;a) static non threaded i get&lt;BR /&gt;without StaticInit: 1.4 clocks/element&lt;BR /&gt;
with StaticInit: 2.8 clocks/element&lt;BR /&gt;&lt;BR /&gt;b) static threaded&lt;BR /&gt;without StaticInit: 1.4 clocks/element&lt;BR /&gt;with StaticInit: 2.3 clocks/element&lt;BR /&gt;&lt;BR /&gt;c) dynamic&lt;BR /&gt;without StaticInit: 2.5 clocks/element&lt;BR /&gt;with StaticInit: 1.0 clocks/element&lt;BR /&gt;&lt;BR /&gt;The only one that is faster with static init is the dynamically loaded one!&lt;BR /&gt;Am I missing soemthing here?&lt;BR /&gt;Steve&lt;/N&gt;&lt;/N&gt;&lt;/IPP.H&gt;&lt;/STDIO.H&gt;</description>
      <pubDate>Tue, 24 May 2011 13:20:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819429#M4574</guid>
      <dc:creator>BatterseaSteve</dc:creator>
      <dc:date>2011-05-24T13:20:29Z</dc:date>
    </item>
    <item>
      <title>Format Conversion</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819430#M4575</link>
      <description>&lt;P&gt;Hi Steve,&lt;BR /&gt;The minimum instruction set supported in IPP has changed since Intel IPP 7.0; for instance it is now SSE3 on the 64 bit version of Intel IPP, so the ippSqrt function will use at least SSE3 and the performance is probably not that different from that version to the y8 (SSE4.1,4.2, AESNI) version. &lt;BR /&gt;&lt;BR /&gt;See this article: &lt;BR /&gt;&lt;A href="http://software.intel.com/en-us/articles/understanding-simd-optimization-layers-and-dispatching-in-the-intel-ipp-70-library/" target="_blank"&gt;http://software.intel.com/en-us/articles/understanding-simd-optimization-layers-and-dispatching-in-the-intel-ipp-70-library/&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;In addition I built your code and see the decrease in clocks per element after calling ippInit in the dynamically loaded version but also see the same speed up after commenting out the ippInit; the differential isduesomething else. &lt;BR /&gt;&lt;BR /&gt;For the static version I did not see an increase in the number of clocks per elementafter calling ippInit. You should probably separate the experiments into different program runs to eliminate any cache warming or other effects. &lt;/P&gt;</description>
      <pubDate>Wed, 25 May 2011 22:46:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819430#M4575</guid>
      <dc:creator>Joseph_S_Intel</dc:creator>
      <dc:date>2011-05-25T22:46:20Z</dc:date>
    </item>
    <item>
      <title>Format Conversion</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819431#M4576</link>
      <description>&lt;P&gt;Steve,&lt;/P&gt;&lt;P&gt;Another notes on the benchmark code is the data alignment. Since each element only need 1 or 2 CPU clock tickets. Memory access becomes the important factor on the performance. For src/dst data, you can use ippsMalloc_ to allocate the aligned data, so the test code could have similar performance behavior from run to run. &lt;/P&gt;&lt;P&gt;Thanks,&lt;BR /&gt;Chao&lt;/P&gt;</description>
      <pubDate>Fri, 27 May 2011 09:08:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Format-Conversion/m-p/819431#M4576</guid>
      <dc:creator>Chao_Y_Intel</dc:creator>
      <dc:date>2011-05-27T09:08:22Z</dc:date>
    </item>
  </channel>
</rss>

