<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic What's the behaviour if I in Intel® Integrated Performance Primitives</title>
    <link>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995122#M22830</link>
    <description>What's the behaviour if I call ippsEncodeLZOInit_8u with IppLZO1XMT?  The document says if the thread mode is IppLZO1XMT then compression and decompression are performed in parallel. Does that mean the multi-thread will split the input buffer averagely to perform?
For example, if the input buffer is 24MB and the ippGetNumThreads = 24, does each thread will perform 1MB ?</description>
    <pubDate>Fri, 28 Sep 2012 03:43:56 GMT</pubDate>
    <dc:creator>haixiao_j_</dc:creator>
    <dc:date>2012-09-28T03:43:56Z</dc:date>
    <item>
      <title>Low performance issue about ipp function ippsEncodeLZO_8u</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995121#M22829</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;I am testng the performance both Intel IPP LZO and LZO(Ver2.0.6). I found that the IPP performance is much lower than LZO2.06 .&lt;/P&gt;
&lt;P&gt;My test bed:&lt;/P&gt;
&lt;P&gt;Hardware&lt;/P&gt;
&lt;P&gt;•DELL R720&lt;/P&gt;
&lt;P&gt;&amp;nbsp; Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (Sandy Bridge Arch)&lt;/P&gt;
&lt;P&gt;&amp;nbsp; 24 GB RAM, BIOS Version: 1.2.6&lt;/P&gt;
&lt;P&gt;SoftWare&lt;/P&gt;
&lt;P&gt;•OS: RH6.0, kernel 2.6.32-71.el6.x86_64&lt;/P&gt;
&lt;P&gt;•Intel IPP main package: &lt;STRONG&gt;parallel_studio_xe_2011_sp1_update3_intel64 &lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;•LZO version 2.06 •Compile Option:&lt;STRONG&gt; gcc &lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Test Method :&lt;/P&gt;
&lt;P&gt;1.First&amp;nbsp; I&amp;nbsp; can configure the thread number and round number to do the compression. (The ipp internal thread mode&amp;nbsp;is IppLZO1XST, but&amp;nbsp;&amp;nbsp;benchmark program is multithread&amp;nbsp;)&lt;/P&gt;
&lt;P&gt;2.Then, the benchmark program reads full file into memory and compress whole in memory.&lt;/P&gt;
&lt;P&gt;3.Finally we can get the result about performance and compress ratio.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="title"&gt;The main procedure&lt;/SPAN&gt;&lt;SPAN class="additional"&gt; for&amp;nbsp;Intel IPP LZO&amp;nbsp;test program pseudocode:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;*&lt;STRONG&gt;The source file to be compressed is 16MB and the compression ratio is 1.5:1&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;#define BUFSIZE 16*1024*1024&amp;nbsp; /* 16MB */&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;void compress_per_thread(const char* pInFileName, int opt_round_num) // this is the thread function&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;{&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; int fd_in;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; IppLZOState_8u *pLZOState;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Ipp8u* p_in_buffer = NULL;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Ipp32u srcLen, dstLen, lzoSize;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="additional"&gt;fd_in = open(pInFileName, O_RDONLY, 0);&amp;nbsp; &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ippsEncodeLZOGetSize(IppLZO1XST, BUFSIZE, &amp;amp;lzoSize);&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; pLZOState = (IppLZOState_8u*)ippsMalloc_8u(lzoSize);&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ippsEncodeLZOInit_8u(IppLZO1XST, BUFSIZE, pLZOState);&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; p_in_buffer = ppsMalloc_8u(BUFSIZE);&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; p_out_buffer = ppsMalloc_8u(BUFSIZE + BUFSIZE / 10);&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;src_len = read(fd_in, p_in_buffer, BUFSIZE); // I make sure that the size of src_file is BUFSIZE. So,&amp;nbsp;program read the whole file into memory&amp;nbsp;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; gettimeofday(timeStart, &amp;amp;tz);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for(i = 0; i &amp;lt; opt_round_num; i++) //&amp;nbsp;Specified the opt_round_num&amp;nbsp;for per thread to&amp;nbsp;tune &amp;nbsp;performance&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ippsEncodeLZO_8u(p_in_buffer , src_len , p_out_buffer , (Ipp32u*)&amp;amp;dst_len, pLZOState);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; gettimeofday(timeEnd, &amp;amp;tz);&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ippsFree(p_out_buffer);&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ippsFree(p_in_buffer);&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; close(fd_in);&lt;BR /&gt;&lt;SPAN class="additional"&gt;}&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN class="title"&gt;The main procedure&lt;/SPAN&gt;&lt;SPAN class="additional"&gt; for&amp;nbsp;LZO(v.2.0.3) test program is same to IPP LZO, it calls function lzo1x_1_compress to compress.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;Performance reached the optimal value when thread nume is 24. But the performance for IPP LZO is&amp;nbsp;&lt;STRONG&gt;10.3 Gbps&lt;/STRONG&gt; and LZO v2.0.6 is&lt;STRONG&gt; 31.18 Gbps&lt;/STRONG&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&lt;STRONG&gt;Why the IPP LZO performance is much slower than LZO v2.0.6 with the 16MB test data?&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&lt;/SPAN&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp; Notes: &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp; If I configure the IPP thread mode to IppLZO1XMT(the thread number equals number of processors in the system&amp;nbsp;by default), and my benchmark program&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;&amp;nbsp;thread&lt;/SPAN&gt;&lt;SPAN class="additional"&gt;&amp;nbsp;number also aquals to number of processors in the system. I think&lt;/SPAN&gt;&lt;SPAN class="additional"&gt;&amp;nbsp; the&amp;nbsp;&amp;nbsp;thread context-switch will degrade performance.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&lt;/SPAN&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="additional"&gt;&lt;/SPAN&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 28 Sep 2012 03:31:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995121#M22829</guid>
      <dc:creator>haixiao_j_</dc:creator>
      <dc:date>2012-09-28T03:31:32Z</dc:date>
    </item>
    <item>
      <title>What's the behaviour if I</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995122#M22830</link>
      <description>What's the behaviour if I call ippsEncodeLZOInit_8u with IppLZO1XMT?  The document says if the thread mode is IppLZO1XMT then compression and decompression are performed in parallel. Does that mean the multi-thread will split the input buffer averagely to perform?
For example, if the input buffer is 24MB and the ippGetNumThreads = 24, does each thread will perform 1MB ?</description>
      <pubDate>Fri, 28 Sep 2012 03:43:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995122#M22830</guid>
      <dc:creator>haixiao_j_</dc:creator>
      <dc:date>2012-09-28T03:43:56Z</dc:date>
    </item>
    <item>
      <title>My system has 24 logical core</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995123#M22831</link>
      <description>My system has 24 logical core.</description>
      <pubDate>Fri, 28 Sep 2012 06:04:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995123#M22831</guid>
      <dc:creator>haixiao_j_</dc:creator>
      <dc:date>2012-09-28T06:04:51Z</dc:date>
    </item>
    <item>
      <title>haixiao, it might be the</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995124#M22832</link>
      <description>haixiao, it might be the problem with treading of IPP's implementation. it should be cheked on our side. Did you compare the performance with in a single-thread mode?</description>
      <pubDate>Sat, 29 Sep 2012 17:43:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995124#M22832</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2012-09-29T17:43:08Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995125#M22833</link>
      <description>Hi,
According to data compression ratio, IPP LZO functions are somewhere between LZO1B-2 and LZO1B-1. Look at the table obtained with "lzotest" benchmark on Calgary corpus for LZO 2.0.6:

Summary of total values

 Algorithm           Length       ComLen  Ratio% Bits     Com MB/s  Dec MB/s 

 memcpy()          3141622    3141622  100.0  8.00 14647.128  9467.011
 LZO1X-1(11) &lt;F&gt; 3141622    1712066   54.5   4.36   315.300      759.555 
 LZO1B-1           3141622     1534268   48.8  3.91   155.613      507.660 
 ipplzost              3141622    1533165   48.8   3.90   165.994      777.953 
 LZO1B-2           3141622     1487293   47.3  3.79   155.750      504.164
LZO1B-3           3141622      1461534   46.5  3.72   152.539      499.473 

Comparing to equivalent LZO methods, IPP LZO looks not bad. LZO1X-1 is faster but at the cost of less compression. 
Regards,
Sergey&lt;/F&gt;</description>
      <pubDate>Mon, 01 Oct 2012 12:17:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995125#M22833</guid>
      <dc:creator>Sergey_K_Intel</dc:creator>
      <dc:date>2012-10-01T12:17:03Z</dc:date>
    </item>
    <item>
      <title>Quote:Gennady Fedorov (Intel)</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995126#M22834</link>
      <description>&lt;BLOCKQUOTE&gt;Gennady Fedorov (Intel) wrote:&lt;BR /&gt;&lt;P&gt;haixiao, it might be the problem with treading of IPP's implementation. it should be cheked on our side. Did you compare the performance with in a single-thread mode?&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
Hi , thanks for your relpy. I got the performance results above with in a single-thread mode.  I also checked the results with multi-thread mode and it is lower than single-thread mode. I think this is caused by implementation of my benchmark program, because benchmark program is multi-thread (24 threads), and the IPP mode is also multi-thread mode(24 threads), 24*2 &amp;gt; logical core(24). So it will cause the conflicting with threads, then the  thread context-switch will degrade performance. 

Some questions:
1. What is the main difference between LZO and IPP LZOP? 
2. LZO has different compression levels(1 - 9), does intel provide IPP library for all levels?</description>
      <pubDate>Thu, 04 Oct 2012 13:18:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995126#M22834</guid>
      <dc:creator>haixiao_j_</dc:creator>
      <dc:date>2012-10-04T13:18:26Z</dc:date>
    </item>
    <item>
      <title>Quote:Sergey Khlystov (Intel)</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995127#M22835</link>
      <description>&lt;BLOCKQUOTE&gt;Sergey Khlystov (Intel) wrote:&lt;BR /&gt;&lt;P&gt;Hi,&lt;BR /&gt;
According to data compression ratio, IPP LZO functions are somewhere between LZO1B-2 and LZO1B-1. Look at the table obtained with "lzotest" benchmark on Calgary corpus for LZO 2.0.6:&lt;/P&gt;
&lt;P&gt;Summary of total values&lt;/P&gt;
&lt;P&gt; Algorithm           Length       ComLen  Ratio% Bits     Com MB/s  Dec MB/s &lt;/P&gt;
&lt;P&gt; memcpy()          3141622    3141622  100.0  8.00 14647.128  9467.011&lt;BR /&gt;
 LZO1X-1(11) &lt;F&gt; 3141622    1712066   54.5   4.36   315.300      759.555&lt;BR /&gt;
 LZO1B-1           3141622     1534268   48.8  3.91   155.613      507.660&lt;BR /&gt;
 ipplzost              3141622    1533165   48.8   3.90   165.994      777.953&lt;BR /&gt;
 LZO1B-2           3141622     1487293   47.3  3.79   155.750      504.164&lt;BR /&gt;
LZO1B-3           3141622      1461534   46.5  3.72   152.539      499.473 &lt;/F&gt;&lt;/P&gt;
&lt;P&gt;Comparing to equivalent LZO methods, IPP LZO looks not bad. LZO1X-1 is faster but at the cost of less compression.&lt;BR /&gt;
Regards,&lt;BR /&gt;
Sergey&lt;/P&gt;&lt;/BLOCKQUOTE&gt;

Hi Sergey,
Thanks for your reply. What's the LZO version that your ipp LZO test based on?  My ipp LZO test based on LZO 2.0.3, and I compared it with LZO 2.0.6.  

Dose the performance optimization from LZO 2.0.3 to LZO 2.0.6 cause the so much performance difference?

Some questions:
1. What is the main difference between LZO and IPP LZOP?
2. LZO has different compression levels(1 - 9), does intel provide IPP library for all levels?</description>
      <pubDate>Thu, 04 Oct 2012 13:34:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995127#M22835</guid>
      <dc:creator>haixiao_j_</dc:creator>
      <dc:date>2012-10-04T13:34:07Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995128#M22836</link>
      <description>Hi,
IPP LZO functions are based on the general idea and, mostly, on LZO packed format. At the time when these functions were developed, LZO 2.0.3 was active. Besides similar function interface and identical output format, there's no similarity between IPP and non-IPP LZO. 
The difference between LZO levels mostly comes from difference in compression dictionary lookups. If the function spends more time in substring searches, the better compression ratio can be obtained (with less compression speed though).
Regarding your last question, no, IPP does not provide different levels of LZO compression. It was experimental development and untill we see that there is demand for LZO functionality we are not planning to do anything extra in this area.
By the way, using multi-thread IPP functions in multi-thread applications can bring significant performance drawback. In your case (IPP LZO in 24-thread application, I am afraid the real situation is dramatic: each function can submit up to 24 threads. 24*24 is too much ))). IPP functions with internal multi-threading were designed for single-thread applications to save IPP customers from overhead of developing multi-threaded solutions. Now situation is different, more and more applications become multi-threaded, This is why we are deprecating internal multi-threading.
Regards,
Sergey</description>
      <pubDate>Mon, 08 Oct 2012 11:50:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995128#M22836</guid>
      <dc:creator>Sergey_K_Intel</dc:creator>
      <dc:date>2012-10-08T11:50:38Z</dc:date>
    </item>
    <item>
      <title>Quote:Sergey Khlystov (Intel)</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995129#M22837</link>
      <description>&lt;BLOCKQUOTE&gt;Sergey Khlystov (Intel) wrote:&lt;BR /&gt;&lt;P&gt;Hi,&lt;BR /&gt;
IPP LZO functions are based on the general idea and, mostly, on LZO packed format. At the time when these functions were developed, LZO 2.0.3 was active. Besides similar function interface and identical output format, there's no similarity between IPP and non-IPP LZO.&lt;BR /&gt;
The difference between LZO levels mostly comes from difference in compression dictionary lookups. If the function spends more time in substring searches, the better compression ratio can be obtained (with less compression speed though).&lt;BR /&gt;
Regarding your last question, no, IPP does not provide different levels of LZO compression. It was experimental development and untill we see that there is demand for LZO functionality we are not planning to do anything extra in this area.&lt;BR /&gt;
By the way, using multi-thread IPP functions in multi-thread applications can bring significant performance drawback. In your case (IPP LZO in 24-thread application, I am afraid the real situation is dramatic: each function can submit up to 24 threads. 24*24 is too much ))). IPP functions with internal multi-threading were designed for single-thread applications to save IPP customers from overhead of developing multi-threaded solutions. Now situation is different, more and more applications become multi-threaded, This is why we are deprecating internal multi-threading.&lt;BR /&gt;
Regards,&lt;BR /&gt;
Sergey&lt;/P&gt;&lt;/BLOCKQUOTE&gt;

Sergey,

Thanks for your detailed explanation! I have understood.

Regards,
Haixiao</description>
      <pubDate>Mon, 15 Oct 2012 14:12:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Low-performance-issue-about-ipp-function-ippsEncodeLZO-8u/m-p/995129#M22837</guid>
      <dc:creator>haixiao_j_</dc:creator>
      <dc:date>2012-10-15T14:12:29Z</dc:date>
    </item>
  </channel>
</rss>

