Low performance issue about ipp function ippsEncodeLZO_8u

haixiao_j_ · ‎09-27-2012

Hi,

I am testng the performance both Intel IPP LZO and LZO(Ver2.0.6). I found that the IPP performance is much lower than LZO2.06 .

My test bed:

Hardware

•DELL R720

Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (Sandy Bridge Arch)

24 GB RAM, BIOS Version: 1.2.6

SoftWare

•OS: RH6.0, kernel 2.6.32-71.el6.x86_64

•Intel IPP main package: parallel_studio_xe_2011_sp1_update3_intel64

•LZO version 2.06 •Compile Option: gcc

Test Method :

1.First I can configure the thread number and round number to do the compression. (The ipp internal thread mode is IppLZO1XST, but benchmark program is multithread )

2.Then, the benchmark program reads full file into memory and compress whole in memory.

3.Finally we can get the result about performance and compress ratio.

The main procedure for Intel IPP LZO test program pseudocode:

*The source file to be compressed is 16MB and the compression ratio is 1.5:1

#define BUFSIZE 16*1024*1024 /* 16MB */

void compress_per_thread(const char* pInFileName, int opt_round_num) // this is the thread function

{

int fd_in;

IppLZOState_8u *pLZOState;

Ipp8u* p_in_buffer = NULL;

Ipp32u srcLen, dstLen, lzoSize;

fd_in = open(pInFileName, O_RDONLY, 0);

ippsEncodeLZOGetSize(IppLZO1XST, BUFSIZE, &lzoSize);

pLZOState = (IppLZOState_8u*)ippsMalloc_8u(lzoSize);

ippsEncodeLZOInit_8u(IppLZO1XST, BUFSIZE, pLZOState);

p_in_buffer = ppsMalloc_8u(BUFSIZE);

p_out_buffer = ppsMalloc_8u(BUFSIZE + BUFSIZE / 10);

src_len = read(fd_in, p_in_buffer, BUFSIZE); // I make sure that the size of src_file is BUFSIZE. So, program read the whole file into memory .

    gettimeofday(timeStart, &tz);
    for(i = 0; i < opt_round_num; i++) // Specified the opt_round_num for per thread to tune performance
    {
        ippsEncodeLZO_8u(p_in_buffer , src_len , p_out_buffer , (Ipp32u*)&dst_len, pLZOState);
    }
    gettimeofday(timeEnd, &tz);

ippsFree(p_out_buffer);

ippsFree(p_in_buffer);

close(fd_in);
}

The main procedure for LZO(v.2.0.3) test program is same to IPP LZO, it calls function lzo1x_1_compress to compress.

Performance reached the optimal value when thread nume is 24. But the performance for IPP LZO is 10.3 Gbps and LZO v2.0.6 is 31.18 Gbps.

Why the IPP LZO performance is much slower than LZO v2.0.6 with the 16MB test data?

Notes:

If I configure the IPP thread mode to IppLZO1XMT(the thread number equals number of processors in the system by default), and my benchmark program

thread number also aquals to number of processors in the system. I think the thread context-switch will degrade performance.

haixiao_j_ · ‎09-27-2012

What's the behaviour if I call ippsEncodeLZOInit_8u with IppLZO1XMT? The document says if the thread mode is IppLZO1XMT then compression and decompression are performed in parallel. Does that mean the multi-thread will split the input buffer averagely to perform? For example, if the input buffer is 24MB and the ippGetNumThreads = 24, does each thread will perform 1MB ?

haixiao_j_ · ‎09-27-2012

My system has 24 logical core.

Gennady_F_Intel · ‎09-29-2012

haixiao, it might be the problem with treading of IPP's implementation. it should be cheked on our side. Did you compare the performance with in a single-thread mode?

Sergey_K_Intel · ‎10-01-2012

Hi, According to data compression ratio, IPP LZO functions are somewhere between LZO1B-2 and LZO1B-1. Look at the table obtained with "lzotest" benchmark on Calgary corpus for LZO 2.0.6: Summary of total values Algorithm Length ComLen Ratio% Bits Com MB/s Dec MB/s memcpy() 3141622 3141622 100.0 8.00 14647.128 9467.011 LZO1X-1(11) 3141622 1712066 54.5 4.36 315.300 759.555 LZO1B-1 3141622 1534268 48.8 3.91 155.613 507.660 ipplzost 3141622 1533165 48.8 3.90 165.994 777.953 LZO1B-2 3141622 1487293 47.3 3.79 155.750 504.164 LZO1B-3 3141622 1461534 46.5 3.72 152.539 499.473 Comparing to equivalent LZO methods, IPP LZO looks not bad. LZO1X-1 is faster but at the cost of less compression. Regards, Sergey

haixiao_j_ · ‎10-04-2012

Gennady Fedorov (Intel) wrote:
haixiao, it might be the problem with treading of IPP's implementation. it should be cheked on our side. Did you compare the performance with in a single-thread mode?

Hi , thanks for your relpy. I got the performance results above with in a single-thread mode. I also checked the results with multi-thread mode and it is lower than single-thread mode. I think this is caused by implementation of my benchmark program, because benchmark program is multi-thread (24 threads), and the IPP mode is also multi-thread mode(24 threads), 24*2 > logical core(24). So it will cause the conflicting with threads, then the thread context-switch will degrade performance. Some questions: 1. What is the main difference between LZO and IPP LZOP? 2. LZO has different compression levels(1 - 9), does intel provide IPP library for all levels?

haixiao_j_ · ‎10-04-2012

Sergey Khlystov (Intel) wrote:
Hi,
According to data compression ratio, IPP LZO functions are somewhere between LZO1B-2 and LZO1B-1. Look at the table obtained with "lzotest" benchmark on Calgary corpus for LZO 2.0.6:

Summary of total values

Algorithm Length ComLen Ratio% Bits Com MB/s Dec MB/s

memcpy() 3141622 3141622 100.0 8.00 14647.128 9467.011
LZO1X-1(11) 3141622 1712066 54.5 4.36 315.300 759.555
LZO1B-1 3141622 1534268 48.8 3.91 155.613 507.660
ipplzost 3141622 1533165 48.8 3.90 165.994 777.953
LZO1B-2 3141622 1487293 47.3 3.79 155.750 504.164
LZO1B-3 3141622 1461534 46.5 3.72 152.539 499.473

Comparing to equivalent LZO methods, IPP LZO looks not bad. LZO1X-1 is faster but at the cost of less compression.
Regards,
Sergey

Hi Sergey, Thanks for your reply. What's the LZO version that your ipp LZO test based on? My ipp LZO test based on LZO 2.0.3, and I compared it with LZO 2.0.6. Dose the performance optimization from LZO 2.0.3 to LZO 2.0.6 cause the so much performance difference? Some questions: 1. What is the main difference between LZO and IPP LZOP? 2. LZO has different compression levels(1 - 9), does intel provide IPP library for all levels?

Sergey_K_Intel · ‎10-08-2012

Hi, IPP LZO functions are based on the general idea and, mostly, on LZO packed format. At the time when these functions were developed, LZO 2.0.3 was active. Besides similar function interface and identical output format, there's no similarity between IPP and non-IPP LZO. The difference between LZO levels mostly comes from difference in compression dictionary lookups. If the function spends more time in substring searches, the better compression ratio can be obtained (with less compression speed though). Regarding your last question, no, IPP does not provide different levels of LZO compression. It was experimental development and untill we see that there is demand for LZO functionality we are not planning to do anything extra in this area. By the way, using multi-thread IPP functions in multi-thread applications can bring significant performance drawback. In your case (IPP LZO in 24-thread application, I am afraid the real situation is dramatic: each function can submit up to 24 threads. 24*24 is too much ))). IPP functions with internal multi-threading were designed for single-thread applications to save IPP customers from overhead of developing multi-threaded solutions. Now situation is different, more and more applications become multi-threaded, This is why we are deprecating internal multi-threading. Regards, Sergey

haixiao_j_ · ‎10-15-2012

Sergey Khlystov (Intel) wrote:
Hi,
IPP LZO functions are based on the general idea and, mostly, on LZO packed format. At the time when these functions were developed, LZO 2.0.3 was active. Besides similar function interface and identical output format, there's no similarity between IPP and non-IPP LZO.
The difference between LZO levels mostly comes from difference in compression dictionary lookups. If the function spends more time in substring searches, the better compression ratio can be obtained (with less compression speed though).
Regarding your last question, no, IPP does not provide different levels of LZO compression. It was experimental development and untill we see that there is demand for LZO functionality we are not planning to do anything extra in this area.
By the way, using multi-thread IPP functions in multi-thread applications can bring significant performance drawback. In your case (IPP LZO in 24-thread application, I am afraid the real situation is dramatic: each function can submit up to 24 threads. 24*24 is too much ))). IPP functions with internal multi-threading were designed for single-thread applications to save IPP customers from overhead of developing multi-threaded solutions. Now situation is different, more and more applications become multi-threaded, This is why we are deprecating internal multi-threading.
Regards,
Sergey

Sergey, Thanks for your detailed explanation! I have understood. Regards, Haixiao