I am testng the performance both Intel IPP LZO and LZO(Ver2.0.6). I found that the IPP performance is much lower than LZO2.06 .
My test bed:
Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (Sandy Bridge Arch)
24 GB RAM, BIOS Version: 1.2.6
•OS: RH6.0, kernel 2.6.32-71.el6.x86_64
•Intel IPP main package: parallel_studio_xe_2011_sp1_update3_intel64
•LZO version 2.06 •Compile Option: gcc
Test Method :
1.First I can configure the thread number and round number to do the compression. (The ipp internal thread mode is IppLZO1XST, but benchmark program is multithread )
2.Then, the benchmark program reads full file into memory and compress whole in memory.
3.Finally we can get the result about performance and compress ratio.
The main procedure for Intel IPP LZO test program pseudocode:
*The source file to be compressed is 16MB and the compression ratio is 1.5:1
#define BUFSIZE 16*1024*1024 /* 16MB */
void compress_per_thread(const char* pInFileName, int opt_round_num) // this is the thread function
Ipp8u* p_in_buffer = NULL;
Ipp32u srcLen, dstLen, lzoSize;
fd_in = open(pInFileName, O_RDONLY, 0);
ippsEncodeLZOGetSize(IppLZO1XST, BUFSIZE, &lzoSize);
pLZOState = (IppLZOState_8u*)ippsMalloc_8u(lzoSize);
ippsEncodeLZOInit_8u(IppLZO1XST, BUFSIZE, pLZOState);
p_in_buffer = ppsMalloc_8u(BUFSIZE);
p_out_buffer = ppsMalloc_8u(BUFSIZE + BUFSIZE / 10);
src_len = read(fd_in, p_in_buffer, BUFSIZE); // I make sure that the size of src_file is BUFSIZE. So, program read the whole file into memory .
for(i = 0; i < opt_round_num; i++) // Specified the opt_round_num for per thread to tune performance
ippsEncodeLZO_8u(p_in_buffer , src_len , p_out_buffer , (Ipp32u*)&dst_len, pLZOState);
The main procedure for LZO(v.2.0.3) test program is same to IPP LZO, it calls function lzo1x_1_compress to compress.
Performance reached the optimal value when thread nume is 24. But the performance for IPP LZO is 10.3 Gbps and LZO v2.0.6 is 31.18 Gbps.
Why the IPP LZO performance is much slower than LZO v2.0.6 with the 16MB test data?
If I configure the IPP thread mode to IppLZO1XMT(the thread number equals number of processors in the system by default), and my benchmark program
thread number also aquals to number of processors in the system. I think the thread context-switch will degrade performance.
Gennady Fedorov (Intel) wrote:Hi , thanks for your relpy. I got the performance results above with in a single-thread mode. I also checked the results with multi-thread mode and it is lower than single-thread mode. I think this is caused by implementation of my benchmark program, because benchmark program is multi-thread (24 threads), and the IPP mode is also multi-thread mode(24 threads), 24*2 > logical core(24). So it will cause the conflicting with threads, then the thread context-switch will degrade performance. Some questions: 1. What is the main difference between LZO and IPP LZOP? 2. LZO has different compression levels(1 - 9), does intel provide IPP library for all levels?
haixiao, it might be the problem with treading of IPP's implementation. it should be cheked on our side. Did you compare the performance with in a single-thread mode?
Sergey Khlystov (Intel) wrote:Hi Sergey, Thanks for your reply. What's the LZO version that your ipp LZO test based on? My ipp LZO test based on LZO 2.0.3, and I compared it with LZO 2.0.6. Dose the performance optimization from LZO 2.0.3 to LZO 2.0.6 cause the so much performance difference? Some questions: 1. What is the main difference between LZO and IPP LZOP? 2. LZO has different compression levels(1 - 9), does intel provide IPP library for all levels?
According to data compression ratio, IPP LZO functions are somewhere between LZO1B-2 and LZO1B-1. Look at the table obtained with "lzotest" benchmark on Calgary corpus for LZO 2.0.6:
Summary of total values
Algorithm Length ComLen Ratio% Bits Com MB/s Dec MB/s
memcpy() 3141622 3141622 100.0 8.00 14647.128 9467.011
3141622 1712066 54.5 4.36 315.300 759.555
LZO1B-1 3141622 1534268 48.8 3.91 155.613 507.660
ipplzost 3141622 1533165 48.8 3.90 165.994 777.953
LZO1B-2 3141622 1487293 47.3 3.79 155.750 504.164
LZO1B-3 3141622 1461534 46.5 3.72 152.539 499.473
Comparing to equivalent LZO methods, IPP LZO looks not bad. LZO1X-1 is faster but at the cost of less compression.
Sergey Khlystov (Intel) wrote:Sergey, Thanks for your detailed explanation! I have understood. Regards, Haixiao
IPP LZO functions are based on the general idea and, mostly, on LZO packed format. At the time when these functions were developed, LZO 2.0.3 was active. Besides similar function interface and identical output format, there's no similarity between IPP and non-IPP LZO.
The difference between LZO levels mostly comes from difference in compression dictionary lookups. If the function spends more time in substring searches, the better compression ratio can be obtained (with less compression speed though).
Regarding your last question, no, IPP does not provide different levels of LZO compression. It was experimental development and untill we see that there is demand for LZO functionality we are not planning to do anything extra in this area.
By the way, using multi-thread IPP functions in multi-thread applications can bring significant performance drawback. In your case (IPP LZO in 24-thread application, I am afraid the real situation is dramatic: each function can submit up to 24 threads. 24*24 is too much ))). IPP functions with internal multi-threading were designed for single-thread applications to save IPP customers from overhead of developing multi-threaded solutions. Now situation is different, more and more applications become multi-threaded, This is why we are deprecating internal multi-threading.