From my understanding IPP libraries use SIMD instructions so that they outperform other libraries. So I tried using the 'disass' feature of gdb to check if there are any SIMD instructions present in the assembly code of the function ippsDecodeLZO_8u().
0x00000000004010a0 <ippsDecodeLZO_8u+0>: mov 0x207ab1(%rip),%rax
0x00000000004010a7 <ippsDecodeLZO_8u+7>: mov (%rax),%eax
0x00000000004010a9 <ippsDecodeLZO_8u+9>: mov 0x207a98(%rip),%r11
0x00000000004010b0 <ippsDecodeLZO_8u+16>: jmpq *(%r11,%rax,8)
The above 4 lines show the assembly level code of ippsDecodeLZO_8u(). I don’t see any SIMD instructions in it. Am confused, if this is the case how can IPP’s decompression be faster than other algorithms?
Correct me if my understanding is wrong.
Thanks in advance!
In IPP not only SIMD instructions are used to increase performance, sometimes careful organization of data reads/writes and, say, branch predictions give additional benefit. Regarding SIMD, in data compression SIMDs are rarely used, because algorithms - especially decompression - are bit-driven. There is simply no place for parallel data operations like SIMD. Regarding your disassembly instructions you provided, actually it is a CPU dispatcher code. The last instruction is a table-driven jump. So, you didn't come to LZO decoding yet )).
Thanks a lot Sergey :)
IPP gives better results, and am happy with it.
But still the question that keeps pondering is what makes decompression faster.
So is there a way I could reach the assembly code, so that I could figure out the optimization that is going on?
Or is there any document that would help me understand the optimizations carried out in decompression?
Currently, there is no way to get assembly code for these functions. But I would say, there's nothing extra-special (besides the "magic" of course :)). Using of aligned data, short jumps, in some cases quad word read/writes and that's it. The trick is to try not so miss L1 cache (for both data and instructions).
By the way, if you'd take LZO from official site, there are optimized assembly variants of encode/decode functions. They are very good.
And there are also Intel architecture optimization guides (like http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html ) that may help.
I mean, that decompression is almost always is bit-field reading, shifting, using the bit masks as table indexes and so on. So, decompression is parsing the stream of bits. There were no packets or structures which can be read at once and each field in this struct is filled in with the required data. There are no bit-padding to a byte boundary (except some block boundaries, which are logical structures comprising usually upper-level data formats, like compressed archive or network stream). Take as an example LZ77 (or zlib). The compressed data consists of several blocks starting with 3-bit field, which defines the type of next bit data (stored block, dynamic Huffman-compressed data, or static Hufman-compressed data). I don't remember exactly, but suppose, that 0x0 in these 3 bits means the end of logical block, which can be bit-padded to the next byte boundary, which starts the next block. The lentgh of block is varying and depends on block content. By the way, returning to your initial question, there is nothing to do with SIMDs. No operations where SIMD can be applied.