IPP zlib slower than opensource zlib when used by HBase

Donald_W_ · ‎02-16-2015

Hi everyone,

Recently I tried to replace the stock zlib with IPP zlib 7.0.6 on 64-bit Linux in order to boost the performance of a project using HBase 0.99.2. However I observed slowdown in compression performance by about 30%. I measure the time that "deflate" function takes (inside Hadoop native library), and indeed it's slower than stock zlib and the slowdown happens almost all inside "deflate" calls.

I wrote a couple of test programs separately which invoke zlib. In those cases, IPP shows good amount of improvement over stock one. It seems the slowdown only happens when it's used with HBase.

I don't know what could cause IPP zlib to be slower than the stock one. Has anyone some ideas? Thanks.

Sergey_K_Intel · ‎02-16-2015

Hi Donald,

It's an interesting observation. Could you tell what CPU is on HBase computer ?

Two other things would be helpful to know:

are you sure that you call ippInit() (or, ippStaticInit() in IPP 7.x) in your HBase version of IPP zlib ?
what is average size of buffer which is deflated in HBase ?

Donald_W_ · ‎02-16-2015

Hi Sergey,

Thanks for your reply.

The CPU on which my project is running is Xeon E5-2670 v2.

Yes, I modified the Hadoop native library a little bit so it calls ippInit when it's loaded, and I verified it's calling the e9 functions which is for AVX.

I think in Hadoop/HBase the maximum buffer that can be sent to deflate each round is 64K (I didn't change that). I logged the input data size for all deflate calls. Most of them are from 64K to 128K, and they got compressed in two rounds.

Thanks,

Donald

Sergey_K_Intel · ‎02-16-2015

Donald,

What version of Hadoop, and - more important - what version of open-source ZLIB you are speaking about ? I am asking this, because recently both Intel and CloudFlare invested into ZLIB as open-source.

Donald_W_ · ‎02-17-2015

Hi, Sergey

I'm testing with zlib 1.2.3 and vanilla hbase 0.99.2 which should be using Hadoop 2.5.1.

Donald

Donald_W_ · ‎02-18-2015

Hi, Sergey

I think I have found out why. IPP zlib doesn't like the pattern the random data generator generates data in the benchmark of HBase and it runs slower than stock zlib. I tried with some other datasets and found IPP does give some improvement, to various degrees. I used a different random data generator during the separated tests so they gave the different results. I did expect different dataset would impact absolute performance of the two libraries, but I didn't expect the relative performance is also affected.

Thanks,

Donald