Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Khan__Amsal
Beginner
251 Views

Intel IPP Zlib Build giving 3x slower speed than normal zlib dll in compression and decompression

I have build zlibwinapi.dll and zlib1.dll by using Intel IPP documentation and resources with latest IPP library. I build the dlls also and tried to use with my program to enhance its speed, but it gave same results and speed as normal public zlib. The program is a compressor and it supports zlibwinapi.dll placed nearby to it so it takes functions and calls from there. Can someone from here tell me what I am doing wrong even after following Intel IPP documentation for building the dll or if possible somebody can send the zlibwinapi.dll and zlib1.dll to me made with Intel IPP?

Best Regards

0 Kudos
10 Replies
Sergey_K_Intel
Employee
251 Views

Hi Amsal,

Could you tell how do you build zlibwinapi.dll and zlib1.dll ? Where is "how-to" procedure for Zlibwinapi build described?

Regards,
Sergey

Khan__Amsal
Beginner
251 Views

Sergey Khlystov (Intel) wrote:

Hi Amsal,

Could you tell how do you build zlibwinapi.dll and zlib1.dll ? Where is "how-to" procedure for Zlibwinapi build described?

Regards,
Sergey

I have made zlib1 with the exact way told in the readme.html of the ipp components. Zlib1wapi can also be build in the similiar way by just using ZLiB_WINAPi preprocessor definition. Secondly, I thought maybe it won't work, so there are even visual studio projects files, so I modified the zlib source by applying patch and then used the visual studio 14 project available in contrib\vstudio, for this I have define predecessor with WITH_IPP too. So there is no case left. What I all want is a zlib with windows api calls instead of cdecl for my application due to its limiation. And ofcourse I am not the owner of application nor I have source of it. And it supports gildor fast zlib also very well then why not this. There is one project made by one guy Shelwein in 2010 here - http://nishi.dreamhosters.com/u/ipp70beta_dc.rar I tried using dll of zlib with my program and yes it worked very well and faster than zlib normal public dll. Well I am pretty sure that I build the zlib1 properly. So that's why I was asking if somebody can share dll with windows calls and also zlib1.dll instead, so that I will confirm that Intel IPP is really slower than public zlib in compression and decompression. Thank you Sergey for helping on my topic :) I bought Intel Parallel Studio 2018 with a sure hope that it will improve my normal zlib with faster speed and probably better ratio. Best Regards
Sergey_K_Intel
Employee
251 Views

OK, let's do experiment together)). I will use Zlib 1.2.11 for that. First, let's create compression benchmark. Unfortunately, original Zlib has no bench, so let's create simple program like that:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <assert.h>

#include "zlib.h"
#define BUFSIZE 1000000
int main(int argc, const char* argv[])
{
    FILE*   fin;
    Bytef   *in_buf, *out_buf;
    size_t  in_size;
    z_stream    deflate_stream;
    clock_t t;
    unsigned int    it;
    double secs, mbs;

    assert((fin = fopen(argv[1], "rb")) != NULL);
    assert((in_buf = (Bytef*)malloc(BUFSIZE)) != NULL);
    assert((out_buf = (Bytef*)malloc(BUFSIZE)) != NULL);
    assert((in_size = fread(in_buf, 1, BUFSIZE, fin)) > 0);
    fclose(fin);

    deflate_stream.zalloc = Z_NULL;
    deflate_stream.zfree = Z_NULL;
    assert(deflateInit(&deflate_stream, Z_DEFAULT_COMPRESSION) == Z_OK);

    t = clock();
    /* Do deflate for 5 secs */
    for(it = 0; ; it++) {
        deflateReset(&deflate_stream);
        deflate_stream.next_in = in_buf;
        deflate_stream.avail_in = (uInt)in_size;
        deflate_stream.next_out = out_buf;
        deflate_stream.avail_out = (uInt)BUFSIZE;
        assert(deflate(&deflate_stream, Z_FINISH) == Z_STREAM_END);
        if((clock() - t) > CLOCKS_PER_SEC * 5)
            break;
    }
    t = clock() - t;
    secs = (double)t/CLOCKS_PER_SEC;
    mbs = (double)(in_size * it)/(1024*1024);
    printf("Deflated in %f seconds, %d iterations done, %f MB/s\n", secs, it, mbs/secs);
    assert(deflateEnd(&deflate_stream) == Z_OK);
    free(in_buf);
    free(out_buf);
}

Sorry, I done it on asserts. Then let's create Zlib.dll without IPP.

    nmake -f win32\Makefile.msc LOC="-DZLIB_WINAPI"

I got zdll.lib (16K) and zlib1.dll (86K). Then, build zlib_bench with DLL:

    cl -DZLIB_WINAPI zlib_bench.c zdll.lib

Run zlib_bench.exe against Calgary's bib file

    zlib_bench.exe c:\calgary\bib
    Deflated in 5.001000 seconds, 648 iterations done, 13.748686 MB/s

Clean Zlib and rebuild it with IPP:

    nmake -f win32\Makefile.msc clean
   nmake -f win32\Makefile.msc LOC="-DWITH_IPP -DZLIB_WINAPI -I\"%IPPROOT%\include\"" LDFLAGS="-nologo -incremental:no -opt:ref /LIBPATH:\"%IPPROOT%\lib\intel64\" ippdcmt.lib ippsmt.lib ippcoremt.lib"

We got 16K zdll.lib and 1088K zlib1.dll (because of static link to IPP). Build zlib_bench.exe again, because "nmake clean" deletes *.exe. Run it:

    cl -DZLIB_WINAPI zlib_bench.c zdll.lib
    zlib_bench.exe c:\calgary\bib
    Deflated in 5.001000 seconds, 1224 iterations done, 25.969740 MB/s

You may not rebuild zlib_bench.exe, but rename it to copy.tmp, clean and rebuild zlib1.dll with IPP and rename copy.tmp back to zlib_bench.exe. You may want to build zlib1.dll with dynamic link to IPP:

    nmake -f win32\Makefile.msc LOC="-DWITH_IPP -DZLIB_WINAPI -I\"%IPPROOT%\include\"" LDFLAGS="-nologo -incremental:no -opt:ref /LIBPATH:\"%IPPROOT%\lib\intel64\" ippdc.lib ipps.lib ippcore.lib"

Zlib1.dll in this case is 86K bytes long.

By the way, what CPU (model) do you use? In my case CPU is Skylake i5-6300U @ 2.5 GHz.
Sorry for the long post.

Regards,
Sergey

Khan__Amsal
Beginner
251 Views

Sergey Khlystov (Intel) wrote:

OK, let's do experiment together)). I will use Zlib 1.2.11 for that. First, let's create compression benchmark. Unfortunately, original Zlib has no bench, so let's create simple program like that:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <assert.h>

#include "zlib.h"
#define BUFSIZE 1000000
int main(int argc, const char* argv[])
{
    FILE*   fin;
    Bytef   *in_buf, *out_buf;
    size_t  in_size;
    z_stream    deflate_stream;
    clock_t t;
    unsigned int    it;
    double secs, mbs;

    assert((fin = fopen(argv[1], "rb")) != NULL);
    assert((in_buf = (Bytef*)malloc(BUFSIZE)) != NULL);
    assert((out_buf = (Bytef*)malloc(BUFSIZE)) != NULL);
    assert((in_size = fread(in_buf, 1, BUFSIZE, fin)) > 0);
    fclose(fin);

    deflate_stream.zalloc = Z_NULL;
    deflate_stream.zfree = Z_NULL;
    assert(deflateInit(&deflate_stream, Z_DEFAULT_COMPRESSION) == Z_OK);

    t = clock();
    /* Do deflate for 5 secs */
    for(it = 0; ; it++) {
        deflateReset(&deflate_stream);
        deflate_stream.next_in = in_buf;
        deflate_stream.avail_in = (uInt)in_size;
        deflate_stream.next_out = out_buf;
        deflate_stream.avail_out = (uInt)BUFSIZE;
        assert(deflate(&deflate_stream, Z_FINISH) == Z_STREAM_END);
        if((clock() - t) > CLOCKS_PER_SEC * 5)
            break;
    }
    t = clock() - t;
    secs = (double)t/CLOCKS_PER_SEC;
    mbs = (double)(in_size * it)/(1024*1024);
    printf("Deflated in %f seconds, %d iterations done, %f MB/s\n", secs, it, mbs/secs);
    assert(deflateEnd(&deflate_stream) == Z_OK);
    free(in_buf);
    free(out_buf);
}

Sorry, I done it on asserts. Then let's create Zlib.dll without IPP.

    nmake -f win32\Makefile.msc LOC="-DZLIB_WINAPI"

I got zdll.lib (16K) and zlib1.dll (86K). Then, build zlib_bench with DLL:

    cl -DZLIB_WINAPI zlib_bench.c zdll.lib

Run zlib_bench.exe against Calgary's bib file

    zlib_bench.exe c:\calgary\bib
    Deflated in 5.001000 seconds, 648 iterations done, 13.748686 MB/s

Clean Zlib and rebuild it with IPP:

    nmake -f win32\Makefile.msc clean
   nmake -f win32\Makefile.msc LOC="-DWITH_IPP -DZLIB_WINAPI -I\"%IPPROOT%\include\"" LDFLAGS="-nologo -incremental:no -opt:ref /LIBPATH:\"%IPPROOT%\lib\intel64\" ippdcmt.lib ippsmt.lib ippcoremt.lib"

We got 16K zdll.lib and 1088K zlib1.dll (because of static link to IPP). Build zlib_bench.exe again, because "nmake clean" deletes *.exe. Run it:

    cl -DZLIB_WINAPI zlib_bench.c zdll.lib
    zlib_bench.exe c:\calgary\bib
    Deflated in 5.001000 seconds, 1224 iterations done, 25.969740 MB/s

You may not rebuild zlib_bench.exe, but rename it to copy.tmp, clean and rebuild zlib1.dll with IPP and rename copy.tmp back to zlib_bench.exe. You may want to build zlib1.dll with dynamic link to IPP:

    nmake -f win32\Makefile.msc LOC="-DWITH_IPP -DZLIB_WINAPI -I\"%IPPROOT%\include\"" LDFLAGS="-nologo -incremental:no -opt:ref /LIBPATH:\"%IPPROOT%\lib\intel64\" ippdc.lib ipps.lib ippcore.lib"

Zlib1.dll in this case is 86K bytes long.

By the way, what CPU (model) do you use? In my case CPU is Skylake i5-6300U @ 2.5 GHz.
Sorry for the long post.

Regards,
Sergey

Thank you for building on your own. I got the same sizes of zlib dlls and files. But what could be the possibility that its working with the compressor.exe that it works 2-3 times slower than normal zlib. And I am pretty sure it should support ipp zlib when it support many other modified zlib dll. And ofcourse I am not the author of the tool. If you want I can attach the compressor and tell in detail how to test with it. Basically what I am all focusing on is to make the decompression speed faster anyhow. And for now by using zlib dll we get speed 2-3x slower which is disappointing to me. By the way, I am using these CPUs - Xeon E3 1230 v3, Xeon E3 1245 v6 and Core 2 Quad Q9500 BTW, if possible you share your dll builds as well, I will test them with compressor and if it worked then there is sure problem with my compilation only. But if I tell, I compiled with the same way you told here. Best regards
Sergey_K_Intel
Employee
251 Views

Please, share you compression tool. Do you use any specific input data to test?

Regards,
Sergey

Khan__Amsal
Beginner
251 Views

Sergey Khlystov (Intel) wrote:

Please, share you compression tool. Do you use any specific input data to test?

Regards,
Sergey

Sure, I have attached a zip with all the files required for tests. Additionally there is readme where you can see how to utilize it. A sample is also in the test zip. Let me know if your Intel IPP zlib perform faster anyhow.
 
BTW, I just found that my zlib1.dll which was made with WITH_IPP preprocessor defination is of 1064kb and zlib.lib 17kb which differs from your size. If possible, can you share your dll's in private or public?
 
Best Regards
Sergey_K_Intel
Employee
251 Views

I have some results. It's of course "black box" study, but the following is seen (I used also VTune Amplifier to see profiles):

  1. ZTool.exe doesn't use Zlib "inflate()". Simply add a "printf" in the beginning of inflate function and you will see that it is not called. With "deflate()" function all is OK, it is called from ZTool. VTune shows, that in both cases (original zlibwapi.dll or IPP) 97-98% of CPU is used by a function from HIF2RAW_DLL.dll. You can remove zlibwapi.dll from working directory and decompression (or, whatever it is) still works.
    IPPs zlibwapi
        Top Hotspots
        Function Module CPU Time
        func@0x180005d57 HIF2RAW_DLL.DLL 6.477s
        func@0x180001121 HIF2RAW_DLL.DLL 1.384s
        func@0x1800038c0 HIF2RAW_DLL.DLL 0.079s
        func@0x180005cb0 HIF2RAW_DLL.DLL 0.050s
        func@0x180005cd0 HIF2RAW_DLL.DLL 0.039s
        [Others] 0.064s
    Original zlibwapi
        Top Hotspots
        Function Module CPU Time
        func@0x180005cd0 HIF2RAW_DLL.DLL 6.444s
        func@0x180001121 HIF2RAW_DLL.DLL 1.520s
        func@0x1800038c0 HIF2RAW_DLL.DLL 0.082s
        func@0x180006510 HIF2RAW_DLL.DLL 0.053s
        func@0x180005cb0 HIF2RAW_DLL.DLL 0.030s
        [Others] 0.085s
    Without zlibwapi in current dictionary
        Top Hotspots
        Function Module CPU Time
        func@0x180005d91 HIF2RAW_DLL.DLL 5.986s
        func@0x180001121 HIF2RAW_DLL.DLL 1.534s
        func@0x180005d81 HIF2RAW_DLL.DLL 0.198s
        func@0x1800038c0 HIF2RAW_DLL.DLL 0.091s
        WriteFile KERNELBASE.dll 0.041s
        [Others] 0.100s
  2. In compression, the timer.exe shows that using zlibwapi+IPP takes more time, than original zlibwapi. If with VTune remove everything except zibwapi (ZTool, MSVCRT and RAW2HIF_DLL) from time calculation, it is seen that original zlibwapi takes 9 seconds from task run and zlibwapi+IPP takes 6 seconds. The results are below.
    IPPs zlibwapi
        Top Hotspots
        Function Module CPU Time
        func@0x180007120 RAW2HIF_DLL.DLL 5.892s
        func@0x180051260 zlibwapi.dll 5.586s
        func@0x180002340 RAW2HIF_DLL.DLL 1.617s
        func@0x57583e ZTool.exe 0.563s
        func@0x180042020 zlibwapi.dll 0.562s
        [Others] 2.520s
    Original zlibwapi
        Top Hotspots
        Function Module CPU Time
        func@0x180011000 zlibwapi.dll 9.077s
        func@0x180003d80 zlibwapi.dll 0.974s
        func@0x575760 ZTool.exe 0.540s
        func@0x1800019b0 zlibwapi.dll 0.340s
        func@0x584e40 ZTool.exe 0.221s
        [Others] 1.371s
     
    Looks, like dependency of ZTool and Zlib DLL is more complex than we think about it.
    I have no more guesses about nature of the ZTool and how it uses Zlib. It looks like C# process using .NET Zlib library.

    If you have more details, you may have more findings.
     
    Regards,
    Sergey
Khan__Amsal
Beginner
251 Views

Thanks you Sergey for your detail benchmarks. So I left this program support for now because I don't have its source code, but there is another precompressor in GitHub which is very famous and also its source is public and it uses Zlib, I want to enable IPP in it and compile it. Here the source - https://github.com/schnaader/precomp-cpp

But it didn't worked out as the release which is made with Intel IPP libraries and modified zlib source. This program uses zlib source and its located inside contrib/zlib/

If possible, can you help me out by making a working build with IntelC and IPP support?

Basically, I downloaded zlib-1.2.11 source and then patched it so that it can be used with Intel IPP, then I pasted the zlib modified source files into the contrib/zlib and then I opened vs2015/Precomp and the Visual Studio Project file and define some preprocessor like WITH_IPP and I also included the three .libs files ippdcmt.lib, ippsmt.lib, ippcoremt.lib in the linker includes part. I also enabled Intel IPP with single threaded in visual studio settings and used IntelC++ 18.0 compiler for the compilation and used 64bit. And the precomp.exe compiled sucessfully, what in the end I faced is that the program never precompress with the use of zlib. For testing zlib usage, use this command

 

precomp.exe -cn -intense <input-filename>

 

I gave you sample in my previous post, so use that sample, so command will be like this

precomp.exe -cn -intense sample

And yes I am pretty sure from my many tests and different type of compilation that the all problem is with the zlib source. I tried default linking, dynamic linking and different types of modes and settings but in the end all end in the same place with the same hang problem. What happens here is that the precomp.exe hangs on precompression at some point, sometimes at 0% or sometimes at 10% or like that always, while the original exe without modifications with zlib source works very well. So I request you or Intel team if they can see and find out whats the problem and how it could be fixed, then it would be very nice of you. I hope that with this tool, I also get nice support as well, I bought Parallel Studio with these hopes and promises only :)

 

Best Regards

Sergey_K_Intel
Employee
251 Views

RNeverWe got precomp project and see problems. They are that inflate() functions hangs sometime. Precomp application reads input file and tries to find compressed parts in it, basing on pre-defined headers which exist in standard compressed streams. After candidate header found precomp tries to decompress next bytes of input file. It goes to decompression  Original Zlib in most cases returns status (-3) which means Z_DATA_ERROR and precomp continues searches for other candidates. Our Zlib hangs. We will fix this ASAP.

The other problem is that this fix will not help precomp, because its logic is as follows:

- Precomp finds compressed part of file (or, something it thinks to be compressed)
- Precomp decompresses input data
- It compresses decompressed data into another buffer and compares it with source data bit-to-bit. If yes, precomp leaves uncompressed data in .pcf file having in mind that further general compression of .pcf file will provide better compression ratio.
- If comparison fails, precomp ignores this chunk and moves to the next compressor (MP3, lzma, ...)

If something is recompressed (decompress->compress) with IPP Zlib, the compressed data will not be the same as previously compressed data (compressed with original Zlib), because the core algorithm of compression is slightly different from original Zlib.

So, my feeling is that precomp will fail to find better compression candidates with IPP Zlib just because it will not be able to get bit-to-bit coinciding compressed blocks.

Nevertheless, thank you for pointing us to a problem in inflate().

Regards,
Sergey

 

Khan__Amsal
Beginner
251 Views

Sergey Khlystov (Intel) wrote:

RNeverWe got precomp project and see problems. They are that inflate() functions hangs sometime. Precomp application reads input file and tries to find compressed parts in it, basing on pre-defined headers which exist in standard compressed streams. After candidate header found precomp tries to decompress next bytes of input file. It goes to decompression  Original Zlib in most cases returns status (-3) which means Z_DATA_ERROR and precomp continues searches for other candidates. Our Zlib hangs. We will fix this ASAP.

The other problem is that this fix will not help precomp, because its logic is as follows:

- Precomp finds compressed part of file (or, something it thinks to be compressed)
- Precomp decompresses input data
- It compresses decompressed data into another buffer and compares it with source data bit-to-bit. If yes, precomp leaves uncompressed data in .pcf file having in mind that further general compression of .pcf file will provide better compression ratio.
- If comparison fails, precomp ignores this chunk and moves to the next compressor (MP3, lzma, ...)

If something is recompressed (decompress->compress) with IPP Zlib, the compressed data will not be the same as previously compressed data (compressed with original Zlib), because the core algorithm of compression is slightly different from original Zlib.

So, my feeling is that precomp will fail to find better compression candidates with IPP Zlib just because it will not be able to get bit-to-bit coinciding compressed blocks.

Nevertheless, thank you for pointing us to a problem in inflate().

Regards,
Sergey
 

Thank you Sergery for looking in it, and I am glad that i was helpful to Intel development team, I hope it will be fixed in next releases. I understood precomp logic but I will still try on my own again after this bug is fixed, and I think if Intel team adds more additional components(examples) in their IPP library, it would be good like there are many compressors which can get a additional version with modified source of IPP, it will indirectly let more audience know that how much Intel compilers are better and not only on intel CPUs but also on AMD. Anyways, I will wait for update. Best Regards
Reply