IJG Performance Issue

mbrierst · ‎09-16-2009

I'm comparing the standard Independent JPEG Group's code to the speedups provided by IPP. All I'm doing is loading JPEGs into arrays, along the lines of the decompress sample in ipp-samples\image-codecs\ijg\samples\ijgtiming. On the images I'm looking at, I find a 25% speedup if I load one image 100 times from my network. But if I loop through a directory of 450 images on the network and load each one at a time, I find no speedup.

Does anyone have any tips? Is it known that IPP won't speed up some workloads? Or is it likely that I'm doing anything wrong? I've already read a few knowledgebase articles I found without any obviously helpful suggestions. My memory is aligned, I don't want to split into multiple threads, nothing else looked applicable.

If no one has any suggestions, I guess I'll stick with the Independent JPEG Group's code.

PaulF_IntelCorp · ‎09-16-2009

Hello "mbrierst":

The IPP implementation will definitely compress and decompress JPG images faster.Your network access and file loads might be swamping the measurements made by ijg_timing. There can be a fair amount of file I/O during the compression/decompression process.

You might try altering the ijg_timing.c file to increase the size of the file I/O buffers; although I haven't looked closely enough at the code and how the measurements are made to guarantee that this will fix the problem, it might help.

You can increase the size of the file I/O buffer by using the setvbuf() function, see:

http://msdn.microsoft.com/en-us/library/aa272909(VS.60).aspx

for additional information on this function and how to use it.

You might add a call to setvbuf(), with a request for a large file I/O buffer, after each of the fopen() calls.

Note: I do not know what the maximum buffer size is that can be specified with setvbuf(), it is probably dependent on the specific C library implementation and OS being used. The function might "max out" and not tell you if you requested something "too big," so some experimentation may be called for.

Hope this helps,

Paul

mbrierst · ‎09-16-2009

Quoting - Paul F (Intel)

You might try altering the ijg_timing.c file to increase the size of the file I/O buffers; although I haven't looked closely enough at the code and how the measurements are made to guarantee that this will fix the problem, it might help.

You can increase the size of the file I/O buffer by using the setvbuf() function, see:

http://msdn.microsoft.com/en-us/library/aa272909(VS.60).aspx

That's some great advice. The maximum size is 32768 as referenced in the documentation you linked to, though that may be out of date as it's back on VC++6.0. It looks like it's gone up since then, see: http://msdn.microsoft.com/en-us/library/86cebhfs.aspx

Using a buffer size of 32768 sped up code for both by 25%, but still no advantage for IPP. Moving the images to the local drive didn't make much difference either. Removing filling of the array and just leaving the calls to jpeg_read_scanlines sped everything up of course, but still no performance difference between the two.

It's definitely possible that I'm somehow doing linking wrong or something, but I think the time I have to spend on this is up now. I may come back to it later.

Vladimir_Dudnik · ‎09-17-2009

Hello,

if you link with IPP static libraries please make sure you call ippStaticInit function somewhere in the beginning of your application. Otherwise generic code will be dispatched by IPP to run instead of optimized for your processor

Regards,
Vladimir

mbrierst · ‎09-17-2009

Quoting - Vladimir Dudnik (Intel)

Hello,

if you link with IPP static libraries please make sure you call ippStaticInit function somewhere in the beginning of your application. Otherwise generic code will be dispatched by IPP to run instead of optimized for your processor

Regards,
Vladimir

I'm aware of that, but thanks for the advice. I've been linking to a DLL (custom DLL using the linking example), but I'm trying other types of linking to see if I can get a performance improvement. I'll report back. Does using the visual studio 2008 compiler make a big difference compared to using the intel compiler?

Vladimir_Dudnik · ‎09-17-2009

I've did not measure performance impact for moving from Intel compiler tothird-party one in IJG timing application, but for my experience Intel compilercan provide additionallyfrom 10% up to 2..3X (for vectorized code)in performance

Regards,
Vladimir

mbrierst · ‎09-22-2009

I've done a number of tests now. In order to minimize the chance of error, I started with the ijg_timing example and made the smallest modifications I could.

1. First, in the ijg_decode function I added the suggested setvbuf call (size 32768) and removed the code that actually copies the decoded jpeg into a large bitmap memory buffer, so that what remains is just decoding the jpg.

2. To do the comparison I have two build targets, one that links ipp to its appropriate libraries (leaving the sample code set up as-is), and another that links instead to the independent jpeg group library.

I have found that on some image files ipp is faster, 25% or more faster, but on others the independent jpeg group version is up to 25% faster. So when I run across the a full directory of my sample images the times are equal. When I was previously seeing mysterious results where ipp was faster on one image but not when going through the whole directory, that was just due to luck that the first image happened to be a fast one for ipp.

These results are consistent whether running on images on my local machine or across my network (which seems to be quite fast). I don't think I'm doing anything wrong at this point. It's definitely possible that using the Intel compiler instead of the Microsoft compiler I'm using now would change these results, but I'm not ready to make the time investment in trying that now. I think I've done everything I'm going to try on my own for now.

If anyone has other suggestions, I'm still open to trying them.

PaulF_IntelCorp · ‎09-22-2009

Thanks for the observations. I'll make sure our engineering group sees this information, as well.

Vladimir_Dudnik · ‎09-24-2009

My guess is that in benchmarks like you are trying to do (traversing all files in folder and decode one by one) the most important factor of performance is minimizing HDD heads seeking, system memory heap trashing and similar system wide staff, which is by itself way longer then single image decoding time either with IPP or original IJG.

To achieve the best performance on application level it is not enough just to turn on SIMD instructions support in your favourite compiler or link with optimized libraries, like Intel IPP. It is also important that application itself designed in the way to address system wide factors affecting performance. And thisrequired some known level of expertise of course.

Regards,
Vladimir

mbrierst · ‎09-29-2009

Quoting - Vladimir Dudnik (Intel)

My guess is that in benchmarks like you are trying to do (traversing all files in folder and decode one by one) the most important factor of performance is minimizing HDD heads seeking, system memory heap trashing and similar system wide staff, which is by itself way longer then single image decoding time either with IPP or original IJG.

To achieve the best performance on application level it is not enough just to turn on SIMD instructions support in your favourite compiler or link with optimized libraries, like Intel IPP. It is also important that application itself designed in the way to address system wide factors affecting performance. And thisrequired some known level of expertise of course.

Regards,
Vladimir

You're right, it's entirely possible that hard drive and memory issues are swamping image decoding time. I guess the peculiarities of how IJG and IPP read the files could affect total load time quite a bit. I may try to look into that further later on. Thanks for the advice.

Bell · ‎10-04-2009

Quoting - mbrierst

You're right, it's entirely possible that hard drive and memory issues are swamping image decoding time. I guess the peculiarities of how IJG and IPP read the files could affect total load time quite a bit. I may try to look into that further later on. Thanks for the advice.

If JPEG files encoded in progressive mode, than there will be no speedup at all. Because IPP IJG onlymodify baseline sequence mode codes.

Vladimir_Dudnik · ‎10-06-2009

Well, this is not exactly right. We should still have some benefits from optimized DCT, sub-sampling and color convesion routines. For progressive mode we did not substitute huffman routines.

But generally, progressive mode requires several passes through data before it get to final image, so benefit might be not that big.

Regards,
Vladimir

Bell · ‎10-08-2009

Quoting - Vladimir Dudnik (Intel)

Well, this is not exactly right. We should still have some benefits from optimized DCT, sub-sampling and color convesion routines. For progressive mode we did not substitute huffman routines.

But generally, progressive mode requires several passes through data before it get to final image, so benefit might be not that big.

Regards,
Vladimir

The inverse DCT speedup only occurredwhile choosing accurate JDCT_ISLOW method. In generally,weuse JDCT_IFAST for decoding. And there is no significant improvement on sub-sampling and color conversion. So in my experience, we don't get speedup on progressive mode JPEG decoding.

Sincerely,

Bell

Vladimir_Dudnik · ‎10-09-2009

Hi Bell,

if you will choose JSLOW IDCT method (which is powered by IPP) your final performance might be better then in case of JFAST method. Did you try that?

Regards,
Vladimir