jpeg 2000 performance

ksubox · ‎01-29-2012

Hello,

Our company is starting project related to decomressing jpeg2000. Most images have resolution is 4096x4096 in jp2. Most important requirement is decomression speed. I tested sample images with Intel IPP 7.0 & Kakadu on my Q6600 and got huge difference in speed:

Intel IPP 7.0.6 uic_transcoder_con - 0.770 sec/image

Kakadu 6.3 - 0.170 sec/image

Also if I specify -n (number of threads), it increases decompress time: -n 2 gives 1.280 sec/image, -n 4 gives 1.580 sec/image

I suspect I did something wrong.

What is wrong with my tests ? Can I get Intel IPP speed to be on pair with Kakadu ? If yes - it would be preferrable for our company.

Thank you, Sergey.

ps: Actually I found reason:ippiDecodeCBProgrSetPassCounter_JPEG2K function is not multithreaded and if it would be probably decoding time could be improved down to 0.220 - 0.230 sec/image which is on pair with kakadu. So it would be good to reimplement this function in IPP...

Jeffrey_M_Intel1 · ‎03-15-2012

Hello,

First, please let me apologize for the delay in this response.

If you're still interested we may be able to helpimprove your performance with multiple threads. However, bottom line is that we're not going tobe able to push our free sample to this level of performance with different parameters, compile options, or simple changes to the code.

Thank you very much for your feedback on this issue, including your analysis on ippiDecodeCBProgrSetPassCounter_JPEG2K.This looks like a great place to start for improving decode performance.Ihave fed thisrecommendation to product planningso it can be considered for future releases.

Best Regards,

Jeff

fvipp · ‎03-29-2012

Do you have a document with official benchmarks for JPEG and JPEG2000 codecs?
What maximum throughput could be achieved with IPP-7.0 JPEG/JPEG2000 codecs?
This is important question and we can't find the answer. Please adivse.
As we know, latest white paper concerning IPP performance benchmarks dated 2003.

Jeffrey_M_Intel1 · ‎03-30-2012

We appreciate the importance of this question. Unfortunatelywe don't currently have the kind of performance comparisons you're looking for. What we can show, if you look at our performance details summary

http://software.intel.com/en-us/articles/intel-ipp/#details

is that JPEG2000 performance improved betweenthe 6.1 and7.0 version of IPP.

Yourrequirements are always important.Since we don't have this data ready for publication, perhaps thebest way toassist withyour decision could be to let us knowwhere your expectations have been set by alternative solutions. Wemay be able toprovide assistance with setting up some quick tests to determine ifUICis close to those requirements. In any casewe can pass thisdataonto project planning to be considered for future releases.

Would this help?

Best regards,

Jeff

fvipp · ‎03-30-2012

Thanks for your reply. Unfortunately your link doesn't help. We don't need to know relative speed-up. We need to know real performance at standard conditions. This is example for jpeg: Baseline jpeg, standard images from Kodak set (grayscale or color), quality settings 50%, static tables for quantization and Huffman, good PC with Core i7, your recommended parameters for multithreading. We need results for compression ratio, compression time and decompression time (without time for image loading from HDD).

It should be like this: uic_transcoder_con.exe -o test.jpg -i lenna.bmp -j b -q 50 -t 1 -n 8

with details how to get maximum performance at coding and decoding. And we need a table with achieved results. Please have a look at these examples:

http://www.accusoft.com/picphotofeatures.htm#comparison_table

http://visionexperts.co.uk/news/?id=25

Jeffrey_M_Intel1 · ‎04-01-2012

Hi Sergey,

We don't have this data now. However, your suggestion makes a lot of sense. I've escalated your feedback to our developers and project planners. No guarantees though. I can't promise if or when we will be able to produce this kind of report. A lot of effort and review needs to go into official performance comparisons, and this needs to be prioritized against other tasks.

Until then, my ability to provide this data is limited. In the interest of being helpful, since you mentioned that 4096x4096 decode performance is your most important requirement, here is a snapshot of what I'm seeing on my machine. Please note that these results have not been reviewed, and are not authorititative benchmark results. There are a lot of factors affecting performance, so your results may be different. The intent is simply to give you some indication of the results we're getting here.

Test platform: Intel Core i7-2600K CPU @3.40 GHz, Windows 7, IPP+samples 7.0.6 (Using pre-compiled executables for easy reproduceability. If you have not already, you may want to give these a try.)

library: ippje9-7.0.dll

Input image: 4096x4096 resize of standard lenna, sRGB colorspace, 50% quality, jp2 format.

./uic_transcoder_con.exe -i c:/videos/lenna4096x4096_2.jp2 -o test.bmp -n 1 -t 1 -m 20 : decode time: 548.71 msec

./uic_transcoder_con.exe -i c:/videos/lenna4096x4096_2.jp2 -o test.bmp -n 2 -t 1 -m 20 : decode time: 436.84 msec

./uic_transcoder_con.exe -i c:/videos/lenna4096x4096_2.jp2 -o test.bmp -n 4 -t 1 -m 20 : decode time: 381.77 msec

./uic_transcoder_con.exe -i c:/videos/lenna4096x4096_2.jp2 -o test.bmp -n 8 -t 1 -m 20 : decode time: 381.97 msec

Sorry we can't get everything today, but hopefully this is at least getting closer to the data you need.

Best regards,

Jeff

fvipp · ‎04-02-2012

Thanks a lot for your reply. That's exactly what we are looking for. Actually we are more interested in Baseline JPEG compression for greyscale images, so let's try to get more understanding in the matter.

We can get greyscale test image "big_building.bmp" from here:

http://www.imagecompression.info/test_images/

Then we try to compress it to JPEG on Core i7 920 with the following command lines

1) uic_transcoder_con.exe -i big_building.bmp -o test.jpg -n 8 -q 50 -t 1 -m 1

Intel Integrated Performance Primitives

version: 7.0 build 205.85, [7.0.1058.205]

date: Nov 27 2011

image: big_building.bmp, 7216x5408x1, 8-bits unsigned, color: Grayscale, sampling: 444

decode time: 19.25 msec

encode time: 112.50 msec

2) uic_transcoder_con.exe -i big_building.bmp -o test.jpg -n 8 -q 50 -t 1 -m 20

We do compression bmp-to-jpeg with parameter -m 20. It's almost the same, we just ask to repeat compression 20 times.

Intel Integrated Performance Primitives

version: 7.0 build 205.85, [7.0.1058.205]

date: Nov 27 2011

image: big_building.bmp, 7216x5408x1, 8-bits unsigned, color: Grayscale, sampling: 444

decode time: 8.78 msec

encode time: 50.33 msec

We have the same image and same settings, with the only difference in the repeat count. What is the main reason for such a variation in performance (a factor of 2)? Can your software decode 37 MB image in 8.78 msec? Does your software do decoding at the same time as encoding?

Now we can do the same thing for decoding (we decode the image which we got in the previous test):

3) uic_transcoder_con.exe -i test.jpg -o test.bmp -n 8 -t 1 -m 1

Intel Integrated Performance Primitives

version: 7.0 build 205.85, [7.0.1058.205]

date: Nov 27 2011

image: test.jpg, 7216x5408x1, 8-bits unsigned, color: Grayscale, sampling: 444

decode time: 66.13 msec

encode time: 14.28 msec

Then we try to see what we get with -m 20:

4) uic_transcoder_con.exe -i test.jpg -o test.bmp -n 8 -t 1 -m 20

Intel Integrated Performance Primitives

version: 7.0 build 205.85, [7.0.1058.205]

date: Nov 27 2011

image: test.jpg, 7216x5408x1, 8-bits unsigned, color: Grayscale, sampling: 444

decode time: 37.63 msec

encode time: 9.00 msec

Decoding time varies by a factor of 2. The meaning of encode time here is not clear either.

What is the accuracy of time measurements? Could you explain the above results?

Sergey_K_Intel · ‎04-03-2012

Hi,

-m options defines number of loops (the same operations of encoding and decoding) to do.

Usually in performance measurements it is helpful to do the same thing several times and to divide overall time by the number of iterations. You'll have better accuracy.

But, there's a side effect related to the cache "temperature". When you repeat execution of the same algorithm, it can happen that the data you try to read is already in the CPU cache line(s). So, you spend less time waiting for instruction(s) to complete, then if required data must be loaded from virtual memory (RAM) to cache first and only then is consumed by CPU instruction.

Regarding JPEG decoding (your last example), encode time is just the time to repack JPEG-decoded image into BMP container.

Then, I see "-n" option, which means multi-threaded execution. It also speeds up the execution.

Regards,

Sergey

fvipp · ‎04-03-2012

-m options defines number of loops (the same operations of encoding and decoding) to do.
Usually in performance measurements it is helpful to do the same thing several times and to divide overall time by the number of iterations. You'll have better accuracy.

No, we haven't got better accuracy. Actually we've got significant "speed up" (from 112 ms to 50 ms for jpeg encoding) just with -m option. We think that it's not better accuracy. This is lack of accuracy.

Let's talk about real accuracy rather than better accuracy. It's not clear how you measure execution time and this is very important. If you show two digits after point do they mean anything? If you do the same encoding several times (without -m option), what error in terms of MSE will you get? How we can find out average time for encoding or decoding?
But, there's a side effect related to the cache "temperature". When you repeat execution of the same algorithm, it can happen that the data you try to read is already in the CPU cache line(s). So, you spend less time waiting for instruction(s) to complete, then if required data must be loaded from virtual memory (RAM) to cache first and only then is consumed by CPU instruction.
Thanks, we know about hot cache. As far as concerns your approach with -m option, it looks really strange due to hot cache. It's far from being real time measurements and one could say that this is not fair. We want to estimate encoding performance and we see something strange. Please advise. As we understand, option -m is not worth using because of hot cache. We think that we'd better use big images to increase accuracy of time measurements.

Regarding JPEG decoding (your last example), encode time is just the time to repack JPEG-decoded image into BMP container.
This is also very strange. We see the phrase encode time, though it means time to repack JPEG-decoded image into BMP container.
Then, I see "-n" option, which means multi-threaded execution. It also speeds up the execution.
Thanks, we know that. We are trying to find out how Intel recommend to do jpeg encoding and decoding in the fastest way. What would you recommend?

Sergey_K_Intel · ‎04-04-2012

No, we haven't got better accuracy. Actually we've got significant "speed up" (from 112 ms to 50 ms for jpeg encoding) just with -m option. We think that it's not better accuracy. This is lack of accuracy.

OK. Let's consider this option to get more stable results. If you run "-m 1" several times, you will probably see quite a big fluctuation of results. "-m " makes timing more predictable.

It's not clear how you measure execution time and this is very important. If you show two digits after point do they mean anything?

This is why we distribute samples in source code form. Two digits might be helpful if overall time is several msecs.

How we can find out average time for encoding or decoding?

Take several input images, encode them with "-m 1" and find average.

This is also very strange. We see the phrase encode time, though it means time to repack JPEG-decoded image into BMP container.

In terms of UIC BMP is also a "codec", so it has Encode method, which in reality is repacking of raw image array to BMP format (no any encoding performed). Nevertheless, since time marks are around EncodeImage function, the results go to "Encode time" section. The same situation will be with other codecs-containers like TIFF and PNM. There is also no encoding.

As we understand, option -m is not worth using because of hot cache. We think that we'd better use big images to increase accuracy of time measurements

Good way too

We are trying to find out how Intel recommend to do jpeg encoding and decoding in the fastest way. What would you recommend?

General recomendations are to use appropriate library (with binary code according to used CPU). Multi-threading could also help if the image is not too short and the number of hardware cores is not too big (is this case the overhead on multi-thread support is significant).

Regards,

Sergey

fvipp · ‎04-04-2012

No, we haven't got better accuracy. Actually we've got significant "speed up" (from 112 ms to 50 ms for jpeg encoding) just with -m option. We think that it's not better accuracy. This is lack of accuracy.

OK. Let's consider this option to get more stable results. If you run "-m 1" several times, you will probably see quite a big fluctuation of results. "-m " makes timing more predictable.

Unfortunately we can't explain the difference between 112 ms and 50 ms as you do. So we have to consider "-m " option as unfair trick with cache.

It's not clear how you measure execution time and this is very important. If you show two digits after point do they mean anything?

This is why we distribute samples in source code form. Two digits might be helpful if overall time is several msecs.

It could be a good idea to round output and to show only necessary digits. If two digits after point are right, then your accuracy in time measurements is equal to 0.01 ms. It's difficult to believe.

This is also very strange. We see the phrase encode time, though it means time to repack JPEG-decoded image into BMP container.

In terms of UIC BMP is also a "codec", so it has Encode method, which in reality is repacking of raw image array to BMP format (no any encoding performed). Nevertheless, since time marks are around EncodeImage function, the results go to "Encode time" section. The same situation will be with other codecs-containers like TIFF and PNM. There is also no encoding.

If you indicate in command line bmp file with option -i (input) and jpeg file with option -o (output), it means that there will be conversion from bmp to jpeg. This is sufficient indication that process should be called encoding. There is no any decoding here. In this case it could be better not to show timing for decoding at all.

We are trying to find out how Intel recommend to do jpeg encoding and decoding in the fastest way. What would you recommend?

General recomendations are to use appropriate library (with binary code according to used CPU). Multi-threading could also help if the image is not too short and the number of hardware cores is not too big (is this case the overhead on multi-thread support is significant).

What would you recommend for Core i7 920 and Win-7 (32/64)?
We do the following for jpeg encoding with 50% quality: uic_transcoder_con.exe -i lenna.bmp -o test.jpg -j b -q 50 -t 1 -n 8