Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.
6704 Discussions

UIC jpeg codec performance not scaling to multi core

Simon_D_3
Beginner
1,206 Views

Hi

I have built the sample code for uic_transconder_con on a Q9650 (quad core)64 bit Ubuntu 10.04 system with the Intel IPP libraries V7.0.1 installed.

Using the sample image (uic_test_image.jpg) I do not see any any change to the time taken to process the image to jpeg with different numbers of threads. I am running from a ram disk.

Command issued for a single thread:
./uic_transcoder_con -otest.jpg -t1 -q95 -n1

Command issued for 4 threads:

./uic_transcoder_con -otest.jpg -t1 -q95 -n4

Restart intervals are turned on by the '-t1' option. Changing the jpeg quality does result in different encoding times. I have also tried adding '-s0' and '-jb', but as I think these are the defaults I has no effect.

0 Kudos
25 Replies
Vladimir_Dudnik
Employee
1,092 Views

Note, JPEG decoder is threaded with two different options, the first one will process MCU rows in parallel, it has limited threading scalability but will work with any baseline JPEG file; and the second will process restart intervals in parallel, this option provide good threading scalability but it also obviously depend on presence of restart intervals in JPEG file. If JPEG file in your performance test was cmpressed with restart intervals at every image row - then you can expect to see threading advantage

Please also check if you compiled sample with Intel Compiler, because threading is implemented with OpenMP.

Regards,
Vladimir

0 Kudos
Ying_H_Intel
Employee
1,092 Views
Hello,

For Intel Compiler, youmay getan evaluation version from http://software.intel.com/en-us/articles/intel-composer-xe/


FYI,a KB article about JPEG thread in 7.0

http://software.intel.com/en-us/articles/jpeg-new-threading-model-in-ipp/


Regards,
Ying
0 Kudos
Simon_D_3
Beginner
1,092 Views
Hi Vladimir,

I agree that the jpeg decode would depend on if the sample jpeg file had restart intervals, if this were not the case thedecode speed would not change with the number of threads.

I was more concerned with the jpeg encode speed. WhenI ran the uic_transcoder_con with the '-t1' setting it should have used restart intervals & I should have seen a difference in the encoding time when I varied the number of threads on my multi-core machine?

We use gcc compiler and not the Intel compiler. OpenMP & TBB shared libraries installed as required.

Regards,
Simon.
0 Kudos
Ying_H_Intel
Employee
1,092 Views
Hi Simon,

You need to use intel Compilerto rebuild the uic_transcoder_con, then OpenMP threads will be enabling and you will see the speedup of the jpeg encoder on your multi-core machine

(The latest gcc compiler version should besupport OpenMP, butthesample's makefile has atiny defect, sorequire Intel Compiler)

Regards,
Ying
0 Kudos
Ying_H_Intel
Employee
1,092 Views
Hi Simon,

You need to use intel Compilerto rebuild the uic_transcoder_con, then OpenMP threads will be enabling and you will see the speedup of the jpeg encoder on your multi-core machine

(The latest gcc compiler version should besupport OpenMP, butthesample's makefile has atiny defect, sorequire Intel Compiler)

Regards,
Ying
0 Kudos
Simon_D_3
Beginner
1,092 Views
Hi Ying,

Thanks for your reply.

I also ran the test using the supplied pre-built binaries on a Windows 7 quad core machine and also didn't see any difference on the encode time with different numbers of threads. When viewing the loading on the cores in Task Manager it seemed that only 1 core was loaded independant on the number of threads.

Is it possible to give me an example command line for uic_transcoder_con that I can try out, perhaps I am specifying the wrong parameters?

If I want to use the uic jpeg encoder in my system then I need to build it with gcc (4.4.3). Can you tell me what the defect is in the makefile?

Regards,
Simon.
0 Kudos
Ying_H_Intel
Employee
1,092 Views
Hi Simon, The parameter and command line looks no problem. I just try the pre-build binaries on Quad core machine. The encode time is different with difference numbers of threades. C:\Documents and Settings\zsha>cd C:\Documents and Settings\zsha\desktop\7.0\w_ipp- uic_p_7.0.2.048\ipp-samples\image-codecs\uic\bin\intel64 C:\Documents and Settings\zsha\desktop\7.0\w_ipp-uic_p_7.0.2.048\ipp-samples\image- codecs\uic\bin\intel64>uic_transcoder_con.exe -otest.jpg -t1 -q95 -n1 Intel Integrated Performance Primitives version: 7.0 build 205.23, [7.0.998.205] name: ippjy8-7.0.dll+ date: Sep 2 2010 image: uic_test_image.jpg, 1280x960x3, 8-bits unsigned, color: RGB, sampling: 44 4 decode time: 144.20 msec encode time: 28.22 msec C:\Documents and Settings\zsha\desktop\7.0\w_ipp-uic_p_7.0.2.048\ipp-samples\image- codecs\uic\bin\intel64>uic_transcoder_con.exe -otest.jpg -t1 -q95 -n4 Intel Integrated Performance Primitives version: 7.0 build 205.23, [7.0.998.205] name: ippjy8-7.0.dll+ date: Sep 2 2010 image: uic_test_image.jpg, 1280x960x3, 8-bits unsigned, color: RGB, sampling: 44 4 decode time: 14.53 msec encode time: 11.20 msec C:\Documents and Settings\zsha\desktop\7.0\w_ipp-uic_p_7.0.2.048\ipp-samples\image- codecs\uic\bin\intel64>uic_transcoder_con.exe -otest.jpg -t1 -q95 -n8 Intel Integrated Performance Primitives version: 7.0 build 205.23, [7.0.998.205] name: ippjy8-7.0.dll+ date: Sep 2 2010 image: uic_test_image.jpg, 1280x960x3, 8-bits unsigned, color: RGB, sampling: 44 4 decode time: 17.96 msec encode time: 7.90 msec Are you try the command line under Command windows (start by cmd) and enter the exact directory of the exe, then run the command? or try with Administrator right? ( Window 7 seems have strict execute and access control). Regards, Ying
0 Kudos
Ying_H_Intel
Employee
1,092 Views

Hi Simon,

The editor seems remove all line terminator.

The pre-build binaries should work different time when different thread. So please try more,
for example,try the command line under Command windows (start by cmd) and enter the exact directory of the exe, then run the command or
try with Administrator right ( Window 7 seems have strict execute and access control).

Regarding linux build,
If you have to use GCC openMP, you may change the Makefile line 67- 70 as below.

66 ifeq ($(OPENMP_SUPPORT), YES)
67 CFLAGS += -fopenmp -DMULTITHREADING_OMP
68 #LDFLAGS += $(LIBPTHREAD)
69 LDFLAGS += -fopenmp
70 #-liomp5
71 endif

and additionally, if GCC openMP is enabling, it is recommend to use serial IPP
you may add the line LINKAGE="static"in build_intel64(ia32).sh

SAMPLE_STATUS="-FAILED"
LINKAGE="static"

TARG1=$2

Regards,
Ying H.

One result with gcc4
[yhu5@NHM02 bin]export LD_LIBRARY_PATH=./:$LD_LIBRARY_PATH
[yhu5@NHM02 bin]$ ./uic_transcoder_con -otest.jpg -t1 -q95 -n1
Intel Integrated Performance Primitives
version: 7.0 build 205.40, [7.0.1015.205]
name: ippjy8_l.a+
date: Jan 5 2011
image: uic_test_image.jpg, 1280x960x3, 8-bits unsigned, color: RGB, sampling: 444
decode time: 11.43 msec
encode time: 18.97 msec
[yhu5@NHM02 bin]$ ./uic_transcoder_con -otest.jpg -t1 -q95 -n4
Intel Integrated Performance Primitives
version: 7.0 build 205.40, [7.0.1015.205]
name: ippjy8_l.a+
date: Jan 5 2011
image: uic_test_image.jpg, 1280x960x3, 8-bits unsigned, color: RGB, sampling: 444
decode time: 12.66 msec
encode time: 11.93 msec

another result with icc120

[yhu5@NHM02 bin]$ export LD_LIBRARY_PATH=./:$LD_LIBRARY_PATH
[yhu5@NHM02 bin]$ ./uic_transcoder_con -otest.jpg -t1 -q95 -n1
Intel Integrated Performance Primitives
version: 7.0 build 205.40, [7.0.1015.205]
name: libippjy8.so.7.0+
date: Jan 6 2011
image: uic_test_image.jpg, 1280x960x3, 8-bits unsigned, color: RGB, sampling: 444
decode time: 31.34 msec
encode time: 33.88 msec
[yhu5@NHM02 bin]$ ./uic_transcoder_con -otest.jpg -t1 -q95 -n4
Intel Integrated Performance Primitives
version: 7.0 build 205.40, [7.0.1015.205]
name: libippjy8.so.7.0+
date: Jan 6 2011
image: uic_test_image.jpg, 1280x960x3, 8-bits unsigned, color: RGB, sampling: 444
decode time: 14.38 msec
encode time: 11.35 msec

0 Kudos
Simon_D_3
Beginner
1,092 Views
Hi Ying,

Thank you very much for the time you have spent on this.

I have upgraded the Windows build from IPP 7.0.1 to 7.0.2 & compiled the sample code. I was then able to get similar results to you on my quad core machine. Maybe there was some issue in 7.0.1 giving me a problem.

However I still have problems with the Linux build.

If I use the pre-built Linux uic_transcoder_con sample & shared libraries then I get similar results to those on the Windows PC.The transocder executable is around 500K in size.

If I build the Linux code with the -fopenmp in both the CFLAGS & LDFLAGS then when I run the executable I get an error saying something like 'error while loading shared libraries: libgomp.so.1: cannot open shared object file: No such file or directory'. The shared library in question (GNU omp) is in the system folder '/usr/lib'. The executable file size is around 500K.

I also tried building the Linux code with static libs as you suggested but I still got the error about libgomp. The executable file size in this case was around 11MB.

I thought it was a bit odd that the pre-built binary did not match in size to static/dynamic linking code that I had built.

Do you have any idea what might be causing this run time error?

Regards,
Simon.

0 Kudos
Ying_H_Intel
Employee
1,092 Views
Hi Simon,

there may something wrong with your build environment. Butgenerally, for the error like'error while loading shared libraries: libgomp.so.1: cannot open shared object file: No such file or directory' can be resolved byexplicitlypointingthe path to the program like
export LD_LIBRARY_PATH=/usr/lib:./:$LD_LIBRARY_PATH
you may try this and see if the error can be solved.

It is true that if linked with static library the executable file size is a little bigger. As the uic_transcoder_con support all mentioned format and add-lippch_l -lippdc_l -lippcc_l -lippcv_l -lippj_l -lippi_l -lipps_l -lippcore_l.If only jpeg, then only -lippj_l -lippi_l-lippcore is required, the binary will be smaller

The pre-build binaryis based on dynamic linking, so the exectuable file size is small.

Here is one KB abour IPP linkage model
http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-intel-ipp-linkage-models-quick-reference-guide/#5

Regards,
Ying
0 Kudos
Simon_D_3
Beginner
1,092 Views
Hi Ying,

I did try various things around the library load path yesterday, but not exactly as you have suggested. The ld.co.conf file contains the path to the folder containing the sample shared libs, tbb, ipp libs etc. After updating the file I ran ldconfig to update the system. I also tried adding the /usr/lib path in explicitly, wasn't expecting this to make any difference but worth a try. I also tried copying the 2 shared files relating to gomp tothe folder containing the sample shared libs and again running an ldconfig.

When I looked at the uic_transcoder_con build log file I noticed that the iomp5 lib was always in the linker line no matter what I did to the Makefile. I didn't see gomp lib in the list as I was expecting? I had a look through the various files to see where iomp5 was being pulled in - it is in tools.sh. Doyou think there may be somethingelse that mayhave been missed in the build environment?

I am going to have another look at the build later on but any further suggestions that you have would be very helpful.

Ideally it would be good to have the Linux build fixed up for the 7.0.3 update, but I would ideally like to resolve the issue in the short term.

Regards,
Simon.

0 Kudos
Simon_D_3
Beginner
1,092 Views
Hi Ying,

In order to resolve the issue with the gomp shared lib not found I had to hack the uic_transcoder_con.lin file within the application folder.

I had to modify the SYS_LIBS line: SYS_LIBS = -ltbb -lgomp -lpthread. Additionally I had to put the tbb lib path in the LD_FLAGS(~line 46). For some reason these were not being picked up from the top level Makefile.

However when I run the app as before with 1 thread the timing is similar to before, but when I specifiy 4 threads the timing returns silly numbers (decode 224 ms, encode 696 ms). If I also look at the System Monitor at the same time I can see the overvall timing looks to be the same & the loading is still only on 1 core.

I can't realy think of anything else that I can try in order to move forward with this. I think there is some problem between the ipp libraries wanting to use iomp5 & the app wanting to use gomp.

Do you have any other suggestions or should I submit this as an issue through Intel Premier Support?

Regards,
Simon.

0 Kudos
Ying_H_Intel
Employee
1,092 Views
Hi Simon,

Just comments "there is some problem between the ipp libraries wanting to use iomp5 & the app wanting to use gomp.".

Right, The -iomp5 is from intel Compiler, being OpenMP run-time library and -gomp is from GNU GCC. They may in defferent version and work in different mechanism. So we genrally don't mix to use them.

That is also why i suggest either use Intel Compiler oruse serial IPP library ipp*_l.lib if have to GNU OpenMP.

The sample binaryis tested by Intel Compiler and iomp5 library.For the giom5 usage,it isright tomodify the SYS_LIBSlike you do in order to remove the iomp5. But the threading model stilllookscomplex, tbb thread, IPPIntel OpenMP libraryand GNU openMP.What isyour run resultif you try gomp5+ ipp static library + modify the SYS_LIBS inuic_transcoder_con.lin?

Yes, you can submit it to intel premier support.

Regards,
Ying

p.s my run result when gomp5+ ipp static library + modify the SYS_LIBS inuic_transcoder_con.lin
[yhu5@NHM02 bin]$ ./uic_transcoder_con -otest1.jpg -t1 -q95 -n1
Intel Integrated Performance Primitives
version: 7.0 build 205.40, [7.0.1015.205]
name: ippjy8_l.a+
date: Jan 5 2011
image: uic_test_image.jpg, 1280x960x3, 8-bits unsigned, color: RGB, sampling: 444
decode time: 56.23 msec
encode time: 19.44 msec
[yhu5@NHM02 bin]$ ./uic_transcoder_con -otest1.jpg -t1 -q95 -n4
Intel Integrated Performance Primitives
version: 7.0 build 205.40, [7.0.1015.205]
name: ippjy8_l.a+
date: Jan 5 2011
image: uic_test_image.jpg, 1280x960x3, 8-bits unsigned, color: RGB, sampling: 444
decode time: 11.67 msec
encode time: 10.69 msec

[yhu5@NHM02 bin]$ ldd uic_transcoder_con
libuic_core.so => ./libuic_core.so (0x00002b5be6511000)
libuic_io.so => ./libuic_io.so (0x00002b5be6716000)
libuic_bmp.so => ./libuic_bmp.so (0x00002b5be691e000)
libuic_pnm.so => ./libuic_pnm.so (0x00002b5be6b24000)
libuic_jpeg.so => ./libuic_jpeg.so (0x00002b5be6d35000)
libuic_jpeg2000.so => ./libuic_jpeg2000.so (0x00002b5be70a7000)
libuic_dds.so => ./libuic_dds.so (0x00002b5be732d000)
libuic_png.so => ./libuic_png.so (0x00002b5be753f000)
libuic_tiff.so => ./libuic_tiff.so (0x00002b5be7777000)
libuic_jpegxr.so => ./libuic_jpegxr.so (0x00002b5be797e000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000031c5400000)
libm.so.6 => /lib64/libm.so.6 (0x0000003d48e00000)
libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00002b5be7bec000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003d4de00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003d49600000)
libc.so.6 => /lib64/libc.so.6 (0x0000003d48a00000)
libtbb.so.2 => /opt/intel/composerxe-2011.2.137/tbb/lib/intel64//cc4.1.0_libc2.4_kernel2.6.16.21/libtbb.so.2 (0x00002b5be7dfa000)
librt.so.1 => /lib64/librt.so.1 (0x0000003d4d600000)
/lib64/ld-linux-x86-64.so.2 (0x0000003d48600000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003d49200000)
[yhu5@NHM02 bin]$ ldd libuic_jpeg.so
libtbb.so.2 => /opt/intel/composerxe-2011.2.137/tbb/lib/intel64//cc4.1.0_libc2.4_kernel2.6.16.21/libtbb.so.2 (0x00002ab8fd75c000)
libgomp.so.1 => not found
libdl.so.2 => /lib64/libdl.so.2 (0x00002ab8fd8c3000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ab8fdac8000)
librt.so.1 => /lib64/librt.so.1 (0x00002ab8fdce3000)
libm.so.6 => /lib64/libm.so.6 (0x00002ab8fdeec000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002ab8fe170000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ab8fe360000)
libc.so.6 => /lib64/libc.so.6 (0x00002ab8fe56e000)
/lib64/ld-linux-x86-64.so.2 (0x0000003d48600000)
0 Kudos
Simon_D_3
Beginner
1,092 Views
Hi Ying,

I have tried to post some data herebut keep getting an error from the server, I can paste it in ok but when I hit submit I get an error. I could email it to you directly. It contains my latest results, edits made to the build & output from ldd.

I have managed to reproduce your results using static libraries. However when I look in System Monitor (cf Task Mananger) I only see 1 core out of the 4 loaded at 100%. This does not change with the number of threads specified. Can you look at System Monitor while running uic_transcoder_con with 4 threadson your system to see if it gives the same result?

One other strange thing is the time it takes to actually run the app does not seem to tie up with the times reported i.e. it takes several secondstorun the app and with only 10/20 ms reported for the decode/encode. Do you experience this on your system?

Regards,
Simon.
0 Kudos
Ying_H_Intel
Employee
1,092 Views
Hi Simon,

The performance looks fine.

About the performance,actually, there are 4 cores used when specify n4 if you take care the cpu usage( start a new windows, enter command >top, click 1) , but thereis adominated core out of the 4 core. So it seems 1 core out of the 4 loaded at 100%.

Right, I have same observations about the time, will look intothis and get back soon

Regards,
Ying
0 Kudos
Simon_D_3
Beginner
1,092 Views
Hi Ying,

I have now also tried running the pre-built Linux binaries with n4 specified. I have also run top as you did & I can see 1 core is running up at 100% with the other 3showing mostly 98% idle. I was expecting to see the loading on the other cores as higher.

The timings I now have between the pre-built & static library builtsample/app look similar.

Regards,
Simon.
0 Kudos
jacobh
Beginner
1,092 Views

Hello Ying,

image: uic_test_image.jpg, 1280x960x3, 8-bits unsigned, color: RGB, sampling: 444

decode time: 11.43 msec

encode time: 18.97 msec

I'm a bit suprised by seeing your results. I'm currently assessing the performance of jpeg2000 components and when I run the uic example I get the following performance results :

Intel Integrated Performance Primitives

version: 6.1 build 137.46, [6.1.813.137]

name: libippjy8.so.6.1+

date: Nov 27 2009

image: ../../../src/application/uic_transcoder_con/uic_test_image.jpg, 1280x960x3, 8-bits, color: RGB, sampling: 444

decode time: 1180.00 msec

encode time: 1600.00 msec

The processor I use is a : Intel Xeon W3250 @ 2.67GHz.

For the code to be usable to us I must stay under the 40mSec...and from the results I get at this time I'm far from there.

Jacob

0 Kudos
Vladimir_Dudnik
Employee
1,092 Views
Hello Jacob,

something is wrong, JPEG decoder and encoder for such kind of resolution should not take that much time.

Regards,
Vladimir
0 Kudos
jacobh
Beginner
1,092 Views
I couldn't agree more.

But what could be wrong ? As I compiled the example andtried with the supplied jpg.

There is something causing a bottle neck.
Measurements for pre processing and wavelet transform are 30 and 190msec

Best regards,

Jacob

0 Kudos
Vladimir_Dudnik
Employee
1,013 Views
Well, I meant JPEG codec (looking at .JPG file you use). If your concern is about JPEG2000 codec, then you are right, JPEG2000 is about 10X slower comparing to JPEG, due to complexety (mostly related to arithmetic entropy coding stage). It is expected.

Vladimir
0 Kudos
Reply