Solved: IPP dynamic linking with dispatching disabled?

Tam_N_1 · ‎04-26-2016

Hi staff,

I'm using the old Intel Parallel PXE 2011. I have a question about dynamic linking and dispatching.

I know that dynamic linking is default with dispatching enabled. Is there any option/function call to disable it?

I want this because i want to compare different data between static linking with no dispatching and dynamic linking (with dispatching disable).

Hope anyone can help me.

Best regards,

Tam.

Sergey_K_Intel · ‎04-27-2016

OK. Now it's clear. The fact is that in your cases (1) and (2) different CPU codes work.

In case (1), when there is no initialization with ippInit(), the computations are done using scalar FPU (floating point unit). It's a 80-bit precision device. It is relatively slow, but precise. It is so-called "px/mx" code. Look: https://software.intel.com/en-us/articles/ipp-dispatcher-control-functions-ippinit-functions

When you call ippInit without particular CPU, like in the function ippInitCpu(cpucode), the most appropriate optimized function variant is chosen. It is done automatically in IPP dynamic libraries. With your CPU this code is kind of SSEx clone (SIMD). If I'm not wrong, for FP calculations SSE precision in 64 bits. SSE is fast, but less precise.

By the way, you can the same results as in (1) in your (2) case, if, at the beginning, you call ippInitCpu with CPU-type argument for Intel(R) Pentium/II/III processors (ippCpuUnknown, or ippCpuPP, or ippCpuPII, or ippCpuPIII).

The question is "is 10^-6 enough for you and you may want more performing functionality?". The additional problem may come from the fact, that chosen CPU optimization works throughout the whole library. If you limit your optimization to px/mx, it means, that the limitation will affect all other IPP functions, which may not suffer from FPU/SSE difference. And you may lose performance, where it really necessary.

As far as I know, Sqrt functionality was redeveloped for better precision in IPP 9.0.

View solution in original post

Sergey_K_Intel · ‎04-26-2016

Hi Tam,

In the existing IPP environment, I think it is impossible. CPU dispatching is inherent part of dynamic library.

However, from technical point of view, there is almost no difference between static and dynamic libs. Calling of a dispatched function differs from calling of statically linked function by a few CPU clocks. Only a first call of an IPP function starts CPU dispatcher, which initializes function pointers according to CPU characteristics, or according to selected CPU features, if you want to execute a particular CPU optimization on a higher CPU (for example, SSE4.2 code on AVX CPU). This first step may take (and, actually takes) longer also because with dynamic linking it leads to bunch of DLLs loading.

After you have done IPP initialization (CPU dispatching initialization), dynamic function calls are as fast as static function calls. Or, all this is not about performance?

Please, tell us what particular experiments you want to execute, and may be we could find the other ways to implement this.

Tam_N_1 · ‎04-26-2016

Hi Sergey,

I'm wondering if it is possible to use dynamic library linking + not use dispatching.

My circumstance is: the old project used static library linked without calling to ippStaticInit/ippInit. (static linking without dispatching).

Now, I want to use dynamic library linking. But there is the problem that dynamic library linking produces different output as compared to the old project because it uses dispatching.

I really don't care about the performance, i'm thinking that program produces the same output for any CPU type when dispatching is disable. Is this true?

I'm wondering if I can produce results as same as old project by using dynamic library linking? By turn off dispatching? Is it possible?

Regards,

Tam.

Sergey_K_Intel · ‎04-26-2016

Different output ? This is a problem. It should be the same.

Could you please give more info: IPP function you suspect in wrong result, IPP library version with correct output (static linking) and IPP library version with incorrect output (dynamic linking)? Or, you speak about the same library? Please, specify version. And your current CPU model, please.

Here, we must speak not about difference in static or dynamic linking, but about possible discrepancy of results of different CPU-optimized implementations. It is possible to try different optimizations in both static and dynamic cases.

Tam_N_1 · ‎04-27-2016

Hi Sergrey,

Here is information:

IPP Function ippsSqrt_32fc_I.

IPP library version : IPP PXE 2011.

My CPU model is Intel core i3-3240.

1/ static library ippm_l.lib ipps_l.lib ippi_l.lib ippvm_l.lib ippcore_l.lib and not call to ippInit() function in sourcecode

2/ dynamic library ippm.lib ipps.lib ippi.lib ippvm.lib ippcore.lib

(1) and (2) not give the identical result ( epsilon 10^-6).

When use (1) + ippInit() then (1) and (2) is same.

What i want is (2) give the identical result as (1). How could it be?

Regards,

Tam

Sergey_K_Intel · ‎04-27-2016

OK. Now it's clear. The fact is that in your cases (1) and (2) different CPU codes work.

In case (1), when there is no initialization with ippInit(), the computations are done using scalar FPU (floating point unit). It's a 80-bit precision device. It is relatively slow, but precise. It is so-called "px/mx" code. Look: https://software.intel.com/en-us/articles/ipp-dispatcher-control-functions-ippinit-functions

When you call ippInit without particular CPU, like in the function ippInitCpu(cpucode), the most appropriate optimized function variant is chosen. It is done automatically in IPP dynamic libraries. With your CPU this code is kind of SSEx clone (SIMD). If I'm not wrong, for FP calculations SSE precision in 64 bits. SSE is fast, but less precise.

By the way, you can the same results as in (1) in your (2) case, if, at the beginning, you call ippInitCpu with CPU-type argument for Intel(R) Pentium/II/III processors (ippCpuUnknown, or ippCpuPP, or ippCpuPII, or ippCpuPIII).

The question is "is 10^-6 enough for you and you may want more performing functionality?". The additional problem may come from the fact, that chosen CPU optimization works throughout the whole library. If you limit your optimization to px/mx, it means, that the limitation will affect all other IPP functions, which may not suffer from FPU/SSE difference. And you may lose performance, where it really necessary.

As far as I know, Sqrt functionality was redeveloped for better precision in IPP 9.0.

Tam_N_1 · ‎04-27-2016

Hi Sergey,

Thank you alot, The information was almost there, but I've not read it carefully. :(.

Best regards,

Tam

Tam_N_1 · ‎05-01-2016

Hi Sergey,

"By the way, you can the same results as in (1) in your (2) case, if, at the beginning, you call ippInitCpu with CPU-type argument for Intel(R) Pentium/II/III processors (ippCpuUnknown, or ippCpuPP, or ippCpuPII, or ippCpuPIII)."

I tried (2) with ippCpuUnknown, it seems not to affect the output result at all. I also tried with ippCpuPP, but it didn't give the same output as (1).

I don't know why? How do you think about it. Please not I'm using IPP PXE 2011 (7.0) version. The link you gave me is 6.1

Hope you can help me to solve this.

Regards,

Tam.

Tam_N_1 · ‎05-02-2016

Hi,

I have tried all CPU types.I used (2) + call ippInitCPU(). Here are results:

CPU TYPE	Result
ippCpuUnknown = 0x00	Dynamic
ippCpuPP = 0x01, /* Intel(R) Pentium(R) processor */	X
ippCpuPMX = 0x02, /* Pentium(R) processor with MMX(TM) technology */	X
ippCpuPPR = 0x03, /* Pentium(R) Pro processor */	X
ippCpuPII = 0x04, /* Pentium(R) II processor */	X
ippCpuPIII = 0x05, /* Pentium(R) III processor and Pentium(R) III Xeon(R) processor */	X
ippCpuP4 = 0x06, /* Pentium(R) 4 processor and Intel(R) Xeon(R) processor */	X
ippCpuP4HT = 0x07, /* Pentium(R) 4 Processor with HT Technology */	X
ippCpuP4HT2 = 0x08, /* Pentium(R) 4 processor with Streaming SIMD Extensions 3 */	X
ippCpuCentrino = 0x09, /* Intel(R) Centrino(TM) mobile technology */	X
ippCpuCoreSolo = 0x0a, /* Intel(R) Core(TM) Solo processor */	X
ippCpuCoreDuo = 0x0b, /* Intel(R) Core(TM) Duo processor */	X
ippCpuITP = 0x10, /* Intel(R) Itanium(R) processor */	X
ippCpuITP2 = 0x11, /* Intel(R) Itanium(R) 2 processor */	X
ippCpuEM64T = 0x20, /* Intel(R) 64 Instruction Set Architecture (ISA) */	X
ippCpuC2D = 0x21, /* Intel(R) Core(TM) 2 Duo processor */	O
ippCpuC2Q = 0x22, /* Intel(R) Core(TM) 2 Quad processor */	O
ippCpuPenryn = 0x23, /* Intel(R) Core(TM) 2 processor with Intel(R) SSE4.1 */	O
ippCpuBonnell = 0x24, /* Intel(R) Atom(TM) processor */	O
ippCpuNehalem = 0x25, /* Intel(R) Core(TM) i7 processor */	O
ippCpuNext = 0x26,	O
ippCpuSSE = 0x40, /* Processor supports Streaming SIMD Extensions instruction set */	X
ippCpuSSE2 = 0x41, /* Processor supports Streaming SIMD Extensions 2 instruction set */	X
ippCpuSSE3 = 0x42, /* Processor supports Streaming SIMD Extensions 3 instruction set */	Static
ippCpuSSSE3 = 0x43, /* Processor supports Supplemental Streaming SIMD Extension 3 instruction set */	O
ippCpuSSE41 = 0x44, /* Processor supports Streaming SIMD Extensions 4.1 instruction set */	O
ippCpuSSE42 = 0x45, /* Processor supports Streaming SIMD Extensions 4.2 instruction set */	O
ippCpuAVX = 0x46, /* Processor supports Advanced Vector Extensions instruction set */	Dynamic
ippCpuAES = 0x47, /* Processor supports AES New Instructions */	O
ippCpuX8664 = 0x60 /* Processor supports 64 bit extension */	X

X : mean my program can't execute completely. (ignore this please).

O : mean my program run completely, but the output not same as (1) or (2).

Static : mean the output same as (1).

Dynamic: Mean the output same as (2).

As you see, only CPU type ippCPUSSE3 give the same output as static library.

I wonder if the static is always use ippCPUSSE3 as default in IPP PXE 2011?

Let us know how do you think about these.

Best regards,

Tam.

Ying_H_Intel · ‎05-03-2016

Hi Tam,

Thanks for sharing the test result.

I check the version information, https://software.intel.com/en-us/articles/which-version-of-the-intel-ipp-intel-mkl-and-intel-tbb-libraries-are-included-in-the-intel. It seems PXE 2011 are included IPP 7.0 x version.

No, I don't recalled that the static is always use ippCPUSSE3 as default in IPP PXE 2011?

anyway, could you please add the below printf information when 1) static and 2) dynamic + call ippInitCPU(). in the test case?

ippCpuSSE3 = 0x42, /* Processor supports Streaming SIMD Extensions 3 instruction set */

Static

lib = ippsGetLibVersion();
printf(“%s %s %d.%d.%d.%d\n”,
lib->Name, lib->Version,
lib->major,
lib->minor, lib->majorBuild, lib->build);
}
Output:

Thanks

Ying

Tam_N_1 · ‎05-03-2016

Hi Ying,

Here are printed information for (1) and (2):

(1) : ippsm7_l.lib 7.0 build 205.23 7.0.205.1024

(2) : ippsm7-7.0.dll 7.0 build 205.23 7.0.205.1024

Regards,

Tam.

Ying_H_Intel · ‎05-04-2016

Hi Tam,

Thank you much for the test. Then it is clear now.

IPP dispatched the optimized code according to the CPU type.

For example , the table in https://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-understanding-cpu-optimized-code-used-in-intel-ipp

m7: means optimized for SSE3 CPU 64bit

As the static link was using m7 code, so when dynamic + init CPU (ippCPU SSE3) use m7 , then they run same cpu-optimized code. So keep same result.

IPP 9.0 have update in the week. You may try it at https://software.intel.com/en-us/intel-ipp/ => try .

Best Regards,

Ying

Tam_N_1 · ‎05-04-2016

Hi Ying,

Thank you so much. It's clear for me, too.

I'm going to migration my work from PXE 2011 to PXE 2016. I may have more questions in future.

Best Regards,

Tam.