Performance of Copy Function

Shankar1 · ‎02-16-2009

I have been using the evaluation version of IPP 5.3 and I have some trouble achieving thebest performance figures that I would be able to achieve with quite a few IPP functions. For example, I ran the performance tool that comes with the IPP installation for IppsCopy_32fc function alone( using ipps.exe found in tools/prefsys/ folder) on my machine and have attached the results in IppsCopy.csv.

I see from the results that a copy of 2048Ipp32fc(Complex Float) elements should take ideally 0.866 micro-seconds.
But Im not able to achieve the same in my samples. Here is the sample code.

intsize = 2048;
Ipp32fc* pSrc = ippsMalloc_32fc( size);
Ipp32fc* pDst = ippsMalloc_32fc( size);

tbb::tick_count t0 = tbb::tick_count::now();
ippsCopy_32fc( pSrc, pDst, size);
tbb::tick_count t1 = tbb::tick_count::now();

std::cout << (t1 - t0).seconds() * 1000000 << std::endl;

I see it takes 23 micro seconds to do the above copy in my sample which is approx 30 times slower
compared to what the performance tool shows.

I also tried bringing all the elements in the src and dst array into the cache but still the IppsCopy_32fc alone
takes7.8 micro seconds which is still approx 10 times slower.

I have done the allocation through Ippsmalloc and soI would assume memory alignment is also taken care of.

can somebody tell me as why I see this performance difference..? Is there something that Im missing in this?

Regards,
Sankar

Vladimir_Dudnik · ‎02-16-2009

Quoting - Shankar

I have been using the evaluation version of IPP 5.3 and I have some trouble achieving thebest performance figures that I would be able to achieve with quite a few IPP functions. For example, I ran the performance tool that comes with the IPP installation for IppsCopy_32fc function alone( using ipps.exe found in tools/prefsys/ folder) on my machine and have attached the results in IppsCopy.csv.

I see from the results that a copy of 2048Ipp32fc(Complex Float) elements should take ideally 0.866 micro-seconds.
But Im not able to achieve the same in my samples. Here is the sample code.

intsize = 2048;
Ipp32fc* pSrc = ippsMalloc_32fc( size);
Ipp32fc* pDst = ippsMalloc_32fc( size);

tbb::tick_count t0 = tbb::tick_count::now();
ippsCopy_32fc( pSrc, pDst, size);
tbb::tick_count t1 = tbb::tick_count::now();

std::cout << (t1 - t0).seconds() * 1000000 << std::endl;

I see it takes 23 micro seconds to do the above copy in my sample which is approx 30 times slower
compared to what the performance tool shows.

I also tried bringing all the elements in the src and dst array into the cache but still the IppsCopy_32fc alone
takes7.8 micro seconds which is still approx 10 times slower.

I have done the allocation through Ippsmalloc and soI would assume memory alignment is also taken care of.

can somebody tell me as why I see this performance difference..? Is there something that Im missing in this?

Regards,
Sankar

Hi Sankar,

IPP performance system try to provide the best possible performance estimation by utilizing cash warm-up loops and averaging results skipping the most first. It loops until that avg difference not exceed 5 %. Note, actual number of loops used for measurement specified in CSV file.

By the way, when you develop your own test how did you link with IPP? Please do not forget to call ippStaticInit function in case you link with IPP static libraries.

Regards,
Vladimir

Shankar1 · ‎02-17-2009

Quoting - Vladimir Dudnik (Intel)

Quoting - Shankar

I have been using the evaluation version of IPP 5.3 and I have some trouble achieving thebest performance figures that I would be able to achieve with quite a few IPP functions. For example, I ran the performance tool that comes with the IPP installation for IppsCopy_32fc function alone( using ipps.exe found in tools/prefsys/ folder) on my machine and have attached the results in IppsCopy.csv.

I see from the results that a copy of 2048Ipp32fc(Complex Float) elements should take ideally 0.866 micro-seconds.
But Im not able to achieve the same in my samples. Here is the sample code.

intsize = 2048;
Ipp32fc* pSrc = ippsMalloc_32fc( size);
Ipp32fc* pDst = ippsMalloc_32fc( size);

tbb::tick_count t0 = tbb::tick_count::now();
ippsCopy_32fc( pSrc, pDst, size);
tbb::tick_count t1 = tbb::tick_count::now();

std::cout << (t1 - t0).seconds() * 1000000 << std::endl;

I see it takes 23 micro seconds to do the above copy in my sample which is approx 30 times slower
compared to what the performance tool shows.

I also tried bringing all the elements in the src and dst array into the cache but still the IppsCopy_32fc alone
takes7.8 micro seconds which is still approx 10 times slower.

I have done the allocation through Ippsmalloc and soI would assume memory alignment is also taken care of.

can somebody tell me as why I see this performance difference..? Is there something that Im missing in this?

Regards,
Sankar

Hi Sankar,

IPP performance system try to provide the best possible performance estimation by utilizing cash warm-up loops and averaging results skipping the most first. It loops until that avg difference not exceed 5 %. Note, actual number of loops used for measurement specified in CSV file.

By the way, when you develop your own test how did you link with IPP? Please do not forget to call ippStaticInit function in case you link with IPP static libraries.

Regards,
Vladimir

Hi ,
Thanks for your reply. I use dynamic linking in my samples. Do I have to do something like IPPStaticInit in that case?

I tried running the copy in loops and getting the best results like you had mentioned. Now I see the copy takes3 micro-seconds on an average but its still slower than the 0.866 micro-seconds which the IPP performance system shows.

Can I find some sample which can demonstrate how to achieve this 0.866 micro seconds that the IPP Performance system is able to achieve?

Further when I run the performance test for copy I see a few options like MutualVectorShift in bytes (divisible by 64) which is set to 512 and VectorAlign by element which is set to 0. What do these mean in the case that I have tried and how do they affect the performance results? Am I missing any of these in my sample?

Thanks
Sankar

Vladimir_Dudnik · ‎02-17-2009

You do not have to call ippStaticInit in case of dynamic linkage (although it is safe to call, it will just no-ops in case of DLL).

Yes, IPP performance system is able to locate input and output arrays at different align boundaries, most effective one is virtual page (4K boundary).

Regards,
Vladimir

Ying_S_Intel · ‎02-17-2009

As you may know , the latest Intel IPP is version 6.0, we launched it last Nov. You can also check the latest version from http://www.intel.com/software/products/ipp

Thanks,
Ying