Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
23 Views

Performance loss between IPP3.0 and IPP6.1

Hi,

we had the need to switch to WindowsXP-64Bit because our software needs too much memory for a WindowsXP-32Bit OS (huge images for computer vision). On Windows-32bit we were using IPP3.0. Therefore we had to upgrade because IPP3.0 is not available for 64Bit, so now we use IPP6.1. Additionally we were hoping for a small performance boost as side effect of the new optimizations (e.g. multithreading, since our system is a dual core system) in the new version, but in the contrary! We encountered cases in which the old IPP3.0 was faster than the new one and this effect got even worse when we used ippSetNumThreads(1) to limit computation to only one thread just as IPP3.0 already did. IPP6.1 was more than two times slower than the old IPP3.0.
Example:
ippiFilterMedian_16s_C1R(...) with the following parameters
srcStep=4090
dstStep=4050
dstRoiSize={ width=2025 height=6337 }
maskSize={ width=21 height=1 }
anchor={ x=0 y=0 }
Size of source image={ width=2045 height=6337 }
Image Content=Random noise created by a randomizer

Our test computer:
Intel Core 2 Duo CPU (E6750 @ 2,66GHz) with 8GB of RAM on Windows XP 64Bit

Does anyone have an idea why the older version is slower and is there a solution for our problem?

Thank you,

Rohwedder AG
0 Kudos
22 Replies
Highlighted
New Contributor I
21 Views

Quoting - rohwedder
Hi,

we had the need to switch to WindowsXP-64Bit because our software needs too much memory for a WindowsXP-32Bit OS (huge images for computer vision). On Windows-32bit we were using IPP3.0. Therefore we had to upgrade because IPP3.0 is not available for 64Bit, so now we use IPP6.1. Additionally we were hoping for a small performance boost as side effect of the new optimizations (e.g. multithreading, since our system is a dual core system) in the new version, but in the contrary! We encountered cases in which the old IPP3.0 was faster than the new one and this effect got even worse when we used ippSetNumThreads(1) to limit computation to only one thread just as IPP3.0 already did. IPP6.1 was more than two times slower than the old IPP3.0.
Example:
ippiFilterMedian_16s_C1R(...) with the following parameters
srcStep=4090
dstStep=4050
dstRoiSize={ width=2025 height=6337 }
maskSize={ width=21 height=1 }
anchor={ x=0 y=0 }
Size of source image={ width=2045 height=6337 }
Image Content=Random noise created by a randomizer

Our test computer:
Intel Core 2 Duo CPU (E6750 @ 2,66GHz) with 8GB of RAM on Windows XP 64Bit

Does anyone have an idea why the older version is slower and is there a solution for our problem?

Thank you,

Rohwedder AG

Did you validate that IPP is initialized correctly and actually using optimize code for your processor?

Emmanuel
0 Kudos
Highlighted
21 Views

Yeah, that seems like some odd results. Could you please check what optimized code was dispatched by IPP 6.1? The difference in performance is similar to the usual difference between generic C code (PX libraries or MX libraries in case of EM64T architecture) and SSE optimized libraries (W7, T7, V8, P8for 32-bit and M7, N8, U8 for EM64T architecture).

Is not this a case where you link with static libraries but do not call ippStaticInit function?

Regards,
Vladimir
0 Kudos
Highlighted
Beginner
21 Views

Yeah, that seems like some odd results. Could you please check what optimized code was dispatched by IPP 6.1? The difference in performance is similar to the usual difference between generic C code (PX libraries or MX libraries in case of EM64T architecture) and SSE optimized libraries (W7, T7, V8, P8for 32-bit and M7, N8, U8 for EM64T architecture).

Is not this a case where you link with static libraries but do not call ippStaticInit function?

Regards,
Vladimir

We link the files from the stublib directory, which should be the dynamic libraries. We haven't used the ippStaticInit function so far, but I guess that it should only be used for static libraries anyway, right? Of course I would like to find out which DLLs are beeing used and if they are the optimized versions, but I don't know how. Suggestions?

Thanks
0 Kudos
Highlighted
Beginner
21 Views

Quoting - rohwedder

We link the files from the stublib directory, which should be the dynamic libraries. We haven't used the ippStaticInit function so far, but I guess that it should only be used for static libraries anyway, right? Of course I would like to find out which DLLs are beeing used and if they are the optimized versions, but I don't know how. Suggestions?

Thanks

Run you application in debug mode from the IDE you use (VS ?) and look what actual DLLs are being loaded-all common IDEs show that info.
0 Kudos
Highlighted
Beginner
21 Views

Quoting - kdiamond

Run you application in debug mode from the IDE you use (VS ?) and look what actual DLLs are being loaded-all common IDEs show that info.

The follwing DLLs are loaded by our application 64Bit version using IPP6.1:
ntdll.dll, mscoree.dll, KERNEL32.dll,
advapi32.dll, RPCRT4.dll, Secur32.dll,
MSVCR80D.dll, msvcrt.dll,
ippiem64t-6.1.dll, ippcoreem64t-6.1.dll,
libiomp5md.dll, USER32.dll, GDI32.dll,
msvcm80d.dll, ole32.dll, ippiu8-6.1.dll,
SHLWAPI.dll, mscorwks.dll, MSVCR80.dll,
shell32.dll, comctl32.dll, mscorlib.ni.dll,
mscorjit.dll, diasymreader.dll, rsaenh.dll,
PSAPI.DLL, System.ni.dll

This means that the processor code u8 (New Optimizations for 64-bit applications on Intel Core 2 and Intel Xeon 5100 Processors) seems to be used, which should be the right one, isn't it?

So again: Why is our 64Bit IPP6.1 (running on 2 cores) slower than the old 32Bit IPP3.0 (running on 1 core)???

Would it help if I provide you the source code? If yes, how do you want to have it?

Thanks
0 Kudos
Highlighted
21 Views

Hello,

we surely would like to get a test case for that issue. From what you described above everything seems to be done in right way. That means now is our step to take a look into the problem.

Regards,
Vladimir
0 Kudos
Highlighted
Beginner
21 Views

Hello,

we surely would like to get a test case for that issue. From what you described above everything seems to be done in right way. That means now is our step to take a look into the problem.

Regards,
Vladimir

The observed timings are:

IPP 3.0:
32Bit 1 Thread: 606 ms
32Bit 2 Threads: not supported
64Bit 1 Thread: not supported
64Bit 2 Threads: not supported

IPP 6.1:
32Bit 1 Thread: 1182 ms
32Bit 2 Threads: 596 ms
64Bit 1 Thread: 1297 ms
64Bit 2 Threads: 655 ms

As can be seen 64Bit IPP6.1 is slower than 32Bit IPP3.0. Only 32Bit IPP6.1 with 2 threads is slightly faster than 32Bit IPP3.0 but even then by using 2 threads I would expect a more significant performance boost.

These times were produced by the following code (C++/CLI; MS VS2005 SP1; Release) and the CPU and computer mentioned earlier in this discussion:

[cpp]#include "memory.h"
#include "ippi.h"
#include "ipps.h"

#ifdef IPP61
#include "ippcore.h"
#endif

#ifdef IPP61
#define THREADTESTS 2
#else
#define THREADTESTS 1
#endif

using namespace System;
using namespace System::Diagnostics;

int main(array<:STRING> ^args)
{
    int iSrcWidth	 = 2045;
    int iSrcHeight	 = 6337;
    int iFilterHalfSizeX = 10;
    int iFilterHalfSizeY = 0;
    int iDstWidth	 = iSrcWidth - iFilterHalfSizeX * 2;
    int iDstHeight	 = iSrcHeight - iFilterHalfSizeY * 2;
    int iFilterSizeX	 = iFilterHalfSizeX * 2 + 1;
    int iFilterSizeY	 = iFilterHalfSizeY * 2 + 1;

    short *pSrcData = 0;
    short *pDstData = 0;
    try
    {
        pSrcData = new short[iSrcWidth * iSrcHeight];
        memset(	pSrcData,
                0,
                iSrcWidth *
                iSrcHeight *
                sizeof(unsigned short));

        Random random(0);
        for(int iY = 0; iY < iSrcHeight; iY++)
        {
            for(int iX = 0; iX < iSrcWidth; iX++)
            {
                pSrcData[iX + iY * iSrcWidth] = (short)(random.Next());
            }
        }


        pDstData = new short[iDstWidth * iDstHeight];

        for(int i = 1; i <= THREADTESTS; i++)
        {
#ifdef IPP61
            ippSetNumThreads(i);
#endif
            Console::WriteLine("Threads: " + i.ToString());

            for(int j = 0; j < 10; j++)
            {
                memset(	pDstData,
                        0,
                        iDstWidth *
                        iDstHeight *
                        sizeof(short));

                
                IppiSize ippiMask = { iFilterSizeX,
                                      iFilterSizeY};

                IppiPoint ippiPoint = { 0,
                                        0};

                IppiSize ippiSize = {   iDstWidth,
                                        iDstHeight};

                Stopwatch ^tStopwatch = gcnew Stopwatch();
                tStopwatch->Start();
                
                ippiFilterMedian_16s_C1R( (Ipp16s*)pSrcData,
                                          iSrcWidth * sizeof(short),
                                          (Ipp16s*)pDstData,
                                          iDstWidth * sizeof(short),
                                          ippiSize,
                                          ippiMask,
                                          ippiPoint);

                tStopwatch->Stop();

                Console::WriteLine(  tStopwatch->ElapsedMilliseconds +
                                    " ms");
            }
        }
    }
    finally
    {
        if(0 != pSrcData)
        {
            delete [] pSrcData;
            pSrcData = 0;
        }

        if(0 != pDstData)
        {
            delete [] pDstData;
            pDstData = 0;
        }
    }

    Console::WriteLine("Press any key to continue . . .");
    Console::ReadKey();

    return 0;
}
[/cpp]



Thanks for your help.
0 Kudos
Highlighted
New Contributor I
21 Views

Quoting - rohwedder

The observed timings are:

IPP 3.0:
32Bit 1 Thread: 606 ms
32Bit 2 Threads: not supported
64Bit 1 Thread: not supported
64Bit 2 Threads: not supported

IPP 6.1:
32Bit 1 Thread: 1182 ms
32Bit 2 Threads: 596 ms
64Bit 1 Thread: 1297 ms
64Bit 2 Threads: 655 ms

As can be seen 64Bit IPP6.1 is slower than 32Bit IPP3.0. Only 32Bit IPP6.1 with 2 threads is slightly faster than 32Bit IPP3.0 but even then by using 2 threads I would expect a more significant performance boost.

These times were produced by the following code (C++/CLI; MS VS2005 SP1; Release) and the CPU and computer mentioned earlier in this discussion:

[cpp]#include "memory.h"
#include "ippi.h"
#include "ipps.h"

#ifdef IPP61
#include "ippcore.h"
#endif

#ifdef IPP61
#define THREADTESTS 2
#else
#define THREADTESTS 1
#endif

using namespace System;
using namespace System::Diagnostics;

int main(array<:STRING> ^args)
{
    int iSrcWidth	 = 2045;
    int iSrcHeight	 = 6337;
    int iFilterHalfSizeX = 10;
    int iFilterHalfSizeY = 0;
    int iDstWidth	 = iSrcWidth - iFilterHalfSizeX * 2;
    int iDstHeight	 = iSrcHeight - iFilterHalfSizeY * 2;
    int iFilterSizeX	 = iFilterHalfSizeX * 2 + 1;
    int iFilterSizeY	 = iFilterHalfSizeY * 2 + 1;

    short *pSrcData = 0;
    short *pDstData = 0;
    try
    {
        pSrcData = new short[iSrcWidth * iSrcHeight];
        memset(	pSrcData,
                0,
                iSrcWidth *
                iSrcHeight *
                sizeof(unsigned short));

        Random random(0);
        for(int iY = 0; iY < iSrcHeight; iY++)
        {
            for(int iX = 0; iX < iSrcWidth; iX++)
            {
                pSrcData[iX + iY * iSrcWidth] = (short)(random.Next());
            }
        }


        pDstData = new short[iDstWidth * iDstHeight];

        for(int i = 1; i <= THREADTESTS; i++)
        {
#ifdef IPP61
            ippSetNumThreads(i);
#endif
            Console::WriteLine("Threads: " + i.ToString());

            for(int j = 0; j < 10; j++)
            {
                memset(	pDstData,
                        0,
                        iDstWidth *
                        iDstHeight *
                        sizeof(short));

                
                IppiSize ippiMask = { iFilterSizeX,
                                      iFilterSizeY};

                IppiPoint ippiPoint = { 0,
                                        0};

                IppiSize ippiSize = {   iDstWidth,
                                        iDstHeight};

                Stopwatch ^tStopwatch = gcnew Stopwatch();
                tStopwatch->Start();
                
                ippiFilterMedian_16s_C1R( (Ipp16s*)pSrcData,
                                          iSrcWidth * sizeof(short),
                                          (Ipp16s*)pDstData,
                                          iDstWidth * sizeof(short),
                                          ippiSize,
                                          ippiMask,
                                          ippiPoint);

                tStopwatch->Stop();

                Console::WriteLine(  tStopwatch->ElapsedMilliseconds +
                                    " ms");
            }
        }
    }
    finally
    {
        if(0 != pSrcData)
        {
            delete [] pSrcData;
            pSrcData = 0;
        }

        if(0 != pDstData)
        {
            delete [] pDstData;
            pDstData = 0;
        }
    }

    Console::WriteLine("Press any key to continue . . .");
    Console::ReadKey();

    return 0;
}
[/cpp]



Thanks for your help.

Hi,

I didn't try to run your sample so I might be off but you should try to allocate to allocate memory using the IppiAlloc function or unsure that the memory is aligned on 16 byte boundaries. This sometimes have a huge impact on performance.

Emmanuel
0 Kudos
Highlighted
Beginner
21 Views

Quoting - eweber

Hi,

I didn't try to run your sample so I might be off but you should try to allocate to allocate memory using the IppiAlloc function or unsure that the memory is aligned on 16 byte boundaries. This sometimes have a huge impact on performance.

Emmanuel

I tried your hint but it didn't change anything. All timings are equal to the ones I mentioned earlier in this discussion. The updated code is the following:

[cpp]#include "memory.h"
#include "ippi.h"
#include "ipps.h"

#ifdef IPP61
#include "ippcore.h"
#endif

#ifdef IPP61
#define THREADTESTS 2
#else
#define THREADTESTS 1
#endif

using namespace System;
using namespace System::Diagnostics;

int main(array<:STRING> ^args)
{
	unsigned char *pSrcData = 0;
	unsigned char *pDstData = 0;
	try
	{
		int iSrcWidth			= 2045;
		int iSrcHeight			= 6337;
		int	iSrcPitch			= 0;
		int iFilterHalfSizeX	= 10;
		int iFilterHalfSizeY	= 0;
		int iDstWidth			= iSrcWidth - iFilterHalfSizeX * 2;
		int iDstHeight			= iSrcHeight - iFilterHalfSizeY * 2;
		int	iDstPitch			= 0;
		int iFilterSizeX		= iFilterHalfSizeX * 2 + 1;
		int iFilterSizeY		= iFilterHalfSizeY * 2 + 1;

		pSrcData = (unsigned char*)ippiMalloc_16s_C1(iSrcWidth, iSrcHeight, &iSrcPitch);

		memset(	pSrcData,
				0,
				iSrcPitch *
				iSrcHeight);

		Random random(0);
		for(int iY = 0; iY < iSrcHeight; iY++)
		{
			for(int iX = 0; iX < iSrcWidth; iX++)
			{
				short *pData = (short*)(pSrcData + iX * sizeof(short) + iY * iSrcPitch);
				*pData = (short)(random.Next());
			}
		}

		pDstData = (unsigned char*)ippiMalloc_16s_C1(iDstWidth, iDstHeight, &iDstPitch);

		for(int i = 1; i <= THREADTESTS; i++)
		{
#ifdef IPP61
			ippSetNumThreads(i);
#endif
			Console::WriteLine("Threads: " + i.ToString());

			for(int j = 0; j < 10; j++)
			{
				memset(	pDstData,
						0,
						iDstPitch *
						iDstHeight);
				
				IppiSize ippiMask = {	iFilterSizeX,
										iFilterSizeY};
					
				IppiPoint ippiPoint = {	0,
										0};

				IppiSize ippiSize = {	iDstWidth,
										iDstHeight};

				Stopwatch ^tStopwatch = gcnew Stopwatch();
				tStopwatch->Start();

				ippiFilterMedian_16s_C1R(	(Ipp16s*)pSrcData,
											iSrcPitch,
											(Ipp16s*)pDstData,
											iDstPitch,
											ippiSize,
											ippiMask,
											ippiPoint);

				tStopwatch->Stop();

				Console::WriteLine(	tStopwatch->ElapsedMilliseconds +
									" ms");
			}
		}
	}
	finally
	{
		if(0 != pSrcData)
		{
			ippiFree(pSrcData);
			pSrcData = 0;
		}

		if(0 != pDstData)
		{
			ippiFree(pDstData);
			pDstData = 0;
		}
	}

	Console::WriteLine("Press any key to continue . . .");
	Console::ReadKey();

    return 0;
}[/cpp]

Thank you anyway
0 Kudos
Highlighted
Employee
21 Views


Hi,

This code does not look to consider the image border. For filter functions, IPP also assume adjacent border pixels also exist. Check here for more information:

http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-processin...


This will make the test will use some uninitialized memory, and make some errors. See if it works after fixing this problem.

Thanks,
Chao

0 Kudos
Highlighted
Beginner
21 Views

Quoting - Chao Y (Intel)

Hi,

This code does not look to consider the image border. For filter functions, IPP also assume adjacent border pixels also exist. Check here for more information:

http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-processin...


This will make the test will use some uninitialized memory, and make some errors. See if it works after fixing this problem.

Thanks,
Chao


I don't see why I don't consider the image borders. The destination image is smaller than the source image which should be sufficient.

The according code snippet:

[cpp]int iSrcWidth    = 2045;  
int iSrcHeight   = 6337;  
int iFilterHalfSizeX = 10;  
int iFilterHalfSizeY = 0;  
int iDstWidth    = iSrcWidth - iFilterHalfSizeX * 2;  
int iDstHeight   = iSrcHeight - iFilterHalfSizeY * 2;  
int iFilterSizeX     = iFilterHalfSizeX * 2 + 1;  
int iFilterSizeY     = iFilterHalfSizeY * 2 + 1;  [/cpp]

If you still think that it is wrong, please let me know how it is being done correctly. Thanks
0 Kudos
Highlighted
Employee
21 Views


Hi rohwedder,

I overlooked this code. It looks fine. We havesome test for the performance.

Thanks,
Chao
0 Kudos
Highlighted
Beginner
21 Views

Quoting - Chao Y (Intel)

Hi rohwedder,

I overlooked this code. It looks fine. We havesome test for the performance.

Thanks,
Chao

Are you doing the performance tests you mentioned or what is happening? I still have no idea where this performance problem is coming from or rather how to solve it.

Thanks
0 Kudos
Highlighted
Employee
21 Views


Hello,

This problem can be reproduced here. Our engineer owner checked performance of this function. The algorithm of ippiFilterMedian_16s_C1R function in v30 and in v61 were changed for using the OMP threading. For small and low masks, the algorithm in IPP 3.0 is a bit better. For other masks, IPP 6.1 is better.

In this test case, it looks you are using 1D mask. This is not intended for optimization in this function. For 1D mask, you can try function in IPPS domain.

Thanks,
Chao


0 Kudos
Highlighted
Beginner
21 Views

Quoting - rohwedder
Hi,

we had the need to switch to WindowsXP-64Bit because our software needs too much memory for a WindowsXP-32Bit OS (huge images for computer vision). On Windows-32bit we were using IPP3.0. Therefore we had to upgrade because IPP3.0 is not available for 64Bit, so now we use IPP6.1. Additionally we were hoping for a small performance boost as side effect of the new optimizations (e.g. multithreading, since our system is a dual core system) in the new version, but in the contrary! We encountered cases in which the old IPP3.0 was faster than the new one and this effect got even worse when we used ippSetNumThreads(1) to limit computation to only one thread just as IPP3.0 already did. IPP6.1 was more than two times slower than the old IPP3.0.
Example:
ippiFilterMedian_16s_C1R(...) with the following parameters
srcStep=4090
dstStep=4050
dstRoiSize={ width=2025 height=6337 }
maskSize={ width=21 height=1 }
anchor={ x=0 y=0 }
Size of source image={ width=2045 height=6337 }
Image Content=Random noise created by a randomizer

Our test computer:
Intel Core 2 Duo CPU (E6750 @ 2,66GHz) with 8GB of RAM on Windows XP 64Bit

Does anyone have an idea why the older version is slower and is there a solution for our problem?

Thank you,

Rohwedder AG

I'd like to bring to your attention that your IPP3.0was on the 32bit OS while your IPP6.1 is on a 64bit OS. In my tests, IPP6.1 is much slower on 64bit OS than on 32bit OS.
0 Kudos
Highlighted
Employee
21 Views

Quoting - shyaki

I'd like to bring to your attention that your IPP3.0was on the 32bit OS while your IPP6.1 is on a 64bit OS. In my tests, IPP6.1 is much slower on 64bit OS than on 32bit OS.

Is it for ippiFilterMedian_16s_C1R function or others?

thanks,
Chao
0 Kudos
Highlighted
Beginner
21 Views

Quoting - Chao Y (Intel)

Is it for ippiFilterMedian_16s_C1R function or others?

thanks,
Chao

We used the old IPP3.0 (32Bit edition) and the new IPP6.1 (32Bit and 64Bit editions) on a 64Bit Windows XP operating system and there were no big performance differences between IPP6.1 32Bit and IPP6.1 64Bit (see timings earlier in the discussion). Anyway, it is not an option for us to use 32Bit because of the amount of memory we need.
We only tried out the ippiFilterMedian_16s_C1R because this is the function we need right now... well, actually we need the ippiFilterMedian_16u_C1R but this function doesn't exist in IPP3.0 and we were using the ippiFilterMedian_16s_C1R (and a few 16u<->16s conversion functions) to replace it in the old implementation. Of course with IPP6.1 we will use ippiFilterMedian_16u_C1R but after tests we found out it is not faster than ippiFilterMedian_16s_C1R. As you can imagine it would have been a bad example if we posted our code with ippiFilterMedian_16u_C1R in this forum because it would not have been compareable.

Thanks
0 Kudos
Highlighted
Beginner
21 Views

Quoting - Chao Y (Intel)

Hello,

This problem can be reproduced here. Our engineer owner checked performance of this function. The algorithm of ippiFilterMedian_16s_C1R function in v30 and in v61 were changed for using the OMP threading. For small and low masks, the algorithm in IPP 3.0 is a bit better. For other masks, IPP 6.1 is better.

In this test case, it looks you are using 1D mask. This is not intended for optimization in this function. For 1D mask, you can try function in IPPS domain.

Thanks,
Chao



I tried calling the ippsFilterMedian_16s for each line of my image but unfortunately it seems to produce wrong results. Have I found another bug? Anyway, my images will be 16u in the end and there is no ippsFilterMedian_16u. By the way, generally we also use ippiFilterMin_8/16u_C1R, ippiFilterMax_8/16u_C1R, ippiFilterBox_8/16u_C1R but for our current project we only need ippiFilterMedian_8/16u_C1R.
Can this performance issue be fixed?
0 Kudos
Highlighted
Employee
21 Views

Quoting - rohwedder

I tried calling the ippsFilterMedian_16s for each line of my image but unfortunately it seems to produce wrong results. Have I found another bug? Anyway, my images will be 16u in the end and there is no ippsFilterMedian_16u. By the way, generally we also use ippiFilterMin_8/16u_C1R, ippiFilterMax_8/16u_C1R, ippiFilterBox_8/16u_C1R but for our current project we only need ippiFilterMedian_8/16u_C1R.
Can this performance issue be fixed?


Hello,

It looks that it has two problems here:

1> ippsFilterMedian_16s error, Maybe you attach your file here. So we can check if there any problem with this functions.

2> Fixing the performance issue with ippiFilterMin_8/16u_C1R:
This looks a little different with performance problem in ippiFilterMedian_16s. The old issue is comparing performance of IPP 3.0 and IPP 6.1 for ippiFilterMedian_16s.
Here are you comparing ippiFilterMin_8/16u_C1R performance with ippiFilterMedian_16s in IPP 6.1. You find 16u functions is slower than 16s.

Do I understand it correctly?

Thanks,
Chao
0 Kudos