irregular result of invers FFT!

genoray · ‎09-17-2010

Hi,

I trying to FFT filter operation using ippi.

What I implement procedure is summarized below.

1. Alloc ippi values and initialize FFT context

=> ippiFFTInitAlloc_R_32f(&specR, 11, 0, IPP_FFT_DIV_INV_BY_N, ippAlgHintAccurate);

2. Read file (2000 x 1 (width x height)) and padding to 2048 x 1 and adopt forward FFT.

=> ippiFFTFwd_RToPack_32f_C1R

3. Multiply filter kernel using ippiMulPack_32f_C1IR.

4. Do invers FFT.

=> ippiFFTInv_PackToR_32f_C1R

5. Repeat above (1000 x 250)/(number of CPU core) times.

6. Free all variables and FFT context.

Above procedure runs on multithreads corresponding to the number of cpu core.

It works good, but sometimes (I can not expect when it occur) the result of inverse FFT is bad, garbage value.

(Once the result of inverse FFT is bad, all the repeated (1000 x 250 times) result is bad.)

My development system: Windows7 ultimate x86, Visual Studio 2005.

Can you advice to fix the program?

Is it ok using ippi on multithread?

Thanks,

Changyoon Lee.

PaulF_IntelCorp · ‎09-17-2010

Hello Changyoon,

Can you be more specific about what sequence you are repeating. Are you repeating steps 1-6 or 2-5 or ??? Are you creating multiple threads of your own around the repeated sequence, or are you relying on internal threading withing IPP to generate multiple threads of execution within the IPP functions?

If you are a single memory allocation for multiple threads of your ownI would expect to see some problems. If you are relying solely on threading within the IPP library, there should not be issues with multi-threading since the sequence will essentially be serial and will use mutiple cores to implement each IPP call.

BTW, what do you mean by (1000 x 250)/(number of cores)? Do you mean you are repeating this 250,000 times on each core or do you mean 250,000 repetitions are happening but divided over multiple cores?

Regards,

Paul

genoray · ‎09-17-2010

Hello Paul, First thanks for your kind reply.
I repeated steps 2-5, and I created own threads to run that sequence. and (1000 x 250)/(number of cores) means that I repeated 250,000 times and these are divided over multiple cores. Furthermore, each threads run independently.

Do Iallocate asingle memory for multiple threads for my own?

For the specific information see below.

- Create threads as many as the number of cpu core (N).

ThreadWork1()
{
Sequence();
}

ThreadWork2()
{
Sequence();
}

...

ThreadWorkN()
{
Sequence();
}

Sequence()
{
Alloc ippi valiables and initialize FFT context
for((1000 x 250)/(number of cores) times)
{
Read file (2000 x 1 (width x height)) and padding to 2048 x 1.
Do forward FFT
Multiply filter kernel using ippiMulPack_32f_C1IR.
Do invers FFT
}
Free all variables and FFT context.
}

It seems hard to understand clearly for my english skill.

Thanks,
Changyoon Lee.

PaulF_IntelCorp · ‎09-20-2010

Hello Changyoon,

No need to apologize regarding your English. It is quite good!

Are you linking with the multi-threaded version of the library or the single-threaded version? Given the job you are describing it would make most sense to link against the static single-threaded version of the library (there are two static libraries: one is single-threaded and one is multi-threaded). This will eliminate any conflicts between your threading and internal threading within the IPP library.

The basic idea that you present in your pseudo-code (above) seems valid. The only part I question is having the FFT init outside of your (1000x250) for loop. I would suggest a new FFT init each time through the loop. Note that there is a separate ippsFFTInit function that does not require that you also allocate memory, so you can implement the allocate once, before the loop, and then do the init inside the for loop (the first time might result in an extra init, but that's okay).

Regards,

Paul

genoray · ‎09-26-2010

I solve the problem. There are some mistake in my code.

When I use 'Ipp32fc' data type, I malloc variable as below and it is initialized zero successfully.

Ipp32fc* p32fc;
int StepBytes_p32fc = 0;
p32fc = ippiMalloc_32fc_C1(2048, 1, &StepBytes_p32fc);

However, it works well only at first time. After that it is not initialized to zero value. So I set the variable to zero, after that my problem is solved.

I saw some reference about this in userguide, page 7-1.

Ipp* ippiMalloc_(int widthPixels, int
heightPixels, int* pStepBytes)
32-byte aligned memory allocation for images where every line of the image is padded with zeros. Memory can be freed only with the function ippiFree.

Above explanation seem to be wrong.

Furthermore, I have some question in your reply.
1. You mentioned linking method. Linking to ipp*merged_t.lib is multi-threaded version? and ipp*merged.lib is single-threaded version? What is the difference between that? Is there performance difference?
If I use dynamic linking method, I can't choose the version between single-threaded and multi-threaded?
I'm using dynamic linking method now.

2. Is there performance difference in FFT function between ippi and ipps? and the reason that you suggest FFT init in the loopis only for my problem? or other? for exmaple...performance or right usage.

Thanks a lot,
Changyoon Lee.

PaulF_IntelCorp · ‎09-28-2010

Hello Changyoon,

Glad to hear you found the problem! Regarding your questions:

1) Yes, *_t.lib are the multi-threaded static librariesand the others are single-threaded static. There is no such option for the dynamic library, they are provided as multi-threaded only. (You could build your own custom dynamic libs from the single-threaded static libraries if you wanted a single-threaded dynamic library.)

There can be a difference between multi-threaded and single-threaded. Note that only about 15-20% of the functions are multi-threaded in the MT version of the library. And the effectiveness of MT on performance depends on your data set and other parameters. So it's impossible to make a blanket statement regarding MT and performance, it depends on the functions you use, how often you use them, the nature of your data sets, etc. You need to run and compare to see. It also depends heavily on whether or not you thread above the functions (within your own app). If you are managing the threading you may get better results using the single-threaded version of the library, since you won't get competition between threading within the library and your threading. BTW, most of the FFT functions are threaded.

2) Again, performance differences between different functions depend heavily on your data set and application. Difficult to make a blanket statement. The ippi FFT functions are designed for use with image processing tasks, the ipps FFT functions are designed for use with general-purpose signal processing tasks. My suggestion for using the FFT init in your loop was only to see if that might be the source of your problem. As it turns out, you did have an init problem, just solved using a different method!

Regards,

Paul