Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Thomas_Jensen1
Beginner
185 Views

Missing ippiMulPack_32f for real unpacked input

I'm working on a DFT filter for images using a Real filter.

I forward transform the image using ippiDFTFwd_RToPack_32f_C1R(). The result is packed complex.

I then want to multiply my filter into the transform, but there is a problem: I don't know how to create a packed filter, and there is no variant of ippiMulPack_32f that takes an non-packed filter as source.

Of course, I could create a complex filter, then convert that to packed complex, but that extra step is costly.

Is there an ippiMulPack that takes unpacked real input?

Does anybody know how to write a packed complex image?

(w*h 32fc image array)

0 Kudos
22 Replies
renegr
New Contributor I
169 Views

the packed format is well described in the according pdf documentation. You can write your own function which transforms your filter into packed format.
Thomas_Jensen1
Beginner
169 Views

I would not say "well described" for Intells explanation/documentation of their RCPack2D format.

The PDF indicates shaded colum only for Even, but there is no shaded column...

Also, I was looking for a quick solution, on how to take an Ipp32f image (data is Real, Im=0), and shuffle that into a packed Ipp32f.

Normally, I'd look elsewhere, but since this format is from Intel, I'm hoping they have a quick solution, and of cource, all in the light of speed/high performance.

renegr
New Contributor I
169 Views

So you mean ippiCplxExtendToPack_ ?
Thomas_Jensen1
Beginner
169 Views

No, I mean, instead of Create Real Filter Cplx, followed by ippiCplxExtendToPack, followed by ippiMulPack_32f, I'm looking to see if there is a way to skip one step, doing Create Real Filter 32f, followed by some unknown function ippiMulRToPack_32f.

This way, the filter uses less memory, ippiCplxExtendToPack does not have to be called == more performance.

Alternatively, I there is a function to pack a Real image, or if I could write my own function (I can't, because documentation is too low), then I'd just use the packed functions directly, also with high performance.

renegr
New Contributor I
169 Views

We have our own function which converts real into IPP packed data - but sorry, that's copyrighted. Of course it is no function with 2 or 3 lines (our C optimized function needs about 90 lines of code). I don't see any code piece here which could be optimized by SSE, so I think therefore it isn't part of IPP
Thomas_Jensen1
Beginner
169 Views

Your 90 lines of code are what prevents me to start recoding that myself!

I still feel its a pity Intel does not supply this for us. Looking at UIC, Intel hands out tons of code to implement image compression. For Fourier Transforms, Intel does not really have a good sample. Of course there is some sample code, but that is more for the beginning, on how to prepare the buffers and call the functions.

I really miss a good DFT/FFT sample, one that does some optimized image filtering, for instance, a Butterworth sharpening filter. One that includes all border management (still a problem for many), and a useful real to packed function.

Anyway, you've told me about your copyrighted code, so what now? Can you give a hint with some pseudo code? :))

renegr
New Contributor I
169 Views

We've developed (a collegue) the transformation based on the information of RCPack2D format described in ippiman.pdf (March 2009) on page 718-719. There is a table on how the data is located in memory. I don't think that there is a better way to describe the format than with a table which is good to imagine.

The 90 lines of code include also

  • moves imaginary data into RCPack2D
  • handles even and also odd # of rows
  • optimized by removing jumps for odd/even column/row index within loops

Maybe you need none of these points.

Thomas_Jensen1
Beginner
169 Views

I only need the first point.

In the PDF, "Table 10-5, RCPack2D Storage for Even Number of Rows", I see that the first row is boxed, why is that?

If I ignore the box, I see an incomplete pattern:

First row (odd rows):
Re A(0,0) Re A(0,1) Im A(0,1) ... Re A(0,(N-1)/2) Im A(0,(N-1)/2) Re A(0,N/2)

How am I supposed to know what is in the "..." ?
If it is A(I,J), I can see that I is 0 for the row.
I can see that J goes 0,1,1, but what comes after that? I don't like guesswork in my programming!
Could it be (j+1)/2, where j=0..N/2 ?

Then the selection of Re/Im components. I seen Re, Re, Im, ..., how am I supposed to known what is in "..."?
Is the next Im or Re?
Is the full pattern Re, Re, Im, Im, Re, Re, Im, Im etc, or is it Re, Re, Im, Re, Re, Im, etc?

Any help highly appreciated!

renegr
New Contributor I
169 Views

You think completely in the wrong direction.

Let's look at first row: Re Re Im .... Re Im Re -> the "..." here means pairs of Re/Im for each row index, only the first and the last element are real parts.

Now look at the columns from row 2 on, the first and last column are pairs of Re/Im while the columns between are completely real or imaginary parts.

And now the last point for the even # of rows the last row is symmetric to the 1. one.

Thomas_Jensen1
Beginner
169 Views

I guess you mean "means pairs of Re/Im for each column index...", no?

Lets for arguments sake say that N=32, therefore there are 16 columns in the packed array (even width). Is the first row then this (according to you, not according to the PDF)? :

Re(0,0), Re(0,1), Im(0,1), Re(0,2), Im(0,2), Re(0,3), Im(0,3), Re(0,4), Im(0,4), Re(0,5), Im(0,5), Re(0,6), Im(0,6), Re(0,7), Im(0,7), Re(0,8)

The PDF does not say this.

renegr
New Contributor I
169 Views

>> "means pairs of Re/Im for each column index..." yes of course


>> Lets for arguments sake say that N=32, therefore there are 16 columns in the packed array (even width).

No, if N==32, then 30 columns are in the packed array, just Im(0,0) and Im(0,N/2) are removed

Thomas_Jensen1
Beginner
169 Views

Ahh, yes, source items are Re+Im.

So, for N=32, there are 32 source items (with Re+Im) in a row, and the packed array will have 32 float items (packed array has same number of floats as source complex array).

I guess I'll have to decide speed from between writing my filter directly to a packed array versus writing a simple float array, and then calling ippiCplxExtendToPack. Since I only do one single image when filtering, the computation always includes generating the filter.

(A)
- Generate Real filter
- ippiCplxExtendToPack
- ippiDFTFwd_RToPack
- ippiMulPack_32f
- ippiDFTInv_PackToR

(B)
- Generate packed filter
- ippiDFTFwd_RToPack
- ippiMulPack_32f
- ippiDFTInv_PackToR

(C) future, whished for, IPP function
- Generate Real filter
- ippiDFTFwd_RToPack
- ippiMulRealToPack_32f32fc
- ippiDFTInv_PackToR

Which one would be fastest?

renegr
New Contributor I
169 Views

>>Which one would be fastest?

Of course it will be (A) or (B)

  1. in most cases you will create the packed filter once and do ippiMulPack multiple times
  2. ippiMulPack is much faster than ippiMulRealToPack because source1 and source2 have the same format and 4 floating point multiplications can be done at once in SSE
Therefore (B) only has an advantage over (A) if you really often generate new packed filters
Thomas_Jensen1
Beginner
169 Views

I need to generate the filter for every image since each image has another (unknown) dimension.

I guess I should decide if

(A)
- Generate Real filter Cplx
- ippiCplxExtendToPack

is slower/faster than

(B)
- Generate packed filter

Since ippiDFTFwd_RToPack+ippiMulPack_32f+ippiDFTInv_PackToR is rather time consuming anyway, I guess I might as well use (A), and drop my ideas of generating a packed filter.

I do not immediately agree that 4 multiplications can be done at once in SSE because complex mulitiplication is not just using multiply, but also add.

In fact, since multiplying a real number to a complex number only involves multiplying the real number into each of Re, Im, that would be very fast if it was supported by IPP.

renegr
New Contributor I
169 Views

Of course complex multiplication can be fast (see entry "Complex multiply by conjugate" within this forum where I posted some SSE3 code doing this.

>>In fact, since multiplying a real number to a complex number only involves multiplying the real number into each of Re, Im, that would be very fast if it was supported by IPP.

No because your real data buffer has a completely different data aligning compared with the packed data. So you need a loop with jumps or an aligning correction which will be slower than just streaming data through the SSE unit.

The slowest step in your code is the DFT. What you need is a FFT for arbitrary sizes, which isn't supported in IPP. Maybe you should take a look at FFTW.

Chao_Y_Intel
Employee
169 Views

>What you need is a FFT for arbitrary sizes, which isn't supported in IPP. Maybe you should take a look at FFTW.

Maybe you can let us know some details about the limitation of DFT functions:

ippiFFTFwd_/ippiFFTInv_ functions needs the data lengths must be integer powers of 2.

But for,

ippiDFTFwd_/ippiDFTInv_ functions, the N and M can take on arbitrary integer non-negative values.

Thanks,

Chao

renegr
New Contributor I
169 Views

The limitation of the DFT is its complexity O(N^2), while FFT has a complexity of O(N log N). For more details about arbitrary length FFT's, you can take a look at wikipedia (prime-factor fft algorithm)

Thomas_Jensen1
Beginner
169 Views

Chaos,

Since I know that IPP convolution filters will internally use Fourier transforms if it determines it would be faster, I'd like to know if IPP DFT internally uses FFT for the same reasons?

Those internal IPP decisions are also a bit less documented. Is there a specific location for more docs?

renegr
New Contributor I
169 Views

The name "Convolution" does not reason to the internal algorithm which is used. The name "DFT" of course. I wouldn't name it DFT if it is a FFT in some cases :)

But will be easy to find out, just take an image with a power of 2 for width/height and test the DFT/FFT

Thomas_Jensen1
Beginner
72 Views

With convolution I mean ippiFilter32f. If I'm informed well, it uses DFT/FFT internally, if its size is larger than X.

I did take a look at FFTW, but it has a price tag for non-GPL applications.

Reply