topic Complex multiply by conjugate in Intel® Integrated Performance Primitives

Complex multiply by conjugate

Eric2 — Sat, 06 Feb 2010 18:55:19 GMT

It would save memory bandwidth if there was a function that would multiply two complex vectorswith the result being as if one of thesource vectors had been conjugated. This is currently a two step process of conjugating one vector and then performing the complex vector multiply.

Please consider this for a future release.

Complex multiply by conjugate

renegr — Mon, 08 Feb 2010 16:03:42 GMT

couldn't you use ippsMulPackConj_%

Complex multiply by conjugate

Eric2 — Mon, 08 Feb 2010 21:10:34 GMT

If I understand MulPackConj correctly, I would have to convert my complex data to packed format, multiply, and then convert it back. My data is stored as Ipp32fc.

My main objective with this request is to reduce the memory footprint and/or memory accesses required to perform the multiplication. The conj() and mul() sequence that I currently use requires 3 reads and 2 writes of complex data to get the job done when it is possible (when combined) to use 2 reads and 1 write.

I would hazard a guess that conjugate multiply is already used internally by some of the other algorithms provided. It is a relatively common operation in singal processing involving complex data.

Complex multiply by conjugate

Thomas_Jensen1 — Mon, 08 Feb 2010 21:29:38 GMT

I have been there also, trying to get IPP DFT effective.

My problem was that I couldn't find any good FFT/DFT samples from IPP.

I'd like to have a good sample that loads a grayscale image, and then perform some frequency domain filtering, and then saving the result. The frequency domain filter could be a butterworth filter for example.

If this was available, we'd be able to continue with other filters. Google only indicates general frequency domain theory. Here, we need specific Intel IPP information, to get top performance.

Maybe Intel could expand Picnic to include a simple FFT/DFT workbench function.

Complex multiply by conjugate

renegr — Tue, 09 Feb 2010 11:28:39 GMT

Ok, it would be a nice to have,

but if you're sure your pointers are 16 byte aligned and your processor supports sse3, it's really easy to implement in SSE intrinsics (_mm_mul_ps/_mm_hadd_ps for single precision)

Complex multiply by conjugate

Eric2 — Tue, 09 Feb 2010 14:34:43 GMT

Thanks for the suggestions.

I haven't written any sse intrinsic code in Visual Studio. I did some monkey see monkey do code changes in Unix/gcc in a previous life that used them.

What would that look like in Visual Studio C/C++? Or do you have a good reference link?

Complex multiply by conjugate

renegr — Tue, 09 Feb 2010 16:46:57 GMT

it will look like

[bash]void MulConjCompC( float* pSrc1, float* pSrc2, float* pDst, int count)
{
  for( int i=0; i

I did not compare the outputs of SSE implementation and C, maybe the imaginary part of SSE has the wrong sign.

Complex multiply by conjugate

Eric2 — Wed, 10 Feb 2010 00:51:59 GMT

Thanks a bunch. I will try this out and see what kind of change it has on the timing of my algorithm.

I would still like Intel to supply the functionality so it will work on all platforms and be supported/optimized by them on future processors.

Complex multiply by conjugate

renegr — Wed, 10 Feb 2010 11:24:37 GMT

sorry, yesterday I just had 10min to "hack" some buggy code

here's the correction

[bash]//the C version
void MulConjCompC( float* pSrc1, float* pSrc2, float* pDst, int count)
{
  for( int i=0; i  {
    pDst[0] = pSrc1[0]*pSrc2[0] - pSrc1[1]*pSrc2[1];
    pDst[1] = pSrc1[1]*pSrc2[0] - pSrc1[0]*pSrc2[1];
  }
}

//2 complex values with 1 loop
void MulConjCompSSE( float* pSrc1, float* pSrc2, float* pDst, int count)
{
  assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) & 0xF));  //check for 16byte align
  __m128* src1 = (__m128*)pSrc1;
  __m128* src2 = (__m128*)pSrc2;
  __m128* dst = (__m128*)pDst;
  for( ; count>0; count-=4, src1++, src2++, dst++)
  {
    __m128 d1 = _mm_mul_ps( src1[0], src2[0]);
    __m128 ds = _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1));
    __m128 d2 = _mm_mul_ps( ds, src2[0]);
    ds = _mm_hsub_ps( d1, d2);  //horizontally add 2 values
    *dst = _mm_shuffle_ps( ds, ds, _MM_SHUFFLE(3, 1, 2, 0));
  }
  MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count);  //rest
}

//4 complex values with 1 loop
void MulConjCompSSE3_2( float* pSrc1, float* pSrc2, float* pDst, int count)
{
  assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) & 0xF)); //check for 16byte align
  __m128* src1 = (__m128*)pSrc1;
  __m128* src2 = (__m128*)pSrc2;
  __m128* dst = (__m128*)pDst;
  for( ; count>0; count-=8, src1+=2, src2+=2, dst+=2)
  {
   __m128 d1 = _mm_mul_ps( src1[0], src2[0]);
   __m128 d2 = _mm_mul_ps( _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1)), src2[0]);
   __m128 e1 = _mm_mul_ps( src1[1], src2[1]); 
   __m128 e2 = _mm_mul_ps( _mm_shuffle_ps( src1[1], src1[1], _MM_SHUFFLE(2, 3, 0, 1)), src2[1]);
   __m128 f1 = _mm_hsub_ps( d1, e1);
   __m128 f2 = _mm_hsub_ps( d2, e2);
   dst[0] = _mm_unpacklo_ps( f1, f2);
   dst[1] = _mm_unpackhi_ps( f1, f2);
  }
  MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count); //rest
}


[/bash]

Complex multiply by conjugate

Eric2 — Sun, 14 Feb 2010 17:29:49 GMT

renegr - I didn't expect someone to take the time to actually code it. Many thanks. Your help is very much appreciated.

Any one else that is interested,

Since what I actually wanted was dst = src1 * conj(src2) the real calculation inMulConjCompC() should be:

pDst[0] = pSrc1[0]*pSrc2[0]+ pSrc1[1]*pSrc2[1]; // notice the + instead of -

and therefor I think the first mm_hsub_ps() in MulConjCompSSE3_2() should be:

__m128 f1 = _mm_hadd_ps( d1, e1); // noitice the hadd instead of hsub

Complex multiply by conjugate

renegr — Tue, 16 Feb 2010 10:03:29 GMT

Of course you're right (it was correct in the c function)

I'm currently gaining my SSE experiences so it was a nice practice. Would be nice if you could provide some measures on how your performance did increase by this functions.

Complex multiply by conjugate

rohitspandey — Mon, 21 Nov 2011 10:28:52 GMT

Hi,

I have a similar need to have a Complex multiply by conjugate for arrayfunction. Currently the function is a C implementation in my application. The function takes a quite a good amount of time in my application.
I have to optimize the implementation using IPP. I triedtwo versions (given below )but both are slower then the C version.

//IPP implementation

IppStatus ippStats;

Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format

/*//Implement-1

// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j

ippsMul_32f(pAreal,pBreal,pCreal,Np);

ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);

//Calculating q = (aI*bR-aR*bI)

ippsMul_32f(pAimag,pBreal,pCimag,Np);

//Make -bI vector from bI vector

ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);

ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);

//Implement-2

//Allocate memory for pSrcA, pSrcB and pSrcC buffer

pSrcA = ippsMalloc_32fc(3*Np);

pSrcB = &pSrcA[Np];//ippsMalloc_32fc(Np);

pDst = &pSrcA[2*Np];//ippsMalloc_32fc(Np);

if((pSrcA==0)||(pDst==0)||(pSrcB==0))

{

IPP_ERR_STATUS_var = ippStsMemAllocErr;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//First Convert 2 separte IQ buffers to single complex number buffer

//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);

ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);

ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);

//Using IPP API:-

//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);

ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);

if(ippStats!=ippStsNoErr)

{

IPP_ERR_STATUS_var = ippStats;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//Now Convert output complex buffer to 2 separte IQ buffers

//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);

ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

Then I implemented the SSE version of the function as

void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)

{

/*16-byte alignment check*/

{

__m128 m1, m2, m3, m4;

//__m128 sr1,sr2,si1,si2,dr1,dr2;

// __declspec(align(16)) float *pSrc1re2;

__m128 *srcR1 = (__m128*)pSrcRe1; //a1re

__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im

__m128 *srcR2 = (__m128*)pSrcRe2; //b1re

__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im

__m128 *destR = (__m128*)pDstRe; //ResRe

__m128 *destI = (__m128*)pDstIm; //ResIm

for(int i = 0 ;i< count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1)

{

m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)

m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)

m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)

m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)

*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)

*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)

}

mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real

const Float_t *pAimag, // source of array 'A' imag

const Float_t *pBreal, // source of array 'B' real

const Float_t *pBimag, // source of array 'B' imag

Uword32 Np, // number of points

Float_t *pCreal, // Dest for real part

Float_t *pCimag)

C version

int s;

float ar, ai, br, bi;

for (s = 0; s < count; s++) // for each array element

{

ar = *pSrcRe1++; // get the real and imaginary parts

ai = *pSrcIm1++;

br = *pSrcRe2++;

bi = *pSrcIm2++;

*pDstRe++ = ar * br + ai * bi; // do the complex multiply

*pDstIm++ = ai * br - ar * bi; // B conjugate of B

}

This function is very fast compare to C as long all the input vectors are 16-byte aligned. I have limitation not to change the current framework of application. And I am not sure how to force a vector to be 16-byte aligned. Kindly suggest what are the options I can try
1. algin vectors to 16-byte.
2. or using IPP to have faster routine.
Hoping to get a speedy reply from experts.

Note: I have just started using IPP/SSE. Its hardly a week's experience.

Regards
Rohit

Complex multiply by conjugate

rohitspandey — Mon, 21 Nov 2011 11:28:33 GMT

//////// Posted Twice/////////

Hi,

I have similar requirementto have Complex multiply by conjugate function for array . Currently the function is a C implementation. The function takes quite a good amount of time in my application. I have to optimise wit either IPP or SSE.

Details on the function
--------------------------

Description:Array A x Comp Conj of B; Result in C.

mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real

const Float_t *pAimag, // source of array 'A' imag

const Float_t *pBreal, // source of array 'B' real

const Float_t *pBimag, // source of array 'B' imag

Uword32 Np, // number of points

Float_t *pCreal, // Dest for real part

Float_t *pCimag) // Dest for imag part

I have tried two version of IPP implementation. But both version are slower then C.

//IPP implementation

IppStatus ippStats;

Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format

/*//Implement-1

// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j

ippsMul_32f(pAreal,pBreal,pCreal,Np);

ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);

//Calculating q = (aI*bR-aR*bI)

ippsMul_32f(pAimag,pBreal,pCimag,Np);

//Make -bI vector from bI vector

ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);

ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);

//Implement-2

//Allocate memory for pSrcA, pSrcB and pSrcC buffer

pSrcA = ippsMalloc_32fc(3*Np);

pSrcB = &pSrcA[Np];//ippsMalloc_32fc(Np);

pDst = &pSrcA[2*Np];//ippsMalloc_32fc(Np);

if((pSrcA==0)||(pDst==0)||(pSrcB==0))

{

IPP_ERR_STATUS_var = ippStsMemAllocErr;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//First Convert 2 separte IQ buffers to single complex number buffer

//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);

ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);

ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);

//Using IPP API:-

//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);

ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);

if(ippStats!=ippStsNoErr)

{

IPP_ERR_STATUS_var = ippStats;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//Now Convert output complex buffer to 2 separte IQ buffers

//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);

ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

Then I implemented the function using SSE. Which is quite fast and works perfect in my application as long as the vectors are 16-byte aligned.

void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)

{

//Check for 16-byte alignment

{

__m128 m1, m2, m3, m4;

//__m128 sr1,sr2,si1,si2,dr1,dr2;

// __declspec(align(16)) float *pSrc1re2;

__m128 *srcR1 = (__m128*)pSrcRe1; //a1re

__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im

__m128 *srcR2 = (__m128*)pSrcRe2; //b1re

__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im

__m128 *destR = (__m128*)pDstRe; //ResRe

__m128 *destI = (__m128*)pDstIm; //ResIm

for(int i = 0 ;i< count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1)

{

m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)

m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)

m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)

m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)

*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)

*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)

}

}

I have limitation not to disturb the application framework i.e changing the interface. I have to use vectors. And I don't know how to force the vector to have 16-byte aligned allocation. I have to optimised the speed for this function. Kindly suggest the options for the following
1. Either a method to allocate vector 16-byte aligned
2. Or a fast IPP based method to implement this function
3. SSE implementation to handle non-aligned vectors.

Hoping to get a speedy reply from the experts.

Note: I have just started working on IPP/SSE. its just a week into my first IPP/SSE routine.

Regards
Rohit

Complex multiply by conjugate

SergeyKostrov — Tue, 22 Nov 2011 14:41:17 GMT

I see that you tried already to use__declspec( align(16) ) declaration but it is commented out. Did you have any problems?

You canalsouse_mm_malloc function:

...
__m128 *pVec1 = ( __m128 * )_mm_malloc( _RTVECTOR_SIZE * sizeof( __m128 ), 16 );
...

Best regards,
Sergey

Complex multiply by conjugate

rohitspandey — Tue, 22 Nov 2011 15:37:52 GMT

I have tried both options. It will work when the vectors are defined with in the function. I asked the case how to handle the vectors passed to function are not 16-byte aligned. i.e if the function are passed with vectors which are not aligned to 16-byte boundary.

Regards

Rohit

Complex multiply by conjugate

igorastakhov — Fri, 25 Nov 2011 07:43:22 GMT

Hi Rohit,

I think I've already provided an answer at http://software.intel.com/en-us/forums/showthread.php?t=101316 thread.

Regards,
Igor