Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

## Complex multiply by conjugate function for array

Beginner
820 Views

Hi,

I have similar requirement to have Complex multiply by conjugate function for array . Currently the function is a C implementation. The function takes quite a good amount of time in my application. I have to optimise wit either IPP or SSE.

Details on the function

--------------------------

Description: Array A x Comp Conj of B; Result in C.

mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real

const Float_t *pAimag, // source of array 'A' imag

const Float_t *pBreal, // source of array 'B' real

const Float_t *pBimag, // source of array 'B' imag

Uword32 Np, // number of points

Float_t *pCreal, // Dest for real part

Float_t *pCimag) // Dest for imag part

I have tried two version of IPP implementation. But both version are slower then C.

//IPP implementation

IppStatus ippStats;

Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format

/*//Implement-1

// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j

ippsMul_32f(pAreal,pBreal,pCreal,Np);

//Calculating q = (aI*bR-aR*bI)

ippsMul_32f(pAimag,pBreal,pCimag,Np);

//Make -bI vector from bI vector

ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);

*/

//Implement-2

//Allocate memory for pSrcA, pSrcB and pSrcC buffer

pSrcA = ippsMalloc_32fc(3*Np);

pSrcB = &pSrcA[Np];//ippsMalloc_32fc(Np);

pDst = &pSrcA[2*Np];//ippsMalloc_32fc(Np);

if((pSrcA==0)||(pDst==0)||(pSrcB==0))

{

IPP_ERR_STATUS_var = ippStsMemAllocErr;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//First Convert 2 separte IQ buffers to single complex number buffer

//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);

ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);

ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);

//Using IPP API:-

//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);

ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);

if(ippStats!=ippStsNoErr)

{

IPP_ERR_STATUS_var = ippStats;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//Now Convert output complex buffer to 2 separte IQ buffers

//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);

ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

Then I implemented the function using SSE. Which is quite fast and works perfect in my application as long as the vectors are 16-byte aligned.

void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)

{

//Check for 16-byte alignment

{

__m128 m1, m2, m3, m4;

//__m128 sr1,sr2,si1,si2,dr1,dr2;

// __declspec(align(16)) float *pSrc1re2;

__m128 *srcR1 = (__m128*)pSrcRe1; //a1re

__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im

__m128 *srcR2 = (__m128*)pSrcRe2; //b1re

__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im

__m128 *destR = (__m128*)pDstRe; //ResRe

__m128 *destI = (__m128*)pDstIm; //ResIm

for(int i = 0 ;i< count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1)

{

m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)

m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)

m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)

m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)

*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)

*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)

}

}

I have limitation not to disturb the application framework i.e changing the interface. I have to use vectors. And I don't know how to force the vector to have 16-byte aligned allocation. I have to optimised the speed for this function. Kindly suggest the options for the following

1. Either a method to allocate vector 16-byte aligned

2. Or a fast IPP based method to implement this function

3. SSE implementation to handle non-aligned vectors.

Hoping to get a speedy reply from the experts.

Note: I have just started working on IPP/SSE. its just a week into my first IPP/SSE routine.

Regards

Rohit

4 Replies
Employee
820 Views
Hi Rohit,

Seemingly it is same issue as

Rights?

As i understand, the SSE implementation have an obvious advantage with your array layout, while IPP functions have totake many time to convert data layoutto suitable dataformat and multiples I/O.

So your question actually left3. SSE implementation to handle non-aligned vectors. You may ask the question to Intel Compiler Forum also. As i recalled, the non-alignedload is not big issueonlatestProcessor with Intel C/C++ compiler and you may usenon-alignedloadfor example _mm_loadu_psto insteadthe "compulsive conversion", like__m128 *srcR1 = (__m128*)pSrcRe1,

__m128 _mm_loadu_ps(float * p)

Loads four SP FP values. The address need not be 16-byte-aligned.

Regards,
Ying
Employee
820 Views
New Contributor II
820 Views
Rohit,

1) allocating of 16-byte aligned vectors - use any ippMalloc function (for example ippsMalloc_32fc() in your case) - they all guarantee an appropriate alignment for particular architecture - so 16-byte for SSE, 32-byte for AVX, etc. If you need aligned vectors at stack - see example 3 below.
2) there is noany suitable ipp functionality for fast implementation of what you need
3) an example of MulConj for non-aligned case and merged re,im format - guess it will be easy for you to translate it for separate re & im vectors - just split src vectors and reverse src shuffles, the same - for dst:

#define M_SIGN 0x80000000

static const __declspec(align(16))int SIGN_IM[4] = { 0, M_SIGN, 0, M_SIGN };

sMulConj_32fc( const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32u len ){

Ipp32fc *px1, *px2, *py;
__m128x1, x2, re1, im1, mr, mi, y1, y2;
Ipp32f re, im;
Ipp32u i;

px1 = (Ipp32fc*)pSrc1;
px2 = (Ipp32fc*)pSrc2;
py = (Ipp32fc*)pDst;

for (i=0; i+4<=len; i+=4) {
x1 = _mm_loadu_ps( (float*)( px1+i ));
x2 = _mm_loadu_ps( (float*)( px2+i ));
re1 = _mm_shuffle_ps(x1, x1,_MM_SHUFFLE(2,2,0,0));
im1 = _mm_shuffle_ps(x1, x1, _MM_SHUFFLE(3,3,1,1));
mr = _mm_mul_ps(x2, re1 );
mi = _mm_mul_ps(x2, im1 );
mr = _mm_xor_ps( mr, ( *(__m128*)(&(SIGN_IM))) );
mi = _mm_shuffle_ps(mi, mi, _MM_SHUFFLE(2,3,0,1));
y1 = _mm_add_ps(mr, mi );

x1 = _mm_loadu_ps( (float*)( px1+i+2 ));
x2 = _mm_loadu_ps( (float*)( px2+i+2 ));
re1 = _mm_shuffle_ps(x1, x1,_MM_SHUFFLE(2,2,0,0) );
im1 = _mm_shuffle_ps(x1, x1,_MM_SHUFFLE(3,3,1,1) );
mr = _mm_mul_ps(x2, re1 );
mi = _mm_mul_ps(x2, im1 );
mr =_mm_xor_ps( mr, ( *(__m128*)(&(SIGN_IM))) );
mi = _mm_shuffle_ps( mi, mi,_MM_SHUFFLE(2,3,0,1));
y2 = _mm_add_ps( mr, mi );
_mm_storeu_ps( (float*)( py+i ), y1 );
_mm_storeu_ps( (float*)( py+i+2), y2 );
}
}

Regards,
Igor

Beginner
820 Views
Hi,

I have usedXYZuas off now, which isslower. something like this

Uword32 s;

Float_t ar, ai, br, bi;

__m128 m1, m2, m3, m4;

for(s = 0 ;s< (Np/4); s++)

{

_mm_storeu_ps(pCreal, _mm_add_ps(m1,m2)); // (a1Re*b1Re)+(a1Im*b1Im)

_mm_storeu_ps(pCimag,_mm_sub_ps(m4,m3)); //(a1Im*b1Re)-(a1Re*b1Im)

pAreal=pAreal+4;pAimag=pAimag+4;

pBreal=pBreal+4;pBimag=pBimag+4;

pCreal=pCreal+4;pCimag=pCimag+4;

}

for (s = 0; s < (Np%4); s++) // for each array element

{

ar = *pAreal++; // get the real and imaginary parts

ai = *pAimag++;

br = *pBreal++;

bi = *pBimag++;

*pCreal++ = ar * br + ai * bi; // do the complex multiply

*pCimag++ = ai * br - ar * bi; // NB conjugate of B

}

In the mean time working to create a memory manager to handle aligned vectors.
Thanks for help.
Regards
Rohit