Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.
6671 Discussions

Complex multiply by conjugate function for array

rohitspandey
Beginner
235 Views

Hi,

I have similar requirement to have Complex multiply by conjugate function for array . Currently the function is a C implementation. The function takes quite a good amount of time in my application. I have to optimise wit either IPP or SSE.

Details on the function

--------------------------

Description: Array A x Comp Conj of B; Result in C.

mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real

const Float_t *pAimag, // source of array 'A' imag

const Float_t *pBreal, // source of array 'B' real

const Float_t *pBimag, // source of array 'B' imag

Uword32 Np, // number of points

Float_t *pCreal, // Dest for real part

Float_t *pCimag) // Dest for imag part

I have tried two version of IPP implementation. But both version are slower then C.

//IPP implementation

IppStatus ippStats;

Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format

/*//Implement-1

// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j

ippsMul_32f(pAreal,pBreal,pCreal,Np);

ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);

//Calculating q = (aI*bR-aR*bI)

ippsMul_32f(pAimag,pBreal,pCimag,Np);

//Make -bI vector from bI vector

ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);

ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);

*/

//Implement-2

//Allocate memory for pSrcA, pSrcB and pSrcC buffer

pSrcA = ippsMalloc_32fc(3*Np);

pSrcB = &pSrcA[Np];//ippsMalloc_32fc(Np);

pDst = &pSrcA[2*Np];//ippsMalloc_32fc(Np);

if((pSrcA==0)||(pDst==0)||(pSrcB==0))

{

IPP_ERR_STATUS_var = ippStsMemAllocErr;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//First Convert 2 separte IQ buffers to single complex number buffer

//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);

ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);

ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);

//Using IPP API:-

//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);

ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);

if(ippStats!=ippStsNoErr)

{

IPP_ERR_STATUS_var = ippStats;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//Now Convert output complex buffer to 2 separte IQ buffers

//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);

ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

Then I implemented the function using SSE. Which is quite fast and works perfect in my application as long as the vectors are 16-byte aligned.

void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)

{

//Check for 16-byte alignment

{

__m128 m1, m2, m3, m4;

//__m128 sr1,sr2,si1,si2,dr1,dr2;

// __declspec(align(16)) float *pSrc1re2;

__m128 *srcR1 = (__m128*)pSrcRe1; //a1re

__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im

__m128 *srcR2 = (__m128*)pSrcRe2; //b1re

__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im

__m128 *destR = (__m128*)pDstRe; //ResRe

__m128 *destI = (__m128*)pDstIm; //ResIm

for(int i = 0 ;i< count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1)

{

m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)

m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)

m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)

m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)

*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)

*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)

}

}

I have limitation not to disturb the application framework i.e changing the interface. I have to use vectors. And I don't know how to force the vector to have 16-byte aligned allocation. I have to optimised the speed for this function. Kindly suggest the options for the following

1. Either a method to allocate vector 16-byte aligned

2. Or a fast IPP based method to implement this function

3. SSE implementation to handle non-aligned vectors.

Hoping to get a speedy reply from the experts.

Note: I have just started working on IPP/SSE. its just a week into my first IPP/SSE routine.

Regards

Rohit

0 Kudos
4 Replies
Ying_H_Intel
Employee
235 Views
Hi Rohit,

Seemingly it is same issue as

http://software.intel.com/en-us/forums/showthread.php?t=71784&o=a&s=lr

Rights?

As i understand, the SSE implementation have an obvious advantage with your array layout, while IPP functions have totake many time to convert data layoutto suitable dataformat and multiples I/O.

So your question actually left3. SSE implementation to handle non-aligned vectors. You may ask the question to Intel Compiler Forum also. As i recalled, the non-alignedload is not big issueonlatestProcessor with Intel C/C++ compiler and you may usenon-alignedloadfor example _mm_loadu_psto insteadthe "compulsive conversion", like__m128 *srcR1 = (__m128*)pSrcRe1,

__m128 _mm_loadu_ps(float * p)

Loads four SP FP values. The address need not be 16-byte-aligned.


Regards,
Ying
Ying_H_Intel
Employee
235 Views
igorastakhov
New Contributor II
235 Views
Rohit,

regarding your questions:
1) allocating of 16-byte aligned vectors - use any ippMalloc function (for example ippsMalloc_32fc() in your case) - they all guarantee an appropriate alignment for particular architecture - so 16-byte for SSE, 32-byte for AVX, etc. If you need aligned vectors at stack - see example 3 below.
2) there is noany suitable ipp functionality for fast implementation of what you need
3) an example of MulConj for non-aligned case and merged re,im format - guess it will be easy for you to translate it for separate re & im vectors - just split src vectors and reverse src shuffles, the same - for dst:

#define M_SIGN 0x80000000

static const __declspec(align(16))int SIGN_IM[4] = { 0, M_SIGN, 0, M_SIGN };

sMulConj_32fc( const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32u len ){

Ipp32fc *px1, *px2, *py;
__m128x1, x2, re1, im1, mr, mi, y1, y2;
Ipp32f re, im;
Ipp32u i;

px1 = (Ipp32fc*)pSrc1;
px2 = (Ipp32fc*)pSrc2;
py = (Ipp32fc*)pDst;

for (i=0; i+4<=len; i+=4) {
x1 = _mm_loadu_ps( (float*)( px1+i ));
x2 = _mm_loadu_ps( (float*)( px2+i ));
re1 = _mm_shuffle_ps(x1, x1,_MM_SHUFFLE(2,2,0,0));
im1 = _mm_shuffle_ps(x1, x1, _MM_SHUFFLE(3,3,1,1));
mr = _mm_mul_ps(x2, re1 );
mi = _mm_mul_ps(x2, im1 );
mr = _mm_xor_ps( mr, ( *(__m128*)(&(SIGN_IM))) );
mi = _mm_shuffle_ps(mi, mi, _MM_SHUFFLE(2,3,0,1));
y1 = _mm_add_ps(mr, mi );

x1 = _mm_loadu_ps( (float*)( px1+i+2 ));
x2 = _mm_loadu_ps( (float*)( px2+i+2 ));
re1 = _mm_shuffle_ps(x1, x1,_MM_SHUFFLE(2,2,0,0) );
im1 = _mm_shuffle_ps(x1, x1,_MM_SHUFFLE(3,3,1,1) );
mr = _mm_mul_ps(x2, re1 );
mi = _mm_mul_ps(x2, im1 );
mr =_mm_xor_ps( mr, ( *(__m128*)(&(SIGN_IM))) );
mi = _mm_shuffle_ps( mi, mi,_MM_SHUFFLE(2,3,0,1));
y2 = _mm_add_ps( mr, mi );
_mm_storeu_ps( (float*)( py+i ), y1 );
_mm_storeu_ps( (float*)( py+i+2), y2 );
}
}

Regards,
Igor

rohitspandey
Beginner
235 Views
Hi,

I have usedXYZuas off now, which isslower. something like this


Uword32 s;

Float_t ar, ai, br, bi;

__m128 m1, m2, m3, m4;

for(s = 0 ;s< (Np/4); s++)

{

m1 = _mm_mul_ps( _mm_loadu_ps(pAreal), _mm_loadu_ps(pBreal)); //(a1Re*b1Re)

m2 = _mm_mul_ps(_mm_loadu_ps(pAimag),_mm_loadu_ps(pBimag)); //(a1Im*b1Im)

m3 = _mm_mul_ps(_mm_loadu_ps(pAreal),_mm_loadu_ps(pBimag)); //(a1Re*b1Im)

m4 = _mm_mul_ps(_mm_loadu_ps(pAimag),_mm_loadu_ps(pBreal)); //(a1Im*b1Re)

_mm_storeu_ps(pCreal, _mm_add_ps(m1,m2)); // (a1Re*b1Re)+(a1Im*b1Im)

_mm_storeu_ps(pCimag,_mm_sub_ps(m4,m3)); //(a1Im*b1Re)-(a1Re*b1Im)

pAreal=pAreal+4;pAimag=pAimag+4;

pBreal=pBreal+4;pBimag=pBimag+4;

pCreal=pCreal+4;pCimag=pCimag+4;

}

for (s = 0; s < (Np%4); s++) // for each array element

{

ar = *pAreal++; // get the real and imaginary parts

ai = *pAimag++;

br = *pBreal++;

bi = *pBimag++;

*pCreal++ = ar * br + ai * bi; // do the complex multiply

*pCimag++ = ai * br - ar * bi; // NB conjugate of B

}

In the mean time working to create a memory manager to handle aligned vectors.
Thanks for help.
Regards
Rohit

Reply