- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have similar requirement to have Complex multiply by conjugate function for array . Currently the function is a C implementation. The function takes quite a good amount of time in my application. I have to optimise wit either IPP or SSE.
Details on the function
--------------------------
Description: Array A x Comp Conj of B; Result in C.
mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real
const Float_t *pAimag, // source of array 'A' imag
const Float_t *pBreal, // source of array 'B' real
const Float_t *pBimag, // source of array 'B' imag
Uword32 Np, // number of points
Float_t *pCreal, // Dest for real part
Float_t *pCimag) // Dest for imag part
I have tried two version of IPP implementation. But both version are slower then C.
//IPP implementation
IppStatus ippStats;
Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format
/*//Implement-1
// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j
ippsMul_32f(pAreal,pBreal,pCreal,Np);
ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);
//Calculating q = (aI*bR-aR*bI)
ippsMul_32f(pAimag,pBreal,pCimag,Np);
//Make -bI vector from bI vector
ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);
ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);
*/
//Implement-2
//Allocate memory for pSrcA, pSrcB and pSrcC buffer
pSrcA = ippsMalloc_32fc(3*Np);
pSrcB = &pSrcA[Np];//ippsMalloc_32fc(Np);
pDst = &pSrcA[2*Np];//ippsMalloc_32fc(Np);
if((pSrcA==0)||(pDst==0)||(pSrcB==0))
{
IPP_ERR_STATUS_var = ippStsMemAllocErr;
myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);
return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;
}
//First Convert 2 separte IQ buffers to single complex number buffer
//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);
ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);
ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);
//Using IPP API:-
//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);
ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);
if(ippStats!=ippStsNoErr)
{
IPP_ERR_STATUS_var = ippStats;
myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);
return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;
}
//Now Convert output complex buffer to 2 separte IQ buffers
//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);
ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);
myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);
Then I implemented the function using SSE. Which is quite fast and works perfect in my application as long as the vectors are 16-byte aligned.
void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)
{
//Check for 16-byte alignment
{
__m128 m1, m2, m3, m4;
//__m128 sr1,sr2,si1,si2,dr1,dr2;
// __declspec(align(16)) float *pSrc1re2;
__m128 *srcR1 = (__m128*)pSrcRe1; //a1re
__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im
__m128 *srcR2 = (__m128*)pSrcRe2; //b1re
__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im
__m128 *destR = (__m128*)pDstRe; //ResRe
__m128 *destI = (__m128*)pDstIm; //ResIm
for(int i = 0 ;i< count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1)
{
m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)
m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)
m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)
m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)
*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)
*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)
}
}
I have limitation not to disturb the application framework i.e changing the interface. I have to use vectors. And I don't know how to force the vector to have 16-byte aligned allocation. I have to optimised the speed for this function. Kindly suggest the options for the following
1. Either a method to allocate vector 16-byte aligned
2. Or a fast IPP based method to implement this function
3. SSE implementation to handle non-aligned vectors.
Hoping to get a speedy reply from the experts.
Note: I have just started working on IPP/SSE. its just a week into my first IPP/SSE routine.
Regards
Rohit
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Seemingly it is same issue as
http://software.intel.com/en-us/forums/showthread.php?t=71784&o=a&s=lr
Rights?
As i understand, the SSE implementation have an obvious advantage with your array layout, while IPP functions have totake many time to convert data layoutto suitable dataformat and multiples I/O.
So your question actually left3. SSE implementation to handle non-aligned vectors. You may ask the question to Intel Compiler Forum also. As i recalled, the non-alignedload is not big issueonlatestProcessor with Intel C/C++ compiler and you may usenon-alignedloadfor example _mm_loadu_psto insteadthe "compulsive conversion", like__m128 *srcR1 = (__m128*)pSrcRe1,
__m128 _mm_loadu_ps(float * p)
Loads four SP FP values. The address need not be 16-byte-aligned.
Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://software.intel.com/en-us/articles/benefits-of-intel-avx-for-small-matrices/
Hope it helps,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
regarding your questions:
1) allocating of 16-byte aligned vectors - use any ippMalloc function (for example ippsMalloc_32fc() in your case) - they all guarantee an appropriate alignment for particular architecture - so 16-byte for SSE, 32-byte for AVX, etc. If you need aligned vectors at stack - see example 3 below.
2) there is noany suitable ipp functionality for fast implementation of what you need
3) an example of MulConj for non-aligned case and merged re,im format - guess it will be easy for you to translate it for separate re & im vectors - just split src vectors and reverse src shuffles, the same - for dst:
#define M_SIGN 0x80000000
static const __declspec(align(16))int SIGN_IM[4] = { 0, M_SIGN, 0, M_SIGN };
sMulConj_32fc( const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32u len ){
Ipp32fc *px1, *px2, *py;
__m128x1, x2, re1, im1, mr, mi, y1, y2;
Ipp32f re, im;
Ipp32u i;
px1 = (Ipp32fc*)pSrc1;
px2 = (Ipp32fc*)pSrc2;
py = (Ipp32fc*)pDst;
for (i=0; i+4<=len; i+=4) {
x1 = _mm_loadu_ps( (float*)( px1+i ));
x2 = _mm_loadu_ps( (float*)( px2+i ));
re1 = _mm_shuffle_ps(x1, x1,_MM_SHUFFLE(2,2,0,0));
im1 = _mm_shuffle_ps(x1, x1, _MM_SHUFFLE(3,3,1,1));
mr = _mm_mul_ps(x2, re1 );
mi = _mm_mul_ps(x2, im1 );
mr = _mm_xor_ps( mr, ( *(__m128*)(&(SIGN_IM))) );
mi = _mm_shuffle_ps(mi, mi, _MM_SHUFFLE(2,3,0,1));
y1 = _mm_add_ps(mr, mi );
x1 = _mm_loadu_ps( (float*)( px1+i+2 ));
x2 = _mm_loadu_ps( (float*)( px2+i+2 ));
re1 = _mm_shuffle_ps(x1, x1,_MM_SHUFFLE(2,2,0,0) );
im1 = _mm_shuffle_ps(x1, x1,_MM_SHUFFLE(3,3,1,1) );
mr = _mm_mul_ps(x2, re1 );
mi = _mm_mul_ps(x2, im1 );
mr =_mm_xor_ps( mr, ( *(__m128*)(&(SIGN_IM))) );
mi = _mm_shuffle_ps( mi, mi,_MM_SHUFFLE(2,3,0,1));
y2 = _mm_add_ps( mr, mi );
_mm_storeu_ps( (float*)( py+i ), y1 );
_mm_storeu_ps( (float*)( py+i+2), y2 );
}
}
Regards,
Igor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have usedXYZuas off now, which isslower. something like this
Uword32 s;
Float_t ar, ai, br, bi;
__m128 m1, m2, m3, m4;
for(s = 0 ;s< (Np/4); s++)
{
m1 = _mm_mul_ps( _mm_loadu_ps(pAreal), _mm_loadu_ps(pBreal)); //(a1Re*b1Re)
m2 = _mm_mul_ps(_mm_loadu_ps(pAimag),_mm_loadu_ps(pBimag)); //(a1Im*b1Im)
m3 = _mm_mul_ps(_mm_loadu_ps(pAreal),_mm_loadu_ps(pBimag)); //(a1Re*b1Im)
m4 = _mm_mul_ps(_mm_loadu_ps(pAimag),_mm_loadu_ps(pBreal)); //(a1Im*b1Re)
_mm_storeu_ps(pCreal, _mm_add_ps(m1,m2)); // (a1Re*b1Re)+(a1Im*b1Im)
_mm_storeu_ps(pCimag,_mm_sub_ps(m4,m3)); //(a1Im*b1Re)-(a1Re*b1Im)
pAreal=pAreal+4;pAimag=pAimag+4;
pBreal=pBreal+4;pBimag=pBimag+4;
pCreal=pCreal+4;pCimag=pCimag+4;
}
for (s = 0; s < (Np%4); s++) // for each array element
{
ar = *pAreal++; // get the real and imaginary parts
ai = *pAimag++;
br = *pBreal++;
bi = *pBimag++;
*pCreal++ = ar * br + ai * bi; // do the complex multiply
*pCimag++ = ai * br - ar * bi; // NB conjugate of B
}
In the mean time working to create a memory manager to handle aligned vectors.
Thanks for help.
Regards
Rohit

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page