- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It would save memory bandwidth if there was a function that would multiply two complex vectorswith the result being as if one of thesource vectors had been conjugated. This is currently a two step process of conjugating one vector and then performing the complex vector multiply.
Please consider this for a future release.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sorry, yesterday I just had 10min to "hack" some buggy code
here's the correction
[bash]//the C version
void MulConjCompC( float* pSrc1, float* pSrc2, float* pDst, int count)
{
for( int i=0; i{
pDst[0] = pSrc1[0]*pSrc2[0] - pSrc1[1]*pSrc2[1];
pDst[1] = pSrc1[1]*pSrc2[0] - pSrc1[0]*pSrc2[1];
}
}
//2 complex values with 1 loop
void MulConjCompSSE( float* pSrc1, float* pSrc2, float* pDst, int count)
{
assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) & 0xF)); //check for 16byte align
__m128* src1 = (__m128*)pSrc1;
__m128* src2 = (__m128*)pSrc2;
__m128* dst = (__m128*)pDst;
for( ; count>0; count-=4, src1++, src2++, dst++)
{
__m128 d1 = _mm_mul_ps( src1[0], src2[0]);
__m128 ds = _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1));
__m128 d2 = _mm_mul_ps( ds, src2[0]);
ds = _mm_hsub_ps( d1, d2); //horizontally add 2 values
*dst = _mm_shuffle_ps( ds, ds, _MM_SHUFFLE(3, 1, 2, 0));
}
MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count); //rest
}
//4 complex values with 1 loop
void MulConjCompSSE3_2( float* pSrc1, float* pSrc2, float* pDst, int count)
{
assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) & 0xF)); //check for 16byte align
__m128* src1 = (__m128*)pSrc1;
__m128* src2 = (__m128*)pSrc2;
__m128* dst = (__m128*)pDst;
for( ; count>0; count-=8, src1+=2, src2+=2, dst+=2)
{
__m128 d1 = _mm_mul_ps( src1[0], src2[0]);
__m128 d2 = _mm_mul_ps( _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1)), src2[0]);
__m128 e1 = _mm_mul_ps( src1[1], src2[1]);
__m128 e2 = _mm_mul_ps( _mm_shuffle_ps( src1[1], src1[1], _MM_SHUFFLE(2, 3, 0, 1)), src2[1]);
__m128 f1 = _mm_hsub_ps( d1, e1);
__m128 f2 = _mm_hsub_ps( d2, e2);
dst[0] = _mm_unpacklo_ps( f1, f2);
dst[1] = _mm_unpackhi_ps( f1, f2);
}
MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count); //rest
}
[/bash]
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I understand MulPackConj correctly, I would have to convert my complex data to packed format, multiply, and then convert it back. My data is stored as Ipp32fc.
My main objective with this request is to reduce the memory footprint and/or memory accesses required to perform the multiplication. The conj() and mul() sequence that I currently use requires 3 reads and 2 writes of complex data to get the job done when it is possible (when combined) to use 2 reads and 1 write.
I would hazard a guess that conjugate multiply is already used internally by some of the other algorithms provided. It is a relatively common operation in singal processing involving complex data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been there also, trying to get IPP DFT effective.
My problem was that I couldn't find any good FFT/DFT samples from IPP.
I'd like to have a good sample that loads a grayscale image, and then perform some frequency domain filtering, and then saving the result. The frequency domain filter could be a butterworth filter for example.
If this was available, we'd be able to continue with other filters. Google only indicates general frequency domain theory. Here, we need specific Intel IPP information, to get top performance.
Maybe Intel could expand Picnic to include a simple FFT/DFT workbench function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, it would be a nice to have,
but if you're sure your pointers are 16 byte aligned and your processor supports sse3, it's really easy to implement in SSE intrinsics (_mm_mul_ps/_mm_hadd_ps for single precision)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the suggestions.
I haven't written any sse intrinsic code in Visual Studio. I did some monkey see monkey do code changes in Unix/gcc in a previous life that used them.
What would that look like in Visual Studio C/C++? Or do you have a good reference link?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
it will look like
[bash]void MulConjCompC( float* pSrc1, float* pSrc2, float* pDst, int count) { for( int i=0; i
I did not compare the outputs of SSE implementation and C, maybe the imaginary part of SSE has the wrong sign.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks a bunch. I will try this out and see what kind of change it has on the timing of my algorithm.
I would still like Intel to supply the functionality so it will work on all platforms and be supported/optimized by them on future processors.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sorry, yesterday I just had 10min to "hack" some buggy code
here's the correction
[bash]//the C version
void MulConjCompC( float* pSrc1, float* pSrc2, float* pDst, int count)
{
for( int i=0; i{
pDst[0] = pSrc1[0]*pSrc2[0] - pSrc1[1]*pSrc2[1];
pDst[1] = pSrc1[1]*pSrc2[0] - pSrc1[0]*pSrc2[1];
}
}
//2 complex values with 1 loop
void MulConjCompSSE( float* pSrc1, float* pSrc2, float* pDst, int count)
{
assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) & 0xF)); //check for 16byte align
__m128* src1 = (__m128*)pSrc1;
__m128* src2 = (__m128*)pSrc2;
__m128* dst = (__m128*)pDst;
for( ; count>0; count-=4, src1++, src2++, dst++)
{
__m128 d1 = _mm_mul_ps( src1[0], src2[0]);
__m128 ds = _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1));
__m128 d2 = _mm_mul_ps( ds, src2[0]);
ds = _mm_hsub_ps( d1, d2); //horizontally add 2 values
*dst = _mm_shuffle_ps( ds, ds, _MM_SHUFFLE(3, 1, 2, 0));
}
MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count); //rest
}
//4 complex values with 1 loop
void MulConjCompSSE3_2( float* pSrc1, float* pSrc2, float* pDst, int count)
{
assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) & 0xF)); //check for 16byte align
__m128* src1 = (__m128*)pSrc1;
__m128* src2 = (__m128*)pSrc2;
__m128* dst = (__m128*)pDst;
for( ; count>0; count-=8, src1+=2, src2+=2, dst+=2)
{
__m128 d1 = _mm_mul_ps( src1[0], src2[0]);
__m128 d2 = _mm_mul_ps( _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1)), src2[0]);
__m128 e1 = _mm_mul_ps( src1[1], src2[1]);
__m128 e2 = _mm_mul_ps( _mm_shuffle_ps( src1[1], src1[1], _MM_SHUFFLE(2, 3, 0, 1)), src2[1]);
__m128 f1 = _mm_hsub_ps( d1, e1);
__m128 f2 = _mm_hsub_ps( d2, e2);
dst[0] = _mm_unpacklo_ps( f1, f2);
dst[1] = _mm_unpackhi_ps( f1, f2);
}
MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count); //rest
}
[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
renegr - I didn't expect someone to take the time to actually code it. Many thanks. Your help is very much appreciated.
Any one else that is interested,
Since what I actually wanted was dst = src1 * conj(src2) the real calculation inMulConjCompC() should be:
pDst[0] = pSrc1[0]*pSrc2[0]+ pSrc1[1]*pSrc2[1]; // notice the + instead of -
and therefor I think the first mm_hsub_ps() in MulConjCompSSE3_2() should be:
__m128 f1 = _mm_hadd_ps( d1, e1); // noitice the hadd instead of hsub
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Of course you're right (it was correct in the c function)
I'm currently gaining my SSE experiences so it was a nice practice. Would be nice if you could provide some measures on how your performance did increase by this functions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a similar need to have a Complex multiply by conjugate for arrayfunction. Currently the function is a C implementation in my application. The function takes a quite a good amount of time in my application.
I have to optimize the implementation using IPP. I triedtwo versions (given below )but both are slower then the C version.
//IPP implementation
IppStatus ippStats;
Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format
/*//Implement-1
// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j
ippsMul_32f(pAreal,pBreal,pCreal,Np);
ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);
//Calculating q = (aI*bR-aR*bI)
ippsMul_32f(pAimag,pBreal,pCimag,Np);
//Make -bI vector from bI vector
ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);
ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);
*/
//Implement-2
//Allocate memory for pSrcA, pSrcB and pSrcC buffer
pSrcA = ippsMalloc_32fc(3*Np);
pSrcB = &pSrcA[Np];//ippsMalloc_32fc(Np);
pDst = &pSrcA[2*Np];//ippsMalloc_32fc(Np);
if((pSrcA==0)||(pDst==0)||(pSrcB==0))
{
IPP_ERR_STATUS_var = ippStsMemAllocErr;
myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);
return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;
}
//First Convert 2 separte IQ buffers to single complex number buffer
//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);
ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);
ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);
//Using IPP API:-
//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);
ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);
if(ippStats!=ippStsNoErr)
{
IPP_ERR_STATUS_var = ippStats;
myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);
return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;
}
//Now Convert output complex buffer to 2 separte IQ buffers
//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);
ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);
myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);
Then I implemented the SSE version of the function as
void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)
{
/*16-byte alignment check*/
{
__m128 m1, m2, m3, m4;
//__m128 sr1,sr2,si1,si2,dr1,dr2;
// __declspec(align(16)) float *pSrc1re2;
__m128 *srcR1 = (__m128*)pSrcRe1; //a1re
__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im
__m128 *srcR2 = (__m128*)pSrcRe2; //b1re
__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im
__m128 *destR = (__m128*)pDstRe; //ResRe
__m128 *destI = (__m128*)pDstIm; //ResIm
for(int i = 0 ;i< count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1)
{
m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)
m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)
m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)
m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)
*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)
*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)
}
mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real
const Float_t *pAimag, // source of array 'A' imag
const Float_t *pBreal, // source of array 'B' real
const Float_t *pBimag, // source of array 'B' imag
Uword32 Np, // number of points
Float_t *pCreal, // Dest for real part
Float_t *pCimag)
C version
int s;
float ar, ai, br, bi;
for (s = 0; s < count; s++) // for each array element
{
ar = *pSrcRe1++; // get the real and imaginary parts
ai = *pSrcIm1++;
br = *pSrcRe2++;
bi = *pSrcIm2++;
*pDstRe++ = ar * br + ai * bi; // do the complex multiply
*pDstIm++ = ai * br - ar * bi; // B conjugate of B
}
This function is very fast compare to C as long all the input vectors are 16-byte aligned. I have limitation not to change the current framework of application. And I am not sure how to force a vector to be 16-byte aligned. Kindly suggest what are the options I can try
1. algin vectors to 16-byte.
2. or using IPP to have faster routine.
Hoping to get a speedy reply from experts.
Note: I have just started using IPP/SSE. Its hardly a week's experience.
Regards
Rohit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have similar requirementto have Complex multiply by conjugate function for array . Currently the function is a C implementation. The function takes quite a good amount of time in my application. I have to optimise wit either IPP or SSE.
Details on the function
--------------------------
Description:Array A x Comp Conj of B; Result in C.
mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real
const Float_t *pAimag, // source of array 'A' imag
const Float_t *pBreal, // source of array 'B' real
const Float_t *pBimag, // source of array 'B' imag
Uword32 Np, // number of points
Float_t *pCreal, // Dest for real part
Float_t *pCimag) // Dest for imag part
I have tried two version of IPP implementation. But both version are slower then C.
//IPP implementation
IppStatus ippStats;
Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format
/*//Implement-1
// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j
ippsMul_32f(pAreal,pBreal,pCreal,Np);
ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);
//Calculating q = (aI*bR-aR*bI)
ippsMul_32f(pAimag,pBreal,pCimag,Np);
//Make -bI vector from bI vector
ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);
ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);
*/
//Implement-2
//Allocate memory for pSrcA, pSrcB and pSrcC buffer
pSrcA = ippsMalloc_32fc(3*Np);
pSrcB = &pSrcA[Np];//ippsMalloc_32fc(Np);
pDst = &pSrcA[2*Np];//ippsMalloc_32fc(Np);
if((pSrcA==0)||(pDst==0)||(pSrcB==0))
{
IPP_ERR_STATUS_var = ippStsMemAllocErr;
myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);
return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;
}
//First Convert 2 separte IQ buffers to single complex number buffer
//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);
ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);
ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);
//Using IPP API:-
//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);
ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);
if(ippStats!=ippStsNoErr)
{
IPP_ERR_STATUS_var = ippStats;
myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);
return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;
}
//Now Convert output complex buffer to 2 separte IQ buffers
//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);
ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);
myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);
Then I implemented the function using SSE. Which is quite fast and works perfect in my application as long as the vectors are 16-byte aligned.
void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)
{
//Check for 16-byte alignment
{
__m128 m1, m2, m3, m4;
//__m128 sr1,sr2,si1,si2,dr1,dr2;
// __declspec(align(16)) float *pSrc1re2;
__m128 *srcR1 = (__m128*)pSrcRe1; //a1re
__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im
__m128 *srcR2 = (__m128*)pSrcRe2; //b1re
__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im
__m128 *destR = (__m128*)pDstRe; //ResRe
__m128 *destI = (__m128*)pDstIm; //ResIm
for(int i = 0 ;i< count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1)
{
m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)
m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)
m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)
m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)
*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)
*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)
}
}
I have limitation not to disturb the application framework i.e changing the interface. I have to use vectors. And I don't know how to force the vector to have 16-byte aligned allocation. I have to optimised the speed for this function. Kindly suggest the options for the following
1. Either a method to allocate vector 16-byte aligned
2. Or a fast IPP based method to implement this function
3. SSE implementation to handle non-aligned vectors.
Hoping to get a speedy reply from the experts.
Note: I have just started working on IPP/SSE. its just a week into my first IPP/SSE routine.
Regards
Rohit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You canalsouse_mm_malloc function:
...
__m128 *pVec1 = ( __m128 * )_mm_malloc( _RTVECTOR_SIZE * sizeof( __m128 ), 16 );
...
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think I've already provided an answer at http://software.intel.com/en-us/forums/showthread.php?t=101316 thread.
Regards,
Igor

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page