- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

It would save memory bandwidth if there was a function that would multiply two complex vectorswith the result being as if one of thesource vectors had been conjugated. This is currently a two step process of conjugating one vector and then performing the complex vector multiply.

Please consider this for a future release.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

sorry, yesterday I just had 10min to "hack" some buggy code

here's the correction

[bash]//the C version

void MulConjCompC( float* pSrc1, float* pSrc2, float* pDst, int count)

{

for( int i=0; i{

pDst[0] = pSrc1[0]*pSrc2[0] - pSrc1[1]*pSrc2[1];

pDst[1] = pSrc1[1]*pSrc2[0] - pSrc1[0]*pSrc2[1];

}

}

//2 complex values with 1 loop

void MulConjCompSSE( float* pSrc1, float* pSrc2, float* pDst, int count)

{

assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) & 0xF)); //check for 16byte align

__m128* src1 = (__m128*)pSrc1;

__m128* src2 = (__m128*)pSrc2;

__m128* dst = (__m128*)pDst;

for( ; count>0; count-=4, src1++, src2++, dst++)

{

__m128 d1 = _mm_mul_ps( src1[0], src2[0]);

__m128 ds = _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1));

__m128 d2 = _mm_mul_ps( ds, src2[0]);

ds = _mm_hsub_ps( d1, d2); //horizontally add 2 values

*dst = _mm_shuffle_ps( ds, ds, _MM_SHUFFLE(3, 1, 2, 0));

}

MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count); //rest

}

//4 complex values with 1 loop

void MulConjCompSSE3_2( float* pSrc1, float* pSrc2, float* pDst, int count)

{

assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) & 0xF)); //check for 16byte align

__m128* src1 = (__m128*)pSrc1;

__m128* src2 = (__m128*)pSrc2;

__m128* dst = (__m128*)pDst;

for( ; count>0; count-=8, src1+=2, src2+=2, dst+=2)

{

__m128 d1 = _mm_mul_ps( src1[0], src2[0]);

__m128 d2 = _mm_mul_ps( _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1)), src2[0]);

__m128 e1 = _mm_mul_ps( src1[1], src2[1]);

__m128 e2 = _mm_mul_ps( _mm_shuffle_ps( src1[1], src1[1], _MM_SHUFFLE(2, 3, 0, 1)), src2[1]);

__m128 f1 = _mm_hsub_ps( d1, e1);

__m128 f2 = _mm_hsub_ps( d2, e2);

dst[0] = _mm_unpacklo_ps( f1, f2);

dst[1] = _mm_unpackhi_ps( f1, f2);

}

MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count); //rest

}

[/bash]

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

If I understand MulPackConj correctly, I would have to convert my complex data to packed format, multiply, and then convert it back. My data is stored as Ipp32fc.

My main objective with this request is to reduce the memory footprint and/or memory accesses required to perform the multiplication. The conj() and mul() sequence that I currently use requires 3 reads and 2 writes of complex data to get the job done when it is possible (when combined) to use 2 reads and 1 write.

I would hazard a guess that conjugate multiply is already used internally by some of the other algorithms provided. It is a relatively common operation in singal processing involving complex data.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I have been there also, trying to get IPP DFT effective.

My problem was that I couldn't find any good FFT/DFT samples from IPP.

I'd like to have a good sample that loads a grayscale image, and then perform some frequency domain filtering, and then saving the result. The frequency domain filter could be a butterworth filter for example.

If this was available, we'd be able to continue with other filters. Google only indicates general frequency domain theory. Here, we need specific Intel IPP information, to get top performance.

Maybe Intel could expand Picnic to include a simple FFT/DFT workbench function.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Ok, it would be a nice to have,

but if you're sure your pointers are 16 byte aligned and your processor supports sse3, it's really easy to implement in SSE intrinsics (_mm_mul_ps/_mm_hadd_ps for single precision)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for the suggestions.

I haven't written any sse intrinsic code in Visual Studio. I did some monkey see monkey do code changes in Unix/gcc in a previous life that used them.

What would that look like in Visual Studio C/C++? Or do you have a good reference link?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

it will look like

[bash]void MulConjCompC( float* pSrc1, float* pSrc2, float* pDst, int count) { for( int i=0; i

I did not compare the outputs of SSE implementation and C, maybe the imaginary part of SSE has the wrong sign.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks a bunch. I will try this out and see what kind of change it has on the timing of my algorithm.

I would still like Intel to supply the functionality so it will work on all platforms and be supported/optimized by them on future processors.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

sorry, yesterday I just had 10min to "hack" some buggy code

here's the correction

[bash]//the C version

void MulConjCompC( float* pSrc1, float* pSrc2, float* pDst, int count)

{

for( int i=0; i{

pDst[0] = pSrc1[0]*pSrc2[0] - pSrc1[1]*pSrc2[1];

pDst[1] = pSrc1[1]*pSrc2[0] - pSrc1[0]*pSrc2[1];

}

}

//2 complex values with 1 loop

void MulConjCompSSE( float* pSrc1, float* pSrc2, float* pDst, int count)

{

assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) & 0xF)); //check for 16byte align

__m128* src1 = (__m128*)pSrc1;

__m128* src2 = (__m128*)pSrc2;

__m128* dst = (__m128*)pDst;

for( ; count>0; count-=4, src1++, src2++, dst++)

{

__m128 d1 = _mm_mul_ps( src1[0], src2[0]);

__m128 ds = _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1));

__m128 d2 = _mm_mul_ps( ds, src2[0]);

ds = _mm_hsub_ps( d1, d2); //horizontally add 2 values

*dst = _mm_shuffle_ps( ds, ds, _MM_SHUFFLE(3, 1, 2, 0));

}

MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count); //rest

}

//4 complex values with 1 loop

void MulConjCompSSE3_2( float* pSrc1, float* pSrc2, float* pDst, int count)

{

assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) & 0xF)); //check for 16byte align

__m128* src1 = (__m128*)pSrc1;

__m128* src2 = (__m128*)pSrc2;

__m128* dst = (__m128*)pDst;

for( ; count>0; count-=8, src1+=2, src2+=2, dst+=2)

{

__m128 d1 = _mm_mul_ps( src1[0], src2[0]);

__m128 d2 = _mm_mul_ps( _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1)), src2[0]);

__m128 e1 = _mm_mul_ps( src1[1], src2[1]);

__m128 e2 = _mm_mul_ps( _mm_shuffle_ps( src1[1], src1[1], _MM_SHUFFLE(2, 3, 0, 1)), src2[1]);

__m128 f1 = _mm_hsub_ps( d1, e1);

__m128 f2 = _mm_hsub_ps( d2, e2);

dst[0] = _mm_unpacklo_ps( f1, f2);

dst[1] = _mm_unpackhi_ps( f1, f2);

}

MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count); //rest

}

[/bash]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

renegr - I didn't expect someone to take the time to actually code it. Many thanks. Your help is very much appreciated.

Any one else that is interested,

Since what I actually wanted was dst = src1 * conj(src2) the real calculation inMulConjCompC() should be:

pDst[0] = pSrc1[0]*pSrc2[0]+ pSrc1[1]*pSrc2[1]; // notice the + instead of -

and therefor I think the first mm_hsub_ps() in MulConjCompSSE3_2() should be:

__m128 f1 = _mm_hadd_ps( d1, e1); // noitice the hadd instead of hsub

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Of course you're right (it was correct in the c function)

I'm currently gaining my SSE experiences so it was a nice practice. Would be nice if you could provide some measures on how your performance did increase by this functions.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

I have a similar need to have a Complex multiply by conjugate for arrayfunction. Currently the function is a C implementation in my application. The function takes a quite a good amount of time in my application.

I have to optimize the implementation using IPP. I triedtwo versions (given below )but both are slower then the C version.

//IPP implementation

IppStatus ippStats;

Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format

/*//Implement-1

// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j

ippsMul_32f(pAreal,pBreal,pCreal,Np);

ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);

//Calculating q = (aI*bR-aR*bI)

ippsMul_32f(pAimag,pBreal,pCimag,Np);

//Make -bI vector from bI vector

ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);

ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);

*/

//Implement-2

//Allocate memory for pSrcA, pSrcB and pSrcC buffer

pSrcA = ippsMalloc_32fc(3*Np);

pSrcB = &pSrcA[Np];//ippsMalloc_32fc(Np);

pDst = &pSrcA[2*Np];//ippsMalloc_32fc(Np);

if((pSrcA==0)||(pDst==0)||(pSrcB==0))

{

IPP_ERR_STATUS_var = ippStsMemAllocErr;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//First Convert 2 separte IQ buffers to single complex number buffer

//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);

ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);

ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);

//Using IPP API:-

//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);

ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);

if(ippStats!=ippStsNoErr)

{

IPP_ERR_STATUS_var = ippStats;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//Now Convert output complex buffer to 2 separte IQ buffers

//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);

ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

Then I implemented the SSE version of the function as

void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)

{

/*16-byte alignment check*/

{

__m128 m1, m2, m3, m4;

//__m128 sr1,sr2,si1,si2,dr1,dr2;

// __declspec(align(16)) float *pSrc1re2;

__m128 *srcR1 = (__m128*)pSrcRe1; //a1re

__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im

__m128 *srcR2 = (__m128*)pSrcRe2; //b1re

__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im

__m128 *destR = (__m128*)pDstRe; //ResRe

__m128 *destI = (__m128*)pDstIm; //ResIm

for(int i = 0 ;i< count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1)

{

m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)

m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)

m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)

m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)

*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)

*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)

}

mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real

const Float_t *pAimag, // source of array 'A' imag

const Float_t *pBreal, // source of array 'B' real

const Float_t *pBimag, // source of array 'B' imag

Uword32 Np, // number of points

Float_t *pCreal, // Dest for real part

Float_t *pCimag)

C version

int s;

float ar, ai, br, bi;

for (s = 0; s < count; s++) // for each array element

{

ar = *pSrcRe1++; // get the real and imaginary parts

ai = *pSrcIm1++;

br = *pSrcRe2++;

bi = *pSrcIm2++;

*pDstRe++ = ar * br + ai * bi; // do the complex multiply

*pDstIm++ = ai * br - ar * bi; // B conjugate of B

}

This function is very fast compare to C as long all the input vectors are 16-byte aligned. I have limitation not to change the current framework of application. And I am not sure how to force a vector to be 16-byte aligned. Kindly suggest what are the options I can try

1. algin vectors to 16-byte.

2. or using IPP to have faster routine.

Hoping to get a speedy reply from experts.

Note: I have just started using IPP/SSE. Its hardly a week's experience.

Regards

Rohit

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**//////// Posted Twice/////////**

Hi,

I have similar requirementto have Complex multiply by conjugate function for array . Currently the function is a C implementation. The function takes quite a good amount of time in my application. I have to optimise wit either IPP or SSE.

Details on the function

Hi,

I have similar requirementto have Complex multiply by conjugate function for array . Currently the function is a C implementation. The function takes quite a good amount of time in my application. I have to optimise wit either IPP or SSE.

Details on the function

--------------------------

Description:Array A x Comp Conj of B; Result in C.

mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real

const Float_t *pAimag, // source of array 'A' imag

const Float_t *pBreal, // source of array 'B' real

const Float_t *pBimag, // source of array 'B' imag

Uword32 Np, // number of points

Float_t *pCreal, // Dest for real part

Float_t *pCimag) // Dest for imag part

I have tried two version of IPP implementation. But both version are slower then C.

//IPP implementation

IppStatus ippStats;

Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format

/*//Implement-1

// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j

ippsMul_32f(pAreal,pBreal,pCreal,Np);

ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);

//Calculating q = (aI*bR-aR*bI)

ippsMul_32f(pAimag,pBreal,pCimag,Np);

//Make -bI vector from bI vector

ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);

ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);

*/

//Implement-2

//Allocate memory for pSrcA, pSrcB and pSrcC buffer

pSrcA = ippsMalloc_32fc(3*Np);

pSrcB = &pSrcA[Np];//ippsMalloc_32fc(Np);

pDst = &pSrcA[2*Np];//ippsMalloc_32fc(Np);

if((pSrcA==0)||(pDst==0)||(pSrcB==0))

{

IPP_ERR_STATUS_var = ippStsMemAllocErr;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//First Convert 2 separte IQ buffers to single complex number buffer

//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);

ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);

ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);

//Using IPP API:-

//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);

ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);

if(ippStats!=ippStsNoErr)

{

IPP_ERR_STATUS_var = ippStats;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//Now Convert output complex buffer to 2 separte IQ buffers

//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);

ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);**Then I implemented the function using SSE. Which is quite fast and works perfect in my application as long as the vectors are 16-byte aligned. **

void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)

{

//Check for 16-byte alignment

{

__m128 m1, m2, m3, m4;

//__m128 sr1,sr2,si1,si2,dr1,dr2;

// __declspec(align(16)) float *pSrc1re2;

__m128 *srcR1 = (__m128*)pSrcRe1; //a1re

__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im

__m128 *srcR2 = (__m128*)pSrcRe2; //b1re

__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im

__m128 *destR = (__m128*)pDstRe; //ResRe

__m128 *destI = (__m128*)pDstIm; //ResIm

for(int i = 0 ;i< count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1)

{

m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)

m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)

m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)

m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)

*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)

*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)

}

}**I have limitation not to disturb the application framework i.e changing the interface. I have to use vectors. And I don't know how to force the vector to have 16-byte aligned allocation. I have to optimised the speed for this function. Kindly suggest the options for the following1. Either a method to allocate vector 16-byte aligned2. Or a fast IPP based method to implement this function3. SSE implementation to handle non-aligned vectors.Hoping to get a speedy reply from the experts.Note: I have just started working on IPP/SSE. its just a week into my first IPP/SSE routine.RegardsRohit**

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**__declspec( align(16) )**declaration but it is commented out. Did you have any problems?

You canalsouse

**_mm_malloc**function:

...

__m128 *pVec1 = ( __m128 * )

**_mm_malloc**( _RTVECTOR_SIZE * sizeof( __m128 ), 16 );

...

Best regards,

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I think I've already provided an answer at http://software.intel.com/en-us/forums/showthread.php?t=101316 thread.

Regards,

Igor

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page