<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Complex multiply by conjugate in Intel® Integrated Performance Primitives</title>
    <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899296#M12574</link>
    <description>&lt;STRONG&gt;//////// Posted Twice/////////&lt;BR /&gt;&lt;BR /&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;I have similar requirementto have Complex multiply by conjugate function for array . Currently the function is a C implementation. The function takes quite a good amount of time in my application. I have to optimise wit either IPP or SSE. &lt;BR /&gt;&lt;BR /&gt;Details on the function&lt;/STRONG&gt;&lt;BR /&gt;--------------------------&lt;BR /&gt;&lt;P&gt;Description:Array A x Comp Conj of B; Result in C.&lt;/P&gt;&lt;P&gt;mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real&lt;/P&gt;&lt;P&gt;const Float_t *pAimag, // source of array 'A' imag&lt;/P&gt;&lt;P&gt;const Float_t *pBreal, // source of array 'B' real&lt;/P&gt;&lt;P&gt;const Float_t *pBimag, // source of array 'B' imag&lt;/P&gt;&lt;P&gt;Uword32 Np, // number of points&lt;/P&gt;&lt;P&gt;Float_t *pCreal, // Dest for real part&lt;/P&gt;&lt;P&gt;Float_t *pCimag) // Dest for imag part&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;I have tried two version of IPP implementation. But both version are slower then C.&lt;/P&gt;&lt;P&gt;//IPP implementation&lt;/P&gt;&lt;P&gt;IppStatus ippStats;&lt;/P&gt;&lt;P&gt;Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format&lt;/P&gt;&lt;P&gt;/*//Implement-1&lt;/P&gt;&lt;P&gt;// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j&lt;/P&gt;&lt;P&gt;ippsMul_32f(pAreal,pBreal,pCreal,Np);&lt;/P&gt;&lt;P&gt;ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);&lt;/P&gt;&lt;P&gt;//Calculating q = (aI*bR-aR*bI)&lt;/P&gt;&lt;P&gt;ippsMul_32f(pAimag,pBreal,pCimag,Np);&lt;/P&gt;&lt;P&gt;//Make -bI vector from bI vector&lt;/P&gt;&lt;P&gt;ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);&lt;/P&gt;&lt;P&gt;ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);&lt;/P&gt;&lt;P&gt;*/&lt;/P&gt;&lt;P&gt;//Implement-2&lt;/P&gt;&lt;P&gt;//Allocate memory for pSrcA, pSrcB and pSrcC buffer&lt;/P&gt;&lt;P&gt;pSrcA = ippsMalloc_32fc(3*Np);&lt;/P&gt;&lt;P&gt;pSrcB = &amp;amp;pSrcA[Np];//ippsMalloc_32fc(Np);&lt;/P&gt;&lt;P&gt;pDst = &amp;amp;pSrcA[2*Np];//ippsMalloc_32fc(Np);&lt;/P&gt;&lt;P&gt;if((pSrcA==0)||(pDst==0)||(pSrcB==0))&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;IPP_ERR_STATUS_var = ippStsMemAllocErr;&lt;/P&gt;&lt;P&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P&gt;return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;//First Convert 2 separte IQ buffers to single complex number buffer&lt;/P&gt;&lt;P&gt;//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);&lt;/P&gt;&lt;P&gt;ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);&lt;/P&gt;&lt;P&gt;ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);&lt;/P&gt;&lt;P&gt;//Using IPP API:-&lt;/P&gt;&lt;P&gt;//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);&lt;/P&gt;&lt;P&gt;ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);&lt;/P&gt;&lt;P&gt;if(ippStats!=ippStsNoErr)&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;IPP_ERR_STATUS_var = ippStats;&lt;/P&gt;&lt;P&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P&gt;return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;//Now Convert output complex buffer to 2 separte IQ buffers&lt;/P&gt;&lt;P&gt;//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);&lt;/P&gt;&lt;P&gt;ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);&lt;/P&gt;&lt;P&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Then I implemented the function using SSE. Which is quite fast and works perfect in my application as long as the vectors are 16-byte aligned. &lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)&lt;/P&gt;&lt;P&gt;{ &lt;/P&gt;&lt;P&gt;//Check for 16-byte alignment&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;__m128 m1, m2, m3, m4;&lt;/P&gt;&lt;P&gt;//__m128 sr1,sr2,si1,si2,dr1,dr2;&lt;/P&gt;&lt;P&gt;// __declspec(align(16)) float *pSrc1re2;&lt;/P&gt;&lt;P&gt;__m128 *srcR1 = (__m128*)pSrcRe1; //a1re&lt;/P&gt;&lt;P&gt;__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im&lt;/P&gt;&lt;P&gt;__m128 *srcR2 = (__m128*)pSrcRe2; //b1re&lt;/P&gt;&lt;P&gt;__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im&lt;/P&gt;&lt;P&gt;__m128 *destR = (__m128*)pDstRe; //ResRe&lt;/P&gt;&lt;P&gt;__m128 *destI = (__m128*)pDstIm; //ResIm&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;for(int i = 0 ;i&amp;lt; count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1) &lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)&lt;/P&gt;&lt;P&gt;m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)&lt;/P&gt;&lt;P&gt;m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)&lt;/P&gt;&lt;P&gt;m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)&lt;/P&gt;&lt;P&gt;*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)&lt;/P&gt;&lt;P&gt;*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;} &lt;/P&gt;&lt;P&gt;}&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;I have limitation not to disturb the application framework i.e changing the interface. I have to use vectors. And I don't know how to force the vector to have 16-byte aligned allocation. I have to optimised the speed for this function. Kindly suggest the options for the following&lt;BR /&gt;1. Either a method to allocate vector 16-byte aligned&lt;BR /&gt;2. Or a fast IPP based method to implement this function&lt;BR /&gt;3. SSE implementation to handle non-aligned vectors.&lt;BR /&gt;&lt;BR /&gt;Hoping to get a speedy reply from the experts.&lt;BR /&gt;&lt;BR /&gt;Note: I have just started working on IPP/SSE. its just a week into my first IPP/SSE routine.&lt;BR /&gt;&lt;BR /&gt;Regards&lt;BR /&gt;Rohit&lt;/STRONG&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 21 Nov 2011 11:28:33 GMT</pubDate>
    <dc:creator>rohitspandey</dc:creator>
    <dc:date>2011-11-21T11:28:33Z</dc:date>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899284#M12562</link>
      <description>&lt;P&gt;It would save memory bandwidth if there was a function that would multiply two complex vectorswith the result being as if one of thesource vectors had been conjugated. This is currently a two step process of conjugating one vector and then performing the complex vector multiply.&lt;/P&gt;
&lt;P&gt;Please consider this for a future release.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 06 Feb 2010 18:55:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899284#M12562</guid>
      <dc:creator>Eric2</dc:creator>
      <dc:date>2010-02-06T18:55:19Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899285#M12563</link>
      <description>couldn't you use ippsMulPackConj_%</description>
      <pubDate>Mon, 08 Feb 2010 16:03:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899285#M12563</guid>
      <dc:creator>renegr</dc:creator>
      <dc:date>2010-02-08T16:03:42Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899286#M12564</link>
      <description>&lt;P&gt;If I understand MulPackConj correctly, I would have to convert my complex data to packed format, multiply, and then convert it back. My data is stored as Ipp32fc.&lt;/P&gt;
&lt;P&gt;My main objective with this request is to reduce the memory footprint and/or memory accesses required to perform the multiplication. The conj() and mul() sequence that I currently use requires 3 reads and 2 writes of complex data to get the job done when it is possible (when combined) to use 2 reads and 1 write.&lt;/P&gt;
&lt;P&gt;I would hazard a guess that conjugate multiply is already used internally by some of the other algorithms provided. It is a relatively common operation in singal processing involving complex data.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2010 21:10:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899286#M12564</guid>
      <dc:creator>Eric2</dc:creator>
      <dc:date>2010-02-08T21:10:34Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899287#M12565</link>
      <description>&lt;P&gt;I have been there also, trying to get IPP DFT effective.&lt;/P&gt;
&lt;P&gt;My problem was that I couldn't find any good FFT/DFT samples from IPP.&lt;/P&gt;
&lt;P&gt;I'd like to have a good sample that loads a grayscale image, and then perform some frequency domain filtering, and then saving the result. The frequency domain filter could be a butterworth filter for example.&lt;/P&gt;
&lt;P&gt;If this was available, we'd be able to continue with other filters. Google only indicates general frequency domain theory. Here, we need specific Intel IPP information, to get top performance.&lt;/P&gt;
&lt;P&gt;Maybe Intel could expand Picnic to include a simple FFT/DFT workbench function.&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2010 21:29:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899287#M12565</guid>
      <dc:creator>Thomas_Jensen1</dc:creator>
      <dc:date>2010-02-08T21:29:38Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899288#M12566</link>
      <description>&lt;P&gt;Ok, it would be a nice to have,&lt;/P&gt;
&lt;P&gt;but if you're sure your pointers are 16 byte aligned and your processor supports sse3, it's really easy to implement in SSE intrinsics (_mm_mul_ps/_mm_hadd_ps for single precision)&lt;/P&gt;</description>
      <pubDate>Tue, 09 Feb 2010 11:28:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899288#M12566</guid>
      <dc:creator>renegr</dc:creator>
      <dc:date>2010-02-09T11:28:39Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899289#M12567</link>
      <description>&lt;P&gt;Thanks for the suggestions.&lt;/P&gt;
&lt;P&gt;I haven't written any sse intrinsic code in Visual Studio. I did some monkey see monkey do code changes in Unix/gcc in a previous life that used them.&lt;/P&gt;
&lt;P&gt;What would that look like in Visual Studio C/C++? Or do you have a good reference link?&lt;/P&gt;</description>
      <pubDate>Tue, 09 Feb 2010 14:34:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899289#M12567</guid>
      <dc:creator>Eric2</dc:creator>
      <dc:date>2010-02-09T14:34:43Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899290#M12568</link>
      <description>&lt;P&gt;it will look like&lt;/P&gt;
&lt;P&gt;
&lt;PRE&gt;[bash]void MulConjCompC( float* pSrc1, float* pSrc2, float* pDst, int count)
{
  for( int i=0; i&lt;COUNT&gt;
&lt;/COUNT&gt;&lt;/PRE&gt;&lt;/P&gt;
&lt;P&gt;I did not compare the outputs of SSE implementation and C, maybe the imaginary part of SSE has the wrong sign.&lt;/P&gt;</description>
      <pubDate>Tue, 09 Feb 2010 16:46:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899290#M12568</guid>
      <dc:creator>renegr</dc:creator>
      <dc:date>2010-02-09T16:46:57Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899291#M12569</link>
      <description>&lt;P&gt;Thanks a bunch. I will try this out and see what kind of change it has on the timing of my algorithm.&lt;/P&gt;
&lt;P&gt;I would still like Intel to supply the functionality so it will work on all platforms and be supported/optimized by them on future processors.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Feb 2010 00:51:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899291#M12569</guid>
      <dc:creator>Eric2</dc:creator>
      <dc:date>2010-02-10T00:51:59Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899292#M12570</link>
      <description>&lt;P&gt;sorry, yesterday I just had 10min to "hack" some buggy code&lt;/P&gt;
&lt;P&gt;here's the correction&lt;/P&gt;
&lt;BR /&gt;
&lt;PRE&gt;[bash]//the C version&lt;BR /&gt;void MulConjCompC( float* pSrc1, float* pSrc2, float* pDst, int count)&lt;BR /&gt;{&lt;BR /&gt;  for( int i=0; i&lt;COUNT&gt;&lt;/COUNT&gt;  {&lt;BR /&gt;    pDst[0] = pSrc1[0]*pSrc2[0] - pSrc1[1]*pSrc2[1];&lt;BR /&gt;    pDst[1] = pSrc1[1]*pSrc2[0] - pSrc1[0]*pSrc2[1];&lt;BR /&gt;  }&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;//2 complex values with 1 loop&lt;BR /&gt;void MulConjCompSSE( float* pSrc1, float* pSrc2, float* pDst, int count)&lt;BR /&gt;{&lt;BR /&gt;  assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) &amp;amp; 0xF));  //check for 16byte align&lt;BR /&gt;  __m128* src1 = (__m128*)pSrc1;&lt;BR /&gt;  __m128* src2 = (__m128*)pSrc2;&lt;BR /&gt;  __m128* dst = (__m128*)pDst;&lt;BR /&gt;  for( ; count&amp;gt;0; count-=4, src1++, src2++, dst++)&lt;BR /&gt;  {&lt;BR /&gt;    __m128 d1 = _mm_mul_ps( src1[0], src2[0]);&lt;BR /&gt;    __m128 ds = _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1));&lt;BR /&gt;    __m128 d2 = _mm_mul_ps( ds, src2[0]);&lt;BR /&gt;    ds = _mm_hsub_ps( d1, d2);  //horizontally add 2 values&lt;BR /&gt;    *dst = _mm_shuffle_ps( ds, ds, _MM_SHUFFLE(3, 1, 2, 0));&lt;BR /&gt;  }&lt;BR /&gt;  MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count);  //rest&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;//4 complex values with 1 loop&lt;BR /&gt;void MulConjCompSSE3_2( float* pSrc1, float* pSrc2, float* pDst, int count)&lt;BR /&gt;{&lt;BR /&gt;  assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) &amp;amp; 0xF)); //check for 16byte align&lt;BR /&gt;  __m128* src1 = (__m128*)pSrc1;&lt;BR /&gt;  __m128* src2 = (__m128*)pSrc2;&lt;BR /&gt;  __m128* dst = (__m128*)pDst;&lt;BR /&gt;  for( ; count&amp;gt;0; count-=8, src1+=2, src2+=2, dst+=2)&lt;BR /&gt;  {&lt;BR /&gt;   __m128 d1 = _mm_mul_ps( src1[0], src2[0]);&lt;BR /&gt;   __m128 d2 = _mm_mul_ps( _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1)), src2[0]);&lt;BR /&gt;   __m128 e1 = _mm_mul_ps( src1[1], src2[1]); &lt;BR /&gt;   __m128 e2 = _mm_mul_ps( _mm_shuffle_ps( src1[1], src1[1], _MM_SHUFFLE(2, 3, 0, 1)), src2[1]);&lt;BR /&gt;   __m128 f1 = _mm_hsub_ps( d1, e1);&lt;BR /&gt;   __m128 f2 = _mm_hsub_ps( d2, e2);&lt;BR /&gt;   dst[0] = _mm_unpacklo_ps( f1, f2);&lt;BR /&gt;   dst[1] = _mm_unpackhi_ps( f1, f2);&lt;BR /&gt;  }&lt;BR /&gt;  MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count); //rest&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;[/bash]&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 10 Feb 2010 11:24:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899292#M12570</guid>
      <dc:creator>renegr</dc:creator>
      <dc:date>2010-02-10T11:24:37Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899293#M12571</link>
      <description>&lt;P&gt;renegr - I didn't expect someone to take the time to actually code it. Many thanks. Your help is very much appreciated.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Any one else that is interested,&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Since what I actually wanted was dst = src1 * conj(src2) the real calculation inMulConjCompC() should be:&lt;/P&gt;
&lt;P&gt;pDst[0] = pSrc1[0]*pSrc2[0]+ pSrc1[1]*pSrc2[1]; // notice the + instead of -&lt;/P&gt;
&lt;P&gt;and therefor I think the first mm_hsub_ps() in MulConjCompSSE3_2() should be:&lt;/P&gt;
&lt;P&gt;__m128 f1 = _mm_hadd_ps( d1, e1); // noitice the hadd instead of hsub&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 14 Feb 2010 17:29:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899293#M12571</guid>
      <dc:creator>Eric2</dc:creator>
      <dc:date>2010-02-14T17:29:49Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899294#M12572</link>
      <description>&lt;P&gt;Of course you're right (it was correct in the c function)&lt;/P&gt;
&lt;P&gt;I'm currently gaining my SSE experiences so it was a nice practice. Would be nice if you could provide some measures on how your performance did increase by this functions.&lt;/P&gt;</description>
      <pubDate>Tue, 16 Feb 2010 10:03:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899294#M12572</guid>
      <dc:creator>renegr</dc:creator>
      <dc:date>2010-02-16T10:03:29Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899295#M12573</link>
      <description>&lt;P style="text-align: left;"&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;I have a similar need to have a Complex multiply by conjugate for arrayfunction. Currently the function is a C implementation in my application. The function takes a quite a good amount of time in my application.&lt;BR /&gt;I have to optimize the implementation using IPP. I triedtwo versions (given below )but both are slower then the C version.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;//IPP implementation&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;IppStatus ippStats;&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;/*//Implement-1&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;ippsMul_32f(pAreal,pBreal,pCreal,Np);&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;//Calculating q = (aI*bR-aR*bI)&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;ippsMul_32f(pAimag,pBreal,pCimag,Np);&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;//Make -bI vector from bI vector&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;*/&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;//Implement-2&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;//Allocate memory for pSrcA, pSrcB and pSrcC buffer&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;pSrcA = ippsMalloc_32fc(3*Np);&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;pSrcB = &amp;amp;pSrcA[Np];//ippsMalloc_32fc(Np);&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;pDst = &amp;amp;pSrcA[2*Np];//ippsMalloc_32fc(Np);&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;if((pSrcA==0)||(pDst==0)||(pSrcB==0))&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;{&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;IPP_ERR_STATUS_var = ippStsMemAllocErr;&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;&lt;/P&gt;&lt;P class="sectionBody" style="text-align: left;"&gt;}&lt;/P&gt;&lt;P style="text-align: left;"&gt;//First Convert 2 separte IQ buffers to single complex number buffer&lt;/P&gt;&lt;P style="text-align: left;"&gt;//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);&lt;/P&gt;&lt;P style="text-align: left;"&gt;ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);&lt;/P&gt;&lt;P style="text-align: left;"&gt;ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);&lt;/P&gt;&lt;P style="text-align: left;"&gt;//Using IPP API:-&lt;/P&gt;&lt;P style="text-align: left;"&gt;//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);&lt;/P&gt;&lt;P style="text-align: left;"&gt;ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);&lt;/P&gt;&lt;P style="text-align: left;"&gt;if(ippStats!=ippStsNoErr)&lt;/P&gt;&lt;P style="text-align: left;"&gt;{&lt;/P&gt;&lt;P style="text-align: left;"&gt;IPP_ERR_STATUS_var = ippStats;&lt;/P&gt;&lt;P style="text-align: left;"&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P style="text-align: left;"&gt;return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;&lt;/P&gt;&lt;P style="text-align: left;"&gt;}&lt;/P&gt;&lt;P style="text-align: left;"&gt;//Now Convert output complex buffer to 2 separte IQ buffers&lt;/P&gt;&lt;P style="text-align: left;"&gt;//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);&lt;/P&gt;&lt;P style="text-align: left;"&gt;ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);&lt;/P&gt;&lt;P style="text-align: left;"&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P style="text-align: left;"&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Then I implemented the SSE version of the function as&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P style="text-align: left;"&gt;void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)&lt;/P&gt;&lt;P style="text-align: left;"&gt;{ &lt;/P&gt;&lt;P style="text-align: left;"&gt;/*16-byte alignment check*/&lt;/P&gt;&lt;P style="text-align: left;"&gt;{&lt;/P&gt;&lt;P style="text-align: left;"&gt;__m128 m1, m2, m3, m4;&lt;/P&gt;&lt;P style="text-align: left;"&gt;//__m128 sr1,sr2,si1,si2,dr1,dr2;&lt;/P&gt;&lt;P style="text-align: left;"&gt;// __declspec(align(16)) float *pSrc1re2;&lt;/P&gt;&lt;P style="text-align: left;"&gt;__m128 *srcR1 = (__m128*)pSrcRe1; //a1re&lt;/P&gt;&lt;P style="text-align: left;"&gt;__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im&lt;/P&gt;&lt;P style="text-align: left;"&gt;__m128 *srcR2 = (__m128*)pSrcRe2; //b1re&lt;/P&gt;&lt;P style="text-align: left;"&gt;__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im&lt;/P&gt;&lt;P style="text-align: left;"&gt;__m128 *destR = (__m128*)pDstRe; //ResRe&lt;/P&gt;&lt;P style="text-align: left;"&gt;__m128 *destI = (__m128*)pDstIm; //ResIm&lt;/P&gt;&lt;P style="text-align: left;"&gt;&lt;/P&gt;&lt;P style="text-align: left;"&gt;for(int i = 0 ;i&amp;lt; count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1) &lt;/P&gt;&lt;P style="text-align: left;"&gt;{&lt;/P&gt;&lt;P style="text-align: left;"&gt;m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)&lt;/P&gt;&lt;P style="text-align: left;"&gt;m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)&lt;/P&gt;&lt;P style="text-align: left;"&gt;m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)&lt;/P&gt;&lt;P style="text-align: left;"&gt;m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)&lt;/P&gt;&lt;P style="text-align: left;"&gt;*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)&lt;/P&gt;&lt;P style="text-align: left;"&gt;*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)&lt;/P&gt;&lt;P style="text-align: left;"&gt;&lt;/P&gt;&lt;P style="text-align: left;"&gt;} &lt;/P&gt;&lt;P&gt;mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real&lt;/P&gt;&lt;P&gt;const Float_t *pAimag, // source of array 'A' imag&lt;/P&gt;&lt;P&gt;const Float_t *pBreal, // source of array 'B' real&lt;/P&gt;&lt;P&gt;const Float_t *pBimag, // source of array 'B' imag&lt;/P&gt;&lt;P&gt;Uword32 Np, // number of points&lt;/P&gt;&lt;P&gt;Float_t *pCreal, // Dest for real part&lt;/P&gt;&lt;P&gt;Float_t *pCimag) &lt;/P&gt;&lt;P style="text-align: left;"&gt;C version&lt;/P&gt;&lt;P&gt;int s;&lt;/P&gt;&lt;P&gt;float ar, ai, br, bi;&lt;/P&gt;&lt;P&gt;for (s = 0; s &amp;lt; count; s++) // for each array element&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;ar = *pSrcRe1++; // get the real and imaginary parts&lt;/P&gt;&lt;P&gt;ai = *pSrcIm1++;&lt;/P&gt;&lt;P&gt;br = *pSrcRe2++;&lt;/P&gt;&lt;P&gt;bi = *pSrcIm2++;&lt;/P&gt;&lt;P&gt;*pDstRe++ = ar * br + ai * bi; // do the complex multiply&lt;/P&gt;&lt;P&gt;*pDstIm++ = ai * br - ar * bi; // B conjugate of B&lt;/P&gt;&lt;P style="text-align: left;"&gt;}&lt;BR /&gt;&lt;BR /&gt;This function is very fast compare to C as long all the input vectors are 16-byte aligned. I have limitation not to change the current framework of application. And I am not sure how to force a vector to be 16-byte aligned. Kindly suggest what are the options I can try&lt;BR /&gt;1. algin vectors to 16-byte.&lt;BR /&gt;2. or using IPP to have faster routine.&lt;BR /&gt;Hoping to get a speedy reply from experts.&lt;BR /&gt;&lt;BR /&gt;Note: I have just started using IPP/SSE. Its hardly a week's experience.&lt;BR /&gt;&lt;BR /&gt;Regards&lt;BR /&gt;Rohit&lt;/P&gt;</description>
      <pubDate>Mon, 21 Nov 2011 10:28:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899295#M12573</guid>
      <dc:creator>rohitspandey</dc:creator>
      <dc:date>2011-11-21T10:28:52Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899296#M12574</link>
      <description>&lt;STRONG&gt;//////// Posted Twice/////////&lt;BR /&gt;&lt;BR /&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;I have similar requirementto have Complex multiply by conjugate function for array . Currently the function is a C implementation. The function takes quite a good amount of time in my application. I have to optimise wit either IPP or SSE. &lt;BR /&gt;&lt;BR /&gt;Details on the function&lt;/STRONG&gt;&lt;BR /&gt;--------------------------&lt;BR /&gt;&lt;P&gt;Description:Array A x Comp Conj of B; Result in C.&lt;/P&gt;&lt;P&gt;mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real&lt;/P&gt;&lt;P&gt;const Float_t *pAimag, // source of array 'A' imag&lt;/P&gt;&lt;P&gt;const Float_t *pBreal, // source of array 'B' real&lt;/P&gt;&lt;P&gt;const Float_t *pBimag, // source of array 'B' imag&lt;/P&gt;&lt;P&gt;Uword32 Np, // number of points&lt;/P&gt;&lt;P&gt;Float_t *pCreal, // Dest for real part&lt;/P&gt;&lt;P&gt;Float_t *pCimag) // Dest for imag part&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;I have tried two version of IPP implementation. But both version are slower then C.&lt;/P&gt;&lt;P&gt;//IPP implementation&lt;/P&gt;&lt;P&gt;IppStatus ippStats;&lt;/P&gt;&lt;P&gt;Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format&lt;/P&gt;&lt;P&gt;/*//Implement-1&lt;/P&gt;&lt;P&gt;// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j&lt;/P&gt;&lt;P&gt;ippsMul_32f(pAreal,pBreal,pCreal,Np);&lt;/P&gt;&lt;P&gt;ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);&lt;/P&gt;&lt;P&gt;//Calculating q = (aI*bR-aR*bI)&lt;/P&gt;&lt;P&gt;ippsMul_32f(pAimag,pBreal,pCimag,Np);&lt;/P&gt;&lt;P&gt;//Make -bI vector from bI vector&lt;/P&gt;&lt;P&gt;ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);&lt;/P&gt;&lt;P&gt;ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);&lt;/P&gt;&lt;P&gt;*/&lt;/P&gt;&lt;P&gt;//Implement-2&lt;/P&gt;&lt;P&gt;//Allocate memory for pSrcA, pSrcB and pSrcC buffer&lt;/P&gt;&lt;P&gt;pSrcA = ippsMalloc_32fc(3*Np);&lt;/P&gt;&lt;P&gt;pSrcB = &amp;amp;pSrcA[Np];//ippsMalloc_32fc(Np);&lt;/P&gt;&lt;P&gt;pDst = &amp;amp;pSrcA[2*Np];//ippsMalloc_32fc(Np);&lt;/P&gt;&lt;P&gt;if((pSrcA==0)||(pDst==0)||(pSrcB==0))&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;IPP_ERR_STATUS_var = ippStsMemAllocErr;&lt;/P&gt;&lt;P&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P&gt;return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;//First Convert 2 separte IQ buffers to single complex number buffer&lt;/P&gt;&lt;P&gt;//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);&lt;/P&gt;&lt;P&gt;ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);&lt;/P&gt;&lt;P&gt;ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);&lt;/P&gt;&lt;P&gt;//Using IPP API:-&lt;/P&gt;&lt;P&gt;//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);&lt;/P&gt;&lt;P&gt;ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);&lt;/P&gt;&lt;P&gt;if(ippStats!=ippStsNoErr)&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;IPP_ERR_STATUS_var = ippStats;&lt;/P&gt;&lt;P&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P&gt;return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;//Now Convert output complex buffer to 2 separte IQ buffers&lt;/P&gt;&lt;P&gt;//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);&lt;/P&gt;&lt;P&gt;ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);&lt;/P&gt;&lt;P&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Then I implemented the function using SSE. Which is quite fast and works perfect in my application as long as the vectors are 16-byte aligned. &lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)&lt;/P&gt;&lt;P&gt;{ &lt;/P&gt;&lt;P&gt;//Check for 16-byte alignment&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;__m128 m1, m2, m3, m4;&lt;/P&gt;&lt;P&gt;//__m128 sr1,sr2,si1,si2,dr1,dr2;&lt;/P&gt;&lt;P&gt;// __declspec(align(16)) float *pSrc1re2;&lt;/P&gt;&lt;P&gt;__m128 *srcR1 = (__m128*)pSrcRe1; //a1re&lt;/P&gt;&lt;P&gt;__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im&lt;/P&gt;&lt;P&gt;__m128 *srcR2 = (__m128*)pSrcRe2; //b1re&lt;/P&gt;&lt;P&gt;__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im&lt;/P&gt;&lt;P&gt;__m128 *destR = (__m128*)pDstRe; //ResRe&lt;/P&gt;&lt;P&gt;__m128 *destI = (__m128*)pDstIm; //ResIm&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;for(int i = 0 ;i&amp;lt; count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1) &lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)&lt;/P&gt;&lt;P&gt;m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)&lt;/P&gt;&lt;P&gt;m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)&lt;/P&gt;&lt;P&gt;m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)&lt;/P&gt;&lt;P&gt;*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)&lt;/P&gt;&lt;P&gt;*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;} &lt;/P&gt;&lt;P&gt;}&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;I have limitation not to disturb the application framework i.e changing the interface. I have to use vectors. And I don't know how to force the vector to have 16-byte aligned allocation. I have to optimised the speed for this function. Kindly suggest the options for the following&lt;BR /&gt;1. Either a method to allocate vector 16-byte aligned&lt;BR /&gt;2. Or a fast IPP based method to implement this function&lt;BR /&gt;3. SSE implementation to handle non-aligned vectors.&lt;BR /&gt;&lt;BR /&gt;Hoping to get a speedy reply from the experts.&lt;BR /&gt;&lt;BR /&gt;Note: I have just started working on IPP/SSE. its just a week into my first IPP/SSE routine.&lt;BR /&gt;&lt;BR /&gt;Regards&lt;BR /&gt;Rohit&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 21 Nov 2011 11:28:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899296#M12574</guid>
      <dc:creator>rohitspandey</dc:creator>
      <dc:date>2011-11-21T11:28:33Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899297#M12575</link>
      <description>I see that you tried already to use&lt;STRONG&gt;__declspec( align(16) )&lt;/STRONG&gt; declaration but it is commented out. Did you have any problems?&lt;BR /&gt;&lt;BR /&gt;You canalsouse&lt;STRONG&gt;_mm_malloc&lt;/STRONG&gt; function:&lt;BR /&gt;&lt;BR /&gt;...&lt;BR /&gt;__m128 *pVec1 = ( __m128 * )&lt;STRONG&gt;_mm_malloc&lt;/STRONG&gt;( _RTVECTOR_SIZE * sizeof( __m128 ), 16 );&lt;BR /&gt;...&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Sergey</description>
      <pubDate>Tue, 22 Nov 2011 14:41:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899297#M12575</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2011-11-22T14:41:17Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899298#M12576</link>
      <description>I have tried both options. It will work when the vectors are defined with in the function. I asked the case how to handle the vectors passed to function are not 16-byte aligned. i.e if the function are passed with vectors which are not aligned to 16-byte boundary.&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Regards&lt;/DIV&gt;&lt;DIV&gt;Rohit&lt;/DIV&gt;</description>
      <pubDate>Tue, 22 Nov 2011 15:37:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899298#M12576</guid>
      <dc:creator>rohitspandey</dc:creator>
      <dc:date>2011-11-22T15:37:52Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899299#M12577</link>
      <description>Hi Rohit,&lt;BR /&gt;&lt;BR /&gt;I think I've already provided an answer at &lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=101316"&gt;http://software.intel.com/en-us/forums/showthread.php?t=101316&lt;/A&gt; thread.&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;Igor</description>
      <pubDate>Fri, 25 Nov 2011 07:43:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate/m-p/899299#M12577</guid>
      <dc:creator>igorastakhov</dc:creator>
      <dc:date>2011-11-25T07:43:22Z</dc:date>
    </item>
  </channel>
</rss>

