<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Complex multiply by conjugate function for array  in Intel® Integrated Performance Primitives</title>
    <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate-function-for-array/m-p/788955#M2226</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have similar requirement to have Complex multiply by conjugate function for array . Currently the function is a C implementation. The function takes quite a good amount of time in my application. I have to optimise wit either IPP or SSE. &lt;/P&gt;&lt;P&gt;Details on the function&lt;/P&gt;&lt;P&gt;--------------------------&lt;/P&gt;&lt;P&gt;Description: Array A x Comp Conj of B; Result in C.&lt;/P&gt;&lt;P&gt;mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real&lt;/P&gt;&lt;P&gt;const Float_t *pAimag, // source of array 'A' imag&lt;/P&gt;&lt;P&gt;const Float_t *pBreal, // source of array 'B' real&lt;/P&gt;&lt;P&gt;const Float_t *pBimag, // source of array 'B' imag&lt;/P&gt;&lt;P&gt;Uword32 Np, // number of points&lt;/P&gt;&lt;P&gt;Float_t *pCreal, // Dest for real part&lt;/P&gt;&lt;P&gt;Float_t *pCimag) // Dest for imag part&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have tried two version of IPP implementation. But both version are slower then C.&lt;/P&gt;&lt;P&gt;//IPP implementation&lt;/P&gt;&lt;P&gt;IppStatus ippStats;&lt;/P&gt;&lt;P&gt;Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format&lt;/P&gt;&lt;P&gt;/*//Implement-1&lt;/P&gt;&lt;P&gt;// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j&lt;/P&gt;&lt;P&gt;ippsMul_32f(pAreal,pBreal,pCreal,Np);&lt;/P&gt;&lt;P&gt;ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);&lt;/P&gt;&lt;P&gt;//Calculating q = (aI*bR-aR*bI)&lt;/P&gt;&lt;P&gt;ippsMul_32f(pAimag,pBreal,pCimag,Np);&lt;/P&gt;&lt;P&gt;//Make -bI vector from bI vector&lt;/P&gt;&lt;P&gt;ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);&lt;/P&gt;&lt;P&gt;ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);&lt;/P&gt;&lt;P&gt;*/&lt;/P&gt;&lt;P&gt;//Implement-2&lt;/P&gt;&lt;P&gt;//Allocate memory for pSrcA, pSrcB and pSrcC buffer&lt;/P&gt;&lt;P&gt;pSrcA = ippsMalloc_32fc(3*Np);&lt;/P&gt;&lt;P&gt;pSrcB = &amp;amp;pSrcA[Np];//ippsMalloc_32fc(Np);&lt;/P&gt;&lt;P&gt;pDst = &amp;amp;pSrcA[2*Np];//ippsMalloc_32fc(Np);&lt;/P&gt;&lt;P&gt;if((pSrcA==0)||(pDst==0)||(pSrcB==0))&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;IPP_ERR_STATUS_var = ippStsMemAllocErr;&lt;/P&gt;&lt;P&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P&gt;return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;//First Convert 2 separte IQ buffers to single complex number buffer&lt;/P&gt;&lt;P&gt;//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);&lt;/P&gt;&lt;P&gt;ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);&lt;/P&gt;&lt;P&gt;ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);&lt;/P&gt;&lt;P&gt;//Using IPP API:-&lt;/P&gt;&lt;P&gt;//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);&lt;/P&gt;&lt;P&gt;ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);&lt;/P&gt;&lt;P&gt;if(ippStats!=ippStsNoErr)&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;IPP_ERR_STATUS_var = ippStats;&lt;/P&gt;&lt;P&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P&gt;return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;//Now Convert output complex buffer to 2 separte IQ buffers&lt;/P&gt;&lt;P&gt;//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);&lt;/P&gt;&lt;P&gt;ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);&lt;/P&gt;&lt;P&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P&gt;Then I implemented the function using SSE. Which is quite fast and works perfect in my application as long as the vectors are 16-byte aligned. &lt;/P&gt;&lt;P&gt;void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)&lt;/P&gt;&lt;P&gt;{ &lt;/P&gt;&lt;P&gt;//Check for 16-byte alignment&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;__m128 m1, m2, m3, m4;&lt;/P&gt;&lt;P&gt;//__m128 sr1,sr2,si1,si2,dr1,dr2;&lt;/P&gt;&lt;P&gt;// __declspec(align(16)) float *pSrc1re2;&lt;/P&gt;&lt;P&gt;__m128 *srcR1 = (__m128*)pSrcRe1; //a1re&lt;/P&gt;&lt;P&gt;__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im&lt;/P&gt;&lt;P&gt;__m128 *srcR2 = (__m128*)pSrcRe2; //b1re&lt;/P&gt;&lt;P&gt;__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im&lt;/P&gt;&lt;P&gt;__m128 *destR = (__m128*)pDstRe; //ResRe&lt;/P&gt;&lt;P&gt;__m128 *destI = (__m128*)pDstIm; //ResIm&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;for(int i = 0 ;i&amp;lt; count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1) &lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)&lt;/P&gt;&lt;P&gt;m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)&lt;/P&gt;&lt;P&gt;m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)&lt;/P&gt;&lt;P&gt;m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)&lt;/P&gt;&lt;P&gt;*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)&lt;/P&gt;&lt;P&gt;*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;} &lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;I have limitation not to disturb the application framework i.e changing the interface. I have to use vectors. And I don't know how to force the vector to have 16-byte aligned allocation. I have to optimised the speed for this function. Kindly suggest the options for the following&lt;/P&gt;&lt;P&gt;1. Either a method to allocate vector 16-byte aligned&lt;/P&gt;&lt;P&gt;2. Or a fast IPP based method to implement this function&lt;/P&gt;&lt;P&gt;3. SSE implementation to handle non-aligned vectors.&lt;/P&gt;&lt;P&gt;Hoping to get a speedy reply from the experts.&lt;/P&gt;&lt;P&gt;Note: I have just started working on IPP/SSE. its just a week into my first IPP/SSE routine.&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;Rohit&lt;/P&gt;</description>
    <pubDate>Mon, 21 Nov 2011 11:30:22 GMT</pubDate>
    <dc:creator>rohitspandey</dc:creator>
    <dc:date>2011-11-21T11:30:22Z</dc:date>
    <item>
      <title>Complex multiply by conjugate function for array</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate-function-for-array/m-p/788955#M2226</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have similar requirement to have Complex multiply by conjugate function for array . Currently the function is a C implementation. The function takes quite a good amount of time in my application. I have to optimise wit either IPP or SSE. &lt;/P&gt;&lt;P&gt;Details on the function&lt;/P&gt;&lt;P&gt;--------------------------&lt;/P&gt;&lt;P&gt;Description: Array A x Comp Conj of B; Result in C.&lt;/P&gt;&lt;P&gt;mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real&lt;/P&gt;&lt;P&gt;const Float_t *pAimag, // source of array 'A' imag&lt;/P&gt;&lt;P&gt;const Float_t *pBreal, // source of array 'B' real&lt;/P&gt;&lt;P&gt;const Float_t *pBimag, // source of array 'B' imag&lt;/P&gt;&lt;P&gt;Uword32 Np, // number of points&lt;/P&gt;&lt;P&gt;Float_t *pCreal, // Dest for real part&lt;/P&gt;&lt;P&gt;Float_t *pCimag) // Dest for imag part&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have tried two version of IPP implementation. But both version are slower then C.&lt;/P&gt;&lt;P&gt;//IPP implementation&lt;/P&gt;&lt;P&gt;IppStatus ippStats;&lt;/P&gt;&lt;P&gt;Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format&lt;/P&gt;&lt;P&gt;/*//Implement-1&lt;/P&gt;&lt;P&gt;// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j&lt;/P&gt;&lt;P&gt;ippsMul_32f(pAreal,pBreal,pCreal,Np);&lt;/P&gt;&lt;P&gt;ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);&lt;/P&gt;&lt;P&gt;//Calculating q = (aI*bR-aR*bI)&lt;/P&gt;&lt;P&gt;ippsMul_32f(pAimag,pBreal,pCimag,Np);&lt;/P&gt;&lt;P&gt;//Make -bI vector from bI vector&lt;/P&gt;&lt;P&gt;ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);&lt;/P&gt;&lt;P&gt;ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);&lt;/P&gt;&lt;P&gt;*/&lt;/P&gt;&lt;P&gt;//Implement-2&lt;/P&gt;&lt;P&gt;//Allocate memory for pSrcA, pSrcB and pSrcC buffer&lt;/P&gt;&lt;P&gt;pSrcA = ippsMalloc_32fc(3*Np);&lt;/P&gt;&lt;P&gt;pSrcB = &amp;amp;pSrcA[Np];//ippsMalloc_32fc(Np);&lt;/P&gt;&lt;P&gt;pDst = &amp;amp;pSrcA[2*Np];//ippsMalloc_32fc(Np);&lt;/P&gt;&lt;P&gt;if((pSrcA==0)||(pDst==0)||(pSrcB==0))&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;IPP_ERR_STATUS_var = ippStsMemAllocErr;&lt;/P&gt;&lt;P&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P&gt;return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;//First Convert 2 separte IQ buffers to single complex number buffer&lt;/P&gt;&lt;P&gt;//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);&lt;/P&gt;&lt;P&gt;ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);&lt;/P&gt;&lt;P&gt;ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);&lt;/P&gt;&lt;P&gt;//Using IPP API:-&lt;/P&gt;&lt;P&gt;//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);&lt;/P&gt;&lt;P&gt;ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);&lt;/P&gt;&lt;P&gt;if(ippStats!=ippStsNoErr)&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;IPP_ERR_STATUS_var = ippStats;&lt;/P&gt;&lt;P&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P&gt;return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;//Now Convert output complex buffer to 2 separte IQ buffers&lt;/P&gt;&lt;P&gt;//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);&lt;/P&gt;&lt;P&gt;ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);&lt;/P&gt;&lt;P&gt;myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);&lt;/P&gt;&lt;P&gt;Then I implemented the function using SSE. Which is quite fast and works perfect in my application as long as the vectors are 16-byte aligned. &lt;/P&gt;&lt;P&gt;void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)&lt;/P&gt;&lt;P&gt;{ &lt;/P&gt;&lt;P&gt;//Check for 16-byte alignment&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;__m128 m1, m2, m3, m4;&lt;/P&gt;&lt;P&gt;//__m128 sr1,sr2,si1,si2,dr1,dr2;&lt;/P&gt;&lt;P&gt;// __declspec(align(16)) float *pSrc1re2;&lt;/P&gt;&lt;P&gt;__m128 *srcR1 = (__m128*)pSrcRe1; //a1re&lt;/P&gt;&lt;P&gt;__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im&lt;/P&gt;&lt;P&gt;__m128 *srcR2 = (__m128*)pSrcRe2; //b1re&lt;/P&gt;&lt;P&gt;__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im&lt;/P&gt;&lt;P&gt;__m128 *destR = (__m128*)pDstRe; //ResRe&lt;/P&gt;&lt;P&gt;__m128 *destI = (__m128*)pDstIm; //ResIm&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;for(int i = 0 ;i&amp;lt; count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1) &lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)&lt;/P&gt;&lt;P&gt;m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)&lt;/P&gt;&lt;P&gt;m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)&lt;/P&gt;&lt;P&gt;m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)&lt;/P&gt;&lt;P&gt;*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)&lt;/P&gt;&lt;P&gt;*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;} &lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;I have limitation not to disturb the application framework i.e changing the interface. I have to use vectors. And I don't know how to force the vector to have 16-byte aligned allocation. I have to optimised the speed for this function. Kindly suggest the options for the following&lt;/P&gt;&lt;P&gt;1. Either a method to allocate vector 16-byte aligned&lt;/P&gt;&lt;P&gt;2. Or a fast IPP based method to implement this function&lt;/P&gt;&lt;P&gt;3. SSE implementation to handle non-aligned vectors.&lt;/P&gt;&lt;P&gt;Hoping to get a speedy reply from the experts.&lt;/P&gt;&lt;P&gt;Note: I have just started working on IPP/SSE. its just a week into my first IPP/SSE routine.&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;Rohit&lt;/P&gt;</description>
      <pubDate>Mon, 21 Nov 2011 11:30:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate-function-for-array/m-p/788955#M2226</guid>
      <dc:creator>rohitspandey</dc:creator>
      <dc:date>2011-11-21T11:30:22Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate function for array</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate-function-for-array/m-p/788956#M2227</link>
      <description>Hi Rohit,&lt;BR /&gt;&lt;BR /&gt; Seemingly it is same issue as &lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=71784&amp;amp;o=a&amp;amp;s=lr"&gt;http://software.intel.com/en-us/forums/showthread.php?t=71784&amp;amp;o=a&amp;amp;s=lr&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Rights?&lt;BR /&gt;&lt;BR /&gt;As i understand, the SSE implementation have an obvious advantage with your array layout, while IPP functions have totake many time to convert data layoutto suitable dataformat and multiples I/O.&lt;BR /&gt;&lt;BR /&gt;So your question actually left3. SSE implementation to handle non-aligned vectors. You may ask the question to Intel Compiler Forum also. As i recalled, the non-alignedload is not big issueonlatestProcessor with Intel C/C++ compiler and you may usenon-alignedloadfor example _mm_loadu_psto insteadthe "compulsive conversion", like__m128 *srcR1 = (__m128*)pSrcRe1, &lt;BR /&gt;&lt;BR /&gt;&lt;P class="MsoNormal" style="margin: 0cm 0cm 0pt;"&gt;&lt;SPAN lang="EN-US" style="font-family: " arial=""&gt;&lt;SPAN style="font-size: small;"&gt;__m128 _mm_loadu_ps(float * p)&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="MsoNormal" style="margin: 0cm 0cm 0pt;"&gt;&lt;SPAN lang="EN-US" style="font-family: " arial=""&gt;&lt;P&gt;&lt;SPAN style="font-family: Arial; font-size: small;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="MsoNormal" style="margin: 0cm 0cm 0pt;"&gt;&lt;SPAN lang="EN-US" style="font-family: " arial=""&gt;&lt;SPAN style="font-size: small;"&gt;Loads four SP FP values. The address need not be 16-byte-aligned.&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;Ying</description>
      <pubDate>Thu, 24 Nov 2011 07:04:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate-function-for-array/m-p/788956#M2227</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2011-11-24T07:04:51Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate function for array</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate-function-for-array/m-p/788957#M2228</link>
      <description>Forfurther AVX optimization &lt;A href="http://software.intel.com/en-us/articles/benefits-of-intel-avx-for-small-matrices/"&gt;&lt;BR /&gt;http://software.intel.com/en-us/articles/benefits-of-intel-avx-for-small-matrices/&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Hope it helps,&lt;BR /&gt;Ying</description>
      <pubDate>Thu, 24 Nov 2011 07:09:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate-function-for-array/m-p/788957#M2228</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2011-11-24T07:09:33Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate function for array</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate-function-for-array/m-p/788958#M2229</link>
      <description>Rohit,&lt;BR /&gt;&lt;BR /&gt;regarding your questions:&lt;BR /&gt;1) allocating of 16-byte aligned vectors - use any ippMalloc function (for example ippsMalloc_32fc() in your case) - they all guarantee an appropriate alignment for particular architecture - so 16-byte for SSE, 32-byte for AVX, etc. If you need aligned vectors at stack - see example 3 below.&lt;BR /&gt;2) there is noany suitable ipp functionality for fast implementation of what you need&lt;BR /&gt;3) an example of MulConj for non-aligned case and merged re,im format - guess it will be easy for you to translate it for separate re &amp;amp; im vectors - just split src vectors and reverse src shuffles, the same - for dst:&lt;BR /&gt;&lt;BR /&gt;&lt;P&gt;#define M_SIGN 0x80000000&lt;/P&gt;&lt;P&gt;static const __declspec(align(16))int SIGN_IM[4] = { 0, M_SIGN, 0, M_SIGN };&lt;/P&gt;&lt;P&gt;sMulConj_32fc( const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32u len ){&lt;/P&gt;&lt;P&gt;Ipp32fc *px1, *px2, *py;&lt;BR /&gt;__m128x1, x2, re1, im1, mr, mi, y1, y2;&lt;BR /&gt;Ipp32f re, im;&lt;BR /&gt;Ipp32u i;&lt;/P&gt;&lt;P&gt;px1 = (Ipp32fc*)pSrc1;&lt;BR /&gt;px2 = (Ipp32fc*)pSrc2;&lt;BR /&gt;py = (Ipp32fc*)pDst;&lt;BR /&gt;&lt;BR /&gt;for (i=0; i+4&amp;lt;=len; i+=4) {&lt;BR /&gt; x1 = _mm_loadu_ps( (float*)( px1+i ));&lt;BR /&gt; x2 = _mm_loadu_ps( (float*)( px2+i ));&lt;BR /&gt; re1 = _mm_shuffle_ps(x1, x1,_MM_SHUFFLE(2,2,0,0));&lt;BR /&gt; im1 = _mm_shuffle_ps(x1, x1, _MM_SHUFFLE(3,3,1,1));&lt;BR /&gt; mr = _mm_mul_ps(x2, re1 );&lt;BR /&gt; mi = _mm_mul_ps(x2, im1 );&lt;BR /&gt; mr = _mm_xor_ps( mr, ( *(__m128*)(&amp;amp;(SIGN_IM))) );&lt;BR /&gt; mi = _mm_shuffle_ps(mi, mi, _MM_SHUFFLE(2,3,0,1));&lt;BR /&gt; y1 = _mm_add_ps(mr, mi );&lt;/P&gt;&lt;P&gt; x1 = _mm_loadu_ps( (float*)( px1+i+2 ));&lt;BR /&gt; x2 = _mm_loadu_ps( (float*)( px2+i+2 ));&lt;BR /&gt; re1 = _mm_shuffle_ps(x1, x1,_MM_SHUFFLE(2,2,0,0) );&lt;BR /&gt; im1 = _mm_shuffle_ps(x1, x1,_MM_SHUFFLE(3,3,1,1) );&lt;BR /&gt; mr = _mm_mul_ps(x2, re1 );&lt;BR /&gt; mi = _mm_mul_ps(x2, im1 );&lt;BR /&gt; mr =_mm_xor_ps( mr, ( *(__m128*)(&amp;amp;(SIGN_IM))) );&lt;BR /&gt; mi = _mm_shuffle_ps( mi, mi,_MM_SHUFFLE(2,3,0,1));&lt;BR /&gt; y2 = _mm_add_ps( mr, mi );&lt;BR /&gt; _mm_storeu_ps( (float*)( py+i ), y1 );&lt;BR /&gt; _mm_storeu_ps( (float*)( py+i+2), y2 );&lt;BR /&gt;}&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;Igor&lt;/P&gt;</description>
      <pubDate>Thu, 24 Nov 2011 09:07:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate-function-for-array/m-p/788958#M2229</guid>
      <dc:creator>igorastakhov</dc:creator>
      <dc:date>2011-11-24T09:07:19Z</dc:date>
    </item>
    <item>
      <title>Complex multiply by conjugate function for array</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate-function-for-array/m-p/788959#M2230</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;I have usedXYZuas off now, which isslower. something like this&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;P&gt;Uword32 s;&lt;/P&gt;&lt;P&gt;Float_t ar, ai, br, bi;&lt;/P&gt;&lt;P&gt;__m128 m1, m2, m3, m4;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;for(s = 0 ;s&amp;lt; (Np/4); s++) &lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;m1 = _mm_mul_ps( _mm_loadu_ps(pAreal), _mm_loadu_ps(pBreal)); //(a1Re*b1Re)&lt;/P&gt;&lt;P&gt;m2 = _mm_mul_ps(_mm_loadu_ps(pAimag),_mm_loadu_ps(pBimag)); //(a1Im*b1Im)&lt;/P&gt;&lt;P&gt;m3 = _mm_mul_ps(_mm_loadu_ps(pAreal),_mm_loadu_ps(pBimag)); //(a1Re*b1Im)&lt;/P&gt;&lt;P&gt;m4 = _mm_mul_ps(_mm_loadu_ps(pAimag),_mm_loadu_ps(pBreal)); //(a1Im*b1Re)&lt;/P&gt;&lt;P&gt;_mm_storeu_ps(pCreal, _mm_add_ps(m1,m2)); // (a1Re*b1Re)+(a1Im*b1Im)&lt;/P&gt;&lt;P&gt;_mm_storeu_ps(pCimag,_mm_sub_ps(m4,m3)); //(a1Im*b1Re)-(a1Re*b1Im) &lt;/P&gt;&lt;P&gt;pAreal=pAreal+4;pAimag=pAimag+4; &lt;/P&gt;&lt;P&gt;pBreal=pBreal+4;pBimag=pBimag+4; &lt;/P&gt;&lt;P&gt;pCreal=pCreal+4;pCimag=pCimag+4;&lt;/P&gt;&lt;P&gt;} &lt;/P&gt;&lt;P&gt;for (s = 0; s &amp;lt; (Np%4); s++) // for each array element&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;ar = *pAreal++; // get the real and imaginary parts&lt;/P&gt;&lt;P&gt;ai = *pAimag++;&lt;/P&gt;&lt;P&gt;br = *pBreal++;&lt;/P&gt;&lt;P&gt;bi = *pBimag++;&lt;/P&gt;&lt;P&gt;*pCreal++ = ar * br + ai * bi; // do the complex multiply&lt;/P&gt;&lt;P&gt;*pCimag++ = ai * br - ar * bi; // NB conjugate of B&lt;/P&gt;&lt;P&gt;}&lt;BR /&gt;&lt;BR /&gt;In the mean time working to create a memory manager to handle aligned vectors.&lt;BR /&gt;Thanks for help.&lt;BR /&gt;Regards&lt;BR /&gt;Rohit&lt;/P&gt;</description>
      <pubDate>Wed, 04 Jan 2012 12:10:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Complex-multiply-by-conjugate-function-for-array/m-p/788959#M2230</guid>
      <dc:creator>rohitspandey</dc:creator>
      <dc:date>2012-01-04T12:10:57Z</dc:date>
    </item>
  </channel>
</rss>

