Complex multiply–accumulate

MGRAV · ‎10-23-2017

I everyone,

I need to do a Multiply-Multiply–accumulate operations over complex arrays.

More exactly I need a <-- a + alpha * c * d with a, c and d complex value, and alpha a reel.

It's a key point of my algorithm, and I made many tests trying to find the best solution. However, it looks still very slow to me. (Computing the FFT is faster !)

Currently, I use codes based on intrinsics. Attached the version for AVX. I suppose any solution can be extended easily to SSE and/or AVX-512

	        __m256 alphaVect=_mm256_set1_ps(Alpha);

		for (int i = 0; i < 2*size/8; i++)
		{
			
			__m256 C_vec=_mm256_loadu_ps(C+i*8);
			__m256 D_vec=_mm256_loadu_ps(D+i*8);

			__m256 CD=_mm256_mul_ps(_mm256_moveldup_ps(D_vec),C_vec);
			__m256 CCD=_mm256_mul_ps(_mm256_movehdup_ps(D_vec),C_vec);
			CCD=_mm256_shuffle_ps( CCD, CCD, _MM_SHUFFLE(2,3,0,1));

			
			__m256 valCD=_mm256_addsub_ps(CD,CCD);
		#if __FMA__
			__m256 val=_mm256_fmadd_ps (valCD,alphaVect,_mm256_loadu_ps(dst+i*8));//a*b+c
		#else
			__m256 val=_mm256_add_ps (_mm256_mul_ps(valCD,alphaVect),_mm256_loadu_ps(dst+i*8));
		#endif
			_mm256_storeu_ps(dst+i*8,val);
		}

Have someone a better idea or solution ?

If it is possible to do this operation with IPP (like the multiplication ippsMulPack_32f, ...), MKL, ... I didn't find the solution.

McCalpinJohn · ‎10-30-2017

I have not tested complex arithmetic with the newest (2018) compilers, but with the Intel 2016 compiler I got substantial speedups by simply splitting the complex data into separate real and imaginary parts and implementing the complex arithmetic manually. I did not need intrinsics to get excellent code once I split the data apart.

TimP · ‎10-30-2017

SSE3 (included in SSE4.2) has specific instructions which the compiler will use to vectorize complex without requiring intrinsics or shuffle. In order to use fully AVX2 or AVX512, as John said, it will be necessary to splir the data. Compilers will not attempt to evaluate whether a mixture of SSE3 and AVX2 will prove faster. The statistics produced by opt-report may help to evaluate this, and might lead you to use some SSE3 intrinsics in case the overhead of splitting the data would be significant.

TimP · ‎10-30-2017

SSE3 (included in SSE4.2) has specific instructions which the compiler will use to vectorize complex without requiring intrinsics or shuffle. In order to use fully AVX2 or AVX512, as John said, it will be necessary to split the data. Compilers will not attempt to evaluate whether a mixture of SSE3 and AVX2 will prove faster. The statistics produced by opt-report may help to evaluate this, and might lead you to use some SSE3 intrinsics in case the overhead of splitting the data would be significant.

MGRAV · ‎10-30-2017

John, for sure I can ask the mkl FFT to split the output in the real part and the imaginary part. However, it would logique to me that the fft would slower with this approach. I will do more tests and then switch to this solution if it is faster.

Tim, are you saying that it gives special instruction in SSE* that are not available truth intrinsic ?

I did some more research and find in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" July 2017 (O.N.: 248966-037) section 12.11.3 that they are using the same approach. So I assume that is the best or one of the best approaches.