I'm trying to take the greatest advantage of AVX-512 simd instructions on my Xeon phi motherboard computer. I am not sure how the compiler deals with floating point complex numbers when doing optimization.
Is it recommended to use the std::complex<float> class in the innermost loop of a high-performance application? In this inner loop, I'm multiplying 2 complex numbers and accumulating in a third.
Or is there a better class, say the MKL_complex class, that would run faster?
I'll answer my own question as best as I can.
What I was looking for is a library of routines, cleverly written and possibly using Intel Intrinsics, that allows me to perform vector operations like multiply on pairs of arrays of std::complex<float>. I was surprised to learn that no such library exists! I never imagined a computational platform like Xeon Phi would not support complex numbers at a deep level. And no contributor has written such a library, either.
The closest thing I found was Agner Fog's vector class library which provides tools from which one could build a vectorized complex number library. I found two very illuminating posts describing how one would vectorize complex multiplication by Matt Scarpino and Peter Cordes. These two posts outline two methods that rely on Intel Intrinsic functions. Even if you don't want to program with intrinsics, I encourage you to study those examples because it gives good hints on why it is hard to vectorize complex multiply and how to overcome that hardship.
For now, I am hoping that after study of the intrinsics solutions I can write some regular C++ code that gives the compiler enough hints to autovectorize my computation.
I don't use C++ classes, but in plain C code I found that the compiler's performance for complex numbers in the standard interleaved format was often quite disappointing. For almost all of my signal-processor codes, it was faster to "de-interleave" the input data into separate real and imaginary arrays, perform the complex arithmetic "manually", and then interleave the separate output arrays into the standard interleaved format. (Obviously, if you don't have to put the data back in interleaved format between steps, you can save even more time.)
With the sse3/4 support for complex, there is no need to split data. It became a dilemma as to whether avx could gain for complex multiply. The 512 bit formats may well benefit from the split.
Thanks for several good comments.
When I look at the intrinsic functions for AVX512, I see that they're perfectly happy doing a stride of 2. Indeed, this is what Matt's algorithm does, albeit for AVX256. So it shouldn't be very difficult -- let me rephrase that -- it should be possible to write fast complex functions without having to de-interleave.
I think it would be helpful to the physics community if Intel were to provide vector support for complex numbers. It certainly would make my code more readable.
Meanwhile, I guess I have to de-interleave temporarily. It is kind of a hassle because both the input and output of my routine is expected to be interleaved complex.