complex number class for optimum performance

Gerald_H_ · ‎03-07-2017

I'm trying to take the greatest advantage of AVX-512 simd instructions on my Xeon phi motherboard computer. I am not sure how the compiler deals with floating point complex numbers when doing optimization.

Is it recommended to use the std::complex<float> class in the innermost loop of a high-performance application? In this inner loop, I'm multiplying 2 complex numbers and accumulating in a third.

Or is there a better class, say the MKL_complex class, that would run faster?

Gerald_H_ · ‎03-15-2017

I'll answer my own question as best as I can.

What I was looking for is a library of routines, cleverly written and possibly using Intel Intrinsics, that allows me to perform vector operations like multiply on pairs of arrays of std::complex<float>. I was surprised to learn that no such library exists! I never imagined a computational platform like Xeon Phi would not support complex numbers at a deep level. And no contributor has written such a library, either.

The closest thing I found was Agner Fog's vector class library which provides tools from which one could build a vectorized complex number library. I found two very illuminating posts describing how one would vectorize complex multiplication by Matt Scarpino and Peter Cordes. These two posts outline two methods that rely on Intel Intrinsic functions. Even if you don't want to program with intrinsics, I encourage you to study those examples because it gives good hints on why it is hard to vectorize complex multiply and how to overcome that hardship.

For now, I am hoping that after study of the intrinsics solutions I can write some regular C++ code that gives the compiler enough hints to autovectorize my computation.

Gerald_H_ · ‎03-15-2017

Also there is nothing magical about the MKL complex number class. The complex class in the standard template library is the best one available.

McCalpinJohn · ‎03-16-2017

I don't use C++ classes, but in plain C code I found that the compiler's performance for complex numbers in the standard interleaved format was often quite disappointing. For almost all of my signal-processor codes, it was faster to "de-interleave" the input data into separate real and imaginary arrays, perform the complex arithmetic "manually", and then interleave the separate output arrays into the standard interleaved format. (Obviously, if you don't have to put the data back in interleaved format between steps, you can save even more time.)

TimP · ‎03-16-2017

With the sse3/4 support for complex, there is no need to split data. It became a dilemma as to whether avx could gain for complex multiply. The 512 bit formats may well benefit from the split.

SergeyKostrov · ‎03-17-2017

1. Intel C++ compiler option to consider: -[no-]complex-limited-range enable/disable(DEFAULT) the use of the basic algebraic expansions of some complex arithmetic operations. This can allow for some performance improvement in programs which use a lot of complex arithmetic at the loss of some exponent range. 2. Take a look at LIBIMF library ( Intel's complex.h and mathimf.h headers ) 3. SSE3 ISA didn't introduce any intrinsic functions related to complex numbers. Take a look at pmmintrin.h header file and SSE3 introduced just 13 new intrinsic functions, 3 defines and 2 run-time macros.

SergeyKostrov · ‎03-17-2017

>>...Or is there a better class, say the MKL_complex class, that would run faster? Take a look at Intel IPP library and compare performance of IPP's complex number functions vs. MKL's complex number functions.

Gerald_H_ · ‎03-17-2017

Thanks for several good comments.

When I look at the intrinsic functions for AVX512, I see that they're perfectly happy doing a stride of 2. Indeed, this is what Matt's algorithm does, albeit for AVX256. So it shouldn't be very difficult -- let me rephrase that -- it should be possible to write fast complex functions without having to de-interleave.

I think it would be helpful to the physics community if Intel were to provide vector support for complex numbers. It certainly would make my code more readable.

Meanwhile, I guess I have to de-interleave temporarily. It is kind of a hassle because both the input and output of my routine is expected to be interleaved complex.