Complex Division Performance Issue

DubitoCogito · ‎05-08-2013

I have noticed a performance issue with complex division on the MIC. Dividing two complex numbers by using the division operator is about 22x slower than if the operation is explicitly coded using the complex conjugate (see attached source file). I passed the -fcode-asm flag to the ifort compiler to dump the assembly code and noticed an unexpected difference. In the former case a call is made to an SVML subroutine named __svml_cdiv8, but in the latter the code is inlined. For the CPU inlined code is always used (meaning no calls to the external VML library). Using the -fimf-precision=low option generates faster code by calling the enhanced performance (EP) version of the aforementioned SVML function, but it is still about 3x slower. I used the following command to compile the source code.

ifort -O3 -mmic -align array64byte complex.f90

For both the CPU and MIC using the division operator is slower than using the complex conjugate to compute the value. What is the Intel compiler doing? Why is it using a function call on the MIC?

TimP · ‎05-08-2013

The default method with function call is implemented to support full exponent range. As you indicated, this is a non-vectorized step, even though vectorization may be reported. The option -complex-limited-range would support full vectorization but works only over roughly half the exponent range. I don't know for which compiler versions this has been tested fully. If you want both the in-line division without support for special cases and the complex-limited-range, both are included in -fp-model fast=2.

DubitoCogito · ‎05-08-2013

Since the CPU does appear to vectorize the complex division does that mean it does not support the full exponent range? Or is this due to a hardware limitation of the MIC?

TimP · ‎05-08-2013

When you write a fast version of complex divide, even an algebraically correct one, you muliply operands and at intermediate steps require double the exponent range of the individual operands. The only hardware support for this is in the scalar x87 instructions, by using the implicit long double 80-bit format to protect the range of expressions involving doubles. All known architecture share this limitation of not having extended range parallel instructions. Even in a vectorized loop, those operations have to be done individually, but the vec-report doesn't distinguish the fast limited range mode with all instruction parallel from the full range version with library function call. On this architecture we have the further limitation of not having full precision parallel divide hardware.

DubitoCogito · ‎05-08-2013

I have a few more questions:

(a) Could you please explain what you meant by "intermediate steps require double the exponent range of the individual operands"?

(c) If the CPU can inline the complex division code, then why does the MIC code make a function call?

TimP · ‎05-08-2013

a) in your own example of fast alternative code you multiply 1 operands, potentially incurring overflow if their exponents exceed half the range.

b) read any reference on how complex division is implemented with protection against overflow. e.g. http://www.netlib.org/slatec/src/cdiv.f ; Even that one doesn't make the full exponent range, although it's much better than half. It involves too much code for in-lining, and not likely to vectorize even if it were in-lined.

DubitoCogito · ‎05-08-2013

Thank you for your helpful answers, but I have another question regarding an earlier remark. You mentioned the MIC does not have full precision parallel divide hardware. Could you please elaborate.

TimP · ‎05-09-2013

http://software.intel.com/sites/default/files/article/382773/differences-in-floating-point-arithmetic-between-intel-xeon-processors-and-the-intel-xeon-phi.pdf

appears to be the authoritative paper, aside from the architecture reference manuals. As you can see, there is a possiblity to use the x87 instruction for scalar divides, under certain compile options (not those you appear to be interested in, although it may be hidden in the library function for complex division). Otherwise, divide and sqrt are carried out by reciprocal approximation and iterative improvement. In many cases, there is an advantage for in-line expansion of that divide method, so it is included in default compiler options.

DubitoCogito · ‎05-28-2013

Neither -complex-limited-range nor -fp-model fast=2 does anything if compiling for the MIC. I checked by dumping the assembly code and it is identical. It still calls the incredibly slow __svml_cdiv8 function.

-bash-4.1$ icc --version
icc (ICC) 13.0.1 20121010