On a 32 bit system, a program loop will take the same time to process a loop using double floating point arithmetic as it does with single floating point arithmetic. The double float calculations are done done in hardware, as opposed to using some sort of software emulation, as is done on most GPUs. The GPU takes more than twice as long to process a loop of double floats than it does a single float loop.
Please exclude all thought of SSE or AVX registers or calculations for the moment.
I understand how the calculation of single(32 bit) floating point values is performed. How is it that the use of double precision values(64 bit) does not use more time on the same hardware. Must the processor ALU be based on 64 bit architecture to achieve this, despite being a 32 bit operating system ?
What hardware mechanism is used to achieve this / Does anyone have a good description ?
On Intel processors there are the following floating point instruction sets: FPU (8087 emulation), SSE and AVX. All three have access to an internal, very fast, internal floating point processor (engine). The FPU supports 4-byte, 8-byte, and 10-byte floating point formats as single elements (scalars). The SSE and AVX support 4-byte, 8-byte floating point formats as scalars (single variable) or small vectors (2 or more elements). Ignoring the multiple element formats in SSE and AVX, the latency of a floating point multiply is on the order of 4 clock cycles (this will extend for memory references). Throughput can be as little as 1-2 clock cycles.
When the problem involves a large degree of RAM reads and writes, the program is waiting for the memory as opposed to waiting for the floating point operations.
Note, when small vectors can be used, the computation time can be significantly reduced (1/2, 1/4. 1/8) memory subsystem overhead can be reduced per floating operation, but the demands on memory subsystem may increase.
>>>The double float calculations are done done in hardware, as opposed to using some sort of software emulation, as is done on most GPUs. The GPU takes more than twice as long to process a loop of double floats than it does a single float loop.>>>
IIRC Nvidia Kepler architecture has support for double precision calculations.Not sure about the Fermi design.
>>>ow is it that the use of double precision values(64 bit) does not use more time on the same hardware. Must the processor ALU be based on 64 bit architecture to achieve this>>>
I suppose that recent Intel processors use one or two execution ports for ALU integer (single) operations and ALU vector operations and this data can be vectorized and send to SIMD execution engine.In case of Vector ALU it is up to 4 32-bit int scalar components processed at once.
When using vectorized simd instructions, single float throughput is roughly double that of double, just as on your GPU. This is because twice as many operations, using the same total number of bytes of data, may be performed per cycle.
when considering a single operation, performance of single and double may be similar. This may be true of the GPU as well. Some of the ads compare vector-parallel operation on a GPU against serial host CPU operation. This is in line with your idea that simd parallelism should not be considered on host, even though you are discussing the equivalent on the GPU.
I think that there cannot be direct comparision between CPU fp peak performance speed and GPU peak performance speed.I suppose that at infinitesimal time (more than one cpu cycle) the peak performance will be a function of scheduled to execute fp code , interdependecies in that code and available to this code execution units per single core.GPU has a lot of more available resources albeit operating at lower speed.
Thanks for all of those neat comments. Attached is single and double sample code with builder in vs2010 for Sergey.
This question originated because someone asked me why the single and double computational performance of a program on an i5 processor was the same, whereas on the GTX 480 GPU this is not the case. I glibly answered that the double and single times were the same because the i5 does the double scalar arithmetic in hardware. I thought about this afterwards and realised that I did not really understand how the processor hardware did this so efficiently. Thanks for the answer Jim.
This question is not about the SSE or AVX. I get v good performance with most of my code sets using these devices. SSE typically x 2.5, AVX typically x 5 : all single precision implementations of course.
The focus of the question is how contemporary CISC processors handle double precision computation. The answer to this is that the FP engine circuitry does the computation.
>>>This question originated because someone asked me why the single and double computational performance of a program on an i5 processor was the same, whereas on the GTX 480 GPU this is not the case>>>
Probably because of either lack of double precision support or locked support for double precision support for non Tesla cards.
>>>I am satisfied with answer that the maths is done by the fp engine>>>
Do you mean integer math?
I do not know if the same engine is processing integer math.
Further to this discussion where I found that the performance of the single floating point was the same as the double floating point performance for computationally intense programs running on the i5 4440. No SIMD, no AVX, SSE, etc. Plain x87 implementation as far as I know.
Some time ago I tried the same on a variety of other processors and found that this was indeed the case. i.e. Sparc and other intel desktop CPUs all process float and double precision with the same performance penalty.
I am now building for the Xeon E5-2640 and X7350 and find that the identical programs using double precision are now taking twice as long to complete. I repeat, no AVX, SSE or optimization flags used with gcc and icc. What is causing this behaviour ? With the FPU functionality, should the processor be using the same number of instructions to complete ? I would have thought that the use of either float or double will make very little difference in performance.
Also; how can one use Vtune to determine where the difference in float/double performance arises. Is this possible?
>>>Further to this discussion where I found that the performance of the single floating point was the same as the double floating point performance for computationally intense programs running on the i5 4440. No SIMD, no AVX, SSE, etc. Plain x87 implementation as far as I know>>>
Which x87 instruction are you talking about? Can you provide more details?
For using VTune to troubleshoot the lack of performance you can code two versions of the same program one with single precision FP and second with double precision FP and start analysis by looking at pipeline front and back end stalls.
From what I understand, if you're compiling for 32bit, then you are probably using x87 floating point instructions. In which case aren't all floating point numbers, (floats and doubles) expanded to 80bits internally anyway ? So that's why they take the same amount of time to process.
The only time you will see a performance difference is when you're processing large data sets, and you become limited by ram bandwidth speeds. At this point you will process through the smaller floats at twice the speed to doubles.
There is no connection between whether you are compiling for 32-bit or 64-bit and whether you're using the original x87 floating-point instructions or newer instructions like SSE, SSE2, or AVX. Whether you're using SSE2 or not is determined by compiler flags. To use SSE2, you need to tell the compiler that you are OK with 64-bit double precision (which is what SSE2 uses) as opposed to 80-bit (which is what x87 uses).
As to the speed of single precision vs. double precision multiplies and adds, they are the same, and have been for a long time (see Agner Fog's tables of instruction latencies and throughput). The FPU hardware is natively double precision. As mentioned earlier, however, you can get double the throughput with single precision because you can operate on twice as many of them at the same time using the SSE instruction set. Of course, the compiler has to recognize that vectorization is possible to achieve this speedup.
You might want to actually determine whether the compiler is producing x87 or SSE object code by looking at the disassembly. This will give you a much better idea of what is going on. It's possible that the compiler is vectorizing the code, in which case it would make sense that the throughput for floats was higher than for doubles.
tomk, I dont believe that is true, at least not for MSVC. When you compile for 64bit, x87 instructions are no longer generated. The floating point model uses SSE in scalar mode instead. (As all 64bit cpus support at least SSE)
"The x64 Application Binary Interface " ... "All floating point operations are done using the 16 XMM registers."