Agner Fog's instruction

Porter__Andrew · ‎12-02-2016

Hello,

I am attempting, in a simplistic way, to understand/predict the performance of a relatively simple Fortran kernel which contains a mixture of addition, subtraction, multiplication and division operations on 64-bit floating point values. I've been using Agner Fog's documents to construct a model of the CPU core (in this case Ivy Bridge, E5-1620 v.2) since it's easier to find the information I need from them. In all that follows I am considering the performance of a single thread and am ignoring SIMD and have explicitly turned it off (-no-vec) when compiling code. I've looked at the generated assembly and verified that it only contains e.g. MULSD and DIVSD operations.

My problem boils down to the fact that the kernel is dominated by the multiplication and division operations: it contains 10 multiplications and one division, all of which (if I understand correctly) must be dispatched to execution port 0. If I use Agner's figures for 1/throughput (which I presume he took from the official Intel documents although I haven't succeeded in finding the right table) then a MULSD takes 1 cycle and a DIVSD takes between 8 and 14 cycles. I therefore conclude that the bare minimum required for kernel execution is 10x1 cycles + 1x8 cycles = 18 cycles. This ignores any latency penalty that may have to be paid. However, when I compile and run the kernel I measure an execution time that corresponds to 15 cycles*.

Given that port 0 should be the bottleneck (irrespective of out-of-order execution or any other magic) I am at a loss to understand why the code appears to perform *better* than the prediction and would welcome any pointers. I should note that for other kernels that do not involve division my predictions seem more reliable.

Many thanks,

Andy.

*The kernel is the body of a loop which has a trip count of 256^2 times and is itself repeated many thousands of times for timing purposes. I have hyper-threading, stepping and turbo disabled and used likwid-perfctr to check what clock speed I get in order to convert the measured duration into a number of cycles.

anton_r_1 · ‎12-02-2016

You’re assuming that the reciprocal throughput is how long the execution port is tied up for. That’s not necessarily true. For instance, it may be that the division operation shares the multiplication hardware for 5 cycles, while using dedicated division hardware for the entire 8 cycles. In that case, an independent multiplication operation could potentially be issued for 3 out of the 8 cycles of the division.

I’m not sure if that’s what’s happening here, but it seems reasonable — most dividers do require multiplication during certain steps & sharing that hardware would make sense given that division is rare.

Thomas_W_Intel · ‎12-02-2016

Andi,

Did you give the Intel Architecture Code Analyzer (IACA) a try? This is a tool that automates what you have been trying by hand: estimating the throughput by analyzing instruction latency, data dependencies, and port assignments. While it is not the perfect performance predicator, IAA might you some hint on what is going on in the core.

Kind regards

Thomas

McCalpinJohn · ‎12-05-2016

Agner Fog's instruction latency and throughput data is from his own tests. The results are generally, but not always, in agreement with Intel's published values (when they are available, which is "sometimes").

Anton's comment probably points to the correct explanation for your observations. I don't think that Intel has ever published a sufficiently detailed description of the implementation of their floating-point divide unit to document this, but there is an excellent write-up on the design of the floating-point division and square root algorithms on the AMD K7 that includes some details on how the divider can be augmented to allow independent multiply operations to occur in the "idle" cycles. This paper is dated April 1999, so it is not current, but it does contain a very nice level of detail:
https://pdfs.semanticscholar.org/c85c/30159f9af9b814b20bbac175326a52e4f272.pdf

Porter__Andrew · ‎12-06-2016

Many thanks for all the suggestions. I did previously consider using the IACA but was put off by the fact it is no longer supported and appears to be C/C++ only (although I could probably have my Fortran call a C routine for mark-up purposes). If I'm honest, I also want to develop my own tool :-) However, I guess IACA might shed some light on the sharing of hardware between the division and multiplication operations so I'll see whether I can get it to work. I'll have a read of that paper too John.

Porter__Andrew · ‎01-23-2017

So, I had a go at getting IACA to work but did not succeed. I then discovered the micro-benchmarking facility of likwid and have used this to generate a series of benchmarks (in assembler) containing from 1 to 8 MULSD operations (within the body of a loop). I've then generated a variant of each of these that has an additional (and independent) DIVSD operation. The number of UOPS sent to port 0 increases appropriately when I add this operation to a benchmark which is encouraging. Running these benchmarks gives the measurements in the attached plot:

535569

These show that for a loop body containing between 1 and 6 MULSD operations, the inclusion of a DIVSD operation takes the cost up to 8 cycles (which is the cost I measure for a loop body containing a single DIVSD), independent of the actual number of MULSDs. For 7 and 8 MULSDs we hit some sort of threshold and the addition of a single DIVSD operation increases the time to 13 cycles.

All of which backs-up Anton's suggestion so thank you for your help.

Travis_D_ · ‎01-31-2017

Thomas Willhalm (Intel) wrote:

Andi,

Did you give the Intel Architecture Code Analyzer (IACA) a try? This is a tool that automates what you have been trying by hand: estimating the throughput by analyzing instruction latency, data dependencies, and port assignments. While it is not the perfect performance predicator, IAA might you some hint on what is going on in the core.

While IACA is a great tool when it works, it doesn't support *any* DIV operations last time I checked, so it wouldn't be very useful here.

Travis_D_ · ‎01-31-2017

Porter, Andrew wrote:

Many thanks for all the suggestions. I did previously consider using the IACA but was put off by the fact it is no longer supported and appears to be C/C++ only (although I could probably have my Fortran call a C routine for mark-up purposes). If I'm honest, I also want to develop my own tool :-) However, I guess IACA might shed some light on the sharing of hardware between the division and multiplication operations so I'll see whether I can get it to work. I'll have a read of that paper too John.

IACA doesn't care what language your program is written in. It just looks for a specific sequence of ASM instructions (the IACA_START and IACA_END) sequences. When you download it, you get an include file that helps you inject those sequences into your C/C++ program via inline asm, but you can copy them into whatever program you have *if* it supports inline ASM. For example, I had no trouble using them in an assembly program.

Actually *calling* a C-routine from Fortran wouldn't help at all, because IACA is looking for a block of assembly *delimited* by those markers, and putting a function call will just result in a <pre>call</pre> instruction, and not the markers.

Really, IACA isn't running your program at all, it just analyzes it via dissassembly. So you can just modify the output file directly with a hex editor or search/replace or whatever and insert the marker that way. Sure it will break your program, but they don't run with the tags in anyway.

Performance prediction for floating-point heavy kernel