Solved: >>>According to the article I

Toby · ‎09-23-2014

I have a problem understanding some of the counters reported by vTunes XE 2013 in order to calculate FLOPS according to this Intel article. Using the vTunes counters I get approximately 3 times as high values than the manually counter value (and it should not be due to speculation I think). Here’s the details of this:

I believe there should be two float operations (MULTIPLY and ADD) here disregarding that the float comparison on the return row. (Is this assumption correct?). Counting the number of iterations in the loop an multiplying it with 2 and divide by elapsed time gives a FLOPS value of 0,33 GFLOPS on this 12 core machine (running 12 threads). However, when using the excellent article describing how to collect the metrics using vTunes, I get a much higher value: 1,06 GFLOPS. There are branches also in another method with the loop, but test data is setup so that branches always render the same decision and the HasRemaingBudget function is always run. I.e. I think branch prediction should be extremly accurate not giving any extra overhead.

Scrutinizing the vTune profilation assmebly code for the above C# code gives us the following metric:

For row 43 (Order order = ad.Order):

According to the article I referenced to above, this counter should be included in the FLOPS calculation. But I don’t understand what the floating point operation is here. It seems like a straight forward move of data from memory to register not involving any floating point calculation at all. Or?

So which is the most accurate way to measure FLOPS in an application? Doing it the simple way and count floating point operations in high level code (C# in this case), or relying on metrics from vTune to capture the cases when the floating point operations seems to be hidden (from non-experts at least)?

McCalpinJohn · ‎09-23-2014

Unfortunately the hardware performance counters that are used to count floating-point arithmetic operations are known to overcount badly on recent Intel processors. See my comments in another forum thread at https://software.intel.com/en-us/forums/topic/499193.

The short answer is that the event counter increments every time the processor dispatches the corresponding floating-point instruction to the execution units. If the input data is not available (due to a cache miss), then the instruction will be rejected and retried a few cycles later.

Since I initially responded to that earlier forum thread I have done more experiments and have found cases with up to 10x overcounting ratios.

So if you have low cache miss rates the counts can be close (sometimes within a few percent), while cases with high cache miss rates or high cache miss rates plus heavy contention at the DRAMs will show overcounting by ratios of 2-4x or sometimes worse.

I recommend incorporating manual counting in the code. This will still not be exact if the compiler does things you don't expect (such as common sub-expression elimination), but it has the wonderful advantage of giving the same result no matter where the data is actually located.

View solution in original post

Bernard · ‎09-23-2014

Please read following blog post: https://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs

Toby · ‎09-23-2014

iliyapolak: Please note that I am referencing exactly that blog post in the question and that the question relates to the information stated in that blog.

Bernard · ‎09-23-2014

Toby wrote:

iliyapolak: Please note that I am referencing exactly that blog post in the question and that the question relates to the information stated in that blog.

I answered your post from my corporate pc so screenshots and hyperlink were blocked. I was not aware that you read that blog.

Bernard · ‎09-23-2014

>>>According to the article I referenced to above, this counter should be included in the FLOPS calculation. But I don’t understand what the floating point operation is here. It seems like a straight forward move of data from memory to register not involving any floating point calculation at all. Or?>>>

It looks like VTune erroneously showed incorrect line of assembly code. Can you post full disassembly?

McCalpinJohn · ‎09-23-2014

Unfortunately the hardware performance counters that are used to count floating-point arithmetic operations are known to overcount badly on recent Intel processors. See my comments in another forum thread at https://software.intel.com/en-us/forums/topic/499193.

The short answer is that the event counter increments every time the processor dispatches the corresponding floating-point instruction to the execution units. If the input data is not available (due to a cache miss), then the instruction will be rejected and retried a few cycles later.

Since I initially responded to that earlier forum thread I have done more experiments and have found cases with up to 10x overcounting ratios.

So if you have low cache miss rates the counts can be close (sometimes within a few percent), while cases with high cache miss rates or high cache miss rates plus heavy contention at the DRAMs will show overcounting by ratios of 2-4x or sometimes worse.

I recommend incorporating manual counting in the code. This will still not be exact if the compiler does things you don't expect (such as common sub-expression elimination), but it has the wonderful advantage of giving the same result no matter where the data is actually located.

Toby · ‎09-24-2014

iliyapolak: I see. In the end of this comment, I gave some more code associated with follow up question 3 below.
John D. McCalpin: Thanks, it is indeed a logical and good explanation.

I have some remaining few follow-up questions though that I wonder if you can help me with.

1. One question I had was why a mov instruction is showing a high counter value for FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE. It is clearly wrong. Could the reason be that since these values are collected at issue time of the instruction rather than retire, that the collected values from some other instruction is showing up in the wrong place. In that case it is the mapping of the counters to specific instructions that is wrong I guess in vTunes (bug?). Is this a fair assumption of why or is there another possible explanation?

2. Are there many other counters besides these in vTunes that are collected before instruction retire? If so, how is it possible to find this information for any particular counter? It seems like it is needed in order to be able to know if you can trust the counters or not.

3. This means I would have to count FP operations "manually" instead I guess as suggested. But I have to know for certain what high level language operations are translated into assembly FP operations. In the code I posted at the top there is a multiply and a subtraction which clearly would count as two FP operations. But I am unsure of the comparison of the float value and the value 0 if this _could_ be translated to a FP operation in assembly (ever?). Looking at the assembly code and how vTunes maps the C# code to the assembler code, it looks like the code below. I understand rows 1-5 and 9, but can someone please explain or give some hints for rows 6-8?

line 1: mov rax, qword ptr [rcx+0x8] //Order order = container.Order
line 2: movss xmm1, dword ptr [rcx+0x14] //float totalCostSpent = container.TotalSpentCost*order.CommissionFactor;
line 3: mulss xmm1, dword ptr [rax+0xc] //float totalCostSpent = container.TotalSpentCost*order.CommissionFactor;
line 4: movss xmm0, dword ptr [rax+0x8] //return order.LastRemainingBudget - totalCostSpent > 0
line 5: subss xmm0, xmm1 //return order.LastRemainingBudget - totalCostSpent > 0
line 6: xor eax, eax //return order.LastRemainingBudget - totalCostSpent > 0
line 7: ucomiss xmm0, dword ptr [rip+0x8] //return order.LastRemainingBudget - totalCostSpent > 0
line 8: setnbe //return order.LastRemainingBudget - totalCostSpent > 0
line 9: ret //Order order = container.Order (seems like an incorrect vTunes C# to assembler mapping though)

4. One final question: How does vTunes get hold of the assembly code? Is it merely a decompilation of the high level code or is based on a some kind of recording of the instructions that are actually retired (or aborted)? (this question is a bit associated with question #1 I think). If the latter alternative, the actual correct execution order (due to Out-of-Order exucetion) would show correctly I guess in vTunes?

Bernard · ‎09-24-2014

>>>but can someone please explain or give some hints for rows 6-8? >>>

line 6: xor eax, eax //return order.LastRemainingBudget - totalCostSpent > 0
Zeroeing eax register. I suppose that return value will be loaded into eax register.

line 7: ucomiss xmm0, dword ptr [rip+0x8] //return order.LastRemainingBudget - totalCostSpent > 0
This line of code performs unordered floating point comparison of totalCostSpent(xmm0) variable with order.LastRemainingBudget which is pointed by RIP register(RIP - relative addressing).

line 8: setnbe //return order.LastRemainingBudget - totalCostSpent > 0
This instruction checks EFLAGS register and sets byte if the result is not below or equal.

McCalpinJohn · ‎09-24-2014

VTune uses a sampling technique that is based on interrupts generated by the overflow of performance counters. There is almost always a "skew" between the instruction that caused the interrupt and the instruction whose program counter gets captured by the interrupt handler. VTune knows about this and tries to correct, but it is not always possible to do this accurately. In your case I suspect that the actual floating point instructions are 1-2 instructions before the MOV instruction that VTune is pointing at.

To reduce the skew between the instruction that causes the counter overflow (and interrupt) and the instruction whose program counter is captured, Intel has introduced "precise" performance counter events. Not all events are "precise", and different processor models support different "precise" events. On at least some systems, to get "precise" identification of the instruction causing the overflow, you must use only one performance counter and clear the "enabled" bit on all the other performance counter event select registers.

On the issue of over-counting "executed" instructions, the only events that are sure to avoid this problem are those with the word "retired" in the name or description. (They might have other bugs, but they won't over-count due to the reject & retry mechanism.) Unfortunately Intel has not documented the use of a reject & retry mechanism for instruction "execution" since the Pentium 4 processor (where is is referred to as "replay", e.g., http://www.xbitlabs.com/articles/cpu/display/replay.html), so explanations that depend on understanding this feature have not been forthcoming.

For the Sandy Bridge core, I see only 8 performance counter events that are limited to counting "retired" events, and of these only the first four look particularly reliable (but I have not tested them extensively).

Events 0xC0, 0xC2, 0xC4, 0xC5 count retired instructions, uops, branches, and mispredicted branches, respectively.
Event 0xCD is the "load latency facility", which is labeled as "unreliable" for locally homed data on the Xeon E5 (Sandy Bridge) processors.
Event 0xD0 counts memory uops. There are published errata for this (and the next two) event(s) when operating with HyperThreading enabled.
Event 0xD1 counts memory load uops. In addition to the errata relating to HyperThreading, there are published errata for most of the sub-events (with partial workarounds), and there is documentation of unexpected behavior for 256-bit AVX loads.
Event 0xD2 counts memory load uops that hit in the LLC. In addition to the errata relating to HyperThreading, this event is subject to the same under-counting errata as Event 0xD1 (for which there is a partial workaround).

David_A_Intel1 · ‎09-24-2014

The assembly code displayed is simply the disassembly of your binary file(s). As "Dr. Bandwidth" mentioned, VTune Amplifier records execution point information (e.g., EIP, TID, PID) at interrupt time and uses that to associate the event counter overflow with your code (or whatever code was executing at the time). The assembly code is shown in the order that it appears in your binary file(s).

Toby · ‎09-26-2014

Thanks for all your answers. I believe it has cleared most of my questions except for the one described in the next paragraph. But I can live with that being unanswered I guess since there may not be a good answer to it...

Since I need to count the floating point operations manually to get an accurate FLOPs/sec count, it is important of course to figure out what instructions to include (at assembly level). The multiply and subtraction operations are given, but the float comparison ucomiss I feel dubious about. Though, on one hand, the definition of FLOPs/sec, as I have found, clearly states that all floating point operations should be included. On the other hand, they can be weighted differently based on the amount of work they perform (For instance, the NSight NVidia CUDA profiler allows for different weighting of different types of instructions as a way to get a FLOPs/sec count (as it will calculate for you). NSight does not seem to include floating point comparison operations in the FLOPS metric. Besides, the value seems to be wrong there too being way too high...) So basically, I am leaning towards not including the floating point comparison in the FLOPS/sec metric since it is not ... comme il faut? But please holler if anyone does have an opinion about this. It would be interesting to read. (and I know. FLOPs/sec is not the best of metric perhaps since it is a bit ambiguous. Yet is being used a lot in various performance benchmarks as _the_ computing metric).

Bernard · ‎09-26-2014

>>>Since I need to count the floating point operations manually to get an accurate FLOPs/sec count, it is important of course to figure out what instructions to include (at assembly level).>>>

I suppose that you should take into account only arithmetic floating point instructions On the other hand when you are benchmarking your code you should take into account also floating point comparison instruction.

Toby · ‎09-26-2014

Sorry. I dont understand the difference you are pointing at between the two different ways. Why is a "benchmark" different to include both logical and arithmetic floating point operations? Is it standard procedure for benchmarks? Could you elaborate please?

McCalpinJohn · ‎09-26-2014

There are problems with any approach to counting "FLOPS" -- sometimes you get to decide which problems you will live with and sometimes a particular set of problems gets forced on you.

I prefer to count "nominal FP operations" at an algorithmic level. This gives me a single count that is independent of the system architecture, and which can therefore be used as the numerator in "FP Operations / Elapsed Time" in a way that allows direct comparison between systems. I try to remember to refer to it as the "nominal operation count" to remind me that it is approximate, and that there are several ways that the hardware and software may do things differently.

Count all add/subtract and multiply operations that are plainly specified by the algorithm.
- If it is obvious that there are common sub-expressions that should be eliminated, I *might* decide not to count them, but I have also found that it is harder to be confident about this for SIMD instruction sets.
Divide and Square root operations are counted separately.
- It is not at all obvious how they should be weighted, but whatever weighting is used, it must be fixed if you want to compare across platforms.
  - Some large supercomputer acquisitions have specified that these should each be counted as 4 floating-point operations. While this is completely arbitrary (and much lower than the actual cost in either latency or throughput), at least it is clear.
Transcendental functions are even worse. They should also be counted separately and given a fixed weight, but whatever value you use it is not going to represent the actual "cost" across platforms.
Don't count operations that might be executed due to hardware speculation.
Most FP comparison operations use the same functional unit as the adder, so those should be counted.
- It is harder to have a strong opinion about how to count trivial compare operations, such as checking the sign or checking if a value is zero, since these might have alternate implementations that don't require use of the FP hardware.

The comments above provide some insight into what kinds of trouble you might get into by looking at the generated machine instructions instead of the algorithmic requirements. In one case a compiler might replace a floating-point divide loop with a software approximation --- executing slightly faster than the hardware divide function, but getting a slightly less accurate result, and performing a lot more visible FP add and multiply instructions. If you change the operation count in your definition of FP Operations in the "FP Operations / Elapsed Time" then you will not be able to compare the resulting rates.

Toby · ‎09-27-2014

John D. McCalpin: Thanks for your input. I actually landed in the same conclusion to use algorithmic level to count FLOPs/sec instead of trying to assess all different versions of the assembly code (I am doing a performance comparison of the same algortihm in many different implementations and platforms). Though I must confess that my conclusion was more based on "gut-feeling" than raw knowledge so it was very nice to read what you wrote above. I also concluded in not counting the FP logical comparison (larger than 0) since I figured some implementations would not use an FP operations. As long as all test versions on all platform use the same FP obvious operations for counting FLOPS , it should a be a relative accurate measurement of floating point processing I figured. And a pessimistic measurement is probably better than a too optimistic one too if comparing with other non tests FLOPs/sec metrics.

So once, again. Thanks for your thoughts on this!

I can't stop myself so a small PS (it's just a story since enough info is not provided for an answer I would guess): I saw something weird using the CUDA NSight profiler that kind of fits with what you are wr4iting. In some versions of the SASS assembly code, an C++ CUDA FP add operations was completly removed by the compiler although it clearly was dependent on run-time information (memory) and nothing else (AFAIK) could render it to be optimised away. I could only see two possible reasons for it 1) the profiler is not showing the SASS code correctly, 2) the FP add operation was optimised into something else. But this sounds weird to me. Oh well. As you say, I did run into trouble... :) Anyways. That question belongs in another forum...

Bernard · ‎09-27-2014

Toby wrote:

Sorry. I dont understand the difference you are pointing at between the two different ways. Why is a "benchmark" different to include both logical and arithmetic floating point operations? Is it standard procedure for benchmarks? Could you elaborate please?

By writing "benchmark" my intention was to describe measurement process where you need to take into account various types of instruction and their corresponding metrics in order to calculate for example average CPI.

Bernard · ‎09-27-2014

>>>Transcendental functions are even worse. They should also be counted separately and given a fixed weight, but whatever value you use it is not going to represent the actual "cost" across platforms>>>

For Intel SIMD ISA code Transcendental functions( sin,cos,exp etc...) will be implemented probably as stream of add/mul or fmad instructions.

How to measure FLOPS using vTunes?