Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

counting flops (SIMD)

José_Luis_G_1
Beginner
828 Views

Hi,

 

I'm trying to determinate the number of FLOPs of a program from processors counters.

For instance, I have this subroutine:

 

         subroutine vec_mul(a, b, c)
          integer, parameter :: N = 1024*1024
          double precision, dimension(N) :: a, b, c

           do i=1, 1000000
              c(i)=a(i) * b(i)
           end do

          end

and when run it I've got ~950,000 SIMD_FP_256.PACKED_DOUBLE events (using perf).

I suppose each of one actually corresponds to 4 operations, so, it's reporting ~3.8 million operations instead of 3 million.

Why is there such a difference?

0 Kudos
8 Replies
Bernard
Valued Contributor I
828 Views

It is quite possible that during the execution of your code the same core executed another unrelated stream of FP instructions and counters simply were incremented on each occurrence of FP mul instruction.

0 Kudos
Bernard
Valued Contributor I
828 Views

Regarding the total amount of flops yes it is ~950 000 fpmul * 4 dp  scalar values.

My first post was more related to variations in counter result.

0 Kudos
José_Luis_G_1
Beginner
828 Views

Thank you for your comments.

Just for clarifications, I meant 1 million (instead of 3 million) is my expected count. 

Regarding your suggestion, I'm pretty sure mine is the only code running on that core (in the entire server, for instance).

 

 

0 Kudos
Bernard
Valued Contributor I
827 Views

So actually you are seeing a 1000000 flops?This will mean that probably 250 000 fmul instructions were executed.

0 Kudos
McCalpinJohn
Honored Contributor III
827 Views

The performance counter event named SIMD_FP_256.PACKED_DOUBLE is only supported on Sandy Bridge and Ivy Bridge processors.  On these systems this event is known to overcount.

It appears that the counter is incremented every time the instruction is issued, not when the instruction is executed or retired.   If all of the inputs are not ready, the instruction will issue, get rejected, and then re-issue at a later time.  

So the number of times the counter will increment depends (in part) on how much latency there is between the time the instruction is first issued and the time that all of the input data items are ready (i.e., in the L1 Data Cache).  This, in turn, depends on how busy the system is and how effectively the hardware prefetchers are operating.

The degree of over-counting can be relatively small for cache-friendly applications (as low as a few percent), while I have seen overcounting by a factor of 4 when running the STREAM benchmark on one core and up to a factor of 6 when running the STREAM benchmark using all cores.

0 Kudos
José_Luis_G_1
Beginner
828 Views

iliyapolak,

 

I'm actually seeing ~3,500,000 flops (~950,000 fmul counted) but I expected 1,000,000  (the size of the loop).

John,

 

thank you for your comments. I indeed reduced the loop to 100,000 and the SIMD_FP_256.PACKED_DOUBLE events count was ~57,800

I'm trying to assess the performance of several apps, and was testing if "perf" is a good enough tool. But with this over-counting, I would have only very rough estimates. 

any ideas to get -perhaps in an indirect way- a better aprox to retired SIMD_FP events?

 

 

0 Kudos
Bernard
Valued Contributor I
828 Views

@Jose

John gave a good explanation why the counter is not showing exactly 1M instructions.

0 Kudos
McCalpinJohn
Honored Contributor III
828 Views

I am not aware of any way to get more accurate counts using the existing Sandy Bridge or Ivy Bridge hardware.   It might be possible to do an approximate correction based on other counters -- stall cycles, for example -- but without knowing a whole lot more about the implementation than any of us outside of the design team are ever likely to know, I don't imagine that any such approaches would be very accurate or generalizable across codes.

Floating-point performance counters can be used for a variety of purposes.  These counts are good enough for some purposes, but not for others.  Off the top of my head, I can think of three common usage scenarios:

  • Counters are used to get a numerator to use in a performance ratio.   FP OPs/second is the most common, but FP OPs per cache miss or FP OPs per word of memory traffic are also used.  These performance counters are not useful for computing such ratios, since the counts change from run to run, and change depending on the load on the system.  You are better off figuring out how to build a formula that can be used inside the program to accumulate the amount of "work" done as a function of problem size and/or iteration count, etc, then using that value as the numerator in the performance ratio calculations.
  • The counters can be used to see how the FP operations are split across scalar/vector, single/double, SSE/x87/AVX.  The existing counters are generally good enough for this purpose.  This is a very important use case, since the generated assembly code usually contains multiple code paths -- e.g., a vectorized code path and a non-vectorized code path -- and you can't tell by inspection which of the code paths is going to be executed.  The counters are definitely helpful here, though you need to be aware that the overcounting may not be consistent across the various categories.
  • The FP operation counters can also be used in overflow-based sampling to look for "hot spots" in the code.  The current floating-point counters are generally good enough for this.  Counts will be biased toward the instructions that have long delays before their data is ready, but that is not necessarily a bad thing -- these instructions are being retried due to cache misses and performance will likely be improved if those cache miss delays can be reduced.

So the counters are generally useful for two of these three use cases, which is not too bad.   The use case that works poorly is one that is not of particular concern to most hardware designers.  If FP instructions are being retried because their data is not ready, then the problem is in the cache miss latency, not in the FP operation count.  

But I think that the lamentations of the users have been heard by Intel management, so I expect this to be improved significantly at some point in the future.  (I don't expect performance counters to ever be without problems, but this is one that can be architected to give repeatable answers that make reasonable sense for use in the numerator of performance ratios.) 

0 Kudos
Reply