I'm trying to determinate the number of FLOPs of a program from processors counters.
For instance, I have this subroutine:
subroutine vec_mul(a, b, c)
integer, parameter :: N = 1024*1024
double precision, dimension(N) :: a, b, c
do i=1, 1000000
c(i)=a(i) * b(i)
and when run it I've got ~950,000 SIMD_FP_256.PACKED_DOUBLE events (using perf).
I suppose each of one actually corresponds to 4 operations, so, it's reporting ~3.8 million operations instead of 3 million.
Why is there such a difference?
It is quite possible that during the execution of your code the same core executed another unrelated stream of FP instructions and counters simply were incremented on each occurrence of FP mul instruction.
Thank you for your comments.
Just for clarifications, I meant 1 million (instead of 3 million) is my expected count.
Regarding your suggestion, I'm pretty sure mine is the only code running on that core (in the entire server, for instance).
The performance counter event named SIMD_FP_256.PACKED_DOUBLE is only supported on Sandy Bridge and Ivy Bridge processors. On these systems this event is known to overcount.
It appears that the counter is incremented every time the instruction is issued, not when the instruction is executed or retired. If all of the inputs are not ready, the instruction will issue, get rejected, and then re-issue at a later time.
So the number of times the counter will increment depends (in part) on how much latency there is between the time the instruction is first issued and the time that all of the input data items are ready (i.e., in the L1 Data Cache). This, in turn, depends on how busy the system is and how effectively the hardware prefetchers are operating.
The degree of over-counting can be relatively small for cache-friendly applications (as low as a few percent), while I have seen overcounting by a factor of 4 when running the STREAM benchmark on one core and up to a factor of 6 when running the STREAM benchmark using all cores.
I'm actually seeing ~3,500,000 flops (~950,000 fmul counted) but I expected 1,000,000 (the size of the loop).
thank you for your comments. I indeed reduced the loop to 100,000 and the SIMD_FP_256.PACKED_DOUBLE events count was ~57,800
I'm trying to assess the performance of several apps, and was testing if "perf" is a good enough tool. But with this over-counting, I would have only very rough estimates.
any ideas to get -perhaps in an indirect way- a better aprox to retired SIMD_FP events?
I am not aware of any way to get more accurate counts using the existing Sandy Bridge or Ivy Bridge hardware. It might be possible to do an approximate correction based on other counters -- stall cycles, for example -- but without knowing a whole lot more about the implementation than any of us outside of the design team are ever likely to know, I don't imagine that any such approaches would be very accurate or generalizable across codes.
Floating-point performance counters can be used for a variety of purposes. These counts are good enough for some purposes, but not for others. Off the top of my head, I can think of three common usage scenarios:
So the counters are generally useful for two of these three use cases, which is not too bad. The use case that works poorly is one that is not of particular concern to most hardware designers. If FP instructions are being retried because their data is not ready, then the problem is in the cache miss latency, not in the FP operation count.
But I think that the lamentations of the users have been heard by Intel management, so I expect this to be improved significantly at some point in the future. (I don't expect performance counters to ever be without problems, but this is one that can be architected to give repeatable answers that make reasonable sense for use in the numerator of performance ratios.)