Review countnumber to know if the results are under expectation.
Event Name Extension
Counter (HT off)
This events counts the number of AVX-256 Computational FP single precision uops issued during the cycle. Note: Packed AVX-256 can be counted as one, and will count for SIMD FP 128.
This event counts the number of AVX-256 Computational FP doube precision uops issued during the cycle. Note: Packed AVX-256 can be counted as one, and will count for SIMD FP 128.
It appears that all of the floating-point performance counters (with the except of the Event 0xCA "Floating Point Assists") have been removed from the Haswell-based products.
These counters are known to systematically overcount in Sandy Bridge and Ivy Bridge processors whenever the input registers are not ready (e.g., due to cache misses). I have seen overcounting by anywhere from ~3% to 10x, depending on the average latency for loads feeding into the FP instructions.
We still use these counters on our 6400-node Sandy Bridge system to monitor whether codes are using SSE or AVX, how well the codes vectorize, and whether they are running with 32-bit or 64-bit floating-point arithmetic. The accuracy is good enough for this classification process, and if we deploy a large Haswell-based system we will have to employ a different approach to get this information.
Intel is certainly aware of the accuracy issues with these counters and is likely to fix the existing problems in some future products. Section 19.2 of Volume 3 of the SW Developer's Guide (document 324384-053, January 2015) shows that Broadwell gets a few FP events back:
- Event 0x14, Umask 0x01: ARITH.FPU_DIV_ACTIVE -- cycles that the divide unit is active
- Event 0xC0, Umask 0x02: INST_RETIRED.X87 -- x87 Floating-Point operations that are retired without generating exceptions.
I have not heard any definitive statements on when improved support for floating-point counts will make it into shipping products.
>>>As far as I understand during execution of packed AVX instructions the vector can be filled just partly. Is there a way to determine whether a vector was completely filled or nor>>>
I presume that you are referring to XMMx/YMMx registers. I this case you can see with debugger if specific register is filled with 4 or 8 scalars.
>>>and if we deploy a large Haswell-based system we will have to employ a different approach to get this information.
Do you have any idea to get flops on haswell architecture ?