Quote:John D. McCalpin wrote:

GHui · ‎10-19-2014

I want to get Gflops and AVX performance. The PCM tools seems not support. What else I can do, in order to get Gflops and AVX?

Any help will be appreciated.

Bernard · ‎10-19-2014

Do you want to measure program performance?

McCalpinJohn · ‎10-20-2014

If you want to measure actual floating point arithmetic execution rate you are mostly out of luck. The performance counters that measure floating-point arithmetic instructions (scalar, 128-bit vector, and 256-bit vector) on Sandy Bridge, Ivy Bridge, and Haswell are known to "over-count".

The degree of over-counting depends primarily on average latency between issuing the instruction and the availability of the data that the instruction uses (either register arguments or memory arguments). If all the data is in the L1 cache, then there is almost no over-counting. If the data is in the L2 cache then you can get slight over-counting (10%-20%, but variable), and if all the data is in memory the counts can be as much as 6x to 10x higher than the actual number of completed floating-point arithmetic instructions.

See more discussion at https://software.intel.com/en-us/forums/topic/499193 and https://software.intel.com/en-us/forums/topic/531796

GHui · ‎10-20-2014

iliyapolak wrote:

Do you want to measure program performance?

Yes, is there some way to do that?

GHui · ‎10-21-2014

John D. McCalpin wrote:

If you want to measure actual floating point arithmetic execution rate you are mostly out of luck. The performance counters that measure floating-point arithmetic instructions (scalar, 128-bit vector, and 256-bit vector) on Sandy Bridge, Ivy Bridge, and Haswell are known to "over-count".

The degree of over-counting depends primarily on average latency between issuing the instruction and the availability of the data that the instruction uses (either register arguments or memory arguments). If all the data is in the L1 cache, then there is almost no over-counting. If the data is in the L2 cache then you can get slight over-counting (10%-20%, but variable), and if all the data is in memory the counts can be as much as 6x to 10x higher than the actual number of completed floating-point arithmetic instructions.

See more discussion at https://software.intel.com/en-us/forums/topic/499193 and https://software.intel.com/en-us/forums/topic/531796

I have measure it on Sandy Bridge and Ivy Bridge, but not Haswell. I can accept slight over-counting. I have check the documents from http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. But it not list events about flops and vector on Haswell.

GHui · ‎10-21-2014

John D. McCalpin wrote:

If you want to measure actual floating point arithmetic execution rate you are mostly out of luck. The performance counters that measure floating-point arithmetic instructions (scalar, 128-bit vector, and 256-bit vector) on Sandy Bridge, Ivy Bridge, and Haswell are known to "over-count".

The degree of over-counting depends primarily on average latency between issuing the instruction and the availability of the data that the instruction uses (either register arguments or memory arguments). If all the data is in the L1 cache, then there is almost no over-counting. If the data is in the L2 cache then you can get slight over-counting (10%-20%, but variable), and if all the data is in memory the counts can be as much as 6x to 10x higher than the actual number of completed floating-point arithmetic instructions.

See more discussion at https://software.intel.com/en-us/forums/topic/499193 and https://software.intel.com/en-us/forums/topic/531796

I have measure it on Sandy Bridge and Ivy Bridge, but not Haswell. I can accept slight over-counting. I have check the documents from http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. But it not list events about flops and vector on Haswell.

GHui · ‎10-21-2014

John D. McCalpin wrote:

If you want to measure actual floating point arithmetic execution rate you are mostly out of luck. The performance counters that measure floating-point arithmetic instructions (scalar, 128-bit vector, and 256-bit vector) on Sandy Bridge, Ivy Bridge, and Haswell are known to "over-count".

The degree of over-counting depends primarily on average latency between issuing the instruction and the availability of the data that the instruction uses (either register arguments or memory arguments). If all the data is in the L1 cache, then there is almost no over-counting. If the data is in the L2 cache then you can get slight over-counting (10%-20%, but variable), and if all the data is in memory the counts can be as much as 6x to 10x higher than the actual number of completed floating-point arithmetic instructions.

See more discussion at https://software.intel.com/en-us/forums/topic/499193 and https://software.intel.com/en-us/forums/topic/531796

I have measure it on Sandy Bridge and Ivy Bridge, but not Haswell. I can accept slight over-counting. I have check the documents from http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. But it not list events about flops and vector on Haswell.

GHui · ‎10-21-2014

Sorry, because of slow internet, I clicked submit button one more times.

Bernard · ‎10-21-2014

GHui wrote:

Quote:

iliyapolak wrote:

Do you want to measure program performance?

Yes, is there some way to do that?

You can use VTune for do that. Start measurement by choosing Lightweight Hotspots and move deeper by choosing more advanced analysis types .

Bernard · ‎10-21-2014

>>>I have measure it on Sandy Bridge and Ivy Bridge, but not Haswell. I can accept slight over-counting. I have check the documents from http://www.intel.com/content/www/us/en/processors/architectures-software.... But it not list events about flops and vector on Haswell.>>>

Check following paper about FP performance analysis https://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs

Question about get Gflops and AVX performance