Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Question about get Gflops and AVX performance

GHui
新手
2,174 次查看

I want to get Gflops and AVX performance. The PCM tools seems not support. What else I can do, in order to get Gflops and AVX? 

Any help will be appreciated.

 

 

0 项奖励
9 回复数
Bernard
重要分销商 I
2,174 次查看

Do you want to measure program performance?

0 项奖励
McCalpinJohn
名誉分销商 III
2,174 次查看

If you want to measure actual floating point arithmetic execution rate you are mostly out of luck.  The performance counters that measure floating-point arithmetic instructions (scalar, 128-bit vector, and 256-bit vector) on Sandy Bridge, Ivy Bridge, and Haswell are known to "over-count".     

The degree of over-counting depends primarily on average latency between issuing the instruction and the availability of the data that the instruction uses (either register arguments or memory arguments).    If all the data is in the L1 cache, then there is almost no over-counting. If the data is in the L2 cache then you can get slight over-counting (10%-20%, but variable), and if all the data is in memory the counts can be as much as 6x to 10x higher than the actual number of completed floating-point arithmetic instructions.

See more discussion at https://software.intel.com/en-us/forums/topic/499193 and https://software.intel.com/en-us/forums/topic/531796

 

0 项奖励
GHui
新手
2,174 次查看

iliyapolak wrote:

Do you want to measure program performance?

Yes, is there some way to do that?

0 项奖励
GHui
新手
2,174 次查看

John D. McCalpin wrote:

If you want to measure actual floating point arithmetic execution rate you are mostly out of luck.  The performance counters that measure floating-point arithmetic instructions (scalar, 128-bit vector, and 256-bit vector) on Sandy Bridge, Ivy Bridge, and Haswell are known to "over-count".     

The degree of over-counting depends primarily on average latency between issuing the instruction and the availability of the data that the instruction uses (either register arguments or memory arguments).    If all the data is in the L1 cache, then there is almost no over-counting. If the data is in the L2 cache then you can get slight over-counting (10%-20%, but variable), and if all the data is in memory the counts can be as much as 6x to 10x higher than the actual number of completed floating-point arithmetic instructions.

See more discussion at https://software.intel.com/en-us/forums/topic/499193 and https://software.intel.com/en-us/forums/topic/531796

 

I have measure it on Sandy Bridge and Ivy Bridge, but not Haswell. I can accept slight over-counting. I have check the documents from http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. But it not list events about flops and vector on Haswell.

0 项奖励
GHui
新手
2,174 次查看

John D. McCalpin wrote:

If you want to measure actual floating point arithmetic execution rate you are mostly out of luck.  The performance counters that measure floating-point arithmetic instructions (scalar, 128-bit vector, and 256-bit vector) on Sandy Bridge, Ivy Bridge, and Haswell are known to "over-count".     

The degree of over-counting depends primarily on average latency between issuing the instruction and the availability of the data that the instruction uses (either register arguments or memory arguments).    If all the data is in the L1 cache, then there is almost no over-counting. If the data is in the L2 cache then you can get slight over-counting (10%-20%, but variable), and if all the data is in memory the counts can be as much as 6x to 10x higher than the actual number of completed floating-point arithmetic instructions.

See more discussion at https://software.intel.com/en-us/forums/topic/499193 and https://software.intel.com/en-us/forums/topic/531796

 

I have measure it on Sandy Bridge and Ivy Bridge, but not Haswell. I can accept slight over-counting. I have check the documents from http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. But it not list events about flops and vector on Haswell.

0 项奖励
GHui
新手
2,174 次查看

John D. McCalpin wrote:

If you want to measure actual floating point arithmetic execution rate you are mostly out of luck.  The performance counters that measure floating-point arithmetic instructions (scalar, 128-bit vector, and 256-bit vector) on Sandy Bridge, Ivy Bridge, and Haswell are known to "over-count".     

The degree of over-counting depends primarily on average latency between issuing the instruction and the availability of the data that the instruction uses (either register arguments or memory arguments).    If all the data is in the L1 cache, then there is almost no over-counting. If the data is in the L2 cache then you can get slight over-counting (10%-20%, but variable), and if all the data is in memory the counts can be as much as 6x to 10x higher than the actual number of completed floating-point arithmetic instructions.

See more discussion at https://software.intel.com/en-us/forums/topic/499193 and https://software.intel.com/en-us/forums/topic/531796

 

I have measure it on Sandy Bridge and Ivy Bridge, but not Haswell. I can accept slight over-counting. I have check the documents from http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. But it not list events about flops and vector on Haswell.

0 项奖励
GHui
新手
2,174 次查看

Sorry, because of slow internet, I clicked submit button one more times.

0 项奖励
Bernard
重要分销商 I
2,174 次查看

GHui wrote:

Quote:

iliyapolak wrote:

 

Do you want to measure program performance?

 

 

Yes, is there some way to do that?

You can use VTune for do that. Start measurement  by choosing Lightweight Hotspots and move deeper by choosing more advanced analysis types .

0 项奖励
Bernard
重要分销商 I
2,174 次查看

>>>I have measure it on Sandy Bridge and Ivy Bridge, but not Haswell. I can accept slight over-counting. I have check the documents from http://www.intel.com/content/www/us/en/processors/architectures-software.... But it not list events about flops and vector on Haswell.>>>

Check following paper about FP performance analysis https://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs

0 项奖励
回复