Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)

Interpreting the AVX counter results

Pavel_Mezentsev
Beginner
594 Views
As far as I understand during execution of packed AVX instructions the vector can be filled just partly. Is there a way to determine whether a vector was completely filled or nor?
Or is it correct to assume that compiler does its job well and the cases when the vector is not filled occur rearly (e.g. when we are out of data in the end of the loop)?
0 Kudos
12 Replies
Peter_W_Intel
Employee
594 Views
Can you use VTune Amplifier XE 2011 to do Event based sampling, with PMU events named

SIMD_FP_256

?

Review countnumber to know if the results are under expectation.

Event Name Extension

Mask

Definition

Description

Counter

Counter (HT off)

PACKED_SINGLE

0x01

This events counts the number of AVX-256 Computational FP single precision uops issued during the cycle. Note: Packed AVX-256 can be counted as one, and will count for SIMD FP 128.

0,1,2,3

0,1,2,3,4,5,6,7

PACKED_DOUBLE

0x02

This event counts the number of AVX-256 Computational FP doube precision uops issued during the cycle. Note: Packed AVX-256 can be counted as one, and will count for SIMD FP 128.

0,1,2,3

0,1,2,3,4,5,6,7


0 Kudos
Pavel_Mezentsev
Beginner
594 Views
Yes, I've done the profiling using VTune.
The thing is that I'm analyzing the performance of a huge application. In particular I'm trying to understand if the code uses many FP operations and if it has been vectorized successfully.
In particular I got the following result for one of the runs:
CPU_CLK_UNHALTED.REF_TSC 4,983,560,000,000
CPU_CLK_UNHALTED.THREAD 5,670,360,000,000
FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 358,000,000,000
FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE 1,164,920,000,000
FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE 21,200,000,000
FP_COMP_OPS_EXE.X87 223,200,000,000
INST_RETIRED.ANY 7,926,840,000,000
And the counters SIMD_FP_256 are all zeroes.
I've also measured the HPL code and got the following results:
CPU_CLK_UNHALTED.REF_TSC 2,675,264,000,000
CPU_CLK_UNHALTED.THREAD 2,922,426,000,000
FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 2,816,000,000
FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE 18,080,000,000
FP_COMP_OPS_EXE.X87 460,000,000
INST_RETIRED.ANY 7,581,522,000,000
SIMD_FP_256.PACKED_DOUBLE 4,582,812,000,000
What I don't understand is how to interpret the results. What is the difference between FP_COMP_OPS_EXE and SIMD_FP_256? And is it justified to to say that each increment of the counter means that actually 4 flop were executed (for DP)? And during one processor cycle there may occur 2 increments (one for add and one for mul)?
So any clarifications on the subject would be appreciated!
0 Kudos
Peter_W_Intel
Employee
594 Views
SIMD_FP_256.PACKED_DOUBLE 4,582,812,000,000; which counts SSE, AVX-128 FPand AVX-256 FP computational double precious uops issued
FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 358,000,000,000; which counts SSE & AVX-128 FPcomputational double precious uops issued, only
0 Kudos
Pavel_Mezentsev
Beginner
594 Views
Is it correct that operations that count in FP_COM_OPS_EXE are a subset of operations counted by SIMD_FP_256? And by subtracting the former from the latter I get the number of operations with 256-bit operations only?
0 Kudos
Peter_W_Intel
Employee
594 Views
I think that the answer is "Yes", result for AVX-256 only:-)
0 Kudos
mrabet_ahmed_amine
594 Views

What is equivalent of  

SIMD_FP_256.PACKED_DOUBLE.

SIMD_FP_256.PACKED_DOUBLE

 

on haswell ?

 

0 Kudos
McCalpinJohn
Honored Contributor III
594 Views

It appears that all of the floating-point performance counters (with the except of the Event 0xCA "Floating Point Assists") have been removed from the Haswell-based products.

These counters are known to systematically overcount in Sandy Bridge and Ivy Bridge processors whenever the input registers are not ready (e.g., due to cache misses).   I have seen overcounting by anywhere from ~3% to 10x, depending on the average latency for loads feeding into the FP instructions.

We still use these counters on our 6400-node Sandy Bridge system to monitor whether codes are using SSE or AVX, how well the codes vectorize, and whether they are running with 32-bit or 64-bit floating-point arithmetic.  The accuracy is good enough for this classification process, and if we deploy a large Haswell-based system we will have to employ a different approach to get this information.

Intel is certainly aware of the accuracy issues with these counters and is likely to fix the existing problems in some future products.  Section 19.2 of Volume 3 of the SW Developer's Guide (document 324384-053, January 2015) shows that Broadwell gets a few FP events back:

  • Event 0x14, Umask 0x01: ARITH.FPU_DIV_ACTIVE -- cycles that the divide unit is active
  • Event 0xC0, Umask 0x02: INST_RETIRED.X87 -- x87 Floating-Point operations that are retired without generating exceptions.

I have not heard any definitive statements on when improved support for floating-point counts will make it into shipping products.

0 Kudos
Bernard
Valued Contributor I
594 Views

>>>As far as I understand during execution of packed AVX instructions the vector can be filled just partly. Is there a way to determine whether a vector was completely filled or nor>>>

​I presume that you are referring to XMMx/YMMx registers. I this case you can see with debugger if specific register is filled with 4 or 8 scalars.

0 Kudos
mrabet_ahmed_amine
594 Views

 

>>>and if we deploy a large Haswell-based system we will have to employ a different approach to get this information.

Do you have any idea to get flops on haswell architecture ?

 
0 Kudos
Bernard
Valued Contributor I
594 Views

>>>Do you have any idea to get flops on haswell architecture ?>>>

Do you mean to count how many GFLOPS were executed?

0 Kudos
mrabet_ahmed_amine
594 Views

>>Do you mean to count how many GFLOPS were executed?

yes to count Gflops of application, and number of simple precision and double precision flops were executed

0 Kudos
Bernard
Valued Contributor I
594 Views

I think that John answered your question.

0 Kudos
Reply