SIMD_FP_256

Pavel_Mezentsev · ‎06-07-2012

As far as I understand during execution of packed AVX instructions the vector can be filled just partly. Is there a way to determine whether a vector was completely filled or nor?

Or is it correct to assume that compiler does its job well and the cases when the vector is not filled occur rearly (e.g. when we are out of data in the end of the loop)?

Peter_W_Intel · ‎06-08-2012

Can you use VTune Amplifier XE 2011 to do Event based sampling, with PMU events named

SIMD_FP_256

?

Review countnumber to know if the results are under expectation.

Event Name Extension	Mask	Definition	Description	Counter	Counter (HT off)
PACKED_SINGLE	0x01		This events counts the number of AVX-256 Computational FP single precision uops issued during the cycle. Note: Packed AVX-256 can be counted as one, and will count for SIMD FP 128.	0,1,2,3	0,1,2,3,4,5,6,7
PACKED_DOUBLE	0x02		This event counts the number of AVX-256 Computational FP doube precision uops issued during the cycle. Note: Packed AVX-256 can be counted as one, and will count for SIMD FP 128.	0,1,2,3	0,1,2,3,4,5,6,7

Pavel_Mezentsev · ‎06-08-2012

Yes, I've done the profiling using VTune.

The thing is that I'm analyzing the performance of a huge application. In particular I'm trying to understand if the code uses many FP operations and if it has been vectorized successfully.

In particular I got the following result for one of the runs:

CPU_CLK_UNHALTED.REF_TSC 4,983,560,000,000

CPU_CLK_UNHALTED.THREAD 5,670,360,000,000

FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 358,000,000,000

FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE 1,164,920,000,000

FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE 21,200,000,000

FP_COMP_OPS_EXE.X87 223,200,000,000

INST_RETIRED.ANY 7,926,840,000,000

And the counters SIMD_FP_256 are all zeroes.

I've also measured the HPL code and got the following results:

CPU_CLK_UNHALTED.REF_TSC 2,675,264,000,000

CPU_CLK_UNHALTED.THREAD 2,922,426,000,000

FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 2,816,000,000

FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE 18,080,000,000

FP_COMP_OPS_EXE.X87 460,000,000

INST_RETIRED.ANY 7,581,522,000,000

SIMD_FP_256.PACKED_DOUBLE 4,582,812,000,000

What I don't understand is how to interpret the results. What is the difference between FP_COMP_OPS_EXE and SIMD_FP_256? And is it justified to to say that each increment of the counter means that actually 4 flop were executed (for DP)? And during one processor cycle there may occur 2 increments (one for add and one for mul)?

So any clarifications on the subject would be appreciated!

Peter_W_Intel · ‎06-09-2012

SIMD_FP_256.PACKED_DOUBLE 4,582,812,000,000; which counts SSE, AVX-128 FPand AVX-256 FP computational double precious uops issued

FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 358,000,000,000; which counts SSE & AVX-128 FPcomputational double precious uops issued, only

Pavel_Mezentsev · ‎06-09-2012

Is it correct that operations that count in FP_COM_OPS_EXE are a subset of operations counted by SIMD_FP_256? And by subtracting the former from the latter I get the number of operations with 256-bit operations only?

Peter_W_Intel · ‎06-09-2012

I think that the answer is "Yes", result for AVX-256 only:-)

mrabet_ahmed_amine · ‎02-27-2015

What is equivalent of

SIMD_FP_256.PACKED_DOUBLE.

SIMD_FP_256.PACKED_DOUBLE

on haswell ?

McCalpinJohn · ‎02-27-2015

It appears that all of the floating-point performance counters (with the except of the Event 0xCA "Floating Point Assists") have been removed from the Haswell-based products.

These counters are known to systematically overcount in Sandy Bridge and Ivy Bridge processors whenever the input registers are not ready (e.g., due to cache misses). I have seen overcounting by anywhere from ~3% to 10x, depending on the average latency for loads feeding into the FP instructions.

We still use these counters on our 6400-node Sandy Bridge system to monitor whether codes are using SSE or AVX, how well the codes vectorize, and whether they are running with 32-bit or 64-bit floating-point arithmetic. The accuracy is good enough for this classification process, and if we deploy a large Haswell-based system we will have to employ a different approach to get this information.

Intel is certainly aware of the accuracy issues with these counters and is likely to fix the existing problems in some future products. Section 19.2 of Volume 3 of the SW Developer's Guide (document 324384-053, January 2015) shows that Broadwell gets a few FP events back:

Event 0x14, Umask 0x01: ARITH.FPU_DIV_ACTIVE -- cycles that the divide unit is active
Event 0xC0, Umask 0x02: INST_RETIRED.X87 -- x87 Floating-Point operations that are retired without generating exceptions.

I have not heard any definitive statements on when improved support for floating-point counts will make it into shipping products.

Bernard · ‎03-06-2015

>>>As far as I understand during execution of packed AVX instructions the vector can be filled just partly. Is there a way to determine whether a vector was completely filled or nor>>>

I presume that you are referring to XMMx/YMMx registers. I this case you can see with debugger if specific register is filled with 4 or 8 scalars.

mrabet_ahmed_amine · ‎03-06-2015

Thank you for your answer

>>>and if we deploy a large Haswell-based system we will have to employ a different approach to get this information.

Do you have any idea to get flops on haswell architecture ?

Bernard · ‎03-07-2015

>>>Do you have any idea to get flops on haswell architecture ?>>>

Do you mean to count how many GFLOPS were executed?

mrabet_ahmed_amine · ‎03-12-2015

>>Do you mean to count how many GFLOPS were executed?

yes to count Gflops of application, and number of simple precision and double precision flops were executed

Bernard · ‎03-12-2015

I think that John answered your question.

Interpreting the AVX counter results

SIMD_FP_256

Event Name Extension

Mask

Definition

Description

Counter

Counter (HT off)