Solved: SP GFLOPS of vectorized and scalar uops - vtune

psing51 · ‎09-01-2021

I have obtained following data on intel 8280 processor -

SP GFLOPS: 344.756
DP GFLOPS: 0.902
x87 GFLOPS: 0

......

Vectorization: 77.1 % of Packed FP Operations
Instruction Mix
SP FLOPs: 21.7 % of uOps
Packed: 78.4 % from SP FP
128-bit: 1.9 % from SP FP
256-bit: 76.5 % from SP FP
512-bit: 0 % from SP FP
Scalar: 21.6 % from SP FP
DP FLOPs: 0.4 % of uOps
Packed: 0.4 % from DP FP
128-bit: 0.4 % from DP FP
256-bit: 0 % from DP FP
512-bit: 0 % from DP FP
Scalar: 99.6 % from DP FP
x87 FLOPs: 0 % of uOps
Non-FP: 77.9 % of uOps
FP Arith/Mem Rd Instr. Ratio: 0.619
FP Arith/Mem Wr Instr. Ratio: 1.942

I have following queries -

a) is following the correct way to get to know that which type of SP uOPS has contributed more towards the total SP GFLOPS? example -
Total SPGFLOPS : Total SP scalar uOPs GFlops + Total SP vector uOPs GFlops
344.756 = (344.756 x 0.784) + (344.756 x 0.216 )

b.1) By non-FP we can mean the instructions which operate on byte/int/char/INTEGER*2,INTEGER*4,INTEGER*8 data types ?

Vtune shows Operations per cycle (as flops) for SP and DP, and each of these shows the breakup up of scalar and vector uOPs.

This code appears to be mostly vectorized as 77.1 percent of code(/instructions) is vectorized and rest 22.9 % is scalar . in this report, the Non-FP packed micro operations seems to have larger (77.9) and this non-FP uOPS metric is not broken down into scalar and packed. This gives me an idea that somehow non-FP code is cotributing a largely to the vectorization, but i am unsure that by what will be the difference between scalar non-FP and vector non-FP uOPs. So i have following queries -

b.2) Is it possible to get the breakup of non-FP uops under scalar and packed categories? like we have for SP FLOPs and DP FLOPs ?
b.3)is it possible to get/derive the Operations Per cycle for these non-FP instruction types using vtune?

c) is is correct to say that since the number of non-FP operations is very large , compared to SP+DP uOps, "FLOPS" is not a very good measure to measure this code's performance?

Mariya_P_Intel · ‎09-15-2021

Hi @psing51,

Did the replies given by Arun, Vladimir and Dmitry helped you? May we assume this forum thread as solved?

View solution in original post

ArunJ_Intel · ‎09-02-2021

Hi psing51,

PFB my response to few of your queries, For the remaining queries I will check internally and let you know the findings.

In question (a) Your conclusion is right, you can use the percentage values of packed and scalar instructions to determine which type of SP uops have contributed more towards the total sp gflops.

Response for b(1) Yes non-FP refers to operations on these remainig data types which are not represented with floating points

Response for b(3)

The below link (a public site not official) mentions to calculate flops by multiplying the number of FLOPS per cycle by the number of arithmetic pipelines per core, then the number of cores, then by the frequency.

https://boardgamestips.com/wow/how-do-you-calculate-flops/#How_do_you_calculate_flops

Infering from this statement we could derive

number of floating point operations per cycle= FLOPS/(sockets * (cores per socket) * (Avg Cpu frequency) )

However the calculations of something like "floating point operations per cycle" might be actually more complicated and varying with architectures. Thus might not hold on to this formula as per my understanding.

However if you are interested in instructions per cycle /cycles per instruction, Vtune displays the metric(CPI Rate) in the summary pane.

Thanks

Arun Jose

psing51 · ‎09-03-2021

Thank you for the response.
i hope you will be checking on the b2 and c question.

for the b3 - it seems you have referred method to calculate "FLOPS" for "non-FP" instructions/micro operations. Since there is no "Floating Point" operation involved in "non-FP", and non-FP is a very generic term which can encompass int, char, byte datatype, so i think it can be categorized into OPS in general ( or IOPS/BOPS etc incase we want to be specific) .
example for a dual socket server having 8280 processor -
theoretical DP flops = 2 sockets x 2.70 x 28 cores per socket x 8 (512 bit vector units / 64 bit ) * 2 FMA/core = 2419

theoretical SP flops = 2 sockets x 2.70 x 28 cores per socket x 16 (512 bit vector units / 32 bit ) * 2 FMA/core = 4838

theoretical Byte ops/BOPS = 2 sockets x 2.70 x 28 cores per socket x 64 (512 bit vector units / 8 bit ) * 2 FMA/core = 9676

Since this code issues large amount instructions in non-FP category, one might be interested in knowing the "OPS" for relevant category as you could see that BOPS can be 8x larger than "DP FLOPS".
and the vtune interface does not provide this FLOPS value, so is there a way in which i can capture / derrive these values using vtune/vtune cli?

ArunJ_Intel · ‎09-07-2021

Hi psing51,

We are checking on this internally, Will get back to you soon with an update.

Thanks

Arun

Vladimir_T_Intel · ‎09-07-2021

Hi ,

b.1) it seems there there was a small misunderstanding of the data. This one:

@psing51 wrote:

This code appears to be mostly vectorized as 77.1 percent of code(/instructions) is vectorized and rest 22.9 % is scalar . in this report, the Non-FP packed micro operations seems to have larger (77.9) and this non-FP uOPS metric is not broken down into scalar and packed.

is not 100% correct. The result says that 77.1% out of of all FP instructions are packed ones (vectorized). The FP instructions are 21.7 % + 0.4 % out of all instructions. And 77.9 % of instructions are not FP.

b.2) It is not possible to break out non-FP instructions into categories with VTune. (You might want to try Intel Advisor for that)

b.3) No, IPC/CPI is calculated against all instructions. But you can select an entity (module, source file, function, source line, or instruction) where the FP is prevailing and see a CPI metric against it.

c) That would be not correct to say for several reasons:

- FP instructions are "heavier" than integer ones (usually bigger latency smaller throughput)

- FP instructions most likely are those that work on your algorithm's data and induce latency for data access in memory (cache misses, etc.)

- There are a lot of non-FP instructions spent for arranging calculations (loop indexes, address calculations, padding, comparison, jumps, etc.). Also your program has a ramp-up stage where only non-FP instructions are used. So, the total number or portion of non-FP uOps doesn't say anything.

Dmitry_P_Intel1 · ‎09-07-2021

Hello,

The situation with a) might be a bit more complex, since uOPs here are not floating point operations but vector instructions with FMA counted as two. So in terms of floating point operations the contribution will be 128-bit x 4 , 256-bit x 8, 512 x 16 and Scalar x 1.

On b) - VTune shows the FLOPs metrics based on HW counters and they are available only for floating point operations. If you need break down for non-FP you can use Intel Advisor product. It uses instruction level binary instrumentation so it has the full picture.

Thanks & Regards, Dmitry

Mariya_P_Intel · ‎09-15-2021

Hi @psing51,

Did the replies given by Arun, Vladimir and Dmitry helped you? May we assume this forum thread as solved?