- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I have obtained following data on intel 8280 processor -

SP GFLOPS: 344.756

DP GFLOPS: 0.902

x87 GFLOPS: 0

......

Vectorization: 77.1 % of Packed FP Operations

Instruction Mix

SP FLOPs: 21.7 % of uOps

Packed: 78.4 % from SP FP

128-bit: 1.9 % from SP FP

256-bit: 76.5 % from SP FP

512-bit: 0 % from SP FP

Scalar: 21.6 % from SP FP

DP FLOPs: 0.4 % of uOps

Packed: 0.4 % from DP FP

128-bit: 0.4 % from DP FP

256-bit: 0 % from DP FP

512-bit: 0 % from DP FP

Scalar: 99.6 % from DP FP

x87 FLOPs: 0 % of uOps

Non-FP: 77.9 % of uOps

FP Arith/Mem Rd Instr. Ratio: 0.619

FP Arith/Mem Wr Instr. Ratio: 1.942

I have following queries -

a) is following the correct way to get to know that which type of SP uOPS has contributed more towards the total SP GFLOPS? example -

Total SPGFLOPS : Total SP scalar uOPs GFlops + Total SP vector uOPs GFlops

344.756 = (344.756 x 0.784) + (344.756 x 0.216 )

b.1) By non-FP we can mean the instructions which operate on byte/int/char/INTEGER*2,INTEGER*4,INTEGER*8 data types ?

Vtune shows Operations per cycle (as flops) for SP and DP, and each of these shows the breakup up of scalar and vector uOPs.

This code appears to be mostly vectorized as 77.1 percent of code(/instructions) is vectorized and rest 22.9 % is scalar . in this report, the Non-FP packed micro operations seems to have larger (77.9) and this non-FP uOPS metric is not broken down into scalar and packed. This gives me an idea that somehow non-FP code is cotributing a largely to the vectorization, but i am unsure that by what will be the difference between scalar non-FP and vector non-FP uOPs. So i have following queries -

b.2) Is it possible to get the breakup of non-FP uops under scalar and packed categories? like we have for SP FLOPs and DP FLOPs ?

b.3)is it possible to get/derive the Operations Per cycle for these non-FP instruction types using vtune?

c) is is correct to say that since the number of non-FP operations is very large , compared to SP+DP uOps, "FLOPS" is not a very good measure to measure this code's performance?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi @psing51,

Did the replies given by Arun, Vladimir and Dmitry helped you? May we assume this forum thread as solved?

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi psing51,

PFB my response to few of your queries, For the remaining queries I will check internally and let you know the findings.

In question (a) Your conclusion is right, you can use the percentage values of packed and scalar instructions to determine which type of SP uops have contributed more towards the total sp gflops.

Response for b(1) Yes non-FP refers to operations on these remainig data types which are not represented with floating points

Response for b(3)

The below link (a public site not official) mentions to calculate flops by multiplying the number of FLOPS per cycle by the number of arithmetic pipelines per core, then the number of cores, then by the frequency.

https://boardgamestips.com/wow/how-do-you-calculate-flops/#How_do_you_calculate_flops

Infering from this statement we could derive

**number of floating point operations per cycle= FLOPS/(sockets * (cores per socket) * (Avg Cpu frequency) )**

However the calculations of something like "floating point operations per cycle" might be actually more complicated and varying with architectures. Thus might not hold on to this formula as per my understanding.

However if you are interested in **instructions per cycle** /**cycles per instruction**, Vtune displays the metric(CPI Rate) in the summary pane.

Thanks

Arun Jose

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thank you for the response.

i hope you will be checking on the b2 and c question.

for the b3 - it seems you have referred method to calculate "FLOPS" for "non-FP" instructions/micro operations. Since there is no "Floating Point" operation involved in "non-FP", and non-FP is a very generic term which can encompass int, char, byte datatype, so i think it can be categorized into OPS in general ( or IOPS/BOPS etc incase we want to be specific) .

example for a dual socket server having 8280 processor -

theoretical DP flops = 2 sockets x 2.70 x 28 cores per socket x 8 (512 bit vector units / 64 bit ) * 2 FMA/core = 2419

theoretical SP flops = 2 sockets x 2.70 x 28 cores per socket x 16 (512 bit vector units / 32 bit ) * 2 FMA/core = 4838

theoretical Byte ops/BOPS = 2 sockets x 2.70 x 28 cores per socket x 64 (512 bit vector units / 8 bit ) * 2 FMA/core = 9676

Since this code issues large amount instructions in non-FP category, one might be interested in knowing the "OPS" for relevant category as you could see that BOPS can be 8x larger than "DP FLOPS".

and the vtune interface does not provide this FLOPS value, so is there a way in which i can capture / derrive these values using vtune/vtune cli?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi psing51,

We are checking on this internally, Will get back to you soon with an update.

Thanks

Arun

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi ,

b.1) it seems there there was a small misunderstanding of the data. This one:

@psing51 wrote:

This code appears to be mostly vectorized as 77.1 percent of code(/instructions) is vectorized and rest 22.9 % is scalar . in this report, the Non-FP packed micro operations seems to have larger (77.9) and this non-FP uOPS metric is not broken down into scalar and packed.

is not 100% correct. The result says that 77.1% out of of all **FP** instructions are packed ones (vectorized). The FP instructions are 21.7 % + 0.4 % out of **all** instructions. And 77.9 % of instructions are not FP.

b.2) It is not possible to break out non-FP instructions into categories with VTune. (You might want to try Intel Advisor for that)

b.3) No, IPC/CPI is calculated against all instructions. But you can select an entity (module, source file, function, source line, or instruction) where the FP is prevailing and see a CPI metric against it.

c) That would be not correct to say for several reasons:

- FP instructions are "heavier" than integer ones (usually bigger latency smaller throughput)

- FP instructions most likely are those that work on your algorithm's data and induce latency for data access in memory (cache misses, etc.)

- There are a lot of non-FP instructions spent for arranging calculations (loop indexes, address calculations, padding, comparison, jumps, etc.). Also your program has a ramp-up stage where only non-FP instructions are used. So, the total number or portion of non-FP uOps doesn't say anything.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello,

The situation with a) might be a bit more complex, since uOPs here are not floating point operations but vector instructions with FMA counted as two. So in terms of floating point operations the contribution will be 128-bit x 4 , 256-bit x 8, 512 x 16 and Scalar x 1.

On b) - VTune shows the FLOPs metrics based on HW counters and they are available only for floating point operations. If you need break down for non-FP you can use Intel Advisor product. It uses instruction level binary instrumentation so it has the full picture.

Thanks & Regards, Dmitry

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi @psing51,

Did the replies given by Arun, Vladimir and Dmitry helped you? May we assume this forum thread as solved?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page