Analyzers
Support for Analyzers (Intel VTune™ Profiler, Intel Advisor, Intel Inspector)
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
4682 Discussions

SP GFLOPS of vectorized and scalar uops - vtune

psing51
New Contributor I
828 Views


I have obtained following data on intel 8280 processor - 

SP GFLOPS: 344.756
DP GFLOPS: 0.902
x87 GFLOPS: 0

......

Vectorization: 77.1 % of Packed FP Operations
   Instruction Mix
         SP FLOPs: 21.7 % of uOps  
                 Packed: 78.4 % from SP FP
                               128-bit: 1.9 % from SP FP
                               256-bit: 76.5 % from SP FP
                              512-bit: 0 % from SP FP
                 Scalar: 21.6 % from SP FP 
         DP FLOPs: 0.4 % of uOps
                 Packed: 0.4 % from DP FP
                            128-bit: 0.4 % from DP FP
                            256-bit: 0 % from DP FP
                            512-bit: 0 % from DP FP
                 Scalar: 99.6 % from DP FP
        x87 FLOPs: 0 % of uOps
        Non-FP: 77.9 % of uOps
FP Arith/Mem Rd Instr. Ratio: 0.619
FP Arith/Mem Wr Instr. Ratio: 1.942

 


I have following queries - 

a) is following the correct way to get to know that which type of SP uOPS has contributed more towards the total SP GFLOPS?  example -
Total SPGFLOPS   :   Total SP scalar uOPs GFlops   +  Total SP vector uOPs GFlops
344.756  =    (344.756 x 0.784)   +  (344.756 x 0.216 ) 




b.1) By non-FP we can mean the instructions which operate on byte/int/char/INTEGER*2,INTEGER*4,INTEGER*8 data types ?

Vtune shows Operations per cycle (as flops) for SP and DP, and each of these shows the breakup up of scalar and vector uOPs. 

This code appears to be mostly vectorized as 77.1 percent of code(/instructions) is vectorized and rest 22.9 % is scalar .  in this report, the Non-FP packed micro operations seems to have larger (77.9) and  this non-FP uOPS metric is not broken down into scalar and packed. This gives me an idea that somehow non-FP code is cotributing a largely to the vectorization, but i am unsure that by what will be the difference between scalar non-FP and vector non-FP uOPs. So i have following queries - 

b.2) Is it possible to get the breakup of non-FP uops under scalar and packed categories? like we have for SP FLOPs and DP FLOPs ?
b.3)is it possible to get/derive the Operations Per cycle for these non-FP instruction types using vtune?

 

c) is is correct to say that since the number of non-FP operations is very large , compared to SP+DP uOps, "FLOPS" is not a very good measure to measure this code's performance?

0 Kudos
1 Solution
Mariya_P_Intel
Moderator
607 Views

Hi @psing51,

Did the replies given by Arun, Vladimir and Dmitry helped you? May we assume this forum thread as solved?


View solution in original post

6 Replies
ArunJ_Intel
Moderator
775 Views

Hi psing51,

 

PFB my response to few of your queries, For the remaining queries I will check internally and let you know the findings.

In question (a) Your conclusion is right, you can use the percentage values of packed and scalar instructions to determine which type of SP uops have contributed more towards the total sp gflops.

Response for b(1) Yes non-FP refers to operations on these remainig data types which are not represented with floating points 

Response for b(3)

The below link (a public site not official) mentions to calculate flops by multiplying the number of FLOPS per cycle by the number of arithmetic pipelines per core, then the number of cores, then by the frequency. 

https://boardgamestips.com/wow/how-do-you-calculate-flops/#How_do_you_calculate_flops

Infering from this statement we could derive 

number of floating point operations per cycle= FLOPS/(sockets * (cores per socket) * (Avg Cpu frequency) )

However the calculations of something like "floating point operations per cycle" might be actually more complicated and varying with architectures. Thus might not hold on to this formula as per my understanding.

 

However if you are interested in instructions per cycle /cycles per instruction, Vtune displays the metric(CPI Rate) in the summary pane.

 

Thanks

Arun Jose

 

psing51
New Contributor I
761 Views

Thank you for the response.
i hope you will be checking on the b2 and c question.

for the b3 - it seems you have referred method to calculate "FLOPS" for "non-FP"  instructions/micro operations. Since there is no "Floating Point" operation involved in "non-FP", and non-FP is a very generic term which can encompass int, char, byte datatype, so i think it can be categorized into OPS in general ( or IOPS/BOPS etc incase we want to be specific) .
example for a dual socket server having 8280 processor - 
theoretical DP flops =  2 sockets x 2.70 x 28 cores per socket x   8 (512 bit vector units / 64 bit  )  * 2 FMA/core =   2419

theoretical SP flops =  2 sockets x 2.70 x 28 cores per socket x   16 (512 bit vector units / 32 bit  )  * 2 FMA/core =  4838

theoretical Byte ops/BOPS =  2 sockets x 2.70 x 28 cores per socket x    64 (512 bit vector units / 8 bit  )  * 2 FMA/core =  9676


Since this code issues large amount instructions in non-FP category, one might be interested in knowing the "OPS" for relevant category as you could see that BOPS can be 8x larger than "DP FLOPS".
and the vtune interface does not provide this FLOPS value, so is there a way in which i can capture / derrive these values using vtune/vtune cli?  

ArunJ_Intel
Moderator
720 Views

Hi psing51,


We are checking on this internally, Will get back to you soon with an update.


Thanks

Arun


Vladimir_T_Intel
Moderator
687 Views

Hi ,

b.1) it seems there there was a small misunderstanding of the data. This one:


@psing51 wrote:

 

This code appears to be mostly vectorized as 77.1 percent of code(/instructions) is vectorized and rest 22.9 % is scalar .  in this report, the Non-FP packed micro operations seems to have larger (77.9) and  this non-FP uOPS metric is not broken down into scalar and packed.


is not 100% correct. The result says that 77.1% out of of all FP instructions are packed ones (vectorized). The FP instructions are 21.7 % + 0.4 % out of all instructions.  And 77.9 % of instructions are not FP. 


b.2) It is not possible to break out non-FP instructions into categories with VTune. (You might want to try Intel Advisor for that)

b.3) No, IPC/CPI is calculated against all instructions. But you can select an entity (module, source file, function, source line, or instruction) where the FP is prevailing and see a CPI metric against it. 

c)  That would be not correct to say for several reasons:

- FP instructions are "heavier" than integer ones (usually bigger latency smaller throughput)

- FP instructions most likely are those that work on your algorithm's data and induce latency for data access in memory (cache misses, etc.) 

- There are a lot of non-FP instructions spent for arranging calculations (loop indexes, address calculations, padding, comparison, jumps, etc.). Also your program has a ramp-up stage where only non-FP instructions are used. So, the total number or portion of non-FP uOps doesn't say anything. 

 

Dmitry_P_Intel1
Employee
687 Views

Hello,

The situation with a) might be a bit more complex, since uOPs here are not floating point operations but vector instructions with FMA counted as two. So in terms of floating point operations the contribution will be 128-bit x 4 , 256-bit x 8, 512 x 16 and Scalar x 1.

On b) - VTune shows the FLOPs metrics based on HW counters and they are available only for floating point operations. If you need break down for non-FP you can use Intel Advisor product. It uses instruction level binary instrumentation so it has the full picture.

Thanks & Regards, Dmitry

Mariya_P_Intel
Moderator
608 Views

Hi @psing51,

Did the replies given by Arun, Vladimir and Dmitry helped you? May we assume this forum thread as solved?


Reply