Community
cancel
Showing results for 
Search instead for 
Did you mean: 
lhartzman
Beginner
142 Views

FLOPS count

I want to count the FLOPS in an application (on a non-Itanium system). From what I see in the documentation it appears that I have to collect x87 retired instructions as well as all SIMD retired instructions and divide the sum by the total number of retired instructions. Is this correct or is there a more direct way of doing this?

BTW, this is an application running on Linux.

Thanks.

Les
0 Kudos
6 Replies
Vladimir_T_Intel
Moderator
142 Views

Sounds correct. But with devidingon the total number of instructions you get the fraction of FLOPS.

lhartzman
Beginner
142 Views

Sounds correct. But with devidingon the total number of instructions you get the fraction of FLOPS.

Right. I don't even remember why that was going through my head! To be more explicit on the system, it is a dual processor Xeon.

I ran a collection session on a small piece of code that was using explicit single precision SSE2 instructions, yet the counts for double precision SSE instructions was about as high as the single precision instructions. Is there some basic concept of the operation of VTune that I'm missing when looking at the counts?
robert-reed
Valued Contributor II
142 Views

Quoting - lhartzman
I ran a collection session on a small piece of code that was using explicit single precision SSE2 instructions, yet the counts for double precision SSE instructions was about as high as the single precision instructions. Is there some basic concept of the operation of VTune that I'm missing when looking at the counts?
It may be a trick of the compiler. The Intel compiler uses scalar SSE instructions for its floating point computations instead of the x87 instruction set. So even though you may have explicit scalar-single precision instructions in your code, if the glue around it is using floating point, you may be seeing scalar-double precision instructions that the compiler emitted.
lhartzman
Beginner
142 Views

It may be a trick of the compiler. The Intel compiler uses scalar SSE instructions for its floating point computations instead of the x87 instruction set. So even though you may have explicit scalar-single precision instructions in your code, if the glue around it is using floating point, you may be seeing scalar-double precision instructions that the compiler emitted.

I assume then the natural thing to do is to ignore that sampling event count? For a simple piece of code that you're familiar with, that isn't an issue. But if you're looking at code that is fairly large, how do you get a reliable count?

Right now we're using version 9.0 of the compiler, but will shortly be going to 10.1 (I know there is a newer one; don't ask!). Is there a set of options that can be used that will be less 'tricky'? The main options I'm using are -O3 -xW -ip.
TimP
Black Belt
142 Views

Quoting - lhartzman

I ran a collection session on a small piece of code that was using explicit single precision SSE2 instructions, yet the counts for double precision SSE instructions was about as high as the single precision instructions. Is there some basic concept of the operation of VTune that I'm missing when looking at the counts?
Among the ways people have unintentionally instructed their compiler to promote expression evaluation to double:
C or C++ double constants, such as 0. 1. rather than 0.f 1.f
C double math functions, like fabs(), sqrt(), without use of C99
compiler options like Intel/Microsoft -fp-model precise, -fp-model double...
In C, C++, or Fortran, a single double type can promote an entire expression into double, and the type conversion instructions count as much as the actual arithmetic operations.
In vectorized code, supposing that you have 1/3 double and 2/3 single precision operations, you will require the same number of double and single precision instructions, due to the single precision instructions being 4 wide, and the double 2 wide.
Technically, or historically, SSE2 instructions are double precision. Plain SSE are single.
srimks
New Contributor II
142 Views

Quoting - lhartzman
I want to count the FLOPS in an application (on a non-Itanium system). From what I see in the documentation it appears that I have to collect x87 retired instructions as well as all SIMD retired instructions and divide the sum by the total number of retired instructions. Is this correct or is there a more direct way of doing this?

BTW, this is an application running on Linux.

Thanks.

Les

To calculate Flops (floating-point operations per second) for a section of code, preferably try using RDTSC (Read Time Stamp Counter) calls. I think you can get idea about measurement of FLOPS operation and Intel RDSTC is only available for IA32 & IA64. Be careful when using RDTSC for multicore, since it has RDSTC clocks per coreand context switch to another core gives a different readout.

Many people worry about context switches that may occur during the measurement. I am not aware about Context switches onLinux x86_64, how manyseveral thousand clock cycles does it takes,probably itmight bias your results. The best way to avoid this problem is to arrange a small test case, so that your thread will be rarely interrupted.To avoid this you must set the thread affinity. If your function takes 60 000 clock cycles on 1 GHz processor, there is 0.3% probability that the context switch will happen. On the other hand, if your function takes 100 times more, it will be interrupted with 30% probability.

Else, one can also try system call gettimeofday(), but don't think that it will give much better results than RDSTC.

Refer: http://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf
http://en.wikipedia.org/wiki/Time_Stamp_Counter

~BR
Mukkaysh Srivastav
Reply