Trying to make sense out of VTune results (FLOPS-related)

lhartzman · ‎05-13-2009

In trying to understand what the results being returned by VTune are, I wrote up a simple program on Linux to just assign values to a 1000 element floating point array. The values being assigned were an addition of the loop variable (cast to float) and a single precision floating point number. I then created an activity to return x87 retired instructions and all of the packed single/double precision retired instructions.

When compiled with -O3 -xW -g, I got the following results regarding the floating point operations:

x87 instructions retired samples: 105
x87 instructions retired events: 8820
packed single precision floating point SSE retired samples: 14
packed single precision floating point SSE retired events: 14

When compiled with -O0 -g, the numbers came back as follows:

x87 instructions retired samples: 131
x87 instructions retired events: 10873
packed single precision floating point SSE retired samples: 14
packed single precision floating point SSE retired events: 14

As I said, the whole program was assigning floating point values to an array of 1000 elements. So there was one explicit floating point add that was executed 1000 times. How do I correlate this with the numbers returned above if I'm trying to count FLOPS? Code is below.

Thank you.

Les

for (i = 0; i < 1000; i++)
{
arr = (float)i + 10.3f;
}

robert-reed · ‎05-13-2009

Quoting - lhartzman

In trying to understand what the results being returned by VTune are, I wrote up a simple program on Linux to just assign values to a 1000 element floating point array. The values being assigned were an addition of the loop variable (cast to float) and a single precision floating point number. I then created an activity to return x87 retired instructions and all of the packed single/double precision retired instructions.

When compiled with -O3 -xW -g, I got the following results regarding the floating point operations:

x87 instructions retired samples: 105
x87 instructions retired events: 8820
packed single precision floating point SSE retired samples: 14
packed single precision floating point SSE retired events: 14

When compiled with -O0 -g, the numbers came back as follows:

x87 instructions retired samples: 131
x87 instructions retired events: 10873
packed single precision floating point SSE retired samples: 14
packed single precision floating point SSE retired events: 14

As I said, the whole program was assigning floating point values to an array of 1000 elements. So there was one explicit floating point add that was executed 1000 times. How do I correlate this with the numbers returned above if I'm trying to count FLOPS?

If these were collected using the default EBS events (hot spot analysis events), these numbers suggest that the x87 instructions retired events were collected with a SAV (Sample After Value) of 84 on the first set and 83 on the second set (84/83 events required to trigger one sample). For both cases, the SSE retired collections look like they're using an SAV of 1 (sample every event). Collecting only 14 events suggests that your code isn't doing what you think it should. Have you tried looking at a disassembly to see what kinds of instructions are actually being used? You might also try drilling down to the source of the x87 events and see where they're coming from.

Peter_W_Intel · ‎05-13-2009

Quoting - Robert Reed (Intel)

If these were collected using the default EBS events (hot spot analysis events), these numbers suggest that the x87 instructions retired events were collected with a SAV (Sample After Value) of 84 on the first set and 83 on the second set (84/83 events required to trigger one sample). For both cases, the SSE retired collections look like they're using an SAV of 1 (sample every event). Collecting only 14 events suggests that your code isn't doing what you think it should. Have you tried looking at a disassembly to see what kinds of instructions are actually being used? You might also try drilling down to the source of the x87 events and see where they're coming from.

Robert is right that you can look into disassembly view to know why "-O3" is better than "-O0".
Additional suggestion - If you use Intel C++ compiler (Sorry that I haven't tested on gcc before), change code as:

float arr[100], const_f1[100]={10.3f}, const_f2[100]={0,1,2,...}; // pseudo code. youhave toinitialize const_f1, const_f2;

for (i = 0; i < 1000; i++)
{
//arr = (float)i + 10.3f;
arr = const_f1 + const_f2; // this statement will be vectorized if you build with "-O3" option!

}

Regards, Peter

lhartzman · ‎05-14-2009

Quoting - Peter Wang (Intel)

Robert is right that you can look into disassembly view to know why "-O3" is better than "-O0".
Additional suggestion - If you use Intel C++ compiler (Sorry that I haven't tested on gcc before), change code as:

float arr[100], const_f1[100]={10.3f}, const_f2[100]={0,1,2,...}; // pseudo code. youhave toinitialize const_f1, const_f2;

for (i = 0; i < 1000; i++)
{
//arr = (float)i + 10.3f;
arr = const_f1 + const_f2; // this statement will be vectorized if you build with "-O3" option!

}

Regards, Peter

Actually the existing code already gets vectorized. Because the code does get vectorized I would expect that the number of x87 ops would be lower.

I think my problem right now is getting a better understanding of the factors that VTune uses in its calculations to come up with the numbers it reports. Between samples, events, sample-after-value, etc... I'm trying to decode all of this to correlate it to what I would expect to be reported.

Les

robert-reed · ‎05-14-2009

Quoting - lhartzman

Actually the existing code already gets vectorized. Because the code does get vectorized I would expect that the number of x87 ops would be lower.

Do you know that this code gets vectorized because you've seen a vectorization report in the compilation log? The VTune analyzer data you published, albeit flawed, suggests otherwise.

I think my problem right now is getting a better understanding of the factors that VTune uses in its calculations to come up with the numbers it reports. Between samples, events, sample-after-value, etc... I'm trying to decode all of this to correlate it to what I would expect to be reported.

Have you looked at any of the documentation that comes with the VTune analyzer help file? There are whole pages talking about the sampling method, which uses a feature of the Intel processor performance monitoring registers to limit the frequency of sample collection in a manner that still reflects the frequency of the event, in order to limit the impact of the collection on the program(s) under test. For hot spot sampling, the SAV is nominally picked to cause interrupts around once per millisecond.

There is a dialog in the collector configuration that lets you set SAVs for each event. You can also look there to see what is the current setting.

lhartzman · ‎05-14-2009

Quoting - Robert Reed (Intel)

Do you know that this code gets vectorized because you've seen a vectorization report in the compilation log? The VTune analyzer data you published, albeit flawed, suggests otherwise.

I think my problem right now is getting a better understanding of the factors that VTune uses in its calculations to come up with the numbers it reports. Between samples, events, sample-after-value, etc... I'm trying to decode all of this to correlate it to what I would expect to be reported.

Have you looked at any of the documentation that comes with the VTune analyzer help file? There are whole pages talking about the sampling method, which uses a feature of the Intel processor performance monitoring registers to limit the frequency of sample collection in a manner that still reflects the frequency of the event, in order to limit the impact of the collection on the program(s) under test. For hot spot sampling, the SAV is nominally picked to cause interrupts around once per millisecond.

There is a dialog in the collector configuration that lets you set SAVs for each event. You can also look there to see what is the current setting.

Yes, I have looked at the documentation. But what would really be nice would be some detailed examples - more than just setting the different collector values. As an example, it would be nice to see a FLOPS counting example!

Now for the best part of this post: I solved the problem. In my little code sample the problem was that it just had too short an execution time. So I changed the code to malloc 100,000,000 floats and then go about assigning a value to each element. My first attempt at this gave me an x87 retired events count of nearly 200,000,000 - 2x what I was expecting. So I figured that the casting of the loop index also generated a floating point operation that was countable. I declared an identifier outside of the loop and assigned it a single precision float value and replaced the loop index in the addition with the new float identifier. I reran VTune and got back an x87 count of nearly 100,000,000 - just what I was expecting. Then to press on, I put in another term that was multiplied against the float constant, and VTune returned a value of nearly 200M; again what I was expecting.

The code was compiled with -O0 to ensure that no optimization were turned on so I could get a "true" count. As it turned out I didn't really have to do anything special within VTune as far as tweaking any values. I just had to give it something that executed long enough to do its job on.

Les