- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When compiled with -O3 -xW -g, I got the following results regarding the floating point operations:
x87 instructions retired samples: 105
x87 instructions retired events: 8820
packed single precision floating point SSE retired samples: 14
packed single precision floating point SSE retired events: 14
When compiled with -O0 -g, the numbers came back as follows:
x87 instructions retired samples: 131
x87 instructions retired events: 10873
packed single precision floating point SSE retired samples: 14
packed single precision floating point SSE retired events: 14
As I said, the whole program was assigning floating point values to an array of 1000 elements. So there was one explicit floating point add that was executed 1000 times. How do I correlate this with the numbers returned above if I'm trying to count FLOPS? Code is below.
Thank you.
Les
for (i = 0; i < 1000; i++)
{
arr = (float)i + 10.3f;
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When compiled with -O3 -xW -g, I got the following results regarding the floating point operations:
x87 instructions retired samples: 105
x87 instructions retired events: 8820
packed single precision floating point SSE retired samples: 14
packed single precision floating point SSE retired events: 14
When compiled with -O0 -g, the numbers came back as follows:
x87 instructions retired samples: 131
x87 instructions retired events: 10873
packed single precision floating point SSE retired samples: 14
packed single precision floating point SSE retired events: 14
As I said, the whole program was assigning floating point values to an array of 1000 elements. So there was one explicit floating point add that was executed 1000 times. How do I correlate this with the numbers returned above if I'm trying to count FLOPS?
If these were collected using the default EBS events (hot spot analysis events), these numbers suggest that the x87 instructions retired events were collected with a SAV (Sample After Value) of 84 on the first set and 83 on the second set (84/83 events required to trigger one sample). For both cases, the SSE retired collections look like they're using an SAV of 1 (sample every event). Collecting only 14 events suggests that your code isn't doing what you think it should. Have you tried looking at a disassembly to see what kinds of instructions are actually being used? You might also try drilling down to the source of the x87 events and see where they're coming from.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If these were collected using the default EBS events (hot spot analysis events), these numbers suggest that the x87 instructions retired events were collected with a SAV (Sample After Value) of 84 on the first set and 83 on the second set (84/83 events required to trigger one sample). For both cases, the SSE retired collections look like they're using an SAV of 1 (sample every event). Collecting only 14 events suggests that your code isn't doing what you think it should. Have you tried looking at a disassembly to see what kinds of instructions are actually being used? You might also try drilling down to the source of the x87 events and see where they're coming from.
Robert is right that you can look into disassembly view to know why "-O3" is better than "-O0".
Additional suggestion - If you use Intel C++ compiler (Sorry that I haven't tested on gcc before), change code as:
float arr[100], const_f1[100]={10.3f}, const_f2[100]={0,1,2,...}; // pseudo code. youhave toinitialize const_f1, const_f2;
for (i = 0; i < 1000; i++)
{
//arr = (float)i + 10.3f;
arr = const_f1 + const_f2; // this statement will be vectorized if you build with "-O3" option!
}
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Robert is right that you can look into disassembly view to know why "-O3" is better than "-O0".
Additional suggestion - If you use Intel C++ compiler (Sorry that I haven't tested on gcc before), change code as:
float arr[100], const_f1[100]={10.3f}, const_f2[100]={0,1,2,...}; // pseudo code. youhave toinitialize const_f1, const_f2;
for (i = 0; i < 1000; i++)
{
//arr = (float)i + 10.3f;
arr = const_f1 + const_f2; // this statement will be vectorized if you build with "-O3" option!
}
Regards, Peter
I think my problem right now is getting a better understanding of the factors that VTune uses in its calculations to come up with the numbers it reports. Between samples, events, sample-after-value, etc... I'm trying to decode all of this to correlate it to what I would expect to be reported.
Les
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you know that this code gets vectorized because you've seen a vectorization report in the compilation log? The VTune analyzer data you published, albeit flawed, suggests otherwise.
Have you looked at any of the documentation that comes with the VTune analyzer help file? There are whole pages talking about the sampling method, which uses a feature of the Intel processor performance monitoring registers to limit the frequency of sample collection in a manner that still reflects the frequency of the event, in order to limit the impact of the collection on the program(s) under test. For hot spot sampling, the SAV is nominally picked to cause interrupts around once per millisecond.
There is a dialog in the collector configuration that lets you set SAVs for each event. You can also look there to see what is the current setting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you know that this code gets vectorized because you've seen a vectorization report in the compilation log? The VTune analyzer data you published, albeit flawed, suggests otherwise.
Have you looked at any of the documentation that comes with the VTune analyzer help file? There are whole pages talking about the sampling method, which uses a feature of the Intel processor performance monitoring registers to limit the frequency of sample collection in a manner that still reflects the frequency of the event, in order to limit the impact of the collection on the program(s) under test. For hot spot sampling, the SAV is nominally picked to cause interrupts around once per millisecond.
There is a dialog in the collector configuration that lets you set SAVs for each event. You can also look there to see what is the current setting.
Now for the best part of this post: I solved the problem. In my little code sample the problem was that it just had too short an execution time. So I changed the code to malloc 100,000,000 floats and then go about assigning a value to each element. My first attempt at this gave me an x87 retired events count of nearly 200,000,000 - 2x what I was expecting. So I figured that the casting of the loop index also generated a floating point operation that was countable. I declared an identifier outside of the loop and assigned it a single precision float value and replaced the loop index in the addition with the new float identifier. I reran VTune and got back an x87 count of nearly 100,000,000 - just what I was expecting. Then to press on, I put in another term that was multiplied against the float constant, and VTune returned a value of nearly 200M; again what I was expecting.
The code was compiled with -O0 to ensure that no optimization were turned on so I could get a "true" count. As it turned out I didn't really have to do anything special within VTune as far as tweaking any values. I just had to give it something that executed long enough to do its job on.
Les
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page