- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://www.dsllc.com/download/vtune.gif
Am I interpreting this correctly? Are the numbers in the right column under "Clock" the number of clock cycles required to perform the instructions on the left? If so this is a nightmare! At other places in the code it only takes a few cycles to perform typical instructions; but here in this loop something goes terribly wrong! n>0 takes 359 cycles?!, z=zz>>14 takes 363 cycles?!, zz>pz[0] takes 991 cycles?!, and the killer, pp[0]=p.b[1] takes 1603 cycles?!
The last line of code above assembles to:
mov al,[ebp-19] (63 cycles)
mov esi,[ebp-144] (1414 cycles)
mov [esi],al (129 cycles)
What's the excuse for this? Is the pre-processor screwing things up so badly that the processor is stumbling all over itself? If simple instructions take this long or the time required for instructions is this unpredictable how can any code be optimized?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I guess no one has been able to figure out what is your question about pre-processing. If your pre-processing is complex, you might try the effect of saving pre-processed source, then compile and analyze that, probably with interprocedural optimizations shut off at first.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The preprocessing I'm talking about is the function of the hardware preprocessor. If this code were running on an early Intel processor with zero wait state memory each instruction would take a fixed amount of time to perform which I could look up in a chart. With the current Intel processors the same instruction takes a variable time to perform depending on what came before it.
I'm trying to figure out whether or not VTune is revealing the approximate actual time required to perform an instruction--including the help, or it seems in my case hindrance, of the hardware preprocessor. You see I've placed calls to the precision timer in the code and have come to the conclusion that VTune is indeed showing me what I think it is--that is that certain loops can take far longer than they should if you just add up the length of time the instructions should take times the number of times through the loop. Of course you can't use the precision timer to time a single instruction; but the VTune sampling process can sort of tease this out. That's why I tried VTune in the first place.
Thanks for responding!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can't help but conclude that the current Intel hardware preprocessors work very well for things like mov eax,[esi+8*ebx+1234] but not when it comes to predicting upcoming events and pipelining. They're probably designed to work well for games and multi-media, which I couldn't give a rat's rear end about; but Intel cares a great deal about.
I guess I should be glad that people buy powerful computers to play on; so that the price comes down; and I can afford to buy a powerful computer to work on and make a living with.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your application has such stalls like a game, it may be embarrassing, but you may have found where to work on it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm just curious. How long have you been at this? I've been writing assembler since it had to be done in zeroes and ones on punch cards. I know multiple languages. I've written hundreds of thousands of lines of code, several compilers, and a linker. I really don't think you're going to embarrass the Old Man.
Besides, can't you tell? I already know the answer to these questions. The precision timer has told me all I need to know. This is the only avenue I have to vent my frustrations about the way the chip design has gone. I'm just poking a stick at the elephant; because there's nothing else I can do about it.
Message Edited by djbenton@dsllc.com on 06-23-200607:07 AM
Message Edited by djbenton@dsllc.com on 06-23-200607:08 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's some additional clarifying informationthe Intel Software Network Support teamreceived about this from our engineering contacts:
Some clarification might help. For the sake of argument, lets assume this is being done on an Intel Core2 processor. The core2 processor executes instructions out of order (unlike an Intel486 processor), dispatching them to the execution units as their inputs become available rather than in the programmed order. They sit in the Reorder Buffer (ROB) and are retired in programmed sequence. Up to 4 instructions can be retired per clock cycle. This OOO execution can result is bursts of retirement.
When the Performance Monitoring Unit (PMU) is used to sample on the occurrence of a performance event, the counter is programmed to count the desired event and is initialized to the Sample After Value (SAV). With each events occurrence the counter is decremented.
When the counter underflows an interrupt is raised by the hardware, and the processor will branch to the address of the interrupt handler specified by the interrupt vector, the VTune Analyzer driver in this case. The driver may not actually start executing for some number of cycles. For example if the processor is executing a ring 0 OS critical piece of code like a page fault handler, this activity will not be interrupted by the performance monitoring interrupt. The point is that the driver acquires the IP of the last retired instruction before it took over control.
Long latency instructions like loads from memory and sqrt and divide will have larger windows during which they are the oldest instruction.
The net effect of these three points, the OOO execution, the larger windows for long latency instructions and the variable interupt response time generate the effect called skid. One particular point is that the combination of the OOO execution and the possibility of multiple instructions being retired per cycle can result in certain instructions never being assigned a single sample during this interrupt behavior, thus the ratio of samples on successive instructions in the VTune disassembly display (or any other sampling tool for that matter) can be infinite.
Unless you are using precise events where the HW captures the IP of the instigating instruction, (ex mem_load_retired.l2_line_miss for L2 cache misses caused by loads), the exact IP value associated with the sa mples should only be viewed as an estimate of the region. In the case of a loop, the event probably occurred in the loop, but even that might not be the case if you rig the test case carefully.
==
Lexi S.
IntelSoftware NetworkSupport

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page