The Old Man is not easily embarrassed.

djbenton · ‎06-16-2006

Please check this out. I put this link in here; because I don't know of any other way of formatting text in this forum so you could see an approximate VTune screen.

http://www.dsllc.com/download/vtune.gif

Am I interpreting this correctly? Are the numbers in the right column under "Clock" the number of clock cycles required to perform the instructions on the left? If so this is a nightmare! At other places in the code it only takes a few cycles to perform typical instructions; but here in this loop something goes terribly wrong! n>0 takes 359 cycles?!, z=zz>>14 takes 363 cycles?!, zz>pz[0] takes 991 cycles?!, and the killer, pp[0]=p.b[1] takes 1603 cycles?!

The last line of code above assembles to:

mov al,[ebp-19] (63 cycles)
mov esi,[ebp-144] (1414 cycles)
mov [esi],al (129 cycles)

What's the excuse for this? Is the pre-processor screwing things up so badly that the processor is stumbling all over itself? If simple instructions take this long or the time required for instructions is this unpredictable how can any code be optimized?

TimP · ‎06-16-2006

Those are total numbers of clock ticks during your sampling run, associated with instructions around the one designated. If you read the docs, you should see that there is no exact association with a particular instruction. Some will show no ticks, most ticks will be associated with an instruction which comes later than the one which is actually responsible. Informally, that's called "skid." Maybe you are looping through these instructions more often than you intended?
I guess no one has been able to figure out what is your question about pre-processing. If your pre-processing is complex, you might try the effect of saving pre-processed source, then compile and analyze that, probably with interprocedural optimizations shut off at first.

djbenton · ‎06-16-2006

Actually, the "total numbers of clock ticks during [my] sampling run" was in the billions and about 20% of that was in this loop alone. Yes, I know that the number of clock cycles doesn't exactly correspond to the instructions due to the sampling process in that this isn't a line-by-line timing like function profiling would be; but it still means something, doesn't it? Otherwise, what's the point of VTune?

The preprocessing I'm talking about is the function of the hardware preprocessor. If this code were running on an early Intel processor with zero wait state memory each instruction would take a fixed amount of time to perform which I could look up in a chart. With the current Intel processors the same instruction takes a variable time to perform depending on what came before it.

I'm trying to figure out whether or not VTune is revealing the approximate actual time required to perform an instruction--including the help, or it seems in my case hindrance, of the hardware preprocessor. You see I've placed calls to the precision timer in the code and have come to the conclusion that VTune is indeed showing me what I think it is--that is that certain loops can take far longer than they should if you just add up the length of time the instructions should take times the number of times through the loop. Of course you can't use the precision timer to time a single instruction; but the VTune sampling process can sort of tease this out. That's why I tried VTune in the first place.

Thanks for responding!

djbenton · ‎06-20-2006

Ah... I see now what Tim was suggesting. If you right click on the "Clock" column and select "View as" and "total events" then you do get the total number of clock cycles for that instruction rather than the average. Yeah, when I do that I get 6 billion! It is as I suspected. Pentiums take 1000 clock cycles to perform a mov eax,[esi] one place and 2 cycles to perform the same instruction elsewhere.

I can't help but conclude that the current Intel hardware preprocessors work very well for things like mov eax,[esi+8*ebx+1234] but not when it comes to predicting upcoming events and pipelining. They're probably designed to work well for games and multi-media, which I couldn't give a rat's rear end about; but Intel cares a great deal about.

I guess I should be glad that people buy powerful computers to play on; so that the price comes down; and I can afford to buy a powerful computer to work on and make a living with.

djbenton · ‎06-22-2006

The silence is deafening.

TimP · ‎06-23-2006

One of the reasons for using VTune would be to detect store forwarding stalls. If your code is storing data to a pointer target, the instruction which loads data from that address has to wait for stalls to be resolved. Did you read about the circumstances where the worst stalls will be incurred? The obvious way to resolve such a problem is to registerize the data.
If your application has such stalls like a game, it may be embarrassing, but you may have found where to work on it.

djbenton · ‎06-23-2006

Au contre, Tim, but you are confused. I have no reason to be embarrassed. It's easy to see that I'm reading from and writing to the same memory location in succession. No doubt there's a stall. Of course, when the same pair of instructions occurs later there should be a similar stall; but the same instructions don't take anywhere near as long.

I'm just curious. How long have you been at this? I've been writing assembler since it had to be done in zeroes and ones on punch cards. I know multiple languages. I've written hundreds of thousands of lines of code, several compilers, and a linker. I really don't think you're going to embarrass the Old Man.

Besides, can't you tell? I already know the answer to these questions. The precision timer has told me all I need to know. This is the only avenue I have to vent my frustrations about the way the chip design has gone. I'm just poking a stick at the elephant; because there's nothing else I can do about it.

Message Edited by djbenton@dsllc.com on 06-23-200607:07 AM

Message Edited by djbenton@dsllc.com on 06-23-200607:08 AM

Intel_Software_Netw1 · ‎04-09-2007

Here's some additional clarifying informationthe Intel Software Network Support teamreceived about this from our engineering contacts:

Some clarification might help. For the sake of argument, lets assume this is being done on an Intel Core2 processor. The core2 processor executes instructions out of order (unlike an Intel486 processor), dispatching them to the execution units as their inputs become available rather than in the programmed order. They sit in the Reorder Buffer (ROB) and are retired in programmed sequence. Up to 4 instructions can be retired per clock cycle. This OOO execution can result is bursts of retirement.

When the Performance Monitoring Unit (PMU) is used to sample on the occurrence of a performance event, the counter is programmed to count the desired event and is initialized to the Sample After Value (SAV). With each events occurrence the counter is decremented.

When the counter underflows an interrupt is raised by the hardware, and the processor will branch to the address of the interrupt handler specified by the interrupt vector, the VTune Analyzer driver in this case. The driver may not actually start executing for some number of cycles. For example if the processor is executing a ring 0 OS critical piece of code like a page fault handler, this activity will not be interrupted by the performance monitoring interrupt. The point is that the driver acquires the IP of the last retired instruction before it took over control.

Long latency instructions like loads from memory and sqrt and divide will have larger windows during which they are the oldest instruction.

The net effect of these three points, the OOO execution, the larger windows for long latency instructions and the variable interupt response time generate the effect called skid. One particular point is that the combination of the OOO execution and the possibility of multiple instructions being retired per cycle can result in certain instructions never being assigned a single sample during this interrupt behavior, thus the ratio of samples on successive instructions in the VTune disassembly display (or any other sampling tool for that matter) can be infinite.

Unless you are using precise events where the HW captures the IP of the instigating instruction, (ex mem_load_retired.l2_line_miss for L2 cache misses caused by loads), the exact IP value associated with the sa mples should only be viewed as an estimate of the region. In the case of a loop, the event probably occurred in the loop, but even that might not be the case if you rig the test case carefully.

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

How long does it take? or Am I reading this right?