Why does _spin_lock has such high CPI in VTune report?

mfcking · ‎07-11-2005

Hello,

I used VTune 3.0 to sample the spin lock activitesinvoked bythe e1000 Gigabit driver and the Linux kernel 2.6.12.I found the CPI of _spin_lock is almost 27while _spin_lock has 100%L2 cache hit rate.

I checked the assembly code of _spin_lock in Linux and it uses the LOCK instruction.Based on IA32 optimization manual,the LOCK prefix does not lock the FSB once the referred data is found in the L2 cache of local CPU. However, it also goes to say that, Locked instructions are inherently slow, whether the data to be locked in found in the L2 cache or not.

I still do not understand what caused the CPI of _spin_lock so high?

Thanks a lot,

L.Y.

_spin_lock code in Linux

1: lock; decb slp# atomically decrement

jns 3f # if clear sign bit jump forward to 3

2: cmpb $0,slp # spin compare to 0

pause # spin wait

jle 2b # spin go back to 2 if <= 0 (locked)

jmp 1b # unlocked; go back to 1 to try to lock again

3: # we have acquired the lock

Message Edited by mfcking@yahoo.com on 07-11-2005 03:09 PM

jeffrey-gallagher · ‎07-12-2005

Just curious here, L.Y. Do you have calibration enabled or disabled in your sampling session? If you aren't sure, it's hard to guess because calibration is off by default for some events, and on by default for others.

If disabled, enable it and report back here what you see, the difference, if any.

cheers

jdg

For more:

$ man sampling

But in case this rings a bell, use "-cal yes" to turn it on, "-cal no" to turn it off in the syntax.

Boaz_T_Intel · ‎07-12-2005

One more interesting question is whether you run it on a Multi-CPU machine?What about HT?

If there is some way of parallelism, two threads accessing the same variable, or even different variables on the same cache line can cause large number of L2 cache misses.

Boaz.

mfcking · ‎07-12-2005

Yes, I did run my testing on SMP(2 Xeon) with HTdisabled.

Message Edited by mfcking@yahoo.com on 07-12-2005 09:19 AM

mfcking · ‎07-12-2005

Hi JDG,

I enabled calibration for all the events and the result is even worse (now is 29 and the CPI without calibration is 27):

FunctionClockticks per Instructions Retired (CPI) (261)

"_spin_lock" "29.153"

2nd-Level Cache Load Hit Rate (261)

"100.000"

Thanks,

L.Y.

Message Edited by mfcking@yahoo.com on 07-12-2005 01:01 PM

TimP · ‎07-12-2005

I'm trying to understand whether you think that high CPI in a spin lock loop is good or bad. The usual goal would be to have the spin lock spend time as efficiently (issuing as few instructions) as possible, which clearly means a high CPI. This would be particularly true if the spin lock loop could be competing for resources with another thread, which would be enabled to do useful work with a lower CPI than if it were competing against the spin lock.

mfcking · ‎07-12-2005

Hi Tim,

Yeah, it seems a single lock instruction in _spin_lock can cost 70 clock cycles. If we add up the clock cycles of other instructions in _spin_lock, a 29 CPI is a reasonable result.

Thanks,

Liang

Message Edited by mfcking@yahoo.com on 07-13-2005 10:44 AM