Re: Hyperthreading and 'lock' prefix

skeil1 · ‎06-11-2003

Hi,

if the following two instructions
inc [eax]
inc [eax]
are executed in a single thread on a single processor,
the processor cares for the atomicity of 'inc', so it is
guaranteed, that after the two instructions the value at
[eax] has been incremented by two, although internally
each inc-instruction is broken down into multiple
micro-ops and is executed in multiple pipeline stages.
The same holds, if the 'inc's are executed in separated
threads on a single processor (the processor guarantees
atomicity of inc).
If the 'inc's are executed on separated _physical_
processor, I need to prefix the 'inc' with the 'lock'
keyword,
lock inc [eax]
to guarantee the desired behaviour (because shared memory
is involved).

But what if the 'inc's are executed on separated _logical_
processors on the same physical processor(hyperthreading)?
Is the lock prefix necessary here?

Regards,

Stephan Keil

skeil1 · ‎06-14-2003

Hm,

if not asking Intel, who to ask else :-? Does anybody of
the Intel guys here know where to get an answer from?

Thx and best regards,

Stephan

ClayB · ‎06-16-2003

Stephan -

I'm no expert in this field (so anyone feel free to correct my errors), but I think you would need the lock prefix for the case where the inc [eax] is executed on logical processors within the same physical processor. The basis of Hyper-Threading Technology is the duplication of the architectural state twice over. That is, there are two sets of registers and such, as well as many of the processor resources are divided in half. Thus, as with the two separate processor case, the same address can be stored within each register and attempt to increment the same memory location simultaneously.

That all depends on whether there are enough of the correct functional units to perform the increment operation simultaneously, of course. If not, the lock may not be needed since the two increments can't be done within the same processor at the same time. Still, given any two different operations that can be executed simultaneously on Hyper-Threading enabled processors, you will need to protect memory references to the same locations.

-- clay

skeil1 · ‎06-16-2003

Thanx for your answer, Clay.

IMHO your register example argument is not an evidence
that the 'lock' is necessary. Else the statements
mov eax, ebx
inc [eax]
inc [ebx]
would produce undefined results even on a single
processor.
AFAIK, the processor keeps track of inter instruction
dependencies and delays the execution of successive
instructions if it detects a conflict. So if there
is only one such dependency checker inside a HT
processor I wouldn't need the 'lock'.

In the meantime I've written a short program which
uses multithreaded un-locked 'inc's and 'dec's to
empirically determine if it runs on a multiprocessor
machine. It works as expected on single processor
systems and real multiprocessor systems. But, as
I have no access to a HT system, I cannot check the
program against a HT processor. Anyway, this would
be a false check only: if the program says it runs
in a MP system I definitvely have to use 'lock',
but if it says it runs on single processor system,
this behaviour might change with future versions of
HT processors. So I am interested in an official
statement about that topic.
In a week or two, I will be able to check the program
on a HT system. I will report the results.

Stephan

ClayB · ‎06-16-2003

> In the meantime I've written a short program which
> uses multithreaded un-locked 'inc's and 'dec's to
> empirically determine if it runs on a multiprocessor
> machine. It works as expected on single processor
> systems and real multiprocessor systems. But, as
> I have no access to a HT system, I cannot check the
> program against a HT processor.

Is this a very large program? You could post it to the Forum here and maybe some kind soul (or several) with current HT hardware would be willing to run it.

(I see the "Attach Files" button and have always wondered how this feature works. If the code is too complex to put into a message, maybe you could try attaching it as a file?)

-- clay

davids · ‎06-16-2003

The lock prefix is necessary. Nothing stops the memory from 'ping ponging' between the two sibling processors and resulting in only one increment.

Note that testing is a very poor way to determine if this is required. Suppose current processors only have one execution unit capable of performing a read-modify-write, so it tests okay. What happens when the next Intel processor has two such units?

DS

ClayB · ‎06-17-2003

I've got a more authoritative response to your original
question. This is from Phil Kerly, one of our
Hyper-Threading experts here at Intel.

Phil says (emphasis is his):

You need to prefix with lock. LD and ST are guaranteed to be memory ordered (when necessary) but that is based on the uop [micro operation] and not on the generating instructions. Both inc instructions could issue the LD at the same time or at least both before the 2 STs are issued. The machine will guarantee that the LD and ST from each instruction are dependent and in order but not between the two different instructions on different logical processors. Each logical processor can retire uops independent of the retirement of the other logical processor thus commiting to architectural state. In-order retirement is only on a per logical processor basis. Note that the lock (if data is cached locally) will cause a cache line lock. Lock does not always cause a processor/memory lock - only when needed.

Phil also echoes the comments from DS, that is, put in the lock now, even if it may not be functionally required, because future processors will change bus timings, configurations, architecture features, etc. etc. What may not be required today may be essential tomorrow.

-- clay

skeil1 · ‎06-18-2003

Thanx to you and Phil for the clarifying words.

> Phil also echoes the comments from DS, that is, put in > the lock now, even if it may not be functionally
> required, because future processors will change bus
> timings, configurations, architecture features, etc.
> etc. What may not be required today may be essential
> tomorrow.

They are right, of course. In that special case, the
software will be bundled with the hardware. So if
current HT processors would not need the LOCK I could
postpone the solution of the problem ;-) Now I am
forced to find a solution by now (which is probably
an arousing stroke of fate ;-))

> Note that the lock (if data is cached locally) will
> cause a cache line lock. Lock does not always cause a
> processor/memory lock - only when needed.

I don't understand that: the LOCK must guarantee
atomicity even if two processors have the affected
value stored in their respective caches. So IMHO
the processor has to lock the processor/memory in any
case?!

I have tested the affected procedures (which are
incrementing/decrementing a reference count) on a
single processor (P4). Result is that the LOCK version
is approx. 16 times slower than the non LOCK version
(as just some instructions of the tested procedures need
the LOCK, the performance loss for a single LOCK/non LOCK
instruction is probably much higher).

Stephan

davids · ‎06-24-2003

>> Note that the lock (if data is cached locally) will
>> cause a cache line lock. Lock does not always cause a
>> processor/memory lock - only when needed.

>I don't understand that: the LOCK must guarantee
>atomicity even if two processors have the affected
>value stored in their respective caches. So IMHO
>the processor has to lock the processor/memory in any
>case?!

You are a tiny bit confused. Any operation that involves writing to memory through a cache requires that the processor doing the write have exclusive ownership of the cache line including that area of memory. This is true whether an operation is locked or not.

So if two processors have the value stored in their respective caches and it's shareable in both caches, a bus operation will be required so that the processor doing the write has exclusive ownership of the memory area. This is not particularly expensive and not nearly as expensive as locking the entire front-side bus for the duration of the read-modify-write operation.

The Pentium Pro and prior processors did just this. They locked the entire bus for the duration of the locked operation, in fact, this was what the lock prefix meant.

Later processors were much smarter. They don't actually lock out the entire front-side bus. They simply acquire the necessary cache state using normal transactions and lock the element in the cache. If there's no attempt by another processor to steal the cache line, the bus impact of a locked operation is no different than the corresponding unlocked operation.

Unfortunately, the impact on the processor pipelines is not the same.

DS

skeil1 · ‎06-24-2003

> Later processors were much smarter. They don't actually > lock out the entire front-side bus. They simply acquire > the necessary cache state using normal transactions and > lock the element in the cache. If there's no attempt by > another processor to steal the cache line, the bus
> impact of a locked operation is no different than the
> corresponding unlocked operation.

Thank you for your explanation!

> Unfortunately, the impact on the processor pipelines is
> not the same.

So the performance penalty (factor 16, see my previous
posting) is (just) due to the pipeline?

ClayB · ‎06-24-2003

One other point to be made for Hyper-Threading is that the cache is shared between the two logical processors. So, in this case, there is no way the same cache line would be found in different caches. Thus, locking the cache line for one thread would keep thread on the other logical processor from accessing this.

Of course, if you have dual physical processors, the same cache line can reside in two different caches and extra synchronization between these will be needed. Nothing for the user to do, though.

skeil: How and what are you measuring when you say you're seeing a 16X slowdown? Is this the entire application or just the parts that are doing locked computations? I know added synchronization can increase overall execution time, but I've never heard of anything of this magnitude.

-- clay

skeil1 · ‎06-24-2003

> skeil: How and what are you measuring when you say
> you're seeing a 16X slowdown? Is this the entire
> application or just the parts that are doing locked
> computations? I know added synchronization can increase
> overall execution time, but I've never heard of
> anything of this magnitude

I've used the attached test program (WindowsXP,
MS DeveloperStudio 7). Result:
100000000 non locked INCs took 328.98 ms
100000000 locked INCs took 5161.91 ms

skeil1 · ‎06-25-2003

> I've used the attached test program (WindowsXP,
> MS DeveloperStudio 7). Result:
> 100000000 non locked INCs took 328.98 ms
> 100000000 locked INCs took 5161.91 ms

I forgot to mention: times were measured on a
single(!) non-HT(!) P4 2.4 GHz.

davids · ‎06-27-2003

Yes, the performance penalty is due to pipeline issues. There's a significant cost associated with an operation that requires an empty pipeline both before and after it executes.

skeil1 · ‎06-28-2003

> Yes, the performance penalty is due to pipeline issues.
> There's a significant cost associated with an operation
> that requires an empty pipeline both before and after it
> executes.

Hm, maybe the answer is obvious, but I cannot see the
trick: why is there a need for an empty pipeline for LOCKed
operations on a single non-HT processor? For my
understanding there should be no difference for LOCKed
and non-LOCKed operations on such systems.