if the following two instructions inc [eax] inc [eax] are executed in a single thread on a single processor, the processor cares for the atomicity of 'inc', so it is guaranteed, that after the two instructions the value at [eax] has been incremented by two, although internally each inc-instruction is broken down into multiple micro-ops and is executed in multiple pipeline stages. The same holds, if the 'inc's are executed in separated threads on a single processor (the processor guarantees atomicity of inc). If the 'inc's are executed on separated _physical_ processor, I need to prefix the 'inc' with the 'lock' keyword, lock inc [eax] to guarantee the desired behaviour (because shared memory is involved).
But what if the 'inc's are executed on separated _logical_ processors on the same physical processor(hyperthreading)? Is the lock prefix necessary here?
I'm no expert in this field (so anyone feel free to correct my errors), but I think you would need the lock prefix for the case where the inc [eax] is executed on logical processors within the same physical processor. The basis of Hyper-Threading Technology is the duplication of the architectural state twice over. That is, there are two sets of registers and such, as well as many of the processor resources are divided in half. Thus, as with the two separate processor case, the same address can be stored within each register and attempt to increment the same memory location simultaneously.
That all depends on whether there are enough of the correct functional units to perform the increment operation simultaneously, of course. If not, the lock may not be needed since the two increments can't be done within the same processor at the same time. Still, given any two different operations that can be executed simultaneously on Hyper-Threading enabled processors, you will need to protect memory references to the same locations.
IMHO your register example argument is not an evidence that the 'lock' is necessary. Else the statements mov eax, ebx inc [eax] inc [ebx] would produce undefined results even on a single processor. AFAIK, the processor keeps track of inter instruction dependencies and delays the execution of successive instructions if it detects a conflict. So if there is only one such dependency checker inside a HT processor I wouldn't need the 'lock'.
In the meantime I've written a short program which uses multithreaded un-locked 'inc's and 'dec's to empirically determine if it runs on a multiprocessor machine. It works as expected on single processor systems and real multiprocessor systems. But, as I have no access to a HT system, I cannot check the program against a HT processor. Anyway, this would be a false check only: if the program says it runs in a MP system I definitvely have to use 'lock', but if it says it runs on single processor system, this behaviour might change with future versions of HT processors. So I am interested in an official statement about that topic. In a week or two, I will be able to check the program on a HT system. I will report the results.
> In the meantime I've written a short program which > uses multithreaded un-locked 'inc's and 'dec's to > empirically determine if it runs on a multiprocessor > machine. It works as expected on single processor > systems and real multiprocessor systems. But, as > I have no access to a HT system, I cannot check the > program against a HT processor.
Is this a very large program? You could post it to the Forum here and maybe some kind soul (or several) with current HT hardware would be willing to run it.
(I see the "Attach Files" button and have always wondered how this feature works. If the code is too complex to put into a message, maybe you could try attaching it as a file?)
The lock prefix is necessary. Nothing stops the memory from 'ping ponging' between the two sibling processors and resulting in only one increment.
Note that testing is a very poor way to determine if this is required. Suppose current processors only have one execution unit capable of performing a read-modify-write, so it tests okay. What happens when the next Intel processor has two such units?
I've got a more authoritative response to your original question. This is from Phil Kerly, one of our Hyper-Threading experts here at Intel.
Phil says (emphasis is his):
You need to prefix with lock. LD and ST are guaranteed to be memory ordered (when necessary) but that is based on the uop [micro operation] and not on the generating instructions. Both inc instructions could issue the LD at the same time or at least both before the 2 STs are issued. The machine will guarantee that the LD and ST from each instruction are dependent and in order but not between the two different instructions on different logical processors. Each logical processor can retire uops independent of the retirement of the other logical processor thus commiting to architectural state. In-order retirement is only on a per logical processor basis. Note that the lock (if data is cached locally) will cause a cache line lock. Lock does not always cause a processor/memory lock - only when needed.
Phil also echoes the comments from DS, that is, put in the lock now, even if it may not be functionally required, because future processors will change bus timings, configurations, architecture features, etc. etc. What may not be required today may be essential tomorrow.
> Phil also echoes the comments from DS, that is, put in > the lock now, even if it may not be functionally > required, because future processors will change bus > timings, configurations, architecture features, etc. > etc. What may not be required today may be essential > tomorrow.
They are right, of course. In that special case, the software will be bundled with the hardware. So if current HT processors would not need the LOCK I could postpone the solution of the problem ;-) Now I am forced to find a solution by now (which is probably an arousing stroke of fate ;-))
> Note that the lock (if data is cached locally) will > cause a cache line lock. Lock does not always cause a > processor/memory lock - only when needed.
I don't understand that: the LOCK must guarantee atomicity even if two processors have the affected value stored in their respective caches. So IMHO the processor has to lock the processor/memory in any case?!
I have tested the affected procedures (which are incrementing/decrementing a reference count) on a single processor (P4). Result is that the LOCK version is approx. 16 times slower than the non LOCK version (as just some instructions of the tested procedures need the LOCK, the performance loss for a single LOCK/non LOCK instruction is probably much higher).
>> Note that the lock (if data is cached locally) will >> cause a cache line lock. Lock does not always cause a >> processor/memory lock - only when needed.
>I don't understand that: the LOCK must guarantee >atomicity even if two processors have the affected >value stored in their respective caches. So IMHO >the processor has to lock the processor/memory in any >case?!
You are a tiny bit confused. Any operation that involves writing to memory through a cache requires that the processor doing the write have exclusive ownership of the cache line including that area of memory. This is true whether an operation is locked or not.
So if two processors have the value stored in their respective caches and it's shareable in both caches, a bus operation will be required so that the processor doing the write has exclusive ownership of the memory area. This is not particularly expensive and not nearly as expensive as locking the entire front-side bus for the duration of the read-modify-write operation.
The Pentium Pro and prior processors did just this. They locked the entire bus for the duration of the locked operation, in fact, this was what the lock prefix meant.
Later processors were much smarter. They don't actually lock out the entire front-side bus. They simply acquire the necessary cache state using normal transactions and lock the element in the cache. If there's no attempt by another processor to steal the cache line, the bus impact of a locked operation is no different than the corresponding unlocked operation.
Unfortunately, the impact on the processor pipelines is not the same.
> Later processors were much smarter. They don't actually > lock out the entire front-side bus. They simply acquire > the necessary cache state using normal transactions and > lock the element in the cache. If there's no attempt by > another processor to steal the cache line, the bus > impact of a locked operation is no different than the > corresponding unlocked operation.
Thank you for your explanation!
> Unfortunately, the impact on the processor pipelines is > not the same.
So the performance penalty (factor 16, see my previous posting) is (just) due to the pipeline?
One other point to be made for Hyper-Threading is that the cache is shared between the two logical processors. So, in this case, there is no way the same cache line would be found in different caches. Thus, locking the cache line for one thread would keep thread on the other logical processor from accessing this.
Of course, if you have dual physical processors, the same cache line can reside in two different caches and extra synchronization between these will be needed. Nothing for the user to do, though.
skeil: How and what are you measuring when you say you're seeing a 16X slowdown? Is this the entire application or just the parts that are doing locked computations? I know added synchronization can increase overall execution time, but I've never heard of anything of this magnitude.
> skeil: How and what are you measuring when you say > you're seeing a 16X slowdown? Is this the entire > application or just the parts that are doing locked > computations? I know added synchronization can increase > overall execution time, but I've never heard of > anything of this magnitude
I've used the attached test program (WindowsXP, MS DeveloperStudio 7). Result: 100000000 non locked INCs took 328.98 ms 100000000 locked INCs took 5161.91 ms
> Yes, the performance penalty is due to pipeline issues. > There's a significant cost associated with an operation > that requires an empty pipeline both before and after it > executes.
Hm, maybe the answer is obvious, but I cannot see the trick: why is there a need for an empty pipeline for LOCKed operations on a single non-HT processor? For my understanding there should be no difference for LOCKed and non-LOCKed operations on such systems.