Why unsynchronized read/write behaviour is undefined in multithreading

Speigel · ‎05-08-2009

I have read in many books and article aboutprotecting a critical section when two threads try to access the critical section simultaneously. Here atleast one threadmust bewriting to critical section (say a volatile integer variable int_X).We knowthat a non-atomiccritical section is not protected then behaviour is undefined. I want to know why the behaviour is undefined?

Iam curious to know what happens at the instruction (processor, cache)level when a 'Thread B' tries toread un-protected int_X and at the same time 'Thread A' isincrementingint_X? What happens when 'Thread A' isis halted by the processor when itsin mid of executing 'load','inc' and 'store'instruction to icrement int_X and then 'Thread B' is run which tries to read the value of int_X?

Thank you in advance.

Dmitry_Vyukov · ‎05-08-2009

Quoting - sameer_87

I have read in many books and article aboutprotecting a critical section when two threads try to access the critical section simultaneously. Here atleast one threadmust bewriting to critical section (say a volatile integer variable int_X).We knowthat a non-atomiccritical section is not protected then behaviour is undefined. I want to know why the behaviour is undefined?

Iam curious to know what happens at the instruction (processor, cache)level when a 'Thread B' tries toread un-protected int_X and at the same time 'Thread A' isincrementingint_X? What happens when 'Thread A' isis halted by the processor when itsin mid of executing 'load','inc' and 'store'instruction to icrement int_X and then 'Thread B' is run which tries to read the value of int_X?

First of all, un-synchronized access is undefined behavior only on the level of abstract specifications (POSIX, Win32, C++09, etc).
Then, some abstract specifications do actually define behavior in the presence of un-synchronized accesses (mostly specifications which just have to - Java, CLI, etc).
When we're on hardware level no such thing as undefined behavior. Behavior of un-synchronized accesses is defined by most architectures, you may get exception, corrupted value, new or old value, etc, but this is indeed defined. For example you may see IA-32/Intel-64 Software Developer Manuals.
However you may take advantage of the defined behavior iff you are on assembly level. Assume you are writing on C++, and use objects with virtual functions, if object's vptr will be corrupted and you are making virtual function call, well, it's undefined behavior, big a life, down to deletion of critical files on the HDD (if vptr end up pointing to DeleteFile).

Regarding your question, it's 200% architecture-dependent. However, most architectures, provide at least word-sized atomic reads and writes, if you are reading/writing more than a word, then you usually get just some combination of old and new data (byte/word-wise). Though future hardware may, for example, raise exceptions on un-synchronized accesses, just like current hardware raises exceptions on unaligned accesses, integer overflows, etc.

anthony_williams · ‎05-08-2009

Quoting - sameer_87

I have read in many books and article aboutprotecting a critical section when two threads try to access the critical section simultaneously. Here atleast one threadmust bewriting to critical section (say a volatile integer variable int_X).We knowthat a non-atomiccritical section is not protected then behaviour is undefined. I want to know why the behaviour is undefined?

The behaviour is undefined because it's not something you should be doing, and it gives implementors (of the compiler, library, operating system and processor) the biggest amount of leeway as to what actually happens. This allows implementors to do whatever is most efficient for the case where the access is correctly synchronized, without having to worry about what the unsynchronized cases.

Quoting - sameer_87

Iam curious to know what happens at the instruction (processor, cache)level when a 'Thread B' tries toread un-protected int_X and at the same time 'Thread A' isincrementingint_X? What happens when 'Thread A' isis halted by the processor when itsin mid of executing 'load','inc' and 'store'instruction to icrement int_X and then 'Thread B' is run which tries to read the value of int_X?

For Intel x86 and x64 CPUs, all aligned reads and writes are atomic; unaligned reads and writes are done as a series of aligned reads and writes. A LOCKed instruction on a suitably-aligned location performs a full read-modify-write cycle as a single atomic operation. Every atomic read returns a value that was previously stored to the location by some processor.

If thread A is performing an unLOCKed INC instruction then it does it as three steps, as you indicate: load/inc/store. The value stored is therefore one more than the value read. If another processor modifies the memory location in between the load and store then the value written by the other processor will be overwritten by thread A when it does its store. If thread B is running on the same processor core then it cannot interrupt an INC instruction (even a non-LOCKed one), but it might interrupt an increment operation from a high-level language that is split into separate assembly instructions, in which case the result is typically the same: reads will see either the value before or the value after, and changes from one thread might overwrite changes from another. If the accesses are not aligned then each part of the access may be before or after any modification by another thread, so you might see half an update.

On other platforms the behaviour might be different --- the processor might trap on unsynchronized accesses that conflict, for example.

Speigel · ‎05-08-2009

Quoting - anthony_williams

The behaviour is undefined because it's not something you should be doing, and it gives implementors (of the compiler, library, operating system and processor) the biggest amount of leeway as to what actually happens. This allows implementors to do whatever is most efficient for the case where the access is correctly synchronized, without having to worry about what the unsynchronized cases.

Quoting - sameer_87

Iam curious to know what happens at the instruction (processor, cache)level when a 'Thread B' tries toread un-protected int_X and at the same time 'Thread A' isincrementingint_X? What happens when 'Thread A' isis halted by the processor when itsin mid of executing 'load','inc' and 'store'instruction to icrement int_X and then 'Thread B' is run which tries to read the value of int_X?

For Intel x86 and x64 CPUs, all aligned reads and writes are atomic; unaligned reads and writes are done as a series of aligned reads and writes. A LOCKed instruction on a suitably-aligned location performs a full read-modify-write cycle as a single atomic operation. Every atomic read returns a value that was previously stored to the location by some processor.

If thread A is performing an unLOCKed INC instruction then it does it as three steps, as you indicate: load/inc/store. The value stored is therefore one more than the value read. If another processor modifies the memory location in between the load and store then the value written by the other processor will be overwritten by thread A when it does its store. If thread B is running on the same processor core then it cannot interrupt an INC instruction (even a non-LOCKed one), but it might interrupt an increment operation from a high-level language that is split into separate assembly instructions, in which case the result is typically the same: reads will see either the value before or the value after, and changes from one thread might overwrite changes from another. If the accesses are not aligned then each part of the access may be before or after any modification by another thread, so you might see half an update.

On other platforms the behaviour might be different --- the processor might trap on unsynchronized accesses that conflict, for example.

Thank youDmitriy and Anthony for your answers:

I am working on x86 dual core system running Win2003 in C++.int_X is a member variable. Supposing thatboth the threads in the above scenario runs on the same processor. Now also suppose that 'Thread A'is put to sleep due to some hardware interrupt just afterthe 'load' instruction (or after 'inc' instruction)when its incrementing int_X and then 'Thread B' is made to run by the processor which reads the value of int_X. What happens in this scenario, Willthread B read an old value of int_X here, assume that memory is aligned. I have also read somewhere that when the thread is put to sleep (pause) in such manner andresumed by the processor then the full set of instructions will be executed from the start, which means it willstart executingfrom the 'load' instruction.What happens here in terms of cache validity? Will the cache not havetwo records showing different valuesat the same memory location?

Can you also please point me tothe article/manual which you mentioned --"Every atomic read returns a value that was previously stored to the location by some processor."Whatever books I followed, always mentioned this behaviour as undefined, none of them mentioned aboutreading an old value or an half update. I'll try to get you the article which sounds like this behaviour may cause crash.

anthony_williams · ‎05-08-2009

Quoting - sameer_87

Thank youDmitriy and Anthony for your answers:

I am working on x86 dual core system running Win2003 in C++.int_X is a member variable. Supposing thatboth the threads in the above scenario runs on the same processor. Now also suppose that 'Thread A'is put to sleep due to some hardware interrupt just afterthe 'load' instruction (or after 'inc' instruction)when its incrementing int_X and then 'Thread B' is made to run by the processor which reads the value of int_X. What happens in this scenario, Willthread B read an old value of int_X here, assume that memory is aligned. I have also read somewhere that when the thread is put to sleep (pause) in such manner andresumed by the processor then the full set of instructions will be executed from the start, which means it willstart executingfrom the 'load' instruction.What happens here in terms of cache validity? Will the cache not havetwo records showing different valuesat the same memory location?

Can you also please point me tothe article/manual which you mentioned --"Every atomic read returns a value that was previously stored to the location by some processor." Whatever books I followed, always mentioned this behaviour as undefined, none of them mentioned aboutreading an old value or an half update. I'll try to get you the article which sounds like this behaviour may cause crash.

The reference for the Intel x86 architecture memory ordering is the IA32 Software Developer's Manual Volume 3A (Intel doc number 253668), section 7.2.

There are two parts to the memory ordering: the compiler's ordering of the instructions from the written code, and the processor's ordering of the memory accesses from the assembly instructions.

On a single core, if the store or INC instruction from thread A occurs before the thread is interrupted (execution cannot be interrupted mid instruction) and the load instruction in thread B running on the same core occurs after the interruption then the load will see the value stored by thread A. If the load in thread B occured before the store by thread A on the same processor then it will see the old value. This is nothing special --- from the processor's point of view there is no such thing as threads, just instruction streams. Interrupts cause the processor to change the stream of instructions to an OS interrupt handler, which may then change the resume address, but it's just a stream of instructions. On a single CPU core instructions happen in a definite order.

Note that this refers to the ordering of the actual CPU instructions. This ordering may be different to the ordering of source code operations due to the compiler's optimizer. In particular, the compiler may make assumptions about what may or may not have changed which might be incorrect if there are unsynchronized accesses from multiple threads.

The CPU does nothing special when it resumes normal execution after an interrupt. It just restores the register states and carries on. In particular it does NOT re-run earlier instructions.

From the point of view of the memory model, a dual-core CPU is two distinct processors. However, they may share cache lines, in which case data is better kept in sync between the two cores, and they are less likely to read stale values. If threads A and B are on separate cores you can make no assumptions in general about whether B will read the new value of int_X or the old one.

With respect to cache validity, the core/processor that did the write will have a "modified" version of the cache line holding the value. All other copies of that cache line will be marked "stale". Any other core/processor that reads that cache line will now need to resync it with the CPU that did the write and/or main memory. Two processors cannot both have a "modified" version of the same cache line.

TimP · ‎05-08-2009

Quoting - sameer_87

"Every atomic read returns a value that was previously stored to the location by some processor." Whatever books I followed, always mentioned this behaviour as undefined, none of them mentioned aboutreading an old value or an half update. I'll try to get you the article which sounds like this behaviour may cause crash.

Atomic reads and writes guarantee against half updates, and against losing an update. They don't guarantee the order of events, in case there is a race condition among threads. All threading systems provide means to specify atomic updates or critical regions, so as to prevent lost updates. There frequently is value in atomic updates with unspecified order. It would be a programmer responsibility not to depend on the order for sufficient reliability.
For example, OpenMP sum reduction guarantees that no updates to the global sum are lost, except that the result may be subject to changes in roundoff due to order changing from run to run or with changing number of threads. Some applications even implement a user option to choose higher performance with OpenMP reduction or repeatable results without it.
Compiler optimization may produce locally atomic updates, preventing a threading error from causing a crash during QA runs. There may or may not be a performance effect for threads being stalled briefly by objects implicitly locked by other threads. Analysis by a tool such as Intel Thread Checker is the only practical way to catch some such errors. Unfortunately, Thread Checker appears to be incompatible with the recommended practice of setting OpenMP default(none) so as to require all variables to be designated as shared or private. It's also incompatible with code size optimizations which frequently are required to get an application linked on Itanium (a problem which is not nearly as serious for Intel64 platforms).
Sun compilers include some compile time analysis for catching possible race conditions, and may silently disable parallelization, unless you ask for the diagnoses to be displayed, even when there is nothing serious enough for Intel Thread Checker to display.

Speigel · ‎05-18-2009

Quoting - anthony_williams

The reference for the Intel x86 architecture memory ordering is the IA32 Software Developer's Manual Volume 3A (Intel doc number 253668), section 7.2.

There are two parts to the memory ordering: the compiler's ordering of the instructions from the written code, and the processor's ordering of the memory accesses from the assembly instructions.

On a single core, if the store or INC instruction from thread A occurs before the thread is interrupted (execution cannot be interrupted mid instruction) and the load instruction in thread B running on the same core occurs after the interruption then the load will see the value stored by thread A. If the load in thread B occured before the store by thread A on the same processor then it will see the old value. This is nothing special --- from the processor's point of view there is no such thing as threads, just instruction streams. Interrupts cause the processor to change the stream of instructions to an OS interrupt handler, which may then change the resume address, but it's just a stream of instructions. On a single CPU core instructions happen in a definite order.

Note that this refers to the ordering of the actual CPU instructions. This ordering may be different to the ordering of source code operations due to the compiler's optimizer. In particular, the compiler may make assumptions about what may or may not have changed which might be incorrect if there are unsynchronized accesses from multiple threads.

The CPU does nothing special when it resumes normal execution after an interrupt. It just restores the register states and carries on. In particular it does NOT re-run earlier instructions.

From the point of view of the memory model, a dual-core CPU is two distinct processors. However, they may share cache lines, in which case data is better kept in sync between the two cores, and they are less likely to read stale values. If threads A and B are on separate cores you can make no assumptions in general about whether B will read the new value of int_X or the old one.

With respect to cache validity, the core/processor that did the write will have a "modified" version of the cache line holding the value. All other copies of that cache line will be marked "stale". Any other core/processor that reads that cache line will now need to resync it with the CPU that did the write and/or main memory. Two processors cannot both have a "modified" version of the same cache line.

Thanks.

Are you suggesting that declaring a variable 'Volatile' is a performance hit on the processors (dual core) which share the same cache because data is kept in sync between the two cores. Also, this means that declaring a variable volatile is important only when each CPU has it own cache. Also, does it mean that reader/writer threading issues with integral data types is not a problem in shared cache architectures and its thread safe as long as variable's memory is aligned.

Can you suggest a case where data will be stale on a shared cache systems.

How does it work on the architectures where each CPU has its own cache memory.

Here is the post I mentioned earlier where they introduce something called "Window of Death".http://rentzsch.com/papers/atomicity
This article is quite interesting and it referes to different PC architectures.

robert-reed · ‎05-18-2009

Quoting - Speigel

Are you suggesting that declaring a variable 'Volatile' is a performance hit on the processors (dual core) which share the same cache because data is kept in sync between the two cores. Also, this means that declaring a variable volatile is important only when each CPU has it own cache. Also, does it mean that reader/writer threading issues with integral data types is not a problem in shared cache architectures and its thread safe as long as variable's memory is aligned.

Well, there's atomicity and there's volatility, which is not the same thing-- this blog from Arch Robison tries to make the differenceclear. Volatility affects the optimizations available to the compiler for improving performance of the code but has little meaning when referring to the interactions of a pair of HW threads within a multi-processor system. The questions raised above suggest that you might benefit from a little background reading on cache coherence.