In section 7.2.2 of the IA-32 System Programming Guide (Volume 3) it says that reads can be carried out speculatively and in any order, and reads can pass buffered writes.
It then says (7.2.4) that when you need to avoid this you need to use the lock prefix (or lfence/sfence/mfence or a few other options). But, there seems to be much debate about this, and sometimes the descriptions are unclear. For instance, it says that the LOCK prefix will "force stronger ordering on the processor" and "Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory". That's a lot of ambiguity.
So, what I'd like to know is what is the real deal? If I am doing lockless programming on x86 (IA-32) do I need to use hardware memory barriers to prevent reordering of reads?
Is there a test program that I can run on my 4-way (two hyperthreaded Xeon processors) that demonstrates read reordering? It seems like a concrete example of whenthe lock prefix is needed, preferablya test that will occasionally fail without it, would be the best way to get clarification on this extremely subtle topic.
It's not entirely clear and for various reasons probably can't be clarified. From a programming point of view you probably should assume the more relaxed memory model interpretation to be on the safe side. You might be able to assume ordering of dependent loads like the Linux kernel does but then you should have code that runs different versions based on cpu model like the Linux kernel does in case there is a processor model which doesn't do dependent load ordering. Note there's compiler reordering of memory accesses to be cosidered as well. Google comp.arch and comp.programming.threads for memory model for discussions on this topic. You won't get a definitive answer but at least you'll know why.