In section 7.2.2 of the IA-32 System Programming Guide (Volume 3) it says that reads can be carried out speculatively and in any order, and reads can pass buffered writes.
It then says (7.2.4) that when you need to avoid this you need to use the lock prefix (or lfence/sfence/mfence or a few other options). But, there seems to be much debate about this, and sometimes the descriptions are unclear. For instance, it says that the LOCK prefix will "force stronger ordering on the processor" and "Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory". That's a lot of ambiguity.
So, what I'd like to know is what is the real deal? If I am doing lockless programming on x86 (IA-32) do I need to use hardware memory barriers to prevent reordering of reads?
Is there a test program that I can run on my 4-way (two hyperthreaded Xeon processors) that demonstrates read reordering? It seems like a concrete example of whenthe lock prefix is needed, preferablya test that will occasionally fail without it, would be the best way to get clarification on this extremely subtle topic.