Re: memory ordering model

tadayuki · ‎08-27-2008

I'd like to know if memory reads are done in-order or not.

According to Intel 64 and IA-32 Architectures Software Developer's manual Vol 3A, Nov 2007, subsection 7.2.2:
Reads can be carried out speculatively and in any order.

But the newer revision, July 2008, says "Reads are not reordered with other reads".

I would assume newer revision is correct, but like to have confirmation.

Also, does this mean any instruction reads from memory can be considered as load
memory barrier? (i.e. can be used instead of lfence)

Intel_C_Intel · ‎08-28-2008

tadayuki:

I'd like to know if memory reads are done in-order or not.

Mostly in order, but not always.
More precisely loads are done (retired) in order, but perceived order (from point of view of other threads) can differ from program order.

tadayuki:

According to Intel 64 and IA-32 Architectures Software Developer's manual Vol 3A, Nov 2007, subsection 7.2.2:
Reads can be carried out speculatively and in any order.

But the newer revision, July 2008, says "Reads are not reordered with other reads".

I would assume newer revision is correct, but like to have confirmation.

Loads can be *executed* out-of-order, but *retired* always in order. And until load is retired, if there is some cache coherence traffic wrt loaded value, value is "patched".
So basically both definitions are correct :)

tadayuki:

Also, does this mean any instruction reads from memory can be considered as load
memory barrier? (i.e. can be used instead of lfence)

Well, I think yes. But this fact really makes little sense. Because lfence is useless on x86.
In some situations perceived order of loads execution can differ from program order, and lfence won't help here, nor another load in-between won't help. Only mfence will help.

If you will provide some example code, than I will be able to say exactly whether reorderings are possible or not in this code.

Dmitriy V'jukov

Intel_C_Intel · ‎08-28-2008

dvyukov:

If you will provide some example code, than I will be able to say exactly whether reorderings are possible or not in this code.

Dmitriy V'jukov

I've changed my Login ID from 'dvyukov' back to 'randomizer'. It seems that ISN has serious problems with Login ID change. When you change your Login ID, basically you lost all your previous posts. And if than someone will change his Login ID to your old Login ID, he will effectively steal all your previous posts. LOL!
So it's better to not play with Logic ID change on ISN :)

Dmitriy V'jukov

Dmitry_Vyukov · ‎08-28-2008

dvyukov:
dvyukov:

If you will provide some example code, than I will be able to say exactly whether reorderings are possible or not in this code.

Dmitriy V'jukov

I've changed my Login ID from 'dvyukov' back to 'randomizer'. It seems that ISN has serious problems with Login ID change. When you change your Login ID, basically you lost all your previous posts. And if than someone will change his Login ID to your old Login ID, he will effectively steal all your previous posts. LOL!
So it's better to not play with Logic ID change on ISN :)

Dmitriy V'jukov

I've already changed my Login ID to 'randomizer', and already explicitly signed in as 'randomizer', and my previous post still came from 'dvyukov' account. LOL
I hope this time post will be from 'randomizer'. Let's see :)

Dmitriy V'jukov

Dmitry_Vyukov · ‎08-28-2008

randomizer:
dvyukov:
dvyukov:

If you will provide some example code, than I will be able to say exactly whether reorderings are possible or not in this code.

Dmitriy V'jukov

I've changed my Login ID from 'dvyukov' back to 'randomizer'. It seems that ISN has serious problems with Login ID change. When you change your Login ID, basically you lost all your previous posts. And if than someone will change his Login ID to your old Login ID, he will effectively steal all your previous posts. LOL!
So it's better to not play with Logic ID change on ISN :)

Dmitriy V'jukov

I've already changed my Login ID to 'randomizer', and already explicitly signed in as 'randomizer', and my previous post still came from 'dvyukov' account. LOL
I hope this time post will be from 'randomizer'. Let's see :)

Dmitriy V'jukov

WOW! Miracle happened! I get all my posts back! Incredible!

tadayuki · ‎08-28-2008

Thanks for the explanation.

dvyukov:
If you will provide some example code, than I will be able to say exactly whether reorderings are possible or not in this code.

I'm working on a ticket based spinlock implementation. (BTW, I'm new to IA.) Basically, it has ticket counter and service counter and if the ticket you have matches to the service counter, you get the lock.

spin_lock() atomically increments the ticket counter and keeps the original value as my ticket. Then it checks the service counter if it matches to my ticket:

cmpl svc_ctr, %eax /* %eax == my_tkt */

je critical_section

In the critical section, you don't want speculatively loaded value before you get the lock. If reads are not done in order, I would think lfence is needed after cmpl, but it looks like there's no need for it.

Also, spin_unlock() increments the service counter:

addl $1, svc_ctr

A mfence would be required before 'add' if reads and writes are not done in order respectively. (On PowerPC, we use 'sync' instruction.) But 'add' seems to work as a full memory barrier if I'm reading the manual correctly.

Would you agree with my assesment?

Thanks.

Dmitry_Vyukov · ‎08-28-2008

This algorithm will work w/o any additional fences. On x86 every load is load-acquire. And every store is store-release. So nothing can hoist above load, and nothing can sink below store.
Final load of svc_ctr will provide all necessary synchronization wrt lock acquisition. And increment (store) of svc_ctr will provide all necessary synchronization wrt lock release.

Btw, don't forget to insert 'pause' instruction into spin-loop, if spin-loop is active (i.e. not sched_yield()).

tadayuki · ‎08-28-2008

randomizer:
This algorithm will work w/o any additional fences. On x86 every load is load-acquire. And every store is store-release. So nothing can hoist above load, and nothing can sink below store.
Final load of svc_ctr will provide all necessary synchronization wrt lock acquisition. And increment (store) of svc_ctr will provide all necessary synchronization wrt lock release.

Thanks. That's good to know.

randomizer:

Btw, don't forget to insert 'pause' instruction into spin-loop, if spin-loop is active (i.e. not sched_yield()).

Thanks for the reminder. I did add 'pause' in the loop.
(Though, I'm not so sure about what it does.)

Regards,

Dmitry_Vyukov · ‎08-29-2008

tadayuki:

randomizer:

Btw, don't forget to insert 'pause' instruction into spin-loop, if spin-loop is active (i.e. not sched_yield()).

Thanks for the reminder. I did add 'pause' in the loop.
(Though, I'm not so sure about what it does.)

If you have tight active spin-loop:

while (some_var)
pause;

Without pause instruction, superscalar processor can issue a lot of load request, i.e. there will be more than one load request at the same time. It's absolutely unnecessary, and degrades performance. With pause instruction there will be at most one active load request at any given time.
Also on Hyper-Threaded processors pause instruction informs core, that current thread just spinning and not doing useful work, so core can give priority to other thread(s).

Dmitry_Vyukov · ‎08-29-2008

tadayuki:
randomizer:
This algorithm will work w/o any additional fences. On x86 every load is load-acquire. And every store is store-release. So nothing can hoist above load, and nothing can sink below store.
Final load of svc_ctr will provide all necessary synchronization wrt lock acquisition. And increment (store) of svc_ctr will provide all necessary synchronization wrt lock release.

Thanks. That's good to know.

Hmmm.... I'm thinking about incorporating x86 memory model into Relacy Race Detector:
http://software.intel.com/en-us/forums//topic/60050

Initially I was thinking that if I will provide extremely relaxed C++0x memory model, than user can map any other memory model to C++0x memory model, thus test algorithm against, for example, x86 memory model.

The problem here is that user have to precisely understand and C++0x MM and his target MM (x86), in order to create correct mapping. So if I will provide direct support for x86 MM this can be of help.

tadayuki · ‎08-29-2008

randomizer:

If you have tight active spin-loop:

while (some_var)
pause;

Without pause instruction, superscalar processor can issue a lot of load request, i.e. there will be more than one load request at the same time. It's absolutely unnecessary, and degrades performance. With pause instruction there will be at most one active load request at any given time.
Also on Hyper-Threaded processors pause instruction informs core, that current thread just spinning and not doing useful work, so core can give priority to other thread(s).

I see. I did know it helps Hyper-Threading but we are not using it. So I wasn't sure if 'pause' is really necessary. But it sounds like it's also good for non-Hyper-Threading core.

Thanks.

Dmitry_Vyukov · ‎08-29-2008

tadayuki:

I see. I did know it helps Hyper-Threading but we are not using it. So I wasn't sure if 'pause' is really necessary. But it sounds like it's also good for non-Hyper-Threading core.

I think that at least it won't worsen the situation. And official Intel guidelines says "you must use 'pause' in spin-loops". Maybe in future Intel processors 'pause' will have some additional useful consequences for non-HT.