Is xend treated as a full memory barrier?

william_l_ · ‎01-13-2017

I've started attempting to learn RTM extensions. The most common examples I can find online are using them to implement a mutex or concurrent lock. Often they are similar to:

#include <immintrin.h>
#include <stdint.h>
__attribute__((target("rtm")))
int64_t attempt_lock( uint64_t *lock)
{
    
    uint64_t status, lock_status;
    int64_t exit;

    if ((status = _xbegin()) == 0)
    {
             lock_status = *lock;
             if(lock_status == 0)
             {
                    *lock = 1;
                     exit = 1;
              }
      } else {
              exit = -1;
      }
      _xend();
      return exit;
}

Now this has little different from a

lock compxchg

But as I understand it the lock prefix carries the same semantics as the mfence instruction (please correct me if I'm wrong). Which is what allows for single atomic instruction to provide concurrent locking as well as memory fencing. This is why we don't see atomic CAS operations in C/C++ compilers issue

lock compxchg
mfence

So the question I have is:

Will the RTM lock provide the same mfence-esque guarantees to cache lines read/written AFTER xend() has been issued AND NOT within the previous operated on RTM region? Or should an RTM based locks be coupled with an mfence instruction when used for locking?

I do understand the best pattern would be to do any modifications WITHIN the RTM code region avoiding this question completely. But I see the lock pattern used very pervasively and I'm wonder if it is the proper solution.

McCalpinJohn · ‎01-13-2017

The first part is easy. Section 8.2.2 of Volume 3 of the Intel Architectures Software Developer's Manual says (emphasis mine):

Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.

[...]

Reads cannot pass earlier LFENCE and MFENCE instructions.

Writes and executions of CLFLUSH and CLFLUSHOPT cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.

MFENCE instructions cannot pass earlier reads, writes, or executions of CLFLUSH and CLFLUSHOPT.

The first line defines the impact of a locked instruction on read and write ordering, while the last 3 lines define what the MFENCE instruction does. Comparing these supports your interpretation of a locked instruction implying a full MFENCE --- with one possible exception. Note that the wording in this section does not say that CLFLUSH and CLFLUSHOPT are fenced by locked instructions.

Fortunately this is clarified in the description of the CLFLUSH and CLFLUSHOPT instructions in Volume 2 of the Intel Architecture Software Developer's Manual (emphasis mine)

CLFLUSH: Executions of the CLFLUSH instruction are ordered with respect to each other and with respect to writes, locked read-modify-write instructions, fence instructions, and executions of CLFLUSHOPT to the same line. (Note 1)

(Note 1: Earlier versions of this manual specified that executions of the CLFLUSH instruction were ordered only by the MFENCE instruction. All processors implementing the CLFLUSH instruction also order it relative to the other operations enumerated above.)

CLFLUSHOPT: Executions of the CLFLUSHOPT instruction are ordered with respect to fence instructions and to locked read-modify-write instructions; [...]

The second part is outside of my area of expertise, but I did find one clear statement at https://software.intel.com/en-us/node/524025 (emphasis mine):

[...] A successfully committed RTM region consisting of an XBEGIN followed by an XEND, even with no memory operations in the RTM region, has the same ordering semantics as a LOCK prefixed instruction. However, if an RTM execution aborts, all memory updates from within the RTM region are discarded and never made visible to any other logical processor.

In the usual case, one will either repeat the RTM region until it succeeds (in which case the memory fence semantics apply) or one will execute the fallback code (which almost certainly has at least one traditional locked instruction which will also force memory semantics).

It may be possible to create a code that gets unexpected results if the fallback code does not contain any instructions that have the effect of creating a memory fence, but it seems unlikely that such a program would be semantically correct. But perhaps not impossible...

The only case I can think of that might fit into this category is one with fallback code that exploits one of the few atomicity guarantees described in section 8.1.1 of Volume 3 of the Intel Architectures Software Developer's Manual. In this case the fallback code might be guaranteed to be correct, but not include the effect of a memory fence. Of course if one could write the fallback code using a guaranteed atomic operation, then one probably did not need the RTM code in the first place, but I can imagine such a situation arising while one is writing test codes to learn about RTM.

Perhaps someone more knowledgeable & experienced can comment on this....