Solved:

Anil_M_ · ‎03-29-2016

(Sorry I posted this in Support section by mistake previously)

I am writing a program. It has two threads. One thread both reads and writes and the other thread reads only. The thread that is reading is okay with reading the shared variables either before or after they are modified by the thread that writes. It will be a problem if partially modified data is read by the thread that reads only.

I am using 128 bit variables. Below is some sample code for reading and writing.

Function to write :

void setVar(float a , float b , float c , float d){

//sharedArray[4] will not be optimized away. I can guarantee that. It is an array of floats.

sharedArray[0] = a;

sharedArray[1] = b;

sharedArray[2] = c;

sharedArray[3] = d;

shared128bitFloat = _mm_loadu_ps((float*)sharedArray);

}

Function to read :

__m128 getVar(){

return shared128bitFloat;

}

I looked into the assembly and found that the read and write to the shared128bitFloat happens using only one assembly instruction.

The code for writing to the 128 bit float ie :

shared128bitFloat = _mm_load_ps((float*)sharedArray)

became :

vmovups xmm0, XMMWORD PTR[rax]

vmovups XMMWORD PTR [rdi + 560], xmm0 //(ONLY ONE INSTRUCTION USED FOR MODIFYING THE DATA)

The code to read got inlined and became :

vmovups xmm0, XMMWORD PTR [rdx+560]

Given one instruction is used for both reading and writing to the 128 bit float is it okay to assume that I do not need to use any lock to read the data as the reader is okay reading the unmodified data or fully modified data , just not the data that has been modified partially.

In essence I think what I am asking is if reads and writes to 128 bit variables to memory be interleaved. I assume the bus width is 64 bits so probably if two instructions are being executed in two different CPU cores one core could be issuing two reads (128/64 =2) and the other could be issuing two writes and they could interleave.

Related questions.

Am I right in assuming context cannot switch with an assembly instruction being partially executed. A question was raised about the problem when both the reader and writer are executing on the same CPU core and context switch happens halfway between a read or write. As far as I know context switch cannot happen during the execution of micro instructions. It would be helpful if someone confirms that.

Can hyper-threading cause an issue here? If the instructions from both the reader and writer are on the pipeline could there be issues ?

If the above methodology cannot work, is there any way I could make the reads and writes atomic by using some kind of compiler intrinsic to lock the bus? I do not want to use any lock from any library.

Thanks,
Anil Mahmud.

McCalpinJohn · ‎03-31-2016

A variable of type __m128 may or may not have an associated memory address (on the stack), depending on what you are doing with it. If it does have a memory address, then that address will be 128-bit aligned. But that is not the alignment that matters -- you are not storing the __m128 variable to it's own address, you are storing it to an address related to the pointer to the sharedArray -- that is the address whose alignment matters.

Using the MOVAPS instruction (generated by _mm_store_ps) will not guarantee that the store is aligned -- it guarantees that the code will trigger an exception if the store is not aligned (rather than executing the store using an unaligned address). This is probably a good thing to do as a safety check if you believe that the code might not operate correctly with unaligned stores (and it probably won't).

I don't think ordering matters here -- the original post said that it is OK if the load gets either the old value or the new value -- it is only a problem if the load gets a partially updated value. So atomicity is the issue, rather than ordering.

The safest approach would be to protect both reads and writes to the shared variable by a lock. It is probably safe to write the data with a locked CMPXCHG16B (to a 16B-aligned address) and to read the data with a 128-bit-aligned 128-bit SSE or AVX load. It is possible that the code will operate correctly using ordinary 128-bit-aligned 128-bit stores and 128-bit-aligned 128-bit loads.

View solution in original post

jimdempseyatthecove · ‎03-29-2016

As long as the shared (128-bits, or 256-bits, or 512-bits) do not cross a cache line, and are performed by one instruction you should be ok. You may be required to insert fencing operations to assure data is flushed when you assume it to be flushed. IOW to avoid write data to be placed into a temporary register and never get passed into RAM.

Jim Dempsey

McCalpinJohn · ‎03-29-2016

Section 8.1 of Volume 3 of the Intel Architectures SW Developers Manual (document 325384, revision 057) discusses locked and atomic operations. There is no mention of support for atomicity or for explicit locking for any memory operations larger than 64 bits on any platform.

It seems likely that Haswell and newer processors will execute 128-bit-aligned 128-bit loads and stores atomically, but Intel's documentation is very careful to say that this is not guaranteed -- the last paragraph of Section 8.1.1 addresses this topic directly.

Because of the two-sided, asynchronous nature of this interaction, I think you might be forced to use a software lock to protect (serialize) reads and writes to the shared variable.

On to the second question: you are correct that a processor cannot be context-switched in the middle of an instruction (unless that instruction raises an exception, such as a page fault). This does not mean that the results of the instruction are guaranteed to become visible atomically. This is relatively easy to understand with unaligned stores that cross cache-line or page boundaries, but there is no guarantee that the implementation of the store buffers in the processor will result in atomic visibility of 128-bit stores even if the stores are 128-bit aligned.

As discussed in section 8.2, there are strong rules on the order in which the results of separate store instructions become visible, but this is only part of what you need to ensure correctness in your use case. You need to both (1) ensure that the results of a prior write are fully visible before performing the read, and (2) ensure that no new write is started while the read is in progress. I believe that such cases are typically handled with explicit software locks.

I don't know if Intel's transactional memory extensions are applicable to this case. Maybe?

Anil_M_ · ‎03-29-2016

Thank you Jim and John.

If I look at the documentation of the compiler intrinsic. _mm_mfence() in the intel compiler intrinsic guide :

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5640,3501,3389&cats=General%2525252525252525252520Support ;

It says :

"Perform a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior to this instruction. Guarantees that every memory access that precedes, in program order, the memory fence instruction is globally visible before any memory instruction which follows the fence in program order."

or the documentation of MFENCE in section 8.2.5 Strengthening or Weakening the Memory-Ordering Model of Volume 3 of the Intel Architectures SW Developers Manual it says:

"Serializes all store and load operations that occurred prior to the MFENCE instruction in the program instruction stream."

Do you know if this will work for all the threads in all the cores or only with the threads inside one cpu core only ? ( I mean in a case where two instructions trying to access the same memory address but are being executed in two cores at the same time. )

I am going to rewrite my code like this :

void setVar(float a , float b , float c , float d){

//sharedArray[4] will not be optimized away. I can guarantee that. It is an array of floats.

sharedArray[0] = a;

sharedArray[1] = b;

sharedArray[2] = c;

sharedArray[3] = d;

shared128bitFloat = _mm_loadu_ps((float*)sharedArray);

_mm_mfence();

}

I will be glad to know your comments about it (currently I am working with a multicore processor with hyperthreading enabled).

Thank you very much. :)

Anil Mahmud.

jimdempseyatthecove · ‎03-30-2016

>>//sharedArray[4] will not be optimized away. I can guarantee that. It is an array of floats.

You also must make sure it is aligned on 16 byte boundary (avoid having lowest byte of a in different cache line than highest byte of d).

John, cmpxchg16b is assured to be atomic with cache line contained data (one example of .gt. 64-bit).

The cache coherency system (inter-processor) has cache line granularity. Therefore, with respect to anything participating obeying the cache coherence rules should produce a consistent multi-byte write within a cache line. Note, it is not assured (stated) that devices interacting with memory play by the coherency rules. IOW I/O or if you happen to have remote memory via, say, a PCIe interface, the coherency rules do not apply. SMP systems have a CPU to CPU coherency-safe path.

Jim Dempsey

McCalpinJohn · ‎03-30-2016

It looks like Intel should probably update the discussion in Section 8.1 of Volume 3 of the SWDM to include the CMPXCHG16B case.

If the writer uses the CMPXCHG16B instruction to update the 128-bit value in memory, then the reader should be able to use "ordinary" 128-bit aligned loads to read the data. (Unless I am missing something?)

While it is certainly true that the coherence protocol guarantees that cache lines are moved around the system atomically, this is not the same as guaranteeing that the data within the cache line includes complete updates of stores that are larger than 64 bits. Otherwise Intel would not need to include any of the disclaimers in the final paragraph of Section 8.1.1.

Anil_M_ · ‎03-30-2016

Jim,

I understand why you are suggesting to make sure that sharedArray be aligned but given according to the assembly code the data from sharedArray is copied to a register and then written to shared128bitFloat, why should aligning of sharedArray be important here given the shared128bitFloat is automatically aligned ?

I looked at the assembly , the compiler did not assume shared128bitFloat is aligned so I changed the code to

void setVar(float a , float b , float c , float d){

sharedArray[0] = a;
sharedArray[1] = b;
sharedArray[2] = c;
sharedArray[3] = d;

_mm_store_ps( (float*)shared128bitFloat, _mm_loadu_ps((float*)sharedArray));

}

This produced assembly that uses aligned store instruction.

vmovups xmm0, XMMWORD PTR [rax]

vmovaps XMMWORD PTR [rdi+560], xmm0

whereas previously I got

vmovups xmm0, XMMWORD PTR[rax]

vmovups XMMWORD PTR [rdi + 560], xmm0

I know there are some confusions regarding my question but regarding alignment shouldn't this be enough. It is a bit hard for me to guarantee that sharedArray be aligned.

Thank you,
Anil Mahmud.

andysem · ‎03-31-2016

@Anil M.

_mm_store_ps will crash if the memory is not actually aligned in runtime. Just using the instruction is not enough.

Regarding atomicity of reads and writes, the only way to do atomic 128-bit reads and and writes I know is by using "lock cmpxchg16b". This instruction still requires the memory to be 16-byte aligned. Others have mentioned that regular 128-bit reads might be atomic on latest Intel architectures, but I'm not finding confirmation of that in the documentation. Also, there are non-Intel implementations.

Besides the atomicity of the operation, you might want to consider if memory ordering is important in your case. x86 has a strong memory model, which means that stores won't be reordered with each other and loads won't be reordered with each other. But stores and loads, I believe, can be reordered and there are also non-temporal stores (and loads in the latest architectures). Then there is possibility that the compiler reorders instructions, which is allowed when you use non-atomic intrinsics without compiler fences.

Anil_M_ · ‎03-31-2016

@Andysem.

Thank you for your answer.

shared128bitFloat is a __m128 variable which means it will automatically be aligned and so will not crash :)

McCalpinJohn · ‎03-31-2016

A variable of type __m128 may or may not have an associated memory address (on the stack), depending on what you are doing with it. If it does have a memory address, then that address will be 128-bit aligned. But that is not the alignment that matters -- you are not storing the __m128 variable to it's own address, you are storing it to an address related to the pointer to the sharedArray -- that is the address whose alignment matters.

Using the MOVAPS instruction (generated by _mm_store_ps) will not guarantee that the store is aligned -- it guarantees that the code will trigger an exception if the store is not aligned (rather than executing the store using an unaligned address). This is probably a good thing to do as a safety check if you believe that the code might not operate correctly with unaligned stores (and it probably won't).

I don't think ordering matters here -- the original post said that it is OK if the load gets either the old value or the new value -- it is only a problem if the load gets a partially updated value. So atomicity is the issue, rather than ordering.

The safest approach would be to protect both reads and writes to the shared variable by a lock. It is probably safe to write the data with a locked CMPXCHG16B (to a 16B-aligned address) and to read the data with a 128-bit-aligned 128-bit SSE or AVX load. It is possible that the code will operate correctly using ordinary 128-bit-aligned 128-bit stores and 128-bit-aligned 128-bit loads.

Anil_M_ · ‎03-31-2016

It worked !!!. Thank all of you very very much :).

Can I read/write shared 128 bit floats without locks ?