Generated code for std::atomic load/store operations

Poeter__Manuel · ‎11-08-2012

Hi,

I wrote a small test program to analyze the generated code for load/store operations with different memory order of the new std::atomic type.

[cpp]
#include <atomic>

std::atomic v(42);

__declspec(noinline) size_t load_relaxed() { return v.load(std::memory_order_relaxed); }
__declspec(noinline) size_t load_acquire() { return v.load(std::memory_order_acquire); }
__declspec(noinline) size_t load_consume() { return v.load(std::memory_order_consume); }
__declspec(noinline) size_t load_seq_cst() { return v.load(std::memory_order_seq_cst); }

__declspec(noinline) void store_relaxed(size_t arg) { v.store(arg, std::memory_order_relaxed); }
__declspec(noinline) void store_release(size_t arg) { v.store(arg, std::memory_order_release); }
__declspec(noinline) void store_seq_cst(size_t arg) { v.store(arg, std::memory_order_seq_cst); }

int main(int argc, char* argv[])
{
   size_t x = 0;

   x += load_relaxed();
   x += load_acquire();
   x += load_consume();
   x += load_seq_cst();

   store_relaxed(x);
   store_release(x);
   store_seq_cst(x);

   return (int)x;
}
[/cpp]

The result with the Intel Composer XE 2013 looks as follows:

with Intel atomic header (__USE_INTEL_ATOMICs)

[plain]v.load(std::memory_order_relaxed);
lea rax,[v (013FE33020h)]
mov rax,qword ptr [rax][/plain]

[plain]v.load(std::memory_order_acquire);
lea rax,[v (013FE33020h)]
mov rax,qword ptr [rax]
lfence[/plain]

[plain]v.load(std::memory_order_seq_cst);
lea rax,[v (013FE33020h)]
mfence
mov rax,qword ptr [rax]
mfence[/plain]

[plain]v.store(arg, std::memory_order_relaxed);
lea rdx,[v (013FE33020h)]
mov qword ptr [rdx],rax[/plain]

[plain]v.store(arg, std::memory_order_release);
lea rdx,[v (013FE33020h)]
sfence
mov qword ptr [rdx],rax[/plain]

[plain]v.store(arg, std::memory_order_seq_cst);
lea rdx,[v (013FE33020h)]
xchg rax,qword ptr [rdx][/plain]

with Microsoft atomic header

[plain]v.load(std::memory_order_relaxed);
v.load(std::memory_order_acquire);
v.load(std::memory_order_seq_cst);
lea rdi,[v (013FA93020h)]
mov rax,qword ptr [rdi]
retry:
mov rdx,rax
or rdx,rcx
lock cmpxchg qword ptr [rdi],rdx
jne retry (013FA91081h)[/plain]

[plain]v.store(arg, std::memory_order_relaxed);
v.store(arg, std::memory_order_release);
mov qword ptr [v (013FA93020h)],rcx[/plain]

[plain]v.store(arg, std::memory_order_seq_cst);
lea rcx,[v (013FA93020h)]
xchg rax,qword ptr [rcx][/plain]

The generated code for the atomic loads with the Microsoft header is something I have to report to Microsoft (this implementation is a catastrophe from a perfomance point of view).
But what I don't understand why the generated code with the Intel header contains all kinds of lfence/sfence.
Especially: why does v.store(arg, std::memory_order_release) require a sfence before the write operation? Write opertions are guaranteed to be executed in program order anyway, right?

Thanks,
Manuel

Melanie_B_Intel · ‎11-12-2012

Hi, Here's a partial response to your post. I don't have an answer about the use of the fence instructions, hope to have more information about that soon. You should not define the __USE_INTEL_ATOMICS symbol. The correct value of this symbol is determined according to the version of Visual Studio you are using. For vs2012, this symbol is defined to be 0 which forces use of Microsoft's atomic header Microsoft introduced the atomic header in Visual Studio 2012. Intel has been shipping its own support for atomic operations and an Intel-supplied header file. If you are using atomics with Visual Studio 2012, then you need to use the Microsoft definition of atomic. In my experiments, the Microsoft-generated code uses a different locking mechanism than what you quoted above. (I'm using Microsoft (R) C/C++ Optimizing Compiler Version 17.00.40825.2 for x64. cl -c /FAs) Here's what I see, [plain] ; 25 : size_t x = 0; mov QWORD PTR x$[rsp], 0 ; 26 : ; 27 : x += load_relaxed(); call ?load_relaxed@@YA_KXZ ; load_relaxed mov rcx, QWORD PTR x$[rsp] add rcx, rax mov rax, rcx mov QWORD PTR x$[rsp], rax ; 28 :[/plain] ...Later on there's a call to do the atomic store. As I understand it, Microsoft support uses a lock object. If there's an atomic object which is accessed by both Intel- and Microsoft- generated code, then the locking mechanism needs to be the same to ensure correct access. If you are using vs2012, you're going to see calls to e.g. load_relaxed, and calls to do atomic store, even in the Intel-compiled code.

Poeter__Manuel · ‎11-13-2012

Hi, I'm using Visual Studio 2012 (v11.0.50727.1) with Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50727.1 for x64. I defined __USE_INTEL_ATOMICS to enforce the compiler to use the Intel atomic header, because I wanted to compare the generated code for both implementations. Apparently my description of what I was doing and what I wanted to analyze wasn't clear enough. I compiled the sample program from my first posting with full optimization, therefore the compiler inlined the .store/.load calls. I declared my own load_* and store_* functions with noinline to prohibit inlining of those. This makes it easier for me to analyze the generated code since there are no reorderings/interleavings or other optimizations between to calls to load_*/store_*. The generated assembler code I posted was from the actual load/store calls inside my small helper functions. In the main function there are of course the calls to load_*/store_*, but inside these functions everything is inlined - resulting in the code posted above.

Melanie_B_Intel · ‎11-13-2012

Thanks for the further information The main point I wanted to make is that users shouldn't define the __USE_INTEL_ATOMICS symbol

Melanie_B_Intel · ‎11-26-2012

We're still investigating your question about fence instructions, DPD200238776 is tracking the issue. Thanks for bringing it up.