Extra overhead in atomic operations generated by Intel ICPC 2021.2

fatvlad1744 · ‎05-24-2021

Greetings,

I was experimenting with thread synchronization and tried different approaches. I expected volatile int to behave similarly to std::atomic<int> when used with relaxed memory semantics, however it appears that std::atomic<int> is slower. The reason for that is redundant mov/lea instruction that is not being elided by optimizer.

For example, here's the code snippet:

#include <atomic>

void load_atomic(std::atomic<int>& v, int& dest) {
    dest = v.load(std::memory_order_relaxed);
}

void load_intrin(int& v, int& dest) {
    dest = __atomic_load_n(&v, __ATOMIC_RELAXED);
}

void load_volatile(volatile int& v, int& dest) {
    dest = v;
}

I'm generating the code with the following compilation string:
$CXX -std=c++20 -O3 -S main2.cpp

g++ (GCC) 10.2.0 (stripped)

_Z11load_atomicRSt6atomicIiERi:
	movl	(%rdi), %eax
	movl	%eax, (%rsi)
	ret
_Z11load_intrinRiS_:
	movl	(%rdi), %eax
	movl	%eax, (%rsi)
	ret
_Z13load_volatileRViRi:
	movl	(%rdi), %eax
	movl	%eax, (%rsi)
	ret

ICX 2021.2 (stripped)

_Z11load_atomicRSt6atomicIiERi:  
	movl	(%rdi), %eax
	movl	%eax, (%rsi)
	retq
_Z11load_intrinRiS_:
	movl	(%rdi), %eax
	movl	%eax, (%rsi)
	retq
_Z13load_volatileRViRi:
	movl	(%rdi), %eax
	movl	%eax, (%rsi)
	retq

ICPC 2021.2 (stripped)

_Z11load_atomicRSt6atomicIiERi:
        movq      %rdi, %rax                                    #4.14
        movl      (%rax), %eax                                  #4.14
        movl      %eax, (%rsi)                                  #4.5
        ret                                                     #5.1
_Z11load_intrinRiS_:
        movq      %rdi, %rax                                    #8.12
        movl      (%rax), %eax                                  #8.12
        movl      %eax, (%rsi)                                  #8.5
        ret                                                     #9.1
_Z13load_volatileRViRi:
        movl      (%rdi), %eax                                  #12.12
        movl      %eax, (%rsi)                                  #12.5
        ret

So you can see that only ICPC has extra mov that can be fused with subsequent instruction.

Why does it happen and can we expect this to be fixed in the new version?
I've attached full IR files to the post.

Thanks in advance.

VidyalathaB_Intel · ‎05-25-2021

Hi,

Thanks for reaching out to us.

We are looking into this issue internally. we will get back to you soon.

Regards,

Vidya.

Viet_H_Intel · ‎05-25-2021

I've reported this issue to our compiler Developer.

Thanks,

Viet_H_Intel · ‎02-28-2022

Not sure if you already knew, but Intel Classic Compiler will enter "Legacy Product Support" mode, signaling the end of regular updates. Please refer to the article bellow for more details.

https://www.intel.com/content/www/us/en/developer/articles/technical/adoption-of-llvm-complete-icx.h...

For that reason, Developer isn't plan to to fix this in Classic compiler. Can you migrate to icx/icpx? and let us know if we could close this case?

Thanks,

Viet

Viet_H_Intel · ‎03-07-2022

Please migrate to icx. We are going to close this as won't fix in C++ Classic compiler. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks,