Solved: 2 lock instructions in sequence, behavior

dtsulaia · ‎08-23-2020

Hello Intel community,

I have a code where I need to atomically increment a value, but before doing so I need to check if that value is above a certain limit. Now in C with -O2 this function

int func(int num) {
    __sync_val_compare_and_swap(&num,10,0);
    return __sync_add_and_fetch(&num,1);
}
compiles as follows
func(int):
1:       mov     eax, 10
2:       xor     edx, edx
3:       mov     DWORD PTR [rsp-4], edi
4:       lock cmpxchg    DWORD PTR [rsp-4], edx
5:       mov     eax, 1
6:       lock xadd       DWORD PTR [rsp-4], eax
7:       add     eax, 1
8:       ret
Question number 1:
Is it possible that under high load another thread (thread B) might do
   lock xadd       DWORD PTR [rsp-4], eax
on line 5 while thread A is doing mov?
Question number 2:
Suppose I rewrite this code as follows:
1:       mov     eax, 10
2:       xor     edx, edx
3:       mov     DWORD PTR [rsp-4], edi
4:       mov     ebx, 1
5:       lock cmpxchg    DWORD PTR [rsp-4], edx
6:       lock xadd       DWORD PTR [rsp-4], ebx
7:       add     ebx, 1
8:       ret
now does the CPU guarantee that no other threads will lock between lines 5 and 6 or is it about whoever is faster.
 
Food for thought:
lets say I am working on NUMA system and thread A executes on CPU 0  the memory region addressed in function is located on CPU4. thread B also executes same function. Is it possible that while thread A locks on line 5 and unlocks thread B will lock its line 6 before thread A can lock line 6 considering B is closer to memory region than A.
 
Thanks in advance,
David

HadiBrais · ‎08-23-2020

Having back-to-back locked instructions doesn't create a critical section containing all of these instructions. So it's possible that one or more threads modify the shared memory location between the first locked instruction and the second locked instruction. Whether that shared location is mapped to a remote NUMA node is irrelevant.

If I understand your concern correctly, it seems to me that the problem you're worried about is that one thread may execute __sync_val_compare_and_swap(&num,10,0), then num gets incremented to the maximum value by other threads, and then the thread executes __sync_add_and_fetch(&num,1), which causes the variable to have a value that is larger than the maximum value, which I think is 10.

One way to avoid having a critical section is to use a maximum value that is a power of two, if possible, such as 16 instead of 10. Increment num by 2^32/2^4 instead of 1. After 16 such increments, the value rounds back to 0, so you don't have to explicitly check and set the variable to 0.

View solution in original post

HadiBrais · ‎08-23-2020

Having back-to-back locked instructions doesn't create a critical section containing all of these instructions. So it's possible that one or more threads modify the shared memory location between the first locked instruction and the second locked instruction. Whether that shared location is mapped to a remote NUMA node is irrelevant.

If I understand your concern correctly, it seems to me that the problem you're worried about is that one thread may execute __sync_val_compare_and_swap(&num,10,0), then num gets incremented to the maximum value by other threads, and then the thread executes __sync_add_and_fetch(&num,1), which causes the variable to have a value that is larger than the maximum value, which I think is 10.

One way to avoid having a critical section is to use a maximum value that is a power of two, if possible, such as 16 instead of 10. Increment num by 2^32/2^4 instead of 1. After 16 such increments, the value rounds back to 0, so you don't have to explicitly check and set the variable to 0.

dtsulaia · ‎08-24-2020

Your assumption about my concern is absolutely correct and solution is elegant as well, I was so focused on the technicality of it that I completely forgot I could have solved it with simple math.

While we are on the topic tho, is it possible to create a critical section containing multiple instructions that will lock memory access for that section?

I found _xbegin and _xend for transactions, my understanding is that everything between begin and end instructions will be committed atomically, but the memory commit operation can fail and require you to have some sort of fallback code as well. Is it possible to do it in queued style? I guess I am approaching a full blown mutex semantic here.

Edit: I accidentally hit Post instead of Reply.

HadiBrais · ‎08-24-2020

I think using transactional memory is a good idea in this case, but it only works on the processors that support it and on which it's enabled, taking into consideration the TAA microcode update mitigation, which applies to, among other processors, the 2nd and 3rd generations of Intel Xeon Scalable Processors. Also, as you pointed out, it's still necessary to have a nontransactional fallback path in which a normal software lock has to be acquired.

I don't understand what you mean by "queued style."