Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1697 Discussions

REP CMPSB, SCASB operation can no longer be used for lock-free algorithms

mugi
Beginner
384 Views

Hello

 

As a result of the 4th Gen. Xeon's support for Fast SHORT REP CMPSB, SCASB, lock-free algorithms using REP CMPSB, SCASB no longer work correctly. For example, the following code will not work correctly:

/* Example of barrier synchronization */

extern bool _scasb(const char*, char, unsigned int);

#define NR_THREADS  8
char   sem[NR_THREADS] = { 0 };

/* This function posts the end of computation for thread #i */
void post_one(unsigned int i) {

	sem[i] = 1;
}

/* This function waits until all threads have completed their calculations. */
void wait_all() {

	while (!_scasb(sem, 1, NR_THREADS)) { yield(); };
}

; _scasb for windows
;
_scasb	PROC
    push   rdi
    mov	   rdi, rcx
    mov    al, dl
    mov    ecx, r8d
    repz scasb
    setz   al
    pop    rdi
    ret
_scasb	ENDP

If Fast Short REP SCASB is enabled, wait_all() may return before all calculations are complete.

Here's an example that makes it a little clearer what's going on.

; 2 bytes of data  FF-FF
;
data  DB    0FFH
      DB    0FFH

; Loop until data becomes 00-00
;
Thread1 PROC
    push  rdi
$L1:
    lea   rdi, data
    xor   al, al
    mov   ecx, 2
    repz scasb
    jnz   $L1
    pop   rdi
    ret
Thread1 ENDP

; Repeat changing the first 1 byte of data
;
Thread2 PROC
$L2:
    mov BYTE PTR data, 0
    mov BYTE PTR data, 0FFh
    jmp $L2
Thread2 ENDP

The value of data can only be FF-FF or 00-FF, so Thread1 should loop forever. However, if Thread1 and Thread2 are executed at the same time, Thread1 will exit the loop.

Here are the results of the actual execution.

0:000> u Thread1
test!Thread1:
00007ff6`e19e1100 57              push    rdi
00007ff6`e19e1101 488d3df81e0000  lea     rdi,test!data (00007ff6`e19e3000)]
00007ff6`e19e1108 32c0            xor     al,al
00007ff6`e19e110a b902000000      mov     ecx,2
00007ff6`e19e110f f3ae            repe scas byte ptr [rdi]
00007ff6`e19e1111 75ee            jne     test!Thread1+0x1 (00007ff6`e19e1101)
00007ff6`e19e1113 5f              pop     rdi
00007ff6`e19e1114 c3              ret
0:000> bp 00007ff6`e19e1113
0:000> g
Breakpoint 1 hit
test!Thread1+0x13:
00007ff6`e19e1113 5f              pop     rdi
0:000> r zf
zf=1
0:000> r rcx
rcx=0000000000000001

According to the Intel IA32 Software Developer's Manual, the completion condition for the REPNZ SCASB instruction is either RCX = 0 or ZF = 1. However, it actually completes with RCX > 0 and ZF = 0.
Probably Fast Short REP SCASB loads the same byte twice.

Seems like a bug in Sapphire Rapids to me, what do you think?

 

Regards,

0 Kudos
0 Replies
Reply