- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello
As a result of the 4th Gen. Xeon's support for Fast SHORT REP CMPSB, SCASB, lock-free algorithms using REP CMPSB, SCASB no longer work correctly. For example, the following code will not work correctly:
/* Example of barrier synchronization */
extern bool _scasb(const char*, char, unsigned int);
#define NR_THREADS 8
char sem[NR_THREADS] = { 0 };
/* This function posts the end of computation for thread #i */
void post_one(unsigned int i) {
sem[i] = 1;
}
/* This function waits until all threads have completed their calculations. */
void wait_all() {
while (!_scasb(sem, 1, NR_THREADS)) { yield(); };
}
; _scasb for windows
;
_scasb PROC
push rdi
mov rdi, rcx
mov al, dl
mov ecx, r8d
repz scasb
setz al
pop rdi
ret
_scasb ENDP
If Fast Short REP SCASB is enabled, wait_all() may return before all calculations are complete.
Here's an example that makes it a little clearer what's going on.
; 2 bytes of data FF-FF
;
data DB 0FFH
DB 0FFH
; Loop until data becomes 00-00
;
Thread1 PROC
push rdi
$L1:
lea rdi, data
xor al, al
mov ecx, 2
repz scasb
jnz $L1
pop rdi
ret
Thread1 ENDP
; Repeat changing the first 1 byte of data
;
Thread2 PROC
$L2:
mov BYTE PTR data, 0
mov BYTE PTR data, 0FFh
jmp $L2
Thread2 ENDP
The value of data can only be FF-FF or 00-FF, so Thread1 should loop forever. However, if Thread1 and Thread2 are executed at the same time, Thread1 will exit the loop.
Here are the results of the actual execution.
0:000> u Thread1
test!Thread1:
00007ff6`e19e1100 57 push rdi
00007ff6`e19e1101 488d3df81e0000 lea rdi,test!data (00007ff6`e19e3000)]
00007ff6`e19e1108 32c0 xor al,al
00007ff6`e19e110a b902000000 mov ecx,2
00007ff6`e19e110f f3ae repe scas byte ptr [rdi]
00007ff6`e19e1111 75ee jne test!Thread1+0x1 (00007ff6`e19e1101)
00007ff6`e19e1113 5f pop rdi
00007ff6`e19e1114 c3 ret
0:000> bp 00007ff6`e19e1113
0:000> g
Breakpoint 1 hit
test!Thread1+0x13:
00007ff6`e19e1113 5f pop rdi
0:000> r zf
zf=1
0:000> r rcx
rcx=0000000000000001
According to the Intel IA32 Software Developer's Manual, the completion condition for the REPNZ SCASB instruction is either RCX = 0 or ZF = 1. However, it actually completes with RCX > 0 and ZF = 0.
Probably Fast Short REP SCASB loads the same byte twice.
Seems like a bug in Sapphire Rapids to me, what do you think?
Regards,
Link Copied
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page