How to solve bugs of simultaneously misaligned memory accesses

gangti · ‎06-14-2010

First Qeustion

Recently, I encountered a rarely happened bug.

**Environment:

1. The address of a pointer(called pMemory) is mis-aligned.

2. Two thread simultaneously access pMmeory

3. Our program runs on a server with 8 CPUs

4. Original value of pMemory is 0xFFFF FFFF

**Operation Sequence:

1. One thread read the value of pMemory while the other thread modified pMemory.

the read/modify instructions both are MOV

2. The first thread firstly read the lower part of pMemory, that is 0xFFFF

3. The second thread modified pMemory from 0xFFFF FFFF to 0x02de 2c68

4. The first thread secondly read the higher part of pMemoyr, that is 0x02de,

and finally the first thread read the pMemory as 0x02de ffff which is a invalid pointer.

Currently we are discussing the way to solve the problem.

Do you have any suggestion?

I don't have too much time, so would you please rely as soon as possible.

BTW, our program is a network program, so the memory is designed to be aligned on one-byte with compiler options such as /Zp1.
It's impossible for us to change /Zp1 to natural alignment with aspect of risks and workload.

Second question

Intel 64 and IA-32 Architectures

Software Developers Manual

Volume 3A:

System Programming Guide, Part 1

8.1.1 Guaranteed Atomic Operations

The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M,Pentium 4, Intel Xeon,

and P6 family processors provide bus control signals that permit external memory

subsystems to make split accesses atomic;

however,nonaligned data accesses will seriously impact the performance of the

processor and should be avoided.

Would you please detail the way:

"provide bus control signals that permit external memory subsystems to make split accesses atomic"

Dmitry_Vyukov · ‎06-15-2010

There are 2 ways to solve the problem. First is to make the data aligned. And the second way is to use LOCKed instructions (that's what "provide bus control signals that permit external memory subsystems to make split accesses atomic" about). I think that it's enough just to modify the data with LOCK XCHG instruction, and on reader side you can leave plain MOV. But keep in mind that on QPI-based systems (Core i7), LOCKed instructions on unaligned data can be *very* slow (order of 5000 cycles).

jimdempseyatthecove · ‎06-16-2010

In your pointer fetch code, I will guess that you have a test for pointer == 0xFFFFFFFF as an indication that the pointe is not yet set.

You can change this test to test for either the high word or low word being equal to 0xFFFF
Any valid pointer you insert into this DWORD will likely not have 0xFFFF in either word. Allocations are generally aligned to 8 bytes (or more) so the low pointer should never be 0xFFFF. Also, 0xFFFF in the high address points to the last few pages of virtual memory (OS in Window or potentially -stack addressing in *ux). You should look at what you place into this pointer to assure my assumption.

Aligning the pointer to a DWORD address (32-bit system) or QWORD (64-bit system) as Dmitiy suggestswould assure that writes occur in one operation. (excepting possibly for address of pointer residing in I/O space)

Jim Dempsey

gangti · ‎06-16-2010

I really appreciate your suggestions.

As your saying, I make a solution to take over the problem. Is it right or not?

;---------------------------------------

**On Writer's side

push ax

mov ax, XXXX ; XXXX means certain value that I want to write to the unaligned memory

lock xchg YYYY, ax ; write value in ax to memory, YYYY means the unaligned address of the memory

pop ax

**On Reader's side

mov ax, YYYY

;---------------------------------------

Still I have some questions:
1. "on reader side you can leave plain MOV...". In my case:
;---------------------------------------
a Write's side | Reader's side
b | read lower part
c lock the bus |
d write lower part|
e write higher part |
f atomaticly unlock the bus |
g | read higher part
;---------------------------------------

What will happen in step c?

Will step c wait until step g of reader's side finishes?

Or if step c success immediately, reader's side should go into a bug? I think so.

2. Though the prefix "LOCK" has not effects on "MOV" instruction, I think that "MOV" instruction need also lock the bus.

In my point of view, all the instructions that read/write the same unaligned memory should lock the bus, in order to make the instructions atomic.

Isn't that right or not?

Dmitry_Vyukov · ‎06-16-2010

Quoting gangti

2. Though the prefix "LOCK" has not effects on "MOV" instruction, I think that "MOV" instruction need also lock the bus.

In my point of view, all the instructions that read/write the same unaligned memory should lock the bus, in order to make the instructions atomic.
Isn't that right or not?

I think you are right, it was premature optimization on my side.
If correctness is the only concern, then use LOCK XCHG for both reader and writer. It should 100% work.

gangti · ‎06-16-2010

Dear Jim Dempsey
Thank you very much, that sounds a good idea. I will take it asa possible solution and try to find a best one.

jimdempseyatthecove · ‎06-16-2010

LOCK will affect MOV when stored/retrieved data spans cache line and/or natural boundry.

consider:

[bash]; coded for 32-bit pointers where shared pointer may be unaligned
; shared pointer has 0xFFFFFFFF prior to store of new pointer
; address 0xFFxxxxxx is invalid (IOW 
; indicating invalid pointer
; only one producer and one consumer of pointer

mov edx, addressOfSharedPointer
mov eax, newPointer
mov [edx], ax  ; store lsw
rcr ax,16
mov [edx+2], al ; store 3rd byte
mov [edx+3], ah ; store 4th byte (overwrite 0xFF of 0xFFxxxxxx)

-----------------------------------------------------------

; read
mov edx, addressOfSharedPointer
loop:
mov eax, [edx]	; collect all 4 bytes
cmp eax, 0xFF000000
jae loop


===============================
or


; read
mov edx, addressOfSharedPointer
loop:
mov eax, [edx]	; collect all 4 bytes
cmp eax, 0xFF000000
jb toReturn
pause
jmp loop
toReturn:
ret


The LOCK, although functionally correct, is an expensive operation.
If this pointer manipulation is infrequent, then use the LOCK.
If the pointer manipulation is heavily used, then experiment with code similar to above.
You can also modify the write of pointer to test to see if the pointer is DWORD or WORD aligned
If on odd byte address, use the code first listed above,
If on DWORD, simply write the data as DWORD,
If on WORD (only thing left) write as WORDs (or as low WORD followed by DWORD)


Jim Dempsey



[/bash]

gangti · ‎06-16-2010

I think you are right, it was premature optimization on my side.
If correctness is the only concern, then use LOCK XCHG for both reader and writer. It should 100% work.

Well, that is just what I want to ask you, HOW TO USE LOCK XCHG FOR READER?
thanks for your help

gangti · ‎06-16-2010

Dear Jim Dempsey
Your code will definitely sove the problem, thank you very much. It really helps.

jimdempseyatthecove · ‎06-16-2010

** error in code sample

rcr ax, 16

should read

ror eax, 16

I assume you caught this typing error.

Jim

jimdempseyatthecove · ‎06-16-2010

Gangti,

For others following this thread, would you be so kind to run a performance test of your application using the LOCK method and the method outlined in my sketch. The readers may find your report useful in determining if they should go to a little extra effort in producing faster code.

Jim Dempsey

gangti · ‎06-16-2010

If possible, I will post the performance test result on this thread.
:-( it's midnight now in China...

Dmitry_Vyukov · ‎06-16-2010

Quoting gangti

Well, that is just what I want to ask you, HOW TO USE LOCK XCHG FOR READER?
thanks for your help

Ah, sorry. For reader you must use LOCK CMPXCHG. Try to change the variable from 0 to 0. In either case the variable is left physically unchanged, and you the get a current value.

Dmitry_Vyukov · ‎06-16-2010

Btw, can't it be so that the variable is 1- or 3-byte aligned (or perhaps mis-aligned), and not just 2-byte aligned? If so, and if you decide to use plain MOVs (as Jim suggested), you must to handle that cases too.

gangti · ‎06-16-2010

Dmitriy Vyukov
thanks, that really works

gangti · ‎06-16-2010

Dmitriy Vyukov
To my case, the pointer is a member of a big struct one-byte aligned.
Currently, the address of the pointer is 2-byte aligned.
Someday when we add some other members before the pointer in the struct,
then the address of the pointer may be 1-byte aligned.

Well, the current codition may not be the worst.

Grant_H_Intel · ‎06-16-2010

Then to be truly portable and not have nightmaresmaintaining your code, I strongly recommend to always use "lock cmpxchg"(for read) with "lock xchg" (for write) instead of using the 0xFFFF or 0xFF tricks. (BTW, I'm not sure the 0xFFFF trick will work with all memory allocation schemesanyway. Especially since members of structures are not naturally aligned in your application.)

- Grant

Dmitry_Vyukov · ‎06-16-2010

Quoting Grant Haab (Intel)

Then to be truly portable and not have nightmaresmaintaining your code, I strongly recommend to always use "lock cmpxchg"(for read) with "lock xchg" (for write) instead of using the 0xFFFF or 0xFF tricks.

I would recommend to either stay single-threaded and do not bear all the complexities of concurrent software, or at least get some performance benefit from concurrent software. And plunge into concurrent software and then find yourself slower than single-threaded version looks quite strange. Load via LOCK CMPXCHG can be 10000 times slower than MOV.

jimdempseyatthecove · ‎06-16-2010

Grant,

The problem arises when the cache line splits the DWORD. This may occure at comma in

0xFFFFFF,FF (least significant byte in lower cache aligned address)
0xFFFF,FFFF (least significant 2 bytes in lower cache aligned address)
0xFF,FFFFFF (least significant3 bytes in lower cache aligned address)
0xFFFFFFFF (DWORD not split between cache lines)

The LOCK prefix, depending on processor model cost you 100x to 500x the overhead of a write for naturaly aligned variables (also conatined within 1 cache line).

If performance matters then consider safe alternatives that bypass LOCK.

Note, the triple write:

write word containing 2 lowest bytes
write byte containing byte 3 ofDWORD
write byte containing byte 4 of DWORD

May (depending on processor model), I said may occur, due to processor write combining, as a single write to memory when the DWORD is fully contained within the same cache line, and in 2 writes when split across cache lines. Ineither of the two circumstances (split or not split) the high byte will be written in the last write (which may be the only write). Without testing the code, my estimate is 50x to 500x the performance of the LOCK prefixed code.

Jim Dempsey

Dmitry_Vyukov · ‎06-17-2010

Quoting jimdempseyatthecove

The LOCK prefix, depending on processor model cost you 100x to 500x the overhead of a write for naturaly aligned variables (also conatined within 1 cache line).

Jim, the problem is not with writes, they are not scalable in either case.

The problem is with reads. A program can perform 1 read per 0.5 cycles per *thread* if implemented with MOV, or 1 read per 100-1000 cycles per *system* if implemented with CMPXCHG. The worst thing one may do in a concurrent program is to turn perfectly scalable read operation into completely non scalable write operation (rebuke rw mutexes).

jimdempseyatthecove · ‎06-17-2010

Quoting Dmitriy Vyukov

Jim, the problem is not with writes, they are not scalable in either case.

Depending on the cache archetecture, the writes are somewhat scalable.

With the same 64-byte memory node (cach line sized memory node) resident in multiple caches of different cores, some cache designs will insert natural aligned 1, 2, 4, 8 byte objects into each cach core'sline without causing an eviction in the other cache(s). As long as another processor/core doesn't cause cache line eviction on your core your writes will scale.

I think it is time to stop the speculative discussion and write some code. We canrun this on various processor models and see what comes out.

Jim