- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- First Qeustion
Recently, I encountered a rarely happened bug.
**Environment:
1. The address of a pointer(called pMemory) is mis-aligned.
2. Two thread simultaneously access pMmeory
3. Our program runs on a server with 8 CPUs
4. Original value of pMemory is 0xFFFF FFFF
**Operation Sequence:
1. One thread read the value of pMemory while the other thread modified pMemory.
the read/modify instructions both are MOV
2. The first thread firstly read the lower part of pMemory, that is 0xFFFF
3. The second thread modified pMemory from 0xFFFF FFFF to 0x02de 2c68
4. The first thread secondly read the higher part of pMemoyr, that is 0x02de,
and finally the first thread read the pMemory as 0x02de ffff which is a invalid pointer.
Currently we are discussing the way to solve the problem.
Do you have any suggestion?
I don't have too much time, so would you please rely as soon as possible.
BTW, our program is a network program, so the memory is designed to be aligned on one-byte with compiler options such as /Zp1.
It's impossible for us to change /Zp1 to natural alignment with aspect of risks and workload.
- Second question
Intel 64 and IA-32 Architectures
Software Developers Manual
Volume 3A:
System Programming Guide, Part 1
8.1.1 Guaranteed Atomic Operations
The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M,Pentium 4, Intel Xeon,
and P6 family processors provide bus control signals that permit external memory
subsystems to make split accesses atomic;
however,nonaligned data accesses will seriously impact the performance of the
processor and should be avoided.
Would you please detail the way:
"provide bus control signals that permit external memory subsystems to make split accesses atomic"
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can change this test to test for either the high word or low word being equal to 0xFFFF
Any valid pointer you insert into this DWORD will likely not have 0xFFFF in either word. Allocations are generally aligned to 8 bytes (or more) so the low pointer should never be 0xFFFF. Also, 0xFFFF in the high address points to the last few pages of virtual memory (OS in Window or potentially -stack addressing in *ux). You should look at what you place into this pointer to assure my assumption.
Aligning the pointer to a DWORD address (32-bit system) or QWORD (64-bit system) as Dmitiy suggestswould assure that writes occur in one operation. (excepting possibly for address of pointer residing in I/O space)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I really appreciate your suggestions.
As your saying, I make a solution to take over the problem. Is it right or not?
;---------------------------------------
**On Writer's side
push ax
mov ax, XXXX ; XXXX means certain value that I want to write to the unaligned memory
lock xchg YYYY, ax ; write value in ax to memory, YYYY means the unaligned address of the memory
pop ax
**On Reader's side
mov ax, YYYY
;---------------------------------------
Still I have some questions:
1. "on reader side you can leave plain MOV...". In my case:
;---------------------------------------
a Write's side | Reader's side
b | read lower part
c lock the bus |
d write lower part|
e write higher part |
f atomaticly unlock the bus |
g | read higher part
;---------------------------------------
What will happen in step c?
Will step c wait until step g of reader's side finishes?
Or if step c success immediately, reader's side should go into a bug? I think so.
2. Though the prefix "LOCK" has not effects on "MOV" instruction, I think that "MOV" instruction need also lock the bus.
In my point of view, all the instructions that read/write the same unaligned memory should lock the bus, in order to make the instructions atomic.
Isn't that right or not?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2. Though the prefix "LOCK" has not effects on "MOV" instruction, I think that "MOV" instruction need also lock the bus.
In my point of view, all the instructions that read/write the same unaligned memory should lock the bus, in order to make the instructions atomic.
Isn't that right or not?
I think you are right, it was premature optimization on my side.
If correctness is the only concern, then use LOCK XCHG for both reader and writer. It should 100% work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Jim Dempsey
Thank you very much, that sounds a good idea. I will take it asa possible solution and try to find a best one.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
consider:
[bash]; coded for 32-bit pointers where shared pointer may be unaligned ; shared pointer has 0xFFFFFFFF prior to store of new pointer ; address 0xFFxxxxxx is invalid (IOW ; indicating invalid pointer ; only one producer and one consumer of pointer mov edx, addressOfSharedPointer mov eax, newPointer mov [edx], ax ; store lsw rcr ax,16 mov [edx+2], al ; store 3rd byte mov [edx+3], ah ; store 4th byte (overwrite 0xFF of 0xFFxxxxxx) ----------------------------------------------------------- ; read mov edx, addressOfSharedPointer loop: mov eax, [edx] ; collect all 4 bytes cmp eax, 0xFF000000 jae loop =============================== or ; read mov edx, addressOfSharedPointer loop: mov eax, [edx] ; collect all 4 bytes cmp eax, 0xFF000000 jb toReturn pause jmp loop toReturn: ret The LOCK, although functionally correct, is an expensive operation.
If this pointer manipulation is infrequent, then use the LOCK.
If the pointer manipulation is heavily used, then experiment with code similar to above.
You can also modify the write of pointer to test to see if the pointer is DWORD or WORD aligned
If on odd byte address, use the code first listed above,
If on DWORD, simply write the data as DWORD,
If on WORD (only thing left) write as WORDs (or as low WORD followed by DWORD)
Jim Dempsey
[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If correctness is the only concern, then use LOCK XCHG for both reader and writer. It should 100% work.
Well, that is just what I want to ask you, HOW TO USE LOCK XCHG FOR READER?
thanks for your help
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your code will definitely sove the problem, thank you very much. It really helps.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
rcr ax, 16
should read
ror eax, 16
I assume you caught this typing error.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For others following this thread, would you be so kind to run a performance test of your application using the LOCK method and the method outlined in my sketch. The readers may find your report useful in determining if they should go to a little extra effort in producing faster code.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
:-( it's midnight now in China...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for your help
Ah, sorry. For reader you must use LOCK CMPXCHG. Try to change the variable from 0 to 0. In either case the variable is left physically unchanged, and you the get a current value.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dmitriy Vyukov
thanks, that really works
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To my case, the pointer is a member of a big struct one-byte aligned.
Currently, the address of the pointer is 2-byte aligned.
Someday when we add some other members before the pointer in the struct,
then the address of the pointer may be 1-byte aligned.
Well, the current codition may not be the worst.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Then to be truly portable and not have nightmaresmaintaining your code, I strongly recommend to always use "lock cmpxchg"(for read) with "lock xchg" (for write) instead of using the 0xFFFF or 0xFF tricks. (BTW, I'm not sure the 0xFFFF trick will work with all memory allocation schemesanyway. Especially since members of structures are not naturally aligned in your application.)
- Grant
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Then to be truly portable and not have nightmaresmaintaining your code, I strongly recommend to always use "lock cmpxchg"(for read) with "lock xchg" (for write) instead of using the 0xFFFF or 0xFF tricks.
I would recommend to either stay single-threaded and do not bear all the complexities of concurrent software, or at least get some performance benefit from concurrent software. And plunge into concurrent software and then find yourself slower than single-threaded version looks quite strange. Load via LOCK CMPXCHG can be 10000 times slower than MOV.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem arises when the cache line splits the DWORD. This may occure at comma in
0xFFFFFF,FF (least significant byte in lower cache aligned address)
0xFFFF,FFFF (least significant 2 bytes in lower cache aligned address)
0xFF,FFFFFF (least significant3 bytes in lower cache aligned address)
0xFFFFFFFF (DWORD not split between cache lines)
The LOCK prefix, depending on processor model cost you 100x to 500x the overhead of a write for naturaly aligned variables (also conatined within 1 cache line).
If performance matters then consider safe alternatives that bypass LOCK.
Note, the triple write:
write word containing 2 lowest bytes
write byte containing byte 3 ofDWORD
write byte containing byte 4 of DWORD
May (depending on processor model), I said may occur, due to processor write combining, as a single write to memory when the DWORD is fully contained within the same cache line, and in 2 writes when split across cache lines. Ineither of the two circumstances (split or not split) the high byte will be written in the last write (which may be the only write). Without testing the code, my estimate is 50x to 500x the performance of the LOCK prefixed code.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The LOCK prefix, depending on processor model cost you 100x to 500x the overhead of a write for naturaly aligned variables (also conatined within 1 cache line).
Jim, the problem is not with writes, they are not scalable in either case.
The problem is with reads. A program can perform 1 read per 0.5 cycles per *thread* if implemented with MOV, or 1 read per 100-1000 cycles per *system* if implemented with CMPXCHG. The worst thing one may do in a concurrent program is to turn perfectly scalable read operation into completely non scalable write operation (rebuke rw mutexes).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Depending on the cache archetecture, the writes are somewhat scalable.
With the same 64-byte memory node (cach line sized memory node) resident in multiple caches of different cores, some cache designs will insert natural aligned 1, 2, 4, 8 byte objects into each cach core'sline without causing an eviction in the other cache(s). As long as another processor/core doesn't cause cache line eviction on your core your writes will scale.
I think it is time to stop the speculative discussion and write some code. We canrun this on various processor models and see what comes out.
Jim
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page