Community
cancel
Showing results for 
Search instead for 
Did you mean: 
gangti
Beginner
164 Views

How to solve bugs of simultaneously misaligned memory accesses

  • First Qeustion

Recently, I encountered a rarely happened bug.

**Environment:

1. The address of a pointer(called pMemory) is mis-aligned.

2. Two thread simultaneously access pMmeory

3. Our program runs on a server with 8 CPUs

4. Original value of pMemory is 0xFFFF FFFF

**Operation Sequence:

1. One thread read the value of pMemory while the other thread modified pMemory.

the read/modify instructions both are MOV

2. The first thread firstly read the lower part of pMemory, that is 0xFFFF

3. The second thread modified pMemory from 0xFFFF FFFF to 0x02de 2c68

4. The first thread secondly read the higher part of pMemoyr, that is 0x02de,

and finally the first thread read the pMemory as 0x02de ffff which is a invalid pointer.

Currently we are discussing the way to solve the problem.

Do you have any suggestion?

I don't have too much time, so would you please rely as soon as possible.

BTW, our program is a network program, so the memory is designed to be aligned on one-byte with compiler options such as /Zp1.
It's impossible for us to change /Zp1 to natural alignment with aspect of risks and workload.


  • Second question

Intel 64 and IA-32 Architectures

Software Developers Manual

Volume 3A:

System Programming Guide, Part 1

8.1.1 Guaranteed Atomic Operations

The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M,Pentium 4, Intel Xeon,

and P6 family processors provide bus control signals that permit external memory

subsystems to make split accesses atomic;

however,nonaligned data accesses will seriously impact the performance of the

processor and should be avoided.


Would you please detail the way:

"provide bus control signals that permit external memory subsystems to make split accesses atomic"

0 Kudos
22 Replies
jimdempseyatthecove
Black Belt
13 Views

Jim, the problem is not with writes, they are not scalable in either case.

Depending on the cache archetecture, the writes are somewhat scalable.

With the same 64-byte memory node (cach line sized memory node) resident in multiple caches of different cores, some cache designs will insert natural aligned 1, 2, 4, 8 byte objects into each cach core'sline without causing an eviction in the other cache(s). As long as another processor/core doesn't cause cache line eviction on your core your writes will scale.

I think it is time to stop the speculative discussion and write some code. We canrun this on various processor models and see what comes out.

Jim
Grant_H_Intel
Employee
13 Views

Jim,

We can all write microbenchmarks, but ultimately that won't help the author of this post one bit. What matter is the actual code being run. If the ratio of synchronized memory access to non-synchronized memory accesses is low enough, then locked instructions won't affect scalability that much. If the ratio is high, then tricks may need to be played to avoid the locked instructions to get scalability.

Each programmer has to make the portability/usability vs. performance tradeoff themselves given their particular program. That is the bottom line for the author who initiated this post. Without the results of the author's experiments onthier owncode, there is nothing we can tellthe authorthat will be guaranteed to be the right decision because we cannot evaluate the tradeoff without much more data.

- Grant
jimdempseyatthecove
Black Belt
13 Views

>>If the ratio of synchronized memory access to non-synchronized memory accesses is low enough, then locked instructions won't affect scalability that much. If the ratio is high, then tricks may need to be played to avoid the locked instructions to get scalability.

That is what I said in my earlier post.

FWIW I ran a test on Q6600

Tests were run in both 32-bit and 64-bit with 32-bit and 64-bit pointers being written to respectively.
A character buffer was aligned to 4096 byte alignment and of size of 8192 + sizeof( void*).
Timings were made for storing void* at character index ranging from 0 to 8192-1.
For each of the 8192 offsets 100000 stores were performed and a rdtsc count was made for the loop.
This count was converted to double and divided by the number of stores (to get averate time in ticks to store)
5 test runs made, lowest value of each offset kept.
Then sum of all best 8192 samples were expressed

Q6600 32-bit pointers

Test1 total = 16400.1 (using simple store of pointer)

Test2 (w/LOCK) total = 374715 22.8484x

Test3 (multi-write) total = 28346.4 1.72844x


Q6600 64-bit pointers

Test1 total = 16927.5

Test2 (w/LOCK) total = 674978 39.8746x

Test3 (multi-write) total = 40752.6 2.40748x

The Test3 on 32-bit system performs

store of short (low 2 bytes)
store of char (3rd byte)
store of char (4th byte) - the char containing 0xFF to be overwritten last

The Test3 on 64-bit system

store of long (low 4 bytes)
store of short (bytes 5 and 6)
store of char (byte 7)
store of char (8th byte) - the char containing 0xFF to be overwritten last

32-bit shows 13.22x improvement using multi-store technique over LOCKed technique
64-bit shows 16.56x improvement using multi-store technique over LOCKed technique

Note, the test code was all C++ (no ASM) so this was completely portable.

Jim Dempsey